TWI323422B - Method and apparatus for cooperative multithreading - Google Patents

Method and apparatus for cooperative multithreading Download PDF

Info

Publication number
TWI323422B
TWI323422B TW95130152A TW95130152A TWI323422B TW I323422 B TWI323422 B TW I323422B TW 95130152 A TW95130152 A TW 95130152A TW 95130152 A TW95130152 A TW 95130152A TW I323422 B TWI323422 B TW I323422B
Authority
TW
Taiwan
Prior art keywords
instruction
auxiliary
micro
vliw
thread
Prior art date
Application number
TW95130152A
Other languages
Chinese (zh)
Other versions
TW200811709A (en
Inventor
Tienfu Chen
Shuhsuan Chou
Chiehjen Cheng
Zhiheng Kang
Original Assignee
Tienfu Chen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tienfu Chen filed Critical Tienfu Chen
Priority to TW95130152A priority Critical patent/TWI323422B/en
Publication of TW200811709A publication Critical patent/TW200811709A/en
Application granted granted Critical
Publication of TWI323422B publication Critical patent/TWI323422B/en

Links

Description

1323422 九、發明說明: 【發明所屬之技術領域】 本發明是有關於一種多線程處理方法以及應用此方法 之多線程處理架構,且特別是有關於一種多線程合作處理 方法以及應用此方法之多線程合作處理架構。 【先前技術】 隨著處理能力之增強,使得多媒體領域中,具有數位 信號處理器之中央處理單元之應用,也隨之增加。就其本 身而論,上述之具有並聯地指令管線之處理器可平行處理 多重指令。然而,因為資料依賴性之問題,使得執行平行 指令層(instruction-level parallelism )會因而導致作用單元 (functional unit )之低使用率。因此,使用平行線程層 (thread-level parallelism)方法,同時執行多線程,以增 加作用單元之使用率。 英特爾(Intel )所推出之超純量處理器(superscalar processor ),使用動態線程之產生(dynamic thread creation ) 以及一檢測線路(detection circuitry ),用於執行線程時以 檢測錯誤之推測。然而,將具有多線程之超純量處理器, 應用於嵌入式之處理器中,會有高電力消耗以及過於複雜 之設計問題。 當具有多線程之極長指令(very long instruction word’ VLIW)處理器,從多重線程中,讀取極長指令(VLIW) 會產生一些問題。例如,在VLIW架構中,固定讀取頻寬 5 1323422 (fixed fetch bandwidth )導致只可從一線程中讀取一 VLIW 指令。因此,線程交換時序(thread switching timing)係 關鍵於快取未中(cache miss )以及分枝避免之預測(branch miss prediction)等 ° 低消耗功率以及減少晶元面積(die area )係嵌入式處 理器之主要一考量因素。此外,必須考慮其它設計發展(例 如,快速演算法之發展以及架構之變化差異)。例如,需要 較長時間,以設計一特定應用積體電路(application specific integrated circuit,ASIC )。ASIC有低消耗功率以及減少晶 元面積之優點,但其缺點為,當ASIC —旦設計完成,不能 任意的改變其演算法以及其規格。因此,工程師傾向使用 處理器或重新裝配工具,以有效率的利用程式來達成所需 之變化。另外,於多媒體的應用,處理器必須在設計上與 功能結合,以處理不同之資料型態(例如,影像或音效)。 因此,需要一新的多線程合作處理方法以及其架構, 以達到快速資料處理的目的。 【發明内容】 因此本發明的目的就是在提供一種處理器,用以處理 不同之嵌入資料型式。 本發明的另一目的是在提供一種多線程合作處理架 構。 本發明的又一目的是在提供一種多線程合作處理方 法。 6 1323422 本發明的又一目的是在提供以處理器為基礎 (register-based)之資料交換機制。 根據本發明之上述目的,提出一種多線程合作處理架 構,包含一指令高速缓衝記憶體,用於提供一 micro-VLIW 指令;一第一群集,連接指令高速緩衝記憶體,以讀取 micro-VLIW指令,用於執行一例行計算;以及一第二群 集,連接指令高速緩衝記憶體,以讀取micro-VLIW指令, 用於一加速執行。其中,第二群集更包含一第二前端模組, 連接指令高速緩衝記憶體,用於請求以及派遣micro-VLIW 指令;一輔助動態排程器,連接第二前端模組,用於派遣 micro-VLIW指令;一非共用資料路徑,連接第二前端模 組,用於提供一處理資料路徑;一共用資料路徑,連接輔 助動態排程器,用於協助管理非共用資料路徑。其中,第 二前端模組派遣micro-VLIW指令至辅助動態排程器以及 非共用資料路徑,以及第一群集與第二群集平行執行分別 地 micro-VLIW 指令。 共用資料路徑更包含:一複數個辅助作用單元,連接 輔助動態排程器,用於接收micro-VLIW指令;一辅助檔案 交換暫存器,連接辅助作用單元,用於傳送一複數個讀取 或寫入的要求;以及一複數個辅助暫存器檔案,連接輔助 檔案交換暫存器,用於提供一控制資料。 非共用資料路徑更包含:一複數個加速作用單元,連 接第二前端模組,用於接收micro-VLIW指令;一加速檔案 交換暫存器,連接加速作用單元,用於傳送一複數個讀取 7 或寫入的要求;以及一複數個加速暫存器檔案,連接加速 檔案交換暫存器,用於提供一加速計算。 根據本發明之上述目的,提出一種多線程合作處理方 法,包含在一第一群集中,執行一主線程;產生一複數個 輔助線程;以及在一第二群集中,執行每一個輔助線程, 更包含從一指令高速緩衝記憶體,.讀取一 micro-VLIW指令 至一第二前端模組;從第二前端模組,派遣micro-VLIW指 令至一輔助動態排程器以及一非共用資料路徑;從輔助動 態排程器,選擇並派遣micro-VLIW指令至一共用資料路 徑;在共用資料路徑中,執行micro-VLIW指令;以及在非 共用資料路徑中,執行micro-VLIW指令;其中主線程以及 輔助線程係平行執行。 共用資料路徑中,執行micro-VLIW指令更包含:一輔 助作用單元接收從輔助動態排成器所發出之 micro-VLIW ;從輔助作用單元,傳送一複數個讀取或寫入 的要求,至一輔助檔案交換暫存器;以及從辅助檔案交換 暫存器,傳送讀取或寫入的要求,至一輔助暫存器檔案。 非共用資料路徑中,執行micro-VLIW指令更包含:一 加速作用單元接收從第二前端模組所發出之 micro-VLIW ;從加速作用單元,傳送一複數個讀取或寫入 的要求,至一加速檔案交換暫存器;以及從加速檔案交換 暫存器,傳送讀取或寫入的要求,至二加速暫存器檔案。 【實施方式】 1323422 請參照第1圖,其繪示依照本發明一實施例之一多線 程合作處理架構100之一示意圖。多線程合作處理架構1〇〇 包含一第一群集1〇2以及一第二群集1〇4。其中,第—群集 102係用於執行一主線程,以及第二群集1〇4係用於執行一 輔助線程。 第一群集102,用於執行一控制以及一例行計算。第一 群集102包含-第—前端模組UG以及—主要控制資料路 k 132其中,主要控制資料路徑132包含一複數個作用單 元12以及複數個暫存器標案114。第一前端模組11〇可 使用一精簡指令集運算(Reduced Instructi〇n % Computing ’ RISC )作為分枝(branch),載人(i〇ad ),儲 存(store ),算術(arithmetic )以及邏輯運算( operations )等。作用單元112係用於提供乘與加 (muluply-aiid- add )或單一指令複合資料模式(以吨匕1323422 IX. Description of the Invention: [Technical Field] The present invention relates to a multi-thread processing method and a multi-thread processing architecture using the same, and in particular to a multi-thread cooperative processing method and a method for applying the same Thread cooperation processing architecture. [Prior Art] With the enhancement of processing power, the application of a central processing unit having a digital signal processor has also increased in the field of multimedia. In its own right, the above described processors with parallel command pipelines can process multiple instructions in parallel. However, because of the problem of data dependencies, the implementation of the instruction-level parallelism results in a low usage rate of the functional unit. Therefore, using the thread-level parallelism method, multiple threads are executed simultaneously to increase the usage of the active unit. Intel's superscalar processor uses dynamic thread creation and detection circuitry to detect errors when executing threads. However, the use of multi-threaded ultra-scalar processors for embedded processors has high power consumption and overly complex design issues. When there is a very long instruction word' VLIW processor, reading very long instructions (VLIW) from multiple threads can cause problems. For example, in the VLIW architecture, the fixed read bandwidth 5 1323422 (fixed fetch bandwidth ) causes only one VLIW instruction to be read from one thread. Therefore, thread switching timing is mainly based on cache miss and branch miss prediction. Low power consumption and reduced die area are embedded processing. The main consideration of the device. In addition, other design developments must be considered (for example, the development of fast algorithms and the differences in architecture). For example, it takes a long time to design an application specific integrated circuit (ASIC). ASICs have the advantages of low power consumption and reduced area of the crystal, but the disadvantage is that when the ASIC is designed, its algorithm and its specifications cannot be arbitrarily changed. As a result, engineers tend to use processors or reassembly tools to efficiently use programs to achieve the desired changes. In addition, for multimedia applications, the processor must be designed and combined to handle different data types (eg, images or sound effects). Therefore, a new multi-threaded cooperative processing method and its architecture are needed to achieve rapid data processing. SUMMARY OF THE INVENTION It is therefore an object of the present invention to provide a processor for processing different embedded data types. Another object of the present invention is to provide a multi-threaded cooperative processing architecture. It is still another object of the present invention to provide a multi-thread cooperative processing method. 6 1323422 Yet another object of the present invention is to provide a register-based data exchange mechanism. According to the above object of the present invention, a multi-thread cooperative processing architecture is provided, comprising an instruction cache for providing a micro-VLIW instruction; a first cluster connecting the instruction cache to read the micro- The VLIW instruction is used to perform a row calculation; and a second cluster is connected to the instruction cache to read the micro-VLIW instruction for an accelerated execution. The second cluster further includes a second front-end module, a connection instruction cache for requesting and dispatching a micro-VLIW instruction, and an auxiliary dynamic scheduler connecting the second front-end module for dispatching a micro- The VLIW instruction; a non-shared data path, connecting the second front-end module for providing a processing data path; and a shared data path connecting the auxiliary dynamic scheduler for assisting in managing the non-shared data path. The second front-end module dispatches micro-VLIW instructions to the auxiliary dynamic scheduler and the non-shared data path, and the first cluster and the second cluster execute the respective micro-VLIW instructions in parallel. The shared data path further includes: a plurality of auxiliary action units connected to the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; an auxiliary file exchange register, connected to the auxiliary action unit for transmitting a plurality of readings or Write request; and a plurality of auxiliary scratchpad files connected to the auxiliary file exchange register for providing a control data. The non-shared data path further includes: a plurality of acceleration units connected to the second front end module for receiving the micro-VLIW instruction; an accelerated file exchange register, and an acceleration unit for transmitting a plurality of readings 7 or write request; and a plurality of accelerating scratchpad files, connected to the accelerated file exchange register for providing an accelerated calculation. According to the above object of the present invention, a multi-thread cooperation processing method is provided, comprising: executing a main thread in a first cluster; generating a plurality of auxiliary threads; and executing each auxiliary thread in a second cluster, Included from an instruction cache, reads a micro-VLIW instruction to a second front-end module; from the second front-end module, dispatches a micro-VLIW instruction to an auxiliary dynamic scheduler and a non-shared data path From the auxiliary dynamic scheduler, select and dispatch the micro-VLIW instruction to a shared data path; execute the micro-VLIW instruction in the shared data path; and execute the micro-VLIW instruction in the non-shared data path; where the main thread And the auxiliary thread is executed in parallel. In the shared data path, executing the micro-VLIW instruction further includes: an auxiliary action unit receiving the micro-VLIW sent from the auxiliary dynamic line arranger; and transmitting a plurality of read or write requests from the auxiliary action unit to the first Auxiliary file exchange register; and from the auxiliary file exchange register, transfer read or write requests to an auxiliary scratchpad file. In the non-shared data path, executing the micro-VLIW instruction further includes: an acceleration action unit receiving the micro-VLIW sent from the second front-end module; and transmitting, from the acceleration action unit, a plurality of read or write requests to An accelerated file exchange register; and an instruction to transfer read or write from the accelerated file exchange register to the second accelerated scratchpad file. [Embodiment] 1323422 Please refer to FIG. 1 , which illustrates a schematic diagram of a multi-thread cooperative processing architecture 100 according to an embodiment of the invention. The multi-threaded cooperative processing architecture 1 includes a first cluster 1〇2 and a second cluster 1〇4. The first cluster 102 is used to execute a main thread, and the second cluster 1 is used to execute a secondary thread. The first cluster 102 is configured to perform a control and a row calculation. The first cluster 102 includes a - front-end module UG and a primary control data path k 132. The primary control data path 132 includes a plurality of active units 12 and a plurality of register registers 114. The first front-end module 11A can use a reduced instruction set operation (Reduced Instructi〇n % Computing ' RISC ) as a branch, a person (i〇ad ), a store, an arithmetic, and a logic. Operation (operation), etc. The action unit 112 is used to provide a multiplication and augmentation (muluply-aiid-add) or single instruction composite data mode (in tons)

Instmction Multiple Data,SIMD)之運算。此外,第一群 集1〇2負責產生一輔助線程。 第二群集104 ’用於一加速執行。第二群集104包含第 二前端模組116,一辅助動態排程器118,—共用資料路徑 134以及一非共用資料路徑丨36。 共用資料路徑134包含一複數個輔助作用單元12〇,— 輔助檔案交換暫存H 122以及—複數個辅助暫存器擋案 124。其中’第二前端模組116係連接指令高速緩衝記憶體 1〇6,。辅助動態排程器118係連接第二前端模也ιΐ6,輔助 作用單元120係連接辅助動態排程$ 118,辅助槽案交換暫 9 係連接辅助作用單元m,以及輔助暫存器檔案 〃連接辅助檀案交換暫存器122。 非八用貝料路徑136包含一複數個加速作用單元 —加速檔案交換暫存器128,以及—複數個加速暫存 态檔案130。盆中,筮一 1山4从 吃情體1〇6 “ 係連接指令高速緩衝 i體㈣,加速作用單元126係連接第二前端模組116, 加速棺案交換暫存器128係連接加速作用單元i26,以及加Instmction Multiple Data, SIMD). In addition, the first cluster 1〇2 is responsible for generating a secondary thread. The second cluster 104' is used for an accelerated execution. The second cluster 104 includes a second front end module 116, an auxiliary dynamic scheduler 118, a shared data path 134, and a non-shared data path 丨36. The shared data path 134 includes a plurality of auxiliary action units 12, - an auxiliary file exchange temporary storage H 122 and a plurality of auxiliary temporary register files 124. The second front end module 116 is connected to the instruction cache memory 1〇6. The auxiliary dynamic scheduler 118 is connected to the second front end mold also ιΐ6, the auxiliary action unit 120 is connected to the auxiliary dynamic schedule $118, the auxiliary slot exchange is temporarily connected to the auxiliary action unit m, and the auxiliary register file is connected to the auxiliary unit. The Tan file exchange register 122. The non-eight-use billet path 136 includes a plurality of accelerating units - an accelerated file swap register 128, and - a plurality of accelerating temporary files 130. In the basin, the 筮一一山4 from the eating body 1〇6 "system connection command cache i body (four), the acceleration unit 126 is connected to the second front end module 116, the acceleration file exchange register 128 series connection acceleration Unit i26, and plus

速暫存器槽案130係連接加速檔案交換暫存器128。 理。此:乍:早π 126係用於嵌入式應用之加快資料處 沖此外,辅助線程可使用每一辅助作用單元12〇。辅助作 用早兀120協助控制輔助線程。例如,共用資料路徑 之輔助作用單疋12〇,從—資料快取記憶區⑽载入資料, 並傳送至非共”料路徑136之加速暫存器檔案H 輔助作用單元120,經由辅助檔案交換暫存器12The scratchpad slot 130 is connected to the accelerated archive swap register 128. Reason. This: 乍: Early π 126 is used to speed up data processing for embedded applications. In addition, the auxiliary thread can use each auxiliary action unit 12〇. Auxiliary action helps to control the secondary thread. For example, the auxiliary data path of the shared data path is 12载入, the data is loaded from the data cache memory area (10), and transmitted to the acceleration register file H auxiliary action unit 120 of the non-common material path 136, via the auxiliary file exchange. Register 12

=存器檔案124之資料。每-輔助線程係分配 速作用早①126,讀供辅助線程流程控n 體運用之範例,每—輔助線程係分配兩加速暫存器槽案 案二供較廣之資料處理路徑。其中’-加速暫存器檔 ” ’、於載入資料以及另一加速暫存器檔案13〇传 於處理資料。 乐用 參考第1圖,主線程可用以產生輔助線程β當 助線程時,主線程會指定給㈣線程—輔助暫存器標案2 以及兩加速暫存器檔案13〇。辅助線程經由加速作用單_ 再經由加速檔案交換暫存器128以存取位於加速暫: 1323422 器檔案130之資料。 在一範例,可使用雙埠(2-port)之指令高速緩衝記憶 體106,每一埠之頻寬為128-bit。資料快取記憶區108係 為雙埠,其中,一埠之頻寬為32-bit以及另一埠之頻寬為 64-bit以提供較寬之資料流量(data flow )。 第2圖係繪示依照本發明一實施例之產生一辅助線程 之一流程圖。本發明一實施例使用一程式語言以產生輔助 線,以減少需要產生一輔助線程所需之邏輯以及一使用於 推測檢測與回復之檢測邏輯。當一主線程200檢測到一起 始線程指令,會使用主線程200之參數「例如,程式計數 值(program counter value )」以產生一輔助線程202。在一 範例中,每一輔助線程202具有一程式計數值,以從記憶 體系統中分別地讀取所需之韌體。同時間,在第一群集102 以及第二群集104中,分別地執行主線程200以及輔助線 程202。主線程200以及輔助線程202之間之一致性,係經 由主線程200檢查輔助線程202是否已結束執行資料流 (data flow )。 在一範例中,係使用兩程式以提供一用戶親和發展環 境(user friendly development environment) ° 其中,係使用 C程式語言以撰寫上述之程式。第一個程式,「輔助線程產 生程式」,係用於檢測一起始線程指令。第二個程式,「檢 查線程程式」,檢測輔助線程是否已完成執行。「辅助線程 產生程式」以及「檢查線程程式」係使用内嵌組合語言 (inline assembly language),以減少當主線程產生輔助線 11 1323422 程或主線程檢查輔助線程之狀態時之額外處理負擔。「輔助 線程產生程式」以及「檢查線程程式」係使用内嵌組合語 言以及c程式語言以遠到上述之目的。然而,熟知此技藝 者’可使用其它程式語言或方式,以達到上述之目的。 第3圖係列舉依照本發明一實施例之一「輔助線程產 生程式」。使用者僅只需要在程式中輸入四參數。參數 「thread_id」33係用於指示產生一輔助線程。參數 「thread_pc_value」3 2係一輔助線程之韌體程式之起始位 置。參數「bankusage」31規劃使用之輔助暫存器檔案以及 加速暫存器檔案。參數「thread_parameter_address」30,從 主線程傳遞一參數位址表之起始位置至辅助線程。「辅助線 程產生程式」中,使用一「if」之條件陳述以限制並確認輔 助線程。然後,使用内喪組合語言,遵循0GCC組合語言 之相關資料,從「startt」指令處撰寫辅助線程。 第4圖係列舉依照本發明一實施例之一使用匚語言以 及内肷組合sf g撰寫之檢查線程程式。「檢查線程程式」之 參數係為「thread—id」41。一「if」之條件陳述以限制並確 。心辅助線程。主要線程係使用「msr」指令42以複製一輔 助線程之資訊至第一群集1〇2之一暫存器檔案114。因此, 暫存器檔案114由遮罩(masking )此資訊以得到所需之辅 助線程之狀態。 第5圖係繪示依照本發明一實施例之具有指令高速緩 衝s己憶體106之第二前端模組116之一示意圖。第二前端 模組116包含一程式計數位址產生器5〇2,一指令高速缓衝 12 1323422 記憶體排程器(循環法)504,以及一複數個配送器500。第 二前端模組116’從指令高速緩衝記憶體1〇6中讀取一 micro-VLIW指令,並使用配送器5〇〇派遣micro-VLIW指 令,至輔助動態排程器118以及非共用資料路徑136。 程式計數位址產生器502,係用於產生一位址,並使用 此位址從指令高速緩衝記憶體1〇6中,以請求micro_ VLIW 指令。 參考第5圖,指令高速緩衝記憶體排程器504,送出一 請求指令508至指令高速緩衝記憶體丨〇6中,並接收到一 micro-VLIW 指令資料 510。因為埠限制(port constraint) 的關係’在同一時間,僅一辅助線程有權利使用指令高速 緩衝記憶體106。因此’指令高速緩衝記憶體排程器504 係根據輔助線程之狀態,使用一線程交換機制(thread switch mechanism),以選擇辅助線程。 線程交換機制係使用本發明之一實施例所主張的一循 環排程策略(round robin scheduling p〇iiCy),其係將每一 辅助線程視為具有相同的優先順序。在一實施例裡,使用 循環排程策略’執行四辅助線程請求存取指令高速緩衝記 憶體106,包含下列步驟: 1.如果指令局速緩衝g己憶體排程器中,四辅助線 程HT1、HT2、HT3以及HT4請求存取指令高速緩衝記憶 體 106。 2·如果最近一次從指令咼速緩衝記憶體排程器5〇4 中π求存取指令尚速緩衝§己憶體106之辅助線程之識別 13 1323422 (help thread ID)係為「N」。 3.輔助線程HT1、HT2、HT3以及HT4存取指令高速 緩衝記憶體106之順序分別為(N+1 ) %4,( N+2 ) %4,( N+3 ) %4,以及(N+4) %4。 上述之辅助線程交換機制簡化設計上之複雜性,以及 因為每一辅助線程係使用連續順序以避免辅助線程之叙餓 (helper thread starvation) ° 參考弟5圖’配送is 500接收從指令高速緩衝記憶體 排程器504之辅助線程之micro-VLIW指令,並將此 micro-VLIW指令儲存在緩衝器506。另外,配送器5〇〇 (例 如,有N個配送器)從排程器504中取出每一micr〇_VLIW 指令(此指令為一讀取或寫入要求),並分別地派遣 micro-VLIW指令至辅助動態排程器118以及非共用資料路 徑 136。 第6圖係纟會示依照本發明一實施例之第二前端模組之 緩衝器506派遣micro-VLIW指令。每一配送器5〇〇具有一 緩衝器506。在每一週期,緩衝器506之每一micro-VLIW 指令610以及612,分別地傳遞至輔助動態排程器118以及 非共用資料路徑136。因此’在每一週期中,輔助動態排程 器118以及非共用資料路徑136會從N個辅助線程中(例 如,主線產生N個輔助線程)接收N個micro-VLIW指令 610 以及 612。 另一考慮的重點係決定辅助作用單元120之數量,以 協助加速作用單元126。其中,每一加速作用單元126負責 14 丄 加速執行。因此,必需事先準備將要被執行之資料。但是, 因為空間以及電力的考量,輔助作用單元12G之數量可少 於加速作用單疋126之數量。由於每-週期至多具有N個 m1Cr〇-VLIW指令⑽派遣至輔助作用單元m。所以,輔 , 助動態排程窃U8必須具有排程mitro-VLIW指令010至輔 助作用單元120之功能。 —參照第1圖以及第6圖,輔助動態排程器118,係連接 • 在第二前端模組U6以及辅助作用單元120之間。輔助動 態排程器118係使用上述之—循環排程策略,以及使用辅 助線程之識別(help thread ID)以確認並傳遞micr〇_VUw 指令610至一辅助作用單元12〇。這裡要注意的原則,係當 =助作用單元120重覆執行一指令,則傳 令610至一輔助作用單元12〇係暫時中止。所以,此 micro-VLIW指令610係在每一週期中嘗試存取,直到辅助 作用單元120完成重覆執行之指令。 % 循J衣排程策略係用於得到一輔助線程(例如,Μ個輔 . 助線程)之優先順序,以及具有最高順序之輔助線程可傳 遞微4日々(mirc〇_instructi〇n,也就是 micr〇_vLIW 指令) 至輔助作用單兀120。其中,辅助線程的數量等於輔助作 用單元120的數量。例如,M個輔助線程,則需要M個辅 助作用單元。當輔助動態排程器118選擇具有最高順序之 輔助線程,則下一輪,此辅助線程將會被改變為最低順序, 因而避免辅助線程之飢餓(helper thread starvati〇n )。 辅助作用單元120係用於協助管理辅助線程,以及每 15 1323422 一辅助線程係使用配置之輔助暫存器檔案124。每一輔助作 用單元120執行一精簡指令集運算(Reduced Instruction Set Computing,RISC),例如,載入、儲存、算術運算。當一 輔助線程需要讀取輔助暫存器檔案124,辅助線程之識別會 跟著經過辅助作用單元120〇然後,第1圖中之辅助檔案交 換暫存器122將會使用輔助線程之識別以讀取需要之輔助 暫存器檔案124。 加速作用單元126係用於執行加速。本發明之一實施 例之第二群集可用下列之安排。例如,當執行一多媒體之 運用,包含可使用不同型態之多媒體之加速作用單元126 以達成即時之限制(real-time constraints )。偕同加速作用 單元126之協助,只需一加速指令以完成習知之使用RICS 作用單元(RISC functional unit)之需要數百個週期以完成 之任務。例如,用於MPEG4編解碼器,需要使用四加速作 用單元126,其中此四加速作用單元126係兩向量作用單元 (vector functional units),一蝶形(butterfly)作用單元, 以及一可變長度碼或可變長度解碼(variable length coding/variable length decoding,VLC/VLD)作用單元。向 量作用單元係負責SIMD之處理操作’以平行處理一複數之 資料段。SIMD運作可加速影像運算。蝶形作用單元負責處 理SIMD資料型態。然而,蝶形作用單元之主要工作性質 為乘與加(multiply-and-add)運算以及矩陣乘(matrices multiply)運算。蝶形作用單元也可用於加速DCT/IDCT運 算。VLC/VLD作用單元係用於加速MPEG4之VLC以及 16 丄 VLD之運算。 參考第1圖,共用資料路徑134具有Ν個輔助暫存号 24 ’ 及非共用資料路徑136具有2關加速暫 檔案 130。其中,Ν # A 4 ήλ· m <« 係為加速作用早元126之數量。鈇而 如果每-輔助線程可使用任何兩加速暫存器檔案咖’ 會導致加速檔案交換暫存器128之邏輯複雜性。在 使用-部分映射機制(partial卿㈣= information of the file 124. Each-assisted thread distribution speed is 1126 earlier, and the auxiliary thread is used to control the flow of the n-body. Each of the auxiliary threads allocates two accelerating register slots to provide a wider data processing path. The '-accelerated scratchpad file', the load data and another accelerating register file 13 are passed on the processing data. Let's refer to Figure 1, the main thread can be used to generate the auxiliary thread β when the thread is assisted. The main thread will be assigned to the (four) thread - the auxiliary register file 2 and the two acceleration register files 13 . The auxiliary thread is accelerated by the single__ via the accelerated file exchange register 128 to access the acceleration: 1323422 The data of file 130. In an example, a 2-port instruction cache memory 106 can be used, each having a bandwidth of 128-bit. The data cache memory area 108 is a double port, wherein The bandwidth is 32-bit and the bandwidth of the other is 64-bit to provide a wider data flow. Figure 2 is a diagram showing the generation of a secondary thread in accordance with an embodiment of the present invention. A flowchart. An embodiment of the present invention uses a programming language to generate auxiliary lines to reduce the logic required to generate a secondary thread and a detection logic for speculative detection and recovery. When a primary thread 200 detects a start Thread instruction, will The main thread 200 with parameters "For example, the value of the program counter (program counter value)" to generate a secondary thread 202. In one example, each of the auxiliary threads 202 has a program counter value to separately read the desired firmware from the memory system. Meanwhile, in the first cluster 102 and the second cluster 104, the main thread 200 and the auxiliary thread 202 are separately executed. The consistency between the main thread 200 and the auxiliary thread 202 is checked by the main thread 200 whether the auxiliary thread 202 has finished executing the data flow. In one example, two programs are used to provide a user friendly development environment. Among them, the C program language is used to compose the above program. The first program, "Assisted Thread Generation Program", is used to detect a starting thread instruction. The second program, "Check Thread Program", detects if the worker thread has completed execution. The "assisted thread generation program" and the "check thread program" use an inline assembly language to reduce the additional processing overhead when the main thread generates the auxiliary line 11 1323422 or the main thread checks the status of the auxiliary thread. "Auxiliary Thread Generation Program" and "Check Thread Program" use the inline combination language and c programming language to achieve the above purpose. However, those skilled in the art may use other programming languages or means to achieve the above objectives. Fig. 3 is a series of "assisted thread generation programs" according to an embodiment of the present invention. The user only needs to enter four parameters in the program. The parameter "thread_id" 33 is used to indicate that a secondary thread is generated. The parameter "thread_pc_value" 3 2 is the starting position of the firmware of the auxiliary thread. The auxiliary bank file and the accelerator register file used by the parameter "bankusage" 31 are planned. The parameter "thread_parameter_address" 30 transfers the starting position of a parameter address table from the main thread to the worker thread. In the "Auxiliary Thread Generation Program", an "if" conditional statement is used to limit and confirm the auxiliary thread. Then, use the internal combination language, follow the relevant information of the 0GCC combination language, and write the auxiliary thread from the "startt" instruction. Fig. 4 is a diagram showing a check thread program written in a 匚 language and a 肷 肷 sf g according to an embodiment of the present invention. The parameter of "Check Thread Program" is "thread_id"41. A conditional statement of "if" to limit and confirm. Heart assisted thread. The main thread uses the "msr" instruction 42 to copy the information of a secondary thread to one of the first clusters 1 暂 2 of the scratchpad file 114. Therefore, the scratchpad file 114 masks this information to get the status of the desired auxiliary thread. Figure 5 is a schematic diagram showing a second front end module 116 having an instruction cache suffix 106 in accordance with an embodiment of the present invention. The second front end module 116 includes a program count address generator 5〇2, an instruction cache 12 1323422 memory scheduler (loop method) 504, and a plurality of dispensers 500. The second front end module 116' reads a micro-VLIW command from the instruction cache memory 1-6, and uses the dispatcher 5 to dispatch the micro-VLIW command to the auxiliary dynamic scheduler 118 and the non-shared data path. 136. The program count address generator 502 is used to generate an address and use this address from the instruction cache 1 〇 6 to request the micro_VLIW instruction. Referring to Figure 5, the instruction cache scheduler 504 sends a request command 508 to the instruction cache 丨〇6 and receives a micro-VLIW instruction material 510. Because of the relationship of port constraints, at the same time, only one worker thread has the right to use the instruction cache memory 106. Therefore, the instruction cache memory scheduler 504 uses a thread switch mechanism to select a secondary thread based on the state of the secondary thread. The thread switch system uses a round robin scheduling p〇iiCy as claimed in one embodiment of the present invention, which treats each of the auxiliary threads as having the same priority order. In one embodiment, the four-threaded thread request access instruction cache memory 106 is executed using a round-robin scheduling policy, including the following steps: 1. If the instruction is to buffer the g-memory scheduler, the four-assisted thread HT1 HT2, HT3, and HT4 request access to instruction cache memory 106. 2. If the last time from the instruction buffer memory scheduler 5〇4, the access instruction is still buffered, the identification of the auxiliary thread of the memory 106 is 13 1323422 (help thread ID) is “N”. 3. The order of the auxiliary threads HT1, HT2, HT3, and HT4 access instruction cache 106 is (N+1) %4, (N+2) %4, (N+3) %4, and (N +4) %4. The above-mentioned auxiliary thread switching mechanism simplifies the design complexity, and because each auxiliary thread uses a sequential order to avoid the helper thread starvation (refer to the brother 5 figure 'delivery is 500 received from the instruction cache memory The micro-VLIW instruction of the auxiliary thread of the volume scheduler 504 stores the micro-VLIW instruction in the buffer 506. In addition, the dispenser 5 (for example, there are N dispensers) takes each micr〇_VLIW instruction from the scheduler 504 (this command is a read or write request) and dispatches the micro-VLIW separately. The instructions are directed to the auxiliary dynamic scheduler 118 and to the non-shared data path 136. Figure 6 is a diagram showing a buffer 506 of a second front end module dispatching a micro-VLIW instruction in accordance with an embodiment of the present invention. Each dispenser 5 has a buffer 506. At each cycle, each micro-VLIW instruction 610 and 612 of buffer 506 is passed to auxiliary dynamic scheduler 118 and non-shared data path 136, respectively. Thus, in each cycle, the auxiliary dynamic scheduler 118 and the unshared data path 136 receive N micro-VLIW instructions 610 and 612 from N auxiliary threads (e.g., N auxiliary threads are generated by the main line). Another consideration is to determine the number of auxiliary action units 120 to assist in accelerating the action unit 126. Among them, each acceleration action unit 126 is responsible for 14 加速 accelerated execution. Therefore, it is necessary to prepare in advance the materials to be executed. However, because of space and power considerations, the number of auxiliary action units 12G may be less than the number of acceleration action units 126. Since there are at most N m1Cr〇-VLIW instructions (10) per cycle, it is dispatched to the auxiliary action unit m. Therefore, the auxiliary and dynamic scheduler U8 must have the function of scheduling the mitro-VLIW instruction 010 to the auxiliary action unit 120. - Referring to Figures 1 and 6, the auxiliary dynamic scheduler 118 is connected between the second front end module U6 and the auxiliary action unit 120. The auxiliary dynamic scheduler 118 uses the above-described loop scheduling strategy and the help thread ID to confirm and pass the micr〇_VUw command 610 to an auxiliary action unit 12A. The principle to be noted here is that when the assisting unit 120 repeatedly executes an instruction, the transmission 610 to the auxiliary acting unit 12 is temporarily suspended. Therefore, the micro-VLIW instruction 610 attempts to access each cycle until the auxiliary action unit 120 completes the repeated execution of the instruction. % Follow-up strategy is used to get the priority order of a secondary thread (for example, a secondary helper thread), and the auxiliary thread with the highest order can pass micro 4 days (mirc〇_instructi〇n, ie The micr〇_vLIW instruction) to the auxiliary action unit 120. The number of auxiliary threads is equal to the number of auxiliary working units 120. For example, M auxiliary threads require M auxiliary action units. When the auxiliary dynamic scheduler 118 selects the secondary thread with the highest order, then in the next round, the secondary thread will be changed to the lowest order, thus avoiding helper thread starvati〇n. The auxiliary action unit 120 is used to assist in managing the auxiliary thread, and a secondary thread is used by the secondary thread file 124 every 15 1323422. Each of the auxiliary function units 120 performs a Reduced Instruction Set Computing (RISC), for example, load, store, and arithmetic operations. When a secondary thread needs to read the secondary register file 124, the identification of the secondary thread will follow the auxiliary action unit 120. Then, the secondary file exchange register 122 in FIG. 1 will use the identification of the secondary thread to read Auxiliary register file 124 is required. The acceleration unit 126 is for performing acceleration. The second cluster of an embodiment of the present invention may be arranged in the following arrangements. For example, when performing a multimedia application, an acceleration action unit 126 that can use different types of multimedia is included to achieve real-time constraints. Together with the acceleration of unit 126, an acceleration command is required to accomplish the task of using the RICS functional unit for hundreds of cycles. For example, for the MPEG4 codec, a four-acceleration unit 126 is required, wherein the four acceleration units 126 are two vector functional units, a butterfly action unit, and a variable length code. Or variable length coding/variable length decoding (VLC/VLD) action unit. The vector action unit is responsible for the processing operation of the SIMD' to process a complex data segment in parallel. SIMD operation speeds up image operations. The butterfly action unit is responsible for processing the SIMD data type. However, the main working properties of the butterfly action unit are multiply-and-add operations and matrix multiply operations. The butterfly action unit can also be used to accelerate DCT/IDCT operations. The VLC/VLD action unit is used to accelerate the operation of MPEG4 VLC and 16 丄 VLD. Referring to Fig. 1, the shared data path 134 has two auxiliary temporary storage numbers 24' and the non-shared data path 136 has two closed acceleration temporary files 130. Where Ν # A 4 ήλ· m <« is the number of acceleration effects 126. However, if each of the secondary threads can use any of the two accelerated scratchpad files, the logical complexity of the archive swap register 128 will be accelerated. In the use of - part mapping mechanism (partial Qing (four)

少加速檔案交換暫存器128之邏輯複雜性。部分映射機: 係配置每-加速作用單& 126至—複數的加速暫存 130。。 田茶 第7A圖至第7D ®絲示依照本發明—實施例之部分 映射機制之一不意圖。例如,加速作用單元i 7〇〇以及加速 作用早TC2701可使用加速暫存器槽案丄至加速暫存The logic complexity of the less accelerated file exchange register 128. Partial mapping machine: The acceleration temporary storage 130 is configured for each acceleration action list & 126 to - complex number. . Field Tea Figures 7A through 7D ® show one of the partial mapping mechanisms in accordance with the present invention - not intended. For example, the acceleration unit i 7〇〇 and the acceleration function TC2701 can use the accelerating register slot to accelerate the temporary storage.

6(71〇,711,712,713,714以及715),以及加速翻1 兀3 702以及加速作用單元4 7〇3可使用加速暫存器檔案$ 至加速暫存器檔案8 ( 714,715,716以及717)。選擇加速 暫存器檔案130係依靠數個多工器(multiplexer)。第π 圖描晝對加速暫存器檔案13〇之讀取請求,以及第7c圖以 及第7 D圖係回報所需之資料。第7A圖描畫對加速暫存器 構案130之寫入操作。 第8圖係繪示依照本發明一實施例之使用韌體碼 (firmware code )。程式計數器81連接至記憶體段82 (memory segment)包含所需之韌體83。然後,第二群集 104之第二前端模組116讀取韌體幻,並分別派遣至加速 17 1323422 作用單元126以及經由輔助動態排程器118至輔助作用單 元 120。 第9圖係繪示依照本發明一實施例之主線程之程式之 -流程圖。在主線程開始之後,將會產生用於加速之輔助 線程。最重要的係安排辅助線程之順序以及資源依賴性 (resource dependency) 91。當停止一辅助線程,則此輔助6 (71〇, 711, 712, 713, 714, and 715), as well as the acceleration 1 兀 3 702 and the acceleration unit 4 7 〇 3 can use the Accelerated Register File $ to the Accelerated Scratch File 8 (714, 715, 716, and 717) ). Select Acceleration The scratchpad file 130 relies on several multiplexers. The πth figure depicts the read request for the accelerated scratchpad file, and the data required for the 7c and 7D drawings. Figure 7A depicts the write operation to the accelerated scratchpad configuration 130. Figure 8 is a diagram showing the use of a firmware code in accordance with an embodiment of the present invention. The program counter 81 is coupled to the memory segment 82 containing the desired firmware 83. The second front end module 116 of the second cluster 104 then reads the firmware and dispatches them to the acceleration 17 1323422 active unit 126 and via the auxiliary dynamic scheduler 118 to the auxiliary unit 120. Figure 9 is a flow chart showing the program of the main thread in accordance with an embodiment of the present invention. After the main thread starts, the auxiliary thread for acceleration will be generated. The most important is to arrange the order of the auxiliary threads and the resource dependency 91. This auxiliary when stopping a secondary thread

線程將於所分配到的輔助暫存器檔案中寫入一訊息,以及 可利用此訊息以檢查辅助線程是否被停止92。 弟10圖係!會示依照本發明一 m列之輔助線程式之一 流程圖。當產生-辅助線程i",此辅助線程將會從指令 高速緩衝記憶體中,讀取所需之㈣碼。如果此勤體碼需 要讀取以及寫入其它之加速暫存器檔案,則使用一確定記 憶體10—1指令以改變加速暫存器難埠指標。在執行動體 碼之後,停止此辅助線程10_2以及辅助作用單元將會寫入 一資訊至辅助暫存器檔案。The thread will write a message to the assigned secondary scratchpad file and can use this message to check if the secondary thread has been stopped92. Brother 10 pictures! A flow chart of one of the auxiliary threads of one m column in accordance with the present invention is shown. When the generate-assisted thread i", this worker thread will read the required (four) code from the instruction cache. If the firmware code needs to read and write to other accelerating scratchpad files, a deterministic memory 10-1 command is used to change the accelerating scratchpad deficiencies indicator. After executing the dynamic code, the stop of the auxiliary thread 10_2 and the auxiliary action unit will write a message to the auxiliary register file.

第11圖係繪示依照本發明一實施例之程式之一流程 圖。辅助線程之起始時M U一0,卩及辅助線程之停止^ 11_1以及主線程檢查輔助線程是否停止之時間丨丨2。檢 查點係主線程檢查辅助線程是否停止之時間。 欢 雖然本發明已以-較佳實施例揭露如上,然其並非用 以限定本發明,任何熟習此技藝者’在不脫離本發明之精 神和範圍内,當可作各種之更動與潤飾,因此本發明之保 護靶圍當視後附之申請專利範圍所界定者為準。 ,、 18 【圖式簡單說明】 為讓本發明之上述和其他目的、特徵 能更明顯易懂,所附圖式之詳細說明如τ: ”〜、貝知例 理 第1圖係繪示依照本發明一實施例之多線程 架構之一示意圖; 第2圖係綠示依照本發明一實施例之產生—辅助線程 之一流程圖; 。第3圖係列舉依照本發明—實施例之—輔助線程產生 程式; 第4圖係列舉依照本發明一實施例之一檢查線程程式; 第5圖係繪示依照本發明一實施例之第二前端模組之 一示意圖; 第6圖係繪示依照本發明—實施例之第二前端模組之 配送器之一示意圖; 第7Α圖至第7D圖係繪示依照本發明—實施例之部分 映射機制之一示意圖; 第8圖係繪示依照本發明—實施例之軟體單元之一示 意圖; 第9圖係繪示依照本發明一實施例之主線程式之一流 程圖; 第10圖係繪示依照本發明一實施例之輔助線程式之一 流程圖;以及 第11圖係繪示依照本發明一實施例之程式之一流程 19 1323422 圖。 【主要元件符號說明】 30、31、32、33、34、41、42 :程式指令 81 :程式計數器 82:記憶體段 83:韌體 90、91、92、10_0、10_1、10_2:步驟 11—0、11_1、11_2 :時間點 100:多線程合作處理架構 102:第一群集 104:第二群集 10 6:指令尚速缓衝記憶體 108:資料快取記憶區 110:第一前端模組 112:作用單元 114:暫存器檔案 116:第二前端模組 118:輔助動態排程器 120:辅助作用單元 122:輔助檔案交換暫存器 124:辅助暫存器檔案 126、700、701、702、703:加速作用單元 128:加速檔案交換暫存器 20 1323422 130、710、71 卜 712、713、714、715、716、717:加 速暫存器檔案 132:主要控制路徑 134:共用資料路徑 136:非共用資料路徑 200:主線程 202:辅助線程 502:程式計數位址產生器 504:指令高速缓衝記憶體排程器(循環法) 506:緩衝器 508:指令 510、610、612: micro-VLIW 指令資料 21Figure 11 is a flow chart showing a program in accordance with an embodiment of the present invention. At the beginning of the auxiliary thread, M U_0, 停止 and the stop of the auxiliary thread ^ 11_1 and the time when the main thread checks whether the auxiliary thread stops 丨丨2. The checkpoint is the time when the main thread checks if the secondary thread has stopped. Although the present invention has been disclosed in the above-described preferred embodiments, it is not intended to limit the invention, and any skilled person can make various changes and modifications without departing from the spirit and scope of the invention. The protective target of the present invention is defined by the scope of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects and features of the present invention more comprehensible, the detailed description of the drawings is as follows: τ: 〜, 知知理理第1图A schematic diagram of a multi-threaded architecture according to an embodiment of the present invention; FIG. 2 is a flowchart showing one of the generated-assisted threads according to an embodiment of the present invention; FIG. 3 is a series of diagrams according to the present invention - an embodiment Thread generating program; FIG. 4 is a view showing a thread program according to an embodiment of the present invention; FIG. 5 is a schematic diagram showing a second front end module according to an embodiment of the present invention; The present invention is a schematic diagram of one of the dispensers of the second front-end module of the embodiment; FIGS. 7 to 7D are schematic diagrams showing a partial mapping mechanism according to the embodiment of the present invention; FIG. 8 is a diagram showing BRIEF DESCRIPTION OF THE DRAWINGS FIG. 9 is a flow chart showing a main thread according to an embodiment of the present invention; FIG. 10 is a flow chart showing a auxiliary thread according to an embodiment of the present invention; Figure; Figure 11 is a flow chart 19 1323422 of a program according to an embodiment of the present invention. [Main component symbol description] 30, 31, 32, 33, 34, 41, 42: program instruction 81: program counter 82: memory Body segment 83: firmware 90, 91, 92, 10_0, 10_1, 10_2: steps 11-1, 11_1, 11_2: time point 100: multi-threaded cooperative processing architecture 102: first cluster 104: second cluster 10 6: instructions The buffer memory 108: the data cache memory area 110: the first front end module 112: the action unit 114: the register file 116: the second front end module 118: the auxiliary dynamic scheduler 120: the auxiliary action unit 122 : Auxiliary File Exchange Register 124: Auxiliary Register File 126, 700, 701, 702, 703: Acceleration Unit 128: Accelerated File Exchange Register 20 1323422 130, 710, 71 712, 713, 714, 715 716, 717: Accelerated register file 132: Main control path 134: Shared data path 136: Non-shared data path 200: Main thread 202: Auxiliary thread 502: Program count address generator 504: Instruction cache memory Scheduler (loop method) 506: Buffer 508: Instructions 510, 610, 612: micr o-VLIW Instruction Material 21

Claims (1)

1323422 十、申請專利範圍: 1. 一種多線程合作處理裝置,包含: 一指令高速緩衝記憶體,用於提供一 micro-VLIW指 令; 一第一群集電路,連接該指令高速緩衝記憶體,以讀 取該micro-VLIW指令,用於執行一例行計算;以及 一第二群集電路,連接該指令高速緩衝記憶體,以讀 取該micro-VLIW指令,用於進行一加速執行程序,其中, 該第二群集電路更包含: 一第二前端模組,連接該指令高速緩衝記憶體, 用於請求以及派遣該micro-VLIW指令; 一輔助動態排程器,連接該第二前端模組,用於 派遣該micro-VLIW指令; 一非共用資料路徑,連接該第二前端模組,用於 提供一處理資料路徑;以及 一共用資料路徑,連接該輔助動態排程器,用於 協助管理該非共用資料路徑; 其中,該第二前端模組派遣該micro-VLIW指令至該辅 助動態排程器以及該非共用資料路徑,以及該第一群集電 路與該第二群集電路係被平行執行; 其中,前述 micro-VLIW指令係為一微指令 (mirco-instruction)型態的極長指令(very long instruction word,VLIW)。 22 1323422 2.如申請專利範圍第1項所述之多線程合作處理裝 置’其t該第二前端模組更包含一指令高速缓衝記憶體排 程器’以請求以及派遣該micro-VLIW指令。 3.如申請專利範圍第2項所述之多線程合作處理裝 置,其中該指令高速緩衝記憶體排程器係使用一循環排程 策略,以從該指令高速緩衝記憶體,請求該micr〇 vuw指 〇 4_如申印專利範圍第丨項所述之多線程合作處理裝 置,其中該辅助動態排程器係使用—循環排程策略。 5.如申請專利範圍第i項所述之多線程合作處理裝 置’其中該共用資料路徑更包含:1323422 X. Patent application scope: 1. A multi-thread cooperative processing device, comprising: an instruction cache memory for providing a micro-VLIW instruction; a first cluster circuit connecting the instruction cache memory to read Taking the micro-VLIW instruction for performing a row calculation; and a second cluster circuit connecting the instruction cache to read the micro-VLIW instruction for performing an accelerated execution program, wherein The second cluster circuit further includes: a second front end module connected to the instruction cache memory for requesting and dispatching the micro-VLIW instruction; and an auxiliary dynamic scheduler connected to the second front end module for Dispatching the micro-VLIW command; a non-shared data path connecting the second front-end module for providing a processing data path; and a shared data path connecting the auxiliary dynamic scheduler for assisting in managing the non-shared data a path; wherein the second front-end module dispatches the micro-VLIW instruction to the auxiliary dynamic scheduler and the non-shared data path, Circuit and the first cluster and the second cluster is performed based parallel circuit; wherein the micro-VLIW-based instruction is a micro instruction (mirco-instruction) patterns Very Long Instruction (very long instruction word, VLIW). 22 1323422 2. The multi-threaded cooperative processing device of claim 1, wherein the second front-end module further includes an instruction cache scheduler to request and dispatch the micro-VLIW instruction . 3. The multi-thread cooperative processing apparatus of claim 2, wherein the instruction cache scheduler uses a loop scheduling policy to request the micr〇vuw from the instruction cache. The multi-thread cooperative processing device as described in the above-mentioned patent application scope, wherein the auxiliary dynamic scheduler uses a cycle scheduling strategy. 5. The multi-thread cooperative processing device as described in claim i, wherein the shared data path further comprises: 至少一辅助作用單元’連接該辅助動態排程器,用於 接收該micro-VLIW指令; 傳逆一::=交換暫存器,連接該輔助作用單元,用於 得送一讀取或寫入的要求;以及 器 至少:辅助暫存器播案’連接該辅助擋案交換暫存 用於提供一控制資料。 == 專利範圍第5項所述之多線 置,其中該非共用資料路徑更包含: 慝理裝 至少一加速作用單元’連接該第二前端模組,用於接 23 1323422 收該micro-VLIW指令; 一加速檔案交換暫存器,連接該加速作用單元,用於 傳送一讀取或寫入的要求;以及 至少一加速暫存器檔案,連接該加速檔案交換暫存 器,用於提供一加速計算程序。 7. 如申請專利範圍第6項所述之多線程合作處理裝 置,其中該加速檔案交換暫存器係使用一部分映射機制。 8. —種多線程合作處理方法,包含: 在一第一群集電路中,執行一主線程; 產生複數個輔助線程;以及 在一第二群集電路中,執行每一該些輔助線程,更包 含: 從一指令高速緩衝記憶體,讀取一 micro-VLIW 指令至一第二前端模組; • 從該第二前端模組,派遣該micro-VLIW指令至 一輔助動態排程器以及一非共用資料路徑; 從該輔助動態排程器,選擇並派遣該micro-VLIW ’指令至一共用資料路徑; 在該共用資料路徑中,執行該micro-VLIW指令; 以及 在該非共用資料路徑中,執行該micro-VLIW指 令; 24 1323422 其中該主線程以及該些辅助線程係平行執行; 其中,前述 micro-VLIW指令係為一微指令 (mirco-instruction)型態的極長指令(very long instruction word,VLIW)。 9. 如申請專利範圍第8項所述之多線裎合作處理方 法,其中產生每一該些輔助線程更包含: 從該主線程中,檢測到一起始線程指令;以及 •傳遞複數個參數從該主線程至該輔助線程。 10. 如申請專利範圍第8項所述之多線程合作處理方 法,其中該些參數包含一程式計數值。 11. 如申請專利範圍第8項所述之多線程合作處理方 法,其中該第二前端模組係使用一循環排程策略。 • 12.如申請專利範圍第8項所述之多線程合作處理方 法,其中該輔助動態排程器係使用一循環排程策略。 13.如申請專利範圍第8項所述之多線程合作處理方 法,其中在該共用資料路徑中,執行該micro-VLIW指令更 包含: 一辅助作用單元接收從該輔助動態排程器所發出之該 micro-VLIW ; 25 從該輔助作用單元,傳送至少一個讀取或寫入的要 求,至一輔助檔案交換暫存器;以及 從該辅助檔案交換暫存器,傳送該讀取或寫入的要 求,至一輔助暫存器檔案。 14. 如申請專利範圍第8項所述之多線程合作處理方 法,其中在該非共用資料路徑中,執行該micro-VLIW指令 更包含: 一加速作用單元接收從該第二前端模組所發出之該 micro-VLIW ; 從該加速作用單元,傳送至少一讀取或寫入的要求, 至一加速檔案交換暫存器;以及 從該加速檔案交換暫存器,傳送該讀取或寫入的要 求,至一加速暫存器檔案。 15. 如申請專利範圍第14項所述之多線程合作處理方 法,其中該加速檔案交換暫存器係使用一部分映射機制, 以傳送該讀取或寫入的要求,至一加速檔案交換暫存器。 16. —種多線程合作處理裝置,包含: 一指令高速緩衝記憶體,用於提供一 micro-VLIW指 令; 一第一群集電路,連接該指令高速緩衝記憶體,以讀 取該micro-VLIW指令,用於執行一例行計算;以及 26 1323422 一第二群集電路,連接該指令高速緩衝記憶體,以讀 取該micro-VLIW指令,用於進行一加速執行程序,其中, 該第二群集電路更包含: 一第二前端模組,連接該指令高速緩衝記憶體, 用於請求以及派遣該micro-VLIW指令; 一輔助動態排程器,連接該第二前端模組,用於 派遣該micro-VLIW指令;至少一辅助作用單元,連 接該輔助動態排程器,用於接收該micro-VLIW指令; • 一輔助檔案交換暫存器,連接該些辅助作用單 元,用於傳送至少一讀取或寫入的要求; 至少一輔助暫存器檔案,連接該輔助檔案交換暫 存器,用於提供一控制資料; 至少一加速作用單元,連接該第二前端模組,用 於接收該micro-VLIW指令; 一加速檔案交換暫存器,連接該加速作用單元, 用於傳送至少一讀取或寫入的要求;以及 • 至少一加速暫存器檔案,連接該加速檔案交換暫 存器,用於提供一加速計算; 其中,該第二前端模組派遣該micro-VLIW指令至該輔 • 助動態排程器以及該加速作用單元,以及該第一群集電路 與該第二群集電路係被平行執行; • 其中,前述micro-VLIW指令係為一微指令 (mirco-instruction)型態的極長指令(very long instruction word,VLIW)。 27 1 7 •如申請專利範圍第16項所述之多線程合作處理裝 置,复由 〃宁該第二前端模組更包含一指令高速緩衝記憶體排 转器 以睛求以及派遣該micro-VLIW指令。 18·如申請專利範圍第17項所述之多線程合作處理裝 置’其中該指令高速緩衝記憶體排程器係使用一循環排程 策略’以從該指令高速緩衝記憶體,請求該micro_VLIW指 令0 19.如申請專利範圍第16項所述之多線程合作處理裝 置’其中該輔助動態排程器係使用一循環排程策略。 20_如申請專利範圍第16項所述之多線程合作處理裝 置’其中該加速檔案交換暫存器係使用一部分映射機制。At least one auxiliary action unit 'connects the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; passes the reverse one::= exchange register, and connects the auxiliary action unit for sending a read or write The requirements; and at least: the auxiliary register broadcasts the 'connection' of the auxiliary file exchange temporary storage for providing a control data. == The multi-line arrangement described in item 5 of the patent scope, wherein the non-shared data path further comprises: a device loading at least one acceleration unit to connect the second front-end module for receiving 23 1323422 to receive the micro-VLIW command An accelerated file exchange register coupled to the acceleration unit for transmitting a read or write request; and at least one acceleration register file coupled to the accelerated file exchange register for providing an acceleration Calculation program. 7. The multi-threaded cooperative processing device of claim 6, wherein the accelerated file exchange register uses a portion of the mapping mechanism. 8. A multi-thread cooperative processing method, comprising: executing a main thread in a first cluster circuit; generating a plurality of auxiliary threads; and executing each of the auxiliary threads in a second cluster circuit, further comprising : reading a micro-VLIW command from a command cache to a second front-end module; • dispatching the micro-VLIW command from the second front-end module to an auxiliary dynamic scheduler and a non-shared a data path; from the auxiliary dynamic scheduler, selecting and dispatching the micro-VLIW 'instruction to a shared data path; executing the micro-VLIW instruction in the shared data path; and executing the same in the non-shared data path The micro-VLIW instruction; 24 1323422 wherein the main thread and the auxiliary threads are executed in parallel; wherein the micro-VLIW instruction is a very long instruction word (VLIW) of a mirco-instruction type. ). 9. The multi-line cooperative processing method of claim 8, wherein generating each of the auxiliary threads further comprises: detecting a start thread instruction from the main thread; and • transmitting a plurality of parameters from The main thread to the secondary thread. 10. The multi-thread cooperative processing method of claim 8, wherein the parameters include a program counter value. 11. The multi-thread cooperative processing method of claim 8, wherein the second front-end module uses a cyclic scheduling strategy. • The multi-threaded cooperative processing method of claim 8, wherein the auxiliary dynamic scheduler uses a cyclic scheduling strategy. 13. The multi-thread cooperative processing method according to claim 8, wherein in the shared data path, executing the micro-VLIW instruction further comprises: an auxiliary action unit receiving the same from the auxiliary dynamic scheduler The micro-VLIW; 25 transmits at least one read or write request from the auxiliary action unit to an auxiliary file exchange register; and transfers the read or write from the auxiliary file exchange register Request, to an auxiliary register file. 14. The multi-thread cooperation processing method of claim 8, wherein executing the micro-VLIW instruction in the non-shared data path further comprises: an acceleration unit receiving the second front-end module The micro-VLIW; transmits at least one read or write request from the acceleration unit to an accelerated file exchange register; and transmits the read or write request from the accelerated file exchange register , to an accelerated scratchpad file. 15. The multi-thread cooperative processing method according to claim 14, wherein the accelerated file exchange register uses a part of the mapping mechanism to transmit the read or write request to an accelerated file exchange temporary storage. Device. 16. A multi-thread cooperative processing apparatus comprising: an instruction cache for providing a micro-VLIW instruction; a first cluster circuit coupled to the instruction cache to read the micro-VLIW instruction For performing a row calculation; and 26 1323422 a second cluster circuit connected to the instruction cache to read the micro-VLIW instruction for performing an accelerated execution process, wherein the second cluster circuit The method further includes: a second front end module connected to the instruction cache memory for requesting and dispatching the micro-VLIW instruction; an auxiliary dynamic scheduler connected to the second front end module for dispatching the micro- a VLIW instruction; at least one auxiliary action unit connected to the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; • an auxiliary file exchange register coupled to the auxiliary action units for transmitting at least one read or Write request; at least one auxiliary register file, connected to the auxiliary file exchange register for providing a control data; at least one acceleration list Connecting the second front end module for receiving the micro-VLIW command; an acceleration file exchange register, connecting the acceleration action unit for transmitting at least one read or write request; and • at least one acceleration a cache file, connected to the accelerated file exchange register, for providing an acceleration calculation; wherein the second front-end module dispatches the micro-VLIW command to the auxiliary dynamic scheduler and the acceleration unit, And the first cluster circuit and the second cluster circuit are executed in parallel; wherein the micro-VLIW instruction is a very long instruction word (VLIW) of a mirco-instruction type. 27 1 7 • The multi-threaded cooperative processing device according to claim 16 of the patent application, the second front-end module further includes an instruction cache memory aligner to request and dispatch the micro-VLIW instruction. 18. The multi-thread cooperative processing apparatus of claim 17, wherein the instruction cache scheduler uses a loop scheduling policy to request the micro_VLIW instruction from the instruction cache. 19. The multi-thread cooperative processing apparatus of claim 16, wherein the auxiliary dynamic scheduler uses a cyclic scheduling policy. 20_ The multi-thread cooperative processing device as described in claim 16 wherein the accelerated file exchange register uses a portion of the mapping mechanism. 2828
TW95130152A 2006-08-16 2006-08-16 Method and apparatus for cooperative multithreading TWI323422B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW95130152A TWI323422B (en) 2006-08-16 2006-08-16 Method and apparatus for cooperative multithreading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW95130152A TWI323422B (en) 2006-08-16 2006-08-16 Method and apparatus for cooperative multithreading

Publications (2)

Publication Number Publication Date
TW200811709A TW200811709A (en) 2008-03-01
TWI323422B true TWI323422B (en) 2010-04-11

Family

ID=44767811

Family Applications (1)

Application Number Title Priority Date Filing Date
TW95130152A TWI323422B (en) 2006-08-16 2006-08-16 Method and apparatus for cooperative multithreading

Country Status (1)

Country Link
TW (1) TWI323422B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489246B2 (en) * 2011-04-02 2016-11-08 Intel Corporation Method and device for determining parallelism of tasks of a program

Also Published As

Publication number Publication date
TW200811709A (en) 2008-03-01

Similar Documents

Publication Publication Date Title
US11494194B2 (en) Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions
TWI628594B (en) User-level fork and join processors, methods, systems, and instructions
US6718457B2 (en) Multiple-thread processor for threaded software applications
US6279100B1 (en) Local stall control method and structure in a microprocessor
US20080046689A1 (en) Method and apparatus for cooperative multithreading
CN100357884C (en) Method, processor and system for processing instructions
US6643763B1 (en) Register pipe for multi-processing engine environment
US6671827B2 (en) Journaling for parallel hardware threads in multithreaded processor
US8069340B2 (en) Microprocessor with microarchitecture for efficiently executing read/modify/write memory operand instructions
US6944850B2 (en) Hop method for stepping parallel hardware threads
JP5666473B2 (en) Multi-threaded data processing system
US7263604B2 (en) Heterogeneous parallel multithread processor (HPMT) with local context memory sets for respective processor type groups and global context memory
US20150074353A1 (en) System and Method for an Asynchronous Processor with Multiple Threading
CN116414464A (en) Method and device for scheduling tasks, electronic equipment and computer readable medium
TWI323422B (en) Method and apparatus for cooperative multithreading
US6625634B1 (en) Efficient implementation of multiprecision arithmetic
US11954491B2 (en) Multi-threading microprocessor with a time counter for statically dispatching instructions
US20230350680A1 (en) Microprocessor with baseline and extended register sets
CN114489793A (en) User timer programmed directly by application

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees