TWI323422B

TWI323422B - Method and apparatus for cooperative multithreading

Info

Publication number: TWI323422B
Application number: TW95130152A
Authority: TW
Inventors: Tienfu Chen; Shuhsuan Chou; Chiehjen Cheng; Zhiheng Kang
Original assignee: Tienfu Chen
Priority date: 2006-08-16
Filing date: 2006-08-16
Publication date: 2010-04-11
Also published as: TW200811709A

Description

1323422 九、發明說明：【發明所屬之技術領域】本發明是有關於一種多線程處理方法以及應用此方法之多線程處理架構，且特別是有關於一種多線程合作處理方法以及應用此方法之多線程合作處理架構。【先前技術】隨著處理能力之增強，使得多媒體領域中，具有數位信號處理器之中央處理單元之應用，也隨之增加。就其本身而論，上述之具有並聯地指令管線之處理器可平行處理多重指令。然而，因為資料依賴性之問題，使得執行平行指令層（instruction-level parallelism )會因而導致作用單元 (functional unit )之低使用率。因此，使用平行線程層 (thread-level parallelism)方法，同時執行多線程，以增加作用單元之使用率。英特爾（Intel )所推出之超純量處理器（superscalar processor )，使用動態線程之產生（dynamic thread creation ) 以及一檢測線路（detection circuitry )，用於執行線程時以檢測錯誤之推測。然而，將具有多線程之超純量處理器，應用於嵌入式之處理器中，會有高電力消耗以及過於複雜之設計問題。當具有多線程之極長指令（very long instruction word’ VLIW)處理器，從多重線程中，讀取極長指令（VLIW) 會產生一些問題。例如，在VLIW架構中，固定讀取頻寬 5 1323422 (fixed fetch bandwidth )導致只可從一線程中讀取一 VLIW 指令。因此，線程交換時序（thread switching timing)係關鍵於快取未中（cache miss )以及分枝避免之預測（branch miss prediction)等 ° 低消耗功率以及減少晶元面積（die area )係嵌入式處理器之主要一考量因素。此外，必須考慮其它設計發展（例如，快速演算法之發展以及架構之變化差異）。例如，需要較長時間，以設計一特定應用積體電路（application specific integrated circuit，ASIC )。ASIC有低消耗功率以及減少晶元面積之優點，但其缺點為，當ASIC —旦設計完成，不能任意的改變其演算法以及其規格。因此，工程師傾向使用處理器或重新裝配工具，以有效率的利用程式來達成所需之變化。另外，於多媒體的應用，處理器必須在設計上與功能結合，以處理不同之資料型態（例如，影像或音效）。因此，需要一新的多線程合作處理方法以及其架構，以達到快速資料處理的目的。【發明内容】因此本發明的目的就是在提供一種處理器，用以處理不同之嵌入資料型式。本發明的另一目的是在提供一種多線程合作處理架構。本發明的又一目的是在提供一種多線程合作處理方法。 6 1323422 本發明的又一目的是在提供以處理器為基礎 (register-based)之資料交換機制。根據本發明之上述目的，提出一種多線程合作處理架構，包含一指令高速缓衝記憶體，用於提供一 micro-VLIW 指令；一第一群集，連接指令高速緩衝記憶體，以讀取 micro-VLIW指令，用於執行一例行計算；以及一第二群集，連接指令高速緩衝記憶體，以讀取micro-VLIW指令，用於一加速執行。其中，第二群集更包含一第二前端模組，連接指令高速緩衝記憶體，用於請求以及派遣micro-VLIW 指令；一輔助動態排程器，連接第二前端模組，用於派遣 micro-VLIW指令；一非共用資料路徑，連接第二前端模組，用於提供一處理資料路徑；一共用資料路徑，連接輔助動態排程器，用於協助管理非共用資料路徑。其中，第二前端模組派遣micro-VLIW指令至辅助動態排程器以及非共用資料路徑，以及第一群集與第二群集平行執行分別地 micro-VLIW 指令。共用資料路徑更包含：一複數個辅助作用單元，連接輔助動態排程器，用於接收micro-VLIW指令；一辅助檔案交換暫存器，連接辅助作用單元，用於傳送一複數個讀取或寫入的要求；以及一複數個辅助暫存器檔案，連接輔助檔案交換暫存器，用於提供一控制資料。非共用資料路徑更包含：一複數個加速作用單元，連接第二前端模組，用於接收micro-VLIW指令；一加速檔案交換暫存器，連接加速作用單元，用於傳送一複數個讀取 7 或寫入的要求；以及一複數個加速暫存器檔案，連接加速檔案交換暫存器，用於提供一加速計算。根據本發明之上述目的，提出一種多線程合作處理方法，包含在一第一群集中，執行一主線程；產生一複數個輔助線程；以及在一第二群集中，執行每一個輔助線程，更包含從一指令高速緩衝記憶體，.讀取一 micro-VLIW指令至一第二前端模組；從第二前端模組，派遣micro-VLIW指令至一輔助動態排程器以及一非共用資料路徑；從輔助動態排程器，選擇並派遣micro-VLIW指令至一共用資料路徑；在共用資料路徑中，執行micro-VLIW指令；以及在非共用資料路徑中，執行micro-VLIW指令；其中主線程以及輔助線程係平行執行。共用資料路徑中，執行micro-VLIW指令更包含：一輔助作用單元接收從輔助動態排成器所發出之 micro-VLIW ;從輔助作用單元，傳送一複數個讀取或寫入的要求，至一輔助檔案交換暫存器；以及從辅助檔案交換暫存器，傳送讀取或寫入的要求，至一輔助暫存器檔案。非共用資料路徑中，執行micro-VLIW指令更包含：一加速作用單元接收從第二前端模組所發出之 micro-VLIW ;從加速作用單元，傳送一複數個讀取或寫入的要求，至一加速檔案交換暫存器；以及從加速檔案交換暫存器，傳送讀取或寫入的要求，至二加速暫存器檔案。【實施方式】 1323422 請參照第1圖，其繪示依照本發明一實施例之一多線程合作處理架構100之一示意圖。多線程合作處理架構1〇〇包含一第一群集1〇2以及一第二群集1〇4。其中，第—群集 102係用於執行一主線程，以及第二群集1〇4係用於執行一輔助線程。第一群集102，用於執行一控制以及一例行計算。第一群集102包含-第—前端模組UG以及—主要控制資料路 k 132其中，主要控制資料路徑132包含一複數個作用單元12以及複數個暫存器標案114。第一前端模組11〇可使用一精簡指令集運算（Reduced Instructi〇n % Computing ’ RISC )作為分枝（branch)，載人（i〇ad )，儲存（store )，算術（arithmetic )以及邏輯運算（ operations )等。作用單元112係用於提供乘與加 (muluply-aiid- add )或單一指令複合資料模式（以吨匕1323422 IX. Description of the Invention: [Technical Field] The present invention relates to a multi-thread processing method and a multi-thread processing architecture using the same, and in particular to a multi-thread cooperative processing method and a method for applying the same Thread cooperation processing architecture. [Prior Art] With the enhancement of processing power, the application of a central processing unit having a digital signal processor has also increased in the field of multimedia. In its own right, the above described processors with parallel command pipelines can process multiple instructions in parallel. However, because of the problem of data dependencies, the implementation of the instruction-level parallelism results in a low usage rate of the functional unit. Therefore, using the thread-level parallelism method, multiple threads are executed simultaneously to increase the usage of the active unit. Intel's superscalar processor uses dynamic thread creation and detection circuitry to detect errors when executing threads. However, the use of multi-threaded ultra-scalar processors for embedded processors has high power consumption and overly complex design issues. When there is a very long instruction word' VLIW processor, reading very long instructions (VLIW) from multiple threads can cause problems. For example, in the VLIW architecture, the fixed read bandwidth 5 1323422 (fixed fetch bandwidth ) causes only one VLIW instruction to be read from one thread. Therefore, thread switching timing is mainly based on cache miss and branch miss prediction. Low power consumption and reduced die area are embedded processing. The main consideration of the device. In addition, other design developments must be considered (for example, the development of fast algorithms and the differences in architecture). For example, it takes a long time to design an application specific integrated circuit (ASIC). ASICs have the advantages of low power consumption and reduced area of the crystal, but the disadvantage is that when the ASIC is designed, its algorithm and its specifications cannot be arbitrarily changed. As a result, engineers tend to use processors or reassembly tools to efficiently use programs to achieve the desired changes. In addition, for multimedia applications, the processor must be designed and combined to handle different data types (eg, images or sound effects). Therefore, a new multi-threaded cooperative processing method and its architecture are needed to achieve rapid data processing. SUMMARY OF THE INVENTION It is therefore an object of the present invention to provide a processor for processing different embedded data types. Another object of the present invention is to provide a multi-threaded cooperative processing architecture. It is still another object of the present invention to provide a multi-thread cooperative processing method. 6 1323422 Yet another object of the present invention is to provide a register-based data exchange mechanism. According to the above object of the present invention, a multi-thread cooperative processing architecture is provided, comprising an instruction cache for providing a micro-VLIW instruction; a first cluster connecting the instruction cache to read the micro- The VLIW instruction is used to perform a row calculation; and a second cluster is connected to the instruction cache to read the micro-VLIW instruction for an accelerated execution. The second cluster further includes a second front-end module, a connection instruction cache for requesting and dispatching a micro-VLIW instruction, and an auxiliary dynamic scheduler connecting the second front-end module for dispatching a micro- The VLIW instruction; a non-shared data path, connecting the second front-end module for providing a processing data path; and a shared data path connecting the auxiliary dynamic scheduler for assisting in managing the non-shared data path. The second front-end module dispatches micro-VLIW instructions to the auxiliary dynamic scheduler and the non-shared data path, and the first cluster and the second cluster execute the respective micro-VLIW instructions in parallel. The shared data path further includes: a plurality of auxiliary action units connected to the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; an auxiliary file exchange register, connected to the auxiliary action unit for transmitting a plurality of readings or Write request; and a plurality of auxiliary scratchpad files connected to the auxiliary file exchange register for providing a control data. The non-shared data path further includes: a plurality of acceleration units connected to the second front end module for receiving the micro-VLIW instruction; an accelerated file exchange register, and an acceleration unit for transmitting a plurality of readings 7 or write request; and a plurality of accelerating scratchpad files, connected to the accelerated file exchange register for providing an accelerated calculation. According to the above object of the present invention, a multi-thread cooperation processing method is provided, comprising: executing a main thread in a first cluster; generating a plurality of auxiliary threads; and executing each auxiliary thread in a second cluster, Included from an instruction cache, reads a micro-VLIW instruction to a second front-end module; from the second front-end module, dispatches a micro-VLIW instruction to an auxiliary dynamic scheduler and a non-shared data path From the auxiliary dynamic scheduler, select and dispatch the micro-VLIW instruction to a shared data path; execute the micro-VLIW instruction in the shared data path; and execute the micro-VLIW instruction in the non-shared data path; where the main thread And the auxiliary thread is executed in parallel. In the shared data path, executing the micro-VLIW instruction further includes: an auxiliary action unit receiving the micro-VLIW sent from the auxiliary dynamic line arranger; and transmitting a plurality of read or write requests from the auxiliary action unit to the first Auxiliary file exchange register; and from the auxiliary file exchange register, transfer read or write requests to an auxiliary scratchpad file. In the non-shared data path, executing the micro-VLIW instruction further includes: an acceleration action unit receiving the micro-VLIW sent from the second front-end module; and transmitting, from the acceleration action unit, a plurality of read or write requests to An accelerated file exchange register; and an instruction to transfer read or write from the accelerated file exchange register to the second accelerated scratchpad file. [Embodiment] 1323422 Please refer to FIG. 1 , which illustrates a schematic diagram of a multi-thread cooperative processing architecture 100 according to an embodiment of the invention. The multi-threaded cooperative processing architecture 1 includes a first cluster 1〇2 and a second cluster 1〇4. The first cluster 102 is used to execute a main thread, and the second cluster 1 is used to execute a secondary thread. The first cluster 102 is configured to perform a control and a row calculation. The first cluster 102 includes a - front-end module UG and a primary control data path k 132. The primary control data path 132 includes a plurality of active units 12 and a plurality of register registers 114. The first front-end module 11A can use a reduced instruction set operation (Reduced Instructi〇n % Computing ' RISC ) as a branch, a person (i〇ad ), a store, an arithmetic, and a logic. Operation (operation), etc. The action unit 112 is used to provide a multiplication and augmentation (muluply-aiid-add) or single instruction composite data mode (in tons)

Instmction Multiple Data，SIMD)之運算。此外，第一群集1〇2負責產生一輔助線程。第二群集104 ’用於一加速執行。第二群集104包含第二前端模組116，一辅助動態排程器118，—共用資料路徑 134以及一非共用資料路徑丨36。共用資料路徑134包含一複數個輔助作用單元12〇，— 輔助檔案交換暫存H 122以及—複數個辅助暫存器擋案 124。其中’第二前端模組116係連接指令高速緩衝記憶體 1〇6，。辅助動態排程器118係連接第二前端模也ιΐ6，輔助作用單元120係連接辅助動態排程$ 118，辅助槽案交換暫 9 係連接辅助作用單元m，以及輔助暫存器檔案〃連接辅助檀案交換暫存器122。非八用貝料路徑136包含一複數個加速作用單元 —加速檔案交換暫存器128，以及—複數個加速暫存态檔案130。盆中，筮一 1山4从吃情體1〇6 “ 係連接指令高速緩衝 i體㈣，加速作用單元126係連接第二前端模組116, 加速棺案交換暫存器128係連接加速作用單元i26，以及加Instmction Multiple Data, SIMD). In addition, the first cluster 1〇2 is responsible for generating a secondary thread. The second cluster 104' is used for an accelerated execution. The second cluster 104 includes a second front end module 116, an auxiliary dynamic scheduler 118, a shared data path 134, and a non-shared data path 丨36. The shared data path 134 includes a plurality of auxiliary action units 12, - an auxiliary file exchange temporary storage H 122 and a plurality of auxiliary temporary register files 124. The second front end module 116 is connected to the instruction cache memory 1〇6. The auxiliary dynamic scheduler 118 is connected to the second front end mold also ιΐ6, the auxiliary action unit 120 is connected to the auxiliary dynamic schedule $118, the auxiliary slot exchange is temporarily connected to the auxiliary action unit m, and the auxiliary register file is connected to the auxiliary unit. The Tan file exchange register 122. The non-eight-use billet path 136 includes a plurality of accelerating units - an accelerated file swap register 128, and - a plurality of accelerating temporary files 130. In the basin, the 筮一一山4 from the eating body 1〇6 "system connection command cache i body (four), the acceleration unit 126 is connected to the second front end module 116, the acceleration file exchange register 128 series connection acceleration Unit i26, and plus

速暫存器槽案130係連接加速檔案交換暫存器128。理。此：乍：早π 126係用於嵌入式應用之加快資料處沖此外，辅助線程可使用每一辅助作用單元12〇。辅助作用早兀120協助控制輔助線程。例如，共用資料路徑之輔助作用單疋12〇,從—資料快取記憶區⑽载入資料，並傳送至非共”料路徑136之加速暫存器檔案H 輔助作用單元120,經由辅助檔案交換暫存器12The scratchpad slot 130 is connected to the accelerated archive swap register 128. Reason. This: 乍: Early π 126 is used to speed up data processing for embedded applications. In addition, the auxiliary thread can use each auxiliary action unit 12〇. Auxiliary action helps to control the secondary thread. For example, the auxiliary data path of the shared data path is 12载入, the data is loaded from the data cache memory area (10), and transmitted to the acceleration register file H auxiliary action unit 120 of the non-common material path 136, via the auxiliary file exchange. Register 12

=存器檔案124之資料。每-輔助線程係分配速作用早①126，讀供辅助線程流程控n 體運用之範例，每—輔助線程係分配兩加速暫存器槽案案二供較廣之資料處理路徑。其中’-加速暫存器檔 ” ’、於載入資料以及另一加速暫存器檔案13〇传於處理資料。乐用參考第1圖，主線程可用以產生輔助線程β當助線程時，主線程會指定給㈣線程—輔助暫存器標案2 以及兩加速暫存器檔案13〇。辅助線程經由加速作用單_ 再經由加速檔案交換暫存器128以存取位於加速暫: 1323422 器檔案130之資料。在一範例，可使用雙埠（2-port)之指令高速緩衝記憶體106，每一埠之頻寬為128-bit。資料快取記憶區108係為雙埠，其中，一埠之頻寬為32-bit以及另一埠之頻寬為 64-bit以提供較寬之資料流量（data flow )。第2圖係繪示依照本發明一實施例之產生一辅助線程之一流程圖。本發明一實施例使用一程式語言以產生輔助線，以減少需要產生一輔助線程所需之邏輯以及一使用於推測檢測與回復之檢測邏輯。當一主線程200檢測到一起始線程指令，會使用主線程200之參數「例如，程式計數值（program counter value )」以產生一輔助線程202。在一範例中，每一輔助線程202具有一程式計數值，以從記憶體系統中分別地讀取所需之韌體。同時間，在第一群集102 以及第二群集104中，分別地執行主線程200以及輔助線程202。主線程200以及輔助線程202之間之一致性，係經由主線程200檢查輔助線程202是否已結束執行資料流 (data flow )。在一範例中，係使用兩程式以提供一用戶親和發展環境（user friendly development environment) ° 其中，係使用 C程式語言以撰寫上述之程式。第一個程式，「輔助線程產生程式」，係用於檢測一起始線程指令。第二個程式，「檢查線程程式」，檢測輔助線程是否已完成執行。「辅助線程產生程式」以及「檢查線程程式」係使用内嵌組合語言 (inline assembly language)，以減少當主線程產生輔助線 11 1323422 程或主線程檢查輔助線程之狀態時之額外處理負擔。「輔助線程產生程式」以及「檢查線程程式」係使用内嵌組合語言以及c程式語言以遠到上述之目的。然而，熟知此技藝者’可使用其它程式語言或方式，以達到上述之目的。第3圖係列舉依照本發明一實施例之一「輔助線程產生程式」。使用者僅只需要在程式中輸入四參數。參數「thread_id」33係用於指示產生一輔助線程。參數「thread_pc_value」3 2係一輔助線程之韌體程式之起始位置。參數「bankusage」31規劃使用之輔助暫存器檔案以及加速暫存器檔案。參數「thread_parameter_address」30,從主線程傳遞一參數位址表之起始位置至辅助線程。「辅助線程產生程式」中，使用一「if」之條件陳述以限制並確認輔助線程。然後，使用内喪組合語言，遵循0GCC組合語言之相關資料，從「startt」指令處撰寫辅助線程。第4圖係列舉依照本發明一實施例之一使用匚語言以及内肷組合sf g撰寫之檢查線程程式。「檢查線程程式」之參數係為「thread—id」41。一「if」之條件陳述以限制並確。心辅助線程。主要線程係使用「msr」指令42以複製一輔助線程之資訊至第一群集1〇2之一暫存器檔案114。因此，暫存器檔案114由遮罩（masking )此資訊以得到所需之辅助線程之狀態。第5圖係繪示依照本發明一實施例之具有指令高速緩衝s己憶體106之第二前端模組116之一示意圖。第二前端模組116包含一程式計數位址產生器5〇2，一指令高速缓衝 12 1323422 記憶體排程器（循環法）504，以及一複數個配送器500。第二前端模組116’從指令高速緩衝記憶體1〇6中讀取一 micro-VLIW指令，並使用配送器5〇〇派遣micro-VLIW指令，至輔助動態排程器118以及非共用資料路徑136。程式計數位址產生器502,係用於產生一位址，並使用此位址從指令高速緩衝記憶體1〇6中，以請求micro_ VLIW 指令。參考第5圖，指令高速緩衝記憶體排程器504，送出一請求指令508至指令高速緩衝記憶體丨〇6中，並接收到一 micro-VLIW 指令資料 510。因為埠限制（port constraint) 的關係’在同一時間，僅一辅助線程有權利使用指令高速緩衝記憶體106。因此’指令高速緩衝記憶體排程器504 係根據輔助線程之狀態，使用一線程交換機制（thread switch mechanism)，以選擇辅助線程。線程交換機制係使用本發明之一實施例所主張的一循環排程策略（round robin scheduling p〇iiCy)，其係將每一辅助線程視為具有相同的優先順序。在一實施例裡，使用循環排程策略’執行四辅助線程請求存取指令高速緩衝記憶體106，包含下列步驟： 1.如果指令局速緩衝g己憶體排程器中，四辅助線程HT1、HT2、HT3以及HT4請求存取指令高速緩衝記憶體 106。 2·如果最近一次從指令咼速緩衝記憶體排程器5〇4 中π求存取指令尚速緩衝§己憶體106之辅助線程之識別 13 1323422 (help thread ID)係為「N」。 3.輔助線程HT1、HT2、HT3以及HT4存取指令高速緩衝記憶體106之順序分別為（N+1 ) %4，（ N+2 ) %4，（ N+3 ) %4，以及（N+4) %4。上述之辅助線程交換機制簡化設計上之複雜性，以及因為每一辅助線程係使用連續順序以避免辅助線程之叙餓 (helper thread starvation) ° 參考弟5圖’配送is 500接收從指令高速緩衝記憶體排程器504之辅助線程之micro-VLIW指令，並將此 micro-VLIW指令儲存在緩衝器506。另外，配送器5〇〇 (例如，有N個配送器）從排程器504中取出每一micr〇_VLIW 指令（此指令為一讀取或寫入要求），並分別地派遣 micro-VLIW指令至辅助動態排程器118以及非共用資料路徑 136。第6圖係纟會示依照本發明一實施例之第二前端模組之緩衝器506派遣micro-VLIW指令。每一配送器5〇〇具有一緩衝器506。在每一週期，緩衝器506之每一micro-VLIW 指令610以及612,分別地傳遞至輔助動態排程器118以及非共用資料路徑136。因此’在每一週期中，輔助動態排程器118以及非共用資料路徑136會從N個辅助線程中（例如，主線產生N個輔助線程）接收N個micro-VLIW指令 610 以及 612。另一考慮的重點係決定辅助作用單元120之數量，以協助加速作用單元126。其中，每一加速作用單元126負責 14 丄加速執行。因此，必需事先準備將要被執行之資料。但是，因為空間以及電力的考量，輔助作用單元12G之數量可少於加速作用單疋126之數量。由於每-週期至多具有N個 m1Cr〇-VLIW指令⑽派遣至輔助作用單元m。所以，輔 , 助動態排程窃U8必須具有排程mitro-VLIW指令010至輔助作用單元120之功能。 —參照第1圖以及第6圖，輔助動態排程器118，係連接 • 在第二前端模組U6以及辅助作用單元120之間。輔助動態排程器118係使用上述之—循環排程策略，以及使用辅助線程之識別（help thread ID)以確認並傳遞micr〇_VUw 指令610至一辅助作用單元12〇。這裡要注意的原則，係當 =助作用單元120重覆執行一指令，則傳令610至一輔助作用單元12〇係暫時中止。所以，此 micro-VLIW指令610係在每一週期中嘗試存取，直到辅助作用單元120完成重覆執行之指令。 % 循J衣排程策略係用於得到一輔助線程（例如，Μ個輔 . 助線程）之優先順序，以及具有最高順序之輔助線程可傳遞微4日々（mirc〇_instructi〇n，也就是 micr〇_vLIW 指令）至輔助作用單兀120。其中，辅助線程的數量等於輔助作用單元120的數量。例如，M個輔助線程，則需要M個辅助作用單元。當輔助動態排程器118選擇具有最高順序之輔助線程，則下一輪，此辅助線程將會被改變為最低順序，因而避免辅助線程之飢餓（helper thread starvati〇n )。辅助作用單元120係用於協助管理辅助線程，以及每 15 1323422 一辅助線程係使用配置之輔助暫存器檔案124。每一輔助作用單元120執行一精簡指令集運算（Reduced Instruction Set Computing，RISC)，例如，載入、儲存、算術運算。當一輔助線程需要讀取輔助暫存器檔案124,辅助線程之識別會跟著經過辅助作用單元120〇然後，第1圖中之辅助檔案交換暫存器122將會使用輔助線程之識別以讀取需要之輔助暫存器檔案124。加速作用單元126係用於執行加速。本發明之一實施例之第二群集可用下列之安排。例如，當執行一多媒體之運用，包含可使用不同型態之多媒體之加速作用單元126 以達成即時之限制（real-time constraints )。偕同加速作用單元126之協助，只需一加速指令以完成習知之使用RICS 作用單元（RISC functional unit)之需要數百個週期以完成之任務。例如，用於MPEG4編解碼器，需要使用四加速作用單元126,其中此四加速作用單元126係兩向量作用單元 (vector functional units)，一蝶形（butterfly)作用單元，以及一可變長度碼或可變長度解碼（variable length coding/variable length decoding，VLC/VLD)作用單元。向量作用單元係負責SIMD之處理操作’以平行處理一複數之資料段。SIMD運作可加速影像運算。蝶形作用單元負責處理SIMD資料型態。然而，蝶形作用單元之主要工作性質為乘與加（multiply-and-add)運算以及矩陣乘（matrices multiply)運算。蝶形作用單元也可用於加速DCT/IDCT運算。VLC/VLD作用單元係用於加速MPEG4之VLC以及 16 丄 VLD之運算。參考第1圖，共用資料路徑134具有Ν個輔助暫存号 24 ’ 及非共用資料路徑136具有2關加速暫檔案 130。其中，Ν # A 4 ήλ· m <« 係為加速作用早元126之數量。鈇而如果每-輔助線程可使用任何兩加速暫存器檔案咖’ 會導致加速檔案交換暫存器128之邏輯複雜性。在使用-部分映射機制（partial卿㈣= information of the file 124. Each-assisted thread distribution speed is 1126 earlier, and the auxiliary thread is used to control the flow of the n-body. Each of the auxiliary threads allocates two accelerating register slots to provide a wider data processing path. The '-accelerated scratchpad file', the load data and another accelerating register file 13 are passed on the processing data. Let's refer to Figure 1, the main thread can be used to generate the auxiliary thread β when the thread is assisted. The main thread will be assigned to the (four) thread - the auxiliary register file 2 and the two acceleration register files 13 . The auxiliary thread is accelerated by the single__ via the accelerated file exchange register 128 to access the acceleration: 1323422 The data of file 130. In an example, a 2-port instruction cache memory 106 can be used, each having a bandwidth of 128-bit. The data cache memory area 108 is a double port, wherein The bandwidth is 32-bit and the bandwidth of the other is 64-bit to provide a wider data flow. Figure 2 is a diagram showing the generation of a secondary thread in accordance with an embodiment of the present invention. A flowchart. An embodiment of the present invention uses a programming language to generate auxiliary lines to reduce the logic required to generate a secondary thread and a detection logic for speculative detection and recovery. When a primary thread 200 detects a start Thread instruction, will The main thread 200 with parameters "For example, the value of the program counter (program counter value)" to generate a secondary thread 202. In one example, each of the auxiliary threads 202 has a program counter value to separately read the desired firmware from the memory system. Meanwhile, in the first cluster 102 and the second cluster 104, the main thread 200 and the auxiliary thread 202 are separately executed. The consistency between the main thread 200 and the auxiliary thread 202 is checked by the main thread 200 whether the auxiliary thread 202 has finished executing the data flow. In one example, two programs are used to provide a user friendly development environment. Among them, the C program language is used to compose the above program. The first program, "Assisted Thread Generation Program", is used to detect a starting thread instruction. The second program, "Check Thread Program", detects if the worker thread has completed execution. The "assisted thread generation program" and the "check thread program" use an inline assembly language to reduce the additional processing overhead when the main thread generates the auxiliary line 11 1323422 or the main thread checks the status of the auxiliary thread. "Auxiliary Thread Generation Program" and "Check Thread Program" use the inline combination language and c programming language to achieve the above purpose. However, those skilled in the art may use other programming languages or means to achieve the above objectives. Fig. 3 is a series of "assisted thread generation programs" according to an embodiment of the present invention. The user only needs to enter four parameters in the program. The parameter "thread_id" 33 is used to indicate that a secondary thread is generated. The parameter "thread_pc_value" 3 2 is the starting position of the firmware of the auxiliary thread. The auxiliary bank file and the accelerator register file used by the parameter "bankusage" 31 are planned. The parameter "thread_parameter_address" 30 transfers the starting position of a parameter address table from the main thread to the worker thread. In the "Auxiliary Thread Generation Program", an "if" conditional statement is used to limit and confirm the auxiliary thread. Then, use the internal combination language, follow the relevant information of the 0GCC combination language, and write the auxiliary thread from the "startt" instruction. Fig. 4 is a diagram showing a check thread program written in a 匚 language and a 肷肷 sf g according to an embodiment of the present invention. The parameter of "Check Thread Program" is "thread_id"41. A conditional statement of "if" to limit and confirm. Heart assisted thread. The main thread uses the "msr" instruction 42 to copy the information of a secondary thread to one of the first clusters 1 暂 2 of the scratchpad file 114. Therefore, the scratchpad file 114 masks this information to get the status of the desired auxiliary thread. Figure 5 is a schematic diagram showing a second front end module 116 having an instruction cache suffix 106 in accordance with an embodiment of the present invention. The second front end module 116 includes a program count address generator 5〇2, an instruction cache 12 1323422 memory scheduler (loop method) 504, and a plurality of dispensers 500. The second front end module 116' reads a micro-VLIW command from the instruction cache memory 1-6, and uses the dispatcher 5 to dispatch the micro-VLIW command to the auxiliary dynamic scheduler 118 and the non-shared data path. 136. The program count address generator 502 is used to generate an address and use this address from the instruction cache 1 〇 6 to request the micro_VLIW instruction. Referring to Figure 5, the instruction cache scheduler 504 sends a request command 508 to the instruction cache 丨〇6 and receives a micro-VLIW instruction material 510. Because of the relationship of port constraints, at the same time, only one worker thread has the right to use the instruction cache memory 106. Therefore, the instruction cache memory scheduler 504 uses a thread switch mechanism to select a secondary thread based on the state of the secondary thread. The thread switch system uses a round robin scheduling p〇iiCy as claimed in one embodiment of the present invention, which treats each of the auxiliary threads as having the same priority order. In one embodiment, the four-threaded thread request access instruction cache memory 106 is executed using a round-robin scheduling policy, including the following steps: 1. If the instruction is to buffer the g-memory scheduler, the four-assisted thread HT1 HT2, HT3, and HT4 request access to instruction cache memory 106. 2. If the last time from the instruction buffer memory scheduler 5〇4, the access instruction is still buffered, the identification of the auxiliary thread of the memory 106 is 13 1323422 (help thread ID) is “N”. 3. The order of the auxiliary threads HT1, HT2, HT3, and HT4 access instruction cache 106 is (N+1) %4, (N+2) %4, (N+3) %4, and (N +4) %4. The above-mentioned auxiliary thread switching mechanism simplifies the design complexity, and because each auxiliary thread uses a sequential order to avoid the helper thread starvation (refer to the brother 5 figure 'delivery is 500 received from the instruction cache memory The micro-VLIW instruction of the auxiliary thread of the volume scheduler 504 stores the micro-VLIW instruction in the buffer 506. In addition, the dispenser 5 (for example, there are N dispensers) takes each micr〇_VLIW instruction from the scheduler 504 (this command is a read or write request) and dispatches the micro-VLIW separately. The instructions are directed to the auxiliary dynamic scheduler 118 and to the non-shared data path 136. Figure 6 is a diagram showing a buffer 506 of a second front end module dispatching a micro-VLIW instruction in accordance with an embodiment of the present invention. Each dispenser 5 has a buffer 506. At each cycle, each micro-VLIW instruction 610 and 612 of buffer 506 is passed to auxiliary dynamic scheduler 118 and non-shared data path 136, respectively. Thus, in each cycle, the auxiliary dynamic scheduler 118 and the unshared data path 136 receive N micro-VLIW instructions 610 and 612 from N auxiliary threads (e.g., N auxiliary threads are generated by the main line). Another consideration is to determine the number of auxiliary action units 120 to assist in accelerating the action unit 126. Among them, each acceleration action unit 126 is responsible for 14 加速 accelerated execution. Therefore, it is necessary to prepare in advance the materials to be executed. However, because of space and power considerations, the number of auxiliary action units 12G may be less than the number of acceleration action units 126. Since there are at most N m1Cr〇-VLIW instructions (10) per cycle, it is dispatched to the auxiliary action unit m. Therefore, the auxiliary and dynamic scheduler U8 must have the function of scheduling the mitro-VLIW instruction 010 to the auxiliary action unit 120. - Referring to Figures 1 and 6, the auxiliary dynamic scheduler 118 is connected between the second front end module U6 and the auxiliary action unit 120. The auxiliary dynamic scheduler 118 uses the above-described loop scheduling strategy and the help thread ID to confirm and pass the micr〇_VUw command 610 to an auxiliary action unit 12A. The principle to be noted here is that when the assisting unit 120 repeatedly executes an instruction, the transmission 610 to the auxiliary acting unit 12 is temporarily suspended. Therefore, the micro-VLIW instruction 610 attempts to access each cycle until the auxiliary action unit 120 completes the repeated execution of the instruction. % Follow-up strategy is used to get the priority order of a secondary thread (for example, a secondary helper thread), and the auxiliary thread with the highest order can pass micro 4 days (mirc〇_instructi〇n, ie The micr〇_vLIW instruction) to the auxiliary action unit 120. The number of auxiliary threads is equal to the number of auxiliary working units 120. For example, M auxiliary threads require M auxiliary action units. When the auxiliary dynamic scheduler 118 selects the secondary thread with the highest order, then in the next round, the secondary thread will be changed to the lowest order, thus avoiding helper thread starvati〇n. The auxiliary action unit 120 is used to assist in managing the auxiliary thread, and a secondary thread is used by the secondary thread file 124 every 15 1323422. Each of the auxiliary function units 120 performs a Reduced Instruction Set Computing (RISC), for example, load, store, and arithmetic operations. When a secondary thread needs to read the secondary register file 124, the identification of the secondary thread will follow the auxiliary action unit 120. Then, the secondary file exchange register 122 in FIG. 1 will use the identification of the secondary thread to read Auxiliary register file 124 is required. The acceleration unit 126 is for performing acceleration. The second cluster of an embodiment of the present invention may be arranged in the following arrangements. For example, when performing a multimedia application, an acceleration action unit 126 that can use different types of multimedia is included to achieve real-time constraints. Together with the acceleration of unit 126, an acceleration command is required to accomplish the task of using the RICS functional unit for hundreds of cycles. For example, for the MPEG4 codec, a four-acceleration unit 126 is required, wherein the four acceleration units 126 are two vector functional units, a butterfly action unit, and a variable length code. Or variable length coding/variable length decoding (VLC/VLD) action unit. The vector action unit is responsible for the processing operation of the SIMD' to process a complex data segment in parallel. SIMD operation speeds up image operations. The butterfly action unit is responsible for processing the SIMD data type. However, the main working properties of the butterfly action unit are multiply-and-add operations and matrix multiply operations. The butterfly action unit can also be used to accelerate DCT/IDCT operations. The VLC/VLD action unit is used to accelerate the operation of MPEG4 VLC and 16 丄 VLD. Referring to Fig. 1, the shared data path 134 has two auxiliary temporary storage numbers 24' and the non-shared data path 136 has two closed acceleration temporary files 130. Where Ν # A 4 ήλ· m <« is the number of acceleration effects 126. However, if each of the secondary threads can use any of the two accelerated scratchpad files, the logical complexity of the archive swap register 128 will be accelerated. In the use of - part mapping mechanism (partial Qing (four)

少加速檔案交換暫存器128之邏輯複雜性。部分映射機：係配置每-加速作用單& 126至—複數的加速暫存 130。。田茶第7A圖至第7D ®絲示依照本發明—實施例之部分映射機制之一不意圖。例如，加速作用單元i 7〇〇以及加速作用早TC2701可使用加速暫存器槽案丄至加速暫存The logic complexity of the less accelerated file exchange register 128. Partial mapping machine: The acceleration temporary storage 130 is configured for each acceleration action list & 126 to - complex number. . Field Tea Figures 7A through 7D ® show one of the partial mapping mechanisms in accordance with the present invention - not intended. For example, the acceleration unit i 7〇〇 and the acceleration function TC2701 can use the accelerating register slot to accelerate the temporary storage.

6(71〇,711，712,713,714以及715)，以及加速翻1 兀3 702以及加速作用單元4 7〇3可使用加速暫存器檔案$ 至加速暫存器檔案8 ( 714，715，716以及717)。選擇加速暫存器檔案130係依靠數個多工器（multiplexer)。第π 圖描晝對加速暫存器檔案13〇之讀取請求，以及第7c圖以及第7 D圖係回報所需之資料。第7A圖描畫對加速暫存器構案130之寫入操作。第8圖係繪示依照本發明一實施例之使用韌體碼 (firmware code )。程式計數器81連接至記憶體段82 (memory segment)包含所需之韌體83。然後，第二群集 104之第二前端模組116讀取韌體幻，並分別派遣至加速 17 1323422 作用單元126以及經由輔助動態排程器118至輔助作用單元 120。第9圖係繪示依照本發明一實施例之主線程之程式之 -流程圖。在主線程開始之後，將會產生用於加速之輔助線程。最重要的係安排辅助線程之順序以及資源依賴性 (resource dependency) 91。當停止一辅助線程，則此輔助6 (71〇, 711, 712, 713, 714, and 715), as well as the acceleration 1 兀 3 702 and the acceleration unit 4 7 〇 3 can use the Accelerated Register File $ to the Accelerated Scratch File 8 (714, 715, 716, and 717) ). Select Acceleration The scratchpad file 130 relies on several multiplexers. The πth figure depicts the read request for the accelerated scratchpad file, and the data required for the 7c and 7D drawings. Figure 7A depicts the write operation to the accelerated scratchpad configuration 130. Figure 8 is a diagram showing the use of a firmware code in accordance with an embodiment of the present invention. The program counter 81 is coupled to the memory segment 82 containing the desired firmware 83. The second front end module 116 of the second cluster 104 then reads the firmware and dispatches them to the acceleration 17 1323422 active unit 126 and via the auxiliary dynamic scheduler 118 to the auxiliary unit 120. Figure 9 is a flow chart showing the program of the main thread in accordance with an embodiment of the present invention. After the main thread starts, the auxiliary thread for acceleration will be generated. The most important is to arrange the order of the auxiliary threads and the resource dependency 91. This auxiliary when stopping a secondary thread

線程將於所分配到的輔助暫存器檔案中寫入一訊息，以及可利用此訊息以檢查辅助線程是否被停止92。弟10圖係！會示依照本發明一 m列之輔助線程式之一流程圖。當產生-辅助線程i"，此辅助線程將會從指令高速緩衝記憶體中，讀取所需之㈣碼。如果此勤體碼需要讀取以及寫入其它之加速暫存器檔案，則使用一確定記憶體10—1指令以改變加速暫存器難埠指標。在執行動體碼之後，停止此辅助線程10_2以及辅助作用單元將會寫入一資訊至辅助暫存器檔案。The thread will write a message to the assigned secondary scratchpad file and can use this message to check if the secondary thread has been stopped92. Brother 10 pictures! A flow chart of one of the auxiliary threads of one m column in accordance with the present invention is shown. When the generate-assisted thread i", this worker thread will read the required (four) code from the instruction cache. If the firmware code needs to read and write to other accelerating scratchpad files, a deterministic memory 10-1 command is used to change the accelerating scratchpad deficiencies indicator. After executing the dynamic code, the stop of the auxiliary thread 10_2 and the auxiliary action unit will write a message to the auxiliary register file.

第11圖係繪示依照本發明一實施例之程式之一流程圖。辅助線程之起始時M U一0,卩及辅助線程之停止^ 11_1以及主線程檢查輔助線程是否停止之時間丨丨2。檢查點係主線程檢查辅助線程是否停止之時間。欢雖然本發明已以-較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者’在不脫離本發明之精神和範圍内，當可作各種之更動與潤飾，因此本發明之保護靶圍當視後附之申請專利範圍所界定者為準。 ,、 18 【圖式簡單說明】為讓本發明之上述和其他目的、特徵能更明顯易懂，所附圖式之詳細說明如τ: ”〜、貝知例理第1圖係繪示依照本發明一實施例之多線程架構之一示意圖；第2圖係綠示依照本發明一實施例之產生—辅助線程之一流程圖；。第3圖係列舉依照本發明—實施例之—輔助線程產生程式；第4圖係列舉依照本發明一實施例之一檢查線程程式；第5圖係繪示依照本發明一實施例之第二前端模組之一示意圖；第6圖係繪示依照本發明—實施例之第二前端模組之配送器之一示意圖；第7Α圖至第7D圖係繪示依照本發明—實施例之部分映射機制之一示意圖；第8圖係繪示依照本發明—實施例之軟體單元之一示意圖；第9圖係繪示依照本發明一實施例之主線程式之一流程圖；第10圖係繪示依照本發明一實施例之輔助線程式之一流程圖；以及第11圖係繪示依照本發明一實施例之程式之一流程 19 1323422 圖。【主要元件符號說明】 30、31、32、33、34、41、42 :程式指令 81 :程式計數器 82:記憶體段 83:韌體 90、91、92、10_0、10_1、10_2:步驟 11—0、11_1、11_2 :時間點 100:多線程合作處理架構 102:第一群集 104:第二群集 10 6:指令尚速缓衝記憶體 108:資料快取記憶區 110:第一前端模組 112:作用單元 114:暫存器檔案 116:第二前端模組 118:輔助動態排程器 120:辅助作用單元 122:輔助檔案交換暫存器 124:辅助暫存器檔案 126、700、701、702、703:加速作用單元 128:加速檔案交換暫存器 20 1323422 130、710、71 卜 712、713、714、715、716、717:加速暫存器檔案 132:主要控制路徑 134:共用資料路徑 136:非共用資料路徑 200:主線程 202:辅助線程 502:程式計數位址產生器 504:指令高速缓衝記憶體排程器（循環法） 506:緩衝器 508:指令 510、610、612: micro-VLIW 指令資料 21Figure 11 is a flow chart showing a program in accordance with an embodiment of the present invention. At the beginning of the auxiliary thread, M U_0, 停止 and the stop of the auxiliary thread ^ 11_1 and the time when the main thread checks whether the auxiliary thread stops 丨丨2. The checkpoint is the time when the main thread checks if the secondary thread has stopped. Although the present invention has been disclosed in the above-described preferred embodiments, it is not intended to limit the invention, and any skilled person can make various changes and modifications without departing from the spirit and scope of the invention. The protective target of the present invention is defined by the scope of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects and features of the present invention more comprehensible, the detailed description of the drawings is as follows: τ: 〜, 知知理理第1图A schematic diagram of a multi-threaded architecture according to an embodiment of the present invention; FIG. 2 is a flowchart showing one of the generated-assisted threads according to an embodiment of the present invention; FIG. 3 is a series of diagrams according to the present invention - an embodiment Thread generating program; FIG. 4 is a view showing a thread program according to an embodiment of the present invention; FIG. 5 is a schematic diagram showing a second front end module according to an embodiment of the present invention; The present invention is a schematic diagram of one of the dispensers of the second front-end module of the embodiment; FIGS. 7 to 7D are schematic diagrams showing a partial mapping mechanism according to the embodiment of the present invention; FIG. 8 is a diagram showing BRIEF DESCRIPTION OF THE DRAWINGS FIG. 9 is a flow chart showing a main thread according to an embodiment of the present invention; FIG. 10 is a flow chart showing a auxiliary thread according to an embodiment of the present invention; Figure; Figure 11 is a flow chart 19 1323422 of a program according to an embodiment of the present invention. [Main component symbol description] 30, 31, 32, 33, 34, 41, 42: program instruction 81: program counter 82: memory Body segment 83: firmware 90, 91, 92, 10_0, 10_1, 10_2: steps 11-1, 11_1, 11_2: time point 100: multi-threaded cooperative processing architecture 102: first cluster 104: second cluster 10 6: instructions The buffer memory 108: the data cache memory area 110: the first front end module 112: the action unit 114: the register file 116: the second front end module 118: the auxiliary dynamic scheduler 120: the auxiliary action unit 122 : Auxiliary File Exchange Register 124: Auxiliary Register File 126, 700, 701, 702, 703: Acceleration Unit 128: Accelerated File Exchange Register 20 1323422 130, 710, 71 712, 713, 714, 715 716, 717: Accelerated register file 132: Main control path 134: Shared data path 136: Non-shared data path 200: Main thread 202: Auxiliary thread 502: Program count address generator 504: Instruction cache memory Scheduler (loop method) 506: Buffer 508: Instructions 510, 610, 612: micr o-VLIW Instruction Material 21

Claims

1323422 X. Patent application scope: 1. A multi-thread cooperative processing device, comprising: an instruction cache memory for providing a micro-VLIW instruction; a first cluster circuit connecting the instruction cache memory to read Taking the micro-VLIW instruction for performing a row calculation; and a second cluster circuit connecting the instruction cache to read the micro-VLIW instruction for performing an accelerated execution program, wherein The second cluster circuit further includes: a second front end module connected to the instruction cache memory for requesting and dispatching the micro-VLIW instruction; and an auxiliary dynamic scheduler connected to the second front end module for Dispatching the micro-VLIW command; a non-shared data path connecting the second front-end module for providing a processing data path; and a shared data path connecting the auxiliary dynamic scheduler for assisting in managing the non-shared data a path; wherein the second front-end module dispatches the micro-VLIW instruction to the auxiliary dynamic scheduler and the non-shared data path, Circuit and the first cluster and the second cluster is performed based parallel circuit; wherein the micro-VLIW-based instruction is a micro instruction (mirco-instruction) patterns Very Long Instruction (very long instruction word, VLIW). 22 1323422 2. The multi-threaded cooperative processing device of claim 1, wherein the second front-end module further includes an instruction cache scheduler to request and dispatch the micro-VLIW instruction . 3. The multi-thread cooperative processing apparatus of claim 2, wherein the instruction cache scheduler uses a loop scheduling policy to request the micr〇vuw from the instruction cache. The multi-thread cooperative processing device as described in the above-mentioned patent application scope, wherein the auxiliary dynamic scheduler uses a cycle scheduling strategy. 5. The multi-thread cooperative processing device as described in claim i, wherein the shared data path further comprises:

At least one auxiliary action unit 'connects the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; passes the reverse one::= exchange register, and connects the auxiliary action unit for sending a read or write The requirements; and at least: the auxiliary register broadcasts the 'connection' of the auxiliary file exchange temporary storage for providing a control data. == The multi-line arrangement described in item 5 of the patent scope, wherein the non-shared data path further comprises: a device loading at least one acceleration unit to connect the second front-end module for receiving 23 1323422 to receive the micro-VLIW command An accelerated file exchange register coupled to the acceleration unit for transmitting a read or write request; and at least one acceleration register file coupled to the accelerated file exchange register for providing an acceleration Calculation program. 7. The multi-threaded cooperative processing device of claim 6, wherein the accelerated file exchange register uses a portion of the mapping mechanism. 8. A multi-thread cooperative processing method, comprising: executing a main thread in a first cluster circuit; generating a plurality of auxiliary threads; and executing each of the auxiliary threads in a second cluster circuit, further comprising : reading a micro-VLIW command from a command cache to a second front-end module; • dispatching the micro-VLIW command from the second front-end module to an auxiliary dynamic scheduler and a non-shared a data path; from the auxiliary dynamic scheduler, selecting and dispatching the micro-VLIW 'instruction to a shared data path; executing the micro-VLIW instruction in the shared data path; and executing the same in the non-shared data path The micro-VLIW instruction; 24 1323422 wherein the main thread and the auxiliary threads are executed in parallel; wherein the micro-VLIW instruction is a very long instruction word (VLIW) of a mirco-instruction type. ). 9. The multi-line cooperative processing method of claim 8, wherein generating each of the auxiliary threads further comprises: detecting a start thread instruction from the main thread; and • transmitting a plurality of parameters from The main thread to the secondary thread. 10. The multi-thread cooperative processing method of claim 8, wherein the parameters include a program counter value. 11. The multi-thread cooperative processing method of claim 8, wherein the second front-end module uses a cyclic scheduling strategy. • The multi-threaded cooperative processing method of claim 8, wherein the auxiliary dynamic scheduler uses a cyclic scheduling strategy. 13. The multi-thread cooperative processing method according to claim 8, wherein in the shared data path, executing the micro-VLIW instruction further comprises: an auxiliary action unit receiving the same from the auxiliary dynamic scheduler The micro-VLIW; 25 transmits at least one read or write request from the auxiliary action unit to an auxiliary file exchange register; and transfers the read or write from the auxiliary file exchange register Request, to an auxiliary register file. 14. The multi-thread cooperation processing method of claim 8, wherein executing the micro-VLIW instruction in the non-shared data path further comprises: an acceleration unit receiving the second front-end module The micro-VLIW; transmits at least one read or write request from the acceleration unit to an accelerated file exchange register; and transmits the read or write request from the accelerated file exchange register , to an accelerated scratchpad file. 15. The multi-thread cooperative processing method according to claim 14, wherein the accelerated file exchange register uses a part of the mapping mechanism to transmit the read or write request to an accelerated file exchange temporary storage. Device. 16. A multi-thread cooperative processing apparatus comprising: an instruction cache for providing a micro-VLIW instruction; a first cluster circuit coupled to the instruction cache to read the micro-VLIW instruction For performing a row calculation; and 26 1323422 a second cluster circuit connected to the instruction cache to read the micro-VLIW instruction for performing an accelerated execution process, wherein the second cluster circuit The method further includes: a second front end module connected to the instruction cache memory for requesting and dispatching the micro-VLIW instruction; an auxiliary dynamic scheduler connected to the second front end module for dispatching the micro- a VLIW instruction; at least one auxiliary action unit connected to the auxiliary dynamic scheduler for receiving the micro-VLIW instruction; • an auxiliary file exchange register coupled to the auxiliary action units for transmitting at least one read or Write request; at least one auxiliary register file, connected to the auxiliary file exchange register for providing a control data; at least one acceleration list Connecting the second front end module for receiving the micro-VLIW command; an acceleration file exchange register, connecting the acceleration action unit for transmitting at least one read or write request; and • at least one acceleration a cache file, connected to the accelerated file exchange register, for providing an acceleration calculation; wherein the second front-end module dispatches the micro-VLIW command to the auxiliary dynamic scheduler and the acceleration unit, And the first cluster circuit and the second cluster circuit are executed in parallel; wherein the micro-VLIW instruction is a very long instruction word (VLIW) of a mirco-instruction type. 27 1 7 • The multi-threaded cooperative processing device according to claim 16 of the patent application, the second front-end module further includes an instruction cache memory aligner to request and dispatch the micro-VLIW instruction. 18. The multi-thread cooperative processing apparatus of claim 17, wherein the instruction cache scheduler uses a loop scheduling policy to request the micro_VLIW instruction from the instruction cache. 19. The multi-thread cooperative processing apparatus of claim 16, wherein the auxiliary dynamic scheduler uses a cyclic scheduling policy. 20_ The multi-thread cooperative processing device as described in claim 16 wherein the accelerated file exchange register uses a portion of the mapping mechanism.

28