TWI307478B

TWI307478B - Method for scheduling instructions for clustered digital signal processors and method for allocating registers using the same

Info

Publication number: TWI307478B
Application number: TW094137414A
Authority: TW
Inventors: Chung Lin Tang; Yung Chia Lin; Jenq Kuen Lee
Original assignee: Nat Univ Tsing Hua
Priority date: 2005-10-26
Filing date: 2005-10-26
Publication date: 2009-03-11
Also published as: TW200717320A

Description

•1307478 九、發明說明：【發明所屬之技術領域】本發明係關於一種為叢集式（clustered)數位訊號處理器 (processor)暫.存器配置（allocating registers)之方法，特別是關於一種為具有多個存取埠限制（restricted)之暫存器樓案 (register files)之叢集數位訊號處理器指令排程（scheduling instructions)的方法。【先前技術】大多數電腦含有一稱作暫存器之高效能資料儲存元件形式其耑要被有效使用以在運行期間（runtime)達成高效能。選擇語言元件來將指令配置給暫存器及使用語言元件所需之資料移動的處理稱作“暫存器配置”。暫存器配置對程式碼之最終品質及效能具有主要影響。不良之配置可使程式碼冗長及運行時間效能降低。然而，找出真正最佳 (computer languages)” 苐4,571，678號“暫存5 之解決方法已證實為在計算上難以處理。已有若干暫存器配置之方法被揭示。舉例而言，ChaiHn等人在“電腦語言第6卷第47至57頁中以及美國專利BACKGROUND OF THE INVENTION 1. Field of the Invention This invention relates to a method for a clustered digital processor processor, in particular with respect to one A method of clustering digital signal processor scheduling instructions for a plurality of registered register files. [Prior Art] Most computers contain a form of high-performance data storage element called a scratchpad that is effectively used to achieve high performance during runtime. The process of selecting a language component to configure an instruction to a scratchpad and using the language component to move the data is called a "scratchpad configuration." The scratchpad configuration has a major impact on the final quality and performance of the code. Poor configuration can make the code lengthy and run-time less efficient. However, finding the "computer languages" 苐 4,571,678 "temporary 5 solution has proven to be computationally difficult to handle. A number of methods for register configuration have been disclosed. For example, ChaiHn et al., "Computer Language Volume 6, pages 47-57, and US patents.

圖形著色之暫存器配置法。決方法，The scratchpad configuration method for graphics coloring. Method of determination,

104278.doc 雖然在先前技術巾對於暫存器配置演算法可找出良好解決方法’但由於*斬左5? SB I Α β . ‘1307478 案的處理器，因此會影響處理雜度。暫存益配置問題的複【發明内容】本發明之目的係提供一種用於叢集令排程及暫存器配置之方法，該數位°遽處理器指個存取埠限制之暫存H檔案。 5心❹為具有多為達到上述之目的與避免習知的問題，本發明例揭示一種用於叢集式數位實細 ^ 指令排程的方法，該處理器包含複數個叢集，每— 万击 (functional)單元及一具有一第_、木已括至少二功能該功能單元共用之單組存取埠的第:暫存及-由 ^指令排程之方法包含以下步驟：檢查__指令之執行是否需要自該第一暫存器檔案之第— 。疋 ., 早几及第二早元讀取資料；及若檢查結果為是，則產生— 叹•巧守日令（C〇py叩饤此❶ 以將資料自該第一暫存器檔案之單元。 …早元轉移至該第二另外，該方法進一步包含以下步驟：產生一額外虛擬 ⑽㈣暫存器’該虛擬暫存器指派給該第一暫存器槽案之該第一半…接收自該第一暫存器槽案之該第一單元轉移的貢料，以及將該指令排程於該複寫指令之後。此外’該方法可視需要包含以下步驟：檢查是否存在一可執行該複寫指令之先前(pri〇r)操作週期—a — 以及若存在—可用之切操作週期，則排程該複寫指令於該先前操作週期。 104278.doc * 6 - l3〇7478 該方法可包含以下步驟：檢查該二功能單元是否需要在操作週期中存取該第一暫存器檔案；及若檢查結果為是’則根據一預定優先順序排程該二功能單元其中之一在 T3 功能單元之前存取該第一暫存器檔案，該預定優先順序係根據運算元之類型而定。104278.doc Although the prior art towel can find a good solution for the scratchpad configuration algorithm, but because of *斩 left 5? SB I Α β. ‘1307478 processor, it will affect the processing noise. SUMMARY OF THE INVENTION The object of the present invention is to provide a method for cluster scheduling and register configuration, the digital processor refers to a temporary storage H file with access restrictions. 5 ❹ ❹ 具有具有具有具有具有具有具有具有具有具有具有具有具有具有具有本本本本本本本本本本本本本本本本本本本本本本本本本本本丛丛丛丛丛丛丛丛The functional unit and a first group of accesses shared by the functional unit having at least two functions: a temporary storage and a method of scheduling by the ^ instruction include the following steps: checking execution of the __ instruction Whether it is required from the first register file -.疋., Read the data in the early morning and the second early morning; and if the result of the check is yes, then sigh 巧巧巧 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( Unit. The early element is transferred to the second. In addition, the method further includes the steps of: generating an additional virtual (10) (four) register, the virtual register is assigned to the first half of the first register slot... receiving The tribute transferred from the first unit of the first register slot case, and scheduling the instruction after the rewriting instruction. Further, the method may optionally include the following steps: checking whether there is an executable copy instruction The previous (pri〇r) operation cycle—a—and if there is a available cut operation cycle, schedules the overwrite instruction for the previous operation cycle. 104278.doc * 6 - l3〇7478 The method may include the following steps: Checking whether the two functional units need to access the first temporary register file during the operation cycle; and if the check result is yes, scheduling one of the two functional units to be accessed before the T3 functional unit according to a predetermined priority order The first Register file, the system according to a predetermined priority order depending on the type of the operand.

—除此，該方法可包含以下步驟：檢查一指令之執行是否 =要自—第一叢集及一第二叢集存取資料；若檢查結果為、J產生複寫指令以將資料自該第一叢集轉移至該第 ~叢集；以及將該指令排程於該複寫指令之後。此外，除第-暫存諸案以外，該叢集進―步包含一連 !至-功能單元之第二暫存器檔案，及一連接·至另一功能單70之第三暫存器檔案。本發明亦揭示—種暫存器配置之方法，其包含以下步 Z .自程式媽產生複數個指令；將複數個虛擬暫存器指派 ^該第冑存器檔案、該第二暫存器播案及該第三暫存器槽案以建置—初始（initial)暫存器檔案指派（register file sS^ment ’ RFA)圖’·執行一排程處理以計算一用於該指令第-操作週期；執行—模擬退火㈣也㈣⑽職㈣S A) 處理以修正該暫存号俨沪㈣… 以及根據該暫存器檔案二圖將该虛擬暫存器配置給該第暫存器檔案及該第三暫存器檔案。中火處理包含以下步驟：在該暫存器檔案指派圓之第操^改變；執行該排程處理以計算一用於該指令㈣週期’·檢查該第二操作週期是否小於該第一操 104278.doc 1307478 作週期；以及若檢查結果為是，則保持該暫存器檔案指派圖中之該改變。【實施方式】圖 1 係說明一平行架構核心（Parallel Architecture Core;PAC)數位訊號處理器10之架構。該PAC數位訊號處理器10具有五指令同時執行（five-way is sue)之超長指令字組（Very Long Instruction Word ;VLIW)，其包含兩個算術單元（arithmetic unit ;ALU)20、兩個載入 /儲存單元（load/store t unit ;LSU)30及一單一標量（scalar)單元（B單元）40。這些 LSU 30及ALU 20配置於兩個叢集12A及12B内，每一叢集均含有一對兩種功能單元（FU)類型。該B單元（BU)40係獨立配置，且其負責分枝操作（branch operation)，其亦能夠執行簡單的載入/儲存及位址算術。第一暫存器檔案（D1、 D2、D3及D4)22作為該叢集12A及12B的溝通橋樑。只有該BU 40能夠存取所有該第一暫存器檔案22，而能夠執 | 行跨越該叢集12A及12B等複寫操作。第二暫存器檔案 (Al、A2)14、第三暫存器標案（AC1、AC2)16及第四暫存器 ' 檔案（R)18係分別直接附著至該LSU 30、該ALU 20及該 * BU 40且係僅可分別由該LSU 30、該ALU 20及該BU 40 存取之私有（private)暫存器。特而言之，該第一暫存器檔案 22除跨越兩個叢集12A及12B以外，其使用一所謂“乒乓 (ping-pong)”暫存器棺案設計，其可達成減小功率之消耗。- In addition, the method may comprise the steps of: checking whether execution of an instruction = from the first cluster and a second cluster access data; if the result of the check is, J generates a replication instruction to extract data from the first cluster Transfer to the ~~ cluster; and schedule the instruction after the replication instruction. In addition, in addition to the first-temporary case, the cluster further includes a second register file of the connection to the function unit, and a third register file connected to the other function list 70. The invention also discloses a method for registering a temporary register, which comprises the following steps: generating a plurality of instructions from a program mother; assigning a plurality of virtual registers to the second file, the second register And the third register slot case to create - initial register file assignment (register file sS^ment 'RFA) map' - perform a scheduling process to calculate a first operation for the instruction Cycle; Execution - Simulated Annealing (4) Also (4) (10) Position (4) S A) Processing to correct the temporary storage number (Shanghai) (4)... and configuring the virtual register to the temporary register file and the first according to the register file 2 Three registers file. The medium fire processing includes the following steps: the operation of the circle is assigned in the register file; the scheduling process is performed to calculate a cycle for the instruction (four)'·Check whether the second operation cycle is smaller than the first operation 104278 .doc 1307478 The cycle; and if the result of the check is yes, then the change in the scratchpad file assignment map is maintained. [Embodiment] FIG. 1 illustrates the architecture of a Parallel Architecture Core (PAC) digital signal processor 10. The PAC digital signal processor 10 has a five-instruction (five-way is sue) very long instruction word (VLIW), which includes two arithmetic units (ALU) 20, two A load/store unit (LSU) 30 and a single scalar unit (B unit) 40. These LSUs 30 and ALUs 20 are arranged in two clusters 12A and 12B, each of which contains a pair of two functional unit (FU) types. The B unit (BU) 40 is independently configured and is responsible for branch operations, which are also capable of performing simple load/store and address arithmetic. The first register file (D1, D2, D3, and D4) 22 serves as a communication bridge between the clusters 12A and 12B. Only the BU 40 can access all of the first scratchpad files 22, and can perform a write operation across the clusters 12A and 12B. The second register file (Al, A2) 14, the third register file (AC1, AC2) 16 and the fourth register ' file (R) 18 are directly attached to the LSU 30, the ALU 20, respectively. And the *BU 40 is a private register that can only be accessed by the LSU 30, the ALU 20, and the BU 40, respectively. In particular, the first register file 22 uses a so-called "ping-pong" scratchpad file design in addition to spanning the two clusters 12A and 12B, which can achieve a reduction in power consumption. .

該PAC數位訊號處理器10之特徵在於其係一高度分割之暫存器檔案設計，其中在該架構中之每一該叢集12A、12B 104278.doc 1307478 含有包括兩個單元之該第-暫存器擋案22、該第二暫存器播案14及該第三暫存器檔案16,該第二暫存器檔案14及該第一暫存器檔案16分別直接連接至該Lsu 3〇及該alu 2〇。每-該第-暫存器檔案22僅具有一由該湯3〇及該 ALU 20共用之單組存取蟑。特而言之，在每一該叢集μ ' 及12B中，每—VLIW含有控制存取琿於該第-暫存器檔案、 22與此二個功能單元（FU，即該ALU2G與該LSU3G)之間切 φ 換的至少兩個指令及兩個位元攔位。因此，在每一操作週期（〇Perati〇nCycle)中，每一該FU僅可存取其由每一指令指派之專用該第一暫存器播案22;兩個不同之該即至單一該第一暫存器檔案22之同時存取係互斥的。當然，此設計之基本原理為降低暫存器檔案埠之數目以避免一統一暫存器檔案之緩慢存取速度及高功率消耗，但以不規則架構為代價。藉由此設計，暫存器配置與指令排程之間之交叉干擾實質上增加，升高在編譯程式碼產生中 • 之典型階段排序（phase ordering)問題。不僅叢集設計使暫存器跨叢集存取成為一額外問題，而且“乒乓，，暫存器檔案（即該第一暫存器檔案2 2)之切換存取性質亦使得暫存器 * 指派及指令排程之細節彼此高度依賴。舉例而言，以下之紐碼序列將兩個常數移入兩個虛擬暫存器TN1及ΤΝ2中： mov TNI, 1 mov TN2, 2 僅當將TN1及TN2指派給來自不同之該第一暫存器檔案 22之暫存器時，可將此等兩個指令並行地排程；若tni及 104278.doc • 1307478 TN2皆指派給同一之該第一暫存器檔案，則其僅可按序地排程及執行。本發明之解決方法為：藉由一稱作模擬退火技術之最佳化技術來添加一新的預先暫存器配置指令_排程階段（pre-register allocation instruction-scheduling phase) 〇 - 圖2至圖13說明根據本發明之一實施例用於PAC數位、訊號處理器10排程指令以執行以下之C語言程式碼程序 # (pr〇Cedure)的方法。首先，如圖2及圖3中分別展示，此等c語言程式碼程序經轉譯成一假操作（pseud〇_〇perati〇n) 表（form)及一相依圖形（dependency graph)。自 1〇1 至 i2〇之總共20個指令將被執行用於完成此等c語言程式碼程序之操作。如圖4所示，接著隨機產生一初始暫存器檔案指派（register file assignment ; RFA)圖。RFA 圖中左行之 tn3 至TN14表示虛擬暫存器，且在右行中之m、ai等表示該 PAC數位訊號處理器1〇中之實體暫存器檔案。鲁 int foo(int a, int b) { int c = a + b; a = (c » 2) - a * b; return a + c & 〇xff; } 參考圖5，開頭三個指令1〇1、旧及i〇3直接排程於週期〇中’但隨後二個指令1〇4、1〇5及1〇6何直接排程，因為該叢集12A中之該第一暫存器22之單元D1已由該 104278.doc '1307478 LSU1用來執行該指令1〇4。換言之，僅該指令“ iw TN3, 0，， 104及105排程於週期1中，而該指令ι〇6 “m〇vu TNI4, $0 不可排程於週期1中但可排程於稍後之週期2中，因為該指令104中之運算元（operand ) “載入字組，，（lw)經設計成具有比該指令106中之運算元“移動指令，，（瓜的^更高之優先級。本方法檢查該二功能單元（ALln及LSU1)是否需要在同一操作週期中存取該第一暫存器檔案22，且根The PAC digital signal processor 10 is characterized in that it is a highly partitioned scratchpad file design, wherein each of the clusters 12A, 12B 104278.doc 1307478 in the architecture contains the first temporary storage including two units. The file file 22, the second register file 14 and the third register file 16, the second register file 14 and the first register file 16 are directly connected to the Lsu 3 and The alu 2 〇. Each of the first-storage files 22 has only a single set of access ports shared by the soup and the ALU 20. In particular, in each of the clusters μ' and 12B, each -VLIW contains control access to the first-storage file, 22 and the two functional units (FU, ie, the ALU2G and the LSU3G) At least two instructions and two bit positions are switched between φ. Therefore, in each operation cycle (〇Perati〇Cycle), each of the FUs can only access the dedicated first register broadcast 22 that is assigned by each instruction; two different ones to a single one The simultaneous access of the first register file 22 is mutually exclusive. Of course, the basic principle of this design is to reduce the number of scratchpad files to avoid slow access speed and high power consumption of a unified scratchpad file, but at the expense of an irregular architecture. With this design, the cross-interference between the scratchpad configuration and the instruction schedule is substantially increased, raising the typical phase ordering problem in the compiled code generation. Not only does the cluster design make the scratchpad cross-cluster access an additional problem, but the switching access nature of the "ping-pong," scratchpad file (ie, the first scratchpad file 22) also causes the scratchpad* to be assigned and The details of the instruction schedule are highly dependent on each other. For example, the following new code sequence shifts two constants into two virtual registers TN1 and ΤΝ2: mov TNI, 1 mov TN2, 2 Only when TN1 and TN2 are assigned to When the registers from different first register files 22 are different, the two instructions can be scheduled in parallel; if tni and 104278.doc • 1307478 TN2 are assigned to the same first register file Then, it can only be scheduled and executed in order. The solution of the present invention is to add a new pre-register configuration command by a optimization technique called simulated annealing technology (pre-stage) -register allocation instruction-scheduling phase) - Figure 2 through Figure 13 illustrate a C-bit, signal processor 10 scheduling instruction for executing the following C language code program # (pr〇Cedure) in accordance with one embodiment of the present invention. Method. First, as shown in Figure 2 and Figure 3. It is shown separately that these c language code programs are translated into a pseudo operation (pseud〇_〇perati〇n) form and a dependency graph. A total of 20 instructions from 1〇1 to i2〇 will be It is executed to complete the operations of these c language code programs. As shown in Fig. 4, an initial register file assignment (RFA) map is randomly generated. The left line tn3 to TN14 in the RFA diagram indicates The virtual scratchpad, and m, ai, etc. in the right row represent the physical scratchpad file in the PAC digital signal processor 1 。 int foo(int a, int b) { int c = a + b; a = (c » 2) - a * b; return a + c &〇xff; } Referring to Figure 5, the first three instructions 1〇1, old and i〇3 are directly scheduled in the period '' but then two The instructions 1〇4, 1〇5, and 1〇6 are directly scheduled because the unit D1 of the first register 22 in the cluster 12A has been used by the 104278.doc '1307478 LSU1 to execute the instruction 1〇 4. In other words, only the instruction "iw TN3, 0, 104 and 105 is scheduled in cycle 1, and the instruction ι〇6 "m〇vu TNI4, $0 is not scheduled in cycle 1 But can be scheduled in a later cycle 2, because the operand "loading" in the instruction 104, (lw) is designed to have an operand "moving instruction" in the instruction 106, (Melon's ^ higher priority. The method checks whether the two functional units (ALln and LSU1) need to access the first temporary register file 22 in the same operation cycle, and the root

據一基於操作類型及用於排程之關鍵路徑狀態的預定優先順序將二功能單元中之一者（LSU1 30)排程在其它功能單元（ALU1 20)之前存取該第一暫存器檔案22。特而言之，該指令104 1 105花費三個週期來完成乃因自記憶體載入字組。該指♦ 1〇3中之運算元“ m〇vih，’用於移動含有至少兩個指令之vuw之高位元級中的一指令，而該指令1〇6 中之運算元“movi.r用於移_ vuw之低位元級中的另一指令。參考圖6及圖7,將另一指令1〇7 “_τΝ7，彻，排程’但其為不可能的’因為該叢集12Α中之該第一器播案22之單元D1及叢集12β中之該第一暫存器檔案Μ =:3不可同時存取，此係因為該m與Μ單元位於不同叢木中。本方法列舉用^在該單元⑴與該單元〇間轉移資料之複寫插作夕I & —且將其指派給二了；I/。產生一虛擬暫存器該叢集12Α十之該第-暫存器檔案22 期二二Μ中之值經由地操作被傳輸且經由先前週期5中之㈣桑作被接收於她令，且TN5及 104278.doc -11 - 1307478 在暫存器檔案m及m中具有相等值。在使用鳩來置換TN5之後，將該指令107排程於週期8中，僅存取⑴ 而非同時存取D1及D3，後者是不允許的。Accessing the first register file by scheduling one of the two functional units (LSU1 30) before the other functional units (ALU1 20) according to a predetermined priority order based on the type of operation and the critical path status for scheduling twenty two. In particular, the instruction 104 1 105 takes three cycles to complete because the word is loaded from the memory. The operation element "m〇vih," in the finger ♦ 1〇3 is used to move an instruction in the high-order level of the vuw containing at least two instructions, and the operation element in the instruction 1〇6 is used in the movi.r Another instruction in the lower bit level of shifting _vuw. Referring to FIG. 6 and FIG. 7, another instruction 1〇7 “_τΝ7, complete, schedule 'but it is impossible' is because the unit D1 and the cluster 12β of the first broadcast 22 in the cluster 12Α The first register file Μ =: 3 cannot be accessed at the same time, because the m and the Μ unit are located in different clusters. This method enumerates the use of ^ in the unit (1) and the unit to transfer data between the meta-inserts into the eve I & - and assign it to two; I /. Generate a virtual register. The value of the cluster 12 of the first - register file 22 period 22 is transmitted via the local operation and via the previous cycle 5 (4) Sang Zuo was received by her order, and TN5 and 104278.doc -11 - 1307478 have equal values in the scratchpad files m and m. After using 鸠 to replace TN5, the instruction 107 is scheduled in the cycle. In 8, only access (1), not D1 and D3, is not allowed at the same time. The latter is not allowed.

換言之，本方法檢查執行該指令1G7是否需要存取來自該叢集UA及該叢集12B之資料，產生-複寫指令以將來自該叢集12B之資料轉移至該叢集以，且在將複寫指令排程於週期5中之後將該指令1G7排程於週期8中。特而言之，本方法檢查是否存在可用於執行由運算元對“” 及“W，、组成之複寫指令的先前操作週期；且將複寫指令排程於先前操作週期5 t。類似地，產生另一虛擬暫存器 τΝ15且將其指派給單元D3，TN3 +之值經由_操作被傳輸且經由週期4中之bdr操作被接收於TN15中，且Tm5 及TN3在暫存器檔案D1AD3中具有相等值。此意謂可以 TN15置換TN3之一運算元且在使用TN15來置換TN3 之後將該指令109排程於週期7中。此外，如圖8中所示，In other words, the method checks whether execution of the instruction 1G7 requires access to data from the cluster UA and the cluster 12B, a generate-rewrite instruction to transfer data from the cluster 12B to the cluster, and to schedule the overwrite instruction The instruction 1G7 is scheduled in cycle 8 after cycle 5. In particular, the method checks if there is a previous operational cycle available to perform a rewrite instruction consisting of "" and "W," by the operand; and the rewrite instruction is scheduled for the previous operational cycle 5t. Similarly, generating Another virtual register τΝ15 and assigned to unit D3, the value of TN3 + is transmitted via the _ operation and received in TN15 via the bdr operation in cycle 4, and Tm5 and TN3 have in register file D1AD3 An equal value. This means that one of the TN3 operands can be replaced by TN15 and the instruction 109 is scheduled in cycle 7 after replacing TN3 with TN15. Further, as shown in FIG.

在使用TN15來置換期之後’將該指令1〇8亦排程於週期8中。參考圖9,該指令118及11〇排程於週期9中，而週期 10 為工，因為該指令 11〇 “fmulus TN11，TNU”（TN5 與ΤΝ11相乘，相乘值儲存於ΤΝ11中）具有一週期之延遲 (latency)。指令lu排程於週期u中。隨後，如圖1〇及圖 11中所展示，指令112、113、114、11 5及116按順序排程於週期12、13、14、16及17中。參考圖 12，指令 1 P “sub TN12, TN8, TN9”（TN8 減去 104278.doc -12- 1307478 TN9，且相減值儲存於TN12中）被排程。然而，將TN8指派給該叢集12Α中之單元D1，而將ΤΝ9及ΤΝ12指派給該叢集12Β中之單元D4，意即資料源位於不同叢集中。產生一虛擬暫存器ΤΝ17且可選擇性地將其指派給該第一暫存器檔案22之單元D3及D4或該第三暫存器檔案（AC2)16。不可能將該單元D3指派給TN17,因為將會與該D4衝突（乒乓約束）’該單元D4係非常適當的，且雖然AC2將需要額外複寫操作，其亦係可能的。假設將TN1 7指派給該單元 D4，且TN8中之值由bdt操作傳輸且由先前週期1〇中之 bdr操作接收於TN17中，則在由TNn置換tn8之後將該指令117排程於週期18中。隨後，指令 119 “add ΤΝ13, ΤΝ7 ΤΝ12”（ΤΝ7 與 ΤΝ12 相加，且相加值儲存於TN13中）被排程。然而，將tni2 才曰派給該叢集12Β中之該單元D4,而將ΤΝ7及ΤΝ13指派給該叢集12A中之該單元D1及D2,意即資料源位於不同叢集中。產生-虛擬暫存器則8且將其指派給該di，tni2 中之值由bdt操作傳輪且由先前週期i 9中之他操作接收於TN18中，且在由TN18置換tni2之後將指令HQ排程於週期22中。參考圖13,取終指令120將計算ΤΝ13與TN14之（AND) 邏輯運算結果儲存於—指派給該m之虛擬暫存器中。然 N13及TN14分別指派給該D2及m，且指令之直接 =作係不可能的’因為將會起衝突（兵兵約束）。產生一虛擬存器TN19且將其指派給該D2，將懦4中之值轉移至 104278.doc -13· 1307478 该τΝ19且將該指令12〇排程於週期门法檢查執行該指令120是否、吕之，本方今第而要自該第-暫存器檔案22之 :::-早㈣2)及該第二單元(m)讀取資料產扣令（―由ALIH 20於週期22中執寫自該第-暫存器槽案22之該第一單元=令州元(D1)，且在將複寫指令排 # 亥第-單 19Λ 22中之後將該指令排程於週期23中。因此，總共使用23個週期，插入9After the replacement period is used using TN15, the instruction 1〇8 is also scheduled in the period 8. Referring to Figure 9, the instructions 118 and 11 are scheduled in cycle 9, and cycle 10 is working because the command 11 〇 "fmulus TN11, TNU" (TN5 is multiplied by ΤΝ11, and the multiplied value is stored in ΤΝ11) One cycle of latency. The instruction lu is scheduled in the period u. Subsequently, as shown in Figures 1 and 11, instructions 112, 113, 114, 11 5 and 116 are sequentially arranged in cycles 12, 13, 14, 16 and 17. Referring to Figure 12, the instruction 1 P "sub TN12, TN8, TN9" (TN8 minus 104278.doc -12-1307478 TN9, and the subtraction value stored in TN12) is scheduled. However, TN8 is assigned to unit D1 in the cluster 12, and ΤΝ9 and ΤΝ12 are assigned to unit D4 in the cluster 12, meaning that the data sources are in different clusters. A virtual scratchpad 产生 17 is generated and selectively assigned to cells D3 and D4 or the third scratchpad file (AC2) 16 of the first scratchpad file 22. It is not possible to assign this unit D3 to TN17 because it will collide with the D4 (ping-pong constraint). This unit D4 is very suitable, and although AC2 will require additional rewriting operations, it is also possible. Assuming that TN1 7 is assigned to the unit D4, and the value in TN8 is transmitted by the bdt operation and received by the bdr operation in the previous cycle 1N in TN17, the instruction 117 is scheduled to cycle 18 after replacing tn8 by TNn. in. Subsequently, the instruction 119 "add ΤΝ13, ΤΝ7 ΤΝ12" (ΤΝ7 is added to ΤΝ12, and the added value is stored in TN13) is scheduled. However, tni2 is assigned to the unit D4 in the cluster 12, and ΤΝ7 and ΤΝ13 are assigned to the units D1 and D2 in the cluster 12A, meaning that the data sources are located in different clusters. The generate-virtual register is then assigned to the di, the value in tni2 is polled by the bdt operation and received by TN18 from the previous operation in the previous period i9, and the HQ is commanded after the replacement of tni2 by TN18 Scheduled in cycle 22. Referring to Figure 13, the final instruction 120 stores the results of the (AND) logical operation of the compute ΤΝ13 and TN14 in the virtual register assigned to the m. However, N13 and TN14 are assigned to the D2 and m respectively, and the direct command of the instruction is impossible because it will be a conflict (insurance constraint). Generating a virtual memory TN19 and assigning it to the D2, transferring the value in 懦4 to 104278.doc -13· 1307478 the τΝ19 and scheduling the instruction 12〇 in the periodic gate method to check whether the execution of the instruction 120 is Lu Zhi, the party must first read from the first-storage file 22:::-early (four) 2) and the second unit (m) read the data production order ("by ALIH 20 in cycle 22 The first unit = the state element (D1) from the first-storage slot 22, and the instruction is scheduled in the cycle 23 after the rewrite instruction is queued #一第一单19Λ 22. Therefore, A total of 23 cycles, insert 9

個複寫知作，且產生5個新的虛擬暫存器。圖14及圖15說明根據本發明之—實施例用於配置暫存器之方法的流程圖。首先’本發明執行一步驟si〇2:隨機建立初始暫存器檔案指派且將指派記錄於一暫存器檔案指派（RFA)圖中，其隨機指派自如圖4中所示之程式碼產生之每-虛擬暫存器。本發明接著執行一步驟si〇4:進行指令之修正清單排程（modified list scheduling)，意即關於運算Rewrite the knowledge and generate 5 new virtual registers. 14 and 15 illustrate a flow chart of a method for configuring a scratchpad in accordance with an embodiment of the present invention. First, the present invention performs a step si 〇 2: randomly establishes an initial register file assignment and records the assignment in a register file assignment (RFA) map, which is randomly assigned from the code shown in FIG. Per-virtual scratchpad. The present invention then performs a step si〇4: performing a modified list scheduling of instructions, meaning that the operation is performed

^之嫩執行如圖5至圖13中所示之排程處理，且對計算出之週期中的總排程長度設定一變數⑹。隨後，本發明執行一步驟S106:設定用於一模擬退火（SA)處理中之門檻值（threshold)、能量（energy)及機率測試值 (probability test ; p—test)之初始值’其中該能量之初始值大於該門檻值之初始值，且〇< p_test <1。本發明接著進行一步驟S108:執行一模擬退火處理以調節能量之值、schedjen及RFA圖，且接著執行一步驟sn〇來檢查能量之值是否大於該門檻值。若該能量之值大於該門根值’則本發明返回執行該模擬退火處理以調整該能 104278.doc -14- 1307478 量、該sched—len及該RFA圖的該步驟sl〇8 ;另一方面，若該能量之值未大於該門檻值，則本發明進行一步驟如〇:保留最終排程及該RFA圖作為輪出結果。隨後，本發明進行-步驟S132:根據該RFA圖正式配置硬體暫存器1即’將全部該虛擬暫存器TN3至加9指派給該處理器10甲之該硬體暫存器檔案22、14及16。參考圖步驟議中之SA處理劃分成若干子步驟。SA 處理首先執行一步驟S112:在隨機選擇一虛擬暫存器且將其指派給該處理器10中之一不同暫存器槽案，使該嫩圖中至少有一改變。梠據新的RFA指派改變，執行—步驟“Μ 來重新進行修正清單排程（意即執行一第二排程處理）且將一 new—schedjen設定為數總排程長度之新週期數。隨後，執行一步驟S116來檢查new—sched Jen(該第二排程處理之操作週期）是否小於schedJen(該第一排程處理之操作週期），若是則執行一步驟S118將能量減少丨，將sched—設定成new—schedjen，且保持新的RFA圖之改變。否則，執订一步驟S120來產生一〇與1之間的隨機數R。隨後，執行一步驟S122來檢查R是否大於p—咖。若R大於p—【如，則執行一步驟S124將能量減少1 ，將sched—len設定成 newjched—len，且保持新的RFA圖之改變’且接著返回步驟siio。否則，執行一步驟S126將能量增加丨且還原新的 RFA圖之改變’且接著返回步驟su〇e 本發明之技術内容及技術特點已揭示如上，然而熟悉本項技術之人士仍可能基於本發明之教示及揭示而作種種不 104278.doc -15- 130.7478 背離本發明精神之替換及修飾。因此，本發明之保護範圍應不限於實施例所揭示者，而應包括各種不f離本發明之替換及修飾’並為以下之申請專利範圍所涵蓋。【圖式簡單說明】圖1例示一平行架構核心（PAC)數位訊號處理器之架構；圖2至圖13例示根據本發明之一實施例用於pAc數位訊號處理器之指令排程的方法；以及圖14及圖15例示根據本發明之一實施例之暫存器配置之方法的流程圖。【主要元件符號說明】 12A叢集 14第二暫存器檔案 18第四暫存器檔案 22第一暫存器檔案 40標量單元 10 PAC數位訊號處理器 12B叢集 16第三暫存器檔案 20算術單元 30載入/儲存單元The processing of the schedule shown in Figs. 5 to 13 is performed, and a variable (6) is set for the total schedule length in the calculated period. Subsequently, the present invention performs a step S106 of setting an initial value of a threshold, an energy, and a probability test (p-test) for a simulated annealing (SA) process, wherein the energy The initial value is greater than the initial value of the threshold and 〇< p_test <1. The present invention then proceeds to a step S108 of performing a simulated annealing process to adjust the value of the energy, the schedjen and the RFA map, and then performing a step snn to check if the value of the energy is greater than the threshold. If the value of the energy is greater than the threshold value, the present invention returns to perform the simulated annealing process to adjust the amount of energy 104278.doc -14 - 1307478, the sched-len and the step of the RFA diagram sl8; another In the aspect, if the value of the energy is not greater than the threshold, the present invention performs a step such as: retaining the final schedule and the RFA map as a round-out result. Subsequently, the present invention proceeds to step S132: formally configure the hardware register 1 according to the RFA map, ie, assign all of the virtual registers TN3 to 9 to the hardware register file 22 of the processor 10. 14, 14 and 16. The SA process in the reference step discussion is divided into several sub-steps. The SA process first performs a step S112 of randomly selecting a virtual scratchpad and assigning it to one of the different scratchpad slots in the processor 10 to cause at least one change in the tender map. According to the new RFA assignment change, the execution - step "Μ to re-edit the list schedule (meaning to perform a second schedule processing) and set a new_schedjen to the number of new cycles of the total schedule length. Subsequently, Step S116 is performed to check whether new_sched Jen (the operation period of the second scheduling process) is smaller than schedJen (the operation period of the first scheduling process), and if yes, perform a step S118 to reduce the energy, and sched- It is set to new_schedjen, and the change of the new RFA map is maintained. Otherwise, a step S120 is performed to generate a random number R between 1 and 1. Subsequently, a step S122 is performed to check if R is greater than p-coffee. If R is greater than p—[if, then perform a step S124 to reduce the energy by 1, set sched_len to newjched-len, and keep the new RFA map changed' and then return to step siio. Otherwise, perform a step S126. The energy is increased and the change of the new RFA map is restored' and then returned to the step su〇e. The technical content and technical features of the present invention have been disclosed above, but those skilled in the art may still be based on the present invention. The invention is not limited to the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should not be limited to those disclosed in the embodiments, but should include various alternatives to the present invention. And the modifications are covered by the following patent application. [Simplified Schematic] FIG. 1 illustrates the architecture of a parallel architecture core (PAC) digital signal processor; FIGS. 2 through 13 illustrate an embodiment of the present invention. A method for scheduling an instruction of a pAc digital signal processor; and FIGS. 14 and 15 are flowcharts illustrating a method for register configuration according to an embodiment of the present invention. [Description of main component symbols] 12A cluster 14 second Archive file 18 fourth register file 22 first register file 40 scalar unit 10 PAC digital signal processor 12B cluster 16 third register file 20 arithmetic unit 30 load / storage unit

TN3、TN4、TN5、TN6、TN7、TN8、TN9、TN10、TN11、 TN12、TN13、TN14虛擬暫存器 101、102、103、104、105、106、107、108、109、110、 m、112、113、114、115、116、117、118、119、120 指令 S102、 S104、 S106、 S108、 S110、 S112、 S114、 S116、 S118、S120、S122、S124、S126、S130、S132 步驟 104278.doc -16-TN3, TN4, TN5, TN6, TN7, TN8, TN9, TN10, TN11, TN12, TN13, TN14 virtual registers 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, m, 112 , 113, 114, 115, 116, 117, 118, 119, 120 instructions S102, S104, S106, S108, S110, S112, S114, S116, S118, S120, S122, S124, S126, S130, S132, step 104278.doc -16-

Claims

1307478 Apparel patent scope: The instruction scheduling method of the packet digital signal processor, the processor = complex; a cluster, the per-cluster includes at least two functional units and one has a first unit, a second unit; And the first register file of the early group access shared by the functional units, the method comprising the following steps: checking whether executing an instruction requires a self-reservation-n object-scratch register The first case of the Tan case and the second unit read the data; if the result of the check is yes, a temporary copying command is generated to transfer the data from the first unit of the temporary file to the first Two units; and scheduling the 3H instruction after the rewriting instruction. The instruction sequence of the clustered digital signal processor of the 2. item includes the following steps: checking whether the two functional units need to temporarily store the n-slot; and if the material is to be accessed, the result of the check is yes. According to a ^ # ^ explosion pre-order priority, one of the two power moons is scheduled in another - knife file. The first temporary storage device 3 is accessed as early as 70. According to the industry 隹+# a one-t method of the request item 1, the D-group digital signal is judged to include the following steps: - Whether the previous operation cycle is achievable if the result Α θ a 仃 the rewrite instruction; and the instruction. If it is 疋' then schedule the overwrite in the previous operation cycle. 4. According to the request item industry method, ... the f-type digital signal processor's instruction schedule party accepts the mother a cluster of another white red wall and a temporary storage test; A temporary file and a third temporary file, and the method further comprises the following steps: 104278.doc • 17-