TW201106174A

TW201106174A - Discrete cosine transformation circuit and apparatus utilizing the same

Info

Publication number: TW201106174A
Application number: TW98126121A
Authority: TW
Inventors: Ming-Chung Hsu; Yi-Shin Li; Yi-Shin Tung; Chia-Ying Li
Original assignee: Hon Hai Prec Ind Co Ltd
Priority date: 2009-08-03
Filing date: 2009-08-03
Publication date: 2011-02-16
Also published as: TWI398781B

Abstract

A discrete cosine transformation circuit comprising a pipeline with a memory stage and an arithmetic stage. The arithmetic stage comprises a first and a second arithmetic logic unit (ALU). Each of the ALUs receives from the memory a set of image data, performs a first calculation on the set of image data and outputs calculation result thereof in a first clock cycle. A path in the circuit directs the result to the memory stage, such that at least one ALU can selectively receive and perform another calculation on the result from the path in a clock cycle next to the first clock cycle.

Description

201106174 六、發明說明：【發明所屬之技術領威】 [0001] 本發明涉及一種離散餘弦轉換（Discrete Cosine Transformation ’簡稱DCT)技術’尤其涉及一種用來執行二維離散餘弦轉換的離散餘弦轉換電路。 [先前技術] [0002] 離散餘弦轉換經常用於對影像進行數據壓縮。正向離散餘弦轉換是利用離散餘弦函數將影像資料正向轉換為頻率域資料，逆向離散餘弦轉換則是利用離散餘弦函數將頻域資料逆向轉換為原本的影像資料i離散餘弦轉換一詞可用來表示正向離散餘弦轉換或逆向離散餘弦轉換》 [0003] 離散餘弦轉換裝置通常於一個巨集:區塊上執行完整的二雉轉換，例如，先對8x8像素區塊的每一列進行一維離散餘弦轉換，再對8x8像素區塊的每一行進行另一維離散餘弦轉換。視頻編碼標準如H. 264、VC-1及MPEG2都使用區塊式轉換，其差別僅在於區塊大小及係數。不同的視頻編碼方法通常設計有專屬的離散餘弦轉換電路。若要將這些專屬電路整合在一個展置中以支援不同的視頻編碼方法，可旎使得電路設計複雜並且在電路微型化上妗加困難度，當裝置欲支援新的視頻編碼方法時，電路設叶勢必需再更改。若使用通用處理器來執行不同的^編碼方法是較有彈性的，但相對而言較缺乏效率。【發明内容】 [0004]有鑒於此，需提供一種離散餘弦轉換電路及使用兮電路的影像處理裝置。 ~ 098126121 表單編號A0101 第4頁/共45頁 0982044783-0 201106174 [0005] Ο ❹ [0006] 一種離散餘弦轉換電路包括具有提取級、記憶級以及運算級的管線之蝶形演算電路。該提取級接收並解碼蝶形演算指令集。該記憶級包含一記憶儲存庫用來儲存影像資料係數以及該運算級所輸出的中間計算數據，並根據至少一已解碼蝶形演算指令，在第一時脈週期輸出儲存在該記憶級的第一組數據。該運算級包含複數暫存器、第一運算邏輯單元以及第二運算邏輯單元，該複數暫存器從該記憶儲存庫接收該第一組數據作為該運算級的輸入資料，該第一運算邏輯單元與該第二運算邏輯單元從該複數暫存器接收一組輸入資料並對該組輸入資料執行第一計算式，根據至少一已解碼蝶形演算指令在該第一時脈週期後的第二時脈週期輸出該第一計算式的計算結果。該蝶形演算電路包含一線路在該第一計算式運算完成的同一時脈週期中將該計算結果從該運算級導至該記憶級，使至少一暫存器在該第二時脈週期後的第三時脈週期中可選擇從該線路接收該計算結果或徒該記憶儲存庫接收下一組數據。一種影像處理裝置包括具有提取級、記憶級以及運算級的管線之離散餘弦轉換電路。該提取級接收並解碼蝶形演算指令集。該記憶級包含一記憶儲存庫用來儲存影像資料係數以及該運算級所輸出的中間計算數據，並根據至少一已解碼蝶形演算指令，在第一時脈週期輸出第一組數據儲存在該記憶級。該運算級包含複數暫存器、第一運算邏輯單元以及第二運算邏輯單元，該複數暫存器從該記憶儲存庫接收該第一組數據作為該運算級的輸入 098126121 表單編號Α0101 第5頁/共45頁 0982044783-0 201106174 [0007] [0008] [0009] [0010] [0011] 098126121 資科，該第—運曾複數暫存器接收—植於單7^ 二運算邏輯單元從該 —計算式，根據至少f料並對該組輸人資料執行第脈週期後的第二時脈週演算指令在該第-時。該蝶形演算電路U 計算式的計算結果的rn + ^3—線路在該第—計算式運瞀—忐的同—_«中將該計算結果從該級，使至少一暫存+ 、導至。亥5己憶期中可選擇從該線=第二時脈週期後的第三時脈週接收下—組數據。收料讀果或從該記憶館存庫【實施方式】離散餘弦轉換電路及其裝置的實施例說明如 1.系統概述 Π=Γ離散餘弦轉換電路可以實施在各種不盒或是任何具有影像二=置數位攝影機，機圈等則是内建有影像處理能力的裝構：喝換電―處™ 1.1影像處理裝置之實施例離散餘弦轉換電路165整合理嚴开，^像處理裝置100的中央理早-即處理器151中1處理器151可由晶片構裝而成。電源供應器 s m中的各元件。石英心，電力予影像㈣ ⑻盘w 提供時脈訊號給處理象處理裝置100中的其它元件。圖u顯示例201106174 VI. Description of the Invention: [Technical Leadership of the Invention] [0001] The present invention relates to a Discrete Cosine Transformation (DCT) technology, in particular to a discrete cosine transform circuit for performing two-dimensional discrete cosine transform . [Prior Art] [0002] Discrete cosine transform is often used for data compression of images. The forward discrete cosine transform uses the discrete cosine function to convert the image data into the frequency domain data. The inverse discrete cosine transform uses the discrete cosine function to inversely convert the frequency domain data into the original image data. The discrete cosine transform can be used. Representing forward discrete cosine transform or inverse discrete cosine transform [0003] Discrete cosine transform devices typically perform a complete binary transform on a macro: block, for example, first-order discrete for each column of an 8x8 pixel block Cosine transform, and then perform another dimensional discrete cosine transform on each line of the 8x8 pixel block. Video coding standards such as H.264, VC-1, and MPEG2 use block conversion, the only difference being the block size and coefficient. Different video coding methods are usually designed with a dedicated discrete cosine conversion circuit. In order to integrate these dedicated circuits in one exhibition to support different video coding methods, the circuit design is complicated and the circuit miniaturization is difficult. When the device wants to support the new video coding method, the circuit design The leaf must be changed again. It is more flexible to use a general purpose processor to perform different ^ encoding methods, but it is relatively inefficient. SUMMARY OF THE INVENTION [0004] In view of the above, it is desirable to provide a discrete cosine transform circuit and an image processing apparatus using the same. ~ 098126121 Form No. A0101 Page 4 of 45 0982044783-0 201106174 [0005] A discrete cosine transform circuit includes a butterfly calculus circuit having a pipeline of an extraction stage, a memory stage, and an operation stage. The fetch stage receives and decodes the butterfly calculus instruction set. The memory level includes a memory repository for storing image data coefficients and intermediate calculation data output by the operation level, and outputting the first stored in the memory level in the first clock cycle according to the at least one decoded butterfly calculation instruction a set of data. The operation stage includes a plurality of registers, a first operation logic unit, and a second operation logic unit, the plurality of registers receiving the first set of data from the memory repository as input data of the operation stage, the first operation logic And the second arithmetic logic unit receives a set of input data from the plurality of registers and performs a first calculation on the set of input data, according to the at least one decoded butterfly calculation instruction after the first clock cycle The second clock cycle outputs the calculation result of the first calculation formula. The butterfly calculation circuit includes a line leading the calculation result from the operation level to the memory level in the same clock cycle in which the first calculation operation is completed, so that at least one register is after the second clock cycle The third clock cycle may choose to receive the calculation result from the line or the memory bank to receive the next set of data. An image processing apparatus includes a discrete cosine transform circuit having a pipeline of an extraction stage, a memory stage, and an operation stage. The fetch stage receives and decodes the butterfly calculus instruction set. The memory level includes a memory storage for storing image data coefficients and intermediate calculation data output by the operation level, and outputting the first group of data in the first clock cycle according to the at least one decoded butterfly calculation instruction. Memory level. The operation stage includes a plurality of registers, a first operation logic unit, and a second operation logic unit, the plurality of registers receiving the first group of data from the memory bank as an input of the operation level 098126121 Form No. 1010101 Page 5 / Total 45 pages 0982044783-0 201106174 [0007] [0008] [0009] [0010] [0011] 098126121 资科, the first - transported a number of registers received - planted in a single 7 ^ two arithmetic logic unit from the In the calculation formula, the second clock cycle calculation command after the first pulse period is executed according to at least the f material and the set of input data is at the first time. The rn + ^3—the line of the calculation result of the butterfly calculation circuit U is calculated from the same level in the same-_« of the first calculation type, so that at least one temporary storage + guide to. In the Hei 5 recall period, the next-group data can be selected from the third clock period after the line=second clock period. Receipt reading or storage from the memory library [Embodiment] Embodiments of the discrete cosine conversion circuit and its apparatus are as follows: 1. System Overview Π = Γ discrete cosine conversion circuit can be implemented in various boxes or any image 2 = digital camera, lap, etc. is built-in image processing capability: drink and change - the embodiment of the TM 1.1 image processing device discrete cosine conversion circuit 165 is rationally opened, the center of the image processing device 100 The processor 151 in the processor 151 can be constructed from a wafer. Power supply s m components. Quartz heart, power to image (4) (8) Disk w provides a clock signal to the processing of other components in the processing device 100. Figure u shows an example

下under

表單編號A0101 第6頁/共45頁 0982044783-0 201106174 理裝置ιο〇中各元件的連結關係，其連結可透過串列匯流排或平行匯流排。輸入輸出裝置包括控制按鈕、七段顯示以及和遠程控制器通信的紅外線接受器或收發器。埠 164之其—與外部電腦相連結可用來對影像處理裝置100 除錯。埠164可以是符合美國電子工業協會（Electr〇n_ ic Industries Association，簡稱EIA)所制定的第Form No. A0101 Page 6 of 45 0982044783-0 201106174 The connection of each component in the device ιο〇 can be connected through a serial bus or a parallel bus. Input and output devices include control buttons, seven-segment display, and an infrared receiver or transceiver that communicates with the remote controller. The 164 is connected to an external computer for debugging the image processing apparatus 100.埠164 may be in accordance with the regulations of the Electronic Industry Association (EIA)

232號推薦標準（Recommended Standard-232，簡稱 232)及/或第號推薦標準（Rec〇mfflended Standard-u ’簡稱RS_U)的實體連接埠、串列ATA ( Serial ΑΤΑ ’簡稱SATA)及/或高清晰度多媒體介面（232 (Recommended Standard-232, referred to as 232) and / or the recommended standard (Rec〇mfflended Standard-u 'RS_U) physical connection 串, Serial ATA (Serial ΑΤΑ 'referred to as SATA) and / or high Sharpness multimedia interface (

High Definition Multimedia Interface，簡稱 Ο HDMI)。非择發性記憶體〗53儲存處理器i51所執行的作業系統及應用程式。處理器151載入運行程序與數據資料到主記憶體152並將數位内容儲存於大量儲存裝置154中。該主記憶體152可以是動態隨機存取記憶體（Random Access Memory ’簡稱RAM)，例如靜態隨機存取記憶體（Static RAM ’簡稱SRAM)或是動態隨機存取記憶體 (Dynamic RAM，簡稱DRAM)。該非揮發性記憶體U3 可以是電子可抹除可規劃唯讀記憶體（ElectricallyHigh Definition Multimedia Interface, referred to as HDMI HDMI). The non-selective memory 】 53 stores the operating system and application executed by the processor i51. The processor 151 loads the running program and data into the main memory 152 and stores the digital content in the mass storage device 154. The main memory 152 may be a random access memory (RAM), such as a static random access memory (SRAM) or a dynamic random access memory (DRAM). ). The non-volatile memory U3 can be an electronic erasable programmable read-only memory (Electrically

Erasable Programmable Read-Only Memory，簡稱 EEPROM) ’例如反或（N0R)快閃記憶體或是反及（ NAND)快閃記憶體。内容保護單元155針對影像處理裝置 100所產生的數位内容提供存取控制。該内容保護單元 155包含實現數位視訊廣播之通用介面（DVB_CI)及/或條件式存取（DVB-CA)所需的記憶體與必要裝置。影像處理裝置100可從天線165、調諧器157以及解調器156傳 098126121 表單編號A0101 第7頁/共45頁 0982044783-0 201106174 遞的數位訊號取得數位时。_顯示另—個實施例，影像處理裝置1G1透過網路存取介面從網際網路等網路中取得數位内容。視訊輸出單元162包讀波#和放大器用來將處理訊1所輸㈣視訊訊號加以過濾及放大。音訊輸出單元161包含數位類比轉換器用來將處理器151所輪出的音訊訊號從類比格式轉換為數位格式。 [0012] [0013] [0014] 098126121 1.2離散餘弦轉換電路之實施例圖2顯示離散餘弦轉換電路2QG之—實施例的結構方塊圖。離散餘弦轉換電路200是圖1A與/或叫離散餘弦轉換電路165的一實施例。離散餘弦轉換電路2〇〇包含蝶形演算電路用以執行蝶形演算法，其詳細之說明如下。蝶形、演算電路設計為管線架構，包含三級：提取級3〇1、記級302與運算級303。如圖2所示，離散餘弦轉換電路2〇〇中各元件透過匯流排相連結。指令記憶體253儲存執行離散餘弦轉換的指令。暫存器Regl用來儲存從指令記憶體 253讀取的指令。提取級301在一時脈週期（或稱第—時脈週期）接收與解碼從暫存器Regl所取得的蝶形演算护令，並在之後的時脈週期中控制記憶級3〇2與運算級3〇3 。從提取級301至記憶級302與運算級3〇3中各相關單元的控制線並未顯示在圖中。記憶級302包含資料記憶體252作為記憶儲存庫用來儲存影像資料係數，記憶體251用來儲存運算級3〇3所輸出的中間計算結果。根據已解碼的指令，在另一時脈週期（或稱第二時脈週期）從資料記憶體252與記憶體251讀取數據組並經由多工器241，242與23卜234指引到多工器 0982044783-0 表單編號A0101 第8頁/共45頁 201106174 [0015] ❹ [0016] ❹ 231〜234的輸出。匯流排連結記憶級302中多工器 231〜234的輸出至運算級303中的相對應暫存器 Reg2〜Reg5 。在本例中，運算級303包含四個暫存器Reg2~Reg5與四個運算邏輯單元ALU1〜ALU4。暫存器Reg2~Reg5中的每— 個暫存器接收由相應的多工器231〜234而來的一組數據，並輸出至相應的運算邏輯單元作為其輸入資料。相應的多工器、運算邏輯單元或任何其它暫存器的實體為多工器、運算邏輯單元或連結到暫存器的多工器實體。運算邏輯單元ALU卜ALU4根據接收到的輸入資料分別執行相同或不同的計算式’並根據至少_已解㈣指令將計算結果在-時脈週期（或稱第三時脈週期）中輸出。每_ 個運算邏輯單元（如ALU1)包含_個移位器（如移位器 2〇1)用來執行運算移位操作以及-個加法器/減法器（如加法器/減法器221)用來執行加法及減法運算。記，_與252可由-個或多個的記憶艘區塊或晶片电成。暫存器包含邊緣觸發正反器，如D型正反 [0017] 098126121 離散餘弦轉換電物〇包含線路27〇與271用來在前述計算式完成的同一時脈週期中丫將叶算結果從運算級3〇3傳至記憶級302。該計算蛀罢丹 J得迭。鈇供给多工器231〜234作為任選輸入。根據已解碼的指令，暫存器 Reg2~Reg5 可選擇 —個時脈週期中經由多工議-m 從線路270與271中接收該 ^ ^ 卞算結果。舉例來說，多工器 233包含一個任選輪入31 .«一咖，到線路270，任選輸入32與第9頁/共45頁表單編號A0101 0982044783-0 201106174 33分別經由多工器241與242連結到記憶體251與252與任選輸入34用來接收二進位數〇，例如根據匯流排頻寬與 ALUs又计所形成之52位元長的二進位數〇。根據已解碼的指令’多工器233可經由選擇輸入31將計算結果輸出至暫存器Reg4 ’選擇輸入32或33從記憶體251或252讀取下一組數據，選擇輸入34接收52位元長的二進位數〇。如圖2 所示，多工器234的輸入44亦可接收52位元長的二進位數 [0018] 2.蝶形演算架構實施例 [0019] 2. 1蝶形演算法實施例 [0020] [0021] 影像處理裝置2_碼並顯示數位化靜態影像或視訊短片。數位化影像通常以影像元素（或稱·素）的矩陣表 —a〜斤、双值衣不。例如 ’在YCbCr色㈣統中，該三個主要元素包含—個亮度元素Y和兩個色度元素Cb與Cr。該亮度元素與色度元素之數值用來描述像錢紐與色度。影像處理裝置200也能在其!色料統巾處晴，《三原色™色_ 。母一個數位化影像可表示成三個矩形_，每-個矩形陣列則㈣包含影像之三元素之數值。丨曰J休像格所形成的像素區塊、4料乘8像奸⑷ ㈣可為4像素乘4 像素區塊、8像素姆素區塊、16 像素乘16像素區塊或是其它任何 /A ,, _ 才及心式的區塊。影象令的巨㈣塊巾每—個像 '、 έ t 巴也貧汛（例如在YCbCr 巴心系統中的一個主要元素）受7白形成_個影像資料俜 098126121 表單編號A0101 第10頁/共45頁 0982044783-0 201106174 ❹ [0022] 矩陣。假設離散餘弦轉換根據線性轉換函數Χ==Ηχ將長度 Ν的向量X轉換為一個新的向量係數X，其中Η為矩陣，X可為影像資料係數矩陣的行或列。離散餘弦轉換將影像資料係數從空間域轉換到頻率域。本文後段，影像資料係數矩陣皆以二維索引的陣列表示。矩陣中的元素皆可表示成F [ i ] [ j ]，其中[i ]、[ j ]為索引，i、j為整數變數，第一個（最左邊）的索引[i]為垂直方向的索引，而第二個（最右邊）的索引[j ]為水平方向的索引。例如， F[3] [5]表示位於矩陣F中垂直位置3與水平位置5的矩陣元素。〇影像處理中離散餘弦轉換包含將影像資料係數矩陣作一維（1D )的行轉換與列轉換。1D的行轉換與列轉換皆為一連串多個矩陣的乘積運算，使得電路設計的複雜度增加。在數學上等同於矩陣相乘積的蝶形演算法（如後所稱之蝶形運算）則非常適合實現不具矩陣乘法電路的離散餘弦轉換電路。不同的影像或視訊壓縮規範其蝶形運算亦不同。舉例而言，國際電信聯盟（Internat i ona 1 Telecommunications Union，簡稱為 ITU)所制定的 Η. 264標準，亦為MPEG-4第10部分，或稱為MPEG-4進階視訊編碼（Advanced Video Coding，簡稱為AVC)，使用離散餘弦轉換式X = H x，其中： 098126121 表單編號A0101 第11頁/共45頁 0982044783-0 201106174 [0023]Erasable Programmable Read-Only Memory (EEPROM) ‘for example, reverse (N0R) flash memory or reverse (NAND) flash memory. The content protection unit 155 provides access control for the digital content generated by the image processing apparatus 100. The content protection unit 155 includes the memory and necessary means for implementing the universal interface (DVB_CI) and/or conditional access (DVB-CA) for digital video broadcasting. The image processing apparatus 100 can obtain the digital signal from the antenna 165, the tuner 157, and the demodulator 156 by 098126121 Form No. A0101, Page 7 of 45, 0982044783-0 201106174. In another embodiment, the image processing device 1G1 obtains digital content from a network such as the Internet through a network access interface. The video output unit 162 includes the wave # and the amplifier for filtering and amplifying the (four) video signals transmitted by the processing signal 1. The audio output unit 161 includes a digital analog converter for converting the audio signal rotated by the processor 151 from an analog format to a digital format. [0014] [0014] 098126121 1.2 Embodiment of Discrete Cosine Transform Circuit FIG. 2 shows a block diagram of a structure of an embodiment of a discrete cosine transform circuit 2QG. Discrete cosine transform circuit 200 is an embodiment of FIG. 1A and/or discrete cosine transform circuit 165. The discrete cosine transform circuit 2〇〇 includes a butterfly operation circuit for performing a butterfly algorithm, which is described in detail below. The butterfly and calculus circuit is designed as a pipeline architecture and includes three levels: an extraction stage 3〇1, a record 302, and an operation stage 303. As shown in Fig. 2, each element in the discrete cosine transform circuit 2 is connected through a bus bar. The instruction memory 253 stores an instruction to perform a discrete cosine conversion. The register Regl is used to store instructions read from the instruction memory 253. The extraction stage 301 receives and decodes the butterfly calculation command obtained from the register Regl in a clock cycle (or the first-clock cycle), and controls the memory level 3〇2 and the operation stage in the subsequent clock cycle. 3〇3. The control lines from the extraction stage 301 to the memory level 302 and the respective units in the operation stage 3〇3 are not shown in the figure. The memory level 302 includes a data memory 252 as a memory storage for storing image data coefficients, and a memory 251 for storing intermediate calculation results output by the arithmetic level 3〇3. According to the decoded instruction, the data set is read from the data memory 252 and the memory 251 in another clock cycle (or the second clock cycle) and directed to the multiplexer via the multiplexers 241, 242 and 23 0982044783-0 Form No. A0101 Page 8 / Total 45 Pages 201106174 [0015] ❹ [0016] 输出 Output of 231~234. The bus connects the outputs of the multiplexers 231 to 234 in the memory stage 302 to the corresponding registers Reg2 to Reg5 in the arithmetic stage 303. In this example, the operational stage 303 includes four registers Reg2~Reg5 and four arithmetic logic units ALU1~ALU4. Each of the registers in the registers Reg2~Reg5 receives a set of data from the corresponding multiplexers 231~234 and outputs them to the corresponding arithmetic logic unit as their input data. The entities of the corresponding multiplexer, arithmetic logic unit, or any other scratchpad are multiplexers, arithmetic logic units, or multiplexer entities that are linked to the scratchpad. The arithmetic logic unit ALU ALU4 respectively executes the same or different calculation formulas according to the received input data and outputs the calculation result in the -clock period (or the third clock period) according to at least the _resolved (four) instruction. Each _ arithmetic logic unit (such as ALU1) contains _ shifters (such as shifter 2〇1) for performing arithmetic shift operations and - adders/subtractors (such as adder/subtractor 221) To perform addition and subtraction. It is noted that _ and 252 can be formed by one or more memory blocks or wafers. The scratchpad includes an edge-triggered flip-flop, such as a D-type positive and negative [0017] 098126121 Discrete cosine transforming electrons, including lines 27〇 and 271, used to calculate the leaf results in the same clock cycle completed by the aforementioned calculation formula The arithmetic stage 3〇3 is passed to the memory stage 302. The calculation is 蛀丹丹 J. The multiplexers 231 to 234 are supplied as optional inputs. According to the decoded instructions, the registers Reg2~Reg5 can select the ^^ calculation result from lines 270 and 271 via multi-work-m in a clock cycle. For example, the multiplexer 233 includes an optional wheel entry 31. «One coffee, to line 270, optional input 32 and page 9 / total 45 pages form number A0101 0982044783-0 201106174 33 respectively via multiplexer 241 The 242 is coupled to the memories 251 and 252 and the optional input 34 is used to receive the binary digits 〇, for example, according to the bus width and the ALUs, the 52-bit long binary digits formed by the ALUs. According to the decoded instruction 'multiplexer 233, the calculation result can be output to the register Reg4' via the selection input 31. The selection input 32 or 33 reads the next set of data from the memory 251 or 252, and the selection input 34 receives 52 bits. Long binary digits 〇. As shown in FIG. 2, the input 44 of the multiplexer 234 can also receive a binary number of 52 bits long. [0018] 2. Butterfly Algorithm Architecture Embodiment [0019] 2. 1 Butterfly Algorithm Embodiment [0020] [0021] The image processing device 2_codes and displays the digitized still image or the video clip. The digital image is usually a matrix of image elements (or suffixes) - a ~ jin, double value clothing. For example, in the YCbCr color (four) system, the three main elements include - a luminance element Y and two chrominance elements Cb and Cr. The values of the luminance and chrominance elements are used to describe things like money and chrominance. The image processing device 200 can also be sunny at the "color material", "three primary colors TM color _. A digital image of the mother can be represented as three rectangles _, and each of the rectangular arrays (4) contains the values of the three elements of the image.像素J Hugh image formed by the pixel block, 4 material by 8 image (4) (4) can be 4 pixels by 4 pixel block, 8 pixel m block, 16 pixels by 16 pixel block or any other / A , , _ and the heart of the block. The image of the giant (four) block towel is like a ', έ t bar is also poor (for example, a main element in the YCbCr Baxin system) is formed by 7 white _ an image data 俜 098126121 Form No. A0101 Page 10 / Total 45 pages 0982044783-0 201106174 ❹ [0022] Matrix. It is assumed that the discrete cosine transform converts the vector X of length 根据 into a new vector coefficient X according to the linear transformation function Χ==Ηχ, where Η is a matrix, and X can be a row or column of a matrix of image data coefficients. The discrete cosine transform converts the image data coefficients from the spatial domain to the frequency domain. In the latter part of the paper, the image data matrix is represented by an array of two-dimensional indices. The elements in the matrix can be expressed as F [ i ] [ j ], where [i ], [ j ] are indices, i, j are integer variables, and the first (leftmost) index [i] is vertical. The index, while the second (rightmost) index [j] is the index in the horizontal direction. For example, F[3] [5] represents a matrix element located in the vertical position 3 and the horizontal position 5 in the matrix F.离散 Discrete cosine transform in image processing involves one-dimensional (1D) row conversion and column conversion of the image data coefficient matrix. Both 1D row conversion and column conversion are product operations of a series of multiple matrices, which increases the complexity of circuit design. A butterfly algorithm that is mathematically equivalent to a matrix multiplication product (as hereinafter referred to as a butterfly operation) is well suited for implementing a discrete cosine transform circuit without a matrix multiplication circuit. Different image or video compression specifications have different butterfly operations. For example, the 264 standard developed by the International Telecommunications Union (ITU) is also the MPEG-4 Part 10, or MPEG-4 Advanced Video Coding. , abbreviated as AVC), using discrete cosine transform X = H x, where: 098126121 Form number A0101 Page 11 / Total 45 pages 0982044783-0 201106174 [0023]

[0024] [0025] 圖3顯示對應離散餘弦轉換算式（〇)的蝶形演算法。圖3包含以x[〇]、x[l]、Χ[2]與Χ[3]表示的四個輪入節點，以Χ[〇]、Χ[1]、ΧΕ2]、Χ[3]表示的四個輸出節點，與八個加號符號‘‘ + ”的運算節點m-m。節點中的加號符號‘‘ + ’’表示該節點執行加法運算。圖3中的節點以具方向性的轉移線相連結以表示其運算流程:該演算法將每個節點的輸出值經由轉移線從該節點傳送至轉移線所連結的下一個節點。轉移線旁所標示的_丨、2或_2等常數表不在轉移過程中相乘的乘數。輸入值χ[〇]、χ[1]、 x[2]與χ[3]可代入影像資料係數矩陣中列向量或行向量中的矩陣元素。蝶形演算法中的基本單元在本實施方式中以蝶形演算單元表示，如蝶形演算單元ln。一個蝶形演算單元包含兩個輸入節點與兩個輸出節點。如蝶形單元111即包含x[l] 與X[2]兩個輸出節點和作為輸出節點的122與123兩個運算節點。節點122與123分別表示具有兩個輸入值的加法運算。節點122以輸入值对1]與对2]值執行加法運算並產生輸出值。節點123以輸入值x[l]與（-1 098126121 表單編號A0101 第12頁/共45頁 0982044783-0 201106174 χ x[2])執行加法運算並產生輸出值x[l] + (- l x x[2])，其中蝶形演算單元111顯示被乘數x[2]與常數乘數-1相乘之後為節點123的輸入值。該乘數在圖3中以轉移線旁所標示之數字表示。節點x[0]、x[3]、122與124 形成另一個蝶形演算單元。節點121、122、125與126形成另一個蝶形單元。節點122與123的輸出值經由轉移線傳送至節點1 25-1 28。當圖3所有的蝶形演算單元皆執行完前述的運算之後，圖3所示的演算法才算執行完成。同理，圖3的其它部分也可以循此方式解讀。 ◎ [0026] 圖2所示的離散餘弦轉換電路200被設計用來實現各種不同的蝶形演算法。離散餘弦轉換電路所實現之相應的蝶形演算單元如下所示。 [0027] 2.2蝶形演算單元實施例 [0028] 控制單元261接收並解碼從指令記憶體253取得的蝶形演算指令，根據已解碼的蝶形運算指令控制離散餘弦電路 ^ 200，以實現蝶形演算法中的蝶形演算單元。離散餘弦轉〇換電路200之一運算例如下所示。 [0029] 請參考圖4，其中蝶形演算單元112表示為： [0030] D[0]，=D[0]x9 + D[l]x5 (1) [0031] D[IY =D[0]x5-D[l]x9 (2) [0032] 該離散餘弦轉換電路包含執行運算位元移位的筒移位器。在本實施方式中，以實數變數χ以及正整數變數y所標示的x((y表示χ為二的補數並向右位移y位元。向右位移 098126121 表單編號A0101 第13頁/共45頁 0982044783-0 201106174 後填入最高有效位元（Most Significant Bit，簡稱 MSB)的數值應與x位移前最高有效位元的數值相同。相同地’ x((y表示X為二的補數並向左位移y位元。向左位移後填入最低有效位元的數值應為0。以位移運算實現乘法，算式（1)和（2)可推導為： [0033] D[0]， =D[0]x9 + D[l]x5 [0034] - D[0]x(8 + 1) + D[1]x(4 + 1) [0035] = (D[0]x8+ D[0]xl) + (D[l]x4+ D[l] xl) [0036] = (D[0]x23+ D[QD +(D[1]x22+ D[l]) [0037] =(D[0]«3+ D[0] ) + (D[l]«2+ D[.l] ) (3) [0038] D[l]， =D[0]x5 -D[l]x9 [0039] = D[0]x(4 + 1)-D[1]x(8 + 1) [_] - (D[0]x4+ D[0]xl)-(D[l]x8+ D[l] ：xl) [0041] = (D[0]x22+ D[0]> -(D[1]x23+ D[l]) [0_ = (D[0]«2+ D[0]) - (D[l]<<3+ D[l] ) (4) [0043] 假設作為ID列轉換的影像資料係數為2X2的矩陣C，其中 [0044] c ά 0982044783-0 a C = b [0045] a, b, c與d為實數。 098126121 表單編號A0101 第14頁/共45頁 201106174 [_ _演算單元m以2χ2矩陣c執行_轉換時，矩陣。中列向量的每個矩陣s素將分別代人蝶形演算單元112中的 D[0]與D[l]。（：矩陣中第一個列向量的矩陣元素 C[0][0]與C[0][1]將分別代入算式⑴與⑵中的心] 與D[l]，之後C矩陣中第二個列向量的矩陣元素以丨][〇] 與C[ 1][1]將分別代入算式（1)與（2)中的d[〇]與d[ 1 ]。 2x2矩陣C為2x2矩陣C經由蝶形演算單元丨丨2作1 d列轉換的輸出例，其中： ο c a i?f 以及 [0048] C， [〇][〇]=a， =C[0][0]x9 + C[0][l]x5, (5) [0049] C， [〇][l]=c， =C[l][0]x9+C[l][l]x5, (6) [0050] C， [l][〇]=b， =C[0][0Jx5 - C[0][l]x9, (7) [0051] C [l][l]=d，: =ttl][〇]x5-C[l][l]x9. (8) [0052] 另一個影像資料係數2x2矩陣Y為ID行轉換的實施例： [0053] p -, e s ν - 6 r h [0054] 其中，e，f，g與h為實數。 [0055] 蝶形演算單元112執行2x2矩陣Y的ID行轉換時，矩陣γ 098126121 表單編號A0101 第15頁/共45頁 0982044783-0 201106174 每-個行向量中的矩陣元素皆分別代入蝶形演算單元中的D[〇]與D[1]。也就是第一個行向量中的矩陣元素 Y[〇][G]與Y[1][G]分別代入算式（！）與⑵中的D[0]與 Dm’no]⑴與Y[1][1]分別代入算式⑴與⑵中的 D[0]與DU]。2x2矩陣Υ，為2χ2楚陣¥經由蝶形演算單元112作1D行轉換的輸出例，其中： [0056] Yr= _r h\ nmm ，以及 [0057] r [0][0]=e’ =Y[〇][0]x9 + Y[l][〇]x5, (5a) [0058] Y， [〇][l]=g， =Y[0][0]x5 - Y[l][〇]x9, (6a) [0059] [1][〇] = ί( = Y[〇][1]x9 + Y[1][1]x5, (7a) [0060] Y> = Υ[〇][1]χ5-Υ[1]Π3χ9< (8a) [〇〇61]矩陣¥的21)離散餘弦赞換可經由蝶形演算單元112以矩陣γ 代入算式5a、6a、7芦與8a中的矩陣Y完成。 [0062] 3.離散餘弦轉換電路運作實施例 [0063] 098126121 如前所述，蝶形演算單元112可用算式（3)與（4)表示。離散餘弦轉換電路可經由執行算式（3)與（4)實現蝶形演算單元112。也就是說，影像處理裝置1〇〇至少包含三道指令用來實現蝶形演算單元112。第一蝶形演算指令控制蝶形 0982044783-0 離散餘弦轉換電路200執行算式（3)的（D[l] <<2 + D[l])與算式（4)的（D[0] <〈2 + D[0])。亦即第_ 表單編號A0101 第16頁/共45頁 201106174 演算指令是用來實現： _] tl= (D[l]<<2+ D[l]);以及⑻ _5] t2= 〇)[〇]《2+ D[0])。（1〇) [0066] &设1;1與t2為變數’可經由暫存器實施，但不限定於此。算式（9 )相當於蝶形演算單元丨丨2中從D [丨]到節點丨丨2 j 的轉移線。算式（10)相當於蝶形演算單元112中從D[〇] 到節點1122的轉移線。 Ο [〇〇67]第二蝶形演算指令控制離散餘弦轉換電路200執行算式 (3)的（D[0]<<3+ ΰ[0])與算式（4)的（d[1]<<3 + D[l])。亦即第二蝶形演算指令是用來實現：國 1:3= 〇)[0]<<3+ D[〇]);以及（⑴ _] t4= (D[l]«3+ D[i] ) (12) [0070]算式（11)相當於蝶形演算單元112中從D[ 0 ]到節點1121 的轉移線。算式（12)相當於蝶形演算單元〗12中從D[ j ] 〇到節點112 2的轉移線。 [〇〇71]第三蝶形演算指令控制離散餘弦轉換電路200完成算式 (3)與（4)的運算。亦即第三蝶形演算指令是用來實現： [0072] D[0]’ =t3 + tl;以及（13) 剛 D[l]， = t2 - t4. (14) [0074]算式（13)相當於蝶形演算單元112中的節點1121。算式 (14)相當於蝶形演算單元112中的節點1122。 [0075] 運算7L (operand)取得的控制資訊亦包含於蝶形演算指 098126121 表單編號A0101 第17頁/共45頁 0982044783-0 201106174 令中。請參考圖5至圖8中所描述離散餘弦轉換電路200的運作。離散餘弦轉換電路200中各元件的連結可透過匯流排，其裝載的資料顯示在匯流排旁。離散餘弦轉換電路中的匯流排擁有足夠的頻寬可以在一個時脈週期中傳送這些資料。圖5至圖8並未顯示控制訊號的連結。記憶級 302的每一個多工器根據已解碼的指令從任選輸入中選擇其一並輸出。 [0076] 3.1第j個時脈週期 [0077] 3. 1. 1 提取级 [〇〇78]在第j個時脈週期中，其中j為整數，當離散餘弦電路200 以矩陣C執行蝶形演算單元112，提取級301的控制單元 261在第j個時脈週期中接收並解碼第一蝶形演算指令以便在後續的時脈週期中控制記憶級302與運算級303。 [0079] 3.2第（j + l)個時脈週期； . ：： !； .[0025] FIG. 3 shows a butterfly algorithm corresponding to a discrete cosine transform equation (〇). Figure 3 contains four wheeled nodes represented by x[〇], x[l], Χ[2], and Χ[3], represented by Χ[〇], Χ[1], ΧΕ2], Χ[3] The four output nodes, with the operation node mm of the eight plus sign '' + '. The plus sign '' + '' in the node indicates that the node performs the addition operation. The node in Figure 3 is directional. The lines are connected to represent their operation flow: the algorithm transfers the output value of each node from the node to the next node connected to the transfer line via the transfer line. _丨, 2 or _2, etc. marked next to the transfer line The constant table is not multiplied by the multiplication during the transfer process. The input values χ[〇], χ[1], x[2], and χ[3] can be substituted into the matrix elements in the column vector or row vector of the image data coefficient matrix. The basic unit in the butterfly algorithm is represented by a butterfly calculation unit in the present embodiment, such as a butterfly calculation unit ln. A butterfly calculation unit includes two input nodes and two output nodes. For example, the butterfly unit 111 includes Two output nodes x[l] and X[2] and two operation nodes 122 and 123 as output nodes. Nodes 122 and 123 respectively represent Addition of two input values. Node 122 performs an addition on the input value pair 1] and on the 2] value and produces an output value. Node 123 takes the input value x[l] and (-1 098126121 Form number A0101 page 12 / A total of 45 pages 0982044783-0 201106174 χ x[2]) performs an addition operation and produces an output value x[l] + (- lxx[2]), where the butterfly calculation unit 111 displays the multiplicand x[2] and the constant multiplied The number-1 is multiplied by the input value of node 123. The multiplier is represented by the number indicated by the transition line in Figure 3. The nodes x[0], x[3], 122, and 124 form another butterfly calculation. The nodes 121, 122, 125, and 126 form another butterfly unit. The output values of the nodes 122 and 123 are transmitted to the nodes 1 25-1 28 via the transfer line. When all the butterfly calculation units of FIG. 3 perform the foregoing After the operation, the algorithm shown in Fig. 3 is executed. Similarly, the other parts of Fig. 3 can also be interpreted in this way. ◎ [0026] The discrete cosine transform circuit 200 shown in Fig. 2 is designed to implement various Different butterfly algorithms. The corresponding butterfly calculation unit realized by the discrete cosine transform circuit is as follows. 2.2 Butterfly Calculation Unit Embodiment [0028] The control unit 261 receives and decodes the butterfly calculation instruction acquired from the instruction memory 253, and controls the discrete cosine circuit ^ 200 according to the decoded butterfly operation instruction to implement the butterfly shape. The butterfly calculation unit in the algorithm. One of the operations of the discrete cosine conversion circuit 200 is as follows. [0029] Please refer to FIG. 4, wherein the butterfly calculation unit 112 is expressed as: [0030] D[0],= D[0]x9 + D[l]x5 (1) [0031] D[IY = D[0]x5-D[l]x9 (2) [0032] The discrete cosine transform circuit includes performing an operation bit shift Cartridge shifter. In the present embodiment, x is indicated by a real variable χ and a positive integer variable y ((y represents 补 is a two's complement and is shifted to the right by y bit. Shift right 098126121 Form No. A0101 Page 13 of 45 Page 0982044783-0 201106174 The value of the Most Significant Bit (MSB) should be the same as the value of the most significant bit before the x-shift. Similarly, 'x((y means X is the complement of two and Shift y bit to the left. The value of the least significant bit after shifting to the left should be 0. Multiplication is performed by displacement operation, and equations (1) and (2) can be derived as: [0033] D[0], = D[0]x9 + D[l]x5 [0034] - D[0]x(8 + 1) + D[1]x(4 + 1) [0035] = (D[0]x8+ D[0] Xl) + (D[l]x4+ D[l] xl) [0036] = (D[0]x23+ D[QD +(D[1]x22+ D[l]) [0037] =(D[0]« 3+ D[0] ) + (D[l]«2+ D[.l] ) (3) [0038] D[l], =D[0]x5 -D[l]x9 [0039] = D [0]x(4 + 1)-D[1]x(8 + 1) [_] - (D[0]x4+ D[0]xl)-(D[l]x8+ D[l] :xl) [Delta][D[0]x22+ D[0]> -(D[1]x23+ D[l]) [0_ = (D[0]«2+ D[0]) - (D[l] <<3+ D[l] ) (4) [0043] Assume that the image data coefficient converted as the ID column is 2X2 C, where [0044] c ά 0982044783-0 a C = b [0045] a, b, c and d are real numbers. 098126121 Form number A0101 Page 14 of 45 201106174 [_ _ calculus unit m with 2 χ 2 matrix c When performing _conversion, each matrix s of the matrix of the matrix will be respectively substituted for D[0] and D[l] in the butterfly calculus unit 112. (: matrix element C of the first column vector in the matrix [0][0] and C[0][1] will be substituted into the cores of equations (1) and (2) and D[l], respectively, and then the matrix elements of the second column vector in the C matrix are 丨][〇] and C[1][1] will be substituted into d[〇] and d[1] in equations (1) and (2), respectively. 2x2 matrix C is 2x2 matrix C is converted into 1 d column by butterfly calculation unit 丨丨2. Output examples, where: ο cai?f and [0048] C, [〇][〇]=a, =C[0][0]x9 + C[0][l]x5, (5) [0049] C, [〇][l]=c, =C[l][0]x9+C[l][l]x5, (6) [0050] C, [l][〇]=b, =C[ 0][0Jx5 - C[0][l]x9, (7) [0051] C [l][l]=d,:=ttl][〇]x5-C[l][l]x9. (8 [0052] Another image data coefficient 2x2 matrix Y is an embodiment of ID line conversion: [0053] p -, es ν - 6 rh [0054] where e, f, g, and h are real . [0055] When the butterfly calculation unit 112 performs ID line conversion of the 2x2 matrix Y, the matrix γ 098126121 Form No. A0101 Page 15 / Total 45 pages 0982044783-0 201106174 The matrix elements in each row vector are respectively substituted into the butterfly calculation D[〇] and D[1] in the unit. That is, the matrix elements Y[〇][G] and Y[1][G] in the first row vector are substituted into D[0] and Dm'no](1) and Y[1 in the equations (!) and (2), respectively. ][1] substituting D[0] and DU] in equations (1) and (2), respectively. 2x2 matrix Υ, which is an output example of 2D2 matrix conversion by the butterfly calculation unit 112 for 1D line conversion, where: [0056] Yr= _r h\ nmm , and [0057] r [0][0]=e' = Y[〇][0]x9 + Y[l][〇]x5, (5a) [0058] Y, [〇][l]=g, =Y[0][0]x5 - Y[l][ 〇]x9, (6a) [0059] [1][〇] = ί( = Y[〇][1]x9 + Y[1][1]x5, (7a) [0060] Y> = Υ[〇 ][1]χ5-Υ[1]Π3χ9<(8a) [〇〇61] 21) Discrete cosine substitution of matrix ¥ can be substituted into equations 5a, 6a, 7 and 8a by matrix γ by matrix γ The matrix Y is done. 3. Discrete Cosine Transform Circuit Operation Embodiment [0063] As described above, the butterfly calculation unit 112 can be expressed by equations (3) and (4). The discrete cosine transform circuit can implement the butterfly operation unit 112 by performing equations (3) and (4). That is, the image processing apparatus 1 includes at least three instructions for implementing the butterfly calculation unit 112. The first butterfly calculation command controls the butterfly 0982044783-0 discrete cosine transform circuit 200 to perform (D[l] <<2 + D[l]) of equation (3) and (D[0] of equation (4) <<2 + D[0]). That is, the first _ form number A0101 page 16 / a total of 45 pages 201106174 calculus instructions are used to achieve: _] tl = (D[l]<<2+ D[l]); and (8) _5] t2= 〇 ) [〇] "2+ D[0]). (1〇) [1] The setting of 1; 1 and t2 is variable ' can be implemented via a register, but is not limited thereto. Equation (9) is equivalent to the transition line from D [丨] to node 丨丨2 j in the butterfly calculation unit 丨丨2. Equation (10) corresponds to a transition line from D[〇] to node 1122 in butterfly calculation unit 112. Ο [〇〇67] The second butterfly calculation command controls the discrete cosine transform circuit 200 to perform (D[0]<3+ ΰ[0]) of equation (3) and (d[1] of equation (4) ]<<3 + D[l]). That is, the second butterfly calculation instruction is used to realize: Country 1:3= 〇)[0]<<3+ D[〇]); and ((1) _] t4= (D[l]«3+ D[i] ) (12) [0070] Equation (11) corresponds to a transition line from D[ 0 ] to node 1121 in the butterfly calculation unit 112. Equation (12) is equivalent to a butterfly from the butterfly calculation unit 12 [j] 转移 to the transfer line of node 112 2. [〇〇71] The third butterfly calculation command controls the discrete cosine transform circuit 200 to perform the operations of equations (3) and (4). That is, the third butterfly calculation instruction is It is used to realize: [0072] D[0]' = t3 + tl; and (13) just D[l], = t2 - t4. (14) [0074] Equation (13) is equivalent to the butterfly calculation unit 112. Node 1121. Equation (14) corresponds to node 1122 in butterfly calculation unit 112. [0075] The control information obtained by operation 7L (operand) is also included in butterfly calculation finger 098126121 Form No. A0101 Page 17 of 45 </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> Discrete cosine The bus in the switching circuit has sufficient bandwidth to transmit the data in a clock cycle. The connection of the control signals is not shown in Figures 5 through 8. Each multiplexer of the memory stage 302 is based on the decoded instructions. Select one of the optional inputs and output it. [0076] 3.1 jth clock cycle [0077] 3. 1. 1 extraction stage [〇〇78] in the jth clock cycle, where j is an integer, when The discrete cosine circuit 200 executes the butterfly calculus unit 112 in a matrix C. The control unit 261 of the extraction stage 301 receives and decodes the first butterfly calculus instruction in the jth clock cycle to control the memory stage 302 in subsequent clock cycles. And the operation stage 303. [0079] 3.2 (j + l) clock cycles; . :: !;

[0080] 3.1.1 記憶级 [0081] 請參考圖5 ,在第（j + 1 )個時脈週期中，記憶級3〇2根據第一蝶形演算指令中的兩個計算式準備數據資料給運算級303中的暫存器Keg2~Reg5，其中第一蝶形演算指令中的兩個計算式分別為矩陣C第一列與第二列的運算元。如圖5所示，從資料記憶體252讀取影像資料係數c[〇][〇] 與C[l][〇]在傳送給多工器231的任選輸入12與多工器 2 3 3的任選輸入W，從資料記憶體2 5 2讀取資料影像係數 (：[〇][1]與〔[1][1]並傳送給多工器232的任選輸入22與多工器234的任選輪入43。 098126121 表單編號A0101 第18頁/共45頁 0982044783-0 201106174 [0082] 3.2.2 提取級剛提取級謝中的控制單元261在同一時脈週期中接收碼第二蝶形演算指令。 ^ [0084] 3.3第（j + 2)個時脈週期 _] ^參考圖6，詳細說明第（j + 2)個時脈週期中三個級的運[0080] 3.1.1 Memory Level [0081] Referring to FIG. 5, in the (j+1)th clock cycle, the memory level 3〇2 prepares data according to two calculation formulas in the first butterfly calculation instruction. To the register Keg2~Reg5 in the operation stage 303, wherein the two calculation formulas in the first butterfly calculation instruction are the operation elements of the first column and the second column of the matrix C, respectively. As shown in FIG. 5, the image data coefficients c[〇][〇] and C[l][〇] are read from the data memory 252 at the optional input 12 and the multiplexer 2 3 3 that are transmitted to the multiplexer 231. Optional input W, read data image coefficients (:[〇][1] and [[1][1] from the data memory 2 5 2 and transmit to the optional input 22 and multiplexer of the multiplexer 232 Optional rounding of 234 is 43. 098126121 Form No. A0101 Page 18 of 45 0982044783-0 201106174 [2.22] 3.2.2 Extraction Stage The control unit 261 in the first extraction stage receives the code second in the same clock cycle. Butterfly calculation command. ^ [0084] 3.3 (j + 2) clock cycles _] ^ Refer to Figure 6, detailing the three levels of the (j + 2) clock cycles

級算 " IX 3 3 ο [_運算級中的每―個暫存器在前一個級中從相對應的多工器接收相對應的數據資料並經由連結提供給運算邏輯單元ALUWLU4。每一個運算邏輯單元從暫存器所提供的數據資料中取得兩個輸入數值。如圖6所示，輸入給每 —個運算邏輯單元_值_在相對應的運算邏輯單°元旁。連結筒移位H或加法器/減法器旁所標示的數值為經由連結傳送到該筒移位器或加法器/減法器的輸入值。例如，移位器201與加法器/減法器221分別接收〇 C[〇][〇]作為輸入運灰元》 [0088]根據第一蝶形演算指令於矩陣c第一列的運算，運算邏輯單元ALU1以輸入值C[0][0]代入D[0]完成算式（1〇)的運鼻並輸出5C[〇][〇] ’運鼻邏輯單元ALU3以輸入值 C[〇][i]代入D[l]完成算式（9)的運算並輸出。運算邏輯單元ALU1於時脈週期中完成蝶形演算單元⑴ 中從D[0]連結到節點U22的運算。明確來說，每—個運算邏輯單元中的移位器與加法器/減法器分別完成相對應算式中的移位運算與加法/減法運算。舉例來說，當 098126121 表單編號A〇101 第丨9頁/共45頁田〇982〇44783-〇 201106174 運算邏輯單元ALU1以C[0][0]代入D[0]進行算式（10)的計算時’運算邏輯單元中的移位器2〇1藉由將(：[〇][〇]左移2位元得到4xC[0][0]並將運算結果4xC[0][0]輸出到加法器/減法器2 21。加法器/減法器2 21接收並相加兩輸入值 ’ 4xC[0][0]與C[0][0]，然後輸出 5xC[0][0] 。其它運算邏輯單元的内部運算方式亦相似。 [0089] [0090] [0091] [0092] 根據第一蝶形演算指令於矩陣C第二列的運算，運算邏輯單元ALU2以輸入值C[1][0]代入D[0]完成算式（1〇)的運算並輸出5C[1][0]，運算邏輯單元ALU4以輸入值 C[l][l]代入D[l]完成算式(9)的運算並輸出。輸出值5C[〇][〇]、5C[1][0]、5C[0][1]與 5C[1][1]儲存在記憶體251作為中間計算結果。運鼻邏輯單元ALU卜ALU4以影像資料係數中的不同數據資料平行執行相同的計算用來實現單一指令多重資料（ Single instruction stream &nd multiple data streams ’簡稱SIMD)架構。需注意的是，離散餘弦轉換電路並不限於SIMD。運算邏輯單元ALU卜ALU4的其中兩個可以執行同一計算以實現一個SIMD架構，另兩個運算邏輯單元可以執行另一計算以實現另一個81仙架構，如此一來即實現多重指令多重資料流（Multiple instruction streams and multiple data streams ’簡稱為MIMD)架構。 3. 3. 2記憶級凊參考圖6，在第（j+ 2)個時脈週期中，記憶級根據第二 098126121 表單編號A0101 第20頁/共45頁 0982044783-0 201106174 [0093] [0094] Ο [0095] [0096] [0097] [0098] Ο 098126121 蝶形演算指令用於矩陣C第一列與第二列的運算元提供其匕數據資料給運算級中的暫存器。如圖6所示，從記憶體 252讀取影像資料係數C[0][0]與c[1][〇]並傳送給多工器231的任選輸入12與多工器233的任選輸入33，從記憶體252讀取影像資料係數C[0][1]與c[1][1]並傳送給多工器232的任選輸入22與多工器234的任選輸入43。 3. 3. 3提取級提取級301中的控制單元261在同一個時脈週期中接收並解碼第三蝶形演算指令。 3. 4第（j + 3)個時脈週期請參考圖7 ’詳細說明第（j + 3)個時脈週期中三個級的運作。 3. 4. 1運算級根據第二蝶形演算指令於矩陣C第一列的達算，運算邏輯單元ALU1以C[0] [0]代入D[0]完成算式（11)的運算並輸出9C[0][0]，運算邏輯單元AL1I3以C[0][1]代入D[l]完成算式（12)的運算並輸出9C[0][1]。根據第二蝶形演算指令於矩陣C第二列的運算，運算邏輯單元ALU2以 C[1 ] [0]代入D[0]完成算式（11)的運算並輸出9C[1][0] ，運算邏輯單元ALU4以C[l][l]代入D[l]完成算式（12) 的運算並輸出9C[1][1]。離散餘弦轉換電路200包含連結線路2 70與271用於將輸出值9C[0][0]、9C[1][0]、 9C[0][1]與9C[1] [1]於第（j + 3)時脈週期導至前一個記憶級。輸出值9C[0][0]、9C[1][0]、9C[〇ni]、表單編號A0101 第21頁/共45頁 0982044783-0 201106174 9C[1][1]、5C[0][0]、5C[1][0]、5C[0][1]與 5C [ 1 ] [ 1 ]皆為中間計算結果。 [0099] [0100] [0101] [0102] [0103] [0104] 098126121 3. 4 · 2記憶級在第（j + 3)個時脈週期中，記憶級302根據第三蝶形演算指令用於中間資料的兩個計算式提供其它數據資料給運算級303中的暫存器。中間資料包含運算邏輯單元對應於第一蝶形演算指令的兩個計算式之輸出以及運算邏輯單元對應於第二蝶形演算指令的兩個計算式之輸出。需注意的是’運算邏輯單元ALII1與運算邏輯單元ALU2的輸出 ’如9C[0][〇]與9C[1][〇]，經由資料傳輸線路27〇導至多工器233的任選輸入31而不猶存9C[0] [0]與9C[1 ] [〇] 在記憶體。運算邏輯單元ALU3與運算邏輯單元乩“的輸出，如9C[0][1]與9C[1][1]，導至多工器234的任選輸入41而不儲存9C[0][1]與9C[i][i]在記憶體。因此，運算級330在下一個時脈週期中岢以選擇性接收 9C[0][0]、9C[l][〇]、。已儲存的5C[0][0]與5C[l][〇]從記憶雜251讀取後輸出到多工器232，5C[0][1]與5C[1][1]從記憶體251讀取後輸出到多工器231。 3. 4. 3提取級提取級301中的控制單元261可在第（j + 3)個時脈週期中接收與解碼其它蝶形運算指令。 3. 5第（j + 4)個時脈週期請參考圖8，詳細說明第（j + 4)個時脈週期中運算級的運 0982044783-0 表單編號Α0101 第22頁/共45頁 201106174 作。 [0105] [0106] 3. 5. 1運算級根據第三蝶形演算指令用於相對於矩陣c第一列的中間計算資料的計算式，運算邏輯單元ALU1以5C[0][1]代入t3 、9C[0][0]代入tl以完成算式（13)的運算，並輸出 5C[0][1] + 9C[0][0]，相當於算式（5)。運算邏輯單元 ALU3以5C[〇][〇]代入t2、9c[〇][1]代入t4以完成算式 Ο [0107] (14)的運算，並輸A5C[0][0] —9c[〇][1]，相當於算式⑺。 "：. .... . Ο [0108] 根據第三蝶形演算指令用於相對於矩陣C第二列的中間計算資料的計算式’運算邏輯單元ALU2以5C[1][1]代入t3 、9C[1][0]代入ti以完成算式（13)的運算，並輸出 5C[1 ] [1 ] + 9C[l ] [〇]，相當於算式（6)。運算邏輯單元 ALIM以5C[l][〇]代入t2、9C[1][1]代入t4以完成算式 (14)的運算，並輸出，相當於算式 (8)〇據此，離散餘弦轉換電路可以執存此三道蝶形演算指令以應用蝶形單元112於矩陣C ο 需注意的是 ’ 5C[0][0]、5C[0][1]、5C[l][〇]、 5C[1][1]、9C[0][0]、9C[0][1]、9C[1][0]與 9C[1][1]為在第（j + 4)個時脈週期中，8個輸入到運算邏輯單元ALU1〜ALU4的不同數值。此8個數值中，有4個是從記憶體251讀取，其餘數值則由運算邏輯單元 ALU1〜ALU4經由線路270與271轉送。記憶體251與252 可分別擁有2埠，此2埠的頻寬足夠傳遞8個數值的其中之 098126121 表單編號A0101 第23頁/共45頁 0982044783-0 201106174 一。因此，記憶級中的記憶體並不需要為了同時傳遞8個數值給4個運算邏輯單元而擁有4埠。 [0109] 4.變化例 [0110] 圖9A~9E顯示蝶形演算單元的不同實施例。請參考圖9A，因蝶形演算單元113中沒有常數乘數，節點1131或1132 的運作可以在一個時脈週期中藉由運算邏輯單元 ALU1~ALU4的其中之一實現與完成。 [0111] 請參考圖9B，從節點D[0]到節點1141的水平轉移線以及從節點D[ 1 ]到節點1142的水平轉移線旁的分別有一常數 η，若η的數值為整數2或分數1/2的k次方（也就是n = 2k 或n = mk )，其中k為一整數，則從節點D[0]到節點1141之轉移線所表示的運算以及從節點D [ 1 ]到節點114 2之轉移線所表示的運算可經由移位器中的位元移位運算完成。節點 1141與節點114 2其中之一可表示為加法器所完成的加法。因此，節點1141與節點1142及相關節點中的轉移線所表示的運算可經由運算邏輯單元ALU1~ALU4其中之一在一個時脈週期中完成。 [0112] 同樣地，在圖9C中，如果m的數值是整數2或分數1/2的f 次方（也就是m = 098126121 表單編號A0101 2’ 第24頁/共45頁 0982044783-0 201106174 或m = 其中f為一整數 Ο 別攸即點到節點1152之轉移線所表示的運算以及從節輯丨]到節點ll5i之轉移線所表示的運算可經由移位器中的位元移位一文丹化成。在第 -個例子中’ n4m並残整數2或分數1/2的整數欠方則每-個轉移線所表示的運算式也許需要離:餘弦人轉換電糊中更多的運算邏輯單元經由更多的時脈週期完成。在第-例中’指言記憶體253中的蝶形演算指令可控制離散餘弦轉換電糊優先執行轉移線旁有 : 的運算再執行轉移線旁沒有常數乘數的運算。 ^ [0113] 〇Level " IX 3 3 ο [Each register in the _ computing stage receives the corresponding data from the corresponding multiplexer in the previous stage and provides it to the arithmetic logic unit ALUWLU4 via the link. Each arithmetic logic unit takes two input values from the data provided by the scratchpad. As shown in Figure 6, the input to each arithmetic logic unit_value_ is next to the corresponding arithmetic logic unit. The value indicated by the link cylinder shift H or the adder/subtracter is the input value transmitted to the cartridge shifter or adder/subtractor via the link. For example, the shifter 201 and the adder/subtractor 221 respectively receive 〇C[〇][〇] as an input ash element. [0088] The operation of the first column of the matrix c according to the first butterfly calculation instruction, the operation logic The unit ALU1 substitutes the input value C[0][0] into D[0] to complete the nose of the formula (1〇) and outputs 5C[〇][〇] 'the nose logic unit ALU3 with the input value C[〇][i Substituting D[l] to complete the operation of equation (9) and output. The arithmetic logic unit ALU1 completes the operation of connecting from D[0] to the node U22 in the butterfly calculation unit (1) in the clock cycle. Specifically, the shifter and the adder/subtracter in each of the arithmetic logic units respectively perform the shift operation and the addition/subtraction operation in the corresponding equation. For example, when 098126121 form number A〇101 page 9/45 page field 〇 982 〇 44783-〇201106174 arithmetic logic unit ALU1 substitutes C[0][0] into D[0] for equation (10) When calculating, the shifter 2〇1 in the arithmetic logic unit obtains 4xC[0][0] by shifting (:[〇][〇] to the left by 2 bits and outputs the operation result 4xC[0][0] Go to adder/subtractor 2 21. Adder/subtracter 2 21 receives and adds two input values ' 4xC[0][0] and C[0][0], and then outputs 5xC[0][0]. The internal operation modes of other arithmetic logic units are also similar. [0092] According to the operation of the first butterfly calculation instruction in the second column of the matrix C, the operation logic unit ALU2 takes the input value C[1]. [0] Substituting D[0] to complete the operation of the formula (1〇) and output 5C[1][0], the arithmetic logic unit ALU4 substitutes the input value C[l][l] into D[l] to complete the formula (9) The operation is output and the output values 5C[〇][〇], 5C[1][0], 5C[0][1], and 5C[1][1] are stored in the memory 251 as intermediate calculation results. The logical unit ALU ALU4 performs the same calculation in parallel with different data in the image data coefficients to achieve a single finger. Single instruction stream & nd multiple data streams 'SIMD' architecture. It should be noted that the discrete cosine transform circuit is not limited to SIMD. Two of the arithmetic logic unit ALU ALU4 can perform the same calculation to implement a SIMD. Architecture, the other two arithmetic logic units can perform another calculation to implement another 81-inch architecture, thus implementing the Multiple instruction streams and multiple data streams (MIMD) architecture. 2 Memory level 凊 Referring to FIG. 6, in the (j+2)th clock cycle, the memory level is according to the second 098126121 Form No. A0101 Page 20/45 pages 0982044783-0 201106174 [0093] [0094] [0095] [0098] 098 098126121 The butterfly calculation instruction is used by the operands of the first column and the second column of the matrix C to provide the data data to the register in the operation stage. As shown in FIG. The memory 252 reads the image data coefficients C[0][0] and c[1][〇] and transmits them to the optional input 12 of the multiplexer 231 and the optional input 33 of the multiplexer 233 from the memory 252. Reading image data system C [0] [1] and c [1] [1] and transmitted to the multiplexing unit 22, optionally 232, optionally with an input 234 of the multiplexer 43 input. 3. 3. 3 Extraction Stage The control unit 261 in the extraction stage 301 receives and decodes the third butterfly calculation instruction in the same clock cycle. 3. 4 (j + 3) clock cycles Please refer to Figure 7' for a detailed description of the operation of three stages in the (j + 3) clock cycle. 3. 4. The calculation level is calculated according to the second butterfly calculation instruction in the first column of the matrix C. The operation logic unit ALU1 substitutes C[0][0] into D[0] to complete the operation of the equation (11) and outputs 9C[0][0], the arithmetic logic unit AL1I3 substitutes C[0][1] into D[l] to complete the operation of equation (12) and outputs 9C[0][1]. According to the operation of the second butterfly calculation instruction in the second column of the matrix C, the operation logic unit ALU2 substitutes D[0] into C[1][0] to complete the operation of the equation (11) and outputs 9C[1][0]. The arithmetic logic unit ALU4 substitutes C[l][l] into D[l] to complete the operation of equation (12) and outputs 9C[1][1]. The discrete cosine transform circuit 200 includes connection lines 2 70 and 271 for outputting values 9C[0][0], 9C[1][0], 9C[0][1], and 9C[1][1] (j + 3) The clock cycle leads to the previous memory level. Output value 9C[0][0], 9C[1][0], 9C[〇ni], form number A0101 Page 21/45 pages 0982044783-0 201106174 9C[1][1], 5C[0] [0], 5C[1][0], 5C[0][1], and 5C [1] [1] are intermediate calculation results. [0102] [0102] [0104] 098126121 3. 4 · 2 memory level In the (j + 3) clock cycle, the memory level 302 is used according to the third butterfly calculation instruction The two calculations of the intermediate data provide additional data to the registers in the operational stage 303. The intermediate data includes an output of the two computational expressions of the operational logic unit corresponding to the first butterfly computational instruction and an output of the two computational expressions of the operational logic unit corresponding to the second butterfly computational instruction. It should be noted that the 'output logic unit ALII1 and the output of the arithmetic logic unit ALU2', such as 9C[0][〇] and 9C[1][〇], are routed via the data transmission line 27 to the optional input 31 of the multiplexer 233. Without juxtaposition 9C[0][0] and 9C[1] [〇] in memory. The outputs of the arithmetic logic unit ALU3 and the arithmetic logic unit 乩 ", such as 9C[0][1] and 9C[1][1], lead to the optional input 41 of the multiplexer 234 without storing 9C[0][1] And 9C[i][i] in the memory. Therefore, the operation stage 330 selectively receives 9C[0][0], 9C[l][〇], and stored 5C in the next clock cycle. 0][0] and 5C[l][〇] are output from the memory 251 and output to the multiplexer 232, and 5C[0][1] and 5C[1][1] are read from the memory 251 and output. Go to the multiplexer 231. 3. 4. 3 The control unit 261 in the extraction stage extraction stage 301 can receive and decode other butterfly operation instructions in the (j + 3) clock cycles. 3. 5 (j + 4) Clock cycle Please refer to FIG. 8 for details of the operation level of the (j + 4) clock cycle, 0982044783-0 Form No. Α0101, page 22/45 pages, 201106174. [0105] 3. 5. The calculation level is used according to the calculation formula of the intermediate calculation data relative to the first column of the matrix c according to the third butterfly calculation instruction, and the arithmetic logic unit ALU1 substitutes 5C[0][1] into t3 and 9C[0]. [0] Substituting t1 to complete the operation of equation (13), and output 5C[0][1] + 9C[0][0], which is equivalent to equation (5). The arithmetic logic unit ALU3 substitutes 5C[〇][〇] into t2, 9c[〇][1] and substitutes t4 to complete the operation of the equation [0107] (14), and inputs A5C[0][0]-9c[〇 ][1], which is equivalent to the formula (7). ":. . . . Ο [0108] The calculation equation for the intermediate calculation data relative to the second column of the matrix C according to the third butterfly calculation instruction ALU2 substitutes 5C[1][1] into t3, 9C[1][0] and substitutes ti to complete the operation of equation (13), and outputs 5C[1 ] [1 ] + 9C[l ] [〇], which is equivalent to Equation (6). The arithmetic logic unit ALIM substitutes 5C[l][〇] into t2, 9C[1][1] and substitutes into t4 to complete the operation of equation (14), and outputs it, which is equivalent to equation (8). The discrete cosine transform circuit can execute the three butterfly calculus commands to apply the butterfly unit 112 to the matrix C. Note that '5C[0][0], 5C[0][1], 5C[l] [〇], 5C[1][1], 9C[0][0], 9C[0][1], 9C[1][0] and 9C[1][1] are in the (j + 4) In the clock cycle, 8 are input to different values of the arithmetic logic units ALU1 to ALU4. Of the 8 values, 4 are read from the memory 251, and the remaining values are calculated by the arithmetic logic units ALU1 to ALU4. Transferred by lines 270 and 271. The memories 251 and 252 can each have 2 埠, and the bandwidth of the 2 足够 is enough to pass 8 of the values 098126121 Form No. A0101 Page 23 / Total 45 Page 0982044783-0 201106174 I. Therefore, the memory in the memory level does not need to have 4 为了 in order to simultaneously transfer 8 values to 4 arithmetic logic units. 4. Variations [0110] FIGS. 9A to 9E show different embodiments of the butterfly calculation unit. Referring to FIG. 9A, since there is no constant multiplier in the butterfly calculation unit 113, the operation of the node 1131 or 1132 can be realized and completed by one of the arithmetic logic units ALU1 A ALU4 in one clock cycle. Referring to FIG. 9B, a horizontal transfer line from the node D[0] to the node 1141 and a horizontal transfer line from the node D[1] to the node 1142 respectively have a constant η, if the value of η is an integer 2 or The fraction 1/2 k-th power (that is, n = 2k or n = mk ), where k is an integer, the operation represented by the transition line from node D[0] to node 1141 and the slave node D [ 1 ] The operation represented by the transfer line to node 114 2 can be accomplished via a bit shift operation in the shifter. One of node 1141 and node 114 2 can be represented as an addition done by the adder. Thus, the operations represented by the transition lines in node 1141 and node 1142 and associated nodes can be accomplished in one clock cycle via one of operational logic units ALU1 through ALU4. [0112] Similarly, in FIG. 9C, if the value of m is an integer 2 or a fraction 1/2 of the f-th power (that is, m = 098126121 form number A0101 2' page 24 / total 45 pages 0982044783-0 201106174 or m = where f is an integer Ο 攸攸攸攸到到 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 115 Dan Huacheng. In the first example, 'n4m and the integer 2 or fraction 1/2 of the integer under-squares, the expression represented by each transfer line may need to be separated: more logic in the cosine conversion The unit is completed via more clock cycles. In the first example, the butterfly calculation command in the 'Word memory 253 can control the discrete cosine transform. The paste is preferentially executed next to the transfer line. There is no constant next to the transfer line. The operation of the multiplier. ^ [0113] 〇

請參考圖9D，如果n和崎別為整數2或分數1/2的整數次方，到節點1161的任何轉移線所表示的運算可經由筒移位器的位元移位運算完成。到節點1161的其他轉移線所表示的運算可#由運算邏輯單元則卜則4其中之一在一個時脈週期中完成。因此，節點1161所表示的運算以及連結到節點1161之轉移線所表示的運算可藉由運算邏輯單元ALUWLU4其中之—在兩個時脈週期中完成。^ 點1162所對應到的運算可做相同理解。 PReferring to Figure 9D, if n and the singularity are integer 2 or fractional 1/2 integer power, the operation represented by any of the transfer lines to node 1161 can be accomplished via a bit shift operation of the cartridge shifter. The operation represented by the other transfer lines to the node 1161 can be completed by one of the arithmetic logic units in one clock cycle. Thus, the operations represented by node 1161 and the transitions represented by the transition lines connected to node 1161 can be accomplished in two clock cycles by the arithmetic logic unit ALUWLU4. ^ The operation corresponding to point 1162 can be understood as the same. P

[0114] 另外，在第二财n為整數2或分數1/2的整數次方，_ 與整數2或分數1/2的整數次方相差}(也就是^2’ ±1 或 m= 098126121 表單編號A0101 第25頁/共45頁 0982044783-0 201106174 ± 1 ) ’節點1161所表示的運算以及連結到節點丨丨6丨之轉移線所表示的運算仍可藉由運算邏輯單元ALU1〜ALU4其中之一在兩個時脈週期中完成，在此情況下，運算邏輯單兀會優先執行關聯m的轉移線之運算再執行關聯n的轉移線之運算。在第二個例子中，指令記憶體253中的蝶形演算指令可以控制離散餘弦轉換電路200先執行與m有關的運算再執行與n有關的運算。 [0115] 同樣地，在第三個例子中，為整數2或分數1/2的整數次方，η則與整數2或分數1/2的整數次方相si (也就是仏 2k ±1 或 η: mk ±1)，節點1161所表示的運算以及連結到節點1161之轉移線所表示的運算仍可藉由運算邏輯單元ALU1〜ALU4其中之一在兩個時脈週期中完成，在此情況下，運算邏輯單元會優先執行關聯n的轉移線之運算再執行關聯m的轉移線之運算。在第三個例子中，指令記憶體253中的蝶形演算指令可以控制離散餘弦轉換電路2〇〇先執行與η有關的運算再執行與m有關的運算。第二個例子與第三個例子中的情況同樣適用於節點1162以及連結到節點1162的轉移線。圖9E為蝶形演算單元117相關的運算例。圖1〇12 顯示離散餘弦轉換電路2〇〇中蝶形演算單元117對矩陣c 098126121 表單編號A0101 第26頁/共45頁 0982044783-0 201106174 [0116] ❹ 在連續的三個時脈週期的運作。 °月參考圖9Ε ’連結到節點1171的轉移線1173與im分別關聯到常數乘數9與8，其中9 =23 + 1 且 8 =23 。如圖10~12所示’當離散餘弦轉換電路2〇〇使用蝶形演算單元117對矩陣c進行列轉換，運算邏輯單元八1^141^4會先執行關聯到乘數9之轉移線1173所對應的運算再執行關聯到乘數8之轉移線1174所對應的運算。同樣地，蝶形演算單元11 7中連結到節點1172的轉移線 1175與1176分別關聯到常數乘數8與9，其中8 = Ο[0114] In addition, in the second fiscal n is an integer 2 or a fractional power of 1/2, _ is different from the integer power of 2 or fraction 1/2} (that is, ^2' ±1 or m= 098126121) Form No. A0101 Page 25 of 45 0982044783-0 201106174 ± 1 ) The operation represented by the node 1161 and the transition line connected to the node 仍6丨 can still be performed by the arithmetic logic unit ALU1~ALU4 One of them is completed in two clock cycles. In this case, the arithmetic logic unit preferentially executes the operation of the transfer line associated with m and then performs the operation of the transfer line associated with n. In the second example, the butterfly calculation command in the instruction memory 253 can control the discrete cosine transform circuit 200 to perform an operation related to m and then perform an operation related to n. Similarly, in the third example, it is an integer 2 or a fractional power of 1/2, and η is an integer two of the integer 2 or fraction 1/2 (that is, 仏2k ±1 or η: mk ±1), the operation represented by the node 1161 and the operation represented by the transfer line connected to the node 1161 can still be completed in one of two clock cycles by one of the arithmetic logic units ALU1 to ALU4, in this case Next, the arithmetic logic unit preferentially executes the operation of the transfer line associated with n and then performs the operation of the transfer line associated with m. In the third example, the butterfly calculation command in the instruction memory 253 can control the discrete cosine transform circuit 2 to perform an operation related to η before performing an operation related to m. The second example applies equally to the node 1162 and the transition line connected to the node 1162 as in the third example. FIG. 9E is an operation example related to the butterfly calculation unit 117. Figure 1〇12 shows the discrete cosine transform circuit 2〇〇 butterfly calculation unit 117 to matrix c 098126121 Form No. A0101 Page 26 of 45 0982044783-0 201106174 [0116] 运作 Operation in three consecutive clock cycles . The transition line 1173 associated with the node 1171 is associated with the constant multipliers 9 and 8, respectively, with 9 = 23 + 1 and 8 = 23, respectively. As shown in FIGS. 10 to 12, when the discrete cosine transform circuit 2 performs column conversion on the matrix c using the butterfly calculation unit 117, the arithmetic logic unit 八1^141^4 first performs the transfer line 1173 associated with the multiplier 9. The corresponding operation then performs the operation corresponding to the transfer line 1174 of the multiplier 8. Similarly, the transfer lines 1175 and 1176 of the butterfly calculation unit 117 connected to the node 1172 are associated with constant multipliers 8 and 9, respectively, where 8 = Ο

+ 1。如圖10〜12所示，運算邏輯單元ALU卜ALU4，運算邏輯單元ALU1〜ALU4會先執行關聯到乘數9之轉移線1176 所對應的運算再執行關聯到乘數8之轉移線1175所對應的運算。運算邏輯單元於矩陣c的運算於蝶形演算單元117 中只要兩個時脈週期即可完成。 [0117] [0118] 098126121 5.結論如前所述，影像處理裝置可儲存不同的蝶形演算指令以表單編號A〇W 第27頁/共45頁 0982044783-0 201106174 完成不同的蝶形演算法來實現不同影像與視訊壓縮標準的離散餘弦轉換電路，如適用於MPEG2與H. 264的離散餘弦轉換電路。當越來越多的壓縮標準所適用的指令被整合進指令記憶體，影像處理裝置100的彈性與標準相容性也會提昇。四個運算邏輯單元同時根據不同的影像資料係數進行不同的運算以實現MIMD並增進離散餘弦轉換電路的整體效率。此外，藉由資料傳遞路線，離散餘弦轉換電路記憶體的使用並不需要四埠。總結來說，本發明所提出的離散餘弦轉換電路適用於各式影像處理裝置， Λ 〇包含但不限於機頂盒、媒體播放機、電視與視訊會議裝置。 [0119] [0120] [0121] [0122] [0123] [0124] [0125] 綜上所述，本發明符合發明專利要件，爰依法提出專利申請。惟，以上所述者僅為本發明之較佳實施方式，舉凡熟悉本案技藝之人士，在爰依本案發明精神所作之等效修飾或變化，皆應包含於以下之申請專利範圍内。【圖式簡單說明】圖1Α為包含離散餘弦轉換電路165的影像處理裝置100之 u 一實施方式的結構方塊圖。圖1B為影像處理裝置之第二種實施方式的結構方塊圖，示意如何從網路接收數位内容。圖2為離散餘弦轉換電路之一實施方式的結構方塊圖。圖3為蝶形演算法的示意圖。圖4為蝶形演算法其中一個單元的示意圖。圖5為離散餘弦轉換電路於第（j +1)個時脈週期運作的示 098126121 表單編號A0101 第28頁/共45頁 0982044783-0 201106174 意圖 [0126] 圖意圖 6為離散餘弦轉換電路於第（吨個時脈週期運作的示 [0127] Γ:。離散餘弦轉換電路於第(⑽個時脈週期運作的示剛y為離散餘弦轉換電路於第（j+4)個時意圖。瑚運作的示〇 [酬隱〜9E為蝶形演算單元之實施觸㈣I 1：0130]圖1〇〜12為離散餘弦轉換電路中，蝶形演算單元於— 續的時脈週期中對矩陣運作的示意圓。個連【主要元件符號說明】 [〇13!]影像處理裝置1〇〇、1〇1 Ο [0132] 處理器151 [0133] 主記憶體152 [_ 非揮發性記憶體153 [0135] 大量儲存裝置154 [0136] 内容保護單元丨55 [0137] 解調器156 [_] 調諧器15 7 [0139]電源供應器158+ 1. As shown in FIGS. 10 to 12, the arithmetic logic unit ALU ALU4, the arithmetic logic units ALU1 to ALU4 first execute the operation corresponding to the transfer line 1176 associated with the multiplier 9 and then execute the transfer line 1175 associated with the multiplier 8. The operation. The operation of the arithmetic logic unit in the matrix c in the butterfly calculation unit 117 can be completed in only two clock cycles. [0118] 098126121 5. Conclusion As described above, the image processing apparatus can store different butterfly calculation instructions to complete different butterfly algorithms by form number A〇W page 27/45 pages 0982044783-0 201106174 A discrete cosine transform circuit for implementing different image and video compression standards, such as a discrete cosine transform circuit for MPEG2 and H.264. As more and more compression standard applicable instructions are integrated into the instruction memory, the flexibility and standard compatibility of the image processing apparatus 100 will also increase. The four arithmetic logic units simultaneously perform different operations according to different image data coefficients to achieve MIMD and improve the overall efficiency of the discrete cosine transform circuit. In addition, the use of discrete cosine transform circuit memory does not require four passes through the data transfer path. In summary, the discrete cosine transform circuit proposed by the present invention is applicable to various image processing apparatuses, including but not limited to set top boxes, media players, televisions, and video conferencing apparatuses. [0125] [0125] In summary, the present invention complies with the requirements of the invention patent and submits a patent application according to law. However, the above-mentioned embodiments are merely preferred embodiments of the present invention, and those skilled in the art will be able to incorporate the equivalent modifications and variations in the spirit of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A is a block diagram showing an embodiment of an image processing apparatus 100 including a discrete cosine transform circuit 165. 1B is a block diagram showing the structure of a second embodiment of an image processing apparatus, showing how to receive digital content from a network. 2 is a block diagram showing the structure of one embodiment of a discrete cosine transform circuit. Figure 3 is a schematic diagram of a butterfly algorithm. Figure 4 is a schematic diagram of one of the units of the butterfly algorithm. Figure 5 is a diagram showing the operation of the discrete cosine transform circuit in the (j +1)th clock cycle. 098126121 Form No. A0101 Page 28 of 45 0982044783-0 201106174 Intention [0126] Figure 6 is a discrete cosine transform circuit (Ten clock cycle operation indication [0127] Γ: Discrete cosine conversion circuit in the ((10) clock cycle operation just y is the discrete cosine conversion circuit at the (j+4)th intent. The demonstration [reward ~ 9E is the implementation of the butterfly calculation unit (four) I 1: 0130] Figure 1 〇 ~ 12 is the discrete cosine conversion circuit, the butterfly calculus in the continuous clock cycle of the operation of the matrix Circle. [Main component symbol description] [〇13!] Image processing device 1〇〇, 1〇1 Ο [0132] Processor 151 [0133] Main memory 152 [_ Non-volatile memory 153 [0135] Mass storage device 154 [0136] Content protection unit 丨 55 [0137] Demodulator 156 [_] Tuner 15 7 [0139] Power supply 158

Kf'irrv [0140]石英振盪器1 5 9 098126121 表單編號 A01.01 第29頁/共45頁〇982〇44783~〇 201106174 [0141] I/O單元 160 [0142] 音訊輸出單元161 [0143] 視訊輸出單元162 [0144] 天線 1 6 3 [0145] 埠 164 [0146] 離散餘弦轉換電路165、200 [0147] 網路介面170 [0148] 移位器 201、202、203、204 [0149] 加/減法器 221、222、223、224 [0150] 多工器 231、232、233、234、 [0151] 241 ' 242 [0152] 記憶體251 [0153] 資料記憶體252 [0154] 指令記憶體253 [0155] 控制單元261 [0156] 線路 270、271 [0157] 提取級301 [0158] 記憶級30 2 [0159] 運算級303 098126121 表單編號A0101 第30頁/共45頁 0982044783-0 201106174 [0160]暫存器 Regl、Reg2、Reg3、Reg4、Reg5 _1]運算邏輯單元ALU1、ALU2、 [0162] ALU3 ' ALU4 [0163] 蝶形演算單元 in、n2、113、114、115、116、117Kf'irrv [0140] Crystal Oscillator 1 5 9 098126121 Form No. A01.01 Page 29 of 45 〇982〇44783~〇201106174 [0141] I/O Unit 160 [0142] Audio Output Unit 161 [0143] Video Output Unit 162 [0144] Antenna 1 6 3 [0145] 埠 164 [0146] Discrete Cosine Transform Circuits 165, 200 [0147] Network Interface 170 [0148] Shifters 201, 202, 203, 204 [0149] /Subtractor 221, 222, 223, 224 [0150] Multiplexer 231, 232, 233, 234, [0151] 241 '242 [0152] Memory 251 [0153] Data Memory 252 [0154] Instruction Memory 253 [0155] Control Unit 261 [0156] Lines 270, 271 [0157] Extraction Stage 301 [0158] Memory Level 30 2 [0159] Operation Stage 303 098126121 Form Number A0101 Page 30 / Total 45 Page 0982044783-0 201106174 [0160] Registers Regl, Reg2, Reg3, Reg4, Reg5 _1] arithmetic logic units ALU1, ALU2, [0162] ALU3 'ALU4 [0163] Butterfly calculation units in, n2, 113, 114, 115, 116, 117

GG

098126121 表單編號A0101 第31頁/共45頁 0982044783-0098126121 Form No. A0101 Page 31 of 45 0982044783-0

Claims

201106174 Seven patent application scope: • A discrete cosine transform circuit, - extraction stage, - memory level and 'item calculation circuit, with % heterogeneous, its φ, the extraction stage receives and solves the Ma butterfly shape calculation refers to the human set /. The memory level includes a memory memory bank for storing the intermediate calculation data outputted by the shadow operation level, the image data and the calculation instruction, and the butterfly memory is decoded on the first day of the butterfly calculation circuit. a first set of data of the stage; and the cycle of the cycle of the cycle is stored in the operation stage comprising a plurality of registers, a first operation. Calculating a logical unit, the complex register is from the data; the data is used as the round entry data of the operation level, and the fourth storage library receives the first set of second operational logic units respectively from the = arithmetic logic unit and the pair Input data execution register receiving-group input data shape calculation instruction in the first-clock cycle & according to at least - decoded butterfly-calculated calculation result; - clock cycle output of the first of the 'the _ calculus circuit In the same-clock cycle including 1 way, the count is calculated in the 4th-calculation operation, so that at least one register is in the first;;;; from the operation level to the record During the period, the third set of data received from the line and after the pulse period can be selected to receive the next set of data. The calculation result or the discrete cosine conversion circuit according to item 1 of the memory storage, wherein the operation stage performs the second group of data obtained by the library in the third clock cycle. The second result and the discrete cosine transform circuit as recited in claim 1, wherein the memory stage includes at least a conversion circuit 'where the» is such that at least the register is at least one of 098 丨26121 Form No. A010I Page 32 of 45 201106174 The decoding butterfly algorithm can choose to receive the calculation result from the line or obtain the next set of data from the memory bank. 4. As described in claim 1 a discrete cosine transform circuit, wherein the first arithmetic logic unit and the second operational logic unit respectively perform calculations substantially equal to multiplication with different data, the data comprising at least one set of image data coefficients or the calculation result. The discrete cosine transform circuit of claim 4, wherein the first operational logic unit and the second operational logic unit respectively comprise a cylinder shifter and an addition 6. The discrete cosine transform circuit of claim 4, wherein the arithmetic stage includes a third operational logic unit and a fourth operational logic unit, in the first calculation At the same time, another second calculation formula is executed according to different data, and the third operation logic unit and the fourth operation logic unit respectively perform a third calculation formula according to at least one set of image data coefficients or the calculation result. The discrete cosine transform circuit of claim 1, wherein the operation stage comprises a third operation logic unit and a fourth operation logic unit, and the third calculation formula is executed according to different data, respectively, while the first calculation formula is executed, The third arithmetic logic unit and the fourth arithmetic logic unit respectively perform a third calculation formula according to at least one set of image data coefficients or the calculation result. 8. The discrete cosine transform circuit according to claim 1, wherein The butterfly calculation unit operation performed by the calculation stage includes multiplying the first image data coefficient by the first constant, The image data coefficient is multiplied by the second constant, and the two multiplication results are added together, if the first constant differs from the integer power of 2 or 1/2 by one and the second constant is an integer power of 2 or 1/2. The multiplication operation of the first image data coefficient takes precedence over the second image data coefficient multiplication 098126121. Form No. A0101 Page 33/45 pages 0982044783-0 201106174 The calculation operation is performed. 9. An image processing apparatus includes: The discrete cosine transform circuit includes an extraction stage, a memory stage and an operation stage, wherein the extraction stage receives and decodes a butterfly calculation instruction set; the memory stage includes a memory storage for storing image data coefficients and the operation level Outputting intermediate calculation data, and outputting, according to at least one decoded butterfly calculation instruction, a first set of data stored in the memory level in a first clock cycle of the butterfly calculation circuit; and wherein the operation stage includes a plurality of registers, a first arithmetic logic unit and a second operational logic unit, the complex register receiving the first set of data from the memory repository as an input to the computational stage The first arithmetic logic unit and the second arithmetic logic unit respectively receive a set of input data from the complex register and perform a first calculation formula on the set of input data, according to the at least one decoded butterfly calculation instruction. a second clock cycle after the first clock cycle outputs a calculation result of the first calculation formula; wherein the butterfly calculation circuit includes a line that is to be used in the same clock cycle in which the first calculation operation ft is completed The calculation result is guided from the operation level to the memory level, so that at least one register can select to receive the calculation result from the line or receive from the memory repository in a third clock cycle after the second clock cycle. a set of data. 10. The image processing apparatus of claim 9, wherein the calculation stage performs the second calculation formula in the third clock cycle with the calculation result and the second set of data obtained from the memory storage. The image processing device of claim 9, wherein the memory level further comprises at least one multiplexer, such that at least one register is based on at least one solved 098126121, form number A0101, page 34, total 45 pages 0982044783-0 201106174 The code butterfly algorithm can choose to receive the calculation result from the line or obtain the next set of data from the memory repository. 12. The image processing apparatus of claim 9, wherein the first arithmetic logic unit and the second arithmetic logic unit respectively perform calculations substantially equivalent to multiplication with different data, the data comprising at least one set of images Data coefficient or the result of the calculation. The image processing device of claim 12, wherein the first arithmetic logic unit and the second operational logic unit respectively include a cartridge shifter and an adder for performing the calculation substantially equivalent to multiplication. The image processing device of claim 12, wherein the operation stage comprises a third operation logic unit and a fourth operation logic unit, and while the first calculation formula is executed, respectively performing another according to different data In a second calculation formula, the third operation logic unit and the fourth operation logic unit respectively perform a third calculation formula according to at least one set of image data coefficients or the calculation result. 15. The image processing apparatus of claim 9, wherein the operation stage comprises a third operation logic unit and a fourth operation logic unit, and performing the third calculation according to different data while executing the first calculation formula The third arithmetic logic unit and the fourth operational logic unit respectively perform a third calculation formula according to at least one set of image data coefficients or the calculation result. The image processing device of claim 9, wherein the butterfly calculation unit operation performed by the calculation stage comprises multiplying the first image data coefficient by the first constant, the second image data coefficient and the second Multiplying the constants and adding the two multiplied results. If the first constant is different from the integer power of 2 or 1/2 and the second constant is an integer power of 2 or 1/2, the first image is obtained. The multiplication of the data coefficients is performed in preference to the multiplication of the second image data coefficients. 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。. 098126121 Form No. A0101 Page 36 of 45

0982044783-0