TWI246255B

TWI246255B - Parallel embedded block encoder

Info

Publication number: TWI246255B
Application number: TW92131980A
Authority: TW
Inventors: Liang-Gee Chen; Hung-Chi Fang; Yu-Wei Chang
Original assignee: Univ Nat Taiwan
Priority date: 2003-11-14
Filing date: 2003-11-14
Publication date: 2005-12-21
Also published as: TW200516870A

Abstract

A parallel embedded block encoder is provided. The parallel embedded block encoder is fast and saves memory space. The parallel encoder can process a discrete wavelet transformation (DWT) parameter in each clock without any state variable. Therefore, the parallel encoder avoids the memory usage and reduces the external memory bandwidth. The present invention uses the same area and much lower power consumption to achieve the processing speed six times of the other structures. In addition, the present invention can process 50,000,000 pixels per second at the 100 MHz operating frequency. Under this specification, the present invention can support high definition television (HDTV) encoding, 30 frames per second.

Description

1246255 玖、發明說明：【發明所屬之技術領域】本發明是有關於一種平行化嵌入式方塊編碼器，尤指一種可一次平行處理所有位元層（bit plane)的演算法，而使處理速度為目前一般作法的六倍快，同時，對記憶體頻寬的需求也降低到六分之一，而在面積成本上並沒有增加。【先前技術】 JPEG 2000為最新一代之靜態影像壓縮標準，它所提供的高品質及豐富的影像壓縮工具，使它非常有可能取代JPEG 成為最廣泛使用之靜態影像壓縮標準，而應用在數位相機或是數位攝影機等產品；而在JPEG 2000系統中，嵌入式方塊編碼器為最複雜的部分也是研究的主要目標，在目前提出的架構中，都是以位元層（bit plane)為單位，序列（serial) 式的運算，這種架構的缺點如下：一、它的處理速度很慢；二、它在晶片上需要很大的隨機處理記憶體；三、它對外部記憶體以位元層為單位的讀取，沒有效率且在系統整合上會產生困難。而本發明可以完全解決以上的問題，本發明的處理速度為其它架構的六至十倍以上，而由於平行處理的特性，完全不需要將狀態變數儲存下來，所以可降低記憶體需求，另外，一次處理一個離散小波轉換係數不論是對外部記憶體讀取或是編碼器的整合都有很大的幫助。 1246255 【發明内容】 j 因此，本發明之主要目的係在於，可以運用在任何需要即時影像壓縮之產品’如數位相機、數位攝影機、即時監視系統，或是需要無失真壓縮之醫學影像、軍方之影像處理。本發明之另一目的係為最新之靜態影像壓縮標準，可一次平行處理所有位元層（bit plane)的演算法。本發明之再一目的係在於，其處理速度為目前一般作法的六倍快，同時，對記憶體頻寬的需求也降低到六分之一，Φ 而在面積成本上並沒有增加。 . 為達上述之目的，本發明係一種平行化嵌入式方塊編碼器，係包括平行化嵌入式方塊編碼器演算法，以及平行化嵌 / 入式方塊編碼器架構；其中該平行化嵌入式方塊編碼器演算法如下： a. 編碼程序（Coding pass)分類：以兩係數最高位元層的相對位置來決定貢獻值，進而決定中心係數各位元層之編碼程序； · b. 狀態變數計算（State variable calculation):係藉由比較兩係數的值來即時計算狀態變數而不需要儲存； c·係數值編碼（Magnitude coding):平行比較中心點與八個鄰點的值來產生貢獻值進而以查表方式形成係數值之組態（Context); d·符號（正負）編碼（Sign coding):平行比較中心點與八個鄰點之最大位元之位置來產生貢獻值進而以查表方式形成 1246255 符號之組態（Context)。而該平行化嵌入式方塊編碼器架構係為一個時脈可以處理一個離散小波轉換係數，輸出為嵌入式位元流，該架構係包括：棋盤式位移暫存器模組（Gobang register bank， GRB),;最大位元狀態產生模組（Compute MSB Pass, CMP); 編碼程序及參數產生模組（Find contribution, FC);組態產生模組（Context formation，CF);可重組化先入先出暫存器模組（Reconfigurable First-in First-out，RFIF0)以及算術編碼模組（Arithmetic encoder，AE)，其中，編碼程序及參數產生模組及組態產生模組是每個位元層都需要一套，算術編碼模組則是兩個位元層需要一套，而棋式位移暫存器模組、可重組化先入先出暫存器模組及最大位元狀態產生.模組則只需要一套。【實施方式】請參閱『第1〜1 2圖』所示，係本發明係數所在的位置、係本發明算出貢獻值後進而求得符號編碼（signc〇ding) 的組態（context)圖表、係本發明平行化嵌入式方塊編碼器架構之方塊圖、係本發明棋盤式位移暫存器模組示意圖、係本發明最大位元狀態產生模組示意圖、係本發明最大位元狀態產生模組各個運算單元（Pr〇cessing element，pE)的詳細電路圖、係本發明係為編碼程序及參數產生模組（FC) 的架，圖、係本發明編碼程序及參數產生模組⑽架構的各個運算單元的電賴、係本發明計，的電路示意圖、係本 1246255 發明組態產生模組示意圖、係本發明微為符號（sign)的組態（context)產生電路圖、係本發明為先入先出暫存器（FIF〇) 的長度（暫存器數）與平均一個方塊所需的總時脈數的模擬圖、係為算術編碼模組（AE)的架構圖。如圖所示：本發明係種平行化嵌入式方塊編碼器，其係包括平行化嵌入式方塊編碼器演算法，以及平行化嵌入式方塊編碼器架構，可一次平打處理所有位元層（bit plane)的演算法，而使處理速度為目前-般作法的六倍快，同時，對記憶體頻寬的需求也降低到六分之一，而在面積成本上並沒有增力口。其中該平行化嵌入式方塊編碼器演算法如下·· a·編碼程序（Coding pass)分類··以兩係數最高位元層的相對位置來決定貢獻值，進而決定中心係數各位元層之編碼程序；在此以來表示離散餘弦轉換（DWT)係數的值 (Magnitude)，其中下標 se{C，仙，dl，d2, d3, hO, hl，v〇， vl}代表此係數所在的位置（如第i圖），中心點c為目前要壓縮的點。凡代表C的第左個位元層的編碼程序（〇為最低層），為了計算出〆，必需先算出每個相鄰的係數在對c的第左層位元層的貢獻值（Contribution)以 <表示。在八個鄰點中，分為兩種算法，第一為比c還晚作壓縮的點（以上圖為例的話，hi、vl及d3屬於這種）公式如下： ^^io, k^ms m = Ms^〇 11，Λ: < /^ ’其中 1 丨 s < 2m+l 1246255 若是比C适早壓縮的話，公式如下· k<ms =\l (k^ms)&(p^ =1) 〇， otherwise 了。-經可取得 2， k<mc1246255 发明 Description of the invention: [Technical field to which the invention belongs] The present invention relates to a parallelized embedded block encoder, especially an algorithm that can process all bit planes in parallel at a time, so that the processing speed It is six times faster than the current general practice. At the same time, the demand for memory bandwidth is also reduced to one-sixth, and there is no increase in area cost. [Previous technology] JPEG 2000 is the latest generation of still image compression standards. It provides high-quality and rich image compression tools, making it very likely to replace JPEG as the most widely used still image compression standard, and is applied to digital cameras. Or digital cameras and other products; In the JPEG 2000 system, the embedded block encoder is the most complicated part and the main goal of research. In the current proposed architecture, the bit plane is used as the unit. The disadvantages of this architecture are as follows: 1. Its processing speed is very slow; 2. It requires a large amount of random processing memory on the chip; 3. It uses a bit layer for external memory. Reading as a unit is inefficient and causes difficulties in system integration. The present invention can completely solve the above problems. The processing speed of the present invention is six to ten times more than that of other architectures. Due to the characteristics of parallel processing, there is no need to store state variables at all, so memory requirements can be reduced. Processing one discrete wavelet transform coefficient at a time is of great help to both external memory reading and integration of the encoder. 1246255 [Summary of the invention] j Therefore, the main purpose of the present invention is that it can be used in any product that requires real-time image compression, such as digital cameras, digital cameras, real-time surveillance systems, or medical images that require distortion-free compression, the military Image processing. Another object of the present invention is the latest static image compression standard, which can process all bit plane algorithms in parallel at one time. Another object of the present invention is that its processing speed is six times faster than the current general practice, and at the same time, the memory bandwidth requirement is reduced to one-sixth, and the area cost is not increased. To achieve the above object, the present invention is a parallelized embedded block encoder, which includes a parallelized embedded block encoder algorithm and a parallelized embedded / inline block encoder architecture; wherein the parallelized embedded block The encoder algorithm is as follows: a. Coding pass classification: The relative position of the highest bit layer of the two coefficients is used to determine the contribution value, and then the encoding process of each element layer of the central coefficient; b. State variable calculation (State variable calculation): The state variables are calculated in real time by comparing the values of the two coefficients without the need to store them; c · Magnitude coding: Compare the values of the central point and the eight neighboring points in parallel to generate a contribution value to check Configuration of coefficient values in table mode (Context); d · Sign coding: Compare the position of the center point and the maximum bit position of eight neighboring points in parallel to generate the contribution value and form 1246255 by table lookup Symbol configuration (Context). The parallel embedded block encoder architecture is a clock that can process a discrete wavelet transform coefficient and the output is an embedded bit stream. The architecture includes a checkerboard shift register module (Gobang register bank, GRB). ) ,; Maximum bit state generation module (Compute MSB Pass, CMP); Encoding program and parameter generation module (Find contribution, FC); Configuration generation module (Context formation, CF); Reconfigurable FIFO Register module (Reconfigurable First-in First-out (RFIF0)) and arithmetic coding module (Arithmetic encoder (AE)), where the encoding program and parameter generation module and configuration generation module are each bit layer One set is needed, the arithmetic coding module is one set for two bit layers, and the chess-type shift register module, reconfigurable first-in-first-out register module, and maximum bit state generation. Just one set. [Embodiment] Please refer to "Figures 1 to 12", where the coefficients of the present invention are located, the configuration chart of the sign coding after calculating the contribution value of the present invention, It is a block diagram of the parallel embedded block encoder architecture of the present invention, a schematic diagram of the checkerboard displacement register module of the present invention, a schematic diagram of the maximum bit state generation module of the present invention, and a maximum bit state generation module of the present invention. The detailed circuit diagram of each computing unit (Prcessing element, pE) is a frame of the coding program and parameter generation module (FC) of the present invention. The unit's electrical circuit is a circuit diagram of the present invention, a schematic diagram of the 1246255 invention configuration generating module, a circuit diagram of the invention's micro-sign configuration, and the invention is a first-in-first-out The simulation diagram of the length of the register (FIF0) (the number of registers) and the total number of clocks required for averaging a block is the architecture diagram of the arithmetic coding module (AE). As shown in the figure: The present invention is a kind of parallel embedded block encoder, which includes a parallel embedded block encoder algorithm and a parallel embedded block encoder architecture, which can process all bit layers at a time. plane) algorithm, so that the processing speed is six times faster than the current-normal method, at the same time, the memory bandwidth requirement is also reduced to one-sixth, and there is no increase in area cost. The algorithm of the parallelized embedded block encoder is as follows: a. Coding pass classification. The relative value of the highest bit layer of the two coefficients is used to determine the contribution value, and then the encoding process of each element layer of the center coefficient is determined. ; Since then represents the value of the discrete cosine transform (DWT) coefficient (Magnitude), where the subscript se {C, cent, dl, d2, d3, hO, hl, v0, vl} represents the position of this coefficient (such as (Figure i), the center point c is the current point to be compressed. For the encoding program representing the left bit layer of C (0 is the lowest layer), in order to calculate 〆, the contribution value of each adjacent coefficient to the left bit layer of c must be calculated (Contribution) Expressed as <. Among the eight neighboring points, there are two algorithms. The first is a point that is compressed later than c (in the above example, hi, vl, and d3 belong to this type), the formula is as follows: ^^ io, k ^ ms m = Ms ^ 〇11, Λ: < / ^ 'where 1 丨 s < 2m + l 1246255 If it is compressed earlier than C, the formula is as follows: k < ms = \ l (k ^ ms) & (p ^ = 1) 〇, otherwise. -Can get 2, k < mc

Pc 3, A: = mc&2^=〇 1， otherwise b·狀態變數计算（State variable calculation):係藉由比較兩係數的值來即時計算狀態變數而不需要儲存；在說明係數值及符號編碼的演算法前，先定義兩個狀態變數如下：Pc 3, A: = mc & 2 ^ = 〇1, otherwise b · State variable calculation: State variables are calculated in real time by comparing the values of two coefficients without the need for storage; the coefficient values and symbols are explained Before encoding the algorithm, first define two state variables as follows:

Xk\^ k<ms [0, ^ > ms ,= il, k = ms 5 l〇, k^ms 其中，$代表S係數的第左個位元層是否低於它的最大位元（MSB)，<表示S的第左個位元層是否為最大位元（MSB)。 c·係數值編碼（Magnitude coding):平行比較中心點與八個鄰點的值來產生貢獻值進而以查表方式形成係數值之組態；在此，一樣必需先算出每個鄰點對中心點的貢獻值，同樣地，所有鄰點根據與中心點壓縮的順序關係分為兩組，若是在中心點之後壓縮的點，它的頁獻值可以用下列公式算 1246255 出Xk \ ^ k < ms [0, ^ > ms, = il, k = ms 5 l〇, k ^ ms where $ represents whether the left bit layer of the S coefficient is lower than its maximum bit (MSB ), &Lt; indicates whether the left bit layer of S is the maximum bit (MSB). c · Magnitude coding: A configuration that compares the values of the center point and eight neighboring points in parallel to generate a contribution value and then forms a coefficient value by looking up the table; here, it is also necessary to first calculate the center of each neighboring point The contribution value of points is similarly divided into two groups according to the order relationship of compression with the center point. If the point is compressed after the center point, its page contribution value can be calculated by the following formula: 1246255

义 1，ο, VI L 4 11 V =1&Pc =1 otherwise 若是在中心點之前壓縮的點，可以下式算： σ V ο，1， ^=〇 otherwise 等八個貝獻值都算出來後，將它們依據水平（H)、垂直 (V)及對角（D)累加起來，如下式： i*0 i*0 ^=Σσ* i*0 接下來只要依據JPEG 2000標準所定的表將組態 (context)查出來即可。〜 d·符號（正負）編碼（sign c〇ding):平行比較中心點與八個鄰點之最大位元之位置來產生貢獻值進而以查表方式形成符號之組態。係以χ8來表示s的正負號（〇代表正），同時定義以下兩個新的變數：义 1, ο, VI L 4 11 V = 1 & Pc = 1 otherwise If it is a point compressed before the center point, it can be calculated by the following formula: σ V ο, 1, ^ = 〇otherwise and other eight values are calculated Then, add them up according to the horizontal (H), vertical (V), and diagonal (D), as follows: i * 0 i * 0 ^ = Σσ * i * 0 Next, according to the table set by the JPEG 2000 standard, The configuration (context) can be found out. ~ D · sign (positive and negative) coding (sign coding): A configuration that compares the position of the center point and the maximum bit position of eight neighboring points in parallel to generate a contribution value and then forms a symbol by looking up a table. The sign of s is represented by χ8 (0 is positive), and the following two new variables are defined:

a=Σ 乂X k &Kkc k as為1時表示s的最大位元（MSB)在比c的最大位元（msb) 高的位元層，ps為1時表示兩者的MSB在同一個位元層。在作符號編碼（sign coding)時，只有左右及上下四個點的貢獻 1246255 值需要，對於在中心點之後壓縮的點，可以用下式算出它的貢獻值：a = Σ 乂 X k & Kkc k as 1 means that the maximum bit (MSB) of s is at a bit level higher than the maximum bit (msb) of c. When ps is 1, it means that the MSB of the two are the same. One bit layer. In sign coding, only the left and right and up and down points contribute 1246255 value. For the point compressed after the center point, its contribution value can be calculated using the following formula:

A=〇 otherwise 反之則用下式·· 卜 A = 〇 1, ^3&p^c 〇， otherwise 算出貢獻值後，可以用第2圖之表列來得到HX及VX， · 進而求得符號編碼（sign coding)的組態（context)。 ‘ 而該平行化嵌入式方塊編碼器架構（如第3圖所示）係乂為一個時脈可以處理一個離散小波轉換係數，輸出為嵌入式 ’ 位元流，該架構係包括：棋盤式位移暫存器模組（Gobang register bank, GRB)1 ;最大位元狀態產生模組（Compute MSB Pass，CMP) 2 ;編碼程序及參數產生模組（Find contribution，FC) 3 ;組態產生模組（Context formation， CF) 4 ;可重組化先入先出暫存器模組（Reconf igurable ® First-in First-out， RFIF0) 5 以及算術編碼模組 (Arithmetic encoder, AE) 6，其中，編碼程序及參數產生模組3及組態產生模組4 5是每個位元層都需要一套，算術編碼模組6則是兩個位元層需要一套，而棋盤式位移暫存器模組1、可重組化先入先出暫存器模組5及最大位元狀態產生模組2則只需要一套。上述所提之棋盤式位移暫存器模組（Gobang register 11 1246255 =nk，GRB)l (如第4圖所示），這個模組為二維位移暫存器模組(2 D Shift register bank)用以讓輸入的離散小波轉換係數可以如標準所規定的方式移動，每個暫存器在每個時脈都會向下旋轉位移一個暫存器（如Wl4W2，W2~>W3, W3"^W4 ’ W4,0，W0—W1)，每四個時脈，每個暫存器會向左下角旋轉位移—個暫存器（如W1—W7，W2—W8，W3—W9， W44W5，WC)3W6)。其中coCMP及coFC標示出在最大位元狀態產生杈組（CMP) 2及編碼程序及參數產生模組（FC) 3會使用到的九（3x3)個暫存器； "亥最大位元狀態產生模組（Compute MSB Pass，CMP) 2 (如第5圖所示），本模組主要是算出中心點的&值，另外也算出各係數的4及 <兩個參數。在這個模組之後的各模組在有這些參數後，就可以讓每個位元層獨立地運算，這樣一來，當有些位元層是空的時候，就可以把它關掉以節省功率消耗。如第6圖所示為各個運算單元（processing element， PE)的詳細電路圖，其中u〇R先將兩輸入的各位元層兩兩作 OR後，再將所有結果〇R起來。該編碼程序及參數產生模組（Find contribution, FC) 3 (如第7圖所示），這個模組算出中心點屬於的編碼程序 (coding pass)及參數（PHVD)資料以供組態產生模組（CF) 4 計算組態（context)，其中各個運算單元的電路如（如第8 圖所示）這個部分電路較為簡單，直接由演算法即可看出第 9圖為計算p的電路，所計算f的電路和完全一樣；該組態產生模組（Context formation, CF)4 (如第1 12 1246255 0圖所示）’其中ZC為零編碼（zero coding)模組，MR為強度修正（magnitude ref inement)模組，這二部分皆為查表，與標準規定的一致。為了處理特殊的連續零編碼 (Run-length code，RLC)必須把同一個行（c〇iumn)的四個點的組態（context)暫存起來，最後再決定出最終的組態 (context)，另一個方式是把參數（HVD)和編碼程序（pass)資· 料暫存起來，但是這樣比較佔面積，選擇暫存組態（c〇ntext) 可以節省2/3的暫存器面積。在同一個時脈内，最多會有四個組態（context)產生，另外，ΧΙ)χ為符號（sign)的組態 (context)由第1 1圖電路產生。有一點要注意的是第工〇圖的電路每個位元層都需要，而第1 1圖則是只需要一個，再把結果送到每一個位元層中；該可重組化先入先出暫存器模組（Reconfigurable ?11^卜:11^卜以-〇的，舴^0)5，係用先入先出暫存器（1?1叩) 來使壓縮過程更流暢，因為組態產生模組（CF)4雖然平均產生的組態（context)數大約是一個，但是每個時脈會輸出的組態（context)數不一定（由0到4)，但是後面的算術編碼模組（AE) 6 —次卻只能處理一個組態（context)，如第1 2圖所示為先入先出暫存器（FIF〇)的長度（暫存器數）與平均一個方塊所需的總#脈數的模擬。而由第1 2圖可知先入先出暫存器（FIFO)的長度約為十五時最有效率，但是這樣所需要的暫存器數目太多，於是利用嵌入式方塊編碼器的特性，以一個可重組化先入先出暫存器模組（RFIF〇)5架構來降低硬體需求，最後以一條長度十五，八條長度為四（最小需求）的 13 1246255 先入先出暫存器（FIFO)就可以達到十條先入先出暫存器 (FIFO)都用長度為十五的架構下之八成以上的效能；以二條長先入先出暫存器（長度為十五）及八條短先入先出暫存器（長度為四）在每次要進行新的方塊編碼時重組成隶有效率的組態。經過理論分析，長先入先出暫存器（FIFO) 配給此方塊的第三及第四個位元層時最有效率，所以當一個方塊開始編碼時，就根據它的最高位元層來重組本可重組化先入先出暫存器模組（RFIF0)5的架構； - 該算術編碼模組（Arithmetic encoder，AE) 6，在本架_ 構中，一個AE可以處理6個獨立的嵌入式位元流，主要是. 為了降低硬體需求同時提高利用率（Utilization)，由於本:· 架構支援11個位元層（1 sign + 1〇 magnitude)，所以至多會有28(3x9+1)個嵌入式位元流，最直接的做法是用28個 AE，不過進一步分析嵌入式方塊編碼的兩個特性可以將硬體需求降低，第一，相同位元層的三個編碼程序（c〇dingpass) 在同一個時脈只會有一個出現（ExClusive pr〇perty)，所以可以將它減少到10個（因為在本發明中有1〇個係數值位元φ 層）；第二，算術編碼模組（AE) 6所需處理的組態（context) 數目在最低的位元層最多，然後由低到高的位元層急遽下降，甚至整層都是空的，所以將一個算術編碼模組（AE) 6給 2個位元層共用，這樣一來又將數目再減少為5個，表面看起來面積應該降到原本的18%(义）’但是其實並不是整個ae 都可以節省掉’因為每一個獨立的位元流的編碼狀態暫存器 (Coding state registers)是一定要分開的，所以是無法省 1246255 總合來說，本發明以相同的面積成本及低很多的功率消耗達到其它架構處理速度的六倍以上，在各方面都勝過以往的架構許多。惟以上所述者，僅為本發明之較佳實施例而已，當不能以此限定本發明實施之範圍；故，凡依本發明申請專利範圍及發明說明書内容所作之簡單的等效變化與修飾，皆應仍屬本發明專利涵蓋之範圍内。 1246255 編碼程序及參數產生模組3 組態產生模組4 可重組化先入先出暫存器模組5 算術編碼模組6A = 〇otherwise otherwise use the following formula ... A = 〇1, ^ 3 & p ^ c 〇, otherwise After calculating the contribution value, you can use the table in Figure 2 to get HX and VX, and then find the symbol The context of sign coding. 'And the parallel embedded block encoder architecture (as shown in Figure 3) is a clock that can process a discrete wavelet transform coefficient and the output is embedded' bit stream, the architecture includes: checkerboard displacement Register module (Gobang register bank, GRB) 1; Maximum bit state generation module (Compute MSB Pass, CMP) 2; Encoding program and parameter generation module (Find contribution, FC) 3; Configuration generation module (Context formation, CF) 4; Reconf igurable ® First-in First-out (RFIF0) 5 and Arithmetic encoder (AE) 6 And parameter generation module 3 and configuration generation module 4 5 require one set for each bit layer, arithmetic coding module 6 requires one set for two bit layers, and a checkerboard displacement register module 1. Reconfigurable FIFO register module 5 and maximum bit state generation module 2 only need one set. The above-mentioned checkerboard shift register module (Gobang register 11 1246255 = nk, GRB) l (as shown in Figure 4), this module is a two-dimensional shift register module (2 D Shift register bank ) Used to make the input discrete wavelet transform coefficients can be moved as specified in the standard, each register will be rotated downward to shift a register at each clock (such as Wl4W2, W2 ~ > W3, W3 " ^ W4 'W4,0, W0—W1), every four clocks, each register will rotate and shift to the lower left corner—one register (such as W1—W7, W2—W8, W3—W9, W44W5, WC) 3W6). Among them, coCMP and coFC indicate nine (3x3) registers that will be used in generating the maximum bit state (CMP) 2 and the encoding program and parameter generating module (FC) 3; " Generate a module (Compute MSB Pass, CMP) 2 (as shown in Figure 5). This module mainly calculates the & value of the center point, and also calculates 4 and < two parameters of each coefficient. After these modules have these parameters, each bit layer can be operated independently. In this way, when some bit layers are empty, it can be turned off to save power. Consume. Figure 6 shows the detailed circuit diagram of each processing element (PE). Among them, uOR first ORs the two input element layers, and then combines all the results OR. The coding program and parameter generation module (Find contribution, FC) 3 (as shown in Figure 7), this module calculates the coding pass and parameter (PHVD) data of the center point for configuration generation module Group (CF) 4 Calculates the configuration (context), where the circuit of each arithmetic unit is as shown in Figure 8 and this part of the circuit is relatively simple. From the algorithm, we can see that Figure 9 is the circuit for calculating p. The calculated f circuit is exactly the same; The configuration generation module (Context formation, CF) 4 (as shown in Figure 1 12 1246255 0) 'where ZC is a zero coding module and MR is a strength correction (Magnitude ref inement) module, these two parts are table lookups, which are consistent with the standards. In order to deal with the special run-length code (RLC), the configuration of the four points on the same line (c0iumn) must be temporarily stored, and finally the final configuration (context) must be determined. Another method is to temporarily store the parameters (HVD) and encoding program (pass) data, but this takes up a relatively large area. Choosing a temporary configuration (context) can save 2/3 of the register area. Within the same clock, up to four contexts can be generated. In addition, the context where χ1χ is a sign is generated by the circuit in Figure 11. One thing to note is that the circuit of Figure 0 requires each bit layer, while Figure 11 only requires one, and then sends the result to each bit layer; the reconfigurable FIFO Register module (Reconfigurable? 11 ^ bu: 11 ^ bu with -0, 舴 ^ 0) 5, the first-in-first-out register (1? 1 叩) is used to make the compression process smoother, because the configuration Although the generating module (CF) 4 generates an average number of contexts, the number of contexts output by each clock is not necessarily (from 0 to 4). Group (AE) 6 — but only one configuration (context), as shown in Figure 12 shows the length of the first-in-first-out register (FIF0) (average number of registers) and the average one block Simulation of the total # pulses. From Figure 12 it can be seen that the length of the first-in-first-out register (FIFO) is about 15 when it is most efficient, but the number of registers required for this is too large, so the characteristics of the embedded block encoder are used to A reconfigurable first-in-first-out register module (RFIF〇) 5 architecture to reduce hardware requirements. Finally, a length of fifteen and eight lengths of four (minimum requirements) 13 1246255 first-in-first-out register ( FIFO) can achieve more than 80% of the performance of ten first-in-first-out registers (FIFO) using a length of fifteen; two long-first-in-first-out registers (length of fifteen) and eight short-in first-in The first-out register (length of four) is reorganized into an efficient configuration each time a new block encoding is to be performed. After theoretical analysis, the long first-in first-out register (FIFO) is most efficient when it is allocated to the third and fourth bit layers of this block, so when a block begins to encode, it is reorganized according to its highest bit layer The reconfigurable architecture of the FIFO register module (RFIF0) 5;-The Arithmetic encoder (AE) 6; in this architecture, one AE can handle 6 independent embedded The bit stream is mainly. In order to reduce the hardware requirements and increase the utilization (Utilization), since this architecture supports 11 bit layers (1 sign + 1〇magnitude), there will be at most 28 (3x9 + 1) The most direct way to use an embedded bit stream is to use 28 AEs. However, further analysis of the two characteristics of embedded block encoding can reduce the hardware requirements. First, the three encoding programs of the same bit layer (c. dingpass) There will only be one occurrence (ExClusive prOperty) in the same clock, so it can be reduced to 10 (because there are 10 coefficient value bits φ layers in the present invention); Second, arithmetic coding Module (AE) 6 Configuration to be processed (context) The lowest bit layer is the most, and then the bit layer drops from low to high, even the entire layer is empty, so an arithmetic coding module (AE) 6 is shared between 2 bit layers. In the future, the number will be reduced to five, and the surface area should be reduced to 18% (meaning), but it is not the whole ae that can be saved because the encoding status register of each independent bit stream (Coding state registers) must be separated, so it is not possible to save 1246255. In summary, the invention achieves more than six times the processing speed of other architectures with the same area cost and much lower power consumption, which is superior to all aspects Many previous architectures. However, the above are only the preferred embodiments of the present invention, and the scope of implementation of the present invention cannot be limited by this; therefore, any simple equivalent changes and modifications made in accordance with the scope of the patent application and the content of the invention specification of the present invention , All should still fall within the scope of the invention patent. 1246255 Encoding program and parameter generation module 3 Configuration generation module 4 Reconfigurable FIFO register module 5 Arithmetic encoding module 6

Claims

1246255 number, rotation is embedded bit stream, the architecture includes · checkerboard shift register module (Gobang register bank, GRB); maximum bit state generation module (Compute MSB Pass, CMP); encoding Program and parameter generation module (Find contribution, FC); configuration generation module (Context formation, CF); reconfigurable first-in-first-out register module (Reconfigurable First-in First-out (RFIF0)) and arithmetic coding Module (Arithmetic encoder, AE), where the encoding program and parameter generation module and configuration generation module require one set for each bit layer, and the arithmetic coding module requires one set for two bit layers. The checkerboard displacement register module, the reconfigurable FIFO register module, and the maximum bit state generation module only need one set. 10. The parallelized funeral block coding as described in item 9 of the declared patent ° Go board register bank (GRB) is a two-cone-like imitation child private seven register module (2- D shift register to move, all μ registers (such as WhW2: 9 will rotate downwards in the mother clock-every four clocks, each register = 'W3,4, body size.' (Such as W1—W7, W2 ~ will be rotated and shifted to the lower left corner of a register 11 · such as the patent application for the first W4 '' W0—W6) ° device, where the maximum bit, parallel The embedded block code CMP) is a two-state generation module (c ° mPute MSB Pass, which calculates the parameters of the center point. "'Also, the person with the coefficients and ~ two 12. Such as _Please patent_ the 9th place The parallel description is based on the block editor 23 1246255. Among them, the encoding program and parameter generation module (Find contribution, FC) can be used to calculate the coding pass and PHVD data of the center point for configuration generation. Module (CF) calculation configuration (context). 13. As described in item 9 of the scope of patent application Embedded block encoder, wherein the configuration generation module (Context formation, CF) is used to process special continuous zero code (Run-length code, RLC). 14. As described in item 9 of the scope of patent application A parallel embedded block encoder, in which the reconfigurable first-in-first-out register module (Reconfigurable First-in First-out (RFIF0)) is thrown to make the compression process smoother. The “parallel embedded embedded encoder” described in the item “Arithmetic encoder (AE)” is mainly used to reduce hardware requirements and improve Utilization. 24