TWI444047B

TWI444047B - Deblockings filter for video decoding , video decoders and graphic processing units

Info

Publication number: TWI444047B
Application number: TW096120098A
Authority: TW
Inventors: Hussain Zahid
Original assignee: Via Tech Inc
Priority date: 2006-06-16
Filing date: 2007-06-05
Publication date: 2014-07-01
Also published as: TWI482117B; TWI395488B; CN101083764B; CN101068365A; CN101068365B; CN101083764A; TW200803525A; CN101068364B; TWI348654B; TW200821986A; CN101083763B; TW200803528A; TW200816082A; CN101068364A; CN101083763A; TWI383683B; TW200816820A; CN101072351A; TWI350109B; CN101068353B

Description

Deblocking filter, video decoder and graph for video decoding Shape processing unit

本發明係關於影像壓縮與解壓縮，且尤其係關於具有影像壓縮與解壓縮特徵之圖形處理單元。The present invention relates to image compression and decompression, and more particularly to a graphics processing unit having image compression and decompression features.

個人電腦與消費性電子產品係用於各種娛樂用品。這些娛樂用品可以大致區分為2類：使用電腦製圖(computer-generated graphics)的那些，例如電腦遊戲；與使用壓縮視訊資料流(compressed video stream)的那些，例如預錄節目到數位式影音光碟(DVD)上，或由有線電視或衛星業者提供數位節目(digital programming)至一機上盒(set-top box)。第2種亦包含編碼類比視訊資料流，例如由一數位錄影機(DVR，digital video recorder)所執行。Personal computers and consumer electronics are used in a variety of entertainment products. These entertainment items can be broadly classified into two categories: those that use computer-generated graphics, such as computer games, and those that use compressed video streams, such as pre-recorded programs to digital video discs ( On DVD), or from a cable or satellite industry, digital programming to a set-top box. The second type also includes a coded analog video stream, such as a digital video recorder (DVR).

電腦製圖通常由一圖形處理單元(GPU，graphic processing unit)產生。一圖形處理單元是一種建立在電腦遊戲平台(computer game consoles)與一些個人電腦上一種特別的微處理器。一圖形處理單元係被最佳化為快速執行描繪三度空間基本物件(three-dimensional primitive objects)，例如三角形、四邊形等。這些基本物件係以多個頂點描述，其中每個頂點具有屬性(例如顏色)，且可施加紋理(texture)至該基本物件上。描繪的結果係一二度空間像素陣列(two-dimensional array of pixels)，顯示在一電腦之顯示器或監視器上。Computer graphics are usually generated by a graphics processing unit (GPU). A graphics processing unit is a special type of microprocessor built on computer game consoles and some personal computers. A graphics processing unit is optimized to quickly perform three-dimensional primitive objects, such as triangles, quads, and the like. These basic objects are described by a plurality of vertices, where each vertex has an attribute (eg, a color) and texture can be applied to the base object. The result of the depiction is a two-dimensional array of pixels Pixels), displayed on a computer monitor or monitor.

視訊資料流的編碼與解碼牽涉到不同種類的運算，例如，離散餘弦變換(discrete cosine transform)、移動估測(motion estimation)、移動補償(motion compensation)、去方塊效應濾波器(deblocking filter)。這些計算通常由一般用途中央處理器(CPU)結合特別的硬體邏輯電路，例如特殊應用積體電路(ASIC，application specific integrated circuit)，來處理。消費者因而需要多個運算平台以滿足他們的娛樂需求。因而需要可以處理電腦製圖與視訊編碼/解碼的單一計算平台。The encoding and decoding of video streams involves different kinds of operations, such as discrete cosine transform, motion estimation, motion compensation, and deblocking filters. These calculations are typically handled by a general purpose central processing unit (CPU) in conjunction with special hardware logic circuits, such as an application specific integrated circuit (ASIC). Consumers therefore need multiple computing platforms to meet their entertainment needs. There is therefore a need for a single computing platform that can handle computer graphics and video encoding/decoding.

在此揭露之實施例提供一種用於視訊壓縮去方塊效應的系統與方法。一用於視訊解碼的示範性去方塊效應濾波器包含：設置成用來判定複數個像素群中之一預定像素群之像素是否達到一標準的邏輯電路；設置成當達到該標準時，先對該預定像素群之像素濾波的邏輯電路；以及設置成當達到該標準時，根據在複數組濾波單元(set of taps)中之一相應組濾波單元，循序對該複數個像素群中剩下的像素群濾波之邏輯電路。Embodiments disclosed herein provide a system and method for video compression deblocking. An exemplary deblocking filter for video decoding includes: a logic circuit configured to determine whether a pixel of a predetermined pixel group of a plurality of pixel groups reaches a standard; and when the standard is reached, first a logic circuit for pixel filtering of a predetermined pixel group; and arranged to sequentially sequence the remaining pixel groups of the plurality of pixel groups according to a corresponding group of filtering units in a set of taps when the standard is reached Filtered logic circuit.

一種示範性視訊解碼器包含：一熵解碼器、一空間解碼器、組合邏輯電路與一回路內去方塊效應濾波器。該熵解碼器接收一輸入編碼位元流。該空間解碼器接收該熵解碼器的輸出並產生包含複數個像素的一編碼圖片。該組合邏輯電路結合一目前圖片與一預測圖片以產生一結合圖片。該回路內去方塊效應濾波器接收該結合圖片。該回路內去方塊效應濾波器包含：設置成對一預定像素群濾波的邏輯電路；以及設置成當該預定像素群達到一標準時，根據在複數組濾波單元中之一相應組濾波單元，對該複數個像素群中剩下的各像素群濾波之邏輯電路。An exemplary video decoder includes an entropy decoder, a spatial decoder, a combinational logic circuit, and an in-loop deblocking filter. The entropy decoder receives an input encoded bit stream. The spatial decoder receives the output of the entropy decoder and produces a coded picture comprising a plurality of pixels. The combination The logic circuit combines a current picture with a predicted picture to produce a combined picture. The deblocking filter in the loop receives the combined picture. The in-loop deblocking filter includes: a logic circuit configured to filter a predetermined pixel group; and configured to, when the predetermined pixel group reaches a criterion, according to a corresponding group of filtering units in the complex array filtering unit, A logic circuit for filtering each of the remaining pixel groups in a plurality of pixel groups.

一種示範性圖形處理單元包含一主處理介面與一視訊加速單元。該主處理介面，接收至少一視訊加速指令。該視訊加速單元，用於該至少一視訊加速指令。該視訊加速單元包含一回路內去方塊效應濾波器。該回路內去方塊效應濾波器包含：設置成判定複數個像素群之一預定像素群之像素是否達到一第一標準的邏輯電路；設置成當達到該第一標準時，先對該預定像素群之像素濾波的邏輯電路；以及設置成當達到該第一標準時，根據在複數組濾波單元(set of taps)中之一相應組濾波單元，循序對該複數個像素群中剩下的像素群濾波之邏輯電路。An exemplary graphics processing unit includes a main processing interface and a video acceleration unit. The main processing interface receives at least one video acceleration command. The video acceleration unit is configured to use the at least one video acceleration command. The video acceleration unit includes a loop-in-square block effect filter. The in-loop deblocking filter includes: a logic circuit configured to determine whether a pixel of a predetermined pixel group of a plurality of pixel groups reaches a first standard; and is configured to first determine the predetermined pixel group when the first standard is reached a logic circuit for pixel filtering; and configured to, when the first criterion is reached, sequentially filter the remaining pixel groups of the plurality of pixel groups according to a corresponding group of filtering units in a set of taps Logic circuit.

Computing platform for video encoding/decoding

第1圖係用於圖形與視訊編碼及/或解碼之一示範性運算平台之方塊圖。系統100包含一一般用途CPU 110(此後稱為主處理器)、一圖形處理器(GPU)120、記憶體130與匯流排140。圖形處理單元120包含一視訊加速單元(VPU)150，其可加速視訊編碼及/或解碼，將於後敘述。圖形處理單元120 的視訊加速功能係可在圖形處理單元120上執行的指令。Figure 1 is a block diagram of an exemplary computing platform for graphics and video coding and/or decoding. The system 100 includes a general purpose CPU 110 (hereinafter referred to as a main processor), a graphics processing unit (GPU) 120, a memory 130, and a bus bar 140. Graphics processing unit 120 includes a video acceleration unit (VPU) 150 that can speed up video encoding and/or decoding, as will be described later. Graphics processing unit 120 The video acceleration function is an instruction that can be executed on the graphics processing unit 120.

軟體解碼器160與視訊加速驅動器170位於記憶體130中，而至少一部份的解碼器160與視訊加速驅動器170在主處理器110上執行。透過一個由視訊加速驅動器170提供的一主處理器介面180，解碼器160亦可發出給圖形處理單元120的視訊加速指令。如此一來，系統100透過發出視訊加速指令給圖形處理單元120的主處理器軟體(host processor software)執行視訊編碼及/或解碼，圖形處理單元120透過加速解碼器160之一部分回應這些指令。The software decoder 160 and the video acceleration driver 170 are located in the memory 130, and at least a portion of the decoder 160 and the video acceleration driver 170 are executed on the main processor 110. The decoder 160 can also issue a video acceleration command to the graphics processing unit 120 via a host processor interface 180 provided by the video acceleration driver 170. As such, the system 100 performs video encoding and/or decoding by the host processor software of the graphics processing unit 120 by issuing a video acceleration command, and the graphics processing unit 120 responds to the commands through a portion of the acceleration decoder 160.

在一些實施例中，僅有一小部分的解碼器160在主處理器上執行，而大部分的解碼器160係由圖形處理單元120執行，在驅動器極少超載之下。依此法，經常被執行的密集運算方塊(computationally intensive blocks)被卸至圖形處理單元120，而更複雜的運算係由主處理器110所執行。在一些實施例中，由圖形處理單元120所實現的一個密集運算功能包含回路內去方塊效應濾波器硬體加速邏輯(inloop deblocking filter hardware acceleration logic)400，亦稱為回路內方塊效應濾波器400或去方塊效應濾波器400，其稍後將結合第4圖說明。另一密集運算功能之範例係判定各濾波器之邊界強度(BS，boundary strength)。In some embodiments, only a small portion of the decoder 160 is executing on the host processor, while most of the decoder 160 is executed by the graphics processing unit 120 with minimal overloading of the driver. In this way, the computationally intensive blocks that are often executed are unloaded to the graphics processing unit 120, while the more complex operations are performed by the main processor 110. In some embodiments, a dense computational function implemented by graphics processing unit 120 includes inloop deblocking filter hardware acceleration logic 400, also known as in-loop square effect filter 400. Or go to the block effect filter 400, which will be described later in conjunction with FIG. Another example of intensive computing functions is to determine the boundary strength (BS) of each filter.

上述之結構因而使下列運作有彈性：在主處理器110上對解碼器160執行一些透過對大方塊(marcoblock)執行一著色程式(shader program)之特殊功能(例如去方塊效應或計算邊界強度)；或在圖形處理單元120上執行大部分的解碼器160，利用管線流通(pipelining)與平行化(parallelism)。在一些解碼器160在圖形處理單元120上執行之實施例中，該去方塊效應處理係該解碼器160各態樣間同步之執行緒(thread)。The above structure thus makes the following operations flexible: on the main processor 110, the decoder 160 performs some special functions (such as deblocking or calculation) by executing a shader program on a marcoblock. Boundary strength); or a majority of the decoder 160 is executed on the graphics processing unit 120, utilizing pipeline pipelining and parallelism. In some embodiments in which decoder 160 is executed on graphics processing unit 120, the deblocking process is a thread of synchronization between the various aspects of decoder 160.

第1圖中省略數個對於解釋圖形處理單元120之視訊加速特徵並非必要且熟悉此項記憶者熟知的習知元件。Several conventional elements that are not necessary for explaining the video acceleration characteristics of the graphics processing unit 120 and are familiar to those skilled in the memory are omitted in FIG.

Video decoder

第2圖係第1圖中該視訊解碼器160之方塊圖。在第2圖中說明之特殊實施例，解碼器160施用ITU H.264視訊壓縮規範。然而，熟悉此項技藝者應當瞭解到第2圖之解碼器160係一視訊解碼器之初步表示，該視訊解碼器亦說明類似於H.264之其他類型解碼器之運作，例如SMPTE VC-1與MPEG-2規範。此外，儘管示為一圖形處理單元120之一部分，熟悉此項技藝者亦應瞭解到在此揭露之部分解碼器160亦可實現於一圖形處理單元之外，例如一獨立存在之邏輯電路，特殊應用積體電路(ASIC)之一部分等。Figure 2 is a block diagram of the video decoder 160 in Figure 1. In the particular embodiment illustrated in Figure 2, decoder 160 applies the ITU H.264 video compression specification. However, those skilled in the art will appreciate that the decoder 160 of FIG. 2 is a preliminary representation of a video decoder that also illustrates the operation of other types of decoders similar to H.264, such as SMPTE VC-1. With the MPEG-2 specification. In addition, although shown as part of a graphics processing unit 120, those skilled in the art will appreciate that the portion of the decoder 160 disclosed herein can also be implemented outside of a graphics processing unit, such as a separate logic circuit, Apply a part of an integrated circuit (ASIC), etc.

輸入之位元流205首先由一熵解碼器(entropy decoder)210所處理。熵編碼具有統計重複型(statistic redundancy)之優點：一些圖樣比其他圖樣更常出現，所以較常出現的就用較短的碼代表。熵編碼包含霍夫曼編碼(Huffman coding)與運行長度編碼(run-length encoding)。在熵編碼之後，該資料由一空間解碼器(spatial decoder)215所處理，其具有下述優點，事實上，一圖形中鄰近的像素通常相同或相關，所以只要對差異編碼即可。在此示範性實施例中，空間解碼器215包含一反相量化器(inverse quantizer)220，與一反相離散餘弦轉換(IDCT)功能230。IDCT功能230之輸出可視為一圖形(235)，由數像素組成。The input bit stream 205 is first processed by an entropy decoder 210. Entropy coding has the advantage of statistically redundant: some patterns appear more often than others, so the more common ones are represented by shorter codes. Entropy coding includes Huffman coding and run-length encoding. After entropy coding, the capital The material is processed by a spatial decoder 215, which has the advantage that, in fact, adjacent pixels in a pattern are generally identical or related, so that the difference can be encoded. In this exemplary embodiment, spatial decoder 215 includes an inverse quantizer 220 and an inverse discrete cosine transform (IDCT) function 230. The output of the IDCT function 230 can be viewed as a graphic (235) consisting of a number of pixels.

圖形235被處理為較小的子區塊，稱為大方塊。H.264視訊壓縮規範使用16x16像素的大方塊尺寸，而其他壓縮規範可使用其他尺寸。圖形235內的大方塊與先前解碼圖項之資訊結合，稱為畫面間預測(inter prediction)處理，或與圖形235之其他大方塊之資訊結合，稱為畫面內預測(intra prediction)處理。該輸入位元流205，被熵解碼器205解碼，而依各類型之圖形施用畫面間或畫面內預測。The graphic 235 is processed into smaller sub-blocks, called large squares. The H.264 video compression specification uses a large block size of 16x16 pixels, while other compression specifications can use other sizes. The large blocks in the graphics 235 are combined with the information of the previously decoded graphics items, referred to as inter-inter prediction processing, or combined with the information of other large blocks of the graphics 235, referred to as intra-prediction processing. The input bit stream 205 is decoded by the entropy decoder 205, and inter-picture or intra-picture prediction is applied depending on the type of graphics.

當施用畫面間預測時，熵解碼器210產生一移動向量(motion vector)245輸出。移動向量245被用來暫時的編碼，其具有下述優點，事實上，通常在一連串的圖形中許多像素會有相同的值。從一圖形到另一圖形之改變係編碼為移動向量245。移動補償方塊250將一個或多個先前解碼圖形255結合移動向量245以產生一預測圖形(265)。當施用畫面間預測時，空間補償方塊270將得自鄰近大方塊的資訊與圖形235內的大方塊結合以產生一預測圖形(275)。Entropy decoder 210 generates a motion vector 245 output when inter-picture prediction is applied. The motion vector 245 is used for temporary encoding, which has the advantage that, in fact, many pixels will typically have the same value in a series of graphics. The change from one graphic to another is encoded as a motion vector 245. Motion compensation block 250 combines one or more previously decoded graphics 255 with motion vector 245 to produce a predicted graphics (265). When inter-picture prediction is applied, spatial compensation block 270 combines information from adjacent large squares with large squares within graphics 235 to produce a predicted pattern (275).

結合器280將圖形235與模式選擇器(mode selector)285的輸出相加。模式選擇器285使用熵解碼位元流以判定結合器 280使用移動補償方塊250產生的預測圖形(265)或使用空間補償方塊270所產生的預測圖形(275)。Combiner 280 adds graphic 235 to the output of mode selector 285. Mode selector 285 uses entropy to decode the bit stream to determine the combiner 280 uses the predicted graphics (265) generated by the motion compensation block 250 or the predicted graphics (275) generated using the spatial compensation block 270.

編碼程序引起如在沿著大方塊邊緣的不連續以及沿著大方塊內的子方塊邊緣不連續的產物(artifact)。結果是在解碼圖框出現了”邊緣”(edge)，而原本沒有。去方塊效應濾波器290係施用於由結合器280輸出之結合圖形，以移去這些邊緣產物。儲存由去方塊效應濾波器產生之該解碼圖形295用來解碼接下來的圖形。The encoding process causes artifacts such as discontinuities along the edges of the large squares and discontinuities along the edges of the sub-blocks within the large squares. The result is an "edge" in the decoding frame, which was not originally. The deblocking filter 290 is applied to the combined pattern output by the combiner 280 to remove these edge products. The decoded pattern 295 generated by the deblocking filter is stored for decoding the next pattern.

結合第1圖之討論，部分解碼器160在主處理器110上執行，而解碼器160亦有由圖形處理單元120提供視訊加速指令之優點。尤其是，在一些實施例中，去方塊效應濾波器290使用由圖形處理單元120提供之一個或多個指令用來實現使用相對低運算成本之濾波。In conjunction with the discussion of FIG. 1, partial decoder 160 is executed on host processor 110, and decoder 160 also has the advantage of providing video acceleration instructions by graphics processing unit 120. In particular, in some embodiments, deblocking filter 290 uses one or more instructions provided by graphics processing unit 120 to implement filtering using relatively low computational cost.

Deblocking filter

去方塊效應濾波器290係一多單元濾波器(multi-tap filter)，其基於鄰近像素值調整子方塊邊緣的像素值。可依照解碼器160施行之壓縮規範使用去方塊效應濾波器290之不同實施例。各規範使用不同的濾波器參數，例如子區塊的尺寸、由該濾波運作更新之像素數目、該濾波器施用之頻率(例如每N列或每M行)。此外，各規範使用不同濾波器長度結構。熟悉此項技藝者應瞭解多單元濾波器，在此不討論特定單元之結構。由VC-1規範規定之去方塊效應濾波器實施例將結合第4圖說明。首先，VC-1濾波器之子方塊像素安排將結合第3圖說明。The deblocking filter 290 is a multi-tap filter that adjusts the pixel values of the edges of the sub-blocks based on neighboring pixel values. Different embodiments of the deblocking filter 290 can be used in accordance with the compression specifications implemented by the decoder 160. Each specification uses different filter parameters, such as the size of the sub-block, the number of pixels updated by the filtering operation, and the frequency at which the filter is applied (eg, every N columns or every M rows). In addition, each specification uses a different filter length structure. Those skilled in the art should be aware of multi-cell filters, and the structure of a particular unit will not be discussed here. By the VC-1 specification The block effect filter embodiment will be described in conjunction with FIG. First, the sub-block pixel arrangement of the VC-1 filter will be described in conjunction with Figure 3.

第3圖顯示兩個鄰近4x4子方塊(310,320)，定義為列R1-R4與行C1-C8。這兩個子方塊間的垂直邊界330係沿著行C4與C5。該VC-1濾波器對每個4x4子方塊運作。對於最左邊的子方塊，該VC-1濾波器檢驗在一預定列(R3)中之一預定群像素(P1、P2、P3)。若該預定群像素達到一特定標準，則更新相同預定列中另一像素P4。該標準係由該預定組中像素之計算與比較之特殊集合而定。熟悉此項技藝者應瞭解到這些計算與比較亦可是為一組濾波單元(a set of taps)，而詳細的計算與比較將稍後結合第5圖討論。更新值亦基於對預定群組中像素所執行之運算。該VC-1濾波器以類似方式處理最右邊的子方塊，判定像素P6、P7、P8是否達到一標準，若達到該標準則更新P5。換言之，該VC-1濾波器為一預定列(R3)之一群預定像素-邊緣像素P4與P5-根據同一列中其他群預定像素之值計算數值，P4的值根據P1、P2、P3，而P5的值根據P6、P7、P8。Figure 3 shows two adjacent 4x4 sub-blocks (310, 320), defined as columns R1-R4 and rows C1-C8. The vertical boundary 330 between the two sub-blocks is along lines C4 and C5. The VC-1 filter operates on each 4x4 sub-block. For the leftmost sub-block, the VC-1 filter checks a predetermined group of pixels (P1, P2, P3) in a predetermined column (R3). If the predetermined group of pixels reaches a certain standard, another pixel P4 in the same predetermined column is updated. The standard is determined by the particular set of calculations and comparisons of pixels in the predetermined set. Those skilled in the art will appreciate that these calculations and comparisons can also be a set of taps, and detailed calculations and comparisons will be discussed later in conjunction with FIG. The update value is also based on the operations performed on the pixels in the predetermined group. The VC-1 filter processes the rightmost sub-block in a similar manner, determines whether the pixels P6, P7, P8 meet a criterion, and updates P5 if the criterion is reached. In other words, the VC-1 filter is a predetermined column (R3) of a group of predetermined pixels - edge pixels P4 and P5 - based on the values of other groups of predetermined pixels in the same column, the value of P4 is based on P1, P2, P3, and The value of P5 is based on P6, P7, and P8.

該VC-1有條件的更新其餘列的相同群預定像素，係根據為該預定列(R3)之預定群像素(邊緣像素P4、P5)所計算之值。如此一來，僅當R3中之P4、P5更新了，R1中之P4才基於R1中之P1、P2、P3而更新。同樣地，僅當R3中之P4、P5更新了，R1中之P5才基於R1中之P6、P7、P8而更新。第2列與第4列亦以類似方式處理。The VC-1 conditionally updates the same group of predetermined pixels of the remaining columns based on the values calculated for the predetermined group of pixels (edge pixels P4, P5) of the predetermined column (R3). As a result, only when P4 and P5 in R3 are updated, P4 in R1 is updated based on P1, P2, and P3 in R1. Similarly, only when P4 and P5 in R3 are updated, P5 in R1 is updated based on P6, P7, and P8 in R1. Columns 2 and 4 are also treated in a similar manner.

從另一方面來看，在一預定第三列之像素的一些像素被濾波或更新了，當在第三列之其他像素達到一標準時。該濾波器牽涉到對這些其他像素執行比較與計算。若在第三列之其他像素達到該標準時，在其餘列相應的各像素係以一類似方式濾波，如上所述。在此揭露之去方塊效應濾波器290之一些實施例使用一開創性技術，先對第三列濾波，接著再對其他列濾波。這些開創性的技術將結合第4、5、6A-6D圖，更詳細的說明。On the other hand, some pixels of a pixel in a predetermined third column Filtered or updated when the other pixels in the third column reach a standard. This filter involves performing comparisons and calculations on these other pixels. If the other pixels in the third column reach the standard, the corresponding pixels in the remaining columns are filtered in a similar manner, as described above. Some embodiments of the deblocking filter 290 disclosed herein use a groundbreaking technique to first filter the third column and then filter the other columns. These groundbreaking techniques will be described in more detail in conjunction with Figures 4, 5, and 6A-6D.

儘管第3圖說明一列列的處理垂直邊緣，熟悉此項技藝者應可瞭解同一圖旋轉90度後亦可說明一行行處理水平邊緣。熟悉此項技藝者亦可瞭解到儘管VC-1使用四列中的第三列作為判定有條件更新其他列的預定列，在此揭露之原則亦可應用至使用其他預定列之實施例(例如第一列、第二列等)，亦可應用至形成子方塊列數目不同之其他實施例。同樣地，熟悉此項技藝者亦可瞭解到儘管VC-1檢驗鄰近一組像素的值以設定欲更新像素之值，在此揭露之原則亦可應用至其他像素已被檢驗且其他像素已設定之實施例。就一範例而言，可檢驗P2與P3以判定P4之更新值。另一範例，P3可根據P2與P4之值設定。Although Figure 3 illustrates the processing of the vertical edges of a column, it will be appreciated by those skilled in the art that the same figure can be rotated by 90 degrees to indicate that the horizontal edges of the line are processed. Those skilled in the art will also appreciate that although VC-1 uses the third column of the four columns as a predetermined column for determining the conditional update of other columns, the principles disclosed herein can be applied to embodiments using other predetermined columns (eg, The first column, the second column, etc.) can also be applied to other embodiments that form different numbers of sub-blocks. Similarly, those skilled in the art will appreciate that although VC-1 checks the value of a group of neighboring pixels to set the value of the pixel to be updated, the principles disclosed herein can be applied to other pixels that have been verified and other pixels have been set. An embodiment. For an example, P2 and P3 can be tested to determine the updated value of P4. As another example, P3 can be set according to the values of P2 and P4.

圖形處理單元120中之視訊加速單元150為一回路內去方塊濾波器(IDF，inloop deblockging filter)，例如由VC-1規範之回路內去方塊效應濾波器，實現硬體加速邏輯電路。一圖形處理單元指令實現此硬體加速邏輯電路，將於後說明。實現一VC-1回路內去方塊效應濾波器之習知方法係平行處理各列/行，因為相同像素計算係在一子方塊之各列/行執行。此習知方法每週期對兩個鄰近的 4x4子方塊濾波，但需要一增進邏輯閘(increased gate count)執行。相對的，由VC-1回路內去方塊效應濾波器硬體加速邏輯電路400所使用的開創性方法係先處理第三列/行像素，而若這些像素達到該所要求之標準，接著順序處理剩下的那三列/行。此開創性方法比習知方法使用較少的邏輯閘數，其複製各列/行之機能。VC-1回路內去方塊效應濾波器加速邏輯電路400循序列處理每個週期對兩個鄰近的4x4子方塊濾波。此較長之濾波時間與圖形處理單元120之指令週期一致，其中該習知方法較快速的濾波，事實上比所需求之速度還快，造成邏輯閘上的浪費。The video acceleration unit 150 in the graphics processing unit 120 is an in-loop deblocking filter (IDF), for example, a block-effect filter in the loop of the VC-1 specification to implement a hardware acceleration logic circuit. A graphics processing unit instruction implements the hardware acceleration logic circuit, which will be described later. A conventional method of implementing a VC-1 loop deblocking filter is to process each column/row in parallel because the same pixel calculation is performed in each column/row of a sub-block. This conventional method is for two adjacent cycles per cycle. 4x4 sub-block filtering, but requires an enhanced gate count execution. In contrast, the groundbreaking method used by the VC-1 loop-in-blocking filter hardware acceleration logic circuit 400 processes the third column/row of pixels first, and if the pixels meet the required standard, then sequentially process The remaining three columns/rows. This groundbreaking method uses fewer logic gates than the conventional method, which replicates the function of each column/row. The VC-1 loop deblocking filter acceleration logic circuit 400 sequentially samples two adjacent 4x4 sub-blocks per cycle. This longer filtering time coincides with the instruction cycle of the graphics processing unit 120, wherein the faster filtering of the conventional method is in fact faster than the required speed, resulting in wasted logic gates.

第4圖係VC-1回路內去方塊效應濾波器硬體加速邏輯電路400之硬體描述虛擬碼之列表。雖非使用實際硬體描述語言(HDL，hardware description language)，例如Verilog與VHDL而使用一虛擬碼，熟悉此項技藝者應對這些虛擬碼相當熟悉。這些人應可瞭解當以實際HDL描述時，這些程式碼應可被編譯並接著合成為構成部分視訊加速單元150之數邏輯閘配置。這些人應當可瞭解到這些邏輯閘可以各種技術實現，例如一特定應用積體電路(ASIC)、可程式化邏輯閘陣列(PGA)或現場程式化邏輯閘陣列(FPGA)。Figure 4 is a block diagram of the hard description virtual code of the block-effect filter hardware acceleration logic circuit 400 in the VC-1 loop. Although not using a virtual hardware description language (HDL), such as Verilog and VHDL, a virtual code is used, and those skilled in the art should be familiar with these virtual codes. These individuals should be aware that when described in actual HDL, these codes should be compiled and then synthesized into a number of logical gate configurations that form part of the video acceleration unit 150. These people should be aware that these logic gates can be implemented in a variety of techniques, such as an application-specific integrated circuit (ASIC), programmable logic gate array (PGA), or field-programmed logic gate array (FPGA).

此程式碼的410段係模組定義(module definition)。VC-1回路內去方塊效應濾波器硬體加速邏輯電路400有許多輸入參數。要進行濾波之子方塊係由該方塊參數(Block parameter)所規範。若垂直參數(Vertical parameter)為真(True)，則該加速邏輯電路400將方塊參數視為4x8方塊(參見第3圖)，並執行垂直邊緣濾波。若垂直參數為假(False)，則該加速邏輯電路400將方塊參數視為8x4方塊(參見第3圖)，並執行水平邊緣濾波。The 410 segment of this code is a module definition. The VC-1 loop deblocking filter hardware acceleration logic circuit 400 has a number of input parameters. The sub-block to be filtered is specified by the block parameter. If the vertical parameter is True, the acceleration logic circuit 400 will block The parameters are treated as 4x8 squares (see Figure 3) and vertical edge filtering is performed. If the vertical parameter is False, the acceleration logic circuit 400 treats the block parameter as an 8x4 block (see Figure 3) and performs horizontal edge filtering.

程式碼之區段420開始一疊代迴圈(iteration loop)，設定該迴圈參數變數之值。第一次通過此迴圈時，迴圈參數設為3，故先處理第3行。後續的迴圈疊代設定迴圈參數為1、2與4。利用這些參數，VC-1回路內去方塊效應濾波器硬體加速邏輯電路400重複4次，每次處理8個像素，其中一行可為一水平列或一垂直行，每一列係由行加速邏輯電路500所處理(參見第5圖)。在一些實施例中，此行加速邏輯電路500係以一HDL次模組實現，將結合第5圖說明。The section 420 of the code begins an iteration loop, setting the value of the loop parameter variable. When the loop is passed for the first time, the loop parameter is set to 3, so the third line is processed first. Subsequent loop iterations set the loop parameters to 1, 2, and 4. Using these parameters, the VC-1 in-loop deblocking filter hardware acceleration logic circuit 400 repeats 4 times, processing 8 pixels at a time, one of which can be a horizontal column or a vertical line, and each column is line accelerated logic. Circuit 500 processes (see Figure 5). In some embodiments, the row acceleration logic circuit 500 is implemented as an HDL sub-module and will be described in conjunction with FIG.

區段430測試垂直參數以判定執行垂直或水平邊緣濾波。根據該結果，行陣列變數之8個元素係自該4x8輸入方塊之列或8x4輸入方塊之行初始化。Section 430 tests the vertical parameters to determine whether to perform vertical or horizontal edge filtering. Based on the result, the eight elements of the row array variable are initialized from the row of the 4x8 input block or the row of the 8x4 input block.

區段440藉由將迴圈參數與3做比較判定該第3行是否已處理。若迴圈參數為3，另兩個控制變數，ProcessingPixel3與FILTER_OTHER_3則設為真。若迴圈參數不為3，將ProcessingPixel3設為假。Section 440 determines if the third line has been processed by comparing the loop parameter to 3. If the loop parameter is 3 and the other two control variables, ProcessingPixel3 and FILTER_OTHER_3 are set to true. If the loop parameter is not 3, the ProcessingPixel3 is set to false.

區段450舉例說明另一HDL模組，VC1_IDC_Filter_Line，該濾波器施用目前之行。(結合第3圖所述，該行濾波器基於鄰近像素值更新邊緣像素值。)提供至該子模組之參數包含該控制變數ProcessingPixel3、FILTER_OTHER_3與迴圈參數。在一實施例中，VC-1回路內去方塊效應濾波器硬體加速邏輯電路 400有一額外輸入參數，一量化值，而此量化參數亦提供給該子模組。Section 450 illustrates another HDL module, VC1_IDC_Filter_Line, which applies the current line. (As described in connection with Figure 3, the row filter updates the edge pixel values based on neighboring pixel values.) The parameters supplied to the sub-module include the control variables ProcessingPixel3, FILTER_OTHER_3, and loop parameters. In an embodiment, the VC-1 loop deblocking filter hardware acceleration logic circuit The 400 has an additional input parameter, a quantized value, and the quantized parameter is also provided to the sub-module.

在子模組處理該列之後，VC-1回路內去方塊效應濾波器硬體加速邏輯電路400在區段420以一迴圈參數更新值繼續該疊代迴圈。依此法，對輸入方塊之第3行施用該濾波器，接著第1行、第2行、第4行。After the sub-module processes the column, the VC-1 loop de-blocking filter hardware acceleration logic circuit 400 continues the iteration loop with a loop parameter update value in section 420. In this way, the filter is applied to the third row of the input block, followed by the first row, the second row, and the fourth row.

第5圖係行加速邏輯電路500之硬體描述語言程式碼之列表，其實現了上述之子模組。程式碼之區段510係一模組定義。行加速邏輯電路500有許多輸入參數。將進行濾波的行係定義為行輸入參數。ProcessingPixel3係一輸入參數，若該行為第3行或第3列則藉由較高層邏輯電路將其設為真。參數FILTER_OTHER_3一開始係由較高層邏輯電路設為真，而根據像素值由行加速邏輯電路500調整。Figure 5 is a listing of the hardware description language code of the acceleration logic circuit 500, which implements the sub-modules described above. The section 510 of the code is a module definition. The row acceleration logic circuit 500 has a number of input parameters. The line system to be filtered is defined as a line input parameter. ProcessingPixel3 is an input parameter. If the behavior is in the third or third column, it is set to true by the higher layer logic. The parameter FILTER_OTHER_3 is initially set to true by the higher layer logic circuit and is adjusted by the line acceleration logic circuit 500 according to the pixel value.

區段520執行如VC-1所定之各種像素值運算。(因為該計算可以參考VC-1之規範理解，將不對這些運算作詳細說明。)區段530測試由較高層VC-1回路內去方塊效應濾波器硬體加速邏輯電路400所提供之ProcessingPixel3參數。若ProcessingPixel3為真，則區段530將一控制變數DO_FILTER初始化為一預設值，真。在區段520中間的運算之各種結果係用來判定是否也要處理其他3行。若該像素運算結果表示不處理其他3行，則將DO_FILTER設為假。Section 520 performs various pixel value operations as determined by VC-1. (Because the calculation can be understood with reference to the specification of VC-1, these operations will not be described in detail.) Section 530 tests the ProcessingPixel3 parameters provided by the block-effect filter hardware acceleration logic circuit 400 in the higher layer VC-1 loop. . If ProcessingPixel3 is true, then section 530 initializes a control variable DO_FILTER to a predetermined value, true. The various results of the operations in the middle of the segment 520 are used to determine if the other three rows are also to be processed. If the pixel operation result indicates that the other three lines are not processed, DO_FILTER is set to false.

若ProcessingPixel3為假，區段540使用輸入參數FILTER_OTHER_3(由較高層VC-1回路內去方塊效應濾波器硬體加速邏輯電路400所設定)以設定DO_FILTER之值。若DO_FILTER為真，區段540測試該DO_FILTER變數並更新該行變數之該邊緣像素P4、P5(參見第3圖)。If ProcessingPixel3 is false, segment 540 uses the input parameter FILTER_OTHER_3 (set by the higher layer VC-1 loop to the block effect filter hardware acceleration logic circuit 400) to set the value of DO_FILTER. If DO_FILTER is true, segment 540 tests the DO_FILTER variable and updates the edge pixels P4, P5 of the row variable (see Figure 3).

區段550測試該ProcessingPixel3參數，並適當更新FILTER_OTHER_3。該FILTER_OTHER_3變數係用來傳達此模組中不同範例之狀態資訊。若ProcessingPixel3為真，則區段550以DO_FILTER之值更新該FILTER_OTHER_3參數。此技術使得用來說明此模組之較高層模組(即VC1_InloopFilter)提供由此例之VC_1_INLOOPFILTER_LINE低層模組所更新之FILTER_OTHER_3值至另一例之VC_1_INLOOPFILTER_LINE。Section 550 tests the ProcessingPixel3 parameter and updates FILTER_OTHER_3 as appropriate. The FILTER_OTHER_3 variable is used to convey status information for different examples in this module. If ProcessingPixel3 is true, then section 550 updates the FILTER_OTHER_3 parameter with the value of DO_FILTER. This technique allows the higher layer module (ie VC1_InloopFilter) used to describe this module to provide the FILTER_OTHER_3 value updated by the VC_1_INLOOPFILTER_LINE low-level module of this example to VC_1_INLOOPFILTER_LINE of another example.

熟悉此項技藝者應瞭解到第5圖之虛擬碼可以各種方式合成以產生實現行加速邏輯電路500之邏輯閘布置。其中一種布置係在第6A-D圖中說明，他們一起構成行加速邏輯電路500之方塊圖。熟悉此項技藝者應當對VC-1回路內去方塊效應濾波器演算法及邏輯電路結構感到熟悉。因此，第6A-D圖之元件將不詳述。而將選擇詳述行加速邏輯電路500之特徵。Those skilled in the art will appreciate that the virtual code of FIG. 5 can be synthesized in various ways to produce a logic gate arrangement that implements line acceleration logic circuit 500. One of the arrangements is illustrated in Figures 6A-D, which together form a block diagram of row acceleration logic circuit 500. Those skilled in the art should be familiar with the block-effect filter algorithm and logic circuit structure in the VC-1 loop. Therefore, the elements of Figures 6A-D will not be described in detail. The features of the row acceleration logic circuit 500 will be selected in detail.

熟悉此項技藝者應瞭解到，VC-1回路內去方塊效應濾波器所牽涉到之運算包含下列，其中P1-P8係指像素在被處理之列/行中之位置。Those skilled in the art will appreciate that the operations involved in the deblocking filter in the VC-1 loop include the following, where P1-P8 refers to the position of the pixel in the column/row being processed.

A0=(2*(P3-P6)-5*(P4-P5)+4)>>3A0=(2*(P3-P6)-5*(P4-P5)+4)>>3

A1=(2*(P1-P4)-5*(P2-P3)+4)>>3A1=(2*(P1-P4)-5*(P2-P3)+4)>>3

A2=(2*(P5-P8)-5*(P6-P7)+4)>>3A2=(2*(P5-P8)-5*(P6-P7)+4)>>3

clip=(P4-P5)/2Clip=(P4-P5)/2

前3個運算中的每一個牽涉到3個減法、2個乘法、1個加法與1個右移。第6A圖中之行加速邏輯電路500之一部分使用共用邏輯電路循序計算A0、A1、A2，而非為了A0、A1、A2使用特定獨立邏輯電路方塊。藉由避免邏輯電路方塊重複，利用多工器循序處理各輸入，減少了邏輯閘及/或功率消耗。Each of the first three operations involves three subtractions, two multiplications, one addition, and one right shift. One portion of the row acceleration logic circuit 500 in FIG. 6A uses the shared logic circuit to sequentially calculate A0, A1, A2 instead of using specific independent logic circuit blocks for A0, A1, A2. By avoiding the repetition of the logic circuit blocks, the multiplexer is used to sequentially process the inputs, reducing logic gates and/or power consumption.

多工器605、610與620係用來從像素暫存器P-8在不同時序週期選擇不同之輸入，而這些輸入係提供給各共用邏輯電路方塊。邏輯電路方塊625與630各執行一減法。邏輯電路方塊635藉由執行左移1位實現乘以2。乘以係由左移1位所實行，後面接一加法器645。加法器650將左移器635之輸出、一常數4與645輸出之負數加在一起。最後，邏輯電路方塊655執行右移3位。Multiplexers 605, 610, and 620 are used to select different inputs from pixel register P-8 for different timing periods, and these inputs are provided to the respective shared logic blocks. Logic circuit blocks 625 and 630 each perform a subtraction. Logic circuit block 635 is multiplied by 2 by performing a left shift of 1 bit. The multiplication is performed by shifting one bit to the left, followed by an adder 645. Adder 650 adds the output of left shifter 635, a constant 4 and the negative of the 645 output. Finally, logic circuit block 655 performs a right shift of 3 bits.

在第1時序週期，一輸入T=1係提供至各多工器605、610與615，而計算A1之值並存在暫存器660。在第2時序週期，一輸入T=2係提供至各多工器605、610與615，而計算A2之值並存在暫存器665。在第3時序週期，一輸入T=3係提供至各多工器605、610與615，而計算A0之值並存在暫存器670。存在暫存器660、665、670之值A1、A2、A3將被第6B圖之部分行加速邏輯電路500所使用，將於後說明。P4暫存器(671)之輸出與P5暫存器(673)之輸出將被第6C圖之部分行加速邏輯電路500所使用，將於後說明。In the first timing cycle, an input T = 1 is supplied to each of the multiplexers 605, 610, and 615, and the value of A1 is calculated and stored in the register 660. In the second timing cycle, an input T=2 is provided to each of the multiplexers 605, 610, and 615, and the value of A2 is calculated and stored in the register 665. In the third timing cycle, an input T=3 is provided to each of the multiplexers 605, 610, and 615, and the value of A0 is calculated and stored in the register 670. The values A1, A2, A3 in which the registers 660, 665, 670 are present will be used by the partial row acceleration logic circuit 500 of Figure 6B, as will be described later. The output of the P4 register (671) and the output of the P5 register (673) will be used by the partial line acceleration logic circuit 500 of Figure 6C, as will be described later.

熟悉此項技藝者亦應瞭解在VC-1回路內去方塊效應濾波器所牽涉到後敘之額外運算： Those skilled in the art should also understand that the de-blocking filter in the VC-1 loop involves additional operations that are described later:

第6B圖之部分行加速邏輯電路500從第6A圖之部分行加速邏輯電路500接收輸入，並計算D(675)。再次參照第6A圖，CLIP(677)係如下產生：像素P4與P5由邏輯電路方塊679相減，該結果由邏輯電路方塊680右移(整數除以2)以產生CLIP 677。回到第6B圖，A1可在第一週期自暫存器660取得，A2可在第二週期自暫存器665取得，A0可在第三週期自暫存器670取得。因而，在第四週期，第6圖之部分行加速邏輯電路500根據上述之方程式計算D(675)。The partial row acceleration logic circuit 500 of FIG. 6B receives input from the partial row acceleration logic circuit 500 of FIG. 6A and calculates D (675). Referring again to FIG. 6A, CLIP (677) is generated as follows: Pixels P4 and P5 are subtracted from logic circuit block 679, and the result is shifted right by logic block 680 (integer divided by 2) to produce CLIP 677. Going back to Figure 6B, A1 can be in the first The cycle is taken from the register 660, A2 is available from the register 665 in the second cycle, and A0 is available from the register 670 in the third cycle. Thus, in the fourth cycle, the partial line acceleration logic circuit 500 of FIG. 6 calculates D (675) according to the above equation.

行加速邏輯電路500利用(675)以更新P4、P5之像素位置。尤其是，P4=P4-D而P5=P5+D。儘管第6A、6B圖先前結合單一列/行(例如單一組像素位置P0-P8)說明，一子區塊第3列/行之運算會影響該子區塊其他3列/行之行為。行加速邏輯電路500利用一開創性方法實現此行為。當獨立濾波運算從最前面開始-平行地-完成，結合第6A、6B圖之說明，示於第6C、6D圖之部分行加速邏輯電路500有條件的選擇要更新之位置。換言之，VC-1回路內去方塊效應濾波器硬體加速邏輯電路400判定是原本的值被寫回或新的值被寫回。相對地，一習知方法，一VC-1回路內去方塊效應濾波器使用迴圈，所以獨立濾波運算有條件地執行。Line acceleration logic circuit 500 utilizes (675) to update the pixel locations of P4, P5. In particular, P4 = P4-D and P5 = P5 + D. Although the 6A, 6B diagrams have previously been described in connection with a single column/row (e.g., a single set of pixel locations P0-P8), the operation of the third column/row of a sub-block affects the behavior of the other three columns/rows of the sub-block. Row acceleration logic circuit 500 implements this behavior using a groundbreaking approach. When the independent filtering operation is started from the front - parallel - completion, in conjunction with the description of Figs. 6A, 6B, the partial line acceleration logic circuit 500 shown in the 6C, 6D diagram conditionally selects the position to be updated. In other words, the VC-1 in-loop deblocking filter hardware acceleration logic circuit 400 determines that the original value was written back or the new value was written back. In contrast, a conventional method, a VC-1 loop de-blocking filter uses a loop, so independent filtering operations are performed conditionally.

如先前說明的，第4圖解釋行加速邏輯電路500的虛擬碼在一迴圈內如此運作：在一重複區段420中出現了示例區段(instantiation section)450。此外行加速邏輯電路500之示例使用2個參數，ProcessingPixel3與FILTER_OTHER_3。用行加速邏輯電路500的這些參數如下執行像素P4、P5有條件的更新。參見第6C圖，暫存器P4寫入減法器681之結果，其中減法器681有一輸入為P4(671)，為0或D(675)，依DO_FILTER(683)之值而定。同樣地，暫存器P5寫入加法器685之結果，其中加法器 685有一輸入為P5(673)，為0或D(675)，依DO_FILTER(683)之值而定。因而，P4之更新值為原本之P4值(若DO_FILTER為假)，或P4-D。同樣地，P5之更新值為原本之P5值(若DO_FILTER為假)，或P5+D。As previously explained, FIG. 4 illustrates that the virtual code of row acceleration logic circuit 500 operates in a loop: an instantiation section 450 appears in a repeating section 420. In addition, the example of row acceleration logic circuit 500 uses two parameters, ProcessingPixel3 and FILTER_OTHER_3. These parameters of the row acceleration logic circuit 500 are used to conditionally update the pixels P4, P5 as follows. Referring to Figure 6C, the register P4 is written to the result of the subtractor 681, wherein the subtractor 681 has an input of P4 (671), which is 0 or D (675), depending on the value of DO_FILTER (683). Similarly, the register P5 writes the result of the adder 685, wherein the adder 685 has an input of P5 (673), which is 0 or D (675), depending on the value of DO_FILTER (683). Thus, the updated value of P4 is the original P4 value (if DO_FILTER is false), or P4-D. Similarly, the updated value of P5 is the original P5 value (if DO_FILTER is false), or P5+D.

熟悉此項技藝者應當瞭解到，當處理一子方塊第3列時，以P4-D更新P4之標準為：((ABS(A0)<PQUANT)OR(A3<ABS(A0))OR(CLIP！=0)Those skilled in the art should understand that when processing the third column of a sub-block, the standard for updating P4 with P4-D is: ((ABS(A0)<PQUANT)OR(A3<ABS(A0))OR(CLIP) !=0)

DO_FILTER 683係由第6D圖中檢驗這些條件的部分行加速邏輯電路500所計算。多工器687提供一輸入至OR閘697，若ABS(A0)<PQUANT則選擇一真輸出，其他則為假。多工器689提供另一輸入至OR閘697，若A3<ABS(A0)則選擇一真輸出，其他則為假。多工器691提供另一輸入至OR閘697，若CLIP！=0則選擇一真輸出，其他則為假。The DO_FILTER 683 is calculated by the partial line acceleration logic circuit 500 that tests these conditions in Figure 6D. Multiplexer 687 provides an input to OR gate 697. If ABS(A0)<PQUANT selects a true output, the others are false. Multiplexer 689 provides another input to OR gate 697, if A3 < ABS (A0) then selects a true output, others are false. Multiplexer 691 provides another input to OR gate 697, if CLIP! =0 selects a true output, others are false.

DO_FILTER 683係由多工器693所提供，其利用控制輸入Processing_Pixel_3(695)以選擇輸出OR閘697的輸出或輸入信號FILTER_OTHER_3(699)。輸入Processing_Pixel_3(695)與FILTER_OTHER_3(699)先前結合第4圖與舉例說明行加速邏輯電路500之較高層VC-1回路內去方塊效應濾波器硬體加速邏輯電路400的虛擬碼已說明過了。回到第4圖，當處理第3行/列時(第1圈)，Processing_Pixel_3(695)設為真，其他則為假。基於關於PQUANT、ABS(A0)、CLIP之條件，記錄一中間變數DO_FILTER，不論P4/P5是否更新。最後FILTER_OTHER_3 (699)之值係設自該中間變數DO_FILTER。第6C、6圖之邏輯電路部分之行加速邏輯電路500之結果係為，每4個週期，在4鄰近列/行之P4、P5的像素位置設為濾波後的值(根據A0-A3、PQUANT、CLIP等變數)或再次寫入其原本的值。DO_FILTER 683 is provided by multiplexer 693, which utilizes control input Processing_Pixel_3 (695) to select the output of output OR gate 697 or input signal FILTER_OTHER_3 (699). The input of Processing_Pixel_3 (695) and FILTER_OTHER_3 (699) previously described in conjunction with FIG. 4 and the virtual code of the higher layer VC-1 in-loop deblocking filter hardware acceleration logic circuit 400 illustrating the row acceleration logic circuit 500 has been described. Returning to Figure 4, when the third row/column is processed (the first lap), Processing_Pixel_3 (695) is set to true, and the others are false. Based on the conditions regarding PQUANT, ABS(A0), and CLIP, an intermediate variable DO_FILTER is recorded, regardless of whether P4/P5 is updated. Last FILTER_OTHER_3 The value of (699) is set from the intermediate variable DO_FILTER. The result of the row acceleration logic circuit 500 of the logic circuit portion of FIGS. 6C and 6 is that the pixel position of P4 and P5 in 4 adjacent columns/rows is set as a filtered value every 4 cycles (according to A0-A3, PQUANT, CLIP, etc. variables) or write their original values again.

該VC-1去方塊效應加速單元400開創性地採用平行與循序之結合，如前所述。平行處理提供較快速的執行並減少延遲。儘管平行化增加了邏輯閘數，但增加量被前述的循序處理所抵銷。沒有使用前述循序處理的習知方法徒增邏輯閘數。The VC-1 deblocking acceleration unit 400 pioneered the combination of parallel and sequential, as previously described. Parallel processing provides faster execution and reduces latency. Although the parallelization increases the number of logic gates, the amount of increase is offset by the aforementioned sequential processing. The conventional method of using the aforementioned sequential processing does not increase the number of logic gates.

圖形處理單元120的一些實施例包含一用於H.264去方塊效應的硬體加速單元，而此去方塊效應功能係透過圖形處理單元指令以供使用。圖形處理單元120將結合第8圖詳細說明，並加強說明提供H.264去方塊效應加速功能的圖形處理單元指令特殊選擇。Some embodiments of graphics processing unit 120 include a hardware acceleration unit for H.264 deblocking, and the deblocking function is used by the graphics processing unit for use. The graphics processing unit 120 will be described in detail in conjunction with FIG. 8 and enhances the specification of the graphics processing unit instructions that provide the H.264 deblocking acceleration function.

Graphics processor Principle of multiple deblocking instructions

圖形處理單元120的指令集包含在軟體裡執行的部分解碼器160可用來加速一去方塊效應濾波器。在此說明一開創性技術提供不只一個的多重圖形處理單元指令以加速特定去方塊效應濾波器。回路內去方塊效應濾波器290原本就是循序的，因而一特定濾波器必須以一定次序對像素濾波(例如H.264規定從左到右接著從上到下)。因而，先前濾過的或更新過的像素在濾後面像素時被拿來作為輸入。主處理器處理儲存在習知記憶體的像素值，這使得像素一個接一個讀取、寫入。然而，這循序的本質當回路內去方塊效應濾波器290使用一圖形處理單元加速部分濾波處理時無法適當配合。習知圖形處理單元將像素儲存在一紋理快取(texture cache)，而該圖形處理單元管線設計不遵從一個接一個(back-to-back)讀取、寫入紋理快取。The set of instructions of graphics processing unit 120, including partial decoder 160, implemented in software, can be used to speed up a deblocking filter. It is described herein that a groundbreaking technique provides more than one multiple graphics processing unit instructions to speed up a particular deblocking filter. The in-loop deblocking filter 290 is originally sequential, so a particular filter must filter the pixels in a certain order (eg, H.264 specifies left to right and then top to bottom). Thus, previously filtered or updated pixels are used as input when filtering back pixels. In. The main processor processes the pixel values stored in the conventional memory, which causes the pixels to be read and written one by one. However, the nature of this sequence cannot be properly coordinated when the in-loop deblocking filter 290 uses a graphics processing unit to speed up the partial filtering process. Conventional graphics processing units store pixels in a texture cache, and the graphics processing unit pipeline design does not follow back-to-back read and write texture caches.

在此揭露圖形處理單元120的一些實施例提供多重圖形處理單元指令，其可一起用來加速一特定去方塊效應濾波器。其中一些指令把紋理快取當像素資料源，而一些指令使用圖形處理單元執行單元作為資料源。回路內去方塊效應濾波器290適當的結合使用這些不同的圖形處理單元指令以達成一個接一個讀取、寫入像素。接下來概要說明流經圖形處理單元120的資料，再接著解釋由圖形處理單元120提供的去方塊效應加速指令，與回路內去方塊效應濾波器290運用這些指令。It is disclosed herein that some embodiments of graphics processing unit 120 provide multiple graphics processing unit instructions that can be used together to speed up a particular deblocking filter. Some of these instructions use texture cache as a pixel data source, while some instructions use a graphics processing unit execution unit as a data source. The in-loop deblocking filter 290 suitably combines these different graphics processing unit instructions to achieve read and write pixels one after the other. Next, the data flowing through the graphics processing unit 120 will be outlined, followed by the deblocking acceleration commands provided by the graphics processing unit 120, and the in-loop deblocking filter 290.

Graphics processing unit flow

第7圖係圖形處理單元120資料流的圖，其中指令流係由第7圖左邊之箭頭，而影像或圖形流係由右邊的箭頭表示。第7圖省略了數個熟悉此項技藝者習知的元件，這些對解釋圖形處理單元120之回路內去方塊效應特徵非必要。一指令流處理器710從一系統匯流排(未示)接收一指令720，並解碼該指令，產生指令資料730，例如頂點資料。圖形處理單元120支援一習知圖形處理指令，以及加速視訊編碼及/或解碼的指令。Figure 7 is a diagram of the data flow of graphics processing unit 120, wherein the instruction stream is represented by the arrow to the left of Figure 7, and the image or graphics stream is represented by the arrow to the right. Figure 7 omits several elements familiar to those skilled in the art which are not necessary to interpret the deblocking features within the loop of graphics processing unit 120. An instruction stream processor 710 receives an instruction 720 from a system bus (not shown) and decodes the instruction to generate instruction material 730, such as vertex data. Graphics processing unit 120 supports a conventional graphics processing instruction and instructions for speeding up video encoding and/or decoding.

習知圖形處理指令牽涉到如頂點著色(vertex shading)、幾何著色(geometry shading)、像素著色(pixel shading)等難題。因此，指令資料730係施用於著色器執行單元(shader execution units)之池(pool)740。著色執行單元必要使用一紋理濾波單元(TFU，texture filter unit)750以施加一紋理至一像素。紋理資料係快取自紋理快取760，其係在主記憶體(未示)後面。Conventional graphics processing instructions involve problems such as vertex shading, geometry shading, and pixel shading. Thus, the instruction material 730 is applied to a pool 740 of shader execution units. The shading execution unit necessarily uses a texture filter unit (TFU) 750 to apply a texture to a pixel. The texture data is cached from texture cache 760, which is appended to the main memory (not shown).

一些指令送給視訊加速器150，其運作將於後說明。產生的資料接著由後包裝器(post-packer 770)處理，其壓縮該資料。在後處理(post-processing)之後，由視訊加速單元所產生的資料係提供給執行單元池(execution unit pool)740。Some instructions are sent to the video accelerator 150, the operation of which will be described later. The resulting data is then processed by a post-packer (post-packer 770) which compresses the data. After post-processing, the data generated by the video acceleration unit is provided to an execution unit pool 740.

視訊編碼/解碼加速指令的執行，例如前述之去方塊效應濾波指令，在許多方面與前述之習知圖形指令不同。首先，視訊加速指令係由視訊加速單元150執行，而非著色器執行單元。其次，視訊加速指令不使用其紋理資料。The execution of the video encoding/decoding acceleration instructions, such as the aforementioned deblocking filtering instructions, differs in many respects from the conventional graphics instructions described above. First, the video acceleration command is executed by the video acceleration unit 150 instead of the shader execution unit. Second, the video acceleration instructions do not use their texture data.

然而，視訊加速指令所使用的影像資料與圖形指令所使用的紋理資料均為2維陣列。圖形處理單元120同樣利用此優點，使用紋理濾波單元750下載給視訊加速單元150的影像資料，因而使紋理快取760快取一些由視訊加速單元150運作之影像資料。因此，示於第7圖，視訊加速單元150係位於紋理濾波單元750與後包裝器770之間。However, the image data used by the video acceleration command and the texture data used by the graphics command are both 2-dimensional arrays. The graphics processing unit 120 also utilizes this advantage to use the texture filtering unit 750 to download image data to the video acceleration unit 150, thereby causing the texture cache 760 to cache some of the image data operated by the video acceleration unit 150. Therefore, as shown in FIG. 7, the video acceleration unit 150 is located between the texture filtering unit 750 and the post-packer 770.

紋理濾波單元750檢驗從指令720擷取的指令資料730。指令資料730更提供TFU 750紋理快取760內想要的影像資料的座標。在一實施例中，這些座標標明為U、V對，熟悉此項技藝者應對此熟悉。當指令720係一視訊加速指令時，所擷取的指令資料更命令紋理濾波單元750略過紋理濾波單元750內的紋理濾波器(未示)。Texture filtering unit 750 examines instruction material 730 retrieved from instruction 720. Instruction material 730 also provides the TFU 750 texture cache 760 desired The coordinates of the image data. In one embodiment, these coordinates are designated U, V pairs, and those skilled in the art should be familiar with this. When the instruction 720 is a video acceleration instruction, the captured instruction material further instructs the texture filtering unit 750 to skip the texture filter (not shown) in the texture filtering unit 750.

依此法，紋理濾波單元750係受操縱為視訊加速指令去下載影像資料給視訊加速單元150。視訊加速單元150從資料路徑上的紋理濾波單元750接收影像資料，與命令路徑上的命令資料730，並根據命令資料730對該影像資料執行一運作。由視訊加速單元150所輸出影像資料係回饋給執行單元池740，在由後包裝器770處理之後。According to this method, the texture filtering unit 750 is manipulated as a video acceleration command to download image data to the video acceleration unit 150. The video acceleration unit 150 receives the image data from the texture filtering unit 750 on the data path, and the command material 730 on the command path, and performs an operation on the image data according to the command data 730. The image data output by the video acceleration unit 150 is fed back to the execution unit pool 740 after being processed by the post wrapper 770.

Go block effect instruction

在此敘述之圖形處理單元120之實施例，提供VC-1去方塊效應濾波器與H.264去方塊效應濾波器硬體加速。VC-1去方塊效應濾波器係由一圖形處理單元指令(”IDF_VC-1”)加速，而H.264去方塊效應濾波器由三個圖形處理單元指令(”IDF_H264_0”、”IDF_H264_1”、”IDF_H264_2”)加速。In the embodiment of the graphics processing unit 120 described herein, a VC-1 deblocking filter and a H.264 deblocking filter hardware acceleration are provided. The VC-1 deblocking filter is accelerated by a graphics processing unit instruction ("IDF_VC-1"), while the H.264 deblocking filter is commanded by three graphics processing units ("IDF_H264_0", "IDF_H264_1"," IDF_H264_2") acceleration.

如先前說明的，各圖形處理單元指令係解碼且分析(parsed)為指令資料730，其可視為各指令之特定參數集，示於第1表。IDF_H264_x指令共用一些共用參數，而其他的為各指令獨有的。熟悉此項技藝者應瞭解到這些參數可以使用各種操作碼(opcode)與指令格式編碼，所以這些議題將不在此討論。As previously explained, each graphics processing unit instruction is decoded and parsed into instruction material 730, which can be viewed as a particular set of parameters for each instruction, as shown in the first table. The IDF_H264_x instruction shares some common parameters, while the others are unique to each instruction. Those skilled in the art should understand that these parameters can be encoded using various opcodes and instruction formats, so these topics will not be discussed here.

結合使用許多輸入參數以判定由紋理濾波單元750所擷取的4x4方塊位址。BaseAddress參數指出在紋理快取中該紋理資料的起點。將此區域內左上方塊座標給 BaseAddress參數。PictureHeight與PictureWidth輸入參數係用來判斷該方塊的範圍，即左下方座標。最後，視訊圖形可為漸進式掃瞄(progessive)或隔行掃瞄(interlace)。若為隔行掃瞄，其係由兩個方向組成(上方與下方)。紋理濾波單元750使用FieldFlag與TopFieldFlag以適當處理隔行掃瞄影像。A number of input parameters are used in combination to determine the 4x4 block address captured by texture filtering unit 750. The BaseAddress parameter indicates the starting point of the texture data in the texture cache. Mark the upper left square of this area BaseAddress parameter. The PictureHeight and PictureWidth input parameters are used to determine the range of the square, the lower left coordinate. Finally, the video graphics can be progressive (progessive) or interlaced (interlace). For interlaced scanning, it consists of two directions (upper and lower). The texture filtering unit 750 uses FieldFlag and TopFieldFlag to properly process the interlaced scanned image.

去方塊效應8x4x8位元輸出係提供於一目標暫存器，且亦寫回執行單元池740。將去方塊效應輸出寫回執行單元池740係一”位置修改(modify in place)”運作，在某些解碼器的實現中是必要的，例如H.264其中方塊中之像素值，右邊與下方，係依先前的結果所計算。然而VC-1解碼器不像H.264有此限制關係。在VC-1中，對每個8x8邊界(先垂直再水平)濾波。所有的垂直邊緣可以因而實質上平行地執行，4x4邊緣稍後濾波。可以利用平行化因為僅有兩個像素(一個邊緣一個)被更新，而這些像素不用來計算其他邊緣。既然去方塊效應資料是寫回執行單元池740而非紋理快取760，提供了不同的IDF_H264_x指令，這子方塊從不同位置被擷取。這可在第1表中看到，在BlockAddress的敘述中，Data Block 1與Data Block 2參數。IDF_H264_0指令從紋理快取760擷取整個8x4x8位元子方塊。IDF_H264_1指令從紋理快取760擷取半個子方塊並從執行單元池740擷取半個。The deblocking 8x4x8 bit output is provided to a target register and also written back to the execution unit pool 740. Writing the deblocking output back to the execution unit pool 740 is a "modify in place" operation that is necessary in some decoder implementations, such as the pixel values in the H.264 block, right and below. , calculated according to the previous results. However, the VC-1 decoder does not have this limitation relationship like H.264. In VC-1, each 8x8 boundary (first vertical and then horizontal) is filtered. All vertical edges can thus be executed substantially in parallel, with the 4x4 edges being filtered later. Parallelization can be utilized because only two pixels (one edge at a time) are updated, and these pixels are not used to calculate other edges. Since the deblocking data is written back to the execution unit pool 740 instead of the texture cache 760, a different IDF_H264_x instruction is provided, which is retrieved from a different location. This can be seen in the first table, in the description of BlockAddress, the Data Block 1 and Data Block 2 parameters. The IDF_H264_0 instruction fetches the entire 8x4x8 bit sub-block from texture cache 760. The IDF_H264_1 instruction fetches half of the sub-blocks from texture cache 760 and fetches half from execution unit pool 740.

隨解碼器160而變之IDF_H264_x指令的功用將結合第8圖詳述。接下來敘述在供應像素資料給視訊加速單元150前，紋理濾波單元750與執行單元池740轉換所擷取的像素資料的處理。The function of the IDF_H264_x instruction as a function of decoder 160 will be described in more detail in connection with FIG. Next, before the pixel data is supplied to the video acceleration unit 150, the texture filtering unit 750 and the execution unit pool 740 convert the captured image. Processing of prime data.

Conversion of image data

上述之指令參數，提供欲從紋理快取760或從執行單元池740解取的子方塊位址之座標給紋理濾波單元750。影像資料包含亮度(Y)與彩度(Cb,Cr)平面。一YC旗標輸入參數定義要處理Y平面或是CbCr平面。The instruction parameters described above provide coordinates to the texture filtering unit 750 for the sub-block address to be extracted from the texture cache 760 or from the execution unit pool 740. The image data contains the brightness (Y) and chroma (Cb, Cr) planes. A YC flag input parameter defines whether the Y plane or the CbCr plane is to be processed.

當處理亮度(Y)資料時，如YC旗標參數所標示的，紋理濾波單元750擷取該子方塊並提供該128位元作為VC-1回路內去方塊效應濾波器硬體加速邏輯電路400的輸入(例如第4圖之VC-1加速器範例之方塊輸入參數)。所產生的資料係寫入目標暫存器作為一4組-暫存器(register quad，即，DST、DST+1、DST+2、DST+3)。When processing the luminance (Y) data, as indicated by the YC flag parameter, the texture filtering unit 750 extracts the sub-block and provides the 128-bit as a VC-1 loop deblocking filter hardware acceleration logic circuit 400. Input (such as the block input parameter of the VC-1 accelerator example in Figure 4). The generated data is written to the target register as a group of four registers (registered quads, ie, DST, DST+1, DST+2, DST+3).

當處理彩度資料時，如YC旗標參數所標示的，Cb與Cr方塊將由VC-1回路內去方塊效應濾波器硬體加速邏輯電路400連續地處理。所產生的資料係寫入紋理快取760。在一些實施例中，此寫入動作在各週期中發生，每個週期寫入256位元。When processing the chroma data, as indicated by the YC flag parameters, the Cb and Cr blocks will be continuously processed by the block-1 filter hardware acceleration logic circuit 400 within the VC-1 loop. The resulting data is written to texture cache 760. In some embodiments, this write action occurs in each cycle, with 256 bits written per cycle.

一些視訊加速單元實施例使用隔行掃瞄CbCr平面，各存為一半寬度與一半長度。在這些實施例中，紋理濾波單元750為視訊加速單元150將CbCr子方塊資料解隔行掃瞄至用來溝通紋理濾波單元750與視訊加速單元150之一緩衝器。尤其是，紋理濾波單元750將2個4x4 Cb方塊寫入該緩衝器，接著將2個4x4 Cr方塊寫入該緩衝器。8x4 Cb方塊首先由VC-1回路內去方塊效應濾波器硬體加速邏輯電路400處理，所產生的資料寫入紋理快取760。接著，8x4 Cr方塊由VC-1回路內去方塊效應濾波器硬體加速邏輯電路400處理，所產生的資料寫入紋理快取760。視訊加速單元150使用CbCr旗標參數以管理此循序處理。Some video acceleration unit embodiments use interlaced scanning CbCr planes, each of which is half the width and half the length. In these embodiments, the texture filtering unit 750 scans the CbCr sub-block data for the video acceleration unit 150 to communicate with one of the texture filtering unit 750 and the video acceleration unit 150. In particular, texture filtering unit 750 writes two 4x4 Cb squares to the buffer, and then writes two 4x4 Cr squares to the buffer. The 8x4 Cb block is first accelerated by the blocker filter hardware in the VC-1 loop. The circuit 400 processes and the generated data is written to the texture cache 760. Next, the 8x4 Cr block is processed by the VC-1 loop deblocking filter hardware acceleration logic circuit 400, and the resulting data is written to texture cache 760. The video acceleration unit 150 uses the CbCr flag parameters to manage this sequential processing.

Software decoder uses deblocking instructions

結合先前第1圖之說明，解碼器160在主處理器110上執行但亦利用圖形處理單元120所提供的視訊加速指令。尤其是H.264回路內去方塊效應濾波器290之實施例使用特定IDF_H264_x結合以處理邊緣，依H.264所規定之次序，從紋理快取760擷取一些子方塊並從執行單元池740擷取另一些。在適當結合之下，這些IDF_H264_x指令達成一個接一個像素讀取與寫入。In conjunction with the previous description of FIG. 1, decoder 160 executes on main processor 110 but also utilizes video acceleration instructions provided by graphics processing unit 120. In particular, the embodiment of the H.264 loop deblocking filter 290 uses a specific IDF_H264_x combination to process the edges, extracting some sub-blocks from the texture cache 760 and executing from the execution unit pool 740 in the order specified by H.264. Take some more. With proper combination, these IDF_H264_x instructions achieve one pixel read and write.

第8圖係用於H.264之16x16大方塊之方塊圖。這大方塊切割成16個4x4子方塊，每個均將進行去方塊效應。第8圖中之4個子方塊可依列與行定義(例如R1，C2)。H.264定義先處理垂直邊緣在處理水平邊緣，如第8圖所示之邊緣順序(a-h)。Figure 8 is a block diagram of a 16x16 large block for H.264. This large square is cut into 16 4x4 sub-blocks, each of which will perform a deblocking effect. The four sub-blocks in Figure 8 can be defined by columns and rows (for example, R1, C2). The H.264 definition first processes the vertical edges at the processing horizontal edges, as shown in Figure 8 (a-h).

對於第1對子方塊，均下載自紋理快取760，因為還沒有像素因施用濾波器而被改變。儘管第1垂直邊緣(a)之濾波器可以改變(R1，C1)之像素值，第2列垂直邊緣實際上與第1列垂直邊緣共用所有像素。因此，第2對子方塊(邊緣b)亦下載自紋理快取760。既然兩相鄰列間的垂直邊緣不共用像素，第3對(邊緣c)與第4對(邊緣d)子方塊亦同。For the first pair of sub-blocks, the texture cache 760 is downloaded since no pixels have been changed due to the application of the filter. Although the filter of the first vertical edge (a) can change the pixel value of (R1, C1), the vertical edge of the second column actually shares all the pixels with the vertical edge of the first column. Therefore, the second pair of sub-blocks (edge b) are also downloaded from texture cache 760. Since the vertical edges between two adjacent columns do not share pixels, the third pair (edge c) is the same as the fourth pair (edge d) sub-block.

由回路內去方塊效應濾波器290所發出的特定IDF_H264_x指令判定要從那個位置下載像素資料。由回路內去方塊效應濾波器290所使用的IDF_H264_x指令處理第1組垂直邊緣(a-d)之次序為：IDF_H264_0 SRC1=address of(R1,C1)；IDF_H264_0 SRC1=address of(R2,C1)；IDF_H264_0 SRC1=address of(R3,C1)； IDF_H264_0 SRC1=address of(R4,C1)；The specific IDF_H264_x command issued by the in-loop deblocking filter 290 determines that pixel data is to be downloaded from that location. The order of processing the first set of vertical edges (ad) by the IDF_H264_x instruction used by the in-loop deblocking filter 290 is: IDF_H264_0 SRC1 = address of (R1, C1); IDF_H264_0 SRC1 = address of (R2, C1); IDF_H264_0 SRC1=address of(R3,C1); IDF_H264_0 SRC1=address of(R4,C1);

接下來，回路內去方塊效應濾波器290處理第2垂直邊緣(b)，從(R1，C2)開始。在定義為(R1，C2)8x4子方塊內最左邊4個像素與(R1，C1)子方塊最右邊的像素重疊。這些由(R1，C1)之垂直邊緣濾波器所處理，亦可能更新，之重疊像素係因而被讀自執行單元池740而非紋理快取760。然而，在(R1，C2)子方塊最右邊的4個像素還沒被濾波，因而讀自紋理快取760。子方塊(R2，C2)到(R4，C2)亦同。回路內去方塊效應濾波器290藉由命令下面IDF_H264_x的順序以處理第2組垂直邊緣，以完成此結果：IDF_H264_1 SRC1=address of(R1,C2)；IDF_H264_1 SRC1=address of(R2,C2)；IDF_H264_1 SRC1=address of(R3,C2)；IDF_H264_1 SRC1=address of(R4,C2)；Next, the in-loop deblocking filter 290 processes the second vertical edge (b) starting from (R1, C2). The leftmost 4 pixels in the 8x4 sub-block defined as (R1, C2) overlap with the rightmost pixel of the (R1, C1) sub-block. These are processed by the vertical edge filters of (R1, C1) and may also be updated, and the overlapping pixel systems are thus read from the execution unit pool 740 instead of the texture cache 760. However, the 4 pixels to the far right of the (R1, C2) sub-block are not yet filtered and are therefore read from texture cache 760. Sub-blocks (R2, C2) to (R4, C2) are also the same. The in-loop deblocking filter 290 processes the second set of vertical edges by commanding the order of IDF_H264_x below to complete the result: IDF_H264_1 SRC1=address of(R1, C2); IDF_H264_1 SRC1=address of(R2, C2); IDF_H264_1 SRC1=address of(R3,C2);IDF_H264_1 SRC1=address of(R4,C2);

當處理第3組垂直邊緣時，從(R1，C3)開始。在(R1，C3)8x4子方塊內最左邊4個像素與(R1，C2)子方塊最右邊的像素重疊，因而要讀自執行單元池740而非紋理快取760。然而，在(R1，C2)子方塊最右邊的4個像素還沒被濾波，因而讀自紋理快取760。子方塊(R1，C2)到(R4，C2)亦同。最後一組垂直邊緣會發生類似的情形。因此，回路內去方塊效應濾波器290藉由命令下面IDF_H264_x的順序以處理剩下2組垂直邊緣：IDF_H264_1 SRC1=address of(R1,C3)；IDF_H264_1 SRC1=address of(R2,C3)；IDF_H264_1 SRC1=address of(R3,C3)； IDF_H264_1 SRC1=address of(R4,C3)；IDF_H264_1 SRC1=address of(R1,C4)；IDF_H264_1 SRC1=address of(R2,C4)；IDF_H264_1 SRC1=address of(R3,C4)；IDF_H264_1 SRC1=address of(R4,C4)；When processing the third set of vertical edges, start with (R1, C3). The leftmost 4 pixels in the (R1, C3) 8x4 sub-block overlap with the rightmost pixel of the (R1, C2) sub-block, and thus are read from the execution unit pool 740 instead of the texture cache 760. However, the 4 pixels to the far right of the (R1, C2) sub-block are not yet filtered and are therefore read from texture cache 760. The sub-blocks (R1, C2) to (R4, C2) are also the same. A similar situation occurs with the last set of vertical edges. Therefore, the in-loop deblocking filter 290 processes the remaining two sets of vertical edges by ordering the following IDF_H264_x: IDF_H264_1 SRC1=address of(R1, C3); IDF_H264_1 SRC1=address of(R2, C3); IDF_H264_1 SRC1 =address of(R3,C3); IDF_H264_1 SRC1=address of(R4,C3);IDF_H264_1 SRC1=address of(R1,C4);IDF_H264_1 SRC1=address of(R2,C4);IDF_H264_1 SRC1=address of(R3,C4);IDF_H264_1 SRC1=address of( R4, C4);

接著處理水平邊緣(e-h)。此時，去方塊效應濾波器已應用於大方塊中的每個子方塊，因而每個像素可能已更新。因此，送去進行水平邊緣濾波的各子方塊係讀自執行單元池740而非紋理快取760。因此，回路內去方塊效應濾波器290藉由命令下面IDF_H264_x的順序以處理水平邊緣：IDF_H264_2 SRC1=address of(R1,C1)；IDF_H264_2 SRC1=address of(R2,C1)；IDF_H264_2 SRC1=address of(R3,C1)；IDF_H264_2 SRC1=address of(R4,C1)；IDF_H264_2 SRC1=address of(R1,C2)；IDF_H264_2 SRC1=address of(R2,C2)；IDF_H264_2‧‧‧SRC1=address of(R3,C2)；IDF_H264_2 SRC1=address of(R4,C2)；IDF_H264_2 SRC1=address of(R1,C3)；The horizontal edge (e-h) is then processed. At this point, the deblocking filter has been applied to each sub-block in the large square, so each pixel may have been updated. Therefore, each sub-block sent to perform horizontal edge filtering is read from execution unit pool 740 instead of texture cache 760. Therefore, the in-loop deblocking filter 290 processes the horizontal edge by ordering the following IDF_H264_x: IDF_H264_2 SRC1=address of(R1, C1); IDF_H264_2 SRC1=address of(R2, C1); IDF_H264_2 SRC1=address of( R3,C1);IDF_H264_2 SRC1=address of(R4,C1);IDF_H264_2 SRC1=address of(R1,C2);IDF_H264_2 SRC1=address of(R2,C2);IDF_H264_2‧‧‧SRC1=address of(R3,C2 );IDF_H264_2 SRC1=address of(R4,C2);IDF_H264_2 SRC1=address of(R1,C3);

任何程序說明或流程圖中的方塊應被理解為表示模組、區段或部分程式碼，其包含用於實現特定邏輯電路功能或程序中的步驟之一個或多個可執行的指令。熟悉軟體部門之技藝者應當瞭解到，其他的實現方法亦包含於所揭露之範圍內。在其他的實現方法中，各功能可不依所示或揭露之順序執行，包含實質上同步進行或逆向進行，依所涉之功能而定。A block of any program description or flow diagram should be understood to represent a module, segment or portion of code, which comprises one or more executable instructions for implementing a particular logic circuit function or a step in a program. Those skilled in the software sector should be aware that other implementation methods are also included in the scope of the disclosure. In other implementations, the functions may be performed in the order shown or disclosed, including substantially concurrent or reverse, Depending on the function involved.

在此揭露之系統與方法可以軟體、硬體或其結合實現。在一些實施例中，該系統及/或方法係以存在記憶體中之軟體實現，且由位於一計算裝置中之適當處理器所執行(包含而不限於一微處理器、微控制器、網路處理器、可重新裝配處理器、可擴充處理器)。在其他實施例中，該系統及/或方法係以邏輯電路實現，包含而不限於一可程式邏輯裝置(PLD，programmable logic device)、可程式邏輯閘陣列(PGA，programmable gate array)、現場可程式化邏輯閘陣列(FPGA，field programmable gate array)或特定應用電路(ASIC)。在其他實施例中，這些邏輯敘述係在一圖形處理器或圖形處理單元(GPU)完成。The systems and methods disclosed herein can be implemented in software, hardware, or a combination thereof. In some embodiments, the system and/or method is implemented in software stored in memory and executed by a suitable processor located in a computing device (including but not limited to a microprocessor, microcontroller, network Road processor, reassemblable processor, expandable processor). In other embodiments, the system and/or method is implemented by a logic circuit, including but not limited to a programmable logic device (PLD), a programmable gate array (PGA), and a field programmable circuit array (PGA). A programmable gate array (FPGA) or an application specific circuit (ASIC). In other embodiments, these logical narratives are performed in a graphics processor or graphics processing unit (GPU).

在此揭露之系統與方法可被嵌入任何電腦可讀媒體而使用，或連結一指令執行系統、設備、裝置。該指令執行系統包含任何以電腦為基礎的系統、含有處理器的系統或其他可以從該指令執行系統擷取與執行這些指令的系統。所揭露之文字”電腦可讀媒體(computer-readable medium)”可為任何可以容納、儲存、溝通、傳遞或傳送該程式作為使用或與該指令執行系統連結之工具。該電腦可讀媒體可為，例如(非限制)為基於電子的、有磁性的、光的、電磁的、紅外線的或半導體技術的一系統或傳遞媒體。The systems and methods disclosed herein can be embedded in any computer readable medium or linked to an instruction execution system, apparatus, or device. The instruction execution system includes any computer-based system, a system containing the processor, or other system from which the instructions can be retrieved and executed. The disclosed text "computer-readable medium" can be any means by which the program can be stored, stored, communicated, communicated or transmitted for use or in connection with the execution system of the instruction. The computer readable medium can be, for example, a non-limiting, electronic or magnetic, optical, electromagnetic, infrared, or semiconductor technology system or delivery medium.

使用電子技術之電腦可讀媒體之特定範例(非限制)可包含：具有一條或多條電性(電子)連接的線；一隨機存取記憶體(RAM，random access memory)；一唯讀記憶體 (ROM，read-only memory)；一可拭去可程式化唯讀記憶體(EPROM或快閃記憶體)。使用磁技術之電腦可讀媒體之特定範例(非限制)可包含：可攜帶電腦磁碟。使用光技術之電腦可讀媒體之特定範例(非限制)可包含：一光纖與一可攜帶唯讀光碟(CD-ROM)。A specific example (not limiting) of a computer readable medium using electronic technology may include: a line having one or more electrical (electronic) connections; a random access memory (RAM); a read only memory body (ROM, read-only memory); can erase the programmable read-only memory (EPROM or flash memory). Specific examples of computer readable media that use magnetic technology (without limitation) may include: a portable computer diskette. A specific example (not limiting) of a computer readable medium using optical technology can include: an optical fiber and a portable CD-ROM.

雖然本發明在此以一個或更多個特定的範例作為實施例闡明及描述，不過不應將本發明侷限於所示之細節，然而仍可在不背離本發明的精神下且在申請專利範圍均等之領域與範圍內實現許多不同的修改與結構上的改變。因此，最好將所附上的申請專利範圍廣泛地且以符合本發明領域之方法解釋，在隨後的申請專利範圍前提出此聲明。The invention is illustrated and described herein by way of example only, and is not intended to Many different modifications and structural changes are made within the scope and scope of equalization. Therefore, it is preferable to interpret the scope of the appended patent application broadly and in a manner consistent with the field of the invention, and to make this statement before the scope of the subsequent patent application.

100‧‧‧系統100‧‧‧ system

110‧‧‧一般用途CPU110‧‧‧General Purpose CPU

120‧‧‧圖形處理器(GPU)120‧‧‧Graphics Processor (GPU)

130‧‧‧記憶體130‧‧‧ memory

140‧‧‧匯流排140‧‧‧ Busbar

150‧‧‧視訊加速單元(VPU)150‧‧‧Video Acceleration Unit (VPU)

160‧‧‧軟體解碼器160‧‧‧Software decoder

170‧‧‧視訊加速驅動器170‧‧‧Video Acceleration Driver

205‧‧‧輸入之位元流205‧‧‧Input bit stream

210‧‧‧熵解碼器210‧‧‧ Entropy decoder

215‧‧‧空間解碼器215‧‧‧ Space Decoder

220‧‧‧反相量化器220‧‧‧Inverse Quantizer

230‧‧‧反相離散餘弦轉換230‧‧‧Inverse Discrete Cosine Transform

235‧‧‧圖形235‧‧‧ graphics

245‧‧‧移動向量245‧‧‧Mobile vector

250‧‧‧移動補償250‧‧‧Mobile compensation

255‧‧‧先前解碼圖形255‧‧‧ previously decoded graphics

265‧‧‧預測圖形265‧‧‧ forecast graphics

270‧‧‧空間補償270‧‧‧ Space compensation

280‧‧‧加法器280‧‧‧Adder

290‧‧‧去方塊效應濾波器290‧‧‧Deblocking filter

295‧‧‧解碼圖形295‧‧‧Decoding graphics

310-320‧‧‧兩個鄰近4x4子方塊310-320‧‧‧Two adjacent 4x4 sub-blocks

330‧‧‧垂直邊界330‧‧‧ vertical boundary

400‧‧‧回路內去方塊效應濾波器硬體加速邏輯電路400‧‧‧In-loop deblocking filter hardware acceleration logic circuit

410‧‧‧模組定義區段410‧‧‧Module Definition Section

420‧‧‧疊代迴圈區段420‧‧ ‧Replicated loop section

430‧‧‧測試垂直參數區段430‧‧‧Test vertical parameter section

440‧‧‧比較迴圈參數與3區段440‧‧‧Compare loop parameters with 3 segments

450‧‧‧示例區段450‧‧‧Example section

500‧‧‧行加速邏輯電路500‧‧‧ lines of acceleration logic

510‧‧‧模組定義區段510‧‧‧Module Definition Section

520‧‧‧像素值運算區段520‧‧‧Pixel Value Operation Section

530‧‧‧比較迴圈參數與3區段530‧‧‧Compare loop parameters with 3 segments

540‧‧‧測試DO_FILTER區段540‧‧‧Test DO_FILTER section

550‧‧‧更新狀態區段550‧‧‧Update status section

605-610-615-620‧‧‧多工器605-610-615-620‧‧‧Multiplexer

625-630-679‧‧‧減法器625-630-679‧‧‧Subtractor

635-640-655-680‧‧‧邏輯電路方塊635-640-655-680‧‧‧Logical Circuit Blocks

645-650‧‧‧加法器645-650‧‧‧Adder

660-665-670‧‧‧暫存器660-665-670‧‧‧ 存存器

671‧‧‧P4暫存器輸出671‧‧‧P4 register output

673‧‧‧P5暫存器輸出673‧‧‧P5 register output

681‧‧‧減法器681‧‧‧Subtractor

685‧‧‧加法器685‧‧‧Adder

687-689-691-693‧‧‧多工器687-689-691-693‧‧‧Multiplexer

697‧‧‧OR閘697‧‧‧OR gate

710‧‧‧指令流處理器710‧‧‧Instruction Stream Processor

720‧‧‧指令720‧‧‧ directive

730‧‧‧指令資料730‧‧‧Instruction Information

740‧‧‧執行單元池740‧‧‧Executive unit pool

750‧‧‧紋理濾波單元750‧‧‧Texture Filter Unit

760‧‧‧紋理快取760‧‧‧Texture cache

770‧‧‧後包裝器770‧‧‧post wrapper

第1圖係用於圖形與視訊編碼及/或解碼之一示範性運算平台之方塊圖。Figure 1 is a block diagram of an exemplary computing platform for graphics and video coding and/or decoding.

第2圖係第1圖中該視訊解碼器160之方塊圖。Figure 2 is a block diagram of the video decoder 160 in Figure 1.

第3圖說明一VC-1濾波器之子方塊像素設置。Figure 3 illustrates the sub-block pixel settings of a VC-1 filter.

第4圖係第1圖VC-1回路內去方塊效應濾波器硬體加速邏輯電路400之硬體描述虛擬碼之列表。Figure 4 is a block diagram of the hard description virtual code of the block-effect filter hardware acceleration logic circuit 400 in the VC-1 loop.

第5圖係第4圖行加速邏輯電路500之硬體描述語言程式碼之列表。Figure 5 is a list of hardware description language code for the acceleration logic circuit 500 of Figure 4.

第6A-D圖形成第4、5圖之行加速邏輯電路之一方塊圖。Fig. 6A-D is a block diagram showing the acceleration logic circuit of the fourth and fifth rows.

第7圖係第1圖之圖形處理單元120之資料流程圖。Figure 7 is a data flow diagram of the graphics processing unit 120 of Figure 1.

第8圖係H.264所用之16x16大方塊之方塊圖。Figure 8 is a block diagram of a 16x16 large block used by H.264.

205‧‧‧輸入之位元流205‧‧‧Input bit stream

210‧‧‧熵解碼器210‧‧‧ Entropy decoder

215‧‧‧空間解碼器215‧‧‧ Space Decoder

220‧‧‧反相量化器220‧‧‧Inverse Quantizer

235‧‧‧圖形235‧‧‧ graphics

245‧‧‧移動向量245‧‧‧Mobile vector

250‧‧‧移動補償250‧‧‧Mobile compensation

255‧‧‧先前解碼圖形255‧‧‧ previously decoded graphics

265‧‧‧預測圖形265‧‧‧ forecast graphics

270‧‧‧空間補償270‧‧‧ Space compensation

280‧‧‧加法器280‧‧‧Adder

285‧‧‧模式選擇器285‧‧‧Mode selector

290‧‧‧去方塊效應濾波器290‧‧‧Deblocking filter

295‧‧‧解碼圖形295‧‧‧Decoding graphics

Claims

A deblocking filter for video decoding, comprising: a first logic circuit for performing correlation of each pixel group in a plurality of pixel groups according to a corresponding group of filtering units in the complex array filtering unit a second logic circuit for determining whether a pixel of a predetermined pixel group of the plurality of pixel groups reaches a standard; a third logic circuit configured to: when the standard is reached, according to the predetermined pixel group Correlating filtering operation, first filtering the pixels of the predetermined pixel group; and a fourth logic circuit, configured to, when the standard is reached, sequentially sequence the remaining pixel groups in the plurality of pixel groups according to the corresponding correlation filtering operation Filtering, wherein when the criterion is not met, the fourth logic circuit maintains pixels in the pixel groups unfiltered according to the correlation filtering operation corresponding to maintaining the plurality of pixel groups.

The deblocking filter for video decoding according to claim 1, wherein the plurality of pixel groups form a square pixel block, and each pixel group of the plurality of pixel groups comprises a column of pixel blocks.

The deblocking filter for video decoding according to claim 1, wherein the plurality of pixel groups form a square pixel block, and each pixel group of the plurality of pixel groups comprises a row of pixel blocks.

The deblocking filter for video decoding according to claim 1, wherein the third logic circuit further comprises: a fifth logic circuit, configured to be based on one of the remaining pixel groups. The predetermined pixel group is updated to update one of the remaining predetermined pixel groups.

The third logic circuit further includes: a sixth logic circuit configured to, when the standard is reached, first parallel to the pixels of the predetermined pixel group, as in the deblocking filter for video decoding in claim 1 Filtering.

A deblocking filter for video decoding according to claim 1, wherein the deblocking filter for video decoding is applied to an edge between sub-block pairs to remove edge products.

A deblocking filter for video decoding according to claim 1, wherein the deblocking filter for video decoding uses a plurality of graphics processing instructions combined to achieve one pixel read and write. In.

A deblocking filter for video decoding according to claim 1 of the patent application, wherein the deblocking filter for video decoding is defined in accordance with the VC-1 standard.

A video decoder comprising: an entropy decoder for receiving an input encoded bit stream; a spatial decoder receiving the output of the entropy decoder and generating a coded picture comprising a plurality of pixels; a first logic circuit, setting Combining a current picture with a predicted picture to generate a combined picture; and a loopback internal block effect filter to receive the combined picture, the in-loop deblocking filter includes: a second logic circuit for a corresponding group of filtering units in the complex array filtering unit to perform a correlation of each pixel group in the plurality of pixel groups Filtering operation; a third logic circuit configured to: when a predetermined pixel group of the plurality of pixel groups reaches a standard, filter the predetermined pixel group according to the correlation filtering operation of the predetermined pixel group; and a fourth logic a circuit configured to: when the predetermined pixel group reaches the standard, filter the remaining pixel groups in the plurality of pixel groups according to the correlation filtering operation of the predetermined pixel group, wherein when the predetermined pixel group does not meet the criterion, The fourth logic circuit maintains pixels in the pixel groups unfiltered according to the correlation filtering operation corresponding to the plurality of pixel groups.

The video decoder of claim 9, wherein the plurality of pixel groups form a square pixel block, and each pixel group of the plurality of pixel groups comprises a column of pixel squares.

The video decoder of claim 9, wherein the plurality of pixel groups form a square pixel block, and each pixel group of the plurality of pixel groups comprises a row of pixel blocks.

The video decoder of claim 9, wherein the third logic circuit further comprises: a fifth logic circuit configured to filter the pixels of the predetermined pixel group in parallel when the standard is reached.

The video decoder of claim 9, wherein the fourth logic circuit further comprises: a sixth logic circuit, configured to update each of the remaining pixels according to one of the remaining predetermined pixel groups One of the first predetermined pixel groups in the group of pixels.

For example, the video decoder of claim 9 of the patent scope, wherein the loop is The block effect filter is defined in accordance with the VC-1 standard.

A graphics processing unit includes: a main processing interface, receiving at least one video acceleration command; and a video acceleration unit for responding to at least one video acceleration command, the video acceleration unit including a loop-in-square block effect filter, The in-loop deblocking filter comprises: a first logic circuit for performing a correlation filtering operation on each pixel group in the plurality of pixel groups according to a corresponding group of filtering units in the complex array filtering unit; The second logic circuit is configured to determine whether the pixel of the predetermined pixel group of the plurality of pixel groups reaches a first standard; a third logic circuit is configured to: when the first standard is reached, the correlation filtering according to the predetermined pixel group Computing, first filtering the pixels of the predetermined pixel group; and a fourth logic circuit configured to, when the first criterion is reached, sequentially perform the correlation filtering operation on the remaining pixel groups in the plurality of pixel groups, sequentially ordering the complex number The remaining pixel group filtering in the pixel group, wherein the predetermined pixel group does not reach the standard according to the plurality of pixels The respective correlation filtering operation, maintaining the unfiltered pixels of those pixel groups.

The graphics processing unit of claim 15, wherein the plurality of pixel groups form a square pixel block, and each pixel group of the plurality of pixel groups comprises a column of pixel squares.

The graphics processing unit of claim 15, wherein the plurality of pixel groups form a square pixel block, and each pixel of the plurality of pixel groups A group contains a row of pixel squares.

The video decoder of claim 15, wherein the fourth logic circuit further comprises: a fifth logic circuit, configured to update each of the remaining pixels according to a second predetermined pixel group in each of the remaining pixel groups One of the first predetermined pixel groups in the group of pixels.

A video decoder as claimed in claim 15 wherein the in-loop block effect filter is defined in accordance with the VC-1 standard.

The graphics processing unit of claim 15, wherein the first logic circuit is further configured to perform the correlation filtering operation on each of the plurality of pixel groups in parallel.