TW200821986A

TW200821986A - Graphic processing unit and method for computing sum of absolute difference of marcoblocks

Info

Publication number: TW200821986A
Application number: TW096122009A
Authority: TW
Inventors: Zahid Hussain; John Brothers; Jiang-Ming Xu
Original assignee: Via Tech Inc
Priority date: 2006-06-16
Filing date: 2007-06-15
Publication date: 2008-05-16
Also published as: CN101068353B; CN101068353A; TWI383683B; TWI482117B; TW200816082A; TW200803525A; CN101083764A; TWI348654B; TWI444047B; CN101072351B; TW200803527A; TW200816820A; CN101083763B; CN101068365B; CN101068365A; TWI395488B; TW200803528A; CN101068364B; CN101068364A; CN101083763A

Abstract

A graphics processing unit (GPU) comprising: an instruction decoder configured to decode a sum-of-absolute-differences (SAD) instruction into a plurality of parameters describing an M x N (M, N are integers) pixel block and a n x n (n is an integer) pixel block in U, V coordinates; and sum-of-absolute-differences (SAD) acceleration logic configured to receive the plurality of parameters and to compute a plurality of SAD scores, each SAD score corresponding to the n x n (n is an integer) pixel block and to one of a plurality of blocks that are contained within the M x N pixel block and are horizontally offset with the n x n pixel block.

Description

200821986 、九、發明說明：【發明所屬之技術領域】目前所揭露的内容關於一圖形處理單元，且尤其係關於具有影像壓縮與解壓縮特徵之圖形處理單元。【先前技術】個人電腦與消費性電子產品係用於各種娛樂用品。這 f. 些娛樂用品可以大致區分為2類：使用電腦製圖 (computer-generated graphics)的那些，例如電腦遊戲；與使用壓縮視訊資料流（compressed video stream)的那些，例如預錄節目到數位式影音光碟（DVD)上，或由有線電視或衛星業者提供數位節目（digital programming)至一機上盒（set-topbox)。第2種亦包含編碼類比視訊資料流’例如由一數位錄影機（DVR，digital video recorder) 所執行。 ( 電腦製圖通常由一圖形處理單元（GPU，graphic processing unit)產生。一圖形處理單元是一種建立在電腦遊戲平台（computer game consoles)與一些個人電腦上一種特別的微處理器。一圖形處理單元係被最佳化為快速執行描繪三度空間基本物件（three-dimensional primitiveobjects)，例如三角形、四邊形等。這些基本物件係以多個頂點描述，其中每個頂點具有屬性（例如顏色），且可施加紋理（texture)至該基本物件上。描繪的結果係一二度空間像素陣列（two-dimensional array of 6Client5s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林環輝/2007/06/13 6 200821986 p i xe 1 s)，顯示在一電腦之顯示器或監視器上。視訊貧料流的編碼與解碼牽涉到不同種類的運算，例如，離散餘弦變換（discrete cosine transform)、移動估測（motion estimation )、移動補償（motion compensation )、去方塊效應濾波器（deblocking filter)。這些計算通常由一般用途中央處理器（CPU)結合特別的硬體邏輯電路，例如特殊應用積體電路（ASIC， application specific integrated circuit)，來處理。消費者因而需要多個運鼻平台以滿足他們的娱樂需求。因而需要可以處理電腦製圖與視訊編碼/解碼的單一計算平台0 【發明内容】200821986, IX. Description of the Invention: [Technical Field] The present disclosure relates to a graphics processing unit, and more particularly to a graphics processing unit having image compression and decompression features. [Prior Art] Personal computers and consumer electronics are used in various entertainment products. This f. entertainment items can be roughly divided into two categories: those using computer-generated graphics, such as computer games; and those using compressed video streams, such as pre-recorded programs to digital On a video disc (DVD), or from a cable or satellite industry, digital programming to a set-topbox. The second type also includes the encoded analog video stream', e.g., executed by a digital video recorder (DVR). (Computer graphics are usually generated by a graphics processing unit (GPU). A graphics processing unit is a special type of microprocessor built on computer game consoles and some personal computers. The system is optimized for fast execution of three-dimensional primitive objects, such as triangles, quadrilaterals, etc. These basic objects are described by multiple vertices, each of which has attributes (such as color) and can be Texture is applied to the basic object. The result is a two-dimensional array of 6Client5s Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林环辉/2007/06/13 6 200821986 pi xe 1 s), displayed on a computer monitor or monitor. The encoding and decoding of video poor streams involves different kinds of operations, such as discrete cosine transform (discrete cosine transform) ), motion estimation, motion compensation, deblocking filter (deblocking filter) These calculations are usually handled by a general purpose central processing unit (CPU) in combination with special hardware logic circuits, such as application specific integrated circuits (ASICs). The nose platform meets their entertainment needs. Therefore, it needs a single computing platform that can handle computer graphics and video encoding/decoding. [Invention content]

本發明之一態樣係一種圖形處理單元，包含：—指令解碼器，設置成將一絕對差值加總指令解碼為複數個參數，該複數個參數描述在u、v座標上之一 ΜχΝ像素方塊與一 η X η像素方塊，其中Μ、N、n係整數；以及一絕對差值加總加速邏輯電路，設置成接收該複數個參數並古十算複數個絕對差值加總值，各絕對差值加總值對應該η χ η 像素方塊’及對應存在於該Μ χ Ν像素方塊且與該η χ η 像素方塊有一位差。本發明之另一態樣係種圖形處理單元，包含：—主& 理器介面，接收視訊加速指令；以及一視訊加速單元，回麻該視訊加速指令，該視訊加速單元包含一絕對差值加濟^ 速邏輯電路，設置成接收該複數個參數並計算複數個^對 7Client，s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06八3 7 200821986 • 差值加總值，各絕對差值加總值對應該η χ η像素方塊，及對應存在於該Μ X Ν像素方塊且與該η X η像素方塊有一位差之複數個方塊其中之一。本發明之另一態樣係一種計算一 ΜΧΝ巨圖塊之一絕對差值加總值的方法，其中Μ、Ν為整數，該方法包含：執行一絕對差值加總指令以計算一 ΜχΜ巨圖塊之一第一 η χ η部分的一第一絕對差值加總值，該第一部分包含該Μ χΜ巨圖塊的一左上部分，其中η為整數;執行一絕對差值 1 加總指令以計算該Μ χ Μ巨圖塊之一第二η χ η部分的一第二絕對差值加總值，該第二部分包含該ΜχΜ巨圖塊的一右上部分；累加該第一與第二絕對差值加總值得一總和；執行該絕對差值加總指令以計算該ΜχΜ巨圖塊之一第三 η χη部分的一第三絕對差值加總值，該第三部分包含該Μ χ Μ巨圖塊的一左下部分;將該第三絕對差值加總值加至該總和；執行該絕對差值加總指令以計算該Μ χ Μ巨圖塊之一第四η χ η部分的一第四絕對差值加總值，該第四部分包含該Μ χ Μ巨圖塊的一右下部分；以及將該第四絕對差值加總值加至該總和。【實施方式】在此揭露的實施例提供利用一圖形處理單元以增進移動估測系統與方法。 1.用於視訊編碼的運算平台 SClienfs Docket N〇.:S3U06-0024 TT，s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 8 200821986 第1圖係麟®形與視賴碼及/或解碼之_稀性運算塊圖。系統HK)包含——般用途cpum (此後稱為處理器）、一圖形處理器⑽）120、記憶體13◦盘匯流排140。圖形處理單元12〇包含一視訊加速單元（川〇，其可加速視訊編碼及/或解碼，將則灸敛述。圖形處理單元12〇的視訊加速魏係可在_處鮮元12()上執行的指令。軟體解碼器160與視訊加速驅動器17〇位於記憶體13〇中，和馬器160在主處理器110上執行。透過一個由^加速驅動器170提供的-介面，解碼器⑽亦可發出給圖形處理單元120的視訊加速指令。如此一來，系統1〇〇透過發出視訊加速指令給圖形處理單A 120的主處理器軟體（h〇st processor software)執行視訊編碼。依此法，經常被執行的密集運算方塊（⑶mPutati〇nally intensive bi〇cks)被卸裘圖形處理單元12G，而更複雜的運算係由S處理器110所執行。第1圖中省略數個對於解釋圖形處理單元12〇之視訊加速特被亚非必要且熟悉此項記憶者熟知的習知元件。接下來將對視訊編碼概要制，再接下來討論—個視訊編瑪元件（移動估測态）如何利用圖形處理單元120所提供的視訊加速單元功 2.視sfl編碼器第2圖係第1圖之視訊編碼器ι6〇的功能方塊圖。輸入至 9Client5s Docket N〇.：S3U06-0024 TT’s Docket N0:〇6〇8-A41261-TW/fmal/林璟輝/2〇〇7/06/13 200821986 、、扁馬态160的圖像（ 205)係由像素所組成。編碼器ι6〇利用圖像205内的時間（temporal)與空間相似性（spatial similarities)運作，並且利用判定一圖框内（空間）及/或圖 "門）的差異相似性編碼。空間編碼利用一圖像内鄰近像素通常相同或相關的特性編碼，故僅對差異編碼。時間編碼利用連φ ϋ像巾的許多像素通常相同的值，故僅對圖像間的差異、4碼編碼為1β〇亦利用熵編碼的統計冗餘性：一些圖像 f 較另一些圖樣更常發生，故較常發生的以較短的碼代表。熵編碼的範例包含霍夫曼編碼（Huffman coding)、運行長度編碼 (run-length encoding)、算術編碼（Ari1：hmetic c〇ding) 與前後自我適應的二位元算術編碼（c〇ntext-adap1；ive binary arithmetic coding)〇在此示範性實施例中，輸入圖像2〇5的方塊係提供至一減法器210與一移動估測器22〇。移動估測器22〇比較輸入圖像 205内的方塊與一預先儲存的參考圖像23〇以找出相似的方 ~ 塊。移動估測器220計算代表相符方塊間配置的一組移動向量 245。移動向夏245與參考圖像的相符方塊230合稱為預測方塊255，代表時間編碼。預測方塊255係提供至減法器21〇，其將輸入圖像2〇5減去預測方塊255以產生一剩餘圖像260。剩餘圖像260係提供至每隹散餘方疋轉換為（DCT ’ discrete cosine transform)方塊 270與量化器280，其執行空間編碼。量化器28Q的輸出（例如一組量化後的DCT係數）係由熵編碼器290編碼。 lOClienfs Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林環輝/2007/06/13 10 200821986 對於某種類型的圖像（資 280 reS1dual)係提供給内部解,。解碼y e^d 合由移動估測器22G所產生的移動 H1餘數、、、。係提供至移動估測器22G，如前考圖像緩衝器295，其 f 如結合第一圖所討論的，編碼器16〇在主處理哭⑽ 打’然而億利用由圖形處理單元m所提：執尤其:，=估繼。物兀12〇所提供的絕對絕對差值加總早 s㈣卜a_ute-dl版⑽）齡以達成正 ’ 在相對低的運算量下。歸將詳述移動估_、動估而， 3·軟體移動估測演算法这.搜尋窗（Search Window) 如示於第3A、B圖，移動估測器220將目前圖像2〇5十割成不重疊的各區段，稱為巨圖塊。巨圖塊的大小會依編々、、、二所使用的規範（例如，MPEG-2、H.264、VC)與圖像的^碼為改變。、小而在此敘述之示範性實施例，與在各種不同編碼標準中一巨圖塊係16x16像素。一巨圖塊更切割成方塊，該方塊的大 11 Client’s Docket No.:S3U06-0024 XT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 11 200821986 小可為 4x4、8x8、4x8、16x8、或 8x16。在MPEG-2中，各巨圖塊可僅有一移動向量，故移動估測係根據巨圖塊。H· 264允許達32個移動向量（依程度而定），故在H· 264中，移動估測係根據4x4或8x8方塊的基礎而計算。H· 264之變化，稱為AVS，該移動方塊永遠為8χ8。在VCM 中’其可為4x4或8x8。移動估測演算法220對目前圖像205中的每一巨圖塊執行移動估測，依照在一預先編碼的圖像230 (其類似於目前圖像205的巨圖塊）中尋找一方塊的目標。參考圖像“ο中的巨圖塊Ί4目满圖像2 0 5中的巨圖塊間的置換係計算並儲存為移動向量（ 245，第2圖）。An aspect of the present invention is a graphics processing unit, comprising: an instruction decoder configured to decode an absolute difference plus total instruction into a plurality of parameters, the plurality of parameters being described on one of u and v coordinates a block and an η X η pixel block, wherein Μ, N, n are integers; and an absolute difference summation acceleration logic circuit configured to receive the plurality of parameters and calculate a plurality of absolute differences plus total values, each The absolute difference plus the total value corresponds to the η χ η pixel block 'and corresponding to the Μ Ν Ν pixel block and has a difference from the η χ η pixel block. Another aspect of the present invention is a graphics processing unit comprising: a main & processor interface for receiving a video acceleration command; and a video acceleration unit for returning the video acceleration command, the video acceleration unit including an absolute difference The jiji speed logic circuit is set to receive the plurality of parameters and calculate a plurality of pairs of ^Client, s Docket N〇.: S3U06-0024 TT's Docket No: 0608-A41261-TW/final/林璟辉/2007/06 八3 7 200821986 • The difference sum value, each absolute difference plus total value corresponds to η χ η pixel squares, and corresponding to the plurality of squares existing in the Μ X Ν pixel square and having a difference from the η X η pixel square one. Another aspect of the present invention is a method for calculating an absolute difference plus a total value of a macroblock, wherein Μ and Ν are integers, and the method includes: performing an absolute difference plus total instruction to calculate a giant a first absolute difference value of a first η χ η portion of the block, the first portion comprising an upper left portion of the χΜ χΜ giant tile, where n is an integer; performing an absolute difference 1 summing instruction Calculating a second absolute difference sum value of the second η χ η portion of the Μ Μ Μ giant block, the second portion including an upper right portion of the ΜχΜ giant tile; accumulating the first and second portions The absolute difference sum is worth a sum; performing the absolute difference summing instruction to calculate a third absolute difference sum value of the third η χη part of the ΜχΜ giant block, the third part containing the Μ χ a lower left portion of the macroblock; adding the third absolute difference plus the total value to the sum; performing the absolute difference summing instruction to calculate a fourth η χ η portion of the one of the Μ Μ Μ giant tiles a fourth absolute difference plus a total value, the fourth part comprising the Μ Μ Μ giant tile A lower right portion; and the total value of the absolute difference plus a fourth added to the sum. [Embodiment] The embodiments disclosed herein provide for the use of a graphics processing unit to enhance the motion estimation system and method. 1. Computing platform for video coding SClienfs Docket N〇.:S3U06-0024 TT,s Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 8 200821986 1st picture Lay code and / or decoding _ thinning operation block diagram. The system HK) includes a general purpose cpum (hereinafter referred to as a processor), a graphics processor (10) 120, and a memory 13 buffer bus 140. The graphics processing unit 12 includes a video acceleration unit (Kawasaki, which can speed up video encoding and/or decoding, and the moxibustion can be described. The video processing unit of the graphics processing unit 12 can be on the fresh element 12 () The software decoder 160 and the video acceleration driver 17 are located in the memory 13A, and the horse 160 is executed on the main processor 110. The decoder (10) can also be transmitted through an interface provided by the accelerator driver 170. The video acceleration command is sent to the graphics processing unit 120. In this way, the system 1 performs video encoding on the main processor software (graphic processor) of the graphics processing unit A 120 by issuing a video acceleration command. The intensive operation block (3) that is often executed is unloaded by the graphics processing unit 12G, and the more complicated operation is performed by the S processor 110. Several of the explanation for the graphics processing unit are omitted in FIG. 12〇 Video Acceleration is necessary for Asian and African and familiar with the familiar components of this memory. Next, the video coding summary system will be discussed next, then a video coding element How to use the video acceleration unit function provided by the graphics processing unit 120. 2. See the function block diagram of the video encoder ι6〇 of Fig. 1 of the sfl encoder. Input to 9Client5s Docket N〇 .:S3U06-0024 TT's Docket N0:〇6〇8-A41261-TW/fmal/林璟辉/2〇〇7/06/13 200821986 The image of the flat horse state 160 (205) is composed of pixels. The device ι6〇 operates using temporal and spatial similarities within the image 205 and utilizes differential similarity coding within a frame (space) and/or graph & gate. Spatial coding utilizes the same or related characteristic coding of adjacent pixels within an image, so only the difference is encoded. Time coding uses the same value of many pixels of the φ ϋ image towel, so only the difference between the images, 4 codes are encoded as 1β〇, and the statistical redundancy of entropy coding is also used: some images f are more than others Often occurs, so the more common occurrence is represented by a shorter code. Examples of entropy coding include Huffman coding, run-length encoding, arithmetic coding (Ari1: hmetic c〇ding), and two-bit arithmetic coding before and after self-adaptation (c〇ntext-adap1) Ive binary arithmetic coding) In this exemplary embodiment, the blocks of the input image 2〇5 are provided to a subtractor 210 and a motion estimator 22〇. The motion estimator 22 compares the blocks in the input image 205 with a pre-stored reference image 23 to find similar blocks. Motion estimator 220 calculates a set of motion vectors 245 that represent the configuration between the matching blocks. The moving to summer 245 coincidence block 230 with the reference image is collectively referred to as a prediction block 255, representing time encoding. Prediction block 255 is provided to subtractor 21A, which subtracts prediction block 255 from input image 2〇5 to produce a residual image 260. The remaining image 260 is provided to a (DCT ' discrete cosine transform) block 270 and a quantizer 280 that performs spatial encoding. The output of quantizer 28Q (e. g., a set of quantized DCT coefficients) is encoded by entropy encoder 290. lOClienfs Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/final/林环辉/2007/06/13 10 200821986 For some type of image (280 res1dual) is provided to the internal solution, . The decoding y e^d is combined with the movement H1 remainder, , and generated by the motion estimator 22G. Provided to the mobile estimator 22G, such as the pre-examination image buffer 295, which f is discussed in connection with the first figure, the encoder 16 哭 in the main processing cry (10) hit 'however, the use of the graphics processing unit m : Execution:: = Estimated. The absolute absolute difference provided by the object 〇12〇 is added to the early s(four)b a_ute-dl version (10)) to achieve a positive ‘ at a relatively low amount of computation. The details will be detailed in the mobile estimation, and the dynamic estimation algorithm. 3. The search window (Search Window) As shown in the 3A and B pictures, the mobile estimator 220 will present the current image 2 to 5 Cut into segments that do not overlap, called giant tiles. The size of the giant tile will vary depending on the specifications used by the editor, (2, MPEG-2, H.264, VC) and the image. The exemplary embodiments described herein are small and 16 x 16 pixels in a variety of different coding standards. A giant tile is cut into squares. The larger 11 Client's Docket No.: S3U06-0024 XT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 11 200821986 The small can be 4x4, 8x8, 4x8, 16x8, or 8x16. In MPEG-2, each macroblock can have only one motion vector, so the motion estimation is based on the giant tile. H·264 allows up to 32 motion vectors (depending on the degree), so in H·264, the motion estimation is calculated based on the basis of 4x4 or 8x8 blocks. The change of H·264, called AVS, is always 8χ8. In VCM ' it can be 4x4 or 8x8. The motion estimation algorithm 220 performs a motion estimation on each macroblock in the current image 205, looking for a square in a pre-coded image 230 (which is similar to the giant tile of the current image 205). aims. The permutation between the macroblocks in the reference image "o" in the macroblock Ί4 full image 2 0 5 is calculated and stored as a motion vector (245, Fig. 2).

為方便說明，移動估測程序將以目前圖像310中一特定巨圖塊說明（ 320)。此範例所選擇之巨圖塊32〇係在目前圖像 310的中間，然而相同技術亦應用在其他巨圖塊。技兮_ ( 330)係在麥考圖像230 (對應目前圖像31〇的巨圖塊320)中巨圖塊的中間。即’若巨圖塊32〇係位於（χ， Υ) ’則在參考圖像230中的搜尋窗33〇亦位於（χ，γ)，如示於點34G。其他實施例將巨圖塊放在參部分，例如左上。範例第3Α、Β圖中的搜尋窗33〇在水伸通過相應巨圖塊的兩像素，在番 ^ ^ 牡玉1方向一個像素。因此，搜哥囪330包含14個不同巨圖塊：雨個 0 , ^ ^ 凡啕個巨圖塊分別發現1個與 2個像素，就在位置340的左邊·另一細不如故，力組兩個巨圖塊在位置340 HClienfs Docket No.：S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2〇〇7/〇6/13 12 200821986 的左，剩下組在位置340的上面、下面、左上、右上、左下與右下。由移動估測器220所執行的相符方塊移動運算使用絕對絕對差值加總作為判斷巨圖塊間相似性（相符）的準則。絕對絕對差值加總，計算兩像素值間的差值絕對值，並將一方塊中所有像素的這些差值絕對值加總，如熟悉該項技藝之人士所理解的。移動估測裔220結合使用絕對絕對差值加總準則與選擇待測相似性的目標巨圖塊之開創性方法，其將於下說明。 L選擇目標巨圖土舍私動估測斋220使用不同的搜尋方法，依據移動估測器 220疋產生目剷圖像205的内部編碼（intra—c〇de(j)移動向量或外部編碼（inter-coded)移動向量。移動估測器22〇利用真貫世界關於移動的習知知識以預測該相符巨圖塊應該在搜尋窗320的何處，減少搜尋窗32〇中目標方塊數目，其係實際與目前圖像205中的巨圖塊31◦進行相似測試。在真實世界中^ 物體通系以固疋加速度移動，這表示我們可以期待一圖框（光學，optical fl〇w)中物體的移動是緩和且相似（即實質上連、’）的在工間上與時間上都是。此外，在絕對絕對差值加總表面（即在-搜尋空間描纷絕對差值加總值）係被期待為相對地緩和（即相對少數量的局部最小點）。利用此習知知識需要指揮搜尋最可能發現最相符的地方，在此揭露的演算法使用減少要被執行搜尋的數目以找到較 13Client5s Docket No.:S3U06-0024 TT，s Docket No:0608-A41261-TW/flnal%_/2Q_6/i3 13 200821986 佳的最小點。如此-來’該演算法在計算上有效率也可有效的標出較佳的相符。 >For ease of illustration, the motion estimation procedure will be described in a particular macroblock in the current image 310 (320). The giant tile 32 selected in this example is in the middle of the current image 310, however the same technique is applied to other giant tiles. The technique _ (330) is in the middle of the giant tile in the McCaw image 230 (corresponding to the giant tile 320 of the current image 31〇). That is, if the macroblock 32 is located at (χ, Υ), then the search window 33〇 in the reference image 230 is also located at (χ, γ) as shown at point 34G. Other embodiments place the giant tile in the reference portion, such as the upper left. In the third example, the search window 33 in the figure extends through the two pixels of the corresponding giant block, and one pixel in the direction of the ^^. Therefore, the search brothers 330 contain 14 different giant blocks: rain 0, ^ ^ where a giant tile finds 1 and 2 pixels respectively, just to the left of position 340. Another fine is not as good as force group The two giant blocks are in position 340 HClienfs Docket No.: S3U06-0024 TT's Docket No: 0608-A41261-TW/fmal/林璟辉/2〇〇7/〇6/13 12 left of 200821986, the remaining group is at position 340 Top, bottom, top left, top right, bottom left and bottom right. The coincident block shift operation performed by the motion estimator 220 uses the absolute absolute difference sum as a criterion for judging the similarity (match) between the macroblocks. The absolute absolute difference is summed to calculate the absolute value of the difference between the two pixel values, and the absolute values of the differences of all the pixels in a square are summed, as understood by those skilled in the art. The mobile estimation method 220 combines the absolute absolute difference summation criterion with the groundbreaking method of selecting the target giant tile of the similarity to be tested, which will be explained below. L Selecting the target giant map The homesickness estimate 240 uses different search methods to generate an internal code (intra-c〇de(j) motion vector or external code (intra-c〇de(j)) according to the motion estimator 220. Inter-coded) The motion estimator 22 utilizes the knowledge of the real world of motion to predict where the coincident giant tile should be in the search window 320, reducing the number of target blocks in the search window 32, It is actually similar to the giant tile 31◦ in the current image 205. In the real world, the object is moving with the solid acceleration, which means that we can expect an object in a frame (optical, optical fl〇w). The movement is tempered and similar (ie substantially connected, ') in both the work and time. In addition, the absolute absolute difference is added to the total surface (ie, in the - search space, the absolute difference plus the total value) The system is expected to be relatively moderate (ie, a relatively small number of local minimum points). Using this knowledge requires a command to search for the most likely place to find the most consistent, and the algorithm disclosed here uses the number of searches to be performed to reduce the number of searches to be performed. Compared with 13Client5s Docket No.: S3U06-0024 TT, s Docket No: 0608-A41261-TW/flnal%_/2Q_6/i3 13 200821986 Best minimum point. So - the algorithm is efficient in calculation and effective Mark the better match. >

第4圖係一示範性實施例移動估測器220用來計算目前圖像205 β目别巨圖塊_之移動向量之演算法流程圖。移動估測程序從步驟41〇開始，其判定由移動估測器22〇為目前圖像205所產生的移動向量將被圖像間預測 (inter-predicted)或圖像内預測（intra—predicted)。若使用圖像内預測則接著進行步驟42Q，在此施行共輛梯度下降搜尋演，法（conjugated gradient descent search alg〇rithm) 以尋找搜尋窗32G内-預測巨圖塊，這與參考巨圖塊（目前圖像205内之目刚巨圖塊310)是較佳的相符。共軛梯度下降搜哥演异法（步驟420)將結合第5、β圖詳細說明。回到步驟410，若使用圖像間預測以產生移動向量，則接著執行步驟430,在此執行，，鄰近的，，或，，鄰近區域，，搜尋。該搜尋包含鄰近於目前圖像2〇5内目前巨圖塊的巨圖塊，以及對應的先㈤編碼參考圖像230内的巨圖塊。鄰近搜尋凉异法（步驟430)將結合第7、8圖詳細說明。、共軛梯度下降搜尋演算法（步驟41〇)與鄰近搜尋演算法（步驟430)各從-大群目標預測巨圖塊中認出了較佳或; 接受的相符。熟悉此項技藝之人士應當瞭解_來判定如何才是-個”較佳的相符”之準則可以是相對的或是絕對的如’在此敘述之鄰近搜尋演算法使用―絕對準則：有最低值 (score)的目標巨圖塊被視為較佳的相符。麩 HClienfs Docket No. :S3U06-0024 …、叩供 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 200821986 二下降搜尋演算法利用-臨界值，絕對差值加總值低於该臨界值的篦— : 方塊被視為較佳的相符。然而，該臨界值的準則係一設計或實現決定。在處理步驟42〇或43〇之後，以認出一較健選相符。步驟44G更由? - 、、丁一局部區域徹底搜尋（local area exhaustive )乂找到最佳的候選。該搜尋區域係位於步驟420或 r 43\所認出的較佳候選巨圖塊附近。在-些實施例中，在執行步‘ 420 ’共輕梯度下降搜尋演算法之後（即在圖像内預測的狀況下），局部徹底«所搜尋輕域包含步驟420所認出的局部最小值（較佳候選）的外面附近的4個對角。例如，若在梯度下降上個步驟所使用的值是1，則該搜尋限制在離該較佳候選（±1，±1)的點。在一些實施例中，當執行步驟43〇之後（即在圖像間預測的狀況下），局部徹底搜尋（步驟440)所搜尋的包含在較佳候選巨圖塊附近一小區域的候選，通常是 (±2+2 ) 〇步驟440的局部徹底搜尋從一較佳候選巨圖塊限縮至一最佳候選巨圖塊，這是像素調準（pixehaligned)，即具有整數像素解析度。步驟450與460在一分數像素邊& (fractional-pixel boundary)找到一最佳候選巨圖塊調準。習知分數移動搜尋演算法使用特定編解瑪器淚波演算法 (codec-specific filtering algorithm)以内插在分數位置的像素值，根據周圍的整數位置。相對的，步驟45〇建立最佳候選巨圖塊與參考巨圖塊間相符程度為二次表面，而步驟46〇 15Client’s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06八3 15 200821986 判=表:的最小值。最小值對應-最佳相符巨圖塊，二刀而I }數解析度。（關性的以分數符巨圖塊猶立模财法將顿面賴落_朗。）分數解析度的相符巨圖塊於步驟45_忍 ::根據該相符巨圖塊計算-分數移動向量丄::: 項技▲者所知悉的技術。接著就完成了程序棚。 ★熟悉此項技藝者應當瞭解到上面的演算法在本質上是連 4貝的，因其使用了鄰近區域的資訊。儘管使用了硬體加速的習知，計通常避免連續演算法，因為許多仙，連續的設計在這裡是適當的。首先’像素資料係以連續水平掃喊的形式 (sequential raster fashi〇n)讀取，因而可被預先接收，，持在-電路緩衝H中。其次，在含有單―騎差值加總加速單元的實施财’魏是_在該單元是雜維持滿載而非連續處理。、絕對差值加總加料元在預測方塊沒有許多快取遺漏下T以維持咼負載。因為遺漏率是快取大小的函式，而hdtv 解析度影像在快取中僅需要192〇/8 = <1Κβ移動向量，低的快取遺漏率是可以預期的。 c·隻里共軛梯度下_降的圖像内預測蒋動向景第5圖係第4圖共軛梯度步驟44〇的流程圖，由移動估測器220之一實施例所執行。如前所述，步驟係在判定使用圖像内預測將被用來尋找搜尋窗320内巨圖塊係與目前方塊310為一較佳（即可接受的）相符時執行。絕對差值加總值為了一組5個初始候選而計算··目前巨圖塊、與目前巨圖塊 16Client’s Docket No.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 200821986 上下、左、右的巨圖塊。從這初始組5個絕 ^异兩組幻睡錢。從這兩_度，得到最陡躺方向 =域。若該梯度相對地淺，或5個初始候選巨圖塊有非常接 γ e對差值加總值，則該搜尋延伸遠離目前巨圖塊，因為在 =或内不存在有較佳局部最小機率之條件的肖選。在對共幸厄乐又下降步驟44Q概狀後’該步麟更詳細的說明於下。默佶^驟攸步驟5G5開始’在此初始化—候選方塊^^與步二=、△/。在—實施例中，候選巨圖塊q設為搜尋窗拉+/工上角’而步驟值均設為一小整數值，例如8。 =在步驟51〇，計算闕關塊⑼周的候選巨圖塊的座 “。這四個候駐圖塊是候選巨圖U上、下、左、右四個。即，差值515，在此分別計算5倾駐目塊的絕對，加4 (原本那個與周遭四個）。在步驟52〇，計算梯心虚 c梯度&是左邊與右駐圖塊絕對差值加驗的差。梯心是上面與下面巨圖塊絕對差值加總值的差。如此可 1 ㈣巨圖，間的誤差值是增加或減少，該梯度表示x ;y: :在步驟525 ’該梯度係與—臨界值作比較。若該梯度低於（_梯度相對地淺），這衫在目前搜尋中益局精小值’故該搜尋延伸至新的的: =遠離了原本的候選處理巨圖塊、。在一 et NolLotlf ^ ^ M ^ ^ ^ ^ lg # ^ Μ ^ ^ ^ ^ ^ 爪Docket副608捕2财w/f_林璟輝咖嶋咖 17 200821986 亦《申„亥搜哥。該延伸搜個新候選巨圖塊的座伊 Ί々驟530進仃，在此计异四右距執㈨的地方，序、在^下左、倖四個新候遠巨圖塊以形成原本候選巨圖塊c』圍正方形角落，距離CA): 一义以一〆々ς) 分別執行共輛梯度選巨圖塊（c，TL’现BL，抑） =到步驟咖的梯度比較，若在巨圖塊52〇所計算的梯又…、於或大㈣臨界值（即該梯度係相對地㈣）步驟_在步驟515所計算的絕對差值加總值與-臨界值作= 較。右该絕對差值加總值低於該臨界值，則表示找到較佳相 ^則步驟碰回到啤叫器、（在步驟545)，提供該啤叫器有隶低絕對差值加總值的候選巨圖塊。若在步驟540卿m的親對差值加總值等於或低於 S品界值’表不沒有找到較佳相符，故調整搜尋㉗ 55°，選擇-新的中央候選巨圖塊k新的中央巨以 (：，孔，了1^’候選組中在步驟515中算出有最低絕對『總值的方塊。接著，在步驟555，從梯計算新的步^ 值△,與〜’例如陡山肖的梯度代表可接受的相符巨塊係目前中央候選很遠，故增加(、△，）。相反地，淺的梯声^ 表可接受的相符巨圖塊係目前中央候選很近，故應^少 18Client’s Docket No.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林環輝/2007/06/13 1 0 200821986 不同的係數可以 (Δ’Λ)。热悉此項技藝之人士應當瞭解到各種從各梯度用來計算(△,，△,)以達成該結果。接著，在步驟560測試疊代迴圈數。若該數目大於一田值，則步驟440於步，驟565 $成，找不到可以接相:最大 ::近㈣擇一組瞧^ 接、於取、冬相付，該梯度下降步驟440回到步驟51〇，4 is an algorithmic flow diagram of an exemplary embodiment motion estimator 220 for computing a motion vector of a current image 205. The motion estimation procedure begins in step 41, where it is determined that the motion vector generated by the motion estimator 22 as the current image 205 will be inter-predicted or intra-predicted. . If intra-image prediction is used, then step 42Q is performed, where a conjugated gradient descent search alg〇rithm is performed to find the intra-predictive giant tile in the search window 32G, which is related to the reference giant tile. (Currently the giant tile 310 in the image 205) is a preferred match. The conjugate gradient descent search (step 420) will be described in detail in conjunction with the fifth and beta plots. Returning to step 410, if inter-picture prediction is used to generate a motion vector, then step 430 is performed, where, adjacent, or,, adjacent, search. The search includes a giant tile adjacent to the current giant tile within the current image 2〇5, and a corresponding macroblock within the first (five) encoded reference image 230. The proximity search cool method (step 430) will be described in detail in conjunction with Figures 7 and 8. The conjugate gradient descent search algorithm (step 41A) and the neighbor search algorithm (step 430) recognize the preferred or accepted conformance in each of the large-group target prediction macroblocks. Those familiar with the art should understand that the criteria for determining how to be a "best match" can be relative or absolute as used in the proximity search algorithm described here. - Absolute criteria: lowest value The target giant tile of (score) is considered to be a better match. Bran HClienfs Docket No. :S3U06-0024 ..., 叩 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 200821986 Two descent search algorithm utilization - critical value, absolute difference plus low total value The 篦-: block at the critical value is considered to be a better match. However, the criteria for this threshold are a design or implementation decision. After processing step 42 or 43, to recognize a more healthy match. Step 44G finds the best candidate by the local area exhaustive. The search area is located near the preferred candidate giant block identified in step 420 or r 43\. In some embodiments, after performing the step '420' common light gradient descent search algorithm (ie, in the case of intra-image prediction), the partial thorough «searched light field contains the local minimum recognized in step 420. 4 diagonals near the outside of the (preferred candidate). For example, if the value used in the previous step of the gradient descent is 1, the search is limited to a point away from the preferred candidate (±1, ±1). In some embodiments, after performing step 43 (ie, in the case of inter-picture prediction), a partial thorough search (step 440) for searching for a candidate containing a small area near the preferred candidate giant block, typically Yes (±2+2) 局部 The partial thorough search of step 440 is limited from a preferred candidate giant tile block to a best candidate giant tile block, which is pixehaligned, ie has integer pixel resolution. Steps 450 and 460 find an optimal candidate tile block alignment at a fractional-pixel boundary. The conventional fractional motion search algorithm uses a specific codec-specific filtering algorithm to interpolate the pixel values at the fractional position, depending on the surrounding integer position. In contrast, step 45 〇 establishes that the best candidate giant tile and the reference giant tile match the secondary surface, and step 46〇15Client's Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal /林璟辉/2007/06八3 15 200821986 Judgment = Table: The minimum value. The minimum corresponds to the best matching giant block, the second knife and the I } number resolution. (The relationship between the scores and the giant blocks is still _ 朗.) The matching macro block of the fractional resolution is in step 45_Forbearing:: Calculated according to the matching giant block - Fractional motion vector丄::: The technology that the ▲ 者 knows. Then the program shed was completed. ★ Those who are familiar with this technique should understand that the above algorithm is essentially 4, because it uses information from nearby areas. Despite the use of hardware acceleration, the meter usually avoids continuous algorithms because many centuries, continuous designs are appropriate here. First, the 'pixel data is read in the form of a sequential raster fashi〇n, and thus can be received in advance, and held in the -circuit buffer H. Secondly, in the implementation of the single-riding difference plus acceleration unit, the implementation of the _ is in the unit is mixed to maintain full load rather than continuous processing. The absolute difference plus the total feed element does not have many cache misses in the prediction block to maintain the load rejection. Since the miss rate is a function of the cache size, the hdtv resolution image only needs 192 〇/8 = <1 Κ β motion vector in the cache, and a low cache miss rate is expected. c. Intra-image prediction within the conjugate gradient only. Figure 5 is a flow chart of the conjugate gradient step 44A of Figure 4, performed by one embodiment of the motion estimator 220. As previously mentioned, the steps are performed when it is determined that the use of intra-image prediction will be used to find that the macroblock in the search window 320 is in a better (i.e., acceptable) manner than the current block 310. The absolute difference plus the total value is a set of 5 initial candidates and is calculated. · The current giant tile, and the current giant tile 16Client's Docket No.: S3U06-0024 TT's Docket No: 0608-A41261-TW/fmal/林璟辉/2007 /06/13 200821986 Large blocks of up, down, left and right. From this initial group of 5 different kinds of magical sleep money. From these two degrees, get the steepest lying direction = domain. If the gradient is relatively shallow, or if the 5 initial candidate giant blocks have a gamma-to-value plus value, then the search extends away from the current giant tile because there is no better local minimum probability in = or The selection of the conditions. After the completion of the step 44Q of the total fortunate, the step is explained in more detail below. The default step 5G5 begins 'initialize here - candidate block ^^ and step two =, △ /. In the embodiment, the candidate giant tile q is set to the search window pull + / upper corner and the step values are all set to a small integer value, such as 8. = In step 51, the seat of the candidate giant block of the week (9) is calculated. The four candidate tiles are the upper, lower, left and right of the candidate giant U. That is, the difference 515 is This calculates the absolute value of the 5 tilted block, plus 4 (the original and the surrounding four). In step 52, calculate the gradient of the gradient and the difference between the left and right resident blocks. The heart is the difference between the absolute difference and the total value of the upper and lower giant blocks. Thus, the error value between 1 and 4 is increased or decreased, and the gradient represents x; y: : in step 525 'the gradient is - The critical value is compared. If the gradient is lower than (the gradient is relatively shallow), the shirt is in the current search for the small value of the interest rate, so the search extends to the new one: = away from the original candidate processing giant block, In a et NolLotlf ^ ^ M ^ ^ ^ ^ lg # ^ Μ ^ ^ ^ ^ ^ Claw Docket Deputy 608 catch 2 Cai w/f_林璟辉咖嶋咖17 200821986 Also "申„亥搜哥. The extension of the search for a new candidate giant block block Ί々 530 530 仃仃仃仃仃仃仃 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 530 Candidate giant block c" around the square corner, distance CA): Yiyi to a 〆々ς) separately perform a common gradient selection giant block (c, TL' now BL, s) = to the gradient comparison of the step coffee, if The ladder calculated at the macroblock 52〇 is again, or the (4) threshold (i.e., the gradient is relative to (4)). The absolute difference plus the total value calculated in step 515 is compared with the -threshold value. If the absolute difference plus total value is lower than the critical value, it means that the step of finding a better phase is touched back to the beer, (in step 545), the beer is provided with a lower absolute difference plus the total value. Candidate giant block. If in step 540, the total value of the pairwise difference is equal to or lower than the S-value value, the table does not find a better match, so adjust the search to 27 55°, select - the new central candidate giant block k new In the central giant (:, hole, 1^' candidate group, the lowest absolute total value of the square is calculated in step 515. Then, in step 555, the new step value △ is calculated from the ladder, and ~' The gradient of the mountain ridge represents an acceptable coincidence of the giant block system, which is far away from the current central candidate, so it increases (, △,). Conversely, the shallow ladder sound is acceptable, and the corresponding macroblock block is very close to the central candidate. Should be less than 18Client's Docket No.: S3U06-0024 TT's Docket No: 0608-A41261-TW/fmal/Lin Huanhui/2007/06/13 1 0 200821986 Different coefficients can be (Δ'Λ). Those skilled in the art will appreciate that various gradients are used to calculate (Δ, Δ,) to achieve the result. Next, the number of iterations is tested at step 560. If the number is greater than a field value, then step 440 is followed by Step 565 $成, can not find the phase: maximum:: near (four) choose a group of 瞧 ^ pick, take, winter pay, the gradient drops Step 440 returns to step 51,

生一組新的。共輛梯度下降步驟440在以下兩種情況下^產 ==)值(步驟⑹’或最大疊代數目‘無第6圖說明使用共李厄梯度下降步，驟44〇的示範狀離。初妒候選巨圖塊〜係方形（隱），而四個周圍候選係圓圈 (61〇T，610L，610R，610B)。從這些初始候選計算梯度&與 a620X，62GY)。在此示範狀態中，梯度太淺了，而沒^絕對差值加總值低於馳界值。因此延伸搜尋，制四個新的中央候選巨圖塊，示為三角形（63〇TL，63〇TR，63〇BL，63〇br)。這些新的候選巨圖塊距離原本候選巨圖塊&周圍角落△的距離。在這些中央候選周圍的巨圖塊，示為六角形 (64〇11，64叫，6術2,6碰2,6侃3,6彻3,6彻4,6樣4)，被選為候選。在此示範狀態中，兩個候選640具有低於臨界值的絕對差值加總值與”陡峭”梯度（ 650XY，660XY)。另一候選係根據各，，陡峭，，梯度選擇：候選670係根據梯度65〇χγ，而後選68〇係根據梯度660ΧΥ。梯度下降搜尋繼續使用這些新的候選6?〇、68〇，根據共軛梯度下降步驟440。 19Client’s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06/13 200821986 第7圖係第4_近搜尋演算由移動估測器22()之-實施例所執Y驟43G)的4圖，選巨圖塊包含鄰近於目前輯2〇5 2如前所述，該搜尋之候編碼）的巨圖塊。亦包含為一候選的H目前巨圖塊31G (已被聊中的一對應巨圖塊。&的係在預先編碼的參考圖像 1异候選巨圖塊座標的步驟從步驟7iq開始，在此藉由利用目前巨圖塊310位址的絕對值（餘數）與每行巨圖塊數計算 -旗標魏T〇PVALID。若此絕對值非G，則而aud為真，此外，T0PVALID為假。在步驟72〇，—旗標變數LEFTVAUD係利用目前巨圖塊31G位址的除以整數與每行巨圖塊數計算。若此除數非G ’則LEFTMLID為真，此外，LEFTVAUD為假。這些T0PVALID與LEFTVALID變數表示目前巨圖塊31〇分別在上面與左邊有一鄰近巨圖塊，考慮巨圖塊的上緣與左邊緣。在步驟730，結合使用T0PVALID與LEHTALID變數以判定目前巨圖塊310鄰近的4個候選巨圖塊的可得性，或存在性。特別是··左邊有一巨圖塊L若（LEFTVALID);上面有一巨圖塊T若（TOPVALID);左上有一巨圖塊TF若（TOPVALID& LEFTVALID);又上有一巨圖塊 TR 若（TOPVALI D& RIGHTVALID)。接著，在步驟740，為一先前候選巨圖塊p判定可得性，這是在空間上對應目前巨圖塊310之先前編碼參考圖像230中的一巨圖塊。這5個候選巨圖塊的相對位置可在第8A、B圖中看到，其中 L 係 810、T 係 820、TL 係 830、TR 係 840、P 係 850。 20Clienfs Docket N〇.:S3U06-0024 TT’s Docket N(K〇608-A41261-TW/fmal/林璟輝/2007/06/13 200821986 來比驟牛73°與步驟740有多少候選巨圖塊可用對差值加總。若5 為每-可得候駐義計算絕個候柄可得，馳絕對差值加總值為： L + Ts 〇，ΑΓ，ρ ，med(T,TL，m)\ 柄可得，熟悉此項技藝之人士應#瞭解到該組的候選巨圖塊。驟430，回覆有最低絕對差值加總使用ΐ ί:; ，所討論的，一旦找到相符巨圖塊(不論，用弟_«搜錯或是第5 _共 —限縮，採用局部徹底搜尋（第。== =:利用局部徹底搜尋的結果計算-分數移動向= 數移動向1的計算將於下詳述。刀里模型的分數蒋叙熟悉此項技藝之人士應當對圖示巨圖塊對搜尋程度以產生”錯誤表面”制熟悉。_1創性方法，移^ 估測益22G以-二次表面建立錯誤表面的模型並分析地以土像素準確性判定該表面的最小值。移動估消】器220，首 ^ 一方項之最小值，給定-最小行。移動佑測器220接^沿t 條線決定正交方向的最小值。思二次曲線的一般方程式如方程式1。 21Client’s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06八3 200821986 y = q + C2t + C/ 方程式 1 對該曲線取微分，如第2方程式：昏= + = 方程式2 一旦係數c" c2, c3已知，則可求解以判定t，最小的位置。移動估測器220解出方程式3以判定係數c15 c2, c3。Born a new set. The common gradient descent step 440 is in the following two cases: the value of the step = (=) or the number of maximal iterations (the number of steps (6)' or the number of maximal iterations is not shown in Fig. 6 to illustrate the use of the co-Lee gradient descent step, step 44 〇.妒 Candidate giant block ~ is square (hidden), and four surrounding candidate circles (61〇T, 610L, 610R, 610B). Gradient & a620X, 62GY) are calculated from these initial candidates. In this exemplary state, the gradient is too shallow, and the absolute difference plus the total value is below the relaxation value. Therefore, the search is extended to create four new central candidate giant blocks, shown as triangles (63〇TL, 63〇TR, 63〇BL, 63〇br). These new candidate giant blocks are spaced from the original candidate giant block & The giant tiles around these central candidates are shown as hexagonal (64〇11, 64, 6, 2, 6, 2, 6, 3, 6, 3, 6, 4, 6), selected as Candidate. In this exemplary state, the two candidates 640 have an absolute difference plus a total value below the threshold and a "steep" gradient (650XY, 660XY). Another candidate is selected according to each, steep, gradient: candidate 670 is based on a gradient of 65 〇χ γ, and then 68 〇 is based on a gradient of 660 ΧΥ. The gradient descent search continues to use these new candidates 6?, 68, according to the conjugate gradient drop step 440. 19Client's Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/final/林璟辉/2007/06/13 200821986 The 7th picture is the 4th near-search calculus implemented by the mobile estimator 22() In the example of Figure 4, the macroblock contains a giant tile adjacent to the current series 2〇5 2 as described above, the search code. Also included as a candidate H current giant tile 31G (has been associated with a corresponding giant tile. & the sequence of the pre-coded reference image 1 different candidate giant tile block coordinates starting from step 7iq, in This is calculated by using the absolute value (remainder) of the current giant tile 310 address and the number of macroblocks per row - the flag Wei T〇PVALID. If the absolute value is not G, then aud is true, and in addition, T0PVALID is False. In step 72, the flag variable LEFTVAUD is calculated by dividing the current giant block 31G address by the integer and the number of macroblocks per line. If the divisor is not G' then LEFTMLID is true, in addition, LEFTVAUD is False. These T0PVALID and LEFTVALID variables indicate that the current giant block 31〇 has a neighboring giant tile on the top and the left side, respectively, considering the upper edge and the left edge of the giant tile. In step 730, the T0PVALID and LEHTALID variables are used in combination to determine the current giant. The availability or existence of the four candidate giant tile blocks adjacent to block 310. In particular, there is a giant tile L (LEFTVALID) on the left side; there is a giant tile T (TOPVALID) on the top; Block TF if (TOPVALID &LEFTVALID); The block TR is (TOPVALI D& RIGHTVALID). Next, at step 740, the availability is determined for a previous candidate giant tile p, which is spatially corresponding to one of the previously encoded reference images 230 of the current giant tile 310. Giant block. The relative positions of the five candidate giant blocks can be seen in pictures 8A and B, where L system 810, T system 820, TL system 830, TR system 840, P system 850. 20Clienfs Docket N〇 .:S3U06-0024 TT's Docket N(K〇608-A41261-TW/fmal/林璟辉/2007/06/13 200821986 How many candidate giant blocks can be added to the difference than the number of steps 73 and 740. 5 Calculate the total number of handles for each-available, and add the total difference: L + Ts 〇, ΑΓ, ρ, med(T, TL, m)\ handle is available, familiar with this The person of the skill should know the candidate block of the group. Step 430, the reply has the lowest absolute difference plus the total use ΐ ί:; , discussed, once the matching giant block is found (regardless of the use of the brother _« Search for the wrong or the 5th _ total - limit, using a partial thorough search (the first == =: using the results of the partial thorough search - score movement to = number moves to 1 The calculations will be detailed below. The score of the model in the knife Jiang Shun who is familiar with this technique should be familiar with the degree of search for the "flaw surface". _ Ingenious method, shifting estimates 22G establishes a model of the wrong surface with a secondary surface and analyzes the minimum value of the surface with soil pixel accuracy. The mobile estimation unit 220, the minimum value of the first item, the given-minimum line. The mobile detector 220 connects the minimum value of the orthogonal direction along the t lines. Think of the general equation of the quadratic curve as Equation 1. 21Client's Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06八3 200821986 y = q + C2t + C/ Equation 1 Differentiate the curve, as in Equation 2: Faint = + = Equation 2 Once the coefficient c" c2, c3 is known, it can be solved to determine t, the smallest position. The motion estimator 220 solves Equation 3 to determine the coefficients c15 c2, c3.

cl C2 C3 /._V ΣΣ, X --27% 5127 Σ 方程式3 移動估測器220使用由圖形處理單元120所提供的84絕對差值加總指令已有效率的計算方程式3。各4代表一絕對差值加總值，對i累加代表在X方向鄰近巨圖塊的絕對差值加總值。如結合第1圖之詳細說明，該8x4絕對差值加總指令有效 (，率的計算鄰近的巨圖塊（x，y)、（x+l，y)、（x+2，y)、（x+3，y)，的4個絕對差值加總值，即izz()".3且i = t=j + l。如前所述，一旦係數已知，解方程式2得到t，X方向的最小值。方程式3可以用來判定垂直方向的最小值t。在此例中，移動估測器2 2 0使用8 X 4絕對差值加總指令已有效率的計算垂直地鄰近的巨圖塊（x，y)、（x+1，y)、（x+2，y)、（x+3，y)的 4 個絕對差值加總值。方程式3解出計算自這些絕對差值加總值的係數Ci、C2、C3。如前所述，一旦係數已知，解方程式2得 22Clienfs Docket No.：S3U06-0024 TT’s Docket N〇:0608-A41261-TW/fmal/林璟輝/2007/06/13 22 200821986 到t，y方向的最小值。移動估面方法較在先躺1素邊界上二所^的二次錯誤表貴濾波以·子像素ϋ目讀再制運算昂苹乂乜相付的習知方法來的進步。Cl C2 C3 /._V ΣΣ, X - 27% 5127 Σ Equation 3 The motion estimator 220 uses the 84 absolute difference value provided by the graphics processing unit 120 to add the total efficiency of the command calculation equation 3. Each 4 represents an absolute difference plus a total value, and i is added to represent the absolute difference plus the total value of the adjacent macroblocks in the X direction. As described in detail in FIG. 1, the 8x4 absolute difference summing command is valid (the rate is calculated by the adjacent giant block (x, y), (x+l, y), (x+2, y), (x+3,y), the four absolute differences plus the total value, ie izz()".3 and i = t=j + l. As mentioned above, once the coefficient is known, solving equation 2 yields t , the minimum value in the X direction. Equation 3 can be used to determine the minimum value t in the vertical direction. In this example, the motion estimator 2 2 0 uses the 8 X 4 absolute difference plus the total efficiency of the instruction. The total absolute difference between the four macroblocks (x, y), (x+1, y), (x+2, y), (x+3, y) plus the total value. Equation 3 solves the calculations from these The coefficient of the absolute difference plus the total value Ci, C2, C3. As mentioned above, once the coefficient is known, the solution of Equation 2 is 22Clienfs Docket No.: S3U06-0024 TT's Docket N〇: 0608-A41261-TW/fmal/林璟辉/2007/06/13 22 200821986 The minimum value in the direction of t, y. The method of moving the estimation method is more expensive than the second error table on the first-order boundary. The advancement of the familiar method of paying for it.

的計墓_最小值 ^~ 如前所述，移動估測器22〇以目前圖像中Tomb _min ^~ As mentioned earlier, the motion estimator 22 is in the current image

定預測圖像中那個巨圖撿古^ 固鬼有較乜的相付。移動估測器220使用 7形f理單元12(3所提供的絕對差值加總硬體加速，其為圖形加逮單7〇指令。絕對差值加總指令要輸入—4χ4參考方塊與一 8x4預測方塊，並產生4個絕對差值加總值。參考方塊與預測方塊的大小可根據需要而改變。4χ4參考方塊與㈣預測方塊僅為範例以說明本發明，而不應限制參考方塊與預測方塊的大小。第9Α、Β圖係說明對參考與預測方塊進行絕對差值加總指令運作的方塊圖。如示於第9Α圖，8χ4預測方塊係由多個彼此重疊的水平鄰近4x4方塊所組成，如方塊910、920、930、 940。絕對差值加總單元取一個輸入4Χ4參考方塊950並計算该參考方塊與910-940個方塊的絕對差值加總值。即，該絕對差值加總指令計算4個值：一個值是方塊910與方塊950的差值的絕對值之總和；另一個值是方塊920與方塊950的差值的絕對值之總和；另一個值是方塊930與方塊950的差值的絕對值之總和；另一個值是方塊940與方塊950的差值的絕對值之總和。參見第9Β圖，圖形處理單元120内的絕對差值加總加速 23Clienfs Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 23 200821986 單兀使用4個絕對差值加總計算單元（ 960, 970, 980, 990 )以貝現絕對差值加總指令。最左邊的4χ4方塊91〇係提供給絕對差值加總计异單元96〇。接著輸入右邊的4χ4方塊（92〇)給絕對差值加總計算單元97〇。接著輸入右邊的4χ4方塊（93〇) 給絕對差值加總計算單元98〇。最後，提供最右邊的4χ4方塊 940給絕對差值加總計算單元9⑽。圖形處理單元12〇平行地使用獨立的絕對差值加總計算單元，所以絕對差值加總指令每 ρ 個週期產生4個絕對差值加總值。熟悉此項技藝之人士應當暸解到用來计异兩個相同大小像素方塊的絕對差值加總運算之 /貝异法，以及用來執行此運算之硬體設計，故這些細節將不再詳述。 4x4參考方塊係水平地且垂直地列在像素邊緣。然而，不需要垂直地校正4x4預測方塊910-940。在一實施例中，資料係藉由旋轉（邏輯電路995)該參考方塊所校正。旋轉參考方塊而非分別旋轉4個預測方塊可節省邏輯閘數。旋轉後的參考 I 方塊係提供給各獨立絕對差值加總硬體加速單元。各單元產生 12位元的值，而這些值結合成一個48位元的輸出。在一實施例中，這些值的數量級係根據預測方塊的U紋理座標（最低位元位置中的最低座標）。下面的程式碼說明8x8方塊，即兩個鄰近的8x4方塊，的絕對差值加總值可以僅使用4個絕對差值加總指令計算。暫存裔Τ、Τ、Τ、Τ4係用來暫存這4個絕對差值加總值。變數sadS 係用來累加這些絕對差值加總值。8x4參考方塊的位址假設在 24Clienfs Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06A3 24 200821986 refReg。ϋ與V係8x8預測方塊的紋理座標。下面的程式碼產生整個8x8方塊的全部的絕對差值加總值，儲存在sadS。 SAD Tl, refReg, U, V ； left-top of 8x8 prediction block SAD T2, refReg, U+4, V ; right-top of 8x8 prediction blockThe giant figure in the predicted image has a relatively simple payout. The motion estimator 220 uses the 7-shaped metric unit 12 (the absolute difference provided by 3 plus the total hardware acceleration, which is the graphics plus the 7 〇 command. The absolute difference plus the total command is to be input - 4 χ 4 reference blocks and one 8x4 predicts the block and produces 4 absolute difference plus total values. The size of the reference block and the prediction block can be changed as needed. The 4 χ 4 reference block and the ( 4) prediction block are merely examples to illustrate the present invention, and should not limit the reference block and Predicting the size of the block. Figure 9 is a block diagram showing the operation of the absolute difference plus total instruction for the reference and prediction blocks. As shown in Figure 9, the 8χ4 prediction block consists of multiple horizontally adjacent 4x4 blocks that overlap each other. The composition is as in blocks 910, 920, 930, 940. The absolute difference summing unit takes an input 4Χ4 reference block 950 and calculates the absolute difference plus the total value of the reference block and 910-940 squares. The value summation instruction calculates four values: one value is the sum of the absolute values of the difference between block 910 and block 950; the other value is the sum of the absolute values of the difference between block 920 and block 950; the other value is block 930. versus The sum of the absolute values of the differences of block 950; the other value is the sum of the absolute values of the differences between block 940 and block 950. Referring to Figure 9, the absolute difference in the graphics processing unit 120 is summed up by 23Clienfs Docket N〇 .:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 23 200821986 Single 兀 uses 4 absolute difference plus total calculation unit (960, 970, 980, 990) The absolute difference is added to the total command. The leftmost 4χ4 box 91〇 is supplied to the absolute difference plus the total unit 96〇. Then the 4χ4 square (92〇) on the right is input to the absolute difference total calculation unit 97〇. Enter the 4χ4 square (93〇) on the right to add the total difference calculation unit 98〇 to the absolute difference. Finally, provide the rightmost 4χ4 square 940 to the absolute difference total calculation unit 9(10). The graphics processing unit 12〇 uses independent absolutes in parallel. The difference sums up the calculation unit, so the absolute difference plus total instruction produces 4 absolute differences plus the total value per ρ cycles. Those skilled in the art should be aware of the absolute difference used to count two identically sized pixel blocks. Value plus total operation The method, and the hardware design used to perform this operation, will not be described in detail. The 4x4 reference block is horizontally and vertically listed at the edge of the pixel. However, the 4x4 prediction blocks 910-940 need not be corrected vertically. In one embodiment, the data is corrected by rotating (logic circuit 995) the reference block. Rotating the reference block instead of rotating the 4 prediction blocks separately saves the number of logic gates. The rotated reference I block is provided for each individual The absolute difference adds up the hardware acceleration unit. Each unit produces a value of 12 bits, and these values are combined into a 48-bit output. In one embodiment, the magnitude of these values is based on the U texture coordinates of the predicted block (the lowest coordinate in the lowest bit position). The following code shows that 8x8 squares, that is, two adjacent 8x4 squares, can be calculated using only 4 absolute difference plus total instructions. Temporary Τ, Τ, Τ, Τ 4 series are used to temporarily store the four absolute differences plus the total value. The variable sadS is used to accumulate these absolute differences plus the total value. The address of the 8x4 reference block is assumed to be 24Clienfs Docket N〇.: S3U06-0024 TT’s Docket No: 0608-A41261-TW/fmal/林璟辉/2007/06A3 24 200821986 refReg.纹理 and V are 8x8 prediction block texture coordinates. The following code produces the total absolute difference plus the total value of the entire 8x8 block, stored in sadS. SAD Tl, refReg, U, V ; left-top of 8x8 prediction block SAD T2, refReg, U+4, V ; right-top of 8x8 prediction block

ADD sadS, Tl, T2 SAD T3, refReg, U, prediction block ADD sadS, sadS, T3 SAD T4, refReg, U+4 8x8 prediction block ADD sadS, sadS, T4 V+4 / left-bottom of 8x8 ,V+4 ;right-bottom of 然而，通常可以避免計算與加總所有4個子方塊的值，因為只要該總和達到目前最小值就可以停止該計算。下列的虛擬碼說明如何在一迴圈内使用絕對差值加總指令，其在總和達到一最小值時停止。工：=〇； SUM := 0； MIN = currentMIN; WHILE (工 < 4 II SUM < MIN) SUM ：= SUM + SAD(refReg, U+(I%2)*4, V+ (I>〉1)*4); IF (SUM < currMIN) currMIN = MIN; 25Client’s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 200821986 G〇 to Next Search point；圖形處理單元120中的84絕對差值加總指令係直接由夕動估測态220的先進搜尋演算法所使用，例如第$圖'中所〜、的執行局部徹底搜尋。此外，紋理快取(筮Ί 、罘iU圖）係方塊校正，而移動估測器220所使用的演算法，如上所述，係像素校正。儘管可以將多工器單元加到圖形處理單元12〇中以處理這些校正誤差，然而這麼做會增加邏輯閘數與電力消耗。取而代之，圖形處理單元120使用這些多餘的預算到4個絕對差值加總單元，而不是只用1個。在一些實施例中，絕對差值加總指令提供了有效率地運算最小值之優點，這牽涉到計算鄰近方塊的絕對差值加總值。在一些實施例中，8χ4絕對差值加總指令提供了徹底搜尋（方塊440)之另一優點，當步驟值為1時，其計算各對角的絕對差值加總值。 4.圖形處理器已經討論過移動估測器220之軟體演算法實現以及該演算法在圖形處理單元120中之8x4絕對差值加總指令的使用，接下來詳細說明絕對差值加總指令與圖形處理單元12〇。 a.圖形處理單元湳第10圖係圖形處理單元120的資料流程圖，其中指令流係由第10圖左邊之箭頭，而影像或圖形流係由右邊的箭頭表示。第10圖省略了數個熟悉此項技藝者習知的元件，這些對 26Client’s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林環輝/2007/06/13 200821986 解釋圖形處理單元12〇之回路内去方塊效應特徵非必要。一指令流處理器1010從一系統匯流排（未示）接收一指令1020，並解碼該指令，產生指令資料1030，例如頂點資料。圖形處理單元12〇支援一習知圖形處理指令，以及加速視訊編碼及/或解碼的指令，例如前述之8x4絕對差值加總指令。 ‘知圖形處理指令牵涉到如頂點著色（vertex shading)、幾何著色（geometry shading)、像素著色（pixel shading)等難題。因此，指令資料1〇3〇係施用於著色器執行單元（shader execution units)之池（pool) 740。著色執行單元必要使用一紋理濾波單元（TFU，texture filter unit) 750以施加一紋理至一像素。紋理資料係快取自紋理快取1〇6〇，其係在主記憶體（未示）後面。一些指令送給視訊處理單元1100，其運作將於後說明。產生的資料接著由後包裝器（p〇st—packer low)處理’其壓縮該資料。在後處理（p〇st_pr〇cessing)之後，由視訊加速單元所產生的資料係提供給執行單元池 (execution unit pool) 1040。視訊編碼/解碼加速指令的執行，例如前述之絕對差值加總指令’在許多方面與前述之習知圖形指令不同。首先，視訊加速指令係由視訊處理單元1100執行，而非著色器執行單元。其次，視訊加速指令不使用其紋理資料。然而’視訊加速指令所使用的影像資料與圖形指令所使用的紋理資料均為2維陣列。圖形處理單元120同樣利用此優點，使用紋理濾波單元1〇5〇下載給視訊處理單元 27Clienfs Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林環輝/2007/06/13 27 200821986 φ 1100的影像資料，因而使紋理快取1G6Q快取—些由視訊處理單元1100運作之影像資料。因此，示於第1〇圖，視訊處理單元1100係位於紋理濾波單元1〇5〇與後包裝器 1070之間。紋理濾波單元1050檢驗從指令1〇2〇擷取的指令資料 l〇3(W旨令資料1G3G更提供紋理濾波單元1〇5Q主記憶體 (未不）β想要的影像資料的座標。在一實施例中，這些〇座標標明為U、V對’熟悉此項技藝者應對此熟悉。當指令 1020係-視訊加速指令時，所擷取的指令資料丄咖更命令紋理濾、波單元1050略過紋理濾波單元1〇5〇内的任何紋理濾波器（未示）。因此，紋理濾波單元1〇5〇受到視訊加速指令的控制下載影像資料給視訊處理單元11〇〇。依此法，紋理濾、波單元1050係受操縱為視訊加速指令去下載影I資料給視訊加速單A 11〇〇。才見訊處理單元 1100彳文資料路徑上的紋理濾波單元1〇5〇接收影像資料， (；與叩令路徑上的命令資料1030，並根據命令資料1030對該影像資料執行-運作。由視訊處理單元n⑽所輸出影像資料係回饋給執行單S池1_，在由後包裝器m〇處理之後。 b.指令參數現在說龍訊處理單元·錄行㈣差值加總視訊加速指令的運作。如先前說明的，各圖形處理單元指令係解碼且分析（parsed)為指令資料1030，其可視為各指令之特定參 28Client，s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06/13 28 200821986 數集。絕對差值加總指令的參數示於第1表。第1表：圖形處理單元的絕對差值加總指令輸入/ 輸出名稱大小 _____________^——- 敘述输入 FieldFlag 1-位元若 FieldFlag 二二 1 貝Field Picture，其餘則 Frame Picture 輸入 TopFieldFlag 1-位元若 TopFieldFlag 1 貝1J Top-Field-Picture，其他 Bottom-Field_Picture 若設定了 FieldFlag· 輸入 PictureWidth 16-位元例如：1920用於HDTV 輸入 PictureHeigh t 16-位元例如：1080用於30P HDTV 輸入 BaseAddress 32-位元無符號的預測圖片基本位址輸入 BlockAddres s U: 16-位元有符號的 V: 16-位元有符號的預測圖片紋理座標（關係於基本位址）在 SRC1 Opcode SRC1[0:15] = U? SRC1[31:16] = V U，V為13.3格式，忽略分數部分 29Client’s Docket No.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 29 200821986 輸入 RefBlock 128-位元參考圖片資料 "—------ 在 SRC2 Opcode 輸出 Destination 4x16-位 ------ 128位元暫存器中最不重要的32位元 Operand 在 DST OpcodeADD sadS, Tl, T2 SAD T3, refReg, U, prediction block ADD sadS, sadS, T3 SAD T4, refReg, U+4 8x8 prediction block ADD sadS, sadS, T4 V+4 / left-bottom of 8x8 , V+ 4 ;right-bottom of However, it is usually possible to avoid calculating and summing the values of all four sub-blocks, as the calculation can be stopped as long as the sum reaches the current minimum. The following virtual code shows how to use the absolute difference plus total instruction in a loop, which stops when the sum reaches a minimum. WORK:=〇; SUM := 0; MIN = currentMIN; WHILE (work < 4 II SUM < MIN) SUM := SUM + SAD(refReg, U+(I%2)*4, V+ (I>〉1 *4); IF (SUM < currMIN) currMIN = MIN; 25Client's Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 200821986 G〇to Next Search The 84 absolute difference summation command in the graphics processing unit 120 is directly used by the advanced search algorithm of the sway estimation state 220, for example, the partial partial search in the $FIG. In addition, the texture cache (筮Ί, 罘iU map) is block corrected, and the algorithm used by the motion estimator 220, as described above, is pixel corrected. Although multiplexer units can be added to the graphics processing unit 12 to handle these correction errors, doing so increases the number of logic gates and power consumption. Instead, the graphics processing unit 120 uses these extra budgets to four absolute difference summing units instead of just one. In some embodiments, the absolute difference summation instruction provides the advantage of efficiently computing the minimum value, which involves calculating the absolute difference sum value of adjacent blocks. In some embodiments, the 8χ4 absolute difference summation instruction provides another advantage of a thorough search (block 440), which calculates the absolute difference plus the total value for each diagonal when the step value is one. 4. The graphics processor has discussed the implementation of the soft algorithm of the mobile estimator 220 and the use of the 8x4 absolute difference summing instruction in the graphics processing unit 120, followed by a detailed description of the absolute difference summing instructions and The graphics processing unit 12 is. a. Graphics Processing Unit 湳 Figure 10 is a data flow diagram of the graphics processing unit 120, wherein the instruction stream is indicated by the arrow to the left of Figure 10, and the image or graphics stream is represented by the arrow to the right. Figure 10 omits several components familiar to those skilled in the art. These pairs are 26Client's Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林环辉/2007/06/13 200821986 It is not necessary to interpret the deblocking feature in the loop of the graphics processing unit 12〇. An instruction stream processor 1010 receives an instruction 1020 from a system bus (not shown) and decodes the instruction to generate instruction material 1030, such as vertex data. Graphics processing unit 12 supports a conventional graphics processing instruction and instructions for speeding up video encoding and/or decoding, such as the aforementioned 8x4 absolute difference summing instructions. ‘Knowledge graphics processing instructions involve issues such as vertex shading, geometry shading, and pixel shading. Therefore, the instruction data 1〇3 is applied to the pool 740 of shader execution units. The shading execution unit necessarily uses a texture filter unit (TFU) 750 to apply a texture to a pixel. The texture data is taken from texture cache 1〇6〇, which is behind the main memory (not shown). Some instructions are sent to the video processing unit 1100, the operation of which will be described later. The resulting data is then processed by a post-wrapper (p〇st-packer low) which compresses the data. After post-processing (p〇st_pr〇cessing), the data generated by the video acceleration unit is provided to an execution unit pool 1040. The execution of the video encoding/decoding acceleration instructions, such as the aforementioned absolute difference summing instructions, is different in many respects from the conventional graphics instructions described above. First, the video acceleration command is executed by the video processing unit 1100 instead of the shader execution unit. Second, the video acceleration instructions do not use their texture data. However, the image data used by the video acceleration command and the texture data used in the graphics instructions are both 2-dimensional arrays. The graphics processing unit 120 also utilizes this advantage, and uses the texture filtering unit 1〇5〇 to download to the video processing unit 27Clienfs Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林环辉/2007/06 /13 27 200821986 φ 1100 image data, thus making the texture cache 1G6Q cache - some image data operated by the video processing unit 1100. Therefore, as shown in the first diagram, the video processing unit 1100 is located between the texture filtering unit 1〇5〇 and the post-packer 1070. The texture filtering unit 1050 checks the instruction data l〇3 extracted from the instruction 1〇2 (Whether the data 1G3G further provides the coordinates of the image data desired by the texture filtering unit 1〇5Q main memory (not) β. In one embodiment, these squat marks are U and V pairs. Those who are familiar with the art should be familiar with this. When the command 1020 is a video acceleration command, the command data is commanded to command the texture filter and the wave unit 1050. Any texture filter (not shown) in the texture filtering unit 1〇5〇 is skipped. Therefore, the texture filtering unit 1〇5〇 is controlled by the video acceleration command to download the image data to the video processing unit 11〇〇. The texture filtering and wave unit 1050 is manipulated as a video acceleration command to download the image I data to the video acceleration unit A 11. The texture filtering unit 1〇5〇 on the data path of the processing unit 1100 receives the image data. (; and the command data 1030 on the command path, and perform - operation on the image data according to the command data 1030. The image data output by the video processing unit n (10) is fed back to the execution single S pool 1_, in the post wrapper m After processing b. The instruction parameters now say that the processing unit of the dragon processing unit (recording) (4) the difference sums up the operation of the video acceleration instruction. As previously explained, each graphics processing unit instruction is decoded and parsed into the instruction material 1030. It can be regarded as the specific reference of each instruction 28Client, s Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/final/林璟辉/2007/06/13 28 200821986 Number set. Absolute difference plus total instruction parameters Shown in the first table. Table 1: Absolute difference of the graphics processing unit plus total instruction input / output name size _____________^——- Narrative input FieldFlag 1-bit if FieldFlag 22 1 Field Picture, the rest Frame Picture Input TopFieldFlag 1-bit if TopFieldFlag 1 1J Top-Field-Picture, Other Bottom-Field_Picture If FieldFlag is input Input PictureWidth 16-bit For example: 1920 for HDTV Input PictureHeigh t 16-bit For example: 1080 Input BaseAddress 32-bit unsigned predictive picture base address input BlockAddres s U: 16-bit signed V: 16-bit Signed Predicted Picture Texture Coordinates (Related to Basic Address) In SRC1 Opcode SRC1[0:15] = U? SRC1[31:16] = VU, V is 13.3 format, ignore fractional part 29Client's Docket No.: S3U06- 0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 29 200821986 Enter RefBlock 128-bit reference picture data"-------Destination 4x16-bit in SRC2 Opcode output-- The least important 32-bit Operand in the 128-bit scratchpad is in DST Opcode

結合使用數個輪入參數以判定由紋理濾波單元1〇5〇所擷取的4x4方塊位址。BaseAddress參數指出在紋理快取中該紋理貢料的起點。將此區域内左上方塊座標給如㈣此聰參數。Pic—tureHeighl；與PictureWidth輸人參數係用來判斷該方塊的範圍，即左下方座標。最後，視訊圖形可為漸進式掃瞒 (progessive)或隔行掃猫（interlace)。若為隔行掃瞄，其係由兩個方向組成（上方與下方）。紋理濾波單元使用A plurality of rounding parameters are used in combination to determine the 4x4 block address captured by the texture filtering unit 1〇5〇. The BaseAddress parameter indicates the starting point of the texture metric in the texture cache. The upper left square of this area is given to (4) this smart parameter. Pic-tureHeighl; and PictureWidth input parameters are used to determine the range of the square, the lower left coordinate. Finally, the video graphics can be progressive progessive or interlace. For interlaced scanning, it consists of two directions (upper and lower). Texture filtering unit use

FieldFlag與TopFieldFlag以適當處理隔行掃瞄影像。 c·影像資料轉換為執行絕對差值加總指令，視訊處理單元11〇〇從紋理濾波單元10 5 0操取輸入像素方塊並對這此方塊執—奎換，轉換為一適當格式以利絕對差值加'總加速 960-990處理。像素方塊接著被提供至 0 Ί直力口矣囱士 $ 單元960-990，其回覆絕對差值加總值。…、 1 分彡巴對差佶Λα έέι 值接著被累積至目標暫存器。這些功能將於後1、成 _ 視訊處理單元Π00接收定義計算該絶對差^加之8x4方塊的兩個輸入參數。參考方塊的資料仫 SRC2運作碼直接定義：8Χ4Χ8位元方塊視為128位直：：料。相對地，SRC1運作碼定義預測方塊的位址而非=料貧 30Clienfs Docket N〇.：S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06/13 30 200821986 視訊處理單元1100提供這些位址給紋理濾波單元1〇5〇，其從紋理快取1060操取128位元的預測方塊資料。儘管影像資料包含亮度（Y)與彩度（Cb，Cr)平面，移動估測通常僅使用γ成分。因此，當執行絕對差值加總指令時，視訊處理單元1100所運作的像素方塊僅含有成分。在一實施例中，視訊處理單元11〇〇產生一禁止俨號，其指揮紋理濾波單元1050不要從紋理快取iQ6〇類取 $ Cr/Cb像素資料。第Π圖係紋理濾波單元1050與紋理快取ι〇6〇的方塊圖。紋理濾波單元1050係設計為從紋理快取logo榻取紋理影像邊界（texel boundry)，並從紋理快取1〇6〇下載 4x4紋理影像方塊至濾波輸入緩衝器mo。當擷取資料代表視訊處理單元1100時，紋理影像112〇被視為各有犯位元的4個通道（ARGB)，對於128位元的紋理影像大小。當為絕對差值加總指令擷取資料時，紋理濾波單元1〇5〇 ^ : 下載8x4x8位元方塊。為處理校正的問題，該8x4方塊被下載至兩個4X4像素輸入緩衝器（1110A與1110B)。視訊處理單元1100所不用的影像資料可能被位元組校正。然而，紋理滤波單元 1050係被設計為從外取擷取紋理影像邊界。因此，當為視訊處理單元1100擷取的資料時，紋理濾波單元1〇5〇可能需要擷取達4個環繞在各個4x4半方塊之一特定位元組校正8x4方塊周圍的紋理影像校正4X4方塊。 6亥程序可在弟11圖中看到，其中左半4X4方塊（目標 31Client5s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06/13 31 200821986 - 方塊1130)對準在紋理影像邊界上，不論在垂直方向或在水平方向。換§之’目彳示方塊113 0延伸兩個像素影像。該目標方塊1130的U、V位址定義4x4-8位元的最左上角，位元組校正方塊。在此例中，紋理濾波單元1〇5〇判斷影像 1140、1150、1160、1170應被願取以得到目標方塊η3〇。在判斷後，紋理濾波單元1050擷取方塊丨MU 17〇並接著結合從方塊1140-1 170所按位元選擇的行與列，顧目標方 ( 塊1130的最左邊4x4位元係寫入濾波緩衝器m〇B。熟悉此項技藝之人士應當知道如何使用多工器、移位器 (shifter)、遮罩位元（mask bits)達成該結果，不^ 從紋理快取1〇6〇所擷取的4χ4目標校正。在第11圖所示之實施例，當目標方塊113〇包含一垂直紋理像素邊界，該資料不會垂直地重新排列。當此情形么生日^下載至濾波緩衝器lll〇A與ill⑽的資料在垂直方向的順序與在快取中原本的順序不同。在此實施例中， t處理班員11QG必須垂直地重新排列（旋轉）128位元 >考方塊資料以符合預測方塊的順序。在另—實施例中， =入其中-濾波緩衝器111G之前，紋理濾波單元1〇5〇 :直地重新排列快取紋理影像資料以符合原本的快取順組、=1!明或流程圖中的方塊應被理解為表示模 #或二:° ” &式碼’其包含用於實現特定邏輯電路功技之一個或多個可執行的指令。熟悉軟體門之技勢者應#瞭_，其他的實現方法亦包含於所揭 32 200821986 露之範圍内。在其他的實現方法中，各功能可不依所干或揭露之順序執行，包含實質上同步進行或逆向進行，依所涉之功能而定。在此揭露之系統與方法可以軟體、硬體或其結合實現。在-些實_中，該系統及/或方法係以存在記ϋ中之軟體實現，且由位於一計算裝置中之適當處理器所執行 (包含而不限於一微處理器、微控制器、網路處理器、可 r 重新裝配處理器、可擴充處理器）。在其他實施例中，該、系統及/或方法係以邏輯電路實現，包含而不限於一可程式邏輯裝置（PLD ’ programmable logic device)、可程式邏輯閘陣列（PGA，programmable gate array)、現場可程式化邏輯閘陣列（FPGA，field pr〇grammable卯忭 array)或特定應用電路（ASIC)。在其他實施例中，這些邏輯敘述係在一圖形處理器或圖形處理單元（GPU)完成。在此揭露之系統與方法可被嵌入任何電腦可讀媒體而 (；使用，或連結一指令執行系統、設備、裝置。該指令執行系統包含任何以電腦為基礎的系統、含有處理器的系統或其他可以從該指令執行系統擷取與執行這些指令的系統。所揭鉻之文字電腦可讀媒體（c⑽puter-readable medium)”可為任何可以容納、儲存、溝通、傳遞或傳送該程式作為使用或與該指令執行系統連結之工具。該電腦可 $貝媒體可為’例如（非限制）為基於電子的、有磁性的、光的、電磁的、紅外線的或半導體技術的一系統或傳遞媒 33Client5s Docket N〇.：S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 33 200821986 使用電子技術之電腦可讀媒體之特定範例（非限制）可包含··具有一條或多條電性（電子）連接的線；一隨機存取記憶體（RAM，random access memory);—唯讀記憶體 (ROM，read-only memory); 一可拭去可程式化唯讀記憶體（EPROM或快閃記憶體）。使用磁技術之電腦可讀媒體之特定範例（非限制）可包含：可攜帶電腦磁碟。使用光技術之笔細可讀媒體之特定範例（非限制）可包含：一光纖與一可攜帶唯讀光碟（CD-ROM)。雖然本發明在此以一個或更多個特定的範例作為實施例闡明及描述，不過不應將本發明侷限於所示之細節，然而仍可在不背離本發明的精神下且在申請專利範圍均等之領域與範圍内實現許多不同的修改與結構上的改變。因此，最好將所附上的申請專利範圍廣泛地且以符合本發明領域之方法解釋，在隨後的申請專利範圍前提出此聲明。【圖式簡單說明】第1圖係用於圖形與視訊編碼及/或解碼之一示範性運算平台之方塊圖。第2圖係第1圖之視訊編碼器16〇的功能方塊圖。弟3 A、B圖說明將目萷圖像分割成不重疊的區段的巨圖塊。第4圖係第2圖之移動估測器所使用之演算法之一示範性實施例之流程圖。 34Client5s Docket N〇.:S3U06-0024 TT’s Docket No:0608-A41261-TW/fmal/林環輝/2007/06/13 200821986 第5圖係第4圖共軛梯度步驟第6圖說明使用第5圖之共〜實施例的流程圖。之共輕梯度下降步驟440的示範狀之一實施例的流程圖。第7圖係第4圖鄰近搜尋演算法第8A、選巨圖塊的相對位置。FieldFlag and TopFieldFlag to properly interlace the scanned image. c. The image data is converted into an absolute difference summing command, and the video processing unit 11 操 manipulates the input pixel block from the texture filtering unit 105 and converts the block into a suitable format for absolute use. The difference plus 'total acceleration 960-990 processing. The pixel square is then provided to 0 Ί 力矣 $ $ unit 960-990, which replies the absolute difference plus the total value. ..., 1 minute 对对佶Λ έέ ι 值值值值值值值值值值值值值值值These functions will be received in the first 1, _ video processing unit Π00 to define two input parameters that calculate the absolute difference plus 8x4 squares. Reference block data 仫 SRC2 operation code is directly defined: 8Χ4Χ8 bit squares are treated as 128-bit straight:: material. In contrast, the SRC1 operation code defines the address of the prediction block instead of = poor material 30Clienfs Docket N〇.: S3U06-0024 TT's Docket No: 0608-A41261-TW/final/林璟辉/2007/06/13 30 200821986 Video Processing Unit The 1100 provides these addresses to the texture filtering unit 1〇5〇, which fetches 128-bit prediction block data from the texture cache 1060. Although the image data contains luminance (Y) and chroma (Cb, Cr) planes, motion estimation typically uses only the gamma component. Therefore, when the absolute difference summing instruction is executed, the pixel block operated by the video processing unit 1100 contains only components. In one embodiment, the video processing unit 11 generates a disable flag that directs the texture filtering unit 1050 not to fetch the Cr/Cb pixel data from the texture cache iQ6 class. The figure is a block diagram of the texture filtering unit 1050 and the texture cache ι〇6〇. The texture filtering unit 1050 is designed to extract the texel boundry from the texture cache logo and download the 4x4 texture image block from the texture cache to the filter input buffer mo. When the captured data represents the video processing unit 1100, the texture image 112 is considered to be 4 channels (ARGB) of each punctured bit, for a 128-bit texture image size. When the data is extracted for the absolute difference plus total instruction, the texture filtering unit 1〇5〇 ^ : downloads the 8x4x8 bit block. To handle the correction problem, the 8x4 block is downloaded to two 4X4 pixel input buffers (1110A and 1110B). The image data that is not used by the video processing unit 1100 may be corrected by the byte. However, the texture filtering unit 1050 is designed to extract texture image boundaries from the outside. Therefore, when the data is captured by the video processing unit 1100, the texture filtering unit 1〇5〇 may need to capture up to 4 texture image correction 4×4 blocks around a specific byte correction 8×4 block of each 4×4 half block. . 6 Hai program can be seen in the brother 11 picture, in which the left half 4X4 block (target 31Client5s Docket N〇.: S3U06-0024 TT's Docket No: 0608-A41261-TW/final/林璟辉/2007/06/13 31 200821986 - Block 1130) is aligned on the texture image boundary, either in the vertical direction or in the horizontal direction. In other words, the block 113 0 extends two pixel images. The U and V addresses of the target block 1130 define the top left corner of the 4x4-8 bit, the byte correction block. In this example, the texture filtering unit 1〇5〇 determines that the images 1140, 1150, 1160, 1170 should be fetched to obtain the target block η3〇. After the determination, the texture filtering unit 1050 retrieves the block 丨〇 17 17 and then combines the rows and columns selected by the bits from the blocks 1140-1 170 to the target side (the leftmost 4x4 bit of the block 1130 is written to the filter). Buffer m〇B. Those familiar with the art should know how to use multiplexers, shifters, mask bits to achieve this result, not to get from the texture cache 1〇6〇 The 4 χ 4 target correction is taken. In the embodiment shown in Fig. 11, when the target block 113 〇 contains a vertical texel boundary, the data is not rearranged vertically. In this case, the birthday ^ is downloaded to the filter buffer 111 The order of the data of 〇A and ill(10) is different from the original order in the cache. In this embodiment, the t-staff 11QG must vertically rearrange (rotate) 128 bits> Predicting the order of the blocks. In another embodiment, before the -filter buffer 111G, the texture filtering unit 1〇5〇: directly rearranges the cache texture image data to conform to the original cache group, =1 The block in the clear or flow chart should It is understood to mean the modulo # or two: ° " & code" which contains one or more executable instructions for implementing a specific logic circuit function. Those skilled in the software door should be #_, other implementations The method is also included in the scope of the disclosure of 2008 2008. In the other implementation methods, the functions may be performed in the order in which they are performed or disclosed, including substantially synchronous or reverse, depending on the function involved. The disclosed system and method can be implemented in software, hardware or a combination thereof. In some implementations, the system and/or method is implemented in software stored in a memory and is suitably processor located in a computing device. Executed (including but not limited to a microprocessor, microcontroller, network processor, re-assemblable processor, scalable processor). In other embodiments, the system, system, and/or method is logical Circuit implementation, including but not limited to a PLD 'programmable logic device, a programmable gate array (PGA), a field programmable logic gate array (FPGA, f Ield pr〇grammable卯忭array) or an application specific circuit (ASIC). In other embodiments, these logic statements are performed in a graphics processor or graphics processing unit (GPU). The systems and methods disclosed herein can be embedded Any computer readable medium (using or linking to an instruction execution system, apparatus, or device. The instruction execution system includes any computer-based system, a processor-containing system, or other system that can be retrieved from the instruction execution system A system for executing these instructions (c(10)puter-readable medium) can be any tool that can hold, store, communicate, transfer or transfer the program for use as or in connection with the execution system of the instruction. The computer can be a system or transmission medium such as (unrestricted) electronically based, magnetic, optical, electromagnetic, infrared or semiconductor technology. 33Client5s Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 33 200821986 Specific examples of computer-readable media using electronic technology (non-restrictive) may include one or more electrical (electronic) connections Line; a random access memory (RAM); read-only memory (ROM, read-only memory); a wipeable programmable read-only memory (EPROM or flash memory) . Specific examples of computer readable media that use magnetic technology (without limitation) may include: a portable computer diskette. Specific examples of non-limiting media that use optical technology for fine-readable media may include: a fiber optic and a portable CD-ROM. The invention is illustrated and described herein by way of example only, and is not intended to Many different modifications and structural changes are made within the scope and scope of equalization. Therefore, it is preferable to interpret the scope of the appended patent application broadly and in a manner consistent with the field of the invention, and to make this statement before the scope of the subsequent patent application. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram of an exemplary computing platform for graphics and video coding and/or decoding. Figure 2 is a functional block diagram of the video encoder 16A of Figure 1. The brothers 3 A and B illustrate the division of the witness image into giant tiles of non-overlapping segments. Figure 4 is a flow diagram of an exemplary embodiment of an algorithm used by the motion estimator of Figure 2. 34Client5s Docket N〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/Lin Huanhui/2007/06/13 200821986 Figure 5 Figure 4 Conjugate Gradient Step Figure 6 illustrates the use of Figure 5 A total of the flow chart of the embodiment. A flowchart of one embodiment of a common light gradient descent step 440. Figure 7 is the adjacent position of the search algorithm in Figure 4, the relative position of the selected giant block.

的5個候進行絕對差值加總指第9A、B圖係說明對參考與預測方塊令運作的方塊圖。第10圖係第1圖之圖形處理單元的資料流程圖。第11圖係第1G圖紋理濾波單元與紋理快取的方塊圖。【主要元件符號說明】 1〇〇〜系統、110〜主處理器、12〇〜圖形處理器（Gpu)、。己fe、體、140〜匯流排、150〜視訊加速單元（Vpu)、ι6〇〜軟體解碼器、170〜視訊加速驅動器。 205〜圖像、210〜減法器、220〜移動估測器、230〜參考圖像、245〜移動向量、255〜預測方塊、260〜剩餘圖像、27〇〜離散餘旋轉換器、280〜量化器、290〜熵解碼器、21〇〇〜解碼器。 ^Client^ Docket N〇.：S3U06-0024 s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 310〜目前巨圖塊、320〜巨圖塊、330〜搜尋窗、340〜點。 35 200821986 纛、400〜程序、410〜判定移動向量將被圖像間預測或圖像内，測、420〜施行共軛梯度下降搜尋演算法、430〜執行鄰近搜哥440〜執行一局部區域徹底搜尋、‘π〜建立最佳候選巨圖塊與參考巨圖塊間相符程度為二次表面、460〜在一分數像素邊界找到一最佳候選巨圖塊調準、47〇〜根據該相符巨圖塊計算一分數移動向量。 505〜初始化一候選方塊、51〇〜計算候選巨圖塊^，四周的候選巨圖塊的座標、515〜分別計算5個候選巨圖塊的絕對差值加總、520〜計算梯度义與心、525〜梯度是否低於一臨界值、530〜計算四個新候選巨圖塊的座標、535〜對各候選巨圖塊分別執行共軛梯度下降步驟44〇、54〇〜比較絕對差值加總值是否低於一臨界值、545〜回傳有最低絕對差值加總值的候選巨圖塊、550〜選擇一新的中央候選巨圖塊、555〜從梯度& 與心什异新的步驟值Δλ·與~、560〜測試疊代迴圈數是否大於一最大值、565〜回傳不相符。 610C〜候選巨圖塊、610L-610R-610Τ-610Β〜四個周圍候選、620Χ-620Υ〜初始候選計算梯度、 630TL-630TR-630BL-630BR〜四個新的中央候選巨圖塊、 640L-640R-640Τ-640Β〜候選、670-680〜候選 710〜利用目前巨圖塊310位址的絕對值與每行巨圖塊數計算一旗標變數TOPVALID。若此絕對值非〇，則TOPVALID為真，此外，TOPVALID為假 36Client’s Docket N〇.:S3U06-0024 TT，s Docket No:0608-A41261-TW/fmal/林璟輝/2007/06/13 36 200821986 720〜旗標變數LEFTvaLID係利用目前巨圖塊位址的除以整數與每行巨圖塊數計算。若此除數非〇,則LEFTVALID為真，此外，LEFTVALID為假。 730〜結合使用T0PVALID與LEFTVALID變數以判定目前巨圖塊鄰近的4個候選巨圖塊的可得性。 740〜為一先前候選巨圖塊p判定可得性。 750〜為每一可得候選巨圖塊計算絕對差值加總。 810-850〜候選巨圖塊。 910-940〜4x4方塊、950〜4x4參考方塊。 234〜旋轉邏輯、950〜預測方塊、960-990〜絕對差值加總計算單元、 1010〜指令流處理器、1020〜指令、1030〜指令資料、1040 〜執行單元池、1050〜紋理濾波單元、1060〜紋理快取、1〇7〇〜後包裝器、1100〜視訊處理單元。 1120〜紋理影像、Π30〜目標方塊、1140-1170〜紋理影像、1110A-B〜緩衝器。 37Client’s Docket No.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟輝/2007/06/13 37The 5th order of the absolute difference plus the total refers to the block diagram of the reference and prediction block operations. Figure 10 is a data flow diagram of the graphics processing unit of Figure 1. Figure 11 is a block diagram of the texture filtering unit and texture cache of the 1Gth image. [Main component symbol description] 1〇〇~ system, 110~ main processor, 12〇~ graphics processor (Gpu), . Fe, body, 140 to bus, 150 to video acceleration unit (Vpu), ι6〇 ~ software decoder, 170 to video acceleration driver. 205~image, 210~subtractor, 220~moving estimator, 230~reference image, 245~moving vector, 255~prediction block, 260~remaining image, 27〇~ discrete cosine converter, 280~ Quantizer, 290~entropy decoder, 21〇〇~ decoder. ^Client^ Docket N〇.:S3U06-0024 s Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 310~The current giant block, 320~ giant block, 330~ search window, 340~ point. 35 200821986 纛, 400~ program, 410~ determine the motion vector will be inter-image prediction or image inside, measure, 420~ perform conjugate gradient descent search algorithm, 430~ execute adjacent search brother 440~ perform a partial area thoroughly Search, 'π~ establish the best candidate giant block and the reference giant block to match the degree of secondary surface, 460~ find a best candidate giant block alignment at a fractional pixel boundary, 47〇~ according to the match giant The tile calculates a fractional motion vector. 505~ initialize a candidate block, 51〇~ calculate the candidate giant block^, the coordinates of the surrounding candidate giant block, 515~ calculate the absolute difference sum of the five candidate giant blocks, respectively, 520~ calculate the gradient meaning and heart 525~The gradient is lower than a critical value, 530~ calculates the coordinates of the four new candidate giant block, 535~ performs the conjugate gradient lowering step 44〇, 54〇~ compares the absolute difference plus for each candidate giant block respectively Whether the total value is lower than a critical value, 545~return the candidate giant block with the lowest absolute difference plus the total value, 550~select a new central candidate giant block, 555~from gradient & The step value Δλ· and ~, 560~ test whether the number of iterations of the loop is greater than a maximum value, and the 565~back pass does not match. 610C~candidate giant block, 610L-610R-610Τ-610Β~four surrounding candidates, 620Χ-620Υ~ initial candidate calculation gradient, 630TL-630TR-630BL-630BR~ four new central candidate giant blocks, 640L-640R -640Τ-640Β~candidate, 670-680~candidate 710~ Calculate a flag variable TOPVALID using the absolute value of the current huge block 310 address and the number of macroblocks per line. If the absolute value is not 〇, then TOPVALID is true. In addition, TOPVALID is false 36Client's Docket N〇.:S3U06-0024 TT,s Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 36 200821986 720 The ~flag variable LEFTvaLID is calculated by dividing the current giant tile address by the integer and the number of macroblocks per line. If the divisor is not 〇, LEFTVALID is true, and LEFTVALID is false. 730~ combines the T0PVALID and LEFTVALID variables to determine the availability of the four candidate giant tiles adjacent to the current giant tile. 740~ determines the availability of a previous candidate giant block p. 750~ Calculate the absolute difference sum for each available candidate block. 810-850~ Candidate giant block. 910-940~4x4 blocks, 950~4x4 reference blocks. 234~rotation logic, 950~prediction block, 960-990~absolute difference sum total calculation unit, 1010~ instruction stream processor, 1020~ instruction, 1030~ instruction data, 1040~execution unit pool, 1050~texture filter unit, 1060~ texture cache, 1〇7〇~ post wrapper, 1100~video processing unit. 1120 ~ texture image, Π 30 ~ target block, 1140-1170 ~ texture image, 1110A-B ~ buffer. 37Client’s Docket No.:S3U06-0024 TT’s Docket No:0608-A41261-TW/final/林璟辉/2007/06/13 37

Claims

200821986 - X. Patent application scope: 1. A graphics processing unit, comprising: an instruction decoder, configured to decode an absolute difference plus total instruction into a plurality of parameters, the plurality of parameters being described on U and V coordinates a pixel block and a η X η pixel block; and an absolute difference summation acceleration logic circuit configured to receive the plurality of parameters and calculate a plurality of absolute difference plus total values, each absolute difference plus a total value pair f It should be η X η pixel squares, and corresponding to the Μ X Ν pixel square and have a difference from the η X η pixel square. 2. The graphics processing unit of claim 1, wherein the absolute difference summation acceleration logic circuit further comprises: a plurality of absolute difference sum total calculation units, each absolute difference total calculation unit is configured to receive the ηχη a pixel block, and receiving one of the plurality of blocks included in the 像素 pixel block, and calculating a corresponding one of the plurality of absolute difference sum values. 3. The graphics processing unit of claim 1, wherein the parameter of the Μ X Ν pixel block defines an address of the Μ X 像素 pixel block in a texture cache. 4. The graphics processing unit of claim 1, wherein the parameter describing the pixel block is defined in one of the texture caches as one of the relative pixel addresses. 5. The graphics processing unit of claim 1, wherein the difference is a horizontal difference. 38Client's Docket N〇.:S3U06-0024 TT,s Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 38 200821986 ' 6. The graphic processing unit of the application scope of the patent application, The % X Ν pixel square represents a motion estimation prediction block, and the η χ η pixel square represents a motion estimation reference block. 7. The graphics processing unit of claim 2, wherein the number of absolute difference plus total calculation units can process data in parallel. 8. The graphics processing unit of claim 2, further comprising a first logic circuit for accumulating the plurality of absolute difference plus total values to the _ _ _ _ _ _ _ _ $9. The graphics processing unit of claim 8, wherein the first logic circuit stores the plurality of absolute difference plus total values in an order into the target register, the order being The square of the ΜχΝ: and the decision. $10· The graphic processing unit of claim 2, further comprising: a texture cache memory, configured to store the pixel data in a texture image format having a predetermined number of bits; and, a texture filtering unit, configured to Determining whether the MxN pixel block extends a texture image boundary, and correspondingly extracting one or more texture image correction η χ squares around the χ 像素 N pixel block from the texture cache memory, and combining and correcting from the texture image The η χ η block is selected by the bit 疒 and column such that the leftmost bit is written to the first filter buffer and the rightmost bit is written to the second filter buffer. 11· A graphics processing unit, comprising: a main processor interface, receiving a video acceleration command; and 39Client5s Docket N〇.: S3U06-0024 TT, s Docket No: 0608-A41261-TW/fmal/林璟辉/2007/06/ 13 39 200821986 A video acceleration unit, which responds to a video acceleration command, the video acceleration unit includes an absolute difference summation acceleration logic circuit configured to receive the plurality of parameters and calculate a plurality of absolute difference plus total values, each absolute difference The value plus value corresponds to the η X η pixel block, and one of a plurality of blocks corresponding to the Μ XN pixel block and having a difference from the η χ n pixel block. 12. The graphics processing unit of claim 11, wherein the absolute difference summation acceleration logic circuit further comprises: a plurality of absolute difference sum total calculation units, and each absolute difference sum total calculation unit is configured to receive the ηχη a pixel block, and receiving one of the plurality of blocks included in the 像素 pixel block, and calculating a corresponding one of the plurality of absolute difference sum values. 13. The graphics processing unit of claim 12, wherein the plurality of absolute difference sum total calculation units can process the data in parallel. 14. The graphics processing unit of claim 11 further includes a first logic circuit for accumulating the plurality of absolute difference plus total values to a target register. 15. If the first processing circuit of the graphic processing unit of the nth item of the patent scope is applied, the plurality of absolute difference plus the total value is stored in the target register in the order, and the order, the order It is determined by the 1; coordinates of the squares in each of the ridges. 16·If the graphic processing unit of the application scope patent item u is more packaged, the texture memory format is set to the pixel data storage item and the number of bits; and 40Client5s Docket N 〇.:S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13 40 200821986 A texture filtering unit is set to determine whether the MxN pixel block is applied, the text of the figure is corresponding, and corresponds to Extracting one or more texture image correction η χ n squares around the M χ N pixel square from the texture cache memory, and combining the rows and columns selected by the # yuan from the texture image correction ηχn square, so that the most The bit on the left is the first filter buffer and the rightmost bit is written to the second filter buffer. 1 is the graphics processing unit of claim U, wherein the parameter describing the ΜXN pixel block is defined as one of the 像素N pixel blocks in a texture cache and a relative address. I8. The graphics processing unit of claim 11, wherein the parameter describing the NX N pixel block directly defines the pixel data. 19. The graphics processing unit of claim U, wherein the ΜχΝ pixel square represents a motion estimation prediction block, and the “n pixel square represents a motion estimation reference block., 2〇· a calculation Μ X A method of swaying a giant block to a gentleman, and adding a total value to the difference, wherein M and N are integers, the method includes: executing an absolute difference plus total instruction to calculate a χ χ M The -nxn portion of the megablock block - the first - absolute difference plus the total value, the first containing the upper left portion of the MXM giant block, where the integer is; performing an absolute difference plus total instruction to calculate the M x a second absolute difference sum of the second nxn portion of the M giant block, the second: an upper right portion including the MxM giant block; the file 0 accumulating the first and second absolute difference plus Total Worth - Total 41 Client's Docket N〇.:S3U06-0024 TT's Docket N〇:0608-A41261-TW/flna17 林璟辉/2〇〇7/〇6/13 41 200821986 ^ Execute the absolute difference plus total instruction to calculate a third absolute difference of the third η X η portion of the Μ Μ Μ giant block plus a value, the third portion includes a lower left portion of the Μ Μ giant tile; adding the third absolute difference plus the total value to the sum; performing the absolute difference summing instruction to calculate the Μ Μ Μ giant tile a fourth absolute difference value of a fourth η χ η portion, the fourth portion including a lower right portion of the Μ Μ Μ giant tile; and f adding the fourth absolute difference value to the total value To the sum. 42Client's Docket No.: S3U06-0024 TT's Docket No:0608-A41261-TW/fmal/林璟辉/2007/06/13