CN101068353A

CN101068353A - Graph processing unit and method for calculating absolute difference and total value of macroblock

Info

Publication number: CN101068353A
Application number: CNA2007101101936A
Authority: CN
Inventors: 扎伊尔德·荷圣; 约翰·柏拉勒斯; 徐建明
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2006-06-16
Filing date: 2007-06-18
Publication date: 2007-11-07
Anticipated expiration: 2027-06-18
Also published as: TW200821986A; TWI482117B; CN101083763A; TW200816082A; TW200816820A; CN101068353B; TW200803525A; CN101068365A; TWI348654B; TW200803527A; CN101072351B; CN101068364B; CN101072351A; TWI383683B; TWI350109B; TWI444047B; CN101083764A; CN101083764B; CN101083763B; TWI395488B

Abstract

A graphics processing unit contains: an instruction decoder which is configured to decode an absolute difference value casting instruction to plurality of parameters which are described as an M*N pixel square and an n*n pixel square on the U, V coordinates, wherein the M, N, and n are integers; a absolute difference value casting acceleration logic circuit which is configured to receive plurality of parameters and calculates plurality of absolute difference casting value which matches the n*n pixel square; and one n*n pixel square of plurality of squares corresponding to the M*N pixel square horizontally moving.

Description

The absolute difference of Graphics Processing Unit and the huge segment of calculating adds the method for total value

Technical field

The present invention relates to a Graphics Processing Unit, and be particularly related to Graphics Processing Unit with image compression and decompression feature.

Background technology

Personal computer and consumption electronic products are used for various amusement articles.These amusement articles can roughly be divided into 2 classes: those of the drawing that uses a computer (computer-generated graphics), for example computer game; With use those of compressed video data stream (compressed video stream), for example pre-record program to digital video disk (DVD) (DVD), or provide digital program (digital programming) to a set-top box (set-top box) by cable TV or satellite dealer.The 2nd kind also comprises the coding simulation video data stream, for example performed by a digital VTR (DVR, digital videorecorder).

Computer graphics is produced by a Graphics Processing Unit (GPU, graphic processing unit) usually.One Graphics Processing Unit is a kind of a kind of special microprocessor on computer game platform (computer gameconsoles) and some personal computers that is based upon.One Graphics Processing Unit is optimized to carrying out fast and describes three-dimensional space basic object (three-dimensional primitiveobjects), for example triangle, quadrangle etc.These basic objects are described with a plurality of summits, and wherein, each summit has attribute (for example color), and can apply texture (texture) to this basic object.The result who describes is a dual space pel array (two-dimensional array of pixels), is presented on the display or monitor of a computer.

The encoding and decoding of video data stream involves different types of computing, for example, discrete cosine transform (discrete cosine transform), moltion estimation (motion estimation), motion compensation (motion compensation), deblocking effect filter (deblocking filter).Usually by the special hardware logic electric circuit of general service central processing unit (CPU) combination, for example Application Specific Integrated Circuit (ASIC, application specific integrated circuit) is handled in these calculating.Consumer thereby a plurality of calculate platforms of needs are to satisfy their amusement demand.Thereby need can the process computer drawing and the single computing platform of encoding and decoding of video.

Summary of the invention

An aspect of of the present present invention is a kind of Graphics Processing Unit, comprise: an instruction decoder, be arranged to that an absolute difference is added up instruction and be decoded as a plurality of parameters, one M * N pixel square and the one n * n pixel square of these a plurality of parametric descriptions on U, V coordinate, wherein, M, N, n are integers; And one absolute difference add up the acceleration logic circuit, be configured to receive these a plurality of parameters and calculate a plurality of absolute differences and add total value, each absolute difference adds corresponding this n of total value * n pixel square, and correspondence is present in this M * N pixel square and with this n * n pixel square one potential difference is arranged.

Another aspect of the present invention is a kind of Graphics Processing Unit, comprises: a host interface, receiver, video assisted instruction; An and video accelerator module, respond this video assisted instruction, this video accelerator module comprises an absolute difference and adds up the acceleration logic circuit, be configured to receive these a plurality of parameters and calculate a plurality of absolute differences and add total value, each absolute difference adds corresponding this n of total value * n pixel square, and correspondence be present in this M * N pixel square and with this n * n pixel square have a potential difference a plurality of squares one of them.

Another aspect of the present invention is that a kind of calculating one M * absolute difference of the huge segment of N adds the method for total value, wherein, M, N are integer, this method comprises: carry out the instruction of absolute difference totalling and add total value with the one n * n one first absolute difference partly that calculates one M * huge segment of M, this first comprises a upper left of the huge segment of this M * M, wherein, n is an integer; Carry out the instruction of absolute difference totalling and add total value with the 2nd n * n one second absolute difference partly that calculates the huge segment of this M * M, this second portion comprises a upper right portion of the huge segment of this M * M; The worth summation of this first and second absolute difference totalling adds up; Carry out this absolute difference totalling instruction and add total value with the 3rd n * n one the 3rd absolute difference partly that calculates the huge segment of this M * M, this third part comprises a bottom left section of the huge segment of this M * M; The 3rd absolute difference is added total value add to this summation; Carry out this absolute difference totalling instruction and add total value with the 4th n * n one the 4th absolute difference partly that calculates the huge segment of this M * M, the 4th part comprises a lower right-most portion of the huge segment of this M * M; And the 4th absolute difference is added total value add to this summation.

Description of drawings

Fig. 1 is the calcspar that is used for an exemplary calculate platform of figure and video coding and/or decoding.

Fig. 2 is the functional block diagram of the video encoder 160 of Fig. 1.

Fig. 3 A, B explanation becomes present image segmentation the huge segment of nonoverlapping section.

Fig. 4 is the flow chart of an one exemplary embodiment of the employed algorithm of motion estimator of Fig. 2.

Fig. 5 is the flow chart of an embodiment of Fig. 4 conjugate gradient step 440.

Fig. 6 illustrates the example state of the conjugate gradient decline step 440 of using Fig. 5.

Fig. 7 is the flow chart of an embodiment of the contiguous search algorithm of Fig. 4.

The relative position of employed 5 the huge segments of candidate of contiguous search algorithm of Fig. 8 A, B key diagram 7.

Fig. 9 A, B are that explanation is carried out the calcspar that absolute difference adds up the instruction running to reference and prediction square.

Figure 10 is the data flowchart of the Graphics Processing Unit of Fig. 1.

Figure 11 is the calcspar that Figure 10 texture filtering unit and texture quick are got.

The reference numeral explanation

100～system, 110～primary processor, 120～graphic process unit (GPU), 130～memory,

140～bus, 150～video accelerator module (VPU), 160～software decoder,

170～video accelerator actuator.

205～image, 210～subtracter, 220～motion estimator, 230～reference picture,

245～motion vector, 255～prediction square, 260～residual image,

270～discrete surplus rotation parallel operation, 280～quantizer, 290～entropy decoder, 2100～decoder.

310～present huge segment, 320～search window, 330～point.

400～program, 410～judgement motion vector will by inter prediction or infra-frame prediction,

420～implement conjugate gradient decline search algorithm, 430～carry out contiguously search,

440～carry out a regional area thoroughly search,

450～set up the huge segment of optimal candidate with reference to the degree that conforms between huge segment be two subsurfaces,

460～find on a fraction pixel border the huge segment of an optimal candidate harmonize,

470～according to this huge segment that conforms to calculate a fraction movement vector,

505～initialization, one candidate's square,

The huge segment C of 510～calculated candidate _{X, y}The coordinate of the huge segment of candidate all around,

515～respectively absolute differences that calculate 5 huge segments of candidate add up,

520～compute gradient g _xWith g _y, 525～gradient whether be lower than a critical value,

The coordinate of four huge segments of new candidate of 530～calculating,

535～to the huge segment of each candidate carry out respectively conjugate gradient decline step,

440,540～relatively absolute differences add total value whether be lower than a critical value,

545～passback have minimum absolute difference add total value the huge segment of candidate,

550～select a new huge segment of central candidate,

555～from gradient g _xWith g _yCalculate new step value Δ _xWith Δ _y,

560～test iterative cycles number whether greater than a maximum, 565～passback do not conform to,

Candidate around the huge segment of 610C～candidate, 610L-610R-610T-610B～four,

620X-620Y～initial candidate compute gradient,

630TL-630TR-630BL-630BR～four new huge the segment of central candidate, 640L-640R-640T-640B～candidate, a 670-680～candidate

710～utilize the absolute value of present huge segment 310 addresses and the huge segment number of every row calculate a token variable TOPVALID, if this absolute value is non-0, then TOPVALID be true, in addition, TOPVALID is vacation

720～token variable LEFTVALID utilizes calculating divided by integer and the huge segment number of every row of present huge figure block address.If this divisor is non-0, then LEFTVALID be true, and in addition, LEFTVALID is vacation.

730～be used in combination TOPVALID and LEFTVALID variable to judge the availability of 4 huge segments of candidate that present huge segment is contiguous.

740～be that the huge segment P of a previous candidate judges availability.

750～for can getting the huge segment of candidate, each calculates the absolute difference totalling.

The huge segment of 810-850～candidate.

910-940～4 * 4 squares, 950～4 * 4 refrence squares.

234～rotation logic, 950～prediction square, 960-990～absolute difference totalling computing unit,

1010～instruction stream processor, 1020～instruction, 1030～director data,

1040～pool of execution units, 1050～texture filtering unit, 1060～texture quick get,

1070～back wrapper, 1100～video processing unit.1120～texture image,

1130～target square, 1140-1170～texture image, 1110A-B～buffer.

Embodiment

Provide at the embodiment of this exposure and to utilize a Graphics Processing Unit to promote motion estimating system and method.

1. the calculate platform that is used for video coding

Fig. 1 is the calcspar that is used for an exemplary calculate platform of figure and video coding and/or decoding.System 100 comprises a general purposes CPU110 (after this being called primary processor), a graphic process unit (GPU) 120, memory 130 and bus 140.Graphics Processing Unit 120 comprises a video accelerator module (VPU) 150, but its accelerated video encoding and/or decoding, will be in the back narration.It is the instruction that can carry out on Graphics Processing Unit 120 that the video of Graphics Processing Unit 120 quickens function.

Software decoder 160 is arranged in memory 130 with video accelerator actuator 170, and decoder 160 is carried out on primary processor 110.By an interface that is provided by video accelerator actuator 170, decoder 160 also can send the video assisted instruction to Graphics Processing Unit 120.Thus, system 100 carries out video coding for the main processor software (hostprocessor software) of Graphics Processing Unit 120 by sending the video assisted instruction.Method according to this, the intensive computing square that often is performed (computationally intensive blocks) is unloaded to Graphics Processing Unit 120, and more complex calculations are performed by primary processor 110.

Omit several among Fig. 1 and quicken features and inessential and be familiar with the existing element that this operator knows for videos of explaining Graphics Processing Unit 120.Next will next the video accelerator module function how a video coding element (motion estimator) utilizes Graphics Processing Unit 120 to be provided be discussed again to the video coding summary description.

2. video encoder

Fig. 2 is the functional block diagram of the video encoder 160 of Fig. 1.The image (205) that inputs to encoder 160 is made up of pixel.Encoder 160 utilizes the 205 interior times (temporal) of image to operate with space similitude (spatial similarities), and utilizes the difference similitude coding of judging a frame interior (space) and/or interframe (time).Space encoding is utilized the common identical or relevant characteristic encoding of neighborhood pixels in the image, so only difference is encoded.Time encoding is utilized the common identical value of the many pixels in a succession of image, so only the difference between image is encoded.Encoder 160 also utilizes the statistical redundancy of entropy coding: some images are than the more normal generation of other patterns, so the sign indicating number with short that often takes place is represented.The example of entropy coding comprises two arithmetic codings (context-adaptive binary arithmetic coding) of huffman coding (Huffman coding), run length coding (run-length encoding), arithmetic coding (Arithmetic coding) and front and back self-adaptation.

In this one exemplary embodiment, the square of input picture 205 is provided to a subtracter 210 and a motion estimator 220.Motion estimator 220 relatively the square in the input pictures 205 to one in advance reference image stored 230 to find out similar square.Motion estimator 220 is calculated one group of motion vector 245 representing the configuration of compliant parties interblock.Motion vector 245 is collectively referred to as prediction square 255 with the square 230 that conforms to of reference picture, represents time encoding.

Prediction square 255 provides to subtracter 210, and it deducts prediction square 255 to produce a residual image 260 with input picture 205.Residual image 260 is provided to discrete surplus rotation parallel operation (DCT, discrete cosine transform) square 270 and quantizer 280, and it carries out space encoding.The output of quantizer 280 (for example one group of DCT coefficient after quantizing) is by entropy coder 290 codings.

For certain type image (information or I frame are with prediction or P frame), this space comes the space encoding remainder (spatially encoded residual) of quantizer 280 to be provided for inner decoder.Decoder utilize the space encoding remainder in conjunction with the motion vector 245 that is produced by motion estimator 220 with to 205 decodings of space encoding image.Again the image of construction is stored in the reference picture buffers 295, and it provides to motion estimator 220, as previously mentioned.

As discussing in conjunction with first figure, encoder 160 is carried out on primary processor 110, yet hundred million the video assisted instruction that is provided by Graphics Processing Unit 120 is provided.Especially, the algorithm utilization that is realized by motion estimator 220 is added up by the absolute difference that Graphics Processing Unit 120 is provided that (SAD, sum-of-absolute-difference) instruction is to reach correct moltion estimation, under low relatively operand.Then in detail the moltion estimation algorithm will be described in detail.

3. software moltion estimation algorithm

A. search window (Search Window)

As being shown in Fig. 3, motion estimator 220 cuts into nonoverlapping each section with present image 205, is called huge segment.The size of huge segment can according to the employed standard of encoder (for example, MPEG-2, H.264, VC) change with the size of image.

In the one exemplary embodiment of this narration, with in various different coding standards, a huge segment is 16 * 16 pixels.One huge segment more cuts into square, and the size of this square can be 4 * 4,8 * 8,4 * 8,16 * 8 or 8 * 16.

In MPEG-2, each huge segment can only have a motion vector, so moltion estimation is according to huge segment.H.264 allow to reach 32 motion vectors (deciding), so in H.264, moltion estimation is to calculate according to the basis of 4 * 4 or 8 * 8 squares according to degree.H.264 variation is called AVS, and this motion square is 8 * 8 forever.In VC-1, it can be 4 * 4 or 8 * 8.

The huge segment of in 220 pairs of present images 205 of moltion estimation algorithm each is carried out moltion estimation, according in a target of searching one square in the image encoded 230 (it is similar to the huge segment of present image 205) in advance.Huge segment in the reference picture 230 and the displacement between the huge segment in the image 205 at present be calculate and be stored as motion vector (245, Fig. 2).

For convenience of description, the moltion estimation program will be with a specific huge segment explanation (310) in the present image 205.The selected huge segment 310 of this example is in the centre of present image 205, however constructed other huge segment that also is applied in.

One searches the centre of window (320) huge segment in reference picture 230 (the corresponding huge segment 310 of image 205 at present).That is, (X, Y), then the search window 320 in reference picture 230 also is positioned at that (X Y), as is shown in a little 330 if huge segment 310 is positioned at.Other embodiment is placed on the other parts of reference picture 230 with huge segment, and is for example upper left.Search window 320 among example Fig. 3 extends through two pixels of corresponding huge segment in the horizontal direction, in pixel of vertical direction.Therefore, search window 320 and comprise 14 huge segments of difference: two huge segments are found 1 and 2 pixels, just 330 the left side in the position respectively; Two huge segments of another group are 330 the left side in the position; Remaining group on position 330, below, upper left, go up again, lower-left and bottom right.

Use absolute difference to add up by the motion estimator 220 performed square motion computings that conform to as the criterion of judging similitude (conforming to) between huge segment.Absolute difference is added up, calculates the absolute difference of two pixel values, and these absolute difference of all pixels in the square are added up, as be familiar with this skill the personage understood.Motion estimator 220 is used in combination the initiative method that absolute difference adds up criterion and selects the huge segment of target of similitude to be measured, and it will illustrate below.

B. the huge segment of select target

Motion estimator 220 is used different method for searching, is in-line coding (intra-coded) motion vector or external encode (inter-coded) motion vectors that produce present image 205 according to motion estimator 220.Motion estimator 220 utilizes real world should to reduce and search target square number in the window 320 where searching window 320 to predict this huge segment that conforms to about the existing knowledge of motion, and the huge segment 310 in its actual and present image 205 carries out similar test.In real world, object is usually with fixing acceleration movement, and this represents that motion that we can expect object in the frame (optical flow optical flow) is to relax and similar (promptly continuous in fact), spatially with on the time all is.In addition, add total surface (promptly describe absolute difference and add total value) at absolute difference and be expected for mitigation (being the local smallest point of relative small number) relatively in a search space.

Utilize this existing knowledge to command to search most probable to find the place that conforms to most, use to reduce at the algorithm of this exposure will be performed the number of search to find preferable smallest point.Thus, this algorithm is efficient on calculating also can effectively mark preferable conforming to.

Fig. 4 is the algorithm flow chart that an one exemplary embodiment motion estimator 220 is used for calculating the motion vector of a present huge segment 310 in the present image 205.The moltion estimation program is from step 410, and it judges that the motion vector that is produced for present image 205 by motion estimator 220 will be by inter prediction (inter-predicted) or infra-frame prediction (intra-predicted).Then then carry out step 420 if use infra-frame prediction, implement conjugate gradient decline search algorithm (conjugated gradientdescent search algorithm) at this and search the huge segment of a prediction in the window 320 to seek, this is preferable conforming to the huge segment of reference (the present huge segment 310 in the image 205 at present).Conjugate gradient decline search algorithm (step 420) will describe in detail in conjunction with the 5th, 6 figure.

Get back to step 410,, then follow execution in step 430, search in this execution " vicinity " or " adjacent domain " as if using inter prediction to produce motion vector.This search comprises the huge segment that is adjacent to present huge segment 310 in the present image 205, and the huge segment in the corresponding previous coding reference picture 230.Contiguous search algorithm (step 430) will describe in detail in conjunction with the 7th, 8 figure.

Conjugate gradient decline search algorithm (step 410) has respectively been recognized preferable or acceptable conforming to contiguous search algorithm (step 430) from the huge segment of large numbers of target predictions.The personage who is familiar with this skill should recognize and is used for judging that it can be relative or absolute how being only the criterion of " preferable conforms to ".For example, the contiguous search algorithm in this narration uses an absolute criterion: have the huge segment of target of minimum (score) to be regarded as preferable conforming to.Yet, utilizing a critical value at the conjugate gradient decline search algorithm of this narration, absolute difference adds first square that total value is lower than this critical value and is regarded as preferable conforming to.Yet the criterion of this critical value is a design or realizes decision.

After

treatment step

420 or 430, conform to recognize a preferable candidate.Step 440 is more carried out a regional area and is thoroughly searched (local area exhaustive search) to find best candidate.This Search Area is positioned near the huge segment of preferable candidate that step 420 or 430 recognized.In certain embodiments, in execution in step 420, after the conjugate gradient decline search algorithm (promptly under the situation of infra-frame prediction), the local zone of being searched of thoroughly searching comprises near the outside of the local minimum (preferable candidate) that step 420 recognizes 4 diagonal angles.For example, be 1 as if the employed value of last step that descends in gradient, then this search is limited in the point from this preferable candidate (± 1, ± 1).In certain embodiments, when (promptly under the situation at inter prediction) after the execution in step 430, the local candidate who is included near the zonule of the huge segment of preferable candidate that (step 440) searches, normally (± 2, ± 2) of thoroughly searching.

The local of step 440 thoroughly searched from the huge segment of preferable candidate huge segment limit to an optimal candidate, and this is that pixel is harmonized (pixel-aligned), promptly has integer pixel resolution.Step 450 finds the huge segment of an optimal candidate to harmonize with 460 at a fraction pixel border (fractional-pixel boundary).Existing fraction movement search algorithm uses specific codec filtering algorithm (codec-specificfiltering algorithm) with the interior pixel value that is inserted in fractional position, according to integer position on every side.Relative, it is two subsurfaces that step 450 is set up the degree that conforms between huge segment of optimal candidate and the huge segment of reference, and step 460 is analyzed this surperficial minimum value of ground judgement.The corresponding the best of minimum value huge segment that conforms to is mark non-integer resolution.(initiative judges that with mark resolution the model method of setting up of the best huge segment that conforms to will be illustrated at the paragraph of back.) have conforming to of mark resolution huge segment in step 450 by after being recognized, then treatment step 470 calculates a fraction movement vector according to this huge segment that conforms to, and uses the technology that those who are familiar with this art knew.Finished program 400 with that.

Those who are familiar with this art should recognize that top algorithm is continuous in itself, because of it has used the information of adjacent domain.Although used hardware-accelerated existing design to avoid continuation algorithm usually, because many reasons, continuous design here is suitable.At first, pixel data is to read with the form of continuous horizontal scan line (sequential raster fashion), thereby can be received in advance, maintains in the circuit buffer.Secondly, in the embodiment that contains single absolute difference totalling accelerator module, usefulness is to be limited in this unit whether can keep fully loaded and discontinuous processing.Absolute difference adds up accelerator module does not have many getting soon under the omission can keep high capacity at the prediction square.Because missing rate is the letter formula of getting size soon, and the HDTV image in different resolution only needs 1920/8=＜1KB motion vector in getting soon, low get missing rate soon and can expect.

C. the infra-frame prediction motion vector that uses its yoke gradient to descend

Fig. 5 is the flow chart of Fig. 4 conjugate gradient step 440, and is performed by an embodiment of motion estimator 220.As previously mentioned, step 440 is to be that one preferable (can accept) carried out when conforming to present square 310 judge using infra-frame prediction will be used to seek to search in the window 320 huge segment.Absolute difference adds total value and calculates for one group of 5 initial candidate: present huge segment, with the huge segment of present huge segment upper and lower, left and right.Add total value from 5 absolute differences of this initial set, calculate two groups of mutually perpendicular gradients.From these two groups of gradients, obtain the gradient of the most precipitous direction.If this gradient is relatively shallow, or 5 huge segments of initial candidate have very approaching absolute difference to add total value, then should search and extend away from present huge segment, because do not have the candidate of the condition of the minimum probability in preferable part in this zone.To after conjugate gradient decline step 440 general introduction, this step with more detailed description in down.

This step is from step 505, at this initialization one candidate's square C _{X, y}With step value Δ _xWith Δ _yIn one embodiment, the huge segment C of candidate _{X, y}Be made as the upper left corner of searching window 320, and step value Δ _xWith Δ _yAll be made as a lowerinteger value, for example 8.Follow in step 510 the huge segment C of calculated candidate _{X, y}The coordinate of the huge segment of candidate all around.These four huge segments of candidate are the huge segment C of candidate _{X, y}Four of upper and lower, left and right.That is,

T＝(C _x，-Δ _y+C _y)；R＝(Δ _x+C _x，C _y)；B＝(C _x，Δ _y+C _y)；L＝(-Δ _x+C _x，C _y)

Then treatment step 515 adds up at this absolute difference that calculates 5 huge segments of candidate respectively (originally that with arround four).In step 520, compute gradient g _xWith g _yGradient g _xBe that the huge segment absolute difference in the left side and the right adds the poor of total value.Gradient g _yBe that top and following huge segment absolute difference adds the poor of total value.Thus, no matter the error amount that may conform between huge segment is to increase or reduce, this gradient is represented x or y direction.In step 525, this gradient and a critical value are made comparisons.If this gradient is lower than this critical value (promptly this gradient is relatively shallow), this is illustrated in the present Search Area does not have local minimum, so this search extends to the huge segment of new candidate.The huge segment of these new candidates is away from the huge segment C of candidate processes originally _{X, y}In certain embodiments, when adding total value, the absolute difference that is calculated for the huge segment of candidate in step 515 also extends this search when similar.This extension search continues to carry out in step 530, calculates the coordinate of four huge segments of new candidate at this.Originally four huge segments of candidate are at C _{X, y}Up and down apart from (Δ _x, Δ _y) the place, select four huge segments of new candidate to form the huge segment C of candidate originally _{X, y}Square on every side corner, distance (Δ _x, Δ _y): TL=(Δ _x+ C _x,-Δ _y+ C _y); TR=(Δ _x+ C _x, Δ _y+ C _y); BL=(Δ _x+ C _x, C _y); BR=(Δ _x+ C _x, Δ _y+ C _y)

In step 535, to the huge segment of these new candidates (C, TL, TR, BL BR) carries out conjugate gradient decline step 440 respectively.

The gradient ratio of getting back to step 525, if the gradient of being calculated at huge segment 520 is equal to or greater than this critical value (promptly this gradient is relatively precipitous), then the absolute difference that calculates in step 515 in step 540 adds total value and a critical value is made comparisons.If this absolute difference adds total value and is lower than this critical value, then preferable conforming to found in expression, and then step 440 is got back to calling set (in step 545), provides this calling set to have minimum absolute difference to add the huge segment of candidate of total value.

Add total value as if this absolute difference of testing in step 540 and be equal to or less than this critical value, preferable conforming to do not found in expression, so adjust Search Area.In step 550, select a new huge segment C of central candidate _{X, y}New central huge segment is C, TL, and TR, BL calculates the square that minimum absolute difference adds total value in step 515 in the BR candidate set.Then, in step 555, from gradient g _xWith g _yCalculate new step value Δ _xWith Δ _y, Δ for example _x=Δ _x* g _xThe precipitous acceptable huge segment that conforms to of gradient representative is that present central candidate is far, so increase (Δ _x, Δ _y).On the contrary, the shallow acceptable huge segment that conforms to of gradient representative is that present central candidate is very near, so should reduce (Δ _x, Δ _y).The personage who is familiar with this skill should recognize that various coefficient can be used for calculating (Δ from each gradient _x, Δ _y) to reach this result.

Then, at step 560 test iterative cycles number.If this number is greater than a maximum, then step 440 is finished in step 565, can not find acceptable and conforms to.In addition, adopt wrong gradient to select one group of huge segment of new candidate, it is expected for and approaches finally to conform to, and this gradient decline step 440 is got back to step 510, this produce one group new.Conjugate gradient decline step 440 is finished under following two kinds of situations, and when finding acceptable value (step 545), or the greatest iteration number does not still have conform to (step 565) to reach.

Fig. 6 illustrates the example state of using conjugate gradient decline step 440.The huge segment C of initial candidate _{X, y}Be square (610C), and four around the candidate be circle (610T, 610L, 610R, 610B).From these initial candidate compute gradient g _xWith g _y(620X, 620Y).In this example state, gradient is too shallow, is not lower than this critical value and there is absolute difference to add total value.Therefore extend and search, use four new huge segments of central candidate, be shown triangle (630TL, 630TR, 630BL, 630BR).The huge segment of these new candidates is apart from the huge segment C of script candidate _{X, y}The distance of corner Δ on every side.

Entreat candidate's huge segment on every side in these, be shown hexagon (640L ₁, 640T ₁, 640T ₂, 640R ₂, 640L ₃, 640B ₃, 640B ₄, 640R ₄), be chosen as the candidate.In this example state, the absolute difference that two candidates 640 have the subcritical value add total value and " precipitous " gradient (650XY, 660XY).Another candidate selects according to each " precipitous " gradient: candidate 670 then selects 680 to be according to gradient 660XY according to gradient 650XY.Gradient descends to searching and continues to use these new candidates 670,680, according to conjugate gradient decline step 440.

D. use predicted motion vector between previous contiguous frames

Fig. 7 is the flow chart of the contiguous search algorithm of Fig. 4 (step 430), and is performed by an embodiment of motion estimator 220.As previously mentioned, the huge segment of the candidate of this search comprises the huge segment that is adjacent to the present huge segment 310 (being encoded) in the present image 205.Also be included as a candidate be in advance the coding reference picture 230 in a corresponding huge segment.

The step of the huge segment coordinate of calculated candidate is calculated a token variable TOPVALID at this by absolute value (remainder) that utilizes present huge segment 310 addresses and the huge segment number of every row from step 710.If this absolute value is non-0, then TOPVALID be true, and in addition, TOPVALID is vacation.In step 720, a token variable LEFTVALID utilizes calculating divided by integer and the huge segment number of every row of present huge segment 310 addresses.If this divisor is non-0, then LEFTVALID be true, and in addition, LEFTVALID is vacation.These TOPVALID and LEFTVALID variable represent that present huge segment 310 has a contiguous huge segment with the left side in the above respectively, considers the upper limb and the left hand edge of huge segment.

In step 730, be used in combination TOPVALID and LEFTVALID variable judging the availability of 4 huge segments of candidate that present huge segment 310 is contiguous, or existence.Particularly: if there is a huge segment L on the left side (LEFTVALID); One huge segment T is arranged above if (TOPVALID); If upper left have a huge segment TF (TOPVALID ﹠amp; LEFTVALID); One huge segment TR is arranged if (TOPVALID ﹠amp again; RIGHTVALID).Then, in step 740, be that the huge segment P of a previous candidate judges availability, this is the huge segment in the previous coding reference picture 230 of spatially corresponding present huge segment 310.The relative position of these 5 huge segments of candidate can see in Fig. 8, and wherein, L is 810, T is 820, TL is 830, TR is 840, P is 850.

Get back to Fig. 7, step 730 has the huge segment of how many candidates to can be used to comparison (from 1 to 5) with step 740.Step 750 is calculated the absolute difference totalling for each can get the huge segment of candidate.If 5 candidates all can get, this group absolute difference adds total value and is:

{0, L, T, P (\frac{L + T}{2}), med (L, T, TL), (\frac{L + med (T, TL, TR)}{2}), med (T, TL, TR)}

If some candidate's non-availability, the personage who is familiar with this skill should recognize that this group candidate is less relatively.Then completing steps 430 is replied the huge segment of candidate that has minimum absolute difference to add up.

As before discussing in conjunction with Fig. 4, in case find the huge segment that conforms to (no matter using contiguous search method or the decline of the conjugate gradient of Fig. 5 of figure), local thoroughly search (Figure 44 40) is adopted in then Search Area limit more.After search the part, utilize the local result who thoroughly searches to calculate a fraction movement vector.The calculating of fraction movement vector will be described in detail below.

E. utilize the fraction movement vector operation of secondary surface model

The personage who is familiar with this skill should feel familiar to searching the degree that conforms between window to produce " mistake surface " to illustrating huge segment.Adopt an initiative method, motion estimator 220 is set up the model on mistake surface with one or two subsurfaces and is analyzed the following pixel accuracy in ground and judge the minimum value that this is surperficial.Motion estimator 220 is at first judged the described minimum value of a side, a given minimum row.Motion estimator 220 then determines the minimum value of orthogonal direction along this line.

The general equation formula of conic section such as equation 1.

Y=C ₁+ C ₂T+C ₃t ²Equation 1

This curve is got differential, as the 2nd equation:

\frac{δy}{δt} = C_{2} + 2 C_{3} t &DoubleRightArrow; t = \frac{- C_{2}}{2 C_{3}}

Equation 2

In case coefficient C ₁, C ₂, C ₃Known, then can find the solution to judge t, minimum position.Motion estimator 220 solves equation 3 with coefficient of determination C ₁, C ₂, C ₃

(\begin{matrix} C_{1} \\ C_{2} \\ C_{3} \end{matrix}) = \frac{1}{4} (\begin{matrix} 31 & - 27 & 5 \\ - 27 & 25 & - 5 \\ 5 & - 5 & 1 \end{matrix}) \times (\begin{matrix} Σ_{i = 1}^{4} d_{i} \\ Σ_{i = 1}^{4} d_{i} t_{i} \\ Σ d_{i} {t_{i}}^{2} \end{matrix})

Equation 3

Motion estimator 220 uses 84 absolute differences that provided by Graphics Processing Unit 120 to add up the efficient calculation equation 3 of instruction.Each d _iRepresent an absolute difference to add total value, i is added up represent absolute difference to add total value at the contiguous huge segment of x direction.As in conjunction with xThe detailed description of figure, this 8 * 4 absolute difference add up the contiguous huge segment of the efficient calculating of instruction (x, y), (x+1, y), (x+2, y), (x+3, y), 4 absolute differences add total value, i.e. i=0...3 and i=j, t=j+1.As previously mentioned, in case coefficient is known, solving an equation 2 obtains t, the minimum value of x direction.

Equation 3 can be used for judging the minimum value t of vertical direction.In this example, motion estimator 220 use 8 * 4 absolute differences add up the vertically contiguous huge segment of the efficient calculating of instruction (x, y), (x+1, y), (x+2, y), (x+3,4 absolute differences y) add total value.Equation 3 solves calculating adds total value from these absolute differences coefficient C1, C2, C3.As previously mentioned, in case coefficient is known, solving an equation 2 obtains t, the minimum value of y direction.Motion estimator 220 employed secondary mistake surface methods judge that more formerly re-using the expensive filter of computing on the pixel boundary after preferable conforming to goes to seek the progress that the preferable existing method that conforms to is come on the sub-pixel border.

F. on Graphics Processing Unit, use absolute difference to add up accelerator with efficient calculated minimum

As previously mentioned, motion estimator 220 judges that with the huge segment of a reference in the present image that huge segment has preferable conforming in the predicted picture.It is hardware-accelerated that motion estimator 220 uses the absolute difference that is provided by Graphics Processing Unit 120 to add up, and it is the graphics acceleration unit instruction.Absolute difference adds up instruction and will import one 4 * 4 refrence squares and predict squares with one 8 * 4, and produces 4 absolute differences and add total value.Refrence square changes as required with the big I of prediction square.4 * 4 refrence squares and 8 * 4 prediction squares only be example with explanation the present invention, and should not limit refrence square and the size of predicting square.Fig. 9 A, B have illustrated adding up the calcspar that instruction operates with reference to carrying out absolute difference with the prediction square.As being shown in Fig. 9 A, 8 * 4 prediction squares are made up of contiguous 4 * 4 squares of a plurality of levels that overlap each other, as square 910,920,930,940.Absolute difference adds up the unit and gets input 4 * 4 refrence squares 950 and calculate this refrence square and the absolute difference of 910-940 square adds total value.That is, this absolute difference adds up 4 values of command calculations: value is a square 910 and the summation of the absolute value of the difference of square 950; Another value is a square 920 and the summation of the absolute value of the difference of square 950; Another value is a square 930 and the summation of the absolute value of the difference of square 950; Another value is a square 940 and the summation of the absolute value of the difference of square 950.

Referring to Fig. 9 B, the absolute difference in the Graphics Processing Unit 120 adds up accelerator module and uses 4 absolute differences totalling computing units (960,970,980,990) to add up instruction to realize absolute difference.Leftmost 4 * 4 squares 910 are provided for absolute difference and add up computing unit 960.Then 4 * 4 squares (920) on input the right add up computing unit 970 to absolute difference.Then 4 * 4 squares (930) on input the right add up computing unit 980 to absolute difference.At last, provide rightmost 4 * 4 squares 940 to add up computing unit 990 to absolute difference.Graphics Processing Unit 120 is used independently absolute difference totalling computing unit abreast, adds total value so absolute difference adds up 4 absolute differences of each cycle generation of instruction.The personage who is familiar with this skill should recognize that the absolute difference that is used for calculating two identical big or small pixel square adds up the algorithm of computing, and the hardware designs that is used for carrying out this computing, so these details will no longer describe in detail.

4 * 4 refrence squares flatly and vertically are listed in pixel edge.Yet, do not need vertically to proofread and correct 4 * 4 prediction square 910-940.In one embodiment, data are proofreaied and correct by rotation (logical circuit 995) this refrence square.The rotary reference square but not rotate respectively 4 the prediction squares can save logic gate number.Postrotational refrence square is provided for each independent absolute difference and adds up hardware-accelerated unit.Each unit produces 12 value, and these values are combined into one 48 output.In one embodiment, the order of magnitude of these values is the U texture coordinates (the minimum coordinate in the lowest-order bit positions) according to the prediction square.

Following procedure code description 8 * 8 squares, i.e. 8 * 4 squares of two vicinities, absolute difference add total value and can only use 4 absolute differences to add up command calculations.Buffer T, T, T, T4 are used for depositing these 4 absolute differences and add total value.Variable sadS these absolute differences that are used for adding up add total value.The address hypothesis of 8 * 4 refrence squares is at refReg.U and V are the texture coordinates of 8 * 8 prediction squares.Whole absolute difference that the following procedure code produces whole 8 * 8 squares adds total value, is stored in sadS.

SAD T1，refReg，U，V；left-top of 8×8 prediction block

SAD T2，refReg，U+4，V；right-top of 8×8 prediction block

ADD sadS，T1，T2

SAD T3，refReg，U，V+4；left-bottom of 8×8 predictionblock

ADD sadS，sadS，T3

SAD T4，refReg，U+4，V+4；right-bottom of 8×8 predictionblock

ADD sadS，sadS，T4

Yet, can avoid calculating and adding up the value of all 4 sub-squares usually, because as long as this summation reaches present minimum value and just can stop this calculating.How following pseudo-code explanation uses absolute difference to add up instruction in a circulation, and it stops when summation reaches a minimum value.

I：＝0；

SUM：＝0；

MIN＝currentMIN；

WHILE(I＜4||SUM＜MIN)

SUM：＝SUM+SAD(refReg，U+(I％2)*4，V+(I＞＞1)*4)；

IF(SUM＜currMIN)currMIN＝MIN；

Go to Next Search point；

84 absolute differences in the Graphics Processing Unit 120 add up instruction and are directly used by the advanced search algorithm of motion estimator 220, and execution for example illustrated in fig. 5 is local thoroughly searches.In addition, it is that square is proofreaied and correct that texture quick is got 1060 (Figure 10), and motion estimator 220 employed algorithms as mentioned above, are pixel corrections.Although multiplexer module can be added in the Graphics Processing Unit 120 to handle these correction errors, can increase logic gate number and power consumption yet so do.Replace, Graphics Processing Unit 120 uses these unnecessary budgets to 4 absolute difference to add up the unit, rather than only uses 1.In certain embodiments, 8 * 4 absolute differences add up the advantage that instruction provides computing minimum value efficiently, and this involves the absolute difference that calculates contiguous square and adds total value.In certain embodiments, 8 * 4 absolute differences add up another advantage that instruction provides thorough search (square 440), and when the step value was 1, its absolute difference that calculates each diagonal angle added total value.

4. graphic process unit

The use that software algorithm realizes and 8 * 4 absolute differences totalling of this algorithm in Graphics Processing Unit 120 instructs of motion estimator 220 had been discussed, has next been described in detail absolute difference and add up instruction and Graphics Processing Unit 120.

A. Graphics Processing Unit flows

Figure 10 is the data flowchart of Graphics Processing Unit 120, and wherein, instruction stream is the arrow by Figure 10 left side, and image or graphical stream are to be represented by the arrow on the right.Figure 10 has omitted the existing element of several those who are familiar with this art, and these are inessential to deblocking effect feature in the loop of explaining Graphics Processing Unit 120.

One instruction stream processor 1010 receives an instruction 1020 from a system bus (not shown), and this instruction of decoding, and produces director data 1030, for example vertex data.Graphics Processing Unit 120 is supported an existing graphics process instruction, and accelerated video encoding and/or decoded instruction, and for example aforesaid 8 * 4 absolute differences add up instruction.

Existing graphics process instruction involves as vertex coloring (vertex shading), how much painted (geometry shading), the painted difficult problems such as (pixel shading) of pixel.Therefore, director data 1030 is applied to the pond (pool) 740 of tinter performance element (shader execution units).A necessary texture filtering unit (TFU, the texture filter unit) 750 that use of painted performance element is to apply texture to a pixel.Data texturing is to take from texture quick soon to get 1060, and it is in main storage (not shown) back.

Video processing unit 1100 is given in some instructions, and its running will be explained below.The data that produce are then handled by back wrapper (post-packer 1070), and it compresses this data.In reprocessing (post-processing) afterwards, the data that produced by the video accelerator module are provided for pool of execution units (execution unit pool) 1040.

The execution of encoding and decoding of video assisted instruction, for example aforesaid absolute difference adds up instruction, and is different with aforesaid existing graphics command in many aspects.At first, the video assisted instruction is carried out by video processing unit 1100, but not the tinter performance element.Secondly, the video assisted instruction is not used its data texturing.

Yet employed view data of video assisted instruction and the employed data texturing of graphics command are 2 dimension arrays.Graphics Processing Unit 120 is utilized this advantage equally, uses texture filtering unit 1050 to download the view data of giving video processing unit 1100, thereby makes texture quick get 1060 to get some view data by video processing unit 1100 runnings soon.Therefore, be shown in Figure 10, video processing unit 1100 is between texture filtering unit 1050 and back wrapper 1070.

1050 checks of texture filtering unit are from instructing 1020 director datas 1030 that capture.Director data 1030 more provides the coordinate of the view data of wanting in the texture filtering unit 1050 main storage (not shown)s.In one embodiment, these coordinates are marked as U, V is right, and those who are familiar with this art tackle this and are familiar with.When instruction 1020 when being a video assisted instruction, the director data 1030 that is captured more orders texture filtering unit 1050 to skip over any texture filter (not shown) in the texture filtering unit 1050.Therefore, texture filtering unit 1050 is subjected to the control download images data of video assisted instruction to video processing unit 1100.

Method according to this, texture filtering unit 1050 are subjected to handle for the video assisted instruction goes the download images data to video accelerator module 1100.Video processing unit 1100 receives view data from the texture filtering unit 1050 on the data path, with the order data 1030 on the order path, and according to 1030 pairs of these view data execution one runnings of order data.Be fed to pool of execution units 1040 by 1100 output image datas of video processing unit, after handling by back wrapper 1070.

B. order parameter

Explanation video processing unit 1100 is being carried out the running that absolute difference adds the total video assisted instruction now.As previously described, each Graphics Processing Unit instruction decoding and analysis (parsed) are director data 1030, and it can be considered specific set of parameters of each instruction.The parameter that absolute difference adds up instruction is shown in the 1st table.

The 1st table: the absolute difference of Graphics Processing Unit adds up instruction

I/O	Title	Size	Narration
I/O	Title	Size	Narration	Input	FieldFlag	The 1-position	If FieldFlag==1 is Field Picture then, all the other are Frame Picture then
Input	TopFieldFlag	The 1-position	If TopFieldFlag==1 is Top-Field-Picture then, other Bottom-Field-Picture is if set FieldFlag.	Input	FieldFlag	The 1-position

Input	PictureWidth	The 16-position	For example: 1920 are used for HDTV
Input	PictureWidth	The 16-position	For example: 1920 are used for HDTV	Input	PictureHeigh t	The 16-position	For example: 1080 are used for 30P HDTV
Input	BaseAddress	The 32-position is signless	The predicted pictures base address	Input	PictureHeigh t	The 16-position	For example: 1080 are used for 30P HDTV
Input	BaseAddress	The 32-position is signless	The predicted pictures base address	Input	BlockAddress	The U:16-position has the V:16-position of symbol that symbol is arranged	Predicted pictures texture coordinate (pass lies in base address) is at SRC1 Opcode SRC1[0:15]=U, SRC1[31:16]=V U, V is 13.3 forms, ignores fractional part
Input	RefBlock	The 128-position	The reference picture data are at SRC2 Opcode	Input	BlockAddress
Input	RefBlock	The 128-position	The reference picture data are at SRC2 Opcode	Output	Destination perand	4 * 16-position	In 128 buffers least important 32 at DST Opcode

Be used in combination several input parameters to judge 4 * 4 square addresses that captured by texture filtering unit 1050.The BaseAddress parameter is pointed out the starting point of this data texturing in texture quick is got.Give the BaseAddress parameter with upper left square coordinate in this zone.PictureHeight and PictureWidth input parameter are used for judging the scope of this square, i.e. lower left coordinate.At last, video and graphic can be and lines by line scan (progessive) or interlacing scan (interlace).If interlacing scan, it forms (top and below) by both direction.Texture filtering unit 750 uses FieldFlag and TopFieldFlag with suitable processing horizontally interlaced image.

C. view data conversion

Be to carry out absolute difference and add up instruction, video processing unit 1100 is 1050 acquisition input pixel square and these squares are carried out conversion from the texture filtering unit, are converted to an appropriate format and add up accelerator module 960-990 in order to absolute difference and handle.Pixel square then is provided to absolute difference and adds up accelerator module 960-990, and it is replied absolute difference and adds total value.Each absolute difference adds total value and then is accumulate to target buffer.These functions will describe in detail in the back.

Video processing unit 1100 receives definition and calculates two input parameters that this absolute difference adds 8 * 4 squares of total value.The data of refrence square are directly by the directly definition of SRC2 running sign indicating number: 8 * 4 * 8 squares are considered as 128 data.Relatively, SRC1 operates yard address of definition prediction square but not data.Video processing unit 1100 provides these addresses to texture filtering unit 1050, and it gets the prediction square data of 128 of 1060 acquisitions from texture quick.

(moltion estimation only uses the Y composition usually for Cb, Cr) plane with chroma although view data comprises brightness (Y).Therefore, when carrying out the instruction of absolute difference totalling, the pixel square that video processing unit 1100 is operated only contains the Y composition.In one embodiment, video processing unit 1100 produces an inhibit signal, and 1060 acquisition Cr/Cb pixel datas are not got from texture quick in its commander's texture filtering unit 1050.

Figure 11 is that 1060 calcspar is got with texture quick in texture filtering unit 1050.Texture filtering unit 1050 is designed to get 1060 acquisition texture image borders (texel boundry) from texture quick, and gets 1060 from texture quick and download 4 * 4 texture image squares to filtering input buffer 1110.When acquisition data was represented video processing unit 1100, texture image 1120 was regarded as that 4 channels (ARGB) of 32 are respectively arranged, for 128 texture image size.When adding up the instruction acquisition data for absolute difference, 8 * 4 * 8 squares are downloaded in texture filtering unit 1050.

For handling calibration problem, this 8 * 4 square is downloaded to two 4 * 4 pixel input buffer devices (1110A and 1110B).The view data of using shown in the video processing unit 1100 may be proofreaied and correct by byte.Yet, texture filtering unit 1050 be designed to from outside get acquisition texture image border.Therefore, when being the data of video processing unit 1100 acquisitions, texture filtering unit 1050 may need acquisition to reach 4 specified byte correction 8 * 4 squares texture images correction 4 * 4 squares on every side that are looped around each 4 * 4 en piece.

This program can see in Figure 11, and wherein, a left side half 4 * 4 squares (target square 1130) are aligned on the texture image border, no matter in vertical direction or in the horizontal direction.In other words, target square 1130 extends two pixel images.The upper left corner of the U of this target square 1130, V address definition 4 * 4-8 position, byte is proofreaied and correct square.In this example, texture filtering unit 1050 judges that

image

1140,1150,1160,1170 should be by acquisition to obtain target square 1130.After judgement, texture filtering unit 1050 acquisition square 1140-1170 also follow row and the row of selecting in conjunction with from the step-by-step of square 1140-1170 institute, so 4 * 4 of the Far Lefts of target square 1130 are written into filtered buffer 1110B.The personage who is familiar with this skill should know how to use multiplexer, shift unit (shifter), mask bit (mask bits) are reached this result, proofreaies and correct no matter get 1,060 4 * 4 targets that capture from texture quick.

At embodiment shown in Figure 11, when target square 1130 comprises a vertical texture pixel boundary, these data can vertically not rearrange.When this situation took place, it was different with the order of script in getting soon in the order of vertical direction with the data of 1110B to be downloaded to filtered buffer 1110A.In this embodiment, Video processing class person 1100 must vertically rearrange (rotation) 128 refrence square data to meet the order of predicting square.In another embodiment, before writing wherein a filtered buffer 1110, texture filtering unit 1050 vertically rearrange gets soon the texture image data with meet originally get order soon.

Square in any program description or the flow chart should be understood that representation module, section or subprogram code, and it comprises the one or more executable instruction of the step that is used for realizing particular logic circuit function or program.The skill person who is familiar with software department should recognize that other implementation method also is included in the disclosed scope.In other implementation method, shown in each function can be disobeyed or the order that discloses carry out, comprise and carry out in fact synchronously or reverse carrying out, decide according to related function.

Can software, hardware or it is in conjunction with realization at the System and method for of this exposure.In certain embodiments, this system and/or method are existing the software in the memory to realize, and by the suitable processor that is arranged in a calculation element performed (comprise and be not limited to a microprocessor, microcontroller, network processing unit, can ressemble processor, extendible processor).In other embodiments, this system and/or method are to realize with logical circuit, but comprise and be not limited to a program logic device (PLD, programmable logicdevice) but programmed logic gate array (PGA, programmable gate array), field programmable gate array (FPGA, field programmable gate array) or application specific circuits (ASIC).In other embodiments, these logical statements are to finish in a graphic process unit or Graphics Processing Unit (GPU).

Can be embedded into any computer-readable media and use at the System and method for of this exposure, or link an instruction execution system, unit.This instruction execution system comprises any system based on computer, contain the system of processor or other can be from this instruction execution system acquisition and the system that carries out these instructions.Disclosed literal " computer-readable media (computer-readable medium) " can be and anyly can hold, stores, links up, transmits or transmit this program as the instrument that uses or link with this instruction execution system.This computer-readable media can be, and for example (unrestricted) is a system or the transmission medium based on electronics, magnetic, light, electromagnetism, ultrared or semiconductor technology.

Use the particular example (unrestricted) of the computer-readable media of electronic technology to comprise: to have the line that one or more electrical (electronics) connects; One random access memory (RAM, random accessmemory); One read-only memory (ROM, read-only memory); One EPROM (Erasable Programmable Read Only Memory) (EPROM or flash memory).Use the particular example (unrestricted) of the computer-readable media of magnetic technology to comprise: the portable computers disk.Use the particular example (unrestricted) of the computer-readable media of optical tech to comprise: an optical fiber and a portability read-only optical disc (CD-ROM).

Though the present invention illustrates and describes as embodiment with one or more specific example at this, details shown in but should not limiting the invention to, however still can not deviate under the spirit of the present invention and in the field of claim equalization and scope, realize many different modifications and structural change.Therefore, preferably the claim of being enclosed is explained widely and with the method that meets field of the present invention, before claim subsequently, proposed this statement.

Claims

1. Graphics Processing Unit comprises:

One instruction decoder is arranged to that an absolute difference is added up instruction and is decoded as a plurality of parameters, one M * N pixel square and the one n * n pixel square of these a plurality of parametric descriptions on U, V coordinate; And

One absolute difference adds up the acceleration logic circuit, be configured to receive these a plurality of parameters and calculate a plurality of absolute differences and add total value, each absolute difference adds corresponding this n of total value * n pixel square, and correspondence is present in this M * N pixel square and with this n * n pixel square one potential difference is arranged.

2. Graphics Processing Unit as claimed in claim 1, wherein, this absolute difference adds up the acceleration logic circuit and more comprises:

A plurality of absolute differences add up computing unit, each absolute difference adds up computing unit and is configured to receive this n * n pixel square, and receive these a plurality of squares be included in this M * N pixel square one of them, and these a plurality of absolute differences that calculate a correspondence add total value one of them.

3. Graphics Processing Unit as claimed in claim 1 wherein, is described an address of this M * N pixel square of parameter-definition in a texture quick is got of this M * N pixel square.

4. Graphics Processing Unit as claimed in claim 1 wherein, is described a basic and relative address of this M * N pixel square of parameter-definition in a texture quick is got of this M * N pixel square.

5. Graphics Processing Unit as claimed in claim 1, wherein, this potential difference is a horizontal position difference.

6. Graphics Processing Unit as claimed in claim 1, wherein, this M * N pixel square is represented moltion estimation prediction square, and this n * n pixel square is represented a moltion estimation refrence square.

7. Graphics Processing Unit as claimed in claim 2, wherein, these a plurality of absolute differences add up computing unit deal with data abreast.

8. Graphics Processing Unit as claimed in claim 2 more comprises one first logical circuit, should a plurality of absolute differences adds total value and is added to a target buffer.

9. Graphics Processing Unit as claimed in claim 8, wherein, this first logical circuit, should a plurality of absolute differences add total value with an order in conjunction with depositing this target buffer in, this order is judged by the U coordinate of the square among this M * N respectively.

10. Graphics Processing Unit as claimed in claim 2 more comprises:

One texture cache is arranged to pixel data is stored the texture image form of a predetermined figure; And

One texture filtering unit, be arranged to judge whether this M * N pixel square extends a texture image border, and it is corresponding from the one or more texture images correction n * n squares around this M * N pixel square of this texture cache acquisition, and, make leftmost position be written into first filtered buffer and rightmost position is written into second filtered buffer in conjunction with proofreading and correct row and the row that n * n square step-by-step is selected from this texture image.

11. a Graphics Processing Unit comprises:

One host interface, the receiver, video assisted instruction; And

One video accelerator module, respond this video assisted instruction, this video accelerator module comprises an absolute difference and adds up the acceleration logic circuit, be configured to receive these a plurality of parameters and calculate a plurality of absolute differences and add total value, each absolute difference adds corresponding this n of total value * n pixel square, and correspondence be present in this M * N pixel square and with this n * n pixel square have a potential difference a plurality of squares one of them.

12. Graphics Processing Unit as claimed in claim 11, wherein, this absolute difference adds up the acceleration logic circuit and more comprises:

13. Graphics Processing Unit as claimed in claim 12, wherein, these a plurality of absolute differences add up computing unit can become deal with data abreast.

14. Graphics Processing Unit as claimed in claim 11 more comprises one first logical circuit, should a plurality of absolute differences adds total value and is added to a target buffer.

15. Graphics Processing Unit as claimed in claim 14, wherein, this first logical circuit, should a plurality of absolute differences add total value with an order in conjunction with depositing this target buffer in, this order is judged by the U coordinate of the square among this M * N respectively.

16. Graphics Processing Unit as claimed in claim 11 more comprises:

17. Graphics Processing Unit as claimed in claim 11 wherein, is described a basic and relative address of this M * N pixel square of parameter-definition in a texture quick is got of this M * N pixel square.

18. Graphics Processing Unit as claimed in claim 11, wherein, the parameter of describing this M * N pixel square directly defines this pixel data.

19. Graphics Processing Unit as claimed in claim 11, wherein, this M * N pixel square is represented moltion estimation prediction square, and this n * n pixel square is represented a moltion estimation refrence square.

20. one kind is calculated one M * absolute difference of the huge segment of N and adds the method for total value, wherein, M, N are integer, and this method comprises:

Carry out the instruction of absolute difference totalling and add total value with the one n * n one first absolute difference partly that calculates one M * huge segment of M, this first comprises a upper left of the huge segment of this M * M, and wherein, n is an integer;

Carry out the instruction of absolute difference totalling and add total value with the 2nd n * n one second absolute difference partly that calculates the huge segment of this M * M, this second portion comprises a upper right portion of the huge segment of this M * M;

The worth summation of this first and second absolute difference totalling adds up;

Carry out this absolute difference totalling instruction and add total value with the 3rd n * n one the 3rd absolute difference partly that calculates the huge segment of this M * M, this third part comprises a bottom left section of the huge segment of this M * M;

The 3rd absolute difference is added total value add to this summation;

Carry out this absolute difference totalling instruction and add total value with the 4th n * n one the 4th absolute difference partly that calculates the huge segment of this M * M, the 4th part comprises a lower right-most portion of the huge segment of this M * M; And

The 4th absolute difference is added total value add to this summation.