CN104125466A

CN104125466A - GPU (Graphics Processing Unit)-based HEVC (High Efficiency Video Coding) parallel decoding method

Info

Publication number: CN104125466A
Application number: CN201410328646.2A
Authority: CN
Inventors: 梁凡; 罗林
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2014-07-10
Filing date: 2014-07-10
Publication date: 2014-10-29
Anticipated expiration: 2034-07-10
Also published as: CN104125466B

Abstract

The invention discloses a GPU-based HEVC parallel decoding method. The method includes that a GPU performs entropy decoding, re-ordering and inverse quantization on a read code stream file to obtain a transformation coefficient matrix, and the GPU parses the obtained code stream file to a obtain motion vector and a reference frame; the GPU processes the transformation coefficient matrix through an HEVC inverse transformation parallel algorithm to obtain residual data of an image, and the GPU uses an HEVC motion compensation parallel algorithm to obtain a predicted pixel value of the image according to the reference frame position which the motion vector points to; the GPU sequentially performs summing, deblocking filter and sample self-adaption compensation on the residual data and the predicted pixel value of the image to obtain a reconstructed image, and a pixel value of the reconstructed image is copied to a memory of the CPU. The GPU-based HEVC parallel decoding method effectively improves the decoding speed and efficiency and can be widely used in the video coding and decoding field.

Description

A kind of HEVC parallel decoding method based on GPU

Technical field

The present invention relates to coding and decoding video field, especially a kind of HEVC parallel decoding method based on GPU.

Background technology

Fast development along with the Internet and mobile communication technology, digital video just strides forward towards the direction of high definition, high frame per second, high compression rate, the form of video develops into 1080P from 720P, has even occurred the clear digital video of superelevation of 4Kx2K, 8Kx4K in some occasion.In Video Applications, transmission bandwidth and memory space are undoubtedly most crucial resource, how in limited space, to realize the storage of high sharpness video, in bandwidth has the network environment of bottleneck, realize good transmission, are large difficult problems.The video of high definition can be brought higher quality of life, but so must have huge data volume.Give an example, 1080P high sharpness video, pixel is 1920X1080, the form of 4:2:0, the data volume of one two field picture is 24.88Mbit.The video of high definition has produced a difficult problem like this, and that is exactly that video code rate significantly raises.Video coding is exactly to characterize video information with few bit number of trying one's best, and the compression efficiency of the H.264 coding standard of current extensive use still cannot fully meet the application demand of ultra high-definition video.

HEVC (High Efficiency Video Coding) high efficiency video coding is by the MPEG of ISO and the common new video compression coding scheme of the next generation of formulating of the VCEG of ITU-T.HEVC standard is to inherit the existing Video Coding Scheme coding theory H.264 of knowing clearly, some coding techniquess have wherein been continued to use, and improved correlation technique, the interpolation filter that larger, the block-based interframe/infra-frame prediction of coding unit size selection mode is more diversified, more complicated etc.Video coding technique before HEVC contrast, has the advantages such as compression efficiency is higher, video quality is better, robustness is better, error recovery capabilities is stronger, be more suitable for transmitting in IP network.HEVC contrasts H.264/AVC coding standard, and when the video image of high definition and high-fidelity is encoded, compression efficiency is doubled, in the situation that to obtain the picture quality of rebuilding after decoding identical, the code check of video flowing reduces 50% like this.

But, the decline of code check is to using the increase of the complexity of encoding and decoding software as prerequisite, adopted more complicated, more flexibly after coding techniques, the complexity of HEVC encoding and decoding software also increases greatly, make high sharpness video carry out the also increase thereupon of time that compression and decompression spend, cannot meet the high real-time decoding broadcast request of the applications such as video conference and video telephone.

In the situation that high sharpness video becomes main flow, merely rely on CPU obviously can not realize well the real-time decoding of high sharpness video.GPU has excellent Floating-point Computation ability and powerful computation capability, if operand in decoding algorithm is huge, the module that complexity is higher is transferred to the upper realization of GPU, can effectively solve this difficult problem of real-time decoding.Yet, also do not have in the industry the HEVC coding and decoding video scheme based on GPU to occur at present.

Summary of the invention

In order to solve the problems of the technologies described above, the object of the invention is: provide that a kind of decoding speed is fast and efficiency is high, the HEVC parallel decoding method based on GPU.

The technical solution adopted for the present invention to solve the technical problems is: a kind of HEVC parallel decoding method based on GPU, comprising:

A, GPU carry out entropy decoding, reorder and inverse quantization the ASCII stream file ASCII reading, thereby obtain transform coefficient matrix, and GPU resolves the ASCII stream file ASCII obtaining simultaneously, thereby obtains motion vector and reference frame;

B, GPU adopt HEVC inverse transformation parallel algorithm to process transform coefficient matrix, thereby obtain the residual error data of image, and GPU adopts HEVC motion compensation parallel algorithm simultaneously, asks for the predicted pixel values of image according to the reference frame position of motion vector points;

C, GPU by the predicted pixel values of the residual error data of image and image sue for peace successively, deblocking filtering and sample adaptive equalization process, thereby obtain rebuilding image, and the pixel value of rebuilding image is copied in the internal memory of CPU.

Further, in described step B, GPU adopts HEVC inverse transformation parallel algorithm to process this step to transform coefficient matrix, and it comprises:

B11, initialization GPU, on GPU, application is for the equipment end global memory of store transformed coefficient matrix and residual error data;

B12, the size of the sizing grid of thread and thread block is set, and be thread and the corresponding Thread Id number that each converter unit distributes respective numbers according to the size of converter unit;

The corresponding transform coefficient matrix of each converter unit in B13, fetch equipment end global memory, then according to Thread Id number, each transform coefficient matrix is entered to row-column parallel calculation one dimension IDCT inverse transformation and the parallel one dimension IDCT inverse transformation of row successively, thereby obtain the residual error data of whole image block;

B14, the residual error data of each image block calculating is copied back to CPU internal memory, obtain the residual error data of whole image, then release device end global memory space.

Further, described step B13, it comprises:

The corresponding transform coefficient matrix of each converter unit in B131, fetch equipment end global memory;

B132, according to Thread Id number, each row of each transform coefficient matrix are carried out to one dimension IDCT inverse transformation simultaneously, the coefficient matrix after being converted is also temporarily stored in the result of conversion in the shared drive of thread block;

Every a line of B133, the coefficient matrix according to Thread Id number after to conversion in shared drive is carried out one dimension IDCT inverse transformation simultaneously, obtains residual error data matrix, and according to the residual error data of the whole image block of residual error data matrix computations.

Further, in described step B, GPU adopts HEVC motion compensation parallel algorithm, asks for this step of predicted pixel values of image according to the reference frame position of motion vector points, and it comprises:

S1, initialization GPU, in GPU, application is used for storing the memory space of motion vector, reference frame and predicted pixel values that each pixel of inter-frame forecast mode is corresponding;

S2, copy motion vector and corresponding reference frame image to equipment end, with reference to frame, be tied on texture storage device simultaneously;

S3, carry out thread configuration, for the processing of each predicted pixel values distributes a Thread Id number, in equipment end, open up the global memory space for Storage Estimation pixel value;

S4, each thread carry out direct texture reads according to the position of the Thread Id of self number and motion vector points reference frame simultaneously or filtering interpolation is processed, thereby obtains the pixel predictors of each thread;

S5, the pixel predictors of each thread is copied back to CPU internal memory, then the global memory space of release device end.

Further, described step S4, it is specially:

Each thread directly reads with the position of motion vector points reference frame according to the Thread Id of self number simultaneously or filtering interpolation is processed: if the motion vector points of this thread is whole pixel value position, directly read this motion vector locational pixel value of reference frame pointed in texture storage device, and using the pixel value that the reads pixel predictors as this thread; If the motion vector points of this thread is a minute location of pixels, according to the position of minute pixel, selects corresponding brightness or colourity image element interpolation Filtering Formula to calculate, thereby obtain the pixel predictors of this thread.

Further, described brightness image element interpolation Filtering Formula is 8 point interpolation Filtering Formulas, and described degree image element interpolation Filtering Formula is 4 point interpolation Filtering Formulas.

The invention has the beneficial effects as follows: built the decoding framework being formed by CPU and GPU, inverse transformation processing and motion compensation process that decoding complex degree is higher are transferred to the upper realization of GPU, and designed HEVC inverse transformation parallel algorithm and the HEVC motion compensation parallel algorithm based on GPU, effectively improved decoding speed and decoding efficiency.

Accompanying drawing explanation

Below in conjunction with drawings and Examples, the invention will be further described.

Fig. 1 is the flow chart of steps of a kind of HEVC parallel decoding method based on GPU of the present invention;

Fig. 2 is the flow chart that in step B of the present invention, GPU adopts HEVC inverse transformation parallel algorithm to process transform coefficient matrix;

Fig. 3 is the flow chart of step B13 of the present invention;

Fig. 4 is that in step B of the present invention, GPU adopts HEVC motion compensation parallel algorithm, asks for the flow chart of the predicted pixel values of image according to the reference frame position of motion vector points;

Fig. 5 is the HEVC decoding frame diagram of the embodiment of the present invention one;

Fig. 6 is minute pixel interpolating schematic diagram of brightness of the present invention.

Embodiment

With reference to Fig. 1, a kind of HEVC parallel decoding method based on GPU, comprising:

With reference to Fig. 2, be further used as preferred embodiment, in described step B, GPU adopts HEVC inverse transformation parallel algorithm to process this step to transform coefficient matrix, and it comprises:

Wherein, the sizing grid of thread is set as Grid (4,4,1), and the size of thread block is set as Block (16,16,1), and a Grid distributes 16 Block, and each Block distributes 256 threads.The quantity of thread is correspondingly distributed according to the size of converter unit.Image block is comprised of at least one converter unit, and image is comprised of at least one image block.

With reference to Fig. 3, be further used as preferred embodiment, described step B13, it comprises:

With reference to Fig. 4, be further used as preferred embodiment, in described step B, GPU adopts HEVC motion compensation parallel algorithm, asks for this step of predicted pixel values of image according to the reference frame position of motion vector points, and it comprises:

Be further used as preferred embodiment, described step S4, it is specially:

Wherein, read this motion vector locational pixel value of reference frame pointed in texture storage device, by calling texture extraction function tex2D () function, realize.

Be further used as preferred embodiment, described brightness image element interpolation Filtering Formula is 8 point interpolation Filtering Formulas, and described degree image element interpolation Filtering Formula is 4 point interpolation Filtering Formulas.

Below in conjunction with specific embodiment, the present invention is described in further detail.

Embodiment mono-

With reference to Fig. 5, the first embodiment of the present invention:

HEVC decoding framework as shown in Figure 5.HEVC decode procedure is exactly reverseization of cataloged procedure, decoder readout code stream file, from NAL (network abstract layer), obtain bit stream, decoding is by one by one in sequence, one two field picture is divided into several maximum coding unit LCU, and with the order of raster scan, the LCU of take carries out entropy decoding as base unit, then reorder, thereby obtain the residual error coefficient of corresponding encoded unit; Then residual error coefficient is carried out to inverse quantization and inverse transformation, thereby obtain Image Residual data.Meanwhile, the decoder basis header generation forecast piece that decoding obtains from code stream: if inter-frame forecast mode generates a corresponding prediction piece according to motion vector and reference frame; If intra prediction mode, generates a prediction piece from adjacent predicting unit.Then, prediction blocks of data and the summation of residual block data obtain the image block data of reconstruct, and last image block data obtains rebuilding image output after processing by deblocking filtering and sample adaptive equalization.

The difference of consecutive frame on encoding relation has been described in motion compensation, that is to say that having described certain position that how macro block of reference frame moves in present frame above gets on, according to the size that waits of the reference frame of motion vector points, predict that the value of piece and residual values addition obtain reconstructed value.This method is often used for reducing the time domain redundancy in video sequence by Video Codec.Motion compensation, for image reconstruction, is that video is compiled requisite key modules in encoding and decoding.

Motion compensation be exactly a two field picture according to division of image texture, be the coding unit differing in size, on the basis of coding unit, divide predicting unit, predicting unit comprises a luminance block and two chrominance block, and each macro block of inter-coded macroblocks obtains from the macroblock prediction of a certain formed objects of reference picture.Pel motion compensation precision determines by the precision of motion vector, and it is directly connected to the size of reconstructed image quality and code stream.Motion vector is the size of translation in the process of predicting, coding side is being moved and estimating and draw.The precision of motion vector is higher, and the accuracy of motion compensation is higher.Filtering interpolation is in motion compensation, to be a very crucial technology, and what H.264 standard adopted is the Weiner filter of six taps, and its pel motion compensation precision is 1/4 pixel precision.And HEVC has adopted more advanced efficient interpolation filter, the interpolation filter based on discrete cosine transform namely.By contrast, in HEVC standard, the generation of sub-pix more succinctly efficiently, only needs a Filtering Formula, carries out a filtering and processes just passable.What luminance signal was used is the 8 point interpolation filters based on DCT discrete cosine transform, and carrier chrominance signal is used, is the 4 point interpolation filters based on DCT discrete cosine transform, carries out the interpolation of pixel.But a large amount of interpolation calculation causes the corresponding raising of complexity, and code efficiency can be lower.In reference frame, brightness and the chroma pixel of minute location of pixels are actually non-existent, therefore need to carry out the pixel value that pixel interpolating obtains minute location of pixels by filtering interpolation algorithm, and this motion compensation belongs to the motion compensation of sub-pixel precision.

Embodiment bis-

The present embodiment describes HEVC inverse transformation parallel algorithm process of the present invention.

Inverse transform block is a process of residual error sample value matrix that the transform coefficient matrix of current block is converted to, and is that follow-up reconstruct is ready.Inverse transformation is carried out after inverse quantization is processed, and the TU converter unit of take is equally processed as base unit, and its source data used is exactly the result of inverse quantization.When HEVC decoder of the present invention carries out two-dimentional IDCT inverse transformation, first carry out the one dimension IDCT inverse discrete cosine transformation in horizontal direction, then carry out the one dimension IDCT inverse discrete cosine transformation in vertical direction, finally pass through again matrix multiple, convert transform coefficient matrix to onesize residual error data matrix, thereby complete the conversion of frequency domain to time domain.

In one two field picture, the IDCT computing of different transform blocks is separate, and transform coefficient matrix on same converter unit is while carrying out the one dimension idct transform in horizontal direction, and each row are separate, therefore can realize the parallel computation of each row.Similarly, while carrying out the one dimension IDCT inverse transformation in vertical direction, between each row, there is not the correlation of data, therefore can realize parallel computation.The present invention distributes corresponding Thread Count to process according to the size of transform coefficient matrix, each row distributes a thread to process simultaneously, each row carry out one dimension IDCT inverse transformation simultaneously, after being disposed, every a line distributes a thread to carry out the calculating of one dimension IDCT inverse transformation simultaneously, has realized the two-dimentional IDCT inverse transformation parallel processing to transform coefficient matrix after completing.

Because the size of HEVC converter unit is 4x4,8x8,16x16 or 32x32.Converter unit is larger, and degree of concurrence is higher, and acceleration effect is more obvious.For example, the transform coefficient matrix corresponding to 32x32 converter unit first can carry out the one dimension IDCT inverse transformation of 32 row simultaneously, calls syncthreads () function and carries out synchronously, and then carry out the one dimension IDCT inverse transformation of 32 row simultaneously after completing.In addition, can also carry out IDCT inverse transformation to transform coefficient matrix corresponding to each converter unit simultaneously.In order to obtain better acceleration effect, the present invention directly utilizes the conversion coefficient in global memory space obtaining after inverse quantization parallel processing.Converter unit comprises a luminance transformation piece and two chromaticity transformation pieces, therefore, need to carry out respectively the inverse transformation of brightness and colourity, and the step of the two is identical.

The inverse transformation algorithm that the present invention is based on GPU comprises:

(1) the decoding incipient stage, initialization GPU, on GPU, application is for storing the global memory space of the residual error data obtaining after inverse transformation, and directly from global memory space, reading in GPU carries out the conversion coefficient obtaining after inverse quantization simultaneously.

(2) carry out the configuration of number of threads, configuration thread sizing grid is Grid (4,4,1), thread block size is Block (16,16,1), a Grid distributes 16 Block, and each Block distributes 256 threads, then according to the corresponding distribution number of threads of the size of converter unit.

(3) be that a transform coefficient matrix corresponding to converter unit is distributed to a thread block and carried out inverse transformation processing: first each thread carries out one dimension IDCT inverse transformation according to the Thread Id number of self to each row correspondence of transform coefficient matrix, each row carry out simultaneously, and call syncthreads () function and carry out synchronously, resulting result is temporarily stored in the shared drive in thread block; Then the every a line in coefficient matrix in shared drive is carried out to one dimension IDCT inverse transformation simultaneously, for a line, distribute a thread and process, thereby completed the two-dimentional IDCT inverse transformation to conversion coefficient, and obtain residual error data matrix.The transform coefficient matrix corresponding to each converter unit carries out inverse transformation processing simultaneously, and what obtain is exactly the residual error data of whole image block.

(4) the residual error data CongGPU global memory space of each image block calculating is copied in the internal memory of CPU, thereby obtain the residual error data of whole image.

(5) discharge the global memory space of distributing in decode procedure.

Embodiment tri-

The present embodiment describes HEVC motion compensation parallel algorithm process of the present invention.

Inter motion compensation realize principle, briefly, the motion vector obtaining by code stream analyzing exactly, according to the position of pointing to, try to achieve predicted value on reference frame, what point to is that directly read the whole pixel position of reference frame, if a minute pixel position needs to obtain a minute pixel predictors through pixel interpolating, then predicted value and the Image Residual value addition that obtains through inverse quantization, inverse transformation are obtained to image reconstruction value.In motion compensating module, the calculating of pixel interpolation filtering has probably been occupied to 70% operand.So the realization of motion compensation of the present invention on GPU is mainly to carry out pixel interpolating.Motion vector was continuous originally, but while carrying out inter prediction motion compensation, in order to improve the accuracy of video image interframe prediction in cataloged procedure, when search match block, motion vector is a minute pixel precision, the precision of brightness movement vector is 1/4 pixel, and chroma motion vector is 1/8 pixel precision.Therefore,, when motion vector points be reference frame minute location of pixels, need to carry out the pixel value that interpolation obtains correspondence position according to neighboring pixel value.

Wherein, it is as shown in table 1 that brightness minute pixel is carried out filtering interpolation coefficient used.

Table 1 interpolation of luminance pixels filtering coefficient used

Divide location of pixels	Filtering interpolation coefficient
		1/4 pixel	{-1,4,-10,58,17,-5,1,0}
2/4 pixel	{-1,4,-11,40,40,-11,4,-1}
		3/4 pixel	{0,1,-5,17,58,-10,4,-1}

Shown in Fig. 6, for brightness whole pixel position with by interpolation minute pixel position out, the position of capitalization representative is whole pixel, and what lowercase represented is sub-pix point.

In HEVC standard software, by xFracL in parameter list and yFracL, determine the position of pixel, the fractional part that represents the horizontal component of motion vector on xFracL practical significance, yFracL is actual represents the fractional part in the vertical component of motion vector, both combine and in HEVC standard, are representing the position of pixel, xFracL and yFracL are that the position that 0 representative refers to is whole pixel position, and all the other are a minute location of pixels.By determining position, select the neighborhood pixels in corresponding interpolation coefficient and reference frame to carry out the pixel value that interpolation obtains correspondence position.XFracL and yFracL are corresponding as shown in table 2 with the position of pixel in Fig. 6.

Table 2 luminance pixel point position mapping relations

Luminance pixel interpolation needs to select corresponding interpolation coefficient to solve according to the value of the position of minute pixel, with a of whole pixel in same level position _0,0, b _0,0, c _0,0corresponding to 1/4,2/4,3/4 pixel position, according to the coefficient in table 1 and A- _3,0, A- _2,0, A- _1,0, A _0,0, A _1,0, A _2,0, A _3,0, A _{4, O}these whole pixels calculate.Wherein, variable shift1 equals (BitDepthY-8), shift2 be set to 6 and shift3 be arranged to (14-BitDepthY).Specific formula for calculation is:

a _0,0＝(-A _-3,0+4*A _-2,0-10*A _-1,0+58*A _0,0+17*A _1,0-5*A _2,0+A _3,0)>>shift1

b _0,0＝(-A _-3,0+4*A _-2,0-11*A _-1,0+40*A _0,0+40*A _1,0-11*A _2,0+4*A _3,0-A _4,0)>>shift1

c _0,0＝(A _-2,0-5*A _-1,0+17*A _0,0+58*A _1,0-10*A _2,0+4*A _3,0-A _4,0)>>shift1

And d _0,0, h _0,0and n _0,01/4,2/4,3/4 locational pixel in corresponding vertical direction, also needs while carrying out pixel interpolating to know the whole pixel in same upright position, and its interpolation calculation is:

d _0,0＝(-A _0,-3+4*A _0,-2-10*A _0,-1+58*A _0,0+17*A _0,1-5*A _0,2+A _0,3)>>shift1

h _0,0＝(-A _0,-3+4*A _0,-2-11*A _0,-1+40*A _0,0+40*A _0,1-11*A _0,2+4*A _0,3-A _0,-4)>>shift1

n _0,0＝(A _0,-2-5*A _0,-1+17*A _0,0+58*A _0,1-10*A _0,2+4*A _0,3-A _0,-4)>>shift1

A _0,0, b _0,0, c _0,0, d _0,0, h _0,0and n _0,0pixel value can directly be released by whole pixel and filtering interpolation coefficient one step, and the pixel when these positions of motion vector points, can try to achieve corresponding pixel value according to algorithm above.

The value of minute pixel in other position needs to carry out in two steps just trying to achieve.

E _0,0, i _0,0, p _0,0the calculating of pixel value.First according to asking the value with minute pixel position of whole pixel in same level or same upright position to try to achieve a above _{0 ,-3}, a _{0 ,-2}, a _{0 ,-1}, a _0,0, a _0,1, a _0,2, a _0,3, a _0,4value, and then can be calculated as follows:

e _0,0＝(-a _0,-3+4*a _0,-2-10*a _0,-1+58*a _0,0+17*a _0,1-5*a _0,2+a _0,3)>>shift2

i _0,0＝(-a _0,-3+4*a _0,-2-11*a _0,-1+40*a _0,0+40*a _0,1-11*a _0,2+4*a _0,3-a _0,4)>>shift2

p _0,0＝(a _0,-2-5*a _0,-1+17*a _0,0+58*a _0,1-10*a _0,2+4*a _0,3-a _0,4)>>shift2

F _0,0, j _0,0, q _0,0the calculating of pixel value, need to first try to achieve b equally _{0 ,-3}, b _{0 ,-2}, b _{0 ,-1}, b _0,0, b _0,1, b _0,2, b _0,3, b _0,4value, and then utilize filtering interpolation parameter to be handled as follows:

f _0,0＝(-b _0,-3+4*b _0,-2-10*b _0,-1+58*b _0,0+17*b _0,1-5*b _0,2+b _0,3)>>shift2

j _0,0＝(-b _0,-3+4*b _0,-2-11*b _0,-1+40*b _0,0+40*b _0,1-11*b _0,2+4*b _0,3-b _0,4)>>shift2

q _0,0＝(b _0,-2-5*b _0,-1+17*b _0,0+58*b _0,1-10*b _0,2+4*b _0,3-b _0,4)>>shift2

G _0,0, k _0,0, r _0,0calculating need first to try to achieve c _{0 ,-3}, c _{0 ,-2}, c _{0 ,-1}, c _0,0, c _0,1, c _0,2, c _0,3, c _0,4value, then calculate as follows:

g _0,0＝(-c _0,-3+4*c _0,-2-10*c _0,-1+58*c _0,0+17*c _0,1-5*c _0,2+c _0,3)>>shift2

k _0,0＝(-c _0,-3+4*c _0,-2-11*c _0,-1+40*c _0,0+40*c _0,1-11*c _0,2+4*c _0,3-c _0,4)>>shift2

r _0,0＝(c _0,-2-5*c _0,-1+17*c _0,0+58*c _0,1-10*c _0,2+4*c _0,3-c _0,4)>>shift2

When the position that motion vector MV points to is the whole pixel on reference frame just, the value of the pixel of sensing is exactly predicted pixel values.The predicted pixel values of whole like this inter prediction piece can, according to corresponding MV and reference frame, calculate by each thread on GPU.

The process of the image element interpolation of colourity is identical with brightness principle, but colourity adopts, is the interpolation filter of 4 taps, does not here remake and is elaborated.During its interpolation, coefficient used is as shown in table 3.

Table 3 chroma pixel filtering interpolation coefficient used

Divide location of pixels	Filtering interpolation coefficient
		1/8 pixel	{-2,58,10,-2}
2/8 pixel	{-4,54,16,-2}
		3/8 pixel	{-6,46,28,-4}
4/8 pixel	{-4,36,36,-4}
		5/8 pixel	{-4,28,46,-6}
6/8 pixel	{-2,16,54,-4}
		7/8 pixel	{-2,10,58,-2}

In HEVC encoding and decoding standard, motion compensation be take image block and is carried out as unit, and the base unit of in fact processing is pixel, when each pixel carries out motion compensation, do not have relation of interdependence, the position calculation that only need to point to reference frame according to MV obtains corresponding predicted pixel values and is added with residual error pixel value that to obtain the pixel value rebuild just passable again.The base unit of each thread process is the pixel being converted to by PU prediction piece, and a thread carries out the calculating of a predicted pixel values, and like this, the inter prediction pixel value on whole image block just can calculate simultaneously.

HEVC motion compensation parallel algorithm of the present invention comprises:

(1) first, in the decoding incipient stage, GPU is carried out to initialization, in GPU, application is used for storing the memory space of the predicted pixel values of motion vector, reference frame and generation that each pixel of inter-frame forecast mode is corresponding.

(2) then by cudaMemcpy function, copy motion vector and corresponding reference frame image to equipment end, call cudaBindTexteure function simultaneously and be tied on texture storage device with reference to frame.During texture storage device reading out data, speed is fast, can ignore, and has further improved the efficiency of operation.

(3) carry out thread configuration, be a Thread Id number of processing distribution of each predicted pixel values, and open up for storing the global memory space of the predicted pixel values of corresponding generation in equipment end.It is Grid (4,4,1) that thread sizing grid is set, and thread block size is Block (16,16,1).A Grid distributes 16 Block, and each Block distributes 256 threads.

(4) according to the position of motion vector points reference frame, ask for predicted pixel values: if sensing is whole pixel value position, the locational value of reference frame that directly reads respective motion vectors sensing is exactly predicted pixel values; If minute location of pixels selects corresponding image element interpolation Filtering Formula to carry out evaluation according to the position of pixel, obtaining a corresponding minute pixel value is exactly predicted pixel values.Each thread is carried out identical execution step according to the Thread Id number of oneself and is obtained corresponding pixel predictors.

(5) pixel value prediction obtaining and corresponding pixel residual values addition are obtained to pixel reconstructed value the security verification that carries out data cutting.

(6) result is copied back to host side internal memory, the memory space of release device end.

Compared with prior art, the present invention has built the decoding framework being comprised of CPU and GPU, inverse transformation processing and motion compensation process that decoding complex degree is higher are transferred to the upper realization of GPU, and designed HEVC inverse transformation parallel algorithm and the HEVC motion compensation parallel algorithm based on GPU, effectively improved decoding speed and decoding efficiency.

More than that better enforcement of the present invention is illustrated, but the invention is not limited to described embodiment, those of ordinary skill in the art also can make all equivalent variations or replacement under the prerequisite without prejudice to spirit of the present invention, and the distortion that these are equal to or replacement are all included in the application's claim limited range.

Claims

1. the HEVC parallel decoding method based on GPU, is characterized in that: comprising:

2. a kind of HEVC parallel decoding method based on GPU according to claim 1, is characterized in that: in described step B, GPU adopts HEVC inverse transformation parallel algorithm to process this step to transform coefficient matrix, and it comprises:

3. a kind of HEVC parallel decoding method based on GPU according to claim 2, is characterized in that: described step B13, and it comprises:

4. a kind of HEVC parallel decoding method based on GPU according to claim 1, it is characterized in that: in described step B, GPU adopts HEVC motion compensation parallel algorithm, this step of predicted pixel values of asking for image according to the reference frame position of motion vector points, it comprises:

5. a kind of HEVC parallel decoding method based on GPU according to claim 4, is characterized in that: described step S4, and it is specially:

6. a kind of HEVC parallel decoding method based on GPU according to claim 5, is characterized in that: described brightness image element interpolation Filtering Formula is 8 point interpolation Filtering Formulas, and described degree image element interpolation Filtering Formula is 4 point interpolation Filtering Formulas.