CN104125466B

CN104125466B - A kind of HEVC parallel decoding methods based on GPU

Info

Publication number: CN104125466B
Application number: CN201410328646.2A
Authority: CN
Inventors: 梁凡; 罗林
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2014-07-10
Filing date: 2014-07-10
Publication date: 2017-10-10
Anticipated expiration: 2034-07-10
Also published as: CN104125466A

Abstract

The invention discloses a kind of HEVC parallel decoding methods based on GPU, including：GPU carries out entropy decoding to the ASCII stream file ASCII of reading, reordered and inverse quantization, so that transform coefficient matrix is obtained, while GPU is parsed to the ASCII stream file ASCII of acquisition, so as to obtain motion vector and reference frame；GPU is handled transform coefficient matrix using HEVC inverse transformation parallel algorithms, so as to obtain the residual error data of image, while GPU uses HEVC motion compensation parallel algorithms, the reference frame position pointed to according to motion vector asks for the predicted pixel values of image；GPU is summed the residual error data of image and the predicted pixel values of image successively, deblocking filtering and sample adaptive equalization processing, so as to obtain reconstruction image, and the pixel value of reconstruction image is copied in CPU internal memory.The present invention effectively increases decoding speed and decoding efficiency, can be widely applied to coding and decoding video field.

Description

A kind of HEVC parallel decoding methods based on GPU

Technical field

The present invention relates to coding and decoding video field, especially a kind of HEVC parallel decoding methods based on GPU.

Background technology

With the fast development of internet and mobile communication technology, digital video is just towards fine definition, high frame per second, high pressure The direction of shrinkage is strided forward, and the form of video develops into 1080P from 720P, and even 4Kx2K, 8Kx4K are occurred in that in some occasions The clear digital video of superelevation.In Video Applications, transmission bandwidth and memory space are undoubtedly most crucial resource, how limited Space realize the storage of high sharpness video, realize good transmission in the network environment that bandwidth has bottleneck, be a big difficulty Topic.The video of fine definition can be brought higher quality of life, but so necessarily have huge data volume.Lift individual Example, 1080P high sharpness videos, pixel is 1920X1080,4:2:0 form, the data volume of one two field picture is 24.88Mbit.The video of so fine definition generates a problem, that is, video code rate is significantly raised.Video is compiled Code seeks to characterize video information with as far as possible few bit number, and the compression effect of current wide variety of H.264 coding standard Rate still can not fully meet the application demand of ultra high-definition video.

HEVC (High Efficiency Video Coding) high efficiency Video coding is the MPEG and ITU-T by ISO The new video compression coding schemes of future generation formulated jointly of VCEG.HEVC standard is to inherit the existing Video coding side that knows clearly The coding theory of case H.264, has continued to use some of which coding techniques, and improves correlation technique, and coding unit size is more Greatly, block-based interframe/more diversified, more complicated interpolation filter of infra-frame prediction selection mode etc..HEVC contrasts it Preceding video coding technique, there is that compression efficiency is higher, video quality is more preferable, robustness is more preferable, error recovery capabilities are stronger, more suitable The advantages of conjunction is transmitted in an ip network.HEVC contrasts H.264/AVC coding standard, in the video to fine definition and high-fidelity When image is encoded, compression efficiency is doubled, in the case of the picture quality identical so rebuild after the decoding, The code check of video flowing reduces 50%.

But, the decline of code check be using the increase of the complexity of encoding and decoding software as premise, employ it is more complicated, After more flexible coding techniques, the complexity of HEVC encoding and decoding softwares is also greatly increased so that high sharpness video is compressed It is consequently increased with the time for decompressing spent, it is impossible to which the height for meeting the application fields such as video conference and videophone is solved in real time Code broadcast request.

In the case where high sharpness video turns into main flow, merely obviously it can not realize that high-resolution is regarded well by CPU The real-time decoding of frequency.GPU has excellent Floating-point Computation ability and powerful computation capability, if will be transported in decoding algorithm Calculation amount is huge, and the higher module of complexity is transferred on GPU and realized, then can effectively solve the problem that this problem of real-time decoding.However, There is not also the HEVC coding and decoding videos scheme based on GPU to occur in the industry at present.

The content of the invention

In order to solve the above-mentioned technical problem, the purpose of the present invention is：There is provided a kind of decoding speed fast and efficiency high, be based on GPU HEVC parallel decoding methods.

The technical solution adopted for the present invention to solve the technical problems is：A kind of HEVC parallel decoding methods based on GPU, Including：

A, GPU carry out entropy decoding to the ASCII stream file ASCII of reading, reordered and inverse quantization, so that transform coefficient matrix is obtained, GPU is parsed to the ASCII stream file ASCII of acquisition simultaneously, so as to obtain motion vector and reference frame；

B, GPU are handled transform coefficient matrix using HEVC inverse transformation parallel algorithms, so as to obtain the residual error of image Data, while GPU uses HEVC motion compensation parallel algorithms, the reference frame position pointed to according to motion vector asks for the pre- of image Survey pixel value；

C, GPU are summed the residual error data of image and the predicted pixel values of image successively, deblocking filtering and sample Adaptive equalization is handled, so as to obtain reconstruction image, and the pixel value of reconstruction image is copied in CPU internal memory.

Further, GPU carries out handling this using HEVC inverse transformations parallel algorithm to transform coefficient matrix in the step B Step, it includes：

B11, initialization GPU, apply on GPU in the equipment end overall situation for storing transform coefficient matrix and residual error data Deposit；

B12, the sizing grid to thread and thread block size are set, and are each according to the size of converter unit Converter unit distributes the thread and corresponding Thread Id number of respective numbers；

Transform coefficient matrix on B13, reading equipment end global memory corresponding to each converter unit, then according to thread No. ID is entered the one-dimensional IDCT inverse transformations of row-column parallel calculation and the parallel one-dimensional IDCT inverse transformations of row to each transform coefficient matrix successively, so that Obtain the residual error data of whole image block；

B14, the residual error data of each image block of calculating copied back into CPU internal memories, obtains the residual error data of whole image, Then release device end global memory space.

Further, the step B13, it includes：

Transform coefficient matrix on B131, reading equipment end global memory corresponding to each converter unit；

B132, one-dimensional IDCT inverse transformations are carried out simultaneously to each row of each transform coefficient matrix according to Thread Id number, obtained The result of conversion is simultaneously temporarily stored in the shared drive of thread block by coefficient matrix after conversion；

B133, according to Thread Id number to being converted in shared drive after coefficient matrix every a line while carry out one-dimensional IDCT Inverse transformation, obtains residual error data matrix, and according to the residual error data of residual error data matrix computations whole image block.

Further, GPU uses HEVC motion compensation parallel algorithms, the reference pointed to according to motion vector in the step B The step for frame position asks for the predicted pixel values of image, it includes：

S1, initialization GPU, in GPU application be used for store the corresponding motion vector of each pixel of inter-frame forecast mode, The memory space of reference frame and predicted pixel values；

S2, copy motion vector and corresponding reference frame image to equipment end, deposited while reference frame is tied into texture On reservoir；

S3, progress thread configuration, are that the processing of each predicted pixel values distributes a Thread Id number, use are opened up in equipment end Come the global memory space of Storage Estimation pixel value；

The position that S4, each thread point to reference frame according to itself Thread Id number and motion vector carries out direct line simultaneously Reason is read or filtering interpolation processing, so as to obtain the pixel predictors of each thread；

S5, the pixel predictors of each thread are copied back into CPU internal memories, then the global memory space at release device end.

Further, the step S4, it is specially：

The position that each thread points to reference frame according to the Thread Id number of itself and motion vector is directly read simultaneously Or filtering interpolation processing：If what the motion vector of the thread was pointed to is whole pixel value position, directly read in Texture memory The pixel value in reference frame position pointed by the motion vector, and it is used as using the pixel value of reading the pixel prediction of the thread Value；If what the motion vector of the thread pointed to is a point location of pixels, according to the corresponding brightness of the position of point pixel selection or Colourity image element interpolation Filtering Formula is calculated, so as to obtain the pixel predictors of the thread.

Further, the brightness image element interpolation Filtering Formula is 8 point interpolation Filtering Formulas, the degree image element interpolation filter Ripple formula is 4 point interpolation Filtering Formulas.

The beneficial effects of the invention are as follows：The decoding architecture being made up of CPU and GPU is constructed, decoding complex degree is higher Inverse transformation processing and motion compensation process are transferred on GPU and realized, and devise HEVC inverse transformations parallel algorithm based on GPU and HEVC motion compensation parallel algorithms, effectively increase decoding speed and decoding efficiency.

Brief description of the drawings

The invention will be further described with reference to the accompanying drawings and examples.

Fig. 1 is a kind of step flow chart of the HEVC parallel decoding methods based on GPU of the present invention；

Fig. 2 uses the stream that HEVC inverse transformations parallel algorithm is handled transform coefficient matrix for GPU in step B of the present invention Cheng Tu；

Fig. 3 is step B13 of the present invention flow chart；

Fig. 4 uses HEVC motion compensation parallel algorithms, the reference pointed to according to motion vector for GPU in step B of the present invention Frame position asks for the flow chart of the predicted pixel values of image；

Fig. 5 decodes frame diagram for the HEVC of the embodiment of the present invention one；

Fig. 6 divides pixel interpolating schematic diagram for brightness of the present invention.

Embodiment

Reference picture 1, a kind of HEVC parallel decoding methods based on GPU, including：

Reference picture 2, is further used as preferred embodiment, and GPU uses HEVC inverse transformation parallel algorithms in the step B The step for handling transform coefficient matrix, it includes：

Wherein, the sizing grid of thread is set as Grid (4,4,1), the size of thread block be set as Block (16,16, 1), i.e., one Grid distributes 16 Block, and each Block distributes 256 threads.And the quantity of thread is then according to converter unit Size be correspondingly allocated.Image block is made up of at least one converter unit, and image is then by least one image block group Into.

Reference picture 3, is further used as preferred embodiment, the step B13, and it includes：

Reference picture 4, is further used as preferred embodiment, and GPU is calculated parallel using HEVC motion compensation in the step B Method, the step for reference frame position pointed to according to motion vector asks for the predicted pixel values of image, it includes：

It is further used as preferred embodiment, the step S4, it is specially：

Wherein, the pixel value in the reference frame position in Texture memory pointed by the motion vector is read, by calling Texture blending function tex2D () function is realized.

It is further used as preferred embodiment, the brightness image element interpolation Filtering Formula is 8 point interpolation Filtering Formulas, The degree image element interpolation Filtering Formula is 4 point interpolation Filtering Formulas.

The present invention is described in further detail with reference to specific embodiment.

Embodiment one

Reference picture 5, the first embodiment of the present invention：

HEVC decoding frameworks are as shown in Figure 5.HEVC decoding process is exactly reverseization of cataloged procedure, and decoder reads code stream File, obtains bit stream from NAL (network abstract layer), and decoding is carried out by order one by one, and a two field picture is divided into Several maximum coding units LCU, with the order of raster scanning, entropy decoding is carried out by base unit of LCU, then enters rearrangement Sequence, so as to obtain the residual error coefficient of corresponding coding unit；Then inverse quantization and inverse transformation are carried out to residual error coefficient, so as to obtain figure As residual error data.At the same time, the header generation prediction block that decoder is obtained according to being decoded from code stream：If inter prediction Pattern, then generate a corresponding prediction block according to motion vector and reference frame；If intra prediction mode, then from adjacent pre- Survey unit and generate a prediction block.Then, the image block data that prediction block data are reconstructed with the summation of residual block data, finally Image block data is by obtaining reconstruction image after deblocking filtering and sample adaptive equalization processing and exporting.

Motion compensation describes the difference of consecutive frame on encoding relation, that is to say, that describe the macro block of above reference frame such as Some position what is moved in present frame up, the reference frame pointed to according to motion vector etc. the value of size prediction block and residual Difference value obtains reconstructed value.This method is often used for reducing the temporal redundancy in video sequence by Video Codec.Fortune Dynamic compensation is used for image reconstruction, is that video compiles key modules essential in encoding and decoding.

Motion compensation is exactly that a two field picture is the coding unit differed in size according to division of image texture, in coding unit On the basis of divide predicting unit, predicting unit includes a luminance block and two chrominance blocks, each macro block of inter-coded macroblocks Obtained from the macroblock prediction of a certain formed objects of reference picture.Pel motion compensation precision is determined by the precision of motion vector Fixed, it is directly connected to reconstructed image quality and the size of code stream.Motion vector is the size translated during predicting, is Drawn being estimated coding end motion.The precision of motion vector is higher, and the accuracy of motion compensation is higher.Interpolation is filtered It is a very crucial technology in motion compensation that ripple, which is then, and H.264 standard uses the Wiener filter of six taps, and it is transported Dynamic compensation precision is 1/4 pixel precision.And HEVC employs more advanced efficient interpolation filter, that is, based on discrete remaining The interpolation filter of string conversion.By contrast, the generation of sub-pix is more succinct efficient in HEVC standard, it is only necessary to a filtering Formula, carrying out a filtering process just can be with.Luminance signal uses 8 point interpolations based on DCT discrete cosine transforms to filter Ripple device, and carrier chrominance signal uses the 4 point interpolation wave filters based on DCT discrete cosine transforms, carries out the interpolation of pixel.But it is big The interpolation calculation of amount causes complexity accordingly to improve, and code efficiency can be relatively low.The brightness of point location of pixels and colourity in reference frame Pixel is actually non-existent, it is therefore desirable to carry out the pixel that pixel interpolating obtains point location of pixels by filtering interpolation algorithm Value, this motion compensation belongs to the motion compensation of sub-pixel precision.

Embodiment two

The present embodiment is illustrated to the HEVC inverse transformation parallel algorithm processes of the present invention.

Inverse transform block is the process that the transform coefficient matrix of current block is converted to residual error sample matrices, and after being Continuous reconstruct is ready.Inverse transformation is carried out after inverse quantization processing, is equally to be entered using TU converter units as base unit Row processing, the source data used in it is exactly the result of inverse quantization.The HEVC decoders of the present invention carry out two dimension IDCT inverse transformations When, the one-dimensional IDCT inverse discrete cosine transformations in horizontal direction are carried out first, and the one-dimensional IDCT then carried out in vertical direction is anti- Discrete cosine transform, is finally converted into an equal amount of residual error data matrix by matrix multiple by transform coefficient matrix again, from And complete the conversion of frequency domain to time domain.

The IDCT computings of different transform blocks are separate, and the conversion coefficients on same converter unit in one two field picture When matrix carries out the one-dimensional idct transform in horizontal direction, each row are separate, therefore can realize the parallel meter of each row Calculate.Similarly, when carrying out the one-dimensional IDCT inverse transformations in vertical direction, between each row and in the absence of the correlation of data, therefore Parallel computation can be realized.The present invention distributes corresponding Thread Count according to the size of transform coefficient matrix and handled, Mei Yilie One thread of distribution is handled simultaneously, and each row carry out one-dimensional IDCT inverse transformations simultaneously, after being disposed, and one is distributed per a line Thread carries out one-dimensional IDCT inverse transformations and calculated simultaneously, after the completion of realize to the two-dimentional IDCT inverse transformations of transform coefficient matrix simultaneously Row processing.

Because the size of HEVC converter units is 4x4,8x8,16x16 or 32x32.Converter unit is bigger, then degree of concurrence Higher, acceleration effect is more obvious.For example, to the corresponding transform coefficient matrix of 32x32 converter units, 32 can be carried out simultaneously first The one-dimensional IDCT inverse transformations of row, after the completion of call syncthreads () function to synchronize, then carry out 32 rows simultaneously again One-dimensional IDCT inverse transformations.Further, it is also possible to carry out IDCT inverse transformations simultaneously to the corresponding transform coefficient matrix of each converter unit. In order to obtain more preferable acceleration effect, the present invention directly utilizes the change in global memory space obtained after inverse quantization parallel processing Change coefficient.Converter unit includes a luminance transformation block and two chromaticity transformation blocks, accordingly, it would be desirable to carry out brightness and colourity respectively Inverse transformation, the two the step of be identical.

Contravariant scaling method of the invention based on GPU includes：

(1) incipient stage is decoded, GPU is initialized, applies on GPU for storing the residual error number obtained after inverse transformation According to global memory space, while directly reading the conversion coefficient for carrying out obtaining after inverse quantization on GPU from global memory space.

(2) configuration of number of threads is carried out, configuration thread sizing grid is Grid (4,4,1), and thread block size is Block The Grid of (16,16,1), i.e., one distributes 16 Block, and each Block distributes 256 threads, then according to the big of converter unit Small corresponding distribution number of threads.

(3) distribute to a thread block for the corresponding transform coefficient matrix of a converter unit and carry out inverse transformation processing：It is first First each thread carries out one-dimensional IDCT inverse transformations, each row according to the Thread Id number of itself to each row correspondence of transform coefficient matrix Carry out simultaneously, and call syncthreads () function to synchronize, resulting result is temporarily stored in shared in thread block In internal memory；Then one-dimensional IDCT inverse transformations are carried out simultaneously to every a line in coefficient matrix in shared drive, is a line distribution one Individual thread is handled, so as to complete the two-dimentional IDCT inverse transformations to conversion coefficient, and obtains residual error data matrix.To each The corresponding transform coefficient matrix of converter unit carries out inverse transformation processing simultaneously, and what is obtained is exactly the residual error data of whole image block.

(4) residual error data for calculating each obtained image block is copied to CPU internal memory from GPU global memories space In, so as to obtain the residual error data of whole image.

(5) the global memory space distributed in release decoding process.

Embodiment three

The present embodiment is illustrated to the HEVC motion compensation parallel algorithm processes of the present invention.

The realization principle of inter motion compensation, briefly, the motion vector exactly obtained by code stream analyzing, according to Predicted value is tried to achieve in the position pointed on reference frame, and sensing is that the whole pixel position of reference frame is then directly read, if point picture Vegetarian refreshments position then needs to obtain a point pixel predictors by pixel interpolating, is then obtained by predicted value and by inverse quantization, inverse transformation To Image Residual value be added i.e. obtain image reconstruction value.In motion compensating module, the calculating to pixel interpolation filtering is probably accounted for According to 70% operand.So realization of the motion compensation of the present invention on GPU is mainly carry out pixel interpolating.Motion vector Originally it was continuous, but when carrying out inter prediction motion compensation, in order to improve the standard that video image interframe is predicted in cataloged procedure Exactness, when searching for match block, motion vector is a point pixel precision, and the precision of brightness movement vector is 1/4 pixel, and colourity Motion vector is then 1/8 pixel precision.Therefore, when motion vector is oriented to point location of pixels of reference frame, it is necessary to according to week Side pixel value carries out the pixel value that interpolation obtains correspondence position.

Wherein, the coefficient that brightness point pixel is carried out used in filtering interpolation is as shown in table 1.

The used coefficient of the interpolation of luminance pixels of table 1 filtering

Divide location of pixels	Filtering interpolation coefficient
		1/4 pixel	{-1,4,-10,58,17,-5,1,0}
2/4 pixel	{-1,4,-11,40,40,-11,4,-1}
		3/4 pixel	{0,1,-5,17,58,-10,4,-1}

Point pixel position come out shown in Fig. 6 for the whole pixel position of brightness and by interpolation, capitalization is represented Position be whole pixel, what lowercase was represented is sub-pix point.

In HEVC standard software, the position of pixel is determined by xFracL in parameter list and yFracL, xFracL is actual Represent small in the fractional part of the horizontal component of motion vector, the actual vertical components for representing motion vector of yFracL in meaning Number part, both are combined together the position that pixel is represent in HEVC standard, and xFracL and yFracL represent what is referred to for 0 Position is whole pixel position, and remaining is then a point location of pixels.By determining that position selects corresponding interpolation coefficient and reference frame In neighborhood pixels carry out interpolation and obtain the pixel value of correspondence position.The position pair of pixel in xFracL and yFracL and Fig. 6 Should be as shown in table 2.

The luminance pixel point position mapping relations of table 2

Luminance pixel interpolation needs to select corresponding interpolation coefficient to be solved according to the value of the position of point pixel, with Whole pixel is in a of same level position_0,0, b_0,0, c_0,0Correspondence is in 1/4,2/4,3/4 pixel position, according in table 1 Coefficient and A-_3,0,A-_2,0,A-_1,0,A_0,0,A_1,0,A_2,0,A_3,0,A_4,OThese whole pixels are calculated.Wherein, Variable shift1 is equal to (BitDepthY-8), and shift2 is set to 6 and shift3 and is arranged to (14-BitDepthY).Specific meter Calculating formula is：

a_0,0=(- A_-3,0+4*A_-2,0-10*A_-1,0+58*A_0,0+17*A_1,0-5*A_2,0+A_3,0)>>shift1

b_0,0=(- A_-3,0+4*A_-2,0-11*A_-1,0+40*A_0,0+40*A_1,0-11*A_2,0+4*A_3,0-A_4,0)>>shift1

c_0,0=(A_-2,0-5*A_-1,0+17*A_0,0+58*A_1,0-10*A_2,0+4*A_3,0-A_4,0)>>shift1

And d_0,0, h_0,0And n_0,0The pixel on 1/4,2/4,3/4 position in correspondence vertical direction, is carried out in pixel Also need to know the whole pixel in same upright position when slotting, its interpolation calculation is：

d_0,0=(- A_0,-3+4*A_0,-2-10*A_0,-1+58*A_0,0+17*A_0,1-5*A_0,2+A_0,3)>>shift1

h_0,0=(- A_0,-3+4*A_0,-2-11*A_0,-1+40*A_0,0+40*A_0,1-11*A_0,2+4*A_0,3-A_0,-4)>>shift1

n_0,0=(A_0,-2-5*A_0,-1+17*A_0,0+58*A_0,1-10*A_0,2+4*A_0,3-A_0,-4)>>shift1

a_0,0, b_0,0, c_0,0, d_0,0, h_0,0And n_0,0Pixel value can be directly by whole pixel and the step of filtering interpolation coefficient one Release, when motion vector points to the pixel of these positions, corresponding pixel value can be tried to achieve according to algorithm above.

And the value of point pixel in other positions then needs to be carried out in two steps and can just tried to achieve.

e_0,0, i_0,0, p_0,0Pixel value calculating.It is first according to above ask and is in same level or same vertical with whole pixel The value of point pixel position of straight position tries to achieve a_0,-3,a_0,-2,a_0,-1,a_0,0,a_0,1,a_0,2,a_0,3,a_0,4Value, then carry out again Being calculated as below to draw：

e_0,0=(- a_0,-3+4*a_0,-2-10*a_0,-1+58*a_0,0+17*a_0,1-5*a_0,2+a_0,3)>>shift2

i_0,0=(- a_0,-3+4*a_0,-2-11*a_0,-1+40*a_0,0+40*a_0,1-11*a_0,2+4*a_0,3-a_0,4)>>shift2

p_0,0=(a_0,-2-5*a_0,-1+17*a_0,0+58*a_0,1-10*a_0,2+4*a_0,3-a_0,4)>>shift2

f_0,0, j_0,0, q_0,0The calculating of pixel value, needs also exist for first trying to achieve b_0,-3,b_0,-2,b_0,-1,b_0,0,b_0,1,b_0,2,b_0,3, b_0,4Value, then recycle filtering interpolation parameter be handled as follows：

f_0,0=(- b_0,-3+4*b_0,-2-10*b_0,-1+58*b_0,0+17*b_0,1-5*b_0,2+b_0,3)>>shift2

j_0,0=(- b_0,-3+4*b_0,-2-11*b_0,-1+40*b_0,0+40*b_0,1-11*b_0,2+4*b_0,3-b_0,4)>>shift2

q_0,0=(b_0,-2-5*b_0,-1+17*b_0,0+58*b_0,1-10*b_0,2+4*b_0,3-b_0,4)>>shift2

g_0,0, k_0,0, r_0,0Calculating then need first to try to achieve c_0,-3,c_0,-2,c_0,-1,c_0,0,c_0,1,c_0,2,c_0,3,c_0,4Value, so Afterwards be calculated as below drawing：

g_0,0=(- c_0,-3+4*c_0,-2-10*c_0,-1+58*c_0,0+17*c_0,1-5*c_0,2+c_0,3)>>shift2

k_0,0=(- c_0,-3+4*c_0,-2-11*c_0,-1+40*c_0,0+40*c_0,1-11*c_0,2+4*c_0,3-c_0,4)>>shift2

r_0,0=(c_0,-2-5*c_0,-1+17*c_0,0+58*c_0,1-10*c_0,2+4*c_0,3-c_0,4)>>shift2

When the motion vector MV positions pointed to are exactly the whole pixel on reference frame, the value of the pixel of sensing is exactly Predicted pixel values.The predicted pixel values of so whole interframe prediction block can be led to according to corresponding MV and reference frame on GPU Each thread is crossed to be calculated.

The process of the image element interpolation of colourity is identical with brightness principle, but colourity uses the filtering interpolation of 4 taps Device, does not remake be described in detail herein.Used coefficient is as shown in table 3 during its interpolation.

Coefficient used in the chroma pixel filtering interpolation of table 3

Divide location of pixels	Filtering interpolation coefficient
		1/8 pixel	{-2,58,10,-2}
2/8 pixel	{-4,54,16,-2}
		3/8 pixel	{-6,46,28,-4}
4/8 pixel	{-4,36,36,-4}
		5/8 pixel	{-4,28,46,-6}
6/8 pixel	{-2,16,54,-4}
		7/8 pixel	{-2,10,58,-2}

Motion compensation is carried out in units of image block in HEVC encoding and decoding standards, and the base unit actually handled is Pixel, each pixel carries out that relation of interdependence is not present during motion compensation, it is only necessary to the position of reference frame is pointed to according to MV Calculating obtain corresponding predicted pixel values the pixel value rebuild is added with residual pixel value again just can be with.At each thread The base unit of reason is all the pixel being converted into by PU prediction blocks, and a thread carries out the calculating of a predicted pixel values, this Sample, the inter prediction pixel value on whole image block can just be calculated simultaneously.

The HEVC motion compensation parallel algorithms of the present invention include：

(1) first, in the decoding incipient stage, GPU is initialized, application is used for storing inter prediction mould in GPU The memory space of the corresponding motion vector of each pixel of formula, reference frame and the predicted pixel values of generation.

(2) and then by cudaMemcpy functions by motion vector and corresponding reference frame image equipment end is copied to, together When call cudaBindTexteure functions that reference frame is tied on Texture memory.Speed when Texture memory reads data It hurry up, can be neglected, further increase the efficiency of operation.

(3) thread configuration is carried out, is that the processing of each predicted pixel values distributes a Thread Id number, and opened up in equipment end For storing the global memory space of the predicted pixel values accordingly generated.It is Grid (4,4,1), thread to set thread sizing grid Block size is Block (16,16,1).I.e. one Grid distributes 16 Block, and each Block distributes 256 threads.

(4) predicted pixel values are asked in the position for pointing to reference frame according to motion vector：If what is pointed to is whole pixel value position Put, then the value directly read in the reference frame position of respective motion vectors sensing is exactly predicted pixel values；If point pixel position Put, then select corresponding image element interpolation Filtering Formula to carry out evaluation according to the position of pixel, obtain corresponding point of pixel value It is exactly predicted pixel values.Each thread performs identical execution step according to the Thread Id number of oneself and obtains corresponding pixel prediction Value.

(5) obtained pixel value prediction is added with corresponding pixel residual values and obtains pixel reconstructed value, and carry out data The security verification of cutting.

(6) result is copied back into main frame end memory, the memory space at release device end.

Compared with prior art, the present invention constructs the decoding architecture being made up of CPU and GPU, and decoding complex degree is higher Inverse transformation processing and motion compensation process be transferred on GPU and realize, and devise the HEVC inverse transformation parallel algorithms based on GPU With HEVC motion compensation parallel algorithms, decoding speed and decoding efficiency are effectively increased.

Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations or replace on the premise of without prejudice to spirit of the invention Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

1. a kind of HEVC parallel decoding methods based on GPU, it is characterised in that：Including：

A, GPU carry out entropy decoding to the ASCII stream file ASCII of reading, reordered and inverse quantization, so as to obtain transform coefficient matrix, simultaneously GPU is parsed to the ASCII stream file ASCII of acquisition, so as to obtain motion vector and reference frame；

B, GPU are handled transform coefficient matrix using HEVC inverse transformation parallel algorithms, so that the residual error data of image is obtained, GPU uses HEVC motion compensation parallel algorithms simultaneously, and the reference frame position pointed to according to motion vector asks for the prediction picture of image Element value；

C, GPU are summed the residual error data of image and the predicted pixel values of image successively, deblocking filtering and sample are adaptive Compensation deals are answered, so as to obtain reconstruction image, and the pixel value of reconstruction image are copied in CPU internal memory；

GPU uses HEVC motion compensation parallel algorithms in the step B, and the reference frame position pointed to according to motion vector asks for figure The step for predicted pixel values of picture, it includes：

S1, initialization GPU, application is used for storing the corresponding motion vector of each pixel of inter-frame forecast mode, reference in GPU The memory space of frame and predicted pixel values；

S2, motion vector and corresponding reference frame image are copied to equipment end, while reference frame is tied into Texture memory On；

S3, progress thread configuration, are that the processing of each predicted pixel values distributes a Thread Id number, are opened up in equipment end for depositing Store up the global memory space of predicted pixel values；

The position that S4, each thread point to reference frame according to itself Thread Id number and motion vector carries out direct texture reading simultaneously Take or filtering interpolation processing, so as to obtain the pixel predictors of each thread；

2. a kind of HEVC parallel decoding methods based on GPU according to claim 1, it is characterised in that：In the step B The step for GPU is handled transform coefficient matrix using HEVC inverse transformation parallel algorithms, it includes：

B11, initialization GPU, apply for the equipment end global memory for storing transform coefficient matrix and residual error data on GPU；

B12, the sizing grid to thread and thread block size are set, and are each conversion according to the size of converter unit Unit distributes the thread and corresponding Thread Id number of respective numbers；

Transform coefficient matrix on B13, reading equipment end global memory corresponding to each converter unit, then according to Thread Id number Enter the one-dimensional IDCT inverse transformations of row-column parallel calculation and the parallel one-dimensional IDCT inverse transformations of row successively to each transform coefficient matrix, so as to obtain The residual error data of whole image block；

B14, the residual error data of each image block of calculating copied back into CPU internal memories, obtain the residual error data of whole image, then Release device end global memory space.

3. a kind of HEVC parallel decoding methods based on GPU according to claim 2, it is characterised in that：The step B13, it includes：

B132, one-dimensional IDCT inverse transformations are carried out simultaneously to each row of each transform coefficient matrix according to Thread Id number, converted The result of conversion is simultaneously temporarily stored in the shared drive of thread block by rear coefficient matrix；

B133, according to Thread Id number to being converted in shared drive after coefficient matrix every a line while carry out one-dimensional IDCT contravariant Change, obtain residual error data matrix, and according to the residual error data of residual error data matrix computations whole image block.

4. a kind of HEVC parallel decoding methods based on GPU according to claim 1, it is characterised in that：The step S4, It is specially：

The position that each thread points to reference frame according to the Thread Id number of itself and motion vector is directly read or inserted simultaneously Value filtering processing：If what the motion vector of the thread was pointed to is whole pixel value position, the fortune of this in Texture memory is directly read The pixel value in reference frame position pointed by dynamic vector, and it is used as using the pixel value of reading the pixel predictors of the thread；If What the motion vector of the thread was pointed to is a point location of pixels, then selects corresponding brightness or colourity point according to the position of point pixel Pixel interpolation filtering formula is calculated, so as to obtain the pixel predictors of the thread.

5. a kind of HEVC parallel decoding methods based on GPU according to claim 4, it is characterised in that：The brightness point Pixel interpolation filtering formula is 8 point interpolation Filtering Formulas, and the colourity image element interpolation Filtering Formula is that 4 point interpolations filter public affairs Formula.