CN104320668B

CN104320668B - HEVC/H.265 dct transform and the SIMD optimization methods of inverse transformation

Info

Publication number: CN104320668B
Application number: CN201410608208.1A
Authority: CN
Inventors: 张小云; 黎凌宇; 高志勇; 陈立
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2014-10-31
Filing date: 2014-10-31
Publication date: 2017-08-01
Anticipated expiration: 2034-10-31
Also published as: CN104320668A

Abstract

The invention provides the dct transform of HEVC/H.265 a kind of and the SIMD optimization methods of inverse transformation, input data is pre-processed first, data are entered into register from memory loading, it is considered as data vector, combination is interleaved and rearranged to vector data, for the dct transform of vertical direction, data are carried out moving to right rounding-off to adapt to limited register bit wide, improves and calculates degree of parallelism；Then butterfly computation is carried out to pretreated data, it is step-by-step calculation corresponding data and poor；Point multiplication operation is then performed, median and the sum of corresponding conversion coefficient product that calculating butterfly computation is obtained obtain input data and the product of transformation matrix；Finally, rounding-off calculating is carried out to matrix product result, meets bit-width entails and the output of output data.The present invention can effectively accelerate the dct transform inverse transform block of HEVC/H.265 video encoders on Tilera platforms, obtain and preferably accelerate effect of optimization.

Description

HEVC/H.265 dct transform and the SIMD optimization methods of inverse transformation

Technical field

The present invention relates to technical field of video coding, in particular it relates to a kind of DCT of HEVC/H.265 video encoding standards The SIMD of conversion and inverse transformation accelerates optimization method (being based on Tilera platforms), and HEVC is realized using Tilera SIMD instruction collection Dct transform and inverse transform block, improve the speed of service.

Background technology

With the growth of video content and developing rapidly for video product, video content industrial chain faces bigger pressure, Current AVC (Advanced Video Coding) video compression technology can not meet the requirement of transmission of video, more efficient Video compression technology is arisen at the historic moment.Moreover, future video market tends to higher levels of requirement beyond current AVC The scope of code capacity, such as 3D TV and 4K TV.For 4K TV, even if using H.264 mode is encoded at present, it is also desirable to 24- 32M code checks, AVC has become the bottleneck of 4K TV business developments.In this context, efficient video coding (High Efficiency Video Coding, HEVC) this new video encoding standard arises at the historic moment.HEVC development is reviewed earliest By 2004, by the development of last decade, HEVC formed complete committee's draft standard in 2 months 2012, and in 2013 1 The moon formally turns into international standard.HEVC target is that code efficiency improves 50%, 2 to 10 times more complicated than AVC of complexity than AVC. The business following HEVC is mainly directed towards high definition, ultra high-definition, 3D TV, and video is much bigger than ever for data volume, and HEVC requirements are big in addition It is big to improve video compress ratio, and high pressure compression algorithm is to increase algorithm complex as cost, it is contemplated that in terms of the two because Element, HEVC encoders propose higher requirement to the calculating performance of system.

For reduction HEVC encoder complexities, generally there are the methods, its middle finger such as algorithm optimization, instruction set optimization, parallel optimization Order collection optimization is to realize computing module, SIMD (single instruction multiple using the instruction set of calculating platform Data) Single Instruction Multi-data can within an instruction cycle the multiple data of parallel processing calculating, compared to conventional reality Existing scheme can greatly reduce the instruction cycle, the speed of service be improved, while can guarantee that result of calculation is accurate.In Video coding In, SIMD technologies are widely used in density data calculating, such as sub-pixel interpolation, SAD, DCT/IDCT, calculating residual error module.

HEVC encoders are realized on Tilera platforms, we have transplanted HEVC identifying codes HM DCT/IDCT realization sides Method, HEVC DCT modules are greatly promoted compared to H.264 complexity.H.264 conversion coefficient is 1 and 2, and conversion H.264 is only Need simple displacement and additional calculation.HEVC supports 4x4 to 32x32 transform block, and HEVC DCT coefficient numerical value is more in addition Greatly, it is more complicated, it means that HEVC DCT changes need to perform multiple multiplication, and the numerical value of intermediate variable is bigger, it is necessary to more Big bit wide.When performing the dct transform of vertical direction, the value of intermediate variable, beyond 16bit memory range, is 17- 19bit, if preserving intermediate variable with 32bit, the parallel horizontal of data processing is had a greatly reduced quality.It is existing on Intel and Arm Some HEVC DCT and IDCT SIMD implementation methods, the speed of the dct transform of vertical direction will be less than horizontal direction Dct transform.

The content of the invention

For defect of the prior art, it is an object of the invention to provide the dct transform of HEVC/H.265 a kind of and contravariant The SIMD optimization methods changed, DCT the and IDCT modules for the HEVC that methods described is realized for C language conventional on Tilera platforms The problem of computation complexity is high, coding rate is slow, HEVC dct transform and inverse transformation is realized using Tilera SIMD instruction collection Module, improves the speed of service.

To realize object above, the present invention provides the dct transform of HEVC/H.265 a kind of and the SIMD optimization sides of inverse transformation Method, comprises the following steps：

The first step, enters register from memory loading by one-dimensional DCT input datas, is considered as vector data；

Second step, combination is rearranged to vector data, performs butterfly computation, input data is added and subtracted step by step, calculated Intermediate variable vector；

3rd step, if the one-dimensional dct transform is horizontally oriented conversion, leaps to the 5th step；

4th step, carries out moving to right rounding-off computing, to limit its bit wide to intermediate variable；

5th step, point multiplication operation is carried out by the corresponding coefficient vector of intermediate variable vector；

6th step, the result of point multiplication operation is carried out to rearrange combination, execution moves to right rounding-off, and output result is preserved to mesh Mark internal memory.

Preferably, in the second step, to the parallel plus-minus of input data, butterfly computation is completed, is carried out multiple step by step Parallel addition and subtraction, calculates the intermediate variable vector calculated for dot product.

Preferably, in the 4th step, one-dimensional dct transform carries out rounding-off pretreatment to intermediate variable, in order to keep parallel Property, preferable acceleration effect is reached, for the one-dimensional dct transform of vertical direction, intermediate variable is moved to right into rounding-off makes its bit wide exist In 16bit；It is specific to move to right calculating such as formula (1)：

Y=(x+ (1<<(MIVO-1)))>>MIVO (1)

Wherein：X is needs the data for carrying out moving to right rounding-off, and MIVO is intermediate variable maximum order, and y is to have performed to move to right house Result after entering；

Meanwhile, input variable shift values will also do corresponding adjustment according to MIVO, can offset moving to right for intermediate variable Rounding-off, shift values need the digit for being rounded data shift right when being DCT outputs.

Preferably, in the 5th step, multiple multiplication and add operation are replaced using a dot product computations, effectively Accelerate intermediate variable and the product summation process of corresponding coefficient.Further, for calculating four intermediate variables and corresponding coefficient Dot product, directly using one instruction perform four-vector point can complete at convenience；For more than four intermediate variables and accordingly The dot product of coefficient, is completed by multiple four-vector dot product；Dot product for calculating two intermediate variables and corresponding coefficient, is adopted Completed with parallel multiplication and addition instruction.

Preferably, in the 6th step, using parallel addition and shift right operation and data recombination, complete parallel to point Multiply result of calculation to carry out moving to right rounding-off operation；Rounding-off is moved to right to calculate such as formula (2)：

Y=(x+offset)>>shift

Offset=(1<<(shift-1))

Error=2^shift-1

In formula：X is to need to perform the data for moving to right rounding-off, and y is to have performed the result for moving to right rounding-off；Offset is to be used for four The offsets that enter of house five, are derived by moving to right digit shift, shift values by discrete cosine transform block size N (4,8,16,32) and Data bit width B (8bit, 10bit) is derived, and the shift values and N, B of horizontal direction (horizontal) are relevant, and vertical side Relevant with N to the shift values of (vertical), different according to shift values can be for judging the direction of one-dimensional dct transform： Level or vertical；Error is to perform to move to right the worst error that rounding-off operation is produced.

Compared with prior art, the present invention has following beneficial effect：

The method that the present invention is provided carries out SIMD using the instruction set of Tilera platforms to Tilera DCT and IDCT modules Optimization, realizes DCT the and IDCT modules for effectively accelerating HEVC on Tilera platforms in the case of faint performance loss.Experience Card, compared to common C code implementation method, after using the invention, HEVC DCT and IDCT module energy on Tilera platforms 40%-70% instruction cycles is averagely reduced, and BD-PSNR (PSNR under phase homogenous quantities) only has the loss less than 0.003.

Brief description of the drawings

By reading the detailed description made with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the one-dimensional dct transform flow chart of a preferred embodiment of the invention；

Fig. 2 is the one-dimensional idct transform flow chart of a preferred embodiment of the invention；

Fig. 3 calculates intermediate variable schematic diagram for the butterfly computation of a preferred embodiment of the invention；

Fig. 4 is the flow chart that a preferred embodiment of the invention is pre-processed to intermediate variable；

Fig. 5 is execution vector point multiplication operation schematic diagram in preferred embodiment butterfly computation of the invention, wherein：(a) it is four Element vectors point multiplication operation schematic diagram, (b) is Was Used vector point multiplication operation schematic diagram；

Fig. 6 is that a preferred embodiment of the invention performs follow-up rounding-off operation chart to butterfly computation result；

Fig. 7 be the one-dimensional IDCT of a preferred embodiment of the invention in data carry out transposed transform operation schematic diagram.

Embodiment

With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that to the ordinary skill of this area For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection domain.

As shown in Figure 1, 2, the present embodiment provides the dct transform of HEVC/H.265 a kind of and the SIMD optimization methods of inverse transformation (being based on Tilera platforms), specific implementation step is as follows：

Step (1), one-dimensional DCT input datas from memory loading are entered into register, be considered as vector data；Specifically：

Read data in batches from the memory block where input data, adjacent multiple low-bit width input datas are loaded into 64bit registers, are assembled into a data vector.

Step (2), rearrange combination to vector data, perform butterfly computation, input data is added and subtracted step by step, calculated Intermediate variable vector, as shown in Figure 3；Specifically：

HEVC dct transform coefficient has very strong odd-even symmetry, and using this symmetry, first data are carried out step by step Addition and subtraction, then do product by itself and corresponding coefficient, so can effectively reduce multiplying, here it is butterfly computation Principle.The data vector for entering register from memory loading can not be added and subtracted directly, and vector data needs to carry out necessarily again Permutation and combination, complies with the operation rule of needs, then performs follow-up calculation process.Have corresponding on Tilera platforms Instruction, supports the permutation and combination of any byte in two 64bit data.

In subsequent steps, data need constantly to carry out permutation and combination, to meet corresponding operation rule.

Instructed followed by parallel plus-minus, complete the addition and subtraction of vector, such once command can complete multiple data simultaneously Addition and subtraction, reduce the instruction cycle, realize accelerate.With the increase of discrete cosine transform block, the calculating series of intermediate variable also can Increase, needs that result of calculation is rearranged and combined during step-by-step calculation intermediate variable, meets it follow-up Operation rule.

Step (3) if, the one-dimensional dct transform when horizontal direction convert, leap to step (5), as shown in Figure 4；Tool Body：

The input variable shift of horizontal direction and the one-dimensional dct transform of vertical direction is different, by judging input variable The size of shift values may determine that its one-dimensional dct transform for being horizontally oriented still vertical direction.

Step (4), to intermediate variable carry out move to right rounding-off computing, limit its bit wide, as shown in Figure 4；Specifically：

The dct transform of two dimension is made up of the one-dimensional dct transform of both direction：Horizontal direction dct transform and vertical direction Dct transform.In HEVC, the one-dimensional dct transform of horizontal direction is first carried out, its input data is the residual error of pixel value, is 9bit Wide signed number.During butterfly computation step-by-step calculation intermediate variable, it can guarantee that intermediate variable bit wide in 16bit. But in the one-dimensional dct transform of execution vertical direction, it inputs the output for being horizontally oriented dct transform, and bit wide is 16bit, this The intermediate variable of its butterfly computation of sample can not be preserved with 16bit bit wides, serious overflow error otherwise occur.If with 32 guarantors Deposit, then concurrency declines, it is impossible to play good acceleration effect.In order to solve this problem, intermediate variable is carried out to move to right house Enter, give up low level bit, can be preserved with 16bit bit wides, Dinging Hall error costs of by Yi exchanges concurrency for.

It is specific to move to right calculating such as formula (1)：

Y=(x+ (1<<(MIVO-1)))>>MIVO (1)

Wherein：X is needs the data for carrying out moving to right rounding-off, and MIVO is intermediate variable maximum order, and y is to have performed to move to right house Result after entering.

In addition input variable shift values will also do corresponding adjustment according to MIVO, can offset moving to right for intermediate variable Rounding-off.

Step (5), the corresponding coefficient vector of intermediate variable vector is subjected to point multiplication operation, (a) and (b) institute in Fig. 5 Show；Specifically：

Tilera platforms provide the instruction of vector point multiplication operation, and an instruction can perform the dot product summation operation of multiple data, Play good acceleration effect.

Dot product for calculating four intermediate variables and corresponding coefficient, directly performs four-vector dot product using an instruction Just it can complete；For more than four intermediate variables and the dot product of corresponding coefficient, it can be multiplied by multiple four-vector point Into；Dot product for calculating two intermediate variables and corresponding coefficient, Tilera is not instructed directly, and the present invention uses parallel Multiplication and addition instruction are completed.

The dot product calculating of intermediate variable is the core calculations part of dct transform, and multiplicative complexity is high, performs dot product and calculates it Preceding operation is provided to prepare data, adapts to the computation rule of dot product instruction.

Step (6), the result of point multiplication operation carried out to rearrange combination, performs and move to right rounding-off, output result preserve to Target memory, as shown in Figure 6；Specifically：

Regulation based on HEVC, a width of 16bit of carry-out bit of one-dimensional dct transform performs fixed point and multiplies the result of calculating and is 32bit saves as 16bit, it is necessary to perform rounding-off computing；HEVC rounding-off is calculated such as formula (2)：

Y=(x+offset)>>shift

Offset=(1<<(shift-1))

Error=2^shift-1

The dot product result that step (5) is obtained saves as 32bit data, and a 64bit register preserves two dot product knots Really, deviant offset is first added parallel to dot product result, shift is then moved to right parallel, 16bit output result is just obtained. Output result is rearranged and combined, finally preserves to output memory headroom, completes output.

The above-mentioned detailed step for one-dimensional dct transform, one-dimensional IDCT is one-dimensional DCT reverse computing, its idiographic flow with Dct transform is similar, and difference is input data is needed to carry out transposition, specific as shown in Figure 7.It is used for once point in one-dimensional DCT The data for multiplying calculating are continuous in internal memory distribution, therefore can be directly loaded into register for calculating；And one-dimensional IDCT In to be used for the input data that dot product calculates discontinuous in internal memory distribution, meter is cannot be directly used to after loaded into register Calculate；The internal memory of loaded into register is distributed adjacent vector data and interweaved two-by-two step by step by the present invention, is finally reached the effect of transposition Really, i.e., n-th (1,2,3,4) individual element of four four-vectors, constitutes n-th of new four-vector, passes through the data after transposition Vector is with regard to that can be directly used in follow-up calculating.

The present invention can effectively accelerate the dct transform contravariant of HEVC/H.265 video encoders to change the mold on Tilera platforms Block, obtains and preferably accelerates effect of optimization, reduces HEVC encoder complexities.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring the substantive content of the present invention.

Claims

1. the dct transform of HEVC/H.265 a kind of and the SIMD optimization methods of inverse transformation, it is characterised in that comprise the following steps：

Second step, combination is rearranged to vector data, performs butterfly computation, input data is added and subtracted step by step, centre is calculated Variable vector；

5th step, point multiplication operation is carried out by the corresponding coefficient vector of intermediate variable vector；For calculate four intermediate variables and The dot product of corresponding coefficient, directly performing four-vector point using an instruction can complete at convenience；For anaplasia in more than four The dot product of amount and corresponding coefficient, is completed by multiple four-vector dot product；For calculating two intermediate variables and corresponding coefficient Dot product, completed using parallel multiplication and addition instruction；

6th step, the result of point multiplication operation is carried out to rearrange combination, execution moves to right rounding-off, and output result is preserved to target Deposit.

2. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the second step, to the parallel plus-minus of input data, completes butterfly computation, carry out parallel addition and subtraction multiple step by step, Calculate the intermediate variable vector calculated for dot product.

3. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 4th step, one-dimensional dct transform carries out rounding-off pretreatment to intermediate variable, becomes for the one-dimensional DCT of vertical direction Change, intermediate variable is moved to right into rounding-off makes its bit wide in 16bit.

4. a kind of HEVC/H.265 according to claim 3 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 4th step, specifically moves to right calculating such as formula (1)：

Y=(x+ (1 ＜＜ (MIVO-1))) ＞＞ MIVO (1)

Wherein：X is needs the data for carrying out moving to right rounding-off, and MIVO is intermediate variable maximum order, and y is to have performed to move to right after rounding-off Result；

Meanwhile, input variable shift values will also do corresponding adjustment according to MIVO, and can offset intermediate variable moves to right house Enter, shift values need the digit for being rounded data shift right when being DCT outputs.

5. a kind of HEVC/H.265 according to claim 1 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 6th step, using parallel addition and shift right operation and data recombination, completes parallel to dot product result of calculation Progress moves to right rounding-off operation.

6. a kind of HEVC/H.265 according to claim 5 dct transform and the SIMD optimization methods of inverse transformation, its feature It is, in the 6th step, moves to right rounding-off and calculate such as formula (2)：

In formula：X is to need to perform the data for moving to right rounding-off, and y is to have performed the result for moving to right rounding-off；Offset is to be used for four houses five The offset entered, by move to right digit shift derive, shift values by discrete cosine transform block size N (4,8,16,32) and data Bit wide B (8bit, 10bit) is derived, and horizontal direction horizontal shift values and N, B is relevant, and vertical direction Vertical shift values are relevant with N, and the direction of one-dimensional dct transform is judged according to the difference of shift values：Level is still erected Directly；Error is to perform to move to right the worst error that rounding-off operation is produced.