CN104378641A

CN104378641A - Fast SIMD implement method of HEVC/H.265 sub pixel interpolation

Info

Publication number: CN104378641A
Application number: CN201410647903.9A
Authority: CN
Inventors: 张小云; 黎凌宇; 高志勇; 陈立
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2014-11-14
Filing date: 2014-11-14
Publication date: 2015-02-25
Anticipated expiration: 2034-11-14
Also published as: CN104378641B

Abstract

The invention provides a fast single instruction multiple data (SIMD) implement method of HEVC/H.265 sub pixel interpolation. Firstly, in the moving search process, the sub pixel motion vectors are acquired through a four-order sub pixel interpolation module achieved through the simplified SIMD, and the obvious speed increasing is gotten through the weak performance loss; secondly, when the residual error of a current pixel block and a reference pixel block is calculated through motion compensation, a reference pixel block is obtained through original eight-order sub pixel interpolation, it is guaranteed that the coding end and the decoding end are consistent, and decoding pixels are prevented from drifting.

Description

The SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265

Technical field

The present invention relates to technical field of video coding, particularly, relate to SIMD (single instruction multiple data, the Single Instruction Multi-data) Fast implementation of the sub-pixel interpolation of a kind of HEVC/H.265.

Background technology

Along with the growth of video content and developing rapidly of video product, video content industrial chain faces larger pressure, current AVC (Advanced Video Coding) video compression technology can not meet the requirement of transmission of video, and more efficient video compression technology is arisen at the historic moment.Moreover, future video market is tending towards higher levels of requirement beyond the scope of current AVC code capacity, such as 3D TV and 4K TV.For 4K TV, even if use current H.264 mode to encode, also need 24-32M code check, AVC has become the bottleneck of 4K TV business development.In this context, efficient video coding (High Efficiency Video Coding, HEVC) this new video encoding standard is arisen at the historic moment.The development of HEVC traces back to 2004 the earliest, and through the development of nearly ten years, HEVC formed complete committee's draft standard in February, 2012, and formally became international standard in January, 2013.The target of HEVC is that code efficiency improves 50% than AVC, more complicated than AVC 2 to 10 times of complexity.The business in HEVC future is mainly towards high definition, ultra high-definition, 3D TV, data volume than ever video is much bigger, HEVC requires greatly to improve video compression ratio in addition, and high compression algorithm is to increase algorithm complex for cost, consider the factor of these two aspects, the calculated performance of HEVC encoder to system is had higher requirement.

For reducing HEVC encoder complexity, usually the methods such as algorithm optimization, instruction set optimization, parallel optimization are had, wherein instruction set optimization utilizes the instruction set of computing platform to realize computing module, SIMD (single instruction multiple data) Single Instruction Multi-data can the calculating of the multiple data of parallel processing within an instruction cycle, instruction cycle can be greatly reduced compared to the implementation of routine, improve the speed of service, can ensure that result of calculation is accurate simultaneously.In Video coding, SIMD technology is widely used in density data and calculates, as the module such as sub-pixel interpolation, SAD, DCT/IDCT, calculating residual error.

Tilera platform realizes HEVC encoder, has transplanted the sub-pixel interpolation module of HEVC identifying code HM.Sub-pixel interpolation module is repeatedly called in motion search, and in order to obtain motion vector more accurately, pixel residual error is less, thus compression performance is higher.The brightness sub-pixel interpolation of HEVC adopts 8 rank interpolation, computation complexity is very high, exploitation based in the HEVC encoder of Tilera platform, sub-pixel interpolation module account for the scramble time of 30%-50%, needs a kind of acceleration implementation method of sub-pixel interpolation module badly.

Summary of the invention

For defect of the prior art, the object of this invention is to provide the SIMD Fast implementation of the sub-pixel interpolation of a kind of HEVC/H.265, the problem that sub-pixel interpolation module computation complexity is high, coding rate is slow of the HEVC that described method realizes for C language conventional on Tilera platform, the SIMD instruction set of Tilera is utilized to realize the sub-pixel interpolation module of HEVC, under same result of calculation, the instruction cycles that minimizing program is run, improves the speed of service.Improve the speed of service.

For realizing above object, the invention provides the SIMD Fast implementation of the sub-pixel interpolation of a kind of HEVC/H.265, comprising the steps:

Step 1: the Integer Pel data required for sub-pixel interpolation are entered register from memory loading, is considered as vector data;

Step 2: if horizontal direction sub-pixel interpolation leaps to step 3;

Step 3: if vertical direction sub-pixel interpolation, interweaves one group of vector data step by step between two, realizes data transposition;

Step 4: according to the Integer Pel required for current sub-pixel interpolation point, combination is rearranged to vector data;

Step 5: vector data and corresponding coefficient are performed dot product and calculates, complete adjacent Integer Pel and the summation of coefficient of correspondence product, wherein sub-pixel interpolation exponent number is quadravalence or eight rank;

Step 6: dot product result is rearranged combination, executed in parallel moves to right computing of rounding off, and reprints into output internal memory, skips to step 2, until complete sub-pixel interpolations all in current pixel block;

Step 7: for the sub-pixel interpolation of non-horizontal vertical direction, the sub-pixel interpolation in first executive level direction, performs the sub-pixel interpolation of vertical direction to interpolation intermediate object program; Above step 1-7 obtains quadravalence sub-pixel interpolation function or eight rank sub-pixel interpolation functions;

Step 8: call quadravalence sub-pixel interpolation function in motion search, calls eight rank sub-pixel interpolation functions in movement compensation process.

Preferably, in described step 5, dot product computing formula is as follows:

dotp = Σ_{i = 1}^{n} A_{i} \times C_{i} - - - (1)

In formula: dotp is dot product result of calculation; N is sub-pixel interpolation exponent number: 4 or 8; A is corresponding Integer Pel value; C is corresponding sub-pixel interpolation coefficient.

Preferably, in described step 5, adopt dot product instruction to replace multiplication repeatedly and add operation, effectively accelerate Integer Pel and to be multiplied with coefficient of correspondence the calculating of summation.

Preferably, in described step 5, adopt dot product computations to replace multiplication repeatedly and add operation, effectively accelerate the product summation process of intermediate variable and corresponding coefficient.

Preferably, in described step 6, adopt parallel addition and shift right operation and data recombination, having walked abreast to move to right operation of rounding off to dot product result of calculation.

Preferably, in described step 6, the computing formula that moves to right is as follows:

result＝(dotp+offset)＞＞shift (2)

In formula: move to right during result Output rusults; Shift is the value that moves to right; Offset is offset, and shift=6, offset=1<< (shift-1) in the interpolation in horizontal or vertical direction, for the interpolation that non-horizontal is vertical, offset derives in addition.

Preferably, in described step 8, for calling of sub-pixel interpolation function, in motion search, call the quadravalence sub-pixel interpolation function of simplification, in movement compensation process, call eight rank sub-pixel interpolation functions.

Compared with prior art, the present invention has following beneficial effect:

Method provided by the invention utilizes the sub-pixel interpolation module of the instruction set of Tilera platform to Tilera to carry out SIMD optimization, effectively accelerates the sub-pixel interpolation module of HEVC on Tilera platform.Empirical tests, compared to the C code realization method of routine, after this invention of use, sub-pixel interpolation result of calculation is constant, there is no performance loss, the simultaneously instruction cycles of sub-pixel interpolation function energy decreased average 40%-80%, based on the HEVC encoder decreased average scramble time of 30%-40% of Tilera platform research and development.

Accompanying drawing explanation

By reading the detailed description done non-limiting example with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:

Fig. 1 is HEVC brightness sub-pixel interpolation schematic diagram;

Fig. 2 is the SIMD realization flow of horizontal or vertical direction sub-pixel interpolation proposed by the invention;

Fig. 3 is from non-8byte alignment memory address loading data schematic vector diagram;

Fig. 4 is the schematic diagram 88 yuan of vectors being carried out transposition that the present invention proposes;

Fig. 5 be the present invention propose combination schematic diagram is rearranged to data vector;

Fig. 6 is the byte dot product instruction schematic diagram used by the present invention;

Fig. 7 is that the byte two point used by the present invention takes advantage of instruction schematic diagram;

Fig. 8 is that executed in parallel that the present invention proposes moves to right the operation chart that rounds off;

Fig. 9 be the sub-pixel interpolation function that proposes of the present invention call schematic diagram.

Embodiment

Below in conjunction with specific embodiment, the present invention is described in detail.Following examples will contribute to those skilled in the art and understand the present invention further, but not limit the present invention in any form.It should be pointed out that to those skilled in the art, without departing from the inventive concept of the premise, some distortion and improvement can also be made.These all belong to protection scope of the present invention.

As shown in Figure 1, be HEVC brightness sub-pixel interpolation schematic diagram;

As shown in Figure 2, be the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 in one embodiment of the invention, comprise the steps:

In the present embodiment, can in the following ways:

The Integer Pel data relied on by current sub-pixel interpolation are from internal memory to register, these Integer Pel data might not be 8byte alignment in internal memory, so the conventional loading instruction for 8byte alignment can not be used, otherwise bottom obtains correct result of calculation by exception handler, make program slack-off on the contrary.In order to load the register data of a 64bit from the memory address of a non-alignment, the present invention first loads the register data loading two 64bit the memory address of two alignment that instruction closes on from the memory address of current non-alignment with non-alignment, then utilize an aligned instruction to reconfigure this two register datas according to current non-alignment address, obtain the register data loaded from non-alignment memory address.If the size of sub-pixel interpolation block to be m capable n row, obtain m capable n/8 column data vector like this, each data vector is made up of the Integer Pel that 8 horizontal directions are adjacent.As shown in Figure 3.

Step 2: if horizontal direction sub-pixel interpolation leaps to step 3;

Step 3: if vertical direction sub-pixel interpolation, interweaves one group of vector data step by step between two, realizes data transposition; As shown in Figure 4, concrete:

Integer Pel datarams is continuous distribution in the horizontal direction, there is memory address interval, the data at internal memory interval can not be had directly to be assembled into data vector by one group between the Integer Pel of therefore vertical direction.In order to the Integer Pel data loading of vertical direction is become data vector, the data vector that the present invention utilizes step 1 to obtain, every eight data vectors of same row are considered as one group, transposition is carried out to often organizing data vector, concrete then be that data vector is interweaved step by step between two, n-th (1,2 of last 8 eight yuan of vectors ... 8) individual element forms n-th eight yuan of new vectors, and such n-th new data vector is then loaded with 8 row data of the n-th row.

Step 4: according to the Integer Pel required for current sub-pixel interpolation point, combination is rearranged to vector data; As shown in Figure 5, particularly:

For first of horizontal direction the first row and second sub-pixel interpolation, the Integer Pel data loading of 1/4 sub-picture relies on from step 1-3 to 4 is in a data vector, and the Integer Pel Data distribution8 of-2 of the dependence of 1+1/4 sub-pix to 5 is in adjacent two data vectors.In order to the Integer Pel relied on by 1+1/4 sub-pix is assembled into a data vector, the present invention needs to adopt the aligned instruction in step 1, utilize the memory address of-2 position Integer Pel, from two adjacent Integer Pel data vectors, restructuring is mounted with the vector of the Integer Pel data of-2 to 5.

Step 5: vector data and corresponding coefficient are performed dot product and calculates, completes adjacent Integer Pel and the summation of coefficient of correspondence product; Concrete:

The Integer Pel value of being closed on by sub-pixel interpolation point and corresponding filtering interpolation multiplication, then sue for peace product addition.The brightness sub-pixel interpolation of HEVC will insert three sub-pix points between two Integer Pel points, is 1/4,1/2 and 3/4 position respectively, and the filtering interpolation coefficient that three positions are corresponding three groups, each coefficient is the signed number of 1byte.

Dot product computing formula is as follows:

dotp = Σ_{i = 1}^{n} A_{i} \times C_{i} - - - (1)

For eight rank sub-pixel interpolations, one group of 8 coefficient is assembled into the constant coefficient vector of a 64bit, data vector step 5 obtained with the byte dot product instruction of Tilera and corresponding constant coefficient vector dot product, just obtain a dot product result, as shown in Figure 6.

For quadravalence sub-pixel interpolation, according to the characteristic of eight rank interpolation coefficients and with reference to existing document (the description part to sub-pixel interpolation in video encoding standard VC1 (wmv9) standard document of Microsoft), obtain three groups of corresponding quadravalence sub-pix coefficients, one group of quadravalence sub-pixel interpolation coefficient has four elements, these three groups of coefficients are respectively {-4,36,36,-4}, {-4,53,18,-3}, {-3,18,53,-4}, one group of coefficient is loaded into the high 32bit of constant coefficient vector and low 32bit by the present invention.The data vector taking advantage of command calculations step 5 to obtain with the byte two point of Tilera and constant coefficient vector, just can obtain two dot product results, as shown in Figure 7.Compared to eight rank interpolation, the instruction of quadravalence interpolation once can calculate two dot product results, and save relevant data encasement and follow-up data work for the treatment of, computational speed is faster simultaneously.

Step 6: dot product result is rearranged combination, executed in parallel moves to right computing of rounding off, and reprints into output internal memory, skips to step 2, until complete sub-pixel interpolations all in current pixel block; As shown in Figure 8, particularly:

Integer Pel value is the unsigned number of 1byte, and the result of sub-pixel interpolation also should be the unsigned number of 1byte, and the dot product result needs obtained in step 5 perform to move to right and round off to meet the data bit width requirement of sub-pix.

The computing formula that specifically moves to right is as follows:

result＝(dotp+offset)＞＞shift (2)

In formula: move to right during result Output rusults; Shift is the value that moves to right; Offset is offset, shift=6 in the interpolation in horizontal or vertical direction, offset=1<< (shift-1), for the interpolation that non-horizontal is vertical, the value of offset and offset=1<< (shift-1) different, need derive in addition.

Step 7: for the sub-pixel interpolation of non-horizontal vertical direction, the sub-pixel interpolation in first executive level direction, performs the sub-pixel interpolation of vertical direction to interpolation intermediate object program; Particularly:

First executive level directional interpolation step 1 is to 6, and difference is in step 6, and moving to right rounds off becomes parallel subtraction, and result of calculation saves as 16bit unsigned number, does not output to target memory address;

Then vertical direction interpolation procedure 1 to 6 is performed, difference is that the data element processed is the unsigned number of 16bit, step 3,4,5, the processing instruction in 6 needs to do corresponding adjustment, needs to use the relevant SIMD instruction of double byte but not byte SIMD instruction, degree of parallelism can slightly decline, and the coefficient correlation that rounds off of simultaneously moving to right in step 6 is also different.

Step 8: the quadravalence sub-pixel interpolation function calling simplification in motion search, calls eight rank sub-pixel interpolation functions in movement compensation process.

A complete sub-pixel interpolation process is described according to above-mentioned steps 1-7, quadravalence interpolation and the unique difference of eight rank interpolation are step 5, if what adopt inside step 5 is the instruction of byte dot product, what then 1-7 described is eight rank sub-pixel interpolation functions, if what step 5 adopted is that byte two point takes advantage of instruction, then what 1-7 described is a quadravalence sub-pixel interpolation function.

As shown in Figure 9, in motion search process, the quadravalence sub-pixel interpolation function adopting the SIMD simplified to realize is (if step 5 employing is that byte two point takes advantage of instruction, what then 1-7 described is a quadravalence interpolating function), obtain sub-pel motion vector, utilize faint performance loss to exchange obvious speed for and promote; Then when the residual error of motion compensation calculations current pixel block and reference pixel block, adopt original eight rank sub-pixel interpolations to obtain reference pixel block, ensure that coding side is consistent with decoding end, avoid decoded pixel to occur drifting about.

Method provided by the invention utilizes the instruction set of Tilera platform to carry out SIMD optimization to the sub-pixel interpolation of Tilera and shortcut calculation module thereof, effectively accelerates the sub-pixel interpolation module of HEVC on Tilera platform.Empirical tests, compared to the C code realization method of routine, after this invention of use, sub-pixel interpolation result of calculation is constant, there is no performance loss, the simultaneously instruction cycles of sub-pixel interpolation function energy decreased average 40%-80%, based on the HEVC encoder decreased average scramble time of 30%-40% of Tilera platform research and development.

Above specific embodiments of the invention are described.It is to be appreciated that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make various distortion or amendment within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims

1. a SIMD Fast implementation for the sub-pixel interpolation of HEVC/H.265, is characterized in that, comprise the steps:

Step 2: if horizontal direction sub-pixel interpolation leaps to step 3;

2. the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 according to claim 1, it is characterized in that, in described step 1, the Integer Pel data relied on by current sub-pixel interpolation are from internal memory to register, in order to load the register data of a 64bit from the memory address of a non-alignment, the register data loading two 64bit the memory address of two alignment that instruction closes on from the memory address of current non-alignment is first loaded with non-alignment, then an aligned instruction is utilized to reconfigure this two register datas according to current non-alignment address, obtain the register data loaded from non-alignment memory address, if the size of sub-pixel interpolation block to be m capable n row, obtain m capable n/8 column data vector like this, each data vector is made up of the Integer Pel that 8 horizontal directions are adjacent.

3. the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 according to claim 2, it is characterized in that, in described step 3, in order to the Integer Pel data loading of vertical direction is become data vector, utilize the data vector that step 1 obtains, every eight data vectors of same row are considered as one group, transposition is carried out to often organizing data vector, data vector is interweaved step by step between two, n-th element of last 8 eight yuan of vectors forms n-th eight yuan of new vectors, n=1,2 ... 8, such n-th new data vector is then loaded with 8 row data of the n-th row.

4. the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 according to claim 1, it is characterized in that, in described step 4, adopt the aligned instruction in step 1, utilize the memory address of position Integer Pel, from two adjacent Integer Pel data vectors, restructuring obtains the vector being mounted with required Integer Pel data.

5. the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 according to claim 1, is characterized in that, in described step 5, the Integer Pel value of being closed on by sub-pixel interpolation point and corresponding filtering interpolation multiplication, then sue for peace product addition; The brightness sub-pixel interpolation of HEVC will insert three sub-pix points between two Integer Pel points, is 1/4,1/2 and 3/4 position respectively, the filtering interpolation coefficient that three positions are corresponding three groups, and each coefficient is the signed number of 1byte; Dot product computing formula is as follows:

dotp = Σ_{i = 1}^{n} A_{i} \times C_{i}

6. the SIMD Fast implementation of the sub-pixel interpolation of HEVC/H.265 according to claim 5, it is characterized in that, in described step 5, for eight rank sub-pixel interpolations, one group of 8 coefficient is assembled into the constant coefficient vector of a 64bit, data vector step 5 obtained with the byte dot product instruction of Tilera and corresponding constant coefficient vector dot product, just obtain a dot product result;

For quadravalence sub-pixel interpolation, obtain three groups of corresponding quadravalence sub-pix coefficients, one group of quadravalence sub-pixel interpolation coefficient has four elements, these three groups of coefficients are respectively {-4,36,36,-4}, {-4,53,18,-3}, {-3,18, one group of coefficient is loaded into the high 32bit of constant coefficient vector and low 32bit by 53 ,-4}; The data vector taking advantage of command calculations step 5 to obtain with the byte two point of Tilera and constant coefficient vector, just obtain two dot product results.

7. the SIMD Fast implementation of the sub-pixel interpolation of the HEVC/H.265 according to any one of claim 1-6, it is characterized in that, in described step 6, adopt the addition and shift right operation and data recombination that walk abreast, walked abreast and to have moved to right operation of rounding off to dot product result of calculation, the computing formula that moves to right is as follows:

result＝(dotp+offset)＞＞shift