KR20090075335A

KR20090075335A - The technology and method to improve on encoding rates for high complexity coding algorithm which based on hardware acceleration

Info

Publication number: KR20090075335A
Application number: KR1020080001155A
Authority: KR
Inventors: 최중인; 이재규
Original assignee: 최중인; 이재규
Priority date: 2008-01-04
Filing date: 2008-01-04
Publication date: 2009-07-08

Abstract

An H.264 codec coding rate advanced technology through a hardware acceleration base high complexity coding algorithm and a method for implementing the high quality/high efficiency H.264 encoding is provided to implement H264 encoding solution by processing H264 encoding solution. A bit rate is reduced by being searched through a common part. An SAD computational complexity is reduced and ESA(Exhaustive Search Algorithm) has accuracy by using SEA(Successive Elimination Algorithm). The quality s sacrificed but parallelism is enhanced and the encoding rate is improved through the trade-off strategy. A best speed strategy is a mode which a bit sacrifices the quality and takes precedence speed. The best speed strategy applies the fast search algorithm which is suitable.

Description

H.264 Codec Encoding Rates for High Complexity Coding Algorithm which based on Hardware Acceleration}

본 발명은 Hardware Acceleration 기반의 High Complexity Coding Algorithm을 통한 H.264 Codec 부호화율 향상 기술 및 방법에 관한 것으로, 더욱 상세하게는 H.264 인코더의 프로파일링 분석을 통해 인코딩 서브 모듈별 시간 점유율을 분석하고, 가장 Time Critical한 부분부터 GPU를 활용하여 고속화하는 전략을 사용하여, CPU 자원과는 별도로 고도의 병렬처리 프로세싱 기능을 제공하는 보드 혹은 Co-Processor를 활용함으로써 단순하지만 계산량이 많은 H.264 인코딩 부분을 효율적으로 처리하여 고품질 및 고효율의 H.264 인코딩 솔루션을 만드는 방법에 관한 것이다.The present invention relates to a technique and method for improving H.264 Codec coding rate through High Complexity Coding Algorithm based on Hardware Acceleration, and more particularly, to analyze the time share of each encoding submodule through profiling analysis of H.264 encoder. Simple, high-computation H.264 encoding using a board or co-processor that provides a high degree of parallel processing independent of CPU resources, using a strategy to speed up the GPU from the most critical of time. To efficiently process H.264 encoding solutions.

비디오 압축기술은 H.261에서부터 확립된 Motion Estimation, Transform, Quantization, Entropy Coding라는 기본 기술을 바탕으로 MPEG1, MPEG2, H.263, MPEG4, H.264에 이르는 성능향상을 이루어왔으며, 국제 표준화되어 비디오 스트림 처리에 대한 상호운용성이 보장되어 왔다.Video compression technology has achieved performance improvement from MPEG-1, MPEG2, H.263, MPEG4, and H.264 based on basic technologies such as Motion Estimation, Transform, Quantization, and Entropy Coding established from H.261. Interoperability with stream processing has been ensured.

하지만, 현존하는 비디오 압축기술 중 가장 뛰어난 H.264의 경우에도 인코딩 속도가 느리다는 점으로 인해, 인코딩 속도에 매우 민감한 실시간 방송을 위한 시스템이나 서버 기반의 트랜스코딩 시스템에 활용하는데 제약이 있다. 따라서 현존하는 H.264 인코더는 복잡한 연산을 알고리즘을 개선하여 속도를 향상시키는 노력을 하였으나 어느 정도 한계가 있어, 화질을 다소 손상시켜서 속도를 향상시키는 방법을 이용하여 실제 운용성을 높이고 있다. 또한, 이동통신사에서 휴대폰을 통해 제공하는 무선 비디오 서비스의 경우, 휴대폰 화면의 크기 제한으로 인해 QCIF(176x144)나 QVGA(320x240) 급의 작은 화면 크기를 가지는 비디오 서비스를 제공하고 있다. 이런 작은 화면에서는 더욱 더 화질의 우수성이 요구되는데, 현존하는 H.264 인코더들이 화질의 손상을 통해 인코딩 속도를 향상시키기 때문에 서비스의 품질 향상에 한계가 되고 있는 실정이다.However, H.264, which is the best video compression technology in the world, has a slow encoding speed. Therefore, there is a limitation in using it for a real time broadcasting system or a server-based transcoding system that is very sensitive to the encoding speed. Therefore, the existing H.264 encoder tried to improve the speed by improving algorithms for complex calculations, but there are some limitations to improve the actual operability by using the method of improving the speed by slightly damaging the image quality. In addition, a wireless video service provided by a mobile communication company provides a video service having a small screen size of QCIF (176x144) or QVGA (320x240) due to the size limitation of the mobile phone screen. In such a small screen, image quality is more and more required, and the existing H.264 encoders are limited in improving the quality of service because the encoding speed is improved through the loss of image quality.

본 발명은 CPU 자원과는 별도로 고도의 병렬처리 프로세싱 기능을 제공하는 보드 혹은 Co-Processor를 활용하여 단순하지만 계산량이 많은 H.264 인코딩 부분을 효율적으로 처리하여 고품질 및 고효율의 H,264 인코딩 솔루션을 만들고자 하는 것이 목표이다.The present invention utilizes a board or a co-processor that provides a high degree of parallel processing separately from CPU resources to efficiently process a simple but computational H.264 encoding portion to provide a high quality and high efficiency H, 264 encoding solution. The goal is to make it.

CPU를 도와 고속의 병렬처리를 하게될 Co-Processor로는 GPU(Graphic Processing Unit)를 선택했는데, GPU는 그래픽카드에 내장되어 있는 프로세싱 유닛으로 2D 렌더링과 더불어 3D 렌더링을 위한 Vertex Processor와 Fragment Processor를 다수 보유하고 있는 고도의 병렬처리 장치이다. GPU는 실시간 1인칭 게임과 같이 많은 3D객체를 초당 30~50fps까지 처리하는 능력을 가지고 있다. GPU는 거대한 게임시장에 기대어 그 기술발전 속도가 매우 가파르다. GPU는 복잡한 제어 및 캐쉬를 제거하고 병렬 계산 기능을 극대화한 유닛으로 멀티미디어 데이타의 단순연산의 무한반복류의 업무를 처리하는데 매우 적합하다 하겠다. 또한 GPU는 가격 또한 매우 저렴하고 수급이 매우 쉬우며, GPU를 여러개 연결하여 그 처리 능력을 배가시킬 수 있는 SLI(Scalable Link Interface)를 제공하는 등 여러가지 장점을 가지고 있다.As a co-processor that will help the CPU to execute high speed parallel processing, we chose a GPU (Graphic Processing Unit) .The GPU is a processing unit built into the graphics card.The GPU has a number of Vertex Processors and Fragment Processors for 3D rendering. It is a highly parallel processing unit. GPUs have the ability to handle many 3D objects up to 30-50fps per second, like real-time first-person games. GPUs lean against the huge game market, and their technology advances very rapidly. The GPU is a unit that eliminates complex control and cache and maximizes parallel computing. It is well suited for handling the infinite repetitive tasks of simple computation of multimedia data. In addition, GPUs are very inexpensive, very easy to obtain, and offer many benefits, including a Scalable Link Interface (SLI) that allows multiple GPUs to double the processing power.

본 발명에서는 H.264 인코더의 프로파일링 분석을 통해 인코딩 서브 모듈별 시간 점유율을 분석하고, 가장 Time Critical한 부분부터 GPU를 활용하여 고속화하는 전략을 사용하며, 서브 모듈 중 병렬처리가 용이한 부분도 그 대상으로 삼았다.The present invention analyzes the time share of each encoding submodule through profiling analysis of the H.264 encoder, uses a strategy of speeding up using the GPU from the most time critical part, and also the portion of the submodules that can be easily processed in parallel. I made it an object.

본 발명을 통해 만들어진 루틴들은 모두 향후 통합 대상인 x264코드와 되도록 타입과 명칭을 맞추어서 개발하는 관계로, x264와의 포팅은 큰 어려움 없이 진행할 수 있으며, x264와 연계되어 동작하는 많은 트랜스코더, 인코더 등의 솔루션과 결합할 수 있다. 이를 통해 SLI를 이용한 인코딩 서버 솔루션 개발, VC-1 코덱으로의 적용, H.263 코덱으로의 적용, HDTV급의 실시간 인코딩 솔루션 등 다양한 방면에 활용이 가능하게 된다.Routines made through the present invention are all developed according to the type and name to be integrated with the x264 code to be integrated in the future, porting with x264 can be carried out without great difficulty, solutions of many transcoders, encoders, etc. that work in conjunction with x264 Can be combined with Through this, it can be applied to various fields such as development of encoding server solution using SLI, application to VC-1 codec, application to H.263 codec, and real time encoding solution of HDTV.

H.263의 경우, 알고리즘이 단순하여 병렬화를 할 수 있는 여지가 많다. 현재 H.263은 영상전화의 Mandatory 코덱으로 사용되고 있으며, 향후 영상전화를 이용한 자동 응답 서비스, 영상 콜센터 등이 활성화될 것으로 보여, H.263 인코더의 고속화가 절실히 필요한 부분이다. 또한, SLI를 이용한 인코딩 서버 솔루션의 경우 SLI를 이용하여 4개의 GPU가 장착되는 Tesla Platform에서 동시에 여러 개의 인코딩 요청이 들어와도 원할하게 인코딩을 수행할 수 있는 서버 플랫폼을 개발할 수 있게 된다.In the case of H.263, the algorithm is simple and there is much room for parallelism. Currently, H.263 is used as a Mandatory codec for video telephony, and since the automated answering service and video call center using video telephony will be activated in the future, H.263 encoders are urgently needed for high speed. In addition, the encoding server solution using SLI enables the development of a server platform that can perform encoding smoothly even if multiple encoding requests are simultaneously received on the Tesla Platform equipped with four GPUs using SLI.

VC-1 코덱의 경우, H.264와 더불어 현재 가장 전망이 좋은 비디오 압축기술로, 둘 다 HD DVD의 표준으로 선정되었으며, IPTV의 비디오 코덱으로 활용되고 있다. VC-1 코덱에 GPU를 적용할 경우 Silverlight를 위시한 뉴미디어 시장에서의 중요한 백엔드 역할을 할 수 있어 다양한 방면으로의 확장이 가능하게 된다. 더불어, HDTV급의 영상을 실시간으로 처리할 수 있게 하기 위한 초고속의 알고리즘을 고안하고, SLI등을 활용하여 Graphic Card Level의 Parallelism도 연구할 수 있다.The VC-1 codec, along with H.264, is currently the most promising video compression technology, both of which have been selected as the standard for HD DVD and used as the video codec for IPTV. When the GPU is applied to the VC-1 codec, it can serve as an important backend in the new media market including Silverlight, which can be extended to various fields. In addition, it is possible to devise an ultra-fast algorithm for processing HDTV-class images in real time, and to study parallelism of graphic card level using SLI.

본 발명은 Hardware Acceleration 기반 High Complexity Coding Algorithm을 통한 H.264 Codec 부호화율 향상 기술 및 방법에 관한 것으로, Common Part, Best Quality Strategy, Trade-Off Strategy, Best Speed Strategy등 네 가지 파트로 진행되었다.The present invention relates to a technique and method for improving H.264 Codec coding rate through Hardware Complexity-based High Complexity Coding Algorithm. The present invention is carried out in four parts: Common Part, Best Quality Strategy, Trade-Off Strategy, and Best Speed Strategy.

먼저 Common Part의 경우, 기존 비디오 코덱이 1/2 pixel 단위의 Motion Estimation을 하는 것에 비해 H.264는 1/4 pixel 단위의 Motion을 하여 더욱 더 정 확한 Motion Vector를 찾고 이를 통해 비트레이트를 더 줄일 수 있다. 하나의 프레임에 대한 인코딩이 완료되면 그 프레임은 다시 디코딩된 형태로 레퍼런스 프레임의 DPB(Decoded Picture Buffer)로 들어간다. 이 때 Sub-pel Motion Estimation을 위해 Interpolation을 통해 미리 1/4 해상도의 프레임을 생성한다. Sub-pel Reference Frame Building은 전형적인 병렬처리가 가능한 문제이며, GPU 구현을 통해 SIMD 코드에 비해 10배 이상 빠른 속도로 그를 수행할 수 있다. 또한, H.264는 기본적으로 Variable Block Size Motion Estimation을 지원하며 각 블록은 16x16, 8x16, 16x8, 8x8, 4x8, 8x4, 4x4 등으로 세분화된다. SAD는 Sub-optimal한 값이 전체 Optimal의 부분을 이루므로 미리 4x4단위로 SAD를 구해놓으면 이를 조합하여 8x4, 4x8 등의 더 큰 블록의 SAD를 계산량을 줄이면서 계산할 수 있다. 이러한 접근법을 SAD Reuse라고 한다. 4x4 SAD Surface는 도2와 같이 16x16의 타겟 영역을 대상으로 Thread들이 배치되며, Block의 Shared Memory에는 Padding을 고려하여 40x40의 영역이 Texture로부터 복사된다. Thread들은 16x16의 각 위치를 담당하며 그 Thread들은 4x4 영역을 돌면서 SAD의 합을 구하게 된다. 이렇게 구해진 SAD Surface는 Global Memory에 저장되어 도3과 같은 방식으로 더 큰 블록의 SAD를 구하기 위해 재활용된다.First, in case of Common Part, H.264 performs 1/4 pixel unit of motion estimation to find more accurate motion vector and reduces bit rate, compared to conventional video codec that performs motion estimation of unit of 1/2 pixel. Can be. When encoding of one frame is completed, the frame is decoded into the decoded picture buffer (DPB) of the reference frame. At this time, 1 / 4-resolution frame is generated through Interpolation for Sub-pel Motion Estimation. The Sub-pel Reference Frame Building is a typical parallel processing problem, and the GPU implementation allows it to be 10 times faster than SIMD code. In addition, H.264 basically supports Variable Block Size Motion Estimation, and each block is subdivided into 16x16, 8x16, 16x8, 8x8, 4x8, 8x4, 4x4, etc. Since SAD is a sub-optimal value part of the overall optimal, if you obtain SAD in units of 4x4 in advance, you can combine it and calculate SAD of larger blocks such as 8x4, 4x8 while reducing the calculation amount. This approach is called SAD Reuse. In the 4x4 SAD Surface, threads are arranged to target a 16x16 target area as shown in FIG. 2, and a 40x40 area is copied from the texture in consideration of padding in the shared memory of the block. Threads are in charge of each 16x16 position, and the threads go through a 4x4 area to find the sum of SADs. The SAD surface thus obtained is stored in Global Memory and recycled to obtain a larger block SAD in the same manner as in FIG.

다음으로 Best Quality Strategy의 경우, H.264 인코딩에서 가장 좋은 화질을 보장하려면 Motion Estimation의 정확성과 Rate-Distortion Optimization을 통한 Mode Decision의 정확성이 담보되어야 한다. 최고의 화질을 위해서는 Motion Estimation이 정확해야 하는데, 대부분의 상용 H.264 인코더들은 Fast Search Algorithm을 적용하여 최적의 해를 구하기 보다는 빠른 시간 내에 Sub-optimal한 해를 구하는 방법을 사용한다. 최적의 해를 구하기 위해서는 ESA(Exhaustive Search Algorithm)을 사용해야 하는데, 이는 계산량이 너무 많아서 정확한 Motion Vector를 구한다 하더라도 인코딩 속도가 느려서 실용적이지 못하다. 하지만, SEA(Successive Elimination Algorithm)을 이용하면, ESA와 동일한 정확성을 가지면서도 SAD계산의 량을 많게는 90％이상 줄일 수 있게 된다.Next, in case of Best Quality Strategy, in order to guarantee the best image quality in H.264 encoding, the accuracy of Motion Estimation and Mode Decision through Rate-Distortion Optimization must be ensured. Motion Estimation must be accurate for the best image quality. Most commercial H.264 encoders use a fast search algorithm to find a sub-optimal solution in a short time rather than applying an optimal solution. In order to find the optimal solution, ESA (Exhaustive Search Algorithm) should be used, which is not practical because the encoding speed is slow even if the exact motion vector is obtained due to too much computation. However, using SEA (Successive Elimination Algorithm), while having the same accuracy as the ESA can reduce the amount of SAD calculation as much as 90% or more.

더불어, 최적의 Quality를 구현하기 위해서는 Macroblock의 Motion Vector의 비용이 정확하게 계산되어야 하는데, 이는 Motion Vector Prediction을 정확하게 하는 데서 시작된다. Motion Vector Prediction을 정확히 하려면 윗쪽, 오른쪽 윗쪽, 왼쪽의 이웃 Macroblock이 모두 결정되어야 가능하다. 그러나 도4처럼 2개의 Macroblock 간격을 띄울 경우 다소나마 행 단위로 병렬 처리를 할 수 있다. 이러한 접근법을 Staggered Approach라고 하며, 이렇게 구현하면 병렬처리가 다소 많아져 성능 향상을 기대할 수 있다.In addition, to realize optimal quality, the cost of Macroblock's Motion Vector must be calculated accurately, which starts with accurate Motion Vector Prediction. In order to correctly perform Motion Vector Prediction, the neighboring Macroblocks on the top, top right, and left sides must all be determined. However, as shown in FIG. 4, when two macroblocks are spaced apart, parallel processing may be performed in units of rows. This approach is called Staggered Approach, and if you implement it this way, you can expect some performance gains with a little more parallelism.

Trade-off Strategy는 약간의 Quality를 희생하고 병렬성을 높여 인코딩 속도를 향상시키는 접근법이다. 이를 위해서는 TLP Motion Vector Prediction을 통해 행 단위로 병렬성을 확보하고, Modified Intra-Prediction을 적용하여 Intra-Prediction을 전체 프레임 모두 동시에 하는 방법을 적용한다. TLP(Thread Level Parallelism) Motion Vector Prediction은 원래 추정을 위해서는 좌측, 상측, 우상측 이웃 Macroblock의 Median을 취해야 하나 병렬성을 높이기 위해 도5와 같이 좌측 대신에 좌상측 이웃 Macroblock을 선택하여 Motion Vector를 추정하는 방법이 다. 또한 추정 Motion Vector는 Macroblock내부의 Sub-block단위에서도 위의 원칙이 그대로 적용되어야 하나 TLP 접근법에서는 현재 Macroblock의 모든 Sub-block에 대해서 동일한 추청치를 사용하여 알고리즘을 단순화 시켰다. TLP접근법의 경우, 행 단위로 완벽하게 병렬 처리를 할 수 있어 Best Quality Mode보다 훨씬 빠른 속도를 보여준다. 그리고 Intra Prediction은 Reconstruction된 현재의 Frame에서 일정한 규칙을 가지고 만들어진 여러 가지 패턴과 비교하여 최적의 모드를 선택하는 방법이다. Intra Prediction은 Reconstruction된 좌측, 좌상측, 상측, 우상측 등 네 개의 이웃 Macroblock에 의존적이어서 Inter Prediction과 마찬가지로 병렬처리가 매우 힘들다. 그러나 Reconstruction된 Frame이 아니라 현재의 Frame을 가지고 Intra Prediction을 하고도 오차가 거의 나지 않는 Error Term이 알려졌다. 이 방법을 사용할 경우 Intra Prediction의 전 Macroblock에 대한 의존성이 사라져서 도6과 같이 전체 프레임에 대한 병렬 처리도 가능해 진다.The trade-off strategy is an approach to speed up encoding by sacrificing some quality and increasing parallelism. To this end, TLP Motion Vector Prediction ensures parallelism on a row basis, and applies Intra-Prediction to all frames simultaneously by applying Modified Intra-Prediction. TLP (Thread Level Parallelism) Motion Vector Prediction should take Median of left, top, and right top neighbor Macroblock for original estimation, but in order to improve parallelism, select the top left neighbor Macroblock instead of left as shown in Fig. 5 to estimate the motion vector. That's how it is. In addition, the above principle should be applied to the estimated motion vector in the sub-block unit inside the macroblock. However, in the TLP approach, the algorithm is simplified by using the same estimate for all the sub-blocks of the current macroblock. In the case of the TLP approach, it can perform parallel processing perfectly line by line, which is much faster than Best Quality Mode. Intra Prediction is a method to select the optimal mode by comparing various patterns made with certain rules in the current reconstructed frame. Intra Prediction is dependent on four neighboring Macroblocks: Reconstruction Left, Top Left, Top, and Right Top, so parallel processing is very difficult like Inter Prediction. However, Error Term is known to have little error even if Intra Prediction is performed with current frame instead of reconstructed frame. When using this method, the dependency on Intra Prediction on all Macroblocks is eliminated, and parallel processing for the entire frame is possible as shown in FIG.

Best Speed Strategy는 품질을 다소 희생하고 속도를 우선으로 하는 모드로서 병렬에 적합한 Fast Search Algorithm을 적용하고, Rate-Distortion Optimization을 간략하게 수행하는 등의 장치를 사용한다. 주로 HDTV급의 고해상도 비디오를 실시간으로 인코딩 하는데 적합하다. Fast Search Algorithm은 SEA나 ESA와는 달리 Optimal한 Motion Vector를 찾지 않고 빠른 스텝 전진과 이동으로 Sub-optimal한 Motion Vector를 찾는 전략이다. Fast Search Algorithm 중에서 병렬처리에 적합한 모델은 되도록 검사하는 포인트가 동시에 많고, Divergent가 일어나지 않고 고정된 스텝으로 진행하는 것이 적절하다. 이런 점에서 Extended Hexagonal Search Algorithm이 병렬처리에 적합하다. 또한 Motion Vector Prediction은 별도로 구하지 않고 이전 프레임의 Motion Vector를 참조하거나 원점(0, 0)을 기준으로 한다. Rate-Distortion Optimization을 할 경우에는 Best Motion Vector가 (0, 0)에 집중되는 경향이 있기 때문에 오차는 생각보다 크지 않다. Intra-Prediction은 Trade-off Strategy에서 사용한 Modified Intra Prediction을 사용하여 Inter/Intra 모두 전체 프레임을 한번에 인코딩할 수 있도록 한다.Best Speed Strategy is a speed-priority mode that sacrifices some quality and applies devices such as applying Fast Search Algorithm suitable for parallelism and simply performing Rate-Distortion Optimization. It is mainly suitable for encoding HDTV-quality high resolution video in real time. Unlike SEA or ESA, Fast Search Algorithm is a strategy to find a sub-optimal motion vector by fast stepping forward and moving. Among the Fast Search Algorithms, a model suitable for parallel processing has many points to be examined at the same time, and it is appropriate to proceed to a fixed step without divergent. In this respect, Extended Hexagonal Search Algorithm is suitable for parallel processing. In addition, the Motion Vector Prediction is not obtained separately, but refers to the Motion Vector of the previous frame or refers to the origin (0, 0). In case of rate-distortion optimization, the error is not larger than expected because the best motion vector tends to be concentrated at (0, 0). Intra-Prediction uses the Modified Intra Prediction used in the Trade-off Strategy, allowing both Inter / Intra to encode entire frames at once.

도 1은 H.264 ENCODER의 전체적인 구조를 보인 도면.1 is a view showing the overall structure of the H.264 ENCODER.

도 2는 4X4 SAD SPACE를 구하는 과정을 보인 도면.2 is a view showing a process of obtaining 4X4 SAD SPACE.

도 3은 4X4 SAD를 재활용하여 더 큰 블록의 SAD를 구하는 방법을 보인 도면.Figure 3 shows how to recycle the 4X4 SAD to get a larger block SAD.

도 4는 STAGGERED APPROACH 도면.4 shows a STAGGERED APPROACH.

도 5는 THREAD LEVEL PARALELLISM PATTERN MOTION PREDICT ION 도면.5 is a drawing of THREAD LEVEL PARALELLISM PATTERN MOTION PREDICT ION.

도 6은 4X4 LUMA INTRA PREDICTION 도면.6 is a 4X4 LUMA INTRA PREDICTION diagram.

도 7은 FAST SEARCH ALGORITHMS 도면.7 is a FAST SEARCH ALGORITHMS diagram.

Claims

H.264 Codec Code Rate Enhancement Technique and Method Using High Complexity Coding Algorithm Based on Hardware Acceleration