KR20170059040A

KR20170059040A - Optimal mode decision unit of video encoder and video encoding method using the optimal mode decision

Info

Publication number: KR20170059040A
Application number: KR1020150162408A
Authority: KR
Inventors: 김성제; 김용환
Original assignee: 전자부품연구원
Priority date: 2015-11-19
Filing date: 2015-11-19
Publication date: 2017-05-30
Also published as: KR102309910B1

Abstract

The present invention relates to an optimal mode determination apparatus of a video encoder and a video encoding method using the optimal mode determination. The optimal mode determination apparatus comprises: an image combination apparatus for generating a combined image by combining an input image and a surrounding image based on an encoding time point; and a mode determination apparatus for determining a mode of a video encoder for a current coding tree unit (CTU) based on a convolutional neural network (CNN) with an input of the combined image and a quantization parameter. Therefore, an optimal mode can be determined and the encoding efficiency can be improved without performing a rate-distortion cost calculation process of all the modes that a single CTU can have.

Description

[0001] The present invention relates to an optimal mode decision apparatus for a video encoder and a video encoding method using the optimal mode decision,

본 발명은 비디오 부호화기와 관련한 것으로, 더욱 상세하게는 비디오 부호화기 처리를 수행하기 위한 최적의 모드를 결정하여 부호화 처리 효율을 개선할 수 있는 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법에 관한 것이다.The present invention relates to a video encoder, and more particularly, to an apparatus and method for determining a best mode of a video encoder capable of improving encoding efficiency by determining an optimal mode for performing video encoder processing, and a video encoding method using optimal mode determination .

ISO/IEC가 제정한 비디오 표준(MPEG-2/MPEG-4/AVC/HEVC/SVC/SHVC 등) 및 공개 비디오 표준(XVID/Dirac/Theora/Daala/VP8/VP9 등)의 부호화기는 모두 최소의 비트율을 보장하면서 최대의 화질을 제공한다는 측면에서 각자의 고유한 부호화 툴 뿐 아니라, 고유한 모드 결정 방법을 가지고 있다. 특히 2013년 1월에 제정된 HEVC 비디오 표준은 우수한 부호화 툴과 최적의 모드 결정 방식을 갖추고 있다는 점에서 종래의 비디오 표준보다 우수한 압축 성능을 보이고 있다. 특히 이전 표준인 AVC에 대비해서 30~50% 부호화 압축 개선율을 보이지만, 연산 복잡도 측면에서는 120~200% 정도 더 복잡한 문제점을 가지고 있다. The encoders of video standards (MPEG-2 / MPEG-4 / AVC / HEVC / SVC / SHVC etc.) and public video standards (XVID / Dirac / Theora / Daala / VP8 / It has a unique mode decision method as well as a unique encoding tool in terms of providing the maximum picture quality while ensuring the bit rate. In particular, the HEVC video standard, established in January 2013, has superior compression performance compared to the conventional video standard in that it has excellent encoding tools and an optimal mode decision method. Compared with AVC, which is the previous standard, 30-50% compression improvement is shown, but the complexity is 120 ~ 200% more complicated.

특히 비디오 부호화 기술이 고도화함에 따라, 비디오 표준은 부호화 효율을 높이기 위해 다양한 부호화 모드들을 채택하게 되었고, 이로 인해 많은 부호화 모드 중에 최적의 모드를 선택하는 기법 또한 제안되었다. 이 방법 중에 대표적인 방법은 율-왜곡 최적화 기반 모드 결정 기법(Rate-Distortion Optimization, RDO)으로, 이 방법은 모든 모드의 비용 값을 계산하고, 그 중 최소가 되는 비용 값을 갖는 모드를 최적 모드로 결정하는 방법이다. 이러한 종래의 방법은 모든 모드의 비용 값을 계산하는 방식이기 때문에 모드의 개수가 많아질수록 그 연산량이 증가한다는 한계점을 가지고 있다. 연산량이 증가하면 부호화기의 연산 복잡도가 높아져, FHD(Full High Definition)이상의 높은 해상도에서 초당 30 or 60 프레임을 처리해야하는 실시간 처리에는 활용하기 어렵다는 한계 요소를 갖는다. 이때 모든 모드에 대해서 변환 및 양자화/역양자화 및 역변환/엔트로피 부호화를 거쳐 비용 계산이 이루어지기 때문에 (모드 개수) x (변환 및 양자화/역양자화 및 역변환/엔트로피 부호화)를 수행해야 한다는 문제를 안고 있다.Especially, as the video coding technology becomes more sophisticated, the video standard adopts various coding modes in order to increase the coding efficiency. Therefore, a technique of selecting the optimal mode among many coding modes has also been proposed. A typical method of this method is Rate-Distortion Optimization (RDO), which calculates the cost value of all modes and selects the mode with the minimum cost value as the optimal mode . Since the conventional method calculates the cost value of all the modes, it has a limitation that the amount of operation increases as the number of modes increases. As the amount of computation increases, the computational complexity of the encoder increases, making it difficult to utilize it for real-time processing that requires 30 or 60 frames per second at a higher resolution than FHD (Full High Definition). (Mode number) x (conversion and quantization / inverse quantization and inverse transform / entropy encoding) is performed because cost calculation is performed through conversion, quantization / inverse quantization and inverse transform / entropy encoding for all modes .

이는 종래의 기술이 압축 효율 개선을 위해 연산 복잡도를 희생하는 형태로 발전해왔기 때문으로서, 부호화 효율을 개선하기 위한 새로운 방안이 요청된다.This is because a conventional technique has been developed at the expense of computational complexity in order to improve compression efficiency, and a new scheme for improving coding efficiency is required.

한국공개특허 제10-2009-0040028(2009년 04월 23일 공개)Korean Patent Publication No. 10-2009-0040028 (published on Apr. 23, 2009)

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 비디오 부호화기에서 CNN(Convolutional neural network)을 이용해 부호화 압축 효율을 최대한 확보하면서 고속으로 모드를 결정하는 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법을 제공하기 위한 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to solve the above problems and to provide an optimum mode determination apparatus and an optimal mode determination method for a video encoder that determines a mode at a high speed while maximizing a coding compression efficiency using a CNN (Convolutional Neural Network) And to provide a video coding method using the same.

상기와 같은 목적을 달성하기 위한 본 발명의 비디오 부호화기의 최적 모드 결정 장치는, 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하는 영상 조합 장치, 및 상기 조합 영상 및 양자화 파라미터를 입력으로하여 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정하는 모드 결정 장치를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided an apparatus for determining an optimal mode of a video encoder, the apparatus comprising: an image combining device for generating a combined image by combining an input image and a surrounding image based on an encoding time; And determining a mode of a video encoder for a current CTU (Coding Tree Unit) on the basis of a CNN (Convolutional Neural Network).

본 발명의 비디오 부호화기의 최적 모드 결정 장치에 있어서, 상기 영상 조합 장치는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하며 상기 현재 블록의 주변에 위치한 주변 블록을 조합하되, 상기 주변 블록은 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해지는 것을 특징으로 한다.In the optimum mode determining apparatus of the video encoder of the present invention, the video combining apparatus may combine a current block corresponding to an input video and a neighboring block located in the periphery of the current block, corresponding to the neighboring video, And is determined using the CTU of the frame or the pre-processed CTU.

본 발명의 비디오 부호화기의 최적 모드 결정 장치에 있어서, 상기 모드 결정 장치는 CTU의 크기에 대응하는 복수의 모드 결정 장치를 포함하고, 상기 복수의 모드 결정 장치에서 출력된 모드들 중 율-왜곡 비용이 상대적으로 적은 모드를 선택하거나 기계 학습 분류(machine learning classification)를 이용해 최적의 모드를 선택하는 것을 특징으로 한다.In the optimum mode determination apparatus of the video encoder of the present invention, the mode determination apparatus includes a plurality of mode determination apparatuses corresponding to the size of the CTU, and the rate-distortion cost of the modes output from the plurality of mode determination apparatuses And selects an optimal mode by selecting a relatively small mode or using a machine learning classification.

상기와 같은 목적을 달성하기 위한 본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법은, 비디오 부호화기가 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하고, 상기 조합 영상 및 양자화 파라미터를 입력으로하여 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정하는 단계, 상기 비디오 부호화기가 결정된 모드에 따라 입력 CTU를 변환 및 양자화하는 단계, 및 상기 비디오 부호화기가 변환 및 양자화를 통해 얻어진 변환 계수들을 엔트로피 코딩 엔진을 이용해 부호화하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of encoding a video using optimal mode determination, the method comprising: a video encoder for generating a combined image by combining an input image and a surrounding image based on an encoding time, Determining a mode of a video encoder for a current CTU (Coding Tree Unit) on the basis of a CNN (Convolutional Neural Network), converting and quantizing an input CTU according to a mode determined by the video encoder, And encoding the transform coefficients obtained through the transformation and quantization using an entropy coding engine by the video encoder.

본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법에 있어서, 상기 변환 및 양자화하는 단계 이후 역양자화 및 역변환 과정을 거쳐 화소를 복원하는 단계, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)에 저장하는 단계, 및 상기 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상 및 인트라(intra) 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용하는 단계를 더 포함하는 것을 특징으로 한다.A method for encoding a video using an optimal mode decision according to the present invention, the method comprising the steps of: restoring a pixel through an inverse quantization and an inverse transform process after the transform and quantization; decoding the reconstructed pixel using a deblocking filtering and a sample adaptive offset ) Filtering process and storing the result in a DPB (Decoded Picture Buffer), and a difference value between a prediction block obtained through a motion compensation and an intra compensation process in which motion is predicted based on a frame stored in the DPB, Transforming, and quantizing the input signal.

본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법에 있어서, 특정 CTU에 대한 변환 및 양자화하는 단계와 부호화하는 단계는, 다른 CTU에 대한 비디오 부호화기의 모드를 결정하는 단계와 개별적으로 진행되는 것을 특징으로 한다.In the video coding method using the optimal mode decision according to the present invention, the conversion and quantization step and the encoding step for the specific CTU are separately performed in the step of determining the mode of the video encoder for the other CTUs .

본 발명의 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법에 따르면, 주어진 입력 영상에 대해서 이 입력 영상이 어떤 클래스(class)에 있는가를 바로 결정해주기 때문에, 하나의 CTU(Coding Tree Unit)가 가질 수 있는 모든 모드의 율-왜곡 비용 계산 과정을 수행하지 않아도 최적의 모드를 결정할 수 있다. According to the optimal mode determination apparatus of the video encoder of the present invention and the video coding method using the optimal mode determination, since it is immediately determined in which class the input image is present for a given input image, one CTU (Coding Tree Unit It is possible to determine the optimal mode without performing the rate-distortion cost calculation process of all the modes that the mobile station can have.

또한 CNN은 GPU(Graphic Processing Unit)를 이용하면 연산 속도 측면에서 상당히 많은 이득을 보기 때문에, CPU(Central Processing Unit)의 연산을 기반으로 하고 있는 다른 부호화 연산과 병렬적으로 연산을 수행하여 전체 부호화 연산량을 크게 줄여서 고속 부호화가 가능해진다.In addition, since CNN has a considerable advantage in terms of operation speed when a GPU (Graphic Processing Unit) is used, CNN performs operations in parallel with other encoding operations based on operations of a CPU (Central Processing Unit) So that high-speed coding can be performed.

도 1은 본 발명의 일 실시예에 따른 비디오 부호화기의 동작 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 최적 모드 결정 장치를 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 4는 본 발명의 다른 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 5는 도 4의 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 6은 본 발명의 다른 일 실시예에 따른 최적 모드 결정 장치를 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 모드 결정 장치의 동작을 나타낸 예시도이다.
도 8은 본 발명의 일 실시예에 따라 최적 모드 결정과 부호화가 개별적으로 진행되는 모습을 나타낸 도면이다.
도 9는 본 발명의 일 실시예에 따른 비디오 부호화 방법의 과정을 나타낸 흐름도이다.1 is a flowchart illustrating an operation of a video encoder according to an exemplary embodiment of the present invention.
2 is a block diagram illustrating an apparatus for determining an optimal mode according to an embodiment of the present invention.
3 is a diagram illustrating an input configuration of an image combining apparatus according to an embodiment of the present invention.
4 is a diagram illustrating an input configuration of an image combining apparatus according to another embodiment of the present invention.
FIG. 5 is a diagram illustrating an input configuration of an image combining apparatus according to the embodiment of FIG.
6 is a diagram illustrating an apparatus for determining an optimal mode according to another embodiment of the present invention.
7 is a diagram illustrating an operation of a mode determination apparatus according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating an optimal mode determination and an encoding process performed separately according to an embodiment of the present invention. Referring to FIG.
9 is a flowchart illustrating a video encoding method according to an embodiment of the present invention.

하기의 설명에서는 본 발명의 실시예를 이해하는데 필요한 부분만이 설명되며, 그 이외 부분의 설명은 본 발명의 요지를 흩트리지 않도록 생략될 것이라는 것을 유의하여야 한다.In the following description, only parts necessary for understanding the embodiments of the present invention will be described, and the description of other parts will be omitted so as not to obscure the gist of the present invention.

이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 바람직한 실시예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.The terms and words used in the present specification and claims should not be construed as limited to ordinary or dictionary meanings and the inventor is not limited to the meaning of the terms in order to describe his invention in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely preferred embodiments of the present invention, and are not intended to represent all of the technical ideas of the present invention, so that various equivalents And variations are possible.

본 발명은 비디오 부호화기의 모드 결정과 관련한 것이다. 이하, 첨부된 도면을 참조하여 본 발명의 실시예를 보다 상세하게 설명하기로 한다.The present invention relates to mode determination of a video encoder. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 비디오 부호화기의 동작 흐름도이다.1 is a flowchart illustrating an operation of a video encoder according to an exemplary embodiment of the present invention.

도 1을 참조하면 비디오 부호화기는 메모리나 하드웨어의 연산 처리 능력을 고려해 특정 블록 단위로 부호화를 수행한다. 이때 하나의 프레임을 여러 개의 CTU(Coding Tree Unit)로 분할해서 부호화를 수행하며, 최적 모드 결정 장치(100, 200)는 입력 프레임의 CTU(Residual CTU)와, 이전 프레임의 CTU 또는 기 처리된 CTU(Prev. Picture CTU)를 조합하고, 양자화 조절기(1)로부터의 양자화 파라미터(Quantization Parameter, QP)를 고려해 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정한다. CNN은 기계 학습(machine learning) 분야인 feed-forward neural network의 하나로, 음성, 영상 및 비디오 인식 분야에서 우수한 인식 성능을 발휘한다.Referring to FIG. 1, a video encoder performs encoding on a specific block basis considering the processing capability of memory or hardware. In this case, the optimal mode decision apparatuses 100 and 200 divide one frame into a plurality of CTUs (Coding Tree Units), and the optimal mode decision apparatuses 100 and 200 compare the CTUs of the input frames and the CTUs of the previous frame, (Prev. Picture CTU) and the mode of the video encoder for the current CTU (Coding Tree Unit) based on CNN (Convolutional Neural Network) in consideration of the quantization parameter (QP) from the quantization controller 1 . CNN is one of the feed-forward neural networks in the field of machine learning, and has excellent recognition performance in the fields of voice, video and video recognition.

그리고 CTU의 변환 및 양자화를 위한 입력은 입력 CTU와 인트라 보상 또는 움직임 보상 과정을 거쳐 얻어진 예측 블록과의 차이 값이 되며, 이때 변환 및 양자화를 통해 얻어진 변환 계수들은 예를 들어 CABAC(Context-Adaptive Binary Arithmetic Coding)와 같은 엔트로피 코딩 엔진을 통과해 0과 1로 구성된 비트스트림으로 변환된다.The input for CTU conversion and quantization is a difference value between the input CTU and a prediction block obtained through intra-compensation or motion compensation, and the transform coefficients obtained through the transform and quantization are, for example, CABAC (Context-Adaptive Binary Arithmetic Coding) and is converted into a bit stream composed of 0s and 1s.

이때 CTU의 변환 및 양자화 과정 후에 역양자화 및 역변환 과정을 거쳐 화소를 복원하고, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)(2)에 저장한다. 그리고 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상과 인트라(intra) 예측을 이용한 인트라 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용한다. 이 경우 인트라 예측은 복수의 모드를 갖게 되고, 움직임 예측을 통해 다양한 크기를 갖는 PU(Prediction Unit)에 대하여 최적 움직임 벡터 및 모드를 찾아낸다.At this time, after the conversion and quantization of the CTU, a pixel is reconstructed through an inverse quantization and an inverse transform process, and the reconstructed pixel is subjected to a deblocking filtering and a sample adaptive offset filtering to obtain a DPB (Decoded Picture Buffer) . The difference between the predicted motion based on the frame stored in the DPB and the prediction block obtained through the intra prediction using intra prediction is used as an input for CTU conversion and quantization. In this case, the intra prediction has a plurality of modes, and an optimal motion vector and a mode are found for a PU (Prediction Unit) having various sizes through motion prediction.

이러한 비디오 부호화 과정에서 인트라 예측, 움직임 예측의 모드 선택 효율이나 양자화 조절기의 양자화 선택 방법에 따라서 비트율과 영상의 품질이 변하게 된다.In such a video encoding process, the bit rate and the image quality are changed according to the mode selection efficiency of intra prediction and motion prediction or the quantization selection method of the quantization controller.

도 2는 본 발명의 일 실시예에 따른 최적 모드 결정 장치(100)를 나타낸 도면이다.2 is a diagram illustrating an apparatus 100 for determining an optimal mode according to an embodiment of the present invention.

도 2를 참조하면, 최적 모드 결정 장치(100)는 영상 조합 장치(10) 및 모드 결정 장치(20)를 포함하여 구성된다.Referring to FIG. 2, the optimal mode determination apparatus 100 includes an image combining apparatus 10 and a mode determination apparatus 20.

최적 모드 결정 장치(100)는 CTU의 부호화가 시작되면 최적 모드를 결정하고, 이에 따라 부호화가 진행되도록 하는 장치이다.The optimal mode determination apparatus 100 determines an optimal mode when the CTU starts to be encoded, and proceeds with encoding accordingly.

영상 조합 장치(10)는 입력 영상 프레임의 CTU 입력을 받고, 부호화 시점을 기준으로 한 주변 영상 프레임의 CTU 입력을 받아 조합 영상을 생성한다. 이때 영상 조합 장치(10)는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하는 주변 블록을 조합한다. 주변 블록은 현재 블록의 주변에 위치한 것으로서, 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해진다.The image combining apparatus 10 receives a CTU input of an input image frame and receives a CTU input of a peripheral image frame based on a coding time point to generate a combined image. At this time, the image combining apparatus 10 combines a current block corresponding to the input image and a neighboring block corresponding to the surrounding image. The neighboring block is located around the current block, and is determined using the CTU of the previous frame or the pre-processed CTU.

모드 결정 장치(20)는 영상 조합 장치(10)에서 출력된 조합 영상과 양자화 파라미터(QP)를 입력으로 하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다.The mode determination apparatus 20 receives the combined image and the quantization parameter QP output from the image combining apparatus 10 and determines a mode of the video encoder for the current CTU based on the CNN.

이러한 최적 모드 결정 장치(100)의 영상 조합 장치(10)가 영상을 조합하는 과정에 대해서는 도 3 내지 도 5를 참조하여 설명하기로 한다.The process of combining images by the image combining apparatus 10 of the optimal mode determining apparatus 100 will be described with reference to FIGS. 3 to 5. FIG.

도 3은 본 발명의 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.3 is a diagram illustrating an input configuration of an image combining apparatus according to an embodiment of the present invention.

도 3을 참조하면, 영상 조합 장치의 입력 영상에 대응하는 현재 블록과 주변 블록을 조합하여 조합 영상을 생성한다. 이때 주변 블록은 현재 블록 이후의 미래 정보를 이용하는 방식과 이미 기 부호화된 과거 정보를 이용하는 방식으로 나눌 수 있으며, 부호화 효율 및 압축 연산량을 고려해 영상 조합을 선택할 수 있다.Referring to FIG. 3, a combined image is generated by combining a current block and a neighboring block corresponding to an input image of the image combining apparatus. In this case, the neighboring block can be divided into a method of using future information after the current block and a method of using past encoded information, and the image combination can be selected in consideration of the coding efficiency and the compression computation amount.

도 4는 본 발명의 다른 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이고, 도 5는 도 4의 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.FIG. 4 is a diagram illustrating an input configuration of an image combining apparatus according to another embodiment of the present invention, and FIG. 5 is a diagram illustrating an input configuration of an image combining apparatus according to an embodiment of FIG.

도 4 및 도 5는 도 1에 도시된 DBP에서 얻은 이전 프레임을 이용해 주변 블록을 구성하는 모습을 나타낸다. 이때 움직임 벡터 정보(도 4에 도시된 화살표)를 이용하여 주변 블록을 구성할 수 있으며, 이 경우 조합 영상은 현재 블록과 주변 블록이 같은 크기를 가지게 되고, 해당 크기의 현재 블록과 주변 블록을 더하거나 빼서 조합 영상을 구성할 수 있다.FIGS. 4 and 5 illustrate how neighboring blocks are constructed using a previous frame obtained from the DBP shown in FIG. At this time, a neighboring block can be constructed using the motion vector information (arrows shown in FIG. 4). In this case, the current block and the neighboring block have the same size, and the current block and the neighboring block of the corresponding size are added It is possible to construct a combined image.

도 6은 본 발명의 다른 일 실시예에 따른 최적 모드 결정 장치(200)를 나타낸 도면이다.6 is a diagram illustrating an apparatus 200 for determining an optimal mode according to another embodiment of the present invention.

도 6을 참조하면, 최적 모드 결정 장치(200)는 영상 조합 장치(11), 복수의 모드 결정 장치(21) 및 모드 선택 장치(30)를 포함하여 구성된다.Referring to FIG. 6, the optimal mode determination apparatus 200 includes an image combination apparatus 11, a plurality of mode determination apparatuses 21, and a mode selection apparatus 30.

최적 모드 결정 장치(200)는 CTU의 부호화가 시작되면 최적 모드를 결정하고, 이에 따라 부호화가 진행되도록 한다.The optimal mode determination apparatus 200 determines the optimal mode when the encoding of the CTU starts and allows the encoding to proceed accordingly.

영상 조합 장치(11)는 입력 영상 프레임의 CTU 입력을 받고, 부호화 시점을 기준으로 한 주변 영상 프레임의 CTU 입력을 받아 조합 영상을 생성한다. 이때 영상 조합 장치(11)는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하는 주변 블록을 조합한다. 주변 블록은 현재 블록의 주변에 위치한 것으로서, 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해진다.The image combining apparatus 11 receives a CTU input of an input image frame and receives a CTU input of a peripheral image frame based on a coding time point to generate a combined image. At this time, the image combining apparatus 11 combines a current block corresponding to the input image and a neighboring block corresponding to the surrounding image. The neighboring block is located around the current block, and is determined using the CTU of the previous frame or the pre-processed CTU.

복수의 모드 결정 장치(21)는 영상 조합 장치(11)에서 출력된 조합 영상과 양자화 파라미터(QP)를 입력으로 하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다.The plurality of mode deciding apparatus 21 receives the combined image and the quantization parameter QP output from the image combining apparatus 11 and determines the mode of the video encoder for the current CTU based on the CNN.

이때 복수의 모드 결정 장치(21)는 조합 영상의 크기에 따라 복수의 모드를 각각 결정하여 출력한다. 예를 들어 조합 영상이 64x64의 크기를 갖는 경우 모드 결정 장치 0을 이용하고, 32x32의 크기를 갖는 경우 모드 결정 장치 1을 이용할 수 있다, 16x16의 크기를 갖는 경우 모드 결정 장치 N을 이용할 수 있다.At this time, the plurality of mode determination apparatuses 21 respectively determine and output a plurality of modes according to the size of the combined image. For example, the mode determination apparatus 0 may be used when the combined image has a size of 64x64, and the mode determining apparatus 1 may be used when the combined image has a size of 32x32. When the combined image has a size of 16x16, the mode determining apparatus N may be used.

모드 선택 장치(30)는 복수의 모드 결정 장치(21)에서 출력되는 복수의 모드 중에 가장 적절한 모드를 선택해주는 장치이다.The mode selection device 30 is a device for selecting the most appropriate mode out of a plurality of modes output from the plurality of mode determination devices 21. [

도 3에 도시된 조합 영상과 관련하여 인트라 예측 방법을 예로 들어 설명하면, 영상 조합 장치(11)는 모드 결정 장치 0에게 64x64 단위의 조합 영상을 보내고, 모드 결정 장치 1에는 32x32 단위의 조합 영상을 4번 보내고, 모드 결정 장치 N에게는 16x16 단위 영상 16번을 보낸다. 그리고 모드 결정 장치 0에서는 64x64 단위의 인트라 방향을 결정하고, 모드 결정 장치 1에서는 4번의 반복 수행을 통해 32x32 단위 블록 4개에 대한 인트라 방향을 결정하며, 모드 결정 장치 N에서는 16번의 반복 수행을 통해 16x16 단위 블록 16개에 대한 인트라 방향을 결정한다. 이후, 모드 선택 장치(30)는 모드 결정 장치 0/1/N에서 나온 모드들 중에 최적의 모드를 선택하는 방식으로 동작한다. 최적의 모드를 선택하는 방식은 율-왜곡 비용을 계산해 가장 적은 값을 갖는 모드를 선택하거나 기계 학습 분류(machine learning classification) 방법 등을 통해 최적의 모드를 선택할 수 있다.3, the image combining apparatus 11 sends a combined image of 64 × 64 units to the mode determining apparatus 0, and a combined image of 32 × 32 units is sent to the mode determining apparatus 1 4 times, and the mode decision unit N sends the 16x16 unit video 16 times. The mode determination apparatus 0 determines an intra direction of 64x64 units. The mode determination apparatus 1 determines an intra direction for four 32x32 unit blocks through four iterations. Determines the intra direction for 16 16x16 unit blocks. Thereafter, the mode selection device 30 operates in such a manner as to select an optimal mode among the modes derived from the mode determination device 0/1 / N. The optimal mode can be selected by selecting the mode with the smallest value by calculating the rate-distortion cost, or by selecting the optimal mode through a machine learning classification method.

도 7은 본 발명의 일 실시예에 따른 모드 결정 장치의 동작을 나타낸 예시도이다.7 is a diagram illustrating an operation of a mode determination apparatus according to an embodiment of the present invention.

도 7의 좌측에 도시된 convolutional layers에서 모드 결정 장치의 입력으로 64x64 크기의 입력 영상이 입력되었을 때, 12개의 3x3 convolutional 필터(filter)를 64x64 크기의 입력 영상에 대해서 3 화소씩 건너가면서(stride of 3) 적용한다. 12개의 필터는 4개씩 3개의 GPU에 나누어 할당하고, GPU는 주어진 필터 개수만큼 convolutional 필터를 적용한다. 따라서 한 GPU는 두 번째 단계에서 21x21x4(가로x세로x필터 개수) 크기의 영상을 갖게 된다. 다음 단계에서는 GPU는 2개의 3x3x4 크기의 필터를 적용한 후에 2x2 화소 영역에서 최대값을 선택하는 방법(max-pooling)을 수행해서 10x10x2 크기의 영상을 생성하게 된다. When a 64x64 input image is input from the convolutional layers shown in the left side of FIG. 7 to the input of the mode determination apparatus, twelve 3x3 convolutional filters are applied to the 64x64 input image by three pixels 3) Apply. The 12 filters are assigned to 3 GPUs divided by 4, and the GPU applies the convolutional filter to the given number of filters. Thus, one GPU will have an image of size 21x21x4 (width x length x number of filters) in the second step. In the next step, the GPU will apply a 2x3x4 filter and then max-pooling the 2x2 pixels to produce a 10x10x2 image.

convolutional layers에서 최종적으로 얻어진 10x10x2 영상은 3개의 GPU에서 운용되는 1차 fully-connected layers의 입력 값으로 들어가게 된다.The 10x10x2 image finally obtained from the convolutional layers enters the input values of the first fully-connected layers running on the three GPUs.

그리고 1차 Fully-connected layers의 출력은 하나의 GPU에서 운용되는 2차 fully-connected layers의 입력 값으로 들어가게 된다. 이때 2차 fully-connected layers의 출력은 확률 값을 갖게 되는데, 이 경우 최대의 확률 값을 갖는 모드가 최종 모드가 된다. 도 7의 우측에 도시된 바에 따르면 32x32 크기의 4개의 CU(Coding Unit)로 분할되고, 각각 2Nx2N의 인트라 예측(intra prediction) 모드를 갖는 class 2가 0.820의 확률로 64x64 CTU에 대한 최적 모드로 선택된다.The outputs of the first fully-connected layers are then input to the inputs of the second fully-connected layers running on one GPU. At this time, the outputs of the secondary fully-connected layers have probability values. In this case, the mode having the maximum probability value becomes the final mode. As shown in the right side of FIG. 7, class 2 having intra prediction modes of 2Nx2N is divided into 4 CUs (Coding Units) each having a size of 32x32, and class 2 is selected as an optimal mode for 64x64 CTUs at a probability of 0.820 do.

도 8은 본 발명의 일 실시예에 따라 최적 모드 결정과 부호화가 개별적으로 진행되는 모습을 나타낸 도면이다.FIG. 8 is a diagram illustrating an optimal mode determination and an encoding process performed separately according to an embodiment of the present invention. Referring to FIG.

본 발명에 따른 부호화 과정은 최적 모드 결정 과정과 부호화 과정을 별개로 운용할 수 있고, 최적 모드 결정 과정이 부호화 과정 안에 구현되는 형태로 구현될 수도 있는데, 도 8은 실시간 처리를 위한 고속 모드 결정을 위해서 최적 모드 결정 과정과 부호화 과정을 별개로 운용하는 예를 나타낸다.The encoding process according to the present invention can be performed separately from the optimal mode determination process and the encoding process, and the optimal mode determination process can be implemented in the encoding process. FIG. 8 illustrates a fast mode decision process for real- The optimal mode decision process and the encoding process are separately operated.

도 8에서 최적 모드 결정 장치는 GPU에서 CTU 단위의 입력 영상을 이용해서 비트스트림을 생성하는 부호화 과정과 별개로 동작한다. 이때 비디오 부호화기는 최적 모드 결정 장치에서 결정된 최적의 모드를 확인하고, 해당 모드로 CTU의 변환 및 양자화, 엔트로피 코딩을 진행해 비트스트림을 생성한다.In FIG. 8, the optimal mode determination apparatus operates separately from a coding process for generating a bitstream using an input image in a unit of a CTU in a GPU. At this time, the video encoder confirms the optimal mode determined by the optimum mode determination apparatus, and converts and quantizes CTU and entropy coding in the corresponding mode to generate a bitstream.

본 발명에 따른 비디오 부호화 과정에 대해서는 도 9를 참조하여 설명하기로 한다.The video encoding process according to the present invention will be described with reference to FIG.

도 9는 본 발명의 일 실시예에 따른 비디오 부호화 방법의 과정을 나타낸 흐름도이다.9 is a flowchart illustrating a video encoding method according to an embodiment of the present invention.

도 9를 참조하면, CTU 부호화가 시작되면, 비디오 부호화기가 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하고, 해당 조합 영상 및 양자화 파라미터를 입력으로하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다(S1).Referring to FIG. 9, when CTU encoding is started, a video encoder generates a combined image by combining an input image and a surrounding image based on a coding time point, and inputs a combined image and a quantization parameter, The mode of the video encoder for the CTU is determined (S1).

단계(S1)에서 비디오 부호화기는 입력 영상에 대응하는 현재 블록과, 해당 현재 블록의 주변에 위치한 주변 블록을 조합하여 조합 영상을 생성한다. 이때 주변 블록은 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해질 수 있다.In step S1, the video encoder generates a combined image by combining a current block corresponding to the input image and a neighboring block located around the current block. At this time, the neighboring block may be determined using the CTU of the previous frame or the pre-processed CTU.

그리고 비디오 부호화기는 결정된 모드에 따라 CTU를 변환 및 양자화하고(S2), 엔트로피 부호화를 진행하여 비트스트림을 생성한다(S3).The video encoder converts and quantizes the CTU according to the determined mode (S2), and proceeds to entropy encoding to generate a bitstream (S3).

한편 단계(S2) 이후에 비디오 부호화기는 역양자화 및 역변환 과정을 거쳐 화소를 복원하고, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)에 저장할 수 있다. 이때 비디오 부호화기는 해당 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상 및 인트라(intra) 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용할 수 있다.On the other hand, after step S2, the video encoder reconstructs the pixels through inverse quantization and inverse transformation, and decodes the reconstructed pixels through DPB (Decoded Picture Buffer) through deblocking filtering and sample adaptive offset filtering Can be stored. At this time, the video encoder can utilize the difference value from the prediction block obtained through the motion compensation and intra compensation process, which predicts the motion based on the frame stored in the corresponding DPB, as the input for the CTU conversion and quantization.

본 발명의 실시예에 따른 최적 모드 결정을 이용한 비디오 부호화 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다.The video coding method using the optimal mode determination according to an embodiment of the present invention can be implemented in a form of a program readable by various computer means and recorded in a computer readable recording medium.

한편, 본 명세서와 도면에 개시된 실시예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게는 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다.It should be noted that the embodiments disclosed in the present specification and drawings are only illustrative of specific examples for the purpose of understanding, and are not intended to limit the scope of the present invention. It will be apparent to those skilled in the art that other modifications based on the technical idea of the present invention are possible in addition to the embodiments disclosed herein. Furthermore, although specific terms are used in this specification and the drawings, they are used in a generic sense only to facilitate the description of the invention and to facilitate understanding of the invention, and are not intended to limit the scope of the invention.

10, 11: 영상 조합 장치 20, 21: 모드 결정 장치
30: 모드 선택 장치 100, 200: 최적 모드 결정 장치10, 11: image combining device 20, 21: mode determining device
30: Mode selection device 100, 200: Optimum mode determination device

Claims

An image combining device for generating a combined image by combining an input image and a surrounding image based on a coding time; And
A mode decision unit for determining a mode of a video encoder for a current CTU (Coding Tree Unit) on the basis of a CNN (Convolutional Neural Network) based on the combined image and the quantization parameter;
And an optimal mode decision unit for determining a best mode of the video encoder.

The method according to claim 1,
The image combining apparatus combines a current block corresponding to an input image and a neighboring block corresponding to a neighboring image and located in the periphery of the current block, wherein the neighboring block is determined using a CTU of the previous frame or a pre-processed CTU Wherein the optimal mode decision unit determines the best mode of the video encoder.

The method according to claim 1,
Wherein the mode determination apparatus includes a plurality of mode determination apparatuses corresponding to the sizes of the CTUs and selects a mode having a relatively low rate-distortion cost among the modes output from the plurality of mode determination apparatuses, wherein the optimal mode is selected by using the classification mode.

A video encoder generates a combined image by combining an input image and a surrounding image based on a coding time point, and inputs the combined image and quantization parameters to a current CTU (Coding Tree Unit) based on a CNN (Convolutional Neural Network) Determining a mode of the video encoder for the video encoder;
Transforming and quantizing an input CTU according to a mode determined by the video encoder; And
Encoding the transform coefficients obtained through the transform and quantization by the video encoder using an entropy coding engine;
And determining a best mode decision using the best mode decision.

5. The method of claim 4,
Reconstructing a pixel through an inverse quantization and inverse transform process after the transform and quantization;
Storing the restored pixel in DPB (Decoded Picture Buffer) through deblocking filtering and sample adaptive offset filtering;
Utilizing difference values from a prediction block obtained through motion compensation and intra compensation processes in which motion is predicted based on a frame stored in the DPB as input for CTU conversion and quantization;
And determining a best mode decision using the best mode decision.

5. The method of claim 4,
Wherein the transforming and quantizing step and the encoding step for the specific CTU are performed independently of the mode of the video encoder for the other CTUs.