KR102309910B1

KR102309910B1 - Optimal mode decision unit of video encoder and video encoding method using the optimal mode decision

Info

Publication number: KR102309910B1
Application number: KR1020150162408A
Authority: KR
Inventors: 김성제; 김용환
Original assignee: 한국전자기술연구원
Priority date: 2015-11-19
Filing date: 2015-11-19
Publication date: 2021-10-08
Anticipated expiration: 2035-11-19
Also published as: KR20170059040A

Abstract

본 발명은 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법에 관한 것으로서, 최적 모드 결정 장치는 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하는 영상 조합 장치 및 해당 조합 영상 및 양자화 파라미터를 입력으로하여 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정하는 모드 결정 장치를 포함하며, 이를 통해 하나의 CTU(Coding Tree Unit)가 가질 수 있는 모든 모드의 율-왜곡 비용 계산 과정을 수행하지 않아도 최적의 모드를 결정할 수 있고 부호화 효율이 개선된다.The present invention relates to an apparatus for determining an optimal mode of a video encoder and a video encoding method using the determination of an optimal mode, wherein the apparatus for determining an optimal mode generates a combined image by combining an input image and a surrounding image based on an encoding time. and a mode determining device for determining a mode of a video encoder for a current CTU (Coding Tree Unit) based on a Convolutional Neural Network (CNN) by inputting the corresponding combined image and quantization parameters, and through this, one CTU (Coding The optimal mode can be determined without performing the rate-distortion cost calculation process for all modes that the tree unit can have, and the encoding efficiency is improved.

Description

{Optimal mode decision unit of video encoder and video encoding method using the optimal mode decision}

본 발명은 비디오 부호화기와 관련한 것으로, 더욱 상세하게는 비디오 부호화기 처리를 수행하기 위한 최적의 모드를 결정하여 부호화 처리 효율을 개선할 수 있는 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법에 관한 것이다.The present invention relates to a video encoder, and more particularly, an apparatus for determining an optimal mode of a video encoder capable of improving encoding processing efficiency by determining an optimal mode for performing video encoder processing, and a video encoding method using the optimal mode determination is about

ISO/IEC가 제정한 비디오 표준(MPEG-2/MPEG-4/AVC/HEVC/SVC/SHVC 등) 및 공개 비디오 표준(XVID/Dirac/Theora/Daala/VP8/VP9 등)의 부호화기는 모두 최소의 비트율을 보장하면서 최대의 화질을 제공한다는 측면에서 각자의 고유한 부호화 툴 뿐 아니라, 고유한 모드 결정 방법을 가지고 있다. 특히 2013년 1월에 제정된 HEVC 비디오 표준은 우수한 부호화 툴과 최적의 모드 결정 방식을 갖추고 있다는 점에서 종래의 비디오 표준보다 우수한 압축 성능을 보이고 있다. 특히 이전 표준인 AVC에 대비해서 30~50% 부호화 압축 개선율을 보이지만, 연산 복잡도 측면에서는 120~200% 정도 더 복잡한 문제점을 가지고 있다. Encoders of video standards (MPEG-2/MPEG-4/AVC/HEVC/SVC/SHVC, etc.) established by ISO/IEC and public video standards (XVID/Dirac/Theora/Daala/VP8/VP9, etc.) In terms of providing the maximum picture quality while guaranteeing the bit rate, each has its own encoding tool as well as its own mode determination method. In particular, the HEVC video standard established in January 2013 shows better compression performance than the conventional video standard in that it has an excellent encoding tool and an optimal mode determination method. In particular, compared to the previous standard, AVC, it shows a 30-50% coding compression improvement rate, but has a more complicated problem by 120-200% in terms of computational complexity.

특히 비디오 부호화 기술이 고도화함에 따라, 비디오 표준은 부호화 효율을 높이기 위해 다양한 부호화 모드들을 채택하게 되었고, 이로 인해 많은 부호화 모드 중에 최적의 모드를 선택하는 기법 또한 제안되었다. 이 방법 중에 대표적인 방법은 율-왜곡 최적화 기반 모드 결정 기법(Rate-Distortion Optimization, RDO)으로, 이 방법은 모든 모드의 비용 값을 계산하고, 그 중 최소가 되는 비용 값을 갖는 모드를 최적 모드로 결정하는 방법이다. 이러한 종래의 방법은 모든 모드의 비용 값을 계산하는 방식이기 때문에 모드의 개수가 많아질수록 그 연산량이 증가한다는 한계점을 가지고 있다. 연산량이 증가하면 부호화기의 연산 복잡도가 높아져, FHD(Full High Definition)이상의 높은 해상도에서 초당 30 or 60 프레임을 처리해야하는 실시간 처리에는 활용하기 어렵다는 한계 요소를 갖는다. 이때 모든 모드에 대해서 변환 및 양자화/역양자화 및 역변환/엔트로피 부호화를 거쳐 비용 계산이 이루어지기 때문에 (모드 개수) x (변환 및 양자화/역양자화 및 역변환/엔트로피 부호화)를 수행해야 한다는 문제를 안고 있다.In particular, as video encoding technology is advanced, video standards have adopted various encoding modes in order to increase encoding efficiency, and for this reason, a method of selecting an optimal mode among many encoding modes has also been proposed. Among these methods, a representative method is Rate-Distortion Optimization (RDO), which calculates the cost values of all modes, and selects the mode with the smallest cost value as the optimal mode. a way to decide Since this conventional method is a method of calculating the cost values of all modes, it has a limitation that the amount of calculation increases as the number of modes increases. As the amount of computation increases, the computational complexity of the encoder increases, which has a limiting factor in that it is difficult to utilize in real-time processing that requires processing 30 or 60 frames per second at a high resolution of FHD (Full High Definition) or higher. At this time, since the cost calculation is performed through transform and quantization/inverse quantization and inverse transform/entropy encoding for all modes, (number of modes) x (transform and quantization/inverse quantization and inverse transform/entropy encoding) have to be performed. .

이는 종래의 기술이 압축 효율 개선을 위해 연산 복잡도를 희생하는 형태로 발전해왔기 때문으로서, 부호화 효율을 개선하기 위한 새로운 방안이 요청된다.This is because the prior art has been developed in the form of sacrificing computational complexity to improve compression efficiency, and a new method for improving encoding efficiency is required.

한국공개특허 제10-2009-0040028(2009년 04월 23일 공개)Korean Patent Laid-Open Patent No. 10-2009-0040028 (published on April 23, 2009)

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 비디오 부호화기에서 CNN(Convolutional neural network)을 이용해 부호화 압축 효율을 최대한 확보하면서 고속으로 모드를 결정하는 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법을 제공하기 위한 것이다.An object of the present invention to solve the above problems is an apparatus for determining an optimal mode of a video encoder that determines a mode at high speed while maximally securing encoding compression efficiency using a convolutional neural network (CNN) in a video encoder, and determining the optimal mode An object of the present invention is to provide a video encoding method used.

상기와 같은 목적을 달성하기 위한 본 발명의 비디오 부호화기의 최적 모드 결정 장치는, 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하는 영상 조합 장치, 및 상기 조합 영상 및 양자화 파라미터를 입력으로하여 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정하는 모드 결정 장치를 포함하는 것을 특징으로 한다.An apparatus for determining an optimal mode of a video encoder of the present invention for achieving the above object includes an image combining apparatus for generating a combined image by combining an input image and a peripheral image based on an encoding time, and the combined image and quantization parameters It is characterized in that it comprises a mode determining device for determining the mode of the video encoder for the current CTU (Coding Tree Unit) based on CNN (Convolutional Neural Network) as an input.

본 발명의 비디오 부호화기의 최적 모드 결정 장치에 있어서, 상기 영상 조합 장치는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하며 상기 현재 블록의 주변에 위치한 주변 블록을 조합하되, 상기 주변 블록은 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해지는 것을 특징으로 한다.In the apparatus for determining an optimal mode of a video encoder according to the present invention, the image combining apparatus combines a current block corresponding to an input image and a neighboring block corresponding to a neighboring image and located in the vicinity of the current block, wherein the neighboring block is It is characterized in that it is determined using the CTU of the frame or the pre-processed CTU.

본 발명의 비디오 부호화기의 최적 모드 결정 장치에 있어서, 상기 모드 결정 장치는 CTU의 크기에 대응하는 복수의 모드 결정 장치를 포함하고, 상기 복수의 모드 결정 장치에서 출력된 모드들 중 율-왜곡 비용이 상대적으로 적은 모드를 선택하거나 기계 학습 분류(machine learning classification)를 이용해 최적의 모드를 선택하는 것을 특징으로 한다.In the optimal mode determining apparatus of a video encoder of the present invention, the mode determining apparatus includes a plurality of mode determining apparatuses corresponding to the size of the CTU, and among the modes output from the plurality of mode determining apparatuses, the rate-distortion cost is It is characterized by selecting a relatively small number of modes or selecting an optimal mode using machine learning classification.

상기와 같은 목적을 달성하기 위한 본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법은, 비디오 부호화기가 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하고, 상기 조합 영상 및 양자화 파라미터를 입력으로하여 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정하는 단계, 상기 비디오 부호화기가 결정된 모드에 따라 입력 CTU를 변환 및 양자화하는 단계, 및 상기 비디오 부호화기가 변환 및 양자화를 통해 얻어진 변환 계수들을 엔트로피 코딩 엔진을 이용해 부호화하는 단계를 포함하는 것을 특징으로 한다.In the video encoding method using the optimal mode determination of the present invention for achieving the above object, a video encoder generates a combined image by combining an input image and a peripheral image based on an encoding time, and the combined image and quantization parameters Determining the mode of the video encoder for the current CTU (Coding Tree Unit) based on CNN (Convolutional Neural Network) as an input, transforming and quantizing the input CTU according to the determined mode by the video encoder, and the and encoding, by the video encoder, transform coefficients obtained through transform and quantization using an entropy coding engine.

본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법에 있어서, 상기 변환 및 양자화하는 단계 이후 역양자화 및 역변환 과정을 거쳐 화소를 복원하는 단계, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)에 저장하는 단계, 및 상기 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상 및 인트라(intra) 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용하는 단계를 더 포함하는 것을 특징으로 한다.In the video encoding method using the optimal mode determination of the present invention, after the transforming and quantizing, the pixels are reconstructed through inverse quantization and inverse transform processes, the reconstructed pixels are subjected to deblocking filtering and sample adaptive offset (sample adaptive offset). ) through a filtering process and storing in a Decoded Picture Buffer (DPB), and a difference value from a prediction block obtained through motion compensation and intra compensation for predicting motion based on the frame stored in the DPB. It characterized in that it further comprises the step of utilizing as an input for transformation and quantization.

본 발명의 최적 모드 결정을 이용한 비디오 부호화 방법에 있어서, 특정 CTU에 대한 변환 및 양자화하는 단계와 부호화하는 단계는, 다른 CTU에 대한 비디오 부호화기의 모드를 결정하는 단계와 개별적으로 진행되는 것을 특징으로 한다.In the video encoding method using the optimal mode determination of the present invention, the steps of transforming and quantizing and encoding for a specific CTU are performed separately from the step of determining the mode of a video encoder for another CTU. .

본 발명의 비디오 부호화기의 최적 모드 결정 장치 및 최적 모드 결정을 이용한 비디오 부호화 방법에 따르면, 주어진 입력 영상에 대해서 이 입력 영상이 어떤 클래스(class)에 있는가를 바로 결정해주기 때문에, 하나의 CTU(Coding Tree Unit)가 가질 수 있는 모든 모드의 율-왜곡 비용 계산 과정을 수행하지 않아도 최적의 모드를 결정할 수 있다. According to the apparatus for determining the optimal mode of a video encoder and the video encoding method using the optimal mode determination of the present invention, since it directly determines which class the input image is in for a given input image, one Coding Tree Unit (CTU) ), it is possible to determine the optimal mode without performing the rate-distortion cost calculation process for all modes.

또한 CNN은 GPU(Graphic Processing Unit)를 이용하면 연산 속도 측면에서 상당히 많은 이득을 보기 때문에, CPU(Central Processing Unit)의 연산을 기반으로 하고 있는 다른 부호화 연산과 병렬적으로 연산을 수행하여 전체 부호화 연산량을 크게 줄여서 고속 부호화가 가능해진다.In addition, since CNN benefits significantly in terms of operation speed when using a GPU (Graphic Processing Unit), it performs operations in parallel with other encoding operations based on the operation of the CPU (Central Processing Unit), resulting in the total amount of encoding operation. is greatly reduced, enabling high-speed encoding.

도 1은 본 발명의 일 실시예에 따른 비디오 부호화기의 동작 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 최적 모드 결정 장치를 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 4는 본 발명의 다른 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 5는 도 4의 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.
도 6은 본 발명의 다른 일 실시예에 따른 최적 모드 결정 장치를 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 모드 결정 장치의 동작을 나타낸 예시도이다.
도 8은 본 발명의 일 실시예에 따라 최적 모드 결정과 부호화가 개별적으로 진행되는 모습을 나타낸 도면이다.
도 9는 본 발명의 일 실시예에 따른 비디오 부호화 방법의 과정을 나타낸 흐름도이다.1 is a flowchart of an operation of a video encoder according to an embodiment of the present invention.
2 is a diagram illustrating an apparatus for determining an optimal mode according to an embodiment of the present invention.
3 is a diagram illustrating an input configuration of an image combining apparatus according to an embodiment of the present invention.
4 is a diagram illustrating an input configuration of an image combining apparatus according to another embodiment of the present invention.
5 is a diagram illustrating an input configuration of the image combining apparatus according to the embodiment of FIG. 4 .
6 is a diagram illustrating an apparatus for determining an optimal mode according to another embodiment of the present invention.
7 is an exemplary diagram illustrating an operation of a mode determining apparatus according to an embodiment of the present invention.
8 is a diagram illustrating a state in which optimal mode determination and encoding are performed separately according to an embodiment of the present invention.
9 is a flowchart illustrating a process of a video encoding method according to an embodiment of the present invention.

하기의 설명에서는 본 발명의 실시예를 이해하는데 필요한 부분만이 설명되며, 그 이외 부분의 설명은 본 발명의 요지를 흩트리지 않도록 생략될 것이라는 것을 유의하여야 한다.It should be noted that, in the following description, only parts necessary for understanding the embodiments of the present invention are described, and descriptions of other parts will be omitted so as not to obscure the gist of the present invention.

이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 바람직한 실시예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.The terms or words used in the present specification and claims described below should not be construed as being limited to their ordinary or dictionary meanings, and the inventors have appropriate concepts of terms in order to best describe their inventions. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that it can be defined in Accordingly, the embodiments described in this specification and the configurations shown in the drawings are only preferred embodiments of the present invention, and do not represent all of the technical spirit of the present invention, so various equivalents that can be substituted for them at the time of the present application It should be understood that there may be variations and variations.

본 발명은 비디오 부호화기의 모드 결정과 관련한 것이다. 이하, 첨부된 도면을 참조하여 본 발명의 실시예를 보다 상세하게 설명하기로 한다.The present invention relates to mode determination of a video encoder. Hereinafter, embodiments of the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 비디오 부호화기의 동작 흐름도이다.1 is a flowchart of an operation of a video encoder according to an embodiment of the present invention.

도 1을 참조하면 비디오 부호화기는 메모리나 하드웨어의 연산 처리 능력을 고려해 특정 블록 단위로 부호화를 수행한다. 이때 하나의 프레임을 여러 개의 CTU(Coding Tree Unit)로 분할해서 부호화를 수행하며, 최적 모드 결정 장치(100, 200)는 입력 프레임의 CTU(Residual CTU)와, 이전 프레임의 CTU 또는 기 처리된 CTU(Prev. Picture CTU)를 조합하고, 양자화 조절기(1)로부터의 양자화 파라미터(Quantization Parameter, QP)를 고려해 CNN(Convolutional Neural Network) 기반으로 현재의 CTU(Coding Tree Unit)에 대한 비디오 부호화기의 모드를 결정한다. CNN은 기계 학습(machine learning) 분야인 feed-forward neural network의 하나로, 음성, 영상 및 비디오 인식 분야에서 우수한 인식 성능을 발휘한다.Referring to FIG. 1 , the video encoder performs encoding in units of specific blocks in consideration of the arithmetic processing capability of a memory or hardware. At this time, encoding is performed by dividing one frame into several coding tree units (CTUs), and the optimal mode determining apparatus 100, 200 includes a residual CTU (CTU) of an input frame, a CTU of a previous frame, or a pre-processed CTU. (Prev. Picture CTU) and consider the quantization parameter (QP) from the quantization controller 1 to determine the mode of the video encoder for the current CTU (Coding Tree Unit) based on CNN (Convolutional Neural Network). decide CNN is one of the feed-forward neural networks in the field of machine learning, and it exhibits excellent recognition performance in the field of voice, image, and video recognition.

그리고 CTU의 변환 및 양자화를 위한 입력은 입력 CTU와 인트라 보상 또는 움직임 보상 과정을 거쳐 얻어진 예측 블록과의 차이 값이 되며, 이때 변환 및 양자화를 통해 얻어진 변환 계수들은 예를 들어 CABAC(Context-Adaptive Binary Arithmetic Coding)와 같은 엔트로피 코딩 엔진을 통과해 0과 1로 구성된 비트스트림으로 변환된다.And the input for transform and quantization of the CTU becomes a difference value between the input CTU and a prediction block obtained through an intra-compensation or motion compensation process. It passes through an entropy coding engine such as Arithmetic Coding) and is converted into a bitstream composed of 0's and 1's.

이때 CTU의 변환 및 양자화 과정 후에 역양자화 및 역변환 과정을 거쳐 화소를 복원하고, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)(2)에 저장한다. 그리고 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상과 인트라(intra) 예측을 이용한 인트라 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용한다. 이 경우 인트라 예측은 복수의 모드를 갖게 되고, 움직임 예측을 통해 다양한 크기를 갖는 PU(Prediction Unit)에 대하여 최적 움직임 벡터 및 모드를 찾아낸다.At this time, after the transformation and quantization of the CTU, the pixel is restored through inverse quantization and inverse transformation, and the reconstructed pixel is subjected to deblocking filtering and sample adaptive offset filtering, followed by a Decoded Picture Buffer (DPB) (2) save to And the difference value between the prediction block obtained through the motion compensation that predicts motion based on the frame stored in the DPB and the intra compensation process using intra prediction is used as an input for transformation and quantization of the CTU. In this case, intra prediction has a plurality of modes, and optimal motion vectors and modes are found for prediction units (PUs) having various sizes through motion prediction.

이러한 비디오 부호화 과정에서 인트라 예측, 움직임 예측의 모드 선택 효율이나 양자화 조절기의 양자화 선택 방법에 따라서 비트율과 영상의 품질이 변하게 된다.In this video encoding process, the bit rate and image quality change depending on the mode selection efficiency of intra prediction and motion prediction or the quantization selection method of the quantization controller.

도 2는 본 발명의 일 실시예에 따른 최적 모드 결정 장치(100)를 나타낸 도면이다.2 is a diagram illustrating an apparatus 100 for determining an optimal mode according to an embodiment of the present invention.

도 2를 참조하면, 최적 모드 결정 장치(100)는 영상 조합 장치(10) 및 모드 결정 장치(20)를 포함하여 구성된다.Referring to FIG. 2 , the optimal mode determining apparatus 100 includes an image combining apparatus 10 and a mode determining apparatus 20 .

최적 모드 결정 장치(100)는 CTU의 부호화가 시작되면 최적 모드를 결정하고, 이에 따라 부호화가 진행되도록 하는 장치이다.The optimal mode determining apparatus 100 is an apparatus that determines an optimal mode when encoding of a CTU starts, and performs encoding accordingly.

영상 조합 장치(10)는 입력 영상 프레임의 CTU 입력을 받고, 부호화 시점을 기준으로 한 주변 영상 프레임의 CTU 입력을 받아 조합 영상을 생성한다. 이때 영상 조합 장치(10)는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하는 주변 블록을 조합한다. 주변 블록은 현재 블록의 주변에 위치한 것으로서, 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해진다.The image combining apparatus 10 receives a CTU input of an input image frame and generates a combined image by receiving a CTU input of a neighboring image frame based on an encoding time. In this case, the image combining apparatus 10 combines the current block corresponding to the input image and the neighboring block corresponding to the neighboring image. The neighboring block is located in the vicinity of the current block, and is determined using the CTU of the previous frame or the pre-processed CTU.

모드 결정 장치(20)는 영상 조합 장치(10)에서 출력된 조합 영상과 양자화 파라미터(QP)를 입력으로 하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다.The mode determining device 20 determines the mode of the video encoder for the current CTU based on CNN by receiving the combined image output from the image combining device 10 and the quantization parameter (QP) as inputs.

이러한 최적 모드 결정 장치(100)의 영상 조합 장치(10)가 영상을 조합하는 과정에 대해서는 도 3 내지 도 5를 참조하여 설명하기로 한다.A process of combining images by the image combining apparatus 10 of the optimal mode determining apparatus 100 will be described with reference to FIGS. 3 to 5 .

도 3은 본 발명의 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.3 is a diagram illustrating an input configuration of an image combining apparatus according to an embodiment of the present invention.

도 3을 참조하면, 영상 조합 장치의 입력 영상에 대응하는 현재 블록과 주변 블록을 조합하여 조합 영상을 생성한다. 이때 주변 블록은 현재 블록 이후의 미래 정보를 이용하는 방식과 이미 기 부호화된 과거 정보를 이용하는 방식으로 나눌 수 있으며, 부호화 효율 및 압축 연산량을 고려해 영상 조합을 선택할 수 있다.Referring to FIG. 3 , a combined image is generated by combining a current block corresponding to an input image of an image combining apparatus and a neighboring block. In this case, the neighboring block can be divided into a method using future information after the current block and a method using previously encoded past information, and a combination of images can be selected in consideration of encoding efficiency and compression operation amount.

도 4는 본 발명의 다른 일 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이고, 도 5는 도 4의 실시예에 따른 영상 조합 장치의 입력 구성을 나타낸 도면이다.4 is a diagram illustrating an input configuration of an image combining apparatus according to another embodiment of the present invention, and FIG. 5 is a diagram illustrating an input configuration of an image combining apparatus according to the embodiment of FIG. 4 .

도 4 및 도 5는 도 1에 도시된 DPB에서 얻은 이전 프레임을 이용해 주변 블록을 구성하는 모습을 나타낸다. 이때 움직임 벡터 정보(도 4에 도시된 화살표)를 이용하여 주변 블록을 구성할 수 있으며, 이 경우 조합 영상은 현재 블록과 주변 블록이 같은 크기를 가지게 되고, 해당 크기의 현재 블록과 주변 블록을 더하거나 빼서 조합 영상을 구성할 수 있다.4 and 5 show the configuration of neighboring blocks using the previous frame obtained from the DPB shown in FIG. 1 . At this time, a neighboring block may be constructed using the motion vector information (arrows shown in FIG. 4 ). In this case, the combined image has the same size as the current block and the neighboring block, and the current block of the corresponding size and the neighboring block are added or By subtracting it, you can compose a combined image.

도 6은 본 발명의 다른 일 실시예에 따른 최적 모드 결정 장치(200)를 나타낸 도면이다.6 is a diagram illustrating an apparatus 200 for determining an optimal mode according to another embodiment of the present invention.

도 6을 참조하면, 최적 모드 결정 장치(200)는 영상 조합 장치(11), 복수의 모드 결정 장치(21) 및 모드 선택 장치(30)를 포함하여 구성된다.Referring to FIG. 6 , the optimal mode determining apparatus 200 includes an image combining apparatus 11 , a plurality of mode determining apparatuses 21 , and a mode selecting apparatus 30 .

최적 모드 결정 장치(200)는 CTU의 부호화가 시작되면 최적 모드를 결정하고, 이에 따라 부호화가 진행되도록 한다.The optimal mode determining apparatus 200 determines the optimal mode when encoding of the CTU starts, and the encoding proceeds accordingly.

영상 조합 장치(11)는 입력 영상 프레임의 CTU 입력을 받고, 부호화 시점을 기준으로 한 주변 영상 프레임의 CTU 입력을 받아 조합 영상을 생성한다. 이때 영상 조합 장치(11)는 입력 영상에 대응하는 현재 블록과, 주변 영상에 대응하는 주변 블록을 조합한다. 주변 블록은 현재 블록의 주변에 위치한 것으로서, 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해진다.The image combining device 11 receives a CTU input of an input image frame and generates a combined image by receiving a CTU input of a neighboring image frame based on an encoding time. In this case, the image combining device 11 combines the current block corresponding to the input image and the neighboring block corresponding to the neighboring image. The neighboring block is located in the vicinity of the current block, and is determined using the CTU of the previous frame or the pre-processed CTU.

복수의 모드 결정 장치(21)는 영상 조합 장치(11)에서 출력된 조합 영상과 양자화 파라미터(QP)를 입력으로 하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다.The plurality of mode determining unit 21 determines the mode of the video encoder for the current CTU based on CNN by receiving the combined image output from the image combining unit 11 and the quantization parameter (QP) as inputs.

이때 복수의 모드 결정 장치(21)는 조합 영상의 크기에 따라 복수의 모드를 각각 결정하여 출력한다. 예를 들어 조합 영상이 64x64의 크기를 갖는 경우 모드 결정 장치 0을 이용하고, 32x32의 크기를 갖는 경우 모드 결정 장치 1을 이용할 수 있다, 16x16의 크기를 갖는 경우 모드 결정 장치 N을 이용할 수 있다.In this case, the plurality of mode determining apparatus 21 determines and outputs the plurality of modes, respectively, according to the size of the combined image. For example, if the combined image has a size of 64x64, mode determination device 0 may be used, if it has a size of 32x32, mode determination device 1 may be used, and if it has a size of 16x16, mode determination device N may be used.

모드 선택 장치(30)는 복수의 모드 결정 장치(21)에서 출력되는 복수의 모드 중에 가장 적절한 모드를 선택해주는 장치이다.The mode selection device 30 is a device for selecting the most appropriate mode among a plurality of modes output from the plurality of mode determining devices 21 .

도 3에 도시된 조합 영상과 관련하여 인트라 예측 방법을 예로 들어 설명하면, 영상 조합 장치(11)는 모드 결정 장치 0에게 64x64 단위의 조합 영상을 보내고, 모드 결정 장치 1에는 32x32 단위의 조합 영상을 4번 보내고, 모드 결정 장치 N에게는 16x16 단위 영상 16번을 보낸다. 그리고 모드 결정 장치 0에서는 64x64 단위의 인트라 방향을 결정하고, 모드 결정 장치 1에서는 4번의 반복 수행을 통해 32x32 단위 블록 4개에 대한 인트라 방향을 결정하며, 모드 결정 장치 N에서는 16번의 반복 수행을 통해 16x16 단위 블록 16개에 대한 인트라 방향을 결정한다. 이후, 모드 선택 장치(30)는 모드 결정 장치 0/1/N에서 나온 모드들 중에 최적의 모드를 선택하는 방식으로 동작한다. 최적의 모드를 선택하는 방식은 율-왜곡 비용을 계산해 가장 적은 값을 갖는 모드를 선택하거나 기계 학습 분류(machine learning classification) 방법 등을 통해 최적의 모드를 선택할 수 있다.Referring to the intra prediction method with respect to the combined image shown in FIG. 3 as an example, the image combining device 11 sends a 64x64 unit combined image to the mode determining device 0, and a 32x32 unit combined image to the mode determining device 1 4 times, and 16x16 unit video is sent to the mode determining device N. And the mode determining device 0 determines the intra-direction of 64x64 units, the mode determining device 1 determines the intra-direction for 4 32x32 unit blocks through 4 repetitions, and the mode determination unit N determines the intra-direction through 16 repetitions. Intra directions for 16 16x16 unit blocks are determined. Thereafter, the mode selection device 30 operates in a manner of selecting an optimal mode from among the modes outputted from the mode determining device 0/1/N. The optimal mode may be selected by calculating the rate-distortion cost and selecting the mode having the smallest value, or selecting the optimal mode through a machine learning classification method or the like.

도 7은 본 발명의 일 실시예에 따른 모드 결정 장치의 동작을 나타낸 예시도이다.7 is an exemplary diagram illustrating an operation of a mode determining apparatus according to an embodiment of the present invention.

도 7의 좌측에 도시된 convolutional layers에서 모드 결정 장치의 입력으로 64x64 크기의 입력 영상이 입력되었을 때, 12개의 3x3 convolutional 필터(filter)를 64x64 크기의 입력 영상에 대해서 3 화소씩 건너가면서(stride of 3) 적용한다. 12개의 필터는 4개씩 3개의 GPU에 나누어 할당하고, GPU는 주어진 필터 개수만큼 convolutional 필터를 적용한다. 따라서 한 GPU는 두 번째 단계에서 21x21x4(가로x세로x필터 개수) 크기의 영상을 갖게 된다. 다음 단계에서는 GPU는 2개의 3x3x4 크기의 필터를 적용한 후에 2x2 화소 영역에서 최대값을 선택하는 방법(max-pooling)을 수행해서 10x10x2 크기의 영상을 생성하게 된다. In the convolutional layers shown on the left of FIG. 7, when a 64x64 sized input image is input as an input of the mode determining device, 12 3x3 convolutional filters are traversed by 3 pixels with respect to the 64x64 sized input image (stride of 3) Apply. The 12 filters are divided into 3 GPUs, 4 each, and the GPU applies convolutional filters as many as the given number of filters. Therefore, one GPU has an image of size 21x21x4 (width x length x number of filters) in the second step. In the next step, the GPU generates a 10x10x2 image by applying two 3x3x4 filters and then selecting the maximum value in the 2x2 pixel area (max-pooling).

convolutional layers에서 최종적으로 얻어진 10x10x2 영상은 3개의 GPU에서 운용되는 1차 fully-connected layers의 입력 값으로 들어가게 된다.The 10x10x2 image finally obtained from the convolutional layers is entered as the input value of the first fully-connected layers operated on three GPUs.

그리고 1차 Fully-connected layers의 출력은 하나의 GPU에서 운용되는 2차 fully-connected layers의 입력 값으로 들어가게 된다. 이때 2차 fully-connected layers의 출력은 확률 값을 갖게 되는데, 이 경우 최대의 확률 값을 갖는 모드가 최종 모드가 된다. 도 7의 우측에 도시된 바에 따르면 32x32 크기의 4개의 CU(Coding Unit)로 분할되고, 각각 2Nx2N의 인트라 예측(intra prediction) 모드를 갖는 class 2가 0.820의 확률로 64x64 CTU에 대한 최적 모드로 선택된다.And the output of the first fully-connected layers goes into the input value of the second fully-connected layers operated by one GPU. At this time, the output of the secondary fully-connected layers has a probability value. In this case, the mode having the maximum probability value becomes the final mode. As shown on the right side of FIG. 7 , it is divided into four CUs (coding units) of 32x32 size, and class 2 having an intra prediction mode of 2Nx2N, respectively, is selected as the optimal mode for 64x64 CTU with a probability of 0.820. do.

도 8은 본 발명의 일 실시예에 따라 최적 모드 결정과 부호화가 개별적으로 진행되는 모습을 나타낸 도면이다.8 is a diagram illustrating a state in which optimal mode determination and encoding are performed separately according to an embodiment of the present invention.

본 발명에 따른 부호화 과정은 최적 모드 결정 과정과 부호화 과정을 별개로 운용할 수 있고, 최적 모드 결정 과정이 부호화 과정 안에 구현되는 형태로 구현될 수도 있는데, 도 8은 실시간 처리를 위한 고속 모드 결정을 위해서 최적 모드 결정 과정과 부호화 과정을 별개로 운용하는 예를 나타낸다.The encoding process according to the present invention may operate the optimal mode determination process and the encoding process separately, and may be implemented in a form in which the optimal mode determination process is implemented in the encoding process. For this purpose, an example of separately operating the optimal mode determination process and the encoding process is shown.

도 8에서 최적 모드 결정 장치는 GPU에서 CTU 단위의 입력 영상을 이용해서 비트스트림을 생성하는 부호화 과정과 별개로 동작한다. 이때 비디오 부호화기는 최적 모드 결정 장치에서 결정된 최적의 모드를 확인하고, 해당 모드로 CTU의 변환 및 양자화, 엔트로피 코딩을 진행해 비트스트림을 생성한다.In FIG. 8 , the apparatus for determining the optimal mode operates separately from an encoding process of generating a bitstream using an input image in units of CTUs in the GPU. At this time, the video encoder checks the optimal mode determined by the optimal mode determining device, and generates a bitstream by performing transformation, quantization, and entropy coding of the CTU in the corresponding mode.

본 발명에 따른 비디오 부호화 과정에 대해서는 도 9를 참조하여 설명하기로 한다.A video encoding process according to the present invention will be described with reference to FIG. 9 .

도 9는 본 발명의 일 실시예에 따른 비디오 부호화 방법의 과정을 나타낸 흐름도이다.9 is a flowchart illustrating a process of a video encoding method according to an embodiment of the present invention.

도 9를 참조하면, CTU 부호화가 시작되면, 비디오 부호화기가 입력 영상과 부호화 시점을 기준으로 한 주변 영상을 조합하여 조합 영상을 생성하고, 해당 조합 영상 및 양자화 파라미터를 입력으로하여 CNN 기반으로 현재의 CTU에 대한 비디오 부호화기의 모드를 결정한다(S1).Referring to FIG. 9 , when CTU encoding starts, the video encoder generates a combined image by combining the input image and the surrounding image based on the encoding time, and uses the combined image and quantization parameters as inputs to generate the current The mode of the video encoder for the CTU is determined (S1).

단계(S1)에서 비디오 부호화기는 입력 영상에 대응하는 현재 블록과, 해당 현재 블록의 주변에 위치한 주변 블록을 조합하여 조합 영상을 생성한다. 이때 주변 블록은 이전 프레임의 CTU 또는 기 처리된 CTU를 이용하여 정해질 수 있다.In step S1, the video encoder generates a combined image by combining the current block corresponding to the input image and neighboring blocks located around the current block. In this case, the neighboring block may be determined using the CTU of the previous frame or the pre-processed CTU.

그리고 비디오 부호화기는 결정된 모드에 따라 CTU를 변환 및 양자화하고(S2), 엔트로피 부호화를 진행하여 비트스트림을 생성한다(S3).Then, the video encoder transforms and quantizes the CTU according to the determined mode (S2), and generates a bitstream by performing entropy encoding (S3).

한편 단계(S2) 이후에 비디오 부호화기는 역양자화 및 역변환 과정을 거쳐 화소를 복원하고, 복원된 화소를 디블록킹 필터링과 샘플 적응적 오프셋(sample adaptive offset) 필터링 과정을 거쳐 DPB(Decoded Picture Buffer)에 저장할 수 있다. 이때 비디오 부호화기는 해당 DPB에 저장된 프레임을 기초로 움직임을 예측한 움직임 보상 및 인트라(intra) 보상 과정을 거쳐 얻어진 예측 블록과의 차이값을 CTU의 변환 및 양자화를 위한 입력으로 활용할 수 있다.On the other hand, after step S2, the video encoder restores the pixel through inverse quantization and inverse transformation, and puts the reconstructed pixel into a Decoded Picture Buffer (DPB) through deblocking filtering and sample adaptive offset filtering. can be saved In this case, the video encoder may utilize a difference value from a prediction block obtained through motion compensation and intra compensation for predicting motion based on a frame stored in the corresponding DPB as an input for transformation and quantization of the CTU.

본 발명의 실시예에 따른 최적 모드 결정을 이용한 비디오 부호화 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다.The video encoding method using the optimal mode determination according to an embodiment of the present invention may be implemented in the form of a program readable by various computer means and recorded in a computer readable recording medium.

한편, 본 명세서와 도면에 개시된 실시예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게는 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다.On the other hand, the embodiments disclosed in the present specification and drawings are merely presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It is obvious to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein. In addition, although specific terms have been used in the present specification and drawings, these are only used in a general sense to easily explain the technical contents of the present invention and help the understanding of the present invention, and are not intended to limit the scope of the present invention.

10, 11: 영상 조합 장치 20, 21: 모드 결정 장치
30: 모드 선택 장치 100, 200: 최적 모드 결정 장치10, 11: Image combining device 20, 21: Mode determining device
30: mode selection device 100, 200: optimal mode determination device

Claims

an image combining device for generating a combined image by combining an input image and a surrounding image based on an encoding time;
A mode determination device including a plurality according to the size of the combined image, and determining the mode of the video encoder for the current Coding Tree Unit (CTU) based on a Convolutional Neural Network (CNN) by inputting the combined image and the quantization parameter as an input ; and
a mode selection device for selecting any one of a plurality of modes output from the plurality of mode determining devices;
and the mode selection apparatus selects a mode having the lowest rate-distortion cost among the plurality of modes output from the plurality of mode determination apparatuses.

According to claim 1,
The image combining device combines the current block corresponding to the input image and the neighboring block corresponding to the neighboring image and located in the vicinity of the current block, wherein the neighboring block is determined using a CTU of a previous frame or a pre-processed CTU. An apparatus for determining an optimal mode of a video encoder, characterized in that.

delete

generating, by a video encoder, a combined image by combining an input image and a surrounding image based on an encoding time;
The video encoder determines the mode of the video encoder for the current Coding Tree Unit (CTU) based on a CNN (Convolutional Neural Network) by inputting the combined image and the quantization parameter, but a plurality of modes according to the size of the combined image determining each;
selecting, by the video encoder, a mode having the lowest rate-distortion cost among the determined plurality of modes;
transforming and quantizing the input CTU according to the selected mode by the video encoder; and
encoding, by the video encoder, transform coefficients obtained through transform and quantization using an entropy coding engine;
A video encoding method using optimal mode determination, comprising:

5. The method of claim 4,
reconstructing a pixel through inverse quantization and inverse transformation after the transformation and quantization;
storing the reconstructed pixels in a decoded picture buffer (DPB) through deblocking filtering and sample adaptive offset filtering;
utilizing, as an input for CTU transformation and quantization, a difference value from a prediction block obtained through motion compensation and intra compensation processes for predicting motion based on the frame stored in the DPB;
Video encoding method using optimal mode determination, characterized in that it further comprises.

5. The method of claim 4,
A video encoding method using optimal mode determination, characterized in that the steps of transforming and quantizing and encoding for a specific CTU are performed independently from the step of determining a mode of a video encoder for another CTU.