KR20220095693A

KR20220095693A - Gop selection method based on reinforcement learning and image analysis apparatus

Info

Publication number: KR20220095693A
Application number: KR1020200187458A
Authority: KR
Inventors: 강제원; 김나영; 이정경
Original assignee: 이화여자대학교 산학협력단
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07
Also published as: KR102456690B1

Abstract

A group of picture (GOP) selection method based on reinforcement learning includes the steps of: receiving an input image composed of a plurality of frames by an analysis device; determining, by the analysis device, a path of a binary tree for determining a GOP based on the plurality of frames; and selecting, by the analysis device, the GOP of the input image based on a leaf node of the binary tree. The analysis unit determines the path of the binary tree using reinforcement learning using the binary tree as the environment, whether the tree is branched as the behavior, and the coding efficiency of the selected GOP as the reward.

Description

GOP selection method and analysis device based on reinforcement learning

이하 설명하는 기술은 비디오 부호화에서 GOP(Group of Picture)를 선택하는 기법에 관한 것이다.A technique to be described below relates to a technique for selecting a group of pictures (GOP) in video encoding.

HEVC(High Efficiency Video Coding)/H.265(이하 HEVC)을 포함한 비디오 부호화 기술은 GOP 단위로 영상을 나누어 부호화를 진행한다. 비디오 부호화는 동일 프레임 또는 다른 프레임에서 이미 부호화한 정보를 이용하여 현재 부호화하고자하는 대상을 부호화한다. Video encoding technology including HEVC (High Efficiency Video Coding)/H.265 (hereafter HEVC) divides an image in units of GOPs and performs encoding. In video encoding, an object to be currently encoded is encoded using information already encoded in the same frame or another frame.

HEVC는 I-프레임, P-프레임, B-프레임을 이용하여 화면 내 예측만을 이용한 All Intra(AI) 모드, 저지연 모드 (LD: Low Delay), 임의 접근 모드 (RA: Random Access)의 부호화 모드를 제공하여 응용 서비스의 목적에 따라 부호화 및 복호화 구조를 선택적으로 사용할 수 있다. 특히 임의 접근 모드는 B-프레임을 사용하여 고화질의 영상을 낮은 비트로 압축할 수 있다. HEVC는 GOP 내로 한정된 참조 구조로 부호화가 수행된다. HEVC uses I-frame, P-frame, and B-frame and uses only intra prediction in All Intra (AI) mode, low delay mode (LD: Low Delay), and random access mode (RA: Random Access) coding mode can be provided to selectively use the encoding and decoding structures according to the purpose of the application service. In particular, the random access mode can compress high-definition images into low bits using B-frames. HEVC is encoded with a reference structure limited into the GOP.

미국등록특허 US10523940호US Registered Patent No. US10523940

GOP 크기는 해당 GOP 내 프레임 간의 시간적 상관도와 연관된다. 즉, GOP의 크기 변화는 GOP의 참조 계층 구조와 현재 부호화할 프레임의 참조 프레임의 변화를 가져온다. 예컨대, 급격한 장면의 변화나 큰 움직임 변화가 있는 비디오 경우 GOP의 크기가 크다면, 참조 프레임의 텍스처 정보가 현재 부호화를 진행하는 프레임과 달라져 부호화 효율이 악화될 수 있다.The GOP size is related to the temporal correlation between frames in the corresponding GOP. That is, the change in the size of the GOP results in a change in the reference hierarchy structure of the GOP and the reference frame of the frame to be currently encoded. For example, if the size of the GOP is large in the case of a video having a sudden scene change or a large motion change, texture information of a reference frame is different from a frame currently being encoded, and encoding efficiency may deteriorate.

이하 설명하는 기술은 부호화를 위한 GOP의 크기를 선택하는 기법을 제공하고자 한다. 이하 설명하는 기술은 학습 모델에 기반하여 GOP 크기를 결정하는 기법을 제공하고자 한다.The technique to be described below is intended to provide a technique for selecting the size of a GOP for encoding. The technique to be described below is intended to provide a technique for determining the GOP size based on a learning model.

강화학습에 기반한 GOP 선택 방법은 분석장치가 복수의 프레임으로 구성되는 입력 영상을 입력받는 단계, 상기 분석장치가 상기 복수의 프레임들을 기준으로 GOP(Group of Picture)를 결정하기 위한 이진 트리의 경로를 결정하는 단계 및 상기 분석장치가 상기 이진 트리의 리프 노드를 기준으로 상기 입력 영상의 GOP를 선택하는 단계를 포함한다. The GOP selection method based on reinforcement learning comprises the steps of: an analysis device receiving an input image composed of a plurality of frames; determining, and selecting, by the analysis apparatus, a GOP of the input image based on leaf nodes of the binary tree.

강화학습에 기반하여 GOP 선택하는 분석장치는 복수의 프레임으로 구성되는 입력 영상을 입력받는 입력장치, 프레임들을 기준으로 GOP(Group of Picture)를 결정하기 위한 이진 트리의 경로를 결정하는 강화학습모델을 저장하는 저장장치 및 상기 복수의 프레임들을 상기 이진 트리에 적용하여 상기 이진 트리의 리프 노드를 기준으로 상기 입력 영상의 GOP를 선택하는 연산장치를 포함한다.An analysis device for selecting a GOP based on reinforcement learning is an input device that receives an input image composed of a plurality of frames, and a reinforcement learning model that determines a path of a binary tree for determining a GOP (Group of Picture) based on the frames. and a storage device for storing the plurality of frames and a calculation device for selecting a GOP of the input image based on leaf nodes of the binary tree by applying the plurality of frames to the binary tree.

상기 강화학습에서 환경은 상기 이진 트리이고, 행동은 트리의 분기 여부이고, 보상은 선택된 GOP의 부호화 효율이다.In the reinforcement learning, the environment is the binary tree, the action is whether the tree is branched, and the reward is the encoding efficiency of the selected GOP.

이하 설명하는 기술은 강화학습을 이용하여 현재 영상에 대하여 최적의 GOP 크기를 제공한다. 따라서, 이하 설명하는 기술은 비디오 부호화의 효율을 최대화하는데 기여한다. The technique described below uses reinforcement learning to provide an optimal GOP size for the current image. Accordingly, the techniques described below contribute to maximizing the efficiency of video encoding.

도 1은 임의 접근 모드에서의 계층적 부호화 구조를 도시한다.
도 2는 GOP 선택을 위한 강화학습 환경에 대한 예이다.
도 3은 적응적 GOP 트리 구조에 대한 예이다.
도 4는 GOP 크기에 따른 GOP 선택 시나리오에 대한 예이다.
도 5는 GOP 이진 트리의 분기를 결정하기 위한 신경망 모델에 대한 예이다.
도 6은 강화학습을 이용한 QP 결정의 예이다.
도 7은 분석장치에 대한 예이다.1 shows a hierarchical coding structure in a random access mode.
2 is an example of a reinforcement learning environment for GOP selection.
3 is an example of an adaptive GOP tree structure.
4 is an example of a GOP selection scenario according to a GOP size.
5 is an example of a neural network model for determining a branch of a GOP binary tree.
6 is an example of QP determination using reinforcement learning.
7 is an example of an analysis device.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The technology to be described below can apply various changes and can have various embodiments, and specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the technology described below to specific embodiments, and it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the technology described below.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components are not limited by the above terms, and only for the purpose of distinguishing one component from other components. is used only as For example, a first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component without departing from the scope of the present invention. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설명된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of terms used herein, the singular expression should be understood to include the plural expression unless the context clearly dictates otherwise, and terms such as "comprises" include the described feature, number, step, operation, element. , parts or combinations thereof are to be understood, but not to exclude the possibility of the presence or addition of one or more other features or numbers, step operation components, parts or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다.Prior to a detailed description of the drawings, it is intended to clarify that the classification of the constituent parts in the present specification is merely a division according to the main function each constituent unit is responsible for. That is, two or more components to be described below may be combined into one component, or one component may be divided into two or more for each more subdivided function. In addition, each of the constituent units to be described below may additionally perform some or all of the functions of other constituent units in addition to the main function it is responsible for. Of course, it can also be performed by being dedicated to it.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, in performing the method or method of operation, each process constituting the method may occur differently from the specified order unless a specific order is clearly described in context. That is, each process may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

이하 설명하는 기술은 학습 모델을 이용하여 GOP를 선택하는 기법이다. 특히, 이하 설명하는 기술은 강화학습(reinforcement learning)을 이용하여 GOP를 선택하는 기법이다. 먼저, 강화학습에 대하여 간략하게 설명한다.A technique to be described below is a technique for selecting a GOP using a learning model. In particular, a technique to be described below is a technique for selecting a GOP using reinforcement learning. First, reinforcement learning will be briefly described.

지도학습(supervised Learning)은 정답이 주어진 데이터로 학습해서 새로운 데이터에 대한 값이나 카테고리를 예측하고, 비지도학습(unsupervised Learning)은 정답이 없는 데이터를 적절히 그룹화하거나 각 데이터 간의 관계를 찾아낸다. 이에 반하여 강화학습은 에이전트(agent)가 주어진 환경(Environment) 내에서 어떻게 행동해야 하는지에 대해 학습한다. 강화학습은 어떤 환경에서 정의된 에이전트가 현재의 상태(state)를 인식하여, 선택 가능한 행동들 중 보상(reward)을 최대화하는 행동(action) 혹은 행동 순서를 선택하는 방법론이다. 강화학습은 에이전트가 보상을 최대로 하도록 행동 혹은 행동 순서를 학습해 나가는 것을 목표로 한다.Supervised learning predicts values or categories for new data by learning from data with correct answers, and unsupervised learning appropriately groups data without correct answers or finds relationships between data. In contrast, reinforcement learning learns how an agent should behave in a given environment. Reinforcement learning is a methodology in which an agent defined in a certain environment recognizes the current state and selects an action or action sequence that maximizes a reward among selectable actions. Reinforcement learning aims to learn actions or sequences of actions so that the agent maximizes the reward.

강화 학습은 주로 MDP(Markov Decision Process)라는 확률 모델을 이용한다. MDP는 시간 t에서의 상태는 t-1에서의 상태에만 영향을 받는다는 의사결정 확률을 모델링하며 아래 수학식 1과 같은 확률식을 보인다.Reinforcement learning mainly uses a probabilistic model called MDP (Markov Decision Process). MDP models the decision-making probability that the state at time t is affected only by the state at t-1, and shows a probability expression as shown in Equation 1 below.

MDP에서 상태에 보상을 추가하여 확장한 개념을 마르코프 보상 과정(Markov reward process)이라고 하며 (χ,A,p,q,p₀)와 같은 튜플 형태로 표현한다. χ는 상태, A는 행동, p(ㆍ|x,a)는 다음 상태 x_t ₊₁로 갈 확률, q(ㆍ|x,a)는 행동에 대한 보상 R(x_t,a_t)의 확률, p₀는 초기 확률 분포를 뜻한다. The concept extended by adding a reward to the state in MDP is called the Markov reward process and is expressed in the form of a tuple such as (χ,A,p,q,p ₀ ). χ is the state, A is the action, p(•|x,a) is the probability of going to the next state x _t ₊₁ , q(•|x,a) is the probability of the reward R(x _t ,a _t ) for the action , p ₀ means the initial probability distribution.

강화학습은 현재 상태에서 행동을 취할 때 각각의 행동에 대해서 보상을 얼마나 받을지를 고려한 행동을 취하게 된다. 가장 높은 보상을 받을 수 있는 행동들을 연속적으로 취해야 좋은 행동을 선택하는 것이다. 최종적으로 받는 모든 보상의 총합을 Q값(Q-value)이라고 하고 아래 수학식 2와 같이 표현된다. 현재 상태 s에서 행동 a를 취할 때 받을 수 있는 모든 보상의 총합 Q(s,a)는 현재 행동을 취해서 받을 수 있는 즉각 보상과 미래에 받을 미래보상의 최대값의 합으로 계산될 수 있다. Reinforcement learning takes action in consideration of how much reward you will receive for each action when you take action in the current state. A good action is selected by taking the actions that can receive the highest reward in succession. The sum of all rewards finally received is called a Q-value and is expressed as Equation 2 below. The sum of all rewards Q(s,a) that can be received when taking action a in the current state s can be calculated as the sum of the immediate reward for taking the current action and the maximum value of the future reward received in the future.

여기서 r(s,a)는 현재상태 s에서 행동 a를 취할 때 받는 즉각 보상값을 나타낸다. s'는 현재 상태 s에서 행동 a를 취해 도달하는 바로 다음 상태이다. max_aQ(s',a)는 다음 상태 s'에서 받을 수 있는 보상의 최대값이다. γ는 할인율로 미래 가치에 대한 중요도를 조절하는 값이다. 할인율의 값이 커질수록 미래에 받을 보상에 더 큰 가치를 두는 것이고, 작아질수록 즉각적 보상을 더 중요하게 고려하는 것이다. Q(s,a) 값을 최대화하는 행동을 선택하는 것이 강화학습의 목표이다. Here, r(s,a) represents the immediate reward received when taking action a in the current state s. s' is the next state reached by taking action a in the current state s. max _a Q(s',a) is the maximum value of the reward that can be received in the next state s'. γ is a value that adjusts the importance of the future value with the discount rate. The larger the discount rate value, the greater the value of future rewards, and the smaller the discount rate, the more important the immediate rewards. The goal of reinforcement learning is to choose the behavior that maximizes the value of Q(s,a).

Q 테이블(Q-table)은 현재 상태에서 취한 행동에 대한 행동 가치 함수 값을 나타낸다. Q 테이블은 처음에는 임의의 값으로 초기화한 뒤 학습이 진행됨에 따라 아래 수학식 3과 같이 업데이트된다.The Q-table represents the action value function value for the action taken in the current state. The Q table is initially initialized to a random value and is updated as shown in Equation 3 below as learning proceeds.

행동에 대한 Q 테이블을 만들고 이 테이블을 지속적으로 업데이트 하는 방법을 Q 학습(Q-learning)이라고 한다. 마르코프 상태 가정이 유효하다면, 수학식 3은 재귀적인 성질이 미래에 받을 수 있는 보상을 멀리 떨어진 과거까지 전파할 수 있음이 증명되어있다. Q 테이블의 업데이트 없이 현재 상태 값을 입력 값으로 받고, 현재 상태에서 취할 수 있는 행동들에 대한 Q 값을 예측하는 모델을 Q 네트워크(Q-Network)라고 한다.A method of creating a Q table for actions and continuously updating this table is called Q-learning. If the Markov state assumption is valid, Equation 3 proves that the recursive property can propagate future rewards to the distant past. A model that receives the current state value as an input value without updating the Q table and predicts the Q value for actions that can be taken in the current state is called a Q-Network.

컨볼루션 네트워크(Convolution networks)를 이용하여 Q 네트워크를 학습하는 방법을 DQN(Deep Q-Network)이라고 한다. DQN은 입력으로 고차원 데이터(이미지, 동영상 등)가 상태로 주어지게 될 때 컨볼루션 네트워크를 이용하여 효과적으로 학습될 수 있다. DQN의 동작에 대한 상세한 설명은 생략한다. DQN의 구조는 다양할 수 있다. 이하 설명하는 기술에서도 DQN을 이용하여 GOP 결정을 위한 Q 값을 예측할 수 있다.A method of learning Q networks using convolution networks is called DQN (Deep Q-Network). DQN can be effectively learned using a convolutional network when high-dimensional data (image, video, etc.) is given as a state as input. A detailed description of the operation of the DQN will be omitted. The structure of the DQN may vary. In the technology to be described below, a Q value for GOP determination can be predicted using DQN.

전술한 바와 같이 HEVC는 응용 서비스의 목적에 따라 AI 모드, 저지연 모드 및 임의 접근 모드의 부호화 모드 중 어느 하나의 모드를 선택적으로 사용할 수 있다. 임의 접근 모드는 부호화하려는 픽쳐를 기준으로 이전 시간과 이후 시간에 부호화되고 복호화된 픽쳐를 모두 참조하여 부호화를 수행한다. 두 방향의 참조픽쳐를 사용하여 계층적 부호화를 수행하기 때문에 이전 시간 방향의 참조 픽쳐만을 사용하는 저지연 모드보다 높은 압축 성능을 얻을 수 있다.As described above, HEVC may selectively use any one of the encoding modes of the AI mode, the low-delay mode, and the random access mode according to the purpose of the application service. In the random access mode, encoding is performed with reference to both pictures encoded and decoded at a previous time and a subsequent time based on a picture to be encoded. Since hierarchical encoding is performed using reference pictures in two directions, it is possible to obtain higher compression performance than the low-delay mode using only reference pictures in the previous time direction.

도 1은 임의 접근 모드에서의 계층적 부호화 구조를 도시한다. 도 1은 GOP 크기가 8인 경우이다. GOP가 16이나 32로 설정된다면, 그에 따른 계층적 구조도 변화하며 현재 부호화하는 프레임의 참조 프레임도 변하게 된다. 1 shows a hierarchical coding structure in a random access mode. 1 is a case in which the GOP size is 8; If the GOP is set to 16 or 32, the hierarchical structure is changed accordingly, and the reference frame of the currently encoded frame is also changed.

도 1에서 사각형 안의 숫자는 부호화 순서를 의미한다. I-프레임이 0번째로 가장 먼저 코딩되는 것을 볼 수 있다. 그 다음으로 I 프레임을 참조하여 P (또는 B) 프레임이 코딩되고 그 사이의 B 프레임이 코딩된다. 이 세 장의 프레임은 각각 QP=I, QP=I+1, QP=I+2로 코딩된다. 이러한 프레임은 계층적으로 가장 깊이가 낮다고 한다. 또는 Temporal ID = 0이라고 한다. In FIG. 1, the numbers in the rectangle indicate the encoding order. It can be seen that the I-frame is coded 0th first. Then, P (or B) frames are coded with reference to the I frames and the B frames in between are coded. These three frames are coded as QP=I, QP=I+1, and QP=I+2, respectively. Such a frame is said to have the lowest depth hierarchically. Or say Temporal ID = 0.

그 다음으로 Temporal ID = 0 에 속하는 프레임 사이의 프레임들이 코딩이 된다. 예를 들어서 0 번째, 2번째로 코딩이 된 프레임의 중간에 위치한 프레임이 다음으로 코딩된다. 또한, 1번째, 2번째로 코딩이 된 프레임의 중간에 위치한 프레임이 다음으로 코딩된다. 이러한 프레임들은 Temporal ID = 1을 부여 받는다. 같은 방식으로 Temporal ID = 2를 부여 받는 프레임들이 코딩 된다.Next, frames between frames belonging to Temporal ID = 0 are coded. For example, a frame located in the middle of the 0th and 2nd coded frames is coded next. In addition, a frame located in the middle of the first and second coded frames is coded next. These frames are given Temporal ID = 1. In the same way, frames assigned Temporal ID = 2 are coded.

POC(Picture of Count)는 GOP 안에서의 시간 순서에 따라 부여 받는 인덱스이다. 0번째 처음 프레임부터 8번째 프레임까지 순서대로 POC = 0 부터 POC = 8까지 값을 가진다.POC (Picture of Count) is an index given according to the time sequence in the GOP. It has values from POC = 0 to POC = 8 in order from the 0th first frame to the 8th frame.

HEVC에서는 부호화 시나리오에 맞춰 프레임별 QP 크기가 결정되어 있으며 이를 따라 부호화가 진행된다. 예컨대, 도 1과 같이 GOP 8 시나리오인 경우, INTRA 프레임의 QP가 I라면, 다음으로 부호화되는 프레임의 QP는 I+1로 부호화되며, 그 다음 부호화 프레임의 QP는 I+2로 부호화된다.In HEVC, the QP size for each frame is determined according to the encoding scenario, and encoding proceeds according to this. For example, in the case of the GOP 8 scenario as shown in FIG. 1, if the QP of the INTRA frame is I, the QP of the next encoded frame is encoded as I+1, and the QP of the next encoded frame is encoded as I+2.

이하 설명에서 GOP를 선택하는 장치는 영상을 처리하는 장치이다. 따라서, 영상 처리 장치가 GOP를 선택한다. 나아가, 입력 영상을 분석하여 GOP를 선택하게 되므로, GOP를 선택하는 장치를 분석장치라고 명명할 수도 있다. 이하 분석장치가 GOP를 선택하는 과정을 수행한다고 가정한다. 분석장치는 영상을 입력받아 처리하고 연산할 수 있는 장치에 해당한다. 분석장치는 컴퓨터 장치, 스마트기기, 네트워크 상의 서버 등과 같은 장치일 수 있다. 한편, 분석장치는 영상을 부호화하는 인코더일 수 있다.In the following description, an apparatus for selecting a GOP is an apparatus for processing an image. Accordingly, the image processing apparatus selects the GOP. Furthermore, since the GOP is selected by analyzing the input image, the device for selecting the GOP may be called an analysis device. Hereinafter, it is assumed that the analyzer performs the process of selecting the GOP. The analysis device corresponds to a device capable of receiving, processing, and calculating an image. The analysis device may be a device such as a computer device, a smart device, or a server on a network. Meanwhile, the analysis device may be an encoder that encodes an image.

도 2는 GOP 선택을 위한 강화학습 환경에 대한 예이다. 강화학습은 에이전트와 환경이 행동, 상태와 보상을 주고받는다. 2 is an example of a reinforcement learning environment for GOP selection. In reinforcement learning, the agent and the environment exchange behaviors, states, and rewards.

도 2에서 환경은 적응적 GOP 이진 트리(GOP binary tree) 구조를 가진다. GOP 트리는 트리 구조로 노드로 구성된다. 각 노드 n은 하나의 GOP 또는 서브 GOP의 부호화를 의미하며 n(S, L) 형식으로 정의한다. 여기서 S는 시작 프레임, L은 하나의 GOP 또는 서브 GOP 내에 구성된 프레임의 개수이다. 도 2는 0번 POC부터 부호화되는 비디오에 대해 GOP 트리의 예시를 보여준다. 노드가 분기하면서 트리의 깊이가 증가하게 된다. 깊이가 증가하면서 리프(leaf) 노드에 도달하게 된다. 해당 리프 노드는 하나의 GOP 또는 서브 GOP에 대한 결정된 구조를 나타낸다. 하나의 GOP 또는 서브 GOP가 결정이 되면 해당 구간에서의 부호화 순서가 계층적 B 구조에 따라 결정이 된다. In FIG. 2, the environment has an adaptive GOP binary tree structure. The GOP tree is composed of nodes in a tree structure. Each node n means encoding of one GOP or sub-GOP, and is defined in the form of n(S, L). Here, S is the start frame, and L is the number of frames configured in one GOP or sub GOP. 2 shows an example of a GOP tree for a video encoded from POC 0. As nodes branch, the depth of the tree increases. As the depth increases, a leaf node is reached. A corresponding leaf node represents a determined structure for one GOP or sub GOP. When one GOP or sub GOP is determined, the encoding order in the corresponding section is determined according to the hierarchical B structure.

도 2에서 행동은 U(undetermined)와 D(determined)로 구성된다. 선택 U는 GOP 트리에서 현재 상태 노드에서 분기되서 다음 깊이의 상태로 이동하는 것이고, 선택 D은 현재 노드를 리프 노드로 결정하여 시작 POC S부터 부호화할 길이 L의 프레임인 프레임 [S:S+L]까지 부호화하는 것을 의미한다. 모든 입력 프레임에 대해 부호화 방법이 결정되었다면, 입력 비디오에 대한 GOP 선택을 종료하게 되고 강화학습 에피소드는 종료된다.In FIG. 2 , an action is composed of U (undetermined) and D (determined). Selection U branches from the current state node in the GOP tree and moves to the next depth state, and selection D determines the current node as a leaf node and is a frame of length L to be encoded from the start POC S frame [S:S+L] ] means to encode. If the encoding method is determined for all input frames, GOP selection for the input video is finished, and the reinforcement learning episode is ended.

도 2에서 보상은 선택한 GOP를 사용하였을 때 제공하는 부호화 효율을 사용한다. 부호화 효율은 RD(Rate-Distortion) 비용 J를 계산하여 사용한다. RD 비용은 아래 수학식 4와 같이 정의한다.In FIG. 2, the compensation uses encoding efficiency provided when the selected GOP is used. The encoding efficiency is used by calculating the rate-distortion (RD) cost J. The RD cost is defined as in Equation 4 below.

여기서 D는 왜곡, R은 소요 비트, λ는 상수이다. where D is the distortion, R is the required bit, and λ is a constant.

분석장치는 선택한 GOP가 현재 비용보다 더 낮은 비용 J를 제공하면, 보상은 가산점(r_t>0)을 부여한다. 분석장치는 선택한 GOP가 현재 비용이상의 비용 J를 제공하면, 보상은 감점(r_t≤0)을 부여한다. 중단 조건을 만족한 경우와 그렇지 않은 경우 각각의 y_t는 아래 수학식 5와 같이 결정된다.The analyzer grants a bonus point (r _t >0) if the selected GOP provides a cost J that is lower than the current cost. If the selected GOP provides a cost J greater than or equal to the current cost, the analysis device gives a deduction (r _t ≤ 0). In the case where the stop condition is satisfied and in the case where the stop condition is not satisfied, each y _t is determined as in Equation 5 below.

여기서 Q는 행동으로 가중치 θ를 갖는 가치함수, r_t는 보상, φ_t는 t 단계에서의 시퀀스이다.where Q is a value function with a weight θ as an action, r _t is a reward, and φ _t is a sequence in t steps.

DQN은 실제 가치함수 Q가 참값에 도달했을 때 가장 큰 보상을 획득할 수 있다. 분석장치는 아래 수학식 6으로 비용 함수를 정의하고 그레디언트를 업데이트 한다.DQN can obtain the greatest reward when the actual value function Q reaches the true value. The analysis device defines the cost function by Equation 6 below and updates the gradient.

도 3은 적응적 GOP 트리 구조에 대한 예이다. 적응적 GOP 트리는 강화 학습을 이용하여 입력 비디오의 GOP를 결정하기 위한 환경으로 사용된다. GOP 트리는 도 3에서 보이는 것과 같이 노드와 경로로 구성되며, 각각의 노드는 부호화 시작 POC와 부호화를 진행할 프레임 수로 구성된다. 적응적 GOP 트리는 최초 n₀로 시작되며 아래 수학식 7과 같이 표현된다.3 is an example of an adaptive GOP tree structure. The adaptive GOP tree is used as an environment for determining the GOP of the input video using reinforcement learning. As shown in FIG. 3, the GOP tree is composed of nodes and paths, and each node is composed of an encoding start POC and the number of frames to be encoded. The adaptive GOP tree starts with the first n ₀ and is expressed as in Equation 7 below.

여기서 S는 입력 비디오의 시작 POC를 의미하고, L 길이를 의미한다. 예컨대, S = 0, L = 32인 경우 n₀= [0,32]이며 POC 0을 기준으로 32장의 프레임 코딩을 고려한다. 도 3은 GOP의 최소 단위 8이고, 최대 단위가 32일 경우의 예이다. 트리 구조는 GOP의 최소 단위가 작을수록 더 깊은 트리를 구성하게 된다. 임의의 노드 n에서 깊이 d가 증가하게 되면, 현재 노드를 기준으로 좌 노드와 우 노드 분기가 연결되며 다음과 같이 표현한다.Here, S means the start POC of the input video and L means the length. For example, when S = 0 and L = 32, n ₀ = [0,32] and coding of 32 frames is considered based on POC 0. 3 is an example of a case where the minimum unit of a GOP is 8 and the maximum unit is 32. FIG. The tree structure composes a deeper tree as the minimum unit of GOP is smaller. When the depth d is increased at an arbitrary node n, the left node and the right node branch are connected based on the current node, and it is expressed as follows.

이때 P_d(S)는 현재 분기 노드의 부모 노드의 시작 프레임이다. 좌 노드는 부모 노드의 코딩할 시작 POC를 따르며 L의 1/2로 길이가 설정된다. 우 노드는 부모 노드의 S에 L/2¹만큼 움직인 POC를 시작점으로 하며 L의 1/2로 길이가 설정된다. 예컨대, S = 0, L = 32인 경우 n_1,l = [0,16]이며 POC 0을 기준으로 16장의 프레임 코딩을 고려한다. n_1,r = [16,16]이며 POC 16을 기준으로 16장의 프레임 코딩을 고려한다.In this case, P _d (S) is the starting frame of the parent node of the current branch node. The left node follows the starting POC to be coded of the parent node and its length is set to 1/2 of L. The right node has the starting point of the POC moved by L/2 ¹ to the S of the parent node, and the length is set to 1/2 of L. For example, if S = 0, L = 32, then n _1,l = [0,16], and 16 frames of coding are considered based on POC 0. n _1,r = [16,16], and 16 frame coding is considered based on POC 16.

노드의 길이는 최소 단위보다 더 작게 나눠질 수 없다. 즉 L/2^d = L_min를 만족하는 만큼의 깊이 d가 적응적 트리의 깊이가 된다.The length of a node cannot be divided smaller than the minimum unit. That is, a depth d that satisfies L/2 ^d = L _min becomes the depth of the adaptive tree.

적응적 GOP 트리 구조에서 경로는 자기 자신을 선택하는 경로와 다음 깊이로 분기하는 경로가 존재한다. 각 노드에서의 경로는 강화학습의 행동이 결정한다. 적응적 GOP 트리 구조에서는 행동 U(undetermined)와 D(determined)로 구성된다. 분석장치는 행동 D가 선택되는 경우 해당 노드를 최종 경로로 결정하고, U를 선택하게 되면 다음 깊이로 분기하여 재차 경로를 결정한다. 이때 분기는 오른 노드와 왼 노드로 구성이 되며 각각의 노드에 대해서 행동을 선택하게 된다. In the adaptive GOP tree structure, a path includes a path that selects itself and a path that branches to the next depth. The path at each node is determined by the behavior of reinforcement learning. In the adaptive GOP tree structure, it consists of actions U (undetermined) and D (determined). When action D is selected, the analysis device determines the corresponding node as the final path, and when U is selected, it branches to the next depth and determines the path again. At this time, the branch consists of a right node and a left node, and an action is selected for each node.

깊이 d = 0일 때는 하나의 노드 n₀만 존재하기 때문에 하나의 노드에서 행동 U/D를 선택한다. 깊이 d = 0에서 U를 선택한 경우, 깊이 d = 1일 때는 두 개의 노드 n_r,n_l이 존재하기 때문에 두 개의 노드에서 행동 U 또는 D를 선택한다.When depth d = 0, since there is only one node n ₀ , we choose the action U/D from one node. If we choose U at depth d = 0, we choose action U or D from two nodes because there are two nodes n _r ,n _l at depth d = 1.

분석 장치가 행동 D를 선택하여 해당 노드 자신을 선택하는 경로를 택하면 그때 노드의 시작 POC를 기준으로 설정된 길이까지를 한 GOP를 선택하여 코딩한다.If the analysis device selects action D and chooses a path to select the node itself, then selects and codes a GOP that has a length up to the set length based on the starting POC of the node.

도 4는 GOP 크기에 따른 GOP 선택 시나리오에 대한 예이다. 도 4는 GOP의 최소 단위가 8, 최대 단위가 32일 경우 적응적 GOP 트리 구조를 통하여 결정되는 GOP 경우의 수를 도시한다. 도 4는 도 3의 GOP 트리를 기준하여 분기하는 경우이다.4 is an example of a GOP selection scenario according to a GOP size. 4 shows the number of GOP cases determined through the adaptive GOP tree structure when the minimum unit of the GOP is 8 and the maximum unit is 32. FIG. 4 is a case of branching based on the GOP tree of FIG.

(i) 경우(case) 1은 n₀에서 행동 D를 선택한 경우이다. (ii) 경우 2는 n₀에서 행동 U를 선택한 후, n_l에서 행동 D를 선택하고 n_r에서 행동 D를 선택한 경우이다. (iii) 경우 3은 노드에서 분기하다가 리프 노드인 n_1,l에서 행동 D를 선택하고, n₁,_r에서 행동 D를 선택하고, n_r,l에서 행동 D를 선택하고, n_r,_r에서 행동 D를 선택한 경우이다. (iv) 경우 4는 n₀에서 행동 U하고, n_l에서 행동 U한 후, 리프 노드인 n_1,l에서 행동 D를 선택하고, n₁,_r에서 행동 D를 선택하고, n_r에서 행동 D를 선택한 경우이다. (v) 경우 5는 n_l에서 행동 D를 선택하고, n_r에서 행동 U를 선택한 후 n_r,l에서 행동 D를 선택하고, n_r,_r에서 행동 D를 선택한 경우이다.(i) Case 1 is a case where action D is selected from n ₀ . (ii) Case 2 is a case where action U is selected in n ₀ , then action D is selected in n _l and action D is selected in n _r . (iii) Case 3 branches from a node and selects action D from leaf node n _1,l , selects action D from n ₁ , _r , selects action D from n _r,l , and selects action D from n r,l , n _r , _r This is the case in which action D is chosen. (iv) Case 4 takes action U at n ₀ , action U at n _l , then selects action D at leaf node n _1,l , selects action D at n ₁ , _r , and action U at n _r In case D is selected. (v) Case 5 is a case where action D is selected in n _l , action U is selected in n _r , action D is selected in n _r,l , and action D is selected in n _r , _r .

모든 분기에서 D를 선택하거나, 또는 최소 단위의 L을 만나게 되는 경우 행동 결정이 완료된다. 분석 장치는 행동 결정이 완료되면 최종 분기 경로에 따라 GOP를 결정한다.The action decision is complete when either D is selected in every branch, or the smallest L is encountered. When the action decision is completed, the analysis device determines the GOP according to the final branch path.

입력 비디오가 주어질 때에 행동을 예측하기 위해서 CNN(Convolution Neural Network)을 이용할 수 있다. 도 5는 GOP 이진 트리의 분기를 결정하기 위한 신경망 모델(100)에 대한 예이다. 신경망 모델(100)은 제1 입력단(110), 제2 입력단(120) 및 출력단(130)을 포함한다. 제1 입력단(110) 및 제2 입력단(120)은 동일한 구조로 각각 입력 영상에서 특징값을 추출하는 구성이다. 입력단(110, 120)은 컨볼루션 계층, 풀링 계층을 포함할 수 있다. 출력단(130)은 제1 입력단(110)의 출력 및 제2 입력단(120)의 출력을 퓨전(fusion)하고 전연결 계층에서 최종 결과를 출력한다. 신경망 모델(100)은 GOP 이진 트리의 현재 노드에서의 행동(분기 여부)을 결정한다. Convolutional Neural Network (CNN) can be used to predict behavior given an input video. 5 is an example of a neural network model 100 for determining a branch of a GOP binary tree. The neural network model 100 includes a first input terminal 110 , a second input terminal 120 , and an output terminal 130 . The first input terminal 110 and the second input terminal 120 have the same structure and are configured to extract feature values from the input image, respectively. The input terminals 110 and 120 may include a convolutional layer and a pooling layer. The output terminal 130 fusions the output of the first input terminal 110 and the output of the second input terminal 120 and outputs the final result in all connected layers. The neural network model 100 determines the behavior (whether to branch) at the current node of the GOP binary tree.

도 5는 깊이가 0일 때의 입력 비디오와 그에 따른 행동을 예측 네트워크를 보여준다. 제1 입력단(110)은 제1 스트림을 입력받는다. 제1 스트림은 3장의 프레임을 결합(concatenation)한 데이터이다. 제1 스트림은 프레임 n₀, n₈및 n₁₆을 결합한 데이터이다. 제2 입력단(120)은 제2 스트림을 입력받는다. 제2 스트림은 3장의 프레임을 결합한 데이터이다. 제2 스트림은 프레임 n₈, n₁₆및 n₃₂을 결합한 데이터이다. 입력단에 입력되는 입력 프레임은 아래 수학식 9로 결정할 수 있다.Fig. 5 shows a network that predicts the input video and its behavior when the depth is zero. The first input terminal 110 receives the first stream. The first stream is data obtained by concatenating three frames. The first stream is data obtained by combining frames n ₀ , n ₈ and n ₁₆ . The second input terminal 120 receives the second stream. The second stream is data obtained by combining three frames. The second stream is data obtained by combining frames n ₈ , n ₁₆ and n ₃₂ . The input frame input to the input terminal may be determined by Equation 9 below.

S는 시작 프레임 번호, L은 해당 노드에 설정된 프레임의 길이, d는 트리 깊이, P_d(S)는 현재 노드의 부모 노드의 시작 프레임, P_d(L)은 현재 노드의 부모 노드에 설정된 프레임의 길이, n_d,l(S)는 부모 노드의 좌측 자식 노드의 시작 프레임 번호, n_d,r(L)은 부모 노드의 우측 자식 노드에 설정된 프레임의 길이이다.S is the starting frame number, L is the length of the frame set at that node, d is the tree depth, P _d (S) is the starting frame of the current node's parent node, and P _d (L) is the frame set at the current node's parent node. The length of n _d,l (S) is the starting frame number of the left child node of the parent node, and n _d,r (L) is the length of the frame set in the right child node of the parent node.

수학식 9에서 (1)은 제1 스트림을 생성하는 입력 프레임들이고, (2)는 제2 스트림을 생성하는 입력 프레임들이다.In Equation 9, (1) is input frames generating a first stream, and (2) is input frames generating a second stream.

출력단(130)은 제1 입력단(110)의 출력 및 제2 입력단(120)의 출력을 퓨전하고, 최종적으로 입력 데이터에 대한 행동에 대한 정보를 출력한다.The output terminal 130 fuses the output of the first input terminal 110 and the output of the second input terminal 120 , and finally outputs information on the action of the input data.

한편, 인공 신경망(100)은 입력 영상 외에 프레임 간의 움직임을 표현하는 옵티컬 플로우(optical flow), 움직임 벡터 등을 입력받을 수도 있다. Meanwhile, the artificial neural network 100 may receive an optical flow expressing motion between frames, a motion vector, etc. in addition to the input image.

인공 신경망(100)의 출력은 행동에 대한 Q 값으로 나오게 된다. 행동은 U와 D 중 더 큰 Q 값을 갖는 것으로 선택된다. The output of the artificial neural network 100 comes out as a Q value for the action. The behavior is chosen with the larger Q value of U or D.

학습 과정은 보상이 최대화되는 방향으로 네트워크 파라미터를 업데이트하게 된다. 적응적 GOP 선택 강화학습은 입력 비디오에 대해서 하나의 에피소드로 정의된다. 에피소드가 종료되는 기준은 적응적 GOP 트리 구조에서 행동 D로 인해 분기가 종료된 노드들의 합집합 구간이 L과 일치하게 될 때이다. 이때 분기의 종료는 행동 D 또는 깊이가 L_min로 인해 깊어 질 수 없는 경우에 해당한다. L_min은 GOP 최소 크기이다. 에피소드가 종료되는 시점에서 보상을 계산하게 되고 보상이 최대가 되는 선택을 하여 입력 비디오에 대해서 최상의 GOP 조합을 설정하였다면 다음 비디오로 에피소드가 진행되게 된다. 다음 비디오의 에피소드는 적응적 GOP 트리의 n₀로 초기화되어 실행되는 반면 보상은 앞선 에피소드에서 얻은 값을 그대로 유지한다. 반면 잘못된 GOP 예측으로 인하여 감점된 경우 수학식 6을 통하여 네트워크의 파라미터가 업데이트 되고, 강화학습은 입력 비디오, 보상 및 상태가 모두 초기화되어 다른 비디오로 다시 시작된다.The learning process updates the network parameters in the direction that the reward is maximized. Adaptive GOP selection reinforcement learning is defined as one episode for the input video. The criterion for ending an episode is when the union interval of nodes whose branching is terminated due to behavior D in the adaptive GOP tree structure matches L. In this case, the end of the branch corresponds to the case in which action D or depth cannot be deepened due to L _min . L _min is the GOP minimum size. The reward is calculated at the end of the episode, and if the best GOP combination is set for the input video by selecting the maximum reward, the episode proceeds to the next video. The episode of the next video is executed initialized with n ₀ of the adaptive GOP tree, while the reward retains the value obtained from the previous episode. On the other hand, if points are deducted due to incorrect GOP prediction, the parameters of the network are updated through Equation 6, and the input video, reward, and state are all initialized and the reinforcement learning starts again with another video.

이하 결정된 GOP의 프레임에 대한 QP(Quantization Parameter)를 결정하는 과정을 설명한다. 종전 QP 결정은 도 1에서 설명한 바 있다. 이하 설명하는 기술은 전술한 GOP 선택 기법과 유사한 기법을 이용하여 QP를 결정할 수 있다. 즉, 강화 학습을 이용한 QP 결정 기법이다.Hereinafter, a process of determining a QP (Quantization Parameter) for the determined frame of the GOP will be described. Previous QP determination has been described with reference to FIG. 1 . The technique described below may determine the QP using a technique similar to the above-described GOP selection technique. That is, it is a QP determination technique using reinforcement learning.

환경은 GOP 이진 트리와 유사한 QP 이진 트리일 수 있다. 또는, 환경은 체인(chain)과 같은 형태일 수도 있다. QP 트리 또는 QP 체인은 행동에 따른 상태 노드로 구성된다. 상태 노드는 현재 부호화되는 프레임을 의미한다. 행동은 QP 트리 또는 체인의 현재 노드와 직전(또는 상위) 노드와의 QP 크기와의 차이로 나타내는 α 값을 선택할 수 있다. 보상은 선택한 QP를 사용하였을 때 제공하는 부호화 효율이다. 부호화 효율은 RD 비용으로 계산할 수 있다. QP 트리 또는 QP 체인에서 깊이는 한 에피소드 안에서 상태 노드와 연속된 행동에 따른 순서를 의미한다. 강화학습이 종료되기 전까지 연속된 행동에 따라 깊이도 커지게 된다. The environment may be a QP binary tree similar to a GOP binary tree. Alternatively, the environment may be in the form of a chain. A QP tree or QP chain is made up of state nodes according to actions. The status node means a frame currently being encoded. The action may select a value of α expressed as the difference between the QP size of the current node and the previous (or higher) node in the QP tree or chain. The compensation is the encoding efficiency provided when the selected QP is used. The encoding efficiency can be calculated from the RD cost. In a QP tree or QP chain, depth refers to the sequence of state nodes and successive actions within an episode. Until reinforcement learning ends, the depth increases with successive actions.

선택한 QP가 더 낮은 RD 비용을 선택할 경우는 보상으로 가산점을, 그렇지 못한 경우에는 감점을 부여한다. 부호화 효율은 전술한 수학식 4의 비용 J를 사용할 수 있다.If the selected QP chooses a lower RD cost, additional points are awarded as compensation, otherwise, points are awarded. The encoding efficiency may use the cost J of Equation 4 described above.

분석장치는 강화학습 알고리즘을 통해 QP 트리 또는 QP 체인을 이용하여 입력 프레임에 따라 QP의 크기를 선택할 수 있다. 도 6은 강화학습을 이용한 QP 결정의 예이다.The analysis device may select the size of the QP according to the input frame using a QP tree or QP chain through a reinforcement learning algorithm. 6 is an example of QP determination using reinforcement learning.

도 6은 GOP의 32일 경우의 QP 체인의 예시이다. INTRA 프레임인 POC 0번의 QP크기가 q라면, 그 다음 프레임인 POC 32번은 QP 크기는 q + α가 된다. 경우 1인 에서 POC 32번은 q + 1, POC 16번도 q + 1 크기로 결정되어 부호화 된다. 경우 2에서 POC 32번은 q + 1, POC 16번은 q + 2 크기로 기존 HEVC의 고정 QP와 동일한 가능한 조합이다. α의 값은 다양한 값이 사용될 수 있다.6 is an example of a QP chain in the case of 32 of the GOP. If the QP size of POC 0, which is an INTRA frame, is q, the QP size of the next frame, POC 32, becomes q + α. In the case of 1 person, POC No. 32 is q + 1, and POC No. 16 is also determined to have a size of q + 1 and is coded. In case 2, POC 32 is q + 1 and POC 16 is q + 2, which is the same possible combination as the fixed QP of the existing HEVC. Various values of α may be used.

적응적 QP 선택 네트워크는 적응적 GOP 선택 네트워크와 동일한 구조의 네트워크를 사용할 수 있다. 도 5의 적응적 GOP 선택 네트워크와 동일하게 입력 비디오가 주어질 때 프레임 간 상관도를 고려하여 QP 크기를 예측하게 된다. 학습에서는 보상이 최대화되는 방향으로 네트워크 파라미터를 업데이트하게 된다.The adaptive QP selection network may use a network having the same structure as the adaptive GOP selection network. As in the adaptive GOP selection network of FIG. 5, when an input video is given, the QP size is predicted by considering the inter-frame correlation. In training, the network parameters are updated in the direction that the reward is maximized.

최적의 GOP의 크기가 결정되었다면, 분석 장치는 결정된 GOP 크기의 계층적 구조를 동일하게 유지하되, QP 체인을 통해 프레임별 최적의 QP 크기를 결정할 수 있다. If the size of the optimal GOP is determined, the analysis apparatus may determine the optimal QP size for each frame through the QP chain while maintaining the same hierarchical structure of the determined GOP size.

또는 전술한 적응적 GOP 선택만 사용하여 부호화를 진행할 수 있다. 이때, QP 설정은 부호화 시나리오의 고정된 QP값을 따른다. 예를 들면, GOP 16로 결정된 비디오라면 기존 HEVC와 동일한 GOP 16에서의 QP크기를 사용할 수 있다.Alternatively, encoding may be performed using only the aforementioned adaptive GOP selection. In this case, the QP setting follows the fixed QP value of the encoding scenario. For example, if the video is determined to be GOP 16, the same QP size in GOP 16 as that of the existing HEVC may be used.

나아가, 적응적 QP 선택만 사용하여 부호화를 진행할 수 있다. 이때, GOP 크기는 고정된 값을 사용한다. 예컨대, 고정된 GOP 8을 사용한다면, GOP 8의 계층적 구조를 유지하며 입력 비디오 프레임에 따른 최적의 QP를 결정할 수 있다. Furthermore, encoding may be performed using only adaptive QP selection. In this case, the GOP size uses a fixed value. For example, if a fixed GOP 8 is used, an optimal QP according to an input video frame may be determined while maintaining the hierarchical structure of the GOP 8.

도 7은 분석장치(200)에 대한 예이다. 분석장치(200)는 GOP 및/또는 GP만을 결정하는 전용장치일 수 있다. 또는 분석장치(200)는 입력 영상을 처리하는 영상처리장치일 수도 있다. 예컨대, 분석장치(200)는 인코더일 수 있다. 분석장치(200)는 영상 데이터 처리 및 분석 가능한 컴퓨터 장치, 네트워크의 서버, 프로그램이 임베딩된 칩 셋 등의 형태로 구현될 수 있다.7 is an example of the analysis device 200 . The analysis device 200 may be a dedicated device for determining only the GOP and/or GP. Alternatively, the analysis device 200 may be an image processing device that processes an input image. For example, the analysis device 200 may be an encoder. The analysis apparatus 200 may be implemented in the form of a computer device capable of processing and analyzing image data, a server of a network, a chipset in which a program is embedded, and the like.

분석장치(200)는 저장 장치(210), 메모리(220), 연산장치(220) 및 인터페이스 장치(230)를 포함한다. 나아가, 분석장치(200)는 통신장치(250)를 포함할 수도 있다.The analysis device 200 includes a storage device 210 , a memory 220 , an arithmetic device 220 , and an interface device 230 . Furthermore, the analysis device 200 may include a communication device 250 .

저장 장치(210)는 GOP 이진 트리 및 프레임들을 기준으로 GOP를 결정하기 위한 이진 트리의 경로를 결정하는 강화학습모델을 저장할 수 있다.The storage device 210 may store a GOP binary tree and a reinforcement learning model for determining a path of a binary tree for determining a GOP based on frames.

저장 장치(210)는 입력 또는 수신하는 입력 영상을 저장할 수 있다.The storage device 210 may store an input or received input image.

메모리(220)는 영상 처리 및 GOP 선택 과정에서 생성되거나 필요한 정보를 임시로 저장할 수 있다.The memory 220 may temporarily store information generated or necessary in the process of image processing and GOP selection.

인터페이스 장치(240)는 데이터 및 명령을 입력받는 구성을 의미한다. 인터페이스 장치(240)는 내부 통신을 위한 물리적 장치 및 통신 프로토콜을 포함할 수 있다. 인터페이스 장치(240)는 입력 영상을 입력받을 수 있다. 인터페이스 장치(240)는 입력 영상을 분석하기 위한 명령을 입력받을 수도 있다.The interface device 240 means a configuration for receiving data and commands. The interface device 240 may include a physical device and a communication protocol for internal communication. The interface device 240 may receive an input image. The interface device 240 may receive a command for analyzing the input image.

통신장치(250)는 유선 또는 무선 통신을 통해 외부 객체로부터 일정한 정보를 수신할 수 있다. 통신장치(200)는 입력 영상을 수신할 수 있다. 통신장치(250)는 결정한 GPO 및/ GP를 외부 객체로 송신할 수 있다.The communication device 250 may receive certain information from an external object through wired or wireless communication. The communication device 200 may receive an input image. The communication device 250 may transmit the determined GPO and/or GP to an external object.

통신장치(250) 내지 인터페이스 장치(240)는 외부로부터 일정한 데이터 내지 명령을 전달받는 장치이다. 통신장치(250) 내지 인터페이스 장치(240)는 일정한 데이터를 입력받기에 입력장치라고 명명할 수 있다.The communication device 250 or the interface device 240 are devices that receive predetermined data or commands from the outside. The communication device 250 or the interface device 240 may be referred to as an input device to receive predetermined data.

연산장치(230)는 주어진 데이터 내지 정보를 처리하는 구성을 의미한다. 연산장치(230)는 프로세서, AP, 프로그램이 임베디드된 칩과 같은 장치일 수 있다.The arithmetic unit 230 refers to a configuration for processing given data or information. The computing device 230 may be a device such as a processor, an AP, or a chip in which a program is embedded.

연산장치(230)는 입력 장치로부터 입력되는 입력 영상에 대한 프레임들을 GOP 이진 트리의 루트 노드에 입력하고, GOP 이진 트리의 경로를 결정한다.The arithmetic unit 230 inputs frames for an input image input from the input device to a root node of the GOP binary tree, and determines a path of the GOP binary tree.

연산장치(230)는 GOP 이진 트리의 분기는 전술한 강화학습 모델을 이용하여 결정할 수 있다.The computing unit 230 may determine the branch of the GOP binary tree using the aforementioned reinforcement learning model.

연산장치(230)는 강화학습 모델을 이용하여 GOP 이진 트리의 경로를 모두 결정하고, 최종 결정되는 리프 노드를 기준으로 GOP를 선택할 수 있다.The computing unit 230 may determine all paths of the GOP binary tree using the reinforcement learning model, and select the GOP based on the finally determined leaf node.

연산장치(230)는 프레임들 중 복수의 프레임들을 입력받은 강화학습 모델이 출력하는 Q 값을 기준으로 해당 노드의 행동(U 또는 D)을 결정할 수 있다.The computing device 230 may determine the behavior (U or D) of the corresponding node based on the Q value output by the reinforcement learning model receiving a plurality of frames among the frames.

연산장치(230)는 전술한 바와 같이 강화학습을 이용하여 상기 선택한 GOP의 QP를 결정할 수 있다.The calculating unit 230 may determine the QP of the selected GOP by using reinforcement learning as described above.

또한, 상술한 바와 같은 영상 처리 방법, GOP 선택 방법 내지 QP 결정 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 일시적 또는 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the image processing method, the GOP selection method, or the QP determination method as described above may be implemented as a program (or application) including an executable algorithm that can be executed in a computer. The program may be provided by being stored in a temporary or non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM (read-only memory), PROM (programmable read only memory), EPROM(Erasable PROM, EPROM) 또는 EEPROM(Electrically EPROM) 또는 플래시 메모리 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, memory, and the like, and can be read by a device. Specifically, the various applications or programs described above are CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM (read-only memory), PROM (programmable read only memory), EPROM (Erasable PROM, EPROM) Alternatively, it may be provided by being stored in a non-transitory readable medium such as an EEPROM (Electrically EPROM) or flash memory.

일시적 판독 가능 매체는 스태틱 램(Static RAM，SRAM), 다이내믹 램(Dynamic RAM，DRAM), 싱크로너스 디램 (Synchronous DRAM，SDRAM), 2배속 SDRAM(Double Data Rate SDRAM，DDR SDRAM), 증강형 SDRAM(Enhanced SDRAM，ESDRAM), 동기화 DRAM(Synclink DRAM，SLDRAM) 및 직접 램버스 램(Direct Rambus RAM，DRRAM) 과 같은 다양한 RAM을 의미한다.Temporarily readable media include Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced SDRAM (Enhanced) SDRAM, ESDRAM), Synchronous DRAM (Synclink DRAM, SLDRAM) and Direct Rambus RAM (Direct Rambus RAM, DRRAM) refers to a variety of RAM.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.This embodiment and the drawings attached to this specification merely clearly show a part of the technical idea included in the above-described technology, and within the scope of the technical idea included in the specification and drawings of the above-described technology, those skilled in the art can easily It will be said that it is obvious that all inferred modified examples and specific embodiments are included in the scope of the above-described technology.

Claims

receiving, by the analysis apparatus, an input image composed of a plurality of frames;
determining, by the analysis device, a path of a binary tree for determining a GOP (Group of Picture) based on the plurality of frames; and
Comprising the step of the analysis device selecting the GOP of the input image based on the leaf node of the binary tree,
In the analysis apparatus, the environment is the binary tree, the action is whether the tree is branched, and the reward is a GOP selection method based on reinforcement learning that determines the path of the binary tree using reinforcement learning using the encoding efficiency of the selected GOP.

The method of claim 1,
A node of the binary tree represents a GOP candidate, and a GOP selection method based on reinforcement learning in which input frames are defined by a start number and length.

According to claim 1,
The analysis apparatus is a GOP selection method based on reinforcement learning that determines whether to branch at the node based on a Q value output by an artificial neural network receiving a plurality of frames among frames input to a node of the binary tree.

4. The method of claim 3,
The artificial neural network
a first input terminal receiving data obtained by combining a plurality of frames among frames input to the node;
a second input terminal receiving data obtained by combining a plurality of frames among frames input to the node; and
and an output terminal fusion of the output of the first input terminal and the output of the second input terminal, and outputting the Q value based on the fused data.

4. The method of claim 3,
The artificial neural network is a GOP selection method based on reinforcement learning comprising an input layer that receives data combining frames determined by Equation (1) below, and an input layer that receives data combining frames determined by Equation (2) below. .

(P _d (S) is the starting frame of the parent node of the current node, L is the length set in the current node, P _d (L) is the length of the frame set in the parent node of the current node, n _d,l (S) is The starting frame number of the left child node of the parent node, n _d,r (L) is the length of the frame set in the right child node of the parent node)

According to claim 1,
Further comprising the step of the analysis device determining the QP (Quantization Parameter) of the selected GOP using reinforcement learning,
In the reinforcement learning, a GOP selection method based on reinforcement learning, which is the encoding efficiency provided when the state node is the currently encoded frame, the behavior is the QP difference between the current node and the previous node, and the reward is the selected QP.

an input device for receiving an input image composed of a plurality of frames;
a storage device for storing a reinforcement learning model for determining a path of a binary tree for determining a GOP (Group of Picture) based on frames; and
and an arithmetic unit for selecting a GOP of the input image based on leaf nodes of the binary tree by applying the plurality of frames to the binary tree,
The computing device is the binary tree, the action is whether the tree is branched, and the reward is the GOP selection based on reinforcement learning that determines the path of the binary tree using reinforcement learning using the encoding efficiency of the selected GOP. analysis device.

8. The method of claim 7,
A node of the binary tree represents a GOP candidate, and an analysis apparatus for selecting a GOP based on reinforcement learning defined by a start number and length of input frames.

8. The method of claim 7,
The arithmetic unit is an analysis device for selecting a GOP based on reinforcement learning that determines whether to branch at the node based on a Q value output by an artificial neural network receiving a plurality of frames among frames input to a node of the binary tree .

10. The method of claim 9,
The artificial neural network
a first input terminal receiving data obtained by combining a plurality of frames among frames input to the node;
a second input terminal receiving data obtained by combining a plurality of frames among frames input to the node; and
An analysis apparatus for selecting a GOP based on reinforcement learning, including an output terminal that fusions the output of the first input terminal and the output of the second input terminal, and outputs the Q value based on the fused data.

10. The method of claim 9,
The artificial neural network selects a GOP based on reinforcement learning including an input layer that receives data combining frames determined by Equation (1) below, and an input layer that receives data that combines frames determined by Equation (2) below. analysis device.

8. The method of claim 7,
The arithmetic unit determines the QP (Quantization Parameter) of the selected GOP using reinforcement learning,
In the reinforcement learning, the state node is the currently encoded frame, the behavior is the QP difference between the current node and the previous node, and the compensation is an analysis device that selects the GOP based on reinforcement learning, which is the encoding efficiency provided when the selected QP is used.