KR102472971B1

KR102472971B1 - Method, system, and computer program to optimize video encoding using artificial intelligence model

Info

Publication number: KR102472971B1
Application number: KR1020210026373A
Authority: KR
Inventors: 박근백; 김재훈; 왕희돈; 장준기; 김성호; 조성택; 강인철; 허정수; 김봉섭
Original assignee: 네이버 주식회사
Priority date: 2020-11-19
Filing date: 2021-02-26
Publication date: 2022-12-02
Also published as: KR20220068880A

Abstract

인공지능 모델을 이용한 동영상 인코딩 최적화 방법, 시스템, 및 컴퓨터 프로그램이 개시된다. 인코딩 최적화 방법은, 인공지능 모델을 이용하여 입력 영상의 피처(feature)에 대응되는 인코딩 옵션을 예측하는 단계; 및 상기 인코딩 옵션에 따라 상기 입력 영상에 대한 인코딩을 수행하는 단계를 포함한다.A video encoding optimization method, system, and computer program using an artificial intelligence model are disclosed. An encoding optimization method may include predicting an encoding option corresponding to a feature of an input image using an artificial intelligence model; and performing encoding on the input video according to the encoding option.

Description

Video encoding optimization method, system, and computer program using artificial intelligence model {METHOD, SYSTEM, AND COMPUTER PROGRAM TO OPTIMIZE VIDEO ENCODING USING ARTIFICIAL INTELLIGENCE MODEL}

아래의 설명은 동영상 인코딩 기술에 관한 것이다.The description below relates to video encoding technology.

최근 영상과 음향이 통신 및 컴퓨터와 결합되어 새로운 미디어로 융합된 멀티미디어 정보가 제공되고 있다. 예를 들면, 고속의 데이터 전송망이 공급됨에 따라 입체 음향과 고화질의 영상을 시청할 수 있고, 화상 전화를 통해 사용자 간에 얼굴을 마주보며 통화할 수 있다. 또한, 컴퓨터나 TV를 통해 상품 정보를 실시간으로 보면서 상품을 구매할 수 있고, 웹 사이트를 통해 음악 또는 영화를 감상할 수 있다. 또한, 컴퓨터를 통해 동영상 강의를 수강하는 것이 가능하다.Recently, multimedia information in which images and sounds are combined with communication and computers to become a new medium is provided. For example, as a high-speed data transmission network is supplied, stereophonic sound and high-definition images can be viewed, and users can make face-to-face conversations through video calls. In addition, products can be purchased while viewing product information in real time through a computer or TV, and music or movies can be enjoyed through a website. In addition, it is possible to attend video lectures through a computer.

이러한 멀티미디어 정보들은 동영상 압축(즉, 인코딩) 기술을 기반으로 하여 발전되어 왔다. 정보를 전달하는 데이터는 데이터로부터 중복 요소(데이터를 정확히 복원하는 데 꼭 필요하지 않은 요소)를 제거함으로써 압축될 수 있다. 손실 압축의 경우, 디코더에서 복원되는 데이터가 원본 데이터와 동일하지 않지만, 높은 압축 효율을 얻기 위하여 주관적인 중복 요소가 제거된다. 이미지 또는 비디오 압축에 있어서 주관적인 중복 요소는 보는 사람이 직관적으로 느낄 수 있는 화질에 큰 영향을 주지 않는 제거할 수 있는 요소이다.Such multimedia information has been developed based on video compression (ie, encoding) technology. Data conveying information can be compressed by removing redundant elements from the data (elements that are not essential to accurately recover the data). In the case of lossy compression, the data recovered by the decoder is not the same as the original data, but subjective redundant elements are removed to obtain high compression efficiency. In image or video compression, the subjective redundancy factor is a factor that can be removed that does not significantly affect the picture quality intuitively felt by the viewer.

동영상 압축 기술의 일례로, 한국 등록특허공보 제10-1136858호(등록일 2012년 04월 09일)에는 동영상 압축 표준에서의 인코딩 기술이 기술이 개시되어 있다.As an example of video compression technology, Korean Patent Registration No. 10-1136858 (registered on April 9, 2012) discloses an encoding technology in a video compression standard.

인공지능(AI) 기술을 이용하여 영상의 구간 별로 최적의 인코딩 파라미터를 찾아 압축 효율을 향상시킬 수 있는 인코딩 최적화 기술을 제공한다.Provides encoding optimization technology that can improve compression efficiency by finding optimal encoding parameters for each section of video using artificial intelligence (AI) technology.

사용자가 부담하는 데이터 비용을 줄이면서 영상의 품질을 높게 유지할 수 있는 인코딩 최적화 기술을 제공한다.An encoding optimization technology capable of maintaining high image quality while reducing data costs borne by a user is provided.

사용자 경험과 함께 비트레이트(bitrate) 개선을 위한 인코딩 최적화 기술을 제공한다.It provides encoding optimization technology to improve bitrate along with user experience.

컴퓨터 장치에서 실행되는 인코딩 최적화 방법에 있어서, 상기 컴퓨터 장치는 메모리에 포함된 컴퓨터 판독가능한 명령들을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 상기 인코딩 최적화 방법은, 상기 적어도 하나의 프로세서에 의해, 인공지능 모델을 이용하여 입력 영상의 피처(feature)에 대응되는 인코딩 옵션을 예측하는 단계; 및 상기 적어도 하나의 프로세서에 의해, 상기 인코딩 옵션에 따라 상기 입력 영상에 대한 인코딩을 수행하는 단계를 포함하는 인코딩 최적화 방법을 제공한다.A method for optimizing encoding executed in a computer device, the computer device comprising at least one processor configured to execute computer readable instructions contained in a memory, wherein the method for optimizing encoding is performed by the at least one processor, predicting an encoding option corresponding to a feature of an input image using an intelligent model; and performing encoding on the input video according to the encoding option by the at least one processor.

일 측면에 따르면, 상기 인코딩 옵션을 예측하는 단계는, 상기 입력 영상에서 분할된 각 구간 별로 해당 구간의 피처에 대응되는 인코딩 옵션을 예측할 수 있다.According to one aspect, in the predicting of the encoding option, an encoding option corresponding to a feature of a corresponding section may be predicted for each section divided from the input image.

다른 측면에 따르면, 상기 인공지능 모델은 각 프레임 이미지에 대한 프레임 피처를 추출하기 위한 CNN(convolution neural network) 모델, 상기 프레임 이미지 간의 관계를 바탕으로 비디오 피처를 추출하기 위한 RNN(recurrent neural network) 모델, 및 상기 비디오 피처에 해당되는 인코딩 옵션을 분류하는 분류기(classifier)를 포함하는 인코딩 옵션 예측 모델로서 상기 CNN 모델과 상기 RNN 모델 및 상기 분류기가 하나의 손실 함수에 대해 E2E(end-to-end) 방식으로 학습될 수 있다.According to another aspect, the artificial intelligence model is a convolution neural network (CNN) model for extracting frame features for each frame image, and a recurrent neural network (RNN) model for extracting video features based on a relationship between the frame images. , and an encoding option prediction model including a classifier for classifying an encoding option corresponding to the video feature, wherein the CNN model, the RNN model, and the classifier are end-to-end (E2E) for one loss function. can be learned in this way.

또 다른 측면에 따르면, 상기 인공지능 모델은 서비스 가능한 해상도에 대해 동일 기준을 만족하는 인코딩 옵션을 같은 카테고리의 라벨로 묶은 데이터 셋을 이용하여 단일 모델로 구성될 수 있다.According to another aspect, the artificial intelligence model may be configured as a single model using a data set in which encoding options satisfying the same standard for serviceable resolution are grouped with labels of the same category.

또 다른 측면에 따르면, 상기 인코딩 옵션을 예측하는 단계는, 상기 인코딩 옵션으로서 목표 VMAF(Video Multi-method Assessment Fusion) 점수를 만족하는 CRF(Constant Rate Factor)를 예측하는 단계를 포함할 수 있다.According to another aspect, predicting the encoding option may include predicting a constant rate factor (CRF) that satisfies a target video multi-method assessment fusion (VMAF) score as the encoding option.

또 다른 측면에 따르면, 상기 인코딩 옵션을 예측하는 단계는, 상기 인코딩 옵션으로서 목표 VMAF 점수를 만족하는 제1 CRF를 예측하는 단계; 상기 인코딩 옵션으로서 목표 비트레이트를 만족하는 제2 CRF를 예측하는 단계; 및 상기 제1 CRF와 상기 제2 CRF를 이용하여 상기 입력 영상에 대한 인코딩에 실제 적용할 제3 CRF를 결정하는 단계를 더 포함할 수 있다.According to another aspect, predicting the encoding option may include predicting a first CRF that satisfies a target VMAF score as the encoding option; predicting a second CRF that satisfies a target bit rate as the encoding option; and determining a third CRF to be actually applied to encoding of the input image using the first CRF and the second CRF.

또 다른 측면에 따르면, 상기 제3 CRF를 결정하는 단계는, 사용자 입력 값인 비트레이트 제한과 관련된 가중치를 고려하여 상기 제3 CRF를 결정할 수 있다.According to another aspect, the determining of the third CRF may include determining the third CRF in consideration of a weight related to a bit rate limit, which is a user input value.

또 다른 측면에 따르면, 상기 제3 CRF를 결정하는 단계는, 상기 제1 CRF가 상기 제2 CRF 이상이면 상기 제1 CRF를 상기 제3 CRF로 결정하는 단계; 및 상기 제1 CRF가 상기 제2 CRF보다 작으면 비트레이트 제한과 관련된 가중치를 이용하여 상기 제3 CRF를 결정하는 단계를 포함할 수 있다.According to another aspect, the determining of the third CRF may include determining the first CRF as the third CRF if the first CRF is greater than or equal to the second CRF; and determining the third CRF using a weight related to bitrate limitation when the first CRF is smaller than the second CRF.

또 다른 측면에 따르면, 상기 인코딩 옵션을 예측하는 단계는, 상기 제3 CRF에 기초하여 상기 인코딩 옵션으로서 프레임 이미지의 블록의 QP(quantization parameter) 값을 목표 수준으로 제한하는 QP 제한 파라미터를 예측하는 단계를 더 포함할 수 있다.According to another aspect, the predicting of the encoding option may include predicting a QP restriction parameter for limiting a quantization parameter (QP) value of a block of a frame image to a target level as the encoding option based on the third CRF. may further include.

또 다른 측면에 따르면, 상기 QP 제한 파라미터를 예측하는 단계는, 상기 QP 제한 파라미터에 해당되는 오프셋(offset)을 상기 제3 CRF에 적용하여 최소 QP 값과 최대 QP 값을 산출하는 단계를 포함할 수 있다.According to another aspect, the step of estimating the QP limiting parameter may include calculating a minimum QP value and a maximum QP value by applying an offset corresponding to the QP limiting parameter to the third CRF. have.

상기 인코딩 최적화 방법을 상기 컴퓨터 장치에 실행시키기 위해 컴퓨터 판독가능한 기록 매체에 저장되는 컴퓨터 프로그램을 제공한다.A computer program stored in a computer readable recording medium is provided to execute the encoding optimization method on the computer device.

컴퓨터 장치에 있어서, 메모리에 포함된 컴퓨터 판독가능한 명령들을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 인공지능 모델을 이용하여 입력 영상의 피처에 대응되는 인코딩 옵션을 예측하고, 상기 인코딩 옵션에 따라 상기 입력 영상에 대한 인코딩을 수행하는 것을 특징으로 하는 컴퓨터 장치를 제공한다.A computer device comprising at least one processor configured to execute computer readable instructions included in a memory, wherein the at least one processor predicts an encoding option corresponding to a feature of an input image using an artificial intelligence model and , It provides a computer device characterized in that performing encoding on the input image according to the encoding option.

본 발명의 실시예들에 따르면, 인공지능(AI) 기술을 바탕으로 영상의 구간 별로 최적의 인코딩 파라미터를 찾아 인코딩에 적용함으로써 동영상 압축 효율을 향상시킬 수 있다.According to embodiments of the present invention, video compression efficiency can be improved by finding and applying optimal encoding parameters for each section of a video based on artificial intelligence (AI) technology to encoding.

본 발명의 실시예들에 따르면, AI 모델을 통해 최적화된 인코딩 옵션에 따라 동영상 압축 효율을 확보함으로써 사용자 경험과 비트레이트 개선을 보장할 수 있고, 동영상 서비스의 저장 공간을 확보하는 것은 물론이고, 사용자의 네트워크 비용을 절감할 수 있다.According to the embodiments of the present invention, by securing video compression efficiency according to an encoding option optimized through an AI model, user experience and bitrate improvement can be guaranteed, storage space for video services is secured, and user of network cost can be reduced.

도 1은 본 발명의 일실시예에 따른 네트워크 환경의 예를 도시한 도면이다.
도 2는 본 발명의 일실시예에 있어서 전자 기기 및 서버의 내부 구성을 설명하기 위한 블록도이다.
도 3은 본 발명의 일실시예에 있어서 분산 인코딩 시스템의 구성요소의 예를 도시한 도면이다.
도 4는 본 발명의 일실시예에 있어서 인코딩 최적화를 위한 인코딩 옵션 예측 모델의 기본 컨셉을 설명하기 위한 도면이다.
도 5는 본 발명의 일실시예에 있어서 동영상 인코딩 최적화 방법을 나타낸 것이다.
도 6과 도 7은 본 발명의 일실시예에 있어서 해상도 별 화질 측정 지표를 설명하기 위한 예시 도면이다.
도 8은 본 발명의 일실시예에 있어서 모델 학습을 위한 라벨 데이터를 생성하는 과정을 설명하기 위한 예시 도면이다.
도 9는 본 발명의 일실시예에 있어서 서비스 제약 사항을 고려한 추가 인코딩 옵션을 예측하는 과정을 설명하기 위한 예시 도면이다.1 is a diagram illustrating an example of a network environment according to an embodiment of the present invention.
Figure 2 is a block diagram for explaining the internal configuration of the electronic device and the server in one embodiment of the present invention.
3 is a diagram illustrating an example of components of a distributed encoding system according to an embodiment of the present invention.
4 is a diagram for explaining the basic concept of an encoding option prediction model for encoding optimization according to an embodiment of the present invention.
5 illustrates a video encoding optimization method according to an embodiment of the present invention.
6 and 7 are exemplary diagrams for explaining an image quality measurement index for each resolution in an embodiment of the present invention.
8 is an exemplary diagram for explaining a process of generating label data for model learning in one embodiment of the present invention.
9 is an exemplary diagram for explaining a process of predicting an additional encoding option considering service constraints according to an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 실시예들은 인공지능 모델을 이용한 동영상 인코딩 최적화 기술에 관한 것이다.Embodiments of the present invention relate to video encoding optimization technology using an artificial intelligence model.

본 명세서에서 구체적으로 개시되는 것들을 포함하는 실시예들은 인공지능 모델을 이용하여 최적의 인코딩 옵션을 예측할 수 있고, 이를 통해 압축 효율, 비용 절감, 리소스 절약, 서비스 품질 등 다양한 측면에 있어서 상당한 장점들을 달성할 수 있다.Embodiments including those specifically disclosed in this specification can predict optimal encoding options using artificial intelligence models, thereby achieving significant advantages in various aspects such as compression efficiency, cost reduction, resource conservation, and service quality. can do.

도 1은 본 발명의 일실시예에 따른 네트워크 환경의 예를 도시한 도면이다. 도 1의 네트워크 환경은 복수의 전자 기기들(110, 120, 130, 140), 복수의 서버들(150, 160) 및 네트워크(170)를 포함하는 예를 나타내고 있다. 이러한 도 1은 발명의 설명을 위한 일례로 전자 기기의 수나 서버의 수가 도 1과 같이 한정되는 것은 아니다.1 is a diagram illustrating an example of a network environment according to an embodiment of the present invention. The network environment of FIG. 1 shows an example including a plurality of electronic devices 110 , 120 , 130 , and 140 , a plurality of servers 150 and 160 , and a network 170 . 1 is an example for explanation of the invention, and the number of electronic devices or servers is not limited as shown in FIG. 1 .

복수의 전자 기기들(110, 120, 130, 140)은 컴퓨터 시스템으로 구현되는 고정형 단말이거나 이동형 단말일 수 있다. 복수의 전자 기기들(110, 120, 130, 140)의 예를 들면, 스마트폰(smart phone), 휴대폰, 내비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC, 게임 콘솔(game console), 웨어러블 디바이스(wearable device), IoT(internet of things) 디바이스, VR(virtual reality) 디바이스, AR(augmented reality) 디바이스 등이 있다. 일례로 도 1에서는 전자 기기(110)의 예로 스마트폰의 형상을 나타내고 있으나, 본 발명의 실시예들에서 전자 기기(110)는 실질적으로 무선 또는 유선 통신 방식을 이용하여 네트워크(170)를 통해 다른 전자 기기들(120, 130, 140) 및/또는 서버(150, 160)와 통신할 수 있는 다양한 물리적인 컴퓨터 시스템들 중 하나를 의미할 수 있다.The plurality of electronic devices 110, 120, 130, and 140 may be fixed terminals implemented as computer systems or mobile terminals. Examples of the plurality of electronic devices 110, 120, 130, and 140 include a smart phone, a mobile phone, a navigation device, a computer, a laptop computer, a digital broadcast terminal, a personal digital assistant (PDA), and a portable multimedia player (PMP). ), tablet PC, game console, wearable device, internet of things (IoT) device, virtual reality (VR) device, augmented reality (AR) device, and the like. As an example, FIG. 1 shows the shape of a smartphone as an example of the electronic device 110, but in the embodiments of the present invention, the electronic device 110 substantially uses a wireless or wired communication method to transmit other information via the network 170. It may refer to one of various physical computer systems capable of communicating with the electronic devices 120 , 130 , and 140 and/or the servers 150 and 160 .

통신 방식은 제한되지 않으며, 네트워크(170)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 기기들간의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(170)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(170)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method is not limited, and includes not only a communication method utilizing a communication network (eg, mobile communication network, wired Internet, wireless Internet, broadcasting network, satellite network, etc.) that the network 170 may include, but also short-range wireless communication between devices. can For example, the network 170 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , one or more arbitrary networks such as the Internet. In addition, the network 170 may include any one or more of network topologies including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, and the like. Not limited.

서버(150, 160) 각각은 복수의 전자 기기들(110, 120, 130, 140)과 네트워크(170)를 통해 통신하여 명령, 코드, 파일, 콘텐츠, 서비스 등을 제공하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다. 예를 들어, 서버(150)는 네트워크(170)를 통해 접속한 복수의 전자 기기들(110, 120, 130, 140)로 제1 서비스를 제공하는 시스템일 수 있으며, 서버(160) 역시 네트워크(170)를 통해 접속한 복수의 전자 기기들(110, 120, 130, 140)로 제2 서비스를 제공하는 시스템일 수 있다. 보다 구체적인 예로, 서버(150)는 복수의 전자 기기들(110, 120, 130, 140)에 설치되어 구동되는 컴퓨터 프로그램으로서의 어플리케이션을 통해, 해당 어플리케이션이 목적하는 서비스(일례로, 동영상 서비스 등)를 제1 서비스로서 복수의 전자 기기들(110, 120, 130, 140)로 제공할 수 있다. 다른 예로, 서버(160)는 상술한 어플리케이션의 설치 및 구동을 위한 파일을 복수의 전자 기기들(110, 120, 130, 140)로 배포하는 서비스를 제2 서비스로서 제공할 수 있다.Each of the servers 150 and 160 communicates with the plurality of electronic devices 110, 120, 130, and 140 through the network 170 to provide commands, codes, files, contents, services, and the like, or a computer device or a plurality of computers. Can be implemented in devices. For example, the server 150 may be a system that provides a first service to a plurality of electronic devices 110, 120, 130, and 140 accessed through the network 170, and the server 160 may also include a network ( It may be a system that provides a second service to a plurality of electronic devices 110, 120, 130, and 140 accessed through 170). As a more specific example, the server 150 provides a service (eg, a video service, etc.) for which the application is intended through an application as a computer program that is installed and driven in the plurality of electronic devices 110, 120, 130, and 140. As the first service, it may be provided to a plurality of electronic devices 110, 120, 130, and 140. As another example, the server 160 may provide a service for distributing files for installing and running the above-described application to the plurality of electronic devices 110, 120, 130, and 140 as a second service.

도 2는 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다. 앞서 설명한 복수의 전자 기기들(110, 120, 130, 140) 각각이나 서버들(150, 160) 각각은 도 2를 통해 도시된 컴퓨터 장치(200)에 의해 구현될 수 있다.2 is a block diagram illustrating an example of a computer device according to one embodiment of the present invention. Each of the plurality of electronic devices 110 , 120 , 130 , and 140 or each of the servers 150 and 160 described above may be implemented by the computer device 200 shown in FIG. 2 .

이러한 컴퓨터 장치(200)는 도 2에 도시된 바와 같이, 메모리(210), 프로세서(220), 통신 인터페이스(230) 그리고 입출력 인터페이스(240)를 포함할 수 있다. 메모리(210)는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 디스크 드라이브와 같은 비소멸성 대용량 기록장치는 메모리(210)와는 구분되는 별도의 영구 저장 장치로서 컴퓨터 장치(200)에 포함될 수도 있다. 또한, 메모리(210)에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리(210)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 메모리(210)로 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 인터페이스(230)를 통해 메모리(210)에 로딩될 수도 있다. 예를 들어, 소프트웨어 구성요소들은 네트워크(170)를 통해 수신되는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 컴퓨터 장치(200)의 메모리(210)에 로딩될 수 있다.As shown in FIG. 2 , the computer device 200 may include a memory 210, a processor 220, a communication interface 230, and an input/output interface 240. The memory 210 is a computer-readable recording medium and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. Here, a non-perishable mass storage device such as a ROM and a disk drive may be included in the computer device 200 as a separate permanent storage device distinct from the memory 210 . Also, an operating system and at least one program code may be stored in the memory 210 . These software components may be loaded into the memory 210 from a computer-readable recording medium separate from the memory 210 . The separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card. In another embodiment, software components may be loaded into the memory 210 through the communication interface 230 rather than a computer-readable recording medium. For example, software components may be loaded into memory 210 of computer device 200 based on a computer program installed by files received over network 170 .

프로세서(220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(210) 또는 통신 인터페이스(230)에 의해 프로세서(220)로 제공될 수 있다. 예를 들어 프로세서(220)는 메모리(210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processor 220 may be configured to process commands of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 220 by memory 210 or communication interface 230 . For example, processor 220 may be configured to execute received instructions according to program codes stored in a recording device such as memory 210 .

통신 인터페이스(230)는 네트워크(170)를 통해 컴퓨터 장치(200)가 다른 장치(일례로, 앞서 설명한 저장 장치들)와 서로 통신하기 위한 기능을 제공할 수 있다. 일례로, 컴퓨터 장치(200)의 프로세서(220)가 메모리(210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이나 명령, 데이터, 파일 등이 통신 인터페이스(230)의 제어에 따라 네트워크(170)를 통해 다른 장치들로 전달될 수 있다. 역으로, 다른 장치로부터의 신호나 명령, 데이터, 파일 등이 네트워크(170)를 거쳐 컴퓨터 장치(200)의 통신 인터페이스(230)를 통해 컴퓨터 장치(200)로 수신될 수 있다. 통신 인터페이스(230)를 통해 수신된 신호나 명령, 데이터 등은 프로세서(220)나 메모리(210)로 전달될 수 있고, 파일 등은 컴퓨터 장치(200)가 더 포함할 수 있는 저장 매체(상술한 영구 저장 장치)로 저장될 수 있다.The communication interface 230 may provide a function for the computer device 200 to communicate with other devices (eg, storage devices described above) through the network 170 . For example, a request, command, data, file, etc. generated according to a program code stored in a recording device such as the memory 210 by the processor 220 of the computer device 200 is controlled by the communication interface 230 to the network ( 170) to other devices. Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device 200 through the communication interface 230 of the computer device 200 via the network 170 . Signals, commands, data, etc. received through the communication interface 230 may be transferred to the processor 220 or the memory 210, and files, etc. may be stored as storage media that the computer device 200 may further include (described above). permanent storage).

입출력 인터페이스(240)는 입출력 장치(250)와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크, 키보드 또는 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커와 같은 장치를 포함할 수 있다. 다른 예로 입출력 인터페이스(240)는 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수도 있다. 입출력 장치(250)는 컴퓨터 장치(200)와 하나의 장치로 구성될 수도 있다.The input/output interface 240 may be a means for interface with the input/output device 250 . For example, the input device may include a device such as a microphone, keyboard, or mouse, and the output device may include a device such as a display or speaker. As another example, the input/output interface 240 may be a means for interface with a device in which functions for input and output are integrated into one, such as a touch screen. The input/output device 250 and the computer device 200 may be configured as one device.

또한, 다른 실시예들에서 컴퓨터 장치(200)는 도 2의 구성요소들보다 더 적은 혹은 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 장치(200)는 상술한 입출력 장치(250) 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Also, in other embodiments, computer device 200 may include fewer or more elements than those of FIG. 2 . However, there is no need to clearly show most of the prior art components. For example, the computer device 200 may be implemented to include at least some of the aforementioned input/output devices 250 or may further include other components such as a transceiver and a database.

이하에서는 인공지능 모델을 이용한 동영상 인코딩 최적화를 위한 방법 및 시스템의 구체적인 실시예를 설명하기로 한다.Hereinafter, specific embodiments of a method and system for optimizing video encoding using an artificial intelligence model will be described.

동영상 압축 알고리즘 중 하나는 사람의 시각 특성을 고려하여 원본 영상에 사람이 인지하기 힘든 정도의 손실을 주어 압축률을 높이는 방법을 사용하고 있다. 손실 정도를 조정하여 영상 압축률을 조절할 수 있다.One of the video compression algorithms uses a method of increasing the compression rate by giving an original video a loss that is difficult for humans to perceive in consideration of human visual characteristics. The image compression rate can be adjusted by adjusting the degree of loss.

상용 비디오 압축기(코덱)들은 손실 압축 방식을 통해 영상의 품질을 크게 저하시키지 않으면서 스트리밍에 문제가 되지 않는 정도의 비트 전송률(초당 영상 사이즈)을 유지하도록 비트 전송률 제어(Bitrate Control) 기능을 제공한다.Commercial video compressors (codecs) provide a bitrate control function to maintain the bitrate (video size per second) that does not cause a problem in streaming without significantly degrading the video quality through lossy compression. .

일반적으로 많이 사용되는 비디오 코덱들은 고정된 목표 비트 전송률을 유지하거나 혹은 일정 화질을 유지하는 비트 전송률 제어 기능을 제공한다. 여기서, 일정 화질 유지란 인코딩 결과물의 화질이 균일함을 의미하며 그 화질의 수준은 인코딩 전에는 알 수가 없다.Commonly used video codecs provide a bit rate control function that maintains a fixed target bit rate or maintains a constant picture quality. Here, maintaining a constant picture quality means that the picture quality of the encoding result is uniform, and the level of the picture quality cannot be known before encoding.

그러나, 영상 별로 복잡도와 특성이 다르기 때문에 모든 영상에 대해 동일한 목표 비트 전송률이나 화질로 압축하면 사람이 인지하는 것 이상의 고품질로 압축되어 필요 이상의 크기로 압축되는 결과가 발생한다.However, since complexity and characteristics are different for each image, if all images are compressed with the same target bit rate or quality, the result is that the image is compressed to a higher quality than human perception and to a larger size than necessary.

또한, 동영상은 복잡도와 특성이 다른 여러 장면으로 이루어지는 경우가 많기 때문에 영상의 구간에 따라서도 위와 같은 문제가 발생할 수 있다.In addition, since a video often consists of several scenes with different complexity and characteristics, the above problem may occur depending on the section of the video.

본 실시예들은 AI 기술을 통해 영상의 구간 별 특성에 따라 적정 비트 전송률 또는 적정 화질 옵션을 결정하여 동영상의 압축률을 최적화하고자 하는 것이다.The present embodiments are intended to optimize the compression rate of a video by determining an appropriate bit rate or an appropriate quality option according to the characteristics of each section of the video through AI technology.

도 3은 본 발명의 일실시예에 있어서 분산 인코딩 시스템의 구성요소의 예를 도시한 도면이다.3 is a diagram illustrating an example of components of a distributed encoding system according to an embodiment of the present invention.

도 3을 참조하면, 분산 인코딩 시스템(300)은 동영상 플랫폼(330)에 적용하기 위한 최적화 인코더 구성요소로서, 분산 인코더(distributed encoder)(310), 및 AI 서빙 모듈(serving module)(320)을 포함할 수 있다.Referring to FIG. 3, the distributed encoding system 300 is an optimized encoder component for application to the video platform 330, and includes a distributed encoder 310 and an AI serving module 320. can include

분산 인코더(310)나 AI 서빙 모듈(320) 각각은 도 2를 통해 설명한 컴퓨터 장치(200)에 의해 구현될 수 있다.Each of the distributed encoder 310 or AI serving module 320 may be implemented by the computer device 200 described with reference to FIG. 2 .

동영상 서비스에서는 서비스하고자 하는 영상을 동영상 플랫폼(330)으로 업로드하여 분산 인코딩 시스템(300)을 통해 인코딩을 진행할 수 있다. 인코딩 옵션 최적화 여부는 동영상 플랫폼(330)에서 결정되며 최적화 적용 시에는 인코딩 요청에 모델 정보 등의 파라미터가 추가될 수 있다.In a video service, a video to be served can be uploaded to the video platform 330 and encoded through the distributed encoding system 300 . Whether to optimize encoding options is determined by the video platform 330, and parameters such as model information may be added to the encoding request when optimization is applied.

분산 인코더(310)는 분산 처리부(distributor)(311) 및 워커(312)를 포함할 수 있다.The distributed encoder 310 may include a distributor 311 and a worker 312 .

분산 처리부(311)는 원본 영상을 여러 개의 세그먼트로 분할하여 다수의 워커(312)에 할당한다. 워커(312)에서 인코딩이 진행되면서 압축된 비트스트림을 수신하고, 수신된 비트스트림은 로컬 저장소에 임시 저장되거나 메모리에 로드된 상태로 관리된다. 분산 처리부(311)는 세그먼트 각각에 대한 워커(312)의 인코딩 결과를 병합하여(merge) 최종 영상 파일을 생성할 수 있다.The distributed processing unit 311 divides the original video into several segments and allocates them to a plurality of workers 312 . While encoding is progressing in the worker 312, a compressed bitstream is received, and the received bitstream is temporarily stored in a local storage or managed while being loaded into a memory. The distributed processing unit 311 may generate a final image file by merging encoding results of the worker 312 for each segment.

워커(312)는 인코딩 작업을 수행하기 위한 단위 트랜스코더로서 분산 처리부(311)로부터 할당받은 영상 세그먼트를 인코딩하는 역할을 한다. 이때, 워커(312)는 인코딩 옵션을 최적화하기 위한 AI 서빙 모듈(320)과 직접 연동하여 동작할 수 있다. 인코딩 옵션 최적화가 활성화되어 있다면 인코딩 옵션 예측에 필요한 일부 프레임 이미지를 준비하고 준비된 프레임 이미지를 모델 정보와 함께 AI 서빙 모듈(320)에 전달하여 최적의 인코딩 옵션을 요청할 수 있다. 워커(312)는 AI 서빙 모듈(320)의 예측 결과를 적용하여 인코딩을 수행하고 인코딩 결과 화질과 전송 비트율을 점검한다. 워커(312)는 인코딩 결과 화질 또는 전송 비트율이 적정 범위를 벗어나는 경우 예측 오차를 실험치에 근거하여 보간하고 다시 인코딩을 수행한다. 인코딩 진행과 동시에 압축된 비트스트림은 분산 처리부(311)로 전송하고, 모델의 예측 결과와 인코딩 결과 데이터를 동영상 플랫폼(330)에 저장한다.The worker 312, as a unit transcoder for performing the encoding task, serves to encode the video segment allocated from the distributed processing unit 311. At this time, the worker 312 may operate in direct association with the AI serving module 320 for optimizing the encoding option. If encoding option optimization is activated, some frame images necessary for encoding option prediction may be prepared, and the prepared frame image along with model information may be transmitted to the AI serving module 320 to request an optimal encoding option. The worker 312 performs encoding by applying the prediction result of the AI serving module 320 and checks the encoding result quality and transmission bit rate. The worker 312 interpolates a prediction error based on an experimental value when the encoding result quality or transmission bit rate is out of an appropriate range, and then performs encoding again. Simultaneously with encoding, the compressed bitstream is transmitted to the distribution processing unit 311, and the prediction result of the model and encoding result data are stored in the video platform 330.

AI 서빙 모듈(320)은 분산 인코더(310)와 연동하여 동작하는 것으로, 인코딩 옵션 예측 모델(321)을 포함한다. AI 서빙 모듈(320)은 각 워커(312)로부터 영상 세그먼트를 인코딩하기 위한 인코딩 옵션 요청을 수신하는 경우 수신된 요청에서 지정하는 인코딩 옵션 예측 모델(321)을 통해 최적의 인코딩 옵션을 예측하여 예측 결과를 워커(312)로 반환할 수 있다.The AI serving module 320 operates in conjunction with the distributed encoder 310 and includes an encoding option prediction model 321. When receiving an encoding option request for encoding a video segment from each worker 312, the AI serving module 320 predicts the optimal encoding option through the encoding option prediction model 321 specified in the received request and predicts the predicted result. may be returned to the worker 312.

동영상 플랫폼(330)은 ELK(Elastic Logstash Kibana)와 같은 클라우드 서치(cloud search)를 바탕으로 인코딩 옵션 예측 모델(321)의 예측 정확도에 대한 로그를 수집할 수 있다. 다시 말해, 동영상 플랫폼(330)은 워커(312)의 인코딩 결과를 로그로 기록하여 인코딩 옵션 예측 모델(321)의 정확도를 모니터링하고 모니터링 결과와 사용자(예를 들어, 관리자 등)(340)에 의한 입력 값을 바탕으로 새로운 영상을 통한 모델 추가 학습과 예측 정확도 개선을 지원할 수 있다.The video platform 330 may collect logs about prediction accuracy of the encoding option prediction model 321 based on a cloud search such as ELK (Elastic Logstash Kibana). In other words, the video platform 330 monitors the accuracy of the encoding option prediction model 321 by recording the encoding result of the worker 312 as a log, and monitors the accuracy of the monitoring result and the user (eg, administrator, etc.) 340. Based on the input value, it can support model additional learning and prediction accuracy improvement through new images.

본 실시예에서 인코딩 옵션 예측 모델(321)은 AI 기반의 예측 모델로서 영상의 세그먼트 별 특성을 판단하고 그에 맞는 최적의 인코딩 옵션을 예측하는 것이다.In this embodiment, the encoding option prediction model 321 is an AI-based prediction model that determines characteristics of each segment of an image and predicts an optimal encoding option accordingly.

인코딩 옵션은 인코딩 레이트, 즉 영상 압축률을 조절할 수 있는 파라미터를 의미하는 것으로, 일례로 최적화 인코딩 옵션으로 CRF(Constant Rate Factor)를 사용할 수 있다.An encoding option means a parameter capable of adjusting an encoding rate, that is, an image compression rate, and, for example, a constant rate factor (CRF) can be used as an optimization encoding option.

CRF는 CQP(constant quantization parameter)와 대비하여 시각적으로 보다 균일한 화질을 보장하며, 사람의 지각적인 인지 특성을 반영할 수 있다.Compared to CQP (constant quantization parameter), CRF guarantees a visually more uniform picture quality and can reflect human perceptual characteristics.

인코딩 옵션 예측 모델(321)의 기본 컨셉은 도 4와 같다.The basic concept of the encoding option prediction model 321 is as shown in FIG. 4 .

도 4를 참조하면, 인코딩 옵션 예측 모델(321)은 영상을 구성하는 프레임의 이미지들(401)을 입력받아 딥러닝 모델을 통해 영상의 피처를 추출할 수 있다. 이때, 딥러닝 모델은 CNN(convolution neural network) 모델과 RNN(recurrent neural network) 모델을 포함할 수 있다. CNN 모델은 프레임 이미지의 피처를 추출하는 역할을 하고, RNN 모델은 데이터(프레임 이미지의 피처) 시퀀스 간의 관계를 학습하는 역할을 한다.Referring to FIG. 4 , the encoding option prediction model 321 may receive images 401 of frames constituting an image and extract features of the image through a deep learning model. In this case, the deep learning model may include a convolution neural network (CNN) model and a recurrent neural network (RNN) model. The CNN model serves to extract features of frame images, and the RNN model serves to learn relationships between data (features of frame images) sequences.

인코딩 옵션 예측 모델(321)은 지도 학습(supervised learning)으로 딥러닝 모델을 통해 추출된 피처 별로 최적의 CRF 클래스를 분류하도록 학습된다.The encoding option prediction model 321 is trained to classify an optimal CRF class for each feature extracted through a deep learning model by supervised learning.

이러한 방식으로 학습된 인코딩 옵션 예측 모델(321)에 새로운 영상의 프레임 이미지를 입력하게 되면 해당 이미지의 특성에 맞는 최적의 CRF 카테고리를 예측할 수 있다.When a frame image of a new video is input to the encoding option prediction model 321 learned in this way, an optimal CRF category suitable for the characteristics of the corresponding image can be predicted.

도 5는 본 발명의 일실시예에 있어서 동영상 인코딩 최적화 방법을 나타낸 것이다.5 illustrates a video encoding optimization method according to an embodiment of the present invention.

도 5를 참조하면, 분산 인코딩 시스템(300)은 인코딩 옵션 예측 모델(321)에 대한 학습 과정(S510)과 추론 과정(S520)을 포함한다.Referring to FIG. 5 , the distributed encoding system 300 includes a learning process S510 and an inference process S520 for the encoding option prediction model 321 .

인코딩 옵션 예측 모델(321)은 CNN 모델, RNN 모델, 및 분류기(classifier)를 포함한다.The encoding option prediction model 321 includes a CNN model, an RNN model, and a classifier.

먼저, 분산 인코딩 시스템(300)은 전처리(pre-processing) 과정을 수행한다. 전처리 과정은 영상 세그먼트에서 프레임 이미지를 추출하는 과정을 포함한다. 분산 인코딩 시스템(300)은 학습 과정(S510)에서의 메모리 문제를 방지하고 정확도를 높이기 위해 프레임 이미지 사이즈, 프레임 이미지 개수 등을 최적화할 수 있다.First, the distributed encoding system 300 performs a pre-processing process. The pre-processing process includes a process of extracting frame images from video segments. The distributed encoding system 300 may optimize the frame image size, the number of frame images, and the like in order to prevent a memory problem in the learning process (S510) and increase accuracy.

인코딩 옵션 예측 모델(321)을 학습하기 위한 데이터 셋은 동영상 플랫폼에 업로드된 비디오를 이용하여 제작한 비디오 세그먼트 데이터 셋을 활용할 수 있다.A data set for learning the encoding option prediction model 321 may utilize a video segment data set created using a video uploaded to a video platform.

일례로, 인코딩 최적화 옵션은 CRF를 사용하고, 데이터 셋의 분포와 실제 인코딩 시 유효한 범위를 고려하여 학습 대상이 되는 CRF 범위를 결정할 수 있다.For example, the encoding optimization option may determine a range of CRFs to be learned by using CRFs and considering the distribution of data sets and effective ranges during actual encoding.

프레임 이미지 사이즈는 최초 224×224부터, 336×336, 448×448, 560×560, 그 이상까지 키울 수 있다. 336×336 이하에서는 원본 영상을 80% 중앙 자르기(center-crop) 후에 리사이즈할 수 있다. 원본 정보가 가능한 유지될 수 있게끔 리사이즈하되 원본의 종횡비(aspect ratio)는 변경될 수 있다.The frame image size can be increased from the initial 224×224 to 336×336, 448×448, 560×560, or more. In sizes below 336×336, the original video can be resized after 80% center-crop. It is resized so that the original information can be maintained as much as possible, but the aspect ratio of the original can be changed.

원본 영상의 프레임 이미지(1920×1080)가 위 사이즈로 리사이즈되면 리사이즈 과정에서 정보 손실이 발생할 수 밖에 없고 이러한 손실을 줄이기 위해서 이미지 사이즈를 가능한 키우는 것이 정확도 향상에 도움이 된다.When the frame image (1920×1080) of the original video is resized to the above size, information loss inevitably occurs during the resizing process. In order to reduce this loss, increasing the image size as much as possible helps improve accuracy.

프레임 이미지 개수의 경우 정확도 향상을 위해 4초 세그먼트 영상에서 250msec 간격으로 15장의 이미지를 모델 입력으로 사용할 수 있다. 분산 인코딩 시스템(300)은 정확도 향상을 위해 5초 세그먼트 영상에서 500msec 간격으로 10장의 이미지를 선정한 후 선정된 이미지를 CNN에서 처리 가능한 336×336 사이즈로 리사이즈할 수 있다. 다시 말해, 세그먼트 영상 하나에 대하여 10장×(336×336×3) 만큼의 데이터가 인코딩 옵션 예측 모델(321)의 입력으로 제공될 수 있다.In the case of the number of frame images, 15 images at 250 msec intervals in 4 sec segment images can be used as model inputs to improve accuracy. The distributed encoding system 300 selects 10 images at intervals of 500 msec from a 5-second segment image to improve accuracy, and then resizes the selected images to a size of 336×336 that can be processed by CNN. In other words, 10 × (336 × 336 × 3) data for one segment image may be provided as an input to the encoding option prediction model 321 .

이는 예시적인 것일 뿐 모델 입력으로 사용하지 위한 이미지 선정은 얼마든지 변경 가능하다. 예를 들어, 4초 세그먼트 영상에서 200msec 간격으로 20장의 이미지를 선정하는 것 또한 가능하다.This is just an example, and the selection of an image to be used as a model input can be changed as much as you like. For example, it is also possible to select 20 images at intervals of 200 msec from a 4 second segment image.

각 영상의 반복 인코딩을 통해 VMAF(Video Multi-method Assessment Fusion) 기준의 적정 화질을 만족하는 CRF 옵션을 탐색하여 이렇게 수집된 4초 길이 세그먼트 영상과 GT(ground truth) 데이터 셋을 인코딩 옵션 예측 모델(321)의 학습 과정(S510)에 사용할 수 있다.Through repetitive encoding of each image, CRF options that satisfy the appropriate quality of VMAF (Video Multi-method Assessment Fusion) criteria are explored, and the collected 4-second segment images and GT (ground truth) data sets are used as an encoding option prediction model ( 321) can be used in the learning process (S510).

다음으로, 분산 인코딩 시스템(300)은 영상 피처 추출(video feature extraction) 과정을 수행한다. CNN에서 인코딩 옵션 예측 모델(321)의 입력으로 주어지는 각 프레임 이미지에 대한 프레임 피처를 추출할 수 있고, CNN에서 추출된 프레임 피처를 이미지 순서대로 RNN(LSTM)에 입력하여 피처 시퀀스 간의 관계 정보를 바탕으로 비디오 피처를 추출할 수 있다.Next, the distributed encoding system 300 performs a video feature extraction process. Frame features for each frame image given as input to the encoding option prediction model 321 can be extracted from the CNN, and the frame features extracted from the CNN are input to the RNN (LSTM) in image order based on the relationship information between feature sequences. to extract video features.

마지막으로, 분산 인코딩 시스템(300)은 분류(classification) 과정을 수행하는 것으로, 일례로 소프트맥스 분류기(softmax classifier)를 이용하여 RNN에서 추출된 비디오 피처를 분류할 수 있다.Finally, the distributed encoding system 300 performs a classification process, and may classify the video features extracted from the RNN using, for example, a softmax classifier.

따라서, 분산 인코딩 시스템(300)은 CNN 모델까지 파인-튜닝(fine-tuning)하고 CNN 모델을 포함한 모델 전체를 하나의 손실 함수에 대해 E2E(end-to-end) 방식으로 학습할 수 있다.Accordingly, the distributed encoding system 300 may perform fine-tuning up to the CNN model and learn the entire model including the CNN model for one loss function in an end-to-end (E2E) manner.

이미지 분류 성능을 위해 학습 과정(S510)에서 학습된 CNN의 가중치(weight)를 영상 특성을 잘 구분하도록 최적화할 수 있다. 지도 학습 방법으로 영상의 특성, 즉 비디오 피처와 인코딩 옵션(CRF) 간의 관계를 학습하여 영상의 특성에 맞는 최적의 옵션으로 분류하는 모델을 구축할 수 있다.For image classification performance, weights of the CNN learned in the learning process (S510) may be optimized to distinguish image characteristics well. A supervised learning method can build a model that classifies the optimal option according to the characteristics of an image by learning the characteristics of an image, that is, the relationship between video features and encoding options (CRFs).

상기한 학습 과정(S510)을 통해 학습된 인코딩 옵션 예측 모델(321)에 새로운 영상의 프레임 이미지를 입력하게 되면 추론 과정(S520)으로서 해당 이미지의 특성에 맞는 최적의 CRF 카테고리를 예측할 수 있다.When a frame image of a new video is input to the encoding option prediction model 321 learned through the learning process (S510), an optimal CRF category suitable for the characteristics of the image can be predicted as an inference process (S520).

특히, 본 실시예들은 인코딩 옵션 예측 모델(321)을 단일 모델로 구축하여 여러 해상도 영상의 목표 화질 달성 파라미터를 예측할 수 있다.In particular, in the present embodiments, the encoding option prediction model 321 may be built as a single model to predict target image quality achievement parameters of images of various resolutions.

스트리밍 서비스에서는 하나의 영상을 여러 해상도로 서비스하며 각 해상도를 목표 화질로 인코딩해야 한다. 목표 화질로 인코딩하기 위한 인코딩 파라미터는 AI를 이용한 영상 분석을 통해 도출 가능하며 각 해상도 별로 도출 결과가 다를 수 있다. 이러한 경우 각 해상도 별로 모델을 따로 운영하는 것이 일반적이다.In a streaming service, one video is provided in multiple resolutions, and each resolution must be encoded with a target quality. Encoding parameters for encoding with a target image quality can be derived through image analysis using AI, and the derived result may be different for each resolution. In this case, it is common to separately operate a model for each resolution.

예를 들어, 1080p와 720p 영상을 인코딩하는 경우 1080p용 모델과 720p용 모델을 각각 만들고 각 영상을 상응하는 모델로 추론하게 된다. 이와 같이, 여러 해상도를 지원하는 서비스 환경에서 모델이 많아지면서 모델의 개발과 관리 및 유지보수에 어려움이 있다.For example, when encoding 1080p and 720p videos, a model for 1080p and a model for 720p are created respectively, and each video is inferred as a corresponding model. In this way, as the number of models increases in a service environment supporting various resolutions, it is difficult to develop, manage, and maintain models.

일반적으로 스트리밍 서비스의 화질 측정 지표로 VMAF를 사용하고 있다. VMAF는 압축 아티팩트(compression artifact)와 스케일링 아티팩트(scaling artifact)를 모두 고려한 영상 품질 지표이다.In general, VMAF is used as a quality measurement index for streaming services. VMAF is an image quality index considering both compression artifacts and scaling artifacts.

도 6은 일반적인 VMAF 측정 방법으로, 인코딩한 해상도를 모두 1080p로 리사이즈하여 원본 1080p와 비교한 결과이다.FIG. 6 is a general VMAF measurement method, which is a result of resizing all encoded resolutions to 1080p and comparing them with the original 1080p.

중첩된 VMAF 커브의 컨벡스-헐(convex-hull)을 구하고 컨벡스-헐 상의 점을 선택하면 해당 비트레이트의 최적 해상도 선택이 가능하다. 그러나, 각 해상도의 VMAF 범위가 달라 해상도 별 VMAF 기준을 동일하게 가져갈 수 없다.By obtaining the convex-hull of the overlapped VMAF curves and selecting a point on the convex-hull, it is possible to select the optimal resolution for the corresponding bit rate. However, since the VMAF range of each resolution is different, it is not possible to obtain the same VMAF criteria for each resolution.

도 7은 각 해상도의 인코딩 결과를 동일한 해상도의 인코딩 전 원본 영상과 비교하는 것으로, 스케일링 아티팩트를 제외하고 압축 아티팩트만 고려한 결과이다.7 compares the encoding result of each resolution with the original video before encoding of the same resolution, and is a result of considering only compression artifacts excluding scaling artifacts.

컨벡스-헐을 만들지 못하나 전체 해상도에 대해서 VMAF 기준을 동일하게 가져갈 수 있다. 다시 말해, 인코딩 옵션 예측 모델(321)의 학습에 사용되는 라벨을 구할 때 전체 해상도에 대해 같은 화질 점수(예를 들어, VMAF 93)를 GT 값으로 가져갈 수 있다.It cannot make convex-hull, but it can take the same VMAF criterion for full resolution. In other words, when obtaining a label used for learning of the encoding option prediction model 321, the same quality score (eg, VMAF 93) for the entire resolution may be taken as a GT value.

이와 같이, 전 해상도에 대해 동일 기준(VMAF 점수)를 만족하는 화질 인코딩 파라미터(CRF, QP(quantization parameter) 등과 같은 고정 화질 인코딩 파라미터) 값을 라벨로 사용하면 CRF를 같은 카테고리로 묶을 수 있다.In this way, if a quality encoding parameter (fixed quality encoding parameter such as CRF, QP (quantization parameter), etc.) value that satisfies the same standard (VMAF score) for all resolutions is used as a label, CRFs can be grouped into the same category.

다시 말해, 해상도 별로 동일한 VMAF 기준을 만족하는 CRF값들을 라벨로 구하게 되면 같은 라벨을 가지는 영상들은 해상도가 달라도 같은 라벨의 데이터로 묶을 수 있다. 예를 들어, 1080p CRF 23과 720p CRF 23을 같은 카테고리로 묶을 수 있으므로 학습을 위한 데이터 셋을 구성할 때 CRF 23 카테고리의 데이터 셋을 1080p와 720p 영상을 혼합하여 구성할 수 있다.In other words, if CRF values satisfying the same VMAF criterion are obtained as labels for each resolution, images having the same label can be bundled with data of the same label even if the resolutions are different. For example, since 1080p CRF 23 and 720p CRF 23 can be grouped into the same category, when configuring a data set for learning, a data set of the CRF 23 category can be configured by mixing 1080p and 720p images.

각 라벨 카테고리를 여러 해상도를 혼합한 데이터 셋으로 구성하여 학습하면 하나의 모델로 여러 해상도의 인코딩에 사용될 수 있다. 1080p와 720p 해상도 인코딩에 대한 CRF를 하나의 모델로 예측할 수 있다.If each label category is configured and trained with a data set that mixes multiple resolutions, a single model can be used for encoding of multiple resolutions. CRFs for 1080p and 720p resolution encoding can be predicted with one model.

라벨 데이터를 생성하는 방법은 다음과 같다.The method to create label data is as follows.

각 세그먼트 영상의 최적 CRF 값을 라벨로 사용한다.The optimal CRF value of each segment image is used as a label.

도 8을 참조하면, 분산 인코딩 시스템(300)은 세그먼트 영상(801)에 대해 CRF 값을 달리하여 인코딩을 반복하고(S81), 이때 인코딩 결과 화질을 측정하여(S83) 일정 수준의 화질을 만족하는지 여부를 판단한다(S83).Referring to FIG. 8 , the distributed encoding system 300 repeats encoding with different CRF values for the segment image 801 (S81), and measures the quality of the encoding result (S83) to determine whether a certain level of quality is satisfied. It is determined whether or not (S83).

분산 인코딩 시스템(300)은 상기한 과정(S81 내지 S83)을 통해 일정 수준의 화질을 만족하는 CRF 값을 찾아 라벨(802)을 구할 수 있다.The distributed encoding system 300 may obtain a label 802 by finding a CRF value that satisfies a certain level of image quality through the above-described steps S81 to S83.

화질의 기준은 사람의 지각적인 부분을 가장 잘 반영하는 VMAF을 사용하고, 예를 들어 스트리밍 서비스에 적합한 VMAF 93을 기준으로 할 수 있다. 각 세그먼트 영상(801)의 VMAF 값이 93이 될 때까지 CRF를 변경하면서 인코딩하고 VMAF 값이 93이 되는 시점의 CRF 값을 라벨(802)로 선정한다.The criterion for image quality uses VMAF that best reflects human perception, and may be based on, for example, VMAF 93 suitable for streaming services. Each segment image 801 is encoded while changing the CRF until the VMAF value reaches 93, and the CRF value at the time when the VMAF value reaches 93 is selected as the label 802 .

예를 들어, 1080p와 720p 해상도 인코딩에 대한 라벨 생성 과정은 다음과 같다.For example, label generation process for 1080p and 720p resolution encoding is as follows.

1080p 해상도 인코딩 라벨은 VMAF 93을 기준으로 원본 1080p를 베이스로 라벨 CRF를 구한다.For the 1080p resolution encoding label, a label CRF is obtained based on the original 1080p based on VMAF 93.

720p 해상도 인코딩 라벨의 경우, 먼저 원본 1080p를 720p로 리사이즈하고 리사이즈된 720p를 베이스로 CRF를 달리하여 인코딩을 반복하면서 라벨 CRF를 구한다. 화질 기준은 1080p와 동일하게 VMAF 93으로 한다. 이때, VMAF를 구할 때 화질 비교 원본은 1080p를 720p로 리사이즈한 영상이 된다. 즉, 리사이즈 아티팩트는 고려하지 않고 압축 아티팩트만 고려해서 VMAF를 측정하고 이에 맞는 CRF를 구한다. 압축 아티팩트만 고려한 특정 화질(VMAF)을 만족하는 CRF를 각 해상도 별로 구하고, 이때 CRF 카테고리는 여러 해상도로 묶을 수 있으므로 1080p와 720p 해상도를 묶어서 각 카테고리의 데이터 셋을 생성할 수 있다.In the case of a 720p resolution encoding label, first, the original 1080p is resized to 720p, and the label CRF is obtained while repeating encoding by changing the CRF based on the resized 720p. The picture quality standard is VMAF 93, the same as 1080p. At this time, when VMAF is obtained, the image quality comparison original is an image obtained by resizing 1080p to 720p. That is, the VMAF is measured by considering only the compression artifact without considering the resize artifact, and a corresponding CRF is obtained. A CRF that satisfies a specific image quality (VMAF) considering only compression artifacts is obtained for each resolution. At this time, since CRF categories can be grouped into several resolutions, a data set of each category can be created by grouping 1080p and 720p resolutions.

따라서, 1080p와 720p가 혼합된 데이터 셋으로 1080p와 720p 해상도를 모두 지원하는 단일 모델을 구축할 수 있고, 이를 통해 통해 1080p 세그먼트 영상과 720p 세그먼트 영상에 대한 추론 결과로 최적의 CRF 결과를 획득할 수 있다.Therefore, a single model supporting both 1080p and 720p resolutions can be built with a mixed data set of 1080p and 720p, and through this, optimal CRF results can be obtained as inference results for 1080p segment images and 720p segment images. have.

라벨 데이터를 생성하는 과정에서 예를 들어 CRF를 1 단위로 변경하면서 인코딩을 진행하는 경우 VMAF 93에 가장 가까운 CRF를 선정하기 때문에 라벨의 정확도가 떨어지는 문제가 있다.In the process of generating label data, for example, when encoding is performed while changing the CRF by 1 unit, since the CRF closest to VMAF 93 is selected, there is a problem in that the accuracy of the label is lowered.

이를 해결하기 위해, CRF 단위에 대한 테스트를 통해서 CRF를 1보다 작은 단위, 예를 들어 0.5 단위로 하여 라벨 클래스를 적용함으로써 라벨의 정확도를 개선할 수 있다.In order to solve this problem, label accuracy may be improved by applying a label class with a CRF of a unit smaller than 1, for example, a unit of 0.5 through a test for the CRF unit.

따라서, CRF 예측을 위해 구축된 인코딩 옵션 예측 모델(321)을 이용함으로써 인코딩을 최적화하고 반복적인 과정을 최소화할 수 있다.Therefore, by using the encoding option prediction model 321 built for CRF prediction, encoding can be optimized and an iterative process can be minimized.

더 나아가, 본 실시예들은 서비스 제약 사항 하에서 효율적인 목표 화질 달성 파라미터를 예측할 수 있다.Furthermore, the present embodiments can predict an effective target picture quality achievement parameter under service constraints.

스트리밍 서비스를 위한 영상 압축 시 영상 화질 저하의 최소화와 함께 끊김 없는 시청을 보장하기 위한 비트레이트 제약이 필요하다. 따라서, 비트레이트 제약 이내로 결과 영상을 생성하는 것이 필요하며, 서비스 별로 비트레이트 제약 및 제약의 중요도가 다르다. 동일 비트레이트에서 되도록 높은 화질의 결과물을 만드는 것 또한 중요하다.When compressing video for streaming services, it is necessary to limit the bit rate to ensure uninterrupted viewing along with minimization of video quality degradation. Therefore, it is necessary to generate a resulting image within the bitrate constraint, and the bitrate constraint and the importance of the constraint are different for each service. It is also important to produce high-quality results at the same bit rate.

도 9는 본 발명의 일실시예에 있어서 서비스 제약을 고려한 실제 인코딩에 적용할 화질 파라미터를 도출하는 과정의 일례를 도시한 것이다.9 illustrates an example of a process of deriving picture quality parameters to be applied to actual encoding in consideration of service constraints in an embodiment of the present invention.

도 9를 참조하면, 분산 인코딩 시스템(300)은 입력 영상(901)의 피처를 추출한 후(S900) 추출된 피처에 최적화된 인코딩 옵션을 예측할 수 있다(S901 내지 S904).Referring to FIG. 9 , the distributed encoding system 300 extracts features of an input image 901 (S900) and then predicts encoding options optimized for the extracted features (S901 to S904).

인코딩 옵션을 예측하는 과정에서 단계(S903)를 제외한 나머지 단계(S901, S902, S903)는 AI 기반의 인코딩 옵션 예측 모델(321)을 이용한다.In the process of predicting encoding options, the remaining steps (S901, S902, and S903) except for step S903 use the AI-based encoding option prediction model 321.

상세하게, 분산 인코딩 시스템(300)은 입력 영상(901)의 피처에 대응되는 인코딩 옵션으로서 목표 화질(예를 들어, VMAF 93)을 만족하는 화질 인코딩 파라미터인 제1 CRF(CRF1)를 예측할 수 있다(S901). 제1 CRF를 예측하는 방법은 위에서 설명한 바와 동일하므로 구체적인 설명은 생략한다.In detail, the distributed encoding system 300 may predict a first CRF (CRF1), which is a picture quality encoding parameter that satisfies a target picture quality (eg, VMAF 93), as an encoding option corresponding to a feature of the input image 901. (S901). Since the method of predicting the first CRF is the same as described above, a detailed description thereof will be omitted.

분산 인코딩 시스템(300)은 입력 영상(901)의 피처에 대응되는 인코딩 옵션으로서 목표 비트레이트를 만족하는 화질 인코딩 파라미터인 제2 CRF(CRF2)를 예측할 수 있다(S902). 영상 화질 조절 파라미터로는 일반적으로 구간 별로 고른 화질을 보이는 CRF 옵션을 사용하게 되는데, CRF로 인코딩 시 인코딩 결과 영상의 비트레이트를 알 수 없다. 목표 비트레이트를 만족하는 CRF를 찾기 위해서는 CRF 값을 조절하며 반복 인코딩을 수행한 후 결과 비트레이트를 확인해야 하며 이는 매우 큰 인코딩 비용이 발생한다. 적은 인코딩 비용으로 목표 화질과 서비스 비트레이트 제한을 만족하기 위해서는 목표 비트레이트를 만족하는 제2 CRF 예측이 필요하다. 목표 화질을 만족하는 제1 CRF와 함께 목표 비트레이트를 만족하는 제2 CRF를 예측하면 비용 절감과 함께 서비스 품질 향상이 가능하다.The distributed encoding system 300 may predict a second CRF (CRF2), which is a picture quality encoding parameter that satisfies a target bit rate, as an encoding option corresponding to a feature of the input image 901 (S902). As an image quality control parameter, a CRF option that shows a uniform image quality for each section is generally used, but when encoding with CRF, the bit rate of the encoded video is unknown. In order to find a CRF that satisfies the target bit rate, it is necessary to check the resulting bit rate after performing iterative encoding while adjusting the CRF value, which incurs a very large encoding cost. In order to satisfy the target picture quality and service bit rate limitations with low encoding cost, a second CRF prediction satisfying the target bit rate is required. If a second CRF that satisfies the target bitrate is predicted together with the first CRF that satisfies the target picture quality, it is possible to reduce costs and improve service quality.

제2 CRF 예측을 위해서는 각 세그먼트 영상의 특정 비트레이트를 만족하는 CRF 값을 라벨로 사용한다. 각 세그먼트 영상에 대해 CRF 값을 달리 인코딩해서 목표 비트레이트를 만족하는 CRF 값을 찾고 해당 CRF 값을 라벨로 선정할 수 있다. 목표 비트레이트는 어플리케이션마다 상이할 수 있으며, 예를 들어 카테고리 1의 경우 제2 CRF 23, 카테고리 2의 경우 제2 CRF 23.5, 카테고리 3의 경우 제2 CRF 24 등과 같이 라벨 데이터를 생성할 수 있다.For the second CRF prediction, a CRF value satisfying a specific bit rate of each segment image is used as a label. A CRF value satisfying a target bit rate may be found by encoding the CRF value differently for each segment image, and the corresponding CRF value may be selected as a label. The target bitrate may be different for each application, and for example, label data such as 2nd CRF 23 for category 1, 2nd CRF 23.5 for category 2, and 2nd CRF 24 for category 3 can be generated.

실시예에 따라서는 제1 CRF와 제2 CRF를 합쳐서 라벨 데이터를 생성하는 것 또한 가능하다. 예를 들어 카테고리 1의 경우 제1 CRF 23과 제2 CRF 24, 카테고리 2의 경우 제1 CRF 23과 제2 CRF 25, 카테고리 3의 경우 제1 CRF 23과 제2 CRF 26 등과 같이 라벨 데이터를 생성할 수 있다. 제1 CRF와 제2 CRF가 결합된 라벨 데이터를 사용하는 경우, 도 9에서 단계(S902)와 단계(S903)가 생략되고 단계(S901)에서 예측된 CRF를 인코딩에 적용하기 위한 최종 CRF로 사용할 수 있다.Depending on embodiments, it is also possible to generate label data by combining the first CRF and the second CRF. For example, label data is generated such as 1st CRF 23 and 2nd CRF 24 in case of category 1, 1st CRF 23 and 2nd CRF 25 in case of category 2, 1st CRF 23 and 2nd CRF 26 in case of category 3, etc. can do. When label data in which the first CRF and the second CRF are combined are used, steps S902 and S903 are omitted in FIG. 9 and the predicted CRF in step S901 is used as the final CRF to be applied to encoding. can

인코딩 옵션 예측 모델(321)은 입력 영상(901)을 구성하는 프레임의 이미지들을 입력받아 딥러닝(CNN 및 RNN)으로 영상의 피처를 추출하고 추출된 피처를 이용하여 각 CRF에 상응하는 클래스로 분류하도록 지도 학습으로 학습된다. 이때, CRF 클래스의 정답 라벨은 상기에서 라벨 데이터로 생성된 제2 CRF가 된다. 제1 CRF와 제2 CRF를 합쳐서 라벨 데이터를 생성한 경우 CRF 클래스의 정답 라벨은 제1 CRF와 제2 CRF의 조합으로 구성된다. 이러한 라벨 데이터 셋으로 학습된 인코딩 옵션 예측 모델(321)에 분석 대상이 되는 영상의 프레임들을 입력하면 비트레이트를 만족하는 CRF에 해당하는 클래스를 출력할 수 있다.The encoding option prediction model 321 receives the images of the frames constituting the input image 901, extracts image features through deep learning (CNN and RNN), and classifies them into classes corresponding to each CRF using the extracted features. It is learned through supervised learning to At this time, the correct answer label of the CRF class becomes the second CRF generated from the label data above. When label data is generated by combining the first CRF and the second CRF, the correct answer label of the CRF class is composed of a combination of the first CRF and the second CRF. When frames of an image to be analyzed are input to the encoding option prediction model 321 learned with the label data set, a class corresponding to a CRF satisfying a bit rate may be output.

분산 인코딩 시스템(300)은 목표 화질을 만족하는 제1 CRF와 목표 비트레이트를 만족하는 제2 CRF를 고려하여 실제 인코딩에 적용할 최종 CRF인 제3 CRF(CRF3)를 결정할 수 있다(S903). 일례로, 분산 인코딩 시스템(300)은 제1 CRF가 제2 CRF보다 크거나 같은 경우 제1 CRF를 제3 CRF로 결정할 수 있다. 한편, 분산 인코딩 시스템(300)은 제1 CRF가 제2 CRF보다 작은 경우 모델 학습과 관련된 사용자 입력 값인 비트레이트 제한 준수 가중치를 고려하여 제3 CRF를 결정할 수 있다(S903). 비트레이트 제약이 엄격해야 하는 어플리케이션에서는 제1 CRF가 제2 CRF보다 작은 경우 제2 CRF를 제3 CRF로 결정할 수 있다. 비트레이트 제약이 상대적으로 덜 엄격한 어플리케이션에서는 화질 이득을 위해 제1 CRF 쪽에 더 가까운 값을 선택할 수 있다. 이때, 분산 인코딩 시스템(300)은 사용자 입력 값인 비트레이트 제한 준수 가중치(w)를 이용하여 CRF 값을 보정함으로써 제3 CRF를 결정할 수 있다(수학식 1).The distributed encoding system 300 may determine a third CRF (CRF3) as a final CRF to be applied to actual encoding by considering the first CRF that satisfies the target picture quality and the second CRF that satisfies the target bit rate (S903). For example, the distributed encoding system 300 may determine the first CRF as the third CRF when the first CRF is greater than or equal to the second CRF. Meanwhile, when the first CRF is smaller than the second CRF, the distributed encoding system 300 may determine a third CRF in consideration of a bitrate restriction compliance weight, which is a user input value related to model learning (S903). In an application requiring strict bit rate constraints, when the first CRF is smaller than the second CRF, the second CRF may be determined as the third CRF. In applications with relatively less stringent bit rate constraints, a value closer to the first CRF may be selected for picture quality gain. At this time, the distributed encoding system 300 may determine the third CRF by correcting the CRF value using the bit rate restriction compliance weight w, which is a user input value (Equation 1).

[수학식 1][Equation 1]

CRF3 = CRF1×(1-w)+CRF2×wCRF3 = CRF1×(1-w)+CRF2×w

비트레이트 제한 준수 가중치는 0에서 1.0 이내의 값으로 입력될 수 있다. 가중치가 0인 경우 비트레이트 제한을 준수하지 않고 제1 CRF를 제3 CRF로 결정하고, 가중치가 1인 경우 비트레이트 제한을 엄격히 준수하도록 제2 CRF를 제3 CRF로 결정할 수 있다.The bitrate limit compliance weight may be input as a value within 0 to 1.0. When the weight is 0, the first CRF is determined as the third CRF without complying with the bitrate limit, and when the weight is 1, the second CRF is determined as the third CRF to strictly comply with the bitrate limit.

분산 인코딩 시스템(300)은 입력 영상의 피처에 대응되는 추가적인 인코딩 옵션으로서 QP 제한 파라미터를 예측할 수 있다(S904).The distributed encoding system 300 may predict a QP restriction parameter as an additional encoding option corresponding to a feature of an input image (S904).

화질 인코딩 파라미터 중 하나인 CRF는 프레임 QP를 결정한다. 인코더는 체감 화질 향상을 위한 AQ(adaptive quantization), MB(macroblock)-트리와 같은 다양한 블록 단위 QP 결정 알고리즘을 탑재하고 있으며 CRF와 해당 인코딩 옵션들을 조합해서 사용하는 경우가 일반적이다. QP 결정 알고리즘은 CRF로부터 결정된 프레임 QP에 각 블록 별로 오프셋 값을 적용하여 블록 QP를 결정하게 되는데, 이때 지나치게 낮거나 높은 QP가 할당되는 블록이 생기게 된다. 인코더는 QP 제한을 위해 최소 QP(minqp) 및 최대 QP(maxqp)와 같은 옵션을 제공하며, 이를 활용하여 블록들의 QP 값을 적절한 수준으로 제한하면 체감 화질을 유지하면서 추가적인 비트레이트 절감을 달성할 수 있다.CRF, one of the quality encoding parameters, determines the frame QP. The encoder is equipped with various block-level QP decision algorithms such as AQ (adaptive quantization) and MB (macroblock)-tree for improving perceived picture quality, and it is common to use a combination of CRF and corresponding encoding options. The QP determination algorithm determines the block QP by applying an offset value for each block to the frame QP determined from the CRF. At this time, a block to which an excessively low or high QP is allocated occurs. The encoder provides options such as minimum QP (minqp) and maximum QP (maxqp) to limit QP, and by using these options to limit the QP values of blocks to an appropriate level, additional bitrate reduction can be achieved while maintaining perceived picture quality. have.

분산 인코딩 시스템(300)은 제3 CRF와 함께 적용 시 최적의 비트레이트 대비 화질을 얻을 수 있는 최소 QP 값과 최대 QP 값을 도출할 수 있다. 최소 QP는 과도하게 높은 화질을 방지하고 최대 QP는 과도하게 낮은 화질을 방지하는 역할을 하며, 블록 단위로 적용하여 시각적으로 보다 고른 화질을 달성할 수 있다. 일례로, 분산 인코딩 시스템(300)은 제3 CRF에 오프셋을 적용하여 최소 QP 값과 최대 QP 값을 도출할 수 있다(minqp=CRF3-offset, maxqp=CRF3+offset). 예를 들어, CRF 오프셋을 2라 할 때, CRF3에서 2를 뺀 값을 최소 QP 값으로 설정하고, CRF3에 2를 더한 값을 최대 QP 값으로 설정할 수 있다.When applied together with the third CRF, the distributed encoding system 300 may derive a minimum QP value and a maximum QP value capable of obtaining an optimal bit rate-to-image quality. The minimum QP serves to prevent excessively high picture quality and the maximum QP serves to prevent excessively low picture quality, and can achieve a visually more even picture quality by applying it in units of blocks. As an example, the distributed encoding system 300 may derive a minimum QP value and a maximum QP value by applying an offset to the third CRF (minqp=CRF3-offset, maxqp=CRF3+offset). For example, when the CRF offset is 2, a value obtained by subtracting 2 from CRF3 may be set as the minimum QP value, and a value obtained by adding 2 to CRF3 may be set as the maximum QP value.

분산 인코딩 시스템(300)은 제3 CRF와 함께 최소 QP 값과 최대 QP 값을 인코딩 옵션으로 적용할 수 있으며, 실시예에 따라서는 추론 시간을 줄이기 위해 QP 제한 파라미터를 예측하는 과정(S904)을 생략하고 제3 CRF만을 실제 인코딩에 적용하는 것 또한 가능하다.The distributed encoding system 300 may apply the minimum QP value and the maximum QP value as encoding options together with the third CRF, and depending on the embodiment, the process of estimating the QP limiting parameter (S904) is omitted in order to reduce inference time. It is also possible to apply only the third CRF to actual encoding.

QP 제한 파라미터를 예측하기 위해서는 각 세그먼트에 대해 결정된 인코딩 CRF(CRF3)와 조합 시 최적 비용을 만족하는 CRF 오프셋을 라벨로 사용한다. 각 세그먼트 영상에 대해 CRF 값을 달리 하고 각 CRF 값 당 최소 QP와 최대 QP(또는 오프셋)를 다르게 설정하여 인코딩한다. 각 인코딩 결과의 비트레이트와 화질(VMAF)을 구한 후 이로부터 비용을 구해 비용이 가장 작은 영상의 오프셋 값을 라벨로 선정할 수 있다.To predict the QP limiting parameter, a CRF offset that satisfies the optimal cost when combined with the encoding CRF (CRF3) determined for each segment is used as a label. Different CRF values are set for each segment image, and encoding is performed by setting different minimum QP and maximum QP (or offset) for each CRF value. After obtaining the bit rate and image quality (VMAF) of each encoding result, a cost is obtained from this, and an offset value of an image having the lowest cost may be selected as a label.

비용은 수학식 2와 같이 정의될 수 있다The cost can be defined as Equation 2

[수학식 2][Equation 2]

Cost = rate+λ×dCost = rate+λ×d

여기서, λ(lambda) 값은 비트레이트 절감을 우선할지 또는 화질 보존을 우선할지에 따라 모델 학습 시 사용자 입력 값으로 결정된다.Here, the λ (lambda) value is determined as a user input value during model training depending on whether bitrate reduction or image quality preservation is prioritized.

영상의 피처와 이전 단계에서 결정된 제3 CRF이 입력 값이 되며, 예를 들어 카테고리 1의 경우 CRF오프셋 2, 카테고리 2의 경우 CRF오프셋 4, 카테고리 3의 경우 CRF오프셋 6 등과 같이 라벨 데이터를 생성할 수 있다. 최종적으로 인코딩 적용 파라미터는 제3 CRF, 최소 QP(CRF3-offset), 최대 QP(CRF3+offset)가 된다.The features of the image and the third CRF determined in the previous step are input values. For example, label data such as CRF offset 2 for category 1, CRF offset 4 for category 2, and CRF offset 6 for category 3 are generated. can Finally, the encoding application parameters are the third CRF, the minimum QP (CRF3-offset), and the maximum QP (CRF3 + offset).

일반적으로 QP가 낮아질수록 비트레이트 대비 화질 이득이 떨어지므로 오프셋은 비대칭 적용 가능하다. 다시 말해, 최소 QP는 (CRF3-offset+1)과 같이 결정되고, 최대 QP는 (CRF3+offset)과 같이 적용될 수 있다. 예를 들어, 제3 CRF 값이 26이고, 오프셋이 4라고 할 때, 최소 QP 값은 23, 최대 QP 값은 30이 될 수 있다.In general, the lower the QP, the lower the image quality gain relative to the bit rate, so the offset can be applied asymmetrically. In other words, the minimum QP may be determined as (CRF3-offset + 1), and the maximum QP may be applied as (CRF3 + offset). For example, when the third CRF value is 26 and the offset is 4, the minimum QP value may be 23 and the maximum QP value may be 30.

인코딩 옵션 예측 모델(321)은 입력 영상을 구성하는 프레임의 이미지들을 입력받아 딥러닝(CNN 및 RNN)으로 영상의 피처를 추출하고 해당 피처와 인코딩 CRF 별 최적의 코스트 클래스로 분류하도록 지도 학습으로 학습된다. 이때, CRF 클래스의 정답 라벨은 상기에서 라벨 데이터로 생성된 CRF 오프셋으로 구성된다. 이러한 라벨 데이터 셋으로 학습된 인코딩 옵션 예측 모델(321)에 분석 대상이 되는 영상의 프레임 이미지와 제3 CRF 값을 입력하게 되면 분류 결과 카테고리에 상응하는 QP 제한 파라미터로서 CRF 오프셋 값을 얻을 수 있다.The encoding option prediction model 321 receives the images of the frames constituting the input image, extracts features of the image through deep learning (CNN and RNN), and learns to classify the feature into an optimal cost class for each encoding CRF through supervised learning. do. At this time, the correct answer label of the CRF class is composed of the CRF offset generated as the label data. When the frame image of the video to be analyzed and the third CRF value are input to the encoding option prediction model 321 learned with the label data set, a CRF offset value can be obtained as a QP limiting parameter corresponding to the classification result category.

따라서, 본 실시예에서는 AI 기반의 인코딩 옵션 예측 모델(321)을 이용하여 인코딩에 적용할 최적의 인코딩 옵션을 예측함에 있어 목표 화질을 만족하는 화질 인코딩 파라미터를 예측하는 것은 물론이고, 서비스 제약 사항 하에서 목표 비트레이트를 만족하는 화질 인코딩 파라미터 및/또는 화질 균형 조건을 만족하는 QP 제한 파라미터를 추가로 예측할 수 있다.Therefore, in the present embodiment, in predicting the optimal encoding option to be applied to encoding using the AI-based encoding option prediction model 321, not only predicting the picture quality encoding parameter that satisfies the target picture quality, but also A picture quality encoding parameter satisfying a target bit rate and/or a QP limiting parameter satisfying a picture quality balance condition may be additionally predicted.

이처럼 본 발명의 실시예들에 따르면, AI 기술을 바탕으로 세그먼트 영상 별로 최적의 인코딩 파라미터를 찾아 인코딩에 적용함으로써 동영상 압축 효율을 향상시킬 수 있다. 본 발명의 실시예들에 따르면, AI 기반 학습 모델을 통해 최적화된 인코딩 옵션에 따라 동영상 압축 효율을 확보함으로써 사용자 경험과 비트레이트 개선을 보장할 수 있고, 동영상 서비스의 저장 공간을 확보하는 것은 물론이고, 사용자의 네트워크 비용을 절감할 수 있다.As described above, according to embodiments of the present invention, video compression efficiency can be improved by finding and applying optimal encoding parameters for each segment video based on AI technology to encoding. According to embodiments of the present invention, by securing video compression efficiency according to an encoding option optimized through an AI-based learning model, it is possible to guarantee improvement in user experience and bit rate, as well as securing storage space for video services. , can reduce the user's network cost.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), and a programmable PLU (programmable logic unit). logic unit), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. The software and/or data may be embodied in any tangible machine, component, physical device, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. have. The software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수 개의 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 어플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. In this case, the medium may continuously store a program executable by a computer or temporarily store the program for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or combined hardware, but is not limited to a medium directly connected to a certain computer system, and may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROM and DVD, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc. configured to store program instructions. In addition, examples of other media include recording media or storage media managed by an app store that distributes applications, a site that supplies or distributes various other software, and a server.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

An encoding optimization method executed on a computer device,
The computer device includes at least one processor configured to execute computer readable instructions contained in a memory;
The encoding optimization method,
predicting, by the at least one processor, an encoding option corresponding to a feature of an input image using an artificial intelligence model; and
Encoding, by the at least one processor, the input image according to the encoding option.
including,
Predicting the encoding option,
predicting a first constant rate factor (CRF) satisfying a target video multi-method assessment fusion (VMAF) score and a second CRF satisfying a target bit rate as the encoding option through the artificial intelligence model;
if the first CRF is greater than or equal to the second CRF, determining the first CRF as a third CRF to be actually applied to encoding of the input image; and
If the first CRF is smaller than the second CRF, determining the third CRF using a bitrate limit compliance weight, which is a user input value related to model learning, and the first CRF and the second CRF.
An encoding optimization method comprising a.

According to claim 1,
Predicting the encoding option,
Predicting an encoding option corresponding to a feature of the corresponding section for each section divided from the input image
Encoding optimization method characterized by.

According to claim 1,
The artificial intelligence model includes a convolution neural network (CNN) model for extracting frame features for each frame image, a recurrent neural network (RNN) model for extracting video features based on relationships between the frame images, and the video features. An encoding option prediction model including a classifier for classifying encoding options corresponding to , wherein the CNN model, the RNN model, and the classifier are trained in an end-to-end (E2E) method for one loss function
Encoding optimization method characterized by.

According to claim 1,
The artificial intelligence model is composed of a single model using a data set in which encoding options satisfying the same standard for serviceable resolution are bundled with labels of the same category.
Encoding optimization method characterized by.

delete

According to claim 1,
Predicting the encoding option,
Predicting a QP limiting parameter for limiting a QP (quantization parameter) value of a block of a frame image as the encoding option based on the third CRF to a target level
Encoding optimization method further comprising.

According to claim 9,
Predicting the QP limiting parameter,
Calculating a minimum QP value and a maximum QP value by applying an offset corresponding to the QP limiting parameter to the third CRF
An encoding optimization method comprising a.

A computer program stored in a computer readable recording medium in order to execute the encoding optimization method of any one of claims 1 to 4, 9, and 10 in the computer device.

In a computer device,
at least one processor configured to execute computer readable instructions contained in memory;
including,
The at least one processor,
Predicting encoding options corresponding to features of the input image using an artificial intelligence model,
Encoding the input video according to the encoding option;
The at least one processor,
Predicting a first CRF that satisfies a target VMAF score and a second CRF that satisfies a target bitrate as the encoding option through the artificial intelligence model,
If the first CRF is greater than or equal to the second CRF, determining the first CRF as a third CRF to be actually applied to encoding of the input video;
If the first CRF is smaller than the second CRF, determining the third CRF using a bitrate limit compliance weight, which is a user input value related to model learning, and the first CRF and the second CRF
Characterized by a computer device.

According to claim 12,
The at least one processor,
Predicting an encoding option corresponding to a feature of the corresponding section for each section divided from the input image
Characterized by a computer device.

According to claim 12,
The artificial intelligence model includes a CNN model for extracting frame features for each frame image, an RNN model for extracting video features based on the relationship between the frame images, and a classifier for classifying encoding options corresponding to the video features. As an encoding option prediction model that includes, the CNN model, the RNN model, and the classifier are learned in an E2E manner for one loss function
Characterized by a computer device.

According to claim 12,
The artificial intelligence model is composed of a single model using a data set in which encoding options satisfying the same standard for serviceable resolution are bundled with labels of the same category.
Characterized by a computer device.

delete

According to claim 12,
The at least one processor,
Predicting a QP limiting parameter limiting a QP value of a block of a frame image as the encoding option based on the third CRF to a target level
Characterized by a computer device.

According to claim 19,
The at least one processor,
Calculating a minimum QP value and a maximum QP value by applying an offset corresponding to the QP limiting parameter to the third CRF
Characterized by a computer device.