KR20210109719A

KR20210109719A - Method and Apparatus for Video Colorization

Info

Publication number: KR20210109719A
Application number: KR1020200024503A
Authority: KR
Inventors: 나태영; 오지형; 김수예; 김문철
Original assignee: 에스케이텔레콤 주식회사; 한국과학기술원
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2021-09-07
Also published as: KR102342526B1

Abstract

Disclosed are a video colorization method and device. An embodiment obtains multiple black-and-white images and inputs the multiple black-and-white images to a deep learning-based inference model trained in advance based on diverse losses. Provided are a video colorization method and device for automatically generating a colored video by an inference model including feature extraction, adaptive fusion transform (AFT), and feature enhancement. Therefore, it is possible to secure a diversity of colors.

Description

Method and Apparatus for Video Colorization

본 발명은 비디오 컬러화 방법 및 장치에 관한 것이다. 더욱 상세하게는, 딥러닝 모델을 기반으로 고해상도(higher resolution) 흑백 비디오를 자동으로 컬러화하는 비디오 컬러화 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for colorizing video. More particularly, it relates to a video colorizing method and apparatus for automatically colorizing a higher resolution black and white video based on a deep learning model.

이하에 기술되는 내용은 단순히 본 발명과 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것이 아니다. The content described below merely provides background information related to the present invention and does not constitute the prior art.

카메라가 처음으로 도입된 19 세기 이래, 흑백(black-and-white)으로 제작된 방대한 양의 오래된 자료 영상들(old footages)이 존재한다. 역사적 또는 예술적 의미 등 다양한 이유에 따라 이러한 자료들에 대한 컬러화(colorization)가 요구되고 있다. 그러나 이들의 컬러화를 위한 수동 작업은 매우 노동집약적이다. 또한, 흑백 정보로부터 자연스럽고 시각적인(natural and visual) 호소력을 유도할 수 있는 그럴듯하게 다양한(plausibly diverse) 색상이 고려되어야 하므로, 고도의 전문성을 필요로 한다.Since the 19th century, when cameras were first introduced, there has been a vast amount of old footages produced in black-and-white. Colorization of these materials is required for various reasons, such as historical or artistic significance. However, manual work for these colors is very labor intensive. In addition, since a plausibly diverse color capable of inducing natural and visual appeal from black and white information should be considered, a high degree of expertise is required.

딥러닝(deep learning) 기반 알고리즘의 개발에 따라, 이를 이용한 영상 컬러화(image colorization)에 대한 연구가 활발하게 진행되고 있다. 영상 컬러화를 위한 컬러화 작업을 효과적으로 지원하기 위하여, 참조 영상(reference image) 또는 사용자 안내(user-guided) 정보 등을 이용하는 다양한 방법들이 존재한다. 이와 같은 추가적인 단서들(clues)은 컬러화된 결과물의 품질을 개선하는 데는 도움이 된다. 그러나 양질의 참조 영상을 선택하기 위한 기준(criteria) 또는 적절한 안내를 선택하기 위한 전문식견(expertise)이 매우 가변적이어서, 이러한 가변성은 컬러화 결과에 심각하게 영향을 줄 수 있다. 특히, 참조 영상을 이용하는 대부분의 방법들은 학습용 데이터세트를 구성하기 위해, 사소하지 않은(non-trivial) 전처리과정(pre-processing) 과정이 요구되는, 방대한 참조 영상의 수집을 필요로 한다는 문제가 있다.With the development of a deep learning-based algorithm, research on image colorization using the algorithm is being actively conducted. In order to effectively support a colorization operation for colorizing an image, various methods using a reference image or user-guided information exist. These additional clues help to improve the quality of the colored result. However, the criteria for selecting a high-quality reference image or expertise for selecting an appropriate guide are highly variable, and such variability may seriously affect the colorization result. In particular, most methods using reference images have a problem in that they require a large amount of reference images to be collected, which requires a non-trivial pre-processing process to construct a training dataset. .

한편, 컬러화되어야 하는 단색(monochrome) 영상물이 비디오 형태인 경우, 비디오의 각 화면(frame)에 영상 컬러화 방법이 적용되면, 컬러화 결과는 흔히 깜박이는 아티팩트(flickering artifacts) 또는 시간적 일관성(temporal coherence)의 결여된 흔적을 포함할 수 있다. 연속되는 컬러 화면 간에 시간적 일관성을 충족시켜야 한다는 관점에서, 비디오 컬러화(video colorization)는 매우 어려운 작업이다. 기존의 자동화된 비디오 컬러화(Automatic Video Colorization: AVC) 방법(비특허문헌 1 참조)은, 두 개의 연속된 회색 화면(gray frame) 각각의 짝(pair)으로부터 컬러화 비디오의 다양한 세트를 생성함으로써, 화면 별로 영상 컬러화를 적용하던 이전의 방법에 비하여 높은 시간적 일관성을 제시한다. On the other hand, when a monochrome image to be colorized is in the form of a video, if the image colorizing method is applied to each frame of the video, the colorization result is often caused by flickering artifacts or temporal coherence. It may contain missing traces. Video colorization is a very difficult task from the viewpoint of meeting temporal consistency between successive color screens. The existing Automatic Video Colorization (AVC) method (see Non-Patent Document 1) generates various sets of colored video from each pair of two consecutive gray frames, It presents a high temporal consistency compared to the previous method that applied image colorization.

그러나, 기존의 AVC 방법은, 시간적 일관성을 향상시키는 과정에서, 회색계열의 화면(grayscale frame)에 대하여 갈색(brown) 또는 청색(blue) 톤(tone)을 주로 생성한다는 문제가 있다. 또한 기존의 AVC 방법은, 고해상도(higher resolution)의 비디오(720p의 HD(High-definition) 또는 2160p의 4K UHD (Ultra HD))와 비교하여 제한된 수의 객체를 포함하는, 상대적으로 저해상도(lower resolution)의 비디오(256p 및 480p 비디오)를 트레이닝 및 검증용 데이터세트로 사용한다는 문제가 있다. However, the conventional AVC method has a problem in that, in the process of improving temporal consistency, a brown or blue tone is mainly generated for a grayscale frame. In addition, the existing AVC method includes a limited number of objects compared to video of higher resolution (High-definition (HD (HD) of 720p or 4K UHD (Ultra HD) of 2160p)) of higher resolution, relatively lower resolution (lower resolution). ) (256p and 480p videos) as datasets for training and validation.

따라서, 추가적인 단서의 이용 및 전처리과정에 따른 시간과 비용 소모를 줄이면서도, 다양한 색상과 객체를 포함한 고해상도 흑백 비디오에 대하여 시간적 일관성을 유지하고, 컬러의 다양성을 확보하며, 컬러화 과정에서 발생하는 아티팩트의 영향을 감소시키는 것이 가능한 자동화된 비디오 컬러화 방법이 요구된다.Therefore, while reducing time and cost consumption due to the use of additional clues and preprocessing, it maintains temporal consistency for high-resolution black-and-white video including various colors and objects, secures color diversity, and reduces artifacts generated during the colorization process. There is a need for an automated video colorization method capable of reducing the impact.

비특허문헌 1: Chenyang Lei and Qifeng Chen. Fully automatic video colorization with self-regularization and diversity. In CVPR, pages 3753-3761, 2019. Non-Patent Document 1: Chenyang Lei and Qifeng Chen. Fully automatic video colorization with self-regularization and diversity. In CVPR, pages 3753-3761, 2019. 비특허문헌 2: Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234-241. Springer, 2015. Non-Patent Document 2: Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234-241. Springer, 2015. 비특허문헌 3: Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472-2481, 2018. Non-Patent Document 3: Yuulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472-2481, 2018. 비특허문헌 4: Yanyun Qu, Yizi Chen, Jingying Huang, and Yuan Xie. Enhanced pix2pix dehazing network. In CVPR, pages 8160-8168, 2019.Non-Patent Document 4: Yanyun Qu, Yizi Chen, Jingying Huang, and Yuan Xie. Enhanced pix2pix dehazing network. In CVPR, pages 8160-8168, 2019. 비특허문헌 5: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Non-Patent Document 5: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016. 비특허문헌 6: Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834-848, 2017. Non-Patent Document 6: Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834-848, 2017. 비특허문헌 7: Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. Non-Patent Document 7: Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 비특허문헌 8: Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2019. Non-Patent Document 8: Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In ICLR, 2019. 비특허문헌 9: Dae Young Park and Kwang Hee Lee. Arbitrary style transfer with style-attentional networks. In CVPR, pages 5880-5888, 2019.Non-Patent Document 9: Dae Young Park and Kwang Hee Lee. Arbitrary style transfer with style-attentional networks. In CVPR, pages 5880-5888, 2019. 비특허문헌 10: William K Pratt. Digital image processing, 2001. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. Non-Patent Document 10: William K Pratt. Digital image processing, 2001. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. 비특허문헌 11: Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.Non-Patent Document 11: Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.

본 개시는, 다중 흑백 영상(multiple black-and-white image)을 획득하여, 다양한 손실(diverse loss)을 기반으로 사전에 트레이닝된 딥러닝(deep learning) 기반 추론 모델(inference model)에 입력한다. 특성 추출(feature extraction), 적응적 융합 변환(adaptive fusion transform: AFT) 및 특성 개선(feature enhancement) 기능을 포함하는 추론 모델이 컬러화된 비디오를 자동으로 생성하는 비디오 컬러화(video colorization) 장치 및 방법을 제공하는 데 주된 목적이 있다.The present disclosure acquires multiple black-and-white images and inputs them to a deep learning-based inference model trained in advance based on various losses. A video colorization apparatus and method for automatically generating colored video with an inference model including feature extraction, adaptive fusion transform (AFT) and feature enhancement functions. Its main purpose is to provide

본 발명의 실시예에 따르면, 비디오 컬러화 장치가 이용하는 비디오 컬러화 방법에 있어서, 복수의 흑백 영상(multiple black-and-white images) 중의 하나인 지정 화면(indicated frame)으로부터 분할추출 모델을 이용하여 분할 맵(segmentation map)을 추출하고, 사전에 트레이닝된 딥러닝(deep learning) 기반 ALP 추출부를 이용하여 상기 분할 맵으로부터 ALP(Adaptive Local Parameter)를 생성하는 과정; 상기 지정 화면으로부터 전역특성 추출 모델을 이용하여 전역특성 맵(global feature map)을 추출하고, 사전에 트레이닝된 딥러닝 기반 AGP 추출부를 이용하여 상기 전역특성 맵으로부터 AGP(Adaptive Global Parameter)를 생성하는 과정; 및 상기 ALP 및 상기 AGP를 이용하는 적응적 융합 변환(Adaptive Fusion Transform: AFT)에 기반하는, 사전에 트레이닝된 딥러닝 기반 추론 모델을 이용하여 상기 복수의 흑백 영상으로부터 컬러화된 화면(colorized frame)을 생성하는 과정을 포함하는 것을 특징으로 하는 비디오 컬러화 방법을 제공한다. According to an embodiment of the present invention, in a video colorizing method used by a video colorizing apparatus, a segmentation map using a segmentation extraction model from an indicated frame that is one of a plurality of black-and-white images extracting (segmentation map) and generating an adaptive local parameter (ALP) from the segmentation map using a pre-trained deep learning-based ALP extractor; A process of extracting a global feature map from the specified screen using a global feature extraction model, and generating an Adaptive Global Parameter (AGP) from the global feature map using a pre-trained deep learning-based AGP extractor ; And based on the ALP and adaptive fusion transform (AFT) using the AGP, using a pre-trained deep learning-based inference model to generate a colored frame from the plurality of black and white images It provides a video colorization method comprising the process of:

본 발명의 다른 실시예에 따르면, 비디오 컬러화 장치의 학습방법에 있어서, ALP(Adaptive Local Parameter) 및 AGP(Adaptive Global Parameter)를 이용하는 적응적 융합 변환(Adaptive Fusion Transform: AFT)에 기반하는, 딥러닝 기반 추론 모델인 생성기를 이용하여 복수의 흑백 영상(multiple black-and-white images)으로부터 컬러화된 화면(colorized frame)을 생성하는 과정; 딥러닝 기반 제1 구별기를 이용하여 상기 컬러화된 화면과 GT(Ground Truth) 화면을 구별하는 과정; 딥러닝 기반 제2 구별기를 이용하여 상기 컬러화된 화면이 포함된 복수의 컬러 영상(multiple color image)과 복수의 GT 컬러 영상 간의 시간적 일관성(temporal coherence)을 구별하는 과정; 및 상기 생성기, 상기 제1 구별기 및 상기 제2 구별기의 출력을 이용하여 총손실(total loss)을 산정하는 과정을 포함하는 것을 특징으로 하는 학습방법을 제공한다. According to another embodiment of the present invention, in the learning method of a video colorizing apparatus, adaptive fusion transform (AFT) using ALP (Adaptive Local Parameter) and AGP (Adaptive Global Parameter) based, deep learning generating a colored frame from a plurality of black-and-white images using a generator that is a base inference model; a process of discriminating the colored screen from the GT (Ground Truth) screen using a deep learning-based first discriminator; a process of discriminating temporal coherence between a plurality of color images including the colored screen and a plurality of GT color images using a deep learning-based second discriminator; and calculating a total loss using the outputs of the generator, the first classifier, and the second discriminator.

본 발명의 다른 실시예에 따르면, 비디오 컬러화 방법이 포함하는 각 단계를 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다. According to another embodiment of the present invention, there is provided a computer program stored in a computer-readable recording medium in order to execute each step included in the video colorizing method.

본 발명의 다른 실시예에 따르면, 비디오 컬러화 장치의 학습방법이 포함하는 각 단계를 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다. According to another embodiment of the present invention, there is provided a computer program stored in a computer-readable recording medium in order to execute each step included in the learning method of a video colorizing apparatus.

이상에서 설명한 바와 같이 본 실시예에 따르면, 다중 흑백 영상(multiple black-and-white image)을 획득하여, 적응적 융합 변환(adaptive fusion transform: AFT) 기능을 포함하는 딥러닝(deep learning) 기반 추론 모델(inference model)이 컬러화된 비디오를 자동으로 생성하는 비디오 컬러화(video colorization) 장치 및 방법을 제공함으로써, 고해상도(higher resolution)의 흑백 영상에 대하여 시간적 일관성(temporal coherence)을 유지하고, 컬러의 다양성을 확보하는 것이 가능해지는 효과가 있다. As described above, according to this embodiment, a deep learning-based inference including an adaptive fusion transform (AFT) function by acquiring multiple black-and-white images By providing a video colorization apparatus and method in which an inference model automatically generates a colored video, temporal coherence is maintained for a higher resolution black-and-white image, and color diversity is provided. It has the effect of making it possible to secure

또한 본 실시예에 따르면, 다중 흑백 영상(multiple black-and-white image)을 획득하여, 다양한 손실(diverse loss)을 기반으로 사전에 트레이닝된 딥러닝(deep learning) 기반 추론 모델(inference model)이 컬러화된 비디오를 자동으로 생성하는 비디오 컬러화(video colorization) 장치 및 방법을 제공함으로써, 컬러화 과정에서 발생하는 컬러 블리딩(color bleeding), 블럭 아티팩트(block artifact), 경계 누설(boundary leakage) 등의 문제에 대한 개선이 가능해지는 효과가 있다. In addition, according to this embodiment, a deep learning-based inference model trained in advance based on various losses by acquiring multiple black-and-white images is By providing a video colorization apparatus and method for automatically generating colored video, problems such as color bleeding, block artifacts, and boundary leakage occurring in the colorization process are eliminated. It has the effect of enabling improvement.

도 1은 본 발명의 일 실시예에 따른 비디오 컬러화 장치의 예시도이다.
도 2는 본 발명의 일 실시예에 따른 비디오 컬러화 장치의 구성요소인 추론 모델의 예시도이다.
도 3은 본 발명의 일 실시예에 따른 DB의 구성도이다.
도 4는 본 발명의 일 실시예에 따른 EH의 구성도이다.
도 5는 본 발명의 일 실시예에 따른 ALP 추출부의 구성도이다.
도 6은 본 발명의 일 실시예에 따른 AGP 추출부의 구성도이다.
도 7은 본 발명의 일 실시예에 따른 학습 모델의 예시도이다.
도 8은 본 발명의 일 실시예에 따른 구별기의 구성도이다.
도 9는 본 발명의 일 실시예에 따른 비디오 컬러화 방법의 순서도이다.
도 10은 본 발명의 일 실시예에 따른 학습 모델에 대한 학습방법의 순서도이다. 1 is an exemplary diagram of a video colorizing apparatus according to an embodiment of the present invention.
2 is an exemplary diagram of an inference model that is a component of a video colorizing apparatus according to an embodiment of the present invention.
3 is a block diagram of a DB according to an embodiment of the present invention.
4 is a block diagram of an EH according to an embodiment of the present invention.
5 is a block diagram of an ALP extraction unit according to an embodiment of the present invention.
6 is a block diagram of an AGP extraction unit according to an embodiment of the present invention.
7 is an exemplary diagram of a learning model according to an embodiment of the present invention.
8 is a block diagram of a discriminator according to an embodiment of the present invention.
9 is a flowchart of a video colorizing method according to an embodiment of the present invention.
10 is a flowchart of a learning method for a learning model according to an embodiment of the present invention.

이하, 본 발명의 실시예들을 예시적인 도면을 참조하여 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 실시예들을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 실시예들의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present embodiments, if it is determined that a detailed description of a related well-known configuration or function may obscure the gist of the present embodiments, the detailed description thereof will be omitted.

또한, 본 실시예들의 구성요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성요소를 다른 구성요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Also, in describing the components of the present embodiments, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 발명의 예시적인 실시형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced.

본 실시예는 비디오 컬러화 방법 및 비디오 컬러화 장치의 구조 및 동작을 개시한다. 보다 자세하게는, 다중 흑백 영상(multiple black-and-white image)을 획득하여, 다양한 손실(diverse loss)을 기반으로 사전에 트레이닝되는 딥러닝(deep learning) 기반 추론 모델(inference model)에 입력한다. 적응적 융합 변환(adaptive fusion transform: AFT) 기능을 포함하는 추론 모델(inference model)이 컬러화된 비디오를 자동으로 생성하는 비디오 컬러화(video colorization) 장치 및 방법을 제공한다.This embodiment discloses the structure and operation of a video colorizing method and a video colorizing apparatus. In more detail, multiple black-and-white images are acquired and input to a deep learning-based inference model that is trained in advance based on various losses. Provided are an apparatus and method for video colorization in which an inference model including an adaptive fusion transform (AFT) function automatically generates a colored video.

이하, 흑백(black-and-white), 회색(gray), 회색계열(grayscale) 또는 단색(monochrome)은 모두 동일한 의미를 가지며, 백색, 흰색 및 두 색의 중간 색상들을 의미한다.Hereinafter, black-and-white, gray, grayscale, or monochrome have the same meaning, and mean white, white, and intermediate colors between the two colors.

특성(feature) 또는 특성 맵(feature map)은, 비디오 컬러화 장치에 포함된 내부 블록이 생성하는 중간 결과물을 의미한다. 비디오 컬러화 장치의 내부 블록은 입력 또는 중간 특성 맵을 변환하여 다른 중간 특성 맵 또는 출력을 생성한다. A feature or feature map refers to an intermediate result generated by an internal block included in the video colorizing apparatus. An internal block of the video colorizer transforms the input or intermediate feature map to generate another intermediate feature map or output.

이하, 도 1 및 도 2를 이용하여, 비디오 컬러화 장치를 설명한다.Hereinafter, a video colorizing apparatus will be described with reference to FIGS. 1 and 2 .

도 1은 본 발명의 일 실시예에 따른 비디오 컬러화 장치의 예시도이다. 1 is an exemplary diagram of a video colorizing apparatus according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 비디오 컬러화 장치의 구성요소인 추론 모델의 예시도이다. 2 is an exemplary diagram of an inference model that is a component of a video colorizing apparatus according to an embodiment of the present invention.

본 발명에 따른 실시예에 있어서, 비디오 컬러화 장치(100)는 다중 흑백 영상(multiple black-and-white image)을 획득하여, 적응적 융합 변환(adaptive fusion transform: AFT) 기능을 포함하는 딥러닝(deep learning) 기반 추론 모델(inference model)을 이용하여 컬러화된 비디오를 자동으로 생성한다. 비디오 컬러화 장치(100)는 추론 모델(101), 분할추출부(segmentation extraction unit, 111), ALP(Adaptive Local Parameter) 추출부(112), 전역특성(global feature) 추출부(113) 및 AGP(Adaptive Global Parameter) 추출부(114)의 전부 또는 일부를 포함한다. In an embodiment according to the present invention, the video colorizing apparatus 100 acquires multiple black-and-white images, and deep learning (adaptive fusion transform: AFT) including a function Colored video is automatically generated using a deep learning-based inference model. The video colorizing apparatus 100 includes an inference model 101, a segmentation extraction unit 111, an adaptive local parameter (ALP) extraction unit 112, a global feature extraction unit 113, and an AGP ( Adaptive Global Parameter) includes all or part of the extraction unit 114 .

본 실시예에 따른 추론 모델(101)은 적응적 융합 변환(Adaptive Fusion Transform: AFT) 기능을 이용하여 다중 흑백 영상(multiple black-and-white image)로부터 컬러화된 비디오를 자동으로 생성한다. 추론 모델(10)은 밀집특성(dense feature) 추출부(102), 인코더(103), 병목부(bottleneck unit, 104), 디코더(105), 특성개선부(feature enhancement unit, 106)의 전부 또는 일부를 포함한다.The inference model 101 according to the present embodiment automatically generates a colored video from multiple black-and-white images using an adaptive fusion transform (AFT) function. The inference model 10 may include all of the dense feature extraction unit 102 , the encoder 103 , the bottleneck unit 104 , the decoder 105 , and the feature enhancement unit 106 . includes some

본 실시예에 따른 추론 모델(101)은 적어도 하나의 콘볼루션 레이어(convolution layer)를 포함하는 U-net 기반의 딥러닝 모델로 구현되나(비특허문헌 2 참조), 반드시 이에 한정하는 것은 아니다. 예컨대, RNN(Recurrent Neural Network) 또는 CNN(Convolutional Neural Network) 등과 같이 영상 처리 기법을 구현하는 것이 가능한 어느 딥러닝 모델이든 이용될 수 있다. 추론 모델(101)은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후 설명하기로 한다. The inference model 101 according to the present embodiment is implemented as a U-net-based deep learning model including at least one convolution layer (see Non-Patent Document 2), but is not necessarily limited thereto. For example, any deep learning model capable of implementing an image processing technique such as a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) may be used. The inference model 101 may be pre-trained using a learning model. The structure of the learning model and the training process of the learning model will be described later.

추론 모델(101)에 입력되는 다중 흑백 영상은 시간 t에서의 중앙 화면(central frame)

을 중심으로 다섯 개의 연속적인 회색계열의 비디오 화면으로 구성된다. 다중 흑백 영상은

로 표현되며, 5차원 채널(channel)인 것처럼 결합(concatenation)된다. 한편, 시간 t에서의 추론 모델(101)의 출력 화면은

로 표현한다. 여기서 위첨자

과 ab는 각각 LAB 색공간(color space)에서의 조도(luminance) 및 색차(chrominance)를 의미한다. Multiple black-and-white images input to the inference model 101 have a central frame at time t.

It is composed of five continuous gray-based video screens centered on . Multiple black and white images

It is expressed as , and as if it is a five-dimensional channel concatenated. On the other hand, the output screen of the inference model 101 at time t is

expressed as superscript here

and ab denote luminance and chrominance in the LAB color space, respectively.

이하, iConvj 및 iDcnvj는 각각 ixi 필터 및 간격(stride) j를 갖는 콘볼루션(convolution) 및 디콘볼루션(deconvolution) 레이어를 의미한다. 채널의 개수는 c(c는 자연수)로 표기된다.Hereinafter, iConvj and iDcnvj mean convolution and deconvolution layers having an ixi filter and a stride j, respectively. The number of channels is denoted by c (where c is a natural number).

본 실시예에 따른 밀집특성 추출부(102)는 계층적 특성(hierarchical feature)을 이용하여 전역 특성(global feature)을 효과적으로 융합(fusion)한다. 밀집특성 추출부(102)는 1Conv1 레이어, LR(Leaky ReLU) 레이어 및 DB(Dense Block, 비특허문헌 3 참조)를 포함한다. 여기서 LR(Leaky Rectifier Linear Unit)은 활성함수(active function)이다. 밀집특성 추출부(102)는 입력

로부터 c 개의 채널에 해당하는 밀집 특성(dense feature)

를 생성한다. The dense feature extractor 102 according to the present embodiment effectively fusions global features by using hierarchical features. The density feature extraction unit 102 includes a 1Conv1 layer, a Leaky ReLU (LR) layer, and a Dense Block (see Non-Patent Document 3). Here, LR (Leaky Rectifier Linear Unit) is an active function. The dense feature extraction unit 102 is input

Dense features corresponding to c channels from

create

도 3은 본 발명의 일 실시예에 따른 DB의 구성도이다.3 is a block diagram of a DB according to an embodiment of the present invention.

계층적 특성을 이용하여 DB는 전역 특성을 효과적으로 융합한다. 도 3의 (b)에 도시된 바와 같이 DB는 D(D는 자연수) 개의 RDB(Residual Dense Block)을 포함한다. 각 RDB가 생성하는 전역 잔차(global residue)는 채널 별로 계층적으로 결합(concatenation)된 후, DB의 출력을 생성하는 레이어의 입력으로 이용된다. 도 3의 (a)에 도시된 바와 같이 d 번째

는 e(e는 자연수) 개의 부블럭(sub-block)을 포함하며, 각 부블럭은 1Conv1 레이어 및 LR 레이어를 포함한다. 부블럭 각각이 생성하는 국부 잔차(local residue)는 채널 별로 계층적으로 결합된 후,

의 출력을 생성하는 레이어의 입력으로 이용된다. Using hierarchical characteristics, DB effectively fuses global characteristics. As shown in (b) of Figure 3, the DB includes D (D is a natural number) RDB (Residual Dense Block). A global residue generated by each RDB is hierarchically concatenated for each channel and then used as an input of a layer that generates an output of the DB. As shown in (a) of Figure 3, the d-th

includes e (e is a natural number) sub-blocks, and each sub-block includes 1Conv1 layer and LR layer. After the local residues generated by each subblock are hierarchically combined for each channel,

It is used as an input to the layer that produces the output of

본 실시예에 따른 인코더(103)는

를 입력으로 받아들여 인코더 출력

를 생성하는데,

는 8c 개의 채널에 해당하는 특성 맵(feature map)이다. 인코더(103)는 복수의 RB(Residual Block)와 RDB(Residual Down Block) 짝(pair)을 포함할 수 있는데, 도 2의 예시에는 3 개의 짝이 포함되어 있다. The encoder 103 according to this embodiment is

takes as input and outputs the encoder

to create,

is a feature map corresponding to 8c channels. The encoder 103 may include a plurality of RB (Residual Block) and RDB (Residual Down Block) pairs. In the example of FIG. 2 , three pairs are included.

입력 x에 대하여, i 번째 RB의 출력

는 수학식 1과 같이 표현될 수 있다.For input x, the output of the i-th RB

can be expressed as in Equation 1.

여기서,

는 RB의 입력과 출력 간의 중간 잔차(intermediate residue)로서, 스킵 연결(skip connection)을 이용하여 디코더(105) 측으로 전달될 수 있다. 또한 연산 기호

는 함수의 합성 연산(composite operation)을 의미한다. 수학식 1에 표현된 바와 같이, RB는 콘볼루션에 기반하는 잔차 생성(residue generation) 기능을 포함한다.here,

is an intermediate residue between an input and an output of the RB, and may be transferred to the decoder 105 using a skip connection. Also the arithmetic symbol

denotes a composite operation of a function. As expressed in Equation 1, RB includes a residual generation function based on convolution.

한편, 입력 x에 대하여, i 번째 RDB의 출력

는 수학식 2와 같이 표현될 수 있다.On the other hand, for the input x, the output of the i-th RDB

can be expressed as Equation (2).

수학식 2에 표현된 바와 같이 RDB는 콘볼루션에 기반하는 잔차 생성 기능을 내부에 포함한다. 또한, 3Conv2 레이어의 동작으로 인하여

는 다운샘플링(down-sampling)을 실행할 수 있다.As expressed in Equation 2, the RDB includes a convolution-based residual generation function therein. In addition, due to the operation of the 3Conv2 layer,

can perform down-sampling.

본 실시예에 따른 병목부(104)는

를 입력으로 받아들여 병목 출력

를 생성하는데,

는 8c 개의 채널에 해당하는 특성 맵이다. 병목부(104)는 잔차 생성을 수행하는 복수의 RB 블록을 포함할 수 있는데, 도 2의 예시에는 3 개의 RB 블록이 포함되어 있다.The bottleneck 104 according to this embodiment is

takes as input and outputs the bottleneck

to create,

is a characteristic map corresponding to 8c channels. The bottleneck 104 may include a plurality of RB blocks for generating residuals. In the example of FIG. 2 , three RB blocks are included.

본 실시예에 따른 디코더(105)는

를 입력으로 받아들여 디코더 출력

를 생성하는데,

는 c 개의 채널에 해당하는 특성 맵이다. 디코더(105)는 복수의 RUB(Residual Up Block)와 RSB(Residual Skip Block) 짝을 포함할 수 있는데, 도 2의 예시에는 3 개의 RUB와 RSB 짝이 포함되어 있다. RUB와 RSB 짝의 개수는 인코더(103)에 포함된 RB와 RUB 짝의 개수와 동일하다. 또한 디코더(106)는 각 RSB의 후단에 AFT 레이어를 포함한다. AFT는 특성 맵 변환(Feature Map Transform: FMT)의 한 형태로서, ALP 및 AGP를 이용하여 각 RSB의 출력을 변환한다. AFT에 대한 자세한 내용은 추후에 설명하기로 한다.The decoder 105 according to this embodiment is

takes as input and outputs the decoder

to create,

is a characteristic map corresponding to c channels. The decoder 105 may include a plurality of RUB (Residual Up Block) and RSB (Residual Skip Block) pairs. In the example of FIG. 2 , three RUBs and RSB pairs are included. The number of RUB and RSB pairs is the same as the number of RB and RUB pairs included in the encoder 103 . Also, the decoder 106 includes an AFT layer at the rear end of each RSB. AFT is a form of Feature Map Transform (FMT), and transforms the output of each RSB using ALP and AGP. The details of AFT will be described later.

입력 x에 대하여, i 번째 RUB의 출력

는 수학식 3과 같이 표현될 수 있다.For input x, the output of the i-th RUB

can be expressed as Equation (3).

수학식 3에 표현된 바와 같이 RUB는 콘볼루션에 기반하는 잔차 생성 기능을 내부에 포함한다. 또한, 3Dcnv2 레이어의 동작으로 인하여

는 업샘플링(up-sampling)을 실행할 수 있다.As expressed in Equation 3, RUB includes a convolution-based residual generation function therein. In addition, due to the operation of the 3Dcnv2 layer,

may perform up-sampling.

한편, 입력 x에 대하여, i 번째 RSB의 출력

는 수학식 4와 같이 표현될 수 있다.On the other hand, for the input x, the output of the i-th RSB

can be expressed as Equation (4).

여기서, 연산자 [a, b]는 두 개의 특성 맵 a와 b 간의 결합(concatenation)을 의미한다. 또한

는 스킵 연결을 이용하여 인코더(103)로부터 전달되는 중간 잔차이다. 수학식 4에 표현된 바와 같이 RSB는 콘볼루션에 기반하는 잔차 생성 기능을 포함한다. Here, the operator [a, b] means concatenation between the two characteristic maps a and b. In addition

is the intermediate residual passed from the encoder 103 using skip concatenation. As expressed in Equation 4, RSB includes a convolution-based residual generation function.

본 실시예에 따른 특성개선부(106)는

를 입력으로 받아들여 특성이 개선된 컬러화 화면인

를 생성한다. 특성개선부(106)는 EH(Enhancer, 비특허문헌 4 참조) 및 Tanh 레이어를 포함한다. 여기서 Tanh는 쌍곡선 탄젠트(hyperbolic tangent) 활성함수이다. The characteristic improvement unit 106 according to the present embodiment is

A colorized screen with improved characteristics by accepting as input

create The characteristic improvement unit 106 includes an EH (Enhancer, see Non-Patent Document 4) and a Tanh layer. where Tanh is the hyperbolic tangent activation function.

도 4는 본 발명의 일 실시예에 따른 EH의 구성도이다.4 is a block diagram of an EH according to an embodiment of the present invention.

다양한 축척의 전역 맥락(global context) 정보의 특성을 이용하는 것은 추론 모델(101)의 성능 개선 측면에서 중요하다. EH는 상호 보완적인 다중 축척의(multi-scale)의 공간적 정보를 충분히 이용하여 디코더 출력의 특성을 개선한다. EH는 복수의 축척(scale) 별 가지(branch)를 포함한다. 도 4의 예시에는 4 개의 가지가 포함되어 있다. 각 가지는 축척 별로 특성 맵의 공간적 정보를 평균 풀링(average pooling)한 후, 각 특성 맵 내에서 가장 가까운 이웃(nearest neighborhood)까지 공간적으로 업샘플링한다. 예컨대, ixi 윈도우를 이용하여 평균 풀링된 경우, ixi 업샘플링을 실행하여 평균이 확산되도록 한다. 각 가지의 출력은 채널 별로 결합된 후, EH의 출력을 생성하는 레이어의 입력으로 이용된다. Using the characteristics of the global context information of various scales is important in terms of improving the performance of the inference model 101 . The EH improves the characteristics of the decoder output by fully utilizing the complementary multi-scale spatial information. EH includes branches per multiple scales. The example of FIG. 4 includes four branches. Each branch performs average pooling of spatial information of the feature map for each scale, and then spatially upsamples to the nearest neighborhood within each feature map. For example, if the average is pooled using the ixi window, ixi upsampling is performed to spread the average. The outputs of each branch are combined for each channel and then used as an input of a layer that generates an output of EH.

영상 또는 비디오에 존재하는 시만틱 객체(semantic object)는 고유의 컬러 톤(color tone)을 가질 수 있다. 기존의 비디오 컬러화 방법에서는 분할 맵(segmentation map) 또는 전역 특성(global feature)이 제공할 수 있는 컬러 관련 정보가 간과된 측면이 있다. 본 실시예에 따른 AFT는, 입력 화면으로부터 생성할 수 있는 분할 맵 또는 전역 특성을 이용함으로써, 참조 영상 또는 사용자 안내 정보 등을 대체하고, 내부의 특성 맵에만 의존하는 기존 FMT 방법의 단점을 보완할 수 있다. AFT는 자기 안내(self-guided) FMT로서, 입력 화면으로부터 추출된 국부적인 힌트(local hint)인 분할 관련(segmentation-related) 특성 및 전역적인 힌트(global hint)인 전역 특성을 이용하여, 디코더(105)의 구성요소인 RSB가 생성하는 중간 출력을 변환할 수 있다. A semantic object existing in an image or video may have a unique color tone. In the existing video colorization method, color-related information that can be provided by a segmentation map or a global feature is overlooked. The AFT according to the present embodiment replaces a reference image or user guide information, etc. by using a split map or global characteristic that can be generated from an input screen, and supplements the disadvantages of the existing FMT method that depends only on an internal characteristic map. can AFT is a self-guided FMT, and using a segmentation-related characteristic that is a local hint extracted from an input screen and a global characteristic that is a global hint, the decoder ( 105), the intermediate output generated by RSB can be converted.

본 실시예에 따른 분할추출부(111)는 다중 입력

의 중앙 화면

을 기트레이닝된(pre-trained) 분할추출 모델에 입력하여 분할 맵(segmentation map)

를 생성한다. 본 실시예에서는 분할추출 모델로서 ResNet-101을 근간(backbone)으로 하는(비특허문헌 5 참조) DeepLab v2를 이용하나(비특허문헌 6 참조), 반드시 이에 한정되는 것은 아니며, 분할추출 모델은 객체 분할을 수행할 수 있는 어느 딥러닝 모델이든 될 수 있다. The division extraction unit 111 according to the present embodiment is a multi-input

center screen of

is input to a pre-trained segmentation extraction model to create a segmentation map.

create In this embodiment, DeepLab v2 with ResNet-101 as a backbone (see Non-Patent Document 5) is used as the segmentation extraction model (see Non-Patent Document 6), but it is not necessarily limited thereto, and the segmentation extraction model is an object It can be any deep learning model that can perform segmentation.

도 5는 본 발명의 일 실시예에 따른 ALP 추출부의 구성도이다.5 is a block diagram of an ALP extraction unit according to an embodiment of the present invention.

본 실시예에 따른 ALP 추출부(112)는 분할 맵

를 입력으로 받아들여 ALP를 생성한다. ALP 추출부(112)는 공통 특성(shared feature)을 추출하는 공통 부분 및 공통 특성을 이용하여 ALP를 생성하는 복수의 독립적인 부분을 포함한다. 도 5의 도시에는 3 개의 독립적인 부분이 포함되어 있으며, 독립적인 부분의 개수는 디코더(105)에 포함된 RUB와 RSB 짝의 개수와 동일하다. ALP는 스케일(scale) 파라미터인

과 바이어스(bias) 파라미터인

를 포함한다.

는 공간적 해상도(spatial resolution)의 수준을 의미하는데, k는 입력의

배의 공간적 해상도를 갖는 공간적 크기(spatial size)를 의미한다. 여기서 k는 0, 1 및 2의 값을 갖는다. The ALP extraction unit 112 according to the present embodiment divides the map

takes as input and creates an ALP. The ALP extraction unit 112 includes a common part for extracting a shared feature and a plurality of independent parts for generating an ALP by using the common feature. The illustration of FIG. 5 includes three independent parts, and the number of independent parts is the same as the number of RUB and RSB pairs included in the decoder 105 . ALP is a scale parameter

and the bias parameter

includes

is the level of spatial resolution, where k is the level of the input

It means the spatial size with the spatial resolution of the ship. where k has the

values

0, 1 and 2.

본 실시예에 따른 전역특성 추출부(113)는 다중 입력

의 중앙 화면

을 기트레이닝된 전역특성 추출 모델에 입력하여 전역특성 맵

를 생성한다. 본 발명의 실시예에서는 전역특성 추출 모델로서 VGG19를 이용하나(비특허문헌 7 참조), 반드시 이에 한정되는 것은 아니며, 전역특성 추출 모델은 전역 특성을 추출할 수 있는 어느 딥러닝 모델이든 될 수 있다. The global characteristic extraction unit 113 according to the present embodiment is a multi-input

center screen of

to the pretrained global feature extraction model to map the global feature

create In the embodiment of the present invention, VGG19 is used as a global feature extraction model (see Non-Patent Document 7), but the present invention is not limited thereto, and the global feature extraction model may be any deep learning model capable of extracting global features. .

도 6은 본 발명의 일 실시예에 따른 AGP 추출부의 구성도이다.6 is a block diagram of an AGP extraction unit according to an embodiment of the present invention.

본 실시예에 따른 AGP 추출부(114)는 전역특성 맵

를 입력으로 받아들여 AGP를 생성한다. AGP 추출부(114)는 공통 특성(shared feature)을 추출하는 공통 부분 및 공통 특성을 이용하여 AGP(

및

)를 생성하는 복수의 독립적인 부분을 포함한다. 도 6의 도시에는 3 개의 독립적인 부분이 포함되어 있으며, 독립적인 부분의 개수는 디코더(105)에 포함된 RUB와 RSB 짝의 개수와 동일하다. AGP는 스케일 파라미터인

과 바이어스 파라미터인

를 포함한다. 도 6에 도시된 바와 같이, 독립적인 부분의 첫 단계인 GAP(Global Average Pooling) 층은 공통 특성으로부터 전역 공간 정보가 집약된 1x1 스칼라 정보를 생성한다. 독립적인 부분의 나머지 단계는 AGP를 생성한다.The AGP extraction unit 114 according to the present embodiment is a global characteristic map

takes as input and creates an AGP. The AGP extraction unit 114 uses a common part for extracting a shared feature and a common feature to extract the AGP (

and

) containing a plurality of independent parts that create The illustration of FIG. 6 includes three independent parts, and the number of independent parts is the same as the number of RUB and RSB pairs included in the decoder 105 . AGP is a scale parameter

and the bias parameter

includes As shown in FIG. 6 , the global average pooling (GAP) layer, which is the first step of the independent part, generates 1x1 scalar information in which global spatial information is aggregated from common characteristics. The remaining steps of the independent part create an AGP.

각

에 대한

의 출력

는, RSB의 출력에 해당하는 입력 I에 대하여 수학식 5로 표현된다. each

for

output of

is expressed by Equation 5 for the input I corresponding to the output of the RSB.

여기서 기호

는 원소 간의(element-wise) 곱셉을 의미한다. 또한

는 트레이닝 가능한 가중치(weight)로서 ALP 및 AGP의 반영 비율을 의미한다.. 수학식 5에 표현된 바와 같이, AFT는 국부적인 힌트로 추출된 분할 관련 특성 및 전역 힌트로 추출된 VGG19 관련 특성을 적응적으로 융합(fusion)한다. sign here

means element-wise multiplication. In addition

is a trainable weight, and means the reflection ratio of ALP and AGP. As expressed in Equation 5, AFT adapts the split-related characteristics extracted with local hints and VGG19-related characteristics extracted with global hints. fused in a negative way.

ALP 추출부(112), AGL 추출부(114) 및

는 추론 모델(101)의 트레이닝 시에 함께 트레이닝될 수 있다. 한편, 분할추출부(111) 및 전역특성 추출부(113)는 전술한 바와 같이 기트레이닝된 딥러닝 모델을 이용한다. ALP extraction unit 112, AGL extraction unit 114 and

may be trained together during training of the inference model 101 . On the other hand, the division extraction unit 111 and the global characteristic extraction unit 113 use the pre-trained deep learning model as described above.

도 1 및 도 2의 도시는 본 실시예에 따른 예시적인 구성이며, 입력의 형태, 추론 모델의 구조 및 트레이닝 방법에 따라 다른 구성요소 또는 구성요소 간의 다른 연결을 포함하는 구현이 가능하다. 1 and 2 are exemplary configurations according to the present embodiment, and implementation including other components or other connections between components is possible depending on the type of input, the structure of the inference model, and the training method.

본 실시예에 따른 추론 모델(101)은 컬러화된 비디오를 생성하기 위하여 사전에 학습되는 딥러닝 기반의 학습 모델을 이용한다. 본 실시예에서는, 비디오 컬러화 장치(100)의 추론 모델(101)을 생성기(generator)로 사용하고, 생성기 및 2 개의 구별기(discriminator)를 포함하는 GAN(Generative Adversarial Networks) 기반 학습 모델(700)을 이용하여 추론 모델(101)이 트레이닝될 수 있다. 본 실시예는 GAN 기반 학습 모델(700)을 채택함으로써, 놈(norm) 기반의 손실(loss)에 의존하는 트레이닝의 단점을 보완하고, 컬러화된 결과에 대한 인지 성능(perceptual quality)을 개선할 수 있다.The inference model 101 according to this embodiment uses a deep learning-based learning model trained in advance to generate a colored video. In this embodiment, the inference model 101 of the video colorizing apparatus 100 is used as a generator, and a Generative Adversarial Networks (GAN)-based learning model 700 including a generator and two discriminators. The inference model 101 can be trained using By adopting the GAN-based learning model 700, this embodiment compensates for the disadvantages of training that depends on a norm-based loss, and can improve the perceptual quality for colored results. have.

이하 도 7 및 도 8을 참조하여, 학습 모델(700)의 트레이닝 과정에 대해 설명하도록 한다.Hereinafter, a training process of the learning model 700 will be described with reference to FIGS. 7 and 8 .

도 7은 본 발명의 일 실시예에 따른 학습 모델의 예시도이다. 7 is an exemplary diagram of a learning model according to an embodiment of the present invention.

본 실시예에서는 GAN 기반 학습 모델(700)을 이용하여 비디오 컬러화 장치(100) 상의 추론 모델(101)에 대한 트레이닝이 실행된다. 학습 모델(700)은 생성기(추론 모델, 101)를 포함하는 비디오 컬러화 장치(100), 색상변환부(701), 제1 구별기(702) 및 제2 구별기(703)의 전부 또는 일부를 포함한다. 여기서, 본 실시예에 따른 학습 모델(700)에 포함되는 구성요소가 반드시 이에 한정되는 것은 아니다. 예컨대, 학습 모델(700)은 비디오 컬러화 장치(100)의 트레이닝을 위한 트레이닝부(미도시)를 추가로 구비하거나, 외부의 트레이닝부와 연동되는 형태로 구현될 수 있다. 또한 학습 모델(700)은 소벨 연산자(Sobel operator)을 추가로 구비하여, 손실(loss)의 산정에 이용할 수 있다.In this embodiment, training is performed on the inference model 101 on the video colorizing apparatus 100 using the GAN-based learning model 700 . The training model 700 includes all or part of the video colorizing apparatus 100 including a generator (inference model, 101), the color conversion unit 701, the first distinguisher 702, and the second distinguisher 703. include Here, the components included in the learning model 700 according to the present embodiment are not necessarily limited thereto. For example, the learning model 700 may additionally include a training unit (not shown) for training of the video colorizing apparatus 100 or may be implemented in a form that is linked with an external training unit. In addition, the learning model 700 may additionally include a Sobel operator, and may be used to calculate a loss.

GAN 기반 학습 모델의 생성기는 다중 흑백 영상

으로부터 컬러화된 화면

를 생성한다.The generator of the GAN-based learning model is a multi-black and white image

colored screen from

create

분할추출부(111), ALP 추출부(112), 전역특성추출부(113) 및 AGP 추출부는 ALP 및 AGP를 생성하여 생성기(101)에 포함된 AFT 레이어 측으로 제공한다.The division extraction unit 111 , the ALP extraction unit 112 , the global feature extraction unit 113 , and the AGP extraction unit generate ALP and AGP and provide the generated ALP and AGP to the AFT layer included in the generator 101 .

색상변환부(701)는 생성기의 출력

와 다중 흑백 영상의 중앙 화면인

를 결합하여 RGB 공간 상의 화면인

를 생성한다.

는 비디오 입력

의 생성에 이용될 수 있다. The color conversion unit 701 outputs the generator

and the center screen of multiple black-and-white images.

By combining , the screen in RGB space is

create

is the video input

can be used to create

제1 구별기(702)

는 이미지 입력

및

를 구분한다. 여기서

는 시간 t에서의 GT(Ground Truth) RGB 이미지이다. first discriminator 702

is the image input

and

separate the here

is the GT (Ground Truth) RGB image at time t.

제2 구별기(703)

는 두 입력 간의 시간적 셀프 어텐션(temporal self-attention)을 구별한다. 즉 비디오 입력

와

를 구별한다. 여기서

만이 예측된 이미지이고, 나머지는 GT RGB 이미지이다. 제2 구별기(703)는 비디오 컬러화 장치(100)가 시간적 일관성을 고려하면서

를 추론하도록 한다. second discriminator (703)

distinguishes the temporal self-attention between the two inputs. i.e. video input

Wow

to distinguish here

only the predicted images, the rest are GT RGB images. The second discriminator 703 allows the video colorizing apparatus 100 to

to infer.

도 8은 본 발명의 일 실시예에 따른 구별기의 구성도이다.8 is a block diagram of a discriminator according to an embodiment of the present invention.

본 실시예에 따른 제1 구별기(702) 및 제2 구별기(703)는 도 8에 도시된 바와 같은 딥러닝 기반 모델로 동일하게 구현되나, 반드시 이에 한정하는 것은 아니다. 두 개의 영상 입력을 구별할 수 있는 어느 형태의 딥러닝 모델이든 구별기로 이용될 수 있다. 또한 제1 구별기(702) 및 제2 구별기(703)는 서로 다른 구조의 딥러닝 모델로 구현될 수 있다.The first discriminator 702 and the second discriminator 703 according to the present embodiment are implemented in the same manner as a deep learning-based model as shown in FIG. 8 , but the present invention is not limited thereto. Any type of deep learning model that can distinguish two image inputs can be used as a discriminator. Also, the first discriminator 702 and the second discriminator 703 may be implemented as deep learning models having different structures.

구별기는 안정적인 트레이닝을 위하여 IN(Instance Normalization) 레이어를 포함한다(비특허문헌 10 참조). 구별기는 셀프 어텐션 레이어를 포함하여(비특허문헌 11 참조), 구별기의 중간 특성 맵 간에 존재하는 긴 범위의(long-range) 의존성을 포착(capture)함으로써, 생성기 즉 추론 모델(101)의 성능 개선을 유도할 수 있다. 또한 구별기는 특성 맵

를 생성한다. 도 8에 도시된 바와 같이,

는 세 개의 LR 및 셀프 어텐션 레이어의 출력이다. The discriminator includes an IN (Instance Normalization) layer for stable training (see Non-Patent Document 10). The discriminator includes a self-attention layer (see Non-Patent Document 11), and captures long-range dependencies existing between intermediate feature maps of the discriminator, thereby improving the performance of the generator, that is, the inference model 101. can lead to improvement. Also, the distinguisher is a feature map

create As shown in Figure 8,

is the output of the three LR and self-attention layers.

생성기 및 구별기를 트레이닝할 때, 트레이닝부는 GAN 구조에 기반하는 손실 외에도 다양한 형태 놈(norm) 기반 손실을 이용할 수 있다. When training the generator and the discriminator, the training unit may use various types of norm-based loss in addition to the loss based on the GAN structure.

본 실시예에 따른 트레이닝부는 RaHinge(Relativistic Average Hinge) GAN 손실(비특허문헌 8 참조)을 대립적 손실(adversarial loss)로서 이용한다. 대립적 손실은 수학식 6으로 표현된다.The training unit according to the present embodiment uses a Relativistic Average Hinge (RaHinge) GAN loss (see Non-Patent Document 8) as an adversarial loss. The antagonistic loss is expressed by Equation (6).

여기서,

및

는 각각 구별기 D(제1 구별기(702)

및 제2 구별기(703)

) 및 생성기 G의 GAN 손실이다. 또한

이고,

이다. Y는 GT 화면 또는 GT 다중 컬러 화면을 의미하고, P는 컬러화된 화면 또는 컬러화된 화면을 포함하는 다중 컬러 화면을 의미한다.here,

and

are each a discriminator D (first discriminator 702)

and a second distinguisher (703)

) and GAN loss of generator G. In addition

ego,

am. Y means a GT screen or a GT multi-color screen, and P means a colored screen or a multi-color screen including a colored screen.

GAN에 대한 안정적인 트레이닝을 위하여 특성매칭 손실(feature-matching loss)이 이용될 수 있다. 특성매칭 손실

은 수학식 7로 표현된다.A feature-matching loss may be used for stable training of the GAN. Attribute Matching Loss

is expressed by Equation (7).

특성매칭 손실은 구별기 D(

및

)의 Y 및 P의 특성 맵

간의 L1 손실이다. The characteristic matching loss is the discriminator D(

and

) of Y and P characteristic maps

L1 loss in the liver.

추가적인 손실 항목을 산정하기 위하여, 전역특성 추출부(113)가 생성하는 전역특성 맵이 이용될 수 있다. 본 실시예에서는 전역특성 추출 모델인 VGG19이 생성하는 전역특성 맵을 이용한다. In order to calculate the additional loss item, the global characteristic map generated by the global characteristic extraction unit 113 may be used. In this embodiment, a global characteristic map generated by VGG19, which is a global characteristic extraction model, is used.

VGG19의 특성 맵을 이용하여 산정되는 스타일 손실(style loss)이 이용될 수 있다(비특허문헌 9 참조). 스타일 손실

은 Y 및 P에 대한 특성 맵

에 기반하는 L1 손실이고, 수학식 8로 표현된다.A style loss calculated using the characteristic map of VGG19 may be used (see Non-Patent Document 9). loss of style

is the characteristic map for Y and P

L1 loss based on , expressed by Equation (8).

여기서 i는 VGG19의 구성요소인 ReLU_i_1 레이어를 나타내고,

및

는 각각 평균 및 표준편차를 의미한다. 수학식 8에서, i는 4 및 5가 반영되었으나, 반드시 이에 한정하는 것은 아니다.where i represents the ReLU_i_1 layer, which is a component of VGG19,

and

is the mean and standard deviation, respectively. In Equation 8, 4 and 5 are reflected for i, but the present invention is not limited thereto.

VGG19의 특성 맵을 이용하여 산정되는 콘텐츠 손실(content loss)이 이용될 수 있다(비특허문헌 9 참조). 콘텐츠 손실

은 Y 및 P에 대한 특성 맵

에 기반하는 L1 손실이고, 수학식 9로 표현된다.Content loss calculated using the characteristic map of VGG19 can be used (refer to Non-Patent Document 9). content loss

is the characteristic map for Y and P

L1 loss based on , expressed by Equation (9).

여기서,

는

가 채널 별로 평균-분산 측면에서 정규화된 맵(normalized map)이다. here,

Is

is a normalized map in terms of mean-variance for each channel.

VGG19의 인지 손실(perceptual loss)이 이용될 수 있다. 인지 손실

는 Y 및 P에 대한 특성 맵 간의 차이에 기반하는 L1 손실이고, 수학식 10으로 표현된다.The perceptual loss of VGG19 can be used. cognitive loss

is the L1 loss based on the difference between the feature maps for Y and P, and is expressed by Equation (10).

소벨 연산자는 미분(derivative)을 이용하여 영상에 존재하는 경계를 검출할 수 있다. 본 실시예에서는, 소벨 연산자

가 생성하는 Y 및 P의 경계 맵(edge map) 간의 L2 손실인 경계 손실(edge loss)이 이용될 수 있다. 경계 손실

는 수학식 11로 표현된다.The Sobel operator can detect a boundary existing in an image by using a derivative. In this embodiment, the Sobel operator

An edge loss, which is an L2 loss between edge maps of Y and P generated by , may be used. Boundary loss

is expressed by Equation (11).

여기서 v 및 h는 각각 경계 맵의 수직 및 수평 구성요소이다.where v and h are the vertical and horizontal components of the boundary map, respectively.

색차성분 간의 차이에 기반하는 재구성 손실(reconstruction loss)이 이용될 수 있다. 재구성 손실

는 L1 손실이고, 수학식 12로 표현된다.A reconstruction loss based on the difference between the chrominance components may be used. reconstruction loss

is the L1 loss, and is expressed by Equation (12).

이상의 손실을 결합하여 GAN 기반 학습 모델의 총손실(total loss)은 수학식 13 및 14로 표현될 수 있다.By combining the above losses, the total loss of the GAN-based learning model can be expressed by Equations 13 and 14.

여기서, i, v는 각각 제1 구별기(702)

및 제2 구별기(703)

와 관련된 손실을 의미한다. 또한 모든

는 손실에 관련된 하이퍼파라미터들이다.Here, i and v are each of the first distinguisher 702

and a second distinguisher (703)

loss associated with Also all

are the hyperparameters related to the loss.

트레이닝을 위한 학습용 GT 비디오로는 YouTube^TM에서 수집된 4K(3840x2160) 데이터세트가 이용된다. 기존의 방식(비특허문헌 1 참조)과 비교하여, 학습용 GT 비디오는 고해상도이고, 풍부한 컬러와 다양한 객체를 포함한다. As a training GT video for training, a 4K (3840x2160) dataset collected from ^{YouTube TM is used.} Compared with the existing method (refer to Non-Patent Document 1), the GT video for learning has a high resolution, and includes rich colors and various objects.

고해상도 흑백 비디오 및 학습용 GT 비디오를 이용하여 학습 모델(700)을 효과적으로 트레이닝하여, 추론 모델(101)이 수행하는 컬러화 과정에서 발생할 수 있는 다양한 아티팩트(artifact)의 영향을 감소시키기 위해, 트레이닝부는 다음과 같은 방법을 실행할 수 있다.In order to effectively train the learning model 700 using the high-resolution black-and-white video and the GT video for training, to reduce the influence of various artifacts that may occur in the colorization process performed by the inference model 101, the training unit follows You can run the same way.

본 실시예에 따른 트레이닝부는 총손실이 감소되는 방향으로 생성기(101), 제1 구별기(702) 및 제2 구별기(703)의 파라미터를 업데이트한다.The training unit according to the present embodiment updates the parameters of the generator 101 , the first discriminator 702 , and the second discriminator 703 in a direction in which the total loss is reduced.

또한, 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 생성기(101), 제1 구별기(702) 및 제2 구별기(703)의 파라미터가 업데이트될 수 있다.Also, parameters of the generator 101 , the first classifier 702 , and the second classifier 703 may be updated in a direction in which all or part of the loss items included in the total loss are reduced.

또한, 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 생성기(101), 제1 구별기(702) 및 제2 구별기(703) 중 적어도 하나의 파라미터가 업데이트될 수 있다.In addition, at least one parameter of the generator 101 , the first distinguisher 702 , and the second distinguisher 703 may be updated in a direction in which all or part of the loss items included in the total loss are reduced.

트레이닝부는 학습 모델(700)에 대한 트레이닝을 두 과정으로 진행한다. The training unit performs training on the learning model 700 in two processes.

첫 번째 과정에서, 분할추출부(111) 및 전역특성 추출부(113)에 포함된 딥러닝 모델이 기트레이닝(pre-training)된다. In the first process, the deep learning model included in the division extraction unit 111 and the global feature extraction unit 113 is pre-trained.

두 번째 과정에서, 생성기(101) 및 두 개의 구별기가 트레이닝된다. 트레이닝부는 수학식 13 및 14에 표현된 총손실을 감소시키는 방향으로 생성기(101), 제1 구별기(702) 및 제2 구별기(703)의 파라미터를 업데이트한다. 전술한 바와 같이 생성기(101)가 트레이닝될 때, ALP 추출부(112) 및 AGP 추출부(114)도 함께 트레이닝될 수 있다. In the second process, the generator 101 and two discriminators are trained. The training unit updates the parameters of the generator 101 , the first discriminator 702 , and the second discriminator 703 in the direction of reducing the total loss expressed in Equations 13 and 14 . As described above, when the generator 101 is trained, the ALP extractor 112 and the AGP extractor 114 may also be trained together.

GAN 기반 딥러닝 모델의 트레이닝은 어려운 것으로 알려져 있다. 특히, 학습의 초기 단계에서 안정적인 트레이닝을 실행하는 것이 어려울 수 있다. 따라서, 두 번째 트레이닝 과정에서, 본 실시예에 따른 트레이닝부는 하이퍼파라미터

각각에 대한 설정을 변경함으로써, 학습 모델(700)에 대한 학습 효율을 증대시킬 수 있다. 트레이닝 초기 단계에서, 트레이닝부는 수학식 13 및 14에 표현된 총손실 중에서 일부 항목에 대한

를 영(zero) 로 설정하여 트레이닝을 진행할 수 있다. 예컨대 스타일 손실, 재구성 손실, 콘텐츠 손실 및/또는 인지 손실 항목이 활성화되고, 대립적 손실, 특성 매칭 손실 및 경계 손실을 포함하는 나머지 손실 항목은 비활성화될 수 있다. Training of GAN-based deep learning models is known to be difficult. In particular, it can be difficult to implement stable training in the early stages of learning. Therefore, in the second training process, the training unit according to the present embodiment is a hyperparameter

By changing the settings for each, it is possible to increase the learning efficiency of the learning model 700 . In the initial stage of training, the training unit performs some calculations for some items among the total losses expressed in Equations 13 and 14.

The training can be performed by setting the to zero (zero). For example, loss of style, loss of reconstruction, loss of content, and/or loss of perception may be activated, and the remaining loss items including loss of confrontation, loss of feature matching and loss of boundary may be deactivated.

생성기(101)의 동작이 안정된 후기 단계에서, 트레이닝부는 영으로 설정되었던

를 영이 아닌 값으로 설정함으로써, 모든 손실 항목을 이용하여 생성기(101) 및 두 개의 구별기의 파라미터를 업데이트할 수 있다. 또한 AFT도 후기 단계에서 활성화함으로써, 초기 단계에서 트레이닝부는 추론 모델(101)의 안정화를 집중적으로 도모하고, 후기 단계에서 추론 모델의 성능을 정밀 조정(fine-tuning)할 수 있다. 여기서, AFT를 활성화한다는 것은, ALP 추출부(112) 및 AGP 추출부(114)에 대한 트레이닝을 실행하고, ALP 및 AGP의 반영 비율을 결정하는 가중치도 트레이닝한다는 의미이다.In the later stage when the operation of the generator 101 is stable, the training unit is set to zero.

By setting α to a non-zero value, it is possible to update the parameters of the generator 101 and the two discriminators with all loss terms. In addition, since AFT is also activated at a later stage, the training unit can focus on stabilizing the inference model 101 in the initial stage, and fine-tuning the performance of the inference model in the later stage. Here, activating the AFT means that training is performed on the ALP extraction unit 112 and the AGP extraction unit 114, and weights that determine the reflection ratio of ALP and AGP are also trained.

이상에서 설명한 바와 같이 본 실시예에 따르면, 다중 흑백 영상(multiple black-and-white image)을 획득하여, 다양한 손실(diverse loss)을 기반으로 사전에 트레이닝된 딥러닝(deep learning) 기반 추론 모델(inference model)이 컬러화된 비디오를 자동으로 생성하는 비디오 컬러화 장치를 제공함으로써, 컬러화 과정에서 발생하는 컬러 블리딩(color bleeding), 블럭 아티팩트(block artifact), 경계 누설(boundary leakage) 등의 문제에 대한 개선이 가능해지는 효과가 있다. As described above, according to this embodiment, a deep learning-based inference model ( Improvement of problems such as color bleeding, block artifacts, boundary leakage, etc. occurring in the colorization process by providing a video colorizing apparatus that automatically generates colored video (inference model) This has the effect of making it possible.

본 실시예에 따른 비디오 컬러화 장치(100)가 탑재되는 디바이스(미도시)는 프로그램가능 컴퓨터 또는 스마트폰 등의 정보처리 장치일 수 있으며, 서버(미도시)와 연결이 가능한 적어도 한 개의 통신 인터페이스를 포함한다. The device (not shown) on which the video colorizing apparatus 100 according to the present embodiment is mounted may be an information processing device such as a programmable computer or a smart phone, and has at least one communication interface capable of being connected to a server (not shown). include

전술한 바와 같은 추론 모델에 대한 트레이닝은, 비디오 컬러화 장치(100)가 탑재되는 디바이스의 컴퓨팅 파워를 이용하여 비디오 컬러화 장치(100)가 탑재되는 디바이스에서 진행될 수 있다. Training for the inference model as described above may be performed in a device in which the video colorizing apparatus 100 is mounted by using the computing power of the device in which the video colorizing apparatus 100 is mounted.

전술한 바와 같은 비디오 컬러화 장치(100)의 추론 모델(101)에 대한 트레이닝은 서버에서 진행될 수 있다. 디바이스 상에 탑재된 비디오 컬러화 장치(100)의 구성요소인 추론 모델(101)과 동일한 구조의 딥러닝 모델에 대하여 서버의 트레이닝부는 트레이닝을 수행할 수 있다. 디바이스와 연결되는 통신 인터페이스를 이용하여 서버는 트레이닝된 딥러닝 모델의 파라미터를 디바이스로 전달하고, 전달받은 파라미터를 이용하여 비디오 컬러화 장치(100)는 추론 모델(101)의 파라미터를 설정할 수 있다. 또한 디바이스의 출하 시점 또는 비디오 컬러화 장치(100)가 디바이스에 탑재되는 시점에, 추론 모델(101)의 파라미터가 설정될 수 있다. Training for the inference model 101 of the video colorizing apparatus 100 as described above may be performed in the server. The training unit of the server may perform training on the deep learning model having the same structure as the inference model 101 , which is a component of the video colorizing apparatus 100 mounted on the device. Using a communication interface connected to the device, the server transmits the parameters of the trained deep learning model to the device, and the video colorizing apparatus 100 may set the parameters of the inference model 101 using the received parameters. Also, the parameters of the inference model 101 may be set at the time of shipment of the device or the time at which the video colorizing apparatus 100 is mounted on the device.

도 9는 본 발명의 일 실시예에 따른 비디오 컬러화 방법의 순서도이다. 9 is a flowchart of a video colorizing method according to an embodiment of the present invention.

도 9의 (a)는 비디오 컬러화 장치(100)가 수행하는 비디오 컬러화 방법의 순서도이다. 9A is a flowchart of a video colorizing method performed by the video colorizing apparatus 100 .

비디오 컬러화 장치(100)는 다중 흑백 영상(multiple black-and-white image)의 중앙 화면(center frame)으로부터 분할추출 모델을 이용하여 분할 맵(segmentation map)을 추출하고, 사전에 트레이닝된 딥러닝(deep learning) 기반 ALP(Adaptive Local Parameter) 추출부를 이용하여 상기 분할 맵으로부터 ALP를 생성한다(S901). The video colorizing apparatus 100 extracts a segmentation map using a segmentation extraction model from a center frame of a multiple black-and-white image, and performs pre-trained deep learning ( Deep learning) based ALP (Adaptive Local Parameter) extractor to generate ALP from the segmentation map (S901).

ALP는 스케일(scale) 파라미터와 바이어스(bias) 파라미터를 포함한다. ALP includes a scale parameter and a bias parameter.

비디오 컬러화 장치(100)는 다중 흑백 영상의 중앙 화면으로부터 전역특성 추출 모델을 이용하여 전역특성 맵(global feature map)을 추출하고, 사전에 트레이닝된 딥러닝 기반 AGP 추출부를 이용하여 전역특성 맵으로부터 AGP(Adaptive Global Parameter)를 생성한다(S902). The video colorizing apparatus 100 extracts a global feature map using a global feature extraction model from the central screen of multiple black-and-white images, and AGP from the global feature map using a pre-trained deep learning-based AGP extractor. (Adaptive Global Parameter) is generated (S902).

AGP는 스케일 파라미터와 바이어스 파라미터를 포함한다. AGP includes a scale parameter and a bias parameter.

비디오 컬러화 장치(100)는 ALP 및 AGP를 이용하는 적응적 융합 변환(Adaptive Fusion Transform: AFT)에 기반하는, 사전에 트레이닝된 딥러닝 기반 추론 모델을 이용하여 다중 흑백 영상으로부터 컬러화된 화면을 생성한다(S903).The video colorizing apparatus 100 generates a colored screen from multiple black-and-white images using a pre-trained deep learning-based inference model based on adaptive fusion transform (AFT) using ALP and AGP ( S903).

AFT는 국부적인 힌트(local hint)인 분할 관련 특성 및 전역적인 힌트(global hint)인 전역 특성을 적응적으로 융합(fusion)한다. AFT adaptively fuses a partition-related feature that is a local hint and a global feature that is a global hint.

분할추출 모델 및 전역특성 추출 모델은, 추론 모델(101)이 학습되기 전에 기트레이닝되는(pre-trained) 딥러닝 모델로 구현된다.The segmentation extraction model and the global feature extraction model are implemented as a deep learning model that is pre-trained before the inference model 101 is trained.

한편, ALP 추출부(112) 및 AGL 추출부(114)는 추론 모델(101)의 트레이닝 시에 함께 트레이닝될 수 있다. Meanwhile, the ALP extractor 112 and the AGL extractor 114 may be trained together during training of the inference model 101 .

도 9의 (b)는 추론 모델(101)이 실행하는 S903 단계를 상세하게 나타낸 순서도이다. 9 (b) is a flowchart showing in detail step S903 executed by the inference model 101. As shown in FIG.

다중 흑백 영상을 획득하여 전역 특성(global feature)을 융합한 밀집 특성(dense feature)을 생성한다(S911). 추론 모델(101)은 계층적 특성(hierarchical feature)을 이용하여 전역 특성이 효과적으로 융합된 밀집 특성을 생성할 수 있다.A dense feature fused with global features is generated by acquiring multiple black-and-white images (S911). The inference model 101 may generate a dense feature in which global features are effectively fused by using a hierarchical feature.

추론 모델(101)은 밀집 특성을 인코더에 입력하고, 콘볼루션에 기반하는 잔차 생성 기능을 이용하여 밀집 특성이 다운샘플링(down-sampling)된 인코더 출력을 생성한다(S912). 추론 모델(101)은 인코더(103)을 이용하여 다중 흑백 영상에 대한 특성 맵인 인코더 출력을 생성할 수 있다.The inference model 101 inputs the dense feature to the encoder and generates an encoder output in which the compact feature is down-sampled by using a convolution-based residual generation function (S912). The inference model 101 may generate an encoder output that is a feature map for multiple black-and-white images by using the encoder 103 .

추론 모델(101)은 잔차 생성 기능을 이용하여 인코더 출력으로부터 병목(bottleneck) 출력을 생성한다(S913). The inference model 101 generates a bottleneck output from the encoder output by using the residual generation function (S913).

추론 모델(101)은 병목 출력을 디코더에 입력하고, AFT 및 잔차 생성 기능을 이용하여 상기 병목 출력이 업샘플링(up-sampling)된 디코더 출력을 생성한다(S914). 스킵 연결을 이용하여, 추론 모델(101)은 인코더(103)에서 생성된 중간 잔차(intermediate residue)를 디코터(105) 측으로 전달할 수 있다. The inference model 101 inputs the bottleneck output to the decoder, and uses the AFT and residual generation functions to The bottleneck output generates an up-sampled decoder output (S914). By using skip connection, the inference model 101 may transfer an intermediate residue generated by the encoder 103 to the decoder 105 side.

추론 모델(101)은 디코더(103)을 이용하여 다중 흑백 영상에 대한 특성 맵으로부터 예비적인 (preliminary) 추론 결과인 디코더 출력을 생성할 수 있다.The inference model 101 may generate a decoder output that is a preliminary inference result from a feature map for multiple black-and-white images by using the decoder 103 .

추론 모델(101)은 디코더 출력의 특성을 개선하여 컬러화된 화면을 생성한다(S915). 추론 모델(101)은 다중 축척의(multi-scale)의 공간적 정보를 이용하여 예비적인 추론 결과인 디코더 출력의 특성이 개선된 컬러화된 화면을 생성할 수 있다.The inference model 101 generates a colored picture by improving the characteristics of the decoder output (S915). The inference model 101 may use multi-scale spatial information to generate a colored screen with improved characteristics of a decoder output, which is a preliminary inference result.

도 10은 본 발명의 일 실시예에 따른 학습 모델에 대한 학습방법의 순서도이다. 10 is a flowchart of a learning method for a learning model according to an embodiment of the present invention.

트레이닝부는 다중 흑백 영상의 중앙 화면으로부터 분할추출 모델을 이용하여 분할 맵을 추출하고, ALP 추출부를 이용하여 분할 맵으로부터 ALP를 생성한다(S1001).The training unit extracts the segmentation map from the central screen of the multiple black and white image using the segmentation extraction model, and generates an ALP from the segmentation map using the ALP extractor (S1001).

트레이닝부는 중앙 화면으로부터 전역특성 추출 모델을 이용하여 전역특성 맵을 추출하고, AGP 추출부를 이용하여 전역특성 맵으로부터 AGP를 생성한다(S1002).The training unit extracts the global characteristic map from the central screen using the global characteristic extraction model, and generates an AGP from the global characteristic map using the AGP extraction unit (S1002).

트레이닝부는 ALP 및 AGP를 이용하는 적응적 융합 변환(Adaptive Fusion Transform: AFT)에 기반하는, 딥러닝 기반 추론 모델인 생성기를 이용하여 다중 흑백 영상으로부터 컬러화된 화면을 생성한다(S1003).The training unit generates a colored screen from multiple black-and-white images using a generator that is a deep learning-based inference model, based on adaptive fusion transform (AFT) using ALP and AGP (S1003).

AFT는 국부적인 힌트인 분할 관련 특성 및 전역적인 힌트인 전역 특성을 적응적으로 융합(fusion)한다.AFT adaptively fuses a partition-related feature that is a local hint and a global feature that is a global hint.

트레이닝부는 제1 구별기를 이용하여 컬러화된 화면과 GT(Ground Truth) 화면을 구별한다(S1004). The training unit distinguishes the colored screen from the GT (Ground Truth) screen using the first discriminator (S1004).

트레이닝부는 제2 구별기를 이용하여 컬러화된 화면이 중앙 화면으로 포함된 다중 컬러 화면과 GT 다중 컬러 화면 간의 시간적 일관성을 구별한다(S1005).The training unit distinguishes the temporal consistency between the multi-color screen including the colored screen as the center screen and the GT multi-color screen by using the second discriminator (S1005).

제1 구별기(702) 및 제2 구별기(703)는 딥러닝 기반 모델로 구현되며, 두 개의 영상 입력을 구별할 수 있는 어느 형태의 딥러닝 모델이든 구별기로 이용될 수 있다. The first discriminator 702 and the second discriminator 703 are implemented as a deep learning-based model, and any type of deep learning model capable of distinguishing two image inputs may be used as a discriminator.

트레이닝부는 생성기, 제1 구별기 및 제2 구별기의 출력을 이용하여 총손실(total loss)을 산정한다(S1006).The training unit calculates a total loss by using the outputs of the generator, the first classifier, and the second classifier (S1006).

총손실을 구성하는 각각의 손실 항목에 대한 내용은 이미 설명되었으므로, 더 이상의 자세한 설명은 생략한다. Since the content of each loss item constituting the total loss has already been described, further detailed description will be omitted.

트레이닝부는 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 생성기, 제1 구별기 및 제2 구별기 중 적어도 하나의 파라미터를 업데이트한다(S1007).The training unit updates at least one parameter among the generator, the first discriminator, and the second discriminator in a direction in which all or part of the loss items included in the total loss are reduced ( S1007 ).

본 실시예에 따른 각 순서도에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 순서도에 기재된 과정을 변경하여 실행하거나 하나 이상의 과정을 병렬적으로 실행하는 것이 적용 가능할 것이므로, 순서도는 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in each flowchart according to the present embodiment, the present invention is not limited thereto. In other words, since it may be applicable to change and execute the processes described in the flowchart or to execute one or more processes in parallel, the flowchart is not limited to a time-series order.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍가능 시스템 상에서 실행가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령들을 수신하고 이들에게 데이터 및 명령들을 전송하도록 결합되는 적어도 하나의 프로그래밍가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는　기록매체"에 저장된다. Various implementations of the systems and techniques described herein may include digital electronic circuitry, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combination can be realized. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".

컴퓨터가 읽을 수 있는　기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비일시적인(non-transitory) 매체일 수 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송) 및 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한 컴퓨터가 읽을 수 있는　기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. media, and may further include transitory media such as carrier waves (eg, transmission over the Internet) and data transmission media. In addition, computer-readable recording media are distributed in networked computer systems, and computer-readable codes may be stored and executed in a distributed manner.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋탑 박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩탑, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and various modifications and variations will be possible by those skilled in the art to which this embodiment belongs without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the equivalent range should be interpreted as being included in the scope of the present embodiment.

100: 비디오 컬러화 장치 101: 추론 모델
102: 밀집특성 추출부 103: 인코더
104: 병목부 105: 디코더
106: 특성개선부
111: 분할추출부 112: 전역특성 추출부
113: ALP 추출부 114: AGP 추출부
700: 학습 모델 701: 색상변환부
702: 제1 구별기 703: 제2 구별기
100: video colorizer 101: inference model
102: dense feature extraction unit 103: encoder
104: bottleneck 105: decoder
106: characteristic improvement unit
111: division extraction unit 112: global characteristic extraction unit
113: ALP extraction unit 114: AGP extraction unit
700: learning model 701: color conversion unit
702: first distinguisher 703: second distinguisher

Claims

A video colorizing method used by a video colorizing apparatus, comprising:
A segmentation map is extracted from an indicated frame, which is one of multiple black-and-white images, using a segmentation extraction model, and a pre-trained deep learning-based generating an adaptive local parameter (ALP) from the segmentation map using an ALP extractor;
A process of extracting a global feature map from the specified screen using a global feature extraction model, and generating an Adaptive Global Parameter (AGP) from the global feature map using a pre-trained deep learning-based AGP extractor ; and
Based on the ALP and Adaptive Fusion Transform (AFT) using the AGP, using a pre-trained deep learning-based inference model to generate a colored frame from the plurality of black-and-white images process
A video colorization method comprising:

According to claim 1,
The AFT is
A local characteristic improvement screen in which the local characteristics of the specified screen are reflected using the ALP and a global characteristic improvement screen in which the global characteristics of the specified screen are reflected by using the AGP are generated, and the local characteristic improvement screen and the global characteristic improvement screen are generated The video colorizing apparatus according to claim 1, wherein the local characteristic and the global characteristic are fused by adaptively weighted sum of the characteristic improvement picture.

3. The method of claim 2,
The process of generating the colored screen by the inference model is,
obtaining a plurality of black and white images and generating a dense feature fused with global features;
inputting the compaction feature to an encoder and generating an encoder output in which the compaction feature is down-sampled by using a residual generation function based on convolution;
generating a bottleneck output from the encoder output using the residual generation function;
Input the bottleneck output to a decoder, and use the AFT and the residual generation function to generating a decoder output in which the bottleneck output is up-sampled; and
generating the colored screen by improving characteristics of the decoder output;
A video colorization method comprising:

10. The method of claim 9,
The decoder is
at least one RUB (Residual Up Block) and RSB (Residual Skip Block) pair, and generating the decoder output using the RUB and RSB pair, wherein the RSB A video colorizing method comprising a layer performing the AFT at each rear end.

According to claim 1,
Each of the split extraction model and the global feature extraction model is,
Doedoe implemented as a deep learning model, video colorization method, characterized in that the pre-trained (pre-trained) before performing learning on the inference model.

In the learning method of the video colorization apparatus,
Using a generator that is a deep learning-based inference model based on Adaptive Fusion Transform (AFT) using ALP (Adaptive Local Parameter) and AGP (Adaptive Global Parameter), multiple black-and- a process of generating a colored frame from white images;
a process of discriminating the colored screen from the GT (Ground Truth) screen using a deep learning-based first discriminator;
a process of discriminating temporal coherence between a plurality of color images including the colored screen and a plurality of GT color images using a deep learning-based second discriminator; and
A process of calculating a total loss using the outputs of the generator, the first classifier, and the second classifier
A learning method comprising a.

7. The method of claim 6,
A segmentation map is extracted from a designated frame, which is one of the plurality of black-and-white images, using a segmentation extraction model, and the ALP is extracted from the segmentation map using a deep learning-based ALP extractor. the process of creating; and
A process of extracting a global feature map from the specified screen using a global feature extraction model, and generating the AGP from the global feature map using a deep learning-based AGP extraction unit
Learning method, characterized in that it further comprises.

8. The method of claim 7,
The total loss is
generated based on the output of the first separator for the colored screen and the GT screen, the output of the second separator for the plurality of color images and the plurality of GT color images, and the colored screen adversarial loss;
A difference between a feature map generated by the first discriminator for the colored screen and the GT picture, and a feature map generated by the second discriminator for the plurality of color images and the plurality of GT color images feature-matching loss based on differences between (feature maps); and
Reconstruction loss based on the difference between the colored picture and the GT picture
A learning method comprising a.

9. The method of claim 8,
The total loss is
The difference between the averages for the global characteristic map generated from the colored screen and the global characteristic map generated from the GT picture, and the standard for the global characteristic map generated from the colored picture and the global characteristic map generated from the GT picture style loss based on the difference between the deviations;
a content loss based on a difference between a normalized map generated from the colored screen and a normalized global property map generated from the GT screen; and
Perceptual loss based on the difference between the colored screen and the global feature map generated from the GT screen
Learning method, characterized in that it further comprises.

9. The method of claim 8,
The total loss is
including an edge loss calculated using a Sobel operator, wherein the edge loss is based on a difference between an edge map of the colored screen and the GT screen Learning method characterized in that.

8. The method of claim 7,
Each of the split extraction model and the global feature extraction model is,
A learning method, which is implemented as a deep learning-based neural network, and is pre-trained before learning for the video colorizing device.

7. The method of claim 6,
The method of claim 1, further comprising: updating at least one parameter of the generator, the first discriminator, and the second discriminator in a direction in which all or part of the loss items included in the total loss are reduced. .

8. The method of claim 7,
Each of the ALP extraction unit and the AGP extraction unit,
A learning method, implemented as a deep learning model, and trained together with the generator, the first discriminator, and the second discriminator.

A computer program stored in a computer-readable recording medium to execute each step included in the video colorizing method according to any one of claims 1 to 5.

A computer program stored in a computer-readable recording medium to execute each step included in the learning method of the video colorizing apparatus according to any one of claims 6 to 13.