KR102512018B1

KR102512018B1 - Image-to-Image Translation Apparatus and Method based on Class-aware Memory Network

Info

Publication number: KR102512018B1
Application number: KR1020210168365A
Authority: KR
Inventors: 손광훈; 정소미
Original assignee: 연세대학교 산학협력단
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2023-03-17

Abstract

The invention provides an image conversion device and method that can obtain a natural target domain image with minimal semantic distortion for each object by converting it into a style corresponding to the target domain while well expressing the semantic characteristics of each object, comprising: an encoding unit which receives the input image obtained from the source domain and performs a neural network operation to obtain a content descriptor representing the structural characteristics of the input image and classifies content presenters according to the class of each object included in the input image and obtains content presenters for each class; a memory including a plurality of items which correspond to at least one class and containing a matched item key representing the structural representative characteristics of an object according to the corresponding class and a style value representing the stylistic representative characteristics in a target domain that is different from the source domain in that class; a target style generator which calculates a lead weight according to the similarity with the item keys of items of the class corresponding to each content presenter for each class among the plurality of items stored in the memory, and obtains a target style descriptor by weighting the calculated lead weight to the style value matched to the item key; and an output image generator which receives the content presenter and target style presenter and performs neural network operations to generate an output image in the target domain.

Description

Image conversion apparatus and method based on class-aware memory network {Image-to-Image Translation Apparatus and Method based on Class-aware Memory Network}

본 발명은 영상 변환 장치 및 방법에 관한 것으로, 클래스 인지 메모리 네트워크 기반 영상 변환 장치 및 방법에 관한 것이다.The present invention relates to an image conversion apparatus and method, and relates to an image conversion apparatus and method based on a class-aware memory network.

영상 변환(image-to-image translation)은 특정 도메인의 영상을 요구되는 다른 도메인의 영상으로 변환하는 기술로서, 소스 도메인의 입력 영상을 타겟 도메인의 출력 영상으로 변환하는 것을 목표로 하는 생성 모델(generative model) 분야 기술이다. 여기서 도메인은 영상 획득 조건, 주변 환경 등 영상 변환이 적용되는 분야에 따라 다양하게 지정될 수 있다. 예로서 특정 카메라에 의해 촬영된 영상을 다른 카메라에서 촬영된 영상으로 변환하고자 하는 경우, 카메라의 특성이 도메인이 될 수 있다. 즉 흑백 영상이나 적외선 영상 등을 RGB 영상으로 변환하는 경우에 영상 변환 기술이 적용될 수 있다. 그리고 주간에 촬영된 주간 영상을 야간에 촬영한 야간 영상으로 변환하고자 하는 경우나, 맑은 날씨에서 촬영된 영상을 흐린 날씨나 비오는 날씨에서 촬영된 영상으로 변환하고자 하는 경우 주변 환경이 도메인이 될 수 있다. 또한 테두리만 주어진 라벨 영상으로 실제 제품과 같은 영상을 생성하기 위해서도 이용될 수도 있다.Image-to-image translation is a technology that converts an image of a specific domain into an image of another required domain, and a generative model that aims to transform an input image of a source domain into an output image of a target domain. model) field technology. Here, the domain may be variously designated according to fields to which image conversion is applied, such as image acquisition conditions and surrounding environments. For example, when converting an image captured by a specific camera into an image captured by another camera, the characteristics of the camera may be a domain. That is, an image conversion technique may be applied when a black and white image or an infrared image is converted into an RGB image. In addition, if you want to convert a daytime image taken during the day into a nighttime image taken at night, or if you want to convert an image taken in sunny weather into an image taken in cloudy or rainy weather, the surrounding environment can be a domain. . In addition, it can also be used to create an image like an actual product with a label image given only a border.

이와 같은 영상 변환 기술은 다양한 분야에 이용될 수 있으나, 최근에는 대량의 학습 데이터를 요구하는 딥 러닝 분야에서 특히 영상 변환 기술을 필요로 하고 있다. 딥 러닝 모델은 학습 데이터가 부족한 경우, 요구되는 성능을 나타내지 못한다는 것은 잘 알려져 있으나, 다양한 도메인에서 이용되는 여러 딥 러닝 모델 각각에 대응하는 도메인의 학습 데이터를 대량으로 획득하는 것은 매우 어렵다. 이러한 경우에 영상 변환 기법은 이미 다른 도메인에서 획득된 대량의 영상 데이터를 딥 러닝 모델에 적합한 도메인의 학습 데이터로 변환시킬 수 있어, 학습 데이터 부족 문제를 용이하게 해소할 수 있다.Such image conversion technology can be used in various fields, but recently, image conversion technology is particularly required in the field of deep learning that requires a large amount of learning data. It is well known that deep learning models do not exhibit required performance when training data is insufficient. However, it is very difficult to obtain a large amount of training data of domains corresponding to various deep learning models used in various domains. In this case, the image conversion technique can convert a large amount of image data already acquired in other domains into training data of a domain suitable for a deep learning model, thereby easily solving the problem of lack of training data.

다만 기존의 영상 변환 기법은 소스 도메인에서 획득된 영상 내에 포함된 각 객체의 클래스에 따른 특성을 고려하지 않고, 영상 전체의 스타일을 타겟 도메인에 대응하는 스타일로 전환시켜 영상을 변환하는 방식을 이용함에 따라 영상 내에 포함된 각 객체에 대한 의미론적 왜곡이 발생하여 자연스럽지 않은 타겟 도메인 영상이 획득된다는 문제가 있었다. 그리고 이와 같은 의미론적 왜곡은 딥 러닝 모델의 학습 성능을 저하시켜 경우에 따라서는 객체 인식 오류를 유발하는 요인이 될 수도 있다.However, the existing image conversion technique uses a method of converting the image by converting the style of the entire image into a style corresponding to the target domain without considering the characteristics of each object class included in the image obtained from the source domain. Accordingly, semantic distortion occurs for each object included in the image, resulting in an unnatural target domain image. In addition, such semantic distortion deteriorates the learning performance of the deep learning model, and in some cases may be a factor that causes object recognition errors.

한국 등록 특허 제10-2229572호 (2021.03.12 등록)Korean Registered Patent No. 10-2229572 (registered on March 12, 2021)

본 발명의 목적은 변환되는 영상에 포함된 각 객체의 의미론적 왜곡을 최소화하여 자연스러운 타겟 도메인 영상을 획득할 수 있는 영상 변환 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an image conversion apparatus and method capable of obtaining a natural target domain image by minimizing semantic distortion of each object included in a converted image.

본 발명의 다른 목적은 영상에 포함된 각 객체의 클래스에 따라 메모리에 클래스별로 구분되어 미리 저장된 다수의 로컬 스타일 중 대응하는 로컬 스타일을 해당 객체에 적용하여 변환함으로써, 영상 내 각 객체의 특성을 유지하면서 타겟 도메인에 대응하는 스타일의 영상을 획득할 수 있는 영상 변환 장치 및 방법을 제공하는데 있다.Another object of the present invention is to maintain the characteristics of each object in the image by applying a corresponding local style among a plurality of local styles pre-stored in memory according to the class of each object included in the image to the corresponding object and converting it. It is an object of the present invention to provide an image conversion device and method capable of obtaining an image of a style corresponding to a target domain while doing so.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 영상 변환 장치는 소스 도메인에서 획득된 입력 영상을 인가받아 신경망 연산하여, 상기 입력 영상의 구조적 특징을 나타내는 콘텐츠 표현자를 획득하고, 상기 입력 영상에 포함된 각 객체의 클래스에 따라 상기 콘텐츠 표현자를 구분한 다수의 클래스별 콘텐츠 표현자를 획득하는 인코딩부; 다수의 클래스 각각에 적어도 하나씩 대응하고, 각각 대응하는 클래스에 따른 객체의 구조적 대표 특성을 나타내는 하나의 아이템 키와 해당 클래스에서 상기 소스 도메인과 상이한 타겟 도메인에서의 스타일적 대표 특성을 나타내는 스타일 값이 매칭되어 포함된 다수의 아이템이 저장된 메모리; 상기 메모리에 저장된 다수의 아이템 중 상기 클래스별 콘텐츠 표현자 각각에 대응하는 클래스의 아이템의 아이템 키들과의 유사도에 따른 리드 가중치를 계산하고, 계산된 리드 가중치를 아이템 키에 매칭된 스타일 값에 가중하여 타겟 스타일 표현자를 획득하는 타겟 스타일 생성부; 및 상기 콘텐츠 표현자와 상기 타겟 스타일 표현자를 인가받아 신경망 연산하여 상기 타겟 도메인에서의 출력 영상을 생성하는 출력 영상 생성부를 포함한다.In order to achieve the above object, an image conversion apparatus according to an embodiment of the present invention receives an input image obtained from a source domain and performs a neural network operation to obtain a content presenter representing structural characteristics of the input image, and an encoding unit that obtains a plurality of content presenters for each class by classifying the content presenters according to the classes of each included object; At least one item key corresponding to each of a plurality of classes and representing the structural representative characteristic of an object according to each corresponding class matches a style value representing a stylistic representative characteristic in the target domain different from the source domain in the corresponding class a memory in which a plurality of items included are stored; Among the plurality of items stored in the memory, a lead weight is calculated according to the similarity with item keys of an item of a class corresponding to each content presenter for each class, and the calculated lead weight is weighted by a style value matched to the item key. a target style generator that obtains a target style presenter; and an output image generation unit that receives the content presenter and the target style presenter and performs a neural network operation to generate an output image in the target domain.

상기 인코딩부는 미리 학습된 인공 신경망으로 구현되어 상기 입력 영상에 대해 신경망 연산을 수행하여 상기 입력 영상에 포함된 각 객체의 객체 영역과 클래스를 식별하는 객체 식별 표현자를 획득하는 객체 식별부; 미리 학습된 인공 신경망으로 구현되어 상기 입력 영상에 대해 신경망 연산하여, 상기 입력 영상의 구조적 특징을 나타내는 상기 콘텐츠 표현자를 획득하는 콘텐츠 표현자 획득부; 및 상기 객체 식별 표현자를 이용하여, 상기 콘텐츠 표현자에서 각 객체 영역을 클래스별로 구분하여 상기 다수의 클래스별 콘텐츠 표현자를 획득하는 콘텐츠 클러스터링부를 포함할 수 있다.The encoding unit is implemented as a pre-learned artificial neural network and performs a neural network operation on the input image to obtain an object identification descriptor for identifying the object region and class of each object included in the input image; a content presenter obtaining unit that is implemented with a pre-learned artificial neural network and performs a neural network operation on the input image to acquire the content presenter representing structural characteristics of the input image; and a content clustering unit configured to obtain the plurality of content presenters for each class by classifying each object region in the content presenter by class using the object identification presenter.

상기 타겟 스타일 생성부는 상기 메모리에 저장된 다수의 아이템 중 다수의 클래스별 콘텐츠 각각에 대응하는 클래스의 아이템을 리드하고, 각 클래스별 콘텐츠 표현자에서 픽셀 단위의 픽셀 콘텐츠 표현자와 리드된 아이템 각각의 아이템 키 사이의 유사도를 기지정된 방식으로 계산하는 리드 유사도 계산부; 계산된 유사도에 따라 상기 픽셀 콘텐츠 표현자와 대응하는 클래스의 적어도 하나의 아이템 키 각각의 중요도를 나타내는 리드 가중치를 기지정된 방식으로 계산하는 리드 가중치 계산부; 대응하는 클래스의 적어도 하나의 아이템의 각 아이템 키에 매칭된 적어도 하나의 스타일 값에 대응하는 리드 가중치를 가중합하여, 픽셀 단위로 타겟 스타일을 나타내는 픽셀 타겟 스타일 표현자를 획득하고, 모든 클래스에 대한 픽셀 타겟 스타일 표현자를 해당 픽셀 위치에 배치하여 상기 타겟 스타일 표현자를 획득하는 타겟 스타일 표현자 획득부; 및 상기 콘텐츠 표현자와 상기 타겟 스타일 표현자를 결합하여 결합 타겟 스타일 표현자를 상기 출력 영상 생성부로 출력하는 타겟 스타일 결합부를 포함할 수 있다.The target style generation unit reads items of a class corresponding to each of a plurality of contents for each class among a plurality of items stored in the memory, and from a content presenter for each class, a pixel content presenter in units of pixels and an item for each of the read items. a lead similarity calculation unit that calculates a similarity between keys in a predetermined manner; a read weight calculation unit which calculates a lead weight indicating an importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated similarity in a predetermined manner; A pixel target style descriptor representing a target style in units of pixels is obtained by weighting a lead weight corresponding to at least one style value matched to each item key of at least one item of a corresponding class, and pixel targets for all classes. a target style presenter obtaining unit for acquiring the target style presenter by arranging a style presenter at a corresponding pixel position; and a target style combiner combining the content presenter and the target style presenter and outputting the combined target style presenter to the output image generator.

상기 인코딩부는 학습 시에 상기 소스 도메인과 상기 타겟 도메인에서 획득된 입력 영상 중 적어도 하나를 인가받아, 인가된 적어도 하나의 입력 영상 각각에 대해 신경망 연산하여, 상기 적어도 하나의 입력 영상 각각의 소스 도메인의 스타일 특징을 나타내는 스타일 표현자를 획득하고, 상기 객체 식별 표현자를 이용하여, 상기 스타일 표현자에서 각 객체 영역을 클래스별로 구분하여 다수의 클래스별 스타일 표현자를 획득하는 클래스 스타일 표현자 추출부를 더 포함할 수 있다.The encoding unit receives at least one of the input images acquired from the source domain and the target domain during learning, performs a neural network operation on each of the applied at least one input image, The method may further include a class style descriptor extraction unit that obtains a style descriptor representing style features, classifies each object area in the style descriptor by class using the object identification descriptor, and obtains a plurality of style descriptors for each class. there is.

상기 영상 변환 장치는 학습시에 인가된 적어도 하나의 입력 영상 각각에 대한 상기 다수의 클래스별 콘텐츠 표현자와 상기 다수의 클래스별 스타일 표현자를 이용하여 상기 메모리에 저장된 다수의 아이템 각각에 포함된 아이템 키와 스타일 값을 업데이트하는 업데이트부를 더 포함할 수 있다.The image conversion device uses an item key included in each of a plurality of items stored in the memory by using the plurality of content descriptors for each class and the style descriptor for each class for each of at least one input image applied during learning. and an update unit for updating style values.

상기 업데이트부는 상기 메모리에 저장된 다수의 아이템 중 인가된 적어도 하나의 입력 영상 각각에 대한 상기 다수의 클래스별 콘텐츠 표현자 각각에 대응하는 클래스의 아이템을 리드하고, 각 클래스별 콘텐츠 표현자의 픽셀 콘텐츠 표현자와 리드된 아이템 각각의 아이템 키 사이의 유사도를 기지정된 방식으로 계산하는 업데이트 유사도 계산부; 계산된 유사도에 따라 상기 픽셀 콘텐츠 표현자와 대응하는 클래스의 적어도 하나의 아이템 키 각각의 중요도를 나타내는 업데이트 가중치를 기지정된 방식으로 계산하는 업데이트 가중치 계산부; 및 상기 픽셀 콘텐츠 표현자에 상기 업데이트 가중치를 가중합하고, 대응하는 클래스의 적어도 하나의 아이템의 각 아이템 키에 가산하여 업데이트된 아이템 키를 계산하고, 상기 픽셀 타겟 스타일 표현자에 상기 업데이트 가중치를 가중합하고, 대응하는 스타일 값에 가산하여 업데이트된 스타일 값을 계산하는 업데이트 값 계산부를 포함할 수 있다.The update unit reads an item of a class corresponding to each of the plurality of content presenters for each class for each of the applied at least one input image among a plurality of items stored in the memory, and displays a pixel content presenter of the content presenter for each class. an updated similarity calculation unit for calculating a similarity between the item key and the item key of each of the lead items in a predetermined manner; an update weight calculation unit which calculates an update weight indicating an importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated similarity in a predetermined manner; and weighting the update weight to the pixel content descriptor, calculating an updated item key by adding it to each item key of at least one item of a corresponding class, and weighting the update weight to the pixel target style descriptor; , an update value calculation unit that calculates an updated style value by adding the corresponding style value.

상기 영상 변환 장치는 학습 시에 서로 다른 도메인의 입력 영상(I^x, I^y)에 각각 대응하는 상기 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)와 상기 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)가 상기 메모리에 저장된 아이템 중 하나의 아이템에 대한 아이템 키(k_p+) 및 스타일 값(v_p+ ^x, v_p+ ^y)과는 차이가 작아지는 반면, 나머지 아이템에 대한 아이템 키(k_n) 및 스타일 값(v_n ^x, v_n ^y)과는 차이가 증가되도록 하는 키 손실(L_k) 및 스타일 손실(L_v)을 계산하고, 인공 신경망으로 구현되는 출력 영상 생성부를 학습시키기 위한 자기 재구성 손실(L^self)과 순환 재구성 손실(L^cyc)을 계산하며, 상기 키 손실(L_k)과 상기 스타일 손실(L_v), 상기 자기 재구성 손실(L^self) 및 상기 순환 재구성 손실(L^cyc)을 기지정된 방식으로 가중합하여 총 손실을 계산하고, 계산된 상기 총 손실을 역전파하여 학습을 수행하는 학습부를 더 포함할 수 있다.The image conversion device generates the pixel content descriptor (c _p ^x _, c p ^y ) and the pixel target style descriptor (s _p ^x ) respectively corresponding to input images (I ^x , I y ) ^of different domains during learning. , s _p ^y ) becomes smaller in difference from the item key (k _p+ ) and style value (v _{p +} ^x , v _{p +} ^y ) for one of the items stored in the memory, while the item key for the other items ( k _n ) and style values (v _n ^x , v _n ^y ) Calculate key loss (L _k ) and style loss (L _v ) that increase the difference, and train an output image generator implemented as an artificial neural network Calculate a self-reconfiguration loss (L ^self ) and a cyclic reconstruction loss (L ^cyc ) for the key loss (L _k ) and the style loss (L _v ), the self-reconfiguration loss (L ^self ) and the cyclic reconstruction loss ( L ^cyc ) in a predetermined manner to calculate a total loss, and backpropagating the calculated total loss to perform learning may further include a learning unit.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 영상 변환 방법은 소스 도메인에서 획득된 입력 영상을 인가받아 신경망 연산하여, 상기 입력 영상의 구조적 특징을 나타내는 콘텐츠 표현자를 획득하고, 상기 입력 영상에 포함된 각 객체의 클래스에 따라 상기 콘텐츠 표현자를 구분한 다수의 클래스별 콘텐츠 표현자를 획득하는 단계; 다수의 클래스 각각에 적어도 하나씩 대응하여 메모리에 미리 저장되고 각각 대응하는 클래스에 따른 객체의 구조적 대표 특성을 나타내는 하나의 아이템 키와 해당 클래스에서 상기 소스 도메인과 상이한 타겟 도메인에서의 스타일적 대표 특성을 나타내는 스타일 값이 매칭되어 포함된 다수의 아이템 중 상기 클래스별 콘텐츠 표현자 각각에 대응하는 클래스의 아이템의 아이템 키들과의 유사도에 따른 리드 가중치를 계산하고, 계산된 리드 가중치를 아이템 키에 매칭된 스타일 값에 가중하여 타겟 스타일 표현자를 획득하는 단계; 및 상기 콘텐츠 표현자와 상기 타겟 스타일 표현자를 인가받아 신경망 연산하여 상기 타겟 도메인에서의 출력 영상을 생성하는 단계를 포함한다.In order to achieve the above object, an image conversion method according to another embodiment of the present invention receives an input image obtained from a source domain and performs a neural network operation to obtain a content descriptor representing a structural feature of the input image, and obtaining a plurality of content presenters for each class by classifying the content presenters according to the classes of each included object; At least one item key corresponding to each of a plurality of classes is pre-stored in memory and represents structural representative characteristics of objects according to each corresponding class and stylistic representative characteristics in the target domain different from the source domain in the corresponding class Among a plurality of items with matched style values, a lead weight is calculated according to the degree of similarity with the item keys of the item of the class corresponding to each content presenter for each class, and the calculated lead weight is the style value matched to the item key. weighting to obtain a target style descriptor; and generating an output image in the target domain by receiving the content presenter and the target style presenter and performing a neural network operation.

따라서, 본 발명의 실시예에 따른 영상 변환 장치 및 방법은 소스 도메인에서 획득되어 입력된 영상에 포함된 각 객체의 클래스를 인식하고, 인식된 클래스에 따라 메모리에 클래스별로 구분되어 저장된 다수의 로컬 스타일 중 대응하는 로컬 스타일을 해당 객체에 적용하여 타게 도메인의 영상으로 변환함으로써, 각 객체의 의미론적 특성이 잘 표현되면서 타겟 도메인에 대응하는 스타일로 변환된 영상을 획득할 수 있다. 그러므로 영상에 포함된 각 객체에 대한 의미론적 왜곡이 최소화되어 변환된 자연스러운 타겟 도메인 영상을 획득할 수 있다.Therefore, the image conversion apparatus and method according to the embodiment of the present invention recognizes the class of each object included in the image obtained from the source domain and is input, and according to the recognized class, a plurality of local styles that are classified and stored by class in memory. By applying the corresponding local style to the corresponding object and converting it into an image of the target domain, it is possible to obtain an image converted into a style corresponding to the target domain while well expressing the semantic characteristics of each object. Therefore, it is possible to obtain a natural target domain image transformed by minimizing semantic distortion of each object included in the image.

도 1은 기존의 영상 변환 방식과 본 실시예에 따른 영상 변환 방식의 개념 차이를 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 영상 변환 장치의 개략적 구조를 나타낸다.
도 3은 도 2의 영상 변환 장치의 동작을 설명하기 위한 도면이다.
도 4는 도 2의 제1 인코더의 상세 구성의 일 예를 나타낸다.
도 5는 도 4의 콘텐츠 클러스터링부의 동작을 설명하기 위한 도면이다.
도 6은 도 2의 제1 타겟 스타일 생성부의 상세 구성의 일 예를 나타낸다.
도 7은 타겟 스타일 생성부와 업데이트부의 동작을 설명하기 위한 도면이다.
도 8은 도 2의 업데이트부의 상세 구성의 일 예를 나타낸다.
도 9는 본 실시예의 영상 변환 장치와 기존의 영상 변환 장치의 성능을 비교한 결과를 나타낸다.
도 10은 본 발명의 일 실시예에 따른 영상 변환 방법을 나타낸다.
도 11은 도 10의 영상 변환 방법을 위한 학습 단계의 일 예를 나타낸다.1 is a diagram for explaining a conceptual difference between an existing image conversion method and an image conversion method according to an exemplary embodiment.
2 shows a schematic structure of an image conversion device according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining the operation of the image conversion device of FIG. 2 .
FIG. 4 shows an example of a detailed configuration of the first encoder of FIG. 2 .
FIG. 5 is a diagram for explaining the operation of the content clustering unit of FIG. 4 .
FIG. 6 shows an example of a detailed configuration of the first target style generator of FIG. 2 .
7 is a diagram for explaining the operation of a target style creation unit and an update unit.
8 shows an example of a detailed configuration of the update unit of FIG. 2 .
9 shows a result of comparing the performance of the image conversion device of this embodiment and the existing image conversion device.
10 shows an image conversion method according to an embodiment of the present invention.
11 shows an example of a learning step for the image conversion method of FIG. 10 .

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention and its operational advantages and objectives achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the described embodiments. And, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that it may further include other components, not excluding other components unless otherwise stated. In addition, terms such as "... unit", "... unit", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And it can be implemented as a combination of software.

도 1은 기존의 영상 변환 방식과 본 실시예에 따른 영상 변환 방식의 개념 차이를 설명하기 위한 도면이다.1 is a diagram for explaining a conceptual difference between an existing image conversion method and an image conversion method according to an exemplary embodiment.

도 1에서 (a)는 기존의 영상 변환 방식을 나타내고, (b)는 본 실시예의 영상 변환 방식을 나타낸다. 우선 (a)를 살펴보면, 기존의 영상 변환 방식은 소스 도메인에서의 입력 영상(I)이 인가되면, 인코더(E_a)가 입력 영상(I)에 포함된 콘텐츠, 즉 객체(o₁, o₂)를 구분하지 않고, 입력 영상(I) 전체를 일괄적으로 인코딩하고, 인코딩된 특징맵(f_a)에 타겟 도메인의 전체적인 스타일을 나타내는 글로벌 스타일(Global style)을 적용하여 변환함으로써 출력 영상(I_a)을 획득한다. 이 경우, 입력 영상(I)에 포함된 객체(o₁, o₂)가 구분되지 않고 일괄적으로 글로벌 스타일이 적용됨에 따라, 변환된 출력 영상(I_a)에서는 입력 영상(I)에서와 달리 객체(o_a1, o_a2)의 특성이 잘 나타나지 않게 된다. 예로서 입력 영상(I)의 소스 도메인이 주간이고, 출력 영상(I_a)의 타겟 도메인이 야간인 경우, 입력 영상(I)에 포함된 맑은 날씨에서의 건물, 도로, 차량, 나무, 하늘, 사람 등과 같이 서로 다른 특성을 갖고 있는 다양한 객체에 단일한 하나의 공통 스타일을 적용하여 변환하는 경우, 각 객체의 특성이 소실되어 의미론적 왜곡이 발생할 수 있다. 즉 타겟 도메인으로 획득된 출력 영상(I_a)에서 객체 식별이 용이하지 않을 수 있다.In FIG. 1, (a) shows a conventional image conversion method, and (b) shows an image conversion method according to this embodiment. First of all, looking at (a), in the conventional image conversion method, when an input image (I) in the source domain is applied, the encoder (E _a ) converts the content included in the input image (I), that is, the object (o ₁ , o _{2 )} . ), the entire input image (I) is collectively encoded, and the output image (I) is converted by applying a global style representing the overall style of the target domain to the encoded feature map (f _a ). _a ) is obtained. In this case, as the objects (o ₁ , o ₂ ) included in the input image (I) are not distinguished and the global style is collectively applied, the converted output image (I _a ) is different from the input image (I). The characteristics of the objects (o _a1 , o _a2 ) are not well represented. For example, when the source domain of the input image (I) is daytime and the target domain of the output image (I _a ) is nighttime, buildings, roads, vehicles, trees, sky, When transforming by applying a single common style to various objects having different characteristics, such as people, semantic distortion may occur because the characteristics of each object are lost. That is, it may not be easy to identify an object in the output image I _a acquired as the target domain.

그리고 이와 같은 출력 영상(I_a)에 발생된 의미론적 왜곡은 출력 영상(I_a)을 이용하여 학습된 장치, 예를 들면 자율 주행 장치 등이 주변 사물을 정상적으로 인지하지 못하여 사고를 유발하는 등의 문제를 야기할 수 있다.In addition, the semantic distortion generated in such an output image (I _a ) causes an accident by a device learned using the output image (I _a ), such as an autonomous driving device, not normally recognizing surrounding objects. can cause problems

반면 (b)에서는 입력 영상(I)이 인가되면, 인코더(E_b)가 입력 영상(I)에 포함된 콘텐츠, 즉 각 객체(o₁, o₂)의 클래스를 인지하여 구분하여 특징을 추출하고, 클래스에 따라 구분된 각 객체(o₁, o₂)의 영역에 대해 미리 클래스별로 획득된 서로 다른 로컬 스타일(Local style)을 적용하여 변환함으로써 출력 영상(I_b)을 획득한다. 즉 모든 객체(o₁, o₂)에 대해 대응하는 클래스에 따라 구분된 서로 다른 스타일을 적용하여 변환하므로, 의미론적 왜곡이 최소화되어 출력 영상(I_b)에 포함된 객체(o_b1, o_b2) 각각의 특성이 잘 나타나게 된다. 따라서 사실적인 자연스러운 출력 영상(I_b)을 획득할 수 있으며, 출력 영상(I_b)에 각 객체(o_b1, o_b2)의 의미론적 특성이 잘 유지되어 있어, 이후 다른 딥 러닝 모델이 타겟 도메인의 출력 영상(I_b)을 학습 데이터로 이용하더라도 정확한 학습을 수행할 수 있다.On the other hand, in (b), when the input image (I) is applied, the encoder (E _b ) recognizes the content included in the input image (I), that is, the class of each object (o ₁ , o ₂ ), distinguishes them, and extracts features. Then, the regions of each object (o ₁ , o ₂ ) classified according to the class are converted by applying different local styles obtained for each class in advance to obtain an output image (I _b ). That is, since different styles classified according to the corresponding classes are applied to all objects (o ₁ and o ₂ ) and converted, semantic distortion is minimized and objects (o _b1 and o _b2 ) included in the output image (I _b ) are converted. ), each characteristic is well represented. Therefore, it is possible to obtain a realistic and natural output image (I _b ), and the semantic characteristics of each object (o _b1 , o _b2 ) are well maintained in the output image (I _b ), so that another deep learning model can be applied to the target domain. Accurate learning can be performed even when the output image I _b of is used as learning data.

도 2는 본 발명의 일 실시예에 따른 영상 변환 장치의 개략적 구조를 나타내고, 도 3은 도 2의 영상 변환 장치의 동작을 설명하기 위한 도면이다.2 shows a schematic structure of an image conversion device according to an embodiment of the present invention, and FIG. 3 is a diagram for explaining an operation of the image conversion device of FIG. 2 .

본 실시예에 따른 영상 변환 장치는 실제 사용 시, 기지정된 소스 도메인의 단일 입력 영상을 인가받아 타겟 도메인의 출력 영상을 출력하도록 구성될 수 있다. 그러나 경우에 따라서는 서로 상이한 도메인의 두 영상 중 임의의 적어도 하나의 영상을 입력 영상으로 인가받아 상대 도메인의 영상으로 변환하도록 구성될 수도 있다. 이에 도 2에서는 2개의 서로 다른 도메인의 영상을 인가받아 각각 상대 도메인의 연상으로 변환할 수 있는 영상 변환 장치를 예로서 도시하였다. In actual use, the image conversion device according to this embodiment may be configured to receive a single input image of a predetermined source domain and output an output image of a target domain. However, in some cases, it may be configured to receive at least one image of two images of different domains as an input image and convert it to an image of a corresponding domain. Accordingly, in FIG. 2, an image conversion device capable of receiving images of two different domains and converting them into associations of respective domains is illustrated as an example.

본 실시예에 따른 영상 변환 장치는 도 2에 도시된 바와 같이, 영상 획득부(100), 인코딩부(200), 타겟 스타일 생성부(300), 출력 영상 생성부(400) 및 메모리(500)를 포함할 수 있다.As shown in FIG. 2, the image conversion apparatus according to the present embodiment includes an image acquisition unit 100, an encoding unit 200, a target style generation unit 300, an output image generation unit 400, and a memory 500. can include

영상 획득부는 소스 도메인의 입력 영상(I^x, I^y)을 획득한다. 도 2에서 제1 영상 획득부(110)는 제1 소스 도메인의 제1 입력 영상(I^x)을 획득하고, 제2 영상 획득부(120)는 제2 소스 도메인의 제2 입력 영상(I^y)을 획득한다. 여기서 제1 소스 도메인은 제2 입력 영상(I^y)이 변환되는 제2 출력 영상(

)의 타겟 도메인이고, 제2 소스 도메인은 제1 입력 영상(I^x)이 변환되는 제2 출력 영상(

)의 타겟 도메인이다. 즉 영상 획득부는 서로 도메인이 상호 교차되어 획득되어야 하는 입력 영상(I^x, I^y)을 획득한다.The image acquiring unit acquires input images (I ^x , I ^y ) of the source domain. In FIG. 2 , the first image acquisition unit 110 obtains a first input image I ^x of the first source domain, and the second image acquisition unit 120 obtains a second input image I ^y of the second source domain. ) to obtain Here, the first source domain is a second output image into which the second input image (I ^y ) is converted (

), and the second source domain is a second output image into which the first input image (I ^x ) is converted (

) is the target domain. That is, the image acquisition unit obtains input images (I ^x , I ^y ) to be obtained by crossing domains with each other.

일 예로 제1 입력 영상(I^x)은 야간 영상으로 변환되어야 하는 주간 영상일 수 있으며, 이 경우 제2 입력 영상(I^y)은 주간 영상으로 변환되어야 하는 야간 영상일 수 있다.For example, the first input image (I ^x ) may be a day image to be converted into a night image, and in this case, the second input image (I ^y ) may be a night image to be converted into a day image.

인코딩부(200)는 미리 학습된 방식에 따라 영상 획득부에서 획득된 입력 영상(I^x, I^y)에 대해 신경망 연산을 수행하여, 입력 영상(I^x, I^y)에 포함된 각 객체의 클래스에 따른 구조적 특징을 나타내는 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)와 클래스에 따른 도메인 특징을 나타내는 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 획득한다.The encoding unit 200 performs a neural network operation on the input images (I ^x , I ^y ) acquired by the image acquisition unit according to a pre-learned method, and calculates the values of each object included in the input images (I ^x , I ^y ). Content presenters (c _k ^x , c _k ^y ) for each class representing structural characteristics according to classes and style presenters ( s _k ^x , s _k ^y ) for each class representing domain characteristics according to classes are obtained.

구체적으로 인코딩부(200)는 입력 영상(I^x, I^y)의 구조적 특징을 나타내는 콘텐츠 표현자(c^x, c^y)와 도메인 특징을 나타내는 스타일 특징(s^x, s^y)을 획득하고, 입력 영상(I^x, I^y)에 포함된 각 객체의 객체 영역과 클래스를 식별하여, 클래스별 각 객체의 객체 영역에 대응하여 콘텐츠 표현자(c^x, c^y)와 스타일 표현자(s^x, s^y)를 분할하여 다수의 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)와 다수의 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 획득한다.Specifically, the encoding unit 200 obtains content presenters (c ^x , c ^y ) representing structural characteristics of the input image (I ^x , I ^y ) and style characteristics (s ^x , s ^y ) representing domain characteristics, By identifying the object region and class of each object included in the input image (I ^x , I ^y ), the content presenter (c ^x , c ^y ) and style presenter (s ^{x )} correspond to the object region of each object for each class. , s ^y ) to obtain a plurality of content presenters (c _k ^x , c _k ^y ) for each class and a plurality of style presenters ( s _k ^x , s _k ^y ) for each class.

인코딩부(200)는 제1 입력 영상(I^x)으로부터 제1 클래스별 콘텐츠 표현자(c_k ^x)와 제1 클래스별 스타일 표현자(s_k ^x)를 획득하기 위한 제1 인코더(210)와 제2 입력 영상(I^y)으로부터 제1 클래스별 콘텐츠 표현자(c_k ^y)와 제1 클래스별 스타일 표현자(s_k ^y)를 획득하기 위한 제2 인코더(220)를 포함할 수 있다.The encoding unit 200 uses a first encoder 210 to obtain a content descriptor (c _k ^x ) for each first class and a style descriptor (s _k ^x ) for each first class from the first input image (I ^x ). and a second encoder 220 for obtaining a content descriptor (c _k ^y ) for each first class and a style descriptor (s _k ^y ) for each first class from the second input image (I ^y ). .

여기서 제1 인코더(210)와 제2 인코더(220)는 서로 다른 소스 도메인에서 획득된 제1 입력 영상(I^x)과 제2 입력 영상(I^y)을 인가받는 점에서만 서로 상이할 뿐 동일한 구성으로 구현될 수 있다.Here, the first encoder 210 and the second encoder 220 differ only in that they receive the first input image (I ^x ) and the second input image (I ^y ) obtained from different source domains. can be implemented as

도 4는 도 2의 제1 인코더의 상세 구성의 일 예를 나타내고, 도 5는 도 4의 콘텐츠 클러스터링부의 동작을 설명하기 위한 도면이다.FIG. 4 shows an example of a detailed configuration of the first encoder of FIG. 2 , and FIG. 5 is a diagram for explaining the operation of the content clustering unit of FIG. 4 .

상기한 바와 같이 제1 인코더(210)와 제2 인코더(220)는 동일한 구성을 가지므로, 도 4에서는 설명의 편의를 위하여, 제1 입력 영상(I^x)을 인가받아 제1 클래스별 콘텐츠 표현자(c_k ^x)와 제1 클래스별 스타일 표현자(s_k ^x)를 획득하는 제1 인코더(210)의 구성만을 도시하였다. 이에 이하에서는 편의를 위하여 제1 인코더(210)와 제2 인코더(220)의 구성과 신호를 구분하기 위한 제1 및 제2 와 같은 식별 표현을 생략하여 설명한다.As described above, since the first encoder 210 and the second encoder 220 have the same configuration, in FIG. 4 , for convenience of explanation, the first input image I ^x is applied and the first class content is expressed. Only the configuration of the first encoder 210 that obtains the character (c _k ^x ) and the style descriptor (s _k ^x ) for each first class is shown. In the following, for convenience, configurations of the first encoder 210 and the second encoder 220 and identification expressions such as first and second for distinguishing signals will be omitted and described.

도 4를 참조하면, 제1 인코더(210)는 객체 식별부(211), 콘텐츠 표현자 획득부(212), 콘텐츠 클러스터링부(213), 스타일 표현자 획득부(214) 및 스타일 클러스터링부(215)를 포함할 수 있다.Referring to FIG. 4 , the first encoder 210 includes an object identification unit 211, a content presenter acquisition unit 212, a content clustering unit 213, a style presenter acquisition unit 214, and a style clustering unit 215. ) may be included.

미리 학습된 인공 신경망으로 구현되는 객체 식별부(211)는 제1 입력 영상(I^x)을 인가받아 신경망 연산하여, 제1 입력 영상(I^x)에서 각 객체가 포함된 객체 영역을 판별하고, 각 객체의 클래스를 식별하여 객체 식별 표현자(f_k ^x)를 획득한다. 여기서 객체 식별 표현자(f_k ^x)는 제1 입력 영상(I^x)에서 객체의 클래스와 각 클래스별 객체가 포함되는 영역을 나타내는 경계 박스의 위치와 크기를 지정한다.The object identification unit 211 implemented with a pre-learned artificial neural network receives the first input image I ^x and performs neural network operation to determine an object region including each object in the first input image I ^x , The object identification descriptor (f _k ^x ) is obtained by identifying the class of each object. Here, the object identification descriptor (f _k ^x ) designates the location and size of a bounding box representing the class of an object in the first input image (I ^x ) and a region including an object of each class.

콘텐츠 표현자 획득부(212)는 미리 학습된 인공 신경망으로 구현되어 제1 입력 영상(I^x)을 인가받아 신경망 연산하여, 제1 입력 영상(I^x)의 전체적인 구조적 특징을 나타내는 콘텐츠 표현자(c^x)를 획득한다.The content presenter acquisition unit 212 is implemented as a pre-learned artificial neural network, receives the first input image I ^x , performs neural network operation, and obtains a content presenter representing overall structural characteristics of the first input image I ^x . c ^x ) is obtained.

그리고 콘텐츠 클러스터링부(213)는 콘텐츠 표현자(c^x)에서 객체 식별 표현자(f_k ^x)에 의해 지정된 각 클래스에 따른 객체 영역에 포함된 픽셀들을 구분하여 클러스터링함으로써, 식별되는 클래스의 개수(K)에 대응하는 개수의 제1 클래스별 콘텐츠 표현자(c_k ^x)(여기서 k ∈ {1, …, K})를 획득한다. 콘텐츠 클러스터링부(213)는 도 5에 도시된 바와 같이 콘텐츠 표현자(c^x)와 함께 각 클래스에 따른 객체 영역을 나타내는 객체 식별 표현자(f_k ^x)가 인가되면, 콘텐츠 표현자(c^x) 내에서 객체 영역의 픽셀들을 K개의 클래스별로 구분하여 클러스터링(class-wise clustering)함으로써, K개의 제1 클래스별 콘텐츠 표현자(c_k ^x)를 획득할 수 있다.In addition, the content clustering unit 213 divides and clusters pixels included in the object area according to each class specified by the object identification descriptor (f _k ^x ) in the content descriptor (c ^x ), thereby classifying the number of identified classes ( K), the number of content presenters c _k ^x for each first class (where k ∈ {1, ..., K}) is obtained. The content clustering unit 213, as shown in FIG. 5 , when an object identification presenter (f _k ^x ) indicating an object area according to each class is applied along with a content presenter (c ^x ), the content presenter (c ^{x )} is applied. ), it is possible to obtain K content presenters (c _k ^x ) for each first class by classifying pixels of the object region into K classes and performing class-wise clustering.

한편, 스타일 표현자 획득부(214) 또한 미리 학습된 인공 신경망으로 구현되어 제1 입력 영상(I^x)을 인가받아 신경망 연산하여, 제1 입력 영상(I^x)의 전체적인 도메인 특징을 나타내는 스타일 표현자(s^x)를 획득한다.Meanwhile, the style descriptor acquisition unit 214 is also implemented as a pre-learned artificial neural network, receives the first input image I ^x , performs neural network operation, and expresses a style representing the overall domain characteristics of the first input image I ^x . Get the ruler (s ^x ).

그리고 스타일 클러스터링부(215)는 스타일 표현자(s^x)에서 객체 식별 표현자(f_k ^x)에 의해 지정된 각 클래스에 따른 객체 영역에 포함된 픽셀들을 구분하여 클러스터링함으로써, 식별되는 클래스의 개수(K)에 대응하는 개수의 제1 클래스별 스타일 표현자(s_k ^x)를 획득한다.In addition, the style clustering unit 215 classifies and clusters pixels included in the object area according to each class specified by the object identification descriptor (f _k ^x ) in the style descriptor (s ^x ), thereby classifying the number of identified classes ( K), a number of style descriptors (s _k ^x ) for each first class are obtained.

K개의 제1 클래스별 콘텐츠 표현자(c_k ^x)와 K개의 제1 클래스별 스타일 표현자(s_k ^x) 각각은 콘텐츠 표현자(c^x)와 스타일 표현자(s^x)에서 각 클래스(k)에 해당하는 픽셀 수

과

의 차원으로 획득된다.K content presenters (c _k ^x ) for each first class and K style presenters (s _k ^x ) for each first class are each class (c ^x ) and style presenter (s ^x ). number of pixels corresponding to k)

class

is obtained at the level of

여기서는 이해의 편의를 위하여 콘텐츠 표현자 획득부(212)와 콘텐츠 클러스터링부(213)를 구분하여 도시하였으며, 스타일 표현자 획득부(214)와 스타일 클러스터링부(215)를 구분하여 도시하였으나, 콘텐츠 표현자 획득부(212)와 콘텐츠 클러스터링부(213)는 클래스 콘텐츠 표현자 추출부로 통합될 수 있으며, 스타일 표현자 획득부(214)와 스타일 클러스터링부(215)는 클래스 스타일 표현자 추출부로 통합될 수 있다.Here, for convenience of understanding, the content presenter acquisition unit 212 and the content clustering unit 213 are shown separately, and the style presenter acquisition unit 214 and the style clustering unit 215 are shown separately, but content expression The character acquisition unit 212 and the content clustering unit 213 may be integrated into a class content presenter extraction unit, and the style presenter acquisition unit 214 and the style clustering unit 215 may be integrated into a class style presenter extraction unit. there is.

또한 스타일 표현자 획득부(214)와 스타일 클러스터링부(215)는 영상 변환 장치의 메모리(500)에 저장되어야 하는 각 객체의 클래스에 따른 스타일을 학습 시에 추출하기 위한 구성으로 학습이 종료된 이후에는 생략될 수 있다.In addition, the style presenter acquisition unit 214 and the style clustering unit 215 are configured to extract the style according to the class of each object to be stored in the memory 500 of the image conversion device during learning, and after learning is finished. may be omitted.

다시 도 2 및 도 3을 참조하면, 메모리(500)에는 K개의 클래스 각각에 적어도 하나씩 대응하는 다수의 아이템이 미리 저장되며, 다수의 아이템 각각에는 클래스에 따른 객체의 구조적 대표 특성을 나타내는 하나의 아이템 키(k)와 해당 클래스에서 서로 다른 도메인에서의 스타일적 대표 특성을 각각 나타내는 2개의 스타일 값(v^x, v^y)이 서로 매칭되어 포함된다.Referring back to FIGS. 2 and 3 , a plurality of items corresponding to at least one each of K classes are stored in advance in the memory 500, and each of the plurality of items includes one item representing a structural representative characteristic of an object according to a class. The key (k) and two style values (v ^x , v ^y ) representing stylistic representative characteristics in different domains in the corresponding class are matched and included.

메모리(500)에는 M개의 아이템이 저장될 수 있으며, 각 클래스별로 M_k(

)개의 아이템이 할당될 수 있다. 즉 K개의 클래스 각각에 대해 M_k 개의 아이템이 할당될 수 있다. 여기서 각 클래스에 할당되는 아이템의 개수(M_k)는 클래스에 따라 서로 상이하게 할당될 수도 있다. 예로서 도 3에서 제1 클래스(Class 1)에 대해서는 3개의 아이템이 할당된 반면, 제2 클래스(Class 2)와 제K 클래스(Class K)에 대해서는 각각 2개의 아이템이 할당되었음을 알 수 있다. 즉 메모리(500)에는 전체 M개의 아이템이 K개의 클래스 각각에 대해 적어도 하나씩 할당되어 저장될 수 있다.M items may be stored in the memory 500, and M _k for each class (

) items can be allocated. That is, M _k items may be assigned to each of the K classes. Here, the number of items (M _k ) allocated to each class may be allocated differently according to the class. As an example, it can be seen in FIG. 3 that three items are assigned to the first class (Class 1), whereas two items are assigned to each of the second class (Class 2) and the Kth class (Class K). That is, a total of M items may be allocated and stored in the memory 500, at least one for each of the K classes.

M개의 아이템 각각에서 아이템 키(k)는 대응하는 클래스의 구조, 즉 대응하는 클래스에 따른 콘텐츠의 대표 특징을 나타내는 콘텐츠 대푯값이고, 2개의 스타일 값(v^x, v^y) 중 제1 스타일 값은 (v^x)은 대응하는 클래스의 제1 도메인, 즉 대응하는 클래스의 제1 스타일 대표 특징을 나타내는 스타일 대푯값이며, 제2 스타일 값은 (v^y)은 대응하는 클래스의 제2 스타일 대표 특징을 나타내는 스타일 대푯값이다.In each of the M items, the item key (k) is a structure of the corresponding class, that is, a content representative value representing a representative characteristic of the content according to the corresponding class, and the first style value of the two style values (v ^x , v ^y ) is (v ^x ) is a style representative value representing the first domain of the corresponding class, that is, the first style representative characteristic of the corresponding class, and the second style value (v ^y ) represents the second style representative characteristic of the corresponding class It is a style standard.

즉 메모리(500)에는 대응하는 클래스의 구조적 대표 특징을 나타내는 아이템 키(k)와 도메인에 따른 서로 다른 스타일 대표 특징을 나타내는 2개의 스타일 값(v^x, v^y)을 포함하는 M개의 아이템이 저장된다. 여기서 아이템 키(k)와 2개의 스타일 값(v^x, v^y) 각각은 콘텐츠 표현자(c^x, c^y)와 스타일 표현자(s^x, s^y)의 채널 길이(d)와 동일한 길이(d)를 갖는 1차원 벡터이다.That is, the memory 500 stores M items including an item key (k) representing structural representative characteristics of a corresponding class and two style values (v ^x , v ^y ) representing different style representative characteristics according to domains. do. Here, the item key (k) and two style values (v ^x , v ^y ) each have the same length as the channel length (d) of the content presenter (c ^x , c ^y ) and the style presenter (s ^x , s ^y ). It is a one-dimensional vector with (d).

본 실시예에서 각 클래스에 대해 M_k 개의 아이템을 할당하는 것은 클래스에 따라 하나의 구조적 대표 특징이나 하나의 스타일 대표 특징만으로 해당 클래스에 포함되는 다양한 객체를 모두 표현하기 어려운 경우가 빈번하게 발생하기 때문이다. 예로서 차량에 대한 클래스의 경우, 트럭, 버스, 승용차 등과 같이 서로 매우 상이한 형상을 갖는 객체가 동일한 차량 클래스로 분류될 수 있다. 비록 차량 클래스에 포함되는 객체의 형상이 서로 상이할지라도, 각 차량을 나타내는 객체가 동일 클래스로 식별되었다면, 이는 공통된 구조적 특징이 검출된 것을 의미하므로, 동일 도메인에서만 객체에 대한 처리를 수행한다면 크게 문제가 되지 않을 수 있다. 그러나 다른 도메인에서 공통된 구조적 특징이 검출되어 동일 클래스로 식별될 객체들일지라도 도메인이 변경되는 경우, 해당 특징이 정상적으로 도출되지 않아 동일 클래스의 객체로 식별되지 않을 수 있다.Allocating M _k items for each class in this embodiment is because it is frequently difficult to express all the various objects included in the class with only one structural representative feature or one style representative feature depending on the class. am. As an example, in the case of a vehicle class, objects having very different shapes, such as trucks, buses, cars, and the like, may be classified into the same vehicle class. Although the shapes of the objects included in the vehicle class are different from each other, if the objects representing each vehicle are identified as the same class, this means that a common structural feature is detected, so it is a big problem if the objects are processed only in the same domain. may not be However, even if a common structural feature is detected in different domains and objects to be identified as the same class, if the domain is changed, the corresponding feature may not be derived normally and thus may not be identified as objects of the same class.

이러한 문제를 방지하기 위해, 본 실시예에서는 각 클래스에 대해 다수의 아이템이 할당될 수 있도록 하여, 동일 클래스에서도 서로 상이한 객체의 다양한 특징을 각각 대표할 수 있도록 한다. 여기서 각 클래스별로 할당되는 아이템의 개수는 각 클래스에 속하는 객체의 특성을 고려하여 미리 지정될 수 있다.In order to prevent this problem, in the present embodiment, a plurality of items can be assigned to each class, so that even in the same class, various characteristics of different objects can be represented. Here, the number of items allocated for each class may be designated in advance in consideration of characteristics of objects belonging to each class.

그리고 메모리(500)에 저장되는 아이템에 포함된 아이템 키(k)와 2개의 스타일 값(v^x, v^y)은 후술하는 업데이트부(600)에 의해 학습 시에 미리 획득되어 저장된다.In addition, the item key (k) and two style values (v ^x , v ^y ) included in the item stored in the memory 500 are pre-obtained and stored during learning by the update unit 600 to be described later.

타겟 스타일 생성부(300)는 인코딩부(200)로부터 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)를 인가받고, 메모리(500)에 저장된 다수의 아이템 중 인가된 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)에 대응하는 클래스의 아이템의 아이템 키(k)들과의 유사도에 따른 리드 가중치(α_p,n ^x, α_p,n ^y)를 계산하고, 계산된 리드 가중치(α_p,n ^x, α_p,n ^y)를 아이템 키(k)에 매칭된 타겟 스타일 값(v^y, v^x)에 가중하여 타겟 스타일 표현자(

,

)를 획득한다. 그리고 콘텐츠 표현자(c^x, c^y)와 획득된 타겟 스타일 표현자(

,

)를 결합하여, 결합 타겟 스타일 표현자((c^x,

), (c^y,

))를 출력한다.The target style generation unit 300 receives the content presenters (c _k ^x , c _k ^y ) for each class from the encoding unit 200, and among the plurality of items stored in the memory 500, the applied content presenters for each class ( Calculate lead weights (α _{p,n x ,} ^α p _, _n ^y ) according to the degree of similarity with item keys (k) of items of the class corresponding to c k ^x , c _k ^y ), and calculate lead weights ( _Target ^style _descriptor ⁽ ^_ ^_

,

) to obtain And the content presenters (c ^x , c ^y ) and the acquired target style presenters (

,

) to combine target style descriptors ((c ^x ,

), (c ^y ,

)) outputs

타겟 스타일 생성부(300)는 제1 클래스별 콘텐츠 표현자(c_k ^x)를 인가받아 제1 결합 타겟 스타일 표현자(c^x,

)를 출력하는 제1 타겟 스타일 생성부(310)와 제2 클래스별 콘텐츠 표현자(c_k ^y)를 인가받아 제2 결합 타겟 스타일 표현자(c^y,

)를 출력하는 제2 타겟 스타일 생성부(320)를 포함할 수 있다.The target style generation unit 300 receives the content presenter (c _k ^x ) for each first class, and the first combined target style presenter (c ^x ,

) and the second combined target _style presenter ^{(c y} ^,

) may include a second target style generation unit 320 that outputs.

도 6은 도 2의 제1 타겟 스타일 생성부의 상세 구성의 일 예를 나타내고, 도 7은 타겟 스타일 생성부와 업데이트부의 동작을 설명하기 위한 도면이다.6 shows an example of a detailed configuration of the first target style generator of FIG. 2 , and FIG. 7 is a diagram for explaining operations of the target style generator and the update unit.

도 6에서도 설명의 편의를 위하여 제1 타겟 스타일 생성부(310)만을 도시하였으나, 제2 타겟 스타일 생성부(320)도 동일한 구조를 갖는다.Although only the first target style generator 310 is shown in FIG. 6 for convenience of description, the second target style generator 320 also has the same structure.

도 6 및 도 7을 참조하면, 제1 타겟 스타일 생성부(310)는 리드 유사도 계산부(311), 리드 가중치 계산부(312), 타겟 스타일 표현자 획득부(313) 및 타겟 스타일 결합부(314)를 포함할 수 있다.6 and 7 , the first target style generator 310 includes a lead similarity calculator 311, a lead weight calculator 312, a target style presenter obtainer 313, and a target style combiner ( 314) may be included.

리드 유사도 계산부(311)는 K개의 클래스별 콘텐츠 표현자(c_k ^x)를 인가받고, 인가된 클래스별 콘텐츠 표현자(c_k ^x)의 클래스에 따라 메모리(500)에 저장된 다수의 아이템 중 대응하는 클래스에 할당된 M_k 개의 아이템 각각에 포함된 아이템 키(k)를 리드한다. 이때, 리드 유사도 계산부(311)는 K개의 클래스별 콘텐츠 표현자(c_k ^x) 각각에서 픽셀 단위의 픽셀 콘텐츠 표현자(c_k,p ^x)를 추출한다. 이하에서는 편의를 위해, 클래스를 식별하기 위한 아래 첨자(k)는 삭제하여 표기한다. 따라서 해당 클래스에서의 픽셀 개수(N_k)만큼의 픽셀 콘텐츠 표현자가 클래스별 콘텐츠 표현자(

)에 포함된다. 리드 유사도 계산부(311)는 해당 클래스의 클래스별 콘텐츠 표현자(c_k ^x)에서 각 픽셀 콘텐츠 표현자(

)와 해당 클래스에 할당된 아이템 키(

) 사이의 유사도(d(c_p ^x, k_n))를 수학식 1에 따라 계산한다.The lead similarity calculation unit 311 receives K content presenters (c _k ^x ) for each class, and among a plurality of items stored in the memory 500 according to the classes of the authorized content presenters (c _k ^x ) for each class. The item key (k) included in each of the M _k items assigned to the corresponding class is read. At this time, the read similarity calculation unit 311 extracts pixel content presenters (c _k _{, p} ^x ) in pixel units from each of the K content presenters (c k ^x ) for each class. Hereinafter, for convenience, the subscript (k) for identifying the class is deleted and marked. Therefore, the pixel content presenters as many as the number of pixels (N _k ) in the class are the content presenters for each class (

) are included in The lead similarity calculation unit 311 determines each pixel content presenter (c _k ^x ) of each class content presenter (c k x ) of the corresponding class.

) and the item key assigned to that class (

⁾ is _calculated according to Equation ₁ .

여기서 ∥·∥는 L2 놈 함수를 나타내고, 위첨자 T는 전치 행렬을 나타낸다.Here, Î·‖ denotes the L2 norm function, and the superscript T denotes the transposition matrix.

따라서 리드 유사도 계산부(311)는 각 클래스에 대해 N_k ㅧ M_k 개수의 유사도를 계산한다.Therefore, the read similarity calculation unit 311 calculates N _k × M _k number of similarities for each class.

리드 가중치 계산부(312)는 리드 유사도 계산부(311)에서 N_k 개의 픽셀 콘텐츠 표현자(c_p ^x)에 대해 각각 계산된 M_k 개의 유사도(d(c_p ^x, k_n))를 기반으로 해당 클래스에서 M_k 개의 아이템 키(k_n) 각각에 대한 리드 가중치(α_p,n ^x)를 수학식 2에 따라 계산한다.The read weight calculation unit 312 is based on the M _k similarities (d(c _p ^x , k _n )) calculated for each of the N _k pixel content descriptors (c _p ^x ) in the read similarity calculation unit 311. As , a lead weight (α _p,n ^x ) for each of the M _k item keys (k _n ) in the corresponding class is calculated according to Equation 2.

수학식 2에 따르면, 리드 가중치(α_p,n ^x)는 픽셀 콘텐츠 표현자에 대응하는 클래스의 적어도 하나의 아이템 키 각각의 중요도를 유사도를 이용하여 표현한 값이다.According to Equation 2, the lead weight (α _p,n ^x ) is a value obtained by expressing the importance of each of at least one item key of the class corresponding to the pixel content descriptor using similarity.

클래스 내의 M_k 개의 아이템 키(k_n) 각각에 대한 M_k 개의 리드 가중치(α_p,n ^x)가 계산되면, 타겟 스타일 표현자 획득부(313)는 메모리(500)에서 M_k 개의 아이템 키(k_n) 각각에 매칭된 2개의 스타일 값(v_n ^x, v_n ^y) 중 타겟 도메인에 대한 M_k 개의 스타일 값(여기서는 v_n ^y)을 리드하고, 리드된 M_k 개의 스타일 값(v_n ^y) 각각에 대응하는 M_k 개의 리드 가중치(α_p,n ^x)를 수학식 3과 같이 가중합하여, 해당 클래스에서 각 픽셀에 대한 픽셀 타겟 스타일 표현자(

)를 획득한다.When the M _k number of read weights (α _p,n ^x ) for each of the M _k number of item keys (k _n ) in the class are calculated, the target style descriptor acquisition unit 313 stores the M _k number of item keys in the memory 500. (k _n ) Among the two style values (v _n ^x , v _n ^y ) matched to each, M _k style values (here, v _n ^y ) for the target domain are read, and the M _k style values (v n y ) are read. A _pixel ^target _style ^descriptor ₍

) to obtain

그리고 타겟 스타일 결합부(314)는 다수의 클래스 각각에 대한 픽셀 단위의 픽셀 타겟 스타일 표현자(

)가 획득되면, 모든 클래스의 픽셀 타겟 스타일 표현자(

)를 해당 픽셀 위치에 배치하여 타겟 스타일 표현자(

)를 획득하고, 타겟 스타일 표현자(

)를 콘텐츠 표현자(c^x)와 결합(concatenate)하여 제1 결합 타겟 스타일 표현자(c^x,

)를 출력한다.And the target style combiner 314 is a pixel target style descriptor in units of pixels for each of a plurality of classes (

) is obtained, the pixel target style descriptors of all classes (

) at that pixel location, so that the target style presenter (

) is obtained, and the target style presenter (

) with the content presenter (c ^x ) to form the first concatenation target style presenter (c ^x ,

) is output.

즉 타겟 스타일 생성부(300)는 소스 도메인의 입력 영상(I^x)로부터 추출된 객체의 구조적 특징인 클래스별 콘텐츠 표현자(c_k ^x)들과 메모리에 미리 저장된 대응하는 클래스의 대표 구조적 특징인 아이템 키(k_n)들 사이의 픽셀 단위의 유사도에 따른 가중치를 해당 클래스의 대표 타겟 스타일인 스타일 값(v_n ^y)에 가중함으로써, 소스 도메인의 객체의 구조적 특징에 대응하는 타겟 도메인의 스타일인 타겟 스타일 표현자(

)를 획득한다.That is, the target style generation unit 300 includes content presenters (c _k ^x ) for each class, which are structural features of an object extracted from an input image (I ^x ) of the source domain, and a representative structural feature of a corresponding class pre-stored in memory. By weighting the similarity in pixel units between item keys (k _n ) to the style value (v _n ^y ), which is a representative target style of the corresponding class, the style of the target domain corresponding to the structural characteristics of the object in the source domain target style presenter (

) to obtain

다시 도 2 및 도3을 참조하면, 출력 영상 생성부(400)는 결합 타겟 스타일 표현자((c^x,

), (c^y,

))를 인가받아 타겟 도메인에 대응하는 스타일의 출력 영상(

,

)을 생성한다.Referring back to FIGS. 2 and 3 , the output image generator 400 uses a combination target style descriptor ((c ^x ,

), (c ^y ,

)) and an output image of a style corresponding to the target domain (

,

) to create

출력 영상 생성부(400)는 제1 출력 영상 생성부(410)와 제2 출력 영상 생성부(420)를 포함할 수 있다. 제1 출력 영상 생성부(410)는 미리 학습된 인공 신경망으로 구현되고, 제1 결합 타겟 스타일 표현자(c^x,

)를 인가받아 신경망 연산하여, 타겟 도메인인 제2 도메인에 대응하는 스타일의 제1 출력 영상(

)을 생성한다.The output image generator 400 may include a first output image generator 410 and a second output image generator 420 . The first output image generator 410 is implemented as a pre-learned artificial neural network, and the first combination target style descriptor (c ^x ,

) is applied and the neural network is operated, and the first output image of the style corresponding to the second domain, which is the target domain (

) to create

마찬가지로 제2 출력 영상 생성부(420) 또한 미리 학습된 인공 신경망으로 구현되어, 인가된 제2 결합 타겟 스타일 표현자(c^y,

)에 대해 신경망 연산하여, 타겟 도메인인 제1 도메인에 대응하는 스타일의 제2 출력 영상(

)을 생성한다.Similarly, the second output image generator 420 is also implemented as a pre-learned artificial neural network, and the applied second combination target style descriptor (c ^y ,

), and the second output image of the style corresponding to the first domain, which is the target domain,

) to create

결과적으로 도 2에 도시된 본 실시에에 따른 영상 변환 장치는 제1 도메인에서 획득된 제1 입력 영상(I^x)으로부터 제2 도메인 스타일의 제1 출력 영상(

)을 획득하고, 제2 도메인에서 획득된 제2 입력 영상(I^y)으로부터 제1 도메인 스타일의 제2 출력 영상(

)을 획득할 수 있다. 즉 소스 도메인에서 획득된 입력 영상(I^x, I^y)을 타겟 도메인의 출력 영상(

,

)으로 변환할 수 있다.As a result, the image conversion device according to the present embodiment shown in FIG. 2 has a first output image (I ^x ) of the second domain style from the first input image (I x ) obtained in the first domain.

) is obtained, and a second output image (of the first domain style) is obtained from the second input image (I ^y ) obtained in the second domain.

) can be obtained. That is, input images (I ^x , I ^y ) obtained from the source domain are converted into output images (I x , I y ) of the target domain.

,

) can be converted to

특히 메모리(500)에 클래스별로 구분되어 미리 저장된 타겟 도메인의 스타일을 이용하여 입력 영상(I^x, I^y)에 포함된 각 객체의 클래스에 따라 서로 다른 스타일을 적용하여 변환함으로써, 매우 사실적인 타겟 도메인의 출력 영상(

,

)을 획득할 수 있다.In particular, by applying and converting different styles according to the classes of each object included in the input images (I ^x , I ^y ) using the styles of target domains classified by class and stored in advance in the memory 500, a very realistic target Output image of the domain (

,

) can be obtained.

상기한 바와 같이, 도 2에서는 서로 다른 도메인의 영상을 상호 변환 가능한 영상 변환 장치를 도시한 것으로서, 만일 특정 도메인의 영상을 다른 지정된 도메인의 영상으로 변환하고자 하는 경우, 제2 입력 영상(I^y)으로부터 제2 출력 영상(

)을 생성하기 위해서만 구비된 각 구성 요소는 모두 생략될 수 있다. 즉 제2 영상 획득부(120), 제2 인코더(220), 제2 타겟 스타일 생성부(320) 및 제2 출력 영상 생성부(420)는 생략될 수 있다.As described above, FIG. 2 shows an image conversion device capable of mutually converting images of different domains. If an image of a specific domain is to be converted into an image of another designated domain, the second input image (I ^y ) The second output image from (

) may be omitted. That is, the second image acquisition unit 120, the second encoder 220, the second target style generation unit 320, and the second output image generation unit 420 may be omitted.

한편, 상기한 본 실시예의 영상 변환 장치는 다수의 인공 신경망을 포함하므로, 실제 이용하기 위해서는 미리 학습이 되어야 한다. 뿐만 아니라 본 실시예의 영상 변환 장치는 메모리에 저장된 다수의 아이템에 기반하여 각 클래스의 객체를 타겟 스타일로 변환하므로, 메모리(500)에 저장된 아이템이 학습에 의해 미리 업데이트되어야 한다. 이에 본 실시예의 영상 변환 장치는 메모리(500)에 저장된 다수의 아이템을 업데이트하기 위한 업데이트부(600)를 포함한다. 업데이트부(600)는 학습 이후에도 계속적으로 메모리(500)에 저장된 아이템을 업데이트할 수 있도록 유지될 수도 있다.On the other hand, since the image conversion device of the present embodiment described above includes a plurality of artificial neural networks, it must be trained in advance for actual use. In addition, since the image conversion device of this embodiment converts objects of each class into a target style based on a plurality of items stored in the memory, the items stored in the memory 500 must be updated in advance by learning. Thus, the image conversion device of this embodiment includes an update unit 600 for updating a plurality of items stored in the memory 500 . The update unit 600 may be maintained to continuously update items stored in the memory 500 even after learning.

업데이트부(600)는 타겟 스타일 생성부(300)와 유사하게 인코딩부(200)로부터 제1 및 제2 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y) 중 적어도 하나를 인가받고, 메모리(500)에 저장된 다수의 아이템 중 인가된 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)에 대응하는 클래스의 아이템의 아이템 키(k_n)들과의 유사도에 따른 업데이트 가중치(β_p,n ^x, β_p,n ^y)를 계산하고, 계산된 업데이트 가중치(β_p,n ^x, β_p,n ^y)과 인코딩부(200)에서 인가되는 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 기반으로 각 아이템 키(k_n) 및 아이템 키(k_n)에 매칭된 타겟 스타일 값(v_n ^y, v_n ^x)에 대한 업데이트 값(

,

)들을 획득한다.Similar to the target style generator 300, the update unit 600 receives at least one of the content presenters c _k ^x and c _k ^y for each first and second class from the encoding unit 200, and stores the memory ( 500) update weight (β _p,n) according to the similarity with the item keys (k _n ) of the item of the class corresponding to the content presenter (c _k ^x , c _k ^y ) for each authorized class among the plurality of items stored in ^x , β _p,n ^y ) are calculated, and the calculated update weights (β _p,n ^x , β _p,n ^y ) and style descriptors (s _k ^x , s _k for each class applied by the encoding unit 200) Update values (v n y , v n ^x ) for each item key (k _n ) and target ^style values (v _n ^y , v _n x ) matched to the item key (k _n ) based on y )

,

) to obtain

도 8은 도 2의 업데이트부의 상세 구성의 일 예를 나타낸다.8 shows an example of a detailed configuration of the update unit of FIG. 2 .

도 7 및 도 8을 참조하면, 업데이트부(600)는 업데이트 유사도 계산부(611), 업데이트 가중치 계산부(612) 및 업데이트 값 계산부(613)를 포함할 수 있다.Referring to FIGS. 7 and 8 , the update unit 600 may include an update similarity calculator 611 , an update weight calculator 612 , and an update value calculator 613 .

업데이트 유사도 계산부(611)는 리드 유사도 계산부(311)와 마찬가지로 K개의 제1 및 제2 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y) 중 적어도 하나를 인가받고, 인가된 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)의 클래스를 식별하여 메모리(500)에 저장된 다수의 아이템 중 대응하는 클래스에 할당된 M_k 개의 아이템 각각에 포함된 아이템 키(k)를 리드한다. 업데이트 유사도 계산부(611)는 K개의 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y) 각각에서 픽셀 단위의 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)를 추출한다.Similar to the lead similarity calculator 311, the update similarity calculation unit 611 receives at least one of the K first and second K content descriptors (c _k ^x , c _k ^y ) for each class, and obtains the authorized content for each class. The class of the presenter (c _k ^x , c _k ^y ) is identified and an item key (k) included in each of M _k items allocated to a corresponding class among a plurality of items stored in the memory 500 is read. The update similarity calculating unit 611 extracts pixel content descriptors (c _p ^x , c _p ^y ) in pixel units from each of the K content descriptors (c _k ^x , c _k ^y ) for each class.

그리고 각 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)와 해당 클래스에 할당된 아이템 키(k_n) 각각 사이의 유사도(d(c_p ^x, k_n), d(c_p ^y, k_n))를 수학식 4에 따라 계산한다.And the similarity (d(c _p ^x , k ⁿ ), d(c p y , k _n ) between each pixel content descriptor ( _c _p ^x , c _p _y ) and each item key ( _k ⁿ ) assigned to the corresponding class. )) is calculated according to Equation 4.

업데이트 가중치 계산부(612)는 업데이트 유사도 계산부(611)에서 계산된 유사도(d(c_p ^x, k_n), d(c_p ^y, k_n))를 기반으로 각 도메인의 해당 클래스에서 M_k 개의 아이템 키(k_n) 각각에 대한 업데이트 가중치(β_p,n ^x, β_p,n ^y)를 수학식 5에 따라 계산한다.The update weight calculation unit 612 calculates M in the corresponding class of each domain based on the similarity (d(c _p ^x , k _n ), d(c _p ^y , k _n )) calculated by the update similarity calculation unit 611. Update weights (β _p,n ^x , β _p,n ^y ) for each of the _k item keys (k _n ) are calculated according to Equation 5.

각 도메인에서 클래스 내의 M_k 개의 아이템 키(k_n) 각각에 대한 M_k 개의 업데이트 가중치(β_p,n ^x, β_p,n ^y)가 계산되면, 업데이트 값 계산부(613)는 메모리(500)에서 M_k 개의 아이템 키(k_n)를 수학식 6에 따라 업데이트하여 업데이트 키(

)를 획득한다.When M _k number of update weights (β _p,n ^x , β _p,n ^y ) for each of the M _k item keys (k _n ) in the class in each domain are calculated, the update value calculator 613 calculates the memory 500 In ), M _k item keys (k _n ) are updated according to Equation 6 to update the update key (

) to obtain

그리고 각 아이템 키(k_n)에 매칭된 2개의 스타일 값(v_n ^x, v_n ^y)을 리드하고, 인코딩부(200)에서 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 인가받는다. 이후 수학식 7과 같이, 클래스별 스타일 표현자(s_k ^x, s_k ^y) 각각에서 픽셀 단위로 추출된 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)에 대응하는 업데이트 가중치(β_p,n ^x, β_p,n ^y)를 가중합하여, 리드된 2개의 스타일 값(v_n ^x, v_n ^y)에 가산함으로써, 업데이트 스타일 값(

,

)을 획득한다.Then, two style values (v _n ^x , v _n ^y ) matched with each item key (k _n ) are read, and the encoding unit 200 applies style descriptors (s _k ^x , s _k ^y ) for each class. receive Then ^, ^as _shown _in ^Equation ₇ , ^the update weight ₍ β _{p ,n} ^x , β _p,n ^y ), and by adding the two style values (v _n ^x , v _n ^y ) that are read, the update style value (

,

) to obtain

업데이트 키(

)와 업데이트 스타일 값(

,

)이 획득되어 아이템이 업데이트되면, 업데이트된 아이템을 메모리(500)에 다시 저장한다.update key (

) and the update style value (

,

) is acquired and the item is updated, the updated item is stored in the memory 500 again.

즉 업데이트부(600)는 서로 다른 도메인의 입력 영상(I^x, I^y)이 입력되어 서로 다른 도메인에서의 객체의 구조적 특징인 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)가 획득되면, 각 도메인에서 획득된 콘텐츠 표현자(c_k ^x, c_k ^y)와 기존에 메모리에 저장된 대응하는 클래스의 대표 구조적 특징인 아이템 키(k_n)들 사이의 유사도에 따른 가중치(β_p,n ^x, β_p,n ^y)를 계산하여, 현재 획득된 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)에 각 픽셀별로 가중하여 기존 아이템 키(k_n)에 추가로 반영함으로써 업데이트 아이템 키(

)를 획득한다. 유사하게 가중치(β_p,n ^x, β_p,n ^y)를 현재 획득된 클래스별 스타일 표현자(s_k ^x, s_k ^y)에 각 픽셀별로 가중하여 기존 스타일 값(v_n ^x, v_n ^y) 각각에 추가로 반영함으로써 업데이트 스타일 값(

,

)을 획득한다.That is, when input images (I ^x , I ^y ) of different domains are input to the update unit 600, content presenters (c _k ^x , c _k ^y ) for each class, which are structural characteristics of objects in different domains, are obtained. , The weight (β _p,n ) according to the similarity between the content presenters (c _k ^x , c _k ^y ) obtained in each domain and the item keys (k _n ), which are representative structural features of the corresponding class previously stored in memory. ^x , β _p,n ^y ), weighted for each pixel to the currently obtained content descriptors (c _k ^x , c _k ^y ) for each class, and additionally reflected to the existing item key (k _n ) to update the item key (

) to obtain _Similarly ^, the existing ^style _values ⁽ _v _n _x ^, _v ⁿ ^y ) by additionally reflecting each update style value (

,

) to obtain

상기한 업데이트부(600)를 이용하여, 메모리(500)에 저장된 다수의 아이템은 업데이트할 수 있으나, 인공 신경망은 학습되지 않았다. 또한 메모리(500)에 저장되는 다수의 아이템이 각각 해당 클래스에서 서로 상이한 구조적 특징 및 표현적 특징을 가져 각 특징을 대표하도록, 인코딩부(200)는 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)와 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 추출할 수 있어야 한다.A plurality of items stored in the memory 500 can be updated using the above update unit 600, but the artificial neural network has not been trained. In addition, the encoding unit 200 provides content presenters (c _k ^x , c _k for each class) so that a plurality of items stored in the memory 500 represent each characteristic by having different structural and expressive characteristics in the corresponding class. ^y ) and style descriptors (s _k ^x , s _k ^y ) for each class.

이에 본 실시예의 영상 변환 장치는 학습을 위한 학습부(미도시)를 더 포함할 수 있다. 학습부는 영상 변환 장치의 학습이 종료되면 제거될 수 있다.Thus, the image conversion device of this embodiment may further include a learning unit (not shown) for learning. The learning unit may be removed when the learning of the image conversion device is completed.

학습부는 인코딩부(200)에서 추출된 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)의 N개의 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)와 클래스별 스타일 표현자(s_k ^x, s_k ^y)의 N개의 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)가 각각 메모리(500)에 저장된 M개의 아이템 중 하나의 아이템에 대한 아이템 키(k_p+) 및 스타일 값(v_p+ ^x, v_p+ ^y)과는 차이가 작아지는 반면, 나머지 아이템에 대한 아이템 키(k_n) 및 스타일 값(v_n ^x, v_n ^y)과는 차이가 증가되도록 하는 키 손실(L_k) 및 스타일 손실(L_v)을 각각 수학식 8 및 9에 따라 계산한다.The learning unit extracts N pixel content descriptors (c _p ^x , c _p ^y ) of the class-specific content descriptors (c _k ^x , c _k ^y ) extracted from the encoding unit 200 and a class-specific style descriptor (s _k ^x , s _k ^y ) of N pixel target style descriptors (s _p ^x , s _p ^y ) are each item key (k _p+ ) and style value (v Key loss ( _L ^k ) that causes the difference from item keys (k _n ) and style values (v _n ^x , v _n ^y ) for the remaining items to increase, while the difference from _p ₊ x , v p+ ^y ) becomes small and style loss (L _v ) are calculated according to Equations 8 and 9, respectively.

여기서 k_p+와 v_p+ ^x 및 v_p+ ^x 는 메모리에 저장된 M개의 아이템 각각의 아이템 키(k_n)와 스타일 값(v_n ^x, v_n ^y) 중 콘텐츠 표현자(c_p ^x, c_p ^y)와 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)에 가장 유사한 아이템 키와 스타일 값을 나타내는 포지티브 샘플을 나타낸다.Here, k _p+ , v _p+ ^x, and v _p+ ^x are the content descriptors (c _p x, c ^p ^y ) among the item keys (k _n ) and style values (v _n ^x , _v _n ^y ) of M items stored in memory. ) and the positive samples representing item keys and style values most similar to the pixel target style descriptors (s _p ^x , s _p ^y ).

또한 학습부는 타겟 스타일 표현자(

,

)가 정상적으로 획득되고, 출력 영상 생성부(400)가 정상적으로 요구되는 타겟 도메인 스타일의 출력 영상(

,

)을 생성하는지 여부를 판별하기 위해, 자기 재구성 손실(L^self)과 순환 재구성 손실(L^cyc)을 수학식 10 및 11에 따라 계산할 수 있다.In addition, the learning part has a target style descriptor (

,

) is normally obtained, and the output image generating unit 400 is normally required to output an output image of the target domain style (

,

), the self reconstruction loss (L ^self ) and the cyclic reconstruction loss (L ^cyc ) can be calculated according to Equations 10 and 11.

여기서 G^x(), G^y() 는 각각 제1 출력 영상 생성부(410)와 제2 출력 영상 생성부(420)의 신경망 연산을 표현한 함수이다.Here, G ^x () and G ^y () are functions expressing neural network operations of the first output image generator 410 and the second output image generator 420, respectively.

자기 재구성 손실(L^self)은 수학식 10에 나타난 바와 같이, 제1 출력 영상 생성부(410)와 제2 출력 영상 생성부(420)에 동일한 도메인에 대한 콘텐츠 표현자(c^x, c^y)와 타겟 스타일 표현자(

,

)가 입력된 경우, 제1 출력 영상 생성부(410)와 제2 출력 영상 생성부(420)에서 출력되는 자기 재구성 영상(

,

)은 입력 영상(I^x, I^y)과 동일해야 한다는 것을 나타낸다.As shown in Equation 10, the self reconstruction loss (L ^self ) is the content presenter (c ^x , c ^y ) for the same domain in the first output image generator 410 and the second output image generator 420. and the target style presenter (

,

) is input, the self-reconstructed image output from the first output image generator 410 and the second output image generator 420 (

,

) indicates that it must be the same as the input images (I ^x , I ^y ).

그리고 순환 재구성 손실(L^cyc)은 입력 영상(I^x, I^y)을 타겟 도메인으로 변환된 영상(

,

)을 입력 영상으로 순환하여 입력하는 경우, 제1 출력 영상 생성부(410)와 제2 출력 영상 생성부(420)의 출력(

,

은 입력 영상(I^x, I^y)과 동일해야 한다는 것을 나타낸다.And the cyclic reconstruction loss (L ^cyc ) is the image (I ^x , I ^y ) transformed into the target domain (

,

) as an input image, outputs of the first output image generator 410 and the second output image generator 420 (

,

indicates that it should be the same as the input images (I ^x , I ^y ).

학습부는 키 손실(L_k)과 스타일 손실(L_v), 자기 재구성 손실(L^self) 및 순환 재구성 손실(L^cyc)이 계산되면, 계산된 키 손실(L_k)과 스타일 손실(L_v), 자기 재구성 손실(L^self) 및 순환 재구성 손실(L^cyc)을 기지정된 방식으로 가중합하여 총 손실(L)을 계산하고, 계산된 총 손실을 역전파하여 본 실시예의 영상 변환 장치에 포함된 인공 신경망에 대한 학습을 수행할 수 있다.The learner calculates key loss (L _k ), style loss (L _v ), self-reconstruction loss (L ^self ), and cyclic reconstruction loss (L ^cyc ), then the calculated key loss (L _k ) and style loss (L _v ) , self-reconstruction loss (L ^self ) and cyclic reconstruction loss (L ^cyc ) are weighted together in a predetermined manner to calculate a total loss (L ), and backpropagating the calculated total loss to generate artificial Learning can be performed on neural networks.

여기서는 설명의 편의를 위하여 업데이트부(600)와 학습부를 구분하여 설명하였으나, 업데이트부(600) 또한 학습부에 포함될 수도 있다.Here, the update unit 600 and the learning unit are separately described for convenience of description, but the update unit 600 may also be included in the learning unit.

그리고 본 실시예의 영상 변환 장치에서 학습은 학습 기간 동안 계속하여 인가되는 입력 영상(I^x, I^y) 각각에 대해 우선 업데이트부(600)가 메모리에 대한 업데이트를 우선 수행하고, 이후, 학습부가 손실을 역전파하는 방식으로 수행될 수 있다.In addition, in the learning in the image conversion device of the present embodiment, the update unit 600 first performs an update on the memory for each of the input images (I ^x , I ^y ) continuously applied during the learning period, and then the learning unit loses It can be performed in a way that backpropagates.

도 9는 본 실시예의 영상 변환 장치와 기존의 영상 변환 장치의 성능을 비교한 결과를 나타낸다.9 shows a result of comparing the performance of the image conversion device of this embodiment and the existing image conversion device.

도 9에서는 맑은 날씨에서 획득된 입력 영상에 대해 비오는 날씨와 흐린 날씨의 영상으로 변환한 결과를 도시하였다. 도 9에 도시된 바와 같이, 기존의 방식에 따른 영상 변환 장치의 경우, 도메인의 변화로 인해 객체인 차량의 형상이 정상적으로 출력되지 않은 반면, 본 실시예에 따른 영상 변환 장치는 각 객체의 특성을 고려하여 개별적인 스타일을 적용하여 변환함에 따라 변환된 영상에서도 객체가 매우 선명하게 획득되었음을 알 수 있다.9 shows the result of converting an input image obtained in sunny weather into an image of rainy weather and overcast weather. As shown in FIG. 9 , in the case of the image conversion device according to the conventional method, the shape of the vehicle, which is an object, is not normally output due to the change in the domain, whereas the image conversion device according to the present embodiment converts the characteristics of each object. It can be seen that the object is obtained very clearly in the converted image as the individual style is applied and converted in consideration of this.

도 10은 본 발명의 일 실시예에 따른 영상 변환 방법을 나타낸다.10 shows an image conversion method according to an embodiment of the present invention.

도 2 내지 도 8을 참조하면, 도 10에 도시된 본 발명의 영상 변환 방법은 우선 타겟 도메인의 영상으로 변환되어야 하는 소스 도메인의 입력 영상(I^x)을 획득한다(S11).Referring to FIGS. 2 to 8 , the image conversion method of the present invention shown in FIG. 10 first obtains an input image I ^x of a source domain to be converted into an image of a target domain (S11).

그리고 미리 학습된 인공 신경망을 이용하여 획득된 입력 영상(I^x)에 대해 신경망 연산을 수행하여, 각 객체의 클래스에 따른 구조적 특징을 나타내는 다수의 클래스별 콘텐츠 표현자(c_k ^x)를 획득한다(S12). 여기서 클래스별 콘텐츠 표현자(c_k ^x)는 입력 영상(I^x, I^y)의 전체적인 구조적 특징을 나타내는 콘텐츠 표현자(c^x)를 획득한 후, 콘텐츠 표현자(c^x)에 포함된 각 객체의 객체 영역을 객체의 클래스별로 구분하여 획득될 수 있다.Then, a neural network operation is performed on the acquired input image (I ^x ) using a pre-learned artificial neural network to obtain a plurality of content presenters (c _k ^x ) for each class representing structural features according to the class of each object. (S12). Here, the content presenter (c _k ^x ) for each class obtains a content presenter (c x ) representing the overall structural characteristics of the input image (I ^x , I ^y ), and then each content presenter (c ^x ) included in the content presenter (c ^x ) It may be obtained by classifying the object area of the object according to the class of the object.

다수의 클래스별 콘텐츠 표현자(c_k ^x)가 획득되면, 다수의 클래스별 콘텐츠 표현자(c_k ^x) 각각의 클래스에 따라 메모리(500)에 다수의 클래스 각각에 대해 적어도 하나씩 할당되어 저장된 아이템 중 해당 클래스의 모든 아이템을 리드한다(S13). 여기서 각 아이템에는 해당 클래스의 객체의 대표적인 구조적 특징을 나타내는 벡터인 아이템 키(k)와 타겟 도메인에서 객체의 대표적인 스타일 특징을 나타내는 벡터인 스타일 값(v^y)이 매칭되어 포함된다. 경우에 따라서는 아이템에 소스 도메인에서의 객체의 대표적인 스타일 특징을 나타내는 벡터인 스타일 값(v^x)도 함께 매칭되어 포함될 수 있다.When a plurality of content presenters (c _k ^x ) for each class are obtained, at least one item is allocated and stored in the memory 500 according to each class of the plurality of content presenters (c _k ^x ) for each class. Among them, all items of the corresponding class are read (S13). Here, each item includes an item key (k), which is a vector representing a representative structural feature of an object of the corresponding class, and a style value (v ^y ), which is a vector representing a representative style feature of an object in the target domain. In some cases, a style value (v ^x ), which is a vector representing a representative style characteristic of an object in the source domain, may also be matched and included in the item.

메모리(500)에서 해당 클래스의 모든 아이템이 리드되면, 각 클래스별 콘텐츠 표현자(c_k ^x)에서 픽셀 단위의 픽셀 콘텐츠 표현자(c_p ^x)를 추출하고, 픽셀 콘텐츠 표현자(c_p ^x)와 리드된 아이템 각각의 아이템 키(k_n) 사이의 유사도(d(c_p ^x, k_n))를 수학식 1과 같이 계산한다(S14). 그리고 계산된 픽셀 콘텐츠 표현자(c_p ^x)와 아이템 키(k_n) 사이의 유사도(d(c_p ^x, k_n))를 기반으로 수학식 2에 따라 아이템 키(k_n) 각각에 대한 리드 가중치(α_p,n ^x)를 계산한다(S15).When all items of the corresponding class are read from the memory 500, the pixel content presenter (c _p ^x ) in pixel unit is extracted from the content presenter (c _k x ) for each class, and the pixel content presenter (c _p ^x ⁾ ) and the item key (k _n ) of each lead item (d(c _p ^x , k _n )) is calculated as in Equation 1 (S14). And based on the similarity (d(c _p ^x , k _n )) between the calculated pixel content descriptor (c _p ^x ) and the item key (k _n ), for each item key (k _n ) according to Equation 2 A lead weight (α _p,n ^x ) is calculated (S15).

리드 가중치(α_p,n ^x)가 계산되면, 적어도 하나의 아이템의 각 아이템 키(k_n)에 매칭된 적어도 하나의 스타일 값(v_n ^y)에 대응하는 리드 가중치(α_p,n ^x)를 수학식 3과 같이 가중합하여, 픽셀별 타겟 스타일을 나타내는 픽셀 타겟 스타일 표현자(

)를 획득한다(S16).When the lead weight (α _p,n ^x ) is calculated, the lead ^weight (α _p, n x ) corresponding to at least one style value (v _n ^y ) matched to each item key (k _n ) of the at least one item A pixel target style descriptor representing the target style for each pixel by weighting as shown in Equation 3 (

) is obtained (S16).

다수의 클래스 각각에 대한 픽셀 단위의 픽셀 타겟 스타일 표현자(

)가 획득되면, 모든 클래스의 픽셀 타겟 스타일 표현자(

)를 해당 픽셀 위치에 배치하여 타겟 스타일 표현자(

)를 획득하고, 이후 입력 영상(I^x)의 구조적 특징을 나타내는 콘텐츠 표현자(c^x)와 각 타겟 스타일 표현자(

)를 결합하여 결합 타겟 스타일 표현자(c^x,

)를 획득한다. 즉 소스 도메인의 입력 영상(I^x)의 구조적 특징과 타겟 도메인의 스타일 특징을 결합한다.A per-pixel pixel target style descriptor for each of the multiple classes (

) is obtained, the pixel target style descriptors of all classes (

) at that pixel location, so that the target style presenter (

) is obtained, and then a content presenter (c ^x ) representing the structural characteristics of the input image (I ^x ) and each target style presenter (

) to combine target style descriptors (c ^x ,

) to obtain That is, the structural characteristics of the input image (I ^x ) of the source domain and the style characteristics of the target domain are combined.

그리고 미리 학습된 인공 신경망에 결합 타겟 스타일 표현자(c^x,

)를 입력하여 신경망 연산을 수행함으로써, 소스 도메인의 입력 영상(I^x)이 타겟 도메인의 영상으로 재구성된 출력 영상(

)를 생성한다.And a binding target style descriptor (c ^x ,

) to perform neural network operation, the input image (I ^x ) of the source domain is reconstructed into an image of the target domain, and the output image (

) to create

다만 도 10의 영상 변환 방법을 수행하기 위해서는 메모리(500)에 저장된 다수의 아이템 각각의 아이템 키(k)가 대응하는 클래스의 객체의 구조적 특징을 대표하고, 스타일 값(v^y, v^x)는 해당 도메인에서 객체의 스타일 특징을 대표할 수 있도록 미리 업데이트되어야 한다. 또한 인공 신경망이 미리 학습되어야 한다.However, in order to perform the image conversion method of FIG. 10, the item key (k) of each of a plurality of items stored in the memory 500 represents the structural characteristics of the object of the corresponding class, and the style values (v ^y , v ^x ) It must be updated in advance to represent the style characteristics of objects in the domain. In addition, the artificial neural network must be trained in advance.

이에 도 10의 영상 변환 방법을 수행하기 이전 학습 단계를 우선 수행하여 메모리(500)에 저장되는 아이템을 업데이트하고, 인공 신경망을 학습시킬 수 있다.Accordingly, a learning step prior to performing the image conversion method of FIG. 10 may be performed first to update an item stored in the memory 500 and to train an artificial neural network.

도 11은 도 10의 영상 변환 방법을 위한 학습 단계의 일 예를 나타낸다.11 shows an example of a learning step for the image conversion method of FIG. 10 .

도 11을 참조하면, 학습 단계에서는 우선 서로 다른 도메인의 2개의 입력 영상(I^x, I^y) 중 적어도 하나를 획득한다(S21). 그리고 미리 학습된 인공 신경망을 이용하여 획득된 입력 영상(I^x, I^y)에 대해 신경망 연산을 수행하여, 각 객체의 클래스에 따른 구조적 특징을 나타내는 다수의 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)와 함께 클래스에 따른 스타일 특징을 나타내는 다수의 클래스별 스타일 표현자(s_k ^x, s_k ^y)를 획득한다(S22).Referring to FIG. 11 , in the learning step, first, at least one of two input images (I ^x , I ^y ) of different domains is acquired (S21). In addition, a neural network operation is performed on the acquired input images (I ^x , I ^y ) using a pre-learned artificial neural network to obtain a plurality of class-specific content presenters (c k x , c _k ^x , c _k ^y ) and a plurality of class-specific style descriptors (s _k ^x , s _k ^y ) representing style characteristics according to the class are acquired (S22).

그리고 메모리(500)에 저장된 다수의 아이템 중 획득된 다수의 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)와 다수의 클래스별 스타일 표현자(s_k ^x, s_k ^y) 각각에 대응하는 클래스의 모든 아이템을 리드하고, 각 클래스별 콘텐츠 표현자(c_k ^x, c_k ^y)의 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)와 리드된 아이템 각각의 아이템 키(k_n) 사이의 유사도(d(c_p ^x, k_n), d(c_p ^y, k_n))를 계산한다(S23). In addition, among the plurality of items stored in the memory 500, corresponding to each of a plurality of content presenters (c _k ^x , c _k ^y ) and style presenters (s _k ^x , s _k ^y ) for each class acquired by each class. Read all items of the class, and between the pixel content presenters (c _p ^x , c _p ^y ) of the content presenters (c _k ^x , c _k ^y ) for each class and the item key (k _n ) of each lead item Calculate the similarity (d(c _p ^x , k _n ), d(c _p ^y , k _n )) of (S23).

유사도(d(c_p ^x, k_n), d(c_p ^y, k_n))가 계산되면, 계산된 유사도를 이용하여 리드 가중치(α_p,n ^x, α_p,n ^y)를 계산하고, 계산된 리드 가중치(α_p,n ^x, α_p,n ^y)에 따라 타겟 스타일 표현자(

,

)를 획득하여 입력 영상(I^x, I^y)과 상이한 도메인의 출력 영상(

,

)을 생성한다(S24).When the similarities (d(c _p ^x , k _n ), d(c _p ^y , k _n )) are calculated, the lead weights (α _p,n ^x , α _p,n ^y ) are calculated using the calculated similarities, , according to the calculated lead weights (α _p,n ^x , α _p,n ^y ), target style descriptors

,

) is acquired to obtain an output image (of a domain different from the input image (I ^x , I ^y ))

,

) is generated (S24).

한편, 계산된 유사도(d(c_p ^x, k_n), d(c_p ^y, k_n))를 이용하여 아이템 키(k_n)를 업데이트한 업데이트 가중치(β_p,n ^x, β_p,n ^y)를 계산한다(S25). _Meanwhile , _update weights ⁽ β _p _, _n ^x , _β _p ^, _n ^y ) is calculated (S25).

업데이트 가중치(β_p,n ^x, β_p,n ^y)가 계산되면, 업데이트 가중치(β_p,n ^x, β_p,n ^y)와 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)를 이용하여 수학식 6에 따라 업데이트 아이템 키(

)를 계산하여 획득한다(S26).Once update weights (β _p,n ^x , β _p,n ^y ) are calculated, use update weights (β _p,n ^x , β _p,n ^y ) and pixel content descriptors (c _p ^x , c _p ^y ). So, according to Equation 6, the update item key (

) is calculated and obtained (S26).

그리고 업데이트 가중치(β_p,n ^x, β_p,n ^y)와 클래스별 스타일 표현자(s_k ^x, s_k ^y) 각각에서 픽셀 단위로 추출된 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)를 이용하여 각 도메인에 따른 스타일 값(v^y, v^x)를 업데이트한 업데이트 스타일 값(

,

)을 획득한다(S27).And the pixel target style descriptors (s p x , _s ^p ) extracted in units of pixels _from the update weights (β ^p _,n ^x , β _p ,n y ) and the class-specific style descriptors (s _k ^x , s _k ^y ), respectively. ^y ) to update style values (v ^y , v ^x ) according to each domain (

,

) is obtained (S27).

업데이트된 아이템 키(

)와 대응하는 스타일 값(

,

)을 매칭하여 획득되는 아이템을 메모리(500)에 저장한다(S28). 즉 기존에 저장된 아이템의 아이템 키(k)와 스타일 값(v^y, v^x)을 업데이트된 아이템 키(

)와 대응하는 스타일 값(

,

)으로 대체하여 저장한다.Updated item key (

) and the corresponding style value (

,

) is stored in the memory 500 (S28). That is, the item key (k) and style values (v ^y , v ^x ) of the previously stored item are updated with the item key (

) and the corresponding style value (

,

) and save it.

그리고 픽셀 콘텐츠 표현자(c_p ^x, c_p ^y)와 픽셀 타겟 스타일 표현자(s_p ^x, s_p ^y)가 각각 메모리(500)에 저장된 아이템 중 하나의 아이템에 대한 아이템 키(k_p+) 및 스타일 값(v_p+ ^x, v_p+ ^y)과는 차이가 작아지는 반면, 나머지 아이템에 대한 아이템 키(k_n) 및 스타일 값(v_n ^x, v_n ^y)과는 차이가 증가되도록 하는 키 손실(L_k) 및 스타일 손실(L_v)을 각각 수학식 8 및 9에 따라 계산한다(S29).and an item key (k _p+ ) for one of the items stored in the memory 500 for each of the pixel content descriptors (c _p ^x , c _p ^y ) and the pixel target style descriptors (s _p ^x , s _p ^y ). and the style values (v _p+ ^x , v _p+ ^y ), while the difference from the item keys (k _n ) and style values (v _n ^x , v _n ^y ) for the remaining items are increased. Loss (L _k ) and style loss (L _v ) are calculated according to Equations 8 and 9, respectively (S29).

또한 결합 타겟 스타일 표현자((c^x,

), (c^y,

))를 인가받아 타겟 도메인에 대응하는 스타일의 출력 영상(

,

)을 생성하는 인공 신경망에 동일한 도메인에 대한 콘텐츠 표현자(c^x, c^y)와 타겟 스타일 표현자(

,

)가 결합된 결합 타겟 스타일 표현자((c^x,

), (c^y,

))를 인가하여 출력되는 자기 재구성 영상(

,

)과 입력 영상(I^x, I^y) 사이의 차이로 자기 재구성 손실(L^self)을 계산하고, 입력 영상(I^x, I^y)에서 타겟 도메인으로 변환된 영상(

,

)을 다시 소스 도메인으로 변환하여 획득되는 순환 재구성 영상(

,

)과 입력 영상(I^x, I^y) 사이의 차이로 순환 재구성 손실(L^cyc)을 계산한다(S30).Also, the join target style descriptors ((c ^x ,

), (c ^y ,

)) and an output image of a style corresponding to the target domain (

,

), content presenters (c ^x , c ^y ) and target style presenters (

,

) is the combined target style descriptor ((c ^x ,

), (c ^y ,

)), the self-reconstructed image (

,

) and the input image (I ^x , I ^y ), the self-reconstruction loss (L ^self ) is calculated, and the image transformed from the input image (I ^x , I ^y ) to the target domain (

,

) back to the source domain, and a circular reconstruction image (

,

) and the input images (I ^x , I ^y ), a cyclic reconstruction loss (L ^cyc ) is calculated (S30).

그리고 키 손실(L_k)과 스타일 손실(L_v), 자기 재구성 손실(L^self) 및 순환 재구성 손실(L^cyc)을 기지정된 방식으로 가중합하여 총 손실(L)을 계산하고, 계산된 총 손실을 역전파하여 학습을 수행한다(S31).And the total loss (L) is calculated by weighting the key loss (L _k ), the style loss (L _v ), the self-reconstruction loss (L ^self ), and the cyclic reconstruction loss (L ^cyc ) in a predetermined manner, and the calculated total loss Back-propagating to perform learning (S31).

여기서 학습은 기지정된 횟수 또는 총 손실이 기지정된 기준 손실 이하가 될때까지 반복하여 수행될 수 있다.Here, learning may be performed repeatedly until a predetermined number of times or a total loss becomes less than or equal to a predetermined reference loss.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, computer readable media may be any available media that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including read-only memory (ROM) dedicated memory), random access memory (RAM), compact disk (CD)-ROM, digital video disk (DVD)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is only exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 영상 획득부 110: 제1 영상 획득부
120: 제2 영상 획득부 200: 인코딩부
210: 제1 인코더 211: 객체 식별부
212: 콘텐츠 표현자 획득부 213: 콘텐츠 클러스터링부
214: 스타일 표현자 획득부 215: 스타일 클러스터링부
220: 제2 인코더 300: 타겟 스타일 생성부
310: 제1 타겟 스타일 생성부 311: 리드 유사도 계산부
312: 리드 가중치 계산부 313: 타겟 스타일 표현자 획득부
314: 타겟 스타일 결합부 320: 제2 타겟 스타일 생성부
400: 출력 영상 생성부 500: 메모리
600: 업데이트부 610: 업데이트 유사도 계산부
620: 업데이트 가중치 계산부 630: 업데이트 값 계산부100: image acquisition unit 110: first image acquisition unit
120: second image acquisition unit 200: encoding unit
210: first encoder 211: object identification unit
212: content presenter acquisition unit 213: content clustering unit
214: style presenter acquisition unit 215: style clustering unit
220: second encoder 300: target style generator
310: first target style generation unit 311: lead similarity calculation unit
312: lead weight calculation unit 313: target style presenter acquisition unit
314: target style combiner 320: second target style generator
400: output image generator 500: memory
600: update unit 610: update similarity calculation unit
620: update weight calculation unit 630: update value calculation unit

Claims

A content presenter representing the structural characteristics of the input image is obtained by receiving an input image obtained from the source domain and performing a neural network operation, and the content presenter is divided according to the class of each object included in the input image. an encoding unit that obtains a content presenter;
At least one item key corresponding to each of a plurality of classes and representing the structural representative characteristic of an object according to each corresponding class matches a style value representing a stylistic representative characteristic in the target domain different from the source domain in the corresponding class a memory in which a plurality of items included are stored;
Among the plurality of items stored in the memory, a lead weight is calculated according to the similarity with item keys of an item of a class corresponding to each content presenter for each class, and the calculated lead weight is weighted by a style value matched to the item key. a target style generator that obtains a target style presenter; and
An output image generation unit configured to receive the content presenter and the target style presenter and perform a neural network operation to generate an output image in the target domain;
the encoding unit
an object identification unit implemented with a pre-learned artificial neural network and performing a neural network operation on the input image to obtain an object identification descriptor for identifying an object region and class of each object included in the input image;
a content presenter obtaining unit that is implemented with a pre-learned artificial neural network and performs a neural network operation on the input image to acquire the content presenter indicating structural characteristics of the input image; and
and a content clustering unit configured to classify each object region in the content presenter by class using the object identification presenter and obtain the plurality of content presenters for each class.

delete

The method of claim 1, wherein the target style generation unit
Among the plurality of items stored in the memory, an item of a class corresponding to each content of a plurality of classes is read, and the similarity between the pixel content presenter in pixel units and the item key of each read item is determined in the content presenter for each class. a lead similarity calculating unit that calculates in a predetermined manner;
a read weight calculation unit which calculates a lead weight indicating an importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated similarity in a predetermined manner;
A pixel target style descriptor representing a target style in units of pixels is obtained by weighting a lead weight corresponding to at least one style value matched to each item key of at least one item of a corresponding class, and pixel targets for all classes. a target style presenter obtaining unit for acquiring the target style presenter by arranging a style presenter at a corresponding pixel location; and
and a target style combiner combining the content presenter and the target style presenter and outputting the combined target style presenter to the output image generator.

The method of claim 3, wherein the read similarity calculation unit
In each class content presenter (c _k ^x ), each pixel content presenter (

) and the item key assigned to that class (

) The similarity (d(c _p ^x , k _n )) between

calculated according to
The lead weight calculator
Using the M _k similarities (d(c _p ^x , k _n )) calculated for the N _k pixel content presenters (c _p ^x ), respectively, corresponding to the class of the content presenter (c _k ^x ) for each class. The lead weight (α _p,n ^x ) for each of the M _k item keys (k _n ) is calculated using the equation

calculated as,
The target style presenter obtaining unit
^A _pixel ^target _style _descriptor ₍

) to the equation

An image conversion device that calculates according to

The method of claim 4, wherein the encoding unit
At the time of learning, at least one of the input images acquired from the source domain and the target domain is applied, and a neural network operation is performed on each of the at least one applied input image to obtain a style feature of the source domain of each of the at least one input image. A class style descriptor extraction unit for acquiring a style descriptor representing a style descriptor, classifying each object area in the style descriptor by class using the object identification descriptor, and obtaining a plurality of style descriptors for each class;
The video conversion device
Update the item key and style value included in each of the plurality of items stored in the memory using the plurality of content descriptors for each class and the style descriptor for each class for each of the at least one input image applied during learning. Image conversion device further comprising an update unit to do.

The method of claim 5, wherein the update unit
Reads an item of a class corresponding to each of the plurality of content presenters for each class for each of the applied at least one input image among a plurality of items stored in the memory, and reads the pixel content presenter of the content presenter for each class. an update similarity calculating unit that calculates a similarity between item keys of each item in a predetermined manner;
an update weight calculation unit which calculates an update weight indicating an importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated similarity in a predetermined manner; and
weighting the update weight to the pixel content descriptor, calculating an updated item key by adding it to each item key of at least one item of a corresponding class, and weighting the update weight to the pixel target style descriptor; An image conversion device comprising an update value calculation unit that calculates an updated style value by adding it to a corresponding style value.

The method of claim 6, wherein the update similarity calculation unit
^Similarity ₍ _d ⁽ ^c ^_ _{_} _p ^x , k _n ), d(c _p ^y , k _n ))

calculated according to
The update weight calculator
The update weights (β _p,n ^x , β _p,n ^y ) for each item key (k _n ) in each domain are calculated using the equation

An image conversion device that calculates according to

The method of claim 7, wherein the update value calculator
Updated item keys for each of the M _k item keys (k _n ) of the corresponding class (

) to the equation

calculated according to
The update weights (β _p,n ^x , β _p,n ^y ) corresponding to the pixel target style descriptors (s _p ^x , s _p ^y ) are weighted and the corresponding style values (v _n ^x , v _n ^y ) to calculate the updated style value

An image conversion device that calculates according to

The method of claim 5, wherein the image conversion device
^The pixel ^content descriptors (c _p ^x , c p ^y ) and the pixel target style descriptors (s _p x , _s _p ^y ) respectively corresponding to input images (I ^x , I y ) of different domains during learning While the difference from the item key (k _p+ ) and style values (v _{p +} ^x , v _{p +} ^y ) for one of the items stored in the memory becomes small, the item key (k _n ) and style for the other items Calculate key loss (L _k ) and style loss (L _v ) that increase the difference from values (v _n ^x , v _n ^y ), and self-reconstruction loss ( L ^self ) and cyclic reconstruction loss (L ^cyc ) are calculated, and the key loss (L _k ), the style loss (L _v ), the self-reconfiguration loss (L ^self ), and the cyclic reconstruction loss (L ^cyc ) are calculated. An image conversion device further comprising a learning unit configured to calculate a total loss by performing weighted summing in a designated manner and perform learning by backpropagating the calculated total loss.

10. The method of claim 9, wherein the learning unit
Equation for the key loss (L _k )

calculated according to
The style loss (L _v ) is calculated using the formula

(Where k _p+ and v _p+ ^x and v _p+ ^x _are the content descriptors (c ^p x , c p of item keys (k _n ) and style values (v _n ^x , v _n ^y ) of M items stored in memory ₎ ^y ) and the positive sample representing the item key and style value most similar to the pixel target style descriptor (s _p ^x , s _p ^y ).
calculated according to
The self-reconstruction loss is divided into content presenters (c ^x , c ^y ) and target style presenters (c y ) for the same domain.

,

) Self-reconstructed image (generated from

,

) and the input image (I ^x , I ^y )

Here, G ^x () and G ^y () are functions expressing the neural network operation of the output image generator, respectively.
calculated as,
The cyclic reconstruction loss is converted from the input image (I ^x , I ^y ) to the image (

,

) as an input image, and a circular reconstruction image (

,

) and the input image (I ^x , I ^y )

An image conversion device that calculates according to

A content presenter representing the structural characteristics of the input image is obtained by receiving an input image obtained from the source domain and performing a neural network operation, and the content presenter is divided according to the class of each object included in the input image. obtaining a content presenter;
At least one item key corresponding to each of a plurality of classes is pre-stored in memory and represents structural representative characteristics of objects according to each corresponding class and stylistic representative characteristics in the target domain different from the source domain in the corresponding class Among a plurality of items with matched style values, a lead weight is calculated according to the degree of similarity with the item keys of the item of the class corresponding to each content presenter for each class, and the calculated lead weight is the style value matched to the item key. weighting to obtain a target style descriptor; and
Generating an output image in the target domain by receiving the content presenter and the target style presenter and performing a neural network operation;
The step of obtaining the content presenter for each class is
acquiring an object identification descriptor implemented as a pre-learned artificial neural network and identifying an object region and class of each object included in the input image by performing a neural network operation on the input image;
implementing a pre-learned artificial neural network and performing a neural network operation on the input image to obtain the content presenter indicating a structural feature of the input image; and
and obtaining the plurality of content presenters for each class by classifying each object region in the content presenter by class using the object identification presenter.

delete

12. The method of claim 11, wherein obtaining the target style descriptor comprises:
Among the plurality of items stored in the memory, an item of a class corresponding to each content of a plurality of classes is read, and the read similarity between the pixel content presenter in pixel units and the item key of each read item in the content presenter for each class Calculating ;
calculating a lead weight indicating importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated lead similarity;
A pixel target style descriptor representing a target style in units of pixels is obtained by weighting a lead weight corresponding to at least one style value matched to each item key of at least one item of a corresponding class, and pixel targets for all classes. arranging a style descriptor at a corresponding pixel position to obtain the target style descriptor; and
and combining the content presenter and the target style presenter and outputting a combined target style presenter.

14. The method of claim 13, wherein calculating the read similarity
In each class content presenter (c _k ^x ), each pixel content presenter (

) and the item key assigned to that class (

) The read similarity (d(c _p ^x , k _n )) between

calculated according to
Calculating the lead weight
Corresponds to the class of content presenters ( _{c k} _x ⁾ for each class using M _k read similarities (d(c _p ^x , k _n )) calculated for each of the N k pixel content presenters (c _p ^x ) The lead weight (α _p,n ^x ) for each of the M _k item keys (k _n )

calculated as,
Obtaining the target style descriptor
^A _pixel ^target _style _descriptor ₍

) to the equation

Image conversion method calculated according to.

15. The method of claim 14, wherein the image conversion method
Further comprising a learning step before the step of acquiring a plurality of content presenters for each class,
The learning phase is
At least one of the input images obtained from the source domain and the target domain is applied, and a neural network operation is performed on each of the at least one input image, and a plurality of content presenters for each class are obtained. Acquiring a style descriptor representing style characteristics of the source domain together, and classifying each object area in the style descriptor by class using the object identification descriptor to obtain a plurality of style descriptors for each class;
The target style presenter is acquired according to the plurality of content presenters for each class and corresponding style values stored in memory, and an output image of a different domain from the input image is generated by neural network operation on the content presenter and the target style presenter. doing; and
Updating an item key and a style value included in each of a plurality of items stored in the memory using the plurality of content descriptors for each class and the style descriptor for each class for each of the at least one input image. Including image conversion method.

16. The method of claim 15, wherein the updating step
Reads an item of a class corresponding to each of the plurality of content presenters for each class for each of the applied at least one input image among a plurality of items stored in the memory, and reads the pixel content presenter of the content presenter for each class. calculating an update similarity between item keys of each item;
calculating an update weight indicating an importance of each of the at least one item key of a class corresponding to the pixel content descriptor according to the calculated update similarity; and
weighting the update weight to the pixel content descriptor, calculating an updated item key by adding it to each item key of at least one item of a corresponding class, and weighting the update weight to the pixel target style descriptor; An image conversion method comprising obtaining an updated item by calculating an updated style value by adding it to a corresponding style value.

17. The method of claim 16, wherein calculating the update similarity
^Update ^similarity ₍ _d ⁽ ^_ _{_} c _p ^x , k _n ), d(c _p ^y , k _n ))

calculated according to
The step of calculating the update weight is
The update weights (β _p,n ^x , β _p,n ^y ) for each item key (k _n ) in each domain are calculated using the equation

Image conversion method calculated according to.

18. The method of claim 17, wherein obtaining the updated item comprises:
Updated item keys for each of the M _k item keys (k _n ) of the corresponding class (

) to the equation

calculating according to; and
The update weights (β _p,n ^x , β _p,n ^y ) corresponding to the pixel target style descriptors (s _p ^x , s _p ^y ) are weighted and the corresponding style values (v _n ^x , v _n ^y ) to calculate the updated style value

Image conversion method comprising the step of calculating according to.

16. The method of claim 15, wherein the learning step
^The pixel ^content descriptors (c _p ^x , c p ^y ) and the pixel target style descriptors (s _p x , _s _p ^y ) respectively corresponding to input images (I ^x , I y ) of different domains during learning While the difference from the item key (k _p+ ) and style values (v _{p +} ^x , v _{p +} ^y ) for one of the items stored in the memory becomes small, the item key (k _n ) and style for the other items Calculate key loss (L _k ) and style loss (L _v ) that increase the difference from values (v _n ^x , v _n ^y ), and self-reconstruction loss ( Calculating L ^self ) and cyclic reconstruction loss (L ^cyc ); and
A total loss is calculated by weighting the key loss (L _k ), the style loss (L _v ), the self-reconstruction loss (L ^self ), and the cyclic reconstruction loss (L ^cyc ) in a predetermined manner, and the calculated total An image conversion method further comprising the step of backpropagating the loss.

20. The method of claim 19, wherein calculating the loss comprises
Equation for the key loss (L _k )

calculating according to;
The style loss (L _v ) is calculated using the formula

(Where k _p+ and v _p+ ^x and v _p+ ^x _are the content descriptors (c ^p x , c p of item keys (k _n ) and style values (v _n ^x , v _n ^y ) of M items stored in memory ₎ ^y ) and the positive sample representing the item key and style value most similar to the pixel target style descriptor (s _p ^x , s _p ^y ).
calculating according to;
The self-reconstruction loss is divided into content presenters (c ^x , c ^y ) and target style presenters (c y ) for the same domain.

,

) Self-reconstructed image (generated from

,

) and the input image (I ^x , I ^y )

Here, G ^x () and G ^y () are functions expressing the neural network operation of the output image generator, respectively.
Calculating as; and
The cyclic reconstruction loss is converted from the input image (I ^x , I ^y ) to the image (

,

) as an input image, and a circular reconstruction image (

,

) and the input image (I ^x , I ^y )

Image conversion method comprising the step of calculating according to.