KR102051597B1

KR102051597B1 - Apparatus and method for retargeting images based on content-awareness

Info

Publication number: KR102051597B1
Application number: KR1020170170238A
Authority: KR
Inventors: 권인소; 조동현
Original assignee: 한국과학기술원
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2019-12-03
Also published as: KR20190069893A

Abstract

적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치로서, 원본 영상을 입력받고, 인코딩용 복수의 컨볼루션 레이어들을 통해 상기 원본 영상의 높은 단계 특징을 추출하며, 상기 높은 단계 특징을 디코딩용 복수의 컨볼루션 레이어들로 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)를 출력하는 인코더-디코더, 그리고 상기 제1 주의 지도를 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하고, 상기 제2 주의 지도를 기초로 상기 원본 영상의 픽셀들을 출력 영상으로 이동시키는 시프트 지도(shift map)를 생성하며, 상기 원본 영상을 상기 시프트 지도로 워핑하여 리타겟팅 영상을 출력하는 시프터를 포함한다.An image resizing apparatus operated by at least one processor, comprising: receiving a source image, extracting a high level feature of the original image through a plurality of convolution layers for encoding, and decoding the high level feature An encoder-decoder outputs a first attention map including location information of a meaningful region by decoding into the layers, and generates a second attention map in which the first attention map is scaled to a target aspect ratio. And a shift map generating a shift map for moving pixels of the original image to an output image based on the second state map, and shifting the original image to the shift map to output a retargeting image. do.

Description

Content-based image resizing device and method {APPARATUS AND METHOD FOR RETARGETING IMAGES BASED ON CONTENT-AWARENESS}

본 발명은 영상 크기를 조절하는 리타게팅 방법에 관한 것이다.The present invention relates to a retargeting method for adjusting an image size.

지금까지의 영상 리타겟팅 방법은 여러 가지 연구되고 있는데, 대표적으로 이음새 조각(seam carving) 기반 방법과 워핑 기반 방법으로 분류할 수 있다. 이음새 조각 기반 리타게팅 방법은 중요하지 않은 부분의 이음새를 반복적으로 제거하거나 삽입하여 영상의 종횡비를 변경한다. 워핑 기반 리타게팅 방법은 입력 영상을 목표 크기의 영상으로 연속적으로 변환한다. 워핑 기반 리타게팅 방법은 국부 영역에 대한 최적의 스케일링 계수를 반복적으로 계산하고, 경계 지도(edge map) 및 현저한 지도(saliency map)에 의해 지원되는 워핑 영상을 업데이트한다. Until now, various methods of image retargeting have been studied, and they can be classified into seam carving based method and warping based method. Seam piece based retargeting method changes the aspect ratio of an image by repeatedly removing or inserting seams of non-essential parts. The warping-based retargeting method continuously converts an input image into an image having a target size. The warping-based retargeting method iteratively calculates the optimal scaling factor for the local region and updates the warping image supported by the edge map and the saliency map.

이처럼 종래 기술들은 영상 내의 중요하고 의미있는(semantic) 영역을 찾기 위해서, 경계 지도(edge map)와 같이 수동으로 제작된 낮은 단계의 수제 특징(low-level hand-crafted features)을 이용하여 현저한 지도(saliency map)을 구하며, 이를 바탕으로 영상의 크기를 조절한다. 하지만 낮은 단계의 수제 특징으로는, 영상에서 의미있는 영역을 정확하게 판단하지 못해, 결국 의미적으로 중요한 부분이 제대로 남지 않은 채 리타게팅된다. 예를 들면, 도 1은 낮은 단계 특징(low-level features)을 이용한 영상 리타겟팅 결과를 나타내는 도면으로서, 영상의 핵심 부분이 없어지거나 왜곡된다. 이는 낮은 단계의 특징들로 영상 크기를 조절하는 기술의 근본적인 한계이다.As such, prior art uses prominent maps using low-level hand-crafted features, such as edge maps, that are manually created to find important and semantic areas within an image. Obtain a saliency map and adjust the image size based on this. However, low-level handmade features can't accurately determine meaningful regions in the image, which eventually leads to retargeting without significant semantic parts. For example, FIG. 1 is a diagram illustrating an image retargeting result using low-level features, in which a key part of an image is missing or distorted. This is a fundamental limitation of the technique of resizing images with low level features.

본 발명이 해결하고자 하는 과제는 신경망을 통해 입력 영상의 높은 단계 특징(high-level features)을 추출하고, 높은 단계 특징을 이용해서 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도(attention map)를 구하며, 주의 지도를 기초로 생성한 시프트 지도를 이용하여 입력 영상의 크기를 조절하는 장치 및 방법을 제공하는 것이다.The problem to be solved by the present invention is to extract the high-level features of the input image through the neural network, and to use the high-level features to create an attention map containing the location information on the meaningful region The present invention provides an apparatus and method for adjusting the size of an input image using a shift map generated based on a state map.

또한, 본 발명이 해결하고자 하는 과제는 크기 조절된 출력 영상이 가지는 콘텐츠 손실과 구조 손실을 계산하고, 출력 영상의 손실을 최소화하도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시키는 장치 및 방법을 제공하는 것이다.Another object of the present invention is to provide an apparatus and method for calculating content loss and structure loss of a scaled output image and learning convolution weights related to output image generation to minimize the loss of the output image. will be.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치로서, 원본 영상을 입력받고, 인코딩용 복수의 컨볼루션 레이어들을 통해 상기 원본 영상의 높은 단계 특징을 추출하며, 상기 높은 단계 특징을 디코딩용 복수의 컨볼루션 레이어들로 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)를 출력하는 인코더-디코더, 그리고 상기 제1 주의 지도를 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하고, 상기 제2 주의 지도를 기초로 상기 원본 영상의 픽셀들을 출력 영상으로 이동시키는 시프트 지도(shift map)를 생성하며, 상기 원본 영상을 상기 시프트 지도로 워핑하여 리타겟팅 영상을 출력하는 시프터를 포함한다.An image resizing apparatus operated by at least one processor according to an embodiment, the apparatus comprising: receiving an original image, extracting a high level feature of the original image through a plurality of convolution layers for encoding, and extracting the high level feature An encoder-decoder that decodes a plurality of convolution layers for decoding and outputs a first attention map including location information of a meaningful region, and a size of adjusting the first attention map to a target aspect ratio. A map of two states is generated, a shift map for moving pixels of the original image to an output image is generated based on the second state map, and the retargeting image is warped by warping the original image to the shift map. Contains a shifter to output.

상기 시프터는 상기 제2 주의 지도를 1차원 복제 컨볼루션(1D Duplicate Convolution)하여 열을 따라 균일한 모양의 제3 주의 지도를 생성하고, 상기 제3 주의 지도와 상기 제2 주의 지도를 가중 합산하여 최종 주의 지도하며, 상기 최종 주의 지도를 누적 정규화(cumulative normalization)하여 상기 시프트 지도를 생성할 수 있다.The shifter generates a third state map of uniform shape along a column by 1D Duplicate Convolution of the second state map, and adds the third state map and the second state map by weighted summation. The final state map may be generated, and the shift map may be generated by cumulative normalization of the final state map.

상기 시프터는 상기 제2 주의 지도를 열 벡터로 컨볼루션하여 1차원의 행 벡터를 생성하고, 상기 행 벡터를 반복적으로 복제하여 상기 제3 주의 지도를 생성할 수 있다.The shifter may convolve the second state map into a column vector to generate a one-dimensional row vector and repeatedly copy the row vector to generate the third state map.

상기 열 벡터는 상기 리타겟팅 영상이 가지는 손실을 최소화하도록 학습된 가중치들을 가질 수 있다.The column vector may have weights learned to minimize the loss of the retargeting image.

상기 디코딩용 복수의 컨볼루션 레이어들은 상기 인코딩용 복수의 컨볼루션 레이어들과 역대칭일 수 있다.The plurality of convolution layers for decoding may be antisymmetric to the plurality of convolution layers for encoding.

상기 디코딩용 복수의 컨볼루션 레이어들은 상기 리타겟팅 영상이 가지는 손실을 최소화하도록 학습된 가중치들을 가질 수 있다.The plurality of convolution layers for decoding may have weights that are learned to minimize the loss of the retargeting image.

상기 영상 크기 조절 장치는 상기 리타겟팅 영상이 가지는 손실을 계산하고, 상기 인코더-디코더 그리고 상기 시프터에서 상기 리타겟팅 영상 출력에 관계된 컨볼루션 가중치들을 상기 손실이 최소화되도록 학습시키는 학습 장치를 더 포함할 수 있다.The image resizing device may further include a learning device for calculating a loss of the retargeting image and learning the convolution weights related to the retargeting image output at the encoder-decoder and the shifter to minimize the loss. have.

상기 손실은 콘텐츠 손실을 포함하고, 상기 학습 장치는 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 콘텐츠 손실을 계산할 수 있다.The loss may include content loss, and the learning apparatus may calculate the content loss based on a classification score obtained by inputting the retargeting image to an image classifier.

상기 손실은 구조 손실을 더 포함하고, 상기 학습 장치는 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 구조 손실을 계산할 수 있다.The loss may further include a structure loss, and the learning apparatus may calculate the structure loss by comparing shapes around pixels corresponding to each other in the retargeting image and the original image.

다른 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치로서, 원본 영상을 입력받고, 인코딩용 복수의 컨볼루션 레이어들을 통해 상기 원본 영상의 높은 단계 특징을 추출하며, 상기 높은 단계 특징을 디코딩용 복수의 컨볼루션 레이어들로 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)를 출력하는 인코더-디코더, 타겟 종횡비를 입력받고, 상기 제1 주의 지도를 상기 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하고, 상기 제2 주의 지도를 기초로 상기 원본 영상을 상기 타겟 종횡비로 크기 조절한 리타겟팅 영상을 출력하는 시프터, 그리고 상기 리타겟팅 영상이 가지는 손실을 계산하고, 상기 손실이 최소화되도록 상기 디코딩용 복수의 컨볼루션 레이어들의 가중치들을 학습시키는 학습 장치를 포함한다.An image resizing apparatus operated by at least one processor according to another embodiment, comprising: receiving an original image, extracting a high level feature of the original image through a plurality of convolution layers for encoding, and extracting the high level feature An encoder-decoder for decoding a plurality of convolutional layers for decoding and outputting a first attention map including location information of a meaningful region, a target aspect ratio, and receiving the first attention map as the target map; A shifter for generating a second attention map scaled to an aspect ratio, a retargeting image for scaling the original image to the target aspect ratio based on the second attention map, and calculating a loss of the retargeting image And learning the weights of the plurality of convolution layers for decoding to minimize the loss. And a humidifying device.

상기 손실은 콘텐츠 손실과 구조 손실을 포함하고, 상기 학습 장치는 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 콘텐츠 손실을 계산하고, 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 구조 손실을 계산할 수 있다.The loss includes a content loss and a structure loss, and the learning apparatus calculates the content loss based on a classification score obtained by inputting the retargeting image to an image classifier, and calculates the content loss from each other in the retargeting image and the original image. The structure loss can be calculated by comparing the shapes around the corresponding pixels.

상기 영상 분류기는 상기 인코딩용 복수의 컨볼루션 레이어들로 구성될 수 있다.The image classifier may be composed of a plurality of convolution layers for encoding.

상기 학습 장치는 상기 원본 영상과 상기 리타겟팅 영상 각각을 상기 인코딩용 복수의 컨볼루션 레이어들로 입력한 후, 상기 인코딩용 복수의 컨볼루션 레이어들의 하위 컨볼루션 레이어에서 출력된 상기 원본 영상의 특징과 상기 리타겟팅 영상의 특징을 비교하여 상기 구조 손실을 계산할 수 있다.The learning apparatus inputs each of the original image and the retargeting image as a plurality of convolution layers for encoding, and then outputs a feature of the original image output from a lower convolution layer of the plurality of convolution layers for encoding. The structure loss may be calculated by comparing features of the retargeting image.

또 다른 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치의 영상 크기 조절 방법으로서, 원본 영상의 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)와 타겟 종횡비를 입력받는 단계, 상기 제1 주의 지도를 상기 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하는 단계, 상기 제2 주의 지도를 1차원 복제 컨볼루션(1D Duplicate Convolution)하여 열을 따라 균일한 모양의 제3 주의 지도를 생성하는 단계, 상기 제3 주의 지도와 상기 제2 주의 지도를 가중 합산한 최종 주의 지도를이용하여 시프트 지도(shift map)를 생성하는 단계, 그리고 상기 시프트 지도를 이용하여 상기 원본 영상의 크기를 조절한 리타겟팅 영상을 출력하는 단계를 포함한다.An image resizing method of an image resizing device operated by at least one processor, according to another embodiment, the first attention map including location information of a meaningful area of an original image and a target aspect ratio Receiving an input, generating a second attention map in which the first attention map is scaled to the target aspect ratio, and performing a 1D duplicate convolution of the second attention map to have a uniform shape along a column. Generating a third state map, generating a shift map using a final state map obtained by weighting the third state map and the second state map, and using the shift map. And outputting a retargeting image by adjusting the size of the image.

상기 시프트 지도를 생성하는 단계는 상기 최종 주의 지도를 누적 정규화(cumulative normalization)하여 상기 시프트 지도를 생성할 수 있다.The generating of the shift map may generate the shift map by cumulative normalization of the final attention map.

상기 제3 주의 지도를 생성하는 단계는 상기 제2 주의 지도를 열 벡터로 컨볼루션하여 1차원의 행 벡터를 생성하고, 상기 행 벡터를 반복적으로 복제하여 상기 제3 주의 지도를 생성할 수 있다.The generating of the third state map may include generating a one-dimensional row vector by convolving the second state map into a column vector, and repeatedly generating the third state map by repeatedly replicating the row vector.

상기 제1 주의 지도는 디코딩용 복수의 컨볼루션 레이어들로 상기 원본 영상의 높은 단계 특징을 디코딩한 결과일 수 있다.The first attention map may be a result of decoding a high level feature of the original image with a plurality of convolution layers for decoding.

상기 영상 크기 조절 방법은 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 리타겟팅 영상의 콘텐츠 손실을 계산하는 단계, 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 리타겟팅 영상의 구조 손실을 계산하는 단계, 그리고 상기 콘텐츠 손실과 상기 구조 손실이 최소화되도록 상기 리타겟팅 영상 출력에 관계된 컨볼루션 가중치들을 학습시키는 단계를 더 포함할 수 있다.The image resizing method may include calculating content loss of the retargeting image based on a classification score obtained by inputting the retargeting image to an image classifier, and surrounding the pixels corresponding to each other in the retargeting image and the original image. Comparing the shape, calculating the structure loss of the retargeting image, and learning the convolution weights related to the retargeting image output to minimize the content loss and the structure loss.

본 발명의 실시예에 따르면 영상에서 중요하다고 여겨지는 영역을 보존하면서, 상대적으로 덜 중요한 영역은 늘이거나 줄여서 영상의 크기를 변환하므로, 영상 내 주요 내용 및 물체의 손실을 최소화할 수 있다. 본 발명의 실시예에 따르면 영상 크기 변환이 요구되는 다양한 시스템에 적용될 수 있다.According to an exemplary embodiment of the present invention, while preserving an area considered to be important in an image, a relatively less important area is increased or reduced to change the size of the image, thereby minimizing loss of main content and objects in the image. According to an embodiment of the present invention, it can be applied to various systems requiring image size conversion.

도 1은 종래의 낮은 단계 특징을 이용한 영상 리타겟팅 결과이다.
도 2는 한 실시예에 따른 영상 크기 조절 장치의 구성도이다.
도 3은 한 실시예에 따른 영상 크기 조절 장치의 상세 구성이다.
도 4는 한 실시예에 따른 1차원 복제 컨볼루션을 이용하여 최종 주의 지도를 생성하는 방법을 설명하는 도면이다.
도 5는 한 실시예에 따른 학습 방법의 성능을 나타내는 도면이다.
도 6은 한 실시예에 따른 영상 크기 조절 방법의 흐름도이다.
도 7은 한 실시예에 따른 학습 방법의 흐름도이다.
도 8부터 도 11은 본 발명의 성능을 설명하는 비교 도면이다.
도 12는 영상 크기 변화에 따른 영상 분류 정확도를 평가한 결과이다.1 is a result of image retargeting using a conventional low step feature.
2 is a block diagram of an image size adjusting apparatus according to an embodiment.
3 is a detailed configuration of an image size adjusting apparatus according to an embodiment.
4 is a diagram for describing a method of generating a final attention map using one-dimensional replication convolution according to an embodiment.
5 is a diagram illustrating the performance of a learning method according to an embodiment.
6 is a flowchart illustrating an image resizing method according to an embodiment.
7 is a flowchart of a learning method according to an exemplary embodiment.
8 to 11 are comparative views illustrating the performance of the present invention.
12 is a result of evaluating image classification accuracy according to image size change.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "?기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise. In addition, the terms "… unit", "?", "Module", etc. described in the specification means a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software. have.

도 2는 한 실시예에 따른 영상 크기 조절 장치의 구성도이고, 도 3은 한 실시예에 따른 영상 크기 조절 장치의 상세 구성이며, 도 4는 한 실시예에 따른 1차원 복제 컨볼루션을 이용하여 최종 주의 지도를 생성하는 방법을 설명하는 도면이다.2 is a block diagram of an image resizing device according to an embodiment, FIG. 3 is a detailed configuration of an image resizing device according to an embodiment, and FIG. 4 is a diagram illustrating a one-dimensional replication convolution according to an embodiment. It is a figure explaining the method of generating the final state map.

도 2와 도 3을 참고하면, 영상 크기 조절 장치(10)는 원본 영상 그리고 타겟 종횡비(aspect ratio)를 입력받고, 원본 영상을 타겟 종횡비로 크기 변환하여 리타겟팅 영상을 출력한다. 원본 영상이 타겟 종횡비로 크기 변환된 영상을 리타겟팅 영상이라고 부른다. 원본 영상이 영상 크기 조절 장치(10)의 입력 영상이고, 리타겟팅 영상이 영상 크기 조절 장치(10)의 출력 영상이다.2 and 3, the image size adjusting apparatus 10 receives an original image and a target aspect ratio, and converts the original image into a target aspect ratio to output a retargeting image. An image in which the original image is scaled to a target aspect ratio is called a retargeting image. The original image is the input image of the image resizing device 10, and the retargeting image is the output image of the image resizing device 10.

영상 크기 조절 장치(10)는 적어도 하나의 프로세서에 의해 동작하고, 원본 영상에 대한 주의 지도(attention map)를 생성하는 인코더(encoder)-디코더(decoder)(100), 그리고 주의 지도를 타겟 종횡비로 변환하는 시프터(shifter)(200)를 포함한다. The image resizing device 10 is operated by at least one processor, the encoder-decoder 100 generating an attention map for the original image, and the attention map as the target aspect ratio. A shifter 200 to convert.

학습 장치(300)는 시프터(200)에서 출력된 리타게팅 영상이 의미있는 영역을 잘 보존하면서 자연스럽게 변환되도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. 학습 장치(300) 학습 장치(300)는 영상 크기 조절 장치(10)에 포함될 수 있으나, 영상 크기 조절과 학습을 분리하여 설명하기 위해, 분리하여 설명한다.The learning apparatus 300 learns the convolution weights related to the output image generation so that the retargeting image output from the shifter 200 is naturally converted while preserving a meaningful region. Learning Device 300 The learning device 300 may be included in the image resizing device 10, but will be described separately in order to separate the image resizing and learning.

인코더-디코더(100)는 원본 영상을 입력받고, 원본 영상에 대한 높은 단계 특징(high-level features)을 이용하여 의미있는(semantic) 영역에 대한 위치 정보를 포함하는 주의 지도를 출력한다. 인코더-디코더(100)는 신경망(Neural network)을 이용하여 원본 영상의 높은 단계 특징(high-level features)을 추출하는 인코더(110), 인코더(110)에서 출력된 특징(features)을 단계적으로 업샘플링하여 주의 지도(attention map)를 생성하는 디코더(130)로 구성된다. The encoder-decoder 100 receives an original image and outputs an attention map including location information on a semantic region using high-level features of the original image. The encoder-decoder 100 gradually upgrades the features output from the encoder 110 and the encoder 110 to extract high-level features of the original image using a neural network. It consists of a decoder 130 that samples and generates an attention map.

인코더(110)는 복수의 컨볼루션 레이어들과 이들을 풀링(pooling)하는 풀링 레이어들로 구성된 신경망일 수 있다. 신경망은 심층 컨볼루션망(Deep Convolutional Network)일 수 있고, 미리 학습된 딥 러닝(deep learning) 모델(예를 들면, VGG16)을 사용할 수 있다. 이 경우, 인코더(110)를 구성하는 복수의 컨볼루션 레이어들의 가중치들은 이미 고정되어 있다고 가정한다.The encoder 110 may be a neural network composed of a plurality of convolutional layers and pooling layers that pool them. The neural network may be a deep convolutional network, and may use a pre-learned deep learning model (eg, VGG16). In this case, it is assumed that the weights of the plurality of convolution layers constituting the encoder 110 are already fixed.

디코더(130)는 인코더와 역대칭으로 컨볼루션 레이어들이 구성되고, 업샘플링을 통해 인코더(110)에서 출력된 특징으로부터 주의 지도를 생성한다. 디코더(130)는 인코더(110)와 역대칭 구조를 가지되, 필요에 따라 일부 레이어가 바뀔 수 있다. 예를 들면, 인코더(110)의 ReLU(Rectified Linear Unit) 레이어들이 디코더(130)에서 ELU(Exponential Linear Unit) 레이어들로 바뀔 수 있다. 한편, 디코더(130)를 구성하는 컨볼루션 레이어들의 가중치들은 학습 장치(300)에 의해 학습된다. The decoder 130 includes convolution layers anti-symmetrically with the encoder, and generates attention maps from the features output from the encoder 110 through upsampling. The decoder 130 has an antisymmetric structure with the encoder 110, but some layers may be changed as necessary. For example, rectified linear unit (ReLU) layers of the encoder 110 may be changed from the decoder 130 to exponential linear unit (ELU) layers. Meanwhile, the weights of the convolution layers constituting the decoder 130 are learned by the learning apparatus 300.

시프터(200)는 디코더(130)로부터, 원본 영상의 높은 단계 특징으로부터 생성된 주의 지도를 입력받는다. 따라서, 시프터(200)는 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도를 이용하여, 내용 기반(content-aware)의 시프트 지도(shift map)를 생성할 수 있다. 이를 통해 시프터(200)는 의미있는 영역을 정확하게 판단할 수 있으므로, 영상에서 중요하다고 여겨지는 영역을 보존하면서, 상대적으로 덜 중요한 영역(예를 들면, 배경)은 늘이거나 줄이는 리타겟팅을 할 수 있다.The shifter 200 receives an attention map generated from the high level feature of the original image from the decoder 130. Accordingly, the shifter 200 may generate a shift map of content-aware using the attention map including the location information of the meaningful area. This allows the shifter 200 to accurately determine a meaningful area, thereby preserving the area considered to be important in the image, and retargeting a relatively less important area (eg, a background). .

시프터(200)는 디코더(130)에서 출력된 원본 영상의 주의 지도를 기초로 시프트 지도를 생성하고, 원본 영상을 시프트 지도를 기초로 워핑하여 리타게팅 영상을 출력한다. 시프트 지도[S(x,y)]는 각 픽셀이 WxH 크기의 입력 영상(I)에서 W'xH 크기의 출력 영상으로 어느 정도 이동되어야 하는지를 정의하는 지도이다. 설명을 위해, 영상의 가로 크기만 바뀌는 경우를 예로 들어 설명하나, 세로 크기도 가로 크기 변경과 같은 방법으로 수행된다. 시프트 지도[S(x,y)]를 통한 입력 영상(I)과 출력 영상(O)의 관계는 수학식 1과 같다. 수학식 1에서 (x,y)는 출력 영상의 공간 좌표이다. 영상의 가로 크기만 바뀌는 경우, S는 0에서 가로 변경 크기(W-W')의 범위 내에서 값이 정해진다.The shifter 200 generates a shift map based on the attention map of the original image output from the decoder 130, and outputs a retargeting image by warping the original image based on the shift map. The shift map S (x, y) is a map defining how much each pixel should be shifted from the WxH size input image I to the W'xH size output image. For explanation, the case where only the horizontal size of the image is changed will be described as an example, but the vertical size is also performed in the same manner as the horizontal size. The relationship between the input image I and the output image O through the shift map S (x, y) is shown in Equation 1. In Equation 1, (x, y) is a spatial coordinate of the output image. When only the horizontal size of the image is changed, S is determined within a range of 0 to the horizontal changing size (W-W ').

디코더(130)에서 출력된 주의 지도(A_d)는 의미있는 영역에 대한 위치 정보를 포함하므로, 내용 기반(content-aware)의 영상 리타겟팅에 바로 사용될 수 있다. 하지만, 영상의 전체 모양을 유지하려면 비슷한 열의 픽셀이 비슷한 시프트 값을 가져야 하고, 그렇지 않으면 영상의 전체적인 모양이 뒤틀릴 수 있다. 따라서, 시프터(200)는 디코더(130)에서 출력된 원본 영상의 주의 지도(A_d)를 곧장 리타겟팅에 사용하는 대신, 1차원 복제 컨볼루션(1D Duplicate Convolution)을 사용하여 주의 지도의 모양을 열(세로)을 따라 균일하게 제한한 주의 지도를 사용한다.Since the attention map A _d output from the decoder 130 includes location information on a meaningful region, it may be directly used for content retargeting of content-aware. However, to maintain the overall shape of the image, pixels in similar columns must have similar shift values, or the overall shape of the image may be distorted. Therefore, the shifter 200 uses the 1D Duplicate Convolution to reshape the state of the attention map instead of directly retargeting the attention map A _d of the original image output from the decoder 130. Use a state map that is uniformly restricted along the columns.

이를 위해, 시프터(200)는 시프트 지도 생성부(210) 그리고 워핑부(230)를 포함한다.To this end, the shifter 200 includes a shift map generator 210 and a warping unit 230.

먼저, 시프트 지도 생성부(210)는 디코더(130)에서 출력된 주의 지도(A_d)를 리타겟팅 영상 크기로 변환(resizing)한다. 시프트 지도 생성부(210)는 타겟 종횡비를 기초로 주의 지도(A_d)를 리타겟팅 영상 크기로 변환할 수 있다. 리타겟팅 영상 크기로 변환된 주의 지도(A_r)는 수학식 2와 같이 표현된다. 수학식 2에서, A_r은 리타겟팅 영상의 크기로 조정된 주의 지도이고, A_d는 디코더(130)에서 출력된 주의 지도이며, R은 출력 영상 크기, 즉 리타겟팅 영상의 크기이다.First, the shift map generator 210 converts the attention map A _d output from the decoder 130 into a retargeting image size. The shift map generator 210 may convert the attention map A _d into a retargeting image size based on the target aspect ratio. The attention map A _r converted to the retargeting image size is expressed as in Equation 2. In Equation 2, A _r is the attention map adjusted to the size of the retargeting image, A _d is the attention map output from the decoder 130, and R is the output image size, that is, the size of the retargeting image.

다음, 시프트 지도 생성부(210)는 크기 변환된 주의 지도(A_r)를 1차원 복제 컨볼루션하여 열(세로)을 따라 균일한 모양이 되도록 하여, 왜곡(뒤틀림)을 보정한다.Next, the shift map generation unit 210 corrects the distortion (warping) by performing a one-dimensional replica convolution of the converted sized attention map A _r to have a uniform shape along the column (vertical).

도 4를 참고하면, 시프트 지도 생성부(210)는 크기 변환된 주의 지도(A_r)(H x W)를 H차원의 열 벡터(H x 1)로 컨볼루션하여, 1차원의 행 벡터(1 x W)로 변환한다. H차원의 열 벡터의 가중치들은 학습 장치(300)에 의해 학습된다. 시프트 지도 생성부(210)는 1차원의 행 벡터를 H번 반복적으로 복제하여, 1차원 복제 컨볼루션에 의한 주의 지도(A_1D)(H x W)를 생성한다. 1차원 복제 컨볼루션은 수학식 3과 같이 표현된다.Referring to FIG. 4, the shift map generation unit 210 convolves the transformed attention map Ar _r (H x W) into a H-dimensional column vector H x 1 to obtain a one-dimensional row vector ( 1 x W). The weights of the H-dimensional column vectors are learned by the learning apparatus 300. The shift map generator 210 repeatedly replicates the one-dimensional row vector H times to generate the attention map A _1D (H x W) by one-dimensional replication convolution. The one-dimensional replication convolution is expressed as in Equation 3.

수학식 3에서, A_r은 리타겟팅 영상의 크기로 변환된 주의 지도이고, A_1D는 1차원 복제 컨볼루션으로 열(세로)을 따라 균일하게 생성된 주의 지도이다.

는 H차원의 열 벡터(H dimensional column vector)와의 컨볼루션을 의미한다.

는 1차원 행 벡터를 H번 반복시키는 연산을 의미한다.In Equation 3, A _r is a state map converted to the size of the retargeting image, and A _1D is a state map uniformly generated along a column (vertical) in one-dimensional replication convolution.

Means convolution with an H dimensional column vector.

Denotes an operation of repeating the one-dimensional row vector H times.

1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)는 열(세로)을 따라 균일한 값으로만 채워지기 때문에, 시프트 지도 생성부(210)는 1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)와 리타겟팅 영상의 크기로 변환된 주의 지도(A_r)을 가중 합산하여 최종 주의 지도(A)를 생성한다. 최종 주의 지도(A)는 수학식 4와 같이 표현된다. 수학식 4에서 λ는 가중치 변수이다.Since the attention map A _1D generated by the one-dimensional replication convolution is filled with only uniform values along the column (vertical), the shift map generator 210 generates the attention map generated by the one-dimensional replication convolution. A final state map A is generated by weighted summing the (A _1D ) and the state map A _r converted to the size of the retargeting image. The final attention map A is expressed as in Equation 4. In Equation 4, λ is a weight variable.

시프트 지도 생성부(210)는 최종 주의 지도(A)를 누적 정규화(cumulative normalization)하여 시프트 지도[S(x,y)]를 생성한다. 시프트 지도[S(x,y)]는 수학식 5와 같이 계산될 수 있다. 수학식 5에서,

이고, W는 입력 영상의 가로 크기이고, W'는 출력 영상의 가로 크기이다.The shift map generator 210 generates a shift map S (x, y) by cumulative normalization of the final attention map A. FIG. The shift map S (x, y) may be calculated as in Equation 5. In Equation 5,

Where W is the horizontal size of the input image and W 'is the horizontal size of the output image.

시프트 지도 생성부(210)는 시프트 지도가 단조(monotonically) 증가하도록 누적 정규화를 한다. 입력 영상이 시프트 지도에 의해서 출력 영상으로 워핑된다. 이때 입력 영상의 픽셀들이 순서가 바뀌면 출력 영상이 제대로 출력되지 않으므로, 픽셀들의 순서가 뒤바뀌어서는 안 된다. 즉, 시프트 지도의 값은 공간 축(x, y 방향)에 따라서 단조 증가하도록 누적 정규화를 한다. 누적 정규화에 의해서 값이 0~1 사이로 전환되며,

가 곱해 짐으로써, 시프트 지도 값의 범위가 0~

가 된다. 최종 주의 지도의 값이 낮은 부분에서는 시프트 지도의 증가량이 적기 때문에, 출력 영상에서 많은 영역을 차지하게 되며, 반대로 최종 주의 지도의 값이 높은 부분은 시프트 지도의 증가량이 커서 출력 영상에 적은 영역을 차지하게 된다.The shift map generation unit 210 performs cumulative normalization so that the shift map monotonically increases. The input image is warped to the output image by the shift map. At this time, if the order of the pixels of the input image is changed, the output image is not output properly, so the order of the pixels should not be reversed. That is, the value of the shift map is cumulative normalized to monotonously increase along the spatial axis (x, y direction). Cumulative normalization converts the value between 0 and 1.

Is multiplied so that the shift map values range from 0 to

Becomes In the lower part of the final attention map, the amount of shift map increase is small, so it occupies a large area in the output image. On the contrary, in the part of the high value of the final attention map, the increase in the shift map occupies a small area in the output image. Done.

워핑부(230)는 시프트 지도 생성부(210)에서 생성한 시프트 지도를 이용하여, 원본 영상(입력 영상)을 워핑하여 리타겟팅 영상(출력 영상)을 생성한다. 리타겟팅은 수학식 1에 따라 수행된다. The warping unit 230 generates a retargeting image (output image) by warping the original image (the input image) by using the shift map generated by the shift map generator 210. Retargeting is performed according to equation (1).

시프트 지도는 서브 픽셀 정밀도를 가지므로, 4개의 이웃 픽셀을 사용하여 선형 보간이 수행된다. 선형 워핑 영상(linearly warped image)에 미분 손실(differentiable loss)을 적용할 수 있다. 한편, 리타게팅 영상은 신경망을 학습시키는데 사용된다. 이를 위해, 리타게팅 영상을 신경망 모델의 입력 크기에 맞도록 제로 패딩할 수 있다.Since the shift map has subpixel precision, linear interpolation is performed using four neighboring pixels. Differential loss can be applied to a linearly warped image. Retargeting images, on the other hand, are used to train neural networks. To this end, the retargeting image may be zero-padded to fit the input size of the neural network model.

학습 장치(300)는 시프터(200)에서 출력된 출력 영상의 손실을 계산하고, 출력 영상의 손실을 최소화하도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. 컨볼루션 가중치들은 인코더-디코더(100)와 시프터(200)에서 컨볼루션 연산에 사용되는 가중치들로서, 디코더(130)를 구성하는 복수의 컨볼루션 레이어들의 가중치들, 시프터(200)의 H차원 열 벡터일 수 있다. The learning apparatus 300 calculates the loss of the output image output from the shifter 200 and learns the convolution weights related to the output image generation to minimize the loss of the output image. The convolution weights are weights used for the convolution operation in the encoder-decoder 100 and the shifter 200, and are weights of the plurality of convolution layers constituting the decoder 130, the H-dimensional column vector of the shifter 200. Can be .

학습 장치(300)는 출력 영상이 가지는 손실(loss)을 이용하는데, 출력 영상의 손실은 콘텐츠 손실(content loss)과 구조(모양) 손실(structure loss)을 포함할 수 있다. 콘텐츠 손실은 입력 영상의 중요한 내용을 보존하도록 만드는데 기여한다. 구조 손실(E_s)은 주의 지도에서 비슷한 열에 있는 픽셀들이 비슷한 값을 가지도록 만드는데 기여한다.The learning apparatus 300 uses a loss of the output image, and the loss of the output image may include a content loss and a structure loss. Content loss contributes to preserving important content of the input video. Structural loss (E _s ) contributes to making pixels in similar columns on the state map have similar values.

콘텐츠 손실은 영상 내 주요 물체가 출력 영상에서 잘 보존되는지를 나타내는 정보로서, 원본 영상의 주요 물체가 출력 영상에서 잘 보존되도록 학습시키기 위한 학습 데이터이다. 만약, 원본 영상에서 크기 변환된 출력 영상이 영상 분류기에 의해 잘 분류된다면, 원본 영상의 주요 부분이 출력 영상에서 잘 보존되었다고 추정할 수 있다. 따라서, 학습 장치(300)는 시프터(200)의 출력 영상을 영상 분류기에 입력하고, 영상 분류기에서 출력된 분류 점수(class score)를 기초로 콘텐츠 손실을 정의한다. 분류 점수가 높을수록(즉, 분류가 잘 될 수록) 콘센트 손실이 적다(즉, 원본 영상의 주요 부분이 출력 영상에서 잘 보존됨)고 간주될 수 있다. 학습 장치(300)는 미리 학습된 딥 러닝 모델(예를 들면, VGG16)을 영상 분류기로 사용할 수 있다. 이는 인코더(110)에서 원본 영상의 특징을 출력하는 모델과 동일할 수 있다. The content loss is information indicating whether the main object in the image is well preserved in the output image, and the training data for learning that the main object in the original image is well preserved in the output image. If the output image size-converted in the original image is well classified by the image classifier, it can be estimated that the main part of the original image is well preserved in the output image. Accordingly, the learning apparatus 300 inputs the output image of the shifter 200 to the image classifier and defines the content loss based on the class score output from the image classifier. Higher classification scores (i.e. better classification) may result in less outlet loss (i.e., the main portion of the original image is well preserved in the output image). The learning apparatus 300 may use a pre-trained deep learning model (eg, VGG16) as an image classifier. This may be the same as the model for outputting the feature of the original image from the encoder (110).

학습 장치(300)는 수학식 6과 같이 출력 영상의 콘텐츠 손실(E_c)을 계산할 수 있다. 수학식 6에서, C와 N은 영상 분류기의 종(class) 수와 학습 샘플의 개수이며,

와

는 정답 라벨과 분류기의 시그모이드(sigmoid) 출력이다. The learning apparatus 300 may calculate the content loss E _c of the output image as shown in Equation 6. In Equation 6, C and N are the number of classes and the number of learning samples of the image classifier,

Wow

Is the answer label and the sigmoid output of the classifier.

구조 손실(structure loss)은 크기 변환된 출력 영상이 부자연스러운지를 판단하는 정보로서, 크기 변환된 출력 영상에서 부자연스러운 부분이 최소화되도록 학습시키기 위한 학습 데이터이다. 리타겟팅 영상과 원본 영상에서 서로 대응 되는 픽셀 주변의 모양이 비슷하면, 크기 변환이 자연스럽다고 추정할 수 있다. Structure loss is information for determining whether the scaled output image is unnatural and is learning data for learning to minimize an unnatural portion of the scaled output image. If the shapes around the corresponding pixels in the retargeting image and the original image are similar, it can be assumed that the size conversion is natural.

학습 장치(300)는 수학식 7과 같이 출력 영상의 구조 손실(E_s)을 계산할 수 있다. 학습 장치(300)는 출력 영상과 입력 영상의 낮은 단계의 특징을 비교하여 출력 영상의 구조 손실(E_s)을 계산할 수 있다. 수학식 7에서,

는 미리 학습된 딥 러닝 모델(예를 들면, VGG16)의 첫 번째 컨볼루션 레이어

의 출력이다. 첫 번째 컨볼루션 레이어

의 출력은 경계 지도(edge map)와 같은 낮은 단계의 특징을 표현한다.The learning apparatus 300 may calculate a structural loss E _s of the output image as shown in Equation 7 below. The learning apparatus 300 may calculate a structure loss E _s of the output image by comparing the low-level features of the output image and the input image. In Equation 7,

Is the first convolutional layer of a pretrained deep learning model (eg VGG16).

Is the output of. First convolution layer

Outputs represent low-level features such as edge maps.

학습 장치(300)는 이렇게 출력 영상이 가지는 콘텐츠 손실(E_c)과 구조 손실(E_s)이 최소화되도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. The learning apparatus 300 learns convolution weights related to output image generation such that content loss E _c and structure loss E _{s of the} output image are minimized.

도 5는 한 실시예에 따른 학습 방법의 성능을 나타내는 도면이다.5 is a diagram illustrating the performance of a learning method according to an embodiment.

도 5를 참고하면, (a)는 입력 영상이고, (b)부터 (d)는 입력 영상의 주의 지도(좌측)와 입력 영상이 크기 변환된 리타겟팅 영상(우측)이다. 이때, (b)는 콘텐츠 손실만으로 학습시킨 모델에서 출력된 리타겟팅 영상이고, (c)는 콘텐츠 손실과 구조 손실로 학습시킨 모델에서 출력된 리타겟팅 영상이며, (d)는 콘텐츠 손실과 구조 손실로 학습시키되, 1D 복제 컨볼루션 레이어를 사용하여 생성된 주의 지도로부터 획득된 리타겟팅 영상이다. Referring to FIG. 5, (a) is an input image, and (b) to (d) are a retargeting image (right) in which the attention map (left) and the input image of the input image are size-converted. At this time, (b) is a retargeting image output from the model trained only by content loss, (c) is a retargeting image output from the model trained by content loss and structure loss, (d) is a content loss and structure loss A retargeting image obtained from a state map generated using a 1D duplicated convolution layer.

(b)와 (c)를 비교하면, 콘텐츠 손실만으로 학습시킨 결과에 비해, 구조 손실을 더 이용해서 학습시킨 결과가 좋음을 알 수 있다. 그리고 (c)와 (d)를 비교하면, 1D 복제 컨볼루션 레이어를 사용하여 주의 지도를 생성하는 경우, 가장 좋은 리타겟팅 영상이 출력됨을 알 수 있다.Comparing (b) and (c), it can be seen that the result of learning using more structural loss is better than the result of learning only by content loss. Comparing (c) and (d), it can be seen that the best retargeting image is output when the attention map is generated using the 1D replica convolutional layer.

즉, 콘텐츠 손실은 입력 영상의 중요한 내용을 보존하도록 만드는데 기여한다. 본 발명에서, 구조 손실과 1D 중복 컨볼루션 레이어는 영상에서 열에 있는 픽셀들이 비슷한 값을 가지도록 만들어 영상의 아티팩트(artifact)를 줄이는데 기여한다.In other words, content loss contributes to preserving important content of the input image. In the present invention, the structure loss and the 1D overlap convolutional layer contribute to reducing the artifacts of the image by making the pixels in the columns in the image have similar values.

도 6은 한 실시예에 따른 영상 크기 조절 방법의 흐름도이다.6 is a flowchart illustrating an image resizing method according to an embodiment.

도 6을 참고하면, 영상 크기 조절 장치(10)는 원본 영상 그리고 타겟 종횡비를 입력받는다(S110).Referring to FIG. 6, the image size adjusting device 10 receives an original image and a target aspect ratio (S110).

영상 크기 조절 장치(10)는 복수의 컨볼루션 레이어들로 구성된 신경망을 통해 원본 영상을 인코딩하여 높은 단계 특징을 추출한다(S120). 인코딩에 사용되는 복수의 컨볼루션 레이어들은 미리 학습된 딥러닝 모델일 수 있다.The image resizing apparatus 10 extracts a high step feature by encoding an original image through a neural network composed of a plurality of convolutional layers (S120). The plurality of convolution layers used for encoding may be a pre-learned deep learning model.

영상 크기 조절 장치(10)는 복수의 컨볼루션 레이어들로 구성된 신경망을 통해 원본 영상의 특징을 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도(A_d)를 출력한다(S130). 디코딩에 사용되는 복수의 컨볼루션 레이어들은 인코딩에 사용되는 복수의 컨볼루션 레이어들과 역대칭 구조를 가질 수 있다. The image resizing apparatus 10 outputs an attention map A _d including position information on a meaningful region by decoding a feature of an original image through a neural network composed of a plurality of convolutional layers (S130). The plurality of convolution layers used for decoding may have an antisymmetric structure with the plurality of convolution layers used for encoding.

영상 크기 조절 장치(10)는 타겟 종횡비를 기초로 주의 지도(A_d)를 리타겟팅 영상 크기로 변환한 주의 지도(A_r)을 생성한다(S140).The image resizing apparatus 10 generates the attention map A _r by converting the attention map A _d into a retargeting image size based on the target aspect ratio (S140).

영상 크기 조절 장치(10)는 크기 변환한 주의 지도(A_r)를 1차원 복제 컨볼루션하여 열(세로)을 따라 균일한 모양의 주의 지도(A_1D)를 생성한다(S150).The image resizing apparatus 10 generates a state map A _1D having a uniform shape along a column (vertical) by performing a one-dimensional copy convolution of the converted state map A _{r in} step S150.

영상 크기 조절 장치(10)는 1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)와 리타겟팅 영상의 크기로 변환된 주의 지도(A_r)을 가중 합산하여 최종 주의 지도(A)를 생성한다(S160).The image resizing apparatus 10 generates a final attention map A by weighted summing the attention map A _1D generated by the one-dimensional replication convolution and the attention map A _r converted into the size of the retargeting image. (S160).

영상 크기 조절 장치(10)는 최종 주의 지도(A)를 이용하여 시프트 지도를 생성한다(S170). 시프트 지도는 각 픽셀이 입력 영상에서 크기 변환된 출력 영상으로 어느 정도 이동되어야 하는지를 정의하는 지도이다. 영상 크기 조절 장치(10)는 수학식 5와 같이, 최종 주의 지도(A)를 누적 정규화하여 시프트 지도를 생성할 수 있다.The image resizing apparatus 10 generates a shift map using the final attention map A (S170). The shift map is a map defining how far each pixel should be shifted from the input image to the output image which is scaled. The image resizing apparatus 10 may generate a shift map by accumulating and normalizing the final attention map A as shown in Equation 5 below.

영상 크기 조절 장치(10)는 입력된 원본 영상을 시프트 지도를 이용하여 워핑하여 리타겟팅 영상을 출력한다(S180).The image resizing apparatus 10 outputs the retargeting image by warping the input original image using the shift map (S180).

이와 같이, 영상 크기 조절 장치(10)는 입력 영상과 출력 영상 사이의 픽셀 단위 매핑인 시프트 지도를 통해 리타겟팅하는데, 높은 단계 특징을 통해 구해진 주의 지도를 내용 기반(content-aware)의 시프트 지도에 사용하므로, 중요도가 낮은 배경을 늘리거나 줄이면서 중요한 내용을 보존할 수 있다.As described above, the image resizing apparatus 10 retargets a shift map, which is a pixel-by-pixel mapping between the input image and the output image, and maps the state map obtained through the high-level feature to the content-aware shift map. This allows you to preserve important content while increasing or decreasing less important backgrounds.

도 7은 한 실시예에 따른 학습 방법의 흐름도이다.7 is a flowchart of a learning method according to an exemplary embodiment.

도 7을 참고하면, 학습 장치(300)는 영상 크기 조절 장치(10)에서 출력된 출력 영상의 손실을 계산하고, 출력 영상의 손실을 최소화하도록 영상 크기 조절 장치(10)를 학습시킨다. 학습 장치(300)에 의한 학습을 통해, 영상 크기 조절 장치(10)에서 출력 영상 생성에 관계된 컨볼루션 가중치들이 결정된다.Referring to FIG. 7, the learning apparatus 300 calculates a loss of the output image output from the image scaling apparatus 10 and trains the image scaling apparatus 10 to minimize the loss of the output image. Through the learning by the learning apparatus 300, the convolution weights related to the output image generation are determined by the image resizing apparatus 10.

학습 장치(300)는 영상 크기 조절 장치(10)로 입력된 원본 영상, 그리고 영상 크기 조절 장치(10)에서 크기 변환되어 출력된 리타겟팅 영상을 입력받는다(S210).The learning apparatus 300 receives an original image input to the image resizing apparatus 10 and a retargeting image that is converted in size and output from the image resizing apparatus 10 (S210).

학습 장치(300)는 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 콘텐츠 손실을 계산한다(S220).The learning apparatus 300 calculates content loss based on the classification score obtained by inputting the retargeting image to the image classifier (S220).

학습 장치(300)는 리타겟팅 영상과 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 구조 손실을 계산한다(S230). 학습 장치(300)는 수학식 7과 같이, 영상의 특징을 추출하는 복수의 컨볼루션 레이어들로 구성된 신경망에서, 하위 컨볼루션 레이어에서 출력된 특징(예를 들면, 경계 지도와 같은 낮은 단계의 특징)을 비교하여 구조 손실을 계산할 수 있다.The learning apparatus 300 calculates a structure loss by comparing shapes around pixels corresponding to each other in the retargeting image and the original image (S230). The learning apparatus 300 is a low-level feature such as a boundary map outputted from a lower convolution layer in a neural network composed of a plurality of convolutional layers extracting a feature of an image, as shown in Equation 7 below. ) To calculate the structural loss.

학습 장치(300)는 리타겟팅 영상의 콘텐츠 손실과 구조 손실이 최소화되도록 영상 크기 조절 장치(10) 내 컨볼루션 가중치들을 학습시킨다(S240).The learning apparatus 300 learns the convolution weights in the image resizing apparatus 10 to minimize content loss and structure loss of the retargeting image (S240).

도 8부터 도 11은 본 발명의 성능을 설명하는 비교 도면이다.8 to 11 are comparative views illustrating the performance of the present invention.

도 8을 참고하면, (a)는 입력 영상이고, (b)는 영상의 가로 크기를 리타겟팅 영상의 가로 크기로 선형 보간(linear scaling)한 영상이다. (c)는 본 발명에 따른 리타겟팅 영상으로서, 영상의 가로 크기를 리타겟팅 영상의 가로 크기로 줄이더라도, 영상의 주요 콘텐츠(자동차)가 거의 그대로 유지됨을 확인할 수 있다. Referring to FIG. 8, (a) is an input image, and (b) is an image obtained by linear interpolation of the horizontal size of the image to the horizontal size of the retargeting image. (c) is a retargeting image according to the present invention. Even if the horizontal size of the image is reduced to the horizontal size of the retargeting image, it can be seen that the main content (car) of the image is almost maintained.

도 9를 참고하면, 입력 영상(맨 왼쪽)의 세로 길이는 고정하고, 종횡비를 0.9에서 0.3까지 연속적으로 축소한 리타겟팅 영상이다. Referring to FIG. 9, the vertical length of the input image (the leftmost side) is fixed, and the retargeting image is continuously reduced in aspect ratio from 0.9 to 0.3.

(a)는 입력 영상이다. (b)는 선형 보간(linear scaling)에 의한 리타겟팅 영상들로서, 콘텐츠에 대한 고려 없이 선형적으로 가로 크기를 0.9에서 0.3까지 축소하므로, 영상의 중요 콘텐츠(사람)가 보존되지 않음을 확인할 수 있다. (c)는 본 발명에 의한 리타겟팅 영상들로서, 영상의 가로 크기가 줄어들더라도, 영상의 주요 콘텐츠가 거의 그대로 유지됨을 확인할 수 있다. (a) is an input image. (b) is retargeting images by linear interpolation, and since the horizontal size is linearly reduced from 0.9 to 0.3 without considering the content, it can be confirmed that important content (person) of the image is not preserved. . (c) shows the retargeting images according to the present invention, even though the horizontal size of the image is reduced, the main content of the image is almost maintained.

도 10을 참고하면, 영상의 세로 길이는 고정하고, 확대한 리타겟팅 영상이다.Referring to FIG. 10, the vertical length of the image is fixed and the enlarged retargeting image.

(a)는 입력 영상이다. (b)는 선형 보간으로 1.5배 확대한 영상이고, (c)는 본 발명으로 1.5배 확대한 영상이다. (d)는 선형 보간으로 1.7배 확대한 영상이고, (e)는 본 발명으로 1.7배 확대한 영상이다. (a) is an input image. (b) is an image magnified 1.5 times by linear interpolation, and (c) is an image magnified 1.5 times by the present invention. (d) is an image magnified 1.7 times by linear interpolation, and (e) is an image magnified 1.7 times by the present invention.

영상 축소와 같이, 본 발명은 영상을 확대하더라도, 영상의 주요 콘텐츠는 거의 그대로 유지된 채, 배경이 늘어남을 확인할 수 있다.Like the image reduction, the present invention can confirm that the background is increased even though the image is enlarged, while the main content of the image is almost maintained.

도 11을 참고하면, (a) 는 입력 영상들이다. (b)는 선형 보간에 의한 리타겟팅 영상들이다. (c)는 수동 크롭(manual cropping)에 의한 리타겟팅 영상들이다. (d)는 종래의 리타겟팅 방법인 이음새 조각(seam carving) 에 의한 리타겟팅 영상들이다. (e)는 본 발명에 의한 리타겟팅 영상들이다. Referring to FIG. 11, (a) shows input images. (b) shows retargeting images by linear interpolation. (c) shows retargeting images by manual cropping. (d) shows retargeting images by seam carving, which is a conventional retargeting method. (e) are retargeting images according to the present invention.

이처럼, 영상의 세로 크기를 줄인 경우에도, 본 발명에 의한 리타겟팅 영상들은 주요 콘텐츠가 거의 그대로 보존됨을 확인할 수 있다.As such, even when the vertical size of the image is reduced, the retargeting images according to the present invention can be confirmed that the main contents are almost preserved.

도 12는 영상 크기 변화에 따른 영상 분류 정확도를 평가한 결과이다.12 is a result of evaluating image classification accuracy according to image size change.

도 12를 참고하면, 가로축은 입력 영상의 크기를 축소한 비율이고, 세로축은 원본 영상의 분류 결과에 대한 리타겟팅 영상의 분류 결과 비율이다. Referring to FIG. 12, the horizontal axis represents a ratio of reducing the size of the input image, and the vertical axis represents a ratio of the classification result of the retargeting image to the classification result of the original image.

본 발명(our)은 크기가 축소되더라도, 다른 방법들에 비해서 리타겟팅 영상의 분류 정확도 감소량이 현저히 적음을 보여준다. 즉, 본 발명은 리타겟팅 이후에도, 영상의 주요 부분이 잘 보존됨을 알 수 있다.The present invention shows that even if the size is reduced, the amount of reduction in the classification accuracy of the retargeting image is significantly smaller than that of other methods. That is, the present invention can be seen that the main part of the image is well preserved even after retargeting.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not only implemented through the apparatus and the method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiments of the present invention or a recording medium on which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

An image resizing device operated by at least one processor,
Receives a source image, extracts a high level feature of the original image through a plurality of convolutional layers for encoding, and decodes the high level feature into a plurality of convolution layers for decoding to obtain positional information about a meaningful region. An encoder-decoder for outputting a first attention map comprising, and
Generate a second attention map in which the first attention map is scaled to a target aspect ratio, and generate a shift map for moving pixels of the original image to an output image based on the second attention map; A shifter for outputting a retargeting image by warping an original image to the shift map,
And the plurality of convolutional layers for decoding are antisymmetric to the plurality of convolutional layers for encoding.

In claim 1,
The shifter is
1D Duplicate Convolution of the second state map to generate a third state map of uniform shape along a column, weighted sum of the third state map and the second state map, and a final state map And generating the shift map by cumulative normalization of the final attention map.

In claim 2,
The shifter is
And convolving the second state map into a column vector to generate a one-dimensional row vector, and repeatedly replicating the row vector to generate the third state map.

In claim 3,
And the column vector has weights that are learned to minimize the loss of the retargeting image.

delete

In claim 1,
And the plurality of convolution layers for decoding have weights that are learned to minimize the loss of the retargeting image.

In claim 1,
A learning apparatus for calculating a loss of the retargeting image and learning the convolution weights related to the retargeting image output at the encoder-decoder and the shifter to minimize the loss.
Image resizing apparatus further comprising.

In claim 7,
The loss includes loss of content,
The learning device
And calculating the content loss based on a classification score obtained by inputting the retargeting image to an image classifier.

In claim 8,
The loss further comprises a structure loss,
The learning device
And calculating the structure loss by comparing shapes around pixels corresponding to each other in the retargeting image and the original image.

An image resizing device operated by at least one processor,
Receives a source image, extracts a high level feature of the original image through a plurality of convolutional layers for encoding, and decodes the high level feature into a plurality of convolution layers for decoding to obtain positional information about a meaningful region. An encoder-decoder for outputting a first attention map comprising:
Receiving a target aspect ratio, generating a second attention map in which the first state map is scaled to the target aspect ratio, and retargeting the image in which the original image is scaled to the target aspect ratio based on the second state map. Shifter to output, and
A learning apparatus for calculating a loss of the retargeting image and learning weights of the plurality of convolution layers for decoding to minimize the loss,
The loss includes loss of content and loss of structure;
The learning device
Calculating the content loss based on a classification score obtained by inputting the retargeting image to an image classifier, and calculating the structure loss by comparing shapes around pixels corresponding to each other in the retargeting image and the original image; Image resizing device.

delete

In claim 10,
And the image classifier is comprised of a plurality of convolutional layers for encoding.

In claim 12,
The learning device
After inputting each of the original image and the retargeting image into the plurality of convolution layers for encoding, the characteristics of the original image and the retargeting image output from the lower convolution layers of the plurality of convolution layers for encoding Comparing the features of the to calculate the structure loss, the image resizing device.

An image size adjusting method of an image size adjusting device operated by at least one processor,
Receiving a first attention map and a target aspect ratio including location information of a meaningful region of an original image,
Generating a second state map in which the size of the first state map is adjusted to the target aspect ratio;
Generating a third state map having a uniform shape along a column by performing 1D Duplicate Convolution on the second state map;
Generating a shift map using a final state map obtained by weighting the third state map and the second state map, and
Outputting a retargeting image by adjusting the size of the original image using the shift map;
Image scaling method comprising a.

The method of claim 14,
Generating the shift map
And cumulative normalization of the final attention map to generate the shift map.

The method of claim 14,
Generating the third state map may include
And convolving the second state map into a column vector to generate a one-dimensional row vector, and repeatedly replicating the row vector to generate the third state map.

The method of claim 14,
And the first attention map is a result of decoding a high level feature of the original image with a plurality of convolution layers for decoding.

The method of claim 17,
Calculating a content loss of the retargeting image based on a classification score obtained by inputting the retargeting image to an image classifier;
Calculating a structure loss of the retargeting image by comparing shapes of pixels surrounding the corresponding images in the retargeting image and the original image; and
Learning convolution weights related to the retargeting image output such that the content loss and the structure loss are minimized
Further comprising, the image size adjustment method.