KR20190069893A

KR20190069893A - Apparatus and method for retargeting images based on content-awareness

Info

Publication number: KR20190069893A
Application number: KR1020170170238A
Authority: KR
Inventors: 권인소; 조동현
Original assignee: 한국과학기술원
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2019-06-20
Also published as: KR102051597B1

Abstract

An image size adjustment device operated by at least one processor, comprises: an encoder-decoder receiving an original image, extracting high-level features of the original image through a plurality of convolutional layers for encoding, decoding the high-level features by a plurality of convolutional layers for decoding, and outputting a first attention map including positional information for a meaningful region; and a shifter generating a second attention map in which the first attention map is scaled by a target aspect ratio, generating a shift map for shifting pixels of the original image to an output image based on the second attention map, and warping the original image with the shift map to output a retargeting image. Thus, the loss of main content and objects in the image is minimized.

Description

[0001] APPARATUS AND METHOD FOR RETARGETING IMAGES BASED ON CONTENT-AWARENESS [0002]

본 발명은 영상 크기를 조절하는 리타게팅 방법에 관한 것이다.The present invention relates to a retouching method for adjusting image size.

지금까지의 영상 리타겟팅 방법은 여러 가지 연구되고 있는데, 대표적으로 이음새 조각(seam carving) 기반 방법과 워핑 기반 방법으로 분류할 수 있다. 이음새 조각 기반 리타게팅 방법은 중요하지 않은 부분의 이음새를 반복적으로 제거하거나 삽입하여 영상의 종횡비를 변경한다. 워핑 기반 리타게팅 방법은 입력 영상을 목표 크기의 영상으로 연속적으로 변환한다. 워핑 기반 리타게팅 방법은 국부 영역에 대한 최적의 스케일링 계수를 반복적으로 계산하고, 경계 지도(edge map) 및 현저한 지도(saliency map)에 의해 지원되는 워핑 영상을 업데이트한다. So far, video retargeting methods have been studied in many ways, and can be classified into seam carving based method and warping based method. The seam-based retouching method changes the aspect ratio of an image by repeatedly removing or inserting seams in non-critical parts. The warping-based retouching method continuously transforms an input image into an image of a target size. The warping-based retagging method recursively calculates the optimal scaling factor for the local region and updates the warping image supported by the edge map and the saliency map.

이처럼 종래 기술들은 영상 내의 중요하고 의미있는(semantic) 영역을 찾기 위해서, 경계 지도(edge map)와 같이 수동으로 제작된 낮은 단계의 수제 특징(low-level hand-crafted features)을 이용하여 현저한 지도(saliency map)을 구하며, 이를 바탕으로 영상의 크기를 조절한다. 하지만 낮은 단계의 수제 특징으로는, 영상에서 의미있는 영역을 정확하게 판단하지 못해, 결국 의미적으로 중요한 부분이 제대로 남지 않은 채 리타게팅된다. 예를 들면, 도 1은 낮은 단계 특징(low-level features)을 이용한 영상 리타겟팅 결과를 나타내는 도면으로서, 영상의 핵심 부분이 없어지거나 왜곡된다. 이는 낮은 단계의 특징들로 영상 크기를 조절하는 기술의 근본적인 한계이다.As such, the prior art techniques use a low-level hand-crafted feature that is manually created, such as an edge map, to identify significant and semantic regions within the image, saliency map, and adjusts the size of the image based on the saliency map. However, in the low-level homogeneous feature, it does not accurately determine the meaningful region in the image, and thus, it is retargeted without remaining a semantically important part. For example, FIG. 1 shows a result of image retargeting using low-level features, in which a core part of an image is lost or distorted. This is a fundamental limitation of the technique of adjusting image size with low-level features.

본 발명이 해결하고자 하는 과제는 신경망을 통해 입력 영상의 높은 단계 특징(high-level features)을 추출하고, 높은 단계 특징을 이용해서 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도(attention map)를 구하며, 주의 지도를 기초로 생성한 시프트 지도를 이용하여 입력 영상의 크기를 조절하는 장치 및 방법을 제공하는 것이다.A problem to be solved by the present invention is to extract high-level features of an input image through a neural network and to generate a attention map including positional information on a meaningful region using a high-level feature And an apparatus and method for adjusting the size of an input image using a shift map generated based on a map of the state.

또한, 본 발명이 해결하고자 하는 과제는 크기 조절된 출력 영상이 가지는 콘텐츠 손실과 구조 손실을 계산하고, 출력 영상의 손실을 최소화하도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시키는 장치 및 방법을 제공하는 것이다.Another object of the present invention is to provide an apparatus and method for calculating the content loss and the structure loss of a scaled output image and learning convolution weights related to output image generation so as to minimize the loss of the output image will be.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치로서, 원본 영상을 입력받고, 인코딩용 복수의 컨볼루션 레이어들을 통해 상기 원본 영상의 높은 단계 특징을 추출하며, 상기 높은 단계 특징을 디코딩용 복수의 컨볼루션 레이어들로 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)를 출력하는 인코더-디코더, 그리고 상기 제1 주의 지도를 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하고, 상기 제2 주의 지도를 기초로 상기 원본 영상의 픽셀들을 출력 영상으로 이동시키는 시프트 지도(shift map)를 생성하며, 상기 원본 영상을 상기 시프트 지도로 워핑하여 리타겟팅 영상을 출력하는 시프터를 포함한다.An image size adjustment apparatus operated by at least one processor according to one embodiment, the apparatus comprising: an input unit for receiving an original image, extracting high-level features of the original image through a plurality of convolution layers for encoding, An encoder-decoder for decoding a plurality of convolutional layers for decoding and outputting a first attention map including positional information on a meaningful region, and an encoder-decoder for scaling the first- And generating a shift map for shifting the pixels of the original image to the output image based on the map of the second state, warping the original image with the shift map, And an output shifter.

상기 시프터는 상기 제2 주의 지도를 1차원 복제 컨볼루션(1D Duplicate Convolution)하여 열을 따라 균일한 모양의 제3 주의 지도를 생성하고, 상기 제3 주의 지도와 상기 제2 주의 지도를 가중 합산하여 최종 주의 지도하며, 상기 최종 주의 지도를 누적 정규화(cumulative normalization)하여 상기 시프트 지도를 생성할 수 있다.The shifter performs 1D Duplicate Convolution (1D Duplicate Convolution) of the map of the second week to generate a map of the third week in a uniform shape along the row and adds the map of the third week and the map of the second week in a weighted manner And the shift map can be generated by cumulative normalization of the final state map.

상기 시프터는 상기 제2 주의 지도를 열 벡터로 컨볼루션하여 1차원의 행 벡터를 생성하고, 상기 행 벡터를 반복적으로 복제하여 상기 제3 주의 지도를 생성할 수 있다.The shifter may generate a one-dimensional row vector by convoluting the second state map into a column vector, and replicate the row vector repeatedly to generate the third state map.

상기 열 벡터는 상기 리타겟팅 영상이 가지는 손실을 최소화하도록 학습된 가중치들을 가질 수 있다.The column vectors may have learned weights to minimize the loss of the retargeting image.

상기 디코딩용 복수의 컨볼루션 레이어들은 상기 인코딩용 복수의 컨볼루션 레이어들과 역대칭일 수 있다.The plurality of convolution layers for decoding may be anti-symmetric with the plurality of convolution layers for encoding.

상기 디코딩용 복수의 컨볼루션 레이어들은 상기 리타겟팅 영상이 가지는 손실을 최소화하도록 학습된 가중치들을 가질 수 있다.The plurality of convolutional layers for decoding may have learned weights to minimize the loss of the retargeting image.

상기 영상 크기 조절 장치는 상기 리타겟팅 영상이 가지는 손실을 계산하고, 상기 인코더-디코더 그리고 상기 시프터에서 상기 리타겟팅 영상 출력에 관계된 컨볼루션 가중치들을 상기 손실이 최소화되도록 학습시키는 학습 장치를 더 포함할 수 있다.The image size adjustment apparatus may further include a learning device for calculating a loss of the retargeting image and for learning the convolution weights related to the retargeting image output in the encoder-decoder and the shifter so that the loss is minimized have.

상기 손실은 콘텐츠 손실을 포함하고, 상기 학습 장치는 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 콘텐츠 손실을 계산할 수 있다.The loss may include a content loss, and the learning apparatus may calculate the content loss based on the classification score acquired by inputting the retargeting image into the image classifier.

상기 손실은 구조 손실을 더 포함하고, 상기 학습 장치는 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 구조 손실을 계산할 수 있다.The loss may further include a structural loss, and the learning apparatus may calculate the structural loss by comparing shapes of pixels corresponding to each other in the retargeting image and the original image.

다른 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치로서, 원본 영상을 입력받고, 인코딩용 복수의 컨볼루션 레이어들을 통해 상기 원본 영상의 높은 단계 특징을 추출하며, 상기 높은 단계 특징을 디코딩용 복수의 컨볼루션 레이어들로 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)를 출력하는 인코더-디코더, 타겟 종횡비를 입력받고, 상기 제1 주의 지도를 상기 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하고, 상기 제2 주의 지도를 기초로 상기 원본 영상을 상기 타겟 종횡비로 크기 조절한 리타겟팅 영상을 출력하는 시프터, 그리고 상기 리타겟팅 영상이 가지는 손실을 계산하고, 상기 손실이 최소화되도록 상기 디코딩용 복수의 컨볼루션 레이어들의 가중치들을 학습시키는 학습 장치를 포함한다.An image resizing apparatus operated by at least one processor in accordance with another embodiment, the apparatus comprising: an input means for receiving an original image, extracting high-level features of the original image through a plurality of convolutional layers for encoding, Decoder for outputting a first attention map including position information on a meaningful region by decoding the first target map with a plurality of convolutional layers for decoding, A shifter for generating a retargeting image in which the original image is scaled to the target aspect ratio on the basis of the map of the second state, and a loss of the retargeting image is calculated And weights of the plurality of convolutional layers for decoding are learned so that the loss is minimized And a humidifying device.

상기 손실은 콘텐츠 손실과 구조 손실을 포함하고, 상기 학습 장치는 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 콘텐츠 손실을 계산하고, 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 구조 손실을 계산할 수 있다.Wherein the loss includes a content loss and a structure loss, the learning apparatus calculates the content loss based on a classification score acquired by inputting the retargeting image into an image classifier, The structure loss can be calculated by comparing the shapes around the corresponding pixels.

상기 영상 분류기는 상기 인코딩용 복수의 컨볼루션 레이어들로 구성될 수 있다.The image classifier may comprise a plurality of convolution layers for encoding.

상기 학습 장치는 상기 원본 영상과 상기 리타겟팅 영상 각각을 상기 인코딩용 복수의 컨볼루션 레이어들로 입력한 후, 상기 인코딩용 복수의 컨볼루션 레이어들의 하위 컨볼루션 레이어에서 출력된 상기 원본 영상의 특징과 상기 리타겟팅 영상의 특징을 비교하여 상기 구조 손실을 계산할 수 있다.Wherein the learning apparatus inputs each of the original image and the retargeting image into a plurality of convolutional layers for encoding and then outputs the characteristic of the original image output from the lower convolution layer of the plurality of convolutional layers for encoding and The structural loss can be calculated by comparing features of the retargeting image.

또 다른 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 영상 크기 조절 장치의 영상 크기 조절 방법으로서, 원본 영상의 의미있는 영역에 대한 위치 정보를 포함하는 제1 주의 지도(attention map)와 타겟 종횡비를 입력받는 단계, 상기 제1 주의 지도를 상기 타겟 종횡비로 크기 조절한 제2 주의 지도를 생성하는 단계, 상기 제2 주의 지도를 1차원 복제 컨볼루션(1D Duplicate Convolution)하여 열을 따라 균일한 모양의 제3 주의 지도를 생성하는 단계, 상기 제3 주의 지도와 상기 제2 주의 지도를 가중 합산한 최종 주의 지도를이용하여 시프트 지도(shift map)를 생성하는 단계, 그리고 상기 시프트 지도를 이용하여 상기 원본 영상의 크기를 조절한 리타겟팅 영상을 출력하는 단계를 포함한다.There is provided a method of adjusting an image size of an image size adjusting apparatus operated by at least one processor according to another embodiment, including the steps of: obtaining a first attention map including a positional information of a meaningful region of an original image and a target aspect ratio Generating a map of a second state in which the map of the first state is scaled to the target aspect ratio, performing a one-dimensional (1D) Duplicate Convolution of the map of the second state, Generating a map of a third state, generating a shift map using a final state map weighted by adding the map of the third state and the map of the second state, and generating a shift map using the shift map, And outputting a retargeting image in which the size of the image is adjusted.

상기 시프트 지도를 생성하는 단계는 상기 최종 주의 지도를 누적 정규화(cumulative normalization)하여 상기 시프트 지도를 생성할 수 있다.The step of generating the shift map may generate the shift map by cumulative normalization of the final state map.

상기 제3 주의 지도를 생성하는 단계는 상기 제2 주의 지도를 열 벡터로 컨볼루션하여 1차원의 행 벡터를 생성하고, 상기 행 벡터를 반복적으로 복제하여 상기 제3 주의 지도를 생성할 수 있다.The step of generating the map of the third state may generate a one-dimensional row vector by convoluting the map of the second state into a column vector, and replicating the row vector repeatedly to generate the map of the third state.

상기 제1 주의 지도는 디코딩용 복수의 컨볼루션 레이어들로 상기 원본 영상의 높은 단계 특징을 디코딩한 결과일 수 있다.The map of the first note may be a result of decoding the high-level feature of the original image with a plurality of convolutional layers for decoding.

상기 영상 크기 조절 방법은 상기 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 상기 리타겟팅 영상의 콘텐츠 손실을 계산하는 단계, 상기 리타겟팅 영상과 상기 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 상기 리타겟팅 영상의 구조 손실을 계산하는 단계, 그리고 상기 콘텐츠 손실과 상기 구조 손실이 최소화되도록 상기 리타겟팅 영상 출력에 관계된 컨볼루션 가중치들을 학습시키는 단계를 더 포함할 수 있다.Wherein the image size adjustment method further comprises: calculating a content loss of the retargeting image based on a classification score obtained by inputting the retargeting image into the image classifier, calculating a content loss of the retargeting image, Calculating a structural loss of the retargeting image by comparing shapes of the retargeting images, and learning convolutional weights related to the retargeting image output so that the content loss and the structural loss are minimized.

본 발명의 실시예에 따르면 영상에서 중요하다고 여겨지는 영역을 보존하면서, 상대적으로 덜 중요한 영역은 늘이거나 줄여서 영상의 크기를 변환하므로, 영상 내 주요 내용 및 물체의 손실을 최소화할 수 있다. 본 발명의 실시예에 따르면 영상 크기 변환이 요구되는 다양한 시스템에 적용될 수 있다.According to the embodiment of the present invention, while preserving an area regarded as important in an image, relatively less important areas are enlarged or reduced to convert the size of the image, thereby minimizing loss of main contents and objects in the image. According to an embodiment of the present invention, the present invention can be applied to various systems requiring image size conversion.

도 1은 종래의 낮은 단계 특징을 이용한 영상 리타겟팅 결과이다.
도 2는 한 실시예에 따른 영상 크기 조절 장치의 구성도이다.
도 3은 한 실시예에 따른 영상 크기 조절 장치의 상세 구성이다.
도 4는 한 실시예에 따른 1차원 복제 컨볼루션을 이용하여 최종 주의 지도를 생성하는 방법을 설명하는 도면이다.
도 5는 한 실시예에 따른 학습 방법의 성능을 나타내는 도면이다.
도 6은 한 실시예에 따른 영상 크기 조절 방법의 흐름도이다.
도 7은 한 실시예에 따른 학습 방법의 흐름도이다.
도 8부터 도 11은 본 발명의 성능을 설명하는 비교 도면이다.
도 12는 영상 크기 변화에 따른 영상 분류 정확도를 평가한 결과이다.FIG. 1 shows a result of image retargeting using a conventional low-stage feature.
FIG. 2 is a configuration diagram of an image size adjusting apparatus according to an embodiment.
FIG. 3 is a detailed configuration of an image size adjusting apparatus according to an embodiment.
4 is a diagram illustrating a method for generating a final attention map using a one-dimensional replica convolution according to an embodiment.
5 is a diagram illustrating performance of a learning method according to an embodiment.
6 is a flowchart of a method of adjusting an image size according to an exemplary embodiment of the present invention.
7 is a flowchart of a learning method according to an embodiment.
8 to 11 are comparative diagrams illustrating the performance of the present invention.
12 shows the result of evaluating the image classification accuracy according to the image size change.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "?기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise. Also, the terms " part, "" module, "and the like, which are described in the specification, refer to a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software have.

도 2는 한 실시예에 따른 영상 크기 조절 장치의 구성도이고, 도 3은 한 실시예에 따른 영상 크기 조절 장치의 상세 구성이며, 도 4는 한 실시예에 따른 1차원 복제 컨볼루션을 이용하여 최종 주의 지도를 생성하는 방법을 설명하는 도면이다.FIG. 2 is a configuration diagram of an image size adjusting apparatus according to an embodiment, FIG. 3 is a detailed configuration of an image size adjusting apparatus according to an embodiment, FIG. 4 is a diagram illustrating an image size adjusting apparatus according to an embodiment, Fig. 5 is a diagram illustrating a method for generating a final attention map. Fig.

도 2와 도 3을 참고하면, 영상 크기 조절 장치(10)는 원본 영상 그리고 타겟 종횡비(aspect ratio)를 입력받고, 원본 영상을 타겟 종횡비로 크기 변환하여 리타겟팅 영상을 출력한다. 원본 영상이 타겟 종횡비로 크기 변환된 영상을 리타겟팅 영상이라고 부른다. 원본 영상이 영상 크기 조절 장치(10)의 입력 영상이고, 리타겟팅 영상이 영상 크기 조절 장치(10)의 출력 영상이다.Referring to FIGS. 2 and 3, the image size adjustment device 10 receives an original image and a target aspect ratio, resizes the original image to a target aspect ratio, and outputs a retargeting image. An image whose original image is scaled to the target aspect ratio is called a retargeting image. The original image is the input image of the image size adjustment device 10 and the retargeting image is the output image of the image size adjustment device 10. [

영상 크기 조절 장치(10)는 적어도 하나의 프로세서에 의해 동작하고, 원본 영상에 대한 주의 지도(attention map)를 생성하는 인코더(encoder)-디코더(decoder)(100), 그리고 주의 지도를 타겟 종횡비로 변환하는 시프터(shifter)(200)를 포함한다. The image resizing apparatus 10 includes an encoder-decoder 100, which is operated by at least one processor and generates an attention map for the original image, and an attention- And a shifter 200 for converting the signal.

학습 장치(300)는 시프터(200)에서 출력된 리타게팅 영상이 의미있는 영역을 잘 보존하면서 자연스럽게 변환되도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. 학습 장치(300) 학습 장치(300)는 영상 크기 조절 장치(10)에 포함될 수 있으나, 영상 크기 조절과 학습을 분리하여 설명하기 위해, 분리하여 설명한다.The learning apparatus 300 learns convolution weights related to output image generation so that the retouching image output from the shifter 200 is naturally converted while preserving a meaningful region. Learning apparatus 300 The learning apparatus 300 may be included in the image size adjustment apparatus 10, but will be described separately in order to separately explain image size adjustment and learning.

인코더-디코더(100)는 원본 영상을 입력받고, 원본 영상에 대한 높은 단계 특징(high-level features)을 이용하여 의미있는(semantic) 영역에 대한 위치 정보를 포함하는 주의 지도를 출력한다. 인코더-디코더(100)는 신경망(Neural network)을 이용하여 원본 영상의 높은 단계 특징(high-level features)을 추출하는 인코더(110), 인코더(110)에서 출력된 특징(features)을 단계적으로 업샘플링하여 주의 지도(attention map)를 생성하는 디코더(130)로 구성된다. The encoder-decoder 100 receives the original image and outputs a state map including positional information on a semantic area using high-level features of the original image. The encoder-decoder 100 includes an encoder 110 for extracting high-level features of an original image using a neural network, a processor 110 for incrementally updating features output from the encoder 110, And a decoder 130 for generating an attention map by sampling.

인코더(110)는 복수의 컨볼루션 레이어들과 이들을 풀링(pooling)하는 풀링 레이어들로 구성된 신경망일 수 있다. 신경망은 심층 컨볼루션망(Deep Convolutional Network)일 수 있고, 미리 학습된 딥 러닝(deep learning) 모델(예를 들면, VGG16)을 사용할 수 있다. 이 경우, 인코더(110)를 구성하는 복수의 컨볼루션 레이어들의 가중치들은 이미 고정되어 있다고 가정한다.The encoder 110 may be a neural network composed of a plurality of convolutional layers and pooling layers for pooling them. The neural network may be a Deep Convolutional Network and may use a pre-learned deep learning model (e.g., VGG16). In this case, it is assumed that the weights of the plurality of convolution layers constituting the encoder 110 are already fixed.

디코더(130)는 인코더와 역대칭으로 컨볼루션 레이어들이 구성되고, 업샘플링을 통해 인코더(110)에서 출력된 특징으로부터 주의 지도를 생성한다. 디코더(130)는 인코더(110)와 역대칭 구조를 가지되, 필요에 따라 일부 레이어가 바뀔 수 있다. 예를 들면, 인코더(110)의 ReLU(Rectified Linear Unit) 레이어들이 디코더(130)에서 ELU(Exponential Linear Unit) 레이어들로 바뀔 수 있다. 한편, 디코더(130)를 구성하는 컨볼루션 레이어들의 가중치들은 학습 장치(300)에 의해 학습된다. Decoder 130 generates convolution maps from the features output from encoder 110 through upsampling, with convolution layers being configured inversely to the encoder. The decoder 130 has a symmetric structure with respect to the encoder 110, and some layers may be changed as needed. For example, the ReLU (Rectified Linear Unit) layers of the encoder 110 may be changed from the decoder 130 to Exponential Linear Unit (ELU) layers. On the other hand, the weights of the convolution layers constituting the decoder 130 are learned by the learning apparatus 300.

시프터(200)는 디코더(130)로부터, 원본 영상의 높은 단계 특징으로부터 생성된 주의 지도를 입력받는다. 따라서, 시프터(200)는 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도를 이용하여, 내용 기반(content-aware)의 시프트 지도(shift map)를 생성할 수 있다. 이를 통해 시프터(200)는 의미있는 영역을 정확하게 판단할 수 있으므로, 영상에서 중요하다고 여겨지는 영역을 보존하면서, 상대적으로 덜 중요한 영역(예를 들면, 배경)은 늘이거나 줄이는 리타겟팅을 할 수 있다.The shifter 200 receives, from the decoder 130, a state map generated from the high-level characteristics of the original image. Accordingly, the shifter 200 can generate a content-aware shift map using the attention map including the positional information on the meaningful region. This allows the shifter 200 to accurately determine a meaningful region, thereby allowing retargeting to increase or decrease a relatively less important region (e.g., background) while preserving an area that is considered important in the image .

시프터(200)는 디코더(130)에서 출력된 원본 영상의 주의 지도를 기초로 시프트 지도를 생성하고, 원본 영상을 시프트 지도를 기초로 워핑하여 리타게팅 영상을 출력한다. 시프트 지도[S(x,y)]는 각 픽셀이 WxH 크기의 입력 영상(I)에서 W'xH 크기의 출력 영상으로 어느 정도 이동되어야 하는지를 정의하는 지도이다. 설명을 위해, 영상의 가로 크기만 바뀌는 경우를 예로 들어 설명하나, 세로 크기도 가로 크기 변경과 같은 방법으로 수행된다. 시프트 지도[S(x,y)]를 통한 입력 영상(I)과 출력 영상(O)의 관계는 수학식 1과 같다. 수학식 1에서 (x,y)는 출력 영상의 공간 좌표이다. 영상의 가로 크기만 바뀌는 경우, S는 0에서 가로 변경 크기(W-W')의 범위 내에서 값이 정해진다.The shifter 200 generates a shift map based on the attention map of the original image output from the decoder 130, warps the original image based on the shift map, and outputs the retagging image. The shift map S (x, y) is a map defining how much each pixel should be shifted from the input image I having the size of WxH to the output image having the size of W'xH. For the sake of explanation, the case where only the horizontal size of the image is changed is described as an example, but the vertical size is also performed in the same manner as the horizontal size change. The relationship between the input image I and the output image O through the shift map S (x, y) is expressed by Equation (1). In Equation 1, (x, y) is the spatial coordinate of the output image. When only the horizontal size of the image is changed, S is set within a range of from 0 to the horizontal change size (W-W ').

디코더(130)에서 출력된 주의 지도(A_d)는 의미있는 영역에 대한 위치 정보를 포함하므로, 내용 기반(content-aware)의 영상 리타겟팅에 바로 사용될 수 있다. 하지만, 영상의 전체 모양을 유지하려면 비슷한 열의 픽셀이 비슷한 시프트 값을 가져야 하고, 그렇지 않으면 영상의 전체적인 모양이 뒤틀릴 수 있다. 따라서, 시프터(200)는 디코더(130)에서 출력된 원본 영상의 주의 지도(A_d)를 곧장 리타겟팅에 사용하는 대신, 1차원 복제 컨볼루션(1D Duplicate Convolution)을 사용하여 주의 지도의 모양을 열(세로)을 따라 균일하게 제한한 주의 지도를 사용한다.The caution map A _d output from the decoder 130 includes positional information for a meaningful region, and thus can be directly used for content-aware image retargeting. However, to maintain the overall shape of the image, pixels in similar rows must have a similar shift value, otherwise the overall shape of the image may be distorted. Therefore, the shifter 200 uses the 1D Duplicate Convolution (1D Duplicate Convolution) instead of directly using the attention map A _d of the original image output from the decoder 130 for retargeting, Use caution maps that are uniformly limited along the row (vertical).

이를 위해, 시프터(200)는 시프트 지도 생성부(210) 그리고 워핑부(230)를 포함한다.To this end, the shifter 200 includes a shift map generating unit 210 and a warping unit 230.

먼저, 시프트 지도 생성부(210)는 디코더(130)에서 출력된 주의 지도(A_d)를 리타겟팅 영상 크기로 변환(resizing)한다. 시프트 지도 생성부(210)는 타겟 종횡비를 기초로 주의 지도(A_d)를 리타겟팅 영상 크기로 변환할 수 있다. 리타겟팅 영상 크기로 변환된 주의 지도(A_r)는 수학식 2와 같이 표현된다. 수학식 2에서, A_r은 리타겟팅 영상의 크기로 조정된 주의 지도이고, A_d는 디코더(130)에서 출력된 주의 지도이며, R은 출력 영상 크기, 즉 리타겟팅 영상의 크기이다.First, the shift map generation unit 210 resizes the attention map A _d output from the decoder 130 to a retargeting image size. The shift map generating unit 210 may convert the attention map A _d to the retargeting image size based on the target aspect ratio. The attention map (A _r ) converted into the retargeting image size is expressed by Equation (2). In Equation (2), A _r is a caution map adjusted to the size of a retargeting image, A _d is a caution map output from the decoder 130, and R is an output image size, that is, the size of the retargeting image.

다음, 시프트 지도 생성부(210)는 크기 변환된 주의 지도(A_r)를 1차원 복제 컨볼루션하여 열(세로)을 따라 균일한 모양이 되도록 하여, 왜곡(뒤틀림)을 보정한다.Next, the shift map generating unit 210 corrects the distortion (warping) by making one-dimensional replica convolution of the scaled caution map A _r to be uniform along the column (vertical).

도 4를 참고하면, 시프트 지도 생성부(210)는 크기 변환된 주의 지도(A_r)(H x W)를 H차원의 열 벡터(H x 1)로 컨볼루션하여, 1차원의 행 벡터(1 x W)로 변환한다. H차원의 열 벡터의 가중치들은 학습 장치(300)에 의해 학습된다. 시프트 지도 생성부(210)는 1차원의 행 벡터를 H번 반복적으로 복제하여, 1차원 복제 컨볼루션에 의한 주의 지도(A_1D)(H x W)를 생성한다. 1차원 복제 컨볼루션은 수학식 3과 같이 표현된다.Referring to FIG. 4, the shift map generator 210 converts the scaled map A _r (H x W) into an H-dimensional column vector (H x 1) to generate a one-dimensional row vector 1 x W). The weights of column vectors of the H dimension are learned by the learning apparatus 300. [ The shift map generating unit 210 repeatedly replicates a one-dimensional row vector H times and generates an attention map (A _1D ) (H x W) by one-dimensional replicated convolution. The one-dimensional replica convolution is expressed by Equation (3).

수학식 3에서, A_r은 리타겟팅 영상의 크기로 변환된 주의 지도이고, A_1D는 1차원 복제 컨볼루션으로 열(세로)을 따라 균일하게 생성된 주의 지도이다.

는 H차원의 열 벡터(H dimensional column vector)와의 컨볼루션을 의미한다.

는 1차원 행 벡터를 H번 반복시키는 연산을 의미한다.In Equation (3), A _r is a state map converted into the size of a retargeting image, and A _1D is a state map uniformly generated along a column (vertical) by a one-dimensional replica convolution.

Denotes a convolution with an H dimensional column vector.

Means an operation of repeating a one-dimensional row vector H times.

1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)는 열(세로)을 따라 균일한 값으로만 채워지기 때문에, 시프트 지도 생성부(210)는 1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)와 리타겟팅 영상의 크기로 변환된 주의 지도(A_r)을 가중 합산하여 최종 주의 지도(A)를 생성한다. 최종 주의 지도(A)는 수학식 4와 같이 표현된다. 수학식 4에서 λ는 가중치 변수이다.Since the attention map A _1D generated by the one-dimensional replica convolution is filled only with a uniform value along the column (vertical), the shift map generation unit 210 generates the shift map (A _1D ) and the state map converted into the size of the retargeting image (A _r ) are weighted to generate a final state map (A). The final state map (A) is expressed as Equation (4). In Equation (4),? Is a weighting variable.

시프트 지도 생성부(210)는 최종 주의 지도(A)를 누적 정규화(cumulative normalization)하여 시프트 지도[S(x,y)]를 생성한다. 시프트 지도[S(x,y)]는 수학식 5와 같이 계산될 수 있다. 수학식 5에서,

이고, W는 입력 영상의 가로 크기이고, W'는 출력 영상의 가로 크기이다.The shift map generating unit 210 cumulatively normalizes the final state map A to generate a shift map S (x, y). The shift map S (x, y) can be calculated as shown in Equation (5). In Equation (5)

W is the width of the input image, and W 'is the width of the output image.

시프트 지도 생성부(210)는 시프트 지도가 단조(monotonically) 증가하도록 누적 정규화를 한다. 입력 영상이 시프트 지도에 의해서 출력 영상으로 워핑된다. 이때 입력 영상의 픽셀들이 순서가 바뀌면 출력 영상이 제대로 출력되지 않으므로, 픽셀들의 순서가 뒤바뀌어서는 안 된다. 즉, 시프트 지도의 값은 공간 축(x, y 방향)에 따라서 단조 증가하도록 누적 정규화를 한다. 누적 정규화에 의해서 값이 0~1 사이로 전환되며,

가 곱해 짐으로써, 시프트 지도 값의 범위가 0~

가 된다. 최종 주의 지도의 값이 낮은 부분에서는 시프트 지도의 증가량이 적기 때문에, 출력 영상에서 많은 영역을 차지하게 되며, 반대로 최종 주의 지도의 값이 높은 부분은 시프트 지도의 증가량이 커서 출력 영상에 적은 영역을 차지하게 된다.The shift map generating unit 210 performs the cumulative normalization so that the shift map monotonically increases. The input image is warped to the output image by the shift map. At this time, if the order of the pixels of the input image is changed, the output image is not output properly, so the order of the pixels should not be reversed. That is, the value of the shift map is cumulatively normalized so as to monotonically increase along the space axis (x, y direction). The value is switched from 0 to 1 by the cumulative normalization,

So that the range of the shift map value is 0 to < RTI ID = 0.0 >

. In the low value map of the final state map, the increase amount of the shift map is small, and therefore, the area occupied by the output image occupies a large area. On the other hand, .

워핑부(230)는 시프트 지도 생성부(210)에서 생성한 시프트 지도를 이용하여, 원본 영상(입력 영상)을 워핑하여 리타겟팅 영상(출력 영상)을 생성한다. 리타겟팅은 수학식 1에 따라 수행된다. The warping unit 230 generates a retargeting image (output image) by warping the original image (input image) using the shift map generated by the shift map generating unit 210. [ Retargeting is performed according to Equation (1).

시프트 지도는 서브 픽셀 정밀도를 가지므로, 4개의 이웃 픽셀을 사용하여 선형 보간이 수행된다. 선형 워핑 영상(linearly warped image)에 미분 손실(differentiable loss)을 적용할 수 있다. 한편, 리타게팅 영상은 신경망을 학습시키는데 사용된다. 이를 위해, 리타게팅 영상을 신경망 모델의 입력 크기에 맞도록 제로 패딩할 수 있다.Since the shift map has subpixel precision, linear interpolation is performed using four neighboring pixels. A differentiable loss can be applied to a linearly warped image. On the other hand, retargeting images are used to learn neural networks. To do this, the retargeting image can be zero padded to fit the input size of the neural network model.

학습 장치(300)는 시프터(200)에서 출력된 출력 영상의 손실을 계산하고, 출력 영상의 손실을 최소화하도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. 컨볼루션 가중치들은 인코더-디코더(100)와 시프터(200)에서 컨볼루션 연산에 사용되는 가중치들로서, 디코더(130)를 구성하는 복수의 컨볼루션 레이어들의 가중치들, 시프터(200)의 H차원 열 벡터일 수 있다. The learning apparatus 300 calculates the loss of the output image output from the shifter 200 and learns the convolution weights related to the output image generation so as to minimize the loss of the output image. The convolution weights are weights used in the convolution operation in the encoder-decoder 100 and the shifter 200, and the weights of the plurality of convolution layers constituting the decoder 130, the H-dimensional column vectors of the shifter 200 Lt; / RTI & gt ;

학습 장치(300)는 출력 영상이 가지는 손실(loss)을 이용하는데, 출력 영상의 손실은 콘텐츠 손실(content loss)과 구조(모양) 손실(structure loss)을 포함할 수 있다. 콘텐츠 손실은 입력 영상의 중요한 내용을 보존하도록 만드는데 기여한다. 구조 손실(E_s)은 주의 지도에서 비슷한 열에 있는 픽셀들이 비슷한 값을 가지도록 만드는데 기여한다.The learning apparatus 300 utilizes the loss of the output image, which may include content loss and structure loss. Content loss contributes to preserving the important content of the input image. The structural loss (E _s ) contributes to making similar pixels in similar columns in the attention map.

콘텐츠 손실은 영상 내 주요 물체가 출력 영상에서 잘 보존되는지를 나타내는 정보로서, 원본 영상의 주요 물체가 출력 영상에서 잘 보존되도록 학습시키기 위한 학습 데이터이다. 만약, 원본 영상에서 크기 변환된 출력 영상이 영상 분류기에 의해 잘 분류된다면, 원본 영상의 주요 부분이 출력 영상에서 잘 보존되었다고 추정할 수 있다. 따라서, 학습 장치(300)는 시프터(200)의 출력 영상을 영상 분류기에 입력하고, 영상 분류기에서 출력된 분류 점수(class score)를 기초로 콘텐츠 손실을 정의한다. 분류 점수가 높을수록(즉, 분류가 잘 될 수록) 콘센트 손실이 적다(즉, 원본 영상의 주요 부분이 출력 영상에서 잘 보존됨)고 간주될 수 있다. 학습 장치(300)는 미리 학습된 딥 러닝 모델(예를 들면, VGG16)을 영상 분류기로 사용할 수 있다. 이는 인코더(110)에서 원본 영상의 특징을 출력하는 모델과 동일할 수 있다. Content loss is information indicating whether the main object in the image is well preserved in the output image, and is learning data for allowing the main object of the original image to be well preserved in the output image. If the size-converted output image from the original image is well classified by the image classifier, it can be assumed that the main part of the original image is well preserved in the output image. Accordingly, the learning apparatus 300 inputs the output image of the shifter 200 to the image classifier, and defines the content loss based on the class score output from the image classifier. The higher the classification score (i. E., The better the classification), the lower the outlet loss (i. E. The major part of the original image is well preserved in the output image). The learning apparatus 300 can use a previously learned deep learning model (for example, VGG16) as a video classifier. This may be the same as the model in which the encoder 110 outputs the feature of the original image.

학습 장치(300)는 수학식 6과 같이 출력 영상의 콘텐츠 손실(E_c)을 계산할 수 있다. 수학식 6에서, C와 N은 영상 분류기의 종(class) 수와 학습 샘플의 개수이며,

와

는 정답 라벨과 분류기의 시그모이드(sigmoid) 출력이다. The learning apparatus 300 can calculate the content loss (E _c ) of the output image as shown in Equation (6). In Equation (6), C and N are the class number of the image classifier and the number of learning samples,

Wow

Is the sigmoid output of the correct answer label and sorter.

구조 손실(structure loss)은 크기 변환된 출력 영상이 부자연스러운지를 판단하는 정보로서, 크기 변환된 출력 영상에서 부자연스러운 부분이 최소화되도록 학습시키기 위한 학습 데이터이다. 리타겟팅 영상과 원본 영상에서 서로 대응 되는 픽셀 주변의 모양이 비슷하면, 크기 변환이 자연스럽다고 추정할 수 있다. The structure loss is information for determining whether the size-converted output image is unnatural, and is learning data for learning to minimize the unnatural part in the size-converted output image. If the retargeting image and the shape of the surrounding pixels corresponding to each other in the original image are similar, it can be assumed that the size conversion is natural.

학습 장치(300)는 수학식 7과 같이 출력 영상의 구조 손실(E_s)을 계산할 수 있다. 학습 장치(300)는 출력 영상과 입력 영상의 낮은 단계의 특징을 비교하여 출력 영상의 구조 손실(E_s)을 계산할 수 있다. 수학식 7에서,

는 미리 학습된 딥 러닝 모델(예를 들면, VGG16)의 첫 번째 컨볼루션 레이어

의 출력이다. 첫 번째 컨볼루션 레이어

의 출력은 경계 지도(edge map)와 같은 낮은 단계의 특징을 표현한다.The learning apparatus 300 can calculate the structure loss (E _s ) of the output image as shown in Equation (7). The learning apparatus 300 can compute the structural loss (E _s ) of the output image by comparing characteristics of the low-level of the output image and the input image. In Equation (7)

(E.g., VGG 16) of the previously learned deep learning model

. The first convolution layer

The output of the second stage represents a low-level feature such as an edge map.

학습 장치(300)는 이렇게 출력 영상이 가지는 콘텐츠 손실(E_c)과 구조 손실(E_s)이 최소화되도록 출력 영상 생성에 관계된 컨볼루션 가중치들을 학습시킨다. The learning apparatus 300 learns the convolution weights related to the output image generation such that the content loss (E _c ) and the structure loss (E _s ) of the output image are minimized.

도 5는 한 실시예에 따른 학습 방법의 성능을 나타내는 도면이다.5 is a diagram illustrating performance of a learning method according to an embodiment.

도 5를 참고하면, (a)는 입력 영상이고, (b)부터 (d)는 입력 영상의 주의 지도(좌측)와 입력 영상이 크기 변환된 리타겟팅 영상(우측)이다. 이때, (b)는 콘텐츠 손실만으로 학습시킨 모델에서 출력된 리타겟팅 영상이고, (c)는 콘텐츠 손실과 구조 손실로 학습시킨 모델에서 출력된 리타겟팅 영상이며, (d)는 콘텐츠 손실과 구조 손실로 학습시키되, 1D 복제 컨볼루션 레이어를 사용하여 생성된 주의 지도로부터 획득된 리타겟팅 영상이다. Referring to FIG. 5, (a) is the input image, (b) to (d) are the attention map (left side) of the input image and the retargeting image (right side) in which the input image is scaled. In this case, (b) is a retargeting image output from a model learned only by content loss, (c) is a retargeting image output from a model learned by content loss and structure loss, (d) , Which is a retargeting image acquired from a state map generated using the 1D duplicate convolution layer.

(b)와 (c)를 비교하면, 콘텐츠 손실만으로 학습시킨 결과에 비해, 구조 손실을 더 이용해서 학습시킨 결과가 좋음을 알 수 있다. 그리고 (c)와 (d)를 비교하면, 1D 복제 컨볼루션 레이어를 사용하여 주의 지도를 생성하는 경우, 가장 좋은 리타겟팅 영상이 출력됨을 알 수 있다.Comparing (b) and (c), it can be seen that the results obtained by learning more by using the structural loss are better than those obtained by learning only by content loss. Comparing (c) and (d), it can be seen that the best retargeting image is output when the attention map is generated using the 1D replica convolution layer.

즉, 콘텐츠 손실은 입력 영상의 중요한 내용을 보존하도록 만드는데 기여한다. 본 발명에서, 구조 손실과 1D 중복 컨볼루션 레이어는 영상에서 열에 있는 픽셀들이 비슷한 값을 가지도록 만들어 영상의 아티팩트(artifact)를 줄이는데 기여한다.That is, content loss contributes to preserving important contents of the input image. In the present invention, the structure loss and the 1D redundant convolution layer contribute to reducing the artifacts of the image by making pixels in the column have similar values in the image.

도 6은 한 실시예에 따른 영상 크기 조절 방법의 흐름도이다.6 is a flowchart of a method of adjusting an image size according to an exemplary embodiment of the present invention.

도 6을 참고하면, 영상 크기 조절 장치(10)는 원본 영상 그리고 타겟 종횡비를 입력받는다(S110).Referring to FIG. 6, the image size adjuster 10 receives the original image and the target aspect ratio (S110).

영상 크기 조절 장치(10)는 복수의 컨볼루션 레이어들로 구성된 신경망을 통해 원본 영상을 인코딩하여 높은 단계 특징을 추출한다(S120). 인코딩에 사용되는 복수의 컨볼루션 레이어들은 미리 학습된 딥러닝 모델일 수 있다.The image size adjustment device 10 encodes the original image through a neural network composed of a plurality of convolution layers to extract high-level features (S120). The plurality of convolution layers used in encoding may be a pre-learned deep-run model.

영상 크기 조절 장치(10)는 복수의 컨볼루션 레이어들로 구성된 신경망을 통해 원본 영상의 특징을 디코딩하여 의미있는 영역에 대한 위치 정보를 포함하는 주의 지도(A_d)를 출력한다(S130). 디코딩에 사용되는 복수의 컨볼루션 레이어들은 인코딩에 사용되는 복수의 컨볼루션 레이어들과 역대칭 구조를 가질 수 있다. The image size adjustment device 10 decodes the feature of the original image through a neural network composed of a plurality of convolution layers and outputs a state map A _d including positional information on a meaningful region in operation S 130. The plurality of convolutional layers used for decoding may have an antisymmetric structure with a plurality of convolutional layers used for encoding.

영상 크기 조절 장치(10)는 타겟 종횡비를 기초로 주의 지도(A_d)를 리타겟팅 영상 크기로 변환한 주의 지도(A_r)을 생성한다(S140).The image size adjustment device 10 generates a state map A _r that converts the attention map A _d to the retargeting image size based on the target aspect ratio at step S 140.

영상 크기 조절 장치(10)는 크기 변환한 주의 지도(A_r)를 1차원 복제 컨볼루션하여 열(세로)을 따라 균일한 모양의 주의 지도(A_1D)를 생성한다(S150).The image size adjustment device 10 generates a vertically aligned uniform attention map A _1D along the column (S150) by performing one-dimensional replica convolution of the converted map A _r .

영상 크기 조절 장치(10)는 1차원 복제 컨볼루션에 의해 생성된 주의 지도(A_1D)와 리타겟팅 영상의 크기로 변환된 주의 지도(A_r)을 가중 합산하여 최종 주의 지도(A)를 생성한다(S160).The image size adjustment apparatus 10 generates a final state map A by weighting the state map A _1D generated by the one-dimensional replica convolution and the state map A _r converted into the size of the retargeting image (S160).

영상 크기 조절 장치(10)는 최종 주의 지도(A)를 이용하여 시프트 지도를 생성한다(S170). 시프트 지도는 각 픽셀이 입력 영상에서 크기 변환된 출력 영상으로 어느 정도 이동되어야 하는지를 정의하는 지도이다. 영상 크기 조절 장치(10)는 수학식 5와 같이, 최종 주의 지도(A)를 누적 정규화하여 시프트 지도를 생성할 수 있다.The image size adjustment device 10 generates a shift map using the final state map A (S170). The shift map is a map that defines how much each pixel should be shifted from the input image to the scaled output image. The image size adjustment device 10 can cumulatively normalize the final state map A to generate a shift map, as shown in Equation (5).

영상 크기 조절 장치(10)는 입력된 원본 영상을 시프트 지도를 이용하여 워핑하여 리타겟팅 영상을 출력한다(S180).The image size adjustment device 10 warps the input original image using a shift map to output a retargeting image (S180).

이와 같이, 영상 크기 조절 장치(10)는 입력 영상과 출력 영상 사이의 픽셀 단위 매핑인 시프트 지도를 통해 리타겟팅하는데, 높은 단계 특징을 통해 구해진 주의 지도를 내용 기반(content-aware)의 시프트 지도에 사용하므로, 중요도가 낮은 배경을 늘리거나 줄이면서 중요한 내용을 보존할 수 있다.In this way, the image size adjustment device 10 retargets through a shift map, which is a pixel unit mapping between the input image and the output image, and the state map obtained through the high-level feature is displayed on a content- As you use it, you can preserve important content while increasing or decreasing the low priority background.

도 7은 한 실시예에 따른 학습 방법의 흐름도이다.7 is a flowchart of a learning method according to an embodiment.

도 7을 참고하면, 학습 장치(300)는 영상 크기 조절 장치(10)에서 출력된 출력 영상의 손실을 계산하고, 출력 영상의 손실을 최소화하도록 영상 크기 조절 장치(10)를 학습시킨다. 학습 장치(300)에 의한 학습을 통해, 영상 크기 조절 장치(10)에서 출력 영상 생성에 관계된 컨볼루션 가중치들이 결정된다.Referring to FIG. 7, the learning apparatus 300 calculates the loss of the output image output from the image size adjustment apparatus 10, and learns the image size adjustment apparatus 10 so as to minimize the loss of the output image. Through learning by the learning apparatus 300, convolution weights related to output image generation in the image size adjustment device 10 are determined.

학습 장치(300)는 영상 크기 조절 장치(10)로 입력된 원본 영상, 그리고 영상 크기 조절 장치(10)에서 크기 변환되어 출력된 리타겟팅 영상을 입력받는다(S210).The learning apparatus 300 receives the original image input to the image size adjustment apparatus 10 and the retargeting image output from the image size adjustment apparatus 10 (S210).

학습 장치(300)는 리타겟팅 영상을 영상 분류기에 입력하여 획득한 분류 점수를 기초로 콘텐츠 손실을 계산한다(S220).The learning apparatus 300 calculates the content loss based on the classification score obtained by inputting the retargeting image into the image classifier (S220).

학습 장치(300)는 리타겟팅 영상과 원본 영상에서 서로 대응되는 픽셀 주변의 모양을 비교하여 구조 손실을 계산한다(S230). 학습 장치(300)는 수학식 7과 같이, 영상의 특징을 추출하는 복수의 컨볼루션 레이어들로 구성된 신경망에서, 하위 컨볼루션 레이어에서 출력된 특징(예를 들면, 경계 지도와 같은 낮은 단계의 특징)을 비교하여 구조 손실을 계산할 수 있다.The learning apparatus 300 compares shapes of pixels corresponding to each other in the retargeting image and the original image to calculate a structural loss (S230). The learning apparatus 300 can calculate the feature (for example, a low-level feature such as a boundary map) output from the lower convolution layer in a neural network composed of a plurality of convolutional layers for extracting a feature of an image, ) Can be compared to calculate the structural loss.

학습 장치(300)는 리타겟팅 영상의 콘텐츠 손실과 구조 손실이 최소화되도록 영상 크기 조절 장치(10) 내 컨볼루션 가중치들을 학습시킨다(S240).The learning apparatus 300 learns the convolution weights in the image size adjusting apparatus 10 so that the content loss and the structural loss of the retargeting image are minimized (S240).

도 8부터 도 11은 본 발명의 성능을 설명하는 비교 도면이다.8 to 11 are comparative diagrams illustrating the performance of the present invention.

도 8을 참고하면, (a)는 입력 영상이고, (b)는 영상의 가로 크기를 리타겟팅 영상의 가로 크기로 선형 보간(linear scaling)한 영상이다. (c)는 본 발명에 따른 리타겟팅 영상으로서, 영상의 가로 크기를 리타겟팅 영상의 가로 크기로 줄이더라도, 영상의 주요 콘텐츠(자동차)가 거의 그대로 유지됨을 확인할 수 있다. Referring to FIG. 8, (a) is an input image, and (b) is an image obtained by linear scaling a horizontal size of an image to a horizontal size of a retargeting image. (c) is a retargeting image according to the present invention. Even if the horizontal size of the image is reduced to the horizontal size of the retargeting image, it can be confirmed that the main content (automobile) of the image is almost maintained.

도 9를 참고하면, 입력 영상(맨 왼쪽)의 세로 길이는 고정하고, 종횡비를 0.9에서 0.3까지 연속적으로 축소한 리타겟팅 영상이다. Referring to FIG. 9, the retargeting image is obtained by fixing the vertical length of the input image (leftmost) and continuously reducing the aspect ratio from 0.9 to 0.3.

(a)는 입력 영상이다. (b)는 선형 보간(linear scaling)에 의한 리타겟팅 영상들로서, 콘텐츠에 대한 고려 없이 선형적으로 가로 크기를 0.9에서 0.3까지 축소하므로, 영상의 중요 콘텐츠(사람)가 보존되지 않음을 확인할 수 있다. (c)는 본 발명에 의한 리타겟팅 영상들로서, 영상의 가로 크기가 줄어들더라도, 영상의 주요 콘텐츠가 거의 그대로 유지됨을 확인할 수 있다. (a) is an input image. (b) is a retargeting image by linear scaling. Since the horizontal size is linearly reduced from 0.9 to 0.3 without consideration of the content, it can be confirmed that the important content (person) of the image is not preserved . (c) are retargeting images according to the present invention. Even if the horizontal size of the image is reduced, it can be seen that the main content of the image is maintained almost intact.

도 10을 참고하면, 영상의 세로 길이는 고정하고, 확대한 리타겟팅 영상이다.Referring to Fig. 10, the vertical length of the image is fixed and the retargeting image is enlarged.

(a)는 입력 영상이다. (b)는 선형 보간으로 1.5배 확대한 영상이고, (c)는 본 발명으로 1.5배 확대한 영상이다. (d)는 선형 보간으로 1.7배 확대한 영상이고, (e)는 본 발명으로 1.7배 확대한 영상이다. (a) is an input image. (b) is an image enlarged 1.5 times by linear interpolation, and (c) is an image 1.5 times enlarged by the present invention. (d) is an image enlarged 1.7 times by linear interpolation, and (e) is an image enlarged 1.7 times by the present invention.

영상 축소와 같이, 본 발명은 영상을 확대하더라도, 영상의 주요 콘텐츠는 거의 그대로 유지된 채, 배경이 늘어남을 확인할 수 있다.Like the video reduction, the present invention can confirm that even if the video is enlarged, the main content of the video remains almost intact and the background increases.

도 11을 참고하면, (a) 는 입력 영상들이다. (b)는 선형 보간에 의한 리타겟팅 영상들이다. (c)는 수동 크롭(manual cropping)에 의한 리타겟팅 영상들이다. (d)는 종래의 리타겟팅 방법인 이음새 조각(seam carving) 에 의한 리타겟팅 영상들이다. (e)는 본 발명에 의한 리타겟팅 영상들이다. Referring to FIG. 11, (a) is input images. (b) are retargeting images by linear interpolation. (c) are retargeting images by manual cropping. (d) are retargeting images by seam carving, which is a conventional retargeting method. (e) are retargeting images according to the present invention.

이처럼, 영상의 세로 크기를 줄인 경우에도, 본 발명에 의한 리타겟팅 영상들은 주요 콘텐츠가 거의 그대로 보존됨을 확인할 수 있다.As described above, even when the vertical size of the image is reduced, it can be confirmed that the main content of the retargeting images according to the present invention is almost preserved.

도 12는 영상 크기 변화에 따른 영상 분류 정확도를 평가한 결과이다.12 shows the result of evaluating the image classification accuracy according to the image size change.

도 12를 참고하면, 가로축은 입력 영상의 크기를 축소한 비율이고, 세로축은 원본 영상의 분류 결과에 대한 리타겟팅 영상의 분류 결과 비율이다. 12, the horizontal axis represents the ratio of the size of the input image reduced, and the vertical axis represents the classification result ratio of the retargeting image to the classification result of the original image.

본 발명(our)은 크기가 축소되더라도, 다른 방법들에 비해서 리타겟팅 영상의 분류 정확도 감소량이 현저히 적음을 보여준다. 즉, 본 발명은 리타겟팅 이후에도, 영상의 주요 부분이 잘 보존됨을 알 수 있다.The present invention (our) shows that even if the size is reduced, the amount of reduction in the classification accuracy of the retargeting image is remarkably small as compared with other methods. That is, it can be seen that the main part of the image is well preserved even after retargeting.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not implemented only by the apparatus and method, but may be implemented through a program for realizing the function corresponding to the configuration of the embodiment of the present invention or a recording medium on which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

An image size adjustment device operated by at least one processor,
Extracts a high-level feature of the original image through a plurality of convolutional layers for encoding, decodes the high-level feature into a plurality of convolutional layers for decoding, An encoder-decoder for outputting a first attention map including
Generates a shift map in which pixels of the original image are shifted to an output image on the basis of the map of the second state, A shifter for warping an original image with the shift map to output a retargeting image
And an image size adjusting device.

The method of claim 1,
The shifter
1D Duplicate Convolution (1D Duplicate Convolution) of the map of the second week to generate a uniform map of the third state along the heat, weighting the map of the third state and the map of the second state, And generates the shift map by cumulative normalization of the final state map.

3. The method of claim 2,
The shifter
And generates a one-dimensional row vector by convoluting the map of the second note into a column vector, and replicates the row vector repeatedly to generate the map of the third note.

4. The method of claim 3,
Wherein the column vectors have weights learned to minimize the loss of the retargeting image.

The method of claim 1,
Wherein the plurality of convolutional layers for decoding are symmetric with respect to the plurality of convolutional layers for encoding.

The method of claim 1,
Wherein the plurality of convolutional layers for decoding have learned weights to minimize loss of the retargeting image.

The method of claim 1,
A learning unit for calculating a loss of the retargeting image and learning convolution weights related to the output of the retargeting video in the encoder-decoder and the shifter such that the loss is minimized;
Further comprising an image size adjusting device.

8. The method of claim 7,
The loss includes content loss,
The learning device
And the content loss is calculated on the basis of a classification score obtained by inputting the retargeting image into the image classifier.

9. The method of claim 8,
Wherein the loss further comprises a structural loss,
The learning device
And compares the retargeting image and shapes of neighboring pixels in the original image to calculate the structural loss.

An image size adjustment device operated by at least one processor,
Extracts a high-level feature of the original image through a plurality of convolutional layers for encoding, decodes the high-level feature into a plurality of convolutional layers for decoding, An encoder-decoder for outputting a first attention map,
A second aspect of the map in which the first aspect of the map is scaled to the target aspect ratio is generated and a retargeting image in which the original image is scaled to the target aspect ratio based on the second aspect of the map, Output shifter, and
A learning apparatus for calculating a loss of the retargeting image and learning weight values of a plurality of convolutional layers for decoding so as to minimize the loss,
And an image size adjusting device.

11. The method of claim 10,
The loss includes content loss and structural loss,
The learning device
Calculating the content loss based on a classification score obtained by inputting the retargeting image into the image classifier, and calculating the structural loss by comparing the retargeting image and the shapes around the pixels corresponding to each other in the original image, Image size adjustment device.

12. The method of claim 11,
Wherein the image classifier comprises a plurality of convolutional layers for encoding.

The method of claim 12,
The learning device
The method comprising the steps of: inputting the original image and the retargeting image into a plurality of convolutional layers for encoding and outputting a feature of the original image output from a lower convolution layer of a plurality of convolutional layers for encoding, To calculate the structural loss.

A method of adjusting an image size of an image size adjusting apparatus operated by at least one processor,
Receiving a first attention map and a target aspect ratio including positional information on a meaningful region of an original image,
Generating a second state map in which the first state map is scaled to the target aspect ratio,
1D Duplicate Convolution (1D Duplicate Convolution) of the map of the second state to generate a map of a third state of uniform shape along the line,
Generating a shift map using a final state map weighted by adding the map of the third state and the map of the second state, and
And outputting a retargeting image in which the size of the original image is adjusted using the shift map
And adjusting the size of the image.

The method of claim 14,
The step of generating the shift map
Wherein the shift map is generated by cumulative normalization of the final state map.

The method of claim 14,
The step of generating the map of the third state
Dimensional row vector by convoluting the map of the second note into a column vector and replicating the row vector repeatedly to generate the map of the third note.

The method of claim 14,
Wherein the map of the first note is a result of decoding a high-level feature of the original image with a plurality of convolutional layers for decoding.

The method of claim 17,
Calculating a content loss of the retargeting video based on a classification score obtained by inputting the retargeting video into a video classifier,
Calculating a structural loss of the retargeting image by comparing shapes of the retargeting image and surrounding pixels corresponding to each other in the original image, and
Learning convolution weights related to the retargeting video output such that the content loss and the structural loss are minimized
And adjusting the image size.