KR102567128B1

KR102567128B1 - Enhanced adversarial attention networks system and image generation method using the same

Info

Publication number: KR102567128B1
Application number: KR1020230040630A
Authority: KR
Inventors: 이도훈; 벨트란 호세 크루즈 카스텔로; 김민지; 제강 등; 노요환
Original assignee: 인빅 주식회사
Priority date: 2023-01-30
Filing date: 2023-03-28
Publication date: 2023-08-17

Abstract

본 발명은 이미지 변환 프로세스에서 도메인 간 어텐션 공유 메커니즘을 활용하여 변환된 이미지를 재구성하고 보다 일관되고 사실적인 출력을 생성하며 이미지의 변환 능력을 개선할 수 있는 적대적 어텐션 네트워크 시스템 및 이를 이용한 이미지 생성 방법에 관한 것이다.The present invention relates to an adversarial attention network system capable of reconstructing a converted image by utilizing an attention sharing mechanism between domains in an image conversion process, generating a more consistent and realistic output, and improving image conversion capability, and an image generation method using the same. it's about

Description

Improved adversarial attention network system and image generation method using the same

본 발명은 개선된 적대적 어텐션 네트워크를 이용한 이미지 생성 기술에 관한 것으로, 더욱 상세하게는 이미지 변환 프로세스에서 도메인 간 어텐션 공유 메커니즘을 활용하여 변환된 이미지를 재구성하고 보다 일관되고 사실적인 출력을 생성하며 이미지의 변환 능력을 개선할 수 있는 적대적 어텐션 네트워크 시스템 및 이를 이용한 이미지 생성 방법에 관한 것이다.The present invention relates to an image generation technology using an improved adversarial attention network, and more particularly, in an image conversion process, by utilizing an attention sharing mechanism between domains to reconstruct a converted image, create a more consistent and realistic output, It relates to an adversarial attention network system capable of improving conversion capability and an image generation method using the same.

일반적으로 인공 신경망 기반의 영상 인식 모델을 학습하기 위하여 정상 이미지와 불량 이미지를 포함한 학습데이터를 이용하여 모델을 학습함으로써 인식의 정확성을 향상할 수 있다.In general, in order to learn an artificial neural network-based image recognition model, the accuracy of recognition can be improved by learning the model using learning data including normal images and bad images.

그러나 현실적으로 정상 이미지에 대한 다양한 학습 데이터는 구하기 용이한 편이나, 불량 이미지 혹은 잘 발생하지 않는 희귀 이미지(완벽하게 다른 이미지가 아닌 일부에 정상과는 다른 오류를 포함하고 있는 이미지)에 대해서는 다양한 학습 데이터를 구하기 상당히 어려운 단점이 존재한다.However, in reality, it is easy to obtain various training data for normal images, but various training data for defective images or rare images (images that do not occur completely but contain errors different from normal) There are downsides that are very difficult to find.

이를 해결하기 위하여 사람이 수작업을 통해 정상 이미지를 기반으로 불량 이미지를 생성하는 방법이 있으나 이는 시간과 비용 적인 측면에 있어 그 효율성이 현저히 떨어지는 단점이 존재하였다.In order to solve this problem, there is a method in which a person manually creates a bad image based on a normal image, but this method has a disadvantage in that the efficiency is significantly lowered in terms of time and cost.

이러한 문제점을 해결하기 위하여 GAN 모델을 통해 불량 이미지를 생성하는 문제 해결 방식이 대두되고 있다.In order to solve this problem, a problem-solving method of generating a bad image through a GAN model is emerging.

GAN(Generative Adversarial Networks)은 학습 네트워크로서, 이미지 생성을 담당하는 모델인 생성기(generator) 및 분류를 담당하는 모델인 판별기(discriminator)로 구성될 수 있다. GAN은 생성기와 판별기가 서로의 성능을 개선해 적대적으로 경쟁해 나가는 모델이다.A Generative Adversarial Networks (GAN) is a learning network and may be composed of a generator, which is a model in charge of image generation, and a discriminator, which is a model in charge of classification. GAN is a model in which generators and discriminators compete against each other by improving each other's performance.

GAN에서는 판별기를 먼저 학습시킨 후, 생성기를 학습시키는 과정을 서로 반복한다. GAN의 학습은 두 가지 단계로 이루어진다. 첫 번째로 진짜 이미지와 생성기에서 랜덤 벡터(random vector)를 통해 생성된 가짜 이미지를 분류하도록 학습하고, 두 번째로, 생성기는 새로운 랜덤 노이즈 벡터를 얻어 가짜 이미지를 생성한다.In GAN, the process of learning the discriminator first and then the generator is repeated. Training of GANs is done in two stages. First, it learns to classify the real image and the fake image generated by the random vector in the generator, and second, the generator obtains a new random noise vector and generates a fake image.

즉, 랜덤 벡터(random vector)로부터 생성기에서 만들어낸 가짜 이미지를 판별기가 진짜라고 분류할 만큼 진짜 데이터와 유사한 이미지를 만들어 내도록 생성기를 학습시킬 수 있다.　That is, the generator can be trained to produce an image similar to real data to the extent that the discriminator classifies a fake image created by the generator from a random vector as real.

이와 같은 학습 과정을 반복함에 따라 판별기와 생성기는 서로를 적대적인 경쟁자로 인식하여 모두 발전하게 되고, 결과적으로 생성기는 진짜 데이터와 유사한 가짜 데이터를 만들 수 있게 되고 이에 따라 판별기는 진짜 데이터와 가짜 데이터를 구분할 수 없게 된다. 즉, GAN에서 생성기는 분류에 성공할 확률을 낮추려 하고, 판별기는 분류에 성공할 확률을 높이려 하면서 서로가 서로를 경쟁적으로 발전시키는 구조를 이룬다.As this learning process is repeated, the discriminator and generator recognize each other as hostile competitors and both develop. will not be able to That is, in GAN, the generator tries to lower the probability of success in classification, and the discriminator tries to increase the probability of success in classification, forming a structure in which each develops in a competitive way.

GAN을 이용한 이미지 변환 분야에 셀프 어텐션 매커니즘을 도입하는 것은 어려움이 존재했다. 셀프 어텐션은 NLP 및 컴퓨터 비전 작업과 같은 다양한 분야에서 뛰어난 결과를 보여주었지만, 각 토큰 쌍 간의 내적을 통해 유사성을 계산하면 메모리 복잡성이 발생하는 한계가 있다. 이러한 메모리의 복잡성은 특정 응용 프로그램에 대해 엄청난 비용이 발생할 수 있는 문제가 있다.There were difficulties in introducing the self-attention mechanism to the field of image conversion using GAN. Although self-attention has shown excellent results in various fields such as NLP and computer vision tasks, there is a limitation in that memory complexity occurs when similarity is calculated through the dot product between each pair of tokens. The complexity of these memories presents a problem that can be prohibitively expensive for certain applications.

또한, CNN은 로컬 수용 영역을 사용하여 입력 데이터를 처리하기 때문에 현재 처리 중인 영역과 공간적으로 가까운 정보만 처리할 수 있다. 따라서 입력 이미지와 정보의 거리가 멀어질수록 CNN이 해당 정보를 학습하기 어렵다. 그러나 어텐션 메커니즘은 모델이 가장 관련성이 높은 feature에 선택적으로 초점을 맞추고 입력의 다른 부분 간의 장기 의존성(long-range dependencies)을 포착할 수 있기 때문에 이미지에서 멀리 떨어진 정보를 학습하는 데 효과적이다. 따라서 셀프 어텐션의 메모리 복잡성이 해결된다면 이를 GAN에 도입했을 때 이미지와 먼 정보에 대해서도 학습이 가능하여 성능 개선을 이룰 수 있다.In addition, since CNN processes input data using a local receptive field, it can process only information that is spatially close to the current processing area. Therefore, as the distance between the input image and the information increases, it is difficult for the CNN to learn the information. However, attention mechanisms are effective at learning information far from images because they allow the model to selectively focus on the most relevant features and capture long-range dependencies between different parts of the input. Therefore, if the memory complexity of self-attention is solved, when it is introduced into GAN, it is possible to learn about images and distant information, thereby improving performance.

본 발명의 실시예에 의하면 수학적으로 셀프-어텐션과 동등한 수준의 다중 셀프-어텐션 블록을 활용하는 적대적 어텐션 네트워크 시스템 및 방법을 제공하는 것이다.According to an embodiment of the present invention, an adversarial attention network system and method utilizing multiple self-attention blocks mathematically equivalent to self-attention are provided.

본 발명이 해결하고자 하는 과제는 이미지 도메인간 어텐션 공유 매커니즘(cross-domain attention-sharing mechanisms)을 이용하여 소스 이미지와 타겟 이미지 간에 의존성을 공유하는 적대적 어텐션 네트워크 시스템 및 방법을 제공하는 것이다.An object to be solved by the present invention is to provide an adversarial attention network system and method for sharing dependencies between a source image and a target image using cross-domain attention-sharing mechanisms.

본 발명의 일 실시예에 따른 적대적 어텐션 네트워크 시스템은 일련의 콘볼루션을 통해 입력 이미지로부터 유의미한 피쳐들을 추출하는 인코더(encoder)와, 학습된 일련의 잔여 블록을 이용하여 상기 추출된 피쳐들을 소스 도메인에서 타겟 도메인으로 변경하는 번역기(translator)와, 상기 번역된 피쳐들을 활용하여 상기 인코더에 입력된 이미지와 동일한 해상도의 번역 이미지를 구성하고, 일련의 전치된 콘볼루션들을 활용하여 상기 피쳐들을 원하는 차원으로 업스케일하고, 마지막 콘볼루션 레이어는 출력물을 복수 채널의 RGB 이미지에 매핑하는 디코더(decoder)를 포함하는 생성기(Generator) 및 상기 생성기의 결과물을 판별하는 판별기(Discriminator)를 포함하며, 상기 인코더는 장기 의존성(long-range dependancy)을 해결하기 위한 어텐션 맵들을 산출하는(compute) 어텐션 모듈을 포함할 수 있다.An adversarial attention network system according to an embodiment of the present invention uses an encoder that extracts meaningful features from an input image through a series of convolutions, and a series of learned residual blocks to extract the extracted features from a source domain. A translator that changes to the target domain, uses the translated features to construct a translated image with the same resolution as the image input to the encoder, and uses a series of transposed convolutions to raise the features to a desired dimension. scale, and the last convolution layer includes a generator including a decoder that maps outputs to RGB images of multiple channels and a discriminator that discriminates the result of the generator, and the encoder is a long-term It may include an attention module that computes attention maps for resolving long-range dependancy.

상기 인코더와 디코더는, 양자 간에 상기 산출된 어텐션 맵을 공유(share)하기 위해 인코더의 i차 레이어 상의 어텐션 모듈과 디코더의 i-L차 레이어 상의 어텐션 모듈 간에 생략 연결(skip connections)을 이용할 수 있다. The encoder and decoder may use skip connections between an attention module on the i-th layer of the encoder and an attention module on the i-L-th layer of the decoder in order to share the calculated attention map between them.

상기 어텐션 모듈은, 상기 인코더의 다운-샘플 콘볼루션 이후(after) 및 상기 디코더의 전치 콘볼루션(transposed-convolution) 이전(before)에 각각 배치될 수 있다.The attention module may be disposed after the down-sample convolution of the encoder and before the transposed-convolution of the decoder, respectively.

상기 어텐션 모듈은, 상기 디코더의 콘볼루션 이전(prior) 및 상기 디코더의 전치 콘볼루션(transposed-convolution) 이후(after)에 각각 배치될 수 있다.The attention module may be disposed before convolution of the decoder and after transposed-convolution of the decoder, respectively.

상기 네트워크 시스템은, 의존성 학습을 위한 초기 트레이닝 단계에서, 로컬 의존성에 더 의존하고 장기 의존성이 더 의미 있을 때까지 기다리기 위해 아래 수식의 학습 가능 스케일 파라메터(yi)를 도입할 수 있다.In the initial training stage for dependency learning, the network system may introduce a learnable scale parameter (yi) of the formula below to wait until local dependencies are more dependent and long-term dependencies are more meaningful.

상기 어텐션 모듈은, 입력 이미지를 의미적인(semantic) 관점으로 학습하기 위해 각 채널의 키(K)를 단일 어텐션 맵으로 해석(interpret)하고, 상기 어텐션 모듈들 각각의 키 벡터 사이에 일련의 상기 생략 연결들이 위치할 수 있다.The attention module interprets the key (K) of each channel as a single attention map in order to learn the input image from a semantic point of view, and a series of the above omitting between the key vectors of each of the attention modules. Connections may be located.

상기 일련의 생략 연결들 사이에는, 원본 도메인에서 나온 상기 어텐션 맵을 정제하기 위한 정제 네트워크(refinement network)가 더 위치할 수 있다.Between the series of omitted connections, a refinement network for refining the attention map from the original domain may be further located.

상기 인코더 측의 어텐션 모듈의 출력은 디코더 측의 어텐션 모듈의 입력과 연결(concatenate)되고, 상기 연결된 디코더의 어텐션 모듈은, 이전 레이어의 연결된 피쳐 맵과 인코더의 상응하는 어텐션 모듈의 출력을 기초로 동작할 수 있다.The output of the attention module on the encoder side is concatenated with the input of the attention module on the decoder side, and the connected attention module of the decoder operates based on the concatenated feature map of the previous layer and the output of the corresponding attention module of the encoder. can do.

상기 네트워크 시스템은, 도메인 A와 도메인 B 사이에 이미지 번역을 수행하기 위해 2개의 맵핑 함수를 학습하되, 두 개의 생성기와 두 개의 판별기는 도메인 간 순환 일관성(cycle consistency)을 강화하기 위해 동시에 트레이닝될 수 있다.The network system learns two mapping functions to perform image translation between domain A and domain B, but two generators and two discriminators can be trained simultaneously to enhance cycle consistency between domains. there is.

본 발명이 다른 실시예에 따른 인코더, 번역기 및 디코더를 포함하는 적대적 어텐션 네트워크 시스템의 이미지 생성 방법에 있어서, 상기 인코더가 일련의 콘볼루션을 통해 입력 이미지로부터 유의미한 피쳐들을 추출하는 단계와, 상기 번역기가 학습된 일련의 잔여 블록을 이용하여 상기 추출된 피쳐들을 소스 도메인에서 타겟 도메인으로 변경하는 단계와, 상기 디코더가 상기 번역된 피쳐들을 활용하여 상기 인코더에 입력된 이미지와 동일한 해상도의 번역 이미지를 구성하는 단계와, 상기 디코더가 일련의 전치된 콘볼루션들을 활용하여 상기 피쳐들을 원하는 차원으로 업스케일하는 단계 및 상기 디코더가 마지막 콘볼루션 레이어에서 출력물을 복수 채널의 RGB 이미지에 매핑하는 단계를 포함하며, 상기 피쳐들을 추출하는 단계는, 상기 인코더의 어텐션 모듈이 장기 의존성(long-range dependancy)을 해결하기 위한 어텐션 맵들을 산출하는 단계를 포함할 수 있다.In an image generation method of an adversarial attention network system including an encoder, a translator, and a decoder according to another embodiment of the present invention, the encoder extracts meaningful features from an input image through a series of convolutions; Changing the extracted features from a source domain to a target domain using a series of learned residual blocks, and constructing a translated image having the same resolution as an image input to the encoder by the decoder using the translated features. wherein the decoder utilizes a series of transposed convolutions to upscale the features to a desired dimension and the decoder maps the output to a multi-channel RGB image in a final convolutional layer; Extracting the features may include calculating, by an attention module of the encoder, attention maps for resolving long-range dependancy.

상기 인코더와 디코더는, 양자 간에 상기 산출된 어텐션 맵을 공유(share)하기 위해 인코더의 i차 레이어 상의 어텐션 모듈과 디코더의 i-L차 레이어 상의 어텐션 모듈 간에 생략 연결(skip connections)을 이용할 수 있다.The encoder and decoder may use skip connections between an attention module on the i-th layer of the encoder and an attention module on the i-L-th layer of the decoder in order to share the calculated attention map between them.

상기 어텐션 모듈은, 입력 이미지를 의미적인(semantic) 관점으로 학습하기 위해 각 채널의 키(K)를 단일 어텐션 맵으로 해석(interpret)하고, 상기 어텐션 모듈들 각각의 키 벡터 사이에 일련의 상기 생략 연결들이 위치될 수 있다.The attention module interprets the key (K) of each channel as a single attention map in order to learn the input image from a semantic point of view, and a series of the above omitting between the key vectors of each of the attention modules. Connections can be located.

본 발명의 실시예에 의하면 수학적으로 셀프-어텐션과 동등한 수준의 다중 셀프-어텐션 블록을 활용함으로써 메모리 과다 사용의 한계를 극복할 수 있다.According to an embodiment of the present invention, it is possible to overcome the limitation of excessive memory use by utilizing multiple self-attention blocks mathematically equivalent to self-attention.

본 발명의 실시예에 의하면 이미지 도메인간 어텐션 공유 매커니즘(cross-domain attention-sharing mechanisms)을 이용하여 소스 이미지와 타겟 이미지 간에 의존성을 공유함으로써 장기 의존성 문제(long-range dependancy)를 해결할 수 있다.According to an embodiment of the present invention, a long-range dependency can be solved by sharing dependencies between a source image and a target image using cross-domain attention-sharing mechanisms.

본 발명의 실시예에 의하면 이미지에서 이미지 변환 프로세스에서 어텐션 공유 메커니즘을 활용하여 변환된 이미지를 재구성하고 보다 일관되고 사실적인 출력을 생성할 때 사용함으로써 이미지의 변환 능력을 개선할 수 있다.According to an embodiment of the present invention, an attention sharing mechanism is utilized in an image-to-image conversion process to reconstruct the converted image and to generate a more consistent and realistic output, thereby improving image conversion capability.

도 1은 실시예1의 EAGAN 아키텍처를 나타낸 도면이다.
도 2의 (A), (B)는 어텐션 모듈의 배치를 나타낸 도면이다.
도 3은 도 2의 어텐션 모듈의 배치 옵션 간의 성능 비교를 나타낸 표이다.
도 4는 실시예1의 EAGAN의 어텐션 공유 메커니즘을 나타낸 도면이다.
도 5는 실시예2의 UEAGAN 아키텍처를 나타낸 도면이다.
도 6은 EAGAN과 UEAGAN가 이미지 간 개체 변환을 위한 샘플 이미지를 나타낸 도면이다.
도 7은 EAGAN과 UEAGAN가 이미지간 장면 변환을 위한 샘플 이미지를 나타낸 도면이다.
도 8의 (A), (B)는 작업별 아키텍처간의 장면 변환 작업에 대한 성능을 비교한 표이다.
도 9는 아키텍처 별 장면 변환을 시각적으로 비교하기 위한 도면이다
도 10은 아키텍처 별 개체 변환을 시각적으로 비교하기 위한 나타낸 도면이다
도 11은 아키텍처에 의해 생성된 개체 변환 모델의 정밀 육안 검사를 위한 도면이다.
도 12는 아키텍처에 의해 생성된 장면 변환 모델의 정밀 육안 검사를 위한 도면이다.
도 13은 EGAN 아키텍처의 어텐션 맵을 나타낸 도면이다.
도 4는 실시예3에 따른 적대적 어텐션 네트워크 시스템의 이미지 생성 방법의 흐름도이다.1 is a diagram showing the EAGAN architecture of Embodiment 1;
2 (A) and (B) are views showing the arrangement of the attention module.
FIG. 3 is a table showing performance comparison between deployment options of the attention module of FIG. 2 .
4 is a diagram showing the attention sharing mechanism of EAGAN according to the first embodiment.
5 is a diagram showing the UEAGAN architecture of Embodiment 2;
6 is a diagram showing sample images for entity conversion between images by EAGAN and UEAGAN.
7 is a diagram showing sample images for scene conversion between EAGAN and UEAGAN images.
8 (A) and (B) are tables comparing performance of scene conversion tasks between architectures for each task.
9 is a diagram for visually comparing scene transformations for each architecture.
10 is a diagram for visually comparing object transformations for each architecture.
11 is a diagram for detailed visual inspection of an object conversion model created by the architecture.
12 is a diagram for precise visual inspection of a scene transformation model generated by the architecture.
13 is a diagram illustrating an attention map of the EGAN architecture.
4 is a flowchart of a method for generating an image of an adversarial attention network system according to a third embodiment.

이하 본 발명의 몇 가지 실시예들을 도면을 이용하여 상세히 설명한다. 다만 이것은 본 발명을 어느 특정한 실시예에 대해 한정하려는 것이 아니며 본 발명의 기술적 사상을 포함하는 모든 변형(transformations), 균등물(equivalents) 및 대체물(substitutions)은 본 발명의 범위에 포함되는 것으로 이해되어야 한다. Hereinafter, several embodiments of the present invention will be described in detail using drawings. However, this is not intended to limit the present invention to any specific embodiment, and it should be understood that all transformations, equivalents and substitutions including the technical idea of the present invention are included in the scope of the present invention. do.

본 명세서에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. In this specification, singular expressions include plural expressions unless the context clearly dictates otherwise.

본 명세서에서 어느 한 구성이 어떤 서브 구성을 "구비(have)" 또는 "포함(comprise)" 한다고 기재한 경우, 특별히 반대되는 기재가 없는 한 다른(other) 구성을 제외하는 것이 아니라 다른 구성을 더 포함할 수도 있음을 의미한다. In this specification, when a component is described as “having” or “comprises” a certain sub-configuration, it does not exclude other configurations unless otherwise stated, but further includes other configurations. This means that it may contain

본 명세서에서 "...유닛(Unit)", "...모듈(Module)" 및 "컴포넌트(Component)"의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 하드웨어, 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수도 있다.In this specification, the terms "...unit", "...module" and "component" mean a unit that processes at least one function or operation, and includes hardware, software or It may be implemented as a combination of hardware and software.

<실시예1><Example 1>

실시예1은 EAGAN(Eficient Attention Generative Adversarial Network) 아키텍처를 이용하는 적대적 어텐션 네트워크 시스템 기술에 관한 것이다.Embodiment 1 relates to an adversarial attention network system technology using EAGAN (Eficient Attention Generative Adversarial Network) architecture.

도 1은 실시예1의 EAGAN 아키텍처를 나타낸 도면이다.1 is a diagram showing the EAGAN architecture of Embodiment 1;

실시예1의 적대적 어텐션 네트워크 시스템은 쌍을 이루지 않은 도메인 A와 도메인 B 사이에 이미지 번역(IMAGE TO IMAGE)을 수행하기 위해 EAGAN(Eficient Attention Generative Adversarial Network) 아키텍처를 이용한다.The adversarial attention network system of Example 1 uses EAGAN (Eficient Attention Generative Adversarial Network) architecture to perform image translation (IMAGE TO IMAGE) between unpaired domains A and B.

실시예1의 적대적 어텐션 네트워크 시스템의 EAGAN 아키텍쳐는 GAN을 기반으로 한다. The EAGAN architecture of the adversarial attention network system of Embodiment 1 is based on GAN.

상기 EAGAN 아키텍쳐는 장거리 종속성(long-range dependency)을 사용할 수 있는 방식으로 어텐션 메커니즘(Attention Mechanisms)을 변환 프로세스에 통합한다.The EAGAN architecture incorporates Attention Mechanisms into the transformation process in a way that can use long-range dependencies.

장거리 종속성이란 공간 또는 시계열 데이터 분석에서 발생할 수 있는 현상이다. 점 사이의 시간 간격 또는 공간적 거리가 증가함에 따라 두 점의 통계적 종속성이 감소하는 비율과 관련된다. 종속성이 지수적 감소(일반적으로 전력과 같은 감소)보다 더 느리게 감소하는 경우 일반적으로 장기 종속성을 갖는 것으로 간주된다. 장거리 종속성은 셀프 유사 프로세스(self-similar processes) 또는 필드와 관련이 있다.Long-range dependencies are a phenomenon that can occur in the analysis of spatial or time series data. It is related to the rate at which the statistical dependence of two points decreases as the time interval or spatial distance between them increases. A dependency is generally considered to have a long-term dependency if it decreases more slowly than an exponential decay (usually a decrease such as power). Long-range dependencies are related to self-similar processes or fields.

EAGAN은 수학적으로 셀프-어텐션과 동등한 수준의 다중 셀프-어텐션 블록을 활용함으로써 메모리 과다 사용의 한계를 극복한다. 또한 입력 이미지와 정보의 거리가 멀어질수록 해당 정보를 학습하기 어려운 CNN과 달리, 다중 셀프-어텐션을 사용하여 이미지에서 멀리 떨어진 정보를 효과적으로 학습한다. 모델이 가장 관련성이 높은 feature에 선택적으로 초점을 맞추고 입력의 다른 부분 간의 장기 의존성(long-range dependencies)을 포착할 수 있기 때문이다. 참고로, 셀프 어텐션은 시퀀스의 표현을 계산하기 위해 단일 시퀀스의 다른 위치와 관련된 어텐션 메커니즘이다EAGAN overcomes the limitation of memory overuse by utilizing multiple self-attention blocks that are mathematically equivalent to self-attention. In addition, unlike CNN, where it is difficult to learn the information as the distance between the input image and the information increases, it effectively learns information far from the image using multiple self-attention. This is because the model can selectively focus on the most relevant features and capture long-range dependencies between different parts of the input. FYI, self-attention is an attention mechanism involving different positions in a single sequence to compute a representation of the sequence.

EAGAN은 이미지 도메인간 어텐션 공유 매커니즘(cross-domain attention-sharing mechanisms)을 이용하여 소스 이미지와 타겟 이미지 간에 의존성을 공유함으로써 장기 의존성 문제(long-range dependancy)를 해결한다.EAGAN solves the long-range dependence problem by sharing dependencies between a source image and a target image using cross-domain attention-sharing mechanisms.

적대적 어텐션 네트워크 시스템의 EAGAN는 생성기(Generator) 및 판별기(Discriminator)를 포함한다.EAGAN of the adversarial attention network system includes a generator and a discriminator.

생성기는 이미지 생성을 담당하고, 판별기는 생성기의 결과물을 판별한다.The generator is responsible for generating the image, and the discriminator determines the output of the generator.

적대적 어텐션 네트워크 시스템은 GAN을 기반으로 하며, 판별기를 먼저 학습시킨 후 생성기를 학습시키는 과정을 서로 반복할 수 있다. The adversarial attention network system is based on GAN and can repeat the process of first learning the discriminator and then learning the generator.

생성기는 인코더(encoder)와, 번역기(translator) 및 디코더(decoder)를 포함한다.The generator includes an encoder, a translator and a decoder.

인코더는 일련의 콘볼루션을 통해 입력 이미지로부터 유의미한 피쳐들을 추출한다. 예컨대, 이미지는 입력의 차원을 다운 샘플링하는 일련의 콘볼루션을 통해 잠재 기능 공간으로 변환될 수 있다The encoder extracts meaningful features from the input image through a series of convolutions. For example, an image can be transformed into a latent feature space through a series of convolutions that down-sample the dimensions of the input.

인코더는 장기 의존성(long-range dependancy)을 해결하기 위한 어텐션 맵들을 산출하는(compute) 어텐션 모듈을 포함한다. The encoder includes an attention module that computes attention maps for resolving long-range dependancy.

어텐션 모듈은 싸이클GAN(CycleGan) 기반의 생성기를 이용하여 개선된 어텐션 모듈(eicient-attention modules)로 구현된다. 참고로, 싸이클GAN은 도메인 사이의 이미지 스타일 전이(image style transfer)에 사용되는 GAN(Generative Adversarial Network)이다. 싸이클GAN을 사용하면 한 영역의 이미지를 다른 영역으로 변환하도록 훈련할 수 있으며 싸이클GAN은 비지도학습(unsupervised learning)으로 수행된다. 이때, 싸이클GAN은 두 도메인 모두에서 영상을 일대일 매핑을 하지 않는다.The attention module is implemented as an improved attention module (eicient-attention modules) using a generator based on CycleGAN (CycleGan). For reference, CycleGAN is a Generative Adversarial Network (GAN) used for image style transfer between domains. Using CycleGAN, you can train an image of one area to transform into another area, and CycleGAN is performed by unsupervised learning. At this time, CycleGAN does not perform one-to-one mapping of images in both domains.

어텐션 모듈은 이미지 도메인간 어텐션 공유 매커니즘(cross-domain attention-sharing mechanisms)을 이용하여 소스 이미지와 타겟 이미지 간에 의존성을 공유함으로써 장기 의존성 문제(long-range dependancy)를 해결한다.The attention module solves the long-range dependency problem by sharing dependencies between a source image and a target image using cross-domain attention-sharing mechanisms.

번역기(translator)는 학습된 일련의 잔여 블록을 이용하여 추출된 피쳐들을 소스 도메인에서 타겟 도메인으로 변경한다.A translator changes extracted features from a source domain to a target domain using a series of learned residual blocks.

번역기는 소스 도메인에서 타겟 도메인으로 추출된 기능을 변경하는 방법을 학습하는 일련의 잔여 블록을 사용한다. 이 단계에서 형상의 차원은 변경되지 않으며, 잔여 블록의 수는 결과의 속도와 품질 사이의 균형을 맞추기 위한 하이퍼파라미터(hyperparameter)로 사용될 수 있다.The translator uses a series of residual blocks to learn how to change the extracted features from the source domain to the target domain. At this stage, the dimensions of the shape are not changed, and the number of remaining blocks can be used as a hyperparameter to strike a balance between speed and quality of the result.

디코더(decoder)는 번역기에서 번역된 피쳐들을 활용하여 인코더에 입력된 이미지와 동일한 해상도의 번역 이미지를 구성한다.A decoder constructs a translation image having the same resolution as an image input to an encoder by utilizing features translated by the translator.

디코더는 일련의 전치된 콘볼루션들을 활용하여 피쳐들을 원하는 차원으로 업스케일하고, 마지막 콘볼루션 레이어는 출력물을 복수 채널의 RGB 이미지에 매핑한다.The decoder utilizes a series of transposed convolutions to upscale the features to the desired dimension, and a final convolutional layer maps the output to a multi-channel RGB image.

적대적 어텐션 네트워크 시스템에서 EAGAN 아키텍처를 사용할 경우 생성기의 인코더와 디코더는 양자 간에 산출된 어텐션 맵을 공유(share)하기 위해 인코더의 i차 레이어 상의 어텐션 모듈과 디코더의 i-L차 레이어 상의 어텐션 모듈 간에 생략 연결(skip connections)을 이용한다. 여기서, i는 인코더의 레이어 순번이고 L은 디코더의 레이어 번호를 의미한다. 참고로 생략 연결(skip connections)은 하나의 레이어의 아웃풋을 몇 개의 레이어를 건너뛰고 다음 레이어의 인풋에 추가하는 것이다.When the EAGAN architecture is used in the adversarial attention network system, the encoder and decoder of the generator omit connection between the attention module on the ith layer of the encoder and the attention module on the i-L layer of the decoder to share the calculated attention map between them ( skip connections). Here, i is the layer order of the encoder and L is the layer number of the decoder. For reference, skip connections add the output of one layer to the input of the next layer by skipping several layers.

어텐션 블록의 위치는 입력 및 출력 레이어에 더 가깝게 설정될 수 있다. 이로 인해, 생성기는 어텐션 메커니즘을 도입할 때 원본 도메인에서 장거리 종속성을 학습한 다음 이러한 종속성을 타겟 도메인과 공유하여 이미지 변환의 일관성을 향상시킬 수 있다.The location of the attention block can be set closer to the input and output layers. Because of this, when the generator introduces the attention mechanism, it can learn long-range dependencies from the source domain and then share these dependencies with the target domain to improve the consistency of image transformation.

여기서 피처는 각 이미지 도메인의 특성에 더 가깝게 설정된다. 따라서 인코더와 디코더는 장거리 종속성을 계산하기 위한 이상적인 위치로 나타나고 레이어 i 및 i-L의 어텐션 모듈 간의 생략 연결을 통해 어텐션 맵을 공유할 수 있다.Here features are set closer to the characteristics of each image domain. Therefore, the encoder and decoder appear as ideal locations for calculating long-range dependencies and can share attention maps through skipped connections between attention modules in layers i and i-L.

한편, 적대적 어텐션 네트워크 시스템은 의존성 학습을 위한 초기 트레이닝 단계에서, 로컬 의존성에 더 의존하고 장기 의존성이 더 의미 있을 때까지 기다리기 위해 아래 수식의 학습 가능 스케일 파라메터(yi)를 도입한다.On the other hand, the adversarial attention network system introduces the learnable scale parameter (yi) of the formula below in order to depend more on local dependencies and wait until long-term dependencies become more meaningful in the initial training stage for dependency learning.

스케일 파라메터는 어테션의 출력을 곱한 다음 입력 기능을 다시 추가한다.The scale parameter multiplies the output of the attention and then adds back the input function.

[수학식1][Equation 1]

yi = γo_i + x_i yi = γo _i + x _i

여기서 o_i는 i번째 어텐션의 출력값, x_i는 i번째 입력 피쳐, 가중치 감마r은 최초에 0으로 설정된 후 의미 있는 피처로부터 장기 의존성이 학습됨에 따라 0보다 큰 값으로 설정된다.Here, o _i is the output value of the i th attention, x _i is the i th input feature, and the weight gamma r is initially set to 0 and then set to a value greater than 0 as long-term dependencies are learned from meaningful features.

도 2는 어텐션 모듈의 배치를 나타낸 도면이다.2 is a diagram showing the arrangement of attention modules.

도 2의 (A)에서 보듯, 어텐션 모듈은 인코더의 다운-샘플 콘볼루션 이후(after) 및 디코더의 전치 콘볼루션(transposed-convolution) 이전(before)에 각각 배치된다.As shown in (A) of FIG. 2, the attention module is placed after the down-sample convolution of the encoder and before the transposed-convolution of the decoder, respectively.

도 2의 (B)에서 보듯, 어텐션 모듈은 디코더의 콘볼루션 이전(prior) 및 디코더의 전치 콘볼루션(transposed-convolution) 이후(after)에 각각 배치된다.As shown in (B) of FIG. 2, the attention module is placed before the decoder's convolution and after the decoder's transposed-convolution, respectively.

도 3은 도 2의 어텐션 모듈의 배치 옵션 간의 성능 비교를 나타낸 표이다.FIG. 3 is a table showing performance comparison between deployment options of the attention module of FIG. 2 .

도 3은 어텐션 모듈이 콘볼루션 블록 앞(도 2의 (A))과 뒤에(도 2의 (B)) 배치된 아키텍처의 성능 결과를 나타낸 것이다. 값은 KID(Kernel Inception Distance)로 표시되며 100 단위로 조정되며 값이 낮을수록 우수한 성능을 나타낸다.FIG. 3 shows the performance results of an architecture in which the attention module is placed before ((A) in FIG. 2) and after ((B) in FIG. 2) the convolution block. The value is expressed as KID (Kernel Inception Distance) and is adjusted in increments of 100. The lower the value, the better the performance.

도 3의 결과에서 보듯 도 2의 (B)처럼 어텐션 모듈이 디코더의 콘볼루션 이전(prior) 및 디코더의 전치 콘볼루션(transposed-convolution) 이후(after)에 배치된 경우 대부분의 작업에서 성능이 향상되는 것으로 나타난다.As shown in the results of FIG. 3, performance is improved in most tasks when the attention module is placed before the decoder's convolution and after the decoder's transposed-convolution, as shown in FIG. 2 (B). appears to be

한편, 어텐션 모듈은 입력 이미지를 의미적인(semantic) 관점으로 학습하기 위해 각 채널의 키(K)를 단일 어텐션 맵으로 해석(interpret)하고, 어텐션 모듈들 각각의 키 벡터 사이에 일련의 생략 연결(skip connections)들이 위치한다.On the other hand, the attention module interprets the key (K) of each channel as a single attention map in order to learn the input image from a semantic point of view, and a series of omitted connections between the key vectors of each of the attention modules ( skip connections are located.

어텐션 모듈은 이미지의 두 지점 사이의 어텐션에 해당하는 대신 이미지의 의미론적 측면을 나타내는 방법을 학습한다. 입력 및 출력 이미지는 내용과 관련성이 높을 것으로 예상되므로 생성기의 초기 레이어 i에 있는 기능 간의 주의를 해당 종료 레이어 i-L에서 재사용할 수 있다고 추론할 수 있다.The Attention module learns to represent semantic aspects of an image instead of corresponding to attention between two points in an image. Since the input and output images are expected to be highly content-related, we can infer that the attention between features in the initial layer i of the generator can be reused in the corresponding ending layers i-L.

어텐션 모듈은 실제 샘플에서 계산되므로 이러한 어텐션 맵을 공유하면 변환된 이미지를 재구성하여 보다 일관되고 사실적인 출력을 생성할 때 사용될 수 있다. 이후 일련의 생략 연결이 미러링된 어텐션 모듈 블록의 각 키 벡터 사이에 배치되어 아키텍처의 양 끝 사이에 장거리 종속성을 공유한다. 키(K) 네트워크의 어텐션 맵은 한 쌍의 어텐션 블록 간에 공유된다.Since the attention modules are computed on real samples, sharing these attention maps can be used to reconstruct the transformed image to produce more consistent and realistic output. A series of elided connections are then placed between each key vector in the mirrored Attention module block to share long-distance dependencies between the two ends of the architecture. The attention map of the key (K) network is shared between a pair of attention blocks.

도 4는 실시예1의 EAGAN의 어텐션 공유 메커니즘을 나타낸 도면이다.4 is a diagram showing the attention sharing mechanism of EAGAN according to the first embodiment.

도 4를 참고하면, 어텐션 모듈들 각각의 키 벡터 사이에 위치된 일련의 생략 연결들 사이에는 원본 도메인에서 나온 상기 어텐션 맵을 정제하기 위한 정제 네트워크(refinement network)가 더 위치할 수 있다.Referring to FIG. 4 , a refinement network for refining the attention map from the original domain may be further located between a series of omitted connections located between key vectors of each attention module.

어텐션 모듈이 각 채널을 단일 어텐션 맵으로 해석한다는 점을 고려하면 정제 네트워크는 3 × 3 커널을 사용하는 깊이별 컨볼루션(depth-wise convolution)으로 구현된다. 이 특별한 유형의 회선은 입력 채널을 주어진 수의 그룹으로 나누고 이러한 각 그룹에 대해 서로 다른 커널을 사용한다. Considering that the attention module interprets each channel as a single attention map, the refinement network is implemented as a depth-wise convolution using a 3 × 3 kernel. This special type of circuit divides the input channels into a given number of groups and uses a different kernel for each of these groups.

즉, 어텐션 모듈은 주요 결과를 해당 i 및 i-L 레이어 간에 공유되는 어텐션 맵으로 해석하고, 어텐션 맵은 그룹 컨볼루션을 통해 전달되어 서로 다른 채널 간에 정보가 혼합되지 않는다.That is, the attention module interprets the main result as an attention map shared between the corresponding i and i-L layers, and the attention map is passed through group convolution so that information is not mixed between different channels.

한편, 적대적 어텐션 네트워크 시스템은 도메인 A와 도메인 B 사이에 이미지 번역을 수행하기 위해 2개의 맵핑 함수를 학습하되, 두 개의 생성기(Generators) GA, GB와 두 개의 판별기(Discriminators) DA, DB는 도메인 간 순환 일관성(cycle consistency)을 강화하기 위해 동시에 트레이닝되어 상기 맵핑 함수들을 학습한다.On the other hand, the adversarial attention network system learns two mapping functions to perform image translation between domain A and domain B, but two generators GA and GB and two discriminators DA and DB are domain The mapping functions are simultaneously trained to enhance inter-cycle consistency.

참고로, 순환 일관성(cycle consistency) 손실은 짝을 이루지 않은 이미지 대 이미지 변환을 수행하는 생성적 적대 네트워크에 사용되는 손실 유형이다.For reference, cycle consistency loss is a type of loss used in generative adversarial networks that perform unpaired image-to-image transformations.

<실시예2><Example 2>

실시예2는 UEAGAN(U-net Efficient Attention Generative Adversarial Network) 아키텍쳐를 아키텍처를 이용하는 적대적 어텐션 네트워크 시스템 기술에 관한 것이다.Embodiment 2 relates to an adversarial attention network system technology using UEAGAN (U-net Efficient Attention Generative Adversarial Network) architecture.

도 5는 실시예2의 UEAGAN 아키텍처를 나타낸 도면이다.5 is a diagram showing the UEAGAN architecture of Embodiment 2;

적대적 어텐션 네트워크 시스템은 쌍을 이루지 않은 도메인 A와 도메인 B 사이에 이미지 번역(IMAGE TO IMAGE)을 수행하기 위해 UEAGAN 아키텍처를 이용한다.The adversarial attention network system uses the UEAGAN architecture to perform image translation (IMAGE TO IMAGE) between unpaired domains A and B.

적대적 어텐션 네트워크 시스템의 EAGAN 아키텍쳐는 GAN을 기반으로 한다.The EAGAN architecture of the adversarial attention network system is based on GAN.

GAN에 대한 설명은 실시예1에서 설명하였으므로 중복된 설명을 생략한다.Since the description of GAN was described in Example 1, redundant description is omitted.

상기 UEAGAN 아키텍쳐는 장거리 종속성을 사용할 수 있는 방식으로 어텐션 메커니즘(Attention Mechanisms)을 변환 프로세스에 통합한다. The UEAGAN architecture integrates Attention Mechanisms into the transformation process in a way that can use long-range dependencies.

UEAGAN 아키텍쳐는 수학적으로 셀프-어텐션과 동등한 수준의 다중 셀프-어텐션 블록을 활용함으로써 메모리 과다 사용의 한계를 극복한다. 또한 입력 이미지와 정보의 거리가 멀어질수록 해당 정보를 학습하기 어려운 CNN과 달리, 다중 셀프-어텐션을 사용하여 이미지에서 멀리 떨어진 정보를 효과적으로 학습한다. 모델이 가장 관련성이 높은 feature에 선택적으로 초점을 맞추고 입력의 다른 부분 간의 장기 의존성(long-range dependencies)을 포착할 수 있기 때문이다.The UEAGAN architecture overcomes the limitations of memory overuse by utilizing multiple self-attention blocks that are mathematically equivalent to self-attention. In addition, unlike CNN, where it is difficult to learn the information as the distance between the input image and the information increases, it effectively learns information far from the image using multiple self-attention. This is because the model can selectively focus on the most relevant features and capture long-range dependencies between different parts of the input.

UEAGAN 아키텍쳐는 이미지 도메인간 어텐션 공유 매커니즘(cross-domain attention-sharing mechanisms)을 이용하여 소스 이미지와 타겟 이미지 간에 의존성을 공유함으로써 장기 의존성 문제(long-range dependancy)를 해결한다.The UEAGAN architecture solves the long-range dependence problem by sharing dependencies between a source image and a target image using cross-domain attention-sharing mechanisms.

적대적 어텐션 네트워크 시스템의 UEAGAN 아키텍쳐는 생성기(Generator) 및 판별기(Discriminator)를 포함한다.The UEAGAN architecture of the adversarial attention network system includes a generator and a discriminator.

생성기 및 판별기는 실시예1과 동일하므로 중복된 설명을 생략한다.Since the generator and the discriminator are the same as in Example 1, duplicate descriptions are omitted.

상기 인코더 측의 어텐션 모듈의 출력은 디코더 측의 어텐션 모듈의 입력과 연결(concatenate)되고, 상기 연결된 디코더의 어텐션 모듈은, 이전 레이어의 연결된 피쳐 맵과 인코더의 상응하는 어텐션 모듈의 출력을 기초로 동작한다.The output of the attention module on the encoder side is concatenated with the input of the attention module on the decoder side, and the connected attention module of the decoder operates based on the concatenated feature map of the previous layer and the output of the corresponding attention module of the encoder. do.

도 6은 EAGAN과 UEAGAN가 이미지 간 개체 변환을 위한 샘플 이미지를 나타낸 도면이다.6 is a diagram showing sample images for entity conversion between images by EAGAN and UEAGAN.

도 6을 참조하면, 샘플 이미지는 EAGAN과 UEAGAN가 이미지에 표시되는 특정 개체를 변환하기 위한 하위 작업이다.Referring to FIG. 6 , a sample image is a sub-task for converting a specific entity displayed in an image between EAGAN and UEAGAN.

도 6에서는 사과에서 오렌지로 변환, 말에서 얼룩말 변환하기 위한 데이터 셋이 구축된다, 예컨대, 데이터 셋은 위치, 수량, 측면 및 환경이 다른 주황색 및 사과 이미지의 짝을 이루지 않은 이미지 샘플로 구성될 수 있다. 해당 데이터 셋은 해당 레이블에 따라 모델의 훈련 및 평가에 활용된다.In FIG. 6, a data set is built for transformations from apples to oranges and from horses to zebras. For example, the data set may consist of unpaired image samples of orange and apple images that differ in location, quantity, aspect, and environment. there is. The corresponding data set is used for training and evaluation of the model according to the corresponding label.

도 7은 EAGAN과 UEAGAN가 이미지간 장면 변환을 위한 샘플 이미지를 나타낸 도면이다.7 is a diagram showing sample images for scene conversion between EAGAN and UEAGAN images.

도 7을 참조하면, 장면 변환은 변환 대상이 특정 영역(또는, 특정 개체)이 아닌 전체에 분포돼 있을 때 사용된다. 이를 위해서는 모델이 주어진 이미지에 표시될 수 있는 각 요소에 대한 도메인 변환을 학습해야 한다.Referring to FIG. 7 , scene transformation is used when a transformation target is distributed over the entire area rather than a specific area (or specific object). To do this, the model needs to learn a domain transform for each element that can be displayed in a given image.

장면 번역 작업을 위해 ACDC(Adverse Conditions Data set with Correspondences) 데이터 셋이 활용될 수 있다. 이 데이터 셋은 밤, 눈, 비, 안개 등 다양한 환경 조건에서 운전 장면의 이미지를 나타낸다. 각각의 불리한 조건의 이미지에 대해 정상 조건에서 촬영한 해당 장면을 사용할 수 있다. 해당 데이터 셋이 짝을 이룬 샘플의 근사치를 제공하더라도 데이터 셋은 짝이 없는 스타일로 활용될 수 있다.For the scene translation task, ACDC (Adverse Conditions Data set with Correspondences) data set may be utilized. This dataset represents images of driving scenes under various environmental conditions such as night, snow, rain, and fog. For each adverse condition image, the corresponding scene taken under normal conditions can be used. A data set can be used in an unpaired style, even if that data set provides an approximation of paired samples.

ACDC 데이터 세트는 훈련, 검증 및 테스트에 대해 서로 다른 데이터 분할을 제공하지만 대부분의 데이터를 훈련에 활용하기 위해 훈련 프로세스에 대해 "검증" 및 "테스트" 레이블이 지정된 이미지를 활용하는 것이 바람직하다.Although the ACDC dataset provides different data splits for training, validation, and testing, it is preferable to utilize images labeled “validation” and “test” for the training process in order to utilize most of the data for training.

도 8의 (A), (B)는 작업별 아키텍처간의 장면 변환 작업에 대한 성능을 비교한 표이다.8 (A) and (B) are tables comparing performance of scene conversion tasks between architectures for each task.

도 8을 참조하면, 싸이클GAN(CycleGAN), AGGAN(Attention-guide GAN), ASGIT(Attention-Based Spatial Guidance for Image-to-Image Translation) EAGAN, UEAGAN 아키텍처 간의 비교를 위해 각 아키텍처의 서로 다른 모델을 학습한다. 가중치는 랜덤 시드에 의해 초기화되었으며 각각의 서로 다른 방법에 대해 학습한다. 훈련 중 각 모델이 얻은 가장 낮은 KID(Kernel Inception Distance) 점수를 성능의 기준으로 선택하며 그 후, 세 가지 KID(Kernel Inception Distance) 점수의 평균이 각각의 다른 방법에 대해 취해진다. Referring to FIG. 8, for comparison between CycleGAN (CycleGAN), AGGAN (Attention-guide GAN), ASGIT (Attention-Based Spatial Guidance for Image-to-Image Translation) EAGAN, and UEAGAN architectures, different models of each architecture are used. learn The weights are initialized by a random seed and learn for each different method. The lowest KID (Kernel Inception Distance) score obtained by each model during training is selected as the criterion for performance, after which the average of the three KID (Kernel Inception Distance) scores is taken for each different method.

참고로, 싸이클GAN(CycleGan)은 도메인 사이의 이미지 스타일 전이(image style transfer)에 사용되는 GAN(Generative Adversarial Network)이다.For reference, CycleGAN is a Generative Adversarial Network (GAN) used for image style transfer between domains.

AGGAN(Attention-guide GAN)는 싸이클Gan과 동일한 훈련 프로세스와 유사한 아키텍처를 사용하고, 잔여 블록 후에 잔여 블록의 마지막에 의해 생성된 중간 피쳐 맵을 활용하는 두 개의 디코더에서 분기된다는 점에서 EAGAN과 차이가 있다.AGGAN (Attention-guided GAN) differs from EAGAN in that it uses the same training process and similar architecture as CycleGan, and branches at two decoders that utilize an intermediate feature map generated by the end of the residual block after the residual block. there is.

ASGIT(Attention-Based Spatial Guidance for Image-to-Image Translation)는 싸이클GAN과 동일한 생성기 및 분류기 네트워크를 사용하지만 분류기의 두 번째 컨볼루션 레이어를 기반으로 각 입력 이미지에 대한 어텐션 맵을 구성한다.ASGIT (Attention-Based Spatial Guidance for Image-to-Image Translation) uses the same generator and classifier network as CycleGAN, but constructs an attention map for each input image based on the second convolutional layer of the classifier.

도 8의 (A)와 도 8의 (B)는 각각 풍경과 객체 번역 작업에 대한 KID(Kernel Inception Distance) 점수 측면에서 서로 다른 아키텍처 간의 비교를 나타낸 것이다.8(A) and 8(B) show a comparison between different architectures in terms of KID (Kernel Inception Distance) scores for landscape and object translation tasks, respectively.

도 8의 (A)은 장면 변환에 대한 성능 결과를 나타낸 표이고, 도 9는 아키텍처 별 장면 변환을 시각적으로 비교하기 위한 도면이다. 도 8의 (A) 및 도 9에서 보듯, 싸이클GAN 모델이 가장 잘 수행되는 야간 관련 작업을 제외하면 실시예1에서 제안되는 EAGAN에서 가장 우수한 성능을 나타낸다.8(A) is a table showing performance results for scene conversion, and FIG. 9 is a diagram for visually comparing scene conversion for each architecture. As shown in FIG. 8(A) and FIG. 9, the EAGAN proposed in Example 1 shows the best performance except for the night-related task where the cycle GAN model performs best.

도 8의 (B)는 개체 변환에 대한 성능 결과를 나타낸 표이고, 도 10은 아키텍처 별 개체 변환을 시각적으로 비교하기 위한 나타낸 도면이다. 도 8의 (B) 및 도 10에서 보듯, 실시예1에서 제안되는 EAGAN과 UEAGAN에서 우수한 성능을 나타낸 것을 알 수 있다.8(B) is a table showing performance results for entity transformation, and FIG. 10 is a diagram for visually comparing entity transformations for each architecture. As shown in FIG. 8(B) and FIG. 10, it can be seen that EAGAN and UEAGAN proposed in Example 1 showed excellent performance.

EAGAN 아키텍처는 "사과에서 오렌지로 변환" 및 "말에서 얼룩말로 변환" 작업에 가장 우수한 성능을 나타냈고, 반면 UEAGAN 아키텍처는 "오렌지에서 사과로 변환"에서 우수한 성능을 나타냈으며 "말에서 얼룩말로 변환" 작업에서 EAGAN 아키텍처와 동등한 성능을 나타냈다.The EAGAN architecture showed the best performance for “apples to oranges” and “horses to zebras” tasks, while the UEAGAN architecture performed well for “oranges to apples” and “horses to zebras” tasks. " It showed equivalent performance to the EAGAN architecture in the task.

도 11은 아키텍처에 의해 생성된 개체 변환 모델의 정밀 육안 검사를 위한 도면이고, 도 12는 아키텍처에 의해 생성된 장면 변환 모델의 정밀 육안 검사를 위한 도면이다.FIG. 11 is a diagram for precise visual inspection of an entity transformation model generated by the architecture, and FIG. 12 is a diagram for precise visual inspection of a scene transformation model generated by the architecture.

도 11에서 보듯, "말에서 얼룩말로 변환" 작업에서 특정 영역을 확대하여 정밀 육안 검사를 할 수 있다. 모든 이미지에서 얼룩말의 특징적인 흑백 줄무늬를 나타내지만 싸이클GAN은 원본 이미지의 눈썹 색상이 여전히 잔류하고, AGGAN은 줄무늬가 너무 얇아서 현실감이 없는 것을 알 수 있다. 또한 AGGAN은 줄무늬 패턴이 없어야 하는 꼬리 부분을 변환하였다. As shown in FIG. 11 , in the “conversion from horse to zebra” task, a specific area can be enlarged for detailed visual inspection. Although all images show the characteristic black and white stripes of zebras, CycleGAN still retains the eyebrow color of the original image, and AGGAN shows that the stripes are too thin to be realistic. AGGAN also converted the tail part, which should not have a striped pattern.

ASGIT은 얼룩말의 필수 특성 중 일부를 보여주지만 출력의 색상에 차이가 있는 것으로 보인다.ASGIT shows some of the essential characteristics of a zebra, but there seems to be a difference in the color of the output.

EAGAN은 꼬리 부분에 약간의 인공적인 변환이 있지만 비교적 일관된 줄무늬 패턴을 나타낸다. EAGAN exhibits a relatively consistent stripe pattern, albeit with some artificial transformations in the tail.

UEAGAN은 확대 영역 경계에 노이즈가 있으며 꼬리 부분에 인공적인 변환이 있는 것으로 나타낸다.UEAGAN exhibits noise at the boundaries of the magnification region and artifactual transitions in the tail.

도 12에서 보듯, "눈(SNOW) 에서 눈(SNOW)" 작업에서 싸이클GAN과 AGGAN은 피사체(사람)의 몸통 부분이 주변과 섞이는 경향이 있는 반면 피사체 발의 보존률이 저하된 것을 나타난다. As shown in FIG. 12, cycleGAN and AGGAN in the "SNOW to SNOW" task show that the torso of the subject (human) tends to blend with the surroundings, while the retention rate of the subject's feet is reduced.

ASGIT는 피사체의 몸통에서 더 눈에 띄는 색상을 가지면서 싸이클GAN과 AGGAN 보다 피사체의 발 보존률이 더 높은 것으로 나타난다. 그러나 ASGIT는 전체 이미지에서 다른 이미지에 비해 노이즈가 많은 것으로 나타낸다.ASGIT appears to have a higher retention rate of the subject's feet than CycleGAN and AGGAN, while having more prominent colors on the subject's torso. However, ASGIT shows that the entire image is noisy compared to other images.

EAGAN은 비교 적으로 노이즈가 적고 다른 이미지에 비해 피사체의 발과 몸통 영역의 보존률이 높은 것으로 나타난다. EAGAN shows relatively low noise and high preservation rate of the subject's feet and torso area compared to other images.

UEAGAN 모델은 피사체의 팔과 얼굴 영역과 같은 세부 사항의 대부분을 더 잘 보존하는 것으로 나타난다.The UEAGAN model appears to preserve most of the details, such as the subject's arms and facial regions better.

도 13은 EGAN 아키텍처의 어텐션 맵을 나타낸 도면이다.13 is a diagram illustrating an attention map of the EGAN architecture.

도 13에서 보듯, EAGAN은 어텐션 모듈이 이미지의 다른 영역에 초점을 맞추는 어텐션 맵을 생성한다. EAGAN은 어텐션 모듈에 있는 Key 벡터의 각 채널은 Soft-max 기능을 거쳐 각 Attention Map에 대해 최소값을 빼고 최대값으로 나누어 0과 1의 범위로 확장한다. 그리고 EAGAN은 이를 통해 어텐션 맵을 그레이 스케일 이미지로 시각화할 수 있다. 참고로, 픽셀이 밝을수록 해당 입력 영역에 대한 더 높은 초점을 설정한다.As shown in Figure 13, EAGAN creates an attention map in which the attention module focuses on different regions of the image. In EAGAN, each channel of the key vector in the attention module goes through the Soft-max function, subtracts the minimum value for each Attention Map, divides it by the maximum value, and expands it to the range of 0 and 1. And EAGAN can visualize the attention map as a gray scale image through this. For reference, the brighter the pixel, the higher focus is set for the corresponding input area.

예를 들어, EAGAN은 두 번째 행 즉, 사과에서 오렌지로 개체 변환 시, 첫 번째 어텐션 맵은 사과에 초점을 두고, 두 번째 어텐션 맵은 덤불에 초점을 두고, 세 번째 어텐션 맵은 손에 초점을 두며, 네 번째 어텐션 맵은 하늘에 초점을 둔 상태로 사과에서 오렌지로의 변환 작업을 한다.For example, when EAGAN transforms objects in the second row, i.e. apples to oranges, the first attention map will focus on the apple, the second attention map will focus on the bush, and the third attention map will focus on the hand. The fourth attention map focuses on the sky and transforms apples into oranges.

<실시예3><Example 3>

실시예3은 실시예1 또는 실시예2의 적대적 어텐션 네트워크 시스템을 이용한 이미지 생성 방법에 관한 기술이다.Embodiment 3 is a technique for generating an image using the adversarial attention network system of Embodiment 1 or Embodiment 2.

먼저, 이미지 생성 방법에서 실시예1의 EAGAN(Eficient Attention Generative Adversarial Network) 아키텍처를 이용하여 이미지를 생성하는 방법에 대해서 설명한다.First, a method of generating an image using the EAGAN (Eficient Attention Generative Adversarial Network) architecture of the first embodiment in the image generating method will be described.

도 14에서 보듯, 실시예3의 방법은 피쳐들을 추출하는 단계(S110), 피쳐들을 소스 도메인에서 타겟 도메인으로 변경하는 단계(S120), 번역 이미지를 구성하는 단계(S130), 피쳐들을 원하는 차원으로 업스케일하는 단계(S140) 및 출력물을 복수 채널의 RGB 이미지에 매핑하는 단계(S150)를 포함한다.As shown in FIG. 14, the method of Example 3 includes extracting features (S110), changing features from a source domain to a target domain (S120), constructing a translation image (S130), and converting features to a desired dimension. Upscaling (S140) and mapping an output object to a multi-channel RGB image (S150).

피쳐들을 추출하는 단계(S110)는 인코더(encoder)가 일련의 콘볼루션을 통해 입력 이미지로부터 유의미한 피쳐들을 추출한다.In step S110 of extracting features, an encoder extracts meaningful features from an input image through a series of convolutions.

어텐션 모듈은 인코더의 다운-샘플 콘볼루션 이후(after) 및 디코더의 전치 콘볼루션(transposed-convolution) 이전(before)에 각각 배치되거나(도 2의 (A) 참조), 디코더의 콘볼루션 이전(prior) 및 디코더의 전치 콘볼루션(transposed-convolution) 이후(after)에 각각 배치될 수 있다(도 2의 (B) 참조).The attention module is either placed after the encoder's down-sample convolution and before the decoder's transposed-convolution (see FIG. 2(A)), or before the decoder's transposed-convolution (prior ) and after the transposed-convolution of the decoder (see FIG. 2(B)).

어텐션 모듈이 콘볼루션 블록 앞(도 2의 (A))과 뒤에(도 2의 (B)) 배치된 아키텍처의 성능 결과는 도 3에서 확인할 수 있으며, 도 2의 (B)처럼 어텐션 모듈이 디코더의 콘볼루션 이전(prior) 및 디코더의 전치 콘볼루션(transposed-convolution) 이후(after)에 배치된 경우 대부분의 작업에서 성능이 향상되는 것으로 나타난다.The performance results of the architecture in which the attention module is placed before ((A) in FIG. 2) and after ((B) in FIG. 2) the convolution block can be seen in FIG. It appears to improve performance for most tasks when placed before the convolution of , and after the transposed-convolution of the decoder.

어텐션 모듈들 각각의 키 벡터 사이에 위치된 일련의 생략 연결들 사이에는 원본 도메인에서 나온 상기 어텐션 맵을 정제하기 위한 정제 네트워크(refinement network)가 더 위치할 수 있다.A refinement network for refining the attention map from the original domain may be further positioned between a series of omitted connections located between key vectors of each of the attention modules.

피쳐들을 소스 도메인에서 타겟 도메인으로 변경하는 단계(S120)는 번역기가 학습된 일련의 잔여 블록을 이용하여 상기 추출된 피쳐들을 소스 도메인에서 타겟 도메인으로 변경한다.In the step of changing features from the source domain to the target domain (S120), the translator changes the extracted features from the source domain to the target domain using a series of learned residual blocks.

번역기(translator)는 소스 도메인에서 타겟 도메인으로 추출된 기능을 변경하는 방법을 학습하는 일련의 잔여 블록을 사용한다. 이 단계에서 형상의 차원은 변경되지 않으며, 잔여 블록의 수는 결과의 속도와 품질 사이의 균형을 맞추기 위한 하이퍼파라미터(hyperparameter)로 사용될 수 있다.A translator uses a series of residual blocks to learn how to change the extracted features from the source domain to the target domain. At this stage, the dimensions of the shape are not changed, and the number of remaining blocks can be used as a hyperparameter to strike a balance between speed and quality of the result.

번역 이미지를 구성하는 단계(S130)는 디코더(decoder)가 번역기에서 번역된 피쳐들을 활용하여 인코더에 입력된 이미지와 동일한 해상도의 번역 이미지를 구성한다.In the step of constructing the translated image (S130), a decoder constructs a translated image having the same resolution as the image input to the encoder by using features translated by the translator.

피쳐들을 원하는 차원으로 업스케일하는 단계(S140)는 디코더(decoder)가 일련의 전치된 콘볼루션들을 활용하여 피쳐들을 원하는 차원으로 업스케일한다.In step S140 of upscaling the features to a desired dimension, a decoder utilizes a series of transposed convolutions to upscale the features to a desired dimension.

출력물을 복수 채널의 RGB 이미지에 매핑하는 단계(S150)는 디코더(decoder)가 마지막 콘볼루션 레이어에서 출력물을 복수 채널의 RGB 이미지에 매핑하여 변환된 이미지를 생성한다. 생성된 이미지는 도 9 내지 도 12를 참조할 수 있다.In the step of mapping the output to the multi-channel RGB image (S150), a decoder maps the output to the multi-channel RGB image in the last convolution layer to generate a converted image. The generated images may refer to FIGS. 9 to 12 .

[수학식2][Equation 2]

yi = γo_i + x_i yi = γo _i + x _i

여기서 oi는 i번째 어텐션의 출력값, x_i는 i번째 입력 피쳐, 가중치 감마r은 최초에 0으로 설정된 후 의미 있는 피처로부터 장기 의존성이 학습됨에 따라 0보다 큰 값으로 설정된다.Here, oi is the output value of the i-th attention, x _i is the i-th input feature, and the weight gamma r is initially set to 0 and then set to a value greater than 0 as long-term dependencies are learned from meaningful features.

이상에서는 본 발명에 관한 몇 가지 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to several embodiments of the present invention, those skilled in the art can make the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. It will be appreciated that various modifications and variations may be made.

또한 이상에서 설명한 실시예들 중 방법에 관한 발명은 프로그램으로 구현되거나 그 프로그램이 저장된 컴퓨터로 판독 가능한 기록매체로 구현될 수 있다.In addition, the invention related to the method among the embodiments described above may be implemented as a program or a computer-readable recording medium in which the program is stored.

즉, 본 발명은 애플리케이션 형태로 구현될 수 있으며, 구글사의 안드로이드나 애플사의 IOS를 기반으로 실행되는 스마트폰, 태블릿PC 등의 모바일 단말기에서 실행되는 소프트웨어 프로그램으로 구현되거나, 구글 글래스, 애플 워치, 삼성 갤럭시 워치, 스마트 워치 등과 같은 웨어러블 장치에서 실행되는 소프트웨어 프로그램으로 구현되거나, 마이크로소프트사의 윈도우즈나 구글사의 크롬OS를 기반으로 실행되는 노트북PC, 데스크탑PC 등에서 실행되는 소프트웨어 프로그램으로 구현될 수 있다.That is, the present invention can be implemented in the form of an application, implemented as a software program running on a mobile terminal such as a smartphone or tablet PC running on the basis of Google's Android or Apple's IOS, or a Google Glass, Apple Watch, Samsung It may be implemented as a software program running on a wearable device such as a Galaxy Watch or a smart watch, or may be implemented as a software program running on a laptop PC or desktop PC running based on Microsoft's Windows or Google's Chrome OS.

또한 상술한 장치 또는 시스템의 부분적 기능들은 이를 구현하기 위한 명령어들의 프로그램이 유형적으로 구현됨으로써 컴퓨터를 통해 판독될 수 있는 기록매체에 포함되어 제공될 수도 있다. 컴퓨터로 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, USB 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.In addition, partial functions of the above-described device or system may be provided by being included in a computer-readable recording medium by tangibly implementing a program of instructions for implementing them. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and floptical disks. Included are hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, USB memory, and the like.

S110: 피쳐들을 추출하는 단계
S120: 피쳐들을 소스 도메인에서 타겟 도메인으로 변경하는 단계
S130: 번역 이미지를 구성하는 단계
S140: 피쳐들을 원하는 차원으로 업스케일하는 단계
S150: 출력물을 복수 채널의 RGB 이미지에 매핑하는 단계S110: Extracting features
S120: Changing features from the source domain to the target domain
S130: Step of constructing translation image
S140: Upscaling the features to the desired dimension
S150: Step of mapping the output to a multi-channel RGB image

Claims

An encoder that extracts meaningful features from an input image through a series of convolutions;
A translator for changing the extracted features from a source domain to a target domain using a series of learned residual blocks;
A translation image having the same resolution as the image input to the encoder is constructed using the translated features, and the features are upscaled to a desired dimension using a series of transposed convolutions, and the last convolution layer converts outputs into multiple outputs. a generator including a decoder that maps to an RGB image of a channel; and
It includes a discriminator that discriminates the result of the generator,
The encoder includes an attention module that computes attention maps for resolving long-range dependancy,
The encoder and decoder,
In order to share the calculated attention map between them, skip connections are used between the attention module on the i-th layer of the encoder and the attention module on the iL-th layer of the decoder.
Improved adversarial attention network system.

delete

According to claim 1,
The attention module,
An improved adversarial attention network system disposed after down-sample convolution of the encoder and before transposed-convolution of the decoder, respectively.

According to claim 1,
The attention module,
An improved adversarial attention network system disposed before the decoder's convolution and after the transposed-convolution of the decoder, respectively.

According to claim 1,
The network system,
In the initial training phase for dependency learning, an improved adversarial attention network system that introduces a learnable scale parameter (yi) in the equation below to rely more on local dependencies and wait until long-term dependencies are more meaningful.

According to claim 1,
The attention module,
In order to learn the input image from a semantic point of view, the key (K) of each channel is interpreted as a single attention map,
An improved adversarial attention network system in which a series of the omitted connections are located between key vectors of each of the attention modules.

According to claim 6,
The improved adversarial attention network system further includes a refinement network for refining the attention map from the original domain between the series of skipped connections.

According to claim 1,
The output of the attention module on the encoder side is concatenated with the input of the attention module on the decoder side,
The improved adversarial attention network system in which the attention module of the connected decoder operates based on the connected feature map of the previous layer and the output of the corresponding attention module of the encoder.

According to claim 1,
The network system,
Learn two mapping functions to perform image translation between domain A and domain B,
An advanced adversarial attention network system in which two generators and two discriminators are trained concurrently to enforce cycle consistency across domains.

An image generation method of an adversarial attention network system including an encoder, a translator and a decoder,
extracting, by the encoder, significant features from an input image through a series of convolutions;
changing the extracted features from a source domain to a target domain using a series of learned residual blocks by the translator;
constructing, by the decoder, a translated image having the same resolution as an image input to the encoder by utilizing the translated features;
the decoder utilizing a series of transposed convolutions to upscale the features to a desired dimension; and
Mapping, by the decoder, an output from a last convolution layer to a multi-channel RGB image;
The step of extracting the features includes calculating, by an attention module of the encoder, attention maps for resolving long-range dependancy;
The encoder and decoder,
In order to share the calculated attention map between them, skip connections are used between the attention module on the i-th layer of the encoder and the attention module on the iL-th layer of the decoder.
Image generation method of adversarial attention network system.

delete

According to claim 10,
The attention module,
The image generation method of the adversarial attention network system, which is disposed after the down-sample convolution of the encoder and before the transposed-convolution of the decoder, respectively.

According to claim 10,
The attention module,
The image generation method of the adversarial attention network system, which is respectively disposed before the decoder's convolution and after the transposed-convolution of the decoder.

According to claim 10,
The network system,
In the initial training stage for dependency learning, an image generation method of an adversarial attention network system that relies more on local dependencies and introduces the learnable scale parameter (yi) of the equation below to wait until long-term dependencies are more meaningful.

According to claim 10,
The attention module,
In order to learn the input image from a semantic point of view, the key (K) of each channel is interpreted as a single attention map,
An image generation method of an adversarial attention network system in which a series of the omitted connections are located between key vectors of each of the attention modules.

According to claim 15,
The method of generating an image of an adversarial attention network system, wherein a refinement network for refining the attention map from the original domain is further located between the series of omitted connections.

According to claim 10,
The output of the attention module on the encoder side is concatenated with the input of the attention module on the decoder side,
The method of generating an image of an adversarial attention network system in which the attention module of the connected decoder operates based on a connected feature map of a previous layer and an output of a corresponding attention module of an encoder.

According to claim 10,
The network system,
Learn two mapping functions to perform image translation between domain A and domain B,
An image generation method of an adversarial attention network system in which two generators and two discriminators are trained simultaneously to enhance cycle consistency between domains.