KR20230122974A

KR20230122974A - System and method for High Fidelity Face Swapping Using Single/Multiple Source Images Based on Attention Mechanism

Info

Publication number: KR20230122974A
Application number: KR1020230000493A
Authority: KR
Inventors: 김대식; 이재혁
Original assignee: 한국과학기술원
Priority date: 2022-02-15
Filing date: 2023-01-03
Publication date: 2023-08-22

Abstract

주의 알고리즘 기반의 단일/다중 소스 이미지를 활용 가능한 고선명도 얼굴 교체 시스템으로, 소스 이미지로부터 특징점 추출을 위한 특징점 추출부(100); 상기 특징점 추출부로부터 입력된 특징점에 따라 얼굴 교체를 수행하기 위한 결과값을 출력하는 아이덴티티 전환부(IDTR, 200); 및 상기 아이덴티티 전환부로부터의 출력값에 따라 교체된 얼굴 이미지를 생성하는 이미지 생성부(300)를 포함하는 것을 특징으로 하는 고선명도 얼굴 교체 시스템이 제공된다. A high-definition face replacement system that can utilize a single/multiple source image based on an attentional algorithm, comprising: a feature point extraction unit 100 for extracting a feature point from a source image; an identity conversion unit (IDTR, 200) outputting a result value for performing face replacement according to the keypoint input from the keypoint extraction unit; and an image generating unit 300 generating a replaced face image according to an output value from the identity conversion unit.

Description

System and method for High Fidelity Face Swapping Using Single/Multiple Source Images Based on Attention Mechanism}

본 발명은 주의 알고리즘 기반의 단일/다중 소스 이미지를 활용 가능한 고선명도 얼굴 교체 시스템 및 방법에 관한 것이다. The present invention relates to a high-definition face replacement system and method capable of utilizing single/multiple source images based on an attentional algorithm.

얼굴 교체(Face swapping)는 대상의 속성(예: 포즈, 표정 등)에 영향을 주지 않고 원본 이미지를 특정할 수 있는 특징(아이덴티티(identity))을 대상 이미지로 전송하는 작업이다. 최근에는 엔터테인먼트, 영화산업, 사생활 보호 등 다양한 응용으로 많은 주목을 받고 있다. 이러한 높은 관심에도 불구하고 고해상도 영상의 얼굴 교체 기술에 대한 연구는 초기 단계이다. 하지만, 얼굴 교체 기술의 발전은 현대 데이터 기반 기술 개발에서 얼굴 위조 탐지를 향상시킬 것이라는 점에서 크게 주목을 받고 있다. Face swapping is an operation of transferring a feature (identity) that can specify a source image to a target image without affecting the attributes (eg, pose, expression, etc.) of the target. In recent years, it has attracted a lot of attention for various applications such as entertainment, film industry, and privacy protection. Despite such high interest, research on high-resolution imaging face replacement technology is in its infancy. However, advances in face swapping technology are getting a lot of attention as they will improve face spoof detection in modern data-driven technology developments.

종래 얼굴 교체 방법은 효과적인 얼굴 교체를 위해 고품질 (고품질 - 무표정, 정면 응시)의 소스 이미지가 필요하다는 문제가 있으며, 저 품질 소스 이미지가 사용되는 경우 (특이한 표정 또는 자세) 저 품질의 얼굴 교체 결과가 획득되는 문제가 발생한다. Conventional face replacement methods have a problem that a high-quality (high-quality - expressionless, frontal gaze) source image is required for effective face replacement, and when a low-quality source image is used (unusual expression or posture), a low-quality face replacement result is obtained. Acquisition problems occur.

도 1은 종래 기술에 따라 소스 이미지가 제한된 상황에서의 얼굴 교체 결과이다. 1 is a face replacement result in a situation where a source image is limited according to the prior art.

도 1을 참조하면, 저 품질의 소스 이미지가 사용되는 경우 (특이한 표정 또는 자세) 도 1의 하단 우측 그림과 같이 저 품질의 얼굴 교체 결과가 획득되는 문제 발생한다. 특히 실생활에서 대상이 취할 수 있는 표정, 자세의 범위가 매우 넓다는 점을 고려하여 볼 때 이러한 문제는 빈번하게 발생할 수 있으며 이를 해결할 수 있는 근본적인 해결책이 필요하다. Referring to FIG. 1 , when a low-quality source image is used (a unique facial expression or posture), a low-quality face replacement result is obtained as shown in the lower right figure of FIG. 1 . In particular, considering the fact that the range of expressions and postures that can be taken by objects in real life is very wide, these problems can occur frequently, and a fundamental solution to solve them is needed.

얼굴 교체 기술에 대한 선행기술 1로, "SimSwap: An Efficient Framework For High Fidelity Face Swapping"이 개시된다(Chen et al., ´SimSwap: An Efficient Framework For High Fidelity Face Swapping´ ACMMM, 2020.). As prior art 1 for face replacement technology, “SimSwap: An Efficient Framework For High Fidelity Face Swapping” is disclosed (Chen et al., ´SimSwap: An Efficient Framework For High Fidelity Face Swapping´ ACMMM, 2020.).

선행기술 1은 미리 학습된 identity extractor를 사용하여 소스의 id vector를 추출해 사용, 딥 네트워크 기반의 encoder-decoder 사이에서 추출해 놓은 id vector를 타겟의 feature에 융합하여 소스의 identity를 닮아가도록 얼굴 생성하는 방식이다. Prior art 1 extracts and uses the id vector of the source using a pre-learned identity extractor, and fuses the id vector extracted between the deep network-based encoder-decoder with the target feature to generate a face to resemble the identity of the source. am.

도 2는 선행기술 1의 얼굴 교체 방법에 대한 모식도와 그 결과이다. 2 is a schematic diagram of the face replacement method of Prior Art 1 and its results.

도 2를 참조하면, 선행기술 1의 결과물은 그 해상도가 최대 256x256 수준으로 저해상도로 제한되고, id vector는 타겟 이미지에 관계 없이 미리 학습된 Identity Extractor를 이용해 추출되므로, 소스에 내재된 정보를 adaptive하게 활용할 수 없어 결과물의 품질이 떨어지는 문제가 있다. Referring to FIG. 2, the result of prior art 1 is limited to a low resolution with a maximum resolution of 256x256, and since the id vector is extracted using a pre-learned Identity Extractor regardless of the target image, the information inherent in the source is adaptively There is a problem that the quality of the result is degraded because it cannot be used.

또 다른 선행기술 2로, "One Shot Face Swapping on Megapixels (MegaFS)"가 개시된다(Zhu et al., ´One Shot Face Swapping on Megapixels,´ CVPR, 2021.). As another prior art 2, “One Shot Face Swapping on Megapixels (MegaFS)” is disclosed (Zhu et al., ´One Shot Face Swapping on Megapixels,´ CVPR, 2021.).

선행기술 2는 소스, 타겟 이미지를 딥 네트워크를 이용해 인코딩한 후 타겟의 아이덴티티를 담당하는 부분의 특징점(feature)을 소스의 특징점과 융합하고, 이를 생성기에 입력, 얼굴 생성하는 방식이며, 가장 최근 컴퓨터비전 분야 최고 학회 (CVPR21)를 통해 제안되었다. 선행기술 2는 유일한 subject-agnostic 고해상도 (1024x1024) 얼굴 교체 기술이지만, 소스, 타겟에 대한 전체 특징점이 아닌 일부 특징점만을 이용하여 identity를 이동시키기 때문에 결과물의 품질이 떨어지는 문제가 있다. Prior art 2 is a method of encoding a source and target image using a deep network, then fusing the feature of the part responsible for the identity of the target with the feature of the source, inputting this to a generator, and generating a face. The most recent computer It was proposed through the best visionary society (CVPR21). Prior art 2 is the only subject-agnostic high-resolution (1024x1024) face replacement technology, but the quality of the result is degraded because the identity is moved using only some feature points instead of all feature points for the source and target.

도 3은 선행기술 2의 얼굴 교체 방법에 대한 모식도와 그 결과이다. 3 is a schematic diagram of the face replacement method of Prior Art 2 and its results.

도 3을 참조하면, 실제 얼굴 교체가 진행되는 부분은 파란박스 부분만으로 전체 특징점을 사용하지 않으며, 이에 따라 품질이 떨어지는 문제가 있다. Referring to FIG. 3 , the part where face replacement is actually performed does not use all feature points only in the blue box part, and accordingly, there is a problem of deterioration in quality.

또 다른 선행기술 3으로 대한민국 등록특허 10-2188991호가 있다. 하지만, 본 선행기술은 Subject-specific한 방법으로, 임의의 사진에 적용 불가능하다는 문제가 있다. 또한 고해상도, 고선명도의 얼굴 교체 방법에 대한 문제해결과, 저 품질 소스 얼굴 이미지 사용에 대한 해결방법이 전혀 개시되지 않은 문제가 있다. Another prior art 3 is Korean Patent Registration No. 10-2188991. However, this prior art has a problem in that it cannot be applied to arbitrary photos as a subject-specific method. In addition, there is a problem in which a solution to the problem of a high-resolution, high-definition face replacement method and a solution to the use of a low-quality source face image are not disclosed at all.

또 다른 선행기술 4로 대한민국 공개특허 10-2017-01098512188991호가 있다. As another prior art 4, there is Korean Patent Publication No. 10-2017-01098512188991.

하지만, 본 선행기술은 Subject-specific한 방법으로, 임의의 사진에 적용 불가능하다는 문제가 있다. 또한 고해상도, 고선명도의 얼굴 교체 방법에 대한 문제해결과, 저 품질 소스 얼굴 이미지 사용에 대한 해결방법이 전혀 개시되지 않은 문제가 있다. 또한 딥 러닝 또는 기계학습 기반의 방법이 아니며, 잡음 정보 추출 등을 활용하므로, 결과물의 품질 또한 좋지 않은 문제가 있다. However, this prior art has a problem in that it cannot be applied to arbitrary photos as a subject-specific method. In addition, there is a problem in which a solution to the problem of a high-resolution, high-definition face replacement method and a solution to the use of a low-quality source face image are not disclosed at all. In addition, since it is not a method based on deep learning or machine learning and uses noise information extraction, etc., the quality of the result is also poor.

따라서, subject-agnostic 방식으로 저 품질의 이미지로부터 고품질의 이미지 생성할 수 있는 새로운 기술이 필요하다. Therefore, a new technology capable of generating high-quality images from low-quality images in a subject-agnostic manner is required.

본 발명은 상술한 문제를 해결하기 위한 것으로, subject-agnostic 방식으로 저 품질의 이미지로부터 고품질의 이미지 생성할 수 있는 방법과 시스템을 제공한다. The present invention is to solve the above problems, and provides a method and system capable of generating a high-quality image from a low-quality image in a subject-agnostic manner.

상기 과제를 해결하기 위하여, 본 발명은 주의 알고리즘 기반의 고선명도 얼굴 교체 시스템으로, 소스 이미지로부터 특징점 추출을 위한 특징점 추출부(100); 상기 특징점 추출부로부터 입력된 특징점에 따라 얼굴 교체를 수행하기 위한 결과값을 출력하는 아이덴티티 전환부(IDTR, 200); 및 상기 아이덴티티 전환부로부터의 출력값에 따라 교체된 얼굴 이미지를 생성하는 이미지 생성부(300)를 포함하는 것을 특징으로 하는 고선명도 얼굴 교체 시스템을 제공한다. In order to solve the above problems, the present invention is a high-definition face replacement system based on an attention algorithm, which includes a feature point extraction unit 100 for extracting a feature point from a source image; an identity conversion unit (IDTR, 200) outputting a result value for performing face replacement according to the keypoint input from the keypoint extraction unit; and an image generation unit 300 generating a replaced face image according to an output value from the identity conversion unit.

본 발명의 일 실시예에서, 상기 아이덴티티 전환부(IDTR, 200)는 소프트 주의(soft attention) 알고리즘과, 하드 주의(hard attention) 알고리즘을 동시에 사용하여 결과값을 출력한다. In one embodiment of the present invention, the identity conversion unit (IDTR, 200) uses a soft attention algorithm and a hard attention algorithm at the same time and outputs a resultant value.

본 발명의 일 실시예에서, 상기 소프트 주의 알고리즘은 하기 식에 따라 주의값을 출력한다. In one embodiment of the present invention, the soft attention algorithm outputs an attention value according to the following equation.

(상기 식에서 A_soft ∈ R^C×HW는 소프트 주의 결과값이며, , , 이고, h(·)는 1 × 1 컨볼루션 레이어, M ∈ R^C×HW으로 M의 각 점은 A에 의해 가중치가 부여된 V의 모든 점의 합)(In the above formula, A _soft ∈ R ^{C × HW} is the result of the soft note, , , , and h( ) is a 1 × 1 convolutional layer, where M ∈ R ^C×HW, where each point of M is the sum of all points of V weighted by A)

본 발명의 일 실시예에서, 상기 하드 주의 알고리즘은 하기 식에 따라 주의값을 출력한다. In one embodiment of the present invention, the hard attention algorithm outputs an attention value according to the following equation.

(식에서 hi의 값은 타겟 이미지의 i번째 위치에 대하여, 소스 이미지 중 가장 관련성이 높은 위치를 나타내는 인덱스이며, A_hard(i,j) 는 A_hard의 (i, j)위치에 대한 주의값임)(In the equation, the value of hi is the index representing the position with the highest relevance among the source images with respect to the i-th position of the target image, and A _hard (i, j) is the attention value for the (i, j) position of A _hard )

본 발명의 일 실시예에서, 상기 아이덴티티 전환부(IDTR, 200)는 상기 소프트 및 하드 주의 알고리즘을 위한 주의 맵인 A ∈ R ^{HW ×HW} 을 하기 식에 따라 생성한다. In one embodiment of the present invention, the identity conversion unit (IDTR, 200) generates A ∈ R ^{HW × HW,} which is an attention map for the soft and hard attention algorithms, according to the following equation.

(Q_u ∈ R ^C×HW 와 K_u∈ R ^C×HW 는 Q ∈ R ^C×H×W 와 K ∈ R ^C×H×W 임)(Q _u ∈ R ^C×HW and K _u ∈ R ^C×HW are Q ∈ R ^C×H×W and K ∈ R ^C×H×W )

본 발명은 또한 주의 알고리즘 기반의 고선명도 얼굴 교체 시스템으로, 적어도 2개 이상의 소스 이미지로부터 특징점 추출을 위한 특징점 추출부(100); 상기 특징점 추출부로부터 입력된 특징점에 따라 얼굴 교체를 수행하기 위한 결과값을 출력하는 아이덴티티 전환부(IDTR, 200); 및 상기 아이덴티티 전환부로부터의 출력값에 따라 교체된 얼굴 이미지를 생성하는 이미지 생성부(300)를 포함하는 것을 특징으로 하는 고선명도 얼굴 교체 시스템을 제공한다. The present invention is also a high-definition face replacement system based on an attentional algorithm, comprising: a feature point extraction unit 100 for extracting feature points from at least two or more source images; an identity conversion unit (IDTR, 200) outputting a result value for performing face replacement according to the keypoint input from the keypoint extraction unit; and an image generation unit 300 generating a replaced face image according to an output value from the identity conversion unit.

본 발명의 일 실시예에서, 상기 아이덴티티 전환부(IDTR, 200)는 하기 식에 따라 상기 소프트 및 하드 주의 알고리즘을 위한 주의 맵을 생성한다. In one embodiment of the present invention, the identity conversion unit (IDTR, 200) generates attention maps for the soft and hard attention algorithms according to the following equation.

(여기에서 A_multi는 소스 이미지는 적어도 2개 이상인 경우의 주의 맵이며, ⊙는 배치 행렬곱(batch matrix multiplication)임)(Here, A _multi is a map of attention when there are at least two source images, and ⊙ is batch matrix multiplication)

본 발명은 또한 상술한 고선명도 얼굴 교체 시스템; 및 이상 보존 손실 함수(Ideality preserving loss), 하기 아이덴티티 손실함수(Identity loss), LPIPS(Learned Perceptual Image Patch Similarity) 손실함수, 자가 재구성 손실함수(Self-reconstruction loss) 및 정규화 손실함수(Regularization loss)를 포함하는 손실함수를 통하여 학습을 수행하는 학습부를 포함하는 고선명도 얼굴 교체 시스템을 제공한다. The present invention also relates to the aforementioned high-definition face replacement system; And an ideality preserving loss function, an identity loss function, a Learned Perceptual Image Patch Similarity (LPIPS) loss function, a self-reconstruction loss function, and a regularization loss function It provides a high-definition face replacement system including a learning unit that performs learning through a loss function including

본 발명은 또한 상술한 고선명도 얼굴 교체 시스템; 및 이상 보존 손실 함수(Ideality preserving loss), 아이덴티티 손실함수(Identity loss), LPIPS(Learned Perceptual Image Patch Similarity) 손실함수, 자가 재구성 손실함수(Self-reconstruction loss) 및 정규화 손실함수(Regularization loss)를 모두 포함하는 손실함수를 통하여 학습을 수행하는 학습부를 포함하는 고선명도 얼굴 교체 시스템을 제공한다. The present invention also relates to the aforementioned high-definition face replacement system; and ideality preserving loss, identity loss, LPIPS (Learned Perceptual Image Patch Similarity) loss function, self-reconstruction loss, and regularization loss. It provides a high-definition face replacement system including a learning unit that performs learning through a loss function including

본 발명의 일 실시예에서, 상기 손실함수(L _total )는 하기 식으로 표현된다. In one embodiment of the present invention, the loss function (L _total ) is expressed by the following formula.

(여기에서 L _ip 는 이상 보존 손실 함수(Ideality preserving loss), L _id 는 아이덴티티 손실함수(Identity loss), L _LPIPS 는 LPIPS(Learned Perceptual Image Patch Similarity) 손실함수, L _self 는 자가 재구성 손실함수(Self-reconstruction loss), L _reg 는 정규화 손실함수(Regularization loss)임)(Where L _ip is the ideality preserving loss function, L _id is the identity loss function, L _LPIPS is the LPIPS (Learned Perceptual Image Patch Similarity) loss function, and L _self is the self-reconstruction loss function (Self -reconstruction loss), L _reg is the regularization loss)

본 발명은 주의 알고리즘 기반의 고선명도 얼굴 교체 방법으로, 소스 이미지의 특징점을 특징 스페이스(feature space)에 매핑하는 단계; 상기 매핑 된 특징점으로부터 얼굴 교체를 수행하기 위한 결과값을 주의 알고리즘으로 출력하는 단계; 및 상기 출력된 결과값으로부터 얼굴 교체를 수행하는 단계를 포함하며, 상기 주의 알고리즘은 소프트 주의(soft attention) 알고리즘과, 하드 주의(hard attention) 알고리즘을 동시에 사용하여 결과값을 출력하는 것을 특징으로 하는 고선명도 얼굴 교체 방법을 제공한다. The present invention is a high-definition face replacement method based on an attentional algorithm, comprising the steps of mapping feature points of a source image to a feature space; outputting a result value for performing face replacement from the mapped feature points to an attentional algorithm; and performing a face replacement from the output result value, characterized in that the attention algorithm outputs the result value by simultaneously using a soft attention algorithm and a hard attention algorithm. A high-definition face replacement method is provided.

본 발명은 또한 주의 알고리즘 기반의 고선명도 얼굴 교체 방법으로, 적어도 2개 이상의 소스 이미지의 특징점을 특징 스페이스(feature space)에 매핑하는 단계; 상기 매핑 된 특징점으로부터 얼굴 교체를 수행하기 위한 결과값을 주의 알고리즘으로 출력하는 단계; 및 상기 출력된 결과값으로부터 얼굴 교체를 수행하는 단계를 포함하며, 상기 주의 알고리즘은 소프트 주의(soft attention) 알고리즘과, 하드 주의(hard attention) 알고리즘을 동시에 사용하여 결과값을 출력하는 것을 특징으로 하는 고선명도 얼굴 교체 방법을 제공한다. The present invention is also a high-definition face replacement method based on an attentional algorithm, comprising: mapping feature points of at least two or more source images to a feature space; outputting a result value for performing face replacement from the mapped feature points to an attentional algorithm; and performing a face replacement from the output result value, characterized in that the attention algorithm outputs the result value by simultaneously using a soft attention algorithm and a hard attention algorithm. A high-definition face replacement method is provided.

본 발명의 일 실시예에서, 상기 주의 알고리즘으로 출력하는 단계는 하기 식에 따라 상기 소프트 및 하드 주의 알고리즘을 위한 주의 맵을 생성하는 단계를 포함한다. In an embodiment of the present invention, the outputting to the attentional algorithm includes generating attention maps for the soft and hard attentional algorithms according to the following equation.

본 발명에 따르면, subject-agnostic 방식으로 저 품질의 이미지로부터 고품질의 이미지 생성할 수 있으며, 저 품질의 이미지를 2장 이상 상호 보완적으로 활용하여 결과물의 품질을 높일 수 있다. According to the present invention, a high-quality image can be generated from a low-quality image in a subject-agnostic manner, and the quality of a result can be improved by utilizing two or more low-quality images complementary to each other.

도 1은 종래 기술에 따라 소스 이미지가 제한된 상황에서의 얼굴 교체 결과이다.
도 2는 선행기술 1의 얼굴 교체 방법에 대한 모식도와 그 결과이다.
도 3은 선행기술 2의 얼굴 교체 방법에 대한 모식도와 그 결과이다.
도 4는 본 발명에 따른 얼굴 교체 방법의 전체 프레임워크을 나타내는 모식도이다.
도 5는 본 발명에 따른 IDTR의 상세 구조이다.
도 6 및 7은 고해상도 (1024x1024) 및 고선명도의 사실적인 결과 획득 결과이다.
도 8 및 9은 종래 기술과의 비교 이미지이다.
도 10은 기존 저해상도 영상에서의 얼굴 교체 기술을 포함하여 비교한 결과이다.
도 11은 저 품질의 다중 소스 이미지를 사용한 결과이다.
도 12는 VGGFace2-HQ에서 다중 소스 얼굴 교체 결과를 정성 분석한 결과이다.1 is a face replacement result in a situation where a source image is limited according to the prior art.
2 is a schematic diagram of the face replacement method of Prior Art 1 and its results.
3 is a schematic diagram of the face replacement method of Prior Art 2 and its results.
4 is a schematic diagram showing the entire framework of the face replacement method according to the present invention.
5 is a detailed structure of IDTR according to the present invention.
6 and 7 are high-resolution (1024x1024) and high-definition realistic result acquisition results.
8 and 9 are comparison images with the prior art.
10 is a comparison result including a face replacement technology in an existing low-resolution image.
11 is a result of using low-quality multi-source images.
12 is a result of qualitative analysis of multi-source face replacement results in VGGFace2-HQ.

이하, 첨부한 도면을 참고로 하여 본 발명의 바람직한 실시예에 대하여 상세히 설명하면 다음과 같다.Hereinafter, the preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명을 상세하게 설명하기 전에, 본 명세서에서 사용된 용어나 단어는 통상적이거나 사전적인 의미로 무조건 한정하여 해석되어서는 아니 되며, 본 발명의 발명자가 자신의 발명을 가장 최선의 방법으로 설명하기 위해서 각종 용어의 개념을 적절하게 정의하여 사용할 수 있다.Before explaining the present invention in detail, the terms or words used in this specification should not be construed unconditionally in a conventional or dictionary sense, and in order for the inventor of the present invention to explain his/her invention in the best way Concepts of various terms can be appropriately defined and used.

더 나아가 이들 용어나 단어는 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 함을 알아야 한다.Furthermore, it should be noted that these terms or words should be interpreted as meanings and concepts consistent with the technical idea of the present invention.

즉, 본 명세서에서 사용된 용어는 본 발명의 바람직한 실시예를 설명하기 위해서 사용되는 것일 뿐이고, 본 발명의 내용을 구체적으로 한정하려는 의도로 사용된 것이 아니다.That is, the terms used in this specification are only used to describe preferred embodiments of the present invention, and are not intended to specifically limit the contents of the present invention.

이들 용어는 본 발명의 여러 가지 가능성을 고려하여 정의된 용어임을 알아야 한다.It should be noted that these terms are terms defined in consideration of various possibilities of the present invention.

또한, 본 명세서에 있어서, 단수의 표현은 문맥상 명확하게 다른 의미로 지시하지 않는 이상, 복수의 표현을 포함할 수 있다.Also, in this specification, a singular expression may include a plurality of expressions unless the context clearly indicates otherwise.

또한, 유사하게 복수로 표현되어 있다고 하더라도 단수의 의미를 포함할 수 있음을 알아야 한다.In addition, it should be noted that similarly, even if expressed in a plurality, it may include a singular meaning.

본 명세서의 전체에 걸쳐서 어떤 구성 요소가 다른 구성 요소를 "포함"한다고 기재하는 경우에는, 특별히 반대되는 의미의 기재가 없는 한 임의의 다른 구성 요소를 제외하는 것이 아니라 임의의 다른 구성 요소를 더 포함할 수도 있다는 것을 의미할 수 있다.Throughout this specification, when a component is described as "including" another component, it does not exclude any other component, but further includes any other component, unless otherwise stated. It can mean you can do it.

더 나아가서, 어떤 구성 요소가 다른 구성 요소의 "내부에 존재하거나, 연결되어 설치된다"고 기재한 경우에는, 이 구성 요소가 다른 구성 요소와 직접적으로 연결되어 있거나 접촉하여 설치되어 있을 수 있다.Furthermore, when a component is described as "existing inside or connected to and installed" of another component, this component may be directly connected to or installed in contact with the other component.

본 발명에서는 저 품질의 이미지를 2장 이상 상호 보완적으로 활용하여 결과물의 품질을 높이는 방법을 제공한다. The present invention provides a method of increasing the quality of a result by utilizing two or more low-quality images complementary to each other.

도 4는 본 발명에 따른 얼굴 교체 방법의 전체 프레임워크을 나타내는 모식도이다. 4 is a schematic diagram showing the entire framework of the face replacement method according to the present invention.

도 4를 참조하면, 본 발명에 따른 주의 알고리즘 기반의 단일/다중 소스 이미지를 활용 가능한 고선명도 얼굴 교체 시스템은, 특징점 추출을 위한 ResNet 백본을 활용한 특징점 추출부(100), 얼굴 교체를 위한 아이덴티티 전환부(IDTR, 200), 그리고 StyleGAN2를 이용한 이미지생성부(300)를 포함한다. 여기에서 아이덴티티 전환부(IDTR(은 하드/소프트 주의를 동시에 사용하는, 듀얼 주의 프로세스를 통하여 얼굴 교체를 수행하며, 다중 소스 이미지를 활용한 얼굴 교체가 가능하다. Referring to FIG. 4, the high-definition face replacement system that can utilize single/multiple source images based on the attention algorithm according to the present invention includes a feature point extraction unit 100 using a ResNet backbone for feature point extraction, and an identity for face replacement. It includes a conversion unit (IDTR, 200) and an image generation unit 300 using StyleGAN2. Here, the identity conversion unit (IDTR) performs face replacement through a dual attention process that simultaneously uses hard/soft attention, and can perform face replacement using multiple source images.

이미지 생성부의 StyleGAN2에서 W + ∈ R18×512 스페이스(Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110?8119, 2020참조)을 사용하기 위하여, pSp(Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. 참조)를 사용하기 위하여, pSp(Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021 참조)의 사전 훈련된 백본(backbone)을 사용, 소스 및 대상 이미지의 특징점을 추출한다. W + ∈ R18×512 space (Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on To use Computer Vision and Pattern Recognition, pages 8110?8119, 2020), pSp (Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a To use the stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. Reference), pSp (Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv See Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021). (Backbone) is used to extract feature points of source and target images.

즉, 이전 방법들은 이 과정에서 주로 이미지에서 바로 아이덴티티 추출을 진행하였다, 하지만, 본 발명에서는 이 과정에서 발생하는 정보 손실을 막기 위해 이미지 내 아이덴티티 뿐만이 아닌 최대한의 정보를 특징 스페이스(feature space)에 매핑한다. 본 발명의 일 실시예에서 이미지의 특징점은 거친(coarse), 중간(medium( 및 섬세(fine) 3개 레벨로 추출되며, 이는 종래의 GAN 인버젼 연구 방식에 따른 것이다. That is, in the previous methods, the identity was mainly extracted directly from the image in this process. However, in the present invention, in order to prevent information loss occurring in this process, not only the identity in the image but also the maximum information is mapped to a feature space. do. In one embodiment of the present invention, feature points of an image are extracted at three levels: coarse, medium, and fine, which is according to the conventional GAN inversion research method.

도 4에 도시된 바와 같이 추출된 특징점 F_s,t는, IDTR(Identity Transformer)에 계층적 입력 정보를 고려하기 위한 계층적 정보가 융합된 특징점인 G_s,t와 함께 입력되며, 가장 거친 수준의 IDTR은 F_s,t만 입력받아 Gs_,t의 위치를 대체한다. 즉, 도 4에서 F는 backbone에서 추출된 특징점이고, G는 거친(coarse), 중간(middle), 미세(fine)한 단계를 동시에 고려한 계층적 정보를 포함하는, 소위 계층정보가 융합(Hierarchical fusion, HF)된 특징점이다. Feature points F _s,t extracted as shown in FIG. 4 are input together with G _s,t , which is a feature point in which hierarchical information is fused to IDTR (Identity Transformer) to consider hierarchical input information, and is the coarsest level IDTR of receives only F _s,t and replaces the position of Gs _,t . That is, in FIG. 4, F is a feature point extracted from the backbone, and G is a so-called hierarchical fusion that includes hierarchical information that simultaneously considers coarse, middle, and fine steps. , HF) is the feature point.

본 발명의 일 실시예에서, 이러한 G의 계층적 융합(HF) 정보는 계층적 정보를 인코딩 할 수 있도록 돕는데, 이 과정에서 거친(coarse)/미세(fine) 단계의 계층별 특징점을 융합할 수 있다. 이로써 사람의 얼굴을 찍은 사진의 얼굴 영역 중 작은 일부 (예, 피부)만 따로 떼어 패치로 만들면 해당 패치가 얼굴의 어느 영역인지 유추하는 것이 어렵다는 문제를, 이러한 계층정보가 융합된 정보를 활용하는 경우, 패치 사이즈를 더 크게 잡아 더 넓은 영역을 포함하도록 하며, 이로써 해당 패치가 얼굴의 어느 영역인지를 확인하기 쉬워지게 된다. In one embodiment of the present invention, the hierarchical fusion (HF) information of G helps to encode hierarchical information, and in this process, feature points for each layer in a coarse/fine step can be fused. there is. This solves the problem that it is difficult to infer which area of the face the patch is when only a small part (eg, skin) of the face area of a photograph of a person is separated and made into a patch. , the patch size is made larger to cover a wider area, thereby making it easier to identify which area of the face the corresponding patch is.

이후, IDTR은 특징점 레벨에서 얼굴 교체를 수행하고, 그 결과값은 계층적 융합 모듈과 feature-to-style 모듈을 통해 StyleGAN2의 W⁺ 스페이스에 매핑된다. Then, IDTR performs face replacement at the feature point level, and the resulting value is mapped to W ⁺ space of StyleGAN2 through the hierarchical convergence module and feature-to-style module.

본 발명의 일 실시예에서는 이런 방식으로 총 18 스타일 벡터를 얻고, 마지막으로 StyleGAN2에 스타일 벡터를 입력하여 교체된 얼굴 이미지를 생성하게 된다. In one embodiment of the present invention, a total of 18 style vectors are obtained in this way, and finally, a replaced face image is generated by inputting the style vectors into StyleGAN2.

즉, 본 발명에 따른 주의 알고리즘 기반의 고선명도 얼굴 교체 방법은, 소스 이미지의 특징점을 특징 스페이스(feature space)에 매핑하는 단계; 상기 매핑 된 특징점으로부터 얼굴 교체를 수행하기 위한 결과값을 주의 알고리즘으로 출력하는 단계; 및 상기 출력된 결과값으로부터 얼굴 교체를 수행하는 단계를 포함하며, 상기 주의 알고리즘은 소프트 주의(soft attention) 알고리즘과, 하드 주의(hard attention) 알고리즘을 동시에 사용하여 결과값을 출력하는데, 이때 IDTR은 주의 맵 생성, 소프트/하드 주의값을 출력한다. 또한 주의 맵은 소스 이미지의 수에 따라 상이하게 확장될 수 있으며, 이하 각 구성요소를 통하여 본 발명을 보다 상세히 설명한다. That is, the high-definition face replacement method based on the attentional algorithm according to the present invention includes the steps of mapping feature points of a source image to a feature space; outputting a result value for performing face replacement from the mapped feature points to an attentional algorithm; and performing face replacement from the output result value, wherein the attention algorithm outputs the result value by using a soft attention algorithm and a hard attention algorithm at the same time. Generating an attention map and outputting soft/hard attention values. In addition, the attention map can be expanded differently according to the number of source images, and the present invention will be described in more detail through each component below.

1.IDTR(Identity Transformer)1. IDTR (Identity Transformer)

도 5는 본 발명에 따른 IDTR의 상세 구조이다. 5 is a detailed structure of IDTR according to the present invention.

도 5를 참조하면, 본 발명에 따른 IDTR은 다음 두 가지 프로세스를 통해 얼굴 교체을 수행하는데 1) 소스와 타겟의 표현 사이의 관련성을 사용하는 주의 맵(A) 생성 2) 생성된 주의 맵 A를 사용한 소프트/하드 주의(soft/hard attention) 단계이다. Referring to FIG. 5 , the IDTR according to the present invention performs face replacement through the following two processes: 1) generating an attention map (A) using the relationship between the source and target expressions; 2) using the generated attention map A This is the soft/hard attention stage.

1.1 주의맵 생성1.1 Create an attention map

주의맵은 소스와 타겟 이미지사이의 유사성을 측정하여 소스와 타겟 이미지 사이의 관련성을 포함시키는 것을 목적으로 한다. 유사성 측정을 위하여, 본 발명의 일 실시예에서는 키(K)와 쿼리(Q)를 하기 식과 같이 구성하였다. The purpose of the attention map is to measure the similarity between the source and target images and include the relationship between the source and target images. In order to measure similarity, in an embodiment of the present invention, a key (K) and a query (Q) are configured as follows.

여기에서 f(·) 및 g(·)는 1 × 1 컨볼루션 층, Norm은 인스턴스 정규화(Instance Normalization)를 의미한다. K 및 Q, G_s 및 G_t를 활용하여, 거친 레벨에서의 계층적 정보를 반영하였다. Here, f(·) and g(·) denote a 1 × 1 convolutional layer, and N orm denotes instance normalization. K and Q, G _s and G _t were used to reflect hierarchical information at the coarse level.

주의 맵, A ∈ R ^{HW ×HW} 은 하기 식으로 구성된다. The attention map, A ∈ R ^{HW × HW} , consists of the following equation.

상기 식에서 Q_u ∈ R ^C×HW 와 K_u∈ R ^C×HW 는 Q ∈ R ^C×H×W 와 K ∈ R ^C×H×W 를 풀어 쓴 식이다. In the above equation, Q _u ∈ R ^C×HW and K _u ∈ R ^C×HW are expressions of Q ∈ R ^C×H×W and K ∈ R ^C×H×W .

각 엘리먼트 A_(i,j) (i, j ∈ [1, HW]에서 A는 타겟의 i번째 특징점과 소스의 j번째 특징점간의 연관성을 의미한다. 즉, 본 발명의 일 실시예에 따른 주의 맵은, 타겟 이미지의 각 특정 영역에 대하여, 소스 이미지 전체 영역에 대한 분산된 연관성 정보를 갖는다. 이하 설명되는 소프트/하드 주의 프로세스에 따라, 본 발명에 따른 얼굴 교체 프로세스는 이러한 주의 맵 특성을 활용한 특징점 레벨에 맞춰 수행된다. In each element A _(i,j) (i, j ∈ [1, HW], A means the association between the i-th feature point of the target and the j-th feature point of the source. That is, the attention map according to an embodiment of the present invention. has distributed correlation information for all regions of the source image for each specific region of the target image According to the soft/hard attention process described below, the face replacement process according to the present invention utilizes these attention map characteristics. It is performed according to the feature point level.

1.2 소프트 주의(Soft Attention) 1.2 Soft Attention

본 발명은 타겟, 소스에서 추출된 두 특징점이 모두 [C, H, W]의 크기를 가진다 할 때, [HW, HW]의 크기를 갖는 주의 맵(A)을 1.1과 같이 만들어 이를 활용한 특징점 융합을 수행한다. In the present invention, when both feature points extracted from a target and a source have sizes of [C, H, W], a map of states (A) having a size of [HW, HW] is created as 1.1 and a feature point using this is created. perform fusion.

이때 A의 (i, j) 위치의 값이 의미하는 것은 타겟의 i번째 위치의 특징점에 대한 소스의 j번째 위치의 상관관계이며, 소프트 주의 알고리즘에서는 소스와 타겟 사이 모든 위치에 대한 관계를 고려한다. At this time, the value of the (i, j) position of A means the correlation of the j-th position of the source with the feature point at the i-th position of the target, and the soft attention algorithm considers the relationship between all positions between the source and the target. .

본 발명에서 제안되는 소프트 주의 방법은 프레임워크에 얼굴 인식 네트워크가 있는 기존 얼굴 교체 방법의 적응형 인스턴스 정규화(AdaIN) 방법으로부터 도출된 것으로, AdaIN은 하기 식과 같다. The soft attention method proposed in the present invention is derived from the adaptive instance normalization (AdaIN) method of the existing face replacement method having a face recognition network in the framework, and AdaIN is as follows.

AdaIN에서는 y의 스타일을 x로 전달하기 위해 y 통계값을 사용하여 x의 표현 통계값을 변경한다. In AdaIN, to pass the style of y to x, we use the y statistic to change the expression statistic of x.

본 발명은 주의 맵과 소스 이미지 표현값을 사용하여 타겟 표현에 있어서의 통계값을 소스의 통계값과 동일하게 변경한다. 결과적으로 AdaIN에서 y의 스타일이 x로 전달되는 것과 유사하게, 소프트 주의 알고리즘에 따라 소스 아이덴티티가 타겟으로 전달된다. The present invention uses the attention map and the source image representation values to change the statistical values in the target representation to be the same as the statistical values in the source. As a result, the source identity is passed to the target according to the soft attentional algorithm, similar to how style of y is passed to x in AdaIN.

도 4에 도시된 바와 같이, 본 발명의 일 실시예에 따른 소프트 주의 방식은 주의 맵(A)와 소스 이미지로부터 얻어진 값(V)을 입력값으로 하기 식과 같이 채택하며, As shown in FIG. 4, the soft attention method according to an embodiment of the present invention adopts the attention map (A) and the value (V) obtained from the source image as input values as shown in the following equation,

여기에서 h(·)는 1 × 1 컨볼루션 레이어이다. 이후, AdaIN의 μ(y) 역할을 하는 V의 주의 가중 평균(M)은 다음 식으로 정리된다. Here, h(·) is a 1 × 1 convolutional layer. After that, the weighted average (M) of the states of V serving as μ(y) of AdaIN is organized by the following equation.

여기서 M ∈ R^C×HW이다. M의 각 점은 A에 의해 가중치가 부여된 V의 모든 점의 합으로 해석된다. 확률변수 분산값은 제곱값의 기대값에서 기대값의 제곱을 뺀 것과 같으므로, V의 주의 가중 표준 편차(S)는 다음 식으로 얻을 수 있다. where M ∈ R ^C×HW . Each point in M is interpreted as the sum of all points in V weighted by A. Since the variance of a random variable is equal to the expected value of the square minus the square of the expected value, the weighted standard deviation of the states of V (S) is given by

여기서 V²및 M² 는 V 및 M 의 요소별 제곱이며, 획득한 M과 S ∈ R^C×HW 를 사용하여 타겟이미지 변화 통계치는 변경된다. Here, V ² and M ² are the squares of V and M for each element, and the target image change statistics are changed using the obtained M and S ∈ R ^C×HW .

여기서, A_soft ∈ R^C×HW는 소프트 주의 결과값이다. Here, A _soft ∈ R ^C×HW is the result of soft attention.

이상 본 발명의 일 실시예에 따른 소프트 주의 프로세스를 요약하면, AdaIN과 유사한 방식으로, 얼굴 교체를 이미지 표현에서의 통계적 변화로 정의하고, 타겟 이미지 표현의 통계적 변화를 구현하기 위하여 주의 맵을 기반으로 평균과 표준편차를 공식화하여 활용한다. Summarizing the soft attention process according to an embodiment of the present invention above, in a manner similar to AdaIN, face replacement is defined as a statistical change in image representation, and based on the attention map to implement the statistical change in the target image representation Formulate and use the mean and standard deviation.

1.3 하드 주의(hard attention) 1.3 Hard attention

상술한 소프트 주의(soft attention) 알고리즘을 통하여, 각 쿼리 포인트에 대해 A에 의한 V의 가중치 합산으로 M을 얻는다. 하지만, 이러한 프로세스는 소스 특징점의 분포를 변경하여 흐림 또는 부정확한 아이덴티티 전송을 유발할 수 있다. 즉, Soft Attention에서 내재된 정보의 distribution 개념으로 접근하여, 타겟의 i번째 정보를 융합하기 위해 모든 가능한 j([1,HW])를 고려하였다면, Hard Attention에서는 이 j 중 가장 관계가 높은 한 개를 뽑아 이것 만을 고려하게 된다. Through the above-described soft attention algorithm, M is obtained as a weighted sum of V by A for each query point. However, this process may change the distribution of source feature points, causing blurring or inaccurate identity transmission. In other words, if Soft Attention approached with the concept of distribution of information inherent in it and considered all possible js ([1, HW]) to fuse the i-th information of the target, in Hard Attention, one of these js with the highest relationship is pulled out and only this is considered.

즉, 각 쿼리 포인트에 대해 V와 가장 관련성이 높은 기능만 전송하기 위하여, 본 발명은 하드 주의값 A_hard를 사용하며, 이는 다음과 같이 표시될 수 있다. That is, in order to transmit only the function most closely related to V for each query point, the present invention uses a hard attention value A _hard , which can be expressed as follows.

여기서 hi의 값은 타겟 이미지의 i번째 위치에 대하여, 소스 이미지 중 가장 관련성이 높은 위치를 나타내는 인덱스이다. Here, the value of hi is an index indicating the most relevant position in the source image with respect to the i-th position of the target image.

본 발명의 일 실시예에서 A(i,j)는 i번째 쿼리에 대한 j번째 키의 주의 점수를 나타내며, A_hard(i,j) 는 A_hard의 (i, j)위치에 대한 주의값이다. 본 발명의 일 실시예에 따라 제안된 소프트/하드 주의(soft/hard Attention)을 통해 생성된 A_soft와 A_hard는 정규화된 타겟 특징점 F_t와 연결되어 다음 단계로 넘어간다.In one embodiment of the present invention, A(i,j) represents the attention score of the j-th key for the i-th query, and A _hard (i,j) is the attention value for the (i, j) position of A _hard . A _soft and A _hard generated through the proposed soft/hard attention according to an embodiment of the present invention are connected to the normalized target feature point F _t , and go to the next step.

이를 보다 상세히 설명하면, 소프트 주의(Soft attention)의 결과/하드 주의(hard attention)의 결과 / 그리고 백본(backbone)에서 추출된 타겟t의 특징점 F_t 세 항목 값은 채널 방향으로 합쳐진다(concatenate). 즉, 　A_soft/A_hard/F_t의 형태가 모두 각각 [wxhxc] 과 같이 생겼다면 도 5에서의 'cat'의 결과물은 [wxhx3c]의 크기를 갖게 된다. To explain this in more detail, the result of soft attention / result of hard attention / and the feature point F _t of target t extracted from the backbone are combined in the channel direction (concatenate). . That is, if the shapes of A_soft/A_hard/F_t all look like [wxhxc], the result of 'cat' in FIG. 5 has the size of [wxhx3c].

이후 1x1 컨볼루션을 거쳐 이 세 소스로부터의 정보를 모두 담고 있는 [w*h*3c] 특징점을 리파인하게 된다. Then, [w*h*3c] feature points containing all information from these three sources are refined through 1x1 convolution.

2. 다중 소스 얼굴 교체 2. Multi-source face replacement

소프트/하드 주의 프로세스를 활용하여 얼굴 교체를 하는 IDTR은, 적어도 2개 이상의 소스 이미지를 사용하는, 소위 다중 소스 방식의 얼굴 교체에 적용되어 그 범위를 확장할 수 있다. IDTR, which performs face replacement using a soft/hard attentional process, can be applied to so-called multi-source face replacement using at least two or more source images to expand its range.

보다 구체적으로 이를 설명하면, 먼저 N개의 소스 이미지가 있을 때 K, Q, V는 모두 R^N×C×H×W와 같은 크기를 갖는다고 가정할 수 있다(Q는 하나의 타겟으로부터 계산되므로 배치 차원(batch dimension)을 따라 N번 반복된다). To explain this more specifically, first, when there are N source images, it can be assumed that K, Q, and V all have the same size as R ^N×C×H×W (Q is calculated from one target, so it can be assumed that repeated N times along the batch dimension).

이후 K_u와 Q_u의 크기가 R^N×C×HW가 되도록 K와 Q를 펼치며, 본 발명에서는 A_multi ∈ R^HW×NHW를 다음과 같이 정의한다. Afterwards, K and Q are expanded so that the sizes of K _u and Q _u become R ^N×C×HW , and in the present invention, A _multi ∈ R ^HW×NHW is defined as follows.

여기서 ⊙는 배치 행렬곱(batch matrix multiplication)이고, 그 다음 단일 소스 상황과 동일한 방식으로 소프트/하드 주의 프로세스를 통해 얼굴 교체를 수행한다. Here, ⊙ is batch matrix multiplication, and then face replacement is performed through a soft/hard attentional process in the same way as in the single-source situation.

3. 학습 방법3. Learning method

본 발명에 따른 얼굴 교체 시스템은 에서는 하기 5개로 구분되는 손실함수를 사용하여 학습을 수행하는 학습부를 더 포함하며, 이를 통하여 보다 사실감있고 정확한 얼굴 교체가 가능하다. The face replacement system according to the present invention further includes a learning unit that performs learning using the following five loss functions, and through this, more realistic and accurate face replacement is possible.

3.1 이상 보존 손실(Ideality preserving loss )3.1 Ideality preserving loss

이상 보존 손실(Ideality preserving loss )은 IDTR을 가이드하여, 소스에 대하여 로버스트(robust)한 특징을 추출하게 한다. 특히, 입력이 이상적인 소스이고, 입력이 Mip 비이상적인 소스일 때, IDTR은 동일 잠재적 벡터 w를 추출하도록 한다. 사람의 경우, 부분적으로 이상적인 부분을 갖는 몇개의 비이상적인 얼굴이 주어지면, 이상적인 얼굴을 그릴 수 있다. 즉. 비이상적인 얼굴에서도 정보를 선택적으로 추출하여 모을 수 있다. 유사하게, 하기 식의 보존 함수(L _ip )를 통하여, IDTO은 선택적으로 복새 개의 비이상적인 소스 이미지에 분산되어 있는 이상적인 아이덴티티 정보를 하기 식과 같이 모을 수 있다. Ideality preserving loss guides the IDTR to extract features that are robust to the source. In particular, when the input is an ideal source and the input is a Mip non-ideal source, IDTR allows to extract the same potential vector w. In the case of a person, given several non-ideal faces with partially ideal parts, an ideal face can be drawn. in other words. Information can be selectively extracted and collected even from non-ideal faces. Similarly, through the retention function ( L _ip ) of the following equation, the IDTO can selectively collect ideal identity information distributed in non-ideal source images of the oxen dog as in the following equation.

여기에서 , 는 각각 단일 이상적 소스와 복수개의 비이상적 소스로부터 추출된 잠재적 벡터이다. From here , are potential vectors extracted from a single ideal source and a plurality of non-ideal sources, respectively.

3.2 아이덴티티 손실함수(Identity loss)3.2 Identity loss

이 손실함수는 소스 이미지 x와 교체된 결과값인 사이의 아이덴티티를 제한하기 위하여 사용된다. 본 발명에서는 그 차이간 거리를 계산하기 위하여 코사인 유사도를 활용하였으며, 사전 훈련된 ArcFace를 사용하여 하기 식과 같이 아이덴티티를 추출하였다. This loss function is the result of replacing the source image x with It is used to limit the identity between In the present invention, cosine similarity was used to calculate the distance between the differences, and identities were extracted using the pre-trained ArcFace as shown in the following equation.

3.3 LPIPS(Learned Perceptual Image Patch Similarity) 손실함수3.3 LPIPS (Learned Perceptual Image Patch Similarity) Loss Function

*이 손실함수는 정교한 디테일을 포착하고 현실감을 향상시키기 위한 것으로, 이는 하기 식과 같이 인지 특징 추출기 F(·)를 사용하였다. *This loss function is intended to capture sophisticated details and enhance realism, which uses the cognitive feature extractor F(·) as shown in the following equation.

3.4 자가 재구성 손실함수(Self-reconstruction loss)3.4 Self-reconstruction loss

이 손실함수는 픽셀 단위로 타겟 y와 교체된 결과값 사이의 차이를 제한하기 위한 것으로, 타겟 y가 랜덤하게 뒤집힌 소스 x인 경우 적용된다. This loss function is the result of swapping the target y in units of pixels. This is to limit the difference between, and is applied when the target y is a randomly inverted source x.

3.5 정규화 손실함수(Regularization loss)3.5 Regularization loss

이 손실함수는, feature-to-style 모듈로 하여금 사전 훈련된 StyleGAN2의 평균 은닉 벡터 에 보다 가까운 은닉 스타일 벡터를 출력하도록 유도하기 위한 것으로, 하기 식으로 표시된다. This loss function is the average hidden vector of the pretrained StyleGAN2 by the feature-to-style module. This is to induce output of a hidden style vector closer to , and is represented by the following equation.

정리하면, 본 발명에 따른 전체 손실함수는 하기 식과 같이 정리될 수 있다. In summary, the overall loss function according to the present invention can be organized as the following equation.

여기에서 λ1 = 0.0003, λ2 = 1, λ3 = 3, λ4 = 1, and λ5 = 0.01이다. where λ1 = 0.0003, λ2 = 1, λ3 = 3, λ4 = 1, and λ5 = 0.01.

하기 표는 이상보전손실함수(L _ip )의 효과를 Mip and λ1에 따른 PSNR/SSIM의 변화로 보여준다. The table below shows the effect of the ideal conservation loss function ( L _ip ) as a change in PSNR/SSIM according to Mip and λ1.

상기 표에 확인되는 바와 같이, 이상보전손실함수(L _ip )로 합습된 모델의 수행은 단일소스, 이상적인 이미지가 없는 다중 소스, 이상적인 다중 소스의 모든 경우에도 사용하지 않은 경우(λ1 = 0)보다 탁원하게 월등한 것을 알 수 있다. 달리 말하면, 비이상적인 소스가 이상보전손실함수(L _ip )로 학습된 모델에 주어지면, 보다 많은 이상적인 정보가 선택적으로 추출될 수 있다. As confirmed in the table above, the performance of the model trained with the ideal conservation loss function ( L _ip ) is better than the case of not using (λ1 = 0) even in all cases of single source, multi-source without ideal image, and ideal multi-source. It can be seen that it is remarkably superior. In other words, if a non-ideal source is given to the model learned with the ideal conservation loss function ( L _ip ), more ideal information can be selectively extracted.

λ1이 적을 때보다 λ1이 클때 이러한 개선효과는 더욱 크게 되며, 탁월한 성능은 Mip 가 7일 때 나타났다. 따라서, 본 발명의 일 실시예에서는 λ1은 0.0003, Mip는 5를 남아있는 부분에 적용하였다. This improvement effect is greater when λ1 is larger than when λ1 is small, and excellent performance was shown when Mip was 7. Therefore, in one embodiment of the present invention, λ1 is 0.0003 and Mip is 5 applied to the remaining portion.

4. 결과 4. Results

도 6 및 7은 고해상도 (1024x1024) 및 고선명도의 사실적인 결과 획득 결과이다. 6 and 7 are high-resolution (1024x1024) and high-definition realistic result acquisition results.

도 6 및 7을 참조하면, 소스 이미지의 얼굴로 타겟 이미지의 얼굴이 효과적이면서 사실감 있게 교체된 것을 알 수 있다. Referring to FIGS. 6 and 7 , it can be seen that the face of the target image is effectively and realistically replaced with the face of the source image.

도 8 및 9은 종래 기술과의 비교 이미지이다. 8 and 9 are comparison images with the prior art.

도 8 및 9를 참조하면, subject-agnostic 고해상도 (1024x1024) 얼굴 교체 기술인 선행기술 2 대비, 본 발명은 소스와 타겟에 내재된 정보의 효과적 사용을 통해 훨씬 더 자연스러운 결과물 획득하는 것을 알 수 있다. 특히 딥 네트워크의 학습 데이터셋 내 존재 여부와 관계없이 얼굴 교체가 가능한 "Subject-agnostic Face Swapping" 방식으로 본 발명은 매우 효과적이고 사실감 있는 얼굴 교체가 가능하다는 장점이 있다. 즉, 딥 네트워크의 학습 데이터셋에 존재하는 인물에 대해서만 얼굴 교체가 가능한 "Subject-specific Face Swapping" 방식과 같이 *특정 인물의 얼굴 교체를 위해서는 학습 데이터셋 구성 후 딥 네트워크를 재학습 해야 하는 번거로움 없이,임의의 얼굴 이미지로부터 사실적이고 자연스러운 얼굴 교체가 가능한 본원발명은 종래 기술 대비 우수한 효과를 갖는다. 8 and 9, compared to the prior art 2, which is a subject-agnostic high-resolution (1024x1024) face replacement technology, it can be seen that the present invention obtains much more natural results through effective use of information inherent in the source and target. In particular, as a "Subject-agnostic Face Swapping" method in which faces can be replaced regardless of whether or not they exist in the training dataset of the deep network, the present invention has the advantage of enabling very effective and realistic face replacement. In other words, like the "Subject-specific Face Swapping" method, which allows face replacement only for people in the deep network training dataset, *In order to replace a specific person's face, the hassle of retraining the deep network after configuring the training dataset Without it, the present invention capable of realistic and natural face replacement from an arbitrary face image has superior effects compared to the prior art.

도 10은 기존 저해상도 영상에서의 얼굴 교체 기술을 포함하여 비교한 결과이다. 10 is a comparison result including a face replacement technology in an existing low-resolution image.

도 10을 참조하면, 6개의 종래 기술 대비 본 발명의 교체 결과(ours)가 월등히 자연스럽고, 소스 이미지가 타겟 이미지에 잘 구현되어 교체된 것을 알 수 있다. Referring to FIG. 10, it can be seen that the replacement result (ours) of the present invention compared to the six prior art is much more natural, and the source image is well implemented in the target image and replaced.

도 11은 저품질의 다중 소스 이미지를 사용한 결과이다. 11 is a result of using low-quality multi-source images.

도 11에서(a)는 고품질의 Source 이미지 (무표정, 얼굴 정면)를 활용했을 때의 결과, (b)는 저품질의 Source 이미지 (특이한 표정, 얼굴 옆면)만을 활용했을 때의 결과, 그리고 (c): 저품질의 Source 이미지 여러 장 ((b)+눈을 감음)을 제안 방법을 통해 상보적으로 활용했을 때의 결과이다. In FIG. 11, (a) is the result when a high-quality source image (expressionless, frontal face) is used, (b) is the result when only a low-quality source image (unique expression, side face) is used, and (c) : This is the result of complementary use of several low-quality source images ((b) + eyes closed) through the proposed method.

도 11을 참조하면, 본 발명은 저품질 (특이한 표정, 각도 등)의 Source 이미지 여러 장을 한 번에 활용 가능한 최초의 방법과 시스템을 제공하며((c) 참조), 특히 여러 장의 Source에 분산되어 존재하는 정보를 상호보완적으로 활용하는 측면에서 실생활에서 대상이 취할 수 있는 표정, 자세의 범위가 매우 넓다는 점을 고려하여 볼 때 종래 기술에 비하여 명확한 장점을 갖는다. Referring to FIG. 11, the present invention provides the first method and system that can utilize multiple low-quality (unique facial expressions, angles, etc.) source images at once (see (c)), and is particularly distributed among multiple sources. Considering the fact that the range of expressions and postures that can be taken by objects in real life is very wide in terms of complementary use of existing information, it has a clear advantage over the prior art.

도 12는 VGGFace2-HQ에서 다중 소스 얼굴 교체 결과를 정성 분석한 결과이다.12 is a result of qualitative analysis of multi-source face replacement results in VGGFace2-HQ.

도 12에서 첫 번째 행은 하나의 대상 이미지, 하나의 이상적인 소스 및 3개의 비이상적 이미지를 보여주며, 여기에서 모든 소스는 VFHQ의 동일한 ID에서 나온 것이다. The first row in Fig. 12 shows one target image, one ideal source and three non-ideal images, where all sources are from the same ID of VFHQ.

도 12의 아래 행은 이상적인 소스에서 생성된 하나의 이미지와 생성된 이미지와, 다양한 속성을 가진 소스로부터 생성된 이미지를 보여준다. 종래의 SimSwap[16]과 MegaFS[17]의 경우, 소스의 속성 변경(특히, (d) 및 (e))에 따라 결과값이 인지할 수 있는 수준으로 차이가 발생하는 것을 알 수 있다.The lower row of FIG. 12 shows one image generated from an ideal source, one generated image, and one image generated from a source with various attributes. In the case of the conventional SimSwap [16] and MegaFS [17], it can be seen that the result value differs to a perceptible level depending on the source property change (in particular, (d) and (e)).

반면에 두 개의 이미지를 사용한 본 발명에 따른 다중 소스 얼굴 교체 기술의 경우, 이상적인 소스를 사용하는 경우와. 색상 및 확대에서 큰 차이가 없는 것을 알 수 있다.On the other hand, in the case of the multi-source face replacement technique according to the present invention using two images, the case of using an ideal source and It can be seen that there is no significant difference in color and magnification.

이상 설명한 바와 같이 본 발명은 대부분의 기존 방법들은 최대 512x512 해상도의 영상에 대해서만 수행 가능했으며 일부 1024x1024 해상도 적용 가능한 방법의 경우에도 여전히 결과물의 품질이 좋지 못하였으나, 1024x1024의 고해상도 영상에서 고품질의 얼굴 교체가 가능하므로, 최근 고해상도 영상기기 보급에 따른 수요를 맞출 수 있다. 또한, 기존 방법들 대비 최고 수준의 선명도 및 자연스러운 결과물을 생성 가능하며, 저품질의 (특이한 표정, 각도 등) Source 이미지 여러 장(2장 이상)을 동시에 활용하여 좋은 품질의 결과물을 생성할 수 있는 장점 또한 있다. As described above, most of the existing methods of the present invention can be performed only for images with a maximum resolution of 512x512, and even in the case of some methods applicable to a resolution of 1024x1024, the quality of the result is still poor. Since it is possible, it can meet the demand according to the recent spread of high-resolution video equipment. In addition, compared to existing methods, it is possible to produce the highest level of clarity and natural results, and it is possible to create good quality results by simultaneously utilizing several low-quality (unique facial expressions, angles, etc.) source images (two or more). There is also

Claims

A high-definition face replacement system based on an attentional algorithm,
a feature point extraction unit 100 for extracting feature points from a source image;
an identity conversion unit (IDTR, 200) outputting a result value for performing face replacement according to the keypoint input from the keypoint extraction unit; and
The high-definition face replacement system comprising an image generating unit (300) generating a replaced face image according to an output value from the identity conversion unit.

According to claim 1,
The high-definition face replacement system, characterized in that the identity conversion unit (IDTR, 200) outputs a result value by using a soft attention algorithm and a hard attention algorithm at the same time.

According to claim 2,
The high-definition face replacement system, characterized in that the soft attention algorithm outputs an attention value according to the following formula.

(In the above formula, A _soft ∈ R ^{C × HW} is the result of the soft note, , , , and h( ) is a 1 × 1 convolutional layer, where M ∈ R ^C×HW, where each point of M is the sum of all points of V weighted by A)

According to claim 2,
The high-definition face replacement system, characterized in that the hard attention algorithm outputs an attention value according to the following equation.

(In the equation, the value of hi is the index representing the position with the highest relevance among the source images with respect to the i-th position of the target image, and A _hard (i, j) is the attention value for the (i, j) position of A _hard )

According to claim 1,
The high-definition face replacement system, characterized in that the identity conversion unit (IDTR, 200) generates A ∈ R ^{HW × HW,} which is an attention map for the soft and hard attention algorithms, according to the following equation.

(Q _u ∈ R ^C×HW and K _u ∈ R ^C×HW are Q ∈ R ^C×H×W and K ∈ R ^C×H×W )

A high-definition face replacement system based on an attentional algorithm,
a feature point extraction unit 100 for extracting feature points from at least two or more source images;
an identity conversion unit (IDTR, 200) outputting a result value for performing face replacement according to the keypoint input from the keypoint extraction unit; and
The high-definition face replacement system comprising an image generating unit (300) generating a replaced face image according to an output value from the identity conversion unit.

According to claim 6,
The high-definition face replacement system, characterized in that the identity conversion unit (IDTR, 200) outputs a result value by using a soft attention algorithm and a hard attention algorithm at the same time.

According to claim 7,
The high-definition face replacement system, characterized in that the soft attention algorithm outputs an attention value according to the following formula.

(In the above formula, A _soft ∈ R ^{C × HW} is the result of the soft note, , , , and h( ) is a 1 × 1 convolutional layer, where M ∈ R ^C×HW, where each point of M is the sum of all points of V weighted by A)

According to claim 7,
The high-definition face replacement system, characterized in that the hard attention algorithm outputs an attention value according to the following formula.

(In the equation, the value of hi is the index representing the position with the highest relevance among the source images with respect to the i-th position of the target image, and A _hard (i, j) is the attention value for the (i, j) position of A _hard )

According to claim 7,
The high-definition face replacement system, characterized in that the identity conversion unit (IDTR, 200) generates attention maps for the soft and hard attention algorithms according to the following equation.

(Here, A _multi is a map of attention when there are at least two source images, and ⊙ is batch matrix multiplication)

a high-definition face replacement system according to any one of claims 1 to 5; and
Including ideality preserving loss function, following identity loss function, LPIPS (Learned Perceptual Image Patch Similarity) loss function, self-reconstruction loss function and regularization loss function A high-definition face replacement system including a learning unit that performs learning through a loss function that

a high-definition face replacement system according to any one of claims 6 to 10; and
Ideality preserving loss function, identity loss function, LPIPS (Learned Perceptual Image Patch Similarity) loss function, self-reconstruction loss function, and regularization loss function all included A high-definition face replacement system including a learning unit that performs learning through a loss function that

According to claim 11,
The loss function (L _total ) is a sharpness face replacement system, characterized in that expressed by the following formula.

(Where L _ip is the ideality preserving loss function, L _id is the identity loss function, L _LPIPS is the LPIPS (Learned Perceptual Image Patch Similarity) loss function, and L _self is the self-reconstruction loss function (Self -reconstruction loss), L _reg is the regularization loss)

According to claim 11,
The clarity face replacement system, characterized in that the loss function is expressed by the following formula.

(Where L _ip is the ideality preserving loss function, L _id is the identity loss function, L _LPIPS is the LPIPS (Learned Perceptual Image Patch Similarity) loss function, and L _self is the self-reconstruction loss function (Self -reconstruction loss), L _reg is the regularization loss)

As a high-definition face replacement method based on attention algorithm,
mapping feature points of a source image to a feature space;
outputting a result value for performing face replacement from the mapped feature points to an attentional algorithm; and
And performing face replacement from the output result value,
The high-definition face replacement method, characterized in that the attention algorithm outputs a result value by simultaneously using a soft attention algorithm and a hard attention algorithm.

According to claim 15,
The high-definition face replacement method, characterized in that the soft attention algorithm outputs an attention value according to the following formula.

(In the above formula, A _soft ∈ R ^{C × HW} is the result of the soft note, , , , h( ) is a 1 × 1 convolutional layer, where M ∈ R ^C×HW, where each point of M is the sum of all points of V weighted by A)

According to claim 15,
The high-definition face replacement method, characterized in that the hard attention algorithm outputs an attention value according to the following equation.

(In the equation, the value of hi is the index representing the position with the highest relevance among the source images with respect to the i-th position of the target image, and A _hard (i, j) is the attention value for the (i, j) position of A _hard )

As a high-definition face replacement method based on attention algorithm,
mapping feature points of at least two or more source images to a feature space;
outputting a result value for performing face replacement from the mapped feature points to an attentional algorithm; and
And performing face replacement from the output result value,
The high-definition face replacement method, characterized in that the attention algorithm outputs a result value by simultaneously using a soft attention algorithm and a hard attention algorithm.

According to claim 18,
The high-definition face replacement method, characterized in that the step of outputting the output to the attention algorithm includes generating attention maps for the soft and hard attention algorithms according to the following formula.

(Here, A _multi is a map of attention when there are at least two source images, and ⊙ is batch matrix multiplication)

According to claim 18,
The high-definition face replacement method, characterized in that the soft attention algorithm outputs an attention value according to the following formula.

(In the above formula, A _soft ∈ R ^{C × HW} is the result of the soft note, , , , and h( ) is a 1 × 1 convolutional layer, where M ∈ R ^C×HW, where each point of M is the sum of all points of V weighted by A)

According to claim 18,
The high-definition face replacement method, characterized in that the hard attention algorithm outputs an attention value according to the following equation.

(In the equation, the value of hi is the index representing the position with the highest relevance among the source images with respect to the i-th position of the target image, and A _hard (i, j) is the attention value for the (i, j) position of A _hard )