KR20220145792A

KR20220145792A - Method and apparatus for face image reconstruction using video identity clarification model

Info

Publication number: KR20220145792A
Application number: KR1020220050392A
Authority: KR
Inventors: 이영기; 이주헌
Original assignee: 서울대학교산학협력단
Priority date: 2021-04-22
Filing date: 2022-04-22
Publication date: 2022-10-31
Also published as: KR102613887B1

Abstract

A method and apparatus for reconstructing a high-quality facial image from a low-quality facial image using an identity clarification model are provided. According to an embodiment, a face image reconstruction method includes the steps of: acquiring learning data; and learning a video identity clarification network (VICN). Accordingly, face recognition accuracy can be improved.

Description

FACE IMAGE RECONSTRUCTION METHOD AND APPARATUS USING VIDEO IDENTITY CLARIFICATION MODEL

본 발명은 얼굴 이미지 재구성 방법 및 장치에 관한 것으로, 신원 복원 모델 및/또는 비디오 신원 복원 모델을 이용하여 저화질 얼굴 이미지로부터 고화질 얼굴 이미지를 재구성하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for reconstructing a face image, and to a method and apparatus for reconstructing a high-quality face image from a low-quality face image by using an identity reconstruction model and/or a video identity reconstruction model.

이하에서 기술되는 내용은 본 발명의 실시예와 관련되는 배경 정보를 제공할 목적으로 기재된 것일 뿐이고, 기술되는 내용들이 당연하게 종래기술을 구성하는 것은 아니다.The content to be described below is only provided for the purpose of providing background information related to the embodiment of the present invention, and the content to be described does not naturally constitute the prior art.

복잡한 도심 공간 속 얼굴 인식을 위해서는 입력 이미지에 포함된 먼 거리에서 찍힌 저화질 얼굴들을 정확히 인식할 수 있어야 한다. 최근 딥 뉴럴 네트워크(Deep Neural Network; DNN) 기반 얼굴 인식 기술이 높은 정확도를 달성하고 있으나, 저화질 이미지에 대한 인식 정확도는 현저히 떨어진다.For face recognition in a complex urban space, it is necessary to accurately recognize low-quality faces taken from a distance included in the input image. Recently, deep neural network (DNN)-based face recognition technology has achieved high accuracy, but the recognition accuracy for low-quality images is significantly lower.

한편, 저화질 얼굴 이미지를 고화질로 재구성하는 DNN 기반 연구 또한 활발히 이루어지고 있으나, 시각적으로 그럴듯한 이미지를 재구성하는데 집중되어 인식 정확도 향상에는 도움을 주지 못한다.On the other hand, DNN-based research that reconstructs low-quality face images into high-quality images is also being actively conducted, but it is focused on reconstructing visually plausible images and does not help to improve recognition accuracy.

이와 같은 얼굴 인식 기술의 제한적인 정확도를 제고하기 위해서는, 먼 거리에서 찍은 저화질의 작은 얼굴 이미지로부터 고화질의 얼굴 이미지를 재구성할 수 있는 기술이 필요하다.In order to improve the limited accuracy of such a face recognition technology, a technology capable of reconstructing a high-quality face image from a low-quality small face image taken from a long distance is required.

또한 기존 DNN 모델은 단일 저화질 이미지를 입력으로 받는 상황을 가정한다. 이에 비디오에서 대상 얼굴이 연속적인 프레임에 걸쳐 캡쳐되는 상황에서 해당 정보를 화질 재구성에 활용하지 못하는 한계가 있다.In addition, the existing DNN model assumes a situation where a single low-quality image is received as an input. Accordingly, there is a limitation in that the corresponding information cannot be used for image quality reconstruction in a situation in which the target face is captured over successive frames in the video.

한편, 전술한 선행기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.On the other hand, the above-mentioned prior art is technical information possessed by the inventor for derivation of the present invention or acquired in the process of derivation of the present invention, and it cannot be said that it is necessarily known technology disclosed to the general public before the filing of the present invention. .

본 발명의 일 과제는 저화질 얼굴 이미지의 신원을 복원하여 고화질 얼굴 이미지를 재구성하는 얼굴 이미지 재구성 방법 및 장치를 제공하는 것이다.An object of the present invention is to provide a face image reconstruction method and apparatus for reconstructing a high-quality face image by restoring the identity of a low-quality face image.

본 발명의 일 과제는 저화질 입력 이미지의 신원을 복원하는 신원 복원 모델(Identity Clarification Network; ICN)을 제공하는 것이다.An object of the present invention is to provide an Identity Clarification Network (ICN) for reconstructing the identity of a low-quality input image.

본 발명의 일 과제는 비디오의 연속적인 프레임로부터 캡쳐된 일련의 저화질 이미지 프레임으로부터 대상 얼굴의 이미지를 고화질로 재구성하는 비디오 신원 복원 모델(Video Identity Clarification Network; VICN) 및 이를 이용한 얼굴 이미지 재구성 방법 및 장치를 제공하는 것이다.An object of the present invention is to reconstruct a high-quality image of a target face from a series of low-quality image frames captured from successive frames of a video (Video Identity Clarification Network; VICN), and a method and apparatus for reconstructing a face image using the same is to provide

본 발명의 목적은 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시 예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 청구범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The object of the present invention is not limited to the above-mentioned problems, and other objects and advantages of the present invention that are not mentioned may be understood by the following description, and will be more clearly understood by the embodiments of the present invention. It will also be appreciated that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations thereof indicated in the claims.

본 발명의 일 실시예에 따른 얼굴 이미지 재구성 방법은, 얼굴 이미지 및 상기 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득하는 단계 및 상기 학습 데이터에 기반하여 신원 복원 모델(Identity Clarification Network; ICN)을 학습하는 단계를 포함하고, 상기 학습하는 단계는, 상기 신원 복원 모델의 생성기(Generator)를 실행하여, 상기 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 단계 및 상기 생성기와 생성적 적대 신경망(Generative Adversarial Network; GAN)의 경쟁 관계에 있는 상기 신원 복원 모델의 판별기(Discriminator)를 실행하여, 상기 정답 얼굴 이미지에 기반하여 상기 재구성된 얼굴 이미지를 판별하는 판별 단계를 포함할 수 있다.A face image reconstruction method according to an embodiment of the present invention includes the steps of obtaining training data including a face image and a correct face image for the face image, and an Identity Clarification Network (ICN) model based on the training data. ), wherein the learning comprises a generating step of executing a generator of the identity restoration model to generate a reconstructed face image obtained by reconstructing an identity for a face appearing in the face image; A discriminating step of executing a discriminator of the identity restoration model in competition between the generator and a generative adversarial network (GAN), and discriminating the reconstructed face image based on the correct face image may include.

본 발명의 일 실시예에 따른 얼굴 이미지 재구성 장치는, 생성기 및 상기 생성기와 생성적 적대 신경망의 경쟁 관계에 있는 판별기를 포함하는 신원 복원 모델을 저장하는 메모리 및 얼굴 이미지 및 상기 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터에 기반하여 상기 신원 복원 모델의 학습을 실행하도록 구성되는 프로세서를 포함하고, 상기 프로세서는, 상기 학습을 실행하기 위하여, 상기 생성기를 실행하여, 상기 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 작업 및 상기 판별기를 실행하여, 상기 정답 얼굴 이미지에 기반하여 상기 재구성된 얼굴 이미지를 판별하는 판별 작업을 수행하도록 구성될 수 있다.A face image reconstruction apparatus according to an embodiment of the present invention includes a memory for storing an identity reconstruction model including a generator and a discriminator that competes with the generator and a generative adversarial neural network, and a face image and a correct answer face for the face image a processor configured to execute learning of the identity reconstruction model based on training data including an image, wherein the processor executes the generator to execute the learning, for a face appearing in the face image It may be configured to execute a generation operation for generating a reconstructed face image from which the identity is restored and the discriminator to perform a discrimination operation for discriminating the reconstructed face image based on the correct answer face image.

본 발명의 일 실시예에 따른 프로세서를 포함한 얼굴 이미지 재구성 장치에 의해 실행되는 얼굴 이미지 재구성 방법은, 입력 비디오의 일련의 프레임으로부터 트래킹(tracking)된 적어도 하나의 얼굴 이미지 및 상기 적어도 하나의 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득하는 단계 및 상기 학습 데이터에 기반하여 비디오 신원 복원 모델(Video Identity Clarification Network; VICN)을 학습하는 단계를 포함하고, 상기 학습하는 단계는, 상기 비디오 신원 복원 모델의 생성기(Generator)를 실행하여, 상기 적어도 하나의 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 단계 및 상기 생성기와 생성적 적대 신경망(Generative Adversarial Network; GAN)의 경쟁 관계에 있는 상기 신원 복원 모델의 판별기(Discriminator)를 실행하여, 상기 정답 얼굴 이미지에 기반하여 상기 재구성된 얼굴 이미지를 판별하는 판별 단계를 포함할 수 있다.The facial image reconstruction method executed by the facial image reconstruction apparatus including a processor according to an embodiment of the present invention includes at least one face image tracked from a series of frames of an input video and the at least one face image. Comprising the steps of obtaining training data including a correct face image for the answer and learning a Video Identity Clarification Network (VICN) based on the training data, the learning step includes the video identity restoration A generating step of generating a reconstructed face image in which the identity of the face shown in the at least one face image is restored by executing a generator of the model, and the generator and a generative adversarial network (GAN) and a discriminating step of executing a discriminator of the identity restoration model in competition, and discriminating the reconstructed face image based on the correct answer face image.

본 발명의 일 실시예에 따른 얼굴 이미지 재구성 장치는 생성기 및 상기 생성기와 생성적 적대 신경망의 경쟁 관계에 있는 판별기를 포함하는 비디오 신원 복원 모델을 저장하는 메모리 및 입력 비디오의 일련의 프레임으로부터 트래킹된 적어도 하나의 얼굴 이미지 및 상기 적어도 하나의 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터에 기반하여 상기 비디오 신원 복원 모델의 학습을 실행하도록 구성되는 프로세서를 포함하고, 상기 프로세서는, 상기 학습을 실행하기 위하여, 상기 생성기를 실행하여, 상기 적어도 하나의 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 작업 및 상기 판별기를 실행하여, 상기 정답 얼굴 이미지에 기반하여 상기 재구성된 얼굴 이미지를 판별하는 판별 작업을 수행하도록 구성될 수 있다.A facial image reconstruction apparatus according to an embodiment of the present invention includes a memory for storing a video identity reconstruction model including a generator and a discriminator in competition with the generator and a generative adversarial neural network, and at least tracked from a series of frames of an input video. a processor configured to execute learning of the video identity reconstruction model based on training data including one face image and a correct answer face image for the at least one face image, wherein the processor is configured to: To this end, by executing the generator to generate a reconstructed face image in which the identity of the face shown in the at least one face image is restored, and the discriminator, the reconstructed face is executed based on the correct answer face image It may be configured to perform a discrimination operation that determines the image.

전술한 것 외의 다른 측면, 특징, 및 이점이 이하의 도면, 청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

실시예에 의하면, 저화질 얼굴 이미지의 신원을 복원하여 고화질 얼굴 이미지를 재구성할 수 있다.According to an embodiment, the high-quality face image may be reconstructed by restoring the identity of the low-quality face image.

실시예에 의하면 저화질 얼굴 이미지로 탐색 대상에 대한 탐지 정확도가 제고된다.According to the embodiment, detection accuracy for a search target is improved with a low-quality face image.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 실시예에 따른 얼굴 이미지 재구성 장치의 동작 환경의 개략적인 예시도이다.
도 2는 실시예에 따른 얼굴 이미지 재구성 장치의 블록도이다.
도 3은 실시예에 따른 얼굴 이미지 재구성 방법의 흐름도이다.
도 4는 실시예에 따른 신원 복원 모델 및 학습 구조를 설명하기 위한 도면이다.
도 5는 실시예에 따른 얼굴 이미지 재구성 방법의 학습 과정에 대한 흐름도이다.
도 6은 실시예에 따른 신원 복원 모델의 생성기의 네트워크 구조를 설명하기 위한 도면이다.
도 7은 실시예에 따른 얼굴 이미지 재구성 과정의 실행 결과를 예시적으로 보여주는 도면이다.
도 8은 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성 방법의 흐름도이다.
도 9는 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성의 얼굴 트래킹 과정을 설명하기 위한 도면이다.
도 10은 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성의 얼굴 트래킹 과정을 예시적으로 설명하기 위한 도면이다.
도 11은 실시예에 따른 비디오 신원 복원 모델 및 학습 구조를 설명하기 위한 도면이다.
도 12는 실시예에 따른 신원 복원 모델의 생성기의 다중 프레임 얼굴 화질 개선기의 네트워크 구조를 설명하기 위한 도면이다.1 is a schematic illustration of an operating environment of an apparatus for reconstructing a face image according to an embodiment.
2 is a block diagram of an apparatus for reconstructing a face image according to an embodiment.
3 is a flowchart of a face image reconstruction method according to an embodiment.
4 is a diagram for explaining an identity restoration model and a learning structure according to an embodiment.
5 is a flowchart of a learning process of a face image reconstruction method according to an embodiment.
6 is a diagram for explaining a network structure of a generator of an identity restoration model according to an embodiment.
7 is a diagram exemplarily showing an execution result of a face image reconstruction process according to an embodiment.
8 is a flowchart of a face image reconstruction method using a video identity reconstruction model according to an embodiment.
9 is a view for explaining a face tracking process of reconstructing a face image using a video identity reconstruction model according to an embodiment.
FIG. 10 is a diagram for exemplarily explaining a face tracking process of face image reconstruction using a video identity reconstruction model according to an embodiment.
11 is a diagram for explaining a video identity reconstruction model and a learning structure according to an embodiment.
12 is a diagram for explaining a network structure of a multi-frame face quality improver of a generator of an identity reconstruction model according to an embodiment.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 여기에서 설명하는 실시 예들에 한정되지 않는다. 이하 실시 예에서는 본 발명을 명확하게 설명하기 위해서 설명과 직접적인 관계가 없는 부분을 생략하지만, 본 발명의 사상이 적용된 장치 또는 시스템을 구현함에 있어서, 이와 같이 생략된 구성이 불필요함을 의미하는 것은 아니다. 아울러, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조번호를 사용한다.Hereinafter, the present invention will be described in more detail with reference to the drawings. The present invention may be embodied in several different forms, and is not limited to the embodiments described herein. In the following embodiments, parts not directly related to the description are omitted in order to clearly explain the present invention. . In addition, the same reference numerals are used for the same or similar elements throughout the specification.

이하의 설명에서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 되며, 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 또한, 이하의 설명에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.In the following description, terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms, and the terms refer to one component from another component. It is used only for distinguishing purposes. Also, in the following description, the singular expression includes the plural expression unless the context clearly dictates otherwise.

이하의 설명에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the following description, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification is present, but one or more other It should be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

이하 도면을 참고하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the drawings.

도 1은 실시예에 따른 얼굴 이미지 재구성 장치의 동작 환경의 개략적인 예시도이다.1 is a schematic illustration of an operating environment of an apparatus for reconstructing a face image according to an embodiment.

실시예에 따른 얼굴 이미지 재구성 과정은 고정밀 얼굴 인식을 위한 기술로서, 딥 뉴럴 네트워크(DNN) 기반의 얼굴 인식 알고리즘에 적용되어, 이미지 기반 얼굴 인식의 정확도를 제고할 수 있다. 예를 들어 복잡도가 높은 공간은 많은 사람으로 붐비는 공간, 예를 들어 유동 인구가 많은 도심 공간, 출퇴근 시간의 환승역, 다수의 관중이 들어찬 스포츠 경기장 및 쇼핑몰 등을 포함한다.The face image reconstruction process according to the embodiment is a technology for high-precision face recognition, and may be applied to a deep neural network (DNN)-based face recognition algorithm to improve the accuracy of image-based face recognition. For example, a high-complexity space includes a crowded space, such as an urban space with a high flow of people, a transit station during rush hours, a sports stadium with a large number of spectators, and a shopping mall.

실시예에 따른 얼굴 이미지 재구성 과정은 복잡도가 높은 공간에서 예를 들어 스마트폰, 웨어러블 글래스(wearable glasses) 또는 CCTV(Closed Circuit Television)와 같은 단말의 카메라로 획득한 영상에 포함된 먼 거리에 있는 다수의 작은 얼굴을 정확히 인식가능하도록 고화질 얼굴 이미지를 재구성할 수 있다.The face image reconstruction process according to the embodiment is performed in a space with high complexity, for example, a plurality of far-distance A high-quality face image can be reconstructed to accurately recognize a small face of

실시예에 따른 얼굴 이미지 재구성 장치(100)는 실시예에 따른 얼굴 이미지 재구성 방법을 실행하여 입력 얼굴 이미지로부터 재구성된 얼굴 이미지를 생성할 수 있다.The apparatus 100 for reconstructing a face image according to an embodiment may generate a face image reconstructed from the input face image by executing the face image reconstruction method according to the embodiment.

얼굴 이미지 재구성 장치(100)는 얼굴 인식 알고리즘에 의해 먼 거리에서 찍힌 작은 얼굴을 정확히 인식할 수 있도록 저화질 얼굴 이미지의 화질을 개선하여 고화질 얼굴 이미지로 재구성할 수 있다.The face image reconstruction apparatus 100 may improve the quality of a low-quality face image to accurately recognize a small face taken from a long distance by a face recognition algorithm to reconstruct it into a high-quality face image.

이를 위하여 실시예에 따른 얼굴 이미지 재구성 장치(100)는 딥 뉴럴 네트워크(DNN) 기반의 신원 복원 모델(Identity Clarification Network; ICN)을 제공할 수 있다. To this end, the facial image reconstructing apparatus 100 according to the embodiment may provide an Identity Clarification Network (ICN) based on a deep neural network (DNN).

일 예에서 신원 복원 모델은 얼굴 인식 알고리즘에 의한 얼굴 인식 정확도를 향상시키기 위하여 얼굴 이미지를 재구성하는 모델 구조 및 학습 목표 함수(training loss function)를 도입한다.In one example, the identity reconstruction model introduces a model structure and a training loss function for reconstructing a face image in order to improve the accuracy of face recognition by a face recognition algorithm.

일 예에서 얼굴 이미지 재구성 장치(100)는 네트워크(300)를 통해 서버(200)로부터 제공된 학습 데이터를 이용하여 신원 복원 모델을 학습시킬 수 있다. 일 예에서 얼굴 이미지 재구성 장치(100)는 학습된 신원 복원 모델을 네트워크(300)를 통해 서버(200) 또는 다른 단말 장치로 전송할 수 있다.In one example, the facial image reconstruction apparatus 100 may train the identity restoration model using the training data provided from the server 200 through the network 300 . In an example, the facial image reconstruction apparatus 100 may transmit the learned identity reconstruction model to the server 200 or another terminal device through the network 300 .

일 예에서 얼굴 이미지 재구성 장치(100)는 기학습된 신원 복원 모델을 네트워크(300)를 통해 수신할 수 있다. 예를 들어 얼굴 이미지 재구성 장치(100)는 서버(200) 또는 다른 단말 장치에서 학습된 신원 복원 모델을 네트워크(300)를 통해 수신할 수 있다.In an example, the facial image reconstruction apparatus 100 may receive the previously-learned identity reconstruction model through the network 300 . For example, the facial image reconstruction apparatus 100 may receive the identity restoration model learned from the server 200 or another terminal device through the network 300 .

얼굴 이미지 재구성 장치(100)는 학습된 신원 복원 모델을 실행하여 입력 이미지에 포함된 저화질 얼굴 이미지를 고화질 얼굴 이미지로 재구성할 수 있다. 여기서 얼굴 이미지 재구성 장치(100)는 입력 이미지를 직접 촬영하거나 또는 네트워크(300)를 통해 서버(200) 또는 다른 단말 장치로부터 입력 이미지를 수신할 수 있다.The facial image reconstruction apparatus 100 may reconstruct the low-quality face image included in the input image into a high-quality face image by executing the learned identity restoration model. Here, the face image reconstruction apparatus 100 may directly photograph an input image or may receive an input image from the server 200 or another terminal device through the network 300 .

얼굴 이미지 재구성 장치(100)는 단말 또는 서버(200)에서 구현될 수 있다. 여기서 단말은 사용자가 조작하는 데스크 탑 컴퓨터, 스마트폰, 노트북, 태블릿 PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다. 또한, 단말은 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 디바이스일 수 있다.The face image reconstruction apparatus 100 may be implemented in the terminal or the server 200 . Here, the terminal may be a desktop computer, a smartphone, a notebook computer, a tablet PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a digital camera, a home appliance and other mobile or non-mobile computing devices operated by the user, It is not limited thereto. In addition, the terminal may be a wearable device such as a watch, glasses, a hair band, and a ring having a communication function and a data processing function.

일 예에서 단말 또는 서버(200)는 실시예에 따른 얼굴 이미지 재구성 방법을 실행하는 어플리케이션(application) 또는 앱(app)을 실행하여 입력 이미지에 포함된 얼굴 이미지를 재구성할 수 있다.In an example, the terminal or server 200 may reconstruct a face image included in the input image by executing an application or an app that executes the face image reconstruction method according to the embodiment.

서버(200)는 학습 데이터를 분석하여 신원 복원 모델을 훈련시키고, 훈련된 신원 복원 모델을 네트워크(300)를 통해 얼굴 이미지 재구성 장치(100)에게 제공할 수 있다. 다른 예에서, 얼굴 이미지 재구성 장치(100)는 서버(200)와의 연결 없이, 온디바이스(on-device) 방식으로 신원 복원 모델을 훈련시킬 수 있다.The server 200 may train the identity restoration model by analyzing the learning data, and may provide the trained identity restoration model to the face image reconstruction apparatus 100 through the network 300 . In another example, the facial image reconstruction apparatus 100 may train the identity reconstruction model in an on-device manner without connection with the server 200 .

네트워크(300)는 유선 및 무선 네트워크, 예를 들어 LAN(local area network), WAN(wide area network), 인터넷(internet), 인트라넷(intranet) 및 엑스트라넷(extranet), 그리고 모바일 네트워크, 예를 들어 셀룰러, 3G, LTE, 5G, WiFi 네트워크, 애드혹 네트워크 및 이들의 조합을 비롯한 임의의 적절한 통신 네트워크 일 수 있다.Network 300 is a wired and wireless network, such as a local area network (LAN), a wide area network (WAN), the Internet (internet), intranet (intranet) and extranet (extranet), and mobile networks, such as It may be any suitable communication network, including cellular, 3G, LTE, 5G, WiFi networks, ad hoc networks, and combinations thereof.

네트워크(300)는 허브, 브리지, 라우터, 스위치 및 게이트웨이와 같은 네트워크 요소들의 연결을 포함할 수 있다. 네트워크(300)는 인터넷과 같은 공용 네트워크 및 안전한 기업 사설 네트워크와 같은 사설 네트워크를 비롯한 하나 이상의 연결된 네트워크들, 예컨대 다중 네트워크 환경을 포함할 수 있다. 네트워크(300)에의 액세스는 하나 이상의 유선 또는 무선 액세스 네트워크들을 통해 제공될 수 있다.Network 300 may include connections of network elements such as hubs, bridges, routers, switches, and gateways. Network 300 may include one or more connected networks, eg, multiple network environments, including public networks such as the Internet and private networks such as secure enterprise private networks. Access to network 300 may be provided via one or more wired or wireless access networks.

이하에서 도 2 내지 도 7을 참조하여 실시예에 따른 얼굴 이미지 재구성 방법 및 장치에 대하여 보다 상세히 살펴본다.Hereinafter, a method and apparatus for reconstructing a face image according to an embodiment will be described in more detail with reference to FIGS. 2 to 7 .

도 2는 실시예에 따른 얼굴 이미지 재구성 장치의 블록도이다.2 is a block diagram of an apparatus for reconstructing a face image according to an embodiment.

실시예에 따른 얼굴 이미지 재구성 장치(100)는 메모리(120) 및 프로세서(110)를 포함할 수 있다. 이와 같은 구성은 예시적인 것이고, 얼굴 이미지 재구성 장치(100)는 도 2에 도시된 구성 중 일부를 포함하거나, 도 2에 도시되지 않았으나 장치의 작동을 위해 필요한 구성을 추가로 포함할 수 있다.The apparatus 100 for reconstructing a face image according to an embodiment may include a memory 120 and a processor 110 . Such a configuration is exemplary, and the facial image reconstruction apparatus 100 may include some of the configurations shown in FIG. 2 or may additionally include a configuration necessary for the operation of the device although not shown in FIG. 2 .

프로세서(110)는 일종의 중앙처리장치로서, 메모리(120)에 저장된 하나 이상의 명령어를 실행하여 얼굴 이미지 재구성 장치(100)의 동작을 제어할 수 있다.The processor 110 is a kind of central processing unit, and may execute one or more commands stored in the memory 120 to control the operation of the facial image reconstruction apparatus 100 .

프로세서(110)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 프로세서(110)는 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다.The processor 110 may include any type of device capable of processing data. The processor 110 may refer to, for example, a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or an instruction included in a program.

이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로서, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 그래픽 처리 유닛(Graphic Processing Unit; GPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 이에 한정되는 것은 아니다. 프로세서(110)는 하나 이상의 프로세서를 포함할 수 있다.As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a graphic processing unit (GPU), a processor core, a multi It may include, but is not limited to, processing devices such as a processor, an application-specific integrated circuit (ASIC), and a field programmable gate array (FPGA). The processor 110 may include one or more processors.

얼굴 이미지 재구성 장치(100)는 생성기 및 상기 생성기와 생성적 적대 신경망의 경쟁 관계에 있는 판별기를 포함하는 신원 복원 모델을 저장하는 메모리(120) 및 얼굴 이미지 및 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터에 기반하여 신원 복원 모델의 학습을 실행하도록 구성되는 프로세서(110)를 포함할 수 있다.The face image reconstruction apparatus 100 includes a memory 120 for storing an identity restoration model including a generator and a discriminator in competition with the generator and a generative adversarial neural network, and a face image and a correct answer face image for the face image and a processor 110 configured to execute training of the identity reconstruction model based on the training data.

프로세서(110)는 신원 복원 모델의 학습을 실행하기 위하여, 생성기를 실행하여, 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 작업을 수행할 수 있다.The processor 110 may execute a generator to generate a reconstructed face image obtained by reconstructing an identity for a face appearing in the face image in order to learn the identity restoration model.

프로세서(110)는 신원 복원 모델의 학습을 실행하기 위하여, 판별기를 실행하여, 정답 얼굴 이미지에 기반하여 생성기에서 재구성된 얼굴 이미지를 판별하는 판별 작업을 수행하도록 구성될 수 있다.The processor 110 may be configured to execute a discriminator to perform learning of the identity restoration model, and to perform a discrimination task of determining the face image reconstructed by the generator based on the correct answer face image.

일 예에서, 생성기는 얼굴 랜드마크 예측기 및 얼굴 업샘플러를 포함할 수 있다. 프로세서(110)는, 생성기에서의 생성 작업을 수행하기 위하여, 얼굴 랜드마크 예측기를 실행하여 얼굴 이미지에 기반하여 복수 개의 얼굴 랜드마크를 예측하고, 얼굴 업샘플러를 실행하여 복수 개의 얼굴 랜드마크를 이용하여 얼굴 이미지를 업샘플링(upsampling)하도록 구성될 수 있다.In one example, the generator may include a face landmark predictor and a face upsampler. The processor 110 executes a facial landmark predictor to predict a plurality of facial landmarks based on the face image, and executes a face upsampler to use the plurality of facial landmarks to perform a generation operation in the generator. to upsampling the face image.

일 예에서 생성기는 복수 개의 잔차 블록(Residual Block)을 포함하는 중간 이미지 생성기를 더 포함할 수 있다. 프로세서(110)는 생성기에서의 생성 작업을 수행하기 위하여, 중간 이미지 생성기를 이용하여 얼굴 이미지의 화질을 개선한 중간 이미지를 생성하고, 얼굴 랜드마크 예측기를 실행하여 중간 이미지에 기반하여 복수 개의 얼굴 랜드마크를 예측하고, 얼굴 업샘플러를 실행하여 중간 이미지에 기반하여 예측된 복수 개의 얼굴 랜드마크를 이용하여 중간 이미지를 업샘플링하도록 구성될 수 있다.In an example, the generator may further include an intermediate image generator including a plurality of residual blocks. The processor 110 generates an intermediate image in which the image quality of the face image is improved by using the intermediate image generator to perform the generation operation in the generator, and executes the face landmark predictor to perform a plurality of face lands based on the intermediate image. predicting a mark, and executing a face upsampler to upsample the intermediate image using a plurality of facial landmarks predicted based on the intermediate image.

일 예에서 신원 복원 모델은 얼굴 특징 추출기를 더 포함할 수 있다. 프로세서(110)는 신원 복원 모델의 학습을 실행하기 위하여, 얼굴 특징 추출기를 실행하여, 생성기에서 재구성된 얼굴 이미지의 특징맵 및 정답 얼굴 이미지의 특징맵을 추출하도록 구성될 수 있다.In an example, the identity reconstruction model may further include a facial feature extractor. The processor 110 may be configured to execute a facial feature extractor to extract a feature map of the face image reconstructed by the generator and a feature map of the correct answer face image in order to learn the identity restoration model.

일 예에서 프로세서(110)는, 신원 복원 모델의 학습을 실행하기 위하여, 학습 목표 함수를 연산하고, 학습 목표 함수의 함수값을 최소화하도록 생성기와 상기 판별기를 교번하여 학습시키도록 구성될 수 있다.In an example, the processor 110 may be configured to calculate a learning target function and alternately learn the generator and the discriminator so as to minimize the function value of the learning target function in order to perform learning of the identity restoration model.

여기서 학습 목표 함수는, 생성기에 대한 GAN 손실 함수를 포함한 제 1 목표 함수 및 판별기에 대한 GAN 손실 함수에 기반한 제 2 목표 함수를 포함할 수 있다.Here, the learning target function may include a first target function including a GAN loss function for the generator and a second target function based on the GAN loss function for the discriminator.

제 1 목표 함수는, 재구성된 얼굴 이미지와 정답 얼굴 이미지 간의 픽셀 재구성 정확도 함수, 재구성된 얼굴 이미지의 생성 작업에서 예측한 얼굴 랜드마크의 예측 정확도 함수 및 재구성된 얼굴 이미지와 정답 얼굴 이미지 간의 얼굴 특징 유사도 함수를 더 포함할 수 있다.The first objective function includes a pixel reconstruction accuracy function between the reconstructed face image and the correct answer face image, a predictive accuracy function of a facial landmark predicted in the work of generating the reconstructed face image, and the degree of similarity of facial features between the reconstructed face image and the correct answer face image. It can include more functions.

일 예에서, 프로세서(110)는, 탐색 대상의 얼굴 이미지 및 해당 탐색 대상의 얼굴 이미지에 대한 기준 얼굴 이미지를 포함하는 제 2 학습 데이터에 기반하여 신원 복원 모델을 미세튜닝하는 제 2 학습을 실행하도록 구성될 수 있다.In one example, the processor 110 is configured to execute a second learning of fine-tuning the identity restoration model based on second training data including a face image of the search target and a reference face image for the face image of the search target. can be configured.

프로세서(110)는, 제 2 학습을 실행하기 위하여, 제 2 학습 데이터에 기반하여 생성 작업 및 판별 작업을 수행하도록 구성될 수 있다.The processor 110 may be configured to perform a generation operation and a determination operation based on the second learning data in order to execute the second learning.

메모리(120)는 실시예에 따른 얼굴 이미지 재구성 과정을 실행하기 위한 하나 이상의 명령을 포함하는 프로그램을 저장할 수 있다. 프로세서(110)는 메모리(120)에 저장된 프로그램, 명령어들에 기반하여 실시예에 따른 얼굴 이미지 재구성 과정을 실행할 수 있다.The memory 120 may store a program including one or more instructions for executing a face image reconstruction process according to an embodiment. The processor 110 may execute a face image reconstruction process according to an embodiment based on a program and instructions stored in the memory 120 .

메모리(120)는 신원 복원 모델(ICN) 및 신원 복원 모델(ICN)에 의한 얼굴 이미지 재구성을 위한 연산 과정에서 발생하는 중간 데이터 및 연산 결과 등을 더 저장할 수 있다.The memory 120 may further store intermediate data and calculation results generated in an identity reconstruction model (ICN) and an operation process for reconstructing a face image by the identity reconstruction model (ICN).

메모리(120)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD, CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다. 메모리(120)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 이에 한정되는 것은 아니다.Memory 120 may include internal memory and/or external memory, and may include volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND Flash memory or non-volatile memory such as NOR flash memory, SSD, compact flash (CF) card, SD card, Micro-SD card, Mini-SD card, Xd card, or flash drive such as a memory stick; Alternatively, it may include a storage device such as an HDD. The memory 120 may include, but is not limited to, magnetic storage media or flash storage media.

실시예에 따른 얼굴 이미지 재구성 장치(100)는 통신부(130)를 더 포함할 수 있다.The apparatus 100 for reconstructing a face image according to an embodiment may further include a communication unit 130 .

통신부(130)는 얼굴 이미지 재구성 장치(100)의 데이터의 송신 및 수신을 위한 통신 인터페이스를 포함한다. 통신부(130)는 얼굴 이미지 재구성 장치(100)에게 다양한 방식의 유무선 통신 경로를 제공하여 얼굴 이미지 재구성 장치(100)를 도 1을 참조하여 네트워크(300)와 연결할 수 있다.The communication unit 130 includes a communication interface for transmitting and receiving data of the face image reconstruction apparatus 100 . The communication unit 130 may provide the face image reconstructing apparatus 100 with various types of wired/wireless communication paths to connect the face image reconstructing apparatus 100 to the network 300 with reference to FIG. 1 .

얼굴 이미지 재구성 장치(100)는 통신부(130)를 통해 입력 이미지, 학습 데이터, 제 2 학습 데이터, 중간 이미지 및 재구성된 이미지 등을 송/수신할 수 있다. 통신부(130)는 예를 들어 각종 무선 인터넷 모듈, 근거리 통신 모듈, GPS 모듈, 이동 통신을 위한 모뎀 등에서 적어도 하나 이상을 포함하도록 구성될 수 있다.The face image reconstruction apparatus 100 may transmit/receive an input image, learning data, second learning data, intermediate image, and reconstructed image through the communication unit 130 . The communication unit 130 may be configured to include, for example, at least one of various wireless Internet modules, a short-range communication module, a GPS module, a modem for mobile communication, and the like.

얼굴 이미지 재구성 장치(100)는 프로세서(110), 메모리(120) 및 통신부(130) 간에 물리적/논리적 연결 경로를 제공하는 버스(140)를 더 포함할 수 있다.The facial image reconstruction apparatus 100 may further include a bus 140 that provides a physical/logical connection path between the processor 110 , the memory 120 , and the communication unit 130 .

도 3은 실시예에 따른 얼굴 이미지 재구성 방법의 흐름도이다.3 is a flowchart of a face image reconstruction method according to an embodiment.

실시예에 따른 얼굴 이미지 재구성 방법은, 얼굴 이미지 및 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득하는 단계(S1), 학습 데이터에 기반하여 신원 복원 모델(Identity Clarification Network; ICN)을 학습하는 단계(S2)를 포함할 수 있다.The face image reconstruction method according to the embodiment includes the steps of obtaining training data including a face image and a correct face image for the face image (S1), and learning an Identity Clarification Network (ICN) based on the training data. It may include a step (S2) to.

단계(S1)에서 프로세서(110)는 얼굴 이미지 및 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득한다.In step S1, the processor 110 acquires training data including a face image and a face image correct for the face image.

여기서 얼굴 이미지는 신원 복원 모델에 대한 입력 데이터이고, 입력 이미지에 대한 정답 얼굴 이미지는 해당 얼굴 이미지로부터 신원 복원 모델이 생성한 재구성된 얼굴 이미지(Reconstructed Face Image)에 대한 정답(Ground Truth) 데이터에 대응한다.Here, the face image is input data for the identity restoration model, and the correct answer face image for the input image corresponds to the ground truth data for the reconstructed face image generated by the identity restoration model from the face image. do.

예를 들어, 신원 복원 모델에 대한 입력 데이터인 얼굴 이미지는 저화질 얼굴 이미지이고, 정답 얼굴 이미지는 저화질 얼굴 이미지보다 고화질의 얼굴 이미지일 수 있다.For example, the face image that is input data for the identity restoration model may be a low-quality face image, and the correct answer face image may be a face image of higher quality than the low-quality face image.

일 예에서 프로세서(110)는 얼굴 이미지에 대한 정답 얼굴 이미지를 다운샘플링(down sampling)하여 신원 복원 모델에 입력할 얼굴 이미지를 생성할 수 있다.In an example, the processor 110 may generate a face image to be input to the identity restoration model by down-sampling the correct answer face image with respect to the face image.

예를 들어 프로세서(110)는 다양한 신원을 가진 사람들의 고화질 얼굴 사진을 다운샘플링하여 <고화질 정답 얼굴 이미지, 저화질 얼굴 이미지>로 구성된 학습 데이터셋(training dataset)을 구성할 수 있다. 예를 들어, 프로세서(110)는 약 70,000장의 고화질 얼굴로 구성된 FFHQ 데이터셋(T. Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks," CVPR 2019 참조)을 활용할 수 있다.For example, the processor 110 may downsample high-definition face photos of people with various identities to configure a training dataset composed of <high-resolution correct face image, low-quality face image>. For example, the processor 110 may utilize the FFHQ dataset (refer to T. Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks," CVPR 2019) comprising about 70,000 high-definition faces.

일 예에서 프로세서(110)는 도 1을 참조하여 서버(200)로부터 또는 다른 단말로부터 네트워크(200)를 통해 <고화질 정답 얼굴 이미지, 저화질 얼굴 이미지>로 구성된 학습 데이터셋을 수신할 수 있다.In one example, the processor 110 may receive a training dataset composed of <high-quality correct answer face image, low-quality face image> from the server 200 or another terminal through the network 200 with reference to FIG. 1 .

단계(S2)에서 프로세서(110)는 단계(S1)에서 획득한 학습 데이터에 기반하여 신원 복원 모델을 학습한다.In step S2, the processor 110 learns the identity restoration model based on the training data obtained in step S1.

단계(S2)는 신원 복원 모델의 생성기(Generator)를 실행하여, 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 단계(도 5를 참조하여 S21) 및 생성기와 생성적 적대 신경망(Generative Adversarial Network; GAN)의 경쟁 관계에 있는 신원 복원 모델의 판별기(Discriminator)를 실행하여, 정답 얼굴 이미지에 기반하여 재구성된 얼굴 이미지를 판별하는 판별 단계(도 5를 참조하여 S22)를 포함한다.Step S2 is a generating step (S21 with reference to FIG. 5) of generating a reconstructed face image in which the identity for a face appearing in the face image is restored by executing a generator of the identity restoration model (S21 with reference to FIG. 5) and the generator and the generator A discrimination step of discriminating the reconstructed face image based on the correct answer face image by executing the discriminator of the identity restoration model competing with the Generative Adversarial Network (GAN) (S22 with reference to FIG. 5) includes

단계(S2)에서 프로세서(110)는 단계(S1)에서 구성된 학습 데이터셋을 기반으로 신원 복원 모델을 학습한다. 이 과정을 거쳐 학습된 신원 복원 모델은 임의의 저화질 입력을 신원 정보를 보존하면서 고화질 얼굴로 재구성할 수 있는 능력을 가지게 된다. 신원 정보는 대상의 얼굴의 시각적 특징에 의하여 부여되는 아이덴티티 정보를 의미한다. 단계(S2)에 대하여는 도 5를 참조하여 구체적으로 살펴본다.In step S2, the processor 110 learns the identity restoration model based on the training dataset configured in step S1. The identity restoration model learned through this process has the ability to reconstruct an arbitrary low-quality input into a high-quality face while preserving identity information. The identity information means identity information given by the visual characteristics of the target's face. Step S2 will be described in detail with reference to FIG. 5 .

실시예에 따른 얼굴 이미지 재구성 방법은, 탐색 대상의 얼굴 이미지 및 해당 탐색 대상의 얼굴 이미지에 대한 기준 얼굴 이미지를 포함하는 제 2 학습 데이터를 획득하는 단계(S3) 및 제 2 학습 데이터에 기반하여 단계(S2)에서 학습된 신원 복원 모델을 미세튜닝(fine-tuning)하는 제 2 학습 단계(S4)를 더 포함할 수 있다.The face image reconstruction method according to the embodiment includes the steps of obtaining second training data including a face image of a search target and a reference face image for the face image of the corresponding search target (S3) and the steps based on the second training data The method may further include a second learning step (S4) of fine-tuning the identity restoration model learned in (S2).

단계(S3)에서 프로세서(110)는 탐색 대상에 대한 <고화질 기준 얼굴 이미지(probe), 저화질 얼굴 이미지>로 구성된 제 2 학습 데이터셋을 구성할 수 있다. 예를 들어, 프로세서(110)는 탐색 대상별로 복수의 기준 얼굴 이미지 및 각 기준 얼굴 이미지에 대한 저화질 얼굴 이미지로 제 2 학습 데이터를 구성할 수 있다.In step S3, the processor 110 may configure a second training dataset including <high-quality reference face image (probe), low-quality face image> for the search target. For example, the processor 110 may configure the second learning data with a plurality of reference face images for each search target and a low-quality face image for each reference face image.

일 예에서 단계(S3)에서 프로세서(110)는 단계(S1)에서 학습 데이터를 획득한 방식을 단계(S3)에서 제 2 학습 데이터 획득에 적용할 수 있다.In one example, in step S3, the processor 110 may apply the method of acquiring the learning data in step S1 to acquiring the second learning data in step S3.

단계(S4)에서 프로세서(110)는 제 2 학습 데이터에 기반하여 단계(S2)에서 학습된 신원 복원 모델을 미세튜닝하는 제 2 학습을 수행할 수 있다.In step S4, the processor 110 may perform a second learning of fine-tuning the identity restoration model learned in step S2 based on the second training data.

일 예에서, 제 2 학습 단계는 단계(S3)에서 획득한 제 2 학습 데이터에 기반하여 후술할 도 5를 참조하여 생성 단계(S21) 및 판별 단계(S22)를 실행하는 단계를 포함할 수 있다.In one example, the second learning step may include executing the generating step S21 and the determining step S22 with reference to FIG. 5 to be described later based on the second learning data obtained in step S3. .

단계(S4)에서 프로세서(110)는 탐색 대상에 대한 기준 이미지에 기반한 제 2 학습 데이터를 활용하여, 단계(S2)에서 학습된 신원 복원 모델의 미세 튜닝을 위한 제 2 학습을 실행한다. 단계(S4)에서 제 2 학습된 신원 복원 모델은 탐색 대상에 특화되어, 탐색 대상의 저화질 얼굴을 더욱 탐색 대상과 유사하게 재구성할 수 있는 능력을 가지게 된다. 단계(S4)에 대하여는 도 7을 참조하여 살펴본다.In step S4, the processor 110 executes second learning for fine tuning the identity restoration model learned in step S2 by utilizing the second training data based on the reference image for the search target. In step S4, the second learned identity restoration model is specialized for the search target, and has the ability to reconstruct the low-quality face of the search target to be more similar to the search target. Step S4 will be described with reference to FIG. 7 .

추가적으로 실시예에 따른 얼굴 이미지 재구성 방법은, 단계(S4)에서 제 2 학습된 신원 복원 모델을 이용하여 입력 영상에서 탐색 대상을 인식하는 단계(S5)를 더 포함할 수 있다.Additionally, the face image reconstruction method according to the embodiment may further include (S5) recognizing a search target in the input image by using the second learned identity reconstruction model in step (S4).

단계(S5)에서 프로세서(110)는 입력 영상에서 추출된 저화질 얼굴 이미지를 입력 데이터로 하여 학습된 신원 복원 모델을 실행하여 저화질 얼굴 이미지로부터 고화질 얼굴 이미지를 재구성할 수 있다.In step S5 , the processor 110 may reconstruct a high-quality face image from the low-quality face image by executing the learned identity restoration model using the low-quality face image extracted from the input image as input data.

또한, 단계(S5)에서 프로세서(110)는 신원 복원 모델에 의해 재구성된 고화질 얼굴 이미지에 기반하여 입력 영상에서 탐색된 적어도 하나의 얼굴 영역과 탐색 대상의 유사도를 결정하고, 결정된 유사도에 기반하여 입력 영상에서 탐색된 적어도 하나의 얼굴 영역 중에 탐색 대상이 있는 지 여부를 결정할 수 있다.In addition, in step S5, the processor 110 determines a similarity between at least one face region searched in the input image and a search target based on the high-definition face image reconstructed by the identity reconstruction model, and input based on the determined similarity. It may be determined whether there is a search target among at least one face region searched for in the image.

이하에서 실시예에 따른 얼굴 이미지 재구성 방법에서 사용하는 신원 복원 모델 및 학습 구조에 대하여 도 4를 참조하여 살펴본다.Hereinafter, an identity restoration model and a learning structure used in a face image reconstruction method according to an embodiment will be described with reference to FIG. 4 .

도 4는 실시예에 따른 신원 복원 모델 및 학습 구조를 설명하기 위한 도면이다.4 is a diagram for explaining an identity restoration model and a learning structure according to an embodiment.

실시예에서 먼 거리에서 찍힌 작은 얼굴을 정확히 인식하기 위해 저화질 입력 얼굴의 화질을 개선하여 고화질 얼굴로 재구성하는 딥 뉴럴 네트워크(Deep Neural Network, DNN) 신원 복원 모델(Identity Clarification Network; ICN)을 설계하였다.In the embodiment, in order to accurately recognize a small face taken from a long distance, a Deep Neural Network (DNN) Identity Clarification Network (ICN) was designed that improved the image quality of the low-quality input face and reconstructed it into a high-quality face. .

신원 복원 모델은 생성적 적대 신경망 구조에 기반한 딥 뉴럴 네트워크를 포함한다. 신원 복원 모델은 입력된 얼굴 이미지(LR)로부터 재구성된 얼굴 이미지(Reconstructed

)를 생성하는 생성기(G)와 입력된 얼굴 이미지(LR)에 대한 정답 얼굴 이미지(Ground Truth

)에 기초하여 생성기(G)가 생성한 재구성된 얼굴 이미지(Reconstructed

)가 정답 얼굴 이미지(Ground Truth

)에 대응하는 지 여부를 판별하는 판별기(D)를 포함한다.The identity reconstruction model includes a deep neural network based on a generative adversarial neural network structure. The identity restoration model is a face image reconstructed from the input face image (LR).

) and the correct answer face image (Ground Truth) for the input face image (LR)

) based on the reconstructed face image (Reconstructed) generated by the generator (G)

) is the correct answer face image (Ground Truth)

) includes a discriminator (D) that determines whether it corresponds to or not.

생성기(G)와 판별기(D)는 생성적 적대 신경망(GAN)의 경쟁 함수로, 생성기(G)에 대한 GAN 손실 함수(L_GAN) 및 판별기(D)에 대한 GAN 손실 함수(L_{Discriminator})를 정의한다.The generator (G) and discriminator (D) are competing functions of a generative adversarial network (GAN), the GAN loss function (L _GAN ) for the generator (G) and the GAN loss function (L _{Discriminator} ) for the discriminator (D). ) is defined.

생성기(Generator)(G)는 얼굴 랜드마크 예측기(Face Landmark Estimator)(G_FLE)를 포함한다. 얼굴 랜드마크 예측기(G_FLE)는 생성기(G)에 입력되는 저화질 얼굴 이미지(LR)으로부터 적어도 하나의 얼굴 랜드마크(landmark

)를 추출할 수 있다. 여기서 예를 들어 얼굴 랜드마크는 얼굴의 윤곽 정보 및 이목구비 정보를 포함한다.The Generator (G) includes a Face Landmark Estimator (G_FLE). The facial landmark predictor G_FLE is configured to generate at least one facial landmark from the low-quality facial image LR input to the generator G.

) can be extracted. Here, for example, the face landmark includes face contour information and feature information.

얼굴 랜드마크 예측기(G_FLE)가 예측한 얼굴 랜드마크와 정답 얼굴 이미지로부터 추출된 얼굴 랜드마크에 기반하여 후술할 랜드마크 정확도 함수(L_landmark)가 정의된다.A landmark accuracy function (L _landmark ) to be described later is defined based on the facial landmark predicted by the face landmark predictor (G_FLE) and the facial landmark extracted from the correct face image.

신원 복원 모델에 얼굴 랜드마크 예측기(G_FLE)를 도입함으로써 생성기(G)가 고화질 얼굴을 재구성하는 과정에서, 고화질 얼굴 이미지(Reconstructed

)의 재구성 정확도가 향상될 수 있다.In the process of reconstructing a high-quality face by the generator (G) by introducing a facial landmark predictor (G_FLE) to the identity restoration model, a high-quality face image (Reconstructed

) can be improved in reconstruction accuracy.

생성기(D)는 얼굴 업샘플러(Face Upsampler)(G_FUP)를 포함한다. 얼굴 업샘플러(G_FUP)는 얼굴 랜드마크 예측기(G_FLE)로부터 추출된 적어도 하나의 얼굴 랜드마크에 기초하여 저화질 얼굴 이미지(LR)로부터 재구성된 고화질 얼굴 이미지(Reconstructed

)를 생성한다.The generator D includes a Face Upsampler (G_FUP). The face upsampler (G_FUP) is a high-quality face image (Reconstructed) reconstructed from a low-quality face image (LR) based on at least one facial landmark extracted from the face landmark predictor (G_FLE).

) is created.

얼굴 업샘플러(G_FUP)가 재구성한 얼굴 이미지와 정답 얼굴 이미지의 픽셀값에 기반하여 후술할 픽셀 정확도 함수(L_pixel)가 정의된다.A pixel accuracy function (L _pixel ), which will be described later, is defined based on the pixel values of the face image reconstructed by the face upsampler (G_FUP) and the correct face image.

생성기(G)의 구조는 도 6을 참조하여 후술한다. 판별기(D)는 GAN 학습을 위한 잔차 블록(Residual block) 기반 구조를 포함하여 구성될 수 있다.The structure of the generator G will be described later with reference to FIG. 6 . The discriminator D may be configured to include a residual block-based structure for GAN learning.

추가적으로 신원 복원 모델은 얼굴 특징 추출기(Face Feature Extractor)(

)를 포함한다. 얼굴 특징 추출기(

)는 재구성된 얼굴 이미지(Reconstructed

)의 특징맵(feature map) 및 정답 얼굴 이미지(Ground Truth

)의 특징맵을 추출한다.Additionally, the identity reconstruction model is a face feature extractor (Face Feature Extractor)

) is included. facial feature extractor (

) is the reconstructed face image (Reconstructed

) of the feature map and the correct answer face image (Ground Truth)

) to extract the feature map.

얼굴 특징 추출기(

)는 재구성된 얼굴 이미지(Reconstructed

)의 특징맵(feature map) 및 정답 얼굴 이미지(Ground Truth

)의 특징맵에 기반하여 후술할 얼굴 특징 유사도 함수(L_face)가 정의된다.facial feature extractor (

) is the reconstructed face image (Reconstructed

) of the feature map and the correct answer face image (Ground Truth)

), a facial feature similarity function (L _face ), which will be described later, is defined based on the feature map.

얼굴 특징 추출기(

)는 예를 들어 잔차 블록 기반 ArcFace 네트워크 구조의 얼굴 인식 네트워크를 적용할 수 있으며, 이에 제한되지 않고 얼굴 인식을 위한 다양한 신경망 구조를 포함할 수 있다.facial feature extractor (

) may apply, for example, a face recognition network of a residual block-based ArcFace network structure, but is not limited thereto, and may include various neural network structures for face recognition.

실시예에 따른 얼굴 이미지 재구성 장치(100)는 생성기(G)가 사실적인 얼굴을 재구성하도록 하는 제 1 목표 함수(L_total)와 판별기(D)가 생성기(G)가 재구성한 얼굴과 정답 얼굴을 잘 구분할 수 있도록 하는 제 2 목표 함수(L_{Discriminator})를 번갈아 최소화하도록 학습하여, 신원 복원 모델이 높은 재구성 정확도를 얻을 수 있도록 한다. 이에 대하여는 도 5를 참조하여 단계(S24) 및 단계(S25)에서 설명한다.The face image reconstruction apparatus 100 according to the embodiment includes a first objective function L _total that allows the generator G to reconstruct a realistic face, and a face reconstructed by the generator G and an answer face by the discriminator D By learning to alternately minimize the second objective function (L _{Discriminator} ) that can discriminate well, the identity reconstruction model can obtain high reconstruction accuracy. This will be described in steps S24 and S25 with reference to FIG. 5 .

도 5는 실시예에 따른 얼굴 이미지 재구성 방법의 학습 과정에 대한 흐름도이다.5 is a flowchart of a learning process of a face image reconstruction method according to an embodiment.

일 예에서 도 5에 도시된 단계(S21) 내지 단계(S25)는 도 3을 참조하여 단계(S2)의 학습 단계 또는 단계(S4)의 제 2 학습 단계에서 실행될 수 있다.In one example, steps S21 to S25 shown in FIG. 5 may be executed in the learning step of step S2 or the second learning step of step S4 with reference to FIG. 3 .

단계(S21)에서 프로세서(110)는 신원 복원 모델의 생성기(G)를 실행하여, 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지(Reconstructed

)를 생성할 수 있다.In step S21, the processor 110 executes the generator G of the identity restoration model, and the reconstructed face image Reconstructed to restore the identity for the face shown in the face image.

) can be created.

단계(S21)에서 프로세서(110)는 도 4를 참조하여 신원 복원 모델의 생성기(G)를 실행하여 입력된 얼굴 이미지(LR)에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지(Reconstructed

)를 생성할 수 있다.In step S21, the processor 110 executes the generator G of the identity reconstruction model with reference to FIG. 4 to restore the identity of the face shown in the input face image LR, Reconstructed

) can be created.

이하에서 도 6을 참조하여 생성기(G)의 네트워크 구조에 대하여 보다 상세히 살펴본다.Hereinafter, the network structure of the generator G will be described in more detail with reference to FIG. 6 .

도 6은 실시예에 따른 신원 복원 모델의 생성기의 네트워크 구조를 설명하기 위한 도면이다.6 is a diagram for explaining a network structure of a generator of an identity restoration model according to an embodiment.

생성기(G)는 저화질 얼굴 이미지(LR)의 화질을 일차적으로 개선한 중간 이미지(IN)를 생성하고, 중간 이미지(IN)로부터 얼굴 랜드마크를 예측한 후 이를 활용하여 최종 고화질 얼굴(HR)을 출력할 수 있다.The generator (G) generates an intermediate image (IN) that primarily improves the image quality of a low-quality face image (LR), predicts a facial landmark from the intermediate image (IN), and uses it to generate a final high-quality face (HR) can be printed out.

일 예에서 생성기(G)는 저화질 입력 얼굴 이미지(LR)로부터 중간 이미지(IN)를 생성하는 뉴럴 네트워크를 포함하는 중간 이미지 생성기G_IN), 중간 이미지(IN)로부터 얼굴 랜드마크를 예측하는 뉴럴 네트워크를 포함하는 얼굴 랜드마크 예측기(G_FLE) 및 중간 이미지(IN) 및 얼굴 랜드마크에 기반하여 얼굴 업샘플링을 수행하여 출력 얼굴 이미지(HR)를 생성하는 뉴럴 네트워크를 포함하는 얼굴 업샘플러(G_FUP)를 포함한다. 여기서 출력 얼굴 이미지(HR)는 도 4를 참조하여 재구성된 얼굴 이미지(Reconstructed

)에 대응한다.In one example, the generator G includes an intermediate image generator G_IN including a neural network generating an intermediate image IN from a low-resolution input face image LR, and a neural network predicting a facial landmark from the intermediate image IN. It includes a face landmark predictor (G_FLE) that includes a face landmark predictor (G_FLE) and a face upsampler (G_FUP) that includes a neural network that generates an output face image (HR) by performing face upsampling based on the intermediate image (IN) and facial landmarks. do. Here, the output face image HR is a face image reconstructed with reference to FIG. 4 .

) corresponds to

생성기(G)는 다양한 이미지 처리에서 높은 정확도를 달성하는 잔차 블록(Residual block)(K. He et al., “Deep residual learning for image recognition,” CVPR 2016 참조)을 기본 블록 구조로 활용할 수 있다.The generator G can utilize a residual block that achieves high accuracy in various image processing (see K. He et al., “Deep residual learning for image recognition,” CVPR 2016) as a basic block structure.

일 예에서 중간 이미지 생성기(G_IN)는 복수 개의 잔차 블록을 포함할 수 있다. 예를 들어 중간 이미지 생성기(G_IN)는 12개의 잔차 블록을 포함할 수 있다.In an example, the intermediate image generator G_IN may include a plurality of residual blocks. For example, the intermediate image generator G_IN may include 12 residual blocks.

일 예에서 얼굴 랜드마크 예측기(G_FLE)는 적어도 하나의 적층 모래시계(Stacked Hourglass) 블록 구조 기반으로 설계할 수 있다. 예를 들어 얼굴 랜드마크 예측기(G_FLE)는 4개의 적층 모래시계 블록을 포함할 수 있다.In an example, the face landmark predictor G_FLE may be designed based on at least one stacked hourglass block structure. For example, the face landmark predictor (G_FLE) may include four stacked hourglass blocks.

일 예에서 얼굴 업샘플러(G_FUP)는 복수 개의 잔차 블록의 세트를 복수 개 포함할 수 있다. 예를 들어 얼굴 업샘플러(G_FUP)는 3 개의 잔차 블록으로 구성된 잔차 블록 세트를 2 개 포함할 수 있다.In an example, the face upsampler G_FUP may include a plurality of sets of a plurality of residual blocks. For example, the face upsampler G_FUP may include two residual block sets including three residual blocks.

중간 이미지 생성기(G_IN)에서 생성된 중간 이미지(IN)는 얼굴 업샘플러(G_FUP)에 입력되어 복수 개의 잔차 블록으로 구성된 잔차 블록 세트를 적어도 한 번 거친다. 이후 얼굴 랜드마크 예측기(G_FLE)로부터 예측된 얼굴 랜드마크 정보를 도입하여 얼굴 업샘플러(G_FUP)의 복수 개의 잔차 블록 세트 중 나머지 잔차 블록 세트를 수행하여 출력 얼굴 이미지(HR)를 생성한다.The intermediate image IN generated by the intermediate image generator G_IN is input to the face upsampler G_FUP and passes through a residual block set composed of a plurality of residual blocks at least once. Thereafter, face landmark information predicted from the face landmark predictor G_FLE is introduced and the remaining residual block sets among the plurality of residual block sets of the face upsampler G_FUP are performed to generate an output face image HR.

도 5로 돌아와서 단계(S21) 내지 단계(S25)에 대하여 살펴본다.Returning to FIG. 5, the steps (S21) to (S25) will be described.

) can be created.

단계(S21)에서 생성기(G)는 입력 얼굴 이미지(LR)로부터 얼굴 랜드마크를 예측한 후 이를 활용하여 출력 얼굴 이미지(HR)를 생성할 수 있다.In step S21 , the generator G may predict a facial landmark from the input facial image LR and then use it to generate the output facial image HR.

즉, 일 예에서 단계(S21)은 생성기(G)의 얼굴 랜드마크 예측기(G_FLE)를 실행하여 입력 얼굴 이미지(LR)에 기반하여 복수 개의 얼굴 랜드마크를 예측하는 단계 및 생성기(G)의 얼굴 업샘플러(G_FUP)를 실행하여 앞서 예측된 복수 개의 얼굴 랜드마크를 이용하여 입력 얼굴 이미지(LR)를 업샘플링하는 단계를 포함할 수 있다.That is, in one example, the step S21 is a step of predicting a plurality of facial landmarks based on the input face image LR by executing the facial landmark predictor G_FLE of the generator G and the face of the generator G The method may include upsampling the input face image LR using the plurality of previously predicted facial landmarks by executing the upsampler G_FUP.

단계(S21)에서 생성기(G)는 저화질 얼굴 이미지(LR)의 화질을 일차적으로 개선한 중간 이미지(IN)를 생성하고, 중간 이미지(IN)로부터 얼굴 랜드마크를 예측한 후 이를 활용하여 최종 고화질 얼굴(HR)을 출력할 수 있다.In step S21, the generator G generates an intermediate image IN, which primarily improves the image quality of the low-quality face image LR, predicts a facial landmark from the intermediate image IN, and utilizes it to generate the final high-definition image Face (HR) can be output.

즉, 일 예에서 단계(S21)은 복수 개의 잔차 블록을 포함하는 중간 이미지 생성기(G_IN)를 이용하여 입력된 얼굴 이미지(LR)의 화질을 개선한 중간 이미지(IN)를 생성하는 단계, 생성기(G)의 얼굴 랜드마크 예측기(G_FLE)를 실행하여 중간 이미지(IN)에 기반하여 복수 개의 얼굴 랜드마크를 예측하는 단계 및 생성기(G)의 얼굴 업샘플러(G_FUP)를 실행하여 얼굴 랜드마크 예측기(G_FLE)에서 예측된 복수 개의 얼굴 랜드마크를 이용하여 중간 이미지(IN)를 업샘플링하는 단계를 포함할 수 있다.That is, in one example, step S21 is a step of generating an intermediate image IN in which the image quality of the input face image LR is improved by using an intermediate image generator G_IN including a plurality of residual blocks, the generator ( G) Execute the facial landmark predictor (G_FLE) to predict a plurality of facial landmarks based on the intermediate image (IN), and execute the face upsampler (G_FUP) of the generator (G) to predict the facial landmarks ( Upsampling the intermediate image IN using a plurality of facial landmarks predicted in G_FLE).

단계(S22)에서 프로세서(110)는 생성기(G)와 생성적 적대 신경망의 경쟁 관계에 있는 신원 복원 모델의 판별기(D)를 실행하여, 정답 얼굴 이미지(Ground Truth

)에 기반하여 재구성된 얼굴 이미지(Reconstructed

)를 판별할 수 있다.In step S22, the processor 110 executes the generator (G) and the discriminator (D) of the identity restoration model in competition with the generative adversarial neural network, so that the correct answer face image (Ground Truth)

) based on the reconstructed face image (Reconstructed

) can be determined.

단계(S23)에서 프로세서(110)는 신원 복원 모델의 얼굴 특징 추출기(

)를 실행하여 재구성된 얼굴 이미지(Reconstructed

)의 특징맵 및 정답 얼굴 이미지(Ground Truth

)의 특징맵을 추출할 수 있다.In step S23, the processor 110 extracts the facial features of the identity restoration model (

) to the reconstructed face image (Reconstructed

) feature map and correct answer face image (Ground Truth)

) can be extracted.

도 3을 참조하여 단계(S2)는 학습 목표 함수(training loss function)를 연산하는 단계(S24) 및 학습 목표 함수의 함수값을 최소화하도록 생성기(G)와 판별기(D)를 교번하여 학습하는 단계를 더 포함할 수 있다.Referring to Figure 3, step (S2) is a step (S24) of calculating a learning target function (training loss function) and learning by alternating the generator (G) and the discriminator (D) to minimize the function value of the learning objective function It may include further steps.

단계(S24)에서 프로세서(110)는 신원 복원 모델의 학습 목표 함수를 연산할 수 있다.In step S24, the processor 110 may calculate a learning target function of the identity restoration model.

일 예에서 신원 복원 모델의 학습 목표 함수는 생성기(G)에 대한 GAN 손실 함수(L_GAN)를 포함한 제 1 목표 함수(L_Total) 및 판별기(D)에 대한 GAN 손실 함수(L_{Discriminator})에 기반한 제 2 목표 함수를 포함할 수 있다.In one example, the learning objective function of the identity reconstruction model is a first objective function (L _Total ) including a GAN loss function (L _GAN ) for the generator (G) and a GAN loss function (L _{Discriminator} ) for the discriminator (D). It may include a second objective function based on

일 예에서, 제 1 목표 함수(L_total)는, 재구성된 얼굴 이미지(HR)와 정답 얼굴 이미지(Ground Truth y) 간의 픽셀 재구성 정확도 함수(L_pixel), 재구성된 얼굴 이미지(HR)의 생성 단계(S21)에서 예측한 얼굴 랜드마크의 예측 정확도 함수(L_landmark) 및 재구성된 얼굴 이미지(HR)와 정답 얼굴 이미지(Ground Truth y) 간의 얼굴 특징 유사도 함수(L_face)를 더 포함할 수 있다.In an example, the first objective function (L _total ) is a pixel reconstruction accuracy function (L _pixel ) between the reconstructed face image (HR) and the correct answer face image (Ground Truth y), the generation step of the reconstructed face image (HR) It may further include a predictive accuracy function (L _landmark ) of the facial landmark predicted in (S21) and a facial feature similarity function (L _face ) between the reconstructed face image (HR) and the correct answer face image (Ground Truth y).

제 1 목표 함수(L_total)는, 생성기(G)에 대한 GAN 손실 함수(L_GAN), 재구성된 얼굴 이미지(HR)와 정답 얼굴 이미지(Ground Truth y) 간의 픽셀 재구성 정확도 함수(L_pixel), 재구성된 얼굴 이미지(HR)의 생성 단계(S21)에서 예측한 얼굴 랜드마크의 예측 정확도 함수(L_landmark) 및 재구성된 얼굴 이미지(HR)와 정답 얼굴 이미지(Ground Truth y) 간의 얼굴 특징 유사도 함수(L_face)에 기반하여 정의될 수 있다.The first objective function (L _total ) is the GAN loss function (L _GAN ) for the generator (G), the pixel reconstruction accuracy function (L _pixel ) between the reconstructed face image (HR) and the correct answer face image (Ground Truth y), The prediction accuracy function (L _landmark ) of the facial landmark predicted in the generation step (S21) of the reconstructed face image (HR) and the facial feature similarity function ( L landmark ) between the reconstructed face image (HR) and the correct answer face image (Ground Truth y) It can be defined based on L _face ).

이하에서 제 1 목표 함수(L_total) 및 제 2 목표 함수(L_{Discriminator})를 보다 구체적으로 살펴본다.Hereinafter, the first target function L _total and the second target function L _{Discriminator} will be described in more detail.

실시예에 따른 얼굴 이미지 재구성 방법은 신원 복원 모델이 고화질 얼굴 이미지(HR)을 재구성하는 동시에 입력 얼굴 이미지에 대응하는 대상의 신원 정보를 보존하기 위하여, 다양한 학습 목표 함수들을 도입하였다.In the face image reconstruction method according to the embodiment, various learning objective functions are introduced so that the identity reconstruction model reconstructs a high-quality face image (HR) and at the same time preserves the identity information of the object corresponding to the input face image.

(1) 픽셀 재구성 정확도(L_pixel): 생성기(G)가 재구성한 얼굴(HR)과 원본 얼굴 즉, 정답 얼굴 이미지(Ground Truth) 간의 픽셀값 간의 L2 거리 함수(1) Pixel reconstruction accuracy (L _pixel ): L2 distance function between the pixel values between the face (HR) reconstructed by the generator (G) and the original face, that is, the correct face image (ground truth)

H,W는 정답 얼굴 이미지(Ground Truth)의 높이와 너비를 나타내며,

는 정답 이미지(Ground Truth y)의 (i,j)번째 픽셀값을 나타내며,

와

는 각각 중간 이미지(IN) 및 최종 재구성 얼굴 이미지(HR)의 (i,j)번째 픽셀값을 나타낸다.H, W represent the height and width of the correct face image (ground truth),

represents the (i,j)th pixel value of the correct answer image (Ground Truth y),

Wow

denotes the (i, j)th pixel values of the intermediate image IN and the final reconstructed face image HR, respectively.

(2) 얼굴 랜드마크 예측 정확도: 생성기(G)의 재구성된 얼굴 이미지 생성 과정에서 예측한 랜드마크 좌표와 정답 간의 L2 거리 함수(2) Facial landmark prediction accuracy: L2 distance function between the landmark coordinates predicted in the process of generating the reconstructed face image of the generator (G) and the correct answer

N은 얼굴 랜드마크의 총 개수를 나타내며,

,

는 각각 (i,j)번째 픽셀에서의 n번째 랜드마크에 대한 정답 및 예측 확률을 나타낸다. 예를 들어, 눈, 코, 입 및 얼굴 윤곽선에 해당하는 총 68개의 랜드마크를 사용할 수 있다.N represents the total number of facial landmarks,

,

denotes the correct answer and predicted probability for the n-th landmark in the (i, j)-th pixel, respectively. For example, you can use a total of 68 landmarks corresponding to the outlines of the eyes, nose, mouth, and face.

(3) 얼굴 인식 피쳐(feature) 유사도: 재구성된 얼굴(HR)과 원본 얼굴(Ground Truth)의 얼굴 인식 네트워크 출력 피쳐 간 L2 거리 함수(3) Face recognition feature similarity: L2 distance function between face recognition network output features of reconstructed face (HR) and original face (ground truth)

와

는 각각 정답 얼굴(Ground Truth y) 및 신원 복원 모델이 재구성한 고화질 얼굴(Reconstructed

)에 대한 얼굴 특징 추출기(

)의 출력 피쳐를 나타낸다. d는 출력 피쳐 개수로서 예를 들어 총 512개이다.

Wow

are the correct face (Ground Truth y) and the high-definition face (Reconstructed) reconstructed by the identity restoration model, respectively.

) for the facial feature extractor (

) represents the output feature. d is the number of output features, for example, a total of 512.

(4) 생성적 적대 신경망(Generative Adversarial Network, GAN) 학습 목표 함수: 재구성된 얼굴(Reconstructed

)이 사실적으로 보이도록 하기 위한 생성기(G)와 판별기(D) 간 경쟁 함수(4) Generative Adversarial Network (GAN) Learning Objective Function: Reconstructed

) a competition function between the generator (G) and discriminator (D) to make it look realistic.

G는 생성기, D는 판별기를 나타낸다.G stands for generator and D stands for discriminator.

종합적으로, 수학식 1 내지 수학식 4에 정의된 목표 함수들을 수학식 5와 같이 통합하여 신원 복원 모델의 학습에 사용한다.Overall, the target functions defined in Equations 1 to 4 are integrated as in Equation 5 and used for learning the identity restoration model.

(생성기(G)가 사실적인 얼굴을 재구성하도록 하는 목표 함수)과

(판별기(D)가 생성기(G)가 재구성한 얼굴(

)과 정답 얼굴(y)을 잘 구분할 수 있도록 하는 목표 함수)를 번갈아 최소화하도록 학습하여 신원 복원 모델이 높은 재구성 정확도를 얻을 수 있도록 한다.

(a goal function that allows the generator (G) to reconstruct a realistic face) and

(The face (D) reconstructed by Generator (G)

) and the objective function that allows to distinguish the correct face (y) well), so that the identity reconstruction model can obtain high reconstruction accuracy.

예를 들어, 이와 같은 학습 목표 함수에 의해 총 70,000장으로 구성된 데이터셋으로 학습을 진행하여, NVIDIA RTX 2080Ti GPU 활용 시 약 하루 정도의 학습 시간이 소요된다.For example, it takes about a day to learn when using the NVIDIA RTX 2080Ti GPU by learning with a dataset consisting of a total of 70,000 sheets by such a learning objective function.

도 7은 실시예에 따른 얼굴 이미지 재구성 과정의 실행 결과를 예시적으로 보여주는 도면이다.7 is a diagram exemplarily showing an execution result of a face image reconstruction process according to an embodiment.

(a)는 주어진 정답 얼굴 이미지(Ground Truth y), (b)는 (a)를 다운샘플링하여 생성한 입력 얼굴 이미지(LR), (c)는 도 3을 참조하여 단계(S2)의 학습을 완료한 신원 복원 모델의 생성기(G)를 실행한 결과로 획득한 베이스라인 이미지이고, (d)는 도 3을 참조하여 단계(S4)의 제 2 학습을 완료한 신원 복원 모델의 생성기(G)를 실행한 결과로 미세튜닝된 이미지이다.(a) is a given correct answer face image (Ground Truth y), (b) is an input face image (LR) generated by downsampling (a), (c) is the learning of step (S2) with reference to FIG. It is a baseline image obtained as a result of executing the generator (G) of the completed identity restoration model, (d) is the generator (G) of the identity restoration model that has completed the second learning of step (S4) with reference to FIG. It is a fine-tuned image as a result of executing .

도 3을 참조하여 단계(S3) 및 단계(S4)의 제 2 학습 과정은 주어진 탐색 대상의 기준 이미지(probe)를 활용한 미세-튜닝(fine-tuning) 기법을 제공한다.Referring to FIG. 3 , the second learning process of steps S3 and S4 provides a fine-tuning technique using a reference image (probe) of a given search target.

얼굴 인식 정확도가 저화질 얼굴에 대해 현저히 떨어지는 현상을 분석한 결과, 다른 사람의 두 얼굴을 다르다고 판단하는 정확도(True Negative)에 비해, 같은 사람의 두 얼굴을 같다고 판단하는 정확도(True Positive)가 떨어지는 것이 저화질 얼굴에 대한 얼굴 인식 정확도를 저하시키는 지배적 요인이다.As a result of analyzing the phenomenon that the face recognition accuracy is significantly lowered for low-quality faces, it was found that the accuracy of judging two faces of the same person as the same (True Positive) was lower than the accuracy of judging two faces of another person (True Negative) as being different. It is the dominant factor that lowers the face recognition accuracy for low-quality faces.

이를 해결하기 위해, 도 3을 참조하여 단계(S3) 및 단계(S4)에서 탐색 대상의 고화질 기준 얼굴 이미지(probe)를 모아 다운샘플링하여 <고화질 정답, 저화질 입력>으로 구성된 제 2 학습 데이터를 구성하고, 이를 활용하여 도 5를 참조하여 단계(S24)에서 전술한 학습 목표 함수를 기반으로 신원 복원 모델을 미세-튜닝(fine-tuning)하는 제 2 학습을 실행한다.In order to solve this problem, with reference to FIG. 3, the high-quality reference face image (probe) of the search target is collected and down-sampled in steps S3 and S4 to construct the second learning data composed of <high-quality correct answer, low-quality input> And using this, the second learning of fine-tuning the identity restoration model based on the learning objective function described above in step S24 with reference to FIG. 5 is executed.

제 2 학습은, 탐색 대상에 특화되어 학습되며, 학습할 데이터셋 수가 상대적으로 작으므로 짧은 시간 안에 학습이 진행된다(예: NVIDIA RTX 2080Ti GPU에서 1시간 이내).The second training is specialized for the search target, and since the number of datasets to be trained is relatively small, training proceeds in a short time (eg, within 1 hour on NVIDIA RTX 2080 Ti GPU).

탐색 대상의 기준 이미지(probe)를 활용한 제 2 학습 기법을 통해 정답 탐지율(true positive)이 약 78% 향상되었다.Through the second learning technique using the reference image (probe) of the search target, the correct answer detection rate (true positive) was improved by about 78%.

제안된 신원 복원 모델은 얼굴 인식 정확도를 향상시키는 재구성을 위한 모델 구조 및 학습 목표 함수를 도입하였고, 탐색 대상의 기준 이미지(probe)를 활용한 미세-튜닝(fine-tuning) 기법에 의한 제 2 학습을 제안하였다. The proposed identity restoration model introduces a model structure and a learning objective function for reconstruction to improve facial recognition accuracy, and a second learning by a fine-tuning technique using a reference image (probe) of the search target. has been proposed.

이하에서는 비디오 신원 복원 모델(Video Identity Clarification Model; VICN)을 이용한 얼굴 이미지 재구성에 대하여 살펴본다.Hereinafter, face image reconstruction using a Video Identity Clarification Model (VICN) will be described.

실시예에 따른 비디오 신원 복원 모델(VICN)을 이용한 얼굴 이미지 재구성 방법은 도 3을 참조하여 전술한 신원 복원 모델(ICN)을 이용한 입력 이미지에 포함된 얼굴 이미지 재구성 방법을 입력 비디오에 포함된 여러 얼굴 이미지를 처리할 수 있도록 확장한 것이다.The face image reconstruction method using the video identity reconstruction model (VICN) according to the embodiment includes the face image reconstruction method included in the input image using the identity reconstruction model (ICN) described above with reference to FIG. It is extended to handle images.

즉, 실시예에 따른 비디오 신원 복원 모델(VICN)을 이용한 얼굴 이미지 재구성 방법은 일련의 이미지 프레임에 포함된 저화질 얼굴 이미지를 고화질로 재구성하기 위한 구성을 추가적으로 포함하며, 이하에서는 도 8 내지 도 11을 참조하여 이와 같은 추가 및 확장된 구성을 중심으로 설명한다.That is, the face image reconstruction method using a video identity reconstruction model (VICN) according to an embodiment additionally includes a configuration for reconstructing a low-quality face image included in a series of image frames in high quality, and below, FIGS. With reference to these additional and extended configurations will be mainly described.

실시예에 따른 비디오 신원 복원 모델(Video Identity Clarification Model; VICN)을 이용한 얼굴 이미지 재구성 방법은 도 2를 참조하여 전술한 프로세서(110)를 포함한 얼굴 이미지 재구성 장치(100)에 의해 실행될 수 있다.The facial image reconstruction method using the Video Identity Clarification Model (VICN) according to the embodiment may be executed by the facial image reconstruction apparatus 100 including the processor 110 described above with reference to FIG. 2 .

도 8은 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성 방법의 흐름도이다.8 is a flowchart of a face image reconstruction method using a video identity reconstruction model according to an embodiment.

실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성 방법은, 프로세서(110)에 의해, 입력 비디오의 일련의 프레임으로부터 트래킹(tracking)된 적어도 하나의 얼굴 이미지 및 이와 같은 적어도 하나의 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득하는 단계(SS1) 및 획득된 학습 데이터에 기반하여 비디오 신원 복원 모델(VICN)을 학습하는 단계(SS2)를 포함할 수 있다.The method for reconstructing a face image using a video identity reconstruction model according to an embodiment includes, by the processor 110, at least one face image tracked from a series of frames of an input video and the at least one face image It may include acquiring training data including a correct answer face image (SS1) and learning a video identity reconstruction model (VICN) based on the acquired training data (SS2).

단계(SS1)에서 프로세서(110)는 입력 비디오의 일련의 프레임으로부터 트래킹(tracking)된 적어도 하나의 얼굴 이미지 및 이와 같은 적어도 하나의 얼굴 이미지에 대한 정답 얼굴 이미지를 포함하는 학습 데이터를 획득한다. 이에 대하여는 도 9 및 도 10을 참조하여 후술한다.In step SS1, the processor 110 acquires training data including at least one face image tracked from a series of frames of the input video and a face image correcting the at least one face image. This will be described later with reference to FIGS. 9 and 10 .

단계(SS2)는 단계(SS1)에서 획득된 학습 데이터 셋을 기반으로 기본 VICN 모델을 학습한다. 이 과정을 거쳐 학습된 VICN은 임의의 저화질 얼굴 입력 시퀀스를 신원 정보를 보존하면서 고화질 얼굴로 재구성할 수 있는 능력을 가지게 된다.Step SS2 learns the basic VICN model based on the training data set obtained in step SS1. VICN learned through this process has the ability to reconstruct an arbitrary low-quality face input sequence into a high-quality face while preserving identity information.

단계(SS2)는, 비디오 신원 복원 모델(VICN)의 생성기(Generator; G)를 실행하여, 적어도 하나의 얼굴 이미지에 나타난 얼굴에 대한 신원을 복원한 재구성된 얼굴 이미지를 생성하는 생성 단계(도 5를 참조하여 단계 S21에 대응) 및 생성기(G)와 생성적 적대 신경망(Generative Adversarial Network; GAN)의 경쟁 관계에 있는 비디오 신원 복원 모델(VICN)의 판별기(Discriminator; D)를 실행하여, 정답 얼굴 이미지에 기반하여 재구성된 얼굴 이미지를 판별하는 판별 단계(도 5를 참조하여 단계 S22에 대응)를 포함한다.Step SS2 is a generating step (FIG. 5) of executing a generator (G) of a video identity reconstruction model (VICN) to generate a reconstructed face image by reconstructing an identity for a face appearing in at least one face image (FIG. 5). (corresponding to step S21 with reference) and a determination step of determining the reconstructed face image based on the face image (corresponding to step S22 with reference to FIG. 5 ).

실시예에 따른 얼굴 이미지 재구성 방법은, 탐색 대상의 적어도 하나의 얼굴 이미지 및 해당 탐색 대상의 적어도 하나의 얼굴 이미지에 대한 기준 얼굴 이미지를 포함하는 제 2 학습 데이터를 획득하는 단계(SS3) 및 제 2 학습 데이터에 기반하여 단계(SS2)에서 학습된 비디오 신원 복원 모델(VICN)을 미세튜닝(fine-tuning)하는 제 2 학습 단계(SS4)를 더 포함할 수 있다.The method for reconstructing a face image according to an embodiment includes the steps of: acquiring second learning data including at least one face image of a search target and a reference face image for at least one face image of the corresponding search target (SS3); It may further include a second learning step (SS4) of fine-tuning (fine-tuning) the video identity reconstruction model (VICN) learned in step (SS2) based on the training data.

단계(SS3) 및 단계(SS4)는 도 3을 참조하여 전술한 단계(S3) 및 단계(S4)를 탐색 대상의 적어도 하나의 이미지를 처리하도록 변형한 것에 대응한다.Steps SS3 and SS4 correspond to modifications of steps S3 and S4 described above with reference to FIG. 3 to process at least one image of a search target.

단계(SS3)은 탐색 대상이 찍힌 비디오로부터 탐색 대상의 기준 이미지들(probes)을 모으고, 단계(SS1)과 마찬가지 과정을 통해 <고화질 정답 얼굴 시퀀스, 저화질 입력 얼굴>로 구성된 학습 데이터 셋을 구성한다.Step SS3 collects reference images (probes) of the search target from the video in which the search target is captured, and constructs a training data set consisting of <high-quality correct answer face sequence, low-quality input face> through the same process as in step SS1. .

단계(SS4)는 단계(SS3)에서 획득한 학습 데이터 셋을 활용하여 비디오 신원 복원 모델(VICN)의 파인-튜닝(fine-tuning) 학습 과정을 실행한다. 이와 같이 학습된 비디오 신원 복원 모델(VICN)은 탐색 대상에 특화되어, 탐색 대상의 저화질 얼굴을 더욱 잘 재구성할 수 있는 능력을 가지게 된다.Step SS4 executes a fine-tuning learning process of the video identity reconstruction model VICN by using the training data set obtained in step SS3. The trained video identity reconstruction model (VICN) is specialized for the search target, and has the ability to better reconstruct the low-quality face of the search target.

실시예에 따른 얼굴 이미지 재구성 방법은 학습된 비디오 신원 복원 모델(VICN)을 이용하여 입력 비디오에서 탐색 대상을 인식하는 단계(SS5)를 더 포함할 수 있다. 단계(SS5)는 도 3을 참조하여 전술한 단계(S5)를 비디오 신원 복원 모델(VICN)을 이용하여 수행한다.The facial image reconstruction method according to the embodiment may further include recognizing a search target in the input video using the learned video identity reconstruction model (VICN) (SS5). Step SS5 performs the above-described step S5 with reference to FIG. 3 using a video identity reconstruction model (VICN).

단계(SS5)에 의해, 실시간 비디오 입력으로부터 얼굴 탐지 및 도 9를 참조하여 후술한 단계(SS1)의 얼굴 특징점 트래킹 기법을 활용하여 실시간 비디오 입력으로부터 탐지된 저화질 얼굴 시퀀스를 고화질 얼굴로 재구성한다.By step SS5, the low-quality face sequence detected from the real-time video input is reconstructed into a high-quality face by using face detection from the real-time video input and the facial feature point tracking technique of step SS1 described later with reference to FIG. 9 .

도 9는 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성의 얼굴 트래킹 과정을 설명하기 위한 도면이다.9 is a view for explaining a face tracking process of reconstructing a face image using a video identity reconstruction model according to an embodiment.

도 8을 참조하여 단계(SS1)에서 프로세서(110)는 입력 비디오에서 특징점 기반 얼굴 트래킹(landmark-based face tracking) 및 학습 데이터 셋을 획득한다.Referring to FIG. 8 , in step SS1, the processor 110 acquires a landmark-based face tracking and a training data set from an input video.

예를 들어, 단계(SS1)에서 프로세서(110)는 도심 공간을 캡쳐한 비디오 프레임들로부터 다양한 신원(identity)를 가진 사람들의 얼굴을 탐지한다. 여기서, 카메라 움직임, 장면 내 얼굴 움직임 등이 포함된 연속된 프레임들로부터 탐지된 얼굴 이미지들을 같은 신원끼리 매핑하기 위해, 프로세서(110)는 얼굴 특징점 기반 트래킹 기법을 사용할 수 있다.For example, in step SS1, the processor 110 detects faces of people having various identities from video frames captured in an urban space. Here, in order to map face images detected from successive frames including a camera motion, a face motion within a scene, etc. between the same identities, the processor 110 may use a facial feature point-based tracking technique.

프로세서(110)는 얼굴 특징점 기반 트래킹 기법을 통해 얻어진 얼굴 이미지를 다운샘플링하여 <고화질 정답 얼굴 시퀀스, 저화질 입력 얼굴>로 구성된 학습 데이터셋을 구성할 수 있다. 예를 들어, 학습 데이터 셋으로는 고화질 비디오 데이터셋 WILDTRACK dataset(Tatjana Chavdarova et al., “WILDTRACK: A Multi-Camera HD Dataset for Dense Unscripted Pedestrian Detection,” CVPR2018.) 등을 활용할 수 있다.The processor 110 may downsample the face image obtained through the facial feature point-based tracking technique to configure a training dataset including <high-quality correct face sequence, low-quality input face>. For example, as a training dataset, the high-definition video dataset WILDTRACK dataset (Tatjana Chavdarova et al., “WILDTRACK: A Multi-Camera HD Dataset for Dense Unscripted Pedestrian Detection,” CVPR2018.) can be used.

도 11을 참조하여 후술할 비디오 신원 복원 모델(VICN)은 동일 신원을 가진 사람의 얼굴 프레임 시퀀스를 입력으로 받는다. 이를 위해, 입력 비디오의 프레임별로 얼굴을 탐지하지만, 연속된 프레임 간에는 카메라 움직임, 장면 내 대상의 움직임 등으로 인해 동일 신원을 가진 사람의 얼굴이 시간에 따라 다른 위치에 나타날 수 있다.A video identity reconstruction model (VICN), which will be described later with reference to FIG. 11 , receives a face frame sequence of a person having the same identity as an input. To this end, a face is detected for each frame of the input video, but the face of a person with the same identity may appear at different positions over time due to camera movement or movement of an object in a scene between successive frames.

따라서, 실시예에 따른 얼굴 인식 방법은 단계(SS1)에서 입력 비디오의 연속된 프레임 간 얼굴의 움직임을 고려하여 프레임별로 탐지된 얼굴들을 같은 신원끼리 매핑하기 위해, 특징점 기반 얼굴 트래킹 기법을 사용한다. 도 9는 이와 같은 얼굴 트래킹 과정의 전체 동작 구조도를 나타낸다.Therefore, the face recognition method according to the embodiment uses a feature point-based face tracking technique to map faces detected for each frame with the same identity in consideration of the movement of the face between successive frames of the input video in step SS1. 9 shows the overall operation structure of the face tracking process as described above.

얼굴 트래킹 과정의 동작 순서는 다음과 같다.The operation sequence of the face tracking process is as follows.

(i) 단계(SS11) - 얼굴 탐지(Face Detection): 먼저, 연속된 두 입력 프레임(frame t, frame t+1)에서 각각 얼굴을 탐지한다.(i) Step (SS11) - Face Detection: First, a face is detected in each of two consecutive input frames (frame t, frame t+1).

(ii) 단계(SS12) - 랜드마크 추출(Landmark Estimation): 탐지된 개별 얼굴들로부터 랜드마크(예를 들어 눈, 코, 입의 위치)를 특징점으로 추출한다. 이를 위하여 예를 들어 얼굴 탐지 및 특징점 추출을 동시에 수행 가능한 RetinaFace detector(J. Deng et al., "RetinaFace: Single-stage Dense Face Localisation in the Wild," CVPR 2020.)를 사용할 수 있다.(ii) Step (SS12) - Landmark Estimation: A landmark (eg, positions of eyes, nose, and mouth) is extracted from detected individual faces as feature points. For this, for example, a RetinaFace detector capable of simultaneously performing face detection and feature point extraction (J. Deng et al., "RetinaFace: Single-stage Dense Face Localization in the Wild," CVPR 2020.) may be used.

(iii) 단계(SS13) - 광학 흐름 트래킹(Optical Flow Tracking): 이후, frame t와 frame t+1 사이에서 대응되는 랜드마크를 찾기 위해 광학 흐름(Optical Flow)을 계산한다. 예를 들어, Lukas-Kanade Optical Flow Tracker(B. D. Lucas, T. Kanade et al., "An iterative image registration technique with an application to stereo vision." Vancouver, British Columbia, 1981.)를 사용할 수 있다.(iii) Step (SS13) - Optical Flow Tracking: Then, between frame t and frame t+1, optical flow is calculated to find a corresponding landmark. For example, a Lukas-Kanade Optical Flow Tracker (B. D. Lucas, T. Kanade et al., "An iterative image registration technique with an application to stereo vision." Vancouver, British Columbia, 1981.) can be used.

(iv) 단계(SS14) - 움직임 보상(Motion Compensation): 단계(SS13)에서 계산된 랜드마크(예를 들어 눈, 코, 입)의 특징점들의 광학 흐름(Optical Flow)의 평균을 구한다. 해당 평균을 프레임 간 물체의 움직임으로 가정하고, frame t의 얼굴 바운딩 박스(bounding box) 좌표들을 변환한다.(iv) Step SS14 - Motion Compensation: An average of optical flows of feature points of landmarks (eg, eyes, nose, and mouth) calculated in step SS13 is obtained. Assume that the average is the movement of the object between frames, and transform the face bounding box coordinates of frame t.

(v) 단계(SS15) - IoU-based Bounding Box Matching: (iv) 단계(SS14)의 과정을 거친 frame t의 바운딩 박스들과 frame t+1의 바운딩 박스들 간 IoU (Intersection over Union)를 계산하여 겹치는 영역의 면적을 계산한다. IoU가 일정 값 이상인 두 바운딩 박스가 탐지되는 경우, 해당 두 얼굴을 동일 신원으로 판단한다.(v) Step (SS15) - IoU-based Bounding Box Matching: (iv) Calculate Intersection over Union (IoU) between the bounding boxes of frame t and the bounding boxes of frame t+1 that have undergone the process of step (SS14) to calculate the area of the overlapping area. When two bounding boxes with an IoU equal to or greater than a certain value are detected, the two faces are determined as the same identity.

도 10은 실시예에 따른 비디오 신원 복원 모델을 이용한 얼굴 이미지 재구성의 얼굴 트래킹 과정을 예시적으로 설명하기 위한 도면이다.FIG. 10 is a diagram for exemplarily explaining a face tracking process of face image reconstruction using a video identity reconstruction model according to an embodiment.

도 9를 참조하여 전술한 단계(SS11)에서 연속된 두 개의 프레임(frame t, frame t+1)에서 얼굴을 탐지한다(바운딩 박스 bbox_t,i, bbox_t+1,j 등). 단계(SS12)에서 각 바운딩 박스의 랜드마크를 추출한다.A face is detected in two consecutive frames (frame t, frame t+1) in the above-described step SS11 with reference to FIG. 9 (bounding boxes bbox _t,i , bbox _t+1,j , etc.). In step SS12, the landmark of each bounding box is extracted.

단계(SS13)에서 frame t와 frame t+1 사이에서 대응되는 랜드마크를 찾기 위해 광학 흐름(Optical Flow)을 계산한다. 단계(SS14)에서 랜드마크의 특징점의 광학 흐름의 평균을 구하고, 해당 평균을 프레임 간 물체의 움직임으로 가정하고, frame t의 바운딩 박스(예를 들어 bbox_t,i 등)의 좌표들을 변환한다.In step SS13, an optical flow is calculated to find a corresponding landmark between frame t and frame t+1. In step SS14, the average of the optical flow of the feature points of the landmark is obtained, the average is assumed to be the movement of the object between frames, and the coordinates of the bounding box of frame t (eg, bbox _t,i , etc.) are transformed.

단계(SS15)에서 frame t의 바운딩 박스들과 frame t+1의 바운딩 박스들 간 IoU를 계산하여 겹치는 영역의 면적을 계산한다. 예를 들어 bbox_t,i 와 bbox_t+1,j의 IoU가 일정 값 이상인 경우 두 바운딩 방스(bbox_t,i 와 bbox_t+1,j)가 나타내는 얼굴 이미지를 동일 신원으로 판단한다.In step SS15, the area of the overlapping region is calculated by calculating IoU between the bounding boxes of frame t and the bounding boxes of frame t+1. For example, if the IoU of bbox _t,i and bbox _t+1,j is greater than a certain value, the face image represented by the two bounding measures (bbox _t,i and bbox _t+1,j ) is determined as the same identity.

도 11은 실시예에 따른 비디오 신원 복원 모델 및 학습 구조를 설명하기 위한 도면이다.11 is a diagram for explaining a video identity reconstruction model and a learning structure according to an embodiment.

도 11은 비디오 신원 복원 모델(VICN)의 예시적인 학습 구조를 보여준다.11 shows an exemplary learning structure of a video identity reconstruction model (VICN).

생성기(Generator; G)는 저화질 얼굴 시퀀스(FRM_SE Q)를 입력으로 받아 고화질 얼굴(F_R)로 재구성한다. 구체적으로, 생성기(G)를 통한 얼굴 재구성은 다중 프레임 얼굴 화질 개선기에 의한 제 1 단계 및 랜드마크 기반 얼굴 업샘플러에 의한 제 2 단계를 포함한다.Generator (G) is a low-quality face sequence (FRM_SE) Q) is received as an input and reconstructed into a high-definition face (F_R). Specifically, the face reconstruction via the generator G includes a first step by a multi-frame face image quality improver and a second step by a landmark-based face upsampler.

(i) 제 1 단계 - 다중 프레임 얼굴 화질 개선(Multi-Frame Face Resolution Enhancement): 다중 프레임 얼굴 화질 개선기(Multi-Frame Face Resolution Enhancer)(G_MFRE)는 입력 비디오로부터 얻은 저화질 얼굴 이미지 시퀀스(예를 들어, frame 1, frame 2, frame 3 등)(FRM_SEQ)를 기준 프레임(Reference frame)(FRM_REF)에 기반하여 서로 융합하여 일차적으로 화질이 개선된 중간-재구성된 얼굴 이미지(y_int)(F_IR)를 얻는다.(i) Step 1 - Multi-Frame Face Resolution Enhancement: Multi-Frame Face Resolution Enhancer (G_MFRE) is a low-resolution face image sequence obtained from the input video (e.g. For example, frame 1, frame 2, frame 3, etc.) (FRM_SEQ) is fused with each other based on the reference frame (FRM_REF) to obtain an intermediate-reconstructed face image (y_int) (F_IR) with improved image quality. get

다중 프레임 얼굴 화질 개선기(G_MFRE)는 모션 추정(Motion Estimation)(G_ME) 및 워핑(Warping)(G_W), 및 다중 프레임 융합기(Multi-Frame Fuser)(G_MFF)를 포함한다. 구체적인 구조는 도 12를 참조하여 후술한다.The multi-frame face quality enhancer (G_MFRE) includes motion estimation (G_ME) and warping (G_W), and a multi-frame fuser (G_MFF). A specific structure will be described later with reference to FIG. 12 .

(ii) 제 2 단계 - 랜드마크-기반 얼굴 업샘플링(Landmark-guided Face Upsampling): 이는 얼굴 랜드마크 예측기(Face Landmark Estimator)(G_FLE) 및 얼굴 업샘플러(Face Upsampler)(G_FUP)에 의해 수행되며 이에 대하여는 도 4를 참조하여 전술한 내용에서 중간-재구성된 얼굴 이미지(F_IR)을 도 4의 저화질 이미지(LR)로 하여 실행된다.(ii) Step 2 - Landmark-guided Face Upsampling: This is performed by a Face Landmark Estimator (G_FLE) and a Face Upsampler (G_FUP) and This is performed by using the mid-reconstructed face image F_IR as the low-quality image LR of FIG. 4 in the above description with reference to FIG. 4 .

이로써 재구성된 얼굴 이미지(F_R)이 출력되고, 정답 이미지(F_GT)를 이용하여 학습이 이루어진다.Accordingly, the reconstructed face image F_R is output, and learning is performed using the correct answer image F_GT.

도 12는 실시예에 따른 신원 복원 모델의 생성기의 다중 프레임 얼굴 화질 개선기의 네트워크 구조를 설명하기 위한 도면이다.12 is a diagram for explaining a network structure of a multi-frame face quality improver of a generator of an identity reconstruction model according to an embodiment.

먼저, 일련의 프레임(FRM_SEQ)의 비디오 프레임별 얼굴 이미지의 자세 및 각도 차이를 보정하기 위해, 기준 프레임(reference frame)(FRM_REF)을 중심으로 프레임 간 움직임 예측(G_ME) 및 워핑(G_W) 과정을 거친다.First, in order to correct the difference in posture and angle of the face image for each video frame of a series of frames (FRM_SEQ), inter-frame motion prediction (G_ME) and warping (G_W) processes are performed centering on the reference frame (FRM_REF). rough

예를 들어, 프레임 간 움직임이 너무 크지 않도록 프레임 시퀀스(FRM_SEQ)의 중앙 프레임을 기준 프레임(FRM_REF)으로 정할 수 있으나, 이에 제한되는 것은 아니며, 다양한 방식으로 기준 프레임(FRM_REF)을 정할 수 있다.For example, the central frame of the frame sequence (FRM_SEQ) may be determined as the reference frame (FRM_REF) so that inter-frame motion is not too large, but the present invention is not limited thereto, and the reference frame (FRM_REF) may be determined in various ways.

예를 들어, 세 개의 프레임(frame 1, frame 2, frame 3)가 있다고 가정하면, 중앙 프레임인 frame 2를 기준 프레임(FRM_REF)로 정하고, frame 1과 frame 2에 대하여 프레임 간 움직임 예측(G_ME) 및 워핑(G_W) 과정을 수행하고, frame 2와 frame 3에 대하여 프레임 간 움직임 예측(G_ME) 및 워핑(G_W) 과정을 수행할 수 있다.For example, assuming there are three frames (frame 1, frame 2, frame 3), frame 2, which is the central frame, is set as the reference frame (FRM_REF), and inter-frame motion prediction (G_ME) for frame 1 and frame 2 and a warping (G_W) process, and inter-frame motion prediction (G_ME) and warping (G_W) processes for frame 2 and frame 3 may be performed.

예를 들어, 다섯 개의 프레임(frame 1, frame 2, frame 3, frame 4, frame 5)가 있다고 가정하면, 중앙 프레임인 frame 3를 기준 프레임(FRM_REF)로 정하고, frame 1과 frame 3, frame 2와 frame 3, frame 4와 frame 3, 그리고 frame 5와 frame 3에 대하여 프레임 간 움직임 예측(G_ME) 및 워핑(G_W) 과정을 수행할 수 있다.For example, assuming there are five frames (frame 1, frame 2, frame 3, frame 4, and frame 5), frame 3, which is the central frame, is set as the reference frame (FRM_REF), and frame 1, frame 3, and frame 2 Inter-frame motion prediction (G_ME) and warping (G_W) processes can be performed with respect to and frame 3, frame 4 and frame 3, and frame 5 and frame 3.

예를 들어, 네 개의 프레임((frame 1, frame 2, frame 3, frame 4)이 있다면, frame 2와 frame 3 중 하나를 임의로 기준 프레임(FRM_REF)로 결정할 수 있다.For example, if there are four frames (frame 1, frame 2, frame 3, frame 4), one of frame 2 and frame 3 may be arbitrarily determined as the reference frame (FRM_REF).

예를 들어, 네 개의 프레임((frame 1, frame 2, frame 3, frame 4)이 있다면, frame 1과 frame2, frame 3와 frame 4 에 대하여 프레임 간 움직임 예측(G_ME) 및 워핑(G_W) 과정을 수행할 수 있다.For example, if there are four frames ((frame 1, frame 2, frame 3, frame 4), interframe motion prediction (G_ME) and warping (G_W) processes are performed for frame 1 and frame 2, and frame 3 and frame 4 can be done

여기서 움직임 예측을 위한 네트워크(G_ME)의 구조는 관련 태스크에서 활발히 사용되는 VESPCN 구조(J. Caballero et al., "Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation," CVPR 2017.)를 활용할 수 있다.Here, the structure of the network (G_ME) for motion prediction is the VESPCN structure actively used in related tasks (J. Caballero et al., "Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation," CVPR 2017.) can utilize

이 때, Residual block (K. He et al., "Deep residual learning for image recognition," CVPR 2016.)을 기본 블록 구조로 활용하여, 얼굴 프레임 간 움직임 예측을 위한 효과적인 기준점(feature point)를 추출할 수 있도록 학습될 수 있다.At this time, by using the residual block (K. He et al., "Deep residual learning for image recognition," CVPR 2016.) as the basic block structure, an effective feature point for predicting motion between face frames can be extracted. can be learned to

기준 프레임(FRM_REF)을 중심으로 워핑된 입력 프레임들은 이미지의 채널 축으로 연결되어(concat 연산), 다중 프레임 융합기(G_MFF)의 입력으로 들어간다. 예를 들어, 워핑된 입력 프레임들이 concat 연산될 수 있다. 예를 들어, 워핑된 입력 프레임들과 기준 프레임(FRM_REF)이 concat 연산될 수 있다.The input frames warped around the reference frame (FRM_REF) are connected to the channel axis of the image (concat operation), and enter the input of the multi-frame fusion unit (G_MFF). For example, warped input frames may be concat computed. For example, a concat operation may be performed on the warped input frames and the reference frame FRM_REF.

일 예에서 프로세서(110)는, N 개의 프레임 단위로 입력 프레임 시퀀스(FRM_SEQ) 상에 슬라이딩 윈도우를 이동하면서 전술한 움직임 예측(G_ME)-워핑(G_W)-다중 프레임 융합(G_MFF)을 수행하여, N 개의 프레임을 한번에 합쳐서 중간-재구성된 얼굴 이미지(F_IR)을 출력할 수 있다.In one example, the processor 110 performs the aforementioned motion prediction (G_ME)-warping (G_W)-multi-frame fusion (G_MFF) while moving the sliding window on the input frame sequence (FRM_SEQ) in units of N frames, By merging N frames at once, an intermediate-reconstructed face image (F_IR) can be output.

일 예에서 프로세서(110)는 입력 프레임 시퀀스(FRM_SEQ)의 연속된 두 프레임씩 전술한 움직임 예측(G_ME)-워핑(G_W)-다중 프레임 융합(G_MFF)을 수행하여 점진적으로 화질을 개선하는 방식도 가능하다.In one example, the processor 110 performs the aforementioned motion prediction (G_ME)-warping (G_W)-multi-frame fusion (G_MFF) by two consecutive frames of the input frame sequence (FRM_SEQ) to gradually improve the picture quality. It is possible.

예시적인 다중 프레임 융합 네트워크(G_MFF)는 다양한 이미지 처리에서 높은 정확도를 달성하는 Residual block을 기본 블록 구조로 활용하여, 화질 개선을 위해 중요한 특징(feature)을 뽑도록 학습될 수 있다.An exemplary multi-frame fusion network (G_MFF) can be learned to extract important features for image quality improvement by utilizing residual blocks that achieve high accuracy in various image processing as a basic block structure.

실시예에 의한 얼굴 이미지 재구성 방법 및 장치는 장거리 저화질 얼굴 인식 기술을 제공하며, 대화 상황에서 근거리의 1-2명의 얼굴을 인식하는데 한정된 기존의 모바일 얼굴 인식 응용의 정확도를 혁신적으로 개선할 수 있다.The face image reconstruction method and apparatus according to the embodiment provides a long-range low-quality face recognition technology, and can innovatively improve the accuracy of the existing mobile face recognition application, which is limited to recognizing faces of 1-2 people in a short distance in a conversation situation.

실시예에 의한 얼굴 이미지 재구성 방법 및 장치는 비디오 속 연속적인 프레임에 걸쳐 캡쳐된 일련의 저화질 얼굴 이미지로부터 고화질의 얼굴 이미지를 재구성할 수 있다.실시예에 의한 얼굴 이미지 재구성 방법은 안드로이드 기반 스마트폰에서 동작할 수 있는 소프트웨어로 개발되어, 상용 스마트폰에 탑재 후 실행시킬 수 있다. 신원 복원 모델을 포함한 DNN 기반 얼굴 인식 기술은 Google TensorFlow로 구현되어, 안드로이드용 Google TensorFlow-Lite로 변환하여 실행할 수 있다.The face image reconstruction method and apparatus according to the embodiment may reconstruct a high-quality face image from a series of low-quality face images captured over successive frames in a video. The face image reconstruction method according to the embodiment is performed in an Android-based smartphone It is developed as software that can be operated, and it can be installed on a commercial smartphone and then executed. DNN-based face recognition technology including identity restoration model is implemented with Google TensorFlow, which can be converted and executed with Google TensorFlow-Lite for Android.

실시예에 의한 얼굴 이미지 재구성 기술은 얼굴 인식 기반의 여러 유용한 모바일 AR 응용(예: 실종 아동 찾기, 범인 추적)에 활용가능하다.The facial image reconstruction technology according to the embodiment can be utilized for many useful mobile AR applications (eg, finding missing children, tracking criminals) based on facial recognition.

전술한 본 발명의 일 실시예에 따른 방법은 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다.The method according to an embodiment of the present invention described above can be implemented as computer-readable code in a medium in which a program is recorded. The computer-readable medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is this.

실시예에 따른 얼굴 이미지 재구성 방법은 이를 실행하기 위한 하나 이상의 명령어를 포함하는 컴퓨터 프로그램을 기록한 컴퓨터 판독가능한 비-일시적인 기록매체에 저장될 수 있다.The facial image reconstruction method according to the embodiment may be stored in a computer-readable non-transitory recording medium in which a computer program including one or more instructions for executing the same is recorded.

이상 설명된 본 발명의 실시 예에 대한 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the embodiment of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can easily transform into other specific forms without changing the technical spirit or essential features of the present invention. you will be able to understand that Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 청구범위에 의하여 나타내어지며, 청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

100: 얼굴 이미지 재구성 장치
200: 서버
300: 네트워크100: face image reconstruction device
200: server
300: network

Claims

A facial image reconstruction method executed by a facial image reconstruction apparatus including a processor, the method comprising:
acquiring training data including at least one face image tracked from a series of frames of an input video and a correct answer face image for the at least one face image; and
Comprising the step of learning a video identity reconstruction model (Video Identity Clarification Network; VICN) based on the training data,
The learning step is
a generating step of executing a generator of the video identity reconstruction model to generate a reconstructed face image obtained by reconstructing an identity for a face appearing in the at least one face image; and
A discriminator of the video identity reconstruction model in competition between the generator and a generative adversarial network (GAN) is executed, and the reconstructed face image is determined based on the correct answer face image. step
containing,
How to reconstruct a face image.

The method of claim 1,
The step of obtaining the learning data is,
and extracting the at least one face image from the series of frames based on tracking information on facial feature points between consecutive frames of the series of frames.
How to reconstruct a face image.

The method of claim 1,
The generating step is
executing a Multi-Frame Face Resolution Enhancer of the generator to generate an intermediate-reconstructed face image from the at least one face image;
How to reconstruct a face image.

4. The method of claim 3,
The generating step is
executing a face landmark estimator of the generator to predict a plurality of face landmarks based on the intermediate-reconstructed face image; and
upsampling the intermediate-reconstructed face image using a plurality of face landmarks by executing a face upsampler of the generator;
containing,
How to reconstruct a face image.

5. The method of claim 4,
The generating step is
generating an intermediate image in which the image quality of the intermediate-reconstructed face image is improved by using an intermediate image generator including a plurality of residual blocks;
further comprising,
The predicting step is
Predict the plurality of facial landmarks based on the intermediate image,
The up-sampling step includes:
Upsampling the intermediate image using the predicted plurality of facial landmarks,
How to reconstruct a face image.

The method of claim 1,
The learning step is
An extraction step of extracting a feature map of the reconstructed face image and a feature map of the correct answer face image by executing a face feature extractor of the video identity reconstruction model
further comprising,
How to reconstruct a face image.

The method of claim 1,
The learning step is
calculating a training loss function; and
Alternatingly learning the generator and the discriminator to minimize a function value of the learning target function
further comprising,
How to reconstruct a face image.

8. The method of claim 7,
The learning objective function is
a first objective function comprising a GAN loss function for the generator; and
A second target function based on the GAN loss function for the discriminator
containing,
How to reconstruct a face image.

9. The method of claim 8,
The first objective function is
A pixel reconstruction accuracy function between the reconstructed face image and the correct answer face image, a prediction accuracy function of a facial landmark predicted in the generating step of the reconstructed face image, and a facial feature similarity function between the reconstructed face image and the correct answer face image containing,
How to reconstruct a face image.

The method of claim 1,
Second learning for fine-tuning the video identity reconstruction model based on second training data including at least one face image of a search target and a reference face image for at least one face image of the search target step
further comprising,
How to reconstruct a face image.

11. The method of claim 10,
The second learning step is,
executing the generating step and the determining step based on the second learning data
containing,
How to reconstruct a face image.

A facial image reconstruction device comprising:
a memory for storing a video identity reconstruction model comprising a generator and a discriminator in competition with the generator and a generative adversarial neural network; and
A processor configured to execute training of the video identity reconstruction model based on training data comprising at least one face image tracked from a series of frames of an input video and a correct face image for the at least one face image.
including,
The processor is
In order to carry out the above learning,
a generating operation of executing the generator to generate a reconstructed face image obtained by reconstructing an identity for a face appearing in the at least one face image; and
A determination task of executing the discriminator to determine the reconstructed face image based on the correct answer face image
configured to do
Face image reconstruction device.

13. The method of claim 12,
the processor is configured to obtain the training data,
the processor is configured to extract the at least one facial image from the series of frames based on inter-frame facial feature tracking information of the series of frames, to obtain the learning data,
Face image reconstruction device.

13. The method of claim 12,
the generator comprises a multi-frame face quality improver,
the processor is configured to execute the multi-frame facial image quality enhancer to generate an intermediate-reconstructed facial image from the at least one facial image, to perform the generating operation;
Face image reconstruction device.

15. The method of claim 14,
The generator further comprises a face landmark predictor and a face upsampler,
The processor is
In order to perform the above creation operation,
executing the facial landmark predictor to predict a plurality of facial landmarks based on the intermediate-reconstructed facial image;
run the face upsampler to upsample the intermediate-reconstructed face image using a plurality of face landmarks;
Face image reconstruction device.

16. The method of claim 15,
The generator further comprises an intermediate image generator including a plurality of residual blocks,
The processor is
In order to perform the above creation operation,
generating an intermediate image in which the image quality of the intermediate-reconstructed face image is improved using the intermediate image generator;
Execute the facial landmark predictor to predict the plurality of facial landmarks based on the intermediate image,
and executing the face upsampler to upsample the intermediate image using the plurality of facial landmarks predicted based on the intermediate image.
Face image reconstruction device.

13. The method of claim 12,
The video identity reconstruction model further comprises a facial feature extractor,
The processor is
In order to carry out the above learning,
configured to execute the facial feature extractor to extract a feature map of the reconstructed face image and a feature map of the correct answer face image,
Face image reconstruction device.

13. The method of claim 12,
The processor is
In order to carry out the above learning,
Compute the learning objective function,
configured to alternately learn the generator and the discriminator to minimize a function value of the learning target function,
The learning objective function is
a first objective function comprising a GAN loss function for the generator; and
A second target function based on the GAN loss function for the discriminator
including,
The first objective function is
A pixel reconstruction accuracy function between the reconstructed face image and the correct answer face image, a prediction accuracy function of a facial landmark predicted in the work of generating the reconstructed face image, and a facial feature similarity function between the reconstructed face image and the correct answer face image containing,
Face image reconstruction device.

13. The method of claim 12,
The processor is
configured to execute a second learning of fine-tuning the video identity reconstruction model based on second training data including at least one face image of the search target and a reference face image for the at least one face image of the search target ,
Face image reconstruction device.

20. The method of claim 19,
The processor is
To perform the second learning,
configured to perform the generating operation and the determining operation based on the second learning data,
Face image reconstruction device.