KR102380333B1

KR102380333B1 - Image Reenactment Apparatus, Method and Computer Readable Recording Medium Thereof

Info

Publication number: KR102380333B1
Application number: KR1020200022795A
Authority: KR
Inventors: 안상일; 하성주; 김동영; 마틴 커스너; 김범수; 서석준
Original assignee: 주식회사 하이퍼커넥트
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2022-04-01
Also published as: KR20210108529A

Abstract

이미지 변형 장치, 방법 및 컴퓨터 판독 가능한 기록매체가 개시된다. 본 발명의 일 실시예에 따른 이미지 변형 방법은, 사용자의 얼굴 이미지로부터 특징점(landmark) 정보를 획득하는 단계, 상기 얼굴 이미지의 포즈(pose) 정보로부터 사용자 피처 맵(user feature map)을 생성하는 단계, 타겟(target)의 얼굴 이미지를 수신하고, 상기 얼굴 이미지에 대응하는 스타일(style) 정보 및 포즈 정보로부터 타겟 피처 맵(target feature map) 및 포즈-정규화 타겟 피처 맵(pose-normalized target feature map)을 생성하는 단계, 상기 사용자 피처 맵과 상기 타겟 피처 맵을 이용하여 믹스드 피처 맵(mixed feature map)을 생성하는 단계 및 상기 믹스드 피처 맵과 상기 포즈-정규화 타겟 피처 맵을 이용하여 상기 타겟의 얼굴 이미지에 대한 변형된 이미지를 생성하는 단계를 포함한다.An image modifying apparatus, method, and computer-readable recording medium are disclosed. An image transformation method according to an embodiment of the present invention includes: obtaining landmark information from a face image of a user; generating a user feature map from pose information of the face image; , receive a face image of a target, and a target feature map and a pose-normalized target feature map from style information and pose information corresponding to the face image generating a mixed feature map using the user feature map and the target feature map; and using the mixed feature map and the pose-normalized target feature map to determine the target and generating a deformed image for the face image.

Description

Image Reenactment Apparatus, Method and Computer Readable Recording Medium Thereof

본 발명은 이미지 변형 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것으로, 보다 구체적으로는 다른 이미지의 특징에 따라 자연스럽게 변형되는 이미지를 생성할 수 있는 이미지 변형 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것이다.The present invention relates to an image modifying apparatus, method, and computer-readable recording medium, and more particularly, to an image modifying apparatus, method, and computer-readable recording medium capable of generating an image that is naturally deformed according to the characteristics of another image. will be.

얼굴 랜드마크(facial landmark)는 얼굴의 주요 요소의 기점(key point)을 추출하거나 기점을 연결하여 그린 윤곽선을 추출하는 분석 방법이다. Facial landmark는 얼굴 표정 분류, 포즈 분석, 합성 및 변형 등 얼굴 영상의 analysis, synthesis, morphing, reenactment, classification 등의 기술의 가장 밑단에서 활용되고 있다.A facial landmark is an analysis method that extracts the key points of the main elements of the face or extracts the outline drawn by connecting the key points. Facial landmarks are used at the very bottom of techniques such as analysis, synthesis, morphing, reenactment, and classification of facial images such as facial expression classification, pose analysis, synthesis and transformation.

Facial landmark를 기반으로 하는 기존의 얼굴 영상 분석 및 활용 기술은 facial landmark를 처리할 때 대상의 외모적 특성과 표정 등의 감정에 의한 특성을 구분하지 않아 이로 인한 성능 하락을 동반한다. 예를 들어, 눈썹의 위치가 남들보다 높이 있는 외모적 특성을 가지고 있는 사람의 감정을 분류하는 경우 실제로 무표정 하더라도 놀란 표정을 짓고 있는 것으로 잘못 분류될 수 있다.Existing facial image analysis and utilization technologies based on facial landmarks do not distinguish between the physical characteristics of the target and the characteristics of emotions such as facial expressions when processing facial landmarks, which is accompanied by a decrease in performance. For example, when classifying the emotion of a person who has an appearance characteristic in which the position of the eyebrows is higher than that of others, it may be mistakenly classified as making a surprised expression even if it is actually expressionless.

본 발명은 이미지 변형 대상이 되는 타겟(target) 이미지가 주어졌을 때, 상기 타겟 이미지와 다른 사용자 이미지를 이용하여 상기 사용자 이미지를 따르지만 상기 타겟 이미지의 특성을 지닌 이미지를 생성할 수 있는 이미지 변형 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공하는 것을 목적으로 한다.The present invention provides an image transforming device capable of generating an image having characteristics of the target image while following the user image by using a user image different from the target image when a target image to be an image modification target is given; An object of the present invention is to provide a method and a computer-readable recording medium.

본 발명의 일 실시예에 따른 이미지 변형 방법은, 사용자의 얼굴 이미지로부터 특징점(landmark) 정보를 획득하는 단계, 상기 얼굴 이미지의 포즈(pose) 정보로부터 사용자 피처 맵(user feature map)을 생성하는 단계, 타겟(target)의 얼굴 이미지를 수신하고, 상기 얼굴 이미지에 대응하는 스타일(style) 정보 및 포즈 정보로부터 타겟 피처 맵(target feature map) 및 포즈-정규화 타겟 피처 맵(pose-normalized target feature map)을 생성하는 단계, 상기 사용자 피처 맵과 상기 타겟 피처 맵을 이용하여 믹스드 피처 맵(mixed feature map)을 생성하는 단계 및 상기 믹스드 피처 맵과 상기 포즈-정규화 타겟 피처 맵을 이용하여 상기 타겟의 얼굴 이미지에 대한 변형된 이미지를 생성하는 단계를 포함한다.An image transformation method according to an embodiment of the present invention includes: obtaining landmark information from a face image of a user; generating a user feature map from pose information of the face image; , receive a face image of a target, and a target feature map and a pose-normalized target feature map from style information and pose information corresponding to the face image generating a mixed feature map using the user feature map and the target feature map; and using the mixed feature map and the pose-normalized target feature map to determine the target and generating a deformed image for the face image.

또한, 상기 포즈 정보는 상기 얼굴 이미지의 움직임 정보와 표정 정보를 포함하고, 상기 사용자 피처 맵을 생성하는 단계에서는, 상기 사용자의 얼굴 이미지에 대응하는 포즈 정보를 인공 신경망(Artificial Neural Network)에 입력하여 상기 사용자 피처 맵을 생성할 수 있다.In addition, the pose information includes movement information and expression information of the face image, and in generating the user feature map, pose information corresponding to the user's face image is input to an artificial neural network. The user feature map may be generated.

또한, 상기 특징점 정보는 눈, 코, 입, 눈썹, 또는 귀 중 적어도 어느 하나에 대응하는 위치 정보를 포함하고, 상기 타겟 피처 맵은 상기 타겟의 얼굴 이미지의 스타일 정보와 포즈 정보를 포함할 수 있다.In addition, the feature point information may include location information corresponding to at least one of eyes, nose, mouth, eyebrows, and ears, and the target feature map may include style information and pose information of a face image of the target. .

또한, 상기 포즈-정규화 타겟 피처 맵은 인공 신경망에 입력된 상기 스타일 정보에 대한 출력에 대응할 수 있다.Also, the pose-normalized target feature map may correspond to an output of the style information input to the artificial neural network.

또한, 상기 믹스드 피처 맵을 생성하는 단계에서는, 상기 사용자의 얼굴 이미지의 포즈 정보와 상기 타겟의 얼굴 이미지의 스타일 정보를 인공 신경망에 입력하여 상기 믹스드 피처 맵을 생성할 수 있다.Also, in the generating of the mixed feature map, the mixed feature map may be generated by inputting pose information of the user's face image and style information of the target's face image into an artificial neural network.

또한, 상기 스타일 정보는 상기 타겟의 얼굴 이미지에 대응하는 질감(texture) 정보, 색상(color) 정보, 및 모양(shape) 정보 중 적어도 어느 하나를 포함할 수 있다.Also, the style information may include at least one of texture information, color information, and shape information corresponding to the face image of the target.

또한, 상기 믹스드 피처 맵은 상기 타겟의 특징점이 상기 사용자의 특징점에 대응하는 포즈 정보를 갖도록 생성될 수 있다.In addition, the mixed feature map may be generated so that the feature point of the target has pose information corresponding to the feature point of the user.

또한, 상기 변형된 이미지는 상기 타겟 얼굴의 특성(identity)과 상기 사용자 얼굴의 포즈(pose)를 가질 수 있다.In addition, the transformed image may have the identity of the target face and the pose of the user's face.

한편, 본 발명에 따른 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체가 제공된다.On the other hand, there is provided a computer-readable recording medium in which a program for performing the method according to the present invention is recorded.

한편, 본 발명의 일 실시예에 따른 이미지 변형 장치는, 사용자 및 타겟(target)의 얼굴 이미지를 수신하고, 각각의 얼굴 이미지로부터 특징점 정보를 획득하는 특징점 획득부, 상기 사용자의 얼굴 이미지의 포즈(pose) 정보로부터 사용자 피처 맵(user feature map)을 생성하는 제1 인코더(first encoder), 상기 타겟의 얼굴 이미지의 스타일(style) 정보 및 포즈 정보로부터 타겟 피처 맵(target feature map) 및 포즈-정규화 타겟 피처 맵(pose-normalized target feature map)을 생성하는 제2 인코더(second encoder), 상기 사용자 피처 맵과 상기 타겟 피처 맵을 이용하여 믹스드 피처 맵(mixed feature map)을 생성하는 블렌더(blender) 및 상기 믹스드 피처 맵과 상기 포즈-정규화 타겟 피처 맵을 이용하여 상기 타겟의 얼굴 이미지에 대한 변형된 이미지를 생성하는 디코더(decoder)를 포함한다.On the other hand, the image transforming apparatus according to an embodiment of the present invention includes a feature point acquisition unit that receives face images of a user and a target, and obtains feature point information from each face image, a pose of the user's face image ( A first encoder that generates a user feature map from pose) information, a target feature map and pose-normalization from style information and pose information of the target's face image A second encoder that generates a pose-normalized target feature map, and a blender that generates a mixed feature map using the user feature map and the target feature map and a decoder that generates a transformed image of the face image of the target by using the mixed feature map and the pose-normalized target feature map.

또한, 상기 포즈 정보는 상기 얼굴 이미지의 움직임 정보와 표정 정보를 포함하고, 상기 제1 인코더는 상기 사용자의 얼굴 이미지에 대응하는 포즈 정보를 인공 신경망(Artificial Neural Network)에 입력하여 상기 사용자 피처 맵을 생성할 수 있다.In addition, the pose information includes movement information and expression information of the face image, and the first encoder inputs pose information corresponding to the user's face image into an artificial neural network to generate the user feature map. can create

또한, 상기 포즈-정규화 타겟 피처 맵은 인공 신경망에 입력된 상기 스타일 정보에 대한 출력에 대응하고, 상기 블렌더는 상기 사용자의 얼굴 이미지의 포즈 정보와 상기 타겟의 얼굴 이미지의 스타일 정보를 인공 신경망에 입력하여 상기 믹스드 피처 맵을 생성할 수 있다.In addition, the pose-normalized target feature map corresponds to the output of the style information input to the artificial neural network, and the blender inputs the pose information of the user's face image and the style information of the target's face image to the artificial neural network. to generate the mixed feature map.

또한, 상기 믹스드 피처 맵은 상기 타겟의 특징점을 상기 사용자의 특징점에 대응하는 포즈 정보를 갖도록 생성될 수 있다.In addition, the mixed feature map may be generated to have the feature point of the target and pose information corresponding to the feature point of the user.

본 발명은 이미지 변형 대상이 되는 타겟(target) 이미지가 주어졌을 때, 상기 타겟 이미지와 다른 사용자 이미지를 이용하여 상기 사용자 이미지를 따르지만 상기 타겟 이미지의 특성을 지닌 이미지를 생성할 수 있는 이미지 변형 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.The present invention provides an image transforming device capable of generating an image having characteristics of the target image while following the user image by using a user image different from the target image when a target image to be an image modification target is given; A method and a computer-readable recording medium may be provided.

도 1은 본 발명에 따른 이미지 변형 장치 및 이미지 변형 방법이 동작하는 환경을 개략적으로 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 이미지 변형 방법을 개략적으로 나타내는 순서도이다.
도 3은 본 발명의 실시예에 따른 이미지 변형 방법을 수행한 결과를 예시적으로 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 이미지 변형 장치의 구성을 개략적으로 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 특징점 획득부의 구성을 개략적으로 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 제2 인코더의 구성을 개략적으로 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 블렌더의 구조를 개략적으로 나타내는 도면이다.
도 8은 본 발명의 일 실시예에 따른 디코더의 구조를 개략적으로 나타내는 도면이다.1 is a diagram schematically illustrating an environment in which an image modifying apparatus and an image modifying method according to the present invention operate.
2 is a flowchart schematically illustrating an image transformation method according to an embodiment of the present invention.
3 is a diagram exemplarily showing a result of performing an image transformation method according to an embodiment of the present invention.
4 is a diagram schematically showing the configuration of an image modifying apparatus according to an embodiment of the present invention.
5 is a diagram schematically illustrating the configuration of a feature point acquisition unit according to an embodiment of the present invention.
6 is a diagram schematically illustrating a configuration of a second encoder according to an embodiment of the present invention.
7 is a diagram schematically showing the structure of a blender according to an embodiment of the present invention.
8 is a diagram schematically illustrating a structure of a decoder according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1구성요소는 본 발명의 기술적 사상 내에서 제2구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the above terms. Such terms may only be used to distinguish one component from another. Accordingly, the first component mentioned below may be the second component within the spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.The terminology used herein is for the purpose of describing the embodiment and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” or “comprising” implies that the stated component or step does not exclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be interpreted with meanings commonly understood by those of ordinary skill in the art to which the present invention pertains. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

도 1은 본 발명에 따른 이미지 변형 장치 및 이미지 변형 방법이 동작하는 환경을 개략적으로 나타내는 도면이다. 도 1을 참조하면, 제1 단말기(10), 제2 단말기(20)가 동작하는 환경은 서버(100) 및 서버(100)와 서로 연결된 제1 단말기(10), 및 제2 단말기(20)를 포함할 수 있다. 설명의 편의를 위해 도 1에는 두 개의 단말기, 즉 제1 단말기(10), 및 제2 단말기(20) 만을 도시하고 있으나, 두 개 보다 더 많은 수의 단말기가 포함될 수 있다. 추가될 수 있는 단말기에 대하여, 특별히 언급될 설명을 제외하고, 제1 단말기(10), 및 제2 단말기(20)에 대한 설명이 적용될 수 있다.1 is a diagram schematically illustrating an environment in which an image modifying apparatus and an image modifying method according to the present invention operate. Referring to FIG. 1 , an environment in which a first terminal 10 and a second terminal 20 operate is a server 100 and a first terminal 10 and a second terminal 20 connected to each other with the server 100 . may include. For convenience of explanation, only two terminals, ie, the first terminal 10 and the second terminal 20, are shown in FIG. 1, but more terminals than two may be included. With respect to the terminal that can be added, the description of the first terminal 10 and the second terminal 20 may be applied, except for the description to be specifically mentioned.

서버(100)는 통신망에 연결될 수 있다. 서버(100)는 상기 통신망을 통해 외부의 다른 장치와 서로 연결될 수 있다. 서버(100)는 서로 연결된 다른 장치에 데이터를 전송하거나 상기 다른 장치로부터 데이터를 수신할 수 있다.The server 100 may be connected to a communication network. The server 100 may be connected to other external devices through the communication network. The server 100 may transmit data to or receive data from other devices connected to each other.

서버(100)와 연결된 통신망은 유선 통신망, 무선 통신망, 또는 복합 통신망을 포함할 수 있다. 통신망은 3G, LTE, 또는 LTE-A 등과 같은 이동 통신망을 포함할 수 있다. 통신망은 와이파이(Wi-Fi), UMTS/GPRS, 또는 이더넷(Ethernet) 등과 같은 유선 또는 무선 통신망을 포함할 수 있다. 통신망은 마그네틱 보안 전송(MST, Magnetic Secure Transmission), RFID(Radio Frequency Identification), NFC(Near Field Communication), 지그비(ZigBee), Z-Wave, 블루투스(Bluetooth), 저전력 블루투스(BLE, Bluetooth Low Energy), 또는 적외선 통신(IR, InfraRed communication) 등과 같은 근거리 통신망을 포함할 수 있다. 통신망은 근거리 네트워크(LAN, Local Area Network), 도시권 네트워크(MAN, Metropolitan Area Network), 또는 광역 네트워크(WAN, Wide Area Network) 등을 포함할 수 있다.The communication network connected to the server 100 may include a wired communication network, a wireless communication network, or a complex communication network. The communication network may include a mobile communication network such as 3G, LTE, or LTE-A. The communication network may include a wired or wireless communication network such as Wi-Fi, UMTS/GPRS, or Ethernet. The communication network is Magnetic Secure Transmission (MST), Radio Frequency Identification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE, Bluetooth Low Energy) , or may include a local area network such as infrared communication (IR, InfraRed communication). The communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).

서버(100)는 제1 단말기(10), 및 제2 단말기(20) 중 적어도 하나로부터 데이터를 수신할 수 있다. 서버(100)는 제1 단말기(10), 및 제2 단말기(20) 중 적어도 하나로부터 수신된 데이터를 이용하여 연산을 수행할 수 있다. 서버(100)는 상기 연산 결과를 제1 단말기(10), 및 제2 단말기(20) 중 적어도 하나에 전송할 수 있다.The server 100 may receive data from at least one of the first terminal 10 and the second terminal 20 . The server 100 may perform an operation using data received from at least one of the first terminal 10 and the second terminal 20 . The server 100 may transmit the operation result to at least one of the first terminal 10 and the second terminal 20 .

서버(100)는 제1 단말기(10), 및 제2 단말기(20) 중 적어도 하나의 단말기로부터 중개 요청을 수신할 수 있다. 서버(100)는 중개 요청을 전송한 단말을 선택할 수 있다. 예를 들어, 서버(100)는 제1 단말기(10), 및 제2 단말기(20)를 선택할 수 있다.The server 100 may receive a mediation request from at least one of the first terminal 10 and the second terminal 20 . The server 100 may select a terminal that has transmitted the mediation request. For example, the server 100 may select the first terminal 10 and the second terminal 20 .

서버(100)는 상기 선택된 제1 단말기(10), 및 제2 단말기(20) 사이의 통신 연결을 중개할 수 있다. 예컨대, 서버(100)는 제1 단말기(10), 및 제2 단말기(20) 사이의 영상 통화 연결을 중개하거나, 텍스트 송수신 연결을 중개할 수 있다. 서버(100)는 제1 단말기(10)에 대한 연결 정보를 제2 단말기(20)에 전송할 수 있고, 제2 단말기(20)에 대한 연결 정보를 제1 단말기(10)에 전송할 수 있다.The server 100 may mediate a communication connection between the selected first terminal 10 and the second terminal 20 . For example, the server 100 may mediate a video call connection between the first terminal 10 and the second terminal 20 or a text transmission/reception connection. The server 100 may transmit connection information on the first terminal 10 to the second terminal 20 , and may transmit connection information on the second terminal 20 to the first terminal 10 .

제1 단말기(10)에 대한 연결 정보는 예를 들어, 제1 단말기(10)의 아이피(IP) 주소 및 포트(port) 번호를 포함할 수 있다. 제2 단말기(20)에 대한 연결 정보를 수신한 제1 단말기(10)는 상기 수신된 연결 정보를 이용하여 제2 단말기(20)에의 연결을 시도할 수 있다.The connection information for the first terminal 10 may include, for example, an IP address and a port number of the first terminal 10 . Upon receiving the connection information for the second terminal 20 , the first terminal 10 may attempt to connect to the second terminal 20 using the received connection information.

제1 단말기(10)의 제2 단말기(20)에의 연결 시도 또는 제2 단말기(20)의 제1 단말기(10)에의 연결 시도가 성공함으로써, 제1 단말기(10) 및 제2 단말기(20) 사이의 영상 통화 세션이 수립될 수 있다. 상기 영상 통화 세션을 통해 제1 단말기(10)는 제2 단말기(20)에 영상 또는 소리를 전송할 수 있다. 제1 단말기(10)는 영상 또는 소리를 디지털 신호로 인코딩하고, 상기 인코딩 된 결과물을 제2 단말기(20)에 전송할 수 있다.When the connection attempt of the first terminal 10 to the second terminal 20 or the connection attempt of the second terminal 20 to the first terminal 10 is successful, the first terminal 10 and the second terminal 20 A video call session may be established between Through the video call session, the first terminal 10 may transmit a video or sound to the second terminal 20 . The first terminal 10 may encode an image or sound into a digital signal, and transmit the encoded result to the second terminal 20 .

또한, 상기 영상 통화 세션을 통해 제1 단말기(10)는 제2 단말기(20)로부터 영상 또는 소리를 수신할 수 있다. 제1 단말기(10)는 디지털 신호로 인코딩 된 영상 또는 소리를 수신하고, 상기 수신된 영상 또는 소리를 디코딩할 수 있다.Also, through the video call session, the first terminal 10 may receive a video or sound from the second terminal 20 . The first terminal 10 may receive an image or sound encoded as a digital signal, and decode the received image or sound.

상기 영상 통화 세션을 통해 제2 단말기(20)는 제1 단말기(10)에 영상 또는 소리를 전송할 수 있다. 또한, 상기 영상 통화 세션을 통해 제2 단말기(20)는 제1 단말기(10)로부터 영상 또는 소리를 수신할 수 있다. 이로써, 제1 단말기(10)의 사용자 및 제2 단말기(20)의 사용자는 서로 영상 통화를 할 수 있다.Through the video call session, the second terminal 20 may transmit a video or sound to the first terminal 10 . Also, through the video call session, the second terminal 20 may receive a video or sound from the first terminal 10 . Accordingly, the user of the first terminal 10 and the user of the second terminal 20 can make a video call with each other.

제1 단말기(10), 및 제2 단말기(20)는, 예를 들어, 데스크탑 컴퓨터, 랩탑 컴퓨터, 스마트폰, 스마트 태블릿, 스마트 워치, 이동 단말, 디지털 카메라, 웨어러블 디바이스(wearable device), 또는 휴대용 전자기기 등일 수 있다. 제1 단말기(10), 및 제2 단말기(20)는 프로그램 또는 애플리케이션을 실행할 수 있다. 제1 단말기(10), 및 제2 단말기(20) 각각은 서로 동일한 종류의 장치일 수 있고, 서로 다른 종류의 장치일 수도 있다.The first terminal 10 and the second terminal 20 may be, for example, a desktop computer, a laptop computer, a smart phone, a smart tablet, a smart watch, a mobile terminal, a digital camera, a wearable device, or a portable device. It may be an electronic device or the like. The first terminal 10 and the second terminal 20 may execute programs or applications. Each of the first terminal 10 and the second terminal 20 may be a device of the same type or may be a device of a different type.

도 2는 본 발명의 일 실시예에 따른 이미지 변형 방법을 개략적으로 나타내는 순서도이다.2 is a flowchart schematically illustrating an image transformation method according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 이미지 변형 방법은, 사용자 얼굴의 특징점 정보(landmark)를 획득하는 단계(S110), 사용자 피처 맵(user feature map)을 생성하는 단계(S120), 타겟(target) 피처 맵을 생성하는 단계(S130), 믹스드(mixed) 피처 맵을 생성하는 단계(S140), 및 변형된(reenacted) 이미지를 생성하는 단계(S150)를 포함한다.Referring to FIG. 2 , the image transformation method according to an embodiment of the present invention includes the steps of obtaining landmark information (landmark) of a user's face (S110), and generating a user feature map (S120) , generating a target feature map (S130), generating a mixed feature map (S140), and generating a reenacted image (S150).

단계(S110)에서는 사용자(user)의 얼굴 이미지로부터 특징점(landmark) 정보를 획득한다. 상기 특징점은 상기 사용자의 얼굴의 특징이 되는 얼굴 부위를 의미하며, 예컨대, 상기 사용자의 눈, 눈썹, 코, 입, 귀, 또는 턱선 등을 포함할 수 있다. 그리고, 상기 특징점 정보는 상기 사용자의 얼굴의 주요 요소의 위치, 크기, 또는 모양에 관한 정보를 포함할 수 있다. 또한, 상기 특징점 정보는 상기 사용자의 얼굴의 주요 요소의 색상 또는 질감에 관한 정보를 포함할 수 있다.In step S110, landmark information is obtained from a face image of a user. The feature point means a part of the face that is a feature of the user's face, and may include, for example, the user's eyes, eyebrows, nose, mouth, ears, or jaw line. In addition, the feature point information may include information about a location, size, or shape of a main element of the user's face. In addition, the feature point information may include information about a color or texture of a main element of the user's face.

상기 사용자는 본 발명에 따른 이미지 변형 방법이 수행되는 단말기를 사용하는 임의의 사용자를 의미할 수 있다. 단계(S110)에서는 상기 사용자의 얼굴 이미지를 수신하고, 상기 얼굴 이미지에 대응하는 특징점 정보를 획득한다. 상기 특징점 정보는 공지의 기술을 통해 획득 가능하며, 공지된 방법 중 어떤 방법을 사용하더라도 무방하다. 또한, 상기 특징점 정보를 획득하는 방법에 의하여 본 발명이 제한되는 것은 아니다.The user may mean any user who uses the terminal on which the image modification method according to the present invention is performed. In step S110, the user's face image is received, and feature point information corresponding to the face image is obtained. The characteristic point information can be obtained through a known technique, and any of the known methods may be used. In addition, the present invention is not limited by the method of acquiring the characteristic point information.

단계(S110)에서는 상기 특징점 정보에 대응하는 변환 행렬을 추정할 수 있다. 상기 변환 행렬은 미리 정해진 단위 벡터(unit vector)와 함께 상기 특징점 정보를 구성할 수 있다. 예를 들어, 제1 특징점 정보는 상기 단위 벡터와 제1 변환 행렬의 곱으로 연산될 수 있다. 또한, 제2 특징점 정보는 상기 단위 벡터와 제2 변환 행렬의 곱으로 연산될 수 있다.In step S110, a transformation matrix corresponding to the feature point information may be estimated. The transformation matrix may constitute the feature point information together with a predetermined unit vector. For example, the first feature point information may be calculated as a product of the unit vector and a first transformation matrix. Also, the second feature point information may be calculated as a product of the unit vector and a second transformation matrix.

상기 변환 행렬은 고차원의 특징점 정보를 저차원의 데이터로 변환하는 행렬로서, 주성분분석(PCA, Principal Component Analysis)에서 활용될 수 있다. PCA는 데이터의 분산을 최대한 보존하면서 서로 직교하는 새 축을 찾아 고차원 공간의 변수들을 저차원 공간의 변수로 변환하는 차원 축소 기법이다. PCA는 먼저 데이터에 가장 가까운 초평면(hyperplane)을 구한 뒤에 데이터를 저차원의 초평면에 투영(projection)시켜 데이터의 차원을 축소한다.The transformation matrix is a matrix that transforms high-dimensional feature point information into low-dimensional data, and may be utilized in principal component analysis (PCA). PCA is a dimensionality reduction technique that converts variables in high-dimensional space into variables in low-dimensional space by finding new axes orthogonal to each other while preserving data variance as much as possible. PCA reduces the dimension of data by first finding the hyperplane closest to the data, and then projecting the data onto a low-dimensional hyperplane.

PCA에서 i 번째 축을 정의하는 단위 벡터를 i 번째 주성분(PC, Principal Component)라고 하고, 이러한 축들을 선형 결합하여 고차원 데이터를 저차원 데이터로 변환할 수 있다.In PCA, a unit vector defining the i-th axis is called an i-th principal component (PC), and the high-dimensional data can be converted into low-dimensional data by linearly combining these axes.

여기서, X는 고차원의 특징점 정보, Y는 저차원의 주성분, 그리고 α는 변환 행렬을 의미한다.Here, X denotes high-dimensional feature point information, Y denotes a low-dimensional principal component, and α denotes a transformation matrix.

앞서 설명한 바와 같이, 상기 단위 벡터, 즉 주성분은 미리 결정되어 있을 수 있다. 따라서, 새로운 특징점 정보가 수신되면, 이에 대응하는 변환 행렬이 결정될 수 있다. 이 때, 하나의 특징점 정보에 대응하여 복수 개의 변환 행렬이 존재할 수 있다.As described above, the unit vector, that is, the principal component may be predetermined. Accordingly, when new feature point information is received, a transformation matrix corresponding thereto may be determined. In this case, a plurality of transformation matrices may exist corresponding to one piece of feature point information.

한편, 단계(S110)에서는 상기 변환 행렬을 추정하도록 학습된 학습 모델을 사용할 수 있다. 상기 학습 모델은 임의의 얼굴 이미지와 상기 임의의 얼굴 이미지에 대응하는 특징점 정보로부터 PCA 변환 행렬을 추정하도록 학습된 모델로 이해할 수 있다.Meanwhile, in step S110, a learning model trained to estimate the transformation matrix may be used. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and feature point information corresponding to the arbitrary face image.

상기 학습 모델은 서로 다른 사람들의 얼굴 이미지와, 각각의 얼굴 이미지에 대응하는 특징점 정보로부터 상기 변환 행렬을 추정하도록 학습될 수 있다. 하나의 고차원 특징점 정보에 대응하는 변환 행렬은 여러 개가 존재할 수 있는데, 상기 학습 모델은 여러 개의 변환 행렬 중 하나의 변환 행렬만을 출력하도록 학습될 수 있다.The learning model may be trained to estimate the transformation matrix from face images of different people and feature point information corresponding to each face image. A plurality of transformation matrices corresponding to one high-dimensional feature point information may exist, and the learning model may be trained to output only one transformation matrix among the plurality of transformation matrices.

상기 학습 모델에 입력으로 사용되는 상기 특징점 정보는 얼굴 이미지로부터 특징점을 추출하여 이를 이미지화(visualizing)하는 공지의 방법을 통해 획득될 수 있다.The feature point information used as an input to the learning model may be obtained through a known method of extracting a feature point from a face image and visualizing it.

따라서, 단계(S110)에서는 상기 사용자의 얼굴 이미지와 상기 얼굴 이미지에 대응하는 특징점 정보를 입력으로 수신하고, 이로부터 하나의 변환 행렬을 추정하여 출력하게 된다.Accordingly, in step S110, the user's face image and feature point information corresponding to the face image are received as inputs, and one transformation matrix is estimated and outputted therefrom.

한편, 상기 학습 모델은 특징점 정보를 우안, 좌안, 코, 입에 각각 대응하는 복수의 시맨틱 그룹(semantic group)으로 분류하고, 상기 복수의 시맨틱 그룹 각각에 대응하는 PCA 변환 계수를 출력하도록 학습될 수 있다.On the other hand, the learning model classifies the feature point information into a plurality of semantic groups corresponding to each of the right eye, left eye, nose, and mouth, and can be trained to output PCA transform coefficients corresponding to each of the plurality of semantic groups. there is.

이 때, 상기 시맨틱 그룹은 반드시 우안, 좌안, 코, 입에 대응하도록 분류되는 것은 아니며, 눈썹, 눈, 코, 입, 턱 선에 대응하도록 분류하거나, 눈썹, 우안, 좌안, 코, 입, 턱 선, 귀에 대응하도록 분류하는 것도 가능하다. 단계(S110)에서는 상기 학습 모델에 따라 상기 특징점 정보를 세분화된 단위의 시맨틱 그룹으로 분류하고, 분류된 시맨틱 그룹에 대응하는 PCA 변환 계수를 추정할 수 있다.In this case, the semantic group is not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but is classified to correspond to eyebrows, eyes, nose, mouth, and jaw lines, or eyebrows, right eye, left eye, nose, mouth, and chin. It is also possible to classify according to the line and the ear. In step S110, the feature point information may be classified into a subdivided semantic group according to the learning model, and a PCA transform coefficient corresponding to the classified semantic group may be estimated.

한편, 상기 변환 행렬을 이용하여 상기 사용자의 표현(expression) 특징점을 산출한다. 특징점 정보는 복수의 서브 특징점(sub landmark) 정보로 분리(decompose)될 수 있는데, 본 발명에서는 상기 특징점 정보가 다음과 같이 표현될 수 있음을 상정한다.Meanwhile, the expression feature point of the user is calculated using the transformation matrix. The feature point information may be decomposed into a plurality of sub landmark information. In the present invention, it is assumed that the feature point information can be expressed as follows.

여기서, l(c, t)는 인물 c 가 포함된 비디오의 t 번째 프레임에서의 특징점 정보, l_m은 인간의 평균 특징점(mean facial landmark) 정보, l_id(c)는 인물 c 개인의 고유 특징점(facial landmark of identity geometry) 정보, l_exp(c, t)는 인물 c 가 포함된 비디오의 t 번째 프레임에서의 상기 인물 c 의 표현 특징점(facial landmark of expression geometry)를 의미한다.Here, l(c, t) is the feature point information in the t-th frame of the video including the person c, l _m is the human mean facial landmark information, and l _id (c) is the unique feature point of the person c individual (facial landmark of identity geometry) information, l _exp (c, t) means the facial landmark of expression geometry of the person c in the t-th frame of the video including the person c.

즉, 특정 인물의 특정 프레임에서의 특징점 정보는, 모든 사람의 얼굴의 평균적인 특징점 정보, 상기 특정 인물만의 고유의 특징점 정보, 그리고 상기 특정 프레임에서 상기 특정 인물의 표정 및 움직임 정보의 합으로 표현될 수 있다.That is, the feature point information in a specific frame of a specific person is expressed as the sum of the average feature point information of all faces, the feature point information unique to the specific person, and the expression and movement information of the specific person in the specific frame can be

상기 평균 특징점 정보는 다음의 수학식과 같이 정의할 수 있고, 사전에 수집 가능한 다량의 비디오를 바탕으로 계산할 수 있다.The average feature point information can be defined by the following equation, and can be calculated based on a large amount of video that can be collected in advance.

여기서, T는 비디오의 전체 프레임 수를 의미하며, 따라서 l_m은 사전에 수집한 비디오에 등장하는 모든 인물의 특징점 l(c, t)의 평균을 의미한다.Here, T means the total number of frames of the video, and therefore l _m means the average of the feature points l(c, t) of all persons appearing in the previously collected video.

한편, 상기 표현 특징점은 다음의 수학식을 이용하여 산출할 수 있다.Meanwhile, the expression feature point can be calculated using the following equation.

위 수학식은 인물 c 의 시맨틱 그룹 각각에 대한 PCA 수행 결과를 나타낸다. n_exp는 모든 시맨틱 그룹의 expression basis 수의 합, b _exp는 PCA의 basis인 expression basis, α는 PCA의 계수를 의미한다.The above equation represents the PCA performance result for each semantic group of person c. n _exp is the sum of the number of expression basis of all semantic groups, b _exp is the expression basis that is the basis of PCA, and α is the coefficient of PCA.

다시 말해, b _exp는 앞서 설명한 고유 벡터를 의미하며, 고차원의 표현 특징점은 저차원의 고유 벡터들의 조합으로 정의될 수 있다. 그리고, n_exp는 인물 c 가 우안, 좌안 코, 입 등을 통해 표현할 수 있는 표정 및 움직임의 총 개수를 의미한다.In other words, b _exp means the eigenvector described above, and a high-dimensional expression feature point may be defined as a combination of low-dimensional eigenvectors. And, n _exp means the total number of facial expressions and movements that person c can express through his right eye, left eye, nose, mouth, etc.

따라서, 상기 제1 인물의 표현 특징점은 얼굴의 주요 부위 즉, 상기 우안, 좌안, 코, 입 각각에 대한 표현 정보의 집합으로 정의할 수 있다. 그리고, α_k(c, t)는 각각의 고유 벡터에 대응하여 존재할 수 있다.Accordingly, the expression feature point of the first person may be defined as a set of expression information for each of the main parts of the face, that is, the right eye, the left eye, the nose, and the mouth. And, α _k (c, t) may exist corresponding to each eigenvector.

앞서 설명한 학습 모델은 수학식 2와 같이 특징점 정보를 분리하고자 하는 인물 c 의 사진 x(c, t)와 특징점 정보 l(c, t)를 입력으로 하여 PCA 계수 α(c, t)를 추정하도록 학습시킬 수 있다. 이러한 학습을 통해 상기 학습 모델은 특정한 인물의 이미지와 이에 대응하는 특징점 정보로부터 PCA 계수를 추정할 수 있고, 상기 저차원의 고유 벡터를 추정할 수 있게 된다.As shown in Equation 2, the learning model described above is to estimate the PCA coefficient α(c, t) by inputting the photo x(c, t) of the person c for which the feature point information is to be separated and the feature point information l(c, t) as inputs. can learn Through such learning, the learning model can estimate the PCA coefficient from the image of a specific person and corresponding feature point information, and can estimate the low-dimensional eigenvector.

학습된 뉴럴 네트워크(neural network)를 적용할 때는 특징점 분리를 수행하고자 하는 인물 c` 의 사진 x(c`, t)와 특징점 정보 l(c`, t)를 뉴럴 네트워크의 입력으로 하고, PCA 변환 행렬을 추정한다. 이 때, b _exp는 학습 데이터로부터 구한 값을 사용하고 예측(추정)된 PCA 계수와 b _exp를 이용하여 다음과 같이 표현 특징점을 추정할 수 있다.When applying the learned neural network, the picture x(c`, t) and the key point information l(c`, t) of the person c` who want to perform key point separation are input to the neural network, and PCA transformation is performed. Estimate the matrix. In this case, b _exp uses a value obtained from the training data, and the expression feature point can be estimated as follows using the predicted (estimated) PCA coefficient and b _exp .

이후에는, 상기 표현 특징점을 이용하여 상기 제1 인물의 고유(identity) 특징점을 산출한다. 수학식 2를 참조로 설명한 바와 같이, 특징점 정보는 평균 특징점 정보, 고유 특징점 정보 및 표현 특징점 정보의 합으로 정의될 수 있으며, 상기 표현 특징점 정보는 수학식 5를 통해 추정될 수 있다.Thereafter, an identity feature point of the first person is calculated using the expression feature point. As described with reference to Equation 2, the feature point information may be defined as the sum of the average feature point information, the unique feature point information, and the expression feature point information, and the expression feature point information may be estimated through Equation 5.

따라서, 상기 고유 특징점은 다음과 같이 산출할 수 있다.Therefore, the intrinsic feature point can be calculated as follows.

상기 수학식은 수학식 2로부터 도출될 수 있으며, 표현 특징점이 산출되면, 수학식 6을 통해 고유 특징점을 산출할 수 있다. 평균 특징점 정보 l_m은 사전에 수집 가능한 다량의 비디오를 바탕으로 계산할 수 있다.The above Equation may be derived from Equation 2, and when the expression feature point is calculated, the unique feature point may be calculated through Equation 6 . The average feature point information l _m can be calculated based on a large amount of video that can be collected in advance.

따라서, 임의의 인물의 얼굴 이미지가 주어지면 이로부터 특징점 정보를 획득할 수 있고, 상기 얼굴 이미지와 특징점 정보로부터 표현 특징점 정보 및 고유 특징점 정보를 산출할 수 있다.Accordingly, when a face image of an arbitrary person is given, feature point information can be obtained therefrom, and expression feature point information and unique feature point information can be calculated from the face image and feature point information.

단계(S120)에서는 상기 사용자의 얼굴 이미지의 포즈(pose) 정보로부터 사용자 피처 맵(user feature map)을 생성한다. 상기 포즈 정보는 상기 얼굴 이미지의 움직임 정보와 표정 정보를 포함할 수 있다. 그리고, 단계(S120)에서는 상기 사용자의 얼굴 이미지에 대응하는 포즈 정보를 인공 신경망(Artificial Neural Network)에 입력하여 상기 사용자 피처 맵을 생성할 수 있다. 한편, 상기 포즈 정보는 단계(S110)에서 획득되는 상기 표현 특징점 정보에 상응하는 것으로 이해할 수 있다.In step S120, a user feature map is generated from the pose information of the user's face image. The pose information may include motion information and expression information of the face image. Then, in step S120, pose information corresponding to the user's face image may be input to an artificial neural network to generate the user feature map. Meanwhile, the pose information may be understood as corresponding to the expression feature point information obtained in step S110.

단계(S120)에서 생성되는 상기 사용자 피처 맵은 상기 사용자가 짓고 있는 표정 및 상기 사용자의 얼굴의 움직임이 갖고 있는 특징을 표현하는 정보를 포함한다. 또한, 단계(S120)에서 사용되는 상기 인공 신경망은 CNN(Convolutional Neural Network)일 수 있으나, 다른 종류의 인공 신경망이 사용될 수도 있다.The user feature map generated in step S120 includes information representing the facial expression that the user is making and the characteristics of the movement of the user's face. In addition, the artificial neural network used in step S120 may be a convolutional neural network (CNN), but other types of artificial neural networks may be used.

단계(S130)에서는 타겟(target)의 얼굴 이미지를 수신하고, 상기 타겟의 얼굴 이미지에 대응하는 스타일(style) 정보 및 포즈 정보로부터 타겟 피처 맵(target feature map) 및 포즈-정규화 타겟 피처 맵(pose-normalized target feature map)을 생성한다.In step S130, a face image of a target is received, and a target feature map and a pose-normalized target feature map are obtained from style information and pose information corresponding to the face image of the target. -Generate a normalized target feature map).

상기 타겟은 본 발명에 의해 변형될 사람을 지칭하며, 상기 사용자와 상기 타겟은 서로 다른 사람일 수 있으나 반드시 이에 제한되는 것은 아니다. 본 발명의 실시 결과로 생성되는 변형된(reenacted) 이미지는 상기 타겟의 얼굴 이미지로부터 변형되며, 상기 사용자의 움직임과 표정을 모방하거나 따라하는 타겟의 모습으로 나타날 수 있다.The target refers to a person to be transformed by the present invention, and the user and the target may be different people, but is not necessarily limited thereto. The transformed (reenacted) image generated as a result of the implementation of the present invention may be transformed from the target's face image, and may appear as a target that imitates or imitates the user's movements and expressions.

상기 타겟 피처 맵은 상기 타겟이 짓고 있는 표정 및 상기 타겟의 얼굴의 움직임이 갖고 있는 특징을 표현하는 정보를 포함한다.The target feature map includes information representing characteristics of the facial expression of the target and the movement of the target's face.

상기 포즈-정규화 타겟 피처 맵은 인공 신경망에 입력된 상기 스타일 정보에 대한 출력에 대응할 수 있다. 또는, 상기 포즈-정규화 타겟 피처 맵은 상기 타겟의 포즈 정보를 제외한 상기 타겟의 얼굴의 고유한 특징에 대응하는 정보를 포함할 수 있다.The pose-normalization target feature map may correspond to an output of the style information input to the artificial neural network. Alternatively, the pose-normalized target feature map may include information corresponding to a unique feature of the target's face except for the pose information of the target.

단계(S130)에서 사용되는 상기 인공 신경망은 단계(S120)에서 사용되는 인공 신경망과 마찬가지로 CNN이 사용될 수 있으며, 단계(S120)에서 사용되는 인공 신경망의 구조와 단계(S130)에서 사용되는 인공 신경망의 구조는 서로 다를 수 있다.The artificial neural network used in step S130 may be a CNN similar to the artificial neural network used in step S120, and the structure of the artificial neural network used in step S120 and the artificial neural network used in step S130. The structure may be different.

상기 스타일 정보는 사람의 얼굴에서 그 사람의 고유한 특징을 나타내는 정보를 의미하는데, 예를 들어, 상기 스타일 정보는 상기 타겟의 얼굴에 드러나는 선천적 특징, 특징점의 크기, 모양, 위치 등을 포함할 수 있다. 또는, 상기 스타일 정보는 상기 타겟의 얼굴 이미지에 대응하는 질감(texture) 정보, 색상(color) 정보, 및 모양(shape) 정보 중 적어도 어느 하나를 포함할 수 있다.The style information means information representing a person's unique characteristics on a person's face. For example, the style information may include an innate characteristic exposed on the face of the target, the size, shape, location, etc. of the characteristic point. there is. Alternatively, the style information may include at least one of texture information, color information, and shape information corresponding to the face image of the target.

상기 타겟 피처 맵은 상기 타겟의 얼굴 이미지로부터 획득되는 표현 특징점 정보에 대응하는 데이터를 포함하고, 상기 포즈-정규화 타겟 피처 맵은 상기 타겟의 얼굴 이미지로부터 획득되는 고유 특징점 정보에 대응하는 데이터를 포함하는 것으로 이해할 수 있다.The target feature map includes data corresponding to expression feature point information obtained from the target face image, and the pose-normalized target feature map includes data corresponding to unique feature point information obtained from the target face image. can be understood as

단계(S140)에서는 상기 사용자 피처 맵과 상기 타겟 피처 맵을 이용하여 믹스드 피처 맵(mixed feature map)을 생성하며, 상기 사용자의 얼굴 이미지의 포즈 정보와 상기 타겟의 얼굴 이미지의 스타일 정보를 인공 신경망에 입력하여 상기 믹스드 피처 맵을 생성할 수 있다.In step S140, a mixed feature map is generated using the user feature map and the target feature map, and pose information of the user's face image and style information of the target's face image are combined with an artificial neural network. The mixed feature map can be generated by inputting to .

상기 믹스드 피처 맵은 상기 타겟의 특징점이 상기 사용자의 특징점에 대응하는 포즈 정보를 갖도록 생성될 수 있다.The mixed feature map may be generated so that the feature point of the target has pose information corresponding to the feature point of the user.

단계(S140)에서 사용되는 상기 인공 신경망은 단계(S120)와 단계(S130)에서 사용되는 인공 신경망과 마찬가지로 CNN이 사용될 수 있으며, 단계(S140)에서 사용되는 인공 신경망의 구조는 앞선 단계에서 사용되는 인공 신경망의 구조와 다를 수 있다.As the artificial neural network used in step S140, a CNN may be used like the artificial neural network used in steps S120 and S130, and the structure of the artificial neural network used in step S140 is the structure used in the previous step. It may be different from the structure of the artificial neural network.

단계(S150)에서는 상기 믹스드 피처 맵과 상기 포즈-정규화 타겟 피처 맵을 이용하여 상기 타겟의 얼굴 이미지에 대한 변형된 이미지를 생성한다.In step S150, a modified image of the face image of the target is generated using the mixed feature map and the pose-normalized target feature map.

앞서 설명한 바와 같이, 상기 포즈-정규화 타겟 피처 맵은 상기 타겟의 얼굴 이미지로부터 획득되는 고유 특징점 정보에 대응하는 데이터를 포함하는데, 상기 고유 특징점 정보는 해당 인물의 움직임 정보나 표정 정보에 대응하는 표현 정보와 무관한 인물의 고유한 특징에 대응하는 정보를 의미한다.As described above, the pose-normalized target feature map includes data corresponding to unique feature point information obtained from the target face image, and the unique feature point information includes expression information corresponding to movement information or facial expression information of a corresponding person. It means information corresponding to the unique characteristics of a person that is not related to

단계(S140)에서 생성되는 상기 믹스드 피처 맵을 통해 상기 사용자의 움직임을 자연스럽게 추종하는 타겟의 움직임을 얻을 수 있다면, 단계(S150)에서는 타켓의 고유한 특징을 반영하여 실제 타겟이 스스로 움직이고 표정을 짓는 것과 같은 효과를 얻을 수 있다.If the movement of the target that naturally follows the user's movement can be obtained through the mixed feature map generated in step S140, the actual target moves by itself by reflecting the unique characteristics of the target in step S150 You can get the same effect as building.

도 3은 본 발명의 실시예에 따른 이미지 변형 방법을 수행한 결과를 예시적으로 나타내는 도면이다. 도 3은 타겟(target) 이미지, 사용자(user) 이미지, 및 변형된(reenacted) 이미지를 도시하며, 상기 변형된 이미지는 상기 타겟의 얼굴 특징을 유지하되 상기 사용자의 얼굴의 움직임 및 표정을 갖는다.3 is a diagram exemplarily showing a result of performing an image transformation method according to an embodiment of the present invention. Fig. 3 shows a target image, a user image, and a reenacted image, wherein the modified image retains the facial features of the target but has the movement and expression of the user's face.

도 3의 타겟 이미지와 변형된 이미지를 비교하면, 두 이미지는 서로 동일한 인물을 나타내며 표정의 차이만 존재하는 것을 알 수 있다. 상기 타겟 이미지의 눈, 코, 입, 헤어 스타일은 각각 상기 변형된 이미지의 눈, 코, 입, 헤어 스타일과 동일하다.Comparing the target image and the deformed image of FIG. 3 , it can be seen that the two images represent the same person and only a difference in expression exists. The eyes, nose, mouth, and hairstyle of the target image are the same as the eyes, nose, mouth, and hairstyle of the modified image, respectively.

한편, 상기 변형된 이미지의 인물이 갖는 표정은 상기 사용자의 표정과 실질적으로 동일하다. 예를 들어, 상기 사용자 이미지가 입을 벌리고 있다면, 변형된 이미지는 입을 벌리고 있는 타겟의 이미지를 갖게 된다. 또한, 상기 사용자 이미지가 고개를 오른쪽 또는 왼쪽으로 돌리고 있다면, 변형된 이미지는 고개를 오른쪽 또는 왼쪽으로 돌리고 있는 타겟의 이미지를 갖게 된다.On the other hand, the facial expression of the person of the transformed image is substantially the same as the facial expression of the user. For example, if the user image has an open mouth, the transformed image has an image of a target with an open mouth. In addition, if the user image turns the head to the right or left, the transformed image has an image of the target whose head is turned to the right or left.

실시간으로 변화하는 사용자의 이미지를 수신하고, 이를 바탕으로 변형된 이미지를 생성하는 경우, 변형된 이미지는 실시간으로 변화하는 사용자의 움직임 및 표정에 대응하여 타겟 이미지를 변형할 수 있다.When a user's image that changes in real time is received and a transformed image is generated based on the received image, the transformed image may transform the target image in response to the user's movement and facial expression changing in real time.

도 4는 본 발명의 일 실시예에 따른 이미지 변형 장치의 구성을 개략적으로 나타내는 도면이다. 도 4를 참조하면, 본 발명의 일 실시예에 따른 이미지 변형 장치(30)는 특징점 획득부(31), 제1 인코더(32), 제2 인코더(33), 블렌더(34), 및 디코더(35)를 포함한다.4 is a diagram schematically showing the configuration of an image modifying apparatus according to an embodiment of the present invention. Referring to FIG. 4 , an image transformation apparatus 30 according to an embodiment of the present invention includes a feature point acquirer 31 , a first encoder 32 , a second encoder 33 , a blender 34 , and a decoder ( 35).

특징점 획득부(31)는 사용자(user) 및 타겟(target)의 얼굴 이미지를 수신하고, 각각의 얼굴 이미지로부터 특징점(landmark) 정보를 획득한다. 상기 특징점은 상기 사용자의 얼굴의 특징이 되는 얼굴 부위를 의미하며, 예컨대, 상기 사용자의 눈, 눈썹, 코, 입, 귀, 또는 턱선 등을 포함할 수 있다. 그리고, 상기 특징점 정보는 상기 사용자의 얼굴의 주요 요소의 위치, 크기, 또는 모양에 관한 정보를 포함할 수 있다. 또한, 상기 특징점 정보는 상기 사용자의 얼굴의 주요 요소의 색상 또는 질감에 관한 정보를 포함할 수 있다.The feature point acquisition unit 31 receives face images of a user and a target, and acquires landmark information from each face image. The feature point means a part of the face that is a feature of the user's face, and may include, for example, the user's eyes, eyebrows, nose, mouth, ears, or jaw line. In addition, the feature point information may include information about a location, size, or shape of a main element of the user's face. In addition, the feature point information may include information about a color or texture of a main element of the user's face.

상기 사용자는 본 발명에 따른 이미지 변형 장치가 수행되는 단말기를 사용하는 임의의 사용자를 의미할 수 있다. 특징점 획득부(31)는 상기 사용자의 얼굴 이미지를 수신하고, 상기 얼굴 이미지에 대응하는 특징점 정보를 획득한다. 상기 특징점 정보는 공지의 기술을 통해 획득 가능하며, 공지된 방법 중 어떤 방법을 사용하더라도 무방하다. 또한, 상기 특징점 정보를 획득하는 방법에 의하여 본 발명이 제한되는 것은 아니다.The user may mean any user who uses a terminal on which the image modifying apparatus according to the present invention is performed. The feature point acquisition unit 31 receives the user's face image and acquires feature point information corresponding to the face image. The characteristic point information can be obtained through a known technique, and any of the known methods may be used. In addition, the present invention is not limited by the method of acquiring the characteristic point information.

특징점 획득부(31)는 상기 특징점 정보에 대응하는 변환 행렬을 추정할 수 있다. 상기 변환 행렬은 미리 정해진 단위 벡터(unit vector)와 함께 상기 특징점 정보를 구성할 수 있다. 예를 들어, 제1 특징점 정보는 상기 단위 벡터와 제1 변환 행렬의 곱으로 연산될 수 있다. 또한, 제2 특징점 정보는 상기 단위 벡터와 제2 변환 행렬의 곱으로 연산될 수 있다.The key point acquisition unit 31 may estimate a transformation matrix corresponding to the key point information. The transformation matrix may constitute the feature point information together with a predetermined unit vector. For example, the first feature point information may be calculated as a product of the unit vector and a first transformation matrix. Also, the second feature point information may be calculated as a product of the unit vector and a second transformation matrix.

한편, 특징점 획득부(31)는 상기 변환 행렬을 추정하도록 학습된 학습 모델을 사용할 수 있다. 상기 학습 모델은 임의의 얼굴 이미지와 상기 임의의 얼굴 이미지에 대응하는 특징점 정보로부터 PCA 변환 행렬을 추정하도록 학습된 모델로 이해할 수 있다.Meanwhile, the feature point acquisition unit 31 may use a learning model learned to estimate the transformation matrix. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and feature point information corresponding to the arbitrary face image.

따라서, 특징점 획득부(31)는 상기 사용자의 얼굴 이미지와 상기 얼굴 이미지에 대응하는 특징점 정보를 입력으로 수신하고, 이로부터 하나의 변환 행렬을 추정하여 출력하게 된다.Accordingly, the feature point acquisition unit 31 receives the face image of the user and the feature point information corresponding to the face image as inputs, and estimates and outputs one transformation matrix therefrom.

이 때, 상기 시맨틱 그룹은 반드시 우안, 좌안, 코, 입에 대응하도록 분류되는 것은 아니며, 눈썹, 눈, 코, 입, 턱 선에 대응하도록 분류하거나, 눈썹, 우안, 좌안, 코, 입 턱 선, 귀에 대응하도록 분류하는 것도 가능하다. 특징점 획득부(31)는 상기 학습 모델에 따라 상기 특징점 정보를 세분화된 단위의 시맨틱 그룹으로 분류하고, 분류된 시맨틱 그룹에 대응하는 PCA 변환 계수를 추정할 수 있다.In this case, the semantic group is not necessarily classified to correspond to the right eye, left eye, nose, and mouth, but is classified to correspond to the eyebrows, eyes, nose, mouth, and jaw lines, or the eyebrows, right eye, left eye, nose, mouth and jaw lines. , it is also possible to classify to correspond to the ear. The key point acquirer 31 may classify the key point information into a subdivided semantic group according to the learning model, and estimate a PCA transform coefficient corresponding to the classified semantic group.

한편, 상기 변환 행렬을 이용하여 상기 사용자의 표현(expression) 특징점을 산출할 수 있다. 특징점 정보는 복수의 서브 특징점(sub landmark) 정보로 분리(decompose)될 수 있는데, 본 발명에서는 상기 특징점 정보가 인간의 평균 특징점(mean facial landmark) 정보, 인물 개인의 고유 특징점(facial landmark of identity geometry) 정보, 및 인물의 표현 특징점(facial landmark of expression geometry)의 합으로 정의한다.Meanwhile, the user's expression feature point may be calculated using the transformation matrix. The feature point information may be decomposed into a plurality of sub landmark information. In the present invention, the feature point information is a human mean facial landmark information and a personal facial landmark of identity geometry. ) is defined as the sum of information and a facial landmark of expression geometry.

한편, 상기 표현 특징점은 상기 사용자의 얼굴 이미지의 포즈 정보에 대응하고, 상기 고유 특징점은 상기 타겟의 얼굴 이미지의 스타일 정보에 대응한다.Meanwhile, the expression feature point corresponds to pose information of the user's face image, and the unique feature point corresponds to style information of the face image of the target.

정리하면, 특징점 획득부(31)는 상기 사용자의 얼굴 이미지 및 상기 타겟의 얼굴 이미지를 수신하고, 이로부터 각각 표현 특징점 정보와 고유 특징점 정보를 포함하는 복수의 특징점 정보를 생성할 수 있다. In summary, the feature point acquisition unit 31 may receive the face image of the user and the face image of the target, and generate a plurality of feature point information each including expression feature point information and unique feature point information therefrom.

제1 인코더(32)는 상기 사용자의 얼굴 이미지의 포즈(pose) 정보로부터 사용자 피처 맵(user feature map)을 생성한다. 상기 포즈 정보는 상기 표현 특징점 정보에 대응하며, 상기 얼굴 이미지의 움직임 정보와 표정 정보를 포함할 수 있다. 그리고, 제1 인코더(32)는 상기 사용자의 얼굴 이미지에 대응하는 포즈 정보를 인공 신경망에 입력하여 상기 사용자 피처 맵을 생성할 수 있다.The first encoder 32 generates a user feature map from pose information of the user's face image. The pose information may correspond to the expression feature point information, and may include motion information and expression information of the face image. In addition, the first encoder 32 may generate the user feature map by inputting pose information corresponding to the face image of the user into the artificial neural network.

제1 인코더(32)에서 생성되는 상기 사용자 피처 맵은 상기 사용자가 짓고 있는 표정 및 상기 사용자의 얼굴의 움직임이 갖고 있는 특징을 표현하는 정보를 포함한다. 또한, 제1 인코더(32)에서 사용되는 상기 인공 신경망은 CNN(Convolutional Neural Network)일 수 있으나, 다른 종류의 인공 신경망이 사용될 수도 있다.The user feature map generated by the first encoder 32 includes information representing the characteristics of the facial expression that the user is making and the movement of the user's face. In addition, the artificial neural network used in the first encoder 32 may be a convolutional neural network (CNN), but other types of artificial neural networks may be used.

제2 인코더(33)는 상기 타겟의 얼굴 이미지의 스타일(style) 정보 및 포즈 정보로부터 타겟 피처 맵(target feature map) 및 포즈-정규화 타겟 피처 맵(pose-normalized target feature map)을 생성한다.The second encoder 33 generates a target feature map and a pose-normalized target feature map from style information and pose information of the face image of the target.

제2 인코더(33)에서 생성되는 상기 타겟 피처 맵은 제1 인코더(32)에서 생성되는 상기 사용자 피처 맵에 대응하는 데이터로 이해할 수 있으며, 상기 타겟이 짓고 있는 표정 및 상기 타겟의 얼굴의 움직임이 갖고 있는 특징을 표현하는 정보를 포함한다.The target feature map generated by the second encoder 33 can be understood as data corresponding to the user feature map generated by the first encoder 32, and the expression of the target and the movement of the target's face are It contains information that expresses the characteristics it possesses.

제2 인코더(33)에서 사용되는 상기 인공 신경망은 제1 인코더(32)에서 사용되는 인공 신경망과 마찬가지로 CNN이 사용될 수 있으며, 제1 인코더(32)에서 사용되는 인공 신경망의 구조와 제2 인코더(33)에서 사용되는 인공 신경망의 구조는 서로 다를 수 있다.The artificial neural network used in the second encoder 33 may be a CNN similar to the artificial neural network used in the first encoder 32, and the structure of the artificial neural network used in the first encoder 32 and the second encoder ( 33) may have different structures of artificial neural networks.

블렌더(blender, 34)는 상기 사용자 피처 맵과 상기 타겟 피처 맵을 이용하여 믹스드 피처 맵(mixed feature map)을 생성하며, 상기 사용자의 얼굴 이미지의 포즈 정보와 상기 타겟의 얼굴 이미지의 스타일 정보를 인공 신경망에 입력하여 상기 믹스드 피처 맵을 생성할 수 있다.A blender 34 generates a mixed feature map using the user feature map and the target feature map, and combines pose information of the user's face image and style information of the target's face image. The mixed feature map may be generated by input to the artificial neural network.

상기 믹스드 피처 맵은 상기 타겟의 특징점이 상기 사용자의 특징점에 대응하는 포즈 정보를 갖도록 생성될 수 있다. 블렌더(34)에서 사용되는 상기 인공 신경망은 제1 인코더(32)와 제2 인코더(33)에서 사용되는 인공 신경망과 마찬가지로 CNN이 사용될 수 있으며, 블렌더(34)에서 사용되는 인공 신경망의 구조는 제1 또는 제2 인코더(32, 33)에서 사용되는 인공 신경망의 구조와 다를 수 있다.The mixed feature map may be generated so that the feature point of the target has pose information corresponding to the feature point of the user. As the artificial neural network used in the blender 34, CNN may be used like the artificial neural network used in the first encoder 32 and the second encoder 33, and the structure of the artificial neural network used in the blender 34 is The structure of the artificial neural network used in the first or second encoders 32 and 33 may be different.

블렌더(34)에 입력되는 상기 사용자 피처 맵과 상기 타겟 피처 맵은 각각 사용자의 얼굴의 특징점 정보와 타겟의 얼굴의 특징점 정보를 포함하며, 상기 사용자의 얼굴의 움직임과 표정에 대응하는 타겟의 얼굴을 생성하되 상기 타겟의 얼굴의 고유한 특징을 유지할 수 있도록 상기 사용자의 얼굴의 특징점과 상기 타겟의 얼굴의 특징점을 매치(match)하는 동작을 수행할 수 있다.The user feature map and the target feature map input to the blender 34 include feature point information of the user's face and feature point information of the target's face, respectively, and the target's face corresponding to the movement and expression of the user's face An operation of matching the feature points of the user's face and the feature points of the target's face may be performed to generate but maintain the unique features of the target's face.

예컨대, 상기 사용자의 얼굴의 움직임을 따라 상기 타겟의 얼굴의 움직임을 제어하기 위해서 상기 사용자의 눈, 눈썹, 코, 입, 턱 선 등의 특징점을 상기 타겟의 눈, 눈썹, 코, 입, 턱 선 등의 특징점에 각각 연동시키는 것으로 이해할 수 있다.For example, in order to control the movement of the target's face according to the movement of the user's face, feature points such as the user's eyes, eyebrows, nose, mouth, and jaw lines are set to the target's eye, eyebrow, nose, mouth, and chin lines. It can be understood as interlocking each feature point, such as.

또는, 상기 사용자의 얼굴의 표정에 따라 상기 타겟의 얼굴의 표정을 제어하기 위해서 상기 사용자의 눈, 눈썹, 코, 입, 턱 선 등의 특징점을 상기 타겟의 눈, 눈썹, 코, 입, 턱 선 등의 특징점에 각각 연동시킬 수 있다.Alternatively, in order to control the expression of the target's face according to the expression of the user's face, the user's eye, eyebrow, nose, mouth, jaw line, etc. feature points of the target's eye, eyebrow, nose, mouth, and chin line Each of the feature points can be linked.

디코더(35)는 상기 믹스드 피처 맵과 상기 포즈-정규화 타겟 피처 맵을 이용하여 상기 타겟의 얼굴 이미지에 대한 변형된 이미지를 생성한다.The decoder 35 generates a transformed image of the face image of the target by using the mixed feature map and the pose-normalized target feature map.

블렌더(34)에서 생성되는 상기 믹스드 피처 맵을 통해 상기 사용자의 움직임을 자연스럽게 추종하는 타겟의 움직임을 얻을 수 있다면, 디코더(35)에서는 타켓의 고유한 특징을 반영하여 실제 타겟이 스스로 움직이고 표정을 짓는 것과 같은 효과를 얻을 수 있다.If the movement of the target that naturally follows the movement of the user can be obtained through the mixed feature map generated by the blender 34, the decoder 35 reflects the unique characteristics of the target so that the actual target moves by itself and expresses the expression. You can get the same effect as building.

도 5는 본 발명의 일 실시예에 따른 특징점 획득부의 구성을 개략적으로 나타내는 도면이다. 도 5를 참조하면, 본 발명의 일 실시예에 따른 특징점 획득부는 인공 신경망(artificial neural network)를 포함할 수 있는데, 상기 인공 신경망은 인물의 얼굴 이미지(input image)를 입력으로 수신한다. 상기 인공 신경망은 공지의 인공 신경망 중 일부를 적용할 수 있는데, 일 실시예에서 상기 인공 신경망은 ResNet 일 수 있다. ResNet 은 CNN(Convolutional Neural Network)의 일종이며, 본 발명이 특정한 인공 신경망의 종류로 제한되는 것은 아니다.5 is a diagram schematically illustrating the configuration of a feature point acquisition unit according to an embodiment of the present invention. Referring to FIG. 5 , the feature point acquisition unit according to an embodiment of the present invention may include an artificial neural network, which receives an input image of a person's face as an input. The artificial neural network may apply some of known artificial neural networks, and in an embodiment, the artificial neural network may be ResNet. ResNet is a type of Convolutional Neural Network (CNN), and the present invention is not limited to a specific type of artificial neural network.

MLP(Multi-Layer Perceptron)는 단층 Perceptron의 한계를 극복하기 위해 여러 층의 Perceptron을 쌓아 올린 형태의 인공 신경망의 일종이다. 도 5를 참조하면, MLP는 상기 인공 신경망의 출력과 상기 얼굴 이미지에 대응하는 랜드마크(landmark) 정보를 입력으로 수신한다. 그리고, MLP는 변환 행렬(transformation matrix)을 출력한다. 한편, 상기 인공 신경망과 MLP가 전체로서 하나의 학습된 인공 신경망을 구성하는 것으로도 이해할 수 있다.Multi-Layer Perceptron (MLP) is a type of artificial neural network in which multiple layers of perceptrons are stacked to overcome the limitations of single-layer perceptrons. Referring to FIG. 5 , the MLP receives the output of the artificial neural network and landmark information corresponding to the face image as inputs. Then, the MLP outputs a transformation matrix. On the other hand, it can be understood that the artificial neural network and the MLP constitute one learned artificial neural network as a whole.

학습된 인공 신경망을 통해 상기 변환 행렬이 추정되면, 도 4를 참조로 하여 설명한 바와 같이 표현 특징점 정보와 고유 특징점 정보를 산출할 수 있다. 본 발명에 따른 이미지 변형 장치는 매우 적은 수의 얼굴 이미지만 존재하거나 단 하나의 프레임의 얼굴 이미지만 존재하는 경우에도 적용될 수 있다.When the transformation matrix is estimated through the learned artificial neural network, as described with reference to FIG. 4 , expression feature point information and unique feature point information can be calculated. The image modifying apparatus according to the present invention can be applied even when there are only a very small number of face images or only one frame of face images.

상기 학습된 인공 신경망은 수많은 얼굴 이미지와 그에 대응하는 특징점 정보로부터 저차원의 고유 벡터 및 변환 계수를 추정하도록 학습되어 있으며, 이렇게 학습된 인공 신경망은 하나의 프레임의 얼굴 이미지만 주어지더라도 상기 고유 벡터와 변환 계수를 추정할 수 있다.The trained artificial neural network is trained to estimate low-dimensional eigenvectors and transform coefficients from numerous face images and corresponding feature point information. and transform coefficients can be estimated.

이러한 방법으로 임의의 인물의 표현 특징점과 고유 특징점이 분리되면 facial landmark를 기반으로 한 face reenactment, face classification, face morphing 등의 얼굴 영상 처리 기술의 품질을 향상시킬 수 있다.In this way, when the expression feature and unique feature of an arbitrary person are separated, the quality of facial image processing techniques such as face reenactment, face classification, and face morphing based on facial landmarks can be improved.

도 6은 본 발명의 일 실시예에 따른 제2 인코더의 구성을 개략적으로 나타내는 도면이다.6 is a diagram schematically illustrating a configuration of a second encoder according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일 실시예에 따른 제2 인코더(33)는 U-Net 구조를 채용할 수 있다. U-Net은 U 자 형태의 네트워크(network)로 기본적으로 segmentation 기능을 수행하되 symmetric shape을 갖는 네트워크를 의미한다.Referring to FIG. 6 , the second encoder 33 according to an embodiment of the present invention may adopt a U-Net structure. U-Net is a U-shaped network that basically performs a segmentation function, but refers to a network with a symmetric shape.

f_y는 타겟 피처 맵을 정규화할 때 사용되는 normalized flow map을 의미하고, T는 warping을 수행하는 warping function을 의미한다. 그리고, S_j, j=1?n_y는 각각의 convolutional layer에서 인코딩 된 타겟 피처 맵을 나타낸다.f _y denotes a normalized flow map used when normalizing the target feature map, and T denotes a warping function that performs warping. And, S _j , j=1?n _y represents a target feature map encoded in each convolutional layer.

제2 인코더(33)는 랜더링 된 target landmark와 target image를 입력으로 수신하고, 이로부터 인코딩 된 타겟 피처 맵과 normalized flow map f_y를 생성한다. 그리고, 생성된 target feature map Sj와 normalized flow map fy를 입력으로 하여 warping function을 수행함으로써 warping 된 target feature map을 생성한다.The second encoder 33 receives the rendered target landmark and target image as inputs, and generates an encoded target feature map and a normalized flow map f _y therefrom. Then, the warped target feature map is generated by performing the warping function with the generated target feature map Sj and normalized flow map fy as inputs.

여기서 warping 된 target feature map은 앞서 설명한 pose-normalized target feature map과 동일한 것으로 이해할 수 있다. 따라서, 상기 warping function T는 상기 타겟의 표현 특징점 정보를 제외하고, 상기 타겟 고유의 스타일 정보, 즉 고유 특징점 정보만으로 이루어진 데이터를 생성하는 function으로 이해할 수 있다.Here, the warped target feature map can be understood as the same as the pose-normalized target feature map described above. Accordingly, the warping function T may be understood as a function that generates data including only the target-specific style information, that is, the unique characteristic point information, excluding the target expression feature point information.

도 7은 본 발명의 일 실시예에 따른 블렌더의 구조를 개략적으로 나타내는 도면이다.7 is a diagram schematically showing the structure of a blender according to an embodiment of the present invention.

앞서 설명한 바와 같이, 블렌더(34)는 사용자 피처 맵과 타켓 피처 맵으로부터 믹스드 피처 맵을 생성하는데, 사용자의 얼굴 이미지의 포즈 정보와 타겟의 얼굴 이미지의 스타일 정보를 인공 신경망에 입력하여 상기 믹스드 피처 맵을 생성할 수 있다.As described above, the blender 34 generates a mixed feature map from the user feature map and the target feature map, and inputs pose information of the user's face image and style information of the target's face image to the artificial neural network to input the mixed feature map. You can create feature maps.

도 7에는 하나의 사용자 피처 맵과 세 개의 타겟 피처 맵이 도시되어 있으나, 타겟 피처 맵은 하나일수도 있고, 두 개 혹은 세 개 보다 더 많을 수 있다. 그리고, 도 6에 도시되는 각각의 피처 맵 내부의 작은 영역은 임의의 특징점에 대한 정보를 의미하며, 모두 동일한 특징점에 대한 정보를 나타낸다.Although one user feature map and three target feature maps are shown in FIG. 7 , the target feature map may be one, two or more than three. In addition, a small area inside each feature map shown in FIG. 6 means information about an arbitrary feature point, and all of them indicate information about the same feature point.

또한, 예를 들어, 상기 사용자 피처 맵에서 눈(eye)을 찾은 후 상기 타겟 피처 맵에서 눈(eye)을 찾고, 타겟 피처 맵의 눈이 사용자 피처 맵의 눈의 움직임을 따르도록 믹스드 피처 맵이 생성될 수 있다. 다른 특징점들에 대해서도 실질적으로 동일한 동작이 블렌더(34)에서 수행될 수 있다.Also, for example, after finding an eye in the user feature map, an eye is found in the target feature map, and the mixed feature map so that the eyes of the target feature map follow the eye movement of the user feature map can be created. Substantially the same operation may be performed in blender 34 for other features.

도 8은 본 발명의 일 실시예에 따른 디코더의 구조를 개략적으로 나타내는 도면이다.8 is a diagram schematically illustrating a structure of a decoder according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일 실시예에 따른 디코더(35)는 제2 인코더(33)에서 생성된 포즈-정규화 타겟 피처 맵과, 블렌더(34)에서 생성된 믹스드 피처 맵 z_xy를 입력으로 하여 사용자의 표현 특징점 정보를 타겟 이미지에 적용한다. Referring to FIG. 8 , the decoder 35 according to an embodiment of the present invention generates the pose-normalized target feature map generated by the second encoder 33 and the mixed feature map z _xy generated by the blender 34 . As an input, the user's expression feature point information is applied to the target image.

도 8에서 디코더(35)의 각 블록(block)으로 입력되는 데이터는 제2 인코더(33)에서 생성된 포즈-정규화 타겟 피처 맵이고, f_u는 포즈-정규화 타겟 피처 맵에 사용자의 표현 특징점 정보를 적용시키는 flow map을 의미한다.In FIG. 8 , data input to each block of the decoder 35 is a pose-normalized target feature map generated by the second encoder 33 , and f _u is the user's expression feature point information in the pose-normalized target feature map. It means a flow map that applies

또한, 디코더(35)의 Warp-alignment block은 디코더(35)의 이전 블록(block)의 출력 u와 포즈-정규화 타겟 피처 맵을 입력으로 하여 warping function을 수행한다. 디코더(35)에서 수행되는 warping function은 타겟의 고유한 특성을 유지하면서 사용자의 움직임과 포즈를 따르는 변형된(reenacted) 이미지를 생성하기 위한 것으로, 제2 인코더(33)에서 수행되는 warping function과는 상이하다.In addition, the warp-alignment block of the decoder 35 performs a warping function with the output u of the previous block of the decoder 35 and the pose-normalized target feature map as inputs. The warping function performed by the decoder 35 is to generate a reenacted image that follows the movement and pose of the user while maintaining the unique characteristics of the target, and is different from the warping function performed by the second encoder 33 different

이상에서 설명된 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비 휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The embodiments described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and can include both volatile and non-volatile media, removable and non-removable media.

또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비 휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.In addition, computer-readable media may include computer storage media. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can practice the present invention in other specific forms without changing its technical spirit or essential features. You can understand that there is Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

100: 서버 10: 제1 단말기
20: 제2 단말기 30: 이미지 변형 장치
31: 특징점 획득부 32: 제1 인코더
33: 제2 인코더 34: 블렌더
35: 디코더100: server 10: first terminal
20: second terminal 30: image modifying device
31: feature point acquisition unit 32: first encoder
33: second encoder 34: blender
35: decoder

Claims

obtaining landmark information from the user's face image;
generating a user feature map from pose information including motion information and expression information of the face image;
Based on the face image of the target, a target feature map including pose information including motion information and facial expression information of the target and information corresponding to a unique feature excluding the pose information of the target Generating a pose-normalized target feature map including a pose-normalized target feature map;
generating a mixed feature map by using the user feature map and the target feature map; and
generating a transformed image of the face image of the target using the mixed feature map and the pose-normalized target feature map;
An image transformation method comprising a.

According to claim 1,
The pose information includes movement information and expression information of the face image,
In the generating of the user feature map, the image transformation method of generating the user feature map by inputting pose information corresponding to the face image of the user into an artificial neural network.

According to claim 1,
The feature point information includes location information corresponding to at least one of eyes, nose, mouth, eyebrows, and ears.

delete

According to claim 1,
The pose-normalized target feature map corresponds to an output of the target style information input to an artificial neural network.

According to claim 1,
In the generating of the mixed feature map, the image transformation method of generating the mixed feature map by inputting pose information of the face image of the user and style information of the face image of the target into an artificial neural network.

According to claim 1,
The style information includes at least one of texture information, color information, and shape information corresponding to the face image of the target.

According to claim 1,
The mixed feature map is an image transformation method in which the feature point of the target has pose information corresponding to the feature point of the user.

According to claim 1,
The transformed image has an identity of the target face and a pose of the user's face.

A computer-readable recording medium in which a program for performing the method according to any one of claims 1 to 3 and 5 to 9 is recorded.

a feature point acquisition unit for receiving face images of a user and a target, and acquiring feature point information from each face image;
a first encoder for generating a user feature map from pose information including motion information and expression information of the user's face image;
Based on the face image of the target, a target feature map including pose information including motion information and facial expression information of the target and information corresponding to a unique feature excluding the pose information of the target are included a second encoder that generates a pose-normalized target feature map;
a blender for generating a mixed feature map by using the user feature map and the target feature map; and
a decoder for generating a transformed image of the face image of the target by using the mixed feature map and the pose-normalized target feature map;
An image transforming device comprising a.

12. The method of claim 11,
The pose information includes movement information and expression information of the face image,
The first encoder generates the user feature map by inputting pose information corresponding to the user's face image into an artificial neural network.

12. The method of claim 11,
The feature point information is an image transforming apparatus including location information corresponding to at least one of eyes, nose, mouth, eyebrows, and ears.

delete

12. The method of claim 11,
The pose-normalized target feature map corresponds to an output of the target style information input to an artificial neural network.

12. The method of claim 11,
The blender generates the mixed feature map by inputting pose information of the user's face image and style information of the target's face image into an artificial neural network.

12. The method of claim 11,
The style information may include at least one of texture information, color information, and shape information corresponding to the face image of the target.

12. The method of claim 11,
The mixed feature map is an image transforming device that is generated so that the target feature point has pose information corresponding to the user's feature point.

12. The method of claim 11,
The transformed image has an identity of the target face and a pose of the user's face.