KR102470866B1

KR102470866B1 - Retargetimg method of 3d character facial expression and neural network learning method of thereof

Info

Publication number: KR102470866B1
Application number: KR1020200183483A
Authority: KR
Inventors: 노준용; 김성현; 정선진; 서광균
Original assignee: 한국과학기술원
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2022-11-28
Also published as: KR20220092086A

Abstract

실시예는, 3D 캐릭터의 얼굴 표정 리타게팅 방법 및 이를 위해 신경망을 학습하는 방법에 대한 것이다. 리타게팅 방법은, 소스 얼굴 모델의 제1 표정에 대응하는 소스 블렌드쉐입 가중치를 수신하는 단계; 소스 블렌드쉐입 가중치에 기초하여 소스 얼굴 모델을 렌더링함으로써, 소스 얼굴 모델의 제1 표정에 대응하는 소스 얼굴 이미지를 생성하는 단계; 소스 얼굴 이미지를 타깃 얼굴 모델에 대응하는 영상을 생성하는 제1 신경망에 인가함으로써, 소스 얼굴 모델의 제1 표정에 대응하는 타깃 얼굴 이미지를 생성하는 단계; 및 타깃 얼굴 이미지를 타깃 얼굴 모델의 표정에 대응하는 블렌드쉐입 가중치를 추정하는 제2 신경망에 인가함으로써, 타깃 얼굴 모델의 제1 표정에 대응하는 타깃 블렌드쉐입 가중치를 추정하는 단계를 포함할 수 있다.The embodiment relates to a method for retargeting facial expressions of a 3D character and a method for learning a neural network for this purpose. The retargeting method may include receiving a source blendshape weight corresponding to a first expression of a source facial model; generating a source facial image corresponding to a first facial expression of the source facial model by rendering the source facial model based on the source blendshape weight; generating a target face image corresponding to a first facial expression of the source face model by applying the source face image to a first neural network that generates an image corresponding to the target face model; and estimating a target blend shape weight corresponding to a first expression of the target face model by applying the target face image to a second neural network for estimating a blend shape weight corresponding to the expression of the target face model.

Description

Method for retargeting facial expression of 3D character and method for learning neural network for this

실시예는, 3D 캐릭터의 얼굴 표정 리타게팅 방법 및 이를 위해 신경망을 학습하는 방법에 관한 것이다.The embodiment relates to a method for retargeting facial expressions of a 3D character and a method for learning a neural network for this purpose.

영화, 애니메이션, 게임, 방송과 같은 다양한 엔터테인먼트 산업이 발달함에 따라 사실적인 표현을 가능케하는 컴퓨터 그래픽스 분야의 중요도가 날이 갈수록 증대하고 있다. 이에 따라 컴퓨터 그래픽스 기술을 활용해 만들어낸 가상 캐릭터를 실제 사람과 구분하기 어려운 수준에 이르렀다. 자연스러운 얼굴 애니메이션은 사실적인 캐릭터를 묘사하기 위해서 가장 중요한 요소 중 하나이기에 얼굴 애니메이션을 쉽게 제작하기 위한 여러 기술이 소개되었다.As various entertainment industries such as movies, animations, games, and broadcasting develop, the importance of computer graphics that enables realistic expressions is increasing day by day. As a result, it has reached a level where it is difficult to distinguish virtual characters created using computer graphics technology from real people. Since natural facial animation is one of the most important elements to portray realistic characters, several techniques for easily producing facial animation have been introduced.

얼굴 애니메이션을 제작하는 방법에는 크게 모션 캡처(Motion capture)와 키 프레이밍(Key framing) 두 가지가 존재한다. 모션 캡처 방법에서는 실제 사람의 몸 동작과 얼굴 동작을 추적하는 장비를 사용하여 데이터를 획득하고 이를 정제해 가상 캐릭터의 얼굴 애니메이션 값으로 사용한다. 반면에, 키 프레이밍 방법에서는 숙련된 전문가가 애니메이션 제작 전용 프로그램을 사용해 얼굴 블렌드쉐입(Blendshape) 가중치 값을 지정하는 방식으로 애니메이션을 생성한다.There are two main methods of producing facial animation: motion capture and key framing. In the motion capture method, data is acquired using equipment that tracks real human body and face motions, and it is refined and used as facial animation values for virtual characters. On the other hand, in the key framing method, a skilled expert creates an animation by designating a facial blendshape weight value using a program dedicated to animation production.

일반적으로 서로 다른 캐릭터 모델의 얼굴 블렌드쉐입 구조는 일치하지 않는 경우가 많기 때문에 위와 같은 방법으로 획득한 애니메이션 데이터는 하나의 모델에만 적용이 가능하다. 따라서, 획득한 애니메이션 데이터를 다른 모델에 적용하고자 하는 경우 같은 블렌드쉐입 구조를 가지도록 새롭게 모델을 제작하거나 애니메이션을 다시 제작해야 하는 문제가 생긴다. 그러나, 모션 캡처의 경우 장비의 가격이 매우 고가이며 데이터를 정제할 수 있는 전문 인력이 필요하고, 키프레임 애니메이션이나 캐릭터 모델을 새로 제작하는 경우 또한 많은 시간과 전문적인 인력을 요구하여 비용이 많이 들기 때문에 한계가 존재한다.In general, since the face blend shape structures of different character models do not match in many cases, the animation data obtained in the above way can be applied only to one model. Therefore, when applying the acquired animation data to another model, a new model or animation must be created to have the same blend shape structure. However, in the case of motion capture, equipment is very expensive and requires specialized personnel who can refine data. In addition, when keyframe animation or character models are newly created, a lot of time and specialized personnel are required, resulting in high costs. Because of this, limitations exist.

이에 상기의 문제를 해결하고 소스(Source) 캐릭터의 애니메이션 데이터를 타깃(Target) 모델에 의미상 일치하게 적용하는 리타게팅(Retargeting) 기술에 대한 연구가 활발히 진행되었다. 그러나 기존의 연구들은 리타게팅 과정에서 소스 모델과 타깃 모델의 애니메이션 데이터쌍을 요구하거나 메쉬를 새로 생성하여 기존 블렌드쉐입 구조를 유지하지 못하는 문제가 존재한다.Accordingly, research on a retargeting technology that solves the above problems and applies animation data of a source character to a target model in a semantically consistent manner has been actively conducted. However, existing studies have a problem of failing to maintain the existing blendshape structure by requiring animation data pairs of a source model and a target model in the retargeting process or by creating a new mesh.

블렌드쉐입은 하나의 무표정 얼굴 메쉬와 다양한 표정의 얼굴 메쉬를 선형조합하여 새로운 얼굴 표정을 만들어내는 기법이다. 블렌드쉐입 기반의 얼굴 모델에 맞추어 애니메이션을 제작하기 위해서는 일반적으로 전문가가 필요하거나 고가의 캡처 장비가 필요하며 시간이 많이 소요된다. 그러나 서로 다른 블렌드쉐입은 대개 다른 구조를 가지고 있는 경우가 많기에, 이미 특정 블렌드쉐입에 맞추어 제작된 애니메이션을 다른 블렌드쉐입에서 사용하기는 어렵다.Blendshape is a technique that creates a new facial expression by linearly combining one expressionless face mesh and a face mesh with various expressions. In order to create animations according to blendshape-based face models, experts or expensive capture equipment are generally required and time-consuming. However, since different blendshapes usually have different structures, it is difficult to use animations already created for a specific blendshape in another blendshape.

변형 전이 기반의 리타게팅 방법론은 소스 모델의 표정 변화를 타깃 모델에 전달하여 타깃 모델이 소스 모델과 의미상 일치하는 표정을 짓게 만든다. 그러나 변형 전이 방법론은 모델 메쉬를 재구축하기 때문에 타깃 모델이 기존에 가지고 있던 블렌드쉐입 구조를 유지하지 못해 표정을 전달한 후 수정이 어렵다는 한계를 지닌다.Transformation-based retargeting methodology transfers the facial expression change of the source model to the target model so that the target model makes a semantically identical facial expression to the source model. However, since the transformation transfer methodology reconstructs the model mesh, it has a limitation that it is difficult to modify after transmitting the expression because the target model cannot maintain the existing blend shape structure.

서로 다른 두 얼굴이 주어졌을 때, 소스 얼굴과 같은 표정을 짓는 타깃 얼굴을 합성하는 얼굴 재연 기술은 VFX, 더빙, 가상 현실 등 다양한 분야에서 활용 가능하다.When two different faces are given, face reproduction technology that synthesizes a target face that makes the same expression as a source face can be used in various fields such as VFX, dubbing, and virtual reality.

이와 관련된 선행 문헌으로는 아래의 논문들이 개시된다.As prior literature related to this, the following papers are disclosed.

J. Song, B. Choi, Y. Seol, and J. Noh, "Characteristic facial retargeting," Computer Animation and Virtual Worlds, vol. 22, no. 2-3, pp. 187-194, 2011.J. Song, B. Choi, Y. Seol, and J. Noh, "Characteristic facial retargeting," Computer Animation and Virtual Worlds, vol. 22, no. 2-3, pp. 2-3. 187-194, 2011.

J. Naruniec, L. Helminger, C. Schroers, and R. Weber, "Highresolution neural face swapping for visual effects," in Computer Graphics Forum, vol. 39, no. 4. Wiley Online Library, 2020, pp. 173-184J. Naruniec, L. Helminger, C. Schroers, and R. Weber, "Highresolution neural face swapping for visual effects," in Computer Graphics Forum, vol. 39, no. 4. Wiley Online Library, 2020, pp. 173-184

실시예에 따른 발명은, 얼굴 재연(Facial reenactment) 기술에 적용 가능하도록 오토인코더(Autoencoder)를 이용하여 수작업으로 제작해야 하는 데이터쌍을 요구하지 않으면서도 블렌드쉐입 구조를 유지하는 새로운 리타게팅 시스템을 제공하고자 한다.The invention according to the embodiment maintains the blendshape structure without requiring a data pair that must be manually produced using an autoencoder to be applicable to facial reenactment technology Provides a new retargeting system want to do

실시예에 따른 발명을 통해, 첫째로 블렌드쉐입 구조를 유지하면서도 수작업으로 의미상 일치하도록 대응시킨 애니메이션 데이터쌍 없이 자동으로 리타게팅을 수행하도록 하고, 둘째로 미분 가능한 렌더러를 도입하여 모델의 이미지로부터 블렌드쉐입 추론 네트워크를 학습하는 새로운 방법을 제시할 수 있으며, 세 번째로 얼굴 재연 기술을 이용하여 2차원 이미지에서 두 얼굴의 렌더링된 이미지를 비교하여 애니메이션 리타게팅을 가능하도록 할 수 있다.Through the invention according to the embodiment, firstly, while maintaining the blendshape structure, automatic retargeting is performed without manually matching animation data pairs to match semantically, and secondly, a differentiable renderer is introduced to blend from the image of the model A new method for learning the shape inference network can be proposed, and thirdly, it is possible to enable animation retargeting by comparing the rendered images of two faces in a 2D image using face reproduction technology.

실시예에 따른 발명의 적용을 통해 얼굴 애니메이션에 대한 전문적인 지식이 없는 초심자라도 손쉽게 리타게팅이 가능하도록 할 수 있다.Through the application of the invention according to the embodiment, even beginners without professional knowledge of face animation can easily perform retargeting.

도 1은 실시예에서, 리타게팅 방법에 대해 설명하기 위한 흐름도이다.
도 2는 실시예에서, 리타게팅 방법이 적용되는 파이프라인의 개요이다.
도 3은 실시예에서, 리타게팅을 위한 신경망을 학습하는 방법에 대해 설명하기 위한 흐름도이다.
도 4는 실시예에서, 제1 신경망을 학습하는 방법에 대한 개요이다.
도 5는 실시예에서, 제2 신경망을 학습하는 방법에 대한 개요이다.
도 6은 실시예에서, 리타게팅 결과를 도시한 도면이다.
도 7은 실시예에서, 렌더링 손실의 유무가 렌더링 결과에 미치는 영향을 설명하기 위한 도면이다.
도 8은 실시예에서, 비교 실험 결과를 도시한 도면이다.1 is a flowchart for explaining a retargeting method in an embodiment.
2 is an outline of a pipeline to which a retargeting method is applied in an embodiment.
3 is a flowchart for explaining a method of learning a neural network for retargeting in an embodiment.
4 is an overview of a method for training a first neural network, in an embodiment.
5 is an overview of a method for training a second neural network in an embodiment.
6 is a diagram illustrating a retargeting result in an embodiment.
7 is a diagram for explaining an effect of the presence or absence of rendering loss on a rendering result in an embodiment.
8 is a diagram showing comparative experimental results in Examples.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes can be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents or substitutes to the embodiments are included within the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the examples are used only for descriptive purposes and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. Also, terms such as first, second, A, B, (a), and (b) may be used in describing the components of the embodiment. These terms are only used to distinguish the component from other components, and the nature, order, or order of the corresponding component is not limited by the term. When an element is described as being “connected,” “coupled to,” or “connected” to another element, that element may be directly connected or connected to the other element, but there may be another element between the elements. It should be understood that may be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having common functions will be described using the same names in other embodiments. Unless stated to the contrary, descriptions described in one embodiment may be applied to other embodiments, and detailed descriptions will be omitted to the extent of overlap.

도 1은 실시예에서, 리타게팅 방법에 대해 설명하기 위한 흐름도이다.1 is a flowchart for explaining a retargeting method in an embodiment.

단계(110)에서 장치는, 소스 얼굴 모델의 제1 표정에 대응하는 소스 블렌드쉐입 가중치를 수신한다.In step 110, the device receives source blendshape weights corresponding to the first expression of the source facial model.

단계(120)에서 장치는, 소스 블렌드쉐입 가중치에 기초하여 소스 얼굴 모델을 렌더링함으로써, 소스 얼굴 모델의 제1 표정에 대응하는 소스 얼굴 이미지를 생성한다.In step 120, the device generates a source facial image corresponding to a first expression of the source facial model by rendering the source facial model based on the source blendshape weights.

실시예에서, 이렇게 생성된 소스 얼굴 이미지를 통해 타깃 얼굴 이미지 및 타깃 블렌드쉐입 가중치를 추정하는 데에 활용될 수 있다.In an embodiment, the source face image generated in this way may be used to estimate a target face image and a target blend shape weight.

실시예에서, 소스 얼굴 모델의 표정을 타깃 얼굴 모델에 리타게팅하기 위해 소스 얼굴 이미지에서 타깃 얼굴 이미지를 생성하는 제1 신경망 및 타깃 얼굴 이미지로부터 타깃 블렌드쉐입 가중치를 추론하는 제2 신경망을 각각 학습할 수 있다. 학습이 완료된 제1 신경망 및 제2 신경망을 통해 소스 블렌드쉐입 가중치로부터 의미상 일치하는 타깃 블렌드쉐입 가중치를 획득할 수 있다.In an embodiment, a first neural network for generating a target face image from a source face image and a second neural network for inferring a target blendshape weight from a target face image are respectively trained to retarget expressions of the source face model to the target face model. can A semantically matched target blend shape weight may be obtained from the source blend shape weight through the first neural network and the second neural network for which learning has been completed.

단계(130)에서 장치는, 소스 얼굴 이미지를 타깃 얼굴 모델에 대응하는 영상을 생성하는 제1 신경망에 인가함으로써, 소스 얼굴 모델의 제1 표정에 대응하는 타깃 얼굴 이미지를 생성한다.In step 130, the device generates a target face image corresponding to a first facial expression of the source face model by applying the source face image to a first neural network that generates an image corresponding to the target face model.

실시예에서, 오토 인코더를 이용한 얼굴 재연 방법론을 기반으로 소스 얼굴 이미지를 타깃 얼굴 이미지로 변환할 수 있다. 이에, 제1 신경망을 통해 출력된 타깃 얼굴 이미지는 소스 얼굴 이미지와 동일한 표정을 가질 수 있다.In an embodiment, a source face image may be converted into a target face image based on a face reconstruction methodology using an auto-encoder. Accordingly, the target face image output through the first neural network may have the same expression as the source face image.

실시예에 따른 제1 신경망은 소스 얼굴 이미지들 및 타깃 얼굴 이미지들 각각에 대해서 입력 및 출력이 동일하도록 학습된 오토인코더, 소스 인코더 및 타깃 인코더를 포함할 수 있다.The first neural network according to the embodiment may include an autoencoder, a source encoder, and a target encoder that are trained to have the same input and output for each of the source face images and the target face images.

단계(140)에서 장치는, 타깃 얼굴 이미지를 타깃 얼굴 모델의 표정에 대응하는 블렌드쉐입 가중치를 추정하는 제2 신경망에 인가함으로써, 타깃 얼굴 모델의 제1 표정에 대응하는 타깃 블렌드쉐입 가중치를 추정한다.In step 140, the device estimates target blendshape weights corresponding to a first expression of the target face model by applying the target face image to a second neural network that estimates blendshape weights corresponding to the expressions of the target face model. .

소스 얼굴 모델 및 타깃 얼굴 모델 각각의 블렌드쉐입 모델은 서로 다른 개수의 블렌드쉐입 가중치들을 포함할 수 있다. 이에, 학습된 제2 신경망을 통해 소스 얼굴 이미지의 블렌드쉐입 가중치와 생성된 타깃 얼굴 이미지의 블렌드쉐입 가중치가 의미적으로 동일한 값을 가지도록 할 수 있다.Each blend shape model of the source face model and the target face model may include a different number of blend shape weights. Thus, through the learned second neural network, the blend shape weight of the source face image and the blend shape weight of the generated target face image may have semantically the same value.

실시예에서, 제2 신경망은, 타깃 얼굴 모델의 타깃 얼굴 이미지들을 및 타깃 얼굴 이미지들을 각각의 표정에 대응하는 타깃 블렌드쉐입 가중치들을 학습 데이터로 이용하여 학습될 수 있다. 실시예에 따라서, 타깃 얼굴 모델의 영상들로부터 추론된 타깃 블렌드쉐입 가중치들과 타깃 블렌드세입 가중치들이 동일하도록 학습될 수 있고, 추론된 타깃 블렌드쉐입 가중치들을 이용하여 렌더링된 타깃 얼굴 모델의 타깃 얼굴 이미지들이 타깃 얼굴 모델의 타깃 얼굴 이미지들과 동일하도록 학습될 수 있다.In an embodiment, the second neural network may be trained using target face images of the target face model and target blendshape weights corresponding to facial expressions of the target face images as training data. According to an embodiment, target blend shape weights inferred from images of the target face model and target blend shape weights may be learned to be the same, and the target face image of the target face model rendered using the inferred target blend shape weights may be learned to be identical to the target face images of the target face model.

도 2는 실시예에서, 리타게팅 방법이 적용되는 파이프라인의 개요이다.2 is an outline of a pipeline to which a retargeting method is applied in an embodiment.

소스 표정 가중치 w_s와 텍스처 T_s, 렌더링 파라미터 p_s를 입력으로 받아 소스 얼굴 이미지 I_s를 렌더링하고, 오토 인코더를 통해 I_s로부터

를 획득할 수 있다. P는 I_t를 입력으로 받아 블렌드쉐입 가중치

= P(

)를 추론할 수 있다. 상기 과정을 통해 소스 표정 가중치 w_s와 의미상 일치하는 타깃 표정 가중치

를 획득 가능하다.The source face image I _s is rendered by receiving the source facial expression weight w _s , texture T _s , and rendering parameter p _s as input, and from I _s through an auto-encoder.

can be obtained. P takes I _t as input and blendshape weight

=P(

) can be inferred. Through the above process, the target expression weight that is semantically identical to the source expression weight w _s

can be obtained

실시예에서 제안되는 리타게팅 방법은, 크게 얼굴 재연 단계(도 1의 130)와 블렌드쉐입 가중치 추론 단계(도 1의 140)로 나뉜다.The retargeting method proposed in the embodiment is largely divided into a face reproduction step ( 130 in FIG. 1 ) and a blendshape weight inference step ( 140 in FIG. 1 ).

먼저, 입력으로 주어지는 소스 블렌드쉐입 가중치를 이용하여 소스 얼굴 모델의 소스 얼굴 이미지를 렌더링할 수 있다. 타깃 얼굴 이미지를 생성하는 제1 신경망은, 입력으로 주어진 소스 얼굴 모델의 렌더링된 소스 얼굴 이미지로부터 같은 표정의 타깃 얼굴 이미지를 생성할 수 있다. First, a source face image of a source face model may be rendered using a source blend shape weight given as an input. The first neural network generating the target face image may generate a target face image having the same expression from a rendered source face image of a source face model given as an input.

실시예에서, 타깃 블론드쉐입 가중치를 추정하는 과정에서 제2 신경망을 이용하여 제1 신경망을 거쳐 생성된 타깃 얼굴 이미지로부터 소스 얼굴 모델의 블렌드쉐입 가중치와 의미상 일치하는 타깃 블렌드쉐입 가중치를 획득할 수 있다.In an embodiment, in the process of estimating the target blonde shape weight, a target blend shape weight that semantically matches the blend shape weight of the source facial model may be obtained from the target face image generated through the first neural network using the second neural network. have.

실시예에서 한 프레임의 임의 표정은 블렌드쉐입 가중치 벡터 w로 표현될 수 있다. 표정에 대한 블렌드쉐입 가중치는 소스 얼굴 모델 및 타깃 얼굴 모델에 대하여 w_s와 w_t로 표현될다. w_s 와 w_t는 서로 의미상 대응하지 않으며 다른 개수일 수 있다. In an embodiment, an arbitrary facial expression of one frame may be expressed as a blendshape weight vector w. Blendshape weights for facial expressions are expressed as w _s and w _t for the source face model and the target face model. w _s and w _t do not correspond semantically to each other and may have different numbers.

블렌드쉐입 가중치를 이용하면 한 모델의 블렌드쉐입을 선형 조합한 V(w)를 통해 버텍스의 위치값, 노말 벡터 및 텍스처 UV좌표를 갖는 버텍스 모델을 생성할 수 있다. 타깃 버텍스 모델과 소스 버텍스 모델은 각각 V_s(w_s)와 V_t(w_t)를 통해 획득 가능하며, 실시예에서 사용하는 모든 기호에 대하여 아래 첨자 s와 t는 각각 소스와 타깃을 나타낸다.Using blend shape weights, a vertex model having vertex position values, normal vectors, and texture UV coordinates can be created through V(w), which is a linear combination of blend shapes of one model. The target vertex model and the source vertex model can be obtained through V _s (w _s ) and V _t (w _t ), respectively, and for all symbols used in the embodiment, subscripts s and t denote a source and a target, respectively.

실시예에서, 제1 신경망의 학습을 위해서 소스 얼굴 모델 및 타깃 얼굴 모델의 전면부 이미지가 이용될 수 있고, 이는 미분 가능한 렌더러 R(*)을 통해 획득할 수있다. 미분 가능한 렌더링 과정을 수식으로 나타내면 아래의 수학식 1과 같다.In an embodiment, frontal images of the source face model and the target face model may be used for learning of the first neural network, which may be obtained through a differentiable renderer R(*). A differentiable rendering process is expressed as Equation 1 below.

[수학식 1][Equation 1]

렌더링 과정에서는 블렌드쉐입 표정 데이터 w에 대한 버텍스 모델 V, 텍스쳐 T, 그리고 렌더링 파라미터 p가 사용될 수 있다. 텍스처 T는 모델 버텍스 V가 가지고 있는 U좌표에 대응한다. 렌더링 파라미터 p는 카메라의 거리와 앙각 및 방위각, 모델의 3차원 위치, 모델의 스케일, 조명의 위치로 이루어진 10차원 벡터이다. T와 p는 모델별로 단일한 값을 가진다. 렌더링의 결과로 소스 모델의 얼굴 전면부 이미지 I_s ∈ S와 타깃 캐릭터의 얼굴 전면부 이미지 I_t ∈ T를 구축할 수 있다. S는 소스 얼굴 이미지의 영역이며, T는 타깃 얼굴 이미지의 영역이다. S와 T는 인간형 얼굴 캐릭터 이미지의 영역 F의 부분집합에 해당한다.In the rendering process, vertex model V for blendshape expression data w, texture T, and rendering parameter p may be used. The texture T corresponds to the U coordinate of the model vertex V. The rendering parameter p is a 10-dimensional vector consisting of the distance, elevation, and azimuth of the camera, the 3-dimensional position of the model, the scale of the model, and the position of lighting. T and p have a single value for each model. As a result of rendering, the source model's face image I _s ∈ S and the target character's face image I _t ∈ T can be constructed. S is the area of the source face image, and T is the area of the target face image. S and T correspond to subsets of region F of the humanoid face character image.

도 3은 실시예에서, 리타게팅을 위한 신경망을 학습하는 방법에 대해 설명하기 위한 흐름도이다.3 is a flowchart for explaining a method of learning a neural network for retargeting in an embodiment.

실시예에 따른 학습 방법은, 제1 신경망을 학습하는 단계 및 제2 신경망을 학습하는 단계로 나뉜다.The learning method according to the embodiment is divided into a step of learning a first neural network and a step of learning a second neural network.

단계(310)에서, 장치는 제1 학습 데이터 세트를 획득한다.At step 310, the device obtains a first set of training data.

실시예에서, 제1 학습 데이터 세트는 소스 얼굴 모델에 대응하는 소스 얼굴 이미지들과 타깃 얼굴 모델에 대응하는 타깃 얼굴 이미지들을 포함할 수 있고, 제1 학습 데이터는 제1 신경망을 학습하는 데에 이용될 수 있다.In an embodiment, the first training data set may include source face images corresponding to a source face model and target face images corresponding to a target face model, and the first training data is used to learn a first neural network. It can be.

단계(320)에서 장치는, 제1 학습 데이터 세트를 이용하여 타깃 얼굴 모델의 이미지가 출력되도록 제1 신경망을 학습한다.In step 320, the device learns the first neural network to output an image of the target face model using the first training data set.

실시예에서, 제1 신경망을 학습하는 방법에 대해서 도 4를 참조하여 자세히 설명하도록 한다.In an embodiment, a method for learning the first neural network will be described in detail with reference to FIG. 4 .

도 4는 실시예에서, 제1 신경망을 학습하는 방법에 대한 개요이다.4 is an overview of a method for training a first neural network, in an embodiment.

실시예에 따른 제1 신경망은, 오토 인코더를 이용한 얼굴 재연 방법론을 기반으로 소스 얼굴 이미지 I_s를 타깃 얼굴 이미지

로 변환하도록 학습될 수 있다.The first neural network according to the embodiment converts the source face image I _s into a target face image based on a face reproduction methodology using an auto-encoder.

It can be learned to convert to .

실시예에서, 오토 인코더 E:F -> Z는 입력 이미지 I_s와 I_t를 잠재 공간 Z로 인코딩할 수 있다. D_s:Z -> S는 E를 통해 공간 Z에 인코딩된 잠재 코드 z를 입력으로 받아 소스 얼굴 이 미지 Is를 재구성하는 디코더이며, D_t : Z -> T는 z를 입력으로 받아 타깃 얼굴 이미지 It를 재구성하는 디코더이다. In an embodiment, an autoencoder E:F -> Z may encode input images I _s and I _t into latent space Z. D _s :Z -> S is a decoder that receives latent code z encoded in space Z through E as an input and reconstructs the source face image Is, and D _t : Z -> T receives z as an input and is a target face image It is a decoder that reconstructs It.

실시예에서, 인코더는 일곱 개의 합성곱 레이어, 두 개의 완전연결 레이어(Fully connected layer), 하나의 픽셀셔플(Pixel shuffle) 레이어를 포함할 수 있고, 디코더는 다섯 개의 합성곱 레이어와 네 개의 픽셀 셔플 레이어를 포함할 수 있다. 제1 신경망의 전체 구조는 표 1과 같다.In an embodiment, the encoder may include seven convolution layers, two fully connected layers, and one pixel shuffle layer, and the decoder may include five convolution layers and four pixel shuffle layers. Can contain layers. The overall structure of the first neural network is shown in Table 1.

[표 1][Table 1]

표 1은 제1 신경망의 구조의 예시이고, 합성곱 필터는 "(#kernel size)s(#stride)" 표시하고 있다. PS2는 2의 업스케일(Upscale) 인자를 가지는 픽셀셔플 레이어를 의미한다. 두 디코더, 소스 디코더 D_s와 타깃 디코더 D_t는 동일한 구조를 갖는다.Table 1 is an example of the structure of the first neural network, and the convolutional filter is indicated by "(#kernel size)s(#stride)". PS2 means a pixel shuffle layer having an upscale factor of 2. The two decoders, the source decoder D _s and the target decoder D _t , have the same structure.

도 4에 도시된 바와 같이, 학습 단계에서 제1 신경망은 일반적인 오토 인코더와 같이 입력과 출력이 동일하게 유도하는 재구축 손실 L_ae를 이용하여 학습될 수 있다. 손실 L_ae는 서로 대응하지 않는 얼굴 이미지 I_s와 I_t가 신경망에 주어졌을 때

를 재구축하기 위한 값으로 아래의 수학식 2와 같이 정의될 수 있다.As shown in FIG. 4 , in the learning step, the first neural network may be trained using a reconstruction loss L _ae that derives the same input and output as in a general auto-encoder. The loss L _ae is given when the neural network is given face images I _s and I _t that do not correspond to each other.

It can be defined as Equation 2 below as a value for rebuilding.

[수학식 2][Equation 2]

실시예에서, 소스 얼굴 이미지에 대해서 오토 인코더와 소스 디코더가 학습될 수 있고, 타깃 얼굴 이미지에 대해서 오토 인코더와 타깃 디코더가 학습될 수 있다.In an embodiment, an autoencoder and a source decoder may be trained on a source facial image, and an autoencoder and target decoder may be trained on a target facial image.

제1 신경망은 도 2에 도시된 바와 같이, 소스 얼굴 이미지의 표정을 타깃 얼굴 모델로 리타게팅하기 위해 오토 인코더와 타깃 디코더가 이용될 수 있다. 소스 얼굴 이미지 및 타깃 얼굴 이미지를 이용하여 학습된 E와 D_t로 구성한 제1 신경망에 소스 얼굴 이미지 I_s를 입력으로 사용하면 같은 표정을 지닌 타깃 얼굴 모델의 타깃 얼굴 이미지

을 출력할 수 있다.As shown in FIG. 2 , the first neural network may use an auto encoder and a target decoder to retarget an expression of a source face image to a target face model. If the source face image I _s is used as input to the first neural network composed of E and D _t trained using the source face image and the target face image, the target face image of the target face model with the same expression

can output

다시 돌아가, 단계(330)에서 장치는, 제2 학습 데이터 세트를 획득한다.Again, at step 330, the device obtains a second set of training data.

실시예에서, 제2 학습 데이터 세트는 타깃 얼굴 모델에 대응하는 타깃 얼굴 이미지들 및 타깃 얼굴 이미지들 각각의 표정에 대응하는 타깃 블렌드쉐입 가중치들이 포함될 수 있다.In an embodiment, the second training data set may include target face images corresponding to the target face model and target blendshape weights corresponding to expressions of each of the target face images.

단계(340)에서 장치는, 제2 학습 데이터 세트를 이용하여 타깃 얼굴 모델의 이미지에 대응하는 블렌드쉐입 가중치가 출력되도록 제2 신경망을 학습한다.In step 340, the device learns the second neural network by using the second training data set to output a blend shape weight corresponding to the image of the target face model.

실시예에서, 입력되는 상기 타깃 블렌드 쉐입 가중치들 및 상기 타깃 얼굴 이미지들로부터 제2 신경망을 통해 추론되는 추론 타깃 블렌드쉐입 가중치들이 동일하도록 학습될 수 있고, 타깃 얼굴 이미지들 및 추론 타깃 블렌드쉐입 가중치들로부터 렌더링되는 추론 타깃 얼굴 이미지들이 동일하도록 제2 신경망이 학습될 수 있다.In an embodiment, the inference target blend shape weights inferred through the second neural network from the input target blend shape weights and the target face images may be learned to be the same, and the target face images and the inference target blend shape weights The second neural network may be trained so that the inference target face images rendered from are the same.

실시예에서, 제2 신경망을 학습하는 방법에 대해서 도 5를 참조하여 자세히 설명하도록 한다.In an embodiment, a method for learning the second neural network will be described in detail with reference to FIG. 5 .

단계(350)에서 장치는, 제1 신경망과 제2 신경망을 연결함으로써 리타게팅을 위한 신경망을 구축할 수 있다.In step 350, the device may build a neural network for retargeting by connecting the first neural network and the second neural network.

도 5는 실시예에서, 제2 신경망을 학습하는 방법에 대한 개요이다.5 is an overview of a method for training a second neural network in an embodiment.

실시예에서, 제1 신경망을 통해 획득한 타깃 얼굴 이미지

는 이미지이므로 타깃 얼굴 모델에 적용 가능한 블렌드쉐입 가중치

를 획득하기 위해서는 타깃 얼굴 이미지로부터 블렌드쉐입 가중치를 추론하는 과정이 필요하다.In an embodiment, the target face image acquired through the first neural network

is an image, so the blendshape weight applicable to the target face model

In order to obtain , a process of inferring the blendshape weight from the target face image is required.

실시예에 따라서는 타깃 얼굴 이미지

로부터 블렌드쉐입 가중치를 획득하기 위해 블렌드쉐입 가중치를 추론하는 신경망 P에 대해서 학습할 수 있다.Depending on the embodiment, the target face image

It is possible to learn about a neural network P that infers the blend shape weight in order to obtain the blend shape weight from

실시예에 따른 제2 신경망은 여섯 개의 합성곱 계층과 다섯 개의 완전 연결 계층(Fully connected layer)로 구성된다. 신경망의 전체 구조는 표 2와 같다.The second neural network according to the embodiment is composed of six convolutional layers and five fully connected layers. The overall structure of the neural network is shown in Table 2.

[표 2][Table 2]

표 2에 의하면, 제2 신경망의 구조에서 합성곱 필터는 "k(#kernel size)s(#stride)"로 표기한다.According to Table 2, in the structure of the second neural network, the convolutional filter is expressed as "k(#kernel size)s(#stride)".

도 5에 도시된 바에 의하면,

는 타깃 영상 이미지 I_t로부터 추론된 블렌드쉐입 가중치

를 사용하여 렌더링된 이미지로,

로 나타낼 수 있다.As shown in Figure 5,

is the blendshape weight inferred from the target video image I _t

As an image rendered using

can be expressed as

실시예에서, 제2 신경망의 학습 단계에서는 타깃 얼굴 모델의 타깃 얼굴 이미지 I_t와 타깃 블렌드쉐입 가중치 w_t를 학습 데이터로 사용할 수 있다. 제2 신경망의 목표는 w_t를 통해 렌더링된 타깃 모델의 타깃 얼굴 이미지 I_t로부터

w_t를 추론하는 것이다.In an embodiment, in the learning step of the second neural network, the target face image I _t of the target face model and the target blendshape weight w _t may be used as training data. The goal of the second neural network is to obtain from the target face image I _t of the target model rendered through w _t

to infer w _t .

도 5에 도시된 바와 같이, 제2 신경망의 손실은 블렌드쉐입 가중치 w_t와 네트워크가 추론한 값

= P(I_t)의 차이를 통해 획득할 수 있는 블렌드쉐입 가중치 손실로 다음의 수학식 3과 같이 정의된다:As shown in FIG. 5, the loss of the second neural network is the blendshape weight w _t and the value inferred by the network

= Blendshape weight loss that can be obtained through the difference of P(I _t ), which is defined as in Equation 3 below:

[수학식 3][Equation 3]

L_w는 입력된 타깃 얼굴 이미지 I_t에 대응하는 블렌드쉐입 가중치 w_t와 제2 신경망을 통해 추론된 값 w_t = P(I_t)의 차이를 통해 획득할 수 있는 블렌드쉐입 가중치 손실이다.L _w is a blend shape weight loss that can be obtained through the difference between the blend shape weight w _t corresponding to the input target face image I _t and the value w _t = P(I _t ) inferred through the second neural network.

L_w만 사용하여 학습을 진행할 경우, 실제로 얼굴 전면부 이미지에 끼치는 영향과 관계없이 오직 추론된 블렌드쉐입 가중치

와 정답 값인 w_t의 차이를 줄이는 방향으로 학습이 진행될 수 있다. 따라서, 실시예에 따른 제2 신경망이 얼굴 전면부 이미지에 영향을 끼치는 가중치 값을 중점적으로 학습하도록 미분 가능한 렌더러 R(·)을 도입하여 렌더링 손실 L_r을 정의할 수 있다.When learning is performed using only L _w , only the inferred blendshape weight is applied regardless of the actual effect on the anterior face image.

Learning may proceed in a direction of reducing the difference between w and the correct answer value w _t . Therefore, rendering loss L _r may be defined by introducing a differentiable renderer R(·) so that the second neural network according to the embodiment intensively learns weight values that affect the front face image.

제2 신경망의 학습 과정에서 R(·)은 추론된 w_t를 입력으로 받아

를 렌더링할 수 있다.

는 정답 값인 I_t와 비교되어 로스 L_r를 구축할 수 있다. L_r을 이용하여 제2 신경망은 추론한 블렌드쉐입 가중치

로 렌더링한 타깃 얼굴 모델의 타깃 얼굴 이미지가 타깃 얼굴 이미지 I_t와 같도록 유도할 수 있다.In the learning process of the second neural network, R(·) receives the inferred w _t as an input and

can render.

can be compared with the correct value I _t to construct a loss L _r . The blendshape weight inferred by the second neural network using L _r

The target face image of the target face model rendered with can be induced to be the same as the target face image I _t .

도 6은 실시예에서, 리타게팅 결과를 도시한 도면이다.6 is a diagram illustrating a retargeting result in an embodiment.

실시예에서, 소스 얼굴 이미지(Rendered source)는 소스 블렌드 쉐입 가중치 w_s를 사용하여 렌더링된 소스 얼굴 모델의 이미지를 의미하고,

와 타깃 얼굴 이미지(Rendered terget)는 각각 추론된 블렌드쉐입 가중치

를 사용하여 렌더링된 타깃 얼굴 모델의 이미지를 의미한다.In an embodiment, a source face image (Rendered source) means an image of a source face model rendered using a source blend shape weight w _s ,

and the target face image (Rendered terget) are each inferred blendshape weight

It means the image of the target face model rendered using .

실시예에서, I_s와 I_t를 통해 학습된 제1 신경망의 인코더 E와 디코더 D_t를 사용하여

= D_t(E(I_s))를 획득할 수 있다. 도 6에 도시된 바와 같이 입력된 소스 얼굴 이미지 I_s는 제1 신경망을 통해 의미상 같은 표정을 가진 타깃 얼굴 이미지

로 변환되는 것을 확인할 수 있다.In an embodiment, using the encoder E and decoder D _t of the first neural network learned through I _s and I _t

= D _t (E(I _s )). As shown in FIG. 6, the input source face image I _s is a target face image having the same facial expression semantically through the first neural network.

You can check that it is converted to .

실시예에 따른 학습을 위해 소스 얼굴 이미지 14,532 개, 타깃 얼굴 이미지 16,050개를 학습 데이터로 사용했으며, 이미지의 사이즈는 128 Х 128 Х 3이다.For learning according to the embodiment, 14,532 source face images and 16,050 target face images were used as training data, and the size of the images is 128 Х 128 Х 3.

실시예에서, 제2 신경망은 I_t로부터 블렌드쉐입 가중치 w_t를 추론하도록 학습될 수 있다. 실시예에 사용한 타깃 얼굴 모델은 총 52개의 블렌드쉐입을 가지고 있으므로 추론 결과값

는 52차원 벡터이다. 도 6의 왼쪽으로부터 네 번째 열

는 P의 입력으로 사용되어 왼쪽으로부터 다섯 번째 열과 같은 결과를 렌더링할 수 있는 블렌드쉐입 가중치

를 출력할 수 있다. 출력된

를 타깃 얼굴 모델에 적용하여 렌더링된 결과는 왼쪽으로부터 여섯 번째, 일곱 번째 열과 같다. In an embodiment, the second neural network may be taught to infer the blendshape weight w _t from I _t . Since the target face model used in the example has a total of 52 blend shapes, the inference result value

is a 52-dimensional vector. 4th column from the left in Fig. 6

is a blendshape weight that can be used as an input to P to render a result like the fifth column from the left.

can output output

The results rendered by applying to the target face model are shown in the sixth and seventh columns from the left.

도 7은 실시예에서, 렌더링 손실의 유무가 렌더링 결과에 미치는 영향을 설명하기 위한 도면이다.7 is a diagram for explaining an effect of the presence or absence of rendering loss on a rendering result in an embodiment.

도 7에 의하면, 좌측부터 입력값으로 사용된 타깃 얼굴 이미지, 제2 신경망을 통해 예측된 블렌드쉐입 가중치

를 통해 렌더링된 타깃 얼굴 이미지, 그리고 입력 타깃 얼굴 이미지와 예측된 타깃 얼굴 이미지의 차이를 나타내고 있다.According to FIG. 7, from the left, the target face image used as an input value and the blend shape weight predicted through the second neural network

It shows the target face image rendered through , and the difference between the input target face image and the predicted target face image.

렌더링 손실 L_r의 유무가 신경망의 추론 결과에 끼치는 영향을 입증하기 위하여 블렌드쉐입 가중치 추론 네트워크를 통해 학습에 사용되지 않은 780개의 타깃 모델 이미지로부터 추론한 블렌드쉐입 가중치를 구축했다. 구축한 가중치를 사용하여 렌더링된 이미지를 입력 이미지와 비교하여 정량적 평가를 진행할 수 있고, 아래의 표 3과 같이 결과가 나타날 수 있다.In order to prove the effect of the presence or absence of rendering loss L _r on the inference result of the neural network, blendshape weights inferred from 780 target model images that were not used for learning were constructed through the blendshape weight inference network. Quantitative evaluation can be performed by comparing the rendered image with the input image using the constructed weights, and the results can be shown in Table 3 below.

[표 3][Table 3]

표 3과 도 5은 블렌드쉐입 가중치 추론 네트워크를 학습할 때 L_r과 L_w가 추론 결과에 어떤 영향을 끼치는지를 보여준다. 표 3에 도시된 바와 같이, L_w와 L_r을 모두 사용하였을 때 정량적 수치에서 좋은 결과를 얻을 수 있음을 확인할 수 있다. 또한 도 7은 L_w와 L_r을 함께 사용하였을 때 얼굴 모델의 눈과 입 부분에서 입력과의 차이가 더 적은 결과를 보여준다.Table 3 and FIG. 5 show how L _r and L _w affect the inference result when learning the blendshape weight inference network. As shown in Table 3, it can be confirmed that good results can be obtained in quantitative values when both L _w and L _r are used. In addition, FIG. 7 shows a result with a smaller difference from the input in the eyes and mouth of the face model when L _w and L _r are used together.

도 8은 실시예에서, 비교 실험 결과를 도시한 도면이다.8 is a diagram showing comparative experimental results in Examples.

실시예에서, 제1 신경망의 의 성능 비교를 위해 이미지 변환 방법으로 널리 쓰이고 있는 UNIT과의 비교를 진행할 수 있다. UNIT 은 실시예에서 사용하는 제1 신경망과 같이 잠재 공간을 공유한다는 공통점을 지니지만, 인코더를 완전히 공유하지는 않는다. 128 Х 128 Х 3 사이즈의 소스 얼굴 이미지 14,532개, 타깃 얼굴 이미지 16,050개를 학습에 사용한다. UNIT은 해당 연구의 저자들이 설정한 기본 설정으로 25만회 학습하였다. In an embodiment, a comparison with UNIT, which is widely used as an image conversion method, can be performed to compare the performance of the first neural network. UNIT shares a latent space with the first neural network used in the embodiment, but does not completely share an encoder. 128 Х 128 Х 3 size 14,532 source face images and 16,050 target face images are used for learning. UNIT was trained 250,000 times with the default settings set by the authors of the study.

도 8에 의하면, 제1 신경망과 UNIT의 이미지 변환 결과를 보여준다. 왼쪽부터 첫 번째 열은 소스 얼굴 이미지이며, 두 번째 열은 실시예에서 제안하는 제1 신경망을 사용한 결과, 세 번째 열은 UNIT을 사용한 결과를 도시한다. UNIT을 사용한 결과는 실시예와 달리 붉은색으로 표시한 부분에서 의미를 유지하지 못하는 것을 확인할 수 있다.8 shows the image conversion result of the first neural network and UNIT. The first column from the left shows the source face image, the second column shows the result using the first neural network proposed in the embodiment, and the third column shows the result using UNIT. It can be seen that the result using UNIT does not maintain the meaning in the part marked in red, unlike the embodiment.

실시예에서, 리타게팅 방법은 장치(미도시)에 의해 실행될 수 있다. 실시예에 따른 장치는, 메모리 및 프로세서를 포함하여 구성될 수 있고, 리타게팅을 위해 프로세서에 의해 실행되는 프로그램을 포함할 수 있다. 실시예에서, 프로그램은 도 1 및 2를 통해 설명된 리타게팅 방법을 포함할 수 있다.In an embodiment, the retargeting method may be executed by a device (not shown). A device according to an embodiment may include a memory and a processor, and may include a program executed by the processor for retargeting. In an embodiment, the program may include the retargeting method described with reference to FIGS. 1 and 2 .

또한, 리타게팅을 위한 신경망을 학습하는 방법을 수행하기 위한 장치(미도시)가 제공될 수 있다. 실시예에 따른 장치는, 메모리 및 프로세서를 포함하여 구성될 수 있고, 신경망을 학습하기 위해 프로세서에 의해 실행되는 프로그램을 포함할 수 있다. 실시예에서, 프로그램은 신경망의 학습 방법을 포함할 수 있다.Also, an apparatus (not shown) for performing a method of learning a neural network for retargeting may be provided. An apparatus according to an embodiment may include a memory and a processor, and may include a program executed by the processor to learn a neural network. In an embodiment, the program may include a method for learning a neural network.

실시예에 따른 리타게팅 방법을 통해 소스 얼굴 모델와 타깃 얼굴 모델 간 일대일로 대응되는 표정 데이터 없이도 리타게팅이 가능한 새로운 형태의 얼굴 애니메이션 리타게팅 파이프라인을 제안할 수 있다. 심층학습 기반의 얼굴 재연 기술을 활용하여 입력된 소스 블렌드쉐입 가중치로 소스 얼굴 이미지를 렌더하여 타깃 얼굴 이미지로 변환한 뒤, 블렌드쉐입 가중치 추론을 위한 신경망을 학습하여 이미지로부터 블렌드쉐입 가중치를 추론할 수 있고, 블렌드쉐입 가중치 추론을 위한 신경망의 학습 과정에서는 미분 가능한 렌더러를 도입하여 추론한 가중치가 실제 얼굴 전면부에 크게 영향을 끼치는 값을 중점적으로 학습하여 시각적으로 보았을 때의 결과를 향상시키는 방법을 사용할 수 있다. 이러한 학습 과정을 통해 소스 얼굴 모델의 블렌드쉐입 가중치 데이터로부터 타깃 얼굴 모델의 블렌드쉐입 가중치를 생성하는 파이프라인을 설계할 수 있다.Through the retargeting method according to the embodiment, it is possible to propose a new type of face animation retargeting pipeline capable of retargeting even without facial expression data corresponding one-to-one between a source face model and a target face model. After rendering the source face image with input source blendshape weights by using deep learning-based face reproduction technology and converting them into target face images, blendshape weights can be inferred from images by learning a neural network for blendshape weight inference. In the learning process of the neural network for blendshape weight inference, a differentiable renderer is introduced and the inferred weight focuses on learning the value that greatly affects the front part of the actual face, thereby improving the visual result. can Through this learning process, it is possible to design a pipeline for generating blend shape weights of the target face model from blend shape weight data of the source face model.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the retargeting method of the device for retargeting,
Receiving a source blendshape weight corresponding to a first expression of a source face model;
generating a source facial image corresponding to the first facial expression of the source facial model by rendering the source facial model based on the source blendshape weight;
generating a target face image corresponding to the first facial expression of the source face model by applying the source face image to a first neural network that generates an image corresponding to the target face model; and
Estimating a target blend shape weight corresponding to the first expression of the target face model by applying the target face image to a second neural network for estimating a blend shape weight corresponding to the expression of the target face model.
including,
An encoder and a first decoder are trained to output source face images of the source face model by inputting source face images of the source face model as training data, and inputting target face images of the target face model as training data to generate the target face images of the target face model. The encoder and the second decoder are trained to output target face images of the face model,
The first neural network includes the encoder and the second decoder,
Retargeting method.

According to claim 1,
The first neural network,
Target face images of the target face model are input as training data, and target face images of the target face model are learned to be output.
Retargeting method.

According to claim 1,
The second neural network,
Learning using target face images of the target face model and target blend shape weights corresponding to respective facial expressions of the target face images as learning data,
Retargeting method.

According to claim 3,
The second neural network,
Target blend shape weights inferred from the images of the target face model are learned to be the same as the target blend shape weights,
The target face images of the target face model rendered using the inferred target blend shape weights are learned to be the same as the target face images of the target face model.
Retargeting method.

According to claim 1,
The step of generating the source face image,
generating a vertex model by linearly combining blendshape weight vectors for facial expressions of the predetermined source face model; and
Rendering the source face model using the vertex model
including,
Retargeting method.

According to claim 1,
The source face model and the target face model include a front image of a face,
Retargeting method.

A method for learning a neural network in an apparatus for learning a neural network for retargeting,
obtaining a first training data set;
learning a first neural network to output an image of a target face model using the first training data set;
obtaining a second training data set;
learning a second neural network to output a blend shape weight corresponding to an image of the target face model using the second training data set; and
Connecting the first neural network and the second neural network
including,
The first training data set,
source facial images corresponding to the source facial model; and
Target face images corresponding to the target face model
including,
How to train a neural network.

delete

According to claim 7,
Learning the first neural network,
learning an auto encoder and a source decoder such that the source face images are the same for input and output; and
Learning the autoencoder and target decoder so that the target face images are the same as input and output.
including,
How to train a neural network.

According to claim 7,
The second learning data,
Target face images corresponding to the target face model and target blendshape weights corresponding to expressions of each of the target face images
including,
How to train a neural network.

According to claim 10,
Learning the second neural network,
learning the second neural network so that the input target blend shape weights and inferred target blend shape weights inferred through the second neural network from the target face images are the same; and
Learning the second neural network so that inference target face images rendered from the target face images and the inference target blend shape weights are identical.
including,
How to train a neural network.

A computer program stored in a computer readable medium to be combined with hardware to execute the method of any one of claims 1 to 7 and 9 to 11.

In the device for retargeting,
one or more processors;
Memory; and
one or more programs stored in the memory and configured to be executed by the one or more processors;
said program,
Receiving a source blendshape weight corresponding to a first expression of a source face model;
generating a source facial image corresponding to the first facial expression of the source facial model by rendering the source facial model based on the source blendshape weight;
generating a target face image corresponding to the first facial expression of the source face model by applying the source face image to a first neural network that generates an image corresponding to the target face model; and
Estimating a target blend shape weight corresponding to the first expression of the target face model by applying the target face image to a second neural network for estimating a blend shape weight corresponding to the expression of the target face model.
and
An encoder and a first decoder are trained to output source face images of the source face model by inputting source face images of the source face model as training data, and inputting target face images of the target face model as training data to generate the target face images of the target face model. The encoder and the second decoder are trained to output target face images of the face model,
The first neural network includes the encoder and the second decoder,
Device.

According to claim 13,
The first neural network,
Target face images of the target face model are input as training data, and target face images of the target face model are learned to be output.
Device.

According to claim 13,
The second neural network,
Learning using target face images of the target face model and target blend shape weights corresponding to respective facial expressions of the target face images as learning data,
Device.

According to claim 15,
The second neural network,
Target blend shape weights inferred from the images of the target face model are learned to be the same as the target blend shape weights,
The target face images of the target face model rendered using the inferred target blend shape weights are learned to be the same as the target face images of the target face model.
Device.

An apparatus for learning a neural network for retargeting,
one or more processors;
Memory; and
one or more programs stored in the memory and configured to be executed by the one or more processors;
said program,
obtaining a first training data set;
learning a first neural network to output an image of a target face model using the first training data set;
obtaining a second training data set;
learning a second neural network to output a blend shape weight corresponding to an image of the target face model using the second training data set; and
Connecting the first neural network and the second neural network
including,
The first training data set,
source facial images corresponding to the source facial model; and
Target face images corresponding to the target face model
including,
Device.

delete

According to claim 17,
Learning the first neural network,
learning an auto encoder and a source decoder such that the source face images are the same for input and output; and
Learning the autoencoder and target decoder so that the target face images are the same as input and output.
including,
Device.

According to claim 17,
The second learning data,
Target face images corresponding to the target face model and target blendshape weights corresponding to expressions of each of the target face images
including,
Device.

According to claim 20,
Learning the second neural network,
learning the second neural network so that the input target blend shape weights and inferred target blend shape weights inferred through the second neural network from the target face images are the same; and
Learning the second neural network so that inference target face images rendered from the target face images and the inference target blend shape weights are identical.
including,
Device.