KR20210070169A

KR20210070169A - Method for generating a head model animation from a speech signal and electronic device implementing the same

Info

Publication number: KR20210070169A
Application number: KR1020200089852A
Authority: KR
Inventors: 이반 빅토로비취 글라지스토브; 일리아 이고리비취 크로토브; 자크시리크 너라노비취 너라노브; 이반 올리고비취 카라차로브; 알렉시 브로니슬라보비취 대니레비취; 알렉산드르 블래디스라보비취 시뮤틴
Original assignee: 삼성전자주식회사
Priority date: 2019-12-02
Filing date: 2020-07-20
Publication date: 2021-06-14
Also published as: RU2721180C1

Abstract

Disclosed are a method for generating an animated head model from a speech signal by using an artificial intelligence (AI) model and an electronic device implementing the same. The method for generating an animated head model from a speech signal by the disclosed electronic device comprises the following steps of: obtaining feature information on a speech signal from the speech signal; using the AI model to obtain from the feature information a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream; using the AI model to obtain animated curves of visemes included in the viseme stream; combining the phoneme stream and the viseme stream; and applying the animated curve to the combined phoneme and viseme streams to generate an animated head model. According to the present invention, an animated head model can be provided from a speech signal in real time with less time delay and high quality.

Description

METHOD FOR GENERATING A HEAD MODEL ANIMATION FROM A SPEECH SIGNAL AND ELECTRONIC DEVICE IMPLEMENTING THE SAME

본 개시는 일반적으로 컴퓨터 그래픽을 생성하는 방법, 보다 구체적으로는, 인공지능 모델을 이용하여 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법 및 그 방법을 구현하는 전자 장치에 관한 것이다.The present disclosure relates generally to a method of generating computer graphics, and more particularly, to a method of generating a head model animation from a voice signal using an artificial intelligence model, and an electronic device implementing the method.

오늘날, 사용자의 아바타(avatar)에 해당하는 다양한 캐릭터들을 애니메이션화함으로써 실제 인물이 존재하는 것과 유사한 효과를 얻을 수 있는 증강 및 가상 현실이 점점 더 많이 사용되고 있다. 예를 들어, 개인화된 3차원 (3D) 헤드 모델을 생성하고 전화 통화 또는 가상 채팅에서 이를 사용하거나, 다른 언어로 음성을 더빙할 때 헤드 모델을 표시하는 등의 작업을 수행할 수 있다.Today, augmented and virtual reality that can obtain an effect similar to that of a real person by animating various characters corresponding to a user's avatar is increasingly used. For example, you can create a personalized three-dimensional (3D) head model and use it in a phone call or virtual chat, display the head model when you dub your voice in another language, and more.

이를 위해서는, 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 기술적 솔루션이 필요하다. 이러한 솔루션은 실시간으로 고품질의 애니메이션을 제공하고, 음성 신호의 수신과 헤드 모델의 움직임 사이의 지연 시간을 줄일 수 있어야 한다. 또한, 이러한 작업에 요구되는 컴퓨팅 리소스 소비를 줄일 수 있어야 한다. 한편, 이러한 솔루션은 인공지능 모델을 이용하여 제공될 수 있다. To this end, a technical solution for generating a head model animation from a voice signal is required. Such a solution should be able to provide high-quality animation in real time and reduce the delay between the reception of the voice signal and movement of the head model. It should also be able to reduce the consumption of computing resources required for these tasks. On the other hand, such a solution can be provided using an artificial intelligence model.

일반적으로, 종래의 헤드 모델 애니메이션 기술에는 다음과 같은 문제가 있다.In general, the conventional head model animation technique has the following problems.

- 인공지능 모델을 학습시키기 위해서는 일반적으로 많은 계산 또는 획득하기 어려운 대량의 데이터가 필요하다.- In order to train an AI model, large amounts of data that are generally difficult to compute or acquire are required.

- 2차원 랜드마크에 기초하여 얼굴 움직임을 묘사하는 방법들은 일반적으로 3차원 정보의 부족으로 인해 매우 평면적인 애니메이션 결과를 제공한다.- Methods of depicting facial movements based on two-dimensional landmarks generally provide very flat animation results due to the lack of three-dimensional information.

- 사람의 얼굴 움직임에 기초하여 가상 캐릭터의 고품질 애니메이션을 획득하기 위해서는 얼굴 형태의 차이로 인해 많은 계산이 요구된다.- In order to obtain high-quality animation of a virtual character based on human facial movement, many calculations are required due to the difference in facial shape

- 특정 인물이 아닌 임의의 사용자 음성에 대하여 애니메이션 데이터를 일반화하기 어렵다.- It is difficult to generalize animation data to an arbitrary user voice other than a specific person.

- 고품질 이미지의 애니메이션 모델은 지연 시간이 길다.- Animated models with high-quality images have high latency.

따라서, 상기한 문제점들을 해결하면서 아래에 기술되는 이점들 중 적어도 하나 이상을 제공하는 기술이 요구되고 있다.Accordingly, there is a need for a technology that provides at least one or more of the advantages described below while solving the above problems.

본 개시의 목적은, 음성 신호로부터 헤드 모델 애니메이션을 낮은 지연 시간 및 고품질로 실시간 제공할 수 있는, 인공지능 모델을 이용하여 음성 신호에서 헤드 모델 애니메이션을 생성하는 방법 및 이 방법을 구현하는 전자 장치를 제공하는데 있다. An object of the present disclosure is to provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which can provide a head model animation from a voice signal in real time with low latency and high quality, and an electronic device implementing the method. is to provide

또한, 본 개시의 일부 실시예는, 상기 인공지능 모델의 학습을 위해 널리 이용 가능한 데이터를 사용하는 방법 및 이 방법을 구현하는 전자 장치를 제공할 수 있다.In addition, some embodiments of the present disclosure may provide a method of using widely available data for learning the artificial intelligence model, and an electronic device implementing the method.

또한, 본 개시의 일부 실시예는, 임의의 목소리에 대한 헤드 모델 애니메이션 또는 임의의 캐릭터의 헤드 모델 애니메이션을 생성하는, 인공지능 모델을 이용하여 음성 신호에서 헤드 모델 애니메이션을 생성하는 방법 및 이 방법을 구현하는 전자 장치를 제공할 수 있다.In addition, some embodiments of the present disclosure provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which generates a head model animation for an arbitrary voice or a head model animation for an arbitrary character, and the method An electronic device that implements may be provided.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법을 제안할 수 있다. 상기 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법은, 상기 음성 신호로부터 상기 음성 신호의 특성 정보를 획득하는 단계; 인공지능 모델을 이용하여, 상기 특성 정보로부터 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐(viseme) 스트림을 획득하는 단계; 상기 인공지능 모델을 이용하여, 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득하는 단계; 상기 음소 스트림 및 상기 비짐 스트림을 병합하는 단계; 및 상기 애니메이션 곡선을 상기 병합된 음소 및 비짐 스트림의 비짐들에 적용하여 헤드 모델 애니메이션을 생성하는 단계를 포함할 수 있다.As a technical means for achieving the above technical problem, the first aspect of the present disclosure may propose a method of generating a head model animation from a voice signal. A method of generating a head model animation from the voice signal may include: acquiring characteristic information of the voice signal from the voice signal; obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information using an artificial intelligence model; obtaining animation curves of visemes included in the viseme stream by using the artificial intelligence model; merging the phoneme stream and the viseme stream; and generating a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.

또한, 본 개시의 제2 측면은 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델을 학습시키는 방법을 제안할 수 있다. 상기 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델을 학습시키는 방법은, 음성 신호, 상기 음성 신호에 대응하는 텍스트, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트를 획득하는 단계; 상기 음성 신호를 상기 인공지능 모델에 입력하여, 상기 인공지능 모델로부터 출력되는 제1 음소 스트림, 상기 제1 음소 스트림에 대응되는 비짐 스트림, 및 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득하는 단계; 상기 제1 음소 스트림과 상기 음성 신호의 텍스트를 이용하여, 상기 인공지능 모델을 위한 음소 스트림 형성 함수를 계산하는 단계; 상기 비짐 스트림, 상기 애니메이션 곡선 및 상기 비디오 신호를 이용하여, 상기 인공지능 모델을 위한 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산하는 단계; 상기 제1 음소 스트림, 상기 비짐 스트림, 및 상기 애니메이션 곡선에 기초하여, 상기 인공지능 모델을 위한 음소 정규화 함수를 계산하는 단계; 및 상기 음소 스트림 형성 함수, 상기 비짐 스트림 형성 함수, 상기 애니메이션 곡선 형성 함수, 및 상기 음소 정규화 함수를 이용하여, 상기 인공지능 모델을 갱신하는 단계;를 포함할 수 있다.In addition, the second aspect of the present disclosure may propose a method of training an artificial intelligence model for generating a head model animation from a voice signal. A method for training an artificial intelligence model for generating a head model animation from the voice signal includes: acquiring a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal; ; inputting the speech signal into the artificial intelligence model to obtain a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream; ; calculating a phoneme stream forming function for the artificial intelligence model by using the first phoneme stream and the text of the speech signal; calculating a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal; calculating a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve; and updating the AI model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

또한, 본 개시의 제3 측면은 하나 이상의 인스트럭션을 저장하는 메모리 및 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션을 실행함으로써, 음성 신호로부터 애니메이션 헤드 모델을 생성하는 방법을 수행하는 전자 장치를 제공할 수 있다.In addition, a third aspect of the present disclosure includes a memory storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby providing a method of generating an animation head model from a voice signal. It is possible to provide an electronic device that performs the

또한, 본 개시의 제4 측면은 하나 이상의 인스트럭션을 저장하는 메모리 및 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션을 실행함으로써, 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델을 학습시키는 방법을 수행하는 전자 장치를 제공할 수 있다.Further, a fourth aspect of the present disclosure includes a memory for storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby generating an artificial head model animation from a speech signal. An electronic device that performs a method of training an intelligent model may be provided.

도 1은 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하는 시스템의 개요도이다.
도 2는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델의 개요도이다.
도 3은 다양한 실시예들에 따른, 제1 인공지능 모델 및 제2 인공지능 모델의 블록도이다.
도 4a는 일 실시예에 따른 공간적 특성 추출 레이어의 구조를 도시한다.
도 4b는 일 실시예에 따른 음소 예측 레이어 및 비짐 예측 레이어의 구조를 도시한다.
도 5는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법의 흐름도이다.
도 6은 다양한 실시예들에 따른, 음성 신호로부터 애니메이션 헤드 모델을 생성하기 위한 인공지능 모델을 학습시키는 학습부의 개요도이다.
도 7은 다양한 실시예들에 따른, 음성 신호 및 음성 신호에 대응되는 비디오 신호로부터 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산하는 개요도이다.
도 8은 다양한 실시예들에 따른, 음성 신호로부터 애니메이션 헤드 모델을 생성하기 위한 인공지능 모델을 학습시키는 방법의 흐름도이다.
도 9는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델을 애니메이션 생성하도록 구성된 전자 장치의 블록도이다.1 is a schematic diagram of a system for generating a head model animation from a speech signal, in accordance with various embodiments.
2 is a schematic diagram of an artificial intelligence model for generating a head model animation from a voice signal, in accordance with various embodiments.
3 is a block diagram of a first artificial intelligence model and a second artificial intelligence model, according to various embodiments.
4A illustrates a structure of a spatial feature extraction layer according to an embodiment.
4B illustrates the structures of a phoneme prediction layer and a viseme prediction layer according to an embodiment.
5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments.
6 is a schematic diagram of a learning unit for learning an artificial intelligence model for generating an animation head model from a voice signal, according to various embodiments of the present disclosure;
7 is a schematic diagram of calculating a viseme stream forming function and an animation curve forming function from a voice signal and a video signal corresponding to the voice signal, according to various embodiments of the present disclosure;
8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure;
9 is a block diagram of an electronic device configured to animate a head model from a voice signal, according to various embodiments of the present disclosure;

아래에서는 첨부한 도면을 참조하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 개시의 실시예를 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

이해를 용이하게 하기 위하여 이하의 설명은 다양한 구체적 세부사항들을 포함하지만, 이들 세부사항들은 단지 예시적인 것으로 간주되어야 한다. 따라서, 당업자는 본 개시의 범위를 벗어나지 않고 이하에 설명된 다양한 실시예에 대하여 다양한 변경 및 수정이 적용될 수 있다는 것을 알 것이다. 또한, 공지된 기능 및 구조에 대한 설명은 명확성 및 간결성을 위해 생략될 수 있다.In order to facilitate understanding, the following description includes various specific details, but these details are to be regarded as illustrative only. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be applied to the various embodiments described below without departing from the scope of the present disclosure. In addition, descriptions of well-known functions and structures may be omitted for clarity and conciseness.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다. Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the present disclosure are selected as currently widely used general terms as possible while considering the functions in the present disclosure, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "…부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In the entire specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as “…unit” and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software.

본 개시에 따른 인공지능과 관련된 기능은 프로세서와 메모리를 통해 동작될 수 있다. 프로세서는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서일 수 있다. 하나 또는 복수의 프로세서는, 메모리에 저장된 기 정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어할 수 있다. 또는, 하나 또는 복수의 프로세서가 인공지능 전용 프로세서인 경우, 인공지능 전용 프로세서는, 특정 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계될 수 있다.Functions related to artificial intelligence according to the present disclosure may be operated through a processor and a memory. The processor may consist of one or a plurality of processors. In this case, one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-only processor such as an NPU. One or more processors may control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory. Alternatively, when one or more processors are AI-only processors, the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.

본 개시에 따른 음성 신호로부터 애니메이션 헤드 모델을 생성하는 방법에 있어서, 음성 신호에 대응되는 헤드 모델 애니메이션을 추론 또는 예측하기 위하여 인공지능 모델을 이용할 수 있다. 프로세서는 상기 음성 신호 데이터에 대해 전처리 과정을 수행하여 인공지능 모델의 입력으로 사용하는 데에 적합한 형태로 변환할 수 있다.In the method of generating an animation head model from a voice signal according to the present disclosure, an artificial intelligence model may be used to infer or predict the head model animation corresponding to the voice signal. The processor may perform a preprocessing process on the voice signal data to convert it into a form suitable for use as an input of an artificial intelligence model.

인공지능 모델은 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계된 인공지능 전용 프로세서에 의해 처리될 수 있다. 인공지능 모델은 학습을 통해 만들어 질 수 있다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. The AI model can be processed by an AI-only processor designed with a hardware structure specialized for processing the AI model. AI models can be created through learning. Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden.

인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 인공지능 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 인공지능 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다. 인공 신경망은 심층 신경망(DNN:Deep Neural Network)를 포함할 수 있으며, 예를 들어, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등이 있으나, 전술한 예에 한정되지 않는다.The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between an operation result of a previous layer and a plurality of weight values. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized. The artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited to the above-described example.

추론 예측은 정보를 판단하여 논리적으로 추론하고 예측하는 기술로서, 지식/확률 기반 추론(Knowledge based Reasoning), 최적화 예측(Optimization Prediction), 선호 기반 계획(Preference-based Planning), 추천(Recommendation) 등을 포함한다.Inference prediction is a technology for logically reasoning and predicting information by judging information. Knowledge based reasoning, optimization prediction, preference-based planning, recommendation, etc. include

음소(phoneme)는 단어를 다른 단어와 구분하게 하는, 사용자가 인식하는 소리의 최소 단위이다. 비짐(viseme)은 하나 이상의 음소와 연관된, 다른 것과 구분하여 식별 가능한 입술의 형상을 나타내는 단위이다. 일반적으로 음소와 비짐은 1대1 대응되지 않으며, 이는 서로 다른 음성 신호들이 동일한 얼굴 형상에 대응될 수 있음을 의미한다.A phoneme is the smallest unit of sound perceived by a user that distinguishes a word from other words. A viseme is a unit representing the shape of a lip that can be distinguished from others, associated with one or more phonemes. In general, phonemes and visemes do not correspond one-to-one, which means that different voice signals may correspond to the same face shape.

이하, 첨부된 도면들을 참조하여 본 개시의 다양한 실시예들을 보다 상세하게 설명한다. Hereinafter, various embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings.

도 1은 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하는 시스템의 개요도이다.1 is a schematic diagram of a system for generating a head model animation from a speech signal, in accordance with various embodiments.

도 1을 참조하면, 상기 음성 신호로부터 헤드 모델 애니메이션을 생성하는 시스템은 전자 장치(100)를 포함할 수 있다. Referring to FIG. 1 , a system for generating a head model animation from the voice signal may include an electronic device 100 .

전자 장치(100)는 인공지능 모델(110)을 이용하여 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 장치일 수 있다. 전자 장치(100)는 학습 데이터 세트(200)를 이용하여 인공지능 모델(110)을 학습시키기 위한 장치일 수 있다. 다양한 실시예들에 따르면, 전자 장치(100)는 인공지능 모델(110), 애니메이션 생성부(120) 및 인공지능 모델을 학습시키는 학습부(150)를 포함할 수 있다.The electronic device 100 may be a device for generating a head model animation from a voice signal using the artificial intelligence model 110 . The electronic device 100 may be a device for learning the artificial intelligence model 110 using the training data set 200 . According to various embodiments, the electronic device 100 may include an artificial intelligence model 110 , an animation generator 120 , and a learning unit 150 for learning the artificial intelligence model.

전자 장치(100)는 음성 신호를 수신하여 인공지능 모델(110)의 입력으로 전달할 수 있다. 상기 음성 신호는 인터넷, TV 또는 라디오 방송, 스마트폰, 휴대전화, 보이스 레코더, 데스크톱 컴퓨터, 랩톱 등과 같은 사용 가능한 모든 소스로부터 수신할 수 있다. 일 실시예에서, 상기 음성 신호는 전자 장치(100)에 포함된 마이크로폰 등의 입력부(미도시)에 의해 실시간으로 수신될 수 있다. 다른 실시예에서, 상기 음성 신호는 전자 장치(100)에 포함된 통신부(미도시)에 의해 네트워크를 통하여 외부 전자 장치로부터 수신될 수 있다. 다른 실시예에서, 상기 음성 신호는 전자 장치(100)의 메모리 또는 저장 장치에 저장된 오디오 데이터로부터 획득될 수 있다.The electronic device 100 may receive a voice signal and transmit it as an input of the artificial intelligence model 110 . The voice signal may be received from any available source, such as the Internet, TV or radio broadcasts, smart phones, cell phones, voice recorders, desktop computers, laptops, and the like. In an embodiment, the voice signal may be received in real time by an input unit (not shown) such as a microphone included in the electronic device 100 . In another embodiment, the voice signal may be received from an external electronic device through a network by a communication unit (not shown) included in the electronic device 100 . In another embodiment, the voice signal may be obtained from audio data stored in a memory or a storage device of the electronic device 100 .

전자 장치(100)는 학습 데이터 세트(200)를 수신하여 학습부(150)의 입력으로 전달할 수 있다. 학습 데이터 세트(200)는 음성 신호, 상기 음성 신호에 대응되는 텍스트, 및 상기 음성 신호에 대응되는 비디오 신호로 구성될 수 있다. 상기 음성 신호, 상기 텍스트, 및 상기 비디오 신호는 다른 얼굴 형태를 가진 다양한 인물의 기록을 포함할 수 있다. 학습 데이터 세트(200)는 학습부(150)가 인공지능 모델(110)을 학습시키기 위하여 학습부(150)에 제공될 수 있다. 학습 데이터 세트(200)는 전자 장치(100) 내의 메모리 또는 저장 장치에 저장되어 있을 수 있다. 또는, 학습 데이터 세트(200)는 전자 장치(1000) 외부의 저장 장치에 저장되어 있을 수 있다.The electronic device 100 may receive the training data set 200 and transmit it as an input of the learning unit 150 . The training data set 200 may include a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text signal, and the video signal may include recordings of various people with different face shapes. The training data set 200 may be provided to the learning unit 150 in order for the learning unit 150 to learn the artificial intelligence model 110 . The training data set 200 may be stored in a memory or a storage device in the electronic device 100 . Alternatively, the training data set 200 may be stored in a storage device external to the electronic device 1000 .

인공지능 모델(110)은 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 파라미터들을 도출할 수 있다. The artificial intelligence model 110 may derive parameters for generating the head model animation from the voice signal.

인공지능 모델(110)은 음성 신호를 전처리하여 음성 신호의 특성을 나타내는 특성 정보로 변환할 수 있다. 다양한 실시예들에서, 인공지능 모델(110)은 음성 신호로부터 상기 음성 신호의 특성을 나타내는 특성 계수들을 획득할 수 있다. The artificial intelligence model 110 may pre-process the voice signal and convert it into characteristic information indicating the characteristics of the voice signal. In various embodiments, the artificial intelligence model 110 may obtain characteristic coefficients representing the characteristics of the voice signal from the voice signal.

인공지능 모델(110)은 음성 신호로부터 추출된 음성 신호의 특성 정보를 입력받아, 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐(viseme) 스트림을 출력할 수 있다. The artificial intelligence model 110 may receive characteristic information of a voice signal extracted from the voice signal, and may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream.

인공지능 모델(110)은 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 출력할 수 있다. 애니메이션 곡선은 헤드 모델의 움직임에 관련된 애니메이션 파라미터의 시간적 변화를 나타낸다. 일 실시예에서, 애니메이션 곡선은 각 비짐 애니메이션에서의 얼굴 랜드마크의 움직임 및 비짐 애니메이션의 지속 시간을 지정할 수 있다. The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.

다양한 실시예들에서, 인공지능 모델(110)은 상기 계수들로부터 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐 스트림을 획득하는 제1 인공지능 모델 및 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득하는 제2 인공지능 모델을 포함할 수 있다. In various embodiments, the artificial intelligence model 110 obtains a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the coefficients and a viseme included in the first AI model and the viseme stream. It may include a second artificial intelligence model that acquires the animation curves of these.

다양한 실시예들에서, 인공지능 모델(110)은 음성 신호로부터 음소 스트림, 비짐 스트림, 및 애니메이션 곡선을 도출하기 위한 하나 이상의 수치 파라미터 및 함수를 포함할 수 있다. 상기 수치 파라미터는, 인공지능 모델(110)을 구성하는 복수의 신경망 레이어들 각각의 가중치일 수 있다. 다양한 실시예들에서, 상기 수치 파라미터 및 함수는 인공지능 모델(110)이 학습하는 데이터에 기초하여 결정되거나 갱신될 수 있다.In various embodiments, the artificial intelligence model 110 may include one or more numerical parameters and functions for deriving a phoneme stream, a viseme stream, and an animation curve from a speech signal. The numerical parameter may be a weight of each of the plurality of neural network layers constituting the artificial intelligence model 110 . In various embodiments, the numerical parameters and functions may be determined or updated based on data learned by the artificial intelligence model 110 .

일 실시예에서, 인공지능 모델(110)은 음소 스트림 형성 함수에 기초하여 음성 신호로부터 음소 스트림을 예측할 수 있다. 인공지능 모델(110)은 비짐 스트림 형성 함수에 기초하여 음성 신호로부터 비짐 스트림을 예측할 수 있다. 인공지능 모델(110)은 음소 정규화 함수에 기초하여 음소에 대응되는 복수의 비짐들 중 하나의 비짐을 선택할 수 있다. 인공지능 모델(110)은 애니메이션 곡선 형성 함수에 기초하여 비짐 스트림의 비짐들의 애니메이션 곡선을 도출할 수 있다. In an embodiment, the artificial intelligence model 110 may predict a phoneme stream from a speech signal based on a phoneme stream forming function. The artificial intelligence model 110 may predict a viseme stream from a voice signal based on a viseme stream forming function. The artificial intelligence model 110 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function. The artificial intelligence model 110 may derive animation curves of visemes of the viseme stream based on the animation curve forming function.

인공지능 모델(110)은 상기 음소 스트림 및 상기 비짐 스트림을 후처리할 수 있다. 일 실시예에서, 인공지능 모델(110)은 상기 음소 스트림 및 비짐 스트림을 상기 애니메이션 곡선을 고려하여 오버레이함으로써 병합할 수 있다. The artificial intelligence model 110 may post-process the phoneme stream and the viseme stream. In an embodiment, the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.

인공지능 모델(110)은 인터넷, 데스크탑 컴퓨터, 랩탑 등과 같은 임의의 이용 가능한 소스로부터 획득될 수 있고, 전자 장치(100)의 메모리에 저장될 수 있다. 일 실시예에서, 인공지능 모델(110)은 학습 데이터 세트(200)에 포함된 데이터들의 적어도 일부를 이용하여 미리 학습된 것일 수 있다. 인공지능 모델(110)은 학습부(150)의 학습 알고리즘에 따라 학습 데이터 세트(200)를 이용하여 갱신될 수 있다.The artificial intelligence model 110 may be obtained from any available source, such as the Internet, a desktop computer, a laptop, etc., and may be stored in the memory of the electronic device 100 . In an embodiment, the artificial intelligence model 110 may be pre-trained using at least a portion of data included in the training data set 200 . The artificial intelligence model 110 may be updated using the learning data set 200 according to the learning algorithm of the learning unit 150 .

애니메이션 생성부(120)는 인공지능 모델(110)로부터 획득한 파라미터들을 헤드 모델에 적용하여, 음성 신호에 대응되는 헤드 모델 애니메이션을 생성할 수 있다. The animation generator 120 may generate a head model animation corresponding to the voice signal by applying the parameters obtained from the artificial intelligence model 110 to the head model.

애니메이션 생성부(120)는 인공지능 모델(110)로부터 병합된 음소 및 비짐 스트림 및 애니메이션 곡선을 획득할 수 있다. 애니메이션 생성부(120)는 애니메이션 곡선을 상기 병합된 음소 및 비짐 스트림에 포함된 비짐들에 적용하여 헤드 모델 애니메이션을 생성할 수 있다. The animation generator 120 may acquire the merged phoneme and viseme stream and animation curve from the artificial intelligence model 110 . The animation generator 120 may generate a head model animation by applying an animation curve to visemes included in the merged phoneme and viseme stream.

다양한 실시예들에서, 애니메이션 생성부(120)는 미리 정의된 헤드 모델에 기초하여 헤드 모델 애니메이션을 생성할 수 있다. 일 실시예에서, 상기 미리 정의된 헤드 모델은 얼굴 움직임 부호화 시스템(Facial Action Coding System, FACS)에 기초한 임의의 3D 캐릭터 모델일 수 있다. FACS는 인간의 얼굴 움직임을 분류하는 시스템이다. FACS를 사용하여, 임의의 얼굴 표현은 특정한 행동 단위 및 그들의 시간적 분할로 분해하여 부호화될 수 있다. 예를 들어, 상기 미리 정의된 헤드 모델에서 각 비짐은 FACS 계수로 정의될 수 있다. In various embodiments, the animation generator 120 may generate a head model animation based on a predefined head model. In an embodiment, the predefined head model may be any 3D character model based on a Facial Action Coding System (FACS). FACS is a system for classifying human facial movements. Using FACS, arbitrary facial expressions can be coded by decomposing them into specific action units and their temporal divisions. For example, in the predefined head model, each viseme may be defined as a FACS coefficient.

일 실시예에서, 애니메이션 생성부(120)는 상기 병합된 음소 및 비짐 스트림에 기초하여, 미리 정의된 헤드 모델의 비짐 세트를 결정할 수 있다. 애니메이션 생성부(120)는 상기 미리 정의된 헤드 모델의 비짐 세트에 상기 애니메이션 곡선을 적용하여 헤드 모델 애니메이션을 생성할 수 있다. In an embodiment, the animation generator 120 may determine a viseme set of a predefined head model based on the merged phoneme and viseme stream. The animation generator 120 may generate the head model animation by applying the animation curve to the viseme set of the predefined head model.

학습부(150)는 학습 데이터 세트(200)를 이용하여 인공지능 모델(110)을 학습시킬 수 있다. The learning unit 150 may train the artificial intelligence model 110 by using the training data set 200 .

학습부(150)는 음성 신호, 상기 음성 신호에 대응하는 텍스트, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트(200)를 획득할 수 있다. 학습부(150)는 학습 데이터 세트(200)를 인공지능 모델(110)에 입력하여, 인공지능 모델(110)로부터 출력되는 음소 스트림, 비짐 스트림, 및 애니메이션 곡선을 획득할 수 있다. The learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The learning unit 150 may obtain a phoneme stream, a viseme stream, and an animation curve output from the AI model 110 by inputting the training data set 200 into the AI model 110 .

학습부(150)는 인공지능 모델(110)에 의해 학습 데이터 세트(200)의 음성 신호로부터 생성된 음소 스트림을 학습 데이터 세트(200)의 텍스트와 비교하여 평가하고, 평가에 기초하여 인공지능 모델(110)을 갱신할 수 있다. 다양한 실시예들에서, 학습부(150)는 상기 제1 음소 스트림과 상기 텍스트를 이용하여, 인공지능 모델(110)을 위한 음소 스트림 형성 함수를 계산할 수 있다. The learning unit 150 compares and evaluates the phoneme stream generated from the speech signal of the training data set 200 by the artificial intelligence model 110 with the text of the training data set 200, and based on the evaluation, the AI model (110) can be updated. In various embodiments, the learning unit 150 may calculate a phoneme stream forming function for the artificial intelligence model 110 using the first phoneme stream and the text.

학습부(150)는 인공지능 모델(110)에 의해 학습 데이터 세트(200)의 음성 신호로부터 생성된 3D 헤드 모델 애니메이션을 학습 데이터 세트(200)의 비디오 신호와 비교하여 평가하고, 평가에 기초하여 인공지능 모델(110)을 갱신할 수 있다. 다양한 실시예들에서, 학습부(150)는 상기 비짐 스트림, 상기 애니메이션 곡선 및 상기 비디오 신호를 이용하여, 인공지능 모델(110)을 위한 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산할 수 있다. 다양한 실시예들에서, 학습부(150)는 상기 제1 음소 스트림, 상기 비짐 스트림, 및 상기 애니메이션 곡선에 기초하여, 인공지능 모델(110)의 음소 정규화 함수를 계산할 수 있다.The learning unit 150 compares and evaluates the 3D head model animation generated from the voice signal of the training data set 200 by the artificial intelligence model 110 with the video signal of the training data set 200, and based on the evaluation The artificial intelligence model 110 may be updated. In various embodiments, the learner 150 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal. In various embodiments, the learner 150 may calculate a phoneme normalization function of the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.

다양한 실시예들에서, 학습부(150)는 상기 음소 스트림 형성 함수, 상기 비짐 스트림 형성 함수, 상기 애니메이션 곡선 형성 함수, 및 상기 음소 정규화 함수를 이용하여, 상기 인공지능 모델을 갱신할 수 있다.In various embodiments, the learning unit 150 may update the AI model by using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

전자 장치(100)는 스마트폰, 태블릿 PC, PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 미디어 플레이어, 마이크로 서버, GPS(global positioning system) 장치, 전자책 단말기, 디지털방송용 단말기, 네비게이션, 키오스크, MP3 플레이어, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다.The electronic device 100 includes a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS) device, an e-book terminal, a digital broadcasting terminal, It may be, but is not limited to, navigation, kiosks, MP3 players, digital cameras, consumer electronics and other mobile or non-mobile computing devices.

예시적으로 전자 장치(100)가 하나의 장치로써 도 1에 도시되어 있으나, 반드시 이에 한정되는 것은 아니다. 전자 장치(100)는 기능적으로 연결되어 상술한 동작들을 수행하는 하나 이상의 물리적으로 분리된 장치들의 집합일 수 있다.For example, although the electronic device 100 is illustrated in FIG. 1 as one device, it is not necessarily limited thereto. The electronic device 100 may be a set of one or more physically separated devices that are functionally connected and perform the above-described operations.

도 2는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델의 개요도이다. 2 is a schematic diagram of an artificial intelligence model for generating a head model animation from a voice signal, in accordance with various embodiments.

도 2를 참조하면, 인공지능 모델(110)은 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240)를 포함할 수 있다.Referring to FIG. 2 , the artificial intelligence model 110 may include a preprocessor 210 , a first artificial intelligence model 220 , a second artificial intelligence model 230 , and a postprocessor 240 .

전처리부(210)는 헤드 모델 애니메이션의 생성을 위해 음성 신호가 이용될 수 있도록 음성 신호를 전처리할 수 있다. 전처리부(210)는 제1 인공지능 모델(220)이 헤드 모델 애니메이션을 생성하기 위하여 획득된 음성 신호를 이용할 수 있도록, 획득된 음성 신호를 기 설정된 포맷으로 가공할 수 있다. The preprocessor 210 may preprocess the voice signal so that the voice signal can be used to generate the head model animation. The preprocessor 210 may process the acquired voice signal into a preset format so that the first artificial intelligence model 220 may use the acquired voice signal to generate the head model animation.

다양한 실시예들에서, 전처리부(210)는 음성 신호를 전처리하여 음성 신호의 특성을 나타내는 특성 계수들로 변환할 수 있다. 상기 특성 계수들은 제1 인공지능 모델(220)에 입력되어 음성 신호에 대응되는 음소 스트림 및 비짐 스트림을 예측하기 위하여 사용될 수 있다.In various embodiments, the preprocessor 210 may preprocess the voice signal and convert it into characteristic coefficients indicating characteristics of the voice signal. The characteristic coefficients may be input to the first artificial intelligence model 220 and used to predict a phoneme stream and a viseme stream corresponding to a voice signal.

일 실시예에서, 전처리부(210)는 MFCC(Mel-Frequency Cepstral Coefficients) 방법에 의하여 음성 계수를 변환하여 상기 특성 계수들을 획득할 수 있다. MFCC는 소리의 단구간 스펙트럼을 분석하여 특징을 추출하는 기법으로써, 특성 계수들은 로그 파워 스펙트럼을 주파수의 비선형 Mel 스케일로 선형 코사인 변환하여 획득될 수 있다. MFCC 방법은 발화자 및 녹음 조건에 따른 변동성에 크게 영향을 받지 않으며, 별도의 학습 과정을 필요로 하지 않고 계산 속도가 빠르다. MFCC 방법은 기술분야에서 알려져 있으므로, 이에 대한 상세한 설명은 생략한다. 다른 실시예에서, 상기 특성 계수들은 다른 음성 특성 추출 방법, 예를 들어 지각적 선형 예측 (Perceptual Linear Prediction) 또는 Body Linear Predictive Codes 등의 방법을 사용하여 획득될 수 있다.In an embodiment, the preprocessor 210 may obtain the characteristic coefficients by transforming the speech coefficients using a Mel-Frequency Cepstral Coefficients (MFCC) method. MFCC is a technique for extracting features by analyzing a short-term spectrum of a sound, and the characteristic coefficients may be obtained by linear cosine transformation of a log power spectrum into a nonlinear Mel scale of frequency. The MFCC method is not significantly affected by the variability according to the speaker and recording conditions, does not require a separate learning process, and has a fast calculation speed. Since the MFCC method is known in the art, a detailed description thereof will be omitted. In another embodiment, the characteristic coefficients may be obtained using another voice characteristic extraction method, for example, a method such as Perceptual Linear Prediction or Body Linear Predictive Codes.

다른 실시예에서, 전처리부(210)는 다른 사전 학습된 인공지능 모델에 상기 음성 신호를 입력하여 상기 특성 계수들을 획득할 수 있다. 상기 추가 사전 학습된 인공지능 수단은 순환 신경망, 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 및 이들의 임의의 조합 중 적어도 하나일 수 있다. In another embodiment, the preprocessor 210 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model. The additional pre-trained artificial intelligence means may be at least one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, and any combination thereof. have.

제1 인공지능 모델(220)은 전처리부(210)로부터 제공받은 전처리된 음성 신호로부터, 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐(viseme) 스트림을 출력할 수 있다. 일 실시예에서, 제1 인공지능 모델(220)은 컨볼루션 신경망, 순환 신경망, 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 또는 이들의 임의의 조합 중 적어도 하나일 수 있다.The first artificial intelligence model 220 may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the preprocessed voice signal provided from the preprocessor 210 . In one embodiment, the first artificial intelligence model 220 is a convolutional neural network, a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or these may be at least one of any combination of

다양한 실시예들에서, 제1 인공지능 모델(220)은 음소 스트림 형성 함수에 기초하여 음성 신호의 특성 계수로부터 음소 스트림을 예측할 수 있다. 다양한 실시예들에서, 제1 인공지능 모델(220)은 비짐 스트림 형성 함수에 기초하여 음성 신호의 특성 계수로부터 음소 스트림에 대응되는 비짐 스트림을 예측할 수 있다. 다양한 실시예들에서, 제1 인공지능 모델(220)은 음소 정규화 함수에 기초하여 음소에 대응되는 복수의 비짐들 중 하나의 비짐을 선택할 수 있다. In various embodiments, the first artificial intelligence model 220 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function. In various embodiments, the first artificial intelligence model 220 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function. In various embodiments, the first AI model 220 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.

제2 인공지능 모델(230)은 제1 인공지능 모델(220)에 의해 생성된 비짐 스트림을 입력받아, 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 출력할 수 있다. 애니메이션 곡선은 헤드 모델의 움직임에 관련된 애니메이션 파라미터의 시간적 변화를 나타낸다. 일 실시예에서, 애니메이션 곡선은 각 비짐 애니메이션에서의 얼굴 랜드마크의 움직임 및 비짐 애니메이션의 지속 시간을 지정할 수 있다. 상기 애니메이션 곡선은 애니메이션 생성부(120)에 입력되어 헤드 모델에 적용됨으로써 헤드 모델 애니메이션을 생성하기 위하여 사용될 수 있다.The second artificial intelligence model 230 may receive the viseme stream generated by the first artificial intelligence model 220 and output animation curves of visemes included in the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation. The animation curve may be input to the animation generator 120 and applied to the head model to generate a head model animation.

일 실시예에서, 제2 인공지능 모델(230)은 컨볼루션 신경망(Convolutional Neural Network, CNN), 순환 신경망(Recurrent Neural Network, RNN), 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 또는 이들의 임의의 조합 중 적어도 하나일 수 있다.In one embodiment, the second artificial intelligence model 230 is a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory (Long Short-Term Memory, LSTM), a gate circulation unit (gated recurrent unit, GRU), a variant thereof, or any combination thereof.

다양한 실시예들에서, 제2 인공지능 모델(230)은 애니메이션 곡선 형성 함수에 기초하여 비짐 스트림의 비짐들의 애니메이션 곡선을 도출할 수 있다. 일 실시예에서, 상기 애니메이션 곡선은 얼굴 움직임 부호화 시스템(Facial Action Coding System, FACS)을 사용하여 계산될 수 있다. FACS를 이용하여 계산된 애니메이션 곡선은 임의의 FACS 기반 헤드 모델에 적용되어 헤드 모델 애니메이션을 생성할 수 있다.In various embodiments, the second artificial intelligence model 230 may derive animation curves of visemes of the viseme stream based on the animation curve forming function. In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS). An animation curve calculated using FACS can be applied to any FACS-based head model to generate a head model animation.

후처리부(240)는 음소 스트림 및 비짐 스트림을 후처리하여 병합할 수 있다. 후처리부(240)에서 출력된 병합된 음소 및 비짐 스트림은 애니메이션 생성부(120)에 입력되어 헤드 모델 애니메이션을 생성하기 위하여 사용될 수 있다.The post-processing unit 240 may post-process the phoneme stream and the viseme stream and merge them. The merged phoneme and viseme stream output from the post-processing unit 240 may be input to the animation generating unit 120 and used to generate a head model animation.

다양한 실시예들에서, 후처리부(240)는 상기 음소 스트림 및 비짐 스트림을 상기 애니메이션 곡선을 고려하여 오버레이함으로써 병합할 수 있다. 병합된 음소 및 비짐 스트림에서, 각 음소는 대응되는 비짐과 연관될 수 있다. 병합된 음소 및 비짐 스트림의 각 음소 및 연관되는 대응 비짐의 지속시간은 상기 비짐의 애니메이션 곡선에 의해 지정될 수 있다. 일 실시예에서, 후처리부(240)는 병합을 위하여 두 개의 입력을 받아 하나의 출력을 반환하는 임의의 함수를 사용할 수 있다. In various embodiments, the post-processing unit 240 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve. In the merged phoneme and viseme stream, each phone may be associated with a corresponding viseme. The duration of each phone and associated corresponding viseme in the merged phoneme and viseme stream may be specified by the animation curve of the viseme. In an embodiment, the post-processing unit 240 may use an arbitrary function that receives two inputs and returns one output for merging.

한편, 인공지능 모델(110)에 포함된 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전술한 각종 전자 장치에 탑재될 수도 있다.On the other hand, at least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 included in the artificial intelligence model 110, at least one hardware It may be manufactured in the form of a chip and mounted in an electronic device. For example, at least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 is dedicated hardware for artificial intelligence (AI). It may be manufactured in the form of a chip, or may be manufactured as a part of an existing general-purpose processor (eg, CPU or application processor) or graphics-only processor (eg, GPU) and mounted on the various electronic devices described above.

또한, 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240)는 하나의 전자 장치(100)에 탑재될 수도 있으며, 또는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240) 중 일부는 전자 장치(100)에 포함되고, 나머지 일부는 서버에 포함될 수 있다.In addition, the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be mounted in one electronic device 100 , or separate electronic devices. Each of the devices may be mounted separately. For example, some of the pre-processing unit 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the post-processing unit 240 are included in the electronic device 100 , and the rest are included in the server. can be included in

또한, 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 전처리부(210), 제1 인공지능 모델(220), 제2 인공지능 모델(230), 및 후처리부(240) 중 적어도 하나가 소프트웨어 모듈(또는, 인스터력션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다.In addition, at least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be implemented as a software module. At least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 is a software module (or a program module including an instruction) When implemented as , the software module may be stored in a computer-readable non-transitory computer readable medium. Also, in this case, at least one software module may be provided by an operating system (OS) or may be provided by a predetermined application. Alternatively, a part of the at least one software module may be provided by an operating system (OS), and the other part may be provided by a predetermined application.

도 3은 다양한 실시예들에 따른, 제1 인공지능 모델 및 제2 인공지능 모델의 블록도이다. 3 is a block diagram of a first artificial intelligence model and a second artificial intelligence model, according to various embodiments.

도 3을 참조하면, 제1 인공지능 모델(220)은 공간적 특성 추출 레이어(310), 시간적 특성 추출 레이어(320), 음소 예측 레이어(330), 음소 스트림 형성 함수(340), 비짐 예측 레이어(350), 비짐 스트림 형성 함수(360), 및 음소 정규화 함수(370)를 포함할 수 있다. 제2 인공지능 모델(230)은 애니메이션 곡선 예측 레이어(380) 및 애니메이션 곡선 형성 함수(390)를 포함할 수 있다.3, the first artificial intelligence model 220 includes a spatial feature extraction layer 310, a temporal feature extraction layer 320, a phoneme prediction layer 330, a phoneme stream forming function 340, and a viseme prediction layer ( 350 ), a viseme stream formation function 360 , and a phoneme normalization function 370 . The second AI model 230 may include an animation curve prediction layer 380 and an animation curve forming function 390 .

공간적 특성 추출 레이어(310), 시간적 특성 추출 레이어(320), 음소 예측 레이어(330), 비짐 예측 레이어(350), 및 애니메이션 곡선 예측 레이어(380)는 특정 기능을 수행하는 신경망(neural network)의 적어도 일부일 수 있다. 음소 스트림 형성 함수(340), 비짐 스트림 형성 함수(360), 음소 정규화 함수(370) 및 애니메이션 곡선 형성 함수(390)는 인공지능 모델에 포함된 하나 이상의 레이어에서 결과를 도출하거나 도출된 결과를 평가하기 위하여 사용되는 함수일 수 있다. The spatial feature extraction layer 310, the temporal feature extraction layer 320, the phoneme prediction layer 330, the viseme prediction layer 350, and the animation curve prediction layer 380 are of a neural network that performs a specific function. It may be at least a part. The phoneme stream forming function 340 , the viseme stream forming function 360 , the phoneme normalization function 370 , and the animation curve forming function 390 derive results from one or more layers included in the AI model or evaluate the derived results. It may be a function used to

공간적 특성 추출 레이어(310) 및 시간적 추출 레이어(320)는 입력된 음성 신호의 특성 정보로부터 상기 음성 신호의 특성을 추출할 수 있다. 상기 음성 신호의 특성 정보는 전처리부(210)에서 출력된 음성 신호의 특성 계수일 수 있다. The spatial feature extraction layer 310 and the temporal extraction layer 320 may extract the characteristics of the voice signal from the characteristic information of the input voice signal. The characteristic information of the voice signal may be a characteristic coefficient of the voice signal output from the preprocessor 210 .

공간적 특성 추출 레이어(310)는 입력된 음성 신호의 특성 정보를 처리하여, 공간적(spatial) 특성을 추출할 수 있다. 일 실시예에서, 공간적 특성 추출 레이어(310)에는 완전히 연결된(fully connected) 레이어들과 비선형성을 가진 컨볼루션 신경망(Convolutional Neural Network, CNN) 또는 순환 신경망(Recurrent Neural Network, RNN)이 사용될 수 있다. 예를 들어, 도 4a에 도시된 것과 같은 레이어 구조가 사용될 수 있다. 그러나 이에 한정되지 않고, 임의의 미분 가능한(differentiable) 레이어가 추가될 수 있다. 일 실시예에서, 공간적 특성 추출 레이어(310)는 미리 학습된 것일 수 있다.The spatial characteristic extraction layer 310 may process characteristic information of the input voice signal to extract spatial characteristics. In an embodiment, a convolutional neural network (CNN) or a recurrent neural network (RNN) having fully connected layers and nonlinearity may be used for the spatial feature extraction layer 310 . . For example, a layer structure as shown in FIG. 4A may be used. However, the present invention is not limited thereto, and any differentiable layer may be added. In an embodiment, the spatial feature extraction layer 310 may be pre-learned.

시간적 특성 추출 레이어(320)는 상기 추출된 공간적 특성을 처리하여, 시간적(temporal) 특성을 추출할 수 있다. 일 실시예에서, 시간적 특성 추출 레이어(320)에는 완전히 연결된 레이어들과 비선형성을 가진 순환 신경망(RNN)이 사용될 수 있다. 예를 들어, 드롭아웃(dropout)이 있는 3단계 장단기 메모리 (Long Short-Term Memory, LSTM)가 사용될 수 있다. The temporal feature extraction layer 320 may process the extracted spatial feature to extract a temporal feature. In an embodiment, a recurrent neural network (RNN) with fully connected layers and nonlinearity may be used for the temporal feature extraction layer 320 . For example, a three-level Long Short-Term Memory (LSTM) with dropout may be used.

공간적 특성 추출 레이어(310) 및 시간적 추출 레이어(320)를 거쳐 추출된 음성 신호의 특성에 기초하여, 두 개의 독립적인 스트림들, 음소 스트림과 비짐 스트림이 예측될 수 있다. 음소 예측 레이어(330)는 추출된 음성 신호의 특성으로부터 음성 신호에 대응되는 음소 스트림을 도출할 수 있다. 비짐 예측 레이어(350)는 상기 음소 스트림에 포함되는 각 음소에 대응되는 비짐을 선택하여, 상기 음성 신호에 대응되는 비짐 스트림을 도출할 수 있다. Two independent streams, a phoneme stream and a viseme stream, may be predicted based on the characteristics of the speech signal extracted through the spatial feature extraction layer 310 and the temporal extraction layer 320 . The phoneme prediction layer 330 may derive a phoneme stream corresponding to the voice signal from the characteristics of the extracted voice signal. The viseme prediction layer 350 may select a viseme corresponding to each phoneme included in the phoneme stream to derive a viseme stream corresponding to the voice signal.

다양한 실시예들에서, 음소 예측 레이어(330)는 음소 스트림 형성 함수(340)에 기초하여 음성 신호의 특성으로부터 음소 스트림을 예측할 수 있다. 일 실시예에서, 음소 스트림 형성 함수(340)는 제1 인공지능 모델(220)이 예측한 음소 스트림이 실제의 올바른 값과 얼마나 유사한지를 측정하는 손실 함수(loss function)로부터 계산될 수 있다. 일 실시예에서, 음소 스트림 형성 함수(340)는 임의의 음성 신호 및 상기 임의의 음성 신호에 대응하는 텍스트를 포함하는 학습 데이터 세트에 의해 학습된 것일 수 있다.In various embodiments, the phoneme prediction layer 330 may predict a phoneme stream from a characteristic of a speech signal based on the phoneme stream forming function 340 . In an embodiment, the phoneme stream forming function 340 may be calculated from a loss function that measures how similar the phoneme stream predicted by the first AI model 220 is to an actual correct value. In an embodiment, the phoneme stream forming function 340 may be learned from a training data set including an arbitrary voice signal and text corresponding to the arbitrary voice signal.

다양한 실시예들에서, 비짐 예측 레이어(350)는 비짐 스트림 형성 함수(360)에 기초하여 음성 신호의 특성으로부터 음소 스트림에 대응되는 비짐 스트림을 예측할 수 있다. 일 실시예에서, 비짐 스트림 형성 함수(360)는 제1 인공지능 모델(220)이 예측한 비짐 스트림이 실제의 올바른 값과 얼마나 유사한지를 측정하는 손실 함수로부터 계산될 수 있다. 일 실시예에서, 비짐 스트림 형성 함수(360)는 임의의 음성 신호 및 상기 임의의 음성 신호에 대응되는 비디오 신호를 포함하는 학습 데이터 세트에 의해 학습된 것일 수 있다.In various embodiments, the viseme prediction layer 350 may predict the viseme stream corresponding to the phoneme stream from the characteristics of the speech signal based on the viseme stream forming function 360 . In one embodiment, the viseme stream forming function 360 may be calculated from a loss function that measures how similar the viseme stream predicted by the first artificial intelligence model 220 is to the actual correct value. In an embodiment, the viseme stream forming function 360 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

하나의 음소에 부분적으로 대응할 수 있는 복수의 비짐들이 존재하는 경우, 비짐 예측 레이어(350)는 그 중 적합한 비짐을 선택할 수 있다. 다양한 실시예들에서, 비짐 예측 레이어(350)는 음소 정규화 함수(370)에 기초하여 음소에 대응되는 복수의 비짐들 중 하나의 비짐을 선택할 수 있다. 일 실시예에서, 상기 음소 정규화 함수(370)는 음소의 확률 분포를 예측하고, 사용될 가능성이 낮은 음소에 대응되는 기본 형태에 페널티를 주는 함수일 수 있다. 일 실시예에서, 상기 음소 정규화 함수(370)는 임의의 음성 신호를 포함하는 학습 데이터 세트에 의해 정규화(regularization) 방법에 의해 계산된 것일 수 있다.When there are a plurality of visemes that may partially correspond to one phoneme, the viseme prediction layer 350 may select an appropriate viseme from among them. In various embodiments, the viseme prediction layer 350 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function 370 . In an embodiment, the phoneme normalization function 370 may be a function that predicts a probability distribution of a phoneme and gives a penalty to a basic shape corresponding to a phoneme that is unlikely to be used. In an embodiment, the phoneme normalization function 370 may be calculated by a regularization method based on a training data set including an arbitrary speech signal.

일 실시예에서, 음소 예측 레이어(330) 및 비짐 예측 레이어(350)에는 선형 레이어, 비선형성 레이어, 및 다른 미분 가능한 레이어들의 스택을 포함한 임의의 가능한 레이어 구조가 사용될 수 있다. 예를 들어, 도 4b에 도시된 것과 같은, 정류된 선형 유닛(Rectified Linear Unit, ReLU)과 완전히 연결된 (fully connected) 2개의 선형 레이어들이 예측자로써 사용될 수 있다.In one embodiment, any possible layer structure may be used for the phoneme prediction layer 330 and the viseme prediction layer 350 , including a stack of linear layers, non-linear layers, and other differentiable layers. For example, two linear layers fully connected to a Rectified Linear Unit (ReLU) as shown in FIG. 4B may be used as predictors.

일 실시예에서, 음소 예측 레이어(330) 및 비짐 예측 레이어(350)는 시간적 추출 레이어(320)와 가중치를 공유할 수 있다. 두 스트림이 모두 이전 레이어와 가중치를 공유함에 따라, 예측되는 파라미터들의 특성에 의하여 모델을 정규화하는 효과를 얻을 수 있다.In an embodiment, the phoneme prediction layer 330 and the viseme prediction layer 350 may share a weight with the temporal extraction layer 320 . As both streams share a weight with the previous layer, the effect of normalizing the model can be obtained according to the characteristics of the predicted parameters.

애니메이션 곡선 예측 레이어(380)는 비짐 스트림으로부터 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 예측할 수 있다. 애니메이션 곡선은 헤드 모델의 움직임에 관련된 애니메이션 파라미터의 시간적 변화를 나타낸다. 일 실시예에서, 애니메이션 곡선은 각 비짐 애니메이션에서의 얼굴 랜드마크의 움직임 및 비짐 애니메이션의 지속 시간을 지정할 수 있다.The animation curve prediction layer 380 may predict animation curves of visemes included in the viseme stream from the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.

다양한 실시예들에서, 애니메이션 곡선 예측 레이어(380)는 애니메이션 곡선 형성 함수(390)에 기초하여 비짐 스트림의 비짐들의 애니메이션 곡선을 도출할 수 있다. 일 실시예에서, 애니메이션 곡선 형성 함수(390)는 제2 인공지능 모델(220)이 도출한 애니메이션 곡선이 나타내는 움직임이 실제 인물의 움직임과 얼마나 유사한지를 측정하는 손실 함수로부터 계산될 수 있다. 일 실시예에서, 애니메이션 곡선 형성 함수(390)는 임의의 음성 신호 및 상기 임의의 음성 신호에 대응되는 비디오 신호를 포함하는 학습 데이터 세트에 의해 학습된 것일 수 있다.In various embodiments, the animation curve prediction layer 380 may derive an animation curve of the visemes of the viseme stream based on the animation curve forming function 390 . In an embodiment, the animation curve forming function 390 may be calculated from a loss function that measures how similar the motion represented by the animation curve derived by the second artificial intelligence model 220 is to the motion of a real person. In an embodiment, the animation curve forming function 390 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

일 실시예에서, 상기 애니메이션 곡선은 얼굴 움직임 부호화 시스템(Facial Action Coding System, FACS)을 사용하여 계산될 수 있다. 애니메이션 곡선 예측 레이어(380)는 각 비짐에 대하여, FACS 계수를 이용하여 정의된 애니메이션 곡선을 도출할 수 있다. FACS 계수를 사용하여 정의된 애니메이션 곡선은 임의의 FACS 기반 헤드 모델에 적용될 수 있다. 그러나, 애니메이션 곡선은 반드시 FACS를 사용하여 계산되는 것에만 한정되지 않으며, 애니메이션 곡선을 계산하기 위하여 임의의 적절한 방법을 사용할 수 있다.In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS). The animation curve prediction layer 380 may derive an animation curve defined by using FACS coefficients for each viseme. Animation curves defined using FACS coefficients can be applied to any FACS-based head model. However, the animation curve is not necessarily limited to being calculated using FACS, and any suitable method may be used for calculating the animation curve.

도 5는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법의 흐름도이다. 도 5의 각 동작들은 도 1 에 도시된 전자 장치(100), 또는 도 9에 도시된 전자 장치(100) 또는 전자 장치(100)의 프로세서(910)에 의해 수행될 수 있다.5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments. Each operation of FIG. 5 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .

도 5를 참조하면, 동작 S510에서, 전자 장치(100)는 음성 신호로부터 음성 신호의 특성 정보를 획득할 수 있다. 다양한 실시예들에서, 전자 장치(100)는 음성 신호를 전처리하여 음성 신호의 특성을 나타내는 특성 정보, 예를 들어 특성 계수로 변환할 수 있다. 일 실시예에서, 전자 장치(100)는 MFCC(Mel-Frequency Cepstral Coefficients) 방법에 의하여 음성 계수를 변환하여 상기 특성 계수들을 획득할 수 있다. 다른 실시예에서, 전자 장치(100)는 다른 사전 학습된 인공지능 모델에 상기 음성 신호를 입력하여 상기 특성 계수들을 획득할 수 있다.Referring to FIG. 5 , in operation S510 , the electronic device 100 may obtain characteristic information of a voice signal from the voice signal. In various embodiments, the electronic device 100 may pre-process a voice signal and convert it into characteristic information indicating characteristics of the voice signal, for example, a characteristic coefficient. In an embodiment, the electronic device 100 may obtain the characteristic coefficients by transforming the speech coefficient by a Mel-Frequency Cepstral Coefficients (MFCC) method. In another embodiment, the electronic device 100 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model.

동작 S520에서, 전자 장치(100)는 인공지능 모델을 이용하여, 상기 특성 정보로부터 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐 스트림을 획득할 수 있다. In operation S520 , the electronic device 100 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model.

다양한 실시예들에서, 전자 장치(100)는 음소 스트림 형성 함수에 기초하여 음성 신호의 특성 계수로부터 음소 스트림을 예측할 수 있다. 상기 음소 스트림 형성 함수는, 임의의 음성 신호 및 상기 임의의 음성 신호에 대응하는 텍스트를 포함하는 학습 데이터 세트에 의해 학습되는 것일 수 있다.In various embodiments, the electronic device 100 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function. The phoneme stream forming function may be learned from a training data set including an arbitrary voice signal and a text corresponding to the arbitrary voice signal.

다양한 실시예들에서, 전자 장치(100)는 비짐 스트림 형성 함수에 기초하여 음성 신호의 특성 계수로부터 음소 스트림에 대응되는 비짐 스트림을 예측할 수 있다. 상기 비짐 스트림 형성 함수는, 임의의 음성 신호 및 상기 임의의 음성 신호에 대응되는 비디오 신호를 포함하는 학습 데이터 세트에 의해 학습되는 것일 수 있다.In various embodiments, the electronic device 100 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function. The viseme stream forming function may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

다양한 실시예들에서, 전자 장치(100)는 음소 정규화 함수에 기초하여 음소에 대응되는 복수의 비짐들 중 하나의 비짐을 선택할 수 있다.In various embodiments, the electronic device 100 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.

동작 S530에서, 전자 장치(100)는 인공지능 모델을 이용하여, 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득할 수 있다. 다양한 실시예들에서, 전자 장치(100)는 애니메이션 곡선 형성 함수에 기초하여 비짐 스트림의 비짐들의 애니메이션 곡선을 도출할 수 있다. 일 실시예에서, 상기 애니메이션 곡선은 얼굴 움직임 부호화 시스템(Facial Action Coding System, FACS)을 사용하여 계산될 수 있다.In operation S530, the electronic device 100 may obtain an animation curve of visemes included in the viseme stream using the artificial intelligence model. In various embodiments, the electronic device 100 may derive animation curves of visemes of the viseme stream based on the animation curve forming function. In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS).

동작 S540에서, 전자 장치(100)는 상기 음소 스트림 및 상기 비짐 스트림을 병합할 수 있다. 일 실시예에서, 인공지능 모델(110)은 상기 음소 스트림 및 비짐 스트림을 상기 애니메이션 곡선을 고려하여 오버레이함으로써 병합할 수 있다.In operation S540, the electronic device 100 may merge the phoneme stream and the viseme stream. In an embodiment, the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.

동작 S550에서, 전자 장치(100)는 상기 애니메이션 곡선을 상기 병합된 음소 및 비짐 스트림의 비짐들에 적용하여 헤드 모델 애니메이션을 생성할 수 있다. 다양한 실시예들에서, 전자 장치(100)는 미리 정의된 헤드 모델에 기초하여 헤드 모델 애니메이션을 생성할 수 있다. 일 실시예에서, 상기 미리 정의된 헤드 모델은 얼굴 움직임 부호화 시스템(FACS)에 기초한 임의의 3D 캐릭터 모델일 수 있다. In operation S550 , the electronic device 100 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream. In various embodiments, the electronic device 100 may generate a head model animation based on a predefined head model. In an embodiment, the predefined head model may be any 3D character model based on a facial motion coding system (FACS).

일 실시예에서, 전자 장치(100)는 상기 병합된 음소 및 비짐 스트림에 기초하여, 미리 정의된 헤드 모델의 비짐 세트를 결정하고, 상기 비짐 세트에 상기 애니메이션 곡선을 적용하여 헤드 모델 애니메이션을 생성할 수 있다.In an embodiment, the electronic device 100 determines a viseme set of a predefined head model based on the merged phoneme and viseme stream, and applies the animation curve to the viseme set to generate a head model animation. can

도 6은 다양한 실시예들에 따른, 음성 신호로부터 애니메이션 헤드 모델을 생성하기 위한 인공지능 모델을 학습시키는 학습부의 개요도이다. 6 is a schematic diagram of a learning unit for learning an artificial intelligence model for generating an animation head model from a voice signal, according to various embodiments of the present disclosure;

도 6을 참조하면, 음성 신호로부터 애니메이션 헤드 모델을 생성하기 위한 인공지능 모델(110)을 학습시키는 학습부(150)는, 음소 검출부(610), 음소 스트림 형성 함수 계산부 (620), 애니메이션 생성부(630), 제1 움직임 패턴 검출부 (640), 제2 움직임 패턴 검출부 (650), 비짐 스트림 형성 함수 계산부 (660), 애니메이션 곡선 형성 함수 계산부 (670), 및 음소 정규화 함수 계산부 (680)를 포함할 수 있다. Referring to FIG. 6 , the learning unit 150 for training the artificial intelligence model 110 for generating an animation head model from a voice signal includes a phoneme detection unit 610 , a phoneme stream forming function calculation unit 620 , and animation generation. Unit 630, first movement pattern detection unit 640, second movement pattern detection unit 650, viseme stream formation function calculation unit 660, animation curve formation function calculation unit 670, and phoneme normalization function calculation unit ( 680) may be included.

학습부(150)는 음성 신호, 상기 음성 신호에 대응하는 텍스트, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트(200)를 획득할 수 있다. 상기 음성 신호, 상기 텍스트, 및 상기 비디오 신호는 다른 얼굴 형태를 가진 다양한 인물의 기록을 포함할 수 있다. 상기 음성 신호, 상기 텍스트, 및 상기 비디오 신호는 다중 목표 학습을 위하여 각각 별도로 처리될 수 있다. 상기 음성 신호의 처리, 상기 텍스트의 처리, 및 상기 비디오 신호의 처리는 전자 장치(100)의 구성 및 그 계산 능력에 따라 병렬적으로 수행되거나 또는 순차적으로 수행될 수 있다.The learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text signal, and the video signal may include recordings of various people with different face shapes. The voice signal, the text signal, and the video signal may each be separately processed for multi-objective learning. The processing of the voice signal, the processing of the text, and the processing of the video signal may be performed in parallel or sequentially according to the configuration of the electronic device 100 and its calculation capability.

인공지능 모델(110)은 학습 데이터 세트(200)의 음성 신호를 입력받아, 상기 음성 신호에 대응되는 제1 음소 스트림, 상기 제1 음소 스트림에 대응되는 비짐 스트림, 및 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득할 수 있다. The artificial intelligence model 110 receives the voice signal of the training data set 200, and receives a first phoneme stream corresponding to the voice signal, a viseme stream corresponding to the first phoneme stream, and a viseme included in the viseme stream. You can get their animation curves.

다양한 실시예들에서, 인공지능 모델(110)은 전술된 것과 동일한 과정에 의하여 학습 데이터 세트(200)의 음성 신호로부터 제1 음소 스트림, 비짐 스트림, 및 애니메이션 곡선을 획득할 수 있다. 예를 들어, 인공지능 모델(110)은 음성 신호를 전처리하여 음성 신호의 특성을 나타내는 특성 정보를 획득할 수 있다. 인공지능 모델(110)은 상기 특성 정보를 입력 받아, 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐(viseme) 스트림을 출력할 수 있다. 인공지능 모델(110)은 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 출력할 수 있다.In various embodiments, the artificial intelligence model 110 may obtain the first phoneme stream, the viseme stream, and the animation curve from the speech signal of the training data set 200 by the same process as described above. For example, the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal. The artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream. The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.

음소 검출부(610)는 학습 데이터 세트(200)의 음성 신호에 대응되는 텍스트를 입력받아, 제2 음소 스트림을 검출할 수 있다. 텍스트로부터 음소를 검출하는 동작은 알려진 임의의 방법으로 실행될 수 있다.The phoneme detector 610 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream. The operation of detecting phonemes from text may be performed by any known method.

음소 스트림 형성 함수 계산부(620)는 상기 제1 음소 스트림과 상기 제2 음소 스트림을 비교하여, 인공지능 모델(110)에서 음소 스트림을 예측하는 데 사용하는 음소 스트림 형성 함수(340)를 계산할 수 있다. 일 실시예에서, 상기 음소 스트림 형성 함수(340)는 상기 제1 음소 스트림과 상기 제2 음소 스트림을 비교하여 손실 함수(loss function)를 이용하여 계산될 수 있다. The phoneme stream forming function calculation unit 620 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 . have. In an embodiment, the phoneme stream forming function 340 may be calculated using a loss function by comparing the first phoneme stream and the second phoneme stream.

애니메이션 생성부(630)는 인공지능 모델(110)로부터 출력된 애니메이션 곡선을 3D 템플릿 모델에 적용하여 비짐 애니메이션을 생성할 수 있다. 다양한 실시예들에서, 애니메이션 생성부(630)는 애니메이션 생성부(120)와 동일한 방법으로, 비짐 스트림에 포함된 비짐들에 애니메이션 곡선을 적용하여 비짐 애니메이션을 획득할 수 있다. The animation generator 630 may generate a viseme animation by applying the animation curve output from the artificial intelligence model 110 to the 3D template model. In various embodiments, the animation generator 630 may obtain a viseme animation by applying an animation curve to visemes included in the viseme stream in the same manner as the animation generator 120 .

다양한 실시예들에서, 애니메이션 생성부(630)는 미리 정의된 헤드 모델에 기초하여 비짐 애니메이션을 획득할 수 있다. 예를 들어, 애니메이션 생성부(630)는 3D 템플릿 헤드 모델에 애니메이션 곡선을 적용하여 각 비짐의 비짐 애니메이션을 획득할 수 있다.In various embodiments, the animation generator 630 may acquire a viseme animation based on a predefined head model. For example, the animation generator 630 may obtain a viseme animation of each viseme by applying an animation curve to the 3D template head model.

제1 움직임 패턴 검출부(640)는 획득된 비짐 애니메이션에서 얼굴 랜드마크의 움직임 패턴을 검출하여, 제1 움직임 패턴을 획득할 수 있다. 일 실시예에서, 얼굴 랜드마크는 상기 미리 정의된 헤드 모델에 미리 정의되어 있을 수 있다. 애니메이션 곡선이 나타내는 움직임 파라미터는 상기 정의된 얼굴 랜드마크의 움직임 패턴을 지정할 수 있다.The first movement pattern detector 640 may acquire the first movement pattern by detecting the movement pattern of the facial landmark in the obtained viseme animation. In an embodiment, the facial landmark may be predefined in the predefined head model. The motion parameter indicated by the animation curve may designate a motion pattern of the defined facial landmark.

제2 움직임 패턴 검출부(650)는 학습 데이터 세트(200)의 음성 신호에 대응되는 비디오 신호를 입력받아, 비디오 신호에서 얼굴 랜드마크를 검출할 수 있다. 일 실시예에서, 얼굴 랜드마크는 랜드마크 검출기에 의해 검출될 수 있다. 상기 랜드마크 검출기는 임의의 종래의 랜드마크 검출 방법을 수행할 수 있다.The second movement pattern detector 650 may receive a video signal corresponding to the voice signal of the training data set 200 and detect a facial landmark from the video signal. In one embodiment, the facial landmark may be detected by a landmark detector. The landmark detector may perform any conventional landmark detection method.

제2 움직임 패턴 검출부(650)는 검출된 얼굴 랜드마크의 움직임 변위를 측정하여 제2 움직임 패턴을 획득할 수 있다. 일 실시예에서, 상기 움직임 변위는 학습 데이터 세트(200)의 평균 또는 학습 중 선택한 특정한 얼굴을 기준으로 측정할 수 있다. 일 실시예에서, 얼굴 모양에 독립적으로 인공지능 모델(110)을 학습시키기 위하여, 제2 움직임 패턴 검출부(650)는 비디오 신호의 첫 프레임으로부터 획득한 얼굴 모양 또는 추정한 얼굴 모양에 기초하여 얼굴 랜드마크의 움직임 변위를 측정할 수 있다. The second movement pattern detector 650 may obtain a second movement pattern by measuring the movement displacement of the detected facial landmark. In an embodiment, the movement displacement may be measured based on an average of the training data set 200 or a specific face selected during training. In an embodiment, in order to train the artificial intelligence model 110 independently of the face shape, the second movement pattern detection unit 650 is a face land based on the face shape acquired from the first frame of the video signal or the estimated face shape. The movement displacement of the mark can be measured.

비짐 스트림 형성 함수 계산부(660)는 상기 제1 움직임 패턴과 상기 제2 움직임 패턴을 비교하여, 인공지능 모델(110)에서 비짐 스트림을 예측하는 데 사용하는 비짐 스트림 형성 함수(360)를 계산할 수 있다. 일 실시예에서, 상기 비짐 스트림 형성 함수(340)는 상기 제1 움직임 패턴과 상기 제2 움직임 패턴을 비교하여 손실 함수(loss function)를 이용하여 계산될 수 있다.The viseme stream forming function calculation unit 660 may calculate the viseme stream forming function 360 used to predict the viseme stream in the artificial intelligence model 110 by comparing the first movement pattern with the second movement pattern. have. In an embodiment, the viseme stream forming function 340 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.

애니메이션 곡선 형성 함수 계산부(670)는 상기 제1 움직임 패턴과 상기 제2 움직임 패턴을 비교하여, 인공지능 모델(110)에서 애니메이션 곡선을 예측하는 데 사용하는 애니메이션 곡선 형성 함수(390)를 계산할 수 있다. 일 실시예에서, 상기 애니메이션 곡선 형성 함수(390)는 상기 제1 움직임 패턴과 상기 제2 움직임 패턴을 비교하여 손실 함수(loss function)를 이용하여 계산될 수 있다.The animation curve forming function calculation unit 670 compares the first movement pattern with the second movement pattern to calculate the animation curve forming function 390 used to predict the animation curve in the artificial intelligence model 110 . have. In an embodiment, the animation curve forming function 390 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.

음소 정규화 함수 계산부(680)는 인공지능 모델(110)로부터 출력된 제1 음소 스트림, 비짐 스트림, 및 애니메이션 곡선에 기초하여, 인공지능 모델(110)에서 음소에 대응되는 복수의 비짐들 중 하나의 비짐을 선택하기 위해 사용되는 음소 정규화 함수(370)를 계산할 수 있다. 일 실시예에서, 음소 정규화 함수 계산부(680)는 정규화(regularization) 방법에 의해 음소 정규화 함수(370)를 계산할 수 있다.The phoneme normalization function calculator 680 is one of a plurality of visemes corresponding to a phoneme in the AI model 110 based on the first phoneme stream, the viseme stream, and the animation curve output from the AI model 110 . A phonemic normalization function 370 used to select the viseme of may be computed. In an embodiment, the phoneme normalization function calculator 680 may calculate the phoneme normalization function 370 by a regularization method.

학습부(150)는 계산된 음소 스트림 형성 함수, 비짐 스트림 형성 함수, 애니메이션 곡선 형성 함수, 및 음소 정규화 함수를 이용하여, 인공지능 모델(110)을 갱신할 수 있다.The learner 150 may update the artificial intelligence model 110 by using the calculated phoneme stream formation function, viseme stream formation function, animation curve formation function, and phoneme normalization function.

도 7은 다양한 실시예들에 따른, 음성 신호 및 음성 신호에 대응되는 비디오 신호로부터 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산하는 개요도이다.7 is a schematic diagram of calculating a viseme stream forming function and an animation curve forming function from a voice signal and a video signal corresponding to the voice signal, according to various embodiments of the present disclosure;

학습 데이터 세트(200)는 얼굴 모양이 다른 다양한 사람들의 기록을 포함하며, 또한 실제 인물의 얼굴 형태는 인공적으로 제작된 애니메이션 캐릭터 모델과 차이가 있다. 따라서, 인공지능 모델(110)에 의해 생성된 3D 헤드 모델 애니메이션과, 비디오 신호에서 나타나는 얼굴 움직임을 직접적으로 비교할 수 없다. 그러므로 인공지능 모델(110)이 발화되는 음성 신호에 따른 임의의 얼굴 모양의 움직임을 학습하기 위해서는, 얼굴 모양의 차이와 관련된 오류를 제거할 필요가 있다.The training data set 200 includes records of various people with different face shapes, and the face shape of a real person is different from an artificially created animated character model. Therefore, it is not possible to directly compare the 3D head model animation generated by the artificial intelligence model 110 with the facial movement appearing in the video signal. Therefore, in order for the artificial intelligence model 110 to learn the movement of an arbitrary face shape according to the spoken voice signal, it is necessary to remove an error related to the difference in the face shape.

도 7을 참조하면, 제1 움직임 패턴 검출부(640)는 3D 템플릿 모델의 비짐 애니메이션을 획득할 수 있다. 상기 비짐 애니메이션은 인공지능 모델(110)로부터 출력된 애니메이션 곡선에 기초하여 생성된 것이다. Referring to FIG. 7 , the first movement pattern detector 640 may acquire a viseme animation of the 3D template model. The viseme animation is generated based on the animation curve output from the artificial intelligence model 110 .

일 실시예에서, 제1 움직임 패턴 검출부(640)는 3D 헤드 모델 애니메이션의 움직임과 비디오 신호로부터 검출된 얼굴 움직임을 비교하기 위하여, 상기 비짐 애니메이션의 얼굴 랜드마크를 2D 평면에 투영할 수 있다. 제1 움직임 패턴 검출부(640)는 상기 투영된 얼굴 랜드마크의 움직임을 2D 평면 상에서 계산하여, 제1 움직임 패턴을 획득할 수 있다. In an embodiment, the first movement pattern detector 640 may project the facial landmark of the viseme animation on a 2D plane in order to compare the movement of the 3D head model animation with the facial movement detected from the video signal. The first movement pattern detector 640 may obtain a first movement pattern by calculating the movement of the projected facial landmark on a 2D plane.

제2 움직임 패턴 검출부(650)는 학습 데이터 세트(200)의 비디오 신호로부터 얼굴 랜드마크를 검출할 수 있다. The second movement pattern detector 650 may detect a facial landmark from the video signal of the training data set 200 .

일 실시예에서, 제2 움직임 패턴 검출부(640)는 비디오 신호로부터 검출된 임의의 얼굴 움직임과 3D 헤드 모델 애니메이션의 움직임을 비교하기 위하여, 상기 비디오 신호로부터 검출된 얼굴 랜드마크를 미리 정의된 중립 얼굴에 오버레이하여 정렬할 수 있다. 일 실시예에서, 얼굴 랜드마크의 정렬은 Procruste 분석 또는 아핀(Affine) 변환을 사용하여 수행될 수 있다. 일 실시예에서, 최적의 변환 행렬을 찾기 위해 Kabsh 알고리즘이 사용될 수 있다. In an embodiment, the second movement pattern detection unit 640 sets the facial landmark detected from the video signal to a predefined neutral face in order to compare any facial movement detected from the video signal with the movement of the 3D head model animation. It can be sorted by overlaying it on . In one embodiment, alignment of facial landmarks may be performed using Procruste analysis or Affine transforms. In one embodiment, the Kabsh algorithm may be used to find the optimal transformation matrix.

일 실시예에서, 제2 움직임 패턴 검출부(640)는 상기 정렬된 랜드마크의 움직임을 계산하여 제2 움직임 패턴을 획득할 수 있다. 일 실시예에서, 제2 움직임 패턴 검출부(640)는 상기 미리 정의된 중립 얼굴을 기준으로 정렬된 얼굴 랜드마크의 움직임 변위를 측정하여 제2 움직임 패턴을 획득할 수 있다.In an embodiment, the second movement pattern detector 640 may obtain a second movement pattern by calculating the movement of the aligned landmarks. In an embodiment, the second movement pattern detector 640 may acquire the second movement pattern by measuring the movement displacement of the facial landmarks aligned with respect to the predefined neutral face.

학습부(150)는 상기 획득된 제1 움직임 패턴 및 제2 움직임 패턴에 기초하여 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산할 수 있다. 일 실시예에서, 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수는 상기 제1 움직임 패턴과 제2 움직임 패턴의 차이를 나타내는 손실 함수에 기초하여 계산될 수 있다. 상기 손실 함수에 의하여, 인공지능 모델(110)이 예측한 움직임이 실제 얼굴의 움직임과 얼마나 유사한지가 측정될 수 있다.The learner 150 may calculate a viseme stream forming function and an animation curve forming function based on the obtained first and second movement patterns. In an embodiment, the viseme stream forming function and the animation curve forming function may be calculated based on a loss function representing a difference between the first movement pattern and the second movement pattern. Using the loss function, it can be measured how similar the movement predicted by the artificial intelligence model 110 is to the actual movement of the face.

상술한 방법에 따르면, 얼굴 형태의 차이를 제외한 상대적인 움직임만을 학습에 사용하므로, 3D 헤드 모델과 다른 임의의 얼굴 형태를 학습 데이터로 이용하여 인공지능 모델(110)을 학습시킬 수 있고, 2D 움직임에 기초하여 인공지능 모델(110)을 학습시킬 수 있다. 따라서, 쉽게 구할 수 있는 비디오 데이터를 학습 데이터로 이용할 수 있다.According to the method described above, since only the relative movement excluding the difference in the face shape is used for learning, the artificial intelligence model 110 can be trained using an arbitrary face shape different from the 3D head model as learning data, and the 2D movement Based on the artificial intelligence model 110 may be trained. Accordingly, easily obtainable video data can be used as training data.

도 8은 다양한 실시예들에 따른, 음성 신호로부터 애니메이션 헤드 모델을 생성하기 위한 인공지능 모델을 학습시키는 방법의 흐름도이다. 도 8의 각 동작들은 도 1 에 도시된 전자 장치(100), 또는 도 9에 도시된 전자 장치(100) 또는 전자 장치(100)의 프로세서(910)에 의해 수행될 수 있다.8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure; Each operation of FIG. 8 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .

도 8을 참조하면, 동작 S810에서, 전자 장치(100)는 음성 신호, 상기 음성 신호에 대응하는 텍스트, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트를 획득할 수 있다. 상기 음성 신호, 상기 텍스트, 및 상기 비디오 신호는 다른 얼굴 형태를 가진 다양한 인물의 기록을 포함할 수 있다.Referring to FIG. 8 , in operation S810 , the electronic device 100 may obtain a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text, and the video signal may include recordings of various people with different face shapes.

동작 S820에서, 전자 장치(100)는 상기 음성 신호를 인공지능 모델(110)에 입력하여, 상기 인공지능 모델로부터 출력되는 제1 음소 스트림, 상기 제1 음소 스트림에 대응되는 비짐 스트림, 및 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득할 수 있다. In operation S820 , the electronic device 100 inputs the voice signal to the artificial intelligence model 110 , and outputs a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and the viseme An animation curve of visemes included in a stream can be obtained.

다양한 실시예들에서, 인공지능 모델(110)은 음성 신호를 전처리하여 음성 신호의 특성을 나타내는 특성 정보를 획득할 수 있다. 인공지능 모델(110)은 상기 특성 정보를 입력 받아, 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐(viseme) 스트림을 출력할 수 있다. 인공지능 모델(110)은 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 출력할 수 있다.In various embodiments, the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal. The artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream. The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.

동작 S830에서, 전자 장치(100)는 상기 제1 음소 스트림과 상기 음성 신호의 텍스트를 이용하여, 상기 인공지능 모델(110)을 위한 음소 스트림 형성 함수를 계산할 수 있다.In operation S830 , the electronic device 100 may calculate a phoneme stream forming function for the artificial intelligence model 110 by using the first phoneme stream and the text of the voice signal.

다양한 실시예들에서, 전자 장치(100)는 학습 데이터 세트(200)의 음성 신호에 대응되는 텍스트를 입력받아, 제2 음소 스트림을 검출할 수 있다. 전자 장치(100)는 상기 제1 음소 스트림과 상기 제2 음소 스트림을 비교하여, 인공지능 모델(110)에서 음소 스트림을 예측하는 데 사용하는 음소 스트림 형성 함수(340)를 계산할 수 있다.In various embodiments, the electronic device 100 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream. The electronic device 100 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 .

동작 S840에서, 전자 장치(100)는 상기 비짐 스트림, 상기 애니메이션 곡선 및 상기 비디오 신호를 이용하여, 상기 인공지능 모델(110)을 위한 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산할 수 있다.In operation S840 , the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal.

다양한 실시예들에서, 전자 장치(100)는 상기 애니메이션 곡선을 3D 템플릿 모델에 적용하여 비짐 애니메이션을 생성할 수 있다. 전자 장치(100)는 상기 비짐 애니메이션에서 얼굴 랜드마크의 움직임 패턴을 검출하여, 제1 움직임 패턴을 획득할 수 있다. 일 실시예에서, 전자 장치(100)는 상기 3D 템플릿 모델의 얼굴 랜드마크를 2D 평면에 투영할 수 있다. 전자 장치(100)는 상기 2D 평면에 투영된 얼굴 랜드마크의 움직임에 기초하여 제1 움직임 패턴을 획득할 수 있다.In various embodiments, the electronic device 100 may generate a viseme animation by applying the animation curve to a 3D template model. The electronic device 100 may acquire a first movement pattern by detecting a movement pattern of a facial landmark in the viseme animation. In an embodiment, the electronic device 100 may project the facial landmark of the 3D template model on a 2D plane. The electronic device 100 may acquire the first movement pattern based on the movement of the facial landmark projected on the 2D plane.

다양한 실시예들에서, 전자 장치(100)는 상기 비디오 신호에서 얼굴 랜드마크의 움직임 패턴을 검출하여, 제2 움직임 패턴을 획득할 수 있다. 일 실시예에서, 전자 장치(100)는 상기 비디오 신호에서 얼굴 랜드마크를 검출할 수 있다. 전자 장치(100)는 상기 비디오 신호의 얼굴 랜드마크를 중립 얼굴에 정렬할 수 있다. 전자 장치(100)는 상기 중립 얼굴에 정렬된 얼굴 랜드마크의 움직임에 기초하여 제2 움직임 패턴을 획득할 수 있다.In various embodiments, the electronic device 100 may obtain a second movement pattern by detecting a movement pattern of a facial landmark from the video signal. In an embodiment, the electronic device 100 may detect a facial landmark from the video signal. The electronic device 100 may align the face landmark of the video signal to the neutral face. The electronic device 100 may acquire the second movement pattern based on the movement of the facial landmark aligned with the neutral face.

다양한 실시예들에서, 전자 장치(100)는 상기 제1 움직임 패턴과 상기 제2 움직임 패턴을 비교하여, 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산할 수 있다.In various embodiments, the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function by comparing the first movement pattern with the second movement pattern.

동작 S850에서, 전자 장치(100)는 상기 제1 음소 스트림, 상기 비짐 스트림, 및 상기 애니메이션 곡선에 기초하여, 상기 인공지능 모델(110)을 위한 음소 정규화 함수를 계산할 수 있다.In operation S850 , the electronic device 100 may calculate a phoneme normalization function for the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.

동작 S850에서, 전자 장치(100)는 상기 음소 스트림 형성 함수, 상기 비짐 스트림 형성 함수, 상기 애니메이션 곡선 형성 함수, 및 상기 음소 정규화 함수를 이용하여, 상기 인공지능 모델(110)을 갱신할 수 있다.In operation S850 , the electronic device 100 may update the artificial intelligence model 110 using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

도 9는 다양한 실시예들에 따른, 음성 신호로부터 헤드 모델을 애니메이션 생성하도록 구성된 전자 장치의 블록도이다.9 is a block diagram of an electronic device configured to animate a head model from a voice signal, according to various embodiments of the present disclosure;

도 9를 참조하면, 전자 장치(100)는 적어도 하나의 프로세서(910) 및 메모리(920)를 포함할 수 있다. Referring to FIG. 9 , the electronic device 100 may include at least one processor 910 and a memory 920 .

메모리(920)는, 프로세서(910)의 처리 및 제어를 위한 프로그램을 저장할 수 있고, 전자 장치(100)로 입력되거나 전자 장치(100)로부터 출력되는 데이터를 저장할 수 있다. 다양한 실시예들에서, 메모리(920)는 적어도 하나의 학습된 인공지능 모델을 위한 수치 파라미터들 및 함수들을 저장할 수 있다. 다양한 실시예들에서, 메모리(920)는 적어도 하나의 인공지능 모델을 학습시키기 위한 학습 데이터를 저장할 수 있다.The memory 920 may store a program for processing and controlling the processor 910 , and may store data input to or output from the electronic device 100 . In various embodiments, memory 920 may store numerical parameters and functions for at least one trained artificial intelligence model. In various embodiments, the memory 920 may store training data for training at least one artificial intelligence model.

다양한 실시예들에서, 메모리(920)는 적어도 하나의 프로세서(910)에 의해 실행될 때, 적어도 하나의 프로세서(910)가 음성 신호로부터 헤드 모델 애니메이션을 생성하는 방법을 실행하게 하는 인스트럭션을 저장할 수 있다. 다양한 실시예들에서, 메모리(920)는 적어도 하나의 프로세서(910)에 의해 실행될 때, 적어도 하나의 프로세서(910)가 음성 신호로부터 헤드 모델 애니메이션을 생성하기 위한 인공지능 모델을 학습시키는 방법을 실행하게 하는 인스트럭션을 저장할 수 있다.In various embodiments, the memory 920 may store instructions that, when executed by the at least one processor 910 , cause the at least one processor 910 to execute a method of generating a head model animation from a voice signal. . In various embodiments, the memory 920, when executed by the at least one processor 910 , executes a method for the at least one processor 910 to train an artificial intelligence model for generating a head model animation from a speech signal. You can store instructions that make it happen.

프로세서(910)는, 통상적으로 전자 장치(100)의 전반적인 동작을 제어한다. 예를 들어, 프로세서(910)는, 메모리(920)에 저장된 프로그램들을 실행함으로써, 메모리(920), 통신부(미도시), 입력부(미도시), 출력부 (미도시) 등을 전반적으로 제어할 수 있다. 프로세서(950)는, 메모리(920), 통신부(미도시), 입력부(미도시), 출력부 (미도시) 등을 제어함으로써, 본 개시에서의 전자 장치(100)의 동작을 제어할 수 있다. The processor 910 generally controls the overall operation of the electronic device 100 . For example, the processor 910 may control the memory 920, the communication unit (not shown), the input unit (not shown), the output unit (not shown), etc. as a whole by executing the programs stored in the memory 920 . can The processor 950 may control the operation of the electronic device 100 according to the present disclosure by controlling the memory 920 , the communication unit (not shown), the input unit (not shown), the output unit (not shown), and the like. .

구체적으로, 프로세서(910)는, 메모리(102), 통신부(미도시), 또는 입력부(미도시) 등을 통하여, 음성 신호를 획득할 수 있다. 프로세서(910)는 음성 신호로부터 상기 음성 신호의 특성 정보를 획득할 수 있다. Specifically, the processor 910 may acquire a voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown). The processor 910 may obtain characteristic information of the voice signal from the voice signal.

프로세서(910)는 인공지능 모델을 이용하여, 상기 특성 정보로부터 상기 음성 신호에 대응하는 음소 스트림 및 상기 음소 스트림에 대응하는 비짐 스트림을 획득할 수 있다. 일 실시예에서, 프로세서(910)는 음소 스트림 형성 함수에 기초하여 음성 신호로부터 음소 스트림을 예측할 수 있다. 프로세서(910)는 비짐 스트림 형성 함수에 기초하여 음성 신호로부터 비짐 스트림을 예측할 수 있다. 프로세서(910)는 인공지능 모델을 이용하여, 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득할 수 있다. 일 실시예에서, 프로세서(910)는 애니메이션 곡선 형성 함수에 기초하여 비짐 스트림의 비짐들의 애니메이션 곡선을 도출할 수 있다.The processor 910 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model. In an embodiment, the processor 910 may predict a phoneme stream from a speech signal based on a phoneme stream forming function. The processor 910 may predict a viseme stream from the voice signal based on the viseme stream forming function. The processor 910 may obtain an animation curve of visemes included in the viseme stream by using an artificial intelligence model. In an embodiment, the processor 910 may derive an animation curve of visemes of the viseme stream based on the animation curve forming function.

프로세서(910)는 상기 음소 스트림 및 상기 비짐 스트림을 병합할 수 있다. 프로세서(910)는 상기 애니메이션 곡선을 상기 병합된 음소 및 비짐 스트림의 비짐들에 적용하여 헤드 모델 애니메이션을 생성할 수 있다.The processor 910 may merge the phoneme stream and the viseme stream. The processor 910 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.

한편, 프로세서(910)는, 메모리(102), 통신부(미도시), 또는 입력부(미도시) 등을 통하여, 음성 신호, 상기 음성 신호에 대응하는 텍스트, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트를 획득할 수 있다.Meanwhile, the processor 910 receives a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown). It is possible to obtain a training data set including

프로세서(910)는 상기 음성 신호를 상기 인공지능 모델에 입력하여, 상기 인공지능 모델로부터 출력되는 제1 음소 스트림, 상기 제1 음소 스트림에 대응되는 비짐 스트림, 및 상기 비짐 스트림에 포함된 비짐들의 애니메이션 곡선을 획득할 수 있다.The processor 910 inputs the voice signal to the artificial intelligence model, and an animation of a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and visemes included in the viseme stream curve can be obtained.

프로세서(910)는 상기 제1 음소 스트림과 상기 음성 신호의 텍스트를 이용하여, 상기 인공지능 모델을 위한 음소 스트림 형성 함수를 계산할 수 있다. 프로세서(910)는 상기 비짐 스트림, 상기 애니메이션 곡선 및 상기 비디오 신호를 이용하여, 상기 인공지능 모델을 위한 비짐 스트림 형성 함수 및 애니메이션 곡선 형성 함수를 계산할 수 있다. 프로세서(910)는 상기 제1 음소 스트림, 상기 비짐 스트림, 및 상기 애니메이션 곡선에 기초하여, 상기 인공지능 모델을 위한 음소 정규화 함수를 계산할 수 있다.The processor 910 may calculate a phoneme stream forming function for the AI model by using the first phoneme stream and the text of the voice signal. The processor 910 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal. The processor 910 may calculate a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve.

프로세서(910)는 상기 음소 스트림 형성 함수, 상기 비짐 스트림 형성 함수, 상기 애니메이션 곡선 형성 함수, 및 상기 음소 정규화 함수를 이용하여, 상기 인공지능 모델을 갱신함으로써, 인공지능 모델을 학습시킬 수 있다.The processor 910 may train the AI model by updating the AI model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

본 개시의 일 양태는 음성 신호로부터 애니메이션 헤드 모델을 생성하는 방법을 제공하며, 상기 방법은 하나 이상의 프로세서에 의해 실행되며, 음성 신호를 수신하는 단계; 상기 음성 신호를 음성 신호 특징 세트로 변환하는 단계; 상기 음성 신호 특징 세트로부터 음성 신호 특징들을 추출하는 단계; 학습된 인공지능 수단으로 상기 음성 신호 특징들을 처리함으로써 음소 스트림 및 상기 음소 스트림의 음소들에 대응되는 비짐 스트림을 도출하는 단계; 상기 학습된 인공지능 수단에 의해, 상기 도출된 비짐 스트림의 비짐에 대한 애니메이션 곡선을 대응되는 음소들에 기초하여 계산하는 단계; 상기 계산된 애니메이션 곡선을 고려하여 상기 도출된 음소 스트림 및 상기 도출된 비짐 스트림을 서로 오버레이함으로써 상기 도출된 음소 스트림 및 상기 도출된 비짐 스트림을 병합하는 단계; 및 상기 계산된 애니메이션 곡선을 사용하여 상기 병합된 음소 및 비짐 스트림의 비짐을 애니메이팅함으로써 헤드 모델의 애니메이션을 형성하는 단계를 포함한다.One aspect of the present disclosure provides a method for generating an animated head model from a speech signal, the method being executed by one or more processors, the method comprising: receiving a speech signal; converting the speech signal into a speech signal feature set; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream by processing the speech signal features with a learned artificial intelligence means; calculating, by the learned artificial intelligence means, an animation curve for the viseme of the derived viseme stream based on corresponding phonemes; merging the derived phoneme stream and the derived viseme stream by overlaying the derived phoneme stream and the derived viseme stream in consideration of the calculated animation curve; and forming an animation of the head model by animating the merged phoneme and viseme of the viseme stream using the calculated animation curve.

추가적인 양태에서, 인공지능 수단의 학습은, 음성 신호, 상기 음성 신호에 대한 대본, 및 상기 음성 신호에 대응하는 비디오 신호를 포함하는 학습 데이터 세트를 수신하는 단계; 상기 음성 신호에 대한 대본으로부터 음소 스트림을 도출하는 단계; 상기 음성 신호를 음성 신호 특징 세트로 변환하는 단계; 상기 음성 신호 특징 세트로부터 음성 신호 특징들을 추출하는 단계; 상기 음성 신호 특징들에 기초하여 음소 스트림 및 상기 음소 스트림의 음소들에 대응하는 비짐 스트림을 도출하는 단계; 상기 음성 신호에 대한 대본으로부터 도출된 음소 스트림과 상기 음성 신호 특징들에 기초하여 도출된 음소 스트림을 비교함으로써 상기 음소 스트림을 형성하는 함수를 계산하는 단계; 상기 음성 신호 특징들에 기초하여 도출된 비짐 스트림의 비짐에 대한 애니메이션 곡선을 계산하는 단계; 상기 계산된 애니메이션 곡선을 미리 정해진 비짐 세트에 적용하는 단계; 상기 계산된 애니메이션 곡선을 적용한 상기 미리 정해진 비짐 세트에서 얼굴 랜드마크의 움직임 패턴을 결정하는 단계; 상기 음성 신호에 대응하는 상기 비디오 신호에서 얼굴 랜드마크의 움직임 패턴을 결정하는 단계; 상기 음성 신호에 대응하는 상기 비디오 신호에서의 얼굴 랜드마크의 움직임 패턴을 미리 결정된 중립 얼굴에 오버레이하는 단계; 상기 미리 결정된 중립 얼굴에 오버레이된 상기 음성 신호에 대응하는 상기 비디오 신호에서의 얼굴 랜드마크의 움직임 패턴과, 상기 미리 정해진 비짐 세트에서 결정된 상기 얼굴 랜드마크의 움직임 패턴을 비교함으로써, 상기 비짐 스트림을 형성하는 함수 및 애니메이션 곡선들을 계산하는 함수를 계산하는 단계; 및 상기 음성 신호 특징들에 기초하여 도출된 음소 스트림, 상기 음성 신호 특징들에 기초하여 도출된 비짐 스트림, 및 상기 계산된 애니메이션 곡선에 기초하여 비짐을 선택하는 함수를 계산하는 단계를 포함한다.In a further aspect, the learning of the artificial intelligence means comprises: receiving a training data set comprising a voice signal, a transcript for the voice signal, and a video signal corresponding to the voice signal; deriving a phoneme stream from the transcript for the speech signal; converting the speech signal into a set of speech signal features; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream based on the speech signal characteristics; calculating a function for forming the phoneme stream by comparing a phoneme stream derived from a script for the speech signal with a phoneme stream derived based on the speech signal characteristics; calculating an animation curve for a viseme of a viseme stream derived based on the speech signal characteristics; applying the calculated animation curve to a predetermined viseme set; determining a movement pattern of a facial landmark in the predetermined viseme set to which the calculated animation curve is applied; determining a movement pattern of a facial landmark in the video signal corresponding to the voice signal; overlaying a movement pattern of a facial landmark in the video signal corresponding to the voice signal on a predetermined neutral face; forming the viseme stream by comparing a movement pattern of a facial landmark in the video signal corresponding to the voice signal overlaid on the predetermined neutral face with a movement pattern of the facial landmark determined in the predetermined set of visemes calculating a function that calculates a function and animation curves; and calculating a phoneme stream derived based on the speech signal characteristics, a viseme stream derived based on the speech signal characteristics, and a function for selecting a viseme based on the calculated animation curve.

다른 추가적인 양태에서, 상기 음성 신호를 음성 신호 특징 세트로 변환하는 단계 및 상기 음성 신호 특징 세트로부터 음성 신호 특징들을 추출하는 단계는 Mel-Frequency Cepstral Coefficients (MFCC) 방법 또는 추가 사전 학습된 인공지능 수단 중 하나에 의해 수행된다. In another further aspect, the step of converting the speech signal into a speech signal feature set and extracting speech signal features from the speech signal feature set comprises a Mel-Frequency Cepstral Coefficients (MFCC) method or an additional pre-trained artificial intelligence means. performed by one

또 다른 추가적인 양태에서, 상기 추가 사전 학습된 인공지능 수단은 순환 신경망, 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 또는 이들의 조합 중 적어도 하나이다. In a still further aspect, the additional pre-trained artificial intelligence means comprises one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or a combination thereof. at least one

또 다른 추가적인 양태에서, 상기 학습된 인공지능 수단은 적어도 2개의 블록들을 포함하고, 상기 학습된 인공지능 수단의 적어도 2개의 블록들 중 제1 블록은 상기 음성 신호 특징들을 처리함으로써 음소 스트림 및 상기 음소 스트림의 음소들에 대응하는 비짐 스트림을 도출하는 단계를 수행하고, 상기 학습된 인공지능 수단의 적어도 2개의 블록들 중 제2 블록은 상기 학습된 인공지능 수단에 의해 상기 도출된 비짐 스트림의 비짐을 위한 애니메이션 곡선을 대응되는 음소에 기초하여 계산하는 단계를 수행한다.In yet a further aspect, said learned artificial intelligence means comprises at least two blocks, wherein a first of said at least two blocks of said learned artificial intelligence means processes said speech signal features to thereby form a phoneme stream and said phoneme. performing the step of deriving a viseme stream corresponding to the phonemes of the stream, and a second block of at least two blocks of the learned artificial intelligence means is a viseme of the derived viseme stream by the learned artificial intelligence means. Calculating an animation curve for the corresponding phoneme is performed.

또 다른 추가적인 양태에서, 상기 학습된 인공지능 수단의 적어도 2개의 블록들 중 제1 블록은 순환 신경망, 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 또는 이들의 조합 중 적어도 하나이다. In yet a further aspect, a first of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.

또 다른 추가적인 양태에서, 상기 학습된 인공지능 수단의 적어도 2개의 블록들 중 제2 블록은 순환 신경망, 장단기 메모리 (Long Short-Term Memory, LSTM), 게이트 순환 유닛 (gated recurrent unit, GRU), 이들의 변형, 또는 이들의 조합 중 적어도 하나이다. In a still further aspect, a second of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.

또 다른 추가적인 양태에서, 상기 음성 신호 특징들에 기초하여 도출된 비짐 스트림에서 비짐에 대한 애니메이션 곡선들을 계산하는 단계는 얼굴 움직임 부호화 시스템(Facial Action Coding System, FACS)을 사용하여 수행된다.In a still further aspect, calculating animation curves for a viseme in a viseme stream derived based on the speech signal characteristics is performed using a Facial Action Coding System (FACS).

본 개시의 다른 양태는 전자 컴퓨팅 장치를 제공하는데, 상기 전자 컴퓨팅 장치는 적어도 하나의 프로세서; 및 적어도 하나의 학습된 인공지능 수단의 수치 파라미터 및, 적어도 하나의 프로세서에 의해 실행될 때, 적어도 하나의 프로세서로 하여금 음성 신호로부터 애니메이션 헤드 모델을 생성하는 방법을 수행하게 하는 인스트럭션을 저장하는 메모리를 포함한다.Another aspect of the present disclosure provides an electronic computing device, the electronic computing device comprising: at least one processor; and a memory for storing numerical parameters of the at least one learned artificial intelligence means and instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of generating an animated head model from a speech signal. do.

본 개시의 다양한 실시예들은 기기(machine)(예: 전자 장치(100)) 의해 읽을 수 있는 저장 매체(storage medium)(예: 메모리(920))에 저장된 하나 이상의 명령어들을 포함하는 소프트웨어로서 구현될 수 있다. 예를 들면, 기기(예: 전자 장치(100))의 프로세서(예: 프로세서(910))는, 저장 매체로부터 저장된 하나 이상의 명령어들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 상기 호출된 적어도 하나의 명령어에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 상기 하나 이상의 명령어들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다.Various embodiments of the present disclosure may be implemented as software including one or more instructions stored in a storage medium (eg, memory 920) readable by a machine (eg, electronic device 100). can For example, the processor (eg, the processor 910 ) of the device (eg, the electronic device 100 ) may call at least one command among one or more commands stored from a storage medium and execute it. This makes it possible for the device to be operated to perform at least one function according to the at least one command called. The one or more instructions may include code generated by a compiler or code executable by an interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' only means that the storage medium is a tangible device and does not include a signal (eg, electromagnetic wave), and this term is used in cases where data is semi-permanently stored in the storage medium and It does not distinguish between temporary storage cases.

일 실시예에 따르면, 본 개시에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 또는 두 개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to an embodiment, the method according to various embodiments disclosed in the present disclosure may be included and provided in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product is distributed in the form of a machine-readable storage medium (eg compact disc read only memory (CD-ROM)), or via an application store (eg Play Store™) or on two user devices ( It can be distributed (eg downloaded or uploaded) directly between smartphones, eg, online. In the case of online distribution, at least a part of the computer program product may be temporarily stored or temporarily created in a machine-readable storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.

다양한 실시예들에 따르면, 상기 기술한 구성요소들의 각각의 구성요소(예: 모듈 또는 프로그램)는 단수 또는 복수의 개체를 포함할 수 있다. 다양한 실시예들에 따르면, 전술한 해당 구성요소들 중 하나 이상의 구성요소들 또는 동작들이 생략되거나, 또는 하나 이상의 다른 구성요소들 또는 동작들이 추가될 수 있다. 대체적으로 또는 추가적으로, 복수의 구성요소들(예: 모듈 또는 프로그램)은 하나의 구성요소로 통합될 수 있다. 이런 경우, 통합된 구성요소는 상기 복수의 구성요소들 각각의 구성요소의 하나 이상의 기능들을 상기 통합 이전에 상기 복수의 구성요소들 중 해당 구성요소에 의해 수행되는 것과 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따르면, 모듈, 프로그램 또는 다른 구성요소에 의해 수행되는 동작들은 순차적으로, 병렬적으로, 반복적으로, 또는 휴리스틱하게 실행되거나, 상기 동작들 중 하나 이상이 다른 순서로 실행되거나, 생략되거나, 또는 하나 이상의 다른 동작들이 추가될 수 있다.According to various embodiments, each component (eg, a module or a program) of the above-described components may include a singular or a plurality of entities. According to various embodiments, one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added. Alternatively or additionally, a plurality of components (eg, a module or a program) may be integrated into one component. In this case, the integrated component may perform one or more functions of each component of the plurality of components identically or similarly to those performed by the corresponding component among the plurality of components prior to the integration. . According to various embodiments, operations performed by a module, program, or other component are executed sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations are executed in a different order, or omitted. or one or more other operations may be added.

Claims

A method for generating a head model animation from a voice signal, the method comprising:
obtaining characteristic information of the voice signal from the voice signal;
obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information using an artificial intelligence model;
obtaining animation curves of visemes included in the viseme stream by using the artificial intelligence model;
merging the phoneme stream and the viseme stream; and
generating a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream;
How to create a head model animation from a speech signal.

According to claim 1, wherein the artificial intelligence model, a first artificial intelligence model for obtaining a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information and visemes included in the viseme stream A method for generating a head model animation from a speech signal, comprising a second artificial intelligence model for obtaining an animation curve.

The method of claim 1 , wherein the artificial intelligence model is an artificial intelligence model learned using at least one of machine learning, neural network, genetics, deep learning, and a classification algorithm.

The method of claim 1 , wherein the artificial intelligence model is an artificial intelligence model learned using only a voice signal, a text for the voice signal, and a video signal corresponding to the voice signal. .

The head model animation from the voice signal according to claim 1, wherein the step of obtaining the characteristic information of the voice signal from the voice signal is performed using one of a Mel-Frequency Cepstral Coefficients (MFCC) method or another artificial intelligence model. how to create it.

The method of claim 1, wherein obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information comprises:
extracting a characteristic of the voice signal from the characteristic information;
obtaining the phoneme stream from the characteristics of the speech signal; and
obtaining the viseme stream by selecting a viseme corresponding to each phoneme included in the phoneme stream;
A method for generating a head model animation from a speech signal, comprising:

The head from the voice signal according to claim 6, wherein the step of extracting the feature of the voice signal from the characteristic information is performed using at least one of a convolutional neural network and a recurrent neural network. How to create model animations.

The method of claim 6, wherein the obtaining of the phoneme stream from the characteristics of the speech signal is performed based on a phoneme stream forming function of the artificial intelligence model,
wherein the phoneme stream forming function is learned by a training data set including an arbitrary speech signal and text corresponding to the arbitrary speech signal.
How to create a head model animation from a speech signal.

The method of claim 6, wherein the obtaining of the viseme stream by selecting a viseme corresponding to each phoneme included in the phoneme stream is performed based on a viseme stream forming function of the artificial intelligence model,
wherein the viseme stream forming function is learned by a training data set including an arbitrary speech signal and a video signal corresponding to the arbitrary speech signal.

According to claim 1, wherein the step of obtaining a phoneme stream corresponding to the phoneme stream and a viseme stream corresponding to the phoneme stream from the characteristic information, using a phoneme normalization function of the artificial intelligence model to partially correspond to the phoneme A method for generating a head model animation from a speech signal, comprising selecting one viseme from among a plurality of visemes.

The method of claim 1 , wherein the obtaining of animation curves of visemes included in the viseme stream is performed using a Facial Action Coding System (FACS). .

A method for training an artificial intelligence model for generating a head model animation from a voice signal, the method comprising:
obtaining a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal;
inputting the speech signal to the artificial intelligence model to obtain a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream; ;
calculating a phoneme stream forming function for the artificial intelligence model by using the first phoneme stream and the text of the speech signal;
calculating a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal;
calculating a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve; and
updating the artificial intelligence model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function;
A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:

The method of claim 12, wherein the obtaining of the first phoneme stream, the viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream comprises:
obtaining characteristic information of the voice signal from the voice signal;
obtaining the first phoneme stream output from the AI model and a viseme stream corresponding to the first phoneme stream by inputting the characteristic information into the AI model; and
inputting the viseme stream to the artificial intelligence model to obtain animation curves of visemes included in the viseme stream output from the artificial intelligence model;
A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:

13. The method of claim 12, wherein calculating the phoneme stream forming function comprises:
obtaining a second phoneme stream from the text corresponding to the speech signal; and
calculating a phoneme stream forming function by comparing the first phoneme stream and the second phoneme stream;
A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:

13. The method of claim 12, wherein calculating the viseme stream forming function and the animation curve forming function comprises:
generating a viseme animation by applying the animation curve to a 3D template model;
detecting a movement pattern of a facial landmark in the viseme animation to obtain a first movement pattern;
detecting a movement pattern of a facial landmark in the video signal to obtain a second movement pattern; and
calculating a viseme stream forming function and an animation curve forming function by comparing the first movement pattern with the second movement pattern;
A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:

16. The method of claim 15,
Obtaining the first movement pattern comprises:
projecting the facial landmarks of the 3D template model on a 2D plane; and
obtaining a first movement pattern based on the movement of the facial landmark projected on the 2D plane;
including,
Obtaining the second movement pattern comprises:
detecting a facial landmark in the video signal;
aligning facial landmarks in the video signal to a neutral face; and
obtaining a second movement pattern based on the movement of the facial landmark aligned with the neutral face;
containing,
A method for an electronic device to train an artificial intelligence model for generating an animated head model from a voice signal.

An electronic device comprising:
a memory storing one or more instructions; and
at least one processor; including,
The electronic device of claim 1 , wherein the at least one processor executes the one or more instructions to perform a method of generating an animation head model from a voice signal according to any one of claims 1 to 11 .