WO2021112365A1

WO2021112365A1 - Method for generating head model animation from voice signal, and electronic device for implementing same

Info

Publication number: WO2021112365A1
Application number: PCT/KR2020/009663
Authority: WO
Inventors: 빅토로비치 글라지스토브이반; 이고레비치 크로토브일리야; 눌라노비치 눌라노브자크시리크; 올레고비치 카라차로브이반; 블라디스라보비치 시뮤틴알렉산드르; 브로니스라보비치 다니레비치알렉시
Original assignee: 삼성전자 주식회사
Priority date: 2019-12-02
Filing date: 2020-07-22
Publication date: 2021-06-10

Abstract

Disclosed are: a method for generating a head model animation from a voice signal using an artificial intelligence model; and an electronic device for implementing same. The disclosed method for generating a head model animation from a voice signal, carried out by the electronic device, comprises the steps of: acquiring characteristics information of a voice signal from the voice signal; by using the artificial intelligence model, acquiring, from the characteristics information, a phoneme stream corresponding to the voice signal, and a viseme stream corresponding to the phoneme stream; by using the artificial intelligence model, acquiring an animation curve of visemes included in the viseme stream; merging the phoneme stream with the viseme stream; and generating a head model animation by applying the animation curve to the visemes of the merged phoneme and viseme stream.

Description

Method for generating head model animation from speech signal and electronic device implementing the same

The present disclosure relates generally to a method of generating computer graphics, and more particularly, to a method of generating a head model animation from a voice signal using an artificial intelligence model, and an electronic device implementing the method.

Today, augmented and virtual reality that can obtain an effect similar to that of a real person by animating various characters corresponding to a user's avatar is increasingly used. For example, you can create a personalized three-dimensional (3D) head model and use it in a phone call or virtual chat, display the head model when you dub your voice in another language, and more.

To this end, a technical solution for generating a head model animation from a voice signal is required. Such a solution should be able to provide high-quality animation in real time and reduce the delay between the reception of the voice signal and the movement of the head model. In addition, it should be possible to reduce the consumption of computing resources required for these tasks. On the other hand, such a solution can be provided using an artificial intelligence model.

In general, the conventional head model animation technique has the following problems.

- In order to train an AI model, large amounts of data that are generally difficult to compute or acquire are required.

- Methods of depicting facial movements based on two-dimensional landmarks generally provide very flat animation results due to the lack of three-dimensional information.

- In order to obtain high-quality animation of a virtual character based on human facial movement, many calculations are required due to the difference in facial shape

- It is difficult to generalize animation data to an arbitrary user voice other than a specific person.

- Animated models with high-quality images have high latency.

Accordingly, there is a need for a technology that provides at least one or more of the advantages described below while solving the above problems.

An object of the present disclosure is to provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which can provide a head model animation from a voice signal in real time with low latency and high quality, and an electronic device implementing the method. is to provide

In addition, some embodiments of the present disclosure may provide a method of using widely available data for learning the artificial intelligence model, and an electronic device implementing the method.

In addition, some embodiments of the present disclosure provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which generates a head model animation for an arbitrary voice or a head model animation for an arbitrary character, and the method An electronic device that implements may be provided.

As a technical means for achieving the above technical problem, the first aspect of the present disclosure may propose a method of generating a head model animation from a voice signal. A method of generating a head model animation from the voice signal may include: acquiring characteristic information of the voice signal from the voice signal; obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information using an artificial intelligence model; obtaining animation curves of visemes included in the viseme stream by using the artificial intelligence model; merging the phoneme stream and the viseme stream; and generating a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.

In addition, the second aspect of the present disclosure may propose a method of training an artificial intelligence model for generating a head model animation from a voice signal. A method for training an artificial intelligence model for generating a head model animation from the voice signal includes: acquiring a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal; ; inputting the speech signal into the artificial intelligence model to obtain a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream; ; calculating a phoneme stream forming function for the artificial intelligence model by using the first phoneme stream and the text of the speech signal; calculating a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal; calculating a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve; and updating the AI model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

In addition, a third aspect of the present disclosure includes a memory storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby providing a method of generating an animation head model from a voice signal. It is possible to provide an electronic device that performs the

Further, a fourth aspect of the present disclosure includes a memory for storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby generating an artificial head model animation from a speech signal. An electronic device that performs a method of training an intelligent model may be provided.

1 is a schematic diagram of a system for generating a head model animation from a speech signal, in accordance with various embodiments.

2 is a schematic diagram of an artificial intelligence model for generating a head model animation from a voice signal, in accordance with various embodiments.

3 is a block diagram of a first artificial intelligence model and a second artificial intelligence model, according to various embodiments.

4A illustrates a structure of a spatial feature extraction layer according to an embodiment.

4B illustrates the structures of a phoneme prediction layer and a viseme prediction layer according to an embodiment.

5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments.

6 is a schematic diagram of a learning unit for learning an artificial intelligence model for generating an animation head model from a voice signal, according to various embodiments of the present disclosure;

7 is a schematic diagram of calculating a viseme stream forming function and an animation curve forming function from a voice signal and a video signal corresponding to the voice signal, according to various embodiments of the present disclosure;

8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure;

9 is a block diagram of an electronic device configured to animate a head model from a voice signal, according to various embodiments of the present disclosure;

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

In order to facilitate understanding, the following description includes various specific details, but these details are to be regarded as illustrative only. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be applied to the various embodiments described below without departing from the scope of the present disclosure. In addition, descriptions of well-known functions and structures may be omitted for clarity and conciseness.

Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

The terms used in the present disclosure have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

In the entire specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as “…unit” and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software.

Functions related to artificial intelligence according to the present disclosure may be operated through a processor and a memory. The processor may consist of one or a plurality of processors. In this case, one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a vision processing unit (VPU), or an artificial intelligence-only processor such as an NPU. One or more processors may control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory. Alternatively, when one or more processors are AI-only processors, the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.

In the method of generating an animation head model from a voice signal according to the present disclosure, an artificial intelligence model may be used to infer or predict the head model animation corresponding to the voice signal. The processor may perform a preprocessing process on the voice signal data to convert it into a form suitable for use as an input of an artificial intelligence model.

The AI model can be processed by an AI-only processor designed with a hardware structure specialized for processing the AI model. AI models can be created through learning. Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden.

The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between an operation result of a previous layer and a plurality of weight values. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized. The artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited to the above-described example.

Inference prediction is a technology for logically reasoning and predicting information by judging information. Knowledge based reasoning, optimization prediction, preference-based planning, recommendation, etc. include

A phoneme is the smallest unit of sound perceived by a user that distinguishes a word from other words. A viseme is a unit representing the shape of a lip that can be distinguished from others, associated with one or more phonemes. In general, phonemes and visemes do not correspond one-to-one, which means that different voice signals may correspond to the same face shape.

Hereinafter, various embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings.

Referring to FIG. 1 , a system for generating a head model animation from the voice signal may include an electronic device 100 .

The electronic device 100 may be a device for generating a head model animation from a voice signal using the artificial intelligence model 110 . The electronic device 100 may be a device for learning the artificial intelligence model 110 using the training data set 200 . According to various embodiments, the electronic device 100 may include an artificial intelligence model 110 , an animation generator 120 , and a learning unit 150 for learning the artificial intelligence model.

The electronic device 100 may receive a voice signal and transmit it as an input of the artificial intelligence model 110 . The voice signal may be received from any available source, such as the Internet, TV or radio broadcasts, smart phones, cell phones, voice recorders, desktop computers, laptops, and the like. In an embodiment, the voice signal may be received in real time by an input unit (not shown) such as a microphone included in the electronic device 100 . In another embodiment, the voice signal may be received from an external electronic device through a network by a communication unit (not shown) included in the electronic device 100 . In another embodiment, the voice signal may be obtained from audio data stored in a memory or a storage device of the electronic device 100 .

The electronic device 100 may receive the training data set 200 and transmit it as an input of the learning unit 150 . The training data set 200 may include a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text signal, and the video signal may include recordings of various people with different face shapes. The training data set 200 may be provided to the learning unit 150 in order for the learning unit 150 to learn the artificial intelligence model 110 . The training data set 200 may be stored in a memory or a storage device in the electronic device 100 . Alternatively, the training data set 200 may be stored in a storage device external to the electronic device 1000 .

The artificial intelligence model 110 may derive parameters for generating the head model animation from the voice signal.

The artificial intelligence model 110 may pre-process the voice signal and convert it into characteristic information indicating the characteristics of the voice signal. In various embodiments, the artificial intelligence model 110 may obtain characteristic coefficients representing the characteristics of the voice signal from the voice signal.

The artificial intelligence model 110 may receive characteristic information of a voice signal extracted from the voice signal, and may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream.

The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.

In various embodiments, the artificial intelligence model 110 obtains a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the coefficients and a viseme included in the first AI model and the viseme stream. It may include a second artificial intelligence model that acquires the animation curves of these.

In various embodiments, the artificial intelligence model 110 may include one or more numerical parameters and functions for deriving a phoneme stream, a viseme stream, and an animation curve from a speech signal. The numerical parameter may be a weight of each of the plurality of neural network layers constituting the artificial intelligence model 110 . In various embodiments, the numerical parameters and functions may be determined or updated based on data learned by the artificial intelligence model 110 .

In an embodiment, the artificial intelligence model 110 may predict a phoneme stream from a speech signal based on a phoneme stream forming function. The artificial intelligence model 110 may predict a viseme stream from a voice signal based on a viseme stream forming function. The artificial intelligence model 110 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function. The artificial intelligence model 110 may derive animation curves of visemes of the viseme stream based on the animation curve forming function.

The artificial intelligence model 110 may post-process the phoneme stream and the viseme stream. In an embodiment, the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.

The artificial intelligence model 110 may be obtained from any available source, such as the Internet, a desktop computer, a laptop, etc., and may be stored in the memory of the electronic device 100 . In an embodiment, the artificial intelligence model 110 may be pre-trained using at least a portion of data included in the training data set 200 . The artificial intelligence model 110 may be updated using the learning data set 200 according to the learning algorithm of the learning unit 150 .

The animation generator 120 may generate a head model animation corresponding to the voice signal by applying the parameters obtained from the artificial intelligence model 110 to the head model.

The animation generator 120 may acquire the merged phoneme and viseme stream and animation curve from the artificial intelligence model 110 . The animation generator 120 may generate a head model animation by applying an animation curve to visemes included in the merged phoneme and viseme stream.

In various embodiments, the animation generator 120 may generate a head model animation based on a predefined head model. In an embodiment, the predefined head model may be any 3D character model based on a Facial Action Coding System (FACS). FACS is a system for classifying human facial movements. Using FACS, arbitrary facial expressions can be coded by decomposing them into specific action units and their temporal divisions. For example, in the predefined head model, each viseme may be defined as a FACS coefficient.

In an embodiment, the animation generator 120 may determine a viseme set of a predefined head model based on the merged phoneme and viseme stream. The animation generator 120 may generate the head model animation by applying the animation curve to the viseme set of the predefined head model.

The learning unit 150 may train the artificial intelligence model 110 by using the training data set 200 .

The learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The learning unit 150 may obtain a phoneme stream, a viseme stream, and an animation curve output from the AI model 110 by inputting the training data set 200 into the AI model 110 .

The learning unit 150 compares and evaluates the phoneme stream generated from the speech signal of the training data set 200 by the artificial intelligence model 110 with the text of the training data set 200, and based on the evaluation, the AI model (110) can be updated. In various embodiments, the learning unit 150 may calculate a phoneme stream forming function for the artificial intelligence model 110 using the first phoneme stream and the text.

The learning unit 150 compares and evaluates the 3D head model animation generated from the voice signal of the training data set 200 by the artificial intelligence model 110 with the video signal of the training data set 200, and based on the evaluation The artificial intelligence model 110 may be updated. In various embodiments, the learner 150 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal. In various embodiments, the learner 150 may calculate a phoneme normalization function of the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.

In various embodiments, the learning unit 150 may update the AI model by using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

The electronic device 100 includes a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS) device, an e-book terminal, a digital broadcasting terminal, It may be, but is not limited to, navigation, kiosks, MP3 players, digital cameras, consumer electronics and other mobile or non-mobile computing devices.

For example, although the electronic device 100 is illustrated in FIG. 1 as one device, it is not necessarily limited thereto. The electronic device 100 may be a set of one or more physically separated devices that are functionally connected and perform the above-described operations.

Referring to FIG. 2 , the artificial intelligence model 110 may include a preprocessor 210 , a first artificial intelligence model 220 , a second artificial intelligence model 230 , and a postprocessor 240 .

The preprocessor 210 may preprocess the voice signal so that the voice signal can be used to generate the head model animation. The preprocessor 210 may process the acquired voice signal into a preset format so that the first artificial intelligence model 220 may use the acquired voice signal to generate the head model animation.

In various embodiments, the preprocessor 210 may preprocess the voice signal and convert it into characteristic coefficients indicating characteristics of the voice signal. The characteristic coefficients may be input to the first artificial intelligence model 220 and used to predict a phoneme stream and a viseme stream corresponding to a voice signal.

In an embodiment, the preprocessor 210 may obtain the characteristic coefficients by transforming the speech coefficients using a Mel-Frequency Cepstral Coefficients (MFCC) method. MFCC is a technique for extracting features by analyzing a short-term spectrum of a sound, and characteristic coefficients may be obtained by linear cosine transformation of a log power spectrum into a nonlinear Mel scale of frequency. The MFCC method is not significantly affected by the variability according to the speaker and recording conditions, does not require a separate learning process, and has a fast calculation speed. Since the MFCC method is known in the art, a detailed description thereof will be omitted. In another embodiment, the characteristic coefficients may be obtained using another voice characteristic extraction method, for example, a method such as Perceptual Linear Prediction or Body Linear Predictive Codes.

In another embodiment, the preprocessor 210 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model. The additional pre-trained artificial intelligence means may be at least one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, and any combination thereof. have.

The first artificial intelligence model 220 may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the preprocessed voice signal provided from the preprocessor 210 . In one embodiment, the first artificial intelligence model 220 is a convolutional neural network, a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or these may be at least one of any combination of

In various embodiments, the first artificial intelligence model 220 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function. In various embodiments, the first artificial intelligence model 220 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function. In various embodiments, the first AI model 220 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.

The second artificial intelligence model 230 may receive the viseme stream generated by the first artificial intelligence model 220 and output animation curves of visemes included in the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation. The animation curve may be input to the animation generator 120 and applied to the head model to generate a head model animation.

In one embodiment, the second artificial intelligence model 230 is a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory (Long Short-Term Memory, LSTM), a gate circulation unit (gated recurrent unit, GRU), a variant thereof, or any combination thereof.

In various embodiments, the second artificial intelligence model 230 may derive animation curves of visemes of the viseme stream based on the animation curve forming function. In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS). An animation curve calculated using FACS can be applied to any FACS-based head model to generate a head model animation.

The post-processing unit 240 may post-process the phoneme stream and the viseme stream and merge them. The merged phoneme and viseme stream output from the post-processing unit 240 may be input to the animation generating unit 120 and used to generate a head model animation.

In various embodiments, the post-processing unit 240 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve. In the merged phoneme and viseme stream, each phone may be associated with a corresponding viseme. The duration of each phone and associated corresponding viseme in the merged phoneme and viseme stream may be specified by the animation curve of the viseme. In an embodiment, the post-processing unit 240 may use an arbitrary function that receives two inputs and returns one output for merging.

On the other hand, at least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 included in the artificial intelligence model 110, at least one hardware It may be manufactured in the form of a chip and mounted in an electronic device. For example, at least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 is dedicated hardware for artificial intelligence (AI). It may be manufactured in the form of a chip, or may be manufactured as a part of an existing general-purpose processor (eg, CPU or application processor) or graphics-only processor (eg, GPU) and mounted in the various electronic devices described above.

In addition, the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be mounted in one electronic device 100 , or separate electronic devices. Each of the devices may be mounted separately. For example, some of the pre-processing unit 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the post-processing unit 240 are included in the electronic device 100 , and the rest are included in the server. can be included in

In addition, at least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be implemented as a software module. At least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 is a software module (or a program module including an instruction) When implemented as , the software module may be stored in a computer-readable non-transitory computer readable medium. Also, in this case, at least one software module may be provided by an operating system (OS) or may be provided by a predetermined application. Alternatively, a part of the at least one software module may be provided by an operating system (OS), and the other part may be provided by a predetermined application.

3, the first artificial intelligence model 220 includes a spatial feature extraction layer 310, a temporal feature extraction layer 320, a phoneme prediction layer 330, a phoneme stream forming function 340, and a viseme prediction layer ( 350 ), a viseme stream formation function 360 , and a phoneme normalization function 370 . The second AI model 230 may include an animation curve prediction layer 380 and an animation curve forming function 390 .

The spatial feature extraction layer 310, the temporal feature extraction layer 320, the phoneme prediction layer 330, the viseme prediction layer 350, and the animation curve prediction layer 380 are of a neural network that performs a specific function. It may be at least a part. The phoneme stream forming function 340 , the viseme stream forming function 360 , the phoneme normalization function 370 , and the animation curve forming function 390 derive results from one or more layers included in the AI model or evaluate the derived results. It may be a function used to

The spatial feature extraction layer 310 and the temporal extraction layer 320 may extract the characteristics of the voice signal from the characteristic information of the input voice signal. The characteristic information of the voice signal may be a characteristic coefficient of the voice signal output from the preprocessor 210 .

The spatial characteristic extraction layer 310 may process characteristic information of the input voice signal to extract spatial characteristics. In an embodiment, a convolutional neural network (CNN) or a recurrent neural network (RNN) having fully connected layers and nonlinearity may be used for the spatial feature extraction layer 310 . . For example, a layer structure as shown in FIG. 4A may be used. However, the present invention is not limited thereto, and any differentiable layer may be added. In an embodiment, the spatial feature extraction layer 310 may be pre-learned.

The temporal feature extraction layer 320 may process the extracted spatial feature to extract a temporal feature. In an embodiment, a recurrent neural network (RNN) with fully connected layers and nonlinearity may be used for the temporal feature extraction layer 320 . For example, a three-level Long Short-Term Memory (LSTM) with dropout may be used.

Two independent streams, a phoneme stream and a viseme stream, may be predicted based on the characteristics of the speech signal extracted through the spatial feature extraction layer 310 and the temporal extraction layer 320 . The phoneme prediction layer 330 may derive a phoneme stream corresponding to the voice signal from the characteristics of the extracted voice signal. The viseme prediction layer 350 may select a viseme corresponding to each phoneme included in the phoneme stream to derive a viseme stream corresponding to the voice signal.

In various embodiments, the phoneme prediction layer 330 may predict a phoneme stream from a characteristic of a speech signal based on the phoneme stream forming function 340 . In an embodiment, the phoneme stream forming function 340 may be calculated from a loss function that measures how similar the phoneme stream predicted by the first AI model 220 is to an actual correct value. In an embodiment, the phoneme stream forming function 340 may be learned from a training data set including an arbitrary voice signal and text corresponding to the arbitrary voice signal.

In various embodiments, the viseme prediction layer 350 may predict the viseme stream corresponding to the phoneme stream from the characteristics of the speech signal based on the viseme stream forming function 360 . In one embodiment, the viseme stream forming function 360 may be calculated from a loss function that measures how similar the viseme stream predicted by the first artificial intelligence model 220 is to the actual correct value. In an embodiment, the viseme stream forming function 360 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

When there are a plurality of visemes that may partially correspond to one phoneme, the viseme prediction layer 350 may select an appropriate viseme from among them. In various embodiments, the viseme prediction layer 350 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function 370 . In an embodiment, the phoneme normalization function 370 may be a function that predicts a probability distribution of a phoneme and gives a penalty to a basic shape corresponding to a phoneme that is less likely to be used. In an embodiment, the phoneme normalization function 370 may be calculated by a regularization method based on a training data set including an arbitrary speech signal.

In one embodiment, any possible layer structure may be used for the phoneme prediction layer 330 and the viseme prediction layer 350 , including a stack of linear layers, non-linear layers, and other differentiable layers. For example, two linear layers fully connected to a Rectified Linear Unit (ReLU) as shown in FIG. 4B may be used as predictors.

In an embodiment, the phoneme prediction layer 330 and the viseme prediction layer 350 may share a weight with the temporal extraction layer 320 . As both streams share a weight with the previous layer, the effect of normalizing the model can be obtained according to the characteristics of the predicted parameters.

The animation curve prediction layer 380 may predict animation curves of visemes included in the viseme stream from the viseme stream. The animation curve represents the temporal change of animation parameters related to the movement of the head model. In one embodiment, the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.

In various embodiments, the animation curve prediction layer 380 may derive an animation curve of the visemes of the viseme stream based on the animation curve forming function 390 . In an embodiment, the animation curve forming function 390 may be calculated from a loss function that measures how similar the motion represented by the animation curve derived by the second artificial intelligence model 220 is to the motion of a real person. In an embodiment, the animation curve forming function 390 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS). The animation curve prediction layer 380 may derive an animation curve defined by using FACS coefficients for each viseme. Animation curves defined using FACS coefficients can be applied to any FACS-based head model. However, the animation curve is not necessarily limited to being calculated using FACS, and any suitable method may be used for calculating the animation curve.

5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments. Each operation of FIG. 5 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .

Referring to FIG. 5 , in operation S510 , the electronic device 100 may obtain characteristic information of a voice signal from the voice signal. In various embodiments, the electronic device 100 may pre-process a voice signal and convert it into characteristic information indicating characteristics of the voice signal, for example, a characteristic coefficient. In an embodiment, the electronic device 100 may obtain the characteristic coefficients by transforming the speech coefficient by a Mel-Frequency Cepstral Coefficients (MFCC) method. In another embodiment, the electronic device 100 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model.

In operation S520 , the electronic device 100 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model.

In various embodiments, the electronic device 100 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function. The phoneme stream forming function may be learned from a training data set including an arbitrary voice signal and a text corresponding to the arbitrary voice signal.

In various embodiments, the electronic device 100 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function. The viseme stream forming function may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.

In various embodiments, the electronic device 100 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.

In operation S530, the electronic device 100 may obtain an animation curve of visemes included in the viseme stream using the artificial intelligence model. In various embodiments, the electronic device 100 may derive animation curves of visemes of the viseme stream based on the animation curve forming function. In an embodiment, the animation curve may be calculated using a Facial Action Coding System (FACS).

In operation S540, the electronic device 100 may merge the phoneme stream and the viseme stream. In an embodiment, the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.

In operation S550 , the electronic device 100 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream. In various embodiments, the electronic device 100 may generate a head model animation based on a predefined head model. In an embodiment, the predefined head model may be any 3D character model based on a facial motion coding system (FACS).

In an embodiment, the electronic device 100 determines a viseme set of a predefined head model based on the merged phoneme and viseme stream, and applies the animation curve to the viseme set to generate a head model animation. can

Referring to FIG. 6 , the learning unit 150 for training the artificial intelligence model 110 for generating an animation head model from a voice signal includes a phoneme detection unit 610 , a phoneme stream forming function calculation unit 620 , and animation generation. Unit 630, first movement pattern detection unit 640, second movement pattern detection unit 650, viseme stream formation function calculation unit 660, animation curve formation function calculation unit 670, and phoneme normalization function calculation unit ( 680) may be included.

The learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text signal, and the video signal may include recordings of various people with different face shapes. The voice signal, the text signal, and the video signal may each be separately processed for multi-objective learning. The processing of the voice signal, the processing of the text, and the processing of the video signal may be performed in parallel or sequentially according to the configuration of the electronic device 100 and its calculation capability.

The artificial intelligence model 110 receives the voice signal of the training data set 200, and receives a first phoneme stream corresponding to the voice signal, a viseme stream corresponding to the first phoneme stream, and a viseme included in the viseme stream. You can get their animation curves.

In various embodiments, the artificial intelligence model 110 may obtain the first phoneme stream, the viseme stream, and the animation curve from the speech signal of the training data set 200 by the same process as described above. For example, the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal. The artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream. The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.

The phoneme detector 610 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream. The operation of detecting phonemes from text may be performed by any known method.

The phoneme stream forming function calculation unit 620 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 . have. In an embodiment, the phoneme stream forming function 340 may be calculated using a loss function by comparing the first phoneme stream and the second phoneme stream.

The animation generator 630 may generate a viseme animation by applying the animation curve output from the artificial intelligence model 110 to the 3D template model. In various embodiments, the animation generator 630 may obtain a viseme animation by applying an animation curve to visemes included in the viseme stream in the same manner as the animation generator 120 .

In various embodiments, the animation generator 630 may acquire a viseme animation based on a predefined head model. For example, the animation generator 630 may obtain a viseme animation of each viseme by applying an animation curve to the 3D template head model.

The first movement pattern detector 640 may acquire the first movement pattern by detecting the movement pattern of the facial landmark in the obtained viseme animation. In an embodiment, the facial landmark may be predefined in the predefined head model. The movement parameter indicated by the animation curve may designate the movement pattern of the defined facial landmark.

The second movement pattern detector 650 may receive a video signal corresponding to the voice signal of the training data set 200 and detect a facial landmark from the video signal. In one embodiment, the facial landmark may be detected by a landmark detector. The landmark detector may perform any conventional landmark detection method.

The second movement pattern detector 650 may obtain a second movement pattern by measuring the movement displacement of the detected facial landmark. In an embodiment, the movement displacement may be measured based on an average of the training data set 200 or a specific face selected during training. In an embodiment, in order to train the artificial intelligence model 110 independently of the face shape, the second movement pattern detection unit 650 is a face land based on the face shape acquired from the first frame of the video signal or the estimated face shape. The movement displacement of the mark can be measured.

The viseme stream forming function calculation unit 660 may calculate the viseme stream forming function 360 used to predict the viseme stream in the artificial intelligence model 110 by comparing the first movement pattern with the second movement pattern. have. In an embodiment, the viseme stream forming function 340 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.

The animation curve forming function calculation unit 670 can calculate the animation curve forming function 390 used to predict the animation curve in the artificial intelligence model 110 by comparing the first movement pattern with the second movement pattern. have. In an embodiment, the animation curve forming function 390 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.

The phoneme normalization function calculator 680 is one of a plurality of visemes corresponding to a phoneme in the AI model 110 based on the first phoneme stream, the viseme stream, and the animation curve output from the AI model 110 . A phoneme normalization function 370 used to select the viseme of may be computed. In an embodiment, the phoneme normalization function calculator 680 may calculate the phoneme normalization function 370 by a regularization method.

The learner 150 may update the artificial intelligence model 110 by using the calculated phoneme stream formation function, viseme stream formation function, animation curve formation function, and phoneme normalization function.

The training data set 200 includes records of various people with different face shapes, and the face shape of a real person is different from an artificially created animated character model. Therefore, it is not possible to directly compare the 3D head model animation generated by the artificial intelligence model 110 with the facial movement appearing in the video signal. Therefore, in order for the artificial intelligence model 110 to learn the movement of an arbitrary face shape according to the spoken voice signal, it is necessary to remove an error related to the difference in the face shape.

Referring to FIG. 7 , the first movement pattern detector 640 may acquire a viseme animation of the 3D template model. The viseme animation is generated based on the animation curve output from the artificial intelligence model 110 .

In an embodiment, the first movement pattern detector 640 may project the facial landmark of the viseme animation on a 2D plane in order to compare the movement of the 3D head model animation with the facial movement detected from the video signal. The first movement pattern detector 640 may obtain a first movement pattern by calculating the movement of the projected facial landmark on a 2D plane.

The second movement pattern detector 650 may detect a facial landmark from the video signal of the training data set 200 .

In an embodiment, the second movement pattern detection unit 640 sets the facial landmark detected from the video signal to a predefined neutral face in order to compare any facial movement detected from the video signal with the movement of the 3D head model animation. It can be sorted by overlaying it on . In one embodiment, alignment of facial landmarks may be performed using Procruste analysis or Affine transforms. In one embodiment, the Kabsh algorithm may be used to find the optimal transformation matrix.

In an embodiment, the second movement pattern detector 640 may obtain a second movement pattern by calculating the movement of the aligned landmarks. In an embodiment, the second movement pattern detector 640 may acquire the second movement pattern by measuring the movement displacement of the facial landmarks aligned with respect to the predefined neutral face.

The learner 150 may calculate a viseme stream forming function and an animation curve forming function based on the obtained first and second movement patterns. In an embodiment, the viseme stream forming function and the animation curve forming function may be calculated based on a loss function representing a difference between the first movement pattern and the second movement pattern. Using the loss function, it can be measured how similar the movement predicted by the artificial intelligence model 110 is to the actual movement of the face.

According to the method described above, since only the relative movement excluding the difference in the face shape is used for learning, the artificial intelligence model 110 can be trained using an arbitrary face shape different from the 3D head model as learning data, and the 2D movement Based on the artificial intelligence model 110 may be trained. Accordingly, easily obtainable video data can be used as training data.

8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure; Each operation of FIG. 8 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .

Referring to FIG. 8 , in operation S810 , the electronic device 100 may obtain a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal. The voice signal, the text signal, and the video signal may include recordings of various people with different face shapes.

In operation S820 , the electronic device 100 inputs the voice signal to the artificial intelligence model 110 , and outputs a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and the viseme An animation curve of visemes included in a stream can be obtained.

In various embodiments, the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal. The artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream. The artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.

In operation S830 , the electronic device 100 may calculate a phoneme stream forming function for the artificial intelligence model 110 using the first phoneme stream and the text of the voice signal.

In various embodiments, the electronic device 100 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream. The electronic device 100 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 .

In operation S840 , the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal.

In various embodiments, the electronic device 100 may generate a viseme animation by applying the animation curve to a 3D template model. The electronic device 100 may acquire a first movement pattern by detecting a movement pattern of a facial landmark in the viseme animation. In an embodiment, the electronic device 100 may project the facial landmark of the 3D template model on a 2D plane. The electronic device 100 may acquire the first movement pattern based on the movement of the facial landmark projected on the 2D plane.

In various embodiments, the electronic device 100 may obtain a second movement pattern by detecting a movement pattern of a facial landmark from the video signal. In an embodiment, the electronic device 100 may detect a facial landmark from the video signal. The electronic device 100 may align the face landmark of the video signal to the neutral face. The electronic device 100 may acquire the second movement pattern based on the movement of the facial landmark aligned with the neutral face.

In various embodiments, the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function by comparing the first movement pattern with the second movement pattern.

In operation S850 , the electronic device 100 may calculate a phoneme normalization function for the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.

In operation S850 , the electronic device 100 may update the artificial intelligence model 110 using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

Referring to FIG. 9 , the electronic device 100 may include at least one processor 910 and a memory 920 .

The memory 920 may store a program for processing and controlling the processor 910 , and may store data input to or output from the electronic device 100 . In various embodiments, memory 920 may store numerical parameters and functions for at least one trained artificial intelligence model. In various embodiments, the memory 920 may store training data for training at least one artificial intelligence model.

In various embodiments, the memory 920 may store instructions that, when executed by the at least one processor 910 , cause the at least one processor 910 to execute a method of generating a head model animation from a voice signal. . In various embodiments, the memory 920, when executed by the at least one processor 910 , executes a method for the at least one processor 910 to train an artificial intelligence model for generating a head model animation from a speech signal. You can store instructions that make it happen.

The processor 910 generally controls the overall operation of the electronic device 100 . For example, the processor 910 may control the memory 920, the communication unit (not shown), the input unit (not shown), the output unit (not shown), etc. as a whole by executing the programs stored in the memory 920 . can The processor 950 may control the operation of the electronic device 100 according to the present disclosure by controlling the memory 920 , the communication unit (not shown), the input unit (not shown), the output unit (not shown), and the like. .

Specifically, the processor 910 may acquire a voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown). The processor 910 may obtain characteristic information of the voice signal from the voice signal.

The processor 910 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model. In an embodiment, the processor 910 may predict a phoneme stream from a speech signal based on a phoneme stream forming function. The processor 910 may predict a viseme stream from the voice signal based on the viseme stream forming function. The processor 910 may obtain an animation curve of visemes included in the viseme stream by using an artificial intelligence model. In an embodiment, the processor 910 may derive an animation curve of visemes of the viseme stream based on the animation curve forming function.

The processor 910 may merge the phoneme stream and the viseme stream. The processor 910 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.

Meanwhile, the processor 910 receives a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown). It is possible to obtain a training data set including

The processor 910 inputs the voice signal to the artificial intelligence model, and an animation of a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and visemes included in the viseme stream curve can be obtained.

The processor 910 may calculate a phoneme stream forming function for the AI model by using the first phoneme stream and the text of the voice signal. The processor 910 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal. The processor 910 may calculate a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve.

The processor 910 may train the AI model by updating the AI model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.

One aspect of the present disclosure provides a method of generating an animated head model from a speech signal, the method being executed by one or more processors, the method comprising: receiving a speech signal; converting the speech signal into a set of speech signal features; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream by processing the speech signal features with a learned artificial intelligence means; calculating, by the learned artificial intelligence means, an animation curve for the viseme of the derived viseme stream based on the corresponding phonemes; merging the derived phoneme stream and the derived viseme stream by overlaying the derived phoneme stream and the derived viseme stream with each other in consideration of the calculated animation curve; and forming an animation of the head model by animating the merged phoneme and viseme of the viseme stream using the calculated animation curve.

In a further aspect, the learning of the artificial intelligence means comprises: receiving a training data set comprising a voice signal, a transcript for the voice signal, and a video signal corresponding to the voice signal; deriving a phoneme stream from the transcript for the speech signal; converting the speech signal into a set of speech signal features; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream based on the speech signal characteristics; calculating a function for forming the phoneme stream by comparing a phoneme stream derived from a script for the speech signal with a phoneme stream derived based on the speech signal characteristics; calculating an animation curve for a viseme of a viseme stream derived based on the speech signal characteristics; applying the calculated animation curve to a predetermined viseme set; determining a movement pattern of a facial landmark in the predetermined viseme set to which the calculated animation curve is applied; determining a movement pattern of a facial landmark in the video signal corresponding to the voice signal; overlaying a movement pattern of a facial landmark in the video signal corresponding to the voice signal on a predetermined neutral face; forming the viseme stream by comparing a movement pattern of a facial landmark in the video signal corresponding to the voice signal overlaid on the predetermined neutral face with a movement pattern of the facial landmark determined in the predetermined set of visemes calculating a function that calculates a function and animation curves; and calculating a phoneme stream derived based on the speech signal characteristics, a viseme stream derived based on the speech signal characteristics, and a function for selecting a viseme based on the calculated animation curve.

In another further aspect, the step of converting the speech signal into a speech signal feature set and extracting speech signal features from the speech signal feature set comprises a Mel-Frequency Cepstral Coefficients (MFCC) method or an additional pre-trained artificial intelligence means. performed by one

In a still further aspect, the additional pre-trained artificial intelligence means comprises one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or a combination thereof. at least one

In yet a further aspect, said learned artificial intelligence means comprises at least two blocks, wherein a first of said at least two blocks of said learned artificial intelligence means processes said speech signal features to thereby form a phoneme stream and said phoneme. performing the step of deriving a viseme stream corresponding to the phonemes of the stream, and a second block of at least two blocks of the learned artificial intelligence means is a viseme of the derived viseme stream by the learned artificial intelligence means. Calculating an animation curve for the corresponding phoneme is performed.

In a still further aspect, a first of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.

In a still further aspect, a second block of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.

In a still further aspect, calculating animation curves for a viseme in a viseme stream derived based on the speech signal characteristics is performed using a Facial Action Coding System (FACS).

Another aspect of the present disclosure provides an electronic computing device, the electronic computing device comprising: at least one processor; and a memory for storing numerical parameters of the at least one learned artificial intelligence means and instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of generating an animated head model from a speech signal. do.

Various embodiments of the present disclosure may be implemented as software including one or more instructions stored in a storage medium (eg, memory 920) readable by a machine (eg, electronic device 100). can For example, the processor (eg, the processor 910 ) of the device (eg, the electronic device 100 ) may call at least one command among one or more commands stored from a storage medium and execute it. This makes it possible for the device to be operated to perform at least one function according to the at least one command called. The one or more instructions may include code generated by a compiler or code executable by an interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' only means that the storage medium is a tangible device and does not contain a signal (eg, electromagnetic wave), and this term refers to the case where data is semi-permanently stored in the storage medium and It does not distinguish between temporary storage cases.

According to an embodiment, the method according to various embodiments disclosed in the present disclosure may be included and provided in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product is distributed in the form of a machine-readable storage medium (eg compact disc read only memory (CD-ROM)), or via an application store (eg Play Store™) or on two user devices ( It can be distributed (eg downloaded or uploaded) directly, online between smartphones (eg: smartphones). In the case of online distribution, at least a part of the computer program product may be temporarily stored or temporarily created in a machine-readable storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.

According to various embodiments, each component (eg, a module or a program) of the above-described components may include a singular or a plurality of entities. According to various embodiments, one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added. Alternatively or additionally, a plurality of components (eg, a module or a program) may be integrated into one component. In this case, the integrated component may perform one or more functions of each component of the plurality of components identically or similarly to those performed by the corresponding component among the plurality of components prior to the integration. . According to various embodiments, operations performed by a module, program, or other component are executed sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations are executed in a different order, or omitted. or one or more other operations may be added.

Claims

A method for generating a head model animation from a voice signal, the method comprising:

obtaining characteristic information of the voice signal from the voice signal;

obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information using an artificial intelligence model;

obtaining animation curves of visemes included in the viseme stream by using the artificial intelligence model;

merging the phoneme stream and the viseme stream; and

generating a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream;

How to create a head model animation from a speech signal.
According to claim 1, wherein the artificial intelligence model, a first artificial intelligence model for obtaining a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information and visemes included in the viseme stream A method for generating a head model animation from a speech signal, comprising a second artificial intelligence model for obtaining an animation curve.
The method of claim 1 , wherein the artificial intelligence model is an artificial intelligence model learned using at least one of machine learning, neural network, genetics, deep learning, and a classification algorithm.
The method of claim 1 , wherein the artificial intelligence model is an artificial intelligence model learned using only a voice signal, a text for the voice signal, and a video signal corresponding to the voice signal. .
The head model animation from the voice signal according to claim 1, wherein the step of obtaining the characteristic information of the voice signal from the voice signal is performed using one of a Mel-Frequency Cepstral Coefficients (MFCC) method or another artificial intelligence model. how to create it.
The method of claim 1, wherein obtaining a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information comprises:

extracting a characteristic of the voice signal from the characteristic information;

obtaining the phoneme stream from the characteristics of the speech signal; and

obtaining the viseme stream by selecting a viseme corresponding to each phoneme included in the phoneme stream;

A method for generating a head model animation from a speech signal, comprising:
The head from the voice signal according to claim 6, wherein the step of extracting the feature of the voice signal from the characteristic information is performed using at least one of a convolutional neural network and a recurrent neural network. How to create model animations.
The method of claim 6, wherein the obtaining of the phoneme stream from the characteristics of the speech signal is performed based on a phoneme stream forming function of the artificial intelligence model,

wherein the phoneme stream forming function is learned by a training data set including an arbitrary speech signal and text corresponding to the arbitrary speech signal.

How to generate head model animations from speech signals.
The method of claim 6, wherein the obtaining of the viseme stream by selecting a viseme corresponding to each phoneme included in the phoneme stream is performed based on a viseme stream forming function of the artificial intelligence model,

wherein the viseme stream forming function is learned by a training data set including an arbitrary speech signal and a video signal corresponding to the arbitrary speech signal.
According to claim 1, wherein the step of obtaining a phoneme stream corresponding to the phoneme stream and a viseme stream corresponding to the phoneme stream from the characteristic information, using a phoneme normalization function of the artificial intelligence model to partially correspond to the phoneme A method for generating a head model animation from a speech signal, comprising selecting one viseme from among a plurality of visemes.
The method of claim 1 , wherein the obtaining of animation curves of visemes included in the viseme stream is performed using a Facial Action Coding System (FACS). .
A method for training an artificial intelligence model for generating a head model animation from a voice signal, the method comprising:

obtaining a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal;

inputting the speech signal into the artificial intelligence model to obtain a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream; ;

calculating a phoneme stream forming function for the artificial intelligence model by using the first phoneme stream and the text of the speech signal;

calculating a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal;

calculating a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve; and

updating the AI model by using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function;

A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:
The method of claim 12 , wherein the obtaining of the first phoneme stream, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream comprises:

obtaining characteristic information of the voice signal from the voice signal;

obtaining the first phoneme stream output from the AI model and a viseme stream corresponding to the first phoneme stream by inputting the characteristic information into the AI model; and

inputting the viseme stream to the artificial intelligence model to obtain animation curves of visemes included in the viseme stream output from the artificial intelligence model;

A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:
13. The method of claim 12, wherein calculating the phoneme stream forming function comprises:

obtaining a second phoneme stream from the text corresponding to the speech signal; and

calculating a phoneme stream forming function by comparing the first phoneme stream and the second phoneme stream;

A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:
13. The method of claim 12, wherein calculating the viseme stream forming function and the animation curve forming function comprises:

generating a viseme animation by applying the animation curve to a 3D template model;

detecting a movement pattern of a facial landmark in the viseme animation to obtain a first movement pattern;

detecting a movement pattern of a facial landmark in the video signal to obtain a second movement pattern; and

calculating a viseme stream forming function and an animation curve forming function by comparing the first movement pattern with the second movement pattern;

A method of training an artificial intelligence model for generating an animated head model from a voice signal, comprising:
16. The method of claim 15,

Obtaining the first movement pattern comprises:

projecting the facial landmarks of the 3D template model on a 2D plane; and

obtaining a first movement pattern based on the movement of the facial landmark projected on the 2D plane;

including,

Obtaining the second movement pattern comprises:

detecting a facial landmark in the video signal;

aligning facial landmarks in the video signal to a neutral face; and

obtaining a second movement pattern based on the movement of the facial landmark aligned with the neutral face;

containing,

A method for an electronic device to train an artificial intelligence model for generating an animated head model from a voice signal.
An electronic device comprising:

a memory storing one or more instructions; and

at least one processor; including,

The electronic device of claim 1 , wherein the at least one processor executes the one or more instructions to perform a method of generating an animation head model from a voice signal according to any one of claims 1 to 11 .