WO2021112365A1 - Procédé de génération d'une animation de modèle de tête à partir d'un signal vocal et dispositif électronique pour son implémentation - Google Patents
Procédé de génération d'une animation de modèle de tête à partir d'un signal vocal et dispositif électronique pour son implémentation Download PDFInfo
- Publication number
- WO2021112365A1 WO2021112365A1 PCT/KR2020/009663 KR2020009663W WO2021112365A1 WO 2021112365 A1 WO2021112365 A1 WO 2021112365A1 KR 2020009663 W KR2020009663 W KR 2020009663W WO 2021112365 A1 WO2021112365 A1 WO 2021112365A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- stream
- viseme
- phoneme
- animation
- artificial intelligence
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 203
- 230000006870 function Effects 0.000 claims description 145
- 238000012549 training Methods 0.000 claims description 51
- 230000001815 facial effect Effects 0.000 claims description 47
- 238000010606 normalization Methods 0.000 claims description 25
- 230000015654 memory Effects 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 230000015572 biosynthetic process Effects 0.000 claims description 21
- 230000000306 recurrent effect Effects 0.000 claims description 17
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000007935 neutral effect Effects 0.000 claims description 8
- 238000007635 classification algorithm Methods 0.000 claims 1
- 238000013135 deep learning Methods 0.000 claims 1
- 238000010801 machine learning Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 230000002123 temporal effect Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 8
- 230000006403 short-term memory Effects 0.000 description 8
- 238000012805 post-processing Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 241000228740 Procrustes Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/30—Determination of transform parameters for the alignment of images, i.e. image registration
- G06T7/33—Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present disclosure relates generally to a method of generating computer graphics, and more particularly, to a method of generating a head model animation from a voice signal using an artificial intelligence model, and an electronic device implementing the method.
- augmented and virtual reality that can obtain an effect similar to that of a real person by animating various characters corresponding to a user's avatar is increasingly used. For example, you can create a personalized three-dimensional (3D) head model and use it in a phone call or virtual chat, display the head model when you dub your voice in another language, and more.
- 3D three-dimensional
- a technical solution for generating a head model animation from a voice signal is required.
- Such a solution should be able to provide high-quality animation in real time and reduce the delay between the reception of the voice signal and the movement of the head model.
- it should be possible to reduce the consumption of computing resources required for these tasks.
- such a solution can be provided using an artificial intelligence model.
- the conventional head model animation technique has the following problems.
- An object of the present disclosure is to provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which can provide a head model animation from a voice signal in real time with low latency and high quality, and an electronic device implementing the method. is to provide
- some embodiments of the present disclosure may provide a method of using widely available data for learning the artificial intelligence model, and an electronic device implementing the method.
- some embodiments of the present disclosure provide a method for generating a head model animation from a voice signal using an artificial intelligence model, which generates a head model animation for an arbitrary voice or a head model animation for an arbitrary character, and the method
- An electronic device that implements may be provided.
- the first aspect of the present disclosure may propose a method of generating a head model animation from a voice signal.
- a method of generating a head model animation from the voice signal may include: acquiring characteristic information of the voice signal from the voice signal; obtaining a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the characteristic information using an artificial intelligence model; obtaining animation curves of visemes included in the viseme stream by using the artificial intelligence model; merging the phoneme stream and the viseme stream; and generating a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.
- a method for training an artificial intelligence model for generating a head model animation from the voice signal includes: acquiring a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal; ; inputting the speech signal into the artificial intelligence model to obtain a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and animation curves of visemes included in the viseme stream; ; calculating a phoneme stream forming function for the artificial intelligence model by using the first phoneme stream and the text of the speech signal; calculating a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal; calculating a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve; and
- a third aspect of the present disclosure includes a memory storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby providing a method of generating an animation head model from a voice signal. It is possible to provide an electronic device that performs the
- a fourth aspect of the present disclosure includes a memory for storing one or more instructions and at least one processor, wherein the at least one processor executes the one or more instructions, thereby generating an artificial head model animation from a speech signal.
- An electronic device that performs a method of training an intelligent model may be provided.
- FIG. 1 is a schematic diagram of a system for generating a head model animation from a speech signal, in accordance with various embodiments.
- FIG. 2 is a schematic diagram of an artificial intelligence model for generating a head model animation from a voice signal, in accordance with various embodiments.
- FIG. 3 is a block diagram of a first artificial intelligence model and a second artificial intelligence model, according to various embodiments.
- FIG. 4A illustrates a structure of a spatial feature extraction layer according to an embodiment.
- 4B illustrates the structures of a phoneme prediction layer and a viseme prediction layer according to an embodiment.
- FIG. 5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments.
- FIG. 6 is a schematic diagram of a learning unit for learning an artificial intelligence model for generating an animation head model from a voice signal, according to various embodiments of the present disclosure
- FIG. 7 is a schematic diagram of calculating a viseme stream forming function and an animation curve forming function from a voice signal and a video signal corresponding to the voice signal, according to various embodiments of the present disclosure
- FIG. 8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure
- FIG. 9 is a block diagram of an electronic device configured to animate a head model from a voice signal, according to various embodiments of the present disclosure.
- the processor may consist of one or a plurality of processors.
- one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a vision processing unit (VPU), or an artificial intelligence-only processor such as an NPU.
- One or more processors may control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory.
- the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.
- an artificial intelligence model may be used to infer or predict the head model animation corresponding to the voice signal.
- the processor may perform a preprocessing process on the voice signal data to convert it into a form suitable for use as an input of an artificial intelligence model.
- the AI model can be processed by an AI-only processor designed with a hardware structure specialized for processing the AI model.
- AI models can be created through learning.
- being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden.
- the artificial intelligence model may be composed of a plurality of neural network layers.
- Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between an operation result of a previous layer and a plurality of weight values.
- the plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized.
- the artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited to the above-described example.
- DNN Deep Neural Network
- DNN Deep Belief Network
- BBDNN Bidirectional Recurrent Deep Neural Network
- Deep Q-Networks Deep Q-Networks
- Inference prediction is a technology for logically reasoning and predicting information by judging information.
- Knowledge based reasoning, optimization prediction, preference-based planning, recommendation, etc. include
- a phoneme is the smallest unit of sound perceived by a user that distinguishes a word from other words.
- a viseme is a unit representing the shape of a lip that can be distinguished from others, associated with one or more phonemes. In general, phonemes and visemes do not correspond one-to-one, which means that different voice signals may correspond to the same face shape.
- FIG. 1 is a schematic diagram of a system for generating a head model animation from a speech signal, in accordance with various embodiments.
- a system for generating a head model animation from the voice signal may include an electronic device 100 .
- the electronic device 100 may be a device for generating a head model animation from a voice signal using the artificial intelligence model 110 .
- the electronic device 100 may be a device for learning the artificial intelligence model 110 using the training data set 200 .
- the electronic device 100 may include an artificial intelligence model 110 , an animation generator 120 , and a learning unit 150 for learning the artificial intelligence model.
- the electronic device 100 may receive a voice signal and transmit it as an input of the artificial intelligence model 110 .
- the voice signal may be received from any available source, such as the Internet, TV or radio broadcasts, smart phones, cell phones, voice recorders, desktop computers, laptops, and the like.
- the voice signal may be received in real time by an input unit (not shown) such as a microphone included in the electronic device 100 .
- the voice signal may be received from an external electronic device through a network by a communication unit (not shown) included in the electronic device 100 .
- the voice signal may be obtained from audio data stored in a memory or a storage device of the electronic device 100 .
- the electronic device 100 may receive the training data set 200 and transmit it as an input of the learning unit 150 .
- the training data set 200 may include a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal.
- the voice signal, the text signal, and the video signal may include recordings of various people with different face shapes.
- the training data set 200 may be provided to the learning unit 150 in order for the learning unit 150 to learn the artificial intelligence model 110 .
- the training data set 200 may be stored in a memory or a storage device in the electronic device 100 . Alternatively, the training data set 200 may be stored in a storage device external to the electronic device 1000 .
- the artificial intelligence model 110 may derive parameters for generating the head model animation from the voice signal.
- the artificial intelligence model 110 may pre-process the voice signal and convert it into characteristic information indicating the characteristics of the voice signal. In various embodiments, the artificial intelligence model 110 may obtain characteristic coefficients representing the characteristics of the voice signal from the voice signal.
- the artificial intelligence model 110 may receive characteristic information of a voice signal extracted from the voice signal, and may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream.
- the artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.
- the animation curve represents the temporal change of animation parameters related to the movement of the head model.
- the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.
- the artificial intelligence model 110 obtains a phoneme stream corresponding to the speech signal and a viseme stream corresponding to the phoneme stream from the coefficients and a viseme included in the first AI model and the viseme stream. It may include a second artificial intelligence model that acquires the animation curves of these.
- the artificial intelligence model 110 may include one or more numerical parameters and functions for deriving a phoneme stream, a viseme stream, and an animation curve from a speech signal.
- the numerical parameter may be a weight of each of the plurality of neural network layers constituting the artificial intelligence model 110 .
- the numerical parameters and functions may be determined or updated based on data learned by the artificial intelligence model 110 .
- the artificial intelligence model 110 may predict a phoneme stream from a speech signal based on a phoneme stream forming function.
- the artificial intelligence model 110 may predict a viseme stream from a voice signal based on a viseme stream forming function.
- the artificial intelligence model 110 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function.
- the artificial intelligence model 110 may derive animation curves of visemes of the viseme stream based on the animation curve forming function.
- the artificial intelligence model 110 may post-process the phoneme stream and the viseme stream.
- the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.
- the artificial intelligence model 110 may be obtained from any available source, such as the Internet, a desktop computer, a laptop, etc., and may be stored in the memory of the electronic device 100 . In an embodiment, the artificial intelligence model 110 may be pre-trained using at least a portion of data included in the training data set 200 . The artificial intelligence model 110 may be updated using the learning data set 200 according to the learning algorithm of the learning unit 150 .
- the animation generator 120 may generate a head model animation corresponding to the voice signal by applying the parameters obtained from the artificial intelligence model 110 to the head model.
- the animation generator 120 may acquire the merged phoneme and viseme stream and animation curve from the artificial intelligence model 110 .
- the animation generator 120 may generate a head model animation by applying an animation curve to visemes included in the merged phoneme and viseme stream.
- the animation generator 120 may generate a head model animation based on a predefined head model.
- the predefined head model may be any 3D character model based on a Facial Action Coding System (FACS).
- FACS is a system for classifying human facial movements. Using FACS, arbitrary facial expressions can be coded by decomposing them into specific action units and their temporal divisions. For example, in the predefined head model, each viseme may be defined as a FACS coefficient.
- the animation generator 120 may determine a viseme set of a predefined head model based on the merged phoneme and viseme stream.
- the animation generator 120 may generate the head model animation by applying the animation curve to the viseme set of the predefined head model.
- the learning unit 150 may train the artificial intelligence model 110 by using the training data set 200 .
- the learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal.
- the learning unit 150 may obtain a phoneme stream, a viseme stream, and an animation curve output from the AI model 110 by inputting the training data set 200 into the AI model 110 .
- the learning unit 150 compares and evaluates the phoneme stream generated from the speech signal of the training data set 200 by the artificial intelligence model 110 with the text of the training data set 200, and based on the evaluation, the AI model (110) can be updated.
- the learning unit 150 may calculate a phoneme stream forming function for the artificial intelligence model 110 using the first phoneme stream and the text.
- the learning unit 150 compares and evaluates the 3D head model animation generated from the voice signal of the training data set 200 by the artificial intelligence model 110 with the video signal of the training data set 200, and based on the evaluation
- the artificial intelligence model 110 may be updated.
- the learner 150 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal.
- the learner 150 may calculate a phoneme normalization function of the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.
- the learning unit 150 may update the AI model by using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.
- the electronic device 100 includes a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS) device, an e-book terminal, a digital broadcasting terminal, It may be, but is not limited to, navigation, kiosks, MP3 players, digital cameras, consumer electronics and other mobile or non-mobile computing devices.
- PDA personal digital assistant
- GPS global positioning system
- the electronic device 100 is illustrated in FIG. 1 as one device, it is not necessarily limited thereto.
- the electronic device 100 may be a set of one or more physically separated devices that are functionally connected and perform the above-described operations.
- FIG. 2 is a schematic diagram of an artificial intelligence model for generating a head model animation from a voice signal, in accordance with various embodiments.
- the artificial intelligence model 110 may include a preprocessor 210 , a first artificial intelligence model 220 , a second artificial intelligence model 230 , and a postprocessor 240 .
- the preprocessor 210 may preprocess the voice signal so that the voice signal can be used to generate the head model animation.
- the preprocessor 210 may process the acquired voice signal into a preset format so that the first artificial intelligence model 220 may use the acquired voice signal to generate the head model animation.
- the preprocessor 210 may preprocess the voice signal and convert it into characteristic coefficients indicating characteristics of the voice signal.
- the characteristic coefficients may be input to the first artificial intelligence model 220 and used to predict a phoneme stream and a viseme stream corresponding to a voice signal.
- the preprocessor 210 may obtain the characteristic coefficients by transforming the speech coefficients using a Mel-Frequency Cepstral Coefficients (MFCC) method.
- MFCC Mel-Frequency Cepstral Coefficients
- MFCC is a technique for extracting features by analyzing a short-term spectrum of a sound, and characteristic coefficients may be obtained by linear cosine transformation of a log power spectrum into a nonlinear Mel scale of frequency.
- the MFCC method is not significantly affected by the variability according to the speaker and recording conditions, does not require a separate learning process, and has a fast calculation speed. Since the MFCC method is known in the art, a detailed description thereof will be omitted.
- the characteristic coefficients may be obtained using another voice characteristic extraction method, for example, a method such as Perceptual Linear Prediction or Body Linear Predictive Codes.
- the preprocessor 210 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model.
- the additional pre-trained artificial intelligence means may be at least one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, and any combination thereof. have.
- the first artificial intelligence model 220 may output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the preprocessed voice signal provided from the preprocessor 210 .
- the first artificial intelligence model 220 is a convolutional neural network, a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or these may be at least one of any combination of
- the first artificial intelligence model 220 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function. In various embodiments, the first artificial intelligence model 220 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function. In various embodiments, the first AI model 220 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.
- the second artificial intelligence model 230 may receive the viseme stream generated by the first artificial intelligence model 220 and output animation curves of visemes included in the viseme stream.
- the animation curve represents the temporal change of animation parameters related to the movement of the head model.
- the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.
- the animation curve may be input to the animation generator 120 and applied to the head model to generate a head model animation.
- the second artificial intelligence model 230 is a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory (Long Short-Term Memory, LSTM), a gate circulation unit (gated recurrent unit, GRU), a variant thereof, or any combination thereof.
- CNN convolutional neural network
- RNN recurrent neural network
- LSTM long short-term memory
- GRU gate circulation unit
- the second artificial intelligence model 230 may derive animation curves of visemes of the viseme stream based on the animation curve forming function.
- the animation curve may be calculated using a Facial Action Coding System (FACS).
- FACS Facial Action Coding System
- An animation curve calculated using FACS can be applied to any FACS-based head model to generate a head model animation.
- the post-processing unit 240 may post-process the phoneme stream and the viseme stream and merge them.
- the merged phoneme and viseme stream output from the post-processing unit 240 may be input to the animation generating unit 120 and used to generate a head model animation.
- the post-processing unit 240 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.
- each phone In the merged phoneme and viseme stream, each phone may be associated with a corresponding viseme.
- the duration of each phone and associated corresponding viseme in the merged phoneme and viseme stream may be specified by the animation curve of the viseme.
- the post-processing unit 240 may use an arbitrary function that receives two inputs and returns one output for merging.
- At least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 included in the artificial intelligence model 110, at least one hardware It may be manufactured in the form of a chip and mounted in an electronic device.
- at least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 is dedicated hardware for artificial intelligence (AI). It may be manufactured in the form of a chip, or may be manufactured as a part of an existing general-purpose processor (eg, CPU or application processor) or graphics-only processor (eg, GPU) and mounted in the various electronic devices described above.
- AI artificial intelligence
- the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be mounted in one electronic device 100 , or separate electronic devices. Each of the devices may be mounted separately.
- some of the pre-processing unit 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the post-processing unit 240 are included in the electronic device 100 , and the rest are included in the server. can be included in
- At least one of the preprocessor 210 , the first artificial intelligence model 220 , the second artificial intelligence model 230 , and the postprocessor 240 may be implemented as a software module.
- At least one of the pre-processing unit 210, the first artificial intelligence model 220, the second artificial intelligence model 230, and the post-processing unit 240 is a software module (or a program module including an instruction)
- the software module may be stored in a computer-readable non-transitory computer readable medium.
- at least one software module may be provided by an operating system (OS) or may be provided by a predetermined application.
- OS operating system
- a predetermined application Alternatively, a part of the at least one software module may be provided by an operating system (OS), and the other part may be provided by a predetermined application.
- FIG. 3 is a block diagram of a first artificial intelligence model and a second artificial intelligence model, according to various embodiments.
- the first artificial intelligence model 220 includes a spatial feature extraction layer 310, a temporal feature extraction layer 320, a phoneme prediction layer 330, a phoneme stream forming function 340, and a viseme prediction layer ( 350 ), a viseme stream formation function 360 , and a phoneme normalization function 370 .
- the second AI model 230 may include an animation curve prediction layer 380 and an animation curve forming function 390 .
- the spatial feature extraction layer 310, the temporal feature extraction layer 320, the phoneme prediction layer 330, the viseme prediction layer 350, and the animation curve prediction layer 380 are of a neural network that performs a specific function. It may be at least a part.
- the phoneme stream forming function 340 , the viseme stream forming function 360 , the phoneme normalization function 370 , and the animation curve forming function 390 derive results from one or more layers included in the AI model or evaluate the derived results. It may be a function used to
- the spatial feature extraction layer 310 and the temporal extraction layer 320 may extract the characteristics of the voice signal from the characteristic information of the input voice signal.
- the characteristic information of the voice signal may be a characteristic coefficient of the voice signal output from the preprocessor 210 .
- the spatial characteristic extraction layer 310 may process characteristic information of the input voice signal to extract spatial characteristics.
- a convolutional neural network (CNN) or a recurrent neural network (RNN) having fully connected layers and nonlinearity may be used for the spatial feature extraction layer 310 .
- CNN convolutional neural network
- RNN recurrent neural network
- a layer structure as shown in FIG. 4A may be used.
- the present invention is not limited thereto, and any differentiable layer may be added.
- the spatial feature extraction layer 310 may be pre-learned.
- the temporal feature extraction layer 320 may process the extracted spatial feature to extract a temporal feature.
- a recurrent neural network (RNN) with fully connected layers and nonlinearity may be used for the temporal feature extraction layer 320 .
- RNN recurrent neural network
- LSTM three-level Long Short-Term Memory
- Two independent streams, a phoneme stream and a viseme stream, may be predicted based on the characteristics of the speech signal extracted through the spatial feature extraction layer 310 and the temporal extraction layer 320 .
- the phoneme prediction layer 330 may derive a phoneme stream corresponding to the voice signal from the characteristics of the extracted voice signal.
- the viseme prediction layer 350 may select a viseme corresponding to each phoneme included in the phoneme stream to derive a viseme stream corresponding to the voice signal.
- the phoneme prediction layer 330 may predict a phoneme stream from a characteristic of a speech signal based on the phoneme stream forming function 340 .
- the phoneme stream forming function 340 may be calculated from a loss function that measures how similar the phoneme stream predicted by the first AI model 220 is to an actual correct value.
- the phoneme stream forming function 340 may be learned from a training data set including an arbitrary voice signal and text corresponding to the arbitrary voice signal.
- the viseme prediction layer 350 may predict the viseme stream corresponding to the phoneme stream from the characteristics of the speech signal based on the viseme stream forming function 360 .
- the viseme stream forming function 360 may be calculated from a loss function that measures how similar the viseme stream predicted by the first artificial intelligence model 220 is to the actual correct value.
- the viseme stream forming function 360 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.
- the viseme prediction layer 350 may select an appropriate viseme from among them.
- the viseme prediction layer 350 may select one viseme from among a plurality of visemes corresponding to a phoneme based on the phoneme normalization function 370 .
- the phoneme normalization function 370 may be a function that predicts a probability distribution of a phoneme and gives a penalty to a basic shape corresponding to a phoneme that is less likely to be used.
- the phoneme normalization function 370 may be calculated by a regularization method based on a training data set including an arbitrary speech signal.
- any possible layer structure may be used for the phoneme prediction layer 330 and the viseme prediction layer 350 , including a stack of linear layers, non-linear layers, and other differentiable layers.
- a stack of linear layers, non-linear layers, and other differentiable layers For example, two linear layers fully connected to a Rectified Linear Unit (ReLU) as shown in FIG. 4B may be used as predictors.
- ReLU Rectified Linear Unit
- the phoneme prediction layer 330 and the viseme prediction layer 350 may share a weight with the temporal extraction layer 320 . As both streams share a weight with the previous layer, the effect of normalizing the model can be obtained according to the characteristics of the predicted parameters.
- the animation curve prediction layer 380 may predict animation curves of visemes included in the viseme stream from the viseme stream.
- the animation curve represents the temporal change of animation parameters related to the movement of the head model.
- the animation curve may specify the movement of the facial landmark in each viseme animation and the duration of the viseme animation.
- the animation curve prediction layer 380 may derive an animation curve of the visemes of the viseme stream based on the animation curve forming function 390 .
- the animation curve forming function 390 may be calculated from a loss function that measures how similar the motion represented by the animation curve derived by the second artificial intelligence model 220 is to the motion of a real person.
- the animation curve forming function 390 may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.
- the animation curve may be calculated using a Facial Action Coding System (FACS).
- FACS Facial Action Coding System
- the animation curve prediction layer 380 may derive an animation curve defined by using FACS coefficients for each viseme. Animation curves defined using FACS coefficients can be applied to any FACS-based head model. However, the animation curve is not necessarily limited to being calculated using FACS, and any suitable method may be used for calculating the animation curve.
- FIG. 5 is a flowchart of a method of generating a head model animation from a voice signal, according to various embodiments. Each operation of FIG. 5 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .
- the electronic device 100 may obtain characteristic information of a voice signal from the voice signal.
- the electronic device 100 may pre-process a voice signal and convert it into characteristic information indicating characteristics of the voice signal, for example, a characteristic coefficient.
- the electronic device 100 may obtain the characteristic coefficients by transforming the speech coefficient by a Mel-Frequency Cepstral Coefficients (MFCC) method.
- MFCC Mel-Frequency Cepstral Coefficients
- the electronic device 100 may obtain the characteristic coefficients by inputting the speech signal to another pre-trained AI model.
- the electronic device 100 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model.
- the electronic device 100 may predict a phoneme stream from characteristic coefficients of a speech signal based on a phoneme stream forming function.
- the phoneme stream forming function may be learned from a training data set including an arbitrary voice signal and a text corresponding to the arbitrary voice signal.
- the electronic device 100 may predict the viseme stream corresponding to the phoneme stream from the characteristic coefficients of the speech signal based on the viseme stream forming function.
- the viseme stream forming function may be learned from a training data set including an arbitrary voice signal and a video signal corresponding to the arbitrary voice signal.
- the electronic device 100 may select one viseme from among a plurality of visemes corresponding to a phoneme based on a phoneme normalization function.
- the electronic device 100 may obtain an animation curve of visemes included in the viseme stream using the artificial intelligence model.
- the electronic device 100 may derive animation curves of visemes of the viseme stream based on the animation curve forming function.
- the animation curve may be calculated using a Facial Action Coding System (FACS).
- FACS Facial Action Coding System
- the electronic device 100 may merge the phoneme stream and the viseme stream.
- the AI model 110 may merge the phoneme stream and the viseme stream by overlaying them in consideration of the animation curve.
- the electronic device 100 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.
- the electronic device 100 may generate a head model animation based on a predefined head model.
- the predefined head model may be any 3D character model based on a facial motion coding system (FACS).
- the electronic device 100 determines a viseme set of a predefined head model based on the merged phoneme and viseme stream, and applies the animation curve to the viseme set to generate a head model animation.
- FIG. 6 is a schematic diagram of a learning unit for learning an artificial intelligence model for generating an animation head model from a voice signal, according to various embodiments of the present disclosure
- the learning unit 150 for training the artificial intelligence model 110 for generating an animation head model from a voice signal includes a phoneme detection unit 610 , a phoneme stream forming function calculation unit 620 , and animation generation.
- Unit 630, first movement pattern detection unit 640, second movement pattern detection unit 650, viseme stream formation function calculation unit 660, animation curve formation function calculation unit 670, and phoneme normalization function calculation unit ( 680) may be included.
- the learning unit 150 may acquire the training data set 200 including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal.
- the voice signal, the text signal, and the video signal may include recordings of various people with different face shapes.
- the voice signal, the text signal, and the video signal may each be separately processed for multi-objective learning.
- the processing of the voice signal, the processing of the text, and the processing of the video signal may be performed in parallel or sequentially according to the configuration of the electronic device 100 and its calculation capability.
- the artificial intelligence model 110 receives the voice signal of the training data set 200, and receives a first phoneme stream corresponding to the voice signal, a viseme stream corresponding to the first phoneme stream, and a viseme included in the viseme stream. You can get their animation curves.
- the artificial intelligence model 110 may obtain the first phoneme stream, the viseme stream, and the animation curve from the speech signal of the training data set 200 by the same process as described above.
- the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal.
- the artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream.
- the artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.
- the phoneme detector 610 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream.
- the operation of detecting phonemes from text may be performed by any known method.
- the phoneme stream forming function calculation unit 620 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 . have.
- the phoneme stream forming function 340 may be calculated using a loss function by comparing the first phoneme stream and the second phoneme stream.
- the animation generator 630 may generate a viseme animation by applying the animation curve output from the artificial intelligence model 110 to the 3D template model.
- the animation generator 630 may obtain a viseme animation by applying an animation curve to visemes included in the viseme stream in the same manner as the animation generator 120 .
- the animation generator 630 may acquire a viseme animation based on a predefined head model. For example, the animation generator 630 may obtain a viseme animation of each viseme by applying an animation curve to the 3D template head model.
- the first movement pattern detector 640 may acquire the first movement pattern by detecting the movement pattern of the facial landmark in the obtained viseme animation.
- the facial landmark may be predefined in the predefined head model.
- the movement parameter indicated by the animation curve may designate the movement pattern of the defined facial landmark.
- the second movement pattern detector 650 may receive a video signal corresponding to the voice signal of the training data set 200 and detect a facial landmark from the video signal.
- the facial landmark may be detected by a landmark detector.
- the landmark detector may perform any conventional landmark detection method.
- the second movement pattern detector 650 may obtain a second movement pattern by measuring the movement displacement of the detected facial landmark.
- the movement displacement may be measured based on an average of the training data set 200 or a specific face selected during training.
- the second movement pattern detection unit 650 in order to train the artificial intelligence model 110 independently of the face shape, is a face land based on the face shape acquired from the first frame of the video signal or the estimated face shape. The movement displacement of the mark can be measured.
- the viseme stream forming function calculation unit 660 may calculate the viseme stream forming function 360 used to predict the viseme stream in the artificial intelligence model 110 by comparing the first movement pattern with the second movement pattern. have.
- the viseme stream forming function 340 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.
- the animation curve forming function calculation unit 670 can calculate the animation curve forming function 390 used to predict the animation curve in the artificial intelligence model 110 by comparing the first movement pattern with the second movement pattern. have.
- the animation curve forming function 390 may be calculated using a loss function by comparing the first movement pattern with the second movement pattern.
- the phoneme normalization function calculator 680 is one of a plurality of visemes corresponding to a phoneme in the AI model 110 based on the first phoneme stream, the viseme stream, and the animation curve output from the AI model 110 .
- a phoneme normalization function 370 used to select the viseme of may be computed.
- the phoneme normalization function calculator 680 may calculate the phoneme normalization function 370 by a regularization method.
- the learner 150 may update the artificial intelligence model 110 by using the calculated phoneme stream formation function, viseme stream formation function, animation curve formation function, and phoneme normalization function.
- FIG. 7 is a schematic diagram of calculating a viseme stream forming function and an animation curve forming function from a voice signal and a video signal corresponding to the voice signal, according to various embodiments of the present disclosure
- the training data set 200 includes records of various people with different face shapes, and the face shape of a real person is different from an artificially created animated character model. Therefore, it is not possible to directly compare the 3D head model animation generated by the artificial intelligence model 110 with the facial movement appearing in the video signal. Therefore, in order for the artificial intelligence model 110 to learn the movement of an arbitrary face shape according to the spoken voice signal, it is necessary to remove an error related to the difference in the face shape.
- the first movement pattern detector 640 may acquire a viseme animation of the 3D template model.
- the viseme animation is generated based on the animation curve output from the artificial intelligence model 110 .
- the first movement pattern detector 640 may project the facial landmark of the viseme animation on a 2D plane in order to compare the movement of the 3D head model animation with the facial movement detected from the video signal.
- the first movement pattern detector 640 may obtain a first movement pattern by calculating the movement of the projected facial landmark on a 2D plane.
- the second movement pattern detector 650 may detect a facial landmark from the video signal of the training data set 200 .
- the second movement pattern detection unit 640 sets the facial landmark detected from the video signal to a predefined neutral face in order to compare any facial movement detected from the video signal with the movement of the 3D head model animation. It can be sorted by overlaying it on .
- alignment of facial landmarks may be performed using Procruste analysis or Affine transforms.
- the Kabsh algorithm may be used to find the optimal transformation matrix.
- the second movement pattern detector 640 may obtain a second movement pattern by calculating the movement of the aligned landmarks. In an embodiment, the second movement pattern detector 640 may acquire the second movement pattern by measuring the movement displacement of the facial landmarks aligned with respect to the predefined neutral face.
- the learner 150 may calculate a viseme stream forming function and an animation curve forming function based on the obtained first and second movement patterns.
- the viseme stream forming function and the animation curve forming function may be calculated based on a loss function representing a difference between the first movement pattern and the second movement pattern. Using the loss function, it can be measured how similar the movement predicted by the artificial intelligence model 110 is to the actual movement of the face.
- the artificial intelligence model 110 can be trained using an arbitrary face shape different from the 3D head model as learning data, and the 2D movement Based on the artificial intelligence model 110 may be trained. Accordingly, easily obtainable video data can be used as training data.
- FIG. 8 is a flowchart of a method of training an artificial intelligence model for generating an animated head model from a voice signal, according to various embodiments of the present disclosure; Each operation of FIG. 8 may be performed by the electronic device 100 shown in FIG. 1 , or the electronic device 100 shown in FIG. 9 or the processor 910 of the electronic device 100 .
- the electronic device 100 may obtain a training data set including a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal.
- the voice signal, the text signal, and the video signal may include recordings of various people with different face shapes.
- the electronic device 100 inputs the voice signal to the artificial intelligence model 110 , and outputs a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and the viseme An animation curve of visemes included in a stream can be obtained.
- the artificial intelligence model 110 may pre-process the voice signal to obtain characteristic information indicating the characteristics of the voice signal.
- the artificial intelligence model 110 may receive the characteristic information and output a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream.
- the artificial intelligence model 110 may output animation curves of visemes included in the viseme stream.
- the electronic device 100 may calculate a phoneme stream forming function for the artificial intelligence model 110 using the first phoneme stream and the text of the voice signal.
- the electronic device 100 may receive a text corresponding to the voice signal of the training data set 200 and detect the second phoneme stream.
- the electronic device 100 may compare the first phoneme stream and the second phoneme stream to calculate the phoneme stream forming function 340 used to predict the phoneme stream in the artificial intelligence model 110 .
- the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model 110 by using the viseme stream, the animation curve, and the video signal.
- the electronic device 100 may generate a viseme animation by applying the animation curve to a 3D template model.
- the electronic device 100 may acquire a first movement pattern by detecting a movement pattern of a facial landmark in the viseme animation.
- the electronic device 100 may project the facial landmark of the 3D template model on a 2D plane.
- the electronic device 100 may acquire the first movement pattern based on the movement of the facial landmark projected on the 2D plane.
- the electronic device 100 may obtain a second movement pattern by detecting a movement pattern of a facial landmark from the video signal. In an embodiment, the electronic device 100 may detect a facial landmark from the video signal. The electronic device 100 may align the face landmark of the video signal to the neutral face. The electronic device 100 may acquire the second movement pattern based on the movement of the facial landmark aligned with the neutral face.
- the electronic device 100 may calculate a viseme stream forming function and an animation curve forming function by comparing the first movement pattern with the second movement pattern.
- the electronic device 100 may calculate a phoneme normalization function for the artificial intelligence model 110 based on the first phoneme stream, the viseme stream, and the animation curve.
- the electronic device 100 may update the artificial intelligence model 110 using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.
- FIG. 9 is a block diagram of an electronic device configured to animate a head model from a voice signal, according to various embodiments of the present disclosure.
- the electronic device 100 may include at least one processor 910 and a memory 920 .
- the memory 920 may store a program for processing and controlling the processor 910 , and may store data input to or output from the electronic device 100 .
- memory 920 may store numerical parameters and functions for at least one trained artificial intelligence model.
- the memory 920 may store training data for training at least one artificial intelligence model.
- the memory 920 may store instructions that, when executed by the at least one processor 910 , cause the at least one processor 910 to execute a method of generating a head model animation from a voice signal. . In various embodiments, the memory 920, when executed by the at least one processor 910 , executes a method for the at least one processor 910 to train an artificial intelligence model for generating a head model animation from a speech signal. You can store instructions that make it happen.
- the processor 910 generally controls the overall operation of the electronic device 100 .
- the processor 910 may control the memory 920, the communication unit (not shown), the input unit (not shown), the output unit (not shown), etc. as a whole by executing the programs stored in the memory 920 .
- the processor 950 may control the operation of the electronic device 100 according to the present disclosure by controlling the memory 920 , the communication unit (not shown), the input unit (not shown), the output unit (not shown), and the like. .
- the processor 910 may acquire a voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown).
- the processor 910 may obtain characteristic information of the voice signal from the voice signal.
- the processor 910 may obtain a phoneme stream corresponding to the voice signal and a viseme stream corresponding to the phoneme stream from the characteristic information using the artificial intelligence model. In an embodiment, the processor 910 may predict a phoneme stream from a speech signal based on a phoneme stream forming function. The processor 910 may predict a viseme stream from the voice signal based on the viseme stream forming function. The processor 910 may obtain an animation curve of visemes included in the viseme stream by using an artificial intelligence model. In an embodiment, the processor 910 may derive an animation curve of visemes of the viseme stream based on the animation curve forming function.
- the processor 910 may merge the phoneme stream and the viseme stream.
- the processor 910 may generate a head model animation by applying the animation curve to visemes of the merged phoneme and viseme stream.
- the processor 910 receives a voice signal, a text corresponding to the voice signal, and a video signal corresponding to the voice signal through the memory 102 , a communication unit (not shown), or an input unit (not shown). It is possible to obtain a training data set including
- the processor 910 inputs the voice signal to the artificial intelligence model, and an animation of a first phoneme stream output from the AI model, a viseme stream corresponding to the first phoneme stream, and visemes included in the viseme stream curve can be obtained.
- the processor 910 may calculate a phoneme stream forming function for the AI model by using the first phoneme stream and the text of the voice signal.
- the processor 910 may calculate a viseme stream forming function and an animation curve forming function for the artificial intelligence model by using the viseme stream, the animation curve, and the video signal.
- the processor 910 may calculate a phoneme normalization function for the AI model based on the first phoneme stream, the viseme stream, and the animation curve.
- the processor 910 may train the AI model by updating the AI model using the phoneme stream formation function, the viseme stream formation function, the animation curve formation function, and the phoneme normalization function.
- One aspect of the present disclosure provides a method of generating an animated head model from a speech signal, the method being executed by one or more processors, the method comprising: receiving a speech signal; converting the speech signal into a set of speech signal features; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream by processing the speech signal features with a learned artificial intelligence means; calculating, by the learned artificial intelligence means, an animation curve for the viseme of the derived viseme stream based on the corresponding phonemes; merging the derived phoneme stream and the derived viseme stream by overlaying the derived phoneme stream and the derived viseme stream with each other in consideration of the calculated animation curve; and forming an animation of the head model by animating the merged phoneme and viseme of the viseme stream using the calculated animation curve.
- the learning of the artificial intelligence means comprises: receiving a training data set comprising a voice signal, a transcript for the voice signal, and a video signal corresponding to the voice signal; deriving a phoneme stream from the transcript for the speech signal; converting the speech signal into a set of speech signal features; extracting speech signal features from the speech signal feature set; deriving a phoneme stream and a viseme stream corresponding to the phonemes of the phoneme stream based on the speech signal characteristics; calculating a function for forming the phoneme stream by comparing a phoneme stream derived from a script for the speech signal with a phoneme stream derived based on the speech signal characteristics; calculating an animation curve for a viseme of a viseme stream derived based on the speech signal characteristics; applying the calculated animation curve to a predetermined viseme set; determining a movement pattern of a facial landmark in the predetermined viseme set to which the calculated animation curve is applied; determining a movement pattern of a facial landmark in the video
- the step of converting the speech signal into a speech signal feature set and extracting speech signal features from the speech signal feature set comprises a Mel-Frequency Cepstral Coefficients (MFCC) method or an additional pre-trained artificial intelligence means. performed by one
- MFCC Mel-Frequency Cepstral Coefficients
- the additional pre-trained artificial intelligence means comprises one of a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), a variant thereof, or a combination thereof.
- LSTM Long Short-Term Memory
- GRU gated recurrent unit
- said learned artificial intelligence means comprises at least two blocks, wherein a first of said at least two blocks of said learned artificial intelligence means processes said speech signal features to thereby form a phoneme stream and said phoneme. performing the step of deriving a viseme stream corresponding to the phonemes of the stream, and a second block of at least two blocks of the learned artificial intelligence means is a viseme of the derived viseme stream by the learned artificial intelligence means. Calculating an animation curve for the corresponding phoneme is performed.
- a first of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.
- LSTM Long Short-Term Memory
- GRU gated recurrent unit
- a second block of the at least two blocks of the learned artificial intelligence means is a recurrent neural network, a Long Short-Term Memory (LSTM), a gated recurrent unit (GRU), these at least one of a variant of, or a combination thereof.
- LSTM Long Short-Term Memory
- GRU gated recurrent unit
- calculating animation curves for a viseme in a viseme stream derived based on the speech signal characteristics is performed using a Facial Action Coding System (FACS).
- FACS Facial Action Coding System
- an electronic computing device comprising: at least one processor; and a memory for storing numerical parameters of the at least one learned artificial intelligence means and instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of generating an animated head model from a speech signal. do.
- Various embodiments of the present disclosure may be implemented as software including one or more instructions stored in a storage medium (eg, memory 920) readable by a machine (eg, electronic device 100).
- a storage medium eg, memory 920
- the processor eg, the processor 910
- the device may call at least one command among one or more commands stored from a storage medium and execute it. This makes it possible for the device to be operated to perform at least one function according to the at least one command called.
- the one or more instructions may include code generated by a compiler or code executable by an interpreter.
- the device-readable storage medium may be provided in the form of a non-transitory storage medium.
- 'non-transitory' only means that the storage medium is a tangible device and does not contain a signal (eg, electromagnetic wave), and this term refers to the case where data is semi-permanently stored in the storage medium and It does not distinguish between temporary storage cases.
- a signal eg, electromagnetic wave
- the method according to various embodiments disclosed in the present disclosure may be included and provided in a computer program product.
- Computer program products may be traded between sellers and buyers as commodities.
- the computer program product is distributed in the form of a machine-readable storage medium (eg compact disc read only memory (CD-ROM)), or via an application store (eg Play StoreTM) or on two user devices ( It can be distributed (eg downloaded or uploaded) directly, online between smartphones (eg: smartphones).
- a part of the computer program product may be temporarily stored or temporarily created in a machine-readable storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.
- each component eg, a module or a program of the above-described components may include a singular or a plurality of entities.
- one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added.
- a plurality of components eg, a module or a program
- the integrated component may perform one or more functions of each component of the plurality of components identically or similarly to those performed by the corresponding component among the plurality of components prior to the integration. .
- operations performed by a module, program, or other component are executed sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations are executed in a different order, or omitted. or one or more other operations may be added.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Processing Or Creating Images (AREA)
Abstract
L'invention divulgue : un procédé de génération d'une animation de modèle de tête à partir d'un signal vocal à l'aide d'un modèle d'intelligence artificielle ; et un dispositif électronique pour son implémentation. Le procédé divulgué de génération d'une animation de modèle de tête à partir d'un signal vocal, mis en œuvre par le dispositif électronique, comprend les étapes consistant à : acquérir des informations de caractéristiques d'un signal vocal à partir du signal vocal ; acquérir, à l'aide du modèle d'intelligence artificielle, à partir des informations de caractéristiques, un flux de phonèmes correspondant au signal vocal et un flux de visèmes correspondant au flux de phonèmes ; acquérir, à l'aide du modèle d'intelligence artificielle, une courbe d'animation de visèmes inclus dans le flux de visèmes ; fusionner le flux de phonèmes avec le flux de visèmes ; et générer une animation de modèle de tête par l'application de la courbe d'animation aux visèmes du flux de phonèmes et de visèmes fusionnés.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2019139078A RU2721180C1 (ru) | 2019-12-02 | 2019-12-02 | Способ генерации анимационной модели головы по речевому сигналу и электронное вычислительное устройство, реализующее его |
RU2019139078 | 2019-12-02 | ||
KR1020200089852A KR20210070169A (ko) | 2019-12-02 | 2020-07-20 | 음성 신호에서 헤드 모델 애니메이션을 생성하는 방법 및 이를 구현하는 전자 장치 |
KR10-2020-0089852 | 2020-07-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021112365A1 true WO2021112365A1 (fr) | 2021-06-10 |
Family
ID=76222049
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/009663 WO2021112365A1 (fr) | 2019-12-02 | 2020-07-22 | Procédé de génération d'une animation de modèle de tête à partir d'un signal vocal et dispositif électronique pour son implémentation |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021112365A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113689531A (zh) * | 2021-08-02 | 2021-11-23 | 北京小米移动软件有限公司 | 动画显示方法、动画显示装置、终端及存储介质 |
CN115830171A (zh) * | 2023-02-17 | 2023-03-21 | 深圳前海深蕾半导体有限公司 | 基于人工智能绘画的图像生成方法、显示设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20060090687A (ko) * | 2003-09-30 | 2006-08-14 | 코닌클리케 필립스 일렉트로닉스 엔.브이. | 시청각 콘텐츠 합성을 위한 시스템 및 방법 |
US20090135177A1 (en) * | 2007-11-20 | 2009-05-28 | Big Stage Entertainment, Inc. | Systems and methods for voice personalization of video content |
US20170011745A1 (en) * | 2014-03-28 | 2017-01-12 | Ratnakumar Navaratnam | Virtual photorealistic digital actor system for remote service of customers |
US20170243387A1 (en) * | 2016-02-18 | 2017-08-24 | Pinscreen, Inc. | High-fidelity facial and speech animation for virtual reality head mounted displays |
KR20190084260A (ko) * | 2016-11-11 | 2019-07-16 | 매직 립, 인코포레이티드 | 전체 얼굴 이미지의 안구주위 및 오디오 합성 |
-
2020
- 2020-07-22 WO PCT/KR2020/009663 patent/WO2021112365A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20060090687A (ko) * | 2003-09-30 | 2006-08-14 | 코닌클리케 필립스 일렉트로닉스 엔.브이. | 시청각 콘텐츠 합성을 위한 시스템 및 방법 |
US20090135177A1 (en) * | 2007-11-20 | 2009-05-28 | Big Stage Entertainment, Inc. | Systems and methods for voice personalization of video content |
US20170011745A1 (en) * | 2014-03-28 | 2017-01-12 | Ratnakumar Navaratnam | Virtual photorealistic digital actor system for remote service of customers |
US20170243387A1 (en) * | 2016-02-18 | 2017-08-24 | Pinscreen, Inc. | High-fidelity facial and speech animation for virtual reality head mounted displays |
KR20190084260A (ko) * | 2016-11-11 | 2019-07-16 | 매직 립, 인코포레이티드 | 전체 얼굴 이미지의 안구주위 및 오디오 합성 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113689531A (zh) * | 2021-08-02 | 2021-11-23 | 北京小米移动软件有限公司 | 动画显示方法、动画显示装置、终端及存储介质 |
CN115830171A (zh) * | 2023-02-17 | 2023-03-21 | 深圳前海深蕾半导体有限公司 | 基于人工智能绘画的图像生成方法、显示设备及存储介质 |
CN115830171B (zh) * | 2023-02-17 | 2023-05-09 | 深圳前海深蕾半导体有限公司 | 基于人工智能绘画的图像生成方法、显示设备及存储介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019182346A1 (fr) | Dispositif électronique pour moduler une voix d'utilisateur à l'aide d'un modèle d'intelligence artificielle et son procédé de commande | |
WO2020189850A1 (fr) | Dispositif électronique et procédé de commande de reconnaissance vocale par ledit dispositif électronique | |
WO2019031714A1 (fr) | Procédé et appareil de reconnaissance d'objet | |
WO2019098573A1 (fr) | Dispositif électronique et procédé de changement d'agent conversationnel | |
WO2018117704A1 (fr) | Appareil électronique et son procédé de fonctionnement | |
WO2018212538A1 (fr) | Dispositif électronique et procédé de détection d'événement de conduite de véhicule | |
WO2020213842A1 (fr) | Structures multi-modèles pour la classification et la détermination d'intention | |
EP3602497A1 (fr) | Dispositif électronique et procédé de détection d'événement de conduite de véhicule | |
WO2019059505A1 (fr) | Procédé et appareil de reconnaissance d'objet | |
WO2021112365A1 (fr) | Procédé de génération d'une animation de modèle de tête à partir d'un signal vocal et dispositif électronique pour son implémentation | |
WO2018093229A1 (fr) | Procédé et dispositif appliquant une intelligence artificielle afin d'envoyer de l'argent à l'aide d'une entrée vocale | |
WO2019164120A1 (fr) | Dispositif électronique et procédé de commande associé | |
WO2019027141A1 (fr) | Dispositif électronique et procédé de commande du fonctionnement d'un véhicule | |
WO2020027619A1 (fr) | Procédé, dispositif et support d'informations lisible par ordinateur pour la synthèse vocale à l'aide d'un apprentissage automatique sur la base d'une caractéristique de prosodie séquentielle | |
WO2019146942A1 (fr) | Appareil électronique et son procédé de commande | |
WO2018117538A1 (fr) | Procédé d'estimation d'informations de voie et dispositif électronique | |
EP3533015A1 (fr) | Procédé et dispositif appliquant une intelligence artificielle afin d'envoyer de l'argent à l'aide d'une entrée vocale | |
EP3539056A1 (fr) | Appareil électronique et son procédé de fonctionnement | |
WO2021112631A1 (fr) | Dispositif, procédé et programme pour améliorer un contenu de sortie par génération itérative | |
WO2019132410A1 (fr) | Dispositif électronique et son procédé de commande | |
WO2020060311A1 (fr) | Procédé de fourniture ou d'obtention de données pour l'apprentissage et dispositif électronique associé | |
WO2020222384A1 (fr) | Dispositif électronique et son procédé de commande | |
WO2019194451A1 (fr) | Procédé et appareil d'analyse de conversation vocale utilisant une intelligence artificielle | |
KR20210070169A (ko) | 음성 신호에서 헤드 모델 애니메이션을 생성하는 방법 및 이를 구현하는 전자 장치 | |
WO2020045794A1 (fr) | Dispositif électronique et procédé de commande associé |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20897262 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20897262 Country of ref document: EP Kind code of ref document: A1 |