JP7237196B2

JP7237196B2 - Method, device and computer program using Duration Informed Attention Network (DURAN) for audio-visual synthesis

Info

Publication number: JP7237196B2
Application number: JP2021560105A
Authority: JP
Inventors: リュ，ホン; ユィ，チョンジュ; ユィ，ドン
Original assignee: テンセント・アメリカ・エルエルシー
Priority date: 2019-08-23
Filing date: 2020-08-06
Publication date: 2023-03-10
Anticipated expiration: 2040-08-06
Also published as: WO2021040989A1; CN114041183A; JP2022526668A; US20210375259A1; US20210056949A1; US11670283B2; EP3942548A4; EP3942548A1; US11151979B2

Description

関連出願
本願は2019年8月23日に米国特許商標庁に出願された米国特許出願第16/549,068号による優先権を主張しており、その開示内容は参照により全体的に本願に援用される。
背景 RELATED APPLICATIONS This application claims priority to U.S. Patent Application Serial No. 16/549,068, filed in the U.S. Patent and Trademark Office on Aug. 23, 2019, the disclosure of which is hereby incorporated by reference in its entirety. .
background

技術分野
本願で説明される実施形態は入力から音声及びビデオ情報を生成する方法及び装置に関する。 TECHNICAL FIELD Embodiments described herein relate to methods and apparatus for generating audio and video information from input.

関連出願
2019年4月29日付で出願された米国出願第16/397,349号は参照により全体的に本願に援用される。 Related application
US Application No. 16/397,349, filed April 29, 2019, is hereby incorporated by reference in its entirety.

関連技術の説明
最近、Tacotron（タコトロン）のようなエンド・ツー・エンドの音声合成システムは、合成された音声の自然さとプロソディ（prosody）の観点から素晴らしいテキスト・ツー・スピーチ（TTS）の結果を示している。しかしながら、このようなシステムは、音声を合成する際に入力テキスト中の幾つかの言葉がスキップされたり又は繰り返されたりする点で重大な欠点を有する。この問題は、制御不能なアテンション・メカニズムが音声の生成に使用されるエンド・ツー・エンドの性質に起因する。 Description of the Related Art Recently, end-to-end speech synthesis systems such as the Tacotron have produced excellent text-to-speech (TTS) results in terms of the naturalness and prosody of the synthesized speech. showing. However, such systems have a significant drawback in that some words in the input text are skipped or repeated when synthesizing speech. This problem is due to the end-to-end nature of the uncontrollable attention mechanisms used to generate speech.

特定の実施形態の効果及び利点
本願で説明される実施形態は、音声と話している顔のビデオ情報の両方を、一部の実施形態では同時に、モデル化及び生成する方法及び装置に関する。これらの実施形態は、新しいモデル、即ちデュレーション・インフォームド・アテンション・ネットワーク（Duration Informed Attention Network，DurIAN）に基づいており、これは本願で説明されるが、上述した米国出願第16/397,349号でも説明されており、同出願は本開示に全体的に組み込まれている。 Effects and Benefits of Certain Embodiments The embodiments described herein relate to methods and apparatus for modeling and generating both speech and speaking face video information, in some embodiments simultaneously. These embodiments are based on a new model, namely the Duration Informed Attention Network (DurIAN), which is described in this application, but also in the above-referenced US application Ser. No. 16/397,349. , which application is incorporated in its entirety into this disclosure.

従来、エンド・ツー・エンドのアテンション・ベースのモデルは、伝統的な非エンド・ツー・エンドのTTSフレームワークを上回る改善を示してきた。しかしながら、エンド・ツー・エンド・アテンション・ベースのモデルは、生の入力テキスト中の言葉を省略したり繰り返したりすることにも悩まされており、これはエンド・ツー・エンド・アテンション・フレームワークにおいて一般的に見受けられる欠陥である。 Traditionally, end-to-end attention-based models have shown improvements over traditional non-end-to-end TTS frameworks. However, end-to-end attention-based models also suffer from omitting or repeating words in the raw input text, which is a common problem in end-to-end attention frameworks. This is a commonly seen defect.

本開示の実施形態は、独立した音素継続時間のモデリング（independent phone duration modeling）を、エンド・ツー・エンドのアテンション・フレームワークに導入し、従来のエンド・ツー・エンドのアテンション・フレームワークにおける問題を首尾良く解決する。本開示の実施形態は、新たに提案されるデュレーション・インフォームド・アテンション・ネットワーク（DurIAN）のフレームワークを用いて、音声と話している顔のビデオ情報の両方を同時にモデル化する。本開示の実施形態は、従来のオーディオ・ビジュアル・モデリング方法を上回る優れたパフォーマンスを示す。本開示の実施形態はまた、例えば幸せな、悲しい、迷惑な、自然な等の様々なスタイルとともに声と顔をモデリング及び合成することもサポートしている。また、本開示の実施形態は、従来のフレームワークを上回る良好な継続時間及びシステム制御性も示す。 Embodiments of the present disclosure introduce independent phone duration modeling into an end-to-end attention framework to overcome the problems in conventional end-to-end attention frameworks. to successfully resolve Embodiments of the present disclosure use the newly proposed Duration Informed Attention Network (DurIAN) framework to simultaneously model both speech and speaking face video information. Embodiments of the present disclosure exhibit superior performance over conventional audio-visual modeling methods. Embodiments of the present disclosure also support modeling and synthesizing voices and faces with various styles such as happy, sad, annoying, natural, and so on. Embodiments of the present disclosure also exhibit better duration and system controllability over conventional frameworks.

本開示の実施形態は、仮想的な人物、仮想的な顔などに適用することもできる。 Embodiments of the present disclosure can also be applied to virtual people, virtual faces, and the like.

本開示の実施形態は、DurIANモデルを使用してより良好でより同期したオーディオ・ビジュアル・モデリング及び合成方法を提供する。 Embodiments of the present disclosure provide better and more synchronized audio-visual modeling and synthesis methods using DurIAN models.

本開示の実施形態は、マルチ・スタイルのオーディオ・ビジュアル・モデリング及び合成をサポートする。 Embodiments of the present disclosure support multi-style audio-visual modeling and synthesis.

本開示の実施形態は、オーディオ・ビジュアル・モデリング及び合成に関し、従来の方法よりも良好な制御性を提供する。 Embodiments of the present disclosure provide better control over audio-visual modeling and synthesis than conventional methods.

本開示の実施形態は、オーディオのみ又は視覚的特徴のみに適用することも可能であり、あるいはそれらをマルチ・タスク・トレーニングとしてモデル化することも可能である。 Embodiments of the present disclosure can also be applied to audio-only or visual features only, or they can be modeled as multi-task training.

概要
幾つかの可能な実装によれば、方法は：テキスト構成要素のシーケンスを含むテキスト入力を、デバイスにより受信するステップ；テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用してデバイスにより決定するステップ；テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットをデバイスにより生成するステップ；スペクトルの第1セットとテキスト構成要素のシーケンスの個々のテンポラル継続時間とに基づいて、スペクトルの第2セットをデバイスにより生成するステップ；スペクトルの第2セットに基づいて、スペクトログラム・フレームをデバイスにより生成するステップ；スペクトログラム・フレームに基づいて、オーディオ波形をデバイスにより生成するステップ；オーディオ波形に対応するビデオ情報を、デバイスにより生成するステップ；及びビデオ情報に基づいて、オーディオ波形及び対応するビデオをデバイスの出力として、デバイスにより提供するステップを含むことが可能である。 Overview According to some possible implementations, a method includes: receiving, by a device, a text input including a sequence of text components; generating, by the device, a first set of spectra based on the sequence of text components; based on the first set of spectra and the respective temporal durations of the sequence of text components, the generating by the device a second set; generating by the device a spectrogram frame based on the second set of spectra; generating by the device an audio waveform based on the spectrogram frame; corresponding to the audio waveform. generating video information by the device; and providing, by the device, an audio waveform and corresponding video as an output of the device based on the video information.

幾つかの可能な実装によれば、方法は継続時間モデルを訓練するステップを含んでもよい。 According to some possible implementations, the method may include training a duration model.

幾つかの可能な実装によれば、方法において、テキスト入力は：
対応する入力オーディオ波形を含む入力ビデオを、入力として受信するステップ；入力オーディオ波形に対応する入力ビデオ情報を、デバイスにより生成するステップ；入力オーディオ波形に基づいて、入力スペクトログラム・フレームを、デバイスにより生成するステップ；入力スペクトログラム・フレームに基づいて、スペクトルの第1入力セットをデバイスにより生成するステップ；スペクトルの第1入力セットに基づいて、スペクトルの第2入力セットをデバイスにより生成するステップ；及びテキスト入力を、継続時間モデルを利用してデバイスにより決定するステップにより取得されてもよい。 According to some possible implementations, in the method the text input is:
receiving as input an input video including a corresponding input audio waveform; generating, by the device, input video information corresponding to the input audio waveform; generating, by the device, input spectrogram frames based on the input audio waveform. generating by the device a first input set of spectra based on the input spectrogram frame; generating by the device a second input set of spectra based on the first input set of spectra; and text input. may be obtained by determining by the device using a duration model.

幾つかの可能な実装によれば、テキスト構成要素は音素（phonemes）又は文字（characters）であってもよい。 According to some possible implementations, the text components may be phonemes or characters.

幾つかの可能な実装によれば、方法は：テキスト入力に関連付けられる感情状態に対応する情報を、デバイスにより受信するステップを更に含み、出力として提供されるオーディオ波形及び対応するビデオは、感情状態に対応する前記情報に基づいていてもよい。 According to some possible implementations, the method further comprises: receiving by the device information corresponding to an emotional state associated with the text input, wherein the audio waveform and the corresponding video provided as output are the emotional state; may be based on said information corresponding to

幾つかの可能な実装によれば、方法において、ビデオ情報に基づくことが可能なオーディオ波形及び対応するビデオは、出力として同時に提供されてもよい。 According to some possible implementations, in the method an audio waveform, which may be based on video information, and a corresponding video may be provided simultaneously as output.

幾つかの可能な実装によれば、方法において、継続時間モデルを訓練するステップは、マルチ・タスク・トレーニングを含んでもよい。 According to some possible implementations, in the method the step of training the duration model may comprise multi-task training.

幾つかの可能な実装によれば、方法において、出力オーディオ波形及び対応するビデオの出力は、仮想的な人物に適用されてもよい。 According to some possible implementations, in the method the output audio waveform and corresponding video output may be applied to a virtual person.

幾つかの可能な実装によれば、方法において、スペクトルの第2セットは、メル周波数ケプストラム・スペクトル（mel-frequency cepstrum spectra）を含んでもよい。 According to some possible implementations, in the method the second set of spectra may include mel-frequency cepstrum spectra.

幾つかの可能な実装によれば、方法において、継続時間モデルを訓練するステップは、予測フレームと訓練テキスト構成要素のセットを利用するステップを含んでもよい。 According to some possible implementations, in the method training the duration model may include utilizing a set of prediction frames and training text components.

幾つかの可能な実装によれば、デバイスは：プログラム・コードを記憶するように構成された少なくとも1つのメモリ；及びプログラム・コードを読み込み、プログラム・コードにより指図されるように動作するように構成された少なくとも1つのプロセッサを含み、プログラム・コードは：テキスト構成要素のシーケンスを含むテキスト入力を受信することを、少なくとも1つのプロセッサに行わせるように構成された受信コード；テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用して決定することを、少なくとも1つのプロセッサに行わせるように構成された決定コード；テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成すること；スペクトルの第1セットとテキスト構成要素のシーケンスの個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成すること；スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成すること；スペクトログラム・フレームに基づいて、オーディオ波形を生成すること；及びオーディオ波形に対応するビデオ情報を生成することを、少なくとも1つのプロセッサに行わせるように構成された生成コード；及びオーディオ波形及び対応するビデオを出力として提供することを、少なくとも1つのプロセッサに行わせるように構成された提供コードを含むことが可能である。 According to some possible implementations, the device has: at least one memory configured to store program code; and configured to read program code and operate as directed by the program code. program code comprising: receiving code configured to cause the at least one processor to receive a text input including a sequence of text components; Determining code configured to cause at least one processor to determine a temporal duration utilizing a duration model; generating a first set of spectra based on the sequence of textual constituents. generating a second set of spectra based on the first set of spectra and respective temporal durations of the sequence of text components; generating spectrogram frames based on the second set of spectra; generating code configured to cause at least one processor to generate audio waveforms based on the spectrogram frames; and generating video information corresponding to the audio waveforms; and audio waveforms and corresponding video. may include providing code configured to cause at least one processor to provide as an output the

幾つかの可能な実装によれば、プログラム・コードは、継続時間モデルを訓練するように構成された訓練コードを更に含んでもよい。 According to some possible implementations, the program code may further include training code configured to train the duration model.

幾つかの可能な実装によれば、受信コードが少なくとも1つのプロセッサに受信させるテキスト入力は：対応する入力オーディオ波形を含む入力ビデオを入力として受信することを、少なくとも1つのプロセッサに行わせるように構成された入力受信コード；入力オーディオ波形に対応する入力ビデオ情報を生成すること；入力オーディオ波形に基づいて、入力スペクトログラム・フレームを生成すること；入力スペクトログラム・フレームに基づいて、スペクトルの第1入力セットを生成すること；及びスペクトルの第1入力セットに基づいて、スペクトルの第2入力セットを生成することを、少なくとも1つのプロセッサに行わせるように構成された入力生成コード；スペクトルの第2入力セットに関して継続時間モデルを使用することによって、テキスト入力を提供するように構成された入力決定コードを更に含むプログラム・コードによって取得されてもよい。 According to some possible implementations, the receiving code causes the at least one processor to receive text input: to cause the at least one processor to receive as input an input video containing a corresponding input audio waveform. configured input receiving code; generating input video information corresponding to the input audio waveform; generating an input spectrogram frame based on the input audio waveform; based on the input spectrogram frame, the first input of the spectrum generating a set; and generating a second input set of spectra based on the first input set of spectra; It may be obtained by program code further including input determination code configured to provide text input by using a duration model for the set.

幾つかの可能な実装によれば、テキスト構成要素は音素又は文字であってもよい。 According to some possible implementations, text components may be phonemes or letters.

幾つかの可能な実装によれば、受信コードは、テキスト入力に関連付けられる感情状態に対応する情報を受信することを、少なくとも1つのプロセッサに行わせるように更に構成されてもよく、提供コードは、感情状態に対応する情報に基づいて、オーディオ波形及び対応するビデオを出力として提供するように更に構成されている。 According to some possible implementations, the receiving code may be further configured to cause the at least one processor to receive information corresponding to an emotional state associated with the text input, the providing code comprising: , is further configured to provide as an output an audio waveform and a corresponding video based on the information corresponding to the emotional state.

幾つかの可能な実装によれば、提供コードは、オーディオ波形及び対応するビデオを出力として同時に提供するように更に構成されてもよい。 According to some possible implementations, the providing code may be further configured to simultaneously provide an audio waveform and a corresponding video as output.

幾つかの可能な実装によれば、訓練コードは、マルチ・タスク・トレーニングを用いて継続時間モデルを訓練するように構成されてもよい。 According to some possible implementations, the training code may be configured to train the duration model using multi-task training.

幾つかの可能な実装によれば、提供コードは、オーディオ波形及び対応するビデオを、仮想的な人物に適用される出力として提供するように更に構成されてもよい。 According to some possible implementations, the providing code may be further configured to provide the audio waveform and corresponding video as output to be applied to the virtual person.

幾つかの可能な実装によれば、訓練コードは、予測フレームと訓練テキスト構成要素のセットを利用して継続時間モデルを訓練するように構成されてもよい。 According to some possible implementations, the training code may be configured to train the duration model using prediction frames and a set of training text components.

幾つかの可能な実装によれば、1つ以上の命令を含む命令を記憶する非一時的なコンピュータ読み取り可能な媒体を提供することが可能であり、命令は、デバイスの1つ以上のプロセッサにより実行されると、1つ以上のプロセッサに：テキスト構成要素のシーケンスを含むテキスト入力を受信するステップ；テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用して決定するステップ；テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成するステップ；スペクトルの第1セットとテキスト構成要素のシーケンスの個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成するステップ；スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成するステップ；スペクトログラム・フレームに基づいて、オーディオ波形を生成するステップ；オーディオ波形に対応するビデオ情報を生成するステップ；及びオーディオ波形及び対応するビデオを出力として提供するステップを実行させる。 According to some possible implementations, it is possible to provide a non-transitory computer-readable medium storing instructions, including one or more instructions, which are processed by one or more processors of the device. When executed, to one or more processors: receiving a text input comprising a sequence of text components; determining the temporal duration of each of the text components using a duration model; generating a first set of spectra based on the sequence of elements; generating a second set of spectra based on the first set of spectra and respective temporal durations of the sequence of text components; spectra. generating spectrogram frames based on the second set of; generating audio waveforms based on the spectrogram frames; generating video information corresponding to the audio waveforms; Causes the step you provide as output to run.

本願で説明される実施例の概要の図である。1 is a schematic diagram of an embodiment described herein; FIG.

本願で説明されるシステム及び／又は方法が実装され得る例示的な環境の図である。1 is a diagram of an example environment in which systems and/or methods described herein may be implemented; FIG.

図2の1つ以上のデバイスの例示的な構成要素の図である。3 is a diagram of exemplary components of one or more of the devices of FIG. 2; FIG.

実施形態に従ってオーディオ波形及び対応するビデオを生成するための例示的なプロセスのフローチャートである。4 is a flowchart of an exemplary process for generating audio waveforms and corresponding video according to an embodiment;

実施形態による継続時間モデルの入力及び出力を含む図である。FIG. 4 is a diagram containing inputs and outputs of a duration model according to an embodiment; 実施形態による継続時間モデルの入力及び出力を含む図である。FIG. 4 is a diagram containing inputs and outputs of a duration model according to an embodiment;

TTSシステムは多様なアプリケーションを有する。しかしながら、主に採用されている商用システムは、自然な人間の発話と比較して大きなギャップを有するパラメトリック・システムに大抵は基づいている。Tacotronは、従来のパラメトリック・ベースのTTSシステムとは著しく異なるTTS合成システムであり、非常に自然な音声文を生成することができる。システム全体は、エンド・ツー・エンド方式で訓練することが可能であり、従来の複雑な言語特性抽出部分を、エンコーダ－畳み込み－バンク－ハイウェイ・ネットワーク－双方向－ゲート－リカレント・ユニット（CBHG）モジュールに置き換える。 TTS systems have diverse applications. However, the predominantly adopted commercial systems are mostly based on parametric systems which have large gaps compared to natural human speech. Tacotron is a TTS synthesis system that is significantly different from conventional parametric-based TTS systems, and can generate highly natural speech sentences. The whole system can be trained in an end-to-end manner, replacing the traditional complex linguistic feature extraction part with an encoder-convolution-bank-highway network-bidirectional-gate-recurrent unit (CBHG). Replace with module.

従来のパラメトリック・システムで使用されてきた継続時間モデルは、エンド・ツー・エンドのアテンション・メカニズムに置き換えられ、入力テキスト（又は音素シーケンス）と音声信号との間のアライメントは、隠れマルコフ・モデル（HMM）ベースのアライメントではなく、アテンション・モデルから学習される。Tacotronシステムに関連する別の主要な相違は、高品質な音声を合成するためにウェーブネット（Wavenet）及びウェーブRNN（WaveRNN）のような進歩したボコーダによって直接的に使用することが可能なメル／リニア・スペクトルを直接的に予測する点にある。 The duration model used in traditional parametric systems is replaced by an end-to-end attention mechanism, and the alignment between the input text (or phoneme sequence) and the speech signal is replaced by a hidden Markov model ( (HMM)-based alignment, but learned from an attention model. Another major difference associated with the Tacotron system is the mel/synthesis, which can be used directly by advanced vocoders such as Wavenet and WaveRNN to synthesize high quality speech. The point is that the linear spectrum is directly predicted.

Tacotronベースのシステムは、より正確で自然な音声の会話を生成することができる。しかしながら、Tacotronシステムは、入力テキストをスキップしたり及び／又は反復したりするような不安定性を含み、これは音声波形を合成する際の固有の欠点である。 Tacotron-based systems can produce more accurate and natural-sounding speech. However, the Tacotron system contains instability, such as skipping and/or repeating input text, which is an inherent drawback in synthesizing speech waveforms.

本願における幾つかの実装は、Tacotronベース・システムに伴う上述した入力テキストのスキップ及び反復の問題に対処する一方、その優れた合成品質を維持する。更に、本願の幾つかの実装は、これらの不安定性の問題に対処し、合成された音声における大幅に改善された自然さを達成する。 Some implementations in this application address the aforementioned problems of skipping and repeating input text with the Tacotron-based system, while maintaining its excellent synthesis quality. Moreover, some implementations of the present application address these instability issues and achieve significantly improved naturalness in synthesized speech.

Tacotronの不安定性は、主としてその制御不能なアテンション・メカニズムに起因しており、各入力テキストがスキップも反復もせずに順に合成できる保証はない。 Tacotron's instability is primarily due to its uncontrollable attention mechanism, where there is no guarantee that each input text can be synthesized in order without skipping or repeating.

本願の一部の実装は、この不安定で制御不能なアテンション・メカニズムを、継続時間ベースのアテンション・メカニズムに置き換え、入力テキストがスキップも反復もなしに順に合成されるように保証される。Tacotronベースのシステムでアテンションが必要とされる主な理由は、ソース・テキストとターゲット・スペクトログラムとの間の位置合わせ情報が欠けていることである。 Some implementations of the present application replace this unstable and uncontrollable attention mechanism with a duration-based attention mechanism, ensuring that the input text is synthesized sequentially without skipping or repeating. The main reason for the need for attention in Tacotron-based systems is the lack of alignment information between the source text and the target spectrogram.

典型的には、入力テキストの長さは、生成されるスペクトログラムの長さよりもかなり短い。入力テキストからの単一の文字／音素は、スペクトログラムの複数フレームを生成する可能性がある一方、この情報は、何らかのニューラル・ネットワーク・アーキテクチャで入力／出力の関係をモデリングするために必要とされる。 Typically, the length of the input text is much shorter than the length of the generated spectrogram. A single letter/phoneme from the input text can generate multiple frames of the spectrogram, while this information is needed to model the input/output relationship in some neural network architectures. .

Tacotronベースのシステムは、主にエンド・ツー・エンドのメカニズムでこの問題に対処しており、スペクトログラムの生成は、ソース入力テキストに関して学習したアテンションを当てにしている。しかしながら、このようなアテンション・メカニズムは、そのアテンションが極めて制御不能であるので、根本的に不安定である。本願における幾つかの実装は、Tacotronシステム内のエンド・ツー・エンド・アテンション・メカニズムを、継続時間モデルで置き換え、継続時間モデルは単一の入力文字及び／又は音素がどの程度継続するのかを予測する。換言すれば、出力スペクトログラムと入力テキストとの間のアライメントは、各入力文字及び／又は音素を、所定の継続時間にわたって複製することによって達成される。筆者らのシステムから学習されるものに対する入力テキストのグランド・トゥルース継続時間は、HMMベースの強制アライメントで達成される。予想される継続時間を用いて、スペクトログラム内の各ターゲット・フレームは、入力テキスト内の1文字／音素と一致させることができる。モデル・アーキテクチャ全体は以下の図に描かれている。 Tacotron-based systems primarily address this problem with an end-to-end mechanism, where spectrogram generation relies on learned attention about the source input text. However, such attention mechanisms are fundamentally unstable because their attention is highly uncontrollable. Some implementations herein replace the end-to-end attention mechanism in the Tacotron system with a duration model that predicts how long a single input character and/or phoneme will last. do. In other words, alignment between the output spectrogram and the input text is achieved by duplicating each input letter and/or phoneme for a given duration. The ground truth duration of the input text to that learned from our system is achieved with HMM-based forced alignment. Using the expected duration, each target frame in the spectrogram can be matched with one letter/phoneme in the input text. The overall model architecture is depicted in the diagram below.

図1は、本願で説明される実施形態の概観の図である。図1において参照番号110により示されるように、プラットフォーム（例えば、サーバー）は、テキスト構成要素のシーケンスを含むテキスト入力を受け取ることができる。図示されるように、テキスト入力は、「This is a cat（これは猫である）」のようなフレーズを含むことが可能である。テキスト入力は、文字「DH」、「IH」、「S」、「IH」、「Z」、「AX」、「K」、「AE」、及び「T」として示される一連のテキスト構成要素を含んでもよい。 FIG. 1 is an overview diagram of an embodiment described herein. As indicated by reference number 110 in FIG. 1, a platform (eg, server) can receive text input that includes a sequence of text components. As shown, the text input can include phrases such as "This is a cat." A text input consists of a sequence of text components denoted as the letters "DH", "IH", "S", "IH", "Z", "AX", "K", "AE", and "T". may contain.

図1において及び参照番号120により更に示されるように、プラットフォームは、継続時間モデルを用いて、テキスト構成要素の個々のテンポラル継続時間を決定することができる。継続時間モデルは、入力テキスト構成要素を受け取り、テキスト構成要素のテンポラル継続時間を決定するモデルを含むことができる。一例として、「this is a cat（これは猫である）」というフレーズは、聴覚的に出力する場合に、1秒という全体的なテンポラル継続時間を含むことができる。フレーズの個々のテキスト構成要素は異なるテンポラル継続時間を含んでもよく、それらはまとまって全体的なテンポラル継続時間を形成する。 As shown in FIG. 1 and further indicated by reference number 120, the platform can use a duration model to determine the individual temporal duration of text constituents. A duration model can include a model that receives an input text component and determines the temporal duration of the text component. As an example, the phrase "this is a cat" can have an overall temporal duration of 1 second when output aurally. Individual text components of a phrase may have different temporal durations, which collectively form an overall temporal duration.

一例として、“this”という言葉は400ミリ秒のテンポラル継続時間を含むことが可能であり、“is”という言葉は200ミリ秒のテンポラル継続時間を含むことが可能であり、“a”という言葉は100ミリ秒のテンポラル継続時間を含むことが可能であり、“cat”という言葉は300ミリ秒のテンポラル継続時間を含むことが可能である。継続時間モデルは、テキスト構成要素の個々の構成要素のテンポラル継続時間を決定することが可能である。 As an example, the word "this" can have a temporal duration of 400 ms, the word "is" can have a temporal duration of 200 ms, and the word "a" can have a temporal duration of 200 ms. could have a temporal duration of 100 ms, and the word "cat" could have a temporal duration of 300 ms. A duration model can determine the temporal duration of individual components of a text component.

図1において参照番号130により更に示されるように、プラットフォームは、テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成することができる。例えば、プラットフォームは、入力テキスト構成要素に基づいて出力スペクトルを生成するモデルにテキスト構成要素を入力することができる。図示されるように、スペクトルの第1セットは、（例えば、“1,”“2,”“3,”“4,”“5,”“6,”“7,”“8,”及び“9”のような）各テキスト構成要素の個々のスペクトルを含むことが可能である。 As further indicated by reference numeral 130 in FIG. 1, the platform can generate a first set of spectra based on the sequence of textual components. For example, the platform can input a text component into a model that produces an output spectrum based on the input text component. As shown, the first set of spectra includes (e.g., "1," "2," "3," "4," "5," "6," "7," "8," and " It is possible to include an individual spectrum for each text component (such as 9").

図1において参照番号140により更に示されるように、プラットフォームは、スペクトルの第1セットと、テキスト構成要素のシーケンスの個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成することができる。プラットフォームは、スペクトルの個々のテンポラル継続時間に基づいてスペクトルを複製することによって、スペクトルの第2セットを生成することができる。一例として、スペクトル“1”は、スペクトルの第2セットが、スペクトル“1”に対応する3つのスペクトル構成要素を含むように複製されること等々が可能である。プラットフォームは、継続時間モデルの出力を使用して、スペクトルの第2セットを生成する仕方を決定することができる。 As further indicated by reference number 140 in FIG. 1, the platform can generate a second set of spectra based on the first set of spectra and the respective temporal durations of the sequence of textual components. . The platform can generate a second set of spectra by duplicating the spectra based on their individual temporal durations. As an example, spectrum "1" can be duplicated such that the second set of spectra includes three spectral components corresponding to spectrum "1", and so on. The platform can use the output of the duration model to determine how to generate the second set of spectra.

図1において参照番号140により更に示されるように、プラットフォームは、スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成することができる。スペクトログラム・フレームは、スペクトルの第2セットの個々の成分のスペクトル構成要素によって形成することができる。図1に示されるように、スペクトログラム・フレームは、予測フレームに整合することが可能である。言い換えれば、プラットフォームによって生成されたスペクトログラム・フレームは、テキスト入力の意図されるオーディオ出力に正確に整合することが可能である。 As further indicated by reference number 140 in FIG. 1, the platform can generate spectrogram frames based on the second set of spectra. A spectrogram frame may be formed by spectral components of individual components of the second set of spectra. As shown in Figure 1, the spectrogram frame can be aligned with the prediction frame. In other words, the spectrogram frames generated by the platform can accurately match the intended audio output of the text input.

図1に示すように、音素継続時間モードが、エンド・ツー・エンド・アテンション・フレームワークに導入され、入力言語テキストを出力音声特性に整合させることができる。また、図1に示されるように、オーディオ及びビジュアル双方の特徴が自己回帰出力として使用されてもよい。更に、スタイル及び感情のタイプもまた、オーディオ・ビジュアル・スタイル制御のために、エンコードされた言語特徴に追加することも可能である。 As shown in Figure 1, a phoneme duration mode is introduced into the end-to-end attention framework to allow matching of input language text to output speech characteristics. Also, as shown in FIG. 1, both audio and visual features may be used as autoregressive outputs. Additionally, style and emotion type can also be added to the encoded linguistic features for audio-visual style control.

プラットフォームは、様々な技術を用いて、スペクトログラム・フレームに基づいてオーディオ波形を生成し、オーディオ波形を出力として提供することができる。同様に、プラットフォームは、対応するビデオを生成及び出力することもできる。 The platform can use various techniques to generate audio waveforms based on the spectrogram frames and provide the audio waveforms as output. Similarly, the platform can also generate and output corresponding videos.

このように、本願の一部の実装は、入力テキスト構成要素の個々のテンポラル継続時間を決定する継続時間モデルを利用することによって、スピーチ・ツー・テキスト合成に関連する、より正確なオーディオ及びビデオ出力の生成を可能にする。 Thus, some implementations of the present application provide more accurate audio and video related speech-to-text synthesis by utilizing duration models that determine individual temporal durations of input text components. Enable output generation.

図2は、本願で説明されるシステム及び／又は方法が実装され得る例示的な環境200の図である。図2に示すように、環境200は、ユーザー・デバイス210、プラットフォーム220、及びネットワーク230を含む可能性がある。環境200のデバイスは、有線接続、無線接続、又は有線と無線接続の組み合わせを介して相互接続してもよい。 FIG. 2 is a diagram of an example environment 200 in which the systems and/or methods described herein may be implemented. As shown in FIG. 2, environment 200 may include user device 210 , platform 220 and network 230 . Devices in environment 200 may be interconnected via wired connections, wireless connections, or a combination of wired and wireless connections.

ユーザー・デバイス210は、プラットフォーム220に関連する情報を受信、生成、記憶、処理、及び／又は提供することが可能な1つ以上のデバイスを含む。例えば、ユーザー・デバイス210は、コンピューティング・デバイス（例えば、デスクトップ・コンピュータ、ラップトップ・コンピュータ、タブレット・コンピュータ、ハンドヘルド・コンピュータ、スマート・スピーカ、サーバーなど）、携帯電話（例えば、スマート・フォン、無線電話など）、ウェアラブル・デバイス（例えば、一対のスマート・グラス又はスマート・ウォッチ）、又は類似のデバイスを含んでもよい。幾つかの実装において、ユーザー・デバイス210は、プラットフォーム220から情報を受信し及び／又はプラットフォーム220へ情報を送信することができる。 User device 210 includes one or more devices capable of receiving, generating, storing, processing, and/or providing information related to platform 220 . For example, user device 210 can be a computing device (eg, desktop computer, laptop computer, tablet computer, handheld computer, smart speaker, server, etc.), mobile phone (eg, smart phone, wireless phones, etc.), wearable devices (eg, a pair of smart glasses or a smart watch), or similar devices. In some implementations, user device 210 can receive information from and/or transmit information to platform 220 .

プラットフォーム220は、本願の他の箇所でも説明されるように、テキスト・ツー・スピーチ合成のためのデュレーション・インフォームド・アテンション・ネットワークを使用して、オーディオ波形を生成することが可能な1つ以上のデバイスを含む。幾つかの実装では、プラットフォーム220は、クラウド・サーバー又はクラウド・サーバーのグループを含んでもよい。幾つかの実装では、プラットフォーム220は、特定のニーズに応じて、特定のソフトウェア・コンポーネントを交換できるように、モジュール式に設計されてもよい。そのように、プラットフォーム220は、異なる用途のために、簡易に及び／又は迅速に再構成することができる。 Platform 220 is one capable of generating audio waveforms using a Duration Informed Attention Network for text-to-speech synthesis, as described elsewhere herein. Including the above devices. In some implementations, platform 220 may include a cloud server or group of cloud servers. In some implementations, platform 220 may be designed to be modular so that specific software components can be interchanged according to specific needs. As such, platform 220 can be easily and/or quickly reconfigured for different applications.

幾つかの実装では、図示されるように、プラットフォーム220はクラウド・コンピューティング環境222でホストされることが可能である。特に、本願で説明される実装は、プラットフォーム220を、クラウド・コンピューティング環境222でホストされるものとして説明するが、一部の実装では、プラットフォーム220は、クラウド・ベースではなく（即ち、クラウド・コンピューティング環境の外部で実装されてもよい）、あるいは部分的にクラウド・ベースであってもよい。 In some implementations, platform 220 may be hosted in cloud computing environment 222 as shown. In particular, implementations described herein describe platform 220 as being hosted in cloud computing environment 222, although in some implementations platform 220 is not cloud-based (i.e., cloud-based). may be implemented outside the computing environment), or may be partially cloud-based.

クラウド・コンピューティング環境222は、プラットフォーム220をホストする環境を含む。クラウド・コンピューティング環境222は、プラットフォーム220をホストするシステム及び／又は装置の物理的な位置及び構成に関する情報を、エンド・ユーザー（例えば、ユーザー・デバイス210）に要求しない計算、ソフトウェア、データ・アクセス、ストレージなどのサービスを提供することができる。従って、クラウド・コンピューティング環境222は、コンピューティング・リソース224のグループ（まとめて「コンピューティング・リソース224」、個々に「コンピューティング・リソース224」と言及される）を含んでもよい。 Cloud computing environment 222 includes the environment that hosts platform 220 . Cloud computing environment 222 is a computing, software, data access environment that does not require end users (e.g., user devices 210) to know the physical location and configuration of systems and/or devices hosting platform 220. , can provide services such as storage. Accordingly, cloud computing environment 222 may include a group of computing resources 224 (referred to collectively as "computing resources 224" and individually as "computing resources 224").

計算リソース224は、1つ以上のパーソナル・コンピュータ、ワークステーション・コンピュータ、サーバー・デバイス、又はその他の種類の計算及び／又は通信デバイスを含む。幾つかの実装では、コンピューティング・リソース224は、プラットフォーム220をホストすることができる。クラウド・リソースは、コンピューティング・リソース224において実行する計算インスタンス、コンピューティング・リソース224において提供される記憶デバイス、コンピューティング・リソース224によって提供されるデータ転送デバイスなどを含む可能性がある。幾つかの実装では、コンピューティング・リソース224は、有線接続、無線接続、又は有線と無線接続の組み合わせを介して、他のコンピューティング・リソース224と通信することができる。 Computing resources 224 include one or more personal computers, workstation computers, server devices, or other types of computing and/or communication devices. In some implementations, computing resource 224 may host platform 220 . Cloud resources may include computing instances running on computing resources 224, storage devices provided on computing resources 224, data transfer devices provided by computing resources 224, and the like. In some implementations, computing resources 224 may communicate with other computing resources 224 via wired connections, wireless connections, or a combination of wired and wireless connections.

図2に更に示すように、コンピューティング・リソース224は、1つ以上のアプリケーション（「APP」）224-1、1つ以上の仮想マシン（「VM」）224-2、仮想化されたストレージ（「VS」）224-3、1つ以上のハイパーバイザ（「HYP」）224-4などのクラウド・リソースのグループを含む。 As further shown in FIG. 2, computing resources 224 may include one or more applications (“APPs”) 224-1, one or more virtual machines (“VMs”) 224-2, virtualized storage ( "VS") 224-3, one or more hypervisors ("HYP") 224-4, etc.

アプリケーション224-1は、ユーザー・デバイス210及び／又はセンサ・デバイス220に提供するか又はそれらによりアクセスされ得る1つ以上のソフトウェア・アプリケーションを含む。アプリケーション224-1は、ユーザー・デバイス210にソフトウェア・アプリケーションをインストールして実行する必要性をなくすことができる。例えば、アプリケーション224-1は、プラットフォーム220に関連するソフトウェア、及び／又はクラウド・コンピューティング環境222を介して提供されることが可能な他の任意のソフトウェアを含んでもよい。幾つかの実装では、1つのアプリケーション224-1は、1つ以上の他のアプリケーション224-1へ／それらから、仮想マシン224-2を介して情報を送信／受信することができる。 Applications 224-1 include one or more software applications that may be provided to or accessed by user device 210 and/or sensor device 220. Application 224-1 may obviate the need to install and run a software application on user device 210. For example, application 224 - 1 may include software associated with platform 220 and/or any other software that may be provided via cloud computing environment 222 . In some implementations, one application 224-1 can send/receive information to/from one or more other applications 224-1 via virtual machine 224-2.

仮想マシン224-2は、物理マシンのようなプログラムを実行するマシン（例えば、コンピュータ）のソフトウェア実装を含む。仮想マシン224-2は、仮想マシン224-2による何らかの実際のマシンに対する用途及び対応の程度に応じて、システム仮想マシン又はプロセス仮想マシンの何れであってもよい。システム仮想マシンは、完全なオペレーティング・システム（「OS」）の実行をサポートする完全なシステム・プラットフォームを提供することができる。プロセス仮想マシンは、単一のプログラムを実行し、単一のプロセスをサポートすることができる。幾つかの実装では、仮想マシン224-2は、ユーザー（例えば、ユーザー・デバイス210）に代わって実行することが可能であり、データ管理、同期化、又は長時間データ転送のようなクラウド・コンピューティング環境222のインフラストラクチャを管理することが可能である。 A virtual machine 224-2 includes a software implementation of a machine (eg, a computer) that executes programs like a physical machine. Virtual machine 224-2 may be either a system virtual machine or a process virtual machine, depending on the usage and correspondence of virtual machine 224-2 to some real machine. A system virtual machine can provide a complete system platform that supports running a complete operating system (“OS”). A process virtual machine can run a single program and support a single process. In some implementations, virtual machine 224-2 may run on behalf of a user (e.g., user device 210) and perform cloud computing tasks such as data management, synchronization, or long-term data transfer. 222 infrastructure.

仮想化記憶装置224-3は、ストレージ・システム又はコンピューティング・リソース224のデバイス内で仮想化技術を使用する1つ以上のストレージ・システム及び／又は1つ以上のデバイスを含む。幾つかの実装では、ストレージ・システムの状況において、仮想化のタイプは、ブロック仮想化及びファイル仮想化を含んでもよい。ブロック仮想化は、物理ストレージ又はヘテロジニアス構造に関係なくストレージ・システムがアクセスされ得るように、物理ストレージからの論理ストレージの抽象化（又はセパレーション）を参照することができる。セパレーションは、ストレージ・システムの管理者に、その管理者がエンド・ユーザーのストレージを管理する仕方に関する柔軟性を持たせることができる。ファイル仮想化は、ファイル・レベルでアクセスされるデータとファイルが物理的に格納される場所との間の依存関係を排除することができる。これは、ストレージの使用、サーバーの統合、及び／又は継続的なファイル移行のパフォーマンスの最適化を可能にする可能性がある。ハイパーバイザ224-4は、複数のオペレーティング・システム（例えば、「ゲスト・オペレーティング・システム」）が、コンピューティング・リソース224のようなホスト・コンピュータ上で同時に動作することを可能にするハードウェア仮想化技術を提供することができる。ハイパーバイザ224-4は、仮想オペレーティング・プラットフォームをゲスト・オペレーティング・システムに提示することが可能であり、ゲスト・オペレーティング・システムの実行を管理することが可能である。様々なオペレーティング・システムの複数のインスタンスは、仮想化されたハードウェア・リソースを共有することができる。 Virtualized storage 224 - 3 includes one or more storage systems and/or one or more devices that use virtualization technology within the devices of storage system or computing resource 224 . In some implementations, in the context of storage systems, virtualization types may include block virtualization and file virtualization. Block virtualization can refer to the abstraction (or separation) of logical storage from physical storage so that the storage system can be accessed regardless of physical storage or heterogeneous structure. Separation allows administrators of storage systems to have flexibility in how they manage end-user storage. File virtualization can eliminate dependencies between data accessed at the file level and where the files are physically stored. This may allow performance optimization of storage usage, server consolidation, and/or ongoing file migration. Hypervisor 224-4 is hardware virtualization that allows multiple operating systems (e.g., "guest operating systems") to run simultaneously on a host computer, such as computing resource 224. technology can be provided. Hypervisor 224-4 can present a virtual operating platform to guest operating systems and can manage the execution of guest operating systems. Multiple instances of different operating systems can share virtualized hardware resources.

ネットワーク230は、1つ以上の有線及び／又は無線ネットワークを含む。例えば、ネットワーク230は、セルラー・ネットワーク（例えば、第5世代（5G）ネットワーク、ロング・ターム・エボリューション（LTE）ネットワーク、第3世代（3G）ネットワーク、符号分割多元接続（CDMA）ネットワークなど)、公衆陸上移動通信網（PLMN）、ローカル・エリア・ネットワーク（LAN）、ワイド・エリア・ネットワーク（WAN）、メトロポリタン・エリア・ネットワーク（MAN）、電話網（例えば、公衆交換電話網（PSTN））、プライベート・ネットワーク、アドホック・ネットワーク、イントラネット、インターネット、光ファイバ・ベースのネットワークなど、及び／又はこれら又は他のタイプのネットワークの組み合わせを含んでもよい。 Network 230 includes one or more wired and/or wireless networks. For example, network 230 may be a cellular network (e.g., fifth generation (5G) network, long term evolution (LTE) network, third generation (3G) network, code division multiple access (CDMA) network, etc.), public Land Mobile Network (PLMN), Local Area Network (LAN), Wide Area Network (WAN), Metropolitan Area Network (MAN), Telephone Network (e.g. Public Switched Telephone Network (PSTN)), Private • May include networks, ad-hoc networks, intranets, the Internet, fiber-optic based networks, etc., and/or combinations of these or other types of networks.

図2に示されるデバイス及びネットワークの数や配置は一例として提供されている。実際には、図2に示すものに対して、追加のデバイス及び／又はネットワーク、より少ないデバイス及び／又はネットワーク、異なるデバイス及び／又はネットワーク、又は別の仕方で配置されたデバイス及び／又はネットワークが存在する可能性がある。更に、図2に示す2つ以上のデバイスは、単一のデバイス内で実装されてもよいし、又は図2に示す単一のデバイスは、複数の分散されたデバイスとして実装されてもよい。追加的又は代替的に、環境200のうちの一組のデバイス（例えば、1つ以上のデバイス）は、環境200のうちの別の一組のデバイスによって実行されるように説明された1つ以上の機能を実行してもよい。 The number and arrangement of devices and networks shown in FIG. 2 are provided as an example. In practice, additional devices and/or networks, fewer devices and/or networks, different devices and/or networks, or otherwise arranged devices and/or networks may be used relative to that shown in FIG. may exist. Additionally, two or more devices shown in FIG. 2 may be implemented within a single device, or the single device shown in FIG. 2 may be implemented as multiple distributed devices. Additionally or alternatively, one or more devices (e.g., one or more devices) in environment 200 are described as being executed by another set of devices in environment 200 may perform the function of

図3はデバイス300の例示的なコンポーネントの図である。デバイス300は、ユーザー・デバイス210及び／又はプラットフォーム220に対応してもよい。図3に示すように、デバイス300は、バス310、プロセッサ320、メモリ330、ストレージ・コンポーネント340、入力コンポーネント350、出力コンポーネント360、及び通信インターフェース370を含む可能性がある。 FIG. 3 is a diagram of exemplary components of device 300 . Device 300 may correspond to user device 210 and/or platform 220 . As shown in FIG. 3, device 300 may include bus 310 , processor 320 , memory 330 , storage component 340 , input component 350 , output component 360 and communication interface 370 .

バス310は、デバイス300のコンポーネント間の通信を可能にするコンポーネントを含む。プロセッサ320は、ハードウェア、ファームウェア、又はハードウェアとソフトウェアの組み合わせで実装される。プロセッサ320は、中央処理ユニット（CPU）、グラフィックス処理ユニット（GPU）、加速処理ユニット（APU）、マイクロプロセッサ、マイクロコントローラ、デジタル信号プロセッサ（DSP）、フィールド・プログラマブル・ゲート・アレイ（FPGA）、特定用途向け集積回路（ASIC）、又は別のタイプの処理コンポーネントである。幾つかの実装では、プロセッサ320は、機能を実行するようにプログラムすることが可能な1つ以上のプロセッサを含む。メモリ330は、ランダム・アクセス・メモリ（RAM）、リード・オンリー・メモリ（ROM）、及び／又は、他のタイプのダイナミック又はスタティック・ストレージ・デバイス（例えば、フラッシュ・メモリ、磁気メモリ、及び／又は、光メモリ）であって、プロセッサ320による使用のための情報及び／又は命令を記憶するものを含む。 Bus 310 contains components that enable communication between components of device 300 . Processor 320 is implemented in hardware, firmware, or a combination of hardware and software. The processor 320 includes a central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), An application specific integrated circuit (ASIC) or another type of processing component. In some implementations, processor 320 includes one or more processors that can be programmed to perform functions. Memory 330 may include random access memory (RAM), read only memory (ROM), and/or other types of dynamic or static storage devices (e.g., flash memory, magnetic memory, and/or , optical memory) to store information and/or instructions for use by processor 320 .

ストレージ・コンポーネント340は、デバイス300の動作及び利用に関連する情報及び／又はソフトウェアを記憶する。例えば、ストレージ・コンポーネント340は、ハード・ディスク（例えば、磁気ディスク、光ディスク、光磁気ディスク、及び／又はソリッド・ステート・ディスク）、コンパクト・ディスク（CD）、デジタル多用途ディスク（DVD）、フロッピー・ディスク、カートリッジ、磁気テープ、及び／又は他のタイプの非一時的コンピュータ読み取り可能な媒体を、対応するドライブと共に含む可能性がある。 Storage component 340 stores information and/or software related to the operation and use of device 300 . For example, storage component 340 may include hard disks (eg, magnetic disks, optical disks, magneto-optical disks, and/or solid state disks), compact disks (CDs), digital versatile disks (DVDs), floppy disks, Disks, cartridges, magnetic tapes, and/or other types of non-transitory computer-readable media may be included along with corresponding drives.

入力コンポーネント350は、ユーザー入力（例えば、タッチ・スクリーン・ディスプレイ、キーボード、キーパッド、マウス、ボタン、スイッチ、及び／又はマイクロホン）等を介して、デバイス300が情報を受信することを可能にするコンポーネントを含む。追加的又は代替的に、入力コンポーネント350は、情報を感知するためのセンサ（例えば、グローバル・ポジショニング・システム（GPU）コンポーネント、加速度計、ジャイロスコープ、及び／又はアクチュエータ）を含んでもよい。出力コンポーネント360は、デバイス300（例えば、ディスプレイ、スピーカ、及び／又は1つ以上の発光ダイオード（LED））から出力情報を提供するコンポーネントを含む。 Input component 350 is a component that enables device 300 to receive information, such as through user input (eg, touch screen display, keyboard, keypad, mouse, buttons, switches, and/or microphone). including. Additionally or alternatively, input components 350 may include sensors (eg, global positioning system (GPU) components, accelerometers, gyroscopes, and/or actuators) for sensing information. Output components 360 include components that provide output information from device 300 (eg, a display, speakers, and/or one or more light emitting diodes (LEDs)).

通信インターフェース370は、デバイス300が有線接続、無線接続、又は有線と無線接続の組み合わせ等を介して他のデバイスと通信することを可能にするトランシーバのようなコンポーネント（例えば、トランシーバ及び／又は別個の受信機及び送信機）を含む。通信インターフェース370は、デバイス300が他のデバイスから情報を受信し、及び／又は他のデバイスへ情報を提供することを可能にすることができる。例えば、通信インターフェース370は、イーサーネット・インターフェース、光インターフェース、同軸インターフェース、赤外線インターフェース、無線周波数（RF）インターフェース、ユニバーサル・シリアル・バス（USB）インターフェース、Wi-Fiインターフェース、セルラー・ネットワーク・インターフェースなどを含んでもよい。 Communication interface 370 is a transceiver-like component (e.g., a transceiver and/or a separate receiver and transmitter). Communication interface 370 may allow device 300 to receive information from and/or provide information to other devices. For example, communication interface 370 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, etc. may contain.

デバイス300は、本願で説明される1つ以上のプロセスを実行することができる。デバイス300は、メモリ330及び／又はストレージ・コンポーネント340のような非一時的なコンピュータ読み取り可能な媒体によって記憶されるソフトウェア命令を実行するプロセッサ320に応答して、これらのプロセスを実行することができる。コンピュータ読み取り可能な媒体は、本願では、非一時的なメモリ・デバイスとして定義される。メモリ・デバイスは、単一の物理ストレージ・デバイス内のメモリ・スペース、又は複数の物理ストレージ・デバイスにわたって分散されたメモリ・スペースを含む。 Device 300 is capable of executing one or more processes described herein. Device 300 may perform these processes in response to processor 320 executing software instructions stored by non-transitory computer-readable media, such as memory 330 and/or storage component 340. . A computer-readable medium is defined herein as a non-transitory memory device. A memory device may include memory space within a single physical storage device or memory space distributed across multiple physical storage devices.

ソフトウェア命令は、別のコンピュータ読み取り可能な媒体から、又は通信インターフェース370を介して別のデバイスから、メモリ330及び／又はストレージ・コンポーネント340に読み込むことができる。メモリ330及び／又はストレージ・コンポーネント340に記憶されているソフトウェア命令は、実行されると、本願で説明される1つ以上のプロセスを、プロセッサ320に実行させることができる。 The software instructions may be read into memory 330 and/or storage component 340 from another computer-readable medium or from another device via communication interface 370 . The software instructions stored in memory 330 and/or storage component 340, when executed, may cause processor 320 to perform one or more of the processes described herein.

追加的又は代替的に、本願で説明される1つ以上のプロセスを実行するために、ソフトウェア命令の代わりに又はそれと組み合わせて、ハードワイヤード回路が使用されてもよい。従って、本願で説明される実装は、ハードウェア回路とソフトウェアの何らかの特定の組み合わせに限定されない。 Additionally or alternatively, hard-wired circuitry may be used in place of or in combination with software instructions to implement one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.

図3に示すコンポーネントの数と配置は一例として提供されている。実際には、デバイス300は、図3に示されるものに対して、追加のコンポーネント、より少ないコンポーネント、異なるコンポーネント、又は異なる仕方で配置されたコンポーネントを含むことが可能である。追加的又は代替的に、デバイス300の一組のコンポーネント（例えば、1つ以上のコンポーネント）は、デバイス300の別の一組のコンポーネントによって実行されるものとして説明された1つ以上の機能を実行してもよい。 The number and arrangement of components shown in FIG. 3 are provided as an example. In practice, device 300 may include additional components, fewer components, different components, or components arranged differently than those shown in FIG. Additionally or alternatively, one set of components (e.g., one or more components) of device 300 performs one or more functions described as being performed by another set of components of device 300. You may

図4は、テキスト・ツー・スピーチ合成のためのデュレーション・インフォームド・アテンション・ネットワークを使用して、オーディオ波形及び対応するビデオを生成する例示的なプロセス400のフローチャートである。幾つかの実装形態において、図4の1つ以上のプロセス・ブロックは、プラットフォーム220によって実行されてもよい。幾つかの実装では、図4の1つ以上のプロセス・ブロックは、ユーザー・デバイス210のような、プラットフォーム220から分離された又はプラットフォーム220を含む、別のデバイス又は一群のデバイスによって実行されてもよい。 FIG. 4 is a flowchart of an exemplary process 400 for generating audio waveforms and corresponding video using a duration informed attention network for text-to-speech synthesis. In some implementations, one or more of the process blocks in FIG. 4 may be performed by platform 220. In some implementations, one or more of the process blocks in FIG. 4 may be executed by another device or group of devices separate from or including platform 220, such as user device 210. good.

図4に示されるように、プロセスは、テキスト構成要素のシーケンスを含むテキスト入力を、デバイスにより受信することを含むことが可能である（ブロック410）。 As shown in FIG. 4, the process may include receiving by the device a text input that includes a sequence of text components (block 410).

例えば、プラットフォーム220は、オーディオ出力に変換されるべきテキスト入力を受け取ることができる。テキスト構成要素は、キャラクタ、音素、n－グラム、言葉、文字及び／又はそれに類するものを含む可能性がある。テキスト構成要素のシーケンスは、センテンス、フレーズ、及び／又はそれに類するものを形成することができる。 For example, platform 220 can receive text input to be converted to audio output. Text components may include characters, phonemes, n-grams, words, letters and/or the like. A sequence of text components can form sentences, phrases, and/or the like.

図4に更に示されるように、このプロセスは、テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用してデバイスにより決定することを含むことが可能である（ブロック420）。 As further shown in FIG. 4, this process may include determining individual temporal durations of text components by the device utilizing a duration model (block 420).

継続時間モデルは、入力テキスト構成要素を受信し、入力テキスト構成要素のテンポラル継続時間を決定するモデルを含むことが可能である。プラットフォーム220は、継続時間モデルを訓練することができる。例えば、プラットフォーム220は機械学習技術を使用して、データ（例えば、履歴データのような訓練データ）を分析し、継続時間モデルを作成することができる。機械学習技術は、例えば、人工ネットワーク、ベイズ統計、学習オートマトン、隠れマルコフ・モデリング、線形分類器、二次分類器、決定木、関連ルール学習のような教師あり及び／又は教師なし技術を含むことができる。 A duration model may include a model that receives an input text component and determines the temporal duration of the input text component. Platform 220 can train a duration model. For example, platform 220 can use machine learning techniques to analyze data (eg, training data such as historical data) and create duration models. Machine learning techniques include supervised and/or unsupervised techniques such as, for example, artificial networks, Bayesian statistics, learning automata, hidden Markov modeling, linear classifiers, quadratic classifiers, decision trees, associative rule learning. can be done.

プラットフォーム220は、既知の継続時間のスペクトログラム・フレームとテキスト構成要素のシーケンスとを整列させることによって、継続時間モデルを訓練することができる。例えば、プラットフォーム220は、HMMベースの強制アライメントを使用して、テキスト構成要素の入力テキスト・シーケンスのグランド・トゥルース継続時間を決定することができる。プラットフォーム220は、テキスト構成要素を含む既知の入力テキスト・シーケンス及び既知の継続時間の予測又はターゲット・スペクトログラム・フレームを利用することによって、継続時間モデルを訓練することができる。 The platform 220 can train a duration model by aligning spectrogram frames of known duration with a sequence of text components. For example, the platform 220 can use HMM-based forced alignment to determine the ground truth duration of the input text sequence of text components. The platform 220 can train the duration model by utilizing known input text sequences containing text components and known duration predictions or target spectrogram frames.

プラットフォーム220は、テキスト構成要素を継続時間モデルに入力し、モデルの出力に基づいて、テキスト構成要素の個々のテンポラル継続時間を識別するか又はそれに関連付けられる情報を決定することができる。個々のテンポラル継続時間を識別するか又はそれに関連付けられる情報は、以下に説明されるように、スペクトルの第2セットを生成するために使用することができる。 The platform 220 can input the text components into the duration model and, based on the output of the model, identify or determine information associated with individual temporal durations of the text components. Information identifying or associated with individual temporal durations can be used to generate a second set of spectra, as described below.

図4に更に示されるように、このプロセスは、テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成することを含むことが可能である（ブロック430）。 As further shown in FIG. 4, this process may include generating a first set of spectra based on the sequence of text components (block 430).

例えば、プラットフォーム220は、テキスト構成要素の入力シーケンスのテキスト構成要素に対応する出力スペクトルを生成することができる。プラットフォーム220は、出力スペクトルを生成するためにCBHGモジュールを利用することができる。CBHGモジュールは、1-D畳み込みフィルタのバンク、一組のハイウェイ・ネットワーク、双方向ゲート付きリカレント・ユニット（GRU）、リカレント・ニューラル・ネットワーク（RNN）、及び／又は他の構成要素を含んでもよい。 For example, platform 220 can generate an output spectrum corresponding to a text component of an input sequence of text components. Platform 220 can utilize a CBHG module to generate the output spectrum. A CBHG module may include a bank of 1-D convolution filters, a set of highway networks, a bidirectional gated recurrent unit (GRU), a recurrent neural network (RNN), and/or other components. .

一部の実装では、出力スペクトルはメル周波数ケプストラム（MFC）スペクトルであってもよい。出力スペクトルは、スペクトログラム・フレームを生成するために使用される任意のタイプのスペクトルを含む可能性がある。 In some implementations, the output spectrum may be a Mel frequency cepstrum (MFC) spectrum. The output spectrum may include any type of spectrum used to generate spectrogram frames.

図4に更に示されるように、このプロセスは、スペクトルの第1セットとテキスト構成要素のシーケンスの個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成することを含むことが可能である（ブロック440）。 As further shown in FIG. 4, the process can include generating a second set of spectra based on the first set of spectra and respective temporal durations of the sequence of text components. Yes (block 440).

例えば、プラットフォーム220は、スペクトルの第1セットと、テキスト構成要素の個々のテンポラル継続時間を識別するか又はそれに関連付けられる情報とを使用して、スペクトルの第2セットを生成することができる。 For example, platform 220 can use the first set of spectra and information identifying or associated with the individual temporal durations of the text components to generate the second set of spectra.

一例として、プラットフォーム220は、スペクトルに対応する前提とするテキスト構成要素の個々のテンポラル継続時間に基づいて、スペクトルの第1セットの様々なスペクトルを複製することができる。場合によっては、プラットフォーム220は、複製ファクタ、時間ファクタ、及び／又はそれに類するものに基づいてスペクトルを複製してもよい。換言すれば、継続時間モデルの出力はあるファクタを決定するために使用されてもよく、そのファクタにより、特定のスペクトルを複製し、追加のスペクトルを生成し、及び／又はそれに類することを行う。 As an example, platform 220 can replicate various spectra of the first set of spectra based on the respective temporal durations of the underlying textual components corresponding to the spectra. In some cases, platform 220 may replicate spectra based on replication factors, time factors, and/or the like. In other words, the output of the duration model may be used to determine a factor by which to duplicate a particular spectrum, generate additional spectra, and/or the like.

図4に更に示されるように、このプロセスは、スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成することを含むことが可能である（ブロック450）。 As further shown in FIG. 4, this process may include generating a spectrogram frame based on the second set of spectra (block 450).

例えば、プラットフォーム220は、スペクトルの第2セットに基づいてスペクトログラム・フレームを生成することができる。まとめると、スペクトルの第2セットはスペクトログラム・フレームを形成する。本願の他の箇所でも言及されるように、継続時間モデルを使用して生成されるスペクトログラム・フレームは、ターゲット又は予測フレームに、より正確に類似することが可能である。このように、本願の幾つかの実装は、TTS合成の精度を改善し、生成される会話の自然さを改善し、生成される会話のプロソディを改善し、及び／又はそれに類するものを改善する。 For example, platform 220 may generate spectrogram frames based on the second set of spectra. Collectively, the second set of spectra form a spectrogram frame. As noted elsewhere in this application, spectrogram frames generated using duration models can more accurately resemble target or predicted frames. Thus, some implementations of the present application improve the accuracy of TTS synthesis, improve the naturalness of generated speech, improve the prosody of generated speech, and/or the like. .

図4に更に示されるように、このプロセスは、スペクトログラム・フレームに基づいて、オーディオ波形を生成することを含むことが可能である（ブロック460）。 As further shown in FIG. 4, this process may include generating an audio waveform based on the spectrogram frames (block 460).

例えば、プラットフォーム220は、スペクトログラム・フレームに基づいてオーディオ波形を生成し、出力にオーディオ波形を提供することができる。例として、プラットフォーム220は、オーディオ波形を出力コンポーネント（例えば、スピーカなど）に提供してもよいし、オーディオ波形を別のデバイス（例えば、ユーザー・デバイス210）へ提供してもよいし、オーディオ波形をサーバー又は別の端末へ送信してもよいし、及び／又はそれに類することを行うことができる。 For example, platform 220 can generate an audio waveform based on the spectrogram frames and provide the audio waveform at the output. As examples, platform 220 may provide an audio waveform to an output component (e.g., a speaker, etc.), may provide an audio waveform to another device (e.g., user device 210), or may provide an audio waveform may be sent to a server or another terminal, and/or the like.

図4に更に示されるように、このプロセスは、オーディオ波形に対応するビデオ情報を、デバイスにより生成することを含むことが可能である。 As further shown in FIG. 4, this process may include generating video information corresponding to the audio waveforms by the device.

最終的に、図4に示されるように、このプロセスは、オーディオ波形及び対応するビデオを出力として提供することを含むことが可能である。 Ultimately, as shown in FIG. 4, this process can include providing an audio waveform and corresponding video as output.

図4は、プロセス400の例示的なブロックを示しているが、幾つかの実装において、プロセス400は、図4に示されているものに対して、追加のブロック、より少ないブロック、異なるブロック、又は別の仕方で配置されるブロックを含んでもよい。追加的又は代替的に、プロセス400のうちの2つ以上のブロックは、並行して実行されてもよい。 Although FIG. 4 shows exemplary blocks of process 400, in some implementations, process 400 may include additional blocks, fewer blocks, different blocks, or blocks than shown in FIG. or may include blocks that are otherwise arranged. Additionally or alternatively, two or more blocks of process 400 may be executed in parallel.

前述の開示は、説明及び記述を提供しているが、包括的なものであるようには意図されておらず、また、開示される詳細な形態に実装を限定するようにも意図されていない。修正及び変形が上記の開示に関して可能であり、あるいは実装の慣行から得られる可能性がある。 The foregoing disclosure, while providing illustration and description, is not intended to be exhaustive, nor is it intended to limit implementations to the detailed forms disclosed. . Modifications and variations are possible in light of the above disclosure or may result from implementation practice.

本願で使用されるように、コンポーネントという用語は、ハードウェア、ファームウェア、又はハードウェアとソフトウェアの組み合わせとして広く解釈されるように意図されている。 As used in this application, the term component is intended to be interpreted broadly as hardware, firmware, or a combination of hardware and software.

本願で説明されるシステム及び／又は方法は、ハードウェア、ファームウェア、又はハードウェアとソフトウェアの組み合わせの様々な形態で実装されてもよいことは明らかであろう。これらのシステム及び／又は方法を実装するために使用される実際の特化された制御ハードウェア又はソフトウェア・コードは、実装を制限するものではない。従って、システム及び／又は方法の動作及び挙動は、特定のソフトウェア・コードを参照することなく本願において説明されており、ソフトウェア及びハードウェアは、本願の記載に基づいてシステム及び／又は方法を実施するように設計されてもよいことが理解される。 It will be appreciated that the systems and/or methods described herein may be implemented in various forms in hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods is not limiting of implementation. Accordingly, the operation and behavior of systems and/or methods are described herein without reference to specific software code, and software and hardware implement systems and/or methods based on the descriptions herein. It is understood that it may be designed to

たとえ特徴の特定の組み合わせが特許請求の範囲に記載され、及び／又は明細書に開示されていたとしても、これらの組み合わせは、可能性のある実装の開示を限定するようには意図されていない。実際、これらの特徴のうちの多くは、特許請求の範囲で具体的に記載されていない、及び／又は明細書で開示されていない方法で組み合わせられる可能性がある。以下に列挙される各従属請求項は、1の請求項のみに直接的に従属する場合があるかもしれないが、可能な実装の開示は、各従属請求項を、特許請求の範囲における他の全ての請求項との組み合わせにおいて包含する。 Even if specific combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. . Indeed, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Each dependent claim listed below may directly depend on only one claim, but the disclosure of possible implementations is to separate each dependent claim from any other claim in the claim. Inclusive in combination with all claims.

本願で使用される何れの要素、動作、命令も、明示的に記述されていない限り、重要な又は不可欠なものとして解釈されるべきではない。また、本願で使用されるように「ある（“a” and “an”）」という語は、1つ以上の項目を含むように意図されており、「1つ以上」と可換に使用されてもよい。更に、本願で使用されるように、「セット」という用語は、1つ以上の項目（例えば、関連項目、非関連項目、関連及び非関連項目の組み合わせなど）を含むように意図されており、「1つ以上」と可換に使用されてもよい。1つの項目のみが意図される場合、要素「1つの」又は類似の言葉が使用される。また、本願で使用されるように、用語「含む」、「有する」、「有している」又は類似の用語は、オープン・エンドな用語であるように意図されている。更に、「基づいて」というフレーズは、明示的に別意に指定しない限り、「少なくとも部分的に基づいて」を意味するように意図される。 No element, act, or instruction used in this application should be construed as critical or essential unless explicitly described as such. Also, as used herein, the terms "a" and "an" are intended to include one or more items and are used interchangeably with "one or more." may Further, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, combinations of related and unrelated items, etc.); May be used interchangeably with "one or more". Where only one item is intended, the element "one" or similar language is used. Also, as used in this application, the terms "including," "having," "having," or similar terms are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless expressly specified otherwise.

＜付記＞
（付記1）
テキスト構成要素のシーケンスを含むテキスト入力を、デバイスにより受信するステップ；
前記テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用して前記デバイスにより決定するステップ；
前記テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを前記デバイスにより生成するステップ；
前記スペクトルの第1セットと前記テキスト構成要素のシーケンスの前記個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを前記デバイスにより生成するステップ；
前記スペクトルの第2セットに基づいて、スペクトログラム・フレームを前記デバイスにより生成するステップ；
前記スペクトログラム・フレームに基づいて、オーディオ波形を前記デバイスにより生成するステップ；
前記オーディオ波形に対応するビデオ情報を、前記デバイスにより生成するステップ；及び
前記ビデオ情報に基づいて、前記オーディオ波形及び対応するビデオを前記デバイスの出力として、前記デバイスにより提供するステップ；
を含む方法。
（付記2）
前記継続時間モデルを訓練するステップ；
を更に含む付記1に記載の方法。
（付記3）
前記テキスト入力は：
対応する入力オーディオ波形を含む入力ビデオを、入力として受信するステップ；
前記入力オーディオ波形に対応する入力ビデオ情報を、前記デバイスにより生成するステップ；
前記入力オーディオ波形に基づいて、入力スペクトログラム・フレームを前記デバイスにより生成するステップ；
前記入力スペクトログラム・フレームに基づいて、スペクトルの第1入力セットを前記デバイスにより生成するステップ；
前記スペクトルの第1入力セットに基づいて、スペクトルの第2入力セットを前記デバイスにより生成するステップ；及び
前記テキスト入力を、前記継続時間モデルを利用して前記デバイスにより決定するステップ；
によって取得される、付記1に記載の方法。
（付記4）
前記テキスト構成要素は音素又は文字である、付記1に記載の方法。
（付記5）
前記テキスト入力に関連付けられる感情状態に対応する情報を、前記デバイスにより受信するステップ；
を更に含み、前記出力として提供される前記オーディオ波形及び対応するビデオは、前記感情状態に対応する前記情報に基づいている、付記1に記載の方法。
（付記6）
前記ビデオ情報に基づいて、前記オーディオ波形及び前記対応するビデオを前記デバイスの出力として、前記デバイスにより提供する前記ステップは、同時に実行される、付記1に記載の方法。
（付記7）
前記継続時間モデルを訓練する前記ステップは、マルチ・タスク・トレーニングを含む、付記2に記載の方法。
（付記8）
出力の前記オーディオ波形及び出力の前記対応するビデオは、仮想的な人物に適用される、付記1に記載の方法。
（付記9）
前記スペクトルの第2セットは、メル周波数ケプストラム・スペクトルを含む、付記1に記載の方法。
（付記10）
前記継続時間モデルを訓練する前記ステップは、予測フレームと訓練テキスト構成要素のセットを利用するステップを含む、付記2に記載の方法。
（付記11）
デバイスであって：
プログラム・コードを記憶するように構成された少なくとも1つのメモリ；及び
前記プログラム・コードを読み込み、前記プログラム・コードにより指示されるように動作するように構成された少なくとも1つのプロセッサ；
を含み、前記プログラム・コードは：
テキスト構成要素のシーケンスを含むテキスト入力を受信することを、前記少なくとも1つのプロセッサに行わせるように構成された受信コード；
前記テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用して決定することを、前記少なくとも1つのプロセッサに行わせるように構成された決定コード；
前記テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成すること；前記スペクトルの第1セットと前記テキスト構成要素のシーケンスの前記個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成すること；前記スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成すること；前記スペクトログラム・フレームに基づいて、オーディオ波形を生成すること；及び前記オーディオ波形に対応するビデオ情報を生成することを、前記少なくとも1つのプロセッサに行わせるように構成された生成コード；及び
前記オーディオ波形及び対応するビデオを出力として提供することを、前記少なくとも1つのプロセッサに行わせるように構成された提供コード；
を含む、デバイス。
（付記12）
前記プログラム・コードは、前記継続時間モデルを訓練するように構成された訓練コードを更に含む、付記11に記載のデバイス。
（付記13）
前記受信コードが前記少なくとも1つのプロセッサに受信させる前記テキスト入力は：
対応する入力オーディオ波形を含む入力ビデオを入力として受信することを、前記少なくとも1つのプロセッサに行わせるように構成された入力受信コード；
前記入力オーディオ波形に対応する入力ビデオ情報を生成すること；前記入力オーディオ波形に基づいて、入力スペクトログラム・フレームを生成すること；前記入力スペクトログラム・フレームに基づいて、スペクトルの第1入力セットを生成すること；及び前記スペクトルの第1入力セットに基づいて、スペクトルの第2入力セットを生成することを前記少なくとも1つのプロセッサに行わせるように構成された入力生成コード；及び
前記スペクトルの第2入力セットに関して前記継続時間モデルを使用することによって、前記テキスト入力を提供するように構成された入力決定コード；
を更に含む前記プログラム・コードによって取得される、付記11に記載のデバイス。
（付記14）
前記テキスト構成要素は音素又は文字である、付記11に記載のデバイス。
（付記15）
前記受信コードは、前記テキスト入力に関連付けられる感情状態に対応する情報を受信することを、前記少なくとも1つのプロセッサに行わせるように更に構成されており、
前記提供コードは、前記感情状態に対応する前記情報に基づいて、前記オーディオ波形及び前記対応するビデオを前記出力として提供するように更に構成されている、付記11に記載のデバイス。
（付記16）
前記提供コードは、前記オーディオ波形及び前記対応するビデオを前記出力として同時に提供するように更に構成されている、付記11に記載のデバイス。
（付記17）
前記訓練コードは、マルチ・タスク・トレーニングを用いて前記継続時間モデルを訓練するように構成されている、付記12に記載のデバイス。
（付記18）
前記提供コードは、前記オーディオ波形及び前記対応するビデオを、仮想的な人物に適用される前記出力として提供するように更に構成されている、付記11に記載のデバイス。
（付記19）
前記訓練コードは、予測フレームと訓練テキスト構成要素のセットを利用して前記継続時間モデルを訓練するように構成されている、付記12に記載のデバイス。
（付記20）
1つ以上の命令を含む命令を記憶する非一時的なコンピュータ読み取り可能な媒体であって、前記命令は、デバイスの1つ以上のプロセッサにより実行されると、前記1つ以上のプロセッサに：
テキスト構成要素のシーケンスを含むテキスト入力を受信するステップ；
前記テキスト構成要素の個々のテンポラル継続時間を、継続時間モデルを利用して決定するステップ；
前記テキスト構成要素のシーケンスに基づいて、スペクトルの第1セットを生成するステップ；
前記スペクトルの第1セットと前記テキスト構成要素のシーケンスの前記個々のテンポラル継続時間とに基づいて、スペクトルの第2セットを生成するステップ；
前記スペクトルの第2セットに基づいて、スペクトログラム・フレームを生成するステップ；
前記スペクトログラム・フレームに基づいて、オーディオ波形を生成するステップ；
前記オーディオ波形に対応するビデオ情報を生成するステップ；及び
前記オーディオ波形及び対応するビデオを出力として提供するステップ；
を実行させる、記憶媒体。

<Appendix>
(Appendix 1)
receiving by the device a text input comprising a sequence of text components;
determining individual temporal durations of said text constituents by said device utilizing a duration model;
generating by the device a first set of spectra based on the sequence of textual components;
generating by the device a second set of spectra based on the first set of spectra and the respective temporal durations of the sequences of text components;
generating spectrogram frames by the device based on the second set of spectra;
generating an audio waveform by the device based on the spectrogram frames;
generating, by the device, video information corresponding to the audio waveform; and providing, by the device, the audio waveform and corresponding video as an output of the device based on the video information;
method including.
(Appendix 2)
training the duration model;
The method of Clause 1, further comprising:
(Appendix 3)
Said text input is:
receiving as input an input video including a corresponding input audio waveform;
generating by the device input video information corresponding to the input audio waveform;
generating an input spectrogram frame by the device based on the input audio waveform;
generating by said device a first input set of spectra based on said input spectrogram frame;
generating by the device a second input set of spectra based on the first input set of spectra; and determining the text input by the device utilizing the duration model;
The method of Appendix 1, obtained by
(Appendix 4)
2. The method of clause 1, wherein the text components are phonemes or letters.
(Appendix 5)
receiving by the device information corresponding to an emotional state associated with the text input;
and wherein the audio waveform and corresponding video provided as the output are based on the information corresponding to the emotional state.
(Appendix 6)
2. The method of Claim 1, wherein the steps of providing, by the device, the audio waveform and the corresponding video as outputs of the device based on the video information are performed concurrently.
(Appendix 7)
3. The method of clause 2, wherein the step of training the duration model comprises multi-task training.
(Appendix 8)
2. The method of clause 1, wherein the output audio waveform and the output corresponding video are applied to a virtual person.
(Appendix 9)
2. The method of Clause 1, wherein the second set of spectra comprises Mel frequency cepstrum spectra.
(Appendix 10)
3. The method of clause 2, wherein the step of training the duration model includes utilizing a set of prediction frames and training text components.
(Appendix 11)
Device and:
at least one memory configured to store program code; and at least one processor configured to read said program code and operate as directed by said program code;
said program code comprising:
receiving code configured to cause the at least one processor to receive a text input including a sequence of text components;
determination code configured to cause the at least one processor to determine individual temporal durations of the text components utilizing a duration model;
generating a first set of spectra based on said sequence of text components; and a second set of spectra based on said first set of spectra and said respective temporal durations of said sequence of text components. generating a spectrogram frame based on said second set of spectra; generating an audio waveform based on said spectrogram frame; and generating video information corresponding to said audio waveform. and providing code configured to cause the at least one processor to provide as output the audio waveform and corresponding video. ;
device, including
(Appendix 12)
12. The device of clause 11, wherein the program code further comprises training code configured to train the duration model.
(Appendix 13)
The text input that the receiving code causes the at least one processor to receive:
input receiving code configured to cause the at least one processor to receive as input an input video including a corresponding input audio waveform;
generating input video information corresponding to said input audio waveform; generating input spectrogram frames based on said input audio waveform; generating a first input set of spectra based on said input spectrogram frames. and input generating code configured to cause said at least one processor to generate a second input set of spectra based on said first input set of spectra; and said second input set of spectra. input determination code configured to provide said text input by using said duration model for
12. The device of clause 11, obtained by the program code further comprising:
(Appendix 14)
12. The device of clause 11, wherein the text components are phonemes or characters.
(Appendix 15)
the receiving code is further configured to cause the at least one processor to receive information corresponding to an emotional state associated with the text input;
12. The device of Clause 11, wherein the providing code is further configured to provide the audio waveform and the corresponding video as the output based on the information corresponding to the emotional state.
(Appendix 16)
12. The device of clause 11, wherein the providing code is further configured to simultaneously provide the audio waveform and the corresponding video as the output.
(Appendix 17)
13. The device of Clause 12, wherein the training code is configured to train the duration model using multi-task training.
(Appendix 18)
12. The device of Clause 11, wherein the providing code is further configured to provide the audio waveform and the corresponding video as the output to be applied to a virtual person.
(Appendix 19)
13. The device of Clause 12, wherein the training code is configured to utilize a set of prediction frames and training text components to train the duration model.
(Appendix 20)
A non-transitory computer-readable medium for storing instructions comprising one or more instructions which, when executed by one or more processors of a device, cause said one or more processors to:
receiving a text input comprising a sequence of text components;
determining individual temporal durations of said text components using a duration model;
generating a first set of spectra based on the sequence of textual components;
generating a second set of spectra based on the first set of spectra and the respective temporal durations of the sequences of text components;
generating a spectrogram frame based on said second set of spectra;
generating an audio waveform based on the spectrogram frames;
generating video information corresponding to said audio waveform; and providing said audio waveform and corresponding video as an output;
A storage medium for executing

Claims

receiving by the device a text input comprising a sequence of text components;
determining individual temporal durations of said text constituents by said device utilizing a duration model;
generating by the device a first set of spectra based on the sequence of textual components;
generating by the device a second set of spectra based on the first set of spectra and the respective temporal durations of the sequences of text components, the second set of spectra comprising the generated by duplicating the first set of spectra based on the respective temporal durations of the first set of spectra;
generating spectrogram frames by the device based on the second set of spectra;
generating an audio waveform by the device based on the spectrogram frames;
generating, by the device, video information corresponding to the audio waveform; and providing, by the device, the audio waveform and corresponding video as an output of the device based on the video information;
wherein the duration model is a model obtained by multi-task training of audio features and visual features .

training the duration model;
2. The method of claim 1, further comprising:

3. A method according to claim 1 or 2 , wherein said text components are phonemes or letters.

receiving by the device information corresponding to an emotional state associated with the text input;
and wherein the audio waveform and corresponding video provided as the output are based on the information corresponding to the emotional state .

5. The step of providing by the device, based on the video information, the audio waveform and the corresponding video as outputs of the device, according to any one of claims 1-4 , performed concurrently. the method of.

A method according to any one of claims 1 to 5 , wherein the output audio waveform and the output corresponding video are applied to a virtual person.

2. The method of claim 1, wherein the second set of spectra comprises mel-frequency cepstrum spectra.

3. The method of claim 2, wherein the step of training the duration model comprises utilizing a set of prediction frames and training text components.

Device and:
at least one memory configured to store program code; and at least one processor configured to read said program code and operate as directed by said program code;
wherein said program code causes said at least one processor to perform the method of any one of claims 1-8 .

A computer program causing one or more processors of a device to perform the method of any one of claims 1-8 .