JP4052561B2

JP4052561B2 - VIDEO Attached Audio Data Recording Method, VIDEO Attached Audio Data Recording Device, and VIDEO Attached Audio Data Recording Program

Info

Publication number: JP4052561B2
Application number: JP2002226790A
Authority: JP
Inventors: 斉周浜口; 守道家; 林　　正樹; 寛之世木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2002-08-05
Filing date: 2002-08-05
Publication date: 2008-02-27
Anticipated expiration: 2022-08-05
Also published as: JP2004071013A

Description

【０００１】
【発明の属する技術分野】
本発明は、テレビ番組、映画、アニメーション等の映像、特にＣＧアニメーションに付帯させる音声データを記録する映像付帯音声データ記録方法、映像付帯音声データ記録装置および映像付帯音声データ記録プログラムに関する。
【０００２】
【従来の技術】
従来、映像等を制作した後に、この映像にあわせて音声データを記録するアフターレコーディングの場合、特に、ＣＧアニメーションのＣＧキャラクタの台詞を声優等が読み上げて記録する場合には、当該ＣＧキャラクタの口の動き（口唇部の動き）にあわせて声優が声（音声データ）を発声する必要がある。
【０００３】
また、映像を制作する前に、音声データを記録するプリレコーディングの場合、記録した音声データに適合するように映像を制作する必要があり、特に、ＣＧアニメーションのＣＧキャラクタの台詞の場合、記録した音声データである台詞音声データに適合する（リップシンクする）ようにＣＧキャラクタの口唇部の動きを調整する必要がある。
【０００４】
【発明が解決しようとする課題】
しかしながら、従来のアフターレコーディングでは、ＣＧキャラクタの口唇部の動きにあわせて発声しなければならないので、音声データの収録作業が煩雑になるという問題がある。また、ＣＧキャラクタの口唇部の動きにあわせる（リップシンクさせる）ことは、十分な経験を積まないと難しく、場合によっては、不自然なＣＧアニメーションになってしまうという問題がある。
【０００５】
さらに、台詞音声データをプリレコーディングしてから、ＣＧアニメーション（映像）を制作すれば、リップシンクすることは可能であるが、予め、ＣＧアニメーションが出来上がっていないためにアフターレコーディングのように、ＣＧアニメーション（映像）を見ながら発声できない。それゆえ、台詞音声データを発声する声優等が映像シーンを想起することができないため、台詞音声データに感情が込められないというような、使い勝手が悪いという問題がある。
【０００６】
そこで、本発明の目的は前記した従来の技術が有する課題を解消し、使い勝手がよく、容易に映像にあわせて音声データを記録することができ、より自然なＣＧアニメーションを生成することができる映像付帯音声データ記録方法、映像付帯音声データ記録装置および映像付帯音声データ記録プログラムを提供することにある。
【０００７】
【課題を解決するための手段】
本発明は、前記した目的を達成するため、以下に示す構成とした。
請求項１に記載の映像付帯音声データ記録方法は、映像データと、この映像データに付帯させる音声データの挿入箇所を示す情報と、当該音声データのテキストデータとに基づき、発声者が発声した前記音声データについて、前記映像データを表示画面に表示させた状態で、記録する映像付帯音声データ記録方法であって、前記音声データの挿入箇所を示す情報に基づいて、前記表示画面に表示させている映像データの再生を停止する映像停止制御ステップと、前記音声データのテキストデータを字幕にしたテキストスーパーと、前記音声データのテキストデータを音声合成した合成音声データとを、前記映像データに対応させて出力する映像スーパー合成音声出力ステップと、前記映像スーパー合成音声出力ステップにおいて、出力されたテキストスーパーと合成音声データとに基づいて前記発声者が発声した音声データについて、前記テキストデータに関連付けて記録する音声データ記録ステップと、前記音声データ記録ステップにより音声データを記録した後、前記映像データの再生を再開する映像再開制御ステップと、を含むことを特徴とする。
【０００８】
この方法によれば、まず、映像停止制御ステップにおいて、表示画面に表示されている映像データが停止され、そして、映像スーパー合成音声出力ステップにおいて、映像データにテキストスーパーと、音声データのテキストデータを音声合成した合成音声データとが対応付けられて出力される。音声データ記録ステップにおいて、発声者がテキストスーパーと合成音声データとに基づいて発声した音声データが記録される。その後、映像再開制御ステップにおいて、音声データを記録した後、映像データの再生が再開される。
【０００９】
なお、音声データのテキストスーパーは、映像に重ね書き（合成）され、また、合成音声データは、音声データを発声する声優等が映像を見る際に、当該映像を映し出すのに同期させて出力され、この合成音声データが、記録する（収録する）音声データの雑音にならないように、声優のみに聞こえるようにヘッドフォン等が利用され提供されてもよい。そして、この場合、声優等によって発声された音声データが記録されるまで、映像の再生（映し出し）を停止したりすることも可能である。映像データは、すでに制作されている映像のことを指しており、テレビ番組、映画、パッケージメディアにおけるアニメーション（ＣＧアニメーション）のことである。
【００１０】
請求項２に記載の映像付帯音声データ記録装置は、映像データと、この映像データに付帯させる音声データの挿入箇所を示す情報と、当該音声データのテキストデータとに基づき、発声者が発声した前記音声データについて、前記映像データを表示画面に表示させた状態で、記録する映像付帯音声データ記録装置であって、前記映像データを読み込んで記録する映像記録手段と、前記音声データのテキストデータを字幕にしたテキストスーパーと、前記音声データのテキストデータを音声合成した合成音声データとを、前記映像データに対応させて出力する映像スーパー合成音声出力手段と、前記音声データの挿入箇所を示す情報に基づいて、前記表示画面に表示させている映像データの再生を停止し、前記音声データを記録した後、前記映像データの再生を再開するスーパー合成音声出力制御手段と、前記音声データを、前記テキストデータに対応する１個のファイルとして記録する音声データ記録手段と、を備えたことを特徴とする。
【００１１】
かかる構成によれば、映像スーパー合成音声出力手段で、映像データにテキストスーパーと、音声データのテキストデータを音声合成した合成音声データとが対応付けられて出力される。そして、スーパー合成音声出力制御手段で、表示画面に表示されている映像データが停止され、音声データを記録した後、映像データの再生が再開される。また、音声データ記録手段で、発声者がテキストスーパーと合成音声データとに基づいて発声した音声データがテキストデータに対応する１個のファイルとして記録される。
【００１２】
請求項３に記載の映像付帯音声データ記録装置は、請求項２に記載の映像付帯音声データ記録装置において、前記スーパー合成音声出力制御手段で、前記映像データにテキストスーパーが付加された場合、前記映像データと前記音声データとが合成された映像音声データが当該映像付帯音声データ記録装置から出力される際に当該テキストスーパーを削除するスーパー削除手段を備えることを特徴とする。
【００１３】
かかる構成によれば、スーパー合成音声出力制御手段でテキストスーパーが付加された場合、映像データと前記音声データとが合成された映像音声データが当該映像付帯音声データ記録装置から出力される際にスーパー削除手段でテキストスーパーが削除され、最終的に生成される映像データには、発声者が発声した音声データのみが付加されることになる。
【００１４】
請求項４記載の映像付帯音声データ記録装置は、ＣＧ合成による映像データのＣＧキャラクタを表示画面に表示させつつ、当該ＣＧキャラクタが話す台詞の音声データを記録する映像付帯音声データ記録装置であって、前記ＣＧキャラクタが台詞を話す映像シーンに関する情報を含んでなる映像シーンデータに基づいて、映像データを作成する映像作成手段と、前記台詞のテキストデータである台詞テキストデータに基づいて、前記映像データに合成するテキストスーパーを生成するテキストスーパー生成手段と、前記映像データおよび前記テキストスーパーを、前記映像シーンデータに含まれている時刻情報に基づいて合成して映像スーパー合成データを生成する映像スーパー合成手段と、前記台詞テキストデータに基づいて、音声合成を行って、合成音声データを生成する音声合成手段と、前記映像スーパー合成データおよび前記合成音声データの表示出力を制御する合成表示出力制御手段と、前記合成音声データと前記テキストスーパーとを参照して発声された音声データである台詞音声データを収録する音声データ収録手段と、前記映像シーンデータと前記台詞テキストデータとを記録すると共に、前記音声データ収録手段で収録された台詞音声データと前記台詞テキストデータとを関連付けて記録する記録手段と、を備えたことを特徴とする。
【００１５】
かかる構成によれば、映像作成手段で映像シーンデータに基づいて映像データが作成される。この映像シーンデータは、例えば、ＴＶＭＬ（ＴＶｐｒｏｇｒａｍＭａｋｉｎｇＬａｎｇｕａｇｅ）で記述されており、この映像シーンデータは、各映像シーンの順番や、各映像シーンを構成する各映像コマの設定をするものである。続いて、テキストスーパー生成手段で、ＣＧキャラクタの台詞のテキストデータに基づいて、テキストスーパー、すなわち、字幕スーパーが生成される。そして、映像スーパー合成手段で、映像データとテキストスーパーとが映像シーンデータに含まれている時刻情報に基づいて合成され、映像スーパー合成データとされる。また、ＣＧキャラクタの台詞のテキストデータに基づいて、音声合成手段で合成音声データが生成される。そして、合成表示出力制御手段で映像スーパー合成データおよび合成音声データの表示出力が制御され、音声データ収録手段で、合成音声データとテキストスーパーとを参照して、声優等の発声者が発声した音声データである台詞音声データが収録される。その後、収録された台詞音声データと台詞テキストデータとが関連付けて記録手段に記録される。
【００１６】
なお、ＣＧキャラクタには、当該ＣＧキャラクタを識別する識別情報が付されており、例えば、この識別情報がＴＶＭＬで記述されている。また、合成表示出力制御手段における制御は、例えば、ＣＧキャラクタの台詞のある映像シーンになった場合に、映像データにテキストスーパーが合成されている映像スーパー合成データの再生を一時停止させて、この映像シーン（映像コマ）のテキストスーパーを読み上げるように促す信号（音声データ収録要求信号）を出力し、合成音声データを例示として出力するといったものである。さらに、合成表示出力制御手段における制御は、テキストスーパーが読み上げられた場合、つまり、音声データの収録が完了した場合に、停止中の映像スーパー合成データの再生を再開するといったものである。
【００１７】
請求項５記載の映像付帯音声データ記録装置は、請求項４記載の映像付帯音声データ記録装置において、単語の発音辞書と各音韻の特徴量が記された音響モデルとを有し、前記台詞音声データを参照して、前記台詞テキストデータを解析し、当該台詞テキストデータの時系列情報を含んでなる台詞音韻データに変換する音韻解析手段と、前記映像作成手段が、前記音韻解析手段で解析された台詞音韻データと、前記映像シーンデータとに基づいて、前記台詞音声データの発声と前記ＣＧキャラクタの口唇部の動きとが適合する映像であるリップシンク映像データを生成し、このリップシンク映像データと前記台詞音声データとを合成する映像音声データ合成手段と、を備えたことを特徴とする。
【００１８】
かかる構成によれば、音韻解析手段で台詞テキストデータが解析され、台詞音韻データに変換される。この音韻解析手段における音韻解析は、例えば、台詞音声データの各単語および各音素を解析することである。そして、映像作成手段で、台詞音韻データと映像シーンデータとに基づいて、台詞音声データの発声とＣＧキャラクタの口唇部の動きとが適合する映像であるリップシンク映像データが生成され、映像音声データ合成手段で、リップシンク映像データと台詞音声データとが合成される。
【００１９】
つまり、映像データがＣＧアニメーションであるので、ＣＧアニメーション中のＣＧキャラクタの口唇部の動きにあわせて、声優等が発声した音声データを記録した後に、当該ＣＧキャラクタの口唇部の形状を微調整する（変更する）ことができる。
【００２０】
なお、ＣＧアニメーションは、複数のセル画から構成されるアニメーションと異なり、ＣＧアニメーションを描画する装置（通常、コンピュータ）に入力するデータを変更するだけで容易にＣＧキャラクタ等の指定箇所の形状、色彩、質感を変更可能なものである。
【００２１】
請求項６に記載の映像付帯音声データ記録プログラムは、映像データと、この映像データに付帯させる音声データの挿入箇所を示す情報と、当該音声データのテキストデータとに基づき、発声者が発声した前記音声データについて、前記映像データを表示画面に表示させた状態で、記録するために、コンピュータを、以下に示す手段として機能させることを特徴とする。当該コンピュータを機能させる手段は、前記映像データを読み込んで記録する映像記録手段、前記音声データのテキストデータを字幕にしたテキストスーパーと、前記音声データのテキストデータを音声合成した合成音声データとを、前記映像データに対応させて出力する映像スーパー合成音声出力手段、前記音声データの挿入箇所を示す情報に基づいて、前記表示画面に表示させている映像データの再生を停止し、前記音声データを記録した後、前記映像データの再生を再開するスーパー合成音声出力制御手段、前記音声データを、前記テキストデータに対応する１個のファイルとして記録する音声データ記録手段、である。
【００２２】
かかる構成によれば、映像スーパー合成音声出力手段で、映像データにテキストスーパーと、音声データのテキストデータを音声合成した合成音声データとが対応付けられて出力される。そして、スーパー合成音声出力制御手段で、表示画面に表示されている映像データが停止され、音声データを記録した後、映像データの再生が再開される。また、音声データ記録手段で、発声者がテキストスーパーと合成音声データとに基づいて発声した音声データがテキストデータに対応する１個のファイルとして記録される。
【００２３】
【発明の実施の形態】
以下、本発明の一実施の形態について、図面を参照して詳細に説明する。
（映像付帯音声データ記録装置の構成）
図１は、映像付帯音声データ記録装置のブロック図である。この図１に示すように、映像付帯音声データ記録装置１は、映像シーンデータ入力部３と、テキストデータ入力部５と、記録部７と、映像生成部９と、スーパー生成部１１と、映像スーパー合成部１３と、音声合成部１５と、表示出力部１７と、音声データ入力部１９と、音声収録部２１と、音韻解析部２３と、映像音声データ合成部２５とを備えている。
【００２４】
映像付帯音声データ記録装置１は、映像に付帯する音声データを記録するもので、特にＣＧアニメーションに登場するＣＧキャラクタ（ＣＧアクター、ＣＧアクトレス）の台詞である音声データを記録するものである。なお、この映像付帯音声データ記録装置１は、記録した音声データの音韻を解析して、この解析した結果に基づいて、ＣＧキャラクタの口唇部の動き（口唇部の映像）と音声データの出力音声とを適合させるリップシンク機能を有している。この実施の形態では、映像付帯音声データ記録装置１は、一般的なコンピュータをベースにし、「ＴＶＭＬプレーヤー」が実装されて実現されている。
【００２５】
映像シーンデータ入力部３は、外部から入力されるデータ（映像シーンデータ）を記録部７に記録するためのインターフェースであり、映像シーンデータが記録されたディスクを挿入するディスクドライブや、映像シーンデータを入力可能な入力端子等から構成される。
【００２６】
映像シーンデータは、映像シーンを構成する複数の映像コマ内のＣＧキャラクタや他のオブジェクト映像の配置位置等を設定すると共に、映像シーンを並べる順序を設定するもので、ＣＧアニメーションの“シナリオ”に相当するものである。なお、この実施の形態では、映像シーンデータはＴＶＭＬ（ＴｅｌｅＶｉｓｉｏｎｐｒｏｇｒａｍＭａｋｉｎｇＬａｎｇｕａｇｅ）で記述されており、このＴＶＭＬの詳細な説明は、実際の記述例を参照して後ほど行うことにする。
【００２７】
テキストデータ入力部５は、ＣＧキャラクタの台詞である台詞テキストデータを入力して、この台詞テキストデータを記録部７に記録するためのインターフェースであり、一般的なキーボード、マウス等によって構成される。
台詞テキストデータは、ＣＧキャラクタの台詞であり、テキスト形式で記述されたものである。この実施の形態では、テキスト形式で記述された台詞が、ＣＧキャラクタを識別する識別情報（キャラクタ名）と組み合わされて、ＴＶＭＬで定義されている。
【００２８】
記録部７は、一般的なハードディスク等によって構成されており、映像シーンデータ、台詞テキストデータおよび台詞音声データ（後記する）を記録するものである。この記録部７が特許請求の範囲の請求項に記載した記録手段に相当するものである。
【００２９】
映像生成部９は、映像シーンデータに基づいて、映像データを生成するものである。この実施の形態では、映像データはＣＧアニメーションであり、このＣＧアニメーションは、ＴＶＭＬで記述された映像シーンデータ、例えば、ＣＧキャラクタのモデルデータ（形状データ）、初期位置や向き、音声合成部１５で音声合成させる際の声質等の定義に基づいて、描画されたＣＧキャラクタが登場（出演）し、このＣＧキャラクタの行動（演技）によって、会話、物語等が表現されるものである。この映像生成部９が特許請求の範囲の請求項に記載した映像作成手段に相当するものである。
【００３０】
スーパー生成部１１は、台詞テキストデータに基づいて、テキストスーパー（字幕スーパー）を生成するものである。このテキストスーパーは、一般的に類推される「台詞」の読み上げ速度にあわせて、テキストスーパーの文字の表示色が変化するようになっている。このスーパー生成部１１が特許請求の範囲の請求項に記載したテキストスーパー生成手段に相当するものである。
【００３１】
映像スーパー合成部１３は、ＣＧキャラクタが台詞を発声する時刻情報（映像シーンデータに含まれる）に基づいて、映像生成部９で生成された映像データとスーパー生成部１１で生成されたテキストスーパーとを合成し、映像スーパー合成データとするものである。この映像スーパー合成部１３が特許請求の範囲の請求項に記載した映像スーパー合成手段に相当するものである。
【００３２】
なお、映像付帯音声データ記録装置１には、当該装置１の制御を司る主制御部（図示せず）が備えられており、この主制御部は、音声データ入力部１９に備えられる各種スイッチ（後記する）からの制御信号に基づいて、映像スーパー合成データおよび合成音声データの表示出力部１７への表示出力を制御するものである。この主制御部が特許請求の範囲の請求項に記載した合成表示出力制御手段に相当するものである。また、この映像付帯音声データ記録装置１には、映像スーパー合成部１３で、映像データに合成されたテキストスーパーを削除するスーパー削除手段（図示せず）が備えられている。このスーパー削除手段によって、この映像付帯音声データ記録装置１から最終的に出力される映像音声データ（後記）には、テキストスーパーはなくなっている。
【００３３】
なお、この実施の形態では、映像生成部９と、スーパー生成部１１と、映像スーパー合成部１３と各構成を分離して、各構成の役割を明確にして説明したが、例えば、これら映像生成部９と、スーパー生成部１１と、映像スーパー合成部１３とを、ＴＶＭＬに基づいてテキストスーパー（字幕スーパー）を含むＣＧアニメーションを描く「ＣＧ描画部」といったように１個のブロックとして構成することもできる。この場合、これら映像生成部９と、スーパー生成部１１と、映像スーパー合成部１３とは、汎用的なコンピュータ言語で記述されたプログラムとみなすことができるものである。
【００３４】
音声合成部１５は、テキストデータ入力部５で入力された台詞テキストデータに基づいて、音声合成を行って合成音声データを生成するものである。なお、この実施の形態では、この音声合成部１５は、特開平２−４７７００号公報に開示されている方法（装置）を利用して、台詞テキストデータの音声合成を実行している。そして、音声合成部１５は、音声合成した合成音声データを、映像シーンデータに含まれている、ＣＧキャラクタが台詞を発声する時刻情報と、音声データ入力部１９に付属している合成音声データ再生要求スイッチ（図示せず、後記する）からの制御信号とに基づいて、表示出力部１７に出力（送出）する。この音声合成部１５が特許請求の範囲の請求項に記載した音声合成手段に相当するものである。
【００３５】
表示出力部１７は、表示画面を備えたＣＲＴ、液晶、プラズマ等のディスプレイ１７ａとスピーカ１７ｂとから構成されており、ＣＧアニメーションを表示すると共に、合成音声データを出力するものである。
【００３６】
音声データ入力部１９は、声優等が発声した音声（台詞音声データ）を入力（集音）するマイクロフォン等で構成され、さらに、図示を省略した台詞音声データ収録開始スイッチと、台詞音声データ収録終了スイッチと、合成音声データ再生要求スイッチとが付属してなるものである。これらの台詞音声データ収録開始スイッチ、台詞音声データ収録終了スイッチおよび合成音声データ再生要求スイッチは、映像付帯音声データ記録装置１の主制御部（図示せず）に制御信号を送信して、表示出力部１７への映像スーパー合成データの再生および合成音声データの出力のタイミングを制御するものである。
【００３７】
台詞音声データ収録開始スイッチ（図示せず）は、表示出力部１７のディスプレイ１７ａに表示されているテキストスーパー（字幕スーパー）に基づいて、声優等が台詞音声データを発声する際に、押下するもので、この台詞音声データ収録開始スイッチが押下されると、音声データ入力部１９から入力された台詞音声データの収録が映像付帯音声データ記録装置１の音声収録部２１で開始される。
【００３８】
台詞音声データ収録終了スイッチ（図示せず）は、表示出力部１７のディスプレイ１７ａに表示されているテキストスーパー（字幕スーパー）に基づいて、声優等が台詞音声データを発声した後に、押下するもので、この台詞音声データ収録終了スイッチが押下されると、音声データ入力部１９から入力された台詞音声データの収録が映像付帯音声データ記録装置１の音声収録部２１で終了される。
【００３９】
合成音声データ再生要求スイッチ（図示せず）は、表示出力部１７のスピーカ１７ｂで出力された合成音声データの再生を、再び要求するために押下するものである。
【００４０】
音声収録部２１は、音声データ入力部１９で入力された台詞音声データを記録部７に記録するためのインターフェースであり、台詞音声データを入力可能な入力端子等から構成される。この音声収録部２１で収録された台詞音声データは、記録部７に記録されている台詞テキストデータと関連付けられて、記録部７に記録されるものである。つまり、音声収録部２１は、台詞テキストデータ毎の終端（切れ目）を検出すると共に、この台詞テキストデータと台詞音声データとを逐次、１個の台詞ファイルにして、記録部７に記録させるものである。すると、記録部７には、映像シーン毎に複数の台詞ファイルが記録されることになる。この音声収録部２１が特許請求の範囲の請求項に記載した音声データ収録手段に相当するものである。
【００４１】
音韻解析部２３は、図示を省略した単語の発音辞書と各音韻の特徴量が記された音響モデルとを有し、記録部７に記録されている台詞テキストデータを、台詞音声データを参照して、時系列情報が含まれている台詞音韻データに変換するものである。つまり、台詞音韻データは、台詞テキストデータ中の単語および音素が時系列情報（発音時間）によって分割されたものであり、例えば、「いい天気ですね」という台詞テキストデータは「いい天気ですね」といった具合に分割されており、“いい：０〜２０ｍｓ”というように、単語に時系列情報（発音時間）が付されているものである。なお、この台詞音韻データは、映像生成部９で、リップシンク映像データを生成する際に参照される。つまり、この台詞音韻データと、映像シーンデータに含まれている音素毎に定義されるＣＧキャラクタの口唇部の動きに関する情報であるＣＧキャラクタ口唇部情報とに基づいて、映像生成部９で、ＣＧキャラクタの口唇部の動きと台詞音声データの発声とを適合させた（マッチングさせた）リップシンク映像データが生成される。
【００４２】
映像音声データ合成部２５は、映像生成部９で音韻データおよびＣＧキャラクタ口唇部情報に基づいて生成されたリップシンク映像データと、台詞音声データとを映像シーンデータに含まれている時刻情報に基づいて合成し、リップシンク映像台詞音声データを表示出力部１７に出力するものである。
【００４３】
この映像付帯音声データ記録装置１によれば、映像生成部９で映像シーンデータに基づいて映像データが作成され、スーパー生成部１１で、ＣＧキャラクタの台詞テキストデータに基づいて、テキストスーパー、すなわち、字幕スーパーが生成される。また、ＣＧキャラクタの台詞テキストデータに基づいて、音声合成部１５で合成音声データが生成される。そして、映像スーパー合成部１３で、映像データとテキストスーパーとが映像シーンデータに含まれている時刻情報に基づいて合成され、映像スーパー合成データとされ、主制御部（図示せず）で映像スーパー合成データおよび合成音声データの表示出力が制御され、音声収録部２１で、合成音声データとテキストスーパーとを参照して、声優等の発声者が発声した音声データである台詞音声データが収録される。その後、収録された台詞音声データと台詞テキストデータとが関連付けて記録部７に記録される。
【００４４】
このため、声優等の発声者は、合成音声データとテキストスーパーとを参照して、台詞音声データを発声することができ、発声された台詞音声データが台詞テキストデータと関連付けて記録されるので、容易にＣＧキャラクタの映像にあった台詞音声データを記録することができる。
【００４５】
また、映像付帯音声データ記録装置１によれば、音韻解析部２３で台詞テキストデータの音韻が解析され、台詞音韻データに変換される。映像生成部９で、台詞音韻データと映像シーンデータとに基づいて、台詞音声データの発声とＣＧキャラクタの口唇部の動きとが適合する映像であるリップシンク映像データが生成され、映像音声データ合成部２５で、リップシンク映像データと台詞音声データとが合成される。このため、ＣＧアニメーション中のＣＧキャラクタの口唇部の動きと台詞音声データとを適合させたリップシンク映像台詞音声データを生成することができ、より自然なＣＧアニメーションを生成する（描画する）ことができる。
【００４６】
（映像付帯音声データ記録装置の動作［台詞音声データ収録時］）
次に、図２に示すフローチャートを参照して、台詞音声データ収録時の映像付帯音声データ記録装置１の動作を説明する。
まず、映像付帯音声データ記録装置１の映像シーンデータ入力部３で映像シーンデータが入力される（Ｓ１）。また、テキストデータ入力部５でＣＧキャラクタの台詞テキストデータが入力される（Ｓ２）。これらの映像シーンデータと台詞テキストデータとは、記録部７に記録される。
【００４７】
そして、映像生成部９で、記録部７に記録されている映像シーンデータに基づいて、映像データが生成される（Ｓ３）。この映像データは、ＴＶＭＬで記述されている映像シーンデータを元に描画されたＣＧアニメーションである。また、スーパー生成部１１で記録部に記録されている台詞テキストデータに基づいて、テキストスーパー（字幕スーパー）が生成される（Ｓ４）。
【００４８】
続いて、映像スーパー合成部１３で、映像生成部９にて生成された映像データに、スーパー生成部１１にて生成されたテキストスーパー（字幕スーパー）が、映像シーンデータの時刻情報に基づいて合成され、映像スーパー合成データとされる（Ｓ５）。なお、これらの動作Ｓ３〜Ｓ５は、一連の処理として説明したが、実際には、映像付帯音声データ記録装置１の主制御部（図示せず）において、マルチスタック処理により同時並行処理される。
【００４９】
さらに、音声合成部１５で、記録部７に台詞テキストデータに基づいて、音声合成が行われ、合成音声データが生成される（Ｓ６）。これら映像スーパー合成データと合成音声データが生成された状態で、当該装置１の利用者（声優等の発声者）から、これら映像スーパー合成データと合成音声データの再生出力要求があるまで（図示を省略した「映像スーパー合成データ」の再生開始スイッチが押下されるまで）待機される。当該装置１の利用者（声優等の発声者）から再生出力要求があった場合、まず、映像スーパー合成データの再生が表示出力部１７（ディスプレイ１７ａ）にて開始される（Ｓ７）。なお、当然のことながら、ＣＧキャラクタの台詞のない映像シーンにおいて、この映像スーパー合成データには、テキストスーパー（字幕スーパー）が含まれておらず、ディスプレイ１７ａにテキストスーパー（字幕スーパー）は表示されていない。
【００５０】
そして、ＣＧキャラクタの台詞のある映像シーンであるかどうかが図示を省略した主制御部で判断され、台詞のある映像シーンまで（Ｓ８、Ｎｏ）そのまま映像スーパー合成データの再生が続行され、ＣＧキャラクタの台詞のある映像シーンであると判断された場合、映像が停止され、表示出力部１７のディスプレイ１７ａにテキストスーパー（字幕スーパー）が表示され、スピーカ１７ｂに合成音声データが出力される（Ｓ９）。
【００５１】
すると、当該装置１の利用者（声優等の発声者）は、これらテキストスーパーを見ながら、音声データ入力部１９の台詞音声データ収録開始スイッチ（図示せず）を押下して、台詞音声データを発声する。発声し終わったら、台詞音声データ収録終了スイッチ（図示せず）を押下する。また、当該装置１の利用者（声優等の発声者）が台詞音声データの発声の要領が得られない場合（どんな風に台詞を発声したらいいかわからない場合）に、合成音声データ再生要求スイッチ（図示せず）を押下して、再度、合成音声データを聞き直して、参考にすることができる。当該装置１の利用者（声優等の発声者）によって、発声された台詞音声データは、映像付帯音声データ記録装置１の音声収録部２１で収録され、記録部７に、台詞テキストデータと関連付けられて、１個ずつ台詞ファイルとして記録される（Ｓ１０）。
【００５２】
映像付帯音声データ記録装置１の主制御部（図示せず）によって、当該装置１の利用者（声優等の発声者）が発声した台詞音声データの終端が検出された場合、または、台詞音声データ収録終了スイッチ（図示せず）が押下されたと判断された場合、少なくとも１個の台詞ファイルが生成され、当該装置１の利用者（声優等の発声者）に対し、台詞音声データの収録を終了するか、映像スーパー合成データの再生を続行するかが確認される。そのために、まず、台詞音声データの収録を終了するかを示すメッセージが表示出力部１７のディスプレイ１７ａに表示され、当該装置１の利用者（声優等の発声者）の返答を催促する（Ｓ１１）。当該装置１の利用者（声優等の発声者）が映像付帯音声データ記録装置１の動作（台詞音声データの収録）を終了すると判断した場合（Ｓ１１、Ｙｅｓ）、台詞音声データの収録が終了される。
【００５３】
また、当該装置１の利用者（声優等の発声者）が映像付帯音声データ記録装置１の動作を終了すると判断しない場合（Ｓ１１、Ｎｏ）、映像スーパー合成データの再生を続行するかを示すメッセージが表示出力部１７のディスプレイ１７ａに表示され、当該装置１の利用者（声優等の発声者）の返答を催促する（Ｓ１２）。映像スーパー合成データの再生を続行すると判断した場合（Ｓ１２、Ｙｅｓ）には、Ｓ７に戻って映像スーパー合成データの再生が続けられ、映像スーパー合成データの再生を続行すると判断されない場合（Ｓ１２、Ｎｏ）、はじめ（Ｓ１）に戻って、当該装置１の動作が継続される。
【００５４】
（映像付帯音声データ記録装置の動作［リップシンク映像データ合成時］）
続いて、図３に示すフローチャートを参照して、リップシンク映像データを生成して台詞音声データと合成する時の映像付帯音声データ記録装置１の動作を説明する。
【００５５】
まず、音韻解析部２３で、記録部７に記録されている台詞テキストデータが音韻解析され（台詞音声データが参照される）、台詞音韻データに変換される（Ｓ２１）。この台詞音韻データが映像生成部９に出力される。この台詞音韻データには、台詞テキストデータを分割した単語および音素に、時系列情報（発音時間）が付されている。
【００５６】
映像生成部９で、台詞音韻データ（分割した単語および音素に、時系列情報が付加）と映像シーンデータに含まれている時刻情報とに基づいて、ＣＧキャラクタの口唇部の動きと台詞音声データとを適合させたリップシンク映像データが生成され、映像音声データ合成部２５へ出力される（Ｓ２２）。そして、映像音声データ合成部２５で、リップシンク映像データに台詞音声データが、映像シーンデータに含まれている時刻情報に基づいて合成され、リップシンク映像台詞音声データとして表示出力部１７へ出力される（Ｓ２３）。このリップシンク映像台詞音声データが表示出力部１７で表示出力される（Ｓ２４）。
【００５７】
（映像付帯音声データ記録装置の具体的な動作例）
次に、図４を参照して、映像付帯音声データ記録装置１の具体的な動作例を説明する。図４は、表示出力部１７のディスプレイ１７ａに表示されるＣＧアニメーションを図４中、上から表示される順序に４コマ分（ａ）〜（ｄ）図示した説明図である。
【００５８】
図４（ａ）は、地平線が見渡せる背景に、１体のＣＧキャラクタ（角ありキャラクタとする）が佇んでいる映像シーンを示している。この映像シーンにおける角ありキャラクタには台詞が設定されていないので、普通に（そのまま）映像スーパー合成データが再生される。
【００５９】
図４（ｂ）は、角ありキャラクタが佇んでいる所に、新たなＣＧキャラクタ（図４（ｂ）説明中ではＣＧアクター、角なしキャラクタとする）が登場し、この角なしキャラクタが角ありキャラクタに挨拶「いい天気ですねー」する映像シーンを示している。つまり、この映像シーンでは、角なしキャラクタに台詞が設定されており、この映像シーンになったら、映像付帯音声データ記録装置１の主制御部（図示せず）による制御で、映像スーパー合成データの再生が一時停止される。そして、角なしキャラクタが喋る台詞がテキストスーパーで表示出力部１７のディスプレイ１７ａに表示されると共に、合成音声データが表示出力部１７のスピーカ１７ｂに出力される。
【００６０】
図４（ｃ）は、ＣＧキャラクタ（角なしキャラクタ）が喋る台詞である台詞音声データを発声する声優等の発声者が、映像スーパー合成データを見ながら台詞音声データを発声し、この発声した台詞音声データを収録する際の、映像シーンおよび声優等の発声者を示している。声優等の発声者は、マイクロフォン等で構成される音声データ入力部１９に向かって、ＣＧキャラクタの台詞である台詞テキストデータを読み上げた「台詞音声データ」を入力している。
【００６１】
この場合、声優等の発声者は、音声データ入力部１９の台詞音声データ収録開始スイッチ（図示せず）を押下後に、台詞テキストデータを読み上げる。すると、図４（ｃ）中の左上方に示したようにディスプレイ１７ａには、「●収録開始」が表示される。このため、声優等の発声者は台詞音声データを収録中であることが、目視で確認できる。なお、この「●収録開始」の表示は、声優等の発声者が、音声データ入力部１９の台詞音声データ収録開始スイッチを押下しなくても、自動的に、ディスプレイ１７ａに表示させるように、映像シーンデータに記述しておくことで行うことも可能である。
【００６２】
図４（ｄ）は、角なしキャラクタが喋る台詞が終了し、角なしキャラクタと、角ありキャラクタとが向き合っている映像シーンを示している。つまり、この映像シーンでは、角なしキャラクタが喋る台詞である台詞テキストデータと、声優等の発声者が発声した台詞音声データとが比較され、台詞音声データの終わり（終端）が映像付帯音声データ記録装置１の主制御部（図示せず）によって検出され、この一連の動作によって、１個の台詞テキストデータに対する１個の台詞音声データが生成され、これら台詞テキストデータと台詞音声データとが、音声収録部２１で関連付けられ１個の台詞ファイルとして記録部７に記録される。
【００６３】
（ＣＧキャラクタの口唇部の例）
次に、図５を参照して、映像生成部９で生成されるリップシンク映像データ（ＣＧキャラクタの口唇部の例）を説明する。図５（ａ）〜（ｆ）は、ＣＧキャラクタの口唇部の形状と、発音される母音および無音・破裂音との関係を図示したものである。
【００６４】
図５（ａ）は、母音「あ」を発音した際の、ＣＧキャラクタの口唇部の形状を示している。図５（ｂ）は、母音「い」を発音した際の、ＣＧキャラクタの口唇部の形状を示している。図５（ｃ）は、母音「う」を発音した際の、ＣＧキャラクタの口唇部の形状を示している。図５（ｄ）は、母音「え」を発音した際の、ＣＧキャラクタの口唇部の形状を示している。図５（ｅ）は、母音「お」を発音した際の、ＣＧキャラクタの口唇部の形状を示している。図５（ｆ）は、無音・破裂音を発音した際の、ＣＧキャラクタの口唇部の形状を示している。なお、この実施の形態では、図示を省略したが、各子音を発音した際の、ＣＧキャラクタの口唇部の形状も設定されている。
【００６５】
これら図５（ａ）〜図５（ｆ）に示したように、母音毎に、厳密にＣＧキャラクタの口唇部の形状が設定されているので、映像生成部９で台詞音韻データと映像シーンデータとに基づいて生成されるリップシンク映像データは、従来のＣＧアニメーション（映像データ）に比べ、ＣＧキャラクタの口唇部の「ぎこちなさ」、「不自然さ」が解消される。つまり、この映像付帯音声データ記録装置１によれば、人間が台詞を発声するように、リアリティのあるＣＧキャラクタを含むＣＧアニメーションを制作することができる。
【００６６】
（ＴＶＭＬによる映像シーンデータと台詞テキストデータの例）
次に、図６を参照して、ＴＶＭＬによる映像シーンデータと台詞テキストデータの例について説明する。図６は、映像シーンデータと台詞テキストデータとをＴＶＭＬで記述したＴＶＭＬスクリプト（ＴＶＭＬ台本）である。このＴＶＭＬスクリプトが映像生成部９、スーパー生成部１１および映像スーパー合成部１３で映像スーパー合成データとされる（ＣＧアニメーションに描画される）。
【００６７】
また、このＴＶＭＬスクリプトは、一般的なＴＶＭＬプレーヤー（図示せず）で、図６に表記したように、図６中、上から順番に一行一行解釈され（インタープリター動作）実行可能なものである。「Ａ」行の“ｓｅｔ：ｃｈａｎｇｅ”は、制作するＣＧアニメーション番組（アニメーション番組）に使用するセット（舞台や背景）のデータを定義するものである。この例では（ｆｕｊｉ）という名前のモデルデータが読み込まれる。
【００６８】
「Ｂ」行の“ｃｈａｒａｃｔｅｒ：ｃａｓｔｉｎｇ”は、制作するＣＧアニメーション番組（アニメーション番組）に登場するＣＧキャラクタ（キャラクタ）の名前を定義するものである。「Ｃ」行の“ｃｈａｒａｃｔｅｒ：ｂｉｎｄｍｏｄｅｌ”は、定義したＣＧキャラクタ（キャラクタ）にモデルデータを割り当てるものである。
【００６９】
「Ｄ」行の“ｃｈａｒａｃｔｅｒ：ｐｏｓｉｔｉｏｎ”は、ＣＧキャラクタ（キャラクタ）の３次元座標上での初期位置・向きを設定するものである。「Ｅ」行の“ｃｈａｒａｃｔｅｒ：ｓｅｔｖｏｉｃｅ”は、ＣＧキャラクタ（キャラクタ）に音声合成部１５で合成音声データを生成する際の声質を割り当てるものである。
【００７０】
「Ｆ」行の“ｌｉｇｈｔ：ａｓｓｉｇｎ”は、制作するＣＧアニメーション番組（アニメーション番組）に使用する照明の名前を定義するものである。「Ｇ」行の“ｌｉｇｈｔ：ｍｏｄｅｌ”は、光源の種類、３次元座標上の位置・向き、明るさ、色彩等を指定するものである。
【００７１】
「Ｈ」行の“ｃａｍｅｒａ：ｍｏｖｅｍｅｎｔ”は、制作するＣＧアニメーション番組（アニメーション番組）に使用するカメラ（ＣＧ空間上の視点）の位置・向き、視野角等を設定するものである。「Ｉ」行の“ｃｈａｒａｃｔｅｒ：ｗａｌｋ”は、ＣＧキャラクタ（キャラクタ）を指定する座標地点まで歩かせることを定義するものである。
【００７２】
「Ｊ」行の“ｃｈａｒａｃｔｅｒ：ｔｕｒｎ”は、ＣＧキャラクタ（キャラクタ）を指定する角度方向に向きを変えることを定義するものである。「Ｋ」行の“ｃｈａｒａｃｔｅｒ：ｌｏｏｋ”は、ＣＧキャラクタ（キャラクタ）の視線（顔）を対象オブジェクトの方向に向けることを定義するものである。
【００７３】
「Ｌ」行の“ｃｈａｒａｃｔｅｒ：ｔａｌｋ”は、台詞テキストデータ（ｔｅｘｔ）の文字列をテキストスーパーとして表示すること（字幕スーパー表示すること）と、同時に、音声合成部１５で音声合成された合成音声データによって発話の例示を行うことを定義するものである。
【００７４】
この“ｃｈａｒａｃｔｅｒ：ｔａｌｋ”の「Ｌ」行に到達した段階で、映像スーパー合成データの再生が一時停止され、声優等の発声者が発声した台詞音声データの収録が開始するメッセージがディスプレイ１７ａに表示され、台詞音声データの収録が開始される。音声データ入力部１９を介して入力された台詞音声データと台詞テキストデータとが随時照合されて、発話が終わった段階（台詞音声データの終端を検出した段階）で台詞音声データの収録が終了される。
【００７５】
例えば、この図６の「Ｌ」行に示したように「ｃｈａｒａｃｔｅｒ：ｔａｌｋ（ｎａｍｅ＝ＭＡＲＹ，ｔｅｘｔ＝“いい天気ですねー”）」では、「いい天気ですねー」とテキストスーパー（字幕スーパー）をディスプレイ１７ａに表示し、同時に音声合成部１５によって合成音声データとされ、スピーカ１７ｂで発話される（出力される）。合成音声データの出力が終了した段階で、収録開始メッセージ「●収録開始」がディスプレイ１７ａに表示され、台詞音声データ「いい天気ですねー」の収録が開始される。
【００７６】
音声データ入力部１９を介して入力された台詞音声データ「いい天気ですねー」と台詞テキストデータ「いい天気ですねー」とが照合され、発話の終了が検出された段階で台詞音声データの収録が終了される。収録した台詞音声データとＴＶＭＬスクリプトの「ｃｈａｒａｃｔｅｒ：ｔａｌｋ（ｎａｍｅ＝ＭＡＲＹ，ｔｅｘｔ＝“いい天気ですねー”）」というコマンドが関連付けられ、例えば、ｓｅｌｉｆ１．ｗａｖという音声データファイル（台詞ファイル）として記録部７に記録される。
【００７７】
以上、一実施形態に基づいて本発明を説明したが、本発明はこれに限定されるものではない。
例えば、映像付帯音声データ記録装置１の各構成の処理を一つずつの工程（過程）ととらえた映像付帯音声データ記録方法とみなすことや、映像付帯音声データ記録装置１の各構成の処理を一般的なコンピュータ言語で記述した映像付帯音声データ記録プログラムとみなすことも可能である。これらの場合、映像付帯音声データ記録装置１と同様の効果が得られる共に、映像付帯音声データ記録プログラムの場合、記憶媒体に記憶させて流通させることや、ネットワーク等を介して、活用することなどが可能である。
【００７８】
【発明の効果】
請求項１、２、６記載の発明によれば、映像に音声データのテキストスーパーと、音声データのテキストを音声合成した合成音声データとが付加され、出力されると共に、発声者によって読み上げられた音声データが記録される。このため、発声者は、合成音声データとテキストスーパーとを参照して、音声データを発声することができ、容易に映像にあった音声データを記録することができる。
【００７９】
請求項３記載の発明によれば、テキストスーパーが付加された場合、スーパー削除手段でテキストスーパーが削除され、最終的には、映像データと発声者が読み上げた音声データとが合成された映像音声データを得ることができる。
【００８０】
請求項４記載の発明によれば、映像シーンデータに基づいて映像データが作成され、また、ＣＧキャラクタの台詞のテキストデータに基づいて、テキストスーパーが生成される。そして、映像データとテキストスーパーとが映像シーンデータに含まれている時刻情報に基づいて合成され、映像スーパー合成データとされる。また、ＣＧキャラクタの台詞のテキストデータに基づいて合成音声データが生成される。その後、映像スーパー合成データおよび合成音声データの表示出力が制御され、合成音声データとテキストスーパーとを参照して、声優等の発声者が発声した音声データである台詞音声データが収録される。その後、収録された台詞音声データと台詞テキストデータとが関連付けて記録される。このため、声優等の発声者は、合成音声データとテキストスーパーとを参照して、台詞音声データを発声することができ、発声された台詞音声データが台詞テキストデータと関連付けて記録されるので、容易にＣＧキャラクタの映像にあった台詞音声データを記録することができる。
【００８１】
請求項５記載の発明によれば、台詞テキストデータが解析され、台詞音韻データに変換される。台詞音韻データと映像シーンデータとに基づいて、台詞音声データの発声とＣＧキャラクタの口唇部の動きとが適合する映像であるリップシンク映像データが生成され、リップシンク映像データと台詞音声データとが合成される。このため、ＣＧアニメーション中のＣＧキャラクタの口唇部の動きと台詞音声データとを適合させたリップシンク映像台詞音声データを生成することができ、より自然なＣＧアニメーションを生成する（描画する）ことができる。
【図面の簡単な説明】
【図１】本発明による一実施の形態である映像付帯音声データ記録装置のブロック図である。
【図２】図１に示した映像付帯音声データ記録装置の動作（台詞音声データ収録時）を説明したフローチャートである。
【図３】図１に示した映像付帯音声データ記録装置の動作（リップシンク映像データ合成時）を説明したフローチャートである。
【図４】映像付帯音声データ記録装置の具体的な動作例説明した説明図である。
【図５】ＣＧキャラクタの口唇部の形状と、発音される母音および無音・破裂音との関係を示した図である。
【図６】ＴＶＭＬによる映像シーンデータと台詞テキストデータの例を示した図である。
【符号の説明】
１映像付帯音声データ記録装置
３映像シーンデータ入力部
５テキストデータ入力部
７記録部
９映像生成部
１１スーパー生成部
１３映像スーパー合成部
１５音声合成部
１７表示出力部
１７ａディスプレイ
１７ｂスピーカ
１９音声データ入力部
２１音声収録部
２３音韻解析部
２５映像音声データ合成部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a video-carrying audio data recording method, a video-carrying audio data recording apparatus, and a video-carrying audio data recording program for recording video data such as television programs, movies, animations, etc., particularly audio data to be added to CG animation.
[0002]
[Prior art]
Conventionally, in the case of after-recording in which audio data is recorded in accordance with the video after production of the video or the like, in particular, when a voice actor or the like reads out the speech of the CG character of the CG animation and records it, the mouth of the CG character It is necessary for the voice actor to utter a voice (voice data) in accordance with the movement (movement of the lip).
[0003]
In addition, in the case of pre-recording that records audio data before producing the video, it is necessary to produce the video so that it matches the recorded audio data. It is necessary to adjust the movement of the lip portion of the CG character so as to match (lip-sync) speech speech data that is speech data.
[0004]
[Problems to be solved by the invention]
However, the conventional after recording has a problem that the voice data recording operation becomes complicated because it is necessary to speak in accordance with the movement of the lip portion of the CG character. In addition, it is difficult to match the movement of the lip portion of the CG character (lip sync) unless sufficient experience is gained, and in some cases, there is a problem that an unnatural CG animation results.
[0005]
Furthermore, if you record CG animation (video) after pre-recording speech audio data, you can lip-sync, but since CG animation is not completed in advance, CG animation like after recording is possible. I cannot speak while watching (video). Therefore, a voice actor or the like that utters speech audio data cannot recall a video scene, and there is a problem that the speech audio data cannot be used and emotions cannot be put in.
[0006]
Accordingly, an object of the present invention is to solve the problems of the conventional techniques described above, and is easy to use, can easily record audio data according to the video, and can generate a more natural CG animation. To provide an accompanying audio data recording method, a video accompanying audio data recording apparatus, and a video accompanying audio data recording program.
[0007]
[Means for Solving the Problems]
In order to achieve the above-described object, the present invention has the following configuration.
Claim 1 In The video-attached audio data recording method described is about the audio data uttered by a speaker based on video data, information indicating an insertion location of audio data to be added to the video data, and text data of the audio data. A video-attached audio data recording method for recording in a state where the video data is displayed on a display screen, wherein the video data displayed on the display screen is reproduced based on information indicating an insertion position of the audio data Video stop control step to stop ,in front Text super with subtitles of text data When, Synthesized speech data obtained by speech synthesis of text data of the speech data; The Output in correspondence with the video data Ru In the image super synthesized voice output step and the video super synthesized voice output step, the output text super and synthesized voice data are output. And An audio data recording step for recording the audio data uttered by the speaker based on the text data, and a video resumption control for resuming the reproduction of the video data after recording the audio data by the audio data recording step And a step.
[0008]
According to this method, first, in the video stop control step, the video data displayed on the display screen is stopped, and in the video super synthesized voice output step, the text super and the text data of the audio data are added to the video data. Synthesized speech data synthesized by speech Toga Output in association. In the voice data recording step, the speaker speaks text super and synthesized voice data. And The voice data uttered based on this is recorded. Thereafter, in the video resumption control step, after the audio data is recorded, the reproduction of the video data is resumed.
[0009]
Note that the text data superimposition of audio data is overwritten (synthesized) on the video, and the synthesized audio data is output in synchronism with the projection of the video when a voice actor or the like uttering the audio data views the video. Further, headphones may be used and provided so that the synthesized voice data can be heard only by a voice actor so as not to cause noise of voice data to be recorded (recorded). In this case, the reproduction (projection) of the video can be stopped until the audio data uttered by the voice actor or the like is recorded. Video data refers to video that has already been produced, and refers to animation (CG animation) in television programs, movies, and package media.
[0010]
Claim 2 In The video-attached audio data recording device described is about the audio data uttered by a speaker based on video data, information indicating an insertion location of audio data to be added to the video data, and text data of the audio data. A video-accompanying audio data recording device for recording the video data in a state of being displayed on a display screen, and video recording means for reading and recording the video data; ,in front Text super with subtitles of text data When, Synthesized speech data obtained by speech synthesis of text data of the speech data; The Output in correspondence with the video data Ru Based on the image super synthesized voice output means and the information indicating the insertion location of the audio data, the reproduction of the video data displayed on the display screen is stopped, the audio data is recorded, and then the video data is reproduced. And a supersynthetic voice output control means for restarting voice data and voice data recording means for recording the voice data as one file corresponding to the text data.
[0011]
According to such a configuration, the synthesized audio data obtained by synthesizing the text super of the video data and the text data of the audio data with the video super synthesized audio output means. Toga Output in association. Then, the video data displayed on the display screen is stopped by the super synthesized voice output control means, and after the voice data is recorded, the reproduction of the video data is resumed. In addition, the voice data recording means allows the speaker to perform text super and synthetic voice data. And The voice data uttered based on this is recorded as one file corresponding to the text data.
[0012]
Claim 3 In The video-attached audio data recording device according to claim 2, wherein, in the video-added audio data recording device according to claim 2, when the text super is added to the video data by the super synthesized audio output control means, When video / audio data obtained by synthesizing the video data and the audio data is output from the video-accompanying audio data recording apparatus. Super deletion means for deleting the text super is provided.
[0013]
According to such a configuration, when a text super is added by the super synthesized speech output control means, When video / audio data obtained by synthesizing video data and the audio data is output from the video-attached audio data recording device The text super is deleted by the super deletion means, and only the audio data uttered by the speaker is added to the video data finally generated.
[0014]
The video-accompanying audio data recording apparatus according to claim 4 is a video-accompanying audio data recording apparatus that records audio data of speech spoken by the CG character while displaying a CG character of the video data by CG synthesis on a display screen. , A video creation means for creating video data based on video scene data including information about a video scene in which the CG character speaks speech, and the video data based on speech text data which is the text data of the speech A text super generating means for generating a text super to be combined with the video super combining to generate the video super combined data by combining the video data and the text super based on time information included in the video scene data. And speech synthesis based on the dialogue text data. The speech synthesis means for generating the synthesized speech data, the synthesized display output control means for controlling the display output of the video super synthesized data and the synthesized speech data, the synthesized speech data and the text super Voice data recording means for recording speech voice data which is uttered voice data, the video scene data and the speech text data are recorded, and the speech voice data and the speech text recorded by the voice data recording means Recording means for associating and recording data.
[0015]
According to such a configuration, the video data is created based on the video scene data by the video creation means. This video scene data is described in, for example, TVML (TV program Making Language), and this video scene data sets the order of each video scene and each video frame constituting each video scene. . Subsequently, the text super generating means generates a text super, that is, a caption super, based on the text data of the dialogue of the CG character. Then, the video super synthesis means synthesizes the video data and the text super on the basis of the time information included in the video scene data to obtain video super synthesis data. Further, based on the text data of the dialogue of the CG character, synthesized speech data is generated by speech synthesis means. The synthesized display output control means controls the display output of the video super synthesized data and synthesized voice data, and the voice data recording means refers to the synthesized voice data and the text super, and the voice uttered by a speaker such as a voice actor Dialogue voice data is recorded. Thereafter, the recorded speech sound data and the speech text data are recorded in association with each other.
[0016]
Note that identification information for identifying the CG character is attached to the CG character. For example, the identification information is described in TVML. Further, the control in the composite display output control means, for example, when the video scene with the dialogue of the CG character is reached, the reproduction of the video super composite data in which the text super is combined with the video data is paused. A signal (audio data recording request signal) that prompts the user to read out the text super of the video scene (video frame) is output, and the synthesized audio data is output as an example. Further, the control in the composite display output control means is such that when the text super is read out, that is, when the recording of the audio data is completed, the reproduction of the video super composite data being stopped is resumed.
[0017]
The video-accompanying audio data recording device according to claim 5 is the video-accompanying audio data recording device according to claim 4, comprising a word pronunciation dictionary and an acoustic model in which feature values of each phoneme are recorded, and the speech The speech text data is analyzed with reference to data, and the phoneme analyzing means for analyzing the speech text data and converting it into speech phonological data including time series information of the speech text data, and the video creating means are analyzed by the phoneme analyzing means. Lip sync video data that is a video in which the speech of the speech audio data and the movement of the lip portion of the CG character are matched based on the line phonological data and the video scene data, And audio / video data synthesizing means for synthesizing the dialogue audio data.
[0018]
According to such a configuration, the speech text data is analyzed by the phoneme analysis means and converted to speech phoneme data. The phonological analysis in the phonological analysis means is, for example, analyzing each word and each phoneme of the speech speech data. Then, the video creation means generates lip sync video data which is a video in which the speech of the speech audio data and the movement of the lip portion of the CG character are matched based on the speech phonological data and the video scene data. The synthesizing means synthesizes the lip sync video data and the speech audio data.
[0019]
That is, since the video data is a CG animation, the voice data produced by a voice actor or the like is recorded in accordance with the movement of the lip of the CG character in the CG animation, and then the shape of the lip of the CG character is finely adjusted. (Change).
[0020]
Note that CG animation is different from animation composed of a plurality of cell images, and the shape and color of a designated portion such as a CG character can be easily changed only by changing data input to a device (usually a computer) for drawing CG animation. The texture can be changed.
[0021]
Claim 6 In The video-attached audio data recording program described is about the audio data uttered by a speaker based on video data, information indicating an insertion location of audio data to be added to the video data, and text data of the audio data. Record with the video data displayed on the display screen For the computer Is made to function as the following means. Concerned Computer Means for reading and recording the video data ,in front Text super with subtitles of text data When, Synthesized speech data obtained by speech synthesis of text data of the speech data; The Output in correspondence with the video data Ru Based on the information indicating the insertion location of the audio data, the image super synthesized audio output means stops the reproduction of the video data displayed on the display screen, records the audio data, and then reproduces the video data. Super-synthesized voice output control means for restarting, voice data recording means for recording the voice data as one file corresponding to the text data.
[0022]
According to such a configuration, the synthesized audio data obtained by synthesizing the text super of the video data and the text data of the audio data with the video super synthesized audio output means. Toga Output in association. Then, the video data displayed on the display screen is stopped by the super synthesized voice output control means, and after the voice data is recorded, the reproduction of the video data is resumed. In addition, the voice data recording means allows the speaker to perform text super and synthetic voice data. And The voice data uttered based on this is recorded as one file corresponding to the text data.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
(Configuration of audio data recording device with video)
FIG. 1 is a block diagram of a video-accompanying audio data recording apparatus. As shown in FIG. 1, a video-accompanying audio data recording apparatus 1 includes a video scene data input unit 3, a text data input unit 5, a recording unit 7, a video generation unit 9, a super generation unit 11, a video A super synthesis unit 13, a voice synthesis unit 15, a display output unit 17, a voice data input unit 19, a voice recording unit 21, a phonological analysis unit 23, and a video / audio data synthesis unit 25 are provided.
[0024]
The video-attached audio data recording apparatus 1 records audio data attached to the video, and in particular, records audio data that is a dialogue of a CG character (CG actor, CG actress) appearing in a CG animation. The video-accompanying audio data recording device 1 analyzes the phoneme of the recorded audio data, and based on the result of the analysis, the movement of the lip of the CG character (video of the lip) and the output audio of the audio data And a lip sync function. In this embodiment, the video-attached audio data recording apparatus 1 is realized by mounting a “TVML player” based on a general computer.
[0025]
The video scene data input unit 3 is an interface for recording externally input data (video scene data) in the recording unit 7. The video scene data input unit 3 is a disk drive for inserting a disc on which video scene data is recorded, or video scene data. It is comprised from the input terminal etc. which can input.
[0026]
The video scene data sets the arrangement position of the CG characters and other object videos in a plurality of video frames constituting the video scene, and sets the order in which the video scenes are arranged. It is equivalent. In this embodiment, the video scene data is described in TVML (Television program Making Language), and detailed description of the TVML will be made later with reference to an actual description example.
[0027]
The text data input unit 5 is an interface for inputting dialogue text data, which is a dialogue of the CG character, and recording this dialogue text data in the recording unit 7, and is configured by a general keyboard, mouse, and the like.
The dialogue text data is a dialogue of the CG character and is described in a text format. In this embodiment, dialogue described in text format is defined in TVML in combination with identification information (character name) for identifying a CG character.
[0028]
The recording unit 7 is configured by a general hard disk or the like, and records video scene data, dialogue text data, and dialogue audio data (described later). The recording unit 7 corresponds to the recording means described in the claims.
[0029]
The video generation unit 9 generates video data based on the video scene data. In this embodiment, the video data is CG animation, and this CG animation is generated by video scene data described in TVML, for example, model data (shape data) of CG character, initial position and orientation, and voice synthesis unit 15. A drawn CG character appears (appears) based on the definition of voice quality and the like when performing speech synthesis, and conversations, stories, and the like are expressed by the actions (acts) of the CG character. The video generation unit 9 corresponds to the video creation means described in the claims.
[0030]
The super generation unit 11 generates a text super (caption super) based on the dialogue text data. In this text super, the display color of characters in the text super is changed in accordance with the reading speed of “line” generally inferred. The super generation unit 11 corresponds to the text super generation means described in the claims.
[0031]
Based on the time information (included in the video scene data) when the CG character utters speech, the video super synthesis unit 13 generates the video data generated by the video generation unit 9 and the text super generated by the super generation unit 11. Are combined into video super composite data. The video super composition unit 13 corresponds to the video super composition means described in the claims.
[0032]
The video-attached audio data recording apparatus 1 includes a main control unit (not shown) that controls the apparatus 1, and the main control unit includes various switches ( The display output of the video super synthesized data and the synthesized audio data to the display output unit 17 is controlled based on a control signal from (described later). This main control unit corresponds to the composite display output control means described in the claims. Further, the video-accompanying audio data recording apparatus 1 is provided with super deletion means (not shown) for deleting the text super combined with the video data in the video super combining unit 13. By this super deletion means, there is no text super in the video / audio data (to be described later) finally output from the video accompanying audio data recording apparatus 1.
[0033]
In this embodiment, the video generating unit 9, the super generating unit 11, the video super synthesizing unit 13, and the respective components are separated and the roles of the respective components are clarified. The unit 9, the super generation unit 11, and the video super synthesis unit 13 are configured as one block such as a “CG rendering unit” that draws a CG animation including a text super (subtitle super) based on TVML. You can also. In this case, the video generation unit 9, the super generation unit 11, and the video super synthesis unit 13 can be regarded as programs written in a general-purpose computer language.
[0034]
The speech synthesizer 15 performs speech synthesis based on the dialogue text data input by the text data input unit 5 to generate synthesized speech data. In this embodiment, the speech synthesizer 15 performs speech synthesis of dialogue text data using a method (apparatus) disclosed in Japanese Patent Laid-Open No. 2-47700. The voice synthesizer 15 then reproduces the synthesized voice data synthesized by the voice, the time information when the CG character utters the speech included in the video scene data, and the synthesized voice data reproduction attached to the voice data input unit 19. Based on a control signal from a request switch (not shown, which will be described later), it is output (sent) to the display output unit 17. The voice synthesizer 15 corresponds to the voice synthesizer described in the claims.
[0035]
The display output unit 17 is composed of a display 17a such as a CRT, liquid crystal, plasma, or the like having a display screen and a speaker 17b, and displays a CG animation and outputs synthesized voice data.
[0036]
The voice data input unit 19 is composed of a microphone or the like that inputs (sound collection) voices (voice voice data) uttered by a voice actor, etc., and further, a voice voice data recording start switch (not shown) and a voice voice data recording end A switch and a synthesized voice data reproduction request switch are attached. These dialogue voice data recording start switch, dialogue voice data recording end switch, and synthesized voice data reproduction request switch transmit a control signal to the main control unit (not shown) of the video-attached voice data recording apparatus 1 for display output. It controls the reproduction of the video super synthesized data to the unit 17 and the output timing of the synthesized audio data.
[0037]
A speech voice data recording start switch (not shown) is pressed when a voice actor or the like utters speech voice data based on the text super (subtitle super) displayed on the display 17a of the display output unit 17. When the dialogue voice data recording start switch is pressed, the recording of dialogue voice data input from the voice data input unit 19 is started by the voice recording unit 21 of the video-attached voice data recording apparatus 1.
[0038]
A speech voice data recording end switch (not shown) is pressed after a voice actor or the like utters speech voice data based on a text super (subtitle super) displayed on the display 17a of the display output unit 17. When the dialogue voice data recording end switch is pressed, the recording of the dialogue voice data input from the voice data input unit 19 is finished in the voice recording unit 21 of the video-attached voice data recording apparatus 1.
[0039]
A synthetic voice data reproduction request switch (not shown) is pressed to request again the reproduction of the synthetic voice data output from the speaker 17b of the display output unit 17.
[0040]
The voice recording unit 21 is an interface for recording the dialogue voice data input by the voice data input unit 19 in the recording unit 7, and includes an input terminal or the like that can input the dialogue voice data. The speech voice data recorded by the voice recording unit 21 is recorded in the recording unit 7 in association with the speech text data recorded in the recording unit 7. That is, the voice recording unit 21 detects the end (interval) for each line text data, and sequentially records the line text data and the line voice data into one line file and records it in the recording unit 7. is there. Then, a plurality of dialogue files are recorded in the recording unit 7 for each video scene. The audio recording unit 21 corresponds to the audio data recording means described in the claims.
[0041]
The phonological analysis unit 23 has a word pronunciation dictionary (not shown) and an acoustic model in which the feature values of each phoneme are recorded. The speech text data recorded in the recording unit 7 is referred to speech speech data. Thus, it is converted into speech phonological data including time-series information. In other words, the speech phonological data is the words and phonemes in the speech text data divided by time-series information (pronunciation time). For example, the speech text data “Good weather” is “Good weather.” The time series information (pronunciation time) is attached to the words, such as “Good: 0 to 20 ms”. The speech phonological data is referred to when the video generation unit 9 generates lip sync video data. That is, on the basis of this line phonological data and the CG character lip information that is information on the lip movement of the CG character defined for each phoneme included in the video scene data, Lip sync video data is generated by matching (matching) the movement of the character's lip and the speech of the speech audio data.
[0042]
The video / audio data synthesizing unit 25 is based on the time information included in the video scene data and the lip sync video data generated by the video generation unit 9 based on the phoneme data and the CG character lip information and the speech audio data. The lip sync video dialogue audio data is output to the display output unit 17.
[0043]
According to this video-accompanying audio data recording apparatus 1, video data is created based on video scene data by the video generation unit 9, and text super, ie, based on the dialogue text data of the CG character, is generated by the super generation unit 11. Subtitle super is generated. Also, synthesized speech data is generated by the speech synthesizer 15 based on the dialogue text data of the CG character. Then, the video super synthesis unit 13 synthesizes the video data and the text super on the basis of the time information included in the video scene data to obtain video super synthesis data, and the main control unit (not shown) produces the video super The display output of the synthesized data and the synthesized voice data is controlled, and the voice recording unit 21 records speech voice data which is voice data uttered by a speaker such as a voice actor with reference to the synthesized voice data and the text super. . Thereafter, the recorded speech sound data and the speech text data are recorded in the recording unit 7 in association with each other.
[0044]
For this reason, a speaker such as a voice actor can speak speech data by referring to the synthesized speech data and the text super, and the spoken speech data is recorded in association with the speech text data. It is possible to easily record speech audio data suitable for the video of the CG character.
[0045]
In addition, according to the video-attached audio data recording apparatus 1, the phoneme analysis unit 23 analyzes the phoneme of the speech text data and converts it into speech phoneme data. Based on the speech phonological data and the video scene data, the video generation unit 9 generates lip sync video data, which is a video in which the speech of the speech audio data and the movement of the lip portion of the CG character are matched. The unit 25 synthesizes the lip sync video data and the speech audio data. Therefore, it is possible to generate lip sync video speech audio data in which the movement of the lip portion of the CG character in the CG animation and the speech audio data are matched, and to generate (draw) a more natural CG animation. it can.
[0046]
(Operation of video-attached audio data recording device [when recording speech audio data])
Next, with reference to the flowchart shown in FIG. 2, the operation of the video-accompanying audio data recording apparatus 1 when recording speech audio data will be described.
First, video scene data is input from the video scene data input unit 3 of the video-accompanying audio data recording apparatus 1 (S1). Further, dialogue text data of the CG character is inputted by the text data input unit 5 (S2). These video scene data and dialogue text data are recorded in the recording unit 7.
[0047]
Then, the video generation unit 9 generates video data based on the video scene data recorded in the recording unit 7 (S3). This video data is a CG animation drawn based on video scene data described in TVML. Also, a text super (caption super) is generated based on the dialogue text data recorded in the recording unit by the super generation unit 11 (S4).
[0048]
Subsequently, the video super synthesis unit 13 synthesizes the text super (caption super) generated by the super generation unit 11 with the video data generated by the video generation unit 9 based on the time information of the video scene data. Then, it is set as video super composite data (S5). Although these operations S3 to S5 have been described as a series of processes, in practice, the main control unit (not shown) of the video-attached audio data recording apparatus 1 performs simultaneous and parallel processing by multi-stack processing.
[0049]
Further, the speech synthesizer 15 performs speech synthesis on the recording unit 7 based on the line text data to generate synthesized speech data (S6). In a state where the video super synthesized data and the synthesized audio data are generated, a user (speaker such as a voice actor) of the apparatus 1 makes a request to reproduce and output the video super synthesized data and synthesized audio data (as shown in the figure). The system waits until the reproduction start switch of the omitted “video super composite data” is pressed. When there is a reproduction output request from the user of the apparatus 1 (speaker such as a voice actor), first, reproduction of the video super synthesized data is started in the display output unit 17 (display 17a) (S7). Of course, in a video scene without a CG character line, the video super composition data does not include the text super (caption super), and the text super (caption super) is displayed on the display 17a. Not.
[0050]
Then, the main control unit (not shown) determines whether or not the video scene has the dialogue of the CG character, and the reproduction of the video super composite data is continued until the video scene with the dialogue (S8, No). When it is determined that the video scene has a line of the video, the video is stopped, the text super (subtitle super) is displayed on the display 17a of the display output unit 17, and the synthesized audio data is output to the speaker 17b (S9). .
[0051]
Then, the user (speaker of voice actor, etc.) of the device 1 presses a dialogue voice data recording start switch (not shown) of the voice data input unit 19 while watching these text supermarkets, and sends dialogue voice data. Speak. When the utterance is finished, the dialogue voice data recording end switch (not shown) is pressed. In addition, when the user of the device 1 (speaker of voice actor, etc.) cannot obtain the point of utterance of speech sound data (when he / she does not know how to utter speech), the synthesized speech data reproduction request switch ( By pressing (not shown), the synthesized voice data can be heard again for reference. Speech audio data uttered by the user of the device 1 (speaker such as a voice actor) is recorded by the audio recording unit 21 of the video-attached audio data recording device 1 and is associated with the dialogue text data in the recording unit 7. Are recorded as dialogue files one by one (S10).
[0052]
When the end of the speech audio data uttered by the user (speaker such as voice actor) of the device 1 is detected by the main control unit (not shown) of the video-accompanying audio data recording device 1, or the speech audio data When it is determined that a recording end switch (not shown) is pressed, at least one dialogue file is generated, and recording of speech audio data is completed for the user (speaker of voice actor, etc.) of the device 1 Or whether to continue playing the video super composite data. For this purpose, first, a message indicating whether or not to finish recording the speech data is displayed on the display 17a of the display output unit 17, and prompts the user (speaker such as voice actor) of the device 1 to reply (S11). . When it is determined that the user (speaker of voice actor or the like) of the device 1 finishes the operation (recording of speech audio data) of the video-accompanying audio data recording device 1 (Yes in S11), the recording of the speech audio data is terminated. The
[0053]
If the user of the device 1 (speaker such as a voice actor) does not determine that the operation of the video-accompanying audio data recording device 1 is to be terminated (S11, No), a message indicating whether or not to continue the reproduction of the video super synthesized data Is displayed on the display 17a of the display output unit 17 to prompt the user (speaker of voice actor, etc.) of the device 1 to reply (S12). When it is determined that the reproduction of the video super composite data is to be continued (S12, Yes), the process returns to S7, the reproduction of the video super composite data is continued, and the reproduction of the video super composite data is not determined to be continued (S12, No) ), Returning to the beginning (S1), the operation of the device 1 is continued.
[0054]
(Operation of audio data recording device with video [when lip sync video data is synthesized])
Next, the operation of the video-accompanying audio data recording apparatus 1 when generating lip sync video data and synthesizing it with speech audio data will be described with reference to the flowchart shown in FIG.
[0055]
First, the speech text data recorded in the recording unit 7 is phoneme-analyzed (referred to speech speech data) by the phoneme analysis unit 23 and converted to speech phoneme data (S21). This speech phonological data is output to the video generation unit 9. In this line phoneme data, time series information (pronunciation time) is added to words and phonemes obtained by dividing the line text data.
[0056]
Based on the speech phonological data (the time series information is added to the divided words and phonemes) and the time information included in the video scene data in the video generation unit 9, the lip movement and the speech audio data of the CG character Are generated and output to the video / audio data synthesis unit 25 (S22). Then, the audio / video data synthesizing unit 25 synthesizes speech audio data with the lip sync video data based on the time information included in the video scene data, and outputs it to the display output unit 17 as lip sync video speech audio data. (S23). The lip sync video speech data is displayed and output by the display output unit 17 (S24).
[0057]
(Specific operation example of video-attached audio data recording device)
Next, with reference to FIG. 4, a specific operation example of the video-accompanying audio data recording apparatus 1 will be described. FIG. 4 is an explanatory diagram illustrating the CG animation displayed on the display 17a of the display output unit 17 for four frames (a) to (d) in the order of display from the top in FIG.
[0058]
FIG. 4A shows a video scene in which a single CG character (assumed to be a character with corners) is hazy on the background overlooking the horizon. Since no dialogue is set for the horned character in the video scene, the video super composite data is reproduced normally (as is).
[0059]
In FIG. 4B, a new CG character (in the description of FIG. 4B, a CG actor, a character without corners) appears where the character with corners is standing, and the character without corners has a corner. It shows a video scene that greets the character “It ’s a nice weather”. That is, in this video scene, dialogue is set for the character without a corner, and when this video scene is reached, control of the video super synthesized data by the control by the main control unit (not shown) of the video-attached audio data recording device 1 is performed. Playback is paused. Then, the speech spoken by the character without the horn is displayed on the display 17a of the display output unit 17 as a text super, and the synthesized voice data is output to the speaker 17b of the display output unit 17.
[0060]
In FIG. 4C, a voice actor such as a voice actor who utters speech voice data which is a speech spoken by a CG character (hornless character) utters speech voice data while watching the video super synthesis data, and the spoken speech. It shows a speaker such as a video scene and a voice actor when recording audio data. A speaker such as a voice actor inputs “speech speech data”, which is a speech of speech text data of a CG character, to a speech data input unit 19 composed of a microphone or the like.
[0061]
In this case, a speaker such as a voice actor reads the speech text data after pressing a speech audio data recording start switch (not shown) of the audio data input unit 19. Then, as shown in the upper left of FIG. 4C, “● start recording” is displayed on the display 17a. For this reason, it can be confirmed visually that a speaker such as a voice actor is recording speech data. The display of “● recording start” is made so that a voice actor such as a voice actor automatically displays it on the display 17a without pressing the dialogue voice data recording start switch of the voice data input unit 19. It can also be performed by describing it in the video scene data.
[0062]
FIG. 4D shows a video scene in which the dialogue spoken by the character without corners ends and the character without corners and the character with corners face each other. In other words, in this video scene, dialogue text data, which is a dialogue spoken by a character without horns, is compared with speech audio data uttered by a voice actor or the like, and the end (end) of the speech audio data is recorded in the video accompanying audio data recording This is detected by a main control unit (not shown) of the apparatus 1, and by this series of operations, one speech sound data for one speech text data is generated, and these speech text data and speech speech data are converted into speech. It is associated with the recording unit 21 and recorded in the recording unit 7 as one dialogue file.
[0063]
(Example of lip portion of CG character)
Next, lip sync video data (an example of a lip portion of a CG character) generated by the video generation unit 9 will be described with reference to FIG. FIGS. 5A to 5F illustrate the relationship between the shape of the lip portion of the CG character, the vowels that are pronounced, and the silent / burst sounds.
[0064]
FIG. 5A shows the shape of the lip portion of the CG character when the vowel “a” is pronounced. FIG. 5B shows the shape of the lip portion of the CG character when the vowel “I” is pronounced. FIG. 5C shows the shape of the lip portion of the CG character when the vowel “U” is pronounced. FIG. 5D shows the shape of the lip portion of the CG character when the vowel “e” is pronounced. FIG. 5E shows the shape of the lip portion of the CG character when the vowel “o” is pronounced. FIG. 5 (f) shows the shape of the lip portion of the CG character when soundless / plosive sounds are produced. In this embodiment, although not shown, the shape of the lip portion of the CG character when each consonant is pronounced is also set.
[0065]
As shown in FIG. 5 (a) to FIG. 5 (f), the shape of the lip portion of the CG character is set strictly for each vowel. In the lip sync video data generated based on the above, “awkwardness” and “unnaturalness” of the lip portion of the CG character are eliminated as compared with the conventional CG animation (video data). That is, according to this video-accompanying audio data recording apparatus 1, it is possible to produce a CG animation including a realistic CG character so that a person utters a line.
[0066]
(Example of video scene data and dialogue text data by TVML)
Next, an example of video scene data and dialogue text data by TVML will be described with reference to FIG. FIG. 6 is a TVML script (TVML script) in which video scene data and dialogue text data are described in TVML. This TVML script is used as video super synthesis data (drawn in CG animation) by the video generation unit 9, the super generation unit 11, and the video super synthesis unit 13.
[0067]
Further, this TVML script is a general TVML player (not shown), and as shown in FIG. 6, can be executed by interpreting one line at a time from the top in FIG. 6 (interpreter operation). . “Set: change” in the “A” line defines data of a set (stage and background) used for a CG animation program (animation program) to be produced. In this example, model data named (fuji) is read.
[0068]
“Character: casting” in the “B” line defines the name of a CG character (character) appearing in a CG animation program (animation program) to be produced. “Character: bindmodel” in the “C” line assigns model data to a defined CG character (character).
[0069]
“Character: position” in the “D” line sets the initial position / orientation of the CG character (character) on the three-dimensional coordinates. “Character: setvoice” in the “E” line assigns a voice quality when the voice synthesis unit 15 generates synthesized voice data to the CG character (character).
[0070]
“Light: assign” in the “F” line defines the name of the lighting used for the CG animation program (animation program) to be produced. “Light: model” in the “G” line specifies the type of light source, the position / orientation on three-dimensional coordinates, brightness, color, and the like.
[0071]
“Camera: movement” in the “H” line is used to set the position / orientation, viewing angle, etc. of the camera (viewpoint in the CG space) used for the CG animation program (animation program) to be produced. “Character: walk” in the “I” line defines that a CG character (character) is allowed to walk to a coordinate point.
[0072]
“Character: turn” in the “J” line defines that the direction is changed in the angular direction for specifying the CG character (character). “Character: look” in the “K” line defines that the line of sight (face) of the CG character (character) is directed toward the target object.
[0073]
“Character: talk” in the “L” line indicates that the text string of the speech text data (text) is displayed as the text super (displaying the caption subtitle) and at the same time, the synthesized speech synthesized by the speech synthesizer 15. It is defined that utterance is exemplified by data.
[0074]
When the “character”: “L” line of “character” is reached, the reproduction of the video super synthesis data is paused, and a message is displayed on the display 17a to start recording of speech audio data uttered by a voice actor or other speaker. Recording of speech audio data is started. Dialogue voice data input via the voice data input unit 19 and the dialogue text data are collated at any time, and the recording of the dialogue voice data is finished when the speech is finished (the end of the dialogue voice data is detected). The
[0075]
For example, as shown in the “L” line in FIG. 6, “character: talk (name = MARY, text =“ good weather ””) ”is“ good weather ”and a text super (caption super). The data is displayed on the display 17a, and at the same time, is made into synthesized voice data by the voice synthesizer 15, and uttered (output) by the speaker 17b. When the output of the synthesized voice data is finished, a recording start message “● start recording” is displayed on the display 17a, and recording of the speech voice data “good weather is good” is started.
[0076]
The dialogue voice data “good weather is good” input through the voice data input unit 19 and the speech text data “good weather is nice” are collated, and the speech voice data is recorded when the end of the utterance is detected. Is terminated. The recorded speech data is associated with the command “character: talk (name = MARY, text =“ good weather ””) of the TVML script, for example, serif1. It is recorded in the recording unit 7 as a voice data file (line file) called wav.
[0077]
As mentioned above, although this invention was demonstrated based on one Embodiment, this invention is not limited to this.
For example, the process of each component of the video-accompanying audio data recording apparatus 1 is regarded as a video-accompanying audio data recording method that is regarded as one process (process), or the process of each component of the video-accompanying audio data recording apparatus 1 is considered. It can also be regarded as a video-accompanying audio data recording program described in a general computer language. In these cases, the same effect as that of the video-accompanying audio data recording apparatus 1 can be obtained, and in the case of the video-accompanying audio data recording program, it can be stored in a storage medium and distributed or used via a network or the like. Is possible.
[0078]
【The invention's effect】
According to the first, second, and sixth aspects of the invention, the synthesized voice data obtained by synthesizing the text of the voice data and the text of the voice data with the video. Toga Audio data read out by the speaker is recorded while being added and output. Therefore, the speaker can utter the voice data with reference to the synthesized voice data and the text super, and can easily record the voice data suitable for the video.
[0079]
According to the third aspect of the present invention, when a text super is added, the text super is deleted by the super deletion means, and finally, the video and audio in which the video data and the audio data read out by the speaker are synthesized. Data can be obtained.
[0080]
According to the fourth aspect of the present invention, video data is created based on the video scene data, and a text super is generated based on the text data of the dialogue of the CG character. Then, the video data and the text super are synthesized based on the time information included in the video scene data to obtain video super synthesized data. Also, synthesized voice data is generated based on the text data of the dialogue of the CG character. Thereafter, display output of the video super synthesized data and the synthesized audio data is controlled, and speech audio data which is voice data uttered by a speaker such as a voice actor is recorded with reference to the synthesized audio data and the text super. Thereafter, the recorded speech sound data and the speech text data are recorded in association with each other. For this reason, a speaker such as a voice actor can speak speech data by referring to the synthesized speech data and the text super, and the spoken speech data is recorded in association with the speech text data. It is possible to easily record speech audio data suitable for the video of the CG character.
[0081]
According to the fifth aspect of the present invention, dialogue text data is analyzed and converted to dialogue phonological data. Based on the line phonological data and the video scene data, lip sync video data is generated, which is a video in which the speech of the line audio data matches the movement of the lip of the CG character, and the lip sync video data and the line audio data are Synthesized. Therefore, it is possible to generate lip sync video speech audio data in which the movement of the lip of the CG character in the CG animation and the speech audio data are matched, and to generate (draw) a more natural CG animation. it can.
[Brief description of the drawings]
FIG. 1 is a block diagram of a video-accompanying audio data recording apparatus according to an embodiment of the present invention.
2 is a flowchart for explaining the operation (during speech audio data recording) of the video-accompanying audio data recording apparatus shown in FIG.
FIG. 3 is a flowchart for explaining the operation (during lip sync video data synthesis) of the video-accompanying audio data recording apparatus shown in FIG. 1;
FIG. 4 is an explanatory diagram illustrating a specific operation example of the video-attached audio data recording apparatus.
FIG. 5 is a diagram showing the relationship between the shape of the lip portion of a CG character and the vowels that are pronounced and the silent / burst sounds.
FIG. 6 is a diagram showing an example of video scene data and dialogue text data by TVML.
[Explanation of symbols]
1 Audio data recording device with video
3 Video scene data input section
5 Text data input section
7 Recording section
9 Video generator
11 Super generator
13 Video super synthesis part
15 Speech synthesis unit
17 Display output section
17a display
17b Speaker
19 Voice data input section
21 Voice recording part
23 Phonological analysis part
25 Video / audio data synthesizer

Claims

The video data is displayed on the display screen for the audio data uttered by the speaker based on the video data, the information indicating the insertion location of the audio data attached to the video data, and the text data of the audio data. In the state, a video-attached audio data recording method for recording,
A video stop control step for stopping the reproduction of the video data displayed on the display screen based on the information indicating the insertion location of the audio data;
And text super where the text data before Symbol audio data to the caption, and a synthetic audio data speech synthesizing text data of the voice data, and to that Film image Super synthesized speech output step outputs in correspondence with the image data,
In the video super synthesized speech output step, the speech data to which the speaker has uttered on the basis of an output text super and synthesized speech data, and audio data recording step of recording in association with the text data,
A video resumption control step for resuming reproduction of the video data after recording the audio data in the audio data recording step;
A method for recording audio data with video.

The video data is displayed on the display screen for the audio data uttered by the speaker based on the video data, the information indicating the insertion location of the audio data attached to the video data, and the text data of the audio data. In the state, a video-attached audio data recording device for recording,
Video recording means for reading and recording the video data;
And text super where the text data before Symbol audio data to the caption, and the the synthesized speech data to text data and speech synthesis of the speech data, the video data is not you output movies image super synthesized speech output means corresponds to,
Super-synthetic audio output control for stopping reproduction of video data displayed on the display screen, recording the audio data, and restarting reproduction of the video data based on information indicating the insertion position of the audio data Means,
Audio data recording means for recording the audio data as one file corresponding to the text data;
A video-attached audio data recording apparatus comprising:

When text super is added to the video data by the super synthesized audio output control means, the video / audio data obtained by synthesizing the video data and the audio data is output from the video-attached audio data recording device. The video-attached audio data recording apparatus according to claim 2, further comprising super deletion means for deleting the text super.

A video-attached audio data recording apparatus for recording audio data of speech spoken by the CG character while displaying a CG character of the video data by CG synthesis on a display screen,
Video creation means for creating video data based on video scene data including information about a video scene in which the CG character speaks speech;
Text super generating means for generating a text super combined with the video data based on the line text data which is the text data of the line;
Video super combining means for generating video super combined data by combining the video data and the text super based on time information included in the video scene data;
Speech synthesis means for generating synthesized speech data by performing speech synthesis based on the dialogue text data;
Synthetic display output control means for controlling display output of the video super synthetic data and the synthetic audio data;
Voice data recording means for recording speech voice data which is voice data uttered by referring to the synthesized voice data and the text super;
Recording means for recording the video scene data and the speech text data, and recording the speech data recorded by the speech data recording means in association with the speech text data;
A video-attached audio data recording apparatus comprising:

It has a word pronunciation dictionary and an acoustic model in which feature values of each phoneme are described, and the speech text data is analyzed with reference to the speech audio data, and includes time series information of the speech text data Phoneme analysis means for converting to line phoneme data;
The video creation unit is a video in which the speech of the speech data and the movement of the lip portion of the CG character are matched based on the speech phonological data analyzed by the phonological analysis unit and the video scene data. Generate lip sync video data,
Video / audio data synthesis means for synthesizing the lip sync video data and the speech audio data;
5. The video-attached audio data recording apparatus according to claim 4, further comprising:

The video data is displayed on the display screen for the audio data uttered by the speaker based on the video data, the information indicating the insertion location of the audio data attached to the video data, and the text data of the audio data. In order to record the computer ,
Video recording means for reading and recording the video data;
The text super where the text data to the subtitle audio data, the audio data text data and voice synthesized synthesized speech data of the video data you output in correspondence to the movies image super synthesized speech output means,
Super-synthetic audio output control for stopping reproduction of video data displayed on the display screen, recording the audio data, and restarting reproduction of the video data based on information indicating the insertion position of the audio data means,
Audio data recording means for recording the audio data as one file corresponding to the text data;
An audio data recording program with video, characterized by functioning as