JP3540984B2

JP3540984B2 - Speech synthesis apparatus, speech synthesis method, and storage medium storing speech synthesis program

Info

Publication number: JP3540984B2
Application number: JP2000190466A
Authority: JP
Inventors: 克人別所; 正曽根
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-06-26
Filing date: 2000-06-26
Publication date: 2004-07-07
Anticipated expiration: 2020-06-26
Also published as: JP2002006876A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声合成装置、音声合成方法および音声合成プログラムを記憶した記憶媒体に関し、特に、同時通訳において言語間に存在する語順の相違により生ずる無音区間を適切なポーズ時間に設定して原言語から翻訳された目的言語の合成音声の自然性を向上させるが如き、音声合成されるべき目的言語の文間の無音区間を適切なポーズ時間に設定して音声合成する音声合成装置、音声合成方法および音声合成プログラムを記憶した記憶媒体に関する。
【０００２】
【従来の技術】
同時通訳音声が聞き手にとって、違和感なく自然で、内容が理解し易いものか否かを検証する方法およびこれを実施する装置は今のところ特に開発されていない。
同時通訳をした際に、意味のまとまりのある原言語の発話文を翻訳した翻訳音声が、長い無音区間により区切られていると、聞き手にとって違和感を感ずることがある。この現象は、原言語が１文発話された後に訳し始める逐次翻訳の場合に、原言語の１文が長いときに起こり得る。また、原言語が１文節発話される度毎に訳し始める漸進的翻訳の場合においても、例えば日英通訳の場合、日本語は英語とは語順が異なって、目的語が動詞に先行するところから、日本語で最後の動詞が発話されるまで、英語に翻訳することができず、発話タイミングが結局逐次翻訳と同等になる。この漸進的翻訳においても、日本語の目的語が長い場合に無音区間が長くなり、違和感を感ずることがある。この様な場合に、通訳者が人間であると、主語、目的語の最初の方の文節、或いは動詞に到る文脈から、動詞を予測して翻訳することがある。
【０００３】
【発明が解決しようとする課題】
この発明は、同時通訳において言語間に存在する語順の相違により生ずる無音区間を適切なポーズ時間に設定して原言語から翻訳された目的言語の合成音声の自然性を向上させるが如き、音声合成されるべき目的言語の文間の無音区間を適切なポーズ時間に設定して音声合成する場合に使用される音声合成装置、音声合成方法および音声合成プログラムを記憶した記憶媒体を提供することを目的とするものである。
【０００４】
【課題を解決するための手段】
請求項１：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１を具備し、原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を具備し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し、当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与する発話開始時刻付与部３１を具備し、目的言語データベース２１内の各翻訳文を発話開始時刻付与部３１で得られた発話開始時刻に応答して動作せしめられる音声合成部５１を具備する音声合成装置を構成した。
【０００５】
そして、請求項２：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１を具備し、原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を具備し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し、当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与する発話開始時刻付与部３１を具備し、目的言語データベース２１内の各翻訳文の発話時間を算出し、当該発話時間と発話開始時刻付与部３１で得られた発話開始時刻とから当該翻訳文の発話終了時刻を算出し、当該翻訳文に付与する発話終了時刻付与部３２を具備し、目的言語データベース２１内の各翻訳文を、発話開始時刻付与部３１或は発話終了時刻付与部３２で得られた発話開始時刻に応答して動作せしめられる音声合成部５１を具備する音声合成装置を構成した。
【０００６】
また、請求項３：請求項２に記載される音声合成装置において、発話終了時刻付与部３２は直前の翻訳文の発話終了時刻と後の翻訳文の発話開始時刻の間の前後逆転に対応して後の翻訳文の発話開始時刻と発話終了時刻を遅延する遅延処理部を有するものである音声合成装置を構成した。
更に、請求項４：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１を具備し、原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を具備し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し、当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与する発話開始時刻付与部３１を具備し、目的言語データベース２１内の各翻訳文の発話時間を算出し、当該発話時間と発話開始時刻付与部３１で得られた発話開始時刻とから当該翻訳文の発話終了時刻を算出し、当該翻訳文に付与する発話終了時刻付与部３２を具備し、発話開始時刻付与部３１と発話終了時刻付与部３２で得られた時刻情報に基づいて目的言語データベース２１内の各発話文間の無音区間の長さを算出し、その長さが或る値を超えている場合、後の翻訳文に対してその発話開始時刻および発話終了時刻をより前に修正した値を付与する修正発話時刻付与部４１を具備し、目的言語データベース２１内の各翻訳文を、発話開始時刻付与部３１、発話終了時刻付与部３２或いは修正発話時刻付与部４１で得られた発話開始時刻に応答して動作せしめられる音声合成部５１を具備する音声合成装置を構成した。
【０００７】
また、請求項５：請求項４に記載される音声合成装置において、発話終了時刻付与部３２は直前の翻訳文の発話終了時刻と後の翻訳文の発話開始時刻の間の前後逆転に対応して後の翻訳文の発話開始時刻と発話終了時刻を遅延する遅延処理部を有するものである音声合成装置を構成した。
ここで、請求項６：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１と原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を使用し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与し、目的言語データベース２１内の各翻訳文を得られた発話開始時刻から音声合成する音声合成方法を構成した。
【０００８】
そして、請求項７：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１と原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を使用し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得して当該発話終了時刻に或る時間を加算し得られた時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与し、目的言語データベース２１内の各翻訳文の発話時間を算出し、当該発話時間と先に得られた発話開始時刻とから当該翻訳文の発話終了時刻を算出し、直前の翻訳文の発話終了時刻と後の翻訳文の発話開始時刻の間の前後逆転に対応して後の翻訳文の発話開始時刻と発話終了時刻を遅延せしめ、目的言語データベース２１内の各翻訳文を、先に得られた発話開始時刻或いは遅延せしめられた発話開始時刻から音声合成する音声合成方法を構成した。
【０００９】
また、請求項８：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１と原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１を使用し、目的言語データベース２１内の各翻訳文に対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し、当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として当該翻訳文に付与し、目的言語データベース２１内の各翻訳文の発話時間を算出し、当該発話時間と先に得られた発話開始時刻とから当該翻訳文の発話終了時刻を算出し、直前の翻訳文の発話終了時刻と後の翻訳文の発話開始時刻の間の前後逆転に対応して後の翻訳文の発話開始時刻と発話終了時刻を遅延せしめ、目的言語データベース２１内の各発話文間の無音区間の長さを算出し、その長さが或る値を超えている場合、後の翻訳文に対してその発話開始時刻および発話終了時刻をより前に修正し、目的言語データベース２１内の各翻訳文を、先に得られた発話開始時刻、遅延せしめられた発話開始時刻、或いはより前に修正せしめられた発話開始時刻から音声合成する音声合成方法を構成した。
【００１０】
ここで、請求項９：原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース１１と、原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース２１と、目的言語データベース内の各翻訳文に対応する原言語の発話文の発話終了時刻に或る時間を加算して当該翻訳文の発話開始時刻を求めるプロセスと、目的言語データベース２１内の各翻訳文の発話時間を算出するプロセスと、発話時間と当該翻訳文に付与された発話開始時刻に基づいて当該翻訳文の発話終了時刻を求めるプロセスと、求められた各翻訳文の発話開始時刻および発話終了時刻情報に基づいて目的言語データベース２１内の各発話文間の無音区間の長さを算出するプロセスと、算出された無音区間の長さが或る値を超えている場合後の翻訳文に対してその発話開始時刻および発話終了時刻をより前に修正するプロセスと、目的言語データベース２１内の各翻訳文を修正された発話開始時刻から音声合成するプロセスと、より成る音声合成プログラムを記憶した記憶媒体を構成した。
【００１１】
【発明の実施の形態】
この発明の実施の形態を、先ず、図１および図２を参照して説明する。
図１の実施例において、１１は原言語データベース、２１は目的言語データベース、３１は発話開始時刻付与部、３２は発話終了時刻付与部、４１は修正発話時刻付与部、５１は音声合成部を示す。図２は原言語と目的言語との間の発話時刻の関係を示す図である。以下、図３ないし図５をも参照し具体的に説明する。
【００１２】
原言語データベース１１には、原言語の各発話文とその発話開始時刻および発話終了時刻の情報が記憶されている。図３は原言語データベース１１の一例を示す図であり、原言語の発話文と、その発話開始時刻、発話終了時刻の情報が記憶されている。原言語の発話文は、テキストデータの場合もあれば、音声データの場合もある。
目的言語データベース２１には、原言語の各発話文に対応する翻訳文とその発話開始時刻および発話終了時刻の情報とが記憶されている。図４は目的言語データベースの一例を示す図であり、翻訳文と、その発話開始時刻、発話終了時刻の情報が記憶されている。翻訳文は、テキストデータの場合もあれば、音声データの場合もある。目的言語データベース２１には、翻訳文として、（ａ）直訳表現の翻訳文の他に（ｂ）意訳等の修正処理を施した翻訳文を記憶しておき、また、（ｃ）通常の口調で翻訳した音声データ、更に、（ｄ）喜び、悲しみ、怒りなどの感情を込めた原言語の発話内容を反映した口調で翻訳した音声データを記憶しておくこともできる。
【００１３】
発話開始時刻付与部３１は、目的言語データベース２１内の各翻訳文に対して対応する原言語の発話文の発話終了時刻を原言語データベース１１から取得し、当該発話終了時刻に或る時間を加算して得られる時刻を当該翻訳文の発話開始時刻として付与する。図４においては、各翻訳文の発話開始時刻として対応する原言語の発話文の発話終了時刻に或る時間、一例として１ｍｓｅｃを加算した値が表示されている。
【００１４】
発話終了時刻付与部３２は、目的言語データベース２１内の各翻訳文の発話時間を算出し、当該発話時間と発話開始時刻付与部３１で得られた発話開始時刻とから、当該翻訳文の発話終了時刻を算出して付与する。ところで、翻訳文の発話時間は、翻訳文がテキストデータの場合、翻訳文を音声合成部５１により音声合成し、その音声の再生時間を測定することにより求めることができる。また、テキストデータの場合、音声合成部５１において、テキストを音素に分解し、予め音声合成部５１に記憶されている各音素長の情報を使用して音声を再生することなしに発話時間の推定値を算出することによっても得られる。翻訳文が音声データの場合、予め発話時間の情報が組み込まれていることがある。また、発話時間は、音声データを音声合成部５１により再生して、この再生時間を測定することによっても求めることができる。図４においては、以上の通りにして算出した発話時間を発話開始時刻に加算することにより得られた値を、発話終了時刻として示している。
【００１５】
ここで、発話終了時刻付与部３２により発話終了時刻を付与する際に、或る翻訳文の発話終了時刻が、次の翻訳文の発話開始時刻よりも後になる場合が起こり得る。この場合、発話の重なりを認めて特に時刻の修正をしない方法と、発話の重なりを認めずに時刻を修正する方法とがある。以下の方法は、後の翻訳文の発話開始を前の翻訳文の発話終了の直後とするものである。即ち、発話終了時刻付与部３２は、直前の翻訳文の発話終了時刻と後の翻訳文の発話開始時刻の間の前後逆転に対応して後の翻訳文の発話開始時刻と発話終了時刻を遅延する遅延処理部を有する。
【００１６】
Ｎを文の総数としたとき、次のルーチンで発話時刻を修正する。
ルーチンの内容
（１）ｉ＝１とする。
（２）ｉ＜Ｎならば（３）に進む。ｉ＝Ｎならば終了する。
（３）（ｉ＋１）番目の翻訳文の発話開始時刻がｆで、ｉ番目の翻訳文の発話終了時刻が（ｆ＋ｅ）（ｅ＞０）の場合、（ｉ＋１）番目の翻訳文の発話開始時刻と発話終了時刻をｅだけ遅らせる。
（４）ｉ＝ｉ＋１とする。（２）に進む。
【００１７】
図２を参照するに、修正発話時刻付与部４１は、発話開始時刻付与部３１と発話終了時刻付与部３２で得られた時刻情報に基づいて、目的言語データベース２１内の各翻訳文間の無音区間の長さを算出し、この長さが或る値を超えている場合、後の文に対してその発話開始時刻および発話終了時刻をより前に修正した値を付与する。実線は各文の発話区間を示しており、実線に添えられた番号ｉおよび（ｉ＋１）が対応する原言語文と目的言語文を示している。ｍは、発話開始時刻付与部３１において原言語の発話文の発話終了時刻に加算する時間を表わす。但し、修正発話時刻付与部４１において目的言語文の発話開始、終了時刻を修正した場合、このｍの値は一定とは限らない。
【００１８】
Ｎを文の総数としたとき、修正発話時刻付与部４１における時刻修正は、以下のルーチンで行う。
ルーチンの内容
（１）ｉ＝１とする。
（２）ｉ＜Ｎならば（３）に進む。ｉ＝Ｎならば終了する。
（３）目的言語文ｉ、（ｉ＋１）間の無音時間から原言語文ｉ、（ｉ＋１）間の無音時間ｘを除いたｙの長さが或る閾値ｋ以上の場合、目的言語文（ｉ＋１）の発話開始、終了時刻を、ｙの値がｋになる様に、同一時間だけ前方へずらす。但し、この結果、目的言語文（ｉ＋１）の発話開始時刻が原言語文（ｉ＋１）の発話開始時刻よりも前になれば、目的言語文（ｉ＋１）の発話開始時刻が原言語文（ｉ＋１）の発話開始時刻と等しくなる様に、目的言語文（ｉ＋１）の発話開始、終了時刻を同一時間だけ前方へずらす。
（４）ｉ＝ｉ＋１とする。（２）に進む。
【００１９】
以上の通り、発話開始時刻付与部３１と発話終了時刻付与部３２で得られた時刻情報に基づいて、目的言語データベース２１内の各発話文間の無音区間の長さを算出し、その長さが或る値を超えている場合、後の文に対してその発話開始時刻および発話終了時刻をより前に修正した値を付与する構成を具備することにより、音声合成されるべき目的言語の文間の無音区間を適切なポーズ時間に設定して音声合成することができる。
【００２０】
図５は、目的言語データベース２１において、発話開始時刻付与部３１と発話終了時刻付与部３２により付与された発話開始、終了時刻とは別に、修正発話時刻付与部４１で付与された修正時刻を示している。４番目の原言語である日本語文は、目的語が長いところから、最後の動詞が発話されるに到るまで時間がかかる。従って、目的言語である英語文の３番目と４番目の間の無音区間が長くなるので、４番目の文の発話開始、終了時刻が前方に修正されている。この場合、ｘ＝５秒、ｙ＝４０秒、ｋ＝１０秒である。
【００２１】
ここで、音声合成部５１について説明するに、音声合成部５１は目的言語データベース２１内の各翻訳文を発話開始時刻付与部３１或いは発話終了時刻付与部３２で得られた発話開始時刻から音声合成し、或は修正発話時刻付与部４１で得られた発話開始時刻から音声合成する構成を有する。音声合成部５１は、また、原言語データベース１１内の各発話文をその発話開始時刻から音声合成し、原言語文と目的言語文を同時に音声合成する構成を有するものとすることもできる。
【００２２】
以上の音声合成部５１は、目的言語データベース２１内の各翻訳文を発話開始時刻付与部３１或いは発話終了時刻付与部３２で得られた発話開始時刻から音声合成することにより、逐次翻訳音声が得られる。そして、原言語のある一文が長かった場合、逐次翻訳によっては、その翻訳文とその一つ前の翻訳文との間の無音区間が長くなるので、その無音区間を短くした翻訳音声は予測翻訳を取り入れた漸進的翻訳と考えられる。従って、目的言語データベース２１内の各文を、修正発話時刻付与部４１で得られた発話開始時刻から音声合成させることにより、予測翻訳を取り入れた漸進的翻訳音声が得られる。
【００２３】
ところで、この実施例の動作は、音声合成プログラムを記憶した記憶媒体を準備し、図示されている訳ではないが、ＣＰＵにより音声合成プログラムをこの記憶媒体からインストールし、原言語データベース１１および目的言語データベース２１を参照して実施する。この音声合成プログラムは、請求項９に規定される通り原言語の各発話文の発話開始時刻および発話終了時刻の情報を記憶した原言語データベース、原言語の各発話文の翻訳文と翻訳文の発話開始時刻および発話終了時刻の情報とを記憶した目的言語データベース、目的言語データベース内の各翻訳文に対応する原言語の発話文の発話終了時刻に或る時間を加算して当該翻訳文の発話開始時刻を求めるプロセス、目的言語データベース内の各翻訳文の発話時間を算出するプロセス、発話時間と当該翻訳文に付与された発話開始時刻に基づいて当該翻訳文の発話終了時刻を求めるプロセス、求められた各翻訳文の発話開始時刻および発話終了時刻情報に基づいて目的言語データベース内の各発話文間の無音区間の長さを算出するプロセス、算出された無音区間の長さが或る値を超えている場合後の翻訳文に対してその発話開始時刻および発話終了時刻をより前に修正するプロセス、目的言語データベース内の各翻訳文を修正された発話開始時刻から音声合成するプロセスより成るものである。
【００２４】
ここで、原言語文の或る発話文が、「どうしてこういうギャップができたかと申しますと。」であったとする。これに対応する翻訳文として、目的言語データベース２１内に、次に示す直訳表現の翻訳文と、意訳その他の修正処理を施した翻訳文を記憶しておく。
原言語の発話文：どうしてこういうギャップができたかと申しますと。
直訳表現の翻訳文：What is the reason of this gap ？
意訳等を施した翻訳文：How can we have this large gap between the foreigners and Japanese understanding ？
以上の意訳その他の処理を施した翻訳文は、それまでの発話内容の文脈に鑑みて翻訳したものである。以上の直訳表現の翻訳文と意訳その他の処理を施した翻訳文を音声合成部５１により音声合成する。
【００２５】
【発明の効果】
以上の通りであって、この発明は、逐次翻訳音声或いは漸進的翻訳音声を再生する構成を採用することにより、より自然な目的言語の音声合成をすることができる。そして、逐次翻訳音声と予測翻訳を取り入れた漸進的翻訳音声との間の違和感の違いを検証することができる。また、直訳表現の翻訳音声と意訳表現の翻訳音声の間の内容理解の容易性の違いを検証することができる。更に、この発明は、通訳者を志向する者が同時通訳の学習教材として使用することができる。また、同時通訳および機械翻訳の研究者が研究手段として使用することができる。
【図面の簡単な説明】
【図１】実施例を説明する図。
【図２】原言語と目的言語の発話時刻の相互関係を示す図。
【図３】原言語データベースの一例を示す図。
【図４】目的言語データベースの一例を示す図。
【図５】付与された修正時刻を示す図。
【符号の説明】
１１原言語データベース
２１目的言語データベース
３１発話開始時刻付与部
３２発話終了時刻付与部
４１修正発話時刻付与部
５１音声合成部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesizer, a speech synthesis method, and a storage medium storing a speech synthesis program, and in particular, sets a silence section caused by a difference in word order between languages in simultaneous interpretation to an appropriate pause time. Synthesizing apparatus and method for synthesizing speech by setting an appropriate pause time between silent sentences between sentences of the target language to be speech-synthesized so as to improve the naturalness of synthesized speech of the target language translated from And a storage medium storing a speech synthesis program.
[0002]
[Prior art]
A method for verifying whether or not the simultaneous interpreted voice is natural for the listener without discomfort and easy to understand the content, and a device for performing the same have not been developed so far.
At the time of simultaneous interpretation, if the translated speech obtained by translating the utterance sentence of the original language having a coherent meaning is separated by a long silent section, the listener may feel uncomfortable. This phenomenon can occur when one sentence of the source language is long, in the case of sequential translation that starts translating after the source language is uttered. Also, in the case of a gradual translation, in which the source language starts to be translated each time a phrase is uttered, for example, in the case of a Japanese-English interpreter, Japanese has a different word order from English, and the object precedes the verb. Until the last verb is uttered in Japanese, it cannot be translated into English, and the utterance timing eventually becomes equivalent to the sequential translation. Also in this gradual translation, when the Japanese object is long, the silent section becomes long, and the user may feel uncomfortable. In such a case, if the interpreter is a human, the verb may be predicted and translated from the context of the subject, the first phrase of the object, or the verb.
[0003]
[Problems to be solved by the invention]
According to the present invention, speech synthesis is performed such that a silence section caused by a difference in word order existing between languages in simultaneous interpretation is set to an appropriate pause time to improve the naturalness of synthesized speech of a target language translated from a source language. An object of the present invention is to provide a speech synthesizer, a speech synthesis method, and a storage medium storing a speech synthesis program which are used when a silent section between sentences of a target language to be performed is set to an appropriate pause time to perform speech synthesis. It is assumed that.
[0004]
[Means for Solving the Problems]
Claim 1: A source language database 11 storing information on the utterance start time and utterance end time of each utterance sentence in the source language, and a translation of each utterance in the source language, an utterance start time and an utterance end of the translated sentence A target language database 21 storing time information is acquired. The utterance end time of the source language utterance corresponding to each translated sentence in the target language database 21 is acquired from the source language database 11, and the utterance end time is obtained. An utterance start time assigning unit 31 that assigns a time obtained by adding a certain time to the translated sentence as an utterance start time of the translated sentence, and converts each translated sentence in the target language database 21 into an utterance start time assigning unit A speech synthesizing apparatus including a speech synthesizing unit 51 which is operated in response to the utterance start time obtained in 31 is constructed.
[0005]
Claim 2: a source language database 11 storing information on the utterance start time and utterance end time of each utterance sentence in the source language, and a translation of each utterance in the source language and an utterance start time of the translated sentence; A target language database 21 storing information on the utterance end time is provided. The utterance end time of the source language utterance corresponding to each translation in the target language database 21 is acquired from the source language database 11, and the utterance end is obtained. An utterance start time giving unit 31 is provided for giving a time obtained by adding a certain time to the time as the utterance start time of the translated sentence to the translated sentence, and the utterance time of each translated sentence in the target language database 21 is The utterance end time giving unit 3 calculates the utterance end time of the translated sentence from the utterance time and the utterance start time obtained by the utterance start time giving unit 31, and gives it to the translated sentence. And a speech synthesizing unit 51 which operates each translated sentence in the target language database 21 in response to the utterance start time obtained by the utterance start time giving unit 31 or the utterance end time giving unit 32. A speech synthesizer was constructed.
[0006]
Claim 3: In the voice synthesizing device according to claim 2, the speech ending time assigning unit 32 corresponds to a reversal between the speech ending time of the immediately preceding translation and the speech starting time of the subsequent translation. Thus, a speech synthesizer having a delay processing unit for delaying the utterance start time and the utterance end time of the later translated sentence is configured.
The source language database 11 stores information on the utterance start time and the utterance end time of each utterance sentence in the source language. A target language database 21 storing information on the utterance end time is provided. The utterance end time of the source language utterance corresponding to each translation in the target language database 21 is acquired from the source language database 11, and the utterance end is obtained. An utterance start time giving unit 31 is provided for giving a time obtained by adding a certain time to the time as the utterance start time of the translated sentence to the translated sentence, and the utterance time of each translated sentence in the target language database 21 is The utterance end time giving unit 32 calculates the utterance end time of the translated sentence from the utterance time and the utterance start time obtained by the utterance start time giving unit 31, and gives the utterance end time to the translated sentence. The length of a silent section between each sentence in the target language database 21 is calculated based on the time information obtained by the utterance start time giving unit 31 and the utterance end time giving unit 32, and the length is A modified utterance time providing unit 41 for providing a value obtained by correcting the utterance start time and the utterance end time earlier to a later translated sentence, A speech synthesizing apparatus comprising a speech synthesizing unit 51 that operates a translated sentence in response to the utterance start time obtained by the utterance start time giving unit 31, the utterance end time giving unit 32, or the corrected utterance time giving unit 41 did.
[0007]
Claim 5: In the voice synthesizing apparatus according to claim 4, the utterance end time giving unit 32 responds to the reversal of the utterance between the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation. Thus, a speech synthesizer having a delay processing unit for delaying the utterance start time and the utterance end time of the later translated sentence is configured.
Here, Claim 6: The source language database 11 storing information on the utterance start time and utterance end time of each utterance sentence in the source language, the translated sentence of each uttered sentence in the source language, the utterance start time of the translated sentence, and the utterance end Using the target language database 21 storing the time information, the utterance end time of the source language utterance corresponding to each translated sentence in the target language database 21 is acquired from the source language database 11 and the utterance end time is set to The time obtained by adding the time to the translation is added to the translation as the utterance start time of the translation, and the speech synthesis method is performed to synthesize the speech in the target language database 21 from the obtained utterance start time. did.
[0008]
Claim 7: The source language database 11 storing information on the utterance start time and utterance end time of each utterance sentence in the source language, the translation of each utterance in the source language, the utterance start time and the utterance end time of the translated sentence Of the utterance of the source language corresponding to each translated sentence in the target language database 21 from the source language database 11 using the target language database 21 storing the information of the target language. Is added to the translated sentence as the utterance start time of the translated sentence, and the utterance time of each translated sentence in the target language database 21 is calculated. The utterance end time of the translation is calculated from the utterance start time, and the utterance start time of the subsequent translation corresponding to the reversal of the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation When Allowed delay talk end time, each translation in the target language database 21, to constitute a speech synthesis method of speech synthesis from speech start time previously obtained or delay caused to be utterance start time.
[0009]
Claim 8: The source language database 11 storing information of the utterance start time and utterance end time of each utterance sentence of the source language, the translation of each utterance sentence of the source language, the utterance start time and the utterance end time of the translated sentence Of the source language corresponding to each translation in the target language database 21 is acquired from the source language database 11, and the target language database 21 is stored in the target language database 21. Is added to the translated sentence as the utterance start time of the translated sentence, and the utterance time of each translated sentence in the target language database 21 is calculated. The utterance end time of the translation is calculated from the utterance start time, and the utterance start time of the subsequent translation corresponding to the reversal of the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation When The speech end time is delayed, and the length of a silent section between each utterance sentence in the target language database 21 is calculated. If the length exceeds a certain value, the utterance start for a later translated sentence is started. The time and the end time of the utterance are corrected earlier, and each translated sentence in the target language database 21 is adjusted to the utterance start time obtained earlier, the utterance start time delayed, or the utterance start corrected earlier. A speech synthesis method for synthesizing speech from time was constructed.
[0010]
Here, claim 9: a source language database 11 storing information on the utterance start time and utterance end time of each utterance sentence in the source language, a translation of each utterance in the source language, and an utterance start time and utterance of the translated sentence A certain time is added to the utterance end time of the source language utterance corresponding to each translation in the target language database 21 storing the information of the end time and the utterance start time of the translation. A process of calculating the utterance time of each translation in the target language database 21, a process of calculating the utterance end time of the translation based on the utterance time and the utterance start time assigned to the translation, A process of calculating the length of a silent section between each utterance in the target language database 21 based on the obtained utterance start time and utterance end time information of each translation; If the length of the silence section exceeds a certain value, the process of correcting the utterance start time and the utterance end time of the translated sentence earlier and correcting each translated sentence in the target language database 21. And a storage medium for storing a speech synthesis program comprising a speech synthesis process from the utterance start time.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
An embodiment of the present invention will be described first with reference to FIGS.
In the embodiment of FIG. 1, 11 is a source language database, 21 is a target language database, 31 is an utterance start time giving unit, 32 is an utterance end time giving unit, 41 is a corrected utterance time giving unit, and 51 is a speech synthesis unit. . FIG. 2 is a diagram showing a relationship between utterance times between the source language and the target language. Hereinafter, a specific description will be given also with reference to FIGS.
[0012]
The source language database 11 stores information on each utterance sentence in the source language and its utterance start time and utterance end time. FIG. 3 is a diagram showing an example of the source language database 11, in which an utterance sentence in the source language and information on the utterance start time and utterance end time are stored. The utterance sentence in the source language may be text data or voice data.
The target language database 21 stores a translated sentence corresponding to each uttered sentence in the source language and information on the utterance start time and the utterance end time. FIG. 4 is a diagram showing an example of the target language database, in which a translated sentence and information on its utterance start time and utterance end time are stored. The translation may be text data or audio data. The target language database 21 stores (a) a translated sentence that has been subjected to correction processing such as a translation, in addition to (a) a translated sentence of a direct translation expression, and (c) a normal tone. The translated voice data, and (d) voice data translated in a tone reflecting the utterance content of the source language including emotions such as joy, sadness, and anger can also be stored.
[0013]
The utterance start time giving unit 31 acquires the utterance end time of the utterance sentence in the source language corresponding to each translation in the target language database 21 from the source language database 11, and adds a certain time to the utterance end time. Is given as the utterance start time of the translated sentence. In FIG. 4, a value obtained by adding a certain time, for example, 1 msec to the utterance end time of the utterance sentence in the source language corresponding to the utterance start time of each translated sentence is displayed.
[0014]
The utterance end time assigning unit 32 calculates the utterance time of each translated sentence in the target language database 21 and, based on the uttered time and the utterance start time obtained by the utterance start time assigning unit 31, ends the utterance of the translated sentence. The time is calculated and given. By the way, when the translated sentence is text data, the speech time of the translated sentence can be obtained by synthesizing the translated sentence by the speech synthesizer 51 and measuring the reproduction time of the speech. In the case of text data, the speech synthesizer 51 decomposes the text into phonemes, and estimates the utterance time without reproducing the speech using the phoneme length information stored in the speech synthesizer 51 in advance. It can also be obtained by calculating a value. When the translated sentence is voice data, information on the utterance time may be incorporated in advance. The utterance time can also be obtained by reproducing the voice data by the voice synthesis unit 51 and measuring the reproduction time. In FIG. 4, a value obtained by adding the utterance time calculated as described above to the utterance start time is shown as the utterance end time.
[0015]
Here, when giving the utterance end time by the utterance end time giving unit 32, the utterance end time of a certain translated sentence may be later than the utterance start time of the next translated sentence. In this case, there are a method in which the utterances are overlapped and the time is not specifically corrected, and a method in which the utterances are not overlapped and the time is corrected. In the following method, the start of the utterance of the later translated sentence is immediately after the end of the utterance of the preceding translated sentence. That is, the utterance end time giving unit 32 delays the utterance start time and the utterance end time of the next translated sentence in response to the reversal between the utterance end time of the immediately preceding translated sentence and the utterance start time of the subsequent translated sentence. And a delay processing unit.
[0016]
When N is the total number of sentences, the utterance time is corrected by the following routine.
Routine contents (1) i = 1.
(2) If i <N, proceed to (3). If i = N, the process ends.
(3) If the utterance start time of the (i + 1) th translated sentence is f and the utterance end time of the ith translated sentence is (f + e) (e> 0), the utterance start time of the (i + 1) th translated sentence And the utterance end time is delayed by e.
(4) Set i = i + 1. Proceed to (2).
[0017]
Referring to FIG. 2, based on the time information obtained by the utterance start time provision unit 31 and the utterance end time provision unit 32, the modified utterance time provision unit 41 performs silence between the translations in the target language database 21. The length of the section is calculated, and if this length exceeds a certain value, a value obtained by correcting the utterance start time and the utterance end time earlier is added to a later sentence. The solid line indicates the utterance section of each sentence, and the numbers i and (i + 1) attached to the solid lines indicate the corresponding source language sentence and target language sentence. m represents the time to be added to the utterance end time of the utterance sentence in the source language in the utterance start time providing unit 31. However, when the utterance start and end times of the target language sentence are corrected by the corrected utterance time providing unit 41, the value of m is not always constant.
[0018]
When N is the total number of sentences, the time correction in the corrected utterance time giving unit 41 is performed by the following routine.
Routine contents (1) i = 1.
(2) If i <N, proceed to (3). If i = N, the process ends.
(3) If the length of y obtained by subtracting the silence time x between the source language sentence i and (i + 1) from the silence time between the target language sentences i and (i + 1) is equal to or greater than a certain threshold k, the target language sentence (i + 1) The utterance start and end times are shifted forward by the same time so that the value of y becomes k. However, as a result, if the utterance start time of the target language sentence (i + 1) is earlier than the utterance start time of the source language sentence (i + 1), the utterance start time of the target language sentence (i + 1) becomes the source language sentence (i + 1). The utterance start and end times of the target language sentence (i + 1) are shifted forward by the same time so that the utterance start times are equal to the utterance start times.
(4) Set i = i + 1. Proceed to (2).
[0019]
As described above, based on the time information obtained by the utterance start time provision unit 31 and the utterance end time provision unit 32, the length of the silent section between each utterance sentence in the target language database 21 is calculated. Is greater than a certain value, the utterance start time and the utterance end time of the later sentence are provided with values that are corrected earlier, so that the sentence of the target language to be speech-synthesized is provided. It is possible to perform voice synthesis by setting a silence section between pauses to an appropriate pause time.
[0020]
FIG. 5 shows, in the target language database 21, the correction times given by the modified utterance time giving unit 41, separately from the utterance start and end times given by the utterance start time giving unit 31 and the utterance end time giving unit 32. ing. The fourth sentence, the Japanese sentence, takes a long time from the long object to the last verb being uttered. Accordingly, since the silent section between the third and fourth sentence of the English sentence as the target language becomes longer, the utterance start and end times of the fourth sentence are corrected forward. In this case, x = 5 seconds, y = 40 seconds, and k = 10 seconds.
[0021]
Here, the speech synthesis unit 51 will be described. The speech synthesis unit 51 synthesizes each translated sentence in the target language database 21 from the speech start time obtained by the speech start time giving unit 31 or the speech end time giving unit 32. Alternatively, a speech synthesis is performed from the utterance start time obtained by the corrected utterance time giving unit 41. The speech synthesis unit 51 may also have a configuration in which each speech sentence in the source language database 11 is speech-synthesized from its utterance start time, and the source language sentence and the target language sentence are simultaneously speech-synthesized.
[0022]
The above-described speech synthesis unit 51 synthesizes each translated sentence in the target language database 21 from the utterance start time obtained by the utterance start time giving unit 31 or the utterance end time giving unit 32, so that a sequentially translated speech is obtained. Can be If one sentence in the source language is long, the silence interval between the translated sentence and the immediately preceding translated sentence becomes longer depending on the sequential translation. It is considered a gradual translation that incorporates Therefore, by making each sentence in the target language database 21 speech-synthesized from the utterance start time obtained by the corrected utterance time giving unit 41, a progressively translated speech incorporating the predicted translation is obtained.
[0023]
By the way, in the operation of this embodiment, a storage medium storing a speech synthesis program is prepared, and although not shown, the speech synthesis program is installed from this storage medium by the CPU, and the source language database 11 and the target language are stored. This is performed with reference to the database 21. This speech synthesis program includes a source language database storing information on the utterance start time and utterance end time of each utterance sentence in the source language, a translation of each utterance sentence in the source language, and a translation A target language database storing information of the utterance start time and the utterance end time, and adding a certain time to the utterance end time of the utterance sentence of the source language corresponding to each translated sentence in the target language database, and uttering the translated sentence A process for obtaining a start time, a process for calculating the utterance time of each translation in the target language database, a process for obtaining the utterance end time of the translation based on the utterance time and the utterance start time assigned to the translation, A process of calculating the length of a silent section between each utterance in the target language database based on the utterance start time and utterance end time information of each translated sentence obtained If the length of the silence interval exceeds a certain value, the process of correcting the utterance start time and the utterance end time earlier for the later translated sentence, and correcting each translated sentence in the target language database. It is composed of a process of synthesizing voice from the utterance start time.
[0024]
Here, it is assumed that a certain utterance sentence in the source language sentence is "Why do you make such a gap?" As the corresponding translation, a translation of the following direct translation expression and a translation that has been subjected to a meaning translation and other correction processing are stored in the target language database 21.
Source language utterances: How did such a gap occur?
Translation of a literal translation: What is the reason of this gap?
Translated translations: How can we have this large gap between the foreigners and Japanese understanding?
The translated sentence subjected to the above-mentioned translation and other processing is translated in view of the context of the utterance contents up to that time. The speech synthesis unit 51 performs speech synthesis on the translated sentence of the direct translation expression and the translated sentence subjected to the meaning translation and other processes.
[0025]
【The invention's effect】
As described above, according to the present invention, by adopting the configuration for reproducing the sequential translation speech or the progressive translation speech, a more natural speech synthesis of the target language can be performed. Then, it is possible to verify a difference in discomfort between the sequentially translated speech and the progressively translated speech incorporating the predicted translation. In addition, it is possible to verify a difference in ease of content understanding between a translation speech of a direct translation expression and a translation speech of a linguistic expression. Further, the present invention can be used by a person who intends to be an interpreter as a learning material for simultaneous interpretation. It can also be used as a research tool by researchers in simultaneous translation and machine translation.
[Brief description of the drawings]
FIG. 1 illustrates an embodiment.
FIG. 2 is a diagram showing a mutual relationship between utterance times of a source language and a target language.
FIG. 3 is a diagram showing an example of a source language database.
FIG. 4 is a diagram showing an example of a target language database.
FIG. 5 is a diagram showing an assigned correction time.
[Explanation of symbols]
11 Source language database 21 Target language database 31 Utterance start time giving unit 32 Utterance end time giving unit 41 Modified utterance time giving unit 51 Voice synthesis unit

Claims

A source language database storing information on the utterance start time and utterance end time of each utterance sentence in the source language;
A target language database storing translations of each utterance sentence in the source language and information on the utterance start time and utterance end time of the translated sentence;
The utterance end time of the utterance of the source language corresponding to each translation in the target language database is obtained from the source language database, and the time obtained by adding a certain time to the utterance end time is the utterance start time of the translation. An utterance start time giving unit that gives the translation as the time to the translation,
A speech synthesizer comprising: a speech synthesis unit that operates each translated sentence in a target language database in response to an utterance start time obtained by an utterance start time providing unit.

A source language database storing information on the utterance start time and utterance end time of each utterance sentence in the source language;
A target language database storing translations of each utterance sentence in the source language and information on the utterance start time and utterance end time of the translated sentence;
The utterance end time of the utterance of the source language corresponding to each translation in the target language database is obtained from the source language database, and the time obtained by adding a certain time to the utterance end time is the utterance start time of the translation. An utterance start time giving unit that gives the translation as the time to the translation,
The utterance time of each translated sentence in the target language database is calculated, and the utterance end time of the translated sentence is calculated from the uttered time and the utterance start time obtained by the utterance start time assigning unit, and is attached to the translated sentence. An utterance end time providing unit,
A speech synthesizer comprising: a speech synthesis unit that operates each translated sentence in a target language database in response to an utterance start time obtained by an utterance start time giving unit or an utterance end time giving unit.

The speech synthesizer according to claim 2,
The utterance end time assigning unit delays the utterance start time and utterance end time of the subsequent translation in response to the reversal between the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation. A speech synthesizer characterized by having:

A source language database storing information on the utterance start time and utterance end time of each utterance sentence in the source language;
A target language database storing translations of each utterance sentence in the source language and information on the utterance start time and utterance end time of the translated sentence;
The utterance end time of the utterance of the source language corresponding to each translation in the target language database is obtained from the source language database, and the time obtained by adding a certain time to the utterance end time is the utterance start time of the translation. An utterance start time giving unit that gives the translation as the time to the translation,
The utterance time of each translated sentence in the target language database is calculated, and the utterance end time of the translated sentence is calculated from the uttered time and the utterance start time obtained by the utterance start time assigning unit, and is attached to the translated sentence. An utterance end time providing unit,
Based on the time information obtained by the utterance start time provision unit and the utterance end time provision unit, the length of a silent section between each utterance sentence in the target language database is calculated, and the length exceeds a certain value. In the case, there is provided a corrected utterance time providing unit that adds a value obtained by correcting the utterance start time and the utterance end time earlier to a later translated sentence,
A speech synthesis unit is provided which operates each translated sentence in the target language database in response to the utterance start time obtained by the utterance start time giving unit, the utterance end time giving unit or the corrected utterance time giving unit. Speech synthesizer.

The speech synthesizer according to claim 4,
The utterance end time assigning unit delays the utterance start time and utterance end time of the subsequent translation in response to the reversal between the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation. A speech synthesizer characterized by having:

A source language database storing information on the utterance start time and utterance end time of each utterance in the source language, and a translation of each utterance in the source language and information on the utterance start time and utterance end time of the translated sentence are stored. Using the target language database,
The utterance end time of the source language utterance corresponding to each translation in the target language database is obtained from the source language database, and a time obtained by adding a certain time to the utterance end time is set as the start of the utterance of the translation. Time is assigned to the translation as
A speech synthesis method characterized by performing speech synthesis from each utterance start time at which each translated sentence in a target language database is obtained.

A source language database storing information on the utterance start time and utterance end time of each utterance in the source language, and a translation of each utterance in the source language and information on the utterance start time and utterance end time of the translated sentence are stored. Using the target language database,
The utterance end time of the source language utterance corresponding to each translation in the target language database is obtained from the source language database, and a time obtained by adding a certain time to the utterance end time is used as the utterance of the relevant translation. Assigned to the translation as a start time,
Calculate the utterance time of each translation in the target language database, calculate the utterance end time of the translation from the utterance time and the utterance start time obtained earlier,
The utterance start time and the utterance end time of the subsequent translation are delayed corresponding to the reversal of the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation,
A speech synthesis method comprising: synthesizing each translated sentence in a target language database from a speech start time obtained earlier or a speech start time delayed.

A source language database storing information on the utterance start time and utterance end time of each utterance in the source language, and a translation of each utterance in the source language and information on the utterance start time and utterance end time of the translated sentence are stored. Using the target language database,
The utterance end time of the utterance of the source language corresponding to each translation in the target language database is obtained from the source language database, and the time obtained by adding a certain time to the utterance end time is the utterance start time of the translation. Time is assigned to the translation as
Calculate the utterance time of each translation in the target language database, calculate the utterance end time of the translation from the utterance time and the utterance start time obtained earlier,
The utterance start time and the utterance end time of the subsequent translation are delayed corresponding to the reversal of the utterance end time of the immediately preceding translation and the utterance start time of the subsequent translation,
Calculate the length of the silent section between each utterance in the target language database, and if the length exceeds a certain value, set the utterance start time and utterance end Modified to
A speech synthesis method comprising: synthesizing each translated sentence in a target language database from a speech start time obtained earlier, a speech start time delayed, or a speech start time corrected earlier. .

A source language database storing information on the utterance start time and utterance end time of each utterance sentence in the source language;
A target language database storing translations of each utterance in the source language and information on the utterance start time and utterance end time of the translation;
A process of adding a certain time to the utterance end time of the utterance sentence of the source language corresponding to each translation in the target language database to obtain an utterance start time of the translation;
A process of calculating the utterance time of each translated sentence in the target language database;
A process of obtaining the utterance end time of the translation based on the utterance time and the utterance start time given to the translation;
A process of calculating a length of a silent section between each utterance in the target language database based on the obtained utterance start time and utterance end time information of each translation;
A process of correcting the utterance start time and the utterance end time earlier for the later translated sentence if the length of the calculated silent section exceeds a certain value;
A storage medium storing a speech synthesis program, characterized by comprising a process of synthesizing a speech from a corrected utterance start time of each translated sentence in a target language database.