JP4332323B2

JP4332323B2 - Speech synthesis method and apparatus and dictionary generation method and apparatus

Info

Publication number: JP4332323B2
Application number: JP2002164624A
Authority: JP
Inventors: 雅章山田; 俊明深田; 康弘小森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-06-05
Filing date: 2002-06-05
Publication date: 2009-09-16
Anticipated expiration: 2022-06-05
Also published as: JP2004012700A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を合成する音声合成装置および方法に関する。
【０００２】
【従来の技術】
従来より、所望の合成音声を得るための音声合成方法として、あらかじめ収録し蓄えられた音声素片を複数の微細素片に分割し、分割の結果得られた微細素片の再配置を行って所望の合成音声を得る方法がある。これら微細素片の再配置において、微細素片に対して間隔変更・繰り返し・間引き等の処理が行われることにより、所望の時間長・基本周波数を持つ合成音声が得られる。
【０００３】
図１０は、音声波形を微細素片に分割する方法を模式的に示した図である。図１０に示された音声波形は、切り出し窓関数（以下、窓関数）によって微細素片に分割される。このとき、有声音の部分（音声波形の後半部）では原音声のピッチ間隔に同期した窓関数が用いられる。一方、無声音の部分では、適当な間隔の窓関数が用いられる。
【０００４】
そして、図１０に示すようにこれらの微細素片を間引いて用いることにより音声の継続時間長を短縮することができる。一方、これらの微細素片を繰り返して用いれば、音声の継続時間長を伸長することができる。更に、図１０に示すように、有声音の部分では、微細素片の間隔を詰めることにより合成音声の基本周波数を上げることが可能となる。一方、微細素片の間隔を広げることにより合成音声の基本周波数を下げることが可能である。
【０００５】
以上のような繰り返し・間引き・間隔変更を行なって再配置された微細素片を再び重畳することにより所望の合成音声が得られる。なお、音声素片を収録・蓄積する単位としては、音素やＣＶ・ＶＣあるいはＶＣＶといった単位が用いられる。ＣＶ・ＶＣは音素内に素片境界を置いた単位、ＶＣＶは母音内に素片境界を置いた単位である。
【０００６】
【発明が解決しようとする課題】
しかしながら、上記従来法においては、音声波形から微細素片を得るために窓関数が適用されることにより、音声のスペクトルに所謂「ぼやけ」が生じてしまう。すなわち、音声のホルマントが広がったりスペクトル包絡の山谷が曖昧になる等の現象が起こり、合成音声の音質が低下することになる。
【０００７】
本発明は上記の課題に鑑みてなされたものであり、微細素片を得るために適用した窓関数による音声のスペクトルの「ぼやけ」を軽減し、高音質な音声合成を実現することを目的とする。
【０００８】
【課題を解決するための手段】
上記の目的を達成するための本発明による音声合成方法は、
音声波形データと窓関数とから微細素片を取得する取得工程と、
前記取得工程で取得された微細素片を、合成時の韻律を変更するべく再配置する再配置工程と、
前記再配置工程で再配置された微細素片を重畳して得られる重畳波形データに基づいて合成音声波形データを出力する合成工程と、
前記音声波形データに基づいて構成された、前記微細素片を取得するために適用した前記窓関数による音声のスペクトルのぼやけを軽減するスペクトル補正フィルタを、前記取得工程で取得された微細素片に対して作用させる補正工程とを備える。
【０００９】
また、上記の目的を達成するための本発明による音声合成方法は以下の構成を備える。
すなわち、
音声波形データと窓関数とから微細素片を取得する取得工程と、
前記取得工程で取得された微細素片を、合成時の韻律を変更するべく再配置する再配置工程と、
前記再配置工程で再配置された微細素片を重畳して得られる重畳波形データに基づいて合成音声波形データを出力する合成工程とを備え、
前記取得工程で処理される音声波形データに基づいて構成された、前記微細素片を取得するために適用した前記窓関数による音声のスペクトルのぼやけを軽減するスペクトル補正フィルタを分解して得られる複数の要素フィルタのそれぞれを、前記取得工程、再配置工程、合成工程を含む処理過程中の複数個所において作用させる。
【００１０】
また、本発明によれば、上記音声合成方法或は音声合成装置に好適な音声合成用の辞書生成方法が提供される。
【００１１】
【発明の実施の形態】
以下、添付の図面を参照して本発明の好適な実施形態のいくつかについて詳細に説明する。
【００１２】
〈第１実施形態〉
図１は第１実施形態におけるハードウェア構成を示すブロック図である。
【００１３】
図１において、１１は中央処理装置であり、数値演算・制御等の処理を行なう。特に、中央処理装置１１は、以下に説明する手順に従った音声合成処理を実行する。１２は出力装置であり、中央処理装置１１の制御下でユーザに対して各種の情報を提示する。１３はタッチパネル或はキーボード等を備えた入力装置であり、ユーザが本装置に対して動作の指示を与えたり、各種の情報を入力するのに用いられる。１４は音声を出力する音声出力装置であり音声合成された内容を出力する。
【００１４】
１５はディスク装置や不揮発メモリ等の記憶装置であり、音声合成用辞書５０１等が保持される。１６は読み取り専用の記憶装置であり、本実施形態の音声合成処理の手順や、必要な固定的データが格納される。１７はＲＡＭ等の一時情報を保持する記憶装置であり、一時的なデータや各種フラグ等が保持される。以上の各構成（１１〜１７）は、バス１８によって接続されている。なお、本実施形態ではＲＯＭ１６に音声合成処理のための制御プログラムが格納され、中央処理装置１１がこれを実行する形態とするが、そのような制御プログラムを外部記憶装置１５に格納しておき、実行に際してＲＡＭ１７にロードするような形態としてもよい。
【００１５】
以上のような構成を備えた本実施形態の音声出力装置の動作について、図２及び図３を参照して以下に説明する。図２は第１実施形態による音声出力処理を説明するフローチャートである。また、図３は第１実施形態の音声合成処理の様子を表す図である。
【００１６】
まず、韻律目標値取得ステップＳ１において、合成音声の目標韻律値を取得する。合成音声の目標韻律値は、歌声合成の様に直接上位モジュールから与えられる場合もあれば、何らかの手段を用いて推定される場合もある。例えば、テキストからの音声合成であるならばテキストの言語解析結果より推定される。
【００１７】
次に、波形データ取得ステップＳ２において、合成音声の元となる波形データ（図３の音声波形３０１）を取得する。そして、音響分析ステップＳ３において、線形予測（ＬＰＣ）分析・ケプストラム分析・一般化ケプストラム分析等の音響分析を取得した波形データについて行い、スペクトル補正フィルタ３０４を構成するのに必要なパラメータを計算する。なお波形データの分析は、ある定められた時間間隔で行なっても良いし、ピッチ同期分析を行なっても良い。
【００１８】
次に、スペクトル補正フィルタ構成ステップＳ４において、前記音響分析ステップＳ３で計算されたパラメータを用いてスペクトル補正フィルタを構成する。例えば、前記音響分析にｐ次の線形予測分析を用いた場合には、以下の［数１］式で表される特性を持ったフィルタをスペクトル補正フィルタ３０４として用いる。なお、［数１］式を用いる場合、上記パラメータ計算においては線形予測係数α_jが算出されることになる。
【００１９】
【数１】

【００２０】
また、ｐ次のケプストラム分析を用いた場合には、以下の［数２］式で表される特性を持ったフィルタをスペクトル補正フィルタとして用いる。なお、［数２］式を用いる場合、上記パラメータ計算においてはケプストラム係数ｃ_jが算出されることになる。
【００２１】
【数２】

【００２２】
上記各式において、μ、γは適当な係数、αは線形予測係数、ｃはケプストラム係数である。あるいは、上記フィルタのインパルス応答を適当な次数で打ち切って構成した、以下の［数３］式で表されるＦＩＲフィルタが用いられる場合もある。なお、［数３］式を用いる場合、上記パラメータ計算においては係数β_jが計算されることになる。
【００２３】
【数３】

【００２４】
なお、実際には、上記の各式において、システムのゲインを考慮する必要がある。以上のようにして構成されたスペクトル補正フィルタは音声合成用辞書５０１に格納される（実際にはフィルタの係数を格納することになる）。
【００２５】
次に、微細素片切り出しステップＳ５において、前記波形データ取得ステップＳ２で取得した波形に窓関数３０２を適用し、微細素片３０３を切り出す。窓関数としてはハニング窓等が用いられる。
【００２６】
次に、微細素片スペクトル補正ステップＳ６において、微細素片切り出しステップＳ５で切り出した微細素片３０３に対して、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタ３０４を適用し、微細素片切り出しステップＳ５で切り出した微細素片のスペクトルを補正する。こうして、スペクトル補正された微細素片３０５が取得される。
【００２７】
次に、韻律変更ステップＳ７において、微細素片スペクトル補正ステップＳ６でスペクトル補正された微細素片３０５を、韻律目標値取得ステップＳ１で取得した韻律目標値に合致するように、間引き・繰り返し・間隔変更して再配置（３０６）する。そして波形重畳ステップＳ８において、韻律変更ステップＳ７で再配置した微細素片を重畳し、合成音声３０７を得る。なお、ステップＳ８で得られるのは音声素片であるので、実際の合成音声は波形重畳ステップＳ８で得られた複数の音声素片を接続して得られる。すなわち、音声出力ステップＳ９において、波形重畳ステップＳ８で得られた音声素片を接続して合成音声を出力する。
【００２８】
なお、微細素片の再配置処理に関して、「間引き」については、図３に示すようにスペクトル補正フィルタを作用させる前に実行するようにしてもよい。このようにすれば、不要な微細素片についてフィルタ処理を施すという無駄な処理を省くことができるからである。
【００２９】
〈第２実施形態〉
上記第１実施形態においてはスペクトル補正フィルタを音声合成時に構成しているが、スペクトル補正フィルタの構成を音声合成に先立って行い、フィルタを構成するための構成情報（フィルタ係数）を所定の記憶領域に保持しておくようにしてもよい。すなわち、第１実施形態のプロセスをデータ作成（図４）と音声合成（図５）の２つのプロセスに分離することが可能である。第２実施形態ではこの場合の処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。また、本実施形態では、構成情報を音声合成用辞書５０１に格納することとする。
【００３０】
図４のフローチャートにおいて、ステップＳ２、Ｓ３、Ｓ４は第１実施形態（図２）と同様である。そして、スペクトル補正フィルタ記録ステップＳ１０１では、スペクトル補正フィルタ構成ステップＳ４で構成されたスペクトル補正フィルタのフィルタ係数を外部記憶装置１５に記録する。本実施形態では、音声合成用辞書５０１に登録された各波形データについてスペクトル補正フィルタを構成し、各波形データに対応するフィルタの係数をスペクトル補正フィルタとして音声合成用辞書５０１内に保持する。すなわち、第２実施形態の音声合成用辞書５０１には、各音声波形の波形データとスペクトル補正フィルタが登録されていることになる。
【００３１】
一方、音声合成時においては、図５のフローチャートに示されるように、第１実施形態の処理における音響分析ステップＳ３およびスペクトル補正フィルタ構成ステップＳ４が不要となり、代りにスペクトル補正フィルタ読込みステップＳ１０２が追加される。スペクトル補正フィルタ読込みステップＳ１０２では、スペクトル補正フィルタ記録ステップＳ１０１で記録したスペクトル補正フィルタ係数を読み込む。すなわち、波形データ取得ステップＳ２で取得された波形データに対応するスペクトル補正フィルタの係数を音声合成用辞書５０１から読み込んでスペクトル補正フィルタを構成する。そして、微細素片スペクトル補正ステップＳ６では、スペクトル補正フィルタ読込みステップＳ１０２で読込まれたスペクトル補正フィルタを用いて微細素片の処理が行われる。
【００３２】
以上のように、予め全ての波形データについてスペクトル補正フィルタを記録しておくことにより、音声合成時にスペクトル補正フィルタを構成する必要がなくなる。このため、第１実施形態に比べて音声合成時の処理量を軽減することが可能となる。
【００３３】
〈第３実施形態〉
上記第１及び第２実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを微細素片切り出しステップＳ５で切り出された微細素片に適用していた。しかし、スペクトル補正フィルタを前記波形データ取得ステップＳ２で取得した波形データ（音声波形３０１）に対して適用しても良い。第３実施形態ではこのようは音声合成処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００３４】
図６は第３実施形態による音声合成処理を説明するフローチャートである。図６において、波形データ取得ステップＳ２〜スペクトル補正フィルタ構成ステップＳ４の各ステップは上記第２実施形態と同様である。第３実施形態では、スペクトル補正フィルタ構成ステップＳ４によってスペクトル補正フィルタを構成した後、波形データスペクトル補正ステップＳ２０１において、波形データ取得ステップＳ２で取得した波形データに対してスペクトル補正フィルタ構成ステップＳ４で構成したスペクトル補正フィルタを適用し、波形データのスペクトルを補正する。
【００３５】
次に、スペクトル補正波形データ記録ステップＳ２０２において、波形データスペクトル補正ステップＳ２０１でスペクトル補正された波形データを記録する。すなわち、第２実施形態では、図１の音声合成用辞書５０１において、「スペクトル補正フィルタ」の代わりに「スペクトル補正された波形データ」が記憶されることになる。
【００３６】
一方、音声合成処理においては、図７のフローチャートに示される処理が実行される。第３実施形態では、上述の各実施形態における波形データ取得ステップＳ２の代りにスペクトル補正波形データ取得ステップＳ２０３が設けられる。これにより、スペクトル補正波形データ記録ステップＳ２０２で記録されたスペクトル補正後の波形データを、ステップＳ５における微細素片の切り出しの対象として取得させる。そして、この取得された波形データについて微細素片の切り出し、再配置が行なわれることで、スペクトル補正が施された合成音声を得ることになる。なお、スペクトル補正された波形データを用いるので、微細素片に対するスペクトル補正処理（第１、第２実施形態のステップＳ６）は不要となっている。
【００３７】
第３実施形態のように、微細素片ではなく波形データに対してスペクトル補正フィルタを適用した場合、微細素片切り出しステップＳ５にて用いられる窓関数の影響を完全に排除することは出来ない。すなわち、上記第１及び第２実施形態と比べて音質は若干劣ってしまう。しかし、スペクトル補正フィルタによるフィルタリングまでを音声合成に先立って行なうことが出来るため、音声合成時（図７）の処理量は第１、第２実施形態に比べて大幅に削減されるという特長がある。
【００３８】
尚、第３実施形態では、第２実施形態のように、データ作成と音声合成の２つのプロセスに分けた構成を説明したが、第１実施形態のように合成処理を実行する毎にフィルタリングを行なうように構成することもできる。この場合、図２のフローチャートにおいて、ステップＳ４とステップＳ５の間で合成処理対象の波形データにスペクトル補正フィルタを作用させることになる。また、ステップＳ６は不要となる。
【００３９】
〈第４実施形態〉
第１、第２実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを微細素片切り出しステップＳ５で切り出された微細素片に適用した。また、第３実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを、微細素片に切り出される前の波形データに適用した。これらに対して、スペクトル補正フィルタを波形重畳ステップＳ８で合成した合成音声の波形データに対して適用することもできる。第４実施形態ではこの場合の処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００４０】
図８は第４実施形態による音声合成処理を説明するフローチャートである。第１実施形態の処理（図２）と同様の処理には同一の参照番号が付されている。第４実施形態では、図８に示されるように、波形重畳ステップＳ８の後に合成音声スペクトル補正ステップＳ３０１を設け、微細素片スペクトル補正ステップＳ６を廃する。合成音声スペクトル補正ステップＳ３０１では、スペクトル補正フィルタ構成ステップＳ４において構成されたフィルタを、波形重畳ステップＳ８で得られた合成音声の波形データに適用し、スペクトル補正を行なう。
【００４１】
以上の第４実施形態によれば、韻律変更ステップＳ７の結果、同一微細素片の繰り返し回数が少ない場合等においては、第１実施形態に比べて処理量が少なくなる。
【００４２】
また、本実施形態においても、スペクトル補正フィルタをあらかじめ構成しておくことが可能な点は、第１及び第２実施形態との関係と同様である。即ち、予めフィルタ係数を音声合成用辞書５０１に格納しておき、音声合成時にはこれを読出してスペクトル補正用フィルタを構成し、ステップＳ８で波形重畳された波形データに作用させる。
【００４３】
〈第５実施形態〉
スペクトル補正フィルタとして、複数の部分フィルタの合成フィルタとして表現できる場合には、上記第１〜第４実施形態のように１ステップでスペクトル補正を行なうのではなく、スペクトル補正を複数のステップに分散させることが可能となる。スペクトル補正の分散により、上記各実施形態と比べて、音質と処理量のバランスを柔軟に調節することが可能となる。第５実施形態では、このようにスペクトル補正フィルタを分散させて音声合成処理する場合について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００４４】
図９は第５実施形態による音声合成処理を説明するフローチャートである。図９に示されるように、まず、韻律目標値取得ステップＳ１〜スペクトル補正フィルタ構成ステップＳ４の処理を行なう。これらの処理は、上記第１〜第４実施形態におけるステップＳ１〜Ｓ４の処理と同様である。
【００４５】
次に、スペクトル補正フィルタ分解ステップＳ４０１で、スペクトル補正フィルタ構成ステップＳ４で構成されたスペクトル補正フィルタを２乃至３個の部分フィルタ（要素フィルタ）に分解する。例えば、前記音響分析にｐ次の線形予測分析を用いた場合のスペクトル補正フィルタＦ１(z)は、分母多項式と分子多項式の積として、以下の［数４］式のように表現される。
【００４６】
【数４】

【００４７】
あるいは、以下の式のように分子・分母多項式を１次または２次の実係数多項式の積に因数分解することも可能である（以下の［数５］式は、ｐが偶数の場合を示したものである）。同様に、スペクトル補正フィルタにＦＩＲフィルタを使用した場合も、１次または２次の実係数多項式の積に因数分解することができる。すなわち、［数３］式を因数分解して、［数６］式のように表される。
【００４８】
【数５】

【数６】

【００４９】
また、ｐ次のケプストラム分析を用いた場合には、フィルタ特性は指数で表現されるため、［数７］式のようにケプストラム係数をグループ分けするだけで良い。
【００５０】
【数７】

【００５１】
次に、スペクトル補正フィルタ部分適用（１）ステップＳ４０２において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、波形データ取得ステップＳ２で取得した波形データをフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第１のフィルタ要素を用いて、微細素片切り出し前の波形データに対してスペクトル補正処理を施す。
【００５２】
次に、微細素片切り出しステップＳ５において、スペクトル補正フィルタ部分適用（１）ステップＳ４０２の結果として得られた波形データに対して窓関数を適用し、微細素片を切り出す。そして、スペクトル補正フィルタ部分適用（２）ステップＳ４０３において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、微細素片切り出しステップＳ５で切り出された微細素片をフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第２のフィルタ要素を用いて、切り出された各微細素片に対してスペクトル補正処理を施す。
【００５３】
次に、第１及び第２実施形態と同様に韻律変更ステップＳ７と波形重畳ステップＳ８を行なう。そして、スペクトル補正フィルタ部分適用（３）ステップＳ４０４において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、波形重畳ステップＳ８の結果得られた合成音声をフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第３のフィルタ要素を用いて、得られた合成音声の波形データに対してスペクトル補正処理を施す。
【００５４】
そして、音声出力ステップＳ９において、スペクトル補正フィルタ部分適用（３）ステップＳ４０４の結果得られた合成音声を出力する。
以上の構成において、例えば、［数５］式の分解を行った場合は、Ｆ_1,1(ｚ)をステップＳ４０２で、Ｆ_1,2(ｚ)をステップＳ４０３で、Ｆ_1,3(ｚ)をステップＳ４０４で用いるというようなことが可能である。
【００５５】
尚、［数４］式の様に、２要素の積に分割した場合にはステップＳ４０２，Ｓ４０３，Ｓ４０４のいずれかではフィルタリングを行わないことになる。すなわち、スペクトル補正フィルタ分解ステップＳ４０１においてスペクトル補正フィルタを２つに分解した場合（この例では、分母多項式と分子多項式の２つに分割している）には、スペクトル補正フィルタ部分適用（１）ステップＳ４０２、スペクトル補正フィルタ部分適用（２）ステップＳ４０３、スペクトル補正フィルタ部分適用（３）ステップＳ４０４のうちのいずれかは省略される。
【００５６】
また、第５実施形態においても、スペクトル補正フィルタや各要素フィルタをあらかじめ構成して音声合成用辞書５０１の一部として登録しておくようにしてもよいことは、第１及び第２実施形態の関連と同様、明らかである。
以上のように、第５の実施形態によれば、どの多項式（フィルタ）をどのステップ（Ｓ４０２，Ｓ４０３，Ｓ４０４）に割り当てるかという任意性があり、その割り当て方によって、音質・処理量の配分が変わってくる。特に、［数５］式や［数７］式、あるいはＦＩＲフィルタを因数分解した［数６］式の場合には、それぞれのステップに因数を何個ずつ割り当てるかまで制御できるので、さらに柔軟性があることになる。
【００５７】
〈その他の実施形態〉
上記各実施形態において、スペクトル補正フィルタ係数を直接記録するのではなく、ベクトル量子化等の手法を使って量子化した後に記録しても良い。これにより、外部記憶装置１５に記録されるデータ量を削減することが可能である。
【００５８】
このとき、音響分析の手法としてＬＰＣ分析や一般化ケプストラム分析を用いている場合には、フィルタ係数を線スペクトル対（ＬＳＰ）に変換した後に量子化を行なうと量子化の効率が良くなる。
【００５９】
また、波形のサンプリング周波数が高い場合には、帯域分割フィルタによって帯域分割を行い、帯域制限された個々の波形に対してスペクトル補正フィルタリングを行なっても良い。帯域分割によってスペクトル補正フィルタの次数が押えられ、計算量を削減する効果がある。メルケプストラムのような周波数軸の伸縮によっても同様の効果がある。
【００６０】
また、前記各実施形態で、スペクトル補正フィルタリングを行なうタイミングには、複数の選択肢があることを示した。どのタイミングでスペクトル補正フィルタリングを行なうか、あるいはスペクトル補正を行なうか行なわないかの選択を、素片毎に行なっても良い。選択のための情報として、音素種別や有声／無声の種別等を利用することができる。
なお、上記各実施形態において、スペクトル補正フィルタの一例としては、ホルマントを強調するホルマント強調フィルタが挙げられる。
【００６１】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。
【００６２】
この場合、記憶媒体から読出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。
【００６３】
プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク，ハードディスク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭなどを用いることができる。
【００６４】
また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６５】
さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６６】
【発明の効果】
以上説明したように、本発明によれば、微細素片を得るために適用した窓関数による、音声のスペクトルの「ぼやけ」を軽減することができ、音質が高い音声合成を実現することができる。
【図面の簡単な説明】
【図１】第１実施形態におけるハードウェア構成を示すブロック図である。
【図２】第１実施形態による音声出力処理を説明するフローチャートである。
【図３】第１実施形態の音声合成処理の様子を表す図である。
【図４】第２実施形態による音声出力処理におけるスペクトル補正フィルタ登録処理を説明するフローチャートである。
【図５】第２実施形態による音声出力処理における音声合成処理を説明するフローチャートである。
【図６】第３実施形態による音声出力処理におけるスペクトル補正フィルタ登録処理を説明するフローチャートである。
【図７】第３実施形態による音声出力処理における音声合成処理を説明するフローチャートである。
【図８】第４実施形態による音声出力処理を説明するフローチャートである。
【図９】第５実施形態による音声出力処理を説明するフローチャートである。
【図１０】音声波形の微細素片への分割、再配置、合成による音声合成方法を模式的に示した図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesis apparatus and method for synthesizing speech.
[0002]
[Prior art]
Conventionally, as a speech synthesis method for obtaining a desired synthesized speech, a speech segment recorded and stored in advance is divided into a plurality of fine segments, and the fine segments obtained as a result of the division are rearranged. There is a method for obtaining a desired synthesized speech. In the rearrangement of these fine segments, synthetic speech having a desired time length and fundamental frequency is obtained by performing processing such as interval change, repetition, and thinning on the fine segments.
[0003]
FIG. 10 is a diagram schematically showing a method of dividing a speech waveform into fine segments. The speech waveform shown in FIG. 10 is divided into fine segments by a cutout window function (hereinafter referred to as a window function). At this time, a window function synchronized with the pitch interval of the original speech is used in the voiced portion (second half of the speech waveform). On the other hand, a window function with an appropriate interval is used in the unvoiced sound part.
[0004]
Then, as shown in FIG. 10, the duration of the voice can be shortened by thinning and using these fine pieces. On the other hand, if these fine segments are used repeatedly, the voice duration can be extended. Furthermore, as shown in FIG. 10, in the voiced sound portion, it is possible to increase the fundamental frequency of the synthesized speech by reducing the interval between the fine segments. On the other hand, it is possible to lower the fundamental frequency of the synthesized speech by increasing the interval between the fine segments.
[0005]
The desired synthesized speech can be obtained by superimposing the re-arranged fine segments again after repeating, thinning, and changing the interval as described above. Note that a unit such as a phoneme, CV / VC, or VCV is used as a unit for recording / accumulating the speech element. CV · VC is a unit in which a segment boundary is placed in a phoneme, and VCV is a unit in which a segment boundary is placed in a vowel.
[0006]
[Problems to be solved by the invention]
However, in the above-described conventional method, so-called “blurring” occurs in the spectrum of the voice by applying a window function to obtain fine segments from the voice waveform. That is, a phenomenon such as the formant of the voice spreading or the peaks and valleys of the spectrum envelope becoming ambiguous occurs, and the sound quality of the synthesized voice is deteriorated.
[0007]
The present invention has been made in view of the above-described problems, and has an object to reduce speech “blurring” due to a window function applied in order to obtain a fine fragment, and to realize high-quality speech synthesis. To do.
[0008]
[Means for Solving the Problems]
To achieve the above object, a speech synthesis method according to the present invention comprises:
An acquisition step of acquiring fine segments from the audio waveform data and the window function;
The rearrangement step of rearranging the fine segments acquired in the acquisition step to change the prosody at the time of synthesis,
A synthesis step of outputting synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the fine segments rearranged in the rearrangement step;
A spectrum correction filter configured to reduce the blurring of the spectrum of the sound due to the window function applied to acquire the fine segment, which is configured based on the speech waveform data, is applied to the fine segment acquired in the acquisition step. And a correction step that acts on the device.
[0009]
In addition, a speech synthesis method according to the present invention for achieving the above object has the following configuration.
That is,
An acquisition step of acquiring fine segments from the audio waveform data and the window function;
The rearrangement step of rearranging the fine segments acquired in the acquisition step to change the prosody at the time of synthesis,
A synthesizing step of outputting a synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the rearranged micro segments in the rearrangement step,
Plurality obtained by decomposing a spectrum correction filter configured to be based on the voice waveform data processed in the acquisition step and reducing the blur of the voice spectrum due to the window function applied to acquire the fine segment Each of the element filters is operated at a plurality of points in the processing process including the acquisition process, the rearrangement process, and the synthesis process.
[0010]
Further, according to the present invention, there is provided a dictionary generation method for speech synthesis suitable for the speech synthesis method or speech synthesis apparatus.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, some preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0012]
<First Embodiment>
FIG. 1 is a block diagram showing a hardware configuration in the first embodiment.
[0013]
In FIG. 1, reference numeral 11 denotes a central processing unit which performs processing such as numerical calculation and control. In particular, the central processing unit 11 performs speech synthesis processing according to the procedure described below. An output device 12 presents various information to the user under the control of the central processing unit 11. An input device 13 includes a touch panel or a keyboard, and is used by a user to give an operation instruction to the device and to input various information. Reference numeral 14 denotes an audio output device that outputs audio, and outputs the synthesized content.
[0014]
Reference numeral 15 denotes a storage device such as a disk device or a nonvolatile memory, which holds a speech synthesis dictionary 501 and the like. Reference numeral 16 denotes a read-only storage device, which stores the speech synthesis processing procedure of the present embodiment and necessary fixed data. Reference numeral 17 denotes a storage device that holds temporary information such as a RAM, which holds temporary data, various flags, and the like. Each of the above components (11 to 17) is connected by a bus 18. In the present embodiment, a control program for speech synthesis processing is stored in the ROM 16, and the central processing unit 11 executes this. However, such a control program is stored in the external storage device 15, It may be configured to be loaded into the RAM 17 upon execution.
[0015]
The operation of the audio output apparatus according to this embodiment having the above-described configuration will be described below with reference to FIGS. FIG. 2 is a flowchart for explaining audio output processing according to the first embodiment. FIG. 3 is a diagram illustrating a state of speech synthesis processing according to the first embodiment.
[0016]
First, in the prosodic target value acquisition step S1, the target prosodic value of the synthesized speech is acquired. The target prosodic value of the synthesized speech may be given directly from the upper module as in the case of singing voice synthesis, or may be estimated using some means. For example, if it is speech synthesis from text, it is estimated from the language analysis result of the text.
[0017]
Next, in waveform data acquisition step S2, waveform data (speech waveform 301 in FIG. 3) that is the basis of the synthesized speech is acquired. In acoustic analysis step S3, acoustic data such as linear prediction (LPC) analysis, cepstrum analysis, and generalized cepstrum analysis is performed on the acquired waveform data, and parameters necessary to configure the spectrum correction filter 304 are calculated. The analysis of the waveform data may be performed at a predetermined time interval, or pitch synchronization analysis may be performed.
[0018]
Next, in the spectrum correction filter configuration step S4, a spectrum correction filter is configured using the parameters calculated in the acoustic analysis step S3. For example, when p-order linear prediction analysis is used for the acoustic analysis, a filter having a characteristic represented by the following [Equation 1] is used as the spectrum correction filter 304. In addition, when the [Expression 1] expression is used, the linear prediction coefficient α _j is calculated in the parameter calculation.
[0019]
[Expression 1]

[0020]
When p-th order cepstrum analysis is used, a filter having the characteristic expressed by the following [Equation 2] is used as a spectrum correction filter. In addition, when using the formula 2, the cepstrum coefficient c _j is calculated in the above parameter calculation.
[0021]
[Expression 2]

[0022]
In the above equations, μ and γ are appropriate coefficients, α is a linear prediction coefficient, and c is a cepstrum coefficient. Alternatively, an FIR filter represented by the following [Equation 3] in which the impulse response of the filter is cut off at an appropriate order may be used. In addition, when the [Equation 3] expression is used, the coefficient β _j is calculated in the parameter calculation.
[0023]
[Equation 3]

[0024]
Actually, it is necessary to consider the system gain in each of the above equations. The spectrum correction filter configured as described above is stored in the speech synthesis dictionary 501 (actually, filter coefficients are stored).
[0025]
Next, in the fine segment extraction step S5, the window function 302 is applied to the waveform acquired in the waveform data acquisition step S2, and the fine segment 303 is extracted. A Hanning window or the like is used as the window function.
[0026]
Next, in the fine segment spectrum correction step S6, the filter 304 configured in the spectrum correction filter configuration step S4 is applied to the fine segment 303 cut out in the fine segment cutout step S5, and the fine segment cutout step is performed. The spectrum of the fine segment cut out in S5 is corrected. In this way, the spectrally corrected fine segment 305 is obtained.
[0027]
Next, in the prosody change step S7, the thin segment 305 whose spectrum has been corrected in the fine segment spectrum correction step S6 is thinned, repeated, and spaced so as to match the prosodic target value acquired in the prosodic target value acquisition step S1. Change and rearrange (306). In the waveform superimposing step S8, the fine segments rearranged in the prosody changing step S7 are superposed to obtain a synthesized speech 307. Since the speech unit is obtained in step S8, the actual synthesized speech is obtained by connecting a plurality of speech units obtained in waveform superimposing step S8. That is, in the speech output step S9, the speech unit obtained in the waveform superimposing step S8 is connected to output a synthesized speech.
[0028]
As for the relocation processing of the fine pieces, “thinning” may be executed before the spectrum correction filter is operated as shown in FIG. This is because it is possible to omit a useless process of performing filter processing on unnecessary fine pieces.
[0029]
Second Embodiment
In the first embodiment, the spectrum correction filter is configured at the time of speech synthesis. However, the spectrum correction filter is configured prior to speech synthesis, and configuration information (filter coefficient) for configuring the filter is stored in a predetermined storage area. You may make it hold | maintain. That is, the process of the first embodiment can be separated into two processes of data creation (FIG. 4) and speech synthesis (FIG. 5). In the second embodiment, processing in this case will be described. The apparatus configuration for realizing this processing is the same as that of the first embodiment (FIG. 1). In this embodiment, the configuration information is stored in the speech synthesis dictionary 501.
[0030]
In the flowchart of FIG. 4, steps S2, S3, and S4 are the same as those in the first embodiment (FIG. 2). In the spectrum correction filter recording step S101, the filter coefficient of the spectrum correction filter configured in the spectrum correction filter configuration step S4 is recorded in the external storage device 15. In the present embodiment, a spectrum correction filter is configured for each waveform data registered in the speech synthesis dictionary 501 and the coefficient of the filter corresponding to each waveform data is held in the speech synthesis dictionary 501 as a spectrum correction filter. That is, the waveform data of each speech waveform and the spectrum correction filter are registered in the speech synthesis dictionary 501 of the second embodiment.
[0031]
On the other hand, at the time of speech synthesis, as shown in the flowchart of FIG. 5, the acoustic analysis step S3 and the spectrum correction filter configuration step S4 in the process of the first embodiment are not necessary, and a spectrum correction filter reading step S102 is added instead. Is done. In the spectrum correction filter reading step S102, the spectrum correction filter coefficient recorded in the spectrum correction filter recording step S101 is read. That is, the spectrum correction filter is configured by reading the coefficient of the spectrum correction filter corresponding to the waveform data acquired in the waveform data acquisition step S2 from the speech synthesis dictionary 501. In the fine segment spectrum correction step S6, the fine segment processing is performed using the spectrum correction filter read in the spectrum correction filter reading step S102.
[0032]
As described above, by recording the spectrum correction filter for all waveform data in advance, it is not necessary to configure the spectrum correction filter during speech synthesis. For this reason, it is possible to reduce the processing amount at the time of speech synthesis compared to the first embodiment.
[0033]
<Third Embodiment>
In the said 1st and 2nd embodiment, the filter comprised by the spectrum correction filter structure step S4 was applied to the fine piece cut out by the fine piece cutout step S5. However, a spectrum correction filter may be applied to the waveform data (voice waveform 301) acquired in the waveform data acquisition step S2. In the third embodiment, such a speech synthesis process will be described. The apparatus configuration for realizing this processing is the same as that of the first embodiment (FIG. 1).
[0034]
FIG. 6 is a flowchart for explaining speech synthesis processing according to the third embodiment. In FIG. 6, each step of waveform data acquisition step S2 to spectrum correction filter configuration step S4 is the same as that of the second embodiment. In the third embodiment, after the spectrum correction filter is configured in the spectrum correction filter configuration step S4, in the waveform data spectrum correction step S201, the waveform data acquired in the waveform data acquisition step S2 is configured in the spectrum correction filter configuration step S4. The spectrum correction filter is applied to correct the spectrum of the waveform data.
[0035]
Next, in the spectrum correction waveform data recording step S202, the waveform data spectrum-corrected in the waveform data spectrum correction step S201 is recorded. That is, in the second embodiment, “spectrum-corrected waveform data” is stored instead of “spectrum correction filter” in the speech synthesis dictionary 501 of FIG.
[0036]
On the other hand, in the speech synthesis process, the process shown in the flowchart of FIG. 7 is executed. In the third embodiment, a spectrum correction waveform data acquisition step S203 is provided instead of the waveform data acquisition step S2 in each of the above-described embodiments. Thereby, the waveform data after the spectrum correction recorded in the spectrum correction waveform data recording step S202 is acquired as a target for cutting out the fine segment in step S5. Then, by extracting and rearranging the fine segments with respect to the acquired waveform data, synthesized speech subjected to spectrum correction is obtained. In addition, since the spectrum-corrected waveform data is used, the spectrum correction processing (step S6 in the first and second embodiments) for the fine segment is not necessary.
[0037]
When the spectrum correction filter is applied to the waveform data instead of the fine segment as in the third embodiment, the influence of the window function used in the fine segment extraction step S5 cannot be completely eliminated. That is, the sound quality is slightly inferior to the first and second embodiments. However, since filtering up to the spectrum correction filter can be performed prior to speech synthesis, the amount of processing during speech synthesis (FIG. 7) is significantly reduced compared to the first and second embodiments. .
[0038]
In the third embodiment, the configuration is divided into two processes of data creation and speech synthesis as in the second embodiment. However, filtering is performed every time the synthesis process is executed as in the first embodiment. It can also be configured to do. In this case, in the flowchart of FIG. 2, a spectrum correction filter is applied to the waveform data to be synthesized between step S4 and step S5. Moreover, step S6 becomes unnecessary.
[0039]
<Fourth embodiment>
In the first and second embodiments, the filter configured in the spectral correction filter configuration step S4 is applied to the fine segment cut out in the fine segment cutting out step S5. In the third embodiment, the filter configured in the spectral correction filter configuration step S4 is applied to the waveform data before being cut into fine pieces. On the other hand, the spectrum correction filter can be applied to the waveform data of the synthesized speech synthesized in the waveform superimposing step S8. In the fourth embodiment, processing in this case will be described. The apparatus configuration for realizing this processing is the same as that of the first embodiment (FIG. 1).
[0040]
FIG. 8 is a flowchart for explaining speech synthesis processing according to the fourth embodiment. The same reference numerals are assigned to the same processes as those in the first embodiment (FIG. 2). In the fourth embodiment, as shown in FIG. 8, a synthesized speech spectrum correction step S301 is provided after the waveform superposition step S8, and the fine segment spectrum correction step S6 is eliminated. In the synthesized speech spectrum correction step S301, the filter configured in the spectrum correction filter configuration step S4 is applied to the waveform data of the synthesized speech obtained in the waveform superimposing step S8 to perform spectrum correction.
[0041]
According to the fourth embodiment described above, when the number of repetitions of the same fine segment is small as a result of the prosody change step S7, the amount of processing is reduced compared to the first embodiment.
[0042]
Also in this embodiment, the point that the spectrum correction filter can be configured in advance is the same as the relationship with the first and second embodiments. That is, filter coefficients are stored in advance in the speech synthesis dictionary 501 and read out during speech synthesis to form a spectrum correction filter, which is applied to the waveform data superimposed in step S8.
[0043]
<Fifth Embodiment>
If the spectrum correction filter can be expressed as a composite filter of a plurality of partial filters, the spectrum correction is not performed in one step as in the first to fourth embodiments, but is distributed in a plurality of steps. It becomes possible. Due to the dispersion of the spectrum correction, the balance between the sound quality and the processing amount can be adjusted flexibly as compared with the above embodiments. In the fifth embodiment, a case where speech synthesis processing is performed by dispersing the spectrum correction filter in this way will be described. The apparatus configuration for realizing this processing is the same as that of the first embodiment (FIG. 1).
[0044]
FIG. 9 is a flowchart for explaining speech synthesis processing according to the fifth embodiment. As shown in FIG. 9, first, the processing of prosody target value acquisition step S1 to spectrum correction filter configuration step S4 is performed. These processes are the same as the processes of steps S1 to S4 in the first to fourth embodiments.
[0045]
Next, in the spectrum correction filter decomposition step S401, the spectrum correction filter configured in the spectrum correction filter configuration step S4 is decomposed into two to three partial filters (element filters). For example, the spectrum correction filter F1 (z) when the p-order linear prediction analysis is used for the acoustic analysis is expressed as the following [Equation 4] as a product of a denominator polynomial and a numerator polynomial.
[0046]
[Expression 4]

[0047]
Alternatively, it is also possible to factorize the numerator / denominator polynomial into a product of a first-order or second-order real coefficient polynomial as in the following expression (the following [Expression 5] indicates a case where p is an even number. ) Similarly, when an FIR filter is used as the spectrum correction filter, it can be factored into a product of a first-order or second-order real coefficient polynomial. That is, the [Expression 3] is factorized and expressed as [Expression 6].
[0048]
[Equation 5]

[Formula 6]

[0049]
In addition, when p-th order cepstrum analysis is used, the filter characteristic is expressed by an exponent, and therefore it is only necessary to group the cepstrum coefficients as shown in [Expression 7].
[0050]
[Expression 7]

[0051]
Next, in the spectral correction filter partial application (1) step S402, the waveform data acquired in the waveform data acquisition step S2 is filtered using one of the filters decomposed in the spectral correction filter decomposition step S401. That is, using the first filter element that is one of the plurality of filter elements obtained in step S401, spectrum correction processing is performed on the waveform data before cutting out the fine segment.
[0052]
Next, in the fine segment cutout step S5, the window function is applied to the waveform data obtained as a result of the spectral correction filter partial application (1) step S402, and the fine segment is cut out. Then, in the spectral correction filter partial application (2) step S403, the fine segment extracted in the fine segment extraction step S5 is filtered using one of the filters decomposed in the spectrum correction filter decomposition step S401. That is, the spectrum correction process is performed on each cut out fine element using the second filter element which is one of the plurality of filter elements obtained in step S401.
[0053]
Next, as in the first and second embodiments, the prosody changing step S7 and the waveform superimposing step S8 are performed. Then, in the spectral correction filter partial application (3) step S404, the synthesized speech obtained as a result of the waveform superimposing step S8 is filtered using one of the filters decomposed in the spectral correction filter decomposition step S401. That is, using the third filter element, which is one of the plurality of filter elements obtained in step S401, a spectrum correction process is performed on the obtained synthesized speech waveform data.
[0054]
Then, in the voice output step S9, the synthesized voice obtained as a result of the spectral correction filter partial application (3) step S404 is output.
In the above-described configuration, for example, when the equation [5] is decomposed, F _1,1 (z) is set in step S402, F _1,2 (z) is set in step S403, and F _1,3 (z ) Can be used in step S404.
[0055]
In addition, when dividing into a product of two elements as in [Expression 4], filtering is not performed in any of steps S402, S403, and S404. That is, when the spectrum correction filter is decomposed into two in the spectrum correction filter decomposition step S401 (in this example, it is divided into two of a denominator polynomial and a numerator polynomial), the spectrum correction filter partial application (1) step Any of S402, spectral correction filter partial application (2) step S403, and spectral correction filter partial application (3) step S404 is omitted.
[0056]
Also in the fifth embodiment, the spectral correction filter and each element filter may be configured in advance and registered as a part of the speech synthesis dictionary 501 in the first and second embodiments. As with the association, it is clear.
As described above, according to the fifth embodiment, there is arbitraryness as to which polynomial (filter) is assigned to which step (S402, S403, S404), and the distribution of sound quality and processing amount depends on the assignment method. It will change. In particular, in the case of [Expression 5], [Expression 7], or [Expression 6] obtained by factoring the FIR filter, it is possible to control how many factors are assigned to each step, so that more flexibility is provided. There will be.
[0057]
<Other embodiments>
In each of the above embodiments, the spectral correction filter coefficient may be recorded after being quantized using a technique such as vector quantization instead of directly recording. Thereby, the amount of data recorded in the external storage device 15 can be reduced.
[0058]
At this time, if LPC analysis or generalized cepstrum analysis is used as the acoustic analysis method, quantization efficiency is improved by performing quantization after converting the filter coefficient into a line spectrum pair (LSP).
[0059]
When the waveform sampling frequency is high, band division may be performed by a band division filter, and spectrum correction filtering may be performed on each band-limited waveform. The order of the spectrum correction filter is suppressed by the band division, and the calculation amount is reduced. The same effect can be obtained by expansion and contraction of the frequency axis such as a mel cepstrum.
[0060]
Further, in each of the embodiments, it has been shown that there are a plurality of options for the timing of performing spectral correction filtering. The selection of when to perform spectral correction filtering or whether to perform spectral correction may be performed for each segment. As information for selection, a phoneme type, a voiced / unvoiced type, or the like can be used.
In each of the above embodiments, an example of a spectrum correction filter is a formant emphasis filter that emphasizes formants.
[0061]
Another object of the present invention is to supply a storage medium storing software program codes for implementing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in the.
[0062]
In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.
[0063]
As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.
[0064]
Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.
[0065]
Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.
[0066]
【The invention's effect】
As described above, according to the present invention, it is possible to reduce the “blurring” of the speech spectrum caused by the window function applied to obtain the fine segment, and to realize speech synthesis with high sound quality. .
[Brief description of the drawings]
FIG. 1 is a block diagram showing a hardware configuration in a first embodiment.
FIG. 2 is a flowchart illustrating audio output processing according to the first embodiment.
FIG. 3 is a diagram illustrating a state of speech synthesis processing according to the first embodiment.
FIG. 4 is a flowchart illustrating a spectrum correction filter registration process in an audio output process according to the second embodiment.
FIG. 5 is a flowchart for explaining speech synthesis processing in speech output processing according to the second embodiment.
FIG. 6 is a flowchart illustrating a spectrum correction filter registration process in the audio output process according to the third embodiment.
FIG. 7 is a flowchart illustrating speech synthesis processing in speech output processing according to the third embodiment.
FIG. 8 is a flowchart illustrating audio output processing according to a fourth embodiment.
FIG. 9 is a flowchart illustrating audio output processing according to a fifth embodiment.
FIG. 10 is a diagram schematically showing a speech synthesis method by dividing a speech waveform into fine segments, rearrangement, and synthesis.

Claims

An acquisition step of acquiring fine segments from the audio waveform data and the window function;
The rearrangement step of rearranging the fine segments acquired in the acquisition step to change the prosody at the time of synthesis,
A synthesis step of outputting synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the fine segments rearranged in the rearrangement step;
A spectrum correction filter configured to reduce the blurring of the spectrum of the sound due to the window function applied to acquire the fine segment, which is configured based on the speech waveform data, is applied to the fine segment acquired in the acquisition step. A speech synthesis method comprising: a correction step that acts on the speech synthesis method.

The correction step includes a configuration step of configuring a spectrum correction filter based on the audio waveform data processed in the acquisition step,
The speech synthesis method according to claim 1, wherein the spectrum correction filter configured in the configuration step is applied to the fine segment acquired in the acquisition step.

For each of the speech waveform data, comprising a speech synthesis dictionary in which configuration information for a spectrum correction filter based on the speech waveform data is registered,
The correction step configures a spectrum correction filter by acquiring configuration information corresponding to the speech waveform data processed in the acquisition step from the speech synthesis dictionary, and the fine unit acquired in the acquisition step The speech synthesis method according to claim 1, wherein the spectrum correction filter is operated.

An acquisition step of acquiring fine segments from the audio waveform data and the window function;
The rearrangement step of rearranging the fine segments acquired in the acquisition step to change the prosody at the time of synthesis,
A synthesizing step of outputting a synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the rearranged micro segments in the rearrangement step,
Plurality obtained by decomposing a spectrum correction filter configured to be based on the voice waveform data processed in the acquisition step and reducing the blur of the voice spectrum due to the window function applied to acquire the fine segment A speech synthesis method characterized by causing each of the element filters to act at a plurality of locations in the processing process including the acquisition step, the rearrangement step, and the synthesis step.

The rearrangement of the fine pieces cut out by the window function is at least one of a change in the interval between the fine pieces, repetition of the fine pieces, and thinning out of the fine pieces. 5. The speech synthesis method according to any one of 4 above.

Acquisition means for acquiring fine segments from the speech waveform data and the window function;
Rearrangement means for rearranging the fine segments acquired by the acquisition means to change the prosody at the time of synthesis;
Synthesis means for outputting synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the fine segments rearranged by the rearrangement means;
A spectrum correction filter configured to reduce the blurring of the spectrum of the sound due to the window function applied to acquire the fine segment, which is configured based on the speech waveform data, is applied to the fine segment acquired by the acquisition unit. A speech synthesizer characterized by comprising correction means for acting on the speech synthesizer.

Acquisition means for acquiring fine segments from the speech waveform data and the window function;
Rearrangement means for rearranging the fine segments acquired by the acquisition means to change the prosody at the time of synthesis;
A synthesizing means for outputting a synthesized speech waveform data based on the superimposed waveform data obtained by superimposing the rearranged micro segments in the rearrangement unit,
Plurality obtained by decomposing a spectrum correction filter configured to be based on the voice waveform data processed by the acquisition means and reducing blurring of the voice spectrum due to the window function applied to acquire the fine segment A speech synthesizer characterized by causing each of the element filters to act at a plurality of locations in the processing process including the acquisition unit, the rearrangement unit and the synthesis unit.

A control program for causing a computer to execute the speech synthesis method according to any one of claims 1 to 5.