JP2004012700A

JP2004012700A - Method and apparatus for synthesizing voice and method and apparatus for preparing dictionary

Info

Publication number: JP2004012700A
Application number: JP2002164624A
Authority: JP
Inventors: Masaaki Yamada; 山田　雅章; Toshiaki Fukada; 深田　俊明; Yasuhiro Komori; 小森　康弘
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-06-05
Filing date: 2002-06-05
Publication date: 2004-01-15
Anticipated expiration: 2022-06-05
Also published as: JP4332323B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce "diffusion" of voice spectrum due to a windowing function applied for the purpose of obtaining minute element pieces and to realize voice synthesis of high tone quality. <P>SOLUTION: At a spectrum correction filter constituting step S4, a spectrum correction filter is constituted based on waveform data acquired at a waveform data acquiring step S2. On the other hand, at a minute element piece segmenting step S5, minute element pieces are segmented by the windowing function and the waveform data acquired at the step S2. Then at a minute element piece spectrum correcting step S6, spectrum correction processing by the spectrum correction filter is performed to each minute element piece. At a rhythm changing step S7, the minute element pieces of corrected spectrum are rearranged in order to realize desired rhythm. At a waveform superposing step S8, rearranged minute element pieces are superposed to obtain synthesized voice waveform data. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を合成する音声合成装置および方法に関する。
【０００２】
【従来の技術】
従来より、所望の合成音声を得るための音声合成方法として、あらかじめ収録し蓄えられた音声素片を複数の微細素片に分割し、分割の結果得られた微細素片の再配置を行って所望の合成音声を得る方法がある。これら微細素片の再配置において、微細素片に対して間隔変更・繰り返し・間引き等の処理が行われることにより、所望の時間長・基本周波数を持つ合成音声が得られる。
【０００３】
図１０は、音声波形を微細素片に分割する方法を模式的に示した図である。図１０に示された音声波形は、切り出し窓関数（以下、窓関数）によって微細素片に分割される。このとき、有声音の部分（音声波形の後半部）では原音声のピッチ間隔に同期した窓関数が用いられる。一方、無声音の部分では、適当な間隔の窓関数が用いられる。
【０００４】
そして、図１０に示すようにこれらの微細素片を間引いて用いることにより音声の継続時間長を短縮することができる。一方、これらの微細素片を繰り返して用いれば、音声の継続時間長を伸長することができる。更に、図１０に示すように、有声音の部分では、微細素片の間隔を詰めることにより合成音声の基本周波数を上げることが可能となる。一方、微細素片の間隔を広げることにより合成音声の基本周波数を下げることが可能である。
【０００５】
以上のような繰り返し・間引き・間隔変更を行なって再配置された微細素片を再び重畳することにより所望の合成音声が得られる。なお、音声素片を収録・蓄積する単位としては、音素やＣＶ・ＶＣあるいはＶＣＶといった単位が用いられる。ＣＶ・ＶＣは音素内に素片境界を置いた単位、ＶＣＶは母音内に素片境界を置いた単位である。
【０００６】
【発明が解決しようとする課題】
しかしながら、上記従来法においては、音声波形から微細素片を得るために窓関数が適用されることにより、音声のスペクトルに所謂「ぼやけ」が生じてしまう。すなわち、音声のホルマントが広がったりスペクトル包絡の山谷が曖昧になる等の現象が起こり、合成音声の音質が低下することになる。
【０００７】
本発明は上記の課題に鑑みてなされたものであり、微細素片を得るために適用した窓関数による音声のスペクトルの「ぼやけ」を軽減し、高音質な音声合成を実現することを目的とする。
【０００８】
【課題を解決するための手段】
上記の目的を達成するための本発明による音声合成方法は、
音声波形データと窓関数とから微細素片を取得する取得工程と、
前記取得工程で取得された微細素片を、合成時の韻律を変更するべく再配置する再配置工程と、
前記再配置工程で再配置された微細素片を重畳して得られる重畳波形データに基づいて合成音声波形データを出力する合成工程と、
前記取得工程で処理される音声波形データに基づいて構成されたスペクトル補正フィルタを、前記取得工程、再配置工程、合成工程を含む処理の過程において作用させる補正工程とを備える。
【０００９】
また、上記の目的を達成するための本発明による音声合成装置は以下の構成を備える。すなわち、
音声波形データと窓関数とから微細素片を取得する取得手段と、
前記取得手段で取得された微細素片を、合成時の韻律を変更するべく再配置する再配置手段と、
前記再配置手段で再配置された微細素片を重畳して得られる重畳波形データに基づいて合成音声波形データを出力する合成手段と、
前記取得手段で処理される音声波形データに基づいて構成されたスペクトル補正フィルタを、前記取得手段、再配置手段、合成手段を含む処理の過程において作用させる補正手段とを備える。
【００１０】
また、本発明によれば、上記音声合成方法或は音声合成装置に好適な音声合成用の辞書生成方法が提供される。
【００１１】
【発明の実施の形態】
以下、添付の図面を参照して本発明の好適な実施形態のいくつかについて詳細に説明する。
【００１２】
〈第１実施形態〉
図１は第１実施形態におけるハードウェア構成を示すブロック図である。
【００１３】
図１において、１１は中央処理装置であり、数値演算・制御等の処理を行なう。特に、中央処理装置１１は、以下に説明する手順に従った音声合成処理を実行する。１２は出力装置であり、中央処理装置１１の制御下でユーザに対して各種の情報を提示する。１３はタッチパネル或はキーボード等を備えた入力装置であり、ユーザが本装置に対して動作の指示を与えたり、各種の情報を入力するのに用いられる。１４は音声を出力する音声出力装置であり音声合成された内容を出力する。
【００１４】
１５はディスク装置や不揮発メモリ等の記憶装置であり、音声合成用辞書５０１等が保持される。１６は読み取り専用の記憶装置であり、本実施形態の音声合成処理の手順や、必要な固定的データが格納される。１７はＲＡＭ等の一時情報を保持する記憶装置であり、一時的なデータや各種フラグ等が保持される。以上の各構成（１１〜１７）は、バス１８によって接続されている。なお、本実施形態ではＲＯＭ１６に音声合成処理のための制御プログラムが格納され、中央処理装置１１がこれを実行する形態とするが、そのような制御プログラムを外部記憶装置１５に格納しておき、実行に際してＲＡＭ１７にロードするような形態としてもよい。
【００１５】
以上のような構成を備えた本実施形態の音声出力装置の動作について、図２及び図３を参照して以下に説明する。図２は第１実施形態による音声出力処理を説明するフローチャートである。また、図３は第１実施形態の音声合成処理の様子を表す図である。
【００１６】
まず、韻律目標値取得ステップＳ１において、合成音声の目標韻律値を取得する。合成音声の目標韻律値は、歌声合成の様に直接上位モジュールから与えられる場合もあれば、何らかの手段を用いて推定される場合もある。例えば、テキストからの音声合成であるならばテキストの言語解析結果より推定される。
【００１７】
次に、波形データ取得ステップＳ２において、合成音声の元となる波形データ（図３の音声波形３０１）を取得する。そして、音響分析ステップＳ３において、線形予測（ＬＰＣ）分析・ケプストラム分析・一般化ケプストラム分析等の音響分析を取得した波形データについて行い、スペクトル補正フィルタ３０４を構成するのに必要なパラメータを計算する。なお波形データの分析は、ある定められた時間間隔で行なっても良いし、ピッチ同期分析を行なっても良い。
【００１８】
次に、スペクトル補正フィルタ構成ステップＳ４において、前記音響分析ステップＳ３で計算されたパラメータを用いてスペクトル補正フィルタを構成する。例えば、前記音響分析にｐ次の線形予測分析を用いた場合には、以下の［数１］式で表される特性を持ったフィルタをスペクトル補正フィルタ３０４として用いる。なお、［数１］式を用いる場合、上記パラメータ計算においては線形予測係数α_ｊが算出されることになる。
【００１９】
【数１】

【００２０】
また、ｐ次のケプストラム分析を用いた場合には、以下の［数２］式で表される特性を持ったフィルタをスペクトル補正フィルタとして用いる。なお、［数２］式を用いる場合、上記パラメータ計算においてはケプストラム係数ｃ_ｊが算出されることになる。
【００２１】
【数２】

【００２２】
上記各式において、μ、γは適当な係数、αは線形予測係数、ｃはケプストラム係数である。あるいは、上記フィルタのインパルス応答を適当な次数で打ち切って構成した、以下の［数３］式で表されるＦＩＲフィルタが用いられる場合もある。なお、［数３］式を用いる場合、上記パラメータ計算においては係数β_ｊが計算されることになる。
【００２３】
【数３】

【００２４】
なお、実際には、上記の各式において、システムのゲインを考慮する必要がある。以上のようにして構成されたスペクトル補正フィルタは音声合成用辞書５０１に格納される（実際にはフィルタの係数を格納することになる）。
【００２５】
次に、微細素片切り出しステップＳ５において、前記波形データ取得ステップＳ２で取得した波形に窓関数３０２を適用し、微細素片３０３を切り出す。窓関数としてはハニング窓等が用いられる。
【００２６】
次に、微細素片スペクトル補正ステップＳ６において、微細素片切り出しステップＳ５で切り出した微細素片３０３に対して、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタ３０４を適用し、微細素片切り出しステップＳ５で切り出した微細素片のスペクトルを補正する。こうして、スペクトル補正された微細素片３０５が取得される。
【００２７】
次に、韻律変更ステップＳ７において、微細素片スペクトル補正ステップＳ６でスペクトル補正された微細素片３０５を、韻律目標値取得ステップＳ１で取得した韻律目標値に合致するように、間引き・繰り返し・間隔変更して再配置（３０６）する。そして波形重畳ステップＳ８において、韻律変更ステップＳ７で再配置した微細素片を重畳し、合成音声３０７を得る。なお、ステップＳ８で得られるのは音声素片であるので、実際の合成音声は波形重畳ステップＳ８で得られた複数の音声素片を接続して得られる。すなわち、音声出力ステップＳ９において、波形重畳ステップＳ８で得られた音声素片を接続して合成音声を出力する。
【００２８】
なお、微細素片の再配置処理に関して、「間引き」については、図３に示すようにスペクトル補正フィルタを作用させる前に実行するようにしてもよい。このようにすれば、不要な微細素片についてフィルタ処理を施すという無駄な処理を省くことができるからである。
【００２９】
〈第２実施形態〉
上記第１実施形態においてはスペクトル補正フィルタを音声合成時に構成しているが、スペクトル補正フィルタの構成を音声合成に先立って行い、フィルタを構成するための構成情報（フィルタ係数）を所定の記憶領域に保持しておくようにしてもよい。すなわち、第１実施形態のプロセスをデータ作成（図４）と音声合成（図５）の２つのプロセスに分離することが可能である。第２実施形態ではこの場合の処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。また、本実施形態では、構成情報を音声合成用辞書５０１に格納することとする。
【００３０】
図４のフローチャートにおいて、ステップＳ２、Ｓ３、Ｓ４は第１実施形態（図２）と同様である。そして、スペクトル補正フィルタ記録ステップＳ１０１では、スペクトル補正フィルタ構成ステップＳ４で構成されたスペクトル補正フィルタのフィルタ係数を外部記憶装置１５に記録する。本実施形態では、音声合成用辞書５０１に登録された各波形データについてスペクトル補正フィルタを構成し、各波形データに対応するフィルタの係数をスペクトル補正フィルタとして音声合成用辞書５０１内に保持する。すなわち、第２実施形態の音声合成用辞書５０１には、各音声波形の波形データとスペクトル補正フィルタが登録されていることになる。
【００３１】
一方、音声合成時においては、図５のフローチャートに示されるように、第１実施形態の処理における音響分析ステップＳ３およびスペクトル補正フィルタ構成ステップＳ４が不要となり、代りにスペクトル補正フィルタ読込みステップＳ１０２が追加される。スペクトル補正フィルタ読込みステップＳ１０２では、スペクトル補正フィルタ記録ステップＳ１０１で記録したスペクトル補正フィルタ係数を読み込む。すなわち、波形データ取得ステップＳ２で取得された波形データに対応するスペクトル補正フィルタの係数を音声合成用辞書５０１から読み込んでスペクトル補正フィルタを構成する。そして、微細素片スペクトル補正ステップＳ６では、スペクトル補正フィルタ読込みステップＳ１０２で読込まれたスペクトル補正フィルタを用いて微細素片の処理が行われる。
【００３２】
以上のように、予め全ての波形データについてスペクトル補正フィルタを記録しておくことにより、音声合成時にスペクトル補正フィルタを構成する必要がなくなる。このため、第１実施形態に比べて音声合成時の処理量を軽減することが可能となる。
【００３３】
〈第３実施形態〉
上記第１及び第２実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを微細素片切り出しステップＳ５で切り出された微細素片に適用していた。しかし、スペクトル補正フィルタを前記波形データ取得ステップＳ２で取得した波形データ（音声波形３０１）に対して適用しても良い。第３実施形態ではこのようは音声合成処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００３４】
図６は第３実施形態による音声合成処理を説明するフローチャートである。図６において、波形データ取得ステップＳ２〜スペクトル補正フィルタ構成ステップＳ４の各ステップは上記第２実施形態と同様である。第３実施形態では、スペクトル補正フィルタ構成ステップＳ４によってスペクトル補正フィルタを構成した後、波形データスペクトル補正ステップＳ２０１において、波形データ取得ステップＳ２で取得した波形データに対してスペクトル補正フィルタ構成ステップＳ４で構成したスペクトル補正フィルタを適用し、波形データのスペクトルを補正する。
【００３５】
次に、スペクトル補正波形データ記録ステップＳ２０２において、波形データスペクトル補正ステップＳ２０１でスペクトル補正された波形データを記録する。すなわち、第２実施形態では、図１の音声合成用辞書５０１において、「スペクトル補正フィルタ」の代わりに「スペクトル補正された波形データ」が記憶されることになる。
【００３６】
一方、音声合成処理においては、図７のフローチャートに示される処理が実行される。第３実施形態では、上述の各実施形態における波形データ取得ステップＳ２の代りにスペクトル補正波形データ取得ステップＳ２０３が設けられる。これにより、スペクトル補正波形データ記録ステップＳ２０２で記録されたスペクトル補正後の波形データを、ステップＳ５における微細素片の切り出しの対象として取得させる。そして、この取得された波形データについて微細素片の切り出し、再配置が行なわれることで、スペクトル補正が施された合成音声を得ることになる。なお、スペクトル補正された波形データを用いるので、微細素片に対するスペクトル補正処理（第１、第２実施形態のステップＳ６）は不要となっている。
【００３７】
第３実施形態のように、微細素片ではなく波形データに対してスペクトル補正フィルタを適用した場合、微細素片切り出しステップＳ５にて用いられる窓関数の影響を完全に排除することは出来ない。すなわち、上記第１及び第２実施形態と比べて音質は若干劣ってしまう。しかし、スペクトル補正フィルタによるフィルタリングまでを音声合成に先立って行なうことが出来るため、音声合成時（図７）の処理量は第１、第２実施形態に比べて大幅に削減されるという特長がある。
【００３８】
尚、第３実施形態では、第２実施形態のように、データ作成と音声合成の２つのプロセスに分けた構成を説明したが、第１実施形態のように合成処理を実行する毎にフィルタリングを行なうように構成することもできる。この場合、図２のフローチャートにおいて、ステップＳ４とステップＳ５の間で合成処理対象の波形データにスペクトル補正フィルタを作用させることになる。また、ステップＳ６は不要となる。
【００３９】
〈第４実施形態〉
第１、第２実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを微細素片切り出しステップＳ５で切り出された微細素片に適用した。また、第３実施形態では、スペクトル補正フィルタ構成ステップＳ４で構成されたフィルタを、微細素片に切り出される前の波形データに適用した。これらに対して、スペクトル補正フィルタを波形重畳ステップＳ８で合成した合成音声の波形データに対して適用することもできる。第４実施形態ではこの場合の処理について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００４０】
図８は第４実施形態による音声合成処理を説明するフローチャートである。第１実施形態の処理（図２）と同様の処理には同一の参照番号が付されている。第４実施形態では、図８に示されるように、波形重畳ステップＳ８の後に合成音声スペクトル補正ステップＳ３０１を設け、微細素片スペクトル補正ステップＳ６を廃する。合成音声スペクトル補正ステップＳ３０１では、スペクトル補正フィルタ構成ステップＳ４において構成されたフィルタを、波形重畳ステップＳ８で得られた合成音声の波形データに適用し、スペクトル補正を行なう。
【００４１】
以上の第４実施形態によれば、韻律変更ステップＳ７の結果、同一微細素片の繰り返し回数が少ない場合等においては、第１実施形態に比べて処理量が少なくなる。
【００４２】
また、本実施形態においても、スペクトル補正フィルタをあらかじめ構成しておくことが可能な点は、第１及び第２実施形態との関係と同様である。即ち、予めフィルタ係数を音声合成用辞書５０１に格納しておき、音声合成時にはこれを読出してスペクトル補正用フィルタを構成し、ステップＳ８で波形重畳された波形データに作用させる。
【００４３】
〈第５実施形態〉
スペクトル補正フィルタとして、複数の部分フィルタの合成フィルタとして表現できる場合には、上記第１〜第４実施形態のように１ステップでスペクトル補正を行なうのではなく、スペクトル補正を複数のステップに分散させることが可能となる。スペクトル補正の分散により、上記各実施形態と比べて、音質と処理量のバランスを柔軟に調節することが可能となる。第５実施形態では、このようにスペクトル補正フィルタを分散させて音声合成処理する場合について説明する。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。
【００４４】
図９は第５実施形態による音声合成処理を説明するフローチャートである。図９に示されるように、まず、韻律目標値取得ステップＳ１〜スペクトル補正フィルタ構成ステップＳ４の処理を行なう。これらの処理は、上記第１〜第４実施形態におけるステップＳ１〜Ｓ４の処理と同様である。
【００４５】
次に、スペクトル補正フィルタ分解ステップＳ４０１で、スペクトル補正フィルタ構成ステップＳ４で構成されたスペクトル補正フィルタを２乃至３個の部分フィルタ（要素フィルタ）に分解する。例えば、前記音響分析にｐ次の線形予測分析を用いた場合のスペクトル補正フィルタＦ１（ｚ）は、分母多項式と分子多項式の積として、以下の［数４］式のように表現される。
【００４６】
【数４】

【００４７】
あるいは、以下の式のように分子・分母多項式を１次または２次の実係数多項式の積に因数分解することも可能である（以下の［数５］式は、ｐが偶数の場合を示したものである）。同様に、スペクトル補正フィルタにＦＩＲフィルタを使用した場合も、１次または２次の実係数多項式の積に因数分解することができる。すなわち、［数３］式を因数分解して、［数６］式のように表される。
【００４８】
【数５】

【数６】

【００４９】
また、ｐ次のケプストラム分析を用いた場合には、フィルタ特性は指数で表現されるため、［数７］式のようにケプストラム係数をグループ分けするだけで良い。
【００５０】
【数７】

【００５１】
次に、スペクトル補正フィルタ部分適用（１）ステップＳ４０２において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、波形データ取得ステップＳ２で取得した波形データをフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第１のフィルタ要素を用いて、微細素片切り出し前の波形データに対してスペクトル補正処理を施す。
【００５２】
次に、微細素片切り出しステップＳ５において、スペクトル補正フィルタ部分適用（１）ステップＳ４０２の結果として得られた波形データに対して窓関数を適用し、微細素片を切り出す。そして、スペクトル補正フィルタ部分適用（２）ステップＳ４０３において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、微細素片切り出しステップＳ５で切り出された微細素片をフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第２のフィルタ要素を用いて、切り出された各微細素片に対してスペクトル補正処理を施す。
【００５３】
次に、第１及び第２実施形態と同様に韻律変更ステップＳ７と波形重畳ステップＳ８を行なう。そして、スペクトル補正フィルタ部分適用（３）ステップＳ４０４において、スペクトル補正フィルタ分解ステップＳ４０１で分解されたフィルタの１つを用いて、波形重畳ステップＳ８の結果得られた合成音声をフィルタリングする。すなわち、ステップＳ４０１で得られた複数のフィルタ要素のうちの一つである第３のフィルタ要素を用いて、得られた合成音声の波形データに対してスペクトル補正処理を施す。
【００５４】
そして、音声出力ステップＳ９において、スペクトル補正フィルタ部分適用（３）ステップＳ４０４の結果得られた合成音声を出力する。
以上の構成において、例えば、［数５］式の分解を行った場合は、Ｆ_１，１（ｚ）をステップＳ４０２で、Ｆ_１，２（ｚ）をステップＳ４０３で、Ｆ_１，３（ｚ）をステップＳ４０４で用いるというようなことが可能である。
【００５５】
尚、［数４］式の様に、２要素の積に分割した場合にはステップＳ４０２，Ｓ４０３，Ｓ４０４のいずれかではフィルタリングを行わないことになる。すなわち、スペクトル補正フィルタ分解ステップＳ４０１においてスペクトル補正フィルタを２つに分解した場合（この例では、分母多項式と分子多項式の２つに分割している）には、スペクトル補正フィルタ部分適用（１）ステップＳ４０２、スペクトル補正フィルタ部分適用（２）ステップＳ４０３、スペクトル補正フィルタ部分適用（３）ステップＳ４０４のうちのいずれかは省略される。
【００５６】
また、第５実施形態においても、スペクトル補正フィルタや各要素フィルタをあらかじめ構成して音声合成用辞書５０１の一部として登録しておくようにしてもよいことは、第１及び第２実施形態の関連と同様、明らかである。
以上のように、第５の実施形態によれば、どの多項式（フィルタ）をどのステップ（Ｓ４０２，Ｓ４０３，Ｓ４０４）に割り当てるかという任意性があり、その割り当て方によって、音質・処理量の配分が変わってくる。特に、［数５］式や［数７］式、あるいはＦＩＲフィルタを因数分解した［数６］式の場合には、それぞれのステップに因数を何個ずつ割り当てるかまで制御できるので、さらに柔軟性があることになる。
【００５７】
〈その他の実施形態〉
上記各実施形態において、スペクトル補正フィルタ係数を直接記録するのではなく、ベクトル量子化等の手法を使って量子化した後に記録しても良い。これにより、外部記憶装置１５に記録されるデータ量を削減することが可能である。
【００５８】
このとき、音響分析の手法としてＬＰＣ分析や一般化ケプストラム分析を用いている場合には、フィルタ係数を線スペクトル対（ＬＳＰ）に変換した後に量子化を行なうと量子化の効率が良くなる。
【００５９】
また、波形のサンプリング周波数が高い場合には、帯域分割フィルタによって帯域分割を行い、帯域制限された個々の波形に対してスペクトル補正フィルタリングを行なっても良い。帯域分割によってスペクトル補正フィルタの次数が押えられ、計算量を削減する効果がある。メルケプストラムのような周波数軸の伸縮によっても同様の効果がある。
【００６０】
また、前記各実施形態で、スペクトル補正フィルタリングを行なうタイミングには、複数の選択肢があることを示した。どのタイミングでスペクトル補正フィルタリングを行なうか、あるいはスペクトル補正を行なうか行なわないかの選択を、素片毎に行なっても良い。選択のための情報として、音素種別や有声／無声の種別等を利用することができる。
なお、上記各実施形態において、スペクトル補正フィルタの一例としては、ホルマントを強調するホルマント強調フィルタが挙げられる。
【００６１】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。
【００６２】
この場合、記憶媒体から読出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。
【００６３】
プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク，ハードディスク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭなどを用いることができる。
【００６４】
また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６５】
さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００６６】
【発明の効果】
以上説明したように、本発明によれば、微細素片を得るために適用した窓関数による、音声のスペクトルの「ぼやけ」を軽減することができ、音質が高い音声合成を実現することができる。
【図面の簡単な説明】
【図１】第１実施形態におけるハードウェア構成を示すブロック図である。
【図２】第１実施形態による音声出力処理を説明するフローチャートである。
【図３】第１実施形態の音声合成処理の様子を表す図である。
【図４】第２実施形態による音声出力処理におけるスペクトル補正フィルタ登録処理を説明するフローチャートである。
【図５】第２実施形態による音声出力処理における音声合成処理を説明するフローチャートである。
【図６】第３実施形態による音声出力処理におけるスペクトル補正フィルタ登録処理を説明するフローチャートである。
【図７】第３実施形態による音声出力処理における音声合成処理を説明するフローチャートである。
【図８】第４実施形態による音声出力処理を説明するフローチャートである。
【図９】第５実施形態による音声出力処理を説明するフローチャートである。
【図１０】音声波形の微細素片への分割、再配置、合成による音声合成方法を模式的に示した図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis apparatus and method for synthesizing speech.
[0002]
[Prior art]
Conventionally, as a speech synthesis method for obtaining a desired synthesized speech, a speech unit recorded and stored in advance is divided into a plurality of minute units, and the resulting minute units are rearranged. There is a method for obtaining a desired synthesized voice. In the rearrangement of these fine segments, a synthesized speech having a desired time length and fundamental frequency can be obtained by performing processing such as interval change, repetition, and thinning on the fine segments.
[0003]
FIG. 10 is a diagram schematically showing a method of dividing a speech waveform into fine segments. The speech waveform shown in FIG. 10 is divided into fine segments by a cutout window function (hereinafter, a window function). At this time, a window function synchronized with the pitch interval of the original voice is used in the voiced portion (the latter half of the voice waveform). On the other hand, in the unvoiced portion, a window function with an appropriate interval is used.
[0004]
Then, as shown in FIG. 10, by using these fine pieces thinned out, the duration of the voice can be shortened. On the other hand, if these fine segments are used repeatedly, the duration of the voice can be extended. Further, as shown in FIG. 10, in the voiced sound portion, it is possible to increase the fundamental frequency of the synthesized speech by reducing the interval between the fine segments. On the other hand, it is possible to lower the fundamental frequency of the synthesized speech by increasing the interval between the fine segments.
[0005]
The desired synthesized speech can be obtained by repeating the above-described repetition / thinning / interval change and superimposing the re-arranged fine segments again. As a unit for recording and storing speech units, units such as phonemes, CV / VC or VCV are used. CV · VC is a unit in which a unit boundary is placed in a phoneme, and VCV is a unit in which a unit boundary is placed in a vowel.
[0006]
[Problems to be solved by the invention]
However, in the above-described conventional method, a so-called “blur” occurs in the spectrum of a voice because a window function is applied to obtain a fine segment from the voice waveform. That is, phenomena such as the spread of the formant of the voice and the valley of the spectral envelope become ambiguous occur, and the sound quality of the synthesized voice is deteriorated.
[0007]
The present invention has been made in view of the above problems, and aims to reduce the `` blurring '' of the voice spectrum by a window function applied to obtain fine fragments, and to realize high-quality voice synthesis. I do.
[0008]
[Means for Solving the Problems]
To achieve the above object, a speech synthesis method according to the present invention comprises:
An obtaining step of obtaining a fine segment from the audio waveform data and the window function,
A re-arrangement step of re-arranging the fine segments obtained in the obtaining step to change the prosody at the time of synthesis,
A synthesizing step of outputting synthesized speech waveform data based on superimposed waveform data obtained by superimposing the fine elements relocated in the rearrangement step,
A correction step of applying a spectrum correction filter configured based on the audio waveform data processed in the obtaining step in a process including the obtaining step, the rearranging step, and the synthesizing step.
[0009]
Further, a speech synthesizer according to the present invention for achieving the above object has the following configuration. That is,
Acquiring means for acquiring a fine segment from the audio waveform data and the window function,
A re-arrangement unit that re-arranges the fine segments acquired by the acquisition unit to change the prosody at the time of synthesis,
A synthesizing unit that outputs synthesized voice waveform data based on superimposed waveform data obtained by superimposing the fine segments relocated by the rearrangement unit,
And a correction means for causing a spectrum correction filter configured based on the audio waveform data processed by the obtaining means to act in a process of processing including the obtaining means, the rearranging means, and the synthesizing means.
[0010]
Further, according to the present invention, there is provided a dictionary synthesis method for speech synthesis suitable for the above speech synthesis method or speech synthesis apparatus.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, some preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0012]
<First embodiment>
FIG. 1 is a block diagram illustrating a hardware configuration according to the first embodiment.
[0013]
In FIG. 1, reference numeral 11 denotes a central processing unit which performs processing such as numerical calculation and control. In particular, the central processing unit 11 executes a speech synthesis process according to a procedure described below. An output device 12 presents various information to the user under the control of the central processing unit 11. An input device 13 includes a touch panel or a keyboard, and is used by a user to give an operation instruction to the device and to input various information. Reference numeral 14 denotes a voice output device for outputting voice, and outputs voice-synthesized contents.
[0014]
Reference numeral 15 denotes a storage device such as a disk device or a non-volatile memory, and holds a dictionary for speech synthesis 501 and the like. Reference numeral 16 denotes a read-only storage device, which stores the procedure of the speech synthesis process of the present embodiment and necessary fixed data. Reference numeral 17 denotes a storage device such as a RAM that stores temporary information, and stores temporary data, various flags, and the like. The above components (11 to 17) are connected by a bus 18. In this embodiment, a control program for speech synthesis processing is stored in the ROM 16 and executed by the central processing unit 11. However, such a control program is stored in the external storage device 15, At the time of execution, a configuration in which the program is loaded into the RAM 17 may be adopted.
[0015]
The operation of the audio output device according to the present embodiment having the above-described configuration will be described below with reference to FIGS. FIG. 2 is a flowchart illustrating the audio output processing according to the first embodiment. FIG. 3 is a diagram illustrating a speech synthesis process according to the first embodiment.
[0016]
First, in a prosody target value obtaining step S1, a target prosody value of a synthesized speech is obtained. The target prosody value of the synthesized speech may be provided directly from a higher-level module as in singing voice synthesis, or may be estimated using some means. For example, in the case of speech synthesis from text, it is estimated from the result of language analysis of text.
[0017]
Next, in a waveform data acquisition step S2, waveform data (speech waveform 301 in FIG. 3) as a source of the synthesized speech is acquired. In the acoustic analysis step S3, acoustic analysis such as linear prediction (LPC) analysis, cepstrum analysis, and generalized cepstrum analysis is performed on the acquired waveform data, and parameters necessary for configuring the spectrum correction filter 304 are calculated. The analysis of the waveform data may be performed at a predetermined time interval or a pitch synchronization analysis may be performed.
[0018]
Next, in a spectrum correction filter configuration step S4, a spectrum correction filter is configured using the parameters calculated in the acoustic analysis step S3. For example, when a p-order linear prediction analysis is used for the acoustic analysis, a filter having a characteristic represented by the following [Equation 1] is used as the spectrum correction filter 304. In the case where the equation [1] is used, a linear prediction coefficient α _j is calculated in the above parameter calculation.
[0019]
(Equation 1)

[0020]
Further, when the p-order cepstrum analysis is used, a filter having a characteristic represented by the following [Equation 2] is used as a spectrum correction filter. In the case of using the expression 2, the cepstrum coefficient c _j is calculated in the above parameter calculation.
[0021]
(Equation 2)

[0022]
In the above equations, μ and γ are appropriate coefficients, α is a linear prediction coefficient, and c is a cepstrum coefficient. Alternatively, an FIR filter expressed by the following [Equation 3], in which the impulse response of the filter is truncated at an appropriate order, may be used. In the case of using Equation 3, the coefficient β _j is calculated in the above parameter calculation.
[0023]
[Equation 3]

[0024]
In practice, in each of the above equations, it is necessary to consider the gain of the system. The spectrum correction filter configured as described above is stored in the speech synthesis dictionary 501 (actually, the filter coefficients are stored).
[0025]
Next, in a fine segment extracting step S5, the window function 302 is applied to the waveform acquired in the waveform data acquiring step S2 to extract a fine segment 303. A Hanning window or the like is used as the window function.
[0026]
Next, in the fine element spectrum correction step S6, the filter 304 configured in the spectrum correction filter forming step S4 is applied to the fine element 303 extracted in the fine element extraction step S5, and the fine element extraction step is performed. The spectrum of the fine element cut out in S5 is corrected. In this way, the fine element 305 whose spectrum has been corrected is obtained.
[0027]
Next, in the prosody changing step S7, the fine segment 305, whose spectrum has been corrected in the fine segment spectrum correction step S6, is thinned out, repeated, and spaced so as to match the prosody target value acquired in the prosody target value acquisition step S1. Change and rearrange (306). Then, in the waveform superimposing step S8, the fine segments rearranged in the prosody changing step S7 are superimposed to obtain a synthesized speech 307. Note that since speech units are obtained in step S8, an actual synthesized speech is obtained by connecting a plurality of speech units obtained in the waveform superimposing step S8. That is, in the voice output step S9, the voice units obtained in the waveform superposition step S8 are connected to output a synthesized voice.
[0028]
In addition, regarding the rearrangement processing of the fine element, “thinning-out” may be executed before applying the spectrum correction filter as shown in FIG. By doing so, it is possible to omit unnecessary processing of performing filter processing on unnecessary fine fragments.
[0029]
<Second embodiment>
In the first embodiment, the spectrum correction filter is configured at the time of speech synthesis. However, the configuration of the spectrum correction filter is performed prior to speech synthesis, and configuration information (filter coefficient) for configuring the filter is stored in a predetermined storage area. May be held. That is, the process of the first embodiment can be separated into two processes of data creation (FIG. 4) and speech synthesis (FIG. 5). In the second embodiment, processing in this case will be described. Note that the device configuration for implementing this processing is the same as in the first embodiment (FIG. 1). In the present embodiment, the configuration information is stored in the speech synthesis dictionary 501.
[0030]
In the flowchart of FIG. 4, steps S2, S3, and S4 are the same as in the first embodiment (FIG. 2). Then, in the spectrum correction filter recording step S101, the filter coefficient of the spectrum correction filter configured in the spectrum correction filter configuration step S4 is recorded in the external storage device 15. In the present embodiment, a spectrum correction filter is configured for each waveform data registered in the speech synthesis dictionary 501, and a coefficient of a filter corresponding to each waveform data is stored in the speech synthesis dictionary 501 as a spectrum correction filter. That is, in the speech synthesis dictionary 501 of the second embodiment, the waveform data of each speech waveform and the spectrum correction filter are registered.
[0031]
On the other hand, at the time of speech synthesis, as shown in the flowchart of FIG. 5, the acoustic analysis step S3 and the spectrum correction filter configuration step S4 in the processing of the first embodiment are not required, and instead, a spectrum correction filter reading step S102 is added. Is done. In the spectrum correction filter reading step S102, the spectrum correction filter coefficient recorded in the spectrum correction filter recording step S101 is read. That is, the coefficients of the spectrum correction filter corresponding to the waveform data obtained in the waveform data obtaining step S2 are read from the speech synthesis dictionary 501 to form a spectrum correction filter. Then, in the fine segment spectrum correction step S6, processing of the fine segment is performed using the spectrum correction filter read in the spectrum correction filter reading step S102.
[0032]
As described above, by storing the spectrum correction filters for all the waveform data in advance, it is not necessary to configure the spectrum correction filters at the time of speech synthesis. For this reason, it is possible to reduce the processing amount at the time of speech synthesis as compared with the first embodiment.
[0033]
<Third embodiment>
In the first and second embodiments, the filter configured in the spectrum correction filter configuration step S4 is applied to the fine element cut out in the fine element extraction step S5. However, a spectrum correction filter may be applied to the waveform data (audio waveform 301) acquired in the waveform data acquisition step S2. In the third embodiment, such a speech synthesis process will be described. Note that the device configuration for implementing this processing is the same as in the first embodiment (FIG. 1).
[0034]
FIG. 6 is a flowchart illustrating a speech synthesis process according to the third embodiment. In FIG. 6, each step of the waveform data acquisition step S2 to the spectrum correction filter configuration step S4 is the same as in the second embodiment. In the third embodiment, after the spectrum correction filter is configured in the spectrum correction filter configuration step S4, in the waveform data spectrum correction step S201, the spectrum correction filter configuration step S4 is performed on the waveform data acquired in the waveform data acquisition step S2. The spectrum of the waveform data is corrected by applying the calculated spectrum correction filter.
[0035]
Next, in a spectrum correction waveform data recording step S202, the waveform data subjected to the spectrum correction in the waveform data spectrum correction step S201 is recorded. That is, in the second embodiment, “spectral corrected waveform data” is stored in the speech synthesis dictionary 501 of FIG. 1 instead of the “spectral correction filter”.
[0036]
On the other hand, in the speech synthesis processing, the processing shown in the flowchart of FIG. 7 is executed. In the third embodiment, a spectrum correction waveform data acquisition step S203 is provided instead of the waveform data acquisition step S2 in each of the above embodiments. As a result, the waveform data after the spectrum correction recorded in the spectrum correction waveform data recording step S202 is obtained as a target for cutting out the fine element in the step S5. Then, a minute speech is cut out and rearranged with respect to the acquired waveform data, thereby obtaining a synthesized speech subjected to spectrum correction. Since the spectrum-corrected waveform data is used, the spectrum correction process (step S6 in the first and second embodiments) for the fine element is not required.
[0037]
As in the third embodiment, when the spectrum correction filter is applied to waveform data instead of fine segments, the effect of the window function used in the fine segment extraction step S5 cannot be completely eliminated. That is, the sound quality is slightly inferior to those of the first and second embodiments. However, since filtering up to the spectral correction filter can be performed prior to speech synthesis, the processing amount at the time of speech synthesis (FIG. 7) is greatly reduced as compared with the first and second embodiments. .
[0038]
In the third embodiment, as in the second embodiment, a configuration in which the processes are divided into two processes of data creation and speech synthesis has been described. However, as in the first embodiment, filtering is performed every time a synthesis process is performed. It can also be configured to do so. In this case, in the flowchart of FIG. 2, a spectrum correction filter is applied to the waveform data to be synthesized between steps S4 and S5. Step S6 becomes unnecessary.
[0039]
<Fourth embodiment>
In the first and second embodiments, the filter configured in the spectrum correction filter configuration step S4 is applied to the fine element cut out in the fine element extraction step S5. In the third embodiment, the filter configured in the spectrum correction filter configuration step S4 is applied to waveform data before being cut out into fine pieces. On the other hand, a spectrum correction filter can be applied to the waveform data of the synthesized voice synthesized in the waveform superimposing step S8. In the fourth embodiment, processing in this case will be described. Note that the device configuration for implementing this processing is the same as in the first embodiment (FIG. 1).
[0040]
FIG. 8 is a flowchart illustrating a speech synthesis process according to the fourth embodiment. The same processes as those in the first embodiment (FIG. 2) are denoted by the same reference numerals. In the fourth embodiment, as shown in FIG. 8, a synthesized speech spectrum correction step S301 is provided after the waveform superposition step S8, and the fine unit spectrum correction step S6 is omitted. In the synthesized voice spectrum correction step S301, the filter configured in the spectrum correction filter configuration step S4 is applied to the waveform data of the synthesized voice obtained in the waveform superposition step S8 to perform spectrum correction.
[0041]
According to the above-described fourth embodiment, as a result of the prosody changing step S7, when the number of repetitions of the same fine segment is small or the like, the processing amount is smaller than in the first embodiment.
[0042]
Also in this embodiment, the point that the spectrum correction filter can be configured in advance is the same as the relationship between the first and second embodiments. That is, the filter coefficients are stored in the speech synthesis dictionary 501 in advance, and are read out at the time of speech synthesis to constitute a spectrum correction filter, and act on the waveform data with the superimposed waveform in step S8.
[0043]
<Fifth embodiment>
When the spectrum correction filter can be expressed as a synthesis filter of a plurality of partial filters, the spectrum correction is not performed in one step as in the first to fourth embodiments, but the spectrum correction is distributed to a plurality of steps. It becomes possible. The dispersion of the spectrum correction makes it possible to flexibly adjust the balance between the sound quality and the processing amount as compared with the above embodiments. In the fifth embodiment, a case will be described in which the spectrum correction filters are dispersed to perform the speech synthesis processing. Note that the device configuration for implementing this processing is the same as in the first embodiment (FIG. 1).
[0044]
FIG. 9 is a flowchart illustrating a speech synthesis process according to the fifth embodiment. As shown in FIG. 9, first, the processing of the prosody target value acquisition step S1 to the spectrum correction filter configuration step S4 is performed. These processes are the same as the processes of steps S1 to S4 in the first to fourth embodiments.
[0045]
Next, in a spectrum correction filter decomposition step S401, the spectrum correction filter formed in the spectrum correction filter construction step S4 is decomposed into two or three partial filters (element filters). For example, the spectrum correction filter F1 (z) when the p-order linear prediction analysis is used for the acoustic analysis is expressed as the following [Equation 4] as the product of the denominator polynomial and the numerator polynomial.
[0046]
(Equation 4)

[0047]
Alternatively, it is also possible to factorize the numerator / denominator polynomial into a product of a first-order or second-order real coefficient polynomial as in the following equation (the following [Equation 5] shows the case where p is an even number) It is). Similarly, when an FIR filter is used as a spectrum correction filter, it can be factorized into a product of a first-order or second-order real coefficient polynomial. That is, the expression [3] is factorized and expressed as the expression [6].
[0048]
(Equation 5)

(Equation 6)

[0049]
When the p-order cepstrum analysis is used, since the filter characteristics are expressed by exponents, it is only necessary to group the cepstrum coefficients as shown in Expression [7].
[0050]
(Equation 7)

[0051]
Next, in the spectral correction filter partial application (1) step S402, the waveform data obtained in the waveform data obtaining step S2 is filtered using one of the filters decomposed in the spectral correction filter decomposing step S401. That is, the spectrum correction process is performed on the waveform data before the fine segment extraction using the first filter element which is one of the plurality of filter elements obtained in step S401.
[0052]
Next, in a fine segment extraction step S5, a window function is applied to the waveform data obtained as a result of the spectral correction filter partial application (1) step S402 to cut out a fine segment. Then, in the spectral correction filter partial application (2) step S403, one of the filters decomposed in the spectral correction filter decomposing step S401 is used to filter the fine element cut out in the fine element extracting step S5. That is, using the second filter element which is one of the plurality of filter elements obtained in step S401, a spectrum correction process is performed on each of the cut fine elements.
[0053]
Next, a prosody change step S7 and a waveform superposition step S8 are performed as in the first and second embodiments. Then, in the spectrum correction filter partial application (3) step S404, the synthesized speech obtained as a result of the waveform superposition step S8 is filtered using one of the filters decomposed in the spectrum correction filter decomposition step S401. That is, using the third filter element, which is one of the plurality of filter elements obtained in step S401, performs spectrum correction processing on the obtained synthesized speech waveform data.
[0054]
Then, in the voice output step S9, the synthesized voice obtained as a result of the spectral correction filter partial application (3) step S404 is output.
In the above configuration, for example, when the decomposition of Expression 5 is performed, F _1,1 (z) is determined in step S402, F _1,1 (z) is determined in step S403, and F _1,3 (z ) Can be used in step S404.
[0055]
Note that when the image is divided into a product of two elements as in Expression 4, filtering is not performed in any of steps S402, S403, and S404. That is, if the spectrum correction filter is decomposed into two in the spectrum correction filter decomposition step S401 (in this example, the spectrum correction filter is divided into a denominator polynomial and a numerator polynomial), the spectrum correction filter partial application (1) step One of S402, spectral correction filter partial application (2) step S403, and spectral correction filter partial application (3) step S404 is omitted.
[0056]
Also, in the fifth embodiment, the spectrum correction filter and each element filter may be pre-configured and registered as a part of the speech synthesis dictionary 501, as described in the first and second embodiments. As well as the associations are obvious.
As described above, according to the fifth embodiment, there is an option of assigning which polynomial (filter) to which step (S402, S403, S404), and the allocation of sound quality / processing amount depends on the assignment. It will change. In particular, in the case of the formulas [5] and [7] or the formula [6] obtained by factoring the FIR filter, it is possible to control how many factors are assigned to each step, so that more flexibility is provided. There will be.
[0057]
<Other embodiments>
In each of the above embodiments, instead of directly recording the spectrum correction filter coefficient, the spectrum correction filter coefficient may be recorded after being quantized using a method such as vector quantization. Thus, the amount of data recorded in the external storage device 15 can be reduced.
[0058]
At this time, when LPC analysis or generalized cepstrum analysis is used as the acoustic analysis method, if the quantization is performed after converting the filter coefficients into a line spectrum pair (LSP), the quantization efficiency is improved.
[0059]
When the sampling frequency of the waveform is high, band division may be performed by a band division filter, and spectrum correction filtering may be performed on each band-limited waveform. The band division suppresses the order of the spectrum correction filter, which has the effect of reducing the amount of calculation. The same effect can be obtained by expansion and contraction of the frequency axis such as mel cepstrum.
[0060]
Further, in each of the embodiments, it has been described that there are a plurality of options for the timing of performing the spectrum correction filtering. The timing at which spectrum correction filtering is performed or whether spectrum correction is performed or not may be selected for each unit. As information for selection, a phoneme type, a voiced / unvoiced type, or the like can be used.
In each of the above embodiments, an example of the spectrum correction filter is a formant emphasis filter that emphasizes formants.
[0061]
Further, an object of the present invention is to provide a storage medium storing a program code of software for realizing the functions of the above-described embodiments to a system or an apparatus, and a computer (or CPU or MPU) of the system or apparatus to store the storage medium. It is needless to say that the present invention can also be achieved by reading and executing the program code stored in the program.
[0062]
In this case, the program code itself read from the storage medium realizes the function of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
[0063]
As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, and the like can be used.
[0064]
When the computer executes the readout program code, not only the functions of the above-described embodiments are realized, but also an OS (Operating System) running on the computer based on the instruction of the program code. It goes without saying that a part or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing.
[0065]
Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that a CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0066]
【The invention's effect】
As described above, according to the present invention, it is possible to reduce the “blur” of the speech spectrum by the window function applied to obtain a fine segment, and to realize speech synthesis with high sound quality. .
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a hardware configuration according to a first embodiment.
FIG. 2 is a flowchart illustrating audio output processing according to the first embodiment.
FIG. 3 is a diagram illustrating a state of a speech synthesis process according to the first embodiment.
FIG. 4 is a flowchart illustrating a spectrum correction filter registration process in the audio output process according to the second embodiment.
FIG. 5 is a flowchart illustrating a speech synthesis process in a speech output process according to a second embodiment.
FIG. 6 is a flowchart illustrating a spectrum correction filter registration process in the audio output process according to the third embodiment.
FIG. 7 is a flowchart illustrating a speech synthesis process in a speech output process according to a third embodiment.
FIG. 8 is a flowchart illustrating an audio output process according to a fourth embodiment.
FIG. 9 is a flowchart illustrating audio output processing according to a fifth embodiment.
FIG. 10 is a diagram schematically illustrating a speech synthesis method by dividing, rearranging, and synthesizing a speech waveform into fine segments.

Claims

An obtaining step of obtaining a fine segment from the audio waveform data and the window function,
A re-arrangement step of re-arranging the fine segments obtained in the obtaining step to change the prosody at the time of synthesis,
A synthesizing step of outputting synthesized speech waveform data based on superimposed waveform data obtained by superimposing the fine elements relocated in the rearrangement step,
A sound processing apparatus comprising: a correction step of operating a spectrum correction filter configured based on the audio waveform data processed in the obtaining step in a process including the obtaining step, the rearranging step, and the synthesizing step. Synthesis method.

The correcting step includes:
Having a configuration step of configuring a spectrum correction filter based on the audio waveform data processed in the obtaining step,
The speech synthesis method according to claim 1, wherein the spectrum correction filter configured in the configuration step is applied to the fine segment acquired in the acquisition step.

For each of the voice waveform data, comprises a voice synthesis dictionary in which configuration information for a spectrum correction filter is registered,
The correction step acquires a configuration information corresponding to the audio waveform data processed in the acquisition step from the speech synthesis dictionary to configure a spectrum correction filter, and for the fine element acquired in the acquisition step 2. The speech synthesis method according to claim 1, wherein said spectrum correction filter is operated.

The voice synthesizing method according to claim 1, wherein the correcting step applies a spectrum correction filter configured based on the voice waveform data to the voice waveform data processed in the obtaining step.

A speech synthesis dictionary in which the speech waveform data after the spectrum correction is registered by applying a spectrum correction filter configured based on each of the speech waveform data,
The voice synthesizing method according to claim 1, wherein the correcting step provides the voice waveform data after spectrum correction to the obtaining step.

2. The speech synthesis method according to claim 1, wherein in the correction step, the spectrum correction filter is applied to the superimposed waveform data obtained in the synthesis step.

A spectral correction filter is configured based on the audio waveform data processed in the obtaining step, and further includes a configuration step of decomposing this into a plurality of element filters,
The method according to claim 1, wherein the correcting step causes each of the plurality of element filters obtained in the configuration step to act on a plurality of points in a processing process including the obtaining step, the rearranging step, and the combining step. Described speech synthesis method.

The configuration step decomposes the spectrum correction filter into first to third element filters,
The correction step includes causing the first element filter to act on the audio waveform data processed in the acquisition step, causing the second element filter to act on the fine element obtained in the acquisition step, 8. The speech synthesis method according to claim 7, wherein the element filter is applied to the superimposed waveform data obtained in the synthesis step.

Provide a dictionary for speech synthesis registered by decomposing the spectrum correction filter configured based on the speech waveform data into a plurality of element filters,
The correction step acquires a plurality of element filters corresponding to the speech waveform data processed in the acquisition step from the speech synthesis dictionary, and obtains each of the plurality of element filters, the acquisition step, the rearrangement step, 2. The speech synthesis method according to claim 1, wherein the method is performed at a plurality of points in a process including a synthesis step.

The rearrangement of the fine pieces cut out by the window function is at least one of a change of an interval between the fine pieces, a repetition of the fine pieces, and a thinning of the fine pieces. 10. The speech synthesis method according to any one of 9.

A method for generating a dictionary for speech synthesis in which speech waveform data is registered,
A generation step of generating a spectrum correction filter based on the voice waveform data registered in the voice synthesis dictionary;
A registration step of registering the spectrum correction filter generated in the generation step in association with the audio waveform data.

A method for generating a dictionary for speech synthesis in which speech waveform data is registered,
A first generation step of generating a spectrum correction filter based on each of the voice waveform data registered in the voice synthesis dictionary;
A second generation step of causing the spectrum correction filter to act on the corresponding voice waveform data to generate voice waveform data after spectrum correction;
A registration step of registering the spectrum-corrected audio waveform data generated in the second generation step in a dictionary.

A method for generating a dictionary for speech synthesis in which speech waveform data is registered,
A decomposition step of generating a spectrum correction filter based on the voice waveform data registered in the voice synthesis dictionary and decomposing this into a plurality of element filters;
A registration step of registering the plurality of element filters generated in the decomposition step in association with the audio waveform data.

14. The dictionary generation method according to claim 13, wherein the decomposing step includes factorizing a characteristic polynomial representing a spectrum correction filter and converting the spectrum correction filter into a product of element filters.

The method according to claim 1, wherein the decomposition step includes converting the spectrum correction filter into a product of element filters by approximating the spectrum correction filter with a filter expressed by a polynomial, and factorizing the polynomial. 13. The dictionary generation method according to item 13.

Acquiring means for acquiring a fine segment from the audio waveform data and the window function,
A re-arrangement unit that re-arranges the fine segments acquired by the acquisition unit to change the prosody at the time of synthesis,
A synthesizing unit that outputs synthesized voice waveform data based on superimposed waveform data obtained by superimposing the fine segments relocated by the rearrangement unit,
A sound processing apparatus comprising: a correction means for causing a spectrum correction filter configured based on voice waveform data processed by the obtaining means to act in a process of processing including the obtaining means, the rearranging means, and the synthesizing means. Synthesizer.

An apparatus for generating a dictionary for speech synthesis in which speech waveform data is registered,
Generating means for generating a spectrum correction filter based on the voice waveform data registered in the voice synthesis dictionary;
A registration unit for registering the spectrum correction filter generated by the generation unit in association with the audio waveform data.

An apparatus for generating a dictionary for speech synthesis in which speech waveform data is registered,
First generating means for generating a spectrum correction filter based on each of the voice waveform data registered in the voice synthesis dictionary;
Second generating means for causing the spectrum correction filter to act on the corresponding voice waveform data to generate voice waveform data after spectrum correction;
A registration unit for registering in the dictionary the spectrum-corrected speech waveform data generated by the second generation unit.

An apparatus for generating a dictionary for speech synthesis in which speech waveform data is registered,
Decomposition means for generating a spectrum correction filter based on the voice waveform data registered in the voice synthesis dictionary and decomposing the filter into a plurality of element filters;
Registering means for registering the plurality of element filters generated by the decomposing means in association with the audio waveform data.

A control program for causing a computer to execute the speech synthesis method according to claim 1.

A control program for causing a computer to execute the dictionary generation method according to claim 12.

A storage medium storing the control program according to claim 20.