JP2004341259A

JP2004341259A - Speech segment expanding and contracting device and its method

Info

Publication number: JP2004341259A
Application number: JP2003137957A
Authority: JP
Inventors: Yumiko Kato; 弓子加藤; Takahiro Kamai; 孝浩釜井; Katsuyoshi Yamagami; 勝義山上; 良文 ▲ひろ▼瀬; Yoshifumi Hirose; Natsuki Saito; 夏樹齋藤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-05-15
Filing date: 2003-05-15
Publication date: 2004-12-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesis system which exhibits small speech quality deterioration and is capable of adjusting the continuance length of a speech segment according to a variety of speaking states. <P>SOLUTION: A speech synthesis system 10 expands or contracts the temporal length of speech segments for speech synthesis and is equipped with: a robustness calculation part 150 which calculates robustness for the time length expansion or contraction of speech segment data by the states of sound models corresponding to respective speech segments by using a speech segment database 110a including border information on the speech segments and a sound model database 130 using a statistical method having a plurality of states for each speech segments; and a frame time length expansion/contraction rate calculation part 200 which expands or contracts the temporal length of each part of the speech segment corresponding to each state for the speech segments for speech synthesis, etc., on the basis of the ratio of the robustness of the individual calculated states. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、規則合成方式による音声合成に関し、特に、音声素片の継続時間長を制御する音声素片伸縮方法に関する。
【０００２】
【従来の技術】
機械的な方法等でテキストデータから人工的に音声を生成し出力する音声合成の一つの方法として、「あ」や「い」等の音声素片（以下、素片ともいう。）にイントネーションやアクセント等の韻律情報を与え、さまざまな合成規則を用いて音声を生成する規則合成方式がある。規則合成方式においては、外部から与えられる、あるいは韻律制御手段で生成された音韻毎に設定された継続時間長に合わせて、音声素片の継続時間長を制御する必要がある。自然音声では、発話の速度、意味内容、前後の音韻などの影響で、音韻の継続時間長が様々に変化しているので、自然音声に近い高品質な音声を合成するためには、任意の継続時間長の音声を合成器によって合成することが可能でなければならない。
【０００３】
規則合成方式の音声合成における音声素片の継続時間長に関する制御方式として、従来、１音韻内で素片の波形をピッチに同期して均一な割合で間引き、繰り返しを行う、あるいは、一定時間長のフレームを均一な割合で間引き、繰り返しを行う、あるいは、フレーム時間長を均一な割合で伸縮する等の制御方式がある。これによって、基本時間長の音声素片から、任意の継続時間長の音声素片を生成するというものである。
【０００４】
また、素片フレームの時間位置と生成しようとするフレームの時間位置との関係をマッピング関数として設定することで、素片内の時間伸縮の影響を受けにくい部分でフレーム時間長を伸縮する方式も提案されている（たとえば、特許文献１参照）。これによって、より自然で任意の継続時間長の音声素片を生成するというものである。
【０００５】
【特許文献１】
特開平１０−９１１９１号公報
【０００６】
【発明が解決しようとする課題】
しかしながら、均一な割合での間引きや繰り返し、あるいはフレーム長の伸縮を行う上記従来の方式は、音素境界近傍の渡り情報のように、時間あたりの変化量が音韻知覚のための重要な情報となっている部分における時間あたりの変化量を変えることになり、音質を大きく劣化させる。例えば、図１４（ａ）に示されるような「おんそ（音素）」という標準の音声パターン（ここでは、基本周波数の時間変化）の継続時間長を均一に伸張することによって、図１４（ｂ）に示される音声パターンを生成するが、音素境界近傍９０ａおよびｂ、９１ａおよびｂにおける時間あたりの変化量が変化してしまうために、音素境界近傍の渡り情報が異なるものとなってしまい、不自然な音声が生成されてしまう。
【０００７】
一方、上記特許文献１に開示された方式では、マッピング関数を用いて音声素片の継続時間長を制御しているために、図１４（ｃ）に示されるように、音素境界近傍９２ａおよびｂにおける時間あたりの変化量を変えることなく継続時間長を伸張しているものの、そのマッピング関数の生成方式については自動化されることなく、経験的に探り当てられているために、多様な発話状態に対応することができないという問題がある。つまり、音声の変化パターンは、音声の状態や発話様態、話者の個人特性等により異なるため、少なくとも素片データベースごとにマッピング関数を生成する必要があるが、上記特許文献１の方式は、そのような多様なマッピング関数を生成する機能を有していないために、多様な発話状態に対応することができず、例えば、２０歳代の女性の音声だけについての音声合成を行う等の特定種類の音声合成にしか適用することができないという課題を有している。
【０００８】
そこで、本発明は、上記の課題に鑑み、音質劣化が少なく、かつ、多様な発話状態に対応して音声素片の継続時間長を調整することが可能な音声素片伸縮装置等を提供することを目的とする。つまり、時間長調整による音質劣化が少なく、かつ、時間長調整に用いるマッピング関数を生成するための素片データベースの作成作業が軽減される音声素片伸縮装置等を提供することを目的とする。
【０００９】
【課題を解決するための手段】
上記の目的を達成するために、本発明にかかる音声素片伸縮装置は、音声合成のために音声素片の時間長を伸張または圧縮する音声素片伸縮装置であって、音声素片の境界情報を伴った複数の音声素片データと、音声素片ごとに複数個の状態を持つ統計的手法を用いた音響モデルとから、各音声素片に対応する音響モデルの各状態ごとに、音声素片データの時間長伸縮に対する頑健性を算出する頑健性算出手段と、算出された前記各状態の頑健性の比に基づいて、音声合成のための音声素片に対して、前記各状態に対応する音声素片各部の時間長を伸縮する伸縮手段とを備えることを特徴とする。
【００１０】
これによって、音声素片に対応する音響モデルの各状態ごとに、時間長伸縮に対する頑健性が算出され、音声合成時においては、その頑健性の比に基づいて、音声素片各部の時間長が伸縮されるので、例えば、音声素片内で頑健性が高い箇所の伸縮率を大きく、頑健性が低い箇所の伸縮率を小さくするように伸縮率を設定することで、音素境界のような音韻知覚上重要な箇所での時間当たりの変化量を変えることなく音声素片の時間長を伸縮することができ、時間長伸縮による音質劣化が回避される。また、そのような頑健性の情報は、音声素片データと音響モデルとから生成されるので、準備された音声素片データに最適な時間長伸縮が可能になるとともに、時間長伸縮に用いるマッピング関数の作成作業が自動化される。
【００１１】
ここで、前記頑健性算出手段は、前記複数の音声素片データについて、各音声素片に対応する音響モデルの各状態ごとに、時間幅のばらつきを特定し、特定した時間幅のばらつきを前記頑健性として算出してもよい。具体的には、前記頑健性算出手段は、前記複数の音声素片データについて、音声素片ごとに、音声素片データを構成する各フレームを前記各状態に対応づけ、各状態に対応づけられたフレーム数の標準偏差を前記頑健性として算出してもよい。なお、前記頑健性算出手段は、前記各状態に対応づけられたフレーム数の平均を算出し、算出した平均を用いて各状態の音声素片内での相対位置である音素内相対位置を特定し、特定した音素内相対位置と前記頑健性との対を頑健性データとして生成し、前記伸縮手段は、前記頑健性算出手段によって生成された音素内相対位置と前記頑健性との対から定まる音素内相対位置と頑健性との関係に従って、前記音声素片各部に対応する頑健性を特定し、特定した頑健性に基づいて、前記時間長を伸縮するのが好ましい。
【００１２】
これによって、音声素片に対応する音響モデルの状態ごとに、複数の音声素片データにおける時間幅のばらつきが頑健性として特定されるので、音響モデルの各状態に対応する時間幅のばらつきに対応した時間長伸縮が行われ、音声素片の時間長伸縮に伴う音質劣化が抑制される。なお、音素内相対位置は、例えば、音声素片の先頭からの時間位置（相対時間位置）である。
【００１３】
また、前記音響モデルは、前記音声素片データを用いた学習によって生成されたモデルであってもよいし、前記音響モデルは、不特定多数の話者が発生した音声からなる音声素片データを用いて生成されたモデルであってもよい。これによって、特定の話者に絞った最適な音声素片の時間長伸縮を行ったり、特定の話者の音声素片データと汎用の音響モデルとの組み合わせに基づく音声素片の時間長伸縮を行ったりすることができる。
【００１４】
また、前記音響モデルは、隠れマルコフモデルとし、前記頑健性算出手段は、前記複数の音声素片データを用いて前記音響モデルを生成する過程における音声素片データごとの遷移系列における前記各状態の停留回数のばらつきを特定し、特定した停留回数のばらつきを前記頑健性として算出してもよい。具体的には、前記頑健性算出手段は、前記停留回数の分散を前記頑健性として算出してもよい。
【００１５】
このとき、前記頑健性算出手段は、前記音響モデルを生成する過程における学習信号の各時刻、各状態に至る前向き確率または後ろ向き確率の計算において、学習系列における最尤状態遷移系列を求め、求めた最尤状態遷移系列における前記各状態の停留回数のばらつきを前記頑健性として算出してもよい。なお、前記頑健性算出手段は、前記各状態に対応づけられた停留回数の平均を算出し、算出した平均を用いて各状態の音声素片内での相対位置である音素内相対位置を特定し、特定した音素内相対位置と前記頑健性との対を頑健性データとして生成し、前記伸縮手段は、前記頑健性算出手段によって生成された音素内相対位置と前記頑健性との対から定まる音素内相対位置と頑健性との関係に従って、前記音声素片各部に対応する頑健性を特定し、特定した頑健性に基づいて、前記時間長を伸縮するのが好ましい。
【００１６】
これによって、音声素片ごとに、音響モデルを生成する過程における遷移系列の各状態での停留回数のばらつきが頑健性として特定されるので、各状態における停留回数のばらつきに対応した時間長伸縮が行われ、音声素片の時間長伸縮に伴う音質劣化が抑制される。
【００１７】
また、前記音声素片データの話者と前記音声合成のための音声素片の話者とを同一にしたり、前記音声素片データに前記音声合成のための音声素片を含ませてもよい。これによって、頑健性を算出するために用いられる音声素片データと音声合成に用いられる音声素片データとが関連づけられ、音声素片の時間長伸縮に伴う音質劣化が更に抑制され得る。
【００１８】
また、前記伸縮手段は、前記頑健性算出手段によって生成された音素内相対位置と前記頑健性との対を複数個用いて補間することで、前記音声素片各部に対応する頑健性を特定してもよい。具体的には、直線補間、スプライン補間、２次関数または３次関数による近似によって、音声素片各部の頑健性を特定し、伸縮率を決定すればよい。
【００１９】
なお、本発明は、以上のような音声素片伸縮装置として実現することができるだけでなく、その手段をステップとする音声素片伸縮方法として実現したり、音声素片伸縮装置としてコンピュータを機能させるプログラムとして実現することもできる。そして、そのようなプログラムは、ＣＤ−ＲＯＭ等の記録媒体やインターネット等の伝送媒体を介して配信することができるのは言うまでもない。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。
【００２１】
（実施の形態１）
図１は、本発明の実施の形態１における音声合成システム１０の構成を示す機能ブロック図である。この音声合成システム１０は、音素内の相対時間位置に対する時間長伸縮に対する頑健性を予め算出しておき、その頑健性に基づいて音声合成を行うシステムであり、オフラインで学習を行うことによって頑健性を算出するオフライン処理装置１０ａと、算出された頑健性のデータを用いてテキストデータから音声合成を行う音声合成装置１０ｂとから構成される。なお、「時間長伸縮における頑健性」とは、音声素片の継続時間長を伸縮した場合における音質劣化の生じにくさをいい、「頑健性が高い」とは、その音声素片の継続時間長を伸縮しても音質劣化が生じにくいことを意味する。
【００２２】
オフライン処理装置１０ａは、コンピュータ上で実行されるプログラム等で実現される３つの処理部（音響モデル学習部１２０、マッチング部１４０および頑健性計算部１５０）とデータが格納されたハードディスク等で実現される２つのデータベース（音声素片データベース１１０ａおよび音響モデルデータベース１３０）とから構成される。
【００２３】
音声素片データベース１１０ａは、音素境界がラベルされた音声データ（発音記号列、基本周波数、音声強度、音素時間長等を含む素片データの集まり）を格納しているデータベースである。音響モデル学習部１２０は、音声素片データベース１１０ａに格納された音声データに基づいて、ＨＭＭ（隠れマルコフモデル）の学習を行うことにより、音素ごとの音響モデルを作成し、音響モデルデータベース１３０に格納する。音響モデルデータベース１３０は、音響モデル学習部１２０より出力された音素ごとの音響モデルを格納する補助記憶部である。
【００２４】
マッチング部１４０は、音響モデルデータベース１３０に格納された学習が完了した音響モデルと音声素片データベース１１０ａから抽出した素片データとを比較し、ＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングを行うことで、音声素片データベース１１０ａに格納された素片データごとに、素片データを構成するフレームと学習モデルを構成する状態との対応関係を特定し、特定した対応関係を頑健性計算部１５０に通知する。
【００２５】
頑健性計算部１５０は、マッチング部１４０が出力する素片中の各フレームと音響モデルの状態との対応関係を示す情報を音響モデルごとに集計することで、音素内相対時間位置における時間長伸縮に対する頑健性を算出し、算出した頑健性を音声合成装置１０ｂに出力する。
【００２６】
一方、音声合成装置１０ｂは、コンピュータ上で実行されるプログラム等で実現される６つの処理部（言語処理部１６０、韻律生成部１７０、素片選択部１８０、フレーム時間長伸縮率計算部２００、フレーム変形部２１０および波形生成部２２０）とデータが格納されたハードディスク等で実現される２つのデータベース（頑健性データベース１９０および音声素片データベース１１０ｂ）とから構成される。
【００２７】
言語処理部１６０は、入力されたテキストデータの言語解析を行い、形態素や構文等の言語情報と音素列等の読み情報を生成する。韻律生成部１７０は、言語処理部１６０より出力された言語情報と読み情報に基づいて、合成音声の韻律を決定し、基本周波数、パワー、リズム等韻律情報を生成する。音声素片データベース１１０ｂは、波形を生成する際の処理単位で実音声から生成された音声合成パラメータを保持するデータベースである。素片選択部１８０は、韻律生成部１７０から出力された韻律情報と言語情報、読み情報に基づいて、音声素片データベース１１０ｂより最適な音声素片を選択する。
【００２８】
頑健性データベース１９０は、オフライン処理装置１０ａ（の頑健性計算部１５０）から出力された頑健性データ、つまり、音声素片の種類ごとの音素内相対時間位置における時間長伸縮に対する頑健性を示すデータを記憶するデータベースである。フレーム時間長伸縮率計算部２００は、頑健性データベース１９０内のデータに基づいて素片の処理単位であるフレームごとに時間長の伸縮率を算出する。フレーム変形部２１０は、フレーム時間長伸縮率計算部２００で計算された伸縮率に基づいて、素片選択部１８０で選択された音声素片の各フレームの時間長を伸縮させる。波形生成部２２０は、フレーム変形部２１０で伸縮された各フレームを接続し、フレーム毎のパラメータに基づいて信号処理を行うことで、音声波形を生成する。
【００２９】
次に、以上のように構成された本実施の形態１における音声合成システム１０の動作を説明する。
図２は、音声合成システム１０のオフライン処理装置１０ａの動作手順を示すフローチャートである。図３および図４は、その動作を説明するための図である。
【００３０】
まず、音響モデル学習部１２０は、音声素片データベース１１０ａに蓄積された音素境界がラベルされた音声データを学習データとして、ＨＭＭを用いた１音素６状態での音響モデルを音素ごとに生成し（Ｓ１０１）、得られたＨＭＭパラメータを音響モデルとして音響モデルデータベース１３０に記録する。
【００３１】
次に、マッチング部１４０は、ＤＰマッチングの手法を用いて音声素片データベース１１０ａ内のすべての音素について、音響モデルデータベース１３０に記録された音響モデルのうち対応する音素の６状態を抽出し、各音響モデルに素片データをマッチングさせることで、素片の各フレームと音響モデルの６つの状態との対応関係を特定する（Ｓ１０２）。図３には、その対応関係の例が示されている。ここでは、素片Ａフレーム列では、状態１が最初の２フレームに対応し、状態２が続く３フレームに対応し、状態３が続く３フレームに対応し、状態４が続く３フレームに対応し、状態５が続く３フレームに対応し、同様にして、素片Ｂフレーム列、素片Ｃフレーム列、素片Ｄフレーム列についても音響モデルの６つの状態とフレームとの対応関係が特定されている。
【００３２】
次に、頑健性計算部１５０は、上記ステップＳ１０２で特定された音素ごとの対応関係を蓄積し、音素ごとに、６つの状態それぞれに対応付けられたフレームを集計することで、フレーム数の平均および標準偏差を算出する（Ｓ１０３）。図３には、状態１〜４について、フレーム数の平均および標準偏差として、それぞれ、２および０、３．２５および０．５０、３．０および０．８２、２．５および１．２９が算出されている。
【００３３】
続いて、頑健性計算部１５０は、上記ステップＳ１０３で求めたフレーム数の平均より、各状態の中心時間位置を音素先頭からのフレーム位置として求めることで、音素全体の時間長に対する各状態の音素内相対位置を決定する（Ｓ１０４）。具体的には、以下の式に従って、各状態の音素内相対位置を算出する。
【００３４】
ｉ番目の状態の平均フレーム数Ｆ_ｉ
ｊ番目の状態の平均フレーム数Ｆ_ｊ
音素先頭からのフレーム数で表したｉ番目の状態の中心時間位置Ｃ_ｉ
ｉ番目の状態の音素内相対位置ｐ_ｉ
とした場合に、
【００３５】
【数１】

【００３６】
図４は、図３に示されたフレーム数平均について、音響モデルの各状態の音素内相対位置を算出した例を示している。なお、この例では、音素内フレーム数平均（全ての状態のフレーム数平均の合計）が１８として算出されている。
そして、頑健性計算部１５０は、各状態に対応付けられたフレーム数の標準偏差を状態の音素内相対位置における時間長伸縮に対する頑健性として対応づけ、これらの頑健性データ（音素内相対位置と頑健性との対からなるマッピング関数）を音声合成装置１０ｂの頑健性データベース１９０に記録する（Ｓ１０５）。
【００３７】
図５は、音声合成システム１０の音声合成装置１０ｂの動作手順を示すフローチャートである。音声合成装置１０ｂに入力されたテキストデータはまず、言語処理部１６０で形態素解析、構文解析がなされ、音韻列、アクセント、区切り等の読み情報および、品詞、活用等の形態素情報、文節係り受け、構文構造等の構文情報が生成される（Ｓ１１１）。次に、韻律生成部１７０は言語処理部１６０で生成された読み情報、形態素情報、構文情報に基づき、音韻列に対応付けられた基本周波数、音声強度、音素継続時間長等の韻律情報を生成する（Ｓ１１２）。
【００３８】
続いて、素片選択部１８０は言語処理部１６０で生成された読み情報、形態素情報、構文情報さらに、韻律生成部１７０で生成された韻律情報を素片選択コストとして用いて、音声素片データベース１１０ｂより合成しようとする音声の音韻列に対応した音声素片を音素単位で抽出し、音韻列に対応付けられた音声素片テーブルを生成する（Ｓ１１３）。
【００３９】
そして、フレーム時間長伸縮率計算部２００は韻律生成部１７０により生成された音素継続時間長と素片選択部１８０により音声素片データベース１１０ｂから抽出された音声素片テーブルの音素ごとの継続時間長とを比較し、頑健性データベース１９０に記録された、対象の音素に対応する音素内相対位置における時間長伸縮に対する頑健性を参照し、素片の各フレームの中心時間位置での時間長伸縮に対する頑健性を取得する（Ｓ１１４）。このとき、各フレームの中心時間位置に対応する頑健性が頑健性データベース１９０に記録されていない場合には、頑健性データベース１９０に記録されている頑健性データ（音素内相対時間位置と頑健性との組）をスプライン補間することによって、その頑健性を求める。図６は、頑健性データベース１９０に記録されている頑健性データ（グラフ上の点）をスプライン補間することによって、素片の各フレームの中心時間位置での頑健性を求める例を示す図である。
【００４０】
このようにして各フレームの頑健性を特定した後に、フレーム時間長伸縮率計算部２００は、各フレームの頑健性に応じて、各音声素片の音素時間長が韻律生成部１７０により生成された音素継続時間長と等しくなるよう、各フレームの時間長を計算する（Ｓ１１５）。その際、時間長伸縮に対する頑健性が高いフレームの伸縮率を高くし、頑健性の低いフレームの伸縮率が低くなるように、以下の式に従って生成する合成音のフレーム時間長を決定する。
【００４１】
【数２】

【００４２】
とした場合に、頑健性ｒ_ｉに応じた伸縮率（生成する合成音のフレーム時間長と素片のフレーム時間長の比）は以下のように表される。
【００４３】
【数３】

【００４４】
素片のフレーム時間長は一定であるため、
【００４５】
【数４】

【００４６】
生成する合成音の音素時間長はフレーム時間長の和であることから、上記の式に当てはめることにより合成しようとする音声のフレーム時間長は以下のように表すことができる。
【００４７】
【数５】

【００４８】
続いて、フレーム変形部２１０は、素片フレームを、上記式で求められたフレーム時間長に伸縮し、基本周波数、音声強度をステップＳ１１２で設定された値を当てはめ、合成処理フレームごとの合成パラメータを生成する（Ｓ１１６）。最後に、波形生成部２２０は、ステップＳ１６０でフレーム変形部２１０より出力された合成パラメータを参照して音声波形を生成する（Ｓ１１７）。
【００４９】
以上のように、本実施の形態１における音声素片伸縮方法により、ＨＭＭモデルの各状態にマッチングされたフレーム数の標準偏差を時間長伸縮に対する頑健性とし、素片内で頑健性が高いフレームの伸縮率を大きく、頑健性が低いフレームの伸縮率を小さくするように伸縮率を設定することで、ＨＭＭモデルの各状態におけるフレーム数のばらつきの大きい箇所がより大きな時間長伸縮が行われることとなり、これによって、時間長調整による音質劣化を防ぎ、より少ない音声素片データで高い音質の合成音声を生成することが可能になる。また、時間長伸縮に対する頑健性を自動生成することにより、与えられた素片データベースに最適な時間長伸縮が可能になるとともに、時間長伸縮関数の作成作業が軽減される。
【００５０】
つまり、本実施の形態１における音声素片伸縮方法によって、図１４（ａ）に示された音声素片は、図１４（ｃ）に示されるように、音韻知覚のための重要な情報となる音素境界近傍９２ａおよびｂにおける時間あたりの変化量を維持した状態で音声素片の継続時間長が伸張されるとともに、そのためのマッピング関数が自動生成されるので、音声劣化が少なく、かつ、多様な発話状態に対応した音声合成の自動化が可能となる。
【００５１】
（実施の形態２）
次に、本発明の実施の形態２について説明する。
図７は、本発明の実施の形態２における音声合成システムの構成を示す機能ブロック図である。この音声合成システム２０は、不特定話者の音響モデルを用いて音声合成を行うシステムであり、オフライン処理で音素内の相対時間位置に対する時間長伸縮に対する頑健性を算出するオフライン処理装置２０ａと、算出された頑健性のデータを用いて音声合成を行う音声合成装置１０ｂとから構成され、オフライン処理装置２０ａの不特定話者音響モデルデータベース３３０とマッチング部３４０を除いて、実施の形態１における音声合成システム１０と同様の構成を備える。以下、同一の構成要素には同一の符号を付し、その説明を省略する。
【００５２】
不特定話者音響モデルデータベース３３０は、あらかじめ学習されたＨＭＭに基づく不特定話者用の音響モデルのパラメータを格納した不特定話者音響モデルデータベースである。マッチング部３４０は、不特定話者音響モデルデータベース３３０より抽出した音響モデルと音声素片データベース１１０ａから抽出した素片データとを比較しＤＰマッチングを行って素片データの各フレームと音響モデルとをマッチングするマッチング部である。
【００５３】
実施の形態１では、音声素片データベース１１０ａと、その音声素片データベース１１０ａから学習された音響モデルデータベース１３０とに基づいて頑健性が算出されたのに対し、本実施の形態２では、音声素片データベース１１０ａと、あらかじめ学習されたＨＭＭに基づく不特定話者音響モデルデータベース３３０に基づいて頑健性が算出される点で異なる。
【００５４】
次に、以上のように構成された本実施の形態２における音声合成システム２０の動作を説明する。
図８は、音声合成システム２０のオフライン処理装置２０ａの動作手順を示すフローチャートである。まず、マッチング部３４０は、ＤＰマッチングの手法を用いて音声素片データベース１１０ａ内のすべての音素について、不特定話者音響モデルデータベース３３０に記録された音響モデルのうち対応する音素の６状態を抽出して、各音響モデルに素片データをマッチングし、素片の各フレームと音響モデルの６つの状態との対応をつける（Ｓ３０２）。以下、実施の形態１と同様にして、頑健性計算部１５０は、音響モデルと素片データのフレームとの対応を音素ごとに蓄積し、音素ごとに、６つの状態それぞれに対応付けられたフレームの数を集計し、各状態に対応付けられたフレーム数の平均と標準偏差を求め（Ｓ１０３）、さらに各状態の音素全体の時間長に対する各状態の相対位置を決定し（Ｓ１０４）、ステップＳ１０３で求めたフレーム数の標準偏差を音素内相対位置における時間長伸縮に対する頑健性として対応づけ、これらの頑健性データ（音素内相対位置と頑健性との対からなるマッピング関数）を音声合成装置１０ｂの頑健性データベース１９０に記録する（Ｓ１０５）。
【００５５】
一方、音声合成装置１０ｂは、実施の形態１と同様に、言語処理、韻律生成、素片選択を行い、頑健性データベース１９０に格納された時間長伸縮に対する頑健性に従って合成する音声のフレーム長を設定し、フレームの時間長、基本周波数、音声強度等を変形した合成パラメータに従って波形生成を行う。
【００５６】
以上のように、本実施の形態２における音声素片伸縮方法により、音声素片データベース１１０ａと不特定話者音響モデルデータベース３３０に基づいて、実施の形態１と同様に、ＨＭＭモデルの各状態にマッチングされたフレーム数の標準偏差を時間長伸縮に対する頑健性とし、素片内で頑健性が高いフレームの伸縮率を大きく、頑健性が低いフレームの伸縮率を小さくするように伸縮率を設定することで、ＨＭＭモデルの各状態におけるフレーム数のばらつきの大きい箇所がより大きな時間長伸縮が行われることとなり、これによって、時間長調整による音質劣化を防ぎ、より少ない音声素片データで高い音質の合成音声を生成することが可能になる。また、時間長伸縮に対する頑健性を汎用の不特定話者音響モデルを用いて自動生成することにより、音素ごとに最適な時間長伸縮関数を作成する作業がいっそう軽減される。
【００５７】
（実施の形態３）
次に、本発明の実施の形態３について説明する。
図９は、本発明の実施の形態３における音声合成システム３０の構成を示す機能ブロック図である。この音声合成システム３０は、ＨＭＭ音響モデルの学習過程における遷移系列から音声素片の頑健性データを生成する点に特徴を有し、オフラインで学習を行うとともに音素内の相対時間位置に対する時間長伸縮に対する頑健性を算出するオフライン処理装置３０ａと、算出された頑健性のデータを用いて音声合成を行う音声合成装置１０ｂとから構成され、オフライン処理装置３０ａの音響モデル学習部４２０、遷移系列記録部４３０、停留回数計算部４４０および頑健性計算部４５０を除いて、実施の形態１における音声合成システム１０と同様の構成を備える。以下、同一の構成要素には同一の符号を付し、その説明を省略する。
【００５８】
音響モデル学習部４２０は音素ごとのＨＭＭ音響モデルを作成する音響モデル学習部である。遷移系列記録部４３０は音響モデル学習部４２０より素片ごとに出力された、信号系列に対する遷移系列を記録する遷移系列記録部である。停留回数計算部４４０は遷移系列記録部４３０に記録された遷移系列データより、各音素の状態ごとの停留回数を計算する停留回数計算部である。頑健性計算部４５０は停留回数計算部４４０から出力された素片ごとの停留回数の平均と分散より音素内相対時間位置に対する時間長伸縮に対する頑健さを算出する。
【００５９】
次に、以上のように構成された本実施の形態３における音声合成システム３０の動作を説明する。
図１０は、音声合成システム２０のオフライン処理装置３０ａの動作手順を示すフローチャートである。図１１は、その動作手順を説明するための図である。
【００６０】
まず、音響モデル学習部４２０は、音声素片データベース１１０ａに蓄積された音素境界がラベルされた音声データを学習データとして、ＨＭＭを用いた１音素５状態の音響モデルを音素ごとに学習し（Ｓ４０１）、これによって得られた学習データごとの遷移系列データを遷移系列記録部４３０に格納する。具体的には、ＨＭＭの学習過程は、図１１に示されるように、Ｌ種類の信号系列集合Φ＝（Ｏ_ｌ｜１≦ｌ≦Ｌ）からなる信号系列集合が与えられ、開始状態Ｓ０から終了状態ＳＥまで、Ｎ＋１回の状態遷移によって出力される信号系列Ｏ_ｌに関するパラメタ学習をｆｏｒｗａｒｄ・ｂａｃｋｗａｒｄアルゴリズムを用いるＢａｕｍ−Ｗｅｌｃｈのアルゴリズムによって行う。図１１では、開始状態、終了状態を含む５状態によって形成される、隠れマルコフモデルの学習過程の模式図が示されている。
【００６１】
次に、停留回数計算部４４０は各音素について、ＨＭＭの状態ごとの停留回数を集計し、状態ごとの停留回数の平均と分散を求める（Ｓ４０２）。具体的には、図１１に示される各格子点（ｎ，ｉ｜０≦ｎ≦Ｎ＋１，０≦ｉ≦Ｅ）での前向き確率α（ｎ，ｉ）と後ろ向き確率Ｂ（ｎ，ｉ）に加え、Ｖｉｔｅｒｂｉアルゴリズムによる状態遷移系列の計算と同様の方法で求められるバックポインタと、最尤状態遷移系列上での各状態の停留回数を求める。即ち、時刻ｎにおける状態ｉへの遷移の中で、最も確率の高い遷移元の状態番号を記憶するバックポインタＢ（ｎ，ｉ）を用いて、信号系列Ｏ_ｌに対する、時刻ｎおよび状態ｉへの最尤遷移系列上での、時刻０からの状態ｊへの延べ停留回数Ｔ（ｎ，ｉ，ｊ）を以下のように漸化的に求める。
（１）初期遷移
１≦ｉ＜Ｅ、０≦ｊ＜Ｅについて、
Ｔ（１，ｉ，ｊ）＝０
（２）漸化計算式
１≦ｎ≦Ｎ、１≦ｉ＜Ｅ、０≦ｊ＜Ｅについて、
Ｔ（ｎ，ｉ，ｊ）＝Ｔ（ｎ−１，Ｂ（ｎ，ｉ），ｊ）ｉｆ（Ｂ（ｎ，ｉ）≠ｊ）
Ｔ（ｎ，ｉ，ｊ）＝Ｔ（ｎ−１，Ｂ（ｎ，ｉ），ｊ）＋１ｉｆ（Ｂ（ｎ，ｉ）＝ｊ）
（３）最終遷移
０≦ｊ＜Ｅ−１について、
Ｔ（Ｎ＋１，Ｅ，ｊ）＝Ｔ（Ｎ，Ｂ（Ｎ＋１，Ｅ），ｊ）ｉｆ（Ｂ（Ｎ＋１，Ｅ）≠ｊ）
Ｔ（Ｎ＋１，Ｅ，ｊ）＝Ｔ（Ｎ，Ｂ（Ｎ＋１，Ｅ），ｊ）＋１ｉｆ（Ｂ（Ｎ＋１，Ｅ）＝ｊ）
以上の計算により、信号系列に関する最尤状態遷移系列上での各状態ｊ（０≦ｊ＜Ｅ）の延べ停留回数Ｔ_ｌ（ｊ）が、Ｔ_ｌ（ｊ）＝Ｔ（Ｎ＋１，Ｅ，ｊ）として得られる。
【００６２】
信号系列集合Φ＝（Ｏ_ｌ｜１≦ｌ≦Ｌ）の各信号系列Ｏ_ｌの最尤状態遷移系列における各状態ｊの延べ停留回数Ｔ_ｌ（ｊ）を計算した上で、延べ停留回数の平均値μ_ｊと分散σ_ｊ ^２は、Ｂａｕｍ−Ｗｅｌｃｈのアルゴリズムによって求められたモデルＭにおいて信号系列Ｏ_ｌが最尤状態遷移系列Ｓ_ｌに沿って出力される確率Ｐ（Ｏ_ｌ，Ｓ_ｌ｜Ｍ）を用いて、以下のように求められる。
【００６３】
【数６】

【００６４】
また、簡単のため、モデルＭにおいて信号系列Ｏ_ｌが出力される確率Ｐ（Ｏ_ｌ｜Ｍ）を利用し、μ_ｊおよびσ_ｊ ^２の代わりとして以下の値を使っても良い。
【００６５】
【数７】

【００６６】
頑健性計算部４５０はステップＳ４０２で計算された状態ごとの停留回数の平均を用いて、実施の形態１のステップＳ１０４と同様にして音素内相対時間位置を決定し（Ｓ４０３）、ステップＳ４０２で計算された状態ごとの停留回数の分散を音素内相対時間位置における時間長伸縮に対する頑健性として対応づけ、これらの頑健性データ（音素内相対位置と頑健性との対からなるマッピング関数）を音声合成装置１０ｂの頑健性データベース１９０に記録する（Ｓ４０４）。
【００６７】
一方、音声合成装置１０ｂは、実施の形態１と同様に言語処理、韻律生成、素片選択を行い、頑健性データベース１９０に格納された時間長伸縮に対する頑健性に従って合成する音声のフレーム長を設定し、フレームの時間長、基本周波数、音声強度等を変形した合成パラメータに従って波形生成を行う。
【００６８】
以上のように、本実施の形態３における音声素片伸縮方法により、ＨＭＭモデルの学習過程で求められた各状態での停留回数の分散を時間長伸縮に対する頑健性とし、素片内で頑健性が高いフレームの伸縮率を大きく、頑健性が低いフレームの伸縮率を小さくするように伸縮率を設定することで、ＨＭＭモデルの各状態における停留回数のばらつきの大きい箇所がより大きな時間長伸縮が行われることとなり、これによって、時間長調整による音質劣化を防ぎ、より少ない音声素片データで高い音質の合成音声を生成することが可能になる。また、時間長伸縮に対する頑健性をＨＭＭ学習過程から自動生成することにより、個々の素片データベースに最適な時間長伸縮関数が作成され、その作成作業がいっそう軽減される。
【００６９】
（実施の形態４）
次に、本発明の実施の形態４について説明する。
図１２は、本発明の実施の形態４における音声合成システム４０の構成を示す機能ブロック図である。この音声合成システム４０は、ＨＭＭ音響モデルの学習過程における遷移系列から音声素片の頑健性データを生成する点に特徴を有し、オフラインで学習を行うとともに音素内の相対時間位置に対する時間長伸縮に対する頑健性を算出するオフライン処理装置４０ａと、算出された頑健性のデータを用いて音声合成を行う音声合成装置１０ｂとから構成され、オフライン処理装置４０ａの停留回数計算部５４０を除いて、実施の形態３における音声合成システム３０と同様の構成を備える。以下、同一の構成要素には同一の符号を付し、その説明を省略する。
【００７０】
停留回数計算部５４０は遷移系列記録部４３０に記録された遷移系列データより、各音素の状態ごとの停留回数を計算する停留回数計算部であり、状態間の遷移確率と各状態における出力確率とを用いて停留回数を計算する点で、Ｖｉｔｅｒｂｉアルゴリズムに基づくバックポインタＢ（ｎ，ｉ）を用いて停留回数を計算した実施の形態３における停留回数計算部４４０と異なる。
【００７１】
次に、以上のように構成された本実施の形態４における音声合成システム４０の動作を説明する。
図１３は、音声合成システム４０のオフライン処理装置４０ａの動作手順を示すフローチャートである。まず、音響モデル学習部４２０は、実施の形態３と同様にして、音声素片データベース１１０ａに蓄積された音素境界がラベルされた音声データを学習データとして、ＨＭＭを用いた１音素５状態の音響モデルを音素ごとに学習し（Ｓ４０１）、これによって得られた、学習データごとの遷移系列データを遷移系列記録部４３０に格納する。つまり、ＨＭＭの学習過程は、図１１に示されるように、Ｌ種類の信号系列集合Φ＝（Ｏ_ｌ｜１≦ｌ≦Ｌ）からなる信号系列集合が与えられ、開始状態Ｓ０から終了状態ＳＥまで、Ｎ＋１回の状態遷移によって出力される信号系列Ｏ_ｌに関するパラメタ学習をｆｏｒｗａｒｄ・ｂａｃｋｗａｒｄアルゴリズムを用いるＢａｕｍ−Ｗｅｌｃｈのアルゴリズムによって行う。
【００７２】
次に、停留回数計算部４４０は各音素について、ＨＭＭの状態ごとの停留回数を集計し、状態ごとの停留回数の平均と分散を求める（Ｓ５０２）。具体的には、信号系列Ｏ_ｌ＝（ｏ_ｌ（ｎ）｜１≦ｎ≦Ｎ）に対する、時刻０からｎにおける、状態ｊへの延べ停留回数の期待値Ｔ（ｎ，ｉ，ｊ）を以下のように漸化的に求める。
（１）初期遷移
１≦ｉ＜Ｅ、０≦ｊ＜Ｅについて、
Ｔ（１，ｉ，ｊ）＝０
（２）漸化計算式
１≦ｎ≦Ｎ、１≦ｉ＜Ｅ、０≦ｊ＜Ｅについて、
【００７３】
【数８】

【００７４】
ただし、
ｅｑ（ｋ，ｊ）＝１ｉｆｆｋ＝ｊ，
ｅｑ（ｋ，ｊ）＝０ｏｔｈｅｒｗｉｓｅ
とする。また、ａ_ｋｉは状態ｋから状態ｉへの遷移確率、ｂ_ｋ（ｘ）は状態ｋにおける信号ｘの出力確率を表す。
（３）最終遷移
０≦ｊ＜Ｅについて、
【００７５】
【数９】

【００７６】
以上の計算により、信号系列Ｏ_ｌに関する各状態ｊ（０≦ｊ＜Ｅ）の延べ停留回数の期待値Ｔ_ｌ（ｊ）が、Ｔ（Ｎ＋１，Ｅ，ｊ）として得られる。
信号系列集合Φ＝（Ｏ_ｌ｜１≦ｌ≦Ｌ）の各信号系列Ｏ_ｌの最尤状態遷移系列における各状態ｊの延べ停留回数Ｔ_ｌ（ｊ）を計算した上で、延べ停留回数の平均値μ_ｊと分散σ_ｊ ^２は、Ｂａｕｍ−Ｗｅｌｃｈのアルゴリズムによって求められたモデルＭにおいて信号系列Ｏ_ｌが最尤状態遷移系列Ｓ_ｌに沿って出力される確率Ｐ（Ｏ_ｌ｜Ｍ）を用いて、以下のように求められる。
【００７７】
【数１０】

【００７８】
続いて、頑健性計算部４５０は、実施の形態３と同様にして、ステップＳ５０２で計算された状態ごとの停留回数の平均を用いて音素内相対時間位置を決定し（Ｓ４０３）、ステップＳ５０２で計算された状態ごとの停留回数の分散を音素内相対時間位置における時間長伸縮に対する頑健性として対応づけ、これらの頑健性データ（音素内相対位置と頑健性との対からなるマッピング関数）を音声合成装置１０ｂの頑健性データベース１９０に記録する（Ｓ４０４）。
【００７９】
一方、音声合成装置１０ｂは、実施の形態１と同様に言語処理、韻律生成、素片選択を行い、頑健性データベース１９０に格納された時間長伸縮に対する頑健性に従って合成する音声のフレーム長を設定し、フレームの時間長、基本周波数、音声強度等を変形した合成パラメータに従って波形生成を行う。
【００８０】
以上のように、本実施の形態４における音声素片伸縮方法により、ＨＭＭモデルの学習過程で求められた各状態での停留回数の分散を時間長伸縮に対する頑健性とし、素片内で頑健性が高いフレームの伸縮率を大きく、頑健性が低いフレームの伸縮率を小さくするように伸縮率を設定することで、ＨＭＭモデルの各状態における停留回数のばらつきの大きい箇所がより大きな時間長伸縮が行われることとなり、これによって、時間長調整による音質劣化を防ぎ、より少ない音声素片データで高い音質の合成音声を生成することが可能になる。また、時間長伸縮に対する頑健性をＨＭＭ学習過程から自動生成することにより、個々の素片データベースに最適な時間長伸縮関数が作成され、その作成作業がいっそう軽減される。
【００８１】
以上、本発明に係る音声合成システムについて、実施の形態１〜４に基づいて説明したが、本発明は、これらの実施の形態に限られない。
例えば、音声合成装置１０ｂのフレーム時間長伸縮率計算部２００は、音声合成の対象となる音声素片の各フレームの中心時間位置に対応する頑健性が頑健性データベース１９０に記録されていない場合に、頑健性データベース１９０に記録されている頑健性データ（音素内相対時間位置と頑健性との組）をスプライン補間することによって、その頑健性を求めたが、補間方法としては、スプライン関数による補間だけ限られるものではなく、直線補間、２次関数または３次関数による近似に基づく補間であってもよい。
【００８２】
また、オフライン処理装置の音声素片データベース１１０ａと音声合成装置の音声素片データベース１１０ｂとを同一の話者から得られたデータとしたり、オフライン処理装置の音声素片データベース１１０ａに音声合成装置の音声素片データベース１１０ｂを含ませたりしてもよい。これによって、頑健性を算出するために用いられる音声素片データと音声合成に用いられる音声素片データとが関連づけられ、より音声品質の高い音声素片の時間長伸縮が可能となる。
【００８３】
【発明の効果】
以上の説明から明らかなように、本発明に係る音声合成システムによれば、音声素片に対応する音響モデルの各状態ごとに、時間長伸縮に対する頑健性が算出され、音声合成時においては、その頑健性の比に基づいて、音声素片各部の時間長が伸縮されるので、時間長調整による音質劣化が防止される。このことは、従来と同じ程度の音質劣化の範囲であれば、調整可能な時間幅が拡大したことになり、長音素片が不要になる、あるいは、同一音韻の素片を継続時間長によって複数用意する必要がなくなり、より少ない素片データで、時間長調整による音質劣化の少ない音声素片の伸縮が可能となるとともに、多様な発話状態に対応して音声素片の継続時間長を調整することが可能となる。そして、時間長調整に用いるマッピング関数が自動生成されるので、素片データベース作成のための作業が軽減される。
【００８４】
また、本発明に係る音声合成システムによれば、音声素片データベースおよびそのデータベースから学習された音響モデルまたは不特定話者の音響モデル、あるいは、音声素片データベースから音響モデルを生成する過程において記録された遷移系列等のデータベースに基づいて時間長伸縮に用いるマッピング関数が生成されるので、多様なデータベースを基礎として音声素片の継続時間長を調整することが可能となる。
【００８５】
以上のように、本発明により、音質を劣化させることなく、任意の継続時間長の音声を合成することが可能となり、これによって、自然音声に近い高品質な音声合成が実現され、本発明の実用的価値は極めて高い。
【図面の簡単な説明】
【図１】本発明の実施の形態１における音声合成システムの構成を示す機能ブロック図である。
【図２】音声合成システムのオフライン処理装置の動作手順を示すフローチャートである。
【図３】頑健性計算部による頑健性データの計算方法（前半）を示す図である。
【図４】頑健性計算部による頑健性データの計算方法（後半）を示す図である。
【図５】音声合成システムの音声合成装置の動作手順を示すフローチャートである。
【図６】頑健性データをスプライン補間することによって、素片の各フレームの中心時間位置での頑健性を求める例を示す図である。
【図７】本発明の実施の形態２における音声合成システムの構成を示す機能ブロック図である。
【図８】音声合成システムのオフライン処理装置の動作手順を示すフローチャートである。
【図９】本発明の実施の形態３における音声合成システムの構成を示す機能ブロック図である。
【図１０】音声合成システムのオフライン処理装置の動作手順を示すフローチャートである。
【図１１】隠れマルコフモデルの学習過程の模式図である。
【図１２】本発明の実施の形態４における音声合成システムの構成を示す機能ブロック図である。
【図１３】音声合成システムのオフライン処理装置の動作手順を示すフローチャートである。
【図１４】音声パターンの継続時間長の伸縮方法を示す図である。
【符号の説明】
１０、２０、３０、４０音声合成システム
１０ａ、２０ａ、３０ａ、４０ａオフライン処理装置
１０ｂ音声合成装置
１１０ａ、１１０ｂ音声素片データベース
１２０、４２０音響モデル学習部
１３０音響モデルデータベース
１４０、３４０マッチング部
１５０、４５０頑健性計算部
１６０言語処理部
１７０韻律生成部
１８０素片選択部
１９０頑健性データベース
２００フレーム時間長伸縮率計算部
２１０フレーム変形部
２２０波形生成部
３３０不特定話者音響モデルデータベース
４３０遷移系列記録部
４４０、５４０停留回数計算部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to speech synthesis using a rule synthesis method, and more particularly to a speech unit expansion / contraction method for controlling the duration of a speech unit.
[0002]
[Prior art]
As one method of speech synthesis for artificially generating and outputting speech from text data by a mechanical method or the like, an intonation or a speech unit (hereinafter, also referred to as a unit) such as "A" or "I" is used. There is a rule synthesis method in which prosody information such as accents is given and speech is generated using various synthesis rules. In the rule synthesizing method, it is necessary to control the duration of a speech unit according to the duration set for each phoneme given from outside or generated by the prosody control means. In natural speech, the duration of phonemes varies in various ways due to the speed of speech, semantic content, and the phonemes before and after, so in order to synthesize high-quality speech close to natural speech, It must be possible to synthesize long duration speech by a synthesizer.
[0003]
Conventionally, as a control method relating to the duration of a speech unit in the speech synthesis of the rule synthesis method, the waveform of the unit is thinned out and repeated at a uniform rate in synchronization with the pitch within one phoneme, or a certain time length is conventionally used. There are control methods such as thinning out and repeating frames at a uniform rate, or expanding and contracting the frame time length at a uniform rate. In this way, a speech unit having an arbitrary duration is generated from a speech unit having a basic time length.
[0004]
Also, by setting the relationship between the time position of the unit frame and the time position of the frame to be generated as a mapping function, a method of expanding and contracting the frame time length in a portion of the unit that is not easily affected by time expansion and contraction is also available. It has been proposed (for example, see Patent Document 1). In this way, a speech unit having a more natural and arbitrary duration is generated.
[0005]
[Patent Document 1]
JP-A-10-91191
[0006]
[Problems to be solved by the invention]
However, in the above-mentioned conventional method of thinning out or repeating at a uniform rate, or expanding and contracting the frame length, the amount of change per time becomes important information for phoneme perception, such as transition information near a phoneme boundary. This changes the amount of change per unit time in the part where the sound is present, which significantly degrades the sound quality. For example, by uniformly extending the duration of a standard voice pattern “here (phoneme)” (here, the time change of the fundamental frequency) as shown in FIG. ) Is generated, but the amount of change per unit time in the vicinity 90a and b, 91a and b of the phoneme boundary changes, so that the crossover information near the phoneme boundary is different, and Natural sounds are generated.
[0007]
On the other hand, in the method disclosed in Patent Document 1, since the duration of a speech unit is controlled by using a mapping function, as shown in FIG. Although the duration length is extended without changing the amount of change per time in, the method of generating the mapping function is not automated, but is empirically located, so it can handle various utterance states There is a problem that you can not. That is, since the change pattern of the voice is different depending on the voice state, the utterance state, the personal characteristics of the speaker, and the like, it is necessary to generate a mapping function at least for each segment database. Because it does not have a function to generate such various mapping functions, it is not possible to cope with various utterance states. For example, a specific type such as performing speech synthesis only for the voice of a woman in her twenties. The problem is that it can only be applied to speech synthesis.
[0008]
In view of the above problems, the present invention provides a speech unit expansion / contraction device or the like that can reduce the sound quality degradation and adjust the duration of the speech unit in accordance with various utterance states. The purpose is to: That is, an object of the present invention is to provide a speech unit expansion / contraction device or the like in which sound quality deterioration due to time length adjustment is small and the operation of creating a segment database for generating a mapping function used for time length adjustment is reduced.
[0009]
[Means for Solving the Problems]
In order to achieve the above object, a speech unit expansion / contraction device according to the present invention is a speech unit expansion / contraction device that expands or compresses the time length of a speech unit for speech synthesis. From a plurality of speech unit data with information and an acoustic model using a statistical method having a plurality of states for each speech unit, a speech is generated for each state of the acoustic model corresponding to each speech unit. Robustness calculating means for calculating the robustness of the segment data with respect to time length expansion and contraction, and based on the calculated ratio of the robustness of each of the states, for each of the states for the speech unit for speech synthesis, A telescopic unit for expanding and contracting the time length of each unit of the corresponding speech unit.
[0010]
As a result, the robustness with respect to the time length expansion and contraction is calculated for each state of the acoustic model corresponding to the speech unit, and at the time of speech synthesis, the time length of each unit of the speech unit is determined based on the robustness ratio. For example, by setting the expansion / contraction ratio so as to increase the expansion / contraction ratio of a part with high robustness and to reduce the expansion / contraction ratio of a part with low robustness in a speech unit, phonemes such as phoneme boundaries The time length of a speech unit can be expanded or contracted without changing the amount of change per time at a perceptually important place, and sound quality degradation due to the time length expansion or contraction is avoided. In addition, since such robustness information is generated from the speech unit data and the acoustic model, it is possible to perform optimal time length expansion / contraction on the prepared speech unit data, and to perform mapping using the time length expansion / contraction. Function creation is automated.
[0011]
Here, the robustness calculation unit specifies, for each of the states of the acoustic model corresponding to each of the plurality of speech units, the variation in the time range with respect to the plurality of speech unit data, and determines the variation in the specified time range. It may be calculated as robustness. Specifically, the robustness calculating means associates each frame constituting the speech unit data with each state for each of the plurality of speech unit data, and associates each frame with each state. The standard deviation of the number of frames may be calculated as the robustness. The robustness calculating means calculates an average of the number of frames associated with each of the states, and specifies a relative position in a phoneme, which is a relative position in a speech unit in each state, using the calculated average. Then, a pair of the specified relative position in the phoneme and the robustness is generated as robustness data, and the expansion / contraction means is determined from the pair of the relative position in the phoneme generated by the robustness calculating means and the robustness. It is preferable that the robustness corresponding to each part of the speech unit is specified in accordance with the relationship between the relative position in the phoneme and the robustness, and the time length is expanded or contracted based on the specified robustness.
[0012]
As a result, for each state of the acoustic model corresponding to the speech unit, the variation in the time width of the plurality of speech unit data is specified as robustness, so that the variation in the time width corresponding to each state of the acoustic model can be handled. The time length is expanded and contracted as described above, and the sound quality deterioration due to the time length expansion and contraction of the speech unit is suppressed. The relative position in the phoneme is, for example, a time position (relative time position) from the head of the speech unit.
[0013]
Further, the acoustic model may be a model generated by learning using the speech unit data, and the acoustic model may include speech unit data composed of speech generated by an unspecified number of speakers. It may be a model generated using such a model. As a result, the time length expansion and contraction of the optimal speech unit focused on a specific speaker can be performed, and the time length expansion and contraction of the speech unit based on the combination of the speech unit data of the specific speaker and a general-purpose acoustic model can be performed. You can go.
[0014]
Further, the acoustic model is a hidden Markov model, and the robustness calculation means uses the plurality of speech unit data to generate the acoustic model. The variation in the number of stops may be specified, and the variation in the specified number of stops may be calculated as the robustness. Specifically, the robustness calculating means may calculate the variance of the number of stops as the robustness.
[0015]
At this time, the robustness calculating means calculates and calculates the maximum likelihood state transition sequence in the learning sequence in the calculation of each time of the learning signal in the process of generating the acoustic model, a forward probability or a backward probability of reaching each state. Variations in the number of stops of each state in the maximum likelihood state transition sequence may be calculated as the robustness. The robustness calculating means calculates an average of the number of stops associated with each of the states, and specifies a relative position in a phoneme, which is a relative position in a speech unit in each state, using the calculated average. Then, a pair of the specified relative position in the phoneme and the robustness is generated as robustness data, and the expansion / contraction means is determined from the pair of the relative position in the phoneme generated by the robustness calculating means and the robustness. It is preferable that the robustness corresponding to each part of the speech unit is specified in accordance with the relationship between the relative position in the phoneme and the robustness, and the time length is expanded or contracted based on the specified robustness.
[0016]
As a result, for each speech unit, the variation in the number of stops in each state of the transition sequence in the process of generating the acoustic model is specified as robustness. This is performed, and the sound quality deterioration due to the time length expansion and contraction of the speech unit is suppressed.
[0017]
Further, the speaker of the speech unit data may be the same as the speaker of the speech unit for speech synthesis, or the speech unit data may include a speech unit for speech synthesis. . As a result, the speech unit data used for calculating the robustness and the speech unit data used for speech synthesis are associated with each other, and the sound quality deterioration due to the time length expansion and contraction of the speech unit can be further suppressed.
[0018]
Further, the expansion / contraction means specifies the robustness corresponding to each part of the speech unit by interpolating using a plurality of pairs of the relative position in the phoneme and the robustness generated by the robustness calculating means. You may. Specifically, the robustness of each part of the speech unit may be specified by approximation using linear interpolation, spline interpolation, a quadratic function, or a cubic function, and the scaling factor may be determined.
[0019]
In addition, the present invention can be realized not only as the speech unit expansion / contraction device as described above, but also as a speech unit expansion / contraction method using the means as a step, or a computer functioning as a speech unit expansion / contraction device. It can also be implemented as a program. Needless to say, such a program can be distributed via a recording medium such as a CD-ROM or a transmission medium such as the Internet.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0021]
(Embodiment 1)
FIG. 1 is a functional block diagram illustrating a configuration of a speech synthesis system 10 according to Embodiment 1 of the present invention. This speech synthesis system 10 is a system in which the robustness with respect to the time length expansion / contraction with respect to a relative time position in a phoneme is calculated in advance, and the speech synthesis is performed based on the robustness. And a speech synthesizer 10b that performs speech synthesis from text data using the calculated robustness data. Note that “robustness in time length expansion / contraction” refers to the difficulty of sound quality deterioration when the duration of a speech unit is expanded / contracted, and “high robustness” refers to the duration of the speech unit. This means that sound quality hardly deteriorates even if the length is expanded or contracted.
[0022]
The offline processing device 10a is realized by three processing units (an acoustic model learning unit 120, a matching unit 140, and a robustness calculation unit 150) realized by a program or the like executed on a computer and a hard disk storing data. And two databases (a speech unit database 110a and an acoustic model database 130).
[0023]
The speech segment database 110a is a database storing speech data (a collection of segment data including phonetic symbol strings, fundamental frequencies, speech intensities, phoneme time lengths, etc.) with phoneme boundaries labeled. The acoustic model learning unit 120 creates an acoustic model for each phoneme by performing HMM (Hidden Markov Model) learning based on the speech data stored in the speech unit database 110a, and stores the acoustic model in the acoustic model database 130. I do. The acoustic model database 130 is an auxiliary storage unit that stores an acoustic model for each phoneme output from the acoustic model learning unit 120.
[0024]
The matching unit 140 compares the learned acoustic model stored in the acoustic model database 130 with the segment data extracted from the speech segment database 110a, and performs DP (Dynamic Programming) matching, thereby performing speech segment matching. For each segment data stored in the database 110a, the correspondence between the frames constituting the segment data and the states constituting the learning model is specified, and the specified correspondence is notified to the robustness calculation unit 150.
[0025]
The robustness calculation unit 150 aggregates information indicating the correspondence between each frame in the segment output from the matching unit 140 and the state of the acoustic model for each acoustic model, and thereby expands or contracts the time length at the relative time position within the phoneme. And outputs the calculated robustness to the speech synthesizer 10b.
[0026]
On the other hand, the speech synthesizer 10b includes six processing units (a language processing unit 160, a prosody generation unit 170, a segment selection unit 180, a frame time length expansion / contraction rate calculation unit 200, and a processing unit implemented by a program or the like executed on a computer). It is composed of a frame transformation unit 210 and a waveform generation unit 220) and two databases (robustness database 190 and speech unit database 110b) which are realized by a hard disk or the like in which data is stored.
[0027]
The linguistic processing unit 160 performs linguistic analysis of the input text data, and generates linguistic information such as morphemes and syntax and reading information such as phoneme strings. The prosody generation unit 170 determines the prosody of the synthesized speech based on the linguistic information and the reading information output from the language processing unit 160, and generates prosody information such as a fundamental frequency, power, and rhythm. The speech unit database 110b is a database that holds speech synthesis parameters generated from real speech in units of processing when generating a waveform. Based on the prosody information, linguistic information, and reading information output from the prosody generation unit 170, the unit selection unit 180 selects an optimal speech unit from the speech unit database 110b.
[0028]
The robustness database 190 stores the robustness data output from the offline processing device 10a (the robustness calculation unit 150 thereof), that is, data indicating the robustness with respect to the time length expansion / contraction at the relative time position within the phoneme for each type of speech unit. Is a database for storing. The frame time length expansion / contraction ratio calculation unit 200 calculates the expansion / contraction ratio of the time length for each frame which is a unit of processing of a segment based on the data in the robustness database 190. The frame transforming unit 210 expands and contracts the time length of each frame of the speech unit selected by the unit selecting unit 180 based on the expansion and contraction ratio calculated by the frame time length expansion and contraction ratio calculation unit 200. The waveform generation unit 220 generates an audio waveform by connecting the frames expanded and contracted by the frame deformation unit 210 and performing signal processing based on parameters for each frame.
[0029]
Next, the operation of the speech synthesis system 10 according to the first embodiment configured as described above will be described.
FIG. 2 is a flowchart showing an operation procedure of the offline processing device 10a of the speech synthesis system 10. 3 and 4 are diagrams for explaining the operation.
[0030]
First, the acoustic model learning unit 120 generates, for each phoneme, an acoustic model in one phoneme 6 state using the HMM, using, as learning data, the speech data in which the phoneme boundaries stored in the speech unit database 110a are labeled as learning data ( S101) The obtained HMM parameters are recorded in the acoustic model database 130 as an acoustic model.
[0031]
Next, the matching unit 140 extracts six states of the corresponding phonemes from the acoustic models recorded in the acoustic model database 130 for all the phonemes in the speech unit database 110a using the DP matching technique, and By matching the segment data with the acoustic model, the correspondence between each frame of the segment and the six states of the acoustic model is specified (S102). FIG. 3 shows an example of the correspondence. Here, in the unit A frame sequence, state 1 corresponds to the first two frames, state 2 corresponds to the following three frames, state 3 corresponds to the following three frames, and state 4 corresponds to the following three frames. Corresponding to the three frames in which the state 5 continues, and similarly, the correspondence between the six states of the acoustic model and the frame is specified for the unit B frame sequence, the unit C frame sequence, and the unit D frame sequence. I have.
[0032]
Next, the robustness calculation unit 150 accumulates the correspondence for each phoneme identified in step S102, and for each phoneme, totals the frames associated with each of the six states, thereby averaging the number of frames. And the standard deviation is calculated (S103). FIG. 3 shows the average and standard deviation of the number of frames for states 1 to 4 as 2 and 0, 3.25 and 0.50, 3.0 and 0.82, 2.5 and 1.29, respectively. It has been calculated.
[0033]
Subsequently, the robustness calculation unit 150 obtains the center time position of each state as a frame position from the beginning of the phoneme from the average of the number of frames obtained in step S103, thereby obtaining the phoneme of each state with respect to the time length of the entire phoneme. The inner relative position is determined (S104). Specifically, the relative positions within the phonemes in each state are calculated according to the following equations.
[0034]
Average number of frames in the i-th state F_i
Average number of frames in the j-th state F_j
The center time position C of the i-th state expressed by the number of frames from the head of the phoneme C_i
Relative position in phoneme in i-th state p_i
And if
[0035]
(Equation 1)

[0036]
FIG. 4 shows an example of calculating the relative positions in phonemes in each state of the acoustic model for the average number of frames shown in FIG. In this example, the average number of frames in a phoneme (sum of the average number of frames in all states) is calculated as 18.
Then, the robustness calculation unit 150 associates the standard deviation of the number of frames associated with each state as robustness with respect to the time length expansion and contraction at the relative position within the phoneme of the state, and sets these robustness data (the relative position within the phoneme and The mapping function comprising a pair with the robustness is recorded in the robustness database 190 of the speech synthesizer 10b (S105).
[0037]
FIG. 5 is a flowchart showing the operation procedure of the speech synthesis device 10b of the speech synthesis system 10. The text data input to the speech synthesizer 10b is first subjected to morphological analysis and syntactic analysis by the language processing unit 160, and to read information such as phoneme strings, accents, and delimiters, and morphological information such as part of speech and inflection, phrase dependency, Syntax information such as a syntax structure is generated (S111). Next, based on the reading information, morphological information, and syntax information generated by the language processing unit 160, the prosody generation unit 170 generates prosody information such as a fundamental frequency, a voice intensity, and a phoneme duration corresponding to the phoneme sequence. (S112).
[0038]
Subsequently, the segment selection unit 180 uses the reading information, morpheme information, and syntax information generated by the language processing unit 160 and the prosody information generated by the prosody generation unit 170 as a unit selection cost to generate a speech unit database. From 110b, a speech unit corresponding to the phoneme sequence of the speech to be synthesized is extracted for each phoneme, and a speech unit table associated with the phoneme sequence is generated (S113).
[0039]
Then, the frame time length expansion / contraction rate calculation unit 200 calculates the phoneme duration time generated by the prosody generation unit 170 and the duration time for each phoneme in the speech unit table extracted from the speech unit database 110b by the unit selection unit 180. And refer to the robustness against time length expansion and contraction at the relative position within the phoneme corresponding to the target phoneme recorded in the robustness database 190, and compare the time length expansion and contraction at the center time position of each frame of the unit. The robustness is acquired (S114). At this time, if the robustness corresponding to the center time position of each frame is not recorded in the robustness database 190, the robustness data (the relative time position in the phoneme and the robustness) recorded in the robustness database 190 are not recorded. ) Is determined by spline interpolation. FIG. 6 is a diagram illustrating an example in which the robustness at the center time position of each frame of the unit is obtained by performing spline interpolation on the robustness data (points on the graph) recorded in the robustness database 190. .
[0040]
After specifying the robustness of each frame in this way, the frame time length expansion / contraction rate calculation unit 200 generates the phoneme time length of each speech unit by the prosody generation unit 170 according to the robustness of each frame. The time length of each frame is calculated so as to be equal to the phoneme duration time (S115). At this time, the frame time length of the synthesized sound to be generated is determined according to the following equation so that the expansion / contraction rate of the frame having high robustness to the time length expansion / contraction is increased and the expansion / contraction rate of the frame with low robustness is reduced.
[0041]
(Equation 2)

[0042]
And the robustness r_i(The ratio of the frame time length of the synthetic sound to be generated to the frame time length of the segment) according to the following equation is expressed as follows.
[0043]
(Equation 3)

[0044]
Since the frame time length of the unit is constant,
[0045]
(Equation 4)

[0046]
Since the phoneme time length of the synthesized sound to be generated is the sum of the frame time lengths, the frame time length of the speech to be synthesized by applying the above equation can be expressed as follows.
[0047]
(Equation 5)

[0048]
Subsequently, the frame deforming unit 210 expands and contracts the unit frame to the frame time length obtained by the above equation, applies the fundamental frequency and the sound intensity to the values set in step S112, and sets the synthesis parameters for each synthesis processing frame. Is generated (S116). Finally, the waveform generation unit 220 generates an audio waveform with reference to the synthesis parameters output from the frame transformation unit 210 in step S160 (S117).
[0049]
As described above, according to the speech unit expansion / contraction method in the first embodiment, the standard deviation of the number of frames matched to each state of the HMM model is set as robustness against time length expansion and contraction, and a frame having high robustness within the unit is used. By setting the expansion / contraction ratio so that the expansion / contraction ratio of the frame is large and the expansion / contraction ratio of the frame with low robustness is reduced, a portion with a large variation in the number of frames in each state of the HMM model undergoes a larger time length expansion / contraction. Accordingly, it is possible to prevent the sound quality from being deteriorated due to the adjustment of the time length, and to generate a synthesized speech having a high sound quality with less speech unit data. Further, by automatically generating the robustness against the time length expansion / contraction, the time length expansion / contraction that is optimal for the given segment database can be performed, and the work of creating the time length expansion / contraction function can be reduced.
[0050]
That is, the speech unit shown in FIG. 14A is important information for phoneme perception as shown in FIG. 14C by the speech unit expansion / contraction method in the first embodiment. The duration of the speech unit is extended while maintaining the amount of change per unit time in the vicinity of the

phoneme boundaries

92a and 92b, and a mapping function therefor is automatically generated. It becomes possible to automate speech synthesis corresponding to the utterance state.
[0051]
(Embodiment 2)
Next, a second embodiment of the present invention will be described.
FIG. 7 is a functional block diagram showing a configuration of the speech synthesis system according to Embodiment 2 of the present invention. The speech synthesis system 20 is a system that performs speech synthesis using an acoustic model of an unspecified speaker. A speech synthesizer 10b that performs speech synthesis using the calculated robustness data. The speech in the first embodiment, except for the speaker-independent acoustic model database 330 and the matching unit 340 of the offline processing device 20a. A configuration similar to that of the synthesis system 10 is provided. Hereinafter, the same components are denoted by the same reference numerals, and description thereof will be omitted.
[0052]
The unspecified speaker acoustic model database 330 is an unspecified speaker acoustic model database that stores acoustic model parameters for unspecified speakers based on HMMs that have been learned in advance. The matching unit 340 compares the acoustic model extracted from the unspecified speaker acoustic model database 330 with the segment data extracted from the speech segment database 110a, performs DP matching, and determines each frame of the segment data and the acoustic model. It is a matching unit for matching.
[0053]
In the first embodiment, the robustness is calculated based on the speech unit database 110a and the acoustic model database 130 learned from the speech unit database 110a, whereas in the second embodiment, the speech unit is calculated. The difference is that the robustness is calculated based on the piece database 110a and the speaker-independent acoustic model database 330 based on the previously learned HMM.
[0054]
Next, the operation of the speech synthesis system 20 according to the second embodiment configured as described above will be described.
FIG. 8 is a flowchart showing the operation procedure of the offline processing device 20a of the speech synthesis system 20. First, the matching unit 340 extracts six states of the corresponding phonemes among the acoustic models recorded in the unspecified speaker acoustic model database 330 for all the phonemes in the speech unit database 110a using the DP matching method. Then, the unit data is matched with each acoustic model, and correspondence is established between each frame of the unit and the six states of the acoustic model (S302). Hereinafter, in the same manner as in the first embodiment, the robustness calculation unit 150 accumulates the correspondence between the acoustic model and the frame of the segment data for each phoneme, and for each phoneme, the frame associated with each of the six states. , The average and standard deviation of the number of frames associated with each state are obtained (S103), and the relative position of each state with respect to the entire time length of the phoneme in each state is determined (S104). Step S103 The standard deviation of the number of frames obtained in step (1) is associated with the robustness against time length expansion and contraction at the relative position in the phoneme, and the robustness data (mapping function composed of the pair of the relative position and the robustness in the phoneme) is converted to the speech synthesis device 10b. Is recorded in the robustness database 190 (S105).
[0055]
On the other hand, the speech synthesizer 10b performs the language processing, the prosody generation, and the segment selection in the same manner as in the first embodiment, and determines the frame length of the speech to be synthesized according to the robustness to the time length expansion and contraction stored in the robustness database 190. After setting, the waveform generation is performed according to the synthesis parameters obtained by modifying the time length of the frame, the fundamental frequency, the voice intensity, and the like.
[0056]
As described above, according to the speech unit expansion / contraction method in the second embodiment, based on the speech unit database 110a and the unspecified speaker acoustic model database 330, each state of the HMM model is changed in the same manner as in the first embodiment. The standard deviation of the number of matched frames is set as the robustness against time length expansion and contraction, and the expansion and contraction ratio is set so that the expansion and contraction ratio of frames with high robustness in the unit is increased and the expansion and contraction ratio of frames with low robustness is reduced. As a result, a portion having a large variation in the number of frames in each state of the HMM model undergoes a larger time length expansion and contraction, thereby preventing sound quality deterioration due to time length adjustment and achieving high sound quality with less speech unit data. It is possible to generate synthesized speech. In addition, by automatically generating the robustness to the time length expansion and contraction using a general-purpose speaker-independent acoustic model, the task of creating an optimum time length expansion function for each phoneme is further reduced.
[0057]
(Embodiment 3)
Next, a third embodiment of the present invention will be described.
FIG. 9 is a functional block diagram showing a configuration of the speech synthesis system 30 according to Embodiment 3 of the present invention. The speech synthesis system 30 is characterized in that robustness data of a speech unit is generated from a transition sequence in a learning process of an HMM acoustic model, performs offline learning, and performs time length expansion / contraction with respect to a relative time position in a phoneme. An offline processing device 30a that calculates the robustness to the data, and a speech synthesis device 10b that performs speech synthesis using the calculated robustness data. The acoustic model learning unit 420 and the transition sequence recording unit of the offline processing device 30a A configuration similar to that of the speech synthesis system 10 according to the first embodiment is provided except for the number of stops 430, the number-of-stops calculation unit 440, and the robustness calculation unit 450. Hereinafter, the same components are denoted by the same reference numerals, and description thereof will be omitted.
[0058]
The acoustic model learning unit 420 is an acoustic model learning unit that creates an HMM acoustic model for each phoneme. The transition sequence recording unit 430 is a transition sequence recording unit that records the transition sequence for the signal sequence output from the acoustic model learning unit 420 for each segment. The number-of-stops calculation unit 440 is a number-of-stops calculation unit that calculates the number of stops for each phoneme state from the transition sequence data recorded in the transition sequence recording unit 430. The robustness calculation unit 450 calculates the robustness with respect to the time length expansion / contraction with respect to the relative time position in the phoneme from the average and variance of the number of stops for each segment output from the number of stops calculation unit 440.
[0059]
Next, the operation of the speech synthesis system 30 according to the third embodiment configured as described above will be described.
FIG. 10 is a flowchart showing the operation procedure of the offline processing device 30a of the speech synthesis system 20. FIG. 11 is a diagram for explaining the operation procedure.
[0060]
First, the acoustic model learning unit 420 learns, for each phoneme, an acoustic model of five states of one phoneme using the HMM, using the speech data in which the phoneme boundaries stored in the speech unit database 110a are labeled as learning data (S401). ), And stores the obtained transition sequence data for each learning data in the transition sequence recording unit 430. Specifically, in the learning process of the HMM, as shown in FIG. 11, L types of signal sequence sets Φ = (O_l| 1 ≦ l ≦ L), and a signal sequence O output by N + 1 state transitions from the start state S0 to the end state SE_lIs performed by the Baum-Welch algorithm using the forward-backward algorithm. FIG. 11 is a schematic diagram illustrating a learning process of a hidden Markov model formed by five states including a start state and an end state.
[0061]
Next, the number-of-stops calculation unit 440 totals the number of stops for each phoneme for each state of the HMM, and calculates the average and variance of the number of stops for each state (S402). Specifically, the forward probability α (n, i) and the backward probability B (n, i) at each grid point (n, i | 0 ≦ n ≦ N + 1, 0 ≦ i ≦ E) shown in FIG. In addition, the back pointer obtained by the same method as the calculation of the state transition sequence by the Viterbi algorithm and the number of times each state stops on the maximum likelihood state transition sequence are obtained. That is, in the transition to the state i at the time n, the signal sequence O is used by using the back pointer B (n, i) storing the state number of the transition source having the highest probability._l, The total number of stops T (n, i, j) from time 0 to state j on the maximum likelihood transition sequence to time n and state i is recursively calculated as follows.
(1) Initial transition
For 1 ≦ i <E and 0 ≦ j <E,
T (1, i, j) = 0
(2) Recurrence formula
For 1 ≦ n ≦ N, 1 ≦ i <E, 0 ≦ j <E,
T (n, i, j) = T (n-1, B (n, i), j) if (B (n, i) ≠ j)
T (n, i, j) = T (n-1, B (n, i), j) +1 if (B (n, i) = j)
(3) Final transition
For 0 ≦ j <E−1,
T (N + 1, E, j) = T (N, B (N + 1, E), j) if (B (N + 1, E) ≠ j)
T (N + 1, E, j) = T (N, B (N + 1, E), j) +1 if (B (N + 1, E) = j)
By the above calculation, the total number of stops T of each state j (0 ≦ j <E) on the maximum likelihood state transition sequence for the signal sequence_l(J) is T_l(J) = T (N + 1, E, j).
[0062]
Signal sequence set Φ = (O_l| 1 ≦ l ≦ L)_lTotal number of stops T for each state j in the maximum likelihood state transition sequence_lAfter calculating (j), the average value μ of the total number of stops_jAnd variance σ_j ²Is the signal sequence O in the model M obtained by the Baum-Welch algorithm._lIs the maximum likelihood state transition sequence S_lIs output along the probability P (O_l, S_l| M) is obtained as follows.
[0063]
(Equation 6)

[0064]
For simplicity, the signal sequence O_lIs output P (O_l| M) and μ_jAnd σ_j ²The following values may be used instead of
[0065]
(Equation 7)

[0066]
The robustness calculation unit 450 determines the relative time position within the phoneme in the same manner as in step S104 of the first embodiment using the average of the number of stops for each state calculated in step S402 (S403), and calculates in step S402. The variance of the number of stops for each state is associated with the robustness against time length expansion and contraction at the relative time position within the phoneme, and these robustness data (mapping function consisting of the pair of the relative position within the phoneme and the robustness) is used for speech synthesis. The information is recorded in the robustness database 190 of the device 10b (S404).
[0067]
On the other hand, the speech synthesizer 10b performs linguistic processing, prosody generation and segment selection in the same manner as in the first embodiment, and sets the frame length of the speech to be synthesized according to the robustness against the time length expansion and contraction stored in the robustness database 190. Then, a waveform is generated in accordance with the synthesis parameters obtained by modifying the time length of the frame, the fundamental frequency, the voice intensity, and the like.
[0068]
As described above, according to the speech unit expansion / contraction method in the third embodiment, the variance of the number of stops in each state obtained in the learning process of the HMM model is set as robustness against time length expansion and contraction, and the robustness within the unit is improved. By setting the expansion and contraction ratio so that the expansion ratio of the frame with high frame is large and the expansion ratio of the frame with low robustness is small, the portion where the variation of the number of times of stopping in each state of the HMM model has a large As a result, the sound quality is prevented from deteriorating due to the adjustment of the time length, and it is possible to generate a synthesized speech having a high sound quality with less speech unit data. Also, by automatically generating the robustness to the time length expansion and contraction from the HMM learning process, an optimum time length expansion function is created for each segment database, and the creation work is further reduced.
[0069]
(Embodiment 4)
Next, a fourth embodiment of the present invention will be described.
FIG. 12 is a functional block diagram showing a configuration of the speech synthesis system 40 according to Embodiment 4 of the present invention. The speech synthesis system 40 is characterized in that robustness data of a speech unit is generated from a transition sequence in a learning process of an HMM acoustic model. The speech synthesis system 40 performs offline learning and performs time length expansion / contraction with respect to a relative time position in a phoneme. The offline processing device 40a that calculates the robustness of the offline processing, and the voice synthesis device 10b that performs the voice synthesis using the calculated robustness data, are implemented except for the number-of-stops calculation unit 540 of the offline processing device 40a. A configuration similar to that of the speech synthesis system 30 according to the third embodiment is provided. Hereinafter, the same components are denoted by the same reference numerals, and description thereof will be omitted.
[0070]
The number-of-stops calculation unit 540 is a number-of-stops calculation unit that calculates the number of stops for each phoneme state from the transition sequence data recorded in the transition sequence recording unit 430. In that the number of stops is calculated by using the back pointer B (n, i) based on the Viterbi algorithm.
[0071]
Next, the operation of the speech synthesis system 40 according to Embodiment 4 configured as described above will be described.
FIG. 13 is a flowchart showing the operation procedure of the offline processing device 40a of the speech synthesis system 40. First, in the same manner as in the third embodiment, the acoustic model learning unit 420 uses the speech data of the phoneme boundary labeled in the speech unit database 110a as learning data, and uses the HMM to generate a one-phoneme 5-state sound. The model is learned for each phoneme (S401), and the obtained transition sequence data for each learning data is stored in the transition sequence recording unit 430. That is, in the learning process of the HMM, as shown in FIG. 11, L types of signal sequence sets Φ = (O_l| 1 ≦ l ≦ L), and a signal sequence O output by N + 1 state transitions from the start state S0 to the end state SE_lIs performed by the Baum-Welch algorithm using the forward-backward algorithm.
[0072]
Next, for each phoneme, the number-of-stops calculation unit 440 totals the number of stops for each state of the HMM, and obtains the average and variance of the number of stops for each state (S502). Specifically, the signal sequence O_l= (O_lThe expected value T (n, i, j) of the total number of stops to the state j from time 0 to n with respect to (n) | 1 ≦ n ≦ N is recursively obtained as follows.
(1) Initial transition
For 1 ≦ i <E and 0 ≦ j <E,
T (1, i, j) = 0
(2) Recurrence formula
For 1 ≦ n ≦ N, 1 ≦ i <E, 0 ≦ j <E,
[0073]
(Equation 8)

[0074]
However,
eq (k, j) = 1 if k = j,
eq (k, j) = 0 otherwise
And Also, a_kiIs the transition probability from state k to state i, b_k(X) represents the output probability of the signal x in the state k.
(3) Final transition
For 0 ≦ j <E,
[0075]
(Equation 9)

[0076]
By the above calculation, the signal sequence O_lValue T of the total number of stops in each state j (0 ≦ j <E)_l(J) is obtained as T (N + 1, E, j).
Signal sequence set Φ = (O_l| 1 ≦ l ≦ L)_lTotal number of stops T for each state j in the maximum likelihood state transition sequence_lAfter calculating (j), the average value μ of the total number of stops_jAnd variance σ_j ²Is the signal sequence O in the model M obtained by the Baum-Welch algorithm._lIs the maximum likelihood state transition sequence S_lIs output along the probability P (O_l| M) is obtained as follows.
[0077]
(Equation 10)

[0078]
Subsequently, the robustness calculation unit 450 determines the relative time position within the phoneme using the average of the number of stops for each state calculated in step S502 in the same manner as in the third embodiment (S403), and in step S502. The calculated variance of the number of stops for each state is associated with the robustness against time length expansion and contraction in the relative time position in the phoneme, and these robustness data (mapping function consisting of the pair of the relative position in the phoneme and the robustness) is used as a sound. It is recorded in the robustness database 190 of the synthesizing device 10b (S404).
[0079]
On the other hand, the speech synthesizer 10b performs linguistic processing, prosody generation and segment selection in the same manner as in the first embodiment, and sets the frame length of the speech to be synthesized according to the robustness against the time length expansion and contraction stored in the robustness database 190. Then, a waveform is generated in accordance with the synthesis parameters obtained by modifying the time length of the frame, the fundamental frequency, the voice intensity, and the like.
[0080]
As described above, according to the speech unit expansion / contraction method in the fourth embodiment, the variance of the number of stops in each state obtained in the learning process of the HMM model is set as robustness against time length expansion / contraction, By setting the expansion and contraction ratio so that the expansion ratio of the frame with a high degree is large and the expansion ratio of the frame with a low robustness is low, the portion where the variation in the number of stops in each state of the HMM model has a large As a result, it is possible to prevent sound quality deterioration due to time length adjustment, and to generate a synthesized speech of high sound quality with less speech unit data. In addition, by automatically generating the robustness against the time length expansion and contraction from the HMM learning process, an optimum time length expansion function is created for each segment database, and the creation work is further reduced.
[0081]
As described above, the speech synthesis system according to the present invention has been described based on Embodiments 1 to 4, but the present invention is not limited to these embodiments.
For example, if the robustness corresponding to the center time position of each frame of the speech unit to be subjected to speech synthesis is not recorded in the robustness database 190, the frame time length expansion / contraction rate calculation unit 200 of the speech synthesizer 10b may The robustness data (a set of relative time positions in a phoneme and the robustness) recorded in the robustness database 190 is spline-interpolated to determine the robustness. However, the present invention is not limited to this, and may be interpolation based on linear interpolation, quadratic function or cubic function approximation.
[0082]
The speech unit database 110a of the off-line processing unit and the speech unit database 110b of the speech synthesis unit may be data obtained from the same speaker, or the speech unit database 110a of the off-line processing unit may include the speech of the speech synthesis unit. The segment database 110b may be included. As a result, the speech unit data used for calculating the robustness and the speech unit data used for speech synthesis are associated with each other, and the time length of a speech unit with higher speech quality can be expanded or shortened.
[0083]
【The invention's effect】
As is clear from the above description, according to the speech synthesis system according to the present invention, for each state of the acoustic model corresponding to the speech unit, the robustness against time length expansion and contraction is calculated. The time length of each part of the speech unit is expanded or contracted based on the robustness ratio, so that sound quality deterioration due to time length adjustment is prevented. This means that within the same range of sound quality deterioration as in the past, the adjustable time width has been expanded, and long phonemes are no longer necessary, or a plurality of segments of the same phoneme are determined by the duration. There is no need to prepare, and it is possible to expand and contract a speech unit with less sound quality degradation by adjusting the time length with less unit data, and adjust the duration time of the speech unit corresponding to various utterance states It becomes possible. Since the mapping function used for the time length adjustment is automatically generated, the work for creating the segment database is reduced.
[0084]
According to the speech synthesis system of the present invention, a speech unit database and an acoustic model learned from the database or an acoustic model of an unspecified speaker, or recorded in a process of generating an acoustic model from the speech unit database. Since the mapping function used for time length expansion and contraction is generated based on the database of the transition sequence and the like, it is possible to adjust the duration of the speech unit based on various databases.
[0085]
As described above, according to the present invention, it is possible to synthesize speech having an arbitrary duration without deteriorating the sound quality, thereby realizing high-quality speech synthesis close to natural speech. The practical value is extremely high.
[Brief description of the drawings]
FIG. 1 is a functional block diagram showing a configuration of a speech synthesis system according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart showing an operation procedure of an offline processing device of the speech synthesis system.
FIG. 3 is a diagram illustrating a method (first half) of calculating robustness data by a robustness calculating unit.
FIG. 4 is a diagram illustrating a method (second half) of calculating robustness data by a robustness calculating unit.
FIG. 5 is a flowchart showing an operation procedure of the speech synthesis device of the speech synthesis system.
FIG. 6 is a diagram illustrating an example of obtaining robustness at a center time position of each frame of a unit by performing spline interpolation on robustness data.
FIG. 7 is a functional block diagram illustrating a configuration of a speech synthesis system according to Embodiment 2 of the present invention.
FIG. 8 is a flowchart showing an operation procedure of an offline processing device of the speech synthesis system.
FIG. 9 is a functional block diagram showing a configuration of a speech synthesis system according to Embodiment 3 of the present invention.
FIG. 10 is a flowchart showing an operation procedure of the offline processing device of the speech synthesis system.
FIG. 11 is a schematic diagram of a learning process of a hidden Markov model.
FIG. 12 is a functional block diagram showing a configuration of a speech synthesis system according to Embodiment 4 of the present invention.
FIG. 13 is a flowchart showing an operation procedure of the offline processing device of the speech synthesis system.
FIG. 14 is a diagram illustrating a method of expanding and contracting the duration of a voice pattern.
[Explanation of symbols]
10, 20, 30, 40 speech synthesis system
10a, 20a, 30a, 40a Offline processing device
10b speech synthesizer
110a, 110b Speech unit database
120, 420 Acoustic model learning unit
130 Acoustic model database
140, 340 Matching unit
150, 450 Robustness calculator
160 language processing unit
170 Prosody generation unit
180 unit selection unit
190 Robustness Database
200 frame time length expansion / contraction ratio calculation unit
210 Frame deformation part
220 Waveform generator
330 Unspecified speaker acoustic model database
430 transition series recording unit
440, 540 Stop frequency calculation unit

Claims

A speech unit expansion / contraction device for extending or compressing the time length of a speech unit for speech synthesis,
From a plurality of speech unit data with the speech unit boundary information and an acoustic model using a statistical method having multiple states for each speech unit, each of the acoustic models corresponding to each speech unit For each state, robustness calculating means for calculating the robustness of the speech unit data with respect to the time length expansion and contraction,
Based on the calculated robustness ratio of each state, for a speech unit for speech synthesis, comprising a stretching unit that extends and contracts the time length of each unit of the speech unit corresponding to each state. Characteristic speech unit expansion and contraction device.

The robustness calculating means, for each of the plurality of speech unit data, for each state of the acoustic model corresponding to each speech unit, to specify the variation of the time width, the variation of the specified time width as the robustness. The speech unit expansion / contraction device according to claim 1, wherein the speech unit is calculated.

The robustness calculating means, for each of the plurality of speech unit data, associates each frame constituting the speech unit data with each of the states, and sets a standard number of frames associated with each state. The speech unit expansion / contraction device according to claim 2, wherein a deviation is calculated as the robustness.

The robustness calculation means calculates an average of the number of frames associated with each of the states, and specifies a relative position in a phoneme that is a relative position in a speech unit in each state using the calculated average. A pair of the specified relative position in the phoneme and the robustness is generated as robustness data,
The expansion / contraction means, according to the relationship between the relative position in the phoneme and the robustness determined from a pair of the relative position in the phoneme and the robustness generated by the robustness calculating means, the robustness corresponding to each part of the speech unit. The speech unit expansion / contraction device according to claim 3, wherein the time length is expanded or contracted based on the specified robustness.

The speech unit expansion / contraction device according to claim 1, wherein the acoustic model is a model generated by learning using the speech unit data.

The speech unit expansion / contraction device according to claim 1, wherein the acoustic model is a model generated using speech unit data including speech generated by an unspecified number of speakers.

The speech unit expansion / contraction device according to claim 1, wherein the acoustic model is a hidden Markov model.

The robustness calculating means specifies a variation in the number of stops in each state in a transition sequence for each voice unit data in the process of generating the acoustic model using the plurality of voice unit data, and specifies the specified number of stops. The speech unit expansion / contraction device according to claim 7, wherein the variation is calculated as the robustness.

The speech unit expansion / contraction device according to claim 8, wherein the robustness calculating unit calculates the variance of the number of times of stopping as the robustness.

The robustness calculating means calculates an average of the number of stops associated with each of the states, and specifies a relative position in a phoneme that is a relative position in a speech unit in each state using the calculated average. A pair of the specified relative position in the phoneme and the robustness is generated as robustness data,
The expansion / contraction means, according to the relationship between the relative position in the phoneme and the robustness determined from a pair of the relative position in the phoneme and the robustness generated by the robustness calculating means, the robustness corresponding to each part of the speech unit. 10. The speech unit expansion / contraction device according to claim 9, wherein the time length is expanded / contracted based on the specified robustness.

The robustness calculating means calculates the maximum likelihood state transition sequence in the learning sequence in the calculation of each time of the learning signal in the process of generating the acoustic model, the forward probability or the backward probability of reaching each state, 9. The speech unit expansion / contraction device according to claim 8, wherein a variation in the number of stops in each state in a transition sequence is calculated as the robustness.

2. The speech unit expansion / contraction device according to claim 1, wherein the speaker of the speech unit data is the same as the speaker of the speech unit for speech synthesis.

The speech unit expansion / contraction device according to claim 1, wherein the speech unit data includes a speech unit for the speech synthesis.

The expansion / contraction means specifies the robustness corresponding to each part of the speech unit by interpolating using a plurality of pairs of the relative position in the phoneme and the robustness generated by the robustness calculating means. The speech unit expansion / contraction device according to claim 4 or 10, wherein:

The expansion / contraction unit specifies the robustness corresponding to each unit of the speech unit by performing linear interpolation using a plurality of pairs of the relative position in the phoneme and the robustness generated by the robustness calculation unit. The speech unit expansion / contraction device according to claim 14, characterized in that:

The expansion / contraction means specifies the robustness corresponding to each part of the speech unit by interpolating with a spline function using a plurality of pairs of the relative position in the phoneme generated by the robustness calculating means and the robustness. The speech unit expansion / contraction device according to claim 14, wherein

The expansion / contraction means uses a plurality of pairs of the relative position in the phoneme and the robustness generated by the robustness calculating means and approximates the quadratic function or the cubic function to correspond to each unit of the speech unit. The speech unit expansion / contraction device according to claim 14, wherein the robustness to perform is specified.

A speech unit expansion / contraction method for extending or compressing the time length of a speech unit for speech synthesis,
From a plurality of speech unit data with the speech unit boundary information and an acoustic model using a statistical method having multiple states for each speech unit, each of the acoustic models corresponding to each speech unit For each state, a robustness calculation step of calculating the robustness of the speech unit data with respect to the time length expansion and contraction,
Based on the calculated robustness ratio of each state, for a speech unit for speech synthesis, a stretching step of stretching or shortening the time length of each unit of the speech unit corresponding to each state. Characteristic speech unit expansion / contraction method.

In the robustness calculating step, for each of the plurality of speech unit data, for each state of the acoustic model corresponding to each speech unit, to specify the variation of the time width, the variation of the specified time width as the robustness. The method according to claim 18, wherein the method is performed.

In the robustness calculating step, in the process of generating the acoustic model using the plurality of speech unit data, a variation in the number of stops in each state in a transition sequence for each speech unit data is specified, and the specified number of stops is specified. 19. The speech unit expansion / contraction method according to claim 18, wherein a variation of the speech unit is calculated as the robustness.

A program for a speech unit expansion / contraction device that expands or compresses the time length of a speech unit for speech synthesis,
A non-transitory computer-readable storage medium storing a program for causing a computer to function as means included in the speech unit expansion / contraction device according to claim 1.