JP2004246292A

JP2004246292A - Word clustering speech database, and device, method and program for generating word clustering speech database, and speech synthesizing device

Info

Publication number: JP2004246292A
Application number: JP2003038725A
Authority: JP
Inventors: Hiroyuki Segi; 寛之世木; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2003-02-17
Filing date: 2003-02-17
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a word speech database wherein words can be clustered while phoneme environments before and after the same word are assumed, a device, method and program for word clustering speech database generation, and a speech synthesizing device capable of generating a synthesized speech fast. <P>SOLUTION: The word clustering speech database generating device 1 that generates the word clustering speech database 19 which contains a plurality of words and is used for speech synthesis is equipped with a feature quantity calculation part 7 which calculates a sound feature quantity as an acoustic feature quantity and a word clustering part 9 which generates word clustering information for clustering words according to the sound feature quantity. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成する際に供される単語クラスタリング音声データベースおよび単語クラスタリング音声データベース生成装置、単語クラスタリング音声データベース生成方法、単語クラスタリング音声データベース生成プログラムならびに単語クラスタリング音声データベースを備える音声合成装置に関する。
【０００２】
【従来の技術】
従来、音声合成装置によって音声合成する際に利用される音声データベース中に含まれている単語をクラスタリングするものとして、単語自動分類装置（特許文献１参照）がある。なお、クラスタリングとは、クラスターと呼ばれるカテゴリに単語を分類することである。クラスターとは、単語の上位概念、または、単語を属している属性といえるもので、例えば、単語「サッカー」は「スポーツ」というクラスターに属しており、単語「晴れ」は「天気」というクラスターに属している。
【０００３】
特許文献１で開示されている単語自動分類装置は、異なる単語の共起頻度の統計量を算出し、この統計量に基づいて、確率モデルの推定問題として、音声データベース中に含まれている単語のクラスタリングを行うものである。
【０００４】
【特許文献１】
特開平１１−１４３８７５号公報（段落２４〜４０、第１図）
【０００５】
【発明が解決しようとする課題】
しかしながら、従来の単語自動分類装置では、異なる単語のクラスタリングが行われているので、例えば、「サッカー」と「野球」を同じクラスターに属するようにできても、音声データベースに含まれている単語「の」（助詞）の前後の音素環境を想定してクラスタリングすることができないという問題がある。
【０００６】
また、このような単語（クラスタリングされていない単語）を含むテキストデータを、音声合成装置によって音声合成しようとすると、音声データベースから最適な音声を検索するまでに多大な処理時間を要して、合成音声を高速に生成することができないという問題がある。
【０００７】
そこで、本発明の目的は前記した従来の技術が有する課題を解消し、同一の単語の前後の音素環境を想定してクラスタリングすることができる単語クラスタリング音声データベースおよび単語クラスタリング音声データベース生成装置、単語クラスタリング音声データベース生成方法、単語クラスタリング音声データベース生成プログラムならびに合成音声を高速に生成することができる音声合成装置を提供することにある。
【０００８】
【課題を解決するための手段】
本発明は、前記した目的を達成するため、以下に示す構成とした。
請求項１記載の単語クラスタリング音声データベースは、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースであって、前記単語における音響上の特徴量に基づいて、前記単語がクラスタリングされている構成とした。
【０００９】
かかる構成によれば、単語における音響上の特徴量に基づいて、単語がクラスタリングされている。音響上の特徴量とは、単語の最初の音素と最後の音素における特徴量（メルケプストラム係数や、線形予測分析によるケプストラム係数等）の平均と分散とを計算した（算出した）ものである。クラスタリングとは、クラスターと呼ばれるカテゴリに単語を分類することである。また、この構成におけるクラスターとは、一つの単語と、この単語に接続される前後の音素との関係を設定したものである。例えば、単語「の」（助詞）と、この「の」の前の音素が母音であるクラスター（クラスターの番号を“１”とする）は、“クラスター１；「の」の前の音素が母音である”といったように示されるものである。
【００１０】
請求項２記載の単語クラスタリング音声データベース生成装置は、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースを生成する単語クラスタリング音声データベース生成装置であって、語頭語尾音響特徴量算出手段と、単語クラスタリング手段と、を備える構成とした。
【００１１】
かかる構成によれば、語頭語尾音響特徴量算出手段と、単語クラスタリング手段とが備えられている。語頭語尾音響特徴量算出手段では、語頭語尾音響特徴量が算出される。この語頭語尾音響特徴量は、単語における語頭および語尾の音響上の特徴量、すなわち、メルケプストラム係数や、線形予測分析によるケプストラム係数等の平均、分散である。単語クラスタリング手段では、語頭語尾音響特徴量に基づいて、音声データベースの単語がクラスタリングされる。
【００１２】
請求項３記載の単語クラスタリング音声データベース生成装置は、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースを生成する単語クラスタリング音声データベース生成装置であって、音素音響特徴量算出手段と、音素継続時間長算出手段と、単語クラスタリング手段と、を備える構成とした。
【００１３】
かかる構成によれば、音素音響特徴量算出手段と、音素継続時間長算出手段と、単語クラスタリング手段とが備えられている。音素音響特徴量算出手段では、音素音響特徴量が算出される。この音素音響特徴量は、単語における最初の音素および最後の音素での音響上の特徴量、すなわち、メルケプストラム係数や、線形予測分析によるケプストラム係数等の平均、分散である。音素継続時間長算出手段では、音素継続時間長が算出される。この音素継続時間長は、単語における最初の音素および最後の音素の継続時間、通常、数十ｍｓ程度のものである。単語クラスタリング手段では、音素音響特徴量および音素継続時間長に基づいて、音声データベースの単語がクラスタリングされる。
【００１４】
請求項４記載の単語クラスタリング音声データベース生成装置は、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースを生成する単語クラスタリング音声データベース生成装置であって、モデル音響特徴量算出手段と、状態占有確率値算出手段と、単語クラスタリング手段と、を備える構成とした。
【００１５】
かかる構成によれば、モデル音響特徴量算出手段と、状態占有確率値算出手段と、単語クラスタリング手段とが備えられている。モデル音響特徴量算出手段では、モデル音響特徴量が算出される。このモデル音響特徴量は、モデル化した状態における音響上の特徴量、すなわち、メルケプストラム係数や、線形予測分析によるケプストラム係数等の平均、分散である。状態占有確率値算出手段では、状態占有確率値が算出される。この状態占有確率値は、モデル化した各状態における占有確率を示すものである。単語クラスタリング手段では、モデル音響特徴量と、状態占有確率値と、音韻表における調音方式と、音韻表における調音位置とに基づいて、音声データベースの単語がクラスタリングされる。
【００１６】
請求項５記載の単語クラスタリング音声データベース生成方法は、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースを生成する単語クラスタリング音声データベース生成方法であって、音響特徴量算出ステップと、単語クラスタリングステップと、を含むものとした。
【００１７】
この方法によれば、音響特徴量算出ステップと、単語クラスタリングステップとが含まれている。音響特徴量算出ステップでは、音響特徴量が算出され、単語クラスタリングステップでは、音響特徴量に基づいて、音声データベースの単語がクラスタリングされる。
【００１８】
請求項６記載の単語クラスタリング音声データベース生成プログラムは、複数の単語を含んでなり、音声合成する際に供される単語クラスタリング音声データベースを生成する装置を、以下に示す手段として機能させることを特徴とする。当該装置を機能させる手段は、音響特徴量算出手段、単語クラスタリング手段、である。
【００１９】
かかる構成によれば、音響特徴量算出手段と、単語クラスタリング手段とによって装置が機能する。音響特徴量算出手段は、音響特徴量が算出され、単語クラスタリング手段では、音響特徴量に基づいて、音声データベースの単語がクラスタリングされる。
【００２０】
請求項７記載の音声合成装置は、請求項１に記載の単語クラスタリング音声データベースを利用して、入力されたテキストデータを音声合成する音声合成装置であって、前記単語クラスタリング音声データベースと、テキストデータ分解手段と、クラスター探索手段と、合成音声生成手段と、を備える構成とした。
【００２１】
かかる構成によれば、単語クラスタリング音声データベースと、テキストデータ分解手段と、クラスター探索手段と、合成音声生成手段とが備えられている。テキストデータ分解手段では、例えば、形態素解析によってテキストデータが解析されて、まず、単語に分割され、続いて、音素に分解される。クラスター探索手段では、単語クラスタリング音声データベースにおいて、クラスターに分類されている単語の中から合成音声を生成する際に最適な単語の候補である音声合成候補が順次探索される。合成音声生成手段では、探索された音声合成候補が連結され、合成音声が生成される。
【００２２】
【発明の実施の形態】
以下、本発明の一実施の形態について、図面を参照して詳細に説明する。
この実施の形態の説明では、まず、単語クラスタリング音声データベースの概略について説明する。そして、単語クラスタリング音声データベース生成装置の構成を説明し、続いて、当該装置で生成した単語クラスタリング音声データベースを備える音声合成装置の構成を説明する（図１参照）。次に、単語クラスタリング音声データベース生成装置の動作（単語クラスタリング音声データベースの生成の仕方）について説明し（図２参照）、続いて、音声合成装置の動作を説明する（図３参照）。さらに、単語クラスタリング音声データベース生成装置において、単語クラスタリング情報の生成の仕方について説明する（図４参照）。
【００２３】
（単語クラスタリング音声データベースの概略について）
単語クラスタリング音声データベースは、音声合成装置に備えられている単語音素毎に発話時刻が区分されている音声データベースの単語を、音響上の特徴量（以下、この実施の形態の説明では、「音響的な特徴量」と記載する）に基づいて「クラスター」に分類したもので、当該音声合成装置で音声合成する際の単語の探索時間を短縮することができ、処理速度を向上させることができるものである。この単語クラスタリング音声データベースにおけるクラスターとは、単語とこの単語の前後の音素のつながりとに基づいて、音声データベース内のすべての単語をいくつかのグループに分類した際の、それぞれのグループ（グループ一つ一つ）のことである。
【００２４】
（単語クラスタリング音声データベース生成装置の構成）
図１は、単語クラスタリング音声データベース生成装置と音声合成装置とのブロック図であり、この図１に示すように、単語クラスタリング音声データベース生成装置１は、音声合成装置に供される通常の音声データベースに含まれている単語をクラスターに分類した単語クラスタリング音声データベースを生成するもので、入出力部５と、特徴算出部７と、単語クラスタリング部９と、音素継続時間長算出部１１と、状態占有確率値算出部１３とを備えている。
【００２５】
なお、この図１の音声合成装置は、単語クラスタリング音声データベース生成装置１でクラスタリングした単語クラスタリング音声データベースを備えているが、当初（単語クラスタリング音声データベース生成装置１で単語クラスタリング音声データベースを生成する以前）、通常の音声データベースを備えていたものである。つまり、通常の音声データベースは、各単語のクラスタリングが行われておらず、一つのクラスターしか存在していない状況であるといえる（前後の音素環境を考慮していないといえる）。
【００２６】
入出力部５は、音声合成装置に備えられている単語クラスタリング音声データベース（音声データベース）からデータベース情報が入力される（取得される）と共に、単語クラスタリング部９から出力される単語クラスタリング情報を当該音声合成装置に出力するものである。なお、この入出力部５は、インターネット等の通信回線網（図示せず）を介して、情報（データベース情報、単語クラスタリング情報）の送受信が行えるように構成してもよい。
【００２７】
データベース情報は、音声合成装置に備えられている音声データベース（単語クラスタリング音声データベース）の音声データに関する情報であり、音声データは、単語および音素（ローマ字表記したときの各アルファベットに該当）と音声波形とを対応させた複数の文章からなるもの（単語および音素毎に発話時刻が区分されている）である。つまり、データベース情報は、当該音声データベースに記録されている（含まれている）音声データについて、各単語および音素のローマ字表記（各アルファベットに該当）と、各単語および音素の発話時刻（発話開始時刻および発話終了時刻；発話時間）とを組み合わせたものである。
【００２８】
単語クラスタリング情報は、音声データベースに記録されている（含まれている）単語を複数のクラスターに分類するための情報であり、クラスターは、単語とこの単語の前後の音素のつながりとに基づいて、音声データベース内のすべての単語をいくつかのグループに分類した際の、それぞれのグループ（グループ一つ一つ）のことである。例えば、単語「の」（助詞）と、この「の」の前の音素が母音であるクラスター（クラスターの番号を“１”とする）は、“クラスター１；「の」の前の音素が母音である”といったように示されるものである。
【００２９】
特徴算出部７は、入出力部５を介して入力されたデータベース情報に基づいて、各音声データの「単語」の音響的な特徴量（特徴量の各次元での平均、分散等の統計量）を算出するもので、語頭語尾音響特徴量算出手段７ａと、音素音響特徴量算出手段７ｂと、モデル音響特徴量算出手段７ｃとを備えている。つまり、この特徴量算出部７は、単語の前後の音素環境を考慮しつつ（反映させつつ）、単語毎に、語頭・語尾の音素でのＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ；音質を表すパラメータ［音声をＦＦＴし、逆フーリエ変換してケプストラム係数に変換したパラメータ］）、基本周波数等の統計量を算出するものである。
【００３０】
語頭語尾音響特徴量算出手段７ａは、単語の語頭と語尾の特定区間（４０ｍｓ）における音響的な特徴量の統計量（語頭語尾音響特徴量）を算出するもので、例えば、前記した単語「の」の場合、数式（１）〜（４）によって、語頭語尾音響特徴量を算出する。
【００３１】
【数１】

【００３２】
【数２】

【００３３】
【数３】

【００３４】
【数４】

【００３５】
なお、μ_語頭 ^ｊ（ａ−の＋ｕ）は、単語「の」の先行音素（直前の音素）が「ａ」で後続音素（直後の音素）が「ｕ」である場合の、語頭におけるｊ次元目の特徴量の平均値を示しており、μ_語尾 ^ｊ（ａ−の＋ｕ）は、単語「の」の先行音素（直前の音素）が「ａ」で後続音素（直後の音素）が「ｕ」である場合の、語尾におけるｊ次元目の特徴量の平均値を示している。また、Σ_語頭 ^ｊ（ａ−の＋ｕ）は、単語「の」の先行音素（直前の音素）が「ａ」で後続音素（直後の音素）が「ｕ」である場合の、語頭におけるｊ次元目の特徴量の分散値を示しており、Σ_語尾 ^ｊ（ａ−の＋ｕ）は、単語「の」の先行音素（直前の音素）が「ａ」で後続音素（直後の音素）が「ｕ」である場合の、語尾におけるｊ次元目の特徴量の分散値を示している。
【００３６】
音素音響特徴量算出手段７ｂは、単語における最初の音素と最後の音素における音響的な特徴量の統計量（音素音響特徴量）を算出するもので、例えば、前記した単語「の」の場合、前記した数式（１）〜（４）によって、音素音響特徴量を算出する。
【００３７】
モデル音響特徴量算出手段７ｃは、単語を音韻表における調音方式と、音韻表における調音位置との少なくとも一方に基づいて、モデル化した状態における音響的な特徴量の統計量（モデル音響特徴量）を算出するもので、例えば、前記した単語「の」の場合、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）の作成方法によって、モデル音響特徴量を算出する。
【００３８】
単語クラスタリング部９は、特徴量算出部７で算出された特徴量の統計量（語頭語尾音響特徴量、音素音響特徴量、モデル音響特徴量）、音素継続時間長算出部１１で算出された音素継続時間長、状態占有確率値算出部１３で算出された状態占有確率値と、音韻表における調音方式と、音韻表における調音位置とに基づいて、単語のクラスタリング（この実施の形態では、Ｔｒｅｅ−ｂａｓｅｄｃｌｕｓｔｅｒｉｎｇ）を行うものである。つまり、この単語クラスタリング部９は、前後の音素環境を考慮した単語を「クラスター」にどのように分けるかを決定する単語クラスタリング情報を生成するものである。この単語クラスタリング部９が特許請求の請求項に記載した単語クラスタリング手段に相当するものである。
【００３９】
この単語クラスタリング部９において生成される単語クラスタリング情報は、特徴量算出部７で算出された特徴量に応じて、適宜変更されるものである。つまり、特徴量算出部７で語頭語尾音響特徴量が算出された場合、この語頭語尾音響特徴量および音素のデータ数に基づいて、単語クラスタリング情報が生成され、特徴量算出部７で音素音響特徴量が算出された場合、音素音響特徴量および音素継続時間長に基づいて、単語クラスタリング情報が生成され、特徴量算出部７で、モデル音響特徴量が算出された場合、モデル音響特徴量および状態占有確率値に基づいて、単語クラスタリング情報が生成されることとなる。
【００４０】
音韻表における調音方式とは、音声器官（喉頭等）によって発声する際の声道に、閉鎖または狭まりを形成する方式（動かし方）のことである。音韻表における調音位置とは、調音器官（舌、下唇）によって調整される調音点（口蓋、歯茎、上唇等）の場所を指すもので、この場所において音声が発声される際の閉鎖または狭まりが形成される。
【００４１】
なお、この音韻表における調音方式と調音位置に関するデータが単語クラスタリング音声データベース生成装置１の記録部（図示せず）に記録されており、状態占有確率値算出部１３で状態占有確率値が算出された場合に記録部（図示せず）から読み出されて利用される。
【００４２】
この単語クラスタリング情報の生成の手順を以下に示す。まず、単語の「中心」（前後の音素を除外した部分）が同じ単語（中心同一単語とする）を集める（分類する）。この中心同一単語を、特定の条件（中心同一単語を分けるための質問）に該当するもの、しないもので２分割する。分割前後の対数尤度Ｌ（Ｃ）（数式５）の差ΔＬ（ｑ）（数式６）が最も大きくなった条件を最初の「判別木」の条件として採用する。
【００４３】
【数５】

【００４４】
【数６】

【００４５】
数式５において、Σ（Ｃ）は、共有化された状態Ｃ（単語の「中心」）の分散、Ｘｆは、ｆ番目のフレームで観測された特徴ベクトル、ｎは特徴ベクトルの次元数、γｓ（Ｘｆ）は共有化された状態Ｃに含まれるある状態ｓが特徴ベクトルＸｆを出力する確率を示すものである。
【００４６】
また、数式６において、Ｃ（ｑ，ＹＥＳ）は条件（質問）ｑでＹＥＳに分類される共有化された状態、Ｃ（ｑ，ＮＯ）は条件（質問）ｑでＮＯに分類される共有化された状態を示すものである。このｑとしては、「前の音素が摩擦音か？」、「後の音素が子音か？」等、音韻表の調音方式や調音位置が同じ音素かどうかを設定する条件（質問）が用いられている。また、数式５におけるΣ（Ｃ）は、次に示す数式７によって算出するものである。
【００４７】
【数７】

【００４８】
ただし、μ（Ｃ）は、共有化された状態Ｃの平均を示すものであり、次に示す数式８によって算出するものである。
【００４９】
【数８】

【００５０】
次に、条件（質問）によって２分割されたそれぞれのクラスターについて、当初採用された条件（質問）以外の条件（質問）で、さらに、２分割を行い、前記したのと同様に、分割前後の対数尤度Ｌ（Ｃ）（数式５）の差ΔＬ（ｑ）（数式６）が最も大きくなる条件（質問）を次の「判別木」の条件（質問）として採用する。
【００５１】
さらに、これらの条件（質問）による分割を、分割前後の対数尤度Ｌ（Ｃ）（数式５）の差ΔＬ（ｑ）が予め設定された閾値を下回るまで繰り返していき、「判別木」を生成する。最後に、生成された「判別木」において、前後の音素環境を全て考慮した単語に適用して、同じリーフ（判別木の末端であり、葉っぱに該当する）に属した単語を同一のクラスターに属する単語として扱う「単語クラスタリング情報」とされる。
【００５２】
音素継続時間長算出部１１は、特徴量算出部７の音素音響特徴量算出手段７ｂで、音素音響特徴量が算出された場合に、音素の継続時間を算出するもので、単語の語頭、語尾の音素の継続時間長をそれぞれ算出し、単語クラスタリング部９に出力するものである。音素の継続時間は、音素の発話開始時刻と発話終了時刻とによって算出される。この音素継続時間長算出部１１が特許請求の範囲の請求項に記載した音素継続時間長算出手段に相当するものである。
【００５３】
状態占有確率値算出部１３は、特徴量算出部７のモデル音響特徴量算出手段７ｃで、モデル音響特徴量が算出された場合に、状態占有確率値を算出するものである。状態占有確率値は、モデル化した状態における単語の占有確率である。この状態占有確率値算出部１３が特許請求の範囲の請求項に記載した状態占有確率値算出手段に相当するものである。
【００５４】
この単語クラスタリング音声データベース生成装置１によれば、特徴量算出部７の語頭語尾音響特徴量算出手段７ａで、語頭語尾音響特徴量が算出され、単語クラスタリング部９で、語頭語尾音響特徴量と音素のデータ数に基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、単語の語頭および語尾の音響特徴量である語頭語尾音響特徴量が参照されて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成されており、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００５５】
また、この単語クラスタリング音声データベース生成装置１によれば、特徴量算出部７の音素音響特徴量算出手段７ｂで、音素音響特徴量が算出され、音素継続時間長算出部１１で、音素継続時間長が算出される。単語クラスタリング部９で、音素音響特徴量および音素継続時間長に基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、単語の最初の音素および最後の音素の音響特徴量である音素音響特徴量が参照されて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成されており、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００５６】
さらに、この単語クラスタリング音声データベース生成装置１によれば、特徴量算出部７のモデル音響特徴量算出手段７ｃで、モデル音響特徴量が算出され、状態占有確率値算出部１３で、状態占有確率値が算出される。単語クラスタリング部９で、モデル音響特徴量および状態占有確率値に基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、単語をモデル化した状態における音響特徴量であるモデル音響特徴量が参照されて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成されており、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００５７】
（音声合成装置の構成）
また、図１に示した音声合成装置３は、単語クラスタリング音声データベース生成装置１で生成された単語クラスタリング情報に従ってクラスタリングされた音声データベース（単語クラスタリング音声データベース１９）を使用して、入力されたテキストデータの音声合成を行うもので、テキストデータ解析部１５と、クラスター探索部１７と、単語クラスタリング音声データベース１９と、合成音声生成部２１とを備えている。
【００５８】
テキストデータ解析部１５は、入力されたテキストデータを形態素解析して、単語および音素に分解するものである。形態素解析は、テキストデータを解析して、このテキストデータに含まれている単語の語幹、接辞、語形変化等を、意味を担う最小の言語要素である形態素に同定することである。なお、テキストデータ解析部１５が特許請求の範囲の請求項に記載したテキストデータ分解手段に相当するものである。
【００５９】
クラスター探索部１７は、テキストデータ解析部１５で分解された単語および音素を、単語クラスタリング音声データベース１９に記録されているクラスター毎に探索して、このクラスターに含まれている単語の中から、音声合成する際の候補となる音声合成候補を取得するものである。この音声合成候補は、クラスターに属している「単語」のみの音声データを指すものである。なお、このクラスター探索部１７がクラスター探索手段に相当するものである。
【００６０】
単語クラスタリング音声データベース１９は、データベースの構成単位として、単語および音素からなる複数の「文章」を記録していると共に、各単語が単語クラスタリング情報によって、クラスタリングされている（クラスターに分類されている）ものである。なお、各文章には「文番号」が付されており、各単語の文章先頭から発話時刻が記録されている。
【００６１】
合成音声生成部２１は、クラスター探索部１７によって、単語クラスタリング音声データベース１９から取得された音声合成候補を連結し、合成音声を生成して、当該音声合成装置３の外部に出力するものである。なお、この合成音声生成部２１が特許請求の範囲の請求項に記載した合成音声生成手段に相当するものである。
【００６２】
この音声合成装置３によれば、テキストデータ解析部１５で、形態素解析によってテキストデータが解析されて、まず、単語および音素に分割される。クラスター探索部１７で、単語クラスタリング音声データベース１９において、クラスターに分類されている単語の中から合成音声を生成する際に最適な単語の候補である音声合成候補が順次探索される。合成音声生成部２１で、探索された音声合成候補が連結され、合成音声が生成される。このため、クラスター探索部１７で、予め、クラスターに分類されている単語の中から、音声合成候補が探索されるので、探索時間を短縮することができ、合成音声を高速に生成することができる。
【００６３】
（単語クラスタリング音声データベース生成装置の動作）
次に、図２に示すフローチャートを参照して、単語クラスタリング音声データベース生成装置１の動作について説明する（適宜、図１参照）。
【００６４】
まず、単語クラスタリング音声データベース生成装置１の入出力部５に音声合成装置３からデータベース情報が入力される（Ｓ１）。続いて、単語クラスタリング音声データベース生成装置１の主制御部（図示を省略）にて、特徴量算出部７で算出する音響特徴量が語頭語尾音響特徴量であるかどうかが判断される（Ｓ２）。特徴量算出部７で算出する音響特徴量が語頭語尾音響特徴量であると判断された場合（Ｓ２、Ｙｅｓ）には、語頭語尾音響特徴量算出手段７ａで語頭語尾音響特徴量が算出される（Ｓ３）。
【００６５】
また、特徴量算出部７で算出する音響特徴量が語頭語尾音響特徴量であると判断されなかった場合（Ｓ２、Ｎｏ）には、特徴量算出部７で算出する音響特徴量が音素音響特徴量であるかどうかが判断される（Ｓ４）。特徴量算出部７で算出する音響特徴量が音素音響特徴量であると判断された場合（Ｓ４、Ｙｅｓ）には、音素音響特徴量算出手段７ｂで音素音響特徴量が算出される（Ｓ５）。そして、音素継続時間長算出部１１で音素継続時間長が算出される（Ｓ６）。
【００６６】
さらに、Ｓ４にて、特徴量算出部７で算出する音響特徴量が音素音響特徴量であると判断されなかった場合（Ｓ４、Ｎｏ）には、モデル音響特徴量算出手段７ｃでモデル音響特徴量が算出される（Ｓ７）。そして、状態占有確率値算出部１３で状態占有確率値が算出される（Ｓ８）。
【００６７】
すると、単語クラスタリング部９で、語頭語尾音響特徴量、音素音響特徴量および音素継続時間長、モデル音響特徴量および状態占有確率値のいずれかに基づいて、単語クラスタリング情報が生成される（Ｓ９）。そして、入出力部５から音声合成装置３に単語クラスタリング情報が出力される（Ｓ１０）。
【００６８】
（音声合成装置の動作）
次に、図３に示すフローチャートを参照して、音声合成装置３の動作について説明する（適宜、図１参照）。
【００６９】
まず、音声合成装置３のテキストデータ分解部１５にテキストデータが入力され、このテキストデータ分解部１５で、テキストデータが形態素解析されて、単語および音素に分解される（Ｓ１１）。
【００７０】
そして、このテキストデータ分解部１５で分解された単語および音素に基づいて、クラスター探索部１７で単語クラスタリング音声データ１９が探索され、音声合成候補が取得される（Ｓ１２）。
【００７１】
その後、合成音声生成部２１で、音声合成候補が連結されて、合成音声が生成され、出力される（Ｓ１３）。
【００７２】
（単語クラスタリング情報の生成の仕方について）
最後に、単語クラスタリング音声データベース生成装置１において、単語クラスタリング情報の生成の仕方（一例）について、図４を参照して説明する。
【００７３】
この図４に示した単語クラスタリング情報の生成の仕方（一例）は、単語の「の」に関するものである。まず、単語「の」の全てのデータ「１００」個が「前の音素が母音である」という条件に基づいて、２分割される。単語「の」の前の音素が母音である個数が１０個であり、単語「の」の前の音素が母音でない個数が９０個である（この９０個が、図４において「クラスタ３」）。
【００７４】
そして、さらに、単語「の」の「後の音素が破裂音である」という条件に基づいて、２分割され、これらが図４において「クラスタ１」、「クラスタ２」とされる。
【００７５】
以上、一実施形態に基づいて本発明を説明したが、本発明はこれに限定されるものではない。
例えば、単語クラスタリング音声データベース生成装置１の各構成の処理を一つずつの過程ととらえ、単語クラスタリング音声データベース生成方法とみなすことや、各構成の処理を汎用的なコンピュータ言語で記述した単語クラスタリング音声データベース生成プログラムとみなすことも可能である。これらは、単語クラスタリング音声データベース生成装置１と同様の効果を得ることができる。
【００７６】
【発明の効果】
請求項１記載の発明によれば、単語における音響上の特徴量に基づいて、単語がクラスタリングされたものであるため、この単語クラスタリング音声データベースが音声合成装置に利用されれば、単語の探索時間を短縮化することができ、合成音声を高速に生成することができる。
【００７７】
請求項２、５、６記載の発明によれば、単語の音響特徴量に基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００７８】
請求項３記載の発明によれば、音素音響特徴量が算出され、音素継続時間長が算出される。これらに基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、単語の最初の音素および最後の音素の音響特徴量である音素音響特徴量が参照されて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成されており、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００７９】
請求項４記載の発明によれば、モデル音響特徴量が算出され、状態占有確率値が算出される。これらに基づいて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成される。このため、単語をモデル化した状態における音響特徴量であるモデル音響特徴量が参照されて、音声データベースの単語をクラスタリングする単語クラスタリング情報が生成されており、この単語クラスタリング情報により、同一の単語の前後の音素環境を想定して（考慮して）、音声データベースの単語をクラスタリングすることができる。
【００８０】
請求項７記載の発明によれば、テキストデータが解析されて、まず、単語に分割され、さらに、分割された単語それぞれに接続する前後の音素に分解され、クラスターに分類されている単語の中から合成音声を生成する際に最適な単語の候補である音声合成候補が順次探索される。そして、探索された音声合成候補が連結され、合成音声が生成される。このため、予め、クラスターに分類されている単語の中から、音声合成候補が探索されるので、探索時間を短縮することができ、合成音声を高速に生成することができる。
【図面の簡単な説明】
【図１】本発明による一実施の形態である単語クラスタリング音声データベース生成装置および音声合成装置のブロック図である。
【図２】図１に示した単語クラスタリング音声データベース生成装置の動作を説明したフローチャートである。
【図３】図１に示した音声合成装置の動作を説明したフローチャートである。
【図４】単語クラスタリング情報の生成の仕方を説明した図である。
【符号の説明】
１単語クラスタリング音声データベース生成装置
３音声合成装置
５入出力部
７特徴量算出部
７ａ語頭語尾音響特徴量算出手段
７ｂ音素音響特徴量算出手段
７ｃモデル音響特徴量算出手段
９単語クラスタリング部
１１音素継続時間長算出部
１３状態占有確率値算出部
１５テキストデータ分解部
１７クラスター探索部
１９単語クラスタリング音声データベース
２１合成音声生成部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a word clustering speech database and a word clustering speech database generation device, a word clustering speech database generation method, a word clustering speech database generation program, and a speech synthesis device including a word clustering speech database used for speech synthesis.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there is an automatic word classification device (see Patent Document 1) for clustering words included in a speech database used when performing speech synthesis by a speech synthesis device. Note that clustering refers to classifying words into categories called clusters. A cluster is a superordinate concept of a word or an attribute to which a word belongs.For example, the word “soccer” belongs to the cluster “sports”, and the word “sunny” belongs to the cluster “weather”. belong to.
[0003]
The automatic word classification device disclosed in Patent Literature 1 calculates a statistic of a co-occurrence frequency of different words, and, based on the statistic, as a problem of estimating a probability model, a word included in a speech database. Is performed.
[0004]
[Patent Document 1]
JP-A-11-143875 (paragraphs 24 to 40, FIG. 1)
[0005]
[Problems to be solved by the invention]
However, in the conventional word automatic classification device, since different words are clustered, for example, even if “soccer” and “baseball” can be made to belong to the same cluster, the word “ There is a problem in that clustering cannot be performed assuming a phoneme environment before and after "no" (particle).
[0006]
Also, if text data including such words (words that are not clustered) is to be synthesized by a speech synthesizer, a great deal of processing time is required until an optimal speech is retrieved from a speech database. There is a problem that speech cannot be generated at high speed.
[0007]
Therefore, an object of the present invention is to solve the above-described problems of the conventional technology, and to provide a word clustering speech database and a word clustering speech database generation device capable of performing clustering by assuming a phoneme environment before and after the same word, and a word clustering An object of the present invention is to provide a speech database generation method, a word clustering speech database generation program, and a speech synthesis device capable of generating synthesized speech at high speed.
[0008]
[Means for Solving the Problems]
The present invention has the following configuration to achieve the above object.
The word clustering speech database according to claim 1, wherein the word clustering speech database includes a plurality of words, and is provided when speech synthesis is performed. Clustered configuration.
[0009]
According to such a configuration, words are clustered based on acoustic features of the words. The acoustic feature value is obtained by calculating (calculating) the average and the variance of the feature value (mel cepstrum coefficient, cepstrum coefficient by linear prediction analysis, etc.) of the first phoneme and the last phoneme of the word. Clustering refers to classifying words into categories called clusters. In addition, the cluster in this configuration sets a relationship between one word and phonemes before and after the word is connected to the word. For example, the word “no” (particle) and a cluster in which the phoneme before the “no” is a vowel (the cluster number is “1”) are “cluster 1; the phoneme before the“ no ”is a vowel. Is ".
[0010]
3. The word clustering speech database generation device according to claim 2, wherein the word clustering speech database generation device includes a plurality of words and generates a word clustering speech database used for speech synthesis. The configuration includes an amount calculating unit and a word clustering unit.
[0011]
According to such a configuration, an initial sound ending acoustic feature amount calculating means and a word clustering means are provided. The initial ending acoustic feature amount calculating means calculates an initial ending acoustic feature amount. The initial ending acoustic feature is an acoustic feature of the beginning and ending of the word, that is, the average and variance of the mel cepstrum coefficient, the cepstrum coefficient by linear predictive analysis, and the like. In the word clustering means, words in the speech database are clustered based on the initial ending acoustic feature.
[0012]
4. The word clustering speech database generation device according to claim 3, wherein the word clustering speech database generation device includes a plurality of words and generates a word clustering speech database used for speech synthesis. It is configured to include a calculating unit, a phoneme duration calculating unit, and a word clustering unit.
[0013]
According to such a configuration, a phoneme acoustic feature amount calculation unit, a phoneme duration time calculation unit, and a word clustering unit are provided. The phoneme acoustic feature calculation means calculates a phoneme acoustic feature. The phoneme acoustic feature is an acoustic feature of the first phoneme and the last phoneme of the word, that is, the average and variance of the mel cepstrum coefficient, the cepstrum coefficient by linear prediction analysis, and the like. The phoneme duration calculation unit calculates the phoneme duration. This phoneme duration is the duration of the first phoneme and the last phoneme in a word, typically of the order of tens of ms. The word clustering means clusters words in the speech database based on the phoneme acoustic feature amount and the phoneme duration.
[0014]
5. The word clustering speech database generation device according to claim 4, wherein the word clustering speech database generation device includes a plurality of words and generates a word clustering speech database used for speech synthesis. The configuration includes a calculation unit, a state occupation probability value calculation unit, and a word clustering unit.
[0015]
According to this configuration, the model acoustic feature amount calculating unit, the state occupation probability value calculating unit, and the word clustering unit are provided. The model acoustic feature value calculating means calculates the model acoustic feature value. The model acoustic feature is an acoustic feature in a modeled state, that is, an average and a variance of a mel cepstrum coefficient, a cepstrum coefficient by linear prediction analysis, and the like. The state occupation probability value calculation means calculates a state occupation probability value. The state occupancy probability value indicates the occupancy probability in each modeled state. The word clustering means clusters words in the speech database based on the model acoustic feature, the state occupancy probability value, the articulation method in the phoneme table, and the articulation position in the phoneme table.
[0016]
The word clustering speech database generation method according to claim 5, wherein the word clustering speech database generation method includes a plurality of words and generates a word clustering speech database used for speech synthesis. And a word clustering step.
[0017]
According to this method, an acoustic feature value calculating step and a word clustering step are included. In the acoustic feature amount calculating step, the acoustic feature amount is calculated, and in the word clustering step, words in the speech database are clustered based on the acoustic feature amount.
[0018]
The word clustering speech database generation program according to claim 6 causes a device for generating a word clustering speech database including a plurality of words and used for speech synthesis to function as the following means. I do. Means for causing the device to function are an acoustic feature amount calculating unit and a word clustering unit.
[0019]
According to such a configuration, the device functions by the acoustic feature amount calculating unit and the word clustering unit. The acoustic feature quantity calculating means calculates the acoustic feature quantity, and the word clustering means clusters the words of the speech database based on the acoustic feature quantity.
[0020]
A speech synthesizer for synthesizing input text data using the word clustering speech database according to claim 1, wherein the word clustering speech database and the text data are combined. It is configured to include a decomposition unit, a cluster search unit, and a synthesized speech generation unit.
[0021]
According to this configuration, the word clustering speech database, the text data decomposing means, the cluster searching means, and the synthetic speech generating means are provided. In the text data decomposing means, for example, text data is analyzed by morphological analysis, firstly divided into words, and then decomposed into phonemes. The cluster search means sequentially searches the word clustering speech database for speech synthesis candidates that are optimal word candidates when generating a synthesized speech from words classified into clusters. In the synthesized speech generating means, the searched speech synthesis candidates are connected to generate a synthesized speech.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
In the description of this embodiment, first, an outline of a word clustering speech database will be described. Then, the configuration of the word clustering speech database generation device will be described, and then, the configuration of the speech synthesis device including the word clustering speech database generated by the device will be described (see FIG. 1). Next, the operation of the word clustering speech database generation device (how to generate the word clustering speech database) will be described (see FIG. 2), and subsequently, the operation of the speech synthesis device will be described (see FIG. 3). Further, a method of generating word clustering information in the word clustering speech database generation device will be described (see FIG. 4).
[0023]
(About the word clustering speech database)
In the word clustering speech database, words in the speech database in which speech times are classified for each word phoneme provided in the speech synthesizer are represented by acoustic features (hereinafter, in the description of this embodiment, “acoustic”). ), Which can reduce the time required to search for words when performing speech synthesis by the speech synthesizer and improve the processing speed. It is. A cluster in the word clustering speech database is a group when all words in the speech database are classified into several groups based on a word and a connection between phonemes before and after the word. One).
[0024]
(Configuration of word clustering speech database generation device)
FIG. 1 is a block diagram of a word clustering speech database generating device and a speech synthesizing device. As shown in FIG. 1, the word clustering speech database generating device 1 is used in a normal speech database provided for the speech synthesizing device. It generates a word clustering speech database in which the included words are classified into clusters, and includes an input / output unit 5, a feature calculation unit 7, a word clustering unit 9, a phoneme duration calculation unit 11, a state occupancy probability A value calculation unit 13.
[0025]
Note that the speech synthesizer in FIG. 1 includes a word clustering speech database clustered by the word clustering speech database generator 1, but initially (before the word clustering speech database generator 1 generates the word clustering speech database). , A normal voice database. In other words, in a normal speech database, it can be said that each word is not clustered and only one cluster exists (it can be said that the previous and subsequent phoneme environments are not considered).
[0026]
The input / output unit 5 receives (acquires) database information from a word clustering speech database (speech database) provided in the speech synthesizer, and converts the word clustering information output from the word clustering unit 9 into the speech. This is output to the synthesizer. The input / output unit 5 may be configured to transmit and receive information (database information, word clustering information) via a communication network (not shown) such as the Internet.
[0027]
The database information is information relating to voice data of a voice database (word clustering voice database) provided in the voice synthesizer, and the voice data includes words and phonemes (corresponding to each alphabet in Roman alphabet), voice waveforms, (A utterance time is divided for each word and phoneme). That is, the database information includes, for the voice data recorded (contained) in the voice database, the Romanization of each word and phoneme (corresponding to each alphabet), the utterance time of each word and phoneme (the utterance start time). And utterance end time; utterance time).
[0028]
The word clustering information is information for classifying a word recorded (contained) in a speech database into a plurality of clusters. The cluster is based on a word and a connection between phonemes before and after the word. Each group (each group) when all words in the voice database are classified into several groups. For example, the word “no” (particle) and a cluster in which the phoneme before the “no” is a vowel (the cluster number is “1”) are “cluster 1; the phoneme before the“ no ”is a vowel. Is ".
[0029]
The feature calculation unit 7 performs an acoustic feature of a “word” of each sound data (statistics such as an average and a variance in each dimension of the feature) based on the database information input via the input / output unit 5. ), And includes an initial sound ending acoustic feature amount calculating unit 7a, a phoneme acoustic feature amount calculating unit 7b, and a model acoustic feature amount calculating unit 7c. In other words, the feature amount calculation unit 7 considers (reflects) the phoneme environment before and after the word and, for each word, a MFCC (Mel Frequency Cepstrum Coefficient: sound quality parameter at the beginning and end phonemes) Are subjected to FFT, inverse Fourier transform and converted into cepstrum coefficients]), and a statistic such as a fundamental frequency is calculated.
[0030]
The initial-word ending acoustic feature calculating means 7a calculates the statistic of the acoustic feature in the specific section (40 ms) of the initial and ending of the word (initial-ending acoustic feature). In the case of ”, the initial ending acoustic feature amount is calculated by Expressions (1) to (4).
[0031]
(Equation 1)

[0032]
(Equation 2)

[0033]
(Equation 3)

[0034]
(Equation 4)

[0035]
Note that μ_Initial ^j(A− + u) is the j-dimensional feature amount at the beginning of the word when the preceding phoneme (the immediately preceding phoneme) of the word “no” is “a” and the following phoneme (the next phoneme) is “u”. The average value is shown and μ_Ending ^j(A− + u) is the j-dimensional feature amount at the end of the word “no” when the preceding phoneme (the immediately preceding phoneme) is “a” and the following phoneme (the immediately succeeding phoneme) is “u”. The average value is shown. Also, Σ_Initial ^j(A− + u) is the j-dimensional feature amount at the beginning of the word when the preceding phoneme (the immediately preceding phoneme) of the word “no” is “a” and the following phoneme (the next phoneme) is “u”. It shows the variance value, Σ_Ending ^j(A− + u) is the j-dimensional feature amount at the end of the word “no” when the preceding phoneme (the immediately preceding phoneme) is “a” and the following phoneme (the immediately succeeding phoneme) is “u”. The variance is shown.
[0036]
The phoneme acoustic feature calculation means 7b calculates the statistic (phoneme acoustic feature) of the acoustic feature of the first phoneme and the last phoneme of the word. For example, in the case of the above-mentioned word "no", The phoneme acoustic feature value is calculated by the above-described equations (1) to (4).
[0037]
The model acoustic feature value calculating means 7c calculates a statistical amount of acoustic features in a modeled state (model acoustic feature value) based on at least one of the articulation method in the phoneme table and the articulation position in the phoneme table. For example, in the case of the above-mentioned word “no”, a model acoustic feature value is calculated by a HMM (Hidden Markov Model) creation method.
[0038]
The word clustering unit 9 calculates the statistic of the feature amount calculated by the feature amount calculation unit 7 (initial ending sound feature amount, phoneme sound feature amount, model sound feature amount), and the phoneme calculated by the phoneme duration time calculation unit 11. Based on the duration time, the state occupancy probability value calculated by the state occupancy probability value calculation unit 13, the articulation method in the phoneme table, and the articulation position in the phoneme table, word clustering (in this embodiment, Tree- based clustering). In other words, the word clustering unit 9 generates word clustering information that determines how words in consideration of the surrounding phoneme environment are divided into “clusters”. This word clustering unit 9 corresponds to the word clustering means described in the claims.
[0039]
The word clustering information generated by the word clustering unit 9 is appropriately changed according to the feature amount calculated by the feature amount calculation unit 7. In other words, when the initial-end and end-of-sound acoustic features are calculated by the feature-amount calculating unit 7, word clustering information is generated based on the initial-end and end-of-sound acoustic features and the number of phoneme data. When the amount is calculated, word clustering information is generated based on the phoneme acoustic feature amount and the phoneme duration, and when the feature amount calculating unit 7 calculates the model acoustic feature amount, the model acoustic feature amount and the state are calculated. Word clustering information is generated based on the occupation probability values.
[0040]
The articulation method in the phonological table is a method (moving method) that forms a closure or narrowing in a vocal tract when vocalization is performed by a voice organ (such as a larynx). The articulation position in the phonetic table refers to the location of the articulation point (palate, gum, upper lip, etc.) adjusted by the articulatory organs (tongue, lower lip), and is closed or narrowed when sound is uttered at this location. Is formed.
[0041]
Note that data relating to the articulation method and articulation position in this phoneme table is recorded in a recording unit (not shown) of the word clustering speech database generation device 1, and the state occupancy probability value is calculated by the state occupation probability value calculation unit 13. In this case, it is read from a recording unit (not shown) and used.
[0042]
The procedure for generating the word clustering information will be described below. First, words having the same “center” (a part excluding the preceding and succeeding phonemes) of the word (the same central word) are collected (classified). This center-same word is divided into two parts that satisfy a specific condition (question for separating the same center word) and those that do not. The condition in which the difference ΔL (q) (Formula 6) between the log likelihood L (C) (Formula 5) before and after the division is the largest is adopted as the condition of the first “discrimination tree”.
[0043]
(Equation 5)

[0044]
(Equation 6)

[0045]
In Equation 5, Σ (C) is the variance of the shared state C (the “center” of the word), Xf is the feature vector observed in the f-th frame, n is the number of dimensions of the feature vector, γs ( Xf) indicates the probability that a certain state s included in the shared state C outputs the feature vector Xf.
[0046]
In Equation 6, C (q, YES) is a shared state classified as YES under condition (question) q, and C (q, NO) is a shared state classified as NO under condition (question) q. FIG. As the q, a condition (question) for setting whether the articulation method and articulation position in the phoneme table are the same phoneme, such as “is the previous phoneme a fricative sound?” Or “is the subsequent phoneme a consonant?” I have. Further, Σ (C) in Expression 5 is calculated by Expression 7 shown below.
[0047]
(Equation 7)

[0048]
Here, μ (C) indicates the average of the shared state C, and is calculated by the following Expression 8.
[0049]
(Equation 8)

[0050]
Next, each of the clusters divided into two by the condition (question) is further divided into two under conditions (questions) other than the condition (question) initially adopted, and before and after the division, as described above. The condition (question) that maximizes the difference ΔL (q) (formula 6) of the log likelihood L (C) (formula 5) is adopted as the condition (question) of the next “discrimination tree”.
[0051]
Further, the division based on these conditions (questions) is repeated until the difference ΔL (q) between the log likelihood L (C) (formula 5) before and after the division falls below a preset threshold, and the “discrimination tree” is formed. Generate. Finally, in the generated “discrimination tree”, the words belonging to the same leaf (the end of the discrimination tree, which corresponds to the leaf) are applied to the words in consideration of all the phoneme environments before and after, to the same cluster. This is referred to as “word clustering information” that is treated as a belonging word.
[0052]
The phoneme duration calculation unit 11 calculates the duration of a phoneme when the phoneme acoustic feature is calculated by the phoneme acoustic feature calculation unit 7b of the feature calculation unit 7. Are calculated and output to the word clustering unit 9. The duration of the phoneme is calculated based on the speech start time and the speech end time of the phoneme. The phoneme duration calculating unit 11 corresponds to a phoneme duration calculating unit described in the claims.
[0053]
The state occupancy probability value calculating unit 13 calculates a state occupancy probability value when the model acoustic feature amount calculating unit 7c of the feature amount calculating unit 7 calculates the model acoustic feature amount. The state occupancy probability value is the occupancy probability of a word in the modeled state. The state occupancy probability value calculation unit 13 corresponds to the state occupation probability value calculation means described in the claims.
[0054]
According to the word clustering speech database generation device 1, the initial and final sound characteristics are calculated by the initial and final sound characteristics calculating means 7a of the characteristic value calculating unit 7, and the initial and final sound characteristics and the phonemes are calculated by the word clustering unit 9. The word clustering information for clustering the words in the speech database is generated based on the number of data items. For this reason, word clustering information for clustering the words in the speech database is generated by referring to the initial sound characteristics at the beginning and the end of the words, and the word clustering information is used to generate the word clustering information for the same words. Assuming (considering) the surrounding phoneme environment, words in the speech database can be clustered.
[0055]
Further, according to the word clustering speech database generating apparatus 1, the phoneme acoustic feature is calculated by the phoneme acoustic feature calculating means 7b of the feature calculating section 7, and the phoneme duration calculating section 11 calculates the phoneme duration. Is calculated. The word clustering unit 9 generates word clustering information for clustering the words in the speech database based on the phoneme acoustic feature amount and the phoneme duration. For this reason, the word clustering information for clustering the words in the speech database is generated by referring to the phoneme acoustic features that are the acoustic features of the first phoneme and the last phoneme of the word. Assuming (considering) the phonemic environment before and after the word, the words in the speech database can be clustered.
[0056]
Further, according to the word clustering speech database generation device 1, the model acoustic feature is calculated by the model acoustic feature calculating means 7c of the feature calculating unit 7, and the state occupancy probability is calculated by the state occupancy probability calculating unit 13. Is calculated. The word clustering unit 9 generates word clustering information for clustering words in the speech database based on the model acoustic feature amount and the state occupancy probability value. For this reason, word clustering information for clustering the words in the speech database is generated by referring to the model acoustic features that are the acoustic features in a state in which the words are modeled. Assuming (considering) the surrounding phoneme environment, words in the speech database can be clustered.
[0057]
(Configuration of speech synthesizer)
The speech synthesizer 3 shown in FIG. 1 uses the speech database (word clustering speech database 19) clustered according to the word clustering information generated by the word clustering speech database generator 1 to input text data. And a text data analysis unit 15, a cluster search unit 17, a word clustering speech database 19, and a synthesized speech generation unit 21.
[0058]
The text data analysis unit 15 performs a morphological analysis on the input text data and breaks it down into words and phonemes. The morphological analysis is to analyze text data and identify a stem, affix, inflection, etc. of a word included in the text data as a morpheme which is a minimum linguistic element having a meaning. The text data analysis unit 15 corresponds to the text data decomposing means described in the claims.
[0059]
The cluster search unit 17 searches for the words and phonemes decomposed by the text data analysis unit 15 for each cluster recorded in the word clustering speech database 19, and searches the words included in this cluster for the speech. A speech synthesis candidate which is a candidate for synthesis is obtained. This speech synthesis candidate indicates speech data of only “words” belonging to the cluster. The cluster search unit 17 corresponds to a cluster search unit.
[0060]
The word clustering speech database 19 records a plurality of "sentences" composed of words and phonemes as constituent units of the database, and the words are clustered (classified into clusters) by word clustering information. Things. Each sentence is assigned a “sentence number”, and the utterance time is recorded from the beginning of the sentence of each word.
[0061]
The synthesized speech generation unit 21 connects the speech synthesis candidates acquired from the word clustering speech database 19 by the cluster search unit 17, generates a synthesized speech, and outputs the synthesized speech to the outside of the speech synthesis device 3. It should be noted that the synthesized speech generation unit 21 corresponds to a synthesized speech generation unit described in the claims.
[0062]
According to the speech synthesizer 3, the text data analysis unit 15 analyzes text data by morphological analysis, and first divides the text data into words and phonemes. The cluster search unit 17 sequentially searches the word clustering speech database 19 for speech synthesis candidates that are optimal word candidates when generating synthesized speech from words classified into clusters. In the synthesized speech generation unit 21, the searched speech synthesis candidates are connected, and a synthesized speech is generated. Thus, the cluster search unit 17 searches for a speech synthesis candidate from words that have been classified into clusters in advance, so that the search time can be reduced and synthesized speech can be generated at high speed. .
[0063]
(Operation of the word clustering speech database generator)
Next, the operation of the word clustering speech database generation device 1 will be described with reference to the flowchart shown in FIG. 2 (see FIG. 1 as appropriate).
[0064]
First, database information is input from the speech synthesizer 3 to the input / output unit 5 of the word clustering speech database generator 1 (S1). Subsequently, the main control unit (not shown) of the word clustering speech database generation device 1 determines whether or not the acoustic feature calculated by the feature calculator 7 is an initial ending acoustic feature (S2). . When it is determined that the acoustic feature calculated by the feature calculator 7 is the initial ending acoustic feature (S2, Yes), the initial ending acoustic feature is calculated by the initial ending acoustic feature calculating means 7a. (S3).
[0065]
On the other hand, if the acoustic feature calculated by the feature calculator 7 is not determined to be the initial sound feature (S2, No), the acoustic feature calculated by the feature calculator 7 is replaced by the phoneme acoustic feature. It is determined whether it is a quantity (S4). If it is determined that the acoustic feature calculated by the feature calculator 7 is a phoneme acoustic feature (S4, Yes), the phoneme acoustic feature is calculated by the phoneme acoustic feature calculator 7b (S5). . Then, the phoneme duration calculation unit 11 calculates the phoneme duration (S6).
[0066]
Further, in S4, when it is not determined that the acoustic feature calculated by the feature calculator 7 is a phoneme acoustic feature (S4, No), the model acoustic feature calculating means 7c sets the model acoustic feature. Is calculated (S7). Then, the state occupation probability value calculation unit 13 calculates the state occupation probability value (S8).
[0067]
Then, the word clustering unit 9 generates word clustering information based on any of the initial ending acoustic feature, the phoneme acoustic feature and the phoneme duration, the model acoustic feature, and the state occupancy probability value (S9). . Then, the word clustering information is output from the input / output unit 5 to the speech synthesizer 3 (S10).
[0068]
(Operation of speech synthesizer)
Next, the operation of the speech synthesizer 3 will be described with reference to the flowchart shown in FIG. 3 (see FIG. 1 as appropriate).
[0069]
First, text data is input to the text data decomposing unit 15 of the speech synthesizing device 3, and the text data is subjected to morphological analysis and decomposed into words and phonemes (S11).
[0070]
Then, based on the words and phonemes decomposed by the text data decomposing unit 15, the cluster searching unit 17 searches the word clustering speech data 19, and acquires speech synthesis candidates (S12).
[0071]
Thereafter, the synthesized speech generation unit 21 connects the synthesized speech candidates to generate and output a synthesized speech (S13).
[0072]
(How to generate word clustering information)
Finally, a method (an example) of generating word clustering information in the word clustering speech database generation device 1 will be described with reference to FIG.
[0073]
The way (one example) of generating the word clustering information shown in FIG. 4 relates to the word “no”. First, all “100” data of the word “no” are divided into two based on the condition that “the previous phoneme is a vowel”. The number of phonemes before the word "no" is a vowel is 10, and the number of phonemes before the word "no" is not a vowel is 90 (these 90 are "cluster 3" in FIG. 4). .
[0074]
Then, the word “no” is further divided into two on the basis of the condition “the latter phoneme is a plosive”, and these are divided into “cluster 1” and “cluster 2” in FIG.
[0075]
As described above, the present invention has been described based on one embodiment, but the present invention is not limited to this.
For example, the processing of each component of the word clustering speech database generation device 1 is regarded as one process, and is considered as a word clustering speech database generation method, or the word clustering speech in which the processing of each component is described in a general-purpose computer language It can be regarded as a database generation program. These can obtain the same effects as those of the word clustering voice database generation device 1.
[0076]
【The invention's effect】
According to the first aspect of the present invention, the words are clustered based on the acoustic features of the words. Therefore, if this word clustering speech database is used in a speech synthesizer, the search time of the words is reduced. Can be shortened, and synthesized speech can be generated at high speed.
[0077]
According to the second, fifth, and sixth aspects of the present invention, word clustering information for clustering words in a speech database is generated based on the acoustic features of words. For this reason, the words in the speech database can be clustered by assuming (considering) the phoneme environments before and after the same word by using the word clustering information.
[0078]
According to the third aspect of the present invention, the phoneme acoustic feature amount is calculated, and the phoneme duration is calculated. Based on these, word clustering information for clustering words in the speech database is generated. For this reason, the word clustering information for clustering the words in the speech database is generated by referring to the phoneme acoustic features that are the acoustic features of the first phoneme and the last phoneme of the word. Assuming (considering) the phonemic environment before and after the word, the words in the speech database can be clustered.
[0079]
According to the fourth aspect of the present invention, the model acoustic feature value is calculated, and the state occupancy probability value is calculated. Based on these, word clustering information for clustering words in the speech database is generated. For this reason, word clustering information for clustering the words in the speech database is generated by referring to the model acoustic features that are the acoustic features in a state in which the words are modeled. Assuming (considering) the surrounding phoneme environment, words in the speech database can be clustered.
[0080]
According to the seventh aspect of the present invention, the text data is analyzed, firstly divided into words, and further divided into phonemes before and after connecting to each of the divided words. In order to generate a synthesized speech from, a speech synthesis candidate that is an optimal word candidate is sequentially searched. Then, the searched voice synthesis candidates are connected to generate a synthesized voice. For this reason, a speech synthesis candidate is searched for from words that have been classified into clusters in advance, so that the search time can be reduced and synthesized speech can be generated at high speed.
[Brief description of the drawings]
FIG. 1 is a block diagram of a word clustering speech database generation device and a speech synthesis device according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating an operation of the word clustering speech database generation device shown in FIG. 1;
FIG. 3 is a flowchart illustrating an operation of the speech synthesizer illustrated in FIG. 1;
FIG. 4 is a diagram illustrating a method of generating word clustering information.
[Explanation of symbols]
1. Word clustering speech database generator
3 Voice synthesizer
5 Input / output unit
7 Feature amount calculation unit
7a Initial and ending sound feature value calculating means
7b Phoneme acoustic feature value calculation means
7c Model acoustic feature calculation means
9 Word clustering part
11 Phoneme duration calculation unit
13 State occupancy probability value calculation unit
15 Text data decomposition unit
17 Cluster search unit
19 Word Clustering Speech Database
21 Synthesized speech generator

Claims

A word clustering speech database comprising a plurality of words and provided for speech synthesis, wherein the words are clustered based on acoustic features of the words. Voice database.

A word clustering speech database generation device that includes a plurality of words and generates a word clustering speech database to be provided when performing speech synthesis,
A head-end and ending acoustic feature amount calculating means for calculating a head-and-ending acoustic feature amount which is an acoustic feature amount of the head and the ending of the word,
Word clustering means for clustering the words based on the initial ending acoustic feature calculated by the initial ending acoustic feature calculating means;
A word clustering speech database generation device, comprising:

A word clustering speech database generation device that includes a plurality of words and generates a word clustering speech database to be provided when performing speech synthesis,
Phoneme acoustic feature calculation means for calculating a phoneme acoustic feature that is an acoustic feature of the first phoneme and the last phoneme in the word,
Phoneme duration calculating means for calculating the phoneme duration of the first phoneme and the last phoneme in the word,
Word clustering means for clustering the words based on the phoneme acoustic feature amount and the phoneme duration,
A word clustering speech database generation device, comprising:

A word clustering speech database generation device that includes a plurality of words and generates a word clustering speech database to be provided when performing speech synthesis,
A model acoustic feature amount calculating unit that calculates a model acoustic feature amount that is an acoustic feature amount in a state where the word is modeled,
State occupancy probability value calculation means for calculating a state occupancy probability value that is an occupancy probability in each of the modeled states,
Word clustering means for clustering the words based on the model acoustic feature quantity, the state occupancy probability value, the articulation method in the phoneme table, and the articulation position in the phoneme table,
A word clustering speech database generation device, comprising:

A word clustering speech database generation method for generating a word clustering speech database that includes a plurality of words and is provided when speech synthesis is performed,
An acoustic feature value calculating step of calculating an acoustic feature value that is an acoustic feature value in the word;
In the acoustic feature amount calculating step, a word clustering step of clustering the words based on the calculated acoustic feature amount;
A method for generating a word clustering speech database, comprising:

A device that includes a plurality of words and generates a word clustering speech database that is provided when speech synthesis is performed.
An acoustic feature amount calculating unit that calculates an acoustic feature amount that is an acoustic feature amount in the word;
A word clustering unit that clusters the words based on the acoustic feature amount calculated by the acoustic feature amount calculating unit;
A word clustering speech database generation program characterized by functioning as:

A speech synthesizer that performs speech synthesis on input text data using the word clustering speech database according to claim 1,
Said word clustering speech database;
Analyzing the text data, text data decomposing means for decomposing into words and phonemes,
Cluster search means for sequentially searching for the same cluster that has been clustered in the word clustering voice database based on the words and phonemes decomposed by the text data decomposition means to obtain speech synthesis candidates;
Synthetic speech generation means for generating synthetic speech based on the speech synthesis candidates acquired by the cluster search means,
A speech synthesis device comprising: