JP3576066B2

JP3576066B2 - Speech synthesis system and speech synthesis method

Info

Publication number: JP3576066B2
Application number: JP2000087173A
Authority: JP
Inventors: 弓子加藤; 謙二松井; 孝浩釜井; 勝義山上
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1999-03-25
Filing date: 2000-03-27
Publication date: 2004-10-13
Anticipated expiration: 2020-03-27
Also published as: JP2001092482A

Description

【０００１】
【発明の属する技術分野】
本発明は、任意の入力テキスト、または入力表音記号列等を合成音声に変換して出力する音声合成システムに関するものである。
【０００２】
【従来の技術】
近年、家庭電化製品や、カーナビゲーションシステム、携帯電話などの種々の電子機器において、機器の状態や、操作などの指示、応答メッセージ等のメッセージを発声させるために合成音声が多く用いられている。また、パーソナルコンピュータなどにおいては、音声インタフェイスによる操作や、光学文字認識（ＯＣＲ）による文字認識結果の確認などにも用いられつつある。
【０００３】
上記のような音声合成を行う手法としては、あらかじめ音声データを記憶させておいて、これを再生させるような方法があり、限られたメッセージなどを発声させる場合などに多く用いられているが、この方法を用いて任意の音声を発声させるためには、大容量の記憶装置を必要とし、高価なものとなりがちであるため、用途が限られたものとなっている。
【０００４】
一方、比較的安価な構成で任意の音声を発声させる手法としては、入力されたテキストや表音記号列の並びなどに基づいて、所定の音声データ生成規則を用いて音声データを生成させるようにしたものがある。しかし、このような音声データ生成規則を用いる方法では、多様な種々の表現に対して自然な音声を発声させることは困難である。
【０００５】
そこで、例えば特開平８−８７２９７号公報に開示されているように、データベースを用いた音声情報の検索による合成音声の生成と、合成音声生成規則による合成音声の生成とを併用する音声合成システムが知られている。この種の装置は、より詳しくは、例えば図１３に示すように、文字列入力部９１０と、実音声を分析して抽出した音声特徴量およびこれに対応する発声内容を格納した音声情報データベース９２０と、音声情報データベース９２０を検索する音声情報検索部９３０と、音声波形を生成する合成音声生成部９４０と、入力テキストまたは入力表音記号列から音声特徴量を生成する際の規則を含む合成音声生成規則９５０と、電気音響変換器９６０とを備えて構成されている。この音声合成システムでは、文字列入力部９１０にテキストまたは表音記号列が入力されると、音声情報検索部９３０は、音声情報データベース９２０から入力テキストまたは入力表音記号列に一致する発声内容の音声情報を検索する。一致する発声内容が存在する場合には、対応する音声情報を合成音声生成部９４０へ渡す。一方、一致する発声内容が存在しない場合には、音声情報検索部９３０は、入力テキストまたは入力表音記号列をそのまま合成音声生成部９４０へ渡す。合成音声生成部９４０は、検索された音声情報が入力された場合には、これに基づいて合成音声を生成し、入力テキストあるいは入力表音記号列が入力された場合には、これと合成音声生成規則９５０とに基づいて音声特徴量を生成した後に、合成音声を生成する。
【０００６】
上記のように、音声情報の検索と合成音声生成規則とを用いることにより、任意の入力テキスト等を合成音声に変換して出力することができるとともに、一部の音声（検索がヒットした場合）については、自然な音声を発声させることができる。
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来の音声合成システムでは、検索がヒットした場合とヒットしなかった場合と、すなわち、音声情報データベース内に、入力テキスト等に対応する発声内容が存在する場合と存在しない場合とで、音質の差が大きく、そのような音質の異なる音声をつなぎあわせることにより、かえって不自然さが目立つことになるという問題点を有していた。また、音声情報データベース９２０の検索を単に入力表音記号列と格納されている発声内容との一致の有無によって行っているために、一致する発声内容が存在すれば、文の構成などに係らず、検索された音声情報によって音声合成が行われ、やはり不自然な合成音声になるという問題点も有していた。
【０００８】
具体例には、例えば、「大阪に住んでいる私は松下です」という文を音声合成する場合、固有名詞「松下」がデータベースに存在しない場合には、その部分だけ機械的な合成音声になったり、文末の発声内容として格納されている「大阪に住んでいる」の音声情報が用いられて、「大阪に住んでいる」「私は松下です」といった２つの文が不自然に繋ぎ合わされたような合成音声になったりしがちであった。
【０００９】
本発明は、上記の点に鑑み、任意の入力テキスト等に応じて、自然な合成音声を発声させることができ、特に、音声情報（韻律情報）データベース内に、入力テキスト等に対応する発声内容が存在してもしなくても、同様の音質で合成音声を発声させることができる音声合成システムの提供を目的としている。
【００１０】
【課題を解決するための手段】
上記の目的を達成するため、本発明は、
合成される音声を示す合成音声情報に基づいて合成音声を出力する音声合成システムにおいて、
検索のキーとなるキー情報と対応して、音声合成に用いられる実音声から抽出された状態の韻律情報が格納されたデータベースと、
上記合成音声情報と上記キー情報との一致程度に応じて上記韻律情報を検索する検索手段と、
上記合成音声情報と上記キー情報との一致程度に応じた変形規則に基づいて上記検索手段によって検索された韻律情報に変形を施す変形手段と、
上記合成音声情報、および上記変形手段によって変形された韻律情報に基づいて、合成音声を出力する合成手段と、
を備えたことを特徴としている。
【００１１】
上記変形規則は、変形のパターンおよび程度の少なくともいずれかを設定するものであってもよい。上記合成音声情報、および上記キー情報は、それぞれ、合成される音声の音声的属性を示す表音記号列や、さらに、合成される音声の言語的属性を示す言語情報を含んでいてもよく、上記表音記号列は、少なくとも、合成される音声の音韻の列、アクセント位置、およびポーズの有無または長さのうちの何れかを実質的に示す情報を含んでいてもよい。また、上記言語情報は、少なくとも、合成される音声の文法的情報、および意味的情報の何れかを含んでいてもよい。
【００１２】
また、さらに、上記音声合成システムに入力されたテキスト情報を解析して、上記表音記号列、および上記言語情報を生成する言語処理手段を備えたことを特徴としている。
【００１３】
これにより、合成音声情報とキー情報とが完全に一致するような韻律情報がデータベースに格納されていない場合でも、類似した韻律情報によって音声合成が行われるので、任意の音声に対して、比較的適切、かつ、むらのない自然な音声を発声させることができる。また、逆に、合成音声の自然さを損なうことなく、データベースの記憶容量を低減することができる。さらに、上記のように類似した韻律情報が用いられる場合に、その類似の程度に応じて韻律情報が変形されるので、より適切な合成音声が発せられる。
【００１４】
また、本発明は、
上記音声合成システムであって、
上記合成音声情報、および上記キー情報は、それぞれ、合成される音声の各音韻が属する音韻カテゴリを示す音韻カテゴリ列を実質的に含むことを特徴としている。
【００１５】
また、さらに、上記音声合成システムに入力された、上記合成音声情報に対応する情報、および上記データベースに格納された、上記キー情報に対応する情報の少なくとも何れかを音韻カテゴリ列に変換する変換手段を備えたことを特徴としている。
【００１６】
上記音韻カテゴリは、少なくとも、音韻の調音方式、調音位置、および継続時間長のうちの何れかを用いて音韻をグループ化したものや、
韻律パタンを統計的手法を用いてグループ化し、韻律パタンのグループを最も良く反映するように、音韻を多変量解析等の統計的手法を用いてグループ化したもの、
音韻どうしの異聴表から多変量解析等の統計的手法を用いて決定した音韻間の距離に従って音韻をグループ化したもの、
音韻の、音韻の基本周波数、強度、時間長、またはスペクトルなどの物理特性の類似度に従って音韻をグループ化したものなどでもよい。
【００１７】
これにより、韻律情報の検索において、音素列が一致していない場合でも、各音素の音韻カテゴリが一致している場合には、韻律情報を流用しても、多くの場合、適切で自然な合成音声を発声させることができる。
【００１８】
また、本発明は、
上記音声合成システムであって、
上記データベースに格納される上記韻律情報は、同一の実音声から抽出された韻律的特徴を示す情報を含むことを特徴としている。
【００１９】
また、本発明は、
上記音声合成システムであって、
上記韻律的特徴を示す情報は、少なくとも、
基本周波数の時間的変化を示す基本周波数パタン、
音声強度の時間的変化を示す音声強度パタン、
音韻ごとの時間長を示す音韻時間長パタン、および
ポーズの有無または長さを示すポーズ情報の
何れかを含むことを特徴としている。
【００２０】
また、本発明は、
上記音声合成システムであって、
上記データベースは、上記韻律情報を韻律制御単位ごとに格納することを特徴としている。
【００２１】
また、本発明は、
上記音声合成システムであって、
上記韻律制御単位は、
アクセント句、
１以上のアクセント句によって構成されるフレーズ、
文節、
１以上の文節によって構成されるフレーズ、
単語、
１以上の単語によって構成されるフレーズ、
ストレス句、および
１以上のストレス句によって構成されるフレーズ
のうちの何れかであることを特徴としている。
【００２２】
これにより、適切で自然な合成音声を容易に発声させることができる。
【００２３】
また、本発明は、
上記音声合成システムであって、
上記合成音声情報、および上記キー情報は、それぞれ、合成される音声を決定する要素である複数種類の音声指標情報を含み、
上記合成音声情報とキー情報との一致程度は、上記合成音声情報における各音声指標情報と、上記キー情報における各音声指標情報との一致程度が、それぞれ重み付けされて合成されたものであることを特徴としている。
【００２４】
また、本発明は、
上記音声合成システムであって、
上記音声指標情報は、少なくとも、合成される音声の音韻の列、アクセント位置、ポーズの有無または長さ、および言語的属性を示す言語情報のうちの何れかを実質的に示す情報を含むことを特徴としている。
【００２５】
また、本発明は、
上記音声合成システムであって、
上記音声指標情報は、合成される音声の音韻の列を実質的に示す情報を含み、
上記合成音声情報における各音声指標情報と、上記キー情報における各音声指標情報との一致程度は、上記音韻ごとの音響的特徴長の類似程度を含むことを特徴としている。
【００２６】
また、本発明は、
上記音声合成システムであって、
上記音声指標情報は、合成される音声の各音韻が属する音韻カテゴリを示す音韻カテゴリ列を実質的に含むことを特徴としている。
【００２７】
また、本発明は、
上記音声合成システムであって、
上記合成音声情報における各音声指標情報と、上記キー情報における各音声指標情報との一致程度は、上記音韻ごとの音韻カテゴリの類似程度を含むことを特徴としている。
【００２８】
これにより、適切な韻律情報の検索および変形を容易に行うことができる。
【００２９】
また、本発明は、
上記音声合成システムであって、
上記韻律情報は、合成される音声を特徴づける複数種類の韻律特徴情報を含むことを特徴としている。
【００３０】
また、本発明は、
上記音声合成システムであって、
上記複数種類の韻律特徴情報は、組にされて、上記データベースに格納されていることを特徴としている。
【００３１】
また、本発明は、
上記音声合成システムであって、
上記組にされる複数種類の韻律特徴情報は、それぞれ、同一の実音声から抽出されたものであることを特徴としている。
【００３２】
また、本発明は、
上記音声合成システムであって、
上記韻律的特徴情報は、少なくとも、
基本周波数の時間的変化を示す基本周波数パタン、
音声強度の時間的変化を示す音声強度パタン、
音韻ごとの時間長を示す音韻時間長パタン、および
ポーズの有無または長さを示すポーズ情報の
何れかを含むことを特徴としている。
【００３３】
また、本発明は、
上記音声合成システムであって、
上記音韻時間長パタンは、少なくとも、音素時間長パタン、モーラ時間長パタン、および音節時間長パタンの何れかを含むことを特徴としている。
【００３４】
また、本発明は、
上記音声合成システムであって、
上記各種類の韻律特徴情報は、それぞれ、異なる上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて検索、および変形されることを特徴としている。
【００３５】
また、本発明は、
上記音声合成システムであって、
上記検索手段による上記韻律情報の検索と、上記変形手段による上記韻律情報の変形とは、それぞれ、異なる上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて行われることを特徴としている。
【００３６】
また、本発明は、
上記音声合成システムであって、
上記検索手段による上記韻律情報の検索と、上記変形手段による上記韻律情報の変形とは、それぞれ、同一の上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて行われることを特徴としている。
【００３７】
また、本発明は、
上記音声合成システムであって、
上記変形手段は、少なくとも、
音素ごと、
モーラごと、
音節ごと、
上記合成手段における音声波形の生成単位ごと、および
音韻ごと
の何れかの一致程度に基づいて、上記検索手段によって検索された上記韻律情報の変形を行うことを特徴としている。
【００３８】
また、本発明は、
上記音声合成システムであって、
上記音素ごと、モーラごと、音節ごと、上記合成手段における音声波形の生成単位ごと、および音韻ごとの何れかの一致度は、少なくとも、
音響特性に基づく距離、
調音方式、調音位置、および継続時間長のうちの何れかにより求められた距離、および
聴取実験による異聴表に基づく距離
の何れかに基づいて設定されることを特徴としている。
【００３９】
これにより、適切な変形を容易に行うことができる。
【００４０】
また、本発明は、
上記音声合成システムであって、
上記音響特性は、少なくとも、基本周波数、強度、時間長、およびスペクトルのうちの何れかであることを特徴としている。
【００４１】
また、本発明は、
上記音声合成システムであって、
上記データベースは、複数種類の言語について、上記キー情報および韻律情報が格納されることを特徴としている。
【００４２】
これにより、複数種類の言語を含む合成音声を容易に発声させることができる。
【００４３】
また、本発明は、
合成される音声を示す合成音声情報に基づいて合成音声を出力する音声合成方法において、
検索のキーとなるキー情報と対応して音声合成に用いられる実音声から抽出された状態の韻律情報が格納されたデータベースから、
上記合成音声情報と上記キー情報との一致程度に応じて上記韻律情報を検索し、
上記合成音声情報と上記キー情報との一致程度に応じた変形規則に基づいて上記検索手段によって検索された韻律情報に変形を施し、
上記合成音声情報、および上記変形手段によって変形された韻律情報に基づいて、合成音声を出力することを特徴としている
【００４４】
また、本発明は、
上記音声合成方法であって、
上記合成音声情報、および上記キー情報は、それぞれ、合成される音声を決定する要素である複数種類の音声指標情報を含み、
上記合成音声情報とキー情報との一致程度は、上記合成音声情報における各音声指標情報と、上記キー情報における各音声指標情報との一致程度が、それぞれ重み付けされて合成されたものであることを特徴としている。
【００４５】
また、本発明は、
上記音声合成方法であって、
上記韻律情報は、合成される音声を特徴づける複数種類の韻律特徴情報を含むことを特徴としている。
【００４６】
また、本発明は、
上記音声合成方法であって、
上記各種類の韻律特徴情報は、それぞれ、異なる上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて検索、および変形されることを特徴としている。
【００４７】
また、本発明は、
上記音声合成方法であって、
上記検索手段による上記韻律情報の検索と、上記変形手段による上記韻律情報の変形とは、それぞれ、異なる上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて行われることを特徴としている。
【００４８】
また、本発明は、
上記音声合成方法であって、
上記検索手段による上記韻律情報の検索と、上記変形手段による上記韻律情報の変形とは、それぞれ、同一の上記重み付けによる上記合成音声情報とキー情報との一致程度に応じて行われることを特徴としている。
【００４９】
これにより、やはり、合成音声情報とキー情報とが完全に一致するような韻律情報がデータベースに格納されていない場合でも、類似した韻律情報によって音声合成が行われるので、任意の音声に対して、比較的適切、かつ、むらのない自然な音声を発声させることができる。また、逆に、合成音声の自然さを損なうことなく、データベースの記憶容量を低減することができる。さらに、上記のように類似した韻律情報が用いられる場合に、その類似の程度に応じて韻律情報が変形されるので、より適切な合成音声が発せられる。
【００５０】
また、本発明は、入力されたテキストを合成音声に変換して出力する音声合成システムにおいて、
上記入力されたテキストを解析して、表音記号列と言語情報とを出力する言語処理手段と、
実音声から抽出された状態の韻律情報と、合成される音声に対応する表音記号列および言語情報とが、対応して格納された韻律情報データベースと、
上記言語処理手段から出力された上記表音記号列と上記言語情報とから成る検索項目の少なくとも一部に対応する、上記韻律情報データベースに格納されている上記韻律情報を検索する検索手段と、
上記検索項目と上記韻律情報データベースの格納内容との一致の程度に応じて、上記韻律情報データベースから検索され、選択された韻律情報を上記検索項目と上記韻律情報データベースの格納内容との一致の程度に応じた変形規則に基づいて変形する韻律変形手段と、
上記韻律変形手段から出力される上記韻律情報と上記言語処理手段から出力された上記表音記号列とに基づいて音声波形を生成する波形生成手段とを備えたことを特徴としている。
【００５１】
これにより、やはり、任意の入力テキストに対して、比較的適切、かつ、むらのない自然な音声を発声させることができる。
【００５２】
【発明の実施の形態】
実施の形態に基づいて本発明の内容を具体的に説明する。
【００５３】
（実施の形態１）
図１は、実施の形態１の音声合成システムの構成を示す機能ブロック図である。図１において、
文字列入力部１１０は、音声合成の対象となる情報として、漢字かな交じり文字列や、かな文字列などのテキストなどを入力するものである。この文字列入力部１１０としては、具体的には、例えば、キーボードのような入力装置などが用いられる。
【００５４】
言語処理部１２０は、後述するデータベース検索などのための前処理を行うもので、入力されたテキストを解析し、例えば図２に示すように、アクセント句ごとに、表音記号列、および言語情報を出力するものである。ここで、上記アクセント句は、便宜上、音声合成のための処理単位となるもので、ほぼ文法上の文節に相当するが、例えば２桁以上の数字は各桁の数字をそれぞれ１つのアクセント句とするなど、音声合成処理に適したように、入力テキストを区切ったものである。また、上記表音記号列は、例えば英数記号から成る文字列によって、音声の発声単位となる音素や、アクセントの位置などを示すものである。また、上記言語情報は、例えば、アクセント句の文法情報（品詞など）および意味情報（意味の属性など）を示すものである。
【００５５】
韻律情報データベース１３０は、例えば図３に示すように、アクセント句ごとに、実際の音声からアクセント句ごとに抽出された韻律情報が、被検索キーと対応して格納されたものである。同図に示す例では、被検索キーとして、
（ａ）音素列
（ｂ）アクセント位置
（ｃ）モーラ（拍）数
（ｄ）アクセント句の前後のポーズ長
（ｅ）文法情報、および意味情報
が用いられている。また、韻律情報として、
（ａ）基本周波数パタン
（ｂ）音声強度パタン
（ｃ）音韻時間長パタン
が用いられている。ここで、上記各韻律情報は、自然な合成音声を発声させるためには、同一の実音声から抽出したものであることが好ましい。なお、上記モーラ数は、韻律情報データベース１３０にあらかじめ格納せずに、検索の都度、上記音素列から数えるようにしてもよい。また、上記アクセント句の前後のポーズ長は、同図の例ではアクセント句が文頭または文末であるかどうかを示す情報を兼ねている。これによって、同一のアクセント句が文中の位置によって発声強度などが異なる場合でも、検索において区別されて、適切な音声を合成することができるようになっているが、これに限らず、ポーズ長だけを含むものにしてもよいし、また、文頭、文末を示す情報を別個の被検索キーとするようにしてもよい。
【００５６】
韻律情報検索部１４０は、言語処理部１２０の出力に基づいて、韻律情報データベース１３０の韻律情報を検索して出力するものである。この検索においては、いわゆるあいまい検索が行われる。すなわち、言語処理部１２０からの出力に基づく音素列等の検索キーが韻律情報データベース１３０中の被検索キーと完全に一致しなくても、ある程度一致するものを検索候補とし、その中から、例えば最小コスト法によって、最も一致程度の高いもの（検索キーと被検索キーとの差に相当する近似コストが小さいもの）を選択するようになっている。すなわち、検索キーと被検索キーとが完全に一致しない場合でも、類似したアクセント句の韻律情報を用いることにより、韻律情報を生成規則によって生成するよりも自然な音声を発声させることができる。
【００５７】
韻律情報変形部１５０は、韻律情報検索部１４０における検索時の近似コストと、後述する韻律情報変形規則記憶部１６０に保持された変形規則とに基づいて、韻律情報検索部１４０によって検索された韻律情報を変形するものである。すなわち、韻律情報検索部１４０での検索において、検索キーと被検索キーとが一致する場合には、検索された韻律情報によって最も適切な音声合成を行うことができるが、両キーが完全に一致しない場合には、上記のように類似したアクセント句の韻律情報を用いるため、両キーの一致程度が低いほど（近似コストが大きいほど）、合成音声が適切な音声からずれたものになる可能性がある。そこで、上記近似コストに応じて、検索された韻律情報に所定の変形を施すことによって、より適切な合成音声が発せられるようになっている。
【００５８】
韻律情報変形規則記憶部１６０は、上記近似コストに応じた韻律情報の変形のための変形規則を保持するものである。
【００５９】
波形生成部１７０は、言語処理部１２０から出力された表音記号列と、韻律情報変形部１５０から出力された韻律情報とに基づいて、音声波形を合成し、アナログ音声信号を出力するものである。
【００６０】
電気音響変換器１８０は、例えばスピーカやヘッドフォンなど、アナログ音声信号を音声に変換するものである。
【００６１】
次に、上記のように構成された音声合成システムの音声合成動作を説明する。
【００６２】
（１）文字列入力部１１０に音声に変換されるべきテキストが入力されると、言語処理部１２０は、入力されたテキストを解析し、アクセント句ごとに分離して、図２に示すような表音記号列、および言語情報を出力する。具体的には、例えば、漢字かな交じり文字列が入力される場合には、図示しない漢字辞書などの変換辞書などを用いて、アクセント句に分離するとともに、読みに変換し、アクセント位置やポーズの有無、長さなどを表す表音記号列を生成する。ここで、図２の表音記号列の例では、英数記号によって次のような情報を示すようになっている。
【００６３】
（ａ）アルファベット：音素（「Ｎ」は撥音を示す。）
（ｂ）「’」：アクセント位置
（ｃ）「／」：アクセント句の区切り
（ｄ）「ｃｌ」：無音区間
（ｅ）数字：ポーズ長
なお、同図には示していないが、フレーズや文の区切りを示す情報なども示すようにしてもよい。なお、表音記号列の表記は上記のものに限るものではなく、また、音素列やアクセント位置を示す数値などをそれぞれ別個の情報として出力するなどしてもよい。また、言語情報（文法情報、意味情報）としては、品詞や意味などのほか、活用形や、係り受けの有無、一般的な文中での重要度などを含めるようにしてもよく、さらに、表記も同図に示すような「名詞」や「連体形」などの文字列に限らず、コード化した数字を用いるなどしてもよい。
【００６４】
（２）韻律情報検索部１４０は、言語処理部１２０から出力されたアクセント句ごとの表音記号列と言語情報に基づいて、韻律情報データベース１３０の韻律情報を検索し、検索された韻律情報と、後に詳述する近似コストとを出力する。より詳しくは、言語処理部１２０から上記のような表記の表音記号列が出力される場合には、まず、この表音記号列から、音素列や、アクセント位置、モーラ数等を示す数値などを求め、これらを検索キーとして、韻律情報データベース１３０中の韻律情報を検索する。この検索においては、上記検索キーと完全に一致する被検索キーが韻律情報データベース１３０中に存在する場合には、その被検索キーに対応する韻律情報を検索結果とすればよいが、存在しない場合には、まず、ある程度一致するもの（例えば音素列は一致するが意味情報は一致しないものや、音素列は一致しないが、アクセントおよびモーラ数は一致するものなど）を検索候補とし、それらのうち、検索キーと被検索キーとの一致程度が最も高いものを選択して、検索結果とする。
【００６５】
上記選択は、例えば近似コストを用いた最小コスト法によって行うことができる。具体的には、まず、次のようにして近似コストＣを求める。
【数１】
【００６６】
Ｃ＝ａ１・Ｄ１＋ａ２・Ｄ２＋ａ３・Ｄ３＋ａ４・Ｄ４＋ａ５・Ｄ５＋ａ６・Ｄ６＋ａ７・Ｄ７
ここで、上記ａ１、Ｄ１等は、以下の通りである。
【００６７】
Ｄ１：音素列における一致しない音素数
Ｄ２：アクセント位置の差
Ｄ３：モーラ数の差
Ｄ４：直前のポーズ長の一致の有無（被検索キーの範囲内か否か）
Ｄ５：直後のポーズ長の一致の有無（被検索キーの範囲内か否か）
Ｄ６：文法情報の一致の有無または程度
Ｄ７：意味情報の一致の有無または程度
ａ１〜ａ７：上記Ｄ１〜Ｄ７の重みづけをする係数（これらのＤ１〜Ｄ７が、適切な韻律情報の選択に寄与する程度を、統計的手法や学習によって求めたもの）である。
【００６８】
なお、上記Ｄ１〜Ｄ７としては、上記に限らず、検索キーと被検索キーとの一致程度を表すものであれば、種々のものを用いることができる。例えば、Ｄ１については、一致しない音素が互いに類似する音素かどうかや、一致しない音素の位置、一致しない音素が連続しているかどうかなどに応じて異なる値としたりしてもよい。また、Ｄ４、Ｄ５については、ポーズ長が図３に示すように長、短、無しなどの段階で示される場合には、一致しているか否かを０、１で表したり、段階の差を数値で表したりしてもよく、また、ポーズ長が時間の数値で示される場合には、時間の差を用いたりしてもよい。また、Ｄ６、Ｄ７については、文法情報や意味情報が一致しているか否かを０、１で表してもよいし、検索キーと被検索キーとをパラメータとするテーブルを用いて、両者の組み合わせに応じた一致の程度（例えば名詞と動詞とでは一致の程度は低く、助詞と助動詞とでは高いなど。）を示す数値を用いるようにしたり、類義語辞書を用いて意味の類似の程度を求めるようにしたりしてもよい。
【００６９】
上記のような近似コストを各検索候補ごとに算出し、もっとも近似コストの小さいものを検索結果として選択して検索結果とすることにより、検索キーと被検索キーとが完全に一致するような韻律情報が韻律情報データベース１３０に格納されていない場合でも、類似した韻律情報によって、比較的適切、かつ自然な音声を発声させることができる。
【００７０】
（３）韻律情報変形部１５０は、韻律情報検索部１４０から出力された近似コストに応じて、韻律情報変形規則記憶部１６０に記憶されている規則を用い、韻律情報検索部１４０から検索結果として出力された韻律情報（基本周波数パタン、音声強度パタン、音韻時間長パタン）を変形する。具体的には、例えば、基本周波数パタンのダイナミックレンジを圧縮する変形規則が適用される場合には、図４に示すような基本周波数パタンの変形がなされる。
【００７１】
上記近似コストに応じた変形は、次のような意味を持っている。すなわち、例えば、図５に示すように、入力テキスト「門真市」に対して「名古屋市」の韻律情報が検索されたとすると、これらの音素列は相違するが、その他の検索項目は一致している（近似コストは小さい）ため、「名古屋市」の韻律情報をそのまま変形せずに用いれば、適切な音声合成をすることができる。また、例えば、「５分です」に対して「なるんです」が検索されたとすると、「５分です」の適切な合成音声を得るためには、一般に、品詞の相違を考慮すれば、「なるんです」の音声強度パタンを多少減少させることが望ましく、文節情報（例えば意味の重要度）を考慮すれば、数字は発声強度の大きい場合が多いので、「なるんです」の音声強度パタンをある程度増大させることが望ましく、総合的には、「なるんです」の音声強度パタンを多少増大させることが望ましい。このような総合的な変形程度は、近似コストと相関関係を有しているため、近似コストに対応した変形程度（変形倍率等）を変形規則として韻律情報変形規則記憶部１６０に記憶させておくことにより、適切な合成音声を得ることができる。なお、韻律情報の変形は、図４に示すように経過時間の全体にわたって一様に変形するものに限らず、例えば主として時間経過の中間付近を変形させるなどの変形パターンによって、時間経過とともに変形程度を異ならせるなどしてもよい。上記変形規則の具体的な記憶形式としては、近似コストを変形倍率に変換するための係数を変形規則とするものでもよいし、近似コストをパラメータとして変形倍率や変形パターンを対応させたテーブルを用いるなどしてもよい。なお、変形に用いる近似コストとしては、上記のように検索に用いる近似コストと同じものに限らず、上記（数１）とは係数ａ１〜ａ７が異なる式によって、より適切な変形が行われる値を得るようにしてもよく、また、基本周波数パタン、音声強度パタン、音韻時間長パタンでそれぞれ異なる値を用いるようにしてもよい。また、例えば、（数１）の各項が負の値を採り得るような場合には、各項の絶対値の和を検索用の近似コスト（０または正）として用い、各項のそのままの値の和を変形用の近似コスト（負もあり得る）として用いるようにするなどしてもよい。
【００７２】
（４）波形生成部１７０は、言語処理部１２０から出力された表音記号列と、韻律情報変形部１５０によって変形された韻律情報と
に基づいて、すなわち、音素列およびポーズ長と、基本周波数パタン、音声強度パタン、および音韻時間長パタンとに基づいて音声波形を合成し、アナログ音声信号を出力する。このアナログ音声信号により、電気音響変換器１８０から合成音声が発せられる。
【００７３】
上記のように、検索キーと被検索キーとが完全に一致するような韻律情報が韻律情報データベース１３０に格納されていない場合でも、類似した韻律情報によって音声合成が行われるので、比較的適切、かつ、むらのない自然な音声を発声させることができる。また、逆に、合成音声の自然さを損なうことなく、韻律情報データベース１３０の記憶容量を低減することができる。さらに、上記のように類似した韻律情報が用いられる場合に、その類似の程度に応じて韻律情報が変形されるので、より適切な合成音声が発せられる。
【００７４】
（実施の形態２）
実施の形態２の音声合成システムとして、アクセント句の前後のポーズ長も韻律情報として韻律情報データベースに格納された音声合成システムの例を説明する。なお、以下の実施の形態において、前記実施の形態１等と同様の機能を有する構成要素については、同一または対応する符号を付して詳細な説明を省略する。
【００７５】
図６は、実施の形態２の音声合成システムの構成を示す機能ブロック図である。この音声合成システムは、実施の形態１の音声合成システムと比べて、以下の点が異なっている。
【００７６】
（ａ）言語処理部２２０は、言語処理部１２０と異なり、ポーズ情報が含まれない表音記号列を出力するようになっている。
【００７７】
（ｂ）韻律情報データベース２３０には、図７に示すように、韻律情報データベース１３０と異なり、ポーズ情報が被検索キーとしてではなく韻律情報として格納されている。なお、実際には、韻律情報データベース１３０と同じデータ構造のものを用いて、検索時に、ポーズ長を韻律情報として取り扱うようにしてもよい。
【００７８】
（ｃ）韻律情報検索部２４０は、ポーズ情報を含まない検索キー、被検索キーの照合によって検索を行い、（基本周波数パタン、音声強度パタン、音韻時間長パタンに加えて）ポーズ情報も韻律情報として出力するようになっている。
【００７９】
（ｄ）韻律情報変形部２５０は、ポーズ情報も、基本周波数パタン等と同様に、近似コストに応じて変形するようになっている。
【００８０】
（ｅ）韻律情報変形規則記憶部２６０は、基本周波数パタン変形規則等とともに、ポーズ長変更規則も保持するようになっている。
【００８１】
上記のように、韻律情報データベース２３０から検索されたポーズ情報を用いることによって、ポーズ長がより自然な合成音声を発声させることができる。また、言語処理部２２０における入力テキスト解析処理の負荷を軽減することもできる。
【００８２】
なお、実施の形態１と同様に、検索時に言語処理部から出力されたポーズ情報も検索キーとして用いるようにして、検索精度を容易に高め得るようにしてもよい。この場合、韻律情報データベースには、被検索キーとしてのポーズ情報と韻律情報としてのポーズ情報とを別個に格納するようにしてもよいし、兼用されるようにしてもよい。また、このように、ポーズ情報が言語処理部から出力されるとともに韻律情報データベースにも格納されている場合、何れのポーズ情報を用いて音声合成するかは、言語処理部による解析精度と、韻律情報データベースから検索されるポーズ情報の信頼性とに応じて選択すればよく、さらに、近似コスト（検索結果の確からしさ）に応じて、何れを選択するかを決定するようにしてもよい。
【００８３】
（実施の形態３）
実施の形態３の音声合成システムとして、韻律情報の検索および変形が、基本周波数パタン等でそれぞれ別個の近似コストに基づいて行われる音声合成システムの例を説明する。
【００８４】
図８は、実施の形態３の音声合成システムの構成を示す機能ブロック図である。この音声合成システムは、前記実施の形態１の音声合成システムと比べて、以下の点が異なっている。
【００８５】
（ａ）韻律情報検索部１４０に代えて、基本周波数パタン検索部３４１、音声強度パタン検索部３４２、および音韻時間長パタン検索部３４３が設けられている。
【００８６】
（ｂ）韻律情報変形部１５０に代えて、基本周波数パタン変形部３５１、音声強度パタン変形部３５２、および音韻時間長パタン変形部３５３が設けられている。
【００８７】
上記各検索部３４１〜３４３、および各変形部３５１〜３５３は、それぞれ、以下の（数２）〜（数４）により得られる近似コストを用いて、基本周波数パタン、音声強度パタン、または音韻時間長パタンを独立して検索（検索候補を選択）、または変形するようになっている。
【数２】
【００８８】
（基本周波数パタンの検索、変形）
Ｃ＝ｂ１・Ｄ１＋ｂ２・Ｄ２＋ｂ３・Ｄ３＋ｂ４・Ｄ４＋ｂ５・Ｄ５＋ｂ６・Ｄ６＋ｂ７・Ｄ７
【数３】
【００８９】
（音声強度パタンの検索、変形）
Ｃ＝ｃ１・Ｄ１＋ｃ２・Ｄ２＋ｃ３・Ｄ３＋ｃ４・Ｄ４＋ｃ５・Ｄ５＋ｃ６・Ｄ６＋ｃ７・Ｄ７
【数４】
【００９０】
（音韻時間長パタンの検索、変形）
Ｃ＝ｄ１・Ｄ１＋ｄ２・Ｄ２＋ｄ３・Ｄ３＋ｄ４・Ｄ４＋ｄ５・Ｄ５＋ｄ６・Ｄ６＋ｄ７・Ｄ７
ここで、上記Ｄ１〜Ｄ７は、実施の形態１の（数１）と同じであるが、重みづけの係数ｂ１〜ｂ７、ｃ１〜ｃ７、ｄ１〜ｄ７は、（数１）のａ１〜ａ７と異なり、それぞれ、適切な基本周波数パタン、音声強度パタン、または音韻時間長パタンの選択が行われるように、統計的手法や学習によって求めたものが用いられている。すなわち、例えば、一般的に基本周波数パタンはアクセント位置およびモーラ数が同じであれば、おおよそ類似したものであるため、係数ｂ２、ｂ３が（数１）の係数ａ２、ａ３よりも大きく設定されている。また、音声強度パタンはポーズの有無や長さの寄与程度が大きいため、係数ｃ４、ｃ５が係数ａ４、ａ５よりも大きく設定されている。同様に、音韻時間長パタンは音素列の並びの寄与程度が大きいため、係数ｄ１が係数ａ１よりも大きく設定されている。
【００９１】
上記のように、基本周波数パタン等の検索、変形を別個の近似コストを用いて独立して行うことにより、バランスの良い検索および変形を行うことができ、それぞれ最適な基本周波数パタン等に基づいて音声合成を行うことができる。また、韻律情報データベース１３０には、基本周波数パタン、音声強度パタン、および音韻時間長パタンを組にして格納する必要はなく、例えばそれぞれのパタンごとの種類の数だけ格納すればよいので、比較的小さな記憶容量の韻律情報データベース１３０で、良好な音質の合成音声を発声させることができる。
【００９２】
（実施の形態４）
実施の形態４の音声合成システムについて説明する。
【００９３】
図９は、実施の形態４の音声合成システムの構成を示す機能ブロック図である。この音声合成システムは、主として、次のような特徴を有している。
【００９４】
（ａ）前記実施の形態１〜３と異なり、韻律情報の検索や変形等の処理が、アクセント句単位ではなく、フレーズ単位で行われる。ここで、上記フレーズは、節または呼気段落などとも称され、通常、発声される際に（句点がある場合と同様に）区切りとなる、１または複数のアクセント句の集まりである。
【００９５】
（ｂ）実施の形態２と同様に、ポーズ情報が韻律情報として格納された韻律情報データベース４３０、および基本周波数パタン変形規則等とともにポーズ長変更規則も格納された韻律情報変形規則記憶部４６０が設けられている。ただし、これらは、図１０に示すように、韻律情報や変形規則がフレーズ単位でも格納されている点で、実施の形態２の韻律情報データベース２３０、および韻律情報変形規則記憶部２６０と異なっている。
【００９６】
（ｃ）実施の形態３と同様に、韻律情報の検索および変形は、基本周波数パタン等でそれぞれ別個の近似コストに基づいて行われる。また、ポーズ情報の検索およびポーズ長の変更も、同様に独立して行われる。
【００９７】
（ｄ）韻律情報の変形は、実施の形態１〜３と同様に、近似コストに応じて行われるとともに、さらに、検索キーと被検索キーとの音素列における音素ごとの一致度（一致の程度や有無）に応じても行われる点が異なっている。
【００９８】
以下、より詳しく説明する。
【００９９】
言語処理部４２０は、実施の形態１の言語処理部１２０と同様に、文字列入力部１１０から入力されたテキストを解析し、アクセント句ごとに分離した後、所定のアクセント句のまとまりであるフレーズ単位で、表音記号列、および言語情報を出力するようになっている。
【０１００】
韻律情報データベース４３０には、上記のように韻律情報がフレーズ単位で格納されているが、これに伴って、さらに、図１０に示すように各フレーズに含まれるアクセント句の数も被検索キーとして格納されている。なお、韻律情報として格納されるポーズ情報は、フレーズの前後のポーズ長に限らず、アクセント句の前後のポーズ長も含めるようにしてもよい。
【０１０１】
基本周波数パタン検索部４４１、音声強度パタン検索部４４２、音韻時間長パタン検索部４４３、およびポーズ情報検索部４４４は、フレーズ単位で韻律情報の検索を行うために、近似コストとして、フレーズに含まれるアクセント句の数も考慮するようになっている。また、ポーズ情報検索部４４４以外は、検索された基本周波数パタン等、および近似コストとともに、検索キーと被検索キーとの音素列における音素ごとの一致度も出力するようになっている一方、ポーズ情報検索部４４４は、ポーズ情報、および近似コストとともに、アクセント句ごとのモーラ数やアクセント位置などの一致度を出力するようになっている。
【０１０２】
基本周波数パタン変形部４５１、音声強度パタン変形部４５２、および音韻時間長パタン変形部４５３は、実施の形態１〜３の韻律情報変形部１５０等と同様に、韻律情報変形規則記憶部４６０に保持されている規則を用い、基本周波数パタン検索部４４１等から出力された近似コストに応じて韻律情報の変形を行うとともに、さらに、検索キーと被検索キーとの音素列における音素ごとの一致度に応じても変形を行うようになっている。すなわち、例えば「たかな」に対して「さかな」のように一部の音素だけが異なる言葉の韻律情報が用いられる場合に、異なる音素についての音声強度パタンを、図２に記号Ｐで示す部分のように弱くして、音素の相違の影響が目立ちにくくなるような変形を容易にすることができる。なお、このような音素ごとの一致度に応じた変形は必ずしもしなくてもよいし、また、近似コストに応じた変形を行わずに音素ごとの一致度に応じた変形だけを行うなどしてもよい。
【０１０３】
また、ポーズ長変更部４５４は、韻律情報変形規則記憶部４６０に保持されている規則を用い、ポーズ情報検索部４４４から出力された近似コストに応じて韻律情報の変形を行うとともに、さらに、アクセント句ごとのモーラ数やアクセント位置などの一致度に応じて、ポーズ長の変更を行うようになっている。
【０１０４】
上記のように、フレーズ単位で韻律情報の検索や変形等を行うことによって、文の流れに沿った、より自然な合成音声を発声させることができる。また、実施の形態２と同様に、韻律情報データベース４３０から検索されたポーズ情報を用いることによって、ポーズ長がより自然な合成音声を発声させることができるとともに、実施の形態３と同様に、基本周波数パタン等の検索、変形を別個の近似コストを用いて独立して行うことにより、それぞれ最適な基本周波数パタン等に基づいて音声合成を行うことができ、韻律情報データベース４３０の記憶容量を低減することも容易にできる。さらに、音素ごとの一致度に応じた基本周波数パタン等の変形を行うことによって、音素の相違の影響が目立ちにくくすることができるとともに、アクセント句ごとのモーラ数やアクセント位置などの一致度に応じてもポーズ長の変更等を行うことにより、ポーズ長がより自然な合成音声を発声させることなどができる。
【０１０５】
（実施の形態５）
実施の形態５の音声合成システムとして、韻律情報の検索に音韻カテゴリ列が用いられる例を説明する。
【０１０６】
図１１は、実施の形態５の音声合成システムの構成を示す機能ブロック図である。図１２は、音韻カテゴリの例を示す説明図である。
【０１０７】
ここで、上記音韻カテゴリは、音韻を、各音韻間の音声学的特徴から求めた距離によって、すなわち各音韻の調音方式、調音位置、継続時間長などによってグループ化したものである。つまり、この音韻カテゴリを同じくする音素どうしは、類似した音響特性を有しているため、例えば、あるアクセント句と、そのうちの一部の音素が、同じ音韻カテゴリの他の音素に入れ代わったアクセント句とは、同一、または比較的類似した韻律情報を有していることが多い。そこで、韻律情報の検索において、音素列が一致していない場合でも、各音素の音韻カテゴリが一致している場合には、韻律情報を流用しても、多くの場合、適切な合成音声を発声させることができる。なお、音韻のグループ化は、上記に限らず、例えば、図１２に示すように、音韻どうしの異聴表から多変量解析などを用いて決定した音韻間の距離（心理距離）に従って音韻をグループ化したり、音韻の物理特性（音韻の基本周波数、強度、時間長、およびスペクトルなど）の類似度に従ってグループ化したり、また、韻律パタンを多変量解析などの統計的手法を用いてグループ化し、上記韻律パタンのグループを最も良く反映するように、音韻を統計的手法を用いてグループ化したりしてもよい。
【０１０８】
以下、具体的に説明する。この実施の形態５の音声合成システムは、実施の形態１の音声合成システムに比べると、韻律情報データベース１３０に代えて韻律情報データベース７３０を備えるとともに、さらに、音韻カテゴリ列生成部７９０を備えている点が異なる。
【０１０９】
上記韻律情報データベース７３０には、実施の形態１の韻律情報データベース１３０の格納内容に加えて、さらに、アクセント句の各音素が属する音韻カテゴリを示す音韻カテゴリ列が、被検索キーとして格納されている。ここで、音韻カテゴリ列の具体的な表記としては、例えば、各音韻カテゴリに割り当てた番号や記号の列として表したり、各音韻カテゴリ内の何れかの音素を代表音素として、その代表音素の列として表したりすればよい。
【０１１０】
音韻カテゴリ列生成部７９０は、言語処理部１２０から出力されるアクセント句ごとの表音記号列を音韻カテゴリ列に変換して出力するようになっている。
【０１１１】
韻律情報検索部７４０は、音韻カテゴリ列生成部７９０から出力された音韻カテゴリ列、および言語処理部１２０から出力されたアクセント句ごとの表音記号列と言語情報とに基づいて、韻律情報データベース７３０の韻律情報を検索し、検索された韻律情報と、近似コストとを出力するようになっている。上記近似コストは、音韻カテゴリ列の一致程度（例えば音韻ごとの音韻カテゴリの類似程度）を含めることにより、例えば音素列が一致しない場合でも、音韻カテゴリ列が一致している場合には小さな値にすることができるため、より適切な韻律情報が検索（選択）され、自然な合成音声が発声される。また、例えば、まず検索候補を音韻カテゴリ列が一致または類似するものに絞ることによって、検索速度を向上させることなども容易になる。
【０１１２】
なお、上記の例では、言語処理部１２０から出力された表音記号列を音韻カテゴリ列生成部７９０によって音韻カテゴリ列に変換する例を示したが、これに限らず、言語処理部１２０に音韻カテゴリ列を生成させる機能を持たせるようにしたり、韻律情報検索部７４０に、入力された表音記号列を音韻カテゴリ列に変換する機能を持たせるようにしてもよい。また、韻律情報検索部７４０に、韻律情報データベースから読み出した音素列を音韻カテゴリ列に変換する機能を持たせれば、実施の形態１の韻律情報データベース１３０と同様の音韻カテゴリ列が格納されていない韻律情報データベースを用いることもできる。
【０１１３】
また、音素列と音韻カテゴリ列とを共に検索キーとして用いるものに限らず、音韻カテゴリ列だけを用いるようにしてもよい。この場合には、音素列だけが異なる韻律情報はまとめることができるので、データベースの容量を低減したり、検索速度を向上させたりすることが容易にできる。
【０１１４】
なお、上記各実施の形態や変形例で説明した構成要素は、種々組み合わせるなどしてもよい。具体的には、例えば、実施の形態５で示した、音韻カテゴリ列を韻律情報の検索等に用いる手法は、他の実施の形態などに適用してもよい。
【０１１５】
また、実施の形態３、４で示した、音素ごとの一致度に応じた韻律情報の変形も、他の実施の形態などにおいて、近似コストに応じた変形に代えて、またはこれとともに用いることができる。なお、さらに、音素ごとや、モーラごと、音節ごと、波形生成部における音声波形の生成単位ごと、音韻ごとの一致度などを用いて変形するようにしてもよい。また、変形する韻律情報に応じて、用いる一致度を選択してもよい。具体的には、例えば基本周波数パタンの変形には、近似コストまたは音素ごとなどの一致度の何れかを用い、音声強度パタンの変形には、双方を共に用いるなどしてもよい。ここで、上記音素等の一致度は、例えば基本周波数や、強度、時間長、スペクトルなどの音響特性に基づく距離、調音方式、調音位置、継続時間長などにより音声学的に求められた距離、または聴取実験による異聴表に基づく距離などに基づいて定めることができる。
【０１１６】
また、実施の形態５で示した音韻カテゴリを検索等に用いる方法も、他の実施の形態などにおいても、音素列を用いるのに代えて、またはこれとともに用いることができる。
【０１１７】
また、実施の形態２、４で示したように、ポーズ情報が韻律情報として韻律情報データベースに格納されて検索される構成も他の実施の形態などに適用してもよいし、逆に、実施の形態２、４などにおいてポーズ情報も検索に用いるようにしてもよい。
【０１１８】
また、言語処理部は必ずしも備える必要はなく、直接、表音記号列などを外部から入力するようにしてもよい。このような構成は、例えば携帯電話のように小型の機器に適用する場合などに特に有用であり、装置の小型化や通信データの圧縮などがより容易になる。また、表音記号列と言語情報とを外部から入力するようにしてもよい。すなわち、例えば大規模なサーバを用いて精度の高い言語処理を行い、その結果が入力されるようにして、さらに適切な音声を発声させることもできる。一方、簡易に表音記号列などだけを用いるようにして構成の簡素化を図るようにしてもよい。
【０１１９】
また、音声を合成するための韻律情報は上記のものに限るものではない。例えば、音韻時間長パタンに代えて、音素時間長パタンや、モーラ時間長パタン、音節時間長パタンなどを用いてもよい。また、上記のような時間長パタンを含めて種々の韻律情報を組み合わせてもよい。
【０１２０】
また、韻律制御単位、すなわち韻律情報の格納、検索、変形などの単位は、アクセント句または１以上のアクセント句から成るフレーズの何れでもよいし、さらに、文節、単語、ストレス句単位や、１以上の文節、単語、ストレス句から成るフレーズ単位などでもよいし、これらを混在させてもよい。また、韻律制御単位（例えば１以上のアクセント句から成るフレーズ）とは別に、例えば韻律情報の変形等に他の単位（例えばアクセント句）ごとのモーラ数やアクセント位置等の一致度を用いるなどしてもよい。
【０１２１】
また、検索キーの項目や数は上記のものに限るものではない。すなわち、一般には検索キーの項目は多い方が適切な候補が検索されやすいが、最適な候補が検索されやすいように各項目の一致度の決定や重み付けのし方などとともに最適化すればよい。また、検索精度への寄与程度が小さい検索キーは省略して、構成の簡素化、処理速度の向上を図るようにしてもよい。
【０１２２】
また、上記の例では、日本語を例に挙げて説明したが、これに限らず、種々の言語に対しても、同様に容易に応用することができる。その場合、それぞれの言語の特性に応じた変形、例えばモーラ単位の処理をモーラまたはシラブル単位の処理とするなどの変形を加えてもよい。また、韻律情報データベース１３０等には、複数の言語についての情報を格納するなどしてもよい。
【０１２３】
また、上記のような構成はコンピュータ（および周辺機器）とプログラムによって実装してもよいし、ハードウェアによって実装してもよい。
【０１２４】
【発明の効果】
以上説明したように、本発明によれば、例えば実音声から抽出された基本周波数パタンや、音声強度パタン、音素時間長パタン、ポーズ情報などの韻律情報をデータベースとして保持し、テキストや表音記号列などとして入力された発声目標に対して、例えば近似コストが最小となるような韻律情報をデータベースより検索して選択し、近似コストや一致度等に応じて、所定の変形規則に基づき、選択された韻律情報を変形することにより、任意の入力テキスト等に応じた自然な合成音声を発声させることができる。特に、音声情報データベース内に、入力テキスト等に対応する発声内容が存在してもしなくても、同様の音質で、すなわち、全体として実音声に近い自然な合成音声を発声させることができるという効果を奏する。
【０１２５】
したがって、本発明は、家庭電化製品や、カーナビゲーションシステム、携帯電話などの種々の電子機器において、機器の状態や、操作などの指示、応答メッセージ等のメッセージを発声させるため、また、パーソナルコンピュータなどにおいて、音声インタフェイスによる操作や、光学文字認識（ＯＣＲ）による文字認識結果の確認などに用いることができ、上記のような分野などにおいて有用である。
【図面の簡単な説明】
【図１】実施の形態１の音声合成システムの構成を示す機能ブロック図である。
【図２】実施の形態１の音声合成システムの各部の情報の例を示す説明図である。
【図３】実施の形態１の音声合成システムの韻律情報データベースの記憶内容を示す説明図である。
【図４】基本周波数パタンの変形の例を示す説明図である。
【図５】韻律情報の変形の例を示す説明図である。
【図６】実施の形態２の音声合成システムの構成を示す機能ブロック図である。
【図７】実施の形態２の音声合成システムの韻律情報データベースの記憶内容を示す説明図である。
【図８】実施の形態３の音声合成システムの構成を示す機能ブロック図である。
【図９】実施の形態４の音声合成システムの構成を示す機能ブロック図である。
【図１０】実施の形態４の音声合成システムの韻律情報データベースの記憶内容を示す説明図である。
【図１１】実施の形態５の音声合成システムの構成を示す機能ブロック図である。
【図１２】音韻カテゴリの例を示す説明図である。
【図１３】従来の音声合成システムの構成を示す機能ブロック図である。
【符号の説明】
１１０文字列入力部
１２０言語処理部
１３０韻律情報データベース
１４０韻律情報検索部
１５０韻律情報変形部
１６０韻律情報変形規則記憶部
１７０波形生成部
１８０電気音響変換器
２２０言語処理部
２３０韻律情報データベース
２４０韻律情報検索部
２５０韻律情報変形部
２６０韻律情報変形規則記憶部
３４１基本周波数パタン検索部
３４２音声強度パタン検索部
３４３音韻時間長パタン検索部
３５１基本周波数パタン変形部
３５２音声強度パタン変形部
３５３音韻時間長パタン変形部
４２０言語処理部
４３０韻律情報データベース
４４１基本周波数パタン検索部
４４２音声強度パタン検索部
４４３音韻時間長パタン検索部
４４４ポーズ情報検索部
４５１基本周波数パタン変形部
４５２音声強度パタン変形部
４５３音韻時間長パタン変形部
４５４ポーズ長変更部
４６０韻律情報変形規則記憶部
７３０韻律情報データベース
７４０韻律情報検索部
７９０音韻カテゴリ列生成部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis system that converts an arbitrary input text or an input phonetic symbol string or the like into a synthesized speech and outputs the synthesized speech.
[0002]
[Prior art]
2. Description of the Related Art In recent years, in various electronic devices such as home appliances, car navigation systems, and mobile phones, synthetic speech is often used to utter messages such as device states, operation instructions, and response messages. In personal computers and the like, it is also being used for operations through a voice interface and confirmation of character recognition results by optical character recognition (OCR).
[0003]
As a method of performing the above-described voice synthesis, there is a method of storing voice data in advance and reproducing the voice data, and is often used when a limited message or the like is uttered. Producing an arbitrary voice using this method requires a large-capacity storage device and tends to be expensive, so that its use is limited.
[0004]
On the other hand, as a method of producing an arbitrary voice with a relatively inexpensive configuration, a voice data is generated using a predetermined voice data generation rule based on an input text or a sequence of phonetic symbol strings. There is something. However, it is difficult to utter natural sounds for various expressions using such a method using the sound data generation rules.
[0005]
Therefore, as disclosed in, for example, Japanese Patent Application Laid-Open No. 8-87297, a speech synthesis system that combines the generation of synthesized speech by searching for speech information using a database and the generation of synthesized speech by a synthesized speech generation rule has been proposed. Are known. More specifically, as shown in FIG. 13, for example, this type of device includes a character string input unit 910 and a voice information database 920 storing voice feature amounts extracted by analyzing actual voices and utterance contents corresponding thereto. A speech information search unit 930 for searching the speech information database 920, a synthesized speech generation unit 940 for generating a speech waveform, and a synthesized speech including rules for generating a speech feature from an input text or an input phonetic symbol string. The configuration includes a generation rule 950 and an electroacoustic transducer 960. In this speech synthesis system, when a text or a phonetic symbol string is input to the character string input unit 910, the speech information searching unit 930 outputs a speech content corresponding to the input text or the input phonetic symbol string from the speech information database 920. Search audio information. If there is a matching utterance content, the corresponding speech information is passed to the synthesized speech generation unit 940. On the other hand, if there is no matching utterance content, the speech information search unit 930 passes the input text or the input phonetic symbol string to the synthetic speech generation unit 940 as it is. When the searched speech information is input, the synthesized speech generation unit 940 generates a synthesized speech based on the input speech information. When an input text or an input phonetic symbol string is input, the synthesized speech is generated with the synthesized speech. After generating a speech feature based on the generation rule 950, a synthesized speech is generated.
[0006]
As described above, by using the search of the voice information and the synthetic voice generation rule, it is possible to convert an arbitrary input text or the like into a synthetic voice and output it, and to output a part of the voice (when the search is hit). For, a natural sound can be uttered.
[0007]
[Problems to be solved by the invention]
However, in the above-mentioned conventional speech synthesis system, the case where the search hits and the case where the search does not hit, that is, the case where the speech information corresponding to the input text or the like exists in the speech information database and the case where it does not exist, There is a problem that the difference in sound quality is large, and unnaturalness becomes conspicuous by connecting such sounds having different sound quality. In addition, since the search of the voice information database 920 is performed simply based on whether or not the input phonetic symbol string matches the stored utterance content, if there is a matching utterance content, regardless of the sentence configuration, etc. However, there is also a problem that speech synthesis is performed based on the searched speech information, which results in an unnatural synthesized speech.
[0008]
For example, if the sentence "I live in Osaka is Matsushita" is synthesized, and if the proper noun "Matsushita" does not exist in the database, only that part will be a mechanical synthesized voice. Or two sentences such as "Living in Osaka" and "I'm Matsushita" were unnaturally joined together using the voice information of "living in Osaka" stored as the utterance content at the end of the sentence. It tends to be such a synthesized voice.
[0009]
In view of the above, the present invention can produce a natural synthesized voice in response to an arbitrary input text or the like. In particular, the voice information (prosodic information) database contains utterance contents corresponding to the input text or the like. It is an object of the present invention to provide a speech synthesis system capable of producing a synthesized speech with the same sound quality whether or not there is a sound.
[0010]
[Means for Solving the Problems]
To achieve the above objectives, Book The invention is
In a voice synthesis system that outputs a synthesized voice based on synthesized voice information indicating a voice to be synthesized,
Used for speech synthesis, corresponding to the key information that will be the search key In the state extracted from the actual voice A database storing prosody information,
Search means for searching for the prosody information according to the degree of coincidence between the synthesized speech information and the key information;
With the above synthesized speech information Based on the deformation rule according to the degree of matching with the above key information Transforming means for transforming the prosody information searched by the search means;
A synthesizing unit that outputs a synthesized voice based on the synthesized voice information and the prosody information deformed by the deforming unit;
It is characterized by having.
[0011]
The deformation rule may set at least one of a deformation pattern and a degree. The synthesized speech information and the key information may each include a phonetic symbol string indicating a speech attribute of a synthesized voice, and further include linguistic information indicating a linguistic attribute of a synthesized voice, The phonetic symbol sequence may include at least information substantially indicating any of a sequence of phonemes of the synthesized voice, an accent position, and the presence or absence or length of a pause. Further, the linguistic information may include at least one of grammatical information and semantic information of the synthesized speech.
[0012]
Further, the invention is characterized in that it further comprises a language processing means for analyzing the text information input to the speech synthesis system and generating the phonetic symbol string and the language information.
[0013]
Thereby, even when the prosody information such that the synthesized speech information and the key information completely match is not stored in the database, the speech synthesis is performed by the similar prosody information. Appropriate and even natural sounds can be produced. Conversely, the storage capacity of the database can be reduced without impairing the naturalness of the synthesized speech. Further, when similar prosody information is used as described above, the prosody information is transformed according to the degree of similarity, so that a more appropriate synthesized speech is emitted.
[0014]
Also, Book The invention is
the above A speech synthesis system,
Each of the synthesized speech information and the key information substantially includes a phoneme category string indicating a phoneme category to which each phoneme of a synthesized voice belongs.
[0015]
A conversion unit configured to convert at least one of the information corresponding to the synthesized voice information input to the voice synthesis system and the information corresponding to the key information stored in the database into a phoneme category sequence; It is characterized by having.
[0016]
The phoneme category is obtained by grouping phonemes using at least one of a phoneme articulation method, articulation position, and duration.
Prosodic patterns are grouped using statistical methods, and phonemes are grouped using statistical methods such as multivariate analysis to best reflect the groups of prosodic patterns,
Grouping phonemes according to the distance between phonemes determined by using statistical methods such as multivariate analysis from the phonetic aural tables.
The phonemes may be grouped according to the similarity of physical characteristics such as the fundamental frequency, strength, time length, or spectrum of the phonemes.
[0017]
In this way, in the search for prosody information, even if the phoneme strings do not match, if the phoneme categories of each phoneme match, even if the prosody information is diverted, in many cases, appropriate and natural synthesis is performed. Voice can be uttered.
[0018]
Also, Book The invention is
the above A speech synthesis system,
The prosody information stored in the database includes information indicating prosodic features extracted from the same real voice.
[0019]
Also, Book The invention is
the above A speech synthesis system,
The information indicating the prosodic features is at least:
A fundamental frequency pattern indicating the temporal change of the fundamental frequency,
A voice intensity pattern indicating a temporal change of the voice intensity,
A phoneme duration pattern indicating the duration of each phoneme, and
Pause information indicating the presence or absence of a pause
It is characterized by including any of them.
[0020]
Also, Book The invention is
the above A speech synthesis system,
The database stores the prosody information for each prosody control unit.
[0021]
Also, Book The invention is
the above A speech synthesis system,
The prosody control unit is
Accent phrases,
A phrase composed of one or more accent phrases,
Clause,
A phrase composed of one or more phrases,
word,
A phrase composed of one or more words,
Stress phrases, and
Phrases composed of one or more stress phrases
Or any of the above.
[0022]
This makes it possible to easily produce an appropriate and natural synthesized speech.
[0023]
Also, Book The invention is
the above A speech synthesis system,
The synthesized voice information and the key information each include a plurality of types of voice index information that is an element that determines a voice to be synthesized,
The degree of coincidence between the synthesized voice information and the key information indicates that the degree of coincidence between each voice index information in the synthesized voice information and each voice index information in the key information is weighted and synthesized. Features.
[0024]
Also, Book The invention is
the above A speech synthesis system,
The voice index information includes at least information substantially indicating any of a sequence of phonemes, an accent position, presence or absence or length of a pause, and linguistic information indicating a linguistic attribute of a synthesized voice. Features.
[0025]
Also, Book The invention is
the above A speech synthesis system,
The voice index information includes information substantially indicating a sequence of phonemes of a synthesized voice,
The degree of coincidence between each piece of speech index information in the synthesized speech information and each piece of speech index information in the key information includes a similarity in acoustic feature length for each phoneme.
[0026]
Also, Book The invention is
the above A speech synthesis system,
The speech index information is characterized by substantially including a phoneme category sequence indicating a phoneme category to which each phoneme of a synthesized speech belongs.
[0027]
Also, Book The invention is
the above A speech synthesis system,
The degree of coincidence between each piece of speech index information in the synthesized speech information and each piece of speech index information in the key information includes a degree of similarity of a phoneme category for each phoneme.
[0028]
This makes it possible to easily search for and modify appropriate prosody information.
[0029]
Also, Book The invention is
the above A speech synthesis system,
The prosody information includes a plurality of types of prosody feature information that characterizes a synthesized voice.
[0030]
Also, Book The invention is
the above A speech synthesis system,
The plurality of types of prosodic feature information are grouped and stored in the database.
[0031]
Also, Book The invention is
the above A speech synthesis system,
The plurality of types of prosodic feature information in the set are each extracted from the same actual voice.
[0032]
Also, Book The invention is
the above A speech synthesis system,
The prosodic feature information is at least
A fundamental frequency pattern indicating the temporal change of the fundamental frequency,
A voice intensity pattern indicating a temporal change of the voice intensity,
A phoneme duration pattern indicating the duration of each phoneme, and
Pause information indicating the presence or absence of a pause
It is characterized by including any of them.
[0033]
Also, Book The invention is
the above A speech synthesis system,
The phoneme time length pattern includes at least one of a phoneme time length pattern, a mora time length pattern, and a syllable time length pattern.
[0034]
Also, Book The invention is
the above A speech synthesis system,
Each type of prosodic feature information is characterized in that it is searched and modified in accordance with the degree of coincidence between the synthesized speech information and the key information by different weights.
[0035]
Also, Book The invention is
the above A speech synthesis system,
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the modification means are performed according to the degree of coincidence between the synthesized speech information and the key information by different weights. .
[0036]
Also, Book The invention is
the above A speech synthesis system,
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the modification means are performed according to the degree of coincidence between the synthesized speech information and the key information by the same weighting, respectively. I have.
[0037]
Also, Book The invention is
the above A speech synthesis system,
The deforming means is at least
For each phoneme,
Every mora,
Every syllable,
For each unit of speech waveform generation in the synthesis means, and
Each phoneme
The prosody information retrieved by the retrieval means is modified based on one of the degrees of matching.
[0038]
Also, Book The invention is
the above A speech synthesis system,
The degree of coincidence for each phoneme, for each mora, for each syllable, for each unit of speech waveform generation in the synthesis means, and for each phoneme is at least:
Distance based on acoustic properties,
The distance determined by any of the articulation method, articulation position, and duration, and
Distance based on hearing table by listening experiment
Is set based on any of the above.
[0039]
Thereby, appropriate deformation can be easily performed.
[0040]
Also, Book The invention is
the above A speech synthesis system,
The acoustic characteristic is characterized by at least one of a fundamental frequency, an intensity, a time length, and a spectrum.
[0041]
Also, Book The invention is
the above A speech synthesis system,
The database is characterized in that the key information and the prosody information are stored for a plurality of types of languages.
[0042]
This makes it possible to easily produce synthesized speech including a plurality of types of languages.
[0043]
Also, the present invention
In a voice synthesis method for outputting a synthesized voice based on synthesized voice information indicating a voice to be synthesized,
Real speech used for speech synthesis corresponding to the key information that is the key of the search Extracted from From the database where prosody information is stored,
Searching for the prosody information according to the degree of coincidence between the synthesized speech information and the key information,
With the above synthesized speech information Based on the deformation rule according to the degree of matching with the above key information Transforming the prosody information retrieved by the retrieval means,
Outputting a synthesized voice based on the synthesized voice information and the prosody information deformed by the deforming means. Doing To
[0044]
Also, Book The invention is
the above A speech synthesis method,
The synthesized voice information and the key information each include a plurality of types of voice index information that is an element that determines a voice to be synthesized,
The degree of coincidence between the synthesized voice information and the key information indicates that the degree of coincidence between each voice index information in the synthesized voice information and each voice index information in the key information is weighted and synthesized. Features.
[0045]
Also, Book The invention is
the above A speech synthesis method,
The prosody information includes a plurality of types of prosody feature information that characterizes a synthesized voice.
[0046]
Also, Book The invention is
the above A speech synthesis method,
Each type of prosodic feature information is characterized in that it is searched and modified in accordance with the degree of coincidence between the synthesized speech information and the key information by different weights.
[0047]
Also, Book The invention is
the above A speech synthesis method,
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the modification means are performed according to the degree of coincidence between the synthesized speech information and the key information by different weights. .
[0048]
Also, Book The invention is
the above A speech synthesis method,
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the modification means are performed according to the degree of coincidence between the synthesized speech information and the key information by the same weighting, respectively. I have.
[0049]
Thus, even when prosody information such that the synthesized speech information completely matches the key information is not stored in the database, speech synthesis is performed using similar prosody information. Relatively appropriate and even natural sounds can be produced. Conversely, the storage capacity of the database can be reduced without impairing the naturalness of the synthesized speech. Further, when similar prosody information is used as described above, the prosody information is transformed according to the degree of similarity, so that a more appropriate synthesized speech is emitted.
[0050]
The present invention also provides a speech synthesis system that converts an input text into a synthesized speech and outputs the synthesized speech.
Language processing means for analyzing the input text and outputting phonetic symbol strings and language information;
Real voice Extracted from A prosody information database in which the prosody information and phonetic symbol strings and linguistic information corresponding to the synthesized speech are stored correspondingly;
Search means for searching for the prosody information stored in the prosody information database, which corresponds to at least a part of the search item consisting of the phonetic symbol string and the language information output from the language processing means,
The prosody information retrieved from the prosody information database and selected according to the degree of coincidence between the search item and the stored content of the prosody information database is Based on a transformation rule according to the degree of coincidence between the search item and the stored contents of the prosody information database A prosody transformation means to transform;
Waveform generation means for generating a speech waveform based on the prosody information output from the prosody modification means and the phonetic symbol string output from the language processing means.
[0051]
As a result, it is possible to produce a relatively appropriate and even natural sound for any input text.
[0052]
BEST MODE FOR CARRYING OUT THE INVENTION
The contents of the present invention will be specifically described based on embodiments.
[0053]
(Embodiment 1)
FIG. 1 is a functional block diagram illustrating a configuration of the speech synthesis system according to the first embodiment. In FIG.
The character string input unit 110 inputs text such as a kanji kana character string or a kana character string as information to be subjected to speech synthesis. As the character string input unit 110, specifically, for example, an input device such as a keyboard is used.
[0054]
The linguistic processing unit 120 performs preprocessing for a database search or the like described later, analyzes the input text, and, for example, as shown in FIG. Is output. Here, the accent phrase serves as a processing unit for speech synthesis for convenience, and substantially corresponds to a grammatical phrase. For example, two or more digits are each replaced with one accent phrase. The input text is delimited so as to be suitable for speech synthesis processing. The phonetic symbol string indicates, for example, a phoneme, which is a utterance unit of voice, an accent position, and the like by a character string including alphanumeric symbols. The linguistic information indicates, for example, grammatical information (part of speech, etc.) and semantic information (attribute of meaning, etc.) of the accent phrase.
[0055]
As shown in FIG. 3, for example, the prosody information database 130 stores, for each accent phrase, prosody information extracted from actual speech for each accent phrase, corresponding to the searched key. In the example shown in FIG.
(A) Phoneme sequence
(B) Accent position
(C) Mora (beat) number
(D) Pause length before and after the accent phrase
(E) Grammar information and semantic information
Is used. Also, as prosodic information,
(A) Fundamental frequency pattern
(B) Voice intensity pattern
(C) Phoneme duration pattern
Is used. Here, it is preferable that each of the prosody information is extracted from the same real voice in order to produce a natural synthesized voice. The number of moras may not be stored in the prosody information database 130 in advance, but may be counted from the phoneme sequence each time a search is performed. The pause length before and after the accent phrase also serves as information indicating whether the accent phrase is at the beginning or end of the sentence in the example of FIG. As a result, even when the same accent phrase has a different utterance intensity depending on the position in the sentence, it can be distinguished in the search and an appropriate speech can be synthesized, but the present invention is not limited to this, and only the pause length is used. May be included, or information indicating the beginning and end of a sentence may be used as a separate key to be searched.
[0056]
The prosody information search unit 140 searches for and outputs prosody information of the prosody information database 130 based on the output of the language processing unit 120. In this search, a so-called ambiguous search is performed. That is, even if a search key such as a phoneme string based on the output from the language processing unit 120 does not completely match the search target key in the prosody information database 130, a search candidate that matches to some extent is set as a search candidate. By the least cost method, the one with the highest degree of matching (the one with a small approximate cost corresponding to the difference between the search key and the searched key) is selected. That is, even when the search key and the search target key do not completely match, by using the prosody information of a similar accent phrase, a more natural sound can be uttered than when the prosody information is generated by the generation rule.
[0057]
The prosody information transforming section 150 performs the prosody search performed by the prosody information search section 140 based on the approximate cost at the time of retrieval in the prosody information search section 140 and the transformation rules stored in the prosody information transformation rule storage section 160 described later. It transforms information. That is, in the search by the prosody information search unit 140, when the search key matches the searched key, the most appropriate speech synthesis can be performed by the searched prosody information. Otherwise, since the prosody information of similar accent phrases is used as described above, the lower the degree of matching between the two keys (the higher the approximation cost), the more likely the synthesized speech will be shifted from the appropriate speech. There is. Therefore, by applying a predetermined transformation to the retrieved prosody information in accordance with the approximate cost, a more appropriate synthesized speech can be emitted.
[0058]
The prosody information modification rule storage unit 160 holds a modification rule for modifying the prosody information according to the approximate cost.
[0059]
The waveform generation unit 170 synthesizes a speech waveform based on the phonetic symbol string output from the language processing unit 120 and the prosody information output from the prosody information transformation unit 150, and outputs an analog audio signal. is there.
[0060]
The electro-acoustic transducer 180 converts an analog audio signal into a sound, such as a speaker or a headphone.
[0061]
Next, the speech synthesis operation of the speech synthesis system configured as described above will be described.
[0062]
(1) When text to be converted into speech is input to the character string input unit 110, the language processing unit 120 analyzes the input text, separates the text for each accent phrase, and converts the text as shown in FIG. Output phonetic symbol strings and language information. Specifically, for example, when a kanji kana mixed character string is input, it is separated into accent phrases and converted into reading using a conversion dictionary such as a kanji dictionary (not shown), and converted into accents and poses. Generate phonetic symbol strings representing presence, length, etc. Here, in the example of the phonetic symbol string of FIG. 2, the following information is indicated by alphanumeric symbols.
[0063]
(A) Alphabet: phoneme ("N" indicates sound repellency)
(B) "'": Accent position
(C) "/": Accent phrase delimiter
(D) "cl": silent section
(E) Number: Pause length
Although not shown in the figure, information indicating a delimiter between phrases and sentences may also be shown. The representation of the phonetic symbol string is not limited to the above, and a phoneme string, a numerical value indicating an accent position, or the like may be output as separate information. The linguistic information (grammatical information, semantic information) may include not only the part of speech and meaning, but also the inflected form, the presence or absence of dependency, the importance in general sentences, and the like. Also, the character string is not limited to a character string such as “noun” or “continuous form” as shown in FIG.
[0064]
(2) The prosody information search unit 140 searches the prosody information of the prosody information database 130 based on the phonetic symbol string and the linguistic information for each accent phrase output from the language processing unit 120, and , An approximate cost described in detail later. More specifically, when the phonetic symbol string having the above notation is output from the language processing unit 120, first, from the phonetic symbol string, a phoneme string, a numeric value indicating an accent position, the number of mora, and the like are used. Are searched, and the prosody information in the prosody information database 130 is searched using these as search keys. In this search, if a searched key that completely matches the search key exists in the prosody information database 130, the prosody information corresponding to the searched key may be used as the search result. First, search candidates that match to some extent (for example, match phoneme strings but do not match semantic information, or match phoneme strings but do not match accent and mora numbers) are searched candidates. Then, a key having the highest degree of matching between the search key and the key to be searched is selected as a search result.
[0065]
The selection can be made, for example, by a minimum cost method using approximate costs. Specifically, first, the approximate cost C is obtained as follows.
(Equation 1)
[0066]
C = a1.D1 + a2.D2 + a3.D3 + a4.D4 + a5.D5 + a6.D6 + a7.D7
Here, the above a1, D1, etc. are as follows.
[0067]
D1: Number of phonemes that do not match in the phoneme sequence
D2: Difference of accent position
D3: Difference of mora number
D4: Presence / absence of the previous pause length (whether it is within the range of the searched key)
D5: Presence / absence of the pause length immediately after (whether it is within the range of the searched key)
D6: Presence / absence or degree of grammatical information match
D7: Presence / absence or degree of matching of semantic information
a1 to a7: Coefficients for weighting the above D1 to D7 (the degree to which these D1 to D7 contribute to selection of appropriate prosody information is obtained by a statistical method or learning).
[0068]
It should be noted that D1 to D7 are not limited to the above, and various ones can be used as long as they indicate the degree of coincidence between the search key and the searched key. For example, D1 may be set to a different value depending on whether or not the unmatched phonemes are similar to each other, the position of the unmatched phonemes, whether the unmatched phonemes are continuous, and the like. As for D4 and D5, when the pause length is indicated in a stage such as long, short, or no as shown in FIG. 3, whether or not they match is represented by 0 or 1, or the difference between the stages is indicated. It may be represented by a numerical value, or when the pause length is represented by a numerical value of time, a time difference may be used. For D6 and D7, whether or not the grammar information and the semantic information match may be represented by 0 or 1, or a combination of the two using a table having a search key and a searched key as parameters. (E.g., the degree of coincidence between nouns and verbs is low, and the degree between particle and auxiliary verb is high, etc.), or a similarity dictionary is used to determine the degree of similarity in meaning. Or you may be.
[0069]
The approximate cost as described above is calculated for each search candidate, and the one with the smallest approximate cost is selected as a search result and used as a search result. Even when the information is not stored in the prosody information database 130, relatively appropriate and natural speech can be produced by similar prosody information.
[0070]
(3) The prosody information transformation unit 150 uses the rules stored in the prosody information transformation rule storage unit 160 according to the approximation cost output from the prosody information retrieval unit 140, and uses the rules stored in the prosody information retrieval unit 140 as search results. The output prosody information (basic frequency pattern, voice intensity pattern, phoneme time length pattern) is transformed. Specifically, for example, when a modification rule for compressing the dynamic range of the fundamental frequency pattern is applied, the fundamental frequency pattern is modified as shown in FIG.
[0071]
The modification according to the approximate cost has the following meaning. That is, for example, as shown in FIG. 5, if the prosody information of "Nagoya City" is searched for the input text "Kadoma City", these phoneme strings are different, but the other search items are the same. Since the approximation cost is small (approximate cost is small), appropriate speech synthesis can be performed by using the prosody information of "Nagoya City" without modification. Also, for example, if "Naru-nanda" is searched for "5 minutes", in order to obtain an appropriate synthesized speech of "5 minutes", generally, if the difference in the part of speech is considered, " It is desirable to slightly reduce the voice intensity pattern of "Naru-nen". Considering phrase information (for example, the significance of meaning), numbers often have large utterance intensities, so the voice intensity pattern of "Naru-nen" Is desirably increased to some extent, and overall, it is desirable to slightly increase the voice intensity pattern of "Naru-nen". Since such a total degree of deformation has a correlation with the approximate cost, the degree of deformation (deformation magnification or the like) corresponding to the approximate cost is stored in the prosody information deformation rule storage unit 160 as a deformation rule. Thereby, an appropriate synthesized speech can be obtained. The transformation of the prosody information is not limited to the one that is uniformly transformed over the entire elapsed time as shown in FIG. 4. May be different. As a specific storage format of the transformation rule, a coefficient for transforming the approximate cost into the transformation ratio may be used as the transformation rule, or a table in which the approximation cost is used as a parameter and the transformation ratio or the transformation pattern is used. And so on. Note that the approximation cost used for the deformation is not limited to the same as the approximation cost used for the search as described above. May be obtained, and different values may be used for the fundamental frequency pattern, the voice intensity pattern, and the phoneme time length pattern. Further, for example, when each term of (Equation 1) can take a negative value, the sum of the absolute values of each term is used as an approximate cost for search (0 or positive), and The sum of the values may be used as an approximate cost for transformation (possibly negative).
[0072]
(4) The waveform generation unit 170 outputs the phonetic symbol string output from the language processing unit 120 and the prosody information transformed by the prosody information transformation unit 150.
, That is, based on the phoneme sequence and the pause length, the fundamental frequency pattern, the voice intensity pattern, and the phoneme time length pattern, and outputs an analog voice signal. The synthetic sound is emitted from the electroacoustic converter 180 by the analog sound signal.
[0073]
As described above, even when the prosody information in which the search key and the searched key completely match is not stored in the prosody information database 130, the speech synthesis is performed by the similar prosody information. In addition, it is possible to produce an even and natural sound. Conversely, the storage capacity of the prosody information database 130 can be reduced without impairing the naturalness of the synthesized speech. Further, when similar prosody information is used as described above, the prosody information is transformed according to the degree of similarity, so that a more appropriate synthesized speech is emitted.
[0074]
(Embodiment 2)
As a speech synthesis system according to the second embodiment, an example of a speech synthesis system in which pause lengths before and after an accent phrase are stored as prosody information in a prosody information database will be described. In the following embodiments, components having the same functions as those in the first embodiment and the like will be denoted by the same or corresponding reference numerals, and detailed description thereof will be omitted.
[0075]
FIG. 6 is a functional block diagram illustrating a configuration of the speech synthesis system according to the second embodiment. This speech synthesis system differs from the speech synthesis system of the first embodiment in the following points.
[0076]
(A) Unlike the language processing unit 120, the language processing unit 220 outputs a phonetic symbol string that does not include pause information.
[0077]
(B) In the prosody information database 230, as shown in FIG. 7, unlike the prosody information database 130, pause information is stored not as a key to be searched but as prosody information. In practice, a pause having the same data structure as the prosody information database 130 may be used, and the pause length may be handled as prosody information at the time of search.
[0078]
(C) The prosody information search unit 240 performs a search by collating the search key and the search target key that do not include the pause information, and the pause information (in addition to the fundamental frequency pattern, the voice intensity pattern, and the phoneme time length pattern) is also prosody information. Output.
[0079]
(D) The prosody information deforming section 250 also deforms the pause information according to the approximate cost, similarly to the fundamental frequency pattern and the like.
[0080]
(E) The prosody information modification rule storage section 260 holds a pause length change rule together with a fundamental frequency pattern modification rule and the like.
[0081]
As described above, by using the pause information retrieved from the prosody information database 230, a synthesized voice having a more natural pause length can be produced. Further, the load of the input text analysis processing in the language processing unit 220 can be reduced.
[0082]
Note that, as in the first embodiment, the pause information output from the language processing unit at the time of search may be used as a search key, so that search accuracy may be easily increased. In this case, the prosody information database may separately store the pause information as the key to be searched and the pause information as the prosody information, or may share them. When the pause information is output from the language processing unit and stored in the prosody information database, which pause information is used for speech synthesis depends on the analysis accuracy of the language processing unit and the prosody. The selection may be made according to the reliability of the pose information searched from the information database, and further, which may be determined according to the approximate cost (the certainty of the search result).
[0083]
(Embodiment 3)
As a speech synthesis system according to the third embodiment, an example of a speech synthesis system in which prosody information search and modification are performed based on different approximate costs using a fundamental frequency pattern or the like will be described.
[0084]
FIG. 8 is a functional block diagram illustrating a configuration of the speech synthesis system according to the third embodiment. This speech synthesis system is different from the speech synthesis system of the first embodiment in the following points.
[0085]
(A) In place of the prosody information search unit 140, a fundamental frequency pattern search unit 341, a voice intensity pattern search unit 342, and a phoneme time length pattern search unit 343 are provided.
[0086]
(B) In place of the prosody information transforming section 150, a fundamental frequency pattern transforming section 351, a voice intensity pattern transforming section 352, and a phoneme time length pattern transforming section 353 are provided.
[0087]
The search units 341 to 343 and the deformation units 351 to 353 use the approximate cost obtained by the following (Equation 2) to (Equation 4) to calculate the fundamental frequency pattern, the voice intensity pattern, or the phoneme time, respectively. Long patterns are independently searched (selection of search candidates) or transformed.
(Equation 2)
[0088]
(Search and deformation of fundamental frequency pattern)
C = b1 · D1 + b2 · D2 + b3 · D3 + b4 · D4 + b5 · D5 + b6 · D6 + b7 · D7
(Equation 3)
[0089]
(Search for voice intensity pattern, deformation)
C = c1 · D1 + c2 · D2 + c3 · D3 + c4 · D4 + c5 · D5 + c6 · D6 + c7 · D7
(Equation 4)
[0090]
(Phonological duration pattern search and transformation)
C = d1 · D1 + d2 · D2 + d3 · D3 + d4 · D4 + d5 · D5 + d6 · D6 + d7 · D7
Here, D1 to D7 are the same as (Equation 1) in the first embodiment, but the weighting coefficients b1 to b7, c1 to c7, and d1 to d7 are the same as a1 to a7 in (Equation 1). Differently, those obtained by a statistical method or learning are used so that an appropriate fundamental frequency pattern, voice intensity pattern, or phoneme time length pattern is selected. That is, for example, if the fundamental frequency pattern is generally the same when the accent position and the number of mora are the same, the coefficients b2 and b3 are set to be larger than the coefficients a2 and a3 of (Equation 1). I have. Further, since the voice intensity pattern has a large contribution of the presence or absence of the pause and the length, the coefficients c4 and c5 are set to be larger than the coefficients a4 and a5. Similarly, in the phoneme time length pattern, the degree of contribution of the arrangement of phoneme strings is large, so that the coefficient d1 is set to be larger than the coefficient a1.
[0091]
As described above, the search and deformation of the fundamental frequency pattern and the like are independently performed using different approximation costs, so that a well-balanced search and deformation can be performed. Based on the optimum fundamental frequency pattern and the like, respectively. Speech synthesis can be performed. Further, in the prosody information database 130, it is not necessary to store the fundamental frequency pattern, the voice intensity pattern, and the phoneme time length pattern as a set. For example, it is sufficient to store only the number of types for each pattern. With the prosody information database 130 having a small storage capacity, synthesized speech having good sound quality can be uttered.
[0092]
(Embodiment 4)
A speech synthesis system according to the fourth embodiment will be described.
[0093]
FIG. 9 is a functional block diagram showing a configuration of the speech synthesis system according to the fourth embodiment. This speech synthesis system mainly has the following features.
[0094]
(A) Unlike Embodiments 1 to 3, processing such as retrieval and modification of prosodic information is performed not in accent phrase units but in phrase units. Here, the phrase is also referred to as a section or an exhalation paragraph, and is usually a group of one or a plurality of accent phrases that are delimited when uttered (similar to the case where there are punctuation marks).
[0095]
(B) As in the second embodiment, a prosody information database 430 in which pause information is stored as prosody information, and a prosody information deformation rule storage section 460 in which a pause length change rule is stored together with a fundamental frequency pattern deformation rule and the like. Have been. However, these are different from the prosody information database 230 and the prosody information deformation rule storage unit 260 of the second embodiment in that the prosody information and the deformation rules are stored in phrase units as shown in FIG. .
[0096]
(C) As in the third embodiment, the search and modification of the prosody information are performed based on different approximate costs using a fundamental frequency pattern or the like. Also, the search of the pose information and the change of the pose length are similarly performed independently.
[0097]
(D) As in the first to third embodiments, the modification of the prosody information is performed in accordance with the approximate cost, and furthermore, the degree of matching (degree of matching) for each phoneme in the phoneme string between the search key and the searched key. And the presence or absence).
[0098]
Hereinafter, this will be described in more detail.
[0099]
Similar to the language processing unit 120 of the first embodiment, the language processing unit 420 analyzes the text input from the character string input unit 110, separates the text for each accent phrase, and then forms a phrase that is a set of predetermined accent phrases. The unit outputs a phonetic symbol string and linguistic information.
[0100]
The prosody information database 430 stores the prosody information in units of phrases as described above, and accordingly, as shown in FIG. 10, the number of accent phrases included in each phrase is also used as a key to be searched. Is stored. The pose information stored as prosody information may include not only the pause lengths before and after the phrase but also the pause lengths before and after the accent phrase.
[0101]
The fundamental frequency pattern search unit 441, the voice intensity pattern search unit 442, the phoneme time length pattern search unit 443, and the pause information search unit 444 are included in a phrase as approximate costs in order to search for prosodic information in phrase units. The number of accent phrases is also taken into account. Other than the pause information search unit 444, the matching degree of each phoneme in the phoneme string between the search key and the searched key is output together with the searched basic frequency pattern and the like and the approximate cost. The information search unit 444 outputs the degree of coincidence such as the number of mora for each accent phrase and the accent position, together with the pose information and the approximate cost.
[0102]
The fundamental frequency pattern deformation unit 451, the voice intensity pattern deformation unit 452, and the phoneme time length pattern deformation unit 453 are stored in the prosody information deformation rule storage unit 460, like the prosody information deformation unit 150 of the first to third embodiments. The prosody information is transformed in accordance with the approximate cost output from the fundamental frequency pattern search unit 441 and the like using the rule described above, and furthermore, the degree of coincidence between phonemes in the phoneme string between the search key and the searched key is determined. Even if it responds, it will be deformed. That is, for example, when prosodic information of a word in which only some phonemes are different from “kana” is used for “kana”, voice intensity patterns for different phonemes are represented by a portion indicated by a symbol P in FIG. , It is possible to facilitate the deformation in which the influence of the phoneme difference is less noticeable. It should be noted that such a modification according to the degree of coincidence for each phoneme may not necessarily be performed, or a modification according to the degree of coincidence for each phoneme may be performed without performing a modification according to the approximate cost. Is also good.
[0103]
The pause length changing unit 454 uses the rules stored in the prosody information deformation rule storage unit 460 to deform the prosody information according to the approximate cost output from the pause information search unit 444, and furthermore, accentuates the prosody information. The pause length is changed according to the degree of coincidence such as the number of mora and the accent position for each phrase.
[0104]
As described above, by performing search or modification of the prosodic information in phrase units, a more natural synthesized speech can be uttered along the flow of the sentence. Further, by using the pause information retrieved from the prosody information database 430 as in the second embodiment, a synthesized voice having a more natural pause length can be produced. By independently performing search and deformation of frequency patterns and the like using different approximate costs, speech synthesis can be performed based on the optimum fundamental frequency patterns and the like, and the storage capacity of the prosody information database 430 is reduced. It can be done easily. In addition, by modifying the fundamental frequency pattern etc. according to the degree of coincidence for each phoneme, the effects of phoneme differences can be made less noticeable, and according to the degree of coincidence such as the number of mora and accent position for each accent phrase. However, by changing the pause length, it is possible to produce a synthesized voice with a more natural pause length.
[0105]
(Embodiment 5)
As the speech synthesis system according to the fifth embodiment, an example will be described in which a phoneme category sequence is used for searching for prosodic information.
[0106]
FIG. 11 is a functional block diagram showing the configuration of the speech synthesis system according to the fifth embodiment. FIG. 12 is an explanatory diagram illustrating an example of a phoneme category.
[0107]
Here, the phoneme category is obtained by grouping phonemes according to a distance obtained from phonetic features between the phonemes, that is, according to an articulation method, an articulation position, and a duration of each phoneme. In other words, phonemes having the same phoneme category have similar acoustic characteristics. For example, an accent phrase and some of the phonemes are replaced with another phoneme in the same phoneme category. Phrases often have the same or relatively similar prosody information. Therefore, in the search for prosodic information, even if the phoneme strings do not match, if the phoneme category of each phoneme matches, even if the prosody information is diverted, an appropriate synthesized speech is often produced. Can be done. Note that the grouping of phonemes is not limited to the above. For example, as shown in FIG. 12, phonemes are grouped in accordance with the distance (psychological distance) between phonemes determined by using multivariate analysis or the like from an aural table of phonemes. Grouping according to the similarity of the physical characteristics of phonemes (fundamental frequency, intensity, time length, spectrum, etc.), and grouping prosodic patterns using statistical methods such as multivariate analysis, Phonemes may be grouped using statistical techniques to best reflect the prosody pattern group.
[0108]
Hereinafter, a specific description will be given. The speech synthesis system of the fifth embodiment differs from the speech synthesis system of the first embodiment in that it includes a prosody information database 730 instead of the prosody information database 130, and further includes a phoneme category sequence generation unit 790. The points are different.
[0109]
In the prosody information database 730, in addition to the contents stored in the prosody information database 130 of the first embodiment, a phoneme category string indicating a phoneme category to which each phoneme of the accent phrase belongs is stored as a key to be searched. . Here, as a specific notation of the phoneme category sequence, for example, it is represented as a sequence of numbers or symbols assigned to each phoneme category, or any phoneme in each phoneme category is set as a representative phoneme, It may be expressed as
[0110]
The phoneme category sequence generation unit 790 converts a phonetic symbol sequence for each accent phrase output from the language processing unit 120 into a phoneme category sequence and outputs it.
[0111]
The prosody information retrieval unit 740 is based on the prosody information database 730 based on the phoneme category sequence output from the phoneme category sequence generation unit 790, and the phonetic symbol sequence and language information for each accent phrase output from the language processing unit 120. , And outputs the retrieved prosody information and the approximate cost. The approximation cost includes a degree of matching of phoneme category strings (for example, similarity of phoneme categories for each phoneme), so that, for example, even when phoneme strings do not match, a small value is obtained when phoneme category strings match. Therefore, more appropriate prosody information is searched (selected), and natural synthesized speech is uttered. Further, for example, the search speed is easily improved by first narrowing down the search candidates to those having the same or similar phoneme category sequence.
[0112]
Note that, in the above example, an example in which the phonetic symbol sequence output from the language processing unit 120 is converted into a phoneme category sequence by the phoneme category sequence generation unit 790, but the present invention is not limited to this. A function of generating a category string may be provided, or the prosody information search unit 740 may have a function of converting an input phonetic symbol string into a phoneme category string. If the prosody information retrieval unit 740 has a function of converting a phoneme sequence read from the prosody information database into a phoneme category sequence, a phoneme category sequence similar to that of the prosody information database 130 of the first embodiment is not stored. A prosody information database can also be used.
[0113]
In addition, not only the phoneme string and the phoneme category string are used as search keys, but only the phoneme category string may be used. In this case, since the prosody information that differs only in the phoneme sequence can be collected, it is easy to reduce the capacity of the database and improve the search speed.
[0114]
The components described in each of the above embodiments and modifications may be combined in various ways. Specifically, for example, the method of using the phoneme category sequence for retrieving prosodic information and the like shown in Embodiment 5 may be applied to other embodiments and the like.
[0115]
In addition, the modification of the prosody information according to the degree of coincidence for each phoneme shown in the third and fourth embodiments may be used instead of or together with the modification according to the approximate cost in other embodiments. it can. In addition, the modification may be performed using a phoneme, a mora, a syllable, a generation unit of a speech waveform in the waveform generation unit, a degree of coincidence for each phoneme, and the like. Further, the matching degree to be used may be selected according to the prosody information to be transformed. Specifically, for example, either the approximation cost or the degree of coincidence such as for each phoneme may be used to transform the fundamental frequency pattern, and both may be used to transform the voice intensity pattern. Here, the degree of coincidence of the phonemes or the like is, for example, a fundamental frequency, an intensity, a time length, a distance based on acoustic characteristics such as a spectrum, an articulation method, an articulation position, a distance obtained phonetically by a duration time, and the like, Alternatively, it can be determined on the basis of a distance based on an abnormal hearing table obtained by a listening experiment.
[0116]
Further, the method of using the phoneme category described in the fifth embodiment for search or the like can be used in other embodiments or the like instead of or together with the use of the phoneme sequence.
[0117]
Further, as shown in the second and fourth embodiments, the configuration in which the pause information is stored in the prosody information database as the prosody information and searched may be applied to other embodiments and the like. In the second and fourth embodiments, the pause information may also be used for the search.
[0118]
The language processing unit does not necessarily have to be provided, and a phonetic symbol string or the like may be directly input from the outside. Such a configuration is particularly useful, for example, when applied to a small device such as a mobile phone, and makes it easier to reduce the size of the device and compress communication data. Further, the phonetic symbol string and the language information may be inputted from outside. That is, for example, highly accurate language processing can be performed using a large-scale server, and the result can be input, so that a more appropriate voice can be uttered. On the other hand, the configuration may be simplified by simply using only phonetic symbol strings and the like.
[0119]
Further, the prosody information for synthesizing speech is not limited to the above. For example, a phoneme time length pattern, a mora time length pattern, a syllable time length pattern, or the like may be used instead of the phoneme time length pattern. Further, various prosody information may be combined including the time length pattern as described above.
[0120]
Further, the unit of prosody control, that is, the unit of storing, searching, and transforming the prosody information may be any of an accent phrase or a phrase composed of one or more accent phrases, and may be a phrase, word, stress phrase unit, or one or more. Or a phrase unit composed of stress phrases, or a mixture of these. Further, apart from the prosody control unit (for example, a phrase composed of one or more accent phrases), for example, the degree of coincidence of the number of mora and the accent position of each other unit (for example, accent phrase) is used to transform prosody information. You may.
[0121]
Further, the items and the number of search keys are not limited to those described above. In other words, in general, a search key with more items is easier to search for an appropriate candidate, but optimization may be performed together with determination of the degree of coincidence of each item and weighting so that an optimum candidate is easily searched. Also, a search key that contributes little to the search accuracy may be omitted to simplify the configuration and improve the processing speed.
[0122]
Further, in the above example, the Japanese language has been described as an example, but the present invention is not limited to this, and the present invention can be easily applied to various languages. In this case, a modification in accordance with the characteristics of each language, for example, a modification in which processing in units of mora is replaced by processing in units of mora or syllable may be added. The prosody information database 130 and the like may store information on a plurality of languages.
[0123]
Further, the above configuration may be implemented by a computer (and peripheral devices) and a program, or may be implemented by hardware.
[0124]
【The invention's effect】
As described above, according to the present invention, for example, a prosody information such as a fundamental frequency pattern extracted from real speech, a speech intensity pattern, a phoneme time length pattern, and pause information is stored as a database, and texts and phonetic symbols are stored. For the utterance target input as a column or the like, for example, prosody information that minimizes the approximate cost is searched and selected from the database, and selected based on a predetermined deformation rule according to the approximate cost and the degree of matching. By transforming the prosody information thus input, a natural synthesized speech corresponding to an arbitrary input text or the like can be produced. In particular, it is possible to produce a natural synthesized speech with the same sound quality, that is, as a whole, close to the actual speech, whether or not the speech content corresponding to the input text or the like exists in the speech information database. To play.
[0125]
Accordingly, the present invention provides various electronic devices, such as home appliances, car navigation systems, and mobile phones, in order to utter messages such as device status, operation instructions, response messages, and the like. Can be used for operations through a voice interface, confirmation of character recognition results by optical character recognition (OCR), and the like, and are useful in the above-described fields and the like.
[Brief description of the drawings]
FIG. 1 is a functional block diagram illustrating a configuration of a speech synthesis system according to a first embodiment.
FIG. 2 is an explanatory diagram illustrating an example of information of each unit of the speech synthesis system according to the first embodiment.
FIG. 3 is an explanatory diagram showing storage contents of a prosody information database of the speech synthesis system according to the first embodiment.
FIG. 4 is an explanatory diagram showing an example of a modification of a fundamental frequency pattern.
FIG. 5 is an explanatory diagram showing an example of modification of prosody information.
FIG. 6 is a functional block diagram illustrating a configuration of a speech synthesis system according to a second embodiment.
FIG. 7 is an explanatory diagram showing stored contents of a prosody information database of the speech synthesis system according to the second embodiment.
FIG. 8 is a functional block diagram illustrating a configuration of a speech synthesis system according to a third embodiment.
FIG. 9 is a functional block diagram illustrating a configuration of a speech synthesis system according to a fourth embodiment.
FIG. 10 is an explanatory diagram showing storage contents of a prosody information database of the speech synthesis system according to the fourth embodiment.
FIG. 11 is a functional block diagram illustrating a configuration of a speech synthesis system according to a fifth embodiment.
FIG. 12 is an explanatory diagram showing an example of a phoneme category.
FIG. 13 is a functional block diagram showing a configuration of a conventional speech synthesis system.
[Explanation of symbols]
110 Character string input section
120 Language processing unit
130 Prosody Information Database
140 Prosody information search unit
150 Prosody information transformation unit
160 Prosody information transformation rule storage
170 Waveform generator
180 electroacoustic transducer
220 Language processing unit
230 Prosody Information Database
240 Prosody Information Retrieval Unit
250 Prosody information transformation unit
260 Prosody information transformation rule storage unit
341 Fundamental frequency pattern search unit
342 Voice Intensity Pattern Search Unit
343 Phonological time length pattern search unit
351 Basic frequency pattern deformation unit
352 Voice strength pattern deformation unit
353 Phonological time length pattern deformation unit
420 language processing unit
430 Prosodic information database
441 Fundamental frequency pattern search unit
442 Voice strength pattern search unit
443 Phonological time length pattern search unit
444 pose information search section
451 Basic frequency pattern deformation unit
452 Voice strength pattern deformation part
453 Phonological time length pattern transformation unit
454 Pose length change unit
460 Prosody information transformation rule storage unit
730 Prosody information database
740 Prosody information search unit
790 phoneme category sequence generator

Claims

In a voice synthesis system that outputs a synthesized voice based on synthesized voice information indicating a voice to be synthesized,
A database in which prosody information in a state extracted from actual speech used for speech synthesis is stored in correspondence with key information serving as a search key;
Search means for searching for the prosody information according to the degree of coincidence between the synthesized speech information and the key information;
Transforming means for transforming the prosody information searched by the search means based on a deformation rule according to the degree of coincidence between the synthesized speech information and the key information ;
A synthesizing unit that outputs a synthesized voice based on the synthesized voice information and the prosody information deformed by the deforming unit;
A speech synthesis system comprising:

The speech synthesis system according to claim 1, wherein the transformation rule sets at least one of a transformation pattern and a degree .

The speech synthesis system according to claim 1, wherein
A speech synthesis system, wherein each of the synthesized speech information and the key information includes a phonetic symbol string indicating a speech attribute of a speech to be synthesized.

The speech synthesis system according to claim 3 , wherein
The speech synthesis system, wherein the synthesized speech information and the key information each further include linguistic information indicating a linguistic attribute of the synthesized speech.

The speech synthesis system according to claim 3 , wherein
The speech synthesis system, wherein the phonetic symbol string includes at least information substantially indicating any of a phoneme string, an accent position, and the presence or absence or length of a pause of a speech to be synthesized.

The speech synthesis system according to claim 4 , wherein
A speech synthesis system characterized in that the linguistic information includes at least one of grammatical information of synthesized speech and semantic information.

The speech synthesis system according to claim 4 , wherein
The speech synthesis system further comprises language processing means for analyzing the text information input to the speech synthesis system to generate the phonetic symbol string and the language information.

The speech synthesis system according to claim 1, wherein
A speech synthesis system, wherein each of the synthesized speech information and the key information substantially includes a phoneme category sequence indicating a phoneme category to which each phoneme of a speech to be synthesized belongs.

9. The speech synthesis system according to claim 8 , wherein
Further, a conversion unit is provided for converting at least one of the information corresponding to the synthesized speech information input to the speech synthesis system and the information corresponding to the key information stored in the database into a phoneme category sequence. A speech synthesis system characterized in that:

9. The speech synthesis system according to claim 8 , wherein
The speech synthesis system according to claim 1, wherein the phoneme category is obtained by grouping phonemes using at least one of a phoneme articulation method, an articulation position, and a duration.

9. The speech synthesis system according to claim 8 , wherein
The speech synthesis is characterized in that the phoneme category is obtained by grouping prosodic patterns by using a statistical method and by grouping phonemes by using a statistical method so as to best reflect the group of prosodic patterns. system.

The speech synthesis system according to claim 11 , wherein
A speech synthesis system, wherein the statistical method is a multivariate analysis.

9. The speech synthesis system according to claim 8 , wherein
The speech synthesis system according to claim 1, wherein the phoneme category is obtained by grouping phonemes according to a distance between phonemes determined by using a statistical method from an aural table of phonemes.

The speech synthesis system according to claim 13 ,
A speech synthesis system, wherein the statistical method is a multivariate analysis.

9. The speech synthesis system according to claim 8, wherein
The speech synthesis system according to claim 1, wherein the phoneme category is obtained by grouping phonemes according to the similarity of the physical characteristics of the phonemes.

The speech synthesis system according to claim 15 , wherein
The speech synthesis system according to claim 1, wherein the physical characteristic is at least one of a fundamental frequency, an intensity, a time length, and a spectrum of a phoneme.

The speech synthesis system according to claim 1, wherein
A speech synthesis system wherein the prosody information stored in the database includes information indicating prosodic features extracted from the same real speech.

The speech synthesis system according to claim 17 , wherein
The information indicating the prosodic features is at least:
A fundamental frequency pattern indicating the temporal change of the fundamental frequency,
A voice intensity pattern indicating a temporal change of the voice intensity,
A speech synthesis system comprising: a phoneme time length pattern indicating a time length of each phoneme; and pause information indicating presence / absence or length of a pause.

The speech synthesis system according to claim 1, wherein
The voice synthesizing system, wherein the database stores the prosody information for each prosody control unit.

20. The speech synthesis system according to claim 19 ,
The prosody control unit is
Accent phrases,
A phrase composed of one or more accent phrases,
Clause,
A phrase composed of one or more phrases,
word,
A phrase composed of one or more words,
A speech synthesis system, which is one of a stress phrase and a phrase composed of one or more stress phrases.

The speech synthesis system according to claim 1, wherein
The synthesized voice information and the key information each include a plurality of types of voice index information that is an element that determines a voice to be synthesized,
The degree of coincidence between the synthesized speech information and the key information indicates that the degree of coincidence between the respective voice index information in the synthesized speech information and the respective voice index information in the key information is weighted and synthesized. Characteristic speech synthesis system.

22. The speech synthesis system according to claim 21 , wherein
The voice index information includes at least information substantially indicating any of a sequence of phonemes, an accent position, presence or absence or length of a pause, and linguistic information indicating a linguistic attribute of the synthesized voice. Characteristic speech synthesis system.

23. The speech synthesis system according to claim 22 , wherein
The voice index information includes information substantially indicating a sequence of phonemes of a synthesized voice,
A speech synthesis system wherein the degree of coincidence between each piece of speech index information in the synthesized speech information and each piece of speech index information in the key information includes a degree of similarity in acoustic feature length for each phoneme.

22. The speech synthesis system according to claim 21 , wherein
A speech synthesis system, wherein the speech index information substantially includes a phoneme category string indicating a phoneme category to which each phoneme of a synthesized speech belongs.

25. The speech synthesis system of claim 24 ,
A speech synthesis system, wherein the degree of coincidence between each piece of speech index information in the synthesized speech information and each piece of speech index information in the key information includes a degree of similarity of a phoneme category for each phoneme.

22. The speech synthesis system according to claim 21, wherein
A speech synthesis system, wherein the prosody information includes a plurality of types of prosody feature information characterizing a synthesized speech.

27. The speech synthesis system of claim 26,
A speech synthesis system wherein the plurality of types of prosodic feature information are grouped and stored in the database.

28. The speech synthesis system according to claim 27,
A speech synthesis system characterized in that the plurality of types of prosodic feature information in the set are each extracted from the same real speech.

27. The speech synthesis system of claim 26,
The prosodic feature information is at least
A fundamental frequency pattern indicating the temporal change of the fundamental frequency,
A voice intensity pattern indicating a temporal change of the voice intensity,
A speech synthesis system comprising: a phoneme time length pattern indicating a time length of each phoneme; and pause information indicating presence / absence or length of a pause.

30. The speech synthesis system of claim 29 ,
The speech synthesis system, wherein the phoneme duration pattern includes at least one of a phoneme duration pattern, a mora duration pattern, and a syllable duration pattern.

27. The speech synthesis system of claim 26 ,
A speech synthesis system characterized in that the prosody feature information of each type is searched and transformed according to the degree of coincidence between the synthesized speech information and the key information by different weights.

22. The speech synthesis system according to claim 21 , wherein
The retrieval of the prosody information by the retrieval means and the transformation of the prosody information by the transformation means are performed according to the degree of coincidence between the synthesized speech information and the key information by different weights. Speech synthesis system.

22. The speech synthesis system according to claim 21 , wherein
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the transformation means are performed according to the degree of coincidence between the synthesized speech information and key information by the same weighting, respectively. Voice synthesis system.

The speech synthesis system according to claim 1, wherein
The deforming means is at least
For each phoneme,
Every mora,
Every syllable,
A speech synthesis system, wherein the prosody information retrieved by the retrieval means is modified based on the degree of coincidence of each of the generation units of the speech waveforms and the phonemes in the synthesis means.

35. The speech synthesis system of claim 34 ,
The degree of coincidence for each phoneme, for each mora, for each syllable, for each unit of speech waveform generation in the synthesis means, and for each phoneme is at least:
Distance based on acoustic properties,
A speech synthesis system characterized in that the speech synthesis system is set based on any one of a distance obtained by any of an articulation method, an articulation position, and a duration, and a distance based on an abnormal hearing table obtained by a listening experiment.

The speech synthesis system according to claim 35 ,
The speech synthesis system according to claim 1, wherein the acoustic characteristic is at least one of a fundamental frequency, an intensity, a time length, and a spectrum.

The speech synthesis system according to claim 1, wherein
The speech synthesis system according to claim 1, wherein the database stores the key information and the prosody information for a plurality of languages.

In a voice synthesis method for outputting a synthesized voice based on synthesized voice information indicating a voice to be synthesized,
From a database in correspondence with the key information as a search key prosodic information in a state of being extracted from actual speech for use in speech synthesis is stored,
Searching for the prosody information according to the degree of coincidence between the synthesized speech information and the key information,
Applying a modification to the prosody information retrieved by the retrieval means based on a modification rule according to the degree of coincidence between the synthesized speech information and the key information ;
A speech synthesis method characterized by outputting a synthesized speech based on the synthesized speech information and the prosody information transformed by the transforming means.

39. The speech synthesis method according to claim 38 , wherein
The synthesized voice information and the key information each include a plurality of types of voice index information that is an element that determines a voice to be synthesized,
The degree of coincidence between the synthesized speech information and the key information indicates that the degree of coincidence between the respective voice index information in the synthesized speech information and the respective voice index information in the key information is weighted and synthesized. Characteristic speech synthesis method.

40. The speech synthesis method according to claim 39 ,
A speech synthesis method, wherein the prosody information includes a plurality of types of prosody feature information characterizing a synthesized speech.

41. The speech synthesis method according to claim 40 , wherein
A voice synthesizing method characterized in that each type of prosodic feature information is searched and transformed in accordance with the degree of coincidence between the synthesized voice information and the key information by different weights.

40. The speech synthesis method according to claim 39 ,
The retrieval of the prosody information by the retrieval means and the transformation of the prosody information by the transformation means are performed according to the degree of coincidence between the synthesized speech information and the key information by different weights. Speech synthesis method.

40. The speech synthesis method according to claim 39 ,
The retrieval of the prosody information by the retrieval means and the modification of the prosody information by the transformation means are performed according to the degree of coincidence between the synthesized speech information and key information by the same weighting, respectively. Voice synthesis method to be used.

In a speech synthesis system that converts input text into synthesized speech and outputs it,
Language processing means for analyzing the input text and outputting phonetic symbol strings and language information;
Prosody information in a state extracted from real speech, phonetic symbol strings and linguistic information corresponding to the synthesized speech, a prosody information database stored correspondingly,
Search means for searching for the prosody information stored in the prosody information database, which corresponds to at least a part of the search item consisting of the phonetic symbol string and the language information output from the language processing means,
Depending on the degree of match between the search item and the storage contents of the prosodic information database is retrieved from the prosodic information database, the degree of coincidence between the contents stored in the selected said prosodic information retrieval item and the prosodic information database Prosody transformation means for transforming based on a transformation rule according to
A speech synthesis system comprising: a waveform generation unit that generates a speech waveform based on the prosody information output from the prosody transformation unit and the phonetic symbol string output from the language processing unit.