JP4287664B2

JP4287664B2 - Speech synthesizer

Info

Publication number: JP4287664B2
Application number: JP2003029682A
Authority: JP
Inventors: 勝義山上; 弓子加藤; 亮望月
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2003-02-06
Filing date: 2003-02-06
Publication date: 2009-07-01
Anticipated expiration: 2023-02-06
Also published as: JP2004240201A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer which is capable of synthesizing read aloud speech of consistently high quality by utilizing the rhythm pattern data formed from actual speech to the maximum possible extent. <P>SOLUTION: A language analysis section 101 which performs analysis of input text and outputs the result of language processing expressed by a phonetic symbol sequence assigns the position where the phonetic symbol sequence is divided in the arbitrary accent phrase of the phonetic symbol sequence and a rhythm retrieval section 102 carries out retrieval in an accent phrase unit or the unit segmented by the accent division position assigned within the accent phrase as the unit for retrieving the rhythm pattern from a rhythm pattern data base 103, thereby carrying out rhythm formation without using means exclusive of the rhythm pattern retrieval. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は合成音声を生成する装置に関する。さらに詳しくは、入力された文字列（テキスト）を音声に変換する装置に関する。
【０００２】
【従来の技術】
自然な合成音声を生成するために、録音した音声データから音声の特徴量をデータベースとして蓄積しておき、そのデータベース内の音声の特徴量の単位を組み合わせて音声合成に必要な音響パラメータを得る方式が提案されている。たとえば、テキストの文字列あるいは表音記号列によりインデックスされた音声の特徴量をデータベースとしてもち、合成する対象のテキストの文字列、あるいは、テキストを言語処理することにより得られる表音記号列と合致する部分の音声の特徴量の単位を検索する方式（特開平８−８７２９７；方式１と呼ぶ）、あるいは、音韻のパターンとしてインデックス付けされた基本周波数パターンのデータベースから、合成時に必要となる基本周波数パターンを検索して生成する方式（US PATENT 5,905,972；方式２と呼ぶ）などがある。これらの方式は、音声から抽出した音声の特徴量を直接利用するので、たとえば韻律生成の方式として用いた場合、モデルベースによる韻律生成方式に比べ、より自然性の高い合成音が生成できる。
【０００３】
一方、規則に基づいて韻律を生成する音声合成装置においても、通常、言語処理の結果、出力される表音記号列のアクセント句単位に韻律情報を生成するのが一般的な方法である。
【０００４】
【特許文献１】
特開平８−８７２９７号公報
【特許文献２】
米国特許第５，９０５，９７２号明細書
【０００５】
【発明が解決しようとする課題】
しかしながら、上記のデータベースから必要な音声の特徴量を検索する方式では、合成時に必要とされる、すなわち、合成対象のテキストあるいは表音記号列に合致する韻律データがデータベースに見つからない場合に、韻律情報を生成する代替手段が必要になる。例えば方式１では、データベース内の韻律パターンに適合しない部分は、規則による韻律生成を行う。同じく、方式２においてもデータベースから生成できない部分に関しては、線形補完するという代替手段を用いている。
【０００６】
したがってこれらの方式では、図１Ａに示すように、生成される韻律は、データベース内の韻律データに適合する部分と代替手段により生成された部分との間で韻律の自然性のギャップが生じるという不都合を有す。一般に、自然発声の録音データから韻律データベースを構成すると、モーラ数が増えるにしたがって、データベース内の韻律データの分布は少なくなるので、必要となるアクセント型、表音文字列の組み合わせを充分カバーすることが困難になる。代替手段を用いる従来技術では、比較的モーラ数が長い部分（たとえば、名詞が連続する複合語、助動詞や助詞を伴う用言などを表音文字列に変換した部分）では、前記の不都合が生じる傾向が強まってくる。
【０００７】
また、規則に基づいて韻律生成を行う音声合成装置においても、図１Ｂに示すように、長いモーラ数のアクセント句に対して違和感のない安定した韻律を生成することは困難であり、読み上げ音声が聞きづらくなる一因となる。
【０００８】
また、実音声では、発話の速度が遅くなるとポーズ・ピッチ建て直しの頻度が増えるが、従来の音声合成装置では、言語処理の結果として１つのアクセント句となっている部分を分割して韻律を生成することができない。
【０００９】
この発明の目的は、自然な合成音声を生成することができる音声合成装置を提供することである。
【００１０】
【課題を解決するための手段】
この発明の１つの局面に従うと、音声合成装置は、言語解析部と、韻律パターンデータベースと、韻律検索部と、波形生成部とを備える。言語解析部は、入力されたテキストからアクセント句を少なくとも出力する。韻律パターンデータベースは、実音から抽出した韻律パターンを格納する。韻律検索部は、前記アクセント句に対応する韻律パターンを韻律パターンデータベースから検索して韻律情報を生成する。また韻律検索部は、前記アクセント句のうち、１つのアクセント句からなる特定のアクセント句を分割し、分割された単位で韻律パターンを検索して韻律情報を生成する。波形生成部は、前記韻律情報から音声波形を合成する。
【００１１】
好ましくは、上記特定のアクセント句は、対応する韻律パターンが韻律パターンデータベース内には存在しないアクセント句である。
【００１２】
好ましくは、上記特定のアクセント句は、モーラ数が基準値を超えているアクセント句である。上記基準値は、韻律パターンデータベースに格納されているモーラ数毎のグループの中で、すべてのアクセント型が存在しているグループの中の最大のモーラ数を有するグループのモーラ数である。
【００１３】
好ましくは、上記特定のアクセント句は、複数の単語が連続する複合語に対応するアクセント句である。上記韻律検索部は、上記複合語に含まれている単語の各々に分割し、分割された単語に対応する韻律パターンを検索して韻律情報を生成する。
【００１４】
好ましくは、上記特定のアクセント句は、設定されているポーズ頻度が基準値を超えるときに選ばれるアクセント句である。
【００１５】
この発明のもう１つの局面に従うと、音声合成装置は、言語解析部と、韻律生成部と、波形生成部とを備える。言語解析部は、入力されたテキストからアクセント句を少なくとも出力する。韻律生成部は、前記アクセント句ごとに所定の規則から韻律情報を生成する。また韻律生成部は、前記アクセント句のうち、１つのアクセント句からなる特定のアクセント句を分割し、分割された単位で前記規則から韻律情報を生成する。波形生成部は、前記韻律情報から音声波形を合成する。
【００１６】
好ましくは、上記特定のアクセント句は、モーラ数が基準値を超えているアクセント句である。
【００１７】
好ましくは、上記特定のアクセント句は、複数の単語が連続する複合語に対応するアクセント句である。上記韻律生成部は、上記複合語に含まれている単語ごとに上記規則から韻律情報を生成する。
【００１８】
好ましくは、上記特定のアクセント句は、設定されているポーズ頻度が基準値を超えるときに選ばれるアクセント句である。
【００１９】
本発明の第１の様態では、読み上げるための入力テキストを表音記号列に変換し、かつ、変換された表音記号列の任意のアクセント句の中で表音記号列を分割する位置を指定可能な言語解析手段と、実音声から抽出した音声特徴量を韻律パターン情報として格納した韻律パターンデータベースと、言語解析手段が出力する表音記号列のアクセント句の単位、あるいは、アクセント句内の分割位置で分割された単位に合致する韻律情報を、韻律パターンデータベースから検索して韻律情報を生成する韻律検索手段と、韻律検索手段によって生成された韻律情報にしたがって、音声波形を合成する波形合成生成手段と、を具備する音声合成装置として構成とした。
【００２０】
与えられたアクセント句に合致する韻律パターンがデータベースに得られない場合、分割位置によりアクセント句を分割した単位で韻律パターンを再検索できるので、長いモーラ数のアクセント句に対してもデータベースにある韻律パターンを最大限に利用した自然性の高い韻律生成が可能である。
【００２１】
本発明の第２の様態では、第１の様態において、韻律パターンデータベース内に格納された韻律パターンのモーラ数、アクセント型など韻律パターン検索において韻律を区別する特徴の分布から、言語解析手段においてアクセント句内の分割位置を決定する。
【００２２】
言語処理の段階で、データベースにある韻律パターンの特徴を考慮して、アクセント句内の分割位置を決定するので、少なくとも分割位置で分割された単位では、データベースから韻律パターンが検索できるので、代替の韻律生成手段を用いることなく入力の文全体にわたって自然性の高い韻律生成が可能である。
【００２３】
本発明の第３の様態では、第１の様態において、言語処理手段で、入力テキストの名詞が連続する複合語部分を変換した表音記号列のアクセント句内で、複合語を構成する形態素ごとにアクセント句内の分割位置を設定する構成とした。
【００２４】
単語ごとに分割して検索した韻律パターンを接続して韻律生成することで、複合語の各単語がはっきりと聞き取りやすい合成音を生成できる。
【００２５】
本発明の第４の様態では、第１の様態において、言語処理手段が設定するアクセント句内の分割位置を、韻律検索手段において外部からのポーズ頻度を制御するパラメータにより、分割位置として扱うかどうかを決定する構成とした。
【００２６】
ポーズ頻度を指定するパラメータが、ポーズを多く入れる側の設定になっているときには、アクセント句内の分割位置で分割した単位で韻律パターンを検索し、ポーズ頻度を指定するパラメータが、ポーズをなるべく入れない側の設定になっているときには、アクセント句内の分割位置を無視して分割位置がないものとして韻律パターンを検索するという制御が行えるので、読み上げの速度に応じて適切な韻律制御を行うことができる。
【００２７】
本発明の第５の様態では、読み上げるための入力テキストを表音記号列に変換し、かつ、変換された表音記号列の任意のアクセント句の中で表音記号列を分割する位置を指定可能な言語解析手段と、言語解析手段が出力する表音記号列のアクセント句の単位、あるいは、アクセント句内の分割位置で分割された単位ごとに規則により韻律情報生成する韻律生成手段と、韻律生成手段によって生成された韻律情報にしたがって、音声波形を合成する波形合成生成手段と、を具備する音声合成装置として構成した。
【００２８】
モーラ数が多いアクセント句でも分割位置を韻律生成の時のピッチの立ち上げポイントとして利用することで、メリハリ感のある韻律生成が可能である。
【００２９】
本発明の第６の様態では、第５の様態において、言語処理手段で、入力テキストの名詞が連続する複合語部分を変換した表音記号列のアクセント句内で、複合語を構成する形態素ごとにアクセント句内の分割位置を設定する構成とした。
【００３０】
複合語がアクセント結合してできたアクセント句においても、分割位置において微妙なピッチの立ち上げの制御を行うことができるので、規則による韻律生成を行う場合も、複合語を構成する単語ごとの発音が明瞭な合成音を生成可能である。
【００３１】
本発明の第７の様態では、第５の様態において、言語処理手段が設定するアクセント句内の分割位置を、韻律生成手段において外部からのポーズ頻度を制御するパラメータにより、分割位置として扱うかどうかを決定する構成とした。
【００３２】
ポーズ頻度を指定するパラメータが、ポーズを多く入れる側の設定になっているときには、アクセント句内の分割位置をピッチの立ち上げポイントとした韻律制御を行い、ポーズ頻度を指定するパラメータが、ポーズをなるべく入れない側の設定に成っているときには、アクセント句内の分割位置においてピッチの立ち上げを行わない韻律制御を行うことができるので、規則による韻律生成によっても読み上げの速度に応じて適切な韻律制御を行うことができる。
【００３３】
【発明の実施の形態】
以下、この発明の実施の形態を図面を参照して詳しく説明する。なお、図中同一または相当部分には同一の符号を付しその説明は繰り返さない。
【００３４】
（第１の実施形態）
＜テキスト音声合成装置の構成＞
第１の実施形態によるテキスト音声合成装置の構成を図２に示す。この装置は、言語解析部１０１と、韻律検索部１０２と、韻律パターンデータベース１０３と、波形生成部１０４とを備える。
【００３５】
言語解析部１０１は、入力のテキストを言語解析して、テキストを読み上げた合成音を生成するための情報として、表音記号列、アクセント情報、アクセント句区切り位置、アクセント句内分割位置を含む言語解析結果を出力する。
【００３６】
韻律パターンデータベース１０３は、実音声から抽出した音響的特徴量を、少なくともモーラ数とアクセント型、さらに、表音記号列の組み合わせなどを検索条件として検索可能な韻律パターンとして格納する。韻律パターンデータベース１０３は、録音音声から抽出したアクセント句単位の韻律パターンを、アクセント句のモーラ数、アクセント位置（アクセント型）で検索可能なデータベースである。
【００３７】
韻律検索部１０２は、言語解析部１０１が出力する言語解析結果のアクセント句単位、あるいは、アクセント句内の分割位置で区切られた単位の表音記号列とアクセント型を検索キーとして、韻律パターンデータベース１０３から検索キーに合致する韻律パターンを検索し取り出す。韻律検索部１０２は、それぞれのアクセント句ごとに韻律パターンを取り出し、接続して１文にわたって韻律情報を生成して出力する。
【００３８】
波形生成部１０４は、韻律検索部１０２によって生成された韻律情報にしたがって音声波形を合成する（合成音声を生成する）。
【００３９】
このようなテキスト音声合成装置は、例えば、図３に示すようなコンピュータシステム上に構築されるものである。このコンピュータシステムは、本体部５０１と、キーボード５０２と、ディスプレイ５０３と、入力装置(マウス)５０４と、スピーカ５０９とを含む、テキスト入力と音声出力が可能なシステムである。図１の言語解析部１０１、韻律検索部１０２、韻律パターンデータベース１０３、波形生成部１０４は、本体部５０１にセットされるＣＤ−ＲＯＭ５０８内、本体部５０１が内蔵するディスク(メモリ)５０５内、あるいは、回線５０７で接続された他のシステムのディスク５０６内に格納される。
【００４０】
＜合成音声の生成手順＞
以上のように構成されたテキスト音声合成装置による合成音声の生成手順について図４を参照しつつ説明する。
【００４１】
［ステップＳＴ１０１］
読み上げるべきテキストが入力される。テキストの入力は、キーボート・マウスなどの入力装置、テキストファイルの読み込みになどよって行われる。ここでは、「そう思うかもしれません。」というテキストが入力されるものとする。
【００４２】
［ステップＳＴ１０２］
入力されたテキストの言語解析が言語解析部１０１によって行われる。入力テキストの合成音を生成するための情報として、読み情報・アクセント情報・アクセント句区切り位置・アクセント句内分割位置を含む言語解析結果（表音記号列）が得られる。ここでは図５に示すように、与えられたテキスト「そう思うかもしれません。」に対する言語解析結果として表音記号列「ソー／オモ’ウカモ│シレマセン」が得られる。この表音記号列では、テキストの読みを「カタカナ」で示し、アクセントの位置（アクセント核）を記号「’」で示し、アクセント句の区切り位置を記号「／」で示し、アクセント句内の区切り位置を記号「│」で示す。アクセント句内区切り位置「│」は、言語解析で１つのアクセント句に結合したアクセント句（ここではアクセント句２）で言語解析では文節境界となる位置（ここでは文節２と文節３との境界）に挿入される。
【００４３】
［ステップＳＴ１０３］
韻律検索部１０２は、最初のアクセント句（ここではアクセント句１「ソー」）を選択する。
【００４４】
［ステップＳＴ１０４］
韻律検索部１０２は、選択したアクセント句１「ソー」のモーラ数（２モーラ）およびアクセント型（０型）を検索条件として韻律パターンデータベース４４から対応する（合致する）韻律パターンを検索する。対応する韻律パターンが存在するときはステップＳＴ１０５にすすみ、存在しないときはステップＳＴ１０７にすすむ。ここでは図６（ａ）に示すように、アクセント句１「ソー」に対応する２モーラ０型の韻律パターンがデータベース４４内に存在するためステップＳＴ１０５にすすむ。
【００４５】
［ステップＳＴ１０５］
韻律検索部１０２は、すべてのアクセント句について検索したか否かを判断する。すべてのアクセント句について検索したと判断されたときはステップＳＴ１０８にすすみ、そうでないときはステップＳＴ１０６にすすむ。ここではアクセント句２についてまだ検索していないためステップＳＴ１０６にすすむ。
【００４６】
［ステップＳＴ１０６］
韻律検索部１０２は、次のアクセント句（ここではアクセント句２「オモ’ウカモ│シレマセン」）を選択する。
【００４７】
［ステップＳＴ１０４］
韻律検索部１０２は、選択したアクセント句２「オモ’ウカモ│シレマセン」のモーラ数（１０モーラ）およびアクセント型（２型）を検索条件として韻律パターンデータベース４４から対応する韻律パターンを検索する。ここでは図６（ｂ）に示すように、アクセント句２「オモ’ウカモ│シレマセン」に対応する１０モーラ２型の韻律パターンはデータベース４４内には存在しない。したがってステップＳＴ１０７にすすむ。
【００４８】
［ステップＳＴ１０７］
韻律検索部１０２は、区切り位置「│」によりアクセント句２を分割して得られる単位１「オモ’ウカモ」および単位２「シレマセン」に対応する韻律パターンを検索する。韻律検索部１０２は、図７（ａ）に示すように、単位１「オモ’ウカモ」のモーラ数（５モーラ）およびアクセント型（２型）を検索条件として韻律パターンデータベース４４から対応する（合致する）韻律パターンを検索する。ここでは単位１「オモ’ウカモ」に対応する５モーラ２型の韻律パターンがデータベース４４内に存在する。また韻律検索部１０２は、図７（ｂ）に示すように、単位２「シレマセン」のモーラ数（５モーラ）およびアクセント型（０型）を検索条件として韻律パターンデータベース４４から対応する（合致する）韻律パターンを検索する。ここでは単位２「シレマセン」に対応する５モーラ０型の韻律パターンがデータベース４４内に存在する。そしてステップＳＴ１０５にすすむ。
【００４９】
［ステップＳＴ１０５］
すべてのアクセント句について検索したためステップＳＴ１０８にすすむ。
【００５０】
［ステップＳＴ１０８］
韻律検索部１０２は、図８に示すように、ステップＳＴ１０４およびＳＴ１０７で検索された韻律パターンを接続して韻律情報を出力する。この韻律情報は、すべて韻律データベース４４から得られた韻律パターンなので接続部における聴感上のギャップは従来例よりも小さい。
【００５１】
［ステップＳＴ１０９］
韻律検索部１０２によって生成された韻律情報にしたがって波形生成部１０４は音声波形を合成する（合成音声を生成する）。
【００５２】
＜効果＞
第１の実施形態によるテキスト音声合成装置では、言語解析部１０１がアクセント句内に区切り位置を設定する。したがって、韻律検索部１０２においてアクセント句よりもモーラ数が短い単位で韻律パターンを検索できる。これにより、韻律パターンデータベース１０３に格納されている韻律パターンに合致する割合を高めることができる。その結果、実音声から抽出した自然性の高い韻律情報を最大限に利用した高品質な韻律をもつ合成音を生成できる。
【００５３】
（第２の実施形態）
第２の実施形態によるテキスト音声合成装置の構成は図２に示した装置の構成と同様である。第１の実施形態による装置と異なる点は、基準モーラ数をあらかじめ設定し、この基準モーラ数に応じて韻律パターンを検索する点である。
【００５４】
＜基準モーラ数の設定＞
基準モーラ数の設定の手順について図９を参照しつつ説明する。
【００５５】
［ステップＳＴ２０１］
韻律パターンデータベース１０３に格納された韻律パターンをモーラ数ごとのグループに分ける。ここでは図１０に示すように、２モーラのグループＧＲ２、３モーラのグループＧＲ３、４モーラのグループＧＲ４、５モーラのグループＧＲ５に分けられるものとする。
【００５６】
［ステップＳＴ２０２］
グループごとの韻律パターンが可能なアクセント型のすべてをカバーしているかどうかを判定する。図１０に示すように、２モーラのグループＧＲ２にはアクセント型が０型の韻律パターン、１型の韻律パターン、２型の韻律パターンが含まれている。すなわち、２モーラのグループＧＲ２に含まれる韻律パターンは、モーラ数２においてとりうるアクセント型のすべてをカバーしている。同様に、３モーラのグループＧＲ３に含まれる韻律パターンは、モーラ数３においてとりうるアクセント型のすべてをカバーし、４モーラのグループＧＲ４に含まれる韻律パターンは、モーラ数４においてとりうるアクセント型のすべてをカバーしている。一方、５モーラのグループＧＲ５にはアクセント型が０型の韻律パターンと２型の韻律パターンとが含まれている。アクセント型が１型、３型、４型、５型の韻律パターンは含まれていない。すなわち、５モーラのグループＧＲ５に含まれる韻律パターンは、モーラ数５においてとりうるアクセント型のすべてはカバーしていない。
【００５７】
［ステップＳＴ２０３］
ステップＳＴ２０２においてアクセント型のすべてがカバーされていると判定されたグループのモーラ数の最大値を、アクセント句内の分割位置を設定する基準モーラ数とする。ここでは基準モーラ数を４モーラに設定する。
【００５８】
＜合成音声の生成手順＞
次に、合成音声の生成手順について図１１を参照しつつ説明する。
【００５９】
［ステップＳＴ１０１］
読み上げるべきテキスト（ここでは「情報処理装置」というテキスト）が入力される（図１２（ａ）参照）。
【００６０】
［ステップＳＴ１０２］
入力されたテキストの言語解析が言語解析部１０１によって行われる。ここでは図１２（ａ）および（ｂ）に示すように、与えられたテキスト「情報処理装置」に対する言語解析結果として表音記号列「ジョーホーショリソ’ーチ」が得られる。
【００６１】
［ステップＳＴ３０１］
ステップＳＴ１０２によって得られた表音記号列のうち最初のアクセント句が選択される。ここではアクセント句「ジョーホーショリソ’ーチ」が選択される。
【００６２】
［ステップＳＴ３０２］
選択されたアクセント句のモーラ数が基準モーラ数を超えているか否かが判断される。超えているときはステップＳＴ３０３にすすみ、超えていないときはステップＳＴ３０４にすすむ。ここではアクセント句「ジョーホーショリソ’ーチ」のモーラ数は９モーラであり、これは基準モーラ数である４モーラを超えている。したがってステップＳＴ３０３にすすむ。
【００６３】
［ステップＳＴ３０３］
選択されたアクセント句内に区切り位置を設定する。区切り位置によってアクセント句を分割することにより得られる単位（分割単位）の各々のモーラ数が基準モーラ数以下になるように区切り位置を設定する。区切り位置は単語の境界位置に挿入する。ここでは図１２（ｃ）に示すように、単語「ジョーホー」と単語「ショリ」との間、単語「ショリ」と単語「ソ’ーチ」との間にそれぞれ区切り位置を設定する。この区切り位置でアクセント句を分割することにより得られる単位１「ジョーホー」、単位２「ショリ」、単位３「ソ’ーチ」のモーラ数は４モーラ、２モーラ、３モーラでありこれらは基準モーラ数（４モーラ）以下となっている。
【００６４】
［ステップＳＴ３０４］
ステップＳＴ１０２において得られた表音記号列のすべてのアクセント句についてステップＳＴ３０２における判断が行われたか否かが判断される。すべてのアクセント句についてステップＳＴ３０２における判断が行われた場合はステップＳＴ３０６にすすむ。そうでない場合はステップＳＴ３０５にすすみ、次のアクセント句が選択された後、ステップＳＴ３０２にもどる。ここではすべてのアクセント句についてステップＳＴ３０２における判断が行われたためステップＳＴ３０６にすすむ。
【００６５】
［ステップＳＴ３０６］
ステップＳＴ１０２によって得られた表音記号列のうち最初のアクセント句が選択される。そのアクセント句内にステップＳＴ３０３において区切り位置が設定された場合は最初の分割単位が選択される。ここでは（分割）単位１「ジョーホー」が選択される。
【００６６】
［ステップＳＴ３０７］
韻律検索部１０２は、選択したアクセント句または分割単位のモーラ数およびアクセント型を検索条件として韻律パターンデータベース４４から対応する韻律パターンを検索する。ここでは図１３に示すように、単位１「ジョーホー」に対応する４モーラ０型の韻律パターンが検索される。
【００６７】
［ステップＳＴ３０８］
すべてのアクセント句・分割単位について対応する韻律パターンが検索されたか否かが判断される。すべてについて検索された場合はステップＳＴ１０８にすすみ、そうでない場合はステップＳＴ３０９にすすむ。ここでは単位２「ショリ」、単位３「ソ’ーチ」についての検索がまだ行われていないためステップＳＴ３０９にすすむ。
【００６８】
［ステップＳＴ３０９］
次のアクセント句または分割単位が選択され、ステップＳＴ３０７にもどる。ここでは（分割）単位２「ショリ」が選択される。
【００６９】
［ステップＳＴ３０７］
図１３に示すように、単位２「ショリ」に対応する２モーラ０型の韻律パターンが検索される。
【００７０】
［ステップＳＴ３０８］
単位３「ソ’ーチ」についての検索がまだ行われていないためステップＳＴ３０９にすすむ。
【００７１】
［ステップＳＴ３０９］
（分割）単位３「ソ’ーチ」が選択される。
【００７２】
［ステップＳＴ３０７］
図１３に示すように、単位３「ソ’ーチ」に対応する３モーラ１型の韻律パターンが検索される。
【００７３】
［ステップＳＴ３０８］
すべてのアクセント句・分割単位について対応する韻律パターンが検索されたためステップＳＴ１０８にすすむ。
【００７４】
［ステップＳＴ１０８］
韻律検索部１０２は、図１３に示すように、ステップＳＴ３０７で検索された韻律パターンを接続して韻律情報を出力する。
【００７５】
［ステップＳＴ１０９］
韻律検索部１０２によって生成された韻律情報にしたがって波形生成部１０４は音声波形を合成する（合成音声を生成する）。
【００７６】
＜効果＞
第２の実施形態では、基準モーラ数を超えるアクセント句については区切り位置を設定し分割単位で韻律パターンを検索する。したがって、合致する韻律パターンが韻律データベース４４内に必ずみつかる。これにより、代替手段によって韻律を生成する必要がなくなる。その結果、読み上げようとするテキストの全体にわたって高品質な韻律をもつ合成音を生成できる。
【００７７】
（第３の実施形態）
第１の実施形態によるテキスト音声合成装置において、アクセント結合により１つのアクセント句となる複合語が入力テキストとして与えられた場合を考える。ここでは図１４に示すように、テキスト「欧州戦線戦勝式典」が与えられるものとする。このテキストは、４つの単語「欧州」「戦線」「戦勝」「式典」により構成される複合語である。アクセント結合により、このテキストの表音記号列は１６モーラ１４型の１つのアクセント句となる。そして図１４に示すように、１６モーラ１４型の韻律パターンが韻律パターンデータベース４４から検索され韻律情報が生成される。しかしこの韻律情報は、テキストの複合語を構成する各単語の発音「オーシュー」「センセン」「センショー」「シキテン」の区切りがわかりやすい韻律パターンであるとは限らない。また、「セン」という音が３つ連続しており、音を聞いただけでは単語が分かりにくい。仮に、構成単語のモーラ数で韻律パターンを場合分けすると、組み合わせがモーラ数の階乗のオーダーで増えるので、韻律パターンデータベース４４の規模が大きくなりすぎる。
【００７８】
そこで第３の実施形態では、入力テキスト中の複合語部分に対して、複合語を構成する各単語がアクセント結合した結果１つのアクセント句となる場合に、複合語を構成する形態素に対応する表音記号が単位となるように分割位置を設定する。
【００７９】
第３の実施形態によるテキスト音声合成装置の構成は第１の実施形態による装置と同様である。第３の実施形態によるテキスト音声合成装置による合成音声の生成手順は図１５に示すフローチャートに従って行われる。以下では、第１の実施形態による合成音声の生成手順と異なる点について説明する。
【００８０】
［ステップＳＴ１０１］
読み上げるべきテキストが入力される。ここでは図１６（ａ）に示すように、テキスト「欧州戦線戦勝式典」が与えられる。このテキストは、４つの単語「欧州」「戦線」「戦勝」「式典」により構成される複合語である。
【００８１】
［ステップＳＴ１０２］
入力されたテキストの言語解析が言語解析部１０１によって行われる。複合語を構成する単語のアクセント結合により得られたアクセント句については、複合語を構成する各単語の境界にアクセント区内区切り位置が設定される。すなわち、複合語を構成する各単語の発音ごとに、アクセント句内区切り位置を設定する。ここでは図１６（ｂ）に示すように、複合語を構成する４つの単語「欧州」「戦線」「戦勝」「式典」がアクセント結合されて１つのアクセント句「オーシューセンセンセンショーシ’キテン」が形成される。したがって、単語「オーシュー」と「センセン」との間、単語「センセン」と「センショー」との間、単語「センショー」と「シ’キテン」との間にアクセント区内区切り位置が設定される。
【００８２】
［ステップＳＴ１０３］
韻律検索部１０２は、最初のアクセント句（ここではアクセント句「オーシュー│センセン│センショー│シ’キテン」）を選択する。
【００８３】
［ステップＳＴ１１０］
選択したアクセント句が、複合語を構成する単語のアクセント結合により得られたアクセント句であるか否かが判断される。選択したアクセント句が、複合語を構成する単語のアクセント結合により得られたアクセント句であるときはステップＳＴ１０７に、そうではないときはステップＳＴ１０４にすすむ。アクセント句「オーシュー│センセン│センショー│シ’キテン」は、複合語を構成する単語のアクセント結合により得られたアクセント句であるため、ここではステップＳＴ１０７にすすむ。
【００８４】
［ステップＳＴ１０７］
図１６（ｂ）に示すように韻律検索部１０２は、区切り位置「│」によりアクセント句を分割して得られる単位（複合語を構成する各単語）「オーシュー」、「センセン」、「センショー」、「シ’キテン」に対応する韻律パターンを検索する。
【００８５】
［ステップＳＴ１０８］
韻律検索部１０２は、図１６（ｃ）に示すように、ステップＳＴ１０７で検索された韻律パターンを接続して韻律情報を出力する。
【００８６】
＜効果＞
第３の実施形態では複合語の発音ごとに韻律パターンを検索して接続するため、アクセント結合の結果、比較的長いモーラ数となる複合語部分においても、各単語の発音の単位がはっきり分かる韻律生成を行うことができ、了解度の高い合成音を生成できる。
【００８７】
（第４の実施形態）
第４の実施形態によるテキスト音声合成装置の構成は第１の実施形態による装置と同様である。異なる点は、韻律検索部１０２において韻律パターンの検索をする際に、言語解析部１０１が設定したアクセント句内の分割位置でアクセント句を分割して検索するかどうかを、外部より与えられるポーズ頻度パラメータによって決定する点である。ポーズ頻度パラメータは、たとえば連続に変化する０以上の値で、値が大きいほどポーズをより積極的に入れるようにすることを意味するものとする。ポーズ頻度パラメータが、ある正の閾値以下のときはアクセント句内の分割位置を無視してアクセント句単位で韻律パターンを検索し、閾値を超えるときはアクセント句内の分割位置で分割された単位で韻律パターンを検索する。
【００８８】
第４の実施形態によるテキスト音声合成装置による合成音声の生成手順は図１７に示すフローチャートに従って行われる。以下では、第１の実施形態による合成音声の生成手順と異なる点について説明する。
【００８９】
読み上げるべきテキスト「そう思うかもしれません。」が入力される（ＳＴ１０１）。与えられたテキスト「そう思うかもしれません。」に対する言語解析結果として表音記号列「ソー／オモ’ウカモ│シレマセン」が得られる（ＳＴ１０２）。
【００９０】
ステップＳＴ１１１において、ポーズ頻度パラメータとしきい値との比較が行われる。（ポーズ頻度パラメータ）≦（しきい値）のときはステップＳＴ１０４にすすむ。（ポーズ頻度パラメータ）＞（しきい値）のときはステップＳＴ１１１にすすむ。
【００９１】
１．（ポーズ頻度パラメータ）≦（しきい値）のとき
図１８に示すように、アクセント句１「ソー」に対応する２モーラ０型の韻律パターンが韻律パターンデータベース４４から検索される（ＳＴ１０４）。次のアクセント句２「オモ’ウカモ│シレマセン」に対応する１０モーラ２型の韻律パターンが韻律パターンデータベース４４から検索される（ＳＴ１０５，ＳＴ１０６，ＳＴ１０４）。これらの韻律パターンを接続して韻律情報が生成される（ＳＴ１０８）。このように、アクセント句をなるべく長い単位で発声し、ポーズをあまり入れない韻律が生成される。
【００９２】
２．（ポーズ頻度パラメータ）＞（しきい値）のとき
図１９に示すように、アクセント句１「ソー」に対応する２モーラ０型の韻律パターンが韻律パターンデータベース４４から検索される（ＳＴ１０７）。次のアクセント句２「オモ’ウカモ│シレマセン」については、区切り位置で分割することにより得られる単位１「オモ’ウカモ」に対応する５モーラ２型の韻律パターン、単位２「シレマセン」に対応する５モーラ０型の韻律パターンが韻律パターンデータベース４４から検索される（ＳＴ１０５，ＳＴ１０６，ＳＴ１０７）。これらの韻律パターンを接続して韻律情報が生成される（ＳＴ１０８）。このように、アクセント句を小さい単位で発声し、ポーズを頻繁に入れる韻律が生成される。
【００９３】
＜効果＞
第４の実施形態によれば、通常の速度、あるいは、速く読み上げる場合は、ポーズ頻度パラメータを小さくして、韻律を生成する単位をなるべく長くしてピッチの立ち上げが少ない韻律制御を行い、逆にゆっくり読み上げる場合は、アクセント句内においてもピッチの立ち上げを伴う韻律制御を行うことができる。すなわち、所望の読み上げの速度に応じた適切な韻律制御を行うことができる。
【００９４】
（第５の実施形態）
＜テキスト音声合成装置の構成＞
第５の実施形態によるテキスト音声合成装置の構成を図２０に示す。この装置は、言語解析部１０１と、韻律生成部３０１と、波形生成部１０４とを備える。
【００９５】
韻律生成部３０１は、言語解析部１０１が出力する言語解析出力のアクセント句ごとまたはアクセント句を分割して得られた単位ごとのモーラ数・アクセント型の情報を用いて規則により韻律情報を生成する。韻律生成部３０１の規則による韻律生成は、たとえば藤崎モデルにより実現できる。
【００９６】
このような音声合成装置は、第１の実施形態と同じく、図３に示すようなコンピュータシステム上に構築されるものである。言語解析部１０１、韻律生成部３０１、波形生成部１０４は、本体部５０１にセットされるＣＤ−ＲＯＭ５０８内、本体部５０１が内蔵するディスク(メモリ)５０５内、あるいは、回線５０７で接続された他のシステムのディスク５０６内に格納される。
【００９７】
＜合成音声の生成手順＞
以上のように構成されたテキスト音声合成装置による合成音声の生成手順について図２１を参照しつつ説明する。
【００９８】
図２２に示すように、入力テキスト「そう思うかもしれません。」に対する言語解析結果として表音記号列「ソー／オモ’ウカモ│シレマセン」が得られる（ＳＴ１０１〜ＳＴ１０２）。
【００９９】
韻律生成部３０１は、最初のアクセント句（ここではアクセント句１「ソー」）を選択する（ＳＴ１０３）。韻律生成部３０１は、選択したアクセント句１「ソー」のモーラ数（２モーラ）としきい値（ここでは５モーラとする。）とを比較する（ＳＴ３０１）。選択したアクセント句１のモーラ数はしきい値以下であるため韻律生成部３０１は、アクセント句１「ソー」（２モーラ０型）の韻律を規則に基づいて生成する（ＳＴ３０２、図２２参照）。
【０１００】
韻律生成部３０１は、すべてのアクセント句について韻律を生成したか否かを判断する（ＳＴ３０４）。ここではアクセント句２「オモ’ウカモ│シレマセン」についての韻律をまだ生成していないためステップＳＴ１０６にすすむ。
【０１０１】
韻律検索部１０２は、次のアクセント句２「オモ’ウカモ│シレマセン」を選択する（ＳＴ１０６）。韻律生成部３０１は、選択したアクセント句２「オモ’ウカモ│シレマセン」のモーラ数（１０モーラ）としきい値（５モーラ）とを比較する（ＳＴ３０１）。選択したアクセント句２のモーラ数はしきい値よりも大きいため韻律生成部３０１は、区切り位置「│」によりアクセント句２を分割して得られる単位１「オモ’ウカモ」（５モーラ２型）の韻律および単位２「シレマセン」（５モーラ０型）の韻律を規則に基づいて生成する（ＳＴ３０３、図２２参照）。
【０１０２】
すべてのアクセント句について韻律が生成されると韻律生成部３０１は、図２２に示すように、ステップＳＴ３０２およびＳＴ３０３で生成した韻律を接続して韻律情報を出力する。
【０１０３】
＜効果＞
第５の実施形態によれば、モーラ数が多いアクセント句でも分割位置を韻律生成の時のピッチの立ち上げポイントとして利用することで、メリハリ感があり違和感の少ない韻律生成が可能である。
【０１０４】
（第６の実施形態）
第５の実施形態によるテキスト音声合成装置において、アクセント結合により１つのアクセント句となる複合語が入力テキストとして与えられた場合を考える。ここでは図２３に示すように、テキスト「欧州戦線戦勝式典」が与えられるものとする。このテキストは、４つの単語「欧州」「戦線」「戦勝」「式典」により構成される複合語である。アクセント結合により、このテキストの表音記号列は１６モーラ１４型の１つのアクセント句となる。そして図２３に示すように、１６モーラ１４型の韻律が規則に基づいて生成される。しかしこの韻律は、テキストの複合語を構成する各単語の発音「オーシュー」「センセン」「センショー」「シキテン」の区切りがわかりやすい韻律パターンであるとは限らない。また、「セン」という音が３つ連続しており、音を聞いただけでは単語が分かりにくい。
【０１０５】
そこで第６の実施形態では、入力テキスト中の複合語部分に対して、複合語を構成する各単語がアクセント結合した結果１つのアクセント句となる場合に、複合語を構成する形態素に対応する表音記号が単位となるように分割位置を設定する。
【０１０６】
第６の実施形態によるテキスト音声合成装置の構成は第５の実施形態による装置と同様である。第６の実施形態によるテキスト音声合成装置による合成音声の生成手順は図２４に示すフローチャートに従って行われる。以下では、第５の実施形態による合成音声の生成手順と異なる点について説明する。
【０１０７】
［ステップＳＴ１０１］
読み上げるべきテキストが入力される。ここでは図２５に示すように、テキスト「欧州戦線戦勝式典」が与えられる。このテキストは、４つの単語「欧州」「戦線」「戦勝」「式典」により構成される複合語である。
【０１０８】
［ステップＳＴ１０２］
入力されたテキストの言語解析が言語解析部１０１によって行われる。複合語を構成する単語のアクセント結合により得られたアクセント句については、複合語を構成する各単語の境界にアクセント区内区切り位置が設定される。すなわち、複合語を構成する各単語の発音ごとに、アクセント句内区切り位置を設定する。ここでは図２５に示すように、複合語を構成する４つの単語「欧州」「戦線」「戦勝」「式典」がアクセント結合されて１つのアクセント句「オーシューセンセンセンショーシ’キテン」が形成される。したがって、単語「オーシュー」と「センセン」との間、単語「センセン」と「センショー」との間、単語「センショー」と「シ’キテン」との間にアクセント区内区切り位置が設定される。
【０１０９】
［ステップＳＴ１０３］
韻律生成部３０１は、最初のアクセント句（ここではアクセント句「オーシュー│センセン│センショー│シ’キテン」）を選択する。
【０１１０】
［ステップＳＴ１１０］
選択したアクセント句が、複合語を構成する単語のアクセント結合により得られたアクセント句であるか否かが判断される。選択したアクセント句が、複合語を構成する単語のアクセント結合により得られたアクセント句であるときはステップＳＴ３０３に、そうではないときはステップＳＴ３０２にすすむ。アクセント句「オーシュー│センセン│センショー│シ’キテン」は、複合語を構成する単語のアクセント結合により得られたアクセント句であるため、ここではステップＳＴ３０３にすすむ。
【０１１１】
［ステップＳＴ３０３］
図２５に示すように韻律生成部３０１は、区切り位置「│」によりアクセント句を分割して得られる単位（複合語を構成する各単語）「オーシュー」、「センセン」、「センショー」、「シ’キテン」の韻律を規則に基づいて生成する。
【０１１２】
［ステップＳＴ１０８］
韻律生成部３０１は、図２５に示すように、ステップＳＴ３０３で生成した韻律を接続して韻律情報を出力する。
【０１１３】
＜効果＞
第６の実施形態では、複合語の発音ごとに韻律を生成して接続する。これにより、複合語がアクセント結合してできたアクセント句においても、分割位置において微妙なピッチの立ち上げの制御を行うことができるので、複合語を構成する単語ごとの発音が明瞭な合成音を生成可能である。
【０１１４】
（第７の実施形態）
第７の実施形態によるテキスト音声合成装置の構成は第５の実施形態による装置と同様である。異なる点は、韻律生成部３０１において韻律を生成する際に、言語解析部１０１が設定したアクセント句内の分割位置でアクセント句を分割して韻律生成するかどうかを、外部より与えられるポーズ頻度パラメータによって決定する点である。ポーズ頻度パラメータは、たとえば連続に変化する０以上の値で、値が大きいほどポーズをより積極的に入れるようにすることを意味するものとする。ポーズ頻度パラメータが、ある正の閾値以下のときはアクセント句内の分割位置を無視してアクセント句単位で韻律を生成し、閾値を超えるときはアクセント句内の分割位置で分割された単位で韻律を生成する。
【０１１５】
第７の実施形態によるテキスト音声合成装置による合成音声の生成手順は図２６に示すフローチャートに従って行われる。以下では、第５の実施形態による合成音声の生成手順と異なる点について説明する。
【０１１６】
読み上げるべきテキスト「そう思うかもしれません。」が入力される（ＳＴ１０１）。与えられたテキスト「そう思うかもしれません。」に対する言語解析結果として表音記号列「ソー／オモ’ウカモ│シレマセン」が得られる（ＳＴ１０２）。
【０１１７】
ステップＳＴ１１１において、ポーズ頻度パラメータとしきい値との比較が行われる。（ポーズ頻度パラメータ）≦（しきい値）のときはステップＳＴ３０２にすすむ。（ポーズ頻度パラメータ）＞（しきい値）のときはステップＳＴ３０３にすすむ。
【０１１８】
１．（ポーズ頻度パラメータ）≦（しきい値）のとき
図２７に示すように、アクセント句１「ソー」（２モーラ０型）の韻律が韻律生成部３０１によって生成される（ＳＴ３０２）。次のアクセント句２「オモ’ウカモ│シレマセン」（１０モーラ２型）の韻律が韻律生成部３０１によって生成される（ＳＴ３０４，ＳＴ１０６，ＳＴ１１１，ＳＴ３０２）。韻律生成部３０１は、これらの韻律を接続して韻律情報を生成する（ＳＴ３０５）。このように、アクセント句をなるべく長い単位で発声し、ポーズをあまり入れない韻律が生成される。
【０１１９】
２．（ポーズ頻度パラメータ）＞（しきい値）のとき
図２８に示すように、アクセント句１「ソー」（２モーラ０型）の韻律が韻律生成部３０１によって生成される（ＳＴ３０３）。次のアクセント句２「オモ’ウカモ│シレマセン」については、区切り位置で分割することにより得られる単位１「オモ’ウカモ」（５モーラ２型）の韻律、単位２「シレマセン」（５モーラ０型）の韻律が韻律生成部３０１によって生成される（ＳＴ３０４，ＳＴ１０６，ＳＴ１１１，ＳＴ３０３）。韻律生成部３０１は、これらの韻律を接続して韻律情報を生成する（ＳＴ３０５）。このように、アクセント句を小さい単位で発声し、ポーズを頻繁に入れる韻律が生成される。
【０１２０】
＜効果＞
第７の実施形態によれば、通常の速度、あるいは、速く読み上げる場合は、ポーズ頻度パラメータを小さくして、韻律を生成する単位をなるべく長くしてピッチの立ち上げが少ない韻律制御を行い、逆にゆっくり読み上げる場合は、アクセント句内においてもピッチの立ち上げを伴う韻律制御を行うことができる。すなわち、所望の読み上げの速度に応じた適切な韻律制御を行うことができる。
【０１２１】
【発明の効果】
言語解析結果のアクセント句内に、韻律パターンの検索単位を条件に応じて適切に選択できるよう、アクセント句内の分割位置を設定する。これにより、韻律パターンデータベースにある韻律パターンを最大限に利用し、かつ、韻律パターン検索の代替手段による韻律生成を必要とせず、一定の高品質な合成音を生成できる。また、規則による韻律生成を行う場合でも、長いモーラ数のアクセント句に対してもメリハリのあるわかりやすい合成音を生成できる。また、読み上げの速さに応じて適切に韻律制御を行うことができる。
【図面の簡単な説明】
【図１Ａ】従来の方式における不都合を説明するための図である。
【図１Ｂ】従来の方式における不都合を説明するための図である。
【図２】第１の実施形態によるテキスト音声合成装置の構成を示すブロック図である。
【図３】図１に示したテキスト音声合成装置を実現するコンピュータシステムの構成を示す図である。
【図４】図１に示したテキスト音声合成装置による合成音声の生成手順を示すフローチャートである。
【図５】入力テキストおよび表音記号列の一例を示す図である。
【図６】（ａ）および（ｂ）は、アクセント句ごとに韻律パターンを検索する様子を示す図である。
【図７】（ａ）および（ｂ）は、アクセント句の区切り位置で分割された単位ごとに韻律パターンを検索する様子を示す図である。
【図８】検索された韻律パターンを接続して韻律情報を生成する様子を示す図である。
【図９】基準モーラ数の設定手順を示すフローチャートである。
【図１０】韻律パターンデータベース内の韻律パターンをグループ分けした例を示す図である。
【図１１】第２の実施形態による合成音声の生成手順を示すフローチャートである。
【図１２】（ａ）は入力テキストの一例を示す。（ｂ）は表音記号列の一例を示す。（ｃ）はアクセント内区切り位置の設定の一例を示す。
【図１３】検索された韻律パターンを接続して韻律情報を生成する様子を示す図である。
【図１４】複合語の入力テキストに対する韻律パターン情報の生成手順を示す図である。
【図１５】第３の実施形態による合成音声の生成手順を示すフローチャートである。
【図１６】（ａ）は入力テキストの一例を示す。（ｂ）は言語解析の結果として得られる表音記号列の一例を示す。（ｃ）は、生成される韻律情報の一例を示す。
【図１７】第４の実施形態による合成音声の生成手順を示すフローチャートである。
【図１８】（ポーズ頻度パラメータ）≦（しきい値）の場合において韻律情報が生成される様子を示す図である。
【図１９】（ポーズ頻度パラメータ）＞（しきい値）の場合において韻律情報が生成される様子を示す図である。
【図２０】第５の実施形態によるテキスト音声合成装置の構成を示すブロック図である。
【図２１】図２０に示したテキスト音声合成装置による合成音声の生成手順を示すフローチャートである。
【図２２】テキスト、表音記号列、韻律、韻律情報の例を示す図である。
【図２３】複合語の入力テキストの韻律の生成手順を示す図である。
【図２４】第６の実施形態による合成音声の生成手順を示すフローチャートである。
【図２５】テキスト、表音記号列、韻律、韻律情報の例を示す図である。
【図２６】第７の実施形態による合成音声の生成手順を示すフローチャートである。
【図２７】（ポーズ頻度パラメータ）≦（しきい値）の場合において韻律情報が生成される様子を示す図である。
【図２８】（ポーズ頻度パラメータ）＞（しきい値）の場合において韻律情報が生成される様子を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus for generating synthesized speech. More specifically, the present invention relates to a device that converts an input character string (text) into speech.
[0002]
[Prior art]
A method for accumulating speech feature values from recorded speech data as a database in order to generate natural synthesized speech, and obtaining acoustic parameters required for speech synthesis by combining units of speech feature values in the database Has been proposed. For example, the database uses speech feature values indexed by text strings or phonetic symbol strings, and matches text strings to be synthesized or phonetic symbol strings obtained by text processing. A basic frequency required for synthesis from a method for searching for a unit of a feature value of a portion of a voice (Japanese Patent Laid-Open No. 8-87297; called method 1) or a database of basic frequency patterns indexed as phoneme patterns There is a method of searching and generating a pattern (US PATENT 5,905,972; called method 2). Since these methods directly use feature values of speech extracted from speech, for example, when used as a prosody generation method, synthesized speech with higher naturalness can be generated as compared with a model-based prosody generation method.
[0003]
On the other hand, in a speech synthesizer that generates a prosody based on a rule, it is a general method to generate prosodic information for each accent phrase of a phonetic symbol string output as a result of language processing.
[0004]
[Patent Document 1]
JP-A-8-87297
[Patent Document 2]
US Pat. No. 5,905,972
[0005]
[Problems to be solved by the invention]
However, in the above-described method for retrieving the necessary speech feature values from the database, the prosody is required when synthesizing, that is, when prosody data matching the text to be synthesized or the phonetic symbol string is not found in the database. An alternative means of generating information is required. For example, in method 1, prosody generation based on rules is performed for portions that do not match the prosodic pattern in the database. Similarly, in the method 2, an alternative means of linear interpolation is used for a portion that cannot be generated from the database.
[0006]
Therefore, in these methods, as shown in FIG. 1A, the prosody generated is disadvantageous in that a prosody naturalness gap is generated between the part that matches the prosody data in the database and the part generated by the alternative means. Have In general, when a prosody database is constructed from recorded data of natural utterances, the distribution of prosodic data in the database decreases as the number of mora increases, so the necessary combinations of accent types and phonogram strings should be covered sufficiently. Becomes difficult. In the prior art using alternative means, the above-mentioned disadvantage occurs in a portion with a relatively long mora number (for example, a compound word in which a noun continues, a portion in which an auxiliary verb or a predicate with a particle is converted into a phonogram string). The trend is getting stronger.
[0007]
Also, in a speech synthesizer that generates prosody based on rules, as shown in FIG. 1B, it is difficult to generate a stable prosody without an uncomfortable feeling with respect to an accent phrase having a long mora number. It becomes a cause that becomes difficult to hear.
[0008]
In real speech, the frequency of pose / pitch reconstruction increases as the utterance speed slows down, but the conventional speech synthesizer generates prosody by dividing one accent phrase as a result of language processing. Can not do it.
[0009]
An object of the present invention is to provide a speech synthesizer capable of generating natural synthesized speech.
[0010]
[Means for Solving the Problems]
According to one aspect of the present invention, a speech synthesizer includes a language analysis unit, Prosodic pattern database, A prosody search unit and a waveform generation unit are provided. The language analyzer Output at least accent phrases from input text . The prosodic pattern database stores prosodic patterns extracted from real sounds. The prosody search part Above The prosodic pattern corresponding to the accent phrase From the prosodic pattern database Prosody information is generated by searching. The prosody search part Of the accent phrases, a specific accent phrase composed of one accent phrase is divided, and prosodic information is generated by searching the prosodic pattern in the divided units. . The waveform generator Above Prosodic information From Synthesize speech waveform.
[0011]
Preferably, the specific accent phrase has a corresponding prosodic pattern. Prosodic pattern An accent phrase that does not exist in the database.
[0012]
Preferably, the specific accent phrase is an accent phrase having a mora number exceeding a reference value. The above reference value is Prosodic pattern Stored in the database This is the number of mora of the group with the largest number of mora among the groups with all accent types in the group by number of mora. .
[0013]
Preferably, the specific accent phrase includes a plurality of accent phrases. word Is an accent phrase corresponding to consecutive compound words. The prosodic search unit is included in the compound word Divide into each of the words and into the divided words Prosody information is generated by searching corresponding prosodic patterns.
[0014]
Preferably, The specific accent phrase is an accent phrase that is selected when the set pause frequency exceeds the reference value. .
[0015]
According to another aspect of the present invention, the speech synthesizer includes a language analysis unit, a prosody generation unit, and a waveform generation unit. The language analyzer Output at least accent phrases from input text . The prosody generation part Above Predetermined rules for each accent phrase From Prosody information is generated. The prosody generation part A specific accent phrase composed of one accent phrase is divided among the accent phrases, and prosodic information is generated from the rules in divided units. . The waveform generator Above Prosodic information From Synthesize speech waveform.
[0016]
Preferably, the specific accent phrase is an accent phrase having a mora number exceeding a reference value.
[0017]
Preferably, the specific accent phrase includes a plurality of accent phrases. word Is an accent phrase corresponding to consecutive compound words. The prosody generation unit is included in the compound word word Per the above rules From Prosody information is generated.
[0018]
Preferably, The specific accent phrase is an accent phrase that is selected when the set pause frequency exceeds the reference value. .
[0019]
In the first aspect of the present invention, the input text to be read out is converted into a phonetic symbol string, and the position where the phonetic symbol string is divided in an arbitrary accent phrase of the converted phonetic symbol string is specified. Possible language analysis means, prosodic pattern database storing speech feature values extracted from real speech as prosodic pattern information, and units of accent phrases of phonetic symbol strings output by language analysis means, or division within accent phrases Prosody information that generates prosodic information by searching the prosodic information that matches the units divided by position, and waveform synthesis that synthesizes speech waveforms according to the prosodic information generated by the prosody searching means And a speech synthesizer comprising the means.
[0020]
If a prosodic pattern that matches a given accent phrase cannot be found in the database, the prosody pattern can be re-searched in units of the accent phrase divided according to the division position. Prosody generation with high naturalness using patterns to the maximum is possible.
[0021]
In the second aspect of the present invention, in the first aspect, from the distribution of features that distinguish prosody in prosodic pattern search such as the number of mora of prosodic patterns and accent type stored in the prosodic pattern database, the language analysis means Determine the split position within the phrase.
[0022]
At the language processing stage, the division position in the accent phrase is determined in consideration of the features of the prosodic pattern in the database, so the prosody pattern can be searched from the database at least in the unit divided at the division position. It is possible to generate prosody with high naturalness over the entire sentence without using prosody generation means.
[0023]
In a third aspect of the present invention, in the first aspect, for each morpheme constituting a compound word in an accent phrase of a phonetic symbol string obtained by converting a compound word part in which a noun of an input text is continued by a language processing unit. The division position in the accent phrase is set.
[0024]
By generating prosody by connecting prosodic patterns that are divided and searched for each word, it is possible to generate a synthesized sound in which each word of the compound word is clearly audible.
[0025]
In the fourth aspect of the present invention, in the first aspect, whether or not the division position in the accent phrase set by the language processing means is handled as the division position by the parameter for controlling the pause frequency from the outside in the prosody search means. It was set as the structure which determines.
[0026]
If the parameter that specifies the pause frequency is set to include more pauses, the prosody pattern is searched in units divided by the split position in the accent phrase, and the parameter that specifies the pause frequency includes the pause as much as possible. If it is set to the other side, the prosody pattern can be controlled by ignoring the split position in the accent phrase and searching for the prosody pattern as if there is no split position. Can do.
[0027]
In the fifth aspect of the present invention, the input text to be read out is converted into a phonetic symbol string, and the position where the phonetic symbol string is divided in an arbitrary accent phrase of the converted phonetic symbol string is specified. Prosodic generation means for generating prosodic information according to a rule for each unit of accent phrase of a phonetic symbol string output by the language analyzing means, or for each unit divided at the division position in the accent phrase, The speech synthesizer comprises: a waveform synthesis generation unit that synthesizes a speech waveform according to the prosodic information generated by the generation unit.
[0028]
Even with accent phrases with a large number of mora, prosody generation with a sense of sharpness can be achieved by using the division position as a pitch starting point when generating prosody.
[0029]
According to a sixth aspect of the present invention, in the fifth aspect, for each morpheme constituting a compound word in an accent phrase of a phonetic symbol string obtained by converting a compound word part in which the noun of the input text is continued by the language processing means. The division position in the accent phrase is set.
[0030]
Even in an accent phrase formed by combining accents of compound words, it is possible to control the subtle pitch rise at the division position, so even when generating prosody by rules, pronunciation for each word that makes up the compound word Can produce a clear synthesized sound.
[0031]
In the seventh aspect of the present invention, in the fifth aspect, whether or not the division position in the accent phrase set by the language processing means is treated as a division position by a parameter for controlling the pose frequency from the outside in the prosody generation means. It was set as the structure which determines.
[0032]
When the parameter that specifies the pause frequency is set to include more pauses, the prosody control is performed using the division position in the accent phrase as the starting point of the pitch, and the parameter that specifies the pause frequency is set to pause. Prosody control that does not raise the pitch at the division position in the accent phrase can be performed when the setting is set so as not to allow as much as possible. Control can be performed.
[0033]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals, and description thereof will not be repeated.
[0034]
(First embodiment)
<Configuration of text-to-speech synthesizer>
The configuration of the text-to-speech synthesizer according to the first embodiment is shown in FIG. This apparatus includes a language analysis unit 101, a prosody search unit 102, a prosody pattern database 103, and a waveform generation unit 104.
[0035]
The language analysis unit 101 performs language analysis on input text, and includes a phonetic symbol string, accent information, accent phrase delimiter position, and accent phrase division position as information for generating a synthesized sound that reads out the text. Output analysis results.
[0036]
The prosodic pattern database 103 stores acoustic feature quantities extracted from real speech as prosodic patterns that can be searched using at least the number of mora and accent type, and combinations of phonetic symbol strings as search conditions. The prosodic pattern database 103 is a database in which prosodic patterns in units of accent phrases extracted from recorded speech can be searched by the number of accent phrase mora and accent positions (accent type).
[0037]
The prosody search unit 102 uses the phonetic symbol string and accent type of the accent phrase unit output by the language analysis unit 101 as a search key as the search key. A prosodic pattern matching the search key is searched from 103 and extracted. The prosodic search unit 102 extracts prosodic patterns for each accent phrase, connects them, generates prosodic information over one sentence, and outputs it.
[0038]
The waveform generation unit 104 synthesizes a speech waveform according to the prosodic information generated by the prosody search unit 102 (generates synthesized speech).
[0039]
Such a text-to-speech synthesizer is constructed on a computer system as shown in FIG. 3, for example. This computer system includes a main body unit 501, a keyboard 502, a display 503, an input device (mouse) 504, and a speaker 509 and is a system capable of text input and voice output. The language analysis unit 101, the prosody search unit 102, the prosody pattern database 103, and the waveform generation unit 104 in FIG. 1 are stored in the CD-ROM 508 set in the main unit 501, the disc (memory) 505 built in the main unit 501 or And stored in the disk 506 of another system connected by the line 507.
[0040]
<Synthetic voice generation procedure>
A procedure for generating synthesized speech by the text-to-speech synthesizer configured as described above will be described with reference to FIG.
[0041]
[Step ST101]
The text to be read is entered. Text input is performed by an input device such as a keyboard / mouse, reading a text file, or the like. Here, it is assumed that the text “You may think so” is entered.
[0042]
[Step ST102]
The language analysis of the input text is performed by the language analysis unit 101. As information for generating a synthesized sound of the input text, a language analysis result (phonetic symbol string) including reading information, accent information, accent phrase delimiter position, and accent phrase division position is obtained. Here, as shown in FIG. 5, the phonetic symbol string “Saw / Omo'Ukamo | Silemacene” is obtained as the language analysis result for the given text “Maybe I think so”. In this phonetic symbol string, the reading of the text is indicated by “Katakana”, the accent position (accent core) is indicated by the symbol “′”, the accent phrase delimiter is indicated by the symbol “/”, and the delimiter within the accent phrase The position is indicated by the symbol “|”. The accent phrase delimiter position “|” is an accent phrase (accent phrase 2 in this case) combined with one accent phrase in the language analysis, and a position that becomes a sentence boundary in the language analysis (here, the boundary between the phrase 2 and the phrase 3) Inserted into.
[0043]
[Step ST103]
The prosodic search unit 102 selects the first accent phrase (here, accent phrase 1 “saw”).
[0044]
[Step ST104]
The prosodic retrieval unit 102 retrieves corresponding (matching) prosodic patterns from the prosodic pattern database 44 using the number of mora (2 mora) and accent type (0 type) of the selected accent phrase 1 “saw” as a search condition. If the corresponding prosodic pattern exists, the process proceeds to step ST105, and if not, the process proceeds to step ST107. Here, as shown in FIG. 6A, since a 2-mora 0 type prosodic pattern corresponding to the accent phrase 1 “saw” exists in the database 44, the process proceeds to step ST105.
[0045]
[Step ST105]
The prosody search unit 102 determines whether all accent phrases have been searched. If it is determined that all the accent phrases have been searched, the process proceeds to step ST108, and if not, the process proceeds to step ST106. Here, since the accent phrase 2 has not been searched yet, the process proceeds to step ST106.
[0046]
[Step ST106]
The prosodic search unit 102 selects the next accent phrase (in this case, accent phrase 2 “Omo'Ukamo | Sillemasen”).
[0047]
[Step ST104]
The prosodic search unit 102 searches the prosodic pattern database 44 for a corresponding prosodic pattern using the number of mora (10 mora) and accent type (type 2) of the selected accent phrase 2 “Omo'Ukamo | Here, as shown in FIG. 6B, the 10 mora type 2 prosodic pattern corresponding to the accent phrase 2 “Omo'Ukamo | Silemacene” does not exist in the database 44. Therefore, the process proceeds to step ST107.
[0048]
[Step ST107]
The prosodic search unit 102 searches for a prosodic pattern corresponding to the unit 1 “Omo'ucamo” and the unit 2 “Siremasen” obtained by dividing the accent phrase 2 by the break position “|”. As shown in FIG. 7A, the prosody search unit 102 corresponds from the prosody pattern database 44 using the number of mora (5 mora) and the accent type (type 2) of the unit 1 “Omo'ukamo” as a search condition (match). Search for prosodic patterns. Here, a 5-mora type 2 prosodic pattern corresponding to the unit 1 “Omo'Ukamo” exists in the database 44. Further, as shown in FIG. 7B, the prosody search unit 102 corresponds (matches) from the prosodic pattern database 44 using the number of mora (5 mora) and the accent type (0 type) of the unit 2 “Sillemasen” as a search condition. ) Search for prosodic patterns. Here, a 5-mora 0 type prosodic pattern corresponding to the unit 2 “Shiramasen” exists in the database 44. Then, the process proceeds to step ST105.
[0049]
[Step ST105]
Since all the accent phrases have been searched, the process proceeds to step ST108.
[0050]
[Step ST108]
As shown in FIG. 8, the prosodic search unit 102 connects the prosodic patterns searched in steps ST104 and ST107 and outputs prosodic information. Since all the prosodic information is a prosodic pattern obtained from the prosodic database 44, the auditory gap at the connection portion is smaller than that of the conventional example.
[0051]
[Step ST109]
The waveform generation unit 104 synthesizes a speech waveform (generates synthesized speech) according to the prosodic information generated by the prosody search unit 102.
[0052]
<Effect>
In the text-to-speech synthesizer according to the first embodiment, the language analysis unit 101 sets a break position in the accent phrase. Therefore, the prosody search unit 102 can search for prosodic patterns in units with a shorter number of mora than the accent phrase. As a result, it is possible to increase the ratio that matches the prosodic pattern stored in the prosodic pattern database 103. As a result, it is possible to generate a synthesized sound having a high-quality prosody using the most natural prosodic information extracted from real speech.
[0053]
(Second Embodiment)
The configuration of the text-to-speech synthesizer according to the second embodiment is the same as the configuration of the apparatus shown in FIG. The difference from the apparatus according to the first embodiment is that a reference number of mora is set in advance and a prosodic pattern is searched according to this number of reference mora.
[0054]
<Setting the number of reference mora>
A procedure for setting the number of reference mora will be described with reference to FIG.
[0055]
[Step ST201]
The prosodic patterns stored in the prosodic pattern database 103 are divided into groups for each number of mora. Here, as shown in FIG. 10, it is assumed that the group is divided into a group 2 of 2 mora GR2, a group GR3 of 3 mora, a group GR4 of 4 mora, and a group GR5 of 5 mora.
[0056]
[Step ST202]
Determine whether the prosodic pattern for each group covers all possible accent types. As shown in FIG. 10, the 2-mora group GR2 includes a 0-type prosodic pattern, a 1-type prosodic pattern, and a 2-type prosodic pattern. That is, the prosodic patterns included in the group 2 of 2 mora cover all the accent types that can be taken in the number 2 of mora. Similarly, the prosodic pattern included in the group 3 of 3 mora covers all the accent types that can be taken in the number 3 of mora, and the prosodic pattern included in the group 4 of 4 mora has the accent type that can be taken in the number 4 of mora. It covers everything. On the other hand, the 5-mora group GR5 includes a prosodic pattern with an accent type of 0 and a prosodic pattern of type 2. Prosodic patterns with accent types of 1, 3, 4, and 5 are not included. That is, the prosodic patterns included in the 5-mora group GR5 do not cover all of the accent types that can be taken in the number 5 of mora.
[0057]
[Step ST203]
The maximum value of the number of mora of the group determined to cover all the accent types in step ST202 is set as the reference number of mora for setting the division position in the accent phrase. Here, the reference number of mora is set to 4 mora.
[0058]
<Synthetic voice generation procedure>
Next, a synthetic speech generation procedure will be described with reference to FIG.
[0059]
[Step ST101]
The text to be read (here, the text “information processing apparatus”) is input (see FIG. 12A).
[0060]
[Step ST102]
The language analysis of the input text is performed by the language analysis unit 101. Here, as shown in FIGS. 12A and 12B, the phonetic symbol string “Joe Hoshori Sochi” is obtained as the language analysis result for the given text “information processing apparatus”.
[0061]
[Step ST301]
The first accent phrase is selected from the phonetic symbol string obtained in step ST102. Here, the accent phrase “Joe Hoshori Sochi” is selected.
[0062]
[Step ST302]
It is determined whether or not the number of mora of the selected accent phrase exceeds the reference number of mora. When it exceeds, the process proceeds to step ST303, and when it does not exceed, the process proceeds to step ST304. Here, the number of mora of the accent phrase “Joe Hoshori Sochi” is 9 mora, which exceeds the standard mora number of 4 mora. Therefore, the process proceeds to step ST303.
[0063]
[Step ST303]
Set the break position within the selected accent phrase. The delimiter position is set so that the number of mora in each unit (division unit) obtained by dividing the accent phrase by the delimiter position is equal to or less than the reference mora number. The break position is inserted at the boundary position of the word. Here, as shown in FIG. 12C, a separation position is set between the word “Joe Ho” and the word “Shori”, and between the word “Shori” and the word “Sochi”. The unit 1 “Joho”, unit 2 “Shori”, and unit 3 “Sochi” obtained by dividing the accent phrase at this break position have 4 mora, 2 mora, and 3 mora. It is less than the number of mora (4 mora).
[0064]
[Step ST304]
It is determined whether or not the determination in step ST302 has been made for all accent phrases of the phonetic symbol string obtained in step ST102. If the determination in step ST302 is made for all accent phrases, the process proceeds to step ST306. Otherwise, the process proceeds to step ST305, and after the next accent phrase is selected, the process returns to step ST302. Here, since all the accent phrases are determined in step ST302, the process proceeds to step ST306.
[0065]
[Step ST306]
The first accent phrase is selected from the phonetic symbol string obtained in step ST102. If a break position is set in the accent phrase in step ST303, the first division unit is selected. Here, the (division) unit 1 “Joe Ho” is selected.
[0066]
[Step ST307]
The prosodic search unit 102 searches for the corresponding prosodic pattern from the prosodic pattern database 44 using the selected accent phrase or the number of mora of the division unit and the accent type as search conditions. Here, as shown in FIG. 13, a 4-mora 0 type prosodic pattern corresponding to the unit 1 “Joe Ho” is searched.
[0067]
[Step ST308]
It is determined whether or not corresponding prosodic patterns have been searched for all accent phrases / dividing units. If all the items have been searched, the process proceeds to step ST108, and if not, the process proceeds to step ST309. Here, since the search for the unit 2 “Shori” and the unit 3 “Sochi” has not been performed yet, the process proceeds to Step ST309.
[0068]
[Step ST309]
The next accent phrase or division unit is selected, and the process returns to step ST307. Here, the (division) unit 2 “Shori” is selected.
[0069]
[Step ST307]
As shown in FIG. 13, a 2-mora 0 type prosodic pattern corresponding to the unit 2 “Shori” is searched.
[0070]
[Step ST308]
Since the search for the unit 3 “Sochi” has not been performed yet, the process proceeds to Step ST309.
[0071]
[Step ST309]
(Division) Unit 3 “Sochi” is selected.
[0072]
[Step ST307]
As shown in FIG. 13, a 3 mora type 1 prosodic pattern corresponding to the unit 3 “Sochi” is searched.
[0073]
[Step ST308]
Since prosodic patterns corresponding to all accent phrases and division units have been searched, the process proceeds to step ST108.
[0074]
[Step ST108]
As shown in FIG. 13, the prosodic search unit 102 connects the prosodic patterns searched in step ST307 and outputs prosodic information.
[0075]
[Step ST109]
The waveform generation unit 104 synthesizes a speech waveform (generates synthesized speech) according to the prosodic information generated by the prosody search unit 102.
[0076]
<Effect>
In the second embodiment, for accent phrases that exceed the number of reference mora, a delimiter position is set, and prosodic patterns are searched in units of division. Therefore, a matching prosodic pattern is always found in the prosodic database 44. This eliminates the need to generate prosody by alternative means. As a result, a synthesized sound having a high-quality prosody over the entire text to be read can be generated.
[0077]
(Third embodiment)
In the text-to-speech synthesizer according to the first embodiment, consider a case where a compound word that becomes one accent phrase is given as input text by accent combination. Here, as shown in FIG. 14, the text “European Front Victory Ceremony” is given. This text is a compound word composed of four words “Europe”, “front line”, “war” and “ceremony”. By accent concatenation, the phonetic symbol string of this text becomes one accent phrase of 16 mora 14 type. Then, as shown in FIG. 14, the 16-mora 14-type prosodic pattern is retrieved from the prosodic pattern database 44 to generate prosodic information. However, this prosodic information is not always a prosodic pattern in which the pronunciation of each word constituting the compound word of the text is easy to understand, such as the pronunciation of “aushoe”, “sensen”, “sensho”, and “shikiten”. Also, there are three consecutive “sen” sounds, and it is difficult to understand the word just by listening to the sound. If the prosodic patterns are classified according to the number of mora of the constituent words, the number of combinations increases in the order of the factorial of the number of mora, so the scale of the prosodic pattern database 44 becomes too large.
[0078]
Therefore, in the third embodiment, a table corresponding to a morpheme constituting a compound word when each word constituting the compound word becomes one accent phrase as a result of the accent combination with respect to the compound word part in the input text. The division position is set so that the phonetic symbol is a unit.
[0079]
The configuration of the text-to-speech synthesizer according to the third embodiment is the same as that of the apparatus according to the first embodiment. The procedure for generating synthesized speech by the text-to-speech synthesizer according to the third embodiment is performed according to the flowchart shown in FIG. Hereinafter, differences from the synthetic speech generation procedure according to the first embodiment will be described.
[0080]
[Step ST101]
The text to be read is entered. Here, as shown in FIG. 16A, the text “European Front Victory Ceremony” is given. This text is a compound word composed of four words “Europe”, “front line”, “war” and “ceremony”.
[0081]
[Step ST102]
The language analysis of the input text is performed by the language analysis unit 101. For accent phrases obtained by combining accents of words that make up a compound word, an accent zone delimiter is set at the boundary between the words that make up the compound word. That is, the accent phrase delimiter position is set for each pronunciation of each word constituting the compound word. Here, as shown in FIG. 16 (b), the four words "Europe", "front line", "war victory", and "ceremony" that compose the compound word are accent-coupled to form one accent phrase "Ousen Sensen Shoshi 'Kitten". Is formed. Therefore, the accent position delimiter positions are set between the words “Oshu” and “Sensen”, between the words “Sensen” and “Sensho”, and between the words “Sensho” and “Shikiten”.
[0082]
[Step ST103]
The prosodic search unit 102 selects the first accent phrase (in this case, the accent phrase “Oshu | sensen | sensho | shi'kiten”).
[0083]
[Step ST110]
It is determined whether or not the selected accent phrase is an accent phrase obtained by combining accents of words constituting the compound word. If the selected accent phrase is an accent phrase obtained by the accent combination of the words constituting the compound word, the process proceeds to step ST107, and if not, the process proceeds to step ST104. Since the accent phrase “Ohsue | Sensen | Sensho | Shikiten” is an accent phrase obtained by combining accents of words constituting a compound word, the process proceeds to step ST107.
[0084]
[Step ST107]
As shown in FIG. 16 (b), the prosody searching unit 102 obtains units (each word constituting the compound word) “Osho”, “Sensen”, “Sensho” obtained by dividing the accent phrase by the delimiter position “|”. , Search for a prosodic pattern corresponding to “Shikiten”.
[0085]
[Step ST108]
As shown in FIG. 16C, the prosodic search unit 102 connects the prosodic patterns searched in step ST107 and outputs prosodic information.
[0086]
<Effect>
In the third embodiment, prosodic patterns are searched and connected for each pronunciation of the compound word, and therefore the prosody that clearly shows the unit of pronunciation of each word even in the compound word part having a relatively long mora number as a result of the accent combination. It is possible to generate a synthesized sound with high intelligibility.
[0087]
(Fourth embodiment)
The configuration of the text-to-speech synthesizer according to the fourth embodiment is the same as that of the apparatus according to the first embodiment. The difference is that when searching for prosodic patterns in the prosodic search unit 102, whether or not the accent phrase is to be divided and searched at the division position in the accent phrase set by the language analysis unit 101 is given from the outside. It is a point determined by parameters. The pause frequency parameter is, for example, a value of 0 or more that changes continuously, and it means that the larger the value is, the more actively the pause is entered. When the pause frequency parameter is below a certain positive threshold, the prosody pattern is searched for by accent phrase, ignoring the division position in the accent phrase, and when the pause frequency parameter exceeds the threshold, it is divided by the division position within the accent phrase. Search for prosodic patterns.
[0088]
The procedure for generating synthesized speech by the text-to-speech synthesizer according to the fourth embodiment is performed according to the flowchart shown in FIG. Hereinafter, differences from the synthetic speech generation procedure according to the first embodiment will be described.
[0089]
The text “I think so” may be input (ST101). The phonetic symbol string “Saw / Omo'Ukamo | Silemasen” is obtained as a linguistic analysis result for the given text “Maybe” (ST102).
[0090]
In step ST111, the pause frequency parameter is compared with the threshold value. When (pause frequency parameter) ≦ (threshold value), the process proceeds to step ST104. When (pause frequency parameter)> (threshold value), the process proceeds to step ST111.
[0091]
1. When (pause frequency parameter) ≤ (threshold)
As shown in FIG. 18, a 2-mora 0 type prosodic pattern corresponding to accent phrase 1 “saw” is searched from prosodic pattern database 44 (ST104). The 10 mora type 2 prosodic pattern corresponding to the next accent phrase 2 “Omo'Ukamo | Sillemasen” is retrieved from the prosodic pattern database 44 (ST105, ST106, ST104). Prosodic information is generated by connecting these prosodic patterns (ST108). In this way, a prosody that utters an accent phrase in as long a unit as possible and does not pose much is generated.
[0092]
2. When (pause frequency parameter)> (threshold)
As shown in FIG. 19, a 2-mora 0 type prosodic pattern corresponding to accent phrase 1 “saw” is searched from prosodic pattern database 44 (ST107). For the next accent phrase 2 “Omo'Ukamo | Sillemasen”, it corresponds to the unit 5 “Siremasen”, a 5-mora type 2 prosodic pattern corresponding to unit 1 “Omo'Ukamo” obtained by dividing at the delimiter position. The 5-mora 0 type prosodic pattern is retrieved from the prosodic pattern database 44 (ST105, ST106, ST107). Prosodic information is generated by connecting these prosodic patterns (ST108). In this way, a prosody that utters accent phrases in small units and frequently poses is generated.
[0093]
<Effect>
According to the fourth embodiment, when reading at a normal speed or fast, the pause frequency parameter is reduced, the prosody generation unit is made as long as possible, and the prosody control is performed with less pitch rise. When reading out slowly, prosodic control with pitch rise can be performed even in an accent phrase. That is, appropriate prosody control can be performed according to the desired reading speed.
[0094]
(Fifth embodiment)
<Configuration of text-to-speech synthesizer>
The configuration of a text-to-speech synthesizer according to the fifth embodiment is shown in FIG. This apparatus includes a language analysis unit 101, a prosody generation unit 301, and a waveform generation unit 104.
[0095]
The prosody generation unit 301 generates prosody information according to rules using the mora number / accent type information for each accent phrase of the language analysis output output by the language analysis unit 101 or for each unit obtained by dividing the accent phrase. . Prosody generation according to the rules of the prosody generation unit 301 can be realized by, for example, the Fujisaki model.
[0096]
Such a speech synthesizer is constructed on a computer system as shown in FIG. 3 as in the first embodiment. The language analysis unit 101, prosody generation unit 301, and waveform generation unit 104 are connected to each other in a CD-ROM 508 set in the main unit 501, a disk (memory) 505 built in the main unit 501, or connected via a line 507. Stored in the system disk 506.
[0097]
<Synthetic voice generation procedure>
A procedure for generating synthesized speech by the text-to-speech synthesizer configured as described above will be described with reference to FIG.
[0098]
As shown in FIG. 22, the phonetic symbol string “Saw / Omo'Ukamo | Silemacene” is obtained as a language analysis result for the input text “Maybe” (ST101 to ST102).
[0099]
Prosody generation section 301 selects the first accent phrase (here, accent phrase 1 “saw”) (ST103). The prosody generation unit 301 selects the number of mora (2 mora) of the selected accent phrase 1 “saw” and the threshold (here, 5 Let's call it Mora. ) Is compared (ST301). Since the number of mora of the selected accent phrase 1 is less than or equal to the threshold value, the prosody generation unit 301 generates a prosody of the accent phrase 1 “saw” (2 mora type 0) based on the rules (ST302, see FIG. 22). .
[0100]
Prosody generation section 301 determines whether or not prosody has been generated for all accent phrases (ST304). Here, since the prosody for the accent phrase 2 “Omo'Ukamo | Sillemasen” has not yet been generated, the process proceeds to Step ST106.
[0101]
The prosodic search unit 102 selects the next accent phrase 2 “Omo'Ukamo | Sillemasen” (ST106). The prosody generation unit 301 selects the number of mora (10 mora) and the threshold value (10 mora) of the selected accent phrase 2 “Omo'Ukamo | Sillemasen”. 5 (Mora) is compared (ST301). Since the number of mora of the selected accent phrase 2 is larger than the threshold value, the prosody generation unit 301 obtains the unit 1 “Omo'Ukamo” (5 mora type 2) obtained by dividing the accent phrase 2 by the delimiter position “|”. And the prosody of the unit 2 “Shiramasen” (5 mora type 0) is generated based on the rules (see ST303, FIG. 22).
[0102]
When the prosody is generated for all accent phrases, as shown in FIG. 22, the prosody generation unit 301 connects the prosody generated in steps ST302 and ST303 and outputs prosodic information.
[0103]
<Effect>
According to the fifth embodiment, even in an accent phrase with a large number of mora, the division position is used as a pitch starting point when generating a prosody, so that it is possible to generate a prosody with a sense of clarity and a little uncomfortable feeling.
[0104]
(Sixth embodiment)
In the text-to-speech synthesizer according to the fifth embodiment, consider a case where a compound word that becomes one accent phrase is given as input text by accent combination. Here, as shown in FIG. 23, the text “European Front Victory Ceremony” is given. This text is a compound word composed of four words “Europe”, “front line”, “war” and “ceremony”. By accent concatenation, the phonetic symbol string of this text becomes one accent phrase of 16 mora 14 type. Then, as shown in FIG. 23, a 16-mora 14-type prosody is generated based on the rules. However, this prosody is not always a prosodic pattern in which the pronunciation of each word constituting the compound word of the text is easy to understand, such as “Ohsoo” “Sensen” “Sensho” “Shikiten”. Also, there are three consecutive “sen” sounds, and it is difficult to understand the word just by listening to the sound.
[0105]
Therefore, in the sixth embodiment, a table corresponding to a morpheme constituting a compound word when each word constituting the compound word becomes one accent phrase as a result of the accent combination with respect to the compound word part in the input text. The division position is set so that the phonetic symbol is a unit.
[0106]
The configuration of the text-to-speech synthesizer according to the sixth embodiment is the same as that of the apparatus according to the fifth embodiment. The procedure for generating synthesized speech by the text-to-speech synthesizer according to the sixth embodiment is performed according to the flowchart shown in FIG. Hereinafter, differences from the synthetic speech generation procedure according to the fifth embodiment will be described.
[0107]
[Step ST101]
The text to be read is entered. Here, as shown in FIG. 25, the text “European Front Victory Ceremony” is given. This text is a compound word composed of four words “Europe”, “front line”, “war” and “ceremony”.
[0108]
[Step ST102]
The language analysis of the input text is performed by the language analysis unit 101. For accent phrases obtained by combining accents of words that make up a compound word, an accent zone delimiter is set at the boundary between the words that make up the compound word. That is, the accent phrase delimiter position is set for each pronunciation of each word constituting the compound word. In this case, as shown in FIG. 25, the four words “Europe”, “front line”, “war” and “ceremony” constituting the compound word are accent-joined to form one accent phrase “Ousen Sensen Shoshi Kiten”. Is done. Therefore, the accent position delimiter positions are set between the words “Oshu” and “Sensen”, between the words “Sensen” and “Sensho”, and between the words “Sensho” and “Shikiten”.
[0109]
[Step ST103]
The prosody generation unit 301 selects the first accent phrase (in this example, the accent phrase “Oshu | sensen | sensho | shi'kiten”).
[0110]
[Step ST110]
It is determined whether or not the selected accent phrase is an accent phrase obtained by combining accents of words constituting the compound word. If the selected accent phrase is an accent phrase obtained by accent concatenation of the words constituting the compound word, the process proceeds to step ST303; otherwise, the process proceeds to step ST302. Since the accent phrase “Ohsue | Sensen | Sensho | Shikiten” is an accent phrase obtained by combining accents of words constituting a compound word, the process proceeds to step ST303 here.
[0111]
[Step ST303]
As shown in FIG. 25, the prosody generation unit 301 is a unit (each word constituting a compound word) obtained by dividing an accent phrase by a delimiter position “|”, “Ohshou”, “Sensen”, “Sensho”, “Shisho” Prosody of 'Kiten' is generated based on the rules.
[0112]
[Step ST108]
As shown in FIG. 25, the prosody generation unit 301 connects the prosody generated in step ST303 and outputs prosodic information.
[0113]
<Effect>
In the sixth embodiment, a prosody is generated and connected for each pronunciation of a compound word. As a result, even in an accent phrase formed by combining accents of compound words, it is possible to control the subtle pitch start at the division position, so that synthesized sounds with clear pronunciation for each word constituting the compound word can be obtained. Can be generated.
[0114]
(Seventh embodiment)
The configuration of the text-to-speech synthesizer according to the seventh embodiment is the same as that of the device according to the fifth embodiment. The difference is that, when the prosody generation unit 301 generates a prosody, whether or not to generate a prosody by dividing the accent phrase at the division position in the accent phrase set by the language analysis unit 101 is a pose frequency parameter given from the outside. It is a point determined by. The pause frequency parameter is, for example, a value of 0 or more that changes continuously, and it means that the larger the value is, the more actively the pause is entered. When the pause frequency parameter is less than a certain positive threshold, the prosody is generated in the accent phrase unit ignoring the division position in the accent phrase, and when it exceeds the threshold, the prosody is divided in the unit divided at the accent phrase. Is generated.
[0115]
The procedure for generating synthesized speech by the text-to-speech synthesizer according to the seventh embodiment is performed according to the flowchart shown in FIG. Hereinafter, differences from the synthetic speech generation procedure according to the fifth embodiment will be described.
[0116]
The text “I think so” may be input (ST101). The phonetic symbol string “Saw / Omo'Ukamo | Silemasen” is obtained as a linguistic analysis result for the given text “Maybe” (ST102).
[0117]
In step ST111, the pause frequency parameter is compared with the threshold value. When (pause frequency parameter) ≦ (threshold value), the process proceeds to step ST302. If (pause frequency parameter)> (threshold value), the process proceeds to step ST303.
[0118]
1. When (pause frequency parameter) ≤ (threshold)
As shown in FIG. 27, the prosody of the accent phrase 1 “saw” (2 mora 0 type) is generated by the prosody generation unit 301 (ST302). The prosody of the next accent phrase 2 “Omo'umo | Sillemacene” (10 mora type 2) is generated by the prosody generation unit 301 (ST304, ST106, ST111, ST302). The prosody generation unit 301 connects these prosody and generates prosodic information (ST305). In this way, a prosody that utters an accent phrase in as long a unit as possible and does not pose much is generated.
[0119]
2. When (pause frequency parameter)> (threshold)
As shown in FIG. 28, the prosody of accent phrase 1 “saw” (2 mora 0 type) is generated by the prosody generation unit 301 (ST303). For the next accent phrase 2 “Omo'Umo | Sillemasen”, the prosody of unit 1 “Omo'Umomo” (5 mora type 2) obtained by dividing at the delimiter position, unit 2 “Siremasen” (5 mora 0 type) ) Is generated by the prosody generation unit 301 (ST304, ST106, ST111, ST303). The prosody generation unit 301 connects these prosody and generates prosodic information (ST305). In this way, a prosody that utters accent phrases in small units and frequently poses is generated.
[0120]
<Effect>
According to the seventh embodiment, when reading at a normal speed or faster, the pause frequency parameter is reduced, the prosody generation unit is lengthened as much as possible, and the prosody control is performed with less pitch rise. When reading out slowly, prosodic control with pitch rise can be performed even in an accent phrase. That is, appropriate prosody control can be performed according to the desired reading speed.
[0121]
【The invention's effect】
In the accent phrase of the language analysis result, the division position in the accent phrase is set so that the prosodic pattern search unit can be appropriately selected according to the condition. As a result, it is possible to generate a certain high-quality synthesized sound by making maximum use of the prosodic patterns in the prosodic pattern database and without requiring prosody generation by an alternative means of prosodic pattern search. In addition, even when prosody generation is performed according to rules, it is possible to generate clear and easy-to-understand synthesized sounds even for accent phrases with a long mora number. Prosodic control can be appropriately performed according to the speed of reading.
[Brief description of the drawings]
FIG. 1A is a diagram for explaining inconveniences in a conventional method.
FIG. 1B is a diagram for explaining inconveniences in a conventional method.
FIG. 2 is a block diagram showing a configuration of a text-to-speech synthesizer according to the first embodiment.
3 is a diagram showing a configuration of a computer system that realizes the text-to-speech synthesizer shown in FIG. 1. FIG.
FIG. 4 is a flowchart showing a procedure for generating synthesized speech by the text-to-speech synthesizer shown in FIG. 1;
FIG. 5 is a diagram illustrating an example of an input text and a phonetic symbol string.
FIGS. 6A and 6B are diagrams showing a state in which prosodic patterns are searched for each accent phrase. FIGS.
FIGS. 7A and 7B are diagrams showing a state in which prosodic patterns are searched for each unit divided at an accent phrase delimiter position; FIGS.
FIG. 8 is a diagram illustrating a state in which prosodic information is generated by connecting searched prosodic patterns.
FIG. 9 is a flowchart showing a procedure for setting the number of reference mora.
FIG. 10 is a diagram showing an example in which prosodic patterns in a prosodic pattern database are grouped.
FIG. 11 is a flowchart showing a procedure for generating synthesized speech according to the second embodiment.
FIG. 12A shows an example of input text. (B) shows an example of a phonetic symbol string. (C) shows an example of setting of the position within the accent.
FIG. 13 is a diagram illustrating a state in which prosodic information is generated by connecting searched prosodic patterns.
FIG. 14 is a diagram showing a procedure for generating prosodic pattern information for an input text of a compound word.
FIG. 15 is a flowchart showing a procedure for generating synthesized speech according to the third embodiment.
FIG. 16A shows an example of input text. (B) shows an example of a phonetic symbol string obtained as a result of language analysis. (C) shows an example of the prosodic information to be generated.
FIG. 17 is a flowchart showing a synthetic speech generation procedure according to the fourth embodiment.
FIG. 18 is a diagram showing how prosodic information is generated when (pause frequency parameter) ≦ (threshold value).
FIG. 19 is a diagram showing how prosodic information is generated when (pause frequency parameter)> (threshold value).
FIG. 20 is a block diagram showing a configuration of a text-to-speech synthesizer according to a fifth embodiment.
FIG. 21 is a flowchart showing a procedure for generating synthesized speech by the text-to-speech synthesizer shown in FIG. 20;
FIG. 22 is a diagram showing examples of text, phonetic symbol strings, prosody, and prosody information.
FIG. 23 is a diagram showing a procedure for generating a prosody of an input text of a compound word.
FIG. 24 is a flowchart showing a synthetic speech generation procedure according to the sixth embodiment.
FIG. 25 is a diagram showing examples of text, phonetic symbol strings, prosody, and prosody information.
FIG. 26 is a flowchart showing a synthetic speech generation procedure according to the seventh embodiment.
FIG. 27 is a diagram showing how prosodic information is generated when (pause frequency parameter) ≦ (threshold value).
FIG. 28 is a diagram showing how prosodic information is generated when (pause frequency parameter)> (threshold value).

Claims

A language analysis unit for outputting an accent phrase having at least one clause boundary or word boundary from the input text;
A prosody search unit for searching for a prosodic pattern corresponding to the accent phrase from a prosodic pattern database storing prosodic patterns extracted from real sounds and generating prosodic information;
A waveform generation unit that synthesizes a speech waveform from the prosodic information,
The prosody search part
Search prosodic patterns corresponding to the accent phrase from the prosodic pattern database, and for accent phrases that do not have a corresponding prosodic pattern, divide the accent phrase at the clause boundary or word boundary, and search for the prosodic pattern in divided units. To generate prosodic information,
A speech synthesizer characterized by the above.

In claim 1,
The prosody search part
If the number of mora of the accent phrase output by the language analysis unit exceeds the reference value, the accent phrase is divided at the clause boundary or word boundary, and prosodic information is generated by searching the prosodic pattern in the divided unit. And
The reference value is
Among the groups for each number of mora stored in the prosodic pattern database, the number of mora of the group having the largest number of mora among the groups in which all accent types exist.
A speech synthesizer characterized by the above.

In claim 1,
The word boundary is
A word boundary in a compound word in which a plurality of words are consecutive,
The prosody search part
Dividing into each of the words included in the compound word, and searching for prosodic patterns corresponding to the divided words to generate prosodic information,
A speech synthesizer characterized by the above.

In claim 1,
Before Kia accent clause,
An accent phrase that is selected when the set pause frequency exceeds the reference value.
A speech synthesizer characterized by the above.

A language analysis step of outputting an accent phrase having at least one clause boundary or word boundary from the input text;
A search step for searching a prosodic pattern corresponding to the accent phrase from a prosodic pattern database storing prosodic patterns extracted from real sounds;
For the accent phrase in which the corresponding prosodic pattern exists in the search step, the prosody is generated by the corresponding prosodic pattern, and for the accent phrase in which the corresponding prosodic pattern does not exist in the search step, the accent phrase at the phrase boundary or word boundary A prosody generation step of generating prosodic information by searching prosodic patterns in the divided units,
A waveform generation step of synthesizing a speech waveform from the prosodic information,
A speech synthesis method characterized by the above.