JP3578598B2

JP3578598B2 - Speech synthesizer

Info

Publication number: JP3578598B2
Application number: JP18176897A
Authority: JP
Inventors: 修司久保田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-06-23
Filing date: 1997-06-23
Publication date: 2004-10-20
Anticipated expiration: 2017-06-23
Also published as: JPH1115495A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字情報もしくは発音記号列等のテキスト情報を音声に変換して出力する音声合成装置に関する。
【０００２】
【従来の技術】
近年、騒音下で合成音声を聞き取りやすくするための種々の技術が提案されている。
【０００３】
例えば、特開昭６１−１４５６３３号や特開平２−２３８４９４号には、周囲騒音のパワーレベルに応じて合成音声の振幅を変化させる技術が示されている。
【０００４】
また、例えば特開平２−２９３９００号には、周囲騒音のパワーレベルに応じて、ピッチ周波数を変化させる技術が示されている。
【０００５】
また、特開平２−２９３９００号には、騒音を周波数解析し、その周波数特性に適したフィルタを算出して、騒音に対してそのフィルタによりフィルタ処理を行なって音声合成出力する技術が示されている。
【０００６】
【発明が解決しようとする課題】
しかしながら、特開昭６１−１４５６３３号や特開平２−２３８４９４号の技術は、単に周囲騒音のレベルに従って合成音声の出力パワーレベルを上げているだけである。
【０００７】
また、特開平２−２９３９００号の技術では騒音レベルに応じて、ピッチを一定間隔で高域に上げていくというもので、ピッチ周波数は高い周波数の方向にしか移動しない。また、騒音周波数の分析を行なっていないので、ピッチ周波数を移動させることで、騒音に含まれる周波数成分と同調し、逆にマスキング効果（２つ以上の音が同時に存在するとき、一方が他方の存在により聞こえなくなる現象）によって聞き取りにくくなる可能性がある。
【０００８】
また、特開平２−２９３９００号の技術では、騒音の周波数特性を利用して、その特性をもつフィルタに合成音声を通すことでＳ／Ｎ比を上げようとしているが、基本的に、上述したようなマスキング効果を取り去るものではない。また、フィルタ設計やフィルタ処理を必要とするので、システム規模が大きくなってしまうという問題もある。
【０００９】
このように、騒音下で合成音声を聞き取り易くするための種々の技術が提案されているものの、従来では、騒音レベルに応じて合成音声の出力レベルを全体的に大きくするように制御するだけで、騒音下で特に聞き取りにくくなる子音や母音の無声化に着目して制御するような処理は行なっていない。また、従来では、騒音に含まれる周波数成分が合成音のピッチ周波数やフォルマント周波数と重なっているかを判断していないため、マスキング効果を回避するようにはなっていない。すなわち、従来の技術では、騒音下での合成音声の聞き取り易さを著しく向上させるには限界があった。
【００１０】
本発明は、少ない処理量で、騒音下での合成音声の聞き取り易さを著しく向上させることの可能な音声合成装置を提供することを目的としている。
【００１１】
【課題を解決するための手段】
上記目的を達成するために、請求項１記載の発明は、テキスト情報を入力するテキスト入力部と、入力されたテキストに対して形態素解析処理並びに音韻・韻律付与処理を行なって発音記号を生成するテキスト解析部と、音素データが蓄積されている音素データ蓄積部と、テキスト解析部で生成された発音記号に従い、音素データ蓄積部から音素データを引き出し、ピッチ間隔に音素データを重畳することで発音信号に変換する規則音声合成部と、変換した発音信号を音声として出力する音声出力部と、周囲騒音を取り込んで解析し、周囲騒音のパワーレベルを算出する騒音解析手段とを有し、規則音声合成部は、騒音解析手段によって算出された周囲騒音のパワーレベルが閾値αを越えるときには母音の無音化処理がなされないように、母音無声化に対する制御を行なうようになっていることを特徴としている。
【００１４】
また、請求項２記載の発明は、テキスト情報を入力するテキスト入力部と、入力されたテキストに対して形態素解析処理並びに音韻・韻律付与処理を行なって発音記号を生成するテキスト解析部と、音素データが蓄積されている音素データ蓄積部と、テキスト解析部で生成された発音記号に従い、音素データ蓄積部から音素データを引き出し、ピッチ間隔に音素データを重畳することで発音信号に変換する規則音声合成部と、変換した発音信号を音声として出力する音声出力部と、周囲騒音を取り込んで解析し、周囲騒音のパワーレベルを算出するとともに、周囲騒音の周波数分布を算出する騒音解析手段とを有し、規則音声合成部は、騒音解析手段によって周囲騒音のパワーレベルと周囲騒音の周波数分布が算出されるとき、母音毎にあらかじめ保持されたフォルマント周波数と周波数が重なる騒音が存在しその周波数での騒音パワーレベルが閾値γを越える場合には、音声出力サンプリング周波数を移動させる制御を行なうようになっていることを特徴としている。
【００１５】
【発明の実施の形態】
以下、本発明の実施形態を図面に基づいて説明する。図１は本発明に係る音声合成装置の構成例を示す図である。図１を参照すると、この音声合成装置は、文字情報もしくは発音記号列等のテキスト情報を入力するテキスト入力部１と、入力されたテキストに対して形態素解析処理並びに音韻・韻律付与処理を行なって発音記号を生成するテキスト解析部２と、音素データが蓄積されている音素データ蓄積部（音素データベース）３と、テキスト解析部２で生成された発音記号を音素データ蓄積部（音素データベース）３に蓄積されている音素データに基づいて発音信号に変換する規則音声合成部４と、変換した発音信号を音声として出力する音声出力部５と、周囲騒音（雑音）を取り込んで解析する騒音解析手段６とを有している。
【００１６】
ここで、テキスト解析部２は、形態素解析処理部１１と、音韻・韻律付与処理部１２とを備え、形態素解析処理部１１は、例えば標準単語辞書１４を参照して（辞書引きして）形態素解析処理を行なうようになっている。
【００１７】
また、規則音声合成部４は、テキスト解析部２で生成された発音記号に従い、音素データ蓄積部３から音素データを引き出し、ピッチ間隔に音素データを重畳することで、発音信号に変換するようになっている。
【００１８】
また、騒音解析手段６は、周囲騒音を取り込む周囲騒音取込部７と、周囲騒音取込部７で取り込んだ周囲騒音を解析する周囲騒音解析部８とを有しており、周囲騒音解析部８は、取り込んだ周囲騒音を解析して、騒音のパワー（パワーレベル）や騒音の周波数分布を求めるようになっている。なお、周波数分布の解析には、ＦＦＴやフィルタバンクを用いることができる。
【００１９】
そして、本発明の第１の実施形態においては、規則音声合成部４は、音素データ蓄積部３から取り出される音素データが子音音素データであるとき、該子音に対し、騒音解析手段６によって算出された周囲騒音のパワーレベルに応じたパワー制御を行なって発音信号に変換するようになっている。
【００２０】
また、本発明の第２の実施形態においては、規則音声合成部４は、騒音解析手段６によって算出された周囲騒音のパワーレベルが閾値αを越えるときには母音の無音化処理がなされないように、母音無声化に対する制御を行なうようになっている。
【００２１】
また、本発明の第３の実施形態においては、規則音声合成部４は、周囲騒音解析手段６によって周囲騒音のパワーレベルと周囲騒音の周波数分布が算出されるとき、ピッチ周波数と周波数が重なる騒音が存在しその周波数での騒音パワーレベルが閾値βを越えた場合には、ピッチ周波数を前後に移動させる制御を行なうようになっている。
【００２２】
また、本発明の第４の実施形態においては、規則音声合成部４は、騒音解析手段６によって周囲騒音のパワーレベルと周囲騒音の周波数分布が算出されるとき、フォルマント周波数と周波数が重なる騒音が存在しその周波数での騒音パワーレベルが閾値γを越える場合には、音声出力サンプリング周波数を前後に移動させる制御を行なうようになっている。
【００２３】
次に、このような構成の音声合成装置の処理動作を説明する。通常、テキスト入力部１でテキストが入力されると、入力されたテキストは、テキスト解析部２で形態素解析処理，音韻・韻律制御処理が行なわれて発音記号列に変換され、その後、テキスト解析部２から出力された発音記号列は、規則音声合成部４に入力し、規則音声合成部４において規則音声合成処理がなされ、音声波形データに変換される。このようにして発音記号列に基づき変換された音声波形データは音声出力部５に入力し、音声出力部５から音声として出力される。
【００２４】
より具体的に、テキスト解析部２では、形態素解析処理時に単語辞書１４の辞書引きを行なってテキストを発音記号列に変換する。その後、規則音声合成部４は、テキスト解析部２から出力された発音記号列に従い、音素データベース３から音素データ（発音記号列に対応した音素データ）を取り出し、取り出した各音素データを時系列でピッチ間隔に重畳することで（波形重畳方式で）、発音信号（音声波形データ）に変換する。このようにして規則音声合成部４で変換された発音信号（音声波形データ）は音声出力部５に入力し、音声出力部５から音声として出力される。
【００２５】
このような規則音声合成処理において、本発明では、さらに、周囲騒音取込部７で周囲騒音を取り込み、取り込んだ周囲騒音を周囲騒音解析部８で解析し、騒音パワーレベル，周波数分布のパラメータを生成する。
【００２６】
この場合、本発明の第１の実施形態では、規則音声合成部４は、周囲騒音解析部８で解析された騒音パワーレベルに従って、破裂音や摩擦音など騒音下で聞き取りにくくなる子音のパワーを制御する。すなわち、テキスト解析部２から出力された発音記号列に従い、音素データベース３から音素データを取り出して発音信号に変換する際、取り出した音素データが子音音素データである場合、規則音声合成部４は、この子音のパワー（利得）を騒音パワーレベルに従って制御する。具体的に、子音パワーの制御は、例えば騒音パワーレベルに比例するように子音のパワー（利得）を決定し、音素データベース３から取り出した子音音素波形に、決定した利得を施した後（乗じた後）、波形重畳を行なうことによってなされる。
【００２７】
このように、周囲騒音解析部８で解析された騒音パワーレベルに従って子音のパワーを制御することで、騒音下で特に聞き取りにくくなる子音についても、子音が騒音のパワーレベルに応じた大きなパワーで発音出力されることで、騒音下でも子音を聞き易くなる。
【００２８】
また、本発明の第２の実施形態では、規則音声合成部４は、騒音パワーレベルが閾値αを越えるかを判断し、騒音パワーレベルが閾値αを越えた場合には、無声化音韻の無声化処理がなされないように制御する。すなわち、通常の日本語音声合成処理では、よりなめらかに発声させるために自動的に無声化させる音韻（“し”，“き”，“く”など）が存在するが、騒音パワーレベルが所定の閾値αを越えたときには、無声化音韻の無声化処理を強制的に禁止する処理を行なうことにより、騒音下でも母音がはっきりと発声され、より聞き取りやすい合成音声になる。なお、上記閾値αは、例えば、母音を無声化したときに合成音声を聞き取りにくくさせる騒音パワーレベルに設定される。
【００２９】
また、本発明の第３の実施形態では、規則音声合成部４は、騒音の周波数分布のパラメータから、ピッチ周波数と重なっている騒音が存在する場合、その周波数での騒音パワーレベルが閾値βを越えているかを判断し、越えているときには、ピッチ周波数が変更できる範囲内で、ピッチ周波数を前後に移動（シフト）させる。より具体的に、騒音パワーレベルが低い周波数帯にピッチ周波数を移動（シフト）させる。なお、上記閾値βは、例えば、合成音声の出力レベルに従って、Ｓ／Ｎ比から決定される。
【００３０】
このように、第３の実施形態では、騒音の周波数分布のパラメータから、ピッチ周波数と重なっている騒音が存在する場合、その周波数での騒音パワーレベルが閾値βを越えているかを判断し、越えているときには、ピッチ周波数が変更できる範囲内で、ピッチ周波数を前後に移動（シフト）させることにより、マスキング効果が回避され、より聞き取りやすい合成音声になる。
【００３１】
また、本発明の第４の実施形態では、規則音声合成部４は、騒音の周波数分布のパラメータから、フォルマント周波数と重なっている騒音が存在する場合、その周波数での騒音パワーレベルが閾値γを越えているかを判断し、越えているときは、音声出力サンプリング周波数が変更できる範囲内で、音声出力サンプリング周波数を前後に移動（シフト）させる。より具体的に、騒音パワーレベルが低い周波数帯に音声出力サンプリング周波数を前後に移動（シフト）させる。
【００３２】
すなわち、波形重畳方式で規則音声合成を行なう場合には、音素波形データを使用するので声道特性が音素毎に固定になっているため、パラメータ方式の音声合成方式と異なり、フォルマント周波数の位置を変更することはできない。そのため、第４の実施形態では、サンプリング周波数を移動させる（変更する）ことで、相対的にフォルマント周波数を前後に移動させる。
【００３３】
この場合、フォルマント周波数は音素（母音）毎にあらかじめ決定されているので、各音素データ毎にフォルマント情報（周波数位置，利得）をパラメータとして保持させておく。また、上記閾値γは、例えば、合成音声の出力レベルとフォルマント情報に従って、Ｓ／Ｎ比から決定される。
【００３４】
このように、第４の実施形態では、騒音の周波数分布のパラメータから、フォルマント周波数と重なっている騒音が存在する場合、その周波数での騒音パワーレベルが閾値γを越えているかを判断し、越えているときは、音声出力サンプリング周波数が変更できる範囲内で、音声出力サンプリング周波数を前後に移動（シフト）させることにより（より具体的に、騒音パワーレベルが低い周波数帯に音声出力サンプリング周波数を前後に移動（シフト）させることにより）、マスキング効果が回避され、より聞き取りやすい合成音声になる。
【００３５】
なお、上述の説明では、第１，第２，第３，第４の実施形態をそれぞれ個別に説明したが、第１，第２，第３，第４の実施形態を任意に組み合せて用いることもできる。例えば、第１，第２，第３，第４の全ての実施形態を組み合せて、規則音声合成部４内に、子音のパワーを制御する機能，母音の無声化処理を行なわないように制御する機能，ピッチ周波数を移動させる（変更する）機能，音声出力サンプリング周波数を前後に移動（変更する）機能を設けることもできる。
【００３６】
また、上述の例では、規則音声合成部４内に、子音のパワーを制御する機能，母音の無声化処理を行なわないように制御する機能，ピッチ周波数を移動させる（変更する）機能，音声出力サンプリング周波数を前後に移動（変更する）機能を設けているが、これらを規則音声合成部４の外部に設けることも可能である。すなわち、例えば図２に示すように、子音のパワーを制御する子音パワー制御部１５，母音の無声化処理を行なわないように制御する母音無声化判断処理部１６，ピッチ周波数を移動させる（変更する）ピッチ変更部１７，音声出力サンプリング周波数を前後に移動（変更する）サンプリング変更部１８をそれぞれ設けることもできる。
【００３７】
図３は図１あるいは図２の音声合成装置のハードウェア構成例を示す図である。図３を参照すると、この音声合成装置は、例えばパーソナルコンピュータ等で実現され、全体を制御するＣＰＵ５１と、ＣＰＵ５１の制御プログラム等が記憶されているＲＯＭ５２と、ＣＰＵ５１のワークエリア等として使用されるＲＡＭ５３と、テキストを入力するテキスト入力部１と、音声出力部（例えば、スピーカ）５と、周囲騒音を取り込む周囲騒音取込部（例えば、マイク）７とを有している。
【００３８】
ここで、ＲＡＭ５３には、単語辞書１４や音素データベース３などを設定することができる。また、ＣＰＵ５１は、テキスト解析部２，規則音声合成部４，周囲騒音取込部８などの機能を有している。
【００３９】
なお、ＣＰＵ５１におけるこのようなテキスト解析部２，規則音声合成部４，周囲騒音取込部８等としての機能は、例えばソフトウェアパッケージ（具体的には、ＣＤ−ＲＯＭ等の情報記録媒体）の形で提供することができ、このため、図３の例では、情報記録媒体６０がセットさせるとき、これを駆動する媒体駆動装置６１が設けられている。
【００４０】
換言すれば、本発明の音声合成装置は、汎用の計算機システムにＣＤ−ＲＯＭ等の情報記録媒体に記録されたプログラムを読み込ませて、この汎用計算機システムのマイクロプロセッサに本発明の音声合成処理を実行させる装置構成においても実施することが可能である。この場合、本発明の音声合成処理を実行するためのプログラム（すなわち、ハードウェアシステムで用いられるプログラム）は、媒体に記録された状態で提供される。プログラムなどが記録される情報記録媒体としては、ＣＤ−ＲＯＭに限られるものではなく、ＲＯＭ，ＲＡＭ，フレキシブルディスク，メモリカード等が用いられても良い。媒体に記録されたプログラムは、ハードウェアシステムに組み込まれている記憶装置、例えばハードディスク装置にインストールされることにより、このプログラムを実行して、本発明の音声合成装置の機能を実現することができる。
【００４１】
また、本発明の音声合成処理を実現するためのプログラムは、媒体の形で提供されるのみならず、通信によって（例えばサーバによって）提供されるものであっても良い。
【００４２】
【発明の効果】
以上に説明したように、請求項１記載の発明によれば、規則音声合成部は、騒音解析手段によって算出された周囲騒音のパワーレベルが閾値αを越えるときには母音の無音化処理がなされないように、母音無声化に対する制御を行なうようになっているので、騒音下でも母音がはっきりと発声され、より聞き取りやすい合成音声になる。
【００４５】
また、請求項２記載の発明によれば、規則音声合成部は、騒音解析手段によって周囲騒音のパワーレベルと周囲騒音の周波数分布が算出されるとき、母音毎にあらかじめ保持されたフォルマント周波数と周波数が重なる騒音が存在しその周波数での騒音パワーレベルが閾値γを越える場合には、音声出力サンプリング周波数を移動させる制御を行なうようになっているので、マスキング効果が回避され、より聞き取りやすい合成音声になる。
【図面の簡単な説明】
【図１】本発明に係る音声合成装置の構成例を示す図である。
【図２】図１の音声合成装置の変形例を示す図である。
【図３】図１あるいは図２の音声合成装置のハードウェア構成例を示す図である。
【符号の説明】
１テキスト入力部
２テキスト解析部
３音素データ蓄積部
４規則音声合成部
５音声出力部
６騒音解析手段
７周囲騒音取込部
８周囲騒音解析部
１１形態素解析処理部
１２音韻・韻律付与処理部
１４標準単語辞書
１５子音パワー制御部
１６母音無声化判断処理部
１７ピッチ変更部
１８サンプリング変更部
５１ＣＰＵ
５２ＲＯＭ
５３ＲＡＭ
６０情報記憶媒体
６１媒体駆動装置[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesizer that converts text information such as character information or phonetic symbol strings into speech and outputs the speech.
[0002]
[Prior art]
In recent years, various techniques have been proposed for making it easier to hear synthesized speech under noise.
[0003]
For example, JP-A-61-145633 and JP-A-2-238494 disclose techniques for changing the amplitude of a synthesized voice in accordance with the power level of ambient noise.
[0004]
Further, for example, Japanese Patent Application Laid-Open No. 2-293900 discloses a technique of changing a pitch frequency according to a power level of ambient noise.
[0005]
Japanese Patent Application Laid-Open No. 2-293900 discloses a technique in which noise is subjected to frequency analysis, a filter suitable for the frequency characteristic is calculated, the noise is filtered by the filter, and speech synthesis output is performed. I have.
[0006]
[Problems to be solved by the invention]
However, the techniques disclosed in JP-A-61-145633 and JP-A-2-238494 merely increase the output power level of the synthesized voice in accordance with the level of the ambient noise.
[0007]
In the technique disclosed in Japanese Patent Application Laid-Open No. 2-293900, the pitch is raised to a high frequency at a constant interval according to the noise level, and the pitch frequency moves only in the direction of the higher frequency. In addition, since the noise frequency is not analyzed, the pitch frequency is shifted to synchronize with the frequency component contained in the noise, and conversely, the masking effect (when two or more sounds are present simultaneously, one is the other) (Phenomena that cannot be heard due to the existence), it may be difficult to hear.
[0008]
Further, in the technology of Japanese Patent Application Laid-Open No. 2-293900, the S / N ratio is increased by passing the synthesized voice through a filter having the frequency characteristics of the noise by using the frequency characteristics of the noise. Such a masking effect is not removed. Further, since filter design and filter processing are required, there is a problem that the system scale is increased.
[0009]
As described above, various techniques have been proposed to make it easier to hear synthesized speech under noise. However, conventionally, it is only necessary to control the overall output level of synthesized speech in accordance with the noise level so as to increase the overall output level. In addition, no processing is performed in which control is performed while paying attention to the devoicing of consonants and vowels, which are particularly difficult to hear under noise. Further, conventionally, it is not determined whether or not the frequency component included in the noise overlaps with the pitch frequency or the formant frequency of the synthesized sound, so that the masking effect is not avoided. That is, in the related art, there is a limit in remarkably improving the audibility of synthesized speech under noise.
[0010]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech synthesizer capable of significantly improving the audibility of a synthesized speech under noise with a small amount of processing.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, the invention according to claim 1 generates a phonetic symbol by performing a morphological analysis process and a phonological / prosodic imparting process on a text input unit for inputting text information. According to the phonetic data generated by the text analysis unit, the phoneme data storage unit that stores the phoneme data, and the phonetic symbol generated by the text analysis unit, the phoneme data is extracted from the phoneme data storage unit, and the phoneme data is superimposed at the pitch interval. A regular voice synthesizing unit for converting the signal into a signal, a voice output unit for outputting the converted pronunciation signal as voice, and a noise analyzing unit for capturing and analyzing the ambient noise and calculating a power level of the ambient noise; synthesis unit, as is not made silent treatment vowel when the power level of the ambient noise calculated by the noise analysis means exceeds a threshold value alpha, the mother It is characterized by being adapted to perform a control for the unvoiced.
[0014]
Further, the invention according to claim 2 provides a text input unit for inputting text information, a text analysis unit for performing morphological analysis processing and phoneme / prosodic provision processing on the input text to generate phonetic symbols, and a phoneme. According to the phoneme data storage unit in which data is stored and the phonetic data generated from the text analysis unit, phoneme data is extracted from the phoneme data storage unit and converted to a pronunciation signal by superimposing phoneme data at pitch intervals. A synthesis unit, a sound output unit that outputs the converted sound signal as sound, and a noise analysis unit that captures and analyzes ambient noise, calculates a power level of the ambient noise, and calculates a frequency distribution of the ambient noise. and, speech synthesis by rule section, when the frequency distribution of the power level and ambient noise ambient noise is calculated by the noise analysis means, roughness for each vowel Flip when the noise power level at the fit holding formant frequency and the frequency are present noise overlaps the frequency exceeds the threshold value γ is, as characterized by being adapted to perform control to move the audio output sampling frequency I have.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a configuration example of a speech synthesis device according to the present invention. Referring to FIG. 1, the speech synthesis apparatus performs a morphological analysis process and a phonological / prosodic provision process on a text input unit 1 for inputting text information such as character information or phonetic symbol strings. A text analysis unit 2 that generates phonetic symbols, a phoneme data storage unit (phoneme database) 3 that stores phoneme data, and a phonetic symbol generated by the text analysis unit 2 is stored in a phoneme data storage unit (phoneme database) 3. A regular speech synthesizer 4 for converting the stored phoneme data into a pronunciation signal, a speech output unit 5 for outputting the converted pronunciation signal as speech, and a noise analysis unit 6 for capturing and analyzing ambient noise (noise). And
[0016]
Here, the text analysis unit 2 includes a morphological analysis processing unit 11 and a phoneme / prosodic provision processing unit 12, and the morphological analysis processing unit 11 refers to the standard word dictionary 14, for example (by dictionary lookup), and Analysis processing is performed.
[0017]
Also, the rule speech synthesizer 4 extracts phoneme data from the phoneme data storage 3 according to the phonetic symbols generated by the text analyzer 2 and superimposes the phoneme data at pitch intervals to convert the phoneme data into a phonetic signal. Has become.
[0018]
The noise analysis means 6 includes an ambient noise capturing unit 7 for capturing the ambient noise, and an ambient noise analyzing unit 8 for analyzing the ambient noise captured by the ambient noise capturing unit 7. Numeral 8 analyzes the captured ambient noise to obtain noise power (power level) and noise frequency distribution. Note that an FFT or a filter bank can be used for analyzing the frequency distribution.
[0019]
In the first embodiment of the present invention, when the phoneme data extracted from the phoneme data storage unit 3 is consonant phoneme data, the rule speech synthesis unit 4 calculates the consonant by the noise analysis unit 6. The power is controlled in accordance with the power level of the ambient noise and converted to a sound signal.
[0020]
Further, in the second embodiment of the present invention, the rule speech synthesizer 4 does not perform the vowel silence processing when the power level of the ambient noise calculated by the noise analyzer 6 exceeds the threshold α. Control for vowel devoicing is performed.
[0021]
Further, in the third embodiment of the present invention, when the power level of the ambient noise and the frequency distribution of the ambient noise are calculated by the ambient noise analyzing means 6, the rule speech synthesizer 4 Exists, and when the noise power level at that frequency exceeds the threshold value β, control is performed to move the pitch frequency back and forth.
[0022]
Further, in the fourth embodiment of the present invention, when the power level of the ambient noise and the frequency distribution of the ambient noise are calculated by the noise analysis unit 6, the rule speech synthesizer 4 detects the noise whose frequency overlaps with the formant frequency. If the noise power level exists and the noise power level at that frequency exceeds the threshold value γ, control is performed to move the audio output sampling frequency back and forth.
[0023]
Next, the processing operation of the speech synthesizer having such a configuration will be described. Normally, when a text is input by the text input unit 1, the input text is subjected to morphological analysis processing and phonological / prosodic control processing by the text analysis unit 2 to be converted into a phonetic symbol string. The phonetic symbol string output from 2 is input to the regular speech synthesizer 4, where the regular speech synthesizer 4 performs a regular speech synthesis process and converts it into speech waveform data. The voice waveform data converted based on the phonetic symbol string in this manner is input to the voice output unit 5 and is output from the voice output unit 5 as voice.
[0024]
More specifically, the text analysis unit 2 converts the text into a phonetic symbol string by performing dictionary lookup of the word dictionary 14 during the morphological analysis process. After that, the rule speech synthesizer 4 extracts phoneme data (phoneme data corresponding to the phonetic symbol string) from the phoneme database 3 in accordance with the phonetic symbol string output from the text analysis section 2, and chronologically extracts each phoneme data taken out. By superimposing on the pitch interval (by the waveform superimposition method), it is converted into a sound signal (voice waveform data). The sound signal (speech waveform data) converted by the rule speech synthesis unit 4 in this manner is input to the speech output unit 5 and output from the speech output unit 5 as speech.
[0025]
In such a rule speech synthesis process, in the present invention, the ambient noise is further captured by the ambient noise capturing unit 7, the captured ambient noise is analyzed by the ambient noise analyzing unit 8, and the parameters of the noise power level and the frequency distribution are set. Generate.
[0026]
In this case, in the first embodiment of the present invention, the rule speech synthesizer 4 controls the power of consonants that are difficult to hear under noise such as plosives and fricatives, according to the noise power level analyzed by the ambient noise analyzer 8. I do. That is, when phoneme data is extracted from the phoneme database 3 and converted into a pronunciation signal in accordance with the phonetic symbol string output from the text analysis unit 2, if the extracted phoneme data is consonant phoneme data, the rule speech synthesis unit 4 The power (gain) of this consonant is controlled according to the noise power level. Specifically, the consonant power is controlled by, for example, determining the power (gain) of the consonant so as to be proportional to the noise power level, applying the determined gain to the consonant phoneme waveform extracted from the phoneme database 3, and then multiplying (multiplying). Later), this is performed by performing waveform superposition.
[0027]
As described above, by controlling the power of the consonant in accordance with the noise power level analyzed by the ambient noise analysis unit 8, even a consonant which is particularly difficult to hear under noise is generated with a large power corresponding to the power level of the noise. The output makes it easier to hear consonants even under noise.
[0028]
Further, in the second embodiment of the present invention, the regular speech synthesis unit 4 determines whether the noise power level exceeds the threshold α, and when the noise power level exceeds the threshold α, the unvoiced phoneme unvoiced Control so as not to perform the conversion process. That is, in the normal Japanese speech synthesis processing, there are phonemes (“shi”, “ki”, “ku”, etc.) that are automatically devoiced in order to produce a smoother speech, but the noise power level is at a predetermined level. When the threshold value α is exceeded, a process for forcibly prohibiting the devoicing process of the unvoiced phoneme is performed, so that the vowel is clearly uttered even under noise, and the synthesized voice becomes more audible. The threshold α is set, for example, to a noise power level that makes it difficult to hear the synthesized voice when the vowel is devoiced.
[0029]
Further, in the third embodiment of the present invention, when there is noise overlapping with the pitch frequency, the noise power level at that frequency sets the threshold β from the parameter of the frequency distribution of the noise. It is determined whether or not the pitch frequency is exceeded, and if it is exceeded, the pitch frequency is moved back and forth within a range where the pitch frequency can be changed. More specifically, the pitch frequency is moved (shifted) to a frequency band where the noise power level is low. The threshold β is determined from the S / N ratio, for example, according to the output level of the synthesized speech.
[0030]
As described above, in the third embodiment, when there is noise overlapping with the pitch frequency, it is determined from the parameters of the noise frequency distribution whether the noise power level at that frequency exceeds the threshold β, and When the pitch frequency is changed, the masking effect is avoided by moving (shifting) the pitch frequency back and forth within a range in which the pitch frequency can be changed, and the synthesized speech becomes more audible.
[0031]
Further, in the fourth embodiment of the present invention, when there is noise overlapping the formant frequency, the noise power level at that frequency sets the threshold γ from the parameter of the frequency distribution of the noise. It is determined whether the audio output sampling frequency is exceeded, and if it is exceeded, the audio output sampling frequency is moved (shifted) forward or backward within a range in which the audio output sampling frequency can be changed. More specifically, the audio output sampling frequency is shifted back and forth to a frequency band where the noise power level is low.
[0032]
In other words, when performing regular speech synthesis using the waveform superposition method, since the vocal tract characteristics are fixed for each phoneme because the phoneme waveform data is used, the position of the formant frequency is different from the parameter-based speech synthesis method. It cannot be changed. Therefore, in the fourth embodiment, the formant frequency is relatively moved forward and backward by moving (changing) the sampling frequency.
[0033]
In this case, since the formant frequency is determined in advance for each phoneme (vowel), formant information (frequency position, gain) is stored as a parameter for each phoneme data. The threshold value γ is determined from the S / N ratio according to the output level of the synthesized speech and the formant information, for example.
[0034]
As described above, in the fourth embodiment, when there is noise overlapping with the formant frequency, it is determined from the noise frequency distribution parameter whether the noise power level at that frequency exceeds the threshold γ, and When the audio output sampling frequency is changed, the audio output sampling frequency is moved forward or backward within a range in which the audio output sampling frequency can be changed (more specifically, the audio output sampling frequency is shifted forward or backward to a frequency band having a low noise power level). (A shift) to avoid the masking effect, resulting in a more audible synthesized speech.
[0035]
In the above description, the first, second, third, and fourth embodiments are individually described. However, the first, second, third, and fourth embodiments may be used in any combination. You can also. For example, by combining all of the first, second, third, and fourth embodiments, a function for controlling the power of consonants and control not to perform vowel devoicing processing are performed in the ruled speech synthesizer 4. A function, a function of moving (changing) the pitch frequency, and a function of moving (changing) the audio output sampling frequency back and forth can also be provided.
[0036]
Further, in the above-described example, the function of controlling the power of the consonant, the function of controlling the vowel to be unvoiced, the function of moving (changing) the pitch frequency, and the output of the voice are provided in the regular voice synthesis unit 4. Although the function of moving (changing) the sampling frequency back and forth is provided, it is also possible to provide these functions outside the regular speech synthesizer 4. That is, as shown in FIG. 2, for example, a consonant power control unit 15 for controlling the power of consonants, a vowel devoicing determination processing unit 16 for controlling not to perform the vowel devoicing process, and moving (changing) the pitch frequency ) A pitch changing unit 17 and a sampling changing unit 18 for moving (changing) the audio output sampling frequency back and forth can also be provided.
[0037]
FIG. 3 is a diagram showing an example of a hardware configuration of the speech synthesizer of FIG. 1 or FIG. Referring to FIG. 3, the speech synthesizer is realized by, for example, a personal computer or the like, and controls a CPU 51 for controlling the whole, a ROM 52 storing a control program of the CPU 51, and a RAM 53 used as a work area of the CPU 51. And a text input unit 1 for inputting a text, an audio output unit (for example, a speaker) 5, and an ambient noise capturing unit (for example, a microphone) 7 for capturing ambient noise.
[0038]
Here, the word dictionary 14, the phoneme database 3, and the like can be set in the RAM 53. Further, the CPU 51 has functions of a text analysis unit 2, a rule speech synthesis unit 4, an ambient noise capture unit 8, and the like.
[0039]
The functions of the CPU 51 such as the text analysis unit 2, the ruled speech synthesis unit 4, the ambient noise capture unit 8, and the like are implemented, for example, in the form of a software package (specifically, an information recording medium such as a CD-ROM). Therefore, in the example of FIG. 3, when the information recording medium 60 is set, a medium driving device 61 that drives the information recording medium 60 is provided.
[0040]
In other words, the speech synthesizer of the present invention causes a general-purpose computer system to read a program recorded on an information recording medium such as a CD-ROM, and causes the microprocessor of the general-purpose computer system to execute the speech synthesis process of the present invention. The present invention can also be implemented in a device configuration to be executed. In this case, a program for executing the speech synthesis processing of the present invention (that is, a program used in a hardware system) is provided in a state recorded on a medium. The information recording medium on which the program or the like is recorded is not limited to a CD-ROM, but may be a ROM, a RAM, a flexible disk, a memory card, or the like. The program recorded on the medium is installed in a storage device incorporated in the hardware system, for example, a hard disk device, so that the program can be executed to realize the function of the speech synthesizer of the present invention. .
[0041]
Further, the program for realizing the speech synthesis processing of the present invention may be provided not only in the form of a medium but also by communication (for example, by a server).
[0042]
【The invention's effect】
As described above, according to the first aspect of the present invention, when the power level of the ambient noise calculated by the noise analysis unit exceeds the threshold α , the rule speech synthesis unit does not perform the vowel silence processing. In addition, since control for vowel devoicing is performed, vowels are clearly uttered even under noise, and the synthesized speech becomes more audible.
[0045]
According to the second aspect of the present invention, when the power level of the ambient noise and the frequency distribution of the ambient noise are calculated by the noise analysis unit, the rule speech synthesizer may include the formant frequency and the frequency stored in advance for each vowel. If there is noise that overlaps and the noise power level at that frequency exceeds the threshold value γ, control is performed to move the audio output sampling frequency, so that the masking effect is avoided and the synthesized voice that is more audible become.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a speech synthesis device according to the present invention.
FIG. 2 is a diagram illustrating a modified example of the speech synthesizer of FIG. 1;
FIG. 3 is a diagram illustrating an example of a hardware configuration of the speech synthesizer of FIG. 1 or FIG. 2;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Text input part 2 Text analysis part 3 Phoneme data storage part 4 Rule speech synthesis part 5 Audio output part 6 Noise analysis means 7 Ambient noise taking part 8 Ambient noise analysis part 11 Morphological analysis processing part 12 Phoneme / prosodic provision processing part 14 Standard word dictionary 15 Consonant power control unit 16 Vowel devoicing judgment processing unit 17 Pitch changing unit 18 Sampling changing unit 51 CPU
52 ROM
53 RAM
Reference Signs List 60 Information storage medium 61 Medium drive device

Claims

A text input unit for inputting text information, a text analysis unit for performing morphological analysis processing and phonetic / prosodic provision processing on the input text to generate phonetic symbols, and a phoneme data storage unit for storing phoneme data And a rule speech synthesis unit that extracts phoneme data from the phoneme data storage unit according to the phonetic symbols generated by the text analysis unit, and converts the phonetic data into a phonetic signal by superimposing phoneme data at pitch intervals. And a noise analysis unit that captures and analyzes the ambient noise and calculates the power level of the ambient noise, and the rule voice synthesis unit includes a noise output unit that outputs the ambient noise calculated by the noise analysis unit. When the power level exceeds the threshold α, vowel devoicing is controlled so that vowel devoicing is not performed. Speech synthesis apparatus according to claim Rukoto.

A text input unit for inputting text information, a text analysis unit for performing morphological analysis processing and phoneme / prosodic provision processing on the input text to generate phonetic symbols, and a phoneme data storage unit for storing phoneme data And a rule speech synthesizer for extracting phoneme data from the phoneme data storage according to the phonetic symbols generated by the text analyzer and superimposing the phoneme data at pitch intervals to convert the phonetic data into a phonetic signal, A sound output unit that outputs as, and analyzes and captures ambient noise, calculates a power level of the ambient noise, and has a noise analysis unit that calculates a frequency distribution of the ambient noise. when the frequency distribution of the power level and ambient noise ambient noise is calculated by the noise analysis unit, it was held in advance for each vowel follower If the mantle frequency and the frequency are present noise overlapping noise power level at the frequency exceeds the threshold value γ, the speech synthesis apparatus characterized by being adapted to perform control to move the audio output sampling frequency.