JP3785892B2

JP3785892B2 - Speech synthesizer and recording medium

Info

Publication number: JP3785892B2
Application number: JP2000071150A
Authority: JP
Inventors: 賢大谷; ゆみ堤; 敏幸佐野
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2000-03-14
Filing date: 2000-03-14
Publication date: 2006-06-14
Anticipated expiration: 2020-03-14
Also published as: JP2001265374A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力テキストなどから音声メッセージを作成し、それを編集して発話させるための音声合成装置及び記録媒体に関する。
【０００２】
【背景技術】
従来の音声メッセージ作成・編集装置は、音声合成の方式に規則合成や波形重畳を用いている。これら方式によって作成された合成音声では、韻律は自然であるが、声質（人間の声らしいさの程度）が電子的で不自然であるという問題点がある。また、これらの方式では合成音声の声質を変更することが困難であり、声による差異化ないし個性化を行うことができない。
【０００３】
上記問題点を解決するため、音声素片接続型の音声合成方法（特開平１０−４９１９３号）が提案されている。この方法では、発話させたい言語の標準テキストを話者（モデル）に読み上げてもらって録音しておき、この自然発話の録音音声波形を音声素片（以下、音素という）単位に分割し、当該言語における音素を網羅した音素部品データベースを作っておく。そして、合成音声を出力するときには、読み上げたいテキストに合わせて音素部品を再結合する。この音素部品データベースを用いることにより、読み上げた人の肉声に近い合成音声を得ることができる。
【０００４】
音素接続により合成された音声は、声質が肉声に近くて自然であり、しかも、合成時に使用する音素部品データベースを変更することで、複数の話者で合成音声を作成することができる。しかし、この音素接続型の音声合成方法では、合成音声の韻律（音声の長短やアクセントの位置）が不自然になるという問題点がある。
【０００５】
また、音声の韻律を変更する手段として、音声ピッチ変換方法（特許第２６１２８６７号）、音声の発声速度変換方法（特許第２６１２８６８号）がある。これらの方法によれば、教師となる波形に修正したい波形を合わせ込む、あるいは、修正したい波形を手動で操作することで任意の韻律に変更することができる。
【０００６】
しかしながら、前者の方法は、生徒の録音音声と教師の録音音声とを比較し、生徒の録音音声の波形を教師の録音音声の波形に合わせて変換するというものであり、与えられた任意のテキストを音声合成して電子的に読み上げるといった用途では、教師となる録音音声もしくは波形をすべての場合について用意しておくことは不可能である。また、音声認識等と組み合わせて、リアルタイムで合成音声を作成する場合には、手動で波形を操作して韻律の修正を行うということも不可能である。
【０００７】
音声合成に関する現状は、上記のようなところにあり、音声メッセージ作成・編集装置やテキスト読み上げ装置などに用いられる音声合成において、声質の自然さと韻律の自然さとを両立させるのはきわめて困難であった。
【０００８】
また、複数話者の音素部品データベースを用いる場合には、話者が切り替わった時には、その話者の個性ないし特徴が影響して合成音声が滑らかで無くなったり、不自然になったりすることがあった。
【０００９】
【発明の開示】
本発明は上記の従来技術に鑑みてなされたものであり、その目的とするところは、自然な声質と滑らかさ、特に自然な声質と韻律で音声メッセージを作成することができる音声合成装置を提供することにある。
【００１０】
本発明にかかる音声合成装置は、複数の話者に関する音声波形データを保存した音声合成用データベースと、前記音声合成用データベースに保存されたいずれかの話者の音素波形データから音声基本周波数パターンを表現するパラメータのうち話者に依存する特徴を抽出する手段と、前記音声合成用データベースから対象文字に対応する音素波形を選択して接続する音素結合手段と、前記特徴を用いて対象文字に対応する音声を合成する際の目標となる韻律を生成する目標韻律生成手段と、前記目標韻律生成手段で生成された目標韻律に基づいて前記音素結合手段で生成された合成音声を補正する手段とを備えたものである。
【００１１】
音素を接続することによって生成された合成音声は声質が自然であるという特徴がある。さらに、この音声合成装置によれば、話者に応じた特徴を抽出し、その特徴を用いて合成音声を補正しているので、複数の話者を切り替えて合成音声を作成する際、異なる話者の音素音声合成用データベースに切り替わっても、選択された話者の音素波形の特徴に応じて合成音声が補正されるので、自然な声質と滑らかさで合成音声を発生させることができる。
【００１２】
本発明の実施形態における前記音声合成用データベースは、前記音声波形データとその内容に対応する音素の文字データ、音素毎のセグメント、各音素の韻律特徴パラメータを保持する。
また、本発明の別な実施形態における前記特徴を抽出する手段は、話者の音素波形データから音声基本周波数パターンを表現する下記の数式

に用いられているパラメータＡ pi 、Ａ aj 及びＦ min を前記特徴として抽出するものである。
【００１３】
本発明のさらに別な実施形態における前記目標韻律生成手段は、前記パラメータに基づいて音声基本周波数パターンを生成する過程と、音素継続時間長を算出する過程とからなる。
また、本発明のさらに別な実施形態における前記目標韻律生成手段は、各話者の音声データベース中の各音素持続時間長の平均値を用いて、合成対象となる音素の持続時間長を算出するものである。
本発明のさらに別な実施形態における前記合成音声を補正する手段は、前記音素結合手段により生成された合成音声の韻律と前記目標韻律生成手段により生成された目標韻律とを比較し、両韻律が一致しない場合に合成音声の補正を行うことを特徴とする。
また、前記合成音声を補正する手段は、選択された話者に応じた前記特徴を用いて、合成音声の韻律を補正するものである。
【００１４】
本発明のさらに別な実施形態は、前記音素結合手段により作成された合成音声または前記合成音声を補正する手段によって補正された合成音声を音声波形として保存する保存部を備えている。
また、本発明のさらに別な実施形態は、表示装置により表示された前記目標韻律または実際に選択された音素の韻律を操作して韻律を任意に変更させる韻律操作部を備えている。
【００１５】
上記のような各実施形態も、選択された話者に応じて自然な声質と韻律で音声を合成させることが可能になる。
【００１６】
また、本発明にかかる記録媒体は、音声合成用データベースに保存された話者の音素波形データから音声基本周波数パターンを表現するパラメータのうち話者に依存する特徴を抽出する手段、音声合成用データベースから対象文字に対応する音素波形を選択して接続する音素結合手段、前記特徴を用いて対象文字に対応する音声を合成する際の目標となる韻律を生成する目標韻律生成手段、および前記目標韻律生成手段で生成された目標韻律に基づいて前記音素結合手段で生成された合成音声を補正する手段としてコンピュータを機能させるためのプログラムを記録した、コンピュータ読み取り可能な記録媒体である。このような記録媒体としては、通常はＣＤやＤＶＤ、ＭＯなどの形態で実現される。
【００１７】
このような記録媒体に納められているプログラムを実行することにより、話者に応じた特徴を抽出し、その特徴を用いて音素接続による合成音声を補正しているので、異なる話者の音素音声合成用データベースに切り替わっても、自然な声質と滑らかさで合成音声を発生させることができる。
また、この記録媒体でも、前記補正手段が、選択された話者に応じた前記特徴を用いて、合成音声の韻律を補正するものであれば、選択された話者に応じて自然な声質と韻律で音声を合成させることが可能になる。
【００１８】
【発明の実施の形態】
図１は本発明の一実施形態による音声合成装置１の構成を示す図である。この音声合成装置１は、音声合成用データベース２Ａ，２Ｂ、…、話者データベース選択部３、データベース特徴量計算部４、藤崎モデルパラメータ算出部５、テキスト読み込み部６、音素列変換部７、目標韻律生成部８、音素結合部９、韻律表示部１０、韻律補正部１１、合成音声出力部１２、保存部１３から構成されている。また、必要に応じて、韻律操作部１４が設けられる。この音声合成装置１は、ＣＤ等の記録媒体に納められた、コンピュータ上で動作するプログラムとして、あるいはＣＤやハードディスク等の記録媒体に該プログラムを格納したコンピュータシステムとして、あるいは専用装置として構成される。図２（ａ）はコンピュータ上の画面表示（ユーザーインターフェイス）を表している。この表示画面（親ウィンドウ）２１内には、入力されたテキストを表示するテキスト表示ウィンドウ２２、テキストから変換された音素表記を表示する音素表記表示ウィンドウ２３、韻律表示ウィンドウ２４、操作パネル２５内に設けられた話者選択用のコンボボックス２６、発話ボタン２７及び波形保存ボタン２８などが配置されている。図２（ｂ）は話者選択用のコンボボックス２６を開いた様子を表している。以下、図１及び図２（ａ）（ｂ）により、この音声合成装置１を説明する。
【００１９】
音声合成用データベース２Ａ，２Ｂ、…は、音声を合成する元になる音声波形、音声波形とその内容に対応する音素の文字データ、音素毎のセグメント、各音素の韻律特徴パラメータを保持したデータベースであって、予めハードディスクなどの記憶媒体に格納されている。音声合成用データベース２Ａ，２Ｂ、…は、異なる話者毎にそれぞれ用意されている。
【００２０】
話者データベース選択部３は、登録済みの複数の音声合成用データベース２Ａ，２Ｂ、…から、音声合成に用いる話者を選択する。すなわち、図２（ｂ）のように話者選択用のコンボボックス２６には種々なタイプの話者が登録されており、画面の話者選択用のコンボボックスを開いて話者を選択すると、当該話者の音声合成用データベース２Ａ，２Ｂ、…が話者データベース選択部３により選択される。
【００２１】
データベース特徴量計算部４は、いずれかの話者の音声合成用データベース２Ａ，２Ｂ、…（以下、選択された音声合成用データベースを選択データベース２ということがある。）が選択又は指定されると、該選択データベース２の音声波形について特徴量を算出する。すなわち、選択データベース２から抽出した音声波形中の有声区間について、一定窓幅で自己相関を取ることにより各窓の基本周波数を算出し、各窓ごとの基本周波数について特徴量として平均値(ｆo)meanと標準偏差(ｆo)stdを算出する。
【００２２】
藤崎モデルパラメータ算出部５は、選択データベース２に格納されている話者の音声周波数特徴量を用い、各話者に応じた藤崎モデルのパラメータを算出する。藤崎モデルとは、日本語の平叙文を読み上げる際の音声基本周波数パターンのモデルであって、次の(1)式で表される。ここで、Ａpi、Ａajは係数であって、pはフレーズ成分、aはアクセント成分を表わす。
【００２３】
【数１】

【００２４】
藤崎モデルパラメータ算出部５は、このモデルの話者に依存するパラメータを上記データベース特徴量計算部４で求めた各話者の音声合成用データベース（選択データベース２）の特徴量を用いて決定する。上記藤崎モデル式（１）で、Ａpi、Ａaj、Ｆminが話者に依存するパラメータである。自然な合成音声を実現するには、選択データベース２の音声の特徴に合わせて、音声合成用データベース２Ａ，２Ｂ、…毎に、これらパラメータの値を変更する必要がある。従って、藤崎モデルパラメータ算出部５は、この３つのパラメータＡpi、Ａaj、Ｆminについて、話者の特徴量にあった値を決定する。
【００２５】
すなわち、藤崎モデルパラメータ算出部５は、話者の選択時に選択された音声合成用データベース２の基本周波数平均値(ｆo)meanと標準偏差(ｆo)stdを用い、当該データベース２にあった藤崎モデルのパラメータを次の(4)〜(6)式より算出する。
Ｆmin＝ln｛(ｆo)mean−(ｆo)std｝ …(4)
Ａpi＝0.3｛−0.42ln〔(ｆo)mean〕＋0.42ln〔(ｆo)std〕｝＋1.1 …(5)
Ａai＝0.7｛−0.42ln〔(ｆo)mean〕＋0.42ln〔(ｆo)std）｝＋1.1 …(6)
これによって音声合成用データベース２Ａ，２Ｂ、…のうち、選択されたデータベースの話者の特徴が藤崎モデルという形でモデル化される。
【００２６】
テキスト読み込み部６は、音声を合成して発話させようとする元の文（テキスト）をコンピュータのメモリ上に読み込む。テキストはパーソナルコンピュータのキーボードから入力され、あるいはインターネット等の回線を通じて送られてくるが、これらのテキストはコンピュータのメモリ上に読み込まれる。読み込まれたテキストは、かな漢字混じりテキストでテキスト表示ウィンドウ２２に表示される。
【００２７】
テキスト読み込み部６から入力されたかな漢字混じりテキストは、音素列変換部７により合成用の音素列に変換され、アクセント付きの音素表記として音素表記表示ウィンドウ２３に表示される。そして、操作パネル２５の発話ボタン２７を押すと、音声合成の処理が実行される。
【００２８】
目標韻律生成部８は、藤崎モデルパラメータ算出部５において決定された藤崎モデルのパラメータＡpi、Ａaj、Ｆminを用い、藤崎モデルを適用することによって音声合成時の目標となる韻律を生成する。目標韻律は音声基本周波数パターンと音素継続時間長とからなり、目標韻律生成部８も音声基本周波数パターンを生成する過程と、各音素の継続時間長を算出する過程とからなる。音声基本周波数パターンを生成する過程では、藤崎モデルパラメータ算出部５で決定された、Ａpi、Ａaj、Ｆminの３つのパラメータの値を実際に藤崎モデルの式(1)に適用し、音声合成時に目標として用いる音声基本周波数パターンを生成する。また、音素継続時間長を算出する過程では、各話者の音声データベース中の各音素継続時間長の平均値を用いて、合成対象となる音素の持続時間長を算出する。
【００２９】
音素結合部９は、キーボード等から入力された合成対象となるテキストに対して、目標韻律生成部８で生成した目標韻律（基本周波数パターン、音素継続時間長）をもとに選択データベース２から音素（音素波形）を選択し、それらを結像して合成音声を作成する。こうして作成された合成音声の韻律と目標音声の韻律とは、韻律表示部１０により韻律表示ウィンドウ２４にグラフィカルに表示される。
【００３０】
韻律補正部１１は、音素結合部９で生成された合成音声の韻律（音声基本周波数、音素継続時間長）と藤崎モデルに基づいて算出された目標韻律とを比較し、一致しない場合には合成音声波形の各音素を延長したり短縮したりして合成音声波形を目標値に合わせ込むように補正する。
【００３１】
合成音声出力部１２は、こうして作成された合成音声を出力する。例えば、出力された合成音声は増幅されてスピーカ等で音声に変換されたり、保存部１３や適宜記録媒体に保存される。
【００３２】
操作パネル２５の波形保存ボタン２８を押すと、あるいは自動的に、音素結合部９で作成された合成音声や韻律補正部１１によって補正された合成音声は音声波形として保存部１３に保存される。また、韻律操作部１４が設けられている場合には、韻律表示ウィンドウ２４に表示された目標韻律または実際に選択された音素の韻律を操作することにより、任意の韻律に変更することができる。
【００３３】
従って、この音声合成装置１によれば、複数の話者で合成音声を作成する場合に、選択された話者の特徴量を抽出して藤崎モデルによりモデル化することができる。そして、音声合成用データベースから抽出した音素を結合させて合成音声を作成した後、このモデルの韻律により合成音声の韻律を補正することができる。よって、この音声合成装置１によれば、音声用データベースのそれぞれの声質を保ったまま、韻律の自然な音声メッセージを作成することができる。
【００３４】
【発明の効果】
本発明の音声合成装置及び記録媒体によれば、自然な声質で滑らかな合成音声を得ることができる。特に、自然な声質と韻律で発声させることが可能になる。
【図面の簡単な説明】
【図１】本発明の一実施形態による音声合成装置の構成を示す図である。
【図２】（ａ）は音声合成装置の表示画面を示す図、（ｂ）はその話者選択用のコンボボックスを開いた状態を示す図である。
【符号の説明】
２Ａ，２Ｂ、… 音声合成用データベース
３話者データベース選択部
４データベース特徴量計算部
５藤崎モデルパラメータ算出部
８目標韻律生成部
９音素結合部
１１韻律補正部
２２テキスト表示ウィンドウ
２３音素表記表示ウィンドウ
２４韻律表示ウィンドウ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice synthesizer and a recording medium for creating a voice message from input text or the like, and editing the voice message to utter.
[0002]
[Background]
A conventional voice message creating / editing apparatus uses rule synthesis or waveform superposition as a voice synthesis method. Synthetic speech created by these methods has a problem that prosody is natural, but voice quality (degree of human voice) is electronic and unnatural. Also, with these methods, it is difficult to change the voice quality of the synthesized speech, and differentiation or individualization by voice cannot be performed.
[0003]
In order to solve the above problems, a speech unit connection type speech synthesis method (Japanese Patent Laid-Open No. 10-49193) has been proposed. In this method, the standard text of the language you want to speak is read and recorded by the speaker (model), the recorded speech waveform of this natural utterance is divided into speech units (hereinafter called phonemes), and the language Create a phoneme component database that covers all phonemes. When outputting synthesized speech, the phoneme components are recombined in accordance with the text to be read out. By using this phoneme component database, it is possible to obtain synthesized speech that is close to the human voice read out.
[0004]
The voice synthesized by phoneme connection has a natural voice quality close to that of the real voice, and a synthesized voice can be created by a plurality of speakers by changing the phoneme component database used at the time of synthesis. However, this phoneme-connected speech synthesis method has a problem that the prosody of the synthesized speech (speech length or accent position) becomes unnatural.
[0005]
As means for changing the prosody of the voice, there is a voice pitch conversion method (Japanese Patent No. 2612867) and a voice utterance speed conversion method (Japanese Patent No. 2612868). According to these methods, the waveform to be corrected can be combined with the waveform to be a teacher, or can be changed to an arbitrary prosody by manually operating the waveform to be corrected.
[0006]
However, the former method compares the student's recorded voice with the teacher's recorded voice and converts the student's recorded voice waveform to match the teacher's recorded voice waveform. For example, it is impossible to prepare a recorded voice or waveform as a teacher in all cases. In addition, when synthesizing speech is created in real time in combination with speech recognition or the like, it is impossible to manually correct the waveform to correct the prosody.
[0007]
The current situation regarding speech synthesis is as described above, and it has been extremely difficult to achieve both voice quality and prosody nature in speech synthesis used in speech message creation / editing devices and text-to-speech devices. .
[0008]
Also, when using a multi-speaker phoneme component database, the synthesized speech may become smooth or unnatural when the speaker is switched due to the individuality or characteristics of the speaker. It was.
[0009]
DISCLOSURE OF THE INVENTION
The present invention has been made in view of the above prior art, and an object of the present invention is to provide a speech synthesizer capable of creating a voice message with natural voice quality and smoothness, particularly natural voice quality and prosody. There is to do.
[0010]
A speech synthesizer according to the present invention includes a speech synthesis database storing speech waveform data relating to a plurality of speakers, and a speech fundamental frequency pattern from the phoneme waveform data of any speaker stored in the speech synthesis database. Means for extracting speaker-dependent features from among the parameters to be expressed; phoneme combining means for selecting and connecting phoneme waveforms corresponding to the target character from the speech synthesis database; and corresponding to the target character using the feature Target prosody generation means for generating a target prosody for synthesizing the speech to be performed, and means for correcting the synthesized speech generated by the phoneme combination means based on the target prosody generated by the target prosody generation means It is provided.
[0011]
A synthesized speech generated by connecting phonemes is characterized by a natural voice quality. Furthermore, according to this speech synthesizer, features corresponding to the speaker are extracted and the synthesized speech is corrected using the features. Therefore, when a synthesized speech is created by switching a plurality of speakers, different speeches are generated. Even if the database is switched to the phoneme speech synthesis database, the synthesized speech is corrected according to the characteristics of the phoneme waveform of the selected speaker, so that the synthesized speech can be generated with natural voice quality and smoothness.
[0012]
The speech synthesis database according to the embodiment of the present invention holds the speech waveform data and phoneme character data corresponding to the content, segment for each phoneme, and prosodic feature parameters of each phoneme.
Further, the means for extracting the feature in another embodiment of the present invention is the following mathematical expression that expresses the speech fundamental frequency pattern from the phoneme waveform data of the speaker.

The parameters A pi , A aj and F min used in the above are extracted as the features.
[0013]
In still another embodiment of the present invention, the target prosody generation means includes a process of generating a speech fundamental frequency pattern based on the parameters and a process of calculating a phoneme duration.
Further, the target prosody generation means according to still another embodiment of the present invention calculates the duration length of the phonemes to be synthesized using the average value of each phoneme duration length in the speech database of each speaker. Is.
The means for correcting the synthesized speech in yet another embodiment of the present invention compares the synthesized speech prosody generated by the phoneme combining unit with the target prosody generated by the target prosody generating unit, In the case where they do not match, the synthesized speech is corrected.
Further, the means for correcting the synthesized speech corrects the prosody of the synthesized speech using the feature corresponding to the selected speaker.
[0014]
Still another embodiment of the present invention includes a storage unit that stores the synthesized speech created by the phoneme combining unit or the synthesized speech corrected by the unit for correcting the synthesized speech as a speech waveform.
Further, another embodiment of the present invention includes a prosody operating unit that arbitrarily changes the prosody by operating the target prosody displayed by the display device or the prosody of the actually selected phoneme.
[0015]
Each embodiment as described above can also synthesize speech with natural voice quality and prosody according to the selected speaker.
[0016]
Further, the recording medium according to the present invention comprises means for extracting a speaker-dependent feature from parameters expressing a speech fundamental frequency pattern from speaker phoneme waveform data stored in a speech synthesis database, a speech synthesis database Phoneme combination means for selecting and connecting a phoneme waveform corresponding to the target character from the target, target prosody generation means for generating a target prosody for synthesizing speech corresponding to the target character using the features, and the target prosody A computer-readable recording medium in which a program for causing a computer to function as means for correcting synthesized speech generated by the phoneme combination means based on the target prosody generated by the generation means is recorded. Such a recording medium is usually realized in the form of CD, DVD, MO, or the like.
[0017]
By executing the program stored in such a recording medium, the feature corresponding to the speaker is extracted and the synthesized speech by phoneme connection is corrected using that feature. Even when switching to the synthesis database, synthesized speech can be generated with natural voice quality and smoothness.
Also in this recording medium, if the correction means corrects the prosody of the synthesized speech using the feature according to the selected speaker, the natural voice quality according to the selected speaker can be obtained. It is possible to synthesize speech with prosody.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a diagram showing a configuration of a speech synthesizer 1 according to an embodiment of the present invention. This speech synthesizer 1 includes

speech synthesis databases

2A, 2B,..., A speaker database selection unit 3, a database feature amount calculation unit 4, a Fujisaki model parameter calculation unit 5, a text reading unit 6, a phoneme sequence conversion unit 7, a target The prosody generation unit 8, phoneme combination unit 9, prosody display unit 10, prosody correction unit 11, synthesized speech output unit 12, and storage unit 13 are included. Further, a prosodic operation unit 14 is provided as necessary. The speech synthesizer 1 is configured as a program that runs on a computer, stored in a recording medium such as a CD, or as a computer system that stores the program in a recording medium such as a CD or a hard disk, or as a dedicated device. . FIG. 2A shows a screen display (user interface) on the computer. In this display screen (parent window) 21, there are a text display window 22 for displaying input text, a phoneme notation display window 23 for displaying phoneme notations converted from the text, a prosody display window 24, and an operation panel 25. A speaker selection combo box 26, a speech button 27, a waveform storage button 28, and the like are provided. FIG. 2B shows a state in which the speaker selection combo box 26 is opened. The speech synthesizer 1 will be described below with reference to FIGS. 1 and 2 (a) and 2 (b).
[0019]
The

speech synthesis databases

2A, 2B,... Are databases that hold a speech waveform from which speech is synthesized, phoneme character data corresponding to the speech waveform and its contents, segments for each phoneme, and prosodic feature parameters of each phoneme. Therefore, it is stored in advance in a storage medium such as a hard disk. The

speech synthesis databases

2A, 2B,... Are prepared for different speakers.
[0020]
The speaker database selection unit 3 selects a speaker to be used for speech synthesis from a plurality of registered

speech synthesis databases

2A, 2B,. That is, as shown in FIG. 2B, various types of speakers are registered in the speaker selection combo box 26, and when a speaker is selected by opening the speaker selection combo box on the screen, .. Are selected by the speaker database selection unit 3.
[0021]
The database feature quantity calculation unit 4 selects or designates any of the speakers'

speech synthesis databases

2A, 2B,... (Hereinafter, the selected speech synthesis database may be referred to as the selection database 2). The feature amount is calculated for the speech waveform in the selection database 2. That is, for the voiced section in the speech waveform extracted from the selection database 2, the fundamental frequency of each window is calculated by taking autocorrelation with a constant window width, and the average value (fo) is used as the feature value for the fundamental frequency for each window. Mean and standard deviation (fo) std are calculated.
[0022]
The Fujisaki model parameter calculation unit 5 calculates the parameters of the Fujisaki model corresponding to each speaker using the voice frequency feature amount of the speaker stored in the selection database 2. The Fujisaki model is a model of the voice fundamental frequency pattern when reading a Japanese plain text, and is expressed by the following equation (1). Here, Api and Aaj are coefficients, p is a phrase component, and a is an accent component.
[0023]
[Expression 1]

[0024]
The Fujisaki model parameter calculation unit 5 determines a parameter depending on the speaker of the model using the feature amount of the speech synthesis database (selection database 2) of each speaker obtained by the database feature amount calculation unit 4. In the Fujisaki model equation (1), Api, Aaj, and Fmin are parameters depending on the speaker. In order to realize natural synthesized speech, it is necessary to change the values of these parameters for each of the

speech synthesis databases

2A, 2B,. Therefore, the Fujisaki model parameter calculation unit 5 determines a value suitable for the feature amount of the speaker for the three parameters Api, Aaj, and Fmin.
[0025]
That is, the Fujisaki model parameter calculation unit 5 uses the fundamental frequency average value (fo) mean and standard deviation (fo) std of the speech synthesis database 2 selected when the speaker is selected, and uses the Fujisaki model in the database 2. Are calculated from the following equations (4) to (6).
Fmin = ln {(fo) mean- (fo) std} (4)
Api = 0.3 {−0.42ln [(fo) mean] + 0.42ln [(fo) std]} + 1.1 (5)
Aai = 0.7 {−0.42ln [(fo) mean] + 0.42ln [(fo) std)} + 1.1 (6)
Thus, the speaker characteristics of the selected database among the

speech synthesis databases

2A, 2B,... Are modeled in the form of the Fujisaki model.
[0026]
The text reading unit 6 reads the original sentence (text) to be uttered by synthesizing speech into the memory of the computer. The text is input from the keyboard of the personal computer or sent through a line such as the Internet, and these texts are read into the memory of the computer. The read text is displayed in the text display window 22 as kana-kanji mixed text.
[0027]
The kana-kanji mixed text input from the text reading unit 6 is converted into a phoneme string for synthesis by the phoneme string converter 7 and displayed on the phoneme notation display window 23 as an accented phoneme notation. When the utterance button 27 on the operation panel 25 is pressed, a speech synthesis process is executed.
[0028]
The target prosody generation unit 8 uses the Fujisaki model parameters Api, Aaj, and Fmin determined by the Fujisaki model parameter calculation unit 5 to apply the Fujisaki model to generate a target prosody. The target prosody comprises a speech fundamental frequency pattern and a phoneme duration, and the target prosody generation unit 8 also comprises a process of generating a speech fundamental frequency pattern and a process of calculating the duration of each phoneme. In the process of generating the speech fundamental frequency pattern, the values of the three parameters Api, Aaj, and Fmin determined by the Fujisaki model parameter calculation unit 5 are actually applied to the formula (1) of the Fujisaki model, and the target at the time of speech synthesis A voice fundamental frequency pattern used as is generated. Further, in the process of calculating the phoneme duration, the duration of the phoneme to be synthesized is calculated using the average value of the phoneme durations in each speaker's speech database.
[0029]
The phoneme combination unit 9 reads the phoneme from the selection database 2 based on the target prosody (basic frequency pattern, phoneme duration length) generated by the target prosody generation unit 8 for the text to be synthesized input from a keyboard or the like. Select (phoneme waveform) and create an image by synthesizing them. The prosody of the synthesized speech and the prosody of the target speech created in this way are graphically displayed on the prosody display window 24 by the prosody display unit 10.
[0030]
The prosody correction unit 11 compares the prosody of the synthesized speech (speech fundamental frequency, phoneme duration length) generated by the phoneme combination unit 9 with the target prosody calculated based on the Fujisaki model. Each phoneme of the speech waveform is extended or shortened so that the synthesized speech waveform is adjusted to the target value.
[0031]
The synthesized speech output unit 12 outputs the synthesized speech created in this way. For example, the output synthesized speech is amplified and converted into speech by a speaker or the like, or stored in the storage unit 13 or a recording medium as appropriate.
[0032]
When the waveform saving button 28 on the operation panel 25 is pressed, or the synthesized speech created by the phoneme combination unit 9 or the synthesized speech corrected by the prosody correction unit 11 is automatically saved in the storage unit 13 as a speech waveform. When the prosodic operation unit 14 is provided, the prosody can be changed to an arbitrary prosody by operating the target prosody displayed in the prosody display window 24 or the prosody of the actually selected phoneme.
[0033]
Therefore, according to the speech synthesizer 1, when a synthesized speech is created by a plurality of speakers, the feature amount of the selected speaker can be extracted and modeled by the Fujisaki model. Then, after synthesizing the phoneme extracted from the speech synthesis database to create a synthesized speech, the prosody of the synthesized speech can be corrected by the prosody of this model. Therefore, according to the speech synthesizer 1, it is possible to create a prosodic natural speech message while maintaining the respective voice qualities of the speech database.
[0034]
【The invention's effect】
According to the speech synthesizer and the recording medium of the present invention, it is possible to obtain a smooth synthesized speech with a natural voice quality. In particular, it becomes possible to utter with natural voice quality and prosody.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a speech synthesizer according to an embodiment of the present invention.
FIG. 2A is a diagram showing a display screen of the speech synthesizer, and FIG. 2B is a diagram showing a state in which a speaker selection combo box is opened.
[Explanation of symbols]
2A, 2B, ... Database for speech synthesis 3 Speaker database selection unit 4 Database feature calculation unit 5 Fujisaki model parameter calculation unit 8 Target prosody generation unit 9 Phoneme combination unit 11 Prosody correction unit 22 Text display window 23 Phoneme notation display window 24 Prosody display window

Claims

A database for speech synthesis that stores speech waveform data for multiple speakers;
Means for extracting a speaker-dependent feature from parameters representing a speech fundamental frequency pattern from phoneme waveform data of any speaker stored in the speech synthesis database;
Phoneme coupling means for selecting and connecting a phoneme waveform corresponding to a target character from the speech synthesis database;
Target prosody generation means for generating a target prosody for synthesizing speech corresponding to the target character using the features;
A speech synthesizer comprising: means for correcting the synthesized speech generated by the phoneme combination unit based on the target prosody generated by the target prosody generation unit .

The speech synthesis according to claim 1, wherein the speech synthesis database holds the speech waveform data and phoneme character data corresponding to the content, segment for each phoneme, and prosodic feature parameters of each phoneme. apparatus.

The means for extracting the feature includes the following mathematical expression that expresses the speech fundamental frequency pattern from the phoneme waveform data of the speaker.

The speech synthesizer according to claim 1, wherein parameters A pi , A aj, and F min used in the above are extracted as the features.

2. The speech synthesizer according to claim 1, wherein the target prosody generation means includes a process of generating a speech fundamental frequency pattern based on the parameter and a process of calculating a phoneme duration.

The target prosody generation means calculates a duration of a phoneme to be synthesized by using an average value of each phoneme duration in a speech database of each speaker. Voice synthesizer.

The means for correcting the synthesized speech includes the prosody of the synthesized speech generated by the phoneme combining means. 2. The speech synthesizer according to claim 1, wherein the target speech generated by the target prosody generation means is compared, and the synthesized speech is corrected when both of the prosody do not match.

2. The speech synthesizer according to claim 1, wherein the means for correcting the synthesized speech corrects the prosody of the synthesized speech using the feature corresponding to the selected speaker.

The speech synthesizer according to claim 1, further comprising: a storage unit that stores the synthesized speech created by the phoneme combining unit or the synthesized speech corrected by the unit for correcting the synthesized speech as a speech waveform. .

The speech synthesizer according to claim 1, further comprising a prosody operating unit that manipulates the target prosody displayed by the display device or the prosody of the actually selected phoneme to arbitrarily change the prosody.

Means to extract features that depend from the phoneme waveform data of the speaker stored in a database for speech synthesis speaker among the parameters representing the speech fundamental frequency pattern, the phoneme waveforms corresponding to the target character from the database for speech synthesis phoneme coupling means to connect selected to target prosody generated by the target prosody generation means, and the target prosody generation means for generating a prosody as a target in the synthesis of speech corresponding to the target character with the feature And a computer-readable recording medium on which a program for causing the computer to function as means for correcting the synthesized speech generated by the phoneme combination means is recorded.