JP3748064B2

JP3748064B2 - Speech synthesis method, speech synthesizer, and speech synthesis program

Info

Publication number: JP3748064B2
Application number: JP2002033118A
Authority: JP
Inventors: 秀之水野; 匡伸阿部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-02-08
Filing date: 2002-02-08
Publication date: 2006-02-22
Anticipated expiration: 2022-02-08
Also published as: JP2003233386A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力テキストで示される音韻列と韻律情報に対応して、音声データベースから適切な音声片データを選択して順に接続することにより音声を合成する音声合成方法および装置、並びにそのプログラムに関するものである。
【０００２】
【従来の技術】
この種の音声合成技術においては、近年、大容量な記憶装置の使用コストの低下に伴って、数十分から数時間に及ぶ音声データをそのまま大容量の記憶装置に蓄積して、入力されたテキストで示される音韻列及び韻律情報に応じて最適の音声素片を選択し、それを適切に変形し接続することで、高品質な音声を合成する音声合成装置も提案されている（例えば、特許２７６１５５２号）。
【０００３】
このような音声合成装置では、入力テキストに対して高品質な合成音声を出力することが可能であるが、大容量の音声データとそれを格納する記憶装置が必要なため、搭載容量の限られた小型の携帯用電子機器（例えば、携帯情報端末（ＰＤＡ）や携帯電話等）に搭載するのは困難である。
【０００４】
また、クライアント・サーバ方式により、クライアントからサーバへテキストデータを送信し、サーバ側に装備した大容量のデータベースを利用して音声合成して、音声信号をクライアント側が受信する方式もあるが、その場合は送受信のために時間が掛かり、特に現状のインタネット等では合成音声の送信には問題がある。
【０００５】
【発明が解決しようとする課題】
本発明の第１の目的は、搭載容量の限られた携帯電子機器等においても大容量の音声データベースを利用するのと同様な高品質な音声合成の実現を可能とすることにある。
また、本発明の第２の目的は、携帯電子機器等でクライアント・サーバ方式によりサーバ側で音声合成し、音声信号を送受信する場合よりも、格段に応答時間を短縮することにある。
【０００６】
【課題を解決するための手段】
本発明は、端末側に小規模・小容量のローカル音声データベースを用意し、テキスト情報で示される音韻列と音律情報に従って該ローカル音声データベースから音声素片データを選択する。そして、このローカル音声データベースから選択された音声素片データが予め定めた適合度を満足するかどうか判定し、満足する場合には、該音声素片データをそのまま選択する。一方、ローカル音声データベースから音声素片データが選択されないか、選択されても適合度を満足しない場合には、ネットワークを介して遠隔地の大規模・大容量の音声データベースから、前記音韻列と韻律情報に従って適切な音声素片データをダウンロードする。このようにして、前記ローカル音声データベースから選択された音声素片データあるいは前記遠隔地の音声データベースからダウンロードされた音声素片データを用いて音声を合成する。これにより、端末側に大規模・大容量の音声データベースを搭載しなくても、高品質な音声合成が可能になる。
【０００７】
また、本発明では、遠隔地の音声データベースからダウンロードした音声素片データを、ローカル音声データベースに追記して保持するか、該ローカル音声データベースとは別の音声素片データ用キャッシュメモリに保持し、以後、このダウンロードした音声素片データを端末側で選択できるようにする。これにより、音声合成の応答時間の短縮も可能になる。
【０００８】
【発明の実施の形態】
以下、本発明の一実施の形態について図面を参照して説明する。
まず、図１に本発明の音声合成装置を利用したシステム全体の概念図を示す。図１において、ユーザ端末１はＰＤＡ、携帯電子、ＰＨＳなどの総称である。該ユーザ端末１には、小規模・小容量のローカル音声データベース１２０を内蔵した音声合成装置（例えば音声合成モジュール）１００が搭載されている。一方、センタ３側には大規模・大容量の音声データベース５が設置されている。ユーザは、受信メール、その他、テキスト情報を音声で聴く場合、音声合成装置１００を利用する。音声合成装置１００は、テキスト情報を入力し、その音韻列と韻律情報に従ってローカル音声データベース１２０から音声素片データを選択して音声を合成する。この音声合成の実行中に、センタ３から、音声素片データのダウンロードが必要になった場合、ネットワーク２を介して、ユーザ端末１をセンタ３に接続する。ユーザ端末１とセンタ３の間をネットワーク２を介して接続する方法として、例えばネットワーク２としてインターネットを利用するのであれば、携帯電話やＰＨＳのパケット通信網を利用した接続方法があり、ネットワーク２としてＬＡＮを利用するのであれば無線ＬＡＮを利用した接続方法などがある。ユーザ端末１は、必要な音声素片データのリクエストをセンタ３側の受付けサーバ４に送信する。受付けサーバ４は、ＬＡＮで接続された大容量データベース５を利用して音声素片データの検索を行い、検索した最適な音声素片データをリクエスト元であるユーザ端末１に送信する。ユーザ端末１の音声合成装置１００は、受信した音声素片データを利用して音声合成を実行する。
【０００９】
なお、図１では、受付けサーバ４と大容量データベース５は別となっているが、同一の機器で構成してもよい。その場合、ＬＡＮは特に必要はない。
【００１０】
次に、本発明にかかる音声合成装置および音声合成方法の二、三の実施例について詳しく説明する。
【００１１】
〔実施例１〕
図２は本発明の音声合成装置の第１の実施例を示す構成図である。図２において、音声合成装置１００は、テキスト解析部１０１、韻律生成部１０２、音声素片選択部１０３、適合度判定部１０４、合成部１０５、データ送受信制御部１０６、音声素片記憶制御部１０７、テキスト解析辞書１１０及びローカル音声データベース１２０から構成される。なお、図２では、入力されたテキスト情報、それに対応する音韻列および韻律情報、音声素片データ等を一時的に記憶するメモリ（作業用メモリ）は省略してある。また、音声合成装置１００が、ＰＤＡや携帯電話等に搭載して利用される場合、ＰＤＡや携帯電話等の本来の通信機能がデータ送受信制御部１０６を兼ねることになる。
【００１２】
図３に、センタ３側の大容量音声デーベース５と音声合成装置１００内のローカル音声データベース１２０の構成例を示す。図１のセンタ３側の大容量音声データベース５とローカル音声データベース１２０の相違は、蓄積するデータ量だけである。例えば、センタ３側の大容量音声データベース５に蓄積されている音声素片データ中の基本的なもののみを音声合成装置１００内のローカル音声データベース１２０が保持するようにする。あるいは、例えば国語辞書内の見出語中の音韻連接の統計をとり、出現頻度が高いもの（例えば上位１０００程度）をローカル音声データベース１２０が保持するようにする。なお、音韻連接の出現頻度に関しては、例えば特開平１−４４４９８号公報に詳しく記載されている。
【００１３】
図４は、図２の実施例１に対応する本発明の音声合成方法のフローチャートの一例である。以下、図４に従って図２の実施例１の動作を説明する。
【００１４】
テキスト情報が入力されると（ステップ１００１）、テキスト解析部１０１でテキスト解析を行い、音韻列とアクセントを決定する（ステップ１００２）。具体的には、テキスト解析部１０２では、テキスト解析辞書１１０を参照して、テキスト情報について係り受けや品詞解析などの形態素解析、漢字かな変換、アクセント処理を行い、音韻列（音韻記号列）とアクセントを決定する。
【００１５】
次に、音韻列とアクセントに基づいて、韻律生成部１０２により、韻律情報を決定する。音韻情報にはピッチパタン（平均Ｆ０、Ｆ０の傾斜等）、各音素毎の時間長パタン、振幅パタン等が含まれる。韻律生成部１０２では、良く知られているように、所定の生成規則やテーブル等を参照して必要な韻律情報を生成する。
【００１６】
次に、音声素片選択部１０３において、音韻列と韻律情報に従って、ローカル音声データベース１２０から最適な音声素片データを選択し（ステップ１００３）、適合度判定部１０４にて、この選択された音声素片データの適合度を求め、予め定めた閾値を満足するかどうかを判定する（ステップ１００４）。
【００１７】
音声素片データの適合度は、例えば、次のようにして求める。ローカル音声データベース１２０は図３に示す構成として、韻律情報（目標）の前音韻環境をＰ_ｔ、後音韻環境をＳ_ｔ、平均Ｆ０をＦＡ_ｔ、Ｆ０傾斜をＦＳ_ｔ、時間長をＤ_ｔとし、選択された音声素片データの前音韻環境をＰ_ｃ、後音韻環境をＳ_ｃ、平均Ｆ０をＦＡ_ｃ、Ｆ０傾斜をＦＳ_ｃ、時間長をＤ_ｃ、音韻ａ、ｂの異なり度合いを求める関数をＤＰ（ａ、ｂ）とすると、適合度は、

と表わすことができる。ここで、α_ｐ，α_ｓ，α_ｆｓ，α_ｆｓ，α_ｄは適当な重み関数である。また、ＤＰ（ａ，ｂ）の１例は、音韻ａ，ｂの平均的なスペクトル（ベクトル）をＳＰ_ａ，ＳＰ_ｂとしたときＤＰ(ａ，ｂ)＝｜ＳＰ_ａ−ＳＰ_ｂ｜のような関数である。
【００１８】
適合度判定部１０４では、例えば、上記（１）式により適合度を計算し、該適合度が、予め定めた閾値より小さい場合、ローカル音声データベース１２０から選択された音声素片データを適合と判断して、該音声素片データを最終的に選択する。一方、ローカル音声データベース１２０から音声素片データが選択されないか、あるいは、選択されても適合度が閾値以上の場合には、不適合と判断する、この不適合と判断された場合、適合度判定部１０４はデータ送受信制御部１０６を起動し、音韻列と韻律情報を渡す。
【００１９】
データ送受信制御部１０６は、まず、ネットワーク２を介してセンタ３に接続し（ステップ１００８）、次に、音韻列と韻律情報をセンタ３に送信して、音声素片データの選択を依頼し（ステップ１００９）、センタ３にて大容量音声データベース５を用いて選択された音声素片データ及びその付属情報をネットワーク２を介して受信する（ステップ１０１０）。データ送受信制御部１０６では、このセンタ２から受信（ダウンロード）した音声素片データを適合度判定部１０４に送り、適合度判定部１０４では、この音声素片データを最終的に選択する。
【００２０】
さらに、データ送受信制御部１０６は、センタ２から受信（ダウンロード）した音声素片データ及びその付属情報を音声素片記憶制御部１０７に送る。音声素片記憶制御部１０７では、この音声素片データ及びその付属情報をローカル音声データベース１２０に追加して記憶する（ステップ１０１１）。
【００２１】
次に、音韻列に対応した全ての音声素片データが選択されたかを判定し（ステップ１００５）、全ての音声素片が選択されない場合は、ステップ１００３から処理を繰り返す。この時、音声素片選択部１０３では、ローカル音声データベース１２０に新たに追記された音声素片データがあれば、それを含めて最適な音声素片データを選択することが可能になる。
【００２２】
ステップ１００５で全ての音声素片データが選択されたと判定された場合、合成部１０５において、全ての音声素片データを韻律情報に応じて韻律変形し、韻律変形された音声素片データをそれぞれ接続することにより音声を合成する（ステップ１００７）。これは従来と同様であるので（例えば、特許第２７６１５５２号）、具体的な説明は省略する。
【００２３】
〔実施例２〕
図５に本発明の音声合成装置の第２の実施例を示す。図５において、図２の構成と異なる点は、音声素片キヤッシュメモリ１３０を追加し、音声素片記憶制御部１０７にて、センタ３から受信（ダウンロード）された音声素片データとその付属情報を該音声素片キヤッシュメモリ１２０に蓄積するようにしたことである。したがって、音声素片選択部１０３では、以後、ローカル音声データベース１２０及び音声素片キヤッシュメモリ１３０の両方を使用して最適な音声素片データを選択することができる。
【００２４】
本実施例２は、例えばローカル音声データベース１２０がＲＯＭで実装されて、新たな音声素片データを追加できない場合に有効である。なお、音声素片キヤッシュメモリ１３０のデータは、装置の主電源をオフする毎に消去するか（使用のたびに消去する）、あるいは、補助電源等の使用で、主電源をオフしても消去しないようにするか、いずれでもよい。
【００２５】
図５の実施例２に対応する本発明の音声合成方法のフローチャートは、基本的に図４と同様であるので省略する。相違点は、図４のステップ１０１１において、実施例１ではセンタ２から受信した音声素片データの記憶先がローカル音声データベース１２０であったのが、実施例２では音声素片キヤッシュメモリ１３０になるだけである。
【００２６】
図６は、本発明の音声合成装置をコンピュータ上に構築する場合の概念的な構成を示すブロック図である。図６において、音声合成装置１００は、プログラムに基づき処理を実行するとともに各構成要素を制御するＣＰＵ２１０、プログラム及び途中の処理結果等を格納するメモリ２２０、音声素片データ及び辞書、その他のファイル等を格納するデータ蓄積装置２３０、ネットワークを介してホストとデータを送受信するためのデータ送受信制御手段２４０（図２、図５のデータ送受信側制御部１００）などを具備する。また、必要に応じて、音声素片キャッシュメモリ２５０を付加してもよい。特に小型の携帯機器においては、データ蓄積装置２３０は書き換え可能な磁気ディスク等ではなくＲＯＭで実装されることがあり、その場合は、音声素片キャッシュメモリ２５０が必須となる。
【００２７】
なお、図２や図５で示した装置における各部の一部もしくは全部の処理機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること（図６）、あるいは、図４で示したような処理手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもない。また、コンピュータでその処理機能を実現するためのプログラム、あるいは、コンピュータにその処理手順を実行させるためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えば、ＦＤや、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、提供したりすることができるとともに、インターネット等のネットワークを通してそのプログラムを配布したりすることが可能である。
【００２８】
【発明の効果】
以上、本発明によれば、必要に応じてネットワークを介してホスト側の大容量の音声データベースにアクセスすることで、搭載容量の限定された端末側に小容量のローカル音声データベースを搭載しても高品質な音声合成が可能になる。また、端末側に、ネットワークを介してホストからダウンロードした音声素片データを保存するためのメモリ領域を用意することで、応答時間の短縮も可能になる。
【図面の簡単な説明】
【図１】本発明の音声合成装置を使用したシステム全体の概念図である。
【図２】本発明による音声合成装置の第１の実施例の構成図である。
【図３】音声データベースの構成例を示す図である。
【図４】本発明による音声合成方法のフローチャート例である。
【図５】本発明による音声合成装置の第２の実施例の構成図である。
【図６】本発明による音声合成装置をコンピュータで実現する場合の構成図である。
【符号の説明】
１ユーザ端末
２ネットワーク
３センタ
４受付けサーバ
５大容量音声データベース
１００音声合成装置
１０１テキスト解析部
１０２韻律生成部
１０３音声素片選択部
１０４適合度判定部
１０５合成部
１０６データ送受信制御部
１０７音声素片記憶制御部
１１０テキスト解析辞書
１２０ローカル音声データベース
１３０音声素片キヤッシュメモリ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesis method and apparatus for synthesizing speech by selecting appropriate speech segment data from a speech database and sequentially connecting them in correspondence with phoneme strings and prosodic information indicated by input text, and a program therefor Is.
[0002]
[Prior art]
In this type of speech synthesis technology, voice data ranging from several tens of minutes to several hours is stored and input as it is in a large-capacity storage device as the cost of using a large-capacity storage device decreases in recent years. A speech synthesizer that synthesizes high-quality speech by selecting an optimal speech segment according to the phoneme sequence and prosodic information indicated in the text, and appropriately transforming and connecting it is also proposed (for example, Japanese Patent No. 2761552).
[0003]
Such a speech synthesizer can output high-quality synthesized speech with respect to the input text. However, since it requires a large volume of speech data and a storage device for storing it, the installed capacity is limited. It is difficult to mount on small portable electronic devices (for example, personal digital assistants (PDAs) and mobile phones).
[0004]
In addition, there is a method in which text data is transmitted from the client to the server by the client-server method, voice synthesis is performed using a large-capacity database equipped on the server side, and the voice signal is received by the client side. Takes time for transmission and reception, and there is a problem in transmission of synthesized speech especially in the current Internet.
[0005]
[Problems to be solved by the invention]
A first object of the present invention is to enable realization of high-quality speech synthesis similar to using a large-capacity speech database even in a portable electronic device or the like having a limited mounting capacity.
The second object of the present invention is to significantly reduce the response time as compared with the case where voice is synthesized on the server side by a client-server method in a portable electronic device or the like and voice signals are transmitted and received.
[0006]
[Means for Solving the Problems]
In the present invention, a small-scale and small-capacity local speech database is prepared on the terminal side, and speech segment data is selected from the local speech database according to the phoneme string and temperament information indicated by the text information. Then, it is determined whether or not the speech unit data selected from the local speech database satisfies a predetermined fitness, and if satisfied, the speech unit data is selected as it is. On the other hand, if the speech segment data is not selected from the local speech database or if it does not satisfy the fitness even if it is selected, the phoneme sequence and prosody are retrieved from the remote large-scale speech database via the network. Download the appropriate speech segment data according to the information. In this way, speech is synthesized using speech segment data selected from the local speech database or speech segment data downloaded from the remote speech database. This enables high-quality speech synthesis without installing a large-scale, large-capacity speech database on the terminal side.
[0007]
Further, in the present invention, the speech unit data downloaded from the remote speech database is added to the local speech database and held or held in a speech unit data cache memory different from the local speech database, Thereafter, the downloaded speech segment data can be selected on the terminal side. As a result, the response time of speech synthesis can be shortened.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
First, FIG. 1 shows a conceptual diagram of the entire system using the speech synthesizer of the present invention. In FIG. 1, the user terminal 1 is a generic term for PDA, portable electronic, PHS, and the like. The user terminal 1 is equipped with a speech synthesizer (for example, a speech synthesizer module) 100 incorporating a small-scale / small-capacity local speech database 120. On the other hand, a large-scale and large-capacity voice database 5 is installed on the center 3 side. The user uses the speech synthesizer 100 when listening to received mail and other text information by voice. The speech synthesizer 100 inputs text information, selects speech segment data from the local speech database 120 according to the phoneme sequence and prosodic information, and synthesizes speech. If it is necessary to download speech segment data from the center 3 during the speech synthesis, the user terminal 1 is connected to the center 3 via the network 2. As a method of connecting the user terminal 1 and the center 3 via the network 2, for example, if the Internet is used as the network 2, there is a connection method using a cellular phone or a PHS packet communication network. If a LAN is used, there is a connection method using a wireless LAN. The user terminal 1 transmits a request for necessary speech segment data to the reception server 4 on the center 3 side. The accepting server 4 searches for speech segment data using a large-capacity database 5 connected via a LAN, and transmits the searched optimal speech segment data to the user terminal 1 that is the request source. The speech synthesizer 100 of the user terminal 1 performs speech synthesis using the received speech segment data.
[0009]
In FIG. 1, the receiving server 4 and the large-capacity database 5 are separate from each other, but may be configured by the same device. In that case, the LAN is not particularly necessary.
[0010]
Next, a few embodiments of the speech synthesis apparatus and speech synthesis method according to the present invention will be described in detail.
[0011]
[Example 1]
FIG. 2 is a block diagram showing a first embodiment of the speech synthesizer of the present invention. In FIG. 2, a speech synthesizer 100 includes a text analysis unit 101, a prosody generation unit 102, a speech unit selection unit 103, a fitness determination unit 104, a synthesis unit 105, a data transmission / reception control unit 106, and a speech unit storage control unit 107. , A text analysis dictionary 110 and a local speech database 120. In FIG. 2, a memory (working memory) that temporarily stores input text information, corresponding phoneme strings and prosodic information, speech segment data, and the like is omitted. When the speech synthesizer 100 is used by being mounted on a PDA, a mobile phone, or the like, the original communication function of the PDA, the mobile phone, etc. also serves as the data transmission / reception control unit 106.
[0012]
FIG. 3 shows a configuration example of the large-capacity speech database 5 on the center 3 side and the local speech database 120 in the speech synthesizer 100. The difference between the large-capacity voice database 5 on the center 3 side in FIG. 1 and the local voice database 120 is only the amount of data to be stored. For example, the basic speech database 120 in the speech synthesizer 100 holds only basic speech data stored in the large-capacity speech database 5 on the center 3 side. Alternatively, for example, statistics of phonological concatenations in the headwords in the Japanese language dictionary are taken, and the local speech database 120 holds the ones with high appearance frequency (for example, about the top 1000). Note that the appearance frequency of phoneme concatenation is described in detail in, for example, Japanese Patent Laid-Open No. 1-44498.
[0013]
FIG. 4 is an example of a flowchart of the speech synthesis method of the present invention corresponding to Example 1 of FIG. The operation of the first embodiment shown in FIG. 2 will be described below with reference to FIG.
[0014]
When text information is input (step 1001), the text analysis unit 101 performs text analysis to determine a phoneme string and an accent (step 1002). Specifically, the text analysis unit 102 refers to the text analysis dictionary 110, performs morphological analysis such as dependency and part-of-speech analysis, kanji conversion, and accent processing on text information, and obtains a phoneme string (phoneme symbol string). Determine the accent.
[0015]
Next, based on the phoneme string and the accent, the prosody generation unit 102 determines the prosody information. The phoneme information includes a pitch pattern (average F0, slope of F0, etc.), a time length pattern for each phoneme, an amplitude pattern, and the like. As is well known, the prosody generation unit 102 generates necessary prosody information with reference to predetermined generation rules, tables, and the like.
[0016]
Next, the speech unit selection unit 103 selects the optimal speech unit data from the local speech database 120 according to the phoneme sequence and the prosodic information (step 1003), and the fitness determination unit 104 selects the selected speech unit. The degree of fit of the segment data is obtained, and it is determined whether or not a predetermined threshold value is satisfied (step 1004).
[0017]
The degree of adaptation of the speech segment data is obtained as follows, for example. Local voice database 120 as the configuration shown in FIG. 3, before phoneme environment _{P t} prosodic information (target), the rear phoneme environment _{S t,} the average F0 and _FA t, F0 slope and FS _t, the time length and _{D t} , P _c for the pre-phoneme environment of the selected speech segment data, S _c for the post-phoneme environment, FA _{c for} the average F0, FS _{c for} the F0 slope, D _{c for} the time length, and the degree of difference between the phonemes a and b If the function is DP (a, b), the fitness is

Can be expressed as Here, α _p , α _s , α _fs , α _fs , and α _d are appropriate weight functions. Moreover, one case of DP (a, b) is the phoneme a, average spectrum b (vector) _SP a, when the _{SP b DP (a, b)} = | SP a -SP b | like Function.
[0018]
In the fitness level determination unit 104, for example, the fitness level is calculated by the above equation (1), and when the fitness level is smaller than a predetermined threshold value, the speech unit data selected from the local speech database 120 is determined to be compatible. Then, the speech segment data is finally selected. On the other hand, if the speech segment data is not selected from the local speech database 120, or is selected but the fitness is equal to or greater than the threshold value, it is determined as non-conformity. Activates the data transmission / reception control unit 106 and passes the phoneme string and prosodic information.
[0019]
The data transmission / reception control unit 106 first connects to the center 3 via the network 2 (step 1008), and then transmits a phoneme sequence and prosodic information to the center 3 to request selection of speech segment data ( In step 1009), the speech unit data selected by the center 3 using the large-capacity speech database 5 and its associated information are received via the network 2 (step 1010). The data transmission / reception control unit 106 sends the speech unit data received (downloaded) from the center 2 to the fitness level determination unit 104, and the fitness level determination unit 104 finally selects the speech segment data.
[0020]
Further, the data transmission / reception control unit 106 sends the speech unit data received (downloaded) from the center 2 and its attached information to the speech unit storage control unit 107. The speech segment storage control unit 107 adds the speech segment data and its associated information to the local speech database 120 for storage (step 1011).
[0021]
Next, it is determined whether all speech segment data corresponding to the phoneme string have been selected (step 1005). If not all speech segments have been selected, the processing is repeated from step 1003. At this time, if there is newly added speech unit data in the local speech database 120, the speech unit selection unit 103 can select optimal speech unit data including that.
[0022]
If it is determined in step 1005 that all speech unit data has been selected, the synthesizing unit 105 prosody-transforms all speech unit data according to the prosodic information, and connects the prosody transformed speech unit data, respectively. As a result, the speech is synthesized (step 1007). Since this is the same as the conventional one (for example, Japanese Patent No. 2761552), a specific description is omitted.
[0023]
[Example 2]
FIG. 5 shows a second embodiment of the speech synthesizer of the present invention. 5 differs from the configuration of FIG. 2 in that a speech unit cache memory 130 is added, and speech unit data received (downloaded) from the center 3 by the speech unit storage control unit 107 and its associated information. Is stored in the speech segment cache memory 120. Therefore, the speech unit selection unit 103 can select optimum speech unit data using both the local speech database 120 and the speech unit cache memory 130 thereafter.
[0024]
The second embodiment is effective when, for example, the local speech database 120 is implemented in ROM and new speech segment data cannot be added. The data in the speech unit cache memory 130 is erased every time the main power of the apparatus is turned off (erased every time it is used), or even if the main power is turned off by using an auxiliary power source or the like. Either or not.
[0025]
The flowchart of the speech synthesis method of the present invention corresponding to Example 2 in FIG. 5 is basically the same as that in FIG. The difference is that in step 1011 of FIG. 4, the storage unit of the speech unit data received from the center 2 in the first embodiment is the local speech database 120, but the speech unit cache memory 130 in the second embodiment. Only.
[0026]
FIG. 6 is a block diagram showing a conceptual configuration when the speech synthesizer of the present invention is built on a computer. In FIG. 6, a speech synthesizer 100 executes a process based on a program and controls a component 210, a memory 220 for storing a program and intermediate processing results, speech segment data and a dictionary, other files, etc. And a data transmission / reception control means 240 (data transmission / reception side control unit 100 in FIGS. 2 and 5) for transmitting / receiving data to / from a host via a network. Moreover, you may add the speech unit cache memory 250 as needed. In particular, in a small portable device, the data storage device 230 may be mounted as a ROM instead of a rewritable magnetic disk or the like, and in such a case, the speech unit cache memory 250 is essential.
[0027]
Note that the present invention can be realized by configuring a part or all of the processing functions of each unit in the apparatus shown in FIGS. 2 and 5 by a computer program and executing the program using the computer (FIG. 6). 4) Alternatively, it is needless to say that the processing procedure as shown in FIG. 4 can be constituted by a computer program and the program can be executed by the computer. In addition, a computer-readable recording medium such as an FD, an MO, a ROM, a memory card, a program for realizing the processing function by the computer, or a program for causing the computer to execute the processing procedure, The program can be recorded on a CD, DVD, removable disk, etc., stored, provided, and the program can be distributed through a network such as the Internet.
[0028]
【The invention's effect】
As described above, according to the present invention, even when a small-capacity local voice database is mounted on a terminal having a limited mounting capacity by accessing a large-capacity voice database on the host side via a network as necessary. High quality speech synthesis is possible. Also, by providing a memory area on the terminal side for storing speech segment data downloaded from the host via the network, the response time can be shortened.
[Brief description of the drawings]
FIG. 1 is a conceptual diagram of an entire system using a speech synthesizer of the present invention.
FIG. 2 is a block diagram of a first embodiment of a speech synthesizer according to the present invention.
FIG. 3 is a diagram illustrating a configuration example of a voice database.
FIG. 4 is a flowchart example of a speech synthesis method according to the present invention.
FIG. 5 is a block diagram of a second embodiment of the speech synthesizer according to the present invention.
FIG. 6 is a configuration diagram when the speech synthesizer according to the present invention is realized by a computer.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 User terminal 2 Network 3 Center 4 Accepting server 5 Large capacity | capacitance speech database 100 Speech synthesizer 101 Text analysis part 102 Prosody generation part 103 Speech unit selection part 104 Conformity determination part 105 Synthesis part 106 Data transmission / reception control part 107 Speech unit Storage control unit 110 Text analysis dictionary 120 Local speech database 130 Speech segment cache memory

Claims

In the speech synthesis method for selecting speech segment data from the speech database and synthesizing speech corresponding to the phoneme sequence and prosodic information indicated by the text information,
Select speech segment data from a small local speech database according to the phoneme sequence and prosodic information,
Determining whether the speech segment data selected from the local speech database satisfies a predetermined degree of fitness;
If the degree of conformity is satisfied, the speech unit data selected from the local speech database is selected as it is,
If the degree of conformity is not satisfied, an appropriate speech segment data is downloaded according to the phoneme sequence and prosody information from a remote speech database via a network,
Synthesizing speech using speech segment data selected from the local speech database or speech segment data downloaded from the remote speech database;
A speech synthesis method characterized by the above.

The speech synthesis method according to claim 1, wherein speech unit data downloaded from the remote speech database is stored in the local speech database, and thereafter, the stored speech unit data can be selected. A speech synthesis method characterized by the above.

2. The speech synthesis method according to claim 1, wherein speech unit data downloaded from the remote speech database is stored in a speech unit data cache memory different from the local speech database, and thereafter the local speech is stored. A speech synthesis method, wherein speech unit data is selected using a database and the speech unit data cache memory.

In a speech synthesizer that synthesizes speech by selecting speech segment data from a speech database, corresponding to the phoneme sequence and prosodic information indicated by the text information,
A local speech database that stores small speech segment data;
Means for selecting speech segment data from the local speech database according to the phoneme sequence and prosodic information;
Means for determining whether speech segment data selected from the local speech database satisfies a predetermined fitness;
If the degree of conformity is satisfied, means for directly selecting speech segment data selected from the local speech database;
If the degree of conformity is not satisfied, means for downloading appropriate speech segment data according to the phoneme string and prosody information from a remote speech database via a network;
Means for synthesizing speech using speech segment data selected from the local speech database or speech segment data downloaded from the remote speech database;
A speech synthesizer characterized by comprising:

A speech synthesis program for executing the speech synthesis method according to claim 1 on a computer.