JP4392383B2

JP4392383B2 - Speech synthesis system, client device, speech segment database server device, speech synthesis method and program

Info

Publication number: JP4392383B2
Application number: JP2005143581A
Authority: JP
Inventors: 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-05-17
Filing date: 2005-05-17
Publication date: 2009-12-24
Anticipated expiration: 2025-05-17
Also published as: JP2006322962A

Abstract

<P>PROBLEM TO BE SOLVED: To easily output high-quality synthesized speech in a short response period without providing a client apparatus with a large scale database. <P>SOLUTION: The client apparatus 100 reads out an optimum speech element data from a local speech element database storage part 111 using read information and prosody information extracted from text data, to create the synthesized speech. Moreover, independently of this processing, the client apparatus 100 retrieves an optimum speech element index storage part 113, and reads out the optimum speech element sequence information showing the optimum speech element data optimum to the text data. Then, the client apparatus requests this a speech element database server 300 to transmit the speech element data shown by this optimum speech element sequence information. Then, the client apparatus 100 which has received the optimum speech element data from the speech element database server 300 stores it in the local speech element database storage part 111, and utilizes it for creating speech synthesis on and after the next. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、テキストデータから合成音声を生成する技術に関する。 The present invention relates to a technique for generating synthesized speech from text data.

近年では大容量な記憶装置の使用コストの低下に伴って、数十分以上の大容量の音声データをそのまま大容量の記憶装置に蓄積し、入力されたテキスト及び韻律情報に応じて音声素片を適切に選択し、接続・変形することで高品質な音声を合成する波形接続型コーパスベース音声合成方法が提案されている（例えば、特許文献１及び非特許文献１参照。）。
このような方法によって、原理的には肉声同等の高品質な合成音声を生成することが可能になってきている。具体的には、まず、音声データベースから、合成したい文字列に対応する音韻系列と部分的または完全に一致する音声素片を、バイナリ−ツリー等で構成された音声素片辞書を用いて検索する。次に、音声素片の類似度を評価するための評価尺度（複数のパラメータの組み合わせからなる）に従ってコスト付けされた多数の音声素片の中から、ＤＰ等の方法によって適切な音声素片の組み合わせを選択する。そして、これらの選択された音声素片を順に接続することで音声合成を行っている（非特許文献２）。しかし、このような音声合成方式においては、そもそも適切な音声素片が音声データベースに存在しない場合、高品質な合成音声を生成不可能である。 In recent years, along with a decrease in the cost of using large-capacity storage devices, a large volume of speech data of several tens of minutes or more is stored as it is in a large-capacity storage device, and speech segments are generated according to input text and prosodic information. A waveform-connected corpus-based speech synthesis method that synthesizes high-quality speech by appropriately selecting, connecting and transforming is proposed (for example, see Patent Literature 1 and Non-Patent Literature 1).
In principle, it has become possible to generate high-quality synthesized speech equivalent to real voice. Specifically, first, a speech unit that partially or completely matches a phoneme sequence corresponding to a character string to be synthesized is searched from a speech database using a speech unit dictionary composed of a binary tree or the like. . Next, an appropriate speech unit is determined by a method such as DP from among a large number of speech units costed according to an evaluation measure (consisting of a combination of parameters) for evaluating the similarity of speech units. Select a combination. Then, speech synthesis is performed by sequentially connecting these selected speech segments (Non-Patent Document 2). However, in such a speech synthesis method, a high-quality synthesized speech cannot be generated unless an appropriate speech segment exists in the speech database.

従って、様々なテキストを高品質に音声合成するためには、豊富な音声素片のバリエーションを含む音声データベースを用いることが必須である。そして、そのため近年では音声素片のバリエーションを増やし合成音声を高品質化するために、音声データベースの容量をより一層増加させる方向で開発が進んでいる。
特許２７６１５５２号公報 M.Beutnagel, A.Conkie, J.Schoroeter, Y.Stylianou, and A.Sydral, “Chose the best to modify the least: A new generation concatenative synthesis system”, in Proc.Eurospeech’99, 1999, pp.2291-2294 “波形編集型規則合成法における波形選択法”、広川他、電子情報通信学会音声研究会資料、SP89-114,pp.33-40(1990)。 Therefore, in order to synthesize various texts with high quality, it is essential to use a speech database including a variety of speech segment variations. For this reason, in recent years, development has been progressed in the direction of further increasing the capacity of the speech database in order to increase the variations of speech segments and to improve the quality of synthesized speech.
Japanese Patent No. 2761552 M. Beutnagel, A. Conkie, J. Schoroeter, Y. Stylianou, and A. Sydral, “Chose the best to modify the least: A new generation concatenative synthesis system”, in Proc. Eurospeech'99, 1999, pp.2291 -2294 "Waveform Selection Method in Waveform Editing Type Rule Synthesis Method", Hirokawa et al., IEICE Audio Society, SP89-114, pp.33-40 (1990).

このような音声データベースの大容量化により合成音声の品質は向上してきたが、同時に実用面でそのような大規模な音声データベースを必要とするソフトウェアの配布方法について問題が生じてきた。
現在では、ＡＤＳＬやＦＴＴＨ等によるインターネットへのアクセス手段の高速化に伴って、ソフトウェアの配布方法も従来のようにCD-ROM等の物理的な媒体を用いて配布するのではなく、インターネットを用いオンラインでダウンロードさせることにより配布する方法も、配布手段のコストダウンに繋がるものとして普及しつつある。 Although the quality of synthesized speech has been improved by increasing the capacity of such a speech database, there has also been a problem in software distribution methods that require such a large-scale speech database.
At present, with the speed of access to the Internet by ADSL, FTTH, etc., the software distribution method is not using physical media such as CD-ROM as in the past, but using the Internet. The method of distributing by downloading online is also becoming popular as it leads to cost reduction of the distribution means.

しかし、大規模なデータベースは、最近の高速化したアクセス手段を用いても、ダウンロードには非常に長時間必要であり、オンラインでの配布は事実上不可能という課題がある。
また、携帯端末や携帯電話等の携帯型機器分野では、より一層の機器の普及や差別化に繋げるため、ユーザにとって使いやすい様々なソフトウェアの開発が行われている。音声合成技術も人にとってわかりやすい音声での情報伝達が可能であるという点からこれらの分野で必要な技術と考えられる。しかし、大規模なデータベースが必要な波形接続型コーパスベース音声合成方法は、そういった携帯電話等で動作させることが不可能である。 However, there is a problem that a large-scale database requires a very long time to download even if a recent high-speed access means is used, and online distribution is virtually impossible.
Also, in the field of portable devices such as mobile terminals and mobile phones, various software that is easy to use for users has been developed in order to lead to further spread and differentiation of devices. Speech synthesis technology is also considered a necessary technology in these fields because it is possible to transmit information in speech that is easy for humans to understand. However, the waveform-connected corpus-based speech synthesis method that requires a large-scale database cannot be operated on such a mobile phone or the like.

このような技術的な問題は、インターネット等のネットワーク上にサーバ装置を用意し、そこで音声合成ソフトウェアを動作させれば解決可能である。すなわち、携帯電話等のクライアント装置からサーバ装置へテキストデータを送信し、このサーバ装置で合成音声を生成し、それをクライアント装置に返送することとすれば、クライアント装置で大規模なデータベースを保持する必要はなくなる。しかし、その場合、クライアント装置から大量のアクセスがあった場合に備えて大規模なサーバ設備が必要となる。よって、この方法は、サーバ設備に莫大なコストが掛かる等の問題を有し、現実的な方法とはいえない。 Such technical problems can be solved by preparing a server device on a network such as the Internet and operating speech synthesis software there. That is, if text data is transmitted from a client device such as a mobile phone to a server device, synthesized speech is generated by the server device, and the synthesized speech is returned to the client device, the client device holds a large database. There is no need. However, in that case, a large-scale server facility is required in preparation for a large amount of access from the client device. Therefore, this method has problems such as enormous costs for server equipment, and is not a realistic method.

そのため、「音声合成方法、音声合成装置および音声合成プログラム」（特開2003-233386）のように、合成音声の生成はクライアント装置で行うが、音声素片データの一部をネットワーク上のサーバ装置に置き、音声素片の適合率に応じてクライアント装置が保持する音声素片とネットワーク上の音声素片とを適切に使い分ける方法も提案されている。ここで、音声素片の適合率に応じてクライアント装置が保持する音声素片とネットワーク上の音声素片を適切に使い分けるためには、適切な適合率の閾値の設定が必要である。しかし、実際は適切な閾値の設定は困難である。また、この方法では、サーバ装置からダウンロードした音声素片が適切でない場合、再度音声素片をダウンロードしなければならないため、入力したテキストから合成音声が得られるまでのスループットの変動が非常に大きいという問題もあった。 Therefore, as in “speech synthesizer, speech synthesizer, and speech synthesizer program” (Japanese Patent Laid-Open No. 2003-233386), the synthesized speech is generated by the client device. In addition, a method has been proposed in which a speech unit held by a client device and a speech unit on a network are properly used according to the matching rate of speech units. Here, in order to properly use the speech unit held by the client apparatus and the speech unit on the network according to the speech unit adaptation rate, it is necessary to set an appropriate adaptation rate threshold value. However, in practice, it is difficult to set an appropriate threshold value. Also, with this method, if the speech unit downloaded from the server device is not appropriate, the speech unit must be downloaded again, so the throughput variation until the synthesized speech is obtained from the input text is very large. There was also a problem.

また、サーバ装置での負担を減らすため、テキストデータのテキスト解析等はクラインと装置で行い、音声素片データを必要とする処理のみをサーバ装置で実行することによって上記の問題を解決する方法も容易に考えられる。しかし、その場合、クライアント装置が音声合成に必要な読み情報や韻律パラメータをサーバ装置に送信し、サーバ装置がそれらに対応する合成音声を生成・送信し、それをクライアント装置が受信するまでの間は全く音声を出力することができない。その結果、ネットワークの混雑状況によっては、音声が出力されるまでの応答時間が非常に掛かるなどの問題が生じる。 In addition, in order to reduce the burden on the server device, text analysis of text data is performed by Klein and the device, and only the processing that requires speech segment data is executed by the server device. Easy to think. However, in that case, until the client device transmits the reading information and prosodic parameters necessary for speech synthesis to the server device, the server device generates and transmits the corresponding synthesized speech, and the client device receives it. Can not output audio at all. As a result, depending on the network congestion situation, there arises a problem that it takes a very long response time until voice is output.

本発明はこのような点に鑑みてなされたものであり、クライアント装置に大規模なデータベースを設けることなく、容易に高い品質の合成音声を短い応答期間で出力することが可能な技術を提供することを目的とする。 The present invention has been made in view of these points, and provides a technique capable of easily outputting high-quality synthesized speech in a short response period without providing a large-scale database in the client device. For the purpose.

上記課題を解決するために、第１の本発明では、音声素片データベースサーバ装置の最適音声素片データベース格納部に、最適音声素片データ（「音声素片データベースサーバ装置に格納される音声素片データ」を意味する）を格納しておく。
また、クライアント装置のローカル音声素片データベース格納部に、ローカル音声素片データ（「クライアント装置に格納される音声素片データ」を意味する）を格納し、クライアント装置のローカル音声素片インデックス格納部に、ローカル音声素片データを指定するローカル音声素片格納情報と当該ローカル音声素片データに対応する読み情報及び韻律パラメータとが関連付けられたローカル音声素片系列情報を格納し、クライアント装置の最適音声素片インデックス格納部に、最適音声素片データを指定する最適音声素片格納情報と当該最適音声素片データに対応する読み情報及び韻律パラメータとが関連付けられた最適音声素片系列情報を格納しておく。 In order to solve the above-described problem, in the first aspect of the present invention, the optimum speech element database storage unit of the speech element database server apparatus stores the optimum speech element data ("the speech element stored in the speech element database server apparatus"). Means "one piece of data").
Further, local speech unit data (meaning “speech unit data stored in the client device”) is stored in the local speech unit database storage unit of the client device, and the local speech unit index storage unit of the client device is stored. To store local speech unit sequence information in which local speech unit storage information specifying local speech unit data and reading information and prosodic parameters corresponding to the local speech unit data are associated with each other. The speech unit index storage unit stores optimum speech unit storage information in which optimum speech unit storage information for designating optimum speech unit data, reading information corresponding to the optimum speech unit data, and prosodic parameters are associated with each other. Keep it.

そして、まず、音声化すべきテキストデータがクライアント装置のテキスト解析部に入力され、当該テキスト解析部において、当該テキストデータに対してテキスト解析を行って読み情報及び韻律情報を生成し、当該読み情報及び韻律情報を出力する。次に、テキスト解析部から出力された韻律情報がクライアント装置の韻律パラメータ取得部に入力され、当該韻律パラメータ取得部において、当該韻律情報を用いて音声合成に必要な物理的な韻律パラメータを生成し、当該韻律パラメータを出力する。そして、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータがクライアント装置のローカル音声素片探索部に入力され、当該ローカル音声素片探索部において、当該入力された読み情報及び韻律パラメータをキーとしてローカル音声素片インデックス格納部を検索し、当該入力された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応するローカル音声素片系列情報を抽出し、抽出したローカル音声素片系列情報を出力する。また、ローカル音声素片探索部から出力されたローカル音声素片系列情報のローカル音声素片格納情報が、クライアント装置のローカル音声素片データ読み出し部に入力され、当該ローカル音声素片データ読み出し部において、当該ローカル音声素片格納情報が指定するローカル音声素片データを、ローカル音声素片データベース格納部から読み出す。そして、ローカル音声素片データ読み出し部が読み出したローカル音声素片データがクライアント装置の音声素片接続部に入力され、当該音声素片接続部において、当該ローカル音声素片データを用いて合成音声データを生成し、当該合成音声データを出力する。 First, text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Prosodic information is output. Next, the prosody information output from the text analysis unit is input to the prosody parameter acquisition unit of the client device, and the prosody parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information. , Output the prosodic parameters. Then, the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit are input to the local speech unit search unit of the client device, and the input input reading is performed in the local speech unit search unit. The local speech unit index storage unit is searched using the information and the prosodic parameters as keys, and the input speech information and the local speech unit sequence information corresponding to the prosodic parameters belonging to the similar range of the prosodic parameters are extracted, The extracted local speech unit sequence information is output. The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and the local speech unit data reading unit The local speech unit data specified by the local speech unit storage information is read from the local speech unit database storage unit. Then, the local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize synthesized speech data. And the synthesized speech data is output.

また、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータがクライアント装置の最適音声素片探索部に入力され、当該最適音声素片探索部において、当該入力された読み情報及び韻律パラメータをキーとして最適音声素片インデックス格納部を検索し、入力された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を抽出し、抽出した最適音声素片系列情報を出力する。さらに、ローカル音声素片探索部及び最適音声素片探索部からそれぞれ出力されたローカル音声素片系列情報及び最適音声素片系列情報がクライアント装置の要求音声素片決定部に入力され、当該要求音声素片決定部において、当該最適音声素片系列情報から当該ローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した要求音声素片系列情報を生成し、当該要求音声素片系列情報を出力する。そして、クライアント装置の音声素片情報送信部において、要求音声素片系列情報の最適音声素片格納情報を、ネットワークを通じ、音声素片データベースサーバ装置に送信する。 Also, the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit are input to the optimal speech unit search unit of the client device, and the input of the input reading is performed in the optimal speech unit search unit. Search the optimal speech segment index storage using the information and prosodic parameters as keys, and extract and extract the optimal speech segment sequence information corresponding to the input reading information and prosodic parameters that belong to the similar range of prosodic parameters The optimal speech segment sequence information is output. Further, the local speech unit sequence information and the optimum speech unit sequence information output from the local speech unit search unit and the optimum speech unit search unit, respectively, are input to the requested speech unit determination unit of the client device, and the requested speech The unit determination unit generates requested speech unit sequence information in which the local speech unit sequence information, the reading information, and the prosodic parameters in common are excluded from the optimal speech unit sequence information, and the requested speech unit sequence information Is output. Then, the speech unit information transmitting unit of the client device transmits the optimal speech unit storage information of the requested speech unit sequence information to the speech unit database server device through the network.

次に、音声素片データベースサーバ装置の音声素片情報受信部において、要求音声素片系列情報の最適音声素片格納情報を受信する。そして、受信された最適音声素片格納情報が音声素片データベースサーバ装置の最適音声素片データ読み出し部に入力され、当該最適音声素片データ読み出し部において、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部から読み出す。その後、読み出された最適音声素片データを、音声素片データベースサーバ装置の音声素片データ送信部において、ネットワークを通じ、クライアント装置に返信する。 Next, the speech unit information receiving unit of the speech unit database server apparatus receives the optimum speech unit storage information of the requested speech unit sequence information. Then, the received optimum speech unit storage information is input to the optimum speech unit data reading unit of the speech unit database server apparatus, and the optimum speech unit storage information specifies the optimum speech unit data reading unit. The optimum speech element data is read from the optimum speech element database storage unit. Thereafter, the read optimum speech unit data is returned to the client device via the network in the speech unit data transmission unit of the speech unit database server device.

そして、クライアント装置の音声素片データ受信部において、最適音声素片データを受信し、受信された最適音声素片データを、クライアント装置の音声素片データベース追加部において、新たなローカル音声素片データとしてローカル音声素片データベース格納部に追加格納する。また、新たなローカル音声素片データに対応するローカル音声素片系列情報を、クライアント装置の音声素片インデックス追加部において、ローカル音声素片インデックス格納部に追加格納する。
ここで、クライアント装置における合成音声データの生成処理は、音声素片データベースサーバ装置へのアクセスが必要なローカル音声素片データベース格納部への最適音声素片データの追加処理と独立に行われる。その場合、テキストデータの入力から合成音声が出力されるまでの時間は、クライアント装置の処理性能のみに依存し、ネットワークの品質や構成に全く依存しない。 Then, the speech unit data reception unit of the client device receives the optimum speech unit data, and the received optimum speech unit data is converted into new local speech unit data by the speech unit database addition unit of the client device. Are additionally stored in the local speech unit database storage. Further, the local speech unit sequence information corresponding to the new local speech unit data is additionally stored in the local speech unit index storage unit in the speech unit index addition unit of the client device.
Here, the generation process of the synthesized voice data in the client device is performed independently of the process of adding the optimum voice unit data to the local voice unit database storage unit that requires access to the voice unit database server device. In that case, the time from the input of the text data to the output of the synthesized speech depends only on the processing performance of the client device, and does not depend on the quality or configuration of the network at all.

また、ローカル音声素片データベース格納部へ最適音声素片データが新たなローカル音声素片データとして追加されることにより、その後生成される合成音声データの品質が向上する。
さらに、クライアント装置は、最適音声素片系列情報からローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した要求音声素片系列情報をもとに、音声素片データベースサーバ装置に最適音声素片データを要求する最適音声素片データを決定する。すなわち、クライアント装置は、音声素片データベースサーバ装置に最適音声素片データを要求するか否かの判断を、ローカル音声素片データベース格納部に最適な音声素片データが存在するか否かによって行う。そのため、本発明では、適切な設定が困難な閾値等のパラメータを設定する必要はない。 Moreover, the quality of synthesized speech data generated thereafter is improved by adding the optimum speech unit data as new local speech unit data to the local speech unit database storage unit.
Furthermore, the client device is optimal for the speech unit database server device based on the requested speech unit sequence information obtained by excluding the local speech unit sequence information and the common reading information and prosodic parameters from the optimal speech unit sequence information. The optimum speech segment data requesting speech segment data is determined. In other words, the client device determines whether or not to request the optimal speech segment data from the speech segment database server device depending on whether or not the optimal speech segment data exists in the local speech segment database storage unit. . Therefore, in the present invention, it is not necessary to set parameters such as a threshold that are difficult to set appropriately.

上記課題を解決するために、第２の本発明では、音声素片データベースサーバ装置の最適音声素片データベース格納部に、最適音声素片データを格納し、音声素片データベースサーバ装置の最適音声素片インデックス格納部に、最適音声素片データを指定する最適音声素片格納情報と当該最適音声素片データに対応する読み情報及び韻律パラメータとが関連付けられた最適音声素片系列情報を格納しておく。
また、クライアント装置のローカル音声素片データベース格納部に、ローカル音声素片データを格納しておき、クライアント装置のローカル音声素片インデックス格納部に、ローカル音声素片データを指定するローカル音声素片格納情報と当該ローカル音声素片データに対応する読み情報及び韻律パラメータとが関連付けられたローカル音声素片系列情報を格納しておく。 In order to solve the above-described problem, in the second aspect of the present invention, optimal speech element data is stored in the optimal speech element database storage unit of the speech element database server device, and the optimal speech element of the speech segment database server device is stored. In the segment index storage unit, the optimum speech unit storage information for specifying the optimum speech unit data, the optimum speech unit sequence information associated with the reading information and the prosodic parameters corresponding to the optimum speech unit data are stored. deep.
In addition, local speech unit data is stored in the local speech unit database storage unit of the client device, and the local speech unit data is specified in the local speech unit index storage unit of the client device. The local speech unit sequence information in which the information and the reading information corresponding to the local speech unit data and the prosodic parameters are associated is stored.

そして、まず、音声化すべきテキストデータがクライアント装置のテキスト解析部に入力され、当該テキスト解析部において、当該テキストデータに対してテキスト解析を行って読み情報及び韻律情報を生成し、当該読み情報及び韻律情報を出力する。次に、テキスト解析部から出力された韻律情報がクライアント装置の韻律パラメータ取得部に入力され、当該韻律パラメータ取得部において、当該韻律情報を用いて音声合成に必要な物理的な韻律パラメータを生成し、当該韻律パラメータを出力する。そして、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータが、クライアント装置のローカル音声素片探索部に入力され、当該ローカル音声素片探索部において、当該入力された読み情報及び韻律パラメータをキーとしてローカル音声素片インデックス格納部を検索し、当該入力された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応するローカル音声素片系列情報を抽出し、抽出したローカル音声素片系列情報を出力する。また、ローカル音声素片探索部から出力されたローカル音声素片系列情報のローカル音声素片格納情報がクライアント装置のローカル音声素片データ読み出し部に入力され、当該ローカル音声素片データ読み出し部において、当該ローカル音声素片格納情報が指定するローカル音声素片データを、ローカル音声素片データベース格納部から読み出す。そして、ローカル音声素片データ読み出し部が読み出したローカル音声素片データがクライアント装置の音声素片接続部に入力され、当該音声素片接続部において、当該ローカル音声素片データを用いて合成音声データを生成し、当該合成音声データを出力する。また、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータを、クライアント装置の音声素片情報送信部において、ネットワークを通じ、音声素片データベースサーバ装置に送信する。 First, text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Prosodic information is output. Next, the prosody information output from the text analysis unit is input to the prosody parameter acquisition unit of the client device, and the prosody parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information. , Output the prosodic parameters. Then, the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit are input to the local speech unit search unit of the client device, and the input is performed in the local speech unit search unit The local speech unit index storage unit is searched using the reading information and prosodic parameters as keys, and the local speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the input reading information and prosodic parameters is extracted. The extracted local speech unit sequence information is output. The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and in the local speech unit data reading unit, The local speech unit data specified by the local speech unit storage information is read from the local speech unit database storage unit. Then, the local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize synthesized speech data. And the synthesized speech data is output. Also, the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit are transmitted to the speech unit database server device through the network in the speech unit information transmission unit of the client device.

音声素片データベースサーバ装置の音声素片情報受信部は、これらの読み情報及び韻律パラメータを受信する。そして、受信された読み情報及び韻律パラメータをキーとして、音声素片データベースサーバ装置の最適音声素片探索部において、最適音声素片インデックス格納部を検索し、受信された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を抽出し、抽出した最適音声素片系列情報を出力する。また、最適音声素片探索部から出力された最適音声素片系列情報の最適音声素片格納情報が、音声素片データベースサーバ装置の最適音声素片データ読み出し部に入力され、当該最適音声素片データ読み出し部において、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部から読み出す。そして、読み出された最適音声素片データを、音声素片データベースサーバ装置の音声素片データ送信部において、ネットワークを通じ、クライアント装置に返信する。 The speech unit information receiving unit of the speech unit database server apparatus receives the reading information and the prosodic parameters. Then, using the received reading information and prosodic parameters as keys, the optimal speech segment searching unit of the speech unit database server device searches the optimal speech unit index storage unit, and the received reading information and prosody parameters are similar. The optimal speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the range is extracted, and the extracted optimal speech unit sequence information is output. Further, the optimum speech unit storage information of the optimum speech unit sequence information output from the optimum speech unit search unit is input to the optimum speech unit data reading unit of the speech unit database server device, and the optimum speech unit The data reading unit reads the optimum speech unit data specified by the optimum speech unit storage information from the optimum speech unit database storage unit. Then, the read optimal speech unit data is returned to the client device through the network in the speech unit data transmission unit of the speech unit database server device.

クライアント装置の音声素片データ受信部は、最適音声素片データを受信する。そして、受信された最適音声素片データの少なくとも一部を、クライアント装置の音声素片データベース追加部において、新たなローカル音声素片データとしてローカル音声素片データベース格納部に追加格納する。また、クライアント装置の音声素片インデックス追加部において、新たなローカル音声素片データに対応するローカル音声素片系列情報を、ローカル音声素片インデックス格納部に追加格納する。
ここで、クライアント装置における合成音声データの生成処理は、音声素片データベースサーバ装置へのアクセスが必要なローカル音声素片データベース格納部への最適音声素片データの追加処理と独立に行われる。その場合、テキストデータの入力から合成音声が出力されるまでの時間は、クライアント装置の処理性能のみに依存し、ネットワークの品質や構成に全く依存しない。 The speech unit data receiving unit of the client device receives the optimal speech unit data. Then, at least a part of the received optimal speech unit data is additionally stored in the local speech unit database storage unit as new local speech unit data in the speech unit database addition unit of the client device. Further, the speech unit index adding unit of the client device additionally stores the local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit.
Here, the generation process of the synthesized voice data in the client device is performed independently of the process of adding the optimum voice unit data to the local voice unit database storage unit that requires access to the voice unit database server device. In that case, the time from the input of the text data to the output of the synthesized speech depends only on the processing performance of the client device, and does not depend on the quality or configuration of the network at all.

また、ローカル音声素片データベース格納部へ最適音声素片データが新たなローカル音声素片データとして追加されることにより、その後生成される合成音声データの品質が向上する。
さらに、クライアント装置は、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータを音声素片データベースサーバ装置に送信し、対応する最適音声素片データの送信を要求する。そのため、本発明では、適切な設定が困難な閾値等のパラメータを設定する必要はない。 Moreover, the quality of synthesized speech data generated thereafter is improved by adding the optimum speech unit data as new local speech unit data to the local speech unit database storage unit.
Further, the client device transmits the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit to the speech unit database server device, and requests transmission of the corresponding optimal speech unit data. Therefore, in the present invention, it is not necessary to set parameters such as a threshold that are difficult to set appropriately.

上記課題を解決するために、第３の本発明では、音声素片データベースサーバ装置の最適音声素片データベース格納部に、最適音声素片データを格納し、音声素片データベースサーバ装置の最適音声素片インデックス格納部に、最適音声素片データを指定する最適音声素片格納情報と当該最適音声素片データに対応する読み情報及び韻律パラメータとが関連付けられた最適音声素片系列情報が格納しておく。
また、クライアント装置のローカル音声素片データベース格納部に、ローカル音声素片データに格納し、クライアント装置のローカル音声素片インデックス格納部に、ローカル音声素片データを指定するローカル音声素片格納情報と当該ローカル音声素片データに対応する読み情報及び韻律パラメータとが関連付けられたローカル音声素片系列情報が格納しておく。 In order to solve the above-mentioned problem, in the third aspect of the present invention, optimal speech element data is stored in the optimal speech element database storage unit of the speech element database server device, and the optimal speech element of the speech segment database server device is stored. Optimal speech unit sequence information in which optimal speech unit storage information for designating optimal speech unit data, reading information corresponding to the optimal speech unit data and prosodic parameters are associated is stored in the segment index storage unit. deep.
Further, local speech unit storage information for storing local speech unit data in the local speech unit database storage unit of the client device, and specifying local speech unit data in the local speech unit index storage unit of the client device; The local speech unit sequence information associated with the reading information and prosodic parameters corresponding to the local speech unit data is stored.

そして、まず、音声化すべきテキストデータがクライアント装置のテキスト解析部に入力され、当該テキスト解析部において、当該テキストデータに対してテキスト解析を行って読み情報及び韻律情報を生成し、当該読み情報及び韻律情報を出力する。次に、テキスト解析部から出力された韻律情報がクライアント装置の韻律パラメータ取得部に入力され、当該韻律パラメータ取得部において、当該韻律情報を用いて音声合成に必要な物理的な韻律パラメータを生成し、当該韻律パラメータを出力する。そして、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータがクライアント装置のローカル音声素片探索部に入力され、当該ローカル音声素片探索部において、当該入力された読み情報及び韻律パラメータをキーとしてローカル音声素片インデックス格納部を検索し、当該入力された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応するローカル音声素片系列情報を抽出し、抽出したローカル音声素片系列情報を出力する。また、ローカル音声素片探索部から出力されたローカル音声素片系列情報のローカル音声素片格納情報がクライアント装置のローカル音声素片データ読み出し部に入力され、当該ローカル音声素片データ読み出し部において、当該ローカル音声素片格納情報が指定するローカル音声素片データを、ローカル音声素片データベース格納部から読み出す。そして、ローカル音声素片データ読み出し部が読み出したローカル音声素片データが、クライアント装置の音声素片接続部に入力され、当該音声素片接続部において、当該ローカル音声素片データを用いて合成音声データを生成し、当該合成音声データを出力する。また、クライアント装置の音声素片情報送信部において、ローカル音声素片探索部から出力されたローカル音声素片系列情報、テキスト解析部から出力された読み情報及び韻律パラメータ取得部から出力された韻律パラメータを、ネットワークを通じ、音声素片データベースサーバ装置に送信する。 First, text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Prosodic information is output. Next, the prosody information output from the text analysis unit is input to the prosody parameter acquisition unit of the client device, and the prosody parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information. , Output the prosodic parameters. Then, the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit are input to the local speech unit search unit of the client device, and the input input reading is performed in the local speech unit search unit. The local speech unit index storage unit is searched using the information and the prosodic parameters as keys, and the input speech information and the local speech unit sequence information corresponding to the prosodic parameters belonging to the similar range of the prosodic parameters are extracted, The extracted local speech unit sequence information is output. The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and in the local speech unit data reading unit, The local speech unit data specified by the local speech unit storage information is read from the local speech unit database storage unit. The local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize speech. Data is generated and the synthesized speech data is output. In the speech unit information transmission unit of the client device, local speech unit sequence information output from the local speech unit search unit, reading information output from the text analysis unit, and prosodic parameters output from the prosody parameter acquisition unit Is transmitted to the speech unit database server apparatus through the network.

音声素片データベースサーバ装置の音声素片情報受信部は、ローカル音声素片系列情報、読み情報及び韻律パラメータを受信する。そして、受信された読み情報及び韻律パラメータをキーとして、音声素片データベースサーバ装置の最適音声素片探索部において、最適音声素片インデックス格納部を検索し、受信された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を抽出し、抽出した最適音声素片系列情報を出力する。また、音声素片情報受信部において受信されたローカル音声素片系列情報及び最適音声素片探索部から出力された最適音声素片系列情報が、音声素片データベースサーバ装置の送信音声素片決定部に入力され、当該送信音声素片決定部において、当該最適音声素片系列情報から当該ローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した送信音声素片系列情報を生成し、当該送信音声素片系列情報を出力する。そして、送信音声素片系列情報の最適音声素片格納情報が、音声素片データベースサーバ装置の最適音声素片データ読み出し部に入力され、当該最適音声素片データ読み出し部において、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部から読み出す。その後、読み出された最適音声素片データを、音声素片データベースサーバ装置の音声素片データ送信部において、ネットワークを通じ、クライアント装置に返信する。 The speech unit information receiving unit of the speech unit database server device receives local speech unit sequence information, reading information, and prosodic parameters. Then, using the received reading information and prosodic parameters as keys, the optimal speech segment searching unit of the speech unit database server device searches the optimal speech unit index storage unit, and the received reading information and prosody parameters are similar. The optimal speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the range is extracted, and the extracted optimal speech unit sequence information is output. In addition, the local speech unit sequence information received by the speech unit information receiving unit and the optimal speech unit sequence information output from the optimal speech unit search unit are the transmission speech unit determination unit of the speech unit database server device. In the transmission speech unit determination unit, the transmission speech unit sequence information excluding those that are common to the local speech unit sequence information and reading information and prosodic parameters from the optimal speech unit sequence information, The transmission speech unit sequence information is output. Then, the optimum speech unit storage information of the transmitted speech unit sequence information is input to the optimum speech unit data reading unit of the speech unit database server device, and the optimum speech unit data reading unit includes the optimum speech unit data reading unit. The optimum speech unit data designated by the storage information is read from the optimum speech unit database storage unit. Thereafter, the read optimum speech unit data is returned to the client device via the network in the speech unit data transmission unit of the speech unit database server device.

クライアント装置の音声素片データ受信部は、最適音声素片データを受信する。そして、受信された最適音声素片データを、クライアント装置の音声素片データベース追加部において、新たなローカル音声素片データとしてローカル音声素片データベース格納部に追加格納する。また、クライアント装置の音声素片インデックス追加部において、新たなローカル音声素片データに対応するローカル音声素片系列情報を、ローカル音声素片インデックス格納部に追加格納する。
ここで、クライアント装置における合成音声データの生成処理は、音声素片データベースサーバ装置へのアクセスが必要なローカル音声素片データベース格納部への最適音声素片データの追加処理と独立に行われる。その場合、テキストデータの入力から合成音声が出力されるまでの時間は、クライアント装置の処理性能のみに依存し、ネットワークの品質や構成に全く依存しない。 The speech unit data receiving unit of the client device receives the optimal speech unit data. Then, the received optimum speech unit data is additionally stored in the local speech unit database storage unit as new local speech unit data in the speech unit database addition unit of the client device. Further, the speech unit index adding unit of the client device additionally stores the local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit.
Here, the generation process of the synthesized voice data in the client device is performed independently of the process of adding the optimum voice unit data to the local voice unit database storage unit that requires access to the voice unit database server device. In that case, the time from the input of the text data to the output of the synthesized speech depends only on the processing performance of the client device, and does not depend on the quality or configuration of the network at all.

また、ローカル音声素片データベース格納部へ最適音声素片データが新たなローカル音声素片データとして追加されることにより、その後生成される合成音声データの品質が向上する。
さらに、音声素片データベースサーバ装置は、最適音声素片系列情報からローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した送信音声素片系列情報をもとに、クライアント装置に送信する最適音声素片データを決定する。すなわち、音声素片データベースサーバ装置は、クライアント装置へ最適音声素片データを送信するか否かの判断を、ローカル音声素片データベース格納部に最適な音声素片データが存在するか否かによって行う。そのため、本発明では、適切な設定が困難な閾値等のパラメータを設定する必要はない。 Moreover, the quality of synthesized speech data generated thereafter is improved by adding the optimum speech unit data as new local speech unit data to the local speech unit database storage unit.
Furthermore, the speech unit database server device transmits to the client device based on the transmitted speech unit sequence information obtained by excluding the local speech unit sequence information, the reading information, and the prosodic parameters that are common from the optimal speech unit sequence information. The optimum speech segment data to be determined is determined. That is, the speech unit database server device determines whether or not to transmit the optimal speech unit data to the client device, depending on whether or not the optimal speech unit data exists in the local speech unit database storage unit. . Therefore, in the present invention, it is not necessary to set parameters such as a threshold that are difficult to set appropriately.

また、第1から第3の本発明において好ましくは、クライアント装置の音声素片データ削除部において、ローカル音声素片データベース格納部に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下であるか否かを判定し、ローカル音声素片データベース格納部に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下でない場合、所定の優先順位に従って、当該ローカル音声素片データベース格納部に格納されたローカル音声素片データの一部を削除する。そして、クライアント装置の音声素片系列情報削除部において、音声素片データ削除部において削除されたローカル音声素片データに対応するローカル音声素片系列情報をローカル音声素片インデックス格納部から削除する。これにより、ローカル音声素片データベース格納部に格納可能なデータ量に制限がある環境においても本発明を適用することが可能となる。 Preferably, in the first to third aspects of the present invention, the total size of the local speech unit data stored in the local speech unit database storage unit in the speech unit data deletion unit of the client device is a predetermined size. If the total size of the local speech unit data stored in the local speech unit database storage unit is not less than a predetermined size, the local speech according to a predetermined priority is determined. A part of the local speech segment data stored in the segment database storage unit is deleted. Then, the speech unit sequence information deletion unit of the client device deletes the local speech unit sequence information corresponding to the local speech unit data deleted by the speech unit data deletion unit from the local speech unit index storage unit. As a result, the present invention can be applied even in an environment where the amount of data that can be stored in the local speech unit database storage unit is limited.

以上のように、本発明では、クライアント装置における合成音声データの生成処理が、音声素片データベースサーバ装置へのアクセスが必要なローカル音声素片データベース格納部への最適音声素片データの追加処理と独立に行われる。よって、テキストデータの入力から合成音声が出力されるまでの時間は、クライアント装置の処理性能のみに依存し、ネットワークの品質や構成に全く依存しない。その結果、高速な音声合成が可能となる。
また、本発明では、ローカル音声素片データベース格納部へ最適音声素片データが新たなローカル音声素片データとして追加されることにより、その後生成される合成音声データの品質が向上する。 As described above, in the present invention, the synthesized speech data generation processing in the client device includes the processing for adding the optimum speech segment data to the local speech segment database storage unit that requires access to the speech segment database server device. Done independently. Therefore, the time from the input of text data to the output of synthesized speech depends only on the processing performance of the client device, and does not depend on the quality or configuration of the network at all. As a result, high speed speech synthesis is possible.
In the present invention, the optimum speech unit data is added as new local speech unit data to the local speech unit database storage unit, thereby improving the quality of synthesized speech data generated thereafter.

さらに、本発明では、音声素片データの適合性を判定するために、適切な設定が困難な閾値等のパラメータを設定する必要はない。よって、閾値等のパラメータの調整に時間が掛かったり、最適なパラメータの設定ができないため性能が十分でなかったり等の問題が生じる事が無い。
以上より、本発明では、クライアント装置に大規模なデータベースを設けることなく、容易に高い品質の合成音声を短い応答期間で出力することができる。 Furthermore, in the present invention, it is not necessary to set parameters such as a threshold that are difficult to set appropriately in order to determine the suitability of speech segment data. Therefore, there is no problem that adjustment of parameters such as a threshold value takes time, and optimal parameters cannot be set, so that performance is not sufficient.
As described above, according to the present invention, high-quality synthesized speech can be easily output in a short response period without providing a large-scale database in the client device.

以下、本発明の実施の形態を図面を参照して説明する。
〔第１の実施の形態〕
初めに本発明の第１の実施の形態を述べる。
＜構成＞
図１（ａ）は、本形態の音声合成システム１の概念図である。
図１（ａ）に例示するように、本形態の音声合成システム１は、少なくとも１つのクライアント装置１００と、当該クライアント装置１００とネットワーク２００を通じて接続される少なくとも１つの音声素片データベースサーバ装置３００とを具備する。なお、説明の簡略化のため、以下では１つのクライアント装置１００と１つの音声素片データベースサーバ装置３００とのみについて説明を行うが、これ以上の数のクライアント装置１００及び音声素片データベースサーバ装置３００を設ける構成としてもよい。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
First, a first embodiment of the present invention will be described.
<Configuration>
FIG. 1A is a conceptual diagram of the speech synthesis system 1 of the present embodiment.
As illustrated in FIG. 1A, the speech synthesis system 1 according to this embodiment includes at least one client device 100 and at least one speech unit database server device 300 connected to the client device 100 through the network 200. It comprises. For simplification of description, only one client device 100 and one speech unit database server device 300 will be described below, but more client devices 100 and speech unit database server devices 300 are used. It is good also as a structure which provides.

この例のクライアント装置１００は、ユーザ２が利用する音声合成装置であり、例えばＰＤＡ（personal digital assistance）、携帯電話、パーソナルコンピュータ等のハードウェアに所定のプログラムを実行させることにより構成されるものである。また、この例の音声素片データベースサーバ装置３００は、大容量の音声素片データベースを保持するサーバ装置であり、公知のコンピュータに所定のプログラムを実行させることにより構成されるものである。なお、この例の音声素片データベースサーバ装置３００は、センタ３が管理運用する。また、ネットワーク２００としては、例えば、携帯電話のパケット通信網、電話線を利用したＡＤＳＬ（asymmetric digital subscriber line）通信網、光ファイバー通信網等を例示できるが、特にこれらに限定されるものではない。 The client device 100 in this example is a speech synthesizer used by the user 2, and is configured by causing a hardware such as a PDA (personal digital assistance), a mobile phone, or a personal computer to execute a predetermined program. is there. The speech unit database server apparatus 300 in this example is a server apparatus that holds a large-capacity speech unit database, and is configured by causing a known computer to execute a predetermined program. The speech unit database server device 300 in this example is managed and operated by the center 3. Examples of the network 200 include, but are not limited to, a mobile phone packet communication network, an ADSL (asymmetric digital subscriber line) communication network using a telephone line, and an optical fiber communication network.

［クライアント装置１００のハードウェア構成］
図１（ｂ）は、図１（ａ）におけるクライアント装置１００のハードウェア構成を例示した概念図である。
この図に例示するように、本形態のクライアント装置１００は、プログラム及び演算結果等を格納するワークメモリ１０１、プログラムに基づき演算等を行うとともにクライアント装置の各構成要素を制御するＭＰＵ（Micro Processing Unit）１０２、音声素片データ及びその他のファイルを格納する蓄積メモリ１０３、ネットワーク２００を通じてデータを送受信するためのデータ送受信部１０４、テキストデータ等が入力される入力部１０６及び合成音声データを出力する出力部１０７を具備する。なお、必要に応じ、クライアント装置１００がさらに書き換え可能メモリ１０５を具備することとしてもよい。また、ワークメモリ１０１としては、ＲＡＭ（Random Access Memory）等の半導体メモリを例示でき、書き換え可能メモリ１０５としては、ＥＥＰＲＲＯＭ（Electronically Erasable and Programmable Read Only Memory）等の半導体メモリを例示できる。また、蓄積メモリ１０３としては、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等を例示できる。さらに、入力部１０６としては、キー入力を受け付ける入力装置やデータ入力を受け付ける入力インタフェース等を例示できる。また、出力部１０７としては、合成音声を出力するスピーカやそのデータを出力するインタフェース等を例示できる。 [Hardware Configuration of Client Device 100]
FIG. 1B is a conceptual diagram illustrating the hardware configuration of the client apparatus 100 in FIG.
As illustrated in this figure, a client apparatus 100 according to the present embodiment includes a work memory 101 that stores a program, calculation results, and the like, an MPU (Micro Processing Unit) that performs calculation based on the program and controls each component of the client apparatus. ) 102, a storage memory 103 for storing speech unit data and other files, a data transmitting / receiving unit 104 for transmitting / receiving data through the network 200, an input unit 106 for inputting text data, etc., and an output for outputting synthesized speech data Part 107 is provided. Note that the client device 100 may further include a rewritable memory 105 as necessary. The work memory 101 can be a semiconductor memory such as a RAM (Random Access Memory), and the rewritable memory 105 can be a semiconductor memory such as an EEPROM (Electronically Erasable and Programmable Read Only Memory). Examples of the storage memory 103 include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Furthermore, examples of the input unit 106 include an input device that accepts key inputs and an input interface that accepts data inputs. Examples of the output unit 107 include a speaker that outputs synthesized speech, an interface that outputs data thereof, and the like.

なお、図１（ｂ）の代わりに、例えばＣＰＵ（Central Processing Unit）、ＲＡＭ、ハードディスク装置等から構成される公知のコンピュータによって本形態のクライアント装置１００を構成することとしてもよい。
[音声素片データベースサーバ装置３００のハードウェア構成]
図１（ｃ）は、図１（ａ）における音声素片データベースサーバ装置３００のハードウェア構成を例示した概念図である。
この図に例示するように、本形態の音声素片データベースサーバ装置３００は、レジスタ３０１ａを持ち音声素片データベースサーバ装置３００全体を制御するＣＰＵ３０１、プログラム、音声素片データ及びその他のファイルを格納する補助記憶装置３０２、ＲＯＭ（Read Only Memory）３０３、プログラム及び演算結果等を格納するＲＡＭ３０４、及びデータ送受信部３０５を有している。なお、補助記憶装置３０２としては、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等を例示できる。また、データ送受信部３０５としては、ＬＡＮ（Local Area Network）カード、モデム、ルータ、ハブ等の通信装置を例示できる。 Instead of FIG. 1B, the client device 100 of this embodiment may be configured by a known computer including, for example, a CPU (Central Processing Unit), a RAM, a hard disk device, and the like.
[Hardware Configuration of Speech Segment Database Server 300]
FIG. 1C is a conceptual diagram illustrating the hardware configuration of the speech unit database server apparatus 300 in FIG.
As illustrated in this figure, the speech unit database server apparatus 300 according to this embodiment has a register 301a and stores a CPU 301 that controls the entire speech unit database server apparatus 300, a program, speech unit data, and other files. It has an auxiliary storage device 302, a ROM (Read Only Memory) 303, a RAM 304 that stores programs and calculation results, and a data transmission / reception unit 305. Examples of the auxiliary storage device 302 include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the data transmission / reception unit 305 include communication devices such as a LAN (Local Area Network) card, a modem, a router, and a hub.

［クライアント装置１００の機能構成］
本形態のクライアント装置１００は、例えば、図１（ｂ）に例示したＭＰＵ１０２や公知のコンピュータのＣＰＵに所定のプログラムが読み込まれ、実行されることにより構成されるものである。
図２は、このクライアント装置１００の機能構成を例示したブロック図である。
この図に例示するように、本形態のクライアント装置１００は、ローカル音声素片データベース格納部１１１、ローカル音声素片インデックス格納部１１２、最適音声素片インデックス格納部１１３、一時記憶部１１４、テキストデータ入力部１２１、テキスト解析部１２２、韻律パラメータ取得部１２３、ローカル音声素片探索部１２４、ローカル音声素片データ読み出し部１２５、音声素片接続部１２６、音声出力部１２７、最適音声素片探索部１３１、要求音声素片決定部１３２、音声素片情報送信部１３３、音声素片データ受信部１３４、音声素片インデックス追加部１３６、音声素片データベース追加部１３７及び制御部１４０を有しており、音声素片情報送信部１３３及び音声素片データ受信部１３４を通じてネットワーク２００に接続可能に構成されている。 [Functional Configuration of Client Device 100]
The client device 100 according to the present embodiment is configured by, for example, a predetermined program being read and executed by the MPU 102 illustrated in FIG. 1B or a CPU of a known computer.
FIG. 2 is a block diagram illustrating a functional configuration of the client apparatus 100.
As illustrated in this figure, the client device 100 according to this embodiment includes a local speech unit database storage unit 111, a local speech unit index storage unit 112, an optimal speech unit index storage unit 113, a temporary storage unit 114, text data. Input unit 121, text analysis unit 122, prosody parameter acquisition unit 123, local speech unit search unit 124, local speech unit data read unit 125, speech unit connection unit 126, speech output unit 127, optimal speech unit search unit 131, a requested speech unit determination unit 132, a speech unit information transmission unit 133, a speech unit data reception unit 134, a speech unit index addition unit 136, a speech unit database addition unit 137, and a control unit 140. The network 20 through the speech unit information transmission unit 133 and the speech unit data reception unit 134. Connectable to is configured.

なお、ローカル音声素片データベース格納部１１１、ローカル音声素片インデックス格納部１１２及び最適音声素片インデックス格納部１１３は、例えば、図１（ｂ）の蓄積メモリ１０３やワークメモリ１０１等によって構成される。また、一時記憶部１１４は、例えば、図１（ｂ）のワークメモリ１０１や書き換え可能メモリ１０５等によって構成される。また、テキストデータ入力部１２１は、例えば、図１（ｂ）の入力部１０６等によって構成され、音声出力部１２７は、例えば、出力部１０７等によって構成される。さらに、テキスト解析部１２２、韻律パラメータ取得部１２３、ローカル音声素片探索部１２４、ローカル音声素片データ読み出し部１２５、音声素片接続部１２６、音声出力部１２７、最適音声素片探索部１３１、要求音声素片決定部１３２、音声素片インデックス追加部１３６、音声素片データベース追加部１３７及び制御部１４０は、例えば、図１（ｂ）のＭＰＵ１０２にワークメモリ１０１からプログラムが読み込まれ、さらにＭＰＵ１０２がこのプログラムを実行することにより構成されるものである。また、音声素片情報送信部１３３及び音声素片データ受信部１３４は、例えば、図１（ｂ）のデータ送受信部１０４等によって構成される。また、クライアント装置１００は、制御部１４０の制御のもと各処理を実行する。 Note that the local speech unit database storage unit 111, the local speech unit index storage unit 112, and the optimum speech unit index storage unit 113 are configured by, for example, the storage memory 103 and the work memory 101 in FIG. 1B. . In addition, the temporary storage unit 114 includes, for example, the work memory 101 and the rewritable memory 105 shown in FIG. The text data input unit 121 is configured by, for example, the input unit 106 in FIG. 1B, and the voice output unit 127 is configured by, for example, the output unit 107. Furthermore, the text analysis unit 122, the prosodic parameter acquisition unit 123, the local speech unit search unit 124, the local speech unit data reading unit 125, the speech unit connection unit 126, the speech output unit 127, the optimum speech unit search unit 131, For example, the requested speech unit determination unit 132, the speech unit index addition unit 136, the speech unit database addition unit 137, and the control unit 140 read the program from the work memory 101 into the MPU 102 of FIG. Is configured by executing this program. Moreover, the speech unit information transmission unit 133 and the speech unit data reception unit 134 are configured by, for example, the data transmission / reception unit 104 in FIG. In addition, the client device 100 executes each process under the control of the control unit 140.

［音声素片データベースサーバ装置３００の機能構成］
本形態の音声素片データベースサーバ装置３００は、例えば、図１（ｃ）に例示したＣＰＵ３０４に所定のプログラムが読み込まれ、実行されることにより構成されるものである。
図３は、本形態における音声素片データベースサーバ装置３００の機能構成を例示したブロック図である。
この図に例示するように、本形態の音声素片データベースサーバ装置３００は、一時記憶部３１１、最適音声素片データベース格納部３１２、音声素片情報受信部３２１、最適音声素片データ読み出し部３２３、音声素片データ送信部３２２及び制御部３３０を有し、音声素片情報受信部３２１及び音声素片データ送信部３２２を通じ、ネットワーク２００に接続可能に構成されている。 [Functional Configuration of Speech Segment Database Server 300]
The speech segment database server apparatus 300 according to the present embodiment is configured by, for example, a predetermined program being read and executed by the CPU 304 illustrated in FIG.
FIG. 3 is a block diagram illustrating a functional configuration of the speech segment database server apparatus 300 according to this embodiment.
As illustrated in this figure, the speech unit database server apparatus 300 according to this embodiment includes a temporary storage unit 311, an optimal speech unit database storage unit 312, a speech unit information reception unit 321, and an optimal speech unit data reading unit 323. The speech unit data transmission unit 322 and the control unit 330 are configured to be connectable to the network 200 through the speech unit information reception unit 321 and the speech unit data transmission unit 322.

なお、一時記憶部３１１は、例えば、図１（ｃ）のレジスタ３０１ａやＲＡＭ３０４等によって構成される。また、最適音声素片データベース格納部３１２は、例えば、図１（ｃ）の補助記憶装置３０２やＲＡＭ３０４等によって構成される。さらに、最適音声素片データ読み出し部３２３及び制御部３３０は、例えば、図１（ｃ）のＣＰＵ３０１に所定のプログラムが読み込まれ、さらにＣＰＵ３０１がこのプログラムを実行することにより構成されるものである。また、音声素片情報受信部３２１及び音声素片データ送信部３２２は、例えば、図１（ｃ）のデータ送受信部３０５等によって構成されるものである。また、音声素片データベースサーバ装置３００は、制御部３３０の制御のもと各処理を実行する。 The temporary storage unit 311 includes, for example, the register 301a and the RAM 304 in FIG. Further, the optimum speech element database storage unit 312 is configured by, for example, the auxiliary storage device 302 and the RAM 304 of FIG. Furthermore, the optimum speech element data reading unit 323 and the control unit 330 are configured by, for example, a predetermined program being read into the CPU 301 in FIG. 1C and the CPU 301 executing this program. Further, the speech unit information receiving unit 321 and the speech unit data transmitting unit 322 are configured by, for example, the data transmitting / receiving unit 305 in FIG. Further, the speech unit database server apparatus 300 executes each process under the control of the control unit 330.

［ローカル音声素片データベース格納部１１１及び最適音声素片データベース格納部３１２のデータ構成］
ローカル音声素片データベース格納部１１１（図２）には、ローカル音声素片データが格納され、最適音声素片データベース格納部３１２（図３）には、最適音声素片データが格納される。ここで、ローカル音声素片データベース格納部１１１に格納されるローカル音声素片データと、最適音声素片データベース格納部３１２に格納される最適音声素片データとの概念的な構成は同一であるが、格納される内容が異なっている。以下、これについて説明する。 [Data structure of local speech unit database storage unit 111 and optimum speech unit database storage unit 312]
The local speech unit database storage unit 111 (FIG. 2) stores local speech unit data, and the optimal speech unit database storage unit 312 (FIG. 3) stores optimal speech unit data. Here, the conceptual configuration of the local speech unit data stored in the local speech unit database storage unit 111 and the optimal speech unit data stored in the optimal speech unit database storage unit 312 are the same. The stored contents are different. This will be described below.

図４は、ローカル音声素片データベース格納部１１１に格納されるローカル音声素片データの構成を説明するための概念図である。
この図に例示するように、この例のローカル音声素片データベース格納部１１１には、時間情報に対応付けられた複数の音声素片データが格納され、それらによって１つのファイルを構成している。そして、各ファイルにはファイル番号が対応付けられ、ファイル番号と時間とを指定することにより各音声素片データを特定できる構成となっている。例えば、ファイル番号８のファイルには、音韻系列「Ａ」「Ｒ」「Ｅ」「Ｓ」「Ｕ」「Ｒ」「Ａ」・・・に対応する複数の音声素片データが、ファイル番号２３のファイルには、音韻系列「Ｄ」「Ａ」「Ｒ」「Ａ」・・・に対応する複数の音声素片データが、それぞれ時間情報に対応付けられて格納されている。そして、例えば、ファイル番号８、始点位置１０、時間長１１０と指定することにより、これらに対応する音韻系列「Ａ」の音声素片データを特定できる構成となっている。 FIG. 4 is a conceptual diagram for explaining the configuration of local speech unit data stored in the local speech unit database storage unit 111.
As illustrated in this figure, the local speech unit database storage unit 111 of this example stores a plurality of speech unit data associated with time information, and they constitute one file. Each file is associated with a file number, and each voice unit data can be specified by specifying the file number and time. For example, in the file with the file number 8, a plurality of speech segment data corresponding to the phoneme sequences “A”, “R”, “E”, “S”, “U”, “R”, “A”. In the file, a plurality of speech segment data corresponding to phoneme sequences “D”, “A”, “R”, “A”,... For example, by specifying the file number 8, the starting point position 10, and the time length 110, the speech unit data of the phoneme sequence “A” corresponding to these can be specified.

また、最適音声素片データベース格納部３１２に格納される最適音声素片データも構造的にはローカル音声素片データベース格納部１１１に格納されるローカル音声素片データと同様である。ローカル音声素片データベース格納部１１１に格納されるローカル音声素片データと、最適音声素片データベース格納部３１２に格納される最適音声素片データとの相違は、そのデータの種類や量である。即ち、最適音声素片データベース格納部３１２には、大量の音声素片データが格納されるのに対し、初期段階のローカル音声素片データベース格納部１１１には、最低限の音声素片データのみが格納される。例えば、最適音声素片データベース格納部３１２には、任意のテキストに対して高品質な合成音声を生成することが可能な非常に多くの音声素片データを格納する。これに対し、初期段階のローカル音声素片データベース格納部１１１には、例えば、任意のテキストに対応する合成音声を生成可能な最低限の音声素片データのみを格納する。なお、合成音声を生成可能な最低限の音声素片データとしては、例えば、日本語の全ての音素、全ての音節、全ての三つ組み音素等に対応する音声素片データを例示できる。しかし、実際にどのような音声素片データを初期段階のローカル音声素片データベース格納部１１１に格納するかはクライアント装置１００の構成や初期時点で配布可能なデータ量や最低限求められる合成音声の品質等に対応して決めればよい。例えば、最低限の音声素片データに加え、合成音声の品質を部分的に向上させることが可能な音声素片データを初期段階のローカル音声素片データベース格納部１１１に格納することとしてもよい。 The optimum speech unit data stored in the optimum speech unit database storage unit 312 is structurally similar to the local speech unit data stored in the local speech unit database storage unit 111. The difference between the local speech unit data stored in the local speech unit database storage unit 111 and the optimal speech unit data stored in the optimal speech unit database storage unit 312 is the type and amount of the data. That is, a large amount of speech unit data is stored in the optimum speech unit database storage unit 312, whereas only the minimum speech unit data is stored in the local speech unit database storage unit 111 at the initial stage. Stored. For example, the optimum speech segment database storage unit 312 stores a large amount of speech segment data that can generate high-quality synthesized speech for an arbitrary text. On the other hand, the local speech unit database storage unit 111 at the initial stage stores, for example, only minimum speech unit data that can generate synthesized speech corresponding to an arbitrary text. Note that examples of the minimum speech unit data that can generate a synthesized speech include speech unit data corresponding to all Japanese phonemes, all syllables, all triplet phonemes, and the like. However, what kind of speech unit data is actually stored in the local speech unit database storage unit 111 at the initial stage depends on the configuration of the client device 100, the amount of data that can be distributed at the initial time, and the minimum required synthesized speech. You may decide according to quality. For example, in addition to the minimum speech unit data, speech unit data that can partially improve the quality of the synthesized speech may be stored in the local speech unit database storage unit 111 at the initial stage.

［ローカル音声素片インデックス格納部１１２及び最適音声素片インデックス格納部１１３のデータ構成］
この例のローカル音声素片インデックス格納部１１２（図２）には、ローカル音声素片データを指定するローカル音声素片格納情報と当該ローカル音声素片データに対応する読み情報及び韻律パラメータとが関連付けられたローカル音声素片系列情報が格納される。また、最適音声素片インデックス格納部１１３には、最適音声素片データを指定する最適音声素片格納情報と当該最適音声素片データに対応する読み情報及び韻律パラメータとが関連付けられた最適音声素片系列情報が格納される。 [Data structure of local speech unit index storage unit 112 and optimum speech unit index storage unit 113]
In this example, the local speech unit index storage unit 112 (FIG. 2) associates local speech unit storage information for designating local speech unit data with reading information and prosodic parameters corresponding to the local speech unit data. Stored local speech unit sequence information is stored. The optimal speech element index storage unit 113 also includes an optimal speech element in which optimal speech element storage information for designating optimal speech element data and reading information and prosodic parameters corresponding to the optimal speech element data are associated with each other. One-line information is stored.

図５（ａ）は、図２のローカル音声素片インデックス格納部１１２に格納されるローカル音声素片インデックス１１２ａのデータ構成を例示した概念図であり、図５（ｂ）は、図３の最適音声素片インデックス格納部１１３に格納される最適音声素片インデックス１１３ａの構成を例示した概念図である。
図５（ａ）に例示するように、この例のローカル音声素片インデックス１１２ａは、ローカル音声素片データベース格納部１１１に格納される複数のローカル音声素片データに対応する複数のローカル音声素片系列情報１１２ａａ〜１１２ａｆを有している。ここで、各ローカル音声素片系列情報１１２ａａ〜１１２ａｆは、対応するローカル音声素片データの「音韻列」「前音韻環境」「後音韻環境」「平均Ｆ０（基準周波数）」「Ｆ０の傾斜」「パワー」及び「ローカル音声素片格納情報」が関連付けられた情報である。ここで、この例の「ローカル音声素片格納情報」は、対応するローカル音声素片データの格納位置を特定する「ファイル番号」「時間長」「始点位置」からなる情報である。なお、これらの「音韻列」「前音韻環境」「後音韻環境」が「読み情報」に相当し、「平均Ｆ０（基準周波数）」「Ｆ０の傾斜」「パワー」「時間長」が「韻律パラメータ」に相当する。また、「前音韻環境」とは、対応する「音韻列」に対し時系列的に前の音韻を示す情報であり、「後音韻環境」とは、対応する「音韻列」に対し時系列的に後の音韻を示す情報である。また「＃」はポーズ（無音状態）を示している。 FIG. 5A is a conceptual diagram illustrating the data structure of the local speech unit index 112a stored in the local speech unit index storage unit 112 in FIG. 2, and FIG. It is the conceptual diagram which illustrated the structure of the optimal speech unit index 113a stored in the speech unit index storage part 113. FIG.
As illustrated in FIG. 5A, the local speech unit index 112 a in this example includes a plurality of local speech units corresponding to a plurality of local speech unit data stored in the local speech unit database storage unit 111. It has series information 112aa to 112af. Here, each local speech unit sequence information 112aa to 112af includes “phoneme sequence” “pre-phoneme environment” “post-phoneme environment” “average F0 (reference frequency)” “slope of F0” of the corresponding local speech unit data. This is information associated with “power” and “local speech unit storage information”. Here, the “local speech unit storage information” in this example is information including “file number”, “time length”, and “starting point location” that specify the storage location of the corresponding local speech unit data. The “phoneme sequence”, “pre-phoneme environment”, and “post-phoneme environment” correspond to “reading information”, and “average F0 (reference frequency)”, “slope of F0”, “power”, and “time length” are “prosodic”. It corresponds to “parameter”. The “pre-phoneme environment” is information indicating the previous phoneme in time series with respect to the corresponding “phoneme sequence”, and the “post-phoneme environment” is time-series with respect to the corresponding “phoneme sequence”. Is information indicating a later phoneme. “#” Indicates a pause (silent state).

また、図５（ｂ）に例示するように、この例の最適音声素片インデックス１１３ａは、最適音声素片データベース格納部３１２に格納される複数の最適音声素片データに対応する複数の最適音声素片系列情報１１３ａａ〜１１３ａｈを有している。ここで、各最適音声素片系列情報１１３ａａ〜１１３ａｈは、対応する最適音声素片データの「音韻列」「前音韻環境」「後音韻環境」「平均Ｆ０（基準周波数）」「Ｆ０の傾斜」「パワー」及び「最適音声素片格納情報」が関連付けられた情報である。ここで、この例の「最適音声素片格納情報」は、対応する最適音声素片データの格納場所を特定する「ファイル番号」「時間長」「始点位置」からなる情報である。 Further, as illustrated in FIG. 5B, the optimum speech unit index 113 a in this example includes a plurality of optimum speech corresponding to a plurality of optimum speech unit data stored in the optimum speech unit database storage unit 312. It has unit sequence information 113aa-113ah. Here, each optimum speech unit sequence information 113aa to 113ah includes “phoneme sequence”, “pre-phoneme environment”, “rear phoneme environment”, “average F0 (reference frequency)”, “slope of F0” of the corresponding optimum speech unit data. This is information associated with “power” and “optimum speech unit storage information”. Here, the “optimum speech unit storage information” in this example is information including “file number”, “time length”, and “starting point position” that specifies the storage location of the corresponding optimal speech unit data.

また、前述のようにローカル音声素片データベース格納部１１１に格納されるローカル音声素片データの数は、最適音声素片データベース格納部３１２に格納される最適音声素片データの数よりも少ないため、当然ローカル音声素片インデックス１１２ａが有するローカル音声素片系列情報の数も、最適音声素片インデックス１１３ａが有する最適音声素片系列情報の数よりも少ない。なお、音声素片インデックスの構成は、例えば、特許３５１５４０６号公報「音声合成方法及び装置」などで開示されている。
＜クライアント装置１００の処理＞
次に、クライアント装置１００の処理について説明する。 Further, as described above, the number of local speech unit data stored in the local speech unit database storage unit 111 is smaller than the number of optimal speech unit data stored in the optimal speech unit database storage unit 312. Of course, the number of local speech unit sequence information included in the local speech unit index 112a is also smaller than the number of optimal speech unit sequence information included in the optimal speech unit index 113a. The structure of the speech unit index is disclosed in, for example, Japanese Patent No. 3515406 “Speech Synthesis Method and Device”.
<Processing of Client Device 100>
Next, processing of the client device 100 will be described.

図６（ａ）は、本形態のクライアント装置１００における音声合成処理を説明するための流れ図である。以下、この図に従って、本形態の音声合成処理の詳細を説明する。
テキストデータ入力部１２１は、音声化すべきテキストデータの入力を受け付け、入力されたテキストデータは一時記憶部１１４に格納される。これをトリガにテキスト解析部１２２は、一時記憶部１１４からテキストデータを読み込む。そして、このテキスト解析部１２２は、読み込んだテキストデータに対してテキスト解析処理を行って読み情報及び韻律情報を生成し、当該読み情報及び韻律情報を一時記憶部１１４に出力し、そこに格納させる（ステップＳ１０）。なお、ここでいうテキスト解析処理は、主に形態素解析処理と読み・アクセント付与処理からなる。これらの処理方法については従来から様々な方法が存在し、例えば、特許３３７９６４３号公報「形態素解析方法および形態素解析プログラムを記録した記録媒体」で開示された方法や、特許３５１８３４０号公報「読み韻律情報設定方法及び装置及び読み韻律情報設定プログラムを格納した記憶媒体」で開示された方法を用いることができる。 FIG. 6A is a flowchart for explaining speech synthesis processing in the client apparatus 100 of the present embodiment. Hereinafter, the details of the speech synthesis processing of this embodiment will be described with reference to FIG.
The text data input unit 121 accepts input of text data to be voiced, and the input text data is stored in the temporary storage unit 114. With this as a trigger, the text analysis unit 122 reads text data from the temporary storage unit 114. Then, the text analysis unit 122 performs text analysis processing on the read text data to generate reading information and prosodic information, and outputs the reading information and prosodic information to the temporary storage unit 114 for storage therein. (Step S10). The text analysis processing here mainly consists of morphological analysis processing and reading / accenting processing. Various methods exist for these processing methods, for example, the method disclosed in Japanese Patent No. 3337943 “A morphological analysis method and a recording medium on which a morphological analysis program is recorded”, or Japanese Patent No. 3518340 “Reading Prosody Information”. The method disclosed in the “setting method and apparatus and storage medium storing reading prosodic information setting program” can be used.

次に、韻律パラメータ取得部１２３において、一時記憶部１１４から上述の韻律情報を読み込み、当該韻律情報を用いて音声合成に必要な物理的な韻律パラメータを算出し、当該韻律パラメータを一時記憶部１１４に出力し、そこに格納させる（ステップＳ１１）。ここで韻律パラメータとしてはピッチ（基本周波数Ｆ０）や時間長（音素継続時間長）などがあるが、それらを求める方式も従来から存在する。例えば、特許３２４０６９１号公報「ピッチパタン生成方法、その装置及びプログラム記録媒体」や特許３３４４４８７号公報「音声基本周波数パターン生成装置」で開示された方法によってピッチ（基本周波数Ｆ０）を求めることが可能である。また、例えば、”海木ら、「言語情報を利用した母音継続時間長の制御」vol.75, No.3 pp.467-473、信学論,1992”や”M.D. Riley. “Tree-based modeling for speech synthesis.” In G. Bailly, C. Benoit, and T.R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992.”で開示された方法により時間長を求めることもできる。なお、ステップＳ１１の処理は、上述のテキストデータに対応する全ての韻律情報に対して実行される。 Next, in the prosodic parameter acquisition unit 123, the above-mentioned prosodic information is read from the temporary storage unit 114, the physical prosodic parameters necessary for speech synthesis are calculated using the prosodic information, and the prosodic parameters are temporarily stored in the temporary storage unit 114. And store it there (step S11). Here, the prosodic parameters include pitch (fundamental frequency F0), time length (phoneme duration time), and the like. For example, the pitch (fundamental frequency F0) can be obtained by the method disclosed in Japanese Patent No. 3240691 “Pitch Pattern Generation Method, Apparatus and Program Recording Medium” and Japanese Patent No. 3344487 “Voice Basic Frequency Pattern Generation Device”. is there. Also, for example, “Miki et al.,“ Control of vowel duration using linguistic information ”vol.75, No.3 pp.467-473, Theory of Science, 1992” and “MD Riley.“ Tree-based modeling for speech synthesis. ”In G. Bailly, C. Benoit, and TR Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992.” Note that the process of step S11 is executed for all prosodic information corresponding to the text data described above.

次に、ローカル音声素片探索部１２４において、一時記憶部１１４から読み情報と韻律パラメータとを読み込む。そして、ローカル音声素片探索部１２４は、読み込んだ読み情報及び韻律パラメータをキーとしてローカル音声素片インデックス格納部１１２を検索する。そして、この例のローカル音声素片探索部１２４は、読み込んだ読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応するローカル音声素片系列情報をローカル音声素片インデックス格納部１１２に格納されたローカル音声素片インデックス１１２ａ（図５（ａ））から抽出し、抽出したローカル音声素片系列情報を一時記憶部１１４に出力し、そこに格納させる（ステップＳ１２）。すなわち、この例のローカル音声素片探索部１２４は、ローカル音声素片インデックス１１２ａから、テキスト解析によって得られた読み情報と韻律パラメータに最適なローカル音声素片系列情報を決定する。なお、ローカル音声素片系列情報の決定方法としては、例えば、特許３５１５４０６号公報「音声合成方法及び装置」などで開示されている。また、ここでいう類似範囲とは、例えば読み情報及び韻律パラメータが完全に一致するもの、一部一致するもの、コストによって特定される類似度が高いものなどを含む概念である。例えば、読み情報として、音韻「ア」、前音素環境”＃”が与えられ、韻律パラメータとして、平均Ｆ０＝２００±１０Ｈｚという条件が与えられた場合、図５（ａ）のローカル音声素片系列情報１１２ａａ，１１２ａｂ，１１２ａｃの３つが適合することとなる。また、コストによって類似範囲を特定する方法としては、例えば、テキスト解析によって得られた読み情報及び韻律パラメータと、ローカル音声素片インデックスのローカル音声素片系列情報の読み情報及び韻律パラメータとから総合コスト値を計算し、この総合コストを最少にするローカル音声素片系列情報を最適なものとする方法等を例示できる。ここで、総合コストの計算方法であるが、例えば、下記のようにサブコスト関数を用いて総合コストＰｎｅｗを求めることができる（例えば「波形編集型合成方式におけるスペクトル連続性を考慮した波形選択法」、日本音響学会講演論文集、2-6-10、pp.239-240、1990/9参照。）
［サブコスト関数を用いた総合コストＰｎｅｗの算出］
まず、ローカル音声素片探索部１２４において、一時記憶部１１４から読み情報と韻律パラメータと、ローカル音声素片インデックス格納部１１２に格納されたローカル音声素片インデックス１１２ａとを用い、以下のサブコスト関数を算出し、各サブコスト関数を一時記憶部１１４に格納する。 Next, the local speech element search unit 124 reads reading information and prosodic parameters from the temporary storage unit 114. Then, the local speech unit search unit 124 searches the local speech unit index storage unit 112 using the read reading information and prosodic parameters as keys. Then, the local speech unit search unit 124 of this example stores, in the local speech unit index storage unit 112, local speech unit sequence information corresponding to the read information and prosodic parameters belonging to the similar range of the read reading information and prosody parameters. Extracted from the stored local speech unit index 112a (FIG. 5A), the extracted local speech unit sequence information is output to the temporary storage unit 114 and stored therein (step S12). That is, the local speech unit search unit 124 in this example determines local speech unit sequence information that is optimal for reading information and prosodic parameters obtained by text analysis from the local speech unit index 112a. Note that a method for determining local speech element sequence information is disclosed in, for example, Japanese Patent No. 3515406 “Speech Synthesis Method and Device”. The similarity range here is a concept including, for example, a case in which reading information and prosodic parameters are completely matched, a portion of which is matched, and a high similarity specified by cost. For example, when the phoneme environment “#” is given as the reading information and the condition that the average F0 = 200 ± 10 Hz is given as the prosodic parameter, the local speech unit sequence in FIG. Three pieces of information 112aa, 112ab, and 112ac are matched. Further, as a method for specifying the similarity range by cost, for example, the total cost is calculated from the reading information and prosodic parameters obtained by text analysis, and the reading information and prosodic parameters of the local speech element sequence information of the local speech element index. An example is a method of calculating a value and optimizing local speech unit sequence information that minimizes the total cost. Here, although the total cost is calculated, for example, the total cost Pnew can be obtained using a sub-cost function as described below (for example, “a waveform selection method taking into account spectrum continuity in the waveform editing type synthesis method”). (See Proceedings of the Acoustical Society of Japan, 2-6-10, pp.239-240, 1990/9.)
[Calculation of total cost Pnew using sub-cost function]
First, the local speech unit search unit 124 uses the reading information and the prosodic parameters from the temporary storage unit 114 and the local speech unit index 112a stored in the local speech unit index storage unit 112, and uses the following sub-cost function: The sub cost function is calculated and stored in the temporary storage unit 114.

(1）読み情報に対応するサブコスト関数
Ｃ_１（ｎ）＝１／ｅ^ｎ
ただし、テキストデータをテキスト解析して得られた読み情報からなる音韻列と、ローカル音声素片系列情報の音韻列との間で一致する音韻数をｎとする。
(2)平均ピッチに対するサブコスト関数
Ｃ_２（Ｖｐ，Ｖｓ）＝｜Ｖｐ−Ｖｓ｜^２
ただし、テキストデータをテキスト解析して得られたの平均ピッチをＶｐとし、ローカル音声素片系列情報の平均ピッチ（平均Ｆ０）をＶｓとする。 (1) sub-cost function C ₁ corresponding to the reading information (n) = 1 / ^{e n}
Here, n is the number of phonemes that match between the phoneme sequence formed by reading information obtained by text analysis of the text data and the phoneme sequence of the local speech segment sequence information.
(2) Sub-cost function for average pitch C ₂ (Vp, Vs) = | Vp−Vs | ²
However, an average pitch obtained by text analysis of text data is Vp, and an average pitch (average F0) of local speech segment sequence information is Vs.

(3)ピッチ傾きに対するサブコスト関数
Ｃ_３（Ｆｐ，Ｆｓ）＝｜Ｆｐ−Ｆｓ｜^２
ただし、テキストデータをテキスト解析して得られたピッチの傾きをＦｐとし、ローカル音声素片系列情報のピッチの傾き（Ｆ０の傾斜）をＦｓとする。
(4)時間長に対するサブコスト関数
Ｃ_４（Ｔｐ，Ｔｓ）＝｜Ｔｐ−Ｔｓ｜^２
ただし、テキストデータをテキスト解析して得られた時間長をＴｐとし、ローカル音声素片系列情報の時間長をＴｓとする。 (3) Sub-cost function for pitch inclination C ₃ (Fp, Fs) = | Fp−Fs | ²
However, the pitch gradient obtained by text analysis of the text data is Fp, and the pitch gradient (F0 gradient) of the local speech segment sequence information is Fs.
(4) Sub cost function for time length C ₄ (Tp, Ts) = | Tp−Ts | ²
However, the time length obtained by text analysis of the text data is Tp, and the time length of the local speech unit sequence information is Ts.

(5)振幅に対するサブコスト関数
Ｃ_５（Ａｐ，Ａｓ）＝｜Ａｐ−Ａｓ｜^２
ただし、テキストデータをテキスト解析して得られた振幅をＡｐとし、ローカル音声素片系列情報の振幅（パワー）をＡｓとする。
そして、ローカル音声素片探索部１２４は、一時記憶部１１４から各サブコスト関数を読み込み、以下のように総合コストＰｎｅｗを算出し、算出した総合コストＰｎｅｗを一時記憶部１１４に格納する。 (5) Sub cost function for amplitude C ₅ (Ap, As) = | Ap−As | ²
However, the amplitude obtained by text analysis of the text data is Ap, and the amplitude (power) of the local speech unit sequence information is As.
Then, the local speech unit search unit 124 reads each sub cost function from the temporary storage unit 114, calculates the total cost Pnew as follows, and stores the calculated total cost Pnew in the temporary storage unit 114.

(6)Ω＝ω_２Ｃ_２（Ｖｐ，Ｖｓ）＋ω_３Ｃ_３（Ｆｐ，Ｆｓ）＋ω_４Ｃ_４（Ｔｐ，Ｔｓ）＋ω_５Ｃ_５（Ａｐ，Ａｓ）を算出する。
(7)Ｐ＝ω_１Ｃ_１（ｎ）＋（１−ω_１）Ωを算出する。
(8)Ｐｎｅｗ＝（１＋Ｇ）Ｐを算出する。
なお、ω_１，ω_２，ω_３，ω_４，ω_５は、各サブコスト関数に対するサブコスト重みを示す定数であり、予めプログラムに設定されているものとする。さらにＧは音響的な定数を示し、これも予めプログラムに設定されているものとする。 (6) Ω = ω ₂ C ₂ (Vp, Vs) + ω ₃ C ₃ (Fp, Fs) + ω ₄ C ₄ (Tp, Ts) + ω ₅ C ₅ (Ap, As) is calculated.
(7) P = ω ₁ C ₁ (n) + (1−ω ₁ ) Ω is calculated.
(8) Pnew = (1 + G) P is calculated.
Note that ω ₁ , ω ₂ , ω ₃ , ω ₄ , and ω ₅ are constants indicating sub-cost weights for the respective sub-cost functions, and are set in advance in the program. Further, G represents an acoustic constant, which is also set in advance in the program.

以上のような総合コストＰｎｅｗの算出は、各ローカル音声素片系列情報に対して行われ、算出された各総合コストＰｎｅｗは、対応する各ローカル音声素片系列情報に関連付けて一時記憶部１１４に格納される。そして、ローカル音声素片探索部１２４は、一時記憶部１１４に格納された各総合コストＰｎｅｗに対し、一般的なＤＰ（Dynamic Programing）法を適用し、最小の総合コストＰｎｅｗを求め、それに関連付けられているローカル音声素片系列情報を最適なもの（テキスト解析によって得られた読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応するローカル音声素片系列情報）として選択する（「サブコスト関数を用いた総合コストＰｎｅｗの算出」の説明終わり）。なお、ステップＳ１２の処理は、上述のテキストデータに対応する全ての読み情報及び韻律パラメータの組に対して実行される。 The calculation of the total cost Pnew as described above is performed for each local speech unit sequence information, and each calculated total cost Pnew is associated with each corresponding local speech unit sequence information in the temporary storage unit 114. Stored. Then, the local speech segment search unit 124 applies a general DP (Dynamic Programming) method to each total cost Pnew stored in the temporary storage unit 114, obtains the minimum total cost Pnew, and associates it with the total cost Pnew. Local speech unit sequence information is selected as the optimal one (reading information obtained by text analysis and local speech unit sequence information corresponding to prosodic parameters belonging to a similar range of prosodic parameters) (“sub-cost End of the description of "Calculation of total cost Pnew using function"). Note that the process of step S12 is executed for all sets of reading information and prosodic parameters corresponding to the text data described above.

次に、ローカル音声素片データ読み出し部１２５において、ステップＳ１２で抽出された各ローカル音声素片系列情報に対応するローカル音声素片データを順次ローカル音声素片データベース格納部１１１から読み出す（ステップＳ１３）。すなわち、ローカル音声素片データ読み出し部１２５において、一時記憶部１１４からステップＳ１２で抽出されたローカル音声素片系列情報のローカル音声素片格納情報を順次読み出し、当該ローカル音声素片格納情報が指定するローカル音声素片データを、ローカル音声素片データベース格納部１１１から順次読み出す。例えば、ステップＳ１２でローカル音声素片系列情報１１２ａｂ，１１２ａｄ，１１２ａｆ（図５（ａ））が抽出された場合、ローカル音声素片データ読み出し部１２５は、ローカル音声素片系列情報１１２ａｂのローカル音声素片格納情報「ファイル番号８、始点１０ｍｓｅｃ、時間長１１０ｍｓｅｃ」、ローカル音声素片系列情報１１２ａｄのローカル音声素片格納情報「ファイル番号２３、始点５２２５ｍｓｅｃ、時間長１５ｍｓｅｃ」及びローカル音声素片系列情報１１２ａｆのローカル音声素片格納情報「ファイル番号２３、始点５２４０ｍｓｅｃ、時間長９５ｍｓｅｃ」が示す各ローカル音声素片データを、ローカル音声素片データベース格納部１１１から順次読み出す。なお、このように読み出された各ローカル音声素片データは、一時記憶部１１４に格納される。 Next, the local speech unit data reading unit 125 sequentially reads local speech unit data corresponding to each local speech unit sequence information extracted in step S12 from the local speech unit database storage unit 111 (step S13). . That is, the local speech unit data reading unit 125 sequentially reads the local speech unit storage information of the local speech unit sequence information extracted in step S12 from the temporary storage unit 114, and the local speech unit storage information specifies the local speech unit storage information. The local speech unit data is sequentially read from the local speech unit database storage unit 111. For example, when the local speech unit sequence information 112ab, 112ad, 112af (FIG. 5A) is extracted in step S12, the local speech unit data reading unit 125 reads the local speech unit sequence information 112ab. Fragment storage information “file number 8, start point 10 msec, time length 110 msec”, local speech element storage information “file number 23, start point 5225 msec, time length 15 msec” of local speech element sequence information 112ad, and local speech element sequence information 112af Each local speech unit data indicated by the local speech unit storage information “file number 23, start point 5240 msec, time length 95 msec” is sequentially read out from the local speech unit database storage unit 111. Each local speech unit data read in this way is stored in the temporary storage unit 114.

次に、音声素片接続部１２６において、一時記憶部１１４から各ローカル音声素片データを順次読み出し、当該ローカル音声素片データを用いて合成音声データを生成し、当該合成音声データを一時記憶部１１４に出力し、そこに格納させる（ステップＳ１４）。ここで、合成音声データの生成は、例えば、読み出された各ローカル音声素片データを時間的な順に単に接続して行ってもよいが、異なるローカル音声素片データ間を時間的又は周波数的に補間して合成音声データを生成してもよい（例えば、特開平０７−０７２８９７号公報参照。）。また、一時記憶部１１４に格納された韻律パラメータに基づいてローカル音声素片データに所定の信号処理を施した後に、これらを接続して合成音声データを生成してもよい（例えば、「Y. Stylianou, “Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis.” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 1, pp.21-29 JANUARY 2001」参照。）。 Next, in the speech unit connection unit 126, each local speech unit data is sequentially read from the temporary storage unit 114, synthesized speech data is generated using the local speech unit data, and the synthesized speech data is stored in the temporary storage unit. It outputs to 114 and makes it store there (step S14). Here, the generation of the synthesized speech data may be performed by simply connecting the read local speech unit data in the order of time, but different local speech unit data may be temporally or frequency-wise. The synthesized voice data may be generated by interpolation (see Japanese Patent Application Laid-Open No. 07-072897, for example). Further, after performing predetermined signal processing on local speech unit data based on the prosodic parameters stored in the temporary storage unit 114, these may be connected to generate synthesized speech data (for example, “Y. Stylianou, “Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis.” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 1, pp.21-29 JANUARY 2001 ”).

最後に音声出力部１２７において、一時記憶部１１４から上述のように生成された合成音声データを読み込み、それを音声やデータとして出力する（ステップＳ１５）。
次に、クライアント装置１００において、より高品質な合成音声を生成するために、前述のステップＳ１０〜Ｓ１５の処理と独立（例えばこれらの処理と同時平行或いはこれらの処理の後）に、以下に述べる処理も行う。
図６（ｂ）は、このステップＳ１０〜Ｓ１５の処理と独立に行われる処理を説明するための流れ図である。以下、この流れ図に沿って、この処理を説明する。 Finally, the voice output unit 127 reads the synthesized voice data generated as described above from the temporary storage unit 114 and outputs it as voice or data (step S15).
Next, in order to generate higher-quality synthesized speech in the client apparatus 100, the following will be described independently of the processes in steps S10 to S15 described above (for example, simultaneously with these processes or after these processes). Processing is also performed.
FIG. 6B is a flowchart for explaining processing performed independently of the processing in steps S10 to S15. Hereinafter, this process will be described with reference to this flowchart.

まず、最適音声素片探索部１３１おいて、一時記憶部１１４から、テキスト解析部１２２から出力された読み情報及び韻律パラメータ取得部１２３から出力された韻律パラメータを読み込む。そして、最適音声素片探索部１３１は、これらの読み情報及び韻律パラメータをキーとして最適音声素片インデックス格納部１１３を検索し、これらの読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を最適音声素片インデックスから抽出する。抽出された最適音声素片系列情報は一時記憶部１１４に出力され、そこに格納される（ステップＳ１６）。前述のとおり最適音声素片インデックスの構成は、ローカル音声素片インデックスと同様であり（図５参照）、最適音声素片系列の決定方法も、前述のローカル音声素片探索部１２４における決定方法（ステップＳ１２参照）と同様に行うことができる。 First, the optimum speech segment search unit 131 reads the reading information output from the text analysis unit 122 and the prosodic parameters output from the prosodic parameter acquisition unit 123 from the temporary storage unit 114. Then, the optimum speech segment searching unit 131 searches the optimum speech segment index storage unit 113 using these reading information and prosodic parameters as keys, and reading information and prosodic parameters belonging to a similar range of these reading information and prosodic parameters. The optimal speech unit sequence information corresponding to is extracted from the optimal speech unit index. The extracted optimum speech element sequence information is output to the temporary storage unit 114 and stored therein (step S16). As described above, the configuration of the optimal speech unit index is the same as that of the local speech unit index (see FIG. 5), and the determination method of the optimal speech unit sequence is also the determination method in the local speech unit search unit 124 ( (See step S12).

次に、要求音声素片決定部１３２において、最適音声素片探索部１３１によって決定された最適音声素片系列情報から、前述のローカル音声素片探索部１２４で決定されたローカル音声素片系列情報に含まれる音声素片系列情報を除外して、音声素片データベースサーバ装置３００に対して送信を要求すべき最適要求音声データを決定する（ステップＳ１７）。すなわち、要求音声素片決定部１３２は、まず一時記憶部１１４から、ローカル音声素片系列情報及び最適音声素片系列情報を読み込む。次に要求音声素片決定部１３２は、当該最適音声素片系列情報から当該ローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した要求音声素片系列情報を生成し、当該要求音声素片系列情報を一時記憶部１１４に出力し、そこに格納させる。 Next, in the requested speech unit determination unit 132, the local speech unit sequence information determined by the local speech unit search unit 124 described above from the optimal speech unit sequence information determined by the optimal speech unit search unit 131. The speech unit sequence information included in the data is excluded, and optimally requested speech data to be transmitted to the speech unit database server device 300 is determined (step S17). That is, the requested speech unit determination unit 132 first reads the local speech unit sequence information and the optimum speech unit sequence information from the temporary storage unit 114. Next, the requested speech unit determination unit 132 generates requested speech unit sequence information that excludes the local speech unit sequence information, the reading information, and the prosodic parameters that are common from the optimum speech unit sequence information. The speech element sequence information is output to the temporary storage unit 114 and stored therein.

例えば、ローカル音声素片探索部１２４によって決定されたローカル音声素片系列情報が｛ローカル音声素片系列１１２ａｂ，１１２ａｃ，１１２ａｄ，１１２ａｅ｝であり、最適音声素片探索部１３１によって決定された最適音声素片系列情報が｛最適音声素片系列１１３ａｂ，１１３ａｃ，１１３ａｄ，１１３ａｇ，１１３ａｅ｝であるとする（図５参照）。この例の場合、要求音声素片系列情は、ローカル音声素片系列情報と最適音声素片系列情報とで読み情報及び韻律パラメータが共通するものを除いた｛最適音声素片系列情報１１３ａｇ｝になる。 For example, the local speech unit sequence information determined by the local speech unit search unit 124 is {local speech unit sequence 112ab, 112ac, 112ad, 112ae}, and the optimal speech determined by the optimal speech unit search unit 131. It is assumed that the unit sequence information is {optimum speech unit sequence 113ab, 113ac, 113ad, 113ag, 113ae} (see FIG. 5). In the case of this example, the requested speech unit sequence information is the {optimal speech unit sequence information 113ag} excluding the local speech unit sequence information and the optimal speech unit sequence information except for the common reading information and prosodic parameters. Become.

なお、最適音声素片系列情報とローカル音声素片系列情報とで読み情報及び韻律パラメータが共通するか否かの判断は、例えば、読み情報及び韻律パラメータが同一の最適音声素片系列情報及びローカル音声素片系列情報に対して同一のファイル番号を付与することとし、最適音声素片系列情報とローカル音声素片系列情報との最適音声素片格納情報とローカル音声素片格納情報とが同一か否かによって行う。また、最適音声素片系列情報とローカル音声素片系列情報との読み情報及び韻律パラメータを直接比較してこの判断を行うこととしてもよい。 Note that whether or not the optimal speech unit sequence information and the local speech unit sequence information have the same reading information and prosodic parameters is determined by, for example, the optimal speech unit sequence information and the local speech unit information having the same reading information and prosodic parameters. Whether the same file number is assigned to the speech unit sequence information, and whether the optimal speech unit storage information and the local speech unit storage information of the optimal speech unit sequence information and the local speech unit sequence information are the same Depending on whether or not. Further, this determination may be made by directly comparing the reading information and prosodic parameters of the optimum speech unit sequence information and the local speech unit sequence information.

次に、音声素片情報送信部１３３に、一時記憶部１１４に格納されている最適音声素片系列情報の最適音声素片格納情報を、順次ネットワーク２００を通して音声素片データベースサーバ装置３００に対して送信する（ステップＳ１８）。
これに対し、音声素片データベースサーバ装置３００は、上述の最適音声素片格納情報に対応する最適音声素片データをクライアント装置１００に返信する（詳細は後述）。
クライアント装置１００は、音声素片データ受信部１３４において、音声素片データベースサーバ装置３００から最適音声素片データが順次送られてくるのを待ち受け、当該最適音声素片データを受信する（ステップＳ１９）。なお、受信された最適音声素片データは、まず一時記憶部１１４に格納される。 Next, the optimal speech element storage information of the optimal speech element sequence information stored in the temporary storage unit 114 is sequentially transmitted to the speech element database server device 300 through the network 200 in the speech element information transmission unit 133. Transmit (step S18).
In response to this, the speech unit database server device 300 returns the optimum speech unit data corresponding to the above-described optimum speech unit storage information to the client device 100 (details will be described later).
In the speech unit data receiving unit 134, the client device 100 waits for the optimal speech unit data to be sequentially sent from the speech unit database server device 300, and receives the optimal speech unit data (step S19). . The received optimal speech segment data is first stored in the temporary storage unit 114.

その後、受信された最適音声素片データは、音声素片データベース追加部１３７によって、ローカル音声素片データベース格納部１１１のローカル音声素片データベースに追加される（ステップＳ２０）。そして、これに伴い、音声素片インデックス追加部１３６によって、ローカル音声素片インデックス格納部１１２のローカル音声素片インデックスが書き換えられ、ローカル音声素片探索部１２４において前述の受信した最適音声素片データを参照可能にする（ステップＳ２１）。
すなわちまず、音声素片データベース追加部１３７において、一時記憶部１１４から音声素片データ受信部１３４において受信された最適音声素片データを読み出し、それを新たなローカル音声素片データとしてローカル音声素片データベース格納部１１１に追加格納する。 Thereafter, the received optimal speech unit data is added to the local speech unit database of the local speech unit database storage unit 111 by the speech unit database addition unit 137 (step S20). Along with this, the speech unit index adding unit 136 rewrites the local speech unit index of the local speech unit index storage unit 112, and the local speech unit search unit 124 receives the above-mentioned optimal speech unit data. Can be referred to (step S21).
That is, first, the speech unit database adding unit 137 reads out the optimum speech unit data received by the speech unit data receiving unit 134 from the temporary storage unit 114, and uses it as new local speech unit data. It is additionally stored in the database storage unit 111.

例えば、要求音声素片系列情報として｛最適音声素片系列情報１１３ａｇ｝が得られている場合（図５（ｂ）参照）、音声素片データベース追加部１３７は、次のように新たなローカル音声素片データをローカル音声素片データベース格納部１１１に格納する。
まず、音声素片データベース追加部１３７は、一時記憶部１１４から｛最適音声素片系列情報１１３ａｇ｝を読み出し、ローカル音声素片データベース格納部１１１に｛最適音声素片系列情報１１３ａｇ｝のローカル音声素片格納情報が具備するファイル番号「２４３」に対応するローカル音声素片データが存在するか否かを検索する。ここで、ローカル音声素片データベース格納部１１１にファイル番号「２４３」に対応するローカル音声素片データが存在した場合、音声素片データベース追加部１３７は、このファイル番号「２４３」に対応するローカル音声素片データの４３ｍｓｅｃの始点位置から時間長１１２ｍｓｅｃの最適音声素片データ（音声素片データ受信部１３４において受信された最適音声素片データ）を追加し、それを新たなファイル番号「２４３」に対応するローカル音声素片データとしてローカル音声素片データベース格納部１１１に格納する。一方、ローカル音声素片データベース格納部１１１にファイル番号「２４３」に対応するローカル音声素片データが存在しない場合、音声素片データベース追加部１３７は、上述の音声素片データ受信部１３４において受信された最適音声素片データを始点位置４３ｍｓｅｃから時間長１１２ｍｓｅｃで配置したローカル音声素片データを生成し、それをファイル番号「２４３」に関連付けてローカル音声素片データベース格納部１１１に格納する。 For example, when {optimum speech unit sequence information 113ag} is obtained as the requested speech unit sequence information (see FIG. 5B), the speech unit database adding unit 137 creates a new local speech as follows. The unit data is stored in the local speech unit database storage unit 111.
First, the speech unit database adding unit 137 reads {optimal speech unit sequence information 113ag} from the temporary storage unit 114, and stores the local speech unit of {optimal speech unit sequence information 113ag} in the local speech unit database storage unit 111. It is searched whether or not there is local speech segment data corresponding to the file number “243” included in the fragment storage information. Here, when the local speech unit data corresponding to the file number “243” exists in the local speech unit database storage unit 111, the speech unit database addition unit 137 determines the local speech unit corresponding to this file number “243”. The optimum speech element data (optimum speech element data received by the speech element data receiving unit 134) having a time length of 112 msec is added from the start position of 43 msec of the element data, and is added to the new file number “243”. The corresponding local speech unit data is stored in the local speech unit database storage unit 111. On the other hand, when there is no local speech unit data corresponding to the file number “243” in the local speech unit database storage unit 111, the speech unit database addition unit 137 is received by the speech unit data reception unit 134 described above. Then, local speech unit data in which the optimum speech unit data is arranged with a time length of 112 msec from the start point position of 43 msec is generated and stored in the local speech unit database storage unit 111 in association with the file number “243”.

なお、要求音声素片決定部１３２において生成された最適音声素片系列情報と、音声素片データ受信部１３４において受信された最適音声素片データとの対応付けは、例えば、最適音声素片格納情報の音声素片データベースサーバ装置３００への送信順序（ステップＳ１８参照）の情報を一時記憶部１１４に格納しておき、当該送信順序と各最適音声素片データの受信順序（ステップＳ１９参照）との情報を用いて行う。ただし、特にこれに限定されるわけではなく、例えば、音声素片データベースサーバ装置３００が、各最適音声素片データとともに最適音声素片系列情報との対応付けを示す情報（ファイル番号等）をクライアント装置１００に送信することとしてもよい。 The correspondence between the optimum speech element sequence information generated by the requested speech element determination unit 132 and the optimum speech element data received by the speech element data reception unit 134 is, for example, the optimum speech element storage. Information on the transmission order of information to the speech unit database server device 300 (see step S18) is stored in the temporary storage unit 114, and the transmission order and the reception order of each optimum speech unit data (see step S19). This information is used. However, the present invention is not particularly limited to this. For example, the speech unit database server device 300 receives information (file number or the like) indicating the correspondence between the optimum speech unit data and the optimum speech unit sequence information together with each optimum speech unit data. It is good also as transmitting to the apparatus 100.

上述のローカル音声素片データベース格納部１１１への新たなローカル音声素片データの格納に伴い、音声素片インデックス追加部１３６は、当該新たなローカル音声素片データに対応するローカル音声素片系列情報を、ローカル音声素片インデックス格納部１１２に追加格納する。すなわち、音声素片インデックス追加部１３６は、新たなローカル音声素片データ（最適音声素片データ）に対応する最適音声素片系列情報を一時記憶部１１４から読み込み、これを新たなローカル音声素片系列情報としてローカル音声素片インデックス格納部１１２に格納する。 Along with the storage of new local speech unit data in the local speech unit database storage unit 111 described above, the speech unit index addition unit 136 includes local speech unit sequence information corresponding to the new local speech unit data. Are additionally stored in the local speech element index storage unit 112. That is, the speech unit index adding unit 136 reads the optimum speech unit sequence information corresponding to the new local speech unit data (optimum speech unit data) from the temporary storage unit 114, and reads the new speech unit index information. It is stored in the local speech unit index storage unit 112 as sequence information.

ここで、新たなローカル音声素片系列情報の追加方法であるが、単にローカル音声素片インデックス１１２ａの最後の列に追加することとしてもよいし、同一の音韻列に対応するローカル音声素片情報の最後に追加することとしてもよい。また、例えば図７（ａ）のように、ローカル音声素片系列情報探索における効率性を考慮し、音韻列や前後の音素環境の類似性を考慮したローカル音声素片インデックスの位置に新たなローカル音声素片系列情報１１２ａｇを追加してもよい。さらに、平均Ｆ０等の韻律パラメータの類似性を考慮して新たなローカル音声素片系列情報の挿入位置を決定してもよい。 Here, although it is a method for adding new local speech unit sequence information, it may be simply added to the last column of the local speech unit index 112a, or local speech unit information corresponding to the same phoneme sequence. It may be added to the end of. Further, for example, as shown in FIG. 7 (a), considering the efficiency in the search for local speech segment sequence information, a new local segment is positioned at the position of the local speech segment index considering the similarity of the phoneme sequence and the preceding and following phoneme environment. The speech element sequence information 112ag may be added. Furthermore, the insertion position of new local speech unit sequence information may be determined in consideration of the similarity of prosodic parameters such as average F0.

＜音声素片データベースサーバ装置３００の処理＞
次に、音声素片データベースサーバ装置３００の処理について説明する。
図７（ｂ）は、本形態の音声素片データベースサーバ装置３００における最適音声素片データの送信処理を説明するための流れ図である。以下、この図に従って、本形態の音声素片データベースサーバ装置３００における最適音声素片データの送信処理の詳細を説明する。
前述のステップＳ１８において、ネットワークを通して、クライアント装置１００から送信された最適音声素片格納情報は、順次、音声素片データベースサーバ装置３００の音声素片情報受信部３２１で受信され、一時記憶部３１１に格納される（ステップＳ３０）。 <Processing of Speech Segment Database Server 300>
Next, processing of the speech unit database server apparatus 300 will be described.
FIG. 7B is a flowchart for explaining the optimal speech segment data transmission processing in the speech segment database server apparatus 300 according to this embodiment. The details of the optimal speech unit data transmission process in the speech unit database server apparatus 300 according to the present embodiment will be described below with reference to FIG.
In step S18 described above, the optimum speech unit storage information transmitted from the client device 100 through the network is sequentially received by the speech unit information reception unit 321 of the speech unit database server device 300 and is stored in the temporary storage unit 311. Stored (step S30).

次に、最適音声素片データ読み出し部３２３において、一時記憶部３１１に格納された最適音声素片格納情報が読み込まれ、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部３１２から読み出して、一時記憶部３１１に格納する（ステップＳ３１）。ここで、最適音声素片データベース格納部３１２から最適音声素片格納情報が指定する最適音声素片データを読み出す処理は、前述のローカル音声素片データ読み出し部１２５におけるローカル音声素片データの読み出し処理（ステップＳ１３）と同様にして行うことができる。 Next, the optimum speech unit data reading unit 323 reads the optimum speech unit storage information stored in the temporary storage unit 311, and the optimum speech unit data specified by the optimum speech unit storage information is converted into the optimum speech unit data. The data is read from the segment database storage unit 312 and stored in the temporary storage unit 311 (step S31). Here, the process of reading the optimum speech element data specified by the optimum speech element storage information from the optimum speech element database storage unit 312 is the local speech element data reading process in the local speech element data reading unit 125 described above. This can be performed in the same manner as (Step S13).

最後に、一時記憶部３１１に格納された最適音声素片データが、順次音声素片データ送信部３２２に入力され、音声素片データ送信部３２２は、当該最適音声素片データを、順次ネットワークを通してクライアント装置１００に送信する（ステップＳ３２）。
＜本形態の特徴＞
以上のように、本形態では、クライアント装置１００において入力されたテキストデータに最も適した音声素片データを、音声素片データベースサーバ装置３００からネットワーク２００を通してクライアント装置１００に送信し、それをローカル音声素片データベース格納部１１１に蓄積し、またその音声素片データを探索できるようにローカル音声素片インデックス１１２ａを書き換える。これにより、クライアント装置１００で次に同様なテキストデータの合成音声を生成する場合に、今回ローカル音声素片データベース格納部１１１に蓄積した音声素片データを使用することが可能となる。その結果、高品質な合成音声が生成可能となる。 Finally, the optimal speech unit data stored in the temporary storage unit 311 is sequentially input to the speech unit data transmission unit 322, and the speech unit data transmission unit 322 sequentially transmits the optimal speech unit data through the network. It transmits to the client apparatus 100 (step S32).
<Features of this embodiment>
As described above, in this embodiment, the speech unit data most suitable for the text data input in the client device 100 is transmitted from the speech unit database server device 300 to the client device 100 through the network 200 and is transmitted to the local speech. The local speech unit index 112a is rewritten so that it can be stored in the unit database storage unit 111 and the speech unit data can be searched. As a result, when the client device 100 generates a synthesized speech of the same text data next time, it is possible to use the speech unit data accumulated in the local speech unit database storage unit 111 this time. As a result, high quality synthesized speech can be generated.

また、本形態では、テキスト解析処理（ステップＳ１０）から音声出力（ステップＳ１５）までの処理と同時平行又はその処理が終了した後に、最適音声素片系列情報の探索（ステップＳ１６）からローカル音声素片インデックスの更新（ステップＳ２１）までの処理を実行することとした。これにより、テキスト解析処理（ステップＳ１０）から音声出力（ステップＳ１５）までの処理時間は、クライアント装置１００のハードウェア構成によってほぼ決定され、ネットワーク２００を用いたデータの送受信やデータベースへのデータ追加等の処理に必要な時間にほとんど依存しない。その結果、クライアント装置１００は、ネットワーク２００の回線速度、品質、混雑状況等に関わり無く常に高速に合成音声を生成できる。 Further, in this embodiment, the parallel speech processing from the text analysis processing (step S10) to the speech output (step S15) or after the processing is completed, the search from the optimal speech segment sequence information (step S16) to the local speech element. The process up to the updating of the one index (step S21) is executed. Thereby, the processing time from the text analysis process (step S10) to the voice output (step S15) is almost determined by the hardware configuration of the client apparatus 100, and data transmission / reception using the network 200, data addition to the database, etc. It hardly depends on the time required for processing. As a result, the client device 100 can always generate synthesized speech at high speed regardless of the line speed, quality, congestion status, etc. of the network 200.

さらに、本形態では、クライアント装置１００においてテキストデータから合成音声が生成されるたびに、クライアント装置１００内のローカル音声素片データが、そのテキストデータに適したものに更新されていく。そのため、クライアント装置１００の使用を繰り返していくことにより、その利用者が合成音声を生成するテキストに適したローカル音声素片データがクライアント装置１００に蓄積されていく。その結果、使用回数を重ねれば重ねるほど、その利用者にとって最適な合成音声が生成可能となる。また、長期間の使用により、その利用者に適したローカル音声素片データがクライアント装置１００に蓄積されていけば、クライアント装置１００から音声素片データベースサーバ装置３００への最適音声素片データの送信要求数も減少し、ネットワーク２００や音声素片データベースサーバ装置３００に対する処理負担も減少していく。 Furthermore, in this embodiment, every time a synthesized speech is generated from text data in the client device 100, the local speech segment data in the client device 100 is updated to one suitable for the text data. Therefore, by repeatedly using the client device 100, local speech segment data suitable for text for which the user generates synthesized speech is accumulated in the client device 100. As a result, as the number of times of use increases, it is possible to generate synthesized speech that is optimal for the user. In addition, if local speech unit data suitable for the user is accumulated in the client device 100 over a long period of use, the optimal speech unit data is transmitted from the client device 100 to the speech unit database server device 300. The number of requests also decreases, and the processing load on the network 200 and the speech unit database server apparatus 300 also decreases.

〔第２の実施の形態〕
次に、本発明における第２の実施の形態について説明する。
本形態は第１の実施の形態の変形例であり、クライアント装置が最適音声素片インデックスを具備せず、代わりに音声素片データベースサーバ装置が最適音声素片インデックスを具備する例である。以下では、第１の実施の形態との相違点を中心に説明し、第１の実施の形態と共通する事項については説明を省略する。
＜構成＞
本形態の音声合成システムは、第１の実施の形態のクライアント装置１００の代わりにクライアント装置４００を設け、音声素片データベースサーバ装置３００の代わりに音声素片データベースサーバ装置５００を設けた構成となる。また、本形態のクライアント装置４００及び音声素片データベースサーバ装置５００も、第１の実施の形態と同様、例えば、図１（ｂ）に例示したＭＰＵ１０２や公知のコンピュータのＣＰＵに所定のプログラムが読み込まれ、実行されることにより構成されるものである。 [Second Embodiment]
Next, a second embodiment of the present invention will be described.
This embodiment is a modification of the first embodiment, in which the client device does not have the optimum speech unit index, and instead the speech unit database server device has the optimum speech unit index. Below, it demonstrates centering around difference with 1st Embodiment, and abbreviate | omits description about the matter which is common in 1st Embodiment.
<Configuration>
The speech synthesis system according to this embodiment has a configuration in which a client device 400 is provided instead of the client device 100 of the first embodiment, and a speech segment database server device 500 is provided instead of the speech segment database server device 300. . Similarly to the first embodiment, the client device 400 and the speech unit database server device 500 of the present embodiment also read predetermined programs into, for example, the MPU 102 illustrated in FIG. 1B or the CPU of a known computer. It is configured by being executed.

［クライアント装置４００の機能構成］
図８は、本形態におけるクライアント装置４００の機能構成を例示したブロック図である。なお、図８において図２と共通する部分については、図２と同じ符号を付した。
この図に例示するように、本形態のクライアント装置４００は、ローカル音声素片データベース格納部１１１、ローカル音声素片インデックス格納部１１２、一時記憶部１１４、テキストデータ入力部１２１、テキスト解析部１２２、韻律パラメータ取得部１２３、ローカル音声素片探索部１２４、ローカル音声素片データ読み出し部１２５、音声素片接続部１２６、音声出力部１２７、音声素片情報送信部１３３、音声素片データ受信部１３４、音声素片インデックス追加部１３６、音声素片データベース追加部１３７及び制御部１４０を有しており、音声素片情報送信部１３３及び音声素片データ受信部１３４を通じてネットワーク２００に接続可能に構成されている。すなわち、第１の実施の形態と異なり、本形態のクライアント装置４００は、最適音声素片インデックス格納部１１３、最適音声素片探索部１３１及び要求音声素片決定部１３２を具備していない。 [Functional Configuration of Client Device 400]
FIG. 8 is a block diagram illustrating a functional configuration of the client device 400 according to the present embodiment. 8 that are the same as those in FIG. 2 are denoted by the same reference numerals as those in FIG.
As illustrated in this figure, the client device 400 of this embodiment includes a local speech unit database storage unit 111, a local speech unit index storage unit 112, a temporary storage unit 114, a text data input unit 121, a text analysis unit 122, Prosody parameter acquisition unit 123, local speech unit search unit 124, local speech unit data reading unit 125, speech unit connection unit 126, speech output unit 127, speech unit information transmission unit 133, speech unit data reception unit 134 , A speech unit index adding unit 136, a speech unit database adding unit 137, and a control unit 140, and configured to be connectable to the network 200 through the speech unit information transmitting unit 133 and the speech unit data receiving unit 134 ing. That is, unlike the first embodiment, the client device 400 of this embodiment does not include the optimal speech unit index storage unit 113, the optimal speech unit search unit 131, and the requested speech unit determination unit 132.

［音声素片データベースサーバ装置５００の機能構成］
図９は、本形態における音声素片データベースサーバ装置５００の機能構成を例示したブロック図である。なお、図９において図３と共通する部分については、図３と同じ符号を付した。
この図に例示するように、本形態の音声素片データベースサーバ装置５００は、一時記憶部３１１、最適音声素片データベース格納部３１２、音声素片情報受信部３２１、最適音声素片データ読み出し部３２３、音声素片データ送信部３２２、制御部３３０、最適音声素片探索部５３１及び最適音声素片インデックス格納部５３２を有し、音声素片情報受信部３２１及び音声素片データ送信部３２２を通じ、ネットワーク２００に接続可能に構成されている。なお、最適音声素片インデックス格納部５３２の構成は、第１の実施の形態における最適音声素片インデックス格納部１１３と同じである。 [Functional Configuration of Speech Segment Database Server Device 500]
FIG. 9 is a block diagram illustrating a functional configuration of the speech unit database server apparatus 500 according to this embodiment. In FIG. 9, the same reference numerals as those in FIG.
As illustrated in this figure, the speech unit database server apparatus 500 according to this embodiment includes a temporary storage unit 311, an optimal speech unit database storage unit 312, a speech unit information reception unit 321, and an optimal speech unit data reading unit 323. , A speech unit data transmission unit 322, a control unit 330, an optimal speech unit search unit 531 and an optimal speech unit index storage unit 532, and through the speech unit information reception unit 321 and the speech unit data transmission unit 322, It is configured to be connectable to the network 200. The configuration of the optimum speech unit index storage unit 532 is the same as that of the optimum speech unit index storage unit 113 in the first embodiment.

＜クライアント装置４００の処理＞
次に、本形態におけるクライアント装置４００の処理について説明する。
本形態のクライアント装置４００も、第１の実施の形態のクライアント装置１００と同様な音声合成処理（ステップＳ１０〜Ｓ１５）を実行するが、ここでは説明の簡略化のためその説明を省略する。
また本形態でも、クライアント装置４００がより高品質な合成音声を生成できるように、音声合成処理（ステップＳ１０〜Ｓ１５）と独立（並行或いはその処理の後）に、以下に述べる処理を行う。 <Processing of Client Device 400>
Next, processing of the client device 400 in this embodiment will be described.
The client apparatus 400 according to the present embodiment also performs the same speech synthesis process (steps S10 to S15) as the client apparatus 100 according to the first embodiment, but the description thereof is omitted here for the sake of simplification.
Also in this embodiment, the following processing is performed independently of (in parallel with or after the speech synthesis processing) the speech synthesis processing (steps S10 to S15) so that the client device 400 can generate higher-quality synthesized speech.

図１０（ａ）は、この音声合成処理と独立に行われるクライアント装置４００での処理を説明するための流れ図である。以下、この図に従って、この処理を説明する。
まず、音声素片情報送信部１３３に、一時記憶部１１４に格納されている読み情報及び韻律パラメータ（テキスト解析部１２２から出力された読み情報及び韻律パラメータ取得部１２３から出力された韻律パラメータ）が入力される。そして、音声素片情報送信部１３３は、これらの読み情報及び韻律パラメータを、ネットワーク２００を通して音声素片データベースサーバ装置５００に送信する（ステップＳ５０）。 FIG. 10A is a flowchart for explaining processing in the client device 400 performed independently of the speech synthesis processing. Hereinafter, this process will be described with reference to FIG.
First, reading information and prosodic parameters stored in the temporary storage unit 114 (reading information output from the text analysis unit 122 and prosodic parameters output from the prosodic parameter acquisition unit 123) are stored in the speech unit information transmission unit 133. Entered. Then, the speech unit information transmission unit 133 transmits the reading information and the prosodic parameters to the speech unit database server device 500 through the network 200 (step S50).

これに対し、音声素片データベースサーバ装置５００は、上述の読み情報及び韻律パラメータに対応する最適音声素片データ及び最適音声素片系列情報をクライアント装置４００に返信する（詳細は後述）。
次に、クライアント装置４００の音声素片データ受信部１３４において、上記の最適音声素片データ及び最適音声素片系列情報を受信し、一時記憶部１１４に格納する（ステップＳ５１）。
次に、第１の実施の形態と同様に（図６（ｂ）：ステップＳ２０参照）、音声素片データベース追加部１３７において、一時記憶部１１４から音声素片データ受信部１３４において受信された最適音声素片データを読み出し、それを新たなローカル音声素片データとしてローカル音声素片データベース格納部１１１に追加格納する（ステップＳ５２）。また、これに伴い、音声素片インデックス追加部１３６において、ステップＳ５１で受信された最適音声素片系列情報（新たなローカル音声素片データに対応するローカル音声素片系列情報）を一時記憶部１１４から読み込み、これをローカル音声素片インデックス格納部１１２に追加格納する（ステップＳ５３）。なお、新たなローカル音声素片データの追加方法は、例えば、第１の実施の形態と同様に行う（図６（ｂ）：ステップＳ２１参照）。その他、音声素片データ受信部１３４において受信された全ての最適音声素片データを新たなローカル音声素片データとするのではなく、その一部のみ（例えば、既にローカル音声素片データベース格納部１１１に格納されているローカル音声素片データと重複するものを除いたもの）を新たなローカル音声素片データとしてローカル音声素片データベース格納部１１１に格納することとしてもよい。 In response to this, the speech element database server device 500 returns the optimum speech element data and the optimum speech element sequence information corresponding to the above-described reading information and prosodic parameters to the client device 400 (details will be described later).
Next, the speech unit data reception unit 134 of the client device 400 receives the above-described optimum speech unit data and the optimum speech unit sequence information and stores them in the temporary storage unit 114 (step S51).
Next, as in the first embodiment (see FIG. 6B: Step S20), the speech unit database adding unit 137 receives the optimum received from the temporary storage unit 114 by the speech unit data receiving unit 134. The speech unit data is read out and stored in the local speech unit database storage unit 111 as new local speech unit data (step S52). Accordingly, the speech unit index adding unit 136 temporarily stores the optimum speech unit sequence information (local speech unit sequence information corresponding to new local speech unit data) received in step S51. And is additionally stored in the local speech segment index storage unit 112 (step S53). Note that a method for adding new local speech element data is performed, for example, in the same manner as in the first embodiment (see FIG. 6B: step S21). In addition, not all the optimum speech unit data received by the speech unit data receiving unit 134 are used as new local speech unit data, but only a part thereof (for example, the local speech unit database storage unit 111 is already provided). May be stored in the local speech unit database storage unit 111 as new local speech unit data.

＜音声素片データベースサーバ装置５００の処理＞
次に、本形態における音声素片データベースサーバ装置５００の処理について説明する。
図１０（ｂ）は、本形態における音声素片データベースサーバ装置５００における最適音声素片データの送信処理を説明するための流れ図である。以下、この図に従って、本形態における最適音声素片データの送信処理の詳細を説明する。
まず、ネットワーク２００を通して、クライアント装置４００から送信された読み情報及び韻律パラメータが音声素片情報受信部３２１で受信され、一時記憶部３１１に格納される（ステップＳ６０）。 <Processing of Speech Segment Database Server Device 500>
Next, processing of the speech unit database server apparatus 500 in this embodiment will be described.
FIG. 10B is a flowchart for explaining the optimal speech unit data transmission processing in the speech unit database server device 500 according to this embodiment. Hereinafter, the details of the transmission processing of the optimum speech unit data in this embodiment will be described with reference to this figure.
First, reading information and prosodic parameters transmitted from the client device 400 via the network 200 are received by the speech unit information receiving unit 321 and stored in the temporary storage unit 311 (step S60).

次に、最適音声素片探索部５３１において、一時記憶部３１１から、受信された読み情報及び韻律パラメータを読み込み、これらの読み情報及び韻律パラメータをキーとして、最適音声素片インデックス格納部５３２を検索する。そして、最適音声素片探索部５３１は、受信された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を最適音声素片インデックス格納部５３２から抽出し、抽出した最適音声素片系列情報を一時記憶部３１１に格納する（ステップＳ６１）。前述のとおり、最適音声素片インデックス格納部５３２に格納されている最適音声素片インデックスの構成は、ローカル音声素片インデックスと同様であり、音声素片系列情報の決定方法も、前述のローカル音声素片探索部１２４における決定方法（図６（ａ）：ステップＳ１２参照。）と同様に行うことができる。 Next, the optimum speech unit search unit 531 reads the received reading information and prosodic parameters from the temporary storage unit 311 and searches the optimum speech unit index storage unit 532 using these reading information and prosodic parameters as keys. To do. Then, the optimum speech unit search unit 531 extracts the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters from the optimum speech unit index storage unit 532. The extracted optimal speech element sequence information is stored in the temporary storage unit 311 (step S61). As described above, the configuration of the optimal speech unit index stored in the optimal speech unit index storage unit 532 is the same as that of the local speech unit index, and the speech unit sequence information determination method is the same as the local speech unit described above. The determination can be performed in the same manner as the determination method in the segment search unit 124 (see FIG. 6A: Step S12).

次に、この最適音声素片格納情報が、最適音声素片データ読み出し部３２３に読み込まれ、最適音声素片データ読み出し部３２３は、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部３１２から読み出して一時記憶部３１１に格納する（ステップＳ６２）。ここで、最適音声素片データベース格納部３１２から最適音声素片データを読み出すための処理も、ローカル音声素片データ読み出し部１２５におけるローカル音声素片データの読み出し処理（図６（ａ）：ステップＳ１３参照。）と同様にして行うことができる。 Next, the optimum speech element storage information is read by the optimum speech element data reading unit 323, and the optimum speech element data reading unit 323 receives the optimum speech element data specified by the optimum speech element storage information. Then, it reads out from the optimal speech segment database storage unit 312 and stores it in the temporary storage unit 311 (step S62). Here, the process for reading the optimum speech element data from the optimum speech element database storage unit 312 is also the local speech element data reading process in the local speech element data reading part 125 (FIG. 6A: step S13). (See Reference)).

最後に、一時記憶部３１１に格納された最適音声素片データとそれに対応する最適音声素片系列情報とが、音声素片データ送信部３２２に入力され、順次ネットワーク２００を通してクライアント装置４００に送信される（ステップＳ６３）。
＜本形態の特徴＞
以上のような構成としても第１の実施の形態と同様な効果を得ることができる。さらに、本形態では、最適音声素片探索部５３１及び最適音声素片インデックス格納部５３２を音声素片データベースサーバ装置３００に配置することとしたため、クライアント装置４００に要求されるデータ格納容量や計算量を低減させることができる。 Finally, the optimum speech unit data stored in the temporary storage unit 311 and the optimum speech unit sequence information corresponding thereto are input to the speech unit data transmission unit 322 and sequentially transmitted to the client device 400 through the network 200. (Step S63).
<Features of this embodiment>
Even if it is the above structures, the effect similar to 1st Embodiment can be acquired. Furthermore, in this embodiment, since the optimum speech unit search unit 531 and the optimum speech unit index storage unit 532 are arranged in the speech unit database server device 300, the data storage capacity and calculation amount required for the client device 400 are determined. Can be reduced.

〔第３の実施の形態〕
次に、本発明における第３の実施の形態について説明する。
本形態は第１，２の実施の形態の変形例であり、クライアント装置が最適音声素片インデックスを具備せず、代わりに音声素片データベースサーバ装置が最適音声素片インデックスを具備する点、及びクライアント装置が要求最適音声素片系列情報を生成する代わりに、音声素片データベースサーバ装置が送信最適音声素片系列情報を生成する点が、第１の実施の形態との主な相違点である。以下では、第１，２の実施の形態との相違点を中心に説明し、第１，２の実施の形態と共通する事項については説明を省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described.
This embodiment is a modification of the first and second embodiments, in which the client device does not have the optimal speech unit index, and instead the speech unit database server device has the optimal speech unit index, and The main difference from the first embodiment is that the speech unit database server device generates the transmission optimal speech unit sequence information instead of the client device generating the required optimal speech unit sequence information. . Below, it demonstrates centering on difference with 1st, 2nd embodiment, and abbreviate | omits description about the matter which is common in 1st, 2nd embodiment.

＜構成＞
本形態の音声合成システムは、第１の実施の形態のクライアント装置１００の代わりにクライアント装置６００を設け、音声素片データベースサーバ装置３００の代わりに音声素片データベースサーバ装置７００を設けた構成となる。また、本形態のクライアント装置６００及び音声素片データベースサーバ装置７００も、第１の実施の形態と同様、例えば、図１（ｂ）に例示したＭＰＵ１０２や公知のコンピュータのＣＰＵに所定のプログラムが読み込まれ、実行されることにより構成されるものである。 <Configuration>
The speech synthesis system according to this embodiment has a configuration in which a client device 600 is provided instead of the client device 100 according to the first embodiment, and a speech unit database server device 700 is provided instead of the speech unit database server device 300. . Similarly to the first embodiment, the client device 600 and the speech unit database server device 700 of the present embodiment also read predetermined programs into the MPU 102 illustrated in FIG. 1B or the CPU of a known computer, for example. It is configured by being executed.

［クライアント装置６００の機能構成］
図１１は、本形態におけるクライアント装置６００の機能構成を例示したブロック図である。なお、図１１において図２と共通する部分については、図２と同じ符号を付した。
この図に例示するように、本形態のクライアント装置６００の構成は、第２の実施の形態のクライアント装置４００の構成と同様である。すなわち、クライアント装置６００の構成は、ローカル音声素片データベース格納部１１１、ローカル音声素片インデックス格納部１１２、一時記憶部１１４、テキストデータ入力部１２１、テキスト解析部１２２、韻律パラメータ取得部１２３、ローカル音声素片探索部１２４、ローカル音声素片データ読み出し部１２５、音声素片接続部１２６、音声出力部１２７、音声素片情報送信部１３３、音声素片データ受信部１３４、音声素片インデックス追加部１３６、音声素片データベース追加部１３７及び制御部１４０を有しており、音声素片情報送信部１３３及び音声素片データ受信部１３４を通じてネットワーク２００に接続可能に構成されている。 [Functional Configuration of Client Device 600]
FIG. 11 is a block diagram illustrating a functional configuration of the client apparatus 600 according to the present embodiment. In FIG. 11, the same reference numerals as those in FIG.
As illustrated in this figure, the configuration of the client device 600 of the present embodiment is the same as the configuration of the client device 400 of the second embodiment. That is, the configuration of the client device 600 includes a local speech unit database storage unit 111, a local speech unit index storage unit 112, a temporary storage unit 114, a text data input unit 121, a text analysis unit 122, a prosodic parameter acquisition unit 123, a local Speech unit search unit 124, local speech unit data reading unit 125, speech unit connection unit 126, speech output unit 127, speech unit information transmission unit 133, speech unit data reception unit 134, speech unit index addition unit 136, a speech unit database addition unit 137 and a control unit 140, and is configured to be connectable to the network 200 through the speech unit information transmission unit 133 and the speech unit data reception unit 134.

［音声素片データベースサーバ装置７００の機能構成］
図１２は、本形態における音声素片データベースサーバ装置７００の機能構成を例示したブロック図である。なお、図１２において図３或いは図９と共通する部分については、図３或いは図９と同じ符号を付した。
この図に例示するように、本形態の音声素片データベースサーバ装置７００は、一時記憶部３１１、最適音声素片データベース格納部３１２、音声素片情報受信部３２１、最適音声素片データ読み出し部３２３、音声素片データ送信部３２２、制御部３３０、最適音声素片探索部５３１、最適音声素片インデックス格納部５３２及び送信音声素片決定部７１１を有し、音声素片情報受信部３２１及び音声素片データ送信部３２２を通じ、ネットワーク２００に接続可能に構成されている。なお、最適音声素片インデックス格納部５３２の構成は、第１の実施の形態における最適音声素片インデックス格納部１１３と同じである。 [Functional Configuration of Speech Segment Database Server 700]
FIG. 12 is a block diagram illustrating a functional configuration of the speech unit database server apparatus 700 according to this embodiment. In FIG. 12, the same reference numerals as those in FIG. 3 or FIG.
As illustrated in this figure, the speech unit database server device 700 of this embodiment includes a temporary storage unit 311, an optimal speech unit database storage unit 312, a speech unit information reception unit 321, and an optimal speech unit data reading unit 323. , A speech unit data transmission unit 322, a control unit 330, an optimal speech unit search unit 531, an optimal speech unit index storage unit 532, and a transmission speech unit determination unit 711, and a speech unit information reception unit 321 and a speech The unit 200 is configured to be connectable to the network 200 through the unit data transmission unit 322. The configuration of the optimum speech unit index storage unit 532 is the same as that of the optimum speech unit index storage unit 113 in the first embodiment.

＜クライアント装置６００の処理＞
次に、本形態におけるクライアント装置６００の処理について説明する。
本形態のクライアント装置６００も、第１の実施の形態のクライアント装置１００と同様な音声合成処理（ステップＳ１０〜Ｓ１５）を実行するが、ここでは説明の簡略化のためその説明を省略する。
また本形態でも、クライアント装置６００がより高品質な合成音声を生成できるように、音声合成処理（ステップＳ１０〜Ｓ１５）と独立（並行或いはその処理の後）に、以下に述べる処理を行う。 <Processing of Client Device 600>
Next, processing of the client device 600 in this embodiment will be described.
The client device 600 of this embodiment also performs the same voice synthesis process (steps S10 to S15) as the client device 100 of the first embodiment, but the description thereof is omitted here for the sake of simplification.
Also in this embodiment, the following processing is performed independently of (in parallel with or after the speech synthesis processing) the speech synthesis processing (steps S10 to S15) so that the client device 600 can generate synthesized speech of higher quality.

図１３（ａ）は、この音声合成処理と独立に行われるクライアント装置６００での処理を説明するための流れ図である。以下、この図に従って、この処理を説明する。
まず、音声素片情報送信部１３３に、一時記憶部１１４に格納されている、ローカル音声素片系列情報（ローカル音声素片探索部１２４から出力されたローカル音声素片系列情報）、読み情報及び韻律パラメータ（テキスト解析部１２２から出力された読み情報及び韻律パラメータ取得部１２３から出力された韻律パラメータ）が入力される。そして、音声素片情報送信部１３３は、これらのローカル音声素片系列情報、読み情報及び韻律パラメータを、ネットワーク２００を通して音声素片データベースサーバ装置７００に送信する（ステップＳ８０）。 FIG. 13A is a flowchart for explaining processing in the client device 600 performed independently of the speech synthesis processing. Hereinafter, this process will be described with reference to FIG.
First, the local speech unit sequence information (local speech unit sequence information output from the local speech unit search unit 124), reading information, and stored in the temporary storage unit 114 are stored in the speech unit information transmission unit 133. Prosodic parameters (reading information output from the text analysis unit 122 and prosodic parameters output from the prosody parameter acquisition unit 123) are input. Then, the speech unit information transmitting unit 133 transmits these local speech unit sequence information, reading information, and prosodic parameters to the speech unit database server device 700 through the network 200 (step S80).

これに対し、音声素片データベースサーバ装置７００は、上述のローカル音声素片系列情報、読み情報及び韻律パラメータから求めた送信音声素片系列情報に対応する最適音声素片データ及び最適音声素片系列情報をクライアント装置６００に返信する（詳細は後述）。
これ以降、クライアント装置６００で行われる処理は第２の実施の形態と同様である。すなわち、クライアント装置６００は音声素片データ受信部１３４において、上記の最適音声素片データ及び最適音声素片系列情報を受信し、一時記憶部１１４に格納する（ステップＳ８１）。 On the other hand, the speech unit database server device 700 has the optimum speech unit data and the optimum speech unit sequence corresponding to the transmission speech unit sequence information obtained from the above-mentioned local speech unit sequence information, reading information, and prosodic parameters. Information is returned to the client device 600 (details will be described later).
Thereafter, the processing performed by the client device 600 is the same as that of the second embodiment. That is, the client unit 600 receives the above-mentioned optimum speech unit data and optimum speech unit sequence information at the speech unit data receiving unit 134 and stores them in the temporary storage unit 114 (step S81).

次に、第１の実施の形態と同様に（図６（ｂ）：ステップＳ２０参照）、音声素片データベース追加部１３７において、一時記憶部１１４から音声素片データ受信部１３４において受信された最適音声素片データを読み出し、それを新たなローカル音声素片データとしてローカル音声素片データベース格納部１１１に追加格納する（ステップＳ８２）。また、これに伴い、音声素片インデックス追加部１３６において、ステップＳ８１で受信された最適音声素片系列情報（新たなローカル音声素片データに対応するローカル音声素片系列情報）を一時記憶部１１４から読み込み、これをローカル音声素片インデックス格納部１１２に追加格納する（ステップＳ８３）。なお、新たなローカル音声素片データの追加方法は、例えば、第１の実施の形態と同様に行う（図６（ｂ）：ステップＳ２１参照）。 Next, as in the first embodiment (see FIG. 6B: Step S20), the speech unit database adding unit 137 receives the optimum received from the temporary storage unit 114 by the speech unit data receiving unit 134. The speech unit data is read out and stored in the local speech unit database storage unit 111 as new local speech unit data (step S82). Accordingly, the speech unit index adding unit 136 temporarily stores the optimum speech unit sequence information (local speech unit sequence information corresponding to new local speech unit data) received in step S81. Are stored in the local speech segment index storage unit 112 (step S83). Note that a method for adding new local speech unit data is performed, for example, in the same manner as in the first embodiment (see FIG. 6B: step S21).

＜音声素片データベースサーバ装置７００の処理＞
次に、本形態における音声素片データベースサーバ装置７００の処理について説明する。
図１３（ｂ）は、本形態における音声素片データベースサーバ装置７００における最適音声素片データの送信処理を説明するための流れ図である。以下、この図に従って、本形態における最適音声素片データの送信処理の詳細を説明する。
まず、ネットワーク２００を通して、クライアント装置４００から送信されたローカル音声素片系列情報、読み情報及び韻律パラメータが音声素片情報受信部３２１で受信され、一時記憶部３１１に格納される（ステップＳ９０）。 <Processing of Speech Segment Database Server 700>
Next, processing of the speech unit database server apparatus 700 in this embodiment will be described.
FIG. 13B is a flowchart for explaining the optimal speech segment data transmission process in the speech segment database server apparatus 700 according to this embodiment. Hereinafter, the details of the transmission processing of the optimum speech unit data in this embodiment will be described with reference to this figure.
First, local speech unit sequence information, reading information, and prosodic parameters transmitted from the client device 400 via the network 200 are received by the speech unit information receiving unit 321 and stored in the temporary storage unit 311 (step S90).

次に、最適音声素片探索部５３１において、一時記憶部３１１から、受信された読み情報及び韻律パラメータを読み込み、これらの読み情報及び韻律パラメータをキーとして、最適音声素片インデックス格納部５３２を検索する。そして、最適音声素片探索部５３１は、受信された読み情報及び韻律パラメータの類似範囲に属する読み情報及び韻律パラメータに対応する最適音声素片系列情報を最適音声素片インデックス格納部５３２から抽出し、抽出した最適音声素片系列情報を一時記憶部３１１に格納する（ステップＳ９１）。前述のとおり、最適音声素片インデックス格納部５３２に格納されている最適音声素片インデックスの構成は、ローカル音声素片インデックスと同様であり、音声素片系列情報の決定方法も、前述のローカル音声素片探索部１２４における決定方法（図６（ａ）：ステップＳ１２参照。）と同様に行うことができる。 Next, the optimum speech unit search unit 531 reads the received reading information and prosodic parameters from the temporary storage unit 311 and searches the optimum speech unit index storage unit 532 using these reading information and prosodic parameters as keys. To do. Then, the optimum speech unit search unit 531 extracts the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters from the optimum speech unit index storage unit 532. The extracted optimum speech element sequence information is stored in the temporary storage unit 311 (step S91). As described above, the configuration of the optimal speech unit index stored in the optimal speech unit index storage unit 532 is the same as that of the local speech unit index, and the speech unit sequence information determination method is the same as the local speech unit described above. The determination can be performed in the same manner as the determination method in the segment search unit 124 (see FIG. 6A: Step S12).

次に、送信音声素片決定部７１において、一時記憶部３１１から音声素片情報受信部３２１において受信されたローカル音声素片系列情報及び最適音声素片探索部５３１から出力された最適音声素片系列情報を読み込む。そして、送信音声素片決定部７１は、当該最適音声素片系列情報から当該ローカル音声素片系列情報と読み情報及び韻律パラメータが共通するもの除外した送信音声素片系列情報を生成し、当該送信音声素片系列情報を一時記憶部３１１に格納する（ステップＳ９２）。なお、この送信音声素片系列情報の生成は、例えば、第１の実施の形態における要求音声素片系列情報の生成（ステップＳ１７参照）と同様に行われる。 Next, in the transmission speech unit determination unit 71, the local speech unit sequence information received by the speech unit information reception unit 321 from the temporary storage unit 311 and the optimal speech unit output from the optimal speech unit search unit 531. Read series information. Then, the transmission speech unit determination unit 71 generates transmission speech unit sequence information obtained by excluding the local speech unit sequence information, the common reading information, and the prosodic parameters from the optimal speech unit sequence information, and the transmission The speech element sequence information is stored in the temporary storage unit 311 (step S92). The transmission speech unit sequence information is generated in the same manner as the request speech unit sequence information (see step S17) in the first embodiment, for example.

次に、最適音声素片データ読み出し部３２３に、一時記憶部３１１から送信音声素片系列情報の最適音声素片格納情報が読み込まれる。そして、最適音声素片データ読み出し部３２３は、当該最適音声素片格納情報が指定する最適音声素片データを、最適音声素片データベース格納部から読み出し、一時記憶部３１１に格納する（ステップＳ９３）。ここで、最適音声素片データベース格納部３１２から最適音声素片データを読み出すための処理も、ローカル音声素片データ読み出し部１２５におけるローカル音声素片データの読み出し処理（図６（ａ）：ステップＳ１３参照。）と同様にして行うことができる。 Next, the optimum speech unit data reading unit 323 reads the optimum speech unit storage information of the transmission speech unit sequence information from the temporary storage unit 311. Then, the optimum speech unit data reading unit 323 reads out the optimum speech unit data specified by the optimum speech unit storage information from the optimum speech unit database storage unit and stores it in the temporary storage unit 311 (step S93). . Here, the process for reading the optimum speech element data from the optimum speech element database storage unit 312 is also the local speech element data reading process in the local speech element data reading part 125 (FIG. 6A: step S13). (See Reference)).

最後に、一時記憶部３１１に格納された最適音声素片データとそれに対応する最適音声素片系列情報とが、音声素片データ送信部３２２に入力され、順次ネットワーク２００を通してクライアント装置６００に送信される（ステップＳ９４）。
＜本形態の特徴＞
以上のような構成としても第１の実施の形態と同様な効果を得ることができる。さらに、本形態では、最適音声素片探索部５３１及び最適音声素片インデックス格納部５３２を音声素片データベースサーバ装置７００に配置することとしたため、クライアント装置６００に要求されるデータ格納容量や計算量を低減させることができる。また、本形態では、音声素片データベースサーバ装置７００からクライアント装置６００に送信される最適音声素片データ及び最適音声素片系列情報は、送信音声素片決定部７１１が決定した送信音声素片系列情報に対応するもののみである。これにより、既にクライアント装置６００に格納されている音声素片データと重複する最適音声素片データ等が送信されることを防止でき、ネットワークへの通信負担も軽減することができる。 Finally, the optimum speech unit data stored in the temporary storage unit 311 and the optimum speech unit sequence information corresponding thereto are input to the speech unit data transmission unit 322 and sequentially transmitted to the client device 600 through the network 200. (Step S94).
<Features of this embodiment>
Even if it is the above structures, the effect similar to 1st Embodiment can be acquired. Furthermore, in this embodiment, since the optimum speech unit search unit 531 and the optimum speech unit index storage unit 532 are arranged in the speech unit database server device 700, the data storage capacity and calculation amount required for the client device 600 are determined. Can be reduced. Further, in this embodiment, the optimum speech unit data and the optimum speech unit sequence information transmitted from the speech unit database server device 700 to the client device 600 are the transmission speech unit sequence determined by the transmission speech unit determination unit 711. It only corresponds to information. As a result, it is possible to prevent transmission of optimal speech segment data that overlaps speech segment data already stored in the client device 600, and to reduce the communication burden on the network.

〔第４の実施の形態〕
本形態は、第１の実施の形態の変形例であり、クライアント装置のハードウェア構成上、ローカル音声素片データベースのサイズが予め決められたサイズ以上大きくできないような場合に有効な例である。以下では、第１の実施の形態との相違点のみを説明し、第１の実施の形態と共通する事項については説明を省略する。
＜構成＞
図１４は、本形態におけるクライアント装置８００の機能構成を例示したブロック図である。なお、図１４において図２と共通する部分については、図２と同じ符号を付した。また、音声合成システム全体及び音声素片データベースサーバ装置の構成については第１の実施の形態と同様である。 [Fourth Embodiment]
This embodiment is a modification of the first embodiment, and is effective when the size of the local speech segment database cannot be increased beyond a predetermined size due to the hardware configuration of the client device. In the following, only differences from the first embodiment will be described, and description of matters common to the first embodiment will be omitted.
<Configuration>
FIG. 14 is a block diagram illustrating a functional configuration of the client device 800 according to the present embodiment. 14 that are the same as those in FIG. 2 are assigned the same reference numerals as in FIG. The configuration of the entire speech synthesis system and the speech unit database server apparatus are the same as those in the first embodiment.

本形態におけるクライアント装置８００と、第１の実施の形態におけるクライアント装置１００との相違点は、クライアント装置８００がさらに、ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下になるように、所定の優先順位に従って、当該ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの一部を削除する音声素片データ削除部８１１と、音声素片データ削除部８１１において削除されたローカル音声素片データに対応するローカル音声素片系列情報をローカル音声素片インデックス格納部１１２から削除する音声素片系列情報削除部８１２と、をさらに具備する点である。 The difference between the client apparatus 800 in the present embodiment and the client apparatus 100 in the first embodiment is that the client apparatus 800 further stores the total size of local speech segment data stored in the local speech segment database storage unit 111. Speech unit data deletion unit 811 that deletes a part of the local speech unit data stored in the local speech unit database storage unit 111 in accordance with a predetermined priority order so that is less than or equal to a predetermined size. A speech unit sequence information deletion unit 812 that deletes local speech unit sequence information corresponding to the local speech unit data deleted in the speech unit data deletion unit 811 from the local speech unit index storage unit 112. Furthermore, it is the point which comprises.

また、この例の場合、クライアント装置８００のローカル音声素片インデックス格納部１１２に格納されるローカル音声素片インデックスの構成も第１実施の形態との相違する。図１５（ａ）は、クライアント装置８００のローカル音声素片インデックス格納部１１２に格納されるローカル音声素片インデックス８１２ａのデータ構成を例示した図である。この図に例示するように、この例のローカル音声素片インデックス８１２ａは、第１の実施の形態と同様なローカル音声素片系列情報８１２ａａに、それらの読み出し回数や最後に読み出された時間等の「読み出し情報」が対応付けられた構成となっている。 In this example, the configuration of the local speech unit index stored in the local speech unit index storage unit 112 of the client device 800 is also different from that of the first embodiment. FIG. 15A is a diagram illustrating a data configuration of the local speech unit index 812 a stored in the local speech unit index storage unit 112 of the client device 800. As illustrated in this figure, the local speech unit index 812a of this example is added to the local speech unit sequence information 812aa similar to that of the first embodiment, the number of times of reading, the last read time, etc. The “read information” is associated with each other.

＜処理＞
本形態では、第１の実施の形態で説明したステップＳ１０〜Ｓ２１，Ｓ３０〜Ｓ３２の処理と同様な処理が行われる。ただし、以下の点で第１の実施の形態の処理と相違する。
[ローカル音声素片系列情報の探索処理（ステップＳ１２に対応）]
第１の実施の形態との相違点は、ローカル音声素片探索部１２４がローカル音声素片系列情報をローカル音声素片インデックス格納部１１２に格納されたローカル音声素片インデックス８１２ａ（図１５（ａ））から読み出す際、その読み出し回数や最後に読み出された時間等の「読み出し情報」をローカル音声素片インデックス８１２ａに書き込む点である。その他は、ステップＳ１２と同じである。 <Processing>
In this embodiment, processing similar to the processing in steps S10 to S21 and S30 to S32 described in the first embodiment is performed. However, the following points are different from the processing of the first embodiment.
[Search processing of local speech unit sequence information (corresponding to step S12)]
The difference from the first embodiment is that the local speech unit search unit 124 stores the local speech unit sequence information 812a stored in the local speech unit index storage unit 112 (FIG. 15 (a)). )), The “read information” such as the number of times read and the last read time is written in the local speech unit index 812a. Others are the same as step S12.

[ローカル音声素片データ削除処理]
ローカル音声素片データ削除処理は、ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下になるように、所定の優先順位に従って、当該ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの一部を削除し、削除されたローカル音声素片データに対応するローカル音声素片系列情報をローカル音声素片インデックス格納部１１２から削除する処理である。このローカル音声素片データ削除処理は、例えば、ステップＳ１９とステップＳ２０との間で実行されてもよいし、ステップＳ２１の後に実行されてもよいし、その他任意のタイミングで実行されてもよい。 [Local speech segment data deletion processing]
The local speech element data deletion processing is performed according to a predetermined priority order so that the total size of the local speech element data stored in the local speech element database storage unit 111 is equal to or smaller than a predetermined size. A part of the local speech unit data stored in the speech unit database storage unit 111 is deleted, and the local speech unit sequence information corresponding to the deleted local speech unit data is stored in the local speech unit index storage unit 112. It is a process to delete. This local speech unit data deletion process may be executed between step S19 and step S20, may be executed after step S21, or may be executed at any other timing.

図１５（ｂ）は、このローカル音声素片データ削除処理を説明するための流れ図である。以下、この図に従ってこのローカル音声素片データ削除処理を説明する。
まず、クライアント装置８００の音声素片データ削除部８１１において、ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下であるか否かを判定する（ステップＳ１００）。ここで、ローカル音声素片データベース格納部に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下である場合には処理を終了する。一方、ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの合計サイズが予め決められた大きさ以下でない場合、音声素片データ削除部８１１において、所定の優先順位に従って、当該ローカル音声素片データベース格納部１１１に格納されたローカル音声素片データの一部を削除対象として選択し（ステップＳ１０１）、そのローカル音声素片データを削除する（ステップＳ１０２）。なお、このローカル音声素片データの削除は、例えば、ローカル音声素片データベースのサイズが予め決められた大きさになるように行われる。また、ここでの「所定の優先順位」とは、例えば、所定の期間において最も読み出された回数が少ないローカル音声素片データ、又は、一度も読み出されたことがないことを含めて最も時間的に長い間読み出されことがないローカル音声素片データから順に削除する順位等を意味する。 FIG. 15B is a flowchart for explaining the local speech segment data deletion process. Hereinafter, the local speech segment data deletion process will be described with reference to FIG.
First, the speech unit data deletion unit 811 of the client device 800 determines whether or not the total size of the local speech unit data stored in the local speech unit database storage unit 111 is equal to or smaller than a predetermined size. (Step S100). Here, if the total size of the local speech unit data stored in the local speech unit database storage unit is equal to or smaller than a predetermined size, the process is terminated. On the other hand, when the total size of the local speech unit data stored in the local speech unit database storage unit 111 is not less than or equal to a predetermined size, the speech unit data deletion unit 811 performs the local speech unit according to a predetermined priority. A part of local speech element data stored in the speech element database storage unit 111 is selected as a deletion target (step S101), and the local speech element data is deleted (step S102). Note that the deletion of the local speech element data is performed, for example, so that the size of the local speech element database becomes a predetermined size. In addition, the “predetermined priority” here is, for example, the local speech unit data that is most frequently read out in a predetermined period, or the data that has never been read out. It means the order of deletion in order from local speech unit data that is not read for a long time.

そして、音声素片データ削除部８１１は、削除したローカル音声素片データを特定する情報（例えば、「ファイル番号」「時間長」「始点位置」等）を一時記憶部１１４に格納する。次に、音声素片系列情報削除部８１２において、一時記憶部１１４から、音声素片データ削除部８１１において削除されたローカル音声素片データを特定する情報を読み込み、これに対応するローカル音声素片系列情報をローカル音声素片インデックス格納部１１２から削除する（ステップＳ１０３）。
＜本形態の特徴＞
本形態の構成により、クライアント装置８００のハードウェア構成上、ローカル音声素片データベースのサイズが予め決められたサイズ以上大きくできないような場合でも本発明を適用することができる。 Then, the speech unit data deletion unit 811 stores information (for example, “file number”, “time length”, “start point position”, etc.) specifying the deleted local speech unit data in the temporary storage unit 114. Next, in the speech unit sequence information deletion unit 812, information specifying the local speech unit data deleted in the speech unit data deletion unit 811 is read from the temporary storage unit 114, and the corresponding local speech unit is read. The sequence information is deleted from the local speech unit index storage unit 112 (step S103).
<Features of this embodiment>
With the configuration of the present embodiment, the present invention can be applied even when the size of the local speech unit database cannot be increased beyond a predetermined size due to the hardware configuration of the client device 800.

なお、本形態では、第１の実施の形態において、上述のローカル音声素片データ削除処理を実行する例について説明したが、第２の実施の形態及び第３の実施の形態において、上述のローカル音声素片データ削除処理を実行することとしてもよい。その場合のクライアント装置の構成は、例えば、前述のクライアント装置４００，６００に、上述の音声素片データ削除部８１１及び音声素変系列情報削除部８１２を追加したものとなる。
なお、本発明は上述の各実施の形態に限定されるものではなく、その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 In the present embodiment, the example of executing the above-described local speech segment data deletion processing in the first embodiment has been described. However, in the second embodiment and the third embodiment, the above-described local speech unit data is deleted. The speech segment data deletion process may be executed. In this case, the configuration of the client device is, for example, the above-described client devices 400 and 600 with the above-described speech segment data deletion unit 811 and the phoneme variation series information deletion unit 812 added.
Note that the present invention is not limited to the above-described embodiments, and other modifications can be made without departing from the spirit of the present invention. In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明の産業上の利用分野としては、例えば、携帯電話やパーソナルコンピュータにおける合成音声の作成や、自動音声応答システム等を例示できる。 As an industrial application field of the present invention, for example, creation of synthesized voice in a mobile phone or a personal computer, an automatic voice response system, and the like can be exemplified.

図１（ａ）は、第１の実施の形態における音声合成システムの概念図である。図１（ｂ）は、図１（ａ）におけるクライアント装置のハードウェア構成を例示した概念図である。図１（ｃ）は、図１（ａ）における音声素片データベースサーバ装置のハードウェア構成を例示した概念図である。FIG. 1A is a conceptual diagram of a speech synthesis system according to the first embodiment. FIG. 1B is a conceptual diagram illustrating the hardware configuration of the client device in FIG. FIG.1 (c) is the conceptual diagram which illustrated the hardware constitutions of the speech unit database server apparatus in Fig.1 (a). 図２は、第１の実施の形態におけるクライアント装置の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating a functional configuration of the client device according to the first embodiment. 図３は、第１の実施の形態における音声素片データベースサーバ装置の機能構成を例示したブロック図である。FIG. 3 is a block diagram illustrating a functional configuration of the speech segment database server apparatus according to the first embodiment. 図４は、ローカル音声素片データベース格納部に格納されるローカル音声素片データの構成を説明するための概念図である。FIG. 4 is a conceptual diagram for explaining the configuration of local speech unit data stored in the local speech unit database storage unit. 図５（ａ）は、図２のローカル音声素片インデックス格納部に格納されるローカル音声素片インデックスのデータ構成を例示した概念図であり、図５（ｂ）は、図３の最適音声素片インデックス格納部に格納される最適音声素片インデックスの構成を例示した概念図である。FIG. 5A is a conceptual diagram illustrating the data structure of the local speech unit index stored in the local speech unit index storage unit of FIG. 2, and FIG. 5B is the optimum speech unit of FIG. It is the conceptual diagram which illustrated the structure of the optimal audio | voice element index stored in a fragment | piece index storage part. 図６（ａ）は、第１の実施の形態のクライアント装置における音声合成処理を説明するための流れ図である。図６（ｂ）は、図６（ａ）の処理と独立に行われる処理を説明するための流れ図である。FIG. 6A is a flowchart for explaining speech synthesis processing in the client device according to the first embodiment. FIG. 6B is a flowchart for explaining a process performed independently of the process of FIG. 図７（ａ）は、図２のローカル音声素片インデックス格納部に格納されるローカル音声素片インデックスのデータ構成を例示した概念図である。図７（ｂ）は、第１の実施の形態の音声素片データベースサーバ装置における最適音声素片データの送信処理を説明するための流れ図である。FIG. 7A is a conceptual diagram illustrating the data configuration of a local speech unit index stored in the local speech unit index storage unit of FIG. FIG. 7B is a flowchart for explaining optimal speech segment data transmission processing in the speech segment database server apparatus according to the first embodiment. 図８は、第２の実施の形態におけるクライアント装置の機能構成を例示したブロック図である。FIG. 8 is a block diagram illustrating a functional configuration of the client device according to the second embodiment. 第２の実施の形態における音声素片データベースサーバ装置の機能構成を例示したブロック図である。It is the block diagram which illustrated the functional composition of the speech segment database server device in a 2nd embodiment. 図１０（ａ）は、第２の実施の形態において、音声合成処理と独立に行われるクライアント装置での処理を説明するための流れ図である。図１０（ｂ）は、第２の実施の形態の音声素片データベースサーバ装置における最適音声素片データの送信処理を説明するための流れ図である。FIG. 10A is a flowchart for explaining processing in the client device that is performed independently of speech synthesis processing in the second embodiment. FIG. 10B is a flowchart for explaining optimal speech segment data transmission processing in the speech segment database server apparatus according to the second embodiment. 図１１は、第３の実施の形態におけるクライアント装置の機能構成を例示したブロック図である。FIG. 11 is a block diagram illustrating a functional configuration of a client device according to the third embodiment. 図１２は、第３の実施の形態における音声素片データベースサーバ装置の機能構成を例示したブロック図である。FIG. 12 is a block diagram illustrating a functional configuration of the speech segment database server apparatus according to the third embodiment. 図１３（ａ）は、第３の実施の形態において、音声合成処理と独立に行われるクライアント装置での処理を説明するための流れ図である。図１３（ｂ）は、第３の実施の形態の音声素片データベースサーバ装置における最適音声素片データの送信処理を説明するための流れ図である。FIG. 13A is a flowchart for explaining processing in the client device that is performed independently of the speech synthesis processing in the third embodiment. FIG. 13B is a flowchart for explaining optimal speech segment data transmission processing in the speech segment database server apparatus according to the third embodiment. 図１４は、第３の実施の形態におけるクライアント装置の機能構成を例示したブロック図である。FIG. 14 is a block diagram illustrating a functional configuration of a client device according to the third embodiment. 図１５（ａ）は、第４の実施の形態におけるクライアント装置のローカル音声素片インデックス格納部に格納されるローカル音声素片インデックスのデータ構成を例示した図である。図１５（ｂ）は、ローカル音声素片データ削除処理を説明するための流れ図である。FIG. 15A is a diagram illustrating a data configuration of a local speech unit index stored in the local speech unit index storage unit of the client device according to the fourth embodiment. FIG. 15B is a flowchart for explaining the local speech segment data deletion process.

Explanation of symbols

１音声合成システム
１００，４００，６００，８００クライアント装置
３００，５００，７００音声素片データベースサーバ装置 1 Speech synthesis system 100, 400, 600, 800 Client device 300, 500, 700 Speech unit database server device

Claims

A speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
The speech segment database server device is
An optimal speech segment database storage unit for storing optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”);
The client device is
A local speech segment database storage unit for storing local speech segment data (meaning “speech segment data stored in the client device”);
Local speech unit index storage unit for storing local speech unit sequence information in which local speech unit storage information for specifying local speech unit data and reading information and prosodic parameters corresponding to the local speech unit data are associated. When,
Optimal speech unit index storage unit for storing optimal speech unit storage information in which optimum speech unit storage information for designating optimal speech unit data and reading information and prosodic parameters corresponding to the optimal speech unit data are associated. When,
Text data to be uttered is input, text analysis is performed on the text data to generate reading information and prosodic information, and a text analysis unit that outputs the reading information and prosodic information;
Prosody information output from the text analysis unit is input, using the prosodic information to generate a physical prosody parameter necessary for speech synthesis, and a prosodic parameter acquisition unit that outputs the prosodic parameter;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the local speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, A local speech unit search unit for extracting local speech unit sequence information corresponding to reading information and prosodic parameters belonging to a similar range of the input reading information and prosodic parameters, and outputting the extracted local speech unit sequence information; ,
The local speech element storage information of the local speech element sequence information output from the local speech element search unit is input, and the local speech element data specified by the local speech element storage information is converted into the local speech element data. A local speech unit data reading unit that reads from the database storage unit;
A local speech unit data read by the local speech unit data reading unit, a synthesized speech data is generated using the local speech unit data, and a speech unit connection unit that outputs the synthesized speech data;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the optimal speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, An optimal speech segment search unit for extracting optimal speech segment sequence information corresponding to reading information and prosodic parameters belonging to a similar range of input reading information and prosodic parameters, and outputting the extracted optimal speech segment sequence information;
The local speech unit sequence information and the optimal speech unit sequence information respectively output from the local speech unit search unit and the optimal speech unit search unit are input, and the local speech unit is calculated from the optimal speech unit sequence information. Generating a requested speech unit sequence excluding those having common sequence information, reading information, and prosodic parameters, and outputting the requested speech unit sequence information;
A speech unit information transmitting unit that transmits the optimal speech unit storage information of the requested speech unit sequence information to the speech unit database server device through the network, and
The speech segment database server device is
A speech unit information receiving unit for receiving optimum speech unit storage information of the requested speech unit sequence information;
The received optimal speech unit storage information is input, and the optimal speech unit data reading unit that reads out the optimal speech unit data specified by the optimal speech unit storage information from the optimal speech unit database storage unit;
A speech unit data transmitting unit that returns the read optimal speech unit data to the client device via a network; and
The client device is
A speech unit data receiving unit for receiving optimal speech unit data;
A speech unit database adding unit for additionally storing the optimum speech unit data received by the speech unit data receiving unit in the local speech unit database storage unit as new local speech unit data;
A speech unit index adding unit that additionally stores local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A speech synthesis system characterized by this.

A speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
The speech segment database server device is
An optimal speech segment database storage unit for storing optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”);
Optimal speech unit index storage unit for storing optimal speech unit storage information in which optimum speech unit storage information for designating optimal speech unit data and reading information and prosodic parameters corresponding to the optimal speech unit data are associated. And comprising
The client device is
A local speech segment database storage unit for storing local speech segment data (meaning “speech segment data stored in the client device”);
Local speech unit index storage unit for storing local speech unit sequence information in which local speech unit storage information for specifying local speech unit data and reading information and prosodic parameters corresponding to the local speech unit data are associated. When,
Text data to be uttered is input, text analysis is performed on the text data to generate reading information and prosodic information, and a text analysis unit that outputs the reading information and prosodic information;
Prosody information output from the text analysis unit is input, using the prosodic information to generate a physical prosody parameter necessary for speech synthesis, and a prosodic parameter acquisition unit that outputs the prosodic parameter;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the local speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, A local speech unit search unit for extracting local speech unit sequence information corresponding to reading information and prosodic parameters belonging to a similar range of the input reading information and prosodic parameters, and outputting the extracted local speech unit sequence information; ,
The local speech element storage information of the local speech element sequence information output from the local speech element search unit is input, and the local speech element data specified by the local speech element storage information is converted into the local speech element data. A local speech unit data reading unit that reads from the database storage unit;
A local speech unit data read by the local speech unit data reading unit, a synthesized speech data is generated using the local speech unit data, and a speech unit connection unit that outputs the synthesized speech data;
A speech unit information transmission unit that transmits the reading information output from the text analysis unit and the prosodic parameters output from the prosody parameter acquisition unit to a speech unit database server device via a network; and
The speech segment database server device is
A speech unit information receiving unit for receiving reading information and prosodic parameters;
The optimum speech unit index storage unit is searched using the received reading information and prosodic parameters as keys, and the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters , And an optimal speech segment search unit that outputs the extracted optimal speech segment sequence information,
The optimum speech element storage information of the optimum speech element sequence information output from the optimum speech element search unit is input, and the optimum speech element data specified by the optimum speech element storage information is converted into the optimum speech element data. An optimal speech segment data reading unit for reading from the database storage unit;
A speech unit data transmitting unit that returns the read optimal speech unit data to the client device via a network; and
The client device is
A speech unit data receiving unit for receiving optimal speech unit data;
A speech unit database addition unit for additionally storing at least a part of the optimum speech unit data received by the speech unit data reception unit in the local speech unit database storage unit as new local speech unit data;
A speech unit index adding unit that additionally stores local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A speech synthesis system characterized by this.

A speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
The speech segment database server device is
An optimal speech segment database storage unit for storing optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”);
Optimal speech unit index storage unit for storing optimal speech unit storage information in which optimum speech unit storage information for designating optimal speech unit data and reading information and prosodic parameters corresponding to the optimal speech unit data are associated. And comprising
The client device is
A local speech segment database storage unit for storing local speech segment data (meaning “speech segment data stored in the client device”);
Local speech unit index storage unit for storing local speech unit sequence information in which local speech unit storage information for specifying local speech unit data and reading information and prosodic parameters corresponding to the local speech unit data are associated. When,
Text data to be uttered is input, text analysis is performed on the text data to generate reading information and prosodic information, and a text analysis unit that outputs the reading information and prosodic information;
Prosody information output from the text analysis unit is input, using the prosodic information to generate a physical prosody parameter necessary for speech synthesis, and a prosodic parameter acquisition unit that outputs the prosodic parameter;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the local speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, A local speech unit search unit for extracting local speech unit sequence information corresponding to reading information and prosodic parameters belonging to a similar range of the input reading information and prosodic parameters, and outputting the extracted local speech unit sequence information; ,
The local speech element storage information of the local speech element sequence information output from the local speech element search unit is input, and the local speech element data specified by the local speech element storage information is converted into the local speech element data. A local speech unit data reading unit that reads from the database storage unit;
A local speech unit data read by the local speech unit data reading unit, a synthesized speech data is generated using the local speech unit data, and a speech unit connection unit that outputs the synthesized speech data;
The local speech unit sequence information output from the local speech unit search unit, the reading information output from the text analysis unit, and the prosodic parameters output from the prosodic parameter acquisition unit, the speech unit database server via the network A speech unit information transmission unit to be transmitted to the device,
The speech segment database server device is
A speech unit information receiving unit for receiving local speech unit sequence information, reading information, and prosodic parameters;
The optimum speech unit index storage unit is searched using the received reading information and prosodic parameters as keys, and the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters , And an optimal speech segment search unit that outputs the extracted optimal speech segment sequence information,
The local speech unit sequence information received by the speech unit information receiving unit and the optimal speech unit sequence information output from the optimal speech unit search unit are input, and the local speech unit information is obtained from the optimal speech unit sequence information. A transmission speech unit determination unit for generating transmission speech unit sequence information excluding those having common sequence information and reading information and prosodic parameters, and outputting the transmission speech unit sequence information;
Optimum speech unit data reading that reads the optimum speech unit storage information of the transmission speech unit sequence information and reads the optimum speech unit data specified by the optimum speech unit storage information from the optimum speech unit database storage unit And
A speech unit data transmitting unit that returns the read optimal speech unit data to the client device via a network; and
The client device is
A speech unit data receiving unit for receiving optimal speech unit data;
A speech unit database adding unit for additionally storing the optimum speech unit data received by the speech unit data receiving unit in the local speech unit database storage unit as new local speech unit data;
A speech unit index adding unit that additionally stores local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A speech synthesis system characterized by this.

The speech synthesis system according to any one of claims 1 to 3,
The client device is
Stored in the local speech unit database storage unit according to a predetermined priority order so that the total size of the local speech unit data stored in the local speech unit database storage unit is less than or equal to a predetermined size. A speech segment data deletion unit for deleting a part of the local speech segment data;
A speech unit sequence information deletion unit that deletes local speech unit sequence information corresponding to the local speech unit data deleted in the speech unit data deletion unit from the local speech unit index storage unit. A speech synthesis system characterized by this.

A local speech segment database storage unit for storing local speech segment data (meaning “speech segment data stored in the client device”);
Local speech unit index storage unit for storing local speech unit sequence information in which local speech unit storage information for specifying local speech unit data and reading information and prosodic parameters corresponding to the local speech unit data are associated. When,
Optimal speech segment storage information for specifying optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”), reading information and prosodic parameters corresponding to the optimal speech segment data An optimal speech unit index storage unit for storing optimal speech unit sequence information associated with
Text data to be uttered is input, text analysis is performed on the text data to generate reading information and prosodic information, and a text analysis unit that outputs the reading information and prosodic information;
Prosody information output from the text analysis unit is input, using the prosodic information to generate a physical prosody parameter necessary for speech synthesis, and a prosodic parameter acquisition unit that outputs the prosodic parameter;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the local speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, A local speech unit search unit for extracting local speech unit sequence information corresponding to reading information and prosodic parameters belonging to a similar range of the input reading information and prosodic parameters, and outputting the extracted local speech unit sequence information; ,
The local speech element storage information of the local speech element sequence information output from the local speech element search unit is input, and the local speech element data specified by the local speech element storage information is converted into the local speech element data. A local speech unit data reading unit that reads from the database storage unit;
A local speech unit data read by the local speech unit data reading unit, a synthesized speech data is generated using the local speech unit data, and a speech unit connection unit that outputs the synthesized speech data;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input, and the optimal speech segment index storage unit is searched using the input reading information and prosodic parameters as keys, An optimal speech segment search unit for extracting optimal speech segment sequence information corresponding to reading information and prosodic parameters belonging to a similar range of input reading information and prosodic parameters, and outputting the extracted optimal speech segment sequence information;
The local speech unit sequence information and the optimal speech unit sequence information respectively output from the local speech unit search unit and the optimal speech unit search unit are input, and the local speech unit is calculated from the optimal speech unit sequence information. Request speech unit sequence information excluding those that share sequence information and reading information and prosody parameters in common, and output the requested speech unit sequence information,
A speech unit information transmitting unit that transmits the optimal speech unit storage information of the requested speech unit sequence information to the speech unit database server device via the network;
A speech unit data receiving unit for receiving optimum speech unit data transmitted from the speech unit database server device through a network;
A speech unit database adding unit for additionally storing the optimum speech unit data received by the speech unit data receiving unit in the local speech unit database storage unit as new local speech unit data;
A speech unit index adding unit that additionally stores local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A client device.

An optimal speech segment database storage unit for storing optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”);
Optimal speech unit index storage unit for storing optimal speech unit storage information in which optimum speech unit storage information for designating optimal speech unit data and reading information and prosodic parameters corresponding to the optimal speech unit data are associated. When,
A speech unit information receiving unit for receiving reading information and prosodic parameters transmitted from a client device through a network;
The optimum speech unit index storage unit is searched using the received reading information and prosodic parameters as keys, and the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters , And an optimal speech segment search unit that outputs the extracted optimal speech segment sequence information,
The optimum speech element storage information of the optimum speech element sequence information output from the optimum speech element search unit is input, and the optimum speech element data specified by the optimum speech element storage information is converted into the optimum speech element data. An optimal speech segment data reading unit for reading from the database storage unit;
A speech unit data transmission unit that returns the read optimal speech unit data to the client device via a network;
A speech segment database server device characterized by the above.

An optimal speech segment database storage unit for storing optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”);
Optimal speech unit index storage unit for storing optimal speech unit storage information in which optimum speech unit storage information for designating optimal speech unit data and reading information and prosodic parameters corresponding to the optimal speech unit data are associated. When,
A speech unit information receiving unit that receives local speech unit sequence information, reading information, and prosodic parameters transmitted from the client device through the network;
The optimum speech unit index storage unit is searched using the received reading information and prosodic parameters as keys, and the optimum speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the received reading information and prosodic parameters , And an optimal speech segment search unit that outputs the extracted optimal speech segment sequence information,
The local speech unit sequence information received by the speech unit information receiving unit and the optimal speech unit sequence information output from the optimal speech unit search unit are input, and the local speech unit information is obtained from the optimal speech unit sequence information. A transmission speech unit determination unit for generating transmission speech unit sequence information excluding those having common sequence information and reading information and prosodic parameters, and outputting the transmission speech unit sequence information;
Optimum speech unit data reading that reads the optimum speech unit storage information of the transmission speech unit sequence information and reads the optimum speech unit data specified by the optimum speech unit storage information from the optimum speech unit database storage unit And
A speech unit data transmission unit that returns the read optimal speech unit data to the client device via a network;
A speech segment database server device characterized by the above.

A speech synthesis method for a speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
Optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”) is stored in the optimal speech segment database storage unit of the speech segment database server device,
Local speech segment data (meaning “speech segment data stored in the client device”) is stored in the local speech segment database storage unit of the client device,
The local speech unit index storage unit of the client device associates local speech unit storage information for specifying local speech unit data with reading information and prosodic parameters corresponding to the local speech unit data. One series information is stored,
In the optimum speech element index storage unit of the client device, the optimum speech element storage information for designating optimum speech element data and the reading information and prosodic parameters corresponding to the optimum speech element data are associated. In the state where one-line information is stored,
Text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Output step;
The prosodic information output from the text analysis unit is input to the prosodic parameter acquisition unit of the client device, and the prosodic parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information, Outputting the prosodic parameters;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input to the local speech unit search unit of the client device, and the input is performed in the local speech unit search unit The local speech unit index storage unit is searched using the reading information and the prosodic parameters as keys, and the local speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the input reading information and prosodic parameters is extracted. And outputting the extracted local speech segment sequence information;
The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and the local speech unit data reading unit Reading out local speech unit data designated by the local speech unit storage information from the local speech unit database storage unit;
The local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize speech data. Generating and outputting the synthesized speech data;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input to the optimal speech unit search unit of the client device, and the input is performed in the optimal speech unit search unit The optimal speech segment index storage unit is searched using the reading information and prosodic parameters as keys, and the optimal speech segment sequence information corresponding to the input reading information and prosodic parameters belonging to the similar range of the input reading information and prosodic parameters is extracted. Outputting the extracted optimum speech segment sequence information;
The local speech unit sequence information and the optimal speech unit sequence information output from the local speech unit search unit and the optimal speech unit search unit, respectively, are input to the requested speech unit determination unit of the client device, and the request The speech unit determination unit generates requested speech unit sequence information in which the local speech unit sequence information, reading information, and prosody parameters in common are excluded from the optimum speech unit sequence information, and the requested speech unit sequence Outputting information; and
In the speech unit information transmission unit of the client device, transmitting the optimal speech unit storage information of the requested speech unit sequence information to the speech unit database server device through the network;
In the speech unit information receiving unit of the speech unit database server device, receiving the optimum speech unit storage information of the requested speech unit sequence information;
The received optimum speech element storage information is input to the optimum speech element data reading unit of the speech element database server device, and the optimum speech element storage information specifies the optimum Reading out speech unit data from the optimal speech unit database storage unit;
Returning the read optimum speech unit data to the client device through a network in the speech unit data transmission unit of the speech unit database server device;
In the speech unit data receiving unit of the client device, receiving optimal speech unit data;
The optimum speech unit data received by the speech unit data receiving unit is additionally stored in the local speech unit database storage unit as new local speech unit data in the speech unit database addition unit of the client device. Steps,
Additionally storing local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit in the speech unit index addition unit of the client device;
A speech synthesis method characterized by executing

A speech synthesis method for a speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
Optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”) is stored in the optimal speech segment database storage unit of the speech segment database server device,
The optimum speech element index storage unit of the speech element database server device is associated with optimum speech element storage information for designating optimum speech element data, and reading information and prosodic parameters corresponding to the optimum speech element data. Stored the optimal speech segment sequence information,
Local speech segment data (meaning “speech segment data stored in the client device”) is stored in the local speech segment database storage unit of the client device,
The local speech unit index storage unit of the client device associates local speech unit storage information for specifying local speech unit data with reading information and prosodic parameters corresponding to the local speech unit data. In the state where one-line information is stored,
Text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Output step;
The prosodic information output from the text analysis unit is input to the prosodic parameter acquisition unit of the client device, and the prosodic parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information, Outputting the prosodic parameters;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input to the local speech unit search unit of the client device, and the input is performed in the local speech unit search unit. The local speech unit index storage unit is searched using the reading information and prosodic parameters as keys, and the local speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the input reading information and prosodic parameters is obtained. Extracting and outputting the extracted local speech unit sequence information;
The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and in the local speech unit data reading unit, Reading local speech unit data designated by the local speech unit storage information from the local speech unit database storage unit;
The local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize speech data. Generating and outputting the synthesized speech data;
Transmitting the reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit to the speech unit database server device through the network in the speech unit information transmission unit of the client device; ,
In the speech unit information receiving unit of the speech unit database server device, receiving reading information and prosodic parameters;
Using the received reading information and prosodic parameters as keys, the optimal speech unit search unit of the speech unit database server device searches the optimal speech unit index storage unit, and the similarity of the received reading information and prosodic parameters Extracting optimal speech segment sequence information corresponding to reading information and prosodic parameters belonging to the range, and outputting the extracted optimal speech segment sequence information;
The optimal speech element storage information of the optimal speech element sequence information output from the optimal speech element search unit is input to the optimal speech element data reading unit of the speech element database server device, and the optimal speech element data In the data reading unit, reading out the optimum speech unit data specified by the optimum speech unit storage information from the optimum speech unit database storage unit;
Returning the read optimum speech unit data to the client device through a network in the speech unit data transmission unit of the speech unit database server device;
In the speech unit data receiving unit of the client device, receiving optimal speech unit data;
The local speech unit database storage unit stores at least a part of the optimum speech unit data received by the speech unit data reception unit as new local speech unit data in the speech unit database addition unit of the client device. An additional storing step,
In the speech unit index addition unit of the client device, additionally storing local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A speech synthesis method characterized by executing

A speech synthesis method for a speech synthesis system comprising at least one client device and at least one speech segment database server device connected to the client device through a network,
Optimal speech segment data (meaning “speech segment data stored in the speech segment database server device”) is stored in the optimal speech segment database storage unit of the speech segment database server device,
The optimum speech element index storage unit of the speech element database server device is associated with optimum speech element storage information for designating optimum speech element data, and reading information and prosodic parameters corresponding to the optimum speech element data. Stored the optimal speech segment sequence information,
Local speech segment data (meaning “speech segment data stored in the client device”) is stored in the local speech segment database storage unit of the client device,
The local speech unit index storage unit of the client device associates local speech unit storage information for specifying local speech unit data with reading information and prosodic parameters corresponding to the local speech unit data. In the state where one-line information is stored,
Text data to be voiced is input to the text analysis unit of the client device, and the text analysis unit performs text analysis on the text data to generate reading information and prosodic information. Output step;
The prosodic information output from the text analysis unit is input to the prosodic parameter acquisition unit of the client device, and the prosodic parameter acquisition unit generates physical prosody parameters necessary for speech synthesis using the prosodic information, Outputting the prosodic parameters;
The reading information output from the text analysis unit and the prosodic parameters output from the prosodic parameter acquisition unit are input to the local speech unit search unit of the client device, and the input is performed in the local speech unit search unit The local speech unit index storage unit is searched using the reading information and the prosodic parameters as keys, and the local speech unit sequence information corresponding to the reading information and prosodic parameters belonging to the similar range of the input reading information and prosodic parameters is extracted. And outputting the extracted local speech segment sequence information;
The local speech unit storage information of the local speech unit sequence information output from the local speech unit search unit is input to the local speech unit data reading unit of the client device, and in the local speech unit data reading unit, Reading local speech unit data designated by the local speech unit storage information from the local speech unit database storage unit;
The local speech unit data read by the local speech unit data reading unit is input to the speech unit connection unit of the client device, and the speech unit connection unit uses the local speech unit data to synthesize speech. Generating data and outputting the synthesized speech data;
In the speech unit information transmitting unit of the client device, local speech unit sequence information output from the local speech unit search unit, reading information output from the text analysis unit, and output from the prosodic parameter acquisition unit Transmitting the prosodic parameters to the speech segment database server device over the network;
In the speech unit information receiving unit of the speech unit database server device, receiving local speech unit sequence information, reading information, and prosodic parameters;
Using the received reading information and prosodic parameters as keys, the optimal speech unit search unit of the speech unit database server device searches the optimal speech unit index storage unit, and the similarity of the received reading information and prosodic parameters Extracting optimal speech segment sequence information corresponding to reading information and prosodic parameters belonging to the range, and outputting the extracted optimal speech segment sequence information;
The local speech unit sequence information received by the speech unit information receiving unit and the optimal speech unit sequence information output from the optimal speech unit search unit are the transmission speech unit determination unit of the speech unit database server device. In the transmission speech unit determination unit, the transmission speech unit sequence information excluding the common speech unit sequence information and the reading information and prosodic parameters that are in common from the optimal speech unit sequence information, Outputting the transmission speech element sequence information;
Optimal speech element storage information of the transmitted speech element sequence information is input to the optimal speech element data reading unit of the speech element database server device, and the optimal speech element storage unit stores the optimal speech element storage information. Reading the optimal speech segment data designated by the information from the optimal speech segment database storage unit;
Returning the read optimum speech unit data to the client device through a network in the speech unit data transmission unit of the speech unit database server device;
In the speech unit data receiving unit of the client device, receiving optimal speech unit data;
The optimum speech unit data received by the speech unit data receiving unit is additionally stored in the local speech unit database storage unit as new local speech unit data in the speech unit database addition unit of the client device. Steps,
In the speech unit index addition unit of the client device, additionally storing local speech unit sequence information corresponding to the new local speech unit data in the local speech unit index storage unit;
A speech synthesis method characterized by executing

The speech synthesis method according to any one of claims 8 to 10 ,
A step of determining whether or not a total size of local speech unit data stored in the local speech unit database storage unit is equal to or less than a predetermined size in the speech unit data deletion unit of the client device; ,
If the total size of the local speech unit data stored in the local speech unit database storage unit is not less than a predetermined size, the speech unit data deletion unit executes the local speech unit data according to a predetermined priority. Deleting a part of the local speech segment data stored in the fragment database storage unit;
In the speech unit sequence information deletion unit of the client device, local speech unit sequence information corresponding to the local speech unit data deleted in the speech unit data deletion unit is deleted from the local speech unit index storage unit. And further comprising:
A speech synthesis method characterized by the above.

A program for causing a computer to function as the client device according to claim 5.

A program for causing a computer to function as the speech segment database server device according to claim 6 or 7 .