JP4653572B2

JP4653572B2 - Client terminal, speech synthesis information processing server, client terminal program, speech synthesis information processing program

Info

Publication number: JP4653572B2
Application number: JP2005177720A
Authority: JP
Inventors: 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-06-17
Filing date: 2005-06-17
Publication date: 2011-03-16
Anticipated expiration: 2025-06-17
Also published as: JP2006350091A

Abstract

<P>PROBLEM TO BE SOLVED: To generate a good-quality synthesized voice close to a real voice on a mobile terminal with limited processing capacity. <P>SOLUTION: The text analysis, the prosodic parameter calculation, and the voice fragment search processing which require major processing capacity are conducted on a server side. A client terminal reads out voice fragment data from a voice fragment database provided to the client terminal based on the voice fragment information sent from the server, connects the voice fragment data read out sequentially and generates synthesized voice data, and outputs the synthesized voice data as synthesized voices sequentially. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声合成方法、音声合成情報処理方法及びこの音声合成方法、音声合成情報処理方法を用いて動作するクライアント端末、音声合成情報処理サーバに関する。 The present invention relates to a speech synthesis method, a speech synthesis information processing method, a speech synthesis method, a client terminal that operates using the speech synthesis information processing method, and a speech synthesis information processing server.

近年では大容量な記憶装置の使用コストの低下に伴って、数十分以上の大容量の音声データをそのまま大容量の記録装置に蓄積し、入力されたテキスト及び韻律情報に応じて音声素片を適切に選択し、接続・変形することで高品質な音声を合成する波形接続型コーパスベース音声合成方法が提案されている（特許文献１、非特許文献１）。
このような方法によって、物理的には肉声同等の高品質な合成音声を生成することが可能になってきている。具体的には、音声データベースから、合成したい文字列に対応する音韻系列と部分的または完全に一致する音声素片をバイナリーツリー等で構成された音声素片辞書を用いて検索し、音声素片の類似度を評価するための複数のパラメータを組み合わせに基づく評価尺度に従ってコスト付けされた多数の音声素片の中から、DP（Dynamic Programming）等の方法によって適切な音声素片の組み合わせを選択し、選択された音声素片を順に接続することで音声合成を行っている（非特許文献２）。しかし、このような音声合成方法においては、そもそも適切な音声素片が音声データベースに存在しない場合、高品質な合成音声を生成不可能である。 In recent years, along with a decrease in the cost of using a large-capacity storage device, tens of minutes or more of large-capacity speech data is stored in a large-capacity recording device as it is, and speech segments are generated according to input text and prosodic information. A waveform-connected corpus-based speech synthesis method that synthesizes high-quality speech by appropriately selecting, connecting and transforming is proposed (Patent Document 1, Non-Patent Document 1).
By such a method, it has become possible to generate a high-quality synthesized speech that is physically equivalent to the real voice. Specifically, a speech unit that is partially or completely matched with a phoneme sequence corresponding to a character string to be synthesized is searched from a speech database using a speech unit dictionary composed of a binary tree or the like. The appropriate combination of speech units is selected by a method such as DP (Dynamic Programming) from a number of speech units that are costed according to a combination-based evaluation scale for evaluating the similarity of The speech synthesis is performed by connecting the selected speech segments in order (Non-Patent Document 2). However, in such a speech synthesis method, a high-quality synthesized speech cannot be generated unless an appropriate speech segment exists in the speech database.

従って、様々なテキストを高品質に音声合成するためには、豊富な音声素片のバリエーションを含む音声データベースを用いることが必須であり、そのため近年では音声素片のバリエーションを増やし合成音声を高品質化するために、音声データベースの容量をより一層増加させる方向で開発が進んでいる。しかしながら、音声データベースを大容量化することにより、合成音声の品質は向上されるものの、当然音声データベース内に格納される音声素片の個数は増大するため、音声合成の際に入力テキストに応じて適切な音声素片を、音声データベースに含まれる膨大な数の中から検索するに要する検索処理量は増大している。
特許第２７６１５５２号明細書 M.Beutnagel, A. Conkie, J.Schoroeter, Y. Stylianou, and A. Sydral, "Choose the best to modify the least: A new generation concatenative synthesis system", in Proc. Eurospeech'99, 1999, pp.2291-2294 “波形編集型規則合成法における波形選択法”、広川他、電子情報通信学会音声研究会資料、SP89-114,pp.362-369(1990) Therefore, in order to synthesize various texts with high quality, it is indispensable to use a speech database containing a variety of speech segment variations. Therefore, in recent years, the number of speech segment variations has been increased and high quality speech synthesis has been achieved. Therefore, development is progressing in a direction to further increase the capacity of the voice database. However, although the quality of the synthesized speech is improved by increasing the capacity of the speech database, the number of speech units stored in the speech database naturally increases, so that depending on the input text during speech synthesis The amount of search processing required to search for an appropriate speech segment from a huge number included in the speech database is increasing.
Japanese Patent No. 2761552 M. Beutnagel, A. Conkie, J. Schoroeter, Y. Stylianou, and A. Sydral, "Choose the best to modify the least: A new generation concatenative synthesis system", in Proc. Eurospeech'99, 1999, pp. 2291 -2294 “Waveform Selection Method in Waveform Editing Type Rule Synthesis Method”, Hirokawa et al., IEICE Technical Report, SP89-114, pp.362-369 (1990)

波形接続型コーパスベース音声合成方法では、音声データベースを大容量化することにより高品質な合成音声の生成が可能となったものの、反面では従来の手法と比較して処理量が増大するようになってきている。しかし、これまでに主に音声合成機能が必要とされてきたパーソナルコンピュータやワークステーション等においては近年の著しく処理能力及び記憶装置の容量が発達したため、処理量については特に問題とはならなかった。
一方、携帯電話、カーナビゲーション、家電機器などの小型機器分野でも、より一層の機器の普及や差別化に繋げるため、ユーザにとって使いやすい様々なソフトウェアの開発が行われており、音声合成技術も人にとってわかりやすい音声での情報伝達が可能なため必要な技術と考えられ始めている。 The waveform-connected corpus-based speech synthesis method can generate high-quality synthesized speech by increasing the capacity of the speech database, but on the other hand, the amount of processing increases compared to the conventional method. It is coming. However, personal computers, workstations, and the like that have mainly required speech synthesis functions so far have developed remarkable processing capabilities and storage capacities in recent years.
On the other hand, in the field of small devices such as mobile phones, car navigation systems, and home appliances, various software that is easy to use for users has been developed in order to lead to further spread and differentiation of devices. It is beginning to be considered a necessary technology because it allows easy-to-understand voice information transmission.

しかし、そういった機器等では、処理能力やメモリ量が非常に低い場合が多く、そうでない場合も、処理能力やメモリのほとんどは、それらの機器の主要な用途である画像の表示や機器の制御などに使われており、音声合成のために利用可能なメモリや処理能力はほとんど残されていない。
計算処理量を減らすため、計算処理を簡素化したり、音声素片探索時の枝狩り処理を行ったりすることである程度の高速化は可能であるが、その場合最適な素片が選択されず合成音声の品質が劣化する可能性もある。また、音声素片のインデックス等をメモリ上に確保することで探索処理時間の軽減を図る方法もあるが、その場合は大量のワーク用外部メモリが必要となるがそのような外部メモリを搭載可能でない場合も多い。従って、これまで波形接続型コーパスベース音声合成方法はそれらの機器で動作させることは非常に困難であった。 However, in such devices, the processing capacity and the amount of memory are often very low, and even if that is not the case, most of the processing capability and memory are such as image display and device control that are the main applications of those devices There is almost no memory and processing power available for speech synthesis.
In order to reduce the amount of calculation processing, it is possible to speed up to some extent by simplifying the calculation processing or performing branch hunting processing at the time of speech unit search, but in that case the optimum unit is not selected and synthesized There is a possibility that the quality of the voice is deteriorated. There is also a method to reduce the search processing time by securing the speech unit index etc. in the memory, but in that case a large amount of external memory for work is required, but such external memory can be installed Often not. Therefore, until now, it has been very difficult to operate the waveform-connected corpus-based speech synthesis method with these devices.

このため、携帯機器から高品質な合成音声を出力させる方法として、クライアント・サーバ構成にして、クライアントである携帯機器からテキストを送信し、サーバである高性能なワークステーション等でテキストから合成音声を生成し、クライアントに合成音声を伝送する方法や、クライアントである携帯機器で処理量が少なく必要なメモリ量も少なくてよいテキスト解析処理まで行った後テキスト解析結果をサーバに送信し、大量の音声素片データとワークメモリを必要とする音声合成部のみサーバで動作させ、テキスト解析結果から合成音声を生成し、クライアントに合成音声を送信する方法で、上記の問題を解決する方法も容易に考えられる。しかし、これらの場合は合成したいテキストデータに対応する合成音声をサーバ側で生成・送信し、クライアント側が受信するまでの間は全くクライアントから音声を出力することができないため、ネットワークの速度やスループットによっては、音声が出力されるまでの応答時間が非常に掛かる欠点があった。 For this reason, as a method of outputting high-quality synthesized speech from a mobile device, a client / server configuration is used, text is transmitted from a mobile device that is a client, and synthesized speech from text is transmitted from a high-performance workstation that is a server. Generate and transmit synthesized speech to the client, or text analysis processing that requires a small amount of processing and a small amount of memory on the client mobile device, and then sends the text analysis result to the server to generate a large amount of speech A method to solve the above problem easily is also conceived by operating only the speech synthesizer that requires segment data and work memory on the server, generating synthesized speech from the text analysis result, and sending the synthesized speech to the client. It is done. However, in these cases, the synthesized voice corresponding to the text data to be synthesized is generated and transmitted on the server side, and no voice can be output from the client until it is received by the client side, so depending on the speed and throughput of the network Has a drawback that it takes a very long response time until sound is output.

また、現状では携帯電話やカーナビゲーション等ではネットワークとして通常携帯電話のパケット網が利用されることが多いが、その場合はほぼ従量制で課金されるためネットワークの使用料が非常に高くなるという問題もある。さらにパケット網はネットワーク自体の遅延や速度変動が極めて大きいため、音声が途中で途切れたりするなどの問題があった。 In addition, at present, mobile phone packet networks are often used for mobile phones and car navigation systems, but in that case, the usage fee for the network becomes very high because charging is done on a pay-as-you-go basis. There is also. Furthermore, the packet network has problems such as the voice being interrupted in the middle because the delay and speed fluctuation of the network itself are extremely large.

音声合成したいテキストを解析し、そのテキスト解析結果に基づいて適切な音声素片系列を音声素片インデックスを用いて決定した後、音声合成に必要な情報のうち少なくとも音声素片系列情報を送信する機能を有する音声合成情報処理サーバを用意する。
また合成音声を出力させたいクライアント側には少なくとも音声素片系列情報を受信する機能と、音声素片を格納した音声素片データベースと音声合成部とを備えておく。
また、音声合成情報処理サーバ、クライアント共に、ネットワークを利用したデータの送受信機能を有しているものとし、クライアントと音声合成情報処理サーバをネットワークで接続しておく。 Analyze the text you want to synthesize, determine the appropriate speech segment sequence using the speech segment index based on the text analysis results, and then send at least speech segment sequence information from the information required for speech synthesis A speech synthesis information processing server having a function is prepared.
The client side that wants to output synthesized speech is provided with at least a function for receiving speech unit sequence information, a speech unit database storing speech units, and a speech synthesis unit.
In addition, both the speech synthesis information processing server and the client have a data transmission / reception function using a network, and the client and the speech synthesis information processing server are connected via the network.

このようにしてネットワークで接続されたクライアント・サーバ構成にし、処理量が多く及び大量のメモリが必要なテキスト解析から音声素片系列の決定までの処理を高性能なワークステーション等の利用可能なサーバで行う。また読み出しメモリ専用のメモリ又は記憶装置に音声素片データベースを格納し軽微な処理と少量のメモリしか必要としない音声素片データベースからの読み出しと合成処理を、ワークメモリが少ないクライアント側で行うものとする。 A server that can be used as a high-performance workstation such as a text server that requires a large amount of processing and a large amount of memory to determine a speech segment sequence, with a client-server configuration connected via a network in this way. To do. In addition, the speech unit database is stored in a memory dedicated to the read memory or a storage device, and the processing and reading from the speech unit database that requires only a small amount of processing and a small amount of memory are performed on the client side with less work memory. To do.

ネットワーク上に、テキストから音声合成に必要な音声素片系列を決定し、決定された音声素片系列の情報を送信する機能を有する音声合成情報処理サーバと、音声素片を格納した音声素片データベースを備えた音声合成部を有するクライアントを用意し、クライアントにおいて音声合成を利用する時には、合成したいテキストに対応した音声素片系列情報や韻律パラメータ等の情報を音声合成情報処理サーバから受信し、その受信された音声素片系列情報に対応した音声素片データを音声素片データベースから読み出し、その音声素片データを用いて音声合成処理し高品質な合成音を生成する。このように処理することにより、予めクライアント側の処理性能が限られており、かつ速度若しくは使用量が限定されたネットワークを利用する条件下において、処理性能やネットワークの種類によらず高品質かつ高速な合成音声の実現が可能となる。 A speech synthesis information processing server having a function of determining a speech unit sequence necessary for speech synthesis from text and transmitting information of the determined speech unit sequence on a network, and a speech unit storing the speech unit When preparing a client having a speech synthesis unit with a database and using speech synthesis in the client, information such as speech segment sequence information and prosodic parameters corresponding to the text to be synthesized is received from the speech synthesis information processing server, Speech unit data corresponding to the received speech unit sequence information is read from the speech unit database, and speech synthesis processing is performed using the speech unit data to generate high-quality synthesized speech. By performing processing in this way, high-quality and high-speed processing is possible regardless of the processing performance and the type of network under the condition that the processing performance on the client side is limited in advance and a network using limited speed or usage is used. Realization of synthesized speech.

なぜならば、クライアント側では最低限音声素片の読み込みと読み込まれた音声素片を用いた音声合成が可能な程度のみ処理性能及びメモリがあれば、音声合成情報処理サーバから受信した音声素片系列を用いて音声素片データベースから音声素片を読み込み合成することで音声合成することが可能である。
また音声合成処理は、クライアント側で実行されるためネットワークの混雑上場とは無関係に音声合成の出力が可能である。
さらに、ネットワークを用いて伝送されるべき音声素片系列情報や韻律パラメータは音声データと比較すれば極めて少ないデータ量であるため、携帯パケット網を用いても料金は低廉にすることが可能である。また合成音声を送信する場合、大量のパケットを送るための時間がかかりその間に遅延が発生する可能性も高く、遅延が発生すると音声が途切れてしまうが、本発明では少ないパケットで短時間に必要な情報を送信できるため遅延にも強く合成音声が途中で途切れるような状況の発生を極めて少なくすることが可能となる。 This is because the speech segment sequence received from the speech synthesis information processing server is sufficient if the client side has at least processing performance and memory to the extent that speech synthesis is possible using the speech segment read and the speech segment read. Can be synthesized by reading and synthesizing a speech unit from a speech unit database.
Also, since the speech synthesis process is executed on the client side, speech synthesis can be output regardless of the crowded listing on the network.
Furthermore, since the speech unit sequence information and prosodic parameters to be transmitted using the network have a very small amount of data compared to the speech data, the charge can be reduced even if the mobile packet network is used. . Also, when sending synthesized speech, it takes a long time to send a large number of packets, and there is a high possibility that a delay will occur in the meantime. If a delay occurs, the speech will be interrupted. Therefore, it is possible to extremely reduce the occurrence of a situation where the synthesized speech is interrupted midway.

この発明による音声合成方法に従って動作するクライアント端末及び音声合成情報処理方法に従って動作する音声合成情報処理サーバは何れもハードウェアによって構成することもできるが、最も簡素に実現するにはこの発明で提案するクライアント端末プログラム及び音声合成情報処理プログラムをコンピュータにインストールし、コンピュータにクライアント端末或いは音声合成情報処理サーバとして機能させる実施形態が最良である。
コンピュータにクライアント端末として機能させる場合、コンピュータには少なくともプログラムによりテキスト送信部と、音声素片情報受信部と、音声素片データ読み出し部と、音声素片接続部と、音声出力部とを構築し、音声合成用のクライアント端末として機能させる。 The client terminal that operates according to the speech synthesis method according to the present invention and the speech synthesis information processing server that operates according to the speech synthesis information processing method can both be configured by hardware, but the present invention proposes the simplest implementation. An embodiment in which a client terminal program and a speech synthesis information processing program are installed in a computer and the computer functions as a client terminal or a speech synthesis information processing server is the best.
When the computer functions as a client terminal, the computer constructs at least a text transmission unit, a speech unit information receiving unit, a speech unit data reading unit, a speech unit connection unit, and a speech output unit by a program. And function as a client terminal for speech synthesis.

コンピュータに音声合成情報処理サーバとして機能させる場合、コンピュータには少なくともプログラムによりテキストデータ受信部と、テキスト解析部と、韻律パラメータ取得部と、音声素片インデックスと、音声素片探索部と、音声素片情報送出部とを構築し、音声合成情報処理サーバとして機能させる。 When the computer is caused to function as a speech synthesis information processing server, the computer has at least a text data reception unit, a text analysis unit, a prosodic parameter acquisition unit, a speech unit index, a speech unit search unit, and a speech unit, according to a program. A single information transmission unit is constructed and functions as a speech synthesis information processing server.

以下この発明の第１の実施形態を述べる。
まず、図１にシステム全体の概念図を示す。クライアント端末１で合成音声を利用する。クライアント端末１内の音声合成装置または音声合成プログラムで音声合成を実行する際には、ネットワーク２を介して音声合成情報処理サーバ３に接続する。ネットワーク２は、例えば携帯電話のパケット通信網や、電話線を利用したＡＤＳＬや、光ファイバを利用したＦＴＴＨなどがある。尚、図１ではクライアント端末を１個だけ示すが、現実には複数のクライアント端末１が同時に音声合成情報処理サーバ３にアクセスする状況が考えられる。 A first embodiment of the present invention will be described below.
First, FIG. 1 shows a conceptual diagram of the entire system. The client terminal 1 uses synthesized speech. When speech synthesis is executed by a speech synthesizer or a speech synthesis program in the client terminal 1, the speech synthesis information processing server 3 is connected via the network 2. Examples of the network 2 include a cellular phone packet communication network, ADSL using a telephone line, and FTTH using an optical fiber. Although only one client terminal is shown in FIG. 1, a situation where a plurality of client terminals 1 simultaneously access the speech synthesis information processing server 3 can be considered.

音声合成情報処理サーバ３は、合成したいテキストに対応する音声素片情報の決定を行い、クライアント端末１に送信する。クライアント端末１は、音声合成に必要な音声素片系列情報を音声合成情報処理サーバ３から受信し、クライアント端末１内の音声合成装置または音声合成プログラムは受信した音声素片情報を利用して音声合成を実施する。
ここで、クライアント端末１における音声合成装置は、例えばＣＰＵ（Central Processing Unit）やＲＡＭ、ハードディスク装置等から構成される公知のコンピュータに所定のプログラムを実行させることにより構成されるものでもよいし、図２に示すような、プログラム及び演算結果などを格納するＲＡＭ等で構成するワークメモリ４０、プログラムに基づき演算などをするとともに音声合成装置の各構成要素を制御するＭＰＵ（Micro Processing Unit）４１、音声素片データ及びその他のファイルを格納するＲＯＭ等で構成する蓄積メモリ４２、ネットワーク２へテキストデータを送信し、またネットワーク２からデータを受信するためのデータ送受信部４３、音声出力部４４とを具備する。 The speech synthesis information processing server 3 determines speech unit information corresponding to the text to be synthesized and transmits it to the client terminal 1. The client terminal 1 receives the speech unit sequence information necessary for speech synthesis from the speech synthesis information processing server 3, and the speech synthesizer or speech synthesis program in the client terminal 1 uses the received speech unit information to perform speech. Perform the synthesis.
Here, the speech synthesizer in the client terminal 1 may be configured by causing a known computer including a CPU (Central Processing Unit), a RAM, a hard disk device, and the like to execute a predetermined program. 2, a work memory 40 composed of a RAM or the like for storing a program and computation results, an MPU (Micro Processing Unit) 41 for performing computation based on the program and controlling each component of the speech synthesizer, speech A storage memory 42 composed of a ROM or the like for storing fragment data and other files, a data transmission / reception unit 43 for transmitting text data to the network 2 and receiving data from the network 2, and an audio output unit 44 are provided. To do.

パーソナルコンピュータ等においては、上記蓄積メモリ４２は磁気ディスク等で実装してもよい。また、音声合成情報処理サーバ３は、例えばＣＰＵやＲＡＭ、ハードディスク装置等から構成される公知のコンピュータに所定のプログラムを実行させることにより構成されるものである。
図３は、本実施形態における音声合成情報処理サーバ３の概念的な構成図の例示である。
本実施形態の音声合成情報処理サーバ３は、テキストデータ受信部９、テキスト解析部１０、韻律パラメータ取得部１１、音声素片探索部１２、音声素片情報送出部１３、音声素片インデックスIndex-1とを有している。 In a personal computer or the like, the storage memory 42 may be implemented with a magnetic disk or the like. The speech synthesis information processing server 3 is configured by causing a known computer including, for example, a CPU, a RAM, and a hard disk device to execute a predetermined program.
FIG. 3 is an illustration of a conceptual configuration diagram of the speech synthesis information processing server 3 in the present embodiment.
The speech synthesis information processing server 3 of the present embodiment includes a text data receiving unit 9, a text analyzing unit 10, a prosody parameter acquiring unit 11, a speech unit searching unit 12, a speech unit information sending unit 13, and a speech unit index Index- 1 and.

図４は、本形態での音声合成情報処理サーバ３における音声素片情報送信処理を説明するための流れ図である。
以下に上記の図３及び図４に従って、本実施形態の音声素片情報送信の詳細を説明する。音声合成情報処理サーバ３はクライアント端末１から送られて来たテキストデータを受信する（ステップＳ４−１）と、テキスト解析部１０ではテキスト解析処理を実施し、読み情報及び韻律情報を生成する（Ｓ４−２）。ここでいうテキスト解析処理は、主に形態素解析処理と読み・アクセント付与処理からなるがこれらの処理方法については従来から様々な方法が存在し、例えば（参考文献：特許第３３７９６４３号明細書「形態素解析方法および形態素解析プログラムを記録した記録媒体」）や、（参考文献：特許第３５１８３４０号明細書「読み韻律情報設定方法及び装置及び読み韻律情報設定プログラムを格納した記録媒体」）、の方法に基づいて行うこともできる。 FIG. 4 is a flowchart for explaining speech unit information transmission processing in the speech synthesis information processing server 3 in this embodiment.
Details of speech unit information transmission according to the present embodiment will be described below with reference to FIGS. 3 and 4 described above. When the speech synthesis information processing server 3 receives the text data sent from the client terminal 1 (step S4-1), the text analysis unit 10 performs text analysis processing to generate reading information and prosodic information ( S4-2). The text analysis processing here is mainly composed of morphological analysis processing and reading / accenting processing. However, there are various methods for these processing methods. For example, (Reference: Japanese Patent No. 3379634, “Morphology”). Analysis method and recording medium on which morphological analysis program is recorded ") and (reference document: Japanese Patent No. 3518340" Reading Prosody Information Setting Method and Apparatus and Recording Medium on which Reading Prosody Information Setting Program is Stored ") Can also be done based on.

次に、韻律パラメータ取得部１１で前記韻律情報に基づいて韻律パラメータを求める（Ｓ４−３）。ここで韻律パラメータとしてはピッチ（基本周波数）や音素継続時間長等があるが、これらを求める方法も従来から存在し、例えば（参考文献：特許第３２４０６９１号明細書「ピッチパタン生成方法、その装置及びプログラム、記録媒体」）や、（参考文献：特許第３３４４４８７号明細書「音声基本周波数パターン生成装置」）の方法によってピッチ（基本周波数）を求めることが可能である。また、例えば（参考文献：海木ら、「言語情報を利用した母音継続時間長の制御」vol.75, No.3 pp.467-463、信学論,1992）や、（参考文献：M.D. Riley. “Tree-based modeling for speech synthesis.” In G. Bailly C. Benoit, and T.R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992.）の方法により音素継続時間長を求めることもできる。 Next, the prosodic parameter acquisition unit 11 obtains prosodic parameters based on the prosodic information (S4-3). Here, the prosodic parameters include pitch (fundamental frequency), phoneme duration length, and the like, but there are also methods for obtaining them, for example (Reference: Japanese Patent No. 3240691 “Pitch pattern generation method, apparatus thereof” In addition, the pitch (fundamental frequency) can be obtained by the method of (Reference: Japanese Patent No. 3344487, “Audio basic frequency pattern generation device”). Also, for example (reference: Ukiki et al., “Control of vowel duration using linguistic information” vol.75, No.3 pp.467-463, Theory of Science, 1992) and (Reference: MD Riley. “Tree-based modeling for speech synthesis.” In G. Bailly C. Benoit, and TR Sawallis, editors, Talking Machines: Theories, Models, and Designs, pages 265-273. Elsevier, 1992.) The duration time can also be obtained.

次に、前述の読み情報と韻律パラメータに従って、音声素片探索部１２において、音声素片インデックスIndex-1を用いて最適な音声素片系列を決定する（Ｓ４−４）。ここで音声素片インデックスの構成や音声素片系列の決定方法としては、例えば、特許第3515406号明細書「音声合成方法及び装置」等もあるが、ここでは、図１１で示す音声素片インデックスの概念図を用いて説明する。
前記の読み情報と韻律情報の組をキーとして、音声素片インデックスIndex-1を探索し、該当音声素片インデックスIndex-1から、これの組の類似範囲に属する読み情報である音律列と韻律パラメータの組と対応する、音声素片を選択し音声素片系列を決定する。 Next, according to the above-described reading information and prosodic parameters, the speech unit search unit 12 determines an optimal speech unit sequence using the speech unit index Index-1 (S4-4). Here, the speech unit index configuration and speech unit sequence determination method include, for example, Japanese Patent No. 3515406 specification “speech synthesis method and apparatus”. Here, the speech unit index shown in FIG. This will be described using the conceptual diagram.
Using the set of reading information and prosodic information as a key, a speech unit index Index-1 is searched, and from the corresponding speech unit index Index-1, a phoneme sequence and prosody that are reading information belonging to a similar range of the set are searched. A speech unit corresponding to the set of parameters is selected to determine a speech unit sequence.

尚、ここでいう類似範囲とは、例えば読み情報及び韻律パラメータが完全に一致するもの、一部一致するもの、コストによって特定される類似度が高いものなどを含む概念である。例えば、読み情報として、音韻が「ア」、前音素環境が“＃”が与えられ、韻律パラメータとして、平均F0が200±10Hzという条件が与えられた場合、図１１で示すような音声素片A1、A2、A3の３つが適合することになる。さらに前記読み情報と韻律パラメータに対して、音声素片インデックスIndex-1内の音声素片に適合する読み情報と韻律パラメータから総合コスト値を計算し、コスト最小となる音声素片を選択することもできる。 Here, the similarity range is a concept including, for example, a case where reading information and prosodic parameters completely match, a portion of which matches, and a high similarity specified by cost. For example, when the phoneme is “a” as the reading information, the preceding phoneme environment is “#”, and the prosodic parameters are given the condition that the average F0 is 200 ± 10 Hz, the speech unit as shown in FIG. Three of A1, A2 and A3 will fit. Further, with respect to the reading information and the prosodic parameter, a total cost value is calculated from the reading information and the prosodic parameter suitable for the speech element in the speech element index Index-1, and the speech element that minimizes the cost is selected. You can also.

ここで、総合コスト計算方法であるが、例えば、下記のようにサブコスト関数を用いて総合コストPnewを求めることができる（参考文献：「波形編集型合成方式におけるスペクトル連続性を考慮した波形選択法」、日本音響学会講演論分集、2-6-10、pp.413-414、1990/9）。
読み情報としての音韻系列と、音声素片の読み情報としての音韻系列が一致する音韻数をnとし、読み情報に対応するサブコスト関数を
C₁(n)=1/eⁿ
韻律パラメータのうち平均ピッチVpと、音声素片の平均ピッチVsに対応するサブコスト関数を
C₂(Vp,Vs)=|Vp-Vs|²
韻律パラメータのうちピッチの傾きFpと、音声素片のピッチの傾きFsに対応するサブコスト関数を
C₃(Fp,Fs)=|Fp-Fs|²
韻律パラメータのうち時間長Tpと、音声素片の時間長Tsに対応するサブコスト関数を
C₄(Tp,Ts)=|Tp-Ts|²
韻律パラメータのうち振幅Apと、音声素片の振幅Asに対応するサブコスト関数を
C₅(Ap,As)=|Ap-As|²
とし、上記C₁、C₂、C₃、C₄、C₅の各サブコスト関数に対応するサブコスト重みとしてそれぞれ、ω₁、ω₂、ω₃、ω₄、ω₅が予め与えられていたとき
Ω=ω₂C₂(Vp,Vs)+ω₃C₃(Fp,Fs)+ω₄C₄(Tp,Ts)+ω₅C₅(Ap,As)
P=ω₁C₁(n)+(1-ω₁)Ω
Pnew=(1+G)P ： Gは音響的な尺度
こうして求められた各音声素片の総合コストを用いて、例えば一般的なDP（Dynamic Programming）法やViterbi法を用いることで、容易にコスト最小となるような音声素片を順次選択することができ音声素片系列を決定することができる。 Here, although it is a total cost calculation method, for example, the total cost Pnew can be obtained by using a sub-cost function as described below (reference: “a waveform selection method considering spectrum continuity in a waveform editing type composition method) "The Acoustical Society of Japan, 2-6-10, pp.413-414, 1990/9)".
The number of phonemes in which the phoneme sequence as the reading information matches the phoneme sequence as the reading information of the speech unit is n, and the sub-cost function corresponding to the reading information is
C ₁ (n) = 1 / e ⁿ
Among the prosodic parameters, sub-cost functions corresponding to the average pitch Vp and the average pitch Vs of speech segments
C ₂ (Vp, Vs) = | Vp-Vs | ²
Of the prosodic parameters, sub-cost functions corresponding to the pitch slope Fp and the pitch slope Fs of the speech segment
C ₃ (Fp, Fs) = | Fp-Fs | ²
Among the prosodic parameters, sub-cost functions corresponding to the time length Tp and the time length Ts of the speech unit
C ₄ (Tp, Ts) = | Tp-Ts | ²
The sub-cost function corresponding to the amplitude Ap of the prosodic parameters and the amplitude As of the speech segment
C ₅ (Ap, As) = | Ap-As | ²
When ω ₁ , ω ₂ , ω ₃ , ω ₄ , and ω ₅ are given in advance as sub cost weights corresponding to the sub cost functions of C ₁ , C ₂ , C ₃ , C ₄ , and C ₅ , respectively. Ω = ω ₂ C ₂ (Vp, Vs) + ω ₃ C ₃ (Fp, Fs) + ω ₄ C ₄ (Tp, Ts) + ω ₅ C ₅ (Ap, As)
P = ω ₁ C ₁ (n) + (1-ω ₁ ) Ω
Pnew = (1 + G) P: G is an acoustic scale. By using the total cost of each speech segment thus obtained, for example, using a general DP (Dynamic Programming) method or Viterbi method, it is easy. Speech segments that minimize the cost can be sequentially selected, and speech segment sequences can be determined.

次に、音声素片情報送出部１３において、前記音声素片系列において音声素片情報をクライアント端末１に対して送信する（Ｓ４−５）。
図５は、上記実施形態でのクライアント端末１の概念的な構成図の例示である。本実施形態のクライアント端末１は、テキストデータ送信部２９、音声素片情報受信部３０、音声素片データ読み出し部３１、音声素片接続部３２、音声出力部３３、音声素片データベースDB-1とを有している。
図６は、本実施形態でのクライアント端末１における音声合成処理を説明するための流れ図である。 Next, the speech unit information sending unit 13 transmits speech unit information to the client terminal 1 in the speech unit series (S4-5).
FIG. 5 is an illustration of a conceptual configuration diagram of the client terminal 1 in the embodiment. The client terminal 1 of the present embodiment includes a text data transmitting unit 29, a speech unit information receiving unit 30, a speech unit data reading unit 31, a speech unit connection unit 32, a speech output unit 33, and a speech unit database DB-1. And have.
FIG. 6 is a flowchart for explaining speech synthesis processing in the client terminal 1 in the present embodiment.

以下、図５と図６に従って、本実施形態における音声合成処理の詳細を説明する。
先ず、クライアント端末１は生成したい合成音声に対応するテキストデータをネットワーク２を通じて音声合成情報処理サーバ３に送信する（Ｓ６−１）。
音声合成情報処理サーバ３は送り込まれたテキストデータを解析し、クライアント端末１に音声素片情報を返送する。クライアント端末１はネットワーク２を通して、音声合成情報処理サーバ３から送信された音声素片情報を音声素片情報受信部３０で受信する（Ｓ６−２）。 The details of the speech synthesis process in the present embodiment will be described below with reference to FIGS.
First, the client terminal 1 transmits text data corresponding to synthesized speech to be generated to the speech synthesis information processing server 3 through the network 2 (S6-1).
The speech synthesis information processing server 3 analyzes the sent text data and returns speech unit information to the client terminal 1. The client terminal 1 receives the speech unit information transmitted from the speech synthesis information processing server 3 through the network 2 by the speech unit information receiving unit 30 (S6-2).

次に、音声素片データ読み出し部３１において、受信した音声素片情報に基づいて、音声素片データベースDB-1から音声素片データを読み出す（Ｓ６−３）。
ここで、受信した音声素片情報が格納されている音声合成情報処理サーバ３に存在する音声素片インデックスIndex-1と、図１２に示すようなクライアント端末１に備えられている音声素片データベースDB-1は物理的に別々に存在していても、理論的に対応付けられているため、該音声素片情報から対応する音声データを容易に読み出すことができる。
例えば、音声素片情報として、音声素片A2、音声素片R1、音声素片I2、…の中から、音声素片A2に対応する音声素片格納情報に基づいて音声素片データとして、ファイル番号８、始点10msec、時間長110msecの音声データを読み出し、音声素片R1に対応する音声素片データとして、ファイル番号２３、始点5225msec、時間長15msec、音声素片I2に対応する音声素片データとして、ファイル番号２３、始点5240msec、時間長95msecの音声データを読み出す、のように順次音声素片データの読み出しを行う。 Next, the speech unit data reading unit 31 reads speech unit data from the speech unit database DB-1 based on the received speech unit information (S6-3).
Here, the speech unit index Index-1 existing in the speech synthesis information processing server 3 storing the received speech unit information, and the speech unit database provided in the client terminal 1 as shown in FIG. Even if DB-1 physically exists separately, DB-1 can be read out easily from the speech unit information because it is theoretically associated.
For example, as speech unit information, from speech unit A2, speech unit R1, speech unit I2,..., File as speech unit data based on speech unit storage information corresponding to speech unit A2. Voice data of number 8, start point 10 msec, time length 110 msec is read out, and voice unit data corresponding to file number 23, start point 5225 msec, time length 15 msec, voice unit I2 as voice unit data corresponding to voice unit R1 Then, the speech unit data is sequentially read out such that the audio data having the file number 23, the start point 5240 msec, and the time length 95 msec is read out.

次に、音声素片接続部３２において、前記読み出された音声素片を順次接続し合成音声データを生成する（Ｓ６−４）。ここで、音声素片データを時間的な順に単に接続してもよいが、異なる音声素片間を時間的又は周波数的に補完することも容易である。（参考文献：特開平07-072897号公報「音声合成方法および装置」）最後に、前記接続された音声素片データは、音声出力部３３において、合成音声として出力される（Ｓ６−５）。
なお、上述ではテキストデータをクライアント端末１から音声合成情報処理サーバ３に送信するものとして説明したが、必ずしもその必要はなく、例えばネットワーク上にテキストデータを多数備えたサーバを用意し、このサーバにクライアント端末１から希望するテキストデータを音声合成情報処理サーバ３に送ることを指示することにより、音声合成情報処理サーバ３に目的のテキストデータを送り込むようにしてもよい。 Next, in the speech unit connection unit 32, the read speech units are sequentially connected to generate synthesized speech data (S6-4). Here, the speech unit data may be simply connected in the order of time, but it is also easy to supplement between different speech units in terms of time or frequency. (Reference document: Japanese Laid-Open Patent Publication No. 07-072897 "Speech Synthesis Method and Device") Finally, the connected speech segment data is output as synthesized speech in the speech output unit 33 (S6-5).
In the above description, the text data is described as being transmitted from the client terminal 1 to the speech synthesis information processing server 3. However, this is not always necessary. For example, a server having a large number of text data on a network is prepared. The target text data may be sent to the speech synthesis information processing server 3 by instructing the client terminal 1 to send desired text data to the speech synthesis information processing server 3.

図７は、この発明の第２実施形態で提案する音声合成情報処理サーバ３'の概念的な構成図の例示である。
本実施形態の音声合成情報処理サーバ３'はテキストデータ受信部９、テキスト解析部１０、韻律パラメータ取得部１１、音声素片探索部１２、音声素片情報・韻律パラメータ送出部１４、音声素片インデックスIndex-1、とを備えて構成されている。
図８は、この第２実施形態で提案する音声合成情報処理サーバ３'における音声素片情報送信処理を説明するための流れ図である。 FIG. 7 is an illustration of a conceptual configuration diagram of the speech synthesis information processing server 3 ′ proposed in the second embodiment of the present invention.
The speech synthesis information processing server 3 ′ of the present embodiment includes a text data receiving unit 9, a text analysis unit 10, a prosody parameter acquisition unit 11, a speech unit search unit 12, a speech unit information / prosodic parameter transmission unit 14, and a speech unit. And an index Index-1.
FIG. 8 is a flowchart for explaining speech element information transmission processing in the speech synthesis information processing server 3 ′ proposed in the second embodiment.

以下、上記の図７と図８に従って、第２実施形態の音声素片情報送信の詳細を説明する。音声合成情報処理サーバ3'はクライアント端末から送信されたテキスト情報が入力されてから、音声素片情報が得られるまでの、テキスト解析部１０、韻律パラメータ取得部１１、音声素片探索部１２における構成及び処理の内容は前述の第１実施形態と同様に実行することが可能である。
この第２実施形態ではクライアント端末における合成音声品質の改善を可能とするために、以下の処理を行う。 The details of the speech unit information transmission according to the second embodiment will be described below with reference to FIGS. The speech synthesis information processing server 3 ′ receives the text information transmitted from the client terminal until the speech segment information is obtained until the text analysis unit 10, the prosodic parameter acquisition unit 11, and the speech segment search unit 12. The contents of the configuration and processing can be executed in the same manner as in the first embodiment.
In the second embodiment, the following processing is performed in order to improve the synthesized speech quality at the client terminal.

音声素片情報・韻律パラメータ送出部１４において、音声素片探索部１２で決定された音声素片情報に加えて、韻律パラメータ取得部１１で得られた韻律パラメータを、クライアント端末に対して送信する。
図９は、上記音声合成情報処理サーバ３'に対応したクライアント端末１'の概念的な構成図の例示である。
本実施形態のクライアント端末１'における音声合成装置は、テキストデータ送信部２９と、音声素片情報・韻律パラメータ受信部３４、音声素片データ読み出し部３１、音声素片接続・変形部３５、音声出力部３３、音声素片データベースDB-1、とを有している。 The speech unit information / prosodic parameter transmission unit 14 transmits the prosody parameters obtained by the prosody parameter acquisition unit 11 to the client terminal in addition to the speech unit information determined by the speech unit search unit 12. .
FIG. 9 is an illustration of a conceptual configuration diagram of the client terminal 1 ′ corresponding to the speech synthesis information processing server 3 ′.
The speech synthesizer in the client terminal 1 ′ of this embodiment includes a text data transmission unit 29, a speech unit information / prosodic parameter reception unit 34, a speech unit data reading unit 31, a speech unit connection / deformation unit 35, a speech It has an output unit 33 and a speech unit database DB-1.

図１０は、本実施形態のクライアント端末１'における音声合成処理を説明するための流れ図である。以下、この図に従って、本実施形態における音声合成処理の詳細を説明する。
この実施例２でも図６の場合と同様にクライアント端末１'はテキストデータ送信部２９から目的とするテキストデータを音声合成情報処理サーバ３'に送信する（Ｓ１０−１）。音声合成情報処理サーバ３'は送り込まれたテキストデータを解析し、そのテキストデータに対応する音声素片データと韻律パラメータをクライアント端末１'に返送する。クライアント端末１'はネットワーク２を通して、音声合成情報処理サーバ３'から送信された音声素片情報及び韻律パラメータを音声素片情報・韻律パラメータ受信部３４で受信する（Ｓ１０−２）。 FIG. 10 is a flowchart for explaining speech synthesis processing in the client terminal 1 ′ of this embodiment. The details of the speech synthesis process according to this embodiment will be described below with reference to FIG.
Also in the second embodiment, the client terminal 1 ′ transmits the target text data from the text data transmitting unit 29 to the speech synthesis information processing server 3 ′ as in the case of FIG. 6 (S10-1). The speech synthesis information processing server 3 ′ analyzes the sent text data and returns speech segment data and prosodic parameters corresponding to the text data to the client terminal 1 ′. The client terminal 1 ′ receives the speech unit information and prosodic parameters transmitted from the speech synthesis information processing server 3 ′ through the network 2 by the speech unit information / prosodic parameter receiving unit 34 (S10-2).

次に、音声素片データ読み出し部３１における処理の内容は、前述の第１実施形態における音声素片データ読み出し部３１の処理と同様に実行できる（Ｓ１０−３）。
次に、音声素片接続・変形部３５において、音声素片データ読み出し部３１で読み出された音声素片を順次接続し合成音声データを生成する（Ｓ１０−４）。ここで、音声素片データを時間的な順に接続する際に異なる音声素片間を時間的又は周波数的に補完するとともに（参考文献：特開平07-072897号公報「音声合成方法および装置」）、前記受信した韻律情報に基づいて音声素片データに対し信号処理を施した後に接続を行う（Ｓ１０−４）。（参考文献：Y. Stylianou, "Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis." IEEE TRANSACTIONS ON SPEECH AND AUDIO PROSESSING, VOL.9, NO.1, pp.21-29 JANUARY 2001）（Ｓ１０−４）。 Next, the content of the processing in the speech unit data reading unit 31 can be executed in the same manner as the processing of the speech unit data reading unit 31 in the first embodiment described above (S10-3).
Next, the speech unit connection / deformation unit 35 sequentially connects the speech units read by the speech unit data reading unit 31 to generate synthesized speech data (S10-4). Here, when speech unit data is connected in temporal order, different speech units are complemented in terms of time or frequency (reference document: Japanese Patent Laid-Open No. 07-072897, “Speech Synthesis Method and Device”). Then, connection is made after signal processing is performed on the speech segment data based on the received prosodic information (S10-4). (Reference: Y. Stylianou, “Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis.” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROSESSING, VOL.9, NO.1, pp.21-29 JANUARY 2001) (S10-4) .

こうすることで、処理量は多少増加するものの、単純に補完して接続する場合と比較して、信号処理により音声素片の時間長や基本周波数F0を制御することが可能となるため、より韻律的に正確な合成音声の出力が可能となり、合成音声の総合的な品質が改善される。
最後に前記接続された音声素片データは、音声出力部３３において、合成音声として出力される（Ｓ１０−５）。
なお、この実施例２でもテキストデータをクライアント端末１'から音声合成情報処理サーバ３'に送り込むものとして説明したが、必ずしもその通りでなくてもよく、クライアント端末１'の指示に従って他のサーバからテキストデータを音声合成情報処理サーバ３'に送り込んでもよい。 By doing this, although the processing amount increases somewhat, it becomes possible to control the time length of the speech unit and the fundamental frequency F0 by signal processing compared to the case of simply complementing and connecting, Prosodic accurate synthesized speech can be output, improving the overall quality of the synthesized speech.
Finally, the connected speech segment data is output as synthesized speech at the speech output unit 33 (S10-5).
In the second embodiment, the text data has been described as being sent from the client terminal 1 ′ to the speech synthesis information processing server 3 ′. However, this need not be the case. Text data may be sent to the speech synthesis information processing server 3 ′.

以上説明したこの発明によるクライアント端末１、１'、及び音声合成情報処理サーバ３、３'はそれぞれ、この発明で提案するクライアント端末プログラム及び音声合成情報処理プログラムをコンピュータにインストールし、コンピュータにプログラムを実行させることによって実現することができる。
この発明で提案するクライアント端末プログラム及び音声合成処理プログラムはコンピュータが解読可能なプログラム言語によって記述され、コンピュータが読み取り可能な例えば磁気ディスク或いはＣＤ−ＲＯＭのような記録媒体に記録される。コンピュータにはこの記録媒体から又は通信回線を通じてインストールされる。インストールされたプログラムはコンピュータに備えられたＣＰＵ或いはＭＰＵによって解読されて実行される。 The client terminals 1 and 1 'and the speech synthesis information processing servers 3 and 3' according to the present invention described above install the client terminal program and the speech synthesis information processing program proposed in the present invention in the computer, respectively. It can be realized by executing.
The client terminal program and the speech synthesis processing program proposed in the present invention are described in a computer-readable program language, and are recorded in a computer-readable recording medium such as a magnetic disk or a CD-ROM. The computer is installed from this recording medium or through a communication line. The installed program is decrypted and executed by a CPU or MPU provided in the computer.

この発明は携帯端末を用いた音声案内システム、自動予約システム或いはカーナビゲーションにおける音声ガイドシステム等の分野に利用可能である。 The present invention can be used in fields such as a voice guidance system using a portable terminal, an automatic reservation system, or a voice guidance system in car navigation.

この発明による音声合成方法の概要を説明するためのブロック図。The block diagram for demonstrating the outline | summary of the speech synthesis method by this invention. この発明による音声合成方法に用いられるクライアント端末の全体の構成を説明するためのブロック図。The block diagram for demonstrating the structure of the whole client terminal used for the speech synthesis method by this invention. この発明による音声合成方法に用いる音声合成情報処理サーバの構成を説明するためのブロック図。The block diagram for demonstrating the structure of the speech synthesis information processing server used for the speech synthesis method by this invention. 図３に示した音声合成情報処理サーバの動作を説明するためのフローチャート。4 is a flowchart for explaining an operation of the speech synthesis information processing server shown in FIG. 3. 図２に示したクライアント端末内に構築される音声合成手段の構成を説明するためのブロック図。The block diagram for demonstrating the structure of the speech synthesizing means constructed | assembled in the client terminal shown in FIG. 図５に示したクライアント端末内に構築された音声合成手段の動作を説明するためのフローチャート。The flowchart for demonstrating operation | movement of the speech synthesis means constructed | assembled in the client terminal shown in FIG. この発明の実施例２で提案する音声合成処理サーバの構成を説明するためのブロック図。The block diagram for demonstrating the structure of the speech synthesis processing server proposed in Example 2 of this invention. 図７に示した音声合成情報処理サーバの動作を説明するためのフローチャート。The flowchart for demonstrating operation | movement of the speech synthesis information processing server shown in FIG. 図７に示した音声合成情報処理サーバと対応して動作するクライアント端末の構成を説明するためのブロック図。The block diagram for demonstrating the structure of the client terminal which operate | moves corresponding to the speech synthesis information processing server shown in FIG. 図９に示したクライアント端末の動作を説明するためのフローチャート。The flowchart for demonstrating operation | movement of the client terminal shown in FIG. この発明による音声合成処理サーバに備えたインデックスの概要を説明するための図。The figure for demonstrating the outline | summary of the index with which the speech synthesis processing server by this invention was equipped. この発明によるクライアント端末に備えた音声素片データベースの概要を説明するための図。The figure for demonstrating the outline | summary of the speech unit database with which the client terminal by this invention was equipped.

Explanation of symbols

１、１' クライアント端末３４音声素片情報・韻律パラメータ受信部
２ネットワーク３５音声素片接続・変形部
３、３' 音声合成情報処理サーバ４０ワークメモリ
９テキストデータ受信部４１ＭＰＵ
１０テキスト解析部４２蓄積メモリ
１１韻律パラメータ取得部４３データ送受信部
１２音声素片探索部４４音声出力部
１３音声素片情報送出部
１４音声素片情報・韻律パラメータ送出部
Index-1 音声素片インデックス
２９テキストデータ送信部
３０音声素片情報受信部
３１音声素片データ読み出し部
３２音声素片接続部
３３音声出力部
DB-1 音声素片データベース 1, 1 'client terminal 34 speech unit information / prosodic parameter receiving unit 2 network 35 speech unit connection / deformation unit 3, 3' speech synthesis information processing server 40 work memory 9 text data receiving unit 41 MPU
DESCRIPTION OF SYMBOLS 10 Text analysis part 42 Accumulation memory 11 Prosodic parameter acquisition part 43 Data transmission / reception part 12 Speech segment search part 44 Speech output part 13 Speech segment information transmission part 14 Speech segment information / prosodic parameter transmission part
Index-1 speech unit index 29 text data transmitting unit 30 speech unit information receiving unit 31 speech unit data reading unit 32 speech unit connection unit 33 speech output unit
DB-1 speech segment database

Claims

Speech unit information receiving means for receiving speech unit information for identifying speech unit data sent to itself and a prosodic parameter that is a physical parameter for determining the prosody;
Speech unit data reading means for reading out speech unit data from the speech unit database based on the received speech unit information;
Speech unit connection means for generating connection synthesized speech data in order after the read speech unit data is transformed according to the prosodic parameters;
Voice output means for sequentially outputting the generated connection synthesized voice data as synthesized voice;
A client terminal comprising:

Text data receiving means for receiving text data sent to itself;
Text analysis means for performing text analysis on received text data and obtaining reading information and prosodic information;
From the prosodic information, prosodic parameter acquisition means for acquiring a prosodic parameter that is a physical parameter that determines the prosody required for speech synthesis;
Speech segment search means for acquiring speech segment information for identifying speech segment data based on the reading information and prosodic parameters using a speech segment index;
Speech unit information transmitting means for adding a prosodic parameter to the speech unit information and sending it to the network;
A speech synthesis information processing server comprising:

A client terminal program that is written in a computer-readable program language and causes the computer to function as at least the client terminal according to claim 1.

A speech synthesis information processing program written in a computer-readable program language and causing the computer to function as at least the speech synthesis information processing server according to claim 2.

A recording medium comprising a computer-readable recording medium, wherein at least one of the client terminal program according to claim 3 and the speech synthesis information processing program according to claim 4 is recorded on the recording medium.