JP2006010845A

JP2006010845A - Synthesized speech uttering device and program thereof, and data set generating device for speech synthesis, and program thereof

Info

Publication number: JP2006010845A
Application number: JP2004185120A
Authority: JP
Inventors: Reiko Tako; 礼子田高; Norifumi Oide; 訓史大出; Hiroyuki Segi; 寛之世木; Atsushi Imai; 篤今井; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-06-23
Filing date: 2004-06-23
Publication date: 2006-01-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data receiving device which easily switches speech expressions of the voice quality, intonation, etc., of a synthesized speech according to user's preference for utterance of synthesized speeches with various speech expressions. <P>SOLUTION: A data receiving device 20 is equipped with a speech synthesizing means 25 of acquiring kind information for dividing elementary speech unit groups by voice quality, elementary speech unit information made to correspond to elementary speech units by the elementary speech unit groups, and control information in which a way of deforming the waveform of a synthesized speech generated by connecting elementary speech units made to correspond on the basis of the kind information and elementary speech unit information, prescribing one elementary speech unit group stored in a speech database 251 stored with the elementary speech unit groups by voice quality with the kind information, extracting elementary speech units from the prescribed elementary speech unit group according to the elementary speech unit information, generating a waveform by mutually connecting the extracted elementary speech units, and deforming the generated waveform according to the control information for utterance of the synthesized speech. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、合成音声の声質や抑揚等の表現をユーザの好みに応じて容易に切り替えて多彩な音声表現の合成音声を発話させることができるようにするための合成音声発話装置およびそのプログラムならびに音声合成用データセット生成装置およびそのプログラムに関する。 The present invention relates to a synthesized speech utterance device, a program thereof, and a program for enabling a synthesized speech of various speech expressions to be uttered by easily switching expressions such as voice quality and intonation of synthesized speech according to user's preference, and The present invention relates to a speech synthesis data set generation apparatus and a program thereof.

近時、地上波デジタル放送の開始に伴い、送信装置側ではコンテンツをテキストとして送信させ、受信装置側では受信したテキストを音声合成してコンテンツを合成音声として発話させるようにしたデータ放送に関する種々の技術が開発されてきている。
従来のデータ放送システムとしては、送信装置側から送信されるテキストの音声合成のための種々の規則を、受信装置側にデータベース化して登録しておき、送信装置側の選択や受信装置側でのユーザによる選択によって、受信装置側で任意の合成音声を発話するものが知られている。 Recently, with the start of terrestrial digital broadcasting, various types of data broadcasting related to data transmission in which content is transmitted as text on the transmission device side and the received text is synthesized as speech on the reception device side are synthesized. Technology has been developed.
As a conventional data broadcasting system, various rules for speech synthesis of text transmitted from the transmitting device side are registered in a database on the receiving device side, and selection on the transmitting device side or on the receiving device side A device that utters an arbitrary synthesized voice on the receiving device side by selection by a user is known.

例えば、記号列および男声の音素片を格納する男声音素辞書と、記号列および女声の音素片を格納する女声音素辞書とを受信装置側に備えておき、送信装置側から送信されてくるテキストを男声音素辞書または女声音素辞書の参照により男声または女声の合成音声として出力すると共に、テキストを表示させる技術（データ放送表示装置）が知られている（例えば、特許文献１参照）。
この特許文献１で開示されているデータ放送表示装置では、受信装置側に男声音素辞書および女声音素辞書のデータベースを備えておき、音声合成時に参照するデータベースを適宜切り替えることによって、男性または女性を選択することで合成音声の声質を変えられるようになっている。その選択は、送信装置側であらかじめ設定させておく場合も、受信装置側でユーザがその都度設定する場合のいずれでも可能になっている。 For example, a male voice phoneme dictionary storing a symbol string and a male voice phoneme fragment and a female voice phoneme dictionary storing a symbol string and a female voice phoneme fragment are provided on the receiving device side, and the text transmitted from the transmitting device side Is known as a synthesized voice of a male voice or a female voice by referring to a male voice phoneme dictionary or a female voice phoneme dictionary, and a technique for displaying text (data broadcasting display device) is known (for example, see Patent Document 1).
In the data broadcast display device disclosed in Patent Document 1, a database of a male phonemic phoneme dictionary and a female voice phoneme dictionary is provided on the receiving device side, and a man or a woman is selected by appropriately switching a database to be referred to at the time of speech synthesis. The voice quality of the synthesized speech can be changed by selecting it. The selection can be made either on the transmission device side in advance or on the reception device side when the user sets each time.

また、送信装置（データ送信装置）では、受信装置側での音声合成に用いる音声用付加情報を番組データに組み込んで放送波として送信し、受信装置側では、受信した放送波から番組データに組み込まれた音声用付加情報を判別して、その音声用付加情報を元に音声合成を行えるようにした技術（データ放送システム）が知られている（例えば、特許文献２参照）。
この特許文献２で開示されているデータ放送システムでは、受信装置側で、音声合成パラメータの指定、読み上げ速度（発話速度）の指定（５段階）、および、読み上げ音量の指定（５段階）を行えるようになっているが、合成音声の話者はあらかじめ設定されている声質のものに限られたものであった。 Further, in the transmission device (data transmission device), additional information for audio used for speech synthesis on the reception device side is incorporated into the program data and transmitted as a broadcast wave, and on the reception device side, it is incorporated into the program data from the received broadcast wave. There is known a technique (data broadcasting system) that discriminates the additional information for voice and that can perform voice synthesis based on the additional information for voice (for example, refer to Patent Document 2).
In the data broadcasting system disclosed in Patent Document 2, on the receiving device side, it is possible to specify a speech synthesis parameter, a reading speed (speech speed) (five levels), and a reading volume (five levels). However, the number of synthesized speech speakers is limited to those with a preset voice quality.

特開２００１−１６５５３号公報（段落００２１〜００２９、図１）Japanese Patent Laid-Open No. 2001-16553 (paragraphs 0021 to 0029, FIG. 1) 特開２００２−２８０９８１号公報（段落００６０、図１９）Japanese Patent Laying-Open No. 2002-280981 (paragraph 0060, FIG. 19)

しかし、前記特許文献１等に開示されているデータ放送表示装置では、男声音声辞書および女声音素辞書が所定のものに限られているため、声質や抑揚が決まった男声または女声として合成音声を発話させることしかできなかった。そのため、このデータ放送表示装置では、ユーザの好みに応じて男声と女声とを選択させることもできたが、男声および女声のいずれも声質や抑揚が決められているため、男声および女声のいずれの場合でもユーザの好みに合った音声表現からは程遠いものでしかなかった。 However, in the data broadcasting display device disclosed in Patent Document 1 and the like, since the male voice dictionary and the female phoneme dictionary are limited to predetermined ones, the synthesized voice is uttered as a male voice or a female voice whose voice quality and intonation are determined. I could only do it. Therefore, in this data broadcasting display device, it was possible to select male voice and female voice according to the user's preference, but since both male voice and female voice are determined in voice quality and intonation, either male voice or female voice is selected. Even in this case, it was far from a voice expression that suits the user's preference.

また、前記特許文献１等に開示されているデータ放送表示装置では、男声音声辞書や女声音声辞書の音声データベースがあらかじめ決められていて変更することができないことが多かった。また、その音声データベースの内容を更新することができる場合には、ユーザが、店舗に出向く等の何らかの方法で、更新内容を記憶した半導体デバイスを入手して、自ら半導体デバイスを交換したり、ＣＤ（Compact Disc）−ＲＯＭ（Read Only Memory）ドライブが備えられている場合には、更新内容を記憶したＣＤ−ＲＯＭを入手して、読み込ませて更新させたりすることにより行う必要があり、ユーザにとって不便であった。 In the data broadcast display device disclosed in Patent Document 1 and the like, the voice database of the male voice dictionary and the female voice dictionary is often determined in advance and cannot be changed. In addition, when the contents of the voice database can be updated, the user obtains a semiconductor device storing the updated contents by some method such as going to a store and replaces the semiconductor device himself or When a (Compact Disc) -ROM (Read Only Memory) drive is provided, it is necessary to obtain a CD-ROM storing the update contents, read it, and update it. It was inconvenient.

また、前記特許文献２等に開示されているデータ放送システムでは、音声データベースが単一話者のものに限られているが、音声合成のパラメータ設定手段を備えているため、合成音声の速度や音量を調節することは可能であった。そのため、このデータ放送システムでは、ユーザの好みに応じて合成音声の速度や音量を調節できるが、ユーザの好みに応じて声質や抑揚の異なる合成音声を発話させることができず、ユーザの好みに合った音声表現を発話させることはできなかった。 In the data broadcasting system disclosed in Patent Document 2 and the like, the speech database is limited to that of a single speaker. However, since the speech database has parameter setting means for speech synthesis, It was possible to adjust the volume. Therefore, in this data broadcasting system, the speed and volume of the synthesized voice can be adjusted according to the user's preference, but the synthesized voice with different voice quality and intonation cannot be uttered according to the user's preference, and the user's preference is met. It was not possible to utter the appropriate voice expression.

さらに、前記特許文献１等に開示されているデータ放送表示装置と、前記特許文献２等に開示されているデータ放送システムとを単に組み合わせた技術でも、声質や抑揚の決まった男声と女声とを選択し、かつ、その男声または女声の合成音声の速度や音量を調節することしかできないため、ユーザの好みに合った声質や抑揚の音声表現で合成音声を発話させることができなかった。 Furthermore, even with a technology that simply combines the data broadcast display device disclosed in Patent Document 1 and the like and the data broadcast system disclosed in Patent Document 2 or the like, it is possible to obtain a male voice and a female voice whose voice quality and intonation are determined. Since it is only possible to select and adjust the speed and volume of the synthesized voice of the male voice or female voice, the synthesized voice cannot be uttered with a voice quality or intonation voice expression that suits the user's preference.

本発明は、以上のような課題を解決するためになされたものであり、合成音声の声質や抑揚等の音声表現をユーザの好みに応じて容易に切り替えて多彩な音声表現の合成音声を発話させることが可能な合成音声発話装置およびそのプログラムならびに音声合成用データセット生成装置およびそのプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and can synthesize synthesized voices of various voice expressions by easily switching voice expressions such as voice quality and intonation of synthesized voices according to user's preference. It is an object of the present invention to provide a synthesized speech utterance device and program thereof, and a speech synthesis data set generation device and program thereof.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の合成音声発話装置は、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、前記音声素片群ごとの前記音声素片に対応づけられる素片情報、ならびに、前記種類情報および前記素片情報に基づいて対応づけられる前記音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成する音声合成用データセット生成装置から前記音声合成用データセットを取得し、合成音声として発話させるために、音声データベースと、音声合成用データセット取得手段と、音声合成制御情報設定手段と、音声合成手段とを備える構成とした。 The present invention was created to achieve the above object, and first, the synthesized speech utterance device according to claim 1 is a plurality of speech of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech. Based on the type information for dividing the speech unit group composed of the units for each voice quality, the unit information associated with the speech unit for each of the speech unit groups, and the type information and the unit information The speech synthesis from the speech synthesis data set generation device that generates at least one type of control information defining a waveform deformation method of the synthesized speech generated by connecting the speech units to be associated with each other as a speech synthesis data set A speech database, a speech synthesis data set acquisition means, a speech synthesis control information setting means, a speech synthesis It was configured and means.

かかる構成によれば、合成音声発話装置は、音声データベースによって、声質ごとに記憶しておく。ここで、音声素片群は、同一声質を構成する音声素片を集めて音声データベースに記憶させたものであって、声質の異なる種類ごとに分類して定義される。例えば、音声データベースには、後記するとおり、コンテンツや話者の相違により定義される音声素片群をその種類ごとに区別できるように記憶しておく。
そして、合成音声発話装置は、音声合成用データセット取得手段によって、音声合成用データセット生成装置から音声合成用データセットを取得する。 According to this configuration, the synthesized speech utterance apparatus stores each voice quality by the speech database. Here, the speech unit group is a group of speech units constituting the same voice quality and stored in the speech database, and is defined by classification for different types of voice quality. For example, as will be described later, the speech database stores a speech segment group defined by content and speaker differences so that it can be distinguished for each type.
The synthesized speech utterance apparatus acquires the speech synthesis data set from the speech synthesis data set generation apparatus by the speech synthesis data set acquisition means.

次に、合成音声発話装置は、音声合成制御情報設定手段によって、音声合成用データセット取得手段により取得された音声合成用データセットを受け取り、この音声合成用データセットの種類情報を選択して抽出対象の音声素片群を特定し、音声合成用データセットの制御情報を選択して波形変形の仕方を特定し、選択された種類情報および制御情報ならびに素片情報を音声合成用データとして設定する。 Next, the synthesized speech utterance device receives the speech synthesis data set acquired by the speech synthesis data set acquisition unit by the speech synthesis control information setting unit, and selects and extracts the type information of the speech synthesis data set. Identify the target speech segment group, select the control information of the speech synthesis data set to identify how to transform the waveform, and set the selected type information, control information, and segment information as speech synthesis data .

なお、その設定は、いずれの種類情報および制御情報を選択するのかを選択させる画面を表示し、ユーザにインターフェースを介して指定させることで行う。例えば、素片情報として、男声と女声とが与えられている場合に、ユーザがインターフェースを介して女声を指定したときには、女声の音声素片群から素片情報として指定された音声素片を音声データベースから選択するように、女声の音声素片群と設定する。また、制御情報としては、例えば、関数やパラメータや制御情報番号を挙げることができる。 The setting is performed by displaying a screen for selecting which type information and control information to select and allowing the user to specify via the interface. For example, when male voice and female voice are given as the segment information, when the user specifies a female voice via the interface, the voice segment specified as the segment information from the voice segment group of the female voice is voiced. A female voice segment group is set as selected from the database. Moreover, as control information, a function, a parameter, and a control information number can be mentioned, for example.

その後、合成音声発話装置は、合成音声用データに基づいて音声合成手段によって、音声合成制御情報設定手段から音声合成用データを受け取り、種類情報により特定された音声データベースの音声素片群から素片情報に従って音声素片を抽出し、抽出した音声素片同士を接続して合成音声の波形を生成し、選択された制御情報により特定される波形変形の仕方に従って波形を変形し、変形した波形を合成音声として発話させる。
例えば、音声合成手段は、女声の音声素片群から音声素片を選択し、その音声素片を接続して波形を生成し、制御情報として早口が設定されている場合には、生成した波形の周期を短くするように波形を加工して、合成音声を早口の女声で発話させる。なお、音声合成手段は、生成した波形を音声出力手段に送り、この音声出力手段により合成音声として発話させる。音声出力手段としては、例えば、音声合成手段により出力される電気信号としての波形を音声として出力するスピーカを挙げることができる。 After that, the synthesized speech utterance device receives speech synthesis data from the speech synthesis control information setting unit by the speech synthesis unit based on the synthesized speech data, and generates a segment from the speech unit group of the speech database specified by the type information. Extract speech segments according to the information, connect the extracted speech segments to generate a synthesized speech waveform, transform the waveform according to the waveform modification method specified by the selected control information, Speak as synthesized speech.
For example, the speech synthesizer selects a speech unit from the group of speech units of female voices, generates a waveform by connecting the speech units, and when a fast mouth is set as control information, the generated waveform The waveform is processed so as to shorten the period of the voice, and the synthesized voice is uttered with a quick female voice. The voice synthesizing unit sends the generated waveform to the voice output unit, and causes the voice output unit to utter as synthesized voice. Examples of the voice output means include a speaker that outputs a waveform as an electric signal output by the voice synthesis means as voice.

ところで、前記したとおり、種類情報は、音声素片群を声質ごとに分けるためのものであり、音声データベースは、声質ごとに音声素片群を記憶するものであった。ここで、声質の種類は、ユーザの好みに応じてユーザにその種類を選択させて、合成音声として発話させることができるため、合成音声発話装置のユーザが選択する声質の種類が少なくとも２種類以上あるのが好ましい。しかし、声質の種類は、１種類のみであっても構わない。 By the way, as described above, the type information is for dividing the speech element group for each voice quality, and the speech database is for storing the speech element group for each voice quality. Here, since the voice quality types can be made to be uttered as synthesized speech by the user selecting the type according to the user's preference, there are at least two voice quality types selected by the user of the synthesized speech utterance device. Preferably there is. However, only one type of voice quality may be used.

また、請求項２に記載の合成音声発話装置は、請求項１に記載の合成音声発話装置において、音声合成制御情報設定手段が、コンテンツ項目種別設定手段と話者設定手段との少なくとも一方を備える構成とした。 According to a second aspect of the present invention, in the synthesized voice utterance device according to the first aspect, the voice synthesis control information setting means includes at least one of a content item type setting means and a speaker setting means. The configuration.

かかる構成によれば、合成音声発話装置では、コンテンツ項目種別設定手段によって、音声データベースに記憶されている音声素片群の種類を、種類情報に含まれるコンテンツ項目の種別に応じて設定し、話者設定手段によって、音声データベースに記憶されている音声素片群の種類を、種類情報に含まれる話者の相違に応じて設定する。これによって、コンテンツごとに声質の異なる合成音声を発話させることが可能になる。また、話者ごとに声質の異なる合成音声を発話させることが可能になる。例えば、話者設定手段は、男声、女声、特定の男優や女優、または、アニメ等の特定のキャラクタの声質として設定し、それらの合成音声として発話させることができる。 According to this configuration, in the synthesized speech utterance device, the content item type setting means sets the type of the speech element group stored in the speech database according to the type of the content item included in the type information, and The type of the speech element group stored in the speech database is set by the speaker setting means according to the difference of the speakers included in the type information. As a result, it is possible to utter synthetic speech having different voice qualities for each content. Further, it is possible to synthesize synthesized speech having different voice qualities for each speaker. For example, the speaker setting means can be set as the voice quality of a specific character such as a male voice, female voice, a specific actor or actress, or an animation, and uttered as a synthesized voice thereof.

また、請求項３に記載の合成音声発話装置は、請求項１または請求項２に記載の合成音声発話装置において、前記音声合成制御情報設定手段が、話速設定手段と、動特徴制御情報設定手段と、静特徴制御情報設定手段とのうちの少なくとも一つを備える構成とした。 Also, the synthesized speech utterance device according to claim 3 is the synthesized speech utterance device according to claim 1 or 2, wherein the speech synthesis control information setting means includes a speech speed setting means and dynamic feature control information setting. And at least one of static feature control information setting means.

かかる構成によれば、合成音声発話装置では、音声合成制御情報設定手段が備える話速設定手段によって、音声素片同士を接続して生成された波形の話速を変化させるための制御情報である話速制御情報を設定する。
また、合成音声発話装置では、音声合成制御情報設定手段が備える動特徴制御情報設定手段によって、音声素片同士を接続して生成された波形の経過時間に伴う波形の動きを変化させる動特徴を可変させるための制御情報である動特徴制御情報を設定する。ここで、動特徴としては、例えば、波形の基本周波数やスペクトルまたは話における「間」やテンポ感を挙げることができる。 According to such a configuration, in the synthesized speech utterance apparatus, the control information for changing the speech speed of the waveform generated by connecting the speech units by the speech speed setting means included in the speech synthesis control information setting means. Set speech speed control information.
Further, in the synthesized speech utterance device, the dynamic feature control information setting means provided in the voice synthesis control information setting means has a dynamic feature that changes the movement of the waveform according to the elapsed time of the waveform generated by connecting the speech segments. Dynamic feature control information, which is control information for changing, is set. Here, as the dynamic feature, for example, a fundamental frequency of a waveform, a spectrum, or “interval” in a story or a sense of tempo can be cited.

また、合成音声発話装置では、音声合成制御情報設定手段が備える静特徴制御情報設定手段によって、音声素片同士を接続して生成された波形の所定時刻における一定時間の形状を変化させる静特徴を可変させるための制御情報である静特徴制御情報を設定する。ここで、静特徴としては、例えば、基本周波数平均値やスペクトル包絡概形を挙げることができる。 Further, in the synthesized speech utterance device, the static feature control information setting unit included in the speech synthesis control information setting unit includes a static feature that changes the shape of a predetermined time at a predetermined time of a waveform generated by connecting speech units. Static feature control information that is variable control information is set. Here, examples of the static feature include a fundamental frequency average value and a spectrum envelope outline.

また、請求項４に記載の合成音声発話装置は、請求項１ないし請求項３のいずれか一項に記載の合成音声発話装置において、音声データベースの音声素片群を更新するために、更新情報取得手段と、音声データベース更新手段とを備える構成とした。 According to a fourth aspect of the present invention, there is provided the synthesized speech utterance device according to any one of the first to third aspects, wherein the update information is used to update the speech element group of the speech database. The acquisition unit and the voice database update unit are provided.

かかる構成によれば、合成音声発話装置は、更新情報取得手段によって、音声データベースに記憶されている音声素片群を更新する更新情報を音声合成用データセット生成装置から取得する。そして、合成音声発話装置は、音声データベース更新手段によって、更新情報取得手段により取得された更新情報に従って音声データベースを更新する According to this configuration, the synthesized speech utterance apparatus acquires update information for updating the speech element group stored in the speech database from the speech synthesis data set generation apparatus by the update information acquisition unit. Then, the synthesized speech utterance apparatus updates the speech database according to the update information acquired by the update information acquisition unit by the speech database update unit.

また、請求項５に記載の合成音声発話プログラムは、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、前記音声素片群ごとの前記音声素片に対応づけられる素片情報、ならびに、前記種類情報および前記素片情報に基づいて対応づけられる前記音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成する音声合成用データセット生成装置から前記音声合成用データセットを取得し、合成音声として発話させるために、コンピュータを、音声合成用データセット取得手段、音声合成制御情報設定手段および音声合成手段として機能させる構成とした。 Further, the synthesized speech program according to claim 5 is a type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, for each voice quality, The unit information associated with the speech unit for each speech unit group, and the waveform of the synthesized speech generated by connecting the speech units associated based on the type information and the unit information In order to obtain the speech synthesis data set from the speech synthesis data set generation device that generates at least one type of control information defining the method of deformation as a speech synthesis data set, It is configured to function as a speech synthesis data set acquisition unit, a speech synthesis control information setting unit, and a speech synthesis unit.

かかる構成によれば、合成音声発話プログラムは、音声合成用データセット取得手段によって、音声合成用データセット生成装置から音声合成用データセットを取得する。
そして、合成音声発話プログラムは、音声合成制御情報設定手段によって、音声合成用データセット取得手段により取得された音声合成用データセットを受け取り、この音声合成用データセットの種類情報を選択して、声質ごとに音声素片群を記憶する音声データベースからの抽出対象の音声素片群を特定し、音声合成用データセットの制御情報を選択して波形変形の仕方を特定し、選択された種類情報および制御情報ならびに素片情報を音声合成用データとして設定する。 According to this configuration, the synthesized speech utterance program acquires the speech synthesis data set from the speech synthesis data set generation device by the speech synthesis data set acquisition unit.
The synthesized speech utterance program receives the speech synthesis data set acquired by the speech synthesis data set acquisition unit by the speech synthesis control information setting unit, selects type information of the speech synthesis data set, and selects the voice quality Identify the speech unit group to be extracted from the speech database storing the speech unit group for each, select the control information of the speech synthesis data set to identify the waveform deformation method, the selected type information and Control information and segment information are set as data for speech synthesis.

その後、合成音声発話プログラムは、音声合成手段によって、音声合成制御情報設定手段から音声合成用データを受け取り、種類情報により特定された音声データベースの音声素片群から素片情報に従って音声素片を抽出し、抽出した音声素片同士を接続して合成音声の波形を生成し、選択された制御情報により特定される波形変形の仕方に従って波形を変形し、変形した波形を合成音声として発話させる。 Thereafter, the synthesized speech utterance program receives the speech synthesis data from the speech synthesis control information setting unit by the speech synthesis unit, and extracts the speech unit according to the unit information from the speech unit group of the speech database specified by the type information. Then, the extracted speech segments are connected to generate a synthesized speech waveform, the waveform is modified according to the waveform modification method specified by the selected control information, and the deformed waveform is uttered as synthesized speech.

また、請求項６に記載の音声合成用データセット生成装置は、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、前記音声素片群ごとの前記音声素片に対応づけられる素片情報、ならびに、前記種類情報および前記素片情報に基づいて対応づけられる前記音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成し、この音声合成用データセットに基づいて合成音声として発話させる合成音声発話装置に前記音声合成用データセットを提供するために、音声データベースと、素材テキスト取得手段と、読み情報取得手段と、素片選択手段と、制御情報作成手段と、音声合成用データセット提供手段とを備える構成とした。 According to a sixth aspect of the present invention, there is provided a speech synthesis data set generation device for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, for each voice quality. Type information, unit information associated with the speech unit for each speech unit group, and synthesis generated by connecting the speech units associated based on the type information and the unit information At least one type of control information defining how to transform the waveform of speech is generated as a speech synthesis data set, and the speech synthesis data set is generated by a synthesized speech utterance device that utters speech as synthesized speech based on the speech synthesis data set. Voice database, material text acquisition means, reading information acquisition means, segment selection means, control information creation means, and speech synthesis And configured to include a datasets providing means.

かかる構成によれば、音声合成用データセット生成装置は、音声データベースによって、声質ごとに音声素片群を記憶する。
そして、音声合成用データセット生成装置は、素材テキスト取得手段によって、合成音声として発話させるための素材のテキストを取得する。このテキストは、合成音声発話装置で発話させるための文字情報である。
次に、音声合成用データセット生成装置は、読み情報取得手段によって、素材テキスト取得手段により取得されたテキストの読み情報を取得する。 According to this configuration, the speech synthesis data set generation device stores a speech unit group for each voice quality by the speech database.
Then, the speech synthesis data set generation device acquires material text to be uttered as synthesized speech by the material text acquisition means. This text is character information to be uttered by the synthesized speech utterance device.
Next, the speech synthesis data set generation device acquires the reading information of the text acquired by the material text acquisition unit by the reading information acquisition unit.

さらに続けて、音声合成用データセット生成装置は、素片選択手段によって、読み情報取得手段により取得された読み情報の音声素片を音声データベースから声質の種類ごとに選択して素片情報とする。また、音声合成用データセット生成装置は、制御情報作成手段によって、種類情報および制御情報を作成する。なお、制御情報作成手段は、音声合成用データセット生成装置の表示手段に種類情報や制御情報の入力を促す画面を表示し、種類情報および制御情報を入力手段を介してオペレータに入力させることによって、種類情報および制御情報を取得する。
その後、音声合成用データセット生成装置は、音声合成用データセット提供手段によって、素片選択手段により選択された素片情報ならびに制御情報作成手段により作成された種類情報および制御情報を音声合成用データセットとして合成音声発話装置に提供する。 Further, the speech synthesis data set generation device selects the speech information of the reading information acquired by the reading information acquisition means for each type of voice quality from the speech database by the segment selection means to obtain the segment information. . In addition, the speech synthesis data set generation device generates type information and control information by the control information generation means. The control information creation means displays a screen for prompting input of type information and control information on the display means of the speech synthesis data set generation device, and allows the operator to input type information and control information via the input means. Get type information and control information.
After that, the speech synthesis data set generation device uses the speech synthesis data set providing means to convert the segment information selected by the segment selection means and the type information and control information created by the control information creation means into the speech synthesis data. Provided as a set to the synthesized speech utterance device.

また、請求項７に記載の音声合成用データセット生成装置は、請求項６に記載の音声合成用データセット生成装置において、音声データベース更新情報作成手段と、更新情報提供手段とを備える構成とした。
かかる構成によれば、音声合成用データセット生成装置は、音声データベース更新情報作成手段によって、音声データベースに記憶する音声素片を更新する更新情報を作成して、音声データベースの音声素片を更新情報で更新する。
そして、音声合成用データセット生成装置は、更新情報提供手段によって、音声データベース更新情報作成手段により作成された更新情報を合成音声発話装置に提供する。 According to a seventh aspect of the present invention, there is provided the speech synthesis data set generation device according to the sixth aspect, wherein the speech synthesis data set generation device includes a voice database update information creation unit and an update information provision unit. .
According to this configuration, the speech synthesis data set generation device creates update information for updating the speech unit stored in the speech database by the speech database update information creation unit, and updates the speech unit in the speech database. Update with.
Then, the speech synthesis data set generation apparatus provides the update information created by the speech database update information creation means to the synthesized speech utterance apparatus by the update information providing means.

また、請求項８に記載の音声合成用データセット生成プログラムは、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、前記音声素片群ごとの前記音声素片に対応づけられる素片情報、ならびに、前記種類情報および前記素片情報に基づいて対応づけられる前記音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成し、この音声合成用データセットに基づいて合成音声として発話させる合成音声発話装置に前記音声合成用データセットを提供するために、コンピュータを、素材テキスト取得手段、読み情報取得手段、素片選択手段、制御情報作成手段および音声合成用データセット提供手段として機能させる構成とした。 According to another aspect of the present invention, there is provided a speech synthesis data set generation program for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech for each voice quality. Type information, unit information associated with the speech unit for each speech unit group, and synthesis generated by connecting the speech units associated based on the type information and the unit information At least one type of control information defining how to transform the waveform of speech is generated as a speech synthesis data set, and the speech synthesis data set is generated by a synthesized speech utterance device that utters speech as synthesized speech based on the speech synthesis data set. To provide a material text acquisition unit, a reading information acquisition unit, a segment selection unit, a control information generation unit, and a speech synthesis device. And configured to function as Tasetto providing means.

かかる構成によれば、音声合成用データセット生成プログラムは、素材テキスト取得手段によって、合成音声として発話させるための素材のテキストを取得する。
次に、音声合成用データセット生成プログラムは、読み情報取得手段によって、素材テキスト取得手段により取得されたテキストの読み情報を取得する。 According to this configuration, the speech synthesis data set generation program acquires the text of the material to be uttered as synthesized speech by the material text acquisition unit.
Next, the speech synthesis data set generation program acquires the reading information of the text acquired by the material text acquisition unit by the reading information acquisition unit.

さらに続けて、音声合成用データセット生成プログラムは、素片選択手段によって、声質ごとに音声素片群を記憶する音声データベースから、読み情報取得手段により取得された読み情報の音声素片を声質の種類ごとに選択して素片情報とする。また、音声合成用データセット生成プログラムは、制御情報作成手段によって、種類情報および制御情報を作成する。その後、音声合成用データセット生成プログラムは、音声合成用データセット提供手段によって、素片選択手段により選択された素片情報ならびに制御情報作成手段により作成された種類情報および制御情報を音声合成用データセットとして合成音声発話装置に提供する。 Further, the speech synthesis data set generation program reads the speech unit of the reading information acquired by the reading information acquisition unit from the speech database storing the speech unit group for each voice quality by the unit selection unit. Select for each type as segment information. The speech synthesis data set generation program creates type information and control information by the control information creation means. After that, the speech synthesis data set generation program converts the segment information selected by the segment selection unit by the speech synthesis data set providing unit and the type information and control information created by the control information creation unit into the data for speech synthesis. Provided as a set to the synthesized speech utterance device.

請求項１または請求項５に記載の発明によれば、音声合成制御情報設定手段によって、種類情報により特定される素片情報に従って音声素片群を特定し、音声データベースに記憶されている同一声質の音声素片群の中から音声素片を抽出して、音声合成手段により合成音声の波形を生成するため、特定の声質の合成音声を発話させることができ、さらに、制御情報に従って、生成された合成音声の波形を変化させるようにしたため、声質の微妙な設定や抑揚の設定等の合成音声の表現を細かく設定することができる。これによって、ユーザは、好みに応じて、容易に声質を選択し、合成音声の表現を細かく設定し、提供されるコンテンツを好みの合成音声によって発話させることができるようになる。
このように、声質を選択し、合成音声の表現を細かく設定することによって、様々な声の調子の合成音声を発話させることができる。ここで、声の調子とは、例えば、声の大小、声の強弱、声の高低、声の抑揚、声の色つや、よどみ、滑舌、語尾の明瞭さおよび間と呼ばれるものを含むものである。 According to the first or fifth aspect of the invention, the speech synthesis control information setting means identifies the speech segment group according to the segment information identified by the type information, and stores the same voice quality stored in the speech database. Since the speech unit is extracted from the speech unit group and the synthesized speech waveform is generated by the speech synthesis means, the synthesized speech of a specific voice quality can be uttered and further generated according to the control information. Since the waveform of the synthesized speech is changed, it is possible to finely set the synthesized speech expression such as a fine voice quality setting or an inflection setting. As a result, the user can easily select the voice quality according to his / her preference, finely set the expression of the synthesized voice, and utter the provided content with the favorite synthesized voice.
As described above, by selecting the voice quality and finely setting the expression of the synthesized speech, it is possible to utter the synthesized speech having various tone. Here, the tone of the voice includes, for example, what is called the magnitude of the voice, the strength of the voice, the level of the voice, the inflection of the voice, the color of the voice, the stagnation, the tongue, the clarity of the ending, and the interval.

請求項２に記載の発明によれば、コンテンツ項目種別設定手段によりコンテンツごとに声質を容易に切り替えることができ、また、話者設定手段により話者ごとに声質を容易に切り替えることが可能になる。これによって、ユーザは、コンテンツや話者といった観念的に区別するのが容易な項目によって、声質を容易に切り替えることができるようになる。コンテンツにより声質を切り替えるパターンとしては、例えば、ニュースを男声や天気予報を女声とするような場合を挙げることができる。また、話者により声質を切り替えるパターンとしては、例えば、アナウンサや男優や女優やアニメキャラクタによる声質の切り替えを挙げることができる。 According to the second aspect of the invention, the voice quality can be easily switched for each content by the content item type setting means, and the voice quality can be easily switched for each speaker by the speaker setting means. . As a result, the user can easily switch the voice quality by using items that are easy to distinguish conceptually such as contents and speakers. As a pattern for switching the voice quality according to the content, for example, the case where the news is a male voice and the weather forecast is a female voice can be cited. In addition, examples of the voice quality switching pattern by the speaker include voice quality switching by an announcer, an actor, an actress, and an anime character.

請求項３に記載の発明によれば、話速設定手段により合成音声の発話速度を切り替えることができ、動特徴制御情報設定手段により動特徴に応じた音声表現の合成音声を発話させることができ、また、静特徴制御情報設定手段により静特徴に応じた音声表現の合成音声を発話させることができる。これによって、合成音声発話装置は、ユーザの好みに合わせて合成音声の波形を細かく設定することができるようになるため、音声表現の豊かな発話を行うことができるようになる。 According to the third aspect of the present invention, the speech speed of the synthesized speech can be switched by the speech speed setting means, and the synthesized speech of the speech expression corresponding to the dynamic feature can be uttered by the dynamic feature control information setting means. Moreover, the synthesized voice of the voice expression according to the static feature can be uttered by the static feature control information setting means. As a result, the synthesized speech utterance device can finely set the waveform of the synthesized speech according to the user's preference, so that speech with rich speech expression can be performed.

請求項４に記載の発明によれば、音声データベースの音声素片群の種類の変更、増加もしくは減少、または、音声素片の変更、増加もしくは減少を行うことができる。これによって、合成音声発話装置は、音声合成用データセット生成装置から提供される更新情報に従って音声データベースが更新されるため、ユーザの好みの声質の音声素片群を容易に追加することもできるようになる。つまり、ユーザは、声質を自ら作成しなくても、提供される音声素片群の更新情報を追加することで、好みの声質を増やすことも可能になる。 According to the fourth aspect of the present invention, it is possible to change, increase or decrease the type of speech segment group in the speech database, or to change, increase or decrease the speech segment. As a result, the synthesized speech utterance apparatus updates the speech database according to the update information provided from the speech synthesis data set generation apparatus, so that it is possible to easily add a speech unit group having a voice quality desired by the user. become. That is, the user can increase his / her favorite voice quality by adding update information of the provided speech segment group without creating the voice quality himself.

請求項６または請求項８に記載の発明によれば、音声合成用データセットを合成音声発話装置に提供することができ、合成音声発話装置に制御情報を提供し、制御情報を設定させることにより合成音声の表現を調節（微調整）することができる。これによって、合成音声発話装置のユーザは、制御情報を設定して、合成音声の表現を調節（微調整）することによって、好みの合成音声を容易に発話させることが可能になる。 According to the invention described in claim 6 or claim 8, the data set for speech synthesis can be provided to the synthesized speech utterance device, the control information is provided to the synthesized speech utterance device, and the control information is set. The expression of synthesized speech can be adjusted (fine-tuned). Thus, the user of the synthesized speech utterance apparatus can easily utter the desired synthesized speech by setting control information and adjusting (finely adjusting) the expression of the synthesized speech.

請求項７に記載の発明によれば、音声合成用データセット生成装置が音声データベースの更新情報を作成して合成音声発話装置に提供することで、合成音声発話装置での音声データベースを更新することができるため、合成音声発話装置のユーザの使い勝手の向上を図ることが可能になる。 According to the seventh aspect of the invention, the speech synthesis data set generation device creates the update information of the speech database and provides it to the synthesized speech utterance device, thereby updating the speech database in the synthesized speech utterance device. Therefore, it is possible to improve the usability of the user of the synthesized speech utterance device.

したがって、本発明によれば、合成音声の声質や抑揚等の表現をユーザの好みに応じて容易に切り替えて多彩な音声表現の合成音声を発話させることが可能な合成音声発話装置およびそのプログラムならびに音声合成用データセット生成装置およびそのプログラムを提供することができる。 Therefore, according to the present invention, a synthesized speech utterance apparatus capable of easily switching synthesized speech of various speech expressions by expressing expressions such as voice quality and intonation of synthesized speech according to user's preference, and a program thereof, and A speech synthesis data set generation apparatus and a program thereof can be provided.

以下、本発明を実施するための最良の形態（以下「実施形態」という。）について図面を参照して説明する。なお、以下では、音声合成用データセット生成装置が、音声合成用データセットおよび更新情報を放送波に乗せて送信するデータ送信装置として実施され、合成音声発話装置が、そのデータ送信装置から送信された放送波を受信し、音声合成用データセットおよび更新情報を取得して、音声合成用データセットに従って合成音声を発話させるデータ受信装置として実施されるデータ放送システムを例にして説明する。 Hereinafter, the best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described with reference to the drawings. In the following description, the speech synthesis data set generation device is implemented as a data transmission device that transmits a speech synthesis data set and update information on a broadcast wave, and the synthesized speech utterance device is transmitted from the data transmission device. A data broadcasting system implemented as a data receiving apparatus that receives a broadcast wave, acquires a speech synthesis data set and update information, and utters synthesized speech according to the speech synthesis data set will be described as an example.

［データ放送システムの概要］
まず、図１を参照して、データ送信装置１０とデータ受信装置２０とから構成されるデータ放送システム１の概要について説明する。図１は、本発明に係る実施形態のデータ放送システムの構成を示すブロック図である。
このデータ放送システム１は、音声合成用の素材のテキストから作成される音声合成用データセットを送信するデータ送信装置１０と、データ送信装置１０から送信される音声合成用データセットを受信して音声合成して合成音声として発話させるデータ受信装置２０との間で放送の送受信を行うものである。
以下に、データ送信装置１０およびデータ受信装置２０の構成についてそれぞれ説明する。 [Outline of data broadcasting system]
First, with reference to FIG. 1, the outline | summary of the data broadcasting system 1 comprised from the data transmitter 10 and the data receiver 20 is demonstrated. FIG. 1 is a block diagram showing a configuration of a data broadcasting system according to an embodiment of the present invention.
The data broadcasting system 1 receives a data transmitting apparatus 10 that transmits a voice synthesizing data set created from text of a voice synthesizing material, and receives a voice synthesizing data set transmitted from the data transmitting apparatus 10 to generate a voice. Broadcast transmission / reception is performed with the data receiving apparatus 20 that synthesizes and utters as synthesized speech.
Below, the structure of the data transmitter 10 and the data receiver 20 is each demonstrated.

（データ送信装置の構成）
このデータ送信装置１０は、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、音声素片群ごとの音声素片に対応づけられる素片情報、ならびに、種類情報および素片情報に基づいて対応づけられる音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成し、この音声合成用データセットに基づいて合成音声として発話させるデータ受信装置２０に音声合成用データセットを提供するものである。 (Configuration of data transmission device)
This data transmitting apparatus 10 includes type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, and voice for each speech unit group. Segment information associated with the segments, and at least one type of control information defining how to deform the waveform of the synthesized speech generated by connecting speech segments associated with each other based on the type information and the segment information Are generated as a speech synthesis data set, and the speech synthesis data set is provided to the data receiving apparatus 20 that utters speech as synthesized speech based on the speech synthesis data set.

そのために、このデータ送信装置１０は、素材テキスト取得手段１１と、音声合成用データセット作成手段１２と、音声合成用データセット付加手段１３と、音声データベース更新情報作成手段１４と、多重化手段１５と、送信手段１６とを主に備えている。 For this purpose, the data transmission apparatus 10 includes a material text acquisition unit 11, a speech synthesis data set creation unit 12, a speech synthesis data set addition unit 13, a speech database update information creation unit 14, and a multiplexing unit 15. And transmission means 16 are mainly provided.

素材テキスト取得手段１１は、図示しないデータ放送素材処理装置等の記憶手段に記憶されているデータ放送素材を受け取り、このデータ放送素材から合成音声として発話させるための素材のテキストを取得し、テキストを音声合成用データセット作成手段１２に出力し、データ放送素材を音声合成用データセット付加手段１３に出力するものである。ここでは、データ放送素材は、映像データやオーディオデータやテキストを含むストリーム化されたものとし、図示しないデータ放送素材処理装置等の記憶手段に記憶されているもとする。また、テキストは、漢字仮名混じり文等の日本語を対象にするが、これに限らず、例えば、平仮名のみであっても、英語や中国語等の言語であってもよい。 The material text acquisition unit 11 receives a data broadcast material stored in a storage unit such as a data broadcast material processing device (not shown), acquires material text to be uttered as synthesized speech from the data broadcast material, and stores the text. This is output to the voice synthesis data set creation means 12 and the data broadcast material is output to the voice synthesis data set addition means 13. Here, the data broadcast material is assumed to be a stream including video data, audio data, and text, and is stored in storage means such as a data broadcast material processing apparatus (not shown). Further, the text is intended for Japanese such as kanji mixed kana sentences, but is not limited thereto, and may be only hiragana or languages such as English and Chinese.

音声合成用データセット作成手段１２は、素片情報、種類情報および制御情報を含む音声合成用データセットを生成するものである。そのために、この音声合成用データセット作成手段１２は、読み情報取得手段１２１と、素片選択手段１２２と、音声データベース１２３と、制御情報作成手段１２４とを主に備えている。 The speech synthesis data set creation means 12 generates a speech synthesis data set including unit information, type information, and control information. For this purpose, the speech synthesis data set creation unit 12 mainly includes a reading information acquisition unit 121, a segment selection unit 122, a speech database 123, and a control information creation unit 124.

音声合成用データセット作成手段１２は、テキストごとに読み情報を得て、対応する音声素片を音声データベース１２３から選択し、素片情報を作成する。続いて、音声合成用データセット作成手段１２は、種類情報、素片情報および制御情報を含んだ音声合成用データセットをまとめた音声合成用データセットを作成する。 The speech synthesis data set creation unit 12 obtains reading information for each text, selects a corresponding speech segment from the speech database 123, and creates segment information. Subsequently, the speech synthesis data set creation unit 12 creates a speech synthesis data set in which speech synthesis data sets including type information, segment information, and control information are collected.

読み情報取得手段１２１は、素材テキスト取得手段１１により取得されたテキストの読み情報を取得するものである。読み情報取得手段１２１は、文法辞書や言語辞書等の各種辞書を参照して文字を認識し、各種辞書を参照してテキストの形態素を解析し、読み情報を取得する。 The reading information acquisition unit 121 acquires the reading information of the text acquired by the material text acquisition unit 11. The reading information acquisition unit 121 recognizes characters by referring to various dictionaries such as a grammar dictionary and a language dictionary, analyzes text morphemes by referring to the various dictionaries, and acquires reading information.

図２に、データ送信装置が備える音声合成用データセット作成手段の読み情報取得手段による読み情報の取得概念を説明する図を示す。図２（Ａ）はテキストの文例を示す図である。図２（Ｂ）は図２（Ａ）の文例の形態素解析の一例を示す図である。図２（Ｃ）は図２（Ａ）の文例の読み情報の一例を示す図である。
素材テキスト取得手段１１は、図２（Ａ）に示すテキストの文「テキストです。」を取得すると、文「テキストです。」を音声合成用データセット作成手段１２に送る。この音声合成用データセット作成手段１２の読み情報取得手段１２１は、素材テキスト取得手段１１から送られた文「テキストです。」を取得する。 FIG. 2 is a diagram for explaining the concept of acquiring reading information by the reading information acquiring unit of the speech synthesis data set creating unit included in the data transmitting apparatus. FIG. 2A shows an example of text. FIG. 2B is a diagram illustrating an example of morphological analysis of the sentence example in FIG. FIG. 2C is a diagram illustrating an example of reading information of the sentence example in FIG.
When the material text obtaining unit 11 obtains the text “Text is” shown in FIG. 2A, the material text obtaining unit 11 sends the sentence “Text is” to the speech synthesis data set creation unit 12. The reading information acquisition unit 121 of the voice synthesis data set creation unit 12 acquires the sentence “text” sent from the material text acquisition unit 11.

そして、読み情報取得手段１２１は、まず、各種辞書を参照して文「テキストです。」の形態素解析を行い、図２（Ｂ）に示す形態素１「テキスト」および形態素２「です。」に分ける。その形態素解析が終了すると、読み情報取得手段１２１は、形態素１および２のそれぞれの読み情報を取得する。ここで、形態素１の読み情報は「（silent）tekisuto」で表し、形態素２の読み情報は「desu(silent)」で表す場合には、読み情報取得手段１２１は、図２（Ｃ）に示す読み情報「（silent）tekisutodesu(silent)」とし、素片選択手段１２２に渡す。 The reading information acquisition unit 121 first performs morphological analysis of the sentence “text” with reference to various dictionaries, and divides it into morpheme 1 “text” and morpheme 2 “is” shown in FIG. . When the morpheme analysis is completed, the reading information acquisition unit 121 acquires the reading information of each of the morphemes 1 and 2. Here, when the reading information of the morpheme 1 is represented by “(silent) tekisuto” and the reading information of the morpheme 2 is represented by “desu (silent)”, the reading information acquisition unit 121 is illustrated in FIG. Reading information “(silent) tekisutodesu (silent)” is sent to the segment selection means 122.

図１に戻って説明する。
素片選択手段１２２は、読み情報取得手段１２１により取得された読み情報の音声素片を音声データベース１２３から声質の種類ごとに選択して素片情報とするものである。
この素片選択手段１２２は、読み情報取得手段１２１から渡される読み情報に従って音声データベース１２３を検索し、定義されている話者等の声質の種類ごとの読み情報に従った音声素片を音声データベース１２３から抽出する。
なお、素片情報は、音声データベース１２３に記憶されている音声素片群のうちのいずれの種類の音声素片を選択するのかを指定する音声素片番号を含むものである。音声素片番号は、後記するとおり、例えば、「（silent）tekisuto」という音声素片に対して、この音声素片に対応させた番号「Ｎｏ２５０」のように付与して定義することができる。 Returning to FIG.
The segment selection unit 122 selects the speech unit of the reading information acquired by the reading information acquisition unit 121 for each type of voice quality from the speech database 123 and sets it as unit information.
The segment selection unit 122 searches the speech database 123 according to the reading information passed from the reading information acquisition unit 121, and selects the speech unit according to the reading information for each type of voice quality such as a defined speaker. 123.
Note that the unit information includes a speech unit number that specifies which type of speech unit to select from the speech unit group stored in the speech database 123. As will be described later, the speech unit number can be defined, for example, by assigning the speech unit “(silent) tekisuto” as a number “No250” corresponding to this speech unit.

また、音声素片群とは、概念的には、話者等の種別ごとに音声素片を分けて記憶する種類情報に対応づけた分類を意味するが、素片情報としての音声素片番号を用いることによって、音声データベース１２３に記憶させる音声素片数を少なくすることができる。例えば、Ａ氏とＢ氏との声質の合成音声を発話させる場合に、Ａ氏とＢ氏との声質が全体としては異なっているときでも、所定のフレーズの声質が同一となるときもある。このような場合には、話者ごとに分類して話者数分だけの音声素片を音声データベース１２３に記憶させるよりも、同一声質の音声素片をその都度呼び出すようにしておくことによって、音声データベース１２３に記憶するデータ量を少なくすることができる。また、データ送信装置１０からデータ受信装置２０に放送波に乗せて送信する音声合成用データセットのデータ量を少なくすることもできる。 In addition, the speech unit group conceptually means a classification associated with the type information stored by dividing the speech unit for each type of speaker or the like, but the speech unit number as the unit information. By using, the number of speech segments stored in the speech database 123 can be reduced. For example, when a synthesized voice having a voice quality of Mr. A and Mr. B is uttered, the voice quality of a predetermined phrase may be the same even when the voice quality of Mr. A and Mr. B is different as a whole. In such a case, it is possible to call up speech units of the same voice quality each time, rather than storing speech units as many as the number of speakers classified for each speaker in the speech database 123. The amount of data stored in the voice database 123 can be reduced. In addition, it is possible to reduce the amount of data of the speech synthesis data set transmitted from the data transmitting apparatus 10 to the data receiving apparatus 20 on a broadcast wave.

また、データ送信装置１０には、コンテンツ提供者がデータ送信装置１０に対して各種指令や各種データを入力するための図示しないユーザインターフェースと、各種データを記憶しておく図示しない記憶手段とが備えられている。
そして、コンテンツ提供者は、音声データベース更新情報作成手段１４を介して、音声素片の追加・修正・削除を行う。このときに、いずれの声質の音声素片群として音声データベース１２３に記憶させるかを図示しないユーザインターフェースを介して指令し、音声データベース１２３に記憶させる。 In addition, the data transmission device 10 includes a user interface (not shown) for the content provider to input various commands and various data to the data transmission device 10 and a storage unit (not shown) for storing various data. It has been.
Then, the content provider performs addition / correction / deletion of the speech unit via the speech database update information creation unit 14. At this time, it is instructed via a user interface (not shown) which voice unit group of voice quality is stored in the voice database 123 and stored in the voice database 123.

音声データベース１２３は、同一声質の複数の音声素片からなる音声素片群を声質ごとに記憶するものであって、例えば、音声素片番号に対応付けて音声素片を記憶する。具体的には、音声データベース１２３は、音声素片番号「Ｎｏ２５０」に音声素片「（silent）tekisuto」を対応付けて記憶する。また、音声データベース１２３は、声質を指定するために、話者ごとに音声素片群を定義して記憶する。音声素片群は、前記した音声素片番号に音声素片を対応付けた表として定義することができる。 The speech database 123 stores a speech unit group composed of a plurality of speech units of the same voice quality for each voice quality, and stores speech units in association with speech unit numbers, for example. Specifically, the speech database 123 stores the speech unit “No250” in association with the speech unit “(silent) tekisuto”. Further, the voice database 123 defines and stores a voice element group for each speaker in order to specify voice quality. The speech unit group can be defined as a table in which speech units are associated with the speech unit numbers described above.

そこで、ここでは、音声データベース１２３は、例えば、話者ごとに、音声素片番号と音声素片とを対応付けた表（音声素片群）ごとに構築したデータベースとする。このように音声素片群ごとにデータベースを構築するようにすると、音声素片群ごとに音声素片を増加させたり、削除したり、訂正することができるため、データ受信装置２０のユーザのリクエストに迅速に答えて、ユーザの好みに合わせて容易に更新することが可能になる。 Therefore, here, for example, the speech database 123 is a database constructed for each table (speech unit group) in which speech unit numbers and speech units are associated with each other. If a database is constructed for each speech unit group in this way, the speech unit can be increased, deleted, or corrected for each speech unit group. It is possible to respond quickly to and easily update to the user's preference.

なお、話者に応じて異なる声質の合成音声を発話させる場合には、音声素片群は、例えば、男性、女性、アナウンサ、俳優またはアニメキャラクタ等の話者の音声をサンプリングしたり、各キャラクタの声質の特徴の波形を成形したりして集めた音声素片からなる。また、コンテンツに応じて異なる声質の合成音声を発話させる場合には、音声素片群は、ニュースや天気予報等のコンテンツに応じて定義して集めた音声素片からなる。例えば、コンテンツに応じて音声素片群を定義する場合には、ニュースを発話させる男性アナウンサの声質の音声素片を集めた音声素片群と定義し、天気予報を発話させる女性アナウンサの声質の音声素片を集めた音声素片群と定義するような場合がある。 In the case where synthesized speech of different voice qualities is uttered depending on the speaker, the speech segment group is obtained by sampling the speech of a speaker such as a male, female, announcer, actor, or anime character, It consists of speech segments collected by shaping the waveform of the voice quality characteristics. Further, when a synthesized voice having different voice qualities is uttered according to content, the speech segment group is composed of speech segments defined and collected according to content such as news or weather forecast. For example, when defining a speech segment group according to the content, it is defined as a speech segment group that collects speech segments of a male announcer's voice quality that utters news, and the voice quality of a female announcer that utters a weather forecast There is a case where it is defined as a speech unit group in which speech units are collected.

図３に、データ送信装置が備える音声合成用データセット作成手段の素片選択手段による音声素片の選択概念を説明する図を示す。ここで、図３中に示す話者０１の場合の音声データベース１２３の内容を表１とし、話者０２の場合の音声データベース１２３の内容を表２として表して説明する。各表は、左側の列に音声素片番号を並べ、右側の列に音声素片を並べて表している。また、音声素片とは、合成音声として発話させるための波形を構成する最小単位の波形であって、読み情報の最小単位に対応している。そのため、音声素片そのものは波形を表すものであるが、便宜上、この明細書では、音声素片を読み情報として記述して説明する。なお、音声素片は、声質の違いを識別するのに有用な長さがあればよく、その長さは、話者ごとに決めるようにするのが好ましい。 FIG. 3 is a diagram illustrating the concept of selecting speech units by the unit selection unit of the speech synthesis data set creation unit provided in the data transmission apparatus. Here, the contents of the voice database 123 in the case of the speaker 01 shown in FIG. 3 are shown in Table 1, and the contents of the voice database 123 in the case of the speaker 02 are shown as Table 2. In each table, speech unit numbers are arranged in the left column, and speech units are arranged in the right column. The speech segment is a minimum unit waveform constituting a waveform to be uttered as synthesized speech, and corresponds to the minimum unit of reading information. For this reason, the speech segment itself represents a waveform, but for the sake of convenience, in this specification, the speech segment is described as read information. Note that it is sufficient that the speech segment has a length useful for identifying the difference in voice quality, and the length is preferably determined for each speaker.

ここで、表１は、話者０１のときに、同一の意味内容であっても複数の音声素片を音声データベース１２３に登録する場合を示してある。例えば、音声素片番号「Ｎｏ２５０」の音声素片「（silent）tekisuto」と、音声素片番号「Ｎｏ２５１」の音声素片「（silent）tekisuto」とは、同一の意味内容を表すが、両者を登録してある。これは、例えば、音声素片番号「Ｎｏ２５０」が話者０１の明るい声質の音声素片を表し、音声素片番号「Ｎｏ２５１」が話者０１の普通のトーンの声質の音声素片を表すものとして定義するような場合である。この点については、表２の話者０２の場合も同様である。 Here, Table 1 shows a case in which a plurality of speech segments are registered in the speech database 123 even if they have the same semantic content when the speaker is 01. For example, the speech unit “(silent) tekisuto” having the speech unit number “No 250” and the speech unit “(silent) tekisuto” having the speech unit number “No 251” represent the same meaning content. Is registered. For example, a speech unit number “No250” represents a speech unit having a bright voice quality of the speaker 01, and a speech unit number “No251” represents a speech unit having a normal tone voice quality of the speaker 01. Is defined as This is the same for the speaker 02 in Table 2.

なお、後記するとおり、音声素片同士を接続して得られる合成音声の波形を制御情報に従って変形することによっても異なる声質となるように定義することもできる。したがって、前記したとおり、話者０１の明るい声質や普通のトーンの声質の両者を登録しておく必要はないが、頻繁に使用する音声素片については、音声データベース１２３に登録しておくのが好ましい。また、音声データベース１２３には、後記する制御情報も記憶させておくものとする。 As will be described later, it is also possible to define the synthesized speech waveform obtained by connecting speech segments so as to have different voice qualities by modifying them according to the control information. Therefore, as described above, it is not necessary to register both the voice quality of the speaker 01 and the voice quality of the normal tone, but it is necessary to register frequently used speech segments in the speech database 123. preferable. In addition, control information described later is also stored in the voice database 123.

また、音声素片は声質の異なる波形であるため、同一の意味内容であっても、声質の相違に従って音声データベース１２３に記憶させる場合がある。例えば、音声素片番号「Ｎｏ２５１」の音声素片「（silent）tekisuto」は、音声素片番号「Ｎｏ２５０」のものと表記は同一であっても、音声素片番号「Ｎｏ２５１」と音声素片番号「Ｎｏ２５０」とは互いに波形の異なる音声素片である。 In addition, since speech segments are waveforms with different voice qualities, even the same semantic content may be stored in the speech database 123 according to the difference in voice qualities. For example, the speech unit number “No251” and the speech unit number “No251” are the same as the speech unit number “No250”, but the speech unit number “No251” is the same as the speech unit number “No251”. The number “No. 250” is a speech unit having a waveform different from each other.

次に、図１に戻って図３を参照しつつ、読み情報取得手段１２１が取得した読み情報を基にして、素片選択手段１２２が音声データベース１２３から音声素片を選択する場合の具体例を説明する。図３は、データ送信装置が備える音声合成用データセット作成手段の素片選択手段による素片の選択概念を説明する図である。素片選択手段１２２は、話者０１の読み情報「（silent）tekisutodesu.（silent）」を読み情報取得手段１２１から渡されると、音声データベース１２３を検索し、話者０１の音声素片「（silent）tekisuto」を選択し、音声素片番号「Ｎｏ２５０」を取得すると共に、音声素片「desu.（silent）」を選択し、音声素片番号「Ｎｏ４８０」を取得するものである。話者０２の場合も同様にして、読み情報「（silent）tekisutodesu.（silent）」の音声素片番号「Ｎｏ２５５」および「Ｎｏ４９７」を取得する。 Next, referring back to FIG. 1 and referring to FIG. 3, a specific example in which the segment selection unit 122 selects a speech unit from the speech database 123 based on the reading information acquired by the reading information acquisition unit 121. Will be explained. FIG. 3 is a diagram for explaining the concept of segment selection by the segment selection means of the speech synthesis data set creation means provided in the data transmission apparatus. When the reading information “(silent) tekisutodesu. (Silent)” of the speaker 01 is passed from the reading information acquisition unit 121, the segment selection unit 122 searches the speech database 123 to search for the speech unit “(( silent) tekisuto "is selected to obtain the speech unit number" No250 "and the speech unit" desu. (silent) "is selected to obtain the speech unit number" No480 ". Similarly, for the speaker 02, the speech unit numbers “No255” and “No497” of the reading information “(silent) tekisutodesu. (Silent)” are acquired.

ただし、素片選択手段１２２は、図示しないユーザインターフェースを介してコンテンツ提供者によって入力され、設定された選択指令に基づいて音声素片を選択する。選択指令としては、例えば、前記したとおり、素片選択手段１２２が話者ごとに音声素片を選択する場合があるが、これに限らず、声質の異なる種別ごとに選択するのであれば、コンテンツごとに行うものであってもよいし、さらに、話者の音声の表情に応じて行うようにしてもよい。例えば、素片選択手段１２２は、特定の話者の明るい表情の声質や普通の声質とを分けて、選択するものであってもよい。 However, the segment selection means 122 selects a speech segment based on a selection command input by a content provider through a user interface (not shown). As the selection command, for example, as described above, the unit selection unit 122 may select a speech unit for each speaker. However, the selection command is not limited to this. It may be performed every time, or may be performed according to the voice expression of the speaker. For example, the segment selection means 122 may select a voice quality of a specific speaker with a bright expression or a normal voice quality separately.

制御情報作成手段１２４は、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、および、素片情報に基づいて対応づけられる音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を作成するものである。なお、種類情報としては、話者０１、０２等の分類の種類を示すものがある。 The control information creation means 124 is based on the type information and the unit information for dividing the speech unit group composed of a plurality of speech units of the same voice quality constituting the minimum unit waveform to be uttered as synthesized speech, for each voice quality. Thus, at least one type of control information that defines how to deform the waveform of the synthesized speech generated by connecting the speech units associated with each other is created. Note that the type information includes information indicating the type of classification of speakers 01, 02, and the like.

制御情報は、音声素片同士を接続して得た合成音声の波形を微調整して変形するものであり、音声素片同士を接続して得た合成音声の波形を異なる音声表現のタイプの波形に成形する。この制御情報としては、話速制御情報、動特徴制御情報および静特徴制御情報を挙げることができる。 The control information is obtained by finely adjusting the waveform of the synthesized speech obtained by connecting the speech units, and changing the synthesized speech waveform obtained by connecting the speech units to different speech expression types. Shape into corrugations. Examples of the control information include speech speed control information, dynamic feature control information, and static feature control information.

話速制御情報は、音声素片同士を接続して生成された波形の話速を変化させるためのものである。この話速制御情報としては、例えば、ボイスリードセッティング関数がある。このボイスリードセッティング関数は、例えば、５段階の話速に切り替えるものである。この話速制御情報は、時間軸方向に波形を伸張することによって、話速を変換させることができる。 The speech speed control information is used to change the speech speed of a waveform generated by connecting speech segments. As the speech speed control information, for example, there is a voice lead setting function. This voice lead setting function switches, for example, to five stages of speaking speed. This speech speed control information can convert the speech speed by expanding the waveform in the time axis direction.

また、動特徴制御情報は、音声素片同士を接続して生成された波形の経過時間に伴う波形の動きを変化させる動特徴を可変させるものである。この動特徴制御情報としては、例えば、波形の基本周波数やスペクトルを挙げることができる。また、動特徴制御情報としては、その他に、話における「間」やテンポ感を挙げることもできる。 The dynamic feature control information is used to vary a dynamic feature that changes the movement of the waveform according to the elapsed time of the waveform generated by connecting speech segments. As this dynamic feature control information, for example, a fundamental frequency or a spectrum of a waveform can be mentioned. In addition, as the dynamic feature control information, “between” in a story and a sense of tempo can be cited.

また、静特徴制御情報は、音声素片同士を接続して生成された波形の所定時刻における一定時間の形状を変化させる静特徴を可変させるものである。この静特徴制御情報としては、例えば、基本周波数の平均値やスペクトル包絡概形を挙げることができる。 The static feature control information is used to vary a static feature that changes the shape of a predetermined time at a predetermined time of a waveform generated by connecting speech segments. Examples of the static feature control information include an average value of fundamental frequencies and an outline of a spectrum envelope.

ここで、音声合成用データセットの概念を図４を参照しつつ説明する。図４は、データ送信装置が備える音声合成用データセット作成手段の音声合成用データセットの概念を説明する図である。
音声合成用データセットは、話者等ごとの声質によらない一般事項と、話者等ごとの声質によって異なる特殊事項とに分けることができる。 Here, the concept of the data set for speech synthesis will be described with reference to FIG. FIG. 4 is a diagram for explaining the concept of the speech synthesis data set of the speech synthesis data set creation means provided in the data transmission apparatus.
The data set for speech synthesis can be divided into general items that do not depend on the voice quality of each speaker or the like and special items that differ depending on the voice quality of each speaker or the like.

一般事項としては、例えば、ニュース等のコンテンツ情報、文例「テキストです。」等のテキスト、「（silent）te’kisutodesu.（silent）」等の読み情報、および、「１型名詞１」等の品詞ごとのアクセント型がある。なお、読み情報の「’」は、アクセントを表す。また、ここでは、音声合成用データセットには、文例「テキストです。」等のテキストが含まれるものとして記述してあるが、データ受信装置２０は、後記するとおり、読み情報を基にして音声合成を行うため、音声合成用データセットにはテキストを含めないものとしてもよい。 General matters include, for example, content information such as news, text such as “Example text”, reading information such as “(silent) te'kisutodesu. (Silent)”, and “type 1 noun 1”. There is an accent type for each part of speech. Note that “′” in the reading information represents an accent. Here, the speech synthesis data set is described as including text such as a sentence example “text”. However, as will be described later, the data receiving device 20 performs speech based on the reading information. In order to perform synthesis, the speech synthesis data set may not include text.

一方、特殊事項としては、例えば、話者０１、０２等の分類の種類を示す種類情報、データ受信装置２０において音声素片を選択するための素片情報、および、ピッチ等のタイプを表すようにした制御情報がある。
ここでは、話者０１は、例えば、素片情報「Ｎｏ２５０（tekisuto），Ｎｏ４８０（desu.）」、制御情報のタイプ１「平均ピッチ下降」、および、制御情報のタイプ２「平均ピッチ上昇、アクセント１強調」を挙げてある。また、話者０２は、話者０１と異なる声質とするために、話者０１と同一の意味内容について、例えば、素片情報「Ｎｏ２５５（tekisuto），Ｎｏ４９７（desu.）」、制御情報のタイプ１「平均ピッチ下降」、制御情報のタイプ２「アクセント１強調」を挙げてある。なお、素片情報は、音声素片番号のみで構わない。後記するとおり、データ受信装置２０では、音声素片番号をキーとして後記音声データベース２５１から音声素片を呼び出すからである。 On the other hand, as the special items, for example, type information indicating the type of classification of the speakers 01, 02, etc., unit information for selecting a speech unit in the data receiving device 20, and a type such as pitch are indicated. There is control information.
Here, the speaker 01 is, for example, the segment information “No 250 (tekisuto), No 480 (desu.)”, The control information type 1 “average pitch down”, and the control information type 2 “average pitch rise, accent. "1 emphasis". In addition, in order for speaker 02 to have a voice quality different from that of speaker 01, for example, segment information “No255 (tekisuto), No497 (desu.)”, Control information type, 1 “average pitch down”, control information type 2 “accent 1 emphasis”. Note that the segment information may be only the speech segment number. This is because, as will be described later, the data reception device 20 calls a speech unit from the speech database 251 described later using the speech unit number as a key.

音声合成用データセット付加手段１３は、素材テキスト取得手段１１から出力されるデータ放送素材を取得し、音声合成用データセット作成手段１２によって作成された音声合成用データセットをデータ放送素材に組み込むものである。
例えば、音声合成用データセット付加手段１３は、音声合成用データセット作成手段１２で作成された音声合成用データセットをデータ送信用プログラム中に付加する。そのデータ放送用プログラムとしては、ＢＭＬ（Broadcast Markup Language）プログラムがある。 The voice synthesis data set adding means 13 acquires the data broadcast material output from the material text acquisition means 11 and incorporates the voice synthesis data set created by the voice synthesis data set creation means 12 into the data broadcast material. It is.
For example, the voice synthesis data set adding unit 13 adds the voice synthesis data set created by the voice synthesis data set creation unit 12 to the data transmission program. As the data broadcasting program, there is a BML (Broadcast Markup Language) program.

このＢＭＬプログラムは、スクリプト言語により記述され、スクリプト内に「<専用タグ>」と「<／専用タグ>」との間に、「<専用タグ>読み情報<／専用タグ>」のように読み情報等の各種情報を挟み込んで記述される。
なお、音声合成用データセットは、データ放送用プログラムにタグ情報によって挟み込まれるようにして記述されなくても、順次通信データとして放送波に乗せて送信させるようにしてもよい。 This BML program is written in a script language, and is read as “<dedicated tag> reading information </ dedicated tag>” between “<dedicated tag>” and “</ dedicated tag>” in the script. Various information such as information is inserted and described.
Note that the voice synthesis data set may be transmitted on the broadcast wave as communication data sequentially, even if it is not described so as to be sandwiched between tag information in the data broadcasting program.

音声データベース更新情報作成手段１４は、音声データベース１２３がコンテンツ提供者によって図示しないユーザインターフェースを介して入力される更新データによって更新された場合に、データ受信装置２０に送信するための更新情報を作成するものである。なお、更新データと更新情報とは、原則として同一の内容のものであるが、例外的に異なる場合もあるため異なる名称としてある。 The voice database update information creation unit 14 creates update information to be transmitted to the data receiving device 20 when the voice database 123 is updated by update data input via a user interface (not shown) by the content provider. Is. The update data and the update information have the same contents in principle, but have different names because they may be exceptionally different.

ここで、図５を参照しつつ音声データベースの更新時の処理の概要を説明しつつ、更新データと更新情報との関係についても説明する。図５は、音声データベース更新時の更新情報の作成概念を説明する図である。
まず、図示しないユーザインターフェースを介して更新データがデータ送信装置１０に入力される。ここでは、更新データｄ１は、既存話者の新規音声素片を追加する場合とする。この場合の更新データｄ１としては、例えば、話者種別「話者０１」、音声素片番号「Ｎｏ５００」、読み情報「desita」および音声素片「desita」がある。 Here, the relationship between the update data and the update information will be described while explaining the outline of the process at the time of updating the voice database with reference to FIG. FIG. 5 is a diagram for explaining the concept of creating update information when updating the voice database.
First, update data is input to the data transmission device 10 via a user interface (not shown). Here, the update data d1 is a case where a new speech unit of an existing speaker is added. The update data d1 in this case includes, for example, a speaker type “speaker 01”, a speech unit number “No500”, reading information “desita”, and a speech unit “desita”.

入力された更新データｄ１は、音声データベース更新情報作成手段１４に入力され、更新情報Ｄ１を作成する。ここで、作成される更新情報Ｄ１は、更新データｄ１と同一である。ここで、更新データｄ１と更新情報Ｄ１とが異なる場合には、既に登録されている特定話者の特定の情報を転用するときに、入力する更新データｄ１の該当項目を省略するような場合がある。例えば、話者０１として登録する更新データｄ１で指定する音声素片が、既に、話者０２の音声素片として登録されている場合には、その音声素片の入力を省略し、音声データベース更新情報作成手段１４は、省略された音声素片を呼び出して更新情報を作成する。 The input update data d1 is input to the voice database update information creation unit 14 to create update information D1. Here, the created update information D1 is the same as the update data d1. Here, when the update data d1 and the update information D1 are different, there is a case where the corresponding item of the input update data d1 is omitted when diverting the specific information of the specific speaker already registered. is there. For example, if the speech unit specified by the update data d1 registered as the speaker 01 is already registered as the speech unit of the speaker 02, the speech unit input is omitted and the speech database is updated. The information creating unit 14 creates updated information by calling the omitted speech segment.

図１に戻って説明すると、音声データベース更新情報作成手段１４は、作成した更新情報を、音声データベース１２３に登録すると共に、多重化手段１５に出力するものである。なお、更新情報としては、既存話者の新しい音声素片の追加の他に、新規話者の音声素片や既存話者の音声素片を改良したものがある。 Returning to FIG. 1, the voice database update information creating unit 14 registers the created update information in the voice database 123 and outputs it to the multiplexing unit 15. In addition to the addition of a new speech unit for an existing speaker, the update information includes information obtained by improving a speech unit for a new speaker or a speech unit for an existing speaker.

多重化手段１５は、音声合成用データセット付加手段１３から出力される音声合成用データセットを組み込んだデータ放送素材と、音声データベース更新情報作成手段１４から出力される更新情報とを多重化した多重化データを生成し、生成した多重化データを送信手段１６に送るものである。なお、ここでは、多重化手段１５は、データ放送素材と更新情報とを多重化しているため、データ放送素材にはテキストが含まれている。しかし、後記するとおり、データ受信装置２０ではテキストから文字認識して音声合成するのではないため、テキストを含めない多重化データとしてもよい。 The multiplexing means 15 multiplexes the data broadcasting material incorporating the voice synthesis data set output from the voice synthesis data set adding means 13 and the update information output from the voice database update information creating means 14. Data is generated, and the generated multiplexed data is sent to the transmission means 16. Here, since the multiplexing means 15 multiplexes the data broadcast material and the update information, the data broadcast material includes text. However, as will be described later, the data receiving device 20 does not recognize characters from text and synthesizes speech, so that multiplexed data that does not include text may be used.

送信手段１６は、多重化手段１５によって、音声合成用データセットを含むデータ放送素材および更新情報を多重化された多重化データとして放送波に乗せてデータ受信装置２０に向けて送信するものである。なお、この送信手段１６は、具体的には、図示しないアンテナであるが、ここでは、多重化手段１５が多重化した多重化データをデジタル／アナログ変換する図示しないＤ／Ａ変換部や図示しない発振器等の電子回路を含めたものとして説明する。このように、データ送信装置１０からデータ受信装置２０に多重化データを送信し、後記するとおり、データ受信装置２０では、多重化データから更新情報を取り出して、データ受信装置２０の後記音声データベース２５１を更新することによって、データ送信装置１０の音声データベース１２３と、データ受信装置２０の後記音声データベース２５１との内容を同一にすることができる。 The transmission unit 16 transmits the data broadcasting material including the voice synthesis data set and the update information on the broadcast wave as multiplexed multiplexed data to the data reception device 20 by the multiplexing unit 15. . The transmitter 16 is specifically an antenna (not shown), but here, a D / A converter (not shown) for digital / analog conversion of the multiplexed data multiplexed by the multiplexer 15 or not shown. Description will be made assuming that an electronic circuit such as an oscillator is included. In this way, the multiplexed data is transmitted from the data transmitting apparatus 10 to the data receiving apparatus 20, and as will be described later, the data receiving apparatus 20 extracts update information from the multiplexed data, and a later-described voice database 251 of the data receiving apparatus 20. , The contents of the voice database 123 of the data transmitting apparatus 10 and the later-described voice database 251 of the data receiving apparatus 20 can be made the same.

なお、音声合成用データセット付加手段１３と多重化手段１５と送信手段１６とが、素片選択手段１２２により選択された素片情報、ならびに、制御情報作成手段１２４により作成された種類情報および制御情報を音声合成用データセットとしてデータ受信装置２０に提供するための音声合成用データセット提供手段として機能する。
また、音声データベース更新情報作成手段１４と多重化手段１５と送信手段１６とが、音声データベース更新情報作成手段１４により作成された更新情報をデータ受信装置２０に提供するための更新情報提供手段として機能する。 Note that the speech synthesis data set adding means 13, the multiplexing means 15, and the transmission means 16 have the segment information selected by the segment selection means 122 and the type information and control created by the control information creation means 124. It functions as a voice synthesis data set providing means for providing information to the data receiving apparatus 20 as a voice synthesis data set.
In addition, the voice database update information creation unit 14, the multiplexing unit 15, and the transmission unit 16 function as update information provision unit for providing the data reception device 20 with the update information created by the voice database update information creation unit 14. To do.

また、データ送信装置１０は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（音声合成用データセット生成プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The data transmitting apparatus 10 can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. This program (speech synthesis data set generation program) can be distributed via a communication line, or can be distributed on a recording medium such as a CD-ROM.

（データ受信装置の構成）
合成音声発話装置としてのデータ受信装置２０は、合成音声として発話させる最小単位の波形を構成する同一声質の複数の音声素片からなる音声素片群を声質ごとに分けるための種類情報、音声素片群ごとの音声素片に対応づけられる素片情報、ならびに、種類情報および素片情報に基づいて対応づけられる音声素片同士を接続して生成する合成音声の波形変形の仕方を定義した少なくとも１種類の制御情報を音声合成用データセットとして生成するデータ送信装置１０から音声合成用データセットを取得し、合成音声として発話させるために、受信手段２１と、データ識別手段２２と、音声合成制御情報設定手段２３と、音声データベース更新手段２４と、音声合成手段２５と、ユーザインターフェース２６と、音声出力手段２７とを主に備えている。 (Configuration of data receiving device)
The data receiving device 20 as a synthesized speech utterance device includes type information and speech elements for dividing a speech unit group composed of a plurality of speech units of the same voice quality that constitute a minimum unit waveform to be uttered as a synthesized speech for each voice quality. At least defined how to transform the waveform of the synthesized speech generated by connecting the speech units associated with the speech units of each group and the speech units associated with the type information and the segment information In order to acquire a speech synthesis data set from the data transmitting apparatus 10 that generates one type of control information as a speech synthesis data set and to utter as synthesized speech, the reception unit 21, the data identification unit 22, and the speech synthesis control The information setting unit 23, the voice database update unit 24, the voice synthesis unit 25, the user interface 26, and the voice output unit 27 are mainly used. Eteiru.

受信手段２１は、データ送信装置１０から送信される放送波を受信して、電気信号としての受信データをデータ識別手段２２に渡すものである。なお、受信手段２１は、具体的には、図示しないアンテナであるが、ここでは、アンテナが受信した放送波をデータ識別手段２２に受信データとして渡すまでの各種変換を行う増幅回路やＡ／Ｄ変換部等の電子回路を含めたものとして説明する。 The receiving unit 21 receives a broadcast wave transmitted from the data transmitting apparatus 10 and passes received data as an electric signal to the data identifying unit 22. The receiving unit 21 is specifically an antenna (not shown), but here, an amplifier circuit or A / D that performs various conversions until the broadcast wave received by the antenna is passed to the data identifying unit 22 as received data. It demonstrates as what includes electronic circuits, such as a conversion part.

データ識別手段２２は、受信手段２１から渡される受信データがデータ送信装置１０から送信された多重化データである場合には、多重化データからデータ放送素材および更新情報を識別し、データ放送素材からデータ放送用プログラムを抽出し、このデータ放送用プログラムから音声合成用データセットを識別するものである。そして、さらに、データ識別手段２２は、音声合成制御情報設定手段２３に音声合成用データセットを送り、音声データベース更新手段２４に更新情報を送る機能も備えている。 If the received data passed from the receiving means 21 is multiplexed data transmitted from the data transmitting apparatus 10, the data identifying means 22 identifies data broadcast material and update information from the multiplexed data, and from the data broadcast material A data broadcasting program is extracted, and a speech synthesis data set is identified from the data broadcasting program. Further, the data identification means 22 has a function of sending a voice synthesis data set to the voice synthesis control information setting means 23 and sending update information to the voice database update means 24.

なお、データ識別手段２２が受信手段２１から渡される受信データを多重化データでないと識別する場合としては、データ送信装置１０から送信される周波数と同一の周波数の雑音がある。 Note that when the data identification unit 22 identifies the reception data passed from the reception unit 21 as not multiplexed data, there is noise having the same frequency as the frequency transmitted from the data transmission device 10.

ここで、図６を参照してデータ識別手段２２の機能の概念を説明する。図６に、データ受信装置が備えるデータ識別手段の機能の概念を説明する図を示す。
データ識別手段２２は、受信手段２１から受信データを受け取ると、受信データのデータ量に基づいて、受信データが多重化データか否かを判断して、多重化データの場合には、データ放送素材と更新情報とを識別して、それぞれを分ける。データ識別手段２２は、更新情報Ｄ２の場合には、素片情報、読み情報、話者情報（種類情報）、コンテンツ情報（種類情報）および制御情報等の更新情報Ｄ２を音声データベース更新手段２４に送る。なお、データ識別手段２２は、受信手段２１により受信した受信データが多重化データでない場合には、データ受信装置２０への送信ではないと判断して受信データを無視する。 Here, the concept of the function of the data identification means 22 will be described with reference to FIG. FIG. 6 is a diagram for explaining the concept of the function of the data identification means provided in the data receiving apparatus.
When receiving the received data from the receiving means 21, the data identifying means 22 determines whether the received data is multiplexed data based on the data amount of the received data. And update information are identified and separated. In the case of the update information D2, the data identification means 22 sends the update information D2 such as segment information, reading information, speaker information (type information), content information (type information) and control information to the voice database update means 24. send. If the received data received by the receiving unit 21 is not multiplexed data, the data identifying unit 22 determines that the data is not transmitted to the data receiving device 20 and ignores the received data.

一方、データ識別手段２２は、データ放送素材が例えばデータ放送用プログラムＰである場合には、データ放送用プログラムＰから、スクリプト、および、<専用タグ>と<／専用タグ>とで挟まれたデータを音声合成用データセットとして識別し、音声合成制御情報設定装置２３に音声合成用データセットを送る。 On the other hand, when the data broadcasting material is, for example, the data broadcasting program P, the data identification means 22 is sandwiched between the script and the <dedicated tag> and </ dedicated tag> from the data broadcasting program P. The data is identified as a voice synthesis data set, and the voice synthesis data set is sent to the voice synthesis control information setting device 23.

図１に戻って説明すると、音声合成制御情報設定手段２３は、音声合成手段２５の後記音声データベース２５１に記憶されている音声素片群からの抽出対象を、データ識別手段２２から渡される音声合成用データセットの種類情報を選択することによって特定する素片情報特定処理、および、データ識別手段２２から渡される音声合成用データセットの制御情報を選択して波形変形の仕方を特定する制御情報設定処理として機能を有し、種類情報および制御情報ならびにデータ識別手段２２から渡される素片情報を音声合成用データとして設定するものである。 Referring back to FIG. 1, the speech synthesis control information setting unit 23 synthesizes a speech synthesis group 25 to extract an extraction target from a speech unit group stored in the speech database 251, which will be described later, from the data identification unit 22. Unit information specifying process specified by selecting the type information of the data set for control, and control information setting for specifying the waveform deformation method by selecting the control information of the speech synthesis data set passed from the data identifying means 22 It has a function as a process, and sets type information, control information, and segment information passed from the data identification means 22 as speech synthesis data.

音声合成制御情報設定手段２３は、素片情報特定処理として機能するために、コンテンツ項目種別設定手段２３１と、話者設定手段２３２とを主に備えている。
コンテンツ項目種別設定手段２３１は、音声データベース２５１に記憶されている音声素片群の種類を、種類情報にあらかじめ含まれるコンテンツ項目の種別に応じて設定するものである。
話者設定手段２３２は、音声データベース２５１に記憶されている音声素片群の種類を、種類情報にあらかじめ含まれる話者の相違に応じて設定するものである。 The speech synthesis control information setting unit 23 mainly includes a content item type setting unit 231 and a speaker setting unit 232 in order to function as segment information specifying processing.
The content item type setting means 231 sets the type of the speech element group stored in the audio database 251 according to the type of the content item included in advance in the type information.
The speaker setting means 232 sets the type of the speech element group stored in the speech database 251 according to the difference of speakers included in the type information in advance.

なお、この音声合成制御情報設定手段２３の素片情報特定処理における特定は、いずれの音声素片群を選択するのかを選択させる画面を表示し、ユーザにユーザインターフェース２６を介して指定させる。
例えば、音声合成制御情報設定手段２３は、素片情報特定処理によって、素片情報として、男声と女声とが与えられている場合に、ユーザがユーザインターフェース２６を介して女声を指定したときには、女声の音声素片群から素片情報として指定された音声素片を音声データベース２５１から選択するように、女声の音声素片群と特定する。 Note that the identification in the segment information identification process of the speech synthesis control information setting unit 23 displays a screen for selecting which speech segment group to select, and allows the user to specify via the user interface 26.
For example, the voice synthesis control information setting unit 23 sets the female voice when the user designates a female voice via the user interface 26 when a male voice and a female voice are given as the piece information by the piece information specifying process. The voice element group specified as the element information from the voice element group is identified as the female voice element group so as to be selected from the voice database 251.

また、この音声合成制御情報設定手段２３は、制御情報設定処理として機能するために、話速設定手段２３３と、動特徴制御情報設定手段２３４と、静特徴制御情報設定手段２３５とを主に備えている。
話速設定手段２３３は、音声素片同士を接続して生成された波形の話速を変化させるための制御情報である話速制御情報を設定するものである。
動特徴制御情報設定手段２３４は、音声素片同士を接続して生成された波形の経過時間に伴う波形の動きを変化させる動特徴を可変させるための制御情報である動特徴制御情報を設定するものである。この動特徴は、合成音声として発話させる波形が時間的に変化する物理的な性質である。
静特徴制御情報設定手段２３５は、音声素片同士を接続して生成された波形の所定時刻における一定時間の形状を変化させる静特徴を可変させるための制御情報である静特徴制御情報を設定するものである。この静特徴は、合成音声として発話させる波形の所定時刻における一定時間で一意に決められる物理的な性質である。 The speech synthesis control information setting unit 23 mainly includes a speech speed setting unit 233, a dynamic feature control information setting unit 234, and a static feature control information setting unit 235 in order to function as a control information setting process. ing.
The speech speed setting means 233 sets speech speed control information, which is control information for changing the speech speed of a waveform generated by connecting speech segments.
The dynamic feature control information setting unit 234 sets dynamic feature control information that is control information for changing dynamic features that change the movement of the waveform according to the elapsed time of the waveform generated by connecting the speech segments. Is. This dynamic feature is a physical property in which a waveform uttered as synthesized speech changes with time.
The static feature control information setting unit 235 sets static feature control information that is control information for changing a static feature that changes a shape of a predetermined time at a predetermined time of a waveform generated by connecting speech segments. Is. This static feature is a physical property that is uniquely determined at a predetermined time at a predetermined time of a waveform to be uttered as synthesized speech.

そして、音声合成制御情報設定手段２３は、ユーザによって指定される話者、話速、イントネーションまたは声質等に関わる制御情報をデータ項目ごとに設定して音声合成用データとし、この音声合成用データを音声合成手段２５に送る。
また、この音声合成制御情報設定手段２３は、制御情報設定処理によって、一義的に波形変形の仕方を指定するための制御情報を設定する。制御情報としては、例えば、関数やパラメータや制御情報番号を挙げることができる。これらの関数やパラメータや制御情報番号は、音声素片同士を接続して生成された波形変形の仕方を微調整するものであって、まったく異なる波形に変形することを特に意味しない。 Then, the speech synthesis control information setting means 23 sets the control information related to the speaker, speech speed, intonation or voice quality designated by the user for each data item to make speech synthesis data. It is sent to the speech synthesis means 25
The voice synthesis control information setting unit 23 sets control information for uniquely designating the waveform deformation method by the control information setting process. Examples of the control information include functions, parameters, and control information numbers. These functions, parameters, and control information numbers are for finely adjusting the waveform deformation method generated by connecting speech segments, and do not particularly mean that the waveform is completely different.

つまり、制御情報の設定は、素片情報に従って設定された声質自体をまったく異なる声質に変化するものではなく、声質の微調整を行うものである。例えば、波形の鋭角な部分を丸めて滑らかにするような変形をし、同一人物の声質であっても普通の口調からやさしい口調に微調整するような波形変形の仕方である。なお、制御情報番号は、関数やパラメータに対応させた番号として波形変形の仕方を定義するものとして扱うものである。また、制御情報は、ここで説明したとおり、まったく異なる波形に変形するものを特に意味しないが、ユーザの要求に従ってまったく異なる波形にして声質を変化させるようにしても構わない。 That is, the setting of the control information does not change the voice quality itself set according to the segment information to a completely different voice quality, but performs fine adjustment of the voice quality. For example, the waveform is deformed so that the sharp corners of the waveform are rounded and smoothed, and the voice quality of the same person is finely adjusted from a normal tone to a gentle tone. Note that the control information number is handled as a number that defines a waveform deformation method as a number corresponding to a function or parameter. The control information does not particularly mean that the control information is transformed into a completely different waveform as described herein, but the voice information may be changed to a completely different waveform according to the user's request.

さらに、音声合成制御情報設定手段２３は、データ識別手段２２から送られてくる音声合成用データセットを解析して、項目ごとに分ける機能も備えている。また、音声合成制御情報設定手段２３では、コンテンツ項目種別設定手段２３１、話者設定手段２３２、話速設定手段２３３、動特徴制御情報設定手段２３４および静特徴制御情報設定手段２３５がユーザインターフェース２６の図示しない表示手段に、項目を選択するためのキャラクタやメッセージを表示させるための選択項目をデータ識別手段２２から受け取り、ユーザインターフェース２６の表示手段に表示させると共に、図示しない入力手段からの選択を有効にさせる。
図７を参照して、音声合成用データセットを解析する概念を説明する。図７に、音声合成用データセットを解析する概念を説明する図を示す。 Furthermore, the speech synthesis control information setting unit 23 has a function of analyzing the speech synthesis data set sent from the data identification unit 22 and dividing it into items. In the speech synthesis control information setting unit 23, the content item type setting unit 231, the speaker setting unit 232, the speech speed setting unit 233, the dynamic feature control information setting unit 234, and the static feature control information setting unit 235 are included in the user interface 26. A selection item for displaying a character for selecting an item or a message on a display unit (not shown) is received from the data identification unit 22 and displayed on the display unit of the user interface 26, and selection from an input unit (not shown) is effective. Let me.
With reference to FIG. 7, the concept of analyzing a speech synthesis data set will be described. FIG. 7 is a diagram for explaining the concept of analyzing a speech synthesis data set.

音声合成制御情報設定手段２３は、データ識別手段２２によって識別されたデータ放送用プログラムＰから抽出された、スクリプト、および、<専用タグ>と<／専用タグ>とで挟まれたデータを音声合成用データセットを取得し、音声合成用データセットの一般事項と特殊事項とを区別する。
一般事項としては、前記したとおり、例えば、ニュース等のコンテンツ情報、文例「テキストです。」等のテキスト、「（silent）tekisutodesu.（silent）」等の読み情報、および、「１型名詞１」等の品詞ごとのアクセント型がある。なお、ここでは、前記した読み情報の「’」が無い場合を示している。 The voice synthesis control information setting unit 23 performs voice synthesis on the script and the data sandwiched between the <dedicated tag> and the </ dedicated tag> extracted from the data broadcasting program P identified by the data identification unit 22. Data sets are acquired, and general matters and special matters of speech synthesis data sets are distinguished.
As described above, for example, as described above, for example, content information such as news, text such as “example text”, reading information such as “(silent) tekisutodesu. (Silent)”, and “type 1 noun 1” There are accent types for each part of speech. Here, a case where there is no “′” in the reading information is shown.

一方、特殊事項としては、前記したとおり、例えば、話者０１、０２等の分類の種別を示す種類情報、データ受信装置２０において音声素片を選択するための素片情報、および、ピッチ等のタイプを表すようにした制御情報がある。
ここでは、話者０１は、例えば、素片情報「Ｎｏ３５（teki），Ｎｏ４５（suto），Ｎｏ１５（desu.）」、制御情報のタイプ１「平均ピッチ下降」、および、制御情報のタイプ２「平均ピッチ上昇」を挙げてある。また、話者０２は、話者０１と異なる声質とするために、話者０１と同一の意味内容について、例えば、素片情報「Ｎｏ２０（tekisuto），Ｎｏ１０（desu.）」、制御情報のタイプ１「平均ピッチ下降」、制御情報のタイプ２「アクセント１強調」を挙げてある。 On the other hand, as described above, as described above, for example, type information indicating the classification type of the speakers 01, 02, etc., unit information for selecting a speech unit in the data receiving device 20, and pitch, etc. There is control information that indicates the type.
Here, the speaker 01, for example, has segment information “No 35 (teki), No 45 (suto), No 15 (desu.)”, Control information type 1 “average pitch descent”, and control information type 2 “ "Average pitch rise". Further, in order to obtain a voice quality different from that of the speaker 01, the speaker 02 has, for example, the element information “No20 (tekisuto), No10 (desu.)”, The type of control information for the same meaning content as the speaker 01. 1 “average pitch down”, control information type 2 “accent 1 emphasis”.

なお、ここでは、データ送信装置１０の説明で示した音声合成用データセットの内容と異なる場合を示したが、これは音声合成用データセットの例示を行うためであり、データ送信装置１０からデータ受信装置２０へ送信される音声合成用データセットが異なることを意味するものではない。以下では、このデータ受信装置２０において示した音声合成用データセットの内容が、データ送信装置１０から送信されたものとして説明する。 Here, the case where the content of the data set for speech synthesis shown in the description of the data transmission device 10 is different is shown, but this is for the purpose of illustrating the data set for speech synthesis, and the data from the data transmission device 10 This does not mean that the speech synthesis data sets transmitted to the receiving device 20 are different. In the following description, it is assumed that the content of the speech synthesis data set shown in the data receiving device 20 is transmitted from the data transmitting device 10.

図１に戻って説明すると、音声データベース更新手段２４は、データ識別手段２２から送られてくる更新情報に従い、音声合成手段２５の後記音声データベース２５１を更新するものである。
ここで、図８を参照して、音声データベース更新手段２４の機能の概念を説明する。図８に、データ受信装置が備える音声データベース更新手段の機能の概念を説明する図を示す。 Returning to FIG. 1, the voice database update unit 24 updates the postscript voice database 251 of the voice synthesis unit 25 in accordance with the update information sent from the data identification unit 22.
Here, the concept of the function of the voice database update means 24 will be described with reference to FIG. FIG. 8 is a diagram for explaining the concept of the function of the voice database update means provided in the data receiving apparatus.

音声データベース更新手段２４は、データ識別手段２２から送られてくる更新情報Ｄ２の内容に基づいて、音声データベース２５１を更新すると共に、ユーザインターフェース２６を介してユーザに各種設定を行わせるための図示しない表示部の表示を更新する。例えば、音声データベース更新手段２４は、更新情報Ｄ２から音声データベース更新用データＤ３を選択する。この音声データベース更新用データＤ３としては、例えば、音声素片や読み情報がある。 The voice database update unit 24 updates the voice database 251 based on the content of the update information D2 sent from the data identification unit 22 and makes the user perform various settings via the user interface 26 (not shown). Update the display on the display. For example, the voice database update unit 24 selects the voice database update data D3 from the update information D2. As the voice database update data D3, for example, there are voice segments and reading information.

そして、音声データベース更新手段２４は、音声データベース更新用データＤ３を音声データベース２５１に記憶する。なお、音声データベース更新用データＤ３には、例えば、特定の情報の削除指令や特定の情報の修正指令が含まれているようにしてもよい。
また、音声データベース更新手段２４は、更新情報Ｄ２からユーザインターフェース設定データＤ４を選択する。このユーザインターフェース設定データＤ４としては、例えば、話者情報や制御情報がある。 The voice database update unit 24 stores the voice database update data D3 in the voice database 251. The voice database update data D3 may include, for example, a specific information deletion command or a specific information correction command.
The voice database update unit 24 selects user interface setting data D4 from the update information D2. Examples of the user interface setting data D4 include speaker information and control information.

そして、音声データベース更新手段２４は、ユーザインターフェース設定データＤ４の選択ボタンや選択番号等の選択肢をユーザインターフェース２６の図示しない表示手段の画面上に追加するための指令をユーザインターフェース２６に出力する。
この場合には、音声データベース更新手段２４は、アナウンサＸの声質の音声素片群を追加する更新情報がデータ識別手段２２によって識別されると、アナウンサＸをユーザに選択させるための選択肢のボタン等のキャラクタをユーザインターフェース２６の図示しない表示手段に表示するための指令をユーザインターフェース２６に出力する。 Then, the voice database update unit 24 outputs a command for adding options such as a selection button and a selection number of the user interface setting data D4 on a screen of a display unit (not shown) of the user interface 26 to the user interface 26.
In this case, the voice database update unit 24, when update information for adding the voice segment group of the voice quality of the announcer X is identified by the data identification unit 22, a button for options for causing the user to select the announcer X, etc. A command for displaying the character on the display means (not shown) of the user interface 26 is output to the user interface 26.

また、音声データベース更新手段２４は、データ識別手段２２が識別した更新情報により、音声合成手段２５の後記音声データベース２５１に既に登録されている項目を削除する場合には、ユーザインターフェース２６の図示しない表示手段に表示させてある選択肢のボタン等のキャラクタを削除するための指令をユーザインターフェース２６に出力する。 The voice database updating unit 24 displays an unshown display of the user interface 26 when an item already registered in the voice database 251 described later is synthesized by the update information identified by the data identifying unit 22. A command for deleting a character such as an option button displayed on the means is output to the user interface 26.

図１に戻って説明すると、音声合成手段２５は、音声データベース２５１を内蔵している。音声データベース２５１は、声質ごとに音声素片群を記憶しておくものである。ここで、音声素片群は、同一声質を構成する音声素片を集めて音声データベース２５１に記憶させたものであって、声質の異なる種類ごとに分類して定義される。例えば、音声データベース２５１には、後記するとおり、コンテンツや話者の相違により定義される音声素片群をその種類ごとに区別できるように記憶しておく。 Returning to FIG. 1, the speech synthesizer 25 includes a speech database 251. The voice database 251 stores a voice element group for each voice quality. Here, the speech segment group is a collection of speech segments that make up the same voice quality and stored in the speech database 251, and is defined by classifying different types of voice quality. For example, as will be described later, the speech database 251 stores a speech segment group defined by content and speaker differences so that it can be distinguished for each type.

さらに具体的には、音声データベース２５１は、男のアナウンサの声、女のアナウンサの声、男優の声および女優の声の四種類の種別の声質の音声素片群を記憶している場合には、音声合成用データセットの素片情報は、前記した四種類の話者のいずれの音声素片群から音声素片を選択するのかを識別するための情報となっている。この情報としては、音声素片に対応させた音声素片番号を挙げることができる。 More specifically, when the voice database 251 stores voice segment groups of four types of voice qualities, a voice of a male announcer, a voice of a female announcer, a voice of an actor, and a voice of an actress. The unit information of the speech synthesis data set is information for identifying which speech unit group of the above four types of speakers is to be selected. As this information, a speech unit number associated with a speech unit can be cited.

したがって、音声合成手段２５は、音声合成制御情報設定手段２３から音声合成用データを受け取り、音声合成用データの種類情報により特定された音声データベース２５１の音声素片群から、音声合成用データの素片情報に従って音声素片を抽出し、抽出した音声素片同士を接続して合成音声の波形を生成し、音声合成用データの制御情報により特定される波形変形の仕方に従って波形を変形し、変形した波形を合成音声として発話させるものである。 Accordingly, the speech synthesizer 25 receives the speech synthesis data from the speech synthesis control information setting unit 23, and from the speech segment group of the speech database 251 specified by the speech synthesis data type information, Extract speech segments according to fragment information, connect the extracted speech segments to generate a synthesized speech waveform, transform the waveform according to the waveform modification method specified by the speech synthesis data control information, and transform The generated waveform is uttered as synthesized speech.

例えば、音声合成手段２５は、音声データベース２５１から女声の音声素片群を構成する音声素片を種類情報に従って選択し、その音声素片同士を接続して波形を生成し、話速制御情報として早口が設定されている場合には、生成した波形の周期を短くするように波形を加工して、合成音声を早口の女声で発話させる。このとき、音声合成手段２５は、話速制御情報として「１．５倍速」と設定されている場合には、音声素片同士を接続して生成した波形を１．５倍速になるように時間軸方向に縮めるようにして、波形を変形させる。 For example, the speech synthesizer 25 selects speech units constituting a female speech unit group from the speech database 251 according to the type information, generates a waveform by connecting the speech units, and uses it as speech speed control information. When the fast mouth is set, the waveform is processed so that the cycle of the generated waveform is shortened, and the synthesized voice is uttered by a fast-paced female voice. At this time, when “1.5 times speed” is set as the speech speed control information, the voice synthesizing unit 25 sets the time generated so that the waveform generated by connecting the speech elements to each other becomes 1.5 times speed. The waveform is deformed so as to shrink in the axial direction.

そのため、音声合成手段２５は、音声素片番号に対応付けられて音声データベース２５１に記憶された音声素片を、多重化データとしてデータ送信装置１０から送られてきた多重化データに多重化された音声合成用データセットに含まれている音声素片番号をキーとして検索して抽出することによって、音声素片番号に対応する音声素片、つまり、波形を音声データベース２５１から選択し、選択した音声素片同士を接続することによって合成音声の波形を生成し、生成した合成音声の波形を制御情報に従って変形し、変形した波形の合成音声を音声出力手段２７に出力することができる。 Therefore, the speech synthesizer 25 multiplexes the speech unit associated with the speech unit number and stored in the speech database 251 into the multiplexed data sent from the data transmitting apparatus 10 as multiplexed data. A speech unit corresponding to the speech unit number, that is, a waveform is selected from the speech database 251 by searching and extracting the speech unit number included in the speech synthesis data set as a key, and the selected speech By connecting the segments, a synthesized speech waveform can be generated, the generated synthesized speech waveform can be transformed according to the control information, and the synthesized speech of the deformed waveform can be output to the speech output means 27.

ここで、図９を参照して、音声合成手段２５の音声合成の概念を説明する。図９に、データ受信装置が備える音声合成手段による音声合成の概念を説明する図を示す。
音声合成手段２５は、音声合成制御情報設定手段２３から音声合成用データを取得すると、音声合成用データを解析して、コンテンツ情報、テキスト、読み情報、アクセント型および品詞型等の一般事項と、ユーザによって設定された話者ごとの音声素片と、平均ピッチ下降関数等の特殊事項とを識別する。 Here, the concept of speech synthesis by the speech synthesizer 25 will be described with reference to FIG. FIG. 9 is a diagram for explaining the concept of speech synthesis by speech synthesis means provided in the data receiving apparatus.
When the speech synthesis unit 25 acquires the speech synthesis data from the speech synthesis control information setting unit 23, the speech synthesis unit 25 analyzes the speech synthesis data, and includes general matters such as content information, text, reading information, accent type, and part-of-speech type, A speech unit for each speaker set by the user is identified from special items such as an average pitch descent function.

音声合成用データとしては、ここでは、コンテンツ項目の「ニュース」、テキストの「テキストです。」、読み情報の「（silent）tekisutodesu.（silent）」、アクセント型の「１型名詞１」、話者番号の「話者０２」、素片情報の「音声素片番号「Ｎｏ２０」の音声素片「tekisuto」および音声素片番号「Ｎｏ１」の「desu.」」、平均ピッチ下降の関数Ｐａｖ（０．８）がそれぞれ指定されている。 As speech synthesis data, the content item is “News”, the text is “Text”, the reading information is “(silent) tekisutodesu. (Silent)”, the accent type is “Type 1 noun 1”, the story “Speaker 02” of the speaker number, “speech unit“ tekisuto ”of the speech unit number“ No20 ”and“ desu. ”Of the speech unit number“ No1 ”of the unit information, and a function Pav ( 0.8) is specified.

この場合には、音声合成手段２５は、音声合成用データを識別すると、種類情報に従って音声素片を抽出する音声素片群を識別し、素片情報を参照して、その音声素片番号をキーに音声データベース２５１を検索し、音声素片番号に対応する音声素片の波形を抽出する。音声合成手段２５は、音声データベース２５１から抽出した音声素片番号に基づいて音声素片の波形を検索して得た音声素片同士を接続して、合成音声として発話させるテキストの全文または所定の長さごとに合成音声の波形を生成する。これにより、音声合成手段２５は、話者０２に基づいた声質の合成音声の発話が可能になる。 In this case, when the speech synthesis unit 25 identifies the speech synthesis data, the speech synthesis unit 25 identifies the speech unit group from which the speech unit is extracted according to the type information, refers to the unit information, and sets the speech unit number. The speech database 251 is searched for the key, and the speech unit waveform corresponding to the speech unit number is extracted. The speech synthesizer 25 connects speech units obtained by searching speech unit waveforms based on speech unit numbers extracted from the speech database 251 and connects the whole text of a text to be uttered as synthesized speech or a predetermined text. A synthesized speech waveform is generated for each length. Thereby, the speech synthesizer 25 can utter the synthesized speech of the voice quality based on the speaker 02.

次に、音声合成手段２５は、アクセント型や品詞型等の制御情報に従って生成した合成音声の波形を変形する。例えば、合成音声の波形の指定されたアクセント部分の振幅が大きくなるように変形してアクセント部分を強く発話させるようにする。 Next, the speech synthesizer 25 transforms the waveform of the synthesized speech generated according to control information such as accent type and part of speech type. For example, the accented portion is deformed so that the amplitude of the designated accented portion of the waveform of the synthesized speech is increased so that the accented portion is uttered strongly.

この場合には、音声合成手段２５は、合成音声として発話させる名詞の部分の生成した波形をアクセント型に基づいて、アクセント型によって指定される波形の一部を強調した波形に変形すると共に、関数Ｐａｖ（０．８）に基づいて、平均ピッチを下降するように波形を変形し、話者０２に基づいた声質の合成音声の波形を変形させる。そして、音声合成手段２５は、変形した波形を音声出力手段２７に出力し、音声出力手段２７から合成音声として発話させる。これにより、音声合成手段２５は、音声素片同士を接続して生成された合成音声の波形を変形し、ユーザの好みに合った合成音声を発話させることが可能になる。 In this case, the speech synthesizer 25 transforms the waveform generated by the noun part to be uttered as synthesized speech into a waveform in which a part of the waveform specified by the accent type is emphasized based on the accent type, Based on Pav (0.8), the waveform is deformed so as to decrease the average pitch, and the waveform of the synthesized speech of voice quality based on the speaker 02 is deformed. Then, the voice synthesis unit 25 outputs the deformed waveform to the voice output unit 27 and causes the voice output unit 27 to utter as synthesized voice. As a result, the speech synthesizer 25 can deform the synthesized speech waveform generated by connecting speech segments, and utter a synthesized speech that meets the user's preference.

ユーザインターフェース２６は、主に、コンテンツ項目、話者、話速、動特徴または静特徴の制御情報等をデータ受信装置２０に入力するためのものであり、データ受信装置２０がユーザに各種指令を行わせる入力手段を備えていると共に、データ受信装置２０からの各種情報を表示する図示しない表示手段も備えている。 The user interface 26 is mainly used to input content item, speaker, speech speed, dynamic feature or static feature control information to the data receiving device 20, and the data receiving device 20 issues various commands to the user. In addition to the input means to be performed, there is also provided display means (not shown) for displaying various information from the data receiving device 20.

例えば、ユーザインターフェース２６は、前記したように、話者設定手段２３２が四種類の声質を設定することができるようになっている場合には、四種類のボタンを図示しない表示手段に表示し、いずれかをユーザに選択させることによって、話者を選択させることができる。これによって、ユーザは、ユーザインターフェース２６を介した選択という容易な操作を行うことによって、好みの声質の合成音声をデータ受信装置２０から発話させることができるようになる。 For example, as described above, when the speaker setting unit 232 can set four types of voice qualities, the user interface 26 displays four types of buttons on a display unit (not shown), A speaker can be selected by having the user select one of them. As a result, the user can utter a synthesized voice having a favorite voice quality from the data receiving apparatus 20 by performing an easy operation of selection via the user interface 26.

ここで、図１０を参照して、ユーザインターフェース２６における制御情報等の設定概念について説明する。図１０に、データ受信装置が備えるユーザインターフェースによる音声合成制御情報の設定の概念を説明する図である。
ここでは、ユーザインターフェース２６の表示手段には、コンテンツ項目、話者、話速および音声表現のタイプの選択肢が表示されている。ここで、音声表現のタイプは、制御情報によって規定されるタイプであって、制御情報の項目をユーザによる選択のために分類したタイプである。例えば、タイプ１のときは平均ピッチを下降させ、タイプ２のときは平均ピッチを上昇させるというような分類である。 Here, with reference to FIG. 10, a concept of setting control information and the like in the user interface 26 will be described. FIG. 10 is a diagram for explaining the concept of setting speech synthesis control information by a user interface provided in the data receiving apparatus.
Here, the display unit of the user interface 26 displays choices of content item, speaker, speech speed, and speech expression type. Here, the type of phonetic expression is a type defined by the control information, and is a type in which items of the control information are classified for selection by the user. For example, classification is such that the average pitch is lowered for Type 1 and the average pitch is raised for Type 2.

また、ユーザインターフェース２６の図示しない入力手段としては、赤外線通信によりデータ受信装置２０との間でデータ通信可能ないわゆるリモコンによって、各種選択肢を選択するものであっても、これに限らず、いわゆるタッチパネル上に選択肢としての選択ボタンを表示し、ユーザによる画面上のタッチによって選択ボタンの選択を入力するようにしてもよい。 In addition, the input means (not shown) of the user interface 26 is not limited to this, and may be a so-called touch panel, even if various options are selected by a so-called remote controller capable of data communication with the data receiving device 20 by infrared communication. A selection button as an option may be displayed above, and selection of the selection button may be input by a touch on the screen by the user.

図１０に示すように、コンテンツ項目としては、例えば、ニュース、天気または番組解説がある。話者としては、例えば、男、女、男優、女優またはアニメキャラクタがある。
ここでは、例えば、話者０１を男、話者０２を女、話者０３を男優、話者０４を女優、話者０５をアニメキャラクタというように、話者名に対応させて話者番号を選択肢として表示させたものとなっている。 As shown in FIG. 10, the content item includes, for example, news, weather, or program commentary. As the speaker, for example, there are a man, a woman, an actor, an actress or an anime character.
Here, for example, the speaker number is associated with the speaker name, such as speaker 01 as a man, speaker 02 as a woman, speaker 03 as an actor, speaker 04 as an actress, and speaker 05 as an anime character. It is displayed as an option.

話速としては、例えば、１．０倍速の場合を基準の速度とし、この基準の速度に対する比率によって話速を０．８ないし１．４倍速として選択可能にした選択肢を表示させたものになっている。なお、話速は、前記したとおり、基準の速度に対する比率によって示す場合に限らず、例えば、単位時間当たりの発話文字数によって設定できるようにしてもよい。 As the speaking speed, for example, the standard speed is set to 1.0.times., And an option that enables selection of the speaking speed from 0.8 to 1.4 times depending on the ratio to the reference speed is displayed. ing. Note that, as described above, the speech speed is not limited to the case indicated by the ratio with respect to the reference speed, and may be set by the number of spoken characters per unit time, for example.

また、音声表現のタイプとしては、タイプ１およびタイプ２のように、定義されているタイプ番号を選択するようになっている。これらのタイプ１およびタイプ２は、前記したとおり、音声合成用データセットの特殊項目中に話者ごとに定義されている。例えば、図８の話者０１の場合についてみると、タイプ１としては平均ピッチ下降、タイプ２としては平均ピッチ上昇というように定義される。なお、この音声表現タイプの設定の仕方としては、図９に示すように、「やや１」のように、一義的にタイプを選択するものでなく、段階的に設定できるようにしてもよい。 Also, as the type of speech expression, a defined type number is selected as in type 1 and type 2. These types 1 and 2 are defined for each speaker in the special items of the speech synthesis data set as described above. For example, in the case of the speaker 01 in FIG. 8, type 1 is defined as an average pitch decrease, and type 2 is defined as an average pitch increase. Note that, as shown in FIG. 9, the voice expression type may be set in a stepwise manner as shown in FIG. 9 instead of selecting a type unambiguously as “slightly 1”.

また、コンテンツ項目名、話者番号、話速設定値および音声表現タイプは、それぞれを区別するための番号を表示し、選択肢を表示する画面とは別のウィンドウ内に選択肢を表示させたり、選択肢を印刷した紙媒体を配布させたりして、選択肢をユーザに提供するようにしてもよい。この場合には、ユーザは、選択肢を見て画面に表示されている番号を入力したり、画面上をタッチして入力したりして、選択肢を選択することになる。 In addition, the content item name, speaker number, speech speed setting value, and speech expression type are displayed as numbers to distinguish them, and the options can be displayed in a separate window from the screen that displays the options. The user may be provided with choices by distributing a paper medium on which is printed. In this case, the user selects the option by viewing the option and inputting the number displayed on the screen or by touching the screen.

ここで、図１１を参照して、前記した機能を備える音声合成制御情報設定手段２３が、図１０を用いて説明したユーザインターフェース２６から送られてくる設定データに従い、コンテンツ項目種別設定手段２３１、話者設定手段２３２、話速設定手段２３３、動特徴制御情報設定手段２３４および静特徴制御情報設定手段２３５（以下、「コンテンツ項目種別設定手段２３１等」とまとめて呼ぶことにする。）のそれぞれの選択肢を設定する場合を説明する。 Here, referring to FIG. 11, the speech synthesis control information setting means 23 having the above-described functions is set according to the setting data sent from the user interface 26 described with reference to FIG. Each of speaker setting means 232, speech speed setting means 233, dynamic feature control information setting means 234, and static feature control information setting means 235 (hereinafter collectively referred to as “content item type setting means 231 etc.”). The case where the option of this is set is demonstrated.

音声合成制御情報設定手段２３は、データ識別手段２２から音声合成用データセットを受け取ると、前記したとおり、音声合成用データセットの内容を識別し、コンテンツ項目種別設定手段２３１等のそれぞれの選択肢を設定し、ユーザインターフェース２６からの設定データを待って、コンテンツ項目種別設定手段２３１等のそれぞれの選択肢を選択して、選択された選択肢を含む音声合成用データとして設定して、この音声合成用データを音声合成手段２５に出力する。 When the voice synthesis control information setting unit 23 receives the voice synthesis data set from the data identification unit 22, as described above, the voice synthesis control information setting unit 23 identifies the content of the voice synthesis data set, and selects each option of the content item type setting unit 231 or the like. Set, wait for setting data from the user interface 26, select each option of the content item type setting means 231, etc., and set it as voice synthesis data including the selected option, and this voice synthesis data Is output to the speech synthesizer 25.

設定データのコンテンツ項目がニュースを示している場合には、コンテンツ項目種別設定手段２３１のコンテンツをニュースとして選択する。設定データのコンテンツ項目が話者０２を示している場合には、話者設定手段２３２の話者０２を選択する。設定データの制御情報が話速を示している場合には、１．２倍速等の倍速値を選択する。設定データの制御情報が動特徴制御情報を表す平均ピッチ下降関数である場合には、静特徴制御情報設定手段２３５の平均ピッチ下降関数Ｐａｖを選択し、ユーザによって指定される変数値０．８を代入し、関数Ｐａｖ（０．８）を設定する。 When the content item of the setting data indicates news, the content of the content item type setting means 231 is selected as news. When the content item of the setting data indicates the speaker 02, the speaker 02 of the speaker setting means 232 is selected. When the control information of the setting data indicates the speech speed, a double speed value such as 1.2 double speed is selected. When the control information of the setting data is an average pitch lowering function representing dynamic feature control information, the average pitch lowering function Pav of the static feature control information setting means 235 is selected, and a variable value 0.8 specified by the user is set. Substitute and set the function Pav (0.8).

なお、ここでは、設定データには動特徴制御情報が含まれていないため、動特徴制御情報設定手段２３４は、ユーザインターフェース２６を介して動特徴の選択を促すメッセージを出力させて、ユーザからの選択を待ち、所定時間経過後にはあらかじめ規定値として設定されている値を選択するものとする。このように、音声合成制御情報設定手段２３は、データ識別手段２２からの音声合成用データセットの入力があった場合には、ユーザインターフェース２６からの設定データの入力を待ち、設定データの入力のないものについては所定値に設定するものとする。 Here, since the setting data does not include the moving feature control information, the moving feature control information setting unit 234 outputs a message prompting the user to select a moving feature via the user interface 26, and receives a message from the user. Waiting for selection, a value set in advance as a specified value is selected after a predetermined time. As described above, the speech synthesis control information setting unit 23 waits for the input of the setting data from the user interface 26 when the input of the data set for speech synthesis is received from the data identification unit 22, and Those not present are set to predetermined values.

図１に戻って説明すると、音声出力手段２７は、音声合成手段２５から出力される電気信号としての合成音声の波形を合成音声として発話するものであり、具体的には、スピーカである。そして、この音声出力手段２７は、音声合成手段２５から送られてくる電気信号の波形に従って合成音声を発話させる。なお、ここでは、音声出力手段２７は、特に、合成音声を発話させるために使用するものであって、ユーザインターフェース２６とは別体のものとして説明するが、データ受信装置２０から各種情報を音声出力したり、報知音として出力したりすることもできるため、ユーザインターフェース２６に含めた構成とすることもできる。 Returning to FIG. 1, the voice output unit 27 utters the synthesized voice waveform as an electric signal output from the voice synthesis unit 25 as synthesized voice, and specifically, a speaker. Then, the voice output unit 27 utters synthesized speech according to the waveform of the electrical signal sent from the voice synthesis unit 25. Note that, here, the voice output means 27 is used in particular to utter a synthesized voice, and will be described as being separate from the user interface 26. Since it can be output or output as a notification sound, it can be configured to be included in the user interface 26.

なお、受信手段２１とデータ識別手段２２とが、合成音声として発話させる音声素片を声質の種類ごとに特定するための素片情報および音声素片同士を接続して生成された波形を変化させるための制御情報を音声合成用データセットとしてデータ送信装置１０から取得する音声合成用データセット取得手段として機能する。
また、受信手段２１とデータ識別手段２２とが、音声データベース２５１に記憶している音声素片群を更新する更新情報をデータ送信装置１０から取得する更新情報取得手段として機能する。 Note that the receiving unit 21 and the data identifying unit 22 change the waveform generated by connecting the unit information and the speech unit for specifying the speech unit to be uttered as synthesized speech for each type of voice quality. It functions as a voice synthesis data set acquisition means for acquiring control information for voice synthesis from the data transmission device 10 as a voice synthesis data set.
The receiving means 21 and the data identifying means 22 function as update information acquiring means for acquiring update information for updating the speech segment group stored in the speech database 251 from the data transmitting apparatus 10.

また、音声合成制御情報設定手段２３は、音声データベース２５１に記憶されているいずれの種類の音声素片群の中から音声素片を選択するのかを設定して、設定した音声素片群から同一種類の音声素片を選択するための素片情報を特定する素片情報特定手段と、音声合成用データセット取得手段（２１，２２）により取得される制御情報を設定する制御情報設定手段として機能する。 Further, the speech synthesis control information setting means 23 sets which type of speech unit group to be selected from the speech unit group stored in the speech database 251, and is identical from the set speech unit group. Functions as a unit information specifying unit for specifying unit information for selecting a type of speech unit and a control information setting unit for setting control information acquired by the data set acquisition unit for speech synthesis (21, 22) To do.

ところで、データ受信装置２０は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（合成音声発話プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 By the way, the data receiving device 20 can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. This program (synthetic voice utterance program) can be distributed via a communication line, or can be written on a recording medium such as a CD-ROM for distribution.

それでは、次に、データ放送システム１におけるデータ送信装置１０およびデータ受信装置２０のそれぞれの動作を説明する。
（データ送信装置の動作）
まず、図１２を参照（適宜図１ないし１１参照）して、本発明の実施形態に係るデータ送信装置１０の動作について説明する。図１２は、本発明の実施形態に係るデータ送信装置の動作を示すフローチャートである。 Then, each operation | movement of the data transmitter 10 in the data broadcasting system 1 and the data receiver 20 is demonstrated.
(Operation of data transmitter)
First, the operation of the data transmitting apparatus 10 according to the embodiment of the present invention will be described with reference to FIG. 12 (refer to FIGS. 1 to 11 as appropriate). FIG. 12 is a flowchart showing the operation of the data transmission apparatus according to the embodiment of the present invention.

＜更新データ取得ステップ＞
まず、データ送信装置１０の図示しない制御手段は、コンテンツ提供者により各種インターフェースを介して音声データベース１２３を更新するための更新データの有無を判断し、更新データが有る場合には更新データを取得して、処理をステップＳＡ２に移し（ステップＳＡ１でＹｅｓ）、更新データが所定時間内に無い場合には、処理をステップＳＡ４に移す（ステップＳＡ１でＮｏ）。なお、所定時間は、例えば、データ送信装置１０の図示しない電源がオンされた場合や、図示しない送信開始ボタンが押下された場合に、そのオン時や押下時から計時を開始した相対時刻によって表されるものである。 <Update data acquisition step>
First, a control unit (not shown) of the data transmitting apparatus 10 determines whether or not there is update data for updating the audio database 123 via various interfaces by the content provider, and acquires update data when there is update data. Then, the process moves to step SA2 (Yes in step SA1), and if the update data is not within the predetermined time, the process moves to step SA4 (No in step SA1). Note that the predetermined time is represented by, for example, a relative time when the data transmission device 10 is turned on or when a transmission start button (not shown) is pressed, or when the time is started from when the data transmission device 10 is pressed. It is what is done.

＜更新情報作成ステップ＞
音声データベース更新情報作成手段１４は、更新データを受け取ると、音声データベース１２３を更新するための更新情報を作成する（ステップＳＡ２）。例えば、コンテンツ提供者が、図５に示したように、話者番号の話者０１、音声素片番号、読み情報および音声素片を含む更新データを入力した場合には、更新情報も、話者、音声素片番号、読み情報および音声素片を含むものである。 <Update information creation step>
When receiving the update data, the voice database update information creating means 14 creates update information for updating the voice database 123 (step SA2). For example, as shown in FIG. 5, when the content provider inputs the update data including the speaker number speaker 01, the speech unit number, the reading information, and the speech unit, the update information is also stored in the speech information. Person, speech unit number, reading information, and speech unit.

＜音声データベース更新ステップ＞
音声データベース更新情報作成手段１４は、作成した更新情報に基づいて音声データベース１２３を更新する（ステップＳＡ３）。なお、音声素片番号は、音声素片と共にコンテンツ提供者によって入力されるものとしたが、音声データベース更新情報作成手段１４が、音声素片番号のナンバリングを管理し、ユーザによって音声素片が入力されるたびに、新たな音声素片番号を付与するようにしてもよい。 <Audio database update step>
The voice database update information creation unit 14 updates the voice database 123 based on the created update information (step SA3). The speech unit number is input by the content provider together with the speech unit. However, the speech database update information creation unit 14 manages the numbering of the speech unit number, and the speech unit number is input by the user. Each time it is done, a new speech unit number may be assigned.

＜素材テキスト取得ステップ＞
次に、素材テキスト取得手段１１は、図示しないデータ放送素材処理装置等の記憶手段に記憶されているデータ放送素材を受け取り、このデータ放送素材から合成音声として発話させるためのテキストを取得し（ステップＳＡ４）、音声合成用データセット作成手段１２に取得したテキストを出力すると共に、音声合成用データセット付加手段１３にデータ放送素材を出力する。 <Material text acquisition step>
Next, the material text acquisition unit 11 receives a data broadcast material stored in a storage unit such as a data broadcast material processing apparatus (not shown), and acquires a text to be uttered as synthesized speech from the data broadcast material (step SA4) The acquired text is output to the voice synthesis data set creation means 12, and the data broadcast material is output to the voice synthesis data set addition means 13.

＜音声合成用データセット作成ステップ＞
音声合成用データセット作成手段１２は、素材テキスト取得手段１１から送られてくるテキストを受けると、音声合成用データセット作成処理を開始し（ステップＳＡ５）、全てのテキストについてステップＳＡ６ないしＳＡ８の処理が終了するまで、音声合成用データセット作成処理を繰り返す（ステップＳＡ９）。 <Step of creating a data set for speech synthesis>
Upon receiving the text sent from the material text acquisition unit 11, the speech synthesis data set creation unit 12 starts a speech synthesis data set creation process (step SA5), and performs the processes of steps SA6 to SA8 for all the texts. The voice synthesis data set creation process is repeated until is finished (step SA9).

＜読み情報取得ステップ＞
音声合成用データセット作成手段１２では、読み情報取得手段１２１が、素材テキスト取得手段１１から送られてくるテキストを取得すると、文法辞書や言語辞書等の各種辞書を参照して文字を認識し、各種辞書を参照してテキストの形態素を解析し、読み情報を取得し（ステップＳＡ６）、この読み情報を素片選択手段１２２に渡す。 <Reading information acquisition step>
In the speech synthesis data set creation unit 12, when the reading information acquisition unit 121 acquires the text sent from the material text acquisition unit 11, it recognizes characters by referring to various dictionaries such as a grammar dictionary and a language dictionary, The text morphemes are analyzed with reference to various dictionaries, reading information is acquired (step SA6), and the reading information is passed to the segment selection means 122.

＜素片選択ステップ＞
素片選択手段１２２は、読み情報取得手段１２１から渡される読み情報に従って音声データベース１２３を検索し、音声データベース１２３から定義されている話者ごとの読み情報に従った音声素片を選択し（ステップＳＡ７）、選択した音声素片を制御情報作成手段１２４に渡す。 <Unit selection step>
The segment selection unit 122 searches the speech database 123 according to the reading information passed from the reading information acquisition unit 121, and selects a speech unit according to the reading information for each speaker defined from the speech database 123 (step SA7) The selected speech segment is transferred to the control information creation means 124.

＜制御情報作成ステップ＞
制御情報作成手段１２４は、素片選択手段１２２により音声データベース１２３から選択された音声素片を受け取り、その音声素片同士を接続して得られる合成音声の波形を話者等の声質の種類ごとに可変させるための制御情報、および、その種類を区別するための種類情報を作成する（ステップＳＡ８）。
そして、音声合成用データセット作成手段１２は、読み情報取得手段１２１により取得した読み情報と、素片選択手段１２２により選択された音声素片と、制御情報作成手段１２４により作成された制御情報および種類情報とを含む音声合成用データセットとして音声合成用データセット付加手段１３に出力する。 <Control information creation step>
The control information creation unit 124 receives the speech unit selected from the speech database 123 by the unit selection unit 122, and generates a synthesized speech waveform obtained by connecting the speech units for each type of voice quality such as a speaker. Control information for changing the type and type information for distinguishing the type are created (step SA8).
Then, the speech synthesis data set creation unit 12 includes the reading information acquired by the reading information acquisition unit 121, the speech unit selected by the unit selection unit 122, the control information generated by the control information generation unit 124, The data set for speech synthesis including the type information is output to the speech synthesis data set adding means 13.

＜音声合成用データセット付加ステップ＞
音声合成用データセット付加手段１３は、素材テキスト取得手段１１から出力されるデータ放送素材を取得し、音声合成用データセット作成手段１２によって作成された音声合成用データセットをデータ放送素材に組み込んで（ステップＳＡ１０）、多重化手段１５に渡す。 <Adding data set for speech synthesis>
The voice synthesis data set adding means 13 acquires the data broadcast material output from the material text acquisition means 11, and incorporates the voice synthesis data set created by the voice synthesis data set creation means 12 into the data broadcast material. (Step SA10), the data is transferred to the multiplexing means 15.

＜多重化ステップ＞
多重化手段１５は、音声合成用データセット付加手段１３から出力される音声合成用データセットを組み込んだデータ放送素材と、音声データベース更新情報作成手段１４から出力される更新情報とを多重化した多重化データを生成し、生成した多重化データを送信手段１６に送る（ステップＳＡ１１）。 <Multiplexing step>
The multiplexing means 15 multiplexes the data broadcasting material incorporating the voice synthesis data set output from the voice synthesis data set adding means 13 and the update information output from the voice database update information creating means 14. Data is generated, and the generated multiplexed data is sent to the transmission means 16 (step SA11).

＜送信ステップ＞
送信手段１６は、多重化手段１５から受け取った音声合成用データセットを含むデータ放送素材および更新情報を多重化した多重化データを、所定の周波数の放送波に乗せてデータ受信装置２０に向けて送信する（ステップＳＡ１２）。 <Transmission step>
The transmission means 16 puts the multiplexed data obtained by multiplexing the data broadcasting material including the voice synthesis data set received from the multiplexing means 15 and the update information on the broadcast wave of a predetermined frequency, toward the data receiving apparatus 20. Transmit (step SA12).

以上の処理により、データ送信装置１０は、合成音声として発話させる音声素片を含む音声合成用データセットと、音声素片をデータベース化した音声データベース１２３の更新内容を表す更新情報とを多重化してデータ受信装置２０に向けて送信するようにしたため、データ受信装置２０では、音声合成用データセットの音声素片同士を接続することにより合成音声の波形を生成し、音声合成用データセットの制御情報に従って生成された合成音声の波形を成形して、合成音声として発話させることができるようになる。 Through the above processing, the data transmitting apparatus 10 multiplexes the speech synthesis data set including the speech units to be uttered as synthesized speech and the update information indicating the update contents of the speech database 123 in which the speech units are databased. Since the data is transmitted to the data receiving device 20, the data receiving device 20 generates a synthesized speech waveform by connecting speech units of the speech synthesis data set, and control information of the speech synthesis data set. The waveform of the synthesized speech generated according to the above can be shaped and uttered as synthesized speech.

最後に、図１３を参照（適宜図１ないし１１参照）して、本発明の実施形態に係るデータ受信装置の動作について説明する。図１３は、本発明の実施形態に係るデータ受信装置の動作を示すフローチャートである。 Finally, referring to FIG. 13 (refer to FIGS. 1 to 11 as appropriate), the operation of the data receiving apparatus according to the embodiment of the present invention will be described. FIG. 13 is a flowchart showing the operation of the data receiving apparatus according to the embodiment of the present invention.

＜受信ステップ＞
データ受信装置２０では、受信手段２１が、所定の周波数の放送波を受信すると（ステップＳＢ１）、アナログ／デジタル変換を行って、デジタルデータとしての多重化データをデータ識別手段２２に渡す。 <Reception step>
In the data receiving device 20, when the receiving means 21 receives a broadcast wave of a predetermined frequency (step SB1), analog / digital conversion is performed and multiplexed data as digital data is passed to the data identifying means 22.

＜データ識別ステップ＞
データ識別手段２２は、受信手段２１から渡される受信データがデータ送信装置１０から送信された多重化データである場合には、多重化データから音声合成用データセットおよび更新情報を識別して（ステップＳＢ２）、音声合成制御情報設定手段２３に音声合成用データセットを送り、音声データベース更新手段２４に更新情報を送る。 <Data identification step>
If the received data delivered from the receiving means 21 is multiplexed data transmitted from the data transmitting apparatus 10, the data identifying means 22 identifies the data set for speech synthesis and the update information from the multiplexed data (step SB 2), the speech synthesis data set is sent to the speech synthesis control information setting means 23, and the update information is sent to the speech database update means 24.

＜音声データベース更新ステップ＞
まず、データ識別手段２２が更新情報を識別した場合には（ステップＳＢ３でＹｅｓ）、音声データベース更新手段２４に更新情報が与えられるため、音声データベース更新手段２４は、データ識別手段２２から送られてくる更新情報に従い、音声合成手段２５の音声データベース２５１を更新する（ステップＳＢ４）。なお、更新情報が無い場合には(ステップＳＢ３でＮｏ)、処理をステップＳＢ５に移す。 <Audio database update step>
First, when the data identification means 22 identifies update information (Yes in step SB3), the voice database update means 24 is sent from the data identification means 22 because update information is given to the voice database update means 24. In accordance with the incoming update information, the voice database 251 of the voice synthesizer 25 is updated (step SB4). If there is no update information (No in step SB3), the process proceeds to step SB5.

＜音声合成制御情報設定ステップ＞
データ識別手段２２が識別した音声合成用データセットに、種類情報および制御情報を含んでいる場合には（ステップＳＢ５）、音声合成制御情報設定処理を開始し（ステップＳＢ６）、全ての音声合成用データセットについてステップＳＢ６ないしＳＢ１１の処理が終了するまで、音声合成制御情報設定処理を繰り返す（ステップＳＢ１２）。 <Speech synthesis control information setting step>
When the speech synthesis data set identified by the data identification means 22 includes type information and control information (step SB5), the speech synthesis control information setting process is started (step SB6), and all speech synthesis The voice synthesis control information setting process is repeated until the process of steps SB6 to SB11 is completed for the data set (step SB12).

＜コンテンツ項目種別選択ステップ＞
コンテンツ項目種別設定手段２３１は、音声合成用データセットからコンテンツ項目（種類情報）を取得し、ユーザインターフェース２６の図示しない表示手段にコンテンツ項目を表示する。そして、コンテンツ項目種別設定手段２３１は、ユーザインターフェース２６の図示しない入力手段を介してユーザによって入力されるコンテンツ項目の設定データを取得すると、その設定データに基づいてコンテンツ項目を選択する。なお、コンテンツ項目が所定時間内に選択されなかった場合には、次のステップに移る。 <Content item type selection step>
The content item type setting unit 231 acquires a content item (type information) from the speech synthesis data set and displays the content item on a display unit (not shown) of the user interface 26. Then, when the content item type setting unit 231 acquires the setting data of the content item input by the user via the input unit (not shown) of the user interface 26, the content item type setting unit 231 selects the content item based on the setting data. If the content item is not selected within a predetermined time, the process proceeds to the next step.

これによって、コンテンツ項目種別設定手段２３１は、ここで選択されたコンテンツ項目をキーとして、音声データベース２５１から音声素片群を選択するための素片情報を特定することになる（ステップＳＢ７）。例えば、ニュースのコンテンツの場合には、男声の音声素片群の中から音声素片を選択することができるようになる。 As a result, the content item type setting means 231 specifies segment information for selecting a speech segment group from the speech database 251 using the content item selected here as a key (step SB7). For example, in the case of news content, a speech unit can be selected from a group of male speech units.

＜話者設定ステップ＞
話者設定手段２３２は、音声合成用データセットから話者に関する種類情報（話者種別や話者番号）を取得し、ユーザインターフェース２６の図示しない表示手段に話者情報を表示する。そして、話者設定手段２３２は、ユーザインターフェース２６の図示しない入力手段を介してユーザによって入力される話者情報の設定データを取得すると、その設定データに基づいて話者を選択する。なお、話者が所定時間内に選択されなかった場合には、次のステップに移る。 <Speaker setting step>
The speaker setting unit 232 acquires type information (speaker type and speaker number) related to the speaker from the speech synthesis data set, and displays the speaker information on a display unit (not shown) of the user interface 26. When the speaker setting unit 232 acquires the setting data of the speaker information input by the user via the input unit (not shown) of the user interface 26, the speaker setting unit 232 selects the speaker based on the setting data. If the speaker is not selected within a predetermined time, the process proceeds to the next step.

これによって、話者設定手段２３２は、話者の相違に応じて異なる声質の音声素片群を音声データベース２５１から選択するための素片情報を特定することになる（ステップＳＢ８）。 Thereby, the speaker setting means 232 specifies unit information for selecting a speech unit group having different voice qualities from the speech database 251 according to the difference between speakers (step SB8).

＜話速設定ステップ＞
話速設定手段２３３は、音声合成用データセットから話速に関する情報を取得し、ユーザインターフェース２６の図示しない表示手段に話速設定項目を表示する。そして、話速設定手段２３３は、ユーザインターフェース２６の図示しない入力手段を介してユーザによって入力される話速の設定データを取得すると、その設定データに基づいて話速を選択する。なお、話速が所定時間内に選択されなかった場合には、次のステップに移る。 <Speaking speed setting step>
The speech speed setting means 233 acquires information related to the speech speed from the speech synthesis data set, and displays the speech speed setting item on a display means (not shown) of the user interface 26. Then, when the speech speed setting means 233 obtains the speech speed setting data input by the user via the input means (not shown) of the user interface 26, the speech speed setting means 233 selects the speech speed based on the setting data. If the speech speed is not selected within a predetermined time, the process proceeds to the next step.

これによって、話速設定手段２３３は、音声素片同士を接続して生成された波形の話速を変化させるための制御情報である話速制御情報を設定することになる（ステップＳＢ９）。 Thereby, the speech speed setting means 233 sets speech speed control information which is control information for changing the speech speed of the waveform generated by connecting the speech elements (step SB9).

＜動特徴制御設定ステップ＞
動特徴制御情報設定手段２３４は、音声合成用データセットから動特徴制御情報を取得し、ユーザインターフェース２６の図示しない表示手段に動特徴の設定項目を表示する。そして、動特徴制御情報設定手段２３４は、ユーザインターフェース２６の図示しない入力手段を介してユーザによって入力される動特徴制御情報の設定データを取得すると、その設定データに基づいて動特徴の制御情報を選択する。なお、動特徴の制御情報が所定時間内に選択されなかった場合には、次のステップに移る。 <Dynamic feature control setting step>
The dynamic feature control information setting unit 234 acquires dynamic feature control information from the speech synthesis data set, and displays dynamic feature setting items on a display unit (not shown) of the user interface 26. When the dynamic feature control information setting unit 234 acquires the setting data of the dynamic feature control information input by the user via the input unit (not shown) of the user interface 26, the dynamic feature control information setting unit 234 outputs the dynamic feature control information based on the setting data. select. If the control information of the dynamic feature is not selected within a predetermined time, the process proceeds to the next step.

これによって、動特徴制御情報設定手段２３４は、音声素片同士を接続して生成された波形の経過時間に伴う波形の動きを変化させるための制御情報である動特徴制御情報を設定することになる（ステップＳＢ１０）。 Accordingly, the dynamic feature control information setting unit 234 sets dynamic feature control information that is control information for changing the movement of the waveform according to the elapsed time of the waveform generated by connecting the speech segments. (Step SB10).

＜静特徴制御設定ステップ＞
静特徴制御情報設定手段２３５は、音声合成用データセットから静特徴制御情報を取得し、ユーザインターフェース２６の図示しない表示手段に静特徴の設定項目を表示する。そして、静特徴制御情報設定手段２３５は、ユーザインターフェース２６の図示しない入力手段を介してユーザによって入力される静特徴制御情報の設定データを取得すると、その設定データに基づいて静特徴の制御情報を選択する。なお、静特徴の制御情報が所定時間内に選択されなかった場合には、次のステップに移る。 <Static feature control setting step>
The static feature control information setting unit 235 acquires static feature control information from the speech synthesis data set, and displays static feature setting items on a display unit (not shown) of the user interface 26. When the static feature control information setting unit 235 acquires static feature control information setting data input by the user via an input unit (not shown) of the user interface 26, the static feature control information setting unit 235 obtains static feature control information based on the setting data. select. If the static feature control information is not selected within a predetermined time, the process proceeds to the next step.

これによって、静特徴制御情報設定手段２３５は、音声素片同士を接続して生成された波形の所定時刻における一定時間の形状を変化させる静特徴を可変させるための制御情報である静特徴制御情報を設定することになる（ステップＳＢ１１）。 Thus, the static feature control information setting means 235 is static feature control information that is control information for changing a static feature that changes the shape of a predetermined time at a predetermined time of a waveform generated by connecting speech segments. Is set (step SB11).

＜音声合成ステップ＞
音声合成手段２５は、音声合成制御情報設定手段２３から送られてくる音声合成用データを取得する（ステップＳＢ１３）。そして、音声合成手段２５は、音声合成用データから種類情報および音声素片番号を取り出して、この音声素片番号をキーとして種類情報に従った音声データベース２５１を検索し、音声素片を取り出す。音声合成手段２５は、音声データベース２５１から取り出した音声素片同士を接続して合成音声の波形を生成する（ステップＳＢ１４）。その後、音声合成手段２５は、音声合成用データから各種制御情報を取り出して、生成した波形を変形し、変形した波形を音声出力手段２７に出力する（ステップＳＢ１５）。 <Speech synthesis step>
The speech synthesizer 25 acquires the speech synthesis data sent from the speech synthesis control information setting unit 23 (step SB13). Then, the speech synthesizer 25 extracts the type information and the speech unit number from the speech synthesis data, searches the speech database 251 according to the type information using the speech unit number as a key, and extracts the speech unit. The speech synthesizer 25 connects speech segments extracted from the speech database 251 to generate a synthesized speech waveform (step SB14). Thereafter, the voice synthesizer 25 extracts various control information from the voice synthesis data, deforms the generated waveform, and outputs the deformed waveform to the voice output unit 27 (step SB15).

＜音声出力ステップ＞
音声出力手段２７は、音声合成手段２５から入力される電気信号としての波形を合成音声として発話する（ステップＳＢ１６）。 <Audio output step>
The voice output means 27 utters a waveform as an electric signal input from the voice synthesis means 25 as synthesized voice (step SB16).

以上の処理により、データ受信装置２０は、データ送信装置１０から送られてくる音声合成用データセットに基づいて、話者等の種別ごとに異なる声質を設定し、設定された声質の音声素片同士を接続して得られる合成音声の波形を、ユーザが制御情報を特定することによって容易に変形することができるため、ユーザの好みに合った合成音声を発話させることができるようになる。 Through the above processing, the data reception device 20 sets different voice qualities for each type of speaker or the like based on the voice synthesis data set sent from the data transmission device 10, and the voice unit of the set voice quality. Since the user can easily transform the waveform of the synthesized speech obtained by connecting each other by specifying the control information, the synthesized speech can be uttered according to the user's preference.

[補足]
なお、前記実施形態の受信手段２１は、増幅回路やＡ／Ｄ変換部等の電子回路を含む構成として説明したが、その電子回路は電子部品や半導体デバイスによってハードウェア的に構築した回路であっても、同種の機能を実現するようにしたプログラムおよびこれを処理するＣＰＵの協働によって実現するものであってもよい。 [Supplement]
Although the receiving means 21 of the above embodiment has been described as a configuration including an electronic circuit such as an amplifier circuit or an A / D converter, the electronic circuit is a circuit constructed in hardware by an electronic component or a semiconductor device. Alternatively, it may be realized by the cooperation of a program that realizes the same type of function and a CPU that processes the program.

また、前記実施形態では、音声合成用データセットの素片情報として、音声素片番号と、音声素片とを含む情報とし、データ受信装置２０に音声素片を記憶する音声データベース２５１を備えている場合を説明したが、データ受信装置２０に音声データベースを備えずに、合成音声として発話させる音声素片をその都度データ送信装置１０からデータ受信装置２０に送信するようにしてもよい。 In the above-described embodiment, the unit information of the speech synthesis data set is information including a speech unit number and a speech unit, and the data receiving device 20 includes the speech database 251 that stores the speech unit. However, instead of providing the data reception device 20 with the voice database, a speech unit to be uttered as synthesized speech may be transmitted from the data transmission device 10 to the data reception device 20 each time.

また、前記実施形態では、更新データと更新情報とが相違する場合としては、データ送信装置１０の音声データベース更新情報作成手段１４が省略された音声素片波形データを補うものとして説明したが、データ送信装置１０からデータ受信装置２０に送信される更新情報が更新データと同一の音声素片を省略されたものとし、データ受信装置２０の音声データベース更新手段２４が該当する音声素片を音声データベース２５１から取得して補って音声データベース２５１を更新するようにしてもよい。 In the above-described embodiment, when the update data and the update information are different from each other, the voice database update information creation unit 14 of the data transmission device 10 is described as supplementing the voice segment waveform data from which the data is omitted. It is assumed that the speech unit having the same update information as the update data is omitted from the update information transmitted from the transmitting device 10 to the data receiving device 20, and the speech database updating unit 24 of the data receiving device 20 selects the corresponding speech unit as the speech database 251. The voice database 251 may be updated by supplementing the voice database.

また、前記実施形態では、音声データベース１２３は素片選択手段１２２に内蔵されている場合を説明したが、音声データベース１２３は素片選択手段１２２に内蔵されない構成としてもよい。また同様に、音声データベース２５１は音声合成手段２５に内蔵されている場合を説明したが、音声データベース２５１は音声合成手段２５に内蔵されない構成としてもよい。 In the above embodiment, the case where the speech database 123 is built in the segment selection unit 122 has been described. However, the speech database 123 may not be built in the segment selection unit 122. Similarly, although the case where the speech database 251 is built in the speech synthesizer 25 has been described, the speech database 251 may be configured not to be built in the speech synthesizer 25.

また、前記実施形態では、データ送信装置１０は、一つの筐体内に収まるような装置構成として機能および動作を説明した。しかし、通常、データ送信装置１０は、放送局では大掛かりな設備として提供されてデータ送信システムと呼ぶべきものである。このデータ送信システムは、各構成が大掛かりなデータ送信装置１０となって一つの筐体内に収まらない点を除いては、機能構成および動作がデータ送信装置１０と基本的に同一となるため、この明細書および請求の範囲においては、便宜上一つの筐体内に収まる装置構成として機能構成および動作を説明した。このことは、大掛かりなデータ送信システムとしての構成および機能を本発明の権利範囲内から除外するものではない。 In the above-described embodiment, the function and operation of the data transmission apparatus 10 have been described as an apparatus configuration that can be accommodated in one housing. However, normally, the data transmission device 10 is provided as a large facility in a broadcasting station and should be called a data transmission system. This data transmission system is basically the same in function configuration and operation as the data transmission device 10 except that each configuration becomes a large data transmission device 10 and does not fit in one housing. In the specification and the claims, the functional configuration and operation have been described as a device configuration that fits in one housing for convenience. This does not exclude the configuration and function as a large-scale data transmission system from the scope of the right of the present invention.

また、前記実施形態では、１台のデータ送信装置１０と１台のデータ受信装置２０との間の送受信の関係で説明したが、互いの台数は互いに複数であっても構わない。例えば、１台のデータ送信装置１０は、不特定多数の複数のデータ受信装置２０に対して、地上波等の無線や光ファイバケーブル等の有線を介してデータを送信する。また、１台のデータ受信装置２０は、複数のデータ送信装置１０からのデータを受信し、各データ送信装置１０の定める規則を登録するようにしてもよい。 In the above-described embodiment, the transmission / reception relationship between one data transmission device 10 and one data reception device 20 has been described. However, the number of mutual devices may be plural. For example, one data transmission device 10 transmits data to a plurality of unspecified data reception devices 20 via radio waves such as terrestrial waves or wires such as optical fiber cables. Further, one data receiving device 20 may receive data from a plurality of data transmitting devices 10 and register the rules determined by each data transmitting device 10.

また、前記実施形態では、合成音声発話装置としてのデータ受信装置２０と、音声合成用データセット生成装置としてのデータ送信装置１０とから構成されるデータ放送システム１に利用される場合を例にして説明した。しかし、本発明の音声合成用データセット生成装置および合成音声発話装置は、データ放送システムとして利用される場合に限らず、例えば、インターネット等のネットワークを介したデータ通信システム（図示せず）であってもよい。 Moreover, in the said embodiment, the case where it utilizes for the data broadcasting system 1 comprised from the data receiver 20 as a synthetic | combination speech apparatus and the data transmitter 10 as a data set production | generation apparatus for speech synthesis is made into an example. explained. However, the speech synthesis data set generation device and the synthesized speech utterance device of the present invention are not limited to being used as a data broadcasting system, but are, for example, a data communication system (not shown) via a network such as the Internet. May be.

このデータ通信システム（図示せず）は、前記データ放送システム１が音声合成用データセットおよび更新情報を放送波に乗せて送信するのに対して、インターネット等のネットワークを介して音声合成用データセットおよび更新情報を配信するものである。したがって、音声合成用データセット生成装置が音声合成用データセットを作成する処理および更新情報を作成する処理は、いずれのシステムでも同一である。また、合成音声発話装置が音声合成用データセットに従って合成音声を発話する処理および更新情報に従って音声データベースを更新する処理は、いずれのシステムでも同一である。 In this data communication system (not shown), the data broadcasting system 1 transmits a voice synthesis data set and update information on a broadcast wave, whereas a voice synthesis data set via a network such as the Internet. And update information. Therefore, the process of creating the speech synthesis data set and the process of creating update information by the speech synthesis data set generation apparatus are the same in any system. The process in which the synthesized speech utterance apparatus utters synthesized speech in accordance with the speech synthesis data set and the process in which the speech database is updated in accordance with the update information are the same in any system.

そのため、データ通信システム（図示せず）についても、前記実施形態のデータ放送システムの構成および動作と同様に実施することができるため、説明を省略する。ただし、前記実施形態のデータ放送システム１では、送信手段１６が放送波として送信しているが、データ通信システムとして実施する場合には、送信手段１６が各種ネットワークに接続するためのインターフェースを備え、データ伝送するものとすればよい。 For this reason, a data communication system (not shown) can be implemented in the same manner as the configuration and operation of the data broadcasting system of the above embodiment, and thus the description thereof is omitted. However, in the data broadcasting system 1 of the embodiment, the transmission unit 16 transmits as a broadcast wave. However, when the transmission unit 16 is implemented as a data communication system, the transmission unit 16 includes an interface for connecting to various networks. Data transmission may be performed.

また、前記実施形態では、データ放送システムとして説明したため、音声合成用データセット取得手段が受信手段２１およびデータ識別手段２２によって構成されるものとした。しかし、データ通信システムの場合であっても、同様に、受信手段２１およびデータ識別手段２２によって構成されるものとすることができる。ただし、この場合には、受信手段２１には、各種ネットワークに接続するためのインターフェースを備える必要がある。 In the above embodiment, since the data broadcasting system has been described, the speech synthesis data set acquisition unit is configured by the reception unit 21 and the data identification unit 22. However, even in the case of a data communication system, the receiving means 21 and the data identifying means 22 can be similarly configured. However, in this case, the receiving means 21 needs to be provided with an interface for connecting to various networks.

また、前記実施形態のデータ送信装置１０では、テキストから読み情報を取得し、この読み情報に基づく音声素片を選択するようにしたが、生の音声をマイクで集音して音声認識して、読み情報を取得するようにしてもよい。この場合は、前記実施形態の素材テキスト取得手段１１が、特に、音声を集音して電気信号として入力する図示しない音声入力手段と、この音声入力手段により入力された音声の電気信号の波形から各種辞書を参照してテキストとして認識する図示しない音声認識手段とを備えるようにすればよい。そして、前記実施形態の読み情報取得手段１２１は、音声認識手段により認識されたテキストを取得して、前記実施形態の場合と同様にして読み情報を取得すればよい。なお、音声認識手段は、音声認識処理の際に読み情報を生成するため、このときに生成された読み情報を読み情報取得手段１２１がそのまま取得するようにしてもよい。 In the data transmitting apparatus 10 of the above embodiment, reading information is acquired from text, and a speech unit based on the reading information is selected. However, raw speech is collected by a microphone and recognized. Reading information may be acquired. In this case, the material text acquisition unit 11 according to the embodiment particularly includes a voice input unit (not shown) that collects and inputs a voice as an electric signal, and a waveform of the voice electric signal input by the voice input unit. What is necessary is just to provide the voice recognition means (not shown) which recognizes as text with reference to various dictionaries. And the reading information acquisition means 121 of the said embodiment should just acquire the text recognized by the speech recognition means, and may acquire reading information similarly to the case of the said embodiment. Note that since the voice recognition means generates reading information during the voice recognition processing, the reading information acquisition means 121 may acquire the reading information generated at this time as it is.

また、前記実施形態では、映像やオーディオデータやテキストを含めたデータ放送素材としてのコンテンツの場合を説明したが、合成音声として発話させるテキスト（録音音声データを含む。）を含むものであれば、前記コンテンツに限らず、静止画像を含むものやこれらとの組み合わせによるコンテンツであっても構わない。 In the embodiment, the case of content as data broadcasting material including video, audio data, and text has been described. However, as long as it includes text (including recorded audio data) to be uttered as synthesized speech, The content is not limited to the content, and may be a content including a still image or a combination thereof.

また、前記実施形態のデータ放送システム１では、音声合成用データセット生成装置としてのデータ送信装置１０から合成音声発話装置としてのデータ受信装置２０への音声合成用データセットおよび更新情報の提供は、無線による放送波によって行うものとした。しかし、音声合成用データセット生成装置から合成音声発話装置への音声合成用データセットおよび更新情報の提供は、その場合に限らず、有線による放送（いわゆるケーブルテレビ）設備の利用、有線・無線のデータ伝送路の利用、または、有線・無線の電話通信網の利用によって行うようにしてもよい。またさらに、音声合成用データセットおよび更新情報を記憶した可搬なリムーバブルメディアを郵送等の送付方法によって、合成音声発話装置のユーザに提供し、合成音声発話装置（データ送信装置）でリムーバブルメディアから音声合成用データセットや更新情報を読み込ませるようにしてもよい。 In the data broadcasting system 1 of the embodiment, the provision of the speech synthesis data set and the update information from the data transmission device 10 as the speech synthesis data set generation device to the data reception device 20 as the synthesized speech utterance device is as follows. It was carried out by radio broadcast waves. However, the provision of the speech synthesis data set and the update information from the speech synthesis data set generation device to the synthesized speech utterance device is not limited to this, and the use of wired broadcasting (so-called cable television) facilities, wired / wireless The data transmission path may be used or a wired / wireless telephone communication network may be used. In addition, a portable removable medium storing a speech synthesis data set and update information is provided to a user of the synthesized speech utterance device by a mailing method or the like, and the synthesized speech utterance device (data transmission device) is used to remove the removable media from the removable medium. A voice synthesis data set or update information may be read.

特に、前記実施形態のデータ放送システム１やデータ通信システム（図示せず）の場合では、音声合成用データセットが音声合成用データセット生成装置（データ送信装置）から合成音声発話装置（データ受信装置）に略リアルタイムに提供することもできる。また、リムーバブルメディアによる場合では、有線や無線の届かない地域へ各種コンテンツを合成音声の発話によって提供することができる。さらに、ＤＶＤ−ＲＯＭ（Digital Versatile Disk−Read Only Memory）に映画を記憶して提供する場合には、音声合成用データセットによる素片情報と共に、音声データベース自体も記憶させておくことによって、俳優の音声を合成音声によって発話させることもできる。また、合成音声発話プログラム自体もＤＶＤ−ＲＯＭに記憶させておくことによって、一般的なコンピュータによってインタラクティブに再生するようにしてもよい。 In particular, in the case of the data broadcasting system 1 and the data communication system (not shown) of the above embodiment, the speech synthesis data set is transmitted from the speech synthesis data set generation device (data transmission device) to the synthesized speech utterance device (data reception device). ) In substantially real time. Also, in the case of using removable media, various contents can be provided by speech of synthesized speech to areas where wired or wireless does not reach. Furthermore, when a movie is stored and provided on a DVD-ROM (Digital Versatile Disk-Read Only Memory), the voice database itself is stored together with the segment information by the voice synthesis data set, so that the actor's Speech can be uttered by synthesized speech. Further, the synthesized speech utterance program itself may be stored in the DVD-ROM, and interactively reproduced by a general computer.

本発明に係る実施形態のデータ放送システムの構成を示すブロック図である。It is a block diagram which shows the structure of the data broadcasting system of embodiment which concerns on this invention. データ送信装置が備える音声合成用データセット作成手段の読み情報取得手段による読み情報の取得概念を説明する図である。It is a figure explaining the acquisition concept of the reading information by the reading information acquisition means of the data set preparation means for speech synthesis with which a data transmission apparatus is provided. データ送信装置が備える音声合成用データセット作成手段の素片選択手段による音声素片の選択概念を説明する図である。It is a figure explaining the selection concept of the speech unit by the segment selection means of the data set preparation means for speech synthesis with which a data transmission apparatus is provided. データ送信装置が備える音声合成用データセット作成手段の音声合成用データセットの概念を説明する図である。It is a figure explaining the concept of the data set for speech synthesis of the speech data set creation means with which a data transmission apparatus is provided. 音声データベース更新時の更新情報の作成概念を説明する図である。It is a figure explaining the creation concept of the update information at the time of audio | voice database update. データ受信装置が備えるデータ識別手段の機能の概念を説明する図である。It is a figure explaining the concept of the function of the data identification means with which a data receiver is provided. 音声合成用データセットを解析する概念を説明する図である。It is a figure explaining the concept which analyzes the data set for speech synthesis. データ受信装置が備える音声データベース更新手段の機能の概念を説明する図である。It is a figure explaining the concept of the function of the voice database update means with which a data receiver is provided. データ受信装置が備える音声合成手段による音声合成の概念を説明する図である。It is a figure explaining the concept of the speech synthesis | combination by the speech synthesis means with which a data receiver is provided. データ受信装置が備えるユーザインターフェースによる音声合成制御情報の設定の概念を説明する図である。It is a figure explaining the concept of the setting of the speech synthesis control information by the user interface with which a data receiver is provided. データ受信装置が備える音声合成用制御情報設定手段による音声合成制御情報の設定概念を説明する図である。It is a figure explaining the setting concept of the speech synthesis control information by the speech synthesis control information setting means provided in the data receiving device. 本発明の実施形態に係るデータ送信装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data transmitter which concerns on embodiment of this invention. 本発明の実施形態に係るデータ受信装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data receiver which concerns on embodiment of this invention.

Explanation of symbols

１データ放送システム
１０データ送信装置（音声合成用データセット生成装置）
１１素材テキスト取得手段
１２音声合成用データセット作成手段
１２１読み情報取得手段
１２２素片選択手段
１２３音声データベース
１２４制御情報作成手段
１３音声合成用データセット付加手段（音声合成用データセット提供手段）
１４音声データベース更新情報作成手段（更新情報提供手段）
１５多重化手段（音声合成用データセット提供手段、更新情報提供手段）
１６送信手段（音声合成用データセット提供手段、更新情報提供手段）
２０データ受信装置（合成音声発話装置）
２１受信手段（音声合成用データセット取得手段、更新情報取得手段）
２２データ識別手段（音声合成用データセット取得手段、更新情報取得手段）
２３音声合成制御情報設定手段
２３１コンテンツ項目種別設定手段
２３２話者設定手段
２３３話速設定手段
２３４動的特徴制御情報設定手段
２３５静的特徴制御情報設定手段
２４音声データベース更新手段
２５音声合成手段
２５１音声データベース
２６ユーザインターフェース
２７音声出力手段 1 Data Broadcasting System 10 Data Transmission Device (Data Set Generation Device for Speech Synthesis)
DESCRIPTION OF SYMBOLS 11 Material text acquisition means 12 Speech synthesis data set creation means 121 Reading information acquisition means 122 Segment selection means 123 Speech database 124 Control information creation means 13 Speech synthesis data set addition means (speech synthesis data set provision means)
14 Voice database update information creation means (update information provision means)
15 Multiplexing means (speech synthesis data set providing means, update information providing means)
16 Transmitting means (speech synthesis data set providing means, update information providing means)
20 Data receiver (synthesized speech utterance device)
21 Receiving means (speech synthesis data set acquisition means, update information acquisition means)
22 Data identification means (speech synthesis data set acquisition means, update information acquisition means)
23 voice synthesis control information setting means 231 content item type setting means 232 speaker setting means 233 speech speed setting means 234 dynamic feature control information setting means 235 static feature control information setting means 24 voice database update means 25 voice synthesis means 251 voice Database 26 User interface 27 Voice output means

Claims

Type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, and corresponding to the speech unit for each speech unit group And at least one type of control information that defines how to deform the waveform of the synthesized speech generated by connecting the speech units associated with each other based on the type information and the unit information. A synthesized speech utterance device that obtains the speech synthesis data set from a speech synthesis data set generation device that generates a synthesis data set and utters it as synthesized speech,
A speech database that stores the speech segment group for each voice quality;
Voice synthesis data set acquisition means for acquiring the voice synthesis data set from the voice synthesis data set generation device;
The speech synthesis data set acquired by the speech synthesis data set acquisition means is received, the type information of the speech synthesis data set is selected to identify the speech segment group to be extracted, and the speech synthesis data set A voice synthesis control information setting means for selecting the control information of the data set to identify the waveform deformation method, and setting the selected type information and the control information and the segment information as voice synthesis data;
The speech synthesis data is received from the speech synthesis control information setting means, the speech segment is extracted according to the segment information from the speech segment group of the speech database specified by the type information, and the extracted speech A speech synthesizer that connects the segments to generate a waveform of a synthesized speech, deforms the waveform in accordance with the waveform modification method specified by the selected control information, and utters the deformed waveform as synthesized speech; ,
A synthesized speech uttering device comprising:

The speech synthesis control information setting means is
Content item type setting means for setting the type of the speech element group stored in the audio database according to the type of the content item included in the type information;
Speaker setting means for setting the type of the speech segment group stored in the speech database according to the difference of speakers included in the type information;
The synthesized speech utterance device according to claim 1, comprising at least one of the following.

The speech synthesis control information setting means is
A speech speed setting means for setting speech speed control information, which is control information for changing the speech speed of a waveform generated by connecting the speech segments;
Dynamic feature control information setting means for setting dynamic feature control information, which is control information for changing dynamic features that change the movement of the waveform with the elapsed time of the waveform generated by connecting the speech units;
Static feature control information setting means for setting static feature control information, which is control information for changing a static feature that changes the shape of a predetermined time at a predetermined time of a waveform generated by connecting the speech elements;
The synthesized speech utterance device according to claim 1, comprising at least one of the following.

Update information acquisition means for acquiring update information for updating the speech segment group stored in the speech database from the speech synthesis data set generation device;
Voice database update means for updating the voice database according to the update information acquired by the update information acquisition means;
The synthesized speech utterance device according to any one of claims 1 to 3, further comprising:

Type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, and corresponding to the speech unit for each speech unit group And at least one type of control information that defines how to deform the waveform of the synthesized speech generated by connecting the speech units associated with each other based on the type information and the unit information. In order to acquire the speech synthesis data set from the speech synthesis data set generation device that generates the synthesis data set and utter as synthesized speech,
Speech synthesis data set acquisition means for acquiring the speech synthesis data set from the speech synthesis data set generation device;
From the speech database that receives the speech synthesis data set acquired by the speech synthesis data set acquisition means, selects the type information of the speech synthesis data set, and stores the speech segment group for each voice quality The speech unit group to be extracted is specified, the control information of the speech synthesis data set is selected to specify the waveform deformation method, and the selected type information, the control information, and the unit information are selected. Speech synthesis control information setting means for setting as data for speech synthesis,
The speech synthesis data is received from the speech synthesis control information setting means, the speech segment is extracted according to the segment information from the speech segment group of the speech database specified by the type information, and the extracted speech A speech synthesizer that connects the segments together to generate a waveform of synthesized speech, deforms the waveform according to the method of waveform deformation specified by the selected control information, and utters the deformed waveform as synthesized speech;
A synthesized speech utterance program characterized by functioning as

Type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, and corresponding to the speech unit for each speech unit group And at least one type of control information that defines how to deform the waveform of the synthesized speech generated by connecting the speech units associated with each other based on the type information and the unit information. A speech synthesis data set generation device that provides the speech synthesis data set to a synthesized speech utterance device that is generated as a synthesis data set and uttered as synthesized speech based on the speech synthesis data set,
A speech database that stores the speech segment group for each voice quality;
Material text acquisition means for acquiring material text to be uttered as synthesized speech;
Reading information acquisition means for acquiring reading information of the text acquired by the material text acquisition means;
A segment selection unit that selects a speech unit of the reading information acquired by the reading information acquisition unit for each type of voice quality from the speech database and sets it as segment information;
Control information creating means for creating the type information and the control information;
A speech synthesis data set providing unit that provides the segment information selected by the unit selection unit and the type information and control information created by the control information creation unit to the synthesized speech utterance device as a speech synthesis data set; ,
A data set generating apparatus for speech synthesis, comprising:

Creating update information for updating the speech unit stored in the speech database, and updating speech database update information creating means for updating the speech unit of the speech database with the update information;
Update information providing means for providing the update information created by the voice database update information creating means to the synthesized speech utterance device;
The speech synthesis data set generation device according to claim 6, comprising:

Type information for dividing a speech unit group composed of a plurality of speech units of the same voice quality constituting a minimum unit waveform to be uttered as synthesized speech, and corresponding to the speech unit for each speech unit group And at least one type of control information that defines how to deform the waveform of the synthesized speech generated by connecting the speech units associated with each other based on the type information and the unit information. In order to provide the speech synthesis data set to a synthesized speech utterance device that generates as a synthesis data set and utters as synthesized speech based on the speech synthesis data set,
Material text acquisition means for acquiring material text to be uttered as synthesized speech,
Reading information acquisition means for acquiring reading information of the text acquired by the material text acquisition means,
A segment selection unit that selects a speech unit of reading information acquired by the reading information acquisition unit for each type of voice quality from a speech database that stores the speech unit group for each voice quality,
Control information creating means for creating the type information and the control information;
Speech synthesis data set providing means for providing the synthesized speech utterance apparatus with the segment information selected by the segment selection means and the type information and control information created by the control information creation means as a speech synthesis data set;
A data set generation program for speech synthesis, characterized in that