JP2003308083A

JP2003308083A - Voice synthesizing processor

Info

Publication number: JP2003308083A
Application number: JP2002114771A
Authority: JP
Inventors: Yasuo Okuya; 泰夫奥谷
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-04-17
Filing date: 2002-04-17
Publication date: 2003-10-31

Abstract

<P>PROBLEM TO BE SOLVED: To conduct a voice synthesizing process at a terminal which has small scale hardware resources while maintaining reasonable quality for synthesized voice. <P>SOLUTION: A voice synthesizing processor is capable of communicating with a server 10 in which voice piece data are accumulated and character strings are processed for voice synthesis. The processor determines voice pieces required to output character strings with synthesized voice and transmits a transmission request to the server for the data of the determined voice pieces. The server transmits the voice piece data in response to the transmission request and the processor receives the voice piece data and generates synthesized voice data of the character strings based on the received voice piece data. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声合成技術に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech synthesis technology.

【０００２】[0002]

【従来の技術】近年、音声合成の分野では、例えば、特
開平１０−４９１９３号公報に記載されるように、波形
編集・接続型の音声合成技術が主流となっている。この
音声合成技術は、音声素片の波形データを蓄積した合成
素片辞書を用い、その合成素片辞書に蓄積された音声素
片の波形データを接続して合成音声の波形データを作成
するものである。2. Description of the Related Art In recent years, in the field of voice synthesis, a waveform editing / connection type voice synthesis technique has become mainstream, for example, as described in Japanese Patent Laid-Open No. 10-49193. This speech synthesis technology uses a synthesis unit dictionary in which waveform data of a voice unit is stored, and connects waveform data of voice units stored in the synthesis unit dictionary to create waveform data of a synthesized voice. Is.

【０００３】また、この合成素片辞書は、比較的大きな
記憶容量を必要とする場合が多いため、例えば、特開平
１１−９５７９６号公報に記載されるように、より少な
いリソースで音声合成処理を実現するために、合成素片
辞書をコンパクト化する技術も提案されている。Further, since this synthesis segment dictionary often requires a relatively large storage capacity, for example, as described in Japanese Patent Application Laid-Open No. 11-95796, speech synthesis processing can be performed with less resources. In order to realize it, a technique for making the synthesis element dictionary compact is also proposed.

【０００４】また、携帯型電話機等の携帯型端末におい
て合成音声を出力する技術も提案されている。携帯型端
末は、ハードウエア資源が小規模なため、携帯型端末の
負荷を軽減して合成音声を出力するために、ネットワー
ク上のサーバにおいて音声合成処理を行わせ、その結果
得られた合成音声データを携帯型端末に送信して携帯型
端末から合成音声を出力するようにしたシステムも提案
されている。A technique for outputting a synthetic voice in a portable terminal such as a portable telephone has also been proposed. Since the mobile terminal has small hardware resources, in order to reduce the load on the mobile terminal and output the synthesized speech, the server on the network performs speech synthesis processing. A system has also been proposed in which data is transmitted to a mobile terminal and synthetic voice is output from the mobile terminal.

【０００５】[0005]

【発明が解決しようとする課題】しかし、携帯型端末の
ようなハードウエア資源が小規模な端末で合成音声を出
力する場合、上述したようにネットワーク上のサーバに
おいて全ての音声合成処理を行うようにすると、そのサ
ーバに対して複数の携帯型端末が音声合成処理を要求す
ることによりそのサーバの計算負荷が増大することが考
えられ、サーバの設置台数を増加せざる得なくなり、そ
のコストが問題となる。However, in the case where a terminal having a small hardware resource such as a portable terminal outputs synthetic speech, all the speech synthesis processing should be performed by the server on the network as described above. In that case, it is possible that a plurality of mobile terminals request voice synthesis processing to the server, which increases the computational load of the server, and the number of installed servers must be increased. Becomes

【０００６】また、携帯型端末のような端末において、
全ての音声合成処理を行うには、ハードウエア資源が必
ずしも十分ではなく、特に合成素片辞書を蓄積するため
の十分なメモリを確保できない場合が多い。この場合、
特開平１１−９５７９６号公報に記載されるように合成
素片辞書をコンパクト化することも考えられるが、これ
に伴う合成音声の品質劣化が避けられない。In a terminal such as a portable terminal,
Hardware resources are not always sufficient to perform all speech synthesis processing, and in particular, there are many cases where sufficient memory cannot be secured for accumulating a synthesis segment dictionary. in this case,
It is conceivable to make the synthesis unit dictionary compact as described in Japanese Patent Laid-Open No. 11-95796, but the quality deterioration of the synthesized speech accompanying this is inevitable.

【０００７】従って、本発明の目的は、ハードウエア資
源が小規模の端末で、合成音声の品質を維持しつつ、音
声合成処理を行い得る技術を提供することにある。Therefore, an object of the present invention is to provide a technique capable of performing speech synthesis processing while maintaining the quality of synthesized speech with a terminal having small hardware resources.

【０００８】[0008]

【課題を解決するための手段】本発明によれば、音声素
片のデータを蓄積したサーバと通信可能に接続され、文
字列を音声合成処理する音声合成処理装置であって、前
記文字列を合成音声で出力するために必要な音声素片を
決定する決定手段と、前記決定手段によって決定された
音声素片のデータの送信要求を前記サーバに送信する送
信手段と、前記送信要求に対応して、前記サーバから送
信された音声素片のデータを受信する受信手段と、受信
した前記音声素片のデータに基づいて、前記文字列の合
成音声データを作成する作成手段と、を備えた音声合成
処理装置が提供される。According to the present invention, there is provided a voice synthesizing processing device which is communicably connected to a server storing voice segment data, and which performs voice synthesizing processing of a character string, the character string Deciding means for deciding a speech segment necessary for outputting as a synthetic speech, transmitting means for transmitting a transmission request for the data of the speech segment decided by the deciding means to the server, and corresponding to the transmission request. A voice that includes a receiving unit that receives the data of the voice unit transmitted from the server, and a creating unit that creates the synthetic voice data of the character string based on the received data of the voice unit. A synthesis processing device is provided.

【０００９】また、本発明によれば、音声素片のデータ
を蓄積したサーバと通信可能に接続されたコンピュータ
に、文字列の音声合成処理を実行させるプログラムであ
って、文字列を合成音声で出力するために必要な音声素
片を決定する決定工程と、前記決定工程によって決定さ
れた音声素片のデータの送信要求を前記サーバに送信す
る送信工程と、前記送信要求に対応して、前記サーバか
ら送信された音声素片のデータを受信する受信工程と、
受信した前記音声素片のデータに基づいて、前記文字列
の合成音声データを作成する作成工程と、を前記コンピ
ュータに実行させるプログラムが提供される。Further, according to the present invention, there is provided a program for causing a computer communicatively connected to a server storing voice segment data to perform a voice synthesis process of a character string, the character string being synthesized voice. A determining step of determining a speech segment necessary for outputting, a transmitting step of transmitting a transmission request of the data of the speech segment determined by the determining step to the server, and corresponding to the transmission request, A receiving step of receiving the data of the voice unit transmitted from the server,
There is provided a program for causing the computer to execute a creating step of creating synthetic voice data of the character string based on the received data of the voice unit.

【００１０】また、本発明によれば、文字列を音声合成
処理する音声合成処理装置と通信可能に接続されたサー
バであって、音声素片のデータを蓄積した記憶手段と、
前記音声合成処理装置からの、前記音声素片のデータの
送信要求を受信する受信手段と、受信した前記送信要求
に対応して、前記音声素片のデータを前記音声合成処理
装置へ送信する送信手段と、を備えたサーバが提供され
る。According to the present invention, a server communicatively connected to a voice synthesis processing device for performing voice synthesis processing of a character string, and storage means for storing voice segment data,
Receiving means for receiving a request for transmitting the data of the voice unit from the voice synthesis processing device, and transmission for transmitting the data of the voice unit to the voice synthesis processing device in response to the received transmission request. A server is provided.

【００１１】また、本発明によれば、文字列を音声合成
処理する音声合成処理装置と通信可能に接続され、音声
素片のデータを蓄積したサーバを用いたデータ提供方法
であって、前記サーバにより、前記音声合成処理装置か
らの、前記音声素片のデータの送信要求を受信する受信
工程と、前記サーバにより、受信した前記送信要求に対
応して、前記音声素片のデータを前記音声合成処理装置
へ送信する送信工程と、を備えたデータ提供方法が提供
される。Further, according to the present invention, there is provided a data providing method using a server, which is communicably connected to a voice synthesis processing device for performing voice synthesis processing of a character string, and stores voice segment data. A receiving step of receiving a transmission request for the data of the voice unit from the voice synthesis processing device, and the server performing the voice synthesis of the data of the voice unit in response to the transmission request received by the server. A data providing method including a transmitting step of transmitting to a processing device.

【００１２】また、本発明によれば、音声素片のデータ
を蓄積したサーバと、該サーバに通信可能に接続され、
文字列を音声合成処理する音声合成処理装置と、を備え
た音声合成処理システムであって、前記音声合成処理装
置が、前記文字列を合成音声で出力するために必要な音
声素片を決定する決定手段と、前記決定手段によって決
定された音声素片のデータの送信要求を前記サーバに送
信する送信手段と、前記送信要求に対応して、前記サー
バから送信された音声素片のデータを受信する受信手段
と、受信した前記音声素片のデータに基づいて、前記文
字列の合成音声データを作成する作成手段と、を備えた
音声合成処理システムが提供される。Further, according to the present invention, a server accumulating voice segment data and a server communicably connected to the server,
A voice synthesis processing system comprising: a voice synthesis processing device for performing voice synthesis processing of a character string, wherein the voice synthesis processing device determines a voice unit necessary for outputting the character string in a synthesized voice. Deciding means, transmitting means for transmitting to the server a transmission request for the data of the speech unit determined by the deciding means, and receiving data of the speech segment transmitted from the server in response to the transmission request There is provided a speech synthesis processing system comprising: a receiving unit that performs the above; and a creating unit that creates the synthesized voice data of the character string based on the received data of the voice unit.

【００１３】[0013]

【発明の実施の形態】以下、本発明の好適な実施の形態
について図面を参照して説明する。BEST MODE FOR CARRYING OUT THE INVENTION Preferred embodiments of the present invention will be described below with reference to the drawings.

【００１４】＜第１実施形態＞＜システムの構成＞図１は、本発明の一実施形態に係る
音声合成処理システム１のシステム図である。音声合成
処理システム１は、文字列を音声合成処理する音声合成
処理装置としての携帯型電話機１０と、音声素片のデー
タが記録された音声素片辞書を蓄積したサーバ２０と、
を含み、これらは、基地局２及びインターネットに代表
されるネットワーク３を介して通信可能に構成されてい
る。まず、図１を参照して、サーバ２０の構成について
説明する。<First Embodiment><SystemConfiguration> FIG. 1 is a system diagram of a speech synthesis processing system 1 according to an embodiment of the present invention. The speech synthesis processing system 1 includes a mobile phone 10 as a speech synthesis processing device that performs speech synthesis processing on a character string, a server 20 that stores a speech unit dictionary in which voice unit data is recorded,
, And these are configured to be communicable via the base station 2 and the network 3 represented by the Internet. First, the configuration of the server 20 will be described with reference to FIG.

【００１５】ＣＰＵ２１は、サーバ２０全体の制御を司
り、特に、本実施形態では後述する処理を実行する。Ｒ
ＡＭ２２は、ＣＰＵ２１のワークエリアとして用いられ
るメモリである。ＲＯＭ２３には、ＣＰＵ２１が実行す
る制御プログラム、該プログラムの処理に用いられるデ
ータといった固定的なデータを記憶するメモリである。
本実施形態の場合、携帯型電話機１０との通信手順を示
すプログラムや、音声素片辞書２４の検索プログラム等
がこのＲＯＭ２３に格納されることになる。The CPU 21 controls the entire server 20, and particularly executes the processing described later in this embodiment. R
The AM 22 is a memory used as a work area for the CPU 21. The ROM 23 is a memory that stores fixed data such as a control program executed by the CPU 21 and data used for processing the program.
In the case of the present embodiment, a program indicating a communication procedure with the mobile phone 10, a search program for the voice unit dictionary 24, and the like are stored in the ROM 23.

【００１６】音声素片辞書２４は、各音声素片の波形デ
ータが記録された辞書データであり、サーバ２０の内部
又は外部のハードディスク、ＭＯディスクといった記憶
装置に蓄積されている。図７は、音声素片辞書２４に記
録されたデータの一例を示す図である。The speech unit dictionary 24 is dictionary data in which waveform data of each speech unit is recorded, and is stored in a storage device such as a hard disk or MO disk inside or outside the server 20. FIG. 7 is a diagram showing an example of data recorded in the speech unit dictionary 24.

【００１７】図７の例は、音声素片の単位をｄｉｐｈｏ
ｎｅ単位とし、基本周波数として高、中、低の３段階に
量子化されたデジタルの波形データを記録するように構
成した例を示している。本実施形態では、音声素片の単
位をｄｉｐｈｏｎｅ単位とした場合を前提として説明す
るがこれに限られるものではない。音声素片辞書２４の
構成の仕方は、種々の態様が考えられ、音声素片の単位
を音素単位やｔｒｉｐｈｏｎｅ単位等としてもよい。ま
た、異なる種類の単位が混在していてもよい。また、必
ずしも基本周波数で区分けする必要はなく、各音声素片
について一つの基本周波数のデータのみを蓄積してもよ
い。更に、基本周波数以外にも、韻律情報（時間長、パ
ワー）等で区分けしてもよい。In the example of FIG. 7, the unit of the speech unit is dipho.
An example is shown in which digital waveform data quantized in three stages of high, medium, and low as a fundamental frequency is recorded in units of ne. In the present embodiment, description will be given on the premise that the unit of the speech unit is the diphone unit, but the unit is not limited to this. There are various possible ways of constructing the speech unit dictionary 24, and the unit of the speech unit may be a phoneme unit, a triphone unit, or the like. Also, different types of units may be mixed. Further, it is not always necessary to classify by the fundamental frequency, and only one fundamental frequency data may be stored for each speech unit. Further, in addition to the fundamental frequency, it may be classified by prosody information (time length, power) or the like.

【００１８】次に、ネットワークインターフェース２５
は、ネットワーク３を介して、他の端末と通信処理を行
うためのインターフェースであり、これにより、サーバ
２０は、携帯型電話機１０との間で情報の受信、送信が
可能となる。Next, the network interface 25
Is an interface for performing communication processing with other terminals via the network 3, whereby the server 20 can receive and transmit information to and from the mobile phone 10.

【００１９】次に、携帯型電話機１０の構成について説
明する。図２は、携帯型電話機１０のハードウエア構成
を示すブロック図である。なお、本実施形態では、音声
合成処理装置として携帯型電話機１０を例に挙げて説明
するが、ＰＤＡ等の携帯型端末や、パソコン等を用いる
ことも可能である。また、本実施形態では、携帯型電話
機１０により無線通信を行う場合について説明するが、
有線通信を行うようにしてもよいことは言うまでもな
い。Next, the configuration of the mobile phone 10 will be described. FIG. 2 is a block diagram showing the hardware configuration of the mobile phone 10. In the present embodiment, the mobile phone 10 will be described as an example of the voice synthesis processing device, but a mobile terminal such as a PDA or a personal computer may be used. In addition, in the present embodiment, a case where wireless communication is performed by the mobile phone 10 will be described.
It goes without saying that wired communication may be performed.

【００２０】ＣＰＵ１１は、携帯型電話機１０全体の制
御を司り、特に、本実施形態では後述する音声合成処理
を実行する。ＲＡＭ１２は、ＣＰＵ１のワークエリアと
して用いられるメモリである。ＲＯＭ１３には、ＣＰＵ
１１が実行する制御プログラム、該プログラムの処理に
用いられるデータといった固定的なデータを記憶するメ
モリである。音声合成処理のためのプログラムや、処理
に必要な各種辞書類のデータは、このＲＯＭ１３に格納
することができる。フラッシュメモリ１４は、携帯型電
話機１０が受信した電子メールやその他の情報が格納さ
れるメモリである。ＲＯＭ１３に格納されるプログラム
やデータの一部をこのフラッシュメモリ１４に格納して
もよい。The CPU 11 controls the entire mobile phone 10, and particularly executes a voice synthesis process described later in this embodiment. The RAM 12 is a memory used as a work area for the CPU 1. The ROM 13 has a CPU
11 is a memory that stores fixed data such as a control program executed by 11 and data used for processing the program. A program for voice synthesis processing and data of various dictionaries necessary for the processing can be stored in the ROM 13. The flash memory 14 is a memory that stores e-mails and other information received by the mobile phone 10. A part of the programs and data stored in the ROM 13 may be stored in the flash memory 14.

【００２１】インターフェース１５ａは、ＣＰＵ１と操
作ボタン１５との間のインターフェースとして機能す
る。操作ボタン１５は、ユーザが携帯型電話機１０に対
して各種指示を与えるためのものであり、キースイッチ
等から構成される。ディスプレイ１６は、例えば、液晶
表示装置等から構成され、ユーザに対して情報を表示す
るものであり、ディスプレイドライバ１６ａを介してＣ
ＰＵ１１に表示制御される。The interface 15a functions as an interface between the CPU 1 and the operation button 15. The operation button 15 is used by the user to give various instructions to the mobile phone 10, and is composed of a key switch or the like. The display 16 is composed of, for example, a liquid crystal display device or the like, and displays information to the user, and C is displayed via the display driver 16a.
The display is controlled by the PU 11.

【００２２】通信デバイス１７は、基地局２及びネット
ワーク３を介して、他の端末と通信処理を行うためのデ
バイスであり、例えば、基地局２との間で無線通信を行
うためのＲＦ回路等の電子回路を有する。これにより、
携帯型電話機１０は、サーバ２０との間で情報の受信、
送信が可能となる。The communication device 17 is a device for performing communication processing with other terminals via the base station 2 and the network 3, and for example, an RF circuit for performing wireless communication with the base station 2 or the like. It has an electronic circuit. This allows
The mobile phone 10 receives information from the server 20,
It becomes possible to send.

【００２３】Ａ／Ｄ変換器１８ａは、アナログ信号をデ
ジタル信号に変換する回路であり、本実施形態では、マ
イク１８から入力された音声のアナログ信号をデジタル
信号に変換してＣＰＵ１１へ出力するために用いられ
る。The A / D converter 18a is a circuit for converting an analog signal into a digital signal. In the present embodiment, since the analog signal of voice input from the microphone 18 is converted into a digital signal and output to the CPU 11. Used for.

【００２４】Ｄ／Ａ変換器１９ａは、デジタル信号をア
ナログ信号に変換する回路であり、本実施形態では、Ｃ
ＰＵ１１から出力されるデジタル形式の合成音声データ
をアナログ信号に変換するため等に用いられる。スピー
カ１９は、Ｄ／Ａ変換器１９ａから出力されるアナログ
信号を音声として出力するものであり、例えば、ヘッド
ホン等である。なお、Ｄ／Ａ変換器１９ａとスピーカ１
９との間には、増幅回路等を設けてもよいことはいうま
でもない。The D / A converter 19a is a circuit for converting a digital signal into an analog signal, and in the present embodiment, it is C
It is used, for example, to convert digital format synthetic voice data output from the PU 11 into an analog signal. The speaker 19 outputs the analog signal output from the D / A converter 19a as sound, and is, for example, a headphone or the like. The D / A converter 19a and the speaker 1
It goes without saying that an amplifier circuit or the like may be provided between the switch 9 and the switch 9.

【００２５】＜システムの処理＞次に、係る構成からな
る音声合成処理システム１における処理について図３を
参照して説明する。図３は、携帯型電話機１０とサーバ
２０との処理を示すフローチャートである。<System Processing> Next, the processing in the speech synthesis processing system 1 having the above-mentioned configuration will be described with reference to FIG. FIG. 3 is a flowchart showing the processing of the mobile phone 10 and the server 20.

【００２６】ステップＳ３０１では、携帯型電話機１０
において、読み上げる文章があるか否かを判定する。読
み上げる文章としては、例えば、携帯型電話機１０が着
信した電子メールの文章、或いは、携帯型電話機１０が
取得したネットワーク３上で提供されているホームペー
ジの文章等が挙げられる。読み上げる文章がある場合
は、Ｓ３０２へ進み、そうでない場合は終了する。In step S301, the portable telephone 10
At, it is determined whether there is a sentence to be read. Examples of the sentence to be read include a sentence of an email received by the mobile phone 10, a sentence of a home page provided on the network 3 acquired by the mobile phone 10, and the like. If there is a sentence to be read, the process proceeds to S302, and if not, the process ends.

【００２７】Ｓ３０２では、携帯型電話機１０が音声素
片決定処理を行う。ここでは、読み上げる文章に含まれ
る文字列を合成音声で出力するために必要な音声素片を
決定する。図４は、音声素片決定処理を示すフローチャ
ートである。In step S302, the mobile phone 10 performs a voice segment determination process. Here, the speech unit necessary to output the character string included in the sentence to be read as synthetic speech is determined. FIG. 4 is a flowchart showing the speech unit determination process.

【００２８】Ｓ４０１では、読み上げる文章から１文を
切り出す。ここでは、図８（ａ）に示す「おはよう。」
の文字列からなる文が切り出された場合を例に挙げる。
Ｓ４０２では、切り出した１文について言語解析処理を
行う。ここでは、切り出した１文について、言語解析を
行い、音声合成に必要な読み、アクセント等を特定し、
保存する。図８（ｂ）は、「おはよう。」に対して作成
された読み（ｄｉｐｈｏｎｅ系列）であり、「-.o」、
「o.h」、「h.a」、「a.y」、「y.o」、「o.o」、「o.
-」が作成されている。At S401, one sentence is cut out from the sentence to be read. Here, “Good morning.” Shown in FIG.
An example is given in which a sentence consisting of the character string is cut out.
In S402, a language analysis process is performed on the extracted one sentence. Here, language analysis is performed on the cut-out one sentence to identify readings, accents, etc. necessary for speech synthesis,
save. FIG. 8 (b) is a reading (diphone series) created for “Good morning.”, “-.O”,
"Oh", "ha", "ay", "yo", "oo", "o."
-"Has been created.

【００２９】Ｓ４０３では、Ｓ４０２の言語解析結果を
用いて、音声合成に必要な基本周波数、音韻時間長、パ
ワーといった韻律情報を推定し、作成・保存する。図８
（ｃ）は、韻律情報の例を示しており、各ｄｉｐｈｏｎ
ｅに対する時間長（破線）及び基本周波数（実線）を示
している。In step S403, prosodic information such as fundamental frequency, phoneme time length, and power required for speech synthesis is estimated using the language analysis result in step S402, and created and stored. Figure 8
(C) shows an example of prosody information, and each diphon
The time length (broken line) and the fundamental frequency (solid line) for e are shown.

【００３０】Ｓ４０４では、Ｓ４０２の言語解析結果及
びＳ４０３の韻律情報に基づいて切り出した１文の合成
音声の波形データを作成するのに必要な音声素片の種類
をリストアップし、そのリストを作成・保存する。図８
（ｃ）は、「おはよう。」に対して作成された音声素片
リストを示しており、ｄｉｐｈｏｎｅと基本周波数（３
段階に量子化）とが示されている。In step S404, the types of speech units necessary to create the waveform data of the synthesized speech of one sentence cut out based on the language analysis result in step S402 and the prosody information in step S403 are listed, and the list is created. ·save. Figure 8
(C) shows a list of speech units created for "Good morning." The diphone and the fundamental frequency (3
Quantization) is shown in the stage.

【００３１】次に、図３に戻って、Ｓ３０３では、携帯
型電話機１０が、サーバ２０に対して、Ｓ４０４で作成
した音声素片リストを送信し、対応する音声素片データ
を転送するように要求する。Next, returning to FIG. 3, in S303, the portable telephone 10 transmits the speech unit list created in S404 to the server 20 and transfers the corresponding speech unit data. Request.

【００３２】Ｓ３１１では、サーバ２０が携帯型電話機
１０からの転送要求を受信し、これに対応して、Ｓ３１
２で、音声素片辞書２４を検索し、要求された音声素片
のデータを抽出する。Ｓ３１３では、サーバ２０が抽出
した音声素片データを携帯型電話機１０へ送信する。In S311, the server 20 receives the transfer request from the portable telephone 10, and in response to this, S31
In step 2, the voice unit dictionary 24 is searched to extract the requested voice unit data. In S313, the voice unit data extracted by the server 20 is transmitted to the mobile phone 10.

【００３３】Ｓ３０４では、携帯型電話機１０がサーバ
２０から送信された音声素片のデータを受信し、Ｓ３０
５で、受信した音声素片データに基づいて、切り出した
文の合成音声データを作成する。このＳ３０５では、例
えば、Ｓ４０３で作成した韻律情報に基づいて、受信し
た音声素片データを編集、接続（波形の連結）し、合成
音声データを作成する。その後、リアルタイムで、或い
は、ユーザが要求したときに、この合成音声データに基
づく音声がスピーカ１９から出力されることとなる。Ｓ
３０５の処理の後、Ｓ３０１へ戻って文章の残りの文に
ついて同様の処理を行うこととなる。In S304, the portable telephone 10 receives the data of the voice unit transmitted from the server 20, and S30
At 5, the synthesized speech data of the cut-out sentence is created based on the received speech unit data. In this step S305, for example, based on the prosody information created in step S403, the received voice segment data is edited and connected (waveform connection) to create synthetic voice data. After that, the voice based on the synthesized voice data is output from the speaker 19 in real time or when the user requests. S
After the process of 305, the process returns to S301 and the same process is performed for the remaining sentences of the sentence.

【００３４】このように本実施形態では、携帯型電話機
１０側で音声合成処理を実行する一方で、その際に使用
する音声素片辞書２４をサーバ１０側に蓄積して適宜利
用することにより、ハードウエア資源、特にメモリが小
規模な携帯型電話機１０であっても音声合成処理を行う
ことができ、また、サーバ１０の計算負荷を低減するこ
とができる。更に、音声素片辞書を一元管理することが
できると共に、必ずしも音声素片辞書２４をコンパクト
化する必要もないので、音質の低下も免れる。As described above, in the present embodiment, while the voice synthesis processing is executed on the portable telephone 10 side, the speech unit dictionary 24 used at that time is accumulated on the server 10 side and appropriately used, Even if the mobile phone 10 has a small hardware resource, especially a memory, the voice synthesis process can be performed, and the calculation load of the server 10 can be reduced. Furthermore, since the voice unit dictionary can be centrally managed and the voice unit dictionary 24 does not necessarily have to be made compact, the deterioration of sound quality can be avoided.

【００３５】＜第２実施形態＞上記第１実施形態におい
て、Ｓ４０４で作成した音声素片リストの中に重複する
種類の音声素片がある場合は、その重複を削除して音声
素片リストを作成し、そのリストにより、サーバ１０へ
音声素片データの転送要求を送信することもできる。こ
の場合、削除した音声素片の箇所をチェックして保存し
ておき、Ｓ３０６の合成音声データ作成時には、削除し
た音声素片の箇所に同じ音声素片のデータを埋め込むよ
うにすればよい。<Second Embodiment> In the first embodiment, if there are overlapping types of speech units in the speech unit list created in S404, the duplication is deleted and the speech unit list is deleted. It is also possible to create a list and transmit a voice segment data transfer request to the server 10 based on the created list. In this case, the location of the deleted voice segment may be checked and stored, and the data of the same voice segment may be embedded in the location of the deleted voice segment when creating the synthesized voice data in S306.

【００３６】このようにして音声素片リストを作成する
ことにより、通信情報量を削減し、より高速な通信が実
現される。図９は、重複を削除して音声素片リストを作
成する場合の様子を示した例である。合成素片（ｄｉ
ｐｈｏｎｅ：／ａ．ｉ／、基本周波数：中）が２つめ
と５つめに出現して重複しているので、片方をリストか
ら削除して、よりコンパクトなリストを作成している。By creating the speech unit list in this way, the amount of communication information is reduced and higher speed communication is realized. FIG. 9 is an example showing a situation in which duplication is deleted and a speech unit list is created. Synthetic element (di
phone: / a. (i /, fundamental frequency: middle) appears and overlaps in the second and fifth positions, so one is deleted from the list to create a more compact list.

【００３７】＜第３実施形態＞上記第１実施形態では、
１文単位で音声合成処理を行い、１文単位で音声素片リ
ストを作成したが、これに限定されるものではなく、１
文より短い単位、例えば一定数の素片単位、ポーズ句単
位や文節単位、アクセント句単位などで行なってもよ
い。また、１文より長い単位、２文単位や段落単位、文
章全体を単位として行なうようにしてもよい。<Third Embodiment> In the first embodiment,
The speech synthesis processing is performed in units of one sentence, and the speech unit list is created in units of one sentence. However, the list is not limited to this.
The unit may be shorter than a sentence, for example, a unit of a fixed number of units, a unit of a pause phrase, a unit of a phrase, or a unit of an accent phrase. Alternatively, the unit may be a unit longer than one sentence, a unit of two sentences, a unit of paragraphs, or an entire sentence.

【００３８】ここで、上記第２実施形態では、重複する
音声素片をリストから削除する場合について説明した
が、音声素片の数が大きければ大きいほど、重複する音
声素片の数も多くなると考えられる。従って、上記第２
実施形態の場合、１文より長い単位で音声素片リストを
作成して、その音声素片データをサーバ１０に対して要
求する場合には、通信情報量の削減が一層効果的に行え
る。In the second embodiment, the case of deleting the overlapping speech units from the list has been described. However, the larger the number of speech units is, the larger the number of overlapping speech units is. Conceivable. Therefore, the second
In the case of the embodiment, when the speech unit list is created in a unit longer than one sentence and the speech unit data is requested to the server 10, the amount of communication information can be more effectively reduced.

【００３９】一方、通信回線の容量が小さい場合には、
１文単位よりも短い単位で少量ずつサーバ１０に対して
音声素片データを要求することにより、リアルタイム性
を重視した音声合成の実現が可能となる。On the other hand, when the capacity of the communication line is small,
By requesting the voice unit data from the server 10 in small units in units shorter than one sentence, it is possible to realize voice synthesis with an emphasis on real-time property.

【００４０】＜第４実施形態＞上記第１実施形態におい
て、携帯型電話機１０がサーバ１０から受信した音声素
片データを再利用するために携帯型電話機１０内に保存
しておき、携帯型電話機１０内に簡易な音声素片辞書
（以下、サブ音声素片辞書という。）を保持するように
することもできる。この場合、サブ音声素片辞書のデー
タは、例えば、携帯型電話機１０のフラッシュメモリ１
４に格納しておくことができ、その構成は、蓄積量が異
なるだけで、図７を参照して説明したサーバ１０の音声
素片辞書と同じとすることができる。<Fourth Embodiment> In the first embodiment, the voice unit data received by the mobile phone 10 from the server 10 is stored in the mobile phone 10 for reuse, and is stored in the mobile phone 10. It is also possible to hold a simple speech unit dictionary (hereinafter referred to as a sub-speech unit dictionary) in 10. In this case, the data of the sub-speech unit dictionary is, for example, the flash memory 1 of the mobile phone 10.
4 and the configuration thereof can be the same as that of the speech unit dictionary of the server 10 described with reference to FIG. 7 except that the storage amount is different.

【００４１】図５は、この場合におけるＳ４０４の処理
を示すフローチャート、図６は、同じくこの場合におけ
るＳ３０６の処理を示すフローチャートである。FIG. 5 is a flowchart showing the processing of S404 in this case, and FIG. 6 is a flowchart showing the processing of S306 in this case.

【００４２】図５を参照して、Ｓ５０１では音声素片を
リストアップし、そのリストを保存する。Ｓ５０２で
は、携帯型電話機１０内に保持しているサブ音声素片辞
書を参照する。Ｓ５０３では、サブ音声素片辞書中に、
Ｓ５０１でリストアップした音声素片が登録されている
か否かを判定する。一つも登録されていなければ終了
し、一つでも登録されていればＳ５０４へ進む。Referring to FIG. 5, in S501, a list of speech units is listed and the list is stored. In S502, the sub-speech unit dictionary stored in the mobile phone 10 is referred to. In S503, in the sub-speech unit dictionary,
It is determined whether or not the speech units listed in S501 are registered. If no one is registered, the process ends, and if any one is registered, the process proceeds to S504.

【００４３】Ｓ５０４では、サブ音声素片辞書中に、リ
ストアップした音声素片の全てが登録されているか否か
を判定する。全て登録されている場合、サーバ１０に音
声素片データを要求する必要はないので、Ｓ３０５へ進
み、サーバ１０にアクセスすることなく、合成音声デー
タの作成に移行する。全て登録されていない場合、サブ
音声素片辞書に登録されている音声素片をリストから削
除し、必要な音声素片のみリストアップして処理を終了
する。In S504, it is determined whether or not all of the listed speech units are registered in the sub-speech unit dictionary. If all of them are registered, there is no need to request the voice unit data from the server 10, so the process proceeds to S305, and the synthesis voice data is created without accessing the server 10. If all are not registered, the speech units registered in the sub-speech unit dictionary are deleted from the list, only the necessary speech units are listed, and the process ends.

【００４４】図６を参照して、Ｓ６０１では、サーバ１
０からの音声素片データを受信したか否かを判定する。
受信した場合は、サブ音声素片辞書に登録されていない
音声素片データを受信したことになるので、Ｓ６０２へ
進みこれをサブ音声素片辞書に登録し、Ｓ６０３へ進
む。一方、サーバ１０からの音声素片データを受信して
いない場合、すなわち、上述したＳ５０４でＹｅｓの場
合、Ｓ６０３へ進む。Referring to FIG. 6, in step S601, the server 1
It is determined whether or not the voice unit data from 0 is received.
When it is received, it means that the voice unit data which is not registered in the sub voice unit dictionary is received. Therefore, the process proceeds to S602, is registered in the sub voice unit dictionary, and the process proceeds to S603. On the other hand, if the voice unit data from the server 10 is not received, that is, if the answer in S504 is Yes, the process proceeds to S603.

【００４５】Ｓ６０３では、サブ音声素片辞書からＳ５
０１で保存した音声素片のリストに対応する音声素片デ
ータを読み出し、接続することで合成音声データを作成
し、終了する。In S603, S5 is selected from the sub-speech unit dictionary.
The voice unit data corresponding to the list of voice units stored in 01 is read out and connected to create synthesized voice data, and the process ends.

【００４６】このようにすることで、携帯型電話機１０
とサーバ２０との間の通信回数、通信情報量を削減する
ことができる。なお、本実施形態の場合、音声合成する
頻度に伴って、サブ音声素片辞書の容量が増大すること
が考えられるが、携帯型電話機１０の記憶容量に応じ
て、適宜サブ音声素片辞書からデータを削除することも
できる。この場合、例えば、サブ音声素片辞書の容量が
ある容量を越えた時点で、過去の使用頻度が少ない音声
素片データを消去することができる。この場合、容量の
閾値、現在の容量、および各音声素片データの使用頻度
を算出しておき、その結果に応じて消去することができ
る。By doing so, the portable telephone 10
The number of times of communication between the server and the server 20 and the amount of communication information can be reduced. In the case of the present embodiment, the capacity of the sub-speech element dictionary may increase with the frequency of speech synthesis. However, the sub-speech element dictionary may be appropriately selected according to the storage capacity of the mobile phone 10. You can also delete the data. In this case, for example, when the capacity of the sub-speech unit dictionary exceeds a certain capacity, it is possible to erase the speech unit data that has been used less frequently in the past. In this case, it is possible to calculate the threshold value of the capacity, the current capacity, and the frequency of use of each voice segment data, and erase them according to the result.

【００４７】＜第５実施形態＞上記実施形態では、音声
素片を決定する際、韻律情報（基本周波数）で音声素片
を区別したが、これに限定されるものではなく、韻律情
報で区別しないで音声素片（ｄｉｐｈｏｎｅのみ）を決
定し、これと共に韻律情報を携帯型電話機１０からサー
バ２０へ送信し、サーバ２０側で、韻律情報で区別され
た音声素片を選択するようにしてもよい。<Fifth Embodiment> In the above embodiment, when deciding a speech unit, the prosody information (fundamental frequency) is used to distinguish the speech unit. However, the present invention is not limited to this, and the prosodic information is used. Alternatively, the speech unit (only the diphone) is determined, and along with this, the prosody information is transmitted from the mobile phone 10 to the server 20, and the server 20 side selects the speech unit distinguished by the prosody information. Good.

【００４８】＜第６実施形態＞上記実施形態において
は、各処理をソフトウエアで実現したが、これに限定さ
れるものではなく、同様の動作をする固有の回路で実現
してもよい。<Sixth Embodiment> In the above-described embodiment, each process is realized by software, but the present invention is not limited to this, and it may be realized by a unique circuit that performs the same operation.

【００４９】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムを、システ
ムあるいは装置に供給し、そのシステムあるいは装置の
コンピュータ（またはCPUやMPU）がプログラムを読み出
し実行することによっても、達成されることは言うまで
もない。Another object of the present invention is to supply a software program for realizing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads and executes the program. Needless to say, it can be achieved by things.

【００５０】この場合、そのプログラム自体が前述した
実施形態の機能を実現することになり、そのプログラム
や、そのプログラムを記憶した記憶媒体或いはプログラ
ム製品は、本発明を構成することになる。また、コンピ
ュータが読み出したプログラムコードを実行することに
より、前述した実施形態の機能が実現されるだけでな
く、そのプログラムコードの指示に基づき、コンピュー
タ上で稼働しているオペレーティングシステム(OS)など
が実際の処理の一部または全部を行い、その処理によっ
て前述した実施形態の機能が実現される場合も含まれる
ことは言うまでもない。In this case, the program itself realizes the functions of the above-described embodiment, and the program, the storage medium storing the program, or the program product constitutes the present invention. Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instructions of the program code. It goes without saying that a case where some or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing is also included.

【００５１】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるCPUなどが実際の処理の一部または全部を行い、そ
の処理によって前述した実施形態の機能が実現される場
合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written in the memory provided in the function expansion card inserted into the computer or the function expansion unit connected to the computer, based on the instruction of the program code, Needless to say, this also includes a case where a CPU or the like included in the function expansion card or the function expansion unit performs a part or all of the actual processing and the processing realizes the functions of the above-described embodiments.

【００５２】[0052]

【発明の効果】以上説明したように、本発明によれば、
ハードウエア資源が小規模の端末で、合成音声の品質を
維持しつつ、音声合成処理を行うことができる。As described above, according to the present invention,
A terminal with small hardware resources can perform speech synthesis processing while maintaining the quality of synthesized speech.

[Brief description of drawings]

【図１】本発明の一実施形態に係る音声合成処理システ
ム１のシステム図である。FIG. 1 is a system diagram of a speech synthesis processing system 1 according to an embodiment of the present invention.

【図２】携帯型電話機１０のハードウエア構成を示すブ
ロック図である。FIG. 2 is a block diagram showing a hardware configuration of the mobile phone 10.

【図３】携帯型電話機１０とサーバ２０との処理を示す
フローチャートである。FIG. 3 is a flowchart showing processing of the mobile phone 10 and the server 20.

【図４】音声素片決定処理を示すフローチャートであ
る。FIG. 4 is a flowchart showing a speech unit determination process.

【図５】第４実施形態におけるＳ４０４の処理を示すフ
ローチャートである。FIG. 5 is a flowchart showing the processing of S404 in the fourth embodiment.

【図６】第４実施形態におけるＳ３０６の処理を示すフ
ローチャートである。FIG. 6 is a flowchart showing a process of S306 in the fourth embodiment.

【図７】音声素片辞書２４に記録されたデータの一例を
示す図である。FIG. 7 is a diagram showing an example of data recorded in a speech unit dictionary 24.

【図８】文が音声合成処理される過程を示した図であ
る。FIG. 8 is a diagram showing a process in which a sentence is subjected to voice synthesis processing.

【図９】重複を削除して音声素片リストを作成する場合
の様子を示した例である。FIG. 9 is an example showing a state of creating a speech unit list by deleting duplication.

Claims

[Claims]

1. A speech synthesis processing device capable of communicating with a server storing speech segment data and performing speech synthesis processing of a character string, the speech element being necessary for outputting the character string as synthesized speech. Determining means for determining a piece, transmitting means for transmitting to the server a transmission request for the data of the speech element determined by the determining means, and a speech element transmitted from the server in response to the transmission request A voice synthesis processing apparatus comprising: a receiving unit that receives the data of 1. and a creating unit that creates the synthetic voice data of the character string based on the received data of the voice unit.

2. The speech synthesis processing apparatus according to claim 1, wherein the determining unit creates a list of the determined speech units, and the transmitting unit transmits the list to the server. .

3. The determining unit creates the list in which the duplicated speech units are deleted when the determined speech units include the duplicated speech units. The speech synthesis processing apparatus according to claim 2.

4. A storage unit for storing the data of the speech unit received by the receiving unit, wherein the determining unit stores the data of the speech unit stored in the storage unit in the determined speech unit. When there is voice segment data, the list is created by deleting the voice segment, and the creating unit creates the character string based on the data of the voice unit stored in the storage unit. 3. The speech synthesis processing apparatus according to claim 2, wherein the synthesized speech data of 1 is created.

5. The deciding means, a linguistic analyzing means for performing linguistic analysis of the character string, a prosody information creating means for creating prosody information of the character string,
The speech element necessary for outputting the character string as synthesized speech is determined based on the analysis result by the language analysis unit and the prosody information. Speech synthesis processor.

6. The voice synthesis processing apparatus according to claim 1, further comprising an output unit that outputs a synthetic voice based on the synthetic voice data created by the creating unit.

7. A program for causing a computer capable of communicating with a server, which stores speech segment data, to execute a speech synthesis process of a character string, the speech segment being necessary for outputting a character string as synthetic speech. And a transmission step of transmitting to the server a transmission request for the data of the speech segment determined by the determination step, in response to the transmission request, of the speech segment transmitted from the server A program for causing the computer to execute a receiving step of receiving data, and a creating step of creating synthetic voice data of the character string based on the received data of the voice unit.

8. A server communicable with a voice synthesis processing device for performing voice synthesis processing of a character string, comprising: storage means for accumulating voice segment data; A server comprising: a reception unit that receives a data transmission request; and a transmission unit that transmits the data of the speech unit to the speech synthesis processing device in response to the received transmission request.

9. A data providing method using a server, which is communicatively connected to a voice synthesis processing device for voice-synthesizing a character string, and stores voice segment data, the server performing the voice synthesis process. A receiving step of receiving a transmission request for the data of the speech unit from the device, and the server corresponding to the received transmission request,
And a step of transmitting the data of the voice unit to the voice synthesis processing device.

10. A speech synthesis processing system comprising: a server accumulating speech segment data; and a speech synthesis processing device capable of communicating with the server and performing speech synthesis processing of a character string, the speech synthesis processing system comprising: A synthesizing device determines a speech unit required to output the character string as a synthesized voice, and a transmission request for transmitting the data of the speech unit determined by the deciding unit to the server. Means, receiving means for receiving the data of the voice unit transmitted from the server in response to the transmission request, and creating synthetic voice data of the character string based on the received data of the voice unit. A speech synthesis processing system including: