JP2008015361A

JP2008015361A - Voice synthesizer, voice synthesizing method, and program for attaining the voice synthesizing method

Info

Publication number: JP2008015361A
Application number: JP2006188405A
Authority: JP
Inventors: Toshio Akaha; 俊夫赤羽; Yoichiro Hachiman; 洋一郎八幡
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2006-07-07
Filing date: 2006-07-07
Publication date: 2008-01-24
Anticipated expiration: 2026-07-07
Also published as: JP5019807B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizer for outputting voice suppressive in occurrence of distortion. <P>SOLUTION: A waveform generating device 200 included in the voice synthesizer is provided with: an energy changing section 210 for changing energy information Ep to energy information corresponding to volume indication information Ev; an energy correction section 220 for correcting energy information E based on a correction data prepared beforehand; a waveform decoding section 230 for decoding and outputting a waveform of a time unit defined beforehand, from waveform information for each period defined beforehand; and an amplifier section 240 for amplifying and outputting a waveform S[i] generated by decoding, based on energy information f(E) after correction. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声を合成する技術に関する。より特定的には、本発明は、音声を合成して出力するための音声合成装置、音声合成方法および音声合成方法を実現するためのプログラムに関する。 The present invention relates to a technique for synthesizing speech. More specifically, the present invention relates to a speech synthesizer for synthesizing and outputting speech, a speech synthesis method, and a program for realizing the speech synthesis method.

音声を出力できる装置において、デジタル信号は、データフォーマットで規定された振幅の範囲を超えて表現することはできない。デジタル信号からその範囲を超えるような大きな音量を得るためには、より高い増幅率を有する増幅器、あるいは効率がよいスピーカが使用される。しかしながら、増幅器やスピーカの変更は、当該装置のコストの増加をもたらす。 In a device capable of outputting sound, a digital signal cannot be expressed beyond the amplitude range defined by the data format. In order to obtain a loud volume exceeding the range from a digital signal, an amplifier having a higher amplification factor or an efficient speaker is used. However, changing the amplifier or speaker increases the cost of the device.

また、文字情報を用いて音声を合成する、いわゆるテキスト音声合成あるいは音声復号化においても、出力されるデジタル信号が規定の範囲を超えると、音が急激に歪むという問題もある。この場合、歪みを生じさせないためにボリュームを小さくすると、出力レベルが小さな合成音声が出力された時に、聴こえないという問題も生じ得る。 Also, in so-called text-to-speech synthesis or speech decoding, which synthesizes speech using character information, there is a problem that the sound is rapidly distorted if the output digital signal exceeds a specified range. In this case, if the volume is reduced in order not to cause distortion, there may be a problem that when a synthesized speech with a low output level is output, it cannot be heard.

そこで、たとえば特開２００１−１０９５００号公報（特許文献１）は、音声の自然性を損なうことのない少ないピッチ処理技術を開示している。
特開２００１−１０９５００号公報 Thus, for example, Japanese Patent Laid-Open No. 2001-109500 (Patent Document 1) discloses a pitch processing technique that does not impair the naturalness of speech.
JP 2001-109500 A

さらに、音の歪みを補正するために、たとえば録音器材としてのコンプレッサにおいて、入力波形の短時間におけるエネルギーの平均値を用いて増幅率を逐次調整することで出力音量を一定にする技術が知られている。しかしながら、この技術をリアルタイムで行なうためには、応答特性の調整が困難であり、当該平均値を求める処理が無駄になるおそれもある。 Furthermore, in order to correct sound distortion, for example, in a compressor as a recording device, a technique for making the output volume constant by sequentially adjusting the amplification factor using the average value of energy in a short time of the input waveform is known. ing. However, in order to perform this technique in real time, it is difficult to adjust the response characteristics, and there is a possibility that processing for obtaining the average value is wasted.

本発明は、上述の問題点を解決するためになされたものであって、その目的は、出力される音声の歪みを防止できる音声合成装置を提供することである。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech synthesizer that can prevent distortion of output speech.

他の目的は、出力される音声の歪みを防止できる音声合成方法を提供することである。
さらに他の目的は、出力される音声の歪みを防止できる音声合成方法をコンピュータに実現させるためのプログラムを提供することである。 Another object is to provide a speech synthesis method that can prevent distortion of output speech.
Still another object is to provide a program for causing a computer to implement a speech synthesis method that can prevent distortion of output speech.

上記の課題を解決するために、この発明のある局面に従うと、音声を合成して出力する音声合成装置が提供される。この音声合成装置は、出力される音声を生成するためのデータと、出力される音声の音量を指定するための指定情報とを取得する取得手段と、音声を予め規定された単位で区分した各部分の波形を生成するための各波形情報と、各部分の各々のエネルギー情報と、各エネルギー情報を補正するための補正データとを格納する記憶手段と、各エネルギー情報を、指定情報に応じたエネルギー情報にそれぞれ変更する変更手段と、補正データに基づいて、変更手段による変更後の各エネルギー情報を補正する補正手段と、各波形情報に基づいて、出力される音声の各波形を復号する復号化手段と、補正後の各エネルギー情報に基づいて、復号化手段によって復号された各波形を増幅する増幅手段と、増幅手段によって増幅された各波形に基づいて、音声を出力する出力手段とを備える。 In order to solve the above problems, according to one aspect of the present invention, a speech synthesizer that synthesizes and outputs speech is provided. The speech synthesizer includes an acquisition unit that acquires data for generating output sound and designation information for specifying the volume of the output sound, and each of the sounds divided in predetermined units. Each waveform information for generating the waveform of the part, each energy information of each part, correction data for correcting each energy information, and each energy information according to the designation information Changing means for changing to each energy information, correcting means for correcting each energy information changed by the changing means based on the correction data, and decoding for decoding each waveform of the output sound based on each waveform information Based on the energy information after correction, amplification means for amplifying each waveform decoded by the decoding means, and based on each waveform amplified by the amplification means And output means for outputting sound.

好ましくは、補正手段は、エネルギー情報について予め規定された上限値に漸近するように、変更手段によって変更されたエネルギー情報を補正する。 Preferably, the correcting unit corrects the energy information changed by the changing unit so as to approach an upper limit value defined in advance for the energy information.

好ましくは、補正手段は、エネルギー情報を非線形に補正する。
好ましくは、補正手段は、エネルギー情報を連続的に補正する。 Preferably, the correction unit corrects the energy information nonlinearly.
Preferably, the correcting unit continuously corrects the energy information.

好ましくは、補正データは、補正前のエネルギー情報と、補正前のエネルギー情報の補正後のエネルギー情報とを含む。 Preferably, the correction data includes energy information before correction and energy information after correction of energy information before correction.

好ましくは、出力手段は、増幅手段によって増幅された各波形を接続して合成波形を生成する波形接続手段と、合成波形に基づいて音声を出力する音声出力手段とを含む。 Preferably, the output means includes waveform connecting means for connecting the waveforms amplified by the amplifying means to generate a synthesized waveform, and audio output means for outputting sound based on the synthesized waveform.

好ましくは、出力手段は、波形接続手段から出力される合成波形を、予め規定された上限値に漸近するように調整する波形飽和手段をさらに含む。音声出力手段は、波形飽和手段による調整後の合成波形に基づいて音声を出力する。 Preferably, the output unit further includes a waveform saturation unit that adjusts the combined waveform output from the waveform connection unit so as to gradually approach a predetermined upper limit value. The voice output means outputs a voice based on the synthesized waveform adjusted by the waveform saturation means.

好ましくは、音声合成装置は、着脱可能な記録媒体が装着されて、記録媒体を駆動する駆動手段をさらに備える。記録媒体は音声データと、音声データに関連付けられた指定情報とを格納している。取得手段は、駆動手段に装着された記録媒体から、音声データと指定情報とを読み出す読出手段を含む。 Preferably, the speech synthesizer further includes a driving unit that is mounted with a removable recording medium and drives the recording medium. The recording medium stores audio data and designation information associated with the audio data. The acquisition unit includes a reading unit that reads out audio data and designation information from a recording medium attached to the driving unit.

好ましくは、取得手段は、文字情報が含まれる信号を受信する受信手段と、指定情報の入力を受け付ける入力手段とを含む。 Preferably, the acquisition unit includes a reception unit that receives a signal including character information and an input unit that receives an input of designation information.

好ましくは、取得手段は、発話を受けて発話に応じた音声信号を出力するマイクと、音声信号を解析して発話に応じた波形情報を出力する波形情報分析手段と、発話に応じたエネルギー情報を含む韻律情報を出力する韻律分析手段とを含む。 Preferably, the acquisition means includes a microphone that receives an utterance and outputs an audio signal corresponding to the utterance, a waveform information analysis means that analyzes the audio signal and outputs waveform information corresponding to the utterance, and energy information corresponding to the utterance Prosodic analysis means for outputting prosodic information including.

この発明の他の局面に従う音声合成装置は、音声を予め規定された単位で区分した各部分の波形を生成するための各波形情報と、各部分の各々のエネルギー情報と、各エネルギー情報を補正するための補正データと、プログラムを格納するメモリと、プログラムから複数の命令を受信するプロセッサとを備える。各命令は、出力される音声を生成するためのデータと、出力される音声の音量を指定するための指定情報とを取得する取得ステップと、各エネルギー情報を、指定情報に応じたエネルギー情報にそれぞれ変更する変更ステップと、補正データに基づいて、変更手段による変更後の各エネルギー情報を補正する補正ステップと、各波形情報に基づいて、出力される音声の各波形を復号する復号化ステップと、補正後の各エネルギー情報に基づいて、復号化手段によって復号された各波形を増幅する増幅ステップとを含む。音声合成装置は、増幅された各波形に基づいて音声を出力する出力部をさらに備える。 A speech synthesizer according to another aspect of the present invention corrects each waveform information for generating a waveform of each part obtained by dividing speech in a predetermined unit, each energy information of each part, and each energy information Correction data, a memory for storing a program, and a processor for receiving a plurality of instructions from the program. Each command includes an acquisition step for acquiring data for generating output sound and specification information for specifying the volume of the output sound, and converting each energy information into energy information according to the specification information. A change step for changing each, a correction step for correcting each energy information after the change by the changing means based on the correction data, and a decoding step for decoding each waveform of the output speech based on each waveform information; And an amplification step for amplifying each waveform decoded by the decoding means based on each energy information after correction. The speech synthesizer further includes an output unit that outputs speech based on each amplified waveform.

好ましくは、補正ステップは、エネルギー情報について予め規定された上限値に漸近するように、変更ステップによって変更されたエネルギー情報を補正する。 Preferably, the correcting step corrects the energy information changed by the changing step so as to approach an upper limit value defined in advance for the energy information.

好ましくは、補正ステップは、エネルギー情報を非線形に補正する。
好ましくは、補正ステップは、エネルギー情報を連続的に補正する。 Preferably, the correcting step corrects the energy information nonlinearly.
Preferably, the correcting step continuously corrects the energy information.

好ましくは、命令は、増幅ステップによって増幅された各波形を接続して合成波形を生成する波形接続ステップをさらに含む。出力部は、合成波形に基づいて音声を出力する。 Preferably, the instruction further includes a waveform connection step of connecting the waveforms amplified by the amplification step to generate a composite waveform. The output unit outputs sound based on the synthesized waveform.

好ましくは、命令は、波形接続ステップにおいて生成された合成波形を、予め規定された上限値に漸近するように調整する波形飽和ステップをさらに含む。出力部は、波形飽和ステップにおける調整後の合成波形に基づいて音声を出力する。 Preferably, the instruction further includes a waveform saturation step for adjusting the composite waveform generated in the waveform connection step so as to approach the predetermined upper limit value. The output unit outputs sound based on the synthesized waveform after adjustment in the waveform saturation step.

好ましくは、音声合成装置は、着脱可能な記録媒体が装着されて、記録媒体を駆動する駆動装置をさらに備える。記録媒体は音声データと、音声データに関連付けられた指定情報とを格納している。取得ステップは、駆動ステップに装着された記録媒体から、音声データと指定情報とを読み出す読出ステップを含む。 Preferably, the speech synthesizer further includes a drive device that is mounted with a removable recording medium and drives the recording medium. The recording medium stores audio data and designation information associated with the audio data. The acquisition step includes a reading step of reading out audio data and designation information from the recording medium attached to the driving step.

好ましくは、取得ステップは、文字情報が含まれる信号を受信する受信ステップと、指定情報の入力を受け付ける入力ステップとを含む。 Preferably, the obtaining step includes a receiving step for receiving a signal including character information and an input step for receiving input of designation information.

好ましくは、音声合成装置は、発話を受けて発話に応じた音声信号を出力するマイクをさらに備える。取得ステップは、音声信号を解析して発話に応じた波形情報を出力するステップと、発話に応じたエネルギー情報を含む韻律情報を出力するステップとを含む。 Preferably, the speech synthesizer further includes a microphone that receives an utterance and outputs an audio signal corresponding to the utterance. The obtaining step includes a step of analyzing the audio signal and outputting waveform information corresponding to the utterance, and a step of outputting prosodic information including energy information corresponding to the utterance.

この発明の他の局面に従うと、音声を合成して出力する音声合成方法が提供される。この方法は、出力される音声を生成するためのデータと、出力される音声の音量を指定するための指定情報とを取得するステップと、音声を予め規定された単位で区分した各部分の波形を生成するための各波形情報と、各部分の各々のエネルギー情報と、各エネルギー情報を補正するための補正データとをロードするステップと、各エネルギー情報を、指定情報に応じたエネルギー情報にそれぞれ変更するステップと、補正データに基づいて、変更ステップにおける変更後の各エネルギー情報を補正するステップと、各波形情報に基づいて、出力される音声の各波形を復号するステップと、補正後の各エネルギー情報に基づいて、復号化ステップによって復号された各波形を増幅するステップと、増幅ステップによって増幅された各波形に基づいて、音声を出力するステップとを備える。 According to another aspect of the present invention, a speech synthesis method for synthesizing and outputting speech is provided. In this method, a step of obtaining data for generating output sound and designation information for designating a volume of the output sound, and a waveform of each part obtained by dividing the sound into predetermined units Loading each waveform information for generating each energy information of each part, correction data for correcting each energy information, and each energy information into energy information according to the specified information, respectively A step of changing, a step of correcting each energy information after the change in the change step based on the correction data, a step of decoding each waveform of the sound to be output based on each waveform information, and each after the correction Based on the energy information, a step of amplifying each waveform decoded by the decoding step, and a step based on each waveform amplified by the amplification step. There are, and a step of outputting speech.

この発明のさらに他の局面に従うと、メモリとプロセッサとを備えるコンピュータに音声合成方法を実現させるためのプログラムが提供される。音声合成方法は、プロセッサが、メモリから、出力される音声を生成するためのデータと、出力される音声の音量を指定するための指定情報とを取得するステップと、プロセッサが、メモリから、音声を予め規定された単位で区分した各部分の波形を生成するための各波形情報と、各部分の各々のエネルギー情報と、各エネルギー情報を補正するための補正データとを読み出すステップと、プロセッサが、各エネルギー情報を、指定情報に応じたエネルギー情報にそれぞれ変更するステップと、プロセッサが、補正データに基づいて、変更後の各エネルギー情報を補正する補正ステップと、プロセッサが、各波形情報に基づいて、出力される音声の各波形を復号するステップと、プロセッサが、補正後の各エネルギー情報に基づいて、復号化ステップによって復号された各波形を増幅するステップと、プロセッサが、増幅された各波形に基づいて、音声信号を出力するステップとを含む。 When further another situation of this invention is followed, the program for making a computer provided with memory and a processor implement | achieve the speech synthesis method is provided. In the speech synthesis method, the processor obtains data for generating output speech from the memory and designation information for designating the volume of the output speech, and the processor obtains the speech from the memory. A step of reading each waveform information for generating a waveform of each part divided by a predetermined unit, each energy information of each part, and correction data for correcting each energy information; A step of changing each energy information to energy information corresponding to the specified information, a correction step in which the processor corrects each energy information after the change based on the correction data, and a processor based on the waveform information. And decoding each waveform of the output sound, and the processor performs decoding based on each corrected energy information. A step of amplifying each waveform decoded by step, the processor, based on the waveform that is amplified, and outputting the audio signal.

本発明に係る音声合成装置によると、出力される音声の歪みを防止することができる。本発明に係る音声合成方法によると、出力される音声の歪みを防止して音声を出力することができる。本発明に係るプログラムによると、コンピュータは、出力される音声の歪みを防止できる音声合成方法を実現することができる。 The speech synthesizer according to the present invention can prevent distortion of output speech. According to the speech synthesis method of the present invention, it is possible to output speech while preventing distortion of the speech that is output. According to the program according to the present invention, the computer can realize a speech synthesis method capable of preventing distortion of output speech.

以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

＜第１の実施の形態＞
図１を参照して、音声合成装置が備える波形生成装置１００について説明する。図１は、従来の波形生成装置１００によって実現される機能の構成を表わすブロック図である。波形生成装置１００は、波形復号化部１１０と、増幅部１２０，１３０とを備える。 <First Embodiment>
With reference to FIG. 1, the waveform generation apparatus 100 with which a speech synthesizer is provided is demonstrated. FIG. 1 is a block diagram showing a configuration of functions realized by conventional waveform generation apparatus 100. The waveform generation device 100 includes a waveform decoding unit 110 and amplification units 120 and 130.

波形復号化部１１０は、波形生成装置１００に入力される信号を受け付けるように構成される。具体的には、波形復号化部１１０は、波形情報（スペクトル情報あるいは音源情報）の入力を受け付ける。波形復号化部１１０は、その波形情報に基づいて短時間の波形を復号化し、その復号化した波形（Ｓ［ｉ］）を出力する。 The waveform decoding unit 110 is configured to receive a signal input to the waveform generation device 100. Specifically, the waveform decoding unit 110 receives input of waveform information (spectral information or sound source information). The waveform decoding unit 110 decodes a short-time waveform based on the waveform information, and outputs the decoded waveform (S [i]).

増幅部１２０は、波形復号化部１１０からの出力に基づいて作動するように、波形復号化部１１０に接続される。具体的には、増幅部１２０には、波形復号化部１１０によって出力される波形信号と、波形信号に対応したエネルギー情報（Ｅｐ）とが入力される。増幅部１２０は、エネルギー情報Ｅｐを用いて復号化された波形を増幅し、増幅された波形の信号を出力する。 The amplifying unit 120 is connected to the waveform decoding unit 110 so as to operate based on the output from the waveform decoding unit 110. Specifically, the waveform signal output from the waveform decoding unit 110 and energy information (Ep) corresponding to the waveform signal are input to the amplification unit 120. The amplifying unit 120 amplifies the decoded waveform using the energy information Ep, and outputs the amplified waveform signal.

増幅部１３０は、増幅部１２０からの出力に基づいて作動するように、増幅部１２０に接続される。具体的には、増幅部１３０には、増幅部１２０によって増幅された波形の信号が入力される。さらに、増幅部１３０には、波形生成装置１００に対して与えられる音量指定情報が入力される。増幅部１３０は、音量指定情報Ｅｖに基づいて増幅部１２０によって出力された波形を増幅し、音量が調整された波形データとして出力する。 The amplifying unit 130 is connected to the amplifying unit 120 so as to operate based on the output from the amplifying unit 120. Specifically, a signal having a waveform amplified by the amplification unit 120 is input to the amplification unit 130. Furthermore, the volume designation information given to the waveform generation device 100 is input to the amplification unit 130. The amplifying unit 130 amplifies the waveform output from the amplifying unit 120 based on the volume designation information Ev, and outputs the waveform data with the volume adjusted.

図２を参照して、本発明の第１の実施の形態に係る音声合成装置が備える波形生成装置２００について説明する。図２は、本実施の形態に係る波形生成装置２００によって実現される機能の構成を表わすブロック図である。波形生成装置２００は、主たる構成として、エネルギー変更部２１０と、エネルギー補正部２２０と、波形復号化部２３０と、増幅部２４０とを備える。音声合成装置は、具体的には、コンピュータシステム、携帯電話、ゲーム装置その他の少なくとも音声出力装置を備える情報処理装置等によって実現される。なお、以下の説明において、音声の合成には、少なくとも、予め作成された音声データを加工する処理と、マイクに対する発話によって取得された音声信号に基づくデジタルデータを下降する処理と、入力された文字列あるいは受信された信号に含まれる文字情報に基づいて音声を合成する処理とが含まれる。 With reference to FIG. 2, a waveform generation apparatus 200 provided in the speech synthesizer according to the first embodiment of the present invention will be described. FIG. 2 is a block diagram showing a configuration of functions realized by waveform generating apparatus 200 according to the present embodiment. The waveform generation device 200 includes an energy change unit 210, an energy correction unit 220, a waveform decoding unit 230, and an amplification unit 240 as main components. Specifically, the voice synthesizer is realized by an information processing apparatus or the like including at least a voice output device such as a computer system, a mobile phone, a game device, or the like. In the following description, for speech synthesis, at least processing for processing previously created speech data, processing for lowering digital data based on speech signals acquired by utterance to a microphone, and input characters And a process of synthesizing speech based on character information included in the sequence or received signal.

エネルギー変更部２１０は、音量指定情報Ｅｖと、エネルギー情報Ｅｐとの入力を受け付ける。音量指定情報Ｅｖは、波形生成装置２００の外部から与えられる。たとえば、音量指定情報Ｅｖは、波形生成装置２００を備える音声合成装置に対する音量指定情報の入力（たとえば、音量ボタンの操作、音量を規定するレベルの設定等）として与えられる。あるいは、音量指定情報Ｅｖは、当該音声合成装置が備えるメモリに予め書き込まれている。さらに、他の局面において、音声合成装置が音声データの入力を受け付ける場合には、音量指定情報は、当該音声データに関連付けられている。エネルギー変更部２１０は、エネルギー情報Ｅｐを音量指定情報Ｅｖに応じたエネルギー情報に変更し、出力する。 The energy changing unit 210 receives input of the volume designation information Ev and the energy information Ep. The volume designation information Ev is given from the outside of the waveform generation device 200. For example, the volume designation information Ev is given as input of volume designation information (for example, operation of a volume button, setting of a level that defines the volume, etc.) to a speech synthesizer including the waveform generation apparatus 200. Alternatively, the volume designation information Ev is written in advance in a memory provided in the speech synthesizer. Furthermore, in another aspect, when the speech synthesizer accepts input of speech data, the volume designation information is associated with the speech data. The energy changing unit 210 changes the energy information Ep to energy information corresponding to the sound volume designation information Ev and outputs it.

エネルギー補正部２２０は、エネルギー変更部２１０からの出力に基づいて作動するように、エネルギー変更部２１０に接続される。より具体的には、エネルギー補正部２２０は、エネルギー変更部２１０によって変更されたエネルギー情報Ｅの入力を受け付ける。エネルギー補正部２２０は、予め準備された補正データに基づいてエネルギー情報Ｅを補正する。 The energy correcting unit 220 is connected to the energy changing unit 210 so as to operate based on the output from the energy changing unit 210. More specifically, the energy correction unit 220 receives input of energy information E changed by the energy change unit 210. The energy correction unit 220 corrects the energy information E based on correction data prepared in advance.

好ましくは、エネルギー補正部２２０は、補正後のエネルギー情報について予め規定された上限値に漸近するように、エネルギー変更部２１０によって変更されたエネルギー情報Ｅを補正する。このような補正は、たとえば、補正前のエネルギーＥと、当該エネルギーＥの補正後の値ｆ（Ｅ）との間で予め規定された関係に従って行なわれる。 Preferably, the energy correcting unit 220 corrects the energy information E changed by the energy changing unit 210 so as to gradually approach an upper limit value that is defined in advance for the corrected energy information. Such correction is performed, for example, in accordance with a pre-defined relationship between the energy E before correction and the value f (E) after correction of the energy E.

また、他の局面においては、エネルギー補正部２２０は、エネルギー情報Ｅを非線形に補正する。ここで、非線形とは、たとえば、後述するように補正前のエネルギー情報と補正後のエネルギー情報との関係が非線形の関数によって規定される場合における非線形をいう。非線形の関係は、たとえば、補正前後のエネルギーの関係が２次以上の関数として、あるいは指数関数として表わされる。なお、補正前のエネルギーＥについて全ての範囲が非線形で補正されなくてもよい。たとえば、補正前のエネルギーの一部の範囲に、比例関係が含まれていてもよい。 In another aspect, energy correction unit 220 corrects energy information E nonlinearly. Here, the non-linearity refers to non-linearity in a case where the relationship between energy information before correction and energy information after correction is defined by a non-linear function as described later. The non-linear relationship is expressed, for example, as a function of second order or higher, or as an exponential function. Note that the entire range of the energy E before correction may not be corrected nonlinearly. For example, a proportional relationship may be included in a partial range of energy before correction.

また、他の局面においては、エネルギー補正部２２０は、補正前のエネルギー情報と補正後のエネルギー情報との対応関係を予め規定したテーブルに基づいてエネルギー情報Ｅを補正してもよい。当該テーブルは、たとえばある範囲のエネルギー情報に対して適用される補正係数の配列として規定される。 In another aspect, the energy correction unit 220 may correct the energy information E based on a table that preliminarily defines the correspondence between the energy information before correction and the energy information after correction. The table is defined as an array of correction coefficients applied to a certain range of energy information, for example.

さらに他の局面においては、エネルギー補正部２２０は、処理の対象となるフレームごとのエネルギー情報の補正を行なう。また、たとえばＰＳＯＬＡ（Pitch Synchronous OverLap Add）方式の音声合成では、各フレーム内に加算すべき１ピッチ波形の振幅（エネルギー）を、エネルギー補正部２２０で補正した後に、ピッチ間隔ずつ時間をずらして加算し、当該フレームの波形が生成されてもよい。 In still another aspect, the energy correction unit 220 corrects energy information for each frame to be processed. Further, for example, in PSOLA (Pitch Synchronous OverLap Add) type speech synthesis, the amplitude (energy) of one pitch waveform to be added in each frame is corrected by the energy correction unit 220 and then added by shifting the time by the pitch interval. However, the waveform of the frame may be generated.

波形復号化部２３０は、波形生成装置２００に対するデータの入力を受け付けるように構成される。具体的には、波形復号化部２３０には、予め規定された区間（たとえば数ミリ秒から数十ミリ秒）ごとの波形情報が入力される。波形情報は、スペクトル情報と音源情報とを含む。波形復号化部２３０は、当該波形情報から予め規定された時間単位の波形を復号化し、出力する。 The waveform decoding unit 230 is configured to accept input of data to the waveform generation device 200. Specifically, waveform information for each predetermined section (for example, several milliseconds to several tens of milliseconds) is input to the waveform decoding unit 230. The waveform information includes spectrum information and sound source information. The waveform decoding unit 230 decodes and outputs a predetermined time unit waveform from the waveform information.

増幅部２４０は、エネルギー補正部２２０からの出力と波形復号化部２３０からの出力とに基づいて作動するように、エネルギー補正部２２０と波形復号化部２３０とに接続される。具体的には、増幅部２４０には、エネルギー補正部２２０によって生成された補正後のエネルギー情報ｆ（Ｅ）と、波形復号化部２３０によって復号化された波形Ｓ［ｉ］とが、増幅部２４０に入力される。増幅部２４０は、補正後のエネルギー情報ｆ（Ｅ）に基づいて復号化により生成された波形Ｓ［ｉ］を増幅して出力する。増幅部２４０によって出力されるデータは、音量が調整された波形データとして、たとえばデジタルアナログ変換回路（図示しない）に入力される。 The amplification unit 240 is connected to the energy correction unit 220 and the waveform decoding unit 230 so as to operate based on the output from the energy correction unit 220 and the output from the waveform decoding unit 230. Specifically, the amplification unit 240 receives the corrected energy information f (E) generated by the energy correction unit 220 and the waveform S [i] decoded by the waveform decoding unit 230. 240 is input. The amplification unit 240 amplifies and outputs the waveform S [i] generated by decoding based on the corrected energy information f (E). The data output by the amplifying unit 240 is input, for example, to a digital / analog conversion circuit (not shown) as waveform data whose volume has been adjusted.

図３を参照して、エネルギー情報の補正について説明する。ここでのエネルギー情報は、短時間毎の音声波形の持つエネルギーであり、たとえば、波形値の２乗の総和や、それをサンプル数で割ったサンプルあたりの振幅の２乗の平均、さらには、その２乗根を求めた、サンプルあたりの振幅の平均などを表わす。図３は、補正前のエネルギー情報Ｅと補正後のエネルギー情報ｆ（Ｅ）との関係を表わす図である。ここで、補正後のエネルギー情報ｆ（Ｅ）の値に関し、エネルギー情報Ｙ（１）は、再生される音声に歪みを生じさせないためのエネルギー情報についての上限値である。エネルギー情報Ｙ（２）は、補正前後のエネルギー情報が取り得る最大の値である。この場合、値Ｙ（１）〜値Ｙ（２）の範囲においては、再生される音声が歪みを生じることがある。 The correction of energy information will be described with reference to FIG. The energy information here is the energy of the speech waveform for each short time. For example, the sum of the squares of the waveform values, the average of the square of the amplitude per sample divided by the number of samples, The average of the amplitude per sample for which the square root is obtained is represented. FIG. 3 is a diagram showing the relationship between energy information E before correction and energy information f (E) after correction. Here, regarding the value of the corrected energy information f (E), the energy information Y (1) is an upper limit value for energy information for preventing distortion in the reproduced sound. The energy information Y (2) is the maximum value that the energy information before and after correction can take. In this case, the reproduced sound may be distorted within the range of the value Y (1) to the value Y (2).

エネルギー情報Ｅの補正は、当該関係を規定する関数（たとえば非線形の関数３１０あるいは折れ線の関数３２０）によって補正される。非線形の関数３１０あるいは折れ線の関数３２０を規定するデータは、たとえば、エネルギー補正部２２０が有するメモリ領域に予め格納されている。エネルギー補正部２２０は、いずれかの関数を用いてエネルギー変更部２１０によって出力されたエネルギー情報Ｅを補正する。 The energy information E is corrected by a function that defines the relationship (for example, the nonlinear function 310 or the broken line function 320). Data defining the nonlinear function 310 or the broken line function 320 is stored in advance in a memory area of the energy correction unit 220, for example. The energy correcting unit 220 corrects the energy information E output by the energy changing unit 210 using any function.

なお、エネルギー情報Ｅを補正するための関数は、図３に示されるものに限られない。少なくとも、補正後のエネルギー情報ｆ（Ｅ）の値が、上限値Ｙ（１）を超えない関数であればよい。 Note that the function for correcting the energy information E is not limited to that shown in FIG. It is sufficient that the value of the energy information f (E) after correction does not exceed the upper limit value Y (1).

エネルギー情報Ｅの補正を実現するための方法として、先に示したように、エネルギー情報そのものを補正する方法以外に、エネルギー情報から増幅率へ変換する時点で補正した増幅率を求める方法でもよい。 As a method for realizing the correction of the energy information E, as described above, in addition to the method of correcting the energy information itself, a method of obtaining the corrected amplification factor at the time of conversion from the energy information to the amplification factor may be used.

図４を参照して、波形の増幅について説明する。図４は、エネルギー情報と増幅率との関係を表わす図である。当該関係は、たとえば指数関数４１０として、あるいは指数関数と線形関係を有する関数４２０とのいずれかによって表わされる。 Waveform amplification will be described with reference to FIG. FIG. 4 is a diagram illustrating the relationship between energy information and amplification factor. This relationship is represented, for example, either as an exponential function 410 or by a function 420 that has a linear relationship with the exponential function.

図４において、たとえば、エネルギー情報Ｅ（１）を上回る範囲では、関数４２０は、指数関数４１０よりも低い増幅率を与えるように規定されている。したがって、エネルギー情報Ｅ（１）を超える範囲では、波形の増幅が抑制される。すなわち、増幅部２４０は、エネルギー情報Ｅ（１）を上回る補正後のエネルギー情報が入力された場合には、波形復号化部２３０から出力される波形Ｓ［ｉ］の増幅の程度を抑制して出力する。このようにすると、増幅部２４０から出力される音量調整波形データを用いて音声処理を実行するスピーカに対して、そのスピーカの処理能力に応じた信号を出力することができる。その結果、当該スピーカは、音の歪を発生することなく音声を出力することができる。 In FIG. 4, for example, in a range exceeding the energy information E (1), the function 420 is defined to give a lower amplification factor than the exponential function 410. Therefore, in the range exceeding the energy information E (1), waveform amplification is suppressed. That is, the amplification unit 240 suppresses the degree of amplification of the waveform S [i] output from the waveform decoding unit 230 when the corrected energy information exceeding the energy information E (1) is input. Output. In this way, a signal corresponding to the processing capability of the speaker can be output to the speaker that performs audio processing using the volume adjustment waveform data output from the amplification unit 240. As a result, the speaker can output sound without generating sound distortion.

図５を参照して、本実施の形態に係る波形生成装置２００において適用される波形生成アルゴリズムについて説明する。図５は、波形生成装置２００の各部が実行する動作を表わすフローチャートである。 With reference to FIG. 5, a waveform generation algorithm applied in waveform generation apparatus 200 according to the present embodiment will be described. FIG. 5 is a flowchart showing an operation executed by each unit of the waveform generation device 200.

ステップＳ５１０にて、波形生成装置２００は、波形情報とエネルギー情報Ｅｐとの入力を受け付ける。具体的には、波形復号化部２３０は、波形情報の入力を受け付ける。エネルギー変更部２１０は、エネルギー情報Ｅｐの入力を受け付ける。ステップＳ５２０にて、波形復号化部２３０は、波形情報を用いて波形（Ｓ［ｉ］）を復号化する。ステップＳ５３０にて、エネルギー変更部２１０は、音量指定情報のＥｖに基づいてエネルギー情報Ｅｐを変更する（Ｅ←Ｅ＋Ｖ）。 In step S510, waveform generation device 200 accepts input of waveform information and energy information Ep. Specifically, the waveform decoding unit 230 accepts input of waveform information. The energy changing unit 210 receives input of energy information Ep. In step S520, waveform decoding section 230 decodes the waveform (S [i]) using the waveform information. In step S530, energy changing unit 210 changes energy information Ep based on Ev of the volume designation information (E ← E + V).

ステップＳ５４０にて、エネルギー補正部２２０は、エネルギー変更部２１０によって変更されたエネルギー情報Ｅを補正する（Ｅ←ｆ（Ｅ））。ステップＳ５５０にて、増幅部２４０は、エネルギー補正部２２０によって生成された補正後のエネルギー情報を波形の増幅率に変換する（Ｇ←ｅｘｐ（Ｅ））。ステップＳ５６０にて、増幅部２４０は、波形Ｓ［ｉ］を増幅率Ｇを用いて増幅する（Ｓ［ｉ］←Ｓ［ｉ］×Ｇ）。 In step S540, the energy correction unit 220 corrects the energy information E changed by the energy change unit 210 (E ← f (E)). In step S550, amplification section 240 converts the corrected energy information generated by energy correction section 220 into a waveform amplification factor (G ← exp (E)). In step S560, amplification section 240 amplifies waveform S [i] using amplification factor G (S [i] ← S [i] × G).

以上のようにして、波形生成が実現される。ここで、各動作が実行される順序は、前後の他の動作による依存関係を損なわない範囲で、変更可能である。たとえば、ステップＳ５２０とステップＳ５３０との順序は、入れ替えられてもよい。係る順序の変更は、たとえば図２に示されるブロック図によっても容易に理解され得る。 As described above, waveform generation is realized. Here, the order in which each operation is executed can be changed as long as the dependency relationship between the other operations is not impaired. For example, the order of step S520 and step S530 may be interchanged. Such a change in order can be easily understood, for example, by the block diagram shown in FIG.

なお、上記の動作は、波形生成装置２００に格納されているソフトウェアによる情報処理が、波形生成装置２００が備えるハードウェアを用いて実現されるものとして説明されている。具体的には、当該ソフトウェアは、各ステップに示される動作を実現するためのプログラムであり、波形生成装置２００が備える記憶装置に格納されている。当該ハードウェアは、波形生成装置２００を備える音声合成装置を具体的に実現するＣＰＵ（Central Processing Unit）その他の演算処理装置と、上記プログラムを格納した記憶装置とを含む。 Note that the above-described operation is described as information processing by software stored in the waveform generation device 200 being realized using hardware included in the waveform generation device 200. Specifically, the software is a program for realizing the operation shown in each step, and is stored in a storage device included in the waveform generation device 200. The hardware includes a CPU (Central Processing Unit) or other arithmetic processing device that specifically implements a speech synthesizer including the waveform generation device 200, and a storage device that stores the program.

しかしながら、当該波形生成の実現は、ハードウェアとソフトウェアとの組み合わせによってのみ実現されるものではなく、各動作を実現するための回路素子のようなハードウェアのみの組み合わせによっても実現可能である。 However, the waveform generation can be realized not only by a combination of hardware and software, but also by a combination of only hardware such as circuit elements for realizing each operation.

図６を参照して、本実施の形態に係る波形生成の構成を用いた音声合成装置６００について説明する。図６は、音声合成装置６００の機能的構成を表わすブロック図である。音声合成装置６００は、テキスト解析部６１０と、波形辞書記憶部６２０と、音量設定部６３０と、韻律生成部６４０と、波形データ選択部６５０と、波形生成部６６０と、波形重畳部６７０と、増幅部６８０とを備える。 A speech synthesizer 600 using the waveform generation configuration according to the present embodiment will be described with reference to FIG. FIG. 6 is a block diagram showing a functional configuration of speech synthesizer 600. The speech synthesizer 600 includes a text analysis unit 610, a waveform dictionary storage unit 620, a volume setting unit 630, a prosody generation unit 640, a waveform data selection unit 650, a waveform generation unit 660, a waveform superposition unit 670, And an amplifying unit 680.

テキスト解析部６１０は、テキストデータの入力を受けるように構成される。テキストデータは、音声合成装置６００の使用者が文字列を入力することにより音声合成装置６００に与えられる。あるいは、音声合成装置６００が文字情報を含む信号を受信可能な場合、テキストデータは、音声合成装置６００が当該信号を受信して、その信号から文字情報を取得することにより与えられる。 The text analysis unit 610 is configured to receive input of text data. The text data is given to the speech synthesizer 600 when the user of the speech synthesizer 600 inputs a character string. Alternatively, when the speech synthesizer 600 can receive a signal including character information, the text data is given by the speech synthesizer 600 receiving the signal and acquiring the character information from the signal.

テキスト解析部６１０は、そのようにして与えられたテキストデータ（以下、入力テキスト）を解析し、各単語の読みと、アクセント情報とを出力する。テキスト解析部６１０は、他の局面においては、品詞情報などを出力する。入力テキストが漢字仮名混じり文である場合には、テキスト解析部６１０は、言語辞書（図示しない）を用いて、上記の各情報を生成する。あるいは、入力テキストが仮名入力またはアルファベットのような発音記号の入力である場合、テキスト解析部６１０は、仮名と同時に入力されるアクセント情報を用いて上記の各情報を生成する。たとえば「ホ’ンジツハ／セーテンナ’リ」のように、アクセント位置とアクセント句の境界を指定するテキストとが同時に入力される。 The text analysis unit 610 analyzes the text data thus given (hereinafter referred to as input text), and outputs the reading of each word and accent information. In other aspects, text analysis unit 610 outputs part of speech information and the like. When the input text is a kanji-kana mixed sentence, the text analysis unit 610 generates each of the above information using a language dictionary (not shown). Alternatively, when the input text is a kana input or a phonetic symbol input such as an alphabet, the text analysis unit 610 generates each piece of information described above using accent information that is input simultaneously with the kana. For example, the text specifying the boundary between the accent position and the accent phrase is input at the same time, such as “Honjitsuha / Setena”.

波形辞書記憶部６２０は、音声を合成するために使用される素片データを格納する。ソ辺データは、音声合成装置６００の製造者によって予め与えられる。あるいは、音声合成装置６００の使用者が素片データを含む波形辞書を入力できる構成であってもよい。波形辞書記憶部６２０のデータ構造は後述する（図７）。 The waveform dictionary storage unit 620 stores segment data used for synthesizing speech. The edge data is given in advance by the manufacturer of the speech synthesizer 600. Or the structure which the user of the speech synthesizer 600 can input the waveform dictionary containing segment data may be sufficient. The data structure of the waveform dictionary storage unit 620 will be described later (FIG. 7).

音量設定部６３０は、テキスト解析部６１０からの出力に基づいて作動可能なように、テキスト解析部６１０に接続される。音量設定部６３０は、音声合成装置６００を使用する機器から出力される音量を設定し、設定値を音量指定情報Ｅｖとして出力する。ここで音声合成装置６００を使用する機器あるいは音声合成装置６００として機能する機器は、音声出力機能を有するＰＣ（Personal Computer）、携帯電話その他の情報通信端末あるいはゲーム機器として実現される。音量設定部６３０は、これらの機器が有する操作部（たとえばキーボード、マウス、数字ボタン、音量調整ダイヤルなど）に対する操作に基づいて出力される音量を規定する。 The volume setting unit 630 is connected to the text analysis unit 610 so as to be operable based on the output from the text analysis unit 610. The volume setting unit 630 sets the volume output from a device that uses the speech synthesizer 600 and outputs the set value as volume designation information Ev. Here, a device that uses the speech synthesizer 600 or a device that functions as the speech synthesizer 600 is realized as a PC (Personal Computer) having a speech output function, a mobile phone or other information communication terminal, or a game device. The volume setting unit 630 defines a volume output based on an operation on an operation unit (for example, a keyboard, a mouse, a numeric button, a volume adjustment dial, etc.) included in these devices.

たとえば、音量設定部６３０が音量調整ダイヤルによって実現される場合には、出力音量は、音声合成装置の使用者によって設定される。音量設定部６３０は、その設定に応じた信号を音量指定情報Ｅｖとして波形生成装置２００に送出する。波形生成装置２００は、その設定に応じて、全ての区間のエネルギー情報を変更して補正する。 For example, when the volume setting unit 630 is realized by a volume adjustment dial, the output volume is set by the user of the speech synthesizer. The volume setting unit 630 sends a signal corresponding to the setting to the waveform generation apparatus 200 as volume designation information Ev. The waveform generation device 200 changes and corrects the energy information of all the sections according to the setting.

あるいは、当該音量指定情報がこれらの機器に入力されるデータに付随している場合には、音量設定部６３０は、その付随している情報を用いて音量を設定し、音量指定情報Ｅｖを出力する。たとえば、音声合成装置６００がゲーム装置として使用される場合、当該ゲーム装置は、ゲームを実現するためのカートリッジを駆動し、映像データと音声データとを読み出す。この場合、音量指定情報が映像データあるいは音声データのいずれかに関連付けられていれば、音量設定部６３０は、当該カートリッジから取得されたその音量指定情報を用いて、音声合成装置６００から出力される音声の音量として設定する。 Alternatively, when the volume designation information is attached to data input to these devices, the volume setting unit 630 sets the volume using the accompanying information and outputs the volume designation information Ev. To do. For example, when the speech synthesizer 600 is used as a game device, the game device drives a cartridge for realizing a game, and reads video data and audio data. In this case, if the volume designation information is associated with either video data or audio data, the volume setting unit 630 is output from the voice synthesizer 600 using the volume designation information acquired from the cartridge. Set as audio volume.

また、さらに他の局面においては、音量指定情報は、上記の機器の使用者による操作によって指定されるものでもよい。たとえば、使用者が音量設定部６３０を操作した場合に、当該操作に応じて出力される電気信号が、音量指定情報として使用されてもよい。ここで、当該操作は、ダイヤルの回転、ボタンの押下、音量を規定するための設定値の入力等を含む。 In still another aspect, the volume designation information may be designated by an operation by a user of the device. For example, when the user operates the volume setting unit 630, an electrical signal output in response to the operation may be used as the volume designation information. Here, the operation includes rotation of a dial, pressing of a button, input of a setting value for defining a volume, and the like.

韻律生成部６４０は、テキスト解析部６１０からの出力に基づいて作動可能なように、テキスト解析部６１０に接続される。韻律生成部６４０は、アクセント情報あるいは文の境界に基づいて、韻律情報を生成して出力する。韻律情報は、たとえば時間長、ピッチ、エネルギー（パワー）情報などを含む。一般的には、韻律情報は、音素単位に求められ、その後、内挿により各フレーム単位の情報として生成される。 The prosody generation unit 640 is connected to the text analysis unit 610 so as to be operable based on the output from the text analysis unit 610. The prosody generation unit 640 generates and outputs prosodic information based on accent information or sentence boundaries. The prosody information includes, for example, time length, pitch, energy (power) information, and the like. In general, prosodic information is obtained in units of phonemes, and then generated as information for each frame by interpolation.

波形データ選択部６５０は、テキスト解析部６１０からの出力と、波形辞書記憶部６２０に格納されているデータとに基づいて作動可能なように、テキスト解析部６１０と波形辞書記憶部６２０とに接続される。波形データ選択部６５０は、各単語の読みから設定される発音記号列に従って、波形辞書記憶部６２０から、各発音記号についての条件に合致する素片データを選択する。波形データ選択部６５０は、その選択した素片情報から各フレームごとの波形情報（図８）を取得し、出力する。 The waveform data selection unit 650 is connected to the text analysis unit 610 and the waveform dictionary storage unit 620 so as to operate based on the output from the text analysis unit 610 and the data stored in the waveform dictionary storage unit 620. Is done. The waveform data selection unit 650 selects segment data that matches the conditions for each phonetic symbol from the waveform dictionary storage unit 620 according to the phonetic symbol string set from the reading of each word. The waveform data selection unit 650 acquires the waveform information (FIG. 8) for each frame from the selected piece information and outputs it.

波形生成部６６０は、音量設定部６３０からの出力と、韻律生成部６４０からの出力と、波形データ選択部６５０からの出力とに基づいて作動可能なように、音量設定部６３０と韻律設定部６４０と波形データ選択部６５０とに接続される。波形生成部６６０は、図２に示される波形生成装置２００に相当し、波形生成装置２００によって実現される機能を実現する。 The waveform generation unit 660 is configured so that the volume setting unit 630 and the prosody setting unit are operable based on the output from the volume setting unit 630, the output from the prosody generation unit 640, and the output from the waveform data selection unit 650. 640 and the waveform data selection unit 650 are connected. The waveform generation unit 660 corresponds to the waveform generation device 200 shown in FIG. 2 and realizes functions realized by the waveform generation device 200.

具体的には、波形生成部６６０には、音量指定情報Ｅｖと、エネルギー情報Ｅｐと、波形情報とが入力される。波形生成部６６０は、これらの情報を用いて音量が調整された音声を出力するための波形データを生成し出力する。 Specifically, the sound volume designation information Ev, the energy information Ep, and the waveform information are input to the waveform generation unit 660. The waveform generation unit 660 generates and outputs waveform data for outputting sound whose volume is adjusted using these pieces of information.

波形重畳部６７０は、韻律生成部６４０からの出力と、波形生成部６６０からの出力とに基づいて作動するように、韻律生成部６４０と波形生成部６６０とに接続される。波形重畳部６７０は、波形生成部６６０によって出力されるデータに加えて、韻律生成部６４０によって出力されるデータの入力を受け付ける。具体的には、波形重畳部６７０は、波形データと、ピッチ情報と、時間長情報との入力を受け付ける。波形重畳部６７０は、フレームごとに取り出された波形データを、各フレームに対応するピッチ情報から導かれるサンプル間隔で重畳し、出力する。たとえば、出力される音声が有声音である場合、ピッチの間隔は、当該音声の基本周波数に対応する。また、音声が無声音である場合には、波形データは、固定長のフレーム間隔で重畳される。 The waveform superimposing unit 670 is connected to the prosody generation unit 640 and the waveform generation unit 660 so as to operate based on the output from the prosody generation unit 640 and the output from the waveform generation unit 660. The waveform superimposing unit 670 receives input of data output from the prosody generation unit 640 in addition to data output from the waveform generation unit 660. Specifically, the waveform superimposing unit 670 receives input of waveform data, pitch information, and time length information. The waveform superimposing unit 670 superimposes waveform data extracted for each frame at a sample interval derived from pitch information corresponding to each frame, and outputs the result. For example, when the output sound is a voiced sound, the pitch interval corresponds to the fundamental frequency of the sound. When the voice is an unvoiced sound, the waveform data is superimposed at a fixed-length frame interval.

増幅部６８０は、波形重畳部６７０からの出力に基づいて作動可能なように、波形重畳部６７０に接続される。増幅部６８０は、波形重畳部６７０によって生成された波形を増幅し、音声データとして出力する。音声データは、デジタルアナログコンバータによってアナログ信号に変換され、スピーカ（図示しない）に送出される。スピーカはその信号に基づいて音量が調整された音声を出力する。 The amplifying unit 680 is connected to the waveform superimposing unit 670 so as to be operable based on the output from the waveform superimposing unit 670. The amplifying unit 680 amplifies the waveform generated by the waveform superimposing unit 670 and outputs it as audio data. The audio data is converted into an analog signal by a digital / analog converter and sent to a speaker (not shown). The speaker outputs sound whose volume is adjusted based on the signal.

ここで、音量設定部６３０からの出力が、増幅部６８０ではなく、波形生成部６６０に向けられていることに留意されるべきである。すなわち、従来の構成では、音量指定情報Ｅｖ（たとえば音量を調整するための調整値）は、増幅部６８０に相当する構成に対して入力されていた。このような構成では、音声の歪みを抑制するために音量指定情報Ｅｖが使用者によって変更されると、音声信号が出力される段階で増幅率が変更される。そのため、歪の調整が本来不要な音声信号（すなわち、出力レベルがスピーカの規定値を超えないような信号）も一律に調整の対象となり、音量が小さい音声は聞こえにくくなる。 Here, it should be noted that the output from the volume setting unit 630 is directed to the waveform generation unit 660, not the amplification unit 680. That is, in the conventional configuration, the volume designation information Ev (for example, an adjustment value for adjusting the volume) is input to the configuration corresponding to the amplification unit 680. In such a configuration, when the sound volume designation information Ev is changed by the user in order to suppress the distortion of the sound, the amplification factor is changed at the stage where the sound signal is output. Therefore, an audio signal that does not need to be adjusted for distortion (that is, a signal whose output level does not exceed the specified value of the speaker) is also subject to adjustment, and it is difficult to hear a sound with a low volume.

しかしながら、本実施の形態に係る音声合成装置６００においては、図６からも明らかなように、音量指定情報Ｅｖは、波形生成部６６０に入力される。そのため、歪の発生を抑制するための調整が必要な音声データの調整のみが実現可能となる。その結果、歪の調整が本来不要な音声信号の出力レベルは調整されなくなり、音量が小さい音声は、そのまま出力される。これにより、音声合成装置６００の使用者は、大音量の音声と小音量の音声とのいずれをも心地よく聞くことができる。 However, in the speech synthesizer 600 according to the present embodiment, the volume designation information Ev is input to the waveform generation unit 660, as is apparent from FIG. Therefore, only the adjustment of audio data that needs to be adjusted to suppress the occurrence of distortion can be realized. As a result, the output level of the audio signal that originally does not require distortion adjustment is not adjusted, and the sound with a low volume is output as it is. As a result, the user of the speech synthesizer 600 can comfortably listen to both high volume sound and low volume sound.

なお、波形重畳部６７０による重畳後のデータの出力の態様は、上記のものに限られない。たとえばデジタルアナログコンバータに代えてデジタルアンプが使用されてもよい。 Note that the output mode of data after superposition by the waveform superimposing unit 670 is not limited to the above. For example, a digital amplifier may be used instead of the digital-analog converter.

ここで、図７および図８を参照して、音声合成装置６００のデータ構造について説明する。図７は、波形辞書記憶部６２０におけるデータの格納の一態様を概念的に表わす図である。波形辞書記憶部６２０は、データを格納するための領域７１０〜領域７９０を含む。各領域には、音素を基本単位として使用する素片辞書のデータ（素片データ）が格納されている。たとえば、領域７１０には、素片データとして「ａ」のデータが格納されている。各素片データは、さらに詳細な情報を有する。 Here, the data structure of the speech synthesizer 600 will be described with reference to FIGS. FIG. 7 is a diagram conceptually showing one mode of data storage in waveform dictionary storage unit 620. The waveform dictionary storage unit 620 includes areas 710 to 790 for storing data. Each area stores data of a segment dictionary (segment data) that uses phonemes as basic units. For example, in the area 710, data “a” is stored as segment data. Each piece data has more detailed information.

すなわち、図８を参照して、領域７９０は、フレーム番号を格納するための領域８１０と、エネルギー情報を格納するための領域８２０と、波形情報を格納するための領域８３０とを含む。たとえば領域７９０は、Ｍ個のフレームについてエネルギー情報と波形情報とをそれぞれ有している。 That is, referring to FIG. 8, region 790 includes a region 810 for storing a frame number, a region 820 for storing energy information, and a region 830 for storing waveform information. For example, the area 790 has energy information and waveform information for M frames.

図８を参照して、波形情報の生成についてさらに説明する。音声素片辞書は、元になる音声データを、基本単位毎に分析して作成する。一般的には、基本単位として、音素、音節、ＶＣＶ（Vowel-Consonant-Vowel：母音−子音−母音）などが使用される。図８においては、説明を平易にするために、基本単位として音素が使用される場合が例示されているが、その他の基本単位でも、以下の説明は成立する。 The generation of waveform information will be further described with reference to FIG. The speech segment dictionary is created by analyzing the original speech data for each basic unit. Generally, phonemes, syllables, VCVs (Vowel-Consonant-Vowel) are used as basic units. In FIG. 8, the case where phonemes are used as the basic unit is illustrated for the sake of simplicity of explanation, but the following description is also true for other basic units.

まず、基本単位に切り出された音声を、フレーム毎に分析し、波形情報と、エネルギー情報とに分解する。波形情報を取得するための処理としては、音声が有声音の場合には、一般的には、ケプストラム分析を用いて周期性を取り除き、１ピッチ波形を取り出す処理が行なわれる。音声が無声音の場合には、一般的には、元の波形をフレームの長さに切り出すための処理が行なわれる。また、波形情報のデータ圧縮を目的として、波形情報を、ケプストラムやＬＳＰ（Linear Spectrum Pair：線スペクトル対）などのスペクトル情報と、残差波形に分解したり、スペクトル情報と、有声か無声かのフラグに分解する場合もある。この場合、スペクトルを表現するフィルタを用いて、有声音では、インパルスに対するフィルタ応答を求めて波形データを得、他方、無声音では、ランダム雑音に対するフィルタ応答を求めて波形データを得るのが一般的である。 First, the voice cut out in the basic unit is analyzed for each frame and decomposed into waveform information and energy information. As a process for acquiring the waveform information, when the voice is a voiced sound, generally, a process of removing a periodicity using a cepstrum analysis and extracting a one pitch waveform is performed. When the voice is an unvoiced sound, generally, a process for cutting out the original waveform into a frame length is performed. For the purpose of data compression of waveform information, the waveform information is decomposed into spectral information such as cepstrum and LSP (Linear Spectrum Pair) and residual waveform, and the spectrum information is either voiced or unvoiced. It may be broken down into flags. In this case, it is common to obtain a waveform data by obtaining a filter response to an impulse for voiced sound using a filter that represents a spectrum, and to obtain a waveform data by obtaining a filter response to random noise for an unvoiced sound. is there.

波形情報は精度を保つために、一定の振幅レベルになるように正規化しておくのが望ましい。たとえば分析された波形をＸ［ｉ］（ｉ=１，・・・，N) としたとき、平均振幅Ｐxは、 It is desirable to normalize the waveform information so that it has a constant amplitude level in order to maintain accuracy. For example, when the analyzed waveform is X [i] (i = 1,..., N), the average amplitude Px is

で表される。
正規化した波形情報Ｓ［ｉ］は、
Ｓ［ｉ］＝Ｘ［ｉ］×Ａ／Ｐｘ
のように正規化する。 It is represented by
The normalized waveform information S [i] is
S [i] = X [i] × A / Px
Normalize like this.

ここで、Ａには、波形が振幅限界の範囲（たとえば、１６ビットなら−３２７６７から３２７６８）を超えないで、なるべく大きな振幅になるような値が設定される。具体的には、たとえばＡ＝２０４８といった値を設定する。 Here, A is set to a value that makes the amplitude as large as possible without exceeding the amplitude limit range (for example, -32767 to 32768 for 16 bits). Specifically, a value such as A = 2048 is set.

エネルギー情報については、分析時の各フレームの短時間平均エネルギーＰｘが、エネルギー情報として用いられる。また、エネルギー情報の値自体を使用するよりも、その値を対数化して用いるほうが、音量の変更など各種の制御が「加算処理」により可能となるため、望ましい。 For energy information, the short-time average energy Px of each frame at the time of analysis is used as energy information. In addition, it is preferable to logarithmically use the energy information value rather than using the energy information value itself, because various controls such as changing the volume can be performed by “addition processing”.

音声合成における波形生成の場合、音声素片辞書が波形情報のままであれば復号処理は不要である。しかし、音声素片辞書がスペクトル情報を用いて圧縮されている場合は、一般的な復号化の処理を行なう。この際、復号された波形の平均振幅がＡでなければ、平均振幅Ａになるように、一旦、ゲイン調整を行なう。 In the case of waveform generation in speech synthesis, decoding processing is not necessary if the speech unit dictionary remains waveform information. However, when the speech unit dictionary is compressed using the spectrum information, a general decoding process is performed. At this time, if the average amplitude of the decoded waveform is not A, gain adjustment is once performed so that the average amplitude A is obtained.

音声合成に際して、合成のための処理の対象となるフレーム（以下、対象フレーム）の目標エネルギー情報Ｅが、平均振幅でＡに等しい場合には、増幅率を１として、波形を生成する。他方、目標エネルギー情報Ｅの値がＡに等しくない場合には、増幅率Ｅ／Ａで増幅することにより、平均振幅Ｅの波形を得る。 At the time of speech synthesis, when target energy information E of a frame to be processed for synthesis (hereinafter referred to as a target frame) is equal to A in average amplitude, a waveform is generated with an amplification factor of 1. On the other hand, when the value of the target energy information E is not equal to A, a waveform with an average amplitude E is obtained by amplification with the amplification factor E / A.

また、本発明の実施の形態に係る波形生成部における波形生成は、より具体的には、以下のとおりである。前述の対象フレームの目標エネルギーＥは、韻律生成に基づくエネルギー情報Ｅpと、音量指定情報に基づくＥvとを用いて、算式「Ｅ＝Ｅｐ×Ｅｖ」により算出される。ここで、Ｅｖの値は、標準の値を１とする相対的な値である。 Moreover, the waveform generation in the waveform generation unit according to the embodiment of the present invention is more specifically as follows. The target energy E of the target frame is calculated by the formula “E = Ep × Ev” using energy information Ep based on prosody generation and Ev based on sound volume designation information. Here, the value of Ev is a relative value where the standard value is 1.

正規化された平均振幅ＡをＡ＝２０４８とした時、実際の波形情報のうち最もピーク振幅の大きい場合のピーク振幅は、平均振幅Ａの８倍である１６３８４程度と仮定すると、１６ビットの最大振幅（−３２７６８〜３２７６７）に対しては、当該ピーク振幅は、平均振幅Ａの最大2倍の増幅しか許されないことになる。したがって、生成波形に対する増幅率には、制約が課せられる。この制約は、たとえば、Ｅ＜２×Ａとして表わされる。 Assuming that the normalized average amplitude A is A = 2048, the peak amplitude when the peak amplitude is the largest among the actual waveform information is assumed to be about 16384, which is eight times the average amplitude A, and the maximum of 16 bits. For the amplitude (−32768 to 32767), the peak amplitude can only be amplified up to twice the average amplitude A. Therefore, a restriction is imposed on the amplification factor for the generated waveform. This constraint is represented, for example, as E <2 × A.

しかし、韻律情報に基づくエネルギー情報Ｅｐの値は、ルールや統計に基づくため、エネルギー情報Ｅｐを制限するためには、複雑なルールを調整したり、統計量を修正したりする必要がある。そこで、本発明の実施の形態においては、これらの調整や修正を避けるため、Ｅの関数としてｆ（Ｅ）を定義し、ｆ（Ｅ）＜２×Ａとなるように関数ｆ（）を設定する方法が用いられる。関数ｆ（Ｅ）の例として、図３に示されるような、非線形の関数３１０や折れ線として規定される関数３２０が考えられる。ここで、非線形の関数の一例は、たとえば、次のようなものである。 However, since the value of the energy information Ep based on the prosodic information is based on rules and statistics, it is necessary to adjust complex rules or modify the statistics to limit the energy information Ep. Therefore, in the embodiment of the present invention, in order to avoid these adjustments and corrections, f (E) is defined as a function of E, and the function f () is set so that f (E) <2 × A. Is used. As an example of the function f (E), a non-linear function 310 or a function 320 defined as a broken line as shown in FIG. 3 can be considered. Here, an example of the nonlinear function is as follows, for example.

ｆ（Ｅ）＝｛ｌｏｇ（Ｅ＋１）＾（２×Ａ）｝／ｌｏｇ（４×Ａ）
この関数は、Ｅ＝（４×Ａ−１）の時にｆ（Ｅ）＝２×Ａとなるように定義されたものである。非線形の関数は、この関数に限られず、他にも様々な関数が考えられる。 f (E) = {log (E + 1) ^ (2 × A)} / log (4 × A)
This function is defined so that f (E) = 2 × A when E = (4 × A−1). The nonlinear function is not limited to this function, and various other functions can be considered.

また、折れ線の関数の一例は、たとえば、次のようなものである。
ｆ（Ｅ）＝Ｅ（Ｅ＜Ｂの場合、ただしＢ＜２×Ａ）
ｆ（Ｅ）＝Ｂ＋（Ｅ−Ｂ）／２（Ｅ≧Ｂの場合）
ここで、折れ線の関数を設定する方法についても様々な方法が考えられ、本実施の形態において示されたものに限定されるものではない。 An example of a function of a broken line is as follows, for example.
f (E) = E (if E <B, where B <2 × A)
f (E) = B + (EB) / 2 (when E ≧ B)
Here, various methods for setting the function of the polygonal line are conceivable and are not limited to those shown in the present embodiment.

また、別の局面においては、他の実施例として、エネルギー情報が対数の値として規定されている場合があり得る。この場合、当該値を増幅率に変換することが必要になるが、一般的には、エネルギー情報がＥｌ=２０ｌｏｇ（Ｅ）という値で入力されるとすれば、増幅率Ｇは次式で表わされる。 In another aspect, as another example, energy information may be defined as a logarithmic value. In this case, it is necessary to convert the value into an amplification factor. Generally, if energy information is input as a value of El = 20 log (E), the amplification factor G is expressed by the following equation. It is.

Ｇ＝Ｅ／Ａ＝exp（Ｅｌ／２０）／Ａ
なお、増幅率Ｇを算出するための関係式は、上記のものに限られない。少なくとも、Ｅ＝Ａ／２に対応するＥｌ＝２０ｌｏｇ（Ａ／２）より大きな値に対して、連続的につながる線形関数を用いて急激に大きくならないようにするような方法でもよい。 G = E / A = exp (El / 20) / A
The relational expression for calculating the amplification factor G is not limited to the above. A method may be used in which at least a value larger than El = 20 log (A / 2) corresponding to E = A / 2 is not suddenly increased by using a linear function connected continuously.

このように増幅率がある値よりも極端に大きくならないように調整することで、本発明の実施の形態に係る波形生成装置２００が動作することで、出力波形の歪みを抑えることができ、大きな音量を設定することができる。その結果、同一のスピーカを用いる従来の出力音量よりも大きな音量での出力を可能にする。 By adjusting the amplification factor so as not to become extremely larger than a certain value in this way, the waveform generation apparatus 200 according to the embodiment of the present invention operates, so that distortion of the output waveform can be suppressed, and The volume can be set. As a result, it is possible to output at a volume larger than the conventional output volume using the same speaker.

図７および図８に示されるデータは、音声合成装置６００の製造者によって波形辞書記憶部６２０に格納されるが、音声合成装置６００の使用者によって再入力が可能であってもよい。 The data shown in FIGS. 7 and 8 is stored in the waveform dictionary storage unit 620 by the manufacturer of the speech synthesizer 600, but may be re-input by the user of the speech synthesizer 600.

本実施の形態に係る波形生成装置は、音声を符号化した上で復号化する装置にも適用可能である。そこで、図９を参照して、本実施の形態に係る波形生成の構成を有する音声符号化復号化装置９００について説明する。図９は、音声符号化復号化装置９００によって実現される機能の構成を表わすブロック図である。音声符号化復号化装置９００は、符号化装置９１０と、復号化装置９４０とを備える。復号化装置９４０は、符号化装置９１０からの出力に基づいて作動可能なように符号化装置９１０に接続される。符号化装置９１０は、韻律分析部９２０と、波形情報分析部９３０とを含む。復号化装置９４０は、音量設定部９５０と、波形生成部９６０と、波形重畳部９７０と、増幅部９８０とを含む。 The waveform generation apparatus according to the present embodiment can also be applied to an apparatus that encodes and then decodes speech. Therefore, speech coding / decoding apparatus 900 having a waveform generation configuration according to the present embodiment will be described with reference to FIG. FIG. 9 is a block diagram showing a configuration of functions implemented by speech coding / decoding apparatus 900. The speech encoding / decoding device 900 includes an encoding device 910 and a decoding device 940. The decoding device 940 is connected to the encoding device 910 so as to be operable based on the output from the encoding device 910. The encoding device 910 includes a prosody analysis unit 920 and a waveform information analysis unit 930. Decoding apparatus 940 includes a volume setting unit 950, a waveform generation unit 960, a waveform superimposition unit 970, and an amplification unit 980.

復号化装置９１０は、音声符号化復号化装置９００に対して入力された音声に対応する音声信号の入力を受け付ける。当該音声は、たとえば、音声符号化復号化装置９００が備える、あるいは音声符号化復号化装置９００に接続されたマイク（図示しない）を介して入力される。音声信号は、韻律分析部９２０と波形情報分析部９３０とにそれぞれ入力される。 The decoding device 910 receives an input of an audio signal corresponding to the audio input to the audio encoding / decoding device 900. The speech is input, for example, via a microphone (not shown) provided in speech coding / decoding apparatus 900 or connected to speech coding / decoding apparatus 900. The audio signal is input to the prosody analysis unit 920 and the waveform information analysis unit 930, respectively.

韻律分析部９２０は、入力音声のアクセント情報、音声の間隔などに基づいて、入力音声の韻律情報を生成して出力する。波形情報分析部９３０は、入力された音声信号を分析し、当該音声信号の波形情報を取得して出力する。符号化装置９１０から出力される信号は、復号化装置９４０に入力される。具体的には、韻律分析部９２０から出力される情報は、波形生成部９６０と波形重畳部９７０とに入力される。波形情報分析部９３０から出力される波形情報は、波形生成部９６０に入力される。音量設定部９５０は、図６に示される音量設定部６３０と同様の機能を実現する。 The prosody analysis unit 920 generates and outputs prosodic information of the input speech based on the accent information of the input speech, the speech interval, and the like. The waveform information analysis unit 930 analyzes the input audio signal, acquires waveform information of the audio signal, and outputs the waveform information. A signal output from the encoding device 910 is input to the decoding device 940. Specifically, information output from the prosody analysis unit 920 is input to the waveform generation unit 960 and the waveform superimposition unit 970. The waveform information output from the waveform information analysis unit 930 is input to the waveform generation unit 960. The volume setting unit 950 realizes the same function as the volume setting unit 630 shown in FIG.

波形生成部９６０は、韻律分析部９２０からの出力と、波形情報分析部９３０からの出力と、音量設定部９５０からの出力とに基づいて作動可能なように、韻律分析部９２０と波形情報分析部９３０と音量設定部９５０とに接続される。波形生成部９６０は、図２に示される波形生成装置２００によって実現される機能と同様の機能を実現する。すなわち、波形生成部９６０は、波形情報分析部９３０によってフレームごとに分析された波形情報から波形を復号化し、韻律分析部９２０によって出力されたエネルギー情報と、音量設定部９５０によって出力される音量指定情報とに基づいて当該波形を増幅し、振幅が調整された波形を出力する。 The waveform generation unit 960 and the prosody analysis unit 920 and the waveform information analysis so as to be operable based on the output from the prosody analysis unit 920, the output from the waveform information analysis unit 930, and the output from the volume setting unit 950. The unit 930 and the volume setting unit 950 are connected. The waveform generation unit 960 realizes the same function as that realized by the waveform generation apparatus 200 shown in FIG. That is, the waveform generation unit 960 decodes the waveform from the waveform information analyzed for each frame by the waveform information analysis unit 930, and outputs the energy information output by the prosody analysis unit 920 and the volume specification output by the volume setting unit 950. The waveform is amplified based on the information, and a waveform with an adjusted amplitude is output.

波形重畳部９７０は、波形生成部９６０からの出力と、韻律分析部からの出力とに基づいて作動可能なように、韻律分析部９２０と波形生成部９６０とに接続される。波形重畳部９７０は、韻律分析部９２０によって取得されたピッチ情報から導かれるサンプル間隔で、波形生成部９６０によってフレームごとに生成された波形を重畳し出力する。波形重畳部９７０から出力される信号は、増幅部９８０に入力される。 The waveform superimposing unit 970 is connected to the prosody analysis unit 920 and the waveform generation unit 960 so as to operate based on the output from the waveform generation unit 960 and the output from the prosody analysis unit. The waveform superimposing unit 970 superimposes and outputs the waveform generated for each frame by the waveform generating unit 960 at the sample interval derived from the pitch information acquired by the prosody analyzing unit 920. A signal output from the waveform superimposing unit 970 is input to the amplifying unit 980.

増幅部９８０は、波形重畳部９７０からの出力に基づいて作動可能なように、波形重畳部９７０に接続される。具体的には、増幅部９８０は、波形重畳部９７０から出力される波形データを増幅し、復号化された音声データとして出力する。 The amplifying unit 980 is connected to the waveform superimposing unit 970 so as to be operable based on the output from the waveform superimposing unit 970. Specifically, the amplifying unit 980 amplifies the waveform data output from the waveform superimposing unit 970 and outputs it as decoded audio data.

図１０を参照して、本実施の形態の他の局面における波形生成アルゴリズムについて説明する。図１０は、他の局面に従う波形生成装置２００が実行する動作を表わすフローチャートである。なお、図５に示される動作と同一の動作には同一のステップ番号を付してある。したがって、ここではそれらについての説明は繰り返さない。 With reference to FIG. 10, a waveform generation algorithm in another aspect of the present embodiment will be described. FIG. 10 is a flowchart representing an operation executed by waveform generation device 200 according to another aspect. The same steps as those shown in FIG. 5 are denoted by the same step numbers. Therefore, description thereof will not be repeated here.

ステップＳ１０４０にて、波形生成装置２００は、エネルギー変更部２１０によって変更されたエネルギーＥを、増幅率Ｇに変換する（Ｇ←ｇ（Ｅ））。 In step S1040, waveform generation apparatus 200 converts energy E changed by energy changing unit 210 into amplification factor G (G ← g (E)).

ここで、図１１を参照して、本実施の形態における波形増幅の概念について説明する。図１１（Ａ）から図１１（Ｄ）は、それぞれ３つのフレームの波形を表わす図である。図１１（Ａ）は、処理の対象となる元の波形をフレーム１１１１〜フレーム１１１３として表わしたものである。各フレームについて波形を２倍に増幅すると、図１１（Ｂ）に示されるように、フレーム１１２１〜フレーム１１２３が得られる。ここで、フレーム１１２２を表わす図から明らかなように、振幅は、上限値および下限値を超えている。この場合、上限値および下限値を超えた範囲について処理の対象外とすると、各フレームの形状は、図１１（Ｃ）に示されるように、特にフレーム１１３２において振幅の制限値（上限値および下限値）で飽和している。このような信号に基づいて音声が出力されると、振幅が飽和した部分に係る音声は歪んで発せられることとなる。 Here, the concept of waveform amplification in the present embodiment will be described with reference to FIG. FIG. 11A to FIG. 11D are diagrams showing waveforms of three frames, respectively. FIG. 11A shows the original waveforms to be processed as frames 1111 to 1113. When the waveform is amplified twice for each frame, frames 1121 to 1123 are obtained as shown in FIG. Here, as is clear from the diagram representing the frame 1122, the amplitude exceeds the upper limit value and the lower limit value. In this case, if the range exceeding the upper limit value and the lower limit value is excluded from processing, the shape of each frame is the amplitude limit value (upper limit value and lower limit value) particularly in the frame 1132 as shown in FIG. Value). When sound is output based on such a signal, the sound related to the portion where the amplitude is saturated is distorted.

一方、本実施の形態に係る波形生成を用いて図１１（Ａ）に示される各フレームを増幅すると、増幅後の波形は、フレーム１１４１〜１１４３として示される。図１１（Ｄ）において明らかなように、特にフレーム１１４２においては、振幅の上限値あるいは下限値を超えることなく予め規定された範囲内で波形が増幅されている。したがって、このような波形を有する信号に基づいて音声が出力されると、図１１（Ｃ）に示される波形に基づいて音声が出力される場合よりも、歪みが少ないあるいは歪みが生じない音声が出力されることになる。 On the other hand, when each frame shown in FIG. 11A is amplified using the waveform generation according to this embodiment, the amplified waveforms are shown as frames 1141 to 1143. As is clear from FIG. 11D, particularly in the frame 1142, the waveform is amplified within a predetermined range without exceeding the upper limit value or the lower limit value of the amplitude. Therefore, when a sound is output based on a signal having such a waveform, a sound with less distortion or no distortion is generated than when a sound is output based on the waveform shown in FIG. Will be output.

次に、図１２を参照して、本実施の形態に係る波形生成の機能を実現できるゲーム装置１２００について説明する。図１２は、ゲーム装置１２００のハードウェア構成を表わすブロック図である。ゲーム装置１２００は、操作ボタン１２０２と、データＲＯＭ１２０４と、プログラム用ＲＯＭ（Read Only Memory）１２０６と、ＲＡＭ（Random Access Memory）１２０８と、ＣＰＵ１２１０と、デジタルアナログコンバータ１２３０と、アンプ１２４０と、スピーカ１２５０と、液晶ディスプレイ１２６０と、カードコネクタ１２７０とを備える。カードコネクタ１２７０には、ゲームカートリッジ１２８０が装着される。ＣＰＵ１２１０は、エネルギー変化部１２１２と、エネルギー補正部１２１４と、復号化部１２１６と、増幅部１２１８と、波形接続部１２２０とを含む。 Next, with reference to FIG. 12, a game apparatus 1200 capable of realizing the waveform generation function according to the present embodiment will be described. FIG. 12 is a block diagram showing a hardware configuration of game device 1200. The game device 1200 includes an operation button 1202, a data ROM 1204, a program ROM (Read Only Memory) 1206, a RAM (Random Access Memory) 1208, a CPU 1210, a digital-analog converter 1230, an amplifier 1240, a speaker 1250, A liquid crystal display 1260 and a card connector 1270. A game cartridge 1280 is attached to the card connector 1270. CPU 1210 includes an energy changing unit 1212, an energy correcting unit 1214, a decoding unit 1216, an amplifying unit 1218, and a waveform connecting unit 1220.

操作ボタン１２０２は、ゲーム装置１２００に対する操作を受け付けて、当該操作に応じた信号をＣＰＵ１２１０に出力する。データＲＯＭ１２０４は、ゲーム装置１２００を実現するために予め作成された制御データを格納する。プログラム用ＲＯＭ１２０６は、ゲーム装置１２００に予め規定された処理を実行するためのプログラムを格納する。ＲＡＭ１２０８は、ゲーム装置１２００の動作中に生成されたデータあるいはカードコネクタ１２７０を介して読み取られたゲームカートリッジ１２８０に格納されているデータを一時的に保持する。 The operation button 1202 receives an operation on the game device 1200 and outputs a signal corresponding to the operation to the CPU 1210. The data ROM 1204 stores control data created in advance for realizing the game apparatus 1200. The program ROM 1206 stores a program for executing processing predefined in the game device 1200. The RAM 1208 temporarily holds data generated during the operation of the game apparatus 1200 or data stored in the game cartridge 1280 read via the card connector 1270.

ＣＰＵ１２１０は、操作ボタン１２０２とデータＲＯＭ１２０４とプログラム用ＲＯＭ１２０６とＲＡＭ１２０８とカードコネクタ１２７０からの各出力信号に基づいて作動可能なように、操作ボタン１２０２とデータＲＯＭ１２０４とプログラム用ＲＯＭ１２０６とＲＡＭ１２０８とカードコネクタ１２７０とに接続される。 The CPU 1210 operates on the operation buttons 1202, the data ROM 1204, the program ROM 1206, the RAM 1208, and the output signals from the card connector 1270, so that the CPU 1210 can operate. Connected to.

ＣＰＵ１２１０は、ゲームカートリッジ１２８０に格納されているデータを用いて、ゲームカートリッジ１２８０に応じた機器としてゲーム装置１２００を作動させるための処理を実行する。具体的には、ＣＰＵ１２１０は、ゲームカートリッジ１２８０に格納されている映像データ１２８２と、音声データ１２８４とを読み出して映像データ１２８２に基づく製造を液晶ディスプレイ１２６０に表示させる。また、ＣＰＵ１２１０は、音声データ１２８４を用いてスピーカ１２５０にそのデータに対応する音声を出力させる。音声データは、人によって発生された音声を分析したデータや、テキスト音声合成によって発生させるためのテキストデータも含む。ここで、音声データ１２８４に基づく音声の適正な出力（歪みを生じない出力）に必要とされるスピーカの能力と、スピーカ１２５０の能力とが一致しない場合がある。たとえば音声データ１２８４が求める出力がスピーカ１２５０による出力を上回る場合がある。この場合、ＣＰＵ１２１０は、音声データ１２８４を用いて上記の波形生成処理を行ない、歪みが生じない音声が出力されるように波形生成処理を実行する。 The CPU 1210 uses the data stored in the game cartridge 1280 to execute processing for operating the game device 1200 as a device corresponding to the game cartridge 1280. Specifically, CPU 1210 reads video data 1282 and audio data 1284 stored in game cartridge 1280 and causes liquid crystal display 1260 to display manufacturing based on video data 1282. Further, the CPU 1210 uses the audio data 1284 to cause the speaker 1250 to output audio corresponding to the data. The voice data includes data obtained by analyzing voice generated by a person and text data generated by text-to-speech synthesis. Here, there is a case where the capacity of the speaker required for proper output of sound based on the sound data 1284 (output without causing distortion) does not match the capacity of the speaker 1250. For example, the output required by the audio data 1284 may exceed the output from the speaker 1250. In this case, the CPU 1210 performs the above-described waveform generation process using the audio data 1284, and executes the waveform generation process so that sound without distortion is output.

具体的には、ＣＰＵ１２１０において、エネルギー変更部１２１２は、ＲＡＭ１２０８から読み出されるデータに基づいて機能するように構成される。具体的には、エネルギー変更部１２１２は、図２に示されるエネルギー変更部２１０と同様の機能を実現する。 Specifically, in the CPU 1210, the energy changing unit 1212 is configured to function based on data read from the RAM 1208. Specifically, the energy changing unit 1212 realizes the same function as the energy changing unit 210 shown in FIG.

エネルギー補正部１２１４は、データＲＯＭ１２０４からの出力に基づいて機能するように構成される。具体的には、エネルギー補正部１２１４は、エネルギー補正部２２０と同様の機能を実現する。 The energy correction unit 1214 is configured to function based on the output from the data ROM 1204. Specifically, the energy correction unit 1214 realizes the same function as the energy correction unit 220.

復号化部１２１６は、波形復号化部２３０と同様の機能を実現する。具体的には、復号化部１２１６は、カードコネクタ１２７０から送出されたデータ、すなわちゲームカートリッジ１２８０に格納されていた音声データ、に基づいて波形を復号化する。 The decoding unit 1216 realizes the same function as the waveform decoding unit 230. Specifically, the decoding unit 1216 decodes the waveform based on the data sent from the card connector 1270, that is, the audio data stored in the game cartridge 1280.

増幅部１２１８は、エネルギー補正部１２１４と復号化部１２１６からの各出力に基づいて機能するように構成される。具体的には、増幅部１２１８は、増幅部２４０と同様の機能を実現する。 The amplifying unit 1218 is configured to function based on outputs from the energy correcting unit 1214 and the decoding unit 1216. Specifically, the amplifying unit 1218 realizes the same function as the amplifying unit 240.

波形接続部１２２０は、増幅部１２１８からの出力に基づいて機能するように構成される。波形接続部１２２０には、増幅部１２１８から出力される波形データが入力される。この波形データは、音声信号におけるフレーム単位で構成されている。そこで、波形接続部１２２０は、各波形データを接続して音声データを生成する。 The waveform connection unit 1220 is configured to function based on the output from the amplification unit 1218. Waveform data output from the amplifying unit 1218 is input to the waveform connecting unit 1220. This waveform data is configured in units of frames in the audio signal. Therefore, the waveform connecting unit 1220 connects each waveform data to generate audio data.

ＣＰＵ１２１０によって生成された音声データは、デジタルアナログコンバータ１２３０に入力される。デジタルアナログコンバータ１２３０は、その音声データをアナログの音声信号に変換して出力する。音声信号は、アンプ１２４０に入力される。 The audio data generated by the CPU 1210 is input to the digital / analog converter 1230. The digital-analog converter 1230 converts the audio data into an analog audio signal and outputs it. The audio signal is input to the amplifier 1240.

アンプ１２４０は、音量を調整するための信号として操作ボタン１２０２によって出力される信号に基づいて、その音声信号を増幅し出力する。増幅された音声信号は、スピーカ１２５０に入力される。スピーカ１２５０は、当該音声信号を音声に変換して出力する。 The amplifier 1240 amplifies and outputs the audio signal based on the signal output from the operation button 1202 as a signal for adjusting the volume. The amplified audio signal is input to the speaker 1250. The speaker 1250 converts the sound signal into sound and outputs the sound.

上記のような構成により、ゲーム装置１２００は、ゲームカートリッジ１２８０に格納されているゲームのための映像データ１２８２に基づいて液晶ディスプレイ１２６０に映像を表示する。このとき、ゲーム装置１２００は、ＣＰＵ１２１０において歪が生じないように波形が調整された音声データに基づく音声を、スピーカ１２５０から出力することができる。そのため、ゲーム装置１２００の使用者は、操作ボタン１２０２の一部である音量調整ダイヤル（図示しない）を操作して、アンプ１２４０を介した音量の調整によって歪みを調整するといった作業を行なう必要がない。 With the configuration as described above, the game device 1200 displays an image on the liquid crystal display 1260 based on the image data 1282 for the game stored in the game cartridge 1280. At this time, game device 1200 can output a sound based on the sound data whose waveform is adjusted so that distortion does not occur in CPU 1210 from speaker 1250. Therefore, the user of game device 1200 does not need to operate a volume adjustment dial (not shown) that is a part of operation button 1202 to adjust the distortion by adjusting the volume via amplifier 1240. .

このようにすると、予め作成された音声データ１２８４が、スピーカ１２５０における出力レベルが小さいデータと、スピーカ１２５０の定格出力を上回る程度に出力レベルが大きなデータとからなる場合であっても、ゲーム装置１２００は、出力レベルが小さいデータに基づく音声をそのまま出力し、出力レベルが大きなデータは歪を抑制する程度に調整した音量で出力する。これにより、ゲーム装置１２００による興趣が妨げられなくなる。 In this way, even if the audio data 1284 created in advance includes data with a low output level at the speaker 1250 and data with a high output level that exceeds the rated output of the speaker 1250, the game apparatus 1200. Outputs sound based on data with a low output level as it is, and outputs data with a high output level at a volume adjusted to suppress distortion. Thereby, the interest by the game apparatus 1200 is not hindered.

また、出力レベルが異なる様々なゲームカートリッジがカードコネクタ１２７０に装着される場合であっても、ＣＰＵ１２１０における処理によって音量が調整されるため、スピーカ１２５０を、より高い定格出力規格を有するスピーカに変更する必要がない。そのため、ゲーム装置１２００自体をゲームの内容（具体的にはゲームカートリッジ１２８０に格納された音声データ１２８４の出力レベル）に合わせて変更する必要がない。これにより、ゲーム装置１２００のハードウェアとして設計変更が不要になり、またゲーム装置１２００のコストの増加を防止することができる。 Even when various game cartridges having different output levels are mounted on the card connector 1270, the volume is adjusted by the processing in the CPU 1210, so the speaker 1250 is changed to a speaker having a higher rated output standard. There is no need. Therefore, it is not necessary to change the game device 1200 itself according to the content of the game (specifically, the output level of the audio data 1284 stored in the game cartridge 1280). This eliminates the need to change the design of the hardware of the game apparatus 1200 and can prevent an increase in the cost of the game apparatus 1200.

なお、本実施の形態に係る音声合成装置は、テキストデータの受信機能を有する通信端末に対しても適用することができる。通信端末は、たとえば携帯電話、ＰＤＡ（Personal Digital Assistant）として実現される。 Note that the speech synthesizer according to the present embodiment can also be applied to a communication terminal having a text data reception function. The communication terminal is realized as, for example, a mobile phone or a PDA (Personal Digital Assistant).

そこで、図１３を参照して、本実施の形態に係る波形生成の機能を実現する携帯電話１３００について説明する。図１３は、携帯電話１３００のハードウェア構成を表わすブロック図である。携帯電話１３００は、アンテナ１３０２と、通信回路１３０４と、操作ボタン１３０６と、カメラ１３０８と、ＣＰＵ１３１０と、フラッシュメモリ１３１２と、ＲＡＭ１３１４と、データ用ＲＯＭ１３１６と、アナログデジタルコンバータ１３２２と、マイク１３２０と、デジタルアナログコンバータ１３２４と、スピーカ１３２３と、ディスプレイ１３３０と、ＬＥＤ（Light Emitting Diode）１３３２と、データ通信Ｉ／Ｆ（Interface）１３３４と、バイブレータ１３３６と、メモリカード駆動装置１３４０とを備える。メモリカード駆動装置１３４０には、メモリカード１３４２が装着可能である。 A mobile phone 1300 that implements the waveform generation function according to the present embodiment will be described with reference to FIG. FIG. 13 is a block diagram showing a hardware configuration of mobile phone 1300. A cellular phone 1300 includes an antenna 1302, a communication circuit 1304, an operation button 1306, a camera 1308, a CPU 1310, a flash memory 1312, a RAM 1314, a data ROM 1316, an analog-digital converter 1322, a microphone 1320, a digital An analog converter 1324, a speaker 1323, a display 1330, an LED (Light Emitting Diode) 1332, a data communication I / F (Interface) 1334, a vibrator 1336, and a memory card driving device 1340 are provided. A memory card 1342 can be attached to the memory card driving device 1340.

アンテナ１３０２と通信回路１３０４とは、電気的に接続されている。ＣＰＵ１３１０は、通信回路１３０４と、操作ボタン１３０６と、カメラ１３０８と、フラッシュメモリ１３１２と、ＲＡＭ１３１４と、データ用ＲＯＭ１３１６と、メモリカード駆動装置１３４０と、アナログデジタルコンバータ１３２２と、デジタルアナログコンバータ１３２４と、ディスプレイ１３３０と、ＬＥＤ１３３２と、データ通信Ｉ／Ｆ１３３４と、バイブレータ１３３６とに対して、電気的に接続されている。 The antenna 1302 and the communication circuit 1304 are electrically connected. The CPU 1310 includes a communication circuit 1304, an operation button 1306, a camera 1308, a flash memory 1312, a RAM 1314, a data ROM 1316, a memory card driving device 1340, an analog / digital converter 1322, a digital / analog converter 1324, a display. 1330, LED 1332, data communication I / F 1334, and vibrator 1336 are electrically connected.

アンテナ１３０２によって受信された電波は、通信回路１３０４によって予め規定された処理が実行された後、デジタル信号としてＣＰＵ１３１０に伝送される。当該電波は、通話のための電波およびデータ通信のための電波を含む。ＣＰＵ１３１０は、そのデジタル信号を内部処理し、処理後の信号をデジタルアナログコンバータ１３２４に転送する。 The radio wave received by the antenna 1302 is subjected to processing prescribed in advance by the communication circuit 1304 and then transmitted to the CPU 1310 as a digital signal. The radio waves include radio waves for calls and radio waves for data communication. The CPU 1310 internally processes the digital signal and transfers the processed signal to the digital / analog converter 1324.

デジタルアナログコンバータ１３２４は、ＣＰＵ１３１０から出力されるデジタル信号をアナログ信号に変換し、スピーカ１３２６に送出する。スピーカ１３２６は、そのアナログ信号に基づいて音声（すなわち着信を受けた電話）を出力する。 The digital-analog converter 1324 converts a digital signal output from the CPU 1310 into an analog signal and sends the analog signal to the speaker 1326. The speaker 1326 outputs voice (that is, a telephone that receives an incoming call) based on the analog signal.

マイク１３２０は、携帯電話１３００に対する発話を受け付けて、その発話に応じた電気信号を出力する。アナログデジタルコンバータ１３２２は、マイク１３２０によって出力された信号をデジタル変換処理し、ＣＰＵ１３１０に送出する。ＣＰＵ１３１０は、その信号を送信用の信号に変換し、通信回路１３０４に送出する。通信回路１３０４は、アンテナ１３０２を介してその信号を無線発信する。携帯電話１３００の使用者は、このようにして他の相手と通話することができる。 Microphone 1320 accepts an utterance to mobile phone 1300 and outputs an electrical signal corresponding to the utterance. The analog-digital converter 1322 performs a digital conversion process on the signal output from the microphone 1320 and sends it to the CPU 1310. CPU 1310 converts the signal into a signal for transmission and sends the signal to communication circuit 1304. Communication circuit 1304 wirelessly transmits the signal via antenna 1302. The user of the mobile phone 1300 can talk with other parties in this way.

操作ボタン１３０６は、文字あるいは数字の入力を受け付けるためのボタンとして実現される。あるいは他の局面においては、当該入力を受け付ける構成として、操作ボタン１３０６の代わりにジョグダイヤル、タッチパネルその他の操作部として実現されてもよい。操作ボタン１３０６は、携帯電話１３００に対する操作を受け付けて、その操作に応じた信号をＣＰＵ１３１０に送出する。操作ボタン１３０６に対する操作は、携帯電話１３００の使用者が文字を入力するための操作を含む。 The operation button 1306 is realized as a button for accepting input of characters or numbers. Alternatively, in another aspect, as a configuration for receiving the input, a jog dial, a touch panel, or other operation unit may be realized instead of the operation button 1306. The operation button 1306 receives an operation on the mobile phone 1300 and sends a signal corresponding to the operation to the CPU 1310. The operation on the operation button 1306 includes an operation for the user of the mobile phone 1300 to input characters.

カメラ１３０８は、操作ボタン１３０６に対する操作に基づいて被写体を撮影し、その撮影により取得された信号をＣＰＵ１３１０に送出する。カメラ１３０８は、被写体を静止画としてあるいは動画として撮影できる。ＣＰＵ１３１０は、その信号を一時的に保持し、操作ボタン１３０６に対する保存の指示に応答してフラッシュメモリ１３１２に格納する。 The camera 1308 shoots a subject based on an operation on the operation button 1306, and sends a signal acquired by the shooting to the CPU 1310. The camera 1308 can shoot a subject as a still image or a moving image. The CPU 1310 temporarily holds the signal and stores it in the flash memory 1312 in response to a save instruction for the operation button 1306.

ＲＡＭ１３１４は、操作ボタン１３０６に対して行なわれた操作に基づいてＣＰＵ１３１０によって生成されたデータを一時的に保持する。あるいは、ＲＡＭ１３１４は、アンテナ１３０２によって受信された電波に含まれるデータを一時的に保持する。データ用ＲＯＭ１３１６は、携帯電話１３００によって予め規定された動作を実行させるためのデータあるいはアプリケーションプログラムなどを格納する。ＣＰＵ１３１０は、データ用ＲＯＭ１３１６から当該データあるいはアプリケーションプログラムを読み出し、携帯電話１３００に予め規定された処理を実行させる。 The RAM 1314 temporarily holds data generated by the CPU 1310 based on an operation performed on the operation button 1306. Alternatively, the RAM 1314 temporarily holds data included in the radio wave received by the antenna 1302. The data ROM 1316 stores data or application programs for causing the mobile phone 1300 to perform operations specified in advance. The CPU 1310 reads the data or application program from the data ROM 1316 and causes the mobile phone 1300 to execute a predetermined process.

ディスプレイ１３３０は、ＣＰＵ１３１０から出力されるデータに基づいてそのデータによって規定される画像あるいは映像を表示する。たとえば、ＣＰＵ１３１０がフラッシュメモリ１３１２もしくはメモリカード１３４２に格納されている動画データを読み出すと、ディスプレイ１３３０はそのデータに応じた映像を表示する。 Display 1330 displays an image or video defined by the data based on the data output from CPU 1310. For example, when the CPU 1310 reads moving image data stored in the flash memory 1312 or the memory card 1342, the display 1330 displays an image corresponding to the data.

ＬＥＤ１３３２は、ＣＰＵ１３１０から出力される信号に基づいて予め規定された発光動作を実現する。たとえば、ＬＥＤ１３３２が複数の色を表示可能な場合には、ＬＥＤ１３３２は、ＣＰＵ１３１０から出力される信号に含まれるデータに基づいてそのデータに関連付けられている色で発光する。 The LED 1332 realizes a light emitting operation defined in advance based on a signal output from the CPU 1310. For example, when the LED 1332 can display a plurality of colors, the LED 1332 emits light in a color associated with the data based on data included in a signal output from the CPU 1310.

データ通信Ｉ／Ｆ１３３４は、外部から通信用のケーブルの装着を受け付ける。データ通信Ｉ／Ｆ１３３４は、ＣＰＵ１３１０から出力される信号を当該ケーブルに対して送出する。あるいは、データ通信Ｉ／Ｆ１３３４は、ケーブルを介して受信される信号をＣＰＵ１３１０に対して送出する。 The data communication I / F 1334 accepts attachment of a communication cable from the outside. The data communication I / F 1334 sends a signal output from the CPU 1310 to the cable. Alternatively, the data communication I / F 1334 sends a signal received via a cable to the CPU 1310.

バイブレータ１３３６は、ＣＰＵ１３１０から出力される信号に基づいて予め定められた周波数で振動動作を実行する。 Vibrator 1336 performs a vibration operation at a predetermined frequency based on a signal output from CPU 1310.

メモリカード駆動装置１３４０は、メモリカード１３４２の装着を受け付ける。メモリカード駆動装置１３４０は、メモリカード１３４２に格納されているデータを読み出し、ＣＰＵ１３１０に送出する。逆に、メモリカード駆動装置１３４０は、ＣＰＵ１３１０によって出力されたデータをメモリカード１３４２のデータ記憶領域に格納する。メモリカード１３４２は、たとえば記憶媒体としてフラッシュメモリを用いるが、その他のものが使用されてもよい。 The memory card driving device 1340 accepts the mounting of the memory card 1342. The memory card drive device 1340 reads data stored in the memory card 1342 and sends it to the CPU 1310. Conversely, the memory card driving device 1340 stores the data output by the CPU 1310 in the data storage area of the memory card 1342. The memory card 1342 uses, for example, a flash memory as a storage medium, but other cards may be used.

このような構成において、ＣＰＵ１３１０は、図６に示される音声合成装置６００として機能し得る。すなわち、通信回路１３０４によって受信された文字情報（たとえば電子メール）あるいは操作ボタン１３０６の操作により入力された文字列は、テキストデータとしてＣＰＵ１３１０に入力される。ＣＰＵ１３１０は、操作ボタン１３０６に対する音声出力指示の入力に応答して、当該テキストデータから音声データを合成する。携帯電話１３００は、その音声データに基づく音声をスピーカ１３２６を介して出力する。 In such a configuration, the CPU 1310 can function as the speech synthesizer 600 shown in FIG. That is, character information (for example, electronic mail) received by the communication circuit 1304 or a character string input by operating the operation button 1306 is input to the CPU 1310 as text data. In response to the input of the voice output instruction to the operation button 1306, the CPU 1310 synthesizes voice data from the text data. The mobile phone 1300 outputs a sound based on the sound data via the speaker 1326.

この場合、ＣＰＵ１３１０は、図６に示されたテキスト解析部６１０、音量設定部６３０、韻律生成部６４０、波形データ選択部６５０、波形生成部６６０、波形重畳部６７０、増幅部６８０として機能する。波形辞書記憶部６２０は、たとえばフラッシュメモリ１３１２によって実現される。波形辞書記憶部６２０に格納されているデータは、携帯電話１３００の製造者によって製造時に書き込まれてもよいし、携帯電話１３００の使用者によってデータ通信Ｉ／Ｆ１３３４を介して、あるいはメモリカード駆動装置１３４０を介して入力されてもよい。また、波形生成のための具体的な処理を実行するソフトウェアも、フラッシュメモリ１３１２、データ用ＲＯＭ１３１６のような記憶装置に格納されている。 In this case, the CPU 1310 functions as the text analysis unit 610, volume setting unit 630, prosody generation unit 640, waveform data selection unit 650, waveform generation unit 660, waveform superposition unit 670, and amplification unit 680 shown in FIG. The waveform dictionary storage unit 620 is realized by the flash memory 1312, for example. The data stored in the waveform dictionary storage unit 620 may be written at the time of manufacture by the manufacturer of the mobile phone 1300, or by the user of the mobile phone 1300 via the data communication I / F 1334 or the memory card driving device. It may be input via 1340. Software for executing specific processing for waveform generation is also stored in a storage device such as the flash memory 1312 and the data ROM 1316.

このような構成により、携帯電話１３００も、テキストデータを用いた音声を出力することができる。ゲーム装置１２００に関して説明したように、本発明の実施の形態に係る波形生成を用いることにより、スピーカ１３２６の能力を変更することなく、音量レベルの大小に関わらず聴取し易い音声が出力可能となる。また、スピーカ１３２６に加えて、イヤホン（図示しない）を介した音声の出力が可能である場合、当該イヤホンを介して出力される音声の音量も調整可能となる。その結果、たとえば携帯電話１３００の使用者の付近が騒がしい場合であっても、小さな音量を大きくしつつ、当初から大音量に設定される音声の音量レベルを抑制した出力が可能になる。これにより、聞き取り易い音声が出力される。 With such a configuration, the mobile phone 1300 can also output voice using text data. As described with respect to the game device 1200, by using the waveform generation according to the embodiment of the present invention, it is possible to output a sound that is easy to hear regardless of the volume level without changing the capability of the speaker 1326. . Further, in the case where sound can be output through an earphone (not shown) in addition to the speaker 1326, the volume of sound output through the earphone can also be adjusted. As a result, for example, even when the vicinity of the user of the mobile phone 1300 is noisy, it is possible to output while suppressing the volume level of the sound set to a large volume from the beginning while increasing the small volume. As a result, a voice that is easy to hear is output.

なお、携帯電話１３００の基本的な動作は、当業者にとって容易に理解できるものである。したがって、ここではそれらについての詳細な説明は繰り返さない。 Note that the basic operation of the mobile phone 1300 can be easily understood by those skilled in the art. Therefore, detailed description thereof will not be repeated here.

以上のようにして、本実施の形態に係る波形生成装置２００によると、大きな音量を出力するための音声データの歪みを防止しつつ音声データを抑制し、大きすぎない音量を出力するための音声データ、すなわち、スピーカの出力能力を超えない程度の音量を出力するデータは、その能力を超えない程度に大きくする。その結果、全体として音量が大きくされた音声の出力が可能になる。 As described above, according to the waveform generation device 200 according to the present embodiment, the sound for suppressing the sound data while preventing the distortion of the sound data for outputting the large sound volume and for outputting the sound volume not too large. Data, that is, data that outputs a sound volume that does not exceed the output capability of the speaker, is increased to a level that does not exceed that capability. As a result, it is possible to output a sound whose volume is increased as a whole.

具体的には、波形生成装置２００を備える音声合成装置は、音声合成を行なう際に、予め計算されたエネルギー情報を変更して補正を行なう。したがって、音声合成を行なった後のデータを用いて出力音量の調整を行なう場合に比べて、波形エネルギーの計算が不要になる。また、増幅率の計算は簡易に実現されるため、プロセッサの付加も抑制される。その結果、出力音声の音量調整における応答が速くなる。 Specifically, the speech synthesizer including the waveform generation device 200 performs correction by changing energy information calculated in advance when performing speech synthesis. Accordingly, it is not necessary to calculate the waveform energy as compared with the case where the output volume is adjusted using the data after the speech synthesis. Moreover, since the calculation of the amplification factor is easily realized, the addition of the processor is also suppressed. As a result, the response in the volume adjustment of the output sound becomes faster.

＜第２の実施の形態＞
以下、本発明の第２の実施の形態について説明する。本実施の形態に係る波形生成装置１４００は、出力される波形を徐々に飽和するように変形する機能を有する点で前述の波形生成装置２００と異なる。 <Second Embodiment>
Hereinafter, a second embodiment of the present invention will be described. The waveform generation apparatus 1400 according to the present embodiment is different from the waveform generation apparatus 200 described above in that it has a function of deforming an output waveform so as to be gradually saturated.

図１４を参照して、本実施の形態に係る波形生成装置１４００について説明する。図１４は、本実施の形態に係る波形生成装置１４００によって実現される機能の構成を表わすブロック図である。波形生成装置１４００は、図２に示される構成に加えて、飽和処理部１４５０を備える。 With reference to FIG. 14, a waveform generation apparatus 1400 according to the present embodiment will be described. FIG. 14 is a block diagram showing a configuration of functions realized by waveform generation device 1400 according to the present embodiment. The waveform generation device 1400 includes a saturation processing unit 1450 in addition to the configuration shown in FIG.

飽和処理部１４５０は、増幅部２４０からの出力に基づいて作動可能なように、増幅部２４０に接続される。より具体的には、飽和処理部１４５０は、増幅部２４０によって出力される波形の入力を受け付ける。飽和処理部１４５０は、その波形に対して非線形に飽和するように波形を変形する。具体的には、飽和処理部１４５０は、入力される波形の振幅と出力される波形の振幅との関係が非線形な関係として規定される関数を用いて、あるいは同等のマップデータを用いて波形の形状を補正する。 The saturation processing unit 1450 is connected to the amplification unit 240 so as to be operable based on the output from the amplification unit 240. More specifically, the saturation processing unit 1450 receives an input of a waveform output from the amplification unit 240. The saturation processing unit 1450 deforms the waveform so as to be saturated nonlinearly with respect to the waveform. Specifically, the saturation processing unit 1450 uses a function in which the relationship between the amplitude of the input waveform and the output waveform is defined as a non-linear relationship, or equivalent map data. Correct the shape.

そこで、図１５および図１６を参照して、本実施の形態に係る波形生成装置１４００の波形振幅の飽和特性について説明する。図１５は、図２に示される波形生成装置２００における入力される波形の振幅と出力される波形の振幅との関係を表わす図である。図１６は、図１４に示される本実施の形態に係る波形生成装置１４００における入力される波形の振幅と出力される波形の振幅との関係を表わす図である。 Therefore, with reference to FIG. 15 and FIG. 16, the waveform amplitude saturation characteristic of the waveform generation device 1400 according to the present embodiment will be described. FIG. 15 is a diagram showing the relationship between the amplitude of the input waveform and the amplitude of the output waveform in waveform generating apparatus 200 shown in FIG. FIG. 16 is a diagram showing the relationship between the amplitude of the input waveform and the amplitude of the output waveform in waveform generating apparatus 1400 according to the present embodiment shown in FIG.

図１５を参照して、入力される波形の振幅（以下、入力振幅）（Ａｉｎ）は、範囲（Ｘｍｉｎ−Ｘｍａｘ）において、出力される波形の振幅（以下、出力振幅）（Ａｏｕｔ）と比例の関係にある。ここで出力振幅の値は範囲（Ｙｍｉｎ−Ｙｍａｘ）内の値のみを取り得るため、下限の振幅値Ｙｍｉｎよりも小さな振幅および上限の振幅値Ｙｍａｘを上回る振幅は出力されない。そのため、たとえば入力振幅の値が上限値Ｘｍａｘよりも大きい場合でも、出力振幅の値はＹｍａｘに留まるため、出力される音声に歪みが残る場合がある。 Referring to FIG. 15, the amplitude of the input waveform (hereinafter referred to as input amplitude) (Ain) is proportional to the amplitude of the output waveform (hereinafter referred to as output amplitude) (Aout) in the range (Xmin−Xmax). There is a relationship. Here, since the output amplitude value can take only a value within the range (Ymin−Ymax), an amplitude smaller than the lower limit amplitude value Ymin and an amplitude exceeding the upper limit amplitude value Ymax are not output. Therefore, for example, even when the value of the input amplitude is larger than the upper limit value Xmax, the output amplitude value remains at Ymax, so that distortion may remain in the output audio.

これに対して、図１６を参照して、本実施の形態に係る波形生成装置１４００においては、出力振幅の範囲は前述の範囲と同じ範囲を維持しつつ、各出力振幅をもたらす入力振幅の範囲が変更されている。すなわち、下限の入力振幅Ｘｍｉｎを下回る値としてＸ（１）が規定されており、入力振幅Ｘ（２）までは、出力振幅も線形に出力される。同様にして、入力振幅の上限値の近傍においても、上限値Ｘｍａｘを下回る入力振幅Ｘ（３）と上限値Ｘｍａｘを上回る値Ｘ（４）の範囲で、出力振幅は線形で出力されるように特性が補正されている。その結果、入力振幅の値がＸ（１）からＸ（４）まで出力振幅の値は、予め規定された範囲（下限値Ｙｍｉｎから上限値Ｙｍａｘ）の間を取り得る。出力振幅の上限値Ｙｍａｘを超えるような値を与え得る入力振幅は、上限値Ｘ（４）を上回るわずかな領域のみに制限される。このようにすると、音声の出力に関し歪みが急激に生じなくなるため、図２に示される波形生成装置２００を有する音声合成装置よりもさらに大きめの増幅率を得ることができる。 On the other hand, referring to FIG. 16, in waveform generation device 1400 according to the present embodiment, the output amplitude range maintains the same range as that described above, and the input amplitude range provides each output amplitude. Has been changed. That is, X (1) is defined as a value lower than the lower limit input amplitude Xmin, and the output amplitude is also linearly output up to the input amplitude X (2). Similarly, in the vicinity of the upper limit value of the input amplitude, the output amplitude is linearly output in the range of the input amplitude X (3) below the upper limit value Xmax and the value X (4) above the upper limit value Xmax. The characteristics have been corrected. As a result, the value of the output amplitude can be between a predetermined range (lower limit value Ymin to upper limit value Ymax) from X (1) to X (4). The input amplitude that can give a value exceeding the upper limit value Ymax of the output amplitude is limited to only a small region exceeding the upper limit value X (4). In this case, distortion does not occur abruptly with respect to the output of the speech, and therefore a larger amplification factor than that of the speech synthesis device having the waveform generation device 200 shown in FIG. 2 can be obtained.

以上詳述したように、本発明の第１および第２の実施の形態に係る波形生成装置によれば、音声データの出力レベルの増幅が行なわれる前に、音声波形を調整する。この調整は、音声データを処理するソフトウェアによって実現される。そのため、当該調整は、スピーカ、音声出力端子その他の音声出力装置の規格によって規定される上限値を超える恐れがあるデータに対してのみ適用可能である。このような構成により、波形の調整が真に必要な音声データ（音量レベルが高すぎるようなデータ）に対するレベル調整が行なわれ、調整が不要な音声データに対する調整は行なわれない。 As described above in detail, according to the waveform generation apparatuses according to the first and second embodiments of the present invention, the sound waveform is adjusted before the output level of the sound data is amplified. This adjustment is realized by software that processes audio data. Therefore, the adjustment can be applied only to data that may exceed the upper limit value defined by the speaker, audio output terminal, or other audio output device standards. With such a configuration, level adjustment is performed on audio data that truly requires waveform adjustment (data whose volume level is too high), and adjustment on audio data that does not require adjustment is not performed.

これにより、音声出力装置などのハードウェア構成を変更することなく、どのような出力レベルが音声データに対して規定されていても、出力レベルの上限値を超えない音声、すなわち歪が生じない音声の出力が可能となる。 As a result, audio that does not exceed the upper limit of the output level regardless of what output level is defined for the audio data without changing the hardware configuration of the audio output device or the like, that is, audio that does not cause distortion Can be output.

なお、本発明の実施の形態に係る波形生成装置および音声合成装置は、音声出力機能を有するコンピュータシステムによっても実現可能である。そこで、図１７を参照して、音声合成装置として機能するコンピュータシステム１７００について説明する。図１７は、コンピュータシステム１７００のハードウェア構成を表わすブロック図である。 Note that the waveform generation device and the speech synthesizer according to the embodiment of the present invention can also be realized by a computer system having a speech output function. A computer system 1700 that functions as a speech synthesizer will be described with reference to FIG. FIG. 17 is a block diagram showing a hardware configuration of computer system 1700.

コンピュータシステム１７００は、ハードウェアとして、ＣＰＵ１７１０と、コンピュータシステム１７００の使用者による指示の入力を受けるマウス１７２０およびキーボード１７３０と、ＣＰＵ１７１０によるプログラムの実行により生成されたデータ、又はマウス１７２０若しくはキーボード１７３０を介して入力されたデータを揮発的に格納するＲＡＭ１７４０と、データを不揮発的に格納するハードディスク１７５０と、ＣＤ（Compact Disk）−ＲＯＭ駆動装置１７６０と、音声データから音声信号を生成して出力するサウンドカード１７７０と、サウンドカード１７７０から出力される信号に基づいて音声を出力するスピーカ１７７２と、モニタ１７８０と、通信ＩＦ１７９０とを含む。各構成要素は、相互にデータバスによって接続されている。ＣＤ−ＲＯＭ駆動装置１７６０には、ＣＤ−ＲＯＭ１７６２が装着される。 The computer system 1700 includes, as hardware, a CPU 1710, a mouse 1720 and a keyboard 1730 that receive input of instructions from a user of the computer system 1700, data generated by execution of a program by the CPU 1710, or a mouse 1720 or a keyboard 1730. RAM 1740 for storing the input data in a volatile manner, hard disk 1750 for storing the data in a nonvolatile manner, CD (Compact Disk) -ROM drive 1760, and a sound card for generating and outputting an audio signal from the audio data 1770, a speaker 1772 that outputs sound based on a signal output from sound card 1770, a monitor 1780, and a communication IF 1790. Each component is connected to each other by a data bus. A CD-ROM 1762 is mounted on the CD-ROM drive 1760.

コンピュータシステム１７００における情報処理は、ハードウェアおよびＣＰＵ１７１０により実行されるソフトウェアによって実現される。このようなソフトウェアは、ハードディスク１７５０に予め記憶されている場合がある。また、ソフトウェアは、ＣＤ−ＲＯＭ１７６２その他の記憶媒体に格納されて、プログラム製品として流通している場合もある。あるいは、ソフトウェアは、いわゆるインターネットに接続されている情報提供事業者によってダウンロード可能なプログラム製品として提供される場合もある。このようなソフトウェアは、ＣＤ−ＲＯＭ駆動装置１７６０その他の読取装置によりその記憶媒体から読み取られて、あるいは、通信ＩＦ１７９０を介してダウンロードされた後、ハードディスク１７５０に一旦格納される。そのソフトウェアは、ＣＰＵ１７１０によってハードディスク１７５０から読み出され、ＲＡＭ１７４０に実行可能なプログラムの形式で格納される。ＣＰＵ１７１０は、そのプログラムを実行する。 Information processing in the computer system 1700 is realized by hardware and software executed by the CPU 1710. Such software may be stored in the hard disk 1750 in advance. The software may be stored in a CD-ROM 1762 or other storage medium and distributed as a program product. Alternatively, the software may be provided as a program product that can be downloaded by an information provider connected to the so-called Internet. Such software is read from the storage medium by the CD-ROM drive 1760 or other reading device, or downloaded via the communication IF 1790 and then temporarily stored in the hard disk 1750. The software is read from the hard disk 1750 by the CPU 1710 and stored in the RAM 1740 in the form of an executable program. CPU 1710 executes the program.

図１７に示されるコンピュータシステム１７００を構成する各ハードウェアは、一般的なものである。したがって、本発明の本質的な部分は、ＲＡＭ１７４０、ハードディスク１７５０、ＣＤ−ＲＯＭ１７６２その他の記憶媒体に格納されたソフトウェア、あるいはネットワークを介してダウンロード可能なソフトウェアであるともいえる。なお、コンピュータシステム１７００の各ハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 Each hardware constituting the computer system 1700 shown in FIG. 17 is general. Therefore, it can be said that the essential part of the present invention is software stored in the RAM 1740, the hard disk 1750, the CD-ROM 1762 or other storage media, or software downloadable via a network. Since the operation of each hardware of computer system 1700 is well known, detailed description will not be repeated.

なお、記録媒体としては、ＣＤ−ＲＯＭ、ＦＤ、ハードディスクに限られず、磁気テープ、カセットテープ、光ディスク（ＭＯ（Magnetic Optical Disc）／ＭＤ（Mini Disc）／ＤＶＤ（Digital Versatile Disc））、ＩＣ（Integrated Circuit）カード（メモリカードを含む）、光カード、マスクＲＯＭ、ＥＰＲＯＭ（Erasable Programmable Read Only Memory）、ＥＥＰＲＯＭ（Electronically Erasable Programmable Read Only Memory）、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する媒体でもよい。 The recording media are not limited to CD-ROM, FD, and hard disk, but are magnetic tape, cassette tape, optical disc (MO (Magnetic Optical Disc) / MD (Mini Disc) / DVD (Digital Versatile Disc)), IC (Integrated). Circuit (including memory card), optical card, mask ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electronically Erasable Programmable Read Only Memory), flash ROM and other semiconductor memories, etc. It may be a medium.

ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 The program here includes not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明は、スピーカ、音声出力端子その他の音声出力インターフェイスを有する装置、たとえば、携帯電話その他の通信装置、ＰＣその他の情報処理機器、あるいはゲーム装置等にも、適用可能である。 The present invention is also applicable to a device having a sound output interface such as a speaker, a sound output terminal, etc., such as a mobile phone or other communication device, a PC or other information processing device, or a game device.

従来の波形生成装置によって実現される機能の構成を表わすブロック図である。It is a block diagram showing the structure of the function implement | achieved by the conventional waveform generation apparatus. 本発明の第１の実施の形態に係る波形生成装置によって実現される機能の構成を表わすブロック図である。It is a block diagram showing the structure of the function implement | achieved by the waveform generation apparatus which concerns on the 1st Embodiment of this invention. 補正前のエネルギー情報Ｅと補正後のエネルギー情報ｆ（Ｅ）との関係を表わす図である。It is a figure showing the relationship between the energy information E before correction | amendment, and the energy information f (E) after correction | amendment. エネルギー情報と増幅率との関係を表わす図である。It is a figure showing the relationship between energy information and an amplification factor. 本発明の第１の実施の形態に係る波形生成装置が実行する動作を表わすフローチャートである。It is a flowchart showing the operation | movement which the waveform generation apparatus which concerns on the 1st Embodiment of this invention performs. 本実施の形態に係る波形生成の構成を用いた音声合成装置によって実現される機能の構成を表わすブロック図である。It is a block diagram showing the structure of the function implement | achieved by the speech synthesizer using the structure of the waveform generation which concerns on this Embodiment. 波形辞書記憶部におけるデータの格納の一態様を概念的に表わす図（その１）である。FIG. 5 is a diagram (part 1) conceptually showing one aspect of data storage in a waveform dictionary storage unit. 波形辞書記憶部におけるデータの格納の一態様を概念的に表わす図（その１）である。FIG. 5 is a diagram (part 1) conceptually showing one aspect of data storage in a waveform dictionary storage unit. 本実施の形態に係る波形生成の構成を有する音声符号化復号化装置によって実現される機能の構成を表わすブロック図である。It is a block diagram showing the structure of the function implement | achieved by the audio | voice coding / decoding apparatus which has the structure of the waveform generation which concerns on this Embodiment. 本実施の形態の他の局面に従う波形生成装置が実行する動作を表わすフローチャートである。It is a flowchart showing the operation | movement which the waveform generation apparatus according to the other situation of this Embodiment performs. フレームの波形を表わす図である。It is a figure showing the waveform of a flame | frame. 本発明の実施の形態に係る波形生成の機能を実現できるゲーム装置のハードウェア構成を表わすブロック図である。It is a block diagram showing the hardware constitutions of the game device which can implement | achieve the function of the waveform generation which concerns on embodiment of this invention. 本発明の実施の形態に係る波形生成の機能を実現する携帯電話のハードウェア構成を表わすブロック図である。It is a block diagram showing the hardware constitutions of the mobile telephone which implement | achieves the function of the waveform generation which concerns on embodiment of this invention. 本発明の第２の実施の形態に係る波形生成装置によって実現される機能の構成を表わすブロック図である。It is a block diagram showing the structure of the function implement | achieved by the waveform generation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第１の実施の形態に係る波形生成装置に入力される波形の振幅と出力される波形の振幅との関係を表わす図である。It is a figure showing the relationship between the amplitude of the waveform input into the waveform generation apparatus which concerns on the 1st Embodiment of this invention, and the amplitude of the output waveform. 本発明の第２の実施の形態に係る波形生成装置に入力される波形の振幅と出力される波形の振幅との関係を表わす図である。It is a figure showing the relationship between the amplitude of the waveform input into the waveform generation apparatus which concerns on the 2nd Embodiment of this invention, and the amplitude of the output waveform. コンピュータシステム１７００のハードウェア構成を表わすブロック図である。FIG. 11 is a block diagram showing a hardware configuration of a computer system 1700.

Explanation of symbols

１００，２００，１４００波形生成装置、６００音声合成装置、９００音声符号化復号化装置、９１０符号化装置、９４０復号化装置、１２００ゲーム装置、１２１０ＣＰＵ、１２８０ゲームカートリッジ、１３０２アンテナ、１３４２メモリカード。 100, 200, 1400 Waveform generation device, 600 speech synthesis device, 900 speech coding / decoding device, 910 coding device, 940 decoding device, 1200 game device, 1210 CPU, 1280 game cartridge, 1302 antenna, 1342 memory card.

Claims

A speech synthesizer that synthesizes and outputs speech,
Obtaining means for obtaining data for generating output sound and designation information for designating a volume of the output sound;
A memory for storing each waveform information for generating a waveform of each part obtained by dividing speech into predetermined units, each energy information for each part, and correction data for correcting each energy information. Means,
Changing means for changing each of the energy information to energy information corresponding to the specified information;
Correction means for correcting each energy information after the change by the change means based on the correction data;
Decoding means for decoding each waveform of the output speech based on each waveform information;
Amplifying means for amplifying each waveform decoded by the decoding means based on each energy information after the correction;
A speech synthesizer comprising: output means for outputting speech based on each waveform amplified by the amplification means.

The speech synthesizer according to claim 1, wherein the correcting unit corrects the energy information changed by the changing unit so as to gradually approach an upper limit value defined in advance for the energy information.

The speech synthesizer according to claim 2, wherein the correction unit corrects the energy information nonlinearly.

The speech synthesis apparatus according to claim 2, wherein the correction unit continuously corrects the energy information.

The speech synthesizer according to claim 1, wherein the correction data includes energy information before correction and energy information after correction of the energy information before correction.

The output means includes
Waveform connecting means for connecting the waveforms amplified by the amplifying means to generate a combined waveform;
The speech synthesis apparatus according to claim 1, further comprising speech output means for outputting speech based on the synthesized waveform.

The output means further includes waveform saturation means for adjusting the composite waveform output from the waveform connection means so as to gradually approach a predetermined upper limit value,
The speech synthesis apparatus according to claim 6, wherein the speech output unit outputs speech based on the synthesized waveform adjusted by the waveform saturation unit.

The recording medium further includes a drive unit that is mounted with a removable recording medium and drives the recording medium, the recording medium storing audio data and the designation information associated with the audio data,
The speech synthesis apparatus according to claim 1, wherein the acquisition unit includes a reading unit that reads out the voice data and the designation information from the recording medium attached to the driving unit.

The acquisition means includes
Receiving means for receiving a signal including character information;
The speech synthesis apparatus according to claim 1, further comprising an input unit that receives an input of the designation information.

The acquisition means includes
A microphone that receives an utterance and outputs an audio signal corresponding to the utterance;
Waveform information analyzing means for analyzing the voice signal and outputting waveform information corresponding to the utterance;
The speech synthesis apparatus according to claim 1, further comprising: prosodic analysis means for outputting prosodic information including energy information corresponding to the utterance.

A speech synthesizer that synthesizes and outputs speech,
Stores each piece of waveform information for generating a waveform of each part obtained by dividing speech into predetermined units, each piece of energy information for each part, correction data for correcting each piece of energy information, and a program. Memory to
A processor for receiving a plurality of instructions from the program,
Each said instruction is
An acquisition step of acquiring data for generating output sound and designation information for designating a volume of the output sound;
A change step of changing each of the energy information to energy information corresponding to the designated information;
Based on the correction data, a correction step of correcting each energy information after the change by the changing means;
A decoding step for decoding each waveform of the output speech based on each waveform information;
An amplification step of amplifying each waveform decoded by the decoding means based on each energy information after the correction,
The speech synthesizer further includes an output unit that outputs speech based on each amplified waveform.

The speech synthesizer according to claim 11, wherein the correcting step corrects the energy information changed by the changing step so as to gradually approach an upper limit value defined in advance for the energy information.

The speech synthesizer according to claim 12, wherein the correcting step corrects the energy information nonlinearly.

The speech synthesizer according to claim 12, wherein the correcting step continuously corrects the energy information.

The speech synthesizer according to claim 11, wherein the correction data includes energy information before correction and energy information after correction of the energy information before correction.

The command further includes a waveform connection step of connecting the waveforms amplified by the amplification step to generate a composite waveform,
The speech synthesis apparatus according to claim 11, wherein the output unit outputs speech based on the synthesized waveform.

The command further includes a waveform saturation step for adjusting the composite waveform generated in the waveform connection step so as to approach the predetermined upper limit value,
The speech synthesis apparatus according to claim 16, wherein the output unit outputs speech based on the synthesized waveform after adjustment in the waveform saturation step.

The speech synthesizer further includes a drive device that is mounted with a detachable recording medium and drives the recording medium, and the recording medium stores audio data and the designation information associated with the audio data. And
The speech synthesizing apparatus according to claim 11, wherein the obtaining step includes a reading step of reading out the voice data and the designation information from the recording medium attached to the driving step.

The obtaining step includes
A receiving step for receiving a signal including character information;
The speech synthesizer according to claim 11, further comprising an input step of receiving input of the designation information.

The speech synthesizer further includes a microphone that receives an utterance and outputs an audio signal corresponding to the utterance,
The obtaining step includes
Analyzing the audio signal and outputting waveform information corresponding to the utterance;
The speech synthesis apparatus according to claim 11, further comprising: outputting prosodic information including energy information corresponding to the utterance.

A speech synthesis method for synthesizing and outputting speech,
Obtaining data for generating output sound and designation information for designating a volume of the output sound;
Loading each waveform information for generating a waveform of each part obtained by dividing speech in a predetermined unit, each energy information for each part, and correction data for correcting each energy information When,
Changing each of the energy information to energy information corresponding to the designation information;
Correcting each of the energy information after the change in the change step based on the correction data;
Decoding each waveform of the output speech based on each waveform information;
Amplifying each waveform decoded by the decoding step based on each energy information after the correction;
Outputting speech based on each waveform amplified by the amplification step.

A program for causing a computer comprising a memory and a processor to implement a speech synthesis method, the speech synthesis method comprising:
The processor obtaining, from the memory, data for generating output sound and designation information for designating a volume of the output sound;
In order for the processor to correct from the memory each waveform information for generating a waveform of each part obtained by dividing speech in a predetermined unit, each energy information of each part, and each energy information Reading the correction data of
The processor changing each of the energy information to energy information corresponding to the designation information;
A correction step in which the processor corrects each energy information after the change based on the correction data;
The processor decoding each waveform of the output speech based on each waveform information;
The processor amplifying each waveform decoded by the decoding step based on each energy information after the correction;
And a step of outputting an audio signal based on each of the amplified waveforms.