JP6044284B2

JP6044284B2 - Speech synthesizer

Info

Publication number: JP6044284B2
Application number: JP2012250436A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2012-11-14
Filing date: 2012-11-14
Publication date: 2016-12-14
Anticipated expiration: 2032-11-14
Also published as: JP2014098800A

Description

この発明は、音声合成技術に関し、特に、リアルタイム音声合成技術に関する。 The present invention relates to speech synthesis technology, and more particularly to real-time speech synthesis technology.

音声ガイダンスにおける案内音声や文芸作品の朗読音声、或いは歌唱曲の歌唱音声などを表す音声信号を、複数種類の合成情報を用いて電気的な信号処理により合成する音声合成技術が普及している。例えば、歌唱音声の合成の場合は、歌唱音声の合成対象の歌唱曲における韻律変化を示す韻律情報（例えば、当該歌唱曲のメロディを構成する各音符の音高や継続長を表す音符情報）と当該歌唱曲の歌詞の音韻列を表す情報などの音楽表現情報が上記合成情報として用いられる。音声ガイダンスにおける案内音声や文芸作品の朗読音声の音声信号を合成する場合は、案内文や文芸作品の文章の音韻列を表す情報が発話内容を表す合成情報として用いられ、イントネーションやアクセントなどの韻律変化を示す韻律情報が発話態様を示す合成情報として用いられる。これらに加えて、音の強さを指定する情報が合成情報として用いられる場合もある。従来、この種の音声合成は、合成対象の音声全体に亙る各種合成情報を予め音声合成装置に全て入力しておき、合成対象の音声全体の音波形を表す音声信号をそれら合成情報に基づいて一括して生成する所謂バッチ処理方式が一般的であった。しかし、近年ではリアルタイム方式の音声合成技術も提案されている（例えば、特許文献１参照）。 A speech synthesis technique for synthesizing a voice signal representing a guidance voice in voice guidance, a reading voice of a literary work, or a singing voice of a song by electrical signal processing using a plurality of types of synthesis information has become widespread. For example, in the case of synthesis of singing voice, prosodic information (for example, musical note information indicating the pitch or duration of each note constituting the melody of the singing song) indicating prosody change in the singing voice synthesis target song Music expression information such as information representing the phoneme string of the lyrics of the song is used as the synthesis information. When synthesizing the voice signal of the guidance voice in the voice guidance or the reading voice of the literary work, the information indicating the phonological sequence of the text of the guidance sentence or the literary work is used as the synthesized information indicating the utterance content, and the prosody such as intonation and accent. Prosodic information indicating a change is used as synthetic information indicating an utterance mode. In addition to these, information specifying sound intensity may be used as synthesis information. Conventionally, in this type of speech synthesis, various types of synthesis information over the entire speech to be synthesized are input to a speech synthesizer in advance, and a speech signal representing the sound waveform of the entire speech to be synthesized is based on the synthesis information. A so-called batch processing method in which batch generation is performed is common. However, in recent years, a real-time speech synthesis technique has also been proposed (see, for example, Patent Document 1).

リアルタイム方式の音声合成の一例としては、楽曲全体の歌詞を表す情報を歌唱合成装置に予め入力しておき、ピアノ鍵盤を模したキーボードで各歌詞を発音する際の音高等を随時指定することで歌唱合成を進める技術が挙げられる。また、近年では、歌詞の音韻を表す子音および母音を入力するための操作子を配列した音韻情報入力部と、ピアノ鍵盤を模した音符情報入力部とを左右に並べた歌唱合成用キーボードを用いて、音韻情報と音符情報の両者をリアルタイムでユーザに逐次入力させて歌唱合成を行うことも提案されている。 As an example of real-time speech synthesis, information representing the lyrics of the entire song is input to the singing synthesizer in advance, and the pitch and the like when each lyric is pronounced with a keyboard simulating a piano keyboard is designated as needed. A technique that promotes singing synthesis. Also, in recent years, a singing synthesis keyboard has been used in which a phoneme information input unit in which operators for inputting consonants and vowels representing phonics of lyrics are arranged and a note information input unit imitating a piano keyboard are arranged side by side. Thus, it has also been proposed to synthesize a singing voice by sequentially inputting both phonological information and musical note information to a user in real time.

特許３８７９４０２号Japanese Patent No. 3879402

歌唱合成用キーボードを用いてリアルタイム方式の歌唱合成を行う場合、必須の合成情報（すなわち、音韻情報と韻律情報）のみを用いる場合であっても、ユーザは一方の手で音韻情報入力部を操作して音韻情報を入力するとともに他方の手で音符情報入力部を操作して韻律情報を入力しなければならず、両者の入力タイミングには若干のズレが生じる。一方、歌唱合成装置では、音韻情報と韻律情報の両者が揃わないと歌唱合成処理が開始されない。このため、音韻情報と韻律情報の両者の入力タイミングにズレがあると、先に入力された方の入力タイミング（すなわち、歌唱音声出力のための最初の意思表示のタイミング）から遅れて合成歌唱音声が出力されることになり、この遅延がユーザに違和感を抱かせることがある。この点は、案内音声や文芸作品の朗読音声をリアルタイム方式で合成する音声合成においても同様である。また、複数種類の合成情報の入力順が予め定められている場合には、各合成情報を予め定められた順に入力しなければならず、入力の自由度が低いといった問題もある。 When performing a real-time singing synthesis using a singing synthesis keyboard, the user operates the phonological information input unit with one hand even when only essential synthesis information (ie, phonological information and prosodic information) is used. Thus, the phoneme information must be input, and the note information input unit must be operated with the other hand to input the prosodic information. On the other hand, in the singing voice synthesizing apparatus, the singing voice synthesizing process is not started unless both phonological information and prosodic information are prepared. For this reason, if there is a difference in the input timing of both phoneme information and prosodic information, the synthesized singing voice is delayed from the input timing of the earlier input (that is, the initial intention display timing for singing voice output). This delay may make the user feel uncomfortable. The same applies to voice synthesis that synthesizes guidance voices and readings of literary works in real time. In addition, when the input order of a plurality of types of composite information is predetermined, each composite information must be input in a predetermined order, and there is a problem that the degree of freedom of input is low.

本発明は上記課題に鑑みて為されたものであり、リアルタイム音声合成において音声合成に用いる複数種類の合成情報の入力の自由度が低くなることを回避しつつ、各合成情報の入力タイミングに時間差があっても、最先の入力タイミングから遅滞なく音を出力することを可能にする技術を提供することを目的とする。 The present invention has been made in view of the above problems, and avoids a reduction in the degree of freedom of input of a plurality of types of synthesis information used for speech synthesis in real-time speech synthesis while maintaining a time difference in the input timing of each synthesis information. It is an object of the present invention to provide a technology that makes it possible to output a sound without delay from the earliest input timing even if there is.

上記課題を解決するために本発明は、音声信号の合成に用いる複数種類の合成情報であって、合成対象の音声の音韻を示す音韻情報と当該音声における韻律変化を示す韻律情報とを含む複数種類の合成情報を入力するための入力手段と、前記複数種類の合成情報のうちの最先のものが前記入力手段へ入力されてから少なくとも前記音韻情報と前記韻律情報が揃うまでの間に前記入力手段を介して入力された合成情報を用いて音声信号を合成して出力するとともに、当該最先の合成情報が入力されてから当該音声信号の出力が開始されるまでの間、ダミー音を表すダミー音信号を出力する音声合成手段とを有することを特徴とする音声合成装置、を提供する。なお、音声信号の合成に用いる複数種類の合成情報のうち、音韻情報および韻律情報以外の合成情報の一例としては、音の強さを示す情報やビブラートの付与を示す情報、合成音声の声質を指定する情報などが挙げられる。ただし、音韻情報および韻律情報以外の合成情報は、音声信号の合成において必ずしも必須ではない。 In order to solve the above problems, the present invention provides a plurality of types of synthesis information used for synthesizing speech signals, including a plurality of phoneme information indicating phonemes of speech to be synthesized and prosody information indicating prosodic changes in the speech. Input means for inputting types of synthesis information; and at least the phoneme information and the prosodic information from when the earliest one of the plurality of types of synthesis information is input to the input means. A voice signal is synthesized and output using synthesis information input via the input means, and a dummy sound is output from the time when the earliest synthesis information is input until the output of the voice signal is started. There is provided a speech synthesizer characterized by comprising speech synthesis means for outputting a dummy sound signal to be expressed. Of the multiple types of synthesis information used for synthesizing speech signals, examples of synthesis information other than phonological information and prosodic information include information indicating sound intensity, information indicating the addition of vibrato, and voice quality of synthesized speech. Information to be specified. However, synthesis information other than phonological information and prosodic information is not necessarily essential in the synthesis of a speech signal.

このような音声合成装置によれば、複数種類の合成情報のうち最先に前記入力手段へ入力されたものの入力時点から少なくとも必須の合成情報である音韻情報と韻律情報とが揃うまでの間に入力された合成情報に基づいて音声信号の合成が行われるとともに、最先の合成情報の入力から当該音声信号の出力が開始されるまでの間、ダミー音が出力される。このため、ユーザが音声合成のための意思を最初に表明した時点（すなわち、複数種類の合成情報のうち最先のものの入力時点）から音が出力され、ユーザに無用な違和感を抱かせないようにすることが可能になる。また、本発明の音声合成装置においては、複数種類の合成情報の入力順は問われないため、それら合成情報を入力する際の自由度が低くなることはない。 According to such a speech synthesizer, among the plurality of types of synthesis information, the input from the earliest input to the input means until at least the phoneme information and the prosodic information, which are essential synthesis information, are complete. A voice signal is synthesized based on the inputted synthesis information, and a dummy sound is output from the input of the earliest synthesis information until the output of the voice signal is started. For this reason, the sound is output from the time when the user first expresses the intention for speech synthesis (that is, the input time of the earliest of the plurality of types of synthesis information), so that the user does not feel uncomfortable. It becomes possible to. Further, in the speech synthesizer of the present invention, the input order of a plurality of types of synthesis information is not questioned, so the degree of freedom when inputting these synthesis information does not decrease.

ダミー音としてどのようなものを用いるのかについては種々の態様が考えられる。例えば、第１の態様としては、最先の合成情報に応じた音をダミー音として用いる態様が考えられる。また、第２の態様としては、ノイズ音やブレス音、鼻音、予め定められた音素を予め定められた音高で出力した音、或いは所定の音高の周期音（音波形が正弦波により表される音）など、継続可能な音であって最先の合成情報とは無関係な音、をダミー音として用いる態様が考えられる。例えばリアルタイム方式の歌唱合成に本発明を適用し、さらに複数種類の合成情報として必須の合成情報のみを用いる場合の上記第１の態様の具体例としては、音韻情報が先に入力され、かつその音韻情報の示す音韻の先頭が継続可能な音素である場合には、当該先頭の音素を所定の音高で出力した音をダミー音として用いる一方、韻律情報の役割を果たす音符情報が先に入力された場合には継続可能な所定の音素を当該音符情報の示す音高で出力した音や当該音符情報の示す音高を有する周期音をダミー音として用いることが考えられる。ここで、継続可能な音素とは、母音または摩擦音や鼻音など継続可能な子音のことをいう。なお、音韻情報が先に入力され、かつその音韻情報の示す音韻の先頭が継続可能な音素ではない場合には、ノイズ音などの所定の継続可能な音をダミー音として用いるようにすれば良い。 Various modes can be considered as to what to use as the dummy sound. For example, as a first mode, a mode in which a sound corresponding to the earliest synthesized information is used as a dummy sound can be considered. In addition, as a second aspect, noise sound, breath sound, nasal sound, sound obtained by outputting a predetermined phoneme at a predetermined pitch, or a periodic sound having a predetermined pitch (the sound waveform is represented by a sine wave). In other words, a sound that can be continued and is not related to the earliest synthesized information is used as a dummy sound. For example, as a specific example of the first aspect in which the present invention is applied to real-time singing synthesis and only essential synthesis information is used as a plurality of types of synthesis information, phonological information is input first, and When the head of the phoneme indicated by the phoneme information is a continuable phoneme, the note information that plays the role of the prosody information is input first, while the sound output from the head phoneme at a predetermined pitch is used as a dummy sound. In such a case, it is conceivable to use, as a dummy sound, a sound in which a predetermined continuous phoneme is output at a pitch indicated by the note information or a periodic sound having a pitch indicated by the note information. Here, the continuable phoneme means a vowel or a consonant that can be continued, such as a frictional sound and a nasal sound. If the phoneme information is input first and the head of the phoneme indicated by the phoneme information is not a continuous phoneme, a predetermined continuous sound such as a noise sound may be used as a dummy sound. .

より好ましい態様としては、入力手段に入力された合成情報を用いて合成された音声信号の表す音声とダミー音とが滑らかにつながるようにダミー音信号を調整して（または上記音声信号とダミー音信号の両者の信号レベルを調整しつつ）出力する処理を音声合成手段に実行させる態様が考えられる。このような態様によれば、ダミー音から合成音声へ滑らかに移り変わるため、音が不連続に切り替わることによる違和感をユーザに与えないようにすることが可能になる。また、ダミー音の音量が徐々に大きくなるように、信号レベルを予め定められた値まで徐々に上昇させつつダミー音信号を出力する処理を音声合成手段に実行させることで、無用な違和感をユーザに与えないようにする態様も考えられる。さらに別の態様としては、複数種類の合成情報のうち最先のものが入力されてから少なくとも必須の合成情報が揃うまでに入力された合成情報を用いて合成された音声の出力が開始されるまでの間、複数種のダミー音が順次出力されるようにダミー音信号を切り替える処理を音声合成手段に実行させる態様も考えられる。 As a more preferable aspect, the dummy sound signal is adjusted so that the sound represented by the sound signal synthesized using the synthesis information input to the input means and the dummy sound are smoothly connected (or the sound signal and the dummy sound). A mode is conceivable in which the speech synthesizer executes the process of outputting the signal while adjusting the signal levels of both signals. According to such an aspect, since the dummy sound smoothly changes from the synthesized sound, it is possible to prevent the user from feeling uncomfortable due to the sound switching discontinuously. Further, by causing the voice synthesizer to execute the process of outputting the dummy sound signal while gradually increasing the signal level to a predetermined value so that the volume of the dummy sound gradually increases, the user can feel unnecessary discomfort. It is also conceivable to avoid giving to the above. As yet another aspect, the output of the synthesized voice using the synthesis information that has been input after the earliest one of the plurality of types of synthesis information has been input until at least the required synthesis information is available is started. In the meantime, a mode is also conceivable in which the voice synthesizer executes the process of switching the dummy sound signal so that a plurality of types of dummy sounds are sequentially output.

また、別の好ましい態様としては、複数種類の合成情報のうちの最先のものが入力されてから所定の待ち時間が経過するまでに音韻情報と韻律情報が揃わなかった場合には、音声合成手段にダミー音信号の出力を停止させるようにしても良い。このような態様によれば、音韻情報は入力されたものの音符情報が入力されない（或いは、その逆）など必須の合成情報の一部が入力されず音声合成を行えない場合であっても、ダミー音が出力され続けるといった不具合が生じないようにすることができる。また、ダミー音信号の出力停止をユーザに指示させる停止指示手段（上記入力手段として歌唱合成用キーボードを用いる場合には、当該出力停止をユーザに指示させるための操作子）を設け、ダミー音信号の出力停止の指示を与えられたことを契機としてダミー音信号の出力を停止する処理を音声合成手段に実行させるようにしても良い。このような態様によれば、合成情報の誤入力を契機としてダミー音の出力が開始された場合であっても、上記所定時間に亙ってダミー音が出力され続けることを回避することが可能になる。 Further, as another preferred mode, when the phoneme information and the prosodic information are not prepared until a predetermined waiting time elapses after the earliest one of the plurality of types of synthesis information is input, speech synthesis is performed. The output of the dummy sound signal may be stopped by the means. According to such an aspect, even if phonological information is input but note information is not input (or vice versa), even if a part of essential synthesis information is not input and speech synthesis cannot be performed, dummy It is possible to prevent problems such as sound being continuously output. Also, a stop instruction means for instructing the user to stop the output of the dummy sound signal (when using a singing synthesis keyboard as the input means, an operator for instructing the user to stop the output) is provided, and the dummy sound signal is provided. The voice synthesizing unit may be caused to execute a process of stopping the output of the dummy sound signal when the output stop instruction is given. According to such an aspect, it is possible to prevent the dummy sound from being continuously output for the predetermined time even when the output of the dummy sound is triggered by an erroneous input of the synthesis information. become.

また、本発明の別の態様としては、音声信号の合成に用いる複数種類の合成情報であって、合成対象の音声の音韻を示す音韻情報と当該音声における韻律変化を示す韻律情報とを含む複数種類の合成情報のうちの最先のものが入力されてから少なくとも音韻情報と韻律情報が揃うまでの間に入力された合成情報を用いて音声信号を合成して出力する第１の処理と、当該最先の合成情報が入力されてから当該音声信号の出力が開始されるまでの間、ダミー音を表すダミー音信号を出力する第２の処理とをコンピュータに実行させるためのプログラムを提供する態様が考えられる。ここで、当該プログラムの具体的な提供態様としては、ＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）などのコンピュータ読み取り可能な記録媒体に書き込んで配布する態様やインターネットなどの電気通信回線経由のダウンロードにより配布する態様が考えられる。 As another aspect of the present invention, a plurality of types of synthesis information used for synthesis of speech signals, including a plurality of phoneme information indicating phonemes of speech to be synthesized and prosody information indicating prosodic changes in the speech. A first process of synthesizing and outputting a speech signal using synthesis information input at least from the time when the earliest one of the types of synthesis information is input until the phoneme information and prosodic information are aligned; Provided is a program for causing a computer to execute a second process for outputting a dummy sound signal representing a dummy sound from when the first synthesis information is input until the output of the sound signal is started. Embodiments are possible. Here, as a specific provision mode of the program, a program is written and distributed on a computer-readable recording medium such as a CD-ROM (Compact Disk-Read Only Memory) or downloaded via an electric communication line such as the Internet. A distribution mode is conceivable.

この発明の第１実施形態の歌唱合成装置１の構成例を示す図である。It is a figure which shows the structural example of the song synthesizing | combining apparatus 1 of 1st Embodiment of this invention. 同歌唱合成装置１の動作を説明するための図である。It is a figure for demonstrating operation | movement of the song synthesizing | combining apparatus 1. FIG. 第２実施形態のダミー音出力処理を説明するための図である。It is a figure for demonstrating the dummy sound output process of 2nd Embodiment. 第３実施形態のダミー音出力処理を説明するための図である。It is a figure for demonstrating the dummy sound output process of 3rd Embodiment. 第４実施形態のダミー音出力処理を説明するための図である。It is a figure for demonstrating the dummy sound output process of 4th Embodiment.

以下、図面を参照しつつ、本発明の実施形態について説明する。
（Ａ：第１実施形態）
図１は、本発明の音声合成装置の一実施形態の歌唱合成装置１の構成例を示すブロック図である。この歌唱合成装置１は、複数種類の合成情報（本実施形態では、歌唱合成を行う際の必須の合成情報である音韻情報と韻律情報の２種類）を音符毎に逐次ユーザに入力させ、リアルタイム方式の歌唱合成を行う装置である。図１に示すように、歌唱合成装置１は、制御部１１０、操作部１２０、表示部１３０、音声出力部１４０、外部機器インタフェース（以下、「Ｉ／Ｆ」と略記）部１５０、記憶部１６０、および、これら構成要素間のデータ授受を仲介するバス１７０を含んでいる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(A: 1st Embodiment)
FIG. 1 is a block diagram showing a configuration example of a singing voice synthesizing apparatus 1 according to an embodiment of the voice synthesizing apparatus of the present invention. The singing voice synthesizing apparatus 1 allows a user to sequentially input a plurality of types of synthesis information (in this embodiment, two types of phonological information and prosodic information, which are essential synthesis information when performing singing synthesis) for each note. It is a device that performs singing synthesis of the method. As shown in FIG. 1, the singing voice synthesizing apparatus 1 includes a control unit 110, an operation unit 120, a display unit 130, an audio output unit 140, an external device interface (hereinafter abbreviated as “I / F”) unit 150, and a storage unit 160. , And a bus 170 that mediates data exchange between these components.

制御部１１０は、例えばＣＰＵ（Central Processing Unit）である。制御部１１０は、記憶部１６０に記憶されている歌唱合成プログラムにしたがって作動することにより、歌唱合成装置１の制御中枢として機能する。この歌唱合成プログラムにしたがって制御部１１０が実行する処理の詳細については後に明らかにする。本実施形態では制御部１１０としてＣＰＵを用いるがＤＳＰ（Digital Signal Processor）を用いても勿論良い。 The control unit 110 is, for example, a CPU (Central Processing Unit). The control unit 110 functions as a control center of the singing voice synthesizing device 1 by operating according to the singing voice synthesis program stored in the storage unit 160. Details of processing executed by the control unit 110 in accordance with this singing synthesis program will be made clear later. In the present embodiment, a CPU is used as the control unit 110, but a digital signal processor (DSP) may of course be used.

操作部１２０は、前述した歌唱合成用キーボードである。歌唱合成装置１のユーザは、操作部１２０を操作することによって、歌唱音声の合成対象の曲のメロディを構成する音符と音符に合わせて歌唱する歌詞の音韻とを指定することができる。例えば、歌詞の音韻として「さ」を指定する場合には、子音「ｓ」に対応した操作子と母音「ａ」に対応した操作子を順次押下すれば良く、当該歌詞に対応する音符の音高として「Ｃ４」を指定する場合には当該音高に応じた鍵を押下してその発音開始を指示し、当該鍵から指を離すことで発音終了を指示すれば良い。つまり、当該鍵を押下している時間の長さが当該音符の持続時間となる。操作部１２０は、音韻を指定する操作が為された場合には当該音韻を示す音韻情報を制御部１１０に与える。また。操作部１２０は、発音開始を指示する押鍵操作が為された場合には、押下された鍵に応じたノートオンイベント（ＭＩＤＩ（Musical Instrument Digital Interface）イベント）を発音開始を指示する音符情報として制御部１１０に与え、押鍵が解除されたことを契機として当該鍵に応じたノートオフイベント（ＭＩＤＩイベント）を発音終了を指示する音符情報として制御部１１０に与える。このように音符情報入力部の操作子に対する操作により入力される音符情報は、歌唱音声における韻律変化を示す韻律情報の役割を果たす。つまり、操作部１２０は、歌唱音声の合成に用いる複数種類の合成情報を入力するための入力手段の役割を果たす。 The operation unit 120 is the above-described singing synthesis keyboard. The user of the singing voice synthesizing device 1 can designate the notes constituting the melody of the song to be synthesized of the singing voice and the phonology of the lyrics to be sung in accordance with the notes by operating the operation unit 120. For example, when “sa” is designated as the phoneme of the lyrics, the operator corresponding to the consonant “s” and the operator corresponding to the vowel “a” may be pressed in sequence, and the note sound corresponding to the lyrics When “C4” is designated as a high pitch, a key corresponding to the pitch is pressed to instruct the start of sound generation, and the end of sound generation is indicated by releasing the key. That is, the length of time that the key is pressed is the duration of the note. When an operation for designating a phoneme is performed, the operation unit 120 gives phoneme information indicating the phoneme to the control unit 110. Also. When a key pressing operation instructing the start of sounding is performed, the operation unit 120 uses a note-on event (MIDI (Musical Instrument Digital Interface) event) corresponding to the pressed key as musical note information instructing the start of sounding. A note-off event (MIDI event) corresponding to the key is given to the control unit 110 as musical note information for instructing the end of sounding. Thus, the note information input by the operation with respect to the operator of the note information input unit plays a role of prosody information indicating prosody change in the singing voice. That is, the operation unit 120 serves as an input unit for inputting a plurality of types of synthesis information used for synthesizing the singing voice.

表示部１３０は、例えば液晶ディスプレイとその駆動回路であり、制御部１１０による制御の下、歌唱合成装置１の使用を促すメニュー画像などの各種画像を表示する。音声出力部１４０は、図１に示すように、Ｄ／Ａ変換器１４２、増幅器１４４、およびスピーカ
１４６を含んでいる。Ｄ／Ａ変換器１４２は、制御部１１０から与えられるデジタル形式の音声データにＤ／Ａ変換を施し、変換結果のアナログ音声信号を増幅器１４４に与える。増幅器１４４は、Ｄ／Ａ変換器１４２から与えられる音声信号の信号レベル（すなわち、音量）をスピーカ駆動に適したレベルまで増幅してスピーカ１４６に与える。スピーカ１４６は、増幅器１４４から与えられる音声信号を音として出力する。 The display unit 130 is, for example, a liquid crystal display and a driving circuit thereof, and displays various images such as a menu image that prompts the use of the singing voice synthesizing device 1 under the control of the control unit 110. As shown in FIG. 1, the audio output unit 140 includes a D / A converter 142, an amplifier 144, and a speaker 146. The D / A converter 142 performs D / A conversion on the digital audio data provided from the control unit 110, and provides an analog audio signal as a conversion result to the amplifier 144. The amplifier 144 amplifies the signal level (that is, the volume) of the audio signal supplied from the D / A converter 142 to a level suitable for driving the speaker and supplies the amplified signal to the speaker 146. The speaker 146 outputs the audio signal given from the amplifier 144 as sound.

外部機器Ｉ／Ｆ部１５０は、例えばＵＳＢ（Universal Serial Buss）インタフェースやオーディオインタフェースなど、歌唱合成装置１に他の外部機器を接続するためのインタフェースの集合体である。本実施形態では、歌唱合成用キーボード（操作部１２０）や音声出力部１４０が歌唱合成装置１の構成要素である場合について説明するが、歌唱合成用キーボードや音声出力部１４０を、外部機器Ｉ／Ｆ部１５０に接続される外部機器としても勿論良い。 The external device I / F unit 150 is a collection of interfaces for connecting other external devices to the song synthesizer 1, such as a USB (Universal Serial Bus) interface and an audio interface. In this embodiment, the case where the singing voice synthesis keyboard (operation unit 120) and the voice output unit 140 are constituent elements of the singing voice synthesizing device 1 will be described. Of course, an external device connected to the F unit 150 may be used.

記憶部１６０は、不揮発性記憶部１６２と揮発性記憶部１６４とを含んでいる。揮発性記憶部１６４は例えばＲＡＭ（Random Access Memory）などの揮発性メモリにより構成されている。揮発性記憶部１６４は各種プログラムを実行する際のワークエリアとして制御部１１０によって利用される。一方、不揮発性記憶部１６２は、例えばＲＯＭ（Read Only Memory）やフラッシュメモリ或いはハードディスクなどの不揮発性メモリにより構成されている。不揮発性記憶部１６２には、図１に示すように、歌唱合成用ライブラリ１６２ａと、歌唱合成プログラム１６２ｂと、ダミー音ライブラリ１６２ｃとが予め格納されている。 The storage unit 160 includes a nonvolatile storage unit 162 and a volatile storage unit 164. The volatile storage unit 164 is configured by a volatile memory such as a RAM (Random Access Memory). The volatile storage unit 164 is used by the control unit 110 as a work area when executing various programs. On the other hand, the nonvolatile storage unit 162 is configured by a nonvolatile memory such as a ROM (Read Only Memory), a flash memory, or a hard disk. As shown in FIG. 1, the non-volatile storage unit 162 stores a song synthesis library 162a, a song synthesis program 162b, and a dummy sound library 162c in advance.

歌唱合成用ライブラリ１６２ａとは、様々な音素（モノフォン）やダイフォン（音素から異なる音素（無音を含む）への遷移）などの音声波形を表す素片データを格納したデータベースである。なお、歌唱合成用ライブラリ１６２ａは、モノフォンやダイフォンの他にトライフォンの素片データを格納したデータベースであっても良く、また、音声波形の音素の定常部や他の音素への遷移部（過渡部）が格納されたデータベースであっても良い。歌唱合成プログラム１６２ｂは、歌唱合成用ライブラリ１６２ａを利用した歌唱合成を制御部１１０に実行させるためのプログラムである。歌唱合成プログラム１６２ｂにしたがって作動している制御部１１０は歌唱合成処理およびダミー音出力処理の２種類の処理を実行する。 The singing synthesis library 162a is a database storing segment data representing speech waveforms such as various phonemes (monophones) and diphones (transitions from phonemes to different phonemes (including silence)). Note that the singing synthesis library 162a may be a database storing triphone segment data in addition to a monophone or a diphone, and may also be a phoneme stationary part of a speech waveform or a transition part (transient part to another phoneme). Part) may be stored in the database. The song synthesis program 162b is a program for causing the control unit 110 to perform song synthesis using the song synthesis library 162a. The control unit 110 operating in accordance with the singing synthesis program 162b executes two types of processing: singing synthesis processing and dummy sound output processing.

歌唱合成処理とは、音韻情報と音符情報とに基づいて歌唱音声の音波形を表す音声データを合成して出力する処理である。この歌唱合成処理では、制御部１１０は音韻情報の表す音素またはダイフォンに対応する素片データを歌唱合成用ライブラリ１６２ａから読み出し、周波数領域のデータに変換した後に音符情報の示す音高となるようにピッチ変換を施しつつ結合し、その後、時間領域のデータに戻すことで歌唱音声の音波形を表す音声データを合成する。このように、歌唱合成処理の実行は、歌唱音声の歌詞を表す音韻情報と音高を表す音符情報とが揃っていることが前提となるため、これら情報が揃ったことを契機として実行される。より詳細に説明すると、本実施形態の歌唱合成処理は、音韻情報と音符情報のうち先に入力された方の入力時点から所定の待ち時間ＴＷが経過するまでに他方が入力された場合に後者の入力を契機として実行される。この待ち時間ＴＷの長さについては適宜実験を行って予め好適な長さに定めておいても良く、また、ユーザの好みに応じて設定させても良い。なお、本実施形態では、歌唱合成アルゴリズムとして、上記素片接続方式のアルゴリズムが採用されていたが、他のアルゴリズムを採用しても良く、他のアルゴリズムを採用した場合には当該アルゴリズムに合わせて歌唱合成用ライブラリ１６２ａを構成すれば良い。 The singing synthesis process is a process of synthesizing and outputting voice data representing the sound waveform of the singing voice based on the phoneme information and the note information. In this singing synthesis process, the control unit 110 reads the segment data corresponding to the phoneme or the diphone represented by the phoneme information from the singing synthesis library 162a and converts it into frequency domain data so that the pitch indicated by the note information is obtained. The voice data representing the sound waveform of the singing voice is synthesized by combining them while performing pitch conversion, and then returning to the data in the time domain. As described above, the singing synthesis process is executed on the premise that the phonological information representing the lyrics of the singing voice and the note information representing the pitch are prepared. . More specifically, the singing synthesis process according to the present embodiment is performed when the other one of the phonological information and the note information is input before the predetermined waiting time TW elapses from the input time of the first input. Executed with the input of The length of the waiting time TW may be determined in advance by conducting an experiment as appropriate, or may be set according to the user's preference. In this embodiment, the unit connection method algorithm is adopted as the singing synthesis algorithm. However, other algorithms may be adopted. When other algorithms are adopted, the algorithm is adapted to the algorithm. What is necessary is just to comprise the library 162a for song synthesis | combination.

ダミー音出力処理は、音韻情報と音符情報のうち先に入力された方の入力時点から、歌唱合成処理により合成された音声データの表す音声（すなわち、合成歌唱音声）の出力が開始されるまでの間、ダミー音を音声出力部１４０に出力させる処理である。本実施形態のダミー音出力処理は、音韻情報と音符情報のうち先に入力された方の入力を契機として実行が開始される。このダミー音出力処理では、音素と音高の両者を指定する必要のない所定の音（例えば、ブレス音やノイズ音或いは鼻音、所定の継続可能な音素を所定の音高で出力し続ける音、または所定の音高を有する周期的な音）がダミー音として出力される。前述したダミー音ライブラリ１６２ｃは、ブレス音やノイズ音或いは鼻音、所定の継続可能な音素を所定の音高で出力し続ける音、または所定の音高を有する周期的な音などの各種ダミー音の音波形を表す波形データを格納したデータベースである。ダミー音出力処理では、制御部１１０は、ダミー音として定められた音の波形データをダミー音ライブラリ１６２ｃから読み出し、当該波形データを音声出力部１４０に与えてダミー音を出力させる。ここで、ダミー音として何れの音を用いるかについては予め定めておいても良く、ユーザに選択させても良い。本実施形態ではダミー音としてブレス音（以下、ブレス音の音素記号としてｂｒを用いる）が用いられる。なお、本実施形態では、歌唱合成用ライブラリ１６２ａとダミー音ライブラリ１６２ｃを各々別個のデータベースとしたが、両者を一体化しても（例えば、ダミー音ライブラリ１６２ｃを歌唱合成用ライブラリ１６２ａに含ませるなど）良い。 In the dummy sound output process, from the input time point of the phoneme information and note information input first, until the output of the voice represented by the voice data synthesized by the song synthesis process (that is, the synthesized song voice) is started. In this process, a dummy sound is output to the audio output unit 140. Execution of the dummy sound output process of the present embodiment is triggered by the input of the phoneme information and the note information input first. In this dummy sound output process, a predetermined sound that does not need to designate both a phoneme and a pitch (for example, a breath sound, a noise sound or a nasal sound, a sound that continuously outputs a predetermined continuous phoneme at a predetermined pitch, Alternatively, a periodic sound having a predetermined pitch is output as a dummy sound. The dummy sound library 162c described above is used for various dummy sounds such as a breath sound, a noise sound or a nasal sound, a sound that continuously outputs a predetermined continuous phoneme at a predetermined pitch, or a periodic sound having a predetermined pitch. It is the database which stored the waveform data showing a sound waveform. In the dummy sound output process, the control unit 110 reads the waveform data of the sound determined as the dummy sound from the dummy sound library 162c, gives the waveform data to the audio output unit 140, and outputs the dummy sound. Here, which sound is used as the dummy sound may be determined in advance or may be selected by the user. In the present embodiment, a breath sound (hereinafter, br is used as a phoneme symbol of the breath sound) is used as the dummy sound. In this embodiment, the singing synthesis library 162a and the dummy sound library 162c are separate databases. However, even if they are integrated (for example, the dummy sound library 162c is included in the singing synthesis library 162a). good.

ダミー音出力処理では、制御部１１０は、音韻情報と音符情報の両者（すなわち、歌唱合成に用いる合成情報の全て）が揃ったか否かを判定する処理を、両者が揃うかまたは上記待ち時間ＴＷが経過するまで継続して実行する。そして、制御部１１０は、上記待ち時間ＴＷが経過するまでに歌唱合成に用いる合成情報の全てが揃わなかった場合には、当該待ち時間ＴＷの経過の時点でダミー音出力処理を停止する。つまり、本実施形態では、音韻情報と音符情報のうちの何れか一方が入力され、その入力から所定の待ち時間ＴＷが経過するまでに他方が入力されない場合には、制御部１１０は当該待ち時間ＴＷの経過の時点でダミー音出力処理の実行を中止し、ダミー音の出力を停止する。これに対して、音韻情報と音符情報のうちの何れか一方の入力から上記待ち時間ＴＷが経過するまでに他方が入力されると当該他方の入力を契機として歌唱合成処理の実行が開始され、合成歌唱音声の出力が開始されるまでダミー音出力処理は継続して実行される。なお、音韻情報が子音と母音などの音素毎に入力される場合には、先頭の音素の音韻情報の入力タイミングを基準に上記判定を行うようにすれば良い。以下、「さ」という歌詞を音高「Ｃ４」で発音する歌唱音を歌唱合成装置１に合成させるために、音韻情報として子音「ｓ」および母音「ａ」の各々を示す情報が入力され、音符情報として音高「Ｃ４」の音の発音開始および停止を示す情報が入力された場合を例にとってダミー音出力処理の処理内容を説明する。 In the dummy sound output process, the control unit 110 performs a process of determining whether both the phoneme information and the note information (that is, all of the synthesis information used for singing synthesis) are prepared. Continue until the time elapses. And when all the synthetic | combination information used for a song synthesis | combination is not prepared before the said waiting time TW passes, the control part 110 stops a dummy sound output process at the time of the said waiting | waiting TW progressing. That is, in this embodiment, when either one of phonological information and note information is input and the other is not input before the predetermined waiting time TW elapses from the input, the control unit 110 performs the waiting time. At the time when TW has elapsed, the execution of the dummy sound output process is stopped, and the output of the dummy sound is stopped. On the other hand, when the other is input before the waiting time TW elapses from the input of any one of the phoneme information and the note information, the execution of the singing synthesis process is started with the other input as an opportunity, The dummy sound output process is continuously executed until the output of the synthesized singing voice is started. When the phoneme information is input for each phoneme such as a consonant and a vowel, the above determination may be made based on the input timing of the phoneme information of the head phoneme. Hereinafter, in order to make the singing synthesizer 1 synthesize a singing sound that pronounces the lyrics “sa” with a pitch “C4”, information indicating each of the consonant “s” and the vowel “a” is input as phonological information, The contents of the dummy sound output process will be described by taking as an example a case where information indicating the start and stop of the sound of the pitch “C4” is input as the note information.

例えば、図２（ａ）では、時刻ＴＡ１において子音「ｓ」を示す音韻情報が入力された後、時刻ＴＡ２において母音「ａ」を示す音韻情報が入力され、さらに時刻ＴＡ３（ただし、ＴＡ３−ＴＡ１＞ＴＷ）において音高「Ｃ４」の音の発音開始を示す音符情報が入力された場合について例示されている。この場合、最先の合成情報（すなわち、先頭の音素についての音韻情報）の入力タイミングから待ち時間ＴＷが経過するまでに音符情報が入力されていないため、子音「ｓ」を示す音韻情報の入力タイミングから待ち時間ＴＷが経過した時点でダミー音の出力が停止され、歌唱合成処理も実行されない。これに対して、図２（ｂ）に示すように、時刻ＴＡ１において子音「ｓ」を示す音韻情報が入力された後、時刻ＴＡ２において母音「ａ」を示す音韻情報が入力され、さらに時刻ＴＡ３´（ただし、ＴＡ３´−ＴＡ１≦ＴＷ）において音高「Ｃ４」の音の発音開始を示す音符情報が入力された場合には、合成歌唱音声の出力が開始されるまでダミー音が出力され続ける。なお、図２（ｂ）に示す例では、ダミー音として、無音（図２（ｂ）では＃により表記）からブレス音に遷移し、再度、無音へと遷移するといった複数の音素片（図２（ｂ）に示す例では、[＃−ｂｒ],[ｂｒ]および[ｂｒ−＃]）により構成される音を用い、無音状態から合成歌唱音声の出力を開始する場合について例示されている。これは、ダミー音と合成歌唱音声とが滑らかにつながるようにするためである。しかし、ダミー音と合成歌唱音声のつなぎめの滑らかさが問題とならない場合には、ダミー音として１つの音素片（例えば、[＃−ｂｒ]）により構成される音を用いても良い。また、図２（ｂ）における[ｂｒ−＃]を省略するとともに、同図２（ｂ）における[＃−ｓ]に換えて[ｂｒ−ｓ]を用いる、或いは図２（ｂ）における[ｂｒ−＃]に換えて[ｂｒ−ｓ]を用い、[＃−ｓ]を省略するなどしてダミー音から合成歌唱音声へ直接遷移させ、両者がより滑らかにつながるようにしても良い。 For example, in FIG. 2A, after the phoneme information indicating the consonant “s” is input at the time TA1, the phoneme information indicating the vowel “a” is input at the time TA2, and further the time TA3 (however, TA3-TA1). > TW), the case where the note information indicating the start of the sound of the pitch “C4” is input is illustrated. In this case, since note information is not input before the waiting time TW elapses from the input timing of the earliest synthesis information (that is, phoneme information about the first phoneme), input of phoneme information indicating the consonant “s” is input. When the waiting time TW elapses from the timing, the output of the dummy sound is stopped and the singing synthesis process is not executed. On the other hand, as shown in FIG. 2B, after the phoneme information indicating the consonant “s” is input at the time TA1, the phoneme information indicating the vowel “a” is input at the time TA2, and further the time TA3. When the note information indicating the start of the sound of the pitch “C4” is input at ′ (where TA3′−TA1 ≦ TW), the dummy sound continues to be output until the output of the synthesized singing voice is started. . In the example shown in FIG. 2B, as a dummy sound, a plurality of phonemic pieces (FIG. 2) such as transition from silence (indicated by # in FIG. 2B) to breath sound and transition to silence again. In the example shown in (b), the case where the output of the synthesized singing voice is started from the silent state using the sound constituted by [# -br], [br] and [br- #]) is illustrated. This is because the dummy sound and the synthesized singing voice are smoothly connected. However, if the smoothness of the joint between the dummy sound and the synthesized singing voice does not matter, a sound composed of one phoneme piece (for example, [# -br]) may be used as the dummy sound. Also, [br- #] in FIG. 2B is omitted, and [br-s] is used instead of [# -s] in FIG. 2B, or [br- #] in FIG. Instead of [-#], [br-s] may be used, and [# -s] may be omitted, for example, and the transition from the dummy sound to the synthesized singing voice may be made directly so that both are connected more smoothly.

図２（ｃ）および図２（ｄ）は、音符情報が先に入力された場合の動作を示す図である。この場合も、最先の合成情報（音高「Ｃ４」の発音開始を示す音符情報）の入力タイミングから待ち時間ＴＷが経過するまでに先頭の音素の音韻情報が入力されなかった場合には、音符情報の入力タイミングから待ち時間ＴＷが経過した時点でダミー音の出力が停止され、歌唱合成処理は実行されない（図２（ｃ）参照）。逆に、図２（ｄ）に示すように、音高「Ｃ４」の音の発音開始を示す音符情報の入力タイミングから待ち時間ＴＷが経過するまでに先頭の音素の音韻情報が入力された場合には、前述した図２（ｂ）の場合と同様に、合成歌唱音声の出力が開始されるまでダミー音の出力が継続される。なお、音符情報と音韻情報のうち音符情報が先に入力された場合には、その音符情報の示すベロシティの大きさに応じて、ダミー音における無音からブレス音への遷移部分（すなわち、[＃−ｂｒ]）の時間長を調整するようにしても良い。 FIG. 2C and FIG. 2D are diagrams illustrating operations when note information is input first. Also in this case, if the phoneme information of the first phoneme is not input before the waiting time TW elapses from the input timing of the earliest synthesis information (note information indicating the start of pronunciation of the pitch “C4”), When the waiting time TW elapses from the input timing of the note information, the output of the dummy sound is stopped and the singing synthesis process is not executed (see FIG. 2C). On the contrary, as shown in FIG. 2D, when the phoneme information of the first phoneme is input from the input timing of the note information indicating the start of sound generation of the pitch “C4” until the waiting time TW elapses. In the same manner as in the case of FIG. 2B described above, the output of the dummy sound is continued until the output of the synthesized singing voice is started. When note information is input first among note information and phoneme information, a transition portion from a silence to a breath sound in the dummy sound (that is, [# -Br]) may be adjusted.

以上説明したように本実施形態の歌唱合成装置１においては、歌唱音声のメロディを構成する音符を表す音符情報と当該音符に合わせて発音する歌詞の音韻を表す音韻情報の両者が揃ったことを契機として歌唱合成が開始されることは従来のリアルタイム方式の歌唱合成技術と変わりはないものの、音韻情報と音符情報のうち先に入力された方（すなわち、歌唱合成に用いる複数種類の合成情報のうち最先に入力されたもの）の入力時点から合成音声の出力が開始されるまでの間、ダミー音が出力される。このため、歌唱合成のための意思表示を行った時点から音が出力され、ユーザに無用な違和感を抱かせることはない。また、本実施形態によれば、音韻情報と音符情報の何れを先に入力しても良く、これら情報の入力順に制約はない。 As described above, in the singing voice synthesizing apparatus 1 according to the present embodiment, both the note information representing the notes constituting the melody of the singing voice and the phonological information representing the phonology of the lyrics to be pronounced in accordance with the notes are prepared. Although singing synthesis is started as an opportunity, it is not different from the conventional real-time singing synthesis technology, but the one input earlier (ie, multiple types of synthesis information used for singing synthesis) A dummy sound is output during the period from the point of input of the earliest input) until the output of synthesized speech is started. For this reason, a sound is output from the point of time when intentions for singing synthesis are performed, and the user does not feel uncomfortable. Further, according to the present embodiment, either phoneme information or note information may be input first, and there is no restriction on the input order of these information.

なお、本実施形態では、最先の合成情報の入力タイミングから待ち時間ＴＷが経過するまでに全ての合成情報が揃わなかった場合には歌唱合成処理を実行しない場合について説明した。しかし、待ち時間ＴＷの経過の時点でダミー音の出力を停止させるものの、後続の合成情報の入力は待ち時間ＴＷの経過とは無関係に待ち受けを継続し、全ての合成情報が揃った時点で歌唱合成処理を実行するようにしても勿論良い。また、より自然な聴感を演出するために、ダミー音出力処理においては、音量が徐々に大きくなるように音量制御を行う処理（ダミー音信号の信号レベルを所定の値まで徐々に大きくする処理）を制御部１１０に実行させても良い。また、ダミー音の生成態様によってダミー音と合成歌唱音声のつなぎめを滑らかにするのではなく、ダミー音と歌唱合成音とがクロスフェードするようにダミー音信号と合成音声信号の信号レベルを調整する処理を制御部１１０に実行させても良い。具体的には、図２（ｂ）における音素片[ｂｒ−＃]の出力終了時刻が音素片[＃−ｓ]の出力開始時刻よりも後になるようにしつつ、前者の信号レベルを無音まで徐々に引き下げ、かつ後者の信号レベルを無音から徐々に引き上げるようにすれば良い。 In the present embodiment, a case has been described in which the singing synthesis process is not executed when all the synthesis information has not been collected before the waiting time TW elapses from the input timing of the earliest synthesis information. However, although the output of the dummy sound is stopped when the waiting time TW elapses, the subsequent synthesis information input continues waiting regardless of the elapse of the waiting time TW, and the singing is performed when all the synthesis information is gathered. Of course, it is also possible to execute the synthesis process. In order to produce a more natural audibility, in the dummy sound output process, a process of controlling the volume so that the volume gradually increases (a process of gradually increasing the signal level of the dummy sound signal to a predetermined value). May be executed by the control unit 110. Also, instead of smoothing the connection between the dummy sound and the synthesized singing voice according to the generation form of the dummy sound, the signal level of the dummy sound signal and the synthesized voice signal is adjusted so that the dummy sound and the synthesized singing sound crossfade You may make the control part 110 perform the process to perform. Specifically, while the output end time of the phoneme piece [br- #] in FIG. 2B is set to be later than the output start time of the phoneme piece [# -s], the former signal level is gradually increased to silence. And the latter signal level may be gradually increased from silence.

また、ダミー音の出力停止をユーザに指示させるための操作子を操作部１２０に設け、当該操作子が操作されたことを契機として所定の制御信号を操作部１２０に出力させる一方、制御部１１０には、上記待ち時間ＴＷの間に当該制御信号を受け取った場合にはその時点でダミー音出力処理の実行を中止させても良い。このような態様によれば、操作部１２０に対するミスタッチなどにより音韻情報または音符情報が誤入力され、ダミー音の出力が開始された場合であっても、上記操作子の操作によってダミー音の出力を停止させることができ、誤入力の発生から待ち時間ＴＷが経過するまでダミー音が出力され続けることを回避することができる。 In addition, an operation element for instructing the user to stop the output of the dummy sound is provided in the operation unit 120, and when the operation element is operated, a predetermined control signal is output to the operation unit 120, while the control unit 110 If the control signal is received during the waiting time TW, the execution of the dummy sound output process may be stopped at that time. According to such an aspect, even when phonological information or note information is erroneously input due to a mistouch to the operation unit 120 and the output of the dummy sound is started, the dummy sound is output by the operation of the operator. It can be stopped, and it can be avoided that the dummy sound continues to be output until the waiting time TW elapses from the occurrence of the erroneous input.

また、本実施形態では、音韻情報と音符情報の入力順を問わず、先に入力された方の入力を契機としてダミー音出力処理を制御部１１０に実行させ、他方の入力を契機として歌唱合成処理を制御部１１０に実行させた。しかし、歌唱合成装置の動作モードとして歌唱音声の合成を行う歌唱合成演奏モードと、歌唱合成を行わずに音符情報の示す音高の楽器音を出力する楽器音演奏モードの２種類を用意しておき、音韻情報と音符情報の何れが先に入力されたのかに応じて動作モードの切り替えを行うようにしても良い。例えば、歌唱合成演奏モードで動作している状態において音韻情報が先に入力された場合にはダミー音出力処理を制御部１１０に実行させ、音符情報が先に入力された場合には即座に（或いは待ち時間ＴＷが経過するまでに音韻情報が入力されなかった場合に）動作モードを楽器音演奏モードに切り替える処理を制御部１１０に実行させるのである。同様に、楽器音演奏モードにおいて音韻情報が先に入力された場合には即座に（或いは待ち時間ＴＷが経過するまでに音符情報が入力されなかった場合に）動作モードを歌唱合成演奏モードに切り替える処理を制御部１１０に実行させ、音符情報が先に入力された場合にはダミー音出力処理を制御部１１０に実行させるのである。このような態様によれば、歌唱合成演奏と楽器音演奏とを操作部１２０に対する操作によってシームレスに切り替えながらユーザに実施させることができる、といった効果が奏される。 Further, in this embodiment, regardless of the input order of the phoneme information and the note information, the dummy sound output process is executed by the control unit 110 triggered by the input of the first input, and the singing synthesis is triggered by the other input. The process is executed by the control unit 110. However, there are two types of operation modes of the synthesizer: a singing synthesis performance mode that synthesizes singing voice and an instrument sound performance mode that outputs the musical instrument sound at the pitch indicated by the note information without performing singing synthesis. Alternatively, the operation mode may be switched according to which of phoneme information and note information is input first. For example, when the phonological information is input first in the state of operating in the singing synthesis performance mode, the dummy sound output processing is executed by the control unit 110, and immediately when the musical note information is input first ( Alternatively, when the phonological information is not input before the waiting time TW elapses, the control unit 110 is caused to execute processing for switching the operation mode to the instrument sound performance mode. Similarly, when the phoneme information is input first in the musical instrument sound performance mode, the operation mode is immediately switched to the singing synthesis performance mode (or when the note information is not input before the waiting time TW elapses). The process is executed by the control unit 110, and when the note information is input first, the dummy sound output process is executed by the control unit 110. According to such an aspect, there is an effect that the user can be made to perform the singing synthesis performance and the musical instrument sound performance while switching seamlessly by an operation on the operation unit 120.

（Ｂ：第２実施形態）
上記第１実施形態では、ノイズ音やブレス音、鼻音、または、継続可能な所定の音素を所定の音高で出力した音などユーザにより入力される音韻情報や音符情報とは無関係な音であって、継続可能な音をダミー音として用いる場合について説明した。これに対して、本実施形態では、最先の合成情報（すなわち、音韻情報と音符情報のうち先に入力された方）に応じたダミー音を出力する点が上記第１実施形態と異なる。本実施形態の歌唱合成装置のハードウェア構成は上記第１実施形態と同一であるため詳細な説明を省略し（第３および第４実施形態も同様）、以下では、音韻情報が先に入力された場合と音符情報が先に入力された場合に分けて本実施形態のダミー音出力処理を説明する。 (B: Second embodiment)
In the first embodiment, the sound is irrelevant to the phoneme information and note information input by the user, such as a noise sound, breath sound, nasal sound, or a sound obtained by outputting a predetermined continuous phoneme at a predetermined pitch. Thus, the case where a continuable sound is used as a dummy sound has been described. On the other hand, the present embodiment is different from the first embodiment in that a dummy sound corresponding to the earliest synthesis information (that is, the phoneme information and the note information input earlier) is output. Since the hardware configuration of the singing voice synthesizing apparatus of the present embodiment is the same as that of the first embodiment, detailed description thereof is omitted (the same applies to the third and fourth embodiments). In the following, phoneme information is input first. The dummy sound output processing of this embodiment will be described separately for the case where the note information is input and the case where the note information is input first.

（Ｂ−１：音韻情報を先に受け取った場合）
この場合、制御部１１０は、音韻情報の示す先頭の音素が継続可能な音素であるかを判定し、その判定結果が“Ｙｅｓ”である場合には、当該先頭の音素を予め定められた所定の音高で出力し続ける音をダミー音として出力する。逆に、上記判定結果が“Ｎｏ”である場合には、制御部１１０は、前述した第１実施形態と同様に、ノイズ音やブレス音、鼻音、所定の継続可能な音素を所定の音高で出力し続ける音或いは所定の音高を有する周期音をダミー音として出力する。 (B-1: When phoneme information is received first)
In this case, the control unit 110 determines whether or not the head phoneme indicated by the phoneme information is a continuable phoneme, and if the determination result is “Yes”, the head phoneme is determined as a predetermined predetermined phoneme. The sound that continues to be output at the pitch of is output as a dummy sound. On the other hand, when the determination result is “No”, the control unit 110 outputs a noise sound, a breath sound, a nasal sound, and a predetermined continuable phoneme with a predetermined pitch as in the first embodiment. A sound that continues to be output or a periodic sound having a predetermined pitch is output as a dummy sound.

（Ｂ−２：音符情報を先に受け取った場合）
この場合、制御部１１０は、音符情報の示す音高を有する周期音（または所定の継続可能な音素を当該音高で出力し続ける音）をダミー音として出力する。 (B-2: When note information is received first)
In this case, the control unit 110 outputs a periodic sound having a pitch indicated by the note information (or a sound that continues to output a predetermined continuable phoneme at the pitch) as a dummy sound.

以上に説明したことをまとめると、最先の合成情報（音韻情報と音符情報のうち先に入力された方）に応じたダミー音を出力する場合におけるダミー音は、図３に示すようにカテゴリ分けすることができる。本実施形態によっても、歌唱合成に用いる複数種類の合成情報の各々の入力タイミングに時間差があっても、最先の合成情報の入力時点から遅滞なく音が出力され、ユーザに違和感を抱かせないようにすることができる。なお、待ち時間ＴＷの長さを図３に示すカテゴリ毎に異ならせても良く、待ち時間ＴＷの長さを音韻の種別毎にユーザが設定できるようにしても良い。また、本実施形態によれば、最先の合成情報に応じたダミー音が出力されるため、当該情報とは無関係な音をダミー音として用いる場合に比較してダミー音から合成歌唱音声への遷移が滑らかになり、より自然な聴感を演出できると期待される。 In summary, the dummy sound in the case of outputting a dummy sound according to the earliest synthesis information (one of phoneme information and note information input earlier) is classified into categories as shown in FIG. Can be divided. Even in this embodiment, even if there is a time difference between the input timings of a plurality of types of synthesis information used for singing synthesis, sound is output without delay from the input time of the first synthesis information, and the user does not feel uncomfortable. Can be. Note that the length of the waiting time TW may be different for each category shown in FIG. 3, and the user may be able to set the length of the waiting time TW for each type of phoneme. In addition, according to the present embodiment, since a dummy sound corresponding to the earliest synthesized information is output, compared to the case where a sound irrelevant to the information is used as a dummy sound, the dummy sound is changed to the synthesized singing voice. It is expected that the transition will be smooth and a more natural listening experience can be produced.

（Ｃ：第３実施形態）
上記第１および第２実施形態では、最先の合成情報の入力時点から合成歌唱音声の出力が開始されるまでの間、ダミー音を出力する場合について説明した。これに対して、本実施形態では、最先の合成情報の入力時点から合成歌唱音声の出力が開始されるまでの間に出力されるダミー音を順次切り替える点に特徴がある。具体的には、本実施形態の歌唱合成装置の制御部１１０は、図４（ａ）に示すように、音韻情報と音符情報のうち先に入力された方の入力時点からダミー音Ｄ１の出力を開始し、他方の入力を契機としてダミー音Ｄ１の出力を停止するとともにダミー音Ｄ１とは異なるダミー音Ｄ２の出力を開始し、合成歌唱音声の出力が開始されるまでダミー音Ｄ２の出力を継続するのである。 (C: Third embodiment)
In the first and second embodiments, the description has been given of the case where the dummy sound is output from the input time of the earliest synthesis information until the output of the synthesized singing voice is started. On the other hand, the present embodiment is characterized in that the dummy sound output from the input time point of the earliest synthesis information to the start of the output of the synthesized singing voice is sequentially switched. Specifically, as shown in FIG. 4A, the control unit 110 of the singing voice synthesizing apparatus according to the present embodiment outputs the dummy sound D1 from the input time of the phoneme information and the note information input earlier. Is started, the output of the dummy sound D1 is stopped with the other input as an opportunity, the output of the dummy sound D2 different from the dummy sound D1 is started, and the output of the dummy sound D2 is output until the output of the synthesized singing voice is started. It will continue.

例えば、図４（ａ）では、音韻情報が先に入力された場合について例示されている。この場合、音韻情報に応じた音をダミー音Ｄ１として出力し、その後、音符情報が入力されたことを契機としてダミー音Ｄ１の出力を停止するとともに、当該音符情報に応じた音をダミー音Ｄ２として出力する態様が考えられる。また、ダミー音Ｄ２として、ダミー音Ｄ１の音高を音符情報の示す音高となるようにピッチ変換した音を用いるようにしても良い。このように、音韻情報と音符情報のうち後に入力される方の入力を契機としてダミー音を切り替える態様においては、ダミー音Ｄ１とダミー音Ｄ２とをクロスフェードさせても良く、さらに、ダミー音Ｄ２と合成音声とをクロスフェードさせても良い。 For example, FIG. 4A illustrates the case where phonological information is input first. In this case, the sound corresponding to the phoneme information is output as the dummy sound D1, and then the output of the dummy sound D1 is stopped when the note information is input, and the sound corresponding to the note information is output as the dummy sound D2. Can be output. Further, as the dummy sound D2, a sound obtained by converting the pitch of the dummy sound D1 so as to be the pitch indicated by the note information may be used. As described above, in the aspect in which the dummy sound is switched with the input of the phoneme information and the note information input later as a trigger, the dummy sound D1 and the dummy sound D2 may be crossfade, and further, the dummy sound D2 And the synthesized voice may be crossfade.

なお、音韻情報として「ｓａ」と入力すべきところ、「ｔ」が誤入力され、その後、「ｓａ」と入力された場合には、図４（ｂ）に示すように、「ｔ」の入力を契機としてダミー音Ｄ１０を出力し、「ｓ」の入力を契機としてダミー音Ｄ２０を出力し、音符情報の入力を契機としてダミー音Ｄ３０を出力する処理を制御部１１０に実行させるようにすれば良い。この場合、図３に示すカテゴリ分類にしたがって、「ｔ」は継続不能であるためダミー音Ｄ１０としてブレス音等を用い、「ｓ」は継続可能であるためダミー音Ｄ２０として当該音素（すなわち、「ｓ」）を所定の音高で出力した音を用い、ダミー音Ｄ３０として「ｓ」を音符情報の示す音高で出力した音を用いるようにすれば良い。 When “sa” should be input as phoneme information, if “t” is erroneously input and then “sa” is input, as shown in FIG. 4B, the input of “t” is performed. If the control unit 110 executes the process of outputting the dummy sound D10 triggered by the input of "s", outputting the dummy sound D20 triggered by the input of "s", and outputting the dummy sound D30 triggered by the input of the note information. good. In this case, according to the category classification shown in FIG. 3, “t” cannot be continued, so a breath sound or the like is used as the dummy sound D10, and “s” can be continued, so that the phoneme (that is, “ It is sufficient to use a sound output at a predetermined pitch and use a sound output at “s” at the pitch indicated by the note information as the dummy sound D30.

本実施形態によっても、歌唱合成に用いる複数種類の合成情報の各々の入力タイミングに時間差があっても、最先の合成情報の入力時点から遅滞なく音が出力され、ユーザに無用な違和感を抱かせることはない。また、ダミー音Ｄ１として音符情報と音韻情報のうち先に入力された方に応じた音を用い、ダミー音Ｄ２として音符情報と音韻情報の両者に応じた音を用いるようにすれば、上記第２実施形態における場合よりもさらに滑らかにダミー音と合成歌唱音声とがつながり、さらに自然な聴感を演出できると期待される。なお、本実施形態では、他の種類の合成情報の入力を契機としてダミー音を切り替える場合について説明したが、ダミー音Ｄ１の出力を開始してから所定時間が経過した時点でダミー音Ｄ２に切り替えるようにしても良い。また、最先の合成情報の入力時点から合成歌唱音声の出力が開始されるまでの間にダミー音の切り替えを複数回行っても勿論良い。要は、複数種類の合成情報のうち最先のものが入力されてから合成歌唱音声の出力が開始されるまでの間、複数種のダミー音が順次出力されるようにダミー音信号を切り替える態様であれば良い。 Even in this embodiment, even if there is a time difference between the input timings of the plurality of types of synthesis information used for singing synthesis, the sound is output without delay from the input timing of the first synthesis information, and the user feels uncomfortable. I will not let you. Further, if the sound corresponding to the earlier input of the note information and the phoneme information is used as the dummy sound D1, and the sound corresponding to both the note information and the phoneme information is used as the dummy sound D2, the above-mentioned first It is expected that the dummy sound and the synthesized singing voice are connected more smoothly than in the second embodiment, and a more natural hearing feeling can be produced. In the present embodiment, the case where the dummy sound is switched in response to the input of another type of synthesis information has been described. However, the dummy sound D2 is switched when a predetermined time elapses after the output of the dummy sound D1 is started. You may do it. Of course, the dummy sound may be switched a plurality of times during the period from the input of the earliest synthesis information until the output of the synthesized singing voice is started. In short, a mode in which the dummy sound signal is switched so that a plurality of types of dummy sounds are sequentially output from when the first one of the plurality of types of synthesis information is input until the output of the synthesized singing voice is started. If it is good.

（Ｄ：第４実施形態）
上記第１〜第３実施形態では、最先の合成情報（音符情報と音韻情報のうちの先に入力された方）の入力を契機としてダミー音出力処理を開始した。これに対して、本実施形態では、図５に示すように、最先の合成情報の入力時点を起算点として所定時間ＴＭが経過した時点からダミー音の出力を開始する点が異なる。そして、本実施形態では、上記所定時間ＴＭが経過するまでに音符情報と音韻情報のうちの他方が入力され、音韻情報の示す先頭の音素が継続可能な音素である場合には、それら音符情報と音韻情報の両者に応じた音（例えば、当該音素を音符情報の示す音高で出力し続ける音）をダミー音として出力し、その他の場合は先に入力された方の情報に応じた音或いはブレス音等をダミー音として出力する。このような態様によっても、合成歌唱音声に先行してダミー音が出力されるため、ユーザの違和感を軽減することができる。なお、入力操作から実際に合成歌唱音声の出力が開始されるまでの遅延をどの程度許容できるかについては個人差があるので、上記時間ＴＭの長さについてはユーザの所望に応じて適宜調整できることが好ましい。 (D: 4th Embodiment)
In the first to third embodiments, the dummy sound output process is started in response to the input of the earliest synthesis information (one of note information and phoneme information input first). On the other hand, in the present embodiment, as shown in FIG. 5, the point that the output of the dummy sound is started when a predetermined time TM has elapsed from the input time point of the earliest synthesis information as a starting point. In the present embodiment, when the other of the note information and the phoneme information is input before the predetermined time TM elapses and the head phoneme indicated by the phoneme information is a continuous phoneme, the note information A sound corresponding to both the phoneme information (for example, a sound that continues to output the phoneme at the pitch indicated by the note information) is output as a dummy sound, and in other cases, a sound corresponding to the information input earlier Alternatively, a breath sound or the like is output as a dummy sound. Even in such a mode, since the dummy sound is output prior to the synthesized singing voice, the user's uncomfortable feeling can be reduced. In addition, since there is an individual difference in how much delay is allowed from the input operation until the output of the synthesized singing voice is actually started, the length of the time TM can be appropriately adjusted according to the user's desire. Is preferred.

（Ｅ：変形）
以上本発明の各実施形態について説明したが、これら実施形態に以下の変形を加えても勿論良い。
（１）上記各実施形態では、歌唱合成に用いる複数種類の合成情報の具体例として、音韻情報と音符情報（韻律情報）を説明したが、音韻情報および音符情報（韻律情報）に加えてベロシティや音符制御情報をダミー音信号の出力制御に利用しても勿論良い。ベロシティとは音の強さを示す情報であり、ＭＩＤＩにおいては、音高を示す音高情報とともに音符情報を形成する。このようなベロシティの利用方法としては、１つ前の音符についての音符情報に含まれていたベロシティに応じてダミー音の出始めの音量を制御する（ベロシティが大きいほど音量を大きくする）態様が考えられる。また、音符制御情報の一例としてはビブラートやコントロールデータとしてのアタックやリリースが挙げられる。ビブラートを示す音符制御情報が与えられた場合にはダミー音にビブラートを付与し、コントロールデータとしてのアタックの大きさに応じてダミー音の立ち上がりの音量を変化させるようにすれば良い。 (E: deformation)
Although each embodiment of the present invention has been described above, it goes without saying that the following modifications may be added to these embodiments.
(1) In each of the embodiments described above, phonological information and note information (prosodic information) have been described as specific examples of a plurality of types of synthetic information used for singing synthesis. However, velocity is added to phonological information and musical note information (prosodic information). Of course, the note control information may be used for the output control of the dummy sound signal. Velocity is information indicating the strength of sound. In MIDI, note information is formed together with pitch information indicating pitch. As a method of using such velocity, there is a mode in which the volume at which the dummy sound begins to be output is controlled according to the velocity included in the note information for the previous note (the volume is increased as the velocity is increased). Conceivable. Examples of note control information include vibrato and attack or release as control data. When note control information indicating vibrato is given, vibrato is given to the dummy sound, and the rising volume of the dummy sound may be changed in accordance with the magnitude of the attack as the control data.

ベロシティを合成情報として用いる場合、ベロシティは音高情報とともに音符情報を構成する。このため、韻律情報の役割を果たす音符情報の取得によりベロシティも取得される。しかし、ビブラートの付与を示す情報については、必須の合成情報の何れかと同時に取得されるとは限らない。その一方、必須の合成情報が揃っているにも関わらずダミー音が出力され続けることは好ましくない。そこで、音韻情報と韻律情報の他に、ビブラートの付与を示す情報など必ずしも必須ではない情報を合成情報として用いる場合には、複数種類の合成情報のうちの最先のものの入力から必須の合成情報が揃うまでの間に入力された合成情報を用いて歌唱音声の合成を行うようにすれば良い。例えば、最先の合成情報が音韻情報である場合には、音韻情報の次に音符情報が入力された場合にはその時点で歌唱音声の合成を開始する一方、音韻情報の次にビブラートの付与を示す情報が入力された場合にはさらに音符情報の入力を待ってビブラートを付与した歌唱音声を合成するのである。なお、音韻情報の入力を契機として当該音韻情報の示す先頭の音韻を所定の音高で出力するダミー音の出力を開始する場合には、ビブラートの付与を示す情報の入力を契機として当該ダミー音に当該音高を基準とするビブラートを付与しても良い。また、最先の合成情報が音韻情報ではなく、韻律情報でもない場合は、前述した第１実施形態と同様にノイズ音やブレス音等の所定の音をダミー音として出力するようにすれば良い。 When velocity is used as synthesis information, velocity constitutes note information together with pitch information. For this reason, velocity is also acquired by acquiring the note information that plays the role of prosodic information. However, the information indicating the addition of vibrato is not always acquired at the same time as any of the essential synthesis information. On the other hand, it is not preferable that the dummy sound continues to be output even though essential synthesis information is available. Therefore, in addition to phonological information and prosodic information, when using information that is not necessarily essential, such as information indicating the addition of vibrato, as synthetic information, essential synthetic information from the input of the first of a plurality of types of synthetic information It is only necessary to synthesize the singing voice by using the synthesis information input until all the voices are arranged. For example, when the earliest synthesis information is phoneme information, when note information is input after phoneme information, synthesis of singing voice is started at that point, while vibrato is added after phoneme information When the information indicating is input, the singing voice to which the vibrato is added is synthesized after waiting for the input of the note information. When the output of a dummy sound that outputs the head phoneme indicated by the phoneme information at a predetermined pitch is started when the phoneme information is input, the dummy sound is triggered by the input of information indicating the addition of vibrato. Vibrato based on the pitch may be given to the sound. If the earliest synthesized information is neither phoneme information nor prosodic information, a predetermined sound such as a noise sound or a breath sound may be output as a dummy sound as in the first embodiment. .

（２）上記各実施形態では、歌唱合成に用いる複数種類の合成情報を入力するための操作部１２０と、合成歌唱音声を出力するための音声出力部１４０が歌唱合成装置１に内蔵されていた。しかし、操作部１２０および音声出力部１４０の何れか一方或いは両方を歌唱合成装置１の外部機器Ｉ／Ｆ部１５０に接続する態様であっても良い。操作部１２０および音声出力部１４０の両者を外部機器Ｉ／Ｆ部１５０に接続する態様の一例としては、外部機器Ｉ／Ｆ部１５０としてイーサネット（登録商標）インタフェースを用い、この外部機器Ｉ／Ｆ部１５０にＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を接続するとともに、この電気通信回線に操作部１２０および音声出力部１４０を接続する態様が挙げられる。このような態様によれば、所謂クラウドコンピューティング形式の歌唱合成サービスを提供することが可能になる。具体的には、操作部１２０の操作により入力された音韻情報および音符情報を電気通信回線を介して歌唱合成装置に与え、歌唱合成装置には、電気通信回線を介して与えられた音韻情報および音符情報に基づいて歌唱合成処理を実行させる。そして、歌唱合成装置により合成された合成歌唱音声の音声データは電気通信回線を介して音声出力部１４０に与えられ、音声出力部１４０は当該音声データに応じた音を出力する。 (2) In each of the above embodiments, the operation unit 120 for inputting a plurality of types of synthesis information used for singing synthesis and the audio output unit 140 for outputting a synthesized singing voice are incorporated in the singing synthesis apparatus 1. . However, an aspect in which one or both of the operation unit 120 and the audio output unit 140 are connected to the external device I / F unit 150 of the singing voice synthesizing apparatus 1 may be employed. As an example of a mode in which both the operation unit 120 and the audio output unit 140 are connected to the external device I / F unit 150, an Ethernet (registered trademark) interface is used as the external device I / F unit 150. A mode in which a telecommunication line such as a LAN (Local Area Network) or the Internet is connected to the unit 150, and the operation unit 120 and the audio output unit 140 are connected to the telecommunication line. According to such an aspect, it is possible to provide a so-called cloud computing type song synthesis service. Specifically, the phoneme information and the note information input by the operation of the operation unit 120 are given to the singing synthesizer via the telecommunication line, and the phonological information given via the telecommunication line A singing synthesis process is executed based on the note information. Then, the voice data of the synthesized singing voice synthesized by the singing voice synthesizer is given to the voice output unit 140 via the telecommunication line, and the voice output unit 140 outputs a sound corresponding to the voice data.

（３）上記各実施形態では、歌唱合成装置に複数種類の合成情報を入力するための入力手段（操作部１２０）として歌唱合成用キーボードを用いたが、テンキーやカーソルキー、アルファベットの各文字に対応したキーなどを配列した一般的なキーボードと、所謂ＭＩＤＩキーボードの組み合わせを上記入力手段としても良い。一般的なキーボードとＭＩＤＩキーボードの組み合わせを上記入力手段として用いる場合には、ＭＩＤＩキーボードに音符情報入力部の役割を担わせ、一般的なキーボードに音韻情報入力部の役割を担わせれば良い。また、ＧＵＩとマウスなどのポインティングデバイスとの組み合わせにより音符情報入力部或いは音韻情報入力部を実現しても良い。ＧＵＩとマウスなどのポインティングデバイスとの組み合わせにより音符情報入力部を実現する場合には、音韻情報入力部の役割を担う一般的なキーボードと当該音符情報入力部との組み合わせにより上記入力手段を実現することができる。また、ＧＵＩとマウスなどのポインティングデバイスとの組み合わせにより音韻情報入力部を実現する場合には、音符情報入力部の役割を担うＭＩＤＩキーボードと当該音韻情報入力部との組み合わせにより上記入力手段を実現することができる。 (3) In each of the above embodiments, the singing synthesis keyboard is used as an input means (operation unit 120) for inputting a plurality of types of synthesis information to the singing synthesis apparatus. A combination of a general keyboard on which corresponding keys are arranged and a so-called MIDI keyboard may be used as the input means. When a combination of a general keyboard and a MIDI keyboard is used as the input means, the MIDI keyboard may serve as a note information input unit, and the general keyboard may serve as a phoneme information input unit. Further, a note information input unit or a phoneme information input unit may be realized by a combination of a GUI and a pointing device such as a mouse. When the note information input unit is realized by a combination of a GUI and a pointing device such as a mouse, the input means is realized by a combination of a general keyboard serving as a phoneme information input unit and the note information input unit. be able to. When the phoneme information input unit is realized by a combination of a GUI and a pointing device such as a mouse, the input means is realized by a combination of a MIDI keyboard serving as a note information input unit and the phoneme information input unit. be able to.

（４）上記各実施形態では、歌唱合成処理とダミー音出力処理とを制御部１１０に実行させる歌唱合成プログラム１６２ｂが歌唱合成装置１の不揮発性記憶部１６２に予め格納されていた。しかし、この歌唱合成プログラム１６２ｂをＣＤ−ＲＯＭなどのコンピュータ読み取り可能な記録媒体に書き込んで配布しても良く、また、インターネットなどの電気通信回線経由のダウンロードにより配布しても良い。このようにして配布されるプログラムをパーソナルコンピュータなどの一般的なコンピュータに実行させることによって、そのコンピュータを上記実施形態の歌唱合成装置１として機能させることが可能になるからである。また、リアルタイム方式の歌唱合成処理を一部に含むゲームのゲームプログラムに本発明を適用しても勿論良い。具体的には、当該ゲームプログラムに含まれている歌唱合成プログラムを歌唱合成プログラム１６２ｂに差し替えても良い。ゲームにおいても、最先の合成情報の入力タイミングと合成音声の出力タイミングの時間差を小さくすることが好ましいことには変わりはないからである。 (4) In each of the above embodiments, the song synthesis program 162b that causes the control unit 110 to perform the song synthesis process and the dummy sound output process is stored in the nonvolatile storage unit 162 of the song synthesis apparatus 1 in advance. However, the song synthesis program 162b may be distributed by being written on a computer-readable recording medium such as a CD-ROM, or may be distributed by download via an electric communication line such as the Internet. This is because by causing a general computer such as a personal computer to execute the distributed program in this way, the computer can function as the singing voice synthesizing apparatus 1 of the above embodiment. Of course, the present invention may be applied to a game program of a game partially including a real-time singing synthesis process. Specifically, the song synthesis program included in the game program may be replaced with a song synthesis program 162b. This is because it is preferable to reduce the time difference between the input timing of the earliest synthesis information and the output timing of the synthesized speech even in the game.

（５）上記各実施形態では、リアルタイム方式の歌唱合成装置への本発明の適用例を説明した。しかし、本発明の適用対象はリアルタイム方式の歌唱合成装置に限定されるものではない。例えば、音声ガイダンスにおける案内音声をリアルタイム方式で合成する音声合成装置、或いは小説や詩などの文芸作品の朗読音声をリアルタイム方式で合成する音声合成装置に本発明を適用しても良い。これらの音声合成装置においても、発話内容を表す音韻情報と発話態様を示す韻律情報が揃ったことを契機として音声合成処理が実行されることは、上記各実施形態の歌唱合成装置と変わりはないからである。また、本発明の適用対象は歌唱合成機能や音声合成機能を有する玩具（歌唱合成装置や音声合成装置を内蔵した玩具）であっても良い。 (5) In the above embodiments, application examples of the present invention to a real-time singing voice synthesizing apparatus have been described. However, the application target of the present invention is not limited to the real-time singing synthesizer. For example, the present invention may be applied to a voice synthesizer that synthesizes guidance voices in voice guidance in real time, or a voice synthesizer that synthesizes reading sounds of literary works such as novels and poems in real time. In these speech synthesizers, the speech synthesis process is executed when the phoneme information indicating the utterance content and the prosodic information indicating the utterance form are prepared, which is the same as the singing synthesizers of the above embodiments. Because. The application target of the present invention may be a toy having a singing voice synthesis function or a voice synthesis function (a toy incorporating a singing voice synthesis device or a voice synthesis device).

１…歌唱合成装置、１１０…制御部、１２０…操作部、１３０…表示部、１４０…音声出力部、１４２…Ｄ／Ａ変換器、１４４…増幅器、１４６…スピーカ、１５０…外部機器Ｉ／Ｆ、１６０…記憶部、１６２…不揮発性記憶部、１６２ａ…歌唱合成ライブラリ、１６２ｂ…歌唱合成プログラム、１６２ｃ…ダミー音ライブラリ、１６４…揮発性記憶部、１７０…バス。 DESCRIPTION OF SYMBOLS 1 ... Singing synthesis apparatus, 110 ... Control part, 120 ... Operation part, 130 ... Display part, 140 ... Audio | voice output part, 142 ... D / A converter, 144 ... Amplifier, 146 ... Speaker, 150 ... External apparatus I / F , 160 ... storage unit, 162 ... nonvolatile storage unit, 162a ... singing synthesis library, 162b ... singing synthesis program, 162c ... dummy sound library, 164 ... volatile storage unit, 170 ... bus.

Claims

Input means for inputting a plurality of types of synthesis information including a plurality of types of synthesis information used for synthesis of a speech signal, including phonological information indicating the phoneme of the synthesis target speech and prosody information indicating prosodic change in the speech When,
Using the composite information input through the input unit after the earliest one of the plurality of types of composite information is input to the input unit until at least the phonological information and the prosodic information are aligned. A voice synthesizer that outputs a dummy sound signal representing a dummy sound from the time when the earliest synthesis information is input until the output of the voice signal is started.
A speech synthesizer characterized by comprising:

The voice synthesizing unit generates a dummy sound signal or the voice signal so that the voice represented by the voice signal synthesized using the synthesis information input to the input unit and the dummy sound are smoothly connected. The speech synthesizer according to claim 1, wherein the signal levels of both the voice signal and the dummy sound signal are adjusted and output.

3. The speech synthesizer according to claim 1, wherein the speech synthesizer outputs a dummy sound signal while adjusting a signal level so that the volume of the dummy sound gradually increases.

The voice synthesizing unit is a period from when the earliest one of the plurality of types of synthesis information is input to when the output of the voice signal synthesized using the synthesis information input to the input unit is started. The voice synthesizer according to claim 1, wherein the dummy sound signals are switched so that a plurality of types of dummy sounds are sequentially output.

The speech synthesis means, when at least the phoneme information and the prosodic information are not aligned before a predetermined waiting time elapses after the first one of the plurality of types of synthesis information is input, 4. The speech synthesizer according to claim 1, wherein output of the dummy sound signal is stopped.