JP6167503B2

JP6167503B2 - Speech synthesizer

Info

Publication number: JP6167503B2
Application number: JP2012250441A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2012-11-14
Filing date: 2012-11-14
Publication date: 2017-07-26
Anticipated expiration: 2032-11-14
Also published as: JP2014098802A

Description

この発明は、音声合成技術に関し、特に、リアルタイム音声合成技術に関する。 The present invention relates to speech synthesis technology, and more particularly to real-time speech synthesis technology.

音声ガイダンスにおける案内音声や文芸作品の朗読音声、或いは歌唱曲の歌唱音声などを表す音声信号を、複数種類の合成情報を用いて電気的な信号処理により合成する音声合成技術が普及している。例えば、歌唱音声の合成の場合は、歌唱音声の合成対象の歌唱曲における韻律変化を示す韻律情報（例えば、当該歌唱曲のメロディを構成する各音符の音高や継続長を表す音符情報）と当該歌唱曲の歌詞の音韻列を表す情報などの音楽表現情報が上記合成情報として用いられる。音声ガイダンスにおける案内音声や文芸作品の朗読音声の音声信号を合成する場合は、案内文や文芸作品の文章の音韻列を表す情報と、イントネーションやアクセントなどの韻律変化を示す韻律情報が上記合成情報として用いられる。従来、この種の音声合成は、合成対象の音声全体に亙る各種合成情報を予め音声合成装置に全て入力しておき、合成対象の音声全体の音波形を表す音声信号をそれら合成情報に基づいて一括して生成する所謂バッチ処理方式が一般的であった。しかし、近年ではリアルタイム方式の音声合成技術も提案されている（例えば、特許文献１参照）。 A speech synthesis technique for synthesizing a voice signal representing a guidance voice in voice guidance, a reading voice of a literary work, or a singing voice of a song by electrical signal processing using a plurality of types of synthesis information has become widespread. For example, in the case of synthesis of singing voice, prosodic information (for example, musical note information indicating the pitch or duration of each note constituting the melody of the singing song) indicating prosody change in the singing voice synthesis target song Music expression information such as information representing the phoneme string of the lyrics of the song is used as the synthesis information. When synthesizing the voice signal of the guidance voice in the voice guidance or the reading voice of the literary work, information indicating the phonological sequence of the sentence of the guidance sentence or the literary work and the prosody information indicating the prosody change such as intonation and accent are combined information. Used as Conventionally, in this type of speech synthesis, various types of synthesis information over the entire speech to be synthesized are input to a speech synthesizer in advance, and a speech signal representing the sound waveform of the entire speech to be synthesized is based on the synthesis information. A so-called batch processing method in which batch generation is performed is common. However, in recent years, a real-time speech synthesis technique has also been proposed (see, for example, Patent Document 1).

リアルタイム方式の音声合成の一例としては、楽曲全体の歌詞の音韻列を示す情報を歌唱合成装置に予め入力しておき、ピアノ鍵盤を模したキーボードの操作により音符毎に歌詞を発音する際の音高や継続長を表す音符情報を逐次入力することで音符毎に歌唱音声を合成する技術が挙げられる。また、近年では、歌詞の音韻列を構成する各音韻（子音や母音）を入力するための操作子を配列した音韻情報入力部とピアノ鍵盤を模した音符情報入力部とを左右に並べた歌唱合成用キーボードを用いて、音符毎に音符情報と当該音符に合わせて発音する歌詞の音韻列を示す音韻列情報とをリアルタイムでユーザに逐次入力させ、音符毎に歌唱音声の合成を行うことも提案されている。 As an example of real-time speech synthesis, information indicating the phonological sequence of the lyrics of the entire song is input to the singing synthesizer in advance, and the sound when the lyrics are pronounced for each note by operating the keyboard simulating a piano keyboard There is a technique for synthesizing a singing voice for each note by sequentially inputting note information representing a height or duration. Also, in recent years, a singing in which a phoneme information input unit in which operators for inputting each phoneme (consonant and vowel) constituting a phoneme string of lyrics are arranged and a note information input unit imitating a piano keyboard are arranged side by side. Using the synthesis keyboard, the user can sequentially input the note information for each note and the phoneme sequence information indicating the phoneme sequence of the lyrics to be pronounced according to the note, and synthesize the singing voice for each note. Proposed.

特許３８７９４０２号Japanese Patent No. 3879402

電子ピアノなどの電子鍵盤楽器のなかには、押鍵速度によって音符毎の音の強さ（ベロシティ）を指定することができるものがあり、このような電子鍵盤楽器によれば表現力の豊な演奏を行うことができる。歌唱合成用キーボードのなかにも押鍵速度によってベロシティを指定可能なものもあるが、歌唱音声の場合、音符毎の音の強さを変えるだけでは十分な表現力を得られないことが多い。これは案内音声や朗読音声のリアルタイム合成においても同様である。 Some electronic keyboard instruments, such as an electronic piano, can specify the sound intensity (velocity) of each note by the key-pressing speed. It can be carried out. Some singing synthesis keyboards allow the velocity to be specified by the key-pressing speed. However, in the case of singing voice, it is often impossible to obtain sufficient expressive power simply by changing the sound intensity for each note. The same applies to real-time synthesis of guidance voices and reading voices.

本発明は上記課題に鑑みて為されたものであり、従来よりも表現力の豊な音声をリアルタイム方式で合成することを可能にする技術を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique that makes it possible to synthesize a speech having a richer expressiveness than before in a real-time manner.

上記課題を解決するために本発明は、合成対象の音声の音韻列を示す音韻列情報と前記音声の韻律変化を示す韻律情報と前記音韻列情報の示す音韻列の一部の音韻を変化させることを示す音韻制御情報とを含む複数種類の合成情報を取得する手段であって、前記音韻列情報と前記韻律情報と前記音韻制御情報のうちの少なくとも１つについては操作子の操作により取得する取得手段と、前記取得手段により取得された複数種類の合成情報に含まれる音韻列情報の示す音韻列の一部の音韻を当該複数種類の合成情報に含まれる音韻制御情報にしたがって変化させて音声合成を行う音声合成手段とを有することを特徴とする音声合成装置、を提供する。 In order to solve the above problem, the present invention changes phonological sequence information indicating a phonological sequence of a speech to be synthesized, prosodic information indicating a prosodic change of the speech, and a part of the phonological sequence indicated by the phonological sequence information. And at least one of the phoneme sequence information, the prosodic information, and the phoneme control information is acquired by operating an operator. A sound obtained by changing a part of the phoneme of the phoneme sequence indicated by the phoneme sequence information included in the plurality of types of synthesis information acquired by the acquisition unit according to the phoneme control information included in the plurality of types of synthesis information There is provided a speech synthesizer characterized by comprising speech synthesis means for performing synthesis.

本発明の音声合成装置では、取得手段により複数種類の合成情報が取得されたことを契機として音声合成手段による音声合成が行われる。上記複数種類の合成情報のうち音韻列情報、韻律情報および音韻制御情報のうちの少なくとも１つは操作子の操作により取得されるのであるから、本発明の音声合成装置ではリアルタイム方式の音声合成が行われる。本発明の音声合成装置の音声合成手段は、音韻列情報の示す音韻列の一部の音韻を音韻制御情報にしたがって変化させて音声合成を行う。ここで、合成対象が日本語の歌唱音声であり、かつ音韻列情報の表す音韻列が「子音＋母音」である場合、音韻列の一部を変化させる処理の具体例としては、子音の継続長を変化させる（例えば、短くする）処理、子音を欠落させる処理、子音を音の近似した他の子音に差し替える処理、子音の繰り返しを発生させる（すなわち、母音の手前に１または複数の子音を挿入する）処理等が挙げられる。これらの処理によって音韻列の一部の音韻を変化させて歌唱合成を行うと、合成結果の歌唱音声では子音の聴き取り易さが低下する。人の歌唱音声や朗読音声には、発音が早口であるほど子音を聴き取り難くなる（子音の聴き取り易さが低下する）といった傾向があるため、本発明によれば、人の歌唱音声や朗読音声特有の傾向を再現することができ、従来よりも表現力の豊な音声をリアルタイム方式で合成することが可能になる。なお、合成対象が例えば英語などの日本語以外の言語の歌唱音声である場合には、音符に合わせて歌唱する歌詞の末尾に子音が位置している場合があり、この場合は末尾の子音について、継続長を変化させる処理、子音を欠落させる処理、子音を音の近似した他の子音に差し替える処理、子音の繰り返しを発生させる（例えば、当該子音とこれに先行する音素の間に当該子音を１または複数個挿入する）処理を行えば良い。 In the speech synthesizer of the present invention, speech synthesis by the speech synthesizer is performed when a plurality of types of synthesis information is acquired by the acquisition unit. Since at least one of phonological sequence information, prosodic information, and phonological control information among the plurality of types of synthetic information is acquired by operation of an operator, real-time speech synthesis is performed in the speech synthesizer of the present invention. Done. The speech synthesizer of the speech synthesizer of the present invention performs speech synthesis by changing a part of the phoneme of the phoneme sequence indicated by the phoneme sequence information according to the phoneme control information. Here, when the synthesis target is Japanese singing voice and the phoneme string represented by the phoneme string information is “consonant + vowel”, a specific example of the process of changing a part of the phoneme string is continuation of consonant Processing to change the length (for example, shortening), processing to remove consonants, processing to replace consonants with other consonants that approximate the sound, and generate consonant repetition (ie, one or more consonants before vowels) Insertion) and the like. When singing synthesis is performed by changing a part of the phoneme of the phoneme string by these processes, the consonant is less easily heard in the synthesized singing voice. Since a person's singing voice and reading voice tend to be difficult to listen to consonants as the pronunciation becomes quicker (the ease of listening to consonants decreases), according to the present invention, It is possible to reproduce the tendency peculiar to reading speech, and it is possible to synthesize speech with richer expressiveness than before in a real-time manner. In addition, when the synthesis target is a singing voice of a language other than Japanese, such as English, a consonant may be located at the end of the lyrics sung along with the note. , A process of changing the duration, a process of deleting the consonant, a process of replacing the consonant with another consonant that approximates the sound, and generating a repetition of the consonant (for example, the consonant between the consonant and the preceding phoneme) (One or a plurality of insertions) may be performed.

より好ましい態様においては、音韻制御情報に応じた頻度で子音の継続長を変化させる処理等を音声合成手段に実行させる態様や、子音の継続長の調整量や欠落或いは挿入する子音の個数、差し替える子音の個数を音韻制御情報に応じた可変値とする態様が考えられる。このような態様によれば、合成歌唱音声における子音の聴き取り易さをきめ細かく制御することが可能になり、合成歌唱音声の表現力をさらに向上させることができる。 In a more preferable aspect, an aspect in which the speech synthesizer executes a process of changing the duration of the consonant at a frequency according to the phoneme control information, an adjustment amount of the consonant duration, the number of missing or inserted consonants, and replacement An aspect is conceivable in which the number of consonants is a variable value corresponding to the phoneme control information. According to such an aspect, it becomes possible to finely control the ease of listening of consonants in the synthesized singing voice, and the expressive power of the synthesized singing voice can be further improved.

複数種類の合成情報に基づいて歌唱音声を合成する歌唱合成装置に本発明を適用する場合には、音韻列情報や韻律情報をユーザに入力させるための入力装置として歌唱合成用キーボードを用いるようにすれば良い。この場合、歌唱合成用キーボードを用いて入力される一連の音符情報が韻律情報の役割を果たす。また、歌唱合成用キーボードを用いて音韻制御情報を入力できるようにするために、音韻情報入力部または音符情報入力部の何れかに音韻制御情報を入力するための専用の操作子を設けても良く、音符情報を入力するための操作子に音韻制御情報を入力する役割を担わせても良い。具体的には、音高を指定する際の押鍵速度に応じたベロシティに音韻制御情報の役割を担わせることが考えられる。 When the present invention is applied to a singing voice synthesizing device that synthesizes a singing voice based on a plurality of types of synthesis information, a singing voice synthesis keyboard is used as an input device for allowing a user to input phoneme string information and prosodic information. Just do it. In this case, a series of note information input using the singing synthesis keyboard plays the role of prosodic information. Also, a dedicated operator for inputting phonological control information may be provided in either the phonological information input unit or the musical note information input unit so that the phonological control information can be input using the singing synthesis keyboard. For example, an operator for inputting note information may be assigned the role of inputting phonological control information. Specifically, it is conceivable that the velocity corresponding to the key pressing speed when the pitch is designated plays a role of the phoneme control information.

また、別の好ましい態様としては、前記複数種類の合成情報には、理想的な韻律変化からズレを生じさせて音声合成することを指示する韻律制御情報が含まれており、当該ズレが生じるように韻律制御情報にしたがって韻律変化を調整しつつ音声合成を行う処理を音声合成手段に実行させる態様が考えられる。ここで、理想的な韻律変化からのズレを生じさせる方法の具体例としては、アタック（子音から母音への過渡的な韻律変化）の深さを変化させる態様、アタックの継続長を変化させる態様、アタックの欠落を発生させる態様が挙げられる。また、韻律変化の調整の他の具体例としては、アンダーシュート（無音から子音への韻律変化）の深さまたはオーバーシュート（子音から母音への韻律変化）の高さ（或いは両者）を変化させる態様、アンダーシュートまたはオーバーシュート（或いは両者）の継続長を変化させる態様、アンダーシュートまたはオーバシュート（或いは両者）の欠落を発生させる態様が挙げられる。このように韻律変化に調整を加えることによっても合成音声における子音の聴き取り易さを調整することができる。 As another preferred mode, the plurality of types of synthesis information includes prosodic control information that instructs to synthesize speech by generating a shift from an ideal prosodic change, so that the shift occurs. In addition, a mode in which speech synthesis means executes a process of performing speech synthesis while adjusting prosodic changes according to prosodic control information is conceivable. Here, as a specific example of a method for causing a deviation from an ideal prosody change, a mode of changing the depth of attack (transient prosody change from consonant to vowel), a mode of changing the duration of the attack A mode in which a lack of attack is generated. As another specific example of the adjustment of prosody change, the depth of undershoot (prosody change from silence to consonant) or the height (or both) of overshoot (prosody change from consonant to vowel) is changed. Examples include an aspect, an aspect in which the duration of undershoot or overshoot (or both) is changed, and an aspect in which an undershoot or overshoot (or both) is missing. Thus, the ease of listening of consonants in the synthesized speech can also be adjusted by adjusting the prosody change.

この発明の第１実施形態の歌唱合成装置１の構成例を示す図である。It is a figure which shows the structural example of the song synthesizing | combining apparatus 1 of 1st Embodiment of this invention. 同歌唱合成装置１の動作を説明するための図である。It is a figure for demonstrating operation | movement of the song synthesizing | combining apparatus 1. FIG. 子音の継続長の調整により、音韻列の一部の音韻を変化させる態様を説明するための図である。It is a figure for demonstrating the aspect which changes the phoneme of a part of phoneme string by adjustment of the continuation length of a consonant. 子音の欠落を発生させることで、音韻列の一部の音韻を変化させる態様を説明するための図である。It is a figure for demonstrating the aspect which changes the phoneme of a part of phoneme string by generating the loss of a consonant. 子音の差し替えにより、音韻列の一部の音韻を変化させる態様を説明するための図である。It is a figure for demonstrating the aspect which changes the one part phoneme of a phoneme string by replacement | exchange of a consonant. 子音の挿入により、音韻列の一部の音韻を変化させる態様を説明するための図である。It is a figure for demonstrating the aspect which changes the one part phoneme of a phoneme string by insertion of a consonant. アタックの深さの調整により理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosodic change by adjustment of the depth of an attack. アタックの継続長の調整により理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosodic change by adjustment of the continuation length of an attack. アタックを欠落させることにより理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosody change by missing an attack. アンダーシュートの深さおよびオーバーシュートの高さの調整により理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosodic change by adjusting the depth of an undershoot and the height of an overshoot. アンダーシュートおよびオーバーシュートの継続長の調整により理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosodic change by adjustment of the continuation length of an undershoot and an overshoot. アンダーシュートおよびオーバーシュートを欠落させることにより理想的な韻律変化からのズレを生じさせる態様を説明するための図である。It is a figure for demonstrating the aspect which produces the shift | offset | difference from an ideal prosodic change by missing undershoot and overshoot.

以下、図面を参照しつつ、本発明の実施形態について説明する。
（Ａ：第１実施形態）
図１は、本発明の音声合成装置の一実施形態の歌唱合成装置１の構成例を示すブロック図である。この歌唱合成装置１は、音韻列情報および韻律情報などの複数種類の合成情報をユーザに入力させ、それら合成情報を用いてリアルタイム方式の歌唱合成を行う装置である。図１に示すように、歌唱合成装置１は、制御部１１０、操作部１２０、表示部１３０、音声出力部１４０、外部機器インタフェース（以下、「Ｉ／Ｆ」と略記）部１５０、記憶部１６０、および、これら構成要素間のデータ授受を仲介するバス１７０を含んでいる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(A: 1st Embodiment)
FIG. 1 is a block diagram showing a configuration example of a singing voice synthesizing apparatus 1 according to an embodiment of the voice synthesizing apparatus of the present invention. The singing voice synthesizing apparatus 1 is a device that allows a user to input a plurality of types of synthetic information such as phoneme string information and prosodic information, and performs real-time singing synthesis using the synthetic information. As shown in FIG. 1, the singing voice synthesizing apparatus 1 includes a control unit 110, an operation unit 120, a display unit 130, an audio output unit 140, an external device interface (hereinafter abbreviated as “I / F”) unit 150, and a storage unit 160. , And a bus 170 that mediates data exchange between these components.

制御部１１０は、例えばＣＰＵ（Central Processing Unit）である。制御部１１０は、記憶部１６０に記憶されている歌唱合成プログラムにしたがって作動することにより、上記複数種類の合成情報に基づいて歌唱音声を合成する音声合成手段として機能する。この歌唱合成プログラムにしたがって制御部１１０が実行する処理の詳細については後に明らかにする。本実施形態では制御部１１０としてＣＰＵを用いるがＤＳＰ（Digital Signal Processor）を用いても勿論良い。 The control unit 110 is, for example, a CPU (Central Processing Unit). The control unit 110 operates as a voice synthesis unit that synthesizes a singing voice based on the plurality of types of synthesis information by operating according to the song synthesis program stored in the storage unit 160. Details of processing executed by the control unit 110 in accordance with this singing synthesis program will be made clear later. In the present embodiment, a CPU is used as the control unit 110, but a digital signal processor (DSP) may of course be used.

操作部１２０は、前述した歌唱合成用キーボードであり、音韻情報入力部と音符情報入力部とを有している。歌唱合成装置１のユーザは、操作部１２０を操作することによって、歌唱音声の合成対象の歌唱曲のメロディを構成する音符と音符に合わせて発音する歌詞の音韻列を指定することができる。例えば、「さ」という歌詞を指定する場合には音韻情報入力部に設けられている複数の操作子のうちの子音「ｓ」に対応した操作子と母音「ａ」に対応した操作子を順次押下すれば良く、当該歌詞に対応する音符の音高として「Ｃ４」を指定する場合には音符情報入力部に設けられている複数の操作子（鍵）のうち当該音高に応じた鍵を押下してその発音開始を指示し、当該鍵から指を離すことで発音終了を指示すれば良い。つまり、当該鍵を押下している時間の長さが当該音符の継続長となる。また、ユーザは、音符に対応する鍵の押鍵速度によって当該音符に合わせて歌詞を発音する際のベロシティを指定することができる。なお、ベロシティの指定を含む音符情報を押鍵操作によって入力することを可能にする仕組みとしては、従来の電子鍵盤楽器におけるものを採用すれば良い。 The operation unit 120 is the above-described singing synthesis keyboard, and includes a phoneme information input unit and a note information input unit. The user of the singing voice synthesizing device 1 can specify the phonological string of the lyrics that are pronounced in accordance with the notes and the notes constituting the melody of the song tune to be synthesized by operating the operation unit 120. For example, when the word “sa” is designated, an operator corresponding to the consonant “s” and an operator corresponding to the vowel “a” are sequentially selected from the plurality of operators provided in the phonological information input unit. When “C4” is designated as the pitch of the note corresponding to the lyrics, a key corresponding to the pitch is selected from a plurality of operators (keys) provided in the note information input unit. It is only necessary to instruct the start of sound generation by pressing the key and to end the sound generation by releasing the key. That is, the length of time that the key is pressed is the duration of the note. In addition, the user can specify the velocity at which the lyrics are pronounced in accordance with the note by the key pressing speed corresponding to the note. In addition, what is necessary is just to employ | adopt the thing in the conventional electronic keyboard musical instrument as a mechanism which makes it possible to input the note information containing designation | designated of a velocity by key pressing operation.

操作部１２０は、音韻列を指定する操作が為された場合には当該音韻列を示す音韻列情報を制御部１１０に与える。また、操作部１２０は、発音開始を指示する押鍵操作が為された場合には、押下された鍵に応じたノートオンイベント（ＭＩＤＩ（Musical Instrument Digital Interface）イベント）を発音開始を指示する音符情報として制御部１１０に与える。このノートオンイベントには、押下された鍵に対応した音高を示す情報とその押鍵速度に応じたベロシティの大きさを示す情報（１〜１２７の整数値）とが含まれている。そして、操作部１２０は、押鍵が解除されたことを契機として当該鍵に応じたノートオフイベント（ＭＩＤＩイベント）を発音終了を指示する音符情報として制御部１１０に与える。このように音符情報入力部の操作子に対する操作により入力される音符情報は前述した韻律情報の役割を果たす。 When an operation for designating a phoneme sequence is performed, the operation unit 120 provides the control unit 110 with phoneme sequence information indicating the phoneme sequence. In addition, when a key pressing operation for instructing the start of sounding is performed, the operation unit 120 generates a note-on event (MIDI (Musical Instrument Digital Interface) event) corresponding to the pressed key for instructing the start of sounding. Information is given to the control unit 110. This note-on event includes information indicating the pitch corresponding to the pressed key and information indicating the velocity magnitude corresponding to the key pressing speed (an integer value of 1 to 127). Then, the operation unit 120 gives a note-off event (MIDI event) corresponding to the key to the control unit 110 as musical note information instructing the end of the sounding when the key depression is released. In this way, the note information input by operating the operator of the note information input unit plays the role of the prosodic information described above.

詳細については後述するが、歌唱合成装置１のユーザは、音符に対応する鍵の押鍵速度によって、合成後の歌唱音声における子音の聴き取り易さを調整する（低下させる）ことができる。このような子音の聴き取り易さの調整は、音韻情報入力部の操作により指定した音韻列（すなわち、操作部１２０を介して入力された音韻列情報の示す音韻列）の一部の音韻を変化させて歌唱合成を行うことで実現される。本実施形態では、発音開始を指示する音符情報に含まれているベロシティには、音韻列情報の示す音韻列の一部の音韻を変化させて歌唱合成を行うことを指示する音韻制御情報の役割が与えられている。つまり、操作部１２０は、歌唱音声の合成に用いる複数種類の合成情報（本実施形態では、音韻列情報、韻律情報および音韻制御情報）を制御部１１０に取得させるための取得手段の役割を果たす。 Although details will be described later, the user of the singing voice synthesizing apparatus 1 can adjust (decrease) the ease of listening to the consonant sound in the synthesized singing voice by the key pressing speed of the key corresponding to the note. The adjustment of the ease of listening of the consonant is performed by changing a part of the phoneme of the phoneme sequence designated by the operation of the phoneme information input unit (that is, the phoneme sequence indicated by the phoneme sequence information input via the operation unit 120). It is realized by changing the song composition. In the present embodiment, the velocity included in the note information instructing the start of pronunciation includes the role of phonological control information instructing to perform singing synthesis by changing a part of the phoneme sequence indicated by the phoneme sequence information. Is given. That is, the operation unit 120 serves as an acquisition unit for causing the control unit 110 to acquire a plurality of types of synthesis information (in this embodiment, phoneme string information, prosodic information, and phoneme control information) used for synthesis of the singing voice. .

表示部１３０は、例えば液晶ディスプレイとその駆動回路であり、制御部１１０による制御の下、歌唱合成装置１の使用を促すメニュー画像などの各種画像を表示する。音声出力部１４０は、図１に示すように、Ｄ／Ａ変換器１４２、増幅器１４４、およびスピーカ
１４６を含んでいる。Ｄ／Ａ変換器１４２は、制御部１１０から与えられるデジタル形式の音声データ（合成歌唱音声の音波形を表す音声データ）にＤ／Ａ変換を施し、変換結果のアナログ音声信号を増幅器１４４に与える。増幅器１４４は、Ｄ／Ａ変換器１４２から与えられる音声信号の信号レベル（すなわち、音量）をスピーカ駆動に適したレベルまで増幅してスピーカ１４６に与える。スピーカ１４６は、増幅器１４４から与えられる音声信号を音として出力する。 The display unit 130 is, for example, a liquid crystal display and a driving circuit thereof, and displays various images such as a menu image that prompts the use of the singing voice synthesizing device 1 under the control of the control unit 110. As shown in FIG. 1, the audio output unit 140 includes a D / A converter 142, an amplifier 144, and a speaker 146. The D / A converter 142 performs D / A conversion on the digital audio data (audio data representing the sound waveform of the synthesized singing voice) given from the control unit 110, and gives an analog audio signal as a conversion result to the amplifier 144. . The amplifier 144 amplifies the signal level (that is, the volume) of the audio signal supplied from the D / A converter 142 to a level suitable for driving the speaker and supplies the amplified signal to the speaker 146. The speaker 146 outputs the audio signal given from the amplifier 144 as sound.

外部機器Ｉ／Ｆ部１５０は、例えばＵＳＢ（Universal Serial Buss）インタフェースやオーディオインタフェースなど、歌唱合成装置１に他の外部機器を接続するためのインタフェースの集合体である。本実施形態では、歌唱合成用キーボード（操作部１２０）や音声出力部１４０が歌唱合成装置１の構成要素である場合について説明するが、歌唱合成用キーボードや音声出力部１４０を外部機器Ｉ／Ｆ部１５０に接続される外部機器としても勿論良い。 The external device I / F unit 150 is a collection of interfaces for connecting other external devices to the song synthesizer 1, such as a USB (Universal Serial Bus) interface and an audio interface. In the present embodiment, the case where the singing voice synthesizing keyboard (operation unit 120) and the voice output unit 140 are components of the singing voice synthesizing device 1 will be described. However, the singing voice synthesizing keyboard and the voice output unit 140 are connected to the external device I / F. Of course, an external device connected to the unit 150 may be used.

記憶部１６０は、不揮発性記憶部１６２と揮発性記憶部１６４とを含んでいる。不揮発性記憶部１６２は、例えばＲＯＭ（Read Only Memory）やフラッシュメモリ或いはハードディスクなどの不揮発性メモリにより構成されており、揮発性記憶部１６４は例えばＲＡＭ（Random Access Memory）などの揮発性メモリにより構成されている。揮発性記憶部１６４は各種プログラムを実行する際のワークエリアとして制御部１１０によって利用される。一方、不揮発性記憶部１６２には、図１に示すように、歌唱合成用ライブラリ１６２ａと、歌唱合成プログラム１６２ｂとが予め格納されている。 The storage unit 160 includes a nonvolatile storage unit 162 and a volatile storage unit 164. The nonvolatile storage unit 162 is configured by a nonvolatile memory such as a ROM (Read Only Memory), a flash memory, or a hard disk, for example. The volatile storage unit 164 is configured by a volatile memory such as a RAM (Random Access Memory), for example. Has been. The volatile storage unit 164 is used by the control unit 110 as a work area when executing various programs. On the other hand, as shown in FIG. 1, the non-volatile storage unit 162 stores a song synthesis library 162a and a song synthesis program 162b in advance.

歌唱合成用ライブラリ１６２ａとは、様々な音素やダイフォン（音素から異なる音素（無音を含む）への遷移）の音声波形を表す素片データを格納したデータベースである。なお、歌唱合成用ライブラリ１６２ａは、モノフォンやダイフォンの他にトライフォンの素片データを格納したデータベースであっても良く、また、音声波形の音素の定常部や他の音素への遷移部（過渡部）が格納されたデータベースであっても良い。歌唱合成プログラム１６２ｂは、歌唱合成用ライブラリ１６２ａを利用した歌唱合成を制御部１１０に実行させるためのプログラムである。歌唱合成プログラム１６２ｂにしたがって作動している制御部１１０は明瞭度調整処理および歌唱合成処理の２種類の処理を実行する。 The singing synthesis library 162a is a database storing segment data representing speech waveforms of various phonemes and diphones (transitions from phonemes to different phonemes (including silence)). Note that the singing synthesis library 162a may be a database storing triphone segment data in addition to a monophone or a diphone, and may also be a phoneme stationary part of a speech waveform or a transition part (transient part to another phoneme). Part) may be stored in the database. The song synthesis program 162b is a program for causing the control unit 110 to perform song synthesis using the song synthesis library 162a. The control unit 110 operating in accordance with the singing synthesis program 162b executes two types of processes, a clarity adjustment process and a singing synthesis process.

歌唱合成処理とは、操作部１２０を介して取得した複数種類の合成情報に基づいて歌唱音声の音波形を表す音声データを合成して出力する処理である。例えば、図２（ａ）に示すように、合成対象の歌詞として「ま」が指定され、当該歌詞を発音する際の音高として「Ｃ４」が指定されたとする。この場合、子音「ｍ」＋母音「ａ」を表す音韻列情報と音高「Ｃ４」の音の発音開始を指示する音符情報とが操作部１２０から制御部１１０に与えられる。制御部１１０は、当該音韻列情報の示す音韻列を生成する処理を歌唱合成処理の前処理として実行する。図２（ａ）に示すように、合成対象の歌詞として「ま」が指定された場合には、制御部１１０は、図２（ｂ）に示すように、無音（図２（ｂ）では、＃により表記、以降の図面においても同様）から子音「ｍ」への遷移、子音「ｍ」から母音「ａ」への遷移、母音「ａ」、および母音「ａ」から無音への遷移を配列した音韻列を生成する。また、この前処理では、制御部１１０は、発音開始を指示する音符情報に基づいて図２（ｃ）に示すピッチカーブを生成する。そして、歌唱合成処理では、制御部１１０は、上記音韻列を構成する各音素（或いはダイフォン）の素片データを歌唱合成用ライブラリ１６２ａから読み出して周波数領域のデータに変換し、周波数領域のデータに変換済の各素片データに上記ピッチカーブにしたがってピッチ変換を施して結合し、さらに時間領域のデータに戻して合成歌唱音声の音声データを生成する。 The singing synthesis process is a process of synthesizing and outputting voice data representing the sound waveform of the singing voice based on a plurality of types of synthesis information acquired via the operation unit 120. For example, as shown in FIG. 2A, it is assumed that “ma” is designated as the composition target lyrics and “C4” is designated as the pitch when the lyrics are pronounced. In this case, phoneme string information representing consonant “m” + vowel “a” and note information for instructing the start of sound generation of pitch “C4” are provided from the operation unit 120 to the control unit 110. The control part 110 performs the process which produces | generates the phoneme sequence which the said phoneme string information shows as pre-processing of a song synthesis process. As shown in FIG. 2 (a), when “ma” is designated as the composition target lyrics, the control unit 110, as shown in FIG. 2 (b), is silent (in FIG. 2 (b), The transition from the consonant “m” to the consonant “m”, the transition from the consonant “m” to the vowel “a”, the transition from the vowel “a”, and the transition from the vowel “a” to the silence is arranged. Generated phoneme strings. In this preprocessing, the control unit 110 generates the pitch curve shown in FIG. 2C based on the note information instructing the start of sound generation. In the singing synthesis process, the control unit 110 reads out the segment data of each phoneme (or diphone) constituting the phoneme sequence from the singing synthesis library 162a, converts it into frequency domain data, and converts it into frequency domain data. The converted piece data is subjected to pitch conversion according to the pitch curve and combined, and then converted back to time domain data to generate voice data of synthesized singing voice.

図２（ｃ）に示すピッチカーブは、自然な聴感の歌唱音声が得られるという点において理想的な韻律変化を示すピッチカーブである。図２（ｃ）に示すピッチカーブにおいて、区間Ｔ１における韻律変化は無音から子音「ｍ」への過渡的な遷移（アンダーシュート）に対応する。同区間Ｔ２における韻律変化は子音「ｍ」におけるアタックに対応する。同区間Ｔ３における韻律変化は子音「ｍ」から母音「ａ」への過渡的な遷移（オーバーシュート）に対応する。同区間Ｔ４における音高の変化は母音「ａ」における音高の定常的な変化（サスティン）に対応する。そして、同区間Ｔ５における音高の変化は母音「ａ」から無音への遷移（リリース）に対応する。本実施形態では、図２（ｃ）に示す理想的なピッチカーブを特徴付けるピッチカーブデータ（区間Ｔ１〜Ｔ５の各区間の継続長（継続時間）、アンダーシュートの深さ（Ｄ）、アタックの勾配（α）、オーバーシュートの高さＨの各々を示すデータなど）が不揮発性記憶部１６２に予め格納されており、制御部１１０はピッチカーブデータと、発音開始を指示する音符情報の示す音高と、に基づいてピッチカーブを生成し、そのピッチカーブにしたがって各素片データのピッチ変換を実行する。また、合成歌唱音声の音量についても同様に、自然な聴感を得られるという点において理想的な音量の時間変化を示すデータを不揮発性記憶部１６２に予め記憶させておき、このデータにしたがって音量を制御しつつ合成歌唱音声を出力するようにしても良い。 The pitch curve shown in FIG. 2 (c) is a pitch curve that shows an ideal prosody change in that a singing voice with a natural audibility can be obtained. In the pitch curve shown in FIG. 2C, the prosody change in the section T1 corresponds to a transitional transition (undershoot) from silence to the consonant “m”. The prosody change in the section T2 corresponds to the attack in the consonant “m”. The prosody change in the section T3 corresponds to a transition (overshoot) from the consonant “m” to the vowel “a”. The change in pitch in the section T4 corresponds to the steady change (sustain) of the pitch in the vowel “a”. A change in pitch in the same section T5 corresponds to a transition (release) from the vowel “a” to silence. In the present embodiment, pitch curve data that characterizes the ideal pitch curve shown in FIG. 2C (the duration (duration) of each section of sections T1 to T5), the depth of undershoot (D), and the gradient of the attack (Α, data indicating each of the overshoot heights H) are stored in advance in the nonvolatile storage unit 162, and the control unit 110 performs pitch curve data and pitches indicated by note information instructing the start of sound generation. Based on the above, a pitch curve is generated, and the pitch conversion of each piece data is executed according to the pitch curve. Similarly, the volume of the synthesized singing voice is stored in advance in the non-volatile storage unit 162 in such a manner that data indicating an ideal volume change in terms of obtaining a natural audibility, and the volume is adjusted according to this data. A synthetic singing voice may be output while being controlled.

明瞭度調整処理とは、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じて、当該音符に合わせて発音する歌詞の子音の聴き取り易さを調整する処理である。前述したように、本実施形態では、音韻列情報の示す音韻列（上記前処理にて生成した音韻列）の一部の音韻を変化させることで子音の聴き取り易さの調整が実現される。この明瞭度調整処理は歌唱合成処理に先立って（上記前処理と並列に、或いは前処理に後続して）実行される処理であり、（ａ）子音の継続長を変化させる（例えば、短くする）こと、（ｂ）子音の欠落を発生させること、（ｃ）子音を音の近似した他の子音に差し替えること、および（ｄ）子音の繰り返しを発生させることの何れかによって、音韻列情報の示す音韻列の一部の音韻を変化させる。以下、図２（ａ）に示す場合と同様に、「ま」（すなわち、子音「ｍ」＋母音「ａ」）を示す音韻列情報が操作部１２０から制御部１１０へ与えられた場合を例にとって、明瞭度調整処理の具体的な処理内容を説明する。 The articulation adjustment process is a process for adjusting the ease of listening to the consonant of the lyrics that are pronounced in accordance with the note according to the velocity value v included in the note information instructing the start of pronunciation. As described above, in the present embodiment, the consonant ease of listening can be adjusted by changing a part of the phoneme sequence indicated by the phoneme sequence information (the phoneme sequence generated by the preprocessing). . This intelligibility adjustment process is a process executed prior to the song synthesis process (in parallel with the pre-process or after the pre-process), and (a) changing (for example, shortening) the duration of the consonant. ), (B) generating a consonant loss, (c) replacing the consonant with another consonant that approximates the sound, and (d) generating a repetition of the consonant. The phoneme of a part of the phoneme string shown is changed. Hereinafter, as in the case shown in FIG. 2A, an example in which phoneme string information indicating “ma” (that is, consonant “m” + vowel “a”) is given from the operation unit 120 to the control unit 110 is an example. Therefore, specific processing contents of the clarity adjustment processing will be described.

（Ａ−１：子音の継続長の短縮により、音韻列の一部の音韻を変化させる態様）
音韻列情報の示す音韻列の一部の音韻を、子音の継続長の短縮により変化させる態様の具体例としては、図３に示すように、無音から子音への過渡的な遷移（図３に示す例では、[＃−ｍ]）および子音から母音への過渡的な遷移（図３に示す例では、[ｍ−ａ]）の各々の継続長をベロシティの値ｖに応じた調整量（ベロシティの値ｖが大きいほど大きい値）だけ短縮し、母音の継続長を当該短縮した分だけ延長する態様が挙げられる。無音から子音への過渡的な遷移および子音から母音への過渡的な遷移が短縮されると、早口で発音された場合のように、子音の聴き取り易さが低下する。このように、子音の聴き取り易さを低下させる調整を施すのは、従来の歌唱合成技術によれば十分に子音を聴き取り易い歌唱音声が合成されるため、敢えて聴き取り易さが低下するように（すなわち、聴き取り難くなるように）調整することで、早口で発音されたかのような歌唱音声を合成することができるからである。 (A-1: A mode in which a part of phonemes in a phoneme sequence is changed by shortening the duration of consonants)
As a specific example of a mode in which a part of the phoneme sequence indicated by the phoneme sequence information is changed by shortening the duration of the consonant, as shown in FIG. 3, a transition from silent to consonant (see FIG. 3). In the example shown, [# -m]) and the transition length of each transition from a consonant to a vowel (in the example shown in FIG. 3, [m-a]) are adjusted according to the velocity value v ( For example, the larger the velocity value v is, the larger the value is) and the vowel continuation length is extended by the shortened amount. If the transition from silence to consonant and the transition from consonant to vowel are shortened, the consonant is less likely to be heard as if it was pronounced quickly. In this way, the adjustment to reduce the ease of listening of consonants is performed because the singing voice that is easy to listen to consonants is synthesized according to the conventional singing synthesis technology, so the ease of listening is reduced. This is because it is possible to synthesize a singing voice as if it was pronounced quickly by adjusting as described above (that is, making it difficult to hear).

本実施形態では、子音の継続長の調整量をベロシティｖの値に応じた可変値としたが、当該調整量を固定値とし、ベロシティの値ｖが所定の閾値ｔｈ１を上回っている場合に、子音の継続長の調整を行うようにしても良い。また、子音の継続長の短縮をベロシティの値ｖに応じた頻度（すなわち、ベロシティが大きいほど高い頻度）で発生させる制御を制御部１１０に実行させも良い。具体的には、１〜１２７範囲で発生させた疑似乱数とベロシティの大小比較を行い、前者が後者以下である場合に子音の継続長を短縮する処理を制御部１１０に実行させるようにすれば良い。なお、子音の継続長の短縮をベロシティの値ｖに応じた頻度で発生させる場合には、子音の継続長の短縮量は固定値であっても良く、ベロシティの値ｖに応じた可変値であっても良い。 In this embodiment, the adjustment amount of the consonant continuation length is a variable value according to the value of velocity v. However, when the adjustment amount is a fixed value and the velocity value v exceeds a predetermined threshold th1, The duration of the consonant may be adjusted. In addition, the control unit 110 may execute control that causes the duration of the consonant to be shortened at a frequency according to the velocity value v (that is, the frequency increases as the velocity increases). Specifically, the pseudo-random number generated in the range of 1 to 127 is compared with the velocity, and if the former is equal to or less than the latter, the control unit 110 executes the process of reducing the duration of the consonant. good. In the case where the shortening of the consonant duration is generated at a frequency corresponding to the velocity value v, the amount of reduction of the consonant duration may be a fixed value or a variable value corresponding to the velocity value v. There may be.

（Ａ−２：子音の欠落を発生させることで音韻列の一部の音韻を変化させる態様）
子音の欠落を発生させることで音韻列の一部の音韻を変化させる態様の具体例としては、ベロシティの値ｖに応じた頻度で子音の欠落を発生させる態様が挙げられる。具体的には、図４に示すように、音韻列情報の示す音韻列から、無音から子音への過渡的な遷移および子音から母音への過渡的な遷移の各々に対応するダイフォンを欠落させ、代わりに無音から母音への遷移に対応するダイフォンを補う処理をベロシティの値ｖに応じた頻度で制御部１１０に実行させるのである。無音から母音への過渡的な遷移に対応するダイフォンを補うのは、歌唱音声が滑らかに立ち上がるようにするためである。このような態様によれば、子音は最早発音されなくなる。なお、ベロシティの値ｖに応じた頻度で子音の欠落を発生させるのではなく、ベロシティの値ｖが所定の閾値ｔｈ１を上回っている場合には常に子音の欠落を発生させるようにしても良い。また、子音部分が複数の音韻で構成されている場合（例えば、図５における[＃−ｍ]を［＃−ｍ］＋［ｍ］と置き換えた場合など）には、子音部分を構成する複数の音韻のうちのベロシティの値ｖに応じた個数の音韻を欠落させる（或いは、当該個数の音韻を当該ベロシティの値ｖに応じた頻度で欠落させる）ことで子音の欠落を発生させても良い。 (A-2: A mode in which a partial phoneme of a phoneme sequence is changed by generating a missing consonant)
As a specific example of a mode in which a partial phoneme of the phoneme string is changed by generating a consonant loss, a mode in which a consonant loss is generated at a frequency according to the velocity value v can be given. Specifically, as shown in FIG. 4, from the phoneme sequence indicated by the phoneme sequence information, the diphones corresponding to the transitional transition from silence to consonant and the transition from consonant to vowel are deleted. Instead, the control unit 110 is caused to execute the process of compensating for the diphone corresponding to the transition from the silence to the vowel with a frequency corresponding to the velocity value v. The reason why the diphone corresponding to the transition from silent to vowel is supplemented is to make the singing voice rise smoothly. According to such an aspect, the consonant is no longer pronounced. Note that consonant loss may not occur at a frequency corresponding to the velocity value v, but may be always generated when the velocity value v exceeds a predetermined threshold th1. Further, when the consonant part is composed of a plurality of phonemes (for example, when [# −m] in FIG. 5 is replaced with [# −m] + [m]), a plurality of consonant parts are formed. Consonants may be lost by deleting the number of phonemes corresponding to the velocity value v of the phonemes (or deleting the number of phonemes at a frequency corresponding to the velocity value v). .

（Ａ−３：子音の差し替えにより音韻列の一部の音韻を変化させる態様）
子音の差し替えにより音韻列の一部の音韻を変化させる態様の具体例としては、子音の差し替えをベロシティの値ｖに応じた頻度で発生させる態様が挙げられる。具体的には、図５に示すように、音韻列情報の示す音韻列における子音を、音の近似した他の子音に差し替える処理をベロシティの値ｖに応じた頻度で制御部１１０に実行させるのである。図５には、子音「ｍ」を子音「ｎ」に差し替える場合について例示されている。このような子音の差し替えを可能にするには、子音毎に差し替え候補となる子音を示す差し替え制御情報を不揮発性記憶部１６２に予め記憶させておき、この差し替え制御情報にしたがって制御部１１０に子音の差し替えを行わせるようにすれば良い。なお、ベロシティの値ｖに応じた頻度で子音の差し替えを発生させるのではなく、ベロシティの値ｖが所定の閾値ｔｈ１を上回っている場合には常に子音の差し替えを発生させても良い。また、子音部分が複数の音韻で構成されている場合には、子音部分を構成する複数の音韻のうちのベロシティの値ｖに応じた個数の音韻を音の近似した他の音韻に差し替える（或いは、当該個数の音韻を当該ベロシティの値ｖに応じた頻度で差し替える）ようにしても良い。 (A-3: A mode in which a part of the phoneme sequence is changed by replacing consonants)
As a specific example of an aspect in which a part of the phoneme of the phoneme sequence is changed by the consonant replacement, an aspect in which the consonant replacement is generated at a frequency corresponding to the velocity value v can be given. Specifically, as shown in FIG. 5, the control unit 110 is caused to execute a process of replacing a consonant in the phoneme sequence indicated by the phoneme sequence information with another consonant that approximates the sound at a frequency according to the velocity value v. is there. FIG. 5 illustrates the case where the consonant “m” is replaced with the consonant “n”. In order to enable such replacement of consonants, replacement control information indicating consonants that are replacement candidates for each consonant is stored in the nonvolatile storage unit 162 in advance, and the consonant is stored in the control unit 110 according to the replacement control information. It is sufficient to let them be replaced. Instead of generating a consonant replacement at a frequency corresponding to the velocity value v, a consonant replacement may always be generated when the velocity value v exceeds a predetermined threshold th1. When the consonant part is composed of a plurality of phonemes, the number of phonemes corresponding to the velocity value v among the plurality of phonemes constituting the consonant part is replaced with another phoneme approximating the sound (or The number of phonemes may be replaced at a frequency corresponding to the velocity value v).

（Ａ−４：子音の挿入により音韻列の一部の音韻を変化させる態様）
子音の挿入により音韻列の一部の音韻を変化させる態様の具体例としては、図６に示すように、音韻列情報の示す音韻列に含まれる子音に関連するダイフォン（図６に示す例では、子音から無音への過渡的な変化に対応するダイフォンＤ１と無音から子音への過渡的な変化に対応するダイフォンＤ２）を当該子音と母音の間に挿入する処理をベロシティｖの値に応じた頻度で制御部１１０に実行させる態様が挙げられる。このような子音の挿入を行うことによって、あたかも噛みながら発音しているかのような歌唱音声を合成し、子音の聴き取り易さを低下させることができる。なお、挿入するダイフォンの数については予め定めた固定値としても良く、ベロシティの値ｖに応じた可変値（ベロシティの値ｖが大きいほど、大きくなる値）としても良い。また、ベロシティの値ｖに応じた頻度で子音の挿入を発生させるのではなく、ベロシティの値ｖが所定の閾値ｔｈ１を上回っている場合には予め定めた数（或いはベロシティの値ｖに応じて定まる数）の子音を挿入する処理を常に制御部１１０に実行させるようにしても良く、また、ベロシティの値ｖに応じて定まる数の子音を挿入する処理を制御部１１０に実行させても良い。 (A-4: A mode in which part of the phoneme sequence is changed by inserting consonants)
As a specific example of the manner in which a part of the phoneme sequence is changed by inserting a consonant, as shown in FIG. 6, diphones related to consonants included in the phoneme sequence indicated by the phoneme sequence information (in the example shown in FIG. The process of inserting a diphone D1 corresponding to a transitional change from consonant to silence and a diphone D2) corresponding to a transitional change from silence to consonant according to the value of velocity v. An example of causing the control unit 110 to execute at a frequency is given. By inserting such a consonant, it is possible to synthesize a singing voice as if it is pronounced while biting, and to reduce the ease of listening to the consonant. Note that the number of diphones to be inserted may be a predetermined fixed value or a variable value corresponding to the velocity value v (a value that increases as the velocity value v increases). Also, instead of causing consonant insertion to occur at a frequency corresponding to the velocity value v, if the velocity value v exceeds a predetermined threshold th1, a predetermined number (or depending on the velocity value v) is used. The control unit 110 may always execute a process of inserting a fixed number of consonants, or the control unit 110 may execute a process of inserting a number of consonants determined according to the velocity value v.

以上説明したように本実施形態の歌唱合成装置１によれば、発音開始を指示する操作にて指定されたベロシティの値ｖに応じた頻度（或いは調整量）で音韻列情報の示す音韻列の一部の音韻を変化させた後に、歌唱音声の合成が行われる。このため、本実施形態によれば、発音開始を指示する際の押鍵速度によって歌唱音声における子音の聴き取り易さを制御し、早口で発音しているかのような歌唱音声や、噛みながら発音しているかのような歌唱音声など様々な態様の歌唱音声を合成することが可能になり、歌唱合成の表現力が向上する。なお、音韻列情報の示す音韻列の一部の音韻を変化させることを、（ａ）子音の継続長の短縮、（ｂ）子音の欠落を発生させること、（ｃ）子音の差し替え、および（ｄ）子音の挿入の何れにより実現するのかについては予め定めておいても良く、また、ユーザに選択させるようにしても良い。 As described above, according to the singing voice synthesizing apparatus 1 according to the present embodiment, the phoneme sequence indicated by the phoneme sequence information has a frequency (or adjustment amount) according to the velocity value v designated by the operation for instructing the start of sound generation. After changing some of the phonemes, the singing voice is synthesized. Therefore, according to this embodiment, the ease of listening to consonants in the singing voice is controlled by the key-pressing speed at the time of instructing the start of pronunciation, and the singing voice as if it is pronounced quickly or pronounced while chewing It becomes possible to synthesize various types of singing voices, such as singing voices as if they were playing, and the expressive power of singing synthesis is improved. Note that changing some phonemes of the phoneme sequence indicated by the phoneme sequence information includes (a) shortening the consonant duration, (b) generating missing consonants, (c) replacing consonants, and ( d) Which of the consonant insertions is realized may be determined in advance, or may be selected by the user.

また、上記（ａ）〜（ｄ）の態様の何れか１つにより、音韻列の一部の音韻を変化させるのではなく、これらのうちの複数の組み合わせ（例えば、子音の継続長の短縮と子音の差し替えの組み合わせ、或いは子音の差し替えと子音の挿入の組み合わせ）により音韻列の一部の音韻を変化させても良い。また、子音のみを対象とするのではなく母音についても、継続長の調整、欠落、他の母音への差し替え、或いは挿入の各処理を施しても良い。このような態様によれば、「さいた」という歌詞の歌唱音声を合成する場合に、「さ、さ、さ、いた」（子音＋母音部分を繰り返し）という歌唱音声を合成したり、「あ、あ、あいた」（子音の欠落＋母音の繰り返し）という歌唱音声を合成するなど、多様な歌唱音声を合成することが可能になり、表現力をさらに豊にすることが可能になる。また、英語などの日本語以外の言語で記述された歌詞では、末尾が子音の場合（例えば、ｆａｎなど）があるため、末尾の子音についても上記（ａ）〜（ｄ）の何れかの態様（或いはこれらのうちの複数の組み合わせ）によって聴き取り易さを調整しても良い。要は、音韻の継続長の調整、欠落、差し替え、或いは挿入により、音韻列情報の示す音韻の一部を変化させる態様であれば良い。 Further, according to any one of the above aspects (a) to (d), a part of the phoneme string is not changed, but a plurality of combinations (for example, shortening of the consonant duration) Some phonemes of the phoneme string may be changed by a combination of consonant replacements, or a combination of consonant replacements and consonant insertions). In addition, not only consonants but also vowels may be subjected to continuation length adjustment, omission, replacement with other vowels, or insertion. According to such an embodiment, when synthesizing the singing voice of the lyrics “Saita”, the singing voice of “Sa, Sa, Sa, Ita” (repeat consonant + vowel part) is synthesized, It is possible to synthesize various singing voices, such as synthesizing a singing voice of “A, Aita” (missing consonants + repeated vowels), and the expression power can be further enhanced. In addition, since lyrics written in a language other than Japanese, such as English, may have a consonant at the end (for example, fan), the consonant at the end may be any one of the above-described aspects (a) to (d). The ease of listening may be adjusted by (or a combination of these). The point is that any part of the phoneme indicated by the phoneme string information may be changed by adjusting, missing, replacing, or inserting the phoneme duration.

（Ｂ：第２実施形態）
上記第１実施形態の明瞭度調整処理では、音韻列情報に示す音韻列の一部の音韻（子音）を変化させて歌唱音声を合成することで子音の聴き取り易さを低下させた。これに対して本実施形態の明瞭度調整処理では、図２（ｃ）に示す理想的な韻律変化からのズレが生じるように歌唱音声を合成することで子音の聴き取り易さを低下させる点に特徴がある。ここで、理想的な韻律変化からのズレを生じさせる方法の具体例としては、理想的な韻律変化におけるものから、アタックの深さや継続長を変更する態様、アタックの欠落を発生させる態様、オーバーシュートの高さおよびアンダーシュートの深さや継続長を変更する態様、オーバーシュートおよびアンダーシュートの欠落を発生させる態様が挙げられる。なお、本実施形態の歌唱合成装置の構成は第１実施形態におけるものと特段に変わるところはないため詳細な説明を省略し、以下では韻律変化の具体的な調整態様について説明する。 (B: Second embodiment)
In the intelligibility adjustment process of the first embodiment, the consonant is less easily heard by synthesizing the singing voice by changing a part of the phoneme (consonant) of the phoneme string shown in the phoneme string information. On the other hand, in the clarity adjustment processing of the present embodiment, the ease of listening to consonants is reduced by synthesizing the singing voice so that the deviation from the ideal prosody change shown in FIG. There is a feature. Here, specific examples of a method for causing a deviation from an ideal prosodic change include an aspect in which the depth and duration of an attack are changed from an ideal prosodic change, an aspect in which an attack is lost, an overshoot The aspect which changes the height of a chute | shoot, the depth and continuation length of an undershoot, and the aspect which generate | occur | produces the loss of an overshoot and an undershoot are mentioned. The configuration of the singing voice synthesizing apparatus of the present embodiment is not particularly different from that in the first embodiment, and thus detailed description thereof will be omitted. Hereinafter, a specific adjustment mode of prosody change will be described.

（Ｂ−１：アタックの調整により理想的な韻律変化からのズレを生じさせる態様）
図７には、アタックの深さの調整により理想的な韻律変化からのズレを生じさせる態様が例示されている。図７では、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。なお、図７では、両ピッチカーブのオーバーシュート以降は重なっている（図８〜図１２においても両ピッチカーブの一部は重なっている）。本態様では、制御部１１０は、ピッチカーブデータ（より正確には、アタックの勾配αを表すデータ）に対して、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた調整量だけアタックの深さを変化させる（例えば、浅くする）調整を施し、当該調整後のピッチカーブデータと当該音符情報の示す音高とに基づいてピッチカーブを生成する。アタックの深さを理想的なピッチカーブにおける深さよりも浅くすることによって子音の聴き取り易さは低下するからである。本実施形態では、アタックの深さを理想的なピッチカーブにおける深さよりも浅くすることで子音の聴き取り易さを低下させるが、アタックの深さを理想的なピッチカーブにおける深さよりも深くすることでも子音の聴き取り易さは低下する。したがって、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じてアタックの深さを深くする態様であっても良い。アタックの深さを変化させる態様として、理想的なピッチカーブにおける深さよりも浅くする態様と、理想的なピッチカーブにおける深さよりも深くする態様の何れを採用するのかについては予め定めておいても良いし、ユーザに選択させるようにしても良い。 (B-1: A mode in which deviation from an ideal prosodic change is caused by adjusting the attack)
FIG. 7 illustrates a mode in which a deviation from an ideal prosody change is caused by adjusting the attack depth. In FIG. 7, the pitch curve in an ideal prosody change is drawn with a broken line, and the pitch curve used in this aspect is drawn with a solid line. In FIG. 7, the two pitch curves are overlapped after the overshoot (parts of both pitch curves are also overlapped in FIGS. 8 to 12). In this aspect, the control unit 110 adjusts the pitch curve data (more precisely, data representing the attack gradient α) according to the velocity value v included in the note information instructing the start of sounding. Adjustment is performed to change (for example, make shallow) the attack depth by the amount, and a pitch curve is generated based on the pitch curve data after the adjustment and the pitch indicated by the note information. This is because making the attack depth shallower than the ideal pitch curve reduces the ease of listening to consonants. In this embodiment, the ease of listening to consonants is reduced by making the attack depth shallower than the ideal pitch curve, but the attack depth is made deeper than the ideal pitch curve. Even so, the ease of listening to consonants decreases. Therefore, the attack may be deepened in accordance with the velocity value v included in the note information instructing the start of sound generation. As a mode for changing the attack depth, it is possible to determine in advance whether to adopt a mode in which the depth is shallower than the ideal pitch curve or a mode in which the depth is deeper than the ideal pitch curve. It may be good or let the user choose.

本実施形態では、発音開始を指示する音符情報に含まれているベロシティは、理想的な韻律変化からズレを生じさせて歌唱音声を合成すること、およびそのズレの大きさを指示する韻律制御情報の役割を担っている。なお、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度でアタックの深さを所定量（或いは当該ベロシティの値ｖに応じた量）だけ理想的なピッチカーブにおける深さから変化させる処理を制御部１１０に実行させても良い。また、発音開始を指示する音符情報に含まれているベロシティの値ｖが所定の閾値ｔｈ２（第１実施形態における閾値ｔｈ１と同じ値であっても良く、異なる値であっても良い）を上回っている場合に、アタックの深さを所定量（或いは当該ベロシティの値ｖに応じた量）だけ理想的なピッチカーブにおける深さから変化させる処理を制御部１１０に実行させても良い。 In the present embodiment, the velocity included in the note information instructing the start of sound generation is synthesized from the singing voice by causing a deviation from an ideal prosody change, and the prosody control information instructing the magnitude of the deviation. Have a role. It should be noted that the depth of the attack in the ideal pitch curve is a predetermined amount (or an amount corresponding to the velocity value v) with a frequency corresponding to the velocity value v included in the note information instructing the start of sounding. The control unit 110 may execute the process to be changed. Further, the velocity value v included in the note information instructing the start of sound generation exceeds a predetermined threshold th2 (may be the same value as the threshold th1 in the first embodiment or may be a different value). In this case, the control unit 110 may execute processing for changing the attack depth from a depth in the ideal pitch curve by a predetermined amount (or an amount corresponding to the velocity value v).

図８には、アタックの継続長の調整により理想的な韻律変化からのズレを生じさせる態様が例示されている。図８では、前掲図７と同様に、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。本態様では、制御部１１０は、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた調整量だけアタックの継続長を理想的なピッチカーブにおける継続長から変化させ（具体的には、短縮し）、その変化分だけ母音におけるサスティンを変化させて（具体的には、延長して）ピッチカーブを生成する。このように、アタックの継続長を理想的なピッチカーブにおける継続長から変化させることによっても、子音の聴き取り易さは低下する。ここで、図８にて実線で示すピッチカーブの具体的な生成方法としては、ピッチカーブデータ（より正確には、図２（ｃ）の区間Ｔ２の長さを示すデータ、アタックの勾配αを示すデータおよび区間Ｔ４の長さを示すデータ）に対して区間Ｔ２がベロシティの値ｖに応じた長さだけ短くなり、かつその分だけ区間Ｔ４が長くるように調整を施し、当該調整後のピッチカーブデータと発音開始を指示する音符情報の示す音高とに基づいてピッチカーブを生成する方法が考えられる。なお、本態様においても、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度でアタックの継続長を所定量（或いは当該ベロシティの値ｖに応じた量）だけ理想的なピッチカーブにおける継続長から変化させる処理を制御部１１０に実行させても良い。また、発音開始を指示する音符情報に含まれているベロシティの値ｖが所定の閾値ｔｈ２を上回っている場合に、アタックの継続長を所定量（或いは当該ベロシティの値ｖに応じた量）だけ理想的なピッチカーブにおける継続長から変化させる処理を制御部１１０に実行させても良い。 FIG. 8 illustrates a mode in which a deviation from an ideal prosody change is caused by adjusting the attack duration. In FIG. 8, as in FIG. 7, the pitch curve in the ideal prosody change is drawn with a broken line, and the pitch curve used in this aspect is drawn with a solid line. In this aspect, the control unit 110 changes the attack duration from the duration in the ideal pitch curve by an adjustment amount corresponding to the velocity value v included in the note information instructing the start of sound generation (specifically, , The pitch curve is generated by changing the sustain in the vowels (specifically, by extending). Thus, the ease of listening of consonants also decreases by changing the duration of the attack from the duration of the ideal pitch curve. Here, as a specific method for generating the pitch curve indicated by the solid line in FIG. 8, pitch curve data (more precisely, data indicating the length of the section T2 in FIG. 2C, the attack gradient α) is used. The section T2 is shortened by a length corresponding to the velocity value v, and the section T4 is lengthened by that amount. A method of generating a pitch curve based on the pitch curve data and the pitch indicated by the note information instructing the start of sound generation can be considered. Also in this aspect, the attack duration is ideally set by a predetermined amount (or an amount corresponding to the velocity value v) at a frequency corresponding to the velocity value v included in the note information instructing the start of sound generation. It is also possible to cause the control unit 110 to execute a process of changing from the continuation length in a simple pitch curve. In addition, when the velocity value v included in the note information instructing the start of sound generation exceeds a predetermined threshold th2, the attack duration is set to a predetermined amount (or an amount corresponding to the velocity value v). You may make the control part 110 perform the process changed from the continuation length in an ideal pitch curve.

図９には、アタックを欠落させる（アタックの継続長をゼロにしてピッチカーブを生成する）ことで理想的な韻律変化からのズレを生じさせる態様が例示されている。図９では、前掲図７と同様に、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。このように、アタックを欠落させる調整を施すことによっても子音の聴き取り易さが低下する。アタックの欠落については、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度で発生させる態様や、当該ベロシティの値ｖが所定の閾値ｔｈ２を上回っている場合にアタックの欠落を発生させる態様が考えられる。 FIG. 9 exemplifies a mode in which a deviation from an ideal prosodic change is generated by omitting an attack (generating a pitch curve with an attack duration of zero). In FIG. 9, the pitch curve in the ideal prosody change is drawn with a broken line as in FIG. 7, and the pitch curve used in this aspect is drawn with a solid line. In this way, the ease of listening to consonants is also reduced by making adjustments that eliminate the attack. As for the absence of an attack, an attack is generated when the velocity is generated at a frequency corresponding to the velocity value v included in the note information instructing the start of sound generation, or when the velocity value v exceeds a predetermined threshold th2. A mode in which a loss occurs can be considered.

（Ｂ−２：アンダーシュートおよびオーバーシュートの調整により理想的な韻律変化からのズレを生じさせる態様）
図１０には、アンダーシュートの深さＤおよびオーバーシュートの高さＨの調整により、理想的な韻律変化からのズレを生じさせる態様が例示されている。図１０では、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。本態様では、制御部１１０は、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた調整量だけアンダーシュートの深さＤを理想的な韻律変化における深さから変化させ（例えば、浅くし）、かつオーバーシュートの高さＨを当該ベロシティの値ｖに応じた調整量だけ理想的な韻律変化における高さから変化させて（例えば、低くして）ピッチカーブを生成する。このような態様によっても、子音の聴き取り易さが低下する。なお、アタックの深さの調整と同様に、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた調整量だけアンダーシュートの深さＤを深くし、かつオーバーシュートの高さＨを当該ベロシティの値ｖに応じた調整量だけ高くする態様であっても良い。また、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度でアンダーシュートの深さＤおよびオーバーシュートの高さＨを所定量（或いは当該ベロシティの値ｖに応じた量）だけ変化させる処理を制御部１１０に実行させても良い。また、当該ベロシティの値ｖが所定の閾値ｔｈ２を上回っている場合に、アンダーシュートの深さＤおよびオーバーシュートの高さＨを所定量（或いは当該ベロシティの値ｖに応じた量）だけ変化させる処理を制御部１１０に実行させても良い。 (B-2: Mode in which deviation from ideal prosodic change is caused by adjusting undershoot and overshoot)
FIG. 10 illustrates an aspect in which a deviation from an ideal prosody change is generated by adjusting the undershoot depth D and the overshoot height H. In FIG. 10, the pitch curve in the ideal prosody change is drawn with a broken line, and the pitch curve used in this aspect is drawn with a solid line. In this aspect, the control unit 110 changes the undershoot depth D from the ideal prosodic change depth by an adjustment amount corresponding to the velocity value v included in the note information instructing the start of sounding ( For example, the pitch curve is generated by changing the height H of the overshoot by an adjustment amount according to the velocity value v from the ideal height of the prosody change (for example, by lowering). Such an aspect also reduces the ease of listening to consonants. As with the attack depth adjustment, the undershoot depth D is increased by an adjustment amount corresponding to the velocity value v included in the note information instructing the start of sounding, and the overshoot height is set. A mode in which H is increased by an adjustment amount according to the velocity value v may be employed. Further, the undershoot depth D and the overshoot height H are set to a predetermined amount (or an amount corresponding to the velocity value v) at a frequency corresponding to the velocity value v included in the note information instructing the start of sound generation. ) May be executed by the control unit 110. Also, when the velocity value v exceeds the predetermined threshold th2, the undershoot depth D and the overshoot height H are changed by a predetermined amount (or an amount corresponding to the velocity value v). You may make the control part 110 perform a process.

図１１には、アンダーシュートおよびオーバーシュートの継続長の調整により、理想的な韻律変化からのズレを生じさせる態様が例示されている。図１１では、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。本態様では、制御部１１０は、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた調整量だけアンダーシュートおよびオーバーシュートの各々の継続長を理想的な韻律変化における継続長から変化させ（具体的には、短縮し）、その変化分だけ母音のサスティンの継続長を変化させて（具体的には、延長して）ピッチカーブを生成する。このような態様によっても子音の聴き取り易さは低下する。なお、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度でアンダーシュートおよびオーバーシュートの継続長を所定量（或いは当該ベロシティの値ｖに応じた量）だけ理想的なピッチカーブにおけるものから変化させる処理を制御部１１０に実行させても良い。また、当該ベロシティの値ｖが所定の閾値ｔｈ２を上回っている場合に、アンダーシュートおよびオーバーシュートの継続長を所定量（或いは当該ベロシティの値ｖに応じた量）だけ変化させる処理を制御部１１０に実行させても良い。 FIG. 11 illustrates a mode in which a deviation from an ideal prosody change is generated by adjusting the duration of undershoot and overshoot. In FIG. 11, the pitch curve in an ideal prosody change is drawn with a broken line, and the pitch curve used in this aspect is drawn with a solid line. In this aspect, the control unit 110 sets the continuation length of each undershoot and overshoot by an adjustment amount corresponding to the velocity value v included in the note information instructing the start of sound generation, and the continuation length in an ideal prosodic change. The pitch curve is generated by changing (specifically, shortening) and changing the continuation length of the vowel sustain by the change (specifically, extending). Such an aspect also reduces the ease of listening to consonants. It should be noted that the duration of undershoot and overshoot is ideal by a predetermined amount (or an amount corresponding to the velocity value v) at a frequency corresponding to the velocity value v included in the note information instructing the start of sound generation. You may make the control part 110 perform the process changed from the thing in a pitch curve. In addition, when the velocity value v exceeds the predetermined threshold th2, the control unit 110 performs a process of changing the undershoot and overshoot duration by a predetermined amount (or an amount corresponding to the velocity value v). May be executed.

図１２には、アンダーシュートおよびオーバーシュートを欠落させる（アンダーシュートの継続長およびオーバーシュートの継続長を共にゼロにしてピッチカーブを生成する）ことにより、理想的な韻律変化からのズレを生じさせる態様が例示されている。図１２では、理想的な韻律変化におけるピッチカーブが破線で描画されており、本態様にて用いるピッチカーブが実線で描画されている。このような態様によっても子音の聴き取り易さは低下する。アンダーシュートおよびオーバーシュートの欠落については、発音開始を指示する音符情報に含まれているベロシティの値ｖに応じた頻度で発生させても良く、また、当該ベロシティの値ｖが所定の閾値ｔｈ２を上回っている場合に発生させても良い。 In FIG. 12, undershoot and overshoot are eliminated (a pitch curve is generated with both the undershoot duration and overshoot duration set to zero), thereby causing a deviation from an ideal prosody change. Embodiments are illustrated. In FIG. 12, the pitch curve in an ideal prosody change is drawn with a broken line, and the pitch curve used in this aspect is drawn with a solid line. Such an aspect also reduces the ease of listening to consonants. The lack of undershoot and overshoot may be generated at a frequency according to the velocity value v included in the note information instructing the start of sound generation, and the velocity value v has a predetermined threshold th2. It may be generated when the number exceeds.

以上説明したように、理想的な韻律変化からズレを生じさせて音声合成することを指示する韻律制御情報にしたがって韻律変化を調整しつつ歌唱音声を合成することによっても、歌唱音声における子音の聴き取り易さを制御することができる。本実施形態では、発音開始を指示する音符情報に含まれているベロシティの値が上記韻律制御情報として用いられるため、ユーザは歌唱音声の音高を指定する際の押鍵速度によって、合成歌唱音声における子音の聴き取り易さを制御することができ、表現力の豊な歌唱音声をリアルタイム方式で合成することが可能になる。なお、アタックの調整態様として、深さの調整、継続長の調整、および欠落の発生の何れを採用するのかについては予め定めておいても良く、ユーザに選択させるようにしても良い。アンダーシュートおよびオーバーシュートの調整についても同様に、深さおよび高さの調整、継続長の調整、欠落の発生の何れを採用するのかについては予め定めておいても良く、ユーザに選択させても良い。また、オーバーシュートとアンダーシュートの両者を同時に調整するのではなく、何れか一方のみを調整する態様であっても良い。 As explained above, listening to consonants in a singing voice can also be achieved by synthesizing a singing voice while adjusting the prosodic change according to the prosodic control information that instructs to synthesize a voice by generating a deviation from an ideal prosodic change. The ease of taking can be controlled. In the present embodiment, since the velocity value included in the note information instructing the start of pronunciation is used as the prosody control information, the user can select the synthesized singing voice according to the key pressing speed when designating the pitch of the singing voice. This makes it possible to control the ease of listening of consonants in, and to synthesize singing voices rich in expressiveness in real time. It should be noted that as the adjustment mode of the attack, which of the adjustment of the depth, the adjustment of the continuation length, and the occurrence of the omission may be determined in advance or may be selected by the user. Similarly, with regard to undershoot and overshoot adjustment, it may be determined in advance whether the adjustment of the depth and height, the adjustment of the continuation length, or the occurrence of omission is adopted, or the user can select it. good. Further, it is possible to adjust either one of the overshoot and the undershoot at the same time, instead of adjusting both of them.

（Ｃ：変形）
以上本発明の各実施形態について説明したが、これら実施形態に以下の変形を加えても勿論良い。
（１）上記第１実施形態では音韻列情報の示す音韻列の一部の音韻を変化させることで子音の聴き取り易さを調整する態様について説明し、上記第２実施形態では理想的な韻律変化からのズレを発生させることで子音の聴き取り易さを調整する態様について説明した。しかし、複数種類の合成情報として、音韻列情報、韻律情報、音韻制御情報および韻律制御情報を歌唱合成装置へ入力し、音韻列の一部の音韻を音韻制御情報に応じて変化させる処理と、理想的な韻律変化からのズレを韻律制御情報に応じて発生させる処理とを併用して子音の聴き取り易さを調整するようにしても勿論良い。両者を併用することで、何れか一方のみを行う場合に比較して、より多様な歌唱音声を合成することが可能になると期待されるからである。なお、音韻制御情報と韻律制御情報は互いに別個の情報であっても良いし、１つの情報に音韻制御情報と韻律制御情報の役割を兼ねさせても良い。例えば、後者の態様の一例としては、発音開始を指示する音符情報（ノートオンイベント）に含まれているベロシティに音韻制御情報の役割と韻律制御情報の役割の両方を担わせる態様が考えられる。 (C: deformation)
Although each embodiment of the present invention has been described above, it goes without saying that the following modifications may be added to these embodiments.
(1) In the first embodiment, an aspect of adjusting the ease of listening of consonants by changing a part of the phoneme of the phoneme sequence indicated by the phoneme sequence information will be described. In the second embodiment, an ideal prosody is described. The aspect which adjusts the ease of listening of a consonant by generating the shift | offset | difference from a change was demonstrated. However, as a plurality of types of synthesis information, phonological sequence information, prosodic information, phonological control information and prosodic control information are input to the singing synthesizer, and a part of the phonological sequence is changed according to the phonological control information. Of course, it is possible to adjust the ease of listening of the consonant by using together with the process of generating the deviation from the ideal prosodic change according to the prosodic control information. This is because, by using both in combination, it is expected that more various singing voices can be synthesized as compared with the case where only one of them is performed. Note that the phoneme control information and the prosody control information may be different from each other, or one information may serve as the role of the phoneme control information and the prosody control information. For example, as an example of the latter mode, a mode in which both the role of the phoneme control information and the role of the prosody control information can be assigned to the velocity included in the note information (note-on event) instructing the start of pronunciation.

（２）上記第１実施形態では、発音開始を指示する音符情報に含まれているベロシティに音韻制御情報の役割を担わせ、上記第２実施形態では同ベロシティに韻律制御情報の役割を担わせた。しかし、当該ベロシティと音符情報の示す音高の組み合わせに音韻制御情報または韻律制御情報（或いは両者）の役割を担わせても勿論良い。また、歌唱合成用キーボードに音韻制御情報または韻律制御情報をユーザに入力させるための専用の操作子を設け、当該操作子に対する操作によって、発音開始を指示する音符情報とは別個に音韻制御情報または韻律制御情報を入力させるようにしても良い。 (2) In the first embodiment, the velocity included in the note information instructing the start of sounding plays a role of phonological control information, and in the second embodiment, the velocity plays a role of prosodic control information. It was. However, as a matter of course, the combination of the velocity and the pitch indicated by the note information may play the role of the phoneme control information or the prosody control information (or both). In addition, a dedicated operation element for allowing the user to input phonological control information or prosodic control information is provided on the singing voice synthesis keyboard, and the phonological control information or the phonological control information or Prosodic control information may be input.

（３）上記各実施形態では、歌唱合成に用いる複数種類の合成情報（音韻列情報、韻律情報（上記各実施形態では、音符の音高、発音開始タイミングおよび発音終了タイミングを示す音符情報）、および音韻制御情報（或いは韻律制御情報））を歌唱合成装置１に取得させる取得手段の役割を果たす操作部１２０と、合成歌唱音声を出力するための音声出力部１４０が歌唱合成装置１に内蔵されていた。しかし、操作部１２０および音声出力部１４０の何れか一方或いは両方を歌唱合成装置１の外部機器Ｉ／Ｆ部１５０に接続する態様であっても良い。操作部１２０を外部機器Ｉ／Ｆ部１５０を介して歌唱合成装置１に接続する態様では、外部機器Ｉ／Ｆ部１５０が上記取得手段の役割を果たす。 (3) In each of the embodiments described above, a plurality of types of synthesis information (phoneme sequence information, prosody information (note information indicating the pitch of a note, the start timing of pronunciation and the end timing of pronunciation) used in song synthesis, The singing voice synthesizing apparatus 1 includes an operation unit 120 serving as an acquisition unit that causes the singing voice synthesizing apparatus 1 to acquire phonological control information (or prosodic control information) and a voice output unit 140 for outputting synthesized singing voice. It was. However, an aspect in which one or both of the operation unit 120 and the audio output unit 140 are connected to the external device I / F unit 150 of the singing voice synthesizing apparatus 1 may be employed. In an aspect in which the operation unit 120 is connected to the singing voice synthesizing apparatus 1 via the external device I / F unit 150, the external device I / F unit 150 serves as the acquisition unit.

操作部１２０および音声出力部１４０の両者を外部機器Ｉ／Ｆ部１５０に接続する態様の一例としては、外部機器Ｉ／Ｆ部１５０としてイーサネット（登録商標）インタフェースを用い、この外部機器Ｉ／Ｆ部１５０にＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を接続するとともに、この電気通信回線に操作部１２０および音声出力部１４０を接続する態様が挙げられる。このような態様によれば、所謂クラウドコンピューティング形式の歌唱合成サービスを提供することが可能になる。具体的には、操作部１２０に設けられた各種操作子の操作により入力された複数種類の合成情報を電気通信回線を介して歌唱合成装置に与え、歌唱合成装置には、電気通信回線を介して取得した複数種類の合成情報に基づいて歌唱合成処理を実行させる。このようにして、歌唱合成装置により合成された合成歌唱音声の音声データは電気通信回線を介して音声出力部１４０に与えられ、当該音声データに応じた音が音声出力部１４０から出力される。 As an example of a mode in which both the operation unit 120 and the audio output unit 140 are connected to the external device I / F unit 150, an Ethernet (registered trademark) interface is used as the external device I / F unit 150. A mode in which a telecommunication line such as a LAN (Local Area Network) or the Internet is connected to the unit 150, and the operation unit 120 and the audio output unit 140 are connected to the telecommunication line. According to such an aspect, it is possible to provide a so-called cloud computing type song synthesis service. Specifically, a plurality of types of synthesis information input by operation of various operators provided in the operation unit 120 is given to the singing synthesizer via the telecommunication line, and the singing synthesizer is provided via the telecommunication line. The singing synthesis process is executed based on the plurality of types of synthesis information acquired in the above. In this way, the voice data of the synthesized singing voice synthesized by the singing voice synthesizer is given to the voice output unit 140 via the telecommunication line, and the sound corresponding to the voice data is output from the voice output unit 140.

（４）上記各実施形態では、複数種類の合成情報を歌唱合成装置１に入力するための操作部１２０として歌唱合成用キーボードを用いた。しかし、テンキーやカーソルキー、アルファベットの各文字に対応したキーなどを配列した一般的なキーボードと、所謂ＭＩＤＩキーボードの組み合わせを操作部１２０として用いても良い。一般的なキーボードとＭＩＤＩキーボードの組み合わせを操作部１２０として用いる場合には、ＭＩＤＩキーボードに音符情報入力部の役割を担わせ、一般的なキーボードに音韻情報入力部の役割を担わせれば良い。また、マウスなどのポインティングデバイスとＧＵＩとの組み合わせにより音符情報入力部或いは音韻情報入力部を実現しても良い。ポインティングデバイスとＧＵＩの組み合わせにより音符情報入力部を実現する場合には、音韻情報入力部の役割を担う一般的なキーボードと当該音符情報入力部との組み合わせにより操作部１２０を構成すれば良い。また、ポインティングデバイスとＧＵＩの組み合わせにより音韻情報入力部を実現する場合には、音符情報入力部の役割を担うＭＩＤＩキーボードと当該音韻情報入力部との組み合わせにより操作部１２０を構成すれば良い。 (4) In each said embodiment, the keyboard for song synthesis | combination was used as the operation part 120 for inputting multiple types of synthetic | combination information into the song synthesizing | combining apparatus 1. FIG. However, a combination of a general keyboard on which numeric keys, cursor keys, keys corresponding to letters of the alphabet, and the like are arranged, and a so-called MIDI keyboard may be used as the operation unit 120. When a combination of a general keyboard and a MIDI keyboard is used as the operation unit 120, the MIDI keyboard may serve as the note information input unit, and the general keyboard may serve as the phoneme information input unit. Further, a note information input unit or a phoneme information input unit may be realized by a combination of a pointing device such as a mouse and a GUI. When the note information input unit is realized by a combination of a pointing device and a GUI, the operation unit 120 may be configured by a combination of a general keyboard serving as a phoneme information input unit and the note information input unit. When the phoneme information input unit is realized by a combination of a pointing device and a GUI, the operation unit 120 may be configured by a combination of a MIDI keyboard serving as a note information input unit and the phoneme information input unit.

（５）上記各実施形態では、歌唱合成処理と明瞭度調整処理とを制御部１１０に実行させる歌唱合成プログラム１６２ｂが歌唱合成装置１の不揮発性記憶部１６２に予め格納されていた。しかし、この歌唱合成プログラム１６２ｂをＣＤ−ＲＯＭ（Compact Disk- Read Only Memory）などのコンピュータ読み取り可能な記録媒体に書き込んで配布しても良く、また、インターネットなどの電気通信回線経由のダウンロードにより配布しても良い。このようにして配布されるプログラムをパーソナルコンピュータなどの一般的なコンピュータに実行させることによって、そのコンピュータを上記実施形態の歌唱合成装置１として機能させることが可能になるからである。また、リアルタイム方式の歌唱合成処理を一部に含むゲームのゲームプログラムに本発明を適用しても勿論良い。具体的には、当該ゲームプログラムに含まれている歌唱合成プログラムを歌唱合成プログラム１６２ｂに差し替えても良い。このような態様によれば、ゲームの進行にしたがって合成される歌唱音声の表現力を向上させることが可能になる。 (5) In each of the above embodiments, the song synthesis program 162b that causes the control unit 110 to perform the song synthesis process and the clarity adjustment process is stored in the nonvolatile storage unit 162 of the song synthesis apparatus 1 in advance. However, this singing synthesis program 162b may be distributed by being written on a computer-readable recording medium such as a CD-ROM (Compact Disk-Read Only Memory), or by downloading via a telecommunication line such as the Internet. May be. This is because by causing a general computer such as a personal computer to execute the distributed program in this way, the computer can function as the singing voice synthesizing apparatus 1 of the above embodiment. Of course, the present invention may be applied to a game program of a game partially including a real-time singing synthesis process. Specifically, the song synthesis program included in the game program may be replaced with a song synthesis program 162b. According to such an aspect, it becomes possible to improve the expressive power of the singing voice synthesized as the game progresses.

（６）上記各実施形態では、リアルタイム方式の歌唱合成装置への本発明の適用例を説明した。しかし、本発明の適用対象はリアルタイム方式の歌唱合成装置に限定されるものではない。例えば、音声ガイダンスにおける案内音声をリアルタイム方式で合成する音声合成装置、或いは小説や詩などの文芸作品の朗読音声をリアルタイム方式で合成する音声合成装置に本発明を適用しても良い。また、本発明の適用対象は歌唱合成機能や音声合成機能を有する玩具（歌唱合成装置や音声合成装置を内蔵した玩具）であっても良い。 (6) In the above embodiments, application examples of the present invention to a real-time singing voice synthesizing apparatus have been described. However, the application target of the present invention is not limited to the real-time singing synthesizer. For example, the present invention may be applied to a voice synthesizer that synthesizes guidance voices in voice guidance in real time, or a voice synthesizer that synthesizes reading sounds of literary works such as novels and poems in real time. The application target of the present invention may be a toy having a singing voice synthesis function or a voice synthesis function (a toy incorporating a singing voice synthesis device or a voice synthesis device).

（７）上記第１実施形態では、歌唱音声の合成に用いる複数種類の合成情報として、音韻列情報、韻律情報および音韻制御情報を用い、第２実施形態では、歌唱音声の合成に用いる複数種類の合成情報として、音韻列情報、韻律情報、および韻律制御情報を用いた。しかし、歌唱音声の合成に用いる複数種類の合成情報は、音韻列情報、韻律情報および音韻制御情報（或いは韻律制御情報）のみに限定される訳ではない。例えば、歌唱合成用ライブラリ１６２ａに複数種類の声質の素片データが声質毎に分類されて格納されている場合には、音韻列情報、韻律情報および音韻制御情報（或いは韻律制御情報）の他に、声質を指定する声質指定情報を合成情報に含め、当該声質指定情報により指定された声質の素片データを用いて歌唱音声の合成を行うようにしても良い。 (7) In the first embodiment, phoneme sequence information, prosodic information, and phoneme control information are used as a plurality of types of synthesis information used for synthesis of singing voice, and in the second embodiment, a plurality of types used for synthesis of singing voice Phonological sequence information, prosodic information, and prosodic control information were used as composite information. However, the plurality of types of synthesis information used for synthesizing the singing voice is not limited to only phoneme string information, prosody information, and phoneme control information (or prosody control information). For example, in the case where a plurality of types of voice quality segment data are classified and stored for each voice quality in the song synthesis library 162a, in addition to the phoneme string information, prosody information and phoneme control information (or prosody control information) Voice quality designation information for designating voice quality may be included in the synthesis information, and the singing voice may be synthesized using the segment data of the voice quality designated by the voice quality designation information.

また、上記各実施形態では、音韻列情報、韻律情報、および音韻制御情報（或いは韻律制御情報）を操作部１２０の操作子に対する操作により歌唱合成装置１に入力したが、これらのうちの少なくとも１つを操作部１２０の操作子に対する操作により入力し、他のものは予め歌唱合成装置１に記憶させておいても良い。具体的には、歌唱音声の合成対象の歌唱曲全体の歌詞の音韻列を示す音韻列情報を予め不揮発性記憶部１６２に記憶させておき、韻律情報の役割を果たす音符情報と音韻制御情報を操作部１２０の操作子に対する操作によって音符毎に入力させるようにしても良い。この場合、操作部１２０と、不揮発性記憶部１６２に記憶されている音韻列情報を読み出す手段（例えば、制御部１１０）が、複数種類の合成情報を取得する取得手段の役割を果たす。また、韻律情報と音韻制御情報（音韻制御情報に換えて韻律制御情報、または音韻制御情報と韻律制御情報の両者）を歌唱合成装置１に予め記憶させておき、音韻列情報のみを操作部１２０の操作子に対する操作により歌唱合成装置１に入力させても良い。 In each of the above embodiments, phonological sequence information, prosodic information, and phonological control information (or prosodic control information) are input to the singing voice synthesizing apparatus 1 by operating the operator of the operation unit 120, but at least one of these is input. One may be input by operating the operator of the operation unit 120, and the other may be stored in the singing voice synthesizing apparatus 1 in advance. Specifically, phonological sequence information indicating the phonological sequence of the lyrics of the entire song to be synthesized is stored in advance in the nonvolatile storage unit 162, and note information and phonological control information serving as prosodic information are stored. It may be made to input for every note by operation with respect to a manipulator of operation part 120. In this case, the operation unit 120 and a unit (for example, the control unit 110) that reads out phoneme string information stored in the nonvolatile storage unit 162 serve as an acquisition unit that acquires a plurality of types of synthesis information. In addition, prosody information and phonological control information (prosodic control information instead of phonological control information, or both phonological control information and prosodic control information) are stored in advance in the singing synthesizer 1, and only the phonological string information is stored in the operation unit 120. You may make it input into the song synthesizing | combining apparatus 1 by operation with respect to this operator.

（８）上記各実施形態では、子音の聴き取り易さを低下させる調整を行う場合について説明したが、ベロシティの大きさ等に応じて子音の聴き取り易さを向上させる調整を行うようにしても良い。例えば、図２（ｃ）に示す理想的なピッチカーブとはアタックの深さ等が異なるピッチカーブを表すピッチカーブデータが不揮発性記憶部１６２に記憶されている場合には、理想的なピッチカーブに近づくにようにアタックの深さ等を韻律制御情報にしたがって調整する処理を当該ピッチカーブデータに施した後に音声合成を行うようにすれば良い。 (8) In each of the above-described embodiments, the adjustment for reducing the ease of listening of consonants has been described. However, the adjustment for improving the ease of listening of consonants is performed according to the magnitude of velocity or the like. Also good. For example, when pitch curve data representing a pitch curve having an attack depth or the like different from the ideal pitch curve shown in FIG. 2C is stored in the nonvolatile storage unit 162, the ideal pitch curve The speech synthesis may be performed after the pitch curve data is subjected to the process of adjusting the attack depth or the like according to the prosodic control information so as to approach the pitch.

１…歌唱合成装置、１１０…制御部、１２０…操作部、１３０…表示部、１４０…音声出力部、１４２…Ｄ／Ａ変換器、１４４…増幅器、１４６…スピーカ、１５０…外部機器Ｉ／Ｆ、１６０…記憶部、１６２…不揮発性記憶部、１６２ａ…歌唱合成ライブラリ、１６２ｂ…歌唱合成プログラム、１６４…揮発性記憶部、１７０…バス。 DESCRIPTION OF SYMBOLS 1 ... Singing synthesis apparatus, 110 ... Control part, 120 ... Operation part, 130 ... Display part, 140 ... Audio | voice output part, 142 ... D / A converter, 144 ... Amplifier, 146 ... Speaker, 150 ... External apparatus I / F , 160 ... storage unit, 162 ... nonvolatile storage unit, 162a ... singing synthesis library, 162b ... singing synthesis program, 164 ... volatile storage unit, 170 ... bus.

Claims

Means for acquiring a plurality of types of synthesis information including phonological sequence information indicating a phonological sequence of a speech to be synthesized and prosodic information indicating a prosodic change of the speech , and further including a velocity , comprising: Obtaining means for obtaining at least one of the prosody information and the velocity by operation of an operator;
A means for synthesizing speech using a plurality of types of synthesis information acquired by the acquisition means , wherein the consonant included in the phoneme sequence indicated by the phoneme sequence information included in the plurality of types of synthesis information is easily listened to. , Speech synthesis means for performing speech synthesis by lowering according to the velocity value included in the plurality of types of synthesis information;
A speech synthesizer characterized by comprising:

The speech synthesizer according to claim 1, wherein the speech synthesizer lowers the ease of listening of the consonant by changing the synthesized content of the consonant.

The process of changing the consonant synthesis content is predetermined among the process of changing the duration of the consonant, the process of deleting the consonant, the process of replacing the consonant, and the process of inserting the consonant into the phoneme string indicated by the phoneme string information One or more of
The speech synthesis means controls the duration according to the velocity value in the process of changing the duration of consonants, and the number of consonants to be inserted in the process of inserting consonants into the phoneme sequence indicated by the phoneme sequence information Is controlled according to the velocity value, and in the process of deleting consonants, the number of consonants to be deleted is controlled according to the velocity value, and in the process of replacing consonants, the number of consonants to be replaced is set to the velocity value. The speech synthesizer according to claim 2, wherein the speech synthesizer is controlled accordingly.

The speech synthesizer according to claim 2, wherein the speech synthesizer generates a change in consonant synthesis content at a frequency corresponding to the velocity value.

The speech synthesizer according to claim 1, wherein the speech synthesizer reduces the ease of listening of consonants by generating a deviation from a prosodic change indicated by the prosodic information.

The speech synthesizer adjusts the depth of attack or the duration of the attack in the pitch curve representing the prosodic change according to the velocity value, or the height or duration of the overshoot in the pitch curve. Predetermined one or more of processing for adjusting according to the value of velocity, or processing for adjusting the depth or duration of undershoot in the pitch curve according to the value of velocity, from the prosody change 6. The speech synthesizer according to claim 5, wherein the speech synthesizer is executed as a process for generating a deviation.