JP4353174B2

JP4353174B2 - Speech synthesizer

Info

Publication number: JP4353174B2
Application number: JP2005336272A
Authority: JP
Inventors: 裕司久湊; ボナダジョルディ
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-11-21
Filing date: 2005-11-21
Publication date: 2009-10-28
Anticipated expiration: 2021-03-09
Also published as: JP2006119655A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizer in which database size is reduced while deterioration in tone quality is suppressed to a minimum. <P>SOLUTION: The voice synthesizer has a storing means which stores the featured values of voice at a specific time while making phoneme and pitch as indexes, a phoneme template storing means which stores a plurality of templates, which are templates representing time variation of the featured values of the pitch and voice and are obtained by analyzing the voice of the portion where the featured values are normal, and a plurality of templates, which are obtained by analyzing the voice of the connecting portion of phoneme, while making respective phoneme and pitch as indexes, an inputting means which receives voice information that is used for voice synthesis and includes a phoneme flag indicating whether the pitch and the phoneme are in the normal portion or in the transition portion of the phoneme and a voice synthesizing means which reads the featured values of the voice from the storing means while making the pitch and the phoneme included in the inputted voice information as indexes and synthesizes voice based on the featured values and the pitch of the voice. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声合成装置に関し、より詳しくは、人間の歌唱音声を合成する音声合成装置に関する。 The present invention relates to a speech synthesizer, and more particularly to a speech synthesizer that synthesizes a human singing voice.

人間の音声は、音韻（音素）により構成され、各音韻は複数個のフォルマントにより構成されている。よって、人間の歌唱音声の合成は、まず、人間が発生することのできる全ての音韻に対して、その各音韻を構成する全てのフォルマントを発生して合成できるように準備し、必要な音韻を生成する。次に、生成された複数の音韻を順次つなぎ合わせ、メロディに合わせて音高を制御する。この手法は、人間の音声に限らず、フォルマントを有する楽音、例えば、管楽器から発生される楽音の合成にも適用できる。 Human speech is composed of phonemes (phonemes), and each phoneme is composed of a plurality of formants. Therefore, for the synthesis of human singing voice, first, prepare all the phonemes that can be generated by humans to generate and synthesize all the formants that constitute each phoneme, and then add the necessary phonemes. Generate. Next, the plurality of generated phonemes are sequentially connected, and the pitch is controlled according to the melody. This technique is applicable not only to human speech but also to synthesis of musical sounds having formants, for example, musical sounds generated from wind instruments.

この手法を用いた音声合成装置は従来から知られており、例えば、特許公報第２５０４１７２号には、高い音高のフォルマント音を発生するときでも、不要なスペクトルを発生しないように構成したフォルマント音発生装置が開示されている。 A speech synthesizer using this method has been conventionally known. For example, in Japanese Patent Publication No. 2504172, a formant sound configured so as not to generate an unnecessary spectrum even when a formant sound having a high pitch is generated. A generator is disclosed.

また、フォルマント周波数は、ピッチに依存することが知られており、特開平６−３０８９９７号公報の実施例に記載されているように、ピッチ周波数ごとにいくつかの音素片をデータベースに持っておき、音声のピッチに従って、適切な音素片を選択する技術が知られている。 Further, it is known that the formant frequency depends on the pitch. As described in the example of Japanese Patent Laid-Open No. 6-308997, several pieces of phoneme are stored in the database for each pitch frequency. A technique for selecting an appropriate phoneme piece according to the pitch of speech is known.

しかし、上記のような従来のデータベースでは、１つの音素片について、一定以上数のピッチ周波数の音素片を持つ必要があり、データベースのサイズが、比較的大きくなってしまう。 However, in the conventional database as described above, it is necessary to have phonemes having a certain number of pitch frequencies or more for each phoneme, and the size of the database becomes relatively large.

また、多くの異なるピッチで発生された音声から音素片を抽出する必要があるために、データベースの構築に時間を要する。 Further, since it is necessary to extract phonemes from speech generated at many different pitches, it takes time to construct a database.

さらには、フォルマント周波数は、ピッチのみに依存するのではなく、他の要素、例えば、ダイナミクス等が加わることにより、二乗、三乗とデータ量が増えてしまう。 Furthermore, the formant frequency does not depend only on the pitch, but the amount of data increases to the square, the cube, etc. due to the addition of other elements such as dynamics.

本発明の目的は、音質の劣化を最小限に抑えつつ、データベースのサイズを縮小した音声合成装置を提供することである。 An object of the present invention is to provide a speech synthesizer that reduces the size of a database while minimizing degradation of sound quality.

また、本発明の他の目的は、上記データベースを用いた音声合成装置を提供することである。 Another object of the present invention is to provide a speech synthesizer using the database.

本発明の一観点によれば、少なくとも励起レゾナンス又はフォルマントの一方を含む特定時刻の音声の特徴量を、音韻とピッチをインデックスとして記憶する記憶手段と、ピッチと音声の特徴量の時間変化を表すテンプレートを音韻とピッチをインデックスとして記憶するテンプレート記憶手段と、少なくともピッチ及び音韻を含む音声合成のための音声情報を入力する入力手段と、前記音声の特徴量とテンプレートを前記入力された音声情報により前記記憶手段及び前記テンプレート記憶手段からそれぞれ読み出す読み出し手段と、前記読み出された音声の特徴量および前記入力された音声情報に含まれるピッチに前記読み出されたテンプレートを適用し、該適用後の音声の特徴量及びピッチに基づき音声を合成する音声合成手段とを有する音声合成装置は、前記読み出し手段は、前記入力された音声情報に含まれるピッチが前記記憶手段における最も高いインデックスの値を超える場合に、前記入力された音声情報に含まれるピッチから前記記憶手段に記憶される最も高いインデックスのピッチを引いたピッチ差を求め、前記最も高いピッチインデックスにより前記記憶手段から読み出した特徴量に含まれる前記励起レゾナンスの周波数又はフォルマントの周波数に該ピッチ差を加算した特徴量を前記音声合成手段に出力することを特徴とする。
また、本発明の他の観点によれば、少なくとも励起レゾナンス又はフォルマントの一方を含む特定時刻の音声の特徴量を、音韻とピッチをインデックスとして記憶する記憶手段と、ピッチと音声の特徴量の時間変化を表すテンプレートを音韻とピッチをインデックスとして記憶するテンプレート記憶手段と、少なくともピッチ及び音韻を含む音声合成のための音声情報を入力する入力手段と、前記音声の特徴量とテンプレートを前記入力された音声情報により前記記憶手段及び前記テンプレート記憶手段からそれぞれ読み出す読み出し手段と、前記読み出された音声の特徴量および前記入力された音声情報に含まれるピッチに前記読み出されたテンプレートを適用し、該適用後の音声の特徴量及びピッチに基づき音声を合成する音声合成手段とを有する音声合成装置は、前記読み出し手段は、前記入力された音声情報に含まれるピッチが前記記憶手段における最も低いインデックスの値を下回る場合に、該入力された音声情報に含まれるピッチから前記記憶手段に記憶される最も低いインデックスのピッチを引いたピッチ差を求め、前記最も低いピッチインデックスにより前記記憶手段から読み出した特徴量に含まれる前記励起レゾナンスの周波数又はフォルマントの周波数に該ピッチ差の指定割合を加算した特徴量を前記音声合成手段に出力することを特徴とする。 According to one aspect of the present invention, a storage unit that stores a feature amount of speech at a specific time including at least one of excitation resonance or formant as a phoneme and pitch as an index, and represents a temporal change in the feature amount of the pitch and speech. Template storage means for storing a template as a phoneme and a pitch as an index, input means for inputting speech information for speech synthesis including at least the pitch and phoneme, and the feature amount and template of the speech according to the input speech information Read means for reading from the storage means and the template storage means respectively, and applying the read template to the pitch included in the feature value of the read sound and the input sound information. Speech synthesis means for synthesizing speech based on the feature and pitch of speech In the speech synthesizer, when the pitch included in the input speech information exceeds the value of the highest index in the storage unit, the reading unit is configured to transfer the pitch from the pitch included in the input speech information to the storage unit. A feature in which a pitch difference obtained by subtracting a pitch of the highest index to be stored is obtained, and the pitch difference is added to the frequency of the excitation resonance or the formant included in the feature amount read from the storage unit by the highest pitch index. The quantity is output to the speech synthesis means.
In addition, according to another aspect of the present invention, storage means for storing a feature amount of speech at a specific time including at least one of excitation resonance or formant as an index of phoneme and pitch, and time of feature amount of pitch and speech Template storage means for storing a template representing a change using phonemes and pitches as indexes, input means for inputting speech information for speech synthesis including at least the pitch and phonemes, and the feature values and templates of the speech are input Read means for reading from the storage means and the template storage means respectively by voice information, applying the read template to the pitch feature included in the read voice feature and the input voice information, Speech synthesis means for synthesizing speech based on feature and pitch of speech after application When the pitch included in the input speech information is less than the lowest index value in the storage unit, the read-out device has the storage from the pitch included in the input speech information. A pitch difference obtained by subtracting the pitch of the lowest index stored in the means is obtained, and the pitch difference is designated as the frequency of the excitation resonance or formant included in the feature amount read from the storage means by the lowest pitch index. A feature amount obtained by adding the ratio is output to the speech synthesizer.

本発明によれば、音質の劣化を最小限に抑えつつ、サイズを縮小した音声合成用データベースを提供することができる。 According to the present invention, it is possible to provide a speech synthesis database with a reduced size while minimizing deterioration in sound quality.

また、本発明によれば、よりリアルな人間の歌唱音声を合成して、違和感のない自然な状態で歌を歌わせることが可能な音声合成装置を提供することができる。 Further, according to the present invention, it is possible to provide a speech synthesizer capable of synthesizing a more realistic human singing voice and singing a song in a natural state with no sense of incongruity.

図１は、音声合成装置１の構成を表すブロック図である。 FIG. 1 is a block diagram showing the configuration of the speech synthesizer 1.

音声合成装置１は、データ入力部２、特徴パラメータ発生部３、データベース４、ＥｐＲ音声合成エンジン５を有する。 The speech synthesizer 1 includes a data input unit 2, a feature parameter generation unit 3, a database 4, and an EpR speech synthesis engine 5.

データ入力部２に入力される入力データＳｃｏｒｅは、特徴パラメータ発生部３及びＥｐＲ音声合成エンジン５に送られる。特徴パラメータ発生部３は、入力データＳｃｏｒｅに基づきデータベース４から後述する特徴パラメータ、各種テンプレートを読み込む。特徴パラメータ発生部３は、さらに、読み込んだ特徴パラメータに各種テンプレートを適用して、最終的な特徴パラメータを生成してＥｐＲ音声合成エンジン５に送る。 Input data Score input to the data input unit 2 is sent to the feature parameter generation unit 3 and the EpR speech synthesis engine 5. The feature parameter generation unit 3 reads feature parameters and various templates, which will be described later, from the database 4 based on the input data Score. The feature parameter generation unit 3 further applies various templates to the read feature parameters, generates final feature parameters, and sends them to the EpR speech synthesis engine 5.

ＥｐＲ音声合成エンジン５では、入力データＳｃｏｒｅのピッチ、ダイナミクス等に基づきパルスを発生させ、該発生させたパルスに特徴パラメータを適用することにより、音声を合成して出力する。 The EpR speech synthesis engine 5 generates a pulse based on the pitch, dynamics, and the like of the input data Score, and synthesizes and outputs speech by applying a feature parameter to the generated pulse.

図２は、入力データＳｃｏｒｅの一例を示す概念図である。音韻トラックＰＨＴ、ノートトラックＮＴ、ピッチトラックＰＩＴ、ダイナミクストラックＤＹＴ、オープニングトラックＯＴによって構成されており、楽曲のフレーズ若しくは曲全体の、時間とともに変化するデータが保存されている楽曲データである。 FIG. 2 is a conceptual diagram illustrating an example of the input data Score. It is composed of a phonological track PHT, a note track NT, a pitch track PIT, a dynamics track DYT, and an opening track OT, and is music data in which data that changes with time of a phrase of the music or the entire music is stored.

音韻トラックＰＨＴには、音韻名と、その発音継続時間が含まれる。さらに、各音韻は、音素と音素の遷移部分であることを示すアーティキュレーション（Ａｒｔｉｃｕｌａｔｉｏｎ）とその他の定常部分であることを示すステーショナリー（Ｓｔａｔｉｏｎａｒｙ）との２つに分類される。各音韻は、これらのうちどちらに分類されるかに付いてのフラグも含むものとする。なお、アーティキュレーションは、遷移部分であるので、先頭音韻名と後続音韻名の複数の音韻名を有している。一方、ステーショナリーは定常部分であるので１つの音韻名だけからなる。 The phoneme track PHT includes a phoneme name and its duration of pronunciation. Further, each phoneme is classified into two types: articulation (Articulation) indicating a transition part between phonemes and phoneme and stationery (Stationary) indicating other stationary parts. Each phoneme includes a flag indicating which of these phonemes is classified. Since articulation is a transition part, it has a plurality of phoneme names including a head phoneme name and a subsequent phoneme name. On the other hand, since stationery is a stationary part, it consists of only one phoneme name.

ノートトラックＮＴには、ノートアタック（ＮｏｔｅＡｔｔａｃｋ）、ノートトゥノート（ＮｏｔｅＴｏＮｏｔｅ）、ノートリリース（ＮｏｔｅＲｅｌｅａｓｅ）のいずれかを示すフラグが記録されている。ノートアタックは発音の立ち上がり時、ノートトゥノートは音程の変化時、ノートリリースは発音の立下り時の音楽表現を指示するコマンドである。 In the note track NT, a flag indicating any of a note attack, a note-to-note, and a note release is recorded. A note attack is a command for instructing music expression at the time of pronunciation, note-to-note at a change in pitch, and note release at the time of sound fall.

ピッチトラックＰＩＴには、各時刻において発音すべき音声の基本周波数が記録されている。なお、実際に発音される音声のピッチはこのピッチトラックＰＩＴに記録されているピッチ情報に基づき他の情報を用いて算出されるので、実際に発音されているピッチと、ここに記録されているピッチは異なる場合がある。 The pitch track PIT records the fundamental frequency of the sound to be generated at each time. Note that the pitch of the sound that is actually sounded is calculated using other information based on the pitch information recorded in the pitch track PIT, so that the pitch of the sound that is actually sounded is recorded here. The pitch may be different.

ダイナミクストラックＤＹＴには、音声の強さを示すパラメータである各時刻におけるダイナミクス値が記録されている。ダイナミクス値は、０から１までの値をとる。 In the dynamics track DYT, a dynamics value at each time, which is a parameter indicating the strength of sound, is recorded. The dynamics value takes a value from 0 to 1.

オープニングトラックＯＴには、唇の開き具合（唇開度）を示すパラメータである各時刻のオープニング値が記録されている。オープニング値は０から１までの値をとる。 In the opening track OT, the opening value at each time, which is a parameter indicating the degree of lip opening (lip opening), is recorded. The opening value takes a value from 0 to 1.

特徴パラメータ発生部３は、データ入力部２から入力される入力データＳｃｏｒｅに基づき、データベース４からデータを読み出し、後述するように、入力データＳｃｏｒｅ及びデータベース４から読み出したデータに基づき特徴パラメータを発生して、ＥｐＲ音声合成エンジン５に出力する。 The feature parameter generation unit 3 reads data from the database 4 based on the input data Score input from the data input unit 2, and generates feature parameters based on the input data Score and the data read from the database 4 as will be described later. To the EpR speech synthesis engine 5.

この特徴パラメータ発生部３で発生する特徴パラメータは、例えば、励起波形スペクトルのエンベロープ、励起レゾナンス、フォルマント、差分スペクトルの４つに分類することが出来る。これらの４つの特徴パラメータは、実際の人間の音声等（オリジナルの音声）を分析して得られる調和成分のスペクトル・エンベロープ（オリジナルのスペクトル）を分解することにより得られるものである。 The characteristic parameters generated by the characteristic parameter generation unit 3 can be classified into, for example, an envelope of an excitation waveform spectrum, an excitation resonance, a formant, and a difference spectrum. These four characteristic parameters are obtained by decomposing the spectral envelope (original spectrum) of the harmonic component obtained by analyzing actual human speech or the like (original speech).

励起波形スペクトルのエンベロープ（ＥｘｃｉｔａｔｉｏｎＣｕｒｖｅ）は、声帯波形の大きさ（ｄＢ）を表すＥＧａｉｎ、声帯波形のスペクトルエンベロ−プの傾きを表すＥＳｌｏｐｅＤｅｐｔｈ、声帯波形のスペクトルエンベロ−プの最大値から最小値の深さ（ｄＢ）を表すＥＳｌｏｐｅの３つのパラメータによって構成されており、以下の式（Ａ）で表すことが出来る。

励起レゾナンスは、胸部による共鳴を表す。中心周波数（ＥＲＦｒｅｑ）、バンド幅（ＥＲＢＷ）、アンプリチュード（ＥＲＡｍｐ）の３つのパラメータで構成され、２次フィルター特性を有している。 The envelope of the excitation waveform spectrum (Excitation Curve) is EGain indicating the magnitude (dB) of the vocal cord waveform, ESlope Depth indicating the slope of the spectrum envelope of the vocal cord waveform, and the minimum to maximum value of the spectrum envelope of the vocal cord waveform. It is comprised by three parameters of ESlope showing the depth (dB) of this, It can represent with the following formula | equation (A).

Excited resonance represents resonance by the chest. It consists of three parameters: center frequency (ERFreq), bandwidth (ERBW), and amplitude (ERAmp), and has secondary filter characteristics.

フォルマントは、１から１２個のレゾナンスを組み合わせることにより声道による共鳴を表す。中心周波数（ＦｏｒｍａｎｔＦｒｅｑ_ｉ）、バンド幅（ＦｏｒｍａｎｔＢＷ_ｉ）、アンプリチュード（ＦｏｒｍａｎｔＡｍｐ_ｉ）の３つのパラメータで構成される。なお、「ｉ」は、１から１２までの値（１≦ｉ≦１２）である。 Formants represent resonances due to the vocal tract by combining 1 to 12 resonances. It consists of three parameters: center frequency (FormantFreq _i ), bandwidth (FormantBW _i ), and amplitude (FormantAmp _i ). “I” is a value from 1 to 12 (1 ≦ i ≦ 12).

差分スペクトルは、上記の励起波形スペクトルのエンベロープ、励起レゾナンス、フォルマントの３つで表現することの出来ないオリジナルスペクトルとの差分のスペクトルを持つ特徴パラメータである。 The difference spectrum is a characteristic parameter having a spectrum that is different from the original spectrum that cannot be expressed by the envelope, excitation resonance, and formant of the excitation waveform spectrum.

データベース４は、少なくともＴｉｍｂｒｅデータベースＴＤＢ、音韻テンプレートデータベースＰＤＢ、ノートテンプレートデータベースＮＤＢから構成されている。 The database 4 includes at least a Timbre database TDB, a phoneme template database PDB, and a note template database NDB.

一般に、ＴｉｍｂｒｅデータベースＴＤＢに保存されている特定の時刻から得られた特徴パラメータのみを用いて音声を合成した場合には非常に単調で、機械的な音声になる。また、音素が連続する場合にはその遷移部分での音声は実際には徐々に変化してゆくので、音素の定常部分のみを単純に連結した場合には、接続点では非常に不自然な音声となる。そこで音韻テンプレート、及びノートテンプレートをデータベースとして持ち、音声合成時に使用することにより、それらの欠点を低減することが可能となる。 In general, when speech is synthesized using only feature parameters obtained from a specific time stored in the Timbre database TDB, the speech becomes very monotonous and mechanical. In addition, when phonemes are continuous, the sound at the transition part actually changes gradually, so when only the steady parts of phonemes are simply connected, the speech at the connection point is very unnatural. It becomes. Therefore, the phoneme template and the note template are stored as a database and used at the time of speech synthesis, so that those drawbacks can be reduced.

Ｔｉｍｂｒｅとは音韻の音色であり、ある時刻１点における特徴パラメータ（励起スペクトル、励起レゾナンス、フォルマント、差分スペクトルのセット）で表現される。図３にＴｉｍｂｒｅデータベースＴＤＢの例を示す。このデータベースは、インデックスとして音韻名、ピッチを持つ。 Timbre is a timbre tone color and is expressed by a characteristic parameter (set of excitation spectrum, excitation resonance, formant, difference spectrum) at one point in time. FIG. 3 shows an example of the Timbre database TDB. This database has phoneme names and pitches as indexes.

なお、以下、この明細書では図３に示すＴｉｍｂｒｅデータベースＴＤＢを使うが、より細かく特徴パラメータを指定できるように、図４に示すようにインデックスとして音韻名、ピッチ、ダイナミクス、オープニングの４つを持つデータベースを用意してもよい。 In the following description, the Timbre database TDB shown in FIG. 3 is used in this specification. However, as shown in FIG. 4, the phoneme name, the pitch, the dynamics, and the opening are provided as indexes so that the feature parameters can be specified more finely. A database may be prepared.

音韻テンプレートデータベースＰＤＢはステーショナリーテンプレートデータベースとアーティキュレーションテンプレートデータベースで構成される。ここでテンプレートとは、特徴パラメータＰとピッチＰｉｔｃｈのペアが一定時間ごとに並んだシーケンス、及び、その区間の長さＴ（ｓｅｃ．）の組であり、以下の式（Ｂ）で表すことが出来る。

なお、ｔ＝０、Δｔ、２Δｔ、３Δｔ、…、Ｔであり、本実施例では、Δｔは５ｍｓとする。 The phoneme template database PDB includes a stationery template database and an articulation template database. Here, the template is a set of a sequence in which pairs of feature parameters P and pitch pitch are arranged at regular intervals and a length T (sec.) Of the section, and can be expressed by the following formula (B). I can do it.

Note that t = 0, Δt, 2Δt, 3Δt,..., T, and in this embodiment, Δt is 5 ms.

Δｔを小さくすると時間分解能がよくなるので音質は良くなるがデータベースのサイズが大きくなり、逆にΔｔを大きくすると音質が悪くなるがデータベースのサイズは小さくなる。Δｔを決定する際には音質とデータベースのサイズとの優先度を考慮して決定すればよい。 If Δt is reduced, the time resolution is improved and the sound quality is improved, but the database size is increased. Conversely, if Δt is increased, the sound quality is deteriorated but the database size is reduced. What is necessary is just to determine in consideration of the priority of sound quality and the size of a database, when determining (DELTA) t.

図５は、ステーショナリーテンプレートデータベースの一例である。ステーショナリーテンプレートデータベースは、音韻名と代表ピッチをインデックスとして、すべての有声の音韻についてのステーショナリーテンプレートを有している。ステーショナリーテンプレートは音韻、ピッチの安定した部分の音声をＥｐＲモデルを使って分析することによって得ることができる。 FIG. 5 is an example of a stationery template database. The stationery template database has stationery templates for all voiced phonemes, using phoneme names and representative pitches as indexes. A stationery template can be obtained by analyzing speech of a stable phoneme and pitch using an EpR model.

あるひとつの有声音、例えば「あ」、を長く伸ばして、ある音程、例えばＣ４、で発声した場合にはピッチやフォルマント周波数などの特徴パラメータは、ほぼ一定であり定常（ステーショナリー）であると言えるが、実際には若干の変動が生じている。この変動がなく完全に一定の場合には無機質で機械的な音声になってしまい、逆に言えば、その変動が人間らしさ、自然性を表すと言える。 When a certain voiced sound, for example “A”, is extended for a long time and uttered at a certain pitch, for example C4, the characteristic parameters such as pitch and formant frequency are almost constant and can be said to be stationary. In practice, however, there are some fluctuations. If there is no change and the sound is completely constant, the sound becomes inorganic and mechanical. Conversely, it can be said that the change represents humanity and naturalness.

有声音を合成する場合に、Ｔｉｍｂｒｅ、つまりある時刻１点の特徴パラメータのみを使うのではなく、それにステーショナリーテンプレートにある実際の人間の音声から取り出した特徴パラメータの時間変動分、ピッチ変動分を加算することによって有声音に自然性を与えることができる。 When synthesizing voiced sound, instead of using only Timbre, that is, a feature parameter at one point in time, the time variation and pitch variation of the feature parameter extracted from the actual human voice in the stationery template are added to it. By doing so, the voiced sound can be given naturalness.

歌唱音声合成の場合には音符の長さに従って発音する時間を変化させる必要があるが、十分長いテンプレートを１つだけ用意する。テンプレートよりも長い有声音を合成する場合には、テンプレートの時間軸の伸縮をすることはしないで、テンプレートの持っている時間をそのままにして有声音の先頭部分からテンプレートを適用する。 In the case of singing voice synthesis, it is necessary to change the time of sound generation according to the length of a note, but only one sufficiently long template is prepared. When synthesizing a voiced sound longer than the template, the template is applied from the beginning of the voiced sound without changing the time axis of the template.

テンプレートの終端まで達したら、その後に再び同じテンプレートを繰り返し適用する。なお、テンプレートの終端まで達したら、テンプレートの時間を逆にしたテンプレートを適用する方法も考えられる。この方法ではテンプレートの接続点での不連続がなくなる。 When the end of the template is reached, the same template is then applied again. It is also possible to apply a template in which the template time is reversed when the end of the template is reached. This method eliminates the discontinuity at the connection point of the template.

テンプレートの時間軸を伸縮することをしないのは、特徴パラメータ、ピッチの変動のスピードが大きく変わると自然性が損なわれるからである。定常部分の揺らぎは人間が意識してコントロールするものではないという考え方からも伸縮しない方が好ましい。 The reason why the time axis of the template is not expanded / contracted is that the naturalness is lost if the speed of variation of the characteristic parameter and the pitch is greatly changed. It is preferable not to expand and contract from the viewpoint that the fluctuation of the steady portion is not something that humans consciously control.

ステーショナリーテンプレートは、定常部分の特徴パラメータの時系列をそのまま持つのではなく、その音素の代表的な特徴パラメータと、その変動量を持つ構造である、定常部分の特徴パラメータの変動量は小さいことから、特徴パラメータをそのまま持つことに比べて、変動量で持つ方が情報量が少なく、データベースのサイズを小さくする効果がある。 The stationery template does not have the time series of the characteristic parameters of the stationary part as they are, but the variation of the characteristic parameters of the stationary part, which is a structure having the typical characteristic parameters of the phoneme and the fluctuation amount, is small. Compared with having the characteristic parameters as they are, the amount of variation is smaller and the amount of information is smaller, which has the effect of reducing the database size.

図６はアーティキュレーションテンプレートデータベースの一例である。アーティキュレーションテンプレートデータベースは、先頭音韻名と後続音韻名と代表ピッチとをインデックスとしている。アーティキュレーションテンプレートデータベースには、一定の言語における現実的に可能な音韻の組合せについてアーティキュレーションテンプレートが保存されている。 FIG. 6 is an example of an articulation template database. The articulation template database uses the first phoneme name, the subsequent phoneme name, and the representative pitch as indexes. The articulation template database stores articulation templates for practically possible phoneme combinations in a certain language.

アーティキュレーションテンプレートはピッチの安定した、音韻の接続部分の音声をＥｐＲモデルを使って分析することによって得ることができる。 The articulation template can be obtained by analyzing the speech of the connected part of the phoneme having a stable pitch using the EpR model.

なお、特徴パラメータＰ（ｔ）は絶対値そのままでもいいが、差分値を用いることも出来る。後述するように、合成時には、これらのテンプレートの値の絶対値がそのまま利用されるのではなく、パラメータの相対的な変化量が利用されるので、テンプレートの適用方法に従って、以下の式（Ｃ１）〜（Ｃ３）に示すようにＰ（ｔ＝Ｔ）からの差分、あるいはＰ（０）からの差分、あるいはＰ（０）とＰ（Ｔ）を直線で結んだ値との差分の形で特徴パラメータを記録する。

人間が２つの音素を連続して発音する場合には、突然変化するのではなくゆるやかに移行していくので、例えば、「あ」という母音の後に区切りを置かないで連続して「え」という母音を発音する場合には、最初に「あ」が発音され「あ」と「え」の中間に位置する発音を経て「え」に変化する。 The feature parameter P (t) may be an absolute value as it is, but a difference value may be used. As will be described later, at the time of synthesis, the absolute values of these template values are not used as they are, but the relative change amounts of the parameters are used. Therefore, according to the template application method, the following formula (C1) Characterized in the form of a difference from P (t = T), a difference from P (0), or a difference between P (0) and P (T) connected by a straight line as shown in FIG. Record the parameters.

When a person pronounces two phonemes in succession, it changes slowly, not suddenly. For example, it is called “e” continuously without a break after the vowel “a”. When a vowel is pronounced, “a” is first pronounced, and changes to “e” through a pronunciation located between “a” and “e”.

この現象は一般に調音結合と呼ばれる現象である。音素の結合部分が自然になるように音声合成を行うには、ある言語において組合せ可能な音素の組合せについて、結合部分の音声情報を何らかの形で持つことが好ましい。 This phenomenon is a phenomenon generally called articulation coupling. In order to perform speech synthesis so that the phoneme combination part becomes natural, it is preferable to have some form of speech information of the combination part for the combination of phonemes that can be combined in a certain language.

音素の結合部分をＬＰＣ係数や音声波形といった形でそのまま持つ方式はすでに存在しているが、本実施例では、特徴パラメータ、ピッチの差分情報を持ったアーティキュレーションテンプレートを使って２つの音素間の調音（Ａｒｔｉｃｕｌａｔｉｏｎ）部分を合成している。 There is already a method that has a phoneme connection part in the form of an LPC coefficient or a speech waveform, but in this embodiment, an articulation template with feature parameter and pitch difference information is used. The Articulation part of is synthesized.

例えば、２つの連続する同じ音程の４分音符で、それぞれの歌詞が「あ」、「い」という歌唱を合成する場合を考える。２つの音符の境界には「あ」から「い」への移行部分が存在する。「あ」、「い」は両方とも母音であり、有声音であるので、Ｖ（有声音）からＶ（有声音〉へのアーティキュレーションに該当し、後述するタイプ３の方法でアーティキュレーションテンプレートを適用して移行部分の特徴パラメータを求めることができる。 For example, consider a case where two consecutive quarter notes of the same pitch are used to synthesize a song with the lyrics “A” and “I”. There is a transition from “A” to “I” at the boundary between two notes. Since “a” and “i” are both vowels and voiced sounds, they correspond to articulation from V (voiced sound) to V (voiced sound), and are articulated by the type 3 method described later. The template can be applied to determine the feature parameters of the transition part.

すなわち、「あ」と「い」の特徴パラメータをＴｉｍｂｒｅデータベースＴＤＢから読み出し、それらに「あ」から「い」へのアーティキュレーションテンプレートを適用すれば、その移行部分の、自然な変化を持つ特徴パラメータが得られる。 That is, if the characteristic parameters of “A” and “I” are read from the Timbre database TDB and an articulation template from “A” to “I” is applied to them, the transition part has a natural change. A parameter is obtained.

ここで、「あ」から「い」への移行部分の時間を、その部分に適用するアーティキュレーションテンプレートの元々の時間と同じにすれば、テンプレートを作成するときに利用した音声波形と同じ変化を得る事が出来る。 Here, if the time of the transition from “A” to “I” is the same as the original time of the articulation template applied to that part, the same change as the audio waveform used when creating the template Can be obtained.

テンプレートの時間よりもゆっくりと、あるいは長く変化する音声を合成する場合には、テンプレートの長さを線形に伸長してから特徴パラメータの差分を加算すればよい。先に説明したステーショナリーと異なり、２つの音素問の変化部分のスピードは意識的にコントロールできるものであるため、線形にテンプレートを伸縮しても大きな不自然性は生じない。 When synthesizing a voice that changes more slowly or longer than the template time, the feature parameter difference may be added after linearly extending the template length. Unlike the stationery described above, the speed of the changing part of the two phoneme questions can be consciously controlled, so that even if the template is expanded or contracted linearly, no great unnaturalness will occur.

次に２つの連続する同じ音程の４分音符で、それぞれの歌詞が「あ」、「す」という歌唱を合成する場合を考える。２つの音符の境界には「あ」から「す」の子音部分への短い移行部分が存在する。これはＶ（有声音）からＵ（無声音）へのアーティキュレーションに該当するので、後述するタイプ１の方法でアーティキュレーションテンプレートを適用することで移行部分の特徴パラメータを求めることができる。 Next, consider a case where two consecutive quarter notes of the same pitch are used to synthesize a song with the lyrics “A” and “SU”. There is a short transition from the “A” to the “SU” consonant at the boundary between the two notes. Since this corresponds to articulation from V (voiced sound) to U (unvoiced sound), the feature parameter of the transition part can be obtained by applying the articulation template by the type 1 method described later.

「あ」の特徴パラメータをＴｉｍｂｒｅデータベースＴＤＢより求めて、それに「ａ」から「ｓ」へのアーティキュレーションテンプレートを適用することで、自然な変化を持つ移行部分の特徴パラメータを得る事が出来る。 By obtaining the characteristic parameter of “A” from the Timbre database TDB and applying the articulation template from “a” to “s” to it, the characteristic parameter of the transition part having a natural change can be obtained.

Ｖ（有声音）からＵ（無声音）へのアーティキュレーションで、タイプ１、つまりテンプレートの先頭部分からの差分、を使う理由は、単純に終端部分にあたるＵ（無声音〉部分にはピッチ、特徴パラメータが存在しないためである。 The reason for using Type 1, that is, the difference from the beginning of the template, in the articulation from V (voiced sound) to U (unvoiced sound) is simply the pitch and feature parameters in the U (unvoiced sound) part, which is the end part. This is because there is no.

「す」はローマ字であらわすと「ｓｕ」であり、子音部分「ｓ」と母音部分「ｕ」から構成される。この中間点にも、「ｓ」の音を残しながら「ｕ」が発音される移行部分が存在する。これはＵからＶへのアーティキュレーションに該当するので、ここでもまたタイプ１の方法でアーティキュレーションテンプレートを適用する。 “Su” is “su” in Roman letters, and is composed of a consonant part “s” and a vowel part “u”. There is also a transition portion where “u” is pronounced while leaving the sound of “s” at this intermediate point. Since this corresponds to articulation from U to V, the articulation template is again applied in the type 1 method.

「う（ｕ）」の特徴パラメータをＴｉｍｂｒｅデータベースＴＤＢから読み出し、それに「ｓ」から「ｕ」へのアーティキュレーションテンプレートを適用することで、「ｓ」から「ｕ」への変化部分の特徴パラメータを得ることができる。 The characteristic parameter of “u (u)” is read from the Timbre database TDB, and the articulation template from “s” to “u” is applied to the characteristic parameter, so that the characteristic parameter of the changing part from “s” to “u” Can be obtained.

特徴パラメータの差分情報を持ったアーティキュレーションテンプレートは、絶対値で特徴パラメータを記録したテンプレートに比べて、データサイズが少なくなるという利点を持っている。 An articulation template having feature parameter difference information has the advantage of a smaller data size than a template in which feature parameters are recorded as absolute values.

ノートテンプレートデータベースＮＤＢは、少なくとも、ノートアタックテンプレート（ＮＡテンプレート）データベースＮＡＤＢ、ノートリリーステンプレート（ＮＲテンプレート）データベースＮＲＤＢ、ノートトゥノートテンプレート（ＮＮテンプレート）データベースＮＮＤＢを含んでいる。 The note template database NDB includes at least a note attack template (NA template) database NADB, a note release template (NR template) database NRDB, and a note-to-note template (NN template) database NNDB.

図７はＮＡテンプレートデータベースＮＡＤＢの一例である。ＮＡテンプレートには音声の立ち上がり部分の特徴パラメータ及びピッチの変化情報が含まれている。 FIG. 7 shows an example of the NA template database NADB. The NA template includes feature parameters of the rising part of the voice and pitch change information.

ＮＡテンプレートデータベースＮＡＤＢには、音韻名と代表ピッチをインデックスとして、すべての有声の音韻についてのＮＡテンプレートが保存されている。ＮＡテンプレートは、実際に発音した音声の立ち上がり部分を分析することによって得られる。 The NA template database NADB stores NA templates for all voiced phonemes using the phoneme names and representative pitches as indexes. The NA template is obtained by analyzing the rising part of the sound that is actually pronounced.

ＮＲテンプレートには音声の立下り部分の特徴パラメータ及びピッチの変化情報が含まれている。ＮＲテンプレートデータベースＮＲＤＢはＮＡテンプレートデータベースＮＡＤＢと同じ構造であり、音韻名と代表ピッチをインデックスとして、すべての有声の音韻についてのＮＲテンプレートを持っている。 The NR template includes feature parameters of the falling edge of the voice and pitch change information. The NR template database NRDB has the same structure as the NA template database NADB, and has NR templates for all voiced phonemes with the phoneme name and the representative pitch as indexes.

一定のピッチである音素、例えば「あ」を発声しようとしたときの立ち上がり部分（Ａｔｔａｃｋ）を分析すると振幅が徐々に大きくなり、一定のレベルになって安定していくことがわかる。振幅値だけではなく、フォルマント周波数、フォルマントバンド幅、ピッチについても変化している。 Analyzing the rising portion (Attach) when a phoneme having a constant pitch, for example, “A”, is analyzed, it can be seen that the amplitude gradually increases and becomes stable at a constant level. Not only the amplitude value but also the formant frequency, formant bandwidth, and pitch change.

人間の実際に発声した音声、例えば「あ」、の立ち上がり部分を解析して得たＮＡテンプレートを、定常部分の特徴パラメータに適用することで、その立ち上がり部分の人の音声の持つ自然な変化を与えることができる。 Applying the NA template obtained by analyzing the rising part of human speech, for example “A”, to the characteristic parameter of the steady part, the natural change of the voice of the rising part Can be given.

すべての音素ごとにＮＡテンプレートを用意すれば、どの音素についてもアタック部分の変化を与えることが可能になる。 If an NA template is prepared for every phoneme, it becomes possible to change the attack part for any phoneme.

歌唱では、音楽的に表情をつけるために立ち上がりを速くしたり、ゆったりと歌う場合がある。ＮＡテンプレートは、あるひとつの立ち上がりの時間を持っているが、もともとＮＡテンプレートの持っている速さよりも速く、若しくは遅くすることは、テンプレートの時間軸を線形に伸縮してから適用することで可能になる。 In singing, there is a case where the rising speed is fastened or a song is sung in order to make a musical expression. The NA template has a certain rise time, but it can be made faster or slower than the NA template by applying it after linearly expanding and contracting the template time axis. become.

テンプレートを伸縮しても、数倍の範囲内ならば、アタックに不自然さは生じないことが実験によりわかっている。より広範囲のアタックの長さを指定して合成できるようにするには、数段階の長さのＮＡテンプレートを用意して、最も長さの近いテンプレートを選択して伸縮するなどの方法を使う。 Experiments show that even if the template is expanded or contracted, the attack does not cause unnaturalness within a range of several times. In order to be able to synthesize by specifying a wider range of attack lengths, a method such as preparing NA templates of several stages in length, selecting a template having the closest length, and expanding and contracting is used.

発声の終了する部分、つまり立下り（Ｒｅｌｅａｓｅ）についても、立ち上がり（Ａｔｔａｃｋ）と同様に振幅、ピッチ、フォルマントが変化する。 As for the portion where the utterance ends, that is, the fall (Release), the amplitude, pitch, and formant change in the same manner as the rise (Attack).

立下り部分に人間の音声の持つ自然な変化を与えるのは、人間が実際に発声した音声の立ち下がり部分を解析して得たＮＲテンプレートを、立下りの開始する前の音素の特徴パラメータに対して適用することで可能となる。 The natural change of the human voice is given to the falling part because the NR template obtained by analyzing the falling part of the voice actually uttered by the human is used as the characteristic parameter of the phoneme before the falling starts. It becomes possible by applying to it.

図８は、ＮＮテンプレートデータベースＮＮＤＢの一例である。ＮＮテンプレートはピッチが変化する部分の音声の特徴パラメータを持っている。ＮＮテンプレートデータベースＮＮＤＢには、音韻名、テンプレートの始点時刻のピッチ、終了時刻のピッチをインデックスとして、すべての有声の音韻についてのＮＮテンプレートが保存されている。 FIG. 8 is an example of the NN template database NNDB. The NN template has a voice feature parameter of a portion where the pitch changes. The NN template database NNDB stores NN templates for all voiced phonemes using the phoneme name, the pitch of the start time of the template, and the pitch of the end time as indexes.

ピッチの異なる２つの音符を連続して間を置かずに歌唱するときに、前の音符の音程から、後ろの音符のピッチに滑らかにピッチを変化させながら歌う歌唱方法がある。ピッチやアンプリチュードが変化するのは当然であるが、さらに、前後２つの音符の発音が同じ（例えば同じ「あ」）だとしても、フォルマント周波数などの音声の周波数特性が微妙に変化する。 There is a singing method in which, when two notes having different pitches are sung continuously without any gap, the pitch is smoothly changed from the pitch of the previous note to the pitch of the subsequent note. Naturally, the pitch and amplitude change, but even if the two preceding and following notes have the same pronunciation (for example, the same “A”), the frequency characteristics of the sound such as the formant frequency slightly change.

実際にピッチを変化させて歌った音声の変化を始点から終点まで解析して求めたＮＮテンプレートを使うことによって、そのような音程の異なる音符の境界に、自然な音楽的表情を、与えることができる。 By using the NN template that is obtained by analyzing the change of the voice sung by actually changing the pitch from the start point to the end point, it is possible to give a natural musical expression to the boundary of such notes with different pitches. it can.

実際の音楽における旋律では、２オクターブ２４音の音域としたとしても、ピッチ変化の組合せは非常に多い。しかし、実際にはピッチの絶対値が異なっていてもピッチ差が近いテンプレートで代用することができるので全ての組合せについてＮＮテンプレートを用意する必要はない。 In actual melody, there are many combinations of pitch changes even if the range of 24 octaves is 2 octaves. However, in practice, even if the absolute value of the pitch is different, a template with a close pitch difference can be substituted, so it is not necessary to prepare an NN template for all combinations.

ＮＮテンプレートの選択においては、後述するように、ピッチの絶対値が近いものよりも、ピッチの変化幅が近いテンプレートを優先的に選択する。選択されたＮＮテンプレートは、後述するタイプ３の方法で適用する。 In the selection of the NN template, as will be described later, a template having a close pitch change width is preferentially selected rather than a template having a close pitch absolute value. The selected NN template is applied by a type 3 method described later.

このとき、ピッチの変化幅が近いＮＮテンプレートを優先的に選ぶのは、ピッチの大きく変動する部分から作成したＮＮテンプレートには大きな値が入っている可能性があり、それをピッチの変化幅が少ない部分に適用した場合には元のＮＮテンプレートの持っている変化の形状を保てなくなり、変化が不自然になる可能性があるからである。 At this time, the NN template having a close pitch change width is preferentially selected because there is a possibility that a large value is included in the NN template created from a portion where the pitch varies greatly. This is because when applied to a small number of parts, the shape of the change of the original NN template cannot be maintained, and the change may become unnatural.

なお、ある特定の音素、例えば「あ」のピッチの変化している音声から求めたＮＮテンプレートを、全ての音素のピッチ変化に代用して使うことも可能であるが、データサイズが大きくても問題がない環境であれば、音素ごとに何パターンかピッチを変化させてＮＮテンプレートを用意するほうが、より単調でない豊かな合成音声が可能となる。 Note that it is possible to use a specific phoneme, for example, an NN template obtained from a voice whose pitch of “A” is changing, instead of using a pitch change of all phonemes. In an environment where there is no problem, it is possible to produce a rich synthetic speech that is less monotonous by preparing an NN template by changing the pitch of several patterns for each phoneme.

次に、データベース４に記録されているテンプレートの適用方法を説明する。テンプレートの適用とは、入力データＳｃｏｒｅ上のある区間に対して、テンプレートの時間長を伸縮して、基準点となる１つ又は複数の特徴パラメータにテンプレートの特徴パラメータの差分を加算して、Ｓｃｏｒｅのある区間と同じ時間長を持つ特徴パラメータ、ピッチの列を得ることである。具体的にはタイプ１からタイプ４までの４種類のテンプレートの適用方法がある。以下の説明ではテンプレートを｛Ｐ（ｔ），Ｐｉｔｃｈ（ｔ），Ｔ｝であらわす。 Next, a method for applying a template recorded in the database 4 will be described. The application of the template means that the time length of the template is expanded or contracted with respect to a certain section on the input data Score, the difference between the template feature parameters is added to one or more feature parameters serving as reference points, and the score is applied. To obtain a sequence of feature parameters and pitches having the same time length as a certain interval. Specifically, there are four types of template application methods from type 1 to type 4. In the following description, the template is represented by {P (t), Pitch (t), T}.

まずタイプ１によるテンプレートの適用を説明する。タイプ1は、始点指定タイプによるテンプレートの適用方法である。入力データＳｃｏｒｅの長さＴ’の区間Ｋに対するタイプ１によるテンプレートの適用は、下記式（Ｄ）に従って時刻ｔでの特徴パラメータＰ’_ｔを求めることである。なおＰ_ｔは区間Ｋの時刻ｔの特徴パラメータである。

なお、時刻ｔ＝０にテンプレート及び区間Ｋの始点があるとする。この式（Ｄ）はテンプレートの始点からの変化分を時刻ｔの特徴パラメータに加算することを意味する。 First, application of a template according to type 1 will be described. Type 1 is a template application method based on the start point designation type. The application of the template of type 1 to the section K of the length T ′ of the input data Score is to obtain the feature parameter P ′ _t at time t according to the following equation (D). Note that P _t is a feature parameter at time _t in section K.

It is assumed that there is a template and the start point of section K at time t = 0. This equation (D) means that the change from the starting point of the template is added to the feature parameter at time t.

タイプ１は、テンプレートを主にノートリリース部分の特徴パラメータに適用する場合に用いる。何故なら、ノートリリースの開始部分では、定常部分の音声が存在する為、ノートリリースの開始部分でパラメータの連続性、つまりは音声の連続性を保つ必要があり、ノートリリースの終端部は無音であるので、その必要がないからである。 Type 1 is used when the template is mainly applied to the feature parameter of the note release part. This is because, at the beginning of the note release, there is a steady part of the voice, so it is necessary to maintain the continuity of the parameters at the beginning of the note release, that is, the continuity of the voice, and the end of the note release is silent. Because there is no need.

次にタイプ２によるテンプレートの適用方法を説明する。タイプ２は、終点指定タイプによるテンプレートの適用方法である。入力データＳｃｏｒｅの長さＴ’の区間Ｋに対するタイプ２によるテンプレートの適用は、下記式（Ｅ）に従って時刻ｔでの特徴パラメータＰ’_ｔを求めることである。なおＰ_ｔは区間Ｋの時刻ｔの特徴パラメータである。

なお、時刻ｔ＝０にテンプレート及び区間Ｋの始点があるとする。この式（Ｅ）はテンプレートの終点からの変化分を時刻ｔの特徴パラメータに加算することを意味する。 Next, a template application method according to type 2 will be described. Type 2 is a template application method based on the end point designation type. The application of the template according to type 2 to the section K of the length T ′ of the input data Score is to obtain the feature parameter P ′ _t at time t according to the following equation (E). Note that P _t is a feature parameter at time _t in section K.

It is assumed that there is a template and the start point of section K at time t = 0. This equation (E) means that the change from the end point of the template is added to the feature parameter at time t.

タイプ２は、テンプレートを主にノートアタック部分の特徴パラメータに適用する場合に用いる。何故なら、ノートアタックの後方部分では、定常部分の音声が存在する為、ノートアタックの後方部分でパラメータの連続性、つまりは音声の連続性を保つ必要があり、ノートアタックの開始部分は無音であるので、その必要がないからである。 Type 2 is used when the template is mainly applied to the feature parameter of the note attack portion. This is because, in the rear part of the note attack, there is a steady part of the voice, so it is necessary to maintain the continuity of the parameters in the rear part of the note attack, that is, the continuity of the voice, and the start part of the note attack is silent. Because there is no need.

次にタイプ３によるテンプレートの適用方法を説明する。タイプ３は、両点指定タイプによるテンプレートの適用方法である。入力データＳｃｏｒｅの長さＴ’の区間Ｋに対するタイプ３によるテンプレートの適用は、下記式（Ｆ）に従って時刻ｔでの特徴パラメータＰ’_ｔを求めることである。なおＰ_ｔは区間Ｋの時刻ｔの特徴パラメータである。

なお、時刻ｔ＝０にテンプレート及び区間Ｋの始点があるとする。この式（Ｆ）はテンプレートの始点と終点を結んだ直線との差を、区間Ｋの始点と終点を結んだ直線に加算することを意味する。 Next, a template application method according to type 3 will be described. Type 3 is a template application method based on the double point designation type. The application of the template according to type 3 to the section K of the length T ′ of the input data Score is to obtain the feature parameter P ′ _t at time t according to the following equation (F). Note that P _t is a feature parameter at time _t in section K.

It is assumed that there is a template and the start point of section K at time t = 0. This formula (F) means that the difference between the straight line connecting the start point and the end point of the template is added to the straight line connecting the start point and the end point of the section K.

次にタイプ４によるテンプレートの適用方法を説明する。タイプ４は、ステーショナリータイプによるテンプレートの適用方法である。入力データＳｃｏｒｅの長さＴ’の区間Ｋに対するタイプ２によるテンプレートの適用は、下記式（Ｇ）に従って時刻ｔでの特徴パラメータＰ’_ｔを求めることである。なおＰ_ｔは区間Ｋの時刻ｔの特徴パラメータである。

なお、時刻ｔ＝０にテンプレート及び区間Ｋの始点があるとする。この式（Ｇ）は区間Ｋに対してテンプレートの始点からの特徴パラメータの変化分を加算することをＴ毎に繰り返すことを意味する。 Next, a template application method according to type 4 will be described. Type 4 is a method of applying a template by stationery type. The application of the template according to type 2 to the section K of the length T ′ of the input data Score is to obtain the feature parameter P ′ _t at time t according to the following equation (G). Note that P _t is a feature parameter at time _t in section K.

It is assumed that there is a template and the start point of section K at time t = 0. This equation (G) means that the addition of the change in the characteristic parameter from the starting point of the template to the section K is repeated every T.

タイプ４は、主にステーショナリー部分に適用する場合に用いる。このタイプ４は、比較的長時間の音声の定常的部分に自然な揺らぎを与える効果をもっている。 Type 4 is mainly used when applied to the stationary part. This type 4 has an effect of giving natural fluctuation to a stationary part of a relatively long speech.

図９は、特徴パラメータ発生処理を表すフローチャートである。この処理により、ある時刻ｔにおける特徴パラメータを発生させる。この特徴パラメータ発生処理を、ある一定時刻毎に時刻ｔを増加させながら、繰り返し行うことにより、フレーズ、曲といった単位の音声を合成することが出来る。 FIG. 9 is a flowchart showing the feature parameter generation process. With this process, a feature parameter at a certain time t is generated. By repeating this characteristic parameter generation process while increasing the time t at every certain time, it is possible to synthesize unit sounds such as phrases and songs.

ステップＳＡ１では、特徴パラメータ発生処理を開始して次のステップＳＡ２に進む。 In step SA1, feature parameter generation processing is started, and the process proceeds to next step SA2.

ステップＳＡ２では、入力データＳｃｏｒｅの時刻ｔにおける各トラックの値を取得する。具体的には、入力データＳｃｏｒｅ中の時刻ｔにおける音韻名、アーティキュレーション又はステーショナリーの区別、ノートアタック、ノートトゥノート又はノートリリースの区別、ピッチ、ダイナミクス値、及びオープニング値を取得する。その後次のステップＳＡ３に進む。 In step SA2, the value of each track at time t of the input data Score is acquired. Specifically, the phoneme name, articulation or stationery distinction, note attack, note-to-note distinction, note release distinction, pitch, dynamics value, and opening value at time t in the input data Score are acquired. Thereafter, the process proceeds to next Step SA3.

ステップＳＡ３では、ステップＳＡ２で取得した入力データＳｃｏｒｅの各トラックの値に基づき、必要なテンプレートを音韻テンプレートデータベースＰＤＢとノートテンプレートデータベースＮＤＢから読み込む。その後次のステップＳＡ４に進む。 In step SA3, necessary templates are read from the phoneme template database PDB and the note template database NDB based on the value of each track of the input data Score acquired in step SA2. Thereafter, the process proceeds to next Step SA4.

このステップＳＡ３での音韻テンプレートの読み込みは、例えば、以下の手順で行われる。時刻ｔでの音韻がアーティキュレーションであると判断すると、アーティキュレーションテンプレートデータベースを検索して、先頭と後続の音韻名が一致して、かつピッチが一番近いテンプレートを読み込む。 The phoneme template is read in step SA3 by the following procedure, for example. If it is determined that the phoneme at time t is articulation, the articulation template database is searched to read a template whose head and subsequent phoneme names match and whose pitch is closest.

一方、時刻ｔでの音韻がステーショナリーであると判断すると、ステーショナリーテンプレートデータベースを検索して、音韻名が一致して、かつピッチが一番近いステーショナリーテンプレートを読み込む。 On the other hand, if it is determined that the phoneme at time t is stationery, the stationery template database is searched, and the stationery template with the same phoneme name and the closest pitch is read.

また、ノートテンプレートの読み込みは、以下のように行われる。例えば、時刻ｔのノートトラックがノートアタックであると判断した場合は、ＮＡテンプレートデータベースＮＡＤＢを検索して、音韻名が一致して、かつピッチが一番近いテンプレートを読み込む。 The note template is read as follows. For example, if it is determined that the note track at time t is a note attack, the NA template database NADB is searched, and the template with the same phoneme name and the closest pitch is read.

また、例えば、時刻ｔのノートトラックがノートリリースであると判断した場合は、ＮＲテンプレートデータベースＮＲＤＢを検索して、音韻名が一致して、かつピッチが一番近いテンプレートを読み込む。 Also, for example, if it is determined that the note track at time t is a note release, the NR template database NRDB is searched to read a template with the same phoneme name and the closest pitch.

さらに、例えば、時刻ｔのノートトラックがノートトゥノートであると判断した場合は、ＮＮテンプレートデータベースＮＮＤＢを検索して、音韻名が一致して、かつ始点ピッチと終了時刻ピッチを元に以下の式（Ｈ）で求められる距離ｄが一番近くなるテンプレートを読み込む。以下の式（Ｈ）は、周波数の変化量と平均値を重み付けして加算した値を元に距離尺度としている。

ここで、

上記式（Ｈ）で求めた距離ｄに基づき、テンプレートを読み込むことにより、ピッチの絶対値が近いものよりも、ピッチの変化幅が近いテンプレートを優先的に選択するようにしている。 Further, for example, if it is determined that the note track at time t is note-to-note, the NN template database NNDB is searched, the phoneme names match, and the following formula is used based on the start point pitch and end time pitch: A template having the closest distance d obtained in (H) is read. The following formula (H) is a distance scale based on a value obtained by weighting and adding a frequency change amount and an average value.

here,

By reading a template based on the distance d obtained by the above formula (H), a template having a close pitch change width is preferentially selected over a pitch having a close absolute value.

ステップＳＡ４では、ノートトラックの現在時刻ｔと同じ属性を持つ領域の開始時刻及び終了時刻を求め、音韻トラックがステーショナリーである場合はノートアタック、ノートトゥノート又はノートリリースの区別にしたがって、開始時刻あるいは終了時刻又は双方の特徴パラメータを取得若しくは算出する。その後次のステップＳＡ５に進む。 In step SA4, the start time and end time of an area having the same attribute as the current time t of the note track are obtained. If the phonological track is stationary, the start time or the note release is determined according to the distinction between note attack, note to note or note release. The end time or both characteristic parameters are acquired or calculated. Thereafter, the process proceeds to next Step SA5.

時刻ｔのノートトラックがノートアタックである場合には、ＴｉｍｂｒｅデータベースＴＤＢを検索して、音韻名及びノートアタック終了時刻のピッチが一致する特徴パラメータを読み込む。 If the note track at time t is a note attack, the Timbre database TDB is searched to read a feature parameter that matches the phoneme name and the pitch of the note attack end time.

ピッチが一致する特徴パラメータがないときには、音韻名が一致し、かつノートアタック終了時刻のピッチをはさむ２つの特徴パラメータを取得して、これらを補間することによりノートアタック終了時刻の特徴パラメータを算出する。補間方法の詳細は後述する。 When there is no feature parameter with the same pitch, two feature parameters with the same phoneme name and sandwiching the pitch of the note attack end time are obtained, and the feature parameter of the note attack end time is calculated by interpolating these. . Details of the interpolation method will be described later.

時刻ｔのノートトラックがノートリリースである場合には、ＴｉｍｂｒｅデータベースＴＤＢを検索して、音韻名及びノートアタック開始時刻のピッチが一致する特徴パラメータを読み込む。 When the note track at time t is a note release, the Timbre database TDB is searched to read a feature parameter that matches the phoneme name and the pitch of the note attack start time.

ピッチが一致する特徴パラメータがないときには、音韻名が一致し、かつノートリリース開始時刻のピッチをはさむ２つの特徴パラメータを取得して、これらを補間することによりノートリリース開始時刻の特徴パラメータを算出する。補間方法の詳細は後述する。 When there is no feature parameter having the same pitch, two feature parameters having the same phoneme name and sandwiching the pitch of the note release start time are acquired, and the feature parameter of the note release start time is calculated by interpolating these. . Details of the interpolation method will be described later.

時刻ｔのノートトラックがノートトゥノートである場合には、ＴｉｍｂｒｅデータベースＴＤＢを検索して、音韻名とノートトゥノート開始時刻のピッチが一致する特徴パラメータ及び音韻名とノートトゥノート終了時刻が一致する特徴パラメータを読み込む。 When the note track at time t is note-to-note, the Timbre database TDB is searched, and the feature parameter and the phoneme name that match the pitch of the phoneme name and the note-to-note start time match the note-to-note end time. Read feature parameters.

ピッチが一致する特徴パラメータがないときには、音韻名が一致し、かつノートトゥノート開始（終了）時刻のピッチをはさむ２つの特徴パラメータを取得して、これらを補間することによりノートトゥノート開始（終了）時刻の特徴パラメータを算出する。補間方法の詳細は後述する。 When there is no feature parameter with the same pitch, two feature parameters with the same phoneme name and sandwiching the pitch of the note-to-note start (end) time are obtained, and note-to-note start (end) is interpolated between them. ) Calculate the time feature parameter. Details of the interpolation method will be described later.

なお、音韻トラックがアーティキュレーションである場合は開始時刻及び終了時刻の特徴パラメータを取得若しくは算出する。この場合は、ＴｉｍｂｒｅデータベースＴＤＢを検索して、音韻名とアーティキュレーション開始時刻のピッチが一致する特徴パラメータ及び音韻名とアーティキュレーション終了時刻のピッチが一致する特徴パラメータを読み込む。 If the phoneme track is articulation, the feature parameters of the start time and end time are acquired or calculated. In this case, the Timbre database TDB is searched to read the feature parameters having the same phoneme name and the pitch of the articulation start time, and the feature parameters having the same phoneme name and the pitch of the articulation end time.

ピッチが一致する特徴パラメータがないときには、音韻名が一致し、かつアーティキュレーション開始（終了）時刻のピッチをはさむ２つの特徴パラメータを取得して、これらを補間することによりアーティキュレーション開始（終了）時刻の特徴パラメータを算出する。 When there is no feature parameter with the same pitch, two feature parameters that match the phoneme name and sandwich the pitch at the start (end) time of the articulation are obtained, and articulation start (end) is performed by interpolating these ) Calculate the time feature parameter.

ステップＳＡ５では、ステップＳＡ４で求めた始点、終了時刻の特徴パラメータとピッチに対して、ステップＳＡ３で読み込んだテンプレートを適用して、時刻ｔにおけるピッチとダイナミクスを求める。 In step SA5, the template read in step SA3 is applied to the feature parameters and pitch of the start point and end time obtained in step SA4, and the pitch and dynamics at time t are obtained.

時刻ｔのノートトラックがノートアタックならば、ノートアタック部分に対してステップＳＡ４で求めたノートアタック部分の終了時刻の特徴パラメータを使いタイプ２でＮＡテンプレートを適用する。テンプレートを適用した後の時刻ｔにおけるピッチとダイナミクス（ＥＧａｉｎ）を記憶する。 If the note track at time t is a note attack, the NA template is applied for type 2 using the feature parameter of the end time of the note attack portion obtained in step SA4 for the note attack portion. The pitch and dynamics (EGain) at time t after applying the template are stored.

一方、時刻ｔのノートトラックがノートリリースならば、ノートリリース部分に対してステップＳＡ４で求めたノートリリース始点の特徴パラメータを使いタイプ１でＮＲテンプレートを適用する。テンプレートを適用した後の時刻ｔにおけるピッチとダイナミクス（ＥＧａｉｎ）を記憶する。 On the other hand, if the note track at time t is a note release, the type 1 NR template is applied to the note release portion using the feature parameter of the note release start point obtained in step SA4. The pitch and dynamics (EGain) at time t after applying the template are stored.

また、時刻ｔのノートトラックがノートトゥノートならば、ノートトゥノート部分に対してステップＳＡ４で求めたノートトゥノートの始点及び終了時刻における特徴パラメータを使い、その区間に対してタイプ３でＮＮテンプレートを適用する。テンプレートを適用した後の時刻ｔにおけるピッチとダイナミクス（ＥＧａｉｎ）を記憶する。 If the note track at time t is a note-to-note, the feature parameters at the start and end times of the note-to-note obtained in step SA4 are used for the note-to-note part, and the type 3 NN template is used for that interval. Apply. The pitch and dynamics (EGain) at time t after applying the template are stored.

さらに、時刻ｔのノートトラックが上記のいずれでもない場合には、入力データＳｃｏｒｅのピッチとダイナミクス（ＥＧａｉｎ）を記憶する。 Further, when the note track at time t is neither of the above, the pitch and dynamics (EGain) of the input data Score are stored.

以上のいずれかの処理を行ったら、次のステップＳＡ６に進む。 When any of the above processes is performed, the process proceeds to the next step SA6.

ステップＳＡ６では、ステップＳＡ２で求めた各トラックの値から、時刻ｔの音韻がアーティキュレーションであるか否かを判断する。アーティキュレーションである場合には、ＹＥＳの矢印で示すステップＳＡ９に進む。アーティキュレーションでない場合、すなわち時刻ｔの音韻がステーショナリーである場合には、ＮＯの矢印で示すステップＳＡ７に進む。 In step SA6, it is determined from the value of each track obtained in step SA2 whether or not the phoneme at time t is articulation. If it is articulation, the process proceeds to step SA9 indicated by a YES arrow. If it is not articulation, that is, if the phoneme at time t is stationary, the process proceeds to step SA7 indicated by a NO arrow.

ステップＳＡ７では、ステップＳＡ２で求めた時刻ｔにおける音韻名と、ステップＳＡ５で求めたピッチ、ダイナミクスをインデックスとして、ＴｉｍｂｒｅデータベースＴＤＢから特徴パラメータを読み込み補間する。読み込みと補間の方法は、ステップＳＡ４で行ったものと同様である。その後、ステップＳＡ８に進む。 In step SA7, feature parameters are read from the Timbre database TDB and interpolated using the phoneme name at time t obtained in step SA2 and the pitch and dynamics obtained in step SA5 as indexes. The reading and interpolation methods are the same as those performed in step SA4. Thereafter, the process proceeds to Step SA8.

ステップＳＡ８では、ステップＳＡ７で求めた時刻ｔにおける特徴パラメータ及びピッチに対して、ステップＳＡ３で求めたステーショナリーテンプレートをタイプ４で適用する。 In step SA8, the stationary template obtained in step SA3 is applied as type 4 to the feature parameter and pitch at time t obtained in step SA7.

このステップＳＡ８で、ステーショナリーテンプレートを適用することで、時刻ｔでの特徴パラメータ及びピッチが更新され、ステーショナリーテンプレートの持つ音声の揺らぎが加えられる。その後、ステップＳＡ１０に進む。 In step SA8, by applying the stationery template, the feature parameter and the pitch at time t are updated, and the sound fluctuation of the stationery template is added. Thereafter, the process proceeds to step SA10.

ステップＳＡ９では、ステップＳＡ４で求めたアーティキュレーション部分の開始時刻及び終了時刻の特徴パラメータに、ステップＳＡ３で読み込んだアーティキュレーションテンプレートを適用して、時刻ｔでの特徴パラメータ及びピッチを求める。その後、ステップＳＡ１０に進む。 In step SA9, the feature parameter and pitch at time t are obtained by applying the articulation template read in step SA3 to the feature parameter of the start time and end time of the articulation part obtained in step SA4. Thereafter, the process proceeds to step SA10.

ただし、テンプレートの適用方法は有声音（Ｖ）から無声音（Ｕ）への変化の場合はタイプ１で行い、無声音（Ｕ）から有声音（Ｖ）への変化の場合はタイプ２で行い、有声音（Ｖ）から有声音（Ｖ）又は無声音（Ｕ）からから無声音（Ｕ）への変化の場合はタイプ３で行う。 However, the template is applied using type 1 for a change from voiced sound (V) to unvoiced sound (U), and for a change from unvoiced sound (U) to voiced sound (V), type 2 is used. In the case of a change from a voiced sound (V) to a voiced sound (V) or from an unvoiced sound (U) to an unvoiced sound (U), the change is made with type 3.

上記のようにテンプレートの適用方法を変えるのは、有声部分での連続性を保ちつつ、テンプレートに含まれている自然な音声の変化を再現する為である。 The reason for changing the template application method as described above is to reproduce the natural change in the voice included in the template while maintaining continuity in the voiced portion.

ステップＳＡ１０では、ステップＳＡ８若しくはステップＳＡ９で求められた特徴パラメータに対して、ＮＡテンプレート、ＮＲテンプレート、ＮＮテンプレートのいずれかを適用する。ただし、ここでは、特徴パラメータのＥＧａｉｎに対しては、テンプレートを適用しない。その後次のステップＳＡ１１に進み、特徴パラメータ発生処理を終了する。 In step SA10, any one of the NA template, NR template, and NN template is applied to the feature parameter obtained in step SA8 or step SA9. However, here, the template is not applied to the characteristic parameter EGain. Thereafter, the process proceeds to the next step SA11, and the feature parameter generation process is terminated.

このステップＳＡ１０でのテンプレートの適用は、時刻ｔでのノートトラックがノートアタックである場合には、ステップＳＡ３で求めた、ＮＡテンプレートをタイプ２により適用して、特徴パラメータを更新する。 In the application of the template in step SA10, when the note track at time t is a note attack, the NA template obtained in step SA3 is applied by type 2 to update the feature parameter.

時刻ｔでのノートトラックがノートリリースである場合には、ステップＳＡ３で求めた、ＮＲテンプレートをタイプ１により適用して、特徴パラメータを更新する。 When the note track at time t is a note release, the feature parameter is updated by applying the NR template obtained in step SA3 by type 1.

時刻ｔでのノートトラックがノートトゥノートである場合には、ステップＳＡ３で求めた、ＮＮテンプレートをタイプ３により適用して、特徴パラメータを更新する。 If the note track at time t is note-to-note, the NN template obtained in step SA3 is applied by type 3 to update the feature parameter.

ただし上記いずれの場合にも、ここでは、特徴パラメータのＥＧａｉｎに対しては、テンプレートを適用しない。また、ピッチについても、このステップＳＡ１０の前のステップで求められたものをそのまま使用する。 In any of the above cases, however, no template is applied to the characteristic parameter EGain. As for the pitch, the pitch obtained in the step before step SA10 is used as it is.

以下に、図９のステップＳＡ４で行う特徴パラメータの補間について説明する。特徴パラメータの補間には、２つの特徴パラメータの補間と、１つの特徴パラメータからの推定がある。 Hereinafter, the feature parameter interpolation performed in step SA4 of FIG. 9 will be described. The feature parameter interpolation includes interpolation of two feature parameters and estimation from one feature parameter.

人間が音声を発声するときにピッチを変化させると声帯波形（肺からの空気と声帯の振動によって発生する音源波形）が変化することが知られており、またフォルマントもピッチによって変化することが知られている。ある特定のピッチで歌った音声から得られた特徴パラメータを他のピッチの音声を合成するときにそのまま流用した場合には、ピッチを変えても同じような声の音色になってしまい不自然になってしまう。 It is known that the vocal cord waveform (sound source waveform generated by the air from the lungs and the vocal cord vibration) changes when the pitch is changed when a human utters voice, and the formant also changes with the pitch. It has been. If the feature parameters obtained from the voice sung at a certain pitch are used as they are when synthesizing the voice at other pitches, even if the pitch is changed, a similar voice tone will be produced, which is unnatural. turn into.

それを避けるために人間の歌唱音域である２〜３オクターブの音域中、対数軸で、ほぼ等間隔で３点程度のピッチを選び、特徴パラメータをＴｉｍｂｒｅデータベースＴＤＢに保存しておく。ＴｉｍｂｒｅデータベースＴＤＢ中にあるピッチ以外のピッチの音声を合成する場合には、２つの特徴パラメータの補間（直線補間）若しくは１つの特徴パラメータからの推定（外挿）によって特徴パラメータが求められる。 In order to avoid this, pitches of about three points are selected at approximately equal intervals on the logarithmic axis in the range of 2 to 3 octaves which is a human singing range, and feature parameters are stored in the Timbre database TDB. When synthesizing speech with a pitch other than the pitch in the Timbre database TDB, the feature parameters are obtained by interpolation of two feature parameters (linear interpolation) or estimation from one feature parameter (extrapolation).

この方法によって、ピッチが変化したときの音声の特徴パラメータの変化を擬似的に表現することができる。また、ピッチの異なる特徴パラメータを３点程度持つのは、同じ音素、同じピッチの発生でもそのときによって特徴パラメータには変動があり、３点程度から補間して求めた場合とさらに細かく分割して求めた場合との差は余り意味がないからである。 With this method, it is possible to simulate a change in the feature parameter of the voice when the pitch changes. Also, having about 3 feature parameters with different pitches is that even if the same phoneme and the same pitch are generated, the feature parameters vary depending on the situation. This is because the difference from the obtained case is not so meaningful.

２つの特徴パラメータの補間は、例えば、２つの特徴パラメータとそれぞれのピッチの組｛Ｐ１，ｆ１［ｃｅｎｔｓ］｝、｛Ｐ２，ｆ２［ｃｅｎｔｓ］｝が与えられたときに、時刻ｔのピッチｆ１［ｃｅｎｔｓ］における特徴パラメータを、以下の式（Ｉ）により直線補間して求めることにより行われる。

上記式（Ｉ）では、データベースのインデックスがピッチ1個だけの場合を考えたが、一般的にインデックスがＮ個ある場合でも、目標を囲む近傍のＮ＋1個のデータをもとに、以下の式（Ｉ’）を用いて、目標のインデックスｆの代理として使用する特徴パラメータを補間して求めることが出来る。なお、Ｐ_ｉは、近傍のｉ番目の特徴パラメータであり、ｆ_ｉはそのインデックスである。

１つの特徴パラメータからの推定は、データベースに含まれるデータの音域を外れる音声の特徴パラメータを推定するときに用いる。 The interpolation of the two feature parameters is, for example, when a pair {P1, f1 [cents]} and {P2, f2 [cents]} of the two feature parameters and the respective pitches are given, the pitch f1 [ The feature parameter in [cents] is obtained by linear interpolation using the following equation (I).

In the above formula (I), the case where the index of the database has only one pitch was considered. However, even when there are generally N indexes, the following formula is used based on N + 1 pieces of data in the vicinity surrounding the target. Using (I ′), it is possible to interpolate and obtain a characteristic parameter used as a proxy for the target index f. Note that P _i is the i-th feature parameter in the vicinity, and f _i is its index.

The estimation from one feature parameter is used when estimating the feature parameter of speech that is out of the range of the data included in the database.

これは、データベースの音域よりもピッチの高い音声を合成する場合に、データベース中の最もピッチの高い特徴パラメータをそのまま利用すると、明らかに音質が劣化するからである。
This is because, when synthesizing speech having a pitch higher than the range of the database, if the feature parameter with the highest pitch in the database is used as it is, the sound quality is clearly degraded.

また、データベースの音域よりもピッチの低い音声を合成する場合に、最もピッチの低い特徴パラメータを利用すると同様に音質が劣化するからである。そこで本実施例では実際の音声データの観察からの知見に基づいた規則を使って、以下のように特徴パラメータを変化させて劣化を防いでいる。 In addition, when synthesizing a voice having a pitch lower than the range of the database, the sound quality is similarly deteriorated when the feature parameter having the lowest pitch is used. Therefore, in this embodiment, using the rules based on the knowledge based on observation of actual audio data, the characteristic parameters are changed as follows to prevent the deterioration.

まず、データベースの音域よりも高いピッチ（目標ピッチ）の音声を合成する場合を説明する。 First, the case where a voice having a pitch (target pitch) higher than the range of the database is synthesized will be described.

まず、目標ピッチＴａｒｇｅｔＰｉｔｃｈ［ｃｅｎｔｓ］からデータベース中の最も高いピッチＨｉｇｈｅｓｔＰｉｔｃｈ［ｃｅｎｔｓ］を引いた値ＰｉｔｃｈＤｉｆｆ［ｃｅｎｔｓ］を求める。 First, a value PitchDiff [cents] is obtained by subtracting the highest pitch HighPitch [cents] in the database from the target pitch TargetPitch [cents].

次に、データベースから最も高いピッチを持つ特徴パラメータを読み出して、その内の励起レゾナンス周波数ＥｐＲＦｒｅｑ及び第ｉフォルマント周波数ＦｏｒｍａｎｔＦｒｅｑ_ｉに、それぞれ上記ＰｉｔｃｈＤｉｆｆ［ｃｅｎｔｓ］を加算して、ＥｐＲＦｒｅｑ’、ＦｏｒｍａｎｔＦｒｅｑ_ｉ’に置き換えたものを目標ピッチの特徴パラメータとして使う。 Next, the feature parameter having the highest pitch is read from the database, and the PitchDiff [cents] is added to the excitation resonance frequency EpRFreq and the i-th formant frequency FormatFreq _i , respectively, to EpRFreq ′ and FormatFreq _i ′. The replacement is used as a target pitch feature parameter.

次に、データベースの音域よりも低いピッチ（目標ピッチ）の音声を合成する場合を説明する。 Next, a description will be given of the case of synthesizing speech having a pitch (target pitch) lower than the range of the database.

まず、目標ピッチＴａｒｇｅｔＰｉｔｃｈ［ｃｅｎｔｓ］からデータベース中の最も低いピッチＬｏｗｅｓｔＰｉｔｃｈ［ｃｅｎｔｓ］を引いた値ＰｉｔｃｈＤｉｆｆ［ｃｅｎｔｓ］を求める。 First, a value PitchDiff [cents] is obtained by subtracting the lowest pitch LowestPitch [cents] in the database from the target pitch TargetPitch [cents].

次に、データベースから最も低いピッチを持つ特徴パラメータを読み出して、以下のようにパラメータを置き換えて目標ピッチの特徴パラメータとして用いる。 Next, the feature parameter having the lowest pitch is read from the database, and the parameter is replaced as follows and used as the feature parameter of the target pitch.

まず、励起レゾナンス周波数ＥｐＲＦｒｅｑ及び第１から第４フォルマント周波数ＦｏｒｍａｎｔＦｒｅｑ（１≦ｉ≦４）を、それぞれ下記式（Ｊ１）及び（Ｊ２）を用いて、ＥｐＲＦｒｅｑ’、ＦｏｒｍａｎｔＦｒｅｑ_i’に置き換える。

…（Ｊ１）

…（Ｊ２）
さらに、ピッチが低くなるほどバンド幅が狭くなるように、励起レゾナンスバンド幅ＥＲＢＷ及び第１から第３フォルマントのバンド幅ＦｏｒｍａｎｔＢＷ_i（１≦ｉ≦３）をそれぞれ下記式（Ｊ３）のＥＲＢＷ’、ＦｏｒｍａｎｔＢＷ_i’に置き換える。 First, the excitation resonance frequency EpRFreq and the first to fourth formant frequencies FormatFreq (1 ≦ i ≦ 4) are replaced with EpRFreq ′ and FormatFreq _i ′ using the following formulas (J1) and (J2), respectively.

... (J1)

... (J2)
Further, the excitation resonance band width ERBW and the first to third formant bandwidths FormBW _i (1 ≦ i ≦ 3) are respectively expressed by ERBW ′ and FormatBW in the following formula (J3) so that the band width becomes narrower as the pitch becomes lower. _{Replace with i} '.

…（Ｊ３）
さらに、第１から第４フォルマントのアンプリチュードＦｏｒｍａｎｔＡｍｐ１〜ＦｏｒｍａｎｔＡｍｐ４を下記式（Ｊ５）〜（Ｊ８）に従いＰｉｔｃｈＤｉｆｆに比例させて大きくして、ＦｏｒｍａｎｔＡｍｐ１’〜ＦｏｒｍａｎｔＡｍｐ４’に置き換える。

…（Ｊ５）

…（Ｊ６）

…（Ｊ７）

…（Ｊ８）
さらに、スペクトル・エンベロープの傾きＥｓｌｏｐｅを下記式（Ｊ９）に従いＥｓｌｏｐｅ’に置き換える。

…（Ｊ９）
図４に示すような、ピッチ、ダイナミクス、オープニングをインデックスとしてＴｉｍｂｒｅデータベースＴＤＢを作成することが好ましいが、時間的、データベースサイズ的な制約がある場合には、本実施例のように、図３に示すような、ピッチのみをインデックスとしたデータベースを用いることになる。

... (J3)
Further, the first to fourth formant amplitudes FormAmp1 to FormatAmp4 are increased in proportion to PitchDift according to the following formulas (J5) to (J8), and replaced with FormatAmp1 ′ to FormatAmp4 ′.

... (J5)

... (J6)

... (J7)

... (J8)
Further, the slope Elope of the spectrum envelope is replaced with Elope 'according to the following formula (J9).

... (J9)
As shown in FIG. 4, it is preferable to create a Timbre database TDB with pitch, dynamics, and opening as indexes. However, when there are temporal and database size restrictions, as shown in FIG. As shown, a database using only the pitch as an index is used.

そのような場合に、ダイナミクス関数や、オープニング関数を用いて、ピッチのみをインデックスとした特徴パラメータを変化させ、あたかも、ピッチ、ダイナミクス、オープニングをインデックスとして作成したＴｉｍｂｒｅデータベースＴＤＢを使用したかのような効果を擬似的に得る事が出来る。 In such a case, using a dynamics function or an opening function, the feature parameter with only the pitch as an index is changed, as if using a Timbre database TDB created using the pitch, dynamics, and opening as an index. The effect can be obtained in a pseudo manner.

すなわち、ピッチのみを変化させて録音した音声を使用して、ピッチ、ダイナミクス、オープニングを変化させて録音した音声を使用したかのような効果を得る事が出来る。 That is, it is possible to obtain an effect as if the voice recorded by changing the pitch, dynamics, and opening is used by using the voice recorded by changing only the pitch.

ダイナミクス関数及び、オープニング関数は、ダイナミクス、オープニングを変化させて発声した実際の音声と、特徴パラメータの相関関係を分析して得る事が出来る。以下に、ダイナミクス関数及び、オープニング関数の例をあげ、その適用方法を説明する。 The dynamics function and the opening function can be obtained by analyzing the correlation between the actual speech uttered by changing the dynamics and the opening, and the feature parameter. In the following, examples of dynamics functions and opening functions will be given and their application methods will be described.

図１０は、ダイナミクス関数の一例を表すグラフである。図１０（Ａ）は、関数ｆＥＧを表すグラフであり、図１０（Ｂ）は、関数ｆＥＳを表すグラフであり、図１０（Ｃ）は、関数ｆＥＳＤを表すグラフである。 FIG. 10 is a graph showing an example of a dynamics function. 10A is a graph representing the function fEG, FIG. 10B is a graph representing the function fES, and FIG. 10C is a graph representing the function fESD.

これらの、図１０（Ａ）〜（Ｃ）に示される関数ｆＥＧ、ｆＥＳ、ｆＥＳＤを利用して、ダイナミクス値を特徴パラメータＥｘｃｉｔａｔｉｏｎＧａｉｎ（ＥＧ）、ＥｘｃｉｔａｔｉｏｎＳｌｏｐｅ（ＥＳ）、ＥｘｃｉｔａｔｉｏｎＳｌｏｐｅＤｅｐｔｈ（ＥＳＤ）に反映させる。 Using these functions fEG, fES, and fESD shown in FIGS. 10A to 10C, the dynamics values are reflected in the characteristic parameters ExcitationGain (EG), ExcitationSlope (ES), and ExcitationSlopeDepth (ESD).

図１０（Ａ）〜（Ｃ）の関数ｆＥＧ、ｆＥＳ、ｆＥＳＤの入力は、全てダイナミクス値であり、０から１までの値をとる。このダイナミクス値をｄｙｎとして、関数ｆＥＧ、ｆＥＳ、ｆＥＳＤを使い、下記式（Ｋ１）〜（Ｋ３）で、特徴パラメータＥＧ’、ＥＳ’、ＥＳＤ’を求め、ダイナミクス値（ｄｙｎ）の時の特徴パラメータとして用いる。

なお、図１０（Ａ）〜（Ｃ）の関数ｆＥＧ、ｆＥＳ、ｆＥＳＤは、一例であり、歌唱者によって様々な関数を用意することにより、より自然性を持った音声合成を行うことが出来る。 Inputs of the functions fEG, fES, and fESD in FIGS. 10A to 10C are all dynamic values and take values from 0 to 1. Using the functions fEG, fES, and fESD with the dynamics value as dyn, the characteristic parameters EG ′, ES ′, and ESD ′ are obtained by the following formulas (K1) to (K3), and the characteristic parameters at the time of the dynamics value (dyn) are obtained. Used as

Note that the functions fEG, fES, and fESD shown in FIGS. 10A to 10C are examples, and voice synthesis with more naturalness can be performed by preparing various functions by a singer.

図１１は、オープニング関数の一例を表すグラフである。図中、横軸は周波数（Ｈｚ）であり、縦軸はアンプリチュード（ｄＢ）である。 FIG. 11 is a graph showing an example of the opening function. In the figure, the horizontal axis represents frequency (Hz) and the vertical axis represents amplitude (dB).

このオープニング関数をｆＯｐｅｎ（ｆｒｅｑ）とし、オープニング値をＯｐｅｎとして、以下の式（Ｌ１）により、励起レゾナンス周波数ＥＲＦｒｅｑ’を励起レゾナンス周波数ＥＲＦｒｅｑから求め、オープニング値（Ｏｐｅｎ）のときの特徴パラメータとして用いる。

また、以下の式（Ｌ２）により、ｉ番目のフォルマント周波数ＦｏｒｍａｎｔＦｒｅｑ_ｉ’をｉ番目のフォルマント周波数ＦｏｒｍａｎｔＦｒｅｑ_ｉから求め、オープニング値（Ｏｐｅｎ）のときの特徴パラメータとして用いる。

これにより、周波数０〜５００Ｈｚにあるフォルマントのアンプリチュードをオープニング値に比例させて増減させることができ、合成音声に、唇開度による音声の変化を与えることが出来る。 The opening function is fOpen (freq), the opening value is Open, and the excitation resonance frequency ERFreq ′ is obtained from the excitation resonance frequency ERFreq by the following equation (L1), and is used as a characteristic parameter for the opening value (Open).

Further, the i-th formant frequency FormatFreq _i ′ is obtained from the i-th formant frequency FormatFreq _i by the following equation (L2) and used as a feature parameter at the opening value (Open).

As a result, the amplitude of the formant at a frequency of 0 to 500 Hz can be increased or decreased in proportion to the opening value, and a change in voice due to the lip opening can be given to the synthesized voice.

なお、オープニング値を入力とする関数を歌唱者別に用意して、変化させることにより、合成音声をより多様化させることが出来る。 It should be noted that the synthesized speech can be further diversified by preparing and changing functions for which the opening value is input for each singer.

図１２は、本実施例によるテンプレートの第１の適用例を表す図である。図中（ａ）の楽譜による歌唱を本実施例により合成する場合を説明する。 FIG. 12 is a diagram illustrating a first application example of the template according to the present embodiment. The case where the singing by the score of (a) in the figure is synthesized according to the present embodiment will be described.

この楽譜は、最初の２分音符の音程は「ソ」であり、強さは「ピアノ（弱く）」で「あ」という発音である。２つ目の２分音符の音程は「ド」であり、強さは「メゾフォルテ（やや強く）」で「あ」という発音である。２つの２分音符は、レガートで接続されているので、音と音の間に切れ目がなく滑らかに接続する。 In this score, the pitch of the first half note is “So”, the strength is “Piano (weak)” and the pronunciation is “A”. The pitch of the second half note is “do”, the strength is “mesoforte” (slightly strong), and the pronunciation is “a”. Since the two half notes are connected by legato, there is no break between the sounds and they are connected smoothly.

ここで、「ソ」から「ド」への変化の時間は、入力データ（楽譜）とともに与えられるものとする。 Here, it is assumed that the change time from “So” to “Do” is given together with the input data (music score).

まず、音符の音名から２つのピッチの周波数が得られる。その後、２つのピッチの終点と始点を直線で結んで、図中（ｂ）に示すように音符の境界部分のピッチを得ることが出来る。 First, two pitch frequencies are obtained from the note names. Thereafter, the end point and start point of the two pitches are connected by a straight line, and the pitch of the boundary portion of the note can be obtained as shown in FIG.

次にダイナミクスであるが、これは、「ピアノ（弱く）」や「メゾフォルテ（やや強く）」といった強弱記号に対応した値をテーブルとして記憶しておき、これを使って数値に変換して２つの音符に対応するダイナミクス値を得る。このようにして得た２つのダイナミクス値を直線で結ぶことにより、図中（ｂ）に示すように音符の境界部分のダイナミクス値を得ることが出来る。 Next, in terms of dynamics, values corresponding to dynamic symbols such as “piano (weak)” and “mesoforte (slightly strong)” are stored as a table and converted into numerical values using this. Get the dynamics value corresponding to the note. By connecting the two dynamics values thus obtained with a straight line, the dynamics value at the boundary portion of the note can be obtained as shown in FIG.

このようにして得て、ピッチと、ダイナミクス値をそのまま用いると、ピッチ、ダイナミクスが音符の境界部分で急激に変化してしまうので、レガートに接続する為、この音符の境界部分に、図中（ｂ）に示すようにＮＮテンプレートを適用する。 If the pitch and dynamics values obtained in this way are used as they are, the pitch and dynamics change abruptly at the note boundary. Therefore, in order to connect to the legato, the note boundary ( Apply the NN template as shown in b).

ここでは、ピッチとダイナミクスにだけ、ＮＮテンプレートを適用して、図中（ｃ）に示すような音符の境界部分が滑らかに接続されたピッチとダイナミクスを得る。 Here, the NN template is applied only to the pitch and dynamics, and the pitch and dynamics in which the boundary portions of the notes as shown in FIG.

次に、図中（ｃ）に示す決定されたピッチとダイナミクス及び「あ」という音韻名をインデックスとして、ＴｉｍｂｒｅデータベースＴＤＢから、図中（ｄ）に示すような各時刻の特徴パラメータを求める。 Next, the characteristic parameters at each time as shown in (d) of the figure are obtained from the Timbre database TDB using the determined pitch and dynamics shown in (c) in the figure as an index and the phoneme name "A" as an index.

ここで求めた各時刻の特徴パラメータに対して、図中（ｃ）に示す音韻名「あ」に対応するステーショナリーテンプレートを適用し、音符境界の接続部分以外の定常部分に音声の揺らぎを付加して、図中（ｅ）に示すような特徴パラメータを得る。 The stationery template corresponding to the phoneme name “a” shown in FIG. 4C is applied to the characteristic parameters at each time obtained here, and the voice fluctuation is added to the stationary part other than the connected part of the note boundary. Thus, a characteristic parameter as shown in FIG.

次に、図中（ｂ）でピッチとダイナミクスのみ適用したＮＮテンプレートの残り（フォルマント周波数など）を、図中（ｅ）に示す特徴パラメータに適用し、音符の境界部分のフォルマント周波数などに揺らぎを与えた図中（ｆ）で示す特徴パラメータを得る。 Next, the rest of the NN template (formant frequency, etc.) applied only to pitch and dynamics in (b) in the figure is applied to the feature parameters shown in (e) in the figure, and fluctuations are caused in the formant frequency, etc. at the boundary of the note. A feature parameter indicated by (f) in the given figure is obtained.

最後に、図中（ｃ）のピッチ、ダイナミクスと、図中（ｆ）の特徴パラメータを用いて、音声合成を行うことにより、図中（ａ）の楽譜で表す歌唱を合成することが出来る。 Finally, by performing speech synthesis using the pitch and dynamics in (c) in the figure and the characteristic parameters in (f) in the figure, it is possible to synthesize a song represented by the score in (a) in the figure.

なお、図１２の（ｂ）で、ＮＮテンプレートを適用する部分の時間幅は、例えば、図１３に示すように長くすることが出来る。図１３に示すように、ＮＮテンプレートを適用する部分の時間幅を長くすると、ＮＮテンプレートが伸長されて適用されるので、ゆっくりとした変化を持つ歌唱音声を合成することが出来る。 In FIG. 12B, the time width of the portion to which the NN template is applied can be increased, for example, as shown in FIG. As shown in FIG. 13, when the time width of the portion to which the NN template is applied is increased, the NN template is extended and applied, so that it is possible to synthesize a singing voice having a slow change.

また、逆に、ＮＮテンプレートを適用する時間幅を狭くすれば、早く滑らかに変化する歌唱音声を合成することが出来る。このようにＮＮテンプレートの適用時間を制御することで、変化のスピードをコントロールすることが出来る。 Conversely, if the time width during which the NN template is applied is narrowed, a singing voice that changes quickly and smoothly can be synthesized. By controlling the application time of the NN template in this way, the speed of change can be controlled.

また、同じ時間で、ピッチをある高さから別の高さに変化させる場合でも、前半で急激に変化させ、後半はゆっくり変化させる歌い方があり、その逆もある。このように、ピッチの変化の道筋は何通りもあり、その違いは結果的に音楽的な聞こえ方の違いとなって現れる。そこで、このようなレガートの歌い方を変えて歌唱した音声から複数種類のＮＮテンプレートを作成して記録しておけば、様々なバリエーションを合成音声に持たせることが出来る。 Also, even when the pitch is changed from one height to another at the same time, there is a way of singing that is changed rapidly in the first half and slowly in the second half, and vice versa. In this way, there are various ways of changing the pitch, and the difference appears as a difference in the way of listening musically. Therefore, by creating and recording a plurality of types of NN templates from voices sung by changing the way of singing such legato, various variations can be given to the synthesized voice.

さらに、音程（ピッチ）の変化の仕方には、上記のレガート奏法以外にも様々なものがあり、それらについても別にテンプレートを作成して記録するようにしてもよい。 Further, there are various ways of changing the pitch (pitch) other than the legato playing method described above, and a template may be separately created and recorded for these.

例えば、レガートのように完全に連続的にピッチを変化させるのではなく、半音ごとにピッチを変化させたり、楽曲の長で使われる音階（例えば、ハ長調では、ドレミファソラシド）だけで飛び飛びに変化させたりする、いわゆるグリッサンド奏法がある。 For example, instead of changing the pitch completely continuously like legato, changing the pitch for each semitone, or changing only by the scale used by the length of the song (for example, Doremifasolaside in C major) There is a so-called glissando playing technique.

この場合には、グリッサンドで実際に歌唱した音声から、ＮＮテンプレートを作成し、そのテンプレートを適用して２つの音符を滑らかに接続した歌唱を合成することが出来る。 In this case, it is possible to create an NN template from the voice actually sung by the glissando, and synthesize a song in which two notes are smoothly connected by applying the template.

なお、本実施例では、ＮＮテンプレートは、同じ音韻でピッチが変化している場合だけを作成して記録しているが、例えば、「あ」から「え」のように違う音韻でピッチが変化している場合についても作成することができる。この場合は、ＮＮテンプレートの数が多くなってしまうが、実際の歌唱により近づけることが出来る。 In this embodiment, the NN template is created and recorded only when the pitch changes with the same phoneme. For example, the pitch changes with a different phoneme such as “A” to “E”. You can also create a case. In this case, although the number of NN templates will increase, it can be brought closer to actual singing.

図１４は、本実施例によるテンプレートの第２の適用例を表す図である。図中（ａ）の楽譜による歌唱を本実施例により合成する場合を説明する。 FIG. 14 is a diagram illustrating a second application example of the template according to the present embodiment. The case where the singing by the score of (a) in the figure is synthesized according to the present embodiment will be described.

この楽譜は、最初の２分音符の音程は「ソ」であり、強さは「ピアノ（弱く）」で「あ」という発音である。２つ目の２分音符の音程は「ド」であり、強さは「メゾフォルテ（やや強く）」で「え」という発音である。 In this score, the pitch of the first half note is “So”, the strength is “Piano (weak)” and the pronunciation is “A”. The pitch of the second half note is “do”, the strength is “mesoforte (slightly strong)”, and the pronunciation is “e”.

ここで、「あ」から「え」へのアーティキュレーションの時間は、２つの音素の組合せ毎に固定値として設定しておくか、又は入力データとともに与えられるものとする。 Here, the articulation time from “A” to “E” is set as a fixed value for each combination of two phonemes, or given together with input data.

まず、音符の音名から２つのピッチの周波数が得られる。その後、２つのピッチの終点と始点を直線で結んで、図中（ｂ）に示すように音符の境界部分（アーティキュレーション部分）のピッチを得ることが出来る。 First, two pitch frequencies are obtained from the note names. Thereafter, the end point and the start point of the two pitches are connected by a straight line, and the pitch of the note boundary portion (articulation portion) can be obtained as shown in FIG.

次に、図中（ｂ）に示す決定されたピッチとダイナミクス及び「あ」、「え」という音韻名をインデックスとして、ＴｉｍｂｒｅデータベースＴＤＢから、図中（ｃ）に示すような各時刻の特徴パラメータを求める。ただし、アーティキュレーション部分の特徴パラメータは、仮に音韻「あ」の終点部分と、音韻「え」の始点部分を直線補間した値である。 Next, using the determined pitch and dynamics shown in (b) in the figure, and the phoneme names “a” and “e” as indexes, the time parameters as shown in (c) in the figure from the Timbre database TDB. Ask for. However, the feature parameter of the articulation part is a value obtained by linear interpolation between the end point part of the phoneme “A” and the start point part of the phoneme “E”.

次に、図中（ｃ）に示すように、「あ」のステーショナリーテンプレート、「あ」から「え」へのアーティキュレーションテンプレート、「え」のステーショナリーテンプレートを先に求めた、特徴パラメータのそれぞれの該当部分に適用し、図中（ｄ）に示すような特徴パラメータを得る。 Next, as shown in (c) in the figure, each of the characteristic parameters obtained in advance for the stationery template “a”, the articulation template from “a” to “e”, and the stationery template “e” To obtain the characteristic parameters as shown in FIG.

最後に、図中（ｂ）のピッチ、ダイナミクスと、（ｄ）の特徴パラメータを使って、音声合成を行う。 Finally, speech synthesis is performed using the pitch and dynamics of (b) in the figure and the characteristic parameters of (d).

このようにすると、人間が実際に発声する場合と同様に、自然に「あ」から「え」に変化する歌唱音声を合成することが出来る。 In this way, it is possible to synthesize a singing voice that naturally changes from “a” to “e”, as in the case where a person actually utters.

なお、アーティキュレーションテンプレートも、ＮＮテンプレートの場合と同様に、境界部分（アーティキュレーション部分）の長さを楽譜とともに与えられるようにしておけば、「あ」から「え」へのアーティキュレーションの時間を制御することができ、ゆっくりと変化する音声や、早く変化する音声を、１つのテンプレートを伸縮することで合成できる。すなわち、こうすることで、音韻の変化する時間を制御することが出来る。 As with the NN template, if the articulation template can be given the length of the boundary part (articulation part) along with the score, articulation from “a” to “e” Can be controlled by synthesizing a slowly changing sound or a rapidly changing sound by expanding and contracting one template. That is, by doing this, it is possible to control the time during which the phoneme changes.

図１５は、本実施例によるテンプレートの第３の適用例を表す図である。図中（ａ）の楽譜による歌唱を本実施例により合成する場合を説明する。 FIG. 15 is a diagram illustrating a third application example of the template according to the present embodiment. The case where the singing by the score of (a) in the figure is synthesized according to the present embodiment will be described.

この楽譜は、音程が「ソ」で、発音は「あ」である全音符の強さを立ち上がりから次第に強くしていき、立下りで次第に弱くしていくものである。 In this score, the intensity of all notes whose pitch is “So” and pronunciation is “A” is gradually increased from the rising edge and gradually decreased at the falling edge.

この楽譜の場合は、ピッチ、ダイナミクスは図中（ｂ）に示すように平坦である。これらのピッチ、ダイナミクスの先頭にＮＡテンプレートを適用し、さらに音符の最後にＮＲテンプレートを適用して、図中（ｃ）で示すようなピッチとダイナミクスを求めて、決定する。 In the case of this musical score, the pitch and dynamics are flat as shown in FIG. The NA template is applied to the beginning of these pitches and dynamics, and the NR template is applied to the end of the musical notes to obtain and determine the pitch and dynamics as shown in FIG.

なお、ＮＡテンプレート及びＮＲテンプレートを適用する長さは、クレッシェンド記号及びデクレッシェンド記号自身に長さを持たせて入力されているものとする。 It should be noted that the lengths to which the NA template and the NR template are applied are input with the crescendo and decrescendo symbols themselves having a length.

次に、決定した図中（ｃ）のピッチ、ダイナミクス及び音韻名「あ」をインデックスとして、図中（ｄ）に示すようにアタックでもリリースでもない通常部分の特徴パラメータが求められる。 Next, using the determined pitch, dynamics and phoneme name “A” in the figure (c) as an index, the characteristic parameters of the normal part which is neither an attack nor a release are obtained as shown in the figure (d).

さらに、図中（ｄ）に示す通常部分の特徴パラメータに、ステーショナリーテンプレートを適用して、図中（ｅ）に示すような、揺らぎが与えられた特徴パラメータを求める。この（ｅ）の特徴パラメータを元に、アタック部分とリリース部分の特徴パラメータを求める。 Further, a stationary template is applied to the characteristic parameter of the normal part shown in (d) in the figure to obtain a characteristic parameter given fluctuation as shown in (e) in the figure. Based on the feature parameter (e), the feature parameters of the attack portion and the release portion are obtained.

アタック部分の特徴パラメータは、通常部分の始点（アタック部分の終点）に対して、音韻「あ」のＮＡテンプレートを前述のタイプ２の方法で適用して求める。 The characteristic parameter of the attack part is obtained by applying the NA template of the phoneme “A” to the start point of the normal part (end point of the attack part) by the above-described type 2 method.

リリース部分の特徴パラメータは、通常部分の終点（リリース部分の始点）に対して、音韻「あ」のＮＲテンプレートを前述のタイプ１の方法で適用して求める。 The characteristic parameter of the release part is obtained by applying the NR template of the phoneme “A” to the end point of the normal part (start point of the release part) by the type 1 method described above.

このようにして、アタック部分、通常部分、リリース部分の特徴パラメータが、図中（ｆ）のように求められる。この特徴パラメータと、（ｃ）のピッチ、ダイナミクスを使用して、音声を合成することで、（ａ）の楽譜によるクレッシェンド、デクレッシェンドで歌った歌唱音声を得ることが出来る。 In this way, the characteristic parameters of the attack part, the normal part, and the release part are obtained as shown in FIG. By synthesizing speech using this characteristic parameter and the pitch and dynamics of (c), singing speech sung by crescendo and decrescendo according to the score of (a) can be obtained.

以上、本実施例に拠れば、実際の人間の歌唱音声を分析して得られる音韻テンプレートを用いて、特徴パラメータに変動を与えるので、歌唱音声の持っている母音を長く伸ばした部分や、音韻が変化する部分の特徴を反映した自然な合成音声を生成することが出来る。 As described above, according to the present embodiment, since the feature parameters are changed using the phoneme template obtained by analyzing the actual human singing voice, the vowel part of the singing voice is extended long, It is possible to generate natural synthesized speech that reflects the characteristics of the part where the change occurs.

また、本実施例に拠れば、実際の人間の歌唱音声を分析して得られるノートテンプレートを用いて、特徴パラメータに変動を与えるので、単なる音量の違いだけでない、音楽的な強弱の表現力を持った合成音声を生成することが出来る。 In addition, according to the present embodiment, since the feature parameter is changed using the note template obtained by analyzing the actual human singing voice, not only the difference in volume but also the expressive power of musical strength is obtained. You can generate synthesized speech.

さらに、本実施例に拠れば、ピッチ、ダイナミクス、オープニングなどの音楽表現度を細かく変化させたデータを用意しなくても、他に用意されているデータを補間して、用いることが出来るので、少ないサンプルですみ、データベースのサイズを小さくすることが出来るとともに、データベースの作成時間を短縮することが出来る。 Furthermore, according to the present embodiment, it is possible to interpolate and use other prepared data without preparing data with finely changed music expression such as pitch, dynamics, and opening. With a small number of samples, the database size can be reduced and the database creation time can be shortened.

さらに、また、本実施例に拠れば、音楽表現度として、ピッチのみをインデックスとしたデータベースを使用したとしても、オープニング及びダイナミクス関数を用いて、擬似的にピッチ、オープニング、ダイナミクスの３つの音楽表現度をインデックスとして持つデータベースを使用した場合に近い効果を得る事が出来る。 Furthermore, according to this embodiment, even if a database using only the pitch as an index is used as the music expression level, three music expressions of pitch, opening, and dynamics are simulated using the opening and dynamics functions. You can get an effect close to using a database with the degree as an index.

なお、本実施例では、図２に示したように、入力データＳｃｏｒｅとして、音韻トラックＰＨＴ、ノートトラックＮＴ、ピッチトラックＰＩＴ、ダイナミクストラックＤＹＴ、オープニングトラックＯＴを入力したが、入力データＳｃｏｒｅの構成はこれに限られない。 In this embodiment, as shown in FIG. 2, the phoneme track PHT, the note track NT, the pitch track PIT, the dynamics track DYT, and the opening track OT are input as the input data Score, but the configuration of the input data Score is It is not limited to this.

例えば、図２の入力データＳｃｏｒｅに、ビブラートトラックを追加して用意してもよい。ビブラートトラックには、０〜１のビブラート値が記録されている。 For example, a vibrato track may be added to the input data Score shown in FIG. In the vibrato track, vibrato values of 0 to 1 are recorded.

この場合、データベース４には、ビブラート値を引数として、ピッチ、ダイナミクスの時系列を返す関数、若しくはテーブルをビブラートテンプレートとして保存しておく。 In this case, the database 4 stores, as a vibrato template, a function or table that returns a time series of pitch and dynamics using a vibrato value as an argument.

そして、図４のステップＳＡ５のピッチ、ダイナミクスの計算において、このビブラートテンプレートを適用することで、ビブラート効果を与えたピッチ、ダイナミクスを得る事が出来る。 Then, by applying this vibrato template in the calculation of the pitch and dynamics in step SA5 in FIG. 4, it is possible to obtain the pitch and dynamics giving the vibrato effect.

ビブラートテンプレートは、実際の人間の歌唱音声を分析することで得る事が出来る。 The vibrato template can be obtained by analyzing the actual human singing voice.

なお、本実施例は歌唱音声合成を中心に説明したが、歌唱音声に限られるものではなく、通常の会話の音声や楽器音なども同様に合成することができる。 In addition, although the present Example demonstrated centering on the singing voice synthesis | combination, it is not restricted to a singing voice, The voice of a normal conversation, an instrument sound, etc. can be synthesize | combined similarly.

なお、本実施例は、本実施例に対応するコンピュータプログラム等をインストールした市販のコンピュータ等によって、実施させるようにしてもよい。 In addition, you may make it implement a present Example by the commercially available computer etc. which installed the computer program etc. corresponding to a present Example.

その場合には、本実施例に対応するコンピュータプログラム等を、ＣＤ−ＲＯＭやフロッピーディスク等の、コンピュータが読み込むことが出来る記憶媒体に記憶させた状態で、ユーザに提供してもよい。 In that case, the computer program or the like corresponding to the present embodiment may be provided to the user while being stored in a storage medium that can be read by the computer, such as a CD-ROM or a floppy disk.

そのコンピュータ等が、ＬＡＮ、インターネット、電話回線等の通信ネットワークに接続されている場合には、通信ネットワークを介して、コンピュータプログラムや各種データ等をコンピュータ等に提供してもよい。 When the computer or the like is connected to a communication network such as a LAN, the Internet, or a telephone line, a computer program or various data may be provided to the computer or the like via the communication network.

以上実施例に沿って本発明を説明したが、本発明はこれらに制限されるものではない。例えば、種々の変更、改良、組合せ等が可能なことは当業者に自明であろう。 Although the present invention has been described with reference to the embodiments, the present invention is not limited thereto. It will be apparent to those skilled in the art that various modifications, improvements, combinations, and the like can be made.

本発明の実施例による音声合成装置１の構成を表すブロック図である。It is a block diagram showing the structure of the speech synthesizer 1 by the Example of this invention. 入力データＳｃｏｒｅの一例を示す概念図である。It is a conceptual diagram which shows an example of input data Score. ＴｉｍｂｒｅデータベースＴＤＢの一例である。It is an example of a Timbre database TDB. ＴｉｍｂｒｅデータベースＴＤＢの他の例である。It is another example of the Timbre database TDB. ステーショナリーテンプレートデータベースの一例である。It is an example of a stationery template database. アーティキュレーションテンプレートデータベースの一例である。It is an example of an articulation template database. ＮＡテンプレートデータベースＮＡＤＢの一例である。It is an example of NA template database NADB. ＮＮテンプレートデータベースＮＮＤＢの一例である。It is an example of NN template database NNDB. 特徴パラメータ発生処理を表すフローチャートである。It is a flowchart showing a feature parameter generation process. ダイナミクス関数の一例を表すグラフである。It is a graph showing an example of a dynamics function. オープニング関数の一例を表すグラフである。It is a graph showing an example of an opening function. 本実施例によるテンプレートの第１の適用例を表す図である。It is a figure showing the 1st example of application of the template by a present Example. 本実施例によるテンプレートの第１の適用例の変形例を表す図である。It is a figure showing the modification of the 1st application example of the template by a present Example. 本実施例によるテンプレートの第２の適用例を表す図である。It is a figure showing the 2nd application example of the template by a present Example. 本実施例によるテンプレートの第３の適用例を表す図である。It is a figure showing the 3rd application example of the template by a present Example.

Explanation of symbols

１…音声合成装置、２…データ入力部、３…特徴パラメータ発生部、４…データベース、５…ＥｐＲ音声合成エンジン DESCRIPTION OF SYMBOLS 1 ... Speech synthesizer, 2 ... Data input part, 3 ... Feature parameter generation part, 4 ... Database, 5 ... EpR speech synthesis engine

Claims

Storage means for storing a feature amount of speech at a specific time including at least one of excitation resonance or formant as a phoneme and a pitch as an index;
Template storage means for storing a template representing a temporal change in the feature amount of the pitch and the voice as an index of the phoneme and the pitch;
Input means for inputting speech information for speech synthesis including at least pitch and phoneme;
Reading means for reading the feature amount and the template of the voice from the storage means and the template storage means, respectively, according to the inputted voice information;
Voice synthesis means for applying the read template to the pitch included in the read voice feature quantity and the input voice information, and synthesizing voice based on the voice feature quantity and pitch after the application In a speech synthesizer having
When the pitch included in the input voice information exceeds the value of the highest index in the storage means, the reading means is the highest stored in the storage means from the pitch included in the input voice information A pitch difference obtained by subtracting the pitch of the index is obtained, and a feature amount obtained by adding the pitch difference to the frequency of the excitation resonance or the formant included in the feature amount read from the storage unit by the highest pitch index is used for the speech synthesis. A speech synthesizer that outputs to the means.

Storage means for storing a feature amount of speech at a specific time including at least one of excitation resonance or formant as a phoneme and a pitch as an index;
Template storage means for storing a template representing a temporal change in the feature amount of the pitch and the voice as an index of the phoneme and the pitch;
Input means for inputting speech information for speech synthesis including at least pitch and phoneme;
Reading means for reading the feature amount and the template of the voice from the storage means and the template storage means, respectively, according to the inputted voice information;
Voice synthesis means for applying the read template to the pitch included in the read voice feature quantity and the input voice information, and synthesizing voice based on the voice feature quantity and pitch after the application In a speech synthesizer having
When the pitch included in the input voice information is lower than the lowest index value in the storage means, the reading means is the lowest stored in the storage means from the pitch included in the input voice information. A pitch difference obtained by subtracting the pitch of the index is obtained, and a feature amount obtained by adding the specified ratio of the pitch difference to the frequency of the excitation resonance or the formant included in the feature amount read from the storage unit by the lowest pitch index. A speech synthesizer for outputting to the speech synthesizer.

The audio feature quantity stored in the storage means includes excitation resonance,
3. The speech synthesizer according to claim 2, wherein the reading unit corrects and outputs the excitation resonance so that the bandwidth of the excitation resonance becomes narrower as the pitch included in the inputted speech information is lower.

The audio feature quantity stored in the storage means includes a formant,
3. The speech synthesizer according to claim 2, wherein the reading means corrects and outputs the formant so that the amplitude of the formant increases as the pitch included in the input speech information decreases.