JP5699496B2

JP5699496B2 - Stochastic model generation device for sound synthesis, feature amount locus generation device, and program

Info

Publication number: JP5699496B2
Application number: JP2010198710A
Authority: JP
Inventors: 慶二郎才野
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2010-09-06
Filing date: 2010-09-06
Publication date: 2015-04-08
Anticipated expiration: 2030-09-06
Also published as: JP2012058306A

Description

本発明は、音響の特徴量（例えばピッチやパワー）の時系列を示す確率モデルの生成と、確率モデルを利用した特徴量の時系列の生成とに関連する。確率モデルから生成される特徴量の時系列は、歌唱音等の音響の合成に好適に利用される。 The present invention relates to generation of a probability model indicating a time series of acoustic feature quantities (for example, pitch and power) and generation of a time series of feature quantities using a probability model. The time series of feature amounts generated from the probability model is suitably used for synthesizing sounds such as singing sounds.

収録済の音響（以下「参照音」という）に近似する特徴量の変動を合成音に付与することで聴感的に自然な合成音を生成することが可能である。例えば非特許文献１には、参照音のピッチの時系列を表現する確率モデル（例えばＨＭＭ（Hidden Markov Model））を利用して合成音を生成する技術が開示されている。具体的には、参照音が音符毎に複数の音符区間に区分され、各音符区間内のピッチの時系列に対する学習処理で音符毎に確率モデルが生成される。 It is possible to generate an acoustically natural synthesized sound by adding to the synthesized sound a variation in a feature amount that approximates the recorded sound (hereinafter referred to as “reference sound”). For example, Non-Patent Document 1 discloses a technique for generating a synthesized sound using a probability model (for example, HMM (Hidden Markov Model)) that represents a time series of the pitch of a reference sound. Specifically, the reference sound is divided into a plurality of note intervals for each note, and a probability model is generated for each note by a learning process for a time series of pitches in each note interval.

酒向慎司才野慶二郎南角吉彦徳田恵一北村正，「声質と歌唱スタイルを自動学習可能な歌声合成システム」，情報処理学会研究報告［音楽情報科学］，2008(12)，p.39−p.44，2008年2月Shinji Sakaki Keijiro Saino Yoshihiko Minamikaku Keiichi Tokuda, Tadashi Kitamura, “Singing Voice Synthesis System with Automatic Voice Quality and Singing Style”, Information Processing Society of Japan [Music Information Science], 2008 (12), p.39−p. 44, February 2008

図１３は、楽曲の歌唱音を収録した参照音のピッチＰと当該楽曲の各音符Ｖ（Ｖ1，Ｖ2，Ｖ3）の音高（すなわちピッチＰの目標値）との関係を示す模式図である。図１３の部分(A)および部分(B)に示すように、参照音のピッチＰの遷移は、音符Ｖの系列が共通する場合でも例えば歌唱表現に応じて相違し得る。例えば、図１３の部分(A)では、音符Ｖ1と音符Ｖ2との境界の前後で参照音のピッチＰが一時的に低下する（いわゆる「しゃくり」の歌唱表現）のに対し、図１３の部分(B)では、音符Ｖ1から音符Ｖ2にかけてピッチＰは略一定に維持される。 FIG. 13 is a schematic diagram showing the relationship between the pitch P of the reference sound containing the singing sound of the music and the pitch (that is, the target value of the pitch P) of each note V (V1, V2, V3) of the music. . As shown in part (A) and part (B) of FIG. 13, the transition of the pitch P of the reference sound may differ depending on, for example, the singing expression even when the sequence of the note V is common. For example, in the part (A) of FIG. 13, the pitch P of the reference sound temporarily decreases before and after the boundary between the note V1 and the note V2 (so-called “shrimp” singing expression), whereas the part of FIG. In (B), the pitch P is maintained substantially constant from the note V1 to the note V2.

非特許文献１の技術では、参照音のうち音符が共通する各音符区間内のピッチの時系列に対する学習処理で音符毎に確率モデルが生成される。例えば、図１３に例示したケースでは、前述のようにピッチＰの遷移が相違するにも関わらず、音符Ｖ2の確率モデルの生成には部分(A)および部分(B)の双方における音符Ｖ2の区間内のピッチＰが適用される。したがって、部分(A)と部分(B)との中間的なピッチＰの遷移を表現する確率モデルが生成される。以上のように実際の参照音の特性を忠実に反映しない確率モデルを利用した場合、聴感的に不自然な合成音が生成されるという問題がある。 In the technique of Non-Patent Document 1, a probability model is generated for each note by a learning process for a time series of pitches in each note interval in which notes are common among reference sounds. For example, in the case illustrated in FIG. 13, although the transition of the pitch P is different as described above, the generation of the probability model of the note V2 requires the note V2 in both the part (A) and the part (B). The pitch P in the section is applied. Therefore, a probability model that expresses an intermediate pitch P transition between the part (A) and the part (B) is generated. As described above, when a probabilistic model that does not faithfully reflect the characteristics of the actual reference sound is used, there is a problem that a synthetic sound that is audibly unnatural is generated.

なお、以上の説明ではピッチの遷移を表現する確率モデルを例示したが、他の特徴量（例えばパワー）の確率モデルについても同様の問題が発生し得る。以上の事情を考慮して、本発明は、参照音における特徴量の遷移を忠実に反映した確率モデルを生成して聴感的に自然な合成音を生成することを目的とする。 In the above description, the probability model expressing the transition of the pitch is exemplified, but the same problem may occur for the probability models of other feature quantities (for example, power). In view of the above circumstances, an object of the present invention is to generate a stochastic natural synthesized sound by generating a probability model that faithfully reflects the transition of the feature amount in the reference sound.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の第１態様に係る音合成用確率モデル生成装置は、特徴量（例えば参照ピッチＰref）の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段（例えば区間設定部３４）と、複数の状態（例えば状態Ｓt）の各々について特徴量の確率分布を示す遷移種別毎の特徴量モデル（例えば特徴量モデルＱA）を、参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段（例えば確率モデル生成部４２１）とを具備する。以上の構成においては、参照音の遷移種別毎に特徴量モデルが生成されるから、参照音の特徴量の変動傾向の相違が特徴量モデルに忠実に反映される。したがって、例えば遷移種別の相違を加味せずに参照音から特徴量モデルを生成する構成と比較すると、参照音の特性を忠実に反映した聴感的に自然な合成音を生成可能な特徴量モデルを生成できるという利点がある。 The sound synthesis probability model generation apparatus according to the first aspect of the present invention is a section setting means (for example, a section) that divides a reference sound into unit sections for each transition type according to a tendency of variation in a feature amount (for example, a reference pitch Pref). A setting unit 34) and a feature amount model (for example, feature amount model QA) for each transition type indicating the probability distribution of the feature amount for each of a plurality of states (for example, state St), and a unit section of the transition type in the reference sound And a probability model generation means (for example, a probability model generation unit 421) that generates from a time series of feature quantities. In the above configuration, since a feature amount model is generated for each reference sound transition type, a difference in variation tendency of the reference sound feature amount is faithfully reflected in the feature amount model. Therefore, for example, when compared with a configuration in which a feature model is generated from a reference sound without taking into account differences in transition types, a feature model that can generate an auditory natural synthesized sound that faithfully reflects the characteristics of the reference sound. There is an advantage that it can be generated.

第１態様の好適例に係る音合成用確率モデル生成装置は、確率モデル生成手段が生成した複数の特徴量モデルを複数の集合に分類し、分類で構築される特徴量決定木（例えば特徴量決定木ＴA）と、各集合に分類された特徴量モデルから集合毎に生成される特徴量モデル（例えば特徴量モデルＭA）とを含む特徴量情報を生成する特徴量分類手段（例えば特徴量分類部４２３）を具備する。以上の構成においては、確率モデル生成手段が生成した特徴量モデルを分類した複数の集合の各々について当該集合内の特徴量モデルに応じた特徴量モデルが生成されるから、参照音の多数の特徴量を反映した（すなわち統計的な妥当性の高い）特徴量モデルを生成することが可能である。また、特徴量モデルの分類で構築される特徴量決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な特徴量モデルを選択できるという利点もある。 A sound synthesis probability model generation apparatus according to a preferred example of the first aspect classifies a plurality of feature amount models generated by the probability model generation means into a plurality of sets, and a feature amount determination tree (for example, feature amount) constructed by classification Feature amount classification means (for example, feature amount classification) that generates feature amount information including a decision tree TA) and a feature amount model (for example, feature amount model MA) generated for each set from the feature amount models classified into each set. Part 423). In the above configuration, since a feature amount model corresponding to the feature amount model in the set is generated for each of a plurality of sets into which the feature amount model generated by the probability model generating means is classified, a large number of features of the reference sound are generated. It is possible to generate a feature quantity model reflecting the quantity (ie, statistically valid). In addition, by applying the designated sound to be synthesized to the feature quantity decision tree constructed by the classification of the feature quantity model, there is also an advantage that an appropriate feature quantity model can be selected for the designated sound of the attribute that does not exist in the reference sound .

本発明の第２態様に係る音合成用確率モデル生成装置は、特徴量（例えば参照ピッチＰref）の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段（例えば区間設定部３４）と、複数の状態（例えば状態Ｓt）の各々について継続長の確率分布を示す遷移種別毎の継続長モデル（例えば継続長モデルＱB）を、参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段（例えば確率モデル生成部４２１）とを具備する。以上の構成においては、参照音の遷移種別毎に継続長モデルが生成されるから、参照音の特徴量の変動傾向の相違が継続長モデルに忠実に反映される。したがって、例えば遷移種別の相違を加味せずに参照音から継続長モデルを生成する構成と比較すると、参照音の特性を忠実に反映し聴感的に自然な合成音を生成可能な継続長モデルを生成することが可能である。 The sound synthesis probability model generation apparatus according to the second aspect of the present invention is a section setting means (for example, a section) that divides a reference sound into unit sections for each transition type according to a tendency of fluctuation of a feature amount (for example, a reference pitch Pref). A setting section 34) and a duration model (eg, duration model QB) for each transition type indicating a probability distribution of duration for each of a plurality of states (eg, state St), and a unit section of the transition type of the reference sound And a probability model generation means (for example, a probability model generation unit 421) that generates from a time series of feature quantities. In the above configuration, since the duration model is generated for each transition type of the reference sound, the difference in the variation tendency of the reference sound feature amount is faithfully reflected in the duration model. Therefore, for example, when compared with a configuration in which a duration model is generated from a reference sound without taking into account differences in transition types, a duration model that can faithfully reflect the characteristics of the reference sound and generate an acoustically natural synthesized sound It is possible to generate.

第２態様の好適例に係る音合成用確率モデル生成装置は、確率モデル生成手段が生成した複数の継続長モデルを複数の集合に分類し、分類で構築される継続長決定木（例えば継続長決定木ＴB）と、各集合に分類された継続長モデルから集合毎に生成される継続長モデル（例えば継続長モデルＭB）とを含む継続長情報を生成する継続長分類手段（例えば継続長分類部４２５）を具備する。以上の構成においては、確率モデル生成手段が生成した継続長モデルを分類した複数の集合の各々について当該集合内の継続長モデルに応じた継続長モデルが生成されるから、参照音の多数の特徴量を反映した（すなわち統計的な妥当性の高い）継続長モデルを生成することが可能である。また、継続長モデルの分類で構築される継続長決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な継続長モデルを選択できるという利点もある。 The sound synthesis probability model generation device according to a preferred example of the second aspect classifies a plurality of duration models generated by the probability model generation means into a plurality of sets, and a duration determination tree constructed by classification (for example, a duration length) A decision tree (for example, duration classification) that generates duration information including a duration model (for example, duration model MB) generated for each set from the duration model classified into each set. Part 425). In the above configuration, since a duration model corresponding to the duration model in the set is generated for each of a plurality of sets into which the duration model generated by the probability model generation unit is classified, a large number of features of the reference sound are generated. It is possible to generate a duration model that reflects the quantity (ie statistically relevant). In addition, there is an advantage that an appropriate duration model can be selected for a designated tone having an attribute that does not exist in the reference tone by applying the designated tone to be synthesized to the duration decision tree constructed by the duration model classification. .

なお、以上の各形態における遷移種別（特徴量の変動の傾向）とは、特徴量の上昇／低下や変化／維持といった特徴量の経時的な動向（挙動）を意味する。例えば、発音の始点から特徴量が経時的に目標値に接近する過程（開始部Ｂ）や、特徴量が略一定に維持される定常的な過程（定常部Ｓ）や、発音の終点にかけて特徴量が経時的に目標値から変化する過程（終了部Ｅ）が、遷移種別の典型例として例示され得る。 Note that the transition type (the tendency of fluctuation of the feature value) in each of the above forms means a trend (behavior) of the feature value over time such as increase / decrease or change / maintenance of the feature value. For example, a process in which the feature amount approaches the target value over time from the starting point of pronunciation (starting portion B), a steady process in which the feature amount is maintained substantially constant (steady portion S), or a feature from the end point of pronunciation. A process in which the amount changes from the target value over time (end E) can be exemplified as a typical example of the transition type.

第３態様に係る音合成用確率モデル生成装置は、遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する遷移配列モデル（例えば遷移配列モデルＱC）を、参照音のうち当該音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段（例えば遷移配列モデル生成部４４１）を具備する。以上の構成においては、音符区間内の遷移種別の配列毎の出現確率を示す遷移配列モデルが生成されるから、合成対象の指定音について適切な遷移配列を決定するとともに各遷移種別に対応する確率モデル（特徴量モデル，継続長モデル）を選択できる。したがって、参照音の特性を忠実に反映し聴感的に自然な合成音を生成することが可能である。
The sound synthesis probability model generation apparatus according to the third aspect includes a transition arrangement model (for example, a transition arrangement model QC) that specifies a discrete probability that each arrangement appears in the note interval of each note for each of a plurality of types of transition types. ) Is generated from an array of transition types in a note interval corresponding to the note of the reference sound (for example, a transition array model generation unit 441). In the above configuration, since a transition array model indicating the appearance probability for each array of transition types in the note interval is generated, an appropriate transition array is determined for the designated sound to be synthesized and the probability corresponding to each transition type A model (feature model, duration model) can be selected. Therefore, it is possible to generate an acoustically natural synthesized sound that faithfully reflects the characteristics of the reference sound.

第３態様の好適例に係る音合成用確率モデル生成装置は、遷移配列モデル生成手段が生成した複数の遷移配列モデルを複数の集合に分類し、分類で構築される遷移配列決定木（例えば遷移配列決定木ＴC）と各集合に分類された遷移配列モデルから集合毎に生成される遷移配列モデル（例えば遷移配列モデルＭC）とを含む遷移配列情報を生成する遷移配列分類手段（例えば遷移配列分類部４４３）を具備する。以上の構成においては、遷移配列モデルが生成した遷移配列モデルを分類した複数の集合の各々について当該集合内の遷移配列モデルに応じた遷移配列モデルが生成されるから、参照音の多数の特徴量を反映した（すなわち統計的な妥当性の高い）遷移配列モデルを生成することが可能である。また、遷移配列モデルの分類で構築される遷移配列決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な遷移配列モデルを選択できるという利点もある。 The sound synthesis probability model generation apparatus according to a preferred example of the third aspect classifies a plurality of transition array models generated by the transition array model generation unit into a plurality of sets, and a transition sequence determination tree (for example, transitions) constructed by classification Transition sequence classification means (for example, transition sequence classification) that generates transition sequence information including a sequence determination tree TC) and a transition sequence model (for example, transition sequence model MC) generated for each set from the transition sequence model classified into each set Part 443). In the above configuration, a transition array model corresponding to the transition array model in the set is generated for each of a plurality of sets into which the transition array model generated by the transition array model is classified. It is possible to generate a transition sequence model that reflects (ie, statistically valid). In addition, by applying the designated sound to be synthesized to the transition sequence decision tree constructed by the classification of the transition sequence model, there is also an advantage that an appropriate transition sequence model can be selected even for the designated sound having an attribute that does not exist in the reference sound. .

本発明は、以上に例示した第３態様の音合成用確率モデル生成装置が生成した遷移配列モデルを利用して特徴量の時系列を生成する特徴量軌跡生成装置としても特定される。すなわち、本発明の特徴量軌跡生成装置は、特徴量の変動の傾向に応じた遷移種別の各配列が各音符の音符区間内に出現する確率を示す複数の遷移配列モデル（例えば遷移配列モデルＭC）を記憶する記憶手段（例えば記憶装置１４）と、複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別を決定し、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列（例えば合成ピッチ軌跡Ｐsyn）を生成する軌跡生成手段（例えば軌跡生成部５２）とを具備する。以上の構成においては、指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別が決定され、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列が生成される。したがって、例えば遷移種別の相違を加味せずに特徴量の時系列を生成する構成と比較すると、参照音の特性を忠実に反映し聴感的に自然な合成音が生成されるように特徴量の軌跡を決定することが可能である。 The present invention is also specified as a feature amount trajectory generation device that generates a time series of feature amounts using the transition arrangement model generated by the sound synthesis probability model generation device of the third aspect exemplified above. That is, the feature amount trajectory generating apparatus of the present invention has a plurality of transition array models (for example, the transition array model MC) indicating the probability that each array of transition types corresponding to the tendency of variation of the feature amount appears in the note interval of each note. ) And a transition type of each unit section of the specified sound according to the probability indicated by the transition array model corresponding to the note of the specified sound among the plurality of transition array models, Trajectory generating means (for example, a trajectory generating unit 52) is provided that generates a time series of feature values (for example, a combined pitch trajectory Psyn) so that the feature values in each unit section vary with a tendency corresponding to each transition type. In the above configuration, the transition type of each unit section of the designated sound is determined according to the probability indicated by the transition arrangement model corresponding to the note of the designated sound, and the feature amount in each unit section with a tendency according to each transition type A time series of feature values is generated such that fluctuates. Therefore, for example, when compared with a configuration in which a time series of feature values is generated without taking into account differences in transition types, the feature values of the feature values are generated so that a synthesized sound that is audibly natural and reflects the characteristics of the reference sound. It is possible to determine the trajectory.

また、本発明は、以上に説明した特徴量軌跡生成装置を利用した音響合成装置（例えば音響合成装置１００）としても特定され得る。本発明の音響合成装置は、特徴量の変動の傾向に応じた遷移種別の各配列が各音符の音符区間内に出現する確率を示す複数の遷移配列モデル（例えば遷移配列モデルＭC）を記憶する記憶手段（例えば記憶装置１４）と、複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別を決定し、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列（例えば合成ピッチ軌跡Ｐsyn）を生成する軌跡生成手段（例えば軌跡生成部５２）と、軌跡生成手段が生成した特徴量の時系列に沿うように音波形データ（例えば音波形データＺA）を加工して合成音データ（例えば合成音データＶout）を生成する合成処理手段（例えば合成処理部５４）とを具備する。 The present invention can also be specified as a sound synthesizer (for example, sound synthesizer 100) using the feature amount trajectory generator described above. The acoustic synthesizer of the present invention stores a plurality of transition array models (for example, a transition array model MC) indicating the probability that each array of transition types corresponding to the tendency of fluctuations in feature quantities will appear in the note interval of each note. The transition type of each unit section of the designated sound is determined according to the probability indicated by the storage means (for example, the storage device 14) and the transition arrangement model corresponding to the note of the designated sound among the plurality of transition arrangement models. A trajectory generating means (for example, the trajectory generating section 52) that generates a time series of feature quantities (for example, the combined pitch trajectory Psyn) so that the feature quantities in each unit section vary with a corresponding tendency, and a feature generated by the trajectory generating means. Synthesis processing means (for example, a synthesis processing unit 54) that processes the sound waveform data (for example, the sound waveform data ZA) to generate synthesized sound data (for example, the synthesized sound data Vout) so as to follow the time series of the quantity.

以上の各態様に係る装置（音合成用確率モデル生成装置，特徴量軌跡生成装置，音響合成装置）は、ＤＳＰ（Digital Signal Processor）等の専用の電子回路で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。以上の各態様に係る装置としてコンピュータを機能させるプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The devices according to the above embodiments (sound synthesis probability model generation device, feature amount trajectory generation device, sound synthesis device) are realized by a dedicated electronic circuit such as a DSP (Digital Signal Processor), or a CPU (Central Processing). It is also realized by cooperation between a general-purpose arithmetic processing unit such as Unit) and a program. A program that causes a computer to function as an apparatus according to each of the above aspects is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or in a form distributed via a communication network. Provided by the server device and installed in the computer.

本発明の実施形態に係る音響合成装置のブロック図である。1 is a block diagram of a sound synthesizer according to an embodiment of the present invention. 第１処理部のブロック図である。It is a block diagram of a 1st processing part. 参照音の参照ピッチの変動を例示する説明図である。It is explanatory drawing which illustrates the fluctuation | variation of the reference pitch of a reference sound. 参照音の参照ピッチの他の変動を例示する説明図である。It is explanatory drawing which illustrates the other fluctuation | variation of the reference pitch of a reference sound. 合成用情報生成部のブロック図である。It is a block diagram of the information production | generation part for a synthesis | combination. 特徴量モデルおよび継続長モデルの説明図である。It is explanatory drawing of a feature-value model and a continuation length model. 特徴量決定木の説明図である。It is explanatory drawing of a feature-value decision tree. 継続長決定木の説明図である。It is explanatory drawing of a continuation length decision tree. 遷移配列モデルの説明図である。It is explanatory drawing of a transition arrangement | sequence model. 遷移配列決定木の説明図である。It is explanatory drawing of a transition arrangement | sequence decision tree. 第２処理部のブロック図である。It is a block diagram of a 2nd processing part. 軌跡生成部の動作の説明図である。It is explanatory drawing of operation | movement of a locus | trajectory production | generation part. 背景技術における確率モデルの生成の問題点の説明図である。It is explanatory drawing of the problem of the production | generation of the probability model in background art.

＜Ａ：実施形態＞
図１は、本発明のひとつの実施形態に係る音響合成装置１００のブロック図である。図１の音響合成装置１００は、所望の音符および歌詞の楽曲の歌唱音を示す合成音データＶoutを生成する歌唱合成装置であり、図１に示すように、演算処理装置１２と記憶装置１４と入力装置１６とを具備するコンピュータシステムで実現される。入力装置１６（例えばマウスやキーボード）は、利用者からの指示を受付ける。 <A: Embodiment>
FIG. 1 is a block diagram of a sound synthesizer 100 according to one embodiment of the present invention. The sound synthesizer 100 in FIG. 1 is a singing synthesizer that generates synthesized sound data Vout indicating the singing sound of a musical piece of desired notes and lyrics. As shown in FIG. This is realized by a computer system including the input device 16. The input device 16 (for example, a mouse or a keyboard) receives an instruction from the user.

記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種のデータ（参照用情報Ｘ，合成用情報Ｙ，音波形情報Ｚ，楽譜データＳC）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として任意に利用される。 The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data (reference information X, synthesis information Y, sound waveform information Z, score data SC) used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily used as the storage device 14.

参照用情報Ｘは、参照音データＸAと楽譜データＸBとで構成されて合成用情報Ｙの生成（学習）に利用される。参照音データＸAは、特定の歌唱者（以下「参照歌唱者」という）が楽曲を歌唱した音声（参照音）の時間領域での音波形を表現するサンプル系列である。楽譜データＸBは、参照音データＸAが示す楽曲の楽譜を表現する。すなわち、楽譜データＸBは、参照音の音符（音名，継続長）と歌詞（発音文字）とを時系列に指定する。 The reference information X is composed of reference sound data XA and score data XB and is used for generating (learning) synthesis information Y. The reference sound data XA is a sample series that expresses a sound waveform in a time domain of a sound (reference sound) in which a specific singer (hereinafter referred to as “reference singer”) sings a song. The musical score data XB expresses the musical score of the music indicated by the reference sound data XA. That is, the musical score data XB designates a note (sound name, duration) and lyrics (phonetic characters) of the reference sound in time series.

合成用情報Ｙは、参照歌唱者毎（あるいは参照歌唱者が歌唱する楽曲のジャンル毎）に参照用情報Ｘに応じて生成され、参照歌唱者の歌唱音に特有の特徴量の時系列（軌跡）を特定するために利用される。本実施形態では、合成用情報Ｙから特定される特徴量としてピッチ（基本周波数）を想定する。なお、参照用情報Ｘを利用した合成用情報Ｙの生成については後述する。 The synthesis information Y is generated according to the reference information X for each reference singer (or for each genre of the music sung by the reference singer), and the time series (trajectory) of the characteristic amount specific to the singing sound of the reference singer ) To identify. In the present embodiment, a pitch (fundamental frequency) is assumed as a feature amount specified from the synthesis information Y. The generation of the composition information Y using the reference information X will be described later.

音波形情報Ｚは、複数の音波形データＺAを含んで構成される。各音波形データＺAは、参照歌唱者が発声した音声素片毎に事前に生成され、音声素片の波形の特徴（例えば時間領域での波形や周波数スペクトルの形状）を表現する。音声素片は、聴覚的に区別可能な最小単位である音素または複数の音素を連結した音素連鎖である。 The sound waveform information Z includes a plurality of sound waveform data ZA. Each sound waveform data ZA is generated in advance for each speech unit uttered by the reference singer, and expresses a waveform characteristic of the speech unit (for example, a waveform in the time domain or a shape of a frequency spectrum). A phoneme segment is a phoneme chain that is a unit of phonemes or a plurality of phonemes that is the smallest unit that can be audibly distinguished.

楽譜データＳCは、合成対象となる各指定音の音符（音名，継続長）と歌詞（発音文字）とを時系列に指定する。入力装置１６に対する利用者からの指示（各指定音の追加や編集の指示）に応じて楽譜データＳCが生成される。概略的には、楽譜データＳCが指定する各指定音の音符および歌詞に対応する音波形データＺAのピッチを、合成用情報Ｙに応じて生成されるピッチの時系列（以下「合成ピッチ軌跡」という）に沿うように加工することで、合成音データＶoutが生成される。すなわち、合成音データＶoutで表現される合成音には、参照歌唱者に特有の歌唱表現（ピッチの変動）が付加される。 The musical score data SC designates notes (pitch names, duration) and lyrics (pronunciation characters) of each designated sound to be synthesized in time series. The musical score data SC is generated in response to an instruction from the user to the input device 16 (instruction for adding or editing each designated sound). Schematically, the pitch of the sound waveform data ZA corresponding to the notes and lyrics of each designated sound designated by the score data SC is a time series of pitches generated according to the synthesis information Y (hereinafter referred to as “synthetic pitch trajectory”). ), The synthesized sound data Vout is generated. That is, a singing expression (pitch fluctuation) peculiar to the reference singer is added to the synthesized sound represented by the synthesized sound data Vout.

図１の演算処理装置１２は、記憶装置１４に格納されたプログラムＰGMの実行で、合成音データＶoutの生成（音声合成）に必要な複数の機能（第１処理部２１，第２処理部２２）を実現する。第１処理部２１は、参照用情報Ｘを利用して合成用情報Ｙを生成し、第２処理部２２は、合成用情報Ｙと音波形情報Ｚと楽譜データＳCとを利用して合成音データＶoutを生成する。なお、演算処理装置１２の各機能を専用の電子回路（DSP）で実現した構成や、演算処理装置１２の各機能を複数の集積回路に分散した構成も採用され得る。第１処理部２１および第２処理部２２の構成や動作を順次に説明する。 The arithmetic processing unit 12 in FIG. 1 executes a plurality of functions (a first processing unit 21 and a second processing unit 22) necessary for the generation (speech synthesis) of synthesized sound data Vout by executing the program PGM stored in the storage device 14. ). The first processing unit 21 generates the synthesis information Y using the reference information X, and the second processing unit 22 uses the synthesis information Y, the sound waveform information Z, and the score data SC to generate the synthesized sound. Data Vout is generated. A configuration in which each function of the arithmetic processing unit 12 is realized by a dedicated electronic circuit (DSP) or a configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits may be employed. The configuration and operation of the first processing unit 21 and the second processing unit 22 will be described sequentially.

（１）第１処理部２１
図２は、第１処理部２１のブロック図である。図２に示すように、第１処理部２１は、特徴量抽出部３２と区間設定部３４と合成用情報生成部３６とを含んで構成される。特徴量抽出部３２は、参照音データＸAが示す参照音のピッチ（以下「参照ピッチ」という）Ｐrefを順次に検出する。参照ピッチＰrefの検出には公知の技術が任意に採用される。なお、参照音のうち調波構造が存在しない区間（例えばピッチが検出されない子音の区間）の参照ピッチＰrefは所定値（例えば前後の参照ピッチＰrefの補間値）に設定される。図３には、特徴量抽出部３２が検出した参照ピッチＰrefの時系列と、楽譜データＸBで指定される各指定音（Ｖ1，Ｖ2，……）の時系列とが、共通の時間軸のもとで図示されている。 (1) First processing unit 21
FIG. 2 is a block diagram of the first processing unit 21. As shown in FIG. 2, the first processing unit 21 includes a feature amount extraction unit 32, a section setting unit 34, and a composition information generation unit 36. The feature quantity extraction unit 32 sequentially detects the pitch of the reference sound (hereinafter referred to as “reference pitch”) Pref indicated by the reference sound data XA. A known technique is arbitrarily employed for detecting the reference pitch Pref. Note that the reference pitch Pref in a section where the harmonic structure does not exist in the reference sound (for example, a consonant section in which no pitch is detected) is set to a predetermined value (for example, an interpolated value of the preceding and following reference pitch Pref). In FIG. 3, the time series of the reference pitch Pref detected by the feature quantity extraction unit 32 and the time series of the designated sounds (V1, V2,...) Designated by the score data XB have a common time axis. It is shown in the figure.

図２の区間設定部３４は、参照音データＸAが示す参照音（参照ピッチＰrefの時系列）を時間軸上で複数の単位区間μに区分する。図２に示すように、本実施形態の区間設定部３４は、第１区間設定部３４１と第２区間設定部３４３と識別情報設定部３４５とを含んで構成される。第１区間設定部３４１は、図３に示すように、特徴量抽出部３２が検出した参照ピッチＰrefの時系列を音符毎の区間（以下「音符区間」という）σに区分する。各音符区間σの設定には参照用情報Ｘの楽譜データＸBが使用される。すなわち、第１区間設定部３４１は、楽譜データＸBが音符毎に指定する各指定音（Ｖ1，Ｖ2，……）の始点および終点を境界として参照ピッチＰrefの時系列を複数の音符区間σに区分する。 The section setting unit 34 in FIG. 2 divides the reference sound (time series of the reference pitch Pref) indicated by the reference sound data XA into a plurality of unit sections μ on the time axis. As shown in FIG. 2, the section setting unit 34 of the present embodiment includes a first section setting unit 341, a second section setting unit 343, and an identification information setting unit 345. As shown in FIG. 3, the first section setting unit 341 divides the time series of the reference pitch Pref detected by the feature amount extraction unit 32 into sections (hereinafter referred to as “note sections”) σ for each note. The musical score data XB of the reference information X is used for setting each note interval σ. That is, the first section setting unit 341 sets the time series of the reference pitch Pref to a plurality of note sections σ with the start point and end point of each designated sound (V1, V2,...) Designated by the musical score data XB for each note as a boundary. Break down.

図２の第２区間設定部３４３は、参照ピッチＰrefの時系列の各音符区間σを遷移種別毎の単位区間μに区分する。遷移種別は、参照ピッチＰrefの変動の傾向に応じた区分を意味する。本実施形態では、図３に示すように、開始部Ｂ（Beginning）と定常部Ｓ（Sustain）と終了部Ｅ（End）とを遷移種別として例示する。開始部Ｂは、１個の音符の発音の直後に参照ピッチＰrefが当該音符の音高に接近するように変動（例えば上昇）する区間を意味し、定常部Ｓは、１個の音符の発音中に参照ピッチＰrefが当該音符の音高に略一定に維持される区間を意味し、終了部Ｅは、１個の音符の発音が終了する直前に参照ピッチＰrefが当該音符の音高から変動（例えば低下）する区間を意味する。 The second interval setting unit 343 in FIG. 2 divides each note interval σ in time series of the reference pitch Pref into unit intervals μ for each transition type. The transition type means a classification according to the tendency of fluctuation of the reference pitch Pref. In the present embodiment, as shown in FIG. 3, the start part B (Beginning), the steady part S (Sustain), and the end part E (End) are exemplified as transition types. The start part B means a section in which the reference pitch Pref fluctuates (e.g., rises) so as to approach the pitch of the note immediately after the sound of one note, and the stationary part S means the sound of one note. The reference pitch Pref means a section in which the reference pitch Pref is maintained substantially constant at the pitch of the note, and the end portion E changes the reference pitch Pref from the pitch of the note immediately before the end of the sounding of one note. It means a section (for example, a decrease).

各音符区間σでは１種以上の遷移種別が出現する。また、開始部Ｂが定常部Ｓや終了部Ｅの前方に位置するとともに終了部Ｅが開始部Ｂや定常部Ｓの後方に位置するという時間的な関係は固定である。したがって、１個の音符区間σ内で出現し得る遷移種別の配列パターン（以下「遷移配列」という）は、「Ｂ-Ｓ-Ｅ」，「Ｂ-Ｓ」，「Ｓ-Ｅ」，「Ｂ-Ｅ」，「Ｂ」，「Ｓ」，「Ｅ」の合計７種類となる。例えば、図３の指定音Ｖ1に対応する音符区間σは定常部Ｓの単位区間μと終了部Ｅの単位区間μとに区分され（遷移配列「Ｓ-Ｅ」）、指定音Ｖ2の音符区間σは開始部Ｂの単位区間μと定常部Ｓの単位区間μと終了部Ｅの単位区間μとに区分され（遷移配列「Ｂ-Ｓ-Ｅ」）、指定音Ｖ3の音符区間σは定常部Ｓの単位区間μに設定される（遷移配列「Ｓ」）。 One or more transition types appear in each note interval σ. Moreover, the temporal relationship that the start part B is located in front of the steady part S and the end part E and the end part E is located behind the start part B and the steady part S is fixed. Accordingly, the transition type array patterns (hereinafter referred to as “transition arrays”) that may appear within one note interval σ are “BSE”, “BS”, “SE”, “B”. -E "," B "," S "," E "total 7 types. For example, the note interval σ corresponding to the designated sound V1 in FIG. 3 is divided into a unit interval μ of the stationary part S and a unit interval μ of the end part E (transition array “SE”), and the note interval of the designated sound V2 σ is divided into a unit interval μ of the start portion B, a unit interval μ of the stationary portion S, and a unit interval μ of the end portion E (transition array “BSE”), and the note interval σ of the designated sound V3 is stationary. The unit interval μ of the part S is set (transition array “S”).

以上のように各音符区間σが遷移種別毎に単位区間μに区分されるから、参照音の音符の系列（すなわち楽譜データＸBの各指定音の音符）が共通する場合でも、音符区間σの区分の態様（各単位区間μの個数や時間長）は、参照ピッチＰrefの変動の態様（遷移種別）に応じて変化する。例えば、図３の例示のように指定音Ｖ1および指定音Ｖ2の境界の前後で参照ピッチＰrefが一時的に低下する場合（すなわち「しゃくり」の歌唱表現が参照音に付与された場合）、前述のように、指定音Ｖ1に対応する音符区間σは、定常部Ｓと終了部Ｅとに対応する２個の単位区間μに区分され、指定音Ｖ2に対応する音符区間σは、開始部Ｂと定常部Ｓと終了部Ｅとに対応する３個の単位区間μに区分される。他方、図４のように指定音Ｖ1および指定音Ｖ2の境界の前後で参照ピッチＰrefが変動しない場合、指定音Ｖ1に対応する音符区間σは定常部Ｓの１個の単位区間μに設定され（遷移配列「Ｓ」）、指定音Ｖ2に対応する音符区間σは、定常部Ｓと終了部Ｅとに対応する２個の単位区間μに区分される（遷移配列「Ｓ-Ｅ」）。 As described above, each note interval σ is divided into unit intervals μ for each transition type. Therefore, even when the note sequence of the reference sound (that is, the note of each designated note of the score data XB) is common, the note interval σ The mode of division (the number of unit sections μ and the time length) changes according to the mode of change (transition type) of the reference pitch Pref. For example, when the reference pitch Pref is temporarily lowered before and after the boundary between the designated sound V1 and the designated sound V2 as illustrated in FIG. 3 (that is, when the singing expression “sharsh” is given to the reference sound), Thus, the note interval σ corresponding to the designated sound V1 is divided into two unit intervals μ corresponding to the stationary part S and the end part E, and the note interval σ corresponding to the designated sound V2 is the start part B. Are divided into three unit sections μ corresponding to the stationary part S and the end part E. On the other hand, when the reference pitch Pref does not fluctuate before and after the boundary between the designated sound V1 and the designated sound V2 as shown in FIG. 4, the note interval σ corresponding to the designated sound V1 is set to one unit interval μ of the stationary part S. (Transition array “S”), the note interval σ corresponding to the designated sound V2 is divided into two unit intervals μ corresponding to the stationary part S and the end part E (transition array “SE”).

各単位区間μは利用者からの指示に応じて可変に設定される。例えば、利用者は、表示装置（図示略）に表示される参照ピッチＰrefの時系列（例えば図３に例示された参照ピッチＰrefの時間変動）を視認するとともに放音装置（例えばスピーカ）から再生される参照音を聴取することで各時点での遷移種別を推定しながら、入力装置１６を適宜に操作することで各単位区間μを指定する。第２区間設定部３４３は、入力装置１６に対する利用者からの指示に応じて各単位区間μを設定する。 Each unit interval μ is variably set according to an instruction from the user. For example, the user visually recognizes the time series of the reference pitch Pref displayed on the display device (not shown) (for example, time variation of the reference pitch Pref illustrated in FIG. 3) and reproduces it from the sound emitting device (for example, a speaker). Each unit section μ is designated by appropriately operating the input device 16 while estimating the transition type at each time point by listening to the reference sound. The second section setting unit 343 sets each unit section μ in accordance with an instruction from the user to the input device 16.

図２の識別情報設定部３４５は、第２区間設定部３４３が区分した単位区間μ毎に識別情報Ａを設定する。識別情報Ａは、単位区間μの属性を示す識別子（ラベル）であり、図３に示すように音符属性ａ1と遷移種別ａ2とを含んで構成される。遷移種別ａ2は、当該単位区間μの遷移種別（開始部Ｂと定常部Ｓと終了部Ｅとの何れか）を指定する。遷移種別ａ2は、例えば単位区間μの設定時に入力装置１６の操作で利用者が指定する。 The identification information setting unit 345 in FIG. 2 sets the identification information A for each unit section μ divided by the second section setting unit 343. The identification information A is an identifier (label) indicating the attribute of the unit section μ and includes a note attribute a1 and a transition type a2 as shown in FIG. The transition type a2 designates the transition type (any one of the start part B, the steady part S, and the end part E) of the unit section μ. The transition type a2 is designated by the user by operating the input device 16 when setting the unit interval μ, for example.

音符属性ａ1は、当該単位区間μに対応する音符（以下「対象音符」という）の属性を示す情報であり、変数ｐ1〜ｐ3と変数ｄ1〜ｄ3とを含んで構成される。変数ｐ2は、対象音符の音名（ノートナンバ）に設定される。変数ｐ1は、対象音符の直前の音符の音程（対象音符に対する相対値）に設定され、変数ｐ3は対象音符の直後の音符の音程に設定される。また、変数ｄ2は、対象音符の継続長に設定される。変数ｄ1は対象音符の直前の音符の継続長に設定され、変数ｄ3は対象音符の直後の音符の継続長に設定される。音符属性ａ1の各変数（ｐ1〜ｐ3，ｄ1〜ｄ3）は楽譜データＸBから特定される。以上の説明から理解されるように、音楽的な条件が共通する複数の単位区間μについては識別情報Ａが共通する。なお、音符属性ａ1の内容は以上の例示に限定されない。例えば、楽曲の各小節内で対象音符が何番目の拍子に該当するのか（１拍目／２拍目）を示す情報や、参照音のひと息に相当する期間における対象音符の位置（前方／後方）を示す情報など、ピッチの時系列に影響する任意の情報が音符属性ａ1にて指定され得る。 The note attribute a1 is information indicating the attribute of a note (hereinafter referred to as “target note”) corresponding to the unit section μ, and includes variables p1 to p3 and variables d1 to d3. The variable p2 is set to the note name (note number) of the target note. The variable p1 is set to the pitch of the note immediately before the target note (relative value to the target note), and the variable p3 is set to the pitch of the note immediately after the target note. The variable d2 is set to the duration of the target note. The variable d1 is set to the duration of the note immediately before the target note, and the variable d3 is set to the duration of the note immediately after the target note. Each variable (p1 to p3, d1 to d3) of the note attribute a1 is specified from the score data XB. As can be understood from the above description, the identification information A is common to a plurality of unit sections μ having common musical conditions. Note that the content of the note attribute a1 is not limited to the above example. For example, information indicating what time signature the target note corresponds to in each measure of the music (first beat / second beat), or the position of the target note in the period corresponding to the breath of the reference sound (forward / backward) Any information that affects the time series of the pitch, such as information indicating), can be designated by the note attribute a1.

図２の合成用情報生成部３６は、区間設定部３４（第２区間設定部３４３）が設定した単位区間μ毎の参照ピッチＰrefの時系列を利用して合成用情報Ｙを生成する。図５は、合成用情報生成部３６のブロック図である。図５に示すように、合成用情報生成部３６は、特徴量情報ＹAおよび継続長情報ＹBを生成する第１情報生成部４２と遷移配列情報ＹCを生成する第２情報生成部４４とを具備する。特徴量情報ＹAと継続長情報ＹBと遷移配列情報ＹCとが図１の合成用情報Ｙとして記憶装置１４に格納される。 2 generates the synthesis information Y by using the time series of the reference pitch Pref for each unit section μ set by the section setting section 34 (second section setting section 343). FIG. 5 is a block diagram of the synthesis information generation unit 36. As shown in FIG. 5, the synthesis information generation unit 36 includes a first information generation unit 42 that generates feature amount information YA and duration information YB, and a second information generation unit 44 that generates transition sequence information YC. To do. The feature amount information YA, the continuation length information YB, and the transition sequence information YC are stored in the storage device 14 as the combining information Y in FIG.

図５に示すように、第１情報生成部４２は、確率モデル生成部４２１と特徴量分類部４２３と継続長分類部４２５とを含んで構成される。確率モデル生成部４２１は、各遷移種別に対応する１個の単位区間μ内でのピッチＰの出現確率を表現する確率モデルＱを識別情報Ａ毎（音符属性ａ1と遷移種別ａ2との組合せ毎）に生成する。本実施形態では、図６に示すように、複数（図６の例示では３個）の状態Ｓtで規定されるＨＳＭＭ（Hidden Semi Markov Model）を確率モデルＱとして例示する。確率モデルＱは、特徴量モデルＱAと継続長モデルＱBとを含んで構成される。特徴量モデルＱAは、単位区間μ内のピッチＰおよびその時間変化（微分値）ΔＰの確率分布（出力分布）を状態Ｓt毎に規定し、継続長モデルＱBは、単位区間μ内での状態Ｓt毎の継続長Ｄの確率分布（継続長分布）を規定する。なお、特徴量モデルＱAが状態Ｓt毎のピッチＰの２階微分値の確率分布を規定する構成も好適である。 As shown in FIG. 5, the first information generation unit 42 includes a probability model generation unit 421, a feature amount classification unit 423, and a duration classification unit 425. The probability model generation unit 421 generates a probability model Q expressing the appearance probability of the pitch P within one unit section μ corresponding to each transition type for each identification information A (for each combination of the note attribute a1 and the transition type a2). ) To generate. In the present embodiment, as shown in FIG. 6, HSMM (Hidden Semi Markov Model) defined by a plurality of (three in the example of FIG. 6) states St is exemplified as the probability model Q. The probability model Q includes a feature amount model QA and a duration model QB. The feature quantity model QA defines the probability distribution (output distribution) of the pitch P in the unit section μ and its time variation (differential value) ΔP for each state St, and the duration model QB is the state in the unit section μ. The probability distribution (continuation length distribution) of the continuation length D for each St is defined. A configuration in which the feature quantity model QA defines the probability distribution of the second-order differential value of the pitch P for each state St is also suitable.

図５の確率モデル生成部４２１は、識別情報Ａが共通する各単位区間μ内の参照ピッチＰrefの時系列に対して学習処理（最尤推定アルゴリズム）を実行することで、当該識別情報Ａに対応する確率モデルＱを生成する。具体的には、各単位区間μ内の参照ピッチＰrefの時系列が最大の確率で出現するように確率モデルＱが生成される。確率モデルＱは識別情報Ａ毎に生成される。すなわち、複数の単位区間μで音符属性ａ1が共通する場合でも、各単位区間μで遷移種別ａ2が相違するならば、遷移種別ａ2毎に別個の確率モデルＱが生成される。 The probability model generation unit 421 in FIG. 5 executes the learning process (maximum likelihood estimation algorithm) on the time series of the reference pitch Pref in each unit section μ with which the identification information A is common, so that the identification information A A corresponding probability model Q is generated. Specifically, the probability model Q is generated so that the time series of the reference pitch Pref in each unit section μ appears with the maximum probability. The probability model Q is generated for each identification information A. That is, even when the note attribute a1 is common to a plurality of unit intervals μ, if the transition type a2 is different in each unit interval μ, a separate probability model Q is generated for each transition type a2.

図５の特徴量分類部４２３は、確率モデル生成部４２１が識別情報Ａ毎に生成した特徴量モデルＱAを複数（確率モデルＱの総数を下回る個数）の集合に分類する。特徴量モデルＱAの分類（クラスタリング）には公知の機械学習が任意に採用され得るが、以下に例示する決定木学習が好適である。 The feature quantity classification unit 423 in FIG. 5 classifies the feature quantity model QA generated by the probability model generation unit 421 for each piece of identification information A into a set of a plurality (number less than the total number of probability models Q). Although known machine learning can be arbitrarily employed for classification (clustering) of the feature quantity model QA, decision tree learning exemplified below is preferable.

特徴量分類部４２３は、識別情報Ａに関連する所定の条件の成否を各特徴量モデルＱAについて順次に判定することで図７の決定木（以下「特徴量決定木」という）ＴAを構築する。図７に示すように、特徴量決定木ＴAは、分類の開始点となる始端節（ルートノード）と、各条件の判定に対応する複数の中間節（中間ノード）と、各特徴量モデルＱAが最終的に分類される集合に対応するＫA個の終端節（リーフノード）とで構成される分類木である。始端節および各中間節では、例えば対象音符の継続長ｄ2が閾値を上回るか否か、対象音符と直前の音符との音程ｐ1（あるいは直後の音符との音程ｐ3）が閾値を上回るか否か、といった条件の成否が判定される。各特徴量モデルＱAの分類を停止する時点（特徴量決定木ＴAを確定する時点）は、例えば最小記述長（ＭＤＬ：Minimum Description Length）基準に応じて決定される。 The feature quantity classifying unit 423 constructs the decision tree (hereinafter referred to as “feature quantity decision tree”) TA in FIG. 7 by sequentially determining the success or failure of a predetermined condition related to the identification information A for each feature quantity model QA. . As shown in FIG. 7, the feature quantity decision tree TA includes a start node (root node) serving as a starting point of classification, a plurality of intermediate nodes (intermediate nodes) corresponding to the determination of each condition, and each feature model QA. Is a classification tree composed of KA terminal nodes (leaf nodes) corresponding to the finally classified set. In the start node and each intermediate clause, for example, whether the duration d2 of the target note exceeds a threshold value, or whether the pitch p1 between the target note and the immediately preceding note (or the pitch p3 between the immediately following note) exceeds the threshold value. The success or failure of the condition is determined. The time point when the classification of each feature value model QA is stopped (the time point when the feature value determination tree TA is determined) is determined according to, for example, a minimum description length (MDL) standard.

特徴量分類部４２３は、特徴量決定木ＴAのＫA個の終端節の各々について、当該終端節に分類された複数の特徴量モデルＱAに応じた１個の特徴量モデルＭAを生成する。具体的には、特徴量モデルＱAの生成（学習）に適用された特徴量（ピッチＰ）を１個の終端節の複数の特徴量モデルＱAについて全体的に使用して、当該終端節に対応する新規な１個の特徴量モデルＭAが再推定される。例えば、各終端節に分類された複数の特徴量モデルＱAの加重和が特徴量モデルＭAとして生成される。図５に示すように、特徴量分類部４２３は、以上の方法で生成した特徴量決定木ＴAとＫA個の特徴量モデルＭAとを含む特徴量情報ＹAを記憶装置１４に格納する。 The feature quantity classification unit 423 generates, for each of the KA terminal clauses of the feature quantity decision tree TA, one feature quantity model MA corresponding to the plurality of feature quantity models QA classified into the terminal clause. Specifically, the feature quantity (pitch P) applied to the generation (learning) of the feature quantity model QA is used as a whole for a plurality of feature quantity models QA of one terminal clause, and corresponds to the terminal clause. One new feature model MA is re-estimated. For example, a weighted sum of a plurality of feature quantity models QA classified in each terminal clause is generated as the feature quantity model MA. As shown in FIG. 5, the feature quantity classification unit 423 stores feature quantity information YA including the feature quantity decision tree TA and KA feature quantity models MA generated by the above method in the storage device 14.

特徴量分類部４２３と同様に、図５の継続長分類部４２５は、確率モデル生成部４２１が識別情報Ａ毎に生成した継続長モデルＱBを決定木学習で複数の集合に分類する。すなわち、継続長分類部４２５は、識別情報Ａに関連する所定の条件の成否を各継続長モデルＱBについて順次に判定することで図８の決定木（以下「継続長決定木」という）ＴBを構築する。継続長決定木ＴBの構築時に判定される条件や継続長決定木ＴBの構築を停止する基準は、特徴量決定木ＴAの構築時と同様である。継続長分類部４２５は、確定済の継続長決定木ＴBのＫB個の終端節の各々について、当該終端節に分類された複数の継続長モデルＱBに応じた１個の継続長モデルＭB（例えば複数の継続長モデルＱBの加重和）を生成する。そして、継続長分類部４２５は、図５に示すように、継続長決定木ＴBとＫB個の継続長モデルＭBとを含む継続長情報ＹBを記憶装置１４に格納する。 Similar to the feature amount classification unit 423, the duration classification unit 425 in FIG. 5 classifies the duration model QB generated by the probability model generation unit 421 for each identification information A into a plurality of sets by decision tree learning. That is, the continuation length classifying unit 425 determines the success or failure of a predetermined condition related to the identification information A for each continuation length model QB, thereby determining the decision tree (hereinafter referred to as “continuation length decision tree”) TB in FIG. To construct. The conditions determined when the duration determination tree TB is constructed and the criteria for stopping the construction of the duration determination tree TB are the same as those when the feature amount decision tree TA is constructed. The continuation length classifying unit 425, for each of the KB terminal clauses of the confirmed continuation length decision tree TB, one continuation length model MB (for example, corresponding to a plurality of continuation length models QB classified into the ending clause) A weighted sum of a plurality of duration models QB) is generated. Then, the continuation length classifying unit 425 stores the continuation length information YB including the continuation length decision tree TB and KB continuation length models MB as shown in FIG.

図５の第２情報生成部４４は、遷移配列モデル生成部４４１と遷移配列分類部４４３とを含んで構成される。遷移配列モデル生成部４４１は、識別情報Ａ内の音符属性ａ1毎（音符属性ａ1が共通する音符区間σ毎）に遷移配列モデルＱCを生成する。各音符属性ａ1に対応する遷移配列モデルＱCは、図９の例示のように、合計７種類の遷移配列（「Ｂ-Ｓ-Ｅ」，「Ｂ-Ｓ」，「Ｓ-Ｅ」，「Ｂ-Ｅ」，「Ｂ」，「Ｓ」，「Ｅ」）の各々が当該音符属性ａ1の音符区間σ内にて出現する確率（離散確率）を示す確率モデルである。 The second information generation unit 44 in FIG. 5 includes a transition array model generation unit 441 and a transition array classification unit 443. The transition array model generation unit 441 generates a transition array model QC for each note attribute a1 in the identification information A (for each note interval σ in which the note attribute a1 is common). As shown in FIG. 9, the transition array model QC corresponding to each note attribute a1 has a total of seven types of transition arrays (“BSE”, “BS”, “SE”, “B”). -E "," B "," S "," E ") is a probability model indicating the probability (discrete probability) that each of the note attributes a1 appears within the note interval σ.

音符属性ａ1が共通する各音符区間σにて各遷移配列が出現する頻度に応じて当該音符属性ａ1の遷移配列モデルＱCが生成される。すなわち、各音符属性ａ1の遷移配列モデルＱCのうち当該音符属性ａ1の各音符区間σについて多く出現した遷移配列の出現確率ほど大きい数値に設定される。例えば、音符属性ａ1が共通する２個の音符区間σのうち一方の音符区間σの遷移配列が図３の指定音Ｖ2のように「Ｂ-Ｓ-Ｅ」であり、他方の音符区間σの遷移配列が図４の指定音Ｖ2のように「Ｓ-Ｅ」である場合、当該音符属性ａ1の遷移配列モデルＱCでは、遷移配列「Ｂ-Ｓ-Ｅ」および遷移配列「Ｓ-Ｅ」の各々の出現確率が０．５に設定され、他の遷移配列の出現確率が０に設定される。 The transition arrangement model QC of the note attribute a1 is generated according to the frequency of appearance of each transition arrangement in each note interval σ with the common note attribute a1. That is, in the transition array model QC of each note attribute a1, the appearance probability of the transition array that frequently appears for each note section σ of the note attribute a1 is set to a larger numerical value. For example, the transition arrangement of one note interval σ of two note intervals σ having the same note attribute a1 is “BSE” as in the designated note V2 in FIG. When the transition array is “SE” as in the designated sound V2 of FIG. 4, the transition array “BSE” and the transition array “SE” of the transition attribute model QC of the note attribute a1 are shown. The appearance probability of each is set to 0.5, and the appearance probability of other transition arrays is set to 0.

図５の遷移配列分類部４４３は、遷移配列モデル生成部４４１が音符属性ａ1毎（音符区間σ毎）に生成した遷移配列モデルＱCを複数（遷移配列モデルＱCの総数を下回る個数）の集合に分類する。遷移配列モデルＱCの分類には公知の機械学習が任意に採用され得るが、特徴量分類部４２３や継続長分類部４２５での分類と同様に、以下に説明する決定木学習が好適である。 The transition array classifying unit 443 in FIG. 5 sets a plurality of transition array models QC generated by the transition array model generating unit 441 for each note attribute a1 (for each note interval σ) (a number less than the total number of transition array models QC). Classify. Although known machine learning can be arbitrarily employed for classification of the transition sequence model QC, decision tree learning described below is suitable as in the classification by the feature amount classification unit 423 and the duration classification unit 425.

遷移配列分類部４４３は、識別情報Ａに関連する所定の条件の成否を各遷移配列モデルＱCについて順次に判定することで図１０の決定木（以下「遷移配列決定木」という）ＴCを構築する。前述の特徴量決定木ＴAや継続長決定木ＴBと同様に、遷移配列決定木ＴCは、始端節および複数の中間節と、各遷移配列モデルＱCが最終的に分類される集合に対応するＫC個の終端節とで構成される分類木である。 The transition sequence classifying unit 443 constructs the decision tree (hereinafter referred to as “transition sequence decision tree”) TC of FIG. 10 by sequentially determining success or failure of a predetermined condition related to the identification information A for each transition sequence model QC. . Similar to the feature value decision tree TA and the continuation length decision tree TB described above, the transition sequence decision tree TC is a KC corresponding to a starting node, a plurality of intermediate clauses, and a set in which each transition sequence model QC is finally classified. It is a classification tree composed of terminal clauses.

音符区間σ内の遷移配列は、対象音符の時間長ｄ2や対象音符と前後の音符との音程（ｐ1，ｐ2）等に影響される。例えば、対象音符の時間長ｄ2が長いほど音符区間σ内の遷移種別の総数が増加するという傾向や、対象音符と前後の音符との音程（音高差）が大きいほど音符区間σ内の遷移種別の総数が増加するという傾向がある。以上の傾向を考慮して、遷移配列分類部４４３は、特徴量分類部４２３や継続長分類部４２５と同様に、例えば対象音符の継続長ｄ2が閾値を上回るか否か、対象音符と直前の音符との音程ｐ1（あるいは直後の音符との音程ｐ3）が閾値を上回るか否か、といった様々な条件の成否を、始端節および各中間節にて判定する。各遷移配列モデルＱCの分類を停止する時点（遷移配列決定木ＴCを確定する時点）の判定には、例えば最小記述長（ＭＤＬ）基準が好適に適用される。 The transition arrangement within the note interval σ is affected by the time length d2 of the target note, the pitch (p1, p2) between the target note and the preceding and following notes, and the like. For example, the total number of transition types in the note interval σ increases as the time length d2 of the target note increases, or the transition in the note interval σ increases as the pitch (pitch difference) between the target note and the preceding and following notes increases. There is a tendency for the total number of types to increase. In consideration of the above tendency, the transition arrangement classifying unit 443 determines whether the duration d2 of the target note exceeds a threshold value, for example, whether the duration of the target note exceeds the threshold, similarly to the feature amount classifying unit 423 and the duration classification unit 425. The success or failure of various conditions such as whether or not the pitch p1 with the note (or the pitch p3 with the immediately following note) exceeds a threshold value is determined in the starting clause and each intermediate clause. For example, a minimum description length (MDL) criterion is preferably applied to the determination of the time point when the classification of each transition sequence model QC is stopped (the time point when the transition sequence decision tree TC is determined).

遷移配列分類部４４３は、遷移配列決定木ＴCのＫC個の終端節の各々について、当該終端節に分類された複数の遷移配列モデルＱCに応じた１個の遷移配列モデルＭCを生成する。例えば、各終端節に分類された複数の遷移配列モデルＱCの加重和が遷移配列モデルＭCとして生成される。遷移配列分類部４４３は、図５に示すように、以上の方法で生成した遷移配列決定木ＴCとＫC個の遷移配列モデルＭCとを含む遷移配列情報ＹCを記憶装置１４に格納する。以上が第１処理部２１の構成および動作である。 The transition array classification unit 443 generates, for each of the KC terminal nodes of the transition array determination tree TC, one transition array model MC corresponding to the plurality of transition array models QC classified into the terminal node. For example, a weighted sum of a plurality of transition array models QC classified into each terminal clause is generated as the transition array model MC. As shown in FIG. 5, the transition array classification unit 443 stores the transition array information YC including the transition array decision tree TC generated by the above method and KC transition array models MC in the storage device 14. The above is the configuration and operation of the first processing unit 21.

（２）第２処理部２２
図１１は、合成音データＶoutを生成する第２処理部２２のブロック図である。図１１に示すように、第２処理部２２は、軌跡生成部５２と合成処理部５４とを含んで構成される。軌跡生成部５２は、楽譜データＳCが指定する各指定音のピッチの時系列（合成ピッチ軌跡）Ｐsynを合成用情報Ｙから生成する。合成処理部５４は、軌跡生成部５２が生成した合成ピッチ軌跡Ｐsynに沿うようにピッチが時間的に変化する歌唱音の合成音データＶoutを生成する。具体的には、合成処理部５４は、楽譜データＳCが示す各指定音の歌詞に対応する音波形データＺAを記憶装置１４から取得し、合成ピッチ軌跡Ｐsynに沿ってピッチが経時的に変化するように音波形データＺAを加工することで合成音データＶoutを生成する。したがって、合成音データＶoutの再生音は、参照音を発声した参照歌唱者に特有の歌唱表現（ピッチ軌跡）が付加された歌唱音となる。 (2) Second processing unit 22
FIG. 11 is a block diagram of the second processing unit 22 that generates the synthesized sound data Vout. As shown in FIG. 11, the second processing unit 22 includes a trajectory generation unit 52 and a synthesis processing unit 54. The trajectory generator 52 generates a time series (synthetic pitch trajectory) Psyn of the pitches of the designated sounds designated by the score data SC from the synthesis information Y. The synthesis processing unit 54 generates synthesized sound data Vout of a singing sound whose pitch changes with time so as to follow the synthesized pitch trajectory Psyn generated by the trajectory generation unit 52. Specifically, the synthesis processing unit 54 acquires sound waveform data ZA corresponding to the lyrics of each designated sound indicated by the score data SC from the storage device 14, and the pitch changes with time along the synthesized pitch trajectory Psyn. In this way, the synthesized sound data Vout is generated by processing the sound waveform data ZA. Therefore, the reproduced sound of the synthesized sound data Vout is a singing sound to which a singing expression (pitch trajectory) peculiar to the reference singer who uttered the reference sound is added.

図１２は、軌跡生成部５２の動作の説明図である。図１２の処理は、入力装置１６に対する所定の操作（合成音の生成の開始指示）を契機として開始されて楽譜データＳCの指定音毎に順次に実行される。 FIG. 12 is an explanatory diagram of the operation of the trajectory generation unit 52. The process of FIG. 12 is started in response to a predetermined operation on the input device 16 (instruction to start generation of synthesized sound), and is sequentially executed for each designated sound of the score data SC.

図１２の処理を開始すると、軌跡生成部５２は、楽譜データＳCを参照することで指定音の音符属性ａ1（変数ｐ1〜ｐ3，変数ｄ1〜ｄ3）を決定する（Ｓ11）。そして、軌跡生成部５２は、記憶装置１４に記憶された合成用情報Ｙの遷移配列情報ＹC内のＫC個の遷移配列モデルＭCのうち指定音の音符属性ａ1に相応する１個の遷移配列モデルＭCを選択する（Ｓ12）。遷移配列モデルＭCの選択には遷移配列情報ＹC内の遷移配列決定木ＴCが利用される。すなわち、軌跡生成部５２は、指定音の音符属性ａ1を遷移配列決定木ＴCに適用する（遷移配列決定木ＴCの始端節および各中間節の条件の成否を指定音の音符属性ａ1について順次に判定する）ことで指定音が所属すべき終端節（集合）を特定し、当該終端節に対応する遷移配列モデルＭCを遷移配列情報ＹCから選択する。すなわち、参照音のうち指定音の音符属性ａ1に類似する音符属性ａ1の音符区間σから生成された遷移配列モデルＭCが選択される。 When the processing of FIG. 12 is started, the trajectory generator 52 determines the note attribute a1 (variables p1 to p3, variables d1 to d3) of the designated sound by referring to the score data SC (S11). The trajectory generation unit 52 then selects one transition array model corresponding to the note attribute a1 of the designated note among the KC transition array models MC in the transition array information YC of the synthesis information Y stored in the storage device 14. MC is selected (S12). For selection of the transition array model MC, the transition array decision tree TC in the transition array information YC is used. That is, the trajectory generating unit 52 applies the note attribute a1 of the designated sound to the transition arrangement determination tree TC (the success or failure of the conditions of the start node and each intermediate clause of the transition arrangement decision tree TC is sequentially determined for the note attribute a1 of the designated sound. The terminal section (set) to which the designated sound should belong is specified by determining), and the transition array model MC corresponding to the terminal section is selected from the transition array information YC. That is, the transition arrangement model MC generated from the note interval σ of the note attribute a1 similar to the note attribute a1 of the designated sound among the reference sounds is selected.

軌跡生成部５２は、処理Ｓ12で選択した遷移配列モデルＭCに応じて指定音の遷移配列を決定する（Ｓ13）。具体的には、処理Ｓ11で設定した音符属性ａ1が共通する各指定音について処理Ｓ13で各遷移配列を決定する確率が、遷移配列モデルＭCで遷移配列毎に規定される出現確率に近似するように、軌跡生成部５２は指定音の遷移配列を決定する。すなわち、遷移配列モデルＭCで規定される出現確率が高い遷移配列ほど、指定音の遷移配列として高確率で選択される。そして、軌跡生成部５２は、処理Ｓ13で決定した遷移配列を構成する遷移種別毎に指定音の単位区間μを設定する。例えば、処理Ｓ13で選定された遷移配列が「Ｂ-Ｓ-Ｅ」である場合には、各遷移種別に対応する３個の単位区間μが１個の指定音について設定される。軌跡生成部５２は、指定音の単位区間μ毎に、当該指定音の音符属性ａ1と当該単位区間μについて決定した遷移種別ａ2とを含む識別情報Ａを設定する。 The trajectory generator 52 determines the transition arrangement of the designated sound according to the transition arrangement model MC selected in the process S12 (S13). Specifically, the probability of determining each transition array in process S13 for each designated sound having the same note attribute a1 set in process S11 is approximated to the appearance probability defined for each transition array in the transition array model MC. In addition, the trajectory generator 52 determines the transition arrangement of the designated sound. That is, a transition sequence having a higher appearance probability defined by the transition sequence model MC is selected with a higher probability as a transition sequence of the designated sound. Then, the trajectory generating unit 52 sets the unit section μ of the designated sound for each transition type constituting the transition array determined in the process S13. For example, when the transition array selected in step S13 is “BSE”, three unit sections μ corresponding to each transition type are set for one designated sound. The trajectory generating unit 52 sets identification information A including the note attribute a1 of the designated sound and the transition type a2 determined for the unit section μ for each unit interval μ of the designated sound.

軌跡生成部５２は、特徴量情報ＹA内のＫA個の特徴量モデルＭAのうち指定音の識別情報Ａに相応する特徴量モデルＭAを指定音の単位区間μ毎に選択する（Ｓ14）。具体的には、軌跡生成部５２は、指定音の識別情報Ａを特徴量情報ＹAの特徴量決定木ＴAに適用することで指定音の識別情報Ａが所属すべき終端節（集合）を特定し、当該終端節に対応する１個の特徴量モデルＭAを特徴量情報ＹAから選択する。同様に、軌跡生成部５２は、継続長情報ＹBの継続長決定木ＴBに指定音の識別情報Ａを適用することで、継続長情報ＹB内のＫB個の継続長モデルＭBのうち指定音の識別情報Ａに相応する１個の継続長モデルＭBを指定音の単位区間μ毎に選択する（Ｓ15）。 The trajectory generating unit 52 selects a feature quantity model MA corresponding to the identification information A of the designated sound from the KA feature quantity models MA in the feature quantity information YA for each unit section μ of the designated sound (S14). Specifically, the trajectory generation unit 52 specifies the terminal clause (set) to which the specified sound identification information A should belong by applying the specified sound identification information A to the feature amount decision tree TA of the feature amount information YA. Then, one feature quantity model MA corresponding to the terminal clause is selected from the feature quantity information YA. Similarly, the trajectory generation unit 52 applies the specified sound identification information A to the continuous length decision tree TB of the continuous length information YB, so that the specified sound of the KB continuous length models MB in the continuous length information YB is selected. One duration model MB corresponding to the identification information A is selected for each unit section μ of the designated sound (S15).

そして、軌跡生成部５２は、処理Ｓ14で選択した特徴量モデルＭAと処理Ｓ15で選択した継続長モデルＭBとを利用して指定音の各単位区間μ内の合成ピッチ軌跡Ｐsynを生成する（Ｓ16）。具体的には、単位区間μ内の各状態Ｓtの継続長Ｄを継続長モデルＭBに応じて決定し、特徴量モデルＭAで規定されるピッチＰの確率分布と時間変化ΔＰの確率分布とにおいて同時確率が最大化するように単位区間μ毎の合成ピッチ軌跡Ｐsynが生成される。以上の手順で単位区間μ毎に生成された合成ピッチ軌跡Ｐsynを時間軸上で相互に連結することで指定音の合成ピッチ軌跡Ｐsynが生成される。 Then, the trajectory generating unit 52 generates a composite pitch trajectory Psyn within each unit section μ of the designated sound using the feature amount model MA selected in the process S14 and the duration model MB selected in the process S15 (S16). ). Specifically, the continuation length D of each state St in the unit section μ is determined according to the continuation length model MB, and the probability distribution of the pitch P and the probability distribution of the time change ΔP defined by the feature amount model MA A synthetic pitch trajectory Psyn for each unit section μ is generated so that the joint probability is maximized. The synthesized pitch trajectory Psyn of the designated sound is generated by connecting the synthesized pitch trajectory Psyn generated for each unit section μ with the above procedure on the time axis.

以上の形態では、参照音を遷移種別に応じて区分した単位区間μ内の参照ピッチＰrefの時系列を利用して遷移種別毎に特徴量モデルＱAおよび継続長モデルＱBが生成されるから、参照音の音符（音符属性ａ1）が共通する場合でも、参照ピッチＰrefの変動の相違が特徴量モデルＱAや継続長モデルＱB（さらには特徴量モデルＭAや継続長モデルＭB）に忠実に反映される。また、楽譜データＳC内の指定音の各単位区間μの遷移種別が遷移配列モデルＭCに応じて決定され、当該遷移種別に対応する特徴量モデルＱAおよび継続長モデルＱBに応じた合成ピッチ軌跡Ｐsynが指定音の単位区間μ毎に生成される。したがって、参照音の参照ピッチＰrefの変動の相違が忠実に反映されない確率モデルを利用する場合と比較して、参照歌唱者に特有の表現を忠実に反映した合成音を、聴感的な自然性を維持しながら生成することが可能である。 In the above embodiment, the feature amount model QA and the duration model QB are generated for each transition type using the time series of the reference pitch Pref in the unit interval μ in which the reference sound is divided according to the transition type. Even when the notes of the sound (note attribute a1) are common, the difference in the variation of the reference pitch Pref is faithfully reflected in the feature amount model QA and the duration model QB (and also the feature amount model MA and the duration model MB). . Further, the transition type of each unit section μ of the designated sound in the musical score data SC is determined according to the transition array model MC, and the synthesized pitch trajectory Psyn corresponding to the feature amount model QA and the duration model QB corresponding to the transition type. Is generated for each unit interval μ of the designated sound. Therefore, compared to the case of using a probability model that does not faithfully reflect the difference in the reference pitch Pref of the reference sound, the synthesized sound that faithfully reflects the expression specific to the reference singer is more audible and natural. It is possible to generate while maintaining.

＜Ｂ：変形例＞
以上の実施形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <B: Modification>
The above embodiment can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）変形例１
参照音を音符区間σや単位区間μに区分する方法は適宜に変更される。例えば、前述の実施形態では楽譜データＸBに応じて参照音を各音符区間σに区分したが、利用者からの指示に応じて各音符区間σを設定する構成も採用され得る。例えば、利用者は、表示装置に表示される参照音の波形を視認するとともに放音装置から再生される参照音を聴取することで各音符の境界を推定しながら、入力装置１６を適宜に操作して各音符区間σを指定する。第１区間設定部３４１は、利用者からの指示に応じて各音符区間σを設定する。利用者が各音符区間σを指定する構成では、楽譜データＸBは省略され得る。 (1) Modification 1
The method of dividing the reference sound into note intervals σ and unit intervals μ is appropriately changed. For example, in the above-described embodiment, the reference sound is divided into the respective note intervals σ according to the score data XB, but a configuration in which each note interval σ is set according to an instruction from the user may be employed. For example, the user appropriately operates the input device 16 while estimating the boundary of each note by visually recognizing the waveform of the reference sound displayed on the display device and listening to the reference sound reproduced from the sound emitting device. To specify each note interval σ. The first section setting unit 341 sets each note section σ according to an instruction from the user. In a configuration in which the user designates each note interval σ, the score data XB can be omitted.

また、前述の実施形態では利用者からの指示に応じて参照音の各単位区間μを設定したが、第２区間設定部３４３が参照音データＸAに応じて自動的に（すなわち利用者からの指示を必要とせずに）各単位区間μを設定する構成も採用され得る。例えば、第２区間設定部３４３は、音符区間σの始点の直後で参照ピッチＰrefが変動する区間を開始部Ｂの単位区間μとして設定する。同様に、参照ピッチＰrefが略一定に維持される区間が定常部Ｓの単位区間μに設定され、音符区間σの終点にかけて参照ピッチＰrefが変動する区間が終了部Ｅの単位区間μに設定される。 In the above-described embodiment, each unit interval μ of the reference sound is set according to an instruction from the user. However, the second interval setting unit 343 automatically (that is, from the user) according to the reference sound data XA. A configuration in which each unit interval μ is set (without requiring an instruction) may be employed. For example, the second section setting unit 343 sets a section in which the reference pitch Pref changes immediately after the start point of the note section σ as the unit section μ of the start section B. Similarly, a section in which the reference pitch Pref is maintained substantially constant is set as the unit section μ of the stationary part S, and a section in which the reference pitch Pref varies toward the end point of the note section σ is set as the unit section μ of the end part E. The

（２）変形例２
前述の実施形態では、遷移配列モデルＱCの分類の結果に応じてＫC個の遷移配列モデルＭCを生成したが、未分類の遷移配列モデルＱCを遷移配列情報ＹCとして指定音の合成に適用する構成（以下「構成Ａ」という）も採用され得る。指定音に指示された音符に対応する遷移配列モデルＱCを利用して指定音の遷移配列が決定される。構成Ａでは遷移配列分類部４４３（遷移配列モデルＭCや遷移配列決定木ＴC）が省略されるから、第１処理部２１の構成が簡素化されるという利点がある。 (2) Modification 2
In the above-described embodiment, KC transition array models MC are generated in accordance with the result of classification of the transition array model QC. However, the unclassified transition array model QC is applied to the synthesis of the designated sound as the transition array information YC. (Hereinafter referred to as “Configuration A”) may also be employed. The transition arrangement of the designated sound is determined using the transition arrangement model QC corresponding to the note designated by the designated sound. In the configuration A, since the transition sequence classification unit 443 (the transition sequence model MC and the transition sequence determination tree TC) is omitted, there is an advantage that the configuration of the first processing unit 21 is simplified.

ただし、構成Ａでは、１個の遷移配列モデルＱCの生成に利用される参照ピッチＰrefの個数が不足するため、遷移配列モデルＱCの統計的な妥当性を担保することが困難となる。また、全種類の音符属性ａ1について遷移配列モデルＱCを用意することが現実的には困難である以上、遷移配列モデルＱCが用意されていない音符属性ａ1の指定音を合成できないという問題もある。前述の実施形態では、遷移配列モデルＱCの分類の結果に応じてＫC個の遷移配列モデルＭCが生成される（すなわち遷移配列モデルＱCと比較して１個の遷移配列モデルＭCに多数の参照ピッチＰrefが反映される）から、遷移配列モデルＭCの統計的な妥当性を充分に担保することが可能である。また、合成時に指定音を遷移配列決定木ＴCに適用することで指定音の遷移配列が決定される（Ｓ13）から、参照音に存在しない音符の指定音についても、聴感的に自然な合成音の生成を実現し得る適切な遷移配列を選択できるという利点がある。 However, in the configuration A, since the number of reference pitches Pref used for generating one transition array model QC is insufficient, it is difficult to ensure the statistical validity of the transition array model QC. In addition, since it is practically difficult to prepare transition arrangement models QC for all types of note attributes a1, there is also a problem that it is not possible to synthesize designated sounds of note attributes a1 for which transition arrangement models QC are not prepared. In the above-described embodiment, KC transition array models MC are generated according to the classification result of the transition array model QC (that is, a plurality of reference pitches are added to one transition array model MC compared to the transition array model QC). Preref is reflected), and it is possible to sufficiently ensure the statistical validity of the transition sequence model MC. In addition, since the designated sound transition array is determined by applying the designated sound to the transition array decision tree TC at the time of synthesis (S13), even for the designated sound of a note that does not exist in the reference sound, an acoustically natural synthesized sound. There is an advantage that an appropriate transition sequence capable of realizing the generation of can be selected.

前述の構成Ａと同様に、特徴量モデルＱAや継続長モデルＱBを指定音の合成に適用する構成（特徴量分類部４２３や継続長分類部４２５を省略した構成）も採用され得るが、確率モデルＱの統計的な妥当性を担保して聴感的に自然な合成音を合成するという観点からは、前述の実施形態の例示のように特徴量モデルＱAの分類で生成された特徴量モデルＭAや継続長モデルＱBの分類で生成された継続長モデルＭBを指定音の合成に利用する構成が格別に好適である。 Similar to the configuration A described above, a configuration in which the feature amount model QA and the duration model QB are applied to the synthesis of the designated sound (a configuration in which the feature amount classification unit 423 and the duration classification unit 425 are omitted) can be adopted. From the viewpoint of synthesizing an acoustically natural synthesized sound while ensuring the statistical validity of the model Q, the feature amount model MA generated by the classification of the feature amount model QA as illustrated in the above embodiment. A configuration in which the duration model MB generated by the classification of the duration model QB is used for synthesizing the designated sound is particularly suitable.

（３）変形例３
前述の実施形態では、記憶装置１４に格納された参照音データＸAから特徴量抽出部３２が参照ピッチＰrefを抽出したが、参照音から事前に抽出された参照ピッチＰrefの時系列を記憶装置１４に格納した構成（したがって特徴量抽出部３２は省略される）も採用され得る。また、参照音を事前に各音符区間σに区分して記憶装置１４に格納した構成（したがって第１区間設定部３４１は省略される）も採用され得る。 (3) Modification 3
In the above-described embodiment, the feature amount extraction unit 32 extracts the reference pitch Pref from the reference sound data XA stored in the storage device 14. However, the time series of the reference pitch Pref extracted in advance from the reference sound is stored in the storage device 14. Also, the configuration stored in (and thus the feature quantity extraction unit 32 is omitted) may be employed. Further, a configuration in which the reference sound is divided into each note interval σ in advance and stored in the storage device 14 (therefore, the first interval setting unit 341 is omitted) may be employed.

（４）変形例４
前述の実施形態では第１処理部２１と第２処理部２２とを具備する音響合成装置１００を例示したが、合成用情報Ｙ（特徴量情報ＹA，継続長情報ＹB，遷移配列情報ＹC）を生成する第１処理部２１を具備する音合成用確率モデル生成装置（第２処理部２２を省略した装置）や、合成用情報Ｙを利用して合成音データＶoutを生成する第２処理部２２を具備する音響合成装置（第１処理部２１を省略した装置）としても本発明は実施され得る。また、合成用情報Ｙを記憶する記憶装置１４と第２処理部２２の軌跡生成部５２とを具備する装置（合成処理部５４を省略した構成）は、合成音の特徴量の時系列（例えば合成ピッチ軌跡Ｐsyn）を生成する特徴量軌跡生成装置としても把握され得る。 (4) Modification 4
In the above-described embodiment, the acoustic synthesizer 100 including the first processing unit 21 and the second processing unit 22 is exemplified, but the synthesis information Y (feature information YA, duration information YB, transition sequence information YC) is used. A sound synthesis probability model generation device (a device in which the second processing unit 22 is omitted) including the first processing unit 21 to be generated, or a second processing unit 22 that generates the synthesized sound data Vout using the synthesis information Y. The present invention can also be implemented as a sound synthesizer (equipped with the first processing unit 21 omitted). In addition, a device including the storage device 14 that stores the synthesis information Y and the trajectory generation unit 52 of the second processing unit 22 (a configuration in which the synthesis processing unit 54 is omitted) is a time series (for example, a feature amount of synthesized sound) It can also be grasped as a feature amount trajectory generating device that generates a composite pitch trajectory Psyn).

（５）変形例５
前述の実施形態では参照音の参照ピッチＰrefの時系列から合成用情報Ｙを生成するとともに合成用情報Ｙから合成ピッチ軌跡Ｐsynを生成したが、合成用情報Ｙの生成に利用される参照音の特徴量や合成用情報Ｙから生成される指定音の特徴量はピッチ（基本周波数）に限定されない。例えば、参照音のパワーの時系列から合成用情報Ｙを生成するとともに合成用情報Ｙから指定音のパワーの時系列（合成パワー軌跡）を生成する構成も採用され得る。また、指定音のＭＦＣＣ（Mel-Frequency cepstral coefficient）等の特徴量の生成にも、前述の実施形態と同様に本発明を適用することが可能である。 (5) Modification 5
In the above-described embodiment, the synthesis information Y is generated from the time series of the reference pitch Pref of the reference sound and the synthesis pitch trajectory Psyn is generated from the synthesis information Y. However, the reference sound used to generate the synthesis information Y is generated. The feature amount of the designated sound generated from the feature amount and the synthesis information Y is not limited to the pitch (fundamental frequency). For example, a configuration may be employed in which the synthesis information Y is generated from the reference sound power time series and the specified sound power time series (synthesis power trajectory) is generated from the synthesis information Y. The present invention can also be applied to the generation of feature quantities such as the MFCC (Mel-Frequency cepstral coefficient) of the designated sound as in the above-described embodiment.

なお、特徴量は参照音から直接的に抽出される数値に限定されない。例えば、所定の目標値に対する参照音の特徴量の相対値を利用して合成用情報Ｙを生成する構成も採用され得る。具体的には、所定の目標値（例えば参照音の音符の音高）に対する参照音の参照ピッチＰrefの相対値から合成用情報Ｙを生成し、合成用情報Ｙに応じて生成されるピッチの相対値と指定音の音符の音高とから合成ピッチ軌跡Ｐsynを生成する構成が採用される。 The feature amount is not limited to a numerical value extracted directly from the reference sound. For example, a configuration in which the synthesis information Y is generated using a relative value of the feature amount of the reference sound with respect to a predetermined target value may be employed. Specifically, the synthesis information Y is generated from the relative value of the reference pitch reference pitch Pref with respect to a predetermined target value (for example, the pitch of the reference note), and the pitch information generated according to the synthesis information Y is generated. A configuration is employed in which the synthesized pitch trajectory Psyn is generated from the relative value and the pitch of the note of the designated sound.

（６）変形例６
前述の実施形態では歌唱音の合成を例示したが、本発明が適用される範囲は歌唱音の合成に限定されない。例えば、楽器の演奏音（楽音）を合成する場合にも、前述の実施形態と同様に本発明を適用することが可能である。 (6) Modification 6
In the above-described embodiment, the synthesis of the singing sound is exemplified, but the range to which the present invention is applied is not limited to the synthesis of the singing sound. For example, when synthesizing musical instrument performance sounds (musical sounds), the present invention can be applied in the same manner as in the above-described embodiment.

１００……音響合成装置、１２……演算処理装置、１４……記憶装置、１６……入力装置、２１……第１処理部、２２……第２処理部、３２……特徴量抽出部、３４……区間設定部、３４１……第１区間設定部、３４３……第２区間設定部、３４５……識別情報設定部、３６……合成用情報生成部、４２……第１情報生成部、４２１……確率モデル生成部、４２３……特徴量分類部、４２５……継続長分類部、４４……第２情報生成部、４４１……遷移配列モデル生成部、４４３……遷移配列分類部、５２……軌跡生成部、５４……合成処理部。
DESCRIPTION OF SYMBOLS 100 ... Sound synthesizer, 12 ... Arithmetic processing unit, 14 ... Memory | storage device, 16 ... Input device, 21 ... 1st process part, 22 ... 2nd process part, 32 ... Feature-value extraction part, 34... Section setting section, 341... First section setting section, 343... Second section setting section, 345... Identification information setting section, 36. 421... Probability model generation unit 423... Feature amount classification unit 425... Continuous length classification unit 44... 2nd information generation unit 441 ... Transition sequence model generation unit 443. , 52... Locus generation unit, 54... Synthesis processing unit.

Claims

Section setting means for dividing the reference sound into unit sections for each transition type according to the trend of fluctuation of the feature amount,
Probability model generation means for generating a feature amount model for each transition type indicating a probability distribution of a feature amount for each of a plurality of states from a time series of feature amounts in a unit section of the transition type in the reference sound;
For each of a plurality of types of transition types, a transition sequence model that specifies a discrete probability that the sequence appears in the note interval of each note, and an array of transition types in the note interval corresponding to the note among the reference sounds A sound synthesis probability model generation device comprising: a transition sequence model generation means for generating a sound sequence.

The transition array model generation means includes a plurality of transition types of a start part of the note interval, a steady part whose feature value is constantly maintained immediately after the start part, and an end part immediately after the steady part. The sound synthesis probability model generation device according to claim 1, wherein the transition array model indicating the appearance probability is generated for each of the sequences and each of the start part, the steady part, and the end part.

A transition array model that specifies a discrete probability that each array appears in a note interval of a single note for each of a plurality of types of arrays of transition types according to the tendency of variation in feature amount is 1 in a note interval for each note. A sound synthesis probability model generation device comprising transition array model generation means for generating from an array of transition types in a note interval corresponding to the one note among reference sounds including at least transition types .

A plurality of transition array models generated by the transition array model generation means are classified into a plurality of sets, and are generated for each set from the transition array decision tree constructed by the classification and the transition array models classified into the respective sets. The sound synthesis probability model generation device according to any one of claims 1 to 3, further comprising transition sequence classification means for generating transition sequence information including a transition sequence model.

Corresponding to the note of the specified note among a plurality of transition array models that specify discrete probabilities that the array appears in the note interval of each note for each of the plurality of types of transition types according to the tendency of the variation of the feature amount Determine the transition type of each unit section of the specified sound according to the discrete probability specified by the transition array model, and the time series of the feature quantity so that the feature quantity in each unit section varies with the tendency according to each transition type A feature amount trajectory generating apparatus comprising trajectory generating means for generating.

Computer
Section setting means that divides the reference sound into unit sections for each transition type according to the tendency of fluctuation of the feature amount,
Probability model generation means for generating a feature amount model for each transition type indicating a probability distribution of a feature amount for each of a plurality of states from a time series of feature amounts in a unit section of the transition type in the reference sound; and
For each of a plurality of types of transition types, a transition sequence model that specifies a discrete probability that the sequence appears in the note interval of each note, and an array of transition types in the note interval corresponding to the note among the reference sounds A program that functions as a means for generating a transition sequence model generated from an object.

Computer
A transition array model that specifies a discrete probability that each array appears in a note interval of a single note for each of a plurality of types of arrays of transition types according to the tendency of variation in feature amount is 1 in a note interval for each note. A program that functions as a transition array model generating unit that generates from an array of transition types in a note interval corresponding to the one note among reference sounds including at least transition types .

Computer
Corresponding to the note of the specified note among a plurality of transition array models that specify discrete probabilities that the array appears in the note interval of each note for each of the plurality of types of transition types according to the tendency of the variation of the feature amount Determine the transition type of each unit section of the specified sound according to the discrete probability specified by the transition array model, and the time series of the feature quantity so that the feature quantity in each unit section varies with the tendency according to each transition type A program that functions as a trajectory generating means for generating