JP5699496B2 - Stochastic model generation device for sound synthesis, feature amount locus generation device, and program - Google Patents

Stochastic model generation device for sound synthesis, feature amount locus generation device, and program Download PDF

Info

Publication number
JP5699496B2
JP5699496B2 JP2010198710A JP2010198710A JP5699496B2 JP 5699496 B2 JP5699496 B2 JP 5699496B2 JP 2010198710 A JP2010198710 A JP 2010198710A JP 2010198710 A JP2010198710 A JP 2010198710A JP 5699496 B2 JP5699496 B2 JP 5699496B2
Authority
JP
Japan
Prior art keywords
transition
note
model
sound
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2010198710A
Other languages
Japanese (ja)
Other versions
JP2012058306A (en
Inventor
慶二郎 才野
慶二郎 才野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Priority to JP2010198710A priority Critical patent/JP5699496B2/en
Publication of JP2012058306A publication Critical patent/JP2012058306A/en
Application granted granted Critical
Publication of JP5699496B2 publication Critical patent/JP5699496B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

本発明は、音響の特徴量(例えばピッチやパワー)の時系列を示す確率モデルの生成と、確率モデルを利用した特徴量の時系列の生成とに関連する。確率モデルから生成される特徴量の時系列は、歌唱音等の音響の合成に好適に利用される。   The present invention relates to generation of a probability model indicating a time series of acoustic feature quantities (for example, pitch and power) and generation of a time series of feature quantities using a probability model. The time series of feature amounts generated from the probability model is suitably used for synthesizing sounds such as singing sounds.

収録済の音響(以下「参照音」という)に近似する特徴量の変動を合成音に付与することで聴感的に自然な合成音を生成することが可能である。例えば非特許文献1には、参照音のピッチの時系列を表現する確率モデル(例えばHMM(Hidden Markov Model))を利用して合成音を生成する技術が開示されている。具体的には、参照音が音符毎に複数の音符区間に区分され、各音符区間内のピッチの時系列に対する学習処理で音符毎に確率モデルが生成される。   It is possible to generate an acoustically natural synthesized sound by adding to the synthesized sound a variation in a feature amount that approximates the recorded sound (hereinafter referred to as “reference sound”). For example, Non-Patent Document 1 discloses a technique for generating a synthesized sound using a probability model (for example, HMM (Hidden Markov Model)) that represents a time series of the pitch of a reference sound. Specifically, the reference sound is divided into a plurality of note intervals for each note, and a probability model is generated for each note by a learning process for a time series of pitches in each note interval.

酒向慎司 才野慶二郎 南角吉彦 徳田恵一 北村正,「声質と歌唱スタイルを自動学習可能な歌声合成システム」,情報処理学会研究報告[音楽情報科学],2008(12),p.39−p.44,2008年2月Shinji Sakaki Keijiro Saino Yoshihiko Minamikaku Keiichi Tokuda, Tadashi Kitamura, “Singing Voice Synthesis System with Automatic Voice Quality and Singing Style”, Information Processing Society of Japan [Music Information Science], 2008 (12), p.39−p. 44, February 2008

図13は、楽曲の歌唱音を収録した参照音のピッチPと当該楽曲の各音符V(V1,V2,V3)の音高(すなわちピッチPの目標値)との関係を示す模式図である。図13の部分(A)および部分(B)に示すように、参照音のピッチPの遷移は、音符Vの系列が共通する場合でも例えば歌唱表現に応じて相違し得る。例えば、図13の部分(A)では、音符V1と音符V2との境界の前後で参照音のピッチPが一時的に低下する(いわゆる「しゃくり」の歌唱表現)のに対し、図13の部分(B)では、音符V1から音符V2にかけてピッチPは略一定に維持される。   FIG. 13 is a schematic diagram showing the relationship between the pitch P of the reference sound containing the singing sound of the music and the pitch (that is, the target value of the pitch P) of each note V (V1, V2, V3) of the music. . As shown in part (A) and part (B) of FIG. 13, the transition of the pitch P of the reference sound may differ depending on, for example, the singing expression even when the sequence of the note V is common. For example, in the part (A) of FIG. 13, the pitch P of the reference sound temporarily decreases before and after the boundary between the note V1 and the note V2 (so-called “shrimp” singing expression), whereas the part of FIG. In (B), the pitch P is maintained substantially constant from the note V1 to the note V2.

非特許文献1の技術では、参照音のうち音符が共通する各音符区間内のピッチの時系列に対する学習処理で音符毎に確率モデルが生成される。例えば、図13に例示したケースでは、前述のようにピッチPの遷移が相違するにも関わらず、音符V2の確率モデルの生成には部分(A)および部分(B)の双方における音符V2の区間内のピッチPが適用される。したがって、部分(A)と部分(B)との中間的なピッチPの遷移を表現する確率モデルが生成される。以上のように実際の参照音の特性を忠実に反映しない確率モデルを利用した場合、聴感的に不自然な合成音が生成されるという問題がある。   In the technique of Non-Patent Document 1, a probability model is generated for each note by a learning process for a time series of pitches in each note interval in which notes are common among reference sounds. For example, in the case illustrated in FIG. 13, although the transition of the pitch P is different as described above, the generation of the probability model of the note V2 requires the note V2 in both the part (A) and the part (B). The pitch P in the section is applied. Therefore, a probability model that expresses an intermediate pitch P transition between the part (A) and the part (B) is generated. As described above, when a probabilistic model that does not faithfully reflect the characteristics of the actual reference sound is used, there is a problem that a synthetic sound that is audibly unnatural is generated.

なお、以上の説明ではピッチの遷移を表現する確率モデルを例示したが、他の特徴量(例えばパワー)の確率モデルについても同様の問題が発生し得る。以上の事情を考慮して、本発明は、参照音における特徴量の遷移を忠実に反映した確率モデルを生成して聴感的に自然な合成音を生成することを目的とする。   In the above description, the probability model expressing the transition of the pitch is exemplified, but the same problem may occur for the probability models of other feature quantities (for example, power). In view of the above circumstances, an object of the present invention is to generate a stochastic natural synthesized sound by generating a probability model that faithfully reflects the transition of the feature amount in the reference sound.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。   Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の第1態様に係る音合成用確率モデル生成装置は、特徴量(例えば参照ピッチPref)の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段(例えば区間設定部34)と、複数の状態(例えば状態St)の各々について特徴量の確率分布を示す遷移種別毎の特徴量モデル(例えば特徴量モデルQA)を、参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段(例えば確率モデル生成部421)とを具備する。以上の構成においては、参照音の遷移種別毎に特徴量モデルが生成されるから、参照音の特徴量の変動傾向の相違が特徴量モデルに忠実に反映される。したがって、例えば遷移種別の相違を加味せずに参照音から特徴量モデルを生成する構成と比較すると、参照音の特性を忠実に反映した聴感的に自然な合成音を生成可能な特徴量モデルを生成できるという利点がある。   The sound synthesis probability model generation apparatus according to the first aspect of the present invention is a section setting means (for example, a section) that divides a reference sound into unit sections for each transition type according to a tendency of variation in a feature amount (for example, a reference pitch Pref). A setting unit 34) and a feature amount model (for example, feature amount model QA) for each transition type indicating the probability distribution of the feature amount for each of a plurality of states (for example, state St), and a unit section of the transition type in the reference sound And a probability model generation means (for example, a probability model generation unit 421) that generates from a time series of feature quantities. In the above configuration, since a feature amount model is generated for each reference sound transition type, a difference in variation tendency of the reference sound feature amount is faithfully reflected in the feature amount model. Therefore, for example, when compared with a configuration in which a feature model is generated from a reference sound without taking into account differences in transition types, a feature model that can generate an auditory natural synthesized sound that faithfully reflects the characteristics of the reference sound. There is an advantage that it can be generated.

第1態様の好適例に係る音合成用確率モデル生成装置は、確率モデル生成手段が生成した複数の特徴量モデルを複数の集合に分類し、分類で構築される特徴量決定木(例えば特徴量決定木TA)と、各集合に分類された特徴量モデルから集合毎に生成される特徴量モデル(例えば特徴量モデルMA)とを含む特徴量情報を生成する特徴量分類手段(例えば特徴量分類部423)を具備する。以上の構成においては、確率モデル生成手段が生成した特徴量モデルを分類した複数の集合の各々について当該集合内の特徴量モデルに応じた特徴量モデルが生成されるから、参照音の多数の特徴量を反映した(すなわち統計的な妥当性の高い)特徴量モデルを生成することが可能である。また、特徴量モデルの分類で構築される特徴量決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な特徴量モデルを選択できるという利点もある。   A sound synthesis probability model generation apparatus according to a preferred example of the first aspect classifies a plurality of feature amount models generated by the probability model generation means into a plurality of sets, and a feature amount determination tree (for example, feature amount) constructed by classification Feature amount classification means (for example, feature amount classification) that generates feature amount information including a decision tree TA) and a feature amount model (for example, feature amount model MA) generated for each set from the feature amount models classified into each set. Part 423). In the above configuration, since a feature amount model corresponding to the feature amount model in the set is generated for each of a plurality of sets into which the feature amount model generated by the probability model generating means is classified, a large number of features of the reference sound are generated. It is possible to generate a feature quantity model reflecting the quantity (ie, statistically valid). In addition, by applying the designated sound to be synthesized to the feature quantity decision tree constructed by the classification of the feature quantity model, there is also an advantage that an appropriate feature quantity model can be selected for the designated sound of the attribute that does not exist in the reference sound .

本発明の第2態様に係る音合成用確率モデル生成装置は、特徴量(例えば参照ピッチPref)の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段(例えば区間設定部34)と、複数の状態(例えば状態St)の各々について継続長の確率分布を示す遷移種別毎の継続長モデル(例えば継続長モデルQB)を、参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段(例えば確率モデル生成部421)とを具備する。以上の構成においては、参照音の遷移種別毎に継続長モデルが生成されるから、参照音の特徴量の変動傾向の相違が継続長モデルに忠実に反映される。したがって、例えば遷移種別の相違を加味せずに参照音から継続長モデルを生成する構成と比較すると、参照音の特性を忠実に反映し聴感的に自然な合成音を生成可能な継続長モデルを生成することが可能である。   The sound synthesis probability model generation apparatus according to the second aspect of the present invention is a section setting means (for example, a section) that divides a reference sound into unit sections for each transition type according to a tendency of fluctuation of a feature amount (for example, a reference pitch Pref). A setting section 34) and a duration model (eg, duration model QB) for each transition type indicating a probability distribution of duration for each of a plurality of states (eg, state St), and a unit section of the transition type of the reference sound And a probability model generation means (for example, a probability model generation unit 421) that generates from a time series of feature quantities. In the above configuration, since the duration model is generated for each transition type of the reference sound, the difference in the variation tendency of the reference sound feature amount is faithfully reflected in the duration model. Therefore, for example, when compared with a configuration in which a duration model is generated from a reference sound without taking into account differences in transition types, a duration model that can faithfully reflect the characteristics of the reference sound and generate an acoustically natural synthesized sound It is possible to generate.

第2態様の好適例に係る音合成用確率モデル生成装置は、確率モデル生成手段が生成した複数の継続長モデルを複数の集合に分類し、分類で構築される継続長決定木(例えば継続長決定木TB)と、各集合に分類された継続長モデルから集合毎に生成される継続長モデル(例えば継続長モデルMB)とを含む継続長情報を生成する継続長分類手段(例えば継続長分類部425)を具備する。以上の構成においては、確率モデル生成手段が生成した継続長モデルを分類した複数の集合の各々について当該集合内の継続長モデルに応じた継続長モデルが生成されるから、参照音の多数の特徴量を反映した(すなわち統計的な妥当性の高い)継続長モデルを生成することが可能である。また、継続長モデルの分類で構築される継続長決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な継続長モデルを選択できるという利点もある。   The sound synthesis probability model generation device according to a preferred example of the second aspect classifies a plurality of duration models generated by the probability model generation means into a plurality of sets, and a duration determination tree constructed by classification (for example, a duration length) A decision tree (for example, duration classification) that generates duration information including a duration model (for example, duration model MB) generated for each set from the duration model classified into each set. Part 425). In the above configuration, since a duration model corresponding to the duration model in the set is generated for each of a plurality of sets into which the duration model generated by the probability model generation unit is classified, a large number of features of the reference sound are generated. It is possible to generate a duration model that reflects the quantity (ie statistically relevant). In addition, there is an advantage that an appropriate duration model can be selected for a designated tone having an attribute that does not exist in the reference tone by applying the designated tone to be synthesized to the duration decision tree constructed by the duration model classification. .

なお、以上の各形態における遷移種別(特徴量の変動の傾向)とは、特徴量の上昇/低下や変化/維持といった特徴量の経時的な動向(挙動)を意味する。例えば、発音の始点から特徴量が経時的に目標値に接近する過程(開始部B)や、特徴量が略一定に維持される定常的な過程(定常部S)や、発音の終点にかけて特徴量が経時的に目標値から変化する過程(終了部E)が、遷移種別の典型例として例示され得る。   Note that the transition type (the tendency of fluctuation of the feature value) in each of the above forms means a trend (behavior) of the feature value over time such as increase / decrease or change / maintenance of the feature value. For example, a process in which the feature amount approaches the target value over time from the starting point of pronunciation (starting portion B), a steady process in which the feature amount is maintained substantially constant (steady portion S), or a feature from the end point of pronunciation. A process in which the amount changes from the target value over time (end E) can be exemplified as a typical example of the transition type.

第3態様に係る音合成用確率モデル生成装置は、遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する遷移配列モデル(例えば遷移配列モデルQC)を、参照音のうち当該音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段(例えば遷移配列モデル生成部441)を具備する。以上の構成においては、音符区間内の遷移種別の配列毎の出現確率を示す遷移配列モデルが生成されるから、合成対象の指定音について適切な遷移配列を決定するとともに各遷移種別に対応する確率モデル(特徴量モデル,継続長モデル)を選択できる。したがって、参照音の特性を忠実に反映し聴感的に自然な合成音を生成することが可能である。
The sound synthesis probability model generation apparatus according to the third aspect includes a transition arrangement model (for example, a transition arrangement model QC) that specifies a discrete probability that each arrangement appears in the note interval of each note for each of a plurality of types of transition types. ) Is generated from an array of transition types in a note interval corresponding to the note of the reference sound (for example, a transition array model generation unit 441). In the above configuration, since a transition array model indicating the appearance probability for each array of transition types in the note interval is generated, an appropriate transition array is determined for the designated sound to be synthesized and the probability corresponding to each transition type A model (feature model, duration model) can be selected. Therefore, it is possible to generate an acoustically natural synthesized sound that faithfully reflects the characteristics of the reference sound.

第3態様の好適例に係る音合成用確率モデル生成装置は、遷移配列モデル生成手段が生成した複数の遷移配列モデルを複数の集合に分類し、分類で構築される遷移配列決定木(例えば遷移配列決定木TC)と各集合に分類された遷移配列モデルから集合毎に生成される遷移配列モデル(例えば遷移配列モデルMC)とを含む遷移配列情報を生成する遷移配列分類手段(例えば遷移配列分類部443)を具備する。以上の構成においては、遷移配列モデルが生成した遷移配列モデルを分類した複数の集合の各々について当該集合内の遷移配列モデルに応じた遷移配列モデルが生成されるから、参照音の多数の特徴量を反映した(すなわち統計的な妥当性の高い)遷移配列モデルを生成することが可能である。また、遷移配列モデルの分類で構築される遷移配列決定木に合成対象の指定音を適用することで、参照音に存在しない属性の指定音についても適切な遷移配列モデルを選択できるという利点もある。   The sound synthesis probability model generation apparatus according to a preferred example of the third aspect classifies a plurality of transition array models generated by the transition array model generation unit into a plurality of sets, and a transition sequence determination tree (for example, transitions) constructed by classification Transition sequence classification means (for example, transition sequence classification) that generates transition sequence information including a sequence determination tree TC) and a transition sequence model (for example, transition sequence model MC) generated for each set from the transition sequence model classified into each set Part 443). In the above configuration, a transition array model corresponding to the transition array model in the set is generated for each of a plurality of sets into which the transition array model generated by the transition array model is classified. It is possible to generate a transition sequence model that reflects (ie, statistically valid). In addition, by applying the designated sound to be synthesized to the transition sequence decision tree constructed by the classification of the transition sequence model, there is also an advantage that an appropriate transition sequence model can be selected even for the designated sound having an attribute that does not exist in the reference sound. .

本発明は、以上に例示した第3態様の音合成用確率モデル生成装置が生成した遷移配列モデルを利用して特徴量の時系列を生成する特徴量軌跡生成装置としても特定される。すなわち、本発明の特徴量軌跡生成装置は、特徴量の変動の傾向に応じた遷移種別の各配列が各音符の音符区間内に出現する確率を示す複数の遷移配列モデル(例えば遷移配列モデルMC)を記憶する記憶手段(例えば記憶装置14)と、複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別を決定し、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列(例えば合成ピッチ軌跡Psyn)を生成する軌跡生成手段(例えば軌跡生成部52)とを具備する。以上の構成においては、指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別が決定され、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列が生成される。したがって、例えば遷移種別の相違を加味せずに特徴量の時系列を生成する構成と比較すると、参照音の特性を忠実に反映し聴感的に自然な合成音が生成されるように特徴量の軌跡を決定することが可能である。   The present invention is also specified as a feature amount trajectory generation device that generates a time series of feature amounts using the transition arrangement model generated by the sound synthesis probability model generation device of the third aspect exemplified above. That is, the feature amount trajectory generating apparatus of the present invention has a plurality of transition array models (for example, the transition array model MC) indicating the probability that each array of transition types corresponding to the tendency of variation of the feature amount appears in the note interval of each note. ) And a transition type of each unit section of the specified sound according to the probability indicated by the transition array model corresponding to the note of the specified sound among the plurality of transition array models, Trajectory generating means (for example, a trajectory generating unit 52) is provided that generates a time series of feature values (for example, a combined pitch trajectory Psyn) so that the feature values in each unit section vary with a tendency corresponding to each transition type. In the above configuration, the transition type of each unit section of the designated sound is determined according to the probability indicated by the transition arrangement model corresponding to the note of the designated sound, and the feature amount in each unit section with a tendency according to each transition type A time series of feature values is generated such that fluctuates. Therefore, for example, when compared with a configuration in which a time series of feature values is generated without taking into account differences in transition types, the feature values of the feature values are generated so that a synthesized sound that is audibly natural and reflects the characteristics of the reference sound. It is possible to determine the trajectory.

また、本発明は、以上に説明した特徴量軌跡生成装置を利用した音響合成装置(例えば音響合成装置100)としても特定され得る。本発明の音響合成装置は、特徴量の変動の傾向に応じた遷移種別の各配列が各音符の音符区間内に出現する確率を示す複数の遷移配列モデル(例えば遷移配列モデルMC)を記憶する記憶手段(例えば記憶装置14)と、複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが示す確率に応じて指定音の各単位区間の遷移種別を決定し、各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列(例えば合成ピッチ軌跡Psyn)を生成する軌跡生成手段(例えば軌跡生成部52)と、軌跡生成手段が生成した特徴量の時系列に沿うように音波形データ(例えば音波形データZA)を加工して合成音データ(例えば合成音データVout)を生成する合成処理手段(例えば合成処理部54)とを具備する。   The present invention can also be specified as a sound synthesizer (for example, sound synthesizer 100) using the feature amount trajectory generator described above. The acoustic synthesizer of the present invention stores a plurality of transition array models (for example, a transition array model MC) indicating the probability that each array of transition types corresponding to the tendency of fluctuations in feature quantities will appear in the note interval of each note. The transition type of each unit section of the designated sound is determined according to the probability indicated by the storage means (for example, the storage device 14) and the transition arrangement model corresponding to the note of the designated sound among the plurality of transition arrangement models. A trajectory generating means (for example, the trajectory generating section 52) that generates a time series of feature quantities (for example, the combined pitch trajectory Psyn) so that the feature quantities in each unit section vary with a corresponding tendency, and a feature generated by the trajectory generating means. Synthesis processing means (for example, a synthesis processing unit 54) that processes the sound waveform data (for example, the sound waveform data ZA) to generate synthesized sound data (for example, the synthesized sound data Vout) so as to follow the time series of the quantity.

以上の各態様に係る装置(音合成用確率モデル生成装置,特徴量軌跡生成装置,音響合成装置)は、DSP(Digital Signal Processor)等の専用の電子回路で実現されるほか、CPU(Central Processing Unit)などの汎用の演算処理装置とプログラムとの協働でも実現される。以上の各態様に係る装置としてコンピュータを機能させるプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。   The devices according to the above embodiments (sound synthesis probability model generation device, feature amount trajectory generation device, sound synthesis device) are realized by a dedicated electronic circuit such as a DSP (Digital Signal Processor), or a CPU (Central Processing). It is also realized by cooperation between a general-purpose arithmetic processing unit such as Unit) and a program. A program that causes a computer to function as an apparatus according to each of the above aspects is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or in a form distributed via a communication network. Provided by the server device and installed in the computer.

本発明の実施形態に係る音響合成装置のブロック図である。1 is a block diagram of a sound synthesizer according to an embodiment of the present invention. 第1処理部のブロック図である。It is a block diagram of a 1st processing part. 参照音の参照ピッチの変動を例示する説明図である。It is explanatory drawing which illustrates the fluctuation | variation of the reference pitch of a reference sound. 参照音の参照ピッチの他の変動を例示する説明図である。It is explanatory drawing which illustrates the other fluctuation | variation of the reference pitch of a reference sound. 合成用情報生成部のブロック図である。It is a block diagram of the information production | generation part for a synthesis | combination. 特徴量モデルおよび継続長モデルの説明図である。It is explanatory drawing of a feature-value model and a continuation length model. 特徴量決定木の説明図である。It is explanatory drawing of a feature-value decision tree. 継続長決定木の説明図である。It is explanatory drawing of a continuation length decision tree. 遷移配列モデルの説明図である。It is explanatory drawing of a transition arrangement | sequence model. 遷移配列決定木の説明図である。It is explanatory drawing of a transition arrangement | sequence decision tree. 第2処理部のブロック図である。It is a block diagram of a 2nd processing part. 軌跡生成部の動作の説明図である。It is explanatory drawing of operation | movement of a locus | trajectory production | generation part. 背景技術における確率モデルの生成の問題点の説明図である。It is explanatory drawing of the problem of the production | generation of the probability model in background art.

<A:実施形態>
図1は、本発明のひとつの実施形態に係る音響合成装置100のブロック図である。図1の音響合成装置100は、所望の音符および歌詞の楽曲の歌唱音を示す合成音データVoutを生成する歌唱合成装置であり、図1に示すように、演算処理装置12と記憶装置14と入力装置16とを具備するコンピュータシステムで実現される。入力装置16(例えばマウスやキーボード)は、利用者からの指示を受付ける。
<A: Embodiment>
FIG. 1 is a block diagram of a sound synthesizer 100 according to one embodiment of the present invention. The sound synthesizer 100 in FIG. 1 is a singing synthesizer that generates synthesized sound data Vout indicating the singing sound of a musical piece of desired notes and lyrics. As shown in FIG. This is realized by a computer system including the input device 16. The input device 16 (for example, a mouse or a keyboard) receives an instruction from the user.

記憶装置14は、演算処理装置12が実行するプログラムPGMや演算処理装置12が使用する各種のデータ(参照用情報X,合成用情報Y,音波形情報Z,楽譜データSC)を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置14として任意に利用される。   The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data (reference information X, synthesis information Y, sound waveform information Z, score data SC) used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily used as the storage device 14.

参照用情報Xは、参照音データXAと楽譜データXBとで構成されて合成用情報Yの生成(学習)に利用される。参照音データXAは、特定の歌唱者(以下「参照歌唱者」という)が楽曲を歌唱した音声(参照音)の時間領域での音波形を表現するサンプル系列である。楽譜データXBは、参照音データXAが示す楽曲の楽譜を表現する。すなわち、楽譜データXBは、参照音の音符(音名,継続長)と歌詞(発音文字)とを時系列に指定する。   The reference information X is composed of reference sound data XA and score data XB and is used for generating (learning) synthesis information Y. The reference sound data XA is a sample series that expresses a sound waveform in a time domain of a sound (reference sound) in which a specific singer (hereinafter referred to as “reference singer”) sings a song. The musical score data XB expresses the musical score of the music indicated by the reference sound data XA. That is, the musical score data XB designates a note (sound name, duration) and lyrics (phonetic characters) of the reference sound in time series.

合成用情報Yは、参照歌唱者毎(あるいは参照歌唱者が歌唱する楽曲のジャンル毎)に参照用情報Xに応じて生成され、参照歌唱者の歌唱音に特有の特徴量の時系列(軌跡)を特定するために利用される。本実施形態では、合成用情報Yから特定される特徴量としてピッチ(基本周波数)を想定する。なお、参照用情報Xを利用した合成用情報Yの生成については後述する。   The synthesis information Y is generated according to the reference information X for each reference singer (or for each genre of the music sung by the reference singer), and the time series (trajectory) of the characteristic amount specific to the singing sound of the reference singer ) To identify. In the present embodiment, a pitch (fundamental frequency) is assumed as a feature amount specified from the synthesis information Y. The generation of the composition information Y using the reference information X will be described later.

音波形情報Zは、複数の音波形データZAを含んで構成される。各音波形データZAは、参照歌唱者が発声した音声素片毎に事前に生成され、音声素片の波形の特徴(例えば時間領域での波形や周波数スペクトルの形状)を表現する。音声素片は、聴覚的に区別可能な最小単位である音素または複数の音素を連結した音素連鎖である。   The sound waveform information Z includes a plurality of sound waveform data ZA. Each sound waveform data ZA is generated in advance for each speech unit uttered by the reference singer, and expresses a waveform characteristic of the speech unit (for example, a waveform in the time domain or a shape of a frequency spectrum). A phoneme segment is a phoneme chain that is a unit of phonemes or a plurality of phonemes that is the smallest unit that can be audibly distinguished.

楽譜データSCは、合成対象となる各指定音の音符(音名,継続長)と歌詞(発音文字)とを時系列に指定する。入力装置16に対する利用者からの指示(各指定音の追加や編集の指示)に応じて楽譜データSCが生成される。概略的には、楽譜データSCが指定する各指定音の音符および歌詞に対応する音波形データZAのピッチを、合成用情報Yに応じて生成されるピッチの時系列(以下「合成ピッチ軌跡」という)に沿うように加工することで、合成音データVoutが生成される。すなわち、合成音データVoutで表現される合成音には、参照歌唱者に特有の歌唱表現(ピッチの変動)が付加される。   The musical score data SC designates notes (pitch names, duration) and lyrics (pronunciation characters) of each designated sound to be synthesized in time series. The musical score data SC is generated in response to an instruction from the user to the input device 16 (instruction for adding or editing each designated sound). Schematically, the pitch of the sound waveform data ZA corresponding to the notes and lyrics of each designated sound designated by the score data SC is a time series of pitches generated according to the synthesis information Y (hereinafter referred to as “synthetic pitch trajectory”). ), The synthesized sound data Vout is generated. That is, a singing expression (pitch fluctuation) peculiar to the reference singer is added to the synthesized sound represented by the synthesized sound data Vout.

図1の演算処理装置12は、記憶装置14に格納されたプログラムPGMの実行で、合成音データVoutの生成(音声合成)に必要な複数の機能(第1処理部21,第2処理部22)を実現する。第1処理部21は、参照用情報Xを利用して合成用情報Yを生成し、第2処理部22は、合成用情報Yと音波形情報Zと楽譜データSCとを利用して合成音データVoutを生成する。なお、演算処理装置12の各機能を専用の電子回路(DSP)で実現した構成や、演算処理装置12の各機能を複数の集積回路に分散した構成も採用され得る。第1処理部21および第2処理部22の構成や動作を順次に説明する。   The arithmetic processing unit 12 in FIG. 1 executes a plurality of functions (a first processing unit 21 and a second processing unit 22) necessary for the generation (speech synthesis) of synthesized sound data Vout by executing the program PGM stored in the storage device 14. ). The first processing unit 21 generates the synthesis information Y using the reference information X, and the second processing unit 22 uses the synthesis information Y, the sound waveform information Z, and the score data SC to generate the synthesized sound. Data Vout is generated. A configuration in which each function of the arithmetic processing unit 12 is realized by a dedicated electronic circuit (DSP) or a configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits may be employed. The configuration and operation of the first processing unit 21 and the second processing unit 22 will be described sequentially.

(1)第1処理部21
図2は、第1処理部21のブロック図である。図2に示すように、第1処理部21は、特徴量抽出部32と区間設定部34と合成用情報生成部36とを含んで構成される。特徴量抽出部32は、参照音データXAが示す参照音のピッチ(以下「参照ピッチ」という)Prefを順次に検出する。参照ピッチPrefの検出には公知の技術が任意に採用される。なお、参照音のうち調波構造が存在しない区間(例えばピッチが検出されない子音の区間)の参照ピッチPrefは所定値(例えば前後の参照ピッチPrefの補間値)に設定される。図3には、特徴量抽出部32が検出した参照ピッチPrefの時系列と、楽譜データXBで指定される各指定音(V1,V2,……)の時系列とが、共通の時間軸のもとで図示されている。
(1) First processing unit 21
FIG. 2 is a block diagram of the first processing unit 21. As shown in FIG. 2, the first processing unit 21 includes a feature amount extraction unit 32, a section setting unit 34, and a composition information generation unit 36. The feature quantity extraction unit 32 sequentially detects the pitch of the reference sound (hereinafter referred to as “reference pitch”) Pref indicated by the reference sound data XA. A known technique is arbitrarily employed for detecting the reference pitch Pref. Note that the reference pitch Pref in a section where the harmonic structure does not exist in the reference sound (for example, a consonant section in which no pitch is detected) is set to a predetermined value (for example, an interpolated value of the preceding and following reference pitch Pref). In FIG. 3, the time series of the reference pitch Pref detected by the feature quantity extraction unit 32 and the time series of the designated sounds (V1, V2,...) Designated by the score data XB have a common time axis. It is shown in the figure.

図2の区間設定部34は、参照音データXAが示す参照音(参照ピッチPrefの時系列)を時間軸上で複数の単位区間μに区分する。図2に示すように、本実施形態の区間設定部34は、第1区間設定部341と第2区間設定部343と識別情報設定部345とを含んで構成される。第1区間設定部341は、図3に示すように、特徴量抽出部32が検出した参照ピッチPrefの時系列を音符毎の区間(以下「音符区間」という)σに区分する。各音符区間σの設定には参照用情報Xの楽譜データXBが使用される。すなわち、第1区間設定部341は、楽譜データXBが音符毎に指定する各指定音(V1,V2,……)の始点および終点を境界として参照ピッチPrefの時系列を複数の音符区間σに区分する。   The section setting unit 34 in FIG. 2 divides the reference sound (time series of the reference pitch Pref) indicated by the reference sound data XA into a plurality of unit sections μ on the time axis. As shown in FIG. 2, the section setting unit 34 of the present embodiment includes a first section setting unit 341, a second section setting unit 343, and an identification information setting unit 345. As shown in FIG. 3, the first section setting unit 341 divides the time series of the reference pitch Pref detected by the feature amount extraction unit 32 into sections (hereinafter referred to as “note sections”) σ for each note. The musical score data XB of the reference information X is used for setting each note interval σ. That is, the first section setting unit 341 sets the time series of the reference pitch Pref to a plurality of note sections σ with the start point and end point of each designated sound (V1, V2,...) Designated by the musical score data XB for each note as a boundary. Break down.

図2の第2区間設定部343は、参照ピッチPrefの時系列の各音符区間σを遷移種別毎の単位区間μに区分する。遷移種別は、参照ピッチPrefの変動の傾向に応じた区分を意味する。本実施形態では、図3に示すように、開始部B(Beginning)と定常部S(Sustain)と終了部E(End)とを遷移種別として例示する。開始部Bは、1個の音符の発音の直後に参照ピッチPrefが当該音符の音高に接近するように変動(例えば上昇)する区間を意味し、定常部Sは、1個の音符の発音中に参照ピッチPrefが当該音符の音高に略一定に維持される区間を意味し、終了部Eは、1個の音符の発音が終了する直前に参照ピッチPrefが当該音符の音高から変動(例えば低下)する区間を意味する。   The second interval setting unit 343 in FIG. 2 divides each note interval σ in time series of the reference pitch Pref into unit intervals μ for each transition type. The transition type means a classification according to the tendency of fluctuation of the reference pitch Pref. In the present embodiment, as shown in FIG. 3, the start part B (Beginning), the steady part S (Sustain), and the end part E (End) are exemplified as transition types. The start part B means a section in which the reference pitch Pref fluctuates (e.g., rises) so as to approach the pitch of the note immediately after the sound of one note, and the stationary part S means the sound of one note. The reference pitch Pref means a section in which the reference pitch Pref is maintained substantially constant at the pitch of the note, and the end portion E changes the reference pitch Pref from the pitch of the note immediately before the end of the sounding of one note. It means a section (for example, a decrease).

各音符区間σでは1種以上の遷移種別が出現する。また、開始部Bが定常部Sや終了部Eの前方に位置するとともに終了部Eが開始部Bや定常部Sの後方に位置するという時間的な関係は固定である。したがって、1個の音符区間σ内で出現し得る遷移種別の配列パターン(以下「遷移配列」という)は、「B-S-E」,「B-S」,「S-E」,「B-E」,「B」,「S」,「E」の合計7種類となる。例えば、図3の指定音V1に対応する音符区間σは定常部Sの単位区間μと終了部Eの単位区間μとに区分され(遷移配列「S-E」)、指定音V2の音符区間σは開始部Bの単位区間μと定常部Sの単位区間μと終了部Eの単位区間μとに区分され(遷移配列「B-S-E」)、指定音V3の音符区間σは定常部Sの単位区間μに設定される(遷移配列「S」)。   One or more transition types appear in each note interval σ. Moreover, the temporal relationship that the start part B is located in front of the steady part S and the end part E and the end part E is located behind the start part B and the steady part S is fixed. Accordingly, the transition type array patterns (hereinafter referred to as “transition arrays”) that may appear within one note interval σ are “BSE”, “BS”, “SE”, “B”. -E "," B "," S "," E "total 7 types. For example, the note interval σ corresponding to the designated sound V1 in FIG. 3 is divided into a unit interval μ of the stationary part S and a unit interval μ of the end part E (transition array “SE”), and the note interval of the designated sound V2 σ is divided into a unit interval μ of the start portion B, a unit interval μ of the stationary portion S, and a unit interval μ of the end portion E (transition array “BSE”), and the note interval σ of the designated sound V3 is stationary. The unit interval μ of the part S is set (transition array “S”).

以上のように各音符区間σが遷移種別毎に単位区間μに区分されるから、参照音の音符の系列(すなわち楽譜データXBの各指定音の音符)が共通する場合でも、音符区間σの区分の態様(各単位区間μの個数や時間長)は、参照ピッチPrefの変動の態様(遷移種別)に応じて変化する。例えば、図3の例示のように指定音V1および指定音V2の境界の前後で参照ピッチPrefが一時的に低下する場合(すなわち「しゃくり」の歌唱表現が参照音に付与された場合)、前述のように、指定音V1に対応する音符区間σは、定常部Sと終了部Eとに対応する2個の単位区間μに区分され、指定音V2に対応する音符区間σは、開始部Bと定常部Sと終了部Eとに対応する3個の単位区間μに区分される。他方、図4のように指定音V1および指定音V2の境界の前後で参照ピッチPrefが変動しない場合、指定音V1に対応する音符区間σは定常部Sの1個の単位区間μに設定され(遷移配列「S」)、指定音V2に対応する音符区間σは、定常部Sと終了部Eとに対応する2個の単位区間μに区分される(遷移配列「S-E」)。   As described above, each note interval σ is divided into unit intervals μ for each transition type. Therefore, even when the note sequence of the reference sound (that is, the note of each designated note of the score data XB) is common, the note interval σ The mode of division (the number of unit sections μ and the time length) changes according to the mode of change (transition type) of the reference pitch Pref. For example, when the reference pitch Pref is temporarily lowered before and after the boundary between the designated sound V1 and the designated sound V2 as illustrated in FIG. 3 (that is, when the singing expression “sharsh” is given to the reference sound), Thus, the note interval σ corresponding to the designated sound V1 is divided into two unit intervals μ corresponding to the stationary part S and the end part E, and the note interval σ corresponding to the designated sound V2 is the start part B. Are divided into three unit sections μ corresponding to the stationary part S and the end part E. On the other hand, when the reference pitch Pref does not fluctuate before and after the boundary between the designated sound V1 and the designated sound V2 as shown in FIG. 4, the note interval σ corresponding to the designated sound V1 is set to one unit interval μ of the stationary part S. (Transition array “S”), the note interval σ corresponding to the designated sound V2 is divided into two unit intervals μ corresponding to the stationary part S and the end part E (transition array “SE”).

各単位区間μは利用者からの指示に応じて可変に設定される。例えば、利用者は、表示装置(図示略)に表示される参照ピッチPrefの時系列(例えば図3に例示された参照ピッチPrefの時間変動)を視認するとともに放音装置(例えばスピーカ)から再生される参照音を聴取することで各時点での遷移種別を推定しながら、入力装置16を適宜に操作することで各単位区間μを指定する。第2区間設定部343は、入力装置16に対する利用者からの指示に応じて各単位区間μを設定する。   Each unit interval μ is variably set according to an instruction from the user. For example, the user visually recognizes the time series of the reference pitch Pref displayed on the display device (not shown) (for example, time variation of the reference pitch Pref illustrated in FIG. 3) and reproduces it from the sound emitting device (for example, a speaker). Each unit section μ is designated by appropriately operating the input device 16 while estimating the transition type at each time point by listening to the reference sound. The second section setting unit 343 sets each unit section μ in accordance with an instruction from the user to the input device 16.

図2の識別情報設定部345は、第2区間設定部343が区分した単位区間μ毎に識別情報Aを設定する。識別情報Aは、単位区間μの属性を示す識別子(ラベル)であり、図3に示すように音符属性a1と遷移種別a2とを含んで構成される。遷移種別a2は、当該単位区間μの遷移種別(開始部Bと定常部Sと終了部Eとの何れか)を指定する。遷移種別a2は、例えば単位区間μの設定時に入力装置16の操作で利用者が指定する。   The identification information setting unit 345 in FIG. 2 sets the identification information A for each unit section μ divided by the second section setting unit 343. The identification information A is an identifier (label) indicating the attribute of the unit section μ and includes a note attribute a1 and a transition type a2 as shown in FIG. The transition type a2 designates the transition type (any one of the start part B, the steady part S, and the end part E) of the unit section μ. The transition type a2 is designated by the user by operating the input device 16 when setting the unit interval μ, for example.

音符属性a1は、当該単位区間μに対応する音符(以下「対象音符」という)の属性を示す情報であり、変数p1〜p3と変数d1〜d3とを含んで構成される。変数p2は、対象音符の音名(ノートナンバ)に設定される。変数p1は、対象音符の直前の音符の音程(対象音符に対する相対値)に設定され、変数p3は対象音符の直後の音符の音程に設定される。また、変数d2は、対象音符の継続長に設定される。変数d1は対象音符の直前の音符の継続長に設定され、変数d3は対象音符の直後の音符の継続長に設定される。音符属性a1の各変数(p1〜p3,d1〜d3)は楽譜データXBから特定される。以上の説明から理解されるように、音楽的な条件が共通する複数の単位区間μについては識別情報Aが共通する。なお、音符属性a1の内容は以上の例示に限定されない。例えば、楽曲の各小節内で対象音符が何番目の拍子に該当するのか(1拍目/2拍目)を示す情報や、参照音のひと息に相当する期間における対象音符の位置(前方/後方)を示す情報など、ピッチの時系列に影響する任意の情報が音符属性a1にて指定され得る。   The note attribute a1 is information indicating the attribute of a note (hereinafter referred to as “target note”) corresponding to the unit section μ, and includes variables p1 to p3 and variables d1 to d3. The variable p2 is set to the note name (note number) of the target note. The variable p1 is set to the pitch of the note immediately before the target note (relative value to the target note), and the variable p3 is set to the pitch of the note immediately after the target note. The variable d2 is set to the duration of the target note. The variable d1 is set to the duration of the note immediately before the target note, and the variable d3 is set to the duration of the note immediately after the target note. Each variable (p1 to p3, d1 to d3) of the note attribute a1 is specified from the score data XB. As can be understood from the above description, the identification information A is common to a plurality of unit sections μ having common musical conditions. Note that the content of the note attribute a1 is not limited to the above example. For example, information indicating what time signature the target note corresponds to in each measure of the music (first beat / second beat), or the position of the target note in the period corresponding to the breath of the reference sound (forward / backward) Any information that affects the time series of the pitch, such as information indicating), can be designated by the note attribute a1.

図2の合成用情報生成部36は、区間設定部34(第2区間設定部343)が設定した単位区間μ毎の参照ピッチPrefの時系列を利用して合成用情報Yを生成する。図5は、合成用情報生成部36のブロック図である。図5に示すように、合成用情報生成部36は、特徴量情報YAおよび継続長情報YBを生成する第1情報生成部42と遷移配列情報YCを生成する第2情報生成部44とを具備する。特徴量情報YAと継続長情報YBと遷移配列情報YCとが図1の合成用情報Yとして記憶装置14に格納される。   2 generates the synthesis information Y by using the time series of the reference pitch Pref for each unit section μ set by the section setting section 34 (second section setting section 343). FIG. 5 is a block diagram of the synthesis information generation unit 36. As shown in FIG. 5, the synthesis information generation unit 36 includes a first information generation unit 42 that generates feature amount information YA and duration information YB, and a second information generation unit 44 that generates transition sequence information YC. To do. The feature amount information YA, the continuation length information YB, and the transition sequence information YC are stored in the storage device 14 as the combining information Y in FIG.

図5に示すように、第1情報生成部42は、確率モデル生成部421と特徴量分類部423と継続長分類部425とを含んで構成される。確率モデル生成部421は、各遷移種別に対応する1個の単位区間μ内でのピッチPの出現確率を表現する確率モデルQを識別情報A毎(音符属性a1と遷移種別a2との組合せ毎)に生成する。本実施形態では、図6に示すように、複数(図6の例示では3個)の状態Stで規定されるHSMM(Hidden Semi Markov Model)を確率モデルQとして例示する。確率モデルQは、特徴量モデルQAと継続長モデルQBとを含んで構成される。特徴量モデルQAは、単位区間μ内のピッチPおよびその時間変化(微分値)ΔPの確率分布(出力分布)を状態St毎に規定し、継続長モデルQBは、単位区間μ内での状態St毎の継続長Dの確率分布(継続長分布)を規定する。なお、特徴量モデルQAが状態St毎のピッチPの2階微分値の確率分布を規定する構成も好適である。   As shown in FIG. 5, the first information generation unit 42 includes a probability model generation unit 421, a feature amount classification unit 423, and a duration classification unit 425. The probability model generation unit 421 generates a probability model Q expressing the appearance probability of the pitch P within one unit section μ corresponding to each transition type for each identification information A (for each combination of the note attribute a1 and the transition type a2). ) To generate. In the present embodiment, as shown in FIG. 6, HSMM (Hidden Semi Markov Model) defined by a plurality of (three in the example of FIG. 6) states St is exemplified as the probability model Q. The probability model Q includes a feature amount model QA and a duration model QB. The feature quantity model QA defines the probability distribution (output distribution) of the pitch P in the unit section μ and its time variation (differential value) ΔP for each state St, and the duration model QB is the state in the unit section μ. The probability distribution (continuation length distribution) of the continuation length D for each St is defined. A configuration in which the feature quantity model QA defines the probability distribution of the second-order differential value of the pitch P for each state St is also suitable.

図5の確率モデル生成部421は、識別情報Aが共通する各単位区間μ内の参照ピッチPrefの時系列に対して学習処理(最尤推定アルゴリズム)を実行することで、当該識別情報Aに対応する確率モデルQを生成する。具体的には、各単位区間μ内の参照ピッチPrefの時系列が最大の確率で出現するように確率モデルQが生成される。確率モデルQは識別情報A毎に生成される。すなわち、複数の単位区間μで音符属性a1が共通する場合でも、各単位区間μで遷移種別a2が相違するならば、遷移種別a2毎に別個の確率モデルQが生成される。   The probability model generation unit 421 in FIG. 5 executes the learning process (maximum likelihood estimation algorithm) on the time series of the reference pitch Pref in each unit section μ with which the identification information A is common, so that the identification information A A corresponding probability model Q is generated. Specifically, the probability model Q is generated so that the time series of the reference pitch Pref in each unit section μ appears with the maximum probability. The probability model Q is generated for each identification information A. That is, even when the note attribute a1 is common to a plurality of unit intervals μ, if the transition type a2 is different in each unit interval μ, a separate probability model Q is generated for each transition type a2.

図5の特徴量分類部423は、確率モデル生成部421が識別情報A毎に生成した特徴量モデルQAを複数(確率モデルQの総数を下回る個数)の集合に分類する。特徴量モデルQAの分類(クラスタリング)には公知の機械学習が任意に採用され得るが、以下に例示する決定木学習が好適である。   The feature quantity classification unit 423 in FIG. 5 classifies the feature quantity model QA generated by the probability model generation unit 421 for each piece of identification information A into a set of a plurality (number less than the total number of probability models Q). Although known machine learning can be arbitrarily employed for classification (clustering) of the feature quantity model QA, decision tree learning exemplified below is preferable.

特徴量分類部423は、識別情報Aに関連する所定の条件の成否を各特徴量モデルQAについて順次に判定することで図7の決定木(以下「特徴量決定木」という)TAを構築する。図7に示すように、特徴量決定木TAは、分類の開始点となる始端節(ルートノード)と、各条件の判定に対応する複数の中間節(中間ノード)と、各特徴量モデルQAが最終的に分類される集合に対応するKA個の終端節(リーフノード)とで構成される分類木である。始端節および各中間節では、例えば対象音符の継続長d2が閾値を上回るか否か、対象音符と直前の音符との音程p1(あるいは直後の音符との音程p3)が閾値を上回るか否か、といった条件の成否が判定される。各特徴量モデルQAの分類を停止する時点(特徴量決定木TAを確定する時点)は、例えば最小記述長(MDL:Minimum Description Length)基準に応じて決定される。   The feature quantity classifying unit 423 constructs the decision tree (hereinafter referred to as “feature quantity decision tree”) TA in FIG. 7 by sequentially determining the success or failure of a predetermined condition related to the identification information A for each feature quantity model QA. . As shown in FIG. 7, the feature quantity decision tree TA includes a start node (root node) serving as a starting point of classification, a plurality of intermediate nodes (intermediate nodes) corresponding to the determination of each condition, and each feature model QA. Is a classification tree composed of KA terminal nodes (leaf nodes) corresponding to the finally classified set. In the start node and each intermediate clause, for example, whether the duration d2 of the target note exceeds a threshold value, or whether the pitch p1 between the target note and the immediately preceding note (or the pitch p3 between the immediately following note) exceeds the threshold value. The success or failure of the condition is determined. The time point when the classification of each feature value model QA is stopped (the time point when the feature value determination tree TA is determined) is determined according to, for example, a minimum description length (MDL) standard.

特徴量分類部423は、特徴量決定木TAのKA個の終端節の各々について、当該終端節に分類された複数の特徴量モデルQAに応じた1個の特徴量モデルMAを生成する。具体的には、特徴量モデルQAの生成(学習)に適用された特徴量(ピッチP)を1個の終端節の複数の特徴量モデルQAについて全体的に使用して、当該終端節に対応する新規な1個の特徴量モデルMAが再推定される。例えば、各終端節に分類された複数の特徴量モデルQAの加重和が特徴量モデルMAとして生成される。図5に示すように、特徴量分類部423は、以上の方法で生成した特徴量決定木TAとKA個の特徴量モデルMAとを含む特徴量情報YAを記憶装置14に格納する。   The feature quantity classification unit 423 generates, for each of the KA terminal clauses of the feature quantity decision tree TA, one feature quantity model MA corresponding to the plurality of feature quantity models QA classified into the terminal clause. Specifically, the feature quantity (pitch P) applied to the generation (learning) of the feature quantity model QA is used as a whole for a plurality of feature quantity models QA of one terminal clause, and corresponds to the terminal clause. One new feature model MA is re-estimated. For example, a weighted sum of a plurality of feature quantity models QA classified in each terminal clause is generated as the feature quantity model MA. As shown in FIG. 5, the feature quantity classification unit 423 stores feature quantity information YA including the feature quantity decision tree TA and KA feature quantity models MA generated by the above method in the storage device 14.

特徴量分類部423と同様に、図5の継続長分類部425は、確率モデル生成部421が識別情報A毎に生成した継続長モデルQBを決定木学習で複数の集合に分類する。すなわち、継続長分類部425は、識別情報Aに関連する所定の条件の成否を各継続長モデルQBについて順次に判定することで図8の決定木(以下「継続長決定木」という)TBを構築する。継続長決定木TBの構築時に判定される条件や継続長決定木TBの構築を停止する基準は、特徴量決定木TAの構築時と同様である。継続長分類部425は、確定済の継続長決定木TBのKB個の終端節の各々について、当該終端節に分類された複数の継続長モデルQBに応じた1個の継続長モデルMB(例えば複数の継続長モデルQBの加重和)を生成する。そして、継続長分類部425は、図5に示すように、継続長決定木TBとKB個の継続長モデルMBとを含む継続長情報YBを記憶装置14に格納する。   Similar to the feature amount classification unit 423, the duration classification unit 425 in FIG. 5 classifies the duration model QB generated by the probability model generation unit 421 for each identification information A into a plurality of sets by decision tree learning. That is, the continuation length classifying unit 425 determines the success or failure of a predetermined condition related to the identification information A for each continuation length model QB, thereby determining the decision tree (hereinafter referred to as “continuation length decision tree”) TB in FIG. To construct. The conditions determined when the duration determination tree TB is constructed and the criteria for stopping the construction of the duration determination tree TB are the same as those when the feature amount decision tree TA is constructed. The continuation length classifying unit 425, for each of the KB terminal clauses of the confirmed continuation length decision tree TB, one continuation length model MB (for example, corresponding to a plurality of continuation length models QB classified into the ending clause) A weighted sum of a plurality of duration models QB) is generated. Then, the continuation length classifying unit 425 stores the continuation length information YB including the continuation length decision tree TB and KB continuation length models MB as shown in FIG.

図5の第2情報生成部44は、遷移配列モデル生成部441と遷移配列分類部443とを含んで構成される。遷移配列モデル生成部441は、識別情報A内の音符属性a1毎(音符属性a1が共通する音符区間σ毎)に遷移配列モデルQCを生成する。各音符属性a1に対応する遷移配列モデルQCは、図9の例示のように、合計7種類の遷移配列(「B-S-E」,「B-S」,「S-E」,「B-E」,「B」,「S」,「E」)の各々が当該音符属性a1の音符区間σ内にて出現する確率(離散確率)を示す確率モデルである。   The second information generation unit 44 in FIG. 5 includes a transition array model generation unit 441 and a transition array classification unit 443. The transition array model generation unit 441 generates a transition array model QC for each note attribute a1 in the identification information A (for each note interval σ in which the note attribute a1 is common). As shown in FIG. 9, the transition array model QC corresponding to each note attribute a1 has a total of seven types of transition arrays (“BSE”, “BS”, “SE”, “B”). -E "," B "," S "," E ") is a probability model indicating the probability (discrete probability) that each of the note attributes a1 appears within the note interval σ.

音符属性a1が共通する各音符区間σにて各遷移配列が出現する頻度に応じて当該音符属性a1の遷移配列モデルQCが生成される。すなわち、各音符属性a1の遷移配列モデルQCのうち当該音符属性a1の各音符区間σについて多く出現した遷移配列の出現確率ほど大きい数値に設定される。例えば、音符属性a1が共通する2個の音符区間σのうち一方の音符区間σの遷移配列が図3の指定音V2のように「B-S-E」であり、他方の音符区間σの遷移配列が図4の指定音V2のように「S-E」である場合、当該音符属性a1の遷移配列モデルQCでは、遷移配列「B-S-E」および遷移配列「S-E」の各々の出現確率が0.5に設定され、他の遷移配列の出現確率が0に設定される。   The transition arrangement model QC of the note attribute a1 is generated according to the frequency of appearance of each transition arrangement in each note interval σ with the common note attribute a1. That is, in the transition array model QC of each note attribute a1, the appearance probability of the transition array that frequently appears for each note section σ of the note attribute a1 is set to a larger numerical value. For example, the transition arrangement of one note interval σ of two note intervals σ having the same note attribute a1 is “BSE” as in the designated note V2 in FIG. When the transition array is “SE” as in the designated sound V2 of FIG. 4, the transition array “BSE” and the transition array “SE” of the transition attribute model QC of the note attribute a1 are shown. The appearance probability of each is set to 0.5, and the appearance probability of other transition arrays is set to 0.

図5の遷移配列分類部443は、遷移配列モデル生成部441が音符属性a1毎(音符区間σ毎)に生成した遷移配列モデルQCを複数(遷移配列モデルQCの総数を下回る個数)の集合に分類する。遷移配列モデルQCの分類には公知の機械学習が任意に採用され得るが、特徴量分類部423や継続長分類部425での分類と同様に、以下に説明する決定木学習が好適である。   The transition array classifying unit 443 in FIG. 5 sets a plurality of transition array models QC generated by the transition array model generating unit 441 for each note attribute a1 (for each note interval σ) (a number less than the total number of transition array models QC). Classify. Although known machine learning can be arbitrarily employed for classification of the transition sequence model QC, decision tree learning described below is suitable as in the classification by the feature amount classification unit 423 and the duration classification unit 425.

遷移配列分類部443は、識別情報Aに関連する所定の条件の成否を各遷移配列モデルQCについて順次に判定することで図10の決定木(以下「遷移配列決定木」という)TCを構築する。前述の特徴量決定木TAや継続長決定木TBと同様に、遷移配列決定木TCは、始端節および複数の中間節と、各遷移配列モデルQCが最終的に分類される集合に対応するKC個の終端節とで構成される分類木である。   The transition sequence classifying unit 443 constructs the decision tree (hereinafter referred to as “transition sequence decision tree”) TC of FIG. 10 by sequentially determining success or failure of a predetermined condition related to the identification information A for each transition sequence model QC. . Similar to the feature value decision tree TA and the continuation length decision tree TB described above, the transition sequence decision tree TC is a KC corresponding to a starting node, a plurality of intermediate clauses, and a set in which each transition sequence model QC is finally classified. It is a classification tree composed of terminal clauses.

音符区間σ内の遷移配列は、対象音符の時間長d2や対象音符と前後の音符との音程(p1,p2)等に影響される。例えば、対象音符の時間長d2が長いほど音符区間σ内の遷移種別の総数が増加するという傾向や、対象音符と前後の音符との音程(音高差)が大きいほど音符区間σ内の遷移種別の総数が増加するという傾向がある。以上の傾向を考慮して、遷移配列分類部443は、特徴量分類部423や継続長分類部425と同様に、例えば対象音符の継続長d2が閾値を上回るか否か、対象音符と直前の音符との音程p1(あるいは直後の音符との音程p3)が閾値を上回るか否か、といった様々な条件の成否を、始端節および各中間節にて判定する。各遷移配列モデルQCの分類を停止する時点(遷移配列決定木TCを確定する時点)の判定には、例えば最小記述長(MDL)基準が好適に適用される。   The transition arrangement within the note interval σ is affected by the time length d2 of the target note, the pitch (p1, p2) between the target note and the preceding and following notes, and the like. For example, the total number of transition types in the note interval σ increases as the time length d2 of the target note increases, or the transition in the note interval σ increases as the pitch (pitch difference) between the target note and the preceding and following notes increases. There is a tendency for the total number of types to increase. In consideration of the above tendency, the transition arrangement classifying unit 443 determines whether the duration d2 of the target note exceeds a threshold value, for example, whether the duration of the target note exceeds the threshold, similarly to the feature amount classifying unit 423 and the duration classification unit 425. The success or failure of various conditions such as whether or not the pitch p1 with the note (or the pitch p3 with the immediately following note) exceeds a threshold value is determined in the starting clause and each intermediate clause. For example, a minimum description length (MDL) criterion is preferably applied to the determination of the time point when the classification of each transition sequence model QC is stopped (the time point when the transition sequence decision tree TC is determined).

遷移配列分類部443は、遷移配列決定木TCのKC個の終端節の各々について、当該終端節に分類された複数の遷移配列モデルQCに応じた1個の遷移配列モデルMCを生成する。例えば、各終端節に分類された複数の遷移配列モデルQCの加重和が遷移配列モデルMCとして生成される。遷移配列分類部443は、図5に示すように、以上の方法で生成した遷移配列決定木TCとKC個の遷移配列モデルMCとを含む遷移配列情報YCを記憶装置14に格納する。以上が第1処理部21の構成および動作である。   The transition array classification unit 443 generates, for each of the KC terminal nodes of the transition array determination tree TC, one transition array model MC corresponding to the plurality of transition array models QC classified into the terminal node. For example, a weighted sum of a plurality of transition array models QC classified into each terminal clause is generated as the transition array model MC. As shown in FIG. 5, the transition array classification unit 443 stores the transition array information YC including the transition array decision tree TC generated by the above method and KC transition array models MC in the storage device 14. The above is the configuration and operation of the first processing unit 21.

(2)第2処理部22
図11は、合成音データVoutを生成する第2処理部22のブロック図である。図11に示すように、第2処理部22は、軌跡生成部52と合成処理部54とを含んで構成される。軌跡生成部52は、楽譜データSCが指定する各指定音のピッチの時系列(合成ピッチ軌跡)Psynを合成用情報Yから生成する。合成処理部54は、軌跡生成部52が生成した合成ピッチ軌跡Psynに沿うようにピッチが時間的に変化する歌唱音の合成音データVoutを生成する。具体的には、合成処理部54は、楽譜データSCが示す各指定音の歌詞に対応する音波形データZAを記憶装置14から取得し、合成ピッチ軌跡Psynに沿ってピッチが経時的に変化するように音波形データZAを加工することで合成音データVoutを生成する。したがって、合成音データVoutの再生音は、参照音を発声した参照歌唱者に特有の歌唱表現(ピッチ軌跡)が付加された歌唱音となる。
(2) Second processing unit 22
FIG. 11 is a block diagram of the second processing unit 22 that generates the synthesized sound data Vout. As shown in FIG. 11, the second processing unit 22 includes a trajectory generation unit 52 and a synthesis processing unit 54. The trajectory generator 52 generates a time series (synthetic pitch trajectory) Psyn of the pitches of the designated sounds designated by the score data SC from the synthesis information Y. The synthesis processing unit 54 generates synthesized sound data Vout of a singing sound whose pitch changes with time so as to follow the synthesized pitch trajectory Psyn generated by the trajectory generation unit 52. Specifically, the synthesis processing unit 54 acquires sound waveform data ZA corresponding to the lyrics of each designated sound indicated by the score data SC from the storage device 14, and the pitch changes with time along the synthesized pitch trajectory Psyn. In this way, the synthesized sound data Vout is generated by processing the sound waveform data ZA. Therefore, the reproduced sound of the synthesized sound data Vout is a singing sound to which a singing expression (pitch trajectory) peculiar to the reference singer who uttered the reference sound is added.

図12は、軌跡生成部52の動作の説明図である。図12の処理は、入力装置16に対する所定の操作(合成音の生成の開始指示)を契機として開始されて楽譜データSCの指定音毎に順次に実行される。   FIG. 12 is an explanatory diagram of the operation of the trajectory generation unit 52. The process of FIG. 12 is started in response to a predetermined operation on the input device 16 (instruction to start generation of synthesized sound), and is sequentially executed for each designated sound of the score data SC.

図12の処理を開始すると、軌跡生成部52は、楽譜データSCを参照することで指定音の音符属性a1(変数p1〜p3,変数d1〜d3)を決定する(S11)。そして、軌跡生成部52は、記憶装置14に記憶された合成用情報Yの遷移配列情報YC内のKC個の遷移配列モデルMCのうち指定音の音符属性a1に相応する1個の遷移配列モデルMCを選択する(S12)。遷移配列モデルMCの選択には遷移配列情報YC内の遷移配列決定木TCが利用される。すなわち、軌跡生成部52は、指定音の音符属性a1を遷移配列決定木TCに適用する(遷移配列決定木TCの始端節および各中間節の条件の成否を指定音の音符属性a1について順次に判定する)ことで指定音が所属すべき終端節(集合)を特定し、当該終端節に対応する遷移配列モデルMCを遷移配列情報YCから選択する。すなわち、参照音のうち指定音の音符属性a1に類似する音符属性a1の音符区間σから生成された遷移配列モデルMCが選択される。   When the processing of FIG. 12 is started, the trajectory generator 52 determines the note attribute a1 (variables p1 to p3, variables d1 to d3) of the designated sound by referring to the score data SC (S11). The trajectory generation unit 52 then selects one transition array model corresponding to the note attribute a1 of the designated note among the KC transition array models MC in the transition array information YC of the synthesis information Y stored in the storage device 14. MC is selected (S12). For selection of the transition array model MC, the transition array decision tree TC in the transition array information YC is used. That is, the trajectory generating unit 52 applies the note attribute a1 of the designated sound to the transition arrangement determination tree TC (the success or failure of the conditions of the start node and each intermediate clause of the transition arrangement decision tree TC is sequentially determined for the note attribute a1 of the designated sound. The terminal section (set) to which the designated sound should belong is specified by determining), and the transition array model MC corresponding to the terminal section is selected from the transition array information YC. That is, the transition arrangement model MC generated from the note interval σ of the note attribute a1 similar to the note attribute a1 of the designated sound among the reference sounds is selected.

軌跡生成部52は、処理S12で選択した遷移配列モデルMCに応じて指定音の遷移配列を決定する(S13)。具体的には、処理S11で設定した音符属性a1が共通する各指定音について処理S13で各遷移配列を決定する確率が、遷移配列モデルMCで遷移配列毎に規定される出現確率に近似するように、軌跡生成部52は指定音の遷移配列を決定する。すなわち、遷移配列モデルMCで規定される出現確率が高い遷移配列ほど、指定音の遷移配列として高確率で選択される。そして、軌跡生成部52は、処理S13で決定した遷移配列を構成する遷移種別毎に指定音の単位区間μを設定する。例えば、処理S13で選定された遷移配列が「B-S-E」である場合には、各遷移種別に対応する3個の単位区間μが1個の指定音について設定される。軌跡生成部52は、指定音の単位区間μ毎に、当該指定音の音符属性a1と当該単位区間μについて決定した遷移種別a2とを含む識別情報Aを設定する。   The trajectory generator 52 determines the transition arrangement of the designated sound according to the transition arrangement model MC selected in the process S12 (S13). Specifically, the probability of determining each transition array in process S13 for each designated sound having the same note attribute a1 set in process S11 is approximated to the appearance probability defined for each transition array in the transition array model MC. In addition, the trajectory generator 52 determines the transition arrangement of the designated sound. That is, a transition sequence having a higher appearance probability defined by the transition sequence model MC is selected with a higher probability as a transition sequence of the designated sound. Then, the trajectory generating unit 52 sets the unit section μ of the designated sound for each transition type constituting the transition array determined in the process S13. For example, when the transition array selected in step S13 is “BSE”, three unit sections μ corresponding to each transition type are set for one designated sound. The trajectory generating unit 52 sets identification information A including the note attribute a1 of the designated sound and the transition type a2 determined for the unit section μ for each unit interval μ of the designated sound.

軌跡生成部52は、特徴量情報YA内のKA個の特徴量モデルMAのうち指定音の識別情報Aに相応する特徴量モデルMAを指定音の単位区間μ毎に選択する(S14)。具体的には、軌跡生成部52は、指定音の識別情報Aを特徴量情報YAの特徴量決定木TAに適用することで指定音の識別情報Aが所属すべき終端節(集合)を特定し、当該終端節に対応する1個の特徴量モデルMAを特徴量情報YAから選択する。同様に、軌跡生成部52は、継続長情報YBの継続長決定木TBに指定音の識別情報Aを適用することで、継続長情報YB内のKB個の継続長モデルMBのうち指定音の識別情報Aに相応する1個の継続長モデルMBを指定音の単位区間μ毎に選択する(S15)。   The trajectory generating unit 52 selects a feature quantity model MA corresponding to the identification information A of the designated sound from the KA feature quantity models MA in the feature quantity information YA for each unit section μ of the designated sound (S14). Specifically, the trajectory generation unit 52 specifies the terminal clause (set) to which the specified sound identification information A should belong by applying the specified sound identification information A to the feature amount decision tree TA of the feature amount information YA. Then, one feature quantity model MA corresponding to the terminal clause is selected from the feature quantity information YA. Similarly, the trajectory generation unit 52 applies the specified sound identification information A to the continuous length decision tree TB of the continuous length information YB, so that the specified sound of the KB continuous length models MB in the continuous length information YB is selected. One duration model MB corresponding to the identification information A is selected for each unit section μ of the designated sound (S15).

そして、軌跡生成部52は、処理S14で選択した特徴量モデルMAと処理S15で選択した継続長モデルMBとを利用して指定音の各単位区間μ内の合成ピッチ軌跡Psynを生成する(S16)。具体的には、単位区間μ内の各状態Stの継続長Dを継続長モデルMBに応じて決定し、特徴量モデルMAで規定されるピッチPの確率分布と時間変化ΔPの確率分布とにおいて同時確率が最大化するように単位区間μ毎の合成ピッチ軌跡Psynが生成される。以上の手順で単位区間μ毎に生成された合成ピッチ軌跡Psynを時間軸上で相互に連結することで指定音の合成ピッチ軌跡Psynが生成される。   Then, the trajectory generating unit 52 generates a composite pitch trajectory Psyn within each unit section μ of the designated sound using the feature amount model MA selected in the process S14 and the duration model MB selected in the process S15 (S16). ). Specifically, the continuation length D of each state St in the unit section μ is determined according to the continuation length model MB, and the probability distribution of the pitch P and the probability distribution of the time change ΔP defined by the feature amount model MA A synthetic pitch trajectory Psyn for each unit section μ is generated so that the joint probability is maximized. The synthesized pitch trajectory Psyn of the designated sound is generated by connecting the synthesized pitch trajectory Psyn generated for each unit section μ with the above procedure on the time axis.

以上の形態では、参照音を遷移種別に応じて区分した単位区間μ内の参照ピッチPrefの時系列を利用して遷移種別毎に特徴量モデルQAおよび継続長モデルQBが生成されるから、参照音の音符(音符属性a1)が共通する場合でも、参照ピッチPrefの変動の相違が特徴量モデルQAや継続長モデルQB(さらには特徴量モデルMAや継続長モデルMB)に忠実に反映される。また、楽譜データSC内の指定音の各単位区間μの遷移種別が遷移配列モデルMCに応じて決定され、当該遷移種別に対応する特徴量モデルQAおよび継続長モデルQBに応じた合成ピッチ軌跡Psynが指定音の単位区間μ毎に生成される。したがって、参照音の参照ピッチPrefの変動の相違が忠実に反映されない確率モデルを利用する場合と比較して、参照歌唱者に特有の表現を忠実に反映した合成音を、聴感的な自然性を維持しながら生成することが可能である。   In the above embodiment, the feature amount model QA and the duration model QB are generated for each transition type using the time series of the reference pitch Pref in the unit interval μ in which the reference sound is divided according to the transition type. Even when the notes of the sound (note attribute a1) are common, the difference in the variation of the reference pitch Pref is faithfully reflected in the feature amount model QA and the duration model QB (and also the feature amount model MA and the duration model MB). . Further, the transition type of each unit section μ of the designated sound in the musical score data SC is determined according to the transition array model MC, and the synthesized pitch trajectory Psyn corresponding to the feature amount model QA and the duration model QB corresponding to the transition type. Is generated for each unit interval μ of the designated sound. Therefore, compared to the case of using a probability model that does not faithfully reflect the difference in the reference pitch Pref of the reference sound, the synthesized sound that faithfully reflects the expression specific to the reference singer is more audible and natural. It is possible to generate while maintaining.

<B:変形例>
以上の実施形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様は適宜に併合され得る。
<B: Modification>
The above embodiment can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

(1)変形例1
参照音を音符区間σや単位区間μに区分する方法は適宜に変更される。例えば、前述の実施形態では楽譜データXBに応じて参照音を各音符区間σに区分したが、利用者からの指示に応じて各音符区間σを設定する構成も採用され得る。例えば、利用者は、表示装置に表示される参照音の波形を視認するとともに放音装置から再生される参照音を聴取することで各音符の境界を推定しながら、入力装置16を適宜に操作して各音符区間σを指定する。第1区間設定部341は、利用者からの指示に応じて各音符区間σを設定する。利用者が各音符区間σを指定する構成では、楽譜データXBは省略され得る。
(1) Modification 1
The method of dividing the reference sound into note intervals σ and unit intervals μ is appropriately changed. For example, in the above-described embodiment, the reference sound is divided into the respective note intervals σ according to the score data XB, but a configuration in which each note interval σ is set according to an instruction from the user may be employed. For example, the user appropriately operates the input device 16 while estimating the boundary of each note by visually recognizing the waveform of the reference sound displayed on the display device and listening to the reference sound reproduced from the sound emitting device. To specify each note interval σ. The first section setting unit 341 sets each note section σ according to an instruction from the user. In a configuration in which the user designates each note interval σ, the score data XB can be omitted.

また、前述の実施形態では利用者からの指示に応じて参照音の各単位区間μを設定したが、第2区間設定部343が参照音データXAに応じて自動的に(すなわち利用者からの指示を必要とせずに)各単位区間μを設定する構成も採用され得る。例えば、第2区間設定部343は、音符区間σの始点の直後で参照ピッチPrefが変動する区間を開始部Bの単位区間μとして設定する。同様に、参照ピッチPrefが略一定に維持される区間が定常部Sの単位区間μに設定され、音符区間σの終点にかけて参照ピッチPrefが変動する区間が終了部Eの単位区間μに設定される。   In the above-described embodiment, each unit interval μ of the reference sound is set according to an instruction from the user. However, the second interval setting unit 343 automatically (that is, from the user) according to the reference sound data XA. A configuration in which each unit interval μ is set (without requiring an instruction) may be employed. For example, the second section setting unit 343 sets a section in which the reference pitch Pref changes immediately after the start point of the note section σ as the unit section μ of the start section B. Similarly, a section in which the reference pitch Pref is maintained substantially constant is set as the unit section μ of the stationary part S, and a section in which the reference pitch Pref varies toward the end point of the note section σ is set as the unit section μ of the end part E. The

(2)変形例2
前述の実施形態では、遷移配列モデルQCの分類の結果に応じてKC個の遷移配列モデルMCを生成したが、未分類の遷移配列モデルQCを遷移配列情報YCとして指定音の合成に適用する構成(以下「構成A」という)も採用され得る。指定音に指示された音符に対応する遷移配列モデルQCを利用して指定音の遷移配列が決定される。構成Aでは遷移配列分類部443(遷移配列モデルMCや遷移配列決定木TC)が省略されるから、第1処理部21の構成が簡素化されるという利点がある。
(2) Modification 2
In the above-described embodiment, KC transition array models MC are generated in accordance with the result of classification of the transition array model QC. However, the unclassified transition array model QC is applied to the synthesis of the designated sound as the transition array information YC. (Hereinafter referred to as “Configuration A”) may also be employed. The transition arrangement of the designated sound is determined using the transition arrangement model QC corresponding to the note designated by the designated sound. In the configuration A, since the transition sequence classification unit 443 (the transition sequence model MC and the transition sequence determination tree TC) is omitted, there is an advantage that the configuration of the first processing unit 21 is simplified.

ただし、構成Aでは、1個の遷移配列モデルQCの生成に利用される参照ピッチPrefの個数が不足するため、遷移配列モデルQCの統計的な妥当性を担保することが困難となる。また、全種類の音符属性a1について遷移配列モデルQCを用意することが現実的には困難である以上、遷移配列モデルQCが用意されていない音符属性a1の指定音を合成できないという問題もある。前述の実施形態では、遷移配列モデルQCの分類の結果に応じてKC個の遷移配列モデルMCが生成される(すなわち遷移配列モデルQCと比較して1個の遷移配列モデルMCに多数の参照ピッチPrefが反映される)から、遷移配列モデルMCの統計的な妥当性を充分に担保することが可能である。また、合成時に指定音を遷移配列決定木TCに適用することで指定音の遷移配列が決定される(S13)から、参照音に存在しない音符の指定音についても、聴感的に自然な合成音の生成を実現し得る適切な遷移配列を選択できるという利点がある。   However, in the configuration A, since the number of reference pitches Pref used for generating one transition array model QC is insufficient, it is difficult to ensure the statistical validity of the transition array model QC. In addition, since it is practically difficult to prepare transition arrangement models QC for all types of note attributes a1, there is also a problem that it is not possible to synthesize designated sounds of note attributes a1 for which transition arrangement models QC are not prepared. In the above-described embodiment, KC transition array models MC are generated according to the classification result of the transition array model QC (that is, a plurality of reference pitches are added to one transition array model MC compared to the transition array model QC). Preref is reflected), and it is possible to sufficiently ensure the statistical validity of the transition sequence model MC. In addition, since the designated sound transition array is determined by applying the designated sound to the transition array decision tree TC at the time of synthesis (S13), even for the designated sound of a note that does not exist in the reference sound, an acoustically natural synthesized sound. There is an advantage that an appropriate transition sequence capable of realizing the generation of can be selected.

前述の構成Aと同様に、特徴量モデルQAや継続長モデルQBを指定音の合成に適用する構成(特徴量分類部423や継続長分類部425を省略した構成)も採用され得るが、確率モデルQの統計的な妥当性を担保して聴感的に自然な合成音を合成するという観点からは、前述の実施形態の例示のように特徴量モデルQAの分類で生成された特徴量モデルMAや継続長モデルQBの分類で生成された継続長モデルMBを指定音の合成に利用する構成が格別に好適である。   Similar to the configuration A described above, a configuration in which the feature amount model QA and the duration model QB are applied to the synthesis of the designated sound (a configuration in which the feature amount classification unit 423 and the duration classification unit 425 are omitted) can be adopted. From the viewpoint of synthesizing an acoustically natural synthesized sound while ensuring the statistical validity of the model Q, the feature amount model MA generated by the classification of the feature amount model QA as illustrated in the above embodiment. A configuration in which the duration model MB generated by the classification of the duration model QB is used for synthesizing the designated sound is particularly suitable.

(3)変形例3
前述の実施形態では、記憶装置14に格納された参照音データXAから特徴量抽出部32が参照ピッチPrefを抽出したが、参照音から事前に抽出された参照ピッチPrefの時系列を記憶装置14に格納した構成(したがって特徴量抽出部32は省略される)も採用され得る。また、参照音を事前に各音符区間σに区分して記憶装置14に格納した構成(したがって第1区間設定部341は省略される)も採用され得る。
(3) Modification 3
In the above-described embodiment, the feature amount extraction unit 32 extracts the reference pitch Pref from the reference sound data XA stored in the storage device 14. However, the time series of the reference pitch Pref extracted in advance from the reference sound is stored in the storage device 14. Also, the configuration stored in (and thus the feature quantity extraction unit 32 is omitted) may be employed. Further, a configuration in which the reference sound is divided into each note interval σ in advance and stored in the storage device 14 (therefore, the first interval setting unit 341 is omitted) may be employed.

(4)変形例4
前述の実施形態では第1処理部21と第2処理部22とを具備する音響合成装置100を例示したが、合成用情報Y(特徴量情報YA,継続長情報YB,遷移配列情報YC)を生成する第1処理部21を具備する音合成用確率モデル生成装置(第2処理部22を省略した装置)や、合成用情報Yを利用して合成音データVoutを生成する第2処理部22を具備する音響合成装置(第1処理部21を省略した装置)としても本発明は実施され得る。また、合成用情報Yを記憶する記憶装置14と第2処理部22の軌跡生成部52とを具備する装置(合成処理部54を省略した構成)は、合成音の特徴量の時系列(例えば合成ピッチ軌跡Psyn)を生成する特徴量軌跡生成装置としても把握され得る。
(4) Modification 4
In the above-described embodiment, the acoustic synthesizer 100 including the first processing unit 21 and the second processing unit 22 is exemplified, but the synthesis information Y (feature information YA, duration information YB, transition sequence information YC) is used. A sound synthesis probability model generation device (a device in which the second processing unit 22 is omitted) including the first processing unit 21 to be generated, or a second processing unit 22 that generates the synthesized sound data Vout using the synthesis information Y. The present invention can also be implemented as a sound synthesizer (equipped with the first processing unit 21 omitted). In addition, a device including the storage device 14 that stores the synthesis information Y and the trajectory generation unit 52 of the second processing unit 22 (a configuration in which the synthesis processing unit 54 is omitted) is a time series (for example, a feature amount of synthesized sound) It can also be grasped as a feature amount trajectory generating device that generates a composite pitch trajectory Psyn).

(5)変形例5
前述の実施形態では参照音の参照ピッチPrefの時系列から合成用情報Yを生成するとともに合成用情報Yから合成ピッチ軌跡Psynを生成したが、合成用情報Yの生成に利用される参照音の特徴量や合成用情報Yから生成される指定音の特徴量はピッチ(基本周波数)に限定されない。例えば、参照音のパワーの時系列から合成用情報Yを生成するとともに合成用情報Yから指定音のパワーの時系列(合成パワー軌跡)を生成する構成も採用され得る。また、指定音のMFCC(Mel-Frequency cepstral coefficient)等の特徴量の生成にも、前述の実施形態と同様に本発明を適用することが可能である。
(5) Modification 5
In the above-described embodiment, the synthesis information Y is generated from the time series of the reference pitch Pref of the reference sound and the synthesis pitch trajectory Psyn is generated from the synthesis information Y. However, the reference sound used to generate the synthesis information Y is generated. The feature amount of the designated sound generated from the feature amount and the synthesis information Y is not limited to the pitch (fundamental frequency). For example, a configuration may be employed in which the synthesis information Y is generated from the reference sound power time series and the specified sound power time series (synthesis power trajectory) is generated from the synthesis information Y. The present invention can also be applied to the generation of feature quantities such as the MFCC (Mel-Frequency cepstral coefficient) of the designated sound as in the above-described embodiment.

なお、特徴量は参照音から直接的に抽出される数値に限定されない。例えば、所定の目標値に対する参照音の特徴量の相対値を利用して合成用情報Yを生成する構成も採用され得る。具体的には、所定の目標値(例えば参照音の音符の音高)に対する参照音の参照ピッチPrefの相対値から合成用情報Yを生成し、合成用情報Yに応じて生成されるピッチの相対値と指定音の音符の音高とから合成ピッチ軌跡Psynを生成する構成が採用される。   The feature amount is not limited to a numerical value extracted directly from the reference sound. For example, a configuration in which the synthesis information Y is generated using a relative value of the feature amount of the reference sound with respect to a predetermined target value may be employed. Specifically, the synthesis information Y is generated from the relative value of the reference pitch reference pitch Pref with respect to a predetermined target value (for example, the pitch of the reference note), and the pitch information generated according to the synthesis information Y is generated. A configuration is employed in which the synthesized pitch trajectory Psyn is generated from the relative value and the pitch of the note of the designated sound.

(6)変形例6
前述の実施形態では歌唱音の合成を例示したが、本発明が適用される範囲は歌唱音の合成に限定されない。例えば、楽器の演奏音(楽音)を合成する場合にも、前述の実施形態と同様に本発明を適用することが可能である。
(6) Modification 6
In the above-described embodiment, the synthesis of the singing sound is exemplified, but the range to which the present invention is applied is not limited to the synthesis of the singing sound. For example, when synthesizing musical instrument performance sounds (musical sounds), the present invention can be applied in the same manner as in the above-described embodiment.

100……音響合成装置、12……演算処理装置、14……記憶装置、16……入力装置、21……第1処理部、22……第2処理部、32……特徴量抽出部、34……区間設定部、341……第1区間設定部、343……第2区間設定部、345……識別情報設定部、36……合成用情報生成部、42……第1情報生成部、421……確率モデル生成部、423……特徴量分類部、425……継続長分類部、44……第2情報生成部、441……遷移配列モデル生成部、443……遷移配列分類部、52……軌跡生成部、54……合成処理部。
DESCRIPTION OF SYMBOLS 100 ... Sound synthesizer, 12 ... Arithmetic processing unit, 14 ... Memory | storage device, 16 ... Input device, 21 ... 1st process part, 22 ... 2nd process part, 32 ... Feature-value extraction part, 34... Section setting section, 341... First section setting section, 343... Second section setting section, 345... Identification information setting section, 36. 421... Probability model generation unit 423... Feature amount classification unit 425... Continuous length classification unit 44... 2nd information generation unit 441 ... Transition sequence model generation unit 443. , 52... Locus generation unit, 54... Synthesis processing unit.

Claims (8)

特徴量の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段と、
複数の状態の各々について特徴量の確率分布を示す遷移種別毎の特徴量モデルを、前記参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段と、
遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する遷移配列モデルを、前記参照音のうち当該音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段と
を具備する音合成用確率モデル生成装置。
Section setting means for dividing the reference sound into unit sections for each transition type according to the trend of fluctuation of the feature amount,
Probability model generation means for generating a feature amount model for each transition type indicating a probability distribution of a feature amount for each of a plurality of states from a time series of feature amounts in a unit section of the transition type in the reference sound;
For each of a plurality of types of transition types, a transition sequence model that specifies a discrete probability that the sequence appears in the note interval of each note, and an array of transition types in the note interval corresponding to the note among the reference sounds A sound synthesis probability model generation device comprising: a transition sequence model generation means for generating a sound sequence.
前記遷移配列モデル生成手段は、前記音符区間の開始部、前記開始部の直後で特徴量が定常的に維持される定常部、および、前記定常部の直後の終了部のうちの複数の遷移種別の各配列と、前記開始部、前記定常部および前記終了部の各々とについて、出現確率を示す前記遷移配列モデルを生成する
請求項1の音合成用確率モデル生成装置。
The transition array model generation means includes a plurality of transition types of a start part of the note interval, a steady part whose feature value is constantly maintained immediately after the start part, and an end part immediately after the steady part. The sound synthesis probability model generation device according to claim 1, wherein the transition array model indicating the appearance probability is generated for each of the sequences and each of the start part, the steady part, and the end part.
特徴量の変動の傾向に応じた遷移種別の複数種類の配列の各々について当該配列が一の音符の音符区間内に出現する離散確率を指定する遷移配列モデルを、音符毎の音符区間内に1種類以上の遷移種別を包含する参照音のうち前記一の音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段
を具備する音合成用確率モデル生成装置。
A transition array model that specifies a discrete probability that each array appears in a note interval of a single note for each of a plurality of types of arrays of transition types according to the tendency of variation in feature amount is 1 in a note interval for each note. A sound synthesis probability model generation device comprising transition array model generation means for generating from an array of transition types in a note interval corresponding to the one note among reference sounds including at least transition types .
前記遷移配列モデル生成手段が生成した複数の遷移配列モデルを複数の集合に分類し、前記分類で構築される遷移配列決定木と前記各集合に分類された遷移配列モデルから集合毎に生成される遷移配列モデルとを含む遷移配列情報を生成する遷移配列分類手段
を具備する請求項1から請求項3の何れかの音合成用確率モデル生成装置。
A plurality of transition array models generated by the transition array model generation means are classified into a plurality of sets, and are generated for each set from the transition array decision tree constructed by the classification and the transition array models classified into the respective sets. The sound synthesis probability model generation device according to any one of claims 1 to 3, further comprising transition sequence classification means for generating transition sequence information including a transition sequence model.
特徴量の変動の傾向に応じた遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが指定する離散確率に応じて指定音の各単位区間の遷移種別を決定し、前記各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列を生成する軌跡生成手段
を具備する特徴量軌跡生成装置。
Corresponding to the note of the specified note among a plurality of transition array models that specify discrete probabilities that the array appears in the note interval of each note for each of the plurality of types of transition types according to the tendency of the variation of the feature amount Determine the transition type of each unit section of the specified sound according to the discrete probability specified by the transition array model, and the time series of the feature quantity so that the feature quantity in each unit section varies with the tendency according to each transition type A feature amount trajectory generating apparatus comprising trajectory generating means for generating.
コンピュータを、
特徴量の変動の傾向に応じた遷移種別毎に参照音を単位区間に区分する区間設定手段、
複数の状態の各々について特徴量の確率分布を示す遷移種別毎の特徴量モデルを、前記参照音のうち当該遷移種別の単位区間における特徴量の時系列から生成する確率モデル生成手段、および、
遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する遷移配列モデルを、前記参照音のうち当該音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段
として機能させるプログラム。
Computer
Section setting means that divides the reference sound into unit sections for each transition type according to the tendency of fluctuation of the feature amount,
Probability model generation means for generating a feature amount model for each transition type indicating a probability distribution of a feature amount for each of a plurality of states from a time series of feature amounts in a unit section of the transition type in the reference sound; and
For each of a plurality of types of transition types, a transition sequence model that specifies a discrete probability that the sequence appears in the note interval of each note, and an array of transition types in the note interval corresponding to the note among the reference sounds A program that functions as a means for generating a transition sequence model generated from an object.
コンピュータを、
特徴量の変動の傾向に応じた遷移種別の複数種類の配列の各々について当該配列が一の音符の音符区間内に出現する離散確率を指定する遷移配列モデルを、音符毎の音符区間内に1種類以上の遷移種別を包含する参照音のうち前記一の音符に対応する音符区間内の遷移種別の配列から生成する遷移配列モデル生成手段
として機能させるプログラム。
Computer
A transition array model that specifies a discrete probability that each array appears in a note interval of a single note for each of a plurality of types of arrays of transition types according to the tendency of variation in feature amount is 1 in a note interval for each note. A program that functions as a transition array model generating unit that generates from an array of transition types in a note interval corresponding to the one note among reference sounds including at least transition types .
コンピュータを、
特徴量の変動の傾向に応じた遷移種別の複数種類の配列の各々について当該配列が各音符の音符区間内に出現する離散確率を指定する複数の遷移配列モデルのうち指定音の音符に対応する遷移配列モデルが指定する離散確率に応じて指定音の各単位区間の遷移種別を決定し、前記各遷移種別に応じた傾向で各単位区間内の特徴量が変動するように特徴量の時系列を生成する軌跡生成手段
として機能させるプログラム。
Computer
Corresponding to the note of the specified note among a plurality of transition array models that specify discrete probabilities that the array appears in the note interval of each note for each of the plurality of types of transition types according to the tendency of the variation of the feature amount Determine the transition type of each unit section of the specified sound according to the discrete probability specified by the transition array model, and the time series of the feature quantity so that the feature quantity in each unit section varies with the tendency according to each transition type A program that functions as a trajectory generating means for generating
JP2010198710A 2010-09-06 2010-09-06 Stochastic model generation device for sound synthesis, feature amount locus generation device, and program Expired - Fee Related JP5699496B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010198710A JP5699496B2 (en) 2010-09-06 2010-09-06 Stochastic model generation device for sound synthesis, feature amount locus generation device, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2010198710A JP5699496B2 (en) 2010-09-06 2010-09-06 Stochastic model generation device for sound synthesis, feature amount locus generation device, and program

Publications (2)

Publication Number Publication Date
JP2012058306A JP2012058306A (en) 2012-03-22
JP5699496B2 true JP5699496B2 (en) 2015-04-08

Family

ID=46055517

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2010198710A Expired - Fee Related JP5699496B2 (en) 2010-09-06 2010-09-06 Stochastic model generation device for sound synthesis, feature amount locus generation device, and program

Country Status (1)

Country Link
JP (1) JP5699496B2 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6497025B2 (en) * 2013-10-17 2019-04-10 ヤマハ株式会社 Audio processing device
JP6390690B2 (en) * 2016-12-05 2018-09-19 ヤマハ株式会社 Speech synthesis method and speech synthesis apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6202049B1 (en) * 1999-03-09 2001-03-13 Matsushita Electric Industrial Co., Ltd. Identification of unit overlap regions for concatenative speech synthesis system
JP3966074B2 (en) * 2002-05-27 2007-08-29 ヤマハ株式会社 Pitch conversion device, pitch conversion method and program
JP2004325831A (en) * 2003-04-25 2004-11-18 Roland Corp Singing data generating program
JP4367437B2 (en) * 2005-05-26 2009-11-18 ヤマハ株式会社 Audio signal processing apparatus, audio signal processing method, and audio signal processing program
JP4839891B2 (en) * 2006-03-04 2011-12-21 ヤマハ株式会社 Singing composition device and singing composition program

Also Published As

Publication number Publication date
JP2012058306A (en) 2012-03-22

Similar Documents

Publication Publication Date Title
US9818396B2 (en) Method and device for editing singing voice synthesis data, and method for analyzing singing
JP5471858B2 (en) Database generating apparatus for singing synthesis and pitch curve generating apparatus
US11468870B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
US8115089B2 (en) Apparatus and method for creating singing synthesizing database, and pitch curve generation apparatus and method
CN101308652B (en) Synthesizing method of personalized singing voice
CN104347080B (en) The medium of speech analysis method and device, phoneme synthesizing method and device and storaged voice analysis program
JP5605066B2 (en) Data generation apparatus and program for sound synthesis
JP2017107228A (en) Singing voice synthesis device and singing voice synthesis method
Umbert et al. Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges
EP2447944B1 (en) Technique for suppressing particular audio component
CN112382257B (en) Audio processing method, device, equipment and medium
EP3759706A1 (en) Method of combining audio signals
Saino et al. A singing style modeling system for singing voice synthesizers.
JP5598516B2 (en) Voice synthesis system for karaoke and parameter extraction device
JP6756151B2 (en) Singing synthesis data editing method and device, and singing analysis method
JP6390690B2 (en) Speech synthesis method and speech synthesis apparatus
JP5699496B2 (en) Stochastic model generation device for sound synthesis, feature amount locus generation device, and program
JP2013164609A (en) Singing synthesizing database generation device, and pitch curve generation device
JP5131904B2 (en) System and method for automatically associating music acoustic signal and lyrics with time
Shih et al. A statistical multidimensional humming transcription using phone level hidden Markov models for query by humming systems
CN112908302B (en) Audio processing method, device, equipment and readable storage medium
JP2022065554A (en) Method for synthesizing voice and program
JP5131130B2 (en) Follow-up evaluation system, karaoke system and program
JP6380305B2 (en) Data generation apparatus, karaoke system, and program
CN112489607A (en) Method and device for recording songs, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20130724

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20131226

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140121

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140324

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20141021

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20141218

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20150120

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20150202

R151 Written notification of patent or utility model registration

Ref document number: 5699496

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R151

LAPS Cancellation because of no payment of annual fees