JP5689782B2

JP5689782B2 - Target speaker learning method, apparatus and program thereof

Info

Publication number: JP5689782B2
Application number: JP2011256042A
Authority: JP
Inventors: 勇祐井島; 光昭磯貝; 水野　秀之; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-24
Filing date: 2011-11-24
Publication date: 2015-03-25
Anticipated expiration: 2031-11-24
Also published as: JP2013109274A

Description

本発明は、所望の話者の音声を合成するための技術に関する。 The present invention relates to a technique for synthesizing a voice of a desired speaker.

これまで、所望の話者の少量の音声データからその話者の音声を合成すること（任意話者音声合成）を目的として、モデル変換に基づく話者適応手法が提案されている（例えば、非特許文献１等参照）。従来の話者適応手法では、音声合成を行おうとする話者（目標話者）の音声を用い、あらかじめ学習された初期モデルを目標話者の適応モデルへ変換する。得られた目標話者の適応モデルを用いて音声合成を行うことで任意話者音声合成が実現される。 Up to now, speaker adaptation methods based on model conversion have been proposed for the purpose of synthesizing a speaker's speech from a small amount of speech data of a desired speaker (arbitrary speaker speech synthesis) (for example, non-speech method). (See Patent Document 1). In the conventional speaker adaptation method, the initial model learned in advance is converted into the target speaker's adaptation model using the speech of the speaker (target speaker) who is to perform speech synthesis. By performing speech synthesis using the obtained target speaker adaptation model, arbitrary speaker speech synthesis is realized.

一方、これまで知覚実験の結果より、音声処理で一般的に使用されている特徴量であるケプストラム以外にも複数の音響特徴量が音声の類似性に寄与することが報告されている（例えば、非特許文献２等参照）。 On the other hand, from the results of perceptual experiments, it has been reported that a plurality of acoustic feature amounts contribute to the similarity of speech in addition to the cepstrum which is a feature amount generally used in speech processing (for example, Non-patent document 2 etc.).

田村他，“HMMに基づく音声合成におけるピッチ・スペクトルの話者適応”，信学論，vol.J85-D-II，no.4，pp.545-553，April 2002.Tamura et al., “Speaker adaptation of pitch spectrum in HMM-based speech synthesis”, IEICE, vol.J85-D-II, no.4, pp.545-553, April 2002. 井島他，“声質類似性知覚と音響特徴量との相関分析”，音講論（秋），3-Q-13，pp.383-384，Sep. 2011.Ijima et al., “Correlation analysis between perception of voice quality similarity and acoustic features”, Sound lecture (Autumn), 3-Q-13, pp.383-384, Sep. 2011.

従来の話者適応手法では、あらかじめ用意した音声データのケプストラム（特徴量）に対応する初期モデルを、目標話者の音声のケプストラムに対応する適応モデルに変換し、この適応モデルを用いて目標話者の音声合成を行っている。しかしながら、初期モデルを目標話者の音声のケプストラムに対応する適応モデルに変換することにより、合成音声の品質（自然性）が劣化することが課題となる。 In the conventional speaker adaptation method, the initial model corresponding to the cepstrum (features) of speech data prepared in advance is converted into an adaptive model corresponding to the cepstrum of the target speaker's speech, and the target speech is converted using this adaptive model. Voice synthesis. However, there is a problem that the quality (naturalness) of synthesized speech is deteriorated by converting the initial model into an adaptive model corresponding to the cepstrum of the target speaker's speech.

本発明では、Ｎ人の話者の音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ、Ｎ≧２）の特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ、Ｋ≧２）を表す特徴量Ｆ（ｋ，ｎ）が特徴Ｆ（ｋ）ごとに独立にクラスタリングされることで、特徴Ｆ（ｋ）ごとにＪ（ｋ）個のクラスタＣＦ（ｋ，ｊ（ｋ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ）＝１，．．．，Ｊ（ｋ）、Ｊ（ｋ）≧２）が設定される。これにより、音声データＤ（ｎ）それぞれのＫ個の特徴量Ｆ（ｋ，ｎ）がいずれかＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，ｎ）＝１，．．．，Ｊ（ｋ）、Ｊ（ｋ，ｎ）≧２）に属する。 In the present invention, feature F (k) (k = 1,..., K, K ≧) of speech data D (n) (n = 1,..., N, N ≧ 2) of N speakers. 2) is clustered independently for each feature F (k), so that J (k) clusters CF (k, j (k) for each feature F (k) are obtained. ) (K = 1,..., K, j (k) = 1,..., J (k), J (k) ≧ 2). As a result, each of the K feature values F (k, n) of the audio data D (n) is any one of the K clusters CF (k, j (k, n)) (k = 1,..., K). , J (k, n) = 1,..., J (k), J (k, n) ≧ 2).

設定されたクラスタＣＦ（ｋ，ｊ（ｋ））から、目標話者の音声データＤ（Ｔ）（Ｔ≠１，．．．，Ｎ）のＫ個の特徴量Ｆ（ｋ，Ｔ）（ｋ＝１，．．．，Ｋ）が属するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，Ｔ）＝１，．．．，Ｊ（ｋ））の組み合わせを選択する。さらにＮ人の話者の音声データＤ（ｎ）から、Ｋ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の組み合わせに対応する音声データＤ（Ｓ）を選択する。 From the set cluster CF (k, j (k)), K feature values F (k, T) (k) of the target speaker's speech data D (T) (T ≠ 1,..., N). = 1,..., K) to which the K clusters CF (k, j (k, T)) (k = 1,..., K, j (k, T) = 1,. J (k)) combination is selected. Further, voice data D (S) corresponding to a combination of K clusters CF (k, j (k, T)) is selected from the voice data D (n) of N speakers.

音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）（ｋ＝１，．．．，Ｋ）が属するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｓ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，Ｓ）＝１，．．．，Ｊ（ｋ））の組み合わせと、Ｋ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の組み合わせとが異なる場合に、変換関数を用い、Ｋ個の特徴量Ｆ（ｋ，Ｓ）の一部の特徴量Ｆ（ｒ，Ｓ）（ｒ∈｛１，．．．，Ｋ｝）を特徴量ＴＦ（ｒ，Ｓ）に変換し、Ｋ個の特徴量Ｆ（ｋ’，Ｓ）（ｋ’∈｛１，．．．，Ｋ｝、ｋ’≠r），ＴＦ（ｒ，Ｓ）を得る。ただし、この変換関数は、特徴量Ｆ（ｒ，Ｓ）が属するクラスタＣＦ（ｒ，ｊ（ｒ，Ｓ））に属する特徴量を、Ｋ個の特徴量Ｆ（ｋ，Ｔ）の一部の特徴量Ｆ（ｒ，Ｔ）が属するクラスタＣＦ（ｒ，ｊ（ｒ，Ｔ））〔ＣＦ（ｒ，ｊ（ｒ，Ｔ））≠ＣＦ（ｒ，ｊ（ｒ，Ｓ））〕に属する特徴量に変換する。 K clusters CF (k, j (k, S)) (k = 1) to which the K feature values F (k, S) (k = 1,..., K) of the audio data D (S) belong. , ..., K, j (k, S) = 1, ..., J (k)) and the combination of K clusters CF (k, j (k, T)) are different. In addition, using a transformation function, a partial feature amount F (r, S) (rε {1,..., K}) of the K feature amounts F (k, S) is converted into a feature amount TF (r, S) to obtain K feature values F (k ′, S) (k′∈ {1,..., K}, k ′ ≠ r), TF (r, S). However, this conversion function converts the feature quantity belonging to the cluster CF (r, j (r, S)) to which the feature quantity F (r, S) belongs to a part of the K feature quantities F (k, T). Features belonging to the cluster CF (r, j (r, T)) [CF (r, j (r, T)) ≠ CF (r, j (r, S))] to which the feature amount F (r, T) belongs Convert to quantity.

本発明では、音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）の一部の特徴量Ｆ（ｒ，Ｓ）のみを特徴量ＴＦ（ｒ，Ｓ）に変換し、目標話者の音声合成のためのＫ個の特徴量Ｆ（ｋ’，Ｓ）（ｋ’∈｛１，．．．，Ｋ｝、ｋ’≠r），ＴＦ（ｒ，Ｓ）を得るため、変換される特徴量の種別に応じ、変換による自然性の劣化の影響を制御できる。 In the present invention, only some of the feature values F (r, S) of the K feature values F (k, S) of the audio data D (S) are converted into the feature values TF (r, S), and the target story is converted. Conversion to obtain K feature quantities F (k ′, S) (k′∈ {1,..., K}, k ′ ≠ r), TF (r, S) for the person's speech synthesis Depending on the type of feature quantity to be performed, the influence of deterioration of naturalness due to conversion can be controlled.

図１は、実施形態の音声合成装置を説明するためのブロック図である。FIG. 1 is a block diagram for explaining a speech synthesis apparatus according to an embodiment. 図２は、実施形態の話者クラスタリング部を説明するためのブロック図である。FIG. 2 is a block diagram for explaining the speaker clustering unit of the embodiment. 図３は、実施形態の目標話者学習部を説明するためのブロック図である。FIG. 3 is a block diagram for explaining the target speaker learning unit of the embodiment. 図４は、実施形態の音声合成部を説明するためのブロック図である。FIG. 4 is a block diagram for explaining the speech synthesis unit of the embodiment. 図５Ａは、実施形態の話者クラスタリング処理を説明するためのフローチャートであり、図５Ｂは、実施形態の目標話者学習処理を説明するためのフローチャートである。FIG. 5A is a flowchart for explaining the speaker clustering process of the embodiment, and FIG. 5B is a flowchart for explaining the target speaker learning process of the embodiment. 図６は、実施形態の変換関数学習処理を説明するためのフローチャートである。FIG. 6 is a flowchart for explaining the conversion function learning process of the embodiment. 図７は、実施形態の話者選択処理を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining speaker selection processing according to the embodiment. 図８Ａは、実施形態のラベルデータを説明するための図であり、図８Ｂは、実施形態の多次元クラスタを例示した図である。FIG. 8A is a diagram for explaining the label data of the embodiment, and FIG. 8B is a diagram illustrating a multidimensional cluster of the embodiment. 図９Ａは、実施形態の変換関数学習処理を説明するための図であり、図９Ｂは、実施形態の多次元クラスタの選択処理を説明するための図であり、図９Ｃは、実施形態の特徴量変換処理を説明するための図である。9A is a diagram for explaining the conversion function learning process of the embodiment, FIG. 9B is a diagram for explaining the multidimensional cluster selection process of the embodiment, and FIG. 9C is a feature of the embodiment. It is a figure for demonstrating quantity conversion processing.

図面を参照して実施形態を説明する。
＜構成＞
図１に例示するように、本形態の音声合成装置１は、話者クラスタリング部１１０、目標話者学習部１２０、音声合成部１３０、及び制御部１４０を有する。話者クラスタリング部１１０、目標話者学習部１２０、及び音声合成部１３０は、制御部１４０の制御のもとで各処理を実行する。音声合成装置１は、例えば、CPU(central processing unit)、RAM(random-access memory)等を含む公知又は専用のコンピュータに特別なプログラムが読み込まれることで構成される特別な装置である。 Embodiments will be described with reference to the drawings.
<Configuration>
As illustrated in FIG. 1, the speech synthesizer 1 of this embodiment includes a speaker clustering unit 110, a target speaker learning unit 120, a speech synthesizer 130, and a control unit 140. The speaker clustering unit 110, the target speaker learning unit 120, and the speech synthesis unit 130 execute each process under the control of the control unit 140. The speech synthesizer 1 is a special device configured by loading a special program into a known or dedicated computer including a central processing unit (CPU), a random-access memory (RAM), and the like.

図２に例示するように、本形態の話者クラスタリング部１１０は、多数話者音声ＤＢ（データベース）記憶部１１１ａ、特徴量音声ＤＢ記憶部１１１ｂ、クラスタ情報ＤＢ記憶部１１１ｃ、変換関数ＤＢ記憶部１１１ｄ、特徴量抽出部１１２ｂ、クラスタリング部１１２ｃ、及び変換関数学習部１１２ｄを有する。 As illustrated in FIG. 2, the speaker clustering unit 110 of the present embodiment includes a multi-speaker speech DB (database) storage unit 111 a, a feature amount speech DB storage unit 111 b, a cluster information DB storage unit 111 c, and a conversion function DB storage unit. 111d, a feature amount extraction unit 112b, a clustering unit 112c, and a conversion function learning unit 112d.

図３に例示するように、本形態の目標話者学習部１２０は、目標話者音声記憶部１２１ａ、特徴量記憶部１２１ｂ、所属クラスタ記憶部１２１ｃ、選択話者記憶部１２１ｄ、目標話者ＤＢ記憶部１２１ｅ、特徴量抽出部１２２ｂ、クラスタ選択部１２２ｃ、話者選択部１２２ｄ、及び特徴量変換部１２２ｅを有する。 As illustrated in FIG. 3, the target speaker learning unit 120 of this embodiment includes a target speaker voice storage unit 121a, a feature amount storage unit 121b, a belonging cluster storage unit 121c, a selected speaker storage unit 121d, and a target speaker DB. A storage unit 121e, a feature amount extraction unit 122b, a cluster selection unit 122c, a speaker selection unit 122d, and a feature amount conversion unit 122e are included.

図４に例示するように、本形態の音声合成部１３０は、テキスト記憶部１３１ａ、コンテキスト記憶部１３１ｂ、韻律モデルＤＢ記憶部１３１ｃ、韻律パラメータ記憶部１３１ｄ、合成音声記憶部１３１ｅ、テキスト解析部１３２ｂ、韻律生成部１３２ｄ、及び素片選択部１３２ｅを有する。 As illustrated in FIG. 4, the speech synthesizer 130 of this embodiment includes a text storage unit 131a, a context storage unit 131b, a prosody model DB storage unit 131c, a prosody parameter storage unit 131d, a synthesized speech storage unit 131e, and a text analysis unit 132b. A prosody generation unit 132d and a segment selection unit 132e.

＜話者クラスタリング処理＞
話者クラスタリング処理では、多数話者の音声データそれぞれの特徴量がクラスタリングされ、クラスタに属する特徴量を他のクラスタの特徴量に変換する変換関数が学習される。以下、図５Ａに従って本形態の話者クラスタリング処理を説明する。
Ｎ人（Ｎ≧２）の話者（多数話者）の音声が事前に収録され、各話者音声を表す音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）が多数話者音声ＤＢ記憶部１１１ａ（図２）に格納される（ステップＳ１１）。本形態では、話者と音声データＤ（ｎ）とが一対一で対応する。目標話者学習部１２０及び音声合成部１３０での処理性能の観点から、音声データＤ（ｎ）は以下の要件を満たすことが望ましい。ただし、これらの条件は本発明を限定しない。
（１）１名あたりの音声データ量（無音区間を除いた音声区間の時間）は、音声合成用のモデルを学習可能な時間以上である。音声合成用のモデルを学習可能な時間は、使用される音声合成方式によって異なる。例えば、素片選択型音声合成方式が用いられる場合、各話者について数時間程度の音声データが必要である。
（２）音声が収録される話者数Ｎは、性別ごとに最低でも数十名以上である。 <Speaker clustering>
In the speaker clustering process, feature amounts of speech data of a large number of speakers are clustered, and a conversion function for converting a feature amount belonging to a cluster into a feature amount of another cluster is learned. Hereinafter, the speaker clustering processing of this embodiment will be described with reference to FIG. 5A.
Voices of N speakers (N ≧ 2) (multiple speakers) are recorded in advance, and voice data D (n) (n = 1,..., N) representing each speaker voice is a large number of speakers. It is stored in the voice DB storage unit 111a (FIG. 2) (step S11). In this embodiment, there is a one-to-one correspondence between the speaker and the voice data D (n). From the viewpoint of processing performance in the target speaker learning unit 120 and the speech synthesis unit 130, it is desirable that the speech data D (n) satisfy the following requirements. However, these conditions do not limit the present invention.
(1) The amount of speech data per person (the duration of the speech segment excluding the silent segment) is equal to or longer than the time during which the speech synthesis model can be learned. The time during which a model for speech synthesis can be learned varies depending on the speech synthesis method used. For example, when a unit selection type speech synthesis method is used, speech data of about several hours is required for each speaker.
(2) The number of speakers N in which speech is recorded is at least several tens of people for each gender.

本形態では、音声データＤ（ｎ）のそれぞれにラベルデータ（音素セグメンテーション情報）が対応付けられ、音声データＤ（ｎ）とラベルデータからなる多数話者の音声ＤＢが多数話者音声ＤＢ記憶部１１１ａに格納される。図８Ａはラベルデータの一例を示している。図８Ａの例では、音声データＤ（ｎ）に含まれる各音素（無音状態を含む）がそれらの開始時間と終了時間との組に対応付けられている。ラベルデータの付与は人手によって行われてもよいし、特開２００４−７７９０１等に開示された方法に従ってコンピュータによって自動的に行われてもよい。 In this embodiment, label data (phoneme segmentation information) is associated with each of the voice data D (n), and a multi-speaker voice DB composed of the voice data D (n) and the label data is a multi-speaker voice DB storage unit. 111a. FIG. 8A shows an example of label data. In the example of FIG. 8A, each phoneme (including a silent state) included in the audio data D (n) is associated with a set of their start time and end time. The label data may be assigned manually or automatically by a computer according to a method disclosed in Japanese Patent Application Laid-Open No. 2004-77901.

特徴量抽出部１１２ｂは、多数話者音声ＤＢ記憶部１１１ａに格納された音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）の特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ、Ｋ≧２）を表す特徴量Ｆ（ｋ，ｎ）を抽出する（ステップＳ１２）。特徴量Ｆ（ｋ，ｎ）は、音声データＤ（ｎ）それぞれの複数の特徴Ｆ（ｋ）について抽出される。説明の便宜上、本形態の特徴量Ｆ（ｋ，ｎ）は、特徴Ｆ（ｋ）と音声データＤ（ｎ）との組と一対一に対応するものとする。音声データＤ（ｎ）の特徴Ｆ（ｋ）についての特徴量が所定の区間（例えばフレームやサブバンド等）ごとに抽出される場合、特徴Ｆ（ｋ）と音声データＤ（ｎ）との組に対応するすべての特徴量の集合を「特徴量Ｆ（ｋ，ｎ）」と表記する。例えば、音声データＤ（１）の特徴Ｆ（１）についての特徴量がフレームごとに抽出される場合、複数のフレームに対して抽出された特徴Ｆ（１）と音声データＤ（１）との組に対応するすべての特徴量の集合を「特徴量Ｆ（１，１）」と表記する。抽出された特徴量Ｆ（ｋ，ｎ）は、対応する特徴Ｆ（ｋ）と音声データＤ（ｎ）との組に対応付けられて、特徴量ＤＢ記憶部１１１ｂに格納される。以下に特徴量の具体例を示す。
（特徴量１）音声データのケプストラム（例えばメルケプストラム）。
（特徴量２）帯域を制限した音声データＤ（ｎ）のスペクトルより得られるケプストラム（例えば帯域を４ｋＨｚに制限する等）。
（特徴量３）音声データの各帯域での周期成分と非周期成分の割合を表す非周期性指標。
（特徴量４）音声データの全帯域のスペクトルのパワーに対する各帯域のスペクトルのパワーの比。全帯域のスペクトルのパワーに対するi番目の帯域のスペクトルのパワーの比BSP_iは、例えば以下の式により求められる。
BSP_i=mean(spec_i)/mean(spec_all)
ここで、BSP_iはi番目の帯域のパワー比であり、spec_allは全帯域のスペクトルのパワー、spec_iはi番目の帯域のスペクトルのパワーである。mean(α)はαの平均値を算出する関数である。帯域の例は、0-1 kHz（i=1）, 1-2 kHz（i=2）, 2-4 kHz（i=3）, 4-6 kHz（i=4）, 6-8 kHz（i=5）である。
（特徴量５）音声データの話者間の声道長正規化（VTLN: Vocal Tract Length Normalization）のためのワーピングパラメータ（例えば、「E. Eide, “A Parametric Approach to Vocal Tract Length Normalization,” In Proceedings of the International Conference on Acoustics,. Speech and Signal Processing, pp. 346-348, 1996.」等参照）。 The feature amount extraction unit 112b includes the feature F (k) (k = 1,..., N) of the speech data D (n) (n = 1,..., N) stored in the multi-speaker speech DB storage unit 111a. , K, K ≧ 2), feature values F (k, n) are extracted (step S12). The feature amount F (k, n) is extracted for a plurality of features F (k) of each of the audio data D (n). For convenience of explanation, it is assumed that the feature amount F (k, n) of the present embodiment has a one-to-one correspondence with the set of the feature F (k) and the audio data D (n). When the feature amount for the feature F (k) of the audio data D (n) is extracted for each predetermined section (for example, a frame or a subband), a set of the feature F (k) and the audio data D (n) A set of all feature quantities corresponding to is denoted as “feature quantity F (k, n)”. For example, when the feature amount for the feature F (1) of the audio data D (1) is extracted for each frame, the feature F (1) extracted for a plurality of frames and the audio data D (1) A set of all feature quantities corresponding to the set is denoted as “feature quantity F (1, 1)”. The extracted feature amount F (k, n) is stored in the feature amount DB storage unit 111b in association with a pair of the corresponding feature F (k) and audio data D (n). Specific examples of feature amounts are shown below.
(Feature 1) A cepstrum of voice data (for example, mel cepstrum).
(Feature 2) A cepstrum obtained from the spectrum of audio data D (n) whose band is limited (for example, the band is limited to 4 kHz).
(Characteristic 3) A non-periodic index indicating the ratio of the periodic component and the non-periodic component in each band of the audio data.
(Feature 4) The ratio of the spectrum power of each band to the spectrum power of the entire band of audio data. The ratio BSP _i of the spectrum power of the i-th band to the spectrum power of the entire band can be obtained by the following equation, for example.
BSP _i = mean (spec _i ) / mean (spec _all )
Here, BSP _i is the power ratio of the i-th band, spec _all is the spectrum power of the entire band, and spec _i is the power of the spectrum of the i-th band. mean (α) is a function for calculating the average value of α. Examples of bands are 0-1 kHz (i = 1), 1-2 kHz (i = 2), 2-4 kHz (i = 3), 4-6 kHz (i = 4), 6-8 kHz ( i = 5).
(Feature 5) Warping parameters for vocal tract length normalization (VTLN: Vocal Tract Length Normalization) between voice data (for example, “E. Eide,“ A Parametric Approach to Vocal Tract Length Normalization, ”In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 346-348, 1996.

これらの特徴量１〜５はすべて音声の類似性に寄与するものである。しかしながら、特徴量１，２は、その特徴量の変換による合成音声の自然性低下への影響が大きい。すなわち、特徴量１，２では、変換前の特徴量から得られる合成音声の自然性に対する、変換後の特徴量から得られる合成音声の自然性の低下度合いが大きい。一方、特徴量３〜５は、その特徴量の変換による合成音声の自然性低下への影響が小さい。すなわち、特徴量３〜５では、変換前の特徴量から得られる合成音声の自然性に対する、変換後の特徴量から得られる合成音声の自然性の低下度合いが小さい。言い換えると、特徴量１，２よりも特徴量３〜５のほうが、特徴量の変換による合成音声の自然性低下への影響が小さい。本形態の特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ、Ｋ≧２）は、特徴量の変換による合成音声の自然性低下への影響が互いに相違する複数の特徴を含む。すなわち、本形態の特徴量Ｆ（ｋ，ｎ）（ｋ＝１，．．．，Ｋ、Ｋ≧２）は、特徴量の変換による合成音声の自然性低下への影響が大きい特徴量（例えば、特徴量１，２）と、特徴量の変換による合成音声の自然性低下への影響が小さい特徴量（例えば、特徴量３〜５）とを含む。
なお自然性低下への影響が小さい特徴とは、以下に示す２つの特徴のうち、いずれかを有する特徴量である。
１．ある話者の音声データの特徴量〔スペクトル（ケプストラム）等〕が、１次オールパス関数、高域強調フィルタ、異なる話者の音声データのスペクトルパワー比を表すフィルタ（例えば後述のFIL_i）等の簡易なフィルタで、異なる話者の音声データの特徴量〔スペクトル（ケプストラム）等〕に変換可能である（例えば、特徴量４，特徴量５）。すなわち、この特徴量Ｆ（ｋ，ｎ）は、１次オールパス関数、高域強調フィルタ、異なる話者の音声データのスペクトルパワー比を表すフィルタ等の簡易なフィルタで、特徴量Ｆ（ｋ，ｎ’）（ｎ’∈｛１，．．．，Ｎ｝、ｎ≠ ｎ’）に変換可能である。
２．音声データ（周波数領域の音声データ）の周波数軸上の全帯域での平均パワーが類似性に影響を与える特徴量（例えば、特徴量３）。すなわち、特徴量間の類似度が当該特徴量のそれぞれに対応する音声データの周波数軸上の全帯域での平均パワーの類似度に対応する。 These feature amounts 1 to 5 all contribute to the similarity of speech. However, the feature quantities 1 and 2 have a great influence on the decrease in the naturalness of the synthesized speech due to the conversion of the feature quantities. That is, in the feature amounts 1 and 2, the degree of decrease in the naturalness of the synthesized speech obtained from the feature amount after conversion is large with respect to the naturalness of the synthesized speech obtained from the feature amount before conversion. On the other hand, the feature amounts 3 to 5 have a small influence on the decrease in the naturalness of the synthesized speech due to the conversion of the feature amounts. That is, in the feature amounts 3 to 5, the degree of decrease in the naturalness of the synthesized speech obtained from the feature amount after conversion is small with respect to the naturalness of the synthesized speech obtained from the feature amount before conversion. In other words, the feature amounts 3 to 5 are less affected by the feature amount conversion on the reduced naturalness of the synthesized speech than the feature amounts 1 and 2. The feature F (k) (k = 1,..., K, K ≧ 2) of the present embodiment includes a plurality of features that are different from each other in the influence on the decrease in the naturalness of the synthesized speech due to the feature amount conversion. That is, the feature amount F (k, n) (k = 1,..., K, K ≧ 2) of the present embodiment is a feature amount (for example, a large influence on the natural speech degradation due to the feature amount conversion) , Feature amounts 1 and 2) and feature amounts (for example, feature amounts 3 to 5) that have a small effect on the naturalness reduction of the synthesized speech due to the conversion of the feature amounts.
The feature having a small influence on the decrease in naturalness is a feature amount having one of the following two features.
1. A feature amount [spectrum (cepstrum), etc.] of a speaker's voice data is a first-order all-pass function, a high-frequency emphasis filter, a filter (for example, FIL _i described later) representing a spectral power ratio of voice data of different speakers, etc. With a simple filter, it can be converted into a feature amount [spectrum (cepstrum) or the like] of voice data of different speakers (for example, feature amount 4 and feature amount 5). That is, the feature value F (k, n) is a simple filter such as a first-order all-pass function, a high-frequency emphasis filter, a filter representing the spectral power ratio of voice data of different speakers, and the feature value F (k, n). ') (N'ε {1, ..., N}, n ≠ n').
2. A feature amount (for example, feature amount 3) in which the average power in the entire band on the frequency axis of the sound data (frequency region sound data) affects the similarity. That is, the similarity between the feature amounts corresponds to the similarity of the average power in the entire band on the frequency axis of the audio data corresponding to each of the feature amounts.

クラスタリング部１１２ｃは、特徴量ＤＢ記憶部１１１ｂに格納された特徴量Ｆ（ｋ，ｎ）を特徴Ｆ（ｋ）ごとに独立にクラスタリングし、特徴Ｆ（ｋ）ごとにＪ（ｋ）個のクラスタＣＦ（ｋ，ｊ（ｋ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ）＝１，．．．，Ｊ（ｋ）、Ｊ（ｋ）≧２）を設定する。言い換えると、クラスタリング部１１２ｃは、特徴量Ｆ（１，ｎ）（ｎ＝１，．．．，Ｎ）をクラスタリングしてＪ（１）個のクラスタＣＦ（１，ｊ（１））（ｊ（１）＝１，．．．，Ｊ（１））を設定し、特徴量Ｆ（２，ｎ）（ｎ＝１，．．．，Ｎ）をクラスタリングしてＪ（２）個のクラスタＣＦ（２，ｊ（２））（ｊ（２）＝１，．．．，Ｊ（２））を設定し、・・・特徴量Ｆ（Ｋ，ｎ）（ｎ＝１，．．．，Ｎ）をクラスタリングしてＪ（Ｋ）個のクラスタＣＦ（Ｋ，ｊ（Ｋ））（ｊ（Ｋ）＝１，．．．，Ｊ（Ｋ））を設定する（ステップＳ１３）。 The clustering unit 112c clusters the feature amount F (k, n) stored in the feature amount DB storage unit 111b independently for each feature F (k), and J (k) clusters for each feature F (k). CF (k, j (k)) (k = 1,..., K, j (k) = 1,..., J (k), J (k) ≧ 2) is set. In other words, the clustering unit 112c clusters the feature values F (1, n) (n = 1,..., N) to J (1) clusters CF (1, j (1)) (j ( 1) = 1,..., J (1)), and feature quantities F (2, n) (n = 1,..., N) are clustered to obtain J (2) clusters CF ( 2, j (2)) (j (2) = 1,..., J (2)),..., Feature amount F (K, n) (n = 1,..., N) To set J (K) clusters CF (K, j (K)) (j (K) = 1,..., J (K)) (step S13).

特徴量抽出部１１２ｂでフレームごとに特徴量が抽出される場合、特徴と音声データとの組に対応する特徴量の集合（フレーム単位で得られた特徴量からなる集合）をそのままサンプルとして用いてクラスタリングを行っても、適切なクラスタが生成されない場合が多い。そのような場合には、例えば、特徴量抽出部１１２ｂで得られた特徴量の集合から各母音に対応する特徴量を抽出し、特徴と音声データとからなる組ごとに当該集合での各母音の特徴量の平均値を求め、各母音に対応する特徴量の平均値を要素とする話者ベクトルをサンプルとしてクラスタリングを行ってもよい。例えば、各フレームで得られた特徴Ｆ（ｋ’）と音声データＤ（ｎ’）との組に対応する特徴量の集合から各母音に対応する特徴量を抽出し、特徴Ｆ（ｋ’）と音声データＤ（ｎ’）との組ごとに当該集合での母音ごとの特徴量の平均値を求め、各母音に対応する特徴量の平均値を要素とする話者ベクトルを、当該特徴Ｆ（ｋ’）と音声データＤ（ｎ’）との組に対応する特徴量のサンプルとしてクラスタリングを行ってもよい。その他、特徴と音声データとの各組に対応する特徴量の集合から得られるGaussian mixture model (GMM)のスーパーベクトル（例えば、「W. M. Campbell, “Support Vector Machines Using GMM Supervectors for Speaker Verification,” IEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 5, pp.308-311, May 2006」等参照）等をサンプルとしてクラスタリングを行ってもよい。クラスタリングアルゴリズムとしては、k-means法やLBG法といった一般的に使用されている手法を使用できる。 When the feature amount is extracted for each frame by the feature amount extraction unit 112b, a set of feature amounts (a set of feature amounts obtained in units of frames) corresponding to a set of features and audio data is directly used as a sample. Even if clustering is performed, an appropriate cluster is often not generated. In such a case, for example, a feature amount corresponding to each vowel is extracted from the set of feature amounts obtained by the feature amount extraction unit 112b, and each vowel in the set is collected for each set of features and speech data. Clustering may be performed using a speaker vector whose element is an average value of feature values corresponding to each vowel as an example. For example, a feature amount corresponding to each vowel is extracted from a set of feature amounts corresponding to a set of the feature F (k ′) and speech data D (n ′) obtained in each frame, and the feature F (k ′) For each set of voice data D (n ′), an average value of feature values for each vowel in the set is obtained, and a speaker vector having the average value of feature values corresponding to each vowel as an element is used as the feature F. Clustering may be performed as a sample of feature amounts corresponding to a set of (k ′) and audio data D (n ′). In addition, a Gaussian mixture model (GMM) supervector (eg, “WM Campbell,“ Support Vector Machines Using GMM Supervectors for Speaker Verification, ”IEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 5, pp.308-311, May 2006, etc.) may be used as a sample for clustering. As the clustering algorithm, commonly used methods such as the k-means method and the LBG method can be used.

クラスタリングの結果、各特徴量Ｆ（ｋ，ｎ）は何れかのクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））に属する。すなわち、音声データＤ（ｎ）それぞれのＫ個の特徴量Ｆ（ｋ，ｎ）は、何れかＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，ｎ）＝１，．．．，Ｊ（ｋ）、Ｊ（ｋ，ｎ）≧２）に属する。言い換えると、音声データＤ（ｎ）それぞれのＫ個の特徴量Ｆ（ｋ，ｎ）の組み合わせは、何れかＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））の組み合わせに対応する。Ｋ個のクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））（ｋ＝１，．．．，Ｋ）の組み合わせを「多次元クラスタ」と呼び、以下のように表記する。
Ｃ（ｊ（１，ｎ），．．．，ｊ（Ｋ，ｎ））
＝（ＣＦ（１，ｊ（１，ｎ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，ｎ））） As a result of clustering, each feature amount F (k, n) belongs to one of the clusters CF (k, j (k, n)). That is, the K feature values F (k, n) of each of the audio data D (n) are set to any K clusters CF (k, j (k, n)) (k = 1,..., K). , J (k, n) = 1,..., J (k), J (k, n) ≧ 2). In other words, the combination of K feature values F (k, n) of each of the audio data D (n) corresponds to a combination of any K clusters CF (k, j (k, n)). A combination of K clusters CF (k, j (k, n)) (k = 1,..., K) is called a “multidimensional cluster” and is expressed as follows.
C (j (1, n), ..., j (K, n))
= (CF (1, j (1, n)), ..., CF (K, j (K, n)))

図８Ｂは、Ｋ＝２，Ｊ（１）＝Ｊ（２）＝５の場合のクラスタリング結果を例示するための図である。図８Ｂの例では、２個のクラスタＣＦ（１，ｊ（１，ｎ）），ＣＦ（２，ｊ（２，ｎ））からなる組、すなわち多次元クラスタＣ（ｊ（１，ｎ），ｊ（２，ｎ））が５×５の表として表記されている。各列（縦）が特徴量Ｆ（１，ｎ）をクラスタリングして得られるクラスタＣＦ（１，ｊ（１，ｎ））を表し、各行（横）が特徴量Ｆ（２，ｎ）をクラスタリングして得られるクラスタＣＦ（２，ｊ（２，ｎ））を表す。図８Ｂの黒点は音声データＤ（ｎ）の２個の特徴量Ｆ（１，ｎ），Ｆ（２，ｎ）の組を表す。黒点の行方向（横方向）の座標が音声データＤ（ｎ）の特徴量Ｆ（１，ｎ）を表し、列方向（縦方向）の座標が音声データＤ（ｎ）の特徴量Ｆ（２，ｎ）を表す。図８Ｂの例では、音声データＤ（ｎ）それぞれの２個の特徴量Ｆ（１，ｎ），Ｆ（２，ｎ）が、何れか２個のクラスタＣＦ（１，ｊ（１，ｎ）），ＣＦ（２，ｊ（２，ｎ））からなる多次元クラスタＣ（ｊ（１，ｎ），ｊ（２，ｎ））にそれぞれ属する。例えば、音声データＤ（α）の特徴量Ｆ（１，α）はクラスタＣＦ（１，５）に属し、特徴量Ｆ（２，α）はクラスタＣＦ（２，１）に属し、音声データＤ（α）の２個の特徴量Ｆ（１，α），Ｆ（２，α）の組み合わせが多次元クラスタＣ（５，１）＝（ＣＦ（１，５），ＣＦ（２，１））に属する。なお、図８Ｂでは２種類の特徴量のそれぞれについてクラスタリングが行われた結果を２次元の表として表記したが、Ｋ種類の特徴量のそれぞれについてクラスタリングが行われた場合にはＪ（１）×・・・×Ｊ（Ｋ）のＫ次元の表で表記できる。 FIG. 8B is a diagram for illustrating the clustering result when K = 2, J (1) = J (2) = 5. In the example of FIG. 8B, a set of two clusters CF (1, j (1, n)) and CF (2, j (2, n)), that is, a multidimensional cluster C (j (1, n), j (2, n)) is represented as a 5 × 5 table. Each column (vertical) represents a cluster CF (1, j (1, n)) obtained by clustering the feature amount F (1, n), and each row (horizontal) clustered the feature amount F (2, n). Represents the cluster CF (2, j (2, n)). The black dots in FIG. 8B represent a set of two feature amounts F (1, n) and F (2, n) of the audio data D (n). The coordinates of the black dots in the row direction (horizontal direction) represent the feature amount F (1, n) of the audio data D (n), and the coordinates in the column direction (vertical direction) represent the feature amount F (2) of the audio data D (n). , N). In the example of FIG. 8B, the two feature amounts F (1, n) and F (2, n) of each of the audio data D (n) are converted into any two clusters CF (1, j (1, n). ), CF (2, j (2, n)) belong to a multidimensional cluster C (j (1, n), j (2, n)). For example, the feature value F (1, α) of the voice data D (α) belongs to the cluster CF (1,5), the feature value F (2, α) belongs to the cluster CF (2,1), and the voice data D The combination of the two feature values F (1, α) and F (2, α) of (α) is a multidimensional cluster C (5,1) = (CF (1,5), CF (2,1)) Belonging to. In FIG. 8B, the result of clustering for each of the two types of feature values is represented as a two-dimensional table. However, when clustering is performed for each of the K types of feature values, J (1) × ... can be expressed as a K-dimensional table of × J (K).

すべてのクラスタＣＦ（ｋ，ｊ（ｋ））を表す情報と、クラスタＣＦ（ｋ，ｊ（ｋ））のそれぞれに属する特徴量Ｆ（ｋ，ｎ）を表す情報とが対応付けられてクラスタ情報ＤＢ記憶部１１１ｃに格納される。これらの情報により、音声データＤ（ｎ）それぞれのＫ個の特徴量Ｆ（ｋ，ｎ）の組み合わせが何れのＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，ｎ））の組み合わせに属するかを特定できる。 Cluster information is obtained by associating information representing all the clusters CF (k, j (k)) with information representing the feature amount F (k, n) belonging to each of the clusters CF (k, j (k)). Stored in the DB storage unit 111c. With these pieces of information, it is determined which combination of K clusters CF (k, j (k, n)) each of the combinations of K feature values F (k, n) of the audio data D (n) belongs to. Can be identified.

変換関数学習部１１２ｄは、クラスタ情報ＤＢ記憶部１１１ｃに格納された情報を用い、特徴Ｆ（ｋ）ごとに独立に、クラスタＣＦ（ｋ，ｊ（ｋ））に属する特徴量を別のクラスタＣＦ（ｋ，ｊ’（ｋ））に属する特徴量に変換する変換関数ｆ_{k,j(k),j’(k)}（ｋ＝１，．．．，Ｋ，ｊ（ｋ）≠ｊ’（ｋ））を学習（生成）する。変換関数ｆ_{k,j(k),j’(k)}は、クラスタＣＦ（ｋ，ｊ（ｋ））に属するすべての特徴量をクラスタＣＦ（ｋ，ｊ’（ｋ））に属する特徴量に変換するものであってもよいし、クラスタＣＦ（ｋ，ｊ（ｋ））に属する少なくとも一部の特徴量をクラスタＣＦ（ｋ，ｊ’（ｋ））に属する特徴量に変換するものであってもよい。図９Ａは、クラスタＣＦ（２，５）に属するすべての特徴量をＣＦ（２，３）に属する特徴量へ変換する変換関数ｆ_2,5,3を例示する。変換関数学習部１１２ｄは、すべての特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ）について変換関数ｆ_{k,j(k),j’(k)}を生成することにしてもよいし、特徴量の変換による合成音声の自然性低下への影響が小さい特徴Ｆ（ｋ）のみについて変換関数ｆ_{k,j(k),j’(k)}を生成することにしてもよい。本形態では、すべての特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ）について変換関数ｆ_{ｋ，ｊ（ｋ），ｊ’（ｋ）}を生成する例を説明する。生成された変換関数ｆ_{k,j(k),j’(k)}は変換関数ＤＢ記憶部１１１ｄに格納される（ステップＳ１４）。 The conversion function learning unit 112d uses the information stored in the cluster information DB storage unit 111c, and independently converts the feature amount belonging to the cluster CF (k, j (k)) to another cluster CF for each feature F (k). Conversion function f _{k, j (k), j ′ (k)} (k = 1,..., K, j (k) ≠ j ′ ( k)) is learned (generated). The transformation function f _{k, j (k), j ′ (k)} converts all feature quantities belonging to the cluster CF (k, j (k)) to feature quantities belonging to the cluster CF (k, j ′ (k)). It is also possible to convert at least a part of feature quantities belonging to the cluster CF (k, j (k)) into feature quantities belonging to the cluster CF (k, j ′ (k)). May be. FIG. 9A illustrates a conversion function f _2,5,3 that converts all feature quantities belonging to the cluster CF (2,5) to feature quantities belonging to the CF (2,3). The conversion function learning unit 112d may generate the conversion functions f _{k, j (k), j ′ (k)} for all the features F (k) (k = 1,..., K). _{Alternatively, the} conversion functions f _{k, j (k), j ′ (k)} may be generated only for the feature F (k) that has a small effect on the naturalness degradation of the synthesized speech due to the feature amount conversion. In this embodiment, an example will be described in which conversion functions f _{k, j (k), j ′ (k)} are generated for all features F (k) (k = 1,..., K). The generated conversion function f _{k, j (k), j ′ (k)} is stored in the conversion function DB storage unit 111d (step S14).

変換関数ｆ_{k,j(k),j’(k)}の学習法の一例として、両クラスタＣＦ（ｋ，ｊ（ｋ）），ＣＦ（ｋ，ｊ’（ｋ））の代表値の差を使用する方法を説明する。この方法の場合、まず変換関数学習部１１２ｄは、クラスタＣＦ（ｋ，ｊ（ｋ）），ＣＦ（ｋ，ｊ’（ｋ））にそれぞれ含まれる全特徴量を用いて、各クラスタＣＦ（ｋ，ｊ（ｋ）），ＣＦ（ｋ，ｊ’（ｋ））の各代表値を求める。クラスタの代表値の例は、そのクラスタに属する全特徴量の平均値や中央値等である。次に変換関数学習部１１２ｄは、各クラスタＣＦ（ｋ，ｊ（ｋ）），ＣＦ（ｋ，ｊ’（ｋ））の各代表値を用い、以下のように変換関数ｆ_{k,j(k),j’(k)}を生成する。
ｆ_{k,j(k),j’(k)}(ν)=ν+(cent(CF(k,j’(k))-cent(CF(k,j(k)))
ここでcent(β)はクラスタβの代表値を求める関数を表し、νはクラスタＣＦ（ｋ，ｊ（ｋ））に属する任意の特徴量（ベクトル等）を表す。 As an example of a learning method of the conversion function f _{k, j (k), j ′ (k)} , a difference between representative values of both clusters CF (k, j (k)) and CF (k, j ′ (k)) is calculated. The method used is described. In the case of this method, first, the conversion function learning unit 112d uses each feature amount included in each of the clusters CF (k, j (k)) and CF (k, j ′ (k)) to each cluster CF (k , J (k)) and CF (k, j ′ (k)). An example of a representative value of a cluster is an average value or median value of all feature quantities belonging to the cluster. Next, the conversion function learning unit 112d uses the representative values of the clusters CF (k, j (k)) and CF (k, j ′ (k)), and uses the conversion function f _{k, j (k ), j ′ (k)} .
f _{k, j (k), j ′ (k)} (ν) = ν + (cent (CF (k, j ′ (k)) − cent (CF (k, j (k)))
Here, cent (β) represents a function for obtaining a representative value of the cluster β, and ν represents an arbitrary feature amount (vector or the like) belonging to the cluster CF (k, j (k)).

その他、クラスタごとに統計モデル（HMM: Hidden Markov Model）を学習して、変換先のクラスタの特徴量を用い、非特許文献１の話者適応手法により、変換関数ｆ_{k,j(k),j’(k)}が学習されてもよい。この手法では、まずクラスタ毎にクラスタ内に存在する話者の特徴量を用いてHMMを学習する。学習した変換元のクラスタのHMMと変換先のクラスタの特徴量とを用いて、変換元のクラスタのHMMを変換先のクラスタへ変換するための回帰行列W（非特許文献１の式(4)）を最尤推定により求める。この回帰行列Wは変換関数ｆ_{k,j(k),j’(k)}に相当する。すべての話者の音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）が同一テキストを発話して得られたものなのであれば、ＧＭＭによる特徴量変換関数を変換関数ｆ_{k,j(k),j’(k)}として学習することも可能である（例えば、参考文献１「A. Kain and M.W. Macon, “Spectral voice conversion for text-to-speech synthesis,” 1998 ICASSP, pp.285-288, 1998.」等参照）。この手法では、まず２名の話者の同一発話の特徴量からGMMを学習する。変換関数ｆ_{k,j(k),j’(k)}は、学習したGMMの平均ベクトル、共分散行列により得られる。一般的に、この手法は２名の話者の音声を変換するための手法であるが、クラスタ内には複数名の話者が存在する場合がある。そのため、GMMの学習データとして、各クラスタに対応する話者の音声データの特徴量の組合せを用いてGMMを学習する。例えば、クラスタＣＦ（ｋ，ｊ（ｋ））に属する特徴量に対応する話者がA,Bの２名であり、クラスタＣＦ（ｋ，ｊ’（ｋ））に属する特徴量に対応する話者がA’,B’の２名であった場合、以下の４通りの特徴量の組み合わせが学習データとされる。
(1)話者Aの音声データの特徴量と話者A’の音声データの特徴量との組み合わせ。
(2)話者Aの音声データの特徴量と話者B’の音声データの特徴量との組み合わせ。
(3)話者Bの音声データの特徴量と話者A’の音声データの特徴量との組み合わせ。
(4)話者Bの音声データの特徴量と話者B’の音声データの特徴量との組み合わせ。
この手法では参考文献１の式(5)が変換関数ｆ_{k,j(k),j’(k)}となる。 In addition, by learning a statistical model (HMM: Hidden Markov Model) for each cluster and using the feature amount of the cluster at the conversion destination, the conversion function f _{k, j (k), j ′ (k)} may be learned. In this method, the HMM is first learned for each cluster using the speaker's feature value existing in the cluster. A regression matrix W for converting the HMM of the conversion source cluster into the conversion destination cluster using the learned HMM of the conversion source cluster and the feature amount of the conversion destination cluster (Equation (4) of Non-Patent Document 1) ) By maximum likelihood estimation. This regression matrix W corresponds to the conversion function f _{k, j (k), j ′ (k)} . If speech data D (n) (n = 1,..., N) of all speakers are obtained by uttering the same text, the feature value conversion function by GMM is converted to the conversion function f _{k, j. (k), j ′ (k)} can also be used for learning (for example, Reference 1 “A. Kain and MW Macon,“ Spectral voice conversion for text-to-speech synthesis, ”1998 ICASSP, pp.285. -288, 1998 "). In this method, first, GMM is learned from the feature amount of the same utterance of two speakers. The conversion function f _{k, j (k), j ′ (k)} is obtained from the learned GMM mean vector and covariance matrix. Generally, this method is a method for converting the voices of two speakers, but there may be a plurality of speakers in the cluster. For this reason, the GMM is learned using a combination of feature amounts of the speech data of speakers corresponding to each cluster as GMM learning data. For example, there are two speakers A and B corresponding to feature quantities belonging to the cluster CF (k, j (k)), and stories corresponding to feature quantities belonging to the cluster CF (k, j ′ (k)). When there are two persons A ′ and B ′, the following four combinations of feature amounts are used as learning data.
(1) A combination of the feature amount of the speech data of the speaker A and the feature amount of the speech data of the speaker A ′.
(2) A combination of the feature amount of the speech data of the speaker A and the feature amount of the speech data of the speaker B ′.
(3) A combination of the feature amount of the speech data of the speaker B and the feature amount of the speech data of the speaker A ′.
(4) A combination of the feature amount of the speech data of the speaker B and the feature amount of the speech data of the speaker B ′.
In this method, Equation (5) in Reference Document 1 becomes the conversion function f _{k, j (k), j ′ (k)} .

図６を用いて、変換関数ｆ_{k,j(k),j’(k)}の生成手順を例示する。図６の例では、ｊ’（ｋ）＝１，．．．，Ｊ（ｋ）について変換関数ｆ_{k,j(k),j’(k)}を学習する処理をｊ（ｋ）＝１，．．．，Ｊ（ｋ）について行うループ処理を、ｋ＝１，．．．，Ｋのループ処理として実行する（ステップＳ１４１〜Ｓ１４７）。この例ではｊ（ｋ）＝ｊ’（ｋ）の変換関数ｆ_{k,j(k),j’(k)}も生成されるが、ｊ（ｋ）＝ｊ’（ｋ）の変換関数ｆ_{k,j(k),j’(k)}は生成されなくてもよい。 A procedure for generating the conversion functions f _{k, j (k), j ′ (k)} will be exemplified with reference to FIG. In the example of FIG. 6, j ′ (k) = 1,. . . , J (k), the process of learning the conversion function f _{k, j (k), j ′ (k)} is j (k) = 1,. . . , J (k), k = 1,. . . , K loop processing (steps S141 to S147). In this example, a conversion function f _{k, j (k), j ′ (k) of} j (k) = j ′ (k) is also generated, but a conversion function f _{k of} j (k) = j ′ (k) is generated. _{, j (k), j ′ (k)} may not be generated.

＜目標話者学習処理＞
目標話者学習処理では、入力された目標話者の音声データからその話者のモデルを学習する。以下、図５Ｂに従って本形態の目標話者学習処理を説明する。
目標話者の音声が収録され、目標話者の音声を表す音声データＤ（Ｔ）（Ｔ≠１，．．．，Ｎ）が目標話者学習部１２０（図３）の目標話者音声記憶部１２１ａに格納される。本形態では、目標話者と音声データＤ（Ｔ）とが一対一で対応する（ステップＳ２１）。 <Target speaker learning process>
In the target speaker learning process, a model of the speaker is learned from the input target speaker's voice data. Hereinafter, the target speaker learning process of this embodiment will be described with reference to FIG. 5B.
Voice of the target speaker is recorded, and voice data D (T) (T ≠ 1,..., N) representing the voice of the target speaker is stored in the target speaker voice storage of the target speaker learning unit 120 (FIG. 3). Stored in the unit 121a. In this embodiment, the target speaker and the voice data D (T) are in one-to-one correspondence (step S21).

特徴量抽出部１２２ｂは、目標話者音声記憶部１２１ａに格納された音声データＤ（Ｔ）から、Ｋ個の特徴Ｆ（ｋ）（ｋ＝１，．．．，Ｋ）を表す特徴量Ｆ（ｋ，Ｔ）（ｋ＝１，．．．，Ｋ）を抽出し、特徴量記憶部１２１ｂに格納する（ステップＳ２２）。 The feature amount extraction unit 122b represents the feature amount F representing K features F (k) (k = 1,..., K) from the speech data D (T) stored in the target speaker speech storage unit 121a. (K, T) (k = 1,..., K) is extracted and stored in the feature amount storage unit 121b (step S22).

クラスタ選択部１２２ｃは、特徴量記憶部１２１ｂに格納された目標話者の音声データＤ（Ｔ）の特徴量Ｆ（ｋ，Ｔ）を用い、ステップＳ１３で設定されたクラスタＣＦ（ｋ，ｊ（ｋ））から、目標話者の音声データＤ（Ｔ）のＫ個の特徴量Ｆ（ｋ，Ｔ）（ｋ＝１，．．．，Ｋ）が属するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，Ｔ）＝１，．．．，Ｊ（ｋ））の組み合わせを選択する。選択されたＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の組み合わせからなる多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））＝（ＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ）））を表す情報は、所属クラスタ記憶部１２１ｃに格納される（ステップＳ２３）。 The cluster selection unit 122c uses the feature amount F (k, T) of the target speaker's speech data D (T) stored in the feature amount storage unit 121b, and uses the cluster CF (k, j ( k)), the K clusters CF (k, j (k, j) to which the K feature quantities F (k, T) (k = 1,..., K) of the target speaker's voice data D (T) belong. k, T)) (k = 1,..., K, j (k, T) = 1,..., J (k)). Multi-dimensional cluster C (j (1, T),..., J (K, T)) = (CF (1) consisting of a combination of the selected K clusters CF (k, j (k, T)) , J (1, T)),..., CF (K, j (K, T))) is stored in the assigned cluster storage unit 121c (step S23).

特徴量Ｆ（ｋ，Ｔ）が属するクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の選択は特徴Ｆ（ｋ）ごとに独立に行われ、最終的にＫ個の特徴量Ｆ（ｋ，Ｔ）の組み合わせが属する多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））＝（ＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ）））が選択される。図９Ｂの例の場合、目標話者の音声データＤ（Ｔ）の２個の特徴量Ｆ（１，Ｔ），Ｆ（２，Ｔ）はそれぞれクラスタＣＦ（１，１），ＣＦ（２，３）に属し、特徴量Ｆ（１，Ｔ），Ｆ（２，Ｔ）の組み合わせが多次元クラスタＣ（１，３）＝（ＣＦ（１，１），ＣＦ（２，３））に属している。 The selection of the cluster CF (k, j (k, T)) to which the feature value F (k, T) belongs is performed independently for each feature F (k), and finally K feature values F (k, T). ) Of the multidimensional cluster C (j (1, T),..., J (K, T)) = (CF (1, j (1, T)),. j (K, T))) is selected. In the case of the example of FIG. 9B, the two feature amounts F (1, T) and F (2, T) of the target speaker's voice data D (T) are represented by clusters CF (1, 1) and CF (2, 3), and the combination of the feature values F (1, T) and F (2, T) belongs to the multidimensional cluster C (1, 3) = (CF (1, 1), CF (2, 3)). ing.

クラスタの選択手法としては、例えば、目標話者の音声データＤ（Ｔ）のＫ個の特徴量Ｆ（ｋ，Ｔ）からステップＳ１３と同様に話者ベクトルを算出し、話者ベクトルとの距離が最も近い代表値を持つクラスタを選択する手法や、入力された特徴量が各クラスタに属する確率を出力するＧＭＭ等の統計モデルをクラスタごとに学習しておき、目標話者の音声データＤ（Ｔ）の各特徴量Ｆ（ｋ，Ｔ）を当該統計モデルに入力して各特徴量Ｆ（ｋ，Ｔ）が属する確率が最も高い（尤度が最も高い）クラスタをＣＦ（ｋ，ｊ（ｋ，Ｔ））として選択する手法等がある。 As a cluster selection method, for example, a speaker vector is calculated from the K feature values F (k, T) of the target speaker's voice data D (T) in the same manner as in step S13, and the distance from the speaker vector is calculated. A method for selecting a cluster having the closest representative value and a statistical model such as GMM that outputs the probability that the input feature value belongs to each cluster are learned for each cluster, and voice data D ( Each feature value F (k, T) of T) is input to the statistical model, and a cluster having the highest probability (highest likelihood) to which each feature value F (k, T) belongs is designated CF (k, j ( k, T)).

話者選択部１２２ｄは、話者クラスタリング部１１０（図３）の多数話者音声ＤＢ記憶部１１１ａに格納されたＮ人の話者の音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）から、所属クラスタ記憶部１２１ｃに格納されたＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の組み合わせに対応する音声データＤ（Ｓ）を選択し、選択した音声データＤ（Ｓ）を表す情報を選択話者記憶部１２１ｄに格納する（ステップＳ２４）。 The speaker selection unit 122d is configured to store N speakers' speech data D (n) (n = 1,...) Stored in the multi-speaker speech DB storage unit 111a of the speaker clustering unit 110 (FIG. 3). N), the audio data D (S) corresponding to the combination of K clusters CF (k, j (k, T)) stored in the cluster storage unit 121c is selected, and the selected audio data D (S ) Is stored in the selected speaker storage unit 121d (step S24).

話者選択部１２２ｄは、例えば、以下のように音声データＤ（Ｓ）を選択する。
（１）Ｎ人の話者の音声データＤ（ｎ）に音声データＤ（Ｓ’）が１個のみ含まれる場合、話者選択部１２２ｄは、当該音声データＤ（Ｓ’）を音声データＤ（Ｓ）とする。ただし、「音声データＤ（Ｓ’）」は、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））＝（ＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ）））を構成するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））に属するＫ個の特徴量Ｆ（ｋ，Ｓ’）（ｋ＝１，．．．，Ｋ）を持つ音声データを表す。
（２）Ｎ人の話者の音声データＤ（ｎ）に上記音声データＤ（Ｓ’）が複数含まれる場合、話者選択部１２２ｄは、これら複数の音声データＤ（Ｓ’）から選択された１個を音声データＤ（Ｓ）とする。
（３）Ｎ人の話者の音声データＤ（ｎ）に上記音声データＤ（Ｓ’）が含まれない場合、話者選択部１２２ｄは、Ｋ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））の組み合わせと異なるＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｓ”））（ｋ＝１，．．．，Ｋ、ｊ（ｋ，Ｓ”）＝１，．．．，Ｊ（ｋ））の組み合わせをなす、Ｋ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｓ”））に属するＫ個の特徴量Ｆ（ｋ，Ｓ”）（ｋ＝１，．．．，Ｋ）を持つ音声データＤ（Ｓ”）を、音声データＤ（Ｓ）として選択する。 For example, the speaker selection unit 122d selects the voice data D (S) as follows.
(1) When only one voice data D (S ′) is included in the voice data D (n) of N speakers, the speaker selection unit 122d uses the voice data D (S ′) as the voice data D. (S). However, the “voice data D (S ′)” is stored in the multidimensional cluster C (j (1, T),..., J (K, T)) = (CF (1, j ( , T)),..., CF (K, j (K, T))), K feature quantities F (k) belonging to K clusters CF (k, j (k, T)). , S ′) (k = 1,..., K).
(2) When the voice data D (n) of N speakers includes a plurality of the voice data D (S ′), the speaker selection unit 122d is selected from the plurality of voice data D (S ′). The other one is audio data D (S).
(3) If the voice data D (S ′) is not included in the voice data D (n) of N speakers, the speaker selection unit 122d determines that the K clusters CF (k, j (k, T) )) And different K clusters CF (k, j (k, S ″)) (k = 1,..., K, j (k, S ″) = 1,. )), And K feature values F (k, S ″) (k = 1,..., K) belonging to K clusters CF (k, j (k, S ″)). The audio data D (S ″) is selected as the audio data D (S).

次に図７を用いて音声データＤ（Ｓ）の選択手法を例示する。
話者選択部１２２ｄは、特徴量ＤＢ記憶部１１１ｂに格納された各音声データＤ（ｎ）の特徴量を参照し、多数話者音声ＤＢ記憶部１１１ａに格納されたＮ人の話者の音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）のうち、所属クラスタ記憶部１２１ｃに格納されたＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））に属するＫ個の特徴量Ｆ（ｋ，Ｓ’）（ｋ＝１，．．．，Ｋ）を持つ音声データＤ（Ｓ’）の個数をカウントする。言い換えると、話者選択部１２２ｄは、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））＝（ＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ）））に属するＫ個の特徴量Ｆ（ｋ，Ｓ’）（ｋ＝１，．．．，Ｋ）を持つ音声データＤ（Ｓ’）の個数をカウントする（ステップＳ２４１）。 Next, a method for selecting the audio data D (S) will be illustrated with reference to FIG.
The speaker selection unit 122d refers to the feature amount of each voice data D (n) stored in the feature amount DB storage unit 111b, and the voices of N speakers stored in the multi-speaker speech DB storage unit 111a. Of the data D (n) (n = 1,..., N), K feature quantities belonging to K clusters CF (k, j (k, T)) stored in the assigned cluster storage unit 121c. The number of audio data D (S ′) having F (k, S ′) (k = 1,..., K) is counted. In other words, the speaker selection unit 122d selects the multidimensional cluster C (j (1, T),..., J (K, T)) = (CF (1, j (1, T) selected in step S23. )),..., CF (K, j (K, T))) and K feature values F (k, S ′) (k = 1,..., K). The number of S ′) is counted (step S241).

上記のＮ人の話者の音声データＤ（ｎ）が上記の音声データＤ（Ｓ’）を１個のみ含む場合、話者選択部１２２ｄは当該１個の音声データＤ（Ｓ’）を音声データＤ（Ｓ）として選択する（ステップＳ２４２）。 When the voice data D (n) of the N speakers includes only one voice data D (S ′), the speaker selection unit 122d uses the voice data D (S ′) as a voice. It selects as data D (S) (step S242).

上記のＮ人の話者の音声データＤ（ｎ）が上記の音声データＤ（Ｓ’）を２個以上含む場合、話者選択部１２２ｄは当該音声データＤ（Ｓ’）の何れかを音声データＤ（Ｓ）として選択する。この例の話者選択部１２２ｄは、各音声データＤ（Ｓ’）の特徴量Ｆ（ｋ，Ｓ’）（ｋ＝１，．．．，Ｋ）と目標話者の音声データＤ（Ｔ）の特徴量Ｆ（ｋ，Ｔ）（ｋ＝１，．．．，Ｋ）との類似度（距離）を算出し（ステップＳ２４３）、類似度が最も高い（最も近い）特徴量Ｆ（ｋ，Ｓ’）を持つ音声データＤ（Ｓ’）を、音声データＤ（Ｓ）として選択する（ステップＳ２４４）。 When the voice data D (n) of the N speakers includes two or more voice data D (S ′), the speaker selection unit 122d uses one of the voice data D (S ′) as voice. Select as data D (S). The speaker selection unit 122d in this example includes the feature amount F (k, S ′) (k = 1,..., K) of each voice data D (S ′) and the voice data D (T) of the target speaker. The similarity (distance) with the feature quantity F (k, T) (k = 1,..., K) is calculated (step S243), and the feature quantity F (k, The audio data D (S ′) having S ′) is selected as the audio data D (S) (step S244).

上記のＮ人の話者の音声データＤ（ｎ）が上記の音声データＤ（Ｓ’）を含まない場合、話者選択部１２２ｄは、以下の条件１，２を満たす、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））＝（ＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ）））に最も近い、１個の多次元クラスタＣ（ｊ（１，Ｓ”），．．．，ｊ（Ｋ，Ｓ”））＝（ＣＦ（１，ｊ（１，Ｓ”）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｓ”）））を選択する。多次元クラスタ間の距離の比較は、例えば、各多次元クラスタを構成するＫ個のクラスタの代表値を要素として並べたベクトル間の距離を多次元クラスタ間の距離として行われる。
［条件１］多次元クラスタＣ（ｊ（１，Ｓ”），．．．，ｊ（Ｋ，Ｓ”））を構成するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｓ”））（ｋ＝１，．．．，Ｋ）に属するＫ個の特徴量Ｆ（ｋ，Ｓ”）（ｋ＝１，．．．，Ｋ）を持つ音声データＤ（Ｓ”）がＮ人の話者の音声データＤ（ｎ）（ｎ＝１，．．．，Ｎ）に含まれる。
［条件２］多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））を構成するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｔ））（ｋ＝１，．．．，Ｋ）が含む一部のクラスタＣＦ（ｗ，ｊ（ｗ，Ｔ））（ｗ∈｛１，．．．，Ｋ｝）と、多次元クラスタＣ（ｊ（１，Ｓ”），．．．，ｊ（Ｋ，Ｓ”））を構成するＫ個のクラスタＣＦ（ｋ，ｊ（ｋ，Ｓ”））（ｋ＝１，．．．，Ｋ）が含む一部のクラスタＣＦ（ｗ，ｊ（ｗ，Ｓ”））とが等しい。ただし、特徴ｆ（ｗ）（ｗ∈｛１，．．．，Ｋ｝）は、特徴量の変換による合成音声の自然性低下への影響が大きい特徴（例えば、前述の特徴量１，２）であり、その他の特徴ｆ（ｒ）（ｒ∈｛１，．．．，Ｋ｝，ｒ≠ｗ）は、特徴量の変換による合成音声の自然性低下への影響が小さい特徴（例えば、前述の特徴量３〜５）である（ステップＳ２４５）。
図９Ｂ及び図９Ｃの例において、特徴Ｆ（１）が特徴量の変換による合成音声の自然性低下への影響が大きい特徴であり、特徴Ｆ（２）が特徴量の変換による合成音声の自然性低下への影響が小さい特徴であるとする。この場合、話者選択部１２２ｄは、条件１，２を満たす多次元クラスタＣ（１，１），Ｃ（１，２），Ｃ（１，５）のうち、多次元クラスタＣ（１，３）に最も近いＣ（１，５）を選択する。 When the voice data D (n) of the N speakers does not include the voice data D (S ′), the speaker selection unit 122d is selected in step S23 that satisfies the following conditions 1 and 2. Multidimensional cluster C (j (1, T), ..., j (K, T)) = (CF (1, j (1, T)), ..., CF (K, j (K, T, One multidimensional cluster C (j (1, S ″),..., J (K, S ″)) = (CF (1, j (1, S ″))) closest to T))) ,..., CF (K, j (K, S ″))). The comparison of the distance between the multidimensional clusters is performed, for example, by using a distance between vectors in which representative values of K clusters constituting each multidimensional cluster are arranged as elements as a distance between the multidimensional clusters.
[Condition 1] K clusters CF (k, j (k, S ″)) (k constituting the multidimensional cluster C (j (1, S ″),..., J (K, S ″))) = 1,..., K) speech data D (S ″) having K feature values F (k, S ″) (k = 1,. It is included in the audio data D (n) (n = 1,..., N).
[Condition 2] K clusters CF (k, j (k, T)) (k = 1, constituting the multidimensional cluster C (j (1, T),..., J (K, T))) .., K) includes some clusters CF (w, j ( w , T)) (wε {1,..., K}) and multidimensional clusters C (j (1, S ″)). ,..., J (K, S ″)) that are included in K clusters CF (k, j (k, S ″)) (k = 1,..., K). (W, j ( w , S ")) is equal. However, the feature f (w) (wε {1,..., K}) is a feature (for example, the above-described feature amounts 1 and 2) that greatly affects the naturalness of synthesized speech due to the feature amount conversion. And the other features f (r) (rε {1,..., K}, r ≠ w) are features that have a small influence on the deterioration of the naturalness of the synthesized speech due to the feature amount conversion (for example, Feature amounts 3 to 5) (step S245).
In the examples of FIGS. 9B and 9C, the feature F (1) is a feature that has a great influence on the natural speech degradation due to the feature amount conversion, and the feature F (2) is the naturalness of the synthesized speech due to the feature amount conversion. It is assumed that this is a feature that has a small effect on the decline in performance. In this case, the speaker selection unit 122d selects the multidimensional cluster C (1, 3) among the multidimensional clusters C (1, 1), C (1, 2), and C (1, 5) that satisfy the conditions 1 and 2. C (1,5) closest to) is selected.

話者選択部１２２ｄは当該音声データＤ（Ｓ”）の何れかを音声データＤ（Ｓ）として選択する。この例の話者選択部１２２ｄは、各音声データＤ（Ｓ”）の特徴量Ｆ（ｋ，Ｓ”）（ｋ＝１，．．．，Ｋ）と目標話者の音声データＤ（Ｔ）の特徴量Ｆ（ｋ，Ｔ）（ｋ＝１，．．．，Ｋ）との類似度（距離）を算出し（ステップＳ２４６）、類似度が最も高い（最も近い）特徴量Ｆ（ｋ，Ｓ”）を持つ音声データＤ（Ｓ”）を、音声データＤ（Ｓ）として選択する（ステップＳ２４７）。 The speaker selection unit 122d selects any one of the voice data D (S ″) as the voice data D (S). The speaker selection unit 122d in this example uses the feature amount F of each voice data D (S ″). (K, S ″) (k = 1,..., K) and the feature amount F (k, T) (k = 1,..., K) of the target speaker's speech data D (T). The similarity (distance) is calculated (step S246), and the audio data D (S ″) having the highest (closest) similarity F (k, S ″) is selected as the audio data D (S). (Step S247).

制御部１４０（図１）は、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））に対応する音声データＤ（Ｓ）が存在しなかったかを判定する。言い換えると、制御部１４０は、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））と、ステップＳ２４で選択された音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）が属する多次元クラスタＣ（ｊ（１，），．．．，ｊ（Ｋ，））とが異なるか（図７の例では、ステップＳ２４５〜Ｓ２４７が実行されたか）を判定する（ステップＳ２５）。 The control unit 140 (FIG. 1) has voice data D (S) corresponding to the multidimensional cluster C (j (1, T),..., J (K, T)) selected in step S23. Determine if there was no. In other words, the control unit 140 includes the multidimensional cluster C (j (1, T),..., J (K, T)) selected in step S23 and the voice data D (S (S) selected in step S24. ) Is different from the multidimensional cluster C (j (1,),..., J (K,)) to which the K feature values F (k, S) belong (in the example of FIG. Whether or not S247 has been executed is determined (step S25).

ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））に対応する音声データＤ（Ｓ）が存在した場合、目標話者学習処理が終了する。この場合、ステップＳ２４で選択された音声データＤ（Ｓ）の特徴量（「目標話者の特徴量」となる）、音声データＤ（Ｓ）及びそのラベルデータ等、又は、目標話者の特徴量に対応するＨＭＭなどの統計モデルが、音声合成部１３０での目標話者の音声合成処理に利用される。 If the speech data D (S) corresponding to the multidimensional cluster C (j (1, T),..., J (K, T)) selected in step S23 exists, the target speaker learning process ends. To do. In this case, the feature amount of the speech data D (S) selected in step S24 (becomes “target speaker feature amount”), speech data D (S) and its label data, or the feature of the target speaker. A statistical model such as an HMM corresponding to the amount is used for speech synthesis processing of the target speaker in the speech synthesizer 130.

ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））に対応する音声データＤ（Ｓ）が存在しなかった場合、特徴量変換部１２２ｅ（図３）が以下の特徴量変換処理を実行する。
特徴量変換部１２２ｅは、変換関数ｆ_{r,j(r,S),j(r,T)}を用い、ステップＳ２４で選択された音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）のうちクラスタＣＦ（ｒ，ｊ（ｒ，Ｓ））に属する一部の特徴量Ｆ（ｒ，Ｓ）（ｒ∈｛１，．．．，Ｋ｝）を、ステップＳ２３で選択された多次元クラスタＣ（ｊ（１，Ｔ），．．．，ｊ（Ｋ，Ｔ））を構成するＫ個のクラスタＣＦ（１，ｊ（１，Ｔ）），．．．，ＣＦ（Ｋ，ｊ（Ｋ，Ｔ））の一部のクラスタＣＦ（ｒ，ｊ（ｒ，Ｔ））〔ＣＦ（ｒ，ｊ（ｒ，Ｔ））≠ＣＦ（ｒ，ｊ（ｒ，Ｓ））〕に属する特徴量ＴＦ（ｒ，Ｓ）に変換する。特徴量Ｆ（ｒ，Ｓ）は、特徴量の変換による合成音声の自然性低下の影響が小さいものである（例えば、前述の特徴量３〜５を表す特徴量）。以上により、目標話者の音声の特徴量Ｆ（ｋ’，Ｓ）（ｋ’＝１，．．．，Ｋ、ｋ’≠ｒ），ＴＦ（ｒ，Ｓ）が得られる。図９Ｃの例の場合、特徴量変換部１２２ｅは、変換関数ｆ_2,5,3を用い、ステップＳ２４で選択された音声データＤ（Ｓ）の２個の特徴量Ｆ（１，Ｓ），Ｆ（２，Ｓ）のうち、クラスタＣＦ（２，５）に属する一部の特徴量Ｆ（２，Ｓ）を、ステップＳ２３で選択された多次元クラスタＣ（１，３）を構成する２個のクラスタＣＦ（１，１），ＣＦ（２，３）の一部のクラスタＣＦ（２，３）に属する特徴量ＴＦ（２，Ｓ）に変換する。これにより、Ｋ個の特徴量Ｆ（ｋ’，Ｓ）（ｋ’∈｛１，．．．，Ｋ｝、ｋ’≠r），ＴＦ（ｒ，Ｓ）が得られる。得られた特徴量Ｆ（ｋ’，Ｓ）（ｋ’＝１，．．．，Ｋ、ｋ’≠ｒ），ＴＦ（ｒ，Ｓ）、音声データＤ（Ｓ）及びそのラベルデータ等、又は、特徴量Ｆ（ｋ’，Ｓ）（ｋ’＝１，．．．，Ｋ、ｋ’≠ｒ），ＴＦ（ｒ，Ｓ）に対応するＨＭＭなどの統計モデルは、音声合成部１３０での目標話者の音声合成処理に利用される。変換された特徴量ＴＦ（ｒ，Ｓ）は、特徴量の変換による合成音声の自然性低下の影響が小さいが、音声の類似性には寄与する。よって、このように音声データＤ（Ｓ）の特徴量の一部を変換したものを目標話者の音声の特徴量とし、それを含む情報を音声合成処理に利用することで、自然性を低下させることなく目標話者の音声を合成できる（ステップＳ２６）。特徴量４，５は、スペクトル（ケプストラム）より得られる特徴量であるため、音声を合成する際には、これらの特徴量を用いて合成音声のスペクトル（ケプストラム）が変換される。スペクトル（ケプストラム）の変換は特徴量によって異なり、声道長正規化のワーピングパラメータが特徴量である場合（特徴量５）、１次オールパス関数を用いて合成音声のケプストラムが変換される。各帯域のスペクトルのパワー比が特徴量である場合（特徴量４）、変換前後のスペクトルパワー比から得られる各帯域のフィルタFIL_iを用いて、合成音声のスペクトルを変換する。
FIL_i=BSP’_i/BSP_i
ただし、BSP_iは変換前のi番目の帯域のパワー比であり、BSP’_iは変換後のi番目の帯域のパワー比である。変換後のi番目の帯域のスペクトルは、変換前のi番目の帯域のスペクトルにFIL_iを乗ずることにより得られる。 If the audio data D (S) corresponding to the multidimensional cluster C (j (1, T),..., J (K, T)) selected in step S23 does not exist, the feature amount conversion unit 122e. (FIG. 3) executes the following feature amount conversion processing.
The feature amount conversion unit 122e uses the conversion function fr _{, j (r, S), j (r, T)} and uses the K feature amounts F (k, K) of the audio data D (S) selected in step S24. S), some feature values F (r, S) (rε {1,..., K}) belonging to the cluster CF (r, j (r, S)) are selected in step S23. K clusters CF (1, j (1, T)),... Constituting the multidimensional cluster C (j (1, T),..., J (K, T)). . . , CF (K, j (K, T)), part of the cluster CF (r, j (r, T)) [CF (r, j (r, T)) ≠ CF (r, j (r, S) ))] To the feature quantity TF (r, S). The feature amount F (r, S) is small in the influence of the reduced naturalness of the synthesized speech due to the feature amount conversion (for example, the feature amount representing the above-described feature amounts 3 to 5). As described above, the feature amounts F (k ′, S) (k ′ = 1,..., K, k ′ ≠ r) and TF (r, S) of the target speaker's voice are obtained. In the case of the example in FIG. 9C, the feature amount conversion unit 122e uses the conversion function f ₂ , 5, _3, and uses the two feature amounts F (1, S), S of the audio data D (S) selected in step S24. Among the F (2, S), a part of the feature values F (2, S) belonging to the cluster CF (2, 5) constitutes the multidimensional cluster C (1, 3) selected in step S23. The number of clusters CF (1,1) and CF (2,3) is converted into a feature quantity TF (2, S) belonging to a part of the clusters CF (2,3). As a result, K feature quantities F (k ′, S) (k′ε {1,..., K}, k ′ ≠ r), TF (r, S) are obtained. The obtained feature amount F (k ′, S) (k ′ = 1,..., K, k ′ ≠ r), TF (r, S), audio data D (S) and its label data, or the like Statistical models such as HMMs corresponding to feature quantities F (k ′, S) (k ′ = 1,..., K, k ′ ≠ r) and TF (r, S) are It is used for speech synthesis processing of the target speaker. The converted feature quantity TF (r, S) is less affected by the reduced naturalness of the synthesized speech due to the feature quantity conversion, but contributes to the similarity of the voice. Therefore, by converting a part of the feature amount of the speech data D (S) in this way as the feature amount of the target speaker's speech, and using the information including the feature amount for speech synthesis processing, naturalness is reduced. The voice of the target speaker can be synthesized without making it (step S26). Since the feature amounts 4 and 5 are feature amounts obtained from the spectrum (cepstrum), when synthesizing speech, the spectrum (cepstrum) of the synthesized speech is converted using these feature amounts. The conversion of the spectrum (cepstrum) differs depending on the feature amount. When the warping parameter for vocal tract length normalization is a feature amount (feature amount 5), the cepstrum of the synthesized speech is converted using a primary allpass function. When the spectrum power ratio of each band is a feature amount (feature amount 4), the spectrum of the synthesized speech is converted using the filter FIL _{i of} each band obtained from the spectrum power ratio before and after conversion.
FIL _i = BSP ' _i / BSP _i
Where BSP _i is the power ratio of the i-th band before conversion, and BSP ′ _i is the power ratio of the i-th band after conversion. The spectrum of the i-th band after conversion is obtained by multiplying the spectrum of the i-th band before conversion by FIL _i .

＜音声合成処理＞
音声合成処理部１３０は、目標話者学習部１２０で得られた目的話者の特徴量、音声データ及びラベルデータ等、又は、当該特徴量から得られるＨＭＭなどの統計モデル等を用い、公知の波形接続型音声合成方式（例えば「特許２７６１５５２」「特開２００９−１２２３８１」等参照）、又は、ＨＭＭ音声合成方式（例えば「益子貴史，徳田恵一，小林隆夫，今井聖，“動的特徴を用いたHMMに基づく音声合成，” 信学論（D-II），vol.J79-D-II, no.12, pp.2184-2190, 1996.」等参照）等に従い、入力されたテキストに対応する目標話者の音声を合成する。 <Speech synthesis processing>
The speech synthesis processing unit 130 uses a target speaker feature amount, speech data, label data, and the like obtained by the target speaker learning unit 120, or a statistical model such as an HMM obtained from the feature amount. Waveform-connected speech synthesis method (for example, refer to “Patent 2761552”, “JP 2009-122381”, etc.) Synthesized speech based on HMM, "Science theory (D-II), vol.J79-D-II, no.12, pp.2184-2190, 1996." etc.) Synthesize the target speaker's voice.

図４を用い、目標話者学習部１２０で得られた特徴量、音声データ及びラベルデータ等を含む目標話者の音声データベースTDBを用い、波形接続型音声合成方式に従って音声合成を行う例を示す。図４の例の場合、入力されたテキスト（Text）がテキスト記憶部１３１ａに格納され、テキスト解析部１３２ｂがテキスト記憶部１３１ａに記憶されたテキストを読み込み、このテキストを形態素解析し、テキストに対応したコンテキスト情報（読み、アクセント等の情報）を生成し、これをコンテキスト記憶部１３１ｂに格納する。 FIG. 4 shows an example in which speech synthesis is performed according to a waveform-connected speech synthesis method using a target speaker speech database TDB including feature amounts, speech data, label data, and the like obtained by the target speaker learning unit 120. . In the case of the example of FIG. 4, the input text (Text) is stored in the text storage unit 131a, the text analysis unit 132b reads the text stored in the text storage unit 131a, parses this text, and corresponds to the text. Context information (information such as reading and accent) is generated and stored in the context storage unit 131b.

韻律生成部１３２ｄは、韻律モデルＤＢ記憶部１３１ｃに格納された韻律モデルを用い、コンテキスト記憶部１３１ｂに格納されたコンテキスト情報に対応する韻律パラメータ（Ｆ０パターン、音素継続時間長、パワー情報等）を生成（推定）し、これを韻律パラメータ記憶部１３１ｄに格納する。 The prosody generation unit 132d uses the prosody model stored in the prosody model DB storage unit 131c, and uses the prosody parameters (F0 pattern, phoneme duration, power information, etc.) corresponding to the context information stored in the context storage unit 131b. It is generated (estimated) and stored in the prosodic parameter storage unit 131d.

素片選択部１３２ｅには、コンテキスト記憶部１３１ｂから読み出したコンテキスト情報、韻律パラメータ記憶部１３１ｄから読み出した韻律パラメータ、目標話者学習部１２０で得られた目標話者の音声データベースTDBが入力される。素片選択部１３２ｅは、例えば、音声データベースTDBの音声データ及びラベルデータから特定される各音声素片を音声素片候補とし、公知の素片選択方式に従って、コンテキスト情報及び韻律パラメータに対する各音声素片候補の評価コストを求め、評価コストが最良となる音声素片候補を音声素片として抽出する。例えば、参考文献２「波形編集型合成方式におけるスペクトル連続性を考慮した波形選択法、日本音響学会講演論文集、2-6-10, pp.239-240, 1990/9」に記載された各サブコスト関数の線形和からなる評価コストが用いられる場合には、評価コストが最小となる音声素片候補が音声素片として選択される。さらに素片選択部１３２ｅは、公知の素片接続方式に従い、韻律パラメータと音声データベースTDBの音声データの特徴量とを用い、抽出した各音声素片に対応する音声データを接続して目標話者の合成音声Voiceを生成する。特徴量が変換されている場合は、抽出した各音声素片に対応する音声データを接続するのではなく、変換された特徴量（スペクトル、非周期性指標等）から得られる音声データを接続し、目標話者の合成音声Voiceを生成する。生成された合成音声Voiceは合成音声記憶部１３１ｅに格納され、必要に応じて読み出されて出力される。 The unit selection unit 132e receives the context information read from the context storage unit 131b, the prosodic parameters read from the prosody parameter storage unit 131d, and the target speaker speech database TDB obtained by the target speaker learning unit 120. . The unit selection unit 132e uses, for example, each speech unit specified from the speech data and label data of the speech database TDB as a speech unit candidate, and uses each speech unit for context information and prosodic parameters according to a known unit selection method. The evaluation cost of the piece candidate is obtained, and the speech unit candidate having the best evaluation cost is extracted as a speech unit. For example, each reference described in Reference 2 “Waveform Selection Method Considering Spectral Continuity in Waveform Editing Type Synthesis Method, Proc. Of the Acoustical Society of Japan, 2-6-10, pp.239-240, 1990/9” When an evaluation cost consisting of a linear sum of sub-cost functions is used, a speech unit candidate that minimizes the evaluation cost is selected as a speech unit. Further, the unit selection unit 132e connects the speech data corresponding to each extracted speech unit using the prosodic parameters and the feature values of the speech data of the speech database TDB according to a known unit connection method, and connects the target speaker. Generates a synthesized voice for. When feature values have been converted, instead of connecting the audio data corresponding to each extracted speech segment, connect the audio data obtained from the converted feature values (spectrum, aperiodicity index, etc.) Generate a synthesized voice for the target speaker. The generated synthesized voice is stored in the synthesized voice storage unit 131e, and is read and output as necessary.

＜変形例等＞
本発明は上述の実施の形態に限定されるものではない。例えば、上記実施形態のステップＳ２６では、音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）のうち、特徴量の変換による合成音声の自然性低下の影響が小さい特徴量Ｆ（ｒ，Ｓ）のみを特徴量ＴＦ（ｒ，Ｓ）に変換することとした。しかしながら、音声データＤ（Ｓ）のＫ個の特徴量Ｆ（ｋ，Ｓ）のうち、特徴量の変換による合成音声の自然性の影響は多少大きいが音声の類似性への寄与度が大きい特徴量のみを変換する等、用途に応じて変換する特徴量が選択されることとしてもよい。また上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 <Modifications>
The present invention is not limited to the above-described embodiment. For example, in step S26 of the above embodiment, among the K feature amounts F (k, S) of the speech data D (S), the feature amount F () that is less affected by the reduced naturalness of the synthesized speech due to the feature amount conversion. Only r, S) is converted to the feature quantity TF (r, S). However, out of the K feature values F (k, S) of the audio data D (S), the natural speech effect due to the feature value conversion is somewhat large, but the contribution to the similarity of the audio is large. A feature quantity to be converted may be selected depending on the application, such as converting only the quantity. The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capacity of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like. This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own recording device and executes the process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

上記の実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In the above embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１音声合成装置
１１０話者クラスタリング部
１２０目標話者学習部
１３０音声合成部 1 Speech Synthesizer 110 Speaker Clustering Unit 120 Target Speaker Learning Unit 130 Speech Synthesizer

Claims

Represents feature F (k) (k = 1,..., K, K ≧ 2) of speech data D (n) (n = 1,..., N, N ≧ 2) of N speakers. The feature quantity F (k, n) is clustered independently for each feature F (k), so that J (k) clusters CF (k, j (k)) ( k = 1,..., K, j (k) = 1,..., J (k), J (k) ≧ 2), and K features of each of the audio data D (n) The quantity F (k, n) is any K clusters CF (k, j (k, n)) (k = 1,..., K, j (k, n) = 1,. (K), J (k, n) ≧ 2)
From the set cluster CF (k, j (k)), K feature values F (k, T) () of the speech data D (T) (T ≠ 1,..., N) of the target speaker. k clusters CF (k, j (k, T)) (k = 1,..., K, j (k, T) = 1,. , J (k)) a cluster selection step for selecting a combination,
A speaker selection step of selecting speech data D (S) corresponding to a combination of the K clusters CF (k, j (k, T)) from the speech data D (n) of the N speakers; ,
K clusters CF (k, j (k, S)) (k =) to which K feature values F (k, S) (k = 1,..., K) of the audio data D (S) belong. 1, ..., K, j (k, S) = 1, ..., J (k)) and the combination of the K clusters CF (k, j (k, T)). If they are different, a conversion function is used to convert a partial feature quantity F (r, S) (rε {1,..., K}) of the K feature quantities F (k, S) to the feature quantity TF. (R, S) to obtain K feature values F (k ′, S) (k′∈ {1,..., K}, k ′ ≠ r), TF (r, S) A quantity conversion step;
When the combination of the K clusters CF (k, j (k, S)) is equal to the combination of the K clusters CF (k, j (k, T)), the feature amount F (k , S) as a feature amount of the target speaker,
The voice data D (S) includes K feature values F (k, S ′) (k = 1,..., K) belonging to the K clusters CF (k, j (k, T)). Or K feature quantities F (k, S ″) (k = 1,..., K) belonging to K clusters CF (k, j (k, S ″)). A part of the clusters CF (w, j (w, T)) (K = 1,..., K) that are voice data and included in the K clusters CF (k, j (k, T)) (k = 1,. w∈ {1,..., K}, w ≠ r) and a part included in the K clusters CF (k, j (k, S ″)) (k = 1,..., K) Cluster CF (w, j (w, S ″)) is equal to
The conversion function converts a feature quantity belonging to the cluster CF (r, j (r, S)) to which the feature quantity F (r, S) belongs to a part of the K feature quantities F (k, T). Belong to the cluster CF (r, j (r, T)) [CF (r, j (r, T)) ≠ CF (r, j (r, S))] A target speaker learning method for converting into feature values.

The target speaker learning method according to claim 1,
The partial feature amount F (r, S) is:
The voice data D (s ′) (s′∈ {1,..., Different speaker's voice data among the first-order all-pass function, high-frequency emphasis filter, and filter representing the spectral power ratio of voice data of different speakers. N}, s ′ ≠ S) that can be converted into a feature quantity F (r, s ′), and the average power in the entire band on the frequency axis of the audio data D (S) affects the similarity. A target speaker learning method including any one of the feature quantities for providing

The target speaker learning method according to claim 1 or 2,
The partial feature amount F (r, S) is an aperiodic index of the audio data D (S), and a ratio of spectrum power in each band to spectrum power in the entire band of the audio data D (S). Or a target speaker learning method representing warping parameters for normalizing the vocal tract length of the speech data D (S).

The target speaker learning method according to any one of claims 1 to 3,
The speaker selection step includes:
The voice data D (n) of the N speakers has K feature values F (k, S ′) (k = 1, belonging to the K clusters CF (k, j (k, T)). K clusters CF (k, j) that are different from the combination of the K clusters CF (k, j (k, T)) when the speech data D (S ′) having. (K, S ″)) (k = 1,..., K, j (k, S ″) = 1,..., J (k)). , J (k, S ″)) and K pieces of feature data F (k, S ″) (k = 1,..., K), speech data D (S ″) (S ″ ∈ {1, .., N}) as the speech data D (S).

The target speaker learning method according to claim 4,
A part of the clusters CF (w, j (k, T)) (wε {1, k) included in the K clusters CF (k, j (k, T)) (k = 1,..., K). , K}, w ≠ r) and some clusters CF (w) included in the K clusters CF (k, j (k, S ″)) (k = 1,..., K). , J (k, S ″)) is equal to the target speaker learning method.

The target speaker learning method according to claim 4, wherein:
The speaker selection step includes:
The voice data D (n) of the N speakers includes only one voice data D (S ′), the voice data D (S ′) is set as the voice data D (S); ,
When the voice data D (n) of the N speakers includes a plurality of the voice data D (S ′), one selected from the voice data D (S ′) is used as the voice data D (S ′). And a step of learning a target speaker.

Represents feature F (k) (k = 1,..., K, K ≧ 2) of speech data D (n) (n = 1,..., N, N ≧ 2) of N speakers. The feature quantity F (k, n) is clustered independently for each feature F (k), so that J (k) clusters CF (k, j (k)) ( k = 1,..., K, j (k) = 1,..., J (k), J (k) ≧ 2), and K features of each of the audio data D (n) The quantity F (k, n) is any K clusters CF (k, j (k, n)) (k = 1,..., K, j (k, n) = 1,. (K), J (k, n) ≧ 2)
From the set cluster CF (k, j (k)), K feature values F (k, T) () of the speech data D (T) (T ≠ 1,..., N) of the target speaker. k clusters CF (k, j (k, T)) (k = 1,..., K, j (k, T) = 1,. , J (k)), a cluster selection unit for selecting a combination,
A speaker selection unit for selecting speech data D (S) corresponding to a combination of the K clusters CF (k, j (k, T)) from the speech data D (n) of the N speakers; ,
K clusters CF (k, j (k, S)) (k =) to which K feature values F (k, S) (k = 1,..., K) of the audio data D (S) belong. 1, ..., K, j (k, S) = 1, ..., J (k)) and the combination of the K clusters CF (k, j (k, T)). If they are different, a conversion function is used to convert a partial feature quantity F (r, S) (rε {1,..., K}) of the K feature quantities F (k, S) to the feature quantity TF. (R, S) to obtain K feature values F (k ′, S) (k′∈ {1,..., K}, k ′ ≠ r), TF (r, S) A quantity converter,
When the combination of the K clusters CF (k, j (k, S)) is equal to the combination of the K clusters CF (k, j (k, T)), the feature amount F (k , S) and a setting unit that sets the target speaker's feature amount as a feature amount,
The voice data D (S) includes K feature values F (k, S ′) (k = 1,..., K) belonging to the K clusters CF (k, j (k, T)). Or K feature quantities F (k, S ″) (k = 1,..., K) belonging to K clusters CF (k, j (k, S ″)). A part of the clusters CF (w, j (w, T)) (K = 1,..., K) that are voice data and included in the K clusters CF (k, j (k, T)) (k = 1,. w∈ {1,..., K}, w ≠ r) and a part included in the K clusters CF (k, j (k, S ″)) (k = 1,..., K) Cluster CF (w, j (w, S ″)) is equal to
The conversion function converts a feature quantity belonging to the cluster CF (r, j (r, S)) to which the feature quantity F (r, S) belongs to a part of the K feature quantities F (k, T). Belong to the cluster CF (r, j (r, T)) [CF (r, j (r, T)) ≠ CF (r, j (r, S))] A target speaker learning device for converting into a feature value.

The program for making a computer perform the process of each step of the target speaker learning method in any one of Claim 1 to 6.