JP4817250B2

JP4817250B2 - Voice quality conversion model generation device and voice quality conversion system

Info

Publication number: JP4817250B2
Application number: JP2006236422A
Authority: JP
Inventors: 智基戸田; 大和大谷; 剛志舛田
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2006-08-31
Filing date: 2006-08-31
Publication date: 2011-11-16
Anticipated expiration: 2026-08-31
Also published as: JP2008058696A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice quality conversion model generation device which suitably generates a voice quality conversion model capable of being adapted to voices with various properties, and a voice quality conversion system which suitably converts a voice of the voice quality of an arbitrary source speaker into a voice of the voice quality of an arbitrary target speaker by using the generated voice quality conversion model and a predetermined adapting method. <P>SOLUTION: The voice quality conversion model generation device 1 learns voice data of at least one of N source speakers and M target speakers as learning data (normal learning or adaptive learning), and generates a voice quality conversion model comprising one or two models common to at least one of the N source speakers and M target speakers. Then the voice quality conversion system adapts the generated voice quality conversion model to a voice of at least one of the arbitrary source speaker and arbitrary target speaker by using the predetermined adapting method to convert the voice of the arbitrary or specified source speaker into the voice of the voice quality of the arbitrary target speaker. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、所定の元話者の声質の音声を所定の目標話者の声質の音声に変換するための声質変換モデルを生成する声質変換モデル生成装置、声質変換モデル生成プログラム及び声質変換モデル生成方法、並びに前記声質変換モデルを用いて任意の元話者の声質の音声を任意の目標話者の声質の音声に変換する声質変換システム、声質変換プログラム及び声質変換方法に関する。 The present invention relates to a voice quality conversion model generation device, a voice quality conversion model generation program, and a voice quality conversion model generation for generating a voice quality conversion model for converting voice of a predetermined original speaker's voice quality into voice of a predetermined target speaker's voice quality The present invention relates to a method, a voice quality conversion system, a voice quality conversion program, and a voice quality conversion method, which convert voice of an arbitrary original speaker into voice of an arbitrary target speaker using the voice quality conversion model.

従来、１人の元話者の声質の音声を１人の目標話者の声質の音声に変換する声質変換モデルを両者の同じ発声内容の音声データを用いて生成し、この声質変換モデルを所定の適応手法を用いて他の元話者及び他の目標話者の声質の音声に適応させ、当該適応後のモデルを用いて、前記他の元話者の任意の音声を他の目標話者の声質の音声に変換する技術がある（例えば、非特許文献１など）。
A．Mouchtaris and J.V．der Spiegel and P．Mucller“Non-parallel training for voice conversion by maximum licklihood constrained adaptation”ICASSP2004，vol．1 pp．1-4 Conventionally, a voice quality conversion model that converts voice quality of one former speaker into voice quality of one target speaker is generated using voice data of the same utterance content, and this voice quality conversion model is determined in advance. The adaptation method is used to adapt to the voice of the voice quality of other original speakers and other target speakers, and using the model after the adaptation, any voice of the other original speakers is adapted to other target speakers. (For example, Non-patent Document 1).
A. Mouchtaris and JV. der Spiegel and P.M. Mucller “Non-parallel training for voice conversion by maximum licklihood constrained adaptation” ICASSP2004, vol. 1 pp. 1-4

しかしながら、上記非特許文献１の従来技術においては、適応対象の声質変換モデルが、１人の元話者の音声データと１人の目標話者の音声データとの計２人の話者の音声データから生成されるため、適応後のモデルによって変換された他者の音声の音質は、前記２人の話者間の変換規則のみに依存することになる。つまり、２人の話者の音声データとかけ離れた性質の他者の音声に対しては、変換後の音質が悪くなる恐れがある。特に、適応に用いる他者の音声データ数が少ない場合は、変換後の音質が著しく悪くなる恐れがある。 However, in the conventional technique of Non-Patent Document 1, the voice quality conversion model to be applied is the voice of two speakers in total, that is, the voice data of one former speaker and the voice data of one target speaker. Since it is generated from the data, the sound quality of the other person's voice converted by the model after adaptation depends only on the conversion rule between the two speakers. In other words, the voice quality after conversion may be deteriorated with respect to the voice of another person who is far from the voice data of the two speakers. In particular, when the number of voice data of others used for adaptation is small, the sound quality after conversion may be remarkably deteriorated.

そこで、本発明は、このような従来の技術の有する未解決の課題に着目してなされたものであって、様々な性質の音声に適応させることができる声質変換モデルを生成するのに好適な声質変換モデル生成装置、声質変換モデル生成プログラム及び声質変換モデル生成方法、並びに前記声質変換モデルと所定の適応手法とを用いて、任意の元話者の声質の音声を任意の目標話者の声質の音声に変換するのに好適な声質変換システム、声質変換プログラム及び声質変換方法を提供することを目的としている。 Therefore, the present invention has been made paying attention to such an unsolved problem of the conventional technology, and is suitable for generating a voice quality conversion model that can be adapted to speech of various properties. A voice quality conversion model generation apparatus, a voice quality conversion model generation program, a voice quality conversion model generation method, and a voice quality of an arbitrary original speaker by using the voice quality conversion model and a predetermined adaptation method. It is an object of the present invention to provide a voice quality conversion system, a voice quality conversion program, and a voice quality conversion method that are suitable for converting to a voice.

上記目的を達成するために、本発明に係る請求項１記載の声質変換モデル生成装置は、
声質変換用の統計モデルを生成する声質変換モデル生成装置であって、
所定のＮ（Ｎは１以上の整数）人の元話者の音声データである第１音声データと、当該第１音声データと同じ発声内容の所定のＭ（Ｍは１以上の整数）人の目標話者の音声データである第２音声データとを用いて前記Ｎ人の元話者及び前記Ｍ人の目標話者に共通の統計モデルである初期声質変換モデルを生成する初期声質変換モデル生成手段と、
前記Ｎ人の元話者の音声データと前記Ｍ人の目標話者の音声データから前記Ｎ人の元話者各々の声質を前記Ｍ人の目標話者各々の声質に変換する、前記元話者の各々と前記目標話者の各々とに１対１に対応する特定話者モデルを作成する特定話者モデル作成手段と、
前記特定話者モデル作成手段が作成した前記特定話者モデルを固有声技術に用いて得られる結果を前記初期声質変換モデル生成手段が生成した前記初期声質変換モデルを構成するパラメータに適用し、前記初期声質変換モデルを構成するパラメータに、前記Ｎ人の元話者のいずれかの音声を前記Ｍ人の目標話者のいずれかの音声に変換する重みベクトルである話者性制御パラメータを追加することで、前記話者性制御パラメータを有する声質変換モデルを生成する声質変換モデル生成手段と、を備え、
前記Ｎ及び前記Ｍのうち、少なくとも一方が２以上の整数であることを特徴とする。 In order to achieve the above object, a voice quality conversion model generation device according to claim 1 according to the present invention comprises:
A voice quality conversion model generation device for generating a statistical model for voice quality conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker and means,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; Specific speaker model creating means for creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created by the specific speaker model creating means to the eigenvoice technique to parameters constituting the initial voice quality conversion model generated by the initial voice quality conversion model generating means, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. A voice quality conversion model generating means for generating a voice quality conversion model having the speaker control parameter,
At least one of N and M is an integer of 2 or more.

このような構成であれば、初期声質変換モデル生成手段は、所定のＮ（Ｎは１以上の整数）人の元話者の音声データである第１音声データと、当該第１音声データと同じ発声内容の所定のＭ（Ｍは１以上の整数）人の目標話者の音声データである第２音声データとを、例えば学習データとして用いて所定の学習アルゴリズムを用いて学習を行うことが可能である。そして、この学習によって、前記Ｎ人の元話者の声質の音声を前記Ｍ人の目標話者（前記Ｎ及び前記Ｍのうち、少なくとも一方は２以上）の声質の音声に変換する、前記Ｎ人の元話者及び前記Ｍ人の目標話者に共通の統計モデルである初期声質変換モデルを生成することが可能である。
更に、特定話者モデル作成手段は、Ｎ人の元話者の音声データとＭ人の目標話者の音声データからＮ人の元話者各々の声質をＭ人の目標話者各々の声質に変換する、元話者の各々と目標話者の各々とに１対１に対応する特定話者モデルを作成することが可能である。
更に、声質変換モデル生成手段は、特定話者モデル作成手段が作成した特定話者モデルを固有声技術に用いて得られる結果を初期声質変換モデル生成手段が生成した初期声質変換モデルを構成するパラメータに適用し、初期声質変換モデルを構成するパラメータに、Ｎ人の元話者のいずれかの音声をＭ人の目標話者のいずれかの音声に変換する重みベクトルである話者性制御パラメータを追加することで、話者性制御パラメータを有する声質変換モデルを生成することが可能である。 In such a configuration, the initial voice quality conversion model generation means is the same as the first voice data, which is the voice data of the predetermined N (N is an integer of 1 or more) former speakers, and the first voice data. It is possible to perform learning using a predetermined learning algorithm using, as learning data , for example, second voice data that is voice data of predetermined M (M is an integer of 1 or more) target speakers of utterance content It is. Then, by this learning, the voice of the voice quality of the N former speakers is converted to voice of the voice quality of the M target speakers (at least one of N and M is 2 or more). It is possible to generate an initial voice quality conversion model which is a statistical model common to the former speaker and the M target speakers.
Further, the specific speaker model creating means converts the voice quality of each of the N original speakers from the voice data of the N original speakers and the voice data of the M target speakers to the voice quality of each of the M target speakers. It is possible to create a specific speaker model that has a one-to-one correspondence with each of the original speaker and each of the target speakers to be converted.
Further, the voice quality conversion model generation means is a parameter constituting the initial voice quality conversion model generated by the initial voice quality conversion model generation means based on the result obtained by using the specific speaker model generated by the specific speaker model generation means for the proper voice technology. And a speaker control parameter that is a weight vector for converting any of the voices of the N original speakers into any of the voices of the M target speakers, as a parameter constituting the initial voice quality conversion model. In addition, it is possible to generate a voice quality conversion model having a speaker control parameter.

従って、例えば、Ｎ人（２人以上）の元話者の音声データと１人の目標話者の音声データとから声質変換モデルを生成した場合は、１つの声質変換モデルで、様々な声質の元話者の音声を前記１人の目標話者の声質の音声に変換できる。
また、例えば、１人の元話者の音声データとＭ人（２人以上）の目標話者の音声データとから声質変換モデルを生成した場合は、１つの声質変換モデルで、１人の元話者の声質の音声を様々な声質の目標話者の声質の音声に変換することができる。 Therefore, for example, when a voice quality conversion model is generated from voice data of N (two or more) former speakers and voice data of one target speaker, one voice quality conversion model can be used for various voice qualities. The voice of the original speaker can be converted into the voice of the voice quality of the one target speaker.
Also, for example, when a voice quality conversion model is generated from the voice data of one former speaker and the voice data of M (two or more) target speakers, one voice quality conversion model is used for one original voice. The voice of the voice quality of the speaker can be converted into the voice of the voice quality of the target speaker having various voice qualities.

また、例えば、Ｎ人（２人以上）の元話者の音声データとＭ人（２人以上）の目標話者の音声データとから、これらの話者に共通の声質変換モデルを生成する場合は、１つあるいは２つの声質変換モデルで、様々な声質の元話者の音声を様々な声質の目標話者の声質の音声に変換できる。ここで、Ｎ人の元話者の声質の音声を前記Ｍ人の目標話者の声質の音声に変換する声質変換モデルを１つの声質変換モデルで構成しても良いし、Ｎ人の元話者の音声を１人の目標話者Ａに変換する第１声質変換モデルと、１人の元話者Ａの声質の音声をＭ人の目標話者の声質の音声に変換する第２声質変換モデルとの２つのモデルで構成しても良い。いずれの構成であっても、Ｎ人の元話者及びＭ人の目標話者の少なくとも一方に対しては１つの声質変換モデルが対応する。 In addition, for example, when a voice quality conversion model common to these speakers is generated from voice data of N (two or more) former speakers and voice data of M (two or more) target speakers Can convert voices of original speakers of various voice qualities into voices of target speakers of various voice qualities, using one or two voice quality conversion models. Here, the voice quality conversion model for converting the voice quality of the voices of the N original speakers into the voice quality of the voices of the M target speakers may be constituted by one voice quality conversion model. First voice quality conversion model for converting a person's voice into one target speaker A, and a second voice quality conversion for converting a voice of one original speaker A into voice of M target speakers You may comprise by two models with a model. In any configuration, one voice quality conversion model corresponds to at least one of N former speakers and M target speakers.

これによって、声質変換モデルに必要なメモリ量を大幅に低減できるので当該声質変換モデルを用いた声質変換システムの構築にかかるコストを低減できるという効果が得られる。また、メモリ資源の豊かではない（制約のある）環境下において、声質変換システムを構築しやすくできるという効果も得られる。例えば、携帯電話等のモバイル機器において、内部メモリの増設を行ったり、外付メモリを接続したりしなくても、元もとのメモリ資源で簡易に声質変換システムを構築することが可能となる。 As a result, the amount of memory required for the voice quality conversion model can be greatly reduced, so that the effect of reducing the cost for constructing a voice quality conversion system using the voice quality conversion model can be obtained. In addition, it is possible to easily construct a voice quality conversion system in an environment where memory resources are not rich (restricted). For example, in a mobile device such as a mobile phone, it is possible to easily construct a voice quality conversion system using the original memory resources without adding an internal memory or connecting an external memory. .

また、前記声質変換モデルを、所定の適応手法で声質変換対象の元話者の声質の音声、及び声質変換目標の目標話者の声質の音声のうち、少なくとも一方に適応させて声質変換に用いる場合に、１人の元話者の音声データと１人の目標話者の音声データとから生成された声質変換モデルを適応させて用いるよりも、様々な声質の元話者の声質の音声、及び様々な声質の目標話者の音声のうち、少なくとも一方に対して声質変換モデルを適切に適応させることができるので、声質変換後の音声の音質を向上することができるという効果が得られる。特に、学習データを提供する元話者あるいは目標話者の人数が多人数であればあるほど様々な特性が反映された声質変換モデルを生成することができるので、多人数にすればするほど声質変換後の音声の音質をより向上することができる。
また、公知の固有声（Eigenvoice）技術を用いた適応学習を行って声質変換モデルを生成することが可能となるので、声質変換モデルの各話者性制御パラメータを変更して声質変換時の目標話者の声質を簡易に制御することが可能である。これによって、話者性制御パラメータを制御することで、声質変換モデルを任意の目標話者の声質の音声に簡易に適応させることができるという効果が得られる。また、固有声技術を用いることで、少ない音声データ（例えば、１発声のデータ〜数発声のデータ）で新規元話者の声質の音声を声質変換モデルに適応させることができる。これによって、適応処理にかかる時間が少なくて済むので、例えば、リアルタイム（オンライン）に適応処理及び声質変換処理を行うことが可能となる。 Further, the voice quality conversion model is used for voice quality conversion by adapting at least one of the voice quality of the voice of the original speaker to be voice quality converted and the voice quality of the target speaker of the voice quality conversion target by a predetermined adaptation method. In some cases, the voice of the voice quality of the original speakers of various voice qualities is used rather than the voice quality conversion model generated from the voice data of one of the original speakers and the voice data of the one target speaker. In addition, since the voice quality conversion model can be appropriately adapted to at least one of the voices of the target speakers having various voice qualities, the sound quality of the voice after the voice quality conversion can be improved. In particular, the higher the number of original speakers or target speakers that provide learning data, the more voice quality conversion models that reflect various characteristics can be generated. The sound quality of the converted voice can be further improved.
In addition, since it is possible to generate a voice quality conversion model by performing adaptive learning using a known Eigenvoice technology, it is possible to change the target parameters at the time of voice quality conversion by changing each speaker control parameter of the voice quality conversion model. It is possible to easily control the voice quality of the speaker. Thereby, by controlling the speaker control parameter, it is possible to easily adapt the voice quality conversion model to the voice of the voice quality of an arbitrary target speaker. Further, by using the unique voice technology, it is possible to adapt the voice of the voice quality of the new original speaker to the voice quality conversion model with a small amount of voice data (for example, data of one utterance to data of several utterances). As a result, the time required for the adaptation process can be reduced, and for example, the adaptation process and the voice quality conversion process can be performed in real time (online).

ここで、上記統計モデルは、ＧＭＭ（Gaussian mixture model）、ニューラルネットワークなど、例えば、Ｎ人の元話者の声質の音声を１人の目標話者の声質の音声に変換したり、１人の元話者の声質の音声をＭ（Ｍは２以上の整数）人の目標話者の声質の音声に変換したり、Ｎ人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換したりする変換規則を１つのモデル（関数など）で表現できるものであればどのようなものでも良い。以下の声質変換モデル生成装置に関する発明、声質変換モデル生成プログラムに関する発明、声質変換モデル生成方法に関する発明において同じである。 Here, the statistical model is a GMM (Gaussian mixture model), a neural network, or the like, for example, converting voices of voice quality of N former speakers to voices of one target speaker, The voice quality of the original speaker is converted to the voice quality of M (M is an integer of 2 or more) target speakers, or the voice quality of the N original speakers is converted to the voice quality of the M target speakers. Any conversion rule can be used as long as it can express a conversion rule to be converted into a single model (function or the like). The same applies to the following invention relating to a voice quality conversion model generation device, an invention relating to a voice quality conversion model generation program, and an invention relating to a voice quality conversion model generation method.

また、上記学習は、例えば、最尤パラメータ推定を行う公知のＥＭアルゴリズムなどを用いた学習が該当する。例えば、混合正規分布を用いた統計モデルであるＧＭＭであれば、混合重みλ、平均μ、分散σ²がモデルのパラメータとなる。このパラメータを学習データを用いて決定する。以下の声質変換モデル生成装置に関する発明、声質変換モデル生成プログラムに関する発明、声質変換モデル生成方法に関する発明において同じである。 The learning corresponds to, for example, learning using a known EM algorithm that performs maximum likelihood parameter estimation. For example, in the case of GMM, which is a statistical model using a mixed normal distribution, the mixing weight λ, the average μ, and the variance σ ² are the model parameters. This parameter is determined using learning data. The same applies to the following invention relating to a voice quality conversion model generation device, an invention relating to a voice quality conversion model generation program, and an invention relating to a voice quality conversion model generation method.

また、上記第１音声データと同じ発声内容の所定のＭ人の第２音声データとは、それぞれ異なる文章で構成される発話文セットａ，ｂ，ｃ（例えば、セット毎に５０文ずつ）の音声データが第１音声データに含まれるとき、例えば、所定のＭ人の目標話者がＡ、Ｂ、Ｃの３人（Ｍ＝３）であるとすると、第２音声データには、目標話者Ａ、Ｂ、Ｃの発声した発話文セットａ，ｂ，ｃの音声データが含まれている必要がある。例えば、目標話者Ａ、Ｂ、Ｃの各話者毎の発話文セットａ，ｂ，ｃが含まれていても良いし、目標話者Ａの発話文セットａ、目標話者Ｂの発話文セットｂ及び目標話者Ｃの発話文セットｃといったように、元話者の発声内容（発話文セットａ，ｂ，ｃ）と同じものが全て含まれるのであれば、目標話者Ａ、Ｂ、Ｃの発話文セットの内容はどのようなものでも良い。但し、必ず３人の目標話者の音声データ（ａ，ｂ，ｃのいずれか）が含まれている必要がある（誰かが発話しないような組み合わせは除く）。以下の声質変換モデル生成装置に関する発明、声質変換モデル生成プログラムに関する発明、声質変換モデル生成方法に関する発明において同じである。
また、上記固有声技術とは、多数話者（好ましくは１００〜２００人の話者）の固有ベクトルから構成される固有声空間（固有空間ともいう）を用いた話者適応技術のことである。固有声空間を生成するには、例えば、Ｍ人（２人以上）の話者の音声データ（学習データ）から、まず各話者毎の特定モデル（特定話者モデル）を学習により生成し、当該生成した特定話者モデルに基づき、各話者毎の超（スーパー）ベクトルを生成する。特定話者モデルは、ＨＭＭなどから構成され、超ベクトルは、ＨＭＭモデルを構成する複数のパラメータ（例えば、混合成分、混合係数、ガウス密度、平均ベクトル、共分散行列等）の少なくとも一部分（例えば、ガウス分布に係るパラメータの平均）のパラメータを所定順序（全ての超ベクトルで共通となる順序）で並べたリストから構成される。超ベクトルが形成されたら、次に、当該超ベクトルの次元を減縮する。具体的に、高次元の超ベクトルをこれより低次元の超ベクトルへと写像する。この次元減縮を行う方法としては、主成分分析法、線形判別分析法、要素分析法、特異点分解法などの様々な方法がある。そして、前記いずれかの方法を用いた次元減縮によって、各話者毎の固有ベクトル（次元減縮後の超ベクトル）が形成される。この固有ベクトルから形成される空間が固有声空間となる。より詳しくは、「R. Kuhn, et al. "Rapid speaker adaptation in eigenvoice space." IEEE Trans. Speech and Audio Processing, Vol. 8, No. 6, pp. 695-707, 2000.」などの文献に記載されている。以下の声質変換モデル生成装置に関する発明、声質変換モデル生成プログラムに関する発明、声質変換モデル生成方法に関する発明において同じである。 The second voice data of predetermined M people having the same utterance content as the first voice data is an utterance sentence set a, b, c (for example, 50 sentences for each set) composed of different sentences. When the voice data is included in the first voice data, for example, if the predetermined M target speakers are three persons A, B, and C (M = 3), the second voice data includes the target talk. It is necessary to include speech data of speech sentence sets a, b, and c uttered by the persons A, B, and C. For example, the utterance sentence set a, b, c for each speaker of the target speakers A, B, and C may be included, or the utterance sentence set a of the target speaker A and the utterance sentence of the target speaker B If all the same utterance contents (utterance sentence sets a, b, c) of the original speaker are included, such as the utterance sentence set c of the set b and the target speaker C, the target speakers A, B, The contents of C's speech sentence set may be anything. However, the voice data (any one of a, b, and c) of the three target speakers must be included (except for combinations where no one speaks). The same applies to the following invention relating to a voice quality conversion model generation device, an invention relating to a voice quality conversion model generation program, and an invention relating to a voice quality conversion model generation method.
The eigenvoice technique is a speaker adaptation technique using an eigenvoice space (also called eigenspace) composed of eigenvectors of a large number of speakers (preferably 100 to 200 speakers). In order to generate the eigenvoice space, for example, a specific model (specific speaker model) for each speaker is first generated by learning from speech data (learning data) of M (two or more) speakers, Based on the generated specific speaker model, a super vector is generated for each speaker. The specific speaker model is composed of an HMM or the like, and the super vector is at least a part (eg, a mixture component, a mixture coefficient, a Gaussian density, a mean vector, a covariance matrix, etc.) constituting the HMM model (eg, It is composed of a list in which parameters (average of parameters related to Gaussian distribution) are arranged in a predetermined order (an order common to all supervectors). Once the supervector is formed, the dimension of the supervector is then reduced. Specifically, a high-dimensional supervector is mapped to a lower-dimensional supervector. There are various methods such as principal component analysis method, linear discriminant analysis method, element analysis method, singularity decomposition method, and the like as methods for performing this dimension reduction. Then, eigenvectors (super-vectors after dimension reduction) for each speaker are formed by dimension reduction using any one of the above methods. A space formed from these eigenvectors becomes an eigenvoice space. For more details, see R. Kuhn, et al. “Rapid speaker adaptation in eigenvoice space.” IEEE Trans. Speech and Audio Processing, Vol. 8, No. 6, pp. 695-707, 2000. Are listed. The same applies to the following invention relating to a voice quality conversion model generation device, an invention relating to a voice quality conversion model generation program, and an invention relating to a voice quality conversion model generation method.

更に、請求項２に係る発明は、請求項１記載の声質変換モデル生成装置において、
前記Ｍは２以上の整数であり、
前記Ｎ人の元話者の前記第１音声データと、前記Ｍ人の目標話者の前記第２音声データとを用いて適応学習を行う適応学習手段をさらに備え、
前記適応学習手段は、所定の元話者の声質の音声を所定の目標話者の声質の音声に変換する初期声質変換モデルを、前記Ｎ人の元話者の声質の音声、及び前記Ｍ人の目標話者の声質の音声のうち、少なくとも前記Ｍ人の目標話者の声質の音声に適応させる適応学習を固有声技術に基づいて行い、声質を制御できる話者性制御パラメータを有する声質変換モデルを生成することを特徴とする。 Furthermore, the invention according to claim 2 is the voice quality conversion model generation device according to claim 1,
M is an integer of 2 or more,
Adaptive learning means for performing adaptive learning using the first voice data of the N former speakers and the second voice data of the M target speakers;
The adaptive learning means converts an initial voice quality conversion model for converting a voice of a predetermined original speaker voice quality into a voice of a predetermined target speaker voice quality, voices of the voice quality of the N original speakers, and the M people. Voice quality conversion having a speaker control parameter capable of controlling voice quality by performing adaptive learning based on eigenvoice technology to adapt to voices of voice quality of at least the M target speakers among voices of target voice quality of It is characterized by generating a model.

このような構成であれば、適応学習手段によって、所定の元話者の声質の音声を所定の目標話者の声質の音声に変換する初期声質変換モデルを、前記Ｎ人の元話者の声質の音声、及び前記Ｍ人の目標話者の声質の音声のうち、少なくとも前記Ｍ人の目標話者の声質の音声に適応させる適応学習を固有声技術に基づいて行い、声質を制御できる話者性制御パラメータを有する声質変換モデルを生成することが可能である。 With such a configuration, an initial voice quality conversion model in which voice of a predetermined original speaker's voice quality is converted into voice of a predetermined target speaker's voice quality by the adaptive learning means is used as the voice quality of the N former speakers. , And voices of voice quality of the M target speakers, a speaker capable of controlling voice quality by performing adaptive learning based on eigenvoice technology to adapt to voices of voice quality of at least the M target speakers It is possible to generate a voice quality conversion model having sex control parameters.

従って、公知の固有声（Eigenvoice）技術を用いた適応学習を行って声質変換モデルを生成することが可能となるので、声質変換モデルの各話者性制御パラメータを変更して声質変換時の目標話者の声質を簡易に制御することが可能である。これによって、話者性制御パラメータを制御することで、声質変換モデルを任意の目標話者の声質の音声に簡易に適応させることができるという効果が得られる。また、固有声技術を用いることで、少ない音声データ（例えば、１発声のデータ〜数発声のデータ）で新規元話者の声質の音声を声質変換モデルに適応させることができる。これによって、適応処理にかかる時間が少なくて済むので、例えば、リアルタイム（オンライン）に適応処理及び声質変換処理を行うことが可能となる。 Therefore, it is possible to generate a voice quality conversion model by performing adaptive learning using a known Eigenvoice technique, so that the target parameters for voice quality conversion can be changed by changing each speaker control parameter of the voice quality conversion model. It is possible to easily control the voice quality of the speaker. Thereby, by controlling the speaker control parameter, it is possible to easily adapt the voice quality conversion model to the voice of the voice quality of an arbitrary target speaker. Further, by using the unique voice technology, it is possible to adapt the voice of the voice quality of the new original speaker to the voice quality conversion model with a small amount of voice data (for example, data of one utterance to data of several utterances). As a result, the time required for the adaptation process can be reduced, and for example, the adaptation process and the voice quality conversion process can be performed in real time (online).

ここで、上記固有声技術とは、多数話者（好ましくは１００〜２００人の話者）の固有ベクトルから構成される固有声空間（固有空間ともいう）を用いた話者適応技術のことである。固有声空間を生成するには、例えば、Ｍ人（２人以上）の話者の音声データ（学習データ）から、まず各話者毎の特定モデル（特定話者モデル）を学習により生成し、当該生成した特定話者モデルに基づき、各話者毎の超（スーパー）ベクトルを生成する。特定話者モデルは、ＨＭＭなどから構成され、超ベクトルは、ＨＭＭモデルを構成する複数のパラメータ（例えば、混合成分、混合係数、ガウス密度、平均ベクトル、共分散行列等）の少なくとも一部分（例えば、ガウス分布に係るパラメータの平均）のパラメータを所定順序（全ての超ベクトルで共通となる順序）で並べたリストから構成される。超ベクトルが形成されたら、次に、当該超ベクトルの次元を減縮する。具体的に、高次元の超ベクトルをこれより低次元の超ベクトルへと写像する。この次元減縮を行う方法としては、主成分分析法、線形判別分析法、要素分析法、特異点分解法などの様々な方法がある。そして、前記いずれかの方法を用いた次元減縮によって、各話者毎の固有ベクトル（次元減縮後の超ベクトル）が形成される。この固有ベクトルから形成される空間が固有声空間となる。より詳しくは、「R. Kuhn, et al. "Rapid speaker adaptation in eigenvoice space." IEEE Trans. Speech and Audio Processing, Vol. 8, No. 6, pp. 695-707, 2000.」などの文献に記載されている。以下の声質変換モデル生成装置に関する発明、声質変換モデル生成プログラムに関する発明、声質変換モデル生成方法に関する発明において同じである。 Here, the eigenvoice technique is a speaker adaptation technique using an eigenvoice space (also called eigenspace) composed of eigenvectors of a large number of speakers (preferably 100 to 200 speakers). . In order to generate the eigenvoice space, for example, a specific model (specific speaker model) for each speaker is first generated by learning from speech data (learning data) of M (two or more) speakers, Based on the generated specific speaker model, a super vector is generated for each speaker. The specific speaker model is composed of an HMM or the like, and the super vector is at least a part (eg, a mixture component, a mixture coefficient, a Gaussian density, a mean vector, a covariance matrix, etc.) constituting the HMM model (eg, It is composed of a list in which parameters (average of parameters related to Gaussian distribution) are arranged in a predetermined order (an order common to all supervectors). Once the supervector is formed, the dimension of the supervector is then reduced. Specifically, a high-dimensional supervector is mapped to a lower-dimensional supervector. There are various methods such as principal component analysis method, linear discriminant analysis method, element analysis method, singularity decomposition method, and the like as methods for performing this dimension reduction. Then, eigenvectors (super-vectors after dimension reduction) for each speaker are formed by dimension reduction using any one of the above methods. A space formed from these eigenvectors becomes an eigenvoice space. For more details, see R. Kuhn, et al. “Rapid speaker adaptation in eigenvoice space.” IEEE Trans. Speech and Audio Processing, Vol. 8, No. 6, pp. 695-707, 2000. Are listed. The same applies to the following invention relating to a voice quality conversion model generation device, an invention relating to a voice quality conversion model generation program, and an invention relating to a voice quality conversion model generation method.

一方、上記目的を達成するために、請求項２記載の声質変換モデル生成装置は、
声質変換用の統計モデルを生成する声質変換モデル生成装置であって、
所定のＮ（Ｎは２以上の整数）人の元話者の音声データである第１音声データと、当該第１音声データと同じ発声内容の所定の１人の中間話者の音声データである第３音声データとを学習データとして学習を行い、前記Ｎ人の元話者の声質の音声を前記１人の中間話者の音声に変換するための、前記Ｎ人の元話者及び前記１人の中間話者に共通の１つの統計モデルである第１声質変換モデルを生成すると共に、前記１人の中間話者の第３音声データと、当該第３音声データと同じ発声内容の所定のＭ（Ｍは２以上の整数）人の目標話者の音声データである第２音声データとを学習データとして学習を行い、前記１人の中間話者の音声を前記Ｍ人の目標話者の声質の音声に変換するための前記１人の中間話者と前記目標話者の各々とに１対１に対応した統計モデルである、Ｍ個の第２声質変換モデルを生成する声質変換モデル生成手段を備えることを特徴とする。 On the other hand, in order to achieve the above object, a voice quality conversion model generation device according to claim 2 is provided:
A voice quality conversion model generation device for generating a statistical model for voice quality conversion,
First voice data that is voice data of a predetermined N (N is an integer of 2 or more) former speakers, and voice data of a predetermined intermediate speaker having the same utterance content as the first voice data Learning with the third voice data as learning data, and converting the voice of the voice quality of the N former speakers into the voice of the one intermediate speaker and the 1 former speaker and the 1 A first voice quality conversion model, which is one statistical model common to a human intermediate speaker, is generated, and a third voice data of the one intermediate speaker and a predetermined utterance content same as that of the third voice data are generated. Learning is performed by using the second voice data, which is voice data of M (M is an integer of 2 or more) target speakers, as learning data, and the voice of the one intermediate speaker is used as the target speaker's voice. One-to-one for each of the one intermediate speaker and the target speaker for conversion to voice-quality speech Is a response to the statistical model, characterized in that it comprises a voice conversion model generating means for generating M second voice conversion model.

このような構成であれば、声質変換モデル生成手段によって、所定のＮ人の元話者の第１音声データと、所定の１人の中間話者の第３音声データとを学習データとして学習を行い、前記Ｎ人の元話者の声質の音声を前記１人の中間話者の音声に変換するための、前記Ｎ人の元話者及び前記１人の中間話者に共通の１つの統計モデルである第１声質変換モデルを生成することが可能である。 With such a configuration, the voice quality conversion model generation means learns the first voice data of the predetermined N former speakers and the third voice data of the predetermined one intermediate speaker as learning data. One statistic common to the N former speakers and the one intermediate speaker to convert the voice quality of the N former speakers to the voice of the one intermediate speaker It is possible to generate a first voice quality conversion model that is a model.

更に、声質変換モデル生成手段によって、前記１人の中間話者の第３音声データと、前記Ｍ人の目標話者の第２音声データとを学習データとして学習を行い、前記１人の中間話者の音声を前記Ｍ人の目標話者の声質の音声に変換するための前記１人の中間話者と前記目標話者の各々とに１対１に対応した統計モデルである、Ｍ個の第２声質変換モデルを生成することが可能である。 Further, the voice quality conversion model generation means learns the third voice data of the one intermediate speaker and the second voice data of the M target speakers as learning data, and the one intermediate talk data is learned. M models which are one-to-one statistical models corresponding to the one intermediate speaker and each of the target speakers for converting the voice of the speaker into voices of the voice quality of the M target speakers A second voice quality conversion model can be generated.

Ｎ人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換する声質変換モデルを各話者毎に別々に生成する従来の方法では、Ｎ×Ｍ個の声質変換モデルを生成する必要がある。一方、中間話者を介し、且つＮ人の元話者の声質の音声を１人の中間話者の音声に変換する１個の第１声質変換モデルを生成するようにした本発明の上記構成においては、この１個の第１声質変換モデルと、１人の中間話者の音声をＭ人の目標話者の声質の音声に変換する各話者毎のＭ個の第２声質変換モデルとの計（Ｍ＋１）個のモデルで構成することができる。従って、従来と比較して、声質変換モデルを記憶するのに必要なメモリ容量を大幅に低減することができるという効果が得られる。 In the conventional method of generating voice quality conversion models for converting voices of voice quality of N original speakers into voices of voice quality of M target speakers, N × M voice quality conversion models are separately generated for each speaker. Must be generated. On the other hand, the above-described configuration of the present invention is configured to generate one first voice quality conversion model for converting voices of voice quality of N former speakers through a middle speaker into voices of one middle speaker. , One first voice quality conversion model, and M second voice quality conversion models for each speaker for converting the voice of one intermediate speaker into the voice of M target speakers, (M + 1) models in total. Therefore, an effect that the memory capacity necessary for storing the voice quality conversion model can be greatly reduced as compared with the conventional case.

一方、上記目的を達成するために、請求項３記載の声質変換システムにおいて、
任意の元話者の音声を他の声質の音声に変換する声質変換システムであって、
請求項１に記載の声質変換モデル生成装置によって生成された、話者性制御パラメータを有する声質変換モデルを記憶する声質変換モデル記憶手段と、
任意の元話者の音声データを取得する元話者音声データ取得手段と、
任意の目標話者の音声データを取得する目標話者音声データ取得手段と、
前記声質変換モデルにおける前記Ｍ人の目標話者の声質の音声に係る話者性制御パラメータ値を指定する話者性制御パラメータ値指定手段と、
前記元話者音声データ取得手段で取得した音声データと、前記話者性制御パラメータ値指定手段で指定されたパラメータ値と、前記声質変換モデル記憶手段に記憶された声質変換モデルとに基づき、前記声質変換モデルを所定の適応手法を用いて前記任意の元話者の声質の音声及び前記指定されたパラメータ値に適応させて新たな声質変換モデルを作成する第２適応手段と、
前記第２適応手段により作成された前記新たな声質変換モデルと前記元話者音声データ取得手段が取得した音声データに基づき、前記任意の元話者の音声データを他の声質の音声データに変換する声質変換手段と、を備えることを特徴とする。 On the other hand, in order to achieve the above object, in the voice quality conversion system according to claim 3 ,
A voice quality conversion system for converting the voice of any of the original speaker to the voice of the other voice,
Voice quality conversion model storage means for storing a voice quality conversion model having a speaker control parameter generated by the voice quality conversion model generation device according to claim 1 ;
Former speaker voice data acquisition means for acquiring voice data of any former speaker;
Target speaker voice data acquisition means for acquiring voice data of an arbitrary target speaker;
A speaker control parameter value specifying means for specifying a speaker control parameter value related to the voice of the voice quality of the M target speakers in the voice conversion model ;
Based on the audio data obtained in the previous Kimoto speaker speech data acquisition means, and the parameter values specified in the speaker information control parameter value specification unit, a voice conversion model stored in the voice conversion model storing means, Second adaptation means for creating a new voice quality conversion model by adapting the voice quality conversion model to the voice of the voice quality of the arbitrary original speaker and the specified parameter value using a predetermined adaptation method;
Based on the new voice quality conversion model created by the second adapting means and the voice data acquired by the original speaker voice data acquiring means, the voice data of the arbitrary former speaker is converted into voice data of another voice quality. characterized in that it comprises a voice conversion means for, the.

このような構成であれば、元話者音声データ取得手段が任意の元話者の適応処理に用いる音声データを取得し、話者性制御パラメータ値指定手段によって制御パラメータ値が指定されると、第２適応手段は、これらの音声データ及び話者性制御パラメータ値と、前記声質変換モデル記憶手段に記憶された声質変換モデルとに基づき、所定の適応手法を用いて前記声質変換モデルを前記任意の元話者の声質の音声及び前記指定されたパラメータ値に適応させる。そして、元話者音声データ取得手段が任意の元話者の声質変換対象の任意の発話内容の音声データを取得すると、声質変換手段は、前記第２適応手段で適応された新たな声質変換モデルを用いて、前記音声データ取得手段で取得した前記声質変換対象の音声データを前記話者性制御パラメータ値で与えられる任意の目標話者の声質（他の声質）の音声データに変換する。 With such a configuration, when the original speaker voice data acquisition unit acquires voice data used for the adaptation process of an arbitrary original speaker, and the control parameter value is specified by the speaker control parameter value specification unit, The second adaptation means uses the voice quality conversion model stored in the voice quality conversion model storage means based on the voice data and the speaker control parameter value and the voice quality conversion model using the predetermined adaptation technique. The voice of the original speaker's voice quality and the specified parameter value are adapted. Then, when the original speaker voice data acquisition means acquires voice data of an arbitrary utterance content that is a voice quality conversion target of an arbitrary original speaker, the voice quality conversion means uses the new voice quality conversion model adapted by the second adaptation means. Is used to convert the voice data for voice quality conversion acquired by the voice data acquisition means into voice data of an arbitrary target speaker voice quality (other voice quality) given by the speaker control parameter value.

これによって、話者性制御パラメータ値指定手段によって、話者性制御パラメータの値を制御することで、請求項１に記載の声質変換モデル生成装置で生成した声質変換モデルを他の声質の音声に簡易に適応させることができるという効果が得られる。 Thus, the voice quality conversion model generated by the voice quality conversion model generating apparatus according to claim 1 is converted to a voice of another voice quality by controlling the value of the speaker quality control parameter by the speaker quality control parameter value specifying means. The effect that it can adapt easily is acquired.

また、この構成によって、目標話者の音声が入手不可能な場合に、話者性制御パラメータを使用した任意の目標話者への変換が可能となる。話者性制御パラメータを操作すると、実在しない話者の音声に変換することも可能である。 Also, with this configuration, when the target speaker's voice is not available , conversion to an arbitrary target speaker using the speaker control parameter is possible. When the speaker control parameter is manipulated, it can be converted into a voice of a non-existent speaker.

一方、上記目的を達成するために、請求項４記載の声質変換システムは、
任意の元話者の任意の発声内容の音声を所定のＭ（Ｍは２以上の整数）人の目標話者の声質の音声に変換する声質変換システムであって、
請求項２に記載の声質変換モデル生成装置によって生成された、所定のＮ人（Ｎは２以上の整数）の元話者の声質の音声を１人の中間話者の音声に変換する第１声質変換モデル、及び前記１人の中間話者の音声を前記Ｍ人の目標話者の声質の音声に変換する第２声質変換モデルを記憶する声質変換モデル記憶手段と、
前記任意の元話者の音声データを取得する元話者音声データ取得手段と、
前記元話者音声データ取得手段で取得した音声データと、前記声質変換モデル記憶手段に記憶された第１声質変換モデルとに基づき、所定の適応手法を用いて前記第１声質変換モデルを前記任意の元話者の声質の音声に適応させる適応手段と、
前記適応された第１声質変換モデルを用いて、前記元話者音声データ取得手段で取得した前記任意の元話者の任意の発声内容の音声データを前記中間話者の声質の音声データに変換すると共に、前記声質変換モデル記憶手段に記憶された第２声質変換モデルを用いて、前記中間話者の声質に変換後の音声データを前記任意の目標話者の音声データに変換する声質変換手段と、を備えることを特徴とする。 On the other hand, in order to achieve the above object, the voice quality conversion system according to claim 4 is:
A voice quality conversion system that converts a voice of an arbitrary utterance content of an arbitrary former speaker into a voice of a predetermined target voice quality of M (M is an integer of 2 or more),
A first voice of the voice quality of a predetermined N (N is an integer of 2 or more) original speakers, which is generated by the voice quality conversion model generation device according to claim 2 , is converted into a voice of one intermediate speaker. Voice quality conversion model and voice quality conversion model storage means for storing a second voice quality conversion model for converting the voice of the one intermediate speaker into the voice of the voice quality of the M target speakers;
Former speaker voice data acquisition means for acquiring voice data of the arbitrary former speaker;
Based on the voice data acquired by the former speaker voice data acquisition means and the first voice quality conversion model stored in the voice quality conversion model storage means, the first voice quality conversion model is converted into the arbitrary voice using a predetermined adaptive method. Adaptation means to adapt to the voice quality of the former speaker,
Using the adapted first voice quality conversion model, the voice data of the arbitrary utterance content of the arbitrary original speaker acquired by the original speaker voice data acquisition means is converted into the voice data of the voice quality of the intermediate speaker And using the second voice quality conversion model stored in the voice quality conversion model storage means, the voice quality conversion means for converting the voice data converted into the voice quality of the intermediate speaker into the voice data of the arbitrary target speaker. And.

このような構成であれば、元話者音声データ取得手段が任意の元話者の適応処理に用いる音声データを取得すると、適応手段は、この音声データと、前記声質変換モデル記憶手段に記憶された第１声質変換モデルとに基づき、所定の適応手法を用いて前記第１声質変換モデルを前記任意の元話者の声質の音声に適応させる。そして、元話者音声データ取得手段が任意の元話者の声質変換対象の任意の発話内容の音声データを取得すると（適応に用いた音声データが声質変換対象であっても良い）、声質変換手段は、前記適応された第１声質変換モデルを用いて、前記元話者音声データ取得手段で取得した前記声質変換対象の音声データを前記１人の中間話者の声質の音声データに変換する。更に、声質変換手段は、この中間話者の声質の音声データを、前記第２声質変換モデルを用いて目標話者の音声データに変換する。 With such a configuration, when the original speaker voice data acquisition unit acquires voice data used for the adaptation processing of an arbitrary original speaker, the adaptation unit is stored in the voice data and the voice quality conversion model storage unit. Based on the first voice quality conversion model, the first voice quality conversion model is adapted to the voice of the voice quality of the arbitrary former speaker using a predetermined adaptation method. Then, when the original speaker voice data acquisition means acquires voice data of an arbitrary utterance content that is a voice quality conversion target of any original speaker (the voice data used for adaptation may be a voice quality conversion target), the voice quality conversion The means converts the voice data of the voice quality conversion target acquired by the original speaker voice data acquisition means into voice data of the voice quality of the one intermediate speaker using the adapted first voice quality conversion model. . Further, the voice quality conversion means converts the voice data of the voice quality of the intermediate speaker into voice data of the target speaker using the second voice quality conversion model.

これによって、請求項２に記載の声質変換モデル生成装置の上述した作用及び効果と同等の作用及び効果が得られる。
更に、請求項５に係る発明は、請求項４記載の声質変換システムにおいて、
前記適応手段は、前記第１声質変換モデルを構成する前記Ｎ人の元話者の声質の音声に係るパラメータに対して、前記第１声質変換モデルを前記任意の元話者の声質の音声に適応させる適応パラメータ値を前記所定の適応手法を用いて推定し、前記第１声質変換モデルの前記Ｎ人の元話者の声質の音声に係るパラメータの値を前記推定した適応パラメータ値に変換することを特徴とする。 Thus, the same operation and effect as the above-described operation and effect of the voice quality conversion model generation device according to claim 2 can be obtained.
Furthermore, the invention according to claim 5 is the voice quality conversion system according to claim 4 ,
The adapting means converts the first voice conversion model into the voice of the voice quality of any of the original speakers with respect to the parameters related to the voice quality of the N original speakers constituting the first voice conversion model. An adaptive parameter value to be adapted is estimated using the predetermined adaptation method, and a parameter value related to the voice of the voice quality of the N former speakers of the first voice quality conversion model is converted to the estimated adaptive parameter value. It is characterized by that.

このような構成であれば、適応手段は、所定の適応手法を用いて、第１声質変換モデルを構成するＮ人の元話者の声質の音声に係るパラメータの値を、任意の元話者の声質の音声を前記１人の中間話者の音声に変換するのに適切な値に変換することができる。これによって、請求項２に記載の声質変換モデル生成装置の上述した作用及び効果と同等の作用及び効果が得られる。 If it is such a structure, an adaptation means will use the predetermined | prescribed adaptation method, the value of the parameter which concerns on the voice of the voice quality of N former speakers who comprise the 1st voice quality conversion model will be set as arbitrary former speakers. Can be converted to a value suitable for converting the voice of the above voice quality into the voice of the one intermediate speaker. Thus, the same operation and effect as the above-described operation and effect of the voice quality conversion model generation device according to claim 2 can be obtained.

一方、上記目的を達成するために、請求項６記載の声質変換モデル生成プログラムは、
声質変換用の統計モデルを生成する声質変換モデル生成プログラムであって、
所定のＮ（Ｎは１以上の整数）人の元話者の音声データである第１音声データと、当該第１音声データと同じ発声内容の所定のＭ（Ｍは１以上の整数）人の目標話者の音声データである第２音声データとを用いて前記Ｎ人の元話者及び前記Ｍ人の目標話者に共通の統計モデルである初期声質変換モデルを生成する初期声質変換モデル生成ステップと、
前記Ｎ人の元話者の音声データと前記Ｍ人の目標話者の音声データから前記Ｎ人の元話者各々の声質を前記Ｍ人の目標話者各々の声質に変換する、前記元話者の各々と前記目標話者の各々とに１対１に対応する特定話者モデルを作成する特定話者モデル作成ステップと、
前記特定話者モデル作成ステップにおいて作成した前記特定話者モデルを固有声技術に用いて得られる結果を前記初期声質変換モデル生成ステップにおいて生成した前記初期声質変換モデルを構成するパラメータに適用し、前記初期声質変換モデルを構成するパラメータに、前記Ｎ人の元話者のいずれかの音声を前記Ｍ人の目標話者のいずれかの音声に変換する重みベクトルである話者性制御パラメータを追加することで、前記話者性制御パラメータを有する声質変換モデルを生成する声質変換モデル生成ステップとをコンピュータに実行させるためのプログラムを含み、
前記Ｎ及び前記Ｍのうち、少なくとも一方が２以上の整数であることを特徴とする。 On the other hand, in order to achieve the above object, a voice quality conversion model generation program according to claim 6 is provided:
A voice quality conversion model generation program for generating a statistical model for voice quality conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating an initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker and the step,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; A specific speaker model creating step of creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created in the specific speaker model creation step for eigenvoice technology to the parameters constituting the initial voice quality conversion model generated in the initial voice quality conversion model generation step, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. it is, includes a program for executing the voice conversion model generating step of generating a voice conversion model having the speaker information control parameters to the computer,
At least one of N and M is an integer of 2 or more.

このような構成であれば、コンピュータによってプログラムが読み取られ、読み取られたプログラムに従ってコンピュータが処理を実行すると、請求項１記載の声質変換モデル生成装置と同等の作用および効果が得られる。
また、上記目的を達成するために、請求項７記載の声質変換モデル生成方法は、
声質変換用の統計モデルを生成する声質変換モデル生成方法であって、
所定のＮ（Ｎは１以上の整数）人の元話者の音声データである第１音声データと、当該第１音声データと同じ発声内容の所定のＭ（Ｍは１以上の整数）人の目標話者の音声データである第２音声データとを用いて前記Ｎ人の元話者及び前記Ｍ人の目標話者に共通の統計モデルである初期声質変換モデルを生成する初期声質変換モデル生成ステップと、
前記Ｎ人の元話者の音声データと前記Ｍ人の目標話者の音声データから前記Ｎ人の元話者各々の声質を前記Ｍ人の目標話者各々の声質に変換する、前記元話者の各々と前記目標話者の各々とに１対１に対応する特定話者モデルを作成する特定話者モデル作成ステップと、
前記特定話者モデル作成ステップにおいて作成した前記特定話者モデルを固有声技術に用いて得られる結果を前記初期声質変換モデル生成ステップにおいて生成した前記初期声質変換モデルを構成するパラメータに適用し、前記初期声質変換モデルを構成するパラメータに、前記Ｎ人の元話者のいずれかの音声を前記Ｍ人の目標話者のいずれかの音声に変換する重みベクトルである話者性制御パラメータを追加することで、前記話者性制御パラメータを有する声質変換モデルを生成する声質変換モデル生成ステップとを含み、
前記Ｎ及び前記Ｍのうち、少なくとも一方が２以上の整数であることを特徴とする。
これにより、請求項１記載の声質変換モデル生成装置と同等の作用及び効果が得られる。 With such a configuration, when the program is read by the computer and the computer executes processing according to the read program, the same operation and effect as those of the voice quality conversion model generation device according to claim 1 can be obtained.
In order to achieve the above object, a voice quality conversion model generation method according to claim 7 comprises:
A voice conversion model generation method for generating a statistical model for voice conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating an initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker and the step,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; A specific speaker model creating step of creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created in the specific speaker model creation step for eigenvoice technology to the parameters constituting the initial voice quality conversion model generated in the initial voice quality conversion model generation step, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. it is, and a voice conversion model generating step of generating a voice conversion model having the speaker information control parameter,
At least one of N and M is an integer of 2 or more.
Thus, the same operation and effect as the voice quality conversion model generation device according to claim 1 can be obtained.

一方、上記目的を達成するために、請求項８に記載の声質変換サーバクライアントシステムは、
クライアントコンピュータとサーバコンピュータとがネットワークを介して接続され、任意の元話者の音声を他の声質の音声に変換する声質変換クライアントサーバシステムにおいて、
前記サーバコンピュータは、
請求項１に記載の声質変換モデル生成装置によって生成された声質変換モデルを記憶する声質変換モデル記憶手段と、
前記声質変換モデル記憶手段に記憶された前記声質変換モデルを前記クライアントコンピュータへ送信する声質変換モデル送信手段とを備え、
前記クライアントコンピュータは、
前記任意の元話者の音声データを取得する元話者音声データ取得手段と、
前記声質変換モデルにおける前記話者性制御パラメータのパラメータ値を指定する話者性制御パラメータ値指定手段と、
前記サーバコンピュータからの前記声質変換モデルを受信する声質変換モデル受信手段と、
前記元話者音声データ取得手段で取得した音声データと、前記話者性制御パラメータ値指定手段で指定されたパラメータ値と、前記声質変換モデル受信手段で受信した声質変換モデルとに基づき、所定の適応手法を用いて前記声質変換モデルを前記任意の元話者の声質及び音声を前記指定されたパラメータ値に適応させて新たな声質変換モデルを作成する適応手段と、
前記適応手段が作成した前記新たな声質変換モデルと前記元話者音声データ取得手段が取得した音声データに基づき、前記任意の元話者の音声データを他の声質の音声データに変換する声質変換手段と、を備えることを特徴とする。 On the other hand, in order to achieve the above object, the voice quality conversion server client system according to claim 8 comprises:
And the client computer and the server computer are connected via a network, the voice-conversion-client-server system that converts speech of any source speaker to the speech of the other voice,
The server computer
And voice conversion model storage means for storing the voice quality conversion model generated by the voice conversion model generating device according to claim 1,
Voice quality conversion model transmission means for transmitting the voice quality conversion model stored in the voice quality conversion model storage means to the client computer;
The client computer is
Former speaker voice data acquisition means for acquiring voice data of the arbitrary former speaker;
A speaker control parameter value specifying means for specifying a parameter value of the speaker control parameter in the voice conversion model;
And voice conversion model receiving means for receiving the voice conversion model from the previous SL server computer,
Based on the voice data acquired by the former speaker voice data acquiring means, the parameter value specified by the speaker control parameter value specifying means, and the voice quality conversion model received by the voice quality conversion model receiving means, a predetermined Adapting means for adapting the voice quality conversion model of the arbitrary speaker's voice quality and voice to the designated parameter value by using an adaptation technique to create a new voice quality conversion model ;
Based on the new voice quality conversion model created by the adapting means and the voice data acquired by the original speaker voice data acquiring means, the voice quality conversion for converting the voice data of the arbitrary former speaker into voice data of another voice quality And means.

このような構成であれば、サーバコンピュータは、声質変換モデル記憶手段によって、請求項１に記載の声質変換モデル生成装置によって生成された、所定のＮ（Ｎは２以上の整数）人の元話者の声質の音声を所定のＭ（Ｍは２以上の整数）人の目標話者の声質の音声に変換する声質変換モデルを記憶することが可能であり、声質変換モデル送信手段によって、前記声質変換モデル記憶手段に記憶された前記声質変換モデルを前記クライアントコンピュータへ送信することが可能である。 With such a configuration, the server computer uses the voice quality conversion model storage means to generate a predetermined N (N is an integer of 2 or more) narratives generated by the voice quality conversion model generation device according to claim 1. It is possible to store a voice quality conversion model for converting a voice of a person's voice quality into a voice of the voice quality of a target speaker of a predetermined M (M is an integer equal to or greater than 2). The voice quality conversion model stored in the conversion model storage means can be transmitted to the client computer.

また、クライアントコンピュータは、元話者音声データ取得手段によって、前記任意の元話者の音声データを取得することが可能であり、声質変換モデル受信手段によって、前記サーバコンピュータからの前記声質変換モデルを受信することが可能であり、元話者音声データ取得手段が任意の元話者の適応処理に用いる音声データを取得し、話者性制御パラメータ値指定手段によって制御パラメータ値を指定することが可能であり、適応手段によって、これらの音声データ及び話者性制御パラメータ値と、声質変換モデル記憶手段に記憶された声質変換モデルとに基づき、所定の適応手法を用いて声質変換モデルを任意の元話者の声質の音声及び前記任意の目標話者の声質の音声に適応させることが可能である。 Further, the client computer, by Motohanashi's voice data acquiring means can acquire speech data of the arbitrary source speaker by voice quality conversion model receiving unit, wherein the voice conversion model from the server computer Can be received, and the voice data acquisition means can acquire voice data used for adaptive processing of an arbitrary speaker, and the control parameter value can be specified by the speaker control parameter value specification means. It is possible, and the voice quality conversion model is arbitrarily determined by the adaptation means using a predetermined adaptation method based on the voice data and the speaker control parameter value and the voice quality conversion model stored in the voice quality conversion model storage means . The voice quality of the original speaker and the voice quality of the desired target speaker can be adapted.

また、クライアントコンピュータは、声質変換手段によって、前記適応された声質変換モデルを用いて、前記元話者音声データ取得手段で取得した前記任意の元話者の任意の発声内容の音声データを前記任意の目標話者の音声データに変換することが可能である。
これによって、請求項３の声質変換システムと同等の作用及び効果が得られる。
また、声質変換モデル等の比較的容量の大きいデータはサーバコンピュータが保持するようになっているので、ユーザ側のクライアントコンピュータは、声質変換時のみに一時的に声質変換モデルのためのメモリが必要となるが、声質変換モデルを保持し続ける必要が無いという利点がある。更に、声質変換モデルを、サーバコンピュータ側で管理できるので、声質変換モデルのバージョンアップ等を容易に行うことができると共に、バージョンアップされた声質変換モデルをユーザに容易に提供することができるという効果も得られる。 Further, the client computer uses the adapted voice quality conversion model by the voice quality conversion means to convert the voice data of any utterance content of the arbitrary original speaker acquired by the original speaker voice data acquisition means to the arbitrary Can be converted into voice data of the target speaker.
Thus, the same operation and effect as the voice quality conversion system of claim 3 can be obtained.
In addition, since the server computer holds relatively large data such as a voice quality conversion model, the client computer on the user side needs a memory for the voice quality conversion model temporarily only during voice quality conversion. However, there is an advantage that it is not necessary to keep the voice conversion model. Furthermore, since the voice quality conversion model can be managed on the server computer side, the voice quality conversion model can be easily upgraded, and the upgraded voice quality conversion model can be easily provided to the user. Can also be obtained.

ここで、上記「クライアントコンピュータ」は、ＰＣ、ＷＳなどのデスクトップコンピュータ、ノートＰＣ、ＰＤＡ、携帯電話などの携帯型コンピュータ、携帯ゲーム機などのコンピュータゲーム機などが該当する。以下のクライアントコンピュータにおいて同じである。 Here, the “client computer” corresponds to a desktop computer such as a PC or WS, a portable computer such as a notebook PC, a PDA or a mobile phone, or a computer game machine such as a portable game machine. The same applies to the following client computers .

以上説明したように、本発明に係る請求項１及び２に記載の声質変換モデル生成装置、請求項６に記載の声質変換モデル生成プログラム、並びに請求項７に記載の声質変換モデル生成方法によれば、声質変換モデルに必要なメモリ量を大幅に低減できるので当該声質変換モデルを用いた声質変換システムの構築にかかるコストを低減できるという効果が得られる。また、メモリ資源の豊かではない（制約のある）環境下において、声質変換システムを構築しやすくできるという効果も得られる。 As described above, according to the voice quality conversion model generation device according to claims 1 and 2 , the voice quality conversion model generation program according to claim 6, and the voice quality conversion model generation method according to claim 7 , according to the present invention. For example, the amount of memory required for the voice quality conversion model can be greatly reduced, so that the cost required for constructing a voice quality conversion system using the voice quality conversion model can be reduced. In addition, it is possible to easily construct a voice quality conversion system in an environment where memory resources are not rich (restricted).

また、本発明に係る請求項１記載の声質変換モデル生成装置によれば、公知の固有声（Eigenvoice）技術を用いた適応学習を行って声質変換モデルを生成することが可能となるので、話者性制御パラメータを制御することで、声質変換モデルを任意の目標話者の声質の音声に簡易に適応させることができるという効果が得られる。 Further, according to the voice conversion model generation apparatus according to claim 1, wherein according to the present invention, it becomes possible to generate a voice conversion model by performing an adaptive learning using known eigenvoice (Eigenvoice) technology, story By controlling the humanity control parameters, it is possible to easily adapt the voice quality conversion model to the voice of the voice quality of an arbitrary target speaker.

また、本発明に係る請求項４及び５のいずれか１項に記載の声質変換システムによれば、請求項２の声質変換モデル生成装置で生成した声質変換モデルを、所定の適応手法で声質変換対象の元話者の声質の音声、及び声質変換目標の目標話者の声質の音声のうち、少なくとも一方に適応させて声質変換に用いるようにしたので、上記請求項２の効果に加え、１人の元話者の音声データと１人の目標話者の音声データとから生成された声質変換モデルを適応させて用いるよりも、様々な声質の元話者の声質の音声、及び様々な声質の目標話者の音声のうち、少なくとも一方に対して声質変換モデルを適切に適応させることができるので、声質変換後の音声の音質を向上することができるという効果が得られる。 According to the voice quality conversion system of any one of claims 4 and 5 according to the present invention, the voice quality conversion model generated by the voice quality conversion model generation device according to claim 2 is converted into a voice quality by a predetermined adaptive method. source-speaker voice quality of the voice of the target, and of the target speaker in the voice quality of the voice of the voice conversion target. Thus use in voice conversion adapt at least one, in addition to the effect of the second aspect, 1 Rather than adaptively using a voice quality conversion model generated from the voice data of a person's original speaker and the voice data of one target speaker, the voice of the voice quality of the original speaker of various voice qualities and various voice qualities Since the voice quality conversion model can be appropriately adapted to at least one of the target speaker's voices, the sound quality of the voice after voice quality conversion can be improved.

また、本発明に係る請求項３記載の声質変換システムによれば、請求項１記載の声質変換モデル生成装置で生成した固有声技術を用いた声質変換モデルを、声質変換に用いるようにしたので、上記請求項１の効果に加え、固有声技術を用いることで、適応処理にかかる時間を少なくできるので、例えば、リアルタイム（オンライン）に適応処理及び声質変換処理を行うことができるという効果が得られる。 According to the voice quality conversion system according to claim 3 of the present invention, the voice quality conversion model using the proper voice technology generated by the voice quality conversion model generation device according to claim 1 is used for voice quality conversion. In addition to the effect of the first aspect , since the time required for the adaptive processing can be reduced by using the eigenvoice technique, for example, it is possible to perform the adaptive processing and voice quality conversion processing in real time (online). It is done.

また、本発明に係る請求項８に記載の声質変換クライアントサーバシステムによれば、上記請求項３に記載の声質変換システムの効果に加え、声質変換モデルのバージョンアップ等を容易に行うことができると共に、バージョンアップされた声質変換モデルをユーザに容易に提供することができるという効果が得られる。 Further, according to the voice quality conversion client-server system according to claim 8 of the present invention, in addition to the effect of the voice quality conversion system according to claim 3 , it is possible to easily upgrade the voice quality conversion model. At the same time, it is possible to easily provide the user with the upgraded voice quality conversion model.

〔第１の実施の形態〕
以下、本発明の実施の形態を図面に基づき説明する。図１〜図５は、本発明に係る声質変換モデル生成装置、声質変換モデル生成プログラム及び声質変換モデル生成方法の第１の実施の形態を示す図である。
まず、本発明に係る声質変換モデル生成装置の構成を図１に基づき説明する。図１は、本発明の第１の実施の形態に係る声質変換モデル生成装置１の構成を示すブロック図である。 [First Embodiment]
Hereinafter, embodiments of the present invention will be described with reference to the drawings. 1 to 5 are diagrams showing a first embodiment of a voice quality conversion model generation device, a voice quality conversion model generation program, and a voice quality conversion model generation method according to the present invention.
First, the configuration of the voice quality conversion model generation device according to the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of a voice quality conversion model generation device 1 according to the first embodiment of the present invention.

声質変換モデル生成装置１は、声質変換元の話者である元話者の音声データを記憶する元話者音声データ記憶部１０と、声質変換目標の話者である目標話者の音声データを記憶する目標話者音声データ記憶部１１と、元話者の声質の音声の声質を目標話者の声質のものに変換する声質変換モデルを生成する声質変換モデル生成部１２と、声質変換モデルを記憶する声質変換モデル記憶部１３とを含んだ構成となっている。 The voice quality conversion model generation device 1 stores the voice data of an original speaker that is a voice quality conversion source speaker and voice data of a target speaker that is a voice quality conversion target speaker. A target speaker voice data storage unit 11 for storing, a voice quality conversion model generation unit 12 for generating a voice quality conversion model for converting the voice quality of the voice quality of the original speaker to that of the target speaker, and a voice quality conversion model The voice quality conversion model storage unit 13 is stored.

元話者音声データ記憶部１０は、多数（例えば、２００人以上）の元話者の発話した発話文の音声データ（以下、第１音声データと称す）を、各元話者毎に記憶する機能を有している。ここで、第１音声データは、上記多数の元話者の各々に、発話文セットａ〜ｃ（各セット５０文の文章で構成）の文章を発話してもらい、それをマイク等の収音デバイスを介してＰＣ（パーソナルコンピュータ）等の情報処理機器に入力して信号処理した音声波形データである。また、第１音声データは、音声波形データから抽出した特徴パラメータ（例えば、ケプストラム係数（ＣＣ）、メルケプストラム係数（ＭＦＣＣ）など）のパラメータ系列（特徴パラメータ系列）から構成されたデータであっても良い。 The former speaker voice data storage unit 10 stores voice data (hereinafter referred to as first voice data) of utterances spoken by a large number (for example, 200 or more) former speakers for each former speaker. It has a function. Here, the first voice data is obtained by having each of the above-mentioned many former speakers utter a sentence of the utterance sentence set a to c (consisting of 50 sentences in each set), and collecting it with a microphone or the like. This is voice waveform data that has been subjected to signal processing by being input to an information processing device such as a PC (personal computer) via a device. The first voice data may be data composed of a parameter series (feature parameter series) of feature parameters (for example, cepstrum coefficient (CC), mel cepstrum coefficient (MFCC), etc.) extracted from the voice waveform data. good.

目標話者音声データ記憶部１１は、多数（例えば、２００人以上）の目標話者の発話した発話文の音声データ（以下、第２音声データと称す）を、各目標話者毎に記憶する機能を有している。ここで、第２音声データは、上記多数の元話者の発話した発話文セットａ〜ｃを全て含んでおり、各目標話者に発話文セットａ〜ｃ（各セットは、それぞれ異なる内容の文章（５０文）で構成される（セット間でも文章内容が異なる））の文章を発話してもらい、それをマイク等の収音デバイスを介してＰＣ等の情報処理機器に入力して信号処理した音声波形データである。また、第２音声データは、予め音声波形データから抽出した特徴パラメータの系列として記憶しても良い。 The target speaker voice data storage unit 11 stores voice data (hereinafter, referred to as second voice data) of utterances spoken by a large number (for example, 200 or more) target speakers for each target speaker. It has a function. Here, the second voice data includes all of the utterance sentence sets a to c uttered by the large number of former speakers, and each target speaker has the utterance sentence sets a to c (each set has a different content). Sentence of a sentence composed of 50 sentences (sentences differ between sets) is input to an information processing device such as a PC via a sound collection device such as a microphone for signal processing. Voice waveform data. Further, the second sound data may be stored as a series of feature parameters previously extracted from the sound waveform data.

声質変換モデル生成部１２は、元話者音声データ記憶部１０に記憶されたＮ人（Ｎ≧２）又は１人の元話者の第１音声データと、目標話者音声データ記憶部１１に記憶されたＭ人（Ｍ≧２）又は１人の目標話者の第２音声データとを学習データとして、Ｎ人又は１人の元話者の声質の音声を、１人又はＭ人の目標話者の声質の音声に変換するように変換規則（統計モデル）を学習して、元話者及び目標話者全員に共通の１つの声質変換モデルを生成する機能を有している。具体的に、生成する声質変換モデルの種類としては、大別して、（１）Ｎ人の元話者の声質の音声を１人の目標話者の声質の音声に変換する声質変換モデル（以下、Ｎ対１声質変換モデルと称す）、（２）１人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換する声質変換モデル（以下、１対Ｍ声質変換モデルと称す）、（３）Ｎ人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換する声質変換モデル（以下、Ｎ対Ｍ声質変換モデルと称す）の３種類となる。 The voice quality conversion model generation unit 12 stores the first voice data of N people (N ≧ 2) or one former speaker stored in the former speaker voice data storage unit 10 and the target speaker voice data storage unit 11. Using the stored second voice data of M people (M ≧ 2) or one target speaker as learning data, the voice quality of the voice quality of N people or one former speaker is set as one or M goals. It has a function of learning a conversion rule (statistical model) so as to convert the voice into the voice quality of the speaker and generating one voice quality conversion model common to all of the original speaker and the target speaker. Specifically, the types of voice quality conversion models to be generated are broadly classified as follows: (1) a voice quality conversion model (hereinafter referred to as “voice quality conversion model”) that converts the voice quality of N former speakers to the voice quality of one target speaker. (Referred to as N to 1 voice quality conversion model), (2) voice quality conversion model (hereinafter referred to as 1 to M voice quality conversion model) that converts the voice quality of one former speaker to the voice quality of M target speakers. 3) voice quality conversion models (hereinafter referred to as N vs. M voice quality conversion models) that convert the voice quality of N former speakers to the voice quality of M target speakers. .

また、声質変換モデル生成部１２は、Ｎ対Ｍ声質変換モデルの別形態として、（４）Ｎ人の元話者の声質の音声を１人の中間話者の音声に変換する声質変換モデル（以下、Ｎ対１中間声質変換モデルと称す）と、１人の中間話者の音声をＭ人の目標話者の声質の音声に変換する当該Ｍ人の目標話者の各々に対応するＭ個の声質変換モデル（以下、１対Ｍ中間声質変換モデルと称す）とから構成される（１＋Ｍ）個の声質変換モデル（以下、Ｎ対Ｍ中間声質変換モデルと称す）を生成することも可能である。この場合は、元話者音声データ記憶部１０又は目標話者音声データ記憶部１１から中間話者となる話者の音声データ（以下、第３音声データと称す）を取得する必要がある。また、例えば、外部の音声合成装置（ＴＴＳ）において、上記発話文セットａ〜ｃのテキスト文から中間話者の合成音声データを生成して取得するようにしても良い。 As another form of the N-to-M voice quality conversion model, the voice quality conversion model generation unit 12 (4) a voice quality conversion model that converts voices of the voice quality of N former speakers into voices of one intermediate speaker ( Hereinafter, it is referred to as an N-to-1 intermediate voice quality conversion model) and M corresponding to each of the M target speakers that convert the voice of one intermediate speaker into the voice of the voice quality of M target speakers. It is also possible to generate (1 + M) voice quality conversion models (hereinafter referred to as N to M intermediate voice quality conversion models) composed of the following voice quality conversion models (hereinafter referred to as 1 to M intermediate voice quality conversion models). is there. In this case, it is necessary to acquire voice data (hereinafter referred to as third voice data) of a speaker who is an intermediate speaker from the original speaker voice data storage unit 10 or the target speaker voice data storage unit 11. Further, for example, in an external speech synthesizer (TTS), synthesized speech data of an intermediate speaker may be generated and acquired from the text sentences in the utterance sentence sets a to c.

また、声質変換モデル生成部１２は、通常学習モードと、適応学習モードとの２つの学習モードを有し、これらのモードのうち学習条件として設定されたモードを選択して、選択したモードの学習内容で学習を行い上記した（１）〜（４）の各種声質変換モデルを生成する。
声質変換モデル記憶部１３は、声質変換モデル生成部１２で生成された上記（１）〜（４）の声質変換モデルを記憶したり、後述する適応学習で用いる初期声質変換モデルを記憶したりする機能を有している。 Further, the voice quality conversion model generation unit 12 has two learning modes, a normal learning mode and an adaptive learning mode, and selects a mode set as a learning condition from these modes, and learns the selected mode. Learning with the contents is performed to generate the various voice quality conversion models (1) to (4) described above.
The voice quality conversion model storage unit 13 stores the voice quality conversion models (1) to (4) generated by the voice quality conversion model generation unit 12, or stores an initial voice quality conversion model used in adaptive learning described later. It has a function.

更に、図２に基づき、声質変換モデル生成部１２の詳細な構成を説明する。ここで、図２は、声質変換モデル生成部１２の詳細構成を示すブロック図である。
声質変換モデル生成部１２は、図２に示すように、特徴量抽出部１２ａと、通常学習部１２ｂと、特定話者モデル生成部１２ｃと、適応学習部１２ｄとを含んだ構成となっている。
特徴量抽出部１２ａは、元話者音声データ記憶部１０及び目標話者音声データ記憶部１１から取得した第１〜第３音声データから、ケプストラム分析や線形予測分析などの分析処理によって４次元以上（例えば、４０次元）の特徴量（特徴パラメータともいう）を抽出する機能を有している。 Further, a detailed configuration of the voice quality conversion model generation unit 12 will be described with reference to FIG. Here, FIG. 2 is a block diagram showing a detailed configuration of the voice quality conversion model generation unit 12.
As shown in FIG. 2, the voice quality conversion model generation unit 12 includes a feature amount extraction unit 12a, a normal learning unit 12b, a specific speaker model generation unit 12c, and an adaptive learning unit 12d. .
The feature amount extraction unit 12a performs four or more dimensions by analyzing processes such as cepstrum analysis and linear prediction analysis from the first to third voice data acquired from the original speaker voice data storage unit 10 and the target speaker voice data storage unit 11. It has a function of extracting (for example, 40 dimensions) feature quantities (also referred to as feature parameters).

通常学習部１２ｂは、学習条件として通常学習モードが設定されているときに動作し、特徴量抽出部１２ａで抽出された第１〜第３音声データの特徴量を学習データとして、元話者の声質の音声を目標話者の声質の音声に変換するように変換規則（統計モデル）を学習し、上記（１）〜（４）のいずれかの構成の声質変換モデルを生成する機能を有している。本実施の形態においては、具体的に、公知のＥＭアルゴリズムを用いた学習（推定）を行って、ＧＭＭ（Gaussian mixture model）の各パラメータを決定することで、ＧＭＭで構成される声質変換モデルを生成する。 The normal learning unit 12b operates when the normal learning mode is set as a learning condition, and the feature amount of the first to third speech data extracted by the feature amount extraction unit 12a is used as learning data to determine the original speaker's It has a function of learning a conversion rule (statistical model) so as to convert a voice of voice quality into voice of a target speaker and generating a voice quality conversion model having any one of the above configurations (1) to (4) ing. In the present embodiment, specifically, learning (estimation) using a known EM algorithm is performed, and each parameter of GMM (Gaussian mixture model) is determined. Generate.

特定話者モデル生成部１２ｃは、学習条件として適応学習モードが設定されているときに動作し、特徴量抽出部１２ａで抽出された第１〜第３音声データの特徴量を学習データとして、元話者各々の音声を目標話者各々の音声にそれぞれ変換するように変換規則をそれぞれ学習して、元話者の各々と目標話者の各々とがそれぞれ１対１に対応する特定話者モデルを生成する機能を有している。 The specific speaker model generation unit 12c operates when the adaptive learning mode is set as a learning condition, and uses the feature amounts of the first to third speech data extracted by the feature amount extraction unit 12a as learning data. A specific speaker model in which each of the original speakers and each of the target speakers has a one-to-one correspondence by learning conversion rules so as to convert each speaker's speech into each target speaker's speech. It has the function to generate.

適応学習部１２ｄは、学習条件として適応学習モードが設定されているときに動作し、声質変換モデル記憶部１３から初期声質変換モデルを取得し、当該初期声質変換モデルを、所定の適応手法を用いて適応学習し、特定話者モデル生成部１２ｃで生成された特定話者モデルに適応させる機能を有している。本実施の形態においては、公知の適応手法である固有声技術又は話者正規化学習（Speaker Adaptive Training；ＳＡＴ）などのいずれかを用いて、特定話者モデルから、初期声質変換モデルのパラメータを推定し、当該推定結果に基づき、上記（１）〜（４）のいずれかの構成の声質変換モデルを生成する。また、本実施の形態においては、学習条件として繰り返し学習モードを設定することが可能である。繰り返し学習モードが設定されると、適応学習部１２ｄは、初期声質変換モデルに対して、所定条件を満足するまで適応学習を繰り返して何回も行うと共に、各回の学習後のモデルを、次の回の適応学習における初期声質変換モデルとして利用する。 The adaptive learning unit 12d operates when the adaptive learning mode is set as a learning condition, acquires an initial voice quality conversion model from the voice quality conversion model storage unit 13, and uses the initial voice quality conversion model by using a predetermined adaptation method. And has a function of adapting to the specific speaker model generated by the specific speaker model generation unit 12c. In the present embodiment, the parameters of the initial voice quality conversion model are determined from the specific speaker model using either a proper adaptation technique, eigenvoice technology or speaker normalized training (SAT). Based on the estimation result, a voice quality conversion model having any one of the configurations (1) to (4) is generated. In the present embodiment, it is possible to set a repeated learning mode as a learning condition. When the iterative learning mode is set, the adaptive learning unit 12d repeatedly performs adaptive learning on the initial voice quality conversion model until the predetermined condition is satisfied, and performs the learning after each learning as the next model. It is used as an initial voice quality conversion model in adaptive learning.

なお、本実施形態において、声質変換モデル生成装置１は、図示しないプロセッサと、ＲＡＭ（Random Access Memory）と、専用のプログラムの記憶されたＲＯＭ（Read Only Memory）と、を備えており、プロセッサにより専用のプログラムを実行することにより上記各部の機能を果たす。また、上記各部は、専用のプログラムのみでその機能を果たすもの、専用のプログラムによりハードウェアを制御してその機能を果たすものが混在している。 In this embodiment, the voice quality conversion model generation device 1 includes a processor (not shown), a RAM (Random Access Memory), and a ROM (Read Only Memory) in which a dedicated program is stored. The functions of the above-described units are achieved by executing a dedicated program. In addition, the above-described units include those that perform their functions only with dedicated programs, and those that perform their functions by controlling hardware using dedicated programs.

更に、図３に基づき、声質変換モデル生成装置１における声質変換モデル生成処理の流れを説明する。ここで、図３は、声質変換モデル生成装置１における声質変換モデル生成処理を示すフローチャートである。
声質変換モデル生成処理は、図３に示すように、まずステップＳ１００に移行し、声質変換モデル生成部１２において、不図示の操作部を介したユーザからの声質変換モデルの生成指示があったか否かを判定し、生成指示があったと判定された場合(Yes)は、ステップＳ１０２に移行し、そうでない場合(No)は、生成指示があるまで判定処理を繰り返す。 Furthermore, the flow of voice quality conversion model generation processing in the voice quality conversion model generation apparatus 1 will be described with reference to FIG. Here, FIG. 3 is a flowchart showing voice quality conversion model generation processing in the voice quality conversion model generation apparatus 1.
As shown in FIG. 3, the voice quality conversion model generation process first proceeds to step S100, and whether or not the voice quality conversion model generation unit 12 has received a voice quality conversion model generation instruction from the user via an operation unit (not shown). If it is determined that there is a generation instruction (Yes), the process proceeds to step S102. If not (No), the determination process is repeated until there is a generation instruction.

ここで、生成指示には、学習条件として設定された情報（以下、設定情報と称す）が含まれる。設定情報には、声質変換モデルの構成（Ｎ対１、１対Ｍ、Ｎ対Ｍ、中間話者）、元話者、目標話者及び中間話者の情報（例えば、元話者及び目標話者の音声データに付された話者番号などで指定）、学習モード（通常学習モード、適応学習モード（繰り返し学習モードのオンオフ、所定条件（繰り返し条件）））などの情報が含まれている。 Here, the generation instruction includes information set as learning conditions (hereinafter referred to as setting information). The setting information includes the configuration of the voice quality conversion model (N to 1, 1 to M, N to M, intermediate speaker), information of the original speaker, the target speaker and the intermediate speaker (for example, the original speaker and the target story). Information such as a learning number (normal learning mode, adaptive learning mode (on / off of repeated learning mode, predetermined condition (repeated condition)))) and the like are included.

ステップＳ１０２に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、声質変換モデルの構成がＮ対１（元話者対目標話者）か否かを判定し、Ｎ対１であると判定された場合(Yes)は、ステップＳ１０４に移行し、そうでない場合(No)は、ステップＳ１１４に移行する。
ステップＳ１０４に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、指定されたＮ人の元話者の第１音声データを元話者音声データ記憶部１０から取得し、指定された１人の目標話者の第２音声データを目標話者音声データ記憶部１１から取得してステップＳ１０６に移行する。 When the process proceeds to step S102, the voice quality conversion model generation unit 12 determines whether or not the configuration of the voice quality conversion model is N-to-1 (original speaker vs. target speaker) based on the setting information. If it is determined that there is (Yes), the process proceeds to step S104. If not (No), the process proceeds to step S114.
When the process proceeds to step S104, the voice quality conversion model generation unit 12 acquires the first voice data of the designated N former speakers from the former speaker voice data storage unit 10 based on the setting information, and is designated. The second voice data of one target speaker is acquired from the target speaker voice data storage unit 11, and the process proceeds to step S106.

ステップＳ１０６では、声質変換モデル生成部１２において、設定情報に基づき、通常学習モードが設定されているか否かを判定し、設定されていると判定された場合(Yes)は、ステップＳ１０８に移行し、そうでない場合(No)は、ステップＳ１１２に移行する。
ステップＳ１０８に移行した場合は、声質変換モデル生成部１２において、通常学習による声質変換モデルの生成処理を実行し、設定情報のモデル構成に対応した声質変換モデルを生成してステップＳ１１０に移行する。 In step S106, the voice quality conversion model generation unit 12 determines whether or not the normal learning mode is set based on the setting information. If it is determined that the normal learning mode is set (Yes), the process proceeds to step S108. If not (No), the process proceeds to step S112.
When the process moves to step S108, the voice quality conversion model generation unit 12 executes a voice quality conversion model generation process by normal learning, generates a voice quality conversion model corresponding to the model configuration of the setting information, and moves to step S110.

ステップＳ１１０では、声質変換モデル生成部１２において、ステップＳ１０８又はステップＳ１１２において生成された声質変換モデルを声質変換モデル記憶部１３に記憶して処理を終了する。
一方、ステップＳ１０６において、学習モードが通常学習モードではなくステップＳ１１２に移行した場合は、声質変換モデル生成部１２において、適応学習による声質変換モデルの生成処理を実行し、設定情報のモデル構成に対応した声質変換モデルを生成してステップＳ１１０に移行する。 In step S110, the voice quality conversion model generation unit 12 stores the voice quality conversion model generated in step S108 or step S112 in the voice quality conversion model storage unit 13 and ends the process.
On the other hand, when the learning mode is not the normal learning mode and the process proceeds to step S112 in step S106, the voice quality conversion model generation unit 12 executes a voice quality conversion model generation process by adaptive learning and corresponds to the model configuration of the setting information. The generated voice quality conversion model is generated, and the process proceeds to step S110.

また、ステップＳ１０２において、声質変換モデルの構成がＮ対１ではなくステップＳ１１４に移行した場合は、声質変換モデル生成部１２において、声質変換モデルの構成が１対Ｍ（元話者対目標話者）か否かを判定し、１対Ｍであると判定された場合(Yes)は、ステップＳ１１６に移行し、そうでない場合(No)は、ステップＳ１１８に移行する。
ステップＳ１１６に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、指定された１人の元話者の第１音声データを元話者音声データ記憶部１０から取得し、指定されたＭ人の目標話者の第２音声データを目標話者音声データ記憶部１１から取得してステップＳ１０６に移行する。 In step S102, if the configuration of the voice quality conversion model is not N-to-1 and the process proceeds to step S114, the voice quality conversion model generation unit 12 sets the voice quality conversion model to 1-M (original speaker vs. target speaker). ). If it is determined that the pair is M (Yes), the process proceeds to step S116, and if not (No), the process proceeds to step S118.
When the process proceeds to step S116, the voice quality conversion model generation unit 12 acquires the first voice data of the designated one former speaker from the former speaker voice data storage unit 10 based on the setting information, and is designated. The second voice data of the M target speakers is acquired from the target speaker voice data storage unit 11, and the process proceeds to step S106.

また、ステップＳ１１８に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、声質変換モデルの構成が、中間話者を適用したものか否かを判定し、適用したものであると判定された場合(Yes)は、ステップＳ１２０に移行し、そうでない場合(No)は、ステップＳ１２２に移行する。
ステップＳ１２０に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、指定されたＮ人の元話者の第１音声データを元話者音声データ記憶部１０から取得し、指定された１人の中間話者の第３音声データを、元話者音声データ記憶部１０又は目標話者音声データ記憶部１１から取得し、指定されたＭ人の目標話者の第２音声データを目標話者音声データ記憶部１１から取得してステップＳ１０６に移行する。 When the process proceeds to step S118, the voice quality conversion model generation unit 12 determines whether or not the configuration of the voice quality conversion model is based on the setting information and applies an intermediate speaker. If determined (Yes), the process proceeds to step S120, and if not (No), the process proceeds to step S122.
When the process proceeds to step S120, the voice quality conversion model generation unit 12 acquires the first voice data of the designated N former speakers from the former speaker voice data storage unit 10 based on the setting information, and is designated. The third voice data of one intermediate speaker is acquired from the original speaker voice data storage unit 10 or the target speaker voice data storage unit 11, and the second voice data of the designated M target speakers is obtained. Obtained from the target speaker voice data storage unit 11, the process proceeds to step S106.

また、ステップＳ１２２に移行した場合は、声質変換モデル生成部１２において、設定情報に基づき、指定されたＮ人の元話者の第１音声データを元話者音声データ記憶部１０から取得し、指定されたＭ人の目標話者の第２音声データを目標話者音声データ記憶部１１から取得してステップＳ１０６に移行する。
更に、図４に基づき、声質変換モデル生成部１２における通常学習による声質変換モデル生成処理の流れを説明する。ここで、図４は、声質変換モデル生成部１２における通常学習による声質変換モデル生成処理を示すフローチャートである。 When the process proceeds to step S122, the voice quality conversion model generation unit 12 acquires the first voice data of the designated N former speakers from the former speaker voice data storage unit 10 based on the setting information. The second voice data of the designated M target speakers is acquired from the target speaker voice data storage unit 11, and the process proceeds to step S106.
Furthermore, the flow of voice quality conversion model generation processing by normal learning in the voice quality conversion model generation unit 12 will be described with reference to FIG. Here, FIG. 4 is a flowchart showing voice quality conversion model generation processing by normal learning in the voice quality conversion model generation unit 12.

通常学習による声質変換モデル生成処理は、ステップＳ１０８において実行され、図４に示すように、まずステップＳ２００に移行して、特徴量抽出部１２ａにおいて、元話者音声データ記憶部１０及び目標話者音声データ記憶部１１から取得した第１〜第３音声データから特徴量を抽出してステップＳ２０２に移行する。
ステップＳ２０２では、通常学習部１２ｂにおいて、声質変換モデルの設定構成がＮ対１であるか否かを判定し、Ｎ対１であると判定された場合(Yes)は、ステップＳ２０４に移行し、そうでない場合(No)は、ステップＳ２０８に移行する。 The voice quality conversion model generation process by normal learning is executed in step S108, and as shown in FIG. 4, first, the process proceeds to step S200, and the feature amount extraction unit 12a performs the original speaker voice data storage unit 10 and the target speaker. The feature amount is extracted from the first to third sound data acquired from the sound data storage unit 11, and the process proceeds to step S202.
In step S202, the normal learning unit 12b determines whether or not the voice quality conversion model setting configuration is N-to-1 and if it is determined to be N-to-1 (Yes), the process proceeds to step S204. When that is not right (No), it transfers to step S208.

ステップＳ２０４では、通常学習部１２ｂにおいて、ステップＳ２００で抽出した、Ｎ人の元話者の特徴量と、１人の目標話者の特徴量とを学習データとして、Ｎ人の元話者の声質の音声を１人の目標話者の声質の音声に変換するように変換規則（統計モデル）を学習させてステップＳ２０６に移行する。
ステップＳ２０６では、通常学習部１２ｂにおいて、ステップＳ２０４の学習結果に基づき、Ｎ人の元話者の声質の音声を１人の目標話者の声質の音声に変換する、Ｎ人の元話者及び１人の目標話者に共通の１つのＮ対１声質変換モデルを生成して、一連の処理を終了して元の処理に復帰する。 In step S204, the normal learning unit 12b uses the feature values of the N former speakers and the feature amount of one target speaker extracted in step S200 as learning data, and the voice quality of the N former speakers. The conversion rule (statistical model) is learned so as to convert the voice of the voice into the voice quality of one target speaker, and the process proceeds to step S206.
In step S206, based on the learning result of step S204, the normal learning unit 12b converts the voice quality of N original speakers into the voice quality of one target speaker. One N-to-1 voice quality conversion model common to one target speaker is generated, and a series of processes is terminated and the process returns to the original process.

一方、ステップＳ２０２において、声質変換モデルの設定構成がＮ対１ではなくステップＳ２０８に移行した場合は、通常学習部１２ｂにおいて、声質変換モデルの設定構成が１対Ｍであるか否かを判定し、１対Ｍであると判定された場合(Yes)は、ステップＳ２１０に移行し、そうでない場合(No)は、ステップＳ２１４に移行する。
ステップＳ２１０に移行した場合は、通常学習部１２ｂにおいて、ステップＳ２００で抽出した、１人の元話者の特徴量と、Ｍ人の目標話者の特徴量とを学習データとして、１人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換するように変換規則（統計モデル）を学習させてステップＳ２１２に移行する。 On the other hand, when the setting configuration of the voice quality conversion model is not N-to-1 in step S202, the normal learning unit 12b determines whether or not the setting configuration of the voice quality conversion model is 1-to-M. If it is determined to be 1-to-M (Yes), the process proceeds to step S210. If not (No), the process proceeds to step S214.
When the process proceeds to step S210, the normal learning unit 12b uses the feature value of one original speaker and the feature values of M target speakers extracted in step S200 as learning data. The conversion rule (statistical model) is learned so as to convert the voice of the voice quality of the speaker into the voice quality of the M target speakers, and the process proceeds to step S212.

ステップＳ２１２では、通常学習部１２ｂにおいて、ステップＳ２１０の学習結果に基づき、１人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換する、１人の元話者及びＭ人の目標話者に共通の１つの１対Ｍ声質変換モデルを生成して、一連の処理を終了して元の処理に復帰する。
一方、ステップＳ２０８において、声質変換モデルの設定構成が１対ＭではなくステップＳ２１４に移行した場合は、通常学習部１２ｂにおいて、声質変換モデルの設定構成が中間話者を適用する構成であるか否かを判定し、中間話者を適用する構成あると判定された場合(Yes)は、ステップＳ２１６に移行し、そうでない場合(No)は、ステップＳ２２４に移行する。 In step S212, based on the learning result in step S210, the normal learning unit 12b converts the voice quality of one original speaker into the voice quality of M target speakers, One 1-to-M voice quality conversion model common to the M target speakers is generated, a series of processing is terminated, and the original processing is restored.
On the other hand, when the setting structure of the voice quality conversion model is not 1 to M in Step S208 and the process moves to Step S214, whether or not the setting structure of the voice quality conversion model is a structure that applies the intermediate speaker in the normal learning unit 12b. If it is determined that there is a configuration to apply the intermediate speaker (Yes), the process proceeds to step S216. If not (No), the process proceeds to step S224.

ステップＳ２１６に移行した場合は、通常学習部１２ｂにおいて、ステップＳ２００で抽出した、Ｎ人の元話者の特徴量と、１人の中間話者の特徴量とを学習データとして、Ｎ人の元話者の声質の音声を１人の中間話者の音声に変換するように変換規則（統計モデル）を学習させてステップＳ２１８に移行する。
ステップＳ２１８では、通常学習部１２ｂにおいて、ステップＳ２１６の学習結果に基づき、Ｎ人の元話者の声質の音声を１人の中間話者の音声に変換する、Ｎ人の元話者及び１人の中間話者に共通の１つのＮ対１中間声質変換モデルを生成してステップＳ２２０に移行する。 When the process proceeds to step S216, the normal learning unit 12b uses the feature values of the N original speakers and the feature value of one intermediate speaker extracted in step S200 as learning data. The conversion rule (statistical model) is learned so as to convert the voice of the voice quality of the speaker into the voice of one intermediate speaker, and the process proceeds to step S218.
In step S218, the normal learning unit 12b converts the voice of the voice quality of the N original speakers into the voice of one intermediate speaker based on the learning result of step S216, and the N original speakers and one person One N-to-1 intermediate voice quality conversion model common to the intermediate speakers is generated, and the process proceeds to step S220.

ステップＳ２２０では、通常学習部１２ｂにおいて、ステップＳ２００で抽出した、１人の中間話者の特徴量と、Ｍ人の目標話者の特徴量とを学習データとして、１人の中間話者の音声をＭ人の目標話者の各々の音声にそれぞれ変換するように変換規則（統計モデル）を学習させてステップＳ２２２に移行する。
ステップＳ２２２では、通常学習部１２ｂにおいて、ステップＳ２２０の学習結果に基づき、１人の中間話者の音声をＭ人の目標話者の各々の音声に変換する、Ｍ人の目標話者の各々にそれぞれ対応したＭ個の１対Ｍ中間声質変換モデルを生成して、一連の処理を終了し元の処理に復帰する。 In step S220, the normal learning unit 12b uses the feature value of one intermediate speaker and the feature values of M target speakers extracted in step S200 as learning data, and the speech of one intermediate speaker. Then, the conversion rule (statistical model) is learned so as to be converted into the speech of each of the M target speakers, and the process proceeds to step S222.
In step S222, the normal learning unit 12b converts the speech of one intermediate speaker into the speech of each of the M target speakers based on the learning result of step S220. M corresponding one-to-M intermediate voice quality conversion models are generated, a series of processing is terminated, and the original processing is restored.

一方、ステップＳ２１４において、声質変換モデルの設定構成が中間話者を適用した構成ではなくステップＳ２２４に移行した場合は、通常学習部１２ｂにおいて、ステップＳ２００で抽出した、Ｎ人の元話者の特徴量と、Ｍ人の目標話者の特徴量とを学習データとして、Ｎ人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換するように変換規則（統計モデル）を学習させてステップＳ２２６に移行する。 On the other hand, when the setting configuration of the voice quality conversion model shifts to step S224 instead of the configuration in which the intermediate speaker is applied in step S214, the characteristics of the N original speakers extracted in step S200 in the normal learning unit 12b. Conversion rule (statistical model) so as to convert voices of voice quality of N original speakers into voices of voice quality of M target speakers, using learning amount and feature quantity of M target speakers as learning data Is transferred to step S226.

ステップＳ２２６では、通常学習部１２ｂにおいて、ステップＳ２１４の学習結果に基づき、Ｎ人の元話者の声質の音声をＭ人の目標話者の声質の音声に変換する、Ｎ人の元話者及びＭ人の目標話者に共通のＮ対Ｍ声質変換モデルを生成して、一連の処理を終了し元の処理に復帰する。ここで、Ｎ対Ｍ声質変換モデルは、共通の１つの声質変換モデルから構成されていても良いし、Ｎ対１声質変換モデル及び１対Ｍ声質変換モデルの２つの組み合わせによって構成されていても良い。２つの組み合わせの場合は、Ｎ対１の目標話者と、１対Ｍの元話者とを共通の話者とする。 In step S226, the normal learning unit 12b converts the speech quality of N original speakers into the speech quality of M target speakers based on the learning result of step S214. An N-to-M voice quality conversion model common to the M target speakers is generated, a series of processing is terminated, and the original processing is restored. Here, the N-to-M voice quality conversion model may be configured by one common voice quality conversion model, or may be configured by two combinations of the N-to-1 voice quality conversion model and the 1-to-M voice quality conversion model. good. In the case of the two combinations, an N to 1 target speaker and a 1 to M former speaker are set as a common speaker.

更に、図５に基づき、声質変換モデル生成部１２における適応学習による声質変換モデル生成処理の流れを説明する。ここで、図５は、声質変換モデル生成部１２における適応学習による声質変換モデル生成処理を示すフローチャートである。
適応学習による声質変換モデル生成処理は、ステップＳ１１２において実行され、図５に示すように、まずステップＳ３００に移行して、特定話者モデル生成部１２ｃにおいて、設定情報において指定された声質変換モデルの構成に基づき、声質変換モデル記憶部１３から、前記設定構成に対応する１つの初期声質変換モデルを取得してステップＳ３０２に移行する。具体的には、設定構成がＮ対１ならＮ対１声質変換モデルを、１対Ｍなら１対Ｍ声質変換モデルを、Ｎ対ＭならＮ対Ｍ声質変換モデルを、初期声質変換モデルとして取得する。なお、初期声質変換モデルは、ニューラルネットワーク、ＧＭＭ等の統計モデルから構成される。 Furthermore, the flow of voice quality conversion model generation processing by adaptive learning in the voice quality conversion model generation unit 12 will be described with reference to FIG. Here, FIG. 5 is a flowchart showing voice quality conversion model generation processing by adaptive learning in the voice quality conversion model generation unit 12.
The voice quality conversion model generation process by adaptive learning is executed in step S112. As shown in FIG. 5, first, the process proceeds to step S300, and the specific speaker model generation unit 12c determines the voice quality conversion model specified in the setting information. Based on the configuration, one initial voice quality conversion model corresponding to the setting configuration is acquired from the voice quality conversion model storage unit 13, and the process proceeds to step S302. Specifically, if the setting configuration is N to 1, an N to 1 voice quality conversion model is acquired as an initial voice quality conversion model. To do. Note that the initial voice quality conversion model includes a statistical model such as a neural network or GMM.

ステップＳ３０２では、特徴量抽出部１２ａにおいて、元話者音声データ記憶部１０及び目標話者音声データ記憶部１１から取得した第１〜第３音声データから特徴量を抽出してステップＳ３０４に移行する。
ステップＳ３０４では、特定話者モデル生成部１２ｃにおいて、設定情報に基づき、声質変換モデルの構成がＮ対１であるか否かを判定し、Ｎ対１であると判定された場合(Yes)は、ステップＳ３０６に移行し、そうでない場合(No)は、ステップＳ３２２に移行する。 In step S302, the feature quantity extraction unit 12a extracts feature quantities from the first to third voice data acquired from the original speaker voice data storage unit 10 and the target speaker voice data storage unit 11, and the process proceeds to step S304. .
In step S304, the specific speaker model generation unit 12c determines whether or not the configuration of the voice quality conversion model is N-to-1 based on the setting information, and if it is determined to be N-to-1 (Yes). The process proceeds to step S306, and if not (No), the process proceeds to step S322.

ステップＳ３０６では、特定話者モデル生成部１２ｃにおいて、ステップＳ３０２で抽出した、Ｎ人の元話者の特徴量と、１人の目標話者の特徴量とを学習データとして、Ｎ人の元話者の各々の音声を１人の目標話者の声質の音声にそれぞれ変換するように変換規則を学習させて、Ｎ人の元話者にそれぞれ対応するＮ個の特定話者モデルを生成してステップＳ３０８に移行する。具体的には、Ｎ人の元話者の各話者毎に、当該各話者の特徴量及び１人の目標話者の特徴量を用いて上記取得した初期声質変換モデルをもとに学習し、各話者毎の声質変換モデル（＝特定話者モデル）を生成する。 In step S306, the specific speaker model generation unit 12c uses the feature values of the N original speakers and the feature value of one target speaker extracted in step S302 as learning data, and the N narratives. Conversion rules are learned so that each voice of a speaker is converted into a voice of one target speaker, and N specific speaker models respectively corresponding to N former speakers are generated. The process proceeds to step S308. Specifically, for each of N original speakers, learning is performed based on the acquired initial voice quality conversion model using the feature amount of each speaker and the feature amount of one target speaker. Then, a voice quality conversion model (= specific speaker model) for each speaker is generated.

ステップＳ３０８では、適応学習部１２ｄにおいて、所定の適応手法を用いて、ステップＳ３０６で生成したＮ個の特定話者モデルを、初期声質変換モデルに適応させる適応処理を実行してステップＳ３１０に移行する。具体的には、Ｎ個の特定話者モデルを全て用いて、上記取得した１つの初期声質変換モデルのパラメータを所定の適応手法を用いて推定する。 In step S308, the adaptive learning unit 12d performs an adaptation process for adapting the N specific speaker models generated in step S306 to the initial voice conversion model using a predetermined adaptation method, and the process proceeds to step S310. . Specifically, the parameters of one acquired initial voice quality conversion model are estimated using a predetermined adaptive method using all N specific speaker models.

ステップＳ３１０では、適応学習部１２ｄにおいて、適応処理が終了したか否かを判定し、終了したと判定された場合(Yes)は、ステップＳ３１２に移行し、そうでない場合(No)は、終了するまで適応処理を実行する。
ステップＳ３１２に移行した場合は、適応学習部１２ｄにおいて、設定情報に基づき、繰り返し学習モードが設定されているか否かを判定し、設定されていると判定された場合(Yes)は、ステップＳ３１４に移行し、そうでない場合(No)は、ステップＳ３１８に移行する。 In step S310, the adaptive learning unit 12d determines whether or not the adaptation process has been completed. If it is determined that the adaptation process has ended (Yes), the process proceeds to step S312; otherwise (No), the process ends. The adaptive process is executed until.
When the process proceeds to step S312, the adaptive learning unit 12d determines whether the iterative learning mode is set based on the setting information. If it is determined that the setting is set (Yes), the process proceeds to step S314. If not (No), the process proceeds to step S318.

ステップＳ３１４に移行した場合は、適応学習部１２ｄにおいて、適応後のモデルを評価してステップＳ３１６に移行する。例えば、声質変換モデルの音声のスペクトル特徴量として、ケプストラム係数が用いられている場合、変換精度の評価尺度として、ケプストラム歪（Cepstral Distortion）を用いることが可能である。ケプストラム歪は、元話者から変換したケプストラムと、目標話者のケプストラムとの歪を計算することで算出される。ケプストラム歪は、下式（１）で表され、この算出値が小さいほど高い評価となる。 When the process proceeds to step S314, the adaptive learning unit 12d evaluates the model after adaptation, and the process proceeds to step S316. For example, when a cepstrum coefficient is used as the spectral characteristic amount of the voice of the voice quality conversion model, it is possible to use cepstral distortion as an evaluation measure of conversion accuracy. The cepstrum distortion is calculated by calculating the distortion between the cepstrum converted from the original speaker and the cepstrum of the target speaker. The cepstrum distortion is expressed by the following formula (1), and the evaluation becomes higher as the calculated value is smaller.

上式（１）において、C_i ^xは目標話者の音声のケプストラム係数、C_i ^yは変換音声のケプストラム係数、pはケプストラム係数の次数を示す。
ステップＳ３１６では、適応学習部１２ｄにおいて、ステップＳ３１４の評価結果が所定の条件を満足しているか否かを判定し、満足していると判定された場合(Yes)は、ステップＳ３１８に移行し、そうでない場合(No)は、ステップＳ３２０に移行する。例えば、所定条件として、上記ケプストラム歪の閾値が与えられており、当該閾値以下となるまで適応学習を繰り返し行う。 In the above equation (1), C _i ^x represents the cepstrum coefficient of the target speaker's voice, C _i ^y represents the cepstrum coefficient of the converted voice, and p represents the order of the cepstrum coefficient.
In step S316, the adaptive learning unit 12d determines whether or not the evaluation result in step S314 satisfies a predetermined condition. If it is determined that the evaluation is satisfied (Yes), the process proceeds to step S318. When that is not right (No), it transfers to step S320. For example, as a predetermined condition, a threshold value for the cepstrum distortion is given, and adaptive learning is repeatedly performed until the threshold value becomes equal to or less than the threshold value.

ステップＳ３１８に移行した場合は、適応学習部１２ｄにおいて、所定の適応手法を用いた適応処理結果に基づき、全ての元話者及び目標話者に共通の１つの声質変換モデルを生成する。そして、一連の処理を終了して元の処理に復帰する。具体的には、初期声質変換モデルのパラメータ値を、適応処理によって推定されたパラメータ値に変換して声質変換モデルを生成する。このようにして生成される声質変換モデルは、設定情報で指定された声質変換モデルの構成に応じて、Ｎ人の元話者、Ｍ人の目標話者、あるいはこれら双方が適応されたモデルとなる。 When the process proceeds to step S318, the adaptive learning unit 12d generates one voice quality conversion model common to all the original speakers and the target speaker based on the result of the adaptive processing using a predetermined adaptation method. Then, the series of processes is terminated and the process returns to the original process. Specifically, the parameter value of the initial voice quality conversion model is converted into the parameter value estimated by the adaptive processing to generate a voice quality conversion model. The voice quality conversion model generated in this way is a model in which N former speakers, M target speakers, or both are adapted according to the configuration of the voice quality conversion model specified in the setting information. Become.

一方、ステップＳ３１６において、所定の条件を満足しておらずステップＳ３２０に移行した場合は、適応学習部１２ｄにおいて、初期声質変換モデルを適応学習後の声質変換モデルに置換してステップＳ３０４に移行する。
また、ステップＳ３０４において、設定情報で指定された声質変換モデルの構成がＮ対１ではなくステップＳ３２２に移行した場合は、特定話者モデル生成部１２ｃにおいて、設定情報で指定された声質変換モデルの構成が１対Ｍであるか否かを判定し、１対Ｍであると判定された場合(Yes)は、ステップＳ３２４に移行し、そうでない場合(No)は、ステップＳ３２８に移行する。 On the other hand, if the predetermined condition is not satisfied in step S316 and the process proceeds to step S320, the adaptive learning unit 12d replaces the initial voice quality conversion model with the voice quality conversion model after adaptive learning, and the process proceeds to step S304. .
In step S304, when the configuration of the voice quality conversion model specified by the setting information is not N-to-1 and the process moves to step S322, the specific speaker model generation unit 12c determines the voice quality conversion model specified by the setting information. It is determined whether or not the configuration is 1-to-M. If it is determined that the configuration is 1-to-M (Yes), the process proceeds to step S324. If not (No), the process proceeds to step S328.

ステップＳ３２４に移行した場合は、特定話者モデル生成部１２ｃにおいて、ステップＳ３００で抽出した、１人の元話者の特徴量と、Ｍ人の目標話者の特徴量とを学習データとして、１人の元話者の声質の音声をＭ人の目標話者の各々の音声にそれぞれ変換するように変換規則を学習させて、Ｍ人の目標話者にそれぞれ対応するＭ個の特定話者モデルを生成してステップＳ３２６に移行する。具体的には、Ｍ人の目標話者の各話者毎に、当該各話者の特徴量及び１人の元話者の特徴量を用いて上記取得した初期声質変換モデルをもとに学習し、各話者毎の特定話者モデルを生成する。 When the process proceeds to step S324, the specific speaker model generation unit 12c uses the feature values of one original speaker and the feature values of M target speakers extracted in step S300 as learning data. M specific speaker models respectively corresponding to the M target speakers are learned by converting conversion rules so as to convert voices of the voice quality of the former speaker to the respective voices of the M target speakers. And the process proceeds to step S326. Specifically, for each speaker of M target speakers, learning is performed based on the acquired initial voice quality conversion model using the feature amount of each speaker and the feature amount of one original speaker. Then, a specific speaker model for each speaker is generated.

ステップＳ３２６では、適応学習部１２ｄにおいて、所定の適応手法を用いて、ステップＳ３２４で生成したＭ個の特定話者モデルを、初期声質変換モデルに適応させる適応処理を実行してステップＳ３１０に移行する。具体的には、Ｍ個の特定話者モデルを全て用いて、上記取得した１つの初期声質変換モデルのパラメータを所定の適応手法を用いて推定する。 In step S326, the adaptive learning unit 12d performs an adaptation process for adapting the M specific speaker models generated in step S324 to the initial voice conversion model using a predetermined adaptation method, and the process proceeds to step S310. . Specifically, the parameters of one acquired initial voice quality conversion model are estimated using a predetermined adaptive method using all M specific speaker models.

一方、ステップＳ３２２において、設定情報で指定された声質変換モデルの構成が１対ＭではなくステップＳ３２８に移行した場合は、適応学習部１２ｄにおいて、設定情報で指定された構成が中間話者を適用した構成であるか否かを判定し、中間話者を適用した構成であると判定された場合(Yes)は、ステップＳ３３０に移行し、そうでない場合(No)は、ステップＳ３３４に移行する。 On the other hand, in step S322, when the configuration of the voice quality conversion model specified by the setting information shifts to step S328 instead of 1: M, the configuration specified by the setting information applies the intermediate speaker in the adaptive learning unit 12d. It is determined whether or not the configuration is the same, and if it is determined that the configuration is that of applying an intermediate speaker (Yes), the process proceeds to step S330, and if not (No), the process proceeds to step S334.

ステップＳ３３０に移行した場合は、特定話者モデル生成部１２ｃにおいて、ステップＳ３００で抽出した、Ｎ人の元話者の特徴量と、１人の中間話者の特徴量とを学習データとして、Ｎ人の元話者の各々の音声を１人の中間話者の音声にそれぞれ変換するように変換規則を学習させて、Ｎ人の元話者にそれぞれ対応するＮ個の特定話者モデルを生成してステップＳ３３２に移行する。
ステップＳ３３２では、適応学習部１２ｄにおいて、所定の適応手法を用いて、ステップＳ３３０で生成したＮ個の特定話者モデルを、初期声質変換モデルに適応させる適応処理を実行してステップＳ３１０に移行する。 When the process proceeds to step S330, the specific speaker model generation unit 12c uses the feature values of the N original speakers and the feature values of one intermediate speaker extracted in step S300 as learning data. Learning conversion rules to convert each voice of a person's former speaker into one voice of an intermediate speaker, and generate N specific speaker models corresponding to each of N former speakers Then, the process proceeds to step S332.
In step S332, the adaptive learning unit 12d performs an adaptation process for adapting the N specific speaker models generated in step S330 to the initial voice quality conversion model using a predetermined adaptation method, and the process proceeds to step S310. .

一方、ステップＳ３２８において、設定情報で指定された声質変換モデルの構成が中間話者を適用した構成ではなくステップＳ３３４に移行した場合は、特定話者モデル生成部１２ｃにおいて、ステップＳ３００で抽出した、Ｎ人の元話者の特徴量と、Ｍ人の目標話者の特徴量とを学習データとして、Ｎ人の元話者の各々の音声をＭ人の目標話者の各々の音声にそれぞれ変換するように変換規則を学習させて、Ｎ人の元話者及びＭ人の目標話者にそれぞれ対応するＮ×Ｍ個の特定話者モデルを生成してステップＳ３３６に移行する。具体的には、Ｎ人の元話者の各話者毎且つＭ人の目標話者の各話者毎に、当該各元話者及び当該各目標話者の特徴量を用いて上記取得した初期声質変換モデルをもとに学習し、各１人の元話者及び各１人の目標話者の各組毎の特定話者モデルを生成する。 On the other hand, in step S328, when the configuration of the voice quality conversion model specified by the setting information is not the configuration in which the intermediate speaker is applied, but the process moves to step S334, the specific speaker model generation unit 12c extracts in step S300. Using the feature values of the N original speakers and the feature values of the M target speakers as learning data, the voices of the N original speakers are converted into the voices of the M target speakers, respectively. Thus, the conversion rules are learned to generate N × M specific speaker models respectively corresponding to the N original speakers and M target speakers, and the process proceeds to step S336. Specifically, for each speaker of the N original speakers and for each speaker of the M target speakers, the above acquisition is performed using the feature amounts of the original speakers and the target speakers. Learning is performed based on the initial voice quality conversion model, and a specific speaker model is generated for each set of one original speaker and one target speaker.

ステップＳ３３６では、適応学習部１２ｄにおいて、所定の適応手法を用いて、ステップＳ３３４で生成したＮ×Ｍ個の特定話者モデルを、初期声質変換モデルに適応させる適応処理を実行してステップＳ３１０に移行する。具体的には、Ｎ×Ｍ個の特定話者モデルを全て用いて、上記取得した１つの初期声質変換モデルのパラメータを所定の適応手法を用いて推定する。 In step S336, the adaptive learning unit 12d executes an adaptation process for adapting the N × M specific speaker models generated in step S334 to the initial voice quality conversion model using a predetermined adaptation method, and then proceeds to step S310. Transition. Specifically, the parameters of one acquired initial voice quality conversion model are estimated using a predetermined adaptation method using all N × M specific speaker models.

次に、本実施の形態の動作を説明する。
声質変換モデル生成装置１は、まず、ユーザによる不図示の操作部の操作によって、声質変換モデルの生成指示（設定情報含む）が入力されると（ステップＳ１００の「Ｙｅｓ」の分岐）、設定情報に基づき、生成する声質変換モデルの構成を判定する。ここでは、生成する声質変換モデルは１対Ｍ声質変換モデルであるとする（ステップＳ１１４の「Ｙｅｓ」の分岐）。 Next, the operation of this embodiment will be described.
When the voice quality conversion model generation apparatus 1 receives a voice quality conversion model generation instruction (including setting information) by an operation of an operation unit (not shown) by the user (a branch of “Yes” in step S100), the setting information Based on the above, the configuration of the voice quality conversion model to be generated is determined. Here, it is assumed that the voice quality conversion model to be generated is a 1-to-M voice quality conversion model (“Yes” branch in step S114).

モデルの構成が判定されると、次に、声質変換モデル生成部１２において、元話者音声データ記憶部１０から設定情報で指定された１人の元話者の音声データ（上記した発話文セットａ，ｂ，ｃの音声データ）を取得し、目標話者音声データ記憶部１１から設定情報で指定されたＭ人の目標話者の音声データ（上記した発話文セットａ，ｂ，ｃの音声データ）を取得する（ステップＳ１１６）。つまり、１人の元話者の音声データと、Ｍ人の目標話者の音声データとは両者が同じ発話内容のデータセットとなっている（以下、パラレル音声データセットと称す）。 When the configuration of the model is determined, the voice quality conversion model generation unit 12 next selects voice data of the one former speaker specified by the setting information from the former speaker voice data storage unit 10 (the above-described utterance sentence set). voice data of a, b, and c), and voice data of M target speakers specified by the setting information from the target speaker voice data storage unit 11 (speech of the above-described speech sentence sets a, b, and c) Data) is acquired (step S116). That is, the voice data of one former speaker and the voice data of M target speakers are data sets having the same utterance content (hereinafter referred to as a parallel voice data set).

パラレル音声データセットが取得されると、設定情報に基づき、声質変換モデルの生成モードが通常学習モードか適応学習モードかを判定する。ここでは、適応学習モードが設定されているとする（ステップＳ１０６の「Ｎｏ」の分岐）。これにより、適応学習部１２ｄにおいて、所定の適応手法を用いた適応学習による１対Ｍ声質変換モデルの生成処理が実行される（ステップＳ１１２）。 When the parallel audio data set is acquired, it is determined whether the generation mode of the voice quality conversion model is the normal learning mode or the adaptive learning mode based on the setting information. Here, it is assumed that the adaptive learning mode is set ("No" branch of step S106). As a result, the adaptive learning unit 12d executes a process for generating a 1-to-M voice quality conversion model by adaptive learning using a predetermined adaptation method (step S112).

適応学習による声質変換モデルの生成処理が開始されると、まず、適応学習部１２ｄにおいて、声質変換モデル記憶部１３から、初期声質変換モデルを取得する（ステップＳ３００）。この初期声質変換モデルは、１人の元話者及びＭ人の目標話者の事前学習用のパラレル音声データセットで学習の施されたＧＭＭから構成される。
以下、ＧＭＭに基づく声質変換手法（例えば、T. Toda ct al., proc. ICASSP, Vol. 1, pp. 9-12, Philadel-phia, USA, Mar. 2005.参照）について詳しく説明する。時間領域においてフレームごとに対応付けられた、変換元となる元話者の音声の静的・動的特徴量Ｘ_tおよび変換先となる目標話者の音声の静的・動的特徴量Ｙ_tを、それぞれ下式（２）及び（３）と表す。

Ｘ_t＝[ｘ_t ^T，Δｘ_t ^T]^T ・・・・（２）

Ｙ_t＝[ｙ_t ^T，Δｙ_t ^T]^T ・・・・（３）

上式（２）及び（３）において、ｐは特徴量の次元数であり、Ｔは転置を示す。ＧＭＭでは、音声の特徴量Ｘ_t、Ｙ_tの結合確率分布ｐ（Ｘ_t，Ｙ_t｜λ）を下式（４）と表す。 When the generation process of the voice quality conversion model by adaptive learning is started, first, an initial voice quality conversion model is acquired from the voice quality conversion model storage section 13 in the adaptive learning section 12d (step S300). This initial voice quality conversion model is composed of GMMs trained with a parallel speech data set for pre-learning of one former speaker and M target speakers.
Hereinafter, a voice quality conversion method based on GMM (for example, see T. Toda ct al., Proc. ICASSP, Vol. 1, pp. 9-12, Philadel-phia, USA, Mar. 2005.) will be described in detail. The static / dynamic feature amount X _t of the voice of the original speaker serving as the conversion source and the static / dynamic feature amount Y _t of the voice of the target speaker serving as the conversion destination associated with each frame in the time domain. Are represented by the following formulas (2) and (3), respectively.

X _t = [x _t ^T , Δx _t ^T ] ^T (2)

Y _t = [y _t ^T , Δy _t ^T ] ^T (3)

In the above formulas (2) and (3), p is the number of dimensions of the feature quantity, and T indicates transposition. In the GMM, the joint probability distribution p (X _t , Y _t | λ) of speech feature amounts X _t and Y _t is expressed by the following equation (4).

上式（４）において、α_ｉはｉ番目の分布の重み、ｍは混合数である。Ｘ_t、Ｙ_tは、各々２Ｄ次元とする。また、Ｎ（ｘ；μ_ｉ，Σ_ｉ）はｉ番目の分布での平均ベクトルμ_ｉおよび共分散行列Σ_ｉを有する正規分布であり、下式（５）と表される。 In the above equation (4), α _i is the weight of the i-th distribution, and m is the number of mixtures. X _t and Y _t each have a 2D dimension. N (x; μ _i , Σ _i ) is a normal distribution having an average vector μ _i and a covariance matrix Σ _{i in the i-th} distribution, and is represented by the following expression (5).

次に、入出力特徴量系列ベクトルを各々Ｘ＝[Ｘ₁ ^T，・・・，Ｘ_T ^T]^T、Ｙ＝[Ｙ₁ ^T，・・・，Ｙ_T ^T]^Tとすると、静的・動的特徴量系列に対する尤度は、下式（６）と表される。 Next, each of the input and output feature value series vector _{^{X = [X 1 T, ···}} , X T T] T, Y = [Y 1 T, ···, Y T T] When is ^T, static and The likelihood for the dynamic feature quantity sequence is expressed by the following equation (6).

フレームｔにおいて、入力特徴量Ｘ_tが与えられた際の出力特徴量ｙ_iの条件付確率密度を下式（７）及び（８）に示すＧＭＭでモデル化する。 In the frame t, the conditional probability density of the output feature quantity y _i when the input feature quantity X _t is given is modeled by the GMM shown in the following equations (7) and (8).

ここで、上式（７）及び（８）における、条件付確率分布ｍ_iの平均ベクトルＥ_i（ｍ_i）、及び共分散行列Ｄ（ｍ_i）は、それぞれ下式（９）及び（１０）で表される。 Here, the average vector E _i (m _i ) and the covariance matrix D (m _i ) of the conditional probability distribution m _i in the above equations (7) and (8) are expressed by the following equations (9) and (10 ).

また、上式（６）の尤度を最大とする静的特徴量系列ｙ＝[ｙ₁ ^T，・・・，ｙ_T ^T]^Tを変換特徴量系列とする。

ｙ＝argmaxｐ（Ｙ｜Ｘ，λ） subject to Ｙ＝Ｗｙ・・・（１１）

ここで、Ｗは静的特徴量系列を静的・動的特徴量系列に拡張する変換行列である。 Further, static feature amount sequence to maximize the likelihood of the above equation _{(6) y = [y 1} T, ···, y T T] ^T-conversion feature amount sequence.

y = argmaxp (Y | X, λ) subject to Y = Wy (11)

Here, W is a transformation matrix that expands a static feature quantity sequence to a static / dynamic feature quantity sequence.

ＧＭＭの学習は、変換パラメータである（α_ｉ、μ_ｉ ^（Ｘ）、μ_ｉ ^（Ｙ）、Σ_ｉ ^{（Ｘ，Ｘ）}、Σ_ｉ ^{（Ｙ，Ｘ）}）を推定することにより行われる。Ｘ_tおよびＹ_tの結合特徴量ベクトルｚを下式（１２）と定義する。

ｚ＝[Ｘ_t ^T，Ｙ_t ^T]^T ・・・（１２）

これにより、ｚの確率分布ｐ（ｚ）はＧＭＭにより、下式（１３）と表される。 GMM learning is performed by estimating conversion parameters (α _i , μ _i ^(X) , μ _i ^(Y) , Σ _i ^{(X, X)} , Σ _i ^{(Y, X)} ). The combined feature vector z of X _t and Y _t is defined as the following equation (12).

z = [X _t ^T , Y _t ^T ] ^T (12)

Accordingly, the probability distribution p (z) of z is expressed by the following equation (13) by GMM.

上式（１３）において、ｚのｉ番目の分布での共分散行列Σ_ｉ（ｚ）および平均ベクトルμ_ｉ（ｚ）はそれぞれ下式（１４）及び（１５）と表される。 In the above equation (13), the covariance matrix Σ _i (z) and the average vector μ _i (z) in the i-th distribution of z are represented by the following equations (14) and (15), respectively.

ここでは、変換パラメータ（α_ｉ、μ_ｉ ^（Ｘ）、μ_ｉ ^（Ｙ）、Σ_ｉ ^{（Ｘ，Ｘ）}、Σ_ｉ ^{（Ｙ，Ｘ）}）の推定は、公知のＥＭアルゴリズムを用いて行う。
ここで、上式（２）の特徴量Ｘ_tを、１人の元話者の音声データから抽出された特徴量とする。更に、上式（３）の特徴量Ｙ_tを、Ｍ人の目標話者の音声データから抽出された特徴量とする。そして、これら特徴量Ｘ_t及びＹ_tを用いてＥＭアルゴリズムによる推定を行う。そして、推定により決定された変換パラメータ値を有する、Ｍ人の目標話者に共通の１つのＧＭＭが、適応学習用の初期声質変換モデル（以下、ＧＭＭλ⁰とも称す）となる。 Here, the conversion parameters (α _i , μ _i ^(X) , μ _i ^(Y) , Σ _i ^{(X, X)} , Σ _i ^{(Y, X)} ) are estimated using a known EM algorithm.
Here, the feature quantity X _t of the equation (2), and feature amount extracted from one source speaker speech data. Further, the feature value Y _t in the above equation (3) is set as the feature value extracted from the speech data of M target speakers. Then, the estimation by the EM algorithm using these feature amounts X _t and Y _t. Then, one GMM common to the M target speakers having the conversion parameter value determined by estimation becomes an initial voice conversion model (hereinafter also referred to as GMMλ ⁰ ) for adaptive learning.

なお、上記したＥＭアルゴリズムを用いたＧＭＭの学習（パラメータ推定）による声質変換モデルの生成処理は、通常学習モードによる声質変換モデルの生成処理と同様となる。
つまり、Ｎ対１声質変換モデルであれば、上記ＧＭＭの学習において、Ｎ人の元話者と１人の目標話者とのパラレル音声データセットからそれぞれ抽出される特徴量を学習データとする。また、Ｎ対Ｍ声質変換モデルであれば、Ｎ人の元話者とＭ人の目標話者とのパラレル音声データセットからそれぞれ抽出される特徴量を学習データとする。また、Ｎ対１中間声質変換モデルであれば、Ｎ人の元話者と１人の中間話者とのパラレル音声データセットからそれぞれ抽出される特徴量を学習データとする。 Note that the voice quality conversion model generation processing by GMM learning (parameter estimation) using the EM algorithm described above is the same as the voice quality conversion model generation processing in the normal learning mode.
In other words, in the case of the N-to-1 voice quality conversion model, feature data extracted from parallel speech data sets of N original speakers and one target speaker in the GMM learning are used as learning data. In the case of the N-to-M voice quality conversion model, feature amounts extracted from parallel speech data sets of N original speakers and M target speakers are used as learning data. In the case of the N-to-1 intermediate voice quality conversion model, feature amounts extracted from parallel speech data sets of N former speakers and one intermediate speaker are used as learning data.

一方、声質変換モデル生成部１２は、上記のようにして生成されたＭ人の目標話者に共通の１つの声質変換モデルを初期声質変換モデル（ＧＭＭλ⁰）として取得すると、次に、特徴量抽出部１２ａにおいて、上記取得した１人の元話者及びＭ人の目標話者のパラレル音声データセットの各音声データから特徴量を抽出する（ステップＳ３０２）。ここでは、ケプストラム分析を用いて、元話者の音声データ及び目標話者の音声データからそれぞれ２０次元のケプストラム係数を抽出する。 On the other hand, when the voice quality conversion model generation unit 12 acquires one voice quality conversion model common to the M target speakers generated as described above as an initial voice quality conversion model (GMMλ ⁰ ), next, the feature amount In the extraction unit 12a, feature amounts are extracted from the respective speech data of the acquired parallel speech data set of the one original speaker and the M target speakers (step S302). Here, 20-dimensional cepstrum coefficients are extracted from the voice data of the original speaker and the voice data of the target speaker by using cepstrum analysis.

そして、上記取得した初期声質変換モデル（ＧＭＭλ⁰）に対して、上記抽出された各目標話者の音声データの特徴量を学習データとして学習を行い、初期声質変換モデル（ＧＭＭλ⁰）の出力平均ベクトルμ^Ｙ _tのみを更新する。これにより、各目標話者に依存した声質変換モデルである特定話者モデル（以下、ＧＭＭλ^sとも称す）を生成する（ステップＳ３２４）。 Then, learning is performed on the acquired initial voice quality conversion model (GMMλ ⁰ ) using the extracted feature values of the speech data of each target speaker as learning data, and an output average of the initial voice quality conversion model (GMMλ ⁰ ) is obtained. Only the vector μ ^Y _t is updated. Thus, a specific speaker model (hereinafter also referred to as GMMλ ^s ) that is a voice quality conversion model depending on each target speaker is generated (step S324).

ここで、適応学習部１２ｄにおいて、適応学習に用いる所定の適応手法として、固有声技術を用いることとする。
以下、固有声技術を用いた、声質変換手法について説明する。
固有声技術を使用した適応学習で生成される１対Ｍ声質変換モデル（以下、ＥＶ−ＧＭＭλ^EVとも称す）は、上式（１５）における出力平均ベクトルμ_i ^(Ｙ)を、下式（１６）としたものとなる。

μ_i ^(Ｙ)＝Ｂ_iｗ＋ｂ_i ⁽⁰⁾ ・・・（１６）

ここで、ｂ_i ⁰及びＢ_i＝[ｂ_i(1)，・・・，ｂ_i(J)]はｉ番目の分布に対するバイアスベクトル、及びＪ個の代表ベクトルｂ_i(j)からなる行列を表す。出力話者（目標話者）の平均ベクトルは、Ｂ_iで張られる部分空間上でＪ次元の重みベクトルｗ＝[ｗ(1)，・・・，ｗ(J)]^Tにより制御される。即ち、ＥＶ−ＧＭＭλ^EVは、各出力話者（目標話者）に依存するフリーパラメータｗと、全出力話者に対して共通のパラメータ（α_ｉ、μ_ｉ ^（Ｘ）、ｂ_i ⁰、Ｂ_i、Σ_ｉ ^{（Ｘ，Ｙ）}）とを持つことになる。 Here, in the adaptive learning unit 12d, a specific voice technique is used as a predetermined adaptation method used for adaptive learning.
Hereinafter, a voice quality conversion method using eigenvoice technology will be described.
The 1-to-M voice quality conversion model (hereinafter also referred to as EV-GMMλ ^EV ) generated by adaptive learning using eigenvoice technology is obtained by converting the output average vector μ _i ^(Y) in the above equation (15 ⁾ into the following equation (16 ).

μ _i ^(Y) = B _i w + b _i ⁽⁰⁾ (16)

Here, b _i ⁰ and B _i = [b _i (1),..., B _i (J)] are a matrix composed of a bias vector for the i-th distribution and J representative vectors b _i (j). Represents. Mean vector of an output speaker (target speaker) is, weight vector J dimension on the subspace spanned by _{B i w = [w (1} ), ···, w (J)] is controlled by the ^T. That is, EV-GMMλ ^EV is a free parameter w depending on each output speaker (target speaker) and parameters (α _i , μ _i ^(X) , b _i ⁰ , B, common to all output speakers ⁾ . _i , Σ _i ^{(X, Y)} ).

適応学習部１２ｄは、Ｍ人の目標話者の各特定話者モデル（ＧＭＭλ^s）が生成されると、その出力平均ベクトルμ_i ^(Ｙ)(s)を接続することで、各目標話者に対して２ＤＭ次元の超ベクトルＳＶ^s（下式（１７））を構成する。

ＳＶ^s＝[μ₁ ^(Ｙ)(s)^T，・・・，μ_s ^(Ｙ)(s)^T]^T・・・（１７）

そして、Ｍ人の目標話者の各超ベクトルに対して主成分分析を行うことで、バイアスベクトルｂ_i ⁰及び固有ベクトルによる行列Ｂ_iを決定する。これにより、ＳＶ^sは、下式（１８）と表される。 When each specific speaker model (GMMλ ^s ) of the M target speakers is generated, the adaptive learning unit 12d connects the output average vectors μ _i ^(Y) (s) to each target speaker. 2DM super vector SV ^s (the following equation (17)) is constructed.

SV ^s = [μ ₁ ^(Y) (s) ^T ,..., Μ _s ^(Y) (s) ^T ] ^T (17)

Then, principal component analysis is performed on each super vector of the M target speakers to determine a matrix B _i based on the bias vector b _i ⁰ and the eigenvector. Thereby, SV ^s is expressed by the following formula (18).

ここで、Ｓは目標話者数（ここでは目標話者数Ｍ）であり、ｗ^sはｓ番目の目標話者に対するＪ（＜Ｓ≪２ＤＭ）個の主成分である。以上の、ｂ_i ⁰、Ｂ_iを、ＧＭＭλ⁰に適応して（ステップＳ３２６）、ＥＶ−ＧＭＭλ^EVを得る（ステップＳ３１０の「Ｙｅｓ」の分岐）。 Here, S is the target speaker number (here, target speaker number M), and w ^s is J (<S << 2DM) principal components for the s-th target speaker. The above b _i ⁰ and B _i are adapted to GMMλ ⁰ (step S326) to obtain EV-GMMλ ^EV (“Yes” branch of step S310).

なお、設定情報で繰り返し学習モードが設定されている場合は（ステップＳ３１２の「Ｙｅｓ」の分岐）、評価用の音声データと、適応学習によって得られたＥＶ−ＧＭＭλ^EVとを用いてケプストラム歪を算出する。そして、この算出値と予め所定条件として設定されている閾値とを比較する（ステップＳ３１４）。ここでは、前記算出したケプストラム歪が前記閾値以下であれば条件を満足したとして、上記生成したＥＶ−ＧＭＭλ^EVを、最終的な１対Ｍ声質変換モデルとする（ステップＳ３１６の「Ｙｅｓ」の分岐）。 If the iterative learning mode is set in the setting information (“Yes” branch in step S312), the cepstrum distortion is calculated using the evaluation voice data and the EV-GMMλ ^EV obtained by adaptive learning. calculate. Then, the calculated value is compared with a threshold value set in advance as a predetermined condition (step S314). Here, assuming that the condition is satisfied if the calculated cepstrum distortion is equal to or less than the threshold value, the generated EV-GMMλ ^EV is used as a final one-to-M voice conversion model (“Yes” branch in step S316). ).

一方、条件を満足していない場合（ステップＳ３１６の「Ｎｏ」の分岐）は、生成したＥＶ−ＧＭＭλ^EVを、現在の初期声質変換モデル（ＧＭＭλ⁰）と置換して、これを新たな初期声質変換モデルとして、上記取得した目標話者の音声データセットを用いて上記特定話者モデルを生成すると共に、固有声技術を用いた適応学習により、ＥＶ−ＧＭＭλ^EVを生成する。このとき、特定話者モデルの学習においては、既に１回目の学習時に抽出された特徴量を用いる。つまり、所定条件を満足するまで、生成されるＥＶ−ＧＭＭλ^EVを初期声質変換モデルと置換しながら繰り返し適応学習によるＥＶ−ＧＭＭλ^EVの生成処理を行う。 On the other hand, if the condition is not satisfied (“No” branch in step S316), the generated EV-GMMλ ^EV is replaced with the current initial voice conversion model (GMMλ ⁰ ), and this is replaced with a new initial voice quality. As the conversion model, the specific speaker model is generated using the acquired speech data set of the target speaker, and EV-GMMλ ^EV is generated by adaptive learning using eigenvoice technology. At this time, in the learning of the specific speaker model, the feature amount already extracted at the first learning is used. That is, until the predetermined condition is satisfied, EV-GMMλ ^EV generation processing by repeated adaptive learning is performed while replacing the generated EV-GMMλ ^EV with the initial voice quality conversion model.

上記所定条件を満足する１対Ｍ声質変換モデルは、特定話者（１人の元話者）の音声を、不特定話者（Ｍ人の目標話者のいずれか）の音声に変換する適切な重みベクトルが設定された１対Ｍ声質変換モデルとなる。また、この１対Ｍ声質変換モデルは、全出力話者（Ｍ人の目標話者）において、入力側の確率密度分布が共通であるという制約により、各分布がモデル化する音韻空間の統一化が図られる。これにより、音韻性と話者性とが分離された超ベクトル空間が構成される。 The one-to-M voice quality conversion model satisfying the predetermined condition is suitable for converting the voice of a specific speaker (one former speaker) into the voice of an unspecified speaker (any of M target speakers). It becomes a 1-to-M voice quality conversion model in which various weight vectors are set. In addition, this one-to-M voice quality conversion model unifies the phoneme space modeled by each distribution due to the restriction that the probability density distribution on the input side is common among all output speakers (M target speakers). Is planned. Thus, a super vector space in which phonological and speaker characteristics are separated is configured.

そして、このようにして生成された１対Ｍ声質変換モデルは、声質変換モデル記憶部１３に記憶される（ステップＳ１１０）。
なお、Ｎ対１声質変換モデルを適応学習によって生成する場合は、上記したＥＭアルゴリズムを用いたＧＭＭの学習において、上式（１）の特徴量Ｘ_tを、事前学習用のＮ人の元話者及び１人の目標話者のパラレル音声データセットにおけるＮ人の元話者の音声データから抽出される静的・動的特徴量とし、上式（２）の特徴量Ｙ_tを、前記パラレル音声データセットにおける１人の目標話者の音声データから抽出される静的・動的特徴量とする。そして、これら特徴量を学習データとした学習によって生成されたＧＭＭを、初期声質変換モデル（ＧＭＭλ⁰）とする。 The 1-to-M voice quality conversion model generated in this way is stored in the voice quality conversion model storage unit 13 (step S110).
In the case of producing by the adaptive learning N-to-1 voice conversion model, the GMM learning using EM algorithm as described above, the feature quantity X _t of the equation (1), the original story of N's for pre-learning A static and dynamic feature amount extracted from the speech data of N former speakers in the parallel speech data set of the speaker and one target speaker, and the feature amount Y _t of the above equation (2) Static and dynamic feature amounts extracted from the speech data of one target speaker in the speech data set. A GMM generated by learning using these feature quantities as learning data is used as an initial voice quality conversion model (GMMλ ⁰ ).

また、固有声技術による適応学習で生成されるＮ対１声質変換モデル（ＥＶ−ＧＭＭλ^EV）は、上式（１５）における入力平均ベクトルμ_i ^Ｘを、下式（１９）としたものとなる。

μ_i ^(Ｘ)＝Ｄ_iｗ＋ｄ_i ⁰ ・・・（１９）

ここで、ｄ_i ⁰及びＤ_i＝[ｄ_i(1)，・・・，ｄ_i(K)]はｉ番目の分布に対するバイアスベクトル、及びＫ個の代表ベクトルｄ_i(K)からなる行列を表す。入力話者（元話者）の平均ベクトルは、Ｄ_iで張られる部分空間上でＫ次元の重みベクトルｗ＝[ｗ(1)，・・・，ｗ(K)]^Tにより制御される。即ち、ＥＶ−ＧＭＭλ^EVは、各入力話者（元話者）に依存するフリーパラメータｗと、全入力話者に対して共通のパラメータ（α_ｉ、μ_ｉ ^（Ｙ）、ｄ_i ⁰、Ｄ_i、Σ_ｉ ^{（Ｘ，Ｙ）}）とを持つことになる。 Further, the N-to-1 voice quality conversion model (EV-GMMλ ^EV ) generated by adaptive learning using the eigenvoice technique is obtained by changing the input average vector μ _i ^X in the above equation (15) to the following equation (19). .

μ _i ^(X) = D _i w + d _i ⁰ (19)

Here, d _i ⁰ and D _i = [d _i (1),..., D _i (K)] are matrices composed of bias vectors for the i-th distribution and K representative vectors d _i (K). Represents. The average vector of the input speakers (former speakers) is controlled by a K-dimensional weight vector w = [w (1),..., W (K)] ^T on the subspace spanned by D _i . That is, EV-GMMλ ^EV is a free parameter w depending on each input speaker (former speaker) and parameters (α _i , μ _i ^(Y) , d _i ⁰ , D, common to all input speakers. _i , Σ _i ^{(X, Y)} ).

そして、Ｎ人の元話者の音声データの特徴量Ｘ_tを用いた学習を行い、入力平均ベクトルμ^(Ｘ)のみを更新する。これにより、各元話者に依存した声質変換モデルである特定話者モデル（以下、ＧＭＭλ^vともいう）を生成する（ステップＳ３０６）。
適応学習部１２ｄは、Ｎ人の元話者の各特定話者モデル（ＧＭＭλ^v）が生成されると、その入力平均ベクトルμ_i ^(Ｘ)(v)を接続することで、各元話者に対して２ＤＭ次元の超ベクトルＳＶ^v（下式（２０））を構成する。

ＳＶ^v＝[μ₁ ^(Ｙ)(v)^T，・・・，μ_v ^(Ｙ)(v)^T]^T・・・（２０）

そして、Ｎ人の元話者の各超ベクトルに対して主成分分析を行うことで、バイアスベクトルｄ_i ⁰及び固有ベクトルによる行列Ｄ_iを決定する。これにより、ＳＶ^vは、下式（２１）と表される。 Then, learning is performed using the feature amount X _t of the speech data of N former speakers, and only the input average vector μ ^(X) is updated. As a result, a specific speaker model (hereinafter also referred to as GMMλ ^v ) that is a voice quality conversion model depending on each original speaker is generated (step S306).
When each specific speaker model (GMMλ ^v ) of N original speakers is generated, the adaptive learning unit 12d connects the input average vectors μ _i ^(X) (v) to each original speaker. 2DM-dimensional super vector SV ^v (the following equation (20)).

SV ^v = [μ ₁ ^(Y) (v) ^T ,..., Μ _v ^(Y) (v) ^T ] ^T (20)

Then, a principal component analysis is performed on each super vector of N former speakers, thereby determining a matrix D _i using a bias vector d _i ⁰ and an eigenvector. Thus, SV ^v is expressed as the following equation (21).

ここで、Ｖは元話者数（ここでは元話者数Ｎ）であり、ｗ^vはv番目の元話者に対するＫ（＜Ｖ≪２ＤＭ）個の主成分である。以上の、ｄ_i ⁰、Ｄ_iを、ＧＭＭλ⁰に適応して（ステップＳ３２６）、Ｎ対１声質変換モデル（ＥＶ−ＧＭＭλ^EV）を得る（ステップＳ３１０の「Ｙｅｓ」の分岐）。このＮ対１声質変換モデルは、全入力話者（Ｎ人の元話者）において、出力側の確率密度分布が共通であるという制約により、各分布がモデル化する音韻空間の統一化が図られる。これにより、音韻性と話者性とが分離された超ベクトル空間が構成される。 Here, V is the number of former speakers (here, the number of former speakers N), and w ^v is K (<V << 2DM) principal components for the v-th former speaker. The above d _i ⁰ and D _i are adapted to GMMλ ⁰ (step S326) to obtain an N-to-1 voice quality conversion model (EV-GMMλ ^EV ) (“Yes” branch in step S310). This N-to-1 voice quality conversion model unifies the phoneme space modeled by each distribution due to the restriction that the probability density distribution on the output side is the same for all input speakers (N former speakers). It is done. Thus, a super vector space in which phonological and speaker characteristics are separated is configured.

なお、Ｎ対１中間声質変換モデルの生成処理については、１人の目標話者の音声データの代わりに１人の中間話者の音声データを用いるだけで、上記Ｎ対１声質変換モデルと同様となる。
また、Ｎ対Ｍ声質変換モデルついては、上記１対Ｍ声質変換モデルと、Ｎ対１声質変換モデルとを合わせたものとなる。つまり、Ｎ対Ｍ声質変換モデルは、１つのＮ対１声質変換モデルと、１つの１対Ｍ声質変換モデルとの２つのモデルで構成される。このとき、Ｎ対１の１人の目標話者と、１対Ｍの１人の元話者とを共通の話者とする。なお、Ｎ対Ｍ声質変換モデルは、Ｎ人の元話者及びＭ人の目標話者に共通の１つのＮ対Ｍ声質変換モデルとして生成することも可能である。 The N-to-1 intermediate voice quality conversion model generation processing is the same as the N-to-1 voice quality conversion model, except that the voice data of one intermediate speaker is used instead of the voice data of one target speaker. It becomes.
The N to M voice quality conversion model is a combination of the 1 to M voice quality conversion model and the N to 1 voice quality conversion model. That is, the N-to-M voice quality conversion model includes two models, one N-to-one voice quality conversion model and one 1-to-M voice quality conversion model. At this time, one target speaker of N to 1 and one former speaker of 1 to M are set as a common speaker. Note that the N-to-M voice quality conversion model can be generated as one N-to-M voice quality conversion model common to N original speakers and M target speakers.

以上、本実施の形態の声質変換モデル生成装置１によれば、Ｎ人の元話者及びＭ人の目標話者の少なくとも一方の音声データを用いて、Ｎ人の元話者のいずれかの音声を１人の目標話者の音声に変換するＮ対１声質変換モデル、１人の元話者の音声をＭ人の目標話者のいずれかの音声に変換する１対Ｍ声質変換モデル、Ｎ人の元話者のいずれかの音声をＭ人の目標話者のいずれかの音声に変換するＮ対Ｍ声質変換モデルを生成することが可能である。また、Ｎ人の元話者のいずれかの音声を１人の中間話者の音声に変換すると共に、１人の中間話者の音声をＭ人の目標話者のいずれかの音声に変換するＮ対Ｍ中間声質変換モデルを生成することが可能である。 As described above, according to the voice quality conversion model generation device 1 of the present embodiment, any one of the N former speakers is obtained using the voice data of at least one of the N former speakers and the M target speakers. N-to-1 voice quality conversion model for converting speech to the speech of one target speaker, 1-to-M voice quality conversion model for converting speech of one former speaker to the speech of any of M target speakers, It is possible to generate an N-to-M voice quality conversion model that converts any speech of N original speakers to any speech of M target speakers. Also, the voice of any one of N former speakers is converted into the voice of one intermediate speaker, and the voice of one intermediate speaker is converted into the voice of any of M target speakers. An N to M intermediate voice quality conversion model can be generated.

これにより、Ｎ対１声質変換モデルであれば、当該Ｎ対１声質変換モデルを所定の適応手法を用いて任意（新規）の元話者の音声に適応させることで、この任意の元話者の音声を１人の目標話者の音声に変換することができる。また、１対Ｍ声質変換モデルであれば、当該１対Ｍ声質変換モデルを所定の適応手法を用いて任意の目標話者の音声に適応させることで、１人の元話者の音声を前記任意の目標話者の音声に変換することができる。また、Ｎ対Ｍ声質変換モデルであれば、当該Ｎ対Ｍ声質変換モデルを、任意の元話者及び任意の目標話者の音声に適応することで、任意の元話者の音声を、任意の目標話者の音声に変換することができる。 Thus, if the N-to-1 voice quality conversion model is used, the N-to-1 voice quality conversion model is adapted to the voice of an arbitrary (new) original speaker by using a predetermined adaptation method. Can be converted into the voice of one target speaker. Further, in the case of the 1-to-M voice quality conversion model, the 1-to-M voice quality conversion model is adapted to the voice of an arbitrary target speaker using a predetermined adaptation method, whereby the voice of one former speaker is It can be converted to the voice of any target speaker. Further, in the case of an N-to-M voice quality conversion model, the voice of an arbitrary former speaker can be arbitrarily changed by adapting the N-to-M voice quality conversion model to the voice of an arbitrary former speaker and an arbitrary target speaker. The target speaker's voice can be converted.

また、任意話者の音声への適応は、所定の適応手法を用いて、当該任意の話者の少数の音声データによって行うことが可能である。特に、固有声技術を用いて声質変換モデルを生成することによって、話者間変動を考慮して部分空間を張ることで、少量のパラメータによる話者性制御が可能となり、より簡易に声質変換モデルを任意話者の音声に適応させることが可能である。 Further, adaptation to the voice of an arbitrary speaker can be performed using a small number of voice data of the arbitrary speaker using a predetermined adaptation method. In particular, by generating a voice quality conversion model using eigenvoice technology, it is possible to control speaker characteristics with a small amount of parameters by creating a subspace in consideration of inter-speaker fluctuations, making it easier to convert the voice quality conversion model. Can be adapted to the voice of any speaker.

上記第１の実施の形態において、声質変換モデル生成部１２は、請求項１又は２記載の声質変換モデル生成手段に対応する。
また、上記第１の実施の形態において、ステップＳ１０４〜Ｓ１１２，Ｓ１１６，Ｓ１１２は、請求項６記載の声質変換モデル生成ステップに対応する。 In the first embodiment, the voice quality conversion model generation unit 12 corresponds to the voice quality conversion model generation means according to claim 1 or 2 .
In the first embodiment, steps S104 to S112, S116, and S112 correspond to the voice quality conversion model generation step recited in claim 6 .

〔第２の実施の形態〕
次に、本発明の第２の実施の形態を図面に基づき説明する。図６〜図９は、本発明に係る声質変換システム、声質変換プログラム及び声質変換方法の第２の実施の形態を示す図である。
まず、本発明に係る声質変換システムの構成を図６に基づき説明する。図６は、本発明の第２の実施の形態に係る声質変換システム２の構成を示すブロック図である。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to the drawings. 6 to 9 are diagrams showing a second embodiment of the voice quality conversion system, voice quality conversion program, and voice quality conversion method according to the present invention.
First, the configuration of the voice quality conversion system according to the present invention will be described with reference to FIG. FIG. 6 is a block diagram showing the configuration of the voice quality conversion system 2 according to the second embodiment of the present invention.

声質変換システム２は、図６に示すように、声質変換モデルを記憶する声質変換モデル記憶部２０と、目標話者の音声データを取得する目標話者音声データ取得部２１と、元話者の音声データを取得する元話者音声データ取得部２２と、声質変換モデルを元話者及び目標話者の少なくとも一方の音声に適応させる声質変換モデル適応部２３と、適応後の声質変換モデルを用いて、元話者の音声を目標話者の音声に変換する声質変換部２４と、話者性制御パラメータ値を指定する話者性制御パラメータ指定部２５ととを含んだ構成となっている。そして、元話者の音声データ及び目標話者の音声データを用いて声質変換モデルを適応させて声質変換を行う通常モードと、目標話者側のモデル部のパラメータ（話者性制御パラメータ）の値を指定して、当該指定値に基づき声質変換モデルを適応させて声質変換を行うパラメータ制御モードとの２つのモードを有している。 As shown in FIG. 6, the voice quality conversion system 2 includes a voice quality conversion model storage unit 20 that stores a voice quality conversion model, a target speaker voice data acquisition unit 21 that acquires voice data of a target speaker, and a former speaker's voice data. An original speaker voice data acquisition unit 22 that acquires voice data, a voice quality conversion model adaptation unit 23 that adapts the voice quality conversion model to at least one voice of the original speaker and the target speaker, and a voice quality conversion model after adaptation. The voice quality conversion unit 24 converts the voice of the original speaker into the voice of the target speaker, and the speaker control parameter specifying unit 25 that specifies the speaker control parameter value. Then, a normal mode for performing voice quality conversion by adapting a voice quality conversion model using the voice data of the original speaker and the voice data of the target speaker, and parameters (speakerness control parameters) of the model unit on the target speaker side There are two modes: a parameter control mode in which a value is specified and a voice quality conversion model is adapted based on the specified value to perform voice quality conversion.

声質変換モデル記憶部２０は、上記第１の実施の形態における声質変換モデル生成装置１で生成されたＮ対１、１対Ｍ及びＮ対Ｍ声質変換モデル並びにＮ対Ｍ中間声質変換モデルのうち少なくとも１種類のモデルを記憶するようになっている。なお、声質変換モデルはこれら４種類のうち少なくとも１種類を１つずつ記憶しても良いし、１つずつに限らず、モデルの生成条件に応じて条件毎に複数ずつ記憶しても良い。 The voice quality conversion model storage unit 20 includes the N to 1, 1 to M and N to M voice quality conversion models and the N to M intermediate voice quality conversion models generated by the voice quality conversion model generation apparatus 1 according to the first embodiment. At least one type of model is stored. Note that the voice quality conversion model may store at least one of these four types one by one, or not only one by one but may store a plurality for each condition according to the model generation conditions.

目標話者音声データ取得部２１は、声質変換モデルの適応に用いる目標話者の音声データを取得する機能を有している。具体的に、目標話者の音声をマイクを介して入力し、当該入力音声信号に対して信号処理を行って音声データを生成する機能を有している。他に、外部装置等からネットワーク等を介して目標話者の適応用の音声データを取得する機能も有している。 The target speaker voice data acquisition unit 21 has a function of acquiring voice data of the target speaker used for adaptation of the voice quality conversion model. Specifically, the voice of the target speaker is input via a microphone, and signal processing is performed on the input voice signal to generate voice data. In addition, it has a function of acquiring voice data for adaptation of the target speaker from an external device or the like via a network or the like.

元話者音声データ取得部２２は、声質変換モデルの適応に用いる元話者の声データと、元話者の声質変換対象の音声データと取得する機能を有している。具体的に、元話者の音声をマイクを介して入力し、当該入力音声信号に対して信号処理を行って、元話者の適応用又は元話者の声質変換用の音声データを生成する機能を有している。他に、外部装置等からネットワーク等を介して元話者の適応用の音声データ及び元話者の声質変換用の音声データを取得する機能も有している。 The former speaker voice data acquisition unit 22 has a function of acquiring voice data of the former speaker used for adaptation of the voice quality conversion model and voice data of the voice quality conversion target of the former speaker. Specifically, the voice of the original speaker is input through a microphone, and signal processing is performed on the input voice signal to generate voice data for adaptation of the original speaker or voice quality conversion of the original speaker. It has a function. In addition, it has a function of acquiring voice data for adaptation of the original speaker and voice data for voice quality conversion of the original speaker via a network or the like from an external device or the like.

声質変換モデル適応部２３は、声質変換に使用する声質変換モデルを声質変換モデル記憶部２０から取得し、当該取得した声質変換モデルの構成に応じて、適応用の元話者の音声データ及び適応用の目標話者の音声データ、並びに指定された話者性制御パラメータ値の少なくとも１つに基づき、所定の適応手法を用いて、前記取得した声質変換モデルを元話者の音声及び目標話者の音声、並びに話者性制御パラメータ値の表す音声の少なくとも１つの音声に適応させて、適応後声質変換モデルを生成する機能を有している。 The voice quality conversion model adaptation unit 23 acquires a voice quality conversion model to be used for voice quality conversion from the voice quality conversion model storage unit 20, and according to the configuration of the acquired voice quality conversion model, the voice data of the original speaker for adaptation and the adaptation Based on at least one of the target speaker's voice data and the designated speaker control parameter value, the acquired voice quality conversion model is converted into the voice of the original speaker and the target speaker using a predetermined adaptive method. And a voice quality conversion model after adaptation by adapting to at least one voice represented by the speaker control parameter value.

声質変換部２４は、元話者音声データ取得部２２で取得した元話者の声質変換用の音声データと、声質変換モデル適応部２３で生成された適応後声質変換モデルとを用いて、元話者の声質の変換対象の音声を、目標話者の声質の音声に変換すると共に、当該変換後の音声（以下、変換音声と称す）を、不図示のアンプ及びスピーカを介して出力する機能を有している。 The voice quality conversion unit 24 uses the original speaker voice data conversion voice data acquired by the original speaker voice data acquisition unit 22 and the post-adaptation voice quality conversion model generated by the voice quality conversion model adaptation unit 23 to A function for converting the voice of the voice quality of the speaker to be converted into the voice of the voice quality of the target speaker and outputting the converted voice (hereinafter referred to as converted voice) via an amplifier and a speaker (not shown). have.

話者性制御パラメータ指定部２５は、パラメータ制御モードのときに動作し、声質変換モデル記憶部２０から取得した、声質変換に用いる１対Ｍ又はＮ対Ｍ声質変換モデルの目標話者側のモデル部を構成する話者性制御パラメータ（重みベクトル）に対して任意の値（話者性制御パラメータ値）を指定する機能を有している。具体的に、不図示の操作部を介してユーザから入力された値を、話者性制御パラメータ値として指定することが可能である。 The speaker control parameter specifying unit 25 operates in the parameter control mode, and is a model on the target speaker side of the 1-to-M or N-to-M voice quality conversion model used for voice quality conversion obtained from the voice quality conversion model storage unit 20. And a function for designating an arbitrary value (speaker control parameter value) for the talker control parameter (weight vector) constituting the unit. Specifically, a value input from the user via an operation unit (not shown) can be specified as a speaker control parameter value.

なお、本実施形態において、声質変換システム２は、図示しないプロセッサと、ＲＡＭ（Random Access Memory）と、専用のプログラムの記憶されたＲＯＭ（Read Only Memory）と、を備えており、プロセッサにより専用のプログラムを実行することにより上記各部の機能を果たす。また、上記各部は、専用のプログラムのみでその機能を果たすもの、専用のプログラムによりハードウェアを制御してその機能を果たすものが混在している。 In this embodiment, the voice quality conversion system 2 includes a processor (not shown), a RAM (Random Access Memory), and a ROM (Read Only Memory) in which a dedicated program is stored. By executing the program, the above functions are performed. In addition, the above-described units include those that perform their functions only with dedicated programs, and those that perform their functions by controlling hardware using dedicated programs.

更に、図７に基づき、声質変換モデル適応部２３の詳細な構成を説明する。ここで、図７は、声質変換モデル適応部２３の詳細構成を示すブロック図である。
声質変換モデル適応部２３は、図７に示すように、モデル取得部２３ａと、適応対象抽出部２３ｂと、特徴量抽出部２３ｃと、適応部２３ｄとを含んだ構成となっている。
モデル取得部２３ａは、声質変換の開始指示情報に含まれる、使用する声質変換モデルの情報に基づき、声質変換モデル記憶部２０から、該当する声質変換モデルを読み出して、適応対象抽出部２３ｂに出力する機能を有している。 Further, a detailed configuration of the voice quality conversion model adaptation unit 23 will be described with reference to FIG. Here, FIG. 7 is a block diagram showing a detailed configuration of the voice quality conversion model adaptation unit 23.
As shown in FIG. 7, the voice quality conversion model adaptation unit 23 includes a model acquisition unit 23a, an adaptation target extraction unit 23b, a feature amount extraction unit 23c, and an adaptation unit 23d.
The model acquisition unit 23a reads out the corresponding voice quality conversion model from the voice quality conversion model storage unit 20 based on the information of the voice quality conversion model to be used included in the voice quality conversion start instruction information, and outputs it to the adaptation target extraction unit 23b. It has a function to do.

適応対象抽出部２３ｂは、モデル取得部２３ａから入力された声質変換モデルの構成に応じて、当該モデルのうち、適応対象のパラメータを有するモデル部を抽出する機能を有している。具体的に、声質変換モデルは、例えば、元話者側で２０次元、目標話者側で２０次元の合わせて４０次元のパラメータを有している。そして、これら４０次元のパラメータのうち、適応対象となるのは、Ｎ人の元話者あるいはＭ人の目標話者の音声データから生成されたパラメータ部となる。従って、Ｎ対１声質変換モデルであれば、Ｎ人の元話者のパラメータから構成される元話者側のモデル部が適応対象として抽出されることになる。同様に、１対Ｍ声質変換モデルであれば、Ｍ人の目標話者側のモデルが、Ｎ対Ｍ中間声質変換モデルであれば、Ｎ対１中間音声モデルにおけるＮ人の元話者側のモデルが、それぞれ抽出される。一方、Ｎ対Ｍ声質変換モデルであれば、元話者側及び目標話者側の両方が適応対象となるので、これらがそれぞれ抽出されることになる。このようにして、抽出されたモデル部は、適応部２３ｄに出力される。 The adaptation target extraction unit 23b has a function of extracting a model part having parameters to be adapted from the model according to the configuration of the voice quality conversion model input from the model acquisition unit 23a. Specifically, the voice quality conversion model has, for example, a 40-dimensional parameter including 20 dimensions on the original speaker side and 20 dimensions on the target speaker side. Of these 40-dimensional parameters, the target of adaptation is a parameter portion generated from the speech data of N original speakers or M target speakers. Therefore, in the case of the N-to-1 voice quality conversion model, a model part on the former speaker side composed of parameters of N former speakers is extracted as an adaptation target. Similarly, if the model is a one-to-M voice quality conversion model, the model on the M target speaker side is an N-to-M intermediate voice quality conversion model. Each model is extracted. On the other hand, in the case of the N-to-M voice quality conversion model, since both the original speaker side and the target speaker side are to be applied, these are extracted, respectively. In this way, the extracted model part is output to the adaptation part 23d.

特徴量抽出部２３ｃは、上記第１の実施の形態の声質変換モデル生成装置１の特徴量抽出部１２ａと同様に、元話者音声データ取得部２２及び目標話者音声データ取得部２１から取得した元話者音声データ及び目標話者音声データから、声質変換に使用する声質変換モデルのパラメータ特性に応じて、ケプストラム分析や線形予測分析などの分析処理によって特徴量（特徴パラメータ）を抽出する機能を有している。なお、抽出した元話者の特徴量（以下、元話者特徴量と称す）及び目標話者の特徴量（以下、目標話者特徴量と称す）は、適応部２３ｄに出力される。 The feature amount extraction unit 23c is acquired from the original speaker voice data acquisition unit 22 and the target speaker voice data acquisition unit 21 in the same manner as the feature amount extraction unit 12a of the voice quality conversion model generation device 1 of the first embodiment. To extract feature values (feature parameters) from the original speaker voice data and target speaker voice data by analysis processing such as cepstrum analysis or linear prediction analysis according to the parameter characteristics of the voice quality conversion model used for voice quality conversion have. The extracted feature amount of the original speaker (hereinafter referred to as the original speaker feature amount) and the target speaker feature amount (hereinafter referred to as the target speaker feature amount) are output to the adaptation unit 23d.

適応部２３ｄは、適応対象抽出部２３ｂから入力された元話者側のモデル部及び目標話者側のモデル部に対して、特徴量抽出部２３ｃから入力された元話者特徴量及び目標話者特徴量に基づき、所定の適応手法を用いて前記モデル部を構成するパラメータを推定し、当該推定したパラメータ値に基づき、適応後声質変換モデルを生成する機能を有している。 The adaptation unit 23d, for the former speaker side model unit and the target speaker side model unit input from the adaptation target extraction unit 23b, and the former speaker feature amount and target story input from the feature amount extraction unit 23c. A parameter that constitutes the model unit using a predetermined adaptation method based on a person feature, and a function for generating a post-adaptation voice quality conversion model based on the estimated parameter value.

また、話者性制御パラメータ指定部２５でパラメータ値が指定されている場合に、当該指定されたパラメータ値を用いて、目標話者側のパラメータを指定されたパラメータ値に適応させ適応後声質変換モデルを生成する機能も有している。なお、生成した適応後声質変換モデルは、声質変換部２４に出力される。
更に、図８に基づき、声質変換システム２における通常モード時の声質変換処理の流れを説明する。ここで、図８は、声質変換システム２における通常モード時の声質変換処理を示すフローチャートである。 In addition, when a parameter value is specified by the speaker control parameter specifying unit 25, the target speaker side parameter is adapted to the specified parameter value by using the specified parameter value, and post-adaptive voice quality conversion is performed. It also has a function to generate a model. The generated post-adaptation voice quality conversion model is output to the voice quality conversion unit 24.
Furthermore, the flow of voice quality conversion processing in the normal mode in the voice quality conversion system 2 will be described with reference to FIG. Here, FIG. 8 is a flowchart showing voice quality conversion processing in the normal mode in the voice quality conversion system 2.

通常モード時の声質変換処理は、図８に示すように、まずステップＳ４００に移行し、声質変換モデル適応部２３において、不図示の操作部を介したユーザからの声質変換開始指示（開始指示情報含む）があったか否かを判定し、あったと判定された場合(Yes)は、ステップＳ４０２に移行し、そうでない場合(No)は、指示があるまで判定処理を繰り返す。 As shown in FIG. 8, the voice quality conversion process in the normal mode first proceeds to step S400, where the voice quality conversion model adapting unit 23 receives a voice quality conversion start instruction (start instruction information from the user via an operation unit not shown). If it is determined (Yes), the process proceeds to step S402. If not (No), the determination process is repeated until there is an instruction.

ステップＳ４０２に移行した場合は、モデル取得部２３ａにおいて、声質変換モデル記憶部２０から、開始指示情報で指定された構成の声質変換モデルを取得し、当該取得した声質変換モデルを適応対象抽出部２３ｂに出力してステップＳ４０４に移行する。
ステップＳ４０４では、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、Ｎ対１の構成又は中間話者を適応した構成か否かを判定し、これらのいずれかの構成であると判定された場合(Yes)は、ステップＳ４０６に移行し、そうでない場合(No)は、ステップＳ４２６に移行する。 When the process proceeds to step S402, the model acquisition unit 23a acquires the voice quality conversion model having the configuration designated by the start instruction information from the voice quality conversion model storage unit 20, and uses the acquired voice quality conversion model as the adaptation target extraction unit 23b. And the process proceeds to step S404.
In step S404, the voice quality conversion model adaptation unit 23 determines whether or not the configuration of the voice quality conversion model specified by the start instruction information is an N-to-1 configuration or a configuration adapted to an intermediate speaker. If it is determined that the configuration is (Yes), the process proceeds to step S406. If not (No), the process proceeds to step S426.

ステップＳ４０６に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ４０２で取得した、Ｎ対１声質変換モデル又はＮ対１中間声質変換モデルから、元話者側のモデル部を抽出し、当該抽出した元話者側のモデル部を適応部２３ｄに出力してステップＳ４０８に移行する。
ステップＳ４０８では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データの取得要請を行いステップＳ４１０に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S406, the adaptation target extraction unit 23b extracts the model part on the original speaker side from the N-to-1 voice quality conversion model or the N-to-1 intermediate voice quality conversion model acquired in step S402. The model part on the former speaker side is output to the adaptation unit 23d, and the process proceeds to step S408.
In step S408, the voice quality conversion model adaptation unit 23 requests the user to acquire adaptation original speaker voice data, and the process proceeds to step S410. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ４１０では、元話者音声データ取得部２２において、適応用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の元話者音声データを特徴量抽出部２３ｃに出力してステップＳ４１２に移行する。
ステップＳ４１２では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データから特徴量を抽出し、当該抽出した元話者特徴量を適応部２３ｄに出力してステップＳ４１４に移行する。 In step S410, the original speaker voice data acquisition unit 22 determines whether the voice data of the original speaker for adaptation has been acquired (or generated from the input voice), and if it is determined that the voice data has been acquired (generated) (Yes) outputs the acquired (generated) original voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S412.
In step S412, the feature amount extraction unit 23c extracts feature amounts from the adaptation original speaker voice data input from the original speaker voice data acquisition unit 22, and the extracted original speaker feature amounts are applied to the adaptation unit 23d. And the process proceeds to step S414.

ステップＳ４１４では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定してステップＳ４１６に移行する。
ステップＳ４１６では、適応部２３ｄにおいて、ステップＳ４１４で推定したパラメータ値に基づき、適応後元話者音声モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ４１８に移行する。 In step S414, the adaptation unit 23d estimates a parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S416. Transition.
In step S416, the adaptation unit 23d generates a post-adaptation original speaker speech model based on the parameter value estimated in step S414, outputs the generated post-adaptation voice quality conversion model to the voice quality conversion unit 24, and proceeds to step S418. Transition.

ステップＳ４１８では、声質変換部２４において、ユーザに対して声質変換用の元話者音声データの取得要請を行いステップＳ４２０に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ４２０では、元話者音声データ取得部２２において、声質変換用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合(Yes)は、当該取得（生成）した声質変換用の元話者音声データを声質変換部２４に出力してステップＳ４２２に移行し、そうでない場合(No)は、取得（生成）するまで判定処理を繰り返す（または生成処理を行う）。 In step S418, the voice quality conversion unit 24 requests the user to acquire original speaker voice data for voice quality conversion, and the process proceeds to step S420. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S420, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for voice quality conversion has been acquired (or generated from the input voice), and is determined to have been acquired (generated). If yes (Yes), the acquired (generated) original speaker voice data for voice quality conversion is output to the voice quality converting unit 24 and the process proceeds to step S422. If not (No), until acquired (generated). The determination process is repeated (or the generation process is performed).

ステップＳ４２２に移行した場合は、声質変換部２４において、元話者音声データ取得部２２から入力された声質変換対象の元話者の声質の音声データを、目標話者の声質の音声データに変換してステップＳ４２４に移行する。
ステップＳ４２４では、声質変換部２４において、声質変換後の音声データに基づき、変換音声を出力して処理を終了する。 When the process proceeds to step S422, the voice quality conversion unit 24 converts the voice data of the voice quality of the original speaker to be converted, which is input from the voice data acquisition unit 22, into voice data of the target speaker's voice quality. Then, the process proceeds to step S424.
In step S424, the voice quality conversion unit 24 outputs the converted voice based on the voice data after the voice quality conversion, and ends the process.

一方、ステップＳ４０４において、声質変換モデルの構成がＮ対１の構成又は中間話者を適用した構成ではなく、ステップＳ４２６に移行した場合は、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、１対Ｍの構成か否かを判定し、１対Ｍの構成であると判定された場合(Yes)は、ステップＳ４２８に移行し、そうでない場合(No)は、ステップＳ４３８に移行する。 On the other hand, if the configuration of the voice quality conversion model is not an N-to-1 configuration or a configuration in which an intermediate speaker is applied in step S404, and the process proceeds to step S426, the voice quality conversion model adaptation unit 23 specifies the start instruction information. It is determined whether the configuration of the voice quality conversion model is a one-to-M configuration. If it is determined that the configuration is a one-to-M configuration (Yes), the process proceeds to step S428; otherwise (No). The process proceeds to step S438.

ステップＳ４２８に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ４０２で取得した、１対Ｍ声質変換モデルから、目標話者側のモデル部を抽出し、当該抽出した目標話者側のモデル部を適応部２３ｄに出力してステップＳ４３０に移行する。
ステップＳ４３０では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者音声データの取得要請を行いステップＳ４３２に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S428, the adaptation target extraction unit 23b extracts the model unit on the target speaker side from the 1-to-M voice quality conversion model acquired in step S402, and extracts the model unit on the target speaker side thus extracted. Is output to the adaptation unit 23d, and the process proceeds to step S430.
In step S430, the voice quality conversion model adaptation unit 23 requests the user to acquire adaptation target speaker voice data, and the process proceeds to step S432. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ４３２では、目標話者音声データ取得部２１において、適応用の目標話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の目標話者音声データを特徴量抽出部２３ｃに出力してステップＳ４３４に移行する。
ステップＳ４３４では、特徴量抽出部２３ｃにおいて、目標話者音声データ取得部２１から入力された適応用の目標話者音声データから特徴量を抽出し、当該抽出した目標話者特徴量を適応部２３ｄに出力してステップＳ４３６に移行する。 In step S432, the target speaker voice data acquisition unit 21 determines whether or not the voice data of the target speaker for adaptation has been acquired (or generated from the input voice), and when it is determined that the target speaker voice data has been acquired (generated) (Yes) outputs the acquired (generated) target speaker voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S434.
In step S434, the feature amount extraction unit 23c extracts feature amounts from the target speaker voice data for adaptation input from the target speaker voice data acquisition unit 21, and the extracted target speaker feature amounts are applied to the adaptation unit 23d. And the process proceeds to step S436.

ステップＳ４３６では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された目標話者特徴量に基づき、所定の適応手法を用いて目標話者側のモデル部のパラメータ値を推定してステップＳ４１６に移行する。
また、ステップＳ４２６において、声質変換モデルの構成が１対Ｍではなく、ステップＳ４３８に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ４０２で取得した、Ｎ対Ｍ声質変換モデルから、元話者側のモデル部及び目標話者側のモデル部の両方を抽出し、当該抽出した元話者側のモデル部及び目標話者側のモデル部を適応部２３ｄに出力してステップＳ４４０に移行する。 In step S436, the adaptation unit 23d estimates a parameter value of the model unit on the target speaker side using a predetermined adaptation method based on the target speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S416. Transition.
In step S426, if the configuration of the voice quality conversion model is not 1 to M and the process proceeds to step S438, the adaptation target extraction unit 23b uses the N to M voice quality conversion model acquired in step S402 to obtain the original speaker. Both the model unit on the side and the model unit on the target speaker side are extracted, and the extracted model unit on the original speaker side and the model unit on the target speaker side are output to the adaptation unit 23d, and the process proceeds to step S440.

ステップＳ４４０では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データ及び目標話者音声データの取得要請を行いステップＳ４４２に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ４４２では、元話者音声データ取得部２２及び目標話者音声データ取得部２１において、適応用の元話者及び目標話者の音声データを取得（または入力音声から生成）したか否かをそれぞれ判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の、元話者音声データ及び目標話者音声データを特徴量抽出部２３ｃに出力してステップＳ４４４に移行する。 In step S440, the voice quality conversion model adaptation unit 23 requests the user to acquire the original speaker voice data and target speaker voice data for adaptation, and the process proceeds to step S442. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S442, it is determined whether or not the original speaker voice data acquisition unit 22 and the target speaker voice data acquisition unit 21 have acquired (or generated from input voice) voice data of the original speaker and target speaker for adaptation. When it is determined that each has been acquired and acquired (generated) (Yes), the acquired (generated) adaptation original voice data and target speaker audio data are output to the feature amount extraction unit 23c. The process proceeds to step S444.

ステップＳ４４４では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データ及び目標話者音声データ取得部２１から入力された適応用の目標話者音声データから特徴量を抽出し、当該抽出した元話者特徴量及び目標話者特徴量を適応部２３ｄに出力してステップＳ４４６に移行する。
ステップＳ４４６では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定すると共に、特徴量抽出部２３ｃから入力された目標話者特徴量に基づき、所定の適応手法を用いて目標話者側のモデル部のパラメータ値をそれぞれ推定してステップＳ４１６に移行する。 In step S444, the feature amount extraction unit 23c uses the original speaker voice data for adaptation input from the source speaker voice data acquisition unit 22 and the target speaker for adaptation input from the target speaker voice data acquisition unit 21. The feature amount is extracted from the speech data, and the extracted original speaker feature amount and target speaker feature amount are output to the adaptation unit 23d, and the process proceeds to step S446.
In step S446, the adaptation unit 23d estimates a parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c. Based on the target speaker feature input from the extraction unit 23c, the parameter values of the model unit on the target speaker side are estimated using a predetermined adaptation method, and the process proceeds to step S416.

更に、図９に基づき、声質変換システム２におけるパラメータ制御モード時の声質変換処理の流れを説明する。ここで、図９は、声質変換システム２におけるパラメータ制御モード時の声質変換処理を示すフローチャートである。
パラメータ制御モード時の声質変換処理は、図９に示すように、まずステップＳ５００に移行し、声質変換モデル適応部２３において、不図示の操作部を介したユーザからの声質変換開始指示があったか否かを判定し、あったと判定された場合(Yes)は、ステップＳ５０２に移行し、そうでない場合(No)は、指示があるまで判定処理を繰り返す。 Further, the flow of voice quality conversion processing in the parameter control mode in the voice quality conversion system 2 will be described with reference to FIG. Here, FIG. 9 is a flowchart showing voice quality conversion processing in the parameter control mode in the voice quality conversion system 2.
As shown in FIG. 9, the voice quality conversion process in the parameter control mode first proceeds to step S500, and in the voice quality conversion model adaptation section 23, whether or not a voice quality conversion start instruction is received from the user via an operation section (not shown). If it is determined (Yes), the process proceeds to step S502. If not (No), the determination process is repeated until an instruction is given.

ステップＳ５０２に移行した場合は、モデル取得部２３ａにおいて、開始指示情報で指定された構成の声質変換モデル（１対Ｍ又はＮ対Ｍ声質変換モデルのいずれか）を、声質変換モデル記憶部２０から取得して、ステップＳ５０４に移行する。
ステップＳ５０４では、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、１対Ｍの構成か否かを判定し、１対Ｍであると判定された場合(Yes)は、ステップＳ５０６に移行し、そうでない場合(No)は、ステップＳ５２２に移行する。 When the process proceeds to step S <b> 502, the model acquisition unit 23 a obtains the voice quality conversion model (either 1-to-M or N-to-M voice quality conversion model) having the configuration specified by the start instruction information from the voice quality conversion model storage unit 20. Acquire and move to step S504.
In step S504, the voice quality conversion model adaptation unit 23 determines whether or not the configuration of the voice quality conversion model specified by the start instruction information is a 1-to-M configuration, and if it is determined to be 1-to-M (Yes) ) Proceeds to step S506, otherwise (No) proceeds to step S522.

ステップＳ５０６に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ５０２で取得した、１対Ｍ声質変換モデルから、目標話者側のモデル部を抽出し、当該抽出した目標話者側のモデル部を適応部２３ｄに出力してステップＳ５０８に移行する。
ステップＳ５０８では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者側のモデルの話者性制御パラメータ値の指定要請を行いステップＳ５１０に移行する。ここで、指定要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S506, the adaptation target extraction unit 23b extracts the model unit on the target speaker side from the one-to-M voice quality conversion model acquired in step S502, and extracts the model unit on the target speaker side thus extracted. Is output to the adaptation unit 23d, and the process proceeds to step S508.
In step S508, the voice quality conversion model adaptation unit 23 requests the user to specify the speaker control parameter value of the target speaker model for adaptation, and the process proceeds to step S510. Here, the designation request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ５１０では、話者性制御パラメータ指定部２５において、不図示の操作部などを介して、話者性制御パラメータが指定されたか否かを判定し、指定されたと判定された場合(Yes)は、指定された制御パラメータ値を適応部２３ｄに出力してステップＳ５１２に移行し、そうでない場合(No)は、指定されるまで判定処理を繰り返す。
ステップＳ５１２に移行した場合は、適応部２３ｄにおいて、話者性制御パラメータ指定部２５から入力された話者性制御パラメータ値に基づき、当該パラメータ値の表現する目標話者の音声に適応した適応後声質変換モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ５１４に移行する。 In step S510, the speaker control parameter specifying unit 25 determines whether or not the speaker control parameter is specified via an operation unit (not shown). If it is determined that the speaker control parameter is specified (Yes), The designated control parameter value is output to the adaptation unit 23d, and the process proceeds to step S512. If not (No), the determination process is repeated until designated.
When the process proceeds to step S512, the adaptation unit 23d uses the speaker control parameter value input from the speaker control parameter specification unit 25, and then adapts to the target speaker's voice expressed by the parameter value. A voice quality conversion model is generated, the generated post-adaptation voice quality conversion model is output to the voice quality conversion unit 24, and the process proceeds to step S514.

ステップＳ５１４では、声質変換部２４において、ユーザに対して声質変換用の元話者音声データの取得要請を行いステップＳ５１６に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ５１６では、元話者音声データ取得部２２において、声質変換用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合(Yes)は、当該取得（生成）した声質変換用の元話者音声データを声質変換部２４に出力してステップＳ５１８に移行し、そうでない場合(No)は、取得（生成）するまで判定処理を繰り返す（または生成処理を行う）。 In step S514, the voice quality conversion unit 24 requests the user to acquire original speaker voice data for voice quality conversion, and the process proceeds to step S516. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S516, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for voice quality conversion has been acquired (or generated from the input voice), and is determined to have been acquired (generated). If yes (Yes), the acquired (generated) original speaker voice data for voice quality conversion is output to the voice quality converting unit 24, and the process proceeds to step S518. Otherwise (No), until acquired (generated). The determination process is repeated (or the generation process is performed).

ステップＳ５１８に移行した場合は、声質変換部２４において、元話者音声データ取得部２２から入力された声質変換対象の元話者の声質の音声データを、目標話者の声質の音声データに変換してステップＳ５２０に移行する。
ステップＳ５２０では、声質変換部２４において、声質変換後の音声データに基づき、変換音声を出力して処理を終了する。 When the process proceeds to step S518, the voice quality conversion unit 24 converts the voice data of the voice quality of the original speaker to be converted, which is input from the voice data acquisition unit 22, into voice data of the target speaker's voice quality. Then, the process proceeds to step S520.
In step S520, the voice quality conversion unit 24 outputs the converted voice based on the voice data after the voice quality conversion, and ends the process.

一方、ステップＳ５０４において、取得した声質変換モデルの構成が１対Ｍではなく、ステップＳ５２２に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ５０２で取得した、Ｎ対Ｍ声質変換モデルから、元話者側のモデル部及び目標話者側のモデル部の両方を抽出し、当該抽出した元話者側のモデル部及び目標話者側のモデル部を適応部２３ｄに出力してステップＳ５２４に移行する。 On the other hand, if the configuration of the acquired voice quality conversion model is not 1-to-M in step S504, and the process proceeds to step S522, the adaptation target extraction unit 23b uses the original N-to-M voice quality conversion model acquired in step S502. Both the model part on the speaker side and the model part on the target speaker side are extracted, and the extracted model part on the original speaker side and the model part on the target speaker side are output to the adaptation unit 23d, and the process proceeds to step S524. To do.

ステップＳ５２４では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データの取得要請を行いステップＳ５２６に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ５２６では、元話者音声データ取得部２２において、適応用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の元話者音声データを特徴量抽出部２３ｃに出力してステップＳ５２８に移行する。 In step S524, the voice quality conversion model adaptation unit 23 requests the user to acquire the original speaker voice data for adaptation, and the process proceeds to step S526. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S526, it is determined whether or not the voice data of the original speaker for adaptation has been acquired (or generated from the input voice) in the original speaker voice data acquisition unit 22, and it is determined that the voice data has been acquired (generated). (Yes) outputs the acquired (generated) original voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S528.

ステップＳ５２８では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データから特徴量を抽出し、当該抽出した元話者特徴量を適応部２３ｄに出力してステップＳ５３０に移行する。
ステップＳ５３０では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者側のモデルの話者性制御パラメータ値の指定要請を行いステップＳ５３２に移行する。ここで、指定要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 In step S528, the feature amount extraction unit 23c extracts feature amounts from the adaptation original speaker voice data input from the original speaker voice data acquisition unit 22, and the extracted original speaker feature amounts are applied to the adaptation unit 23d. And the process proceeds to step S530.
In step S530, the voice quality conversion model adaptation unit 23 requests the user to specify the speaker control parameter value of the target speaker side model for adaptation, and the process proceeds to step S532. Here, the designation request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ５３２では、話者性制御パラメータ指定部２５において、不図示の操作部などを介して、話者性制御パラメータが指定されたか否かを判定し、指定されたと判定された場合(Yes)は、指定された制御パラメータ値を適応部２３ｄに出力してステップＳ５３４に移行し、そうでない場合(No)は、指定されるまで判定処理を繰り返す。
ステップＳ５３４では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定してステップＳ５３６に移行する。
ステップＳ５３６では、適応部２３ｄにおいて、ステップＳ５３４で推定されたパラメータ値と、ステップＳ５３２で指定された話者性制御パラメータ値とに基づき、適応後声質変換モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ５１４に移行する。 In step S532, the speaker control parameter specifying unit 25 determines whether or not a speaker control parameter is specified via an operation unit (not shown). If it is determined that the speaker control parameter is specified (Yes), The designated control parameter value is output to the adaptation unit 23d, and the process proceeds to step S534. If not (No), the determination process is repeated until designated.
In step S534, the adaptation unit 23d estimates a parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S536. Transition.
In step S536, the adaptation unit 23d generates a post-adaptation voice quality conversion model based on the parameter value estimated in step S534 and the speaker control parameter value specified in step S532, and the generated post-adaptation voice quality The conversion model is output to the voice quality conversion unit 24 and the process proceeds to step S514.

次に、本実施の形態の動作を説明する。
ここで、声質変換モデル記憶部２０に、上記第１の実施の形態における声質変換モデル生成装置１において、固有声技術を用いて生成された１対Ｍ声質変換モデル（ＥＶ−ＧＭＭλ^EV）が記憶されていることとする。そして、この１対Ｍ声質変換モデルを有する声質変換システム２の具体的な動作を説明する。 Next, the operation of this embodiment will be described.
Here, the voice quality conversion model storage unit 20 stores the one-to-M voice quality conversion model (EV-GMMλ ^EV ) generated using the eigenvoice technology in the voice quality conversion model generation apparatus 1 in the first embodiment. Suppose that it is done. The specific operation of the voice quality conversion system 2 having this 1-to-M voice quality conversion model will be described.

最初に、通常モード時の声質変換処理の動作について説明する。
声質変換システム２は、まず、ユーザによる不図示の操作部の操作によって、声質変換開始指示（開始指示情報含む）が入力されると（ステップＳ４００の「Ｙｅｓ」の分岐）、モデル取得部２３ａにおいて、開始指示情報に基づき、声質変換モデル記憶部２０から、前記開始指示情報で指定された声質変換モデルを取得する（ステップＳ４０２）。ここでは、固有声技術を用いて生成された１対Ｍ声質変換モデルが指定されたとする（ステップＳ４２６の「Ｙｅｓ」の分岐）。 First, the voice quality conversion process in the normal mode will be described.
In the voice quality conversion system 2, first, when a voice quality conversion start instruction (including start instruction information) is input by an operation of an operation section (not shown) by the user ("Yes" branch in step S400), the model acquisition section 23a Based on the start instruction information, the voice quality conversion model specified by the start instruction information is acquired from the voice quality conversion model storage unit 20 (step S402). Here, it is assumed that the 1-to-M voice quality conversion model generated using the eigenvoice technique is designated (“Yes” branch in step S426).

１対Ｍ声質変換モデルが取得されると、声質変換モデル適応部２３は、適応対象抽出部２３ｂにおいて、目標話者側のモデル部を抽出する（ステップＳ４２８）。目標話者側のモデル部には、Ｍ人の目標話者の超ベクトルに基づき決定されたｂ_i ⁰及びＢ_iを有する出力平均ベクトルμ_i ^(Ｙ)が含まれる。
目標話者側のモデル部が抽出されると（あるいは抽出処理と並列に）、目標話者音声データ取得部２１において、適応用の目標話者の音声データを取得する（ステップＳ４３０）。ここで、目標話者の発話内容は、どのような内容でも良い。例えば、声質変換モデル生成時のパラレル音声データセットと異なる発話内容（非パラレル）でも良い。また、音声データは、１発話〜数発話分程度で良く、モデル生成時の発話文セット５０文などに対しては極めて少数（例えば１／１０以下）の音声データが取得できれば十分である。 When the 1-to-M voice quality conversion model is acquired, the voice quality conversion model adaptation unit 23 extracts the model part on the target speaker side in the adaptation target extraction unit 23b (step S428). The model portion on the target speaker side includes an output average vector μ _i ^(Y) having b _i ⁰ and B _i determined based on the super vectors of M target speakers.
When the model part on the target speaker side is extracted (or in parallel with the extraction process), the target speaker voice data acquisition unit 21 acquires voice data of the target speaker for adaptation (step S430). Here, the utterance content of the target speaker may be any content. For example, the speech content (non-parallel) may be different from the parallel speech data set when the voice quality conversion model is generated. Further, the voice data may be about one utterance to several utterances, and it is sufficient if a very small number (for example, 1/10 or less) of voice data can be acquired for an utterance sentence set of 50 sentences at the time of model generation.

そして、適応用の目標話者の音声データが取得されると（ステップＳ４３２の「Ｙｅｓ」の分岐）、特徴量抽出部２３ｃにおいて、当該取得した音声データから特徴量を抽出し（ステップＳ４３４）、当該抽出した特徴量を用いて、目標話者側のモデル部における最適な重みベクトルを推定する（ステップＳ４３６）。ここで、上記取得した１対Ｍ声質変換モデルは、固有声技術を用いて生成されているので、上記抽出した特徴量を学習データとして、ＥＭアルゴリズムによる最尤推定によって簡易に重みベクトルの推定値を得ることが可能である。 When the voice data of the target speaker for adaptation is acquired (“Yes” branch of step S432), the feature amount extraction unit 23c extracts the feature amount from the acquired speech data (step S434). Using the extracted feature amount, an optimum weight vector in the model unit on the target speaker side is estimated (step S436). Here, since the acquired 1-to-M voice quality conversion model is generated using the eigenvoice technique, the estimated value of the weight vector can be easily calculated by the maximum likelihood estimation by the EM algorithm using the extracted feature amount as learning data. It is possible to obtain

以下、ＥＭアルゴリズムを用いた適応処理について説明する。
固有声技術に基づく声質変換においては、重みベクトルが適切に設定されたＥＶ−ＧＭＭλ^EVに基づいて、従来法と同様にＥＭアルゴリズムを用いた最尤変換を行う。
以下、重みベクトルの決定法について述べる。
教師なし変換モデル学習において、所望の出力話者（目標話者）への声質変換モデルの適応は、ＥＶ−ＧＭＭλ^EVの重みベクトルｗを、目標話者の音声データのみから下式（２２）に基づき最尤推定する事で行われる。 Hereinafter, adaptive processing using the EM algorithm will be described.
In the voice quality conversion based on the eigenvoice technique, maximum likelihood conversion using the EM algorithm is performed based on EV-GMMλ ^EV in which the weight vector is appropriately set as in the conventional method.
Hereinafter, a method for determining the weight vector will be described.
In unsupervised conversion model learning, adaptation of the voice quality conversion model to a desired output speaker (target speaker) is performed by changing the weight vector w of EV-GMMλ ^EV from the target speaker's voice data only to the following equation (22). Based on maximum likelihood estimation based on this.

上式（２２）において、Ｙ_t（tar）は出力話者（目標話者）の特徴量ベクトル系列を表す。そして、推定される重みベクトルは下式（２３）〜（２５）にて表される。 In the above equation (22), Y _t (tar) represents the feature vector sequence of the output speaker (target speaker). The estimated weight vector is expressed by the following equations (23) to (25).

適応部２３ｄにおいて、ＥＭアルゴリズムを用いた最尤推定によって、出力平均ベクトルμ_i ^(Ｙ)に対する、上式（２３）〜（２５）で表される重みベクトルが推定されると、上記取得した１対Ｍ声質変換モデル（出力平均ベクトルμ_i ^(Ｙ)）の重みベクトルを、前記推定された重みベクトルに変換して、適応後声質変換モデルを生成する（ステップＳ４１６）。 When the adaptive unit 23d estimates the weight vector represented by the above equations (23) to (25) for the output average vector μ _i ^(Y) by the maximum likelihood estimation using the EM algorithm, the acquired 1 A weight vector of the anti-M voice quality conversion model (output average vector μ _i ^(Y) ) is converted into the estimated weight vector to generate a post-adaptation voice quality conversion model (step S416).

適応後声質変換モデルが生成されると、次に、元話者音声データ取得部２２において、変換候補の任意の元話者の音声データの取得要請を行い（ステップＳ４１８）、元話者の音声データを取得する（ステップＳ４２０の「Ｙｅｓ」の分岐）。
変換候補の任意の元話者の音声データを取得すると、声質変換部２４において、上記生成した適応後声質変換モデルを用いて、元話者の声質の音声データを、上記適応させた目標話者の声質の音声データへと変換する（ステップＳ４２２）。そして、声質変換後の音声データに基づき、変換音声を不図示のアンプ及びスピーカを介して出力する（ステップＳ４２４）。 Once the post-adaptation voice quality conversion model has been generated, next, the original speaker voice data acquisition unit 22 requests acquisition of voice data of an arbitrary original speaker as a conversion candidate (step S418). Data is acquired ("Yes" branch of step S420).
When the voice data of an arbitrary original speaker as a conversion candidate is acquired, the voice quality conversion unit 24 uses the generated post-adaptation voice quality conversion model to convert the voice data of the original speaker's voice quality to the target speaker. Is converted into voice data of the voice quality (step S422). Based on the voice data after the voice quality conversion, the converted voice is output via an amplifier and a speaker (not shown) (step S424).

また、生成された声質変換モデルに対しては、最尤線形回帰（ＭＬＬＲ）、最大事後確率推定法（ＭＡＰ）、制約付きＭＬＬＲ（ＣＭＬＬＲ）、ＭＬＬＲとＭＡＰとの組み合わせ等の公知の適応手法を用いて声質変換モデルを適応させる。
具体的に、Ｎ対１声質変換モデルの場合は、任意の元話者の音声に声質変換モデルを適応させ（ステップＳ４１４，Ｓ４１６）、適応後の声質変換モデルを用いて、前記任意の元話者の声質の音声データを、１人の目標話者の声質の音声データに変換する。 For the generated voice quality conversion model, known adaptive methods such as maximum likelihood linear regression (MLLR), maximum a posteriori probability estimation method (MAP), constrained MLLR (CMLLR), and a combination of MLLR and MAP are used. To adapt the voice quality conversion model.
Specifically, in the case of the N-to-1 voice quality conversion model, the voice quality conversion model is adapted to the voice of an arbitrary original speaker (steps S414 and S416), and the arbitrary narrative is used by using the voice quality conversion model after the adaptation. The voice data of the person's voice quality is converted into voice data of the voice quality of one target speaker.

また、Ｎ対Ｍ声質変換モデルの場合は、任意の元話者の音声にＮ対１声質変換モデルを適応させ、任意の目標話者の音声に１対Ｍ声質変換モデルを適応させる（ステップＳ４４６，Ｓ４１６）。そして、まず、適応後のＮ対１声質変換モデルを用いて、前記任意の元話者の声質の音声データを１人の目標話者の声質の音声データに変換する。次に、適応後の１対Ｍ声質変換モデルを用いて、前記１人の目標話者の声質に変換後の音声データを、前記任意の目標話者の声質の音声データに変換する。つまり、２段階で声質変換を行うことで、任意の元話者の音声データを、任意の目標話者の声質の音声データに変換する。 In the case of the N-to-M voice quality conversion model, the N-to-1 voice quality conversion model is adapted to the speech of an arbitrary original speaker, and the 1-to-M voice quality conversion model is adapted to the speech of an arbitrary target speaker (step S446). , S416). First, the voice data of the voice quality of the arbitrary original speaker is converted into the voice data of the voice quality of one target speaker using the N-to-1 voice quality conversion model after adaptation. Next, using the 1-to-M voice quality conversion model after adaptation, the voice data converted into the voice quality of the one target speaker is converted into voice data of the voice quality of the arbitrary target speaker. That is, by performing voice quality conversion in two stages, voice data of an arbitrary original speaker is converted into voice data of a voice quality of an arbitrary target speaker.

なお、Ｎ対Ｍ声質変換モデルの場合は、Ｎ人の元話者及びＭ人の目標話者に共通の１つのＮ対Ｍ声質変換モデルを、任意の元話者の音声及び任意の目標話者の音声に適応させることも可能である。
また、声質変換システム２は、固有声技術を用いて生成された声質変換モデルに対して、上述したようにパラメータ制御モードを設定することが可能である。パラメータ制御モードを設定することで、１対Ｍ声質変換モデルにおける、出力平均ベクトルμ_i ^(Ｙ)の重みベクトルを話者性制御パラメータとし、当該パラメータ値をユーザの任意の値に設定して、声質変換システム２にイコライザのような機能を発揮させることが可能である。 Note that in the case of the N-to-M voice quality conversion model, one N-to-M voice quality conversion model common to N original speakers and M target speakers is used as the voice of any original speaker and any target speech. It is also possible to adapt to the person's voice.
Further, the voice quality conversion system 2 can set the parameter control mode as described above for the voice quality conversion model generated using the eigenvoice technology. By setting the parameter control mode, the weight vector of the output average vector μ _i ^(Y) in the one-to-M voice quality conversion model is set as the speaker control parameter, and the parameter value is set to an arbitrary value of the user. It is possible to make the voice quality conversion system 2 function like an equalizer.

具体的には、上記取得した１対Ｍ声質変換モデルから目標話者側のモデル部を抽出後、ユーザに対して、話者性制御パラメータの指定要請を行う（ステップＳ５０８）。そして、不図示の操作部を介して、ユーザが手動で話者性制御パラメータ値を入力すると、話者性制御パラメータ指定部２５において、前記入力されたパラメータ値を適応部２３ｄに出力（指定）する（ステップＳ５１０の「Ｙｅｓ」の分岐）。適応部２３ｄは、上記取得した１対Ｍ声質変換モデルにおける、出力平均ベクトルμ_i ^(Ｙ)の重みベクトルを、この指定された制御パラメータ値へと変換して適応後声質変換モデルを生成する（ステップＳ５１２）。 Specifically, after extracting the model part on the target speaker side from the acquired 1-to-M voice quality conversion model, the user is requested to specify the speaker control parameter (step S508). When the user manually inputs a speaker control parameter value via an operation unit (not shown), the speaker control parameter specifying unit 25 outputs (specifies) the input parameter value to the adaptation unit 23d. (Branch “Yes” in step S510). The adaptation unit 23d converts the weight vector of the output average vector μ _i ^{(Y) in} the acquired 1-to-M voice quality conversion model into the designated control parameter value, thereby generating a post-adaptation voice quality conversion model ( Step S512).

以上、本実施の形態の声質変換システム２によれば、Ｎ人の元話者の音声データ及びＭ人の目標話者の音声データの少なくとも一方を用いて生成された、Ｎ対１、１対Ｍ、Ｎ対Ｍ声質変換モデルを、所望（任意）の元話者又は所望（任意）の目標話者の音声に適応させて声質変換に用いることが可能である。これによって、特定の元話者の任意の発声内容の音声データを、簡易に所望の目標話者の声質の音声データに変換したり、所望の元話者の任意の発声内容の音声データを、簡易に特定の目標話者又は所望の目標話者の声質の音声データに変換したりすることができる。 As described above, according to the voice quality conversion system 2 of the present embodiment, N-to-1 and 1-pair generated using at least one of the speech data of N former speakers and the speech data of M target speakers. The M, N-to-M voice quality conversion model can be used for voice quality conversion by adapting the voice of a desired (arbitrary) original speaker or a desired (arbitrary) target speaker. As a result, the voice data of an arbitrary utterance content of a specific original speaker can be easily converted into voice data of the desired target speaker's voice quality, or the voice data of an arbitrary utterance content of a desired original speaker can be converted. It can be easily converted into voice data of a voice quality of a specific target speaker or a desired target speaker.

また、固有声技術を用い、且つＮ人の元話者の音声データ及びＭ人の目標話者の音声データの少なくとも一方を用いて生成された、Ｎ対１、１対Ｍ、Ｎ対Ｍ声質変換モデルを、所望（任意）の元話者又は所望（任意）の目標話者の音声に適応させて声質変換に用いることが可能である。これによって、所望の元話者及び所望の目標話者の少なくとも一方の少数の音声データ（非パラレルでも良い）を用いて、ＥＭアルゴリズムによって、簡易に所望の話者の音声に声質変換モデルを適応させることができる。 In addition, N-to-one, one-to-M, and N-to-M voice quality generated using eigenvoice technology and using at least one of voice data of N original speakers and voice data of M target speakers The conversion model can be used for voice conversion by adapting the voice of a desired (arbitrary) original speaker or a desired (arbitrary) target speaker. As a result, the voice quality conversion model can be easily applied to the voice of the desired speaker by the EM algorithm using a small number of voice data (which may be non-parallel) of at least one of the desired original speaker and the desired target speaker. Can be made.

また、固有声技術を用い、且つＮ人の元話者の音声データ及びＭ人の目標話者の音声データの少なくとも一方を用いて生成された、１対Ｍ、又はＮ対Ｍ声質変換モデルの重みベクトル（話者性制御パラメータ）の値を、所定の範囲内においてユーザが任意の値に指定することが可能である。これによって、変換音声の声質を微調整したり、目標話者の音声データが得られなくとも、話者性制御パラメータ値を手動制御して、所望の声質又はそれに近い声質を持つ変換音声を得たりすることができる。また、Ｆ０等の他の音声パラメータも同時に制御するのであれば、スペクトルのみでなく、制御したい全音声パラメータの平均ベクトルでスーパーベクトル空間を構成してもよい。 In addition, a one-to-M or N-to-M voice quality conversion model generated using eigenvoice technology and using at least one of voice data of N original speakers and voice data of M target speakers The user can specify an arbitrary value for the value of the weight vector (speaker control parameter) within a predetermined range. As a result, even if the voice quality of the converted voice is finely adjusted or the target speaker's voice data cannot be obtained, the voice control parameter value is manually controlled to obtain a converted voice having a desired voice quality or a voice quality close thereto. Can be. In addition, if other speech parameters such as F0 are controlled simultaneously, the super vector space may be configured by an average vector of all speech parameters desired to be controlled, not only the spectrum.

上記第２の実施の形態において、目標話者音声データ取得部２１は、請求項３又は４記載の目標話者音声データ取得手段に対応し、元話者音声データ取得部２２は、請求項３又は４記載の元話者音声データ取得手段に対応し、声質変換モデル適応部２３は、請求項４又は５記載の適応手段、又は請求項３記載の第２適応手段に対応し、声質変換部２４は、請求項３又は４記載の声質変換手段に対応する。
また、上記第２の実施の形態において、話者性制御パラメータ指定部２５は、請求項３記載の話者性制御パラメータ値指定手段に対応する。 In the second embodiment, the target-speaker speech data acquiring unit 21 corresponds to the target-speaker speech data acquisition means according to claim 3 or 4, Motohanashi's voice data obtaining unit 22, claim 3 Or the voice quality conversion model adaptation unit 23 corresponds to the adaptation unit according to claim 4 or 5 , or the second adaptation unit according to claim 3 , and the voice quality conversion unit Reference numeral 24 corresponds to the voice quality conversion means according to the third or fourth aspect .
In the second embodiment, the speaker control parameter specifying unit 25 corresponds to the speaker control parameter value specifying means described in claim 3 .

〔実施例１〕
更に、図１０に基づき、本発明の声質変換システム２の実施例１を説明する。ここで、図１０は、本発明の手法で生成された１対Ｍ声質変換モデルに適応手法を用いた場合と、従来法で生成された声質変換モデルに適応手法を用いた場合との評価データにおける変換音声と出力話者の自然音声との間のメルケプストラム歪を示す図である。 [Example 1]
Furthermore, Example 1 of the voice quality conversion system 2 of the present invention will be described with reference to FIG. Here, FIG. 10 shows evaluation data when the adaptive method is used for the one-to-M voice conversion model generated by the method of the present invention and when the adaptive method is used for the voice conversion model generated by the conventional method. It is a figure which shows the mel cepstrum distortion between the conversion audio | voice and the natural voice of an output speaker in FIG.

男性１名を入力話者として、本発明の１対Ｍ声質変換モデルを用いた声質変換の評価を行った。
評価を行うに際して、ＪＮＡＳデータ中の男性８０名および女性８０名の計１６０名を事前学習用出力話者とし、この１６０人に含まれない他の男性５名及び女性５名の計１０名を評価用出力話者とした。各事前学習用話者のデータは音素バランス５０３文中の５０文からなるサブセットＡ〜Ｇのいずれかであり、評価用話者のデータは５３文からなるサブセットＪである。入力話者のデータとして、サブセットＡ〜Ｇを事前学習用に、サブセットＪを評価用に用いた。 Using one male as an input speaker, voice quality conversion using the 1-to-M voice quality conversion model of the present invention was evaluated.
In the evaluation, a total of 160 males and 80 females in the JNAS data were used as pre-learning output speakers. An output speaker for evaluation was used. Each pre-learning speaker data is one of subsets A to G consisting of 50 sentences in the phoneme balance 503 sentences, and the evaluation speaker data is a subset J consisting of 53 sentences. As input speaker data, subsets A to G were used for prior learning, and subset J was used for evaluation.

評価用話者に対する声質変換モデル学習時には５３文中の１〜３２文を用い、残りの２１文を評価に用いた。従来法の声質変換モデルの学習では、入力話者とのパラレル音声データを用いて声質変換モデルの学習（すなわち教師あり学習）を行った。一方、本発明の手法である固有声技術を用いた１対Ｍ声質変換モデルの学習においては、上記ＥＶ−ＧＭＭを事前学習しておき、評価用話者のデータのみを用いて重みベクトルの推定（すなわち教師なし学習）を行った。ＥＶ−ＧＭＭにおける混合数は５１２とし、代表ベクトル数は１５９１とした。従来法の声質変換モデルにおける混合数は、各話者、各学習データ量において、評価データにおけるメルケプストラム歪が最小となるように事後的に決定した。本発明の１対Ｍ声質変換モデル（ＥＶ−ＧＭＭλ^EV）及び従来法の声質変換モデルともに、共分散行列および相互共分散行列は対角成分のみを用いた。分析条件は文献（T. Toda et al., Proc. ICASSP, Vol. 1, pp. 912, Philadel-phia, USA, Mar. 2005.）と同様である。 At the time of learning the voice quality conversion model for the speaker for evaluation, 1 to 32 sentences out of 53 sentences were used, and the remaining 21 sentences were used for evaluation. In the conventional voice quality conversion model learning, the voice quality conversion model learning (that is, supervised learning) is performed using parallel speech data with the input speaker. On the other hand, in the learning of the one-to-M voice quality conversion model using the eigenvoice technique which is the method of the present invention, the EV-GMM is previously learned and the weight vector is estimated using only the data of the speaker for evaluation. (Ie unsupervised learning). The number of mixtures in EV-GMM was 512, and the number of representative vectors was 1591. The number of mixtures in the voice conversion model of the conventional method was determined afterwards so that the mel cepstrum distortion in the evaluation data was minimized for each speaker and each learning data amount. In both the 1-to-M voice quality conversion model (EV-GMMλ ^EV ) of the present invention and the voice quality conversion model of the conventional method, only the diagonal component is used for the covariance matrix and the mutual covariance matrix. The analysis conditions are the same as in the literature (T. Toda et al., Proc. ICASSP, Vol. 1, pp. 912, Philadel-phia, USA, Mar. 2005.).

ここで、図１０に示すグラフにおいて、縦軸はケプストラム歪[ｄＢ]を示し、横軸は適応に用いた学習データ数を示す。また、図１０中の点線が従来法（Conventional VC（Voice Conversion））による声質変換結果のケプストラム歪を示し、図１０中の実線が本発明手法（ＥＶＣ（Eigenvoice Conversion））による声質変換結果のケプストラム歪を示す。図１０に示すように、適応に用いる学習データが少量の際には、本発明のＥＶ−ＧＭＭλ^EVによる声質変換結果のケプストラム歪は従来法のそれを大きく下回っている。これは、ＥＶ−ＧＭＭλ^EVでは事前学習用出力話者の情報を活用する事で、より精度の高い適応後声質変換モデルが得られるためである。 Here, in the graph shown in FIG. 10, the vertical axis represents the cepstrum distortion [dB], and the horizontal axis represents the number of learning data used for adaptation. Also, the dotted line in FIG. 10 shows the cepstrum distortion of the voice quality conversion result by the conventional method (Conventional VC (Voice Conversion)), and the solid line in FIG. 10 shows the cepstrum of the voice quality conversion result by the method of the present invention (EVC (Eigenvoice Conversion)). Shows distortion. As shown in FIG. 10, when the learning data used for adaptation is small, the cepstrum distortion of the voice quality conversion result by the EV-GMMλ ^EV of the present invention is significantly lower than that of the conventional method. This is because EV-GMMλ ^EV can obtain a higher-accuracy post-adaptation voice quality conversion model by utilizing the information of the output speaker for pre-learning.

また、従来法では、適応に用いる学習データ量の増加に伴いケプストラム歪が大きく減少する。これは、より多くの混合分布による結合確率密度の精密なモデル化が可能となるためである。一方で、ＥＶ−ＧＭＭλ^EVでは、学習データを２文からさらに増加させた際に見られるケプストラム歪の減少は小さい。
これは、ＥＶ−ＧＭＭλ^EVのフリーパラメータである重みベクトルの次元数が一定であるためであり、逆に、２文程度の学習データがあれば重みベクトルを十分推定できる事を意味する。なお、数十文の学習データを用いる場合は、ＥＶ−ＧＭＭλ^EVは、従来法に変換精度の点で劣るものの、教師なし学習が可能であるという大きな利点がある。 In the conventional method, the cepstrum distortion is greatly reduced as the amount of learning data used for adaptation increases. This is because it becomes possible to accurately model the joint probability density with a larger mixture distribution. On the other hand, in EV-GMMλ ^EV , the decrease in cepstrum distortion seen when learning data is further increased from two sentences is small.
This is because the number of dimensions of the weight vector is a free parameter of the ^EV-GMMλ EV is constant, on the contrary, it means that the weight vector if there are about two sentences training data can be sufficiently estimated. In the case of using the learning data of several tens of sentences, ^EV-GMMλ EV is inferior in conversion accuracy with conventional methods point, there is a great advantage of allowing unsupervised learning.

〔第３の実施の形態〕
次に、本発明の第３の実施の形態を図面に基づき説明する。図１１〜図１６は、本発明に係る声質変換クライアントサーバシステムの第３の実施の形態を示す図である。
まず、本発明に係る声質変換クライアントサーバシステムの構成を図１１に基づき説明する。図１１は、本発明の第３の実施の形態に係る声質変換クライアントサーバシステム３の概略構成を示すブロック図である。 [Third Embodiment]
Next, a third embodiment of the present invention will be described with reference to the drawings. FIGS. 11-16 is a figure which shows 3rd Embodiment of the voice quality conversion client server system based on this invention.
First, the configuration of a voice quality conversion client / server system according to the present invention will be described with reference to FIG. FIG. 11 is a block diagram showing a schematic configuration of the voice quality conversion client server system 3 according to the third embodiment of the present invention.

声質変換クライアントサーバシステム３は、図１１に示すように、クライアントコンピュータ３１からの取得要求に応じて声質変換モデルを送信するサーバコンピュータ３０と、サーバコンピュータ３０から受信した声質変換モデルを用いて声質変換を行うクライアントコンピュータ３１と、サーバコンピュータ３０とクライアントコンピュータ３１とを相互にデータ通信可能に接続するネットワーク３２とを含んだ構成となっている。なお、図１１においては、クライアントコンピュータ３１が３台、ネットワーク３２に接続された構成となっているが、この構成に限らず、接続されるクライアントコンピュータ３１の台数は、３台より少なくても又は３台より多くても良い。 As shown in FIG. 11, the voice quality conversion client server system 3 uses a server computer 30 that transmits a voice quality conversion model in response to an acquisition request from the client computer 31 and a voice quality conversion using the voice quality conversion model received from the server computer 30. And a network 32 that connects the server computer 30 and the client computer 31 so as to be capable of data communication with each other. In FIG. 11, three client computers 31 are connected to the network 32. However, the present invention is not limited to this configuration, and the number of connected client computers 31 may be less than three. There may be more than three.

次に、図１２に基づき、サーバコンピュータ３０の詳細な構成を説明する。ここで、図１２は、サーバコンピュータ３０の詳細構成を示すブロック図である。なお、上記第２の実施の形態における声質変換システム２と同様の構成部については、同じ符号を付して説明を省略する。
サーバコンピュータ３０は、図１２に示すように、データ通信部３０ａと、声質変換モデル送信部３０ｂと、声質変換モデル記憶部２０とを含んだ構成となっている。 Next, a detailed configuration of the server computer 30 will be described with reference to FIG. Here, FIG. 12 is a block diagram showing a detailed configuration of the server computer 30. Note that the same components as those in the voice quality conversion system 2 in the second embodiment are denoted by the same reference numerals and description thereof is omitted.
As illustrated in FIG. 12, the server computer 30 includes a data communication unit 30 a, a voice quality conversion model transmission unit 30 b, and a voice quality conversion model storage unit 20.

データ通信部３０ａは、クライアントコンピュータ３１からの声質変換モデルの取得要求をネットワーク３２を介して受信したり、声質変換モデル送信部３０ｂからの命令に応じて、声質変換モデルをネットワーク３２を介して取得要求元のクライアントコンピュータ３１に送信したりする機能を提供するようになっている。
声質変換モデル送信部３０ｂは、データ通信部３０ａで受信したクライアントコンピュータ３１からの取得要求に応じて、当該取得要求の示す声質変換モデルを、声質変換モデル記憶部２０から読み出し、当該読み出した声質変換モデルを、データ通信部３０ａを介してクライアントコンピュータ３１に送信する機能を提供するようになっている。 The data communication unit 30a receives a voice quality conversion model acquisition request from the client computer 31 via the network 32, or acquires a voice quality conversion model via the network 32 according to a command from the voice quality conversion model transmission unit 30b. A function of transmitting to the requesting client computer 31 is provided.
In response to the acquisition request from the client computer 31 received by the data communication unit 30a, the voice quality conversion model transmission unit 30b reads the voice quality conversion model indicated by the acquisition request from the voice quality conversion model storage unit 20, and reads the voice quality conversion thus read out A function of transmitting the model to the client computer 31 via the data communication unit 30a is provided.

ここで、サーバコンピュータ３０は、図示しないプロセッサと、ＲＡＭ（Random Access Memory）と、専用のプログラムの記憶されたＲＯＭ（Read Only Memory）と、を備えており、プロセッサにより専用のプログラムを実行することにより上記各部の機能を果たす。また、上記各部は、専用のプログラムのみでその機能を果たすもの、専用のプログラムによりハードウェアを制御してその機能を果たすものが混在している。 Here, the server computer 30 includes a processor (not shown), a RAM (Random Access Memory), and a ROM (Read Only Memory) in which a dedicated program is stored, and the processor executes the dedicated program. Fulfills the functions of the above-mentioned parts. In addition, the above-described units include those that perform their functions only with dedicated programs, and those that perform their functions by controlling hardware using dedicated programs.

更に、図１３に基づき、クライアントコンピュータ３１の詳細な構成を説明する。ここで、図１３は、クライアントコンピュータ３１の詳細構成を示すブロック図である。なお、上記第２の実施の形態における声質変換システム２と同様の構成部については、同じ符号を付して説明を省略する。
クライアントコンピュータ３１は、図１３に示すように、データ通信部３１ａと、声質変換モデル受信部３１ｂと、目標話者音声データ取得部２１と、元話者音声データ取得部２２と、声質変換モデル適応部２３と、声質変換部２４と、話者性制御パラメータ指定部２５とを含んだ構成となっている。つまり、上記第２の実施の形態における声質変換システム２に、データ通信部３１ａ及び声質変換モデル受信部３１ｂを追加し、更に、声質変換モデル記憶部２０を取り除いた構成となっている。 Further, a detailed configuration of the client computer 31 will be described with reference to FIG. Here, FIG. 13 is a block diagram showing a detailed configuration of the client computer 31. Note that the same components as those in the voice quality conversion system 2 in the second embodiment are denoted by the same reference numerals and description thereof is omitted.
As shown in FIG. 13, the client computer 31 includes a data communication unit 31a, a voice quality conversion model reception unit 31b, a target speaker voice data acquisition unit 21, a former speaker voice data acquisition unit 22, and a voice quality conversion model adaptation. The configuration includes a unit 23, a voice quality conversion unit 24, and a speaker control parameter designation unit 25. In other words, the data communication unit 31a and the voice quality conversion model receiving unit 31b are added to the voice quality conversion system 2 in the second embodiment, and the voice quality conversion model storage unit 20 is further removed.

データ通信部３１ａは、声質変換モデル受信部３１ｂからの命令に応じて、サーバコンピュータ３０に、声質変換モデルの取得要求をネットワーク３２を介して送信したり、サーバコンピュータ３０からの声質変換モデルをネットワーク３２を介して受信したりする機能を提供するようになっている。
声質変換モデル送信部３０ｂは、ユーザからの不図示の操作部を介した声質変換開始指示に応じて、指定された構成の声質変換モデルの取得要求を、データ通信部３０ａを介してサーバコンピュータ３０に送信したり、データ通信部３１ａで受信したサーバコンピュータ３０からの声質変換モデルをモデル取得部２３ａに伝送したりする機能を提供するようになっている。 The data communication unit 31a transmits a voice quality conversion model acquisition request to the server computer 30 via the network 32 in response to a command from the voice quality conversion model reception unit 31b, or transmits the voice quality conversion model from the server computer 30 to the network. 32 is provided through a function of receiving data via the terminal 32.
The voice quality conversion model transmission unit 30b sends a request for acquiring a voice quality conversion model having a designated configuration to the server computer 30 via the data communication unit 30a in response to a voice quality conversion start instruction via an operation unit (not shown) from the user. The voice quality conversion model from the server computer 30 received by the data communication unit 31a is transmitted to the model acquisition unit 23a.

ここで、クライアントコンピュータ３１は、上記第２の実施の形態における声質変換システム２と同様に、声質変換のモードとして、通常モード及びパラメータ制御モードを設定できる構成となっている。
また、クライアントコンピュータ３１は、図示しないプロセッサと、ＲＡＭ（Random Access Memory）と、専用のプログラムの記憶されたＲＯＭ（Read Only Memory）と、を備えており、プロセッサにより専用のプログラムを実行することにより上記各部の機能を果たす。また、上記各部は、専用のプログラムのみでその機能を果たすもの、専用のプログラムによりハードウェアを制御してその機能を果たすものが混在している。 Here, similarly to the voice quality conversion system 2 in the second embodiment, the client computer 31 is configured to be able to set the normal mode and the parameter control mode as the voice quality conversion mode.
The client computer 31 includes a processor (not shown), a RAM (Random Access Memory), and a ROM (Read Only Memory) in which a dedicated program is stored, and the processor executes the dedicated program. It fulfills the functions of the above parts. In addition, the above-described units include those that perform their functions only with dedicated programs, and those that perform their functions by controlling hardware using dedicated programs.

更に、図１４に基づき、サーバコンピュータ３０の動作処理の一つである声質変換モデル送信処理の流れを説明する。ここで、図１４は、サーバコンピュータ３０の声質変換モデル送信処理を示すフローチャートである。
声質変換モデル送信処理は、図１４に示すように、まずステップＳ６００に移行し、声質変換モデル送信部３０ｂにおいて、クライアントコンピュータ３１から、データ通信部３０ａを介して声質変換モデルの取得要求を受信したか否かを判定し、受信したと判定された場合(Yes)は、ステップＳ６０２に移行し、そうでない場合(No)は、受信するまで判定処理を繰り返す。 Furthermore, the flow of the voice quality conversion model transmission process which is one of the operation processes of the server computer 30 will be described with reference to FIG. Here, FIG. 14 is a flowchart showing the voice quality conversion model transmission processing of the server computer 30.
As shown in FIG. 14, the voice quality conversion model transmission process first proceeds to step S600, and the voice quality conversion model transmission unit 30b receives a voice quality conversion model acquisition request from the client computer 31 via the data communication unit 30a. If it is determined that it has been received (Yes), the process proceeds to step S602. If not (No), the determination process is repeated until reception.

ステップＳ６０２に移行した場合は、声質変換モデル送信部３０ｂにおいて、ステップＳ６００で取得した取得要求の示す構成の声質変換モデルを、声質変換モデル記憶部２０から読み出してステップＳ６０４に移行する。ここで、取得要求には、取得要求もとのクライアントコンピュータの情報（ＩＰアドレスなど）や、声質変換モデルの構成を特定する情報などが含まれている。 When the process proceeds to step S602, the voice quality conversion model transmission unit 30b reads the voice quality conversion model having the configuration indicated by the acquisition request acquired in step S600 from the voice quality conversion model storage unit 20, and proceeds to step S604. Here, the acquisition request includes information (such as an IP address) of the client computer that is the acquisition request, information that specifies the configuration of the voice quality conversion model, and the like.

ステップＳ６０４では、ステップＳ６０２で読み出した声質変換モデルを、データ通信部３０ａを介して、取得要求元のクライアントコンピュータ３１に送信してステップＳ６００に移行する。
更に、図１５に基づき、クライアントコンピュータ３１における通常モード時の声質変換処理の流れを説明する。ここで、図１５は、クライアントコンピュータ３１における通常モード時の声質変換処理を示すフローチャートである。 In step S604, the voice quality conversion model read in step S602 is transmitted to the acquisition request source client computer 31 via the data communication unit 30a, and the process proceeds to step S600.
Further, the flow of voice quality conversion processing in the normal mode in the client computer 31 will be described with reference to FIG. Here, FIG. 15 is a flowchart showing voice quality conversion processing in the normal mode in the client computer 31.

通常モード時の声質変換処理は、図１５に示すように、まずステップＳ７００に移行し、声質変換モデル受信部３１ｂにおいて、不図示の操作部を介したユーザからの声質変換開始指示（開始指示情報含む）があったか否かを判定し、あったと判定された場合(Yes)は、ステップＳ７０２に移行し、そうでない場合(No)は、指示があるまで判定処理を繰り返す。 As shown in FIG. 15, the voice quality conversion process in the normal mode first proceeds to step S700, where the voice quality conversion model reception unit 31b receives a voice quality conversion start instruction (start instruction information from the user via an operation unit not shown). (Yes), the process proceeds to step S702. If not (No), the determination process is repeated until an instruction is received.

ステップＳ７０２に移行した場合は、声質変換モデル受信部３１ｂにおいて、開始指示情報で指定された構成の声質変換モデルの取得要求を生成し、当該取得要求を、データ通信部３０ａを介してサーバコンピュータ３０に送信してステップＳ７０４に移行する。ここで、クライアントコンピュータ３１は、予めサーバコンピュータ３０のＩＰアドレス等の情報を有している。 When the process proceeds to step S702, the voice quality conversion model receiving unit 31b generates a request for acquiring a voice quality conversion model having the configuration specified by the start instruction information, and the server computer 30 transmits the acquisition request via the data communication unit 30a. And the process proceeds to step S704. Here, the client computer 31 has information such as the IP address of the server computer 30 in advance.

ステップＳ７０４では、声質変換モデル受信部３１ｂにおいて、サーバコンピュータ３０からの声質変換モデルを受信したか否かを判定し、受信したと判定された場合(Yes)は、受信した声質変換モデルをモデル取得部２３ａに伝送してステップＳ７０６に移行し、そうでない場合(No)は、受信するまで判定処理を繰り返す。
ステップＳ７０６では、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、Ｎ対１の構成又は中間話者を適応した構成か否かを判定し、これらのいずれかの構成であると判定された場合(Yes)は、ステップＳ７０８に移行し、そうでない場合(No)は、ステップＳ７２８に移行する。 In step S704, the voice quality conversion model reception unit 31b determines whether or not the voice quality conversion model is received from the server computer 30, and if it is determined that the voice quality conversion model has been received (Yes), the received voice quality conversion model is acquired as a model. The data is transmitted to the unit 23a and the process proceeds to step S706. If not (No), the determination process is repeated until reception.
In step S706, the voice quality conversion model adaptation unit 23 determines whether or not the configuration of the voice quality conversion model specified by the start instruction information is an N-to-1 configuration or a configuration in which an intermediate speaker is adapted. If it is determined that the configuration is (Yes), the process proceeds to step S708; otherwise (No), the process proceeds to step S728.

ステップＳ７０８に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ７０４で受信した、Ｎ対１声質変換モデル又はＮ対１中間声質変換モデルから、元話者側のモデル部を抽出し、当該抽出した元話者側のモデル部を適応部２３ｄに出力してステップＳ７１０に移行する。
ステップＳ７１０では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データの取得要請を行いステップＳ７１２に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S708, the adaptation target extracting unit 23b extracts the model part on the original speaker side from the N-to-1 voice quality conversion model or the N-to-1 intermediate voice quality conversion model received in step S704, and the extraction is performed. The model part on the former speaker side is output to the adaptation unit 23d, and the process proceeds to step S710.
In step S710, the voice quality conversion model adaptation unit 23 requests the user to acquire adaptation original speaker voice data, and the process proceeds to step S712. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ７１２では、元話者音声データ取得部２２において、適応用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の元話者音声データを特徴量抽出部２３ｃに出力してステップＳ７１４に移行する。
ステップＳ７１４では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データから特徴量を抽出し、当該抽出した元話者特徴量を適応部２３ｄに出力してステップＳ７１６に移行する。 In step S712, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for adaptation has been acquired (or generated from the input voice), and if it is determined that the voice data has been acquired (generated) (Yes) outputs the acquired (generated) original voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S714.
In step S714, the feature amount extraction unit 23c extracts feature amounts from the adaptation original speaker voice data input from the original speaker voice data acquisition unit 22, and the extracted former speaker feature amounts are applied to the adaptation unit 23d. And the process proceeds to step S716.

ステップＳ７１６では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定してステップＳ７１８に移行する。
ステップＳ７１８では、適応部２３ｄにおいて、ステップＳ７１６で推定したパラメータ値に基づき、適応後元話者音声モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ７２０に移行する。 In step S716, the adaptation unit 23d estimates a parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S718. Transition.
In step S718, the adaptation unit 23d generates a post-adaptation original speaker voice model based on the parameter value estimated in step S716, and outputs the generated post-adaptation voice quality conversion model to the voice quality conversion unit 24, and then proceeds to step S720. Transition.

ステップＳ７２０では、声質変換部２４において、ユーザに対して声質変換用の元話者音声データの取得要請を行いステップＳ７２２に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ７２２では、元話者音声データ取得部２２において、声質変換用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合(Yes)は、当該取得（生成）した声質変換用の元話者音声データを声質変換部２４に出力してステップＳ７２４に移行し、そうでない場合(No)は、取得（生成）するまで判定処理を繰り返す（または生成処理を行う）。 In step S720, the voice quality conversion unit 24 requests the user to acquire original speaker voice data for voice quality conversion, and the process proceeds to step S722. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S722, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for voice quality conversion has been acquired (or generated from the input voice), and is determined to have been acquired (generated). If yes (Yes), the acquired (generated) original speaker voice data for voice quality conversion is output to the voice quality converting unit 24 and the process proceeds to step S724. If not (No), until the acquired (generated) voice data is converted. The determination process is repeated (or the generation process is performed).

ステップＳ７２４に移行した場合は、声質変換部２４において、元話者音声データ取得部２２から入力された声質変換対象の元話者の声質の音声データを、目標話者の声質の音声データに変換してステップＳ７２６に移行する。
ステップＳ７２６では、声質変換部２４において、声質変換後の音声データに基づき、変換音声を出力して処理を終了する。 When the process proceeds to step S724, the voice quality conversion unit 24 converts the voice data of the voice quality of the original speaker that is the voice quality conversion target input from the voice data conversion unit 22 into voice data of the target speaker's voice quality. Then, the process proceeds to step S726.
In step S726, the voice quality conversion unit 24 outputs the converted voice based on the voice data after the voice quality conversion, and ends the process.

一方、ステップＳ７０６において、声質変換モデルの構成がＮ対１の構成又は中間話者を適用した構成ではなく、ステップＳ７２８に移行した場合は、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、１対Ｍの構成か否かを判定し、１対Ｍの構成であると判定された場合(Yes)は、ステップＳ７３０に移行し、そうでない場合(No)は、ステップＳ７４０に移行する。 On the other hand, if the configuration of the voice quality conversion model is not an N-to-1 configuration or a configuration in which an intermediate speaker is applied in step S706, and the process proceeds to step S728, the voice quality conversion model adaptation unit 23 specifies the start instruction information. It is determined whether the configuration of the voice quality conversion model is a 1-to-M configuration. If it is determined that the configuration is a 1-to-M configuration (Yes), the process proceeds to step S730; otherwise (No). The process proceeds to step S740.

ステップＳ７３０に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ７０４で受信した、１対Ｍ声質変換モデルから、目標話者側のモデル部を抽出し、当該抽出した目標話者側のモデル部を適応部２３ｄに出力してステップＳ７３２に移行する。
ステップＳ７３２では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者音声データの取得要請を行いステップＳ７３４に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S730, the adaptation target extraction unit 23b extracts the model unit on the target speaker side from the one-to-M voice quality conversion model received in step S704, and extracts the model unit on the target speaker side thus extracted. Is output to the adaptation unit 23d, and the process proceeds to step S732.
In step S732, the voice quality conversion model adaptation unit 23 requests the user to acquire adaptation target speaker voice data, and the process proceeds to step S734. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ７３４では、目標話者音声データ取得部２１において、適応用の目標話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の目標話者音声データを特徴量抽出部２３ｃに出力してステップＳ７３６に移行する。
ステップＳ７３６では、特徴量抽出部２３ｃにおいて、目標話者音声データ取得部２１から入力された適応用の目標話者音声データから特徴量を抽出し、当該抽出した目標話者特徴量を適応部２３ｄに出力してステップＳ７３８に移行する。 In step S734, the target speaker voice data acquisition unit 21 determines whether or not the voice data of the target speaker for adaptation has been acquired (or generated from the input voice). (Yes) outputs the acquired (generated) target speaker voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S736.
In step S736, the feature amount extraction unit 23c extracts feature amounts from the target speaker voice data for adaptation input from the target speaker voice data acquisition unit 21, and uses the extracted target speaker feature amounts as the adaptation unit 23d. And the process proceeds to step S738.

ステップＳ７３８では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された目標話者特徴量に基づき、所定の適応手法を用いて目標話者側のモデル部のパラメータ値を推定してステップＳ７１８に移行する。
また、ステップＳ７２８において、声質変換モデルの構成が１対Ｍではなく、ステップＳ７４０に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ７０６で取得した、Ｎ対Ｍ声質変換モデルから、元話者側のモデル部及び目標話者側のモデル部の両方を抽出し、当該抽出した元話者側のモデル部及び目標話者側のモデル部を適応部２３ｄに出力してステップＳ７４２に移行する。 In step S738, the adaptation unit 23d estimates a parameter value of the model unit on the target speaker side using a predetermined adaptation method based on the target speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S718. Transition.
In step S728, if the configuration of the voice quality conversion model is not 1 to M and the process proceeds to step S740, the adaptation target extraction unit 23b uses the N to M voice quality conversion model acquired in step S706 to obtain the original speaker. Both the model unit on the side and the model unit on the target speaker side are extracted, and the extracted model unit on the original speaker side and the model unit on the target speaker side are output to the adaptation unit 23d, and the process proceeds to step S742.

ステップＳ７４２では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データ及び目標話者音声データの取得要請を行いステップＳ７４４に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ７４４では、元話者音声データ取得部２２及び目標話者音声データ取得部２１において、適応用の元話者及び目標話者の音声データを取得（または入力音声から生成）したか否かをそれぞれ判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の、元話者音声データ及び目標話者音声データを特徴量抽出部２３ｃに出力してステップＳ７４６に移行する。 In step S742, the voice quality conversion model adaptation unit 23 requests the user to acquire the original speaker voice data and target speaker voice data for adaptation, and the process proceeds to step S744. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S744, whether or not the original speaker voice data acquisition unit 22 and the target speaker voice data acquisition unit 21 have acquired (or generated from input voice) voice data of the original speaker and target speaker for adaptation. When it is determined that each has been acquired and acquired (generated) (Yes), the acquired (generated) adaptation original voice data and target speaker audio data are output to the feature amount extraction unit 23c. The process moves to step S746.

ステップＳ７４６では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データ及び目標話者音声データ取得部２１から入力された適応用の目標話者音声データから特徴量を抽出し、当該抽出した元話者特徴量及び目標話者特徴量を適応部２３ｄに出力してステップＳ７４８に移行する。
ステップＳ７４８では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定すると共に、特徴量抽出部２３ｃから入力された目標話者特徴量に基づき、所定の適応手法を用いて目標話者側のモデル部のパラメータ値をそれぞれ推定してステップＳ７１８に移行する。 In step S746, the feature amount extraction unit 23c uses the original speaker voice data for adaptation input from the source speaker voice data acquisition unit 22 and the target speaker for adaptation input from the target speaker voice data acquisition unit 21. A feature amount is extracted from the speech data, and the extracted original speaker feature amount and target speaker feature amount are output to the adaptation unit 23d, and the process proceeds to step S748.
In step S748, the adaptation unit 23d estimates the parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c, and the feature amount. Based on the target speaker feature input from the extraction unit 23c, the parameter values of the model unit on the target speaker side are estimated using a predetermined adaptation method, and the process proceeds to step S718.

更に、図１６に基づき、クライアントコンピュータ３１におけるパラメータ制御モード時の声質変換処理の流れを説明する。ここで、図１６は、クライアントコンピュータ３１におけるパラメータ制御モード時の声質変換処理を示すフローチャートである。
パラメータ制御モード時の声質変換処理は、図１６に示すように、まずステップＳ８００に移行し、声質変換モデル受信部３１ｂにおいて、不図示の操作部を介したユーザからの声質変換開始指示があったか否かを判定し、あったと判定された場合(Yes)は、ステップＳ８０２に移行し、そうでない場合(No)は、指示があるまで判定処理を繰り返す。 Furthermore, the flow of voice quality conversion processing in the parameter control mode in the client computer 31 will be described with reference to FIG. Here, FIG. 16 is a flowchart showing voice quality conversion processing in the parameter control mode in the client computer 31.
As shown in FIG. 16, the voice quality conversion process in the parameter control mode first proceeds to step S800, and the voice quality conversion model reception unit 31b has received a voice quality conversion start instruction from the user via an operation unit (not shown). If it is determined (Yes), the process proceeds to step S802. If not (No), the determination process is repeated until an instruction is given.

ステップＳ８０２に移行した場合は、声質変換モデル受信部３１ｂにおいて、開始指示情報で指定された構成の声質変換モデルの取得要求を生成し、当該取得要求を、データ通信部３０ａを介してサーバコンピュータ３０に送信してステップＳ８０４に移行する。ここで、クライアントコンピュータ３１は、予めサーバコンピュータ３０のＩＰアドレス等の情報を有している。 When the process proceeds to step S802, the voice quality conversion model receiving unit 31b generates a voice quality conversion model acquisition request having the configuration specified by the start instruction information, and the server computer 30 transmits the acquisition request via the data communication unit 30a. And the process proceeds to step S804. Here, the client computer 31 has information such as the IP address of the server computer 30 in advance.

ステップＳ８０４では、声質変換モデル受信部３１ｂにおいて、サーバコンピュータ３０からの声質変換モデルを受信したか否かを判定し、受信したと判定された場合(Yes)は、受信した声質変換モデルをモデル取得部２３ａに伝送してステップＳ８０６に移行し、そうでない場合(No)は、受信するまで判定処理を繰り返す。
ステップＳ８０６では、声質変換モデル適応部２３において、開始指示情報で指定された声質変換モデルの構成は、１対Ｍの構成か否かを判定し、１対Ｍであると判定された場合(Yes)は、ステップＳ８０８に移行し、そうでない場合(No)は、ステップＳ８２４に移行する。 In step S804, the voice quality conversion model reception unit 31b determines whether or not the voice quality conversion model is received from the server computer 30. If it is determined that the voice quality conversion model is received (Yes), the received voice quality conversion model is acquired as a model. The data is transmitted to the unit 23a and the process proceeds to step S806. If not (No), the determination process is repeated until reception.
In step S806, the voice conversion model adaptation unit 23 determines whether or not the configuration of the voice conversion model specified by the start instruction information is a one-to-M configuration, and if it is determined to be one-to-M (Yes) ) Goes to step S808, otherwise (No), goes to step S824.

ステップＳ８０８に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ８０４で受信した、１対Ｍ声質変換モデルから、目標話者側のモデル部を抽出し、当該抽出した目標話者側のモデル部を適応部２３ｄに出力してステップＳ８１０に移行する。
ステップＳ８１０では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者側のモデルの話者性制御パラメータ値の指定要請を行いステップＳ８１２に移行する。ここで、指定要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 When the process proceeds to step S808, the adaptation target extraction unit 23b extracts the model part on the target speaker side from the one-to-M voice quality conversion model received in step S804, and extracts the model part on the target speaker side thus extracted. Is output to the adaptation unit 23d, and the process proceeds to step S810.
In step S810, the voice quality conversion model adaptation unit 23 requests the user to specify the speaker control parameter value of the target speaker side model for adaptation, and the process proceeds to step S812. Here, the designation request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ８１２では、話者性制御パラメータ指定部２５において、不図示の操作部などを介して、話者性制御パラメータが指定されたか否かを判定し、指定されたと判定された場合(Yes)は、指定された制御パラメータ値を適応部２３ｄに出力してステップＳ８１４に移行し、そうでない場合(No)は、指定されるまで判定処理を繰り返す。
ステップＳ８１４に移行した場合は、適応部２３ｄにおいて、話者性制御パラメータ指定部２５から入力された話者性制御パラメータ値に基づき、当該パラメータ値の表現する目標話者の音声に適応した適応後声質変換モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ８１６に移行する。 In step S812, the speaker control parameter specifying unit 25 determines whether or not the speaker control parameter is specified via an operation unit (not shown). If it is determined that the speaker control parameter is specified (Yes), The designated control parameter value is output to the adaptation unit 23d, and the process proceeds to step S814. If not (No), the determination process is repeated until designated.
When the process proceeds to step S814, the adaptation unit 23d performs adaptation after adapting to the speech of the target speaker expressed by the parameter value based on the speaker control parameter value input from the speaker control parameter designating unit 25. A voice quality conversion model is generated, the generated post-adaptation voice quality conversion model is output to the voice quality conversion unit 24, and the process proceeds to step S816.

ステップＳ８１６では、声質変換部２４において、ユーザに対して声質変換用の元話者音声データの取得要請を行いステップＳ８１８に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ８１８では、元話者音声データ取得部２２において、声質変換用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合(Yes)は、当該取得（生成）した声質変換用の元話者音声データを声質変換部２４に出力してステップＳ８２０に移行し、そうでない場合(No)は、取得（生成）するまで判定処理を繰り返す（または生成処理を行う）。 In step S816, the voice quality conversion unit 24 requests the user to acquire original speaker voice data for voice quality conversion, and the process proceeds to step S818. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S818, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for voice quality conversion has been acquired (or generated from the input voice), and is determined to have been acquired (generated). If yes (Yes), the acquired (generated) original speaker voice data for voice quality conversion is output to the voice quality converting unit 24 and the process proceeds to step S820. If not (No), until acquired (generated). The determination process is repeated (or the generation process is performed).

ステップＳ８２０に移行した場合は、声質変換部２４において、元話者音声データ取得部２２から入力された声質変換対象の元話者の声質の音声データを、目標話者の声質の音声データに変換してステップＳ８２２に移行する。
ステップＳ８２２では、声質変換部２４において、声質変換後の音声データに基づき、変換音声を出力して処理を終了する。 When the process proceeds to step S820, the voice quality conversion unit 24 converts the voice data of the voice quality of the target speaker to be converted, which is input from the voice data acquisition unit 22, into voice data of the target speaker's voice quality. Then, the process proceeds to step S822.
In step S822, the voice quality conversion unit 24 outputs the converted voice based on the voice data after the voice quality conversion, and ends the process.

一方、ステップＳ８０６において、取得した声質変換モデルの構成が１対Ｍではなく、ステップＳ８２４に移行した場合は、適応対象抽出部２３ｂにおいて、ステップＳ８０４で受信した、Ｎ対Ｍ声質変換モデルから、元話者側のモデル部及び目標話者側のモデル部の両方を抽出し、当該抽出した元話者側のモデル部及び目標話者側のモデル部を適応部２３ｄに出力してステップＳ８２６に移行する。 On the other hand, in step S806, when the configuration of the acquired voice quality conversion model is not 1 to M and the process proceeds to step S824, the adaptation target extraction unit 23b uses the original N to M voice quality conversion model received in step S804. Both the model part on the speaker side and the model part on the target speaker side are extracted, and the extracted model part on the original speaker side and the model part on the target speaker side are output to the adaptation unit 23d, and the process proceeds to step S826. To do.

ステップＳ８２６では、声質変換モデル適応部２３において、ユーザに対して適応用の元話者音声データの取得要請を行いステップＳ８２８に移行する。ここで、取得要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。
ステップＳ８２８では、元話者音声データ取得部２２において、適応用の元話者の音声データを取得（または入力音声から生成）したか否かを判定し、取得（生成）したと判定された場合（Yes)は、当該取得（生成）した適応用の元話者音声データを特徴量抽出部２３ｃに出力してステップＳ８３０に移行する。 In step S826, the voice quality conversion model adaptation unit 23 requests the user to acquire the original speaker voice data for adaptation, and the process proceeds to step S828. Here, the acquisition request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).
In step S828, the original speaker voice data acquisition unit 22 determines whether or not the voice data of the original speaker for adaptation has been acquired (or generated from the input voice), and if it is determined that the voice data has been acquired (generated) (Yes) outputs the acquired (generated) original voice data for adaptation to the feature amount extraction unit 23c, and proceeds to step S830.

ステップＳ８３０では、特徴量抽出部２３ｃにおいて、元話者音声データ取得部２２から入力された適応用の元話者音声データから特徴量を抽出し、当該抽出した元話者特徴量を適応部２３ｄに出力してステップＳ８３２に移行する。
ステップＳ８３２では、声質変換モデル適応部２３において、ユーザに対して適応用の目標話者側のモデルの話者性制御パラメータ値の指定要請を行いステップＳ８３４に移行する。ここで、指定要請は、不図示の表示部にメッセージを表示したり、不図示の音声出力部から音声メッセージを出力するなどして行う。 In step S830, the feature amount extraction unit 23c extracts feature amounts from the adaptation original speaker voice data input from the original speaker voice data acquisition unit 22, and the extracted original speaker feature amounts are applied to the adaptation unit 23d. And the process proceeds to step S832.
In step S832, the voice quality conversion model adaptation unit 23 requests the user to specify the speaker control parameter value of the target speaker model for adaptation, and the process proceeds to step S834. Here, the designation request is made by displaying a message on a display unit (not shown) or outputting a voice message from a voice output unit (not shown).

ステップＳ８３４では、話者性制御パラメータ指定部２５において、不図示の操作部などを介して、話者性制御パラメータが指定されたか否かを判定し、指定されたと判定された場合(Yes)は、指定された制御パラメータ値を適応部２３ｄに出力してステップＳ８３６に移行し、そうでない場合(No)は、指定されるまで判定処理を繰り返す。
ステップＳ８３６では、適応部２３ｄにおいて、特徴量抽出部２３ｃから入力された元話者特徴量に基づき、所定の適応手法を用いて元話者側のモデル部のパラメータ値を推定してステップＳ８３８に移行する。
ステップＳ８３８では、適応部２３ｄにおいて、ステップＳ８３６で推定されたパラメータ値と、ステップＳ８３４で指定された話者性制御パラメータ値とに基づき、適応後声質変換モデルを生成し、当該生成した適応後声質変換モデルを声質変換部２４に出力してステップＳ８１６に移行する。 In step S834, the speaker control parameter specifying unit 25 determines whether or not a speaker control parameter has been specified via an operation unit (not shown). If it is determined that the speaker control parameter has been specified (Yes). The designated control parameter value is output to the adaptation unit 23d, and the process proceeds to step S836. If not (No), the determination process is repeated until designated.
In step S836, the adaptation unit 23d estimates a parameter value of the model unit on the former speaker side using a predetermined adaptation method based on the former speaker feature amount input from the feature amount extraction unit 23c, and then proceeds to step S838. Transition.
In step S838, the adaptation unit 23d generates a post-adaptation voice quality conversion model based on the parameter value estimated in step S836 and the speaker control parameter value specified in step S834, and the generated post-adaptation voice quality The conversion model is output to the voice quality conversion unit 24, and the process proceeds to step S816.

次に、本実施の形態の動作を説明する。
最初に、通常モード時の声質変換処理の動作について説明する。
クライアントコンピュータ３１は、まず、ユーザによる不図示の操作部の操作によって、声質変換開始指示（開始指示情報含む）が入力されると（ステップＳ７００の「Ｙｅｓ」の分岐）、声質変換モデル受信部３１ｂにおいて、開始指示情報と、サーバコンピュータ３０の情報とに基づき、前記開始指示情報で指定された構成の声質変換モデルの取得要求を生成する。そして、当該生成した取得要求を、データ通信部３１ａを介して、サーバコンピュータ３０に送信する（ステップＳ７０２）。 Next, the operation of this embodiment will be described.
First, the voice quality conversion process in the normal mode will be described.
First, when a voice quality conversion start instruction (including start instruction information) is input by a user's operation of an operation unit (not shown) by the user (“Yes” branch of step S700), the client computer 31 receives a voice quality conversion model reception unit 31b. , A request for obtaining a voice quality conversion model having a configuration designated by the start instruction information is generated based on the start instruction information and the information of the server computer 30. Then, the generated acquisition request is transmitted to the server computer 30 via the data communication unit 31a (step S702).

一方、サーバコンピュータ３０は、声質変換モデル送信部３０ｂにおいて、データ通信部３０ａを介して、クライアントコンピュータ３１からの声質変換モデルの取得要求を受信すると（ステップＳ６００の「Ｙｅｓ」の分岐）、当該受信した取得要求に応じた声質変換モデルを声質変換モデル記憶部２０から読み出す（ステップＳ６０２）。そして、当該読み出した声質変換モデルを、データ通信部３０ａを介して、取得要求元のクライアントコンピュータ３１に送信する（ステップＳ６０４）。 On the other hand, when the server computer 30 receives a voice quality conversion model acquisition request from the client computer 31 via the data communication unit 30a in the voice quality conversion model transmission unit 30b ("Yes" branch in step S600), the server computer 30 receives the received request. The voice quality conversion model corresponding to the acquired acquisition request is read from the voice quality conversion model storage unit 20 (step S602). Then, the read voice quality conversion model is transmitted to the client computer 31 that is the acquisition request source via the data communication unit 30a (step S604).

クライアントコンピュータ３１は、サーバコンピュータ３０からの声質変換モデルを受信すると（ステップＳ７０４の「Ｙｅｓ」の分岐）、当該受信した声質変換モデルをモデル取得部２３ａに伝送する。モデル取得部２３ａは、声質変換モデル受信部３１ｂからの声質変換モデルを、適応対象抽出部２３ｂに出力する。
上記一連の動作処理は、パラメータ制御モードにおいても同様となる。
また、以降の声質変換モデルの適応処理及び声質変換処理は、上記第２の実施の形態における声質変換システム２と同様となるので説明を省略する。 When the client computer 31 receives the voice quality conversion model from the server computer 30 (“Yes” branch of step S704), the client computer 31 transmits the received voice quality conversion model to the model acquisition unit 23a. The model acquisition unit 23a outputs the voice quality conversion model from the voice quality conversion model reception unit 31b to the adaptation target extraction unit 23b.
The series of operation processes is the same in the parameter control mode.
Further, since the subsequent voice quality conversion model adaptation processing and voice quality conversion processing are the same as those of the voice quality conversion system 2 in the second embodiment, description thereof will be omitted.

以上、本実施の形態の声質変換クライアントサーバシステム３によれば、Ｎ人の元話者の音声データ及びＭ人の目標話者の音声データの少なくとも一方を用いて生成された、Ｎ対１、１対Ｍ、Ｎ対Ｍ声質変換モデルを、所望（任意）の元話者又は所望（任意）の目標話者の音声に適応させて声質変換に用いることが可能である。これによって、特定の元話者の任意の発声内容の音声データを、簡易に所望の目標話者の声質の音声データに変換したり、所望の元話者の任意の発声内容の音声データを、簡易に特定の目標話者又は所望の目標話者の声質の音声データに変換したりすることができる。 As described above, according to the voice quality conversion client server system 3 of the present embodiment, N-to-one generated using at least one of the voice data of N former speakers and the voice data of M target speakers. The 1-to-M and N-to-M voice quality conversion models can be used for voice quality conversion by adapting to the speech of a desired (arbitrary) original speaker or a desired (arbitrary) target speaker. As a result, the voice data of an arbitrary utterance content of a specific original speaker can be easily converted into voice data of the desired target speaker's voice quality, or the voice data of an arbitrary utterance content of a desired original speaker can be converted. It can be easily converted into voice data of a voice quality of a specific target speaker or a desired target speaker.

また、声質変換モデルを、サーバコンピュータ３０で保持し、クライアントコンピュータ３１からの取得要求に応じて当該クライアントコンピュータ３１に声質変換モデルを送信することが可能である。これにより、クライアントコンピュータ３１側で声質変換モデルを保持する必要が無いので、保持するためのメモリを不要にできるのでメモリの使用効率を高めることができる。 Further, the voice quality conversion model can be held by the server computer 30 and the voice quality conversion model can be transmitted to the client computer 31 in response to an acquisition request from the client computer 31. Thereby, since it is not necessary to hold the voice quality conversion model on the client computer 31 side, a memory for holding it can be made unnecessary, so that the use efficiency of the memory can be improved.

また、声質変換モデルを、サーバコンピュータ３０側で保持及び管理することが可能となるので、これにより、声質変換モデルのバージョンアップ等を容易に行うことができると共に、バージョンアップ後の声質変換モデルを容易にクライアントコンピュータ３１に提供することができる。
上記第３の実施の形態において、データ通信部３０ａ及び声質変換モデル送信部３０ｂによる声質変換モデルの送信処理は、請求項８記載の声質変換モデル送信手段に対応し、声質変換モデル記憶部２０は、請求項８記載の声質変換モデル記憶手段に対応する。 Also, since the voice quality conversion model can be held and managed on the server computer 30 side, it is possible to easily upgrade the voice quality conversion model, and to convert the voice quality conversion model after the upgrade. It can be provided to the client computer 31 easily.
In the third embodiment, the voice quality conversion model transmission processing by the data communication unit 30a and the voice quality conversion model transmission unit 30b corresponds to the voice quality conversion model transmission unit according to claim 8 , and the voice quality conversion model storage unit 20 includes: , corresponding to the voice conversion model storage means according to claim 8.

また、上記第３の実施の形態において、元話者音声データ取得部２２は、請求項８記載の元話者音声データ取得手段に対応し、声質変換モデル適応部２３は、請求項８記載の適応手段に対応し、声質変換部２４は、請求項８記載の声質変換手段に対応する。 Further, in the above-mentioned third embodiment, source-speaker speech data acquiring unit 22 corresponds to the source-speaker speech data acquisition means according to claim 8, voice conversion model adaptation unit 23 according to claim 8 Corresponding to the adaptation means, the voice quality conversion unit 24 corresponds to the voice quality conversion means according to the eighth aspect .

なお、上記第３の実施の形態においては、クライアントコンピュータ３１が、サーバコンピュータ３０から声質変換モデルを取得して、当該取得した声質変換モデルの適応処理と、適応後の声質変換モデルを用いた声質変化処理を行う構成としたが、これに限らず、サーバコンピュータ３０が、声質変換モデルの適応処理を行ったり、当該適応処理に加えて声質変換処理を行う構成など、他の構成としても良い。 In the third embodiment, the client computer 31 obtains a voice quality conversion model from the server computer 30, and adapts the acquired voice quality conversion model and uses the voice quality conversion model after adaptation. Although the configuration is such that the change process is performed, the present invention is not limited to this, and other configurations such as a configuration in which the server computer 30 performs an adaptive process of the voice quality conversion model or performs a voice quality conversion process in addition to the adaptive process may be employed.

例えば、サーバコンピュータ３０側で、適応処理及び声質変換処理を行う構成の場合は、クライアントコンピュータ３１は、目標話者音声データ取得部２１と、元話者音声データ取得部２２とによって、目標話者の音声データ及び元話者の音声データを取得してサーバコンピュータ３０に送信するだけで良くなり、声質変換モデルをサーバコンピュータ３０から取得する必要が無くなる。これにより、クライアントコンピュータ３１側で声質変換モデルを用いた処理を行わなくて済むので、クライアントコンピュータ３１が携帯機器などのようにメモリ資源が豊富ではない場合においても声質変換処理を行うことが可能となる。 For example, in the case where the server computer 30 is configured to perform adaptive processing and voice quality conversion processing, the client computer 31 uses the target speaker voice data acquisition unit 21 and the original speaker voice data acquisition unit 22 to perform the target speaker. It is only necessary to acquire the voice data and the voice data of the former speaker and transmit them to the server computer 30, and it becomes unnecessary to acquire the voice quality conversion model from the server computer 30. This eliminates the need to perform processing using the voice quality conversion model on the client computer 31 side, so that it is possible to perform voice quality conversion processing even when the client computer 31 is not rich in memory resources, such as a portable device. Become.

本発明の第１の実施の形態に係る声質変換モデル生成装置１の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion model production | generation apparatus 1 which concerns on the 1st Embodiment of this invention. 声質変換モデル生成部１２の詳細構成を示すブロック図である。3 is a block diagram showing a detailed configuration of a voice quality conversion model generation unit 12. FIG. 声質変換モデル生成装置１における声質変換モデル生成処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion model production | generation process in the voice quality conversion model production | generation apparatus 1. FIG. 声質変換モデル生成部１２における通常学習による声質変換モデル生成処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion model production | generation process by the normal learning in the voice quality conversion model production | generation part 12. FIG. 声質変換モデル生成部１２における適応学習による声質変換モデル生成処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion model production | generation process by the adaptive learning in the voice quality conversion model production | generation part. 本発明の第２の実施の形態に係る声質変換システム２の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion system 2 which concerns on the 2nd Embodiment of this invention. 声質変換モデル適応部２３の詳細構成を示すブロック図である。4 is a block diagram showing a detailed configuration of a voice quality conversion model adaptation unit 23. FIG. 声質変換システム２における通常モード時の声質変換処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion process at the time of the normal mode in the voice quality conversion system. 声質変換システム２におけるパラメータ制御モード時の声質変換処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion process at the time of the parameter control mode in the voice quality conversion system. 本発明の手法と、従来法との評価データにおける変換音声と出力話者の自然音声との間のメルケプストラム歪を示す図である。It is a figure which shows the mel cepstrum distortion between the conversion audio | voice in the evaluation data of the method of this invention, and a conventional method, and the natural voice of an output speaker. 本発明の第３の実施の形態に係る声質変換クライアントサーバシステム３の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the voice quality conversion client server system 3 which concerns on the 3rd Embodiment of this invention. サーバコンピュータ３０の詳細構成を示すブロック図である。2 is a block diagram showing a detailed configuration of a server computer 30. FIG. クライアントコンピュータ３１の詳細構成を示すブロック図である。2 is a block diagram showing a detailed configuration of a client computer 31. FIG. サーバコンピュータ３０の声質変換モデル送信処理を示すフローチャートである。It is a flowchart which shows the voice quality conversion model transmission process of the server computer. クライアントコンピュータ３１における通常モード時の声質変換処理を示すフローチャートである。4 is a flowchart showing voice quality conversion processing in a normal mode in a client computer 31. クライアントコンピュータ３１におけるパラメータ制御モード時の声質変換処理を示すフローチャートである。4 is a flowchart showing voice quality conversion processing in a parameter control mode in a client computer 31.

Explanation of symbols

１声質変換モデル生成装置
２声質変換システム
３声質変換クライアントサーバシステム
１０元話者音声データ記憶部
１１目標話者音声データ記憶部
１２声質変換モデル生成部
１３声質変換モデル記憶部
１２ａ，２３ｃ特徴量抽出部
１２ｂ通常学習部
１２ｃ特定話者モデル生成部
１２ｄ適応学習部
２０声質変換モデル記憶部
２１目標話者音声データ取得部
２２元話者音声データ取得部
２３声質変換モデル適応部
２４声質変換部
２５話者性制御パラメータ指定部
２３ａモデル取得部
２３ｂ適応対象抽出部
２３ｄ適応部
３０ａ，３１ａデータ通信部
３０ｂ声質変換モデル送信部
３１ｂ声質変換モデル受信部 DESCRIPTION OF SYMBOLS 1 Voice quality conversion model production | generation apparatus 2 Voice quality conversion system 3 Voice quality conversion client server system 10 Original speaker audio | voice data storage part 11 Target speaker audio | voice data storage part 12 Voice quality conversion model production | generation part 13 Voice quality conversion model storage part 12a, 23c Feature-value extraction Unit 12b normal learning unit 12c specific speaker model generation unit 12d adaptive learning unit 20 voice quality conversion model storage unit 21 target speaker voice data acquisition unit 22 former speaker voice data acquisition unit 23 voice quality conversion model adaptation unit 24 voice quality conversion unit 25 Humanity control parameter designation unit 23a Model acquisition unit 23b Adaptation target extraction unit 23d Adaptation unit 30a, 31a Data communication unit 30b Voice quality conversion model transmission unit 31b Voice quality conversion model reception unit

Claims

A voice quality conversion model generation device for generating a statistical model for voice quality conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker and means,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; Specific speaker model creating means for creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created by the specific speaker model creating means to the eigenvoice technique to parameters constituting the initial voice quality conversion model generated by the initial voice quality conversion model generating means, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. A voice quality conversion model generating means for generating a voice quality conversion model having the speaker control parameter,
A voice quality conversion model generation apparatus, wherein at least one of N and M is an integer of 2 or more.

A voice quality conversion model generation device for generating a statistical model for voice quality conversion,
First voice data that is voice data of a predetermined N (N is an integer of 2 or more) former speakers, and voice data of a predetermined intermediate speaker having the same utterance content as the first voice data Learning with the third voice data as learning data, and converting the voice of the voice quality of the N former speakers into the voice of the one intermediate speaker and the 1 former speaker and the 1 A first voice quality conversion model, which is one statistical model common to a human intermediate speaker, is generated, and a third voice data of the one intermediate speaker and a predetermined utterance content same as that of the third voice data are generated. Learning is performed by using the second voice data, which is voice data of M (M is an integer of 2 or more) target speakers, as learning data, and the voice of the one intermediate speaker is used as the target speaker's voice. One-to-one for each of the one intermediate speaker and the target speaker for conversion to voice-quality speech Is a response to the statistical model, voice conversion model generating device characterized by comprising a voice conversion model generating means for generating M second voice conversion model.

A voice quality conversion system for converting the voice of any of the original speaker to the voice of the other voice,
Voice quality conversion model storage means for storing a voice quality conversion model having a speaker control parameter generated by the voice quality conversion model generation device according to claim 1 ;
Former speaker voice data acquisition means for acquiring voice data of any former speaker;
Target speaker voice data acquisition means for acquiring voice data of an arbitrary target speaker;
A speaker control parameter value specifying means for specifying a speaker control parameter value related to the voice of the voice quality of the M target speakers in the voice conversion model ;
Based on the audio data obtained in the previous Kimoto speaker speech data acquisition means, and the parameter values specified in the speaker information control parameter value specification unit, a voice conversion model stored in the voice conversion model storing means, Second adaptation means for creating a new voice quality conversion model by adapting the voice quality conversion model to the voice of the voice quality of the arbitrary original speaker and the specified parameter value using a predetermined adaptation method;
Based on the new voice quality conversion model created by the second adapting means and the voice data acquired by the original speaker voice data acquiring means, the voice data of the arbitrary former speaker is converted into voice data of another voice quality. A voice quality conversion system.

A voice quality conversion system that converts a voice of an arbitrary utterance content of an arbitrary former speaker into a voice of a predetermined target voice quality of M (M is an integer of 2 or more),
A first voice of the voice quality of a predetermined N (N is an integer of 2 or more) original speakers, which is generated by the voice quality conversion model generation device according to claim 2, is converted into a voice of one intermediate speaker. Voice quality conversion model and voice quality conversion model storage means for storing a second voice quality conversion model for converting the voice of the one intermediate speaker into the voice of the voice quality of the M target speakers;
Former speaker voice data acquisition means for acquiring voice data of the arbitrary former speaker;
Based on the voice data acquired by the former speaker voice data acquisition means and the first voice quality conversion model stored in the voice quality conversion model storage means, the first voice quality conversion model is converted into the arbitrary voice using a predetermined adaptive method. Adaptation means to adapt to the voice quality of the former speaker,
Using the adapted first voice quality conversion model, the voice data of the arbitrary utterance content of the arbitrary original speaker acquired by the original speaker voice data acquisition means is converted into the voice data of the voice quality of the intermediate speaker And using the second voice quality conversion model stored in the voice quality conversion model storage means, the voice quality conversion means for converting the voice data converted into the voice quality of the intermediate speaker into the voice data of the arbitrary target speaker. A voice quality conversion system comprising:

The adapting means converts the first voice conversion model into the voice of the voice quality of any of the original speakers with respect to the parameters related to the voice quality of the N original speakers constituting the first voice conversion model. An adaptive parameter value to be adapted is estimated using the predetermined adaptation method, and a parameter value related to the voice of the voice quality of the N former speakers of the first voice quality conversion model is converted to the estimated adaptive parameter value. The voice quality conversion system according to claim 4 .

A voice quality conversion model generation program for generating a statistical model for voice quality conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating an initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker Steps ,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; A specific speaker model creating step of creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created in the specific speaker model creation step for eigenvoice technology to the parameters constituting the initial voice quality conversion model generated in the initial voice quality conversion model generation step, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. Including a program for causing a computer to execute a voice quality conversion model generation step of generating a voice quality conversion model having the speaker control parameter .
A voice quality conversion model generation program characterized in that at least one of N and M is an integer of 2 or more.

A voice conversion model generation method for generating a statistical model for voice conversion,
First voice data which is voice data of a predetermined N (N is an integer of 1 or more) people, and predetermined M (M is an integer of 1 or more) of the same utterance content as the first voice data initial voice conversion model generation for generating an initial voice conversion model is a common statistical model based on the speaker and the M's target speaker of the N people using the second audio data is audio data of the target speaker and the step,
Converting the voice quality of each of the N original speakers into the voice quality of each of the M target speakers from the voice data of the N original speakers and the voice data of the M target speakers; A specific speaker model creating step of creating a specific speaker model corresponding to each one of the speakers and each of the target speakers;
Applying the result obtained by using the specific speaker model created in the specific speaker model creation step for eigenvoice technology to the parameters constituting the initial voice quality conversion model generated in the initial voice quality conversion model generation step, A speaker control parameter, which is a weight vector for converting any of the N original speakers to any of the M target speakers, is added to the parameters constituting the initial voice quality conversion model. A voice quality conversion model generating step for generating a voice quality conversion model having the speaker control parameter;
Including
A method for generating a voice quality conversion model, wherein at least one of N and M is an integer of 2 or more.

And the client computer and the server computer are connected via a network, the voice-conversion-client-server system that converts speech of any source speaker to the speech of the other voice,
The server computer
And voice conversion model storage means for storing the voice quality conversion model generated by the voice conversion model generating device according to claim 1,
Voice quality conversion model transmission means for transmitting the voice quality conversion model stored in the voice quality conversion model storage means to the client computer;
The client computer is
Former speaker voice data acquisition means for acquiring voice data of the arbitrary former speaker;
A speaker control parameter value specifying means for specifying a parameter value of the speaker control parameter in the voice conversion model;
And voice conversion model receiving means for receiving the voice conversion model from the previous SL server computer,
Based on the voice data acquired by the former speaker voice data acquiring means, the parameter value specified by the speaker control parameter value specifying means, and the voice quality conversion model received by the voice quality conversion model receiving means, a predetermined Adapting means for adapting the voice quality conversion model of the arbitrary speaker's voice quality and voice to the designated parameter value by using an adaptation technique to create a new voice quality conversion model ;
Based on the new voice quality conversion model created by the adapting means and the voice data acquired by the original speaker voice data acquiring means, the voice quality conversion for converting the voice data of the arbitrary former speaker into voice data of another voice quality And a voice quality conversion client / server system comprising: means.