JP2017003622A

JP2017003622A - Vocal quality conversion method and vocal quality conversion device

Info

Publication number: JP2017003622A
Application number: JP2015114238A
Authority: JP
Inventors: 亘中鹿; Toru Nakashika; 哲也滝口; Tetsuya Takiguchi; 康雄有木; Yasuo Ariki
Original assignee: Kobe University NUC
Current assignee: Kobe University NUC
Priority date: 2015-06-04
Filing date: 2015-06-04
Publication date: 2017-01-05
Anticipated expiration: 2035-06-04
Also published as: JP6543820B2

Abstract

PROBLEM TO BE SOLVED: To provide a vocal quality conversion method with which it is possible to convert the vocal quality of an input speaker to the vocal quality of a discretionary speaker without using parallel data.SOLUTION: The vocal quality conversion method includes: a step S1 for estimating a weight not dependent on a speaker as an independent weight among coupled weights between a visible element layer and a hidden element layer constituting a Restricted Boltzmann Machine (RBM) that is a probability model; a step S2 for estimating, while the independent weight is fixed, each of a weight dependent on an input speaker and a weight dependent on a target speaker as a dependent weight among coupled weights; a step S3 for estimating a hidden element layer on the basis of the voice of an input speaker inputted to the visible element layer and the dependent weight of the input speaker; and a step S4 for estimating the voice of the target speaker outputted as the visible element layer on the basis of the hidden element layer and the dependent weight of the target speaker.SELECTED DRAWING: Figure 4

Description

本発明は、入力話者の音声の声質を、入力話者以外の人の声質に変換する方法および装置などに関する。 The present invention relates to a method and apparatus for converting voice quality of speech of an input speaker into voice quality of a person other than the input speaker.

近年、音声信号処理の分野の中でも、声質変換技術が盛んに研究されている。この声質変換技術は、入力話者の音声の音韻情報を保存したまま、話者性に関する情報のみを、出力話者（つまり目標話者）の情報へ変換させる技術である。その背景として、雑音環境下や感情音声の音声認識精度の向上、発話困難な障がい者のための発話補助、その他様々なタスクへの応用が可能であることが挙げられる。 In recent years, voice quality conversion technology has been actively researched in the field of audio signal processing. This voice quality conversion technique is a technique for converting only information related to speaker characteristics into information of an output speaker (that is, a target speaker) while preserving phonological information of an input speaker's voice. As the background, it is possible to improve the speech recognition accuracy of noisy environments and emotional speech, assist speech for persons with disabilities who are difficult to speak, and apply to various other tasks.

これまでの声質変換方法では、統計的手法に基づくアプローチが広く研究されてきた。中でもＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）を用いた手法が最も広く用いられており、様々な改良がなされてきた。ＧＭＭ以外のアプローチとしては、近年ＮＭＦ（Ｎｏｎ−ｎｅｇａｔｉｖｅｍａｔｒｉｘｆａｃｔｏｒｉｚａｔｉｏｎ）を用いた声質変換手法（非特許文献１参照）が提案され、過平滑の少ない手法として注目されている。 Conventional voice quality conversion methods have been widely researched on approaches based on statistical methods. Among them, a technique using GMM (Gaussian Mixture Model) is most widely used, and various improvements have been made. As an approach other than GMM, a voice quality conversion method (see Non-Patent Document 1) using NMF (Non-negative matrix factorization) has recently been proposed and attracts attention as a method with less oversmoothness.

また、可視層と隠れ層との２層から構成されるＲＢＭ（ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅ）を用いた声質変換技術も開示されている（非特許文献２参照）。 A voice quality conversion technique using an RBM (Restricted Boltzmann Machine) composed of two layers of a visible layer and a hidden layer is also disclosed (see Non-Patent Document 2).

R.Takashima, T.Takiguchi and Y.Ariki: “Exemplar-based voice conversion in noisy environment”, SLT, pp. 313-317 (2012)R. Takashima, T. Takiguchi and Y. Ariki: “Exemplar-based voice conversion in noisy environment”, SLT, pp. 313-317 (2012) 中鹿亘、滝口哲也、有木康雄「話者依存型 Recurrent Temporal Restricted Boltzmann Machine を用いた声質変換」日本音響学会講演論文集（２０１２年９月）Wataru Nakaga, Tetsuya Takiguchi, Yasuo Ariki “Voice conversion using speaker-dependent Recurrent Temporal Restricted Boltzmann Machine” Acoustical Society of Japan Proceedings (September 2012)

しかしながら、上記非特許文献１および非特許文献２に記載の技術では、入力話者と特定の出力話者（目標話者）との間でのパラレルデータが必要であるという問題がある。 However, the techniques described in Non-Patent Document 1 and Non-Patent Document 2 have a problem that parallel data is required between an input speaker and a specific output speaker (target speaker).

つまり、これらの技術は、いずれもモデルの学習時にパラレルデータ（入力話者と出力話者の、同一発話内容による音声対）を必要とし、パラレルデータの作成には様々な制限が課せられる。第一に、入力話者と出力話者の発話データは同一の発話内容でないといけないという制限があるため、選択（または作成）できる学習データセットの自由度は低い。第二に、フレーム単位で両者の音声の同期を取る必要があるため、動的計画法などを用いてアライメントを取るが、完全にフレームの同期が取れている保証がない。したがって、アライメントの伸縮の際に、音声に変換が加わっているなどの問題がある。また、学習を行っていない話者対に対して、既存の変換モデルを利用できない。つまり、任意の話者の声質に変換することができない。 That is, all of these techniques require parallel data (speech pairs with the same utterance content of the input speaker and output speaker) when learning the model, and various restrictions are imposed on the creation of parallel data. First, since there is a restriction that the utterance data of the input speaker and the output speaker must have the same utterance content, the degree of freedom of the learning data set that can be selected (or created) is low. Second, since it is necessary to synchronize both voices in units of frames, alignment is performed using dynamic programming or the like, but there is no guarantee that the frames are completely synchronized. Therefore, there is a problem that conversion is added to the voice when the alignment is expanded or contracted. Also, existing conversion models cannot be used for speaker pairs who are not learning. That is, it cannot be converted into the voice quality of an arbitrary speaker.

そこで、本発明は、かかる問題に鑑みてなされたものであって、パラレルデータを用いることなく入力話者の声質を任意の話者の声質に変換することができる声質変換方法および装置を提供することを目的とする。 Therefore, the present invention has been made in view of such problems, and provides a voice quality conversion method and apparatus capable of converting the voice quality of an input speaker into the voice quality of an arbitrary speaker without using parallel data. For the purpose.

上記目的を達成するために、本発明に係る声質変換方法は、入力話者の音声の声質を目標話者の声質に変換する声質変換方法であって、確率モデルであるＲＢＭ（ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅ）を構成する可視素子層と隠れ素子層の２つの異層素子間の結合重みのうち、話者に依存しない重みを非依存重みとして推定する第１のステップと、前記ＲＢＭにおいて前記非依存重みを固定した状態で、前記結合重みのうち、前記入力話者に依存する重みと、前記目標話者に依存する重みとをそれぞれ依存重みとして推定する第２のステップと、可視素子層に入力される前記入力話者の音声と、前記入力話者の依存重みとに基づいて、隠れ素子層を推定する第３のステップと、前記隠れ素子層と、前記目標話者の依存重みとに基づいて、可視素子層として出力される前記目標話者の音声を推定する第４のステップとを含む。例えば、前記結合重みは、Ｓ（Ｓは２以上の整数）人の話者のそれぞれに対する０または１を示す要素からなるベクトルと、前記非依存重みと、Ｓ人のそれぞれの話者の依存重みとを用いた演算によって表わされる。 In order to achieve the above object, a voice quality conversion method according to the present invention is a voice quality conversion method for converting the voice quality of an input speaker's voice to the voice quality of a target speaker, and is a probability model RBM (Restricted Boltzmann Machine). A first step of estimating a speaker-independent weight as a non-dependent weight among the coupling weights between two different layer elements of the visible element layer and the hidden element layer that constitutes the non-independent weight; and In a fixed state, a second step of estimating a weight depending on the input speaker and a weight depending on the target speaker out of the coupling weights as a dependency weight, respectively, is input to the visible element layer Third step of estimating a hidden element layer based on the voice of the input speaker and the dependency weight of the input speaker, the hidden element layer, and the dependency of the target speaker And a fourth step of estimating the target speaker's voice output as the visible element layer based on the weight. For example, the joint weight is a vector composed of elements indicating 0 or 1 for each of S (S is an integer of 2 or more) speakers, the independent weight, and the dependent weight of each of the S speakers. It is expressed by the calculation using and.

これにより、拡張されたＲＢＭが用いられる。つまり、ＲＢＭを構成する可視素子層と隠れ素子層の２つの異層素子間の結合重みが、話者に依存しない重み（非依存重み）と、話者に依存する重み（依存重み）とに分離された適応型ＲＢＭが用いられる。したがって、非依存重みを固定させておけば、依存重みを用いて話者性を容易に制御することができる。その結果、パラレルデータを用いることなく入力話者の声質を変換することができる。また、入力話者および目標話者のそれぞれの音声が少なくても、第２のステップにおいて入力話者および目標話者のそれぞれの依存重みを適切に推定することができる。その結果、何れの目標話者の依存重みでも簡単に推定することができるため、第１のステップで推定された非依存重みを流用すれば、入力話者の声質を任意の話者の声質に変換することができる。 Thereby, an extended RBM is used. That is, the coupling weight between two different layer elements of the visible element layer and the hidden element layer constituting the RBM is a weight independent of the speaker (independent weight) and a weight dependent on the speaker (dependent weight). A separate adaptive RBM is used. Therefore, if the independent weight is fixed, the speaker property can be easily controlled using the dependent weight. As a result, the voice quality of the input speaker can be converted without using parallel data. Further, even if the voices of the input speaker and the target speaker are small, the dependency weights of the input speaker and the target speaker can be appropriately estimated in the second step. As a result, since the dependency weight of any target speaker can be easily estimated, if the independent weight estimated in the first step is used, the voice quality of the input speaker can be changed to the voice quality of any speaker. Can be converted.

例えば、前記第１のステップでは、複数の話者から発話される互いに異なる発話内容の音声に基づいて、前記非依存重みを推定し、前記第２のステップでは、前記入力話者と前記目標話者のそれぞれから発話される互いに異なる発話内容の音声に基づいて、前記入力話者の依存重みと、前記目標話者の依存重みとを推定してもよい。 For example, in the first step, the independent weight is estimated based on voices of different utterance contents uttered by a plurality of speakers, and in the second step, the input speaker and the target story are estimated. The dependence weight of the input speaker and the dependence weight of the target speaker may be estimated on the basis of voices of different utterance contents uttered from each of the speakers.

これにより、第１のステップおよび第２のステップにおいてパラレルデータを用いることがないため、発話内容に制約されることなく声質を適切に変換することができる。 Thereby, since parallel data is not used in the first step and the second step, the voice quality can be appropriately converted without being restricted by the utterance content.

なお、これらの包括的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a method, an integrated circuit, a computer program, or a recording medium such as a computer-readable CD-ROM, and the system, method, integrated circuit, and computer program. And any combination of recording media.

本発明の声質変換方法は、パラレルデータを用いることなく入力話者の声質を任意の話者の声質に変換することができる。 The voice quality conversion method of the present invention can convert the voice quality of an input speaker into the voice quality of an arbitrary speaker without using parallel data.

図１は、ＲＢＭのグラフ構造を示す図である。FIG. 1 is a diagram showing a graph structure of RBM. 図２は、実施の形態における適応型ＲＢＭのグラフ構造を示す図である。FIG. 2 is a diagram illustrating a graph structure of the adaptive RBM in the embodiment. 図３は、実施の形態における声質変換装置の構成を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration of the voice quality conversion apparatus according to the embodiment. 図４は、実施の形態における声質変換装置の処理動作を示すフローチャートである。FIG. 4 is a flowchart showing the processing operation of the voice quality conversion apparatus according to the embodiment. 図５は、実施の形態における声質変換方法による声質変換の結果を示す図である。FIG. 5 is a diagram illustrating a result of voice quality conversion by the voice quality conversion method according to the embodiment. 図６は、実施の形態における声質変換方法によって、実際に推定されたパラメータを示す図である。FIG. 6 is a diagram illustrating parameters actually estimated by the voice quality conversion method according to the embodiment. 図７は、実施の形態における声質変換方法によって女性話者の音声を男性話者の音声へ変換した例を示す図である。FIG. 7 is a diagram illustrating an example in which a female speaker's voice is converted into a male speaker's voice by the voice quality conversion method according to the embodiment.

以下、実施の形態について、図面を参照しながら具体的に説明する。 Hereinafter, embodiments will be specifically described with reference to the drawings.

なお、以下で説明する実施の形態は、いずれも包括的または具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序などは、一例であり、本発明を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。 It should be noted that each of the embodiments described below shows a comprehensive or specific example. The numerical values, shapes, materials, constituent elements, arrangement positions and connecting forms of the constituent elements, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present invention. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements.

（概要）
まず、本実施の形態における声質変換方法について、概要を説明する。 (Overview)
First, the outline of the voice quality conversion method in the present embodiment will be described.

本実施の形態における声質変換方法は、確率モデルの一つであるＲＢＭ（ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅ）を拡張したモデルである適応型ＲＢＭ（ａｄａｐｔｉｖｅｒｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎｍａｃｈｉｎｅ；ＡＲＢＭ）を用いて、入力話者−出力話者間のパラレルデータだけではなく、参照話者間のパラレルデータさえも必要としない声質変換方法である。 The voice quality conversion method according to the present embodiment uses an input speaker-output talk using an adaptive RBM (adaptive reduced Boltzmann machine; ARBM), which is an extended model of RBM (Restricted Boltzmann Machine), which is one of probabilistic models. This is a voice quality conversion method that requires not only parallel data between speakers but also parallel data between reference speakers.

この適応型ＲＢＭは、複数の話者が混在する音声データから、話者に依存しない情報と話者に依存した情報とに分離しながら、潜在的な特徴を抽出する確率モデルである。このモデルは可視素子層と隠れ素子層からなる無向グラフで表現され、同層素子間の結合はなく、異層素子間のみ話者に依存した強度（重み）で結合が存在する。さらに、この重みは話者依存項と話者非依存項で表現され、複数の話者が混在した音声データ（パラレルである必要はない）を用いて、それぞれが教師なし学習で同時に推定される。結果として、話者依存重みと話者非依存重みに分離しながら潜在特徴（隠れ素子）を得ることができる。任意話者への声質変換を行う際、まず、複数の話者（参照話者）のデータを用いて、上記のように話者依存重みと話者非依存重みとを同時推定する。次に、変換を行いたい話者（入力話者）の（少量の）データを用いて、話者非依存重みを固定しながら新たな話者依存重みを推定する。変換先の話者（出力話者）の話者依存重みに関しても同様に推定する。そして、変換したい音声（入力話者の音声）から、入力話者の話者依存重み、話者非依存重みを用いて潜在特徴を推定し、その後、出力話者の話者依存重み、話者非依存重みを用いて音響特徴ベクトルを逆推定することで変換音声を得る。 This adaptive RBM is a probabilistic model that extracts potential features while separating into speaker-independent information and speaker-dependent information from speech data in which a plurality of speakers are mixed. This model is represented by an undirected graph composed of a visible element layer and a hidden element layer, and there is no connection between the same layer elements, and there exists a connection between different layer elements with strength (weight) depending on the speaker. Furthermore, this weight is expressed by a speaker-dependent term and a speaker-independent term, and each of them is estimated simultaneously by unsupervised learning using speech data in which a plurality of speakers are mixed (not necessarily parallel). . As a result, latent features (hidden elements) can be obtained while separating into speaker-dependent weights and speaker-independent weights. When performing voice quality conversion to an arbitrary speaker, first, speaker-dependent weights and speaker-independent weights are simultaneously estimated using data of a plurality of speakers (reference speakers) as described above. Next, a new speaker-dependent weight is estimated while fixing the speaker-independent weight by using (a small amount) of data of the speaker (input speaker) to be converted. The speaker-dependent weight of the conversion destination speaker (output speaker) is similarly estimated. Then, latent features are estimated from the speech to be converted (the speech of the input speaker) using the speaker-dependent weight and speaker-independent weight of the input speaker, and then the speaker-dependent weight of the output speaker and the speaker The converted speech is obtained by inversely estimating the acoustic feature vector using the independent weight.

ＧＭＭやＮＭＦなど、従来の声質変換方法の多くは線形変換をベースとしているため、変換精度には限界がある。つまり、人の声道形状は非線形的であるため、音声信号に含まれる声質の特性をより正確に捉えるためには非線形ベースのモデル化の方が線形ベースよりも適切であると考えられる。本実施の形態における声質変換方法も非線形関数をベースとした変換式を用いており、精度の高い声質変換を行うことができる。 Since many conventional voice quality conversion methods such as GMM and NMF are based on linear conversion, the conversion accuracy is limited. That is, since the human vocal tract shape is nonlinear, it is considered that nonlinear-based modeling is more appropriate than linear-based in order to more accurately capture the characteristics of voice quality included in the speech signal. The voice quality conversion method according to the present embodiment also uses a conversion formula based on a nonlinear function, and can perform voice quality conversion with high accuracy.

（ＲＢＭ）
次に、本実施の形態における適応型ＲＢＭの基礎となるＲＢＭについて説明する。 (RBM)
Next, the RBM that is the basis of the adaptive RBM in the present embodiment will be described.

図１は、ＲＢＭのグラフ構造を示す図である。 FIG. 1 is a diagram showing a graph structure of RBM.

ＲＢＭは、特殊な構造を持つ２層ネットワークであり、図１のように、可視層（可視素子層）と隠れ層（隠れ素子層）の確率変数分布を表現する無向グラフィカルモデルである。元々、ＲＢＭはバイナリデータを入力させるモデルとして提案されていたが、後に連続値を入力させるモデル（Ｇａｕｓｓｉａｎ−ＢｅｒｎｏｕｌｌｉＲＢＭ；ＧＢＲＢＭ）が考案された。しかしながらこのモデルは、分散項の影響で学習が不安定になるという問題があったため、ＧＢＲＢＭの改良版（ＩｍｐｒｏｖｅｄＧＢＲＢＭ；ＩｍｐＧＢＲＢＭ）が提案された。このＩｍｐＧＢＲＢＭでは、連続値の可視素子
と２値の隠れ素子
の同時確率
は、以下のように表される。 The RBM is a two-layer network having a special structure, and is an undirected graphical model that represents a random variable distribution of a visible layer (visible element layer) and a hidden layer (hidden element layer) as shown in FIG. Originally, RBM was proposed as a model for inputting binary data, but a model (Gaussian-Bernoulli RBM; GBRBM) for inputting continuous values later was devised. However, since this model has a problem that learning becomes unstable due to the influence of the dispersion term, an improved version of GBRBM (Improved GBRBM; ImpGBBRBM) has been proposed. In this ImpGBBRBM, continuous-value visible elements
And binary hidden elements
Simultaneous probability of
Is expressed as follows.

ここで、
はＬ２ノルム、括線は要素除算を表す。 here,
Represents the L2 norm, and the superscript represents element division.

はそれぞれ可視層−隠れ層間の重み行列、可視素子の偏差、可視素子のバイアス、隠れ素子のバイアスを示しており、いずれも推定すべきパラメータである。 Indicates the weight matrix between the visible layer and the hidden layer, the deviation of the visible element, the bias of the visible element, and the bias of the hidden element, all of which are parameters to be estimated.

ＲＢＭでは可視素子間、または隠れ素子間の接続は存在しない。つまり、それぞれの可視素子、隠れ素子は互いに条件付き独立である。したがって、それぞれの条件付き確率
は以下の様な単純な関数で表現される。 In RBM, there is no connection between visible elements or hidden elements. That is, each visible element and hidden element are conditionally independent from each other. Therefore, each conditional probability
Is expressed by the following simple function.

ここで、
と
は
の第ｊ列ベクトル、第ｉ行ベクトルを表す。また、
は要素ごとのシグモイド関数
は平均、分散σ^２の正規分布を表す。 here,
When
Is
Represents the j-th column vector and the i-th row vector. Also,
Is the elemental sigmoid function
Represents a normal distribution with mean and variance σ ² .

それぞれのＲＢＭのパラメータ
は、Ｎ個の観測データを
とするとき、この確率変数の対数尤度
を最大化するように推定される。この対数尤度をそれぞれのパラメータで偏微分すると、
が得られる。ただし、＜・＞_ｄａｔａと＜・＞_{ｍｏｄｅｌ}はそれぞれ、観測データ、モデルデータの期待値を表す。しかし、一般に後者の期待値に関しては計算困難であるため、代わりに式（４）（５）によって得られる再構築したデータの期待値＜・＞_{ｒｅｃｏｎ}が用いられる（ＣＤ：ＣｏｎｔｒａｓｔｉｖｅＤｉｖｅｒｇｅｎｃｅ法）。また、ＩｍｐＧＢＲＢＭでは分散を非負値に制約し、学習を安定化させるため
と置き換える。これにより、ｚ_ｉに関する勾配は以下のように計算される。 Each RBM parameter
N observation data
Log likelihood of this random variable
Is estimated to maximize. When this logarithmic likelihood is partially differentiated by each parameter,
Is obtained. However, <•> _data and <•> _model represent expected values of observation data and model data, respectively. However, since the latter expected value is generally difficult to calculate, the reconstructed data expected value <·> _recon obtained by the equations (4) and (5) is used _instead (CD: Contrastive Divergence method). In ImpGBBRBM, the variance is constrained to a non-negative value to stabilize learning.
Replace with Thus, the gradient for z _i is calculated as follows:

それぞれのパラメータは式（６）（７）（８）から、確率的勾配法を用いて繰り返し更新される（初期値はランダムに設定される）。すなわち、
のように、ＲＢＭのそれぞれのパラメータが更新される。ここで、γ_θは学習率を表す。 Each parameter is repeatedly updated from the equations (6), (7), and (8) using the stochastic gradient method (initial values are set at random). That is,
As shown, each parameter of the RBM is updated. Here, γ _θ represents a learning rate.

（適応型ＲＢＭと声質変換への応用）
本実施の形態では、上述のＲＢＭを拡張したモデルとして、適応型ＲＢＭ（ＡｄａｐｔｉｖｅｒｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎｍａｃｈｉｎｅ；ＡＲＢＭ）を定義し、声質変換タスクへ応用する。 (Application to adaptive RBM and voice quality conversion)
In the present embodiment, an adaptive RBM (Adaptive restricted Boltzmann machine; ARBM) is defined as an extended model of the above RBM and applied to a voice quality conversion task.

（適応型ＲＢＭの定義）
図２は、適応型ＲＢＭのグラフ構造を示す図である。 (Definition of adaptive RBM)
FIG. 2 is a diagram illustrating a graph structure of the adaptive RBM.

適応型ＲＢＭは、図２のように、通常のＲＢＭで見られた可視素子と隠れ素子だけでなく、識別素子
が加わったモデルとなっている（
は識別素子の数とする）。例えば声質変換において、入力
が話者ｋの発話であることを示す場合、
となる。このモデルでは、可視素子と隠れ素子の間には識別素子
で制御される結合重みが存在する。この結合重みを
とし、本実施の形態ではこれを以下のように定義する。 As shown in FIG. 2, the adaptive RBM is not only a visible element and a hidden element seen in a normal RBM, but also an identification element.
It has become a model with (
Is the number of identification elements). For example, in voice quality conversion, input
Indicates that is the utterance of speaker k,
It becomes. In this model, there is an identification element between the visible element and the hidden element.
There is a connection weight controlled by. This joint weight
In the present embodiment, this is defined as follows.

ただし、
はいずれも、話者依存重みであって、不特定重み行列
を特定化（適応）するための３階のテンソルパラメータ
である。また、
は、モードｄを展開した３階テンソル
の各行列とベクトル
の内積をとる演算子を表す。声質変換の場合、
が不特定話者による結合重み、つまり、話者非依存重みであり、
が話者ｋの適応行列及びバイアス行列を表す（ただし
は３階テンソル
のモード３の第ｋ行列を表す）。 However,
Are speaker-dependent weights, and an unspecified weight matrix
3rd-order tensor parameters for specifying (adapting)
It is. Also,
Is the third-floor tensor that expanded mode d
Matrices and vectors
Represents an operator that takes the inner product of. For voice conversion,
Is the connection weight by unspecified speakers, that is, speaker-independent weights,
Represents the adaptation matrix and bias matrix of speaker k (where
Is the 3rd floor tensor
Represents the k-th matrix of mode 3).

適応型ＲＢＭでは、式（１１）で定義した
を用いて、可視素子
、隠れ素子
、識別素子
の同時確率
を以下のように定義する。 In the adaptive RBM, it is defined by equation (11)
Using the visible element
, Hidden element
, Identification element
Simultaneous probability of
Is defined as follows.

これらの定義により、条件付き確率
は以下のように計算できる。 With these definitions, the conditional probability
Can be calculated as follows:

適応型ＲＢＭのパラメータ
は、Ｎ個の学習データ
を用いて、対数尤度
を最大化するように推定される。この対数尤度を
の要素
で偏微分したものは、それぞれ
と計算できる。他のパラメータ
に関しては、それぞれ式（７）、（９）、（８）と同様にして求められる。適応型ＲＢＭにおいても、ＣＤ法を適用することができるため、各偏微分値の第二項＜・＞_{ｍｏｄｅｌ}を観測データの再構築値＜・＞_{ｒｅｃｏｎ}として計算することで効率よくパラメータを推定することができる。 Adaptive RBM parameters
Is N pieces of learning data
Use the log likelihood
Is estimated to maximize. This log likelihood is
Elements of
Are the partial derivatives of
Can be calculated. Other parameters
Are obtained in the same manner as in equations (7), (9), and (8), respectively. Since the CD method can also be applied to the adaptive RBM, the parameters can be efficiently estimated by calculating the second term _« _model of each partial differential value as the reconstructed value _{« recon} of the observation data. be able to.

（適応型ＲＢＭを用いた声質変換）
本実施の形態における声質変換装置は、上述の適応型ＲＢＭを用いて入力話者の音声の声質を任意の出力話者（目標話者）の声質に変換する。 (Voice quality conversion using adaptive RBM)
The voice quality conversion apparatus according to the present embodiment converts the voice quality of the input speaker's voice into the voice quality of an arbitrary output speaker (target speaker) using the above-described adaptive RBM.

図３は、本実施の形態における声質変換装置の構成を示すブロック図である。 FIG. 3 is a block diagram showing the configuration of the voice quality conversion apparatus according to the present embodiment.

本実施の形態における声質変換装置１０は、入力話者の音声の声質を目標話者の声質に変換する装置であって、非依存重み推定部１１と、依存重み推定部１２と、隠れ素子層推定部１３と、可視素子層推定部１４とを備える。 The voice quality conversion apparatus 10 in the present embodiment is an apparatus that converts the voice quality of an input speaker's voice into the voice quality of a target speaker, and includes an independent weight estimation unit 11, a dependency weight estimation unit 12, and a hidden element layer. An estimation unit 13 and a visible element layer estimation unit 14 are provided.

非依存重み推定部１１は、上述のように、Ｎ個の学習データを用いて、適応型ＲＢＭのパラメータ
を、対数尤度を最大化するように推定する。つまり、非依存重み推定部１１は、ＲＢＭを構成する可視素子層と隠れ素子層の２つの異層素子間の結合重み
のうち、話者に依存しない重み
を話者非依存重みとして推定する。ここで、上述の結合重みは、式（１１）に示すように、Ｓ（Ｓは２以上の整数）人の話者のそれぞれに対する０または１を示す要素からなるベクトルと、話者非依存重みと、Ｓ人のそれぞれの話者の話者依存重みとを用いた演算によって表わされる。また、Ｎ個の学習データは、Ｎ人の話者（参照話者）から発話される互いに異なる発話内容の音声、つまり非パラレルデータであってもよい。 As described above, the independent weight estimation unit 11 uses the N pieces of learning data and uses the adaptive RBM parameters.
Is estimated to maximize the log likelihood. That is, the non-dependent weight estimation unit 11 is a combination weight between two different layer elements of the visible element layer and the hidden element layer constituting the RBM.
Of which speaker-independent weights
Are estimated as speaker-independent weights. Here, as shown in the equation (11), the above-mentioned connection weights are a vector composed of elements indicating 0 or 1 for each of S (S is an integer of 2 or more) speakers, and speaker-independent weights. And a speaker-dependent weight of each of the S speakers. The N pieces of learning data may be voices of different utterance contents uttered by N speakers (reference speakers), that is, non-parallel data.

依存重み推定部１２は、ＲＢＭにおいて上述の話者非依存重みを固定した状態で、結合重みのうち、入力話者に依存する重み
と、目標話者に依存する重み
とをそれぞれ話者依存重みとして推定する。具体的には、依存重み推定部１２は、入力話者の音声に基づいて入力話者の話者依存重みを推定し、目標話者の音声に基づいて目標話者の話者依存重みを推定する。このとき、依存重み推定部１２は、入力話者と目標話者のそれぞれから発話される互いに異なる発話内容の音声に基づいて、つまり、非パラレルデータに基づいて、入力話者および目標話者のそれぞれの話者依存重みを推定してもよい。 The dependency weight estimation unit 12 is a weight that depends on the input speaker among the connection weights in a state where the speaker-independent weight described above is fixed in the RBM.
And weight depending on the target speaker
Are estimated as speaker-dependent weights. Specifically, the dependence weight estimation unit 12 estimates the speaker dependence weight of the input speaker based on the voice of the input speaker, and estimates the speaker dependence weight of the target speaker based on the voice of the target speaker. To do. At this time, the dependency weight estimation unit 12 determines whether the input speaker and the target speaker are based on voices of different utterance contents uttered from the input speaker and the target speaker, that is, based on non-parallel data. Each speaker-dependent weight may be estimated.

隠れ素子層推定部１３は、可視素子層に入力される入力話者の音声と、上述の入力話者の話者依存重みとに基づいて、隠れ素子層
を推定する。 The hidden element layer estimation unit 13 performs the hidden element layer based on the voice of the input speaker input to the visible element layer and the speaker-dependent weight of the input speaker described above.
Is estimated.

可視素子層推定部１４は、隠れ素子層と、目標話者の話者依存重みとに基づいて、可視素子層として出力される目標話者の音声
を推定する。 The visible element layer estimation unit 14 outputs the target speaker's voice output as the visible element layer based on the hidden element layer and the speaker-dependent weight of the target speaker.
Is estimated.

図４は、本実施の形態における声質変換装置１０の処理動作を示すフローチャートである。 FIG. 4 is a flowchart showing the processing operation of the voice quality conversion apparatus 10 in the present embodiment.

まず、声質変換装置１０の非依存重み推定部１１は、図４のように、まず複数（Ｓ人）の参照話者によるデータ（音声）を用いて適応型ＲＢＭの各パラメータ
を同時推定する（ステップＳ１）。 First, as shown in FIG. 4, the independent weight estimation unit 11 of the voice quality conversion apparatus 10 first uses each parameter (speech) of multiple (S persons) reference speakers to set each parameter of the adaptive RBM.
Are simultaneously estimated (step S1).

次に、依存重み推定部１２は、
など話者に依存しないパラメータ（話者非依存重み）を固定して、入力話者および目標話者の音声である適応データを用いて、入力話者と目標話者の話者依存重み
を適応パラメータとして、式（１８）（１９）より推定する（ステップＳ２）。 Next, the dependency weight estimation unit 12
Speaker-independent parameters (speaker-independent weights) are fixed, and the speaker-dependent weights of the input speaker and target speaker are used using adaptive data that is the speech of the input speaker and target speaker.
Is estimated from Equations (18) and (19) (step S2).

そして、隠れ素子層推定部１３は、入力話者の変換される音声のフレーム音響特徴量
から、次式のように潜在特徴量（隠れ素子層）を推定する（ステップＳ３）。 And the hidden element layer estimation part 13 is the frame acoustic feature-value of the audio | voice by which an input speaker is converted.
From this, the latent feature quantity (hidden element layer) is estimated as in the following equation (step S3).

ただし、
は第
要素のみ１、他を０とするベクトルとする。また、同時に変数
の長さを
へ拡張し、
をモード３に沿ってそれぞれ
を追加するものとする。式（２０）を書き直すと、
が得られ、話者に依存しない項
を入力話者に適応させた結合重みを用いて潜在特徴量を推定していることになる。また式（２１）は、一度適応型ＲＢＭの学習が終われば
は変数
の関数となるので、
は話者に依存しない潜在特徴量であることを示唆している。すなわち、話者性は
のみで制御され、
は話者に依存しない音韻に近い情報を表すと考えられる。したがって、出力話者（目標話者）の話者性を持つ音声を得たい場合、音韻情報
から、
を用いて音響特徴量を復元すればよい。すなわち、可視素子層推定部１４は、出力話者の変換先のフレーム特徴量
を以下のように計算する（ステップＳ４）。 However,
Is the first
A vector in which only the element is 1 and the other is 0 is assumed. At the same time, the variable
The length of
Extend to
Along mode 3
Shall be added. Rewriting equation (20),
, A term that does not depend on the speaker
Thus, the latent feature amount is estimated using the connection weight adapted to the input speaker. Also, equation (21) becomes once the learning of adaptive RBM is over
Is a variable
So that
Suggests that it is a latent feature that does not depend on the speaker. That is, speaker nature
Only controlled by
Is considered to represent information close to the phoneme independent of the speaker. Therefore, if you want to obtain speech with the speaker characteristics of the output speaker (target speaker), phoneme information
From
The acoustic feature value may be restored using That is, the visible element layer estimation unit 14 calculates the frame feature value of the conversion destination of the output speaker.
Is calculated as follows (step S4).

これは、入力話者の音声から得られた音韻情報を基に、話者非依存重みを出力話者（目標話者）に適応した基底を用いて、出力話者の音響特徴量を生成していることを表している。また、式（２１）（２２）にもあるように、入力話者の音響特徴量
を出力話者の音響特徴量
へ変換する際、
の推定に非線形関数を用いているため、本実施の形態における声質変換方法は非線形変換ベースの声質変換だと言える。 This is based on phonological information obtained from the speech of the input speaker, and generates the acoustic features of the output speaker using a basis that adapts speaker-independent weights to the output speaker (target speaker). It represents that. Also, as shown in equations (21) and (22), the acoustic features of the input speaker
Output speaker's acoustic features
When converting to
Therefore, it can be said that the voice quality conversion method in this embodiment is a voice quality conversion based on a nonlinear conversion.

なお、現実の音声データを使って適応型ＲＢＭを学習する場合、話者は豊富に存在するが、それぞれの発話データは少ないといったケースがある。この場合、
の推定に用いられるデータは十分存在するが、適応パラメータ
を推定するためのデータが少量となるため、誤推定もしくは過学習の要因となる。そこで本実施の形態による後述の評価実験では、
を対角行列、
を各列が等しい行列で近似することでパラメータ数を抑える。 In addition, when learning adaptive RBM using real speech data, there are cases where there are many speakers but there are few utterance data. in this case,
There are enough data to estimate the
Since the amount of data for estimating is small, it becomes a cause of erroneous estimation or overlearning. Therefore, in an evaluation experiment described later according to the present embodiment,
A diagonal matrix,
The number of parameters is suppressed by approximating with a matrix in which each column is equal.

（評価実験）
本実施の形態における声質変換方法の評価実験について、以下、図５〜図７を用いて詳細に説明する。 (Evaluation experiment)
An evaluation experiment of the voice quality conversion method according to the present embodiment will be described in detail below with reference to FIGS.

（実験条件）
本実験では、英語圏の複数の話者による音声が含まれたコーパスであるＴＩＭＩＴを用いて、本実施の形態における適応型ＲＢＭを用いた声質変換方法の精度を調べた。なお、ＴＩＭＩＴについては、文献「J. S. Garofolo, L. D. Consortium, et al.: "TIMIT: acoustic-phonetic continuous speech corpus", Linguistic Data Consortium (1993)」に詳細に記述されている。 (Experimental conditions)
In this experiment, the accuracy of the voice quality conversion method using the adaptive RBM according to the present embodiment was examined using TIMIT, which is a corpus including voices from a plurality of English-speaking speakers. TIMIT is described in detail in the document “JS Garofolo, LD Consortium, et al .:“ TIMIT: acoustic-phonetic continuous speech corpus ”, Linguistic Data Consortium (1993)”.

このコーパスから、話者非依存パラメータ（話者非依存重み）の推定のために、参照話者として３８名（内女性１４名、男性２４名）を選んだ。各話者からは、５文の発話データを学習に用いている（学習に用いた総フレーム数はおよそ２７万）。本実施の形態における声質変換方法を評価するために、女性４名、男性４名の音声を用いて入力話者・出力話者のペア（計２８ペア）を作成し、異性間及び同性間の声質変換の性能比較を行った。このとき、入力・出力話者のパラレルデータ（同一発話内容による、学習データには含まれない２文のデータから動的計画法によって作成）を用いてＳＤＩＲ（ｓｐｅｃｔｒａｌｄｉｓｔｏｒｔｉｏｎｉｍｐｒｏｖｅｍｅｎｔｒａｔｉｏ）による評価をおこなっている。音響特徴量として、ＳＴＲＡＩＧＨＴスペクトルから計算された３２次元のＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）を用いた。なお、ＳＴＲＡＩＧＨＴスペクトルについては、文献「H. Kawahara, M. Morise, T. Takahashi, R. Nisimura,T. Irino and H. Banno: "TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation", ICASSP, pp. 3933-3936 (2008)」に詳細に記述されている。 From this corpus, 38 speakers (14 women and 24 men) were selected as reference speakers in order to estimate speaker-independent parameters (speaker-independent weights). From each speaker, utterance data of five sentences is used for learning (the total number of frames used for learning is approximately 270,000). In order to evaluate the voice quality conversion method in this embodiment, a pair of input speakers and output speakers (a total of 28 pairs) is created using the voices of four women and four men. The performance of voice quality conversion was compared. At this time, an evaluation by SDIR (spectral distortion improvement ratio) is performed using parallel data of input / output speakers (created by dynamic programming from two sentence data not included in the learning data based on the same utterance content). ing. As the acoustic feature amount, 32-dimensional MFCC (Mel-Frequency Cepstrum Coefficients) calculated from the STRAIGHT spectrum was used. Regarding the STRAIGHT spectrum, the literature “H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino and H. Banno:“ TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to It is described in detail in "interference-free spectrum, F0, and aperiodicity estimation", ICASSP, pp. 3933-3936 (2008).

適応型ＲＢＭにおける学習率、バッチサイズ、繰り返し回数はそれぞれ０：００５、５０、５００とした。隠れ素子数を１２８、１９２、２５６、５１２と変えて比較を行った。 The learning rate, batch size, and number of repetitions in the adaptive RBM were set to 0: 005, 50, and 500, respectively. Comparison was performed by changing the number of hidden elements to 128, 192, 256, and 512.

（実験結果と考察）
図５は、本実施の形態における声質変換方法による声質変換の結果を示す図である。例えば、ｆｅｍａｌｅ−ｔｏ−ｆｅｍａｌｅでは、評価用の女性４名の音声を、それぞれ他の女性３名へ変換し、全フレームのＳＤＩＲの平均をとったものを表す。「ａｖｇ．」は全組み合わせの平均値である。図５から、一部を除いて隠れ素子数が増加すれば変換精度が向上していることが分かる。隠れ素子数が５１２と２５６の結果を比較すると、５１２の場合は男性への変換（ｆｅｍａｌｅ−ｔｏ−ｍａｌｅ，ｍａｌｅ−ｔｏ−ｍａｌｅ）で優っている。しかし、女性への変換（ｆｅｍａｌｅ−ｔｏ−ｆｅｍａｌｅ，ｍａｌｅ−ｔｏ−ｆｅｍａｌｅ）で精度が下がってしまい、結果として全平均のＳＤＩＲ値が低くなってしまっている。この理由として、パラメータ数の増加に伴い、モデルが過学習しているためだと考えられる（男性と女性の話者数は２４対１４であり、隠れ素子数５１２のモデルでは変換音声が男性側へ強く反応していることからも過学習が窺える）。 (Experimental results and discussion)
FIG. 5 is a diagram showing a result of voice quality conversion by the voice quality conversion method in the present embodiment. For example, in female-to-female, the voices of four females for evaluation are converted into the other three females, respectively, and the average of SDIR of all frames is taken. “Avg.” Is an average value of all combinations. From FIG. 5, it can be seen that the conversion accuracy improves if the number of hidden elements increases except for a part. Comparing the results of the number of hidden elements of 512 and 256, 512 is superior in conversion to male (female-to-male, male-to-male). However, the conversion to female (female-to-female, male-to-female) decreases the accuracy, and as a result, the total average SDIR value is lowered. This is probably because the model is overlearning as the number of parameters increases (the number of male and female speakers is 24:14, and in the model with 512 hidden elements, the converted speech is on the male side. The over-learning can be seen from the fact that they are responding strongly.)

図６は、本実施の形態における声質変換方法によって、実際に推定されたパラメータを示す図である。図６における（ａ）、（ｂ）および（ｃ）はそれぞれ、
の一部を示す。 FIG. 6 is a diagram showing parameters actually estimated by the voice quality conversion method according to the present embodiment. (A), (b) and (c) in FIG.
A part of

に関しては、対角行列として近似した
の対角成分を列ベクトルとして話者ごとに並べた行列を示しており、
も同様に話者ごとに並べた列ベクトルを示している。図６の（ｂ）および（ｃ）において、左１４列ベクトルは女性話者、右２４列ベクトルは男性話者に相当する。この図６から分かるように、
の各々の列ベクトルは同性間で類似性が高く、異性間で類似性が低いベクトルとなっている。これは、音声を聴いて話者の違いを認識する際、個人の差異よりも性別の違いをより大きく感じ取っているという直感と一致する。 Is approximated as a diagonal matrix
Shows a matrix in which the diagonal components of are arranged for each speaker as a column vector,
Similarly, column vectors arranged for each speaker are shown. In FIGS. 6B and 6C, the left 14 column vector corresponds to a female speaker, and the right 24 column vector corresponds to a male speaker. As can be seen from FIG.
Each of the column vectors is a vector having a high similarity between the same sexes and a low similarity between the opposite sexes. This is consistent with the intuition that, when listening to speech and recognizing speaker differences, the difference in gender is felt greater than individual differences.

図７は、本実施の形態における声質変換方法によって女性話者の音声（コーパスではＦＣＪＦ０）を男性話者の音声（ＭＷＡＲ０）へ変換した例を示す図である。この例では、ＦＣＪＦ０のある時刻における対数スペクトル（図７の（ａ）における点線）からＭＦＣＣを計算し、ＦＣＪＦ０の適応型ＲＢＭによって、
を推定した後、ＭＷＡＲ０の適応パラメータを用いて変換された音響特徴量を対数スペクトルへ復元した（図７の（ｂ）における実線）。参考として、
の推定後ＦＣＪＦ０の適応パラメータによって復元したスペクトル（図７の（ａ）における実線）、目標となるＭＷＡＲ０のスペクトル（図７の（ｂ）における点線）を載せている。この図７より、ＦＣＪＦ０の音声からＦＣＪＦ０の音声へ再構築したスペクトルのみならず、別の話者であるＭＷＡＲ０へ変換した音声スペクトルにおいても、約３．５ｋＨｚ未満の帯域（低域）におけるスペクトルピークの周波数（フォルマント）がおおよそ目標と一致するなど、その話者の特徴を捉えていることが分かる。約３．５ｋＨｚ以上の帯域（高周波数域）に関してはいずれも目標と大きく異なっているが、ＭＦＣＣからスペクトルを復元しているため、高域における情報が損失してしまうことに起因する。 FIG. 7 is a diagram illustrating an example in which a female speaker's voice (FCJF0 in the corpus) is converted into a male speaker's voice (MWAR0) by the voice quality conversion method according to the present embodiment. In this example, the MFCC is calculated from the logarithmic spectrum at a certain time of FCJF0 (dotted line in FIG. 7A), and the adaptive RBM of FCJF0
After that, the acoustic feature value converted using the adaptive parameter of MWAR0 was restored to a logarithmic spectrum (solid line in FIG. 7B). As reference,
The spectrum restored by the adaptive parameter of FCJF0 after estimation (solid line in FIG. 7A) and the spectrum of target MWAR0 (dotted line in FIG. 7B) are shown. From FIG. 7, not only the spectrum reconstructed from the FCJF0 voice to the FCJF0 voice but also the voice spectrum converted to another speaker MWAR0, the spectrum peak in a band (low frequency) of less than about 3.5 kHz. It can be seen that the characteristics of the speaker are captured, such as the frequency (formant) of the speaker roughly matches the target. All of the bands (high frequency range) of about 3.5 kHz or more are greatly different from the target, but because the spectrum is restored from the MFCC, information in the high range is lost.

このように、本実施の形態では、パラレルデータを学習時に一切使用せず、かつＦＣＪＦ０からＭＷＡＲ０への変換モデルを学習していないにも関わらずＦＣＪＦ０からＭＷＡＲ０へ変換することができる。 Thus, in this embodiment, parallel data can be converted from FCJF0 to MWAR0 even though no parallel data is used during learning and a conversion model from FCJF0 to MWAR0 is not learned.

（まとめ）
以上のように、本実施の形態における声質変換方法は、図４に示すように、ステップＳ１〜Ｓ４を含む。ステップＳ１では、ＲＢＭを構成する可視素子層と隠れ素子層の２つの異層素子間の結合重みのうち、話者に依存しない重みを非依存重み（上述の話者非依存重み）として推定する。ステップＳ２では、ＲＢＭにおいて非依存重みを固定した状態で、結合重みのうち、入力話者に依存する重みと、目標話者に依存する重みとをそれぞれ依存重み（上述の話者依存重み、または適応パラメータ）として推定する。ステップＳ３では、可視素子層に入力される入力話者の音声と、入力話者の依存重みとに基づいて、隠れ素子層を推定する。ステップＳ４では、隠れ素子層と、目標話者の依存重みとに基づいて、可視素子層として出力される目標話者の音声を推定する。また、本実施の形態では、結合重みは、Ｓ（Ｓは２以上の整数）人の話者のそれぞれに対する０または１を示す要素からなるベクトルと、非依存重みと、Ｓ人のそれぞれの話者の依存重みとを用いた演算によって表わされる。 (Summary)
As described above, the voice quality conversion method according to the present embodiment includes steps S1 to S4 as shown in FIG. In step S1, a speaker-independent weight is estimated as an independent weight (the above-mentioned speaker-independent weight) among the coupling weights between two different layer elements of the visible element layer and the hidden element layer constituting the RBM. . In step S2, with the non-dependent weight fixed in the RBM, the weight depending on the input speaker and the weight depending on the target speaker among the combined weights are respectively determined as the dependency weights (the above-mentioned speaker dependent weights, or As an adaptation parameter). In step S3, the hidden element layer is estimated based on the voice of the input speaker input to the visible element layer and the dependency weight of the input speaker. In step S4, the target speaker's voice output as the visible element layer is estimated based on the hidden element layer and the dependency weight of the target speaker. Further, in the present embodiment, the connection weight is a vector composed of elements indicating 0 or 1 for each of S (S is an integer of 2 or more) speakers, an independent weight, and each story of S people. It is expressed by the calculation using the dependence weight of the person.

これにより、本実施の形態では、ＲＢＭを構成する可視素子層と隠れ素子層の２つの異層素子間の結合重みが、非依存重みと依存重みとに分離された適応型ＲＢＭが用いられる。したがって、非依存重みを固定させておけば、依存重みを用いて話者性を容易に制御することができる。その結果、パラレルデータを用いることなく入力話者の声質を変換することができる。また、入力話者および目標話者のそれぞれの音声が少なくても、ステップＳ２において入力話者および目標話者のそれぞれの依存重みを適切に推定することができる。その結果、何れの目標話者の依存重みでも簡単に推定することができるため、ステップＳ１で推定された非依存重みを流用すれば、入力話者の声質を任意の話者の声質に変換することができる。 Thus, in the present embodiment, an adaptive RBM in which the coupling weight between two different layer elements of the visible element layer and the hidden element layer constituting the RBM is separated into an independent weight and a dependent weight is used. Therefore, if the independent weight is fixed, the speaker property can be easily controlled using the dependent weight. As a result, the voice quality of the input speaker can be converted without using parallel data. Further, even if the voices of the input speaker and the target speaker are small, it is possible to appropriately estimate the dependency weights of the input speaker and the target speaker in step S2. As a result, since it is possible to easily estimate the dependency weight of any target speaker, the voice quality of the input speaker is converted to the voice quality of an arbitrary speaker if the non-dependent weight estimated in step S1 is used. be able to.

つまり、本実施の形態では、潜在的な特徴量を抽出するＲＢＭを拡張して、話者に依存する項（依存重み）と依存しない項（非依存重み）に分離してモデル化することで学習時にパラレルデータを必要としない、任意話者に適応可能な声質変換を行うことができる。 In other words, in the present embodiment, the RBM that extracts the potential feature amount is expanded and modeled by separating the term depending on the speaker (dependent weight) and the term not dependent (independent weight). Voice quality conversion that does not require parallel data during learning and can be adapted to any speaker can be performed.

なお、本実施の形態におけるＲＢＭの拡張モデル（適応型ＲＢＭ）は声質変換のみならず、音声の感情付与や物体認識など、様々なタスクへの応用が考えられる。また、このモデルにおいて識別素子
を推定することで、例えば話者認識へ応用することも可能である。音韻情報と話者情報が混在した音声からそれぞれを分離し、話者性を制御できる。 Note that the extended model of RBM (adaptive RBM) in this embodiment can be applied not only to voice quality conversion but also to various tasks such as voice emotion assignment and object recognition. In this model, the identification element
Can be applied to speaker recognition, for example. It is possible to control the speaker characteristics by separating each of the phonological information and the speaker information from the mixed speech.

なお、上記実施の形態において、非依存重み推定部１１、依存重み推定部１２、隠れ素子層推定部１３および可視素子層推定部１４などの各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵまたはプロセッサなどのプログラム実行部が、ハードディスクまたは半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。ここで、上記各実施の形態の声質変換装置１０などを実現するソフトウェアは、例えば図４に示すフローチャートに含まれる各ステップをコンピュータに実行させるプログラムである。また、上記実施の形態における声質変換装置１０は、プロセッサ、メモリおよび入出力ポートを有するコンピュータ、あるいは、論理回路などで実現されてもよい。また、上記実施の形態における各隠れ素子は、例えば０または１であり、その隠れ素子に対応する発話中の音素または音韻の有無を表していると考えられる。 In the above embodiment, whether each component such as the independent weight estimation unit 11, the dependency weight estimation unit 12, the hidden element layer estimation unit 13, and the visible element layer estimation unit 14 is configured with dedicated hardware. It may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. Here, the software that realizes the voice quality conversion apparatus 10 and the like of each of the above embodiments is a program that causes a computer to execute each step included in the flowchart shown in FIG. Further, the voice quality conversion apparatus 10 in the above embodiment may be realized by a computer having a processor, a memory and an input / output port, or a logic circuit. Further, each hidden element in the above embodiment is, for example, 0 or 1, and is considered to indicate the presence or absence of a phoneme or phoneme in speech corresponding to the hidden element.

以上、一つまたは複数の態様に係る声質変換方法について、実施の形態に基づいて説明したが、本発明は、この実施の形態に限定されるものではない。本発明の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構築される形態も、本発明の範囲に含まれてもよい。 Although the voice quality conversion method according to one or more aspects has been described based on the embodiment, the present invention is not limited to this embodiment. Unless it deviates from the meaning of this invention, the form which carried out the various deformation | transformation which those skilled in the art can think to this embodiment, and the form constructed | assembled combining the component in different embodiment are also contained in the scope of the present invention. Also good.

本発明にかかる声質変換方法は、パラレルデータを用いることなく入力話者の声質を任意の話者の声質に変換することができるという効果を奏し、例えば、ボイスチェンジャー、発話支援装置またはアミューズメント機器などの声質変換装置に適用することができる。 The voice quality conversion method according to the present invention has an effect that the voice quality of an input speaker can be converted into the voice quality of an arbitrary speaker without using parallel data. For example, a voice changer, a speech support device, an amusement device, etc. It can be applied to the voice quality conversion device.

１０声質変換装置
１１非依存重み推定部
１２依存重み推定部
１３隠れ素子層推定部
１４可視素子層推定部 DESCRIPTION OF SYMBOLS 10 Voice quality conversion apparatus 11 Independent weight estimation part 12 Dependence weight estimation part 13 Hidden element layer estimation part 14 Visible element layer estimation part

Claims

A voice quality conversion method for converting voice quality of an input speaker's voice to voice quality of a target speaker,
First step of estimating a speaker-independent weight as an independent weight among coupling weights between two different layer elements of a visible element layer and a hidden element layer constituting a RBM (Restricted Boltzmann Machine) that is a probabilistic model When,
A second step of estimating a weight depending on the input speaker and a weight depending on the target speaker among the combined weights, with the independent weight fixed in the RBM; ,
A third step of estimating a hidden element layer based on the voice of the input speaker input to the visible element layer and the dependency weight of the input speaker;
A voice quality conversion method comprising: a fourth step of estimating speech of the target speaker output as a visible element layer based on the hidden element layer and the dependency weight of the target speaker.

The joint weight is a vector composed of elements indicating 0 or 1 for each of S (S is an integer of 2 or more) speakers, the independent weight, and the dependent weight of each of the S speakers. The voice quality conversion method according to claim 1, which is represented by a calculation used.

In the first step, the independent weight is estimated based on speech of different utterance contents uttered by a plurality of speakers,
In the second step, the dependency weight of the input speaker and the dependency weight of the target speaker are determined based on the speech of different utterance contents uttered from each of the input speaker and the target speaker. The voice quality conversion method according to claim 1 or 2, wherein the estimation is performed.

A voice quality conversion device that converts the voice quality of an input speaker's voice to the voice quality of a target speaker,
Independent weight estimation that estimates the speaker-independent weight as the independent weight among the coupling weights between the two different layer elements of the visible element layer and the hidden element layer constituting the RBM (Restricted Boltzmann Machine) which is a probabilistic model And
A dependency weight estimating unit that estimates a weight depending on the input speaker and a weight depending on the target speaker among the connection weights, with the independent weight fixed in the RBM; ,
A hidden element estimation unit that estimates a hidden element layer based on the voice of the input speaker input to the visible element layer and the dependency weight of the input speaker;
A voice quality conversion device comprising: a hidden element layer; and a visible element layer estimation unit that estimates a voice of the target speaker output as a visible element layer based on the dependency weight of the target speaker.