JP6764851B2

JP6764851B2 - Series data converter, learning device, and program

Info

Publication number: JP6764851B2
Application number: JP2017248427A
Authority: JP
Inventors: 卓弘金子; 弘和亀岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2020-10-14
Anticipated expiration: 2037-12-07
Also published as: JP2019101391A

Description

本発明は、系列データ変換装置、学習装置、及びプログラムに関する。 The present invention relates to a sequence data conversion device, a learning device, and a program.

[１章：序論]
入力音声の言語情報（発話文）を保持したまま非言語・パラ言語（話者性や発話様式など）のみを変換する技術を声質変換といい，テキスト音声合成の話者性変換，発声支援，音声強調，発音変換などへの応用が可能である。声質変換の問題は，変換元の音声の特徴量から変換目標の音声の特徴量への写像関数を推定する回帰分析の問題として定式化することができる。声質変換の従来法の中でも混合ガウス分布モデル(Gaussian Mixture Model; GMM) を用いた手法はその有効性と汎用性から広く用いられている。また，近年では，制約つきボルツマンマシン，フィードフォワード型ニューラルネットワーク(Neural Network; NN) ，再帰型NN(Recurrent NN; RNN) ，畳み込み型NN(Convolutional NN; CNN)などのNN 系の手法や非負値行列因子分解(Nonnegative Matrix Factorization; NMF) などを用いた事例（Exemplar）ベースの手法の検討も進められている。 [Chapter 1: Introduction]
The technique of converting only non-linguistic / para-language (speaker nature, utterance style, etc.) while retaining the linguistic information (speech sentence) of the input voice is called voice quality conversion. It can be applied to voice enhancement and pronunciation conversion. The voice quality conversion problem can be formulated as a regression analysis problem that estimates the mapping function from the conversion source voice features to the conversion target voice features. Among the conventional methods of voice quality conversion, the method using the Gaussian Mixture Model (GMM) is widely used because of its effectiveness and versatility. In recent years, NN-based methods such as constrained Boltzmann machines, feed-forward neural networks (Neural Networks; NN), recurrent NNs (RNNs), and convolutional NNs (CNNs) and non-negative values Examination of case-based methods using non-negative Matrix Factorization (NMF) is also underway.

Martin Arjovsky， Soumith Chintala， and L_eon Bottou. WassersteinGAN. In Proc. ICML， 2017.Martin Arjovsky, Soumith Chintala, and L_eon Bottou. WassersteinGAN. In Proc. ICML, 2017. James Bradbury， Stephen Merity， Caiming Xiong， and Richard Socher. Quasi-recurrent neural networks. In Proc. ICLR， 2017.James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In Proc. ICLR, 2017. Yann N Dauphin， Angela Fan， Michael Auli， and David Grangier. Language modeling with gated convolutional networks. In Proc. ICML，pages 933-941， 2017.Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proc. ICML, pages 933-941, 2017. Ian Goodfellow， Jean Pouget-Abadie， Mehdi Mirza， Bing Xu，DavidWarde-Farley， Sherjil Ozair， Aaron Courville， and Yoshua Bengio. Generative adversarial nets. In Proc. NPIS， pages 2672-2680， 2014.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. NPIS, pages 2672-2680, 2014. Takuhiro Kaneko， Hirokazu Kameoka， Kaoru Hiramatsu， and Kunio Kashino. Sequence-to-sequence voice conversion with similaritymetric learned using generative adversarial networks. In Proc. INTERSPEECH， pages 1283-1287， 2017.Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino. Sequence-to-sequence voice conversion with similaritymetric learned using generative adversarial networks. In Proc. INTERSPEECH, pages 1283-1287, 2017. Kun Liu， Jianping Zhang， and Yonghong Yan. High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for Mandarin. In Proc. FSKD， pages 410-414， 2007.Kun Liu, Jianping Zhang, and Yonghong Yan. High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for Mandarin. In Proc. FSKD, pages 410-414, 2007. Xudong Mao， Qing Li， Haoran Xie， Raymond YK Lau， Zhen Wang， and Stephen Paul Smolley. Least squares generative adversarial networks. In Proc. ICCV， 2017.Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proc. ICCV, 2017. Masanori Morise， Fumiya Yokomori， and Kenji Ozawa. WORLD: Avocoder-based high-quality speech synthesis system for real-time appliations. IEICE Trans. Inf. Syst.， 99(7):1877-1884， 2016.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: Avocoder-based high-quality speech synthesis system for real-time appsliations. IEICE Trans. Inf. Syst., 99 (7): 1877-1884, 2016. Yaniv Taigman， Adam Polyak， and Lior Wolf. Unsupervised cross domainimage generation. In Proc. ICLR， 2017.Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross domainimage generation. In Proc. ICLR, 2017. Shinnosuke Takamichi， Tomoki Toda， Graham Neubig， Sakriani Sakti，and Satoshi Nakamura. A postfiter to modify the modulation spectrum in HMM-based speech synthesis. In Proc. ICASSP， pages290-294，2014.Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura. A postfiter to modify the modulation spectrum in HMM-based speech synthesis. In Proc. ICASSP, pages290-294, 2014. Tomoki Toda， Alan W Black， and Keiichi Tokuda. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE/ACM Trans. Audio Speech Lang. Process.，15(8):2222-2235， 2007.Tomoki Toda, Alan W Black, and Keiichi Tokuda. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE / ACM Trans. Audio Speech Lang. Process., 15 (8): 2222-2235, 2007. Tomoki Toda， Ling-Hui Chen， Daisuke Saito， Fernando Villavicencio，Mirjam Wester， Zhizheng Wu， and Junichi Yamagishi. The Voice Conversion Challenge 2016. In Proc. INTERSPEECH，pages 1632-1636，2016.Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. The Voice Conversion Challenge 2016. In Proc. INTERSPEECH, pages 1632-1636, 2016. Mirjam Wester， Zhizheng Wu， and Junichi Yamagishi. Analysis of the Voice Conversion Challenge 2016 evaluation results. In Proc. INTERSPEECH， pages 1637-1641， 2016.Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. Analysis of the Voice Conversion Challenge 2016 evaluation results. In Proc. INTERSPEECH, pages 1637-1641, 2016. Jun-Yan Zhu， Taesung Park， Phillip Isola， and Alexei A. Efros.Un-paired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV，pages 2223-2232， 2017.Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros.Un-paired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, pages 2223-2232, 2017.

これらの手法の多くは，パラレルデータを用いて変換音声の特徴量が目標音声の特徴量にできるだけ近くなるように変換関数の学習が行われる。しかし，用途によっては同一発話内容の変換元音声と目標音声のペアデータを用意することが難しい場面は多くある。また，仮にそのようなペアデータが用意できる場合でも，高い精度の時間整合が必要となり，これを自動処理で行う際は整合ミスを修正するため目視または手動によるプレスクリーニングが必要となる。 In many of these methods, the conversion function is learned using parallel data so that the features of the converted speech are as close as possible to the features of the target speech. However, there are many situations where it is difficult to prepare pair data of the conversion source voice and the target voice of the same utterance content depending on the application. Even if such pair data can be prepared, highly accurate time matching is required, and when this is performed by automatic processing, visual or manual pre-screening is required to correct the matching error.

本発明は、パラレルデータを必要としないパラレルデータフリーな声質変換手法を提供することを目的とする。An object of the present invention is to provide a parallel data-free voice quality conversion method that does not require parallel data.

本発明に係る系列データ変換装置は、二つのドメインの系列データに対して、系列データを受け取る入力部と、変換器を用いて、一方のドメインのデータである順変換入力データから、もう一方のドメインのデータである順変換出力データへ変換する順変換部と、前記順変換出力データに対して、変換器を用いて、逆の変換を行い順変換部の入力ドメインのデータである逆変換出力データに変換する逆変換部と、前記順変換出力データに対して、状態判断器を用いて、前記順変換出力データの対象とするドメインの系列データとして適切かどうかの状態判断を行う状態判断部と、前記逆変換出力データと、前記順変換入力データに対して、距離測定器を用いて、距離を測定する順逆変換距離測定部と、前記状態判断部と前記順逆変換距離測定部の結果に応じて前記変換器、前記状態判断部のパラメータを更新する学習部と、前記学習部によって学習された前記変換器を用いて、前記入力部が受け取ったデータを変換する変換部と、前記変換部が変換したデータを出力する出力部とを含んで構成されている。The series data conversion device according to the present invention uses an input unit for receiving series data and a converter for series data of two domains from the forward conversion input data which is the data of one domain to the other. The forward conversion unit that converts to the forward conversion output data, which is the data of the domain, and the reverse conversion output, which is the data of the input domain of the forward conversion unit, performs the reverse conversion of the forward conversion output data using a converter. A state determination unit that determines whether or not the reverse conversion unit that converts data and the forward conversion output data are appropriate as series data of the domain that is the target of the forward conversion output data by using a state determination device. To the results of the forward / reverse conversion distance measuring unit, the state determination unit, and the forward / reverse conversion distance measuring unit, which measures the distance between the reverse conversion output data and the forward conversion input data using a distance measuring device. A learning unit that updates the parameters of the converter and the state determination unit, a conversion unit that converts data received by the input unit using the converter learned by the learning unit, and the conversion unit. It is configured to include an output unit that outputs the converted data.

本発明に係る学習装置は、二つのドメインの系列データに対して、系列データを受け取る入力部と、変換器を用いて、一方のドメインのデータである順変換入力データから、もう一方のドメインのデータである順変換出力データへ変換する順変換部と、前記順変換出力データに対して、変換器を用いて、逆の変換を行い順変換部の入力ドメインのデータである逆変換出力データに変換する逆変換部と、前記順変換出力データに対して、状態判断器を用いて、前記順変換出力データの対象とするドメインの系列データとして適切かどうかの状態判断を行う状態判断部と、前記逆変換出力データと、前記順変換入力データに対して、距離測定器を用いて、距離を測定する順逆変換距離測定部と、を備え、前記状態判断部と前記順逆変換距離測定部の結果に応じて前記変換器及び前記状態判断部のパラメータを更新する。The learning device according to the present invention uses an input unit for receiving series data and a converter for the series data of two domains, and uses a converter to convert the forward conversion input data which is the data of one domain to the other domain. The forward conversion unit that converts the data to the forward conversion output data and the reverse conversion output data that is the data of the input domain of the forward conversion unit are converted in reverse using a converter. An inverse conversion unit to be converted, a state determination unit that determines whether or not the forward conversion output data is appropriate as series data of the target domain of the forward conversion output data by using a state determination device. The reverse conversion output data and the forward / reverse conversion distance measuring unit for measuring the distance of the forward conversion input data using a distance measuring device are provided, and the results of the state determination unit and the forward / reverse conversion distance measuring unit. The parameters of the converter and the state determination unit are updated accordingly.

また、本発明に係るプログラムは、上記発明に係る系列データ変換装置の各部としてコンピュータを機能させるためのプログラムである。Further, the program according to the present invention is a program for operating a computer as each part of the series data conversion device according to the above invention.

本発明の系列データ変換装置、学習装置、及びプログラムによれば、パラレルデータを必要としないパラレルデータフリーな声質変換手法を提供することができる、という効果が得られる。According to the series data conversion device, the learning device, and the program of the present invention, it is possible to provide a parallel data-free voice quality conversion method that does not require parallel data.

CycleGANの学習処理を示す図である。It is a figure which shows the learning process of CycleGAN. データ変換装置の全体構成を示す図である。It is a figure which shows the whole structure of a data conversion apparatus. 概要１に係るデータ変換装置の全体構成を示す図である。It is a figure which shows the whole structure of the data conversion apparatus which concerns on Outline 1. 概要２に係るデータ変換装置の全体構成を示す図である。It is a figure which shows the whole structure of the data conversion apparatus which concerns on Outline 2. FIG. データ変換装置による学習時の処理ルーチンを示す図である。It is a figure which shows the processing routine at the time of learning by a data conversion apparatus. データ変換装置による変換時の処理ルーチンを示す図である。It is a figure which shows the processing routine at the time of conversion by a data conversion apparatus. メルケプストラムの次数ごとのＧＶ比較を示す図である。It is a figure which shows the GV comparison for each order of mer cepstrum. 変調周波数ごとのＭＳの比較を示す図である。It is a figure which shows the comparison of MS for each modulation frequency. ソース音声とターゲット音声に対する類似性の比較（Ｓ：ソース、Ｔ：ターゲット、Ｐ：提案手法、Ｂ：比較手法）を示す図である。It is a figure which shows the comparison (S: source, T: target, P: proposed method, B: comparison method) of the similarity between a source voice and a target voice.

[概要]
本稿では、パラレルデータフリーな系列データ変換手法を提案する。提案法は、変換元系列データと変換目標系列データのパラレルデータを用いずとも系列データ変換を可能にする点、従来の多くの系列データ変換法（例えば声質変換法）においてしばしば問題とされる系列データ（例えば音響パラメータ）の過剰平滑化が起こりにくい点を特長にもつ。以上の提案法の特長は、Cyclic-consistent adversarial network (CycleGAN)を用いることにより実現している。CycleGAN は元々、画像のスタイル変換の方法として提案されたもので、変換元のデータから変換目標のデータへの順方向の変換関数とともに、変換目標データから変換元データへの逆方向の変換関数を同時に学習することで、変換元と変換目標のペアデータを用いずとも所望の変換を可能にする方法論である。提案法はCycleGAN を系列データ変換問題に適用し、敵対的学習規範(Adversarial loss)、循環無矛盾性規準(Cyclic-consistency loss)、および恒等写像誤差(Identity-mapping loss)の和を学習規準とすることにより変換元系列データから目標系列データへの特徴量系列の変換関数の学習を可能にしている。循環無矛盾性規準は、変換元データの順変換の逆変換が、どれくらい元通りに変換元データに一致するかを表した規準、および、変換目標データの逆変換の順変換がどれくらい元通りに変換目標データと一致するかを表した規準である。敵対的学習規範は、変換されたデータと変換目標の実データとが、識別器によってどれくらい区別しやすいかを表した規準で、これが小さいほど変換データの確率分布が変換目標の実データの確率分布により類似していることを意味する。恒等写像誤差は、変換されたデータと変換元のデータがどれくらい一致するかを表した規準である。また、提案法では、順方向および逆方向の変換関数として特徴量系列から特徴量系列への変換関数を考え、いずれもGated Convolutional Neural Network により記述することにより、特徴量変換則に時間依存関係を反映できるようにしている。実験では、声質変換のタスクに提案手法を適用し評価を行った。定量評価実験により、提案法による変換音声が、変換目標の実音声と近いGlobal Variance (GV)とModulation Spectra (MS)をもつことを確認した。また、主観評価実験により、パラレルデータを用いた声質変換法と比べ、同等以上の自然性および目標話者への類似度が得られることを確認した。 [Overview]
In this paper, we propose a parallel data-free series data conversion method. The proposed method enables series data conversion without using parallel data of conversion source series data and conversion target series data, and is often a problem in many conventional series data conversion methods (for example, voice quality conversion method). The feature is that over-smoothing of data (for example, acoustic parameters) is unlikely to occur. The features of the above proposed method are realized by using the Cyclic-consistent adversarial network (CycleGAN). CycleGAN was originally proposed as a method of image style conversion, and it has a forward conversion function from the conversion source data to the conversion target data and a reverse conversion function from the conversion target data to the conversion target data. It is a methodology that enables desired conversion without using paired data of conversion source and conversion target by learning at the same time. The proposed method applies CycleGAN to the series data transformation problem, and uses the sum of the adversarial loss, the Cyclic-consistency loss, and the identity-mapping loss as the learning criteria. By doing so, it is possible to learn the conversion function of the feature series from the conversion source series data to the target series data. The circular consistency criterion is a criterion that indicates how much the inverse conversion of the forward conversion of the conversion source data matches the original conversion data, and how much the forward conversion of the reverse conversion of the conversion target data is restored. It is a standard that indicates whether or not it matches the target data. The hostile learning norm is a criterion that expresses how easy it is to distinguish the converted data from the actual data of the conversion target by the discriminator. The smaller this is, the more the probability distribution of the conversion data becomes the probability distribution of the actual data of the conversion target. Means more similar. The conformal mapping error is a criterion that indicates how well the converted data and the conversion source data match. In addition, in the proposed method, the conversion function from the feature series to the feature series is considered as the conversion function in the forward and reverse directions, and both are described by the Gated Convolutional Neural Network, so that the time dependence relationship is added to the feature conversion law. I am trying to reflect it. In the experiment, the proposed method was applied to the task of voice quality conversion and evaluated. Quantitative evaluation experiments confirmed that the converted voice by the proposed method has Global Variance (GV) and Modulation Spectra (MS) that are close to the actual voice of the conversion target. In addition, it was confirmed by a subjective evaluation experiment that a degree of naturalness equal to or higher than that of the voice quality conversion method using parallel data and a degree of similarity to the target speaker could be obtained.

提案法は、（１）テキストラベルや参照音声などのデータや音声認識などのモジュールを別途必要としない点、（２）従来の多くの声質変換法においてしばしば問題とされる音響パラメータの過剰平滑化が起こりにくい点、（３）変換元と変換目標の音声の時間周波数構造を捉えた変換が可能である点、を特長にもつ。以上の提案法の特長は、（Disco-GAN およびDualGAN という別称としても知られる）Cyclic-consistent adversarial network (CycleGAN)を用いることにより実現している。CycleGAN は元々、画像のスタイル変換の方法として提案されたもので、変換元のデータから変換目標のデータへの順方向の変換関数とともに、変換目標データから変換元データへの逆方向の変換関数を同時に学習することで、変換元と変換目標のペアデータを用いずとも所望の変換を可能にする方法論である。提案法はCycleGAN を声質変換問題に適用し、敵対的学習規範(Adversarial loss)、循環無矛盾性規準(Cyclic-consistency loss)、および恒等写像誤差(Identity-mapping loss)の和を学習規準とすることにより変換元音声から目標音声への音声特徴量の変換関数の学習を可能にしている。循環無矛盾性規準は、変換元データの順変換の逆変換が、どれくらい元通りに変換元データに一致するかを表した規準、および、変換目標データの逆変換の順変換がどれくらい元通りに変換目標データと一致するかを表した規準である。敵対的学習規範は、変換されたデータと変換目標の実データとが、識別器によってどれくらい区別しやすいかを表した規準で、これが小さいほど変換データの確率分布が変換目標の実データの確率分布により類似していることを意味する。恒等写像誤差は、変換されたデータと変換元のデータがどれくらい一致するかを表した規準である。また、提案法では、順方向および逆方向の変換関数として特徴量系列から特徴量系列への変換関数を考え、いずれもGated Convolutional Neural Network により記述することにより、特徴量変換則に時間依存関係を反映できるようにしている。上記では、系列データ変換の代表例である声質変換に着目し述べてきたが、より一般的な系列データ変換（例えば、曲調変換、テキスト変換など）でも同様の課題意識はあり、これらに対して、提案手法の特長である（１）データやモジュールを別途必要としない点、（２）特徴量系列の過剰平滑化が起こりにくい点、（３）変換元と変換目標の系列データの系列的・階層的構造を捉えた変換が可能である点、を活かすことが可能である。 The proposed method (1) does not require a separate module for data such as text labels and reference speech and speech recognition, and (2) oversmoothing of acoustic parameters, which is often a problem in many conventional voice quality conversion methods. It is characterized by the fact that (3) conversion that captures the time-frequency structure of the voice of the conversion source and conversion target is possible. The features of the above proposed method are realized by using the Cyclic-consistent adversarial network (CycleGAN) (also known as Disco-GAN and DualGAN). CycleGAN was originally proposed as a method of image style conversion, and it has a forward conversion function from the conversion source data to the conversion target data and a reverse conversion function from the conversion target data to the conversion target data. It is a methodology that enables desired conversion without using paired data of conversion source and conversion target by learning at the same time. The proposed method applies CycleGAN to the voice transformation problem, and uses the sum of the adversarial loss, the Cyclic-consistency loss, and the identity-mapping loss as the learning criteria. This makes it possible to learn the conversion function of the voice feature quantity from the conversion source voice to the target voice. The circular consistency criterion is a criterion that indicates how much the inverse conversion of the forward conversion of the conversion source data matches the original conversion data, and how much the forward conversion of the reverse conversion of the conversion target data is restored. It is a standard that indicates whether or not it matches the target data. The hostile learning norm is a criterion that expresses how easy it is to distinguish the converted data from the actual data of the conversion target by the discriminator. The smaller this is, the more the probability distribution of the conversion data becomes the probability distribution of the actual data of the conversion target. Means more similar. The conformal mapping error is a criterion that indicates how well the converted data and the conversion source data match. In addition, in the proposed method, the conversion function from the feature series to the feature series is considered as the conversion function in the forward and reverse directions, and both are described by the Gated Convolutional Neural Network, so that the time dependence relationship is added to the feature conversion law. I am trying to reflect it. In the above, we have focused on voice quality conversion, which is a typical example of series data conversion, but there is a similar awareness of issues in more general series data conversion (for example, song tone conversion, text conversion, etc.). , Features of the proposed method (1) No separate data or module is required, (2) Over-smoothing of feature series is unlikely to occur, (3) Series of conversion source and conversion target series data It is possible to take advantage of the fact that conversion that captures the hierarchical structure is possible.

[２章：関連研究]
系列データから系列データに変換するタスクの代表例である声質変換における関連研究について述べる。上述のように声質変換の従来法には、パラレルデータを用いることを想定したものが多いが、パラレルデータを必ずしも必要としない方法も最近いくつか提案されている。一例は音声認識を用いた方法である。この方法では、変換元音声と変換目標音声において同一音素と認識された時間フレームの音声特徴量をペアとすることでパラレルデータを構築する。この方法は、音声認識が極めて高い精度で行えることが想定されるが、そのためには音声認識自体を学習するための大量の音声コーパスが必要となる場合があるため、利用場面によっては難点になりえる。他の手法例としては話者適応技術を用いるものがある。この方法は、変換元音声と変換目標音声のパラレルデータに関しては準備する必要はないが、話者空間を学習するための参照音声のパラレルデータは必要となる。また、近年、テキストラベルや参照音声などのデータや音声認識などのモジュールおよびパラレルデータを一切必要としない方法の検討も進められている。これらの方法では、変換元音声と変換目標音声がいずれも低次元の埋め込み空間に属することが仮定されるため、音声のスペクトログラムの細部や詳細な成分をモデル化することが難しくなっている。これらに対し、提案法は変換元の系列データから変換目標の系列データへのマッピングを直接する学習する方法となっている。提案法のこの特徴は、声質変換のように変換されたデータの細部や詳細な構造のリアルさが重要となるタスクにおいては特に利点が大きい。 [Chapter 2: Related Research]
We describe related research in voice quality conversion, which is a typical example of the task of converting series data to series data. As described above, many of the conventional methods of voice quality conversion are based on the assumption that parallel data is used, but recently, some methods that do not necessarily require parallel data have been proposed. One example is a method using voice recognition. In this method, parallel data is constructed by pairing the voice features of the time frame recognized as the same phoneme in the conversion source voice and the conversion target voice. This method is expected to perform speech recognition with extremely high accuracy, but it may require a large amount of speech corpus to learn speech recognition itself, which is a problem depending on the usage situation. Eh. Another example of the method is to use speaker adaptation technology. This method does not need to prepare parallel data of the conversion source voice and the conversion target voice, but requires parallel data of the reference voice for learning the speaker space. Further, in recent years, studies have been conducted on a method that does not require any data such as text labels and reference voices, modules such as voice recognition, and parallel data. These methods assume that both the source speech and the target speech belong to a low-dimensional embedded space, making it difficult to model the details and detailed components of the speech spectrogram. On the other hand, the proposed method is a method of directly learning the mapping from the conversion source series data to the conversion target series data. This feature of the proposed method is particularly advantageous for tasks such as voice conversion, where the details of the transformed data and the realism of the detailed structure are important.

[３章：発明を実施するための形態]
以下、本発明の実施の形態について説明する。本発明の、系列データ変換装置の原理について説明する。 [Chapter 3: Mode for carrying out the invention]
Hereinafter, embodiments of the present invention will be described. The principle of the series data conversion device of the present invention will be described.

３．CycleGAN を用いたパラレルデータフリー系列データ変換 3. 3. Parallel data-free series data conversion using CycleGAN

本研究の目的は、ドメインX の系列データx ∈ X からドメインY の系列データy ∈ Y への変換関数をパラレルデータを要することなく学習することである。本研究では、この問題をCycleGAN （非特許文献１４）をベースにして解く。本章では、まず、第4.1 節でCycleGAN を概説する。CycleGAN の元論文では画像データを扱っていたが、本研究の対象は音声データなどの系列データである。系列データを扱う上で重要な工夫点、つまり、我々の提案するパラレルデータフリー系列データ変換手法について第4.2 節で述べる。 The purpose of this study is to learn the conversion function from the domain X series data x ∈ X to the domain Y series data y ∈ Y without the need for parallel data. In this study, this problem is solved based on CycleGAN (Non-Patent Document 14) . This chapter first outlines CycleGAN in Section 4.1. The original paper of CycleGAN dealt with image data, but the subject of this research is series data such as audio data. Section 4.2 describes an important device for handling series data, that is, the parallel data-free series data conversion method we propose.

３．１ CycleGAN 3.1 CycleGAN

CycleGAN では、変換関数G_X→_YをAdversarial loss とCycle-consistency lossの二つの損失関数を用いて学習する。学習処理を図１に示し、(a)は変換元データの順変換の逆変換が、どれくらい元通りに変換元データに一致するかを表した規準、および、（ｂ）は目標データの逆変換の順変換がどれくらい元通りに変換目標データと一致するかを表した規準を示す。Adversarial loss: Adversarial Loss は、変換データG_X→Y (x)が、変換対象ドメインのデータy としての妥当度合いを測る損失関数であり、変換データの分布

と変換対象ドメインのデータ分布P_Data(y)が近づいたとき、この損失関数の値は小さくなる。Adversarial lossの定式化として、Generative adversarial network (GAN) （非特許文献４）を用いた場合、目的関数は以下のようになる。 In CycleGAN, the conversion function G _X → _Y is learned using two loss functions, Adversarial loss and Cycle-consistency loss. The learning process is shown in FIG. 1 , (a) is a criterion showing how much the reverse conversion of the forward conversion of the conversion source data matches the conversion source data as before, and (b) is the reverse conversion of the target data. Shows the criteria that show how much the forward conversion of is consistent with the conversion target data. Adversarial loss: Adversarial Loss is a loss function that measures the validity of the conversion data G _{X → Y} (x) as the data y of the domain to be converted, and the distribution of the conversion data.

And when the data distribution P _Data (y) of the domain to be converted approaches, the value of this loss function becomes smaller. When the Generative adversarial network (GAN) (Non-Patent Document 4) is used as the formulation of Adversarial loss, the objective function is as follows.

ここで生成器G_X→Yは、この目的関数を最小化することによって、識別器D_Yが変換対象ドメインのデータy と区別ができないようなデータを生成できるようにする。一方、識別器D_Yは、この目的関数を最大化することによって、G_X→Yに騙されないようにする。なお、ここではAdversarial lossの定式化にGANを用いる例を示したが、これは、任意のGANの拡張モデル、例えば、Least squares GAN (LSGAN) （非特許文献７）やWasserstein GAN (WGAN) （非特許文献１）などを用いることも可能である。例えば、LSGAN を用いた場合、式(1) のCross Entropy はLeast square loss になる。また、GAN ではJensen-Shannon divergence の基準のもと真のデータ分布と生成データの分布を近づけるが、WGAN ではEarth Mover's Distance の基準のもと近づけようとする。 Here generator G _{X → Y} by minimizing the objective function, the discriminator D _Y to be able to generate data which can not distinguish the data y to be converted domain. On the other hand, the identifier D _Y, by maximizing the objective function, so as not be fooled by G _{X → Y.} Here, an example of using GAN for formulating Adversarial loss is shown, but this is an extension model of any GAN, for example, Least squares GAN (LSGAN) (Non-Patent Document 7) and Wasserstein GAN (WGAN) ( It is also possible to use Non-Patent Document 1) and the like. For example, when LSGAN is used, the Cross Entropy in Eq. (1) is Least square loss. In GAN, the true data distribution and the generated data distribution are brought closer to each other based on the Jensen-Shannon divergence standard, but WGAN tries to bring them closer to each other based on the Earth Mover's Distance standard.

Cycle-consistency loss: Adversarial loss のみでは、G_X→Y (x)が変換対象ドメインのデータ分布に従うようにする制約しか与えられないため、xとG_X→Y(x)の間でコンテキスト情報が保持されるとは限らない。そこで、CycleGANではさらに二つの制約を加えることによってこの問題に対処を行う。一つ目が、逆変換G_Y→Xに対するAdversarial loss、つまり、

である。もう一つが、Cycle-consistency lossで以下で与えられる。 Cycle-consistency loss: Adversarial loss alone only constrains G _{X → Y} (x) to follow the data distribution of the domain to be translated, so there is contextual information between x and G _{X → Y} (x). Not always retained. Therefore, CycleGAN addresses this problem by adding two more constraints. The first is the Adversarial loss for the inverse transformation G _{Y → X} , that is,

Is. The other is the Cycle-consistency loss, which is given below.

上式では、二つのデータ間の距離を測る方法としてL1を用いた場合を示したが、これは任意の距離尺度を用いることが可能であり、例えば、L2 距離やKullback-Leibler divergence、あるいは、任意の特徴量抽出器を用意し、その特徴量抽出器で抽出した特徴量に対して距離を測ってもよい。特徴量抽出器については、例えばニューラルネットワークを用いて構成することも可能である。例えば、上記識別器を特徴抽出器として用いることができ、識別器内の特徴量空間で距離を測ってもよい。 In the above equation, the case where L1 is used as a method for measuring the distance between two data is shown, but it is possible to use any distance scale, for example, L2 distance, Kullback-Leibler divergence, or An arbitrary feature amount extractor may be prepared, and the distance may be measured with respect to the feature amount extracted by the feature amount extractor. The feature extractor can also be configured using, for example, a neural network. For example, the above-mentioned classifier can be used as a feature extractor, and the distance may be measured in the feature amount space in the classifier.

これら追加した項によって、G_X→YとG_Y→Xは、様々な変換先の候補の中から類似したコンテキスト情報を持った(x、y) のペアデータを擬似的に見つけるよう促進される。 These added terms encourage G _{X → Y} and G _{Y → X} to pseudo-find (x, y) paired data with similar contextual information from a variety of destination candidates. ..

全体の目的関数は、トレードオフパラメータλ_cycを用いて以下で表される。 The entire objective function is represented below using the trade-off parameter λ _cyc .

３．２パラレルデータフリー系列データ変換のためのCycleGAN 3.2 CycleGAN for parallel data-free series data conversion

CycleGAN をパラレルデータフリー系列データ変換に適用するために、本研究では二つの修正を提案する。一つ目がGated CNN （非特許文献３）を用いた系列データのモデリングと、二つ目がIdentity-mapping loss （非特許文献９）を用いた言語情報の保持である。なお、本発明では系列データの一例として音声変換を中心に説明を行っているが、提案手法は系列データ一般に有効なものであり、音声データだけに縛られないものであることに留意されたい。 In order to apply CycleGAN to parallel data-free series data conversion, this study proposes two modifications. The first is modeling of series data using Gated CNN (Non-Patent Document 3) , and the second is retention of linguistic information using Identity-mapping loss (Non-Patent Document 9) . Although the present invention mainly describes voice conversion as an example of series data, it should be noted that the proposed method is generally effective for series data and is not limited to voice data.

Gated CNN: 系列データの特徴として、系列的な構造を持っているということと階層的な構造を持っているということの二点が挙げられる。例えば、音声データの場合であれば、有声・無声区間、音素・形態素などの系列的、階層的構造がある。ニューラルネットワークを用いて、このような構造を捉えようとした場合、ネットワークの構成方法が一つ鍵になる。そこで、本研究では、CycleGAN に系列関係・階層関係の表現が可能なモデルの導入すること提案する。具体的には、Gated CNN を用いる。他にも、RNN (LSTMなど) も利用することが可能であるが、RNN は再帰的な構造を持っており並列化が難しく計算コストが高いため、ここではGated CNN を用いる。なお、ここで重要なのは系列構造、階層構造を捉えられるようなモデルを使うということであり、近年提案されているCNN とRNN のハイブリッドであるQuasi-RNN （非特許文献２）などを用いてもよい。 Gated CNN: There are two characteristics of series data: it has a series structure and it has a hierarchical structure. For example, in the case of voice data, there are serial and hierarchical structures such as voiced / unvoiced sections and phonemes / morphemes. When trying to capture such a structure using a neural network, the method of constructing the network is one of the keys. Therefore, in this research, we propose to introduce a model that can express series relations and hierarchical relations in CycleGAN. Specifically, Gated CNN is used. In addition, RNN (LSTM, etc.) can also be used, but since RNN has a recursive structure, parallelization is difficult, and calculation cost is high, Gated CNN is used here. The important thing here is to use a model that can capture the series structure and hierarchical structure, and even if you use Quasi-RNN (Non-Patent Document 2), which is a hybrid of CNN and RNN proposed in recent years. Good.

Gated CNN は、元論文（非特許文献３）では言語モデリングにおいて最新の性能を示しているものであり、近年、音声モデリングにおいても有効性を示している（非特許文献５）。Gated CNN では、Gated linear units (GLUs) が活性化関数として用いられており、(l + 1) 層の出力

は、l 層の出力

とモデルパラメータ

を用いて以下の式により計算できる。 In the original paper (Non-Patent Document 3) , Gated CNN shows the latest performance in language modeling, and in recent years, it has also shown its effectiveness in speech modeling (Non-Patent Document 5) . In Gated CNN, Gated linear units (GLUs) are used as the activation function, and the output of the (l + 1) layer.

Is the output of layer l

And model parameters

Can be calculated by the following formula using.

ここで、

は要素積であり、σはシグモイド関数である。このゲートメカニズムによって、ネットワーク間で情報伝播を行う際、前層の情報に応じて選択的に伝播を行うことが可能である。 here,

Is the element product and σ is the sigmoid function. With this gate mechanism, when information is propagated between networks, it is possible to selectively propagate information according to the information in the previous layer.

Identity-mapping loss:系列データを変換しようとした場合、意味的な情報の保持も一つの重要な要求項目になる。例えば、音声変換の場合であれば、変換をしたいのは話者性であり、発話内容(言語情報) については保持されることが要求される。上述したように、CycleGAN においては、Cycle-consistency loss がコンテキスト情報の保持に寄与するが、この制約は、順変換し逆変換したら戻るという緩い制約にとどまっており、言語情報の保持については十分な働きをしない。この問題を音声認識器などの外部モジュールを要することなく解決するために、本研究では、Identity-mapping loss （非特許文献９）の利用を提案する。Identity-mapping loss は以下の式で表される。 Identity-mapping loss: Retaining semantic information is also an important requirement when trying to convert series data. For example, in the case of voice conversion, it is the speaker who wants to convert, and it is required that the utterance content (linguistic information) is retained. As mentioned above, in CycleGAN, Cycle-consistency loss contributes to the retention of context information, but this constraint is limited to the loose constraint of forward conversion and reverse conversion, and then returns, which is sufficient for retaining linguistic information. Doesn't work. In order to solve this problem without the need for an external module such as a voice recognizer, this study proposes the use of Identity-mapping loss (Non-Patent Document 9) . Identity-mapping loss is expressed by the following formula.

この損失関数は、入力と出力間でデータの構成が保持されるように制約を与える。実際には、トレードオフパラメータλ_idを導入し、重み付けされた損失関数

を式(3)とともに用いる。 This loss function constrains the structure of the data to be preserved between the input and the output. In practice, we introduced the trade-off parameter λ _id and weighted the loss function.

Is used together with equation (3).

上式では、二つのデータ間の距離を測る方法としてL1を用いた場合を示したが、これは任意の距離尺度を用いることが可能であり、例えば、L2距離やKullback-Leibler divergence、あるいは、任意の特徴量抽出器を用意し、その特徴量抽出器で抽出した特徴量に対して距離を測ってもよい。特徴量抽出器については、例えばニューラルネットワークを用いて構成することも可能である。例えば、上記識別器を特徴抽出器として用いることができ、識別器内の特徴量空間で距離を測ってもよい。なお、このIdentity-mapping lossは学習の方向性を導くような制約であり、学習の全期間にわたり用いるのではなく、学習の初期段階のみ用いるようにしてもよい。 In the above equation, the case where L1 is used as a method for measuring the distance between two data is shown, but it is possible to use any distance scale, for example, L2 distance, Kullback-Leibler divergence, or An arbitrary feature amount extractor may be prepared, and the distance may be measured with respect to the feature amount extracted by the feature amount extractor. The feature extractor can also be configured using, for example, a neural network. For example, the above-mentioned classifier can be used as a feature extractor, and the distance may be measured in the feature amount space in the classifier. It should be noted that this Identity-mapping loss is a constraint that guides the direction of learning, and may be used only in the initial stage of learning instead of being used for the entire learning period.

４．全体構成及び各フロー 4. Overall configuration and each flow

４．１
全体構成図を図２に示し、各部について下記のとおり説明する。 4.1
The overall configuration diagram is shown in FIG . 2 , and each part will be described as follows.

データ変換装置は、機能的には入力部１００と、制御部２００と、出力部３００を含んで構成される。 The data conversion device functionally includes an input unit 100, a control unit 200, and an output unit 300.

入力部１００は、データ群Xに含まれるデータと、データ群Yに含まれるデータとを受け付ける。 The input unit 100 receives the data included in the data group X and the data included in the data group Y.

具体的には、データ群Xに含まれるデータx∈Xと、データ群Yに含まれるデータy∈Yを受け付ける。 Specifically, it accepts the data x ∈ X contained in the data group X and the data y ∈ Y contained in the data group Y.

制御部２００は、順変換部２１０と、状態判断部２２０と、逆変換部２３０と、順逆変換距離測定部２４０と、自己変換部２５０と、自己変換距離測定部２６０と、ニューラルネットワーク記憶部２７０と、学習部２８０と、変換部２９０とを含んで構成される。 Control unit 200 includes a rectifier unit 210, a state determination unit 220, an inverse transform unit 230, a forward and reverse converting distance measurement unit 240, a self-conversion unit 250, a self converting the distance measurement unit 260, the neural network memory section It is composed of 270, a learning unit 280, and a conversion unit 290.

順変換部２１０は、入力されたデータ群Xのデータを、変換器G_X→Yによって変換データ群XYのデータに変換する。また、順変換部２１０は、入力されたデータ群Yのデータを、変換器G_Y→Xによって変換データ群YXのデータに変換する。 The forward conversion unit 210 converts the input data of the data group X into the data of the conversion data group XY by the converter G _{X → Y.} Further, the forward conversion unit 210 converts the input data of the data group Y into the data of the conversion data group YX by the converter G _{Y → X.}

具体的には、順変換部２１０は、データ群Xのデータサンプルx を、ニューラルネットワーク記憶部２７０に記憶された変換器G_X→Yによって変換データ群XY のデータG_X→Y(x)に変換する。また、順変換部２１０は、データ群Yのデータサンプルyを、ニューラルネットワーク記憶部２７０に記憶された変換器G_Y→Xによって変換データ群YXのデータG_Y→X(y)に変換する。 Specifically, the forward conversion unit 210 converts the data sample x of the data group X into the data G _{X → Y} (x) of the conversion data group XY by the converter G _{X → Y} stored in the neural network storage unit 270. Convert. Further, the forward conversion unit 210 converts the data sample y of the data group Y into the data G _{Y → X} (y) of the conversion data group _{Y X} by the converter G _{Y → X} stored in the neural network storage unit 270.

状態判断部２２０は、順変換部２１０によって得られた変換データ群XYのデータと、入力データyとの各々について、状態判断器D_Yを用いて状態判断を行う。また、状態判断部２２０は、順変換部２１０によって得られた変換データ群YXのデータと、入力データxとの各々について、状態判断器D_X用いて状態判断を行う。 State determining unit 220, and converts data group XY data obtained by the forward transform unit 210, for each of the input data y, performs state determination using the state determiner D _Y. Further, the state determination unit 220 determines the state of each of the conversion data group YX data obtained by the forward conversion unit 210 and the input data x by using the state determination device D _X.

具体的には、状態判断部２２０は、ニューラルネットワーク記憶部２７０に記憶されたデータ群Yの状態判断器DYによって、変換データ群XYのデータG_X→Y(x)の状態判断と入力データyの状態判断を行い、各々の判断結果D_Y(G_X→Y(x))とD_Y(y)を学習部２８０に渡す。また、状態判断部２２０は、ニューラルネットワーク記憶部２７０に記憶されたデータ群Xの状態判断器D_Xによって、変換データ群YXのデータG_Y→X(y)の状態判断と入力データxの状態判断を行い、各々の判断結果D_X(G_Y→X(y))とD_X(x)を学習部２８０に渡す。 Specifically, the state determination unit 220 uses the state determination device DY of the data group Y stored in the neural network storage unit 270 to determine the state of the data G _{X → Y} (x) of the conversion data group XY and the input data y. Judgment of the state of, and each judgment result D _Y (G _{X → Y} (x)) and D _Y (y) are passed to the learning unit 280. Further, the state determination unit 220 determines the state of the data G _{Y → X} (y) of the conversion data group YX and the state of the input data x by the state determination device D _X of the data group X stored in the neural network storage unit 270. Judgment is made, and each judgment result D _X (G _{Y → X} (y)) and D _X (x) are passed to the learning unit 280.

逆変換部２３０は、順変換部２１０によって得られた変換データ群XY のデータを変換器G_Y→X によって変換データ群XYX のデータに変換する。また、逆変換部２３０は、順変換部２１０によって得られた変換データ群YX のデータを変換器G_X→Y によって変換データ群YXYのデータに変換する。 The inverse conversion unit 230 converts the data of the conversion data group XY obtained by the forward conversion unit 210 into the data of the conversion data group XYX by the converter G _{Y → X.} Further, the inverse conversion unit 230 converts the data of the conversion data group YX obtained by the forward conversion unit 210 into the data of the conversion data group YXY by the converter G _{X → Y.}

具体的には、逆変換部２３０は、変換データ群XYのデータG_X→Y(x)を、ニューラルネットワーク記憶部２７０に記憶された変換器G_Y→X によって変換データ群XYXのデータG_Y→X(G_X→Y(x))に変換する。また、逆変換部２３０は、変換データ群YXのデータG_Y→X(y)を、ニューラルネットワーク記憶部２７０に記憶された変換器G_X→Yによって変換データ群YXYのデータG_X→Y(G_Y→X(y))に変換する。 Specifically, the inverse conversion unit 230 converts the data G _{X → Y} (x) of the conversion data group XY by the converter G _{Y → X} stored in the neural network storage unit 270, and the data G _{Y of the} conversion data group _{XY X.} _→ Convert to _X (G _{X → Y} (x)). The inverse transform unit 230, a data G _{Y → X} conversion data group YX (y), the transducer stored in the neural network memory section 270 G _{X → Y} data conversion data group YXY by G _{X → Y} ( Convert from G _{Y to X} (y)).

順逆変換距離測定部２４０は、入力されたデータ群Xのデータと、逆変換部２３０によって得られた変換データ群XYX のデータとの距離を距離測定器M₁によって測定する。また、順逆変換距離測定部２４０は、入力されたデータ群Yのデータと、逆変換部２３０によって得られた変換データ群YXYのデータとの距離を距離測定器M₁によって測定する。 The forward / reverse conversion distance measuring unit 240 measures the distance between the input data of the data group X and the data of the converted data group XYX obtained by the reverse conversion unit 230 with the distance measuring device M ₁ . Further, the forward / reverse conversion distance measuring unit 240 measures the distance between the input data of the data group Y and the data of the converted data group YXY obtained by the reverse conversion unit 230 with the distance measuring device M ₁ .

具体的には、順逆変換距離測定部２４０は、入力されたデータ群Xのデータx と、逆変換部２３０によって得られた変換データ群XYXのデータG_Y→X(G_X→Y(x))との距離を距離測定器M₁によって測定し、距離の測定結果M₁(x、G_Y→X(G_X→Y(x)))を学習部２８０に渡す。また、順逆変換距離測定部２４０は、入力されたデータ群Yのデータyと、逆変換部２３０によって得られた変換データ群YXYのデータG_X→Y(G_Y→X(y))との距離を距離測定器M₁によって測定し、距離の測定結果M₁(y、G_X→Y (G_Y→X(y)))を学習部２８０に渡す。 Specifically, the forward / reverse conversion distance measuring unit 240 uses the input data x of the data group X and the data of the converted data group XYX obtained by the reverse conversion unit 230 G _{Y → X} (G _{X → Y} (x)). ) Is measured by the distance measuring device M ₁ , and the distance measurement result M ₁ (x, G _{Y → X} (G _{X → Y} (x))) is passed to the learning unit 280. Further, the forward / reverse conversion distance measuring unit 240 combines the input data y of the data group Y and the data G _{X → Y} (G _{Y → X} (y)) of the converted data group YXY obtained by the reverse conversion unit 230. The distance is measured by the distance measuring device M ₁ , and the distance measurement result M ₁ (y, G _{X → Y} (G _{Y → X} (y))) is passed to the learning unit 280.

なお、距離測定器M₁の距離基準としては、例えば、L1距離やL2距離、あるいはニューラルネットワークの特徴量空間内での距離を用いる。ニューラルネットワークを用いる場合は、ニューラルネットワーク記憶部２７０に記憶された特徴抽出器としてのニューラルネットワークを用いて特徴量を抽出し、距離を測る。 As the distance reference of the distance measuring device M ₁ , for example, the L1 distance, the L2 distance, or the distance in the feature space of the neural network is used. When a neural network is used, a feature amount is extracted using a neural network as a feature extractor stored in the neural network storage unit 270, and the distance is measured.

自己変換部２５０は、入力されたデータ群Yのデータを、変換器G_X→Yによって変換データ群YYのデータに変換する。また、自己変換部２５０は、入力されたデータ群Xのデータを、変換器G_Y→Xによって変換データ群XXのデータに変換する。 The self-conversion unit 250 converts the input data of the data group Y into the data of the conversion data group YY by the converter G _{X → Y.} Further, the self-conversion unit 250 converts the input data of the data group X into the data of the conversion data group XX by the converter G _{Y → X.}

具体的には、自己変換部２５０は、入力されたデータ群Y のデータy を、ニューラルネットワーク記憶部２７０に記憶された変換器G_X→Yによって変換データ群YYのデータG_X→Y(y)に変換する。また、自己変換部２５０は、入力されたデータ群Xのデータxを、ニューラルネットワーク記憶部２７０に記憶された変換器G_Y→Xによって変換データ群XXのデータG_Y→X(x)に変換する。 Specifically, the self-conversion unit 250 converts the data y of the input data group Y by the converter G _{X → Y} stored in the neural network storage unit 270, and the data G _{X → Y} (y) of the data group YY. ). Further, the self-conversion unit 250 converts the input data x of the data group X into the data G _{Y → X} (x) of the conversion data group XX by the converter G _{Y → X} stored in the neural network storage unit 270. To do.

自己変換距離測定部２６０は、入力されたデータ群Yのデータと、自己変換部２５０によって得られた変換データ群YYのデータとの距離を距離測定器M₂によって測定する。また、自己変換距離測定部２６０は、入力されたデータ群Xのデータと、自己変換部２５０によって得られた変換データ群XXのデータとの距離を距離測定器M₂によって測定する。 The self-converted distance measuring unit 260 measures the distance between the input data of the data group Y and the data of the converted data group YY obtained by the self-converted unit 250 with the distance measuring device M ₂ . Further, the self-conversion distance measuring unit 260 measures the distance between the input data of the data group X and the data of the conversion data group XX obtained by the self-conversion unit 250 by the distance measuring device M ₂ .

具体的には、自己変換距離測定部２６０は、入力されたデータ群Yのデータyと、自己変換部２５０によって得られた変換データ群YYのデータ(G_X→Y(y))との距離を距離測定器M₂によって測定し、距離の測定結果M₂(y,G_X→Y(y))を学習部２８０に渡す。また、自己変換距離測定部２６０は、入力されたデータ群Xのデータxと、自己変換部２５０によって得られた変換データ群XXのデータ(G_Y→X(x))との距離を距離測定器M₂によって測定し、距離の測定結果M₂(x,G_Y→X(x))を学習部２８０に渡す。 Specifically, the self-conversion distance measuring unit 260 measures the distance between the input data y of the data group Y and the data (G _{X → Y} (y)) of the converted data group YY obtained by the self-conversion unit 250. Is measured by the distance measuring device M ₂ , and the distance measurement result M ₂ (y, G _{X → Y} (y)) is passed to the learning unit 280. Further, the self-conversion distance measurement unit 260 measures the distance between the input data x of the data group X and the data (G _{Y → X} (x)) of the conversion data group _{X X} obtained by the self-conversion unit 250. It is measured by the device M ₂ , and the distance measurement result M ₂ (x, G _{Y → X} (x)) is passed to the learning unit 280.

なお、距離測定器M₂の距離基準としては、例えば、L1距離やL2 距離、あるいはニューラルネットワークの特徴量空間内での距離を用いる。ニューラルネットワークを用いる場合は、ニューラルネットワーク記憶部２７０に記憶された特徴抽出器としてのニューラルネットワークを用いて特徴量を抽出し、距離を測る。 As the distance reference of the distance measuring device M ₂ , for example, the L1 distance, the L2 distance, or the distance in the feature space of the neural network is used. When a neural network is used, a feature amount is extracted using a neural network as a feature extractor stored in the neural network storage unit 270, and the distance is measured.

ニューラルネットワーク記憶部２７０は、変換器としてのニューラルネットワークと状態判断器としてのニューラルネットワークを記憶している。順逆変換距離測定部２４０、または、自己変換距離測定部２６０で、ニューラルネットワークの特徴量空間内での距離を用いる場合、特徴抽出器としてのニューラルネットワークを記憶している。 Neural network Symbol憶部270 stores the neural network as a neural network and a state determiner as transducers. Forward and reverse converting distance measurement unit 240, or a self-converting the distance measurement unit 2 6 0, in the case of using the distance in the feature space of the neural network, and stores the neural network as a feature extractor.

変換器としてのニューラルネットワークと状態判断器としてのニューラルネットワークとしては時系列的な構造や階層的な構造を表現できるものを用いる。例えば、Gated CNNやLSTMを用いる。 As a neural network as a converter and a neural network as a state judgment device, a neural network capable of expressing a time-series structure or a hierarchical structure is used. For example, use Gated CNN or LSTM.

順逆変換距離測定部２４０、または、自己変換距離測定部２６０で、ニューラルネットワークの特徴量空間内での距離を用いる場合、特徴抽出器としてのニューラルネットワークとしては時系列的な構造や階層的な構造を表現できるものを用いる。例えば、Gated CNNやLSTMを用いる。 Forward and reverse converting distance measurement unit 240, or a self-converting the distance measurement unit 2 6 0, in the case of using the distance in the feature space of the neural network, series structure and hierarchical time as a neural network as a feature extractor Use a structure that can express various structures. For example, use Gated CNN or LSTM.

学習部２８０は、状態判断部２２０によって判断した結果について、変換データ群XYのデータの状態判断結果と入力データy の状態判断結果が近くなるように、変換データ群YXのデータの状態判断結果と入力データx の状態判断結果が近くなるように、順逆変換距離測定部２４０によって測定された距離を最小化するように、自己変換距離測定部２６０によって測定された距離を最小化するように、変換器としてのニューラルネットワークを学習する。また、状態判断部２２０によって判断した結果について、変換データ群XYのデータの状態判断結果と入力データyの状態判断結果の差異が明確になるように、また、変換データ群YXのデータの状態判断結果と入力データxの状態判断結果の差異が明確になるように、状態判断器としてのニューラルネットワークを学習する。 The learning unit 280 sets the result determined by the state determination unit 220 to the state determination result of the data of the conversion data group YX so that the state determination result of the data of the conversion data group XY and the state determination result of the input data y are close to each other. Converted so that the state determination result of the input data x is close, the distance measured by the forward / reverse conversion distance measuring unit 240 is minimized, and the distance measured by the self-converted distance measuring unit 260 is minimized. Learn the neural network as a vessel. Further, regarding the result judged by the state judgment unit 220, the difference between the state judgment result of the data of the conversion data group XY and the state judgment result of the input data y is clarified, and the state judgment of the data of the conversion data group YX is made. A neural network as a state judge is trained so that the difference between the result and the state judgment result of the input data x becomes clear.

具体的には、学習部２８０は、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の値が近くなるように、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の値が近くなるように、順逆変換距離測定部２４０によって測定された距離M₁(x,G_Y→X (G_X→Y(x)))とM₁(y, G_X→Y(G_Y→X(y)))とを最小化するように、自己変換距離測定部２６０によって測定された距離M₂(y,G_X→Y(y))とM₂(x,G_Y→X(x))とを最小化するように、変換器としてのニューラルネットワークG_X→YとG_Y→Xを学習する。 Specifically, the learning unit 280 determines by the state determination unit 220 so that the values of _DY (G _{X → Y} (x)) and _DY (y) are close to each other as a result of the determination by the state determination unit 220. Result The distance M ₁ (x, G _{Y → X} (G _{X →)} measured by the forward / reverse conversion distance measuring unit 240 so that the values of D _X (G _{Y → X} (y)) and D _X (x) are close to each other. _The distance M ₂ (y, y,) measured by the self-converting distance measuring unit 260 so as to minimize _Y (x))) and M ₁ (y, G _{X → Y} (G _{Y → X} (y))). Learn the neural networks G _{X → Y} and G _{Y → X} as converters so as to minimize G _{X → Y} (y)) and M ₂ (x, G _{Y → X} (x)).

より具体的には、学習部２８０において、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の値が同じなるようにする目的関数としては、例えば、状態判断器D_Yとして入力データy が与えられた時は確率pを出力し、変換データG_X→Y(x)が与えられた時は確率1-pを出力するようなものを考えた場合、L_adv(G_X→Y,D_Y)（本発明の式（１））を最小化すればよい。同様に、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の値が同じなるようにする目的関数としては、例えば、状態判断器D_Xとして入力データxが与えられた時は確率pを出力し、変換データG_Y→X(y)が与えられた時は確率1-pを出力するようなものを考えた場合、L_adv(G_Y→X,DX)を最小化すればよい。 More specifically, as an objective function that makes the values of D _Y (G _{X → Y} (x)) and D _Y (y) the same as the result of judgment by the state judgment unit 220 in the learning unit 280, for example. , When the input data y is given as the state judge D _Y , the probability p is output, and when the conversion data G _{X → Y} (x) is given, the probability 1-p is output. _{_{If, L adv (G X → Y}} , D Y) may be minimized (formula (1) of the present invention) a. Similarly, as an objective function for making the values of D _X (G _{Y → X} (y)) and D _X (x) the same as the result of judgment by the state judgment unit 220, for example, input as the state judgment device D _X. Considering something that outputs the probability p when the data x is given and outputs the probability 1-p when the conversion data G _{Y → X} (y) is given, L _adv (G _{Y → X} , DX) should be minimized.

なお、学習部２８０において、自己変換距離測定部２６０によって測定された距離を最小化する制約は、学習の初期段階で学習を安定化させるためにのみ用い、学習が安定化した後は用いなくてもよい。 In the learning unit 280, the constraint of minimizing the distance measured by the self-conversion distance measuring unit 260 is used only for stabilizing the learning in the initial stage of learning, and is not used after the learning is stabilized. May be good.

また、学習部２８０において、自己変換距離測定部２６０によって測定された距離を最小化する制約は、学習において補助的な役割を担うものであり、用いなくても学習が安定しているのであれば、用いなくてもよい。 Further, in the learning unit 280, the constraint of minimizing the distance measured by the self-conversion distance measuring unit 260 plays an auxiliary role in learning, and if learning is stable even if it is not used. , You do not have to use it.

そして、学習部２８０は、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の差異が明確になるように、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の差異が明確になるように、状態判断器としてのニューラルネットワークD_YとD_Xとを学習する。 Then, the learning unit 280 determines the result D by the state determination unit 220 so that the difference between the result D _Y (G _{X → Y} (x)) and D _Y (y) determined by the state determination unit 220 becomes clear. Learn the neural networks D _Y and D _X as state judges so that the difference between _X (G _{Y → X} (y)) and D _X (x) becomes clear.

具体的には、学習部２８０において、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の差異が明確になるようにする目的関数としては、例えば、状態判断器D_Yとして入力データyが与えられた時は確率pを出力し、変換データG_X→Y(x)が与えられた時は確率1-pを出力するようなものを考えた場合、L_adv(G_X→Y、D_Y)（本発明の式（１））を最大化すればよい。同様に、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の差異が明確になるようにする目的関数としては、例えば、状態判断器D_Xとして入力データyが与えられた時は確率pを出力し、変換データG_Y→X(y)が与えられた時は確率1-pを出力するようなものを考えた場合、L_adv(G_Y→X、D_X)を最大化すればよい。 Specifically, as an objective function that makes the difference between D _Y (G _{X → Y} (x)) and D _Y (y) clear as a result of judgment by the state judgment unit 220 in the learning unit 280, for example. , When the input data y is given as the state judge D _Y , the probability p is output, and when the conversion data G _{X → Y} (x) is given, the probability 1-p is output. _{_{If, L adv (G X → Y}} , D Y) may be maximized (formula (1) of the present invention) a. Similarly, as an objective function for clarifying the difference between D _X (G _{Y → X} (y)) and D _X (x) as a result of judgment by the state judgment unit 220, for example, as a state judgment device D _X. Considering something that outputs the probability p when the input data y is given and outputs the probability 1-p when the conversion data G _{Y → X} (y) is given, L _adv (G _{Y) → X} , D _X ) should be maximized.

なお、本発明の[数１]では、目的関数内でCross Entropyを用いているが、代わりにEuclidean距離や、Earth Mover距離、エネルギー関数に基づく距離を用いてもよい。 In [Equation 1] of the present invention , Cross Entropy is used in the objective function, but instead, the Euclidean distance, the Earth Mover distance, or the distance based on the energy function may be used.

そして、学習部２８０は、学習結果をニューラルネットワーク記憶部２７０に渡す。 Then, the learning unit 280 passes the learning result to the neural network storage unit 270.

変換部２９０は、学習部２８０によって学習された変換器を用いて、入力された変換対象のデータを変換する。 The conversion unit 290 converts the input data to be converted by using the converter learned by the learning unit 280.

具体的には、変換部２９０は、入力部１００が入力データとしてデータ群Xのデータxを受け取った場合、変換器G_X→Yとしてのニューラルネットワークを、ニューラルネットワーク記憶部２７０から取得する。そして変換部２９０は、変換器G_X→Yのニューラルネットワークを用いて、変換対象であるデータx を変換データG_X→Y(x)に変換する。同様に、変換部２９０は、入力部１００が入力データとしてデータ群Yのデータyを受け取った場合、変換器G_Y→Xとしてのニューラルネットワークを、ニューラルネットワーク記憶部２７０から取得する。そして変換部２９０は、変換器G_Y→Xのニューラルネットワークを用いて、変換対象であるデータy を変換データG_Y→X(y)に変換する。 Acquisition Specifically, the conversion unit 290, when the input unit 100 receives the data x data group X as input data, the neural network as a transducer G _{X → Y,} from the neural network's rating憶部270 To do. The conversion unit 290 uses the neural network of the transducer G _{X → Y,} converts the data x is converted into conversion data G _{X → Y} (x). Likewise, the conversion unit 290, when the input unit 100 receives the data y data group Y as input data, the neural network as a transducer G _{Y → X,} obtained from the neural network's rating憶部270. The conversion unit 290 uses the neural network of the transducer G _{Y → X,} converts the data y to be converted in the conversion data G _{Y → X} (y).

出力部３００は、変換部２９０が変換した変換結果である変換データを出力する。 The output unit 300 outputs the conversion data which is the conversion result converted by the conversion unit 290.

具体的には、出力部３００は、入力部１００が入力データとしてデータ群Xのデータxを受け取った場合、変換部２９０が変換した変換結果である変換データG_X→Y(x)を出力する。同様に、出力部３００は、入力部１００が入力データとしてデータ群Yのデータyを受け取った場合、変換部２９０が変換した変換結果である変換データG_Y→X(y)を出力する。 Specifically, when the input unit 100 receives the data x of the data group X as the input data, the output unit 300 outputs the conversion data G _{X → Y} (x) which is the conversion result converted by the conversion unit 290. .. Similarly, when the input unit 100 receives the data y of the data group Y as the input data, the output unit 300 outputs the conversion data G _{Y → X} (y) which is the conversion result converted by the conversion unit 290.

以下に、実施形態を示す。 An embodiment is shown below.

[概要１]
二つのドメインの系列データに対して、系列データを受け取る入力部と、
変換器を用いて、一方のドメインのデータ（順変換入力データ）から、もう一方のドメインのデータ（順変換出力データ）へ変換する順変換部と、
前記順変換出力データに対して、変換器を用いて、逆の変換を行い順変換部の入力ドメインのデータ（逆変換出力データ）に変換する逆変換部と、
前記順変換出力データに対して、状態判断器を用いて、前記順変換出力データの対象とするドメインの系列データとして適切かどうかの状態判断を行う状態判断部と、
前記逆変換出力データと、前記順変換入力データに対して、距離測定器を用いて、距離を測定する順逆変換距離測定部と、
前記状態判断部と前記順逆変換距離測定部の結果に応じて前記変換器、前記状態判断部のパラメータを更新する学習部と、
前記学習部によって学習された前記変換器を用いて、前記入力部が受け取ったデータを変換する変換部と、
前記変換部が変換したデータを出力する出力部を含む系列データ変換装置。 [Summary 1]
For the series data of two domains, the input part that receives the series data and
A forward conversion unit that converts data from one domain (forward conversion input data) to data from the other domain (forward conversion output data) using a converter,
An inverse conversion unit that performs reverse conversion on the forward conversion output data using a converter and converts it into data in the input domain of the forward conversion unit (inverse conversion output data).
A state determination unit that determines whether or not the forward conversion output data is appropriate as the series data of the domain targeted by the forward conversion output data by using a state determination device.
A forward / reverse conversion distance measuring unit that measures a distance between the reverse conversion output data and the forward conversion input data using a distance measuring device.
A learning unit that updates the parameters of the converter and the state determination unit according to the results of the state determination unit and the forward / reverse conversion distance measurement unit.
A conversion unit that converts data received by the input unit using the converter learned by the learning unit, and a conversion unit.
A series data conversion device including an output unit that outputs data converted by the conversion unit.

[概要２]
前記系列データ変換装置において、前記順変換部の変換器が変換対象とするドメインのデータ（自己変換入力データ）に対して、前記変換器によって変換してデータ（自己変換出力データ）を得る自己変換部と、
前記自己変換入力データと、前記自己変換出力データとの距離を測定する自己変換距離測定部
を含む系列データ変換装置。 [Summary 2]
In the series data conversion device, the converter of the forward conversion unit converts the data (self-conversion input data) of the domain to be converted by the converter to obtain the data (self-conversion output data). Department and
A series data conversion device including a self-conversion distance measuring unit that measures the distance between the self-conversion input data and the self-conversion output data.

４．２
学習時の処理ルーチンを図５に示し、各ステップについて下記のとおり説明する。 4.2
The processing routine at the time of learning is shown in FIG. 5 , and each step will be described as follows.

1. 入力部１００にデータ群Xのデータと、データ群Yのデータとが入力されると、データ変換装置において、学習処理フローが実行される。 1. When the data of the data group X and the data of the data group Y are input to the input unit 100, the learning processing flow is executed in the data conversion device.

2. まず、ステップ S１００において、順変換部２１０と自己変換部２５０は、入力部１００から、データ群X のデータと、データ群Yのデータを取得する。 2. First, in step S100, the forward conversion unit 210 and the self-conversion unit 250 acquire the data of the data group X and the data of the data group Y from the input unit 100.

3. 具体的には、入力部１００は、データ群Xからランダムに選択されたデータx∈Xと、データ群Yからランダムに選択されたデータy∈Yとを、順変換部２１０と自己変換部２５０に渡す。なお、ランダムにデータを選択する際、二つのデータxとyは対応関係が取れている必要はない。例えば、音声データであれば、xとyは同じ発話内容のデータである必要はない。 3. Specifically, the input unit 100 self-converts the data x ∈ X randomly selected from the data group X and the data y ∈ Y randomly selected from the data group Y with the forward conversion unit 210. Hand over to part 250. When selecting data at random, it is not necessary that the two data x and y have a correspondence relationship. For example, in the case of voice data, x and y do not have to be the same utterance content data.

4. ステップ S１１０において、順変換部２１０は、変換器G_X→Yを用いてxをG_X→Y(x)に変換する。また、順変換部２１０は、変換器G_Y→Xを用いてyをG_Y→X(y)に変換する。 4. In step S110, the forward converter 210 converts x to G _{X → Y} (x) using the transducer G _{X → Y.} Further, the rectifier unit 210 converts the y to G _{Y → X} (y) using a transducer G _{Y → X.}

5. ステップS１２０において、状態判断部２２０は、状態判断器D_Yを用いて、G_X→Y(x)の状態判断結果D_Y(G_X→Y(x))と、yの状態判断結果D_Y(y)を取得する。また、状態判断部２２０は、状態判断器D_Xを用いて、G_Y→X(y)の状態判断結果D_X(G_Y→X(y))と、xの状態判断結果D_X(x)を取得する。 In 5. Step S120, the state determination unit 220 uses the state determining unit D _{_Y,} G _{X →} _Y a state determination result of _{_{(x) D Y (G X}} → Y (x)), the state determination result of y Get D _{Y (y)} . The state determining unit 220 uses the state determining unit D _{_X,} G _{Y →} _X and (y) the state determination result of _{_{D X (G Y → X (}} y)), the state determination result of x D _X (x ) Is obtained.

6. ステップ S１３０において、逆変換部２３０は、変換器G_Y→Xを用いてG_X→Y(x)をG_Y→X(G_X→Y(x))に変換する。また、逆変換部２３０は、変換器G_X→Yを用いてG_Y→X(y)をG_X→Y(G_Y→X(y))に変換する。 In 6. Step S130, the inverse transform unit 230, G _{X → Y} a (x) is converted into _{_{G Y → X (G X →}} Y (x)) using a transducer G _{Y → X.} The inverse transform unit 230, G _{Y → X} a (y) is converted into _{_{G X → Y (G Y →}} X (y)) by using a transducer G _{X → Y.}

7. ステップ S１４０において、順逆変換距離測定部２４０は、距離測定器M₁を用いてxとG_Y→X(G_X→Y(x))の距離M₁(x、G_Y→X(G_X→Y(x)))を測定する。また、順逆変換距離測定部２４０は、距離測定器M₁を用いてyとG_X→Y(G_Y→X(y))の距離M₁(y、G_X→Y(G_Y→X(y)))を測定する。 7. In step S140, the forward / reverse conversion distance measuring unit 240 uses the distance measuring device M ₁ to measure the distance between x and G _{Y → X} (G _{X → Y} (x)) M ₁ (x, G _{Y → X} (G). _{X → Y} (x)))) is measured. Further, the forward / reverse conversion distance measuring unit 240 uses the distance measuring device M ₁ to measure the distance between y and G _{X → Y} (G _{Y → X} (y)) M ₁ (y, G _{X → Y} (G _{Y → X} (y)). y))) is measured.

8. ステップ S１５０において、自己変換部２５０は、変換器G_X→Yを用いてyをG_X→Y(y)に変換する。また、自己変換部２５０は、変換器G_Y→Xを用いてxをG_Y→X(x)に変換する。 In 8. step S150, the self-conversion unit 250 converts the y to G _{X → Y} (y) using a transducer G _{X → Y.} Moreover, self-conversion unit 250 converts the x in G _{Y → X} (x) using a transducer G _{Y → X.}

9. ステップ S１６０において、自己変換距離測定部２６０は、距離測定器M₂を用いてyとG_X→Y(y)の距離M₂(y、G_X→Y(y))を測定する。また、自己変換距離測定部２６０は、距離測定器M₂を用いてx とG_Y→X(x)の距離M₂(x、G_Y→X(x))を測定する。 9. In step S160, the self-converting distance measuring unit 260 measures the distance M ₂ (y, G _{X → Y} (y)) between y and G _{X → Y} (y) using the distance measuring device M ₂ . Further, the self-conversion distance measuring unit 260 measures the distance M ₂ (x, G _{Y → X} (x)) between x and G _{Y → X} (x) using the distance measuring device M ₂ .

10.ステップ S１７０において、学習部２８０は、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の値が近くなるように、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の値が近くなるように、順逆変換距離測定部２４０によって測定された距離M₁(x、G_Y→X (G_X→Y(x)))とM₁(y、G_X→Y (G_Y→X(y)))とを最小化するように、自己変換距離測定部２６０によって測定された距離M₂(y、G_X→Y(y))とM₂(x、G_Y→X(x))とを最小化するように、変換器としてのニューラルネットワークG_X→YとG_Y→Xを学習し、ニューラルネットワーク記憶部２７０に記憶されている、変換器としてのニューラルネットワークG_X→YとG_Y→Xのパラメータを更新する。 10. In step S170, the learning unit 280 determines by the state determination unit 220 so that the values of _DY (G _{X → Y} (x)) and _DY (y) are close to each other as a result of the determination by the state determination unit 220. As a result, the distance M ₁ (x, G _{Y → X} (G _X) measured by the forward / reverse conversion distance measuring unit 240 so that the values of D _X (G _{Y → X} (y)) and D _X (x) are close to each other. _{→ Y} (x))) and _{_{M 1 (y, G X →}} Y (G Y → X (y))) and so as to minimize the distance M ₂ measured by self converting the distance measurement unit 2 6 0 Learn the neural networks G _{X → Y} and G _{Y → X} as converters to minimize (y, G _{X → Y} (y)) and M ₂ (x, G _{Y → X} (x)) and it is stored in the neural network's rating憶部270, updates the parameters of the neural network G _{X → Y} and G _{Y → X} as transducers.

11.また、学習部２８０は、状態判断部２２０によって判断した結果D_Y(G_X→Y(x))とD_Y(y)の差異が明確になるように、状態判断部２２０によって判断した結果D_X(G_Y→X(y))とD_X(x)の差異が明確になるように、状態判断器としてのニューラルネットワークD_YとD_Xとを学習し、ニューラルネットワーク記憶部２７０に記憶されている、状態判断器としてのニューラルネットワークD_YとD_Xのパラメータを更新する。 11. Further, the learning unit 280 made a judgment by the state judgment unit 220 so that the difference between D _Y (G _{X → Y} (x)) and D _Y (y) becomes clear as a result of the judgment by the state judgment unit 220. differences as becomes clear results _{_{D X (G Y → X (}} y)) and D _X (x), to learn the neural network D _Y and D _X as the state determining unit, the neural network's rating憶The parameters of the neural networks D _Y and D _X as the state judgment device stored in the part 270 are updated.

12.ステップ S１８０において、全てのデータについて終了したか否かを判断する。 12. In step S180, it is determined whether or not all the data have been completed.

13.全てのデータについて終了していない場合（ステップS１８０のNO）、ステップS１００に戻り、再度ステップS１００〜S１７０の処理を行う。 13. If all the data has not been completed (NO in step S180), the process returns to step S100 and the processes of steps S100 to S170 are performed again.

14.一方、全てのデータについて終了している場合（ステップS１８０のYES）、処理を終了する。 14. On the other hand, if all the data has been completed (YES in step S180), the process is terminated.

４．３変換時の処理ルーチン 4.3 Processing routine at the time of conversion

変換時の処理ルーチンを図６に示し、各ステップについて下記のとおり説明する。 The processing routine at the time of conversion is shown in FIG. 6 , and each step will be described as follows.

1. 入力部１００に、変換対象のデータx∈X、または、変換対象のデータy∈Yが入力されると、データ変換装置において、データ変換処理フローが実行される。ここでは、変換対象のデータx∈X が入力された場合を説明する。なお、変換対象のデータy∈Y が入力された場合も、処理は同様である。 1. When the data x ∈ X to be converted or the data y ∈ Y to be converted is input to the input unit 100, the data conversion processing flow is executed in the data conversion device. Here, the case where the data to be converted x ∈ X is input will be described. The processing is the same when the data y ∈ Y to be converted is input.

2. ステップ S２００において、変換部２９０は、入力部１００から、入力された変換対象のデータx を取得する。 2. In step S200, the conversion unit 290 acquires the input data x to be converted from the input unit 100.

3. ステップ S２１０において、変換部２９０は、ニューラルネットワーク記憶部２７０から、学習部２８０によって学習された変換器G_X→Yのニューラルネットワークを取得する。 In 3. Step S210, conversion unit 290, the neural network's rating憶部270 acquires a neural network of the transducer G _{X → Y} learned by the learning unit 280.

4. ステップ S２２０において、変換部２９０は、変換器G_X→Y を用いて、入力された変換対象のデータxをG_X→Y(x)に変換する。 4. In step S220, the conversion unit 290 uses the converter G _{X → Y} to convert the input data x to be converted into G _{X → Y} (x).

5. ステップ S２３０において、出力部３００は、変換部２９０によってデータx が変換された変換データG_X→Y(x)を出力する。 5. In step S230, the output unit 300 outputs the converted data G _{X → Y} (x) obtained by converting the data x by the conversion unit 290.

５評価実験
５．１実験設定 5 Evaluation experiment 5.1 Experiment setting

提案手法は、系列データ変換一般に適用可能なものであるが、実験では一例としてパラレルデータフリーの音声変換に提案手法を適用し、評価を行った。データとしては、VCC 2016 dataset （非特許文献１２）を用いた。本データセットには、プロのアメリカ英語の発話音声が収録されており、5 人の男性話者、5人の女性話者を含む。各話者のデータは216個の短文(約13分) に分けられ、そのうち162文は学習用に用いられ、54文は評価用に用いられる。提案手法をパラレルデータなしの条件下で評価するため、提案手法を学習する際は、学習用データ162文のうち、前半の81文をソース音声として用い、後半の81 文をターゲット音声として用いた。つまり、ソース音声とターゲット音声間で重複した発話がない条件下で学習を行った。音声データは16 kHz にダウンサンプリングされており、24 次元のメルケプストラム(MCEP)、対数基本周波数(log F₀)、非同期性指標(AP) をWORLD 分析システム（非特許文献８）を用いて5 msで抽出を行った。これらの音声特徴量のうち、メルケプストラムに対して、提案手法を適用し変換を行った。基本周波数については、Logarithm Gaussian normalized transformation （非特許文献６）を用い、非同期性指標については、変換しても有意差がないことが示されており、ソース音声のものをそのまま用いた。 The proposed method is generally applicable to series data conversion, but in the experiment, the proposed method was applied to parallel data-free speech conversion as an example and evaluated. As the data, VCC 2016 dataset (Non-Patent Document 12) was used. This dataset contains professional American English spoken voices, including 5 male and 5 female speakers. The data of each speaker is divided into 216 short sentences (about 13 minutes), of which 162 sentences are used for learning and 54 sentences are used for evaluation. In order to evaluate the proposed method under the condition without parallel data, when learning the proposed method, 81 sentences in the first half of the 162 sentences of the learning data were used as the source voice, and 81 sentences in the latter half were used as the target voice. .. In other words, learning was performed under the condition that there was no duplicate utterance between the source voice and the target voice. The audio data is downsampled to 16 kHz, and the 24-dimensional mel cepstrum (MCEP), log fundamental frequency (log F ₀ ), and asynchronous index (AP) are measured using the WORLD analysis system (Non-Patent Document 8) 5 Extraction was performed in ms. Of these speech features, the proposed method was applied to the mer cepstrum for conversion. Logarithm Gaussian normalized transformation (Non-Patent Document 6) was used for the fundamental frequency, and it was shown that there was no significant difference in the asynchrony index even after conversion, and the source audio was used as it was.

５．２客観評価 5.2 Objective evaluation

本実験では、提案手法の適用対象はメルケプストラムであるため、変換メルケプストラムの質について客観評価を行った。比較手法としては、パラレルデータありの音声変換で代表的な方法の一つであるGMM ベースの音声変換（非特許文献１１）を用いた。GMM ベースの音声変換は、学習にパラレルデータが必要であるため、学習用データ162 文全てを用いた。なお、提案手法はパラレルデータなしでかつデータ量は半分という不利な状況で学習していたことに留意されたい。また、評価データとしては、ソース音声にはSF1 とSM1、ターゲット音声にはTF2とTM3を用いた。 In this experiment, the proposed method was applied to mel cepstrum, so the quality of the converted mer cepstrum was objectively evaluated. As a comparison method, GMM-based speech conversion (Non-Patent Document 11) , which is one of the typical methods for speech conversion with parallel data, was used. Since GMM-based speech conversion requires parallel data for learning, all 162 sentences of training data were used. It should be noted that the proposed method was learned in a disadvantageous situation where there was no parallel data and the amount of data was half. As evaluation data, SF1 and SM1 were used for the source audio, and TF2 and TM3 were used for the target audio.

評価指標としては、音声品質の主観評価と相関が高いと言われているGlobal variance (GV) （非特許文献１１）とModulation spectra (MS) （非特許文献１０）を用いた。図７に、提案手法(Proposed)、比較手法(Conventional)、ターゲット音声(Target) のメルケプストラムの次数ごとのGVの比較を示す。この結果より、提案手法では比較手法よりもターゲット音声に近いGVが得られていることが分かる。図８に、提案手法(Proposed)、比較手法(Conventional)、ターゲット音声(Target) の変調周波数ごとのMSの比較を示す。この結果より、提案手法では比較手法よりもターゲット音声に近いMSが得られていることが分かる。表1に、ターゲット音声と変換音声の対数MSのRoot mean square error(RMSE)の比較を示す。これらの値は小さい方が変換音声がターゲット音声に近いことを示しており、実験結果より、提案手法の方が比較手法よりターゲット音声に近い対数MS が得られていることが分かる。 As evaluation indexes, Global variance (GV) (Non-Patent Document 11) and Modulation spectra (MS) (Non-Patent Document 10) , which are said to have a high correlation with the subjective evaluation of voice quality, were used. FIG. 7 shows a comparison of GVs for each order of the mer cepstrum of the proposed method (Proposed), the comparative method (Conventional), and the target voice (Target). From this result, it can be seen that the proposed method obtains a GV closer to the target voice than the comparison method. FIG. 8 shows a comparison of MS for each modulation frequency of the proposed method (Proposed), the comparison method (Conventional), and the target voice (Target). From this result, it can be seen that the proposed method obtains an MS closer to the target voice than the comparison method. Table 1 shows a comparison of the root mean square error (RMSE) of the logarithmic MS of the target speech and the converted speech. The smaller these values indicate that the converted voice is closer to the target voice, and the experimental results show that the proposed method obtains a logarithmic MS closer to the target voice than the comparison method.

５．３主観評価 5.3 Subjective evaluation

主観評価実験については、VCC 2016 （非特許文献１３）のプロトコルに従い、自然性と話者性の評価を行った。比較手法としては、GMM ベースのパラレルデータありの変換手法（非特許文献１１）を用いた。まず、自然性の評価についてはMean opinion score(MOS) テストを行った。評価データとしては、評価用データの中から2 秒以上5 秒以下のデータをランダムに20 文選択し用いた。被験者としては英語教育を十分に受けた9 人が参加した。MOS テストの結果は、同性話者間の音声変換(SF1-TF2) の場合、提案手法が2.4、比較手法が1.3、また、異性話者間の音声変換(SF1-TM3) の場合、提案手法が2.3、比較手法が1.4 であった。このスコアは値が大きいほど自然性が高いことを示しており、自然性の主観評価においても提案手法が比較手法が上回ることが示された。 For the subjective evaluation experiment, the naturalness and speakerability were evaluated according to the protocol of VCC 2016 (Non-Patent Document 13) . As a comparison method, a conversion method with parallel data based on GMM (Non-Patent Document 11) was used. First, the Mean opinion score (MOS) test was performed to evaluate the naturalness. As the evaluation data, 20 sentences were randomly selected from the evaluation data for 2 seconds or more and 5 seconds or less. Nine people who were fully educated in English participated as subjects. The results of the MOS test show that in the case of same-sex speech conversion (SF1-TF2), the proposed method is 2.4, the comparison method is 1.3, and in the case of heterosexual speech conversion (SF1-TM3), the proposed method. Was 2.3, and the comparison method was 1.4. This score indicates that the larger the value, the higher the naturalness, and it was shown that the proposed method outperforms the comparative method in the subjective evaluation of naturalness.

話者性の評価については、異なる発話内容に対して同一人物が話したように聞こえるかどうかという基準に従って評価を行った。評価データとしては、評価用データの中からランダムに10セット選択し用いた。被験者としては英語教育を十分に受けた9 人が参加した。図９に同性話者間の音声変換(SF1-TF2)の場合の結果を示す。この図では、提案手法の方が比較手法よりも「ターゲット音声と絶対同じ」と答えた割合が多くなっている。この結果より、話者性においても提案手法の優位性が分かる。
The speaker quality was evaluated according to the criteria of whether or not the same person sounds as if the same person spoke for different utterance contents. As the evaluation data, 10 sets were randomly selected from the evaluation data and used. Nine people who were fully educated in English participated as subjects. FIG. 9 shows the results in the case of voice conversion between same-sex speakers (SF1-TF2). In this figure, the percentage of respondents who answered "absolutely the same as the target voice" was higher in the proposed method than in the comparison method. From this result, it can be seen that the proposed method is superior in terms of speaker characteristics.

Claims

For the series data of two domains, the input part that receives the series data and
Using a transducer, the forward transform input data or al the data in one domain, and the rectifier unit for converting the forward conversion output data is the data of the other domain,
To the forward transform output data, using a converter, an inverse converter for converting the inverse transform output data is the data of the input domain of the rectifier unit performs the reverse transformation,
A state determination unit that determines whether or not the forward conversion output data is appropriate as the series data of the domain targeted by the forward conversion output data by using a state determination device.
A forward / reverse conversion distance measuring unit that measures a distance between the reverse conversion output data and the forward conversion input data using a distance measuring device.
A learning unit that updates the parameters of the converter and the state determination unit according to the results of the state determination unit and the forward / reverse conversion distance measurement unit.
A conversion unit that converts data received by the input unit using the converter learned by the learning unit, and a conversion unit.
A series data conversion device including an output unit that outputs data converted by the conversion unit.

Prior SL-series data conversion apparatus, for the self-conversion input data is a data domain converter of the forward transform unit and converted, the self-converted output data is data converted by the converter With the self-conversion part to get
The series data conversion device according to claim 1 , further comprising a self-conversion distance measuring unit that measures a distance between the self-conversion input data and the self-conversion output data.

The pre-Symbol converter and said state determination unit, series data conversion apparatus according to claim 1 or 2 constructed using a neural network that can capture the relationship between time series data.

Before SL series data conversion device part using a model having a Gated CNN or LSTM or Attention structure according to claim 3, wherein the neural network.

For the series data of two domains, the input part that receives the series data and
Using a transducer, the forward transform input data or al the data in one domain, and the rectifier unit for converting the forward conversion output data is the data of the other domain,
To the forward transform output data, using a converter, an inverse converter for converting the inverse transform output data is the data of the input domain of the rectifier unit performs the reverse transformation,
A state determination unit that determines whether or not the forward conversion output data is appropriate as the series data of the domain targeted by the forward conversion output data by using a state determination device.
A forward / reverse conversion distance measuring unit that measures a distance between the reverse conversion output data and the forward conversion input data using a distance measuring device.
With
A learning device that updates the parameters of the converter and the state determination unit according to the results of the state determination unit and the forward / reverse conversion distance measurement unit.

A program for operating a computer as each part of the series data conversion device according to any one of claims 1 to 4.