JP6722810B2

JP6722810B2 - Speech synthesis learning device

Info

Publication number: JP6722810B2
Application number: JP2019149850A
Authority: JP
Inventors: 卓弘金子; 弘和亀岡; 薫平松; 柏野　邦夫; 邦夫柏野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2020-07-15
Anticipated expiration: 2036-08-30
Also published as: JP2019211782A

Description

本発明は、音声合成学習装置に係り、特に、音声を合成するための音声合成学習装置に関する。 The present invention relates to a speech synthesis learning device, and more particularly to a speech synthesis learning device for synthesizing speech.

音声の声帯音源情報（基本周波数や非周期性指標など）や声道スペクトル情報を表す特徴量は、STRAIGHTやメル一般化ケプストラム分析(Mel-Generalized Cepstral Analysis; MGC)などの音声分析手法により得ることができる。多くのテキスト音声合成システムや音声変換システムでは、このような音声特徴量の系列を入力テキストや変換元音声から予測し、ボコーダ方式に従って音声信号を生成するアプローチがとられる。 The vocal chord sound source information (fundamental frequency, aperiodic index, etc.) and vocal tract spectrum information are obtained by a voice analysis method such as STRAIGHT or Mel-Generalized Cepstral Analysis (MGC). You can In many text-to-speech synthesis systems and voice conversion systems, an approach is used in which such a sequence of voice feature quantities is predicted from an input text or a source voice and a voice signal is generated according to a vocoder method.

既存のボコーダ方式の音声合成では、声帯音源情報や声道スペクトル情報のような音声特徴量系列を、ボコーダを用いて変換することによって音声を生成する。図３５に、ボコーダ方式の音声合成の処理の概念図を示す。なお、ここで述べたボコーダとは、人間の発声のメカニズムに関する知見を元に、音の生成過程をモデル化したものである。例えば、ボコーダの代表的なモデルとして、ソースフィルターモデルがあるが、このモデルでは、音の生成過程を音源（ソース）とデジタルフィルターの二つによって説明している。具体的には、ソースから生じる音声信号（パルス信号で表される）に対してデジタルフィルターを随時適用していくことによって、声が生成されるとしている。このように、ボコーダ方式の音声合成では、発声のメカニズムを抽象的にモデル化して表現しているため、音声をコンパクト（低次元）に表現することができる。一方で、抽象化した結果、音声の自然さが失われて、ボコーダ特有の機械的な音質となることが多い。 In the existing vocoder-based speech synthesis, a vocoder is used to convert a speech feature amount sequence such as vocal cord source information and vocal tract spectrum information to generate speech. FIG. 35 shows a conceptual diagram of a vocoder-based speech synthesis process. The vocoder described here is a model of the sound generation process based on the knowledge of the mechanism of human vocalization. For example, as a typical model of a vocoder, there is a source filter model. In this model, a sound generation process is described by two sources, that is, a source and a digital filter. Specifically, it is said that a voice is generated by applying a digital filter to a voice signal (represented by a pulse signal) generated from a source as needed. As described above, in the vocoder-based speech synthesis, the utterance mechanism is abstractly modeled and expressed, so that the speech can be expressed compactly (low dimension). On the other hand, as a result of abstraction, the naturalness of the voice is often lost, resulting in mechanical sound quality peculiar to the vocoder.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozairy,Aaron Courville, Yoshua Bengio, "Generative Adversarial Nets," 2014.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozairy,Aaron Courville, Yoshua Bengio, "Generative Adversarial Nets," 2014. Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, "Deep Generative Image Modelsusing a Laplacian Pyramid of Adversarial Networks," 2015.Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, "Deep Generative Image Modelsusing a Laplacian Pyramid of Adversarial Networks," 2015.

入力テキストや変換元音声から適切な音声特徴量を予測する問題は一種の回帰（機械学習）問題であり、特に限られた数の学習サンプルしか得られない状況ではコンパクト（低次元）な特徴量表現となっている方が統計的な予測において有利である。多くのテキスト音声合成システムや音声変換システムにおいて（波形やスペクトルを直接予測しようとするのではなく）音声特徴量を用いたボコーダ方式が用いられるのはこの利点を活かすためである。一方で、ボコーダ方式によって生成される音声はボコーダ特有の機械的な音質となることが多く、このことが従来のテキスト音声合成システムや音声変換システムにおける音質の潜在的な限界を与えている。 The problem of predicting appropriate speech features from input text or source speech is a kind of regression (machine learning) problem, and compact (low-dimensional) features especially when only a limited number of learning samples are obtained. The expression is more advantageous in statistical prediction. In many text-to-speech synthesis systems and voice conversion systems, a vocoder method using voice features (rather than trying to directly predict a waveform or spectrum) is used to take advantage of this advantage. On the other hand, the voice generated by the vocoder system often has a mechanical sound quality peculiar to the vocoder, which gives a potential limit to the sound quality in the conventional text-to-speech synthesis system and voice conversion system.

本発明は、上記問題点を解決するために成されたものであり、より自然な音声を合成することができるニューラルネットワークを学習できる音声合成学習装置を提供することを目的とする。 The present invention has been made to solve the above problems, and an object of the present invention is to provide a speech synthesis learning device capable of learning a neural network capable of synthesizing more natural speech.

上記目的を達成するために、本発明に係る音声合成学習装置は、任意の音声データ又は音声特徴量系列から音声を合成するニューラルネットワークを学習する音声合成学習装置であって、入力された音声データ又は音声特徴量系列と、学習用の真の音声データとを受け付け、前記音声データ又は音声特徴量系列と前記学習用の真の音声データから中間音声データを生成するように予め学習された第１の生成器としてのニューラルネットワークと、前記中間音声データと前記学習用の真の音声データから合成音声データを生成するように学習される第２の生成器としてのニューラルネットワークとを備え、前記音声データ又は音声特徴量系列を、前記第１の生成器としてのニューラルネットワークへの入力として前記中間音声データを得て、得られた前記中間音声データを、前記第２の生成器としてのニューラルネットワークへの入力として前記合成音声データを生成し、生成した前記合成音声データと、前記学習用の真の音声データとの距離を表す目的関数を最適化するように、又は、前記第２の生成器としてのニューラルネットワークと、生成した前記合成音声データが前記学習用の真の音声データと同一の分布に従うか否かを判別する識別器としてのニューラルネットワークとが互いに競合する最適化条件に従うように、前記第２の生成器としてのニューラルネットワークを学習する学習部、を含み、前記第１の生成器としてのニューラルネットワークは、前記中間音声データと、前記学習用の真の音声データとの距離を表す目的関数の最適化に従って、又は、前記第１の生成器としてのニューラルネットワークと、前記中間音声データが前記学習用の真の音声データと同一の分布に従うか否かを判別する識別器としてのニューラルネットワークとが互いに競合する最適化条件に従って予め学習されている。 In order to achieve the above object, a speech synthesis learning device according to the present invention is a speech synthesis learning device for learning a neural network for synthesizing speech from arbitrary speech data or speech feature quantity sequence, wherein the input speech data Alternatively, the first learned in advance so as to accept the voice feature amount sequence and the true voice data for learning and generate intermediate voice data from the voice data or the voice feature amount sequence and the true voice data for learning. A neural network as a generator and a neural network as a second generator that is learned to generate synthetic speech data from the intermediate speech data and the true speech data for learning. Alternatively, the intermediate feature data is obtained by using the voice feature quantity sequence as an input to the neural network as the first generator, and the obtained intermediate voice data is supplied to the neural network as the second generator. The synthetic voice data is generated as an input, and the objective function representing the distance between the generated synthetic voice data and the true voice data for learning is optimized, or as the second generator. The neural network and the neural network as a discriminator that determines whether or not the generated synthetic speech data follows the same distribution as the true speech data for learning, so that the neural network complies with optimization conditions competing with each other, A learning unit that learns a neural network as a second generator, and the neural network as the first generator has an objective function that represents a distance between the intermediate speech data and the true speech data for learning. According to the above optimization, or a neural network as the first generator, and a neural network as a discriminator for determining whether or not the intermediate voice data follows the same distribution as the true voice data for learning. Are previously learned according to the optimization conditions that compete with each other.

本発明の音声合成学習装置によれば、より自然な音声を合成することができるニューラルネットワークを学習できる、という効果が得られる。 According to the speech synthesis learning device of the present invention, it is possible to obtain the effect that a neural network capable of synthesizing a more natural speech can be learned.

本発明の第１の実施の形態の処理の概念図である。It is a conceptual diagram of a process of the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態の学習処理の概念図である。It is a conceptual diagram of the learning process of the 1st Embodiment of this invention. 本発明の第１及び第２の実施の形態に係る音声合成装置における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in the speech synthesizer which concerns on the 1st and 2nd embodiment of this invention. 本発明の第１及び第２の実施の形態に係る音声合成装置における生成処理ルーチンを示すフローチャートである。It is a flowchart which shows the production|generation processing routine in the speech synthesis apparatus which concerns on the 1st and 2nd embodiment of this invention. 本発明の第２の実施の形態の処理の概念図である。It is a conceptual diagram of the process of the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizing|combining apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態の学習処理の概念図である。It is a conceptual diagram of the learning process of the 2nd Embodiment of this invention. 本発明の第３の実施の形態の処理の概念図である。It is a conceptual diagram of a process of the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizing|combining apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施の形態の学習処理の概念図である。It is a conceptual diagram of the learning process of the 3rd Embodiment of this invention. 本発明の第３及び第４の実施の形態に係る音声合成装置における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in the speech synthesis apparatus which concerns on the 3rd and 4th embodiment of this invention. 本発明の第３及び第４の実施の形態に係る音声合成装置における生成処理ルーチンを示すフローチャートである。It is a flowchart which shows the production|generation processing routine in the speech synthesis apparatus which concerns on the 3rd and 4th embodiment of this invention. 本発明の第４の実施の形態の処理の概念図である。It is a conceptual diagram of a process of the 4th Embodiment of this invention. 本発明の第４の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizing|combining apparatus which concerns on the 4th Embodiment of this invention. 本発明の第４の実施の形態の学習処理の概念図である。It is a conceptual diagram of the learning process of the 4th Embodiment of this invention. 本発明の第５の実施の形態の概念図である。It is a conceptual diagram of the 5th Embodiment of this invention. 本発明の第５の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizing|combining apparatus which concerns on the 5th Embodiment of this invention. 本発明の第５の実施の形態に係る音声合成装置における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in the speech synthesizer which concerns on the 5th Embodiment of this invention. 本発明の第５の実施の形態に係る音声合成装置における生成処理ルーチンを示すフローチャートである。It is a flowchart which shows the production|generation processing routine in the speech synthesis apparatus which concerns on the 5th Embodiment of this invention. 実験例における第３の実施の形態の学習方法の実装例を示す図である。It is a figure which shows the implementation example of the learning method of 3rd Embodiment in an experiment example. 実験例における第４の実施の形態の学習方法の実装例を示す図である。It is a figure which shows the implementation example of the learning method of 4th Embodiment in an experiment example. 実験例における第３の実施の形態の生成方法の実装例を示す図である。It is a figure which shows the implementation example of the production|generation method of 3rd Embodiment in an experiment example. 実験例における第４の実施の形態の生成方法の実装例を示す図である。It is a figure which shows the implementation example of the production|generation method of 4th Embodiment in an experiment example. 実験例における第３の実施の形態のネットワーク構造を示す図である。It is a figure which shows the network structure of 3rd Embodiment in an experimental example. 実験例における第４の実施の形態のネットワーク構造を示す図である。It is a figure which shows the network structure of 4th Embodiment in an experiment example. 実験例における入出力の元になった音声信号の波形の例を示す図である。It is a figure which shows the example of the waveform of the audio|voice signal used as the input/output origin in an experiment example. Volume changeの実験結果を示す図である。It is a figure which shows the experimental result of Volume change. Pre-emphasisの実験結果を示す図である。It is a figure which shows the experimental result of Pre-emphasis. LPCの実験結果を示す図である。It is a figure which shows the experimental result of LPC. LPC+pulseの実験結果を示す図である。It is a figure which shows the experimental result of LPC+pulse. 実験例における第１の実施の形態のネットワーク構造を示す図である。It is a figure which shows the network structure of 1st Embodiment in an experimental example. 実験例における第２の実施の形態のネットワーク構造を示す図である。It is a figure which shows the network structure of 2nd Embodiment in an experimental example. 実験例における第１及び第２の実施の形態の手法による音声復元の結果を示す図である。It is a figure which shows the result of the audio|voice restoration by the method of the 1st and 2nd embodiment in an experimental example. ボコーダ方式の音声合成の処理の概念図である。It is a conceptual diagram of a vocoder-type speech synthesis process.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の第１の実施の形態に係る概要＞ <Outline of First Embodiment of the Present Invention>

まず、本発明の第１の実施の形態における概要を説明する。 First, the outline of the first embodiment of the present invention will be described.

既存のボコーダ方式の音声合成は、人間の発声メカニズムに関する知見を元に、音の生成過程を抽象的にモデル化したものであり、音声特徴量系列から音声データ（音声信号または音声スペクトル系列、以降同様）を再現することについて直接最適化したものではない。 The existing vocoder-based speech synthesis is an abstract modeling of the sound generation process based on the knowledge of the human vocalization mechanism.From the speech feature quantity sequence to speech data (speech signal or speech spectrum sequence, It is not a direct optimization to reproduce

本発明の第１の実施の形態では、音声特徴量系列と音声データのマッピングについて直接最適化を行うことによって、この問題を解決する。処理の概念図を図１に示す。入力された音声特徴量系列に対して、音声特徴量系列と音声データのマッピングについて最適化されたニューラルネットワークを適用することによって、目的となる音声データを得ることができる。なお、この際、音声データとして音声信号を用いた場合は、そのまま目的音声信号が得られる。一方、音声データとして音声スペクトル系列を用いた場合は、出力も音声スペクトル系列になる。その場合は、位相復元をすることによって、目的音声信号が得られる。位相復元の手法としては、例えば、Griffin Limなどがある。 In the first embodiment of the present invention, this problem is solved by directly optimizing the mapping of the audio feature amount sequence and the audio data. A conceptual diagram of the processing is shown in FIG. The target voice data can be obtained by applying a neural network optimized for the mapping of the voice feature amount sequence and the voice data to the input voice feature amount sequence. At this time, when the voice signal is used as the voice data, the target voice signal is obtained as it is. On the other hand, when the voice spectrum series is used as the voice data, the output is also the voice spectrum series. In that case, the target voice signal can be obtained by performing the phase restoration. As a method of phase restoration, there is Griffin Lim, for example.

＜本発明の第１の実施の形態に係る音声合成装置の構成＞ <Structure of speech synthesizer according to first embodiment of the present invention>

次に、本発明の第１の実施の形態に係る音声合成装置の構成について説明する。図２に示すように、本発明の第１の実施の形態に係る音声合成装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この音声合成装置１００は、機能的には図２に示すように入力部１０と、演算部２０と、出力部９０とを備えている。 Next, the configuration of the speech synthesizer according to the first embodiment of the present invention will be described. As shown in FIG. 2, the speech synthesis apparatus 100 according to the first embodiment of the present invention stores a CPU, a RAM, a program for executing a learning processing routine and a generation processing routine described later, and various data. And a computer including the ROM. This speech synthesizer 100 functionally includes an input unit 10, a calculation unit 20, and an output unit 90 as shown in FIG.

入力部１０は、学習データとして、人間の音声データｘを受け付ける。また、入力部１０は、合成音声データの生成対象となる任意の音声特徴量系列ｆを受け付ける。 The input unit 10 receives human voice data x as learning data. In addition, the input unit 10 receives an arbitrary voice feature amount sequence f for which synthetic voice data is to be generated.

演算部２０は、学習部３０と、ニューラルネットワーク記憶部４０と、生成部５０とを含んで構成されている。 The calculation unit 20 includes a learning unit 30, a neural network storage unit 40, and a generation unit 50.

学習部３０は、以下に説明するように、音声データｘを音声分析して得た、ボコーダに用いられる音声特徴量系列ｆと、学習用の真の音声データｘとを入力とし、音声特徴量系列ｆから、合成された合成音声データ

を生成する生成器としてのニューラルネットワークを備え、生成器としてのニューラルネットワークが、合成音声データ

と、学習用の真の音声データｘとの距離を表す目的関数を最適化するように学習を行う。 As will be described below, the learning unit 30 receives the voice feature amount series f used for the vocoder and the true voice data x for learning obtained by voice analysis of the voice data x, and inputs the voice feature amount. Synthesized voice data synthesized from the series f

A neural network as a generator for generating the

And learning is performed so as to optimize the objective function representing the distance from the true voice data x for learning.

学習部３０は、まず、入力部１０で受け付けた音声データｘに対して、音声分析をすることによって、音声特徴量系列ｆを得る。ここで得た音声特徴量系列ｆに対して、元となる真の音声データｘが生成されるようにニューラルネットワークを学習する。具体的には、音声特徴量系列ｆをニューラルネットワークに入力すると、合成音声データ

が出力されるが、真の音声データｘと出力される合成音声データ

とを、ある距離指標に対して距離が最小化するように、ニューラルネットワークの重みを最適化すればよい。なお、ここで述べた距離指標とは、例えば最小二乗誤差などである。距離指標として最小二乗誤差の場合、目的関数Ｌ_２は以下の（１）式で表される。 The learning unit 30 first obtains a voice feature amount sequence f by performing voice analysis on the voice data x received by the input unit 10. The neural network is learned so that the original true voice data x is generated for the voice feature amount sequence f obtained here. Specifically, when the speech feature amount series f is input to the neural network, the synthesized speech data

Is output, but true voice data x and synthetic voice data to be output

The weights of the neural network may be optimized so that and are minimized with respect to a certain distance index. The distance index described here is, for example, the least square error. When the least square error is used as the distance index, the objective function L ₂ is represented by the following equation (1).

・・・（１） ...(1)

図３に第１の実施の形態の学習処理の概念図を示す。 FIG. 3 shows a conceptual diagram of the learning process of the first embodiment.

上記（１）式の目的関数を最適化するように学習された生成器としてのニューラルネットワークはニューラルネットワーク記憶部４０に記憶される。 The neural network as a generator that has been learned to optimize the objective function of the equation (1) is stored in the neural network storage unit 40.

生成部５０は、入力部１０で受け付けた任意の音声特徴量系列ｆを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、ニューラルネットワークから出力される、合成された合成音声データ

を、出力部９０に出力する。 The generation unit 50 inputs the arbitrary speech feature amount sequence f received by the input unit 10 into the neural network stored in the neural network storage unit 40, and the synthesized synthetic speech data output from the neural network.

Is output to the output unit 90.

＜本発明の第１の実施の形態に係る音声合成装置の作用＞ <Operation of the speech synthesizer according to the first embodiment of the present invention>

次に、本発明の第１の実施の形態に係る音声合成装置１００の作用について説明する。音声合成装置１００は、以下に説明する学習処理ルーチンと生成処理ルーチンを実行する。 Next, the operation of the speech synthesizer 100 according to the first embodiment of the present invention will be described. The speech synthesizer 100 executes a learning processing routine and a generation processing routine described below.

まず、学習処理ルーチンについて説明する。入力部１０において学習データとして、人間の音声データｘを受け付けると、音声合成装置１００は、図４に示す学習処理ルーチンを実行する。 First, the learning processing routine will be described. When the input unit 10 receives human voice data x as learning data, the voice synthesizing apparatus 100 executes a learning processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０で受け付けた音声データｘを音声分析し、音声特徴量系列ｆを得る。 First, in step S100, the voice data x received by the input unit 10 is voice analyzed to obtain a voice feature amount sequence f.

次に、ステップＳ１０２では、ステップＳ１００で得た音声特徴量系列ｆと、入力部１０で受け付けた音声データｘとを入力とし、上記（１）式に従って、音声特徴量系列ｆから合成された合成音声データ

を生成する生成器としてのニューラルネットワークが、合成音声データ

と、音声データｘとの距離を表す目的関数を最適化するように学習を行い、学習されたニューラルネットワークを、ニューラルネットワーク記憶部４０に記憶して処理を終了する。 Next, in step S102, the speech feature quantity sequence f obtained in step S100 and the speech data x received by the input unit 10 are input, and synthesis is performed from the speech feature quantity sequence f according to the above equation (1). Voice data

Neural network as a generator that generates

Then, learning is performed so as to optimize the objective function representing the distance from the voice data x, the learned neural network is stored in the neural network storage unit 40, and the processing is ended.

次に、生成処理ルーチンについて説明する。入力部１０において合成音声データの生成対象となる任意の音声特徴量系列ｆを受け付けると、音声合成装置１００は、図５に示す生成処理ルーチンを実行する。 Next, the generation processing routine will be described. When the input unit 10 receives an arbitrary speech feature amount sequence f for which synthetic speech data is to be generated, the speech synthesis device 100 executes the generation processing routine shown in FIG.

ステップＳ２００では、入力部１０で受け付けた任意の音声特徴量系列ｆを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、ニューラルネットワークから出力される、合成された合成音声データ

を、出力部９０に出力して処理を終了する。 In step S200, the synthesized speech data synthesized by inputting the arbitrary speech feature amount sequence f received by the input unit 10 into the neural network stored in the neural network storage unit 40 and output from the neural network.

Is output to the output unit 90, and the process ends.

以上説明したように、本発明の第１の実施の形態に係る音声合成装置によれば、音声特徴量系列ｆと、学習用の真の音声データｘとを入力として、上記（１）式に従って、音声特徴量系列ｆから合成された合成音声データ

を生成する生成器としてのニューラルネットワークが、合成音声データ

と、学習用の真の音声データｘとの距離を表す目的関数を最適化するように学習を行うことにより、より自然な音声を合成することができるニューラルネットワークを学習できる。 As described above, according to the speech synthesizer according to the first embodiment of the present invention, the speech feature amount series f and the true speech data x for learning are input and the expression (1) is followed. , Synthetic speech data synthesized from the speech feature sequence f

Neural network as a generator that generates

Then, by performing learning so as to optimize the objective function representing the distance from the true voice data x for learning, a neural network capable of synthesizing a more natural voice can be learned.

また、学習したニューラルネットワークを用いて音声を合成することにより、音声特徴量系列から、より自然な音声を合成することができる。 Further, by synthesizing the voice using the learned neural network, it is possible to synthesize a more natural voice from the voice feature quantity sequence.

＜本発明の第２の実施の形態に係る概要＞ <Overview of Second Embodiment of the Present Invention>

次に、本発明の第２の実施の形態における概要を説明する。 Next, the outline of the second embodiment of the present invention will be described.

第１の実施の形態は、声を音声特徴量系列のみから再現するものであったが、第２の実施の形態では、ニューラルネットワークの入力として、新たに自然性成分を加えることによって、音声の自然さを表現する。処理の概念図を図６に示す。なお、ここで述べた音声特徴量系列は、音声分析によって得られたものであるが、自然性成分は、それとは独立に与えるもの（例えば、乱数）である。 In the first embodiment, the voice is reproduced only from the voice feature amount series, but in the second embodiment, a new natural component is added as an input of the neural network, and Express the naturalness. A conceptual diagram of the processing is shown in FIG. The speech feature quantity sequence described here is obtained by speech analysis, but the naturalness component is given independently of it (for example, a random number).

＜本発明の第２の実施の形態に係る音声合成装置の構成＞ <Structure of speech synthesizer according to second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る音声合成装置の構成について説明する。なお、第１の実施の形態と同様となる箇所については同一符号を付して説明を省略する。 Next, the configuration of the speech synthesizer according to the second embodiment of the present invention will be described. The same parts as those in the first embodiment are designated by the same reference numerals and the description thereof will be omitted.

図７に示すように、本発明の第２の実施の形態に係る音声合成装置２００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この音声合成装置２００は、機能的には図７に示すように入力部１０と、演算部２２０と、出力部９０とを備えている。 As shown in FIG. 7, a speech synthesis apparatus 200 according to the second embodiment of the present invention stores a CPU, a RAM, a program for executing a learning processing routine and a generation processing routine described later, and various data. And a computer including the ROM. This speech synthesizer 200 functionally includes an input unit 10, a calculation unit 220, and an output unit 90, as shown in FIG. 7.

演算部２２０は、学習部２３０と、ニューラルネットワーク記憶部４０と、生成部２５０とを含んで構成されている。 The calculation unit 220 includes a learning unit 230, a neural network storage unit 40, and a generation unit 250.

学習部２３０は、以下に説明するように、音声データｘを音声分析して得た、ボコーダに用いられる音声特徴量系列ｆと、予め与えられた自然性成分ｚと、学習用の真の音声データｘとを入力とし、音声特徴量系列ｆから、合成された合成音声データ

（合成音声信号又は合成音声スペクトル系列）を生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データと同一の分布に従うか否かを識別する識別器としてのニューラルネットワークとを備え、生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行う。 As will be described below, the learning unit 230 uses a voice feature amount sequence f used for a vocoder, obtained by voice analysis of the voice data x, a natural component z given in advance, and a true voice for learning. Synthesized voice data that is synthesized from the voice feature series f by inputting the data x

Neural network as a generator for generating (synthetic speech signal or synthetic speech spectrum sequence), and synthetic speech data

Is a neural network as a discriminator for discriminating whether or not to follow the same distribution as the true voice data, and the neural network as a generator and the neural network as a discriminator compete with each other for optimization conditions. Learn according to.

学習部２３０は、まず、入力部１０で受け付けた音声データｘに対して、音声特徴量系列ｆを得る。ここで得た音声特徴量系列ｆと、自然性成分ｚと、学習用の真の音声データｘとに基づいて、元となる真の音声データｘが生成されるように生成器としてのニューラルネットワークを学習する。なお、ここで音声特徴量系列ｆについては、一部を変形したものを用いても良い。具体的には、音声特徴量系列の代表的なものの一つとして、基本周波数があるが、これをランダムに定数倍したものを用いても良い。また、自然性成分ｚは、ある分布（例えば、一様分布）に従って生成した乱数である。 The learning unit 230 first obtains the audio feature amount sequence f for the audio data x received by the input unit 10. A neural network as a generator so that the original true voice data x is generated based on the obtained voice feature amount sequence f, the naturalness component z, and the true voice data x for learning. To learn. Here, as the audio feature quantity sequence f, a partially modified version may be used. Specifically, the fundamental frequency is one of the typical ones of the speech feature amount series, but a fundamental frequency may be randomly multiplied by a constant. The naturalness component z is a random number generated according to a certain distribution (for example, uniform distribution).

また、真の音声データｘと、生成器としてのニューラルネットワークにより生成される合成音声データ

とに基づいて、合成音声データが真の音声データであるか否かを識別する識別器としてのニューラルネットワークを学習する。この識別器としてのニューラルネットワークは、入力された合成音声データが真のものであるか合成されたものであるかの識別を行い、その結果を出力するものである。 In addition, true voice data x and synthetic voice data generated by a neural network as a generator

Based on and, the neural network as a discriminator for discriminating whether or not the synthesized voice data is the true voice data is learned. The neural network as the discriminator discriminates whether the input synthetic speech data is true or synthesized, and outputs the result.

本実施の形態では、生成器としてのニューラルネットワーク、及び識別器としてのニューラルネットワークの評価関数を以下（２）式に従って最適化する。（２）式で、Ｇは生成器（Generator）を表し、Ｄは識別器（Discriminator）を表す。（２）式では、識別器は、真の音声と合成音声をなるべく識別できるように、評価関数を最大化し、一方で、生成器は、合成音声をなるべく識別器が真の音声と識別するように、評価関数を最小化する。識別器と生成器が競争をしながら最適化が進む。 In this embodiment, the evaluation functions of the neural network as the generator and the neural network as the discriminator are optimized according to the following equation (2). In the equation (2), G represents a generator and D represents a discriminator. In equation (2), the classifier maximizes the evaluation function so that the true speech and the synthetic speech can be discriminated from each other as much as possible, while the generator discriminates the synthetic speech from the true speech as much as possible. Then, the evaluation function is minimized. Optimization progresses while the discriminator and the generator compete.

・・・（２） ...(2)

図８に第２の実施の形態の学習処理の概念図を示す。 FIG. 8 shows a conceptual diagram of the learning process of the second embodiment.

上記（２）式の評価関数を最適化するように学習された、生成器としてのニューラルネットワーク及び識別器としてのニューラルネットワークはニューラルネットワーク記憶部４０に記憶される。 The neural network as a generator and the neural network as a discriminator learned so as to optimize the evaluation function of the equation (2) are stored in the neural network storage unit 40.

なお、以下（３）式のように、音声特徴量系列ｆも考慮した識別器（Discriminator）を用いた評価関数を最適化するように、生成器としてのニューラルネットワーク及び識別器としてのニューラルネットワークを学習しても良い。 As shown in the following equation (3), a neural network as a generator and a neural network as a discriminator are used so as to optimize an evaluation function using a discriminator that also considers the speech feature amount sequence f. You may learn.

・・・（３） ...(3)

また、ニューラルネットワークを学習するときに、第１の実施の形態の手法を用いて、生成器としてのニューラルネットワークをPre-trainingしてもよい。 Further, when learning the neural network, the neural network as the generator may be pre-trained by using the method of the first embodiment.

生成部２５０は、入力部１０で受け付けた任意の音声特徴量系列ｆと、予め与えられた自然性成分ｚとを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、ニューラルネットワークから出力される、合成された合成音声データ

を、出力部９０に出力する。 The generation unit 250 inputs the arbitrary speech feature amount sequence f accepted by the input unit 10 and a natural component z given in advance to a neural network stored in the neural network storage unit 40, and outputs the neural network from the neural network. Output synthesized voice data

Is output to the output unit 90.

＜本発明の第２の実施の形態に係る音声合成装置の作用＞ <Operation of speech synthesis apparatus according to second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る音声合成装置２００の作用について説明する。音声合成装置２００は、以下に説明する学習処理ルーチンと生成処理ルーチンを実行する。 Next, the operation of the speech synthesizer 200 according to the second embodiment of the present invention will be described. The speech synthesizer 200 executes a learning processing routine and a generation processing routine described below.

まず、学習処理ルーチンについて説明する。入力部１０において学習データとして、人間の音声データｘを受け付けると、音声合成装置２００は、上記図４に示す学習処理ルーチンを実行する。 First, the learning processing routine will be described. When the input unit 10 receives human voice data x as learning data, the voice synthesizer 200 executes the learning processing routine shown in FIG.

第２の実施の形態の学習処理ルーチンでは、ステップＳ１０２において、ステップＳ１００で得られた音声特徴量系列ｆと、予め与えられた自然性成分ｚと、入力部１０により受け付けた音声データｘとを入力とし、上記（２）式に従って、生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行い、学習されたニューラルネットワークを、ニューラルネットワーク記憶部４０に記憶して処理を終了する。 In the learning processing routine according to the second embodiment, in step S102, the voice feature amount sequence f obtained in step S100, the natural component z given in advance, and the voice data x received by the input unit 10 are input. Using the input as an input, the neural network as the generator and the neural network as the discriminator perform learning according to the optimization conditions competing with each other according to the above equation (2), and the learned neural network is stored in the neural network storage unit 40. And the process ends.

第２の実施の形態の生成処理ルーチンでは、上記図５に示すように、ステップＳ２００において、入力部１０で受け付けた任意の音声特徴量系列ｆと、予め与えられた自然性成分ｚとを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、ニューラルネットワークから出力される、合成された合成音声データ

を、出力部９０に出力して処理を終了する。 In the generation processing routine of the second embodiment, as shown in FIG. 5, in step S200, an arbitrary audio feature amount sequence f accepted by the input unit 10 and a natural component z given in advance are Synthesized synthetic voice data input to the neural network stored in the neural network storage unit 40 and output from the neural network.

Is output to the output unit 90, and the process ends.

以上説明したように、本発明の第２の実施の形態に係る音声合成装置によれば、音声特徴量系列ｆと、自然性成分ｚと、学習用の真の音声データｘとを入力とし、上記（２）式に従って、音声特徴量系列ｆから合成された合成音声データ

を生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データｘと同一の分布に従うか否かを識別する識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行うことにより、より自然な音声を合成することができるニューラルネットワークを学習できる。 As described above, according to the speech synthesis device according to the second embodiment of the present invention, the speech feature amount sequence f, the naturalness component z, and the true speech data x for learning are input, Synthesized voice data synthesized from the voice feature quantity sequence f according to the above equation (2).

Neural network as a generator to generate the

However, a neural network as a discriminator for discriminating whether or not to follow the same distribution as the true voice data x learns according to competing optimization conditions to synthesize a more natural voice. Can learn neural networks.

また、学習した生成器としてのニューラルネットワークを用いて音声を合成することにより、より自然な音声を合成することができる。 Further, a more natural voice can be synthesized by synthesizing the voice by using the learned neural network as the generator.

＜本発明の第３の実施の形態に係る概要＞ <Outline of Third Embodiment of the Present Invention>

次に、本発明の第３の実施の形態における概要を説明する。 Next, an outline of the third embodiment of the present invention will be described.

第１及び第２の実施の形態は、音声特徴量系列と高音質音声の間のマッピングを行うものであり、既存のボコーダの代わりになる技術である。一方、第３の実施の形態は、音声特徴量系列から一度合成した音声と高品質音声の間のマッピングを行う方法である。ここで、音声特徴量系列から一度音声を合成するためには、既存のボコーダ、あるいは、第１及び第２の実施の形態を用いれば良い。処理の概念図を図９に示す。 The first and second embodiments are for performing mapping between a voice feature amount sequence and high-quality voice, and are technologies that replace existing vocoders. On the other hand, the third embodiment is a method of performing mapping between a voice synthesized once from a voice feature quantity sequence and high quality voice. Here, in order to synthesize the voice once from the voice feature quantity sequence, an existing vocoder or the first and second embodiments may be used. A conceptual diagram of the processing is shown in FIG.

音声特徴量系列が与えられると、まずボコーダ、あるいは、第１又は第２の実施の形態の手法で学習した生成器としてのニューラルネットワークを用いることによって中間音声信号を得る。この中間音声信号を、ニューラルネットワークに入力し、変換することによって、目的となる音声データを得る。 Given a voice feature quantity sequence, an intermediate voice signal is first obtained by using a vocoder or a neural network as a generator learned by the method of the first or second embodiment. By inputting this intermediate voice signal into the neural network and converting it, the target voice data is obtained.

＜本発明の第３の実施の形態に係る音声合成装置の構成＞ <Configuration of speech synthesizer according to third embodiment of the present invention>

次に、本発明の第３の実施の形態に係る音声合成装置の構成について説明する。なお、第２の実施の形態と同様となる箇所については同一符号を付して説明を省略する。 Next, the configuration of the speech synthesizer according to the third embodiment of the present invention will be described. The same parts as those in the second embodiment are designated by the same reference numerals and the description thereof will be omitted.

図１０に示すように、本発明の第３の実施の形態に係る音声合成装置３００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この音声合成装置３００は、機能的には図１０に示すように入力部１０と、演算部３２０と、出力部９０とを備えている。 As shown in FIG. 10, a speech synthesis apparatus 300 according to the third embodiment of the present invention stores a CPU, a RAM, a program for executing a learning processing routine and a generation processing routine described later, and various data. And a computer including the ROM. This speech synthesizer 300 functionally includes an input unit 10, a calculation unit 320, and an output unit 90, as shown in FIG.

演算部３２０は、学習部３３０と、ニューラルネットワーク記憶部４０と、中間音声変換部３３２と、生成部３５０とを含んで構成されている。 The calculation unit 320 includes a learning unit 330, a neural network storage unit 40, an intermediate voice conversion unit 332, and a generation unit 350.

学習部３３０は、以下に説明するように、音声データｘを音声分析して得た音声特徴量系列から音声を合成して得た、中間音声データｘ’（中間音声信号又は中間音声スペクトル系列）と、自然性成分ｚと、学習用の真の音声データｘとを入力とし、中間音声データｘ’から、合成された合成音声データ

を生成する生成器としてのニューラルネットワークを備え、生成器としてのニューラルネットワークが、合成音声データ

と、学習用の真の音声データｘとの距離を表す目的関数を最適化するように学習を行う。 The learning unit 330, as described below, obtains the intermediate voice data x′ (intermediate voice signal or intermediate voice spectrum sequence) obtained by synthesizing voice from a voice feature amount sequence obtained by voice analysis of the voice data x. , The natural component z, and the true voice data x for learning are input, and the synthesized voice data is synthesized from the intermediate voice data x′.

A neural network as a generator for generating the

And learning is performed so as to optimize the objective function representing the distance from the true voice data x for learning.

学習部３３０は、まず、入力部１０で受け付けた音声データｘに対して、音声特徴量系列ｆを得る。ここで得た音声特徴量系列ｆと、自然性成分ｚとを、上記第２の実施の形態と同様に学習された生成器としてのニューラルネットワークに入力することによって中間音声データｘ’を得る。そして、中間音声データｘ’に対して、元となる真の音声データｘが生成されるように生成器としてのニューラルネットワークを学習する。具体的には、中間音声データｘ’をニューラルネットワークに入力すると、音声データ

が出力されるが、真の音声データｘと出力される合成音声データ

とを、ある距離指標に対して距離が最小化するように、ニューラルネットワークの重みを最適化すればよい。なお、ここで述べた距離指標とは、例えば最小二乗誤差などである。距離指標として最小二乗誤差の場合、目的関数Ｌ_２は以下の（１）式で表される。 The learning unit 330 first obtains the audio feature amount sequence f for the audio data x received by the input unit 10. The intermediate feature data x′ is obtained by inputting the voice feature amount sequence f and the naturalness component z obtained here to a neural network as a generator learned as in the second embodiment. Then, a neural network as a generator is trained so that the original true voice data x is generated for the intermediate voice data x′. Specifically, when the intermediate voice data x′ is input to the neural network, the voice data

Is output, but true voice data x and synthetic voice data to be output

The weights of the neural network may be optimized so that and are minimized with respect to a certain distance index. The distance index described here is, for example, the least square error. When the least square error is used as the distance index, the objective function L ₂ is represented by the following equation (1).

・・・（４） ...(4)

図１１に第３の実施の形態の学習処理の概念図を示す。 FIG. 11 shows a conceptual diagram of the learning process of the third embodiment.

上記（４）式の目的関数を最適化するように学習された生成器としてのニューラルネットワークはニューラルネットワーク記憶部４０に記憶される。 The neural network as a generator learned so as to optimize the objective function of the equation (4) is stored in the neural network storage unit 40.

中間音声変換部３３２は、入力部１０で受け付けた任意の音声特徴量系列ｆを、第２の実施の形態のニューラルネットワーク（図示省略）に入力することによって中間音声データｘ’（中間音声信号又は中間音声スペクトル系列）を得る。 The intermediate voice conversion unit 332 inputs the arbitrary voice feature amount sequence f accepted by the input unit 10 to the neural network (not shown) of the second embodiment, thereby generating the intermediate voice data x′ (intermediate voice signal or Intermediate speech spectrum sequence).

生成部３５０は、中間音声変換部３３２によって得られた中間音声データｘ’を、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、合成された合成音声データ

を出力部９０に出力する。 The generation unit 350 inputs the intermediate voice data x′ obtained by the intermediate voice conversion unit 332 to the neural network stored in the neural network storage unit 40 and synthesizes the synthesized voice data.

Is output to the output unit 90.

＜本発明の第３の実施の形態に係る音声合成装置の作用＞ <Operation of speech synthesis device according to third embodiment of the present invention>

次に、本発明の第３の実施の形態に係る音声合成装置３００の作用について説明する。音声合成装置３００は、以下に説明する学習処理ルーチンと生成処理ルーチンを実行する。 Next, the operation of the speech synthesizer 300 according to the third embodiment of the present invention will be described. The speech synthesizer 300 executes a learning processing routine and a generation processing routine described below.

まず、学習処理ルーチンについて説明する。入力部１０において学習データとして、人間の音声データｘを受け付けると、音声合成装置３００は、図１２に示す学習処理ルーチンを実行する。 First, the learning processing routine will be described. When the input unit 10 receives human voice data x as learning data, the voice synthesizer 300 executes a learning processing routine shown in FIG. 12.

まず、ステップＳ３００では、入力部１０で受け付けた音声データｘを音声分析し、音声特徴量系列ｆを得る。 First, in step S300, the voice data x received by the input unit 10 is voice analyzed to obtain a voice feature amount sequence f.

次に、ステップＳ３０２では、ステップＳ３００で得た音声特徴量系列ｆと、自然性成分ｚと、入力部１０で受け付けた音声データｘとを入力とし、第２の実施の形態と同様に学習された生成器としてのニューラルネットワークに入力することによって中間音声データｘ’（中間音声信号又は中間音声スペクトル系列）を得る。 Next, in step S302, the speech feature amount sequence f obtained in step S300, the naturalness component z, and the speech data x received by the input unit 10 are input, and learning is performed as in the second embodiment. The intermediate speech data x′ (intermediate speech signal or intermediate speech spectrum sequence) is obtained by inputting the neural network as a generator.

ステップＳ３０４では、ステップＳ３０２で得た、中間音声データｘ’と、入力部１０で受け付けた音声データｘとを入力とし、上記（４）式に従って、中間音声データｘ’から合成された合成音声データ

を生成する生成器としてのニューラルネットワークが、目的関数を最適化するように学習を行い、学習されたニューラルネットワークを、ニューラルネットワーク記憶部４０に記憶して処理を終了する。 In step S304, the synthetic speech data x′ obtained in step S302 and the speech data x accepted by the input unit 10 are input, and synthetic speech data synthesized from the intermediate speech data x′ according to the equation (4).

A neural network as a generator that performs learning performs learning so as to optimize the objective function, stores the learned neural network in the neural network storage unit 40, and ends the processing.

次に、生成処理ルーチンについて説明する。入力部１０において合成音声データの生成対象となる任意の音声特徴量系列ｆを受け付けると、音声合成装置３００は、図１３に示す生成処理ルーチンを実行する。 Next, the generation processing routine will be described. When the input unit 10 receives an arbitrary speech feature amount sequence f for which synthetic speech data is to be generated, the speech synthesis device 300 executes the generation processing routine shown in FIG. 13.

ステップＳ４００では、入力部１０で受け付けた任意の音声特徴量系列ｆを、第２の実施の形態と同様に学習された生成器としてのニューラルネットワーク（図示省略）に入力することによって中間音声データｘ’を得る。 In step S400, the arbitrary speech feature amount sequence f accepted by the input unit 10 is input to a learned neural network (not shown) as a generator as in the second embodiment, so that the intermediate speech data x is input. Get'

ステップＳ４０２では、ステップＳ４００で得た中間音声データｘ’を、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、合成された合成音声データ

を出力部９０に出力して処理を終了する。 In step S402, the synthetic speech data synthesized by inputting the intermediate speech data x′ obtained in step S400 to the neural network stored in the neural network storage unit 40.

Is output to the output unit 90, and the process ends.

以上説明したように、本発明の第３の実施の形態に係る音声合成装置によれば、音声特徴量系列から音声を合成して得た、中間音声データｘ’と、学習用の真の音声データとを入力とし、上記（４）式に従って、中間音声データｘ’から合成された合成音声データ

を生成する生成器としてのニューラルネットワークが、目的関数を最適化するように学習を行うことにより、より自然な音声を合成することができるニューラルネットワークを学習できる。 As described above, according to the speech synthesis device according to the third embodiment of the present invention, the intermediate speech data x′ obtained by synthesizing speech from the speech feature quantity sequence and the true speech for learning are used. Data and data as input, and synthesized voice data synthesized from the intermediate voice data x′ according to the equation (4).

By learning so that the neural network as a generator for generating the objective function optimizes the objective function, a neural network capable of synthesizing more natural speech can be learned.

また、学習したニューラルネットワークを用いて音声を合成することにより、より自然な音声を合成することができる。 Further, by synthesizing the voice using the learned neural network, a more natural voice can be synthesized.

なお、中間音声データに変換するために、第２の実施の形態と同様に学習されたニューラルネットワークを用いる場合を例に説明したが、これに限定されるものではなく、ボコーダ、あるいは、第１の実施の形態と同様に学習されたニューラルネットワークを用いて、音声特徴量系列を、中間音声データに変換するようにしてもよい。 Note that the case of using the learned neural network in the same way as in the second embodiment to convert the intermediate voice data has been described as an example, but the present invention is not limited to this, and the vocoder or the first The speech feature quantity sequence may be converted into intermediate speech data by using the learned neural network as in the embodiment.

また、中間音声データに変換するために、第１又は第２の実施の形態と同様に学習されたニューラルネットワークを用いた場合には、本実施の形態における学習処理を行った後、学習されたニューラルネットワークをPre-trainingとみなして、全体のニューラルネットワークを改めて最適化するようにしてもよい。 Further, in the case of using the neural network learned in the same manner as in the first or second embodiment to convert the intermediate voice data, the learning process in the present embodiment is performed and then the learning is performed. The neural network may be regarded as pre-training and the entire neural network may be optimized again.

＜本発明の第４の実施の形態に係る概要＞ <Outline of Fourth Embodiment of the Present Invention>

次に、本発明の第４の実施の形態における概要を説明する。 Next, the outline of the fourth embodiment of the present invention will be described.

第３の実施の形態は、中間音声データから自然な音声へ直接変換するものであったが、第４実施の形態は、中間音声データに自然性成分を加えて本物らしい音声に変換するものである。処理の概念図を図１４に示す。なお、ここで述べた自然性成分は、合成音声とは独立に与えるもの（例えば、乱数）である。 In the third embodiment, the intermediate voice data is directly converted into a natural voice, but in the fourth embodiment, the natural voice component is added to the intermediate voice data to convert the voice into a genuine voice. is there. A conceptual diagram of the processing is shown in FIG. The naturalness component described here is a component (for example, a random number) that is given independently of the synthesized voice.

＜本発明の第４の実施の形態に係る音声合成装置の構成＞ <Structure of speech synthesizer according to fourth embodiment of the present invention>

次に、本発明の第４の実施の形態に係る音声合成装置の構成について説明する。なお、第３の実施の形態と同様となる箇所については同一符号を付して説明を省略する。 Next, the configuration of the speech synthesizer according to the fourth embodiment of the present invention will be described. The same parts as those in the third embodiment are designated by the same reference numerals and the description thereof will be omitted.

図１５に示すように、本発明の第４の実施の形態に係る音声合成装置４００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この音声合成装置４００は、機能的には図１５に示すように入力部１０と、演算部４２０と、出力部９０とを備えている。 As shown in FIG. 15, a speech synthesis apparatus 400 according to the fourth embodiment of the present invention stores a CPU, a RAM, a program for executing a learning processing routine and a generation processing routine described later, and various data. And a computer including the ROM. This speech synthesizer 400 functionally includes an input unit 10, a calculation unit 420, and an output unit 90, as shown in FIG.

演算部４２０は、学習部４３０と、ニューラルネットワーク記憶部４０と、中間音声変換部３３２と、生成部４５０とを含んで構成されている。 The calculation unit 420 includes a learning unit 430, a neural network storage unit 40, an intermediate voice conversion unit 332, and a generation unit 450.

学習部４３０は、以下に説明するように、音声データｘを音声分析して得た音声特徴量系列から音声を合成して得た、中間音声データｘ’（中間音声信号又は中間音声スペクトル系列）と、中間音声データｘ’に対応する自然性成分ｚ_２と、学習用の真の音声データｘとを入力とし、中間音声データｘ’から、合成された合成音声データを生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データｘと同一の分布に従うか否かを識別する識別器とを備え、生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行う。 As will be described below, the learning unit 430 synthesizes speech from a speech feature quantity sequence obtained by speech analysis of the speech data x, and obtains intermediate speech data x′ (intermediate speech signal or intermediate speech spectrum series). , A natural component z ₂ corresponding to the intermediate voice data x′, and the true voice data x for learning are input, and as a generator that generates synthesized voice data synthesized from the intermediate voice data x′. Neural network and synthetic speech data

, A discriminator that discriminates whether or not to follow the same distribution as the true voice data x, and the neural network as the generator and the neural network as the discriminator perform learning according to competing optimization conditions. To do.

学習部４３０は、まず、入力部１０で受け付けた音声データｘに対して、音声特徴量系列ｆを得る。ここで得た音声特徴量系列ｆと、音声特徴量系列ｆに対応する自然性成分ｚ_１とを、上記第２の実施の形態と同様に学習された生成器としてのニューラルネットワークに入力することによって中間音声データｘ’を得る。ここで得た中間音声データｘ’と、自然性成分ｚ_２と、学習用の真の音声データｘとに基づいて、元となる真の音声データｘが生成されるように生成器としてのニューラルネットワークを学習する。なお、ここで音声特徴量系列ｆについては、一部を変形したものを用いても良い。具体的には、音声特徴量系列の代表的なものの一つとして、基本周波数があるが、これをランダムに定数倍したものを用いても良い。また、自然性成分ｚ_１及び自然性成分ｚ_２は、ある分布（例えば、一様分布）に従って生成した乱数である。 The learning unit 430 first obtains the audio feature amount sequence f for the audio data x received by the input unit 10. Input the speech feature amount sequence f and the naturalness component z ₁ corresponding to the speech feature amount sequence f into a neural network as a generator learned as in the second embodiment. To obtain intermediate voice data x′. Based on the intermediate voice data x′ obtained here, the naturalness component z _2, and the true voice data x for learning, the neural voice as a generator is generated so as to generate the true voice data x to be the original. Learn the network. Here, as the audio feature quantity sequence f, a partially modified version may be used. Specifically, the fundamental frequency is one of the typical ones of the speech feature amount series, but a fundamental frequency may be randomly multiplied by a constant. The natural component z ₁ and the natural component z ₂ are random numbers generated according to a certain distribution (for example, uniform distribution).

また、真の音声データｘと、生成器としてのニューラルネットワークにより生成される合成音声データ

とに基づいて、真の音声データｘと同一の分布に従うか否かを識別する識別器としてのニューラルネットワークを学習する。この識別器としてのニューラルネットワークは、入力された音声データが真のものであるか合成されたものであるかの識別を行い、その結果を出力するものである。 In addition, true voice data x and synthetic voice data generated by a neural network as a generator

A neural network as a discriminator that discriminates whether or not to follow the same distribution as the true voice data x is learned based on and. The neural network as the discriminator discriminates whether the input voice data is true or synthesized, and outputs the result.

本実施の形態では、生成器としてのニューラルネットワーク、及び識別器としてのニューラルネットワークの評価関数を、以下（５）式に従って最適化する。（５）式で、Ｇは生成器（Generator）を表し、Ｄは識別器（Discriminator）を表す。（５）式では、識別器は、真の音声と合成音声をなるべく識別できるように、評価関数を最大化し、一方で、生成器は、合成音声をなるべく識別器が真の音声と識別するように、評価関数を最小化する。識別器と生成器が競争をしながら最適化が進む。 In the present embodiment, the evaluation function of the neural network as the generator and the evaluation function of the neural network as the discriminator are optimized according to the following equation (5). In the equation (5), G represents a generator and D represents a discriminator. In equation (5), the classifier maximizes the evaluation function so that the true speech and the synthesized speech can be discriminated from each other as much as possible, while the generator discriminates the synthesized speech from the true speech as much as possible. Then, the evaluation function is minimized. Optimization progresses while the discriminator and the generator compete.

・・・（５） ...(5)

図１６に第４の実施の形態の学習処理の概念図を示す。 FIG. 16 shows a conceptual diagram of the learning process of the fourth embodiment.

上記（５）式の評価関数を最適化するように学習された、生成器としてのニューラルネットワーク及び識別器としてのニューラルネットワークはニューラルネットワーク記憶部４０に記憶される。 The neural network as a generator and the neural network as a discriminator learned so as to optimize the evaluation function of the equation (5) are stored in the neural network storage unit 40.

なお、以下（６）式のように、中間音声データｘ’も考慮した識別器（Discriminator）を用いた評価関数を最適化するように、生成器としてのニューラルネットワーク及び識別器としてのニューラルネットワークを学習しても良い。 As shown in the following equation (6), a neural network as a generator and a neural network as a discriminator are used so as to optimize an evaluation function using a discriminator that also considers intermediate speech data x′. You may learn.

・・・（６） ...(6)

また、ニューラルネットワークを学習するときに、第３の実施の形態の手法を用いて、生成器としてのニューラルネットワークをPre-trainingしてもよい。 Further, when learning the neural network, the neural network as the generator may be pre-trained by using the method of the third embodiment.

生成部４５０は、中間音声変換部３３２によって得られた中間音声データｘ’と、予め与えられた自然性成分ｚ_２とを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、合成された合成音声データ

を出力部９０に出力する。 The generation unit 450 inputs the intermediate voice data x′ obtained by the intermediate voice conversion unit 332 and the natural component z ₂ given in advance to the neural network stored in the neural network storage unit 40, and synthesizes the neural network. Synthesized voice data

Is output to the output unit 90.

＜本発明の第４の実施の形態に係る音声合成装置の作用＞ <Operation of Speech Synthesis Device According to Fourth Embodiment of the Present Invention>

次に、本発明の第４の実施の形態に係る音声合成装置４００の作用について説明する。音声合成装置４００は、以下に説明する学習処理ルーチンと生成処理ルーチンを実行する。 Next, the operation of the speech synthesizer 400 according to the fourth embodiment of the present invention will be described. The speech synthesizer 400 executes a learning processing routine and a generation processing routine described below.

まず、学習処理ルーチンについて説明する。入力部１０において学習データとして、人間の音声データｘを受け付けると、音声合成装置４００は、上記図１２に示す学習処理ルーチンを実行する。 First, the learning processing routine will be described. When the input unit 10 receives human voice data x as learning data, the voice synthesizer 400 executes the learning processing routine shown in FIG.

第４の実施の形態の学習処理ルーチンでは、ステップＳ３０４において、ステップＳ３０２で得られた中間音声データｘ’と、自然性成分ｚ_２と、入力部１０により受け付けた音声データｘとを入力とし、上記（５）式に従って、生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行い、学習されたニューラルネットワークを、ニューラルネットワーク記憶部４０に記憶して処理を終了する。 In the learning processing routine of the fourth embodiment, in step S304, the intermediate voice data x′ obtained in step S302, the naturalness component z _2, and the voice data x received by the input unit 10 are input, According to the above equation (5), the neural network as the generator and the neural network as the discriminator perform learning according to the optimization conditions competing with each other, and the learned neural network is stored in the neural network storage unit 40. Ends the process.

第４の実施の形態の生成処理ルーチンでは、上記図１３に示すように、ステップＳ４０２において、ステップＳ４００で得た中間音声データｘ’と、自然性成分ｚ_２とを、ニューラルネットワーク記憶部４０に記憶されているニューラルネットワークに入力し、合成された合成音声データ

を出力部９０に出力して処理を終了する。 In the generation processing routine of the fourth embodiment, as shown in FIG. 13, in step S402, the intermediate voice data x′ obtained in step S400 and the naturalness component z ₂ are stored in the neural network storage unit 40. Synthesized voice data synthesized by inputting to the stored neural network

Is output to the output unit 90, and the process ends.

第４の実施の形態の生成処理ルーチンは、第３の実施の形態と同様であるため説明を省略する。 The generation processing routine of the fourth embodiment is the same as that of the third embodiment, so the description thereof is omitted.

以上説明したように、本発明の第４の実施の形態に係る音声合成装置によれば、音声特徴量系列から音声を合成して得た、中間音声データｘ’と、自然性成分ｚ_２と、学習用の真の音声データｘとを入力とし、上記（５）式に従って、中間音声データｘ’から、合成された合成音声データ

を生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データｘと同一の分布に従うか否かを識別する識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行うことにより、より自然な音声を合成することができるニューラルネットワークを学習できる。 As described above, according to the voice synthesizing apparatus according to the fourth embodiment of the present invention, the intermediate voice data x′ obtained by synthesizing the voice from the voice feature amount sequence, and the naturalness component z ₂ are obtained. , The true speech data x for learning are input, and the synthesized speech data synthesized from the intermediate speech data x′ according to the above equation (5).

Neural network as a generator to generate the

However, a neural network as a discriminator for discriminating whether or not to follow the same distribution as the true voice data x learns according to competing optimization conditions to synthesize a more natural voice. Can learn neural networks.

＜本発明の第５の実施の形態に係る概要＞ <Outline of Fifth Embodiment of the Present Invention>

次に、本発明の第５の実施の形態における概要を説明する。 Next, an outline of the fifth embodiment of the present invention will be described.

第１〜第４の実施の形態で用いる音声特徴量系列としては、例えば、既存の音声分析によって得られるものを使うこともできるが、ニューラルネットワークによって得られた音声特徴量系列を入力として用いることもできる。なぜなら、第１〜第４の実施の形態は、データドリブンに音声特徴量系列と音声信号のマッピングを学習するものであるからである。 As the speech feature amount sequence used in the first to fourth embodiments, for example, a speech feature amount sequence obtained by existing speech analysis can be used, but the speech feature amount sequence obtained by a neural network is used as an input. Can also This is because the first to fourth embodiments are data driven learning of the mapping of the audio feature amount sequence and the audio signal.

＜本発明の第５の実施の形態に係る音声合成装置の構成＞ <Structure of speech synthesizer according to fifth embodiment of the present invention>

次に、本発明の第５の実施の形態に係る音声合成装置の構成について説明する。なお、第２の実施の形態と同様の構成となる箇所については同一符号を付して説明を省略する。 Next, the configuration of the speech synthesizer according to the fifth embodiment of the present invention will be described. In addition, the same reference numerals are given to the portions having the same configurations as those in the second embodiment, and the description thereof will be omitted.

図１８に示すように、本発明の第５の実施の形態に係る音声合成装置５００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及び生成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この音声合成装置５００は、機能的には図１８に示すように入力部５１０と、演算部５２０と、出力部９０とを備えている。 As shown in FIG. 18, a speech synthesis apparatus 500 according to the fifth embodiment of the present invention stores a CPU, a RAM, and a program and various data for executing a learning processing routine and a generation processing routine described later. And a computer including the ROM. This speech synthesizer 500 functionally includes an input unit 510, a calculation unit 520, and an output unit 90, as shown in FIG.

入力部５１０は、学習データとして、人間の音声データｘを受け付ける。また、入力部５１０は、合成音声データの生成対象となる任意の音声データを受け付ける。 The input unit 510 receives human voice data x as learning data. The input unit 510 also receives arbitrary voice data that is a target for generating synthetic voice data.

演算部５２０は、音声特徴量生成部５２８と、学習部５３０と、ニューラルネットワーク記憶部４０と、音声特徴量変換部５３２と、生成部２５０とを含んで構成されている。 The calculation unit 520 includes a voice feature amount generation unit 528, a learning unit 530, a neural network storage unit 40, a voice feature amount conversion unit 532, and a generation unit 250.

音声特徴量生成部５２８は、入力部５１０で受け付けた音声データｘを、予め学習されたニューラルネットワークであるAuto Encoderに入力し、Auto Encoderから出力された音声特徴量系列ｆを学習部５３０に出力する。ここで用いるニューラルネットワークは、予め学習したVariational Auto Encoderであってもよい。 The voice feature amount generation unit 528 inputs the voice data x received by the input unit 510 into Auto Encoder that is a neural network that has been learned in advance, and outputs the voice feature amount sequence f output from the Auto Encoder to the learning unit 530. To do. The neural network used here may be a Variational Auto Encoder learned in advance.

学習部５３０は、音声特徴量生成部５２８から出力された音声特徴量系列ｆと、自然性成分ｚと、学習用の真の音声データｘとを入力とし、音声特徴量系列ｆから、合成された合成音声データを生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データｘと同一の分布に従うか否かを識別する識別器としてのニューラルネットワークとを備え、第２の実施の形態と同様の処理によって生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行うようにすればよい。 The learning unit 530 receives the voice feature amount sequence f output from the voice feature amount generation unit 528, the naturalness component z, and the true voice data x for learning, and synthesizes the voice feature amount sequence f. Neural network as a generator for generating synthesized speech data, and synthesized speech data

Is provided with a neural network as a discriminator that discriminates whether or not to follow the same distribution as the true voice data x. The neural network and the neural network may perform learning according to optimization conditions that compete with each other.

図１５に第５の実施の形態の学習処理の概念図を示す。 FIG. 15 shows a conceptual diagram of the learning process of the fifth embodiment.

音声特徴量生成部５３２は、入力部５１０で受け付けた合成音声データの生成対象となる任意の音声データを、音声特徴量生成部５２８と同様に、予め学習されたニューラルネットワークであるAuto Encoderに入力し、Auto Encoderから出力された音声特徴量系列ｆを生成部２５０に出力する。 The voice feature amount generation unit 532 inputs arbitrary voice data, which is the target of generation of the synthesized voice data received by the input unit 510, to the Auto Encoder which is a preliminarily learned neural network similarly to the voice feature amount generation unit 528. Then, the audio feature amount sequence f output from the Auto Encoder is output to the generation unit 250.

なお、第５の実施の形態の他の構成は、第２の実施の形態と同様となるため説明を省略する。 The rest of the configuration of the fifth embodiment is the same as that of the second embodiment, so the explanation is omitted.

また、第５の実施の形態において、学習部５３０は、第２の実施の形態と同様の処理を行う場合について説明したが、これに限定されるものではない。例えば、学習部５３０は、音声特徴量生成部５２８から出力された音声特徴量系列ｆと、学習用の真の音声データｘとを入力とし、第１の実施の形態と同様の処理によって、生成器としてのニューラルネットワークを学習するようにしてもよい。また、音声特徴量生成部５２８から出力された音声特徴量系列ｆから、第３の実施の形態と同様の処理によって中間音声データｘ’を得て、得られた中間音声データｘ’と、学習用の真の音声データｘとを入力とし、生成器としてのニューラルネットワークを学習するようにしてもよい。また、音声特徴量生成部５２８から出力された音声特徴量系列ｆと、自然性成分ｚ_１とから、第４の実施の形態と同様の処理によって中間音声データｘ’を得て、得られた中間音声データｘ’と、自然性成分ｚ_２と、学習用の真の音声データｘとを入力とし、生成器としてのニューラルネットワーク、又は、生成器及び識別器としてのニューラルネットワークを学習するようにしてもよい。 Further, in the fifth embodiment, the case where the learning unit 530 performs the same processing as in the second embodiment has been described, but the present invention is not limited to this. For example, the learning unit 530 receives the voice feature amount sequence f output from the voice feature amount generating unit 528 and the true voice data x for learning as input, and generates the voice feature amount sequence f by the same process as in the first embodiment. You may make it learn the neural network as a container. Further, from the audio feature amount sequence f output from the audio feature amount generation unit 528, intermediate voice data x′ is obtained by the same processing as in the third embodiment, and the obtained intermediate voice data x′ and learning It is also possible to learn the neural network as the generator by inputting the true voice data x for. Further, the intermediate feature data x′ is obtained and obtained from the feature feature sequence f output from the feature feature generation unit 528 and the naturalness component z ₁ by the same processing as in the fourth embodiment. The intermediate voice data x′, the naturalness component z _2, and the true voice data x for learning are input, and a neural network as a generator or a neural network as a generator and a discriminator is learned. May be.

＜本発明の第５の実施の形態に係る音声合成装置の作用＞ <Operation of Speech Synthesis Device According to Fifth Embodiment of Present Invention>

次に、本発明の第５の実施の形態に係る音声合成装置５００の作用について説明する。音声合成装置５００は、以下に説明する学習処理ルーチンと生成処理ルーチンを実行する。なお、第２の実施の形態と同様となる箇所については同一符号を付して説明を省略する。 Next, the operation of the speech synthesizer 500 according to the fifth embodiment of the present invention will be described. The speech synthesizer 500 executes a learning processing routine and a generation processing routine described below. The same parts as those in the second embodiment are designated by the same reference numerals and the description thereof will be omitted.

まず、学習処理ルーチンについて説明する。入力部５１０において学習データとして、人間の音声データｘを受け付けると、音声合成装置５００は、図１９に示す学習処理ルーチンを実行する。 First, the learning processing routine will be described. When the input unit 510 receives human voice data x as learning data, the voice synthesizer 500 executes a learning processing routine shown in FIG.

ステップＳ５００では、入力部５１０で受け付けた音声データｘを、予め学習されたニューラルネットワークであるAuto Encoderに入力し、Auto Encoderから出力された音声特徴量系列ｆを学習部５３０に出力する。 In step S500, the voice data x received by the input unit 510 is input to Auto Encoder, which is a neural network learned in advance, and the voice feature amount sequence f output from the Auto Encoder is output to the learning unit 530.

ステップＳ１０２では、ステップＳ５００で得られた音声特徴量系列ｆと、予め与えられた自然性成分ｚと、入力部５１０により受け付けた音声データｘとを入力とし、上記（２）式に従って、生成器としてのニューラルネットワークと、識別器としてのニューラルネットワークとが、互いに競合する最適化条件に従って学習を行い、学習されたニューラルネットワークを、ニューラルネットワーク記憶部４０に記憶して処理を終了する。 In step S102, the speech feature amount sequence f obtained in step S500, the naturalness component z given in advance, and the speech data x received by the input unit 510 are input, and the generator is generated according to the above equation (2). And the neural network as the discriminator perform learning in accordance with the optimization conditions competing with each other, and the learned neural network is stored in the neural network storage unit 40, and the process ends.

次に、生成処理ルーチンについて説明する。入力部５１０において合成音声データの生成対象となる音声データを受け付けると、音声合成装置５００は、図２０に示す生成処理ルーチンを実行する。 Next, the generation processing routine will be described. When the input unit 510 receives the voice data to be the target of generating the synthetic voice data, the voice synthesizing device 500 executes the generation processing routine shown in FIG.

ステップＳ５００では、入力部５１０で受け付けた音声データを、予め学習されたニューラルネットワークであるAuto Encoderに入力し、Auto Encoderから出力された音声特徴量系列ｆを生成部２５０に出力する。 In step S500, the voice data received by the input unit 510 is input to Auto Encoder, which is a neural network learned in advance, and the voice feature amount sequence f output from the Auto Encoder is output to the generation unit 250.

なお、第５の実施の形態の他の作用は、第２の実施の形態と同様であるため説明を省略する。 Note that the other operations of the fifth embodiment are similar to those of the second embodiment, and therefore description thereof will be omitted.

以上説明したように、本発明の第５の実施の形態に係る音声合成装置によれば、音声データｘを、予め学習されたニューラルネットワークであるAuto Encoderに入力し、Auto Encoderから出力された音声特徴量系列ｆを出力し、出力された音声特徴量系列ｆと、自然性成分ｚと、学習用の真の音声データｘとを入力とし、上記（２）式に従って、音声特徴量系列ｆから合成された合成音声データ

を生成する生成器としてのニューラルネットワークと、合成音声データ

が、真の音声データｘと同一の分布に従うか否かを識別する識別器とが、互いに競合する最適化条件に従って学習を行うことにより、より自然な音声を合成することができるニューラルネットワークを学習できる。 As described above, according to the speech synthesizer of the fifth embodiment of the present invention, the speech data x is input to the Auto Encoder which is a pre-learned neural network and the speech output from the Auto Encoder is input. A feature amount sequence f is output, and the output voice feature amount sequence f, the naturalness component z, and true voice data x for learning are input, and the feature amount sequence f is converted from the voice feature amount sequence f according to the above equation (2). Synthesized voice data

Neural network as a generator to generate the

, A discriminator for discriminating whether or not to follow the same distribution as the true voice data x, learns a neural network capable of synthesizing a more natural voice by performing learning according to optimization conditions competing with each other. it can.

＜実験結果１＞ <Experiment result 1>

第３及び第４の実施の形態の有効性を示すために、一実現方法を用いて、実験を行った。 To show the effectiveness of the third and fourth embodiments, an experiment was conducted using one implementation method.

実験データ実験用のデータとして、ATR Speech Dataのうち話者1人の115会話文を用いた。このデータのうち90%のデータをモデルの学習用に用い、残りの10%のデータをテスト用に用いた。なお、音声信号のサンプリング周波数は16,000Hzである。 Experimental data As the data for the experiment, 115 conversation sentences of one speaker in ATR Speech Data were used. Of this data, 90% was used for model training and the remaining 10% was used for testing. The sampling frequency of the audio signal is 16,000 Hz.

第３及び第４の実施の形態では、生成器の入力としては、Vocoderまたは、それと同等の入出力を持ったニューラルネットワークによって生成した音声信号または音声スペクトル系列を用いている。本実験では、これらのうちVocoderを用いて音声信号の生成を行い、それに対して、以下で述べる前処理を行うことによって得た音声スペクトル系列ｘ’を入力とした。具体的な分析合成の手法としてはLPC分析合成を用いた。この分析合成によって生成した音を元の音声信号のような本物の声に変換することが、ニューラルネットワークで構成される生成器の目指す役割である。 In the third and fourth embodiments, a voice signal or a voice spectrum sequence generated by a Vocoder or a neural network having an input/output equivalent thereto is used as an input of the generator. In this experiment, a voice signal was generated using Vocoder among these, and the voice spectrum sequence x′ obtained by performing the preprocessing described below was used as the input. LPC analysis synthesis was used as a specific analysis synthesis method. Converting the sound generated by this analysis and synthesis into a real voice like the original voice signal is the role of the generator composed of neural networks.

前述した前処理とは、以下のような処理である。まず一つ一つの音声信号に対して短時間フーリエ変換（STFT）を適用し、複素スペクトル系列に変換した。この際、フーリエ変換の窓幅は512、シフト幅は128とした。また、窓関数としては、ブラックマン窓を用いた。次に複素スペクトル系列の絶対値をとり、振幅スペクトル系列に変換した。さらに、この振幅スペクトルに対して、底が10の対数スペクトルをとり、20倍することで、振幅の対数スペクトルに変換した。最後に、この処理によって得られたスペクトル系列に対して、ある一定フレーム分を切り出し、それを生成器の入力として用いた。実験では、フレームの切り出す長さとしては21とした。 The above-mentioned pre-processing is the following processing. First, short-time Fourier transform (STFT) was applied to each speech signal to transform it into a complex spectrum sequence. At this time, the window width of the Fourier transform was 512 and the shift width was 128. A Blackman window was used as the window function. Next, the absolute value of the complex spectrum series was taken and converted into an amplitude spectrum series. Furthermore, a logarithmic spectrum with a base of 10 was taken for this amplitude spectrum and multiplied by 20 to convert it into a logarithmic spectrum of amplitude. Finally, a certain frame was cut out from the spectrum sequence obtained by this process and used as the input of the generator. In the experiment, the length to cut out the frame was set to 21.

また、生成器の出力としては、入力と同じ振幅の対数スペクトルが得られるため、それを音声信号に戻すために以下の処理を行った。まず、最初に振幅の対数スペクトルに対して、20で割って、そこで得られた値を乗数として10の冪乗を求めることで、振幅スペクトルに変換した。それに対して、Griffin Limを用いて位相復元を行い、音声信号に変換した。 Further, as the output of the generator, a logarithmic spectrum having the same amplitude as that of the input is obtained, and therefore the following processing was performed in order to return it to a voice signal. First, the logarithmic spectrum of the amplitude was divided by 20, and the value obtained there was used as a multiplier to find the power of 10 to convert into the amplitude spectrum. On the other hand, we performed phase restoration using Griffin Lim and converted it into an audio signal.

図２１に第３の実施の形態の学習方法の実装例を示し、図２２に第４の実施の形態の学習方法の実装例を示す。 FIG. 21 shows an implementation example of the learning method of the third embodiment, and FIG. 22 shows an implementation example of the learning method of the fourth embodiment.

図２３に第３の実施の形態の生成方法の実装例を示し、図２４に第４の実施の形態の生成方法の実装例を示す。 FIG. 23 shows an implementation example of the generation method of the third embodiment, and FIG. 24 shows an implementation example of the generation method of the fourth embodiment.

ネットワーク構造としては、第３及び第４の実施の形態の生成器・識別器ともに隠れ層は３層、それぞれの層のユニット数は500、結合の仕方は、Fully Connectedのものを用いた。図２５、図２６のそれぞれに、第３及び第４の実施の形態の具体的なネットワーク構造を示す。 As the network structure, both the generator and the discriminator of the third and fourth embodiments have three hidden layers, the number of units in each layer is 500, and the coupling method is that of Fully Connected. 25 and 26 respectively show specific network structures of the third and fourth embodiments.

本手法の目的は、分析合成音を本物の声に近い音に変換することであるが、提案したフレームワークの有効性を示すために、合成音として以下の4つを想定した。 The purpose of this method is to convert the analyzed and synthesized sound into a sound close to a real voice, but in order to show the effectiveness of the proposed framework, the following four synthesized sounds were assumed.

1.Volume change:n元の音を半分にした音
2.Pre-emphasis:元の音の高音強調を行った音
3.LPC:LPC分析合成音
4.LPC+pulse:LPC分析で得たLPCと一定間隔をおいて（128サンプルごと）発生したpulse信号を合成して生成した音 1.Volume change:n The original sound is halved.
2.Pre-emphasis: High-pitched sound of the original sound
3.LPC: LPC analysis and synthesis sound
4.LPC+pulse: Sound generated by combining the LPC obtained by the LPC analysis and the pulse signal generated at regular intervals (every 128 samples).

図２７に入出力の元になった音声信号の波形の例を示す。 FIG. 27 shows an example of the waveform of the audio signal which is the source of the input/output.

図２８にVolume changeの実験結果を示す。音声信号の波形データの振幅の大きさに着目すると、合成音は元の音の半分になっているが、第３及び第４の実施の形態の手法を用いると、いずれの場合も元の音と同等の振幅を再現できていることが分かる。 FIG. 28 shows the experimental results of Volume change. Focusing on the magnitude of the amplitude of the waveform data of the audio signal, the synthesized sound is half of the original sound. However, using the methods of the third and fourth embodiments, the original sound is It can be seen that the amplitude equivalent to is reproduced.

図２９にPre-emphasisの実験結果を示す。音声スペクトル系列に着目すると、合成音は元の音と比べて、低周波数領域の値が小さくなっているが、第３及び第４の実施の形態の手法を用いると、元の音と同等の大きさ位に戻っていることがわかる。また、音声信号の波形データの振幅に着目すると、合成音は元の音と比べて全体的に小さくなっているが、第３及び第４の実施の形態の手法を用いると、いずれの場合も元の音と同等の振幅を再現できていることが分かる。 The experimental results of Pre-emphasis are shown in FIG. Focusing on the speech spectrum series, the synthesized sound has a smaller value in the low frequency region than the original sound. However, when the methods of the third and fourth embodiments are used, the synthesized sound is equivalent to the original sound. You can see that it has returned to the size. Further, if attention is paid to the amplitude of the waveform data of the audio signal, the synthetic sound is generally smaller than the original sound. However, in any case, the methods of the third and fourth embodiments are used. It can be seen that the same amplitude as the original sound can be reproduced.

図３０にLPCの実験結果を示す。音声スペクトル系列に着目すると、合成音は元の音と比べて、最低周波数領域(0)に値があり、また、高周波数領域にも値が広がっているという特徴があるが、第３及び第４の実施の形態の手法を用いると、元の音と同等の形状に戻っていることが分かる。また、音声信号の波形データの振幅に着目すると、合成音は元の音と比べて全体的に大きくなっているが、第３及び第４の実施の形態の手法を用いると、いずれの場合も元の音と同等の振幅を再現できていることが分かる。 FIG. 30 shows the experimental result of LPC. Focusing on the speech spectrum sequence, the synthesized speech has a value in the lowest frequency region (0) and spreads in the high frequency region as compared with the original speech. By using the method of the fourth embodiment, it can be seen that the sound has returned to a shape equivalent to the original sound. Further, if attention is paid to the amplitude of the waveform data of the audio signal, the synthesized sound is larger than the original sound as a whole. However, by using the methods of the third and fourth embodiments, in any case. It can be seen that the same amplitude as the original sound can be reproduced.

図３１にLPC+pulseの実験結果を示す。音声スペクトル系列に着目すると、合成音は元の音と比べて、等間隔の縞乗になっているという特徴があるが、第３及び第４の実施の形態の手法を用いると、いずれの場合もと合成音と比較して元の音に近い形状に戻っていることが分かる。 FIG. 31 shows the experimental result of LPC+pulse. Focusing on the speech spectrum sequence, the synthesized speech has a feature that it is a striped pattern with equal intervals as compared with the original speech. However, when the methods of the third and fourth embodiments are used, in either case It can be seen that the shape has returned to a shape close to the original sound compared to the synthetic sound.

＜実験結果２＞
次に第１及び第２の実施の形態の実験結果を示す。 <Experiment result 2>
Next, experimental results of the first and second embodiments will be shown.

実験データまずデータセットとしては、前述した第３及び第４の実施の形態に関する実験と同じものを用いた。第１及び第２の実施の形態では、入力としては音声特徴量系列を用いるが、本実験では、LPC分析によって得た音声特徴量を用いた。具体的には、ピッチとLPCを用いた。ピッチは1フレームあたり1次元の特徴量であり、LPCはLPC分析時の次数を25としたため、26次元の特徴量である。そのため、両者を合わせると1次元あたり27次元の特徴量になる。出力としては、振幅の対数スペクトルを用いた。また、本実験では実際に処理を行うデータのフレームの長さは1とした。最終的には、音声信号を得ることが目的であり、そのためには出力として得られた振幅の対数スペクトルから音声信号を復元することが必要である。その方法としては、先の第３及び第４の実施の形態に関する実験の項で述べた方法と同様の方法を用いた。 Experimental Data First, as the data set, the same data set as the experiment related to the third and fourth embodiments described above was used. In the first and second embodiments, the speech feature quantity sequence is used as the input, but in this experiment, the speech feature quantity obtained by the LPC analysis was used. Specifically, pitch and LPC were used. The pitch is a one-dimensional feature amount per frame, and the LPC is a 26-dimensional feature amount because the order at the time of LPC analysis is 25. Therefore, when the two are combined, the feature quantity is 27 dimensions per dimension. As the output, the logarithmic spectrum of the amplitude was used. Also, in this experiment, the frame length of the data to be actually processed was set to 1. Ultimately, the goal is to obtain a speech signal, which requires the reconstruction of the speech signal from the logarithmic spectrum of the amplitude obtained as output. As the method, the same method as the method described in the experimental section regarding the third and fourth embodiments was used.

図３２、図３３に第１及び第２の実施の形態のネットワーク構造の実装例を示す。ネットワーク構造としては、第１及び第２の実施の形態ともに隠れ層3層、それぞれの層のユニット数は500、層の結合の仕方はFully Connectedとした。 32 and 33 show implementation examples of the network structure according to the first and second embodiments. As the network structure, in both the first and second embodiments, three hidden layers, the number of units in each layer is 500, and the method of connecting the layers is Fully Connected.

図３４に、第１及び第２の実施の形態の手法による音声復元の結果を示す。これからは、入力に用いている音声特徴量は27次元であるが、それに対して、第１及び第２の実施の形態のネットワークを用いることによって、元の音と似た特徴の持つ調和構造（スペクトルの縞模様）が再現できていることが分かる。 FIG. 34 shows the result of voice restoration by the methods of the first and second embodiments. From now on, although the voice feature quantity used for input is 27 dimensions, by using the networks of the first and second embodiments, the harmonic structure of features similar to the original sound ( It can be seen that the striped pattern of the spectrum) has been reproduced.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made without departing from the spirit of the present invention.

例えば、上述した実施の形態では、ニューラルネットワークの学習を行う学習部と、音声の合成を行う生成部とを含む音声合成装置として構成していたが、これに限定されるものではなく、学習部を含む音声合成学習装置と、生成部を含む音声合成装置のそれぞれに分けて構成してもよい。 For example, in the above-described embodiment, the speech synthesizer includes the learning unit that performs the learning of the neural network and the generation unit that synthesizes the speech. However, the present invention is not limited to this, and the learning unit is not limited thereto. The speech synthesis learning device including the above and the speech synthesis device including the generation unit may be separately configured.

また、上述した実施の形態におけるニューラルネットワークには、ＣＮＮやＲＮＮなども用いることができる。 Moreover, CNN, RNN, etc. can also be used for the neural network in the above-mentioned embodiment.

１０、５１０入力部
２０、２２０、３２０、４２０、５２０演算部
３０、２３０、３３０、４３０、５３０学習部
４０ニューラルネットワーク記憶部
５０、２５０、３５０、４５０生成部
９０出力部
１００、２００、３００、４００、５００音声合成装置
３３２中間音声変換部
５２８、５３２音声特徴量生成部 10, 510 input unit 20, 220, 320, 420, 520 arithmetic unit 30, 230, 330, 430, 530 learning unit 40 neural network storage unit 50, 250, 350, 450 generation unit 90 output unit 100, 200, 300, 400, 500 voice synthesizer 332 intermediate voice converter 528, 532 voice feature quantity generator

Claims

A speech synthesis learning device for learning a neural network for synthesizing speech from arbitrary speech data or a speech feature quantity sequence,
Accepts the input voice data or voice feature amount series, and true voice data for learning,
A neural network as a first generator that is pre-learned to generate intermediate voice data from the voice data or the voice feature amount sequence and the true voice data for learning;
A neural network as a second generator that is learned to generate synthetic speech data from the intermediate speech data and the true speech data for learning;
The voice data or the voice feature quantity sequence is used as an input to a neural network as the first generator to obtain the intermediate voice data,
The obtained intermediate voice data is used as an input to a neural network as the second generator to generate the synthetic voice data,
To optimize the objective function representing the distance between the generated synthetic speech data and the true speech data for learning, or a neural network as the second generator, and the generated synthetic speech data Learns the neural network as the second generator so that it follows an optimization condition in which the neural network as a discriminator for determining whether or not follows the same distribution as the true speech data for learning. Including a learning section to
The neural network as the first generator is
According to the optimization of the objective function representing the distance between the intermediate voice data and the true voice data for learning, or the neural network as the first generator, and the intermediate voice data is the true voice for learning. The neural network as a discriminator for determining whether or not to follow the same distribution as the voice data of the
Speech synthesis learning device.

The neural network as the first generator obtains the intermediate voice data from the voice data or the voice feature amount sequence, the natural component given independently of the voice data sequence, and the true voice data for learning. The speech synthesis learning device according to claim 1, which has been learned in advance.

The intermediate speech data and a naturalness component given independently of the intermediate speech data are generated as the input to the neural network as the second generator to generate the synthetic speech data,
The learning unit optimizes an objective function representing a distance between the generated synthetic voice data and the true voice data for learning, or a neural network as the second generator, A neural network as the second generator so that the synthetic speech data and a neural network as a discriminator for discriminating whether or not the true speech data for learning follow the same distribution are subject to optimization conditions competing with each other. The speech synthesis learning device according to claim 1, which learns a network.

The neural network as the first generator discriminates whether or not the neural network as the first generator and the intermediate voice data and the true voice data for learning follow the same distribution. The speech synthesis learning device according to any one of claims 1 to 3, wherein the neural network serving as a training device is preliminarily trained according to competing optimization conditions.

The neural network as the second generator discriminates whether or not the neural network as the second generator and the intermediate voice data and the true voice data for learning follow the same distribution. The speech synthesis learning device according to any one of claims 1 to 3, wherein the neural network as a training device learns according to optimization conditions competing with each other.

The learning of the neural network as the first generator and the learning of the neural network as the second generator in the learning unit are pre-learned, and the neural network as the first generator and the second The speech synthesis learning device according to any one of claims 1 to 5, which optimizes an entire neural network including a neural network as a generator.

The neural network as the first generator is pre-learned, and the entire neural network including the neural network as the first generator and the neural network as the second generator is optimized. The speech synthesis learning device according to any one of claims 1 to 5.