JP2020027168A

JP2020027168A - Learning device, learning method, voice synthesis device, voice synthesis method and program

Info

Publication number: JP2020027168A
Application number: JP2018151611A
Authority: JP
Inventors: ティルオンヒュウ; T Luon Hugh; 山岸　順一; Junichi Yamagishi; 順一山岸
Original assignee: Research Organization of Information and Systems
Current assignee: Research Organization of Information and Systems
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2020-02-20
Anticipated expiration: 2038-08-10
Also published as: JP7109071B2

Abstract

To provide a voice synthesis technique for an unknown speaker having a neural network structure which can cope with both cases of supervised adaptation and unsupervised adaptation.SOLUTION: The learning device includes a text modality neural network (text modality NN) which converts text data into a first vector, a voice modality NN which converts voice waveform data into a second vector, and a common NN for generating an acoustic feature corresponding to a speaker code vector on a speaker space from the first or second vector. And a text modality NN and a common NN are learned based on first training data composed of the text data and the acoustic feature, and the voice modality NN and the common NN are learned by the second training data composed of the voice waveform data and the acoustic feature, and the speaker code vector for a speaker is estimated by using selectively the text modality NN and the common NN, and the voice modality NN and the common NN, according to third training data of the given speaker.SELECTED DRAWING: Figure 1

Description

本発明は、一般に音声合成技術に関し、より詳細には、ニューラルネットワークを利用した未知話者に対する話者適応技術に関する。 The present invention relates generally to speech synthesis techniques, and more particularly, to speaker adaptation techniques for unknown speakers using neural networks.

近年のディープラーニングの進展によって、ニューラルネットワークを利用した音声合成システムの研究開発が進められている。 With the progress of deep learning in recent years, research and development of a speech synthesis system using a neural network have been advanced.

音声合成システムの一例として、特定話者のための音声合成システムがある。特定話者のための音声合成システムによると、特定話者の音声データとテキストデータとのペアを訓練データとして利用することによって、テキストデータを当該話者に対応する音声データに変換するニューラルネットワークが学習され、学習済みのニューラルネットワークを利用して、入力されたテキストデータが当該特定話者の音声によって再生される。 One example of a speech synthesis system is a speech synthesis system for a specific speaker. According to a speech synthesis system for a specific speaker, a neural network that converts text data into speech data corresponding to the speaker by using a pair of speech data and text data of the specific speaker as training data is used. The input text data is reproduced by the voice of the specific speaker by using the learned and learned neural network.

他の例として、複数話者のための音声合成システムがある。複数話者のための音声合成システムによると、複数話者の音声データとテキストデータとのペアを訓練データとして利用することによって、テキストデータを複数話者の何れか指定された話者に対応する音声データに変換するニューラルネットワークが学習され、学習済みのニューラルネットワークを利用して、入力されたテキストデータが当該指定された話者の音声によって再生される。 Another example is a speech synthesis system for multiple speakers. According to the speech synthesis system for multiple speakers, by using a pair of speech data and text data of multiple speakers as training data, the text data corresponds to one of the multiple speakers. A neural network for converting to voice data is learned, and the input text data is reproduced by the voice of the designated speaker using the learned neural network.

更なる他の例として、未知話者のための音声合成システムがある。典型的には、上述した複数話者のための音声合成システムに基づき、未知話者の音声データ及び／又はテキストデータを訓練データとして利用することによって、テキストデータを当該未知話者に対応する音声データに変換するニューラルネットワークが学習される。学習済みのニューラルネットワークを利用して、入力されたテキストデータが当該未知話者の音声によって再生される。 Yet another example is a speech synthesis system for unknown speakers. Typically, based on the above-described speech synthesis system for multiple speakers, by using speech data and / or text data of an unknown speaker as training data, text data is converted to a speech corresponding to the unknown speaker. A neural network to convert to data is learned. Using the learned neural network, the input text data is reproduced by the voice of the unknown speaker.

未知話者のための音声合成システムとして、未知話者の音声データとテキストデータとのペアを訓練データとして利用するもの（教師有り適応と呼ばれる）と、未知話者の音声データのみを訓練データとして利用するもの（教師なし適応と呼ばれる）とがある。 As a speech synthesis system for unknown speakers, one that uses pairs of unknown speaker's speech data and text data as training data (called supervised adaptation), and only speech data of unknown speakers as training data Some use (called unsupervised adaptation).

"Neural Voice Cloning with a Few Samples", Sercan O. Arik, et. al., arXiv: 1802.06006, Mar. 20, 2018."Neural Voice Cloning with a Few Samples", Sercan O. Arik, et.al., arXiv: 1802.06006, Mar. 20, 2018. "Fitting New Speakers Based on a Short Untranscribed Sample", Eliya Nachmani, et. al., arXiv: 1802.06984, Feb. 20, 2018."Fitting New Speakers Based on a Short Untranscribed Sample", Eliya Nachmani, et.al., arXiv: 1802.06984, Feb. 20, 2018.

従来技術によると、教師有り適応に基づく未知話者のための音声合成システムと、教師なし適応に基づく未知話者のための音声合成システムとは、それぞれ独立に設計されており、教師有り適応と教師なし適応との双方に対応可能な音声合成システムは現状存在しない。従って、教師有り適応と教師なし適応との何れのケースにも対応可能なニューラルネットワーク構造を備えた未知話者のための音声合成システムが望まれる。 According to the prior art, a speech synthesis system for unknown speakers based on supervised adaptation and a speech synthesis system for unknown speakers based on unsupervised adaptation are designed independently of each other. There is currently no speech synthesis system that can handle both unsupervised adaptation. Therefore, a speech synthesis system for an unknown speaker having a neural network structure capable of coping with both cases of supervised adaptation and unsupervised adaptation is desired.

上述した問題点を鑑み、本発明の課題は、教師有り適応と教師なし適応との何れのケースにも対応可能なニューラルネットワーク構造を利用した未知話者のための音声合成技術を提供することである。 In view of the above problems, an object of the present invention is to provide a speech synthesis technique for an unknown speaker using a neural network structure capable of coping with both supervised adaptation and unsupervised adaptation. is there.

上記課題を解決するため、本発明の一態様は、メモリと、プロセッサとを有する学習装置であって、前記メモリは、テキストデータを第１のベクトルに変換するテキストモダリティニューラルネットワークと、音声波形データを第２のベクトルに変換する音声モダリティニューラルネットワークと、前記テキストモダリティニューラルネットワーク及び前記音声モダリティニューラルネットワークに接続され、前記第１のベクトル又は前記第２のベクトルから話者空間上の話者コードベクトルに対応する音響特徴量を生成する共通ニューラルネットワークとを格納し、前記プロセッサは、テキストデータと音響特徴量とから構成される第１の訓練データによって前記テキストモダリティニューラルネットワーク及び前記共通ニューラルネットワークを学習し、音声波形データと音響特徴量とから構成される第２の訓練データによって前記音声モダリティニューラルネットワーク及び前記共通ニューラルネットワークを学習し、所与の話者の第３の訓練データに応じて、前記テキストモダリティニューラルネットワーク及び前記共通ニューラルネットワークと、前記音声モダリティニューラルネットワーク及び前記共通ニューラルネットワークとを選択的に利用して、前記所与の話者に対する前記話者コードベクトルを推定する学習装置に関する。 According to one embodiment of the present invention, there is provided a learning device including a memory and a processor, the memory including a text modality neural network that converts text data into a first vector, and a speech waveform data. To a second vector, and a speech code vector in a speaker space connected to the text modality neural network and the speech modality neural network, and from the first vector or the second vector. And a common neural network that generates acoustic features corresponding to the text neural network and the first training data composed of text data and acoustic features. Learning the speech modality neural network and the common neural network with the second training data composed of speech waveform data and acoustic features, and converting the training data to the third training data of a given speaker. Accordingly, learning to estimate the speaker code vector for the given speaker by selectively using the text modality neural network and the common neural network and the speech modality neural network and the common neural network. Related to the device.

本発明によると、教師有り適応と教師なし適応との何れのケースにも対応可能なニューラルネットワーク構造を利用した未知話者のための音声合成技術を提供することができる。 According to the present invention, it is possible to provide a speech synthesis technique for an unknown speaker using a neural network structure capable of coping with both cases of supervised adaptation and unsupervised adaptation.

本発明の一実施例によるニューラルネットワーク構造の概略図である。1 is a schematic diagram of a neural network structure according to an embodiment of the present invention. 本発明の一実施例による学習装置及び音声合成装置のハードウェア構成を示すブロック図である。FIG. 1 is a block diagram illustrating a hardware configuration of a learning device and a speech synthesis device according to an embodiment of the present invention. 本発明の一実施例による学習処理を示す概略図である。FIG. 4 is a schematic diagram illustrating a learning process according to an embodiment of the present invention. 本発明の一実施例による学習処理を示すフローチャートである。5 is a flowchart illustrating a learning process according to an embodiment of the present invention. 本発明の他の実施例による学習処理を示す概略図である。FIG. 11 is a schematic diagram illustrating a learning process according to another embodiment of the present invention. 本発明の他の実施例による学習処理を示すフローチャートである。9 is a flowchart illustrating a learning process according to another embodiment of the present invention. 本発明の一実施例による未知話者適応処理を示す概略図である。FIG. 4 is a schematic diagram illustrating an unknown speaker adaptation process according to an embodiment of the present invention. 本発明の一実施例による未知話者適応処理を示すフローチャートである。6 is a flowchart illustrating an unknown speaker adaptation process according to an embodiment of the present invention. 本発明の一実施例による音声合成処理を示す概略図である。FIG. 3 is a schematic diagram illustrating a speech synthesis process according to an embodiment of the present invention. 本発明の一実施例による音声合成処理を示すフローチャートである。5 is a flowchart illustrating a speech synthesis process according to an embodiment of the present invention. 本発明の各種実施例による学習処理の実験結果を示す図である。FIG. 9 is a diagram illustrating experimental results of learning processing according to various embodiments of the present invention.

以下の実施例では、教師有り適応と教師なし適応との何れのケースにも対応可能なニューラルネットワークを学習する学習装置１００と、当該ニューラルネットワークを利用した未知話者のための音声合成装置２００とが開示される。
［概略］
後述される実施例を概略すると、学習装置１００は、テキストデータをベクトルに変換するテキストモダリティニューラルネットワーク２０、音声波形データをベクトルに変換する音声モダリティニューラルネットワーク３０、及びテキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０から出力されたベクトルから、話者空間上の所与の未知話者を示す話者コードベクトル（潜在変数）に対応する音響特徴量を生成する共通ニューラルネットワーク４０を学習する。 In the following embodiment, a learning device 100 that learns a neural network capable of coping with both cases of supervised adaptation and unsupervised adaptation, a speech synthesis device 200 for an unknown speaker using the neural network, Is disclosed.
[Overview]
The learning device 100 includes a text modality neural network 20 that converts text data into a vector, a speech modality neural network 30 that converts speech waveform data into a vector, a text modality neural network 20, and a speech modality. From the vector output from the neural network 30, a common neural network 40 that generates an acoustic feature corresponding to a speaker code vector (latent variable) indicating a given unknown speaker in the speaker space is learned.

まず、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０から構成されるニューラルネットワーク構造１０に対する学習処理において、学習装置１００は、テキストデータと音響特徴量とのペアから構成される訓練データに対して、テキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを共通ニューラルネットワーク４０に入力する。一方、学習装置１００は、音声波形データと音響特徴量とのペアから構成される訓練データに対して、音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から取得されたベクトルを共通ニューラルネットワーク４０に入力する。そして、以下の実施例において詳細に説明されるように、学習装置１００は、共通ニューラルネットワーク４０から出力された音響特徴量と訓練データの音響特徴量とに基づき、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を学習する。 First, in a learning process for the neural network structure 10 including the text modality neural network 20, the speech modality neural network 30, and the common neural network 40, the learning apparatus 100 performs training including a pair of text data and an acoustic feature. For the data, text data is input to the text modality neural network 20, and the vector output from the text modality neural network 20 is input to the common neural network 40. On the other hand, the learning device 100 inputs the speech waveform data to the speech modality neural network 30 with respect to the training data composed of the pair of the speech waveform data and the acoustic feature amount, and outputs the vector acquired from the speech modality neural network 30. Is input to the common neural network 40. Then, as described in detail in the following embodiments, the learning apparatus 100 performs the text modality neural network 20 and the speech modality based on the acoustic features output from the common neural network 40 and the acoustic features of the training data. The neural network 30 and the common neural network 40 are learned.

次に、未知話者適応処理において、学習装置１００は、上述した学習済みのテキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を利用して、話者空間上の未知話者の位置を示す話者コードベクトルを推定する。すなわち、所与の話者の訓練データが与えられると、学習装置１００は、当該訓練データがテキスト付きの音声データであるか、あるいは、音声データのみであるかに応じて、テキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０と、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０とを選択的に利用して、共通ニューラルネットワーク４０の話者空間上の当該話者を示す話者コードベクトル（潜在変数）を推定し、推定した潜在変数が埋め込まれた話者毎の共通ニューラルネットワーク４０を生成する。 Next, in the unknown speaker adaptation process, the learning apparatus 100 uses the learned text modality neural network 20, the speech modality neural network 30, and the common neural network 40 to recognize the unknown speaker on the speaker space. The speaker code vector indicating the position is estimated. That is, when training data of a given speaker is given, the learning device 100 determines whether the training data is text-based speech data or only speech data, depending on whether the training data is text-based speech data. And a speaker code vector (latent variable) indicating the speaker on the speaker space of the common neural network 40 by selectively using the common neural network 40 and the speech modality neural network 30 and the common neural network 40. The common neural network 40 for each speaker in which the estimated latent variables are estimated is embedded.

音声合成装置２００は、このようにして学習装置１００によって未知話者毎に学習されたニューラルネットワーク構造１０における学習済みのテキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０を利用して、所与のテキストデータから当該未知話者に対応する音声データを生成する。
［ニューラルネットワーク構造］
まず、図１を参照して、本発明の一実施例によるニューラルネットワーク構造１０を説明する。図１は、本発明の一実施例によるニューラルネットワーク構造１０の概略図である。 The speech synthesizer 200 uses the learned text modality neural network 20 and the common neural network 40 in the neural network structure 10 trained for each unknown speaker by the learning device 100 in this manner, and gives given text data. To generate voice data corresponding to the unknown speaker.
[Neural network structure]
First, a neural network structure 10 according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic diagram of a neural network structure 10 according to one embodiment of the present invention.

図１に示されるように、ニューラルネットワーク構造１０は、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を有し、テキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０はそれぞれ、共通ニューラルネットワーク４０に接続される。 As shown in FIG. 1, the neural network structure 10 includes a text modality neural network 20, a speech modality neural network 30, and a common neural network 40, and the text modality neural network 20 and the speech modality neural network 30 each include a common neural network. Connected to network 40.

テキストモダリティニューラルネットワーク２０は、入力されたテキストデータ（例えば、言語特徴量）を共通ニューラルネットワーク４０への入力用のベクトルに変換する何れかのレイヤ構成を有するニューラルネットワークである。図示された実施例では、テキストモダリティニューラルネットワーク２０は、Ｎ^（Ｌ）層のフィードフォワードニューラルネットワークであり、テキストデータのベクトルｌを入力層において取得し、取得したベクトルｌを隠れ層にわたす。Ｎ^（Ｌ）個の隠れ層はそれぞれ、前段のレイヤからわたされたベクトルを行列Ｗ_Ｌ及びバイアスベクトルｂ_Ｌによって線形変換し、変換されたベクトルを活性化関数σ（例えば、シグモイド関数）に入力し、活性化関数σから出力されたベクトルを後段のレイヤにわたす。出力層は、前段の隠れ層からわたされたベクトルを共通ニューラルネットワーク４０の入力層にわたす。 The text modality neural network 20 is a neural network having any layer configuration that converts input text data (for example, linguistic features) into a vector for input to the common neural network 40. In the illustrated embodiment, the text modality neural network 20 is an N ^(L) layer feedforward neural network that acquires a vector l of text data at an input layer and passes the acquired vector l to a hidden layer. N ^(L) pieces each hidden layer, linearly converting the passed vectors from the preceding layer by a matrix W _L and the bias vector b _L, input to the transformed vector activation function sigma (e.g., sigmoid function) Then, the vector output from the activation function σ is passed to a subsequent layer. The output layer passes the vector passed from the preceding hidden layer to the input layer of the common neural network 40.

形式的には、テキストデータのベクトルｌが与えられると、第１の隠れ層は、
ｈ_１＝σ（Ｗ_Ｌ，１ｌ＋ｂ_Ｌ，１）
によってベクトルｈ_１を出力する。以下同様にして、各隠れ層は同様の変換処理を実行し、第Ｎ^（Ｌ）の隠れ層は、前段の隠れ層からベクトルｈ_ＮＬ−１が与えられると、
ｈ_ＮＬ＝σ（Ｗ_Ｌ，ＮＬｈ_ＮＬ−１＋ｂ_Ｌ，ＮＬ）
によってベクトルｈ_ＮＬを出力し、出力層にわたす。当該ベクトル及び行列は、後述される学習処理において学習される。 Formally, given a vector l of text data, the first hidden layer is
h ₁ = σ (W _{L, 1} l + b _{L, 1} )
And outputs a vector _{h 1} by. Similarly, each hidden layer performs a similar conversion process, and the N ^{(L) th} hidden layer is given a vector h _NL−1 from the preceding hidden layer,
_hNL = [sigma] ( _WL, _NLhNL-1 + bL _{, NL} )
To output the vector h _{NL and} pass it to the output layer. The vector and the matrix are learned in a learning process described later.

音声モダリティニューラルネットワーク３０は、入力された音声データ（例えば、音声波形）を共通ニューラルネットワーク４０への入力用のベクトルに変換する何れかのレイヤ構成を有するニューラルネットワークである。図示された実施例では、音声モダリティニューラルネットワーク３０は、Ｎ^（Ｓ）層のフィードフォワードニューラルネットワークであり、音声データのベクトルｓを入力層において取得し、取得したベクトルｓを隠れ層にわたす。Ｎ^（Ｓ）個の隠れ層はそれぞれ、前段のレイヤからわたされたベクトルを行列Ｗ_Ｓ及びバイアスベクトルｂ_Ｓによって線形変換し、変換されたベクトルを活性化関数σ（例えば、シグモイド関数）に入力し、活性化関数σから出力されたベクトルを後段のレイヤにわたす。出力層は、前段の隠れ層からわたされたベクトルを共通ニューラルネットワーク４０の入力層にわたす。なお、各隠れ層における具体的な処理は、上述したテキストモダリティニューラルネットワーク２０のものと同様であり、重複する説明は省く。 The voice modality neural network 30 is a neural network having any layer configuration that converts input voice data (for example, voice waveform) into a vector for input to the common neural network 40. In the illustrated embodiment, the speech modality neural network 30 is an N ^(S) -layer feedforward neural network, which acquires a vector s of speech data in an input layer and passes the acquired vector s to a hidden layer. N ^(S) pieces each hidden layer, linearly converting the passed vectors from the preceding layer by a matrix W _S and the bias vector b _S, input to the transformed vector activation function sigma (e.g., sigmoid function) Then, the vector output from the activation function σ is passed to a subsequent layer. The output layer passes the vector passed from the preceding hidden layer to the input layer of the common neural network 40. The specific processing in each hidden layer is the same as that of the text modality neural network 20 described above, and redundant description will be omitted.

共通ニューラルネットワーク４０は、テキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０からわたされたベクトルを音響特徴量に変換する何れかのレイヤ構成を有するニューラルネットワークである。図示された実施例では、共通ニューラルネットワーク４０は、Ｎ^（Ｃ）層のフィードフォワードニューラルネットワークであり、テキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０から入力されたベクトルを入力層において取得し、取得したベクトルを隠れ層にわたす。Ｎ^（Ｃ）個の隠れ層はそれぞれ、前段のレイヤからわたされたベクトルを行列Ｗ_Ｃ及びバイアスベクトルｂ_Ｃによって線形変換し、変換されたベクトルを活性化関数σ（例えば、シグモイド関数）に入力し、活性化関数σから出力されたベクトルを後段のレイヤにわたす。出力層は、前段の隠れ層からわたされた音響特徴量を示すベクトルを出力する。 The common neural network 40 is a neural network having any layer configuration for converting a vector passed from the text modality neural network 20 and the speech modality neural network 30 into an acoustic feature. In the illustrated embodiment, the common neural network 40 is an N ^(C) layer feedforward neural network, and acquires and acquires vectors input from the text modality neural network 20 and the speech modality neural network 30 in the input layer. Pass the resulting vector to the hidden layer. N ^(C) pieces each hidden layer, linearly converting the passed vectors from the preceding layer by a matrix W _C and the bias vector b _C, input to the transformed vector activation function sigma (e.g., sigmoid function) Then, the vector output from the activation function σ is passed to a subsequent layer. The output layer outputs a vector indicating the acoustic feature passed from the preceding hidden layer.

また、共通ニューラルネットワーク４０は更に、後述される未知話者適応処理によって推定された所与の話者を示す話者コードベクトル（潜在変数）を含む。換言すると、共通ニューラルネットワーク４０は、未知話者適応処理において学習装置１００によって話者毎に学習される。所与の話者を示す話者空間上の推定された話者コードベクトルが与えられた隠れ層は、前段のレイヤからわたされたベクトルと話者コードベクトルとに対して線形変換を実行し、変換されたベクトルを活性化関数σ（例えば、シグモイド関数）に入力し、活性化関数σから出力されたベクトルを後段のレイヤにわたす。 Further, the common neural network 40 further includes a speaker code vector (latent variable) indicating a given speaker estimated by an unknown speaker adaptation process described later. In other words, the learning apparatus 100 learns the common neural network 40 for each speaker in the unknown speaker adaptation process. The hidden layer given the estimated speaker code vector on the speaker space indicating the given speaker performs a linear transformation on the vector passed from the previous layer and the speaker code vector, The converted vector is input to an activation function σ (for example, a sigmoid function), and the vector output from the activation function σ is passed to a subsequent layer.

形式的には、話者コードベクトルが与えられる隠れ層は、前段のレイヤからベクトルｈ_ｎ−１と話者ｉの話者コードベクトルｄ^（ｉ）とが与えられると、
ｈ_ｎ＝σ（Ｗ_Ｃ，ｎｈ_ｎ−１＋ｂ_Ｃ，ｎ＋Ｗ_Ｄｄ^（ｉ））
によってベクトルｈ_ｎを取得する。ここで、Ｗ_Ｄは話者コード用の重み行列である。なお、話者コードベクトルが入力されない各隠れ層における具体的な処理は、上述したテキストモダリティニューラルネットワーク２０のものと同様であり、重複する説明は省く。 Formally, the hidden layer to which the speaker code vector is given is given by the vector h _n-1 and the speaker code vector d ⁽ⁱ⁾ of the speaker i from the preceding layer.
_{_{_{h n = σ (W C,}}} n h n-1 + b C, n + W D d (i))
To get the vector _{h n} by. Here, W _D is the weight matrix for the speaker code. Note that the specific processing in each hidden layer to which no speaker code vector is input is the same as that of the text modality neural network 20 described above, and redundant description will be omitted.

なお、図示された実施例では、話者コードベクトルは１つの隠れ層にわたされているが、これに限定されるものでなく、共通ニューラルネットワーク４０のレイヤ構成に応じて複数の隠れ層にわたされてもよい。
［ハードウェア構成］
ここで、学習装置１００及び音声合成装置２００は、例えば、図２に示されるように、CPU (Central Processing unit)、GPU (Graphics Processing Unit)などのプロセッサ１０１、RAM (Random Access Memory)、フラッシュメモリなどのメモリ１０２、ハードディスク１０３及び入出力(I/O)インタフェース１０４によるハードウェア構成を有してもよい。 In the illustrated embodiment, the speaker code vector is transmitted to one hidden layer. However, the present invention is not limited to this. The speaker code vector may be transmitted to a plurality of hidden layers according to the layer configuration of the common neural network 40. May be done.
[Hardware configuration]
Here, the learning device 100 and the speech synthesis device 200 include, for example, a processor 101 such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a RAM (Random Access Memory), a flash memory, as shown in FIG. And the like, a hardware configuration including a memory 102, a hard disk 103, and an input / output (I / O) interface 104.

プロセッサ１０１は、学習装置１００及び音声合成装置２００の各種処理を実行する。 The processor 101 executes various processes of the learning device 100 and the speech synthesis device 200.

メモリ１０２は、学習装置１００及び音声合成装置２００における各種データ及びプログラムを格納し、特に作業用データ、実行中のプログラムなどのためのワーキングメモリとして機能する。具体的には、メモリ１０２は、ハードディスク１０３からロードされたニューラルネットワーク構造１０を実現するプログラム、各種処理を実行及び制御するためのプログラムなどを格納し、プロセッサ１０１によるプログラムの実行中にワーキングメモリとして機能する。 The memory 102 stores various data and programs in the learning device 100 and the speech synthesis device 200, and functions as a working memory particularly for work data, a running program, and the like. Specifically, the memory 102 stores a program for realizing the neural network structure 10 loaded from the hard disk 103, a program for executing and controlling various processes, and the like, and serves as a working memory during execution of the program by the processor 101. Function.

ハードディスク１０３は、学習装置１００及び音声合成装置２００における各種データ及びプログラムを格納する。 The hard disk 103 stores various data and programs in the learning device 100 and the speech synthesis device 200.

I/Oインタフェース１０４は、ユーザからの命令、入力データなどを受け付け、出力結果を表示、再生などすると共に、外部装置との間でデータを入出力するためのインタフェースである。例えば、I/Oインタフェース１０４は、USB (Universal Serial Bus)、通信回線、キーボード、マウス、ディスプレイ、マイクロフォン、スピーカなどの各種データを入出力するためのデバイスである。 The I / O interface 104 is an interface for receiving commands, input data, and the like from the user, displaying and reproducing output results, and inputting and outputting data to and from an external device. For example, the I / O interface 104 is a device for inputting and outputting various data such as a USB (Universal Serial Bus), a communication line, a keyboard, a mouse, a display, a microphone, and a speaker.

しかしながら、本発明による学習装置１００及び音声合成装置２００は、上述したハードウェア構成に限定されず、他の何れか適切なハードウェア構成を有してもよい。例えば、上述した学習装置１００及び音声合成装置２００による各種処理の１つ以上は、これを実現するよう配線化された処理回路又は電子回路により実現されてもよい。
［ニューラルネットワーク構造の第１の学習処理］
次に、図３及び４を参照して、本発明の一実施例によるニューラルネットワーク構造１０に対する学習処理を説明する。上述したニューラルネットワーク構造１０の内部構成から理解されるように、学習装置１００は、共通ニューラルネットワーク４０がテキストデータと音声データとの異なるモダリティからの入力を適切に受け付けるようにニューラルネットワーク構造１０を学習する必要がある。 However, the learning device 100 and the speech synthesis device 200 according to the present invention are not limited to the above-described hardware configuration, and may have any other appropriate hardware configuration. For example, one or more of the various processes performed by the learning device 100 and the speech synthesis device 200 described above may be realized by a processing circuit or an electronic circuit wired to realize the processes.
[First Learning Process of Neural Network Structure]
Next, a learning process for the neural network structure 10 according to an embodiment of the present invention will be described with reference to FIGS. As understood from the internal configuration of the neural network structure 10 described above, the learning device 100 learns the neural network structure 10 so that the common neural network 40 appropriately receives inputs from different modalities of text data and voice data. There is a need to.

図３は、本発明の一実施例による学習処理を示す概略図である。本実施例では、図３に示されるように、学習装置１００は、共通ニューラルネットワーク４０をテキストモダリティニューラルネットワーク２０と音声モダリティニューラルネットワーク３０とに共有させ、２つの共通ニューラルネットワーク４０を同時に、すなわち、２つの共通ニューラルネットワーク４０におけるパラメータ（例えば、隠れ層の重み行列）が同一となるよう同期的に学習する。 FIG. 3 is a schematic diagram showing a learning process according to one embodiment of the present invention. In the present embodiment, as shown in FIG. 3, the learning device 100 causes the text modality neural network 20 and the speech modality neural network 30 to share the common neural network 40, and simultaneously connects the two common neural networks 40, that is, Learning is performed synchronously so that the parameters (for example, the weight matrix of the hidden layer) in the two common neural networks 40 are the same.

具体的には、学習装置１００は、テキストデータと音響特徴量とのペアから構成される訓練データに対して、当該テキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを取得する。そして、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得し、取得した音響特徴量と訓練データの音響特徴量との間の誤差（ｌｏｓｓ_ｍａｉｎ）を算出する。 Specifically, the learning device 100 inputs the text data to the text modality neural network 20 with respect to the training data composed of a pair of the text data and the acoustic feature amount, and outputs the text data to the text modality neural network 20. Get the vector. Then, the learning device 100 inputs the obtained vector to the common neural network 40, obtains the acoustic feature amount output from the common neural network 40, and sets a value between the acquired acoustic feature amount and the acoustic feature amount of the training data. Calculate the error (loss _main ).

一方、学習装置１００は、音声波形データと音響特徴量とのペアから構成される訓練データに対して、当該音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から出力されたベクトルを取得する。そして、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得し、取得した音響特徴量と訓練データの音響特徴量との間の誤差（ｌｏｓｓ_ｓｕｂ）を算出する。 On the other hand, the learning apparatus 100 inputs the speech waveform data to the speech modality neural network 30 with respect to the training data composed of the pair of the speech waveform data and the acoustic feature quantity, and outputs the training data from the speech modality neural network 30. Get a vector. Then, the learning device 100 inputs the obtained vector to the common neural network 40, obtains the acoustic feature amount output from the common neural network 40, and sets a value between the acquired acoustic feature amount and the acoustic feature amount of the training data. Calculate the error (loss _sub ).

その後、学習装置１００は、算出した２つの誤差（ｌｏｓｓ_ｍａｉｎ，ｌｏｓｓ_ｓｕｂ）の加重和に基づき、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を学習する。例えば、学習装置１００は、
ｌｏｓｓ＝ｌｏｓｓ_ｍａｉｎ＋αｌｏｓｓ_ｓｕｂ
に従って（αは、スカラー値である）、テキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０による誤差ｌｏｓｓ_ｍａｉｎと、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０による誤差ｌｏｓｓ_ｓｕｂとの２つの誤差の加重和（ｌｏｓｓ）を算出してもよい。 After that, the learning device 100 learns the text modality neural network 20, the speech modality neural network 30, and the common neural network 40 based on the weighted sum of the two calculated errors (loss _main , loss _sub ). For example, the learning device 100
loss = loss _main + αloss _sub
(Α is a scalar value), the weighted sum of two errors, the error loss _main by the text modality neural network 20 and the common neural network 40 and the error loss _sub by the speech modality neural network 30 and the common neural network 40 ( loss) may be calculated.

学習装置１００は、算出した誤差の加重和（ｌｏｓｓ）が減少するように、例えば、バックプロパゲーションに従って、共有される２つの共通ニューラルネットワーク４０のパラメータが同一となるように、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０のパラメータ（例えば、隠れ層の重み行列）を更新する。 The learning apparatus 100 sets the text modality neural network 20 such that the weight of the calculated error is reduced, for example, according to back propagation, so that the parameters of the two shared common neural networks 40 are the same. , The parameters of the speech modality neural network 30 and the common neural network 40 (for example, the weight matrix of the hidden layer) are updated.

図４は、本発明の一実施例による学習処理を示すフローチャートである。当該学習処理は、学習装置１００、具体的には、学習装置１００のプロセッサ１０１によって実行される。 FIG. 4 is a flowchart illustrating a learning process according to an embodiment of the present invention. The learning process is executed by the learning device 100, specifically, the processor 101 of the learning device 100.

図４に示されるように、ステップＳ１０１において、学習装置１００は、訓練データを取得する。例えば、訓練データが複数の話者によるテキスト付きの音声データである場合、学習装置１００は、前処理として、当該音声データを対応する音声波形データ及び音響特徴量に変換し、訓練データからテキストデータと音響特徴量とのペアと音声波形データと音響特徴量とのペアとを生成してもよい。 As shown in FIG. 4, in step S101, the learning device 100 acquires training data. For example, when the training data is voice data with text from a plurality of speakers, the learning device 100 converts the voice data into corresponding voice waveform data and acoustic features as preprocessing, and converts the training data to text data. And a pair of audio feature data and a pair of audio waveform data and audio feature data.

ステップＳ１０２において、学習装置１００は、処理対象の訓練データがテキストデータと音響特徴量とのペアである場合、ステップＳ１０３に進み、処理対象の訓練データが音声波形データと音響特徴量とのペアである場合、ステップＳ１０６に進む。 In step S102, when the training data to be processed is a pair of the text data and the acoustic feature amount, the learning device 100 proceeds to step S103, where the training data to be processed is a pair of the audio waveform data and the acoustic feature amount. If there is, the process proceeds to step S106.

ステップＳ１０３において、学習装置１００は、訓練データのテキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを取得する。 In step S103, the learning device 100 inputs the text data of the training data to the text modality neural network 20, and obtains a vector output from the text modality neural network 20.

ステップＳ１０４において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得する。 In step S104, the learning device 100 inputs the acquired vector to the common neural network 40, and acquires the acoustic feature output from the common neural network 40.

ステップＳ１０５において、学習装置１００は、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との誤差（ｌｏｓｓ_ｍａｉｎ）を算出する。 In step S105, the learning device 100 calculates an error (loss _main ) between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data.

一方、ステップＳ１０６において、学習装置１００は、訓練データの音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から出力されたベクトルを取得する。 On the other hand, in step S106, the learning device 100 inputs the speech waveform data of the training data to the speech modality neural network 30, and obtains a vector output from the speech modality neural network 30.

ステップＳ１０７において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得する。 In step S107, the learning device 100 inputs the obtained vector to the common neural network 40, and obtains the acoustic feature output from the common neural network 40.

ステップＳ１０８において、学習装置１００は、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との誤差（ｌｏｓｓ_ｓｕｂ）を算出する。 In step S108, the learning device 100 calculates an error (loss _sub ) between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data.

ステップＳ１０９において、学習装置１００は、ステップＳ１０５及びＳ１０８において取得した２つの誤差の加重和（ｌｏｓｓ）を計算し、計算した加重和（ｌｏｓｓ）が減少するように、例えば、バックプロパゲーションに従ってテキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０のパラメータ（例えば、隠れ層の重み行列）を更新し、具体的には、共有される２つの共通ニューラルネットワーク４０のパラメータが同一のものに更新されるように、２つの共通ニューラルネットワーク４０を同期的に学習する。 In step S109, the learning device 100 calculates the weighted sum (loss) of the two errors acquired in steps S105 and S108, and reduces the calculated weighted sum (loss), for example, according to the backpropagation. The parameters of the neural network 20, the speech modality neural network 30, and the common neural network 40 (for example, the weight matrix of the hidden layer) are updated, and specifically, the parameters of the two shared common neural networks 40 are made the same. The two common neural networks 40 are learned synchronously so as to be updated.

学習装置１００は、所定の終了条件を充足するまで、各訓練データに対して上述したステップＳ１０１〜Ｓ１０９を繰り返す。当該所定の終了条件は、例えば、所定の回数の繰り返しを終了したこと、誤差（ｌｏｓｓ）が所定の閾値以下になったこと、誤差（ｌｏｓｓ）が収束したことなどであってもよい。
［ニューラルネットワーク構造の第２の学習処理］
次に、図５及び６を参照して、本発明の他の実施例によるニューラルネットワーク構造１０に対する学習処理を説明する。上述したニューラルネットワーク構造１０から理解されるように、学習装置１００は、共通ニューラルネットワーク４０がテキストデータと音声データとの異なるモダリティからの入力を適切に受け付けるようにニューラルネットワーク構造１０、特に、共通ニューラルネットワーク４０の入力層に近い下層レイヤを学習することが求められる。 The learning device 100 repeats steps S101 to S109 described above for each training data until a predetermined end condition is satisfied. The predetermined termination condition may be, for example, that a predetermined number of repetitions have been completed, that the error (loss) has become equal to or less than a predetermined threshold, that the error (loss) has converged, and the like.
[Second learning process of neural network structure]
Next, a learning process for the neural network structure 10 according to another embodiment of the present invention will be described with reference to FIGS. As can be understood from the neural network structure 10 described above, the learning device 100 controls the neural network structure 10, particularly the common neural network 10 so that the common neural network 40 appropriately receives input from different modalities of text data and voice data. It is required to learn a lower layer close to the input layer of the network 40.

図５は、本発明の他の実施例による学習処理を示す概略図である。本実施例では、図５に示されるように、学習装置１００は、テキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０から入力された各ベクトルに対して、共通ニューラルネットワーク４０における一部の隠れ層（例えば、入力層から所定番目の隠れ層）から出力される各ベクトルの間の距離を損失又はペナルティ（ｌｏｓｓ_ｓｕｂ）として利用し、上述したテキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０における誤差（ｌｏｓｓ_ｍａｉｎ）と、一部の隠れ層から出力されるベクトル間の距離（ｌｏｓｓ_ｓｕｂ）との加重和に基づき、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を学習する。図３及び４を参照して上述した実施例による学習処理では、隠れ層の重み行列は共有される共通ニューラルネットワーク４０において同じとされたが、テキストモダリティニューラルネットワーク２０及び音声モダリティニューラルネットワーク３０から入力された各ベクトルに対する共通ニューラルネットワーク４０の隠れ層から出力されるベクトルが互いに近いものになることを明示的に保証するものでない。このため、共通ニューラルネットワーク４０の入力層に隠れ層から出力されるベクトルが近似したものになるよう共通ニューラルネットワーク４０を学習することによって、より精度の高い変換が可能になると考えられる。 FIG. 5 is a schematic diagram showing a learning process according to another embodiment of the present invention. In the present embodiment, as shown in FIG. 5, the learning apparatus 100 generates a partial hidden layer (common layer) in the common neural network 40 for each vector input from the text modality neural network 20 and the speech modality neural network 30. For example, the distance between each vector output from a predetermined hidden layer from the input layer is used as a loss or a penalty (loss _sub ), and the error (loss _main ) in the text modality neural network 20 and the common neural network 40 described above is used. a), based on the weighted sum of the distances between the vectors output from some of the hidden layer (loss _sub), the text modality neural network 20, voice modality neural network 30 and the common neural network To learn over click 40. In the learning process according to the embodiment described above with reference to FIGS. 3 and 4, the weight matrix of the hidden layer is set to be the same in the shared common neural network 40, but is input from the text modality neural network 20 and the speech modality neural network 30. It does not explicitly guarantee that the vectors output from the hidden layer of the common neural network 40 for each of the calculated vectors are close to each other. For this reason, by learning the common neural network 40 so that the vector output from the hidden layer is close to the input layer of the common neural network 40, it is considered that more accurate conversion can be performed.

一方、学習装置１００は、音声波形データと音響特徴量とのペアから構成される訓練データに対して、当該音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から出力されたベクトルを取得する。そして、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０の一部のレイヤ（例えば、入力層からＬ番目の隠れ層）から構成されるサブニューラルネットワークから出力されたベクトル（ｈ^ｌ _ｓｕｂ）を取得する一方、テキストモダリティニューラルネットワーク２０から共通ニューラルネットワーク４０に入力されたベクトルに対して、当該サブニューラルネットワークから出力されたベクトル（ｈ^ｌ _ｍａｉｎ）を取得する。 On the other hand, the learning apparatus 100 inputs the speech waveform data to the speech modality neural network 30 with respect to the training data composed of the pair of the speech waveform data and the acoustic feature quantity, and outputs the training data from the speech modality neural network 30. Get a vector. Then, the learning device 100 inputs the obtained vector to the common neural network 40, and outputs the vector from a sub-neural network including a part of the layers of the common neural network 40 (for example, the Lth hidden layer from the input layer). while obtaining the vector _{^(h l sub)} was to obtain for the input vector to the common neural network 40 from the text modality neural network 20, the output from the sub-neural network vector _{^(h l main).}

その後、学習装置１００は、２つのベクトル（ｈ^ｌ _ｍａｉｎ，ｈ^ｌ _ｓｕｂ）の間の距離に基づき誤差（ｌｏｓｓ_ｓｕｂ）を算出し、誤差（ｌｏｓｓ_ｍａｉｎ）と誤差（ｌｏｓｓ_ｓｕｂ）との加重和に基づきテキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を学習する。例えば、学習装置１００は、
ｌｏｓｓ＝ｌｏｓｓ_ｍａｉｎ＋βΣ_ｌ ^Ｌｄｉｓｔａｎｃｅ（ｈ^ｌ _ｍａｉｎ，ｈ^ｌ _ｓｕｂ）
に従って（βは、スカラー値である）、２つの誤差（ｌｏｓｓ_ｍａｉｎ，ｌｏｓｓ_ｓｕｂ）の加重和ｌｏｓｓを算出してもよい。ここで、距離ｄｉｓｔａｎｃｅは、例えば、コサイン距離であってもよい。 Thereafter, the learning device 100 calculates an error (loss _sub ) based on the distance between the two vectors ( ^hl _main , h ^l _sub ), and calculates a weighted sum of the error (loss _main ) and the error (loss _sub ). The text modality neural network 20, the speech modality neural network 30, and the common neural network 40 are learned based on the learning. For example, the learning device 100
_{_{^{loss = loss main + βΣ l L}}} distance (h l main, h l sub)
(Β is a scalar value), and a weighted sum loss of two errors (loss _main , loss _sub ) may be calculated. Here, the distance distance may be, for example, a cosine distance.

学習装置１００は、算出した誤差の加重和が減少するように、例えば、バックプロパゲーションに従ってテキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０のパラメータ（例えば、隠れ層の重み行列）を更新する。 The learning device 100 sets parameters of the text modality neural network 20, the speech modality neural network 30, and the common neural network 40 (for example, a weight matrix of a hidden layer) according to, for example, back propagation so that the weighted sum of the calculated errors decreases. To update.

図６は、本発明の他の実施例による学習処理を示すフローチャートである。当該学習処理は、学習装置１００、具体的には、学習装置１００のプロセッサ１０１によって実行される。 FIG. 6 is a flowchart showing a learning process according to another embodiment of the present invention. The learning process is executed by the learning device 100, specifically, the processor 101 of the learning device 100.

図６に示されるように、ステップＳ２０１において、学習装置１００は、訓練データを取得する。 As shown in FIG. 6, in step S201, the learning device 100 acquires training data.

ステップＳ２０２において、学習装置１００は、処理対象の訓練データがテキストデータと音響特徴量とのペアである場合、ステップＳ２０３に進み、処理対象の訓練データが音声波形データと音響特徴量とのペアである場合、ステップＳ２０６に進む。 In step S202, when the training data to be processed is a pair of the text data and the acoustic feature amount, the learning device 100 proceeds to step S203, and the training data to be processed is a pair of the speech waveform data and the acoustic feature amount. If there is, the process proceeds to step S206.

ステップＳ２０３において、学習装置１００は、訓練データのテキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを取得する。 In step S203, the learning device 100 inputs the text data of the training data to the text modality neural network 20, and acquires the vector output from the text modality neural network 20.

ステップＳ２０４において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力する。 In step S204, the learning device 100 inputs the obtained vector to the common neural network 40.

ステップＳ２０５において、学習装置１００は、共通ニューラルネットワーク４０から出力された音響特徴量を取得すると共に、共通ニューラルネットワーク４０のサブニューラルネットワーク（例えば、入力層から所定番目の隠れ層）から出力されたベクトル（ｈ^ｌ _ｍａｉｎ）を取得する。 In step S205, the learning device 100 acquires the acoustic feature amount output from the common neural network 40, and obtains a vector output from a sub-neural network of the common neural network 40 (for example, a predetermined hidden layer from the input layer). ( ^Hl _main ) is obtained.

一方、ステップＳ２０６において、学習装置１００は、訓練データの音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から出力されたベクトルを取得する。 On the other hand, in step S206, the learning device 100 inputs the speech waveform data of the training data to the speech modality neural network 30, and obtains a vector output from the speech modality neural network 30.

ステップＳ２０７において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力する。 In step S207, the learning device 100 inputs the obtained vector to the common neural network 40.

ステップＳ２０８において、学習装置１００は、共通ニューラルネットワーク４０のサブニューラルネットワークから出力されたベクトル（ｈ^ｌ _ｓｕｂ）を取得する。 In step S208, the learning apparatus 100 obtains the output from the sub-neural networks common neural network 40 vector _{^(h l sub).}

ステップＳ２０９において、学習装置１００は、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との誤差（ｌｏｓｓ_ｍａｉｎ）と、２つのベクトル（ｈ^ｌ _ｍａｉｎ，ｈ^ｌ _ｓｕｂ）の間の距離（ｌｏｓｓ_ｓｕｂ）とを算出する。 In step S209, the learning device 100 determines an error (loss _main ) between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data, and a difference between the two vectors ( ^hl _main , ^hl _sub ). The distance (loss _sub ) is calculated.

ステップＳ２１０において、学習装置１００は、ステップＳ２０９において算出した誤差（ｌｏｓｓ_ｍａｉｎ）と距離（ｌｏｓｓ_ｓｕｂ）との加重和（ｌｏｓｓ）を算出し、算出した加重和（ｌｏｓｓ）が減少するように、例えば、バックプロパゲーションに従ってテキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０のパラメータ（例えば、隠れ層の重み行列）を更新する。 In step S210, the learning device 100 calculates a weighted sum (loss) of the error (loss _main ) and the distance (loss _sub ) calculated in step S209, and reduces the calculated weighted sum (loss), for example, so that the calculated weighted sum (loss) decreases. , The parameters of the text modality neural network 20, the speech modality neural network 30, and the common neural network 40 (for example, the weight matrix of the hidden layer) are updated according to the back propagation.

学習装置１００は、所定の終了条件を充足するまで、各訓練データに対して上述したステップＳ２０１〜Ｓ２１０を繰り返す。当該所定の終了条件は、例えば、所定の回数の繰り返しを終了したこと、誤差（ｌｏｓｓ）が所定の閾値以下になったこと、誤差（ｌｏｓｓ）が収束したことなどであってもよい。 The learning device 100 repeats the above steps S201 to S210 for each training data until a predetermined end condition is satisfied. The predetermined termination condition may be, for example, that a predetermined number of repetitions have been completed, that the error (loss) has become equal to or less than a predetermined threshold, that the error (loss) has converged, and the like.

なお、２つのタイプのニューラルネットワーク構造の学習処理について個別に説明したが、これら２つのタイプの学習処理が組み合わせ可能であることは当業者に理解されるであろう。この場合、誤差（ｌｏｓｓ）は、例えば、
ｌｏｓｓ＝ｌｏｓｓ_ｍａｉｎ＋αｌｏｓｓ_ｓｕｂ＋βΣ_ｌ ^Ｌｄｉｓｔａｎｃｅ（ｈ^ｌ _ｍａｉｎ，ｈ^ｌ _ｓｕｂ）
に従って算出され、テキストモダリティニューラルネットワーク２０、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０のパラメータが、誤差を減少させるように更新されると共に、２つの共通ニューラルネットワーク４０のパラメータが同期的に学習される。
［共通ニューラルネットワーク４０に対する話者適応処理］
次に、図７及び８を参照して、本発明の一実施例による共通ニューラルネットワーク４０に対する話者適応処理を説明する。本実施例では、上述した学習処理に従ってニューラルネットワーク構造１０を学習した後、所与の未知話者の訓練データが与えられると、学習装置１００は、当該訓練データに応じて、テキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０と、音声モダリティニューラルネットワーク及び共通ニューラルネットワーク４０とを選択的に利用して、共通ニューラルネットワーク４０の話者空間における当該未知話者を示す話者コードベクトルを推定する。 Although the two types of learning processes of the neural network structure have been individually described, those skilled in the art will understand that these two types of learning processes can be combined. In this case, the error (loss) is, for example,
_{_{loss = loss main + αloss sub +}} βΣ l L distance (h l main, h l sub)
The parameters of the text modality neural network 20, the speech modality neural network 30, and the common neural network 40 are updated so as to reduce the error, and the parameters of the two common neural networks 40 are learned synchronously. .
[Speaker adaptation processing for common neural network 40]
Next, a speaker adaptation process for the common neural network 40 according to an embodiment of the present invention will be described with reference to FIGS. In the present embodiment, after learning the neural network structure 10 in accordance with the above-described learning process, when training data of a given unknown speaker is given, the learning device 100 causes the text modality neural network 20 to respond to the training data. Then, the speaker code vector indicating the unknown speaker in the speaker space of the common neural network 40 is estimated by selectively using the common neural network 40, the speech modality neural network, and the common neural network 40.

図７は、本発明の一実施例による未知話者適応処理を示す概略図である。本実施例では、図７に示されるように、与えられた訓練データが所与の未知話者のテキストデータと音響特徴量とのペアである場合、学習装置１００は、テキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０を利用して、当該未知話者の話者コードベクトルを推定する。他方、与えられた訓練データが所与の未知話者の音声波形データと音響特徴量とのペアである場合、学習装置１００は、音声モダリティニューラルネットワーク３０及び共通ニューラルネットワーク４０を利用して、当該未知話者の話者コードベクトルを推定する。 FIG. 7 is a schematic diagram illustrating an unknown speaker adaptation process according to an embodiment of the present invention. In the present embodiment, as shown in FIG. 7, when given training data is a pair of text data of a given unknown speaker and acoustic features, the learning device 100 uses the text modality neural network 20 and Using the common neural network 40, the speaker code vector of the unknown speaker is estimated. On the other hand, when the given training data is a pair of the speech waveform data and the acoustic feature of a given unknown speaker, the learning device 100 uses the speech modality neural network 30 and the common neural network 40 to Estimate the speaker code vector of the unknown speaker.

具体的には、学習装置１００は、所与の未知話者の訓練データがテキストデータと音響特徴量とから構成される場合、当該テキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との間の誤差に基づき当該話者の話者コードベクトルを決定する。他方、学習装置１００は、所与の未知話者の訓練データが音声波形データと音響特徴量とから構成される場合、音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との間の誤差に基づき当該話者の話者コードベクトルを決定する。 Specifically, when the training data of a given unknown speaker is composed of text data and acoustic features, the learning device 100 inputs the text data to the text modality neural network 20 and executes the text modality neural network. 20 is input to the common neural network 40, and a speaker code vector of the speaker is determined based on an error between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data. On the other hand, when the training data of a given unknown speaker is composed of speech waveform data and acoustic features, the learning device 100 inputs the speech waveform data to the speech modality neural network 30, The acquired vector is input to the common neural network 40, and a speaker code vector of the speaker is determined based on an error between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data.

例えば、図１に示される具体例によると、話者コードベクトルｄ^（ｉ）は、
ｄ^（ｉ）＝ｄ^（ｉ）＋εＷ_Ｄ ^Ｔｆ_ｎ−１
に従って更新される。ここで、εは所定値以下の小さな値であり、ｆは誤差伝搬のための関数であり、
ｆ_Ｎ−１ ^（Ｃ）＝Ｗ_Ｃ，Ｎ ^{（Ｃ），Ｔ}σ^−１（ｅ'）
として定義され、σ^−１は活性化関数によって決定される伝搬用の関数であり、ｅ'は共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との間の誤差の微分値である。なお、当該未知話者適応処理では、共通ニューラルネットワーク４０の重み行列Ｗ及びバイアスベクトルｂは更新されない。 For example, according to the specific example shown in FIG. 1, the speaker code vector d ⁽ⁱ⁾ is
^{^{d (i) = d (i}} ) + εW D T f n-1
It is updated according to. Here, ε is a small value equal to or less than a predetermined value, f is a function for error propagation,
fN _-1 ^(C) = WC _{, N} ^{(C), T?} ^-1 (e ')
Σ ⁻¹ is a propagation function determined by the activation function, and e ′ is a differential value of an error between the acoustic feature obtained from the common neural network 40 and the acoustic feature of the training data. It is. In the unknown speaker adaptation processing, the weight matrix W and the bias vector b of the common neural network 40 are not updated.

このようにして、共通ニューラルネットワーク４０における話者コードベクトル（潜在変数）を特定することによって、学習済みのニューラルネットワーク構造１０を特定の未知話者に適応させることができる。 Thus, by specifying the speaker code vector (latent variable) in the common neural network 40, the trained neural network structure 10 can be adapted to a specific unknown speaker.

図８は、本発明の一実施例による未知話者適応処理を示すフローチャートである。当該学習処理は、学習装置１００、具体的には、学習装置１００のプロセッサ１０１によって実行される。 FIG. 8 is a flowchart illustrating an unknown speaker adaptation process according to an embodiment of the present invention. The learning process is executed by the learning device 100, specifically, the processor 101 of the learning device 100.

図８に示されるように、ステップＳ３０１において、学習装置１００は、所与の未知話者の訓練データを取得する。 As shown in FIG. 8, in step S301, the learning device 100 acquires training data of a given unknown speaker.

ステップＳ３０２において、学習装置１００は、訓練データがテキストデータと音響特徴量とのペア又は音声波形データと音響特徴量とのペアから構成されているか判断し、訓練データがテキストデータと音響特徴量とのペアから構成されている場合、ステップＳ３０３に進み、訓練データが音声波形データと音響特徴量とのペアから構成されている場合、ステップＳ３０６に進む。 In step S302, the learning device 100 determines whether the training data is composed of a pair of text data and acoustic features, or a pair of audio waveform data and acoustic features, and determines that the training data is text data and acoustic features. If the training data is composed of a pair of the speech waveform data and the acoustic feature, the process proceeds to step S306.

ステップＳ３０３において、学習装置１００は、訓練データのテキストデータをテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを取得する。 In step S303, the learning device 100 inputs the text data of the training data to the text modality neural network 20, and acquires the vector output from the text modality neural network 20.

ステップＳ３０４において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得する。 In step S304, the learning device 100 inputs the acquired vector to the common neural network 40, and acquires the acoustic feature output from the common neural network 40.

ステップＳ３０５において、学習装置１００は、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との間の誤差を算出する。 In step S305, the learning device 100 calculates an error between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data.

一方、ステップＳ３０６において、学習装置１００は、訓練データの音声波形データを音声モダリティニューラルネットワーク３０に入力し、音声モダリティニューラルネットワーク３０から出力されたベクトルを取得する。 On the other hand, in step S306, the learning device 100 inputs the speech waveform data of the training data to the speech modality neural network 30, and obtains a vector output from the speech modality neural network 30.

ステップＳ３０７において、学習装置１００は、取得したベクトルを共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得する。 In step S307, the learning device 100 inputs the acquired vector to the common neural network 40, and acquires the acoustic feature output from the common neural network 40.

ステップＳ３０８において、学習装置１００は、共通ニューラルネットワーク４０から取得した音響特徴量と訓練データの音響特徴量との間の誤差を算出する。 In step S308, the learning device 100 calculates an error between the acoustic feature amount acquired from the common neural network 40 and the acoustic feature amount of the training data.

ステップＳ３０９において、学習装置１００は、ステップＳ３０５及びＳ３０８において算出した誤差が減少するように、例えば、上述した更新式を利用してバックプロパゲーションに従って共通ニューラルネットワーク４０の話者コードベクトルを更新する。 In step S309, the learning device 100 updates the speaker code vector of the common neural network 40 according to the back propagation using, for example, the above-described update formula so that the error calculated in steps S305 and S308 is reduced.

学習装置１００は、所定の終了条件を充足するまで、各訓練データに対して上述したステップＳ３０１〜Ｓ３０９を繰り返す。当該所定の終了条件は、例えば、所定の回数の繰り返しを終了したこと、誤差が所定の閾値以下になったこと、誤差が収束したことなどであってもよい。
［学習済みニューラルネットワーク構造を利用した音声合成処理］
次に、図９〜１１を参照して、本発明の一実施例による音声合成処理を説明する。本実施例では、音声合成装置２００は、上述した学習装置１００によって特定の話者に対して学習されたテキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０を利用して、音声合成対象のテキストデータから当該話者に対応する音声データを生成及び再生する。 The learning device 100 repeats the above-described steps S301 to S309 for each training data until a predetermined end condition is satisfied. The predetermined termination condition may be, for example, that a predetermined number of repetitions have been completed, that the error has become equal to or less than a predetermined threshold, that the error has converged, and the like.
[Speech synthesis processing using a trained neural network structure]
Next, a speech synthesis process according to an embodiment of the present invention will be described with reference to FIGS. In this embodiment, the speech synthesis device 200 uses the text modality neural network 20 and the common neural network 40 learned for a specific speaker by the above-described learning device 100 to convert the text data to be speech-synthesized. Generate and reproduce audio data corresponding to the speaker.

図９は、本発明の一実施例による音声合成処理を示す概略図である。本実施例では、音声合成装置２００は、音声合成対象のテキストデータが与えられると、図９に示されるように、上述した学習装置１００によって特定の話者に対して学習されたテキストモダリティニューラルネットワーク２０及び共通ニューラルネットワーク４０を利用して、当該テキストデータから当該話者に対応する音響特徴量を生成する。具体的には、音声合成装置２００は、入出力インタフェース１０４を介して、テキストデータを取得し、当該話者に対応するテキストデータから生成された音響特徴量を再生してもよい。 FIG. 9 is a schematic diagram showing a speech synthesis process according to one embodiment of the present invention. In the present embodiment, when the text data to be subjected to speech synthesis is given, the speech synthesis device 200, as shown in FIG. 9, performs the text modality neural network learning for a specific speaker by the learning device 100 described above. 20 and the common neural network 40, an acoustic feature value corresponding to the speaker is generated from the text data. Specifically, the speech synthesizer 200 may acquire text data via the input / output interface 104 and reproduce an acoustic feature generated from the text data corresponding to the speaker.

図１０は、本発明の一実施例による音声合成処理を示すフローチャートである。当該音声合成処理は、音声合成装置２００、具体的には、音声合成装置２００のプロセッサ１０１によって実行される。 FIG. 10 is a flowchart showing a speech synthesis process according to one embodiment of the present invention. The speech synthesis processing is executed by the speech synthesis device 200, specifically, the processor 101 of the speech synthesis device 200.

図１０に示されるように、ステップＳ４０１において、音声合成装置２００は、音声合成対象となるテキストデータを取得する。例えば、テキストデータは、音声合成装置２００の入出力インタフェース１０４を介し入力されたものであってもよい。 As shown in FIG. 10, in step S401, the speech synthesis device 200 acquires text data to be subjected to speech synthesis. For example, the text data may be input via the input / output interface 104 of the speech synthesizer 200.

ステップＳ４０２において、音声合成装置２００は、取得したテキストデータを学習済みテキストモダリティニューラルネットワーク２０に入力し、テキストモダリティニューラルネットワーク２０から出力されたベクトルを取得する。 In step S402, the speech synthesizer 200 inputs the acquired text data to the learned text modality neural network 20, and acquires a vector output from the text modality neural network 20.

ステップＳ４０３において、音声合成装置２００は、取得したベクトルを学習済み共通ニューラルネットワーク４０に入力し、共通ニューラルネットワーク４０から出力された音響特徴量を取得する。 In step S403, the speech synthesis device 200 inputs the acquired vector to the learned common neural network 40, and acquires the acoustic feature output from the common neural network 40.

ステップＳ４０４において、音声合成装置２００は、共通ニューラルネットワーク４０から取得した特定の話者に対応する音響特徴量を何れかの音声データフォーマットに変換し、変換された音声データを再生する。例えば、変換された音声データは、当該話者の声、テンポ、アクセントなどに近い音声によって入力されたテキストデータを再生したものとなりうる。 In step S404, the speech synthesizer 200 converts the acoustic feature amount corresponding to the specific speaker acquired from the common neural network 40 into any audio data format, and reproduces the converted audio data. For example, the converted voice data may be text data input by voice that is close to the voice, tempo, accent, etc. of the speaker.

図１１は、本発明の各種実施例による学習処理の実験結果を示す図である。図１１において、ＶＬ、ＳＳ、ＪＧ、ＴＬ及びＪＧ＋ＴＬは、上述した学習済みニューラルネットワーク構造を利用したものを含む各種音声合成システムを表す。 FIG. 11 is a diagram illustrating experimental results of learning processing according to various embodiments of the present invention. In FIG. 11, VL, SS, JG, TL, and JG + TL represent various speech synthesis systems including those using the learned neural network structure described above.

ＶＬは、３つのニューラルネットワークから構成されるニューラルネットワーク構造１０でなく、従来のニューラルネットワーク構造を利用したシステムである。ＳＳは、ニューラルネットワーク構造１０の各モダリティニューラルネットワークを単純に置き換えて学習されたシステムである。ＪＧは、図３及び４を参照して説明した学習処理により学習されたニューラルネットワーク構造１０を利用したシステムである。ＴＬは、図５及び６を参照して説明した学習処理により学習されたニューラルネットワーク構造１０を利用したシステムである。ＪＧ＋ＴＬは、ＪＧとＴＬとを組み合わせた学習処理により学習されたニューラルネットワーク構造１０を利用したシステムである。 The VL is a system using a conventional neural network structure instead of the neural network structure 10 composed of three neural networks. The SS is a system learned by simply replacing each modality neural network of the neural network structure 10. JG is a system using the neural network structure 10 learned by the learning process described with reference to FIGS. The TL is a system using the neural network structure 10 learned by the learning process described with reference to FIGS. JG + TL is a system using the neural network structure 10 learned by learning processing combining JG and TL.

図１１では、１０、４０、１６０及び３２０個の未知話者の訓練データによって学習された場合の各音声合成システムの誤差（ＭＣＤ）のシミュレーション結果が示される。図から理解されうるように、訓練データとして音声データとテキストデータとが与えられる教師有り学習と、音声データのみが与えられる教師なし学習との何れのケースでも、上述した実施例によるＪＧ、ＴＬ及びＪＧ＋ＴＬは、ＶＬ及びＳＳに対して有意に誤差を低減するという結果を得ることができた。 FIG. 11 shows a simulation result of an error (MCD) of each speech synthesis system when learning is performed using training data of 10, 40, 160, and 320 unknown speakers. As can be understood from the figure, in both cases of supervised learning in which voice data and text data are provided as training data and unsupervised learning in which only voice data is provided, JG, TL and JG + TL was able to obtain the result of significantly reducing the error with respect to VL and SS.

なお、上述した実施例では、テキストデータと音声データとが異なるモダリティとして扱われたが、本発明は、これに限定されるものでなく、他のタイプのモダリティの組み合わせに同様にして適用可能であることは理解されるであろう。 In the above-described embodiment, the text data and the audio data are treated as different modalities. However, the present invention is not limited to this, and can be similarly applied to combinations of other types of modalities. It will be appreciated that there are.

以上、本発明の実施例について詳述したが、本発明は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 As described above, the embodiments of the present invention have been described in detail. However, the present invention is not limited to the specific embodiments described above, and various modifications may be made within the scope of the present invention described in the appended claims.・ Change is possible.

１０ニューラルネットワーク構造
２０テキストモダリティニューラルネットワーク
３０音声モダリティニューラルネットワーク
４０共通ニューラルネットワーク
１００学習装置
２００音声合成装置 10 Neural Network Structure 20 Text Modality Neural Network 30 Speech Modality Neural Network 40 Common Neural Network 100 Learning Device 200 Speech Synthesis Device

Claims

Memory and
A processor,
A learning device having
The memory is
A text modality neural network for converting text data into a first vector;
A speech modality neural network for transforming speech waveform data into a second vector;
A common neural network connected to the text modality neural network and the speech modality neural network and configured to generate an acoustic feature corresponding to a speaker code vector in a speaker space from the first vector or the second vector. Store,
The processor comprises:
Learning the text modality neural network and the common neural network by first training data composed of text data and acoustic feature values;
Learning the speech modality neural network and the common neural network with second training data composed of speech waveform data and acoustic features,
Depending on the third training data of a given speaker, the text modality neural network and the common neural network and the speech modality neural network and the common neural network are selectively used to provide the given A learning device for estimating the speaker code vector for a speaker.

The processor comprises:
The text data of the first training data is input to the text modality neural network, the first vector obtained from the text modality neural network is input to the common neural network, and the acoustic feature value obtained from the common neural network is input. And calculating a first error between the first training data and the acoustic feature amount of the first training data,
Speech waveform data of the second training data is input to the speech modality neural network, a second vector obtained from the speech modality neural network is input to the common neural network, and acoustic features obtained from the common neural network. Calculating a second error between the quantity and the acoustic feature quantity of the second training data;
The learning apparatus according to claim 1, wherein the learning unit learns the text modality neural network, the speech modality neural network, and the common neural network based on a weighted sum of the first error and the second error.

The processor comprises:
The text data of the first training data is input to the text modality neural network, the first vector obtained from the text modality neural network is input to the common neural network, and the acoustic feature value obtained from the common neural network is input. And calculating a first error between the first training data and the acoustic feature amount of the first training data,
Speech waveform data of the second training data is input to the speech modality neural network, a second vector obtained from the speech modality neural network is input to the common neural network, and some layers of the common neural network are input. Obtains a third vector from the sub-neural network composed of: and obtains a fourth vector from the sub-neural network for the first vector input to the common neural network, and obtains the third vector and Calculating a third error based on the distance from the fourth vector,
The learning device according to claim 1, wherein the text modality neural network, the speech modality neural network, and the common neural network are learned based on a weighted sum of the first error and the third error.

The processor comprises:
When the third training data is composed of text data and acoustic features, the text data is input to the text modality neural network, and a first vector obtained from the text modality neural network is input to the common neural network. And determining a speaker code vector of the given speaker based on a fourth error between an acoustic feature obtained from the common neural network and an acoustic feature of the third training data. The learning device according to claim 1.

The processor comprises:
When the third training data is composed of speech waveform data and acoustic feature values, the speech training data is input to the speech modality neural network, and the second vector acquired from the speech modality neural network is used as the common vector. Input to a neural network and determining a speaker code vector of the given speaker based on a fifth error between an acoustic feature obtained from the common neural network and an acoustic feature of the third training data. The learning device according to any one of claims 1 to 4, which performs the learning.

Memory and
A processor,
A speech synthesizer having
The memory is
A trained text modality neural network,
A common neural network trained for a given speaker,
And store
A speech synthesizer that, when acquiring the text data, generates an acoustic feature corresponding to the given speaker from the text data by using the stored text modality neural network and the common neural network.

The speech synthesizer according to claim 6, further comprising an input / output interface that acquires text data and reproduces an acoustic feature generated from the text data corresponding to the given speaker.

A learning method realized by a computer having a memory and a processor,
The memory is
A text modality neural network for converting text data into a first vector;
A speech modality neural network for transforming speech waveform data into a second vector;
A common neural network connected to the text modality neural network and the speech modality neural network and configured to generate an acoustic feature corresponding to a speaker code vector in a speaker space from the first vector or the second vector. Store,
The processor learning the text modality neural network and the common neural network with first training data composed of text data and acoustic features;
The processor learning the speech modality neural network and the common neural network with second training data composed of speech waveform data and acoustic features;
Said processor selectively utilizing said text modality neural network and said common neural network and said speech modality neural network and said common neural network in response to third training data of a given speaker, Estimating the speaker code vector for the given speaker;
A learning method having

A speech synthesis method implemented by a computer having a memory and a processor,
The memory is
A trained text modality neural network,
A common neural network trained for a given speaker,
And store
A speech synthesis method including, when the processor acquires text data, generating an acoustic feature corresponding to the given speaker from the text data by the stored text modality neural network and the common neural network. .

A text modality neural network for converting text data into a first vector, a speech modality neural network for converting speech waveform data into a second vector, the text modality neural network and the speech modality neural network, A processor connected to a memory storing a common neural network that generates an acoustic feature value corresponding to a speaker code vector in a speaker space from the first vector or the second vector;
Training the text modality neural network and the common neural network with first training data composed of text data and acoustic features;
Training the speech modality neural network and the common neural network with second training data composed of speech waveform data and acoustic features,
Depending on the third training data of a given speaker, the text modality neural network and the common neural network and the speech modality neural network and the common neural network are selectively used to provide the given A program for estimating the speaker code vector for a speaker.

A processor connected to a memory storing a trained text modality neural network and a common neural network trained for a given speaker,
A program for, when text data is obtained, generating an acoustic feature value corresponding to the given speaker from the text data by the stored text modality neural network and common neural network.