JP7124373B2

JP7124373B2 - LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM

Info

Publication number: JP7124373B2
Application number: JP2018056905A
Authority: JP
Inventors: 大輝日暮
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2022-08-24
Anticipated expiration: 2038-03-23
Also published as: JP2019168608A

Description

本開示は、音響処理技術に関する。 The present disclosure relates to sound processing technology.

従来の音声合成では、波形接続型と隠れマルコフモデル型が主流であった。さらに、深層学習の発展によりニューラルネットワーク型の音声合成手法が提案されるようになった。ニューラルネットワーク型の代表例であるＷａｖｅＮｅｔは、Ｔｅｘｔ－Ｔｏ－Ｓｐｅｅｃｈに利用され、波形接続型や隠れマルコフ型と比較して、自然で高品質な音声合成を実現できる。 In the conventional speech synthesis, waveform concatenation type and hidden Markov model type are mainstream. Furthermore, with the development of deep learning, a neural network type speech synthesis method has been proposed. WaveNet, which is a representative example of the neural network type, is used for Text-To-Speech, and can realize natural and high-quality speech synthesis compared to the waveform connection type and hidden Markov type.

"WAVENET: A GENERATIVE MODEL FOR RAW AUDIO" (https://arxiv.org/pdf/1609.03499.pdf)"WAVENET: A GENERATIVE MODEL FOR RAW AUDIO" (https://arxiv.org/pdf/1609.03499.pdf)

一方、ＷａｖｅＮｅｔは損失関数を用いた効率的な学習が困難であるなどの問題がある。 On the other hand, WaveNet has problems such as difficulty in efficient learning using a loss function.

テキストから人の音声の波形データを合成する場合に限らず、様々なソース情報から特定のグループに属する音の波形データ（音響データ）を生成する場合にも同様の問題がある。 The same problem occurs not only when synthesizing human speech waveform data from text, but also when generating sound waveform data (acoustic data) belonging to a specific group from various source information.

本開示の課題は、特定のグループに属する音の波形データ（音響データ）を効果的に生成するための音響処理技術を提供することである。 An object of the present disclosure is to provide an acoustic processing technique for effectively generating waveform data (acoustic data) of sounds belonging to a specific group.

上記課題を解決するため、本開示の一態様は、ソース情報を第１のニューラルネットワークに入力し、前記第１のニューラルネットワークからの出力として、前記第１のニューラルネットワークに入力したソース情報に対応する波形データを生成する生成器と、前記生成器の前記第１のニューラルネットワークから出力された波形データから、微分可能な数値情報としての第１の周波数情報を抽出する抽出部と、前記第１の周波数情報を第２のニューラルネットワークに入力し、前記第２のニューラルネットワークからの出力として、前記第１の周波数情報が第１のグループに属する波形データから抽出される周波数情報である確からしさの程度を示す微分可能な数値情報としての第１の判別値を出力する判別器と、前記判別器が出力する前記第１の判別値を入力とする損失関数に基づいて、前記判別器が出力する前記第１の判別値がより高い確からしさを示すように前記第１のニューラルネットワークを学習させる制御部と、を有する音響生成装置に関する。 In order to solve the above problems, one aspect of the present disclosure is to input source information to a first neural network, and output from the first neural network corresponds to the source information input to the first neural network. an extraction unit for extracting first frequency information as differentiable numerical information from the waveform data output from the first neural network of the generator; and the first is input to a second neural network, and as an output from the second neural network, the probability that the first frequency information is frequency information extracted from the waveform data belonging to the first group Based on a discriminator that outputs a first discriminant value as differentiable numerical information indicating the degree, and a loss function that inputs the first discriminant value output by the discriminator, the discriminator outputs and a control unit that trains the first neural network so that the first discrimination value indicates a higher likelihood .

本開示によると、特定のグループに属する音の波形データ（音響データ）を効果的に生成するための音響処理技術を提供することができる。 According to the present disclosure, it is possible to provide an acoustic processing technique for effectively generating waveform data (acoustic data) of sounds belonging to a specific group.

本開示の一実施例による学習済み音響分離モデルを有する音響生成装置を示す概略図である。1 is a schematic diagram illustrating a sound generator having a trained sound separation model according to one embodiment of the present disclosure; FIG. 本開示の一実施例による学習装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of a learning device according to an embodiment of the present disclosure; FIG. 本開示の一実施例による生成器及び判別器による学習処理を示す概略図である。FIG. 10 is a schematic diagram illustrating a training process by a generator and classifier according to one embodiment of the present disclosure; 本開示の一実施例による学習装置のハードウェア構成を示すブロック図である。1 is a block diagram showing the hardware configuration of a learning device according to an embodiment of the present disclosure; FIG. 本開示の一実施例による音響生成モデルの学習処理を示すフローチャートである。4 is a flow chart showing a learning process for a sound production model according to an embodiment of the present disclosure; 本開示の一実施例による音響生成モデルの学習処理の詳細を示すフローチャートである。4 is a flow chart showing details of learning processing of a sound production model according to an embodiment of the present disclosure; 本開示の一実施例による生成器及び判別器による学習処理を示す概略図である。FIG. 10 is a schematic diagram illustrating a training process by a generator and classifier according to one embodiment of the present disclosure; 本開示の一実施例による生成器及び判別器による学習処理を示す概略図である。FIG. 10 is a schematic diagram illustrating a training process by a generator and classifier according to one embodiment of the present disclosure; 本開示の一実施例による音響生成装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of a sound generator according to an embodiment of the present disclosure; FIG. 本開示の一実施例による音響生成装置のハードウェア構成を示すブロック図である。1 is a block diagram showing the hardware configuration of a sound generator according to an embodiment of the present disclosure; FIG. 本開示の一実施例による音響生成処理を示すフローチャートである。4 is a flowchart illustrating sound generation processing according to an embodiment of the present disclosure;

以下の実施例では、ソース情報から所与のデータセットの波形に尤もらしい波形を生成するための音響生成モデルを学習し、当該学習済み音響生成モデルを用いて波形データを生成する音響処理技術が開示される。 In the following examples, the sound processing technology learns a sound generation model for generating a plausible waveform for the waveform of a given data set from source information, and generates waveform data using the trained sound generation model. disclosed.

本開示による学習装置は、ソース情報から波形データを生成する生成器とスペクトログラムから出力値を生成する判別器とを含む学習対象のモデルを有し、ソース情報を生成器に入力し、生成器から波形データを取得し、取得した波形データと学習用の波形データとから音響画像変換方式（定Ｑ変換、フーリエ変換など）に従って変換された各スペクトログラムを判別器に入力し、判別器の出力値を入力とする損失関数に基づき生成器及び判別器を学習する。また、本開示による波形生成装置及び音響生成装置は、学習済み生成器を利用してデータセットの波形データのスペクトログラムに尤もらしい波形データを生成する。 A learning device according to the present disclosure has a learning target model including a generator that generates waveform data from source information and a discriminator that generates an output value from a spectrogram; Waveform data is acquired, each spectrogram converted from the acquired waveform data and the waveform data for learning according to the acoustic image conversion method (constant Q transform, Fourier transform, etc.) is input to the discriminator, and the output value of the discriminator is A generator and a discriminator are trained based on a loss function as an input. Also, the waveform generator and sound generator according to the present disclosure utilize a trained generator to generate waveform data that is plausible in the spectrogram of the waveform data of the dataset.

まず、図１を参照して、本開示の一実施例による学習済み生成器を有する音響生成装置を説明する。図１は、本開示の一実施例による学習済み生成器を有する音響生成装置を示す概略図である。 First, referring to FIG. 1, a sound generator having a trained generator according to one embodiment of the present disclosure will be described. FIG. 1 is a schematic diagram illustrating a sound generator having a trained generator according to one embodiment of the disclosure.

図１に示されるように、本開示の一実施例による音響生成装置２００は、ニューラルネットワークとして実現される生成器を有し、学習装置１００によって学習された生成器を利用して、ソース情報からデータセットの波形データと同じグループに属することが尤もらしい波形データを生成する。具体的には、例えば、人の音声の波形データを生成させるように生成器を学習させる場合には、学習用のデータセットとして人の音声の波形データが用いられる。人の音声以外にも、楽器の音や動物の声など、ある特定のグループに属する波形データを生成させる場合には、その特定のグループに属する波形データを学習用のデータセットとすればよい。本開示の一実施例による学習装置１００は、データベース５０に格納されている所望のオーディオデータ（音の波形データ）を示すデータセットによって生成器及び判別器を学習し、学習された生成器を音響生成装置２００に提供する。 As shown in FIG. 1, a sound generation device 200 according to an embodiment of the present disclosure has a generator implemented as a neural network, and utilizes the generator trained by the learning device 100 to generate sound from source information. Waveform data that are likely to belong to the same group as the waveform data of the data set are generated. Specifically, for example, when training a generator to generate waveform data of human speech, the waveform data of human speech is used as a data set for learning. When generating waveform data belonging to a specific group such as sounds of musical instruments and voices of animals other than human voice, the waveform data belonging to the specific group may be used as a data set for learning. The learning device 100 according to an embodiment of the present disclosure learns generators and discriminators using a data set representing desired audio data (sound waveform data) stored in a database 50, and uses the learned generators as acoustic data. Provided to the generation device 200 .

次に、図２～４を参照して、本開示の一実施例による学習装置を説明する。図２は、本開示の一実施例による学習装置の機能構成を示すブロック図である。 A learning device according to an embodiment of the present disclosure will now be described with reference to FIGS. FIG. 2 is a block diagram showing the functional configuration of the learning device according to one embodiment of the present disclosure.

図２に示されるように、学習装置１００は、生成器１１０、変換部１２０、判別器１３０及び学習部１４０を有する。学習装置１００は、生成器１１０及び判別器１３０の２つのタイプのニューラルネットワークを有し、ＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓ）方式に従って、判別器１３０からのフィードバック情報に基づき所与のデータセットの波形データに尤もらしい波形データを生成するよう生成器１１０と判別器１３０とを学習する。 As shown in FIG. 2, the learning device 100 has a generator 110, a conversion section 120, a discriminator 130 and a learning section 140. FIG. The learning device 100 has two types of neural networks, a generator 110 and a discriminator 130. According to the GAN (Generative Adversarial Networks) scheme, the learning device 100 generates waveform data of a given data set based on feedback information from the discriminator 130. The generator 110 and classifier 130 are trained to generate plausible waveform data.

なお、生成器（生成部）および判別器（判別部）は、制御部（ＣＰＵ、ＧＰＵ）がニューラルネットワークをシミュレートすることによって実現され、メモリ内の所定の記憶情報に従って生成処理や判別処理を実行するモデルとして実現される。これらのモデルで用いられる記憶情報は、ニューラルネットワークにおけるパラメータ（重み値）であり、学習によって変化する情報である。 The generator (generating unit) and classifier (discriminating unit) are realized by simulating a neural network by the control unit (CPU, GPU), and the generating process and the discriminating process are performed according to predetermined stored information in the memory. Realized as a running model. The stored information used in these models is the parameters (weight values) in the neural network and information that changes through learning.

生成器１１０は、入力されたソース情報から波形データを生成する。当該ソース情報は、乱数、オーディオ、テキスト、発話など生成対象の波形データと異なるタイプのデータであってもよい。例えば、生成器１１０は、図３に示されるように、乱数を生成器のニューラルネットワークに入力し、当該ニューラルネットワークから波形データを取得する。ここで、当該乱数は、正規分布に従う乱数であってもよい。 The generator 110 generates waveform data from input source information. The source information may be of a different type than the waveform data to be generated, such as random numbers, audio, text, or speech. For example, generator 110 inputs a random number into a neural network of the generator and obtains waveform data from the neural network, as shown in FIG. Here, the random numbers may be random numbers following a normal distribution.

変換部１２０は、音響画像変換方式に従って取得した波形データと学習用の波形データとをそれぞれスペクトログラムに変換する。具体的には、変換部１２０は、入力について微分可能な所定の音響画像変換方式（例えば、定Ｑ変換、フーリエ変換）に従って各波形データを時間、周波数及びオーディオ成分の強度を表すスペクトログラムに変換し、変換されたスペクトログラムを判別器１３０に提供する。ここで、本開示の一実施例によるスペクトログラムは、複数次元においてデータ成分を含むデータ配列として実現されうる。 The conversion unit 120 converts the acquired waveform data and the waveform data for learning into spectrograms according to the acoustic image conversion method. Specifically, the transform unit 120 transforms each waveform data into a spectrogram representing the time, frequency, and intensity of the audio component according to a predetermined acoustic image transform method (for example, constant Q transform, Fourier transform) that can be differentiated with respect to the input. , provides the transformed spectrogram to the discriminator 130 . Here, a spectrogram according to one embodiment of the present disclosure may be implemented as a data array containing data components in multiple dimensions.

判別器１３０は、生成器１１０によって生成された波形データを示すスペクトログラムと、データベース５０における学習用スペクトログラムとからそれぞれの出力値を計算する。具体的には、判別器１３０は、図３に示されるように、生成器１１０によって生成された波形データを示すスペクトログラムを判別器１３０のニューラルネットワークに入力し、当該ニューラルネットワークから実数値を取得する一方、学習用の波形データを示すスペクトログラムを判別器１３０のニューラルネットワークに入力し、当該ニューラルネットワークから実数値を取得する。ここで、判別器１３０の出力値は、学習用のデータセット（第１のグループに属する波形データ）からサンプリングした波形のスペクトログラムの尤もらしさを表す。 The discriminator 130 calculates respective output values from the spectrogram representing the waveform data generated by the generator 110 and the learning spectrogram in the database 50 . Specifically, as shown in FIG. 3, the discriminator 130 inputs the spectrogram representing the waveform data generated by the generator 110 to the neural network of the discriminator 130, and acquires real values from the neural network. On the other hand, a spectrogram representing waveform data for learning is input to the neural network of the discriminator 130, and real values are obtained from the neural network. Here, the output value of the discriminator 130 represents the likelihood of the spectrogram of the waveform sampled from the learning data set (waveform data belonging to the first group).

学習部１４０は、出力値の誤差に基づき生成器１１０と判別器１３０とを学習する。 The learning unit 140 learns the generator 110 and the discriminator 130 based on the errors in the output values.

つまり、学習部１４０は、生成器１１０が、学習用のデータセットが属するグループと同じグループである第１のグループに属する波形データを生成するように（第１のグループに属さない波形データを生成しないように）ニューラルネットワークのパラメータ（第１の記憶情報）を変化させる。 That is, the learning unit 140 is configured so that the generator 110 generates waveform data belonging to the first group, which is the same group as the group to which the learning data set belongs (generates waveform data not belonging to the first group). change the parameters of the neural network (first stored information).

また、学習部１４０は、判別器１３０が、前記第１のグループに属する波形データと前記第１のグループに属さない波形データとを正しく判別できるようにニューラルネットワークのパラメータ（第２の記憶情報）を変化させる。
具体的には、学習部１４０は、後述される学習処理を制御する。また、学習用の波形データを示すスペクトログラムをｘ_ｒｅａｌとし、生成器１１０によって生成された波形データを示すスペクトログラムをｘ_ｆａｋｅとし、Ｄを判別器１３０の出力値とした場合、学習部１４０は、
ｌｏｇＤ（ｘ_ｒｅａｌ）＋ｌｏｇ（１－Ｄ（ｘ_ｆａｋｅ））
を最大化するよう判別器１３０のニューラルネットワークのパラメータを更新すると共に、
ｌｏｇ（１－Ｄ（ｘ_ｆａｋｅ））
を最小化するよう生成器１１０のニューラルネットワークのパラメータを更新してもよい。 In addition, the learning unit 140 stores neural network parameters (second stored information) so that the discriminator 130 can correctly discriminate between waveform data belonging to the first group and waveform data not belonging to the first group. change the
Specifically, the learning unit 140 controls learning processing, which will be described later. Also, let x _real be the spectrogram showing the waveform data for learning, let x _fake be the spectrogram showing the waveform data generated by the generator 110, and let D be the output value of the discriminator 130, the learning unit 140
logD(x _real )+log(1−D(x _fake ))
updating the parameters of the neural network of the discriminator 130 to maximize
log(1−D(x _fake ))
The parameters of the neural network of generator 110 may be updated to minimize .

ここで、学習装置１００は、例えば、図４に示されるように、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０３、通信インタフェース（ＩＦ）１０４、ハードディスク１０５、表示装置１０６及び入力装置１０７によるハードウェア構成を有してもよい。ＣＰＵ１０１及びＧＰＵ１０２は、後述される学習装置１００の各種処理を実行し、上述した生成器１１０、変換部１２０、判別器１３０及び学習部１４０を実現するプロセッサとして機能し、特に、ＣＰＵ１０１は学習装置１００における学習処理の実行を制御し、ＧＰＵ１０２は機械学習における行列演算等の学習処理を実行する。ＲＡＭ１０３及びハードディスク１０５は、学習装置１００における各種データ及びプログラムを格納するメモリとして機能し、特に、ＲＡＭ１０３は、ＣＰＵ１０１及びＧＰＵ１０２における作業データを格納するワーキングメモリとして機能し、ハードディスク１０５は、ＣＰＵ１０１及びＧＰＵ１０２の制御プログラム及び／又は学習用データを格納する。通信ＩＦ１０４は、データベース５０から学習用データを取得するための通信インタフェースである。表示装置１０６は、処理の内容、経過、結果等の各種情報を表示し、入力装置１０７は、キーボード、マウスなどの情報及びデータを入力するためのデバイスである。しかしながら、本開示による学習装置１００は、上述したハードウェア構成に限定されず、他の何れか適切なハードウェア構成を有してもよい。 Here, for example, as shown in FIG. 4, the learning device 100 includes a CPU (Central Processing Unit) 101, a GPU (Graphics Processing Unit) 102, a RAM (Random Access Memory) 103, a communication interface (IF) 104, a hard disk 105 , display device 106 and input device 107 . The CPU 101 and the GPU 102 execute various processes of the learning device 100, which will be described later, and function as processors that realize the generator 110, the conversion unit 120, the classifier 130, and the learning unit 140 described above. , and the GPU 102 executes learning processing such as matrix calculation in machine learning. The RAM 103 and the hard disk 105 function as memories that store various data and programs in the learning device 100. In particular, the RAM 103 functions as a working memory that stores work data in the CPU 101 and the GPU 102. Stores control programs and/or learning data. Communication IF 104 is a communication interface for acquiring learning data from database 50 . A display device 106 displays various types of information such as the contents, progress, and results of processing, and an input device 107 is a device for inputting information and data, such as a keyboard and a mouse. However, the learning device 100 according to the present disclosure is not limited to the hardware configuration described above, and may have any other appropriate hardware configuration.

次に、図５～６を参照して、本開示の一実施例による学習装置１００における学習処理を説明する。図５は、本開示の一実施例による音響生成モデルの学習処理を示すフローチャートである。図示された実施例では、限定されることなく、ソース情報として乱数が用いられる。 Next, learning processing in the learning device 100 according to an embodiment of the present disclosure will be described with reference to FIGS. FIG. 5 is a flow chart illustrating a process of learning a sound production model according to one embodiment of the present disclosure. In the illustrated embodiment, without limitation, random numbers are used as the source information.

図５に示されるように、ステップＳ１０１において、生成器１１０は、乱数から波形データを取得する。具体的には、生成器１１０は、乱数を生成器１１０のニューラルネットワークに入力し、当該ニューラルネットワークから波形データを取得する。また、生成器１１０は、入力されるソース情報から第１の記憶情報に従って波形データを生成する。このとき、判別器１３０は、入力される周波数情報が第１のグループに属する波形データから抽出される周波数情報であるか否かを第２の記憶情報に従って判別する。ここで、第１の記憶情報は、生成された波形データから抽出される周波数情報を判別した判別結果に基づいて第１のグループに属する波形データが生成するように変更させる。また、第２の記憶情報は、第１のグループに属する波形データから抽出される周波数情報を判別した判別結果と、生成された波形データから抽出される周波数情報を判別した判別結果とに基づいて第１のグループに属する波形データと第１のグループに属さない波形データとを正しく判別できるように、変更される。 As shown in FIG. 5, in step S101, the generator 110 acquires waveform data from random numbers. Specifically, the generator 110 inputs random numbers to the neural network of the generator 110 and acquires waveform data from the neural network. The generator 110 also generates waveform data from the input source information according to the first storage information. At this time, the discriminator 130 discriminates whether or not the input frequency information is the frequency information extracted from the waveform data belonging to the first group according to the second stored information. Here, the first stored information is changed so that the waveform data belonging to the first group is generated based on the determination result of determining the frequency information extracted from the generated waveform data. The second stored information is based on a determination result of determining frequency information extracted from the waveform data belonging to the first group and a determination result of determining frequency information extracted from the generated waveform data. It is changed so that waveform data belonging to the first group and waveform data not belonging to the first group can be correctly discriminated.

ステップＳ１０２において、変換部１２０は、生成器１１０によって生成された波形データと学習用の波形データとをそれぞれスペクトログラムに変換する。具体的には、変換部１２０は、入力について微分可能な所定の音響画像変換方式（例えば、定Ｑ変換、フーリエ変換）に従って各波形データを時間、周波数及びオーディオ成分の強度を表すスペクトログラムに変換する。また、変換部１２０は、波形データを、複数の軸のうちの１つの軸を対数の周波数軸とした画像データに変換する。このとき、波形データを変換部１２０で変換して得られた画像データを周波数情報として判別部１３０に判別させてもよい。 In step S102, the conversion unit 120 converts the waveform data generated by the generator 110 and the learning waveform data into spectrograms. Specifically, the transform unit 120 transforms each waveform data into a spectrogram representing time, frequency, and intensity of an audio component according to a predetermined acoustic image transform method (for example, constant Q transform, Fourier transform) capable of differentiating the input. . Further, the conversion unit 120 converts the waveform data into image data with one of the plurality of axes being a logarithmic frequency axis. At this time, the image data obtained by converting the waveform data by the conversion unit 120 may be used as the frequency information to be discriminated by the discrimination unit 130 .

ステップＳ１０３において、判別器１３０は、変換された各スペクトログラムからそれぞれの出力値を計算する。具体的には、判別器１３０は、生成器１１０によって生成された波形データを示すスペクトログラムと学習用の波形データを示すスペクトログラムとを判別器１３０のニューラルネットワークに入力し、当該ニューラルネットワークから各実数値を取得する。 In step S103, the discriminator 130 calculates respective output values from each transformed spectrogram. Specifically, the discriminator 130 inputs the spectrogram indicating the waveform data generated by the generator 110 and the spectrogram indicating the learning waveform data to the neural network of the discriminator 130, and each real value from the neural network. to get

ステップＳ１０４において、学習部１４０は、出力値の誤差に基づき生成器１１０と判別器１３０とを学習する。具体的には、学習部１４０は、出力値の誤差に基づき生成器１１０のニューラルネットワークのパラメータと、判別器１３０のニューラルネットワークとのパラメータとを更新する。すなわち、学習部１４０は、複数のソース情報のそれぞれを生成器１１０に入力して複数の波形データを生成するとともに、生成された複数の波形データのそれぞれを変換部１２０により変換して得られる複数の画像データと、第１のグループに属する複数の波形データのそれぞれを変換部１２０により変換して得られる複数の画像データとを、判別器１３０により判別させ、判別による複数の判別結果に基づいて、第１の記憶情報及び第２の記憶情報を変化させていくことで学習対象のモデルを学習させてもよい。 In step S104, the learning unit 140 learns the generator 110 and the discriminator 130 based on the errors in the output values. Specifically, the learning unit 140 updates the parameters of the neural network of the generator 110 and the parameters of the neural network of the discriminator 130 based on the error of the output value. That is, the learning unit 140 inputs each of the plurality of source information to the generator 110 to generate a plurality of waveform data, and converts each of the generated plurality of waveform data by the conversion unit 120 to obtain a plurality of waveform data. and a plurality of image data obtained by converting each of the plurality of waveform data belonging to the first group by the conversion unit 120 are discriminated by the classifier 130, and based on the plurality of discrimination results by the discrimination , the model to be learned may be learned by changing the first stored information and the second stored information.

上述したステップＳ１０１～Ｓ１０４は、所定の回数実行され、最終的に取得した生成器１１０のニューラルネットワークが、音響生成装置２００に提供される学習済みの音響生成モデルとして決定されてもよい。 The above-described steps S101 to S104 may be performed a predetermined number of times, and the finally acquired neural network of the generator 110 may be determined as the trained sound generation model provided to the sound generation device 200.

上述した学習処理は、例えば、図６に示される手順に従って実現されてもよい。 The learning process described above may be implemented according to the procedure shown in FIG. 6, for example.

図６に示されるように、ステップＳ２０１において、学習部１４０は、繰り返しカウンタ（ｉｔｅｒａｔｉｏｎ）を０に初期化する。 As shown in FIG. 6, the learning unit 140 initializes an iteration counter (iteration) to 0 in step S201.

ステップＳ２０２において、学習部１４０は、繰り返しカウンタが指定回数未満であるか判定する。繰り返しカウンタが指定回数未満である場合（ステップＳ２０２：ＹＥＳ）、学習部１４０は、ステップＳ２０３において、ステップカウンタ（ｓｔｅｐ）を０に初期化する。他方、繰り返しカウンタが指定回数に達している場合（ステップＳ２０２：ＮＯ）、学習部１４０は、当該学習処理を終了する。 In step S202, the learning unit 140 determines whether the repetition counter is less than the specified number of times. If the repetition counter is less than the specified number of times (step S202: YES), the learning unit 140 initializes the step counter (step) to 0 in step S203. On the other hand, if the repetition counter has reached the specified number of times (step S202: NO), the learning unit 140 terminates the learning process.

ステップＳ２０４において、学習部１４０は、ステップカウンタが指定回数未満であるか判定する。ステップカウンタが指定回数未満である場合（ステップＳ２０４：ＹＥＳ）、ステップＳ２０５において、生成器１１０は、乱数から波形データを生成する。 In step S204, the learning unit 140 determines whether the step counter is less than the specified number of times. If the step counter is less than the specified number of times (step S204: YES), the generator 110 generates waveform data from random numbers in step S205.

ステップＳ２０６において、学習部１４０は、データベース５０から学習用の波形をサンプリングし、学習用の波形データを生成する。 In step S206, the learning unit 140 samples the waveform for learning from the database 50 and generates waveform data for learning.

ステップＳ２０７において、変換部１２０は、ステップＳ２０５において生成された波形データと、ステップＳ２０５において生成された学習用の波形データとをそれぞれ所定の音響画像変換方式に従ってスペクトログラムに変換する。例えば、本実施例では、変換部１２０は、フーリエ変換によって波形データをスペクトログラムに変換するが、本開示の音響画像変換方式はこれに限定されず、例えば、定Ｑ変換などの他の入力について微分可能な音響画像変換方式が適用されてもよい。 In step S207, the conversion unit 120 converts the waveform data generated in step S205 and the learning waveform data generated in step S205 into spectrograms according to a predetermined acoustic image conversion method. For example, in the present embodiment, the transform unit 120 transforms the waveform data into a spectrogram by Fourier transform, but the acoustic image transform method of the present disclosure is not limited to this. Any possible audio-to-image conversion scheme may be applied.

ステップＳ２０８において、判別器１３０は、ニューラルネットワークを利用して、変換された各スペクトログラムから実数値を計算し、学習部１４０は、計算された各実数値の誤差を計算する。例えば、学習部１４０は、
ｌｏｇＤ（ｘ_ｒｅａｌ）＋ｌｏｇ（１－Ｄ（ｘ_ｆａｋｅ））
を誤差として算出してもよい。 In step S208, the discriminator 130 uses a neural network to calculate a real value from each transformed spectrogram, and the learning unit 140 calculates an error of each calculated real value. For example, the learning unit 140
logD(x _real )+log(1−D(x _fake ))
may be calculated as an error.

ステップＳ２０９において、学習部１４０は、算出した誤差を最大化するよう判別器１３０のニューラルネットワークのパラメータを更新する。 In step S209, the learning unit 140 updates the parameters of the neural network of the discriminator 130 so as to maximize the calculated error.

ステップＳ２１０において、学習部１４０は、ステップカウンタをインクリメントし、ステップＳ２０４に戻る。 In step S210, learning unit 140 increments the step counter and returns to step S204.

他方、ステップカウンタが指定回数に達している場合（ステップＳ２０４：ＮＯ）、ステップＳ２１１において、生成器１１０は、乱数から波形データを生成する。 On the other hand, if the step counter has reached the specified number of times (step S204: NO), the generator 110 generates waveform data from random numbers in step S211.

ステップＳ２１２において、変換部１２０は、ステップＳ２１１において生成された波形データを所定の音響画像変換方式に従ってスペクトログラムに変換する。例えば、本実施例では、変換部１２０は、フーリエ変換によって波形データをスペクトログラムに変換するが、本開示の音響画像変換方式はこれに限定されず、例えば、定Ｑ変換などの他の入力について微分可能な音響画像変換方式が適用されてもよい。 In step S212, the conversion unit 120 converts the waveform data generated in step S211 into a spectrogram according to a predetermined acoustic image conversion method. For example, in the present embodiment, the transform unit 120 transforms the waveform data into a spectrogram by Fourier transform, but the acoustic image transform method of the present disclosure is not limited to this. Any possible audio-to-image conversion scheme may be applied.

ステップＳ２１３において、判別器１３０は、ニューラルネットワークを利用して、変換されたスペクトログラムから実数値を計算し、学習部１４０は、計算された実数値の誤差ｌｏｇ（１－Ｄ（ｘ_ｆａｋｅ））を算出する。 In step S213, the discriminator 130 uses a neural network to calculate the real value from the transformed spectrogram, and the learning unit 140 calculates the calculated real value error log(1−D(x _fake )) calculate.

ステップＳ２１４において、学習部１４０は、算出した誤差を最小化するよう生成器１１０のニューラルネットワークのパラメータを更新する。 In step S214, the learning unit 140 updates the parameters of the neural network of the generator 110 so as to minimize the calculated error.

ステップＳ２１５において、学習部１４０は、繰り返しカウンタをインクリメントし、ステップＳ２０２に戻る。 In step S215, the learning unit 140 increments the repetition counter and returns to step S202.

次に、図７～８を参照して、本開示の他の実施例による生成器及び判別器による学習処理を説明する。図７及び８は、本開示の一実施例による生成器及び判別器による学習処理を示す概略図である。図示される実施例では、学習装置１００は、サイクルＧＡＮ方式に従って生成器１１０及び判別器１３０を学習する。 Next, the learning process by generators and classifiers according to other embodiments of the present disclosure will be described with reference to FIGS. 7-8. 7 and 8 are schematic diagrams illustrating the training process by generators and classifiers according to one embodiment of the present disclosure. In the illustrated embodiment, the learning device 100 trains the generator 110 and the discriminator 130 according to the cycle GAN scheme.

図７に示されるように、生成器１１０は、２つのニューラルネットワークＧ_ＡｔｏＢ及びＧ_ＢｔｏＡを有し、Ｇ_ＡｔｏＢはドメインＡからドメインＢへの変換を実行し、Ｇ_ＢｔｏＡはドメインＢからドメインＡへの変換を実行する。例えば、ドメインＡは男声のデータセットであり、ドメインＢは女声のデータセットであってもよい。この場合、Ｇ_ＡｔｏＢは男声の波形データを女声の波形データに変換し、Ｇ_ＢｔｏＡは女声の波形データを男声の波形データに変換する。 As shown in FIG. 7, the generator 110 has two neural networks G _AtoB and _GBtoA , G _AtoB performing the transformation from domain A to domain B, _GBtoA performing the transformation from domain B to domain A perform the conversion of For example, domain A may be a data set for male voices and domain B may be a data set for female voices. In this case, G-to-B _converts male waveform data into female-voice waveform data, and _GB- to-A converts female-voice waveform data into male-voice waveform data.

他方、判別器１３０もまた、２つのニューラルネットワークＤ_Ａ及びＤ_Ｂを有し、図８に示されるように、Ｄ_Ａは入力されたスペクトログラムがドメインＡのデータセットの波形データのスペクトログラムに尤もらしいかを判別し、Ｄ_Ｂは入力されたスペクトログラムがドメインＢのデータセットの波形データのスペクトログラムに尤もらしいかを判別する。例えば、ドメインＡが男声のデータセットであり、ドメインＢが女声のデータセットである場合、Ｄ_Ａは入力されたスペクトログラムが男声のスペクトログラムに尤もらしいかを判別し、Ｄ_Ｂは入力されたスペクトログラムが男声のスペクトログラムに尤もらしいかを判別する。すなわち、第１のグループに属する波形データは、言葉を発声した音声データに対応する波形データであってもよい。また、第１のグループに属する波形データは、特定の人の声に対応する波形データであり、第２のグループに属する波形データは、特定の人とは異なる人の声に対応する波形データであってもよい。 On the other hand, the discriminator 130 also has two neural networks D _A and _D _B , as shown in FIG. DB determines whether the input spectrogram is likely to be the spectrogram of the waveform data in the domain _B data set. For example, if domain A is a male data set and domain B is a female data set, DA determines if the input spectrogram is likely to be a male spectrogram, and _{D B} _determines if the input spectrogram is Determine whether the spectrogram of a male voice is plausible. That is, the waveform data belonging to the first group may be waveform data corresponding to speech data obtained by uttering words. Waveform data belonging to the first group are waveform data corresponding to a specific person's voice, and waveform data belonging to the second group are waveform data corresponding to a person's voice different from the specific person. There may be.

本実施例では、図示されるように、Ｇ_ＡｔｏＢ及びＧ_ＢｔｏＡはそれぞれ変換された波形データを出力し、変換部１２０は、所定の音響画像変換方式（例えば、定Ｑ変換、フーリエ変換）に従って各波形データをスペクトログラムに変換し、それぞれＤ_Ｂ及びＤ_Ａに入力する。Ｄ_Ａ及びＤ_Ｂは、上述した実施例と同様に、各自のドメインのデータセットの学習用の波形データを示すスペクトログラムと、Ｇ_ＢｔｏＡ及びＧ_ＡｔｏＢによりそれぞれ変換された波形データを示すスペクトログラムとをそれぞれ入力した際の出力値を計算する。学習部１４０は、これらの出力値の誤差に基づき、上述したようにＧ_ＡｔｏＢ及びＧ_ＢｔｏＡとＤ_Ａ及びＤ_Ｂとのパラメータを更新する。 In this embodiment, as shown, G _AtoB and G _BtoA each output transformed waveform data, and transformation unit 120 transforms each waveform data according to a predetermined acoustic image transformation scheme (eg, constant Q transformation, Fourier transformation). Waveform data are converted to spectrograms and input to D _B and D _A , respectively. As in the above-described embodiment, D _A and D _B are spectrograms showing waveform data for learning of data sets in their respective domains, and spectrograms showing waveform data converted by G _BtoA and G _AtoB , respectively. Calculate the output value when input. The learning unit 140 updates the parameters of G _AtoB , _GBtoA , and D _A and D _B as described above based on the errors in these output values.

さらに本実施例では、図示されるように、Ｇ_ＡｔｏＢ及びＧ_ＢｔｏＡは、それぞれ変換された波形データを他方のＧ_ＢｔｏＡ及びＧ_ＡｔｏＢに入力し、Ｇ_ＢｔｏＡ及びＧ_ＡｔｏＢは、それぞれ入力された波形データを変換し、変換された波形データをそれぞれＤ_Ａ及びＤ_Ｂと変換部１２０に入力する。変換部１２０は、上記と同様に、所定の音響画像変換方式に従って各波形データをスペクトログラムに変換し、それぞれＤ_Ｂ及びＤ_Ａに入力する。Ｄ_Ａ及びＤ_Ｂは、上述した実施例と同様に、各自のドメインのデータセットの学習用の波形データを示すスペクトログラムと、Ｇ_ＢｔｏＡ及びＧ_ＡｔｏＢによりそれぞれ変換された波形データを示すスペクトログラムとの出力値を計算する。学習部１４０は、これらの出力値の誤差に基づき、上述したようにＧ_ＡｔｏＢ及びＧ_ＢｔｏＡとＤ_Ａ及びＤ_Ｂとのパラメータを更新する。 Further, in this embodiment, as shown, G _AtoB and G _BtoA input the converted waveform data to the other G _BtoA and G _AtoB , respectively, and G _BtoA and G _AtoB each input waveform data , and the converted waveform data are input to D _A and D _B and the conversion unit 120, respectively. In the same manner as described above, the conversion unit 120 converts each waveform data into a spectrogram according to a predetermined acoustic image conversion method, and inputs the spectrograms to _DB and _DA , respectively. D _A and D _B output spectrograms indicating waveform data for learning of data sets in their own domains and spectrograms indicating waveform data converted by G _BtoA and G _AtoB , respectively, in the same manner as in the above-described embodiment. Calculate a value. The learning unit 140 updates the parameters of G _AtoB , _GBtoA , and D _A and D _B as described above based on the errors in these output values.

このように、生成器１１０において波形データを変換及び逆変換することによって、例えば、発話内容は同じであって、声質のみ変わっている波形データを取得することが可能になる。 By transforming and inversely transforming the waveform data in the generator 110 in this way, it is possible to obtain, for example, waveform data in which the utterance content is the same and only the voice quality is changed.

また、一実施例では、判別器１３０は、入力される周波数情報が第１のグループに属する波形データから抽出される周波数情報である確からしさに応じた出力値を判別結果として出力し、学習部１４０は、第１のグループに属する波形データから抽出される周波数情報の入力に対して出力される出力値がより高い確からしさを示し、生成器１１０で生成された波形データから抽出される周波数情報の入力に対して出力される出力値がより低い確からしさを示すように、第２の記憶情報を変化させるとともに、生成器１１０で生成された波形データから抽出される周波数情報の入力に対して出力される出力値がより高い確からしさを示すように、第１の記憶情報を変化させてもよい。 In one embodiment, the discriminator 130 outputs an output value as a discrimination result according to the probability that the input frequency information is the frequency information extracted from the waveform data belonging to the first group. 140 indicates a higher likelihood of the output value output with respect to the input of the frequency information extracted from the waveform data belonging to the first group, and the frequency information extracted from the waveform data generated by the generator 110. The second stored information is changed so that the output value output for the input of indicates a lower likelihood, and the frequency information extracted from the waveform data generated by the generator 110 for the input of The first stored information may be changed such that the output value that is output indicates a higher likelihood.

また、一実施例では、生成器１１０は、第２のグループに属する波形データから第１のグループに属する波形データを生成し、波形データから第２のグループに属する波形データを生成し、判別器１３０は、入力される周波数情報が第１のグループに属する波形データから抽出される周波数情報であるか否かを判別してもよい。このとき、学習部１４０は、第２のグループに属する第１の元波形データから第１のグループに属する第１の変換波形データを生成器１１０に生成させた後、第１の変換波形データから第２のグループに属する第１の再構成波形データを生成器１１０に生成させるとともに、第１の元波形データから抽出される周波数情報を判別した判別結果と、第１の再構成波形データから抽出される周波数情報を判別した判別結果との誤差を少なくするように生成器１１０を学習させてもよい。 In one embodiment, the generator 110 generates waveform data belonging to the first group from waveform data belonging to the second group, generates waveform data belonging to the second group from the waveform data, and generates waveform data belonging to the second group from the waveform data. 130 may determine whether the input frequency information is frequency information extracted from waveform data belonging to the first group. At this time, the learning unit 140 causes the generator 110 to generate the first converted waveform data belonging to the first group from the first original waveform data belonging to the second group, and then the first converted waveform data to Causes the generator 110 to generate the first reconstructed waveform data belonging to the second group, and extracts from the determination result of determining the frequency information extracted from the first original waveform data and the first reconstructed waveform data The generator 110 may be trained so as to reduce the error from the determination result of determining the frequency information to be received.

また、一実施例では、判別器１３０は、入力される周波数情報が第２のグループに属する波形データから抽出される周波数情報であるか否かを判別してもよい。このとき、学習部１４０は、第１のグループに属する第２の元波形データから第２のグループに属する第２の変換波形データを生成器１１０に生成させた後、第２の変換波形データから第１のグループに属する第２の再構成波形データを生成器１１０に生成させるとともに、第２の元波形データから抽出される周波数情報を判別した判別結果と、第２の再構成波形データから抽出される周波数情報を判別した判別結果との誤差を少なくするように生成器１１０を学習させてもよい。 In one embodiment, the discriminator 130 may discriminate whether or not the input frequency information is frequency information extracted from waveform data belonging to the second group. At this time, the learning unit 140 causes the generator 110 to generate the second converted waveform data belonging to the second group from the second original waveform data belonging to the first group, and then the second converted waveform data to Causes the generator 110 to generate the second reconstructed waveform data belonging to the first group, and extracts from the determination result of determining the frequency information extracted from the second original waveform data and the second reconstructed waveform data The generator 110 may be trained so as to reduce the error from the determination result of determining the frequency information to be received.

次に、図９～１０を参照して、本開示の一実施例による音響生成装置を説明する。図９は、本開示の一実施例による音響生成装置の機能構成を示すブロック図である。 A sound generating device according to one embodiment of the present disclosure will now be described with reference to FIGS. 9-10. FIG. 9 is a block diagram showing the functional configuration of a sound generator according to one embodiment of the present disclosure.

図９に示されるように、音響生成装置２００は、取得部２１０及び生成器２２０を有する。音響生成装置２００は、学習装置１００から生成器１１０の音響生成モデルを取得し、当該音響生成モデルを生成器２２０として用いてソース情報から波形データを生成する。 As shown in FIG. 9 , the sound generation device 200 has an acquisition section 210 and a generator 220 . The sound generation device 200 acquires the sound generation model of the generator 110 from the learning device 100 and uses the sound generation model as the generator 220 to generate waveform data from source information.

取得部２１０は、ソース情報を取得する。当該ソース情報は、乱数、オーディオ、テキスト、発話など生成対象の波形データを示すオーディオデータと異なるタイプの波形データであってもよく、学習装置１００において学習された生成器１１０に入力される情報に対応するタイプの情報である。すなわち、ソース情報は、言葉に対応するラベル情報または前記言葉を表すテキスト情報であってもよいし、言葉を発声した音声データに対応する波形データであってもよいし、後述される第１のグループとは異なる第２のグループに属する波形データであってもよいし、あるいは、乱数であってもよい。 Acquisition unit 210 acquires source information. The source information may be waveform data of a different type from audio data indicating waveform data to be generated, such as random numbers, audio, text, and speech. The corresponding type of information. That is, the source information may be label information corresponding to a word, text information representing the word, waveform data corresponding to voice data produced by uttering the word, or a first source information to be described later. It may be waveform data belonging to a second group different from the group, or it may be a random number.

生成器２２０は、ソース情報を学習済み音響生成モデルに入力し、当該音響生成モデルから波形データを取得する。当該音響生成モデルは、上述したような手順に従って学習装置１００において学習される。すなわち、学習装置１００は、ソース情報を生成器１１０に入力し、生成器１１０から対応する波形データを取得する。そして、学習装置１００は、音響画像変換方式に従って当該波形データと学習用の波形データとをそれぞれスペクトログラムに変換して判別器１３０に入力し、これらのスペクトログラムの出力値の誤差に基づき生成器１１０と判別器１３０とを学習する。また、生成器２２０は、取得した波形データをオーディオデータに変換するなど、波形データに対して所定の出力処理を実行する。 The generator 220 inputs the source information to the trained sound generation model and acquires waveform data from the sound generation model. The sound generation model is learned by the learning device 100 according to the procedure described above. That is, learning device 100 inputs source information to generator 110 and acquires corresponding waveform data from generator 110 . Then, the learning device 100 converts the waveform data and the waveform data for learning into spectrograms according to the acoustic image conversion method, respectively, and inputs them to the discriminator 130. Based on the error of the output values of these spectrograms, the learning device 100 and the generator 110 The discriminator 130 is learned. The generator 220 also performs predetermined output processing on the waveform data, such as converting the acquired waveform data into audio data.

ここで、音響生成装置２００は、例えば、図１０に示されるように、ＣＰＵ２０１、ＲＯＭ（Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）２０２、ＲＡＭ２０３、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリポート２０４及び再生装置２０５によるハードウェア構成を有してもよい。ＣＰＵ２０１は、後述される音響生成装置２００の各種処理を実行し、上述した取得部２１０及び生成器２２０を実現するプロセッサとして機能する。ＲＯＭ２０２及びＲＡＭ２０３は、音響生成装置２００における各種データ及びプログラムを格納するメモリとして機能し、特に、ＲＡＭ２０３は、ＣＰＵ２０１における作業データを格納するワーキングメモリとして機能し、ＲＯＭ２０３は、ＣＰＵ２０１の制御プログラム及び／又はデータを格納する。ＵＳＢメモリポート２０４は、ユーザによりセットされたＵＳＢメモリに格納されているソース情報を取得する。再生装置２０５は、ＣＰＵ２０１の指示によってソース情報から生成されたオーディオデータを再生する。しかしながら、本開示による音響生成装置２００は、上述したハードウェア構成に限定されず、他の何れか適切なハードウェア構成を有してもよい。例えば、上述した取得部２１０及び生成器２２０の１つ以上は、フィルタ回路などの電子回路により実現されてもよい。 Here, for example, as shown in FIG. 10, the sound generation device 200 has a hardware configuration including a CPU 201, a ROM (Read-Only Memory) 202, a RAM 203, a USB (Universal Serial Bus) memory port 204, and a playback device 205. may have. The CPU 201 functions as a processor that executes various processes of the sound generation device 200, which will be described later, and realizes the acquisition unit 210 and the generator 220 described above. The ROM 202 and the RAM 203 function as memories that store various data and programs in the sound generator 200. In particular, the RAM 203 functions as a working memory that stores work data in the CPU 201, and the ROM 203 stores control programs and/or programs for the CPU 201. Store data. The USB memory port 204 acquires source information stored in a USB memory set by the user. The reproduction device 205 reproduces audio data generated from the source information according to instructions from the CPU 201 . However, the sound generation device 200 according to the present disclosure is not limited to the hardware configuration described above, and may have any other appropriate hardware configuration. For example, one or more of the acquisition unit 210 and the generator 220 described above may be realized by electronic circuits such as filter circuits.

次に、図１１を参照して、本開示の一実施例による音響生成装置２００における音響生成処理を説明する。図１１は、本開示の一実施例による音響生成処理を示すフローチャートである。 Next, sound generation processing in the sound generation device 200 according to an embodiment of the present disclosure will be described with reference to FIG. 11 . FIG. 11 is a flowchart illustrating sound generation processing according to one embodiment of the present disclosure.

図１１に示されるように、ステップＳ３０１において、取得部２１０は、ソース情報を取得する。具体的には、取得部２１０は、学習装置１００において学習用に生成器１１０に入力された情報に対応するソース情報を取得する。 As shown in FIG. 11, in step S301, the obtaining unit 210 obtains source information. Specifically, the acquisition unit 210 acquires source information corresponding to information input to the generator 110 for learning in the learning device 100 .

ステップＳ３０２において、生成器２２０は、学習済み音響生成モデルにソース情報を入力し、当該音響生成モデルから波形データを取得する。 In step S302, the generator 220 inputs source information to the trained sound generation model and acquires waveform data from the sound generation model.

ステップＳ３０３において、生成器２２０は、取得した波形データをオーディオデータに変換するなど、波形データに対して所定の出力処理を実行する。 In step S303, the generator 220 performs predetermined output processing on the waveform data, such as converting the acquired waveform data into audio data.

本開示の一態様では、
入力されるソース情報から第１の記憶情報に従って波形データを生成する生成部と、
入力される周波数情報が第１のグループに属する波形データから抽出される周波数情報であるか否かを第２の記憶情報に従って判別する判別部と、
前記生成部で生成された波形データから抽出される周波数情報を前記判別部で判別した判別結果に基づいて、前記生成部が前記第１のグループに属する波形データを生成するように前記第１の記憶情報を変化させる制御部と、
を有する波形生成装置が提供される。 In one aspect of the present disclosure,
a generator that generates waveform data according to first stored information from input source information;
a determination unit that determines whether or not input frequency information is frequency information extracted from waveform data belonging to the first group according to second stored information;
Based on the discrimination result of discriminating the frequency information extracted from the waveform data generated by the generating section, the generating section generates the waveform data belonging to the first group. a controller for changing stored information;
A waveform generator is provided comprising:

一実施例では、前記制御部は、前記第１のグループに属する波形データから抽出される周波数情報を前記判別部で判別した判別結果と、前記生成部で生成された波形データから抽出される周波数情報を前記判別部で判別した判別結果とに基づいて、前記判別部が前記第１のグループに属する波形データと前記第１のグループに属さない波形データとを正しく判別できるように前記第２の記憶情報を変化させてもよい。 In one embodiment, the control unit determines a determination result of frequency information extracted from the waveform data belonging to the first group by the determination unit, and a frequency extracted from the waveform data generated by the generation unit. Based on the determination result of information determined by the determination unit, the second determination unit can correctly determine waveform data belonging to the first group and waveform data not belonging to the first group. The stored information may be changed.

一実施例では、波形データを、複数の軸のうちの１つの軸を対数の周波数軸とした画像データに変換する変換部を有し、前記制御部は、波形データを前記変換部で変換して得られた画像データを前記周波数情報として前記判別部に判別させるように制御してもよい。 In one embodiment, the control unit includes a conversion unit that converts the waveform data into image data with one of a plurality of axes as a logarithmic frequency axis, and the control unit converts the waveform data with the conversion unit. The image data obtained by the method may be controlled so as to be discriminated by the discriminating section as the frequency information.

一実施例では、前記生成部と前記判別部とを含む学習対象のモデルとして、前記第１の記憶情報と前記第２の記憶情報とを格納する記憶部を有し、前記制御部は、複数のソース情報のそれぞれを前記生成部に入力して複数の波形データを生成するとともに、生成された複数の波形データのそれぞれを前記変換部により変換して得られる複数の画像データと、前記第１のグループに属する複数の波形データのそれぞれを前記変換部により変換して得られる複数の画像データとを、前記判別部により判別させ、前記判別による複数の判別結果に基づいて、前記第１の記憶情報及び前記第２の記憶情報を変化させていくことで前記学習対象のモデルを学習させてもよい。 In one embodiment, a learning target model including the generation unit and the determination unit includes a storage unit that stores the first storage information and the second storage information, and the control unit includes a plurality of to the generation unit to generate a plurality of waveform data, and a plurality of image data obtained by converting each of the generated plurality of waveform data by the conversion unit; and a plurality of image data obtained by converting each of the plurality of waveform data belonging to the group of (1) and (2) by the discrimination unit, and based on the plurality of discrimination results obtained by the discrimination, the first storage The model to be learned may be learned by changing the information and the second stored information.

一実施例では、前記判別部は、入力される周波数情報が前記第１のグループに属する波形データから抽出される周波数情報である確からしさに応じた出力値を前記判別結果として出力し、前記制御部は、前記第１のグループに属する波形データから抽出される周波数情報の入力に対して出力される出力値がより高い確からしさを示し、前記生成部で生成された波形データから抽出される周波数情報の入力に対して出力される出力値がより低い確からしさを示すように、前記第２の記憶情報を変化させるとともに、前記生成部で生成された波形データから抽出される周波数情報の入力に対して出力される出力値がより高い確からしさを示すように、前記第１の記憶情報を変化させてもよい。 In one embodiment, the determination unit outputs an output value corresponding to the probability that the input frequency information is frequency information extracted from the waveform data belonging to the first group as the determination result, and the control The unit indicates that the output value output with respect to the input of the frequency information extracted from the waveform data belonging to the first group has a higher probability, and the frequency extracted from the waveform data generated by the generation unit. changing the second stored information so that the output value output with respect to the information input indicates a lower probability, and inputting frequency information extracted from the waveform data generated by the generating unit; The first stored information may be changed so that the output value to be output indicates a higher likelihood.

一実施例では、前記第１のグループに属する波形データは、言葉を発声した音声データに対応する波形データであってもよい。 In one embodiment, the waveform data belonging to the first group may be waveform data corresponding to speech data obtained by uttering words.

一実施例では、前記ソース情報は、前記言葉に対応するラベル情報または前記言葉を表すテキスト情報であってもよい。 In one embodiment, the source information may be label information corresponding to the term or text information representing the term.

一実施例では、前記ソース情報は、言葉を発声した音声データに対応する波形データであってもよい。 In one embodiment, the source information may be waveform data corresponding to voice data uttering words.

一実施例では、前記ソース情報は、前記第１のグループとは異なる第２のグループに属する波形データであってもよい。 In one embodiment, the source information may be waveform data belonging to a second group different from the first group.

一実施例では、前記第１のグループに属する波形データは、特定の人の声に対応する波形データであり、前記第２のグループに属する波形データは、前記特定の人とは異なる人の声に対応する波形データであってもよい。 In one embodiment, the waveform data belonging to the first group is waveform data corresponding to the voice of a specific person, and the waveform data belonging to the second group is the voice of a person different from the specific person. may be waveform data corresponding to .

一実施例では、前記生成部は、前記第２のグループに属する波形データから前記第１のグループに属する波形データを生成する第１生成部と、前記波形データから前記第２のグループに属する波形データを生成する第２生成部とを有し、前記判別部は、入力される周波数情報が前記第１のグループに属する波形データから抽出される周波数情報であるか否かを判別する第１判別部を有し、前記制御部は、前記第１生成部に前記第２のグループに属する第１の元波形データから前記第１のグループに属する第１の変換波形データを生成させた後、前記第２生成部に前記第１の変換波形データから前記第２のグループに属する第１の再構成波形データを生成させるとともに、前記第１の元波形データから抽出される周波数情報を前記第１判別部で判別した判別結果と、前記第１の再構成波形データから抽出される周波数情報を前記第１判別部で判別した判別結果との誤差を少なくするように前記生成部を学習させてもよい。 In one embodiment, the generation unit includes a first generation unit that generates waveform data belonging to the first group from waveform data belonging to the second group, and a waveform data belonging to the second group from the waveform data. a second generation unit that generates data, the determination unit having a first determination that determines whether or not the input frequency information is frequency information extracted from the waveform data belonging to the first group. The control unit causes the first generation unit to generate first converted waveform data belonging to the first group from the first original waveform data belonging to the second group, and then the causing a second generation unit to generate first reconstructed waveform data belonging to the second group from the first converted waveform data, and determining frequency information extracted from the first original waveform data for the first determination; The generation unit may learn to reduce an error between the determination result determined by the unit and the determination result obtained by determining the frequency information extracted from the first reconstructed waveform data by the first determination unit. .

一実施例では、前記判別部は、入力される周波数情報が前記第２のグループに属する波形データから抽出される周波数情報であるか否かを判別する第２判別部を更に有し、前記制御部は、前記第２生成部に前記第１のグループに属する第２の元波形データから前記第２のグループに属する第２の変換波形データを生成させた後、前記第１生成部に前記第２の変換波形データから前記第１のグループに属する第２の再構成波形データを生成させるとともに、前記第２の元波形データから抽出される周波数情報を前記第２判別部で判別した判別結果と、前記第２の再構成波形データから抽出される周波数情報を前記第２判別部で判別した判別結果との誤差を少なくするように前記生成部を学習させてもよい。 In one embodiment, the determination unit further includes a second determination unit that determines whether the input frequency information is frequency information extracted from the waveform data belonging to the second group, and the control The section causes the second generation section to generate the second converted waveform data belonging to the second group from the second original waveform data belonging to the first group, and then causes the first generation section to generate the second waveform data belonging to the second group. Second reconstructed waveform data belonging to the first group is generated from the second transformed waveform data, and the frequency information extracted from the second original waveform data is discriminated by the second discriminator. Alternatively, the generation unit may be trained so as to reduce an error between the frequency information extracted from the second reconstructed waveform data and the determination result obtained by the second determination unit.

一実施例では、前記ソース情報は、乱数であってもよい。 In one embodiment, the source information may be random numbers.

本開示の一態様では、
学習対象のモデルを格納するメモリと、
前記メモリに接続されるプロセッサと、
を有する学習装置であって、
前記学習対象のモデルは、ソース情報から波形データを生成する生成器と、スペクトログラムから出力値を生成する判別器とを含み、
前記プロセッサは、
前記ソース情報を前記生成器に入力し、前記生成器から第１の波形データを取得し、
音響画像変換方式に従って前記第１の波形データと学習用の第２の波形データとをそれぞれ第１のスペクトログラムと第２のスペクトログラムとに変換し、
前記第１のスペクトログラムと前記第２のスペクトログラムとを前記判別器に入力し、前記第１のスペクトログラムと前記第２のスペクトログラムとの各出力値を取得し、
前記出力値の誤差に基づき前記生成器と前記判別器とを学習する学習装置が提供される。 In one aspect of the present disclosure,
a memory for storing a model to be learned;
a processor connected to the memory;
A learning device having
The model to be learned includes a generator that generates waveform data from source information and a discriminator that generates an output value from a spectrogram,
The processor
inputting the source information into the generator and obtaining first waveform data from the generator;
converting the first waveform data and the second waveform data for learning into a first spectrogram and a second spectrogram, respectively, according to an acoustic image conversion method;
Inputting the first spectrogram and the second spectrogram into the discriminator, obtaining each output value of the first spectrogram and the second spectrogram,
A learning device is provided for learning the generator and the discriminator based on the errors in the output values.

一実施例では、前記音響画像変換方式は、入力について微分可能な音響画像変換方式であってもよい。 In one embodiment, the acoustic image transformation scheme may be an acoustic image transformation scheme that is differentiable with respect to the input.

一実施例では、前記プロセッサは、ＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ）方式に従って前記生成器と前記判別器とを学習してもよい。 In one embodiment, the processor may learn the generator and the discriminator according to a Generative Adversarial Network (GAN) scheme.

一実施例では、前記学習対象のモデルは、複数の生成器と複数の判別器とを含み、前記プロセッサは、サイクルＧＡＮ方式に従って前記複数の生成器と前記複数の判別器とを学習してもよい。 In one embodiment, the model to be trained includes a plurality of generators and a plurality of classifiers, and the processor learns the plurality of generators and the plurality of classifiers according to a cycle GAN method. good.

本開示の一態様では、
学習済みモデルを格納するメモリと、
前記メモリに接続されるプロセッサと、
を有する音響生成装置であって、
前記プロセッサは、
ソース情報を取得し、
前記ソース情報を前記学習済みモデルに入力し、前記学習済みモデルから波形データを取得し、
前記学習済みモデルは、
前記ソース情報を生成器に入力し、前記生成器から第１の波形データを取得し、
音響画像変換方式に従って前記第１の波形データと学習用の第２の波形データとをそれぞれ第１のスペクトログラムと第２のスペクトログラムとに変換し、
前記第１のスペクトログラムと前記第２のスペクトログラムとを判別器に入力し、前記第１のスペクトログラムと前記第２のスペクトログラムとの各出力値を取得し、
前記出力値の誤差に基づき前記生成器と前記判別器とを学習することによって取得される生成器である音響生成装置が提供される。 In one aspect of the present disclosure,
a memory for storing trained models;
a processor connected to the memory;
A sound generator having
The processor
get the source information,
inputting the source information into the trained model, obtaining waveform data from the trained model;
The trained model is
inputting the source information into a generator and obtaining first waveform data from the generator;
converting the first waveform data and the second waveform data for learning into a first spectrogram and a second spectrogram, respectively, according to an acoustic image conversion method;
Inputting the first spectrogram and the second spectrogram into a discriminator, obtaining each output value of the first spectrogram and the second spectrogram,
A sound generation device is provided, which is a generator obtained by learning the generator and the discriminator based on the error of the output value.

一実施例では、前記生成器と前記判別器とは、ＧＡＮ方式に従って学習されてもよい。 In one embodiment, the generator and classifier may be trained according to the GAN scheme.

一実施例では、前記学習済みモデルは、サイクルＧＡＮ方式に従って複数の生成器と複数の判別器とを学習することによって取得されてもよい。 In one embodiment, the trained model may be obtained by training multiple generators and multiple classifiers according to a cycle GAN scheme.

本開示の一態様では、
ソース情報から波形データを生成する生成器と、スペクトログラムから出力値を生成する判別器とを含む学習対象のモデルを学習する方法であって、
プロセッサが、前記ソース情報を前記生成器に入力し、前記生成器から第１の波形データを取得し、
前記プロセッサが、音響画像変換方式に従って前記第１の波形データと学習用の第２の波形データとをそれぞれ第１のスペクトログラムと第２のスペクトログラムとに変換し、
前記プロセッサが、前記第１のスペクトログラムと前記第２のスペクトログラムとを前記判別器に入力し、前記第１のスペクトログラムと前記第２のスペクトログラムとの各出力値を取得し、
前記プロセッサが、前記出力値の誤差に基づき前記生成器と前記判別器とを学習する方法が提供される。 In one aspect of the present disclosure,
A method for learning a model to be learned including a generator that generates waveform data from source information and a discriminator that generates output values from a spectrogram, comprising:
a processor inputs the source information to the generator and obtains first waveform data from the generator;
The processor converts the first waveform data and the second waveform data for learning into a first spectrogram and a second spectrogram, respectively, according to an acoustic image conversion method;
The processor inputs the first spectrogram and the second spectrogram to the discriminator, obtains each output value of the first spectrogram and the second spectrogram,
A method is provided for the processor to train the generator and the classifier based on the errors in the output values.

本開示の一態様では、
プロセッサが、ソース情報を取得し、
前記プロセッサが、前記ソース情報を学習済みモデルに入力し、前記学習済みモデルから波形データを取得する方法であって、
前記学習済みモデルは、
前記ソース情報を生成器に入力し、前記生成器から第１の波形データを取得し、
音響画像変換方式に従って前記第１の波形データと学習用の第２の波形データとをそれぞれ第１のスペクトログラムと第２のスペクトログラムとに変換し、
前記第１のスペクトログラムと前記第２のスペクトログラムとを判別器に入力し、前記第１のスペクトログラムと前記第２のスペクトログラムとの各出力値を取得し、
前記出力値の誤差に基づき前記生成器と前記判別器とを学習することによって取得される生成器である方法が提供される。 In one aspect of the present disclosure,
The processor obtains the source information,
A method in which the processor inputs the source information to a trained model and obtains waveform data from the trained model, comprising:
The trained model is
inputting the source information into a generator and obtaining first waveform data from the generator;
converting the first waveform data and the second waveform data for learning into a first spectrogram and a second spectrogram, respectively, according to an acoustic image conversion method;
Inputting the first spectrogram and the second spectrogram into a discriminator, obtaining each output value of the first spectrogram and the second spectrogram,
A method is provided wherein the generator is obtained by training the generator and the discriminator based on the errors of the output values.

本開示の一態様では、
上述した方法をプロセッサに実現させるプログラム又はコンピュータ可読記憶媒体が提供される。 In one aspect of the present disclosure,
A program or computer readable storage medium is provided that causes a processor to implement the method described above.

以上、本開示の実施例について詳述したが、本開示は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本開示の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the specific embodiments described above, and various modifications can be made within the scope of the gist of the present disclosure described in the claims.・Changes are possible.

５０データベース
１００学習装置
１１０，２２０生成器
１２０変換部
１３０判別器
１４０学習部
２００音響生成装置
２１０取得部 50 database 100 learning device 110, 220 generator 120 conversion unit 130 discriminator 140 learning unit 200 sound generation device 210 acquisition unit

Claims

inputting source information into a first neural network and corresponding to the source information input into the first neural network as an output from the first neural network; a generator for generating waveform data;
an extraction unit that extracts first frequency information as differentiable numerical information from the waveform data output from the first neural network of the generator;
inputting the first frequency information to a second neural network, and outputting the first frequency information from the second neural network; is the frequency information extracted from the waveform data belonging to the first groupOutput the first discriminant value as differentiable numerical information indicating the degree of likelihooda discriminator;
Based on a loss function having as input the first discriminant value output by the discriminator, the first neural network is operated so that the first discriminant value output by the discriminator indicates a higher likelihood. study a control unit that causes
A sound generator having a

further comprising an acquisition unit that acquires waveform data belonging to the first group;
The extraction unit extracts second frequency information as differentiable numerical information from the waveform data acquired by the acquisition unit,
The discriminator inputs the second frequency information to the second neural network, and outputs the second frequency information from waveform data belonging to the first group as an output from the second neural network. Outputting a second discriminant value as differentiable numerical information indicating the degree of likelihood of the extracted frequency information,
The control unit determines whether the first discriminant value output by the discriminator has a higher probability based on a loss function that receives the first discriminant value and the second discriminant value output by the discriminator. The first discriminant value output by the discriminator indicates a lower likelihood, and the second discriminant value output by the discriminator is trained by the first neural network to indicate the likelihood training the second neural network to exhibit a higher likelihood of
A sound generating device according to claim 1.

a conversion unit that converts waveform data into image data with one of a plurality of axes as a logarithmic frequency axis;
The extractor is Image data obtained by converting the waveform data in the conversion unit,extracting as the frequency information, which is differentiable numerical information;
3. A sound generator according to claim 1 or 2.

The source information iswordsis text information representing the law of nature,
The waveform data belonging to the first group is waveform data corresponding to voice data uttering the words.
any one of claims 1 to 3 The sound generating device according to .

The waveform data belonging to the first group is waveform data corresponding to a specific human voice,
The source information is Waveform data corresponding to the voice of a person different from the specific person,any one of claims 1 to 4The sound generating device according to .

second From the waveform data belonging to the group offirsta first generator that generates waveform data belonging to the group of
belonging to the first group A second generator for generating waveform data belonging to the second group from waveform data When,
input A first discriminator for discriminating whether or not the frequency information obtained is frequency information extracted from the waveform data belonging to the first group When,
Said After causing the first generator to generate the first converted waveform data belonging to the first group from the first original waveform data belonging to the second group, the second generator generates the first converted waveform. First reconstructed waveform data belonging to the second group is generated from data, and a discrimination result obtained by discriminating frequency information extracted from the first original waveform data by the first discriminator; so as to reduce the error between the frequency information extracted from the reconstructed waveform data of and the discrimination result discriminated by the first discriminator.said first generator and said second generatorto learn a control unit;
have Sound generator.

input further comprising a second discriminator that discriminates whether or not the frequency information obtained is frequency information extracted from the waveform data belonging to the second group;
After causing the second generator to generate second converted waveform data belonging to the second group from the second original waveform data belonging to the first group, the control unit causes the first generator to second reconstructed waveform data belonging to the first group is generated from the second transformed waveform data, and frequency information extracted from the second original waveform data is discriminated by the second discriminator. so as to reduce the error between the result and the discrimination result obtained by discriminating the frequency information extracted from the second reconstructed waveform data by the second discriminator.said first generator and said second generatorto learnClaim 6The sound generating device according to .

the waveform data belonging to the first group is waveform data corresponding to a specific human voice;
8. The sound generator according to claim 6, wherein the waveform data belonging to said second group is waveform data corresponding to a voice of a person different from said specific person.

the waveform data belonging to the first group is waveform data corresponding to a male voice;
9. The sound generator according to claim 6, wherein the waveform data belonging to said second group is waveform data corresponding to a female voice.

the processor
A generation process of inputting source information into a first neural network and generating waveform data corresponding to the source information input into the first neural network as an output from the first neural network;
an extraction process of extracting first frequency information as differentiable numerical information from the waveform data output from the first neural network by the generation process;
The first frequency information is input to a second neural network, and as an output from the second neural network, the first frequency information is frequency information extracted from waveform data belonging to the first group. A discrimination process that outputs a first discriminant value as differentiable numerical information indicating the degree of likelihood;
Based on a loss function having as input the first discriminant value output by the discriminating process, the first neural network is configured so that the first discriminant value output by the discriminating process indicates a higher likelihood. a control process for learning the network;
how to run.

The processor obtains the source information,
A method in which the processor inputs the source information to a trained model and obtains waveform data from the trained model, comprising:
The trained model is
A generation process of inputting source information into a first neural network and generating waveform data corresponding to the source information input into the first neural network as an output from the first neural network;
an extraction process for extracting first frequency information as differentiable numerical information from the waveform data output from the first neural network by the generation process;
The first frequency information is input to a second neural network, and as an output from the second neural network, the first frequency information is frequency information extracted from waveform data belonging to the first group. A discrimination process that outputs a first discriminant value as differentiable numerical information indicating the degree of likelihood;
Based on a loss function having as input the first discriminant value output by the discriminating process, the first neural network is configured so that the first discriminant value output by the discriminating process indicates a higher likelihood. a control process for learning the network;
comprising the first neural network obtained by performing Method.

the processor
a first generating process for generating waveform data belonging to the first group from the waveform data belonging to the second group according to the first stored information;
a second generating process for generating waveform data belonging to the second group from the waveform data belonging to the first group according to second stored information;
a first determination process for determining whether or not the input frequency information is frequency information extracted from the waveform data belonging to the first group;
After the first transformed waveform data belonging to the first group is generated from the first original waveform data belonging to the second group by the first generating process, the first transformation is performed by the second generating process. First reconstructed waveform data belonging to the second group is generated from waveform data, and a discrimination result obtained by discriminating frequency information extracted from the first original waveform data in the first discrimination process; a control process for changing the first stored information and the second stored information so as to reduce an error between the frequency information extracted from one piece of reconstructed waveform data and the determination result determined in the first determination process; ,
how to run.

A program that causes a processor to implement the method according to any one of claims 10 to 12.

A learning device for carrying out the method according to claim 10 or 12.

A sound generating device for performing the method of claim 11.