JP6989951B2

JP6989951B2 - Speech chain device, computer program and DNN speech recognition / synthesis mutual learning method

Info

Publication number: JP6989951B2
Application number: JP2018001538A
Authority: JP
Inventors: チャンドラアンドロス; サクティサクリアニ; 哲中村
Original assignee: Nara Institute of Science and Technology NUC
Current assignee: Nara Institute of Science and Technology NUC
Priority date: 2018-01-09
Filing date: 2018-01-09
Publication date: 2022-01-12
Anticipated expiration: 2038-01-09
Also published as: JP2019120841A

Description

特許法第３０条第２項適用平成２９年７月１６日に、ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ｐｄｆ／１７０７．０４８７９．ｐｄｆに掲載Application of Article 30, Paragraph 2 of the Patent Act On July 16, 2017, https: // arxiv. org / pdf / 1707.04879. Posted in pdf

特許法第３０条第２項適用平成２９年１２月１８日に、２０１７ＩＥＥＥＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎａｎｄＵｎｄｅｒｓｔａｎｄｉｎｇＷｏｒｋｓｈｏｐにて発表Application of Article 30, Paragraph 2 of the Patent Act Announced at 2017 IEEE Speech Recognition and Understanding Workshop on December 18, 2017.

本発明は、自動音声認識（ASR: Automatic Speech Recognition）および自動音声合成（TTS: Text-To-Speech synthesis）に関し、特にディープニューラルネットワーク（DNN: Deep Neural Network）で構築された音声認識部および音声合成部を相互に学習させる技術に関する。 The present invention relates to automatic speech recognition (ASR) and automatic speech synthesis (TTS: Text-To-Speech synthesis), and in particular, a speech recognition unit and speech constructed by a deep neural network (DNN). It is related to the technology to make the synthesis part learn from each other.

近年、ＡＳＲおよびＴＴＳによる音声言語情報処理技術が発達し、機械と人間が音声を通じてコミュニケーションできるようになりつつある。ＡＳＲについて言えば、これまで、動的時間伸縮法（DTW: dynamic time warping）によるテンプレートベースのスキームや、隠れマルコフ混合ガウスモデル（HMM-GMM: hidden Markov model - Gaussian mixture model）といった厳格な統計モデルによるデータ駆動手法といった音響音声学に基づくアプローチが試みられてきた。ＴＴＳについて言えば、波形符号化および分析合成方式によるルールベースのシステムから、波形素片接続手法や隠れセミマルコフ混合ガウスモデル（HSMM-GMM: hidden semi-Markov model - GMM）を用いたより自由度のある手法へとシフトしつつある。 In recent years, voice language information processing technology using ASR and TTS has been developed, and machines and humans are becoming able to communicate through voice. Speaking of ASR, strict statistical models such as template-based schemes by dynamic time warping (DTW) and hidden Markov model (HMM-GMM: hidden Markov model --Gaussian mixture model) have been used so far. Attempts have been made on acoustic and phonetic approaches such as data-driven methods. Speaking of TTS, there is more freedom from a rule-based system based on waveform coding and analytical synthesis to a waveform element connection method and a hidden semi-Markov model (HSMM-GMM: hidden semi-Markov model --GMM). It is shifting to the method.

そして、近年のコンピュータハードウェアの著しい性能向上によりＤＮＮがさまざまな分野で実用可能となり、ＡＳＲおよびＴＴＳにもＤＮＮを用いたディープラーニングが取り入れられつつある（例えば、下記非特許文献１、２を参照）。 Due to the remarkable performance improvement of computer hardware in recent years, DNN has become practical in various fields, and deep learning using DNN is being adopted for ASR and TTS (see, for example, Non-Patent Documents 1 and 2 below). ).

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960-4964.W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960-4964. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017.Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, RJ Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: A fully end -to-end text-to-speech synthesis model, ”arXiv preprint arXiv: 1703.10135, 2017.

人は自分の声を聞きながら言葉を発している。すなわち、人間の脳は耳から聞こえる自分の声の音量や音調や明瞭さなどに基づいて次にどのような発声をするのか決定して発声器官に指示を出している。このように人の音声認識および音声発話では聴覚器菅、脳および発声器官からなる閉ループであるスピーチチェインが非常に重要な役割を果たしている。例えば、聴覚を失った子供はスピーチチェインが機能しなくなることによってうまく喋れなくなることが知られている。このように人の音声認識と音声発話は互いに密接に関連し合うにもかかわらず、ＡＳＲおよびＴＴＳの研究・開発はそれぞれ独自に進展してきた。 People speak while listening to their own voice. That is, the human brain determines what kind of utterance to make next based on the volume, tone, and clarity of one's voice heard from the ear, and gives an instruction to the vocal organs. Thus, in human speech recognition and speech speech, the speech chain, which is a closed loop consisting of the auditory duct, the brain, and the speech organs, plays a very important role. For example, deaf children are known to be unable to speak well due to the speech chain failing. In spite of the fact that human speech recognition and speech utterance are closely related to each other, research and development of ASR and TTS have progressed independently.

ＡＳＴとＴＴＳの分離はＤＮＮを用いたディープラーニングが取り入れられてからも変わっていない。そして、ＡＳＲとＴＴＳとが分離されていることにより次のような問題が生じる。
１．ＡＳＲおよびＴＴＳをそれぞれ十分なレベルにまで学習させるために音声とテキストのペアからなる教師ありデータを大量に用意する必要がある。教師ありデータは人手で作成しなければならないため大変な労力とコストがかかってしまう。
２．実際の推論段階ではオンラインで入力される信号にノイズが混入するため、それが原因で学習済みのＡＳＲおよびＴＴＳの出力誤差が大きくなったりあるいは出力が得られなくなったりすることがある。そこでオンライン入力された信号に基づいてＡＳＲおよびＴＴＳの再学習が必要になるが、そもそもオンライン入力される信号は教師なしデータであり、教師なしデータを用いてＡＳＲおよびＴＴＳを学習させる仕組みが確立されていない。 The separation of AST and TTS has not changed since the introduction of deep learning using DNN. The separation of ASR and TTS causes the following problems.
1. 1. It is necessary to prepare a large amount of supervised data consisting of voice and text pairs in order to train ASR and TTS to a sufficient level. Supervised data must be created manually, which requires a great deal of labor and cost.
2. 2. In the actual inference stage, noise is mixed in the signal input online, which may increase the output error of the learned ASR and TTS or prevent the output from being obtained. Therefore, it is necessary to relearn ASR and TTS based on the signal input online, but the signal input online is unsupervised data in the first place, and a mechanism for learning ASR and TTS using unsupervised data has been established. Not.

上記問題に鑑み、本発明は、人間のスピーチチェインのメカニズムを機械で再現するスピーチチェイン装置を提供することを目的とする。 In view of the above problems, it is an object of the present invention to provide a speech chain device that mechanically reproduces the mechanism of human speech chain.

本発明の一局面に従うと、音声特徴系列データを入力とし文字系列データを出力とするディープニューラルネットワークで構築された音声認識部と、文字系列データを入力とし音声特徴系列データを出力とするディープニューラルネットワークで構築された音声合成部と、入力された音声を処理して、前記音声認識部に入力される前記音声特徴系列データを生成する音声特徴抽出部と、前記音声認識部から出力される前記文字系列データに基づいて、前記音声特徴抽出部に入力された音声に対応するテキストを生成するテキスト生成部と、入力されたテキストを処理して、前記音声合成部に入力される前記文字系列データを生成するテキスト特徴抽出部と、前記音声合成部から出力される前記音声特徴系列データに基づいて、前記テキスト特徴抽出部に入力されたテキストに対応する音声を生成する音声生成部と、前記音声合成部から出力された前記音声特徴系列データを学習データとして前記音声認識部に入力し、前記テキスト特徴抽出部によって生成された前記文字系列データを教師データとして用いて前記音声認識部を学習させる第１の学習制御部と、前記音声認識部から出力された前記文字系列データを学習データとして前記音声合成部に入力し、前記音声特徴抽出部によって生成された前記音声特徴系列データを教師データとして用いて前記音声合成部を学習させる第２の学習制御部と、を備えたスピーチチェイン装置が提供される。 According to one aspect of the present invention, a voice recognition unit constructed by a deep neural network that inputs voice feature sequence data and outputs character sequence data, and a deep neural system that inputs character sequence data and outputs voice feature sequence data. The voice synthesis unit constructed by the network, the voice feature extraction unit that processes the input voice and generates the voice feature series data input to the voice recognition unit, and the voice recognition unit that outputs the voice feature series data. Based on the character sequence data, the text generation unit that generates the text corresponding to the voice input to the voice feature extraction unit, and the character sequence data that processes the input text and is input to the voice synthesis unit. A voice generation unit that generates a voice corresponding to the text input to the text feature extraction unit based on the voice feature series data output from the voice synthesis unit, and a voice generation unit. The voice feature sequence data output from the synthesis unit is input to the voice recognition unit as learning data, and the character sequence data generated by the text feature extraction unit is used as teacher data to train the voice recognition unit. The learning control unit 1 and the character sequence data output from the voice recognition unit are input to the voice synthesis unit as learning data, and the voice feature sequence data generated by the voice feature extraction unit is used as teacher data. A speech chain device including a second learning control unit for learning the voice synthesis unit is provided.

具体的には、前記音声認識部に入力される前記音声特徴系列データがメルスペクトル特徴量であってもよく、前記音声合成部から出力される前記音声特徴系列データがリニアスペクトル特徴量およびメルスペクトル特徴量であってもよく、前記音声特徴抽出部が、前記音声特徴系列データとして、前記音声特徴抽出部に入力された音声のリニアスペクトル特徴量およびメルスペクトル特徴量を生成するものであってもよく、前記第２の学習制御部が、前記音声特徴抽出部によって生成された前記リニアスペクトル特徴量および前記メルスペクトル特徴量を教師データとして用いて前記音声合成部を学習させるものであってもよい。 Specifically, the voice feature series data input to the voice recognition unit may be a mel spectrum feature amount, and the voice feature series data output from the voice synthesis unit may be a linear spectrum feature amount and a mel spectrum feature amount. It may be a feature amount, and the voice feature extraction unit may generate a linear spectrum feature amount and a mel spectrum feature amount of the voice input to the voice feature extraction unit as the voice feature series data. Often, the second learning control unit may train the speech synthesis unit by using the linear spectrum feature amount and the mel spectrum feature amount generated by the speech feature extraction unit as teacher data. ..

また、具体的には、前記音声合成部が、発話の終端の確率を表す出力レイヤを有するものであってもよく、前記第２の学習制御部が、さらに発話の終端の確率を教師データとして用いて前記音声合成部を学習させるものであってもよい。 Further, specifically, the voice synthesis unit may have an output layer representing the probability of the end of the utterance, and the second learning control unit further uses the probability of the end of the utterance as teacher data. It may be used to learn the voice synthesis unit.

また、具体的には、前記音声合成部が、話者の識別情報が入力される入力レイヤを有するものであってもよい。 Further, specifically, the voice synthesis unit may have an input layer into which speaker identification information is input.

本発明の別の一局面に従うと、上記スピーチチェイン装置の各構成要素をコンピュータに実現させるためのコンピュータプログラムが提供される。 According to another aspect of the present invention, a computer program for realizing each component of the speech chain device in a computer is provided.

本発明のさらに別の一局面に従うと、音声特徴系列データを入力とし文字系列データを出力とするディープニューラルネットワークで構築された音声認識部および文字系列データを入力とし音声特徴系列データを出力とするディープニューラルネットワークで構築された音声合成部を相互に学習させるＤＮＮ音声認識・合成相互学習方法であって、教師ありデータとして音声とテキストのペアが与えられた場合、当該音声の音声特徴系列データを学習データとして前記音声認識部に入力し、当該テキストの文字系列データを教師データとして用いて前記音声認識部を学習させるとともに、当該テキストの文字系列データを学習データとして前記音声合成部に入力し、当該音声の音声特徴系列データを教師データとして用いて前記音声合成部を学習させる第１のステップと、教師なしデータとして音声のみが与えられた場合、前記音声認識部に当該音声の音声特徴系列データを入力して前記音声認識部から出力された文字系列データを学習データとして前記音声合成部に入力し、当該音声の音声特徴系列データを教師データとして用いて前記音声合成部を学習させる第２のステップと、教師なしデータとしてテキストのみが与えられた場合、前記音声合成部に当該テキストの文字系列データを入力して前記音声合成部から出力された音声特徴系列データを学習データとして前記音声認識部に入力し、当該テキストの文字系列データを教師データとして用いて前記音声認識部を学習させる第３のステップと、を備えたＤＮＮ音声認識・合成相互学習方法が提供される。 According to yet another aspect of the present invention, the voice recognition unit constructed by the deep neural network that inputs the voice feature series data and outputs the character series data and the character series data is input and the voice feature series data is output. It is a DNN voice recognition / synthesis mutual learning method that mutually learns the voice synthesis unit constructed by the deep neural network. When a pair of voice and text is given as supervised data, the voice feature series data of the voice is used. The voice recognition unit is input as learning data, the character sequence data of the text is used as teacher data to learn the voice recognition unit, and the character sequence data of the text is input to the voice synthesis unit as learning data. The first step of learning the voice synthesis unit using the voice feature series data of the voice as teacher data, and when only the voice is given as the non-teacher data, the voice feature series data of the voice is given to the voice recognition unit. Is input and the character sequence data output from the voice recognition unit is input to the voice synthesis unit as learning data, and the voice feature sequence data of the voice is used as teacher data to train the voice synthesis unit. When only the text is given as the step and the unsupervised data, the character sequence data of the text is input to the voice synthesis unit, and the voice feature sequence data output from the voice synthesis unit is used as learning data in the voice recognition unit. Provided is a DNN voice recognition / synthetic mutual learning method comprising a third step of learning the voice recognition unit by inputting to and using the character sequence data of the text as teacher data.

上記ＤＮＮ音声認識・合成相互学習方法は、音声とテキストのペア、テキストのみおよび音声のみの３種類のデータが混在するデータセットから各種類のデータを一定量ずつ取り出す第４のステップと、前記データセットから取り出した各種類のデータを用いて前記第１のステップないし前記第３のステップを順に繰り返して前記音声認識部および前記音声合成部のバッチ学習を行う第５のステップと、をさらに備えてもよい。 The above-mentioned DNN speech recognition / synthesis mutual learning method includes a fourth step of extracting a fixed amount of each type of data from a data set in which three types of data, such as a voice-text pair, text-only and voice-only data, are mixed, and the data. A fifth step of performing batch learning of the voice recognition unit and the voice synthesis unit by repeating the first step to the third step in order using each type of data taken out from the set is further provided. May be good.

本発明によると人間のスピーチチェインのメカニズムを機械で再現することができる。これにより、音声認識用に入力された音声および音声合成用に入力されたテキストを教師なしデータとして用いて音声合成および音声認識のオンライン学習を行うことができるようになり、教師ありデータとしての音声とテキストのペアを大量に用意する労力とコストを削減することができる。さらに、本発明に係るスピーチチェイン装置は、音声認識装置および音声合成装置として使えば使うほど学習が進んで音声認識および音声合成の精度が向上する。 According to the present invention, the mechanism of human speech chain can be reproduced by a machine. This makes it possible to perform online learning of speech synthesis and speech recognition using the speech input for speech recognition and the text input for speech synthesis as unsupervised data, and the speech as supervised data. It can reduce the labor and cost of preparing a large number of pairs of text and text. Further, the more the speech chain device according to the present invention is used as the voice recognition device and the voice synthesis device, the more the learning progresses and the accuracy of the voice recognition and the voice synthesis is improved.

本発明に係るスピーチチェイン装置のベースとなるマシンスピーチチェインのアーキテクチャを示す図The figure which shows the architecture of the machine speech chain which is the base of the speech chain apparatus which concerns on this invention. マシンスピーチチェインにおいてＡＳＲの出力をＴＴＳの学習データとして用いてＡＳＲを学習させる様子を示す模式図Schematic diagram showing how ASR is trained using the output of ASR as training data of TTS in a machine speech chain. マシンスピーチチェインにおいてＴＴＳの出力をＡＳＲの学習データとして用いてＴＴＳを学習させる様子を示す模式図Schematic diagram showing how TTS is trained using the output of TTS as learning data of ASR in a machine speech chain. 一例に係るＤＮＮ音声認識モデルの模式図Schematic diagram of the DNN speech recognition model according to an example 一例に係るＤＮＮ音声合成モデルの模式図Schematic diagram of DNN speech synthesis model according to an example 本発明の一実施形態に係るスピーチチェイン装置のブロック図Block diagram of a speech chain device according to an embodiment of the present invention 音声特徴系列データ生成処理のフローチャートFlowchart of voice feature series data generation processing 文字系列データ生成処理のフローチャートFlowchart of character series data generation process ＤＮＮ音声認識・合成相互学習の全体フローチャートOverall flowchart of DNN speech recognition / synthetic mutual learning ＡＳＲおよびＴＴＳのバッチ学習処理のフローチャートFlowchart of batch learning process of ASR and TTS

以下、適宜図面を参照しながら、実施の形態を詳細に説明する。ただし、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。 Hereinafter, embodiments will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration may be omitted. This is to avoid unnecessary redundancy of the following description and to facilitate the understanding of those skilled in the art.

なお、発明者らは、当業者が本発明を十分に理解するために添付図面および以下の説明を提供するのであって、これらによって特許請求の範囲に記載の主題を限定することを意図するものではない。 It should be noted that the inventors are intended to limit the subject matter described in the claims by those skilled in the art by providing the accompanying drawings and the following description in order to fully understand the present invention. is not it.

また、本明細書において単に「音声」と言う場合、それは音声波形信号を指すことに留意されたい。 It should also be noted that when the term "voice" is used herein, it refers to a voice waveform signal.

１．マシンスピーチチェイン（Machine Speech Chain）
図１は、本発明に係るスピーチチェイン装置のベースとなるマシンスピーチチェインのアーキテクチャを示す図である。マシンスピーチチェイン１は、上述した人間のスピーチチェインのメカニズムを機械で再現するものであり、音声ｘを受けてそれをテキストｙ^＾に変換する音声認識部（以下、単に「ＡＳＲ」と称することがある。）１０と、テキストｙを受けてそれを音声ｘ^＾に変換する音声合成部（以下、単に「ＴＴＳ」と称することがある。）２０とを備えている。マシンスピーチチェイン１においてＡＳＲ１０およびＴＴＳ２０は、ＡＳＲ１０の出力（テキストｙ^＾）がＴＴＳ２０に入力されるとともにＴＴＳ２０の出力（音声ｘ^＾）がＡＳＲ１０に入力されるように互いに接続されて閉ループを形成している。 1. 1. Machine Speech Chain
FIG. 1 is a diagram showing an architecture of a machine speech chain that is a base of the speech chain device according to the present invention. The machine speech chain 1 reproduces the above-mentioned human speech chain mechanism by a machine, and is a voice recognition unit (hereinafter, simply referred to as "ASR") that receives voice x and converts it into text y ^{^} . There is.) 10 and a speech synthesizer (hereinafter, may be simply referred to as "TTS") 20 that receives the text y and converts it into speech x ^{^} . In machine speech chain 1, ASR10 and TTS20 are connected to each other so that the output of ASR10 (text y ^{^} ) is input to TTS20 and the output of TTS20 (voice x ^{^} ) is input to ASR10 to form a closed loop. There is.

ＡＳＲ１０およびＴＴＳ２０はいずれも系列（sequence）データが入出力されるsequence-to-sequence型モデルとして構成されている。具体的には、ＡＳＲ１０は、音声特徴系列データを入力とし文字系列データを出力とするモデルとして、ＴＴＳ２０は、文字系列データを入力とし音声特徴系列データを出力とするモデルとしてそれぞれ構成されている。このようにＡＳＲ１０およびＴＴＳ２０をいずれもsequence-to-sequence型モデルとして構成したことにより、ＡＳＲ１０およびＴＴＳ２０間で一方の出力を他方に入力することが可能になっている。 Both ASR10 and TTS20 are configured as a sequence-to-sequence type model in which sequence data is input / output. Specifically, the ASR 10 is configured as a model in which the voice feature sequence data is input and the character sequence data is output, and the TTS 20 is configured as a model in which the character sequence data is input and the voice feature sequence data is output. By configuring both the ASR10 and the TTS20 as a sequence-to-sequence type model in this way, it is possible to input one output between the ASR10 and the TTS20 to the other.

また、マシンスピーチチェイン１においてＡＳＲ１０およびＴＴＳ２０の閉ループを形成したことで、一方のモデルの出力を他方のモデルの学習データとして用いて各モデルを学習させることができるようになる。例えば、音声合成処理の過程でＴＴＳ２０から出力される音声ｘ^＾をＡＳＲ１０の学習データとして用いてＡＳＲ１０を学習させることができ、逆に音声認識処理の過程でＡＳＲ１０から出力されるテキストｙ^＾をＴＴＳ２０の学習データとして用いてＴＴＳ２０を学習させることができる。 Further, by forming the closed loop of ASR10 and TTS20 in the machine speech chain 1, it becomes possible to train each model by using the output of one model as the training data of the other model. For example, the voice x ^{^} output from the TTS 20 in the process of speech synthesis processing can be used as the training data of the ASR 10 to train the ASR 10, and conversely, the text y ^{^} output from the ASR 10 in the process of voice recognition processing can be used as the TTS 20. The TTS 20 can be trained by using it as the training data of.

図２は、マシンスピーチチェイン１においてＡＳＲ１０の出力をＴＴＳ２０の学習データとして用いてＴＴＳ２０を学習させる様子を示す模式図である。図２に示したように、マシンスピーチチェイン１において音声認識処理が行われる場合、ＡＳＲ１０は、音声ｘを受けてそれをテキストｙ^＾に変換する。ＴＴＳ２０は、ＡＳＲ１０によって変換されたテキストｙ^＾を受けてそれを音声ｘ^＾に再変換する。このとき、ＡＳＲ１０によって変換されたテキストｙ^＾を学習データ、ＡＳＲ１０に入力された元の音声ｘを教師データとして用いて、ＴＴＳ２０の出力（音声ｘ^＾）と教師データ（音声ｘ）との誤差が小さくなるように（損失関数ＬｏｓｔＴＴＳ（ｘ，ｘ^＾）の値が小さくなるように）ＴＴＳ２０のパラメータ調整、すなわちディープラーニングが行われる。 FIG. 2 is a schematic diagram showing a state in which the output of the ASR 10 is used as the training data of the TTS 20 in the machine speech chain 1 to train the TTS 20. As shown in FIG. 2, when the voice recognition process is performed in the machine speech chain 1, the ASR 10 receives the voice x and converts it into the text y ^{^} . The TTS 20 receives the text y ^{^} converted by the ASR 10 and reconverts it into the voice x ^{^} . At this time, the text y ^{^} converted by the ASR 10 is used as the learning data, and the original voice x input to the ASR 10 is used as the teacher data, and the error between the output (voice x ^{^} ) of the TTS 20 and the teacher data (voice x) is increased. Parameter adjustment of TTS20, that is, deep learning is performed so that the value of the loss function Lost TTS (x, x ^{^} ) becomes small.

図３は、マシンスピーチチェイン１においてＴＴＳ２０の出力をＡＳＲ１０の学習データとして用いてＡＳＲ１０を学習させる様子を示す模式図である。図３に示したように、マシンスピーチチェイン１において音声合成処理が行われる場合、ＴＴＳ２０は、テキストｙを受けてそれを音声ｘ^＾に変換する。ＡＳＲ１０は、ＴＴＳ２０によって変換された音声ｘ^＾を受けてそれをテキストｙ^＾に再変換する。このとき、ＴＴＳ２０によって変換された音声ｘ^＾を学習データ、ＴＴＳ２０に入力された元のテキストｙを教師データとして用いて、ＡＳＲ１０の出力（テキストｙ^＾、より詳細にはｙ^＾を構成する各文字の発生確率ｐ_ｙ）と教師データ（テキストｙ）との誤差が小さくなるように（損失関数ＬｏｓｔＡＳＲ（ｙ，ｐ_ｙ）の値が小さくなるように）ＡＳＲ１０のパラメータ調整、すなわちディープラーニングが行われる。 FIG. 3 is a schematic diagram showing a state in which the output of the TTS 20 is used as the training data of the ASR 10 to train the ASR 10 in the machine speech chain 1. As shown in FIG. 3, when the voice synthesis process is performed in the machine speech chain 1, the TTS 20 receives the text y and converts it into the voice x ^{^} . The ASR10 receives the voice x ^{^} converted by the TTS 20 and reconverts it into the text y ^{^} . At this time, using the voice x ^{^} converted by the TTS 20 as learning data and the original text y input to the TTS 20 as teacher data, the output of the ASR 10 (text y ^{^} , more specifically, each character constituting y ^{^} ). Parameter adjustment of ASR10, that is, deep learning is performed so that the error between the occurrence probability py) and the teacher data (text _y ) becomes small (the value of the loss function LostASR ( _y , py) becomes small). ..

従来のように音声認識モデルと音声合成モデルとが相互接続されていなければ、教師ありデータとして音声とテキストのペアを用意してそれぞれのモデルをオフラインで学習（音声認識モデルの学習には音声が学習データ、テキストが教師データとして用いられ、音声合成モデルの学習にはテキストが学習データ、音声が教師データとして用いられる。）させる必要がある。一方、マシンスピーチチェイン１は、教師ありデータを用いてＡＳＲ１０およびＴＴＳ２０をそれぞれ教師強制（teacher-forcing）モードでオフライン学習させることができるのはもちろん、音声認識用にオンライン入力された音声を用いてＴＴＳ２０を学習させ、また、音声合成用にオンライン入力されたテキストを用いてＡＳＲ１０を学習させることができる。すなわち、マシンスピーチチェイン１は、音声認識または音声合成をしながらＡＳＲ１０およびＴＴＳ２０をオンライン学習させることができる。 If the speech recognition model and the speech synthesis model are not interconnected as in the past, prepare a pair of speech and text as supervised data and learn each model offline (speech is used for learning the speech recognition model). The training data and the text are used as the teacher data, and the text is used as the training data and the voice is used as the teacher data for the training of the speech synthesis model.) On the other hand, the machine speech chain 1 can train ASR10 and TTS20 offline in teacher-forcing mode using supervised data, as well as using voice input online for speech recognition. The TTS 20 can be trained, and the ASR 10 can be trained using the text input online for speech synthesis. That is, the machine speech chain 1 can train ASR10 and TTS20 online while performing speech recognition or speech synthesis.

２．ＤＮＮ音声認識・合成モデル
次に、マシンスピーチチェイン１を構成するＡＳＲ１０およびＴＴＳ２０の詳細について説明する。本発明の実施形態ではＡＳＲ１０およびＴＴＳ２０はいずれもディープニューラルネットワーク（ＤＮＮ）で構築される。 2. 2. DNN speech recognition / synthesis model Next, the details of ASR10 and TTS20 constituting the machine speech chain 1 will be described. In the embodiment of the present invention, both ASR10 and TTS20 are constructed by a deep neural network (DNN).

まず、ＤＮＮ音声認識モデルについて説明する。図４は、一例に係るＤＮＮ音声認識モデルの模式図である。ＡＳＲ１０は、音声ｘを長さＳの音声特徴系列データ（すなわちｘ＝［ｘ_１，…，ｘ_Ｓ］）、テキストｙを長さＴの文字系列データ（すなわちｙ＝［ｙ_１，…，ｙ_Ｔ］）としたときの条件付き確率ｐ（ｙ｜ｘ）を求めるsequence-to-sequence型モデルとして構成される。具体的には、ＡＳＲ１０は、再帰型ニューラルネットワーク（RNN: Recurrent Neural Network）を応用したオートエンコーダとして構築することができる。音声特徴系列データの各要素ｘ_ｓはＤ次元の実数値ベクトルである。文字系列データの各要素ｙ_ｔは音素（phoneme）または書記素（grapheme）である。 First, the DNN speech recognition model will be described. FIG. 4 is a schematic diagram of the DNN speech recognition model according to an example. In the ASR10, the voice x is the voice feature sequence data of length S (that is, x = [x ₁ , ..., X _S ]), and the text y is the character sequence data of length T (that is, y = [y ₁ , ..., y). It is configured as a sequence-to-sequence type model for obtaining the conditional probability p (y | x) when _T ]) is set. Specifically, the ASR10 can be constructed as an autoencoder to which a recurrent neural network (RNN) is applied. Each element x _s of the voice feature series data is a D-dimensional real value vector. Each element _yt of the character sequence data is a phoneme or a grapheme.

より詳細には、ＡＳＲ１０は、エンコーダ１１と、デコーダ１２と、アテンション１３とを備えている。エンコーダ１１は、３層の双方向ＬＳＴＭ（Bi-LSTM: Bidirectional Long Short-Term Memory）レイヤ１１１、１１２、１１３を備えている。エンコーダ１１において、初段の双方向ＬＳＴＭレイヤ１１１に対数メルスペクトル特徴量で表される音声特徴系列データｘ_１，…，ｘ_Ｓが入力されて最終段の双方向ＬＳＴＭレイヤ１１３から中間層ベクトルｈ^ｅ _ｓ（ｓ＝１，…，Ｓ）が出力される。 More specifically, the ASR 10 includes an encoder 11, a decoder 12, and an attention 13. The encoder 11 includes three layers of bidirectional LSTM (Bi-LSTM: Bidirectional Long Short-Term Memory) layers 111, 112, and 113. In the encoder 11, audio feature sequence data x ₁ , ..., X _S represented by logarithmic mel spectrum features are input to the bidirectional LSTM layer 111 of the first stage, and the intermediate layer vector he from the bidirectional ^LSTM layer 113 of the final stage. _s (s = 1, ..., S) is output.

デコーダ１２は、文字埋め込み（Char Emb.: Character Embed）レイヤ１２１と、ＬＳＴＭレイヤ１２２とを備えている。デコーダ１２において、文字埋め込みレイヤ１２１に文字系列データｙ_０，…，ｙ_Ｔ－１が入力されてＬＳＴＭレイヤ１２２から文字系列データｙ_１，…，ｙ_Ｔが出力される。デコーダ１２の入力である文字系列データｙ_ｔは音素または書記素そのものではなく音素または書記素のｉｄまたはインデックス番号である。時刻ｔにおけるデコーダ１２の出力ｙ_ｔは、ＬＳＴＭレイヤ１２２から出力される中間層ベクトルｈ^ｄ _ｔとアテンション１３によって計算されるコンテキストベクトルｃ_ｔとを連結したベクトルを所定の線型作用素で重み付けし、さらにそれを所定の活性化関数に入力することにより算出される。図示していないが、ＬＳＴＭレイヤ１２２から出力される文字系列データｙ_１，…，ｙ_Ｔはsoftmax関数によって各文字の発生確率ｐ_ｙ１，…，ｐ_ｙＴとして正規化される。 The decoder 12 includes a character embedding (Char Emb .: Character Embed) layer 121 and an LSTM layer 122. In the decoder 12, the character sequence data y ₀ , ..., Y _T-1 is input to the character embedding layer 121, and the character sequence data y ₁ , ..., Y _T are output from the LSTM layer 122. The character sequence data _yt that is the input of the decoder 12 is not the phoneme or grapheme itself, but the id or index number of the phoneme or grapheme. The output y _t of the decoder 12 at time _t weights a vector connecting the intermediate layer vector hd _t output from the ^LSTM layer 122 and the context vector ct calculated by the attention 13 with a predetermined linear operator, and further. It is calculated by inputting it into a predetermined activation function. Although not shown, the character sequence data y ₁ , ..., Y _T output from the _LSTM layer 122 are normalized as the occurrence probabilities py 1, ..., P _y T of each character by the softmax function.

アテンション１３は、コンテキストベクトルｃ_ｔを計算するモジュールである。より詳細には、アテンション１３は、デコーダ１２のＬＳＴＭレイヤ１２２から出力される時刻ｔにおける中間層ベクトルｈ^ｄ _ｔとエンコーダ１１の双方向ＬＳＴＭレイヤ１１３が保持している中間層ベクトルｈ^ｅ _１，…，ｈ^ｅ _Ｓから値ａ_ｔを計算し、さらに値ａ_ｔと中間層ベクトルｈ^ｅ _１，…，ｈ^ｅ _Ｓからコンテキストベクトルｃ_ｔを計算する。なお、値ａ_ｔおよびコンテキストベクトルｃ_ｔの計算式は周知であるのでここでの説明は省略する。 The attention 13 is a module for calculating the context vector _ct . More specifically, the attention 13 has an intermediate layer vector hd _t at time ^t output from the LSTM layer 122 of the decoder 12 and an intermediate layer vector h ^e ₁ held by the bidirectional LSTM layer 113 of the encoder 11. , He ^e _S to calculate the value at, and further calculate the context vector c _t from the _value _at and the intermediate layer vector he ₁ , ..., ^He ^e _S. Since the calculation _formulas for the value at and the context vector _ct are well known, the description thereof is omitted here.

ＡＳＲ１０のパラメータは、次の損失関数の値が最小になるように確率的勾配降下法や誤差逆伝播法などを用いて調整される。

ここで、Ｃは出力クラスの数であり、ｙは正解（ground truth）のテキストである。 The parameters of the ASR10 are adjusted by using a stochastic gradient descent method, an error backpropagation method, or the like so that the value of the next loss function is minimized.

Where C is the number of output classes and y is the text of the ground truth.

オフライン学習時に教師ありデータとして音声とテキストのペアが与えられる場合、当該音声の音声特徴系列データを学習データ、当該テキストの文字系列データを教師データとして用いてＡＳＲ１０の学習を行うことができる。一方、教師なしデータとしてテキストのみが与えられる場合、例えば、音声合成用にオンライン入力されたテキストを使用する場合、図３を参照して説明したように、ＴＴＳ２０から出力される音声特徴系列データを学習データ、音声合成用にオンライン入力されたテキストの文字系列データを教師データとして用いてＡＳＲ１０の学習を行うことができる。 When a pair of voice and text is given as supervised data at the time of offline learning, learning of ASR10 can be performed using the voice feature series data of the voice as learning data and the character series data of the text as teacher data. On the other hand, when only text is given as unsupervised data, for example, when text input online for speech synthesis is used, the speech feature sequence data output from TTS 20 is used as described with reference to FIG. The ASR10 can be learned by using the training data and the character sequence data of the text input online for speech synthesis as the teacher data.

次に、ＤＮＮ音声合成モデルについて説明する。図５は、一例に係るＤＮＮ音声合成モデルの模式図である。ＴＴＳ２０は、テキストｙを長さＴの文字系列データ（すなわちｙ＝［ｙ_１，…，ｙ_Ｔ］）、音声ｘを長さＳの音声特徴系列データ（すなわちｘ＝［ｘ_１，…，ｘ_Ｓ］）としたときの条件付き確率ｐ（ｘ｜ｙ）を求めるsequence-to-sequence型モデルとして構築される。具体的には、ＴＴＳ２０は、再帰型ニューラルネットワークを応用したオートエンコーダとして構築することができる。音声特徴系列データの各要素ｘ_ｓはＤ次元の実数値ベクトルである。文字系列データの各要素ｙ_ｔは音素または書記素である。 Next, the DNN speech synthesis model will be described. FIG. 5 is a schematic diagram of a DNN speech synthesis model according to an example. In the TTS 20, the text y is the character sequence data of length T (that is, y = [y ₁ , ..., y _T ]), and the voice x is the voice feature series data of length S (that is, x = [x ₁ , ..., X]). _S ]) is constructed as a sequence-to-sequence type model for obtaining the conditional probability p (x | y). Specifically, the TTS 20 can be constructed as an autoencoder to which a recurrent neural network is applied. Each element x _s of the voice feature series data is a D-dimensional real value vector. Each element _yt of the character sequence data is a phoneme or a grapheme.

より詳細には、ＴＴＳ２０は、エンコーダ２１と、デコーダ２２と、アテンション２３とを備えている。エンコーダ２１は、文字埋め込みレイヤ２１１と、全結合（FC: Fully Connected）レイヤ２１２と、ＣＢＨＧ（1-D Convolution Bank + Highway network + bidirectional GRU）２１３とを備えている。エンコーダ２１において、文字埋め込みレイヤ２１１に文字系列データｙ_１，…，ｙ_Ｔが入力されてＣＢＨＧ２１３から中間層ベクトルｈ^ｅ _ｔ（ｔ＝１，…，Ｔ）が出力される。エンコーダ２１の入力である文字系列データｙ_ｔは音素または書記素そのものではなく音素または書記素のｉｄまたはインデックス番号である。 More specifically, the TTS 20 includes an encoder 21, a decoder 22, and an attention 23. The encoder 21 includes a character embedding layer 211, a fully connected (FC) layer 212, and a CBHG (1-D Convolution Bank + Highway network + bidirectional GRU) 213. In the encoder 21, the character sequence data y ₁ , ..., Y _T are input to the character embedding layer 211, and the intermediate layer vector heat ( _t = 1, ..., T) is output from the ^CBHG 213. The character sequence data _yt that is the input of the encoder 21 is not the phoneme or grapheme itself, but the id or index number of the phoneme or grapheme.

デコーダ２２は、全結合レイヤ２２１と、ＬＳＴＭレイヤ２２２と、ＣＢＨＧ２２３と、全結合レイヤ２２４とを備えている。デコーダ２２において、全結合レイヤ２２１に対数メルスペクトル特徴量で表される音声特徴系列データｘ^Ｍ _０，…，ｘ^Ｍ _Ｓ－１が入力されてＬＳＴＭレイヤ２２２から対数メルスペクトル特徴量で表される音声特徴系列データｘ^Ｍ _１，…，ｘ^Ｍ _Ｓが出力される。また、デコーダ２２において、ＬＳＴＭレイヤ２２２から出力される音声特徴系列データｘ^Ｍ _１，…，ｘ^Ｍ _ＳがＣＢＨＧ２２３に入力されて全結合レイヤ２２４からリニアスペクトル特徴量で表される音声特徴系列データｘ^Ｒ _ｓ（ｓ＝１，…，Ｓ）が出力される。ＣＢＨＧ２２３の入力ｘ^Ｍ _ｓは、ＬＳＴＭレイヤ２２２から出力される中間層ベクトルｈ^ｄ _ｓとアテンション２３によって計算されるコンテキストベクトルｃ_ｓとを連結したベクトルを所定の線型作用素で重み付けし、さらにそれを所定の活性化関数に入力することにより算出される。図示していないが、全結合レイヤ２２４から出力される音声特徴系列データｘ^Ｒ _ｓ（ｓ＝１，…，Ｓ）はGriffin-Limアルゴリズムに従って処理されて音声が再構築される。 The decoder 22 includes a fully coupled layer 221, an LSTM layer 222, a CBHG223, and a fully coupled layer 224. In the decoder 22, voice feature series data x ^M ₀ , ..., X ^M _S-1 represented by logarithmic mel spectrum features are input to the fully coupled layer 221 and are represented by logarithmic mel spectrum features from the LSTM layer 222. Voice feature series data x ^M ₁ , ..., _X ^MS are output. Further, in the decoder 22, the voice feature series data x ^M ₁ , ..., XMS output from the ^LSTM layer 222 is input to the _CBHG223 , and the voice feature series data x represented by the linear spectral feature amount from the fully coupled layer 224. ^R _s (s = 1, ..., S) is output. The input x ^M _s of the CBHG 223 weights a vector connecting the intermediate layer vector hd _{s output from the LSTM layer 222 and the context vector c s} ^calculated _by the attention 23 with a predetermined linear operator, and further determines it. It is calculated by inputting to the activation function of. Although not shown, the voice feature sequence data x ^R _s (s = 1, ..., S) output from the fully coupled layer 224 is processed according to the Griffin-Lim algorithm to reconstruct the voice.

デコーダ２２は、さらに、出力レイヤ２２５と、入力レイヤ２２６とを備えている。出力レイヤ２２５は、発話の終端の確率を出力するレイヤである。出力レイヤ２２５を設けた理由は、デコーダ２２から出力される音声特徴系列データ音声特徴系列データｘ^Ｍ _ｓおよびｘ^Ｒ _ｓ（ｓ＝１，…，Ｓ）はいずれも実数値ベクトルであり、それらからは発話の終端が判断できないからである。もし出力レイヤ２２５がなければ発話の終端が判断できないためＴＴＳ２０から出力される音声特徴系列データが所定の長さになったところで強制的に音声合成を終了させることとなり語尾が不自然になるおそれがある。一方、出力レイヤ２２５を設けたことによって発話の終端が判断できるようになり、音声特徴系列データを所定の長さで強制的に打ち切ることなく発話終端で音声合成を終了させることができ、自然な語尾の音声を合成が実現できる。 The decoder 22 further includes an output layer 225 and an input layer 226. The output layer 225 is a layer that outputs the probability of the end of the utterance. The reason why the output layer 225 is provided is that the audio feature sequence data audio feature sequence data x ^M _s and x ^R _s (s = 1, ..., S) output from the decoder 22 are both real value vectors, and are derived from them. Is because the end of the speech cannot be determined. If there is no output layer 225, the end of the utterance cannot be determined, so the speech synthesis will be forcibly terminated when the speech feature sequence data output from the TTS 20 reaches a predetermined length, and the ending may become unnatural. be. On the other hand, by providing the output layer 225, it becomes possible to determine the end of the utterance, and it is possible to end the voice synthesis at the end of the utterance without forcibly interrupting the voice feature series data with a predetermined length, which is natural. It is possible to synthesize the voice of the end of the word.

入力レイヤ２２６には話者の識別情報が入力される。話者の識別情報として話者のｉｄを用いることができる。入力レイヤ２２６に入力された話者のｉｄは埋め込み（embed）関数に入力されて実数値ベクトルによる分散表現に変換されてＬＳＴＭレイヤ２２２、２２４などに入力される。未知の話者にも対応可能にするために、話者のｉｄを話者認識用のi-vectorにマッピングするようにしてもよい。このように話者の識別情報が入力される入力レイヤ２２６を設けたことで、ＴＴＳ２０は当該話者の声に似た音声を合成できるようになる。 The speaker identification information is input to the input layer 226. The speaker id can be used as the speaker identification information. The speaker id input to the input layer 226 is input to the embed function, converted into a distributed representation by a real value vector, and input to the LSTM layer 222, 224 and the like. In order to be able to handle unknown speakers, the speaker id may be mapped to the i-vector for speaker recognition. By providing the input layer 226 in which the identification information of the speaker is input in this way, the TTS 20 can synthesize a voice similar to the voice of the speaker.

上述したようにＡＳＲ１０のオンライン学習ではＴＴＳ２０から出力される音声特徴系列データが学習データとして用いられるが、このとき音声認識用に入力された音声とＴＴＳ２０によって合成された音声の声質が異なっているとＡＳＲ１０の学習が正しく進まなくなるおそれがある。そこで入力レイヤ２２６を設けて話者の声に似た音声を合成できるようにすることでＡＳＲ１０のオンライン学習の質を向上させることができる。特にマシンスピーチチェイン１が複数の話者の音声を認識しなければならないような場合には入力レイヤ２２６を設けることが望ましい。 As described above, in the online learning of ASR10, the voice feature series data output from TTS20 is used as learning data, but at this time, if the voice input for voice recognition and the voice quality synthesized by TTS20 are different. There is a risk that the learning of ASR10 will not proceed correctly. Therefore, the quality of online learning of ASR10 can be improved by providing an input layer 226 so that a voice similar to the voice of a speaker can be synthesized. In particular, when the machine speech chain 1 must recognize the voices of a plurality of speakers, it is desirable to provide an input layer 226.

アテンション２３は、コンテキストベクトルｃ_ｓを計算するモジュールである。より詳細には、アテンション２３は、デコーダ２２のＬＳＴＭレイヤ２２２から出力される時刻ｓにおける中間層ベクトルｈ^ｄ _ｓとエンコーダ２１のＣＢＨＧ２１３が保持している中間層ベクトルｈ^ｅ _１，…，ｈ^ｅ _Ｔから値ａ_ｓを計算し、さらに値ａ_ｓと中間層ベクトルｈ^ｅ _１，…，ｈ^ｅ _Ｔからコンテキストベクトルｃ_ｓを計算する。なお、値ａ_ｓおよびコンテキストベクトルｃ_ｓの計算式は周知であるのでここでの説明は省略する。 The attention 23 is a module for calculating the context vector _cs . More specifically, the attention 23 has an intermediate layer vector hd _s at the time s output from the LSTM layer 222 of the decoder 22 and an intermediate layer vector h ^e ₁ , ..., He ^e _T held by the ^CBHG 213 of the encoder 21. The value a _s is calculated from, and the context vector c _s is calculated from the value a _s and the intermediate layer vectors h ^e ₁ , ..., He ^e _T. Since the calculation _{formulas for the value as and the context vector c s} _are well known, the description thereof is omitted here.

ＴＴＳ２０のパラメータは、次の損失関数の値が最小になるように確率的勾配降下法や誤差逆伝播法などを用いて調整される。

ここで、ｘ^＾M、ｘ^＾Ｒ、ｂ^＾はそれぞれＴＴＳ２０から出力される対数メルスペクトル特徴量、リニアスペクトル特徴量、発話終端確率であり、ｘ^M、ｘ^Ｒ、ｂはそれぞれそれらの正解（ground truth）である。 The parameters of the TTS 20 are adjusted by using a stochastic gradient descent method, an error backpropagation method, or the like so that the value of the next loss function is minimized.

Here, x ^{^ M} , x ^{^ R} , and b ^{^} are logarithmic mel spectrum features, linear spectrum features, and utterance termination probabilities output from TTS 20, respectively, and x ^M , x ^R , and b are their correct answers (respectively). ground truth).

オフライン学習時に教師ありデータとして音声とテキストのペアが与えられる場合、当該テキストの文字系列データを学習データ、当該音声の音声特徴系列データ（対数メルスペクトル特徴量およびリニアスペクトル特徴量）を教師データとして用いてＴＴＳ２０の学習を行うことができる。一方、教師なしデータとして音声のみが与えられる場合、例えば、音声認識用にオンライン入力された音声を使用する場合、図２を参照して説明したように、ＡＳＲ１０から出力される文字系列データを学習データ、音声認識用にオンライン入力された音声の音声特徴系列データ（対数メルスペクトル特徴量およびリニアスペクトル特徴量）を教師データとして用いてＴＴＳ２０の学習を行うことができる。 When a pair of voice and text is given as supervised data during offline learning, the character series data of the text is used as training data, and the voice feature series data of the voice (logous mel spectrum feature amount and linear spectrum feature amount) is used as teacher data. It can be used to learn TTS20. On the other hand, when only voice is given as unsupervised data, for example, when voice input online for voice recognition is used, as described with reference to FIG. 2, character sequence data output from ASR10 is learned. The TTS 20 can be learned by using the voice feature series data (logous mel spectrum feature amount and linear spectrum feature amount) of the voice input online for the data and voice recognition as the teacher data.

３．実施形態
次に、本発明の一実施形態に係るスピーチチェイン装置の構成を説明する。図６は、本発明の一実施形態に係るスピーチチェイン装置１００のブロック図である。スピーチチェイン装置１００は、音声認識部（ＡＳＲ）１０と、音声合成部（ＴＴＳ）２０と、音声特徴抽出部３０と、テキスト生成部４０と、テキスト特徴抽出部５０と、音声生成部６０と、ＡＳＲ学習制御部７０と、ＴＴＳ学習制御部８０とを備えている。スピーチチェイン装置１００を構成するこれら要素はハードウェアまたはソフトウェアまたはそれらの組み合わせとして実現することができる。例えば、パソコンやスマートフォンなどのコンピュータ装置に専用のコンピュータソフトウェアをインストールすることで当該コンピュータ装置をスピーチチェイン装置１００として機能させることができる。例えば、スピーチチェイン装置１００は、クラウド上のサーバーに実装してＳａａＳ（software as a service）として実施することもできる。また、スピーチチェイン装置１００の各構成要素を複数のコンピュータ装置に分散配置し、電気通信ネットワークを介して各構成要素を互いに接続することによってスピーチチェイン装置１００を実現することもできる。大量の計算が必要なＡＳＲ１０およびＴＴＳ２０はＧＰＵ（Graphics Processing Unit）などの専用のプロセッサで処理し、それ以外の構成要素はＣＰＵ（Central Processing Unit）で処理させるとよい。 3. 3. Embodiment Next, the configuration of the speech chain device according to the embodiment of the present invention will be described. FIG. 6 is a block diagram of the speech chain device 100 according to the embodiment of the present invention. The speech chain device 100 includes a voice recognition unit (ASR) 10, a voice synthesis unit (TTS) 20, a voice feature extraction unit 30, a text generation unit 40, a text feature extraction unit 50, and a voice generation unit 60. It includes an ASR learning control unit 70 and a TTS learning control unit 80. These elements constituting the speech chain device 100 can be realized as hardware or software or a combination thereof. For example, by installing dedicated computer software in a computer device such as a personal computer or a smartphone, the computer device can function as the speech chain device 100. For example, the speech chain device 100 can be mounted on a server on the cloud and implemented as SaaS (software as a service). Further, the speech chain device 100 can be realized by arranging each component of the speech chain device 100 in a plurality of computer devices in a distributed manner and connecting the components to each other via a telecommunication network. The ASR10 and TTS20, which require a large amount of calculation, may be processed by a dedicated processor such as a GPU (Graphics Processing Unit), and the other components may be processed by a CPU (Central Processing Unit).

次に、スピーチチェイン装置１００の各構成要素の詳細について説明する。なお、ＡＳＲ１０およびＴＴＳ２０については上述した通りであるため、繰り返しの説明は省略する。 Next, the details of each component of the speech chain device 100 will be described. Since the ASR10 and TTS20 are as described above, the repeated description will be omitted.

音声特徴抽出部３０は、入力された音声を処理して、ＡＳＲ１０に入力される音声特徴系列データ（ｘ＝［ｘ_１，…，ｘ_Ｓ］）を生成するモジュールである。テキスト生成部４０は、ＡＳＲ１０から出力される文字系列データ（ｙ^＾＝［ｙ_１，…，ｙ_Ｔ］）に基づいて、音声特徴抽出部３０に入力された音声に対応するテキストを生成するモジュールである。音声特徴抽出部３０には、図略のマイクロフォンで集音した音声をリアルタイムに入力できる他、図略のストレージ装置やメモリ装置に保持された録音音声などを入力することもできる。テキスト生成部４０から出力されるテキストは、図略の表示装置にリアルタイムに表示できる他、図略のストレージ装置やメモリ装置に保存することもできる。 The voice feature extraction unit 30 is a module that processes the input voice and generates voice feature series data (x = [x ₁ , ..., X _S ]) input to the ASR 10. The text generation unit 40 is a module that generates text corresponding to the voice input to the voice feature extraction unit 30 based on the character sequence data (y ^{^} = [y ₁ , ..., y _T ]) output from the ASR 10. Is. The voice feature extraction unit 30 can input the voice collected by the microphone shown in the figure in real time, and can also input the recorded voice held in the storage device or the memory device shown in the figure. The text output from the text generation unit 40 can be displayed in real time on the illustrated display device, and can also be stored in the illustrated storage device or memory device.

図７は、音声特徴抽出部３０によって実施される音声特徴系列データ生成処理のフローチャートである。スピーチチェイン装置１００に音声が入力されると（Ｓ１１）、音声特徴抽出部３０は、入力された音声に対してプリエンファシス処理を施し（Ｓ１２）、その後さらに短時間フーリエ変換を施す（Ｓ１３）。こうして音声特徴抽出部３０は、入力された音声からそのリニアスペクトル特徴量を計算し（Ｓ１４）、それを出力する（Ｓ１５）。出力されたリニアスペクトル特徴量は図略のメモリ装置などに一時保存される。さらに、音声特徴抽出部３０は、リニアスペクトル特徴量から対数メルスペクトル特徴量を計算し（Ｓ１６）、それを出力する（Ｓ１７）。出力された対数メルスペクトル特徴量は図略のメモリ装置などに一時保存される。 FIG. 7 is a flowchart of the voice feature series data generation process performed by the voice feature extraction unit 30. When voice is input to the speech chain device 100 (S11), the voice feature extraction unit 30 performs pre-emphasis processing on the input voice (S12), and then performs a short-time Fourier transform (S13). In this way, the voice feature extraction unit 30 calculates the linear spectral feature amount from the input voice (S14) and outputs it (S15). The output linear spectral features are temporarily stored in a memory device or the like shown in the figure. Further, the voice feature extraction unit 30 calculates a logarithmic mel spectrum feature from the linear spectrum feature (S16) and outputs it (S17). The output logarithmic mel spectrum feature quantity is temporarily stored in a memory device or the like shown in the figure.

図６へ戻り、テキスト特徴抽出部５０は、入力されたテキストを処理して、ＴＴＳ２０に入力される文字系列データ（ｙ＝［ｙ_１，…，ｙ_Ｔ］）を生成するモジュールである。音声生成部６０は、ＴＴＳ２０から出力される音声特徴系列データ（ｘ^＾＝［ｘ_１，…，ｘ_Ｓ］）に基づいて、テキスト特徴抽出部５０に入力されたテキストに対応する音声を生成するモジュールである。テキスト特徴抽出部５０には、図略の入力デバイスを通じて入力されたテキストやＯＣＲ（Optical Character Recognition）装置などで読み取られたテキストをリアルタイムに入力できる他、図略のストレージ装置やメモリ装置に保持された文書中のテキストなどを入力することもできる。音声生成部６０から出力される音声は、図略のスピーカからリアルタイムに出音できる他、図略のストレージ装置やメモリ装置に保存することもできる。 Returning to FIG. 6, the text feature extraction unit 50 is a module that processes the input text and generates character sequence data (y = [y ₁ , ..., y _T ]) input to the TTS 20. The voice generation unit 60 generates voice corresponding to the text input to the text feature extraction unit 50 based on the voice feature series data (x ^{^} = [x ₁ , ..., X _S ]) output from the TTS 20. It is a module. In the text feature extraction unit 50, the text input through the illustration input device and the text read by the OCR (Optical Character Recognition) device can be input in real time, and are held in the storage device and the memory device of the illustration. You can also enter the text in the document. The voice output from the voice generation unit 60 can be output in real time from the speaker shown in the figure, and can also be stored in the storage device or memory device shown in the figure.

図８は、テキスト特徴抽出部５０によって実施される文字系列データ生成処理のフローチャートである。スピーチチェイン装置１００にテキストが入力されると（Ｓ２１）、テキスト特徴抽出部５０は、当該入力されたテキストに含まれる文字、記号、数字の正規化処理を行う（Ｓ２２）。具体的には、テキスト特徴抽出部５０は、大文字をすべて小文字に変換し、ダブルクオーテーションなどの一部の記号をシングルクオーテーションなどの別の記号に置き換え、数字をその読みを表すテキストに変換（例えば、“５”→“ｆｉｖｅ”）する。その後、テキスト特徴抽出部５０は、正規化されたテキストを各文字に切り分ける（Ｓ２３）（例えば、“ｆｉｖｅ”→“ｆ”，“ｉ”，“ｖ”，“ｅ”）。その後、テキスト特徴抽出部５０は、各文字をインデックスに変換し（Ｓ２４）（例えば、 “ｆ”→６，“ｉ”→９，“ｖ”→２２，“ｅ”→５）、正規化テキストと文字インデックスを出力する（Ｓ２５）。出力された正規化テキストと文字インデックスは図略のメモリ装置などに一時保存される。 FIG. 8 is a flowchart of the character sequence data generation process performed by the text feature extraction unit 50. When a text is input to the speech chain device 100 (S21), the text feature extraction unit 50 performs normalization processing of characters, symbols, and numbers included in the input text (S22). Specifically, the text feature extraction unit 50 converts all uppercase letters to lowercase letters, replaces some symbols such as double quotation marks with other symbols such as single quotation marks, and converts numbers into text representing the reading. (For example, "5" → "five"). After that, the text feature extraction unit 50 divides the normalized text into each character (S23) (for example, “five” → “f”, “i”, “v”, “e”). After that, the text feature extraction unit 50 converts each character into an index (S24) (for example, “f” → 6, “i” → 9, “v” → 22, “e” → 5), and the normalized text. And the character index is output (S25). The output normalized text and character index are temporarily stored in the memory device of the figure.

図６へ戻り、ＡＳＲ学習制御部７０は、ＡＳＲ１０の学習を制御するモジュールである。ＡＳＲ１０には音声特徴抽出部３０によって生成された音声特徴系列データｘおよびＴＴＳ２０から出力された音声特徴系列データｘ^＾のいずれか一方が選択的に入力されるようになっている。スピーチチェイン装置１００にテキストが入力されてスピーチチェイン装置１００が音声合成装置として動作するとき、ＡＳＲ学習制御部７０は、ＴＴＳ２０によって生成された音声特徴系列データｘ＾を学習データとしてＡＳＲ１０に入力し、テキスト特徴抽出部５０によって生成された文字系列データｙを教師データとして用いて、上述した方法でＡＳＲ１０のパラメータを調整する。 Returning to FIG. 6, the ASR learning control unit 70 is a module that controls the learning of the ASR 10. Either the voice feature sequence data x generated by the voice feature extraction unit 30 or the voice feature sequence data x ^{^} output from the TTS 20 is selectively input to the ASR 10. When a text is input to the speech chain device 100 and the speech chain device 100 operates as a speech synthesizer, the ASR learning control unit 70 inputs the speech feature sequence data x ^ generated by the TTS 20 into the ASR 10 as learning data. Using the character sequence data y generated by the text feature extraction unit 50 as the teacher data, the parameters of the ASR 10 are adjusted by the above-mentioned method.

ＴＴＳ学習制御部８０は、ＴＴＳ２０の学習を制御するモジュールである。ＴＴＳ２０にはテキスト特徴抽出部５０によって生成された文字系列データｙおよびＡＳＲ１０から出力された文字系列データｙ^＾のいずれか一方が選択的に入力されるようになっている。スピーチチェイン装置１００に音声が入力されてスピーチチェイン装置１００が音声認識装置として動作するとき、ＴＴＳ学習制御部８０は、ＡＳＲ１０によって生成された文字系列データｙ^＾を学習データとしてＴＴＳ２０に入力し、音声特徴抽出部３０によって生成された音声特徴系列データｘを教師データとして用いて、上述した方法でＴＴＳ２０のパラメータを調整する。 The TTS learning control unit 80 is a module that controls learning of the TTS 20. Either the character sequence data y generated by the text feature extraction unit 50 or the character sequence data y ^{^} output from the ASR 10 is selectively input to the TTS 20. When voice is input to the speech chain device 100 and the speech chain device 100 operates as a voice recognition device, the TTS learning control unit 80 inputs the character sequence data y ^{^} generated by the ASR 10 into the TTS 20 as learning data, and the voice is input. Using the voice feature sequence data x generated by the feature extraction unit 30 as teacher data, the parameters of the TTS 20 are adjusted by the above-mentioned method.

図９は、スピーチチェイン装置１００において実施されるＤＮＮ音声認識・合成相互学習の全体フローチャートである。スピーチチェイン装置１００にデータが入力され（Ｓ３１）、それが音声とテキストのペアであれば（Ｓ３２でＹＥＳ）、音声特徴抽出部３０が、当該入力された音声から音声特徴系列データｘを生成し、テキスト特徴抽出部５０が、当該入力されたテキストから文字系列データｙを生成する（Ｓ３３）。音声特徴系列データおよび文字系列データの生成処理について図７および図８を参照して説明した通りである。これら系列データが生成されると、ＡＳＲ学習制御部７０が、音声特徴系列データｘを学習データとしてＡＳＲ１０に入力し、文字系列データｙを教師データとして用いてＡＳＲ１０を学習させるとともに、ＴＴＳ学習制御部８０が、文字系列データｙを学習データとしてＴＴＳ２０に入力し、音声特徴系列データｘを教師データとして用いてＴＴＳ２０を学習させる（Ｓ３４）。 FIG. 9 is an overall flowchart of DNN speech recognition / synthetic mutual learning carried out in the speech chain device 100. If data is input to the speech chain device 100 (S31) and it is a voice / text pair (YES in S32), the voice feature extraction unit 30 generates voice feature sequence data x from the input voice. , The text feature extraction unit 50 generates character sequence data y from the input text (S33). The process of generating the voice feature series data and the character series data is as described with reference to FIGS. 7 and 8. When these series data are generated, the ASR learning control unit 70 inputs the voice feature series data x into the ASR 10 as learning data, uses the character series data y as the teacher data to train the ASR 10, and also trains the ASR 10 and the TTS learning control unit. 80 inputs the character sequence data y as training data into the TTS 20, and trains the TTS 20 using the voice feature sequence data x as teacher data (S34).

このように、音声とテキストのペアという教師ありデータが与えられた場合、ＡＳＲ学習制御部７０およびＴＴＳ学習制御部８０は、その教師ありデータを用いてＡＳＲ１０およびＴＴＳ２０をそれぞれ教師強制モードでオフライン学習させることができる。 In this way, when supervised data such as a pair of voice and text is given, the ASR learning control unit 70 and the TTS learning control unit 80 learn the ASR 10 and TTS 20 offline using the supervised data, respectively, in the supervised mode. Can be made to.

スピーチチェイン装置１００に入力されたデータが音声のみであれば（Ｓ３２でＮＯ、Ｓ３５でＹＥＳ）、音声特徴抽出部３０が当該入力された音声から音声特徴系列データｘを生成し（Ｓ３６）、ＡＳＲ１０がそれを受けて音声認識を行う（Ｓ３７）。そして、ＴＴＳ学習制御部８０が、ＡＳＲ１０から出力された文字系列データｙ^＾を学習データとしてＴＴＳ２０に入力し、音声特徴抽出部３０によって生成された音声特徴系列データｘを教師データとして用いてＴＴＳ２０を学習させる（Ｓ３８）。一方、スピーチチェイン装置１００に入力されたデータがテキストのみであれば（Ｓ３２でＮＯ、Ｓ３５でＮＯ）、テキスト特徴抽出部５０が当該入力されたテキストから文字系列データｙを生成し（Ｓ３９）、ＴＴＳ２０がそれを受けて音声合成を行う（Ｓ４０）。そして、ＡＳＲ学習制御部７０が、ＴＴＳ２０から出力された音声特徴系列データｘ^＾を学習データとしてＡＳＲ１０に入力し、テキスト特徴抽出部５０によって生成された文字系列データｙを教師データとして用いてＡＳＲ１０を学習させる（Ｓ４１）。 If the data input to the speech chain device 100 is only voice (NO in S32, YES in S35), the voice feature extraction unit 30 generates voice feature sequence data x from the input voice (S36), and ASR10. In response to this, voice recognition is performed (S37). Then, the TTS learning control unit 80 inputs the character sequence data y ^{^} output from the ASR 10 into the TTS 20 as learning data, and uses the voice feature sequence data x generated by the voice feature extraction unit 30 as the teacher data to generate the TTS 20. Learn (S38). On the other hand, if the data input to the speech chain device 100 is only text (NO in S32, NO in S35), the text feature extraction unit 50 generates character sequence data y from the input text (S39). The TTS 20 receives it and performs speech synthesis (S40). Then, the ASR learning control unit 70 inputs the voice feature sequence data x ^{^} output from the TTS 20 into the ASR 10 as learning data, and uses the character sequence data y generated by the text feature extraction unit 50 as the teacher data to generate the ASR 10. Learn (S41).

このように、音声のみのみが与えられた場合、ＴＴＳ学習制御部８０は、ＡＳＲ１０による音声認識結果をＴＴＳ２０の学習データとして使用してＴＴＳ２０を学習させることができる。一方、テキストのみが与えられた場合、ＡＳＲ学習制御部７０は、ＴＴＳ２０による音声合成結果をＡＳＲ１０の学習データとして使用してＡＳＲ１０を学習させることができる。すなわち、教師なしデータを用いてＡＳＲ１０およびＴＴＳ２０のオンライン学習が可能になる。 As described above, when only the voice is given, the TTS learning control unit 80 can learn the TTS 20 by using the voice recognition result by the ASR 10 as the learning data of the TTS 20. On the other hand, when only the text is given, the ASR learning control unit 70 can learn the ASR 10 by using the speech synthesis result by the TTS 20 as the learning data of the ASR 10. That is, online learning of ASR10 and TTS20 becomes possible using unsupervised data.

上述したように、スピーチチェイン装置１００においてＡＳＲ１０およびＴＴＳ２０は教師ありデータおよび教師なしデータのいずれを与えられても学習可能であることから、音声とテキストのペア、テキストのみおよび音声のみの３種類のデータが混在するデータセットを用意してＡＳＲ１０およびＴＴＳ２０のバッチ学習を行うことができる。 As described above, in the speech chain device 100, since the ASR10 and TTS20 can be learned by being given either supervised data or unsupervised data, there are three types of voice-text pairs, text-only and voice-only. Batch learning of ASR10 and TTS20 can be performed by preparing a data set in which data is mixed.

図１０は、ＡＳＲ１０およびＴＴＳ２０のバッチ学習処理のフローチャートである。ＡＳＲ学習制御部７０およびＴＴＳ学習制御部８０は、図略のストレージ装置などに保存されたデータセットから音声とテキストのペアを一定量取り出して音声特徴抽出部３０およびテキスト特徴抽出部５０にそれぞれ入力し（Ｓ５１）、音声特徴抽出部３０によって生成された音声特徴系列データｘおよびテキスト特徴抽出部５０によって生成された文字系列データｙを用いてＡＳＲ１０およびＴＴＳ２０をそれぞれ学習させる（Ｓ５２）。続いて、ＴＴＳ学習制御部８０は、データセットから音声のみのデータを一定量取り出して音声特徴抽出部３０に入力し（Ｓ５３）、ＡＳＲ１０によって生成された文字系列データｙ^＾を学習データとしてＴＴＳ２０に入力し、音声特徴抽出部３０によって生成された音声特徴系列データｘを教師データとして用いてＴＴＳ２０を学習させる（Ｓ５４）。続いて、ＡＳＲ学習制御部７０は、データセットからテキストのみのデータを一定量取り出してテキスト特徴抽出部５０に入力し（Ｓ５５）、ＴＴＳ２０によって生成された音声特徴系列データｘ^＾を学習データとしてＡＳＲ１０に入力し、テキスト特徴抽出部５０によって生成された文字系列データｙを教師データとして用いてＡＳＲ１０を学習させる（Ｓ５６）。ＡＳＲ学習制御部７０およびＴＴＳ学習制御部８０は、以上の工程をデータセットのデータがなくなるまで繰り返す。 FIG. 10 is a flowchart of batch learning processing of ASR10 and TTS20. The ASR learning control unit 70 and the TTS learning control unit 80 take out a certain amount of voice and text pairs from the data set stored in the storage device or the like shown in the figure and input them to the voice feature extraction unit 30 and the text feature extraction unit 50, respectively. (S51), ASR10 and TTS20 are trained using the voice feature sequence data x generated by the voice feature extraction unit 30 and the character sequence data y generated by the text feature extraction unit 50 (S52). Subsequently, the TTS learning control unit 80 extracts a certain amount of audio-only data from the dataset and inputs it to the audio feature extraction unit 30 (S53), and uses the character sequence data y ^{^} generated by the ASR 10 as learning data in the TTS 20. The TTS 20 is trained by inputting and using the voice feature series data x generated by the voice feature extraction unit 30 as teacher data (S54). Subsequently, the ASR learning control unit 70 extracts a certain amount of text-only data from the data set and inputs it to the text feature extraction unit 50 (S55), and uses the voice feature sequence data x ^{^} generated by the TTS 20 as training data for the ASR 10 The character sequence data y generated by the text feature extraction unit 50 is used as the teacher data to train the ASR 10 (S56). The ASR learning control unit 70 and the TTS learning control unit 80 repeat the above steps until the data in the data set is exhausted.

以上説明したように、本実施形態に係るスピーチチェイン装置１００によって人間のスピーチチェインのメカニズムを機械で再現することができる。これにより、音声認識用に入力された音声および音声合成用に入力されたテキストを教師なしデータとして用いて音声合成および音声認識のオンライン学習を行うことができるようになり、教師ありデータとしての音声とテキストのペアを大量に用意する労力とコストを削減することができる。さらに、本実施形態に係るスピーチチェイン装置１００は、音声認識装置および音声合成装置として使えば使うほど学習が進んで音声認識および音声合成の精度が向上する。 As described above, the speech chain device 100 according to the present embodiment can mechanically reproduce the mechanism of human speech chain. This makes it possible to perform online learning of speech synthesis and speech recognition using the speech input for speech recognition and the text input for speech synthesis as unsupervised data, and the speech as supervised data. It can reduce the labor and cost of preparing a large number of pairs of text and text. Further, the more the speech chain device 100 according to the present embodiment is used as the voice recognition device and the voice synthesis device, the more the learning progresses and the accuracy of the voice recognition and the voice synthesis is improved.

以上のように、本発明における技術の例示として、実施の形態を説明した。そのために、添付図面および詳細な説明を提供した。 As described above, an embodiment has been described as an example of the technique in the present invention. To that end, the accompanying drawings and detailed description are provided.

したがって、添付図面および詳細な説明に記載された構成要素の中には、課題解決のために必須な構成要素だけでなく、上記技術を例示するために、課題解決のためには必須でない構成要素も含まれ得る。そのため、それらの必須ではない構成要素が添付図面や詳細な説明に記載されていることをもって、直ちに、それらの必須ではない構成要素が必須であるとの認定をするべきではない。 Therefore, among the components described in the attached drawings and the detailed description, not only the components essential for problem solving but also the components not essential for problem solving in order to illustrate the above-mentioned technology. Can also be included. Therefore, the fact that those non-essential components are described in the accompanying drawings or detailed description should not immediately determine that those non-essential components are essential.

また、上述の実施の形態は、本発明における技術を例示するためのものであるから、特許請求の範囲またはその均等の範囲において種々の変更、置き換え、付加、省略などを行うことができる。 Further, since the above-described embodiment is for exemplifying the technique of the present invention, various changes, replacements, additions, omissions, etc. can be made within the scope of claims or the equivalent thereof.

１００…スピーチチェイン装置、１０…音声認識部、２０…音声合成部、３０…音声特徴抽出部、４０…テキスト生成部、５０…テキスト特徴抽出部、６０…音声生成部、７０…ＡＳＲ学習制御部、８０…ＴＴＳ学習制御部、２２５…出力レイヤ、２２６…入力レイヤ 100 ... speech chain device, 10 ... voice recognition unit, 20 ... voice synthesis unit, 30 ... voice feature extraction unit, 40 ... text generation unit, 50 ... text feature extraction unit, 60 ... voice generation unit, 70 ... ASR learning control unit , 80 ... TTS learning control unit, 225 ... output layer, 226 ... input layer

Claims

A voice recognition unit built with a deep neural network that inputs voice feature series data and outputs character series data,
A voice synthesizer built with a deep neural network that inputs character sequence data and outputs voice feature sequence data,
A voice feature extraction unit that processes the input voice and generates the voice feature series data input to the voice recognition unit.
A text generation unit that generates text corresponding to the voice input to the voice feature extraction unit based on the character sequence data output from the voice recognition unit, and a text generation unit.
A text feature extraction unit that processes the input text and generates the character sequence data input to the voice synthesis unit, and
A voice generation unit that generates a voice corresponding to the text input to the text feature extraction unit based on the voice feature series data output from the voice synthesis unit, and a voice generation unit.
The voice feature series data output from the voice synthesis unit is input to the voice recognition unit as learning data, and the character sequence data generated by the text feature extraction unit is used as teacher data to learn the voice recognition unit. The first learning control unit to be made to
The character sequence data output from the voice recognition unit is input to the voice synthesis unit as learning data, and the voice feature sequence data generated by the voice feature extraction unit is used as teacher data to learn the voice synthesis unit. A speech chain device equipped with a second learning control unit.

The voice feature series data input to the voice recognition unit is a mel spectrum feature amount.
The voice feature series data output from the voice synthesizer is a linear spectrum feature amount and a mel spectrum feature amount.
The voice feature extraction unit generates the linear spectrum feature amount and the mel spectrum feature amount of the voice input to the voice feature extraction unit as the voice feature series data.
The first aspect of claim 1, wherein the second learning control unit trains the voice synthesis unit by using the linear spectrum feature amount and the mel spectrum feature amount generated by the voice feature extraction unit as teacher data. Speech chain device.

The voice synthesis unit has an output layer that represents the probability of the end of an utterance.
The speech chain device according to claim 1 or 2, wherein the second learning control unit further learns the speech synthesis unit by using the probability of the end of the utterance as teacher data.

The speech chain device according to claim 1, wherein the speech synthesis unit has an input layer into which speaker identification information is input.

A voice recognition unit built with a deep neural network that inputs voice feature series data and outputs character series data,
A voice synthesizer built with a deep neural network that inputs character sequence data and outputs voice feature sequence data,
A voice feature extraction unit that processes the input voice and generates the voice feature series data input to the voice recognition unit.
A text generation unit that generates text corresponding to the voice input to the voice feature extraction unit based on the character sequence data output from the voice recognition unit, and a text generation unit.
A text feature extraction unit that processes the input text and generates the character sequence data input to the voice synthesis unit, and
A voice generation unit that generates a voice corresponding to the text input to the text feature extraction unit based on the voice feature series data output from the voice synthesis unit, and a voice generation unit.
The voice feature series data output from the voice synthesis unit is input to the voice recognition unit as learning data, and the character sequence data generated by the text feature extraction unit is used as teacher data to learn the voice recognition unit. The first learning control unit to be made to
The character sequence data output from the voice recognition unit is input to the voice synthesis unit as learning data, and the voice feature sequence data generated by the voice feature extraction unit is used as teacher data to learn the voice synthesis unit. A computer program for realizing a second learning control unit and a computer.

A voice recognition unit built with a deep neural network that inputs voice feature series data and outputs character series data, and a voice synthesis unit built with a deep neural network that inputs character series data and outputs voice feature series data. It is a DNN speech recognition / synthesis mutual learning method that allows mutual learning.
When a voice / text pair is given as supervised data, the voice feature sequence data of the voice is input to the voice recognition unit as learning data, and the character sequence data of the text is used as the teacher data in the voice recognition unit. The first step of learning the voice synthesis unit by inputting the character sequence data of the text as learning data into the voice synthesis unit and using the voice feature sequence data of the voice as teacher data.
When only voice is given as unsupervised data, the voice feature sequence data of the voice is input to the voice recognition unit, and the character sequence data output from the voice recognition unit is input to the voice synthesis unit as learning data. , The second step of learning the voice synthesis unit using the voice feature series data of the voice as teacher data,
When only text is given as unsupervised data, the character sequence data of the text is input to the voice synthesis unit, and the voice feature sequence data output from the voice synthesis unit is input to the voice recognition unit as learning data. , A third step of learning the voice recognition unit using the character sequence data of the text as teacher data, and a DNN voice recognition / synthesis mutual learning method.

A fourth step of extracting a fixed amount of each type of data from a dataset containing three types of data: a voice-text pair, text-only data, and voice-only data.
Using each type of data extracted from the data set, the first step to the third step are repeated in order, and a fifth step of performing batch learning of the voice recognition unit and the voice synthesis unit is further performed. The DNN speech recognition / synthetic mutual learning method according to claim 6 provided.