JP2019159654A

JP2019159654A - Time-series information learning system, method, and neural network model

Info

Publication number: JP2019159654A
Application number: JP2018044134A
Authority: JP
Inventors: 遼一高島; Ryoichi Takashima; 勝李; Sheng Li; 恒河井; Hisashi Kawai
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2019-09-19
Anticipated expiration: 2038-03-12
Also published as: JP7070894B2

Abstract

To obtain a model that has both real-time property and a speech recognition ratio by approximating the recognition ratio of a first model to the recognition ratio of a second model having its structure more complexed, without changing the structure of the first model.SOLUTION: A learning system 10 comprises a first model 23, a second model 13 having its structure more complexed than that of the first model, and a first model learning section which learns the first model using knowledge distillation by having the first model as a student model and the second model as a teacher model. The learning section evaluates probability for each of label series candidate groups that are outputted from the first and second models and correspond to first time-series data including a plurality of frame data, and causes itself to learn the first model based on the evaluation result.SELECTED DRAWING: Figure 2

Description

本発明は時系列で入力された情報を、ニューラルネットワークを用いて認識する技術に関する。詳しくは、時系列情報を処理するシステムで利用されるニューラルネットワークのモデルを学習する技術、および、その学習によって得られたニューラルネットワークのモデルに関する。 The present invention relates to a technique for recognizing information input in time series using a neural network. More specifically, the present invention relates to a technique for learning a neural network model used in a system for processing time-series information, and a neural network model obtained by the learning.

音声情報や動画像情報などは、時間的な変化を伴う情報であり、時間の経過とともに連続的に入力される情報である。 Audio information, moving image information, and the like are information accompanying temporal changes, and are information that is continuously input as time passes.

人が発声した音声や各種の音源から音声情報を取得し、音声情報を認識する音声認識システムがある。音声認識システムは発話された音声波形を分析し、音響モデル、発音辞書、言語モデルと呼ばれるデータベースと照合することで、発話内容(文章)を出力するものである。 There is a voice recognition system that acquires voice information from a voice uttered by a person or various sound sources and recognizes the voice information. The speech recognition system analyzes uttered speech waveforms and compares them with a database called an acoustic model, pronunciation dictionary, and language model, and outputs utterance contents (sentences).

従来の音響モデルとして、DNN-HMMモデルがある。DNN-HMMはニューラルネットワークベースの音声認識として主流の方法である。DNN-HMMモデルは、ある時刻の音声特徴量に対してどのラベル（例えば音素）の確率が高いかをモデル化したDNN (deep neural networks：ディープニューラルネットワーク)と、ラベルの時間変化をモデル化したHMM (Hidden Markov Model:隠れマルコフモデル)の２つのモデルで表現している。 There is a DNN-HMM model as a conventional acoustic model. DNN-HMM is the mainstream method for neural network based speech recognition. The DNN-HMM model is a model of DNN (deep neural networks) that models the probability of which label (for example, phoneme) is high with respect to speech features at a certain time, and the time variation of the label. It is expressed by two models, HMM (Hidden Markov Model).

End-to-endモデルはDNN-HMMより後に提案されたモデルである。End-to-endモデルでは音響モデルをDNN-HMMのように２つのモデルに分けずに、１個のモデルで表現する方式である。End-to-endモデルはDNN-HMMと比べて、HMMを用いないため、音声認識処理が単純かつ高速であるという利点がある。End-to-endモデルの例としてはCTC (Connectionist Temporal Classification)やAttentionモデルが存在する。以降、End-to-endモデルについて、CTC音響モデルを例に説明する。 The end-to-end model was proposed after DNN-HMM. In the end-to-end model, the acoustic model is expressed by one model without being divided into two models like DNN-HMM. Compared with DNN-HMM, the end-to-end model has the advantage that the speech recognition process is simple and fast because it does not use an HMM. Examples of end-to-end models include CTC (Connectionist Temporal Classification) and Attention models. Hereinafter, the CTC acoustic model will be described as an example of the end-to-end model.

上述した認識技術の関連技術として、ナレッジディスティレーション（KD : Knowledge distillation）とよばれるDNN学習方法がある。ナレッジディスティレーションは、学習済みの複雑かつ高性能なモデルの情報を単純かつ低性能なモデルに写すのに使われる手法である。例えば、高性能ではあるが、構造が複雑であり、システムに適用することが困難なモデルと、構造は単純であるが、性能が低いモデルが存在する場合を考える。ナレッジディスティレーションでは前者を教師モデル、後者を生徒モデルと定義し、教師モデルの出力を正解ラベルの代わりに用いて生徒モデルを学習させる。これにより、教師モデルの知識を生徒モデルに伝搬させることができる。 As a technology related to the above-described recognition technology, there is a DNN learning method called Knowledge Distillation (KD). Knowledge distribution is a technique used to copy learned complex and high-performance model information into a simple and low-performance model. For example, consider a case where there are a model having high performance but having a complicated structure and difficult to apply to a system, and a model having a simple structure but low performance. In knowledge distribution, the former is defined as a teacher model and the latter as a student model, and the student model is trained using the output of the teacher model instead of the correct answer label. Thereby, the knowledge of the teacher model can be propagated to the student model.

下記特許文献１および特許文献２においては、CTCを用いた音声認識装置が開示されている。 In the following Patent Document 1 and Patent Document 2, a speech recognition apparatus using CTC is disclosed.

下記非特許文献１においては、CTCに関する技術が開示されている。また、下記非特許文献２においては、ナレッジディスティレーションに関する技術が開示されている。 Non-Patent Document 1 below discloses a technique related to CTC. Non-Patent Document 2 below discloses a technique related to knowledge distribution.

特開２０１７−１６１３１号公報JP 2017-16131 A 特開２０１７−４０９１９号公報JP 2017-40919 A

Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, ICML2006, pp. 369-376, 2006Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, ICML2006, pp. 369-376, 2006 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network”, in NIPS Deep Learning and Representation Learning Workshop, 2014Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network”, in NIPS Deep Learning and Representation Learning Workshop, 2014

CTCは時系列データを扱うため、RNN (Recurrent neural network：再帰型ニューラルネットワーク)を内部に持つ必要がある。RNNには過去の情報だけを考慮するUnidirectional RNNと、過去、未来両方の情報を用いるBidirectional RNNの２種類が存在する。Bidirectional RNNを用いた場合、音声認識率は高いが、未来の情報を用いるため、リアルタイム処理が困難である。Unidirectional RNNを用いればリアルタイム処理に適用可能となるが、音声認識率は低下する。 Since CTC handles time-series data, it needs to have an internal RNN (Recurrent neural network). There are two types of RNNs: Unidirectional RNN that considers only past information and Bidirectional RNN that uses both past and future information. When Bidirectional RNN is used, the speech recognition rate is high, but since future information is used, real-time processing is difficult. If Unidirectional RNN is used, it can be applied to real-time processing, but the speech recognition rate decreases.

本発明は、学習対象であるモデルの構造を変えることなく、学習対象であるモデルの認識率を、より構造が複雑な教師モデルの認識率に近づけることで、リアルタイム性と音声認識率を両立したモデルを得ることを課題とする。 The present invention achieves both real-time performance and speech recognition rate by changing the recognition rate of the model to be learned close to the recognition rate of the teacher model having a more complicated structure without changing the structure of the model to be learned. The challenge is to obtain a model.

上記課題を解決するため、本実施の形態の学習システムは以下のとおり構成される。本実施の形態の学習システムは、時系列情報を認識するシステムを構成するために、ニューラルネットワークを学習するシステムである。 In order to solve the above problem, the learning system of the present embodiment is configured as follows. The learning system of this embodiment is a system that learns a neural network in order to configure a system that recognizes time-series information.

本実施の形態の学習システムは、時系列情報を表現可能なニューラルネットワークを内部に有する第１のモデルと、時系列情報を表現可能なニューラルネットワークを内部に有し、正解ラベルによって学習された、前記第１のモデルよりも構造が複雑な第２のモデルと、前記第１のモデルを生徒モデルとし、前記第２のモデルを教師モデルとし、ナレッジディスティレーションを用いて前記第１のモデルを学習する第１のモデル学習部と、を備える。 The learning system of the present embodiment includes a first model having a neural network capable of expressing time series information therein and a neural network capable of expressing time series information therein, and has been learned by a correct answer label. The second model having a more complex structure than the first model, the first model as a student model, the second model as a teacher model, and learning the first model using knowledge distraction A first model learning unit.

前記第１のモデル学習部は、複数のフレームデータを含む第１の時系列データを前記第１のモデルに入力し、前記第１のモデルの第１の出力結果として、ラベル系列候補群のそれぞれの確率を得る第１の出力部と、前記複数のフレームデータを含む前記第１の時系列データを前記第２のモデルに入力し、前記第２のモデルの第２の出力結果として、ラベル系列候補群のそれぞれの確率を得る第２の出力部と、前記第１の出力結果と前記第２の出力結果との差を評価する評価部と、前記評価部における評価結果に基づいて、前記第１のモデルを学習させる第１のモデル学習部と、を備える。 The first model learning unit inputs first time series data including a plurality of frame data to the first model, and outputs each of the label series candidate groups as a first output result of the first model. A first output unit that obtains the probability of the first time-series data including the plurality of frame data is input to the second model, and a label sequence is output as a second output result of the second model. Based on the second output unit for obtaining the respective probabilities of the candidate group, the evaluation unit for evaluating the difference between the first output result and the second output result, and the evaluation result in the evaluation unit, A first model learning unit that learns one model.

第１のモデルは、本実施の形態の学習システムにおいて、拡張されたナレッジディスティレーションによって学習されている。つまり、フレームごとに出力された出力値の確率値ではなく、時系列で出力されたラベル系列候補群の確率値を評価することで、学習の精度を上げることに成功している。 The first model is learned by the extended knowledge distribution in the learning system of the present embodiment. That is, it has succeeded in improving the accuracy of learning by evaluating the probability value of the label sequence candidate group output in time series, not the probability value of the output value output for each frame.

したがって、第１のモデルは、教師モデルである第２のモデルと比べると構造は単純であるモデルでありながら、高い認識精度を保持している。また、第１のモデルは第２のモデルと比較すると構造が単純であるため、ハードウェアとして実装する場合には回路規模を小さくすることができる。また、第１のモデルは第２のモデルと比較すると構造が単純であるため、ソフトウェアとして実装する場合には、ＣＰＵ、メモリ等の資源に高い性能を要求しない。したがって、第１のモデルを実装する認識システムをスマートフォンやタブレットなどの端末でも利用することが可能である。 Therefore, the first model is a model having a simple structure as compared with the second model, which is a teacher model, and maintains high recognition accuracy. In addition, since the first model has a simpler structure than the second model, the circuit scale can be reduced when implemented as hardware. In addition, since the first model has a simple structure as compared with the second model, high performance is not required for resources such as CPU and memory when implemented as software. Therefore, the recognition system that implements the first model can also be used in terminals such as smartphones and tablets.

また、本実施の形態の時系列情報の学習システムにおいて、前記第１のモデルおよび前記第２のモデルはリカレントニューラルネットワークを内部に有するモデルを含む。 In the time-series information learning system according to the present embodiment, the first model and the second model include a model having a recurrent neural network therein.

また、本実施の形態の時系列情報の学習システムにおいて、前記第１のモデルおよび前記第２のモデルはCTC(Connectionist Temporal Classification)モデルを含む。 In the time-series information learning system according to the present embodiment, the first model and the second model include a CTC (Connectionist Temporal Classification) model.

また、本実施の形態の時系列情報の学習システムにおいて、前記第１のモデルは、Unidirectional-CTCモデルであり、前記第２のモデルはBidirectional-CTCモデルである。学習された第１のモデルであるUnidirectional-CTCモデルは、第２のモデルであるBidirectional-CTCモデルとは異なり未来の入力を必要としないため、リアルタイム性の高い処理を実現可能である。また、第１のモデルはハードウェアやソフトウェアの実装上有利である。 In the time-series information learning system of the present embodiment, the first model is a Unidirectional-CTC model, and the second model is a Bidirectional-CTC model. Unlike the Bidirectional-CTC model, which is the second model, the learned first model, the Unidirectional-CTC model, does not require future input, and thus can realize processing with high real-time characteristics. The first model is advantageous in terms of hardware and software implementation.

また、本実施の形態の時系列情報の学習システムにおいて、前記時系列情報は音声情報である。学習された第１のモデルにより、音声情報を高い認識率で認識可能である。また、構造が複雑な音響モデルを利用する場合と比べてリアルタイム性を向上させることができる。 In the time-series information learning system according to the present embodiment, the time-series information is audio information. The learned first model can recognize voice information with a high recognition rate. In addition, real-time performance can be improved as compared with the case where an acoustic model having a complicated structure is used.

また、本実施の形態は、上記の時系列情報の学習システムにおいて学習されたニューラルネットワークモデルも対象である。上記の時系列情報の学習システムにおいて学習されたニューラルネットワークモデルを利用した認識システムを構築することで、ハードウェアやソフトウェアに高い負荷を掛けることなく、高い精度の認識結果を得ることができる。 The present embodiment is also intended for a neural network model learned in the time-series information learning system. By constructing a recognition system using the neural network model learned in the above-described time-series information learning system, a highly accurate recognition result can be obtained without imposing a high load on hardware or software.

また、本実施の形態の学習方法は以下の工程を備える。本実施の形態の学習システムは、時系列情報を認識するシステムを構成するために、時系列情報を表現可能なニューラルネットワークを内部に有する第１のモデルを学習する学習方法である。 In addition, the learning method of the present embodiment includes the following steps. The learning system according to the present embodiment is a learning method for learning a first model that internally includes a neural network that can express time-series information in order to configure a system that recognizes time-series information.

本実施の形態の学習システムは、（ａ）時系列情報を表現可能なニューラルネットワークを内部に有し、前記第１のモデルよりも構造が複雑な第２のモデルを、正解ラベルを用いて学習する第２のモデル学習工程と、（ｂ）前記第１のモデルを生徒モデルとし、前記第２のモデルを教師モデルとし、ナレッジディスティレーションを用いて前記第１のモデルを学習する第１のモデル学習工程と、を備える。 The learning system of the present embodiment (a) has a neural network capable of expressing time-series information therein, and learns a second model having a more complex structure than the first model using a correct label. And (b) a first model that learns the first model using a knowledge destination using the first model as a student model and the second model as a teacher model. A learning process.

また、前記第１のモデル学習工程（ｂ）は、（ｂ−１）複数のフレームデータを含む第１の時系列データを前記第１のモデルに入力する工程と、（ｂ−２）前記複数のフレームデータを含む前記第１の時系列データを前記第２のモデルに入力する工程と、（ｂ−３）前記複数のフレームデータを含む前記第１の時系列データに対応して得られた前記第１のモデルの第１の出力結果として、ラベル系列候補群のそれぞれの確率を得る工程と、（ｂ−４）前記複数のフレームデータを含む前記第１の時系列データに対応して得られた前記第２のモデルの第２の出力結果として、ラベル系列候補群のそれぞれの確率を得る工程と、（ｂ−５）前記工程（ｂ−３）において得られた前記第１の出力結果と、前記工程（ｂ−４）において得られた前記第２の出力結果との差を評価する評価工程と、（ｂ−６）前記評価工程の評価結果に基づいて、前記第１のモデルを学習させる工程と、を含む。 The first model learning step (b) includes (b-1) inputting first time-series data including a plurality of frame data to the first model, and (b-2) the plurality of the plurality of frame data. A step of inputting the first time-series data including the frame data into the second model; and (b-3) obtained in correspondence with the first time-series data including the plurality of frame data. As a first output result of the first model, a step of obtaining each probability of a label sequence candidate group, and (b-4) obtained corresponding to the first time series data including the plurality of frame data. A second output result of the obtained second model, a step of obtaining each probability of a label sequence candidate group, and (b-5) the first output result obtained in the step (b-3). And the first obtained in the step (b-4) An evaluation step of evaluating a difference between the output result based on the evaluation results of (b-6) said evaluation step, and a step to learn the first model.

本実施の形態の学習システムあるいは学習方法で学習される第１のモデルは、教師モデルである第２のモデルと比べると構造は単純であるモデルでありながら、高い認識精度を保持している。また、第１のモデルは第２のモデルと比較して構造が単純であるため、ハードウェアとして実装する場合であっても、ソフトウェアとして実装する場合であっても、コンピュータやデバイスに対する性能の要求を低くすることができる。また、第１のモデルは第２のモデルと比較して構造が単純であるため、第２のモデルを用いて認識処理を実行する場合と比較してリアルタイム性を向上させることができる。 The first model learned by the learning system or the learning method of the present embodiment is a model that has a simple structure as compared with the second model that is a teacher model, but retains high recognition accuracy. In addition, since the first model has a simple structure as compared with the second model, performance requirements for computers and devices are required regardless of whether they are implemented as hardware or software. Can be lowered. Further, since the first model has a simple structure as compared with the second model, the real-time property can be improved as compared with the case where the recognition process is executed using the second model.

本実施の形態に係る時系列情報処理システムにおける学習フェーズと認識フェーズの処理の流れを示す図である。It is a figure which shows the flow of a process of the learning phase in the time series information processing system which concerns on this Embodiment, and a recognition phase. 本実施の形態に係る時系列情報処理システムが備える学習システムのブロック図である。It is a block diagram of the learning system with which the time series information processing system concerning this embodiment is provided. 正解ラベルによる、ニューラルネットワークの学習方法を示す図である。It is a figure which shows the learning method of a neural network by a correct answer label. 一般的なDNNの構造を示す図である。It is a figure which shows the structure of general DNN. 教師モデルが有するBidirectional-RNNを示す図である。It is a figure which shows Bidirectional-RNN which a teacher model has. ナレッジディスティレーションによるニューラルネットワークの学習方法を示す図である。It is a figure which shows the learning method of the neural network by a knowledge distribution. 生徒モデルが有するUnidirectional-RNNを示す図である。It is a figure which shows Unidirectional-RNN which a student model has. 本実施の形態による拡張されたナレッジディスティレーションによる学習方法を示す図である。It is a figure which shows the learning method by the extended knowledge distribution by this Embodiment. 本実施の形態に係る時系列情報処理システムが備える認識システムのブロック図である。It is a block diagram of the recognition system with which the time series information processing system concerning this embodiment is provided. 本実施の形態の学習効果を示す実験例を示す図である。It is a figure which shows the experiment example which shows the learning effect of this Embodiment.

以下、添付の図面を参照しながら、本実施の形態に係る時系列情報処理システムについて説明する。本実施の形態に係る時系列情報処理システムは、時系列情報を入力して学習するとともに、時系列情報を入力し、時系列情報の認識結果を出力するシステムである。時系列情報は、時間の経過に伴って連続的に入力される情報である。時系列情報として、本実施の形態では音声情報を例に説明する。しかし、本実施の形態の時系列情報処理システムは、音声情報のみならず、他の時系列情報を認識するシステムとしても利用できる。例えば、他の時系列情報としては、時間の経過に伴って連続的に入力される動画像情報や、センシング情報などが挙げられる。 Hereinafter, a time-series information processing system according to the present embodiment will be described with reference to the accompanying drawings. The time-series information processing system according to the present embodiment is a system that inputs and learns time-series information, inputs time-series information, and outputs a recognition result of the time-series information. The time series information is information that is continuously input as time passes. In this embodiment, audio information is described as an example of time series information. However, the time-series information processing system of this embodiment can be used as a system that recognizes not only audio information but also other time-series information. For example, as other time-series information, moving image information that is continuously input with the passage of time, sensing information, and the like can be given.

本実施の形態の時系列情報処理システムは、時系列情報を認識するためのモデルを学習する学習システムと、学習システムによって学習されたモデルを利用して、時系列情報を認識する認識システムとから構成されている。 The time-series information processing system according to the present embodiment includes a learning system that learns a model for recognizing time-series information, and a recognition system that recognizes time-series information using a model learned by the learning system. It is configured.

以下の説明においては、時系列情報として音声情報を例として説明する。つまり、本実施の形態の時系列情報処理システムとして、音声情報を認識するための音響モデル２３を学習する学習システム１０と、学習システム１０によって学習された音響モデル２３を利用して、音声情報を認識する認識システム２０を例に説明する。 In the following description, audio information will be described as an example of time series information. That is, as the time-series information processing system of the present embodiment, the learning system 10 that learns the acoustic model 23 for recognizing speech information and the acoustic model 23 learned by the learning system 10 are used to obtain speech information. The recognition system 20 to be recognized will be described as an example.

｛１．時系列情報処理システムにおける学習フェーズと認識フェーズの処理の流れ｝
図１は、本実施の形態に係る時系列情報処理システムの全体の流れを示す図である。本実施の形態の時系列情報処理システムは、学習フェーズと認識フェーズの２つのフェーズを有している。学習フェーズにおいては、学習システム１０において、音響モデル２３の学習が行われる。認識フェーズでは、認識システム２０において、学習された音響モデル２３を利用して音声情報の認識が行われる。学習システム１０と認識システム２０とは、同一のコンピュータや、デバイス上に実装されてもよいし、別のコンピュータやデバイス上に実装されてもよい。 {1. Flow of learning phase and recognition phase processing in a time-series information processing system}
FIG. 1 is a diagram showing an overall flow of the time-series information processing system according to the present embodiment. The time-series information processing system of the present embodiment has two phases, a learning phase and a recognition phase. In the learning phase, the learning system 10 learns the acoustic model 23. In the recognition phase, the recognition system 20 recognizes voice information using the learned acoustic model 23. The learning system 10 and the recognition system 20 may be mounted on the same computer or device, or may be mounted on different computers or devices.

図１に示すように、学習システム１０に入力された音声データは、特徴量計算部１１において特徴量が計算される。特徴量計算部１１において計算された音声データの特徴量は、フレームごとに音響モデル学習部１２に入力される。音響モデル学習部１２よって、音響モデル２３の学習が行われる。後で詳しく説明するが、音響モデル学習部１２は、既に学習されている音響モデル１３を教師モデルとして、音響モデル２３の学習を行う。音響モデル学習部１２は、教師モデルである音響モデル１３から出力された複数のラベル系列とその出力確率を教師として、音響モデル２３の学習を行う。 As shown in FIG. 1, the feature amount calculation unit 11 calculates the feature amount of the speech data input to the learning system 10. The feature amount of the speech data calculated by the feature amount calculation unit 11 is input to the acoustic model learning unit 12 for each frame. The acoustic model learning unit 12 learns the acoustic model 23. As will be described in detail later, the acoustic model learning unit 12 learns the acoustic model 23 using the already learned acoustic model 13 as a teacher model. The acoustic model learning unit 12 learns the acoustic model 23 by using a plurality of label sequences output from the acoustic model 13 that is a teacher model and the output probability thereof as a teacher.

認識システム２０に入力された音声データは、特徴量計算部２１において特徴量が計算される。特徴量計算部２１において計算された音声データの特徴量は、フレームごとにデコーダ２２において分析される。デコーダ２２は、学習フェーズで学習された音響モデル２３、発音辞書２４および言語モデル２５を利用して、音声データの認識結果を出力する。 The feature amount of the speech data input to the recognition system 20 is calculated by the feature amount calculation unit 21. The feature amount of the audio data calculated by the feature amount calculation unit 21 is analyzed by the decoder 22 for each frame. The decoder 22 uses the acoustic model 23, the pronunciation dictionary 24, and the language model 25 learned in the learning phase to output a speech data recognition result.

｛２．学習システムの構成｝
次に、図２〜図８を参照しながら、学習システムの構成と学習処理の方法について説明する。 {2. Structure of learning system}
Next, the configuration of the learning system and the learning processing method will be described with reference to FIGS.

図２は、学習システム１０の機能ブロック図である。学習システム１０は、図１でも示したように、特徴量計算部１１および音響モデル学習部１２を備えている。音響モデル学習部１２は、ラベル推定部１２１、ラベル推定部１２２、ラベル系列評価部１２３および学習部１２４を備えている。学習システム１０は、また、音響モデル１３および音響モデル２３を備えている。 FIG. 2 is a functional block diagram of the learning system 10. As shown in FIG. 1, the learning system 10 includes a feature amount calculation unit 11 and an acoustic model learning unit 12. The acoustic model learning unit 12 includes a label estimation unit 121, a label estimation unit 122, a label sequence evaluation unit 123, and a learning unit 124. The learning system 10 also includes an acoustic model 13 and an acoustic model 23.

特徴量計算部１１は、時系列情報として音声データを入力する。特徴量計算部１１は、音声データの波形を２０ｍｓ〜３０ｍｓのフレームデータに分解し、フレームごとの特徴量を抽出する。特徴量計算部１１は、従来から行われている方法によって音声データの特徴量を抽出する。特徴抽出方法としては例えば、メルフィルタバンク分析やメル周波数ケプストラム分析などが挙げられる。 The feature amount calculation unit 11 inputs audio data as time series information. The feature amount calculation unit 11 decomposes the waveform of the audio data into 20 ms to 30 ms frame data, and extracts the feature amount for each frame. The feature amount calculation unit 11 extracts the feature amount of the audio data by a conventionally performed method. Examples of the feature extraction method include mel filter bank analysis and mel frequency cepstrum analysis.

＜２−１．音響モデル１３（教師モデル）によるラベル（音素）の出力＞
特徴量計算部１１において計算された各フレームの特徴量は、ラベル推定部１２１に入力される。ラベル推定部１２１は、入力されたフレームデータについて、音響モデル１３を用いてラベルごとの確率に変換する。 <2-1. Output of label (phoneme) by acoustic model 13 (teacher model)>
The feature amount of each frame calculated by the feature amount calculation unit 11 is input to the label estimation unit 121. The label estimation unit 121 converts the input frame data into a probability for each label using the acoustic model 13.

ラベルは、例えばa,iといった音素、あるいは仮名、文字、単語といった単位であらかじめ定義される。CTCモデルにおいては前記に加えて、他のどのラベルにも当てはまらないことを示す「ブランク(-)」のラベルもラベル集合に含まれる。本実施形態においては、音素、ノイズ、ブランクでラベルの集合を定義している。ラベル推定部１２１は、入力された各フレームに対して、前記ラベルそれぞれの確率値を出力する。 The label is defined in advance in units of phonemes such as a and i, or kana, characters, and words. In the CTC model, in addition to the above, a label of “blank (−)” indicating that it does not apply to any other label is also included in the label set. In this embodiment, a set of labels is defined by phonemes, noises, and blanks. The label estimation unit 121 outputs the probability value of each label for each input frame.

音響モデル１３は、本実施の形態においては、Bidirectional-CTCを用いている。Bidirectional-CTCは、時系列情報を扱うDNN(Deep neural network)の一種であるBidirectional-RNN (Recurrent neural network:リカレントニューラルネットワーク)を内部に有するCTC(Connectionist Temporal Classification)モデルである。CTCは、End-to-endモデルの一例である。CTCでは、長さの異なる入出力系列(本実施形態では音声フレーム系列とラベル系列)間の変換が可能なフレームワークである。CTCでは、フレーム毎に割り振られたラベルに対して、同一ラベルの削除と、ブランクラベル(−)の削除を行うことで、認識結果であるラベル系列を出力する。例えば７フレームの入力データに対して“a a - k - i -”と割り振られている場合、認識結果としては“aki”というラベル系列を出力する。 The acoustic model 13 uses Bidirectional-CTC in the present embodiment. Bidirectional-CTC is a CTC (Connectionist Temporal Classification) model having a Bidirectional-RNN (Recurrent neural network) that is a kind of DNN (Deep neural network) that handles time-series information. CTC is an example of an end-to-end model. CTC is a framework that can convert between input / output sequences of different lengths (in this embodiment, a speech frame sequence and a label sequence). In CTC, the label sequence which is a recognition result is output by deleting the same label and the blank label (−) with respect to the label allocated for each frame. For example, when “a a -k -i-” is assigned to input data of 7 frames, a label series “aki” is output as a recognition result.

本実施の形態では、音響モデル１３として、End-to-endモデルを利用することを特徴としている。本実施の形態においては、音響モデル１３として、End-to-endモデルの一例であるCTCを用いているが、End-to-endモデルとしては他にAttentionモデルを用いることができる。 The present embodiment is characterized in that an end-to-end model is used as the acoustic model 13. In this embodiment, CTC, which is an example of an end-to-end model, is used as the acoustic model 13, but an attention model can be used as the end-to-end model.

また、本実施の形態においては、音響モデル１３として、Bidirectional-RNNを内部に有するモデルを用いているが、その限りでは無く、時系列情報を扱えるニューラルネットワークであれば適用可能である。時系列情報を扱えるニューラルネットワークの例としては、Bidirectional-RNNの他にUnidirectional-RNN、Time-delay neural networkなどが挙げられる。また前述のRNNはLSTM (Long short term memory)のような類似モデルにも置き換え可能である。ただし、音響モデル１３は音響モデル２３よりも構造が複雑かつ高性能なモデルであることが本実施の形態の学習方法を利用した効果に繋がる。 In the present embodiment, a model having Bidirectional-RNN inside is used as the acoustic model 13. However, the present invention is not limited to this, and any neural network that can handle time-series information is applicable. Examples of neural networks that can handle time-series information include Bidirectional-RNN, Unidirectional-RNN, and Time-delay neural network. The RNN described above can be replaced with a similar model such as LSTM (Long short term memory). However, the acoustic model 13 is a model having a more complicated structure and higher performance than the acoustic model 23, which leads to an effect using the learning method of the present embodiment.

音響モデル１３は、既に学習が完了している教師モデルである。教師モデルである音響モデル１３は、事前に、音声データと正解ラベル系列のセットからなる学習データを用いて学習されている。 The acoustic model 13 is a teacher model that has already been learned. The acoustic model 13 which is a teacher model is learned in advance using learning data including a set of voice data and a correct label sequence.

音響モデル１３は、従来のCTCの学習方法に従って学習される。すなわち、学習データを入力して、正解の音素系列の確率が最大になるように学習する。確率の計算方法は従来のforward-backwardアルゴリズムが用いられる。また、モデルパラメータの更新には、誤差逆伝搬法が用いられる。 The acoustic model 13 is learned according to a conventional CTC learning method. That is, learning is input and learning is performed so that the probability of the correct phoneme sequence is maximized. A conventional forward-backward algorithm is used as the probability calculation method. Further, the error back propagation method is used for updating the model parameters.

図３は、一般的なニューラルネットワークを正解ラベルを用いて学習する方法を示した図である。ニューラルネットワークの入力層に学習データ（本実施形態の例だと１フレームの音声データに相当）が入力される。そしてニューラルネットワークの計算結果として出力層から各ラベルの確率（確率分布）が出力される。このとき、学習データに対応するラベルの確率を１、他のラベルの確率を０とするベクトルを正解の確率分布として、二つの確率分布の距離が小さくなるようにニューラルネットワークを学習する。距離尺度としてはクロスエントロピーやユークリッド距離が使われる。 FIG. 3 is a diagram showing a method of learning a general neural network using correct answer labels. Learning data (corresponding to one frame of audio data in the example of this embodiment) is input to the input layer of the neural network. Then, the probability (probability distribution) of each label is output from the output layer as a calculation result of the neural network. At this time, the neural network is trained so that the distance between the two probability distributions becomes small, assuming that the probability probability of the label corresponding to the learning data is 1 and that the probability of the other label is 0 is a correct probability distribution. Cross entropy or Euclidean distance is used as the distance measure.

図４は一般的なDNN３３を示す図である。DNN３３は入力層と複数の中間層(隠れ層)、出力層を有する。図４の例では入力層３３１と４個の中間層３３２、３３３、３３４および３３５、そして出力層３３６を有している。ここでは図の簡単化のため一般的なDNN３３で例示するが、本実施形態の音響モデル１３においては前後フレーム間でも結合を持つRNNを使用している。 FIG. 4 is a diagram showing a general DNN 33. The DNN 33 has an input layer, a plurality of intermediate layers (hidden layers), and an output layer. 4 includes an input layer 331, four intermediate layers 332, 333, 334, and 335, and an output layer 336. Here, a general DNN 33 is illustrated for simplification of the drawing, but the acoustic model 13 of the present embodiment uses an RNN having coupling between the front and rear frames.

入力層３３１へは、特徴量計算部１１によって計算された、１フレーム分の特徴量ベクトルが入力される。すなわち、入力層のノード３３１（１）〜３３１（ｎ１）の数は、特徴量の次元数に相当する。 A feature value vector for one frame calculated by the feature value calculation unit 11 is input to the input layer 331. That is, the number of nodes 331 (1) to 331 (n1) in the input layer corresponds to the number of dimensions of the feature amount.

本実施の形態においては、中間層３３２は、ｎ２個のノード３３２（１）、３３２（２）・・・３３２（ｎ２）を、中間層３３３は、ｎ３個のノード３３３（１）、３３３（２）・・・３３３（ｎ３）を、中間層３３４は、ｎ４個のノード３３４（１）、３３４（２）・・・３３４（ｎ４）を、中間層３３５は、ｎ５個のノード３３５（１）、３３５（２）・・・３３５（ｎ５）を、有している。各中間層のノード数は、異なっていてもよい。また、各中間層のノード数は入力層と異なっていてもよい。 In this embodiment, the intermediate layer 332 includes n2 nodes 332 (1), 332 (2)... 332 (n2), and the intermediate layer 333 includes n3 nodes 333 (1), 333 ( 2)... 333 (n3), the intermediate layer 334 includes n4 nodes 334 (1), 334 (2)... 334 (n4), and the intermediate layer 335 includes n5 nodes 335 (1 ), 335 (2)... 335 (n5). The number of nodes in each intermediate layer may be different. Further, the number of nodes in each intermediate layer may be different from that of the input layer.

本実施の形態においては、出力層３３６は、各ラベルに対応したノードを含んでいる。出力層のノード数はラベルの数に相当する。 In the present embodiment, the output layer 336 includes a node corresponding to each label. The number of nodes in the output layer corresponds to the number of labels.

音声データに含まれるフレームデータの特徴量が、ｎ１次元のデータ（特徴量ベクトル）として入力層３３１（１）、３３１（２）・・・３３１（ｎ１）に入力される。上述したように、ここでは一般的なDNN３３を図を使って説明しているが、本実施の形態で利用されるBidirectional-CTCである音響モデル１３は、過去のフレームデータの情報および未来のフレームデータの情報を参照しながら、各中間層で演算を行い、出力層においてラベルごとの確率値を出力する。例えば、
ａ：０．１２
ｂ：０．０５
ｃ：０．０３
・・・
ｚ：０．０９
ブランク：０．０２
といったように、フレームデータがいずれのラベルに対応するかを示す確率値を算出する。 The feature amount of the frame data included in the audio data is input to the input layers 331 (1), 331 (2). As described above, a general DNN 33 is described here with reference to the drawings. However, the acoustic model 13 which is a Bidirectional-CTC used in the present embodiment is used for information on past frame data and future frames. While referring to data information, each intermediate layer performs an operation, and the output layer outputs a probability value for each label. For example,
a: 0.12
b: 0.05
c: 0.03
...
z: 0.09
Blank: 0.02
As described above, a probability value indicating which label the frame data corresponds to is calculated.

図５は、本実施形態において音響モデル１３が有する、Bidirectional-RNNの処理を示す。図５において、横軸は時間である。縦方向に並ぶ一系統のブロックがある時刻のBidirectional-RNNを示している。各時刻のBidirectional-RNNの各層は１つのブロックで示している。つまり、図５の各ブロックは、図４のように複数のノードからなるニューラルネットワークの各層を表している。 FIG. 5 shows the Bidirectional-RNN processing that the acoustic model 13 has in this embodiment. In FIG. 5, the horizontal axis is time. Bidirectional-RNN at the time when there is one system of blocks arranged in the vertical direction is shown. Each layer of Bidirectional-RNN at each time is shown by one block. That is, each block in FIG. 5 represents each layer of a neural network composed of a plurality of nodes as shown in FIG.

ある時間に入力層１３１に入力されたフレームデータは、中間層１３２、１３３・・・と伝播して出力層１３６から出力される。このとき、中間層１３２、１３３・・・では、図中で各ブロックから横方向に伸びた線で示されている通り、前後の時刻の中間層１３２、１３３・・・の出力も入力される。 Frame data input to the input layer 131 at a certain time propagates to the intermediate layers 132, 133... And is output from the output layer 136. At this time, in the intermediate layers 132, 133..., The outputs of the intermediate layers 132, 133. .

図５で示す例では、時刻ｔ１においては、音素の候補がブランク（−）として出力されたことを示している。出力層１３６からは、ラベルごとの確率が出力されるが、その中でブランクの確率が一番高かったことを示している。 The example shown in FIG. 5 indicates that a phoneme candidate is output as a blank (-) at time t1. From the output layer 136, the probability for each label is output, and the probability of blank is the highest among them.

同様に、時時刻ｔ２では“a”が、時刻ｔ３では、ブランク（−）、時刻ｔ４では“ｋ”、時刻ｔ５では“ｉ”、時刻ｔ６ではブランク（−）が音素の候補として出力されていることを示している。 Similarly, “a” is output as a phoneme candidate at time t2, “a” at time t3, “k” at time t4, “i” at time t5, and blank (−) at time t6. It shows that.

＜２−２．音響モデル２３（生徒モデル）によるラベル（音素）の出力＞
再び、図２を参照する。特徴量計算部１１において計算された特徴量は、また、ラベル推定部１２２に入力される。ラベル推定部１２２は、ラベル推定部１２１と同様、音声データに含まれる特徴量をフレームごとに入力し、音響モデル２３を用いて、フレームデータをラベルごとの確率に変換する。ラベル推定部１２２は、音声データに含まれるフレームデータを、ラベルごとの確率値として出力する。 <2-2. Output of label (phoneme) by acoustic model 23 (student model)>
Reference is again made to FIG. The feature amount calculated by the feature amount calculation unit 11 is also input to the label estimation unit 122. Similar to the label estimation unit 121, the label estimation unit 122 receives the feature amount included in the speech data for each frame, and converts the frame data into a probability for each label using the acoustic model 23. The label estimation unit 122 outputs frame data included in the audio data as a probability value for each label.

ラベル推定部１２２もラベル推定部１２１と同様にラベルごとの確率値を出力する。ここでラベルの定義は、音響モデル１３、ラベル推定部１２１、音響モデル２３およびラベル推定部１２２において同じ定義である。 Similarly to the label estimation unit 121, the label estimation unit 122 outputs a probability value for each label. Here, the definition of the label is the same in the acoustic model 13, the label estimation unit 121, the acoustic model 23, and the label estimation unit 122.

音響モデル２３は、音響モデル１３を教師モデルとして学習されるモデルである。生徒モデルである音響モデル２３は、音響モデル１３よりも構造の複雑度の小さいモデルである。ここで、構造が複雑なモデルとは、例えば、中間層（隠れ層）の数が多いモデルである。あるいは、構造が複雑なモデルとは、ノードの数の多いモデルである。また、構造が複雑なモデルとしては、他にもCNN (Convolution neural network)のように計算処理量の多い層を有するモデルや、リカレント構造を有するモデルなどがある。 The acoustic model 23 is a model learned using the acoustic model 13 as a teacher model. The acoustic model 23 which is a student model is a model having a structural complexity smaller than that of the acoustic model 13. Here, the model having a complicated structure is a model having a large number of intermediate layers (hidden layers), for example. Alternatively, a model having a complicated structure is a model having a large number of nodes. In addition, as a model having a complicated structure, there are a model having a layer with a large amount of calculation processing such as a CNN (Convolution neural network) and a model having a recurrent structure.

生徒モデルである音響モデル２３は、正解ラベルを用いた学習は行われない。音響モデル２３は、音響モデル１３を教師モデルとしてナレッジディスティレーションにより学習される。 The acoustic model 23 which is a student model is not learned using the correct answer label. The acoustic model 23 is learned by knowledge distribution using the acoustic model 13 as a teacher model.

音響モデル２３は、本実施の形態においては、Unidirectional-CTCを用いている。Unidirectional-CTCは、Unidirectional-RNNを内部に有するCTCモデルである。音響モデル２３は音響モデル１３と同様、End-to-endモデルの一例であり、CTCの他にAttentionモデルを使用することができ、また内部のニューラルネットワークもRNN、LSTM、Time-delay neural networkなどに変更可能である。ただし、音響モデル２３は音響モデル１３よりも単純かつ性能の低いモデルであることが本実施の形態の学習方法を利用した効果に繋がる。 The acoustic model 23 uses Unidirectional-CTC in the present embodiment. Unidirectional-CTC is a CTC model having Unidirectional-RNN inside. Similar to the acoustic model 13, the acoustic model 23 is an example of an end-to-end model. In addition to the CTC, an attention model can be used, and an internal neural network such as RNN, LSTM, Time-delay neural network, etc. Can be changed. However, the acoustic model 23 is a simpler and lower performance model than the acoustic model 13, which leads to the effect of using the learning method of the present embodiment.

本実施の形態においては、教師モデルとしては前後の時間の情報を参照するリカレントニューラルネットワーク(Bidirectional-CTC)を用いるのに対して、生徒モデルとして過去の時間の情報を参照するリカレントニューラルネットワーク(Unidirectional)を用いている。したがって、教師モデルである音響モデル１３は、生徒モデルである音響モデル２３より構造が複雑なモデルである。ただし、これは音響モデル１３および音響モデル２３の一例である。本実施の形態においては、音響モデル１３および音響モデル２３は、時系列情報を表現可能なニューラルネットワークであること、つまり、End-to-endモデルのニューラルネットワークであることと、音響モデル１３に比べて構造が複雑でない音響モデル２３を利用すればよく、その他のモデルを利用してもよい。たとえば、音響モデル１３、２３ともに、Bidirectional-CTCを用い、音響モデル１３よりも構造が複雑でない音響モデル２３を利用してもよい。あるいは、音響モデル１３、２３ともに、Unidirectional-CTCを用い、音響モデル１３よりも構造が複雑でない音響モデル２３を利用してもよい。 In this embodiment, a recurrent neural network (Bidirectional-CTC) that refers to information on the previous and subsequent times is used as a teacher model, whereas a recurrent neural network (Unidirectional) that refers to information on a past time as a student model. ) Is used. Therefore, the acoustic model 13 which is a teacher model is a model having a more complicated structure than the acoustic model 23 which is a student model. However, this is an example of the acoustic model 13 and the acoustic model 23. In the present embodiment, the acoustic model 13 and the acoustic model 23 are neural networks that can express time-series information, that is, are end-to-end model neural networks, and are compared with the acoustic model 13. Therefore, the acoustic model 23 having a complicated structure may be used, and other models may be used. For example, the acoustic models 13 and 23 may both use the Bidirectional-CTC and use the acoustic model 23 whose structure is less complicated than the acoustic model 13. Alternatively, the acoustic models 13 and 23 may both use the Unidirectional-CTC and use the acoustic model 23 whose structure is less complicated than the acoustic model 13.

生徒モデルである音響モデル２３は、教師モデルである音響モデル１３の認識能力を転移させることで学習を行う。例えば、生徒モデルは、比較的処理能力の小さいコンピュータやスマートフォンなどで利用させることを前提とした比較的構造の簡単なモデルである。正解モデルを利用して学習された構造の複雑な音響モデル１３の認識能力を、音響モデル２３に転移させることで、教師モデルの高い認識精度を転移させることができる。 The acoustic model 23 which is a student model performs learning by transferring the recognition ability of the acoustic model 13 which is a teacher model. For example, the student model is a model with a relatively simple structure on the premise that the student model is used on a computer or a smartphone having a relatively small processing capability. By transferring the recognition ability of the complex acoustic model 13 having a structure learned using the correct model to the acoustic model 23, the high recognition accuracy of the teacher model can be transferred.

図６は、一般的なフレーム単位のナレッジディスティレーションを説明する図である。生徒モデルの入力層に、１フレームの音声データが入力される。また、教師モデルの入力層にも、生徒モデルに入力されたフレームデータと同じフレームデータが入力される。 FIG. 6 is a diagram for explaining a knowledge distribution in a general frame unit. One frame of audio data is input to the input layer of the student model. Also, the same frame data as the frame data input to the student model is input to the input layer of the teacher model.

入力された１フレームの音声データは、それぞれ教師モデルと生徒モデルの中間層を伝搬し、出力層において、各ラベルの確率値として出力される。ナレッジディスティレーションによる学習方式では、それぞれのモデルから出力されるラベルごとの確率値（ラベルの確率分布）が近くなるように、生徒モデルを学習する。確率分布の近さを測る指標としては、クロスエントロピーやカルバックライブラーダイバージェンスが用いられる。従来のナレッジディスティレーションによる学習をそのままCTCの学習に用いる場合、音響モデル２３がフレームごとに出力するラベルの確率分布と音響モデル１３がフレームごとに出力する確率分布を用いて、フレームごとの確率分布が近くなるように音響モデル２３を学習することになる。つまり、従来のナレッジディスティレーションではフレーム独立な学習基準になっている。しかし、本実施の形態の学習システムにおいては、フレームごとの各ラベルの確率値を単純に評価するのではなく、新しい手法（拡張ナレッジディスティレーションと呼ぶ）により２つのモデルの差を評価する。この評価方法については後で詳しく説明する。 The input audio data of one frame propagates through the intermediate layer of the teacher model and the student model, respectively, and is output as the probability value of each label in the output layer. In the learning method based on knowledge distribution, the student model is learned so that the probability value (label probability distribution) for each label output from each model is close. As an index for measuring the proximity of the probability distribution, cross entropy or Cullback library divergence is used. When learning by conventional knowledge distribution is used for CTC learning as it is, the probability distribution of each frame using the probability distribution of the label output by the acoustic model 23 for each frame and the probability distribution output by the acoustic model 13 for each frame. The acoustic model 23 is learned so as to be close to each other. In other words, the conventional knowledge distribution is a frame-independent learning standard. However, in the learning system of the present embodiment, the probability value of each label for each frame is not simply evaluated, but the difference between the two models is evaluated by a new method (referred to as extended knowledge distribution). This evaluation method will be described in detail later.

図７は音響モデル２３が有するUnidirectional-RNNを示す図である。図において、時間ｔ１においては、音素の候補がブランク（−）として出力されたことを示している。また、時間ｔ２では“ａ”、時間ｔ３では、ブランク（−）、時間ｔ４では“ｋ”、時間ｔ５では“ｉ”、時間ｔ６ではブランク（−）が音素の候補として出力されていることを示している。 FIG. 7 is a diagram showing the Unidirectional-RNN that the acoustic model 23 has. In the figure, at time t1, a phoneme candidate is output as a blank (-). Also, “a” is output as a phoneme candidate at time t2, “a” at time t3, “k” at time t4, “i” at time t5, and blank (−) at time t6. Show.

＜２−３．拡張ナレッジディスティレーションによる学習処理＞
再び図２を参照する。ラベル系列評価部１２３は、ラベル推定部１２１が出力するフレームデータのラベルごとの確率値を入力する。上述したように、ラベル推定部１２１は、教師モデル１３を用いて、フレームデータごとにラベルごとの確率値を出力する。具体的には、ラベル推定部１２１は、教師モデルを用いて、ラベルごとの確率値を出力する。ラベル系列評価部１２３は、フレームごとに出力されたラベルごとの確率値を入力する。 <2-3. Learning process by extended knowledge distribution>
Refer to FIG. 2 again. The label series evaluation unit 123 inputs a probability value for each label of the frame data output from the label estimation unit 121. As described above, the label estimation unit 121 uses the teacher model 13 to output a probability value for each label for each frame data. Specifically, the label estimation unit 121 outputs a probability value for each label using a teacher model. The label series evaluation unit 123 inputs the probability value for each label output for each frame.

図８は、ラベル系列評価部１２３が入力するラベルごとの確率値の例を示している。図９の右側が、教師モデルである音響モデル１３から出力されたラベルごとの確率値を示している。図の例では、時間ｔ１、ｔ２およびｔ３の各時間において、ラベルごとの確率値が示されている。時間ｔ１では、ラベル“ａ”の確率値が他のラベルの確率値よりも高く、時間ｔ１のフレームデータはラベル“a”である可能性が高いことを示している。同様に、時間ｔ２では、ラベル“ｋ”である可能性が高いことを示している。 FIG. 8 shows an example of probability values for each label input by the label series evaluation unit 123. The right side of FIG. 9 shows the probability value for each label output from the acoustic model 13 which is a teacher model. In the example of the figure, the probability value for each label is shown at each of the times t1, t2, and t3. At the time t1, the probability value of the label “a” is higher than the probability values of the other labels, indicating that the frame data at the time t1 is likely to be the label “a”. Similarly, at time t2, there is a high possibility that the label is “k”.

再び図２を参照する。ラベル系列評価部１２３は、ラベル推定部１２２が出力するフレームデータのラベルごとの確率値を入力する。上述したように、ラベル推定部１２２は、生徒モデル２３を用いて、フレームごとにラベルごとの確率値を出力する。具体的には、ラベル推定部１２２は、生徒モデルを用いて、ラベルごとの確率値を出力する。ラベル系列評価部１２３は、フレームごとに出力されたラベルごとの確率値を入力する。 Refer to FIG. 2 again. The label series evaluation unit 123 inputs a probability value for each label of the frame data output from the label estimation unit 122. As described above, the label estimation unit 122 outputs a probability value for each label for each frame using the student model 23. Specifically, the label estimation unit 122 outputs a probability value for each label using the student model. The label series evaluation unit 123 inputs the probability value for each label output for each frame.

図８の左側が、生徒モデルである音響モデル２３から出力されたラベルごとの確率値を示している。図の例では、時間ｔ１、ｔ２およびｔ３の各時間において、ラベルごとの確率値が示されている。時間ｔ１では、音素“ａ”の確率値が他のラベルの確率値よりも高く、時間ｔ１のフレームデータはラベル“a”である可能性が高いことを示している。同様に、時間ｔ２では、ラベル“ｋ”である可能性が高いことを示している。 The left side of FIG. 8 shows the probability value for each label output from the acoustic model 23 which is a student model. In the example of the figure, the probability value for each label is shown at each of the times t1, t2, and t3. At the time t1, the probability value of the phoneme “a” is higher than the probability values of the other labels, indicating that there is a high possibility that the frame data at the time t1 is the label “a”. Similarly, at time t2, there is a high possibility that the label is “k”.

ラベル系列評価部１２３は、教師モデルである音響モデル１３が出力したラベルごとの確率値から、ラベル系列候補群の確率値を算出する。図８に示した例であれば、ラベル系列評価部１２３は、音響モデル１３が出力したラベルごとの確率値から、以下のラベル系列候補群の確率値を算出している。
aki : 0.5
akai : 0.004
ai : 0.03
・・・ The label sequence evaluation unit 123 calculates the probability value of the label sequence candidate group from the probability value for each label output from the acoustic model 13 as the teacher model. In the example illustrated in FIG. 8, the label sequence evaluation unit 123 calculates the probability value of the following label sequence candidate group from the probability value for each label output from the acoustic model 13.
aki: 0.5
akai: 0.004
ai: 0.03
...

全てのラベル系列を展開することは現実的に困難なため、ラベル系列評価部１２３は、例えば、確率値の高い上位１０個のラベル系列をラベル系列候補として採用する。 Since it is practically difficult to expand all the label sequences, the label sequence evaluation unit 123 employs, for example, the top ten label sequences with the highest probability values as label sequence candidates.

ラベル系列の確率値の算出方法は特に限定されるものではないが、その一例を示す。例えば、ラベル系列“aki”に対応する音声データは多くのパターンが存在する。たとえば、音声データが７フレームのフレームデータであると仮定すると、ラベル系列“aki”に対応する音声データは、
aakk--i
aa-kkki
akk--ii
akkii--
など多くのパターンが存在する。個々のラベル系列の確率値を、ラベルごとの確率値の乗算で表することにより、ラベル系列“aki”に対応する１つのパターン（例えば、aakk-i）の確率値を算出することができる。そこで、ラベル系列“aki”の確率値としては、個々のパターンの確率値の加算を用いることや、最も高い確率値が得られたパターンの確率値を採用するなどの方法が考えられる。 The method for calculating the probability value of the label series is not particularly limited, but an example is shown. For example, there are many patterns of audio data corresponding to the label sequence “aki”. For example, assuming that the audio data is frame data of 7 frames, the audio data corresponding to the label sequence “aki” is
aakk--i
aa-kkki
akk--ii
akkii--
Many patterns exist. By expressing the probability value of each label sequence by multiplying the probability value for each label, the probability value of one pattern (for example, aakk-i) corresponding to the label sequence “aki” can be calculated. Therefore, as the probability value of the label series “aki”, there are conceivable methods such as using addition of probability values of individual patterns or adopting the probability value of the pattern having the highest probability value.

ラベル系列評価部１２３は、各ラベル系列“aki”、“akai”、“ai”などの確率値を算出すると、上述したように、例えば確率値の高い上位１０個のラベル系列を、ラベル系列候補として採用する。 When the label sequence evaluation unit 123 calculates the probability value of each label sequence “aki”, “akai”, “ai”, etc., as described above, for example, the top ten label sequences with the highest probability values are selected as label sequence candidates. Adopt as.

ラベル系列評価部１２３は、生徒モデルである音響モデル２３が出力したラベルごとの確率値からも、同様に、ラベル系列候補群の確率値を算出する。図８に示した例であれば、ラベル系列評価部１２３は、音響モデル２３が出力したラベルごとの確率値から、以下のラベル系列候補群の確率値を算出している。
aki : 0.3
akai : 0.1
ai : 0.05
・・・ Similarly, the label series evaluation unit 123 calculates the probability value of the label series candidate group from the probability value for each label output from the acoustic model 23 as the student model. In the example shown in FIG. 8, the label sequence evaluation unit 123 calculates the probability value of the following label sequence candidate group from the probability value for each label output from the acoustic model 23.
aki: 0.3
akai: 0.1
ai: 0.05
...

生徒モデルである音響モデル２３が出力したラベルごとの確率値から、ラベル系列候補群の確率値を算出する方法は、上述した教師モデルの場合と同様であるため、説明を省略する。 The method of calculating the probability value of the label series candidate group from the probability value for each label output from the acoustic model 23 as the student model is the same as in the case of the teacher model described above, and thus the description thereof is omitted.

ラベル系列評価部１２３は、音響モデル１３および音響モデル２３について、それぞれラベル系列候補群の確率値（ラベル系列の確率分布）を算出すると、音響モデル１３および音響モデル２３について算出されたラベル系列の確率分布との距離を、損失関数を用いて算出する。損失関数としてはクロスエントロピーやカルバックライブラーダイバージェンスが挙げられる。本実施の形態において重要となるのは、音響モデル１３および音響モデル２３について算出されたフレームごとのラベルの確率値の差を評価するのではなく、音響モデル１３および音響モデル２３について算出されたラベル系列候補群の確率値の差を評価することである。 When the label sequence evaluation unit 123 calculates the probability value (label sequence probability distribution) of the label sequence candidate group for the acoustic model 13 and the acoustic model 23, respectively, the probability of the label sequence calculated for the acoustic model 13 and the acoustic model 23 The distance from the distribution is calculated using a loss function. Examples of the loss function include cross entropy and Cullback library divergence. What is important in the present embodiment is that the labels calculated for the acoustic model 13 and the acoustic model 23 are not evaluated, but the difference between the probability values of the labels for each frame calculated for the acoustic model 13 and the acoustic model 23 is not evaluated. It is to evaluate the difference between the probability values of the series candidate groups.

本実施の形態の時系列情報処理システム１で用いられる音響モデルは、End-to-endのモデルであり、時系列情報を表現可能なニューラルネットワークである。したがって、フレームごとに出力されたラベルごとの確率値を評価しても、学習の精度が上がらないことが発明者らによって確認された。そこで、フレームごとに出力されたラベルごとの確率値ではなく、シーケンスレベルで出力されたラベル系列候補群の確率値を評価することで、学習の精度を上げることができるのである。 The acoustic model used in the time series information processing system 1 of the present embodiment is an end-to-end model, which is a neural network capable of expressing time series information. Therefore, the inventors have confirmed that the accuracy of learning does not increase even if the probability value for each label output for each frame is evaluated. Therefore, the accuracy of learning can be improved by evaluating the probability value of the label sequence candidate group output at the sequence level, not the probability value of each label output for each frame.

ラベル系列評価部１２３において、ラベル系列候補群の確率分布の距離が計算されると、学習部１２４が、距離を最小化させるように音響モデル２３を学習する。学習には、従来から用いられている方法、例えば誤差逆伝播法が利用される。 When the label sequence evaluation unit 123 calculates the distance of the probability distribution of the label sequence candidate group, the learning unit 124 learns the acoustic model 23 so as to minimize the distance. For learning, a conventionally used method such as an error back propagation method is used.

｛３．認識システムの構成｝
図９は、本実施の形態に係る認識システム２０の構成である。認識システム２０は、特徴量算出部２１、デコーダ２２、音響モデル２３、発音辞書２４および言語モデル２５を備えている。 {3. Configuration of recognition system}
FIG. 9 shows a configuration of the recognition system 20 according to the present embodiment. The recognition system 20 includes a feature amount calculation unit 21, a decoder 22, an acoustic model 23, a pronunciation dictionary 24, and a language model 25.

特徴量計算部２１は、時系列情報として音声データを入力する。特徴量計算部２１は、音声データの波形をフレームデータに分解し、フレームごとの特徴量を抽出する。特徴量計算部２１は、従来から行われている方法によって音声データの特徴量を抽出する。特徴抽出方法としては例えば、メルフィルタバンク分析やメル周波数ケプストラム分析などが挙げられるが、学習時に使用していた特徴量計算部１１と分析条件を合わせる必要がある。 The feature amount calculation unit 21 inputs audio data as time series information. The feature amount calculator 21 decomposes the waveform of the audio data into frame data, and extracts the feature amount for each frame. The feature quantity calculation unit 21 extracts the feature quantity of the audio data by a conventional method. Examples of the feature extraction method include mel filter bank analysis and mel frequency cepstrum analysis. However, it is necessary to match the analysis conditions with the feature amount calculation unit 11 used at the time of learning.

特徴量計算部２１において算出されたフレームごとの特徴量はデコーダ２２に入力される。デコーダ２２は、上述した学習処理によって学習された音響モデル２３を備えている。デコーダ２２は、音声データの特徴量をフレームごとに音響モデル２３に入力し、フレームをラベルごとの確率値に変換する。本実施形態においてラベルは音素によって定義されているので、デコーダ２２は、音響モデル２３によって、音素ごとの確率値に変換することになる。 The feature amount for each frame calculated by the feature amount calculation unit 21 is input to the decoder 22. The decoder 22 includes an acoustic model 23 learned by the learning process described above. The decoder 22 inputs the feature amount of the audio data to the acoustic model 23 for each frame, and converts the frame into a probability value for each label. In the present embodiment, since the label is defined by phonemes, the decoder 22 converts to a probability value for each phoneme by the acoustic model 23.

デコーダ２２は、音響モデル２３から得られた音素ごとの確率値を元に、発音辞書データベース２４と言語モデル２５を参照し最も確率の高い認識結果を出力する。発音辞書データベース２４は、単語と、それを構成する音素列からなる。たとえば“こんにちは”という単語に対しては、音素列/k/o/N/n/i/ch/i/w/a/が定義されている。 Based on the probability value for each phoneme obtained from the acoustic model 23, the decoder 22 refers to the pronunciation dictionary database 24 and the language model 25 and outputs a recognition result with the highest probability. The pronunciation dictionary database 24 is composed of words and phoneme strings constituting the words. For example, for the word "Hello", the phoneme string / k / o / N / n / i / ch / i / w / a / is defined.

言語モデル２５は単語間のつながりをモデル化してものである。例えば“こんにちは”という単語に対して、次にどの単語が現れやすいかをモデル化している。言語モデル化方式としては、従来のn-gramやRNNモデルが挙げられる。デコーダ２２は、音響モデル２３から得られた確率値、発音辞書データベース２４、言語モデル２５が示す確率値を元に、最も確率の高い単語系列を音声認識結果として出力する。最も確率値の高い単語系列の探索方法としては、ビームサーチ法などが挙げられる。 The language model 25 models the connection between words. For example, for the word "Hello", then what word is modeled or easy to appear. Language modeling methods include conventional n-gram and RNN models. Based on the probability value obtained from the acoustic model 23, the pronunciation dictionary database 24, and the probability value indicated by the language model 25, the decoder 22 outputs the word sequence having the highest probability as a speech recognition result. As a search method of the word sequence having the highest probability value, there is a beam search method or the like.

このように認識システム２０では、学習システム１０で学習された音響モデル２３が利用される。上述したように、音響モデル２３は、本実施の形態の学習システム１０において、拡張されたナレッジディスティレーションによって学習されている。つまり、フレームごとに出力されたラベルごとの確率値ではなく、シーケンスレベルで出力されたラベル系列候補群の確率値を評価することで、学習の精度を上げることに成功している。したがって、教師モデルである音響モデル１３と比べると構造は単純であるモデルでありながら、高い認識精度を保持している。また、音響モデル２３は音響モデル１３と比較すると構造が単純であるため、ハードウェアとして実装する場合には回路規模を小さくすることができる。また、音響モデル２３は構造が単純であるため、ソフトウェアとして実装する場合には、ＣＰＵ、メモリ等の資源に高い性能を要求しない。したがって、本実施の形態の認識システム２０をスマートフォンやタブレットなどの端末でも利用することが可能である。また、音響モデル２３の構造が音響モデル１３と比較して単純であるため、リアルタイム性の向上を図ることができる。 Thus, in the recognition system 20, the acoustic model 23 learned by the learning system 10 is used. As described above, the acoustic model 23 is learned by the expanded knowledge distribution in the learning system 10 of the present embodiment. That is, the accuracy of learning has been successfully improved by evaluating the probability value of the label sequence candidate group output at the sequence level, not the probability value of each label output for each frame. Therefore, compared with the acoustic model 13 which is a teacher model, the model has a simple structure but retains high recognition accuracy. Further, since the acoustic model 23 has a simple structure compared to the acoustic model 13, the circuit scale can be reduced when implemented as hardware. In addition, since the acoustic model 23 has a simple structure, when implemented as software, high performance is not required for resources such as a CPU and a memory. Therefore, it is possible to use the recognition system 20 of the present embodiment also on a terminal such as a smartphone or a tablet. In addition, since the structure of the acoustic model 23 is simpler than that of the acoustic model 13, real-time performance can be improved.

このように本実施の形態の学習システム１０は、学習対象である音響モデル２３の構造を変えることなく、音響モデル２３の認識率を、より構造が複雑な教師モデルである音響モデル１３の認識率に近づけることで、リアルタイム性と音声認識率を両立したモデルを得ることを課題とする。具体的には、Unidirectional RNNベースのCTC (Uni-CTC)の構造を変えることなく、認識率をBidirectional RNNベースのCTC (Bi-CTC)に近づけることで、リアルタイム性と音声認識率を両立したEnd-to-end音響モデルを得ることが可能である。 As described above, the learning system 10 according to the present embodiment changes the recognition rate of the acoustic model 23 without changing the structure of the acoustic model 23 that is a learning target, and the recognition rate of the acoustic model 13 that is a teacher model with a more complicated structure. The objective is to obtain a model that achieves both real-time performance and a speech recognition rate. More specifically, the end-point that achieves both real-time performance and voice recognition rate by changing the recognition rate closer to Bidirectional RNN-based CTC (Bi-CTC) without changing the structure of Unidirectional RNN-based CTC (Uni-CTC) It is possible to obtain a -to-end acoustic model.

｛４．実験結果｝
図１０は、本実施の形態の学習方法による実験結果を示す図である。評価データはWSJコーパスと呼ばれる英語音声データベースを用いている。特徴量は４０次元のメルフィルタバンク特徴量とその１次および２次デルタ特徴量を用いた（計１２０次元）。ラベルは７２種類の音素と２種類のノイズ、そしてブランクによって定義した。教師モデルにはBidirectional-LSTMを有するBidirectional-CTCを、生徒モデルにはUnidirectional-LSTMを有するUnidirectional-CTCを使用した。それぞれの中間層の数は３であり、各中間層のメモリセル数は５１２である。図の上段は、WSJコーパス内のtrain_si84と呼ばれる、１５時間の学習データを用いて学習させたときの実験結果である。正解ラベルを用いて通常の学習法で学習させた教師モデル(Bidirectional-CTC)の単語誤り率は10.35%である。正解ラベルを用いて通常の学習法で学習させた生徒モデル(Unidirectional-CTC)の単語誤り率は11.77％である。 {4. Experimental result}
FIG. 10 is a diagram illustrating an experimental result according to the learning method of the present embodiment. The evaluation data uses an English speech database called WSJ Corpus. For the feature amount, a 40-dimensional mel filter bank feature amount and its primary and secondary delta feature amounts were used (120 dimensions in total). The labels were defined by 72 phonemes, 2 noises, and a blank. Bidirectional-CTC with Bidirectional-LSTM was used for the teacher model, and Unidirectional-CTC with Unidirectional-LSTM was used for the student model. The number of each intermediate layer is 3, and the number of memory cells in each intermediate layer is 512. The upper part of the figure shows the experimental results when learning is performed using 15 hours of learning data called train_si84 in the WSJ corpus. The word error rate of the teacher model (Bidirectional-CTC) trained by the normal learning method using the correct answer label is 10.35%. The word error rate of the student model (Unidirectional-CTC) trained by the normal learning method using the correct answer label is 11.77%.

これに対して、従来から行われているフレームレベルでのナレッジディスティレーションにより学習させた生徒モデルの単語誤り率は16.04%であり、通常の学習法より悪化した。一方、本実施の形態の学習方法であるシーケンスレベル（系列単位）でのナレッジディスティレーションにより学習させた生徒モデルの単語誤り率は10.83％となり、性能差が66.2％改善されたことが分かる。 On the other hand, the word error rate of the student model trained by the conventional knowledge distribution at the frame level was 16.04%, which was worse than the normal learning method. On the other hand, it can be seen that the word error rate of the student model trained by knowledge distribution at the sequence level (sequence unit) as the learning method of the present embodiment is 10.83%, and the performance difference is improved by 66.2%.

図１０の下段は、WSJコーパス内のtrain_si284と呼ばれる、８１時間の学習データを用いて学習させたときの実験結果である。正解ラベルを用いて通常の学習法で学習させた教師モデル(Bidirectional-CTC)の単語誤り率は8.70%である。正解ラベルを用いて通常の学習法で学習させた生徒モデル(Unidirectional-CTC)の単語誤り率は10.37％である。 The lower part of FIG. 10 shows the experimental results when learning is performed using 81 hours of learning data called train_si284 in the WSJ corpus. The word error rate of the teacher model (Bidirectional-CTC) trained by the normal learning method using the correct answer label is 8.70%. The word error rate of the student model (Unidirectional-CTC) trained by the normal learning method using the correct answer label is 10.37%.

これに対して、従来から行われているフレームレベルでのナレッジディスティレーションにより学習させた生徒モデルの単語誤り率は12.71%であり、通常の学習法より悪化した。一方、本実施の形態の学習方法であるシーケンスレベル（系列単位）でのナレッジディスティレーションにより学習させた生徒モデルの単語誤り率は9.57％となり、性能差が47.9％改善されたことが分かる。 On the other hand, the word error rate of the student model learned by the conventional knowledge distribution at the frame level was 12.71%, which was worse than the normal learning method. On the other hand, it can be seen that the word error rate of the student model trained by knowledge distribution at the sequence level (sequence unit) which is the learning method of the present embodiment is 9.57%, and the performance difference is improved by 47.9%.

上記の実験結果は、教師モデルと生徒モデルがそれぞれ３層の中間層を有する例であるが、中間層の数やノード数を教師モデルと生徒モデルで統一する必要は無い。例えば教師モデルの中間層を４層、生徒モデルの中間層を２層というような条件であっても、同様の効果が期待できる。 The above experimental result is an example in which the teacher model and the student model each have three intermediate layers. However, it is not necessary to unify the number of intermediate layers and the number of nodes in the teacher model and the student model. For example, the same effect can be expected even under the condition that the middle layer of the teacher model is four layers and the middle layer of the student model is two layers.

なお、本発明の具体的な構成は、前述の実施形態に限られるものではなく、発明の要旨を逸脱しない範囲で種々の変更および修正が可能である。 The specific configuration of the present invention is not limited to the above-described embodiment, and various changes and modifications can be made without departing from the scope of the invention.

１時系列情報処理システム
１０学習システム
１３教師モデル
２０認識システム
２３生徒モデル 1 Time-series information processing system 10 Learning system 13 Teacher model 20 Recognition system 23 Student model

Claims

A system for learning a neural network to construct a system for recognizing time series information,
A first model having therein a neural network capable of expressing time series information;
A second model having a neural network capable of expressing time-series information therein and having been learned by a correct label and having a more complex structure than the first model;
A first model learning unit that uses the first model as a student model, the second model as a teacher model, and learns the first model using a knowledge destination;
With
The first model learning unit
A first output unit that inputs first time-series data including a plurality of frame data to the first model and obtains respective probabilities of label sequence candidate groups as a first output result of the first model When,
The first time series data including the plurality of frame data is input to the second model, and a second output result of the second model is used to obtain respective probabilities of label series candidate groups. An output section;
An evaluation unit that evaluates a difference between the first output result and the second output result;
A first model learning unit for learning the first model based on an evaluation result in the evaluation unit;
A learning system for time-series information.

The time-series information learning system according to claim 1,
The time series information learning system, wherein the first model and the second model include a model having a recurrent neural network therein.

The time series information learning system according to claim 2,
The time series information learning system, wherein the first model and the second model include a CTC (Connectionist Temporal Classification) model.

The time-series information learning system according to claim 3,
The time-series information learning system, wherein the first model is a Unidirectional-CTC model and the second model is a Bidirectional-CTC model.

The time-series information learning system according to any one of claims 1 to 4,
The time-series information learning system, wherein the time-series information includes voice information.

In order to construct a system for recognizing time series information, a learning method for learning a first model having a neural network that can express time series information therein,
(A) a second model learning step in which a second model having a neural network capable of expressing time-series information and having a more complex structure than the first model is learned using a correct label;
(B) a first model learning step in which the first model is a student model, the second model is a teacher model, and the first model is learned using knowledge distrition;
With
In the first model learning step (b),
(B-1) inputting first time-series data including a plurality of frame data into the first model;
(B-2) inputting the first time-series data including the plurality of frame data into the second model;
(B-3) obtaining each probability of a label sequence candidate group as a first output result of the first model obtained corresponding to the first time-series data including the plurality of frame data When,
(B-4) A step of obtaining each probability of a label sequence candidate group as a second output result of the second model obtained corresponding to the first time-series data including the plurality of frame data. When,
(B-5) an evaluation step for evaluating a difference between the first output result obtained in the step (b-3) and the second output result obtained in the step (b-4); ,
(B-6) a step of learning the first model based on an evaluation result of the evaluation step;
For learning time-series information including