JP7209330B2

JP7209330B2 - classifier, trained model, learning method

Info

Publication number: JP7209330B2
Application number: JP2018142418A
Authority: JP
Inventors: 勝李; シュガンルー; 遼一高島; 鵬沈; 恒河井
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2018-07-30
Filing date: 2018-07-30
Publication date: 2023-01-20
Anticipated expiration: 2038-07-30
Also published as: JP2020020872A

Description

本技術は、入力信号に対してラベルのシーケンスを出力する識別器、当該識別器に向けられた学習済モデル、当該識別器の学習方法に関する。 The present technology relates to a discriminator that outputs a sequence of labels for an input signal, a trained model directed to the discriminator, and a learning method for the discriminator.

音声認識分野において、超深層畳み込みネットワーク（very deep convolutional network）は、従来の深層ニューラルネットワーク（ＤＮＮ：deep neural network）を大きく上回る性能を示すことが知られている。 In the field of speech recognition, very deep convolutional networks are known to significantly outperform conventional deep neural networks (DNNs).

音声認識タスクに対して、超深層残差時間遅延ニューラルネットワーク（very deep residual time-delay neural network）が提案されている（例えば、非特許文献１参照）。層数の少ないＴＤＮＮ（time-delay neural network）およびＦＳＭＮ（feedforward sequential memory networks）とは異なり、超深層残差時間遅延ニューラルネットワークは、再帰フィードバックを用いずに、より長いコンテキスト依存性を学習できる。そのため、ＢＬＳＴＭ（bidirectional long short term memory）ネットワークを用いる場合に生じ得る時間遅れなどの問題を回避できる。そのため、音響モデルおよび言語モデルのトレーニングを一体化したＥ２Ｅ（end-to-end）トレーニングへの適用などが有望視されている。 For speech recognition tasks, very deep residual time-delay neural networks have been proposed (see, for example, Non-Patent Document 1). Unlike less layered time-delay neural networks (TDNN) and feedforward sequential memory networks (FSMN), ultra-deep residual time-delay neural networks can learn longer context dependencies without recursive feedback. Therefore, it is possible to avoid problems such as time delay that may occur when using a BLSTM (bidirectional long short term memory) network. Therefore, application to E2E (end-to-end) training that integrates acoustic model and language model training is considered promising.

音声認識のための有効なＥ２Ｅフレームワークとして、ＣＴＣ（connectionist temporal classification）フレームワークを用いたモデルが知られている（例えば、非特許文献２など参照）。ＣＴＣフレームワークは、入力される可変長な音声フレームと出力されるラベル（単音（phones）、文字（character）、音節（syllable）などの単位）との間で生じるシーケンスラベリング問題の解決に注力している。ＣＴＣのモデル化技術においては、音響モデルパイプラインを大幅に簡素化している。そのため、ＣＴＣフレームワークにおいては、フレームレベルでのラベルや初期のＧＭＭ－ＨＭＭ（Gaussian mixture model and hidden Markov model）モデル（音響モデルに相当）を必要としない。 A model using a CTC (connectionist temporal classification) framework is known as an effective E2E framework for speech recognition (see, for example, Non-Patent Document 2). The CTC framework focuses on solving the sequence labeling problem that arises between incoming variable-length speech frames and outgoing labels (units such as phones, characters, syllables, etc.). ing. The CTC modeling technique greatly simplifies the acoustic model pipeline. Therefore, the CTC framework does not require frame-level labels or an initial GMM-HMM (Gaussian mixture model and hidden Markov model) model (corresponding to an acoustic model).

本願発明者らは、先に、超深層残差時間遅延（very deep residual time-delay）構造を用いて、ＣＴＣベースのＥ２Ｅモデルをトレーニングすることを提案している（例えば、非特許文献３など参照）。 The present inventors have previously proposed training a CTC-based E2E model using a very deep residual time-delay structure (e.g., Non-Patent Document 3). reference).

S. Zhang, M. Li, Z. Yan, and L. Dai, "Deep-FSMN for large vocabulary continuous speech recognition," in arXiv preprint (accepted for ICASSP2018) arxiv:1803.05030, 2018.S. Zhang, M. Li, Z. Yan, and L. Dai, "Deep-FSMN for large vocabulary continuous speech recognition," in arXiv preprint (accepted for ICASSP2018) arxiv:1803.05030, 2018. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proc. ICML, 2006.A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks," in Proc. ICML, 2006. S. Li, X. Lu, R.Takashima, P. Shen, and H. Kawai, "Improving CTC-based acoustic model with very deep residual neural network," in Proc. INTERSPEECH, 2018.S. Li, X. Lu, R.Takashima, P. Shen, and H. Kawai, "Improving CTC-based acoustic model with very deep residual neural network," in Proc. INTERSPEECH, 2018.

超深層モデルを規定するパラメータの膨大さは、最適化を複雑化するとともに、汎化性能を低下させるという課題を有している。本願発明者らの研究では、特定のシステムに対して十分にチューニングされた超深層モデルを別のデータ設定をもつシステムにそのまま適用することはできない。これは、優れた性能を発揮する構造を見つけることは容易ではなく、ネットワーク構造のすべての候補について多数の実験を行なう必要がある。 The enormous number of parameters that define an ultradeep model poses the problem of complicating optimization and degrading generalization performance. In our research, a well-tuned ultradeep model for a particular system cannot be directly applied to a system with a different data set. It is not easy to find a structure that exhibits good performance, and it is necessary to conduct numerous experiments on all candidate network structures.

本技術は、対象のシステムに応じて適切なネットワーク構造を提供できるモデルを提供することを目的としている。 The purpose of this technology is to provide a model capable of providing an appropriate network structure according to the target system.

本発明のある局面に従えば、入力信号に対してラベルのシーケンスを出力する識別器が提供される。識別器は、入力信号から所定時間幅のフレームごとに第１の特徴ベクトルを順次生成する入力層と、入力層に引き続く、積層された複数の残差ブロックと、複数の残差ブロックの出力側に接続された出力層とを含む。複数の残差ブロックの各々は、積層された複数の時間遅延層と、複数の時間遅延層をバイパスするショートカット経路と、複数の時間遅延層を通過する経路とショートカット経路との間の重みを調整するアテンションモジュールとを含む。複数の時間遅延層は、入力に対して所定のタイムステップの遅延を与える遅延要素を有している。アテンションモジュールは、対応する残差ブロックに与えられる入力が対応する複数の時間遅延層を通過して得られる結果出力と、当該対応する残差ブロックに与えられる入力とに基づいて、タイムステップごとに重みを更新する。 According to one aspect of the invention, a discriminator is provided that outputs a sequence of labels for an input signal. The discriminator includes an input layer that sequentially generates a first feature vector for each frame of a predetermined time width from an input signal, a plurality of stacked residual blocks following the input layer, and an output side of the plurality of residual blocks. and an output layer connected to . Each of the plurality of residual blocks adjusts weights between the stacked plurality of time delay layers, shortcut paths bypassing the plurality of time delay layers, paths passing through the plurality of time delay layers, and shortcut paths. and an attention module that A plurality of time delay layers has delay elements that provide a predetermined timestep delay to the input. At each time step, the attention module performs an Update weights.

アテンションモジュールは、対応する残差ブロックの出力とショートカット経路とに接続された全結合層と、全結合層に接続されたｓｏｆｔｍａｘ関数とを含むようにしてもよい。 The attention module may include a fully connected layer connected to the output of the corresponding residual block and the shortcut path, and a softmax function connected to the fully connected layer.

アテンションモジュールは、第１の重みと第２の重みの合計が１となるように、複数の時間遅延層を通過する経路に対する第１の重みと、ショートカット経路に対する第２の重みとを算出するようにしてもよい。 The attention module is configured to calculate a first weight for paths through the multiple time delay layers and a second weight for shortcut paths such that the sum of the first weight and the second weight is one. can be

時間遅延層の各々は、入力ベクトルに対して、当該入力ベクトルに対応するフレームである現在フレームに対して、タイムステップだけ時間を戻した過去フレームに対応する第１の内部ベクトルと、タイムステップだけ時間を進めた未来フレームに対応する第２の内部ベクトルとを生成するようにしてもよい。 Each of the time delay layers receives, for an input vector, a first internal vector corresponding to a past frame obtained by moving back a time step from the current frame, which is the frame corresponding to the input vector, and a first internal vector corresponding to the past frame by the time step. A second internal vector corresponding to a future frame advanced in time may be generated.

入力信号は、音声信号であり、識別器は、音声信号に対する音声認識結果を示すラベルを出力するようにしてもよい。 The input signal may be a speech signal, and the discriminator may output a label indicating the speech recognition result for the speech signal.

本発明の別の局面に従えば、入力信号に対してラベルのシーケンスを出力するように、コンピュータを機能させるための学習済モデルが提供される。学習済モデルは、入力信号から所定時間幅のフレームごとに第１の特徴ベクトルを順次生成する入力層と、入力層に引き続く、積層された複数の残差ブロックと、複数の残差ブロックの出力側に接続された出力層とを含む。複数の残差ブロックの各々は、積層された複数の時間遅延層と、複数の時間遅延層をバイパスするショートカット経路と、複数の時間遅延層を通過する経路とショートカット経路との間の重みを調整するアテンションモジュールとを含む。複数の時間遅延層は、入力に対して所定のタイムステップの遅延を与える遅延要素を有している。アテンションモジュールは、対応する残差ブロックに与えられる入力が対応する複数の時間遅延層を通過して得られる結果出力と、当該対応する残差ブロックに与えられる入力とに基づいて、タイムステップごとに重みを更新する、ように構成される。 According to another aspect of the invention, a trained model is provided for operating a computer to output a sequence of labels for an input signal. The trained model includes an input layer that sequentially generates a first feature vector for each frame of a predetermined time width from an input signal, a plurality of stacked residual blocks following the input layer, and outputs of the plurality of residual blocks. and an output layer connected to the side. Each of the plurality of residual blocks adjusts weights between the stacked plurality of time delay layers, shortcut paths bypassing the plurality of time delay layers, paths passing through the plurality of time delay layers, and shortcut paths. and an attention module that A plurality of time delay layers has delay elements that provide a predetermined timestep delay to the input. At each time step, the attention module performs an It is configured to update the weights.

本発明のさらに別の局面に従えば、入力信号に対してラベルのシーケンスを出力する識別器の学習方法が提供される。識別器は、入力信号から所定時間幅のフレームごとに第１の特徴ベクトルを順次生成する入力層と、入力層に引き続く、積層された複数の残差ブロックと、複数の残差ブロックの出力側に接続された出力層とを含む。複数の残差ブロックの各々は、積層された複数の時間遅延層と、複数の時間遅延層をバイパスするショートカット経路とを含む。複数の時間遅延層は、入力に対して所定のタイムステップの遅延を与える遅延要素を有している。学習方法は、トレーニングデータセットを用いて識別器のネットワークを規定するパラメータを決定する第１のトレーニングステップと、識別器に、複数の時間遅延層を通過する経路とショートカット経路との間の重みを調整するアテンションモジュールを付加する付加ステップとを含む。アテンションモジュールは、対応する残差ブロックに与えられる入力が対応する複数の時間遅延層を通過して得られる結果出力と、当該対応する残差ブロックに与えられる入力とに基づいて、タイムステップごとに重みを更新するように構成されている。学習方法は、トレーニングデータセットを用いてアテンションモジュールを規定するパラメータを決定する第２のトレーニングステップを含む。 According to still another aspect of the present invention, there is provided a method of learning a discriminator that outputs a sequence of labels for an input signal. The discriminator includes an input layer that sequentially generates a first feature vector for each frame of a predetermined time width from an input signal, a plurality of stacked residual blocks following the input layer, and an output side of the plurality of residual blocks. and an output layer connected to . Each of the plurality of residual blocks includes a plurality of stacked time delay layers and a shortcut path bypassing the plurality of time delay layers. A plurality of time delay layers has delay elements that provide a predetermined timestep delay to the input. The learning method includes a first training step of using a training data set to determine the parameters that define the network of classifiers; and an adding step of adding the adjusting attention module. At each time step, the attention module performs an Configured to update weights. The learning method includes a second training step of determining parameters defining attention modules using a training data set.

第２のトレーニングステップは、アテンションモジュールを規定するパラメータを含む、識別器のネットワークを規定するすべてのパラメータの値を再度決定するステップを含むようにしてもよい。 The second training step may comprise re-determining the values of all the parameters defining the network of classifiers, including the parameters defining the attention modules.

第２のトレーニングステップは、第１のトレーニングステップにおいて決定されたパラメータを固定した状態で、アテンションモジュールを規定するパラメータのみを決定するステップを含むようにしてもよい。 The second training step may comprise determining only the parameters defining the attention modules, with the parameters determined in the first training step fixed.

学習方法は、アテンションモジュールが付加された識別器に入力信号を与えることで、アテンションモジュールにより算出される重みの値の変化に基づいて、複数の時間遅延層の一部を削除するステップをさらに含むようにしてもよい。 The learning method further includes providing an input signal to a discriminator attached with an attention module to remove a portion of the plurality of time delay layers based on changes in weight values calculated by the attention module. You can also try to

本技術によれば、対象のシステムに応じて適切なネットワーク構造を提供できる。 According to the present technology, it is possible to provide an appropriate network structure according to the target system.

本実施の形態に従う学習済モデルを用いたアプリケーション例を示す模式図である。FIG. 4 is a schematic diagram showing an application example using a trained model according to the present embodiment; 図１に示す音声認識システムＳの学習方法を説明するための模式図である。FIG. 2 is a schematic diagram for explaining a learning method of the speech recognition system S shown in FIG. 1; 本実施の形態に従う基本ＣＴＣベースドモデルにおける処理内容を概略するための図である。FIG. 4 is a diagram for outlining processing contents in a basic CTC-based model according to the present embodiment; FIG. 本実施の形態に従う基本ＣＴＣベースドモデルのネットワーク構造の一例を示す模式図である。1 is a schematic diagram showing an example of a network structure of a basic CTC-based model according to this embodiment; FIG. 本実施の形態に従う基本ＣＴＣベースドモデルに含まれる時間遅延層の処理構造を示す模式図である。FIG. 4 is a schematic diagram showing the processing structure of a time delay layer included in the basic CTC-based model according to this embodiment; 図４に示すネットワーク構造のうち３層の残差ブロックからなるネットワーク構造を採用した場合と等価なネットワーク構造の一例を示す模式図である。5 is a schematic diagram showing an example of a network structure equivalent to the network structure shown in FIG. 4, which is equivalent to a network structure composed of three layers of residual blocks; FIG. 本実施の形態に従う改良ＣＴＣベースドモデルのネットワーク構造の要部を示す模式図である。FIG. 4 is a schematic diagram showing the main part of the network structure of the improved CTC-based model according to this embodiment; 本実施の形態に従う音声認識システムＳを実現するハードウェア構成の一例を示す模式図である。1 is a schematic diagram showing an example of a hardware configuration that implements a speech recognition system S according to this embodiment; FIG. 本実施の形態に従う改良ＣＴＣベースドモデルの学習方法（再トレーニング法）の処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure of an improved CTC-based model learning method (retraining method) according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルの学習方法（切り落とし法）の処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure of an improved CTC-based model learning method (clipping method) according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルにおけるデータ伝達の分布例を示す図である。FIG. 4 is a diagram showing an example distribution of data transfer in the improved CTC-based model according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルの学習方法（ネットワーク再構成法）の処理手順を説明するための図である。FIG. 4 is a diagram for explaining a processing procedure of an improved CTC-based model learning method (network reconstruction method) according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルを用いて算出されるスケールファクタの時間的変化の一例を示す図である。FIG. 5 is a diagram showing an example of temporal changes in scale factors calculated using the improved CTC-based model according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルの学習方法（ネットワーク再構成法）の処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure of an improved CTC-based model learning method (network reconstruction method) according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルのデコーディング方法の処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure of an improved CTC-based model decoding method according to the present embodiment; 本実施の形態に従う改良ＣＴＣベースドモデルのアテンションスコアの変化例を示す図である。FIG. 10 is a diagram showing an example of change in attention score of the improved CTC-based model according to the present embodiment;

本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. The same or corresponding parts in the drawings are denoted by the same reference numerals, and the description thereof will not be repeated.

［Ａ．アプリケーション例］
まず、本実施の形態に従う学習済モデルを用いたアプリケーション例について説明する。 [A. Application example]
First, an application example using a trained model according to the present embodiment will be described.

図１は、本実施の形態に従う学習済モデルを用いたアプリケーション例を示す模式図である。図１には、アプリケーション例として、音声認識システムＳを示す。音声認識システムＳは、音声信号の入力を受けて認識結果を出力する。より具体的には、音声認識システムＳは、音声信号の入力を受けて、予め定められた区間（以下、「音声フレーム」とも称す。）ごとの時系列データから特徴ベクトルを抽出する特徴量抽出部２と、特徴量抽出部２からのベクトルの入力を受けてテキストなどの認識結果を出力する認識エンジン４とを含む。 FIG. 1 is a schematic diagram showing an application example using a trained model according to this embodiment. FIG. 1 shows a speech recognition system S as an application example. The speech recognition system S receives an input of a speech signal and outputs a recognition result. More specifically, the speech recognition system S receives an input of a speech signal and extracts a feature vector from time-series data for each predetermined section (hereinafter also referred to as "speech frame"). and a recognition engine 4 that receives input of vectors from the feature amount extraction unit 2 and outputs recognition results such as text.

特徴量抽出部２は、入力される音声信号から音声フレームごとに特徴ベクトルを順次生成する。特徴量抽出部２から出力される特徴ベクトルは、予め定められた次元数を有しており、入力される音声信号の対応する音声フレームに相当する部分の特徴量を反映する。特徴ベクトルは、入力される音声信号の長さに応じて、順次出力されることになる。以下では、このような一連の特徴ベクトルの全部または一部を「音響特徴シーケンス」とも総称する。 The feature extraction unit 2 sequentially generates feature vectors for each audio frame from the input audio signal. The feature vector output from the feature amount extraction unit 2 has a predetermined number of dimensions, and reflects the feature amount of the portion corresponding to the corresponding audio frame of the input audio signal. The feature vectors are sequentially output according to the length of the input audio signal. Hereinafter, all or part of such a series of feature vectors will also be collectively referred to as an "acoustic feature sequence".

認識エンジン４は、特徴量抽出部２から出力される音声フレームごとの特徴ベクトルを学習済モデルに入力してテキストを出力する。このように、認識エンジン４は、本実施の形態に従う学習済モデルで構成され、デコーダとして機能する。すなわち、認識エンジン４は、音声認識のためのＥ２Ｅフレームワーク（音響モデルおよび言語モデルが一体化されて構成される）であり、音声フレームの入力を受けて対応するテキストを出力する。 The recognition engine 4 inputs the feature vector for each speech frame output from the feature amount extraction unit 2 to the trained model and outputs text. Thus, the recognition engine 4 is configured with a trained model according to this embodiment and functions as a decoder. That is, the recognition engine 4 is an E2E framework (integrated with an acoustic model and a language model) for speech recognition, and receives input of speech frames and outputs corresponding text.

図２は、図１に示す音声認識システムＳの学習方法を説明するための模式図である。図２を参照して、音声信号４２と対応するテキスト４４とからなるトレーニングデータセット４０を用意する。音声信号４２を特徴量抽出部２に入力するとともに、特徴量抽出部２により順次生成される特徴ベクトルを認識エンジン４に入力することで、認識エンジン４からの認識結果（テキスト）を得る。認識エンジン４からの認識結果と、入力される音声信号４２に対応するラベル（テキスト４４）との誤差に基づいて、認識エンジン４を規定するネットワークのパラメータを順次更新することで、ネットワークが最適化される。 FIG. 2 is a schematic diagram for explaining the learning method of the speech recognition system S shown in FIG. Referring to FIG. 2, a training data set 40 consisting of speech signals 42 and corresponding text 44 is provided. A recognition result (text) from the recognition engine 4 is obtained by inputting the speech signal 42 to the feature amount extraction unit 2 and inputting the feature vectors sequentially generated by the feature amount extraction unit 2 to the recognition engine 4 . The network is optimized by sequentially updating the parameters of the network defining the recognition engine 4 based on the error between the recognition result from the recognition engine 4 and the label (text 44) corresponding to the input speech signal 42. be done.

［Ｂ．基本ネットワーク構造］
本実施の形態に従う学習済モデルは、以下に説明するような基本ネットワーク構造に対して、アテンションモジュールを適宜付加して学習およびネットワーク構造の最適化を実現する。先に、本実施の形態に従う基本ネットワーク構造について説明する。 [B. Basic network structure]
The trained model according to the present embodiment implements learning and optimization of the network structure by appropriately adding attention modules to the basic network structure described below. First, the basic network structure according to this embodiment will be described.

（ｂ１：概要）
本実施の形態においては、ＣＴＣフレームワークを用いたモデルに分類される基本ネットワーク構造（以下、「基本ＣＴＣベースドモデル」とも称す。）を用いる。基本ＣＴＣベースドモデルは、入力信号に対してラベルのシーケンスを出力する識別器である。以下では、主として、入力信号として音声信号を用いるとともに、基本ＣＴＣベースドモデルが音声信号に対する音声認識結果を示すラベルを出力する例について説明するが、基本ＣＴＣベースドモデルは、音声認識以外のタスクにも応用が可能である。 (b1: Overview)
In this embodiment, a basic network structure classified as a model using the CTC framework (hereinafter also referred to as "basic CTC-based model") is used. A basic CTC-based model is a discriminator that outputs a sequence of labels for an input signal. In the following, an example in which a speech signal is used as an input signal and the basic CTC-based model outputs a label indicating the speech recognition result for the speech signal will be mainly described. Application is possible.

典型例として、本実施の形態に従う基本ＣＴＣベースドモデル１には、入力される音声信号のセンテンスに順次設定されるウィンドウ（１０－１５個の音声フレームを含む）の特徴量が入力される。ここで、センテンスは、言語的に意味のある区切りを意味し、通常、予め定められた区間長さの音声フレームを複数含むことになる。 As a typical example, the basic CTC-based model 1 according to the present embodiment is supplied with features of windows (including 10 to 15 speech frames) that are sequentially set in sentences of an input speech signal. Here, a sentence means a linguistically meaningful break, and usually includes a plurality of speech frames with a predetermined section length.

本実施の形態に従う基本ＣＴＣベースドモデル１からの出力は、パスと称されるフレームレベルのシーケンス（以下、「ＣＴＣ出力シーケンス」とも称す。）である。出力されるシーケンスは、何らのＣＴＣラベルもないブランク（以下、「φ」とも表現する。）を含む。 The output from the basic CTC-based model 1 according to this embodiment is a frame-level sequence called a path (hereinafter also referred to as "CTC output sequence"). The output sequence contains blanks (hereinafter also expressed as "φ") without any CTC label.

図３は、本実施の形態に従う基本ＣＴＣベースドモデル１における処理内容を概略するための図である。図３を参照して、入力される音声信号のセンテンスの先頭にウィンドウ（１０－１５個の音声フレームを含む）を設定するとともに、ウィンドウを複数個所にスライドさせることで、ＣＴＣ出力を推定する。図３に示すように、基本ＣＴＣベースドモデル１への入力は、前進の経路のみを有している。すなわち、過去の情報のみを入力として必要とするので、入力音声の終了を待つ必要がない。 FIG. 3 is a diagram for outlining processing contents in the basic CTC based model 1 according to the present embodiment. Referring to FIG. 3, the CTC output is estimated by setting a window (including 10 to 15 speech frames) at the beginning of the sentence of the input speech signal and sliding the window to multiple locations. As shown in FIG. 3, the inputs to the basic CTC-based model 1 have forward paths only. That is, since only past information is required as input, there is no need to wait for the end of input speech.

以下の説明においては、本実施の形態に従う基本ＣＴＣベースドモデル１を「ＶＲｅｓＴＤ－ＣＴＣ」（very deep residual time-delay neural network - CTC）」とも称する。 In the following description, the basic CTC-based model 1 according to this embodiment is also referred to as "VResTD-CTC" (very deep residual time-delay neural network - CTC).

本実施の形態に従う基本ＣＴＣベースドモデル１をトレーニングすることで、図１の認識エンジン４を実現する学習済モデルを実現できる。特徴量抽出部２については、経験則などに基づいて、予め設計されてもよい。 By training the basic CTC-based model 1 according to this embodiment, a learned model that implements the recognition engine 4 of FIG. 1 can be implemented. The feature quantity extraction unit 2 may be designed in advance based on empirical rules and the like.

図４は、本実施の形態に従う基本ＣＴＣベースドモデル１のネットワーク構造の一例を示す模式図である。図４を参照して、基本ＣＴＣベースドモデル１には、音声信号を予め定められた区間ごとに抽出した時系列データ（音声フレーム）から特徴量抽出部２（図１）が生成する特徴ベクトル（音響特徴シーケンス）が入力される。基本ＣＴＣベースドモデル１は、順次入力される特徴ベクトルに対して、対応するテキスト（サブワードシーケンス）を順次出力する。 FIG. 4 is a schematic diagram showing an example of the network structure of the basic CTC-based model 1 according to this embodiment. Referring to FIG. 4, the basic CTC-based model 1 includes feature vectors ( acoustic feature sequence) is input. The basic CTC-based model 1 sequentially outputs texts (subword sequences) corresponding to sequentially input feature vectors.

より具体的には、基本ＣＴＣベースドモデル１は、入力層としての全結合層１０（以下、「ＦＣ」または「ＦＣｌａｙｅｒｓ」とも称す。）と、複数の残差ブロック２０と、出力層３０とを含む。 More specifically, the basic CTC-based model 1 includes a fully connected layer 10 (hereinafter also referred to as "FC" or "FC layers") as an input layer, a plurality of residual blocks 20, and an output layer 30. including.

入力層としての全結合層１０は、特徴ベクトルの入力を受けて、必要な次元数の内部ベクトルを生成する。 A fully connected layer 10 as an input layer receives input of feature vectors and generates internal vectors of a required number of dimensions.

複数の残差ブロック２０は、全結合層１０に引き続いて配置される。複数の残差ブロック２０は互いに積層されることで、多段の残差ブロック２０を構成する。 A plurality of residual blocks 20 are arranged subsequent to the fully connected layer 10 . A plurality of residual blocks 20 are stacked on each other to form multistage residual blocks 20 .

残差ブロック２０の各々は、時間遅延ブロック２２を含む。時間遅延ブロック２２は、積層された複数の時間遅延層２４（「ＴＤｌａｙｅｒ」とも表現する。）を含む。残差ブロック２０の各々は、さらに、時間遅延ブロック２２をバイパスするショートカット経路２６と、時間遅延ブロック２２の出力とショートカット経路２６（「Ｓｈｏｒｔ－ｃｕｔｐａｔｈ」とも表現する。）の出力とを合成する加算器２９とを含む。 Each residual block 20 includes a time delay block 22 . The time delay block 22 includes a plurality of stacked time delay layers 24 (also referred to as "TD layers"). Each residual block 20 further combines a shortcut path 26 that bypasses the time-delay block 22 and combines the output of the time-delay block 22 with the output of the shortcut path 26 (also referred to as a "short-cut path"). and adder 29 .

出力層３０は、複数の残差ブロック２０の出力側に接続されており、全結合層３２と、マッピング関数３４とを含む。全結合層３２は、最終段の残差ブロック２０の出力ノードに結合されており、出力される特徴ベクトルについての確率を正規化して、最も確からしいラベルを出力する。出力層３０からはフレームごとにラベルが出力されるので、入力される音声信号に対応してラベルが順次出力されることになる。図４には、ラベルとして単音（なお、単音は、単音（phones）、文字（character）、音節（syllable）などの単位を意味する）を用いる例を示している。フレームごとに順次推定される一連のラベル（単音のシーケンス）がＣＴＣ出力シーケンスとなる。基本ＣＴＣベースドモデル１の推定結果には、対応するラベルが存在しない、ブランク（図４においては「φ」で示されている。）を含み得る。 The output layer 30 is connected to the outputs of the plurality of residual blocks 20 and includes a fully connected layer 32 and a mapping function 34 . The fully connected layer 32 is connected to the output node of the residual block 20 at the final stage, normalizes the probabilities of the output feature vectors, and outputs the most probable label. Since the label is output from the output layer 30 for each frame, the label is sequentially output corresponding to the input audio signal. FIG. 4 shows an example of using monophones (phones, characters, syllables, etc.) as labels. A series of labels (sequences of single sounds) that are sequentially estimated for each frame is the CTC output sequence. The estimation results of the basic CTC-based model 1 may include blanks (indicated by “φ” in FIG. 4) for which there is no corresponding label.

マッピング関数３４は、ＣＴＣ出力シーケンスから対応するテキスト（サブワードシーケンス）を順次決定する。 A mapping function 34 sequentially determines the corresponding text (subword sequence) from the CTC output sequence.

以上の通り、本実施の形態に従う基本ＣＴＣベースドモデル１においては、入力されたフレームごとの音声信号に対する認識結果として、テキスト（サブワードシーケンス）が出力される。 As described above, in the basic CTC-based model 1 according to the present embodiment, a text (subword sequence) is output as a recognition result for an input speech signal for each frame.

（ｂ２：単一の時間遅延層２４での処理）
図５は、本実施の形態に従う基本ＣＴＣベースドモデル１に含まれる時間遅延層２４の処理構造を示す模式図である。図５を参照して、時間遅延層２４は、入力に対して所定のタイムステップｔ_ｉの遅延を与える２つの遅延要素２４１，２４２を含む。 (b2: processing in a single time delay layer 24)
FIG. 5 is a schematic diagram showing the processing structure of the time delay layer 24 included in the basic CTC based model 1 according to this embodiment. Referring to FIG. 5, time delay layer 24 includes two delay elements 241, 242 that provide a delay of a given time step t _i to the input.

遅延要素２４１，２４２の各々は、タイムステップｔ_ｉだけ入力を遅延させる。時間遅延層２４に与えられる入力シーケンスは、遅延要素２４１においてタイムステップｔ_ｉの遅延が与えられる。遅延要素２４１においてタイムステップｔ_ｉの遅延が与えられた結果出力はさらに遅延要素２４２に与えられる。遅延要素２４２は、遅延要素２４１からの結果出力に対してさらにタイムステップｔ_ｉの遅延を与える。このような２段の遅延要素によって、タイミングがタイムステップｔ_ｉずつ異なる３種類のコンテキストが生成される。 Each of the delay elements 241, 242 delays the input by a _timestep ti. The input sequence provided to time delay layer 24 is provided with a delay of time step t _i in delay element 241 . The resulting output with the delay of time step _ti in delay element 241 is further provided to delay element 242 . Delay element 242 provides an additional delay of timestep t _i to the result output from delay element 241 . Such two stages of delay elements generate three types of contexts whose timing differs by time steps _ti .

入力されるフレームを未来コンテキストとし、遅延要素２４１からの結果出力を現在コンテキストとし、遅延要素２４２から出力される結果出力を過去コンテキストとすることで、実質的に双方向にタイムステップを拡大できる。 By using the input frame as the future context, the result output from the delay element 241 as the current context, and the result output from the delay element 242 as the past context, the time step can be expanded substantially in both directions.

図５に示すように、時間遅延層２４の各々は、入力シーケンス（入力ベクトル）に対して、当該入力ベクトルに対応するフレームである現在フレームに対して、タイムステップｔ_ｉだけ時間を戻した過去フレームに対応する過去コンテキスト（第１の内部ベクトル）と、タイムステップｔ_ｉだけ時間を進めた未来フレームに対応する未来コンテキスト（第２の内部ベクトル）とを生成する。 As shown in FIG. 5, each of the time delay layers 24 returns the time step _t to the current frame, which is the frame corresponding to the input sequence (input vector), to the past. A past context (first internal vector) corresponding to the frame and a future context (second internal vector) corresponding to the future frame advanced by time step _ti are generated.

本実施の形態に従う基本ＣＴＣベースドモデル１において、ｌ（エル）番目の時間遅延層２４へ与えられる全体の入力シーケンスＨ^ｌは、以下の（１）式のように示すことができる。 In the basic CTC-based model 1 according to the present embodiment, the entire input sequence Hl given to the ^l -th time delay layer 24 can be expressed as the following equation (1).

まず、いずれのｈ^ｌ _ｔについても、ｌ番目の時間遅延層２４に入力されると、ｌ番目の時間遅延層２４についての標準重み行列Ｗ^ｌおよびバイアスｂ^ｌを用いて、以下の（２）式のように線形変換される。 First, when any h ^l _t is input to the l-th time delay layer 24, using the standard weight matrix W ^l and the bias b ^l for the l-th time delay layer 24, the following (2) It is linearly transformed as shown in the formula.

次に、ｌ番目の時間遅延層２４におけるタイムステップｔでの偏差ｅ_ｔ ^ｌは、以下の（３）式のように示すことができる。 Next, the deviation e _t ^l at time step t in the l-th time delay layer 24 can be expressed as the following equation (3).

時間遅延層２４の各々の出力は、以下の（４）式のように示すことができる。 Each output of the time delay layer 24 can be expressed as the following equation (4).

なお、上述の（４）式においては、残差ブロック２０の活性化関数として正規化線形関数（ＲｅＬＵ：rectified linear unit）を用いる例を示すが、これに限らず、任意の活性化関数を用いることができる。以下の説明では、正規化線形関数を「ＲｅＬＵ」とも称す。 In the above equation (4), an example of using a rectified linear unit (ReLU) as the activation function of the residual block 20 is shown, but not limited to this, any activation function can be used. be able to. In the following description, the normalized linear function is also referred to as "ReLU".

（ｂ３：残差ブロック２０での処理）
積層された複数の時間遅延層２４からなる残差ブロック２０全体で見ると、多層変換ｆ_ｉの出力と多層変換ｆ_ｉをバイパスするショートカット出力との結合（加算器２９の結果）が出力されることになる。多層変換ｆ_ｉは、時間遅延層２４および活性化関数（ＲｅＬＵ）を直列結合した関数である。 (b3: Processing in residual block 20)
Looking at the entire residual block 20, which consists of a plurality of stacked time delay layers 24, the output is the combination (the result of the adder 29) of the output of the multi-layer transform f _i and the shortcut output bypassing the multi-layer transform f _i . It will be. A multi-layer transform f _i is a series combination function of a time delay layer 24 and an activation function (ReLU).

複数の残差ブロック２０を互いに積層したネットワークは、アンサンブルネットワークのような振る舞いを見せる。 A network in which multiple residual blocks 20 are stacked on top of each other behaves like an ensemble network.

図６は、図４に示すネットワーク構造のうち３層の残差ブロック２０からなるネットワーク構造を採用した場合と等価なネットワーク構造の一例を示す模式図である。図６（Ａ）に示すネットワーク構造は、ショートカット経路２６および合成する加算器２９を展開することにより、図６（Ｂ）に示す等価ネットワーク構造として表現できる。図６（Ｂ）に示すように、異なる数の残差ブロック２０を通過する経路が複数（図６では８つ）並列に存在することになる。この結果、異なるすべてのタイムステップの遅延が与えられた結果が最終的に結合される。 FIG. 6 is a schematic diagram showing an example of a network structure equivalent to the network structure shown in FIG. The network structure shown in FIG. 6A can be expressed as an equivalent network structure shown in FIG. As shown in FIG. 6(B), a plurality of paths (eight in FIG. 6) passing through different numbers of residual blocks 20 exist in parallel. As a result, the results given all the different timestep delays are finally combined.

図４および図６に示す基本ＣＴＣベースドモデル１においては、最終の残差ブロック２０からのすべての出力は、所定時間に亘る他の残差ブロック２０からの出力を含むことになる。例えば、３層の残差ブロック２０からなる基本ＣＴＣベースドモデル１を想定すると、最終の残差ブロック２０からのあるタイムステップにおける出力ｙ^３ _ｔは、以下の（６）式のように示すことができる。 In the basic CTC-based model 1 shown in FIGS. 4 and 6, all outputs from the final residual block 20 will include outputs from other residual blocks 20 over time. For example, assuming a basic CTC-based model 1 consisting of three layers of residual blocks 20, the output y ³ _t at a certain time step from the final residual block 20 can be expressed as the following equation (6). can.

［Ｃ．改良ネットワーク構造］
次に、本実施の形態に従う改良ネットワーク構造について説明する。本実施の形態に従う改良ネットワーク構造は、上述の図４および図６に示される基本ＣＴＣベースドモデル１を構成する残差ブロック２０に改良を加えたものに相当する。以下、「基本ＣＴＣベースドモデル」との対比として「改良ＣＴＣベースドモデル」とも称す。なお、「基本ＣＴＣベースドモデル」と「改良ＣＴＣベースドモデル」とを区別しない文脈においては、単に「ＣＴＣベースドモデル」と総称することもある。 [C. Improved network structure]
An improved network structure according to this embodiment will now be described. The improved network structure according to the present embodiment corresponds to the residual blocks 20 that make up the basic CTC-based model 1 shown in FIGS. 4 and 6 above, with improvements added. Hereinafter, it is also referred to as an "improved CTC-based model" in contrast to the "basic CTC-based model". Note that in a context in which the "basic CTC-based model" and the "improved CTC-based model" are not distinguished, they may simply be collectively referred to as the "CTC-based model."

図７は、本実施の形態に従う改良ＣＴＣベースドモデルのネットワーク構造の要部を示す模式図である。図７（Ａ）は、本実施の形態に従う基本ＣＴＣベースドモデルの残差ブロック２０の構造例を示し、図７（Ｂ）は、本実施の形態に従う改良ＣＴＣベースドモデルの残差ブロック２０Ａの構造例を示す。 FIG. 7 is a schematic diagram showing the main part of the network structure of the improved CTC-based model according to this embodiment. FIG. 7A shows an example structure of the residual block 20 of the basic CTC-based model according to this embodiment, and FIG. 7B shows the structure of the residual block 20A of the improved CTC-based model according to this embodiment. Give an example.

図７（Ａ）と図７（Ｂ）とを比較して、残差ブロック２０Ａは、残差ブロック２０に比較して、アテンションモジュール２８（Attention module）をさらに含む。アテンションモジュール２８は、残差ブロック２０Ａの出力層の後段に配置される。アテンションモジュール２８は、残差ブロック２０Ａに含まれる２つの経路（ショートカット経路２６側の経路および時間遅延ブロック２２側の経路）に対する重みを調整する。このようなアテンションモジュール２８を採用することで、基本ＣＴＣベースドモデル１をより動的に振る舞わせることができるため、学習性能および識別性能を高めることができる。 Comparing FIGS. 7A and 7B, residual block 20A further includes an attention module 28 compared to residual block 20. FIG. The attention module 28 is arranged after the output layer of the residual block 20A. Attention module 28 adjusts the weights for the two paths included in residual block 20A (the path on shortcut path 26 side and the path on time delay block 22 side). By adopting such an attention module 28, the basic CTC-based model 1 can be made to behave more dynamically, so that learning performance and identification performance can be improved.

本実施の形態において、アテンションモジュール２８は、時間遅延を実現する残差ブロック（時間遅延層２４）の後段に配置されることで後述するような顕著な効果を奏する。 In the present embodiment, the attention module 28 is arranged after the residual block (time delay layer 24) that implements the time delay, thereby producing remarkable effects as will be described later.

以下の説明においては、以下の（７）式に示すような、それぞれの経路の重みを変更するためのアテンションスコアα_ｔ ^ｉ（ベクトル量）を用いる。 In the following description, an attention score α _t ⁱ (vector quantity) for changing the weight of each route as shown in the following equation (7) is used.

アテンションスコアα_ｔ ^ｉ（ベクトル量）は、重みα_ｔ ^ｉおよび重みβ_ｔ ^ｉ（＝１－α_ｔ ^ｉ）を要素として含む。重みα_ｔ ^ｉは、任意のタイムステップｔにおいて、ｉ番目の残差ブロック２０Ａのショートカット経路２６を伝達されるデータに対するスケールファクタを意味し、重みβ_ｔ ^ｉ（＝１－α_ｔ ^ｉ）は、任意のタイムステップｔにおいて、ｉ番目の残差ブロック２０Ａの時間遅延ブロック２２を伝達されるデータに対するスケールファクタを意味する。 The attention score α _t ⁱ (vector quantity) includes weight α _t ⁱ and weight β _t ⁱ (=1−α _t ⁱ ) as elements. The weight α _t ⁱ means the scale factor for the data transmitted through the shortcut path 26 of the i-th residual block 20A at any time step t, and the weight β _t ⁱ (=1−α _t ⁱ ) is We denote the scale factor for the data transmitted through the time delay block 22 of the i-th residual block 20A at any time step t.

より具体的には、図７（Ｂ）に示すように、アテンションモジュール２８は、全結合層２８２と、ｓｏｆｔｍａｘ関数２８４と、乗算器２８６，２８８とを含む。 More specifically, as shown in FIG. 7B, the attention module 28 includes a fully connected layer 282, a softmax function 284, and multipliers 286,288.

アテンションモジュール２８の全結合層２８２は、対応する残差ブロック２０Ａの出力とショートカット経路２６とに接続される。ｓｏｆｔｍａｘ関数２８４は、全結合層２８２に接続される。 A fully connected layer 282 of attention module 28 is connected to the output of corresponding residual block 20A and shortcut path 26 . A softmax function 284 is connected to the fully connected layer 282 .

時間遅延ブロック２２からの出力経路２８５は乗算器２８６に入力され、乗算器２８６において重みβ_ｔ ^ｉを乗じられた上で加算器２９に出力される。一方、ショートカット経路２６は乗算器２８８に入力され、乗算器２８８において重みα_ｔ ^ｉを乗じられた上で加算器２９に出力される。なお、α_ｔ ^ｉ＋β_ｔ ^ｉ＝１である。このように、アテンションモジュール２８は、重みα_ｔ ^ｉ（第１の重み）と重みβ_ｔ ^ｉ（第２の重み）の合計が１となるように、複数の時間遅延層２４を通過する経路に対する重みα_ｔ ^ｉ（第１の重み）と、ショートカット経路２６に対する重みβ_ｔ ^ｉ（第２の重み）とを算出する。 Output path 285 from time delay block 22 is input to multiplier 286 , multiplied by weight β _t ⁱ in multiplier 286 , and output to adder 29 . On the other hand, the shortcut path 26 is input to the multiplier 288 , multiplied by the weight α _t ⁱ in the multiplier 288 , and output to the adder 29 . Note that α _t ⁱ +β _t ⁱ =1. ^Thus , the ^attention _module 28 _assigns an A weight α _t ⁱ (first weight) and a weight β _t ⁱ (second weight) for the shortcut route 26 are calculated.

重みα_ｔ ^ｉおよび重みβ_ｔ ^ｉが動的に変更されることで、残差ブロック２０Ａからの出力に含まれる、多層変換ｆ_ｉの出力と多層変換ｆ_ｉをバイパスするショートカット出力との比率を動的に調整できる。 The weights α _t ⁱ and the weights β _t ⁱ are dynamically changed so that the ratio between the output of the multi-layered transform f _i and the shortcut output bypassing the multi-layered transform f _i included in the output from the residual block 20A is Can be adjusted dynamically.

このように、アテンションモジュール２８は、複数の時間遅延層２４を通過する出力経路２８５とショートカット経路２６との間の重みを調整する。 Thus, attention module 28 adjusts the weights between output path 285 and shortcut path 26 through multiple time delay layers 24 .

図７に示すような残差ブロック２０Ａからの出力は、上述の（５）式に示す関係式に対する重み付けを変更することで、以下の（８）式のように定義できる。 The output from the residual block 20A as shown in FIG. 7 can be defined as the following equation (8) by changing the weighting for the relational expression shown in the above equation (5).

このように、アテンションモジュール２８は、対応する残差ブロック２０Ａに与えられる入力が対応する複数の時間遅延層２４を通過して得られる結果出力と、当該対応する残差ブロック２０Ａに与えられる入力とに基づいて、タイムステップごとに重みα_ｔ ^ｉおよび重みβ_ｔ ^ｉ（スケールファクタ）を更新する。 In this way, attention module 28 is able to combine the resulting output obtained from the input provided to corresponding residual block 20A through corresponding multiple time delay layers 24 and the input provided to corresponding residual block 20A. Update the weights α _t ⁱ and the weights β _t ⁱ (scale factors) at each timestep based on .

より具体的には、重みα_ｔ ^ｉおよび重みβ_ｔ ^ｉは、全結合層２８２およびｓｏｆｔｍａｘ関数２８４を用いて、以下の（９）式に従って算出される。 More specifically, the weight α _t ⁱ and the weight β _t ⁱ are calculated using the fully connected layer 282 and the softmax function 284 according to the following equation (9).

［Ｄ．ハードウェア構成］
次に、本実施の形態に従う学習済モデルを用いた音声認識システムＳを実現するためのハードウェア構成の一例について説明する。 [D. Hardware configuration]
Next, an example of a hardware configuration for realizing speech recognition system S using a trained model according to this embodiment will be described.

図８は、本実施の形態に従う音声認識システムＳを実現するハードウェア構成の一例を示す模式図である。音声認識システムＳは、典型的には、コンピュータの一例である情報処理装置５００を用いて実現される。 FIG. 8 is a schematic diagram showing an example of a hardware configuration for realizing speech recognition system S according to this embodiment. The speech recognition system S is typically implemented using an information processing device 500, which is an example of a computer.

図８を参照して、音声認識システムＳを実現する情報処理装置５００は、主要なハードウェアコンポーネントとして、ＣＰＵ（central processing unit）５０２と、ＧＰＵ（graphics processing unit）５０４と、主メモリ５０６と、ディスプレイ５０８と、ネットワークインターフェイス（Ｉ／Ｆ：interface）５１０と、二次記憶装置５１２と、入力デバイス５２２と、光学ドライブ５２４とを含む。これらのコンポーネントは、内部バス５２８を介して互いに接続される。 Referring to FIG. 8, an information processing apparatus 500 realizing speech recognition system S includes, as main hardware components, a CPU (central processing unit) 502, a GPU (graphics processing unit) 504, a main memory 506, It includes a display 508 , a network interface (I/F) 510 , a secondary storage device 512 , an input device 522 and an optical drive 524 . These components are connected to each other via an internal bus 528 .

ＣＰＵ５０２および／またはＧＰＵ５０４は、後述するような各種プログラムを実行することで、本実施の形態に従う音声認識システムＳの実現に必要な処理を実行するプロセッサである。ＣＰＵ５０２およびＧＰＵ５０４は、複数個配置されてもよいし、複数のコアを有していてもよい。 CPU 502 and/or GPU 504 is a processor that executes various programs to be described later to perform processing necessary for realizing speech recognition system S according to the present embodiment. A plurality of CPUs 502 and GPUs 504 may be arranged, or may have a plurality of cores.

主メモリ５０６は、プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が処理を実行するにあたって、プログラムコードやワークデータなどを一時的に格納（あるいは、キャッシュ）する記憶領域であり、例えば、ＤＲＡＭ（dynamic random access memory）やＳＲＡＭ（static random access memory）などの揮発性メモリデバイスなどで構成される。 The main memory 506 is a storage area that temporarily stores (or caches) program codes and work data when the processor (CPU 502 and/or GPU 504) executes processing. ) and SRAM (static random access memory).

ディスプレイ５０８は、処理に係るユーザインターフェイスや処理結果などを出力する表示部であり、例えば、ＬＣＤ（liquid crystal display）や有機ＥＬ（electroluminescence）ディスプレイなどで構成される。 A display 508 is a display unit that outputs a user interface related to processing, processing results, and the like, and is configured by, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.

ネットワークインターフェイス５１０は、インターネット上またはイントラネット上の任意の情報処理装置などとの間でデータを遣り取りする。ネットワークインターフェイス５１０としては、例えば、イーサネット（登録商標）、無線ＬＡＮ（local area network）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの任意の通信方式を採用できる。 A network interface 510 exchanges data with any information processing device on the Internet or an intranet. As the network interface 510, for example, any communication method such as Ethernet (registered trademark), wireless LAN (local area network), Bluetooth (registered trademark), or the like can be adopted.

入力デバイス５２２は、ユーザからの指示や操作などを受付けるデバイスであり、例えば、キーボード、マウス、タッチパネル、ペンなどで構成される。また、入力デバイス５２２は、学習およびデコーディングに必要な音声信号を収集するための集音デバイスを含んでいてもよいし、集音デバイスにより収集された音声信号の入力を受付けるためのインターフェイスを含んでいてもよい。 The input device 522 is a device that receives instructions, operations, and the like from the user, and includes, for example, a keyboard, mouse, touch panel, pen, and the like. In addition, the input device 522 may include a sound collecting device for collecting audio signals necessary for learning and decoding, and an interface for accepting input of audio signals collected by the sound collecting device. You can stay.

光学ドライブ５２４は、ＣＤ－ＲＯＭ（compact disc read only memory）、ＤＶＤ（digital versatile disc）などの光学ディスク５２６に格納されている情報を読出して、内部バス５２８を介して他のコンポーネントへ出力する。光学ディスク５２６は、非一過的（non-transitory）な記録媒体の一例であり、任意のプログラムを不揮発的に格納した状態で流通する。光学ドライブ５２４が光学ディスク５２６からプログラムを読み出して、二次記憶装置５１２などにインストールすることで、コンピュータが情報処理装置５００として機能するようになる。したがって、本発明の主題は、二次記憶装置５１２などにインストールされたプログラム自体、または、本実施の形態に従う機能や処理を実現するためのプログラムを格納した光学ディスク５２６などの記録媒体でもあり得る。 The optical drive 524 reads information stored in an optical disc 526 such as a CD-ROM (compact disc read only memory) or DVD (digital versatile disc) and outputs it to other components via an internal bus 528 . The optical disc 526 is an example of a non-transitory recording medium, and is distributed in a state in which any program is stored in a non-volatile manner. The optical drive 524 reads the program from the optical disk 526 and installs it in the secondary storage device 512 or the like, so that the computer functions as the information processing device 500 . Therefore, the subject of the present invention can also be a program itself installed in secondary storage device 512 or the like, or a recording medium such as optical disc 526 storing a program for realizing the functions and processes according to the present embodiment. .

図８には、非一過的な記録媒体の一例として、光学ディスク５２６などの光学記録媒体を示すが、これに限らず、フラッシュメモリなどの半導体記録媒体、ハードディスクまたはストレージテープなどの磁気記録媒体、ＭＯ（magneto-optical disk）などの光磁気記録媒体を用いてもよい。 FIG. 8 shows an optical recording medium such as an optical disk 526 as an example of a non-transitory recording medium, but is not limited to this, a semiconductor recording medium such as a flash memory, a magnetic recording medium such as a hard disk or a storage tape. , and MO (magneto-optical disk) may be used.

二次記憶装置５１２は、コンピュータを情報処理装置５００として機能させるために必要なプログラムおよびデータを格納する。例えば、ハードディスク、ＳＳＤ（solid state drive）などの不揮発性記憶装置で構成される。 Secondary storage device 512 stores programs and data necessary for the computer to function as information processing device 500 . For example, it is composed of a non-volatile storage device such as a hard disk or an SSD (solid state drive).

より具体的には、二次記憶装置５１２は、図示しないＯＳ（operating system）の他、学習処理を実現するためのトレーニングプログラム５１４と、学習対象のネットワーク構造を定義するモデル定義データ５１６と、学習済モデルを規定するためのネットワークパラメータ５１８と、トレーニングデータセット５２０とを格納している。 More specifically, the secondary storage device 512 contains an OS (operating system) (not shown), a training program 514 for realizing learning processing, model definition data 516 for defining a network structure to be learned, It stores network parameters 518 for defining the finished model and a training data set 520 .

トレーニングプログラム５１４は、プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）により実行されることで、ネットワークパラメータ５１８を決定するための学習処理を実現する。モデル定義データ５１６は、学習対象となる基本ＣＴＣベースドモデル１および改良ＣＴＣベースドモデル１Ａのネットワーク構造を構成するコンポーネントおよび接続関係などを定義するための情報を含む。ネットワークパラメータ５１８は、学習対象のモデル（ネットワーク）を構成する要素ごとのパラメータを含む。ネットワークパラメータ５１８に含まれる各パラメータの値は、トレーニングプログラム５１４の実行により最適化される。トレーニングデータセット５２０は、例えば、後述するようなＣＳＪに含まれるデータセットを用いることができる。例えば、学習対象の基本ＣＴＣベースドモデル１および改良ＣＴＣベースドモデル１Ａが音声認識タスクに向けられたものである場合には、トレーニングデータセット５２０は、講演などの音声信号と、当該音声信号に対応する発話内容を示す転記テキストとを含む。 Training program 514 is executed by the processor (CPU 502 and/or GPU 504 ) to implement a learning process for determining network parameters 518 . The model definition data 516 includes information for defining components, connection relationships, and the like that constitute the network structure of the basic CTC-based model 1 and improved CTC-based model 1A to be learned. The network parameters 518 include parameters for each element that constitutes the model (network) to be learned. The value of each parameter contained in network parameters 518 is optimized by running training program 514 . The training data set 520 can use, for example, a data set included in CSJ as described later. For example, if the basic CTC-based model 1 and the improved CTC-based model 1A to be trained are directed to a speech recognition task, then the training data set 520 includes speech signals such as lectures and corresponding speech signals. Transcription text indicating the contents of the utterance.

プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がプログラムを実行する際に必要となるライブラリや機能モジュールの一部を、ＯＳが標準で提供するライブラリまたは機能モジュールにより代替してもよい。この場合には、プログラム単体では、対応する機能を実現するために必要なプログラムモジュールのすべてを含むものにはならないが、ＯＳの実行環境下にインストールされることで、目的の処理を実現できる。このような一部のライブラリまたは機能モジュールを含まないプログラムであっても、本発明の技術的範囲に含まれ得る。 Some of the libraries and functional modules required when the processor (CPU 502 and/or GPU 504) executes programs may be replaced with libraries or functional modules provided as standard by the OS. In this case, the program alone does not include all of the program modules necessary to implement the corresponding functions, but the intended processing can be achieved by installing it under the execution environment of the OS. Even a program that does not include some of such libraries or functional modules can be included in the technical scope of the present invention.

また、これらのプログラムは、上述したようないずれかの記録媒体に格納されて流通するだけでなく、インターネットまたはイントラネットを介してサーバ装置などからダウンロードすることで配布されてもよい。 Moreover, these programs may be distributed by being stored in any recording medium as described above and not only being distributed, but also being downloaded from a server device or the like via the Internet or an intranet.

図８には、単一のコンピュータを用いて情報処理装置５００を構成する例を示すが、これに限らず、コンピュータネットワークを介して接続された複数のコンピュータが明示的または黙示的に連携して、情報処理装置５００および情報処理装置５００を含む音声認識システムＳを実現するようにしてもよい。 FIG. 8 shows an example in which the information processing apparatus 500 is configured using a single computer. , the information processing device 500 and a speech recognition system S including the information processing device 500 may be realized.

プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がプログラムを実行することで実現される機能の全部または一部を、集積回路などのハードワイヤード回路（hard-wired circuit）を用いて実現してもよい。例えば、ＡＳＩＣ（application specific integrated circuit）やＦＰＧＡ（field-programmable gate array）などを用いて実現してもよい。 All or part of the functions realized by the processor (CPU 502 and/or GPU 504) executing the program may be realized using a hard-wired circuit such as an integrated circuit. For example, it may be implemented using an ASIC (application specific integrated circuit), an FPGA (field-programmable gate array), or the like.

当業者であれば、本発明が実施される時代に応じた技術を適宜用いて、本実施の形態に従う情報処理装置５００を実現できるであろう。 A person skilled in the art will be able to realize information processing apparatus 500 according to the present embodiment by appropriately using technology suitable for the era in which the present invention is implemented.

説明の便宜上、同一の情報処理装置５００を用いて、学習（ＣＴＣベースドモデルの構築）およびデコーディング（ＣＴＣベースドモデルを含むモデルによる音声認識）を実行する例を示したが、学習およびデコーディングを異なるハードウェアを用いて実現してもよい。 For convenience of explanation, the same information processing apparatus 500 is used to perform learning (construction of a CTC-based model) and decoding (speech recognition using a model including the CTC-based model). It may be implemented using different hardware.

［Ｅ．学習方法］
次に、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの学習方法について説明する。 [E. Learning method]
Next, a learning method for improved CTC-based model 1A according to the present embodiment will be described.

（ｅ１：概要）
本実施の形態に従うＣＴＣベースドモデルは、Ｅ２Ｅフレームワークを提供するものであり、音響モデルおよび言語モデルを別々に学習する必要はない。すなわち、ＣＴＣベースドモデルは、入力される音声信号に対応するテキストを直接出力するものであり、学習処理においては、音声信号と対応するテキストとからなるトレーニングデータセットを用いる。 (e1: Overview)
The CTC-based model according to this embodiment provides an E2E framework and does not require separate training of acoustic and language models. That is, the CTC-based model directly outputs the text corresponding to the input speech signal, and uses a training data set consisting of the speech signal and the corresponding text in the learning process.

本実施の形態に従うＣＴＣベースドモデルの学習処理は、ニューラルネットワークの一般的な学習処理と同様に、教師有り学習を用いることができる。具体的には、ＣＴＣベースドモデルを構成する各コンポーネントのパラメータに任意の初期値を設定する。その上で、トレーニングデータセットに含まれる音声信号（音響特徴シーケンス）をＣＴＣベースドモデルに順次入力するとともに、ＣＴＣベースドモデルから順次出力されるＣＴＣ出力シーケンス（テキスト）と入力された音声信号に対応するテキストとの誤差を算出し、その算出した誤差に基づいて、ＣＴＣベースドモデルを構成する各コンポーネントのパラメータを逐次更新する。 The learning process of the CTC-based model according to the present embodiment can use supervised learning in the same way as the general learning process of neural networks. Specifically, arbitrary initial values are set for the parameters of each component that constitutes the CTC-based model. Then, the speech signal (acoustic feature sequence) included in the training data set is sequentially input to the CTC-based model, and the CTC output sequence (text) sequentially output from the CTC-based model and the input speech signal are matched. The error with the text is calculated, and the parameter of each component constituting the CTC-based model is successively updated based on the calculated error.

このような学習処理によって、トレーニングデータからＣＴＣベースドモデルに対応する学習済モデルを構築できる。 Through such learning processing, a trained model corresponding to the CTC-based model can be constructed from the training data.

本実施の形態においては、改良ＣＴＣベースドモデル１Ａを適切に学習させることで、音声認識性能を改善することができる。上述したようなアテンションモジュール２８を含む改良ＣＴＣベースドモデル１Ａに特徴ベクトルを入力することで、任意のタイムステップｔにおける、それぞれのショートカット経路２６についてのスケールファクタ（α_ｔ ^１，α_ｔ ^２，…α_ｔ ^ｉ，…α_ｔ ^Ｎ）を取得できる。 In this embodiment, the speech recognition performance can be improved by appropriately learning the improved CTC-based model 1A. By inputting the feature vectors into the improved CTC-based model 1A, which includes the attention module 28 as described above, the scale factors (α _t ¹ , α _t ² , . . . α _t ⁱ , . . . α _t ^N ).

本願発明者らの研究によれば、それぞれの残差ブロック２０Ａにおいて、時間遅延ブロック２２をデータが通過する経路の重みと、ショートカット経路２６をデータが通過する経路の重みとは、適用されるシステムによって様々である。 According to research by the inventors, in each residual block 20A, the weight of the path through which data passes through the time delay block 22 and the weight of the path through which data passes through the shortcut path 26 are determined by the applied system varies depending on

そこで、本実施の形態においては、以下に示すような、再トレーニング法（Retrain-based method）、切り落とし法（Prune-based method）またはネットワーク再構成法という学習方法を採用できる。 Therefore, in this embodiment, a learning method such as a retrain-based method, a prune-based method, or a network reconstruction method can be employed as described below.

（ｅ２：再トレーニング法）
再トレーニング法は、超深層畳み込みネットワークである改良ＣＴＣベースドモデル１Ａを規定するすべてのパラメータ（アテンションモジュール２８のパラメータも含む）を再度トレーニングする方法である。より具体的には、基本ＣＴＣベースドモデル１をトレーニングすることで学習済モデルを取得し、この取得された学習済モデルに対して、アテンションモジュール２８を付加して改良ＣＴＣベースドモデル１Ａを構成した上で、再度トレーニングを実行する。 (e2: retraining method)
The retraining method is a method of retraining all parameters (including the attention module 28 parameters) that define the improved CTC-based model 1A, which is an ultra-deep convolutional network. More specifically, a learned model is acquired by training the basic CTC-based model 1, and an attention module 28 is added to the acquired learned model to configure an improved CTC-based model 1A. and run the training again.

基本ＣＴＣベースドモデル１および改良ＣＴＣベースドモデル１Ａの両方をトレーニングしなければならないので、トレーニングに要する時間は約２倍になるが、音声認識性能を確実に向上させることができる。 Since both the basic CTC-based model 1 and the improved CTC-based model 1A have to be trained, the time required for training is approximately doubled, but the speech recognition performance can be definitely improved.

なお、スケールファクタ（α_ｔ ^１，α_ｔ ^２，…α_ｔ ^ｉ，…α_ｔ ^Ｎ）は、タイムステップｔごとに変化することになる。 Note that the scale factors (α _t ¹ , α _t ² , . . . α _t ⁱ , . . . α _t ^N ) change at each time step t.

図９は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの学習方法（再トレーニング法）の処理手順を示すフローチャートである。図９に示す各ステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がトレーニングプログラム５１４を実行することで実現される。 FIG. 9 is a flow chart showing a processing procedure of a learning method (retraining method) for improved CTC-based model 1A according to the present embodiment. Each step shown in FIG. 9 is typically implemented by the processor (CPU 502 and/or GPU 504 ) of information processing apparatus 500 executing training program 514 .

図９を参照して、情報処理装置５００には、音声信号４２と対応するテキスト４４とからなるトレーニングデータセット４０が入力される（ステップＳ１００）。情報処理装置５００は、基本ＣＴＣベースドモデル１を規定するパラメータの初期値をランダムに決定する（ステップＳ１０２）。 Referring to FIG. 9, information processing apparatus 500 is supplied with training data set 40 including speech signal 42 and corresponding text 44 (step S100). The information processing apparatus 500 randomly determines initial values of parameters that define the basic CTC based model 1 (step S102).

情報処理装置５００は、トレーニングデータセット４０に含まれる音声信号４２からフレームごとに特徴ベクトルを生成する（ステップＳ１０４）。そして、情報処理装置５００は、生成した特徴ベクトルを基本ＣＴＣベースドモデル１に入力して推定結果を算出する（ステップＳ１０６）。 The information processing device 500 generates a feature vector for each frame from the speech signal 42 included in the training data set 40 (step S104). Then, the information processing apparatus 500 inputs the generated feature vector to the basic CTC-based model 1 to calculate an estimation result (step S106).

情報処理装置５００は、算出された推定結果が予め定められた数に到達したか否かを判断する（ステップＳ１０８）。算出された推定結果が予め定められた数に到達していなければ（ステップＳ１０８においてＮＯ）、ステップＳ１０４以下の処理が繰返される。 The information processing device 500 determines whether or not the calculated estimation result has reached a predetermined number (step S108). If the calculated estimation result has not reached the predetermined number (NO in step S108), the processing from step S104 onward is repeated.

算出された推定結果が予め定められた数に到達していれば（ステップＳ１０８においてＹＥＳ）、情報処理装置５００は、算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、学習処理の収束条件が満たされているか否かを判断する（ステップＳ１１０）。 If the calculated estimation results have reached a predetermined number (YES in step S108), information processing apparatus 500 generates a series of calculated estimation results (output sequence) and corresponding text 44 (label sequence). Based on the error between and, it is determined whether or not the convergence condition of the learning process is satisfied (step S110).

学習処理の収束条件が満たされていなければ（ステップＳ１１０においてＮＯ）、情報処理装置５００は、ミニバッチとしてまとめて算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、基本ＣＴＣベースドモデル１を規定するパラメータの値を更新し（ステップＳ１１２）、ステップＳ１０４以下の処理を繰返す。 If the convergence condition of the learning process is not satisfied (NO in step S110), the information processing device 500 compares a series of estimation results (output sequence) collectively calculated as a mini-batch and the corresponding text 44 (label sequence). Based on the error between the two, the values of the parameters defining the basic CTC based model 1 are updated (step S112), and the processing from step S104 onward is repeated.

これに対して、学習処理の収束条件が満たされていれば（ステップＳ１１０においてＹＥＳ）、現在のパラメータを学習結果として出力する（ステップＳ１１４）。すなわち、現在のパラメータにより規定される基本ＣＴＣベースドモデル１が学習済モデルとして出力される。 On the other hand, if the convergence condition of the learning process is satisfied (YES in step S110), the current parameters are output as the learning result (step S114). That is, the basic CTC-based model 1 defined by the current parameters is output as a trained model.

上述のステップＳ１００～Ｓ１１４において、情報処理装置５００は、トレーニングデータセット４０を用いて基本ＣＴＣベースドモデル１（識別器）のネットワークを規定するパラメータを決定する第１のトレーニングステップを実行する。 In the steps S100 to S114 described above, the information processing device 500 performs a first training step of determining the parameters defining the network of the basic CTC-based model 1 (classifier) using the training data set 40 .

続いて、情報処理装置５００は、学習済の基本ＣＴＣベースドモデル１に対してアテンションモジュール２８を付加して改良ＣＴＣベースドモデル１Ａを生成する（ステップＳ１１６）。すなわち、情報処理装置５００は、基本ＣＴＣベースドモデル１（識別器）に、複数の時間遅延層２４を通過する経路とショートカット経路２６との間の重みを調整するアテンションモジュール２８を付加する付加ステップを実行する。 Subsequently, the information processing apparatus 500 adds the attention module 28 to the learned basic CTC-based model 1 to generate the improved CTC-based model 1A (step S116). That is, the information processing apparatus 500 performs an additional step of adding an attention module 28 that adjusts the weight between the route passing through the multiple time delay layers 24 and the shortcut route 26 to the basic CTC-based model 1 (discriminator). Run.

情報処理装置５００は、改良ＣＴＣベースドモデル１Ａに付加されたアテンションモジュール２８のパラメータの初期値をランダムに決定する（ステップＳ１１８）。そして、再度トレーニングを開始する。 The information processing apparatus 500 randomly determines initial values of the parameters of the attention module 28 added to the improved CTC based model 1A (step S118). Then start training again.

具体的には、情報処理装置５００は、トレーニングデータセット４０に含まれる音声信号４２からフレームごとに特徴ベクトルを生成する（ステップＳ１２０）。そして、情報処理装置５００は、生成した特徴ベクトルを改良ＣＴＣベースドモデル１Ａに入力して推定結果を算出する（ステップＳ１２２）。 Specifically, the information processing device 500 generates a feature vector for each frame from the speech signal 42 included in the training data set 40 (step S120). The information processing apparatus 500 then inputs the generated feature vector to the improved CTC-based model 1A to calculate an estimation result (step S122).

情報処理装置５００は、算出された推定結果が予め定められた数に到達したか否かを判断する（ステップＳ１２４）。算出された推定結果が予め定められた数に到達していなければ（ステップＳ１２４においてＮＯ）、ステップＳ１２０以下の処理が繰返される。 The information processing device 500 determines whether or not the calculated estimation result has reached a predetermined number (step S124). If the calculated estimation result has not reached the predetermined number (NO in step S124), the processing from step S120 onward is repeated.

算出された推定結果が予め定められた数に到達していれば（ステップＳ１２４においてＹＥＳ）、情報処理装置５００は、算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、学習処理の収束条件が満たされているか否かを判断する（ステップＳ１２６）。 If the calculated estimation results have reached a predetermined number (YES in step S124), information processing apparatus 500 generates a series of calculated estimation results (output sequence) and corresponding text 44 (label sequence). Based on the error between and, it is determined whether or not the convergence condition of the learning process is satisfied (step S126).

学習処理の収束条件が満たされていなければ（ステップＳ１２６においてＮＯ）、情報処理装置５００は、ミニバッチとしてまとめて算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、改良ＣＴＣベースドモデル１Ａを規定するパラメータの値を更新し（ステップＳ１２８）、ステップＳ１２０以下の処理を繰返す。 If the convergence condition of the learning process is not satisfied (NO in step S126), the information processing device 500 compares a series of estimation results (output sequence) collectively calculated as a mini-batch and the corresponding text 44 (label sequence). Based on the error between them, the values of the parameters that define the improved CTC-based model 1A are updated (step S128), and the processes from step S120 onward are repeated.

これに対して、学習処理の収束条件が満たされていれば（ステップＳ１２６においてＹＥＳ）、現在のパラメータを学習結果として出力する（ステップＳ１３０）。すなわち、現在のパラメータにより規定される改良ＣＴＣベースドモデル１Ａが学習済モデルとして出力される。そして、処理は終了する。 On the other hand, if the convergence condition of the learning process is satisfied (YES in step S126), the current parameters are output as the learning result (step S130). That is, the improved CTC-based model 1A defined by the current parameters is output as a trained model. Then the process ends.

上述のステップＳ１１８～Ｓ１３０において、情報処理装置５００は、トレーニングデータセット４０を用いてアテンションモジュール２８を規定するパラメータを決定する第２のトレーニングステップを実行する。この第２のトレーニングステップにおいて、情報処理装置５００は、アテンションモジュール２８を規定するパラメータを含む、改良ＣＴＣベースドモデル１Ａ（識別器）のネットワークを規定するすべてのパラメータの値を再度決定することになる。 In steps S118-S130 described above, the information processor 500 performs a second training step of determining the parameters defining the attention module 28 using the training data set 40. FIG. In this second training step, the information processing device 500 will again determine the values of all the parameters defining the network of improved CTC-based models 1A (discriminators), including the parameters defining the attention module 28. .

このようにパラメータ全体の決定処理を２回実行することで、高い識別精度を実現できる。 By executing the process of determining all parameters twice in this manner, high identification accuracy can be achieved.

（ｅ３：切り落とし法）
上述の再トレーニング法では、改良ＣＴＣベースドモデル１Ａを規定するすべてのパラメータ（アテンションモジュール２８のパラメータも含む）を更新対象としたが、切り落とし法では、生成された学習済モデルのパラメータについては固定した上で、より少ないトレーニングデータでアテンションモジュール２８のパラメータのみをトレーニングするようにしてもよい。 (e3: clipping method)
In the above-described retraining method, all parameters that define the improved CTC-based model 1A (including the parameters of the attention module 28) were updated, but in the clipping method, the parameters of the generated trained model were fixed. Above, only the attention module 28 parameters may be trained with less training data.

アテンションモジュール２８のパラメータのみをトレーニングすることで、学習処理に要する時間を短縮できる。 By training only the parameters of the attention module 28, the time required for the learning process can be shortened.

図１０は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの学習方法（切り落とし法）の処理手順を示すフローチャートである。図１０に示す各ステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がトレーニングプログラム５１４を実行することで実現される。図１０に示す処理のうち、図９に示す処理と同一のものは、同一のステップ番号を付している。 FIG. 10 is a flow chart showing a processing procedure of a learning method (clipping method) for improved CTC-based model 1A according to the present embodiment. Each step shown in FIG. 10 is typically implemented by the processor (CPU 502 and/or GPU 504 ) of information processing apparatus 500 executing training program 514 . Among the processes shown in FIG. 10, the same steps as those shown in FIG. 9 are given the same step numbers.

図１０を参照して、情報処理装置５００には、音声信号４２と対応するテキスト４４とからなるトレーニングデータセット４０が入力される（ステップＳ１００）。情報処理装置５００は、基本ＣＴＣベースドモデル１を規定するパラメータの初期値をランダムに決定する（ステップＳ１０２）。 Referring to FIG. 10, information processing apparatus 500 is supplied with training data set 40 including speech signal 42 and corresponding text 44 (step S100). The information processing apparatus 500 randomly determines initial values of parameters that define the basic CTC based model 1 (step S102).

学習処理の収束条件が満たされていなければ（ステップＳ１１０においてＮＯ）、情報処理装置５００は、算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、基本ＣＴＣベースドモデル１を規定するパラメータの値を更新し（ステップＳ１１２）、ステップＳ１０４以下の処理を繰返す。 If the convergence condition of the learning process is not satisfied (NO in step S110), the information processing device 500 determines the error between the calculated series of estimation results (output sequence) and the corresponding text 44 (label sequence). Based on this, the values of the parameters that define the basic CTC-based model 1 are updated (step S112), and the processing from step S104 onward is repeated.

情報処理装置５００は、改良ＣＴＣベースドモデル１Ａに付加されたアテンションモジュール２８のパラメータの初期値をランダムに決定する（ステップＳ１１８）。そして、アテンションモジュール２８に対するトレーニングを開始する。 The information processing apparatus 500 randomly determines initial values of the parameters of the attention module 28 added to the improved CTC based model 1A (step S118). Then, training for the attention module 28 is started.

具体的には、情報処理装置５００は、基本ＣＴＣベースドモデル１のトレーニングに用いたトレーニングデータセット４０の一部からなる縮小トレーニングデータセット４０Ｓを取得する（ステップＳ１１９）。そして、情報処理装置５００は、縮小トレーニングデータセット４０Ｓに含まれる音声信号４２からフレームごとに特徴ベクトルを生成する（ステップＳ１２１）。そして、情報処理装置５００は、生成した特徴ベクトルを改良ＣＴＣベースドモデル１Ａに入力して推定結果を算出する（ステップＳ１２２）。 Specifically, the information processing apparatus 500 acquires a reduced training data set 40S that is part of the training data set 40 used for training the basic CTC-based model 1 (step S119). The information processing device 500 then generates a feature vector for each frame from the speech signal 42 included in the reduced training data set 40S (step S121). The information processing apparatus 500 then inputs the generated feature vector to the improved CTC-based model 1A to calculate an estimation result (step S122).

学習処理の収束条件が満たされていなければ（ステップＳ１２６においてＮＯ）、情報処理装置５００は、算出された一連の推定結果（出力シーケンス）と対応するテキスト４４（ラベルシーケンス）との間の誤差に基づいて、改良ＣＴＣベースドモデル１Ａに含まれるアテンションモジュール２８を規定するパラメータの値を更新し（ステップＳ１２９）、ステップＳ１２０以下の処理を繰返す。 If the convergence condition of the learning process is not satisfied (NO in step S126), the information processing device 500 determines the error between the calculated series of estimation results (output sequence) and the corresponding text 44 (label sequence). Based on this, the values of the parameters defining the attention module 28 included in the improved CTC based model 1A are updated (step S129), and the processing from step S120 onward is repeated.

上述のステップＳ１１８～Ｓ１３０において、情報処理装置５００は、トレーニングデータセット４０を用いてアテンションモジュール２８を規定するパラメータを決定する第２のトレーニングステップを実行する。この第２のトレーニングステップにおいて、情報処理装置５００は、第１のトレーニングステップにおいて決定されたパラメータ（基本ＣＴＣベースドモデル１を規定するパラメータ）を固定した状態で、アテンションモジュール２８を規定するパラメータのみを決定する処理を実行する。 In steps S118-S130 described above, the information processor 500 performs a second training step of determining the parameters defining the attention module 28 using the training data set 40. FIG. In this second training step, the information processing apparatus 500 fixes only the parameters defining the attention module 28 while fixing the parameters determined in the first training step (parameters defining the basic CTC-based model 1). Execute the decision process.

このように、基本ＣＴＣベースドモデル１を規定するパラメータを固定することで、第２のトレーニングを短時間で実現できる。 By fixing the parameters defining the basic CTC-based model 1 in this way, the second training can be achieved in a short time.

（ｅ４：ネットワーク再構成法）
本実施の形態に従う改良ＣＴＣベースドモデル１Ａのアテンションモジュール２８が示す各残差ブロック２０Ａのアテンションスコアあるいはスケールファクタ（重み）の平均値は、データを伝達可能なすべての経路からの情報を示すことになる。 (e4: network reconfiguration method)
The average value of the attention score or scale factor (weight) of each residual block 20A indicated by the attention module 28 of the improved CTC-based model 1A according to this embodiment indicates information from all paths capable of transmitting data. Become.

図１１は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａにおけるデータ伝達の分布例を示す図である。図１１に示す例では、１番目の残差ブロック２０Ａにおいては、データはショートカット経路２６を主体的に通過し、２番目以降の残差ブロック２０Ａにおいては、データは時間遅延ブロック２２を主体的に通過していることが分かる。 FIG. 11 is a diagram showing an example distribution of data transfer in improved CTC-based model 1A according to the present embodiment. In the example shown in FIG. 11, in the first residual block 20A, data mainly passes through the shortcut path 26, and in the second and subsequent residual blocks 20A, data mainly passes through the time delay block 22. I know you are passing.

このような改良ＣＴＣベースドモデル１Ａにおけるデータ伝達の状態を事前知識として利用することで、改良ＣＴＣベースドモデル１Ａのネットワーク構造を改良し得る。ネットワーク構造の改良によって、音声認識性能も向上させることができる。 The network structure of the improved CTC-based model 1A can be improved by using the state of data transmission in the improved CTC-based model 1A as prior knowledge. Improvements in the network structure can also improve speech recognition performance.

例えば、ｉ番目の残差ブロック２０Ａについての重みα_ｔ ^ｉを２値化（「０」または「１」）することで、ネットワーク構造自体をチューニングできる。すなわち、重みα_ｔ ^ｉが「１」であれば、対応する残差ブロック２０Ａの時間遅延ブロック２２にはデータが伝達されないので、時間遅延ブロック２２を削除してもよいと判断できる。一方、重みα_ｔ ^ｉが「０」であれば、ショートカット経路２６を削除してもよいと判断できる。 For example, the network structure itself can be tuned by binarizing (“0” or “1”) the weight α _t ⁱ for the i-th residual block 20A. That is, if the weight α _t ⁱ is "1", no data is transmitted to the time delay block 22 of the corresponding residual block 20A, so it can be determined that the time delay block 22 may be deleted. On the other hand, if the weight α _t ⁱ is "0", it can be determined that the shortcut route 26 may be deleted.

このような重みα_ｔ ^ｉを２値化する方法としては、トレーニングデータセットの一部からなるデータセットを用いて順次生成される特徴ベクトルに対して、タイムステップごとに算出される重みα_ｔ ^ｉの代表値（平均値、最大値、最小値、中間値など）が予め定められたしきい値を超えるか否かに基づいて決定できる。重みα_ｔ ^ｉの二値化に用いられるデータセットは、例えば、後述するような開発データセット（ＣＳＪ－Ｄｅｖ）の音声部分を用いることができる。 As a method for binarizing such weights α _t ⁱ , the weights α _t ⁱ calculated for each time step are applied to feature vectors sequentially generated using a data set consisting of a part of the training data set. can be determined based on whether a representative value (average, maximum, minimum, median, etc.) of exceeds a predetermined threshold. The data set used for binarizing the weights α _t ⁱ can be, for example, the speech portion of the development data set (CSJ-Dev) described later.

図１２は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの学習方法（ネットワーク再構成法）の処理手順を説明するための図である。まず、図１２（Ａ）に示すように、基本ＣＴＣベースドモデル１を通常のトレーニングデータセットを用いてトレーニングする。続いて、図１２（Ｂ）に示すように、基本ＣＴＣベースドモデル１にアテンションモジュール２８を追加して、通常のトレーニングの一部からなる縮小トレーニングデータセットを用いてアテンションモジュール２８をトレーニングする。 FIG. 12 is a diagram for explaining a processing procedure of a learning method (network reconstruction method) for improved CTC-based model 1A according to the present embodiment. First, as shown in FIG. 12(A), the basic CTC-based model 1 is trained using a regular training data set. Subsequently, as shown in FIG. 12(B), an attention module 28 is added to the basic CTC-based model 1, and the attention module 28 is trained using a reduced training data set consisting of a portion of the normal training.

基本ＣＴＣベースドモデル１に対応するパラメータおよびアテンションモジュール２８に対応するパラメータをトレーニングした後、改良ＣＴＣベースドモデル１Ａに対して、開発データセットなどの音声部分から生成される特徴ベクトルを入力し、各残差ブロック２０Ａにおけるスケールファクタの時間的変化を算出する。各残差ブロック２０Ａにおいては、次の図１３に示すような時間的変化を算出できる。 After training the parameters corresponding to the basic CTC-based model 1 and the parameters corresponding to the attention module 28, the improved CTC-based model 1A is input with feature vectors generated from speech parts such as the development data set, and each residual Calculate the change in scale factor over time in the difference block 20A. In each residual block 20A, a temporal change as shown in FIG. 13 can be calculated.

図１３は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａを用いて算出されるスケールファクタの時間的変化の一例を示す図である。図１３に示すスケールファクタである重みα_ｔ ^１の値は、入力される音節ごとに大きく変化している。 FIG. 13 is a diagram showing an example of temporal changes in scale factors calculated using improved CTC-based model 1A according to the present embodiment. The value of the weight α _t ¹ , which is the scale factor shown in FIG. 13, varies greatly for each input syllable.

各残差ブロック２０Ａについて算出されるスケールファクタの時間的変化に基づいて、各残差ブロック２０Ａにおけるデータの伝達状態を評価する。このデータの伝達状態は、各残差ブロック２０Ａにおける安定度に対応していると考えることもできる。そして、対応するスケールファクタの時間的変化が予め定められた条件を満たした残差ブロック２０Ａについては、図１２（Ｃ）に示すように、改良ＣＴＣベースドモデル１Ａから削除される。 Based on the temporal change in the scale factor calculated for each residual block 20A, the data transfer state in each residual block 20A is evaluated. This data transmission state can also be considered to correspond to the stability in each residual block 20A. Residual blocks 20A whose corresponding scale factor changes over time satisfy a predetermined condition are deleted from the improved CTC-based model 1A, as shown in FIG. 12(C).

最終的に、状況に応じていくつかの時間遅延ブロック２２が削除された後の改良ＣＴＣベースドモデル１Ａを規定するすべてのパラメータ（アテンションモジュール２８のパラメータも含む）を再度のトレーニングにより決定する。 Finally, all the parameters (including the attention module 28 parameters) that define the improved CTC-based model 1A after some time delay blocks 22 have been removed, depending on the situation, are determined by retraining.

このように、アテンションモジュール２８が付加された改良ＣＴＣベースドモデル１Ａ（識別器）に入力信号を与えることで、アテンションモジュール２８により算出されるスケールファクタである重みα_ｔ ^１の値の変化に基づいて、複数の時間遅延層２４の一部を削除する処理を実行してもよい。 Thus, by giving an input signal to the improved CTC-based model 1A (discriminator) to which the attention module 28 is added, based on the change in the value of the weight α _t ¹ which is the scale factor calculated by the attention module 28, , a process of deleting some of the plurality of time delay layers 24 may be performed.

ここで、時間遅延ブロック２２の各々を削除すべきか否かの条件としては、ショートカット経路２６についてのスケールファクタである重みα_ｔ ^ｉの絶対値が相対的に大きい場合、あるいは、値のバラツキが相対的に大きい場合などが挙げられる。すなわち、対象となる音声信号に対して、ショートカット経路２６を通過するデータが相対的に大きい、あるいは、ショートカット経路２６を通過するデータ量の変動が相対的に大きい場合には、残差ブロック２０Ａの安定性が低いことを意味し、このような安定性の低い残差ブロック２０Ａについては削除することで、学習およびデコーディングをより安定化できる。 Here, the conditions for determining whether or not each of the time delay blocks 22 should be deleted are when the absolute value of the weight α _t ⁱ , which is the scale factor for the shortcut path 26, is relatively large, or when the variation in the value is relatively large. For example, when it is relatively large. That is, when the amount of data passing through the shortcut path 26 is relatively large with respect to the target audio signal, or when the variation in the amount of data passing through the shortcut path 26 is relatively large, the residual block 20A It means that the stability is low, and learning and decoding can be made more stable by deleting such a low-stability residual block 20A.

時間遅延ブロック２２を削除するか否かの具体的な条件としては、以下のようなものが挙げられる。 Specific conditions for whether or not to delete the time delay block 22 include the following.

（１）特定の音声入力について、重みα_ｔ ^ｉ（スケールファクタ）の値が予め定められたしきい値（典型的には、「０．５」）を超える数（あるいは、現れているピーク）が予め定められた数以上である場合。 (1) For a specific speech input, the number (or appearing peak) of which the value of the weight α _t ⁱ (scale factor) exceeds a predetermined threshold (typically “0.5”) is greater than or equal to a predetermined number.

（２）特定の音声入力に含まれるラベル（単音、文字、音節などの単位）に対して、重みα_ｔ ^ｉ（スケールファクタ）の値が予め定められたしきい値（典型的には、「０．５」）を超える数の比率が予め定められた数（例えば、３０％）以上である場合。 (2) A ^{predetermined} _threshold (typically, " 0.5”) is greater than or equal to a predetermined number (eg, 30%).

（３）重みα_ｔ ^ｉ（スケールファクタ）の時間的変化が示すグラフの面積が予め定められたしきい値以上である場合。 (3) When the area of the graph indicated by the temporal change of the weight α _t ⁱ (scale factor) is equal to or greater than a predetermined threshold.

（４）特定の音声入力について、重みα_ｔ ^ｉ（スケールファクタ）の時間変動の変動幅（標準偏差、分散、最大値と最小値との差）が予め定められたしきい値以上である場合。 (4) For a specific voice input, when the variation width (standard deviation, variance, difference between the maximum value and the minimum value) of the weight α _t ⁱ (scale factor) over time is greater than or equal to a predetermined threshold .

上述した以外の任意の判断基準を用いることができる。
以上のような手順によって、ネットワーク構造を最適化した上で、学習処理を実行することになる。 Any criteria other than those mentioned above can be used.
After optimizing the network structure according to the procedure described above, the learning process is executed.

図１４は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの学習方法（ネットワーク再構成法）の処理手順を示すフローチャートである。図１４に示す各ステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がトレーニングプログラム５１４を実行することで実現される。 FIG. 14 is a flow chart showing a processing procedure of a learning method (network reconstruction method) for improved CTC-based model 1A according to the present embodiment. Each step shown in FIG. 14 is typically implemented by the processor (CPU 502 and/or GPU 504 ) of information processing apparatus 500 executing training program 514 .

図１４を参照して、情報処理装置５００は、トレーニングデータセット４０を用いて、基本ＣＴＣベースドモデル１のパラメータを決定する（ステップＳ１５０）。このステップＳ１５０の処理は、図９に示す再トレーニング法のステップＳ１００～Ｓ１１４と実質的に同一である。 Referring to FIG. 14, information processing apparatus 500 determines parameters of basic CTC-based model 1 using training data set 40 (step S150). The processing of step S150 is substantially the same as steps S100-S114 of the retraining method shown in FIG.

続いて、情報処理装置５００は、学習済の基本ＣＴＣベースドモデル１に対してアテンションモジュール２８を付加して改良ＣＴＣベースドモデル１Ａを生成する（ステップＳ１５２）。そして、情報処理装置５００は、改良ＣＴＣベースドモデル１Ａに付加されたアテンションモジュール２８のパラメータを決定する（ステップＳ１５４）。このステップＳ１５４の処理は、図１０に示す切り落とし法のステップＳ１１８～Ｓ１３０の処理と実質的に同一である。 Subsequently, the information processing apparatus 500 adds the attention module 28 to the trained basic CTC-based model 1 to generate the improved CTC-based model 1A (step S152). The information processing apparatus 500 then determines the parameters of the attention module 28 added to the improved CTC based model 1A (step S154). The processing of this step S154 is substantially the same as the processing of steps S118 to S130 of the clipping method shown in FIG.

続いて、情報処理装置５００は、開発データセットの音声部分から生成される特徴ベクトルを改良ＣＴＣベースドモデル１Ａに入力して、各残差ブロック２０Ａにおけるスケールファクタの時間的変化を算出する（ステップＳ１５６）。そして、情報処理装置５００は、各残差ブロック２０Ａにおけるスケールファクタの時間的変化に基づいて、改良ＣＴＣベースドモデル１Ａに含まれる時間遅延ブロック２２のうち削除すべきものが存在するか否かを判断する（ステップＳ１５８）。時間遅延ブロック２２のうち削除すべきものが存在する場合（ステップＳ１５８においてＹＥＳ）、情報処理装置５００は、改良ＣＴＣベースドモデル１Ａから対象の時間遅延ブロック２２を削除する（ステップＳ１６０）。時間遅延ブロック２２のうち削除すべきものが存在しない場合（ステップＳ１５８においてＮＯ）、ステップＳ１６０の処理はスキップされる。 Subsequently, the information processing device 500 inputs the feature vector generated from the speech portion of the development data set to the improved CTC-based model 1A, and calculates the temporal change of the scale factor in each residual block 20A (step S156). ). Then, the information processing device 500 determines whether or not there is a time delay block 22 included in the improved CTC-based model 1A that should be deleted, based on the temporal change in the scale factor in each residual block 20A. (Step S158). If there is a time delay block 22 to be deleted (YES in step S158), information processing apparatus 500 deletes the target time delay block 22 from improved CTC based model 1A (step S160). If there is no time delay block 22 to be deleted (NO in step S158), the process of step S160 is skipped.

最終的に、情報処理装置５００は、（状況に応じて時間遅延ブロック２２が削除された後の）改良ＣＴＣベースドモデル１Ａのすべてのパラメータを再度決定する（ステップＳ１６２）。このステップＳ１６０の処理は、図９に示す再トレーニング法のステップＳ１２０～Ｓ１３０と実質的に同一である。 Finally, the information processing apparatus 500 re-determines all parameters of the improved CTC-based model 1A (after the time delay block 22 has been removed depending on the situation) (step S162). The processing of step S160 is substantially the same as steps S120-S130 of the retraining method shown in FIG.

以上のような手順によって、改良ＣＴＣベースドモデル１Ａの学習済モデルが生成される。 A learned model of the improved CTC-based model 1A is generated by the above procedure.

［Ｆ．デコーディング方法］
次に、本実施の形態に従う改良ＣＴＣベースドモデル１Ａを用いたデコーディング方法について説明する。本実施の形態に従う改良ＣＴＣベースドモデル１Ａは、Ｅ２Ｅフレームワークであるので、音声信号から順次生成される特徴ベクトルを入力するだけで、対応するテキスト（サブワードシーケンス）が順次出力されることになる。 [F. Decoding method]
Next, a decoding method using the improved CTC-based model 1A according to this embodiment will be explained. Since the improved CTC-based model 1A according to the present embodiment is an E2E framework, only by inputting feature vectors that are sequentially generated from speech signals, corresponding texts (subword sequences) are sequentially output.

図１５は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａのデコーディング方法の処理手順を示すフローチャートである。図１５に示す各ステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がトレーニングプログラム５１４を実行することで実現される。 FIG. 15 is a flow chart showing the processing procedure of the decoding method for improved CTC-based model 1A according to this embodiment. Each step shown in FIG. 15 is typically implemented by the processor (CPU 502 and/or GPU 504 ) of information processing apparatus 500 executing training program 514 .

図１５を参照して、情報処理装置５００は、入力される音声信号からフレームごとに特徴ベクトルを生成する（ステップＳ２００）。そして、情報処理装置５００は、生成した特徴ベクトルを改良ＣＴＣベースドモデル１Ａに入力して推定結果を算出および出力する（ステップＳ２０２）。 Referring to FIG. 15, information processing apparatus 500 generates a feature vector for each frame from an input audio signal (step S200). The information processing apparatus 500 then inputs the generated feature vector to the improved CTC-based model 1A to calculate and output an estimation result (step S202).

そして、情報処理装置５００は、音声信号の入力が継続しているか否かを判断する（ステップＳ２０４）。音声信号の入力が継続していれば（ステップＳ２０４においてＹＥＳ）、ステップＳ２００以下の処理が繰返される。 The information processing device 500 then determines whether or not the input of the audio signal continues (step S204). If the input of the audio signal continues (YES in step S204), the processing from step S200 onwards is repeated.

一方、音声信号の入力が継続していなければ（ステップＳ２０４においてＮＯ）、デコーディングの処理は終了する。 On the other hand, if the input of the audio signal has not continued (NO in step S204), the decoding process ends.

［Ｇ．評価実験］
本願発明者らは、上述した本実施の形態に従う改良ＣＴＣベースドモデル１Ａの性能について評価実験を行なった。以下、評価実験について説明する。 [G. Evaluation experiment]
The inventors of the present application conducted evaluation experiments on the performance of the improved CTC-based model 1A according to the present embodiment described above. Evaluation experiments will be described below.

（ｇ１：データおよびタスクの説明）
評価実験には、トレーニングデータおよび評価データとして、国立国語研究所が提供している「日本語話し言葉コーパス（Corpus of Spontaneous Japanese：ＣＳＪ）」を用いた。 (g1: description of data and tasks)
In the evaluation experiment, the "Corpus of Spontaneous Japanese (CSJ)" provided by the National Institute for Japanese Language and Linguistics was used as training data and evaluation data.

先行研究における知見に従って、ＣＳＪに含まれる２４０時間分の講演の音声をトレーニングデータセット（以下、「ＣＳＪ－Ｔｒａｉｎ」とも称す。）として構成した。ＣＳＪは、３個の公式の評価データセット（ＣＳＪ－Ｅｖａｌ０１、ＣＳＪ－Ｅｖａｌ０２、ＣＳＪ－Ｅｖａｌ０３）を含む。各評価データセットは、１０講演分の音声を含む。これらの評価データセットを音声認識結果の評価に用いた。また、１０講演分の音声からなる開発データセット（ＣＳＪ－Ｄｅｖ）をトレーニング中の評価用として用いた。 In accordance with findings from previous studies, 240 hours of speech speech included in CSJ was constructed as a training data set (hereinafter also referred to as “CSJ-Train”). The CSJ includes three official evaluation datasets (CSJ-Eval01, CSJ-Eval02, CSJ-Eval03). Each evaluation data set contains speech for 10 lectures. These evaluation datasets were used to evaluate speech recognition results. In addition, a development dataset (CSJ-Dev) consisting of 10 speeches was used for evaluation during training.

さらに、ウオームアップ初期化およびパラメータチューニングのためのシードモデルのトレーニング用に、ＣＳＪに含まれる２７．６時間分のデータセット（以下、「ＣＳＪ－Ｔｒａｉｎ_{ｓｍａｌｌ}」とも称す。）を選択した。 Furthermore, a 27.6-hour data set included in CSJ (hereinafter also referred to as “CSJ-Train _small ”) was selected for seed model training for warm-up initialization and parameter tuning.

これらのデータセットに含まれる講演の数および時間は、以下のＴａｂｌｅ１に示す通りである。 The number and duration of talks included in these datasets are shown in Table 1 below.

（ｇ２：ベースラインモデル）
まず、ＣＳＪ－Ｔｒａｉｎを用いて、評価基準となるベースラインモデルをトレーニングした。第１のベースラインモデルとして、ＤＮＮ－ＨＭＭ－ＣＥ（deep neural network and hidden Markov model cross entropy）モデルを取り上げる。ＤＮＮ－ＨＭＭ－ＣＥモデルを構築するにあたって、まず、音響モデルに相当するＧＭＭ－ＨＭＭ（Gaussian mixture model and hidden Markov model）モデルをトレーニングし、続いて、５個の隠れ層（各層は２０４８個の隠れノードを有する）からなるＤＮＮモデル（言語モデルに相当する）をトレーニングした。出力層は、約８５００個のノードを有しており、これは、ＧＭＭ－ＨＭＭモデルの結合トライフォン（triphone）状態に対応する。これらのトレーニングにおいて、７２次元のフィルタバンク特徴（２４次元のスタティック＋Δ＋ΔΔ）を用いた。フィルタバンク特徴は、話者ごとに平均化および正規化が行なわれた結果であり、分割された１１フレーム（過去５フレーム、現在フレーム、未来５フレーム）からなる。ＤＮＮモデルは、交差エントロピー損失基準に基づく標準的な確率的勾配降下法（ＳＧＤ：stochastic gradient descent）を用いてトレーニングした。 (g2: baseline model)
First, CSJ-Train was used to train a baseline model as an evaluation criterion. As a first baseline model, we take the DNN-HMM-CE (deep neural network and hidden Markov model cross entropy) model. In constructing the DNN-HMM-CE model, first, the GMM-HMM (Gaussian mixture model and hidden Markov model) model corresponding to the acoustic model is trained, and then five hidden layers (each layer has 2048 hidden We trained a DNN model (which corresponds to a language model) consisting of The output layer has approximately 8500 nodes, which correspond to the combined triphone states of the GMM-HMM model. In these trainings, 72-dimensional filterbank features (24-dimensional static +Δ+ΔΔ) were used. The filterbank features are the result of averaging and normalizing for each speaker and consist of 11 frames partitioned (5 past, 5 present, 5 future). The DNN model was trained using standard stochastic gradient descent (SGD) based on the cross entropy loss criterion.

デコードに関して、４グラム単語言語モデル（ＷＬＭ：word language model）を、５９１時間分のＣＳＪトレーニングデータセットの転記テキストによりトレーニングした。ＷＬＭの語彙サイズは９８×１０^３である。 For decoding, a 4-gram word language model (WLM) was trained with 591 hours of text transcripts from the CSJ training dataset. The WLM vocabulary size is 98×10 ³ .

（ｇ３：改良ＣＴＣベースドモデル１Ａのトレーニングのための設定）
本実施の形態に従う改良ＣＴＣベースドモデル１Ａは、７２次元のフィルタバンク特徴（２４次元のスタティック＋Δ＋ΔΔ）（非分割）を用いてトレーニングした。このトレーニングにおいては、日本語の２６３音節（日本語書き言葉の基本単位であるかな）と、非発話ノイズと、発話ノイズと、ブランク（φ）とを基本音響モデル単位として用いた。 (g3: setting for training of improved CTC-based model 1A)
The improved CTC-based model 1A according to the present invention was trained with 72-dimensional filterbank features (24-dimensional static +Δ+ΔΔ) (unsplit). In this training, we used 263 syllables of Japanese (which is the basic unit of Japanese written language), non-speech noise, speech noise, and blank (φ) as basic acoustic model units.

対象したネットワーク（ＣＳＪ－Ｔｒａｉｎ_{ｓｍａｌｌ}によりトレーニングされた単音ベースのシードシステムを用いてチューニングされている）は、次のように規定される。すなわち、入力層に引き続く９個の全結合層と、それに続く１５個の時間遅延層２４（３つの残差ブロック２０Ａ全体として）と、ｓｏｆｔｍａｘ出力の前段に配置された２つの全結合層とからなる。 The target network (tuned using a phone-based seed system trained by CSJ-Train _small ) is defined as follows. That is, from the nine fully connected layers following the input layer, the following fifteen time delay layers 24 (total of the three residual blocks 20A), and the two fully connected layers placed before the softmax output, Become.

積層された３つの残差ブロック２０Ａのそれぞれにおけるウィンドウサイズの変化を以下のＴａｂｌｅ２に示す。 Table 2 below shows the change in window size in each of the three stacked residual blocks 20A.

ＣＳＪに含まれる２７．６時間分のデータセット（ＣＳＪ－Ｔｒａｉｎ_{ｓｍａｌｌ}）を用いて、交差エントロピー損失基準に従ってシードモデルをトレーニングし、それにより得られたモデルパラメータを用いてＣＴＣモデルを初期化した。ＣＴＣのトレーニングには、ＦｓＡｄａＧｒａｄアルゴリズムを用いた。２４０時間分の講演の音声を含むトレーニングデータセット（ＣＳＪ－Ｔｒａｉｎ）を用いたトレーニングを高速化するために、ＢＭＵＦ（block-wise model update filtering）を適用した。各フレームに対する学習レートの初期値は０．００００１とし、ＣＳＪ－Ｄｅｖについての検定結果に応じて学習レートを自動的に調整した。ミニバッチサイズは２０４８とし、同一のミニバッチにおいて並列処理されるシーケンス数は１６とした。エポック数の最大値は２５とした。 A 27.6-hour dataset included in the CSJ (CSJ-Train _small ) was used to train a seed model according to the cross-entropy loss criterion, and the resulting model parameters were used to initialize the CTC model. The FsAdaGrad algorithm was used for CTC training. Block-wise model update filtering (BMUF) was applied to speed up training with a training dataset containing 240 hours of speech speech (CSJ-Train). The initial value of the learning rate for each frame was set to 0.00001, and the learning rate was automatically adjusted according to the test results for CSJ-Dev. The mini-batch size was set to 2048, and the number of sequences processed in parallel in the same mini-batch was set to 16. The maximum number of epochs was set to 25.

ネットワークで算出されるスケール化された対数尤度をＥＥＳＥＮデコーダに与えることで、改良ＣＴＣベースドモデル１Ａをデコードする。 The modified CTC-based model 1A is decoded by feeding the network-computed scaled log-likelihoods to the EESEN decoder.

また、本実施の形態に従う改良ＣＴＣベースドモデル１Ａと同一の構造を有し、ＭｉｃｒｏｓｏｆｔのＣｏｍｐｕｔａｔｉｏｎａｌＮｅｔｗｏｒｋＴｏｏｌｋｉｔ（ＣＮＴＫ）により特徴量が設定された交差エントロピーモデル（ＶＲｅｓＴＤ－ＣＥ）についてもトレーニングした。このトレーニングにおいて、ＤＮＮ－ＨＭＭ－ＣＥモデルと同一のラベルを用いた。 A cross-entropy model (VResTD-CE), which has the same structure as the improved CTC-based model 1A according to the present embodiment and whose features are set by Microsoft's Computational Network Toolkit (CNTK), was also trained. In this training, we used the same labels as the DNN-HMM-CE model.

（ｇ４：アテンションモジュールの付加による改良ＣＴＣベースドモデル１Ａのチューニング）
上述したように、基本ＣＴＣベースドモデル１（ＶＲｅｓＴＤ－ＣＴＣ）に対して、アテンションモジュール２８を付加することで、改良ＣＴＣベースドモデル１Ａを構成する。改良ＣＴＣベースドモデル１Ａを規定するすべてのパラメータ（アテンションモジュール２８のパラメータも含む）をＣＳＪ－Ｔｒａｉｎを用いてトレーニングすることで得られた学習済モデルを「ＶＲｅｓＴＤＭ－ＣＴＣ_{ｒｅｔｒａｉｎ}」と称する。 (g4: tuning of improved CTC-based model 1A by addition of attention module)
As described above, the improved CTC-based model 1A is constructed by adding the attention module 28 to the basic CTC-based model 1 (VResTD-CTC). A trained model obtained by training all parameters defining the improved CTC-based model 1A (including the parameters of the attention module 28) using CSJ-Train is referred to as " _VResTDM -CTC retain".

ＶＲｅｓＴＤＭ－ＣＴＣ_{ｒｅｔｒａｉｎ}を得るために用いた学習レートの初期値は０．００００１とした。ミニバッチサイズは２０４８とした。各エポックのトレーニングが完了するごとにＣＳＪ－Ｄｅｖを用いて性能を評価した。結果的に、性能が低下する直前の１７回目のエポックの開始直前でトレーニングを終了した。 The initial learning rate used to obtain _VResTDM -CTC retract was 0.00001. The minibatch size was 2048. Performance was evaluated using CSJ-Dev after each epoch of training was completed. As a result, training was terminated just before the start of the 17th epoch, just before performance degraded.

図１６は、本実施の形態に従う改良ＣＴＣベースドモデル１Ａのアテンションスコアの変化例を示す図である。図１６（Ａ）および（Ｂ）は、入力される音声フレームに対する先頭の残差ブロック２０Ａにおけるアテンションスコアの変化を示し、図１６（Ｃ）および（Ｄ）は、入力される音声フレームに対する最終の残差ブロック２０Ａにおけるアテンションスコアの変化を示す。入力される音声フレームとしては、ＣＳＪ－Ｅｖａｌ０１を用いた。 FIG. 16 is a diagram showing an example of change in attention score of improved CTC-based model 1A according to the present embodiment. 16(A) and (B) show changes in attention score in the leading residual block 20A for the input speech frame, and FIGS. 16(C) and (D) show the final Figure 3 shows the change in attention score in residual block 20A. CSJ-Eval01 was used as an input speech frame.

２つの異なるシステム（音節ベース（syllable system）および単音ベース（ci-phone system））の先頭の残差ブロック２０Ａを通過する際の振る舞いは、互いに異なるものとなっている。具体的には、図１６（Ａ）に示すように、音節ベースにおいては、音声セグメントはショートカット経路を通過する傾向が強い。一方、図１６（Ｂ）に示すように、単音ベースにおいては、そのような傾向は見られない。 The two different systems (syllable system and ci-phone system) behave differently when passing through the leading residual block 20A. Specifically, as shown in FIG. 16(A), on a syllable basis, speech segments tend to follow shortcut paths. On the other hand, as shown in FIG. 16(B), such a tendency is not observed in single-tone bass.

評価として、音節ベースにおいては、ＣＳＪ－Ｅｖａｌ０１について、音声セグメントに対するアテンションスコアの平均値は０．６であり、ブランクに対するアテンションスコアの平均値は０．３６であった。一方、単音ベースにおいては、アテンションスコアの平均値はいずれもそれらの値より十分に小さい。 As evaluations, on a syllable basis, CSJ-Eval01 had an average attention score for speech segments of 0.6 and an average attention score for blanks of 0.36. On the other hand, on a single note basis, the mean attention scores are all well below those values.

最終の残差ブロック２０Ａにおいては、いずれのシステムについても、ショートカット経路を避ける傾向が強い。具体的には、ＣＳＪ－Ｅｖａｌ０１についての音声フレームに対するアテンションスコアの平均値は、いずれのシステムについてもほぼ０．０であった。 In the final residual block 20A, there is a strong tendency to avoid shortcut paths for both systems. Specifically, the average attention score for speech frames for CSJ-Eval01 was approximately 0.0 for both systems.

これらの実験結果に基づいて、基本ＣＴＣベースドモデル１（ＶＲｅｓＴＤ－ＣＴＣ）に含まれる残差ブロック２０に対する重みを調整した改良ＣＴＣベースドモデル１Ａを用意した。より具体的には、音声セグメントに対するアテンションスコアα_ｔ ^ｉをしきい値「０．５」が二値化することで、一部の時間遅延ブロック２２を削除した。すなわち、上述した切り落とし法により生成された学習済モデルを「ＶＲｅｓＴＤＭ－ＣＴＣ_{ｐｒｕｎｅ}と称する。 Based on these experimental results, an improved CTC-based model 1A was prepared by adjusting the weights for the residual block 20 included in the basic CTC-based model 1 (VResTD-CTC). More specifically, some time-delayed blocks 22 are deleted by binarizing the attention score α _t ⁱ for the speech segment with a threshold value of “0.5”. That is, the trained model generated by the pruning method described above is referred to as "VResTDM-CTC _prune ."

（ｇ５：音声認識性能）
次に、本実施の形態に従う改良ＣＴＣベースドモデル１Ａの音声認識性能の評価結果の一例について説明する。音声認識性能の評価には、ＣＳＪに含まれる３個の評価データセット（ＣＳＪ－Ｅｖａｌ０１、ＣＳＪ－Ｅｖａｌ０２、ＣＳＪ－Ｅｖａｌ０３）を用いた。音声認識性能の評価には、上述したベースラインモデル（ＤＮＮ－ＨＭＭ－ＣＥおよびＶＲｅｓＴＤ－ＣＥ）と比較した。この音声認識性能の評価結果を以下のＴａｂｌｅ３に示す。 (g5: speech recognition performance)
Next, an example of evaluation results of speech recognition performance of improved CTC-based model 1A according to the present embodiment will be described. Three evaluation data sets (CSJ-Eval01, CSJ-Eval02, and CSJ-Eval03) included in CSJ were used to evaluate speech recognition performance. Speech recognition performance was evaluated by comparison with the baseline models described above (DNN-HMM-CE and VResTD-CE). The evaluation results of this speech recognition performance are shown in Table 3 below.

上述の評価結果においては、評価指標として、自動音声認識（ＡＳＲ：Automatic Speech Recognition）の単語誤り率（ＷＥＲ：word error rate）を用いた。ＷＥＲは、評価対象のモデルに音声を入力したときに出力されるテキストについて、当該入力された音声に対応する正解テキストに対する誤り率を示す。ＷＥＲの値が小さいほど性能が高いことを示す。 In the evaluation results described above, the word error rate (WER) of automatic speech recognition (ASR) was used as an evaluation index. WER indicates the error rate of the text output when speech is input to the evaluation target model with respect to the correct text corresponding to the input speech. A smaller WER value indicates higher performance.

上述の評価結果によれば、ＶＲｅｓＴＤＭ－ＣＴＣ_{ｐｒｕｎｅ}およびＶＲｅｓＴＤＭ－ＣＴＣ_{ｒｅｔｒａｉｎ}の両方とも、すべての評価データセットにおいて、ベースラインモデル（ＤＮＮ－ＨＭＭ－ＣＥ）および基本ＣＴＣベースドモデル１（ＶＲｅｓＴＤ－ＣＴＣ）に比較して、著しい改善が見られる。また、ＶＲｅｓＴＤＭ－ＣＴＣ_{ｒｅｔｒａｉｎ}については、２つの評価データセットにおいて、ＶＲｅｓＴＤ－ＣＥと同等の性能を発揮するとともに、３番目の評価データセットにおいてはより高い性能を発揮している。 According to the evaluation results described above, both _VResTDM -CTC _prune and VResTDM-CTC retract were superior to the baseline model (DNN-HMM-CE) and basic CTC-based model 1 (VResTD-CTC) in all evaluation datasets. A significant improvement is seen in comparison. In addition, _VResTDM -CTC retract exhibited performance equivalent to that of VResTD-CE in two evaluation data sets, and exhibited higher performance in the third evaluation data set.

［Ｈ．まとめ］
本実施の形態に従う改良ＣＴＣベースドモデル１Ａによれば、複数の時間遅延層２４を通過する経路に対する重み（第１の重み）と、ショートカット経路２６に対する重み（第２の重み）とをタイムステップごとに更新できる。このようなタイムステップ毎の重みの更新によって、ネットワーク全体を動的に振る舞わせることができ、これによって、対象のシステムに応じた適切なネットワーク構造を実現できる。 [H. summary]
According to the improved CTC-based model 1A according to the present embodiment, the weight (first weight) for the route passing through the multiple time delay layers 24 and the weight (second weight) for the shortcut route 26 are set for each time step. can be updated to By updating the weights at each time step in this manner, the entire network can be made to behave dynamically, thereby realizing an appropriate network structure according to the target system.

また、本実施の形態に従う改良ＣＴＣベースドモデル１Ａによれば、アテンションモジュール２８が更新する重み（スケールファクタ）の時間的な変化を監視することで、不安定な時間遅延層２４などを特定することができ、これによって、高精度かつ高速な学習を実現できる。 Further, according to the improved CTC-based model 1A according to the present embodiment, by monitoring temporal changes in weights (scale factors) updated by the attention module 28, the unstable time delay layer 24 and the like can be identified. It is possible to realize high-precision and high-speed learning.

今回開示された実施の形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施の形態の説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time should be considered as examples and not restrictive in all respects. The scope of the present invention is indicated by the scope of the claims rather than the description of the above-described embodiments, and is intended to include all modifications within the scope and meaning equivalent to the scope of the claims.

１基本ＣＴＣベースドモデル、１Ａ改良ＣＴＣベースドモデル、２特徴量抽出部、４認識エンジン、１０，３２，２８２全結合層、２０，２０Ａ残差ブロック、２２時間遅延ブロック、２４時間遅延層、２６ショートカット経路、２８アテンションモジュール、２９加算器、３０出力層、３４マッピング関数、４０，５２０トレーニングデータセット、４０Ｓ縮小トレーニングデータセット、４２音声信号、４４テキスト、２４１，２４２遅延要素、２８４ｓｏｆｔｍａｘ関数、２８５出力経路、２８６，２８８乗算器、５００情報処理装置、５０２ＣＰＵ、５０４ＧＰＵ、５０６主メモリ、５０８ディスプレイ、５１０ネットワークインターフェイス、５１２二次記憶装置、５１４トレーニングプログラム、５１６モデル定義データ、５１８ネットワークパラメータ、５２２入力デバイス、５２４光学ドライブ、５２６光学ディスク、５２８内部バス、Ｓ音声認識システム。 1 basic CTC based model, 1A improved CTC based model, 2 feature extractor, 4 recognition engine, 10, 32, 282 fully connected layer, 20, 20A residual block, 22 time delay block, 24 time delay layer, 26 shortcut path, 28 attention module, 29 adder, 30 output layer, 34 mapping function, 40,520 training dataset, 40S reduced training dataset, 42 audio signal, 44 text, 241, 242 delay element, 284 softmax function, 285 output path, 286, 288 multiplier, 500 information processing device, 502 CPU, 504 GPU, 506 main memory, 508 display, 510 network interface, 512 secondary storage device, 514 training program, 516 model definition data, 518 network parameter, 522 Input Device, 524 Optical Drive, 526 Optical Disc, 528 Internal Bus, S Speech Recognition System.

Claims

A discriminator that outputs a sequence of labels for an input signal,
an input layer that sequentially generates a first feature vector from the input signal for each frame of a predetermined time width;
a plurality of stacked residual blocks following the input layer;
an output layer connected to the output side of the plurality of residual blocks;
each of the plurality of residual blocks comprising:
a plurality of stacked time delay layers;
a shortcut path bypassing the plurality of time delay layers;
an attention module that adjusts weights between paths passing through the plurality of time delay layers and the shortcut paths;
The plurality of time delay layers have delay elements that provide a delay of a predetermined timestep with respect to the input,
Each of the time delay layers includes, for an input vector, a first internal vector corresponding to a past frame obtained by moving back the current frame, which is the frame corresponding to the input vector, by the time step; generating a second internal vector corresponding to a future frame advanced in time by the time step;
The attention module performs the time step based on a result output obtained by passing an input provided to a corresponding residual block through a corresponding plurality of time delay layers and an input provided to the corresponding residual block. A discriminator that updates the weights each time.

The attention module includes:
a fully connected layer connected to the output of the corresponding residual block and the shortcut path;
2. The discriminator of claim 1, comprising a softmax function connected to said fully connected layer.

A discriminator that outputs a sequence of labels for an input signal,
an input layer that sequentially generates a first feature vector from the input signal for each frame of a predetermined time width;
a plurality of stacked residual blocks following the input layer;
an output layer connected to the output side of the plurality of residual blocks;
each of the plurality of residual blocks comprising:
a plurality of stacked time delay layers;
a shortcut path bypassing the plurality of time delay layers;
an attention module that adjusts weights between paths passing through the plurality of time delay layers and the shortcut paths;
The plurality of time delay layers have delay elements that provide a delay of a predetermined timestep with respect to the input,
The attention module performs the time step based on a result output obtained by passing an input provided to a corresponding residual block through a corresponding plurality of time delay layers and an input provided to the corresponding residual block. update the weights every
parameters defining the network of classifiers are determined by training with a training data set in the absence of the attention module;
A discriminator , wherein the parameters defining the attention module are determined by training with a training data set with the attention module in place .

A trained model for causing a computer to output a sequence of labels for an input signal, said trained model comprising:
an input layer that sequentially generates a first feature vector from the input signal for each frame of a predetermined time width;
a plurality of stacked residual blocks following the input layer;
an output layer connected to the output side of the plurality of residual blocks;
each of the plurality of residual blocks comprising:
a plurality of stacked time delay layers;
a shortcut path bypassing the plurality of time delay layers;
an attention module that adjusts weights between paths passing through the plurality of time delay layers and the shortcut paths;
The plurality of time delay layers have delay elements that provide a delay of a predetermined timestep with respect to the input,
Each of the time delay layers includes, for an input vector, a first internal vector corresponding to a past frame obtained by moving back the current frame, which is the frame corresponding to the input vector, by the time step; generating a second internal vector corresponding to a future frame advanced in time by the time step;
The attention module performs the time step based on a result output obtained by passing an input provided to a corresponding residual block through a corresponding plurality of time delay layers and an input provided to the corresponding residual block. A trained model configured to update the weights every time.

A learning method for a discriminator that outputs a sequence of labels for an input signal, comprising:
The discriminator is
an input layer that sequentially generates a first feature vector from the input signal for each frame of a predetermined time width;
a plurality of stacked residual blocks following the input layer;
an output layer connected to the output side of the plurality of residual blocks;
each of the plurality of residual blocks comprising:
a plurality of stacked time delay layers;
a shortcut path that bypasses the plurality of time delay layers;
The plurality of time delay layers have delay elements that provide a delay of a predetermined timestep with respect to the input,
The learning method includes:
a first training step of determining parameters defining the network of classifiers using a training data set;
adding to the discriminator an attention module that adjusts weights between paths passing through the plurality of time-delay layers and the shortcut paths, wherein the attention module gives to corresponding residual blocks: updating the weights at each of the timesteps based on the resulting output obtained from passing the input through the corresponding plurality of time delay layers and the input provided to the corresponding residual block. cage,
a second training step of determining parameters defining said attention module using a training data set.

providing an input signal to a classifier to which the attention module is attached to remove a portion of the plurality of time delay layers based on changes in the weight values calculated by the attention module. A learning method according to claim 5.