JP6691501B2

JP6691501B2 - Acoustic model learning device, model learning device, model learning method, and program

Info

Publication number: JP6691501B2
Application number: JP2017074365A
Authority: JP
Inventors: 祐太河内; 太一浅見; 伸克北条
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-04-04
Filing date: 2017-04-04
Publication date: 2020-04-28
Anticipated expiration: 2037-04-04
Also published as: JP2018180045A

Description

この発明は、系列モデリング技術に関し、特に、ニューラルネットワークで用いる線形変換の重みパラメータを補間する技術に関する。 The present invention relates to a sequence modeling technique, and more particularly to a technique for interpolating weighting parameters of linear transformation used in a neural network.

確率的な系列モデリングとして、例えば、n-gramモデル等がある（非特許文献１参照）。確率的な系列モデリングによる言語モデルは、例えば、確率的言語モデルと呼ばれる。また、ニューラルネットワークを用いた系列モデリングとして、例えば、再帰型ニューラルネットワーク（RNN: Recurrent Neural Network）の一種である単純再帰型ニューラルネットワーク（Elman Network）や長短期記憶（LSTM: Long short-term memory）等がある（非特許文献２参照）。ニューラルネットワークを用いた系列モデリングによる言語モデルは、例えば、ニューラル言語モデルと呼ばれる。 Examples of probabilistic sequence modeling include an n-gram model (see Non-Patent Document 1). A language model based on probabilistic sequence modeling is called, for example, a probabilistic language model. Further, as a series modeling using a neural network, for example, a simple recursive neural network (Elman Network) which is a type of recurrent neural network (RNN) or long short-term memory (LSTM) is used. Etc. (see Non-Patent Document 2). A language model based on sequence modeling using a neural network is called, for example, a neural language model.

奥村学、「言語処理のための機械学習入門」、コロナ社、2010年Okumura Manabu, "Introduction to Machine Learning for Language Processing", Corona Publishing Co., Ltd., 2010 Hochreiter, S., Schmidhuber, J., "Long short-term memory." Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.Hochreiter, S., Schmidhuber, J., "Long short-term memory." Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.

確率的言語モデルは、前の単語が与えられたときの次の単語の確率をカテゴリカルな分布として持つことができる。したがって、確率的言語モデルは、n-gramモデルならば階数nを持つテンソルとして、すべてのパラメータを陽に記述することができる。また同時に、前のn-1個の単語が与えられたときの次の単語の確率値という意味付けをパラメータに対して持たせることができる。例えば、2-gram（bigram）モデルであれば、その情報をすべて階数2のテンソル（すなわち、行列）として表現することができる。この行列の各列は、対応する一つ前の単語が与えられたときの、次の単語の分布として解釈することができる。例えば、あるドメインAに属するテキストデータを使って学習した確率的言語モデルに対して、別のドメインBに属するテキストデータを使って元の確率的言語モデルを適応するといった実用上重要な操作を、各々から計算した確率的言語モデルの持つテンソル同士を線形補間する等の方法によって実行することができる。 Probabilistic language models can have the probability of the next word given the previous word as a categorical distribution. Therefore, a probabilistic language model can explicitly describe all parameters as a tensor with rank n if it is an n-gram model. At the same time, it is possible to give the parameter meaning of the probability value of the next word given the previous n-1 words. For example, in a 2-gram (bigram) model, all the information can be expressed as a rank-2 tensor (that is, a matrix). Each column of this matrix can be interpreted as the distribution of the next word given the corresponding previous word. For example, a practically important operation such as applying a stochastic language model learned using text data belonging to a domain A to the original stochastic language model using text data belonging to another domain B, It can be executed by a method such as linear interpolation between tensors of the probabilistic language model calculated from each.

ニューラル言語モデルは、最小要素として、入力ベクトルに対する線形変換と、線形変換後のベクトルに対する非線形変換（代表的にはsoftmax関数）が必要になる。非線形変換後のベクトルは次の単語の事後確率値として解釈できるが、線形変換に用いる重みパラメータそのものに対して意味付けを行うことはできない。そのため、例えば、あるドメインAのテキストデータを使って学習したニューラル言語モデルに対して、別のドメインBのテキストデータを使って元のニューラル言語モデルを適応するという操作を実行することができない。 The neural language model requires, as the minimum element, a linear transformation for the input vector and a non-linear transformation (typically a softmax function) for the vector after the linear transformation. The vector after the non-linear conversion can be interpreted as the posterior probability value of the next word, but the weighting parameter itself used for the linear conversion cannot be given meaning. Therefore, for example, the operation of applying the original neural language model using the text data of another domain B to the neural language model learned using the text data of a certain domain A cannot be executed.

この発明の目的は、ニューラルネットワークを用いた系列モデリングにおいて、学習時に用いた学習データとドメインが異なる学習データを用いてモデル学習を行うことができるモデル学習技術を実現することである。 An object of the present invention is to realize a model learning technique capable of performing model learning by using learning data having a domain different from that of learning data used at the time of learning in sequence modeling using a neural network.

上記の課題を解決するために、この発明の第一の態様の音響モデル学習装置は、音響特徴ベクトルを入力として、音響特徴ベクトルに対応する出力シンボルに対する事後確率ベクトルと、出力シンボルが空シンボルである確率を表す空シンボル確率と、を出力するニューラルネットワークを用いた音響モデルを記憶する音響モデル記憶部と、学習音声から抽出した音響特徴ベクトルをニューラルネットワークに入力して事後確率ベクトルと空シンボル確率とを得る事後確率計算部と、空シンボル確率に基づいてニューラルネットワークが以前の時刻に出力した事後確率ベクトルもしくはニューラルネットワークが現在の時刻に出力した事後確率ベクトルを選択して保持する文脈保存ベクトルを計算する文脈保存ベクトル計算部と、文脈保存ベクトルが計算されるたびに重みパラメータを用いて文脈保存ベクトルを線形変換してニューラルネットワークの最終隠れ層の出力と加算してニューラルネットワークの出力層に連結する文脈保存ベクトル連結部と、単語列の集合であるテキストデータから確率モデルを用いて確率モデルパラメータを生成する確率モデル生成部と、確率モデルパラメータに対してニューラルネットワークに含まれる非線形変換の逆計算を行い、重みパラメータ相当量を計算する重みパラメータ逆計算部と、線形変換で用いる重みパラメータと重みパラメータ相当量との補間を行う重みパラメータ補間部と、を含む。 In order to solve the above problems, the acoustic model learning device according to the first aspect of the present invention uses an acoustic feature vector as an input, and a posterior probability vector for an output symbol corresponding to the acoustic feature vector, and the output symbol is an empty symbol. An empty model probability that represents a certain probability, and an acoustic model storage unit that stores an acoustic model using a neural network that outputs, and an a posteriori probability vector and an empty symbol probability by inputting an acoustic feature vector extracted from learning speech into the neural network. And a context-preserving vector that holds the posterior probability vector output by the neural network at the previous time or the posterior probability vector output by the neural network at the current time based on the empty symbol probability. The context-saving vector calculation part to calculate and the context-saving vector The context-preserving vector concatenation unit that linearly transforms the context-preserving vector using the weight parameter each time the toll is calculated and adds it to the output of the final hidden layer of the neural network and concatenates it to the output layer of the neural network, and A probabilistic model generator that generates a probabilistic model parameter using a probabilistic model from a set of text data, and a weight that calculates the weight parameter equivalent amount by performing inverse calculation of the nonlinear transformation included in the neural network for the probabilistic model parameter. It includes a parameter inverse calculation unit, and a weight parameter interpolation unit that interpolates the weight parameter used in the linear conversion and the weight parameter equivalent amount.

この発明の第二の態様のモデル学習装置は、第一ドメインに属する離散値系列の集合である第一学習データから学習した重みパラメータを用いて入力ベクトルに対して線形変換を行い、線形変換後の入力ベクトルに対して非線形変換を行うニューラルネットワークを記憶するニューラルネットワーク記憶部と、第二ドメインに属する離散値系列の集合である第二学習データから確率モデルを用いて確率モデルパラメータを生成する確率モデル生成部と、確率モデルパラメータに対して非線形変換の逆計算を行い、重みパラメータ相当量を計算する重みパラメータ逆計算部と、線形変換で用いる重みパラメータと重みパラメータ相当量との補間を行う重みパラメータ補間部と、を含む。 A model learning device according to a second aspect of the present invention linearly transforms an input vector using a weighting parameter learned from first learning data that is a set of discrete value sequences belonging to a first domain, and after linear transformation Probability of generating a probabilistic model parameter using a probabilistic model from a neural network storage unit that stores a neural network that performs a non-linear transformation on the input vector and a second learning data that is a set of discrete value sequences belonging to the second domain A model generation unit, a weight parameter inverse calculation unit that performs a non-linear transformation inverse calculation on a stochastic model parameter, and calculates a weight parameter equivalent amount, and a weight that interpolates a weight parameter used in linear transformation and a weight parameter equivalent amount. And a parameter interpolation unit.

この発明によれば、確率モデルパラメータから重みパラメータ相当量を逆計算して、ニューラルネットワークで用いる線形変換の重みパラメータと補間することができるため、ニューラルネットワークを用いた系列モデリングにおいて、学習時に用いた学習データとは異なるドメインの学習データを用いてモデル学習を行うことができる。 According to the present invention, the weight parameter equivalent amount can be inversely calculated from the probabilistic model parameter and interpolated with the weight parameter of the linear transformation used in the neural network. Model learning can be performed using learning data of a domain different from the learning data.

図１は、モデル学習装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the model learning device. 図２は、モデル学習方法の処理手続きを例示する図である。FIG. 2 is a diagram illustrating a processing procedure of the model learning method. 図３は、音響モデル学習装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the acoustic model learning device. 図４は、音声認識装置の機能構成を例示する図である。FIG. 4 is a diagram illustrating a functional configuration of the voice recognition device. 図５は、音響モデル学習方法の処理手続きを例示する図である。FIG. 5 is a diagram illustrating a processing procedure of the acoustic model learning method. 図６は、音声認識方法の処理手続きを例示する図である。FIG. 6 is a diagram illustrating a processing procedure of the voice recognition method. 図７は、文脈保存ベクトルの計算方法を例示する図である。FIG. 7: is a figure which illustrates the calculation method of a context preservation | save vector. 図８は、文脈保存ベクトルを線形変換して連結する構成を例示する図である。FIG. 8 is a diagram illustrating a configuration in which context-preserving vectors are linearly converted and connected. 図９は、重みパラメータ補間装置の機能構成を例示する図である。FIG. 9 is a diagram illustrating a functional configuration of the weight parameter interpolation device. 図１０は、重みパラメータ補間方法の処理手続きを例示する図である。FIG. 10 is a diagram illustrating a processing procedure of the weight parameter interpolation method.

この明細書中で使用する記号「^」は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。数式中においては、これらの記号は本来の位置、すなわち文字の真上に記述している。例えば、「^x」との表記は、 The symbol "^" used in this specification is supposed to be written right above the character immediately after it, but is written immediately before the character due to the limitation of text notation. In the mathematical formula, these symbols are described at their original positions, that is, directly above the characters. For example, the notation "^ x" is

を表している。 Is represented.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In the drawings, components having the same function are denoted by the same reference numerals, and duplicate description will be omitted.

［第一実施形態］
第一実施形態は、あるドメインのテキストデータを用いて学習済みのニューラル言語モデルに対して、別のドメインのテキストデータを用いてモデル学習を行うモデル学習装置および方法である。第一実施形態では、2-gram（bigram）相当のモデルを対象とする。このモデルは、例えば、階数が2のテンソル（すなわち、行列）を用い、語彙数と等しい次元数を持ち、出現単語に対応する次元が1でそれ以外の次元が0を持つベクトルを入力とし、入力ベクトルと重みパラメータ行列との積を行った後、softmax関数を適用して、次の単語の事後確率ベクトルを出力するものである。ただし、テンソルの階数は任意の自然数であっても対象とすることができる。 [First embodiment]
The first embodiment is a model learning device and method for performing model learning on a neural language model that has been trained using text data of a certain domain, using text data of another domain. In the first embodiment, a model corresponding to 2-gram (bigram) is targeted. This model uses, for example, a tensor of rank 2 (that is, a matrix), has a dimension number equal to the vocabulary number, and inputs a vector having a dimension corresponding to the appearing word of 1 and other dimensions of 0, After performing the product of the input vector and the weight parameter matrix, the softmax function is applied and the posterior probability vector of the next word is output. However, the rank of the tensor can be a target even if it is an arbitrary natural number.

第一実施形態のモデル学習装置は、図１に示すように、第一学習データ記憶部１０、ニューラル言語モデル生成部１１、ニューラル言語モデル記憶部１２、第二学習データ記憶部２０、頻度テンソル生成部２１、確率的言語モデル生成部２２、重みパラメータ逆計算部２３、重みパラメータ補正部２４、および重みパラメータ補間部２５を含む。ただし、重みパラメータ補正部２４は必須の構成ではなく、省略することも可能である。第一実施形態のモデル学習装置が、図２に示す各ステップの処理を行うことにより、第一実施形態のモデル学習方法が実現される。 As shown in FIG. 1, the model learning device of the first embodiment includes a first learning data storage unit 10, a neural language model generation unit 11, a neural language model storage unit 12, a second learning data storage unit 20, and a frequency tensor generation. It includes a unit 21, a probabilistic language model generation unit 22, a weight parameter inverse calculation unit 23, a weight parameter correction unit 24, and a weight parameter interpolation unit 25. However, the weight parameter correction unit 24 is not an essential component and can be omitted. The model learning device of the first embodiment realizes the model learning method of the first embodiment by performing the processing of each step shown in FIG.

モデル学習装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知または専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。モデル学習装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。モデル学習装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、モデル学習装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。モデル学習装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。モデル学習装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The model learning device is, for example, a special program configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. It is a device. The model learning device executes each process under the control of the central processing unit, for example. The data input to the model learning device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read as needed and used for other processes. It Further, at least a part of each processing unit of the model learning device may be configured by hardware such as an integrated circuit. Each storage unit included in the model learning device is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory), or a relational storage device. It can be configured by middleware such as a database or key-value store. Each storage unit included in the model learning device has only to be logically divided, and may be stored in one physical storage device.

以下、図２を参照しながら、第一実施形態のモデル学習装置が実行するモデル学習方法の処理手続きを説明する。 The processing procedure of the model learning method executed by the model learning device of the first embodiment will be described below with reference to FIG.

第一学習データ記憶部１０には、あるドメイン（以下、第一ドメインと呼ぶ）に属する複数のテキストデータを収集した第一ドメイン学習データが記憶されている。第二学習データ記憶部２０には、第一ドメインとは異なるドメイン（以下、第二ドメインと呼ぶ）に属する複数のテキストデータを収集した第二ドメイン学習データが記憶されている。各テキストデータは言語モデルの学習に十分な量があればよい。 The first learning data storage unit 10 stores first domain learning data obtained by collecting a plurality of text data belonging to a certain domain (hereinafter referred to as a first domain). The second learning data storage unit 20 stores second domain learning data obtained by collecting a plurality of text data belonging to a domain different from the first domain (hereinafter referred to as a second domain). It is sufficient that each text data has a sufficient amount for learning the language model.

ステップＳ１１において、ニューラル言語モデル生成部１１は、第一学習データ記憶部１０に記憶された第一ドメイン学習データを読み込み、その第一ドメイン学習データを用いてニューラル言語モデルを学習する。ニューラル言語モデル生成部１１は、学習済みのニューラル言語モデルをニューラル言語モデル記憶部１２へ記憶する。 In step S11, the neural language model generation unit 11 reads the first domain learning data stored in the first learning data storage unit 10 and learns the neural language model using the first domain learning data. The neural language model generation unit 11 stores the learned neural language model in the neural language model storage unit 12.

ステップＳ２１において、頻度テンソル生成部２１は、第二学習データ記憶部２０に記憶された第二ドメイン学習データを読み込み、その第二ドメイン学習データに含まれる単語の連接頻度を集計し、頻度テンソルを生成する。頻度テンソル生成部２１は、生成した頻度テンソルを確率的言語モデル生成部２２へ出力する。この頻度テンソルは、例えば、bigramモデルのときは、頻度表となる。 In step S21, the frequency tensor generation unit 21 reads the second domain learning data stored in the second learning data storage unit 20, totals the concatenation frequencies of words included in the second domain learning data, and calculates the frequency tensor. To generate. The frequency tensor generation unit 21 outputs the generated frequency tensor to the stochastic language model generation unit 22. This frequency tensor becomes a frequency table in the case of a bigram model, for example.

ステップＳ２２において、確率的言語モデル生成部２２は、頻度テンソル生成部２１から頻度テンソルを受け取り、その頻度テンソルから確率的言語モデルを用いて確率モデルパラメータテンソルを生成する。このとき、確率的言語モデルは、例えば、n-gramモデル等の一般的なものを用いればよい。また、その際にスムージングなどの一般的な操作を含めてもよい。確率的言語モデル生成部２２は、生成した確率モデルパラメータテンソルを重みパラメータ逆計算部２３へ出力する。 In step S22, the probabilistic language model generation unit 22 receives the frequency tensor from the frequency tensor generation unit 21, and generates a probabilistic model parameter tensor from the frequency tensor using the probabilistic language model. At this time, as the probabilistic language model, for example, a general one such as an n-gram model may be used. Further, at that time, a general operation such as smoothing may be included. The probabilistic language model generation unit 22 outputs the generated probabilistic model parameter tensor to the weight parameter inverse calculation unit 23.

ステップＳ２３において、重みパラメータ逆計算部２３は、確率的言語モデル生成部２２から確率モデルパラメータテンソルを受け取り、その確率モデルパラメータテンソルに対してニューラル言語モデルに含まれる非線形変換の逆計算を行い、ニューラル言語モデルに含まれる線形変換で用いる重みパラメータに相当する量（以下、重みパラメータ相当量と呼ぶ）を計算する。重みパラメータ逆計算部２３は、計算した重みパラメータ相当量を重みパラメータ補正部２４へ出力する。重みパラメータ補正部２４を省略した場合には、計算した重みパラメータ相当量を重みパラメータ補間部２５へ出力する。 In step S23, the weight parameter inverse calculation unit 23 receives the probabilistic model parameter tensor from the probabilistic language model generation unit 22, performs inverse calculation of the nonlinear transformation included in the neural language model on the probabilistic model parameter tensor, and An amount corresponding to the weight parameter used in the linear conversion included in the language model (hereinafter referred to as a weight parameter equivalent amount) is calculated. The weight parameter inverse calculation unit 23 outputs the calculated weight parameter equivalent amount to the weight parameter correction unit 24. When the weight parameter correction unit 24 is omitted, the calculated weight parameter equivalent amount is output to the weight parameter interpolation unit 25.

重みパラメータ相当量の逆計算は、以下のようにして行う。まず、softmax関数を式（１）に示す。ここで、Dは次元数である。 The inverse calculation of the weight parameter equivalent amount is performed as follows. First, the softmax function is shown in equation (1). Here, D is the number of dimensions.

このsoftmax関数は、入力に定数を加算しても出力が変化しない性質を持つ。この関係性を具体的に示す。確率モデルの出力ベクトルがP=[p₁, p₂, …, p_d, …, p_D]^Tと固定されたベクトルであるとする。ここで、[…]^Tはベクトルの転置を表す。出力ベクトルPを式（１）のsoftmax関数に代入すると、式（２）となる。 This softmax function has the property that the output does not change even if a constant is added to the input. This relationship will be specifically shown. It is assumed that the output vector of the probabilistic model is a fixed vector with P = [p ₁ , p ₂ , ..., p _d , ..., p _D ] ^T. Here, [...] ^T represents the transpose of the vector. Substituting the output vector P into the softmax function of equation (1) yields equation (2).

式（２）の分母は次元dに対して定数であるから、 Since the denominator of equation (2) is a constant for dimension d,

とおくと、式（４）〜（６）が成り立つ。 Therefore, the equations (4) to (6) are established.

式（６）は、cを媒介変数とし、入力ベクトルX=[x₁, x₂, …, x_D]^TのD次元の空間上で、点[ln(p₁), ln(p₂), …, ln(p_D)]^Tを通り、ベクトル[1, 1, …, 1]^Tに平行な直線の方程式である。つまり、この直線上の任意の点が、同じ出力ベクトルP=[p₁, p₂, …, p_D]^Tを与えることがわかる。ここで、任意の基準を用いて、適当なベクトル^X=[^x₁, ^x₂, …, ^x_D]^Tを定めることができるが、ここでは、二乗ノルム最小化基準を用い、||[x₁, x₂, …, x_D]^T||₂を最小化するもの、つまり、長さが最も短いものを採用する。最適な媒介変数cを^cで表すと、式（７）〜（９）が成り立つ。 Equation (6) uses c as a parameter, and points [ln (p ₁ ), ln (p ₂ ) on the D-dimensional space of the input vector X = [x ₁ , x ₂ ,…, x _D ] ^T. ,…, Ln (p _D )] ^T is a straight line equation parallel to the vector [1, 1,…, 1] ^T. That is, it can be seen that any point on this straight line gives the same output vector P = [p ₁ , p ₂ , ..., p _D ] ^T. Here, an arbitrary vector can be used to determine an appropriate vector ^ X = [^ x ₁ ,, ^ x ₂ , ..., ^ x _D ] ^T , but here, using the square norm minimization criterion, || [x ₁ , x ₂ , ..., x _D ] ^T || The one that minimizes ₂ , that is, the one that has the shortest length is adopted. When the optimal parameter c is represented by ^ c, equations (7) to (9) are established.

式（９）を式（６）に代入すると、最終的な逆計算式（10）が定まる。 By substituting equation (9) into equation (6), the final inverse calculation equation (10) is determined.

つまり、以下の式（11）によって、前の単語列が与えられたときの次の単語の事後確率ベクトルP=[p₁, p₂, …, p_D]^Tごとに確率モデルパラメータテンソルから重みパラメータ相当量^X=[^x₁, ^x₂, …, ^x_D]^Tへの逆計算を行うことができる。 That is, according to the following equation (11), weighting is performed from the probability model parameter tensor for each posterior probability vector P = [p ₁ , p ₂ ,…, p _D ] ^T of the next word when the previous word string is given. The inverse calculation to the parameter equivalent ^ X = [^ x ₁ , ^ x ₂ ,…, ^ x _D ] ^T can be performed.

ステップＳ２４において、重みパラメータ補正部２４は、重みパラメータ逆計算部２３から重みパラメータ相当量を受け取り、ニューラル言語モデル記憶部１２に記憶されているニューラル言語モデルの重みパラメータと、分布のパラメータ（平均および分散）が合うように重みパラメータ相当量を補正する。重みパラメータ補正部２４は、補正した重みパラメータ相当量を重みパラメータ補間部２５へ出力する。 In step S24, the weighting parameter correction unit 24 receives the weighting parameter equivalent amount from the weighting parameter inverse calculation unit 23, and the weighting parameter of the neural language model stored in the neural language model storage unit 12 and the distribution parameter (average and The weight parameter equivalent amount is corrected so that the variance) is matched. The weight parameter correction unit 24 outputs the corrected weight parameter equivalent amount to the weight parameter interpolation unit 25.

理論上は、式（11）で計算される量が重みパラメータに相当する量となっているが、実際には、値のレンジが大幅にずれてしまうことがある。そこで、ヒューリスティックに平均と分散をニューラル言語モデルの重みパラメータに合わせることを行う。具体的には、重みパラメータ逆計算部２３が求めた重みパラメータ相当量の平均を0、標準偏差を1に正規化し、対応するニューラル言語モデルの重みパラメータから計算した標準偏差を掛け、同様に重みパラメータから計算した平均を足すことで補正を行う。すなわち、重みパラメータ逆計算部２３が求めた重みパラメータ相当量の平均をμ_p、標準偏差をσ_pとし、対応する重みパラメータの平均をμ_n、標準偏差をσ_nとし、d=1, 2, …, Dについて、式（12）を計算して重みパラメータ相当量^X=[^x₁, ^x₂, …, ^x_D]^Tを補正する。 Theoretically, the amount calculated by equation (11) is the amount corresponding to the weighting parameter, but in practice, the range of values may shift significantly. Therefore, heuristically adjust the mean and variance to the weighting parameters of the neural language model. Specifically, the weight parameter inverse calculation unit 23 normalizes the average of the weight parameter equivalent amounts to 0 and the standard deviation to 1, and multiplies the standard deviation calculated from the weight parameter of the corresponding neural language model to obtain the same weight. Correction is performed by adding the average calculated from the parameters. That is, the average of the weight parameter equivalents calculated by the weight parameter inverse calculation unit 23 is μ _p , the standard deviation is σ _p , the average of the corresponding weight parameters is μ _n , and the standard deviation is σ _n, and d = 1, 2 , ..., the D, the weighting parameter corresponding amount by calculating equation (12) ^ X = [^ x 1, ^ x 2, ..., ^ x D] to correct ^T.

ステップＳ２５において、重みパラメータ補間部２５は、重みパラメータ補正部２４もしくは重みパラメータ逆計算部２３から重みパラメータ相当量を受け取り、ニューラル言語モデル記憶部１２に記憶されているニューラル言語モデルの重みパラメータと、入力された重みパラメータ相当量とを線形補間することにより、新しい重みパラメータを得る。重みパラメータ補間部２５は、ニューラル言語モデル記憶部１２に記憶されているニューラル言語モデルの重みパラメータを新しい重みパラメータに更新する。 In step S25, the weight parameter interpolating unit 25 receives the weight parameter equivalent amount from the weight parameter correcting unit 24 or the weight parameter inverse calculating unit 23, and the weight parameter of the neural language model stored in the neural language model storage unit 12, A new weight parameter is obtained by linearly interpolating the input weight parameter equivalent amount. The weight parameter interpolation unit 25 updates the weight parameter of the neural language model stored in the neural language model storage unit 12 to a new weight parameter.

［変形例１］
第二ドメイン学習データに含まれない単語であって、ニューラル言語モデルには含まれる単語（以下、ミスマッチ単語と呼ぶ）が存在する場合、頻度テンソル生成部２１は、以下の方法によって、ミスマッチ単語の仮想的な頻度を定めて、頻度テンソルを生成する。第二ドメイン学習データに含まれない単語に対して確率的言語モデルが出力する事後確率は、元々のニューラル言語モデルが出力する事後確率から大きな変化がないことが望ましい。そのため、ニューラル言語モデルの重みパラメータをsoftmax関数に掛けた後のミスマッチ単語の事後確率に変化がないという制約を用いる。この制約は、ミスマッチ単語が一個のときは簡単に解くことができる。具体的には、softmax関数の入力ベクトルをX=[x₁, x₂, …, x_D]^Tとして、最後の要素x_Dをミスマッチ単語の頻度とし、仮想的な値を求めたいものとする。残りの要素x₁, x₂, …, x_D-1については、具体的な頻度が第二ドメイン学習データの集計により求まっているものとする。出力ベクトルはY=[y₁, y₂, …, y_D]^Tとして、最後の要素y_Dをニューラル言語モデルの重みパラメータにsoftmax関数を掛けて得られたミスマッチ単語の事後確率とし、この値が変化しないようにする。この関係性は式（13）で表される。 [Modification 1]
When a word that is not included in the second domain learning data and that is included in the neural language model (hereinafter, referred to as a mismatch word) exists, the frequency tensor generation unit 21 uses the following method to identify the mismatch word. A virtual frequency is defined and a frequency tensor is generated. It is desirable that the posterior probabilities output by the probabilistic language model for words not included in the second domain learning data do not significantly change from the posterior probabilities output by the original neural language model. Therefore, we use the constraint that there is no change in the posterior probability of mismatched words after the softmax function is multiplied by the weight parameter of the neural language model. This constraint can be easily solved when there is only one mismatch word. Specifically, assume that the input vector of the softmax function is X = [x ₁ , x ₂ , ,,, x _D ] ^T , and the last element x _D is the frequency of mismatch words, and we want to obtain a virtual value. . For the remaining elements x ₁ , x ₂ , ..., x _D-1, it is assumed that specific frequencies have been obtained by aggregating the second domain learning data. The output vector is Y = [y ₁ , y ₂ , ..., y _D ] ^T , and the last element y _D is the posterior probability of the mismatch word obtained by multiplying the weight parameter of the neural language model by the softmax function. Do not change. This relationship is expressed by equation (13).

これは、 this is,

となる。以前に出現した単語の条件ごとに式（14）の計算を行い、求まったx_Dをミスマッチ単語の仮想的な頻度として用いる。ミスマッチ単語を一個でなく任意の個数とする場合には、上記の関係性から立てた連立方程式を解いて、ミスマッチ単語それぞれの仮想的な頻度として用いればよい。 Becomes Equation (14) is calculated for each condition of previously appeared words, and the obtained x _D is used as the virtual frequency of mismatched words. When the number of mismatched words is not one but an arbitrary number, the simultaneous equations established from the above relationships may be solved and used as the virtual frequency of each mismatched word.

［変形例２］
第一実施形態では、重みパラメータ逆計算部２３が、底を自然対数とする対数関数lnを用いたが、任意の値を底とする対数関数を用いてもよい。 [Modification 2]
In the first embodiment, the weight parameter inverse calculation unit 23 uses the logarithmic function ln whose base is the natural logarithm, but a logarithmic function whose base is an arbitrary value may be used.

［変形例３］
第一実施形態では、重みパラメータ逆計算部２３が、二乗ノルム最小化基準を用いて、||[x₁, x₂, …, x_D]^T||₂を最小化するベクトル^X=[^x₁, ^x₂, …, ^x_D]^Tを重みパラメータ相当量と定めたが、ベクトル^Xを定める基準は任意のものを用いてもよい。例えば、任意のL-pノルムを用いてもよいし、ニューラル言語モデルの重みパラメータに対して近付くように誤差を小さくしてもよい。 [Modification 3]
In the first embodiment, the weight parameter inverse calculation unit 23 uses the square norm minimization criterion to minimize a vector ^ X = [that minimizes || [x ₁ , x ₂ , ..., x _D ] ^T || ₂ . Although ^ x ₁ , ^ x ₂ , ..., ^ x _D ] ^T is defined as the weight parameter equivalent amount, any criterion may be used as the criterion for defining the vector ^ X. For example, an arbitrary Lp norm may be used, or the error may be reduced so as to approach the weight parameter of the neural language model.

［変形例４］
重みパラメータ補正部２４は、固定の関数またはニューラル言語モデルのパラメータを用いた関数などの任意の関数を用いて、重みパラメータ相当量の値が取り得る幅を変更してもよい。 [Modification 4]
The weighting parameter correction unit 24 may change the width that the value of the weighting parameter equivalent amount can take, using an arbitrary function such as a fixed function or a function using the parameters of the neural language model.

［変形例５］
第一実施形態では、重みパラメータ補間部２５が、重みパラメータ相当量とニューラル言語モデルの重みパラメータとを線形補間する構成としたが、任意の一般的な補間法、例えば、幾何平均等により補間を行ってもよい。 [Modification 5]
In the first embodiment, the weight parameter interpolating unit 25 is configured to linearly interpolate the weight parameter equivalent amount and the weight parameter of the neural language model. However, the interpolation may be performed by any general interpolation method, for example, geometric mean. You can go.

［変形例６］
第一実施形態では、単語列を対象として連接頻度を集計したが、例えば、文字や音素列等、任意の離散値系列を対象としてもよい。 [Modification 6]
In the first embodiment, the concatenation frequency is totaled for the word string, but any discrete value series such as a character or a phoneme string may be targeted.

［変形例７］
第一実施形態では、ニューラル言語モデル、すなわちニューラルネットワークによる言語モデルを対象としたが、他の任意の離散値系列判別を目的としたニューラルネットワークの一部または全部の重みパラメータを対象としてもよい。 [Modification 7]
In the first embodiment, the neural language model, that is, the language model based on the neural network is targeted, but the weighting parameters of a part or all of the neural network for the purpose of discriminating other arbitrary discrete value series may be targeted.

［変形例８］
第一実施形態では、確率的言語モデル生成部２２は、確率的言語モデルを用いて確率モデルパラメータテンソルを出力する構成としたが、例えば、RNNやLSTM等、前の単語から次の単語を予測する確率を出力する任意のモデリングを用い、出力された確率値系列を確率モデルパラメータテンソルの代わりに用いてもよい。 [Modification 8]
In the first embodiment, the probabilistic language model generation unit 22 is configured to output the probabilistic model parameter tensor using the probabilistic language model. However, for example, RNN or LSTM predicts the next word from the previous word. The output probability value sequence may be used instead of the probabilistic model parameter tensor by using any modeling that outputs the probability of performing.

［変形例９］
第一実施形態では、第一ドメイン学習データから学習したニューラルネットワークの重みパラメータを初期値としてモデル学習を行う構成を説明したが、重みパラメータ逆計算部２３が求めた（すなわち、第二ドメイン学習データから学習した確率モデルの出力ベクトルに対して逆計算した）重みパラメータ相当量を初期値としてモデル学習を行ってもよい。また、重みパラメータ逆計算部２３が求めた重みパラメータ相当量を平均とするガウス分布を事前分布としてモデル学習を行ってもよい。 [Modification 9]
In the first embodiment, the configuration in which the model learning is performed by using the weight parameter of the neural network learned from the first domain learning data as an initial value has been described, but the weight parameter inverse calculation unit 23 calculates (that is, the second domain learning data). The model learning may be performed with the weight parameter equivalent amount (inversely calculated for the output vector of the probabilistic model learned from the above) as the initial value. Further, model learning may be performed by using a Gaussian distribution having the weight parameter equivalent calculated by the weight parameter inverse calculation unit 23 as an average as a prior distribution.

［第二実施形態］
第一実施形態では、系列モデリングの一例として、ニューラルネットワークによる言語モデルに対してこの発明を適用した一般的な構成例を説明した。第二実施形態では、より実用的な構成例として、CTCニューラルネットワークを用いた音響モデルに対してこの発明を適用した構成例を説明する。 [Second embodiment]
In the first embodiment, as an example of sequence modeling, a general configuration example in which the present invention is applied to a language model by a neural network has been described. In the second embodiment, as a more practical configuration example, a configuration example in which the present invention is applied to an acoustic model using a CTC neural network will be described.

主に音声認識に用いられるConnectionist Temporal Classification（以下、CTC）は、ニューラルネットワーク（NN: Neural Network）を用いた機械学習による系列変換モデルの一種であり、隠れマルコフモデル（HMM: Hidden Markov Model）相当の機能をニューラルネットワークに行わせることができる枠組みである。音声認識において現在一般的に使われているNN-HMMハイブリッド方式では、音をシンボルに変換する音響モデルにおいて、入力系列と出力系列の長さが一対一である制約がある。一方、CTCでは通常の出力シンボルに加えて、空白を表現する空シンボルを導入することにより、系列長の短くなる変換をNN音響モデルに行わせることができるようになっている。そのため、音声認識であれば、単位時間（以下、フレームとも呼ぶ）毎の音響特徴ベクトルの入力に対し、音素や文字、単語等を直接出力系列として、音響モデルや音声認識器を学習することができる（参考文献１参照）。 Connectionist Temporal Classification (CTC), which is mainly used for speech recognition, is a kind of sequence conversion model by machine learning using neural network (NN: Neural Network) and is equivalent to Hidden Markov Model (HMM). It is a framework that allows the neural network to perform the function of. The NN-HMM hybrid method, which is commonly used in speech recognition, has a constraint that the length of the input sequence and the output sequence is one-to-one in the acoustic model that converts sounds into symbols. On the other hand, in CTC, the NN acoustic model can be transformed by introducing an empty symbol that represents a blank in addition to the usual output symbol. Therefore, in the case of speech recognition, it is possible to learn an acoustic model or a speech recognizer by directly inputting an acoustic feature vector for each unit time (hereinafter, also referred to as a frame) using phonemes, characters, words, etc. as output sequences. Yes (see Reference 1).

〔参考文献１〕Yajie Miao, Mohammad Gowayyed, and Florian Metze, "EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding", 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, 2015.
第二実施形態の音声認識システムは、例えば、音響モデル学習装置と音声認識装置とを含む。音響モデル学習装置は、学習音声と各学習音声に関するテキスト情報（例えば、文字、音素、HMM状態等、音声の変換先のシンボル情報）とを含む学習データを用いて、学習音声から生成された音響特徴ベクトルとともに音響特徴ベクトルの変換先の正解系列としてテキスト情報を入力し、このペアを用いてCTCによる音響モデルを学習する。音声認識装置は、音響モデル学習装置により学習した音響モデルを用いて、入力音声の音声認識を行う。 [Reference 1] Yajie Miao, Mohammad Gowayyed, and Florian Metze, "EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding", 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE , 2015.
The voice recognition system of the second embodiment includes, for example, an acoustic model learning device and a voice recognition device. The acoustic model learning device uses a learning data including learning voices and text information (for example, characters, phonemes, HMM states, and the like, symbol information of a voice conversion destination) related to each learning voice, and generates an acoustic sound generated from the learning voice. Along with the feature vector, text information is input as the correct sequence of the transformation destination of the acoustic feature vector, and the acoustic model by CTC is learned using this pair. The voice recognition device performs voice recognition of the input voice using the acoustic model learned by the acoustic model learning device.

音響モデル学習装置は、図３に示すように、学習データ記憶部３０、文脈保存ベクトル生成部３１、事後確率計算部３２、文脈保存ベクトル計算部３３、文脈保存ベクトル連結部３４、音響モデル記憶部３５、テキストデータ記憶部４０、頻度テンソル生成部４１、確率的言語モデル生成部４２、重みパラメータ逆計算部４３、重みパラメータ補正部４４、および重みパラメータ補間部４５を含む。ただし、重みパラメータ補正部４４は必須の構成ではなく、省略することも可能である。第二実施形態の音響モデル学習装置が、図５に示す各ステップの処理を行うことにより第二実施形態の音響モデル学習方法が実現される。 As shown in FIG. 3, the acoustic model learning device includes a learning data storage unit 30, a context preserving vector generation unit 31, a posterior probability calculation unit 32, a context preserving vector calculation unit 33, a context preserving vector connection unit 34, an acoustic model storage unit. 35, a text data storage unit 40, a frequency tensor generation unit 41, a probabilistic language model generation unit 42, a weight parameter inverse calculation unit 43, a weight parameter correction unit 44, and a weight parameter interpolation unit 45. However, the weight parameter correction unit 44 is not an essential component and can be omitted. The acoustic model learning method of the second embodiment is realized by the acoustic model learning device of the second embodiment performing the processing of each step shown in FIG.

音声認識装置は、図４に示すように、音響モデル記憶部３５、言語モデル記憶部３６、および音声認識部３７を含む。第二実施形態の音声認識装置が、図６に示す各ステップの処理を行うことにより第二実施形態の音声認識方法が実現される。 As shown in FIG. 4, the voice recognition device includes an acoustic model storage unit 35, a language model storage unit 36, and a voice recognition unit 37. The speech recognition method of the second embodiment is realized by the processing of each step shown in FIG. 6 by the speech recognition device of the second embodiment.

音響モデル学習装置および音声認識装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知または専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音響モデル学習装置および音声認識装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。音響モデル学習装置および音声認識装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、音響モデル学習装置および音声認識装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。音響モデル学習装置および音声認識装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音響モデル学習装置および音声認識装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 In the acoustic model learning device and the speech recognition device, for example, a special program is loaded into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. It is a special device configured. The acoustic model learning device and the speech recognition device execute each process under the control of the central processing unit, for example. The data input to the acoustic model learning device and the voice recognition device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device are read as necessary and It is used for processing. Further, at least a part of each processing unit of the acoustic model learning device and the voice recognition device may be configured by hardware such as an integrated circuit. Each storage unit included in the acoustic model learning device and the voice recognition device is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary device including a semiconductor memory device such as a hard disk, an optical disk, or a flash memory (Flash Memory). It can be configured by a storage device or middleware such as a relational database or a key-value store. Each storage unit included in the acoustic model learning device and the voice recognition device may be logically divided, and may be stored in one physical storage device.

以下、図５を参照しながら、第二実施形態の音響モデル学習装置が実行する音響モデル学習方法を説明する。 Hereinafter, the acoustic model learning method executed by the acoustic model learning device of the second embodiment will be described with reference to FIG.

音響モデル学習装置の学習データ記憶部３０には、音響モデルの学習に用いる学習データが記憶されている。学習データは、学習音声と各学習音声の内容に関するテキスト情報（例えば、HMM状態、音素、文字、単語等）とを含む。学習データは手動で収集してもよいし、公知の学習データ生成技術を用いて自動的に生成してもよい。学習データは予め十分な量を用意して学習データ記憶部３０に記憶しておく。 The learning data storage unit 30 of the acoustic model learning device stores learning data used for learning the acoustic model. The learning data includes learning voices and text information (for example, HMM state, phonemes, characters, words, etc.) regarding the content of each learning voice. The learning data may be collected manually or may be automatically generated using a known learning data generation technique. A sufficient amount of learning data is prepared in advance and stored in the learning data storage unit 30.

音響モデル学習装置の音響モデル記憶部３５には、CTCによる音響モデルが記憶されている。初期状態では従来技術のCTCによる音響モデルを用意して記憶しておけばよい。 An acoustic model based on CTC is stored in the acoustic model storage unit 35 of the acoustic model learning device. In the initial state, a conventional CTC acoustic model may be prepared and stored.

ステップＳ３１において、文脈保存ベクトル生成部３１は、文脈保存ベクトルK₀=[k_0,1, k_0,2, k_0,3, k_0,4, …]^Tを生成する。文脈保存ベクトルK₀の各次元は任意の値（例えば、0や1）に初期化する。文脈保存ベクトル生成部３１は入力なしに動作することができるが、初期化に用いる値（例えば、0や1）を入力としてもよい。生成した文脈保存ベクトルK₀は、音響モデル記憶部３５に記憶されている音響モデルにおけるCTCニューラルネットワークの入力層もしくは隠れ層の任意の位置に連結する。連結とは、文脈保存ベクトルの情報を用いてCTCニューラルネットワークのパラメータに影響を与えるようにすることである。例えば、元々のCTCニューラルネットワークの隠れ層の前または後ろに文脈保存ベクトルを繋げ、隠れ層のベクトルの長さと文脈保存ベクトルの長さの和の長さであるベクトルを新規に生成する。または、個々の文脈保存ベクトルを隠れ層や出力層の大きさに変換する行列を別に用意し、行列による変換の結果を元々のCTCニューラルネットワークの出力層のベクトルと加算する。 In step S31, the context-preserving vector generation unit 31 generates the context-preserving vector K ₀ = [k _0,1 , k _0,2 , k _0,3 , k _0,4 , ...] ^T. Each dimension of the context-preserving vector K ₀ is initialized to an arbitrary value (for example, 0 or 1). The context-preserving vector generation unit 31 can operate without input, but may use a value used for initialization (for example, 0 or 1) as input. The generated context-saving vector K ₀ is connected to an arbitrary position in the input layer or hidden layer of the CTC neural network in the acoustic model stored in the acoustic model storage unit 35. The concatenation is to use the information of the context-preserving vector to influence the parameters of the CTC neural network. For example, a context-preserving vector is connected before or after the hidden layer of the original CTC neural network, and a vector that is the sum of the length of the hidden layer vector and the length of the context-preserving vector is newly generated. Alternatively, a matrix for converting each context-preserving vector into the size of the hidden layer or the output layer is prepared separately, and the result of the conversion by the matrix is added to the vector of the output layer of the original CTC neural network.

ステップＳ３２において、事後確率計算部３２は、学習データ記憶部３０に記憶されている学習音声から抽出した時刻t+1（t≧0）における音響特徴ベクトルX_t+1=[x_t+1,1, x_t+1,2, x_t+1,3, x_t+1,4, …]^Tと、前の時刻tにおける文脈保存ベクトルK_tとを、音響モデル記憶部３５に記憶されている音響モデルのCTCニューラルネットワークへ入力し、出力シンボルに対する事後確率ベクトルC_t+1=[c_t+1,1, c_t+1,2, c_t+1,3, c_t+1,4, …]^T（以下、出力事後確率ベクトルと呼ぶ）と、出力シンボルが空シンボルである確率を表す空シンボル確率φ_t+1とを得る。出力事後確率ベクトルC_t+1と正解系列とは、CTCニューラルネットワークの誤差関数へ入力され、CTCニューラルネットワークのパラメータの更新に使用される。事後確率計算部３２は、出力事後確率ベクトルC_t+1および空シンボル確率φ_t+1を文脈保存ベクトル計算部３３へ出力する。 In step S32, the posterior probability calculating unit 32 causes the acoustic feature vector X _{t + 1} = [x _{t + 1, at} time t + 1 (t ≧ 0) extracted from the learning voice stored in the learning data storage unit 30 _{. 1} , x _{t + 1,2} , x _{t + 1,3} , x _{t + 1,4} , ...] ^T and the context preservation vector K _{t at} the previous time t are stored in the acoustic model storage unit 35. acoustic model of type into CTC neural network, the posterior probability vector for the output symbol _{_{C t + 1 = [c t}} + 1,1, c t + 1,2 you _{_{are, c t + 1,3, c t}} + 1,4 , ...] ^T (hereinafter referred to as an output posterior probability vector) and an empty symbol probability φ _{t + 1} representing the probability that the output symbol is an empty symbol. The output posterior probability vector C _{t + 1} and the correct sequence are input to the error function of the CTC neural network and used to update the parameters of the CTC neural network. The posterior probability calculation unit 32 outputs the output posterior probability vector C _{t + 1} and the empty symbol probability φ _{t + 1} to the context-preserving vector calculation unit 33.

ステップＳ３３において、文脈保存ベクトル計算部３３は、事後確率計算部３２から出力事後確率ベクトルC_t+1および空シンボル確率φ_t+1を受け取り、空シンボル確率φ_t+1に基づいて一つ前の時刻tにおける文脈保存ベクトルK_t=[k_t,1, k_t,2, k_t,3, k_t,4, …]^Tを更新して、現在の時刻t+1における文脈保存ベクトルK_t+1=[k_t+1,1, k_t+1,2, k_t+1,3, k_t+1,4, …]^Tを生成する。文脈保存ベクトル計算部３３は、例えば、空シンボル確率φ_t+1が空シンボルであることを示す場合には、CTCニューラルネットワークが最後に空シンボル以外のシンボルを出力した際の出力事後確率ベクトルを保持し、空シンボル確率φ_t+1が空シンボルでないことを示す場合には、CTCニューラルネットワークが今回出力した出力事後確率ベクトルC_t+1を記録するように、文脈保存ベクトルK_t+1を計算する。文脈保存ベクトル計算部３３は、計算した文脈保存ベクトルK_t+1を文脈保存ベクトル連結部３４へ出力する。 In step S33, the context-preserving vector calculation unit 33 receives the output posterior probability vector C _{t + 1} and the empty symbol probability φ _{t + 1} from the posterior probability calculation unit 32, and outputs the previous one based on the empty symbol probability φ _{t + 1.} Context-preserving vector K _t = [k _{t, 1} , k _{t, 2} , k _{t, 3} , k _{t, 4} ,…] ^T is updated to the context-preserving vector K at the current time t + 1. _{t + 1} = [k _{t + 1,1} , k _{t + 1,2} , k _{t + 1,3} , k _{t + 1,4} , ...] ^T is generated. For example, when the empty symbol probability φ _{t + 1} indicates that the empty symbol probability φ _{t + 1} is an empty symbol, the context-preserving vector calculation unit 33 calculates the output posterior probability vector when the CTC neural network finally outputs a symbol other than the empty symbol. If it holds and the empty symbol probability φ _{t + 1} indicates that it is not an empty symbol, the context-preserving vector K _{t + 1} is set so as to record the output posterior probability vector C _{t + 1} output by the CTC neural network this time. calculate. The context-saving vector calculation unit 33 outputs the calculated context-saving vector K _{t + 1} to the context-saving vector connecting unit 34.

文脈保存ベクトルの計算には、例えば、電子回路におけるフリップフロップ回路に類似した更新則を用いる。CTCニューラルネットワークが出力する空シンボル確率φ_t+1は0から1までの値を取り、1に近いほど空シンボルである可能性が高いことを表す。簡単のため、その両端の場合を考えると、 The calculation of the context-preserving vector uses, for example, an update rule similar to a flip-flop circuit in an electronic circuit. The empty symbol probability φ _{t + 1} output by the CTC neural network takes a value from 0 to 1, and the closer it is to 1, the higher the possibility of an empty symbol. For simplicity, considering the case of both ends,

のように、空シンボルではない場合（φ_t+1=0）には現在の時刻t+1における出力事後確率ベクトルC_t+1の内容を記録し、空シンボルの場合（φ_t+1=1）には一つ前の時刻tにおける文脈保存ベクトルK_tの内容を保持する。具体的には、出力された空シンボル確率φ_t+1が所定の閾値以上であればφ_t+1=1とし、空シンボル確率φ_t+1が所定の閾値未満であればφ_t+1=0とする等の手段により空シンボル確率φ_t+1を二値化し、式（15）のような更新則を用いればよい。また、両端が含まれるように自然に連続的に拡張した更新則として、式（16）を定義して計算してもよい。 If it is not an empty symbol (φ _{t + 1} = 0), the content of the output posterior probability vector C _{t + 1 at} the current time t + 1 is recorded, and if it is an empty symbol (φ _{t + 1} = In 1), the contents of the context-preserving vector K _{t at} the previous time t are retained. Specifically, if the output empty symbol probability φ _{t + 1} is greater than or equal to a predetermined threshold value, then φ _{t + 1} = 1, and if the empty symbol probability φ _{t + 1} is less than the predetermined threshold value, then φ _{t + 1} The empty symbol probability φ _{t + 1} may be binarized by a means such as = 0, and an update rule such as equation (15) may be used. Further, Equation (16) may be defined and calculated as an update rule that is naturally and continuously expanded to include both ends.

ただし、 However,

は要素毎の積を表す。[1, …, 1]^Tは1を縦に並べたベクトルを表す。Φ_t+1は空シンボル確率φ_t+1を出力事後確率ベクトルC_t+1の次元数分並べた縦ベクトル、すなわち、Φ_t+1=[φ_t+1, …, φ_t+1]^Tである。 Represents the product of each element. [1,…, 1] ^T represents a vector in which 1s are arranged vertically. Φ _{t + 1} is a vertical vector in which empty symbol probabilities φ _{t + 1} are arranged by the number of dimensions of the output posterior probability vector C _{t + 1} , that is, Φ _{t + 1} = [φ _{t + 1} , ..., φ _{t + 1} ]. ^T.

式（16）は式（17）のように書き下すことも可能である。 Equation (16) can also be rewritten as equation (17).

文脈保存ベクトル計算部３３は、出力シンボルに対する事後確率ベクトルC_t+1の代わりに、図７に示す入力層、第一隠れ層、最終隠れ層等の、ニューラルネットワークの隠れ層等の他の内部パラメータや、入力された音響特徴ベクトルX_t+1を用いてもよい。また、文脈保存ベクトル計算部３３は、公知の次元削減手段を用いて文脈保存ベクトルK_t+1を低次元化して出力してもよいし、平均化や正規化、離散化等の予め固定された関数による変換を行った後に出力してもよい。 The context-preserving vector calculation unit 33 replaces the posterior probability vector C _{t + 1} for the output symbol with other internal layers such as the hidden layer of the neural network such as the input layer, the first hidden layer, and the final hidden layer shown in FIG. 7. A parameter or the input acoustic feature vector X _{t + 1} may be used. The context-preserving vector calculation unit 33 may reduce the dimension of the context-preserving vector K _{t + 1} by using a known dimension reduction unit and may output it, or may be fixed in advance such as averaging, normalization, or discretization. You may output after performing the conversion by the function.

ステップＳ３４において、文脈保存ベクトル連結部３４は、文脈保存ベクトル計算部３３から文脈保存ベクトルK_t+1を受け取り、音響モデル記憶部３５に記憶されているCTCニューラルネットワークに対して文脈保存ベクトルK_t+1を連結する。文脈保存ベクトルの連結は、各時刻に更新された文脈保存ベクトルを受け取るたびに行う。文脈保存ベクトルを連結する際には、図８に示すように、出力シンボル系列のマルコフ性を表す行列を用いた線形変換を適用した後、直接出力層に加算等で統合するようなCTCニューラルネットワークを作成する。行列による線形変換を適用するとき、文脈保存ベクトルが入力される行列を、外部の言語資源等からの情報を反映したような出力シンボル系列のマルコフ性を表す行列とする。この行列は、外部の言語資源等を集計することにより算出したシンボル遷移確率を用いて、外部から行列の値に影響を与える操作（例えば、線形補間等）により、外部の言語資源等からの情報を反映するように構成されるものである。 In step S34, the context saved vector connecting portion 34, the context save vector receive context save vector K _{t + 1} from the calculation unit 33, an acoustic model context stored in the storage unit 35 relative to the CTC neural network stored vector K _t Concatenate ₊₁ . Concatenation of the context-preserving vectors is performed each time the context-preserving vector updated at each time is received. When connecting the context-preserving vectors, as shown in FIG. 8, after applying a linear transformation using a matrix representing the Markov property of the output symbol sequence, the CTC neural network is directly integrated into the output layer by addition or the like. To create. When applying a linear transformation using a matrix, the matrix into which the context-preserving vector is input is a matrix that represents the Markov property of the output symbol sequence that reflects information from an external language resource or the like. This matrix uses the symbol transition probabilities calculated by aggregating external language resources, etc., and performs information from external language resources, etc., by operations that affect the matrix values from the outside (for example, linear interpolation). Is configured to reflect.

文脈保存ベクトルK_t+1を線形変換する際に用いる出力シンボル系列のマルコフ性を表す行列は、ステップＳ４１からＳ４５の処理により学習したものである。出力シンボル系列のマルコフ性を表す行列の学習は、文脈保存ベクトルK_t+1をCTCニューラルネットワークへ連結するより前であれば、任意のタイミングで行えばよい。以下、出力シンボル系列のマルコフ性を表す行列は、重みパラメータとも呼ぶ。 The matrix representing the Markov property of the output symbol sequence used when linearly converting the context-preserving vector K _{t + 1} is the one learned by the processing of steps S41 to S45. The learning of the matrix representing the Markov property of the output symbol sequence may be performed at any timing before connecting the context-preserving vector K _{t + 1} to the CTC neural network. Hereinafter, the matrix representing the Markov property of the output symbol sequence is also referred to as a weight parameter.

テキストデータ記憶部４０には、少なくとも１つの単語列が含まれる複数のテキストデータが記憶されている。このテキストデータは手動で収集してもよいし、インターネット等から取得できる言語資源から自動で収集してもよいし、公知の学習データ生成技術を用いて自動的に生成してもよい。テキストデータは予め十分な量を用意してテキストデータ記憶部４０に記憶しておく。 The text data storage unit 40 stores a plurality of text data including at least one word string. This text data may be manually collected, may be automatically collected from a language resource that can be obtained from the Internet, or may be automatically generated by using a known learning data generation technique. A sufficient amount of text data is prepared in advance and stored in the text data storage unit 40.

ステップＳ４１において、頻度テンソル生成部４１は、テキストデータ記憶部４０に記憶されたテキストデータを読み込み、そのテキストデータに含まれる単語の連接頻度を集計し、頻度テンソルを生成する。頻度テンソル生成部４１は、生成した頻度テンソルを確率的言語モデル生成部４２へ出力する。 In step S41, the frequency tensor generation unit 41 reads the text data stored in the text data storage unit 40, totals the concatenation frequencies of words included in the text data, and generates a frequency tensor. The frequency tensor generation unit 41 outputs the generated frequency tensor to the stochastic language model generation unit 42.

ステップＳ４２において、確率的言語モデル生成部４２は、頻度テンソル生成部４１から頻度テンソルを受け取り、その頻度テンソルから確率的言語モデルを用いて確率モデルパラメータテンソルを生成する。このとき、確率的言語モデルは、例えば、n-gramモデル等の一般的なものを用いればよい。また、その際にスムージングなどの一般的な操作を含めてもよい。確率的言語モデル生成部４２は、生成した確率モデルパラメータテンソルを重みパラメータ逆計算部４３へ出力する。 In step S42, the probabilistic language model generation unit 42 receives the frequency tensor from the frequency tensor generation unit 41 and generates a probabilistic model parameter tensor from the frequency tensor using the probabilistic language model. At this time, as the probabilistic language model, for example, a general one such as an n-gram model may be used. Further, at that time, a general operation such as smoothing may be included. The probabilistic language model generation unit 42 outputs the generated probabilistic model parameter tensor to the weight parameter inverse calculation unit 43.

ステップＳ４３において、重みパラメータ逆計算部４３は、確率的言語モデル生成部４２から確率モデルパラメータテンソルを受け取り、その確率モデルパラメータテンソルに対してCTCニューラルネットワークに含まれる非線形変換の逆計算を行い、出力シンボル系列のマルコフ性を表す行列（すなわち、重みパラメータ）に相当する量（以下、重みパラメータ相当量と呼ぶ）を生成する。重みパラメータ相当量の逆計算は、第一実施形態の重みパラメータ逆計算部と同様に、式（11）により行えばよい。重みパラメータ逆計算部４３は、計算した重みパラメータ相当量を重みパラメータ補正部４４へ出力する。重みパラメータ補正部４４を省略した場合には、計算した重みパラメータ相当量を重みパラメータ補間部４５へ出力する。 In step S43, the weight parameter inverse calculation unit 43 receives the stochastic model parameter tensor from the stochastic language model generation unit 42, performs inverse calculation of the nonlinear transformation included in the CTC neural network on the stochastic model parameter tensor, and outputs the result. An amount (hereinafter, referred to as a weight parameter equivalent amount) corresponding to a matrix (that is, a weight parameter) representing the Markov property of the symbol sequence is generated. The inverse calculation of the weight parameter equivalent amount may be performed by the equation (11) as in the weight parameter inverse calculation unit of the first embodiment. The weight parameter inverse calculation unit 43 outputs the calculated weight parameter equivalent amount to the weight parameter correction unit 44. When the weight parameter correction unit 44 is omitted, the calculated weight parameter equivalent amount is output to the weight parameter interpolation unit 45.

ステップＳ４４において、重みパラメータ補正部４４は、重みパラメータ逆計算部４３から重みパラメータ相当量を受け取り、音響モデル記憶部３５に記憶されている重みパラメータと、分布のパラメータ（平均および分散）が合うように重みパラメータ相当量を補正する。重みパラメータ相当量の補正は、第一実施形態の重みパラメータ補正部と同様に、式（12）により行えばよい。重みパラメータ補正部４４は、補正した重みパラメータ相当量を重みパラメータ補間部４５へ出力する。 In step S44, the weighting parameter correction unit 44 receives the weighting parameter equivalent amount from the weighting parameter inverse calculation unit 43 so that the weighting parameter stored in the acoustic model storage unit 35 matches the distribution parameter (mean and variance). Then, the weight parameter equivalent amount is corrected. The correction of the weight parameter equivalent amount may be performed by the equation (12) as in the weight parameter correction unit of the first embodiment. The weight parameter correction unit 44 outputs the corrected weight parameter equivalent amount to the weight parameter interpolation unit 45.

ステップＳ４５において、重みパラメータ補間部４５は、重みパラメータ補正部４４もしくは重みパラメータ逆計算部４３から重みパラメータ相当量を受け取り、音響モデル記憶部３５に記憶されているCTCニューラルネットワークの重みパラメータと、入力された重みパラメータ相当量とを線形補間することにより、新しい重みパラメータを得る。重みパラメータ補間部４５は、音響モデル記憶部３５に記憶されているCTCニューラルネットワークの重みパラメータを新しい重みパラメータに更新する。 In step S45, the weight parameter interpolation unit 45 receives the weight parameter equivalent amount from the weight parameter correction unit 44 or the weight parameter inverse calculation unit 43, and inputs the weight parameter of the CTC neural network stored in the acoustic model storage unit 35 and the input. A new weight parameter is obtained by linearly interpolating the obtained weight parameter equivalent amount. The weight parameter interpolating unit 45 updates the weight parameter of the CTC neural network stored in the acoustic model storage unit 35 to a new weight parameter.

以下、図６を参照しながら、第二実施形態の音声認識装置が実行する音声認識方法を説明する。 Hereinafter, the voice recognition method executed by the voice recognition device of the second embodiment will be described with reference to FIG.

音声認識装置の音響モデル記憶部３５には、音響モデル学習装置により学習したCTCによる音響モデルが記憶されている。 The acoustic model storage unit 35 of the voice recognition device stores an acoustic model based on CTC learned by the acoustic model learning device.

音声認識装置の言語モデル記憶部３６には、音声認識に用いる言語モデルが記憶されている。言語モデルの種類は、音声認識部３７が音声認識を行う際に利用可能なものであればどのようなものであってもよい。 The language model storage unit 36 of the voice recognition device stores a language model used for voice recognition. Any type of language model may be used as long as it can be used when the voice recognition unit 37 performs voice recognition.

ステップＳ３７において、音声認識部３７は、音響モデル記憶部３５に記憶された音響モデルと言語モデル記憶部３６に記憶された言語モデルとを用いて、入力音声を音声認識し、その音声認識結果を出力する。音声認識部３７は、CTCによる音響モデル単体を用いて音声認識を行う音声認識器でもよいし、CTCによる音響モデルを重み付き有限状態トランスデューサ（WFST: Weighted Finite-State Transducer）と組み合わせた音声認識器であってもよい。 In step S37, the voice recognition unit 37 performs voice recognition on the input voice using the acoustic model stored in the acoustic model storage unit 35 and the language model stored in the language model storage unit 36, and outputs the voice recognition result. Output. The voice recognition unit 37 may be a voice recognizer that performs voice recognition using a single CTC acoustic model, or a voice recognizer that combines the CTC acoustic model with a weighted finite state transducer (WFST). May be

第二実施形態では、音響モデル学習装置と音声認識装置とを別々の装置として構成した音声認識システムを説明したが、音響モデル学習装置と音声認識装置とが備える機能をすべて備えた一台の音声認識装置として構成してもよい。すなわち、学習データ記憶部３０、文脈保存ベクトル生成部３１、事後確率計算部３２、文脈保存ベクトル計算部３３、文脈保存ベクトル連結部３４、音響モデル記憶部３５、言語モデル記憶部３６、音声認識部３７、テキストデータ記憶部４０、頻度テンソル生成部４１、確率的言語モデル生成部４２、重みパラメータ逆計算部４３、重みパラメータ補正部４４、および重みパラメータ補間部４５を含む音声認識装置を構成することも可能である。 In the second embodiment, the speech recognition system in which the acoustic model learning device and the speech recognition device are configured as separate devices has been described. However, a single speech having all the functions of the acoustic model learning device and the speech recognition device is provided. It may be configured as a recognition device. That is, the learning data storage unit 30, the context preservation vector generation unit 31, the posterior probability calculation unit 32, the context preservation vector calculation unit 33, the context preservation vector connection unit 34, the acoustic model storage unit 35, the language model storage unit 36, the speech recognition unit. 37, a text data storage unit 40, a frequency tensor generation unit 41, a probabilistic language model generation unit 42, a weight parameter inverse calculation unit 43, a weight parameter correction unit 44, and a weight parameter interpolation unit 45. Is also possible.

上記のように構成することにより、第二実施形態の音響モデル学習装置は、CTCによる音響モデルの学習に用いる学習データとは異なる外部の言語資源であるテキストデータを用いて出力シンボル系列のマルコフ性を表す行列を学習することができる。学習データは学習音声と書き起こしのテキスト情報であるため、収集のために要するコストが大きいが、テキストデータは例えばインターネット等の既存の資源が利用できるため小さいコストで十分な量を収集することが可能である。これにより、第二実施形態の音響モデル学習装置によれば、効率的に高精度なCTCによる音響モデルを学習することができる。 With the above configuration, the acoustic model learning device of the second embodiment uses the Markov property of the output symbol sequence by using text data that is an external language resource different from the learning data used for learning the acoustic model by CTC. You can learn the matrix that represents. Since the learning data is the learning voice and the text information of the transcription, the cost required for the collection is large, but since the existing resources such as the Internet can be used for the text data, it is possible to collect a sufficient amount at a small cost. It is possible. As a result, according to the acoustic model learning device of the second embodiment, it is possible to efficiently and highly accurately learn the acoustic model by CTC.

［第三実施形態］
上記の各実施形態で説明した重みパラメータ逆計算部、重みパラメータ補正部、および重みパラメータ補間部を独立した装置として構成し、確率モデルパラメータを入力としてニューラルネットワークの重みパラメータを補間する重みパラメータ補間装置を構成することも可能である。 [Third embodiment]
A weight parameter interpolating device that configures the weight parameter inverse calculating unit, the weight parameter correcting unit, and the weight parameter interpolating unit described in each of the above embodiments as independent devices, and interpolates the weight parameter of the neural network using the probabilistic model parameters as inputs Can also be configured.

第三実施形態の重みパラメータ補間装置は、図９に示すように、重みパラメータ記憶部５０、重みパラメータ逆計算部５１、重みパラメータ補正部５２、および重みパラメータ補間部５３を含む。ただし、重みパラメータ補正部５２は必須の構成ではなく、省略することも可能である。第三実施形態の重みパラメータ補間装置が、図１０に示す各ステップの処理を行うことにより、第三実施形態の重みパラメータ補間方法が実現される。 As shown in FIG. 9, the weight parameter interpolation device of the third embodiment includes a weight parameter storage unit 50, a weight parameter inverse calculation unit 51, a weight parameter correction unit 52, and a weight parameter interpolation unit 53. However, the weight parameter correction unit 52 is not an indispensable component and can be omitted. The weighting parameter interpolation device of the third embodiment implements the weighting parameter interpolation method of the third embodiment by performing the processing of each step shown in FIG.

重みパラメータ補間装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知または専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。重みパラメータ補間装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。重みパラメータ補間装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、重みパラメータ補間装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。重みパラメータ補間装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The weight parameter interpolating device is a special program configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), or the like. It is a device. The weight parameter interpolating device executes each process under the control of the central processing unit, for example. The data input to the weight parameter interpolating device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as needed and used for other processes. To be done. Further, at least a part of each processing unit of the weight parameter interpolating device may be configured by hardware such as an integrated circuit. Each storage unit included in the weight parameter interpolating device is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory device such as a hard disk, an optical disk, or a flash memory (Flash Memory), or It can be configured by middleware such as relational database and key-value store.

以下、図１０を参照しながら、第三実施形態の重みパラメータ補間装置が実行する重みパラメータ補間方法の処理手続きを説明する。 Hereinafter, the processing procedure of the weight parameter interpolation method executed by the weight parameter interpolation device of the third embodiment will be described with reference to FIG. 10.

重みパラメータ記憶部５０には、あるドメインに属する学習データ（以下、第一学習データ）から学習したニューラルネットワークに含まれる線形変換で用いる重みパラメータが記憶されている。 The weight parameter storage unit 50 stores weight parameters used in linear conversion included in a neural network learned from learning data belonging to a certain domain (hereinafter, first learning data).

ステップＳ５１において、重みパラメータ逆計算部５１は、重みパラメータ補間装置に入力された確率モデルパラメータテンソルに対してニューラルネットワークの非線形変換の逆計算を行い、重みパラメータ相当量を生成する。確率モデルパラメータテンソルは、第一学習データとは異なるドメインに属する学習データ（以下、第二学習データ）から処理単位（例えば、単語や形態素、文字、音素等、任意の離散値系列）毎の連接頻度を表す頻度テンソルを生成し、その頻度テンソルから確率モデルを用いて生成したものである。重みパラメータ相当量の逆計算は、第一実施形態の重みパラメータ逆計算部と同様に、式（11）により行えばよい。重みパラメータ逆計算部５１は、計算した重みパラメータ相当量を重みパラメータ補正部５２へ出力する。重みパラメータ補正部５２を省略した場合には、計算した重みパラメータ相当量を重みパラメータ補間部５３へ出力する。 In step S51, the weight parameter inverse calculation unit 51 performs inverse calculation of the nonlinear transformation of the neural network on the stochastic model parameter tensor input to the weight parameter interpolating device, and generates a weight parameter equivalent amount. The probabilistic model parameter tensor is a concatenation of learning data (hereinafter, second learning data) belonging to a domain different from that of the first learning data for each processing unit (for example, an arbitrary discrete value sequence such as a word, a morpheme, a character, a phoneme, etc.). A frequency tensor representing a frequency is generated, and a probability model is generated from the frequency tensor. The inverse calculation of the weight parameter equivalent amount may be performed by the equation (11) as in the weight parameter inverse calculation unit of the first embodiment. The weight parameter inverse calculation unit 51 outputs the calculated weight parameter equivalent amount to the weight parameter correction unit 52. When the weight parameter correction unit 52 is omitted, the calculated weight parameter equivalent amount is output to the weight parameter interpolation unit 53.

ステップＳ５２において、重みパラメータ補正部５２は、重みパラメータ逆計算部５１から重みパラメータ相当量を受け取り、重みパラメータ記憶部５０に記憶されている重みパラメータと、分布のパラメータ（平均および分散）が合うように重みパラメータ相当量を補正する。重みパラメータ相当量の補正は、第一実施形態の重みパラメータ補正部と同様に、式（12）により行えばよい。重みパラメータ補正部５２は、補正した重みパラメータ相当量を重みパラメータ補間部５３へ出力する。 In step S52, the weighting parameter correction unit 52 receives the weighting parameter equivalent amount from the weighting parameter inverse calculation unit 51 so that the weighting parameter stored in the weighting parameter storage unit 50 matches the distribution parameter (mean and variance). Then, the weight parameter equivalent amount is corrected. The correction of the weight parameter equivalent amount may be performed by the equation (12) as in the weight parameter correction unit of the first embodiment. The weight parameter correction unit 52 outputs the corrected weight parameter equivalent amount to the weight parameter interpolation unit 53.

ステップＳ５３において、重みパラメータ補間部５３は、重みパラメータ補正部５２もしくは重みパラメータ逆計算部５１から重みパラメータ相当量を受け取り、重みパラメータ記憶部５０に記憶されている重みパラメータと、入力された重みパラメータ相当量とを線形補間することにより、新しい重みパラメータを得る。重みパラメータ補間部５３は、新しい重みパラメータを重みパラメータ補間装置の出力とする。 In step S53, the weight parameter interpolation unit 53 receives the weight parameter equivalent amount from the weight parameter correction unit 52 or the weight parameter inverse calculation unit 51, and the weight parameter stored in the weight parameter storage unit 50 and the input weight parameter. A new weighting parameter is obtained by linearly interpolating with The weight parameter interpolation unit 53 outputs the new weight parameter as an output of the weight parameter interpolation device.

［発明のポイント］
ニューラルネットワークで用いるsoftmax関数は、単写ではないため、逆関数が存在しない。しかし、それに相当する操作（すなわち、逆計算）を考えることにより、確率ベクトルをニューラルネットワークで用いる重みパラメータに相当する量に変換することができる。元々のニューラルネットワークで用いる重みパラメータと、別ドメインのテキストデータから計算した確率モデルのパラメータテンソルを重みパラメータに相当する量に変換したものとを補間することによって、別ドメインへのモデル適応ができる。また、変換量を初期値や制約として再学習を行うなど、この性質を用いた学習によって、単なる出力のリスコアリングではないモデルの改善を実現できる。 [Points of the Invention]
Since the softmax function used in the neural network is not a single shot, there is no inverse function. However, the probability vector can be converted into an amount corresponding to the weighting parameter used in the neural network by considering an operation (that is, inverse calculation) corresponding to that. The model can be adapted to another domain by interpolating the weight parameter used in the original neural network and the parameter tensor of the probabilistic model calculated from the text data of another domain converted into an amount corresponding to the weight parameter. In addition, learning using this property, such as performing re-learning with the conversion amount as an initial value or a constraint, can improve a model that is not simply rescoring of output.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of the present invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is appropriately changed without departing from the gist of the present invention, Needless to say, it is included in the present invention. The various kinds of processing described in the embodiments may be executed not only in time series according to the order described, but also in parallel or individually according to the processing capability of the device that executes the processing or the need.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiments are realized by a computer, processing contents of functions that each device should have are described by a program. By executing this program on a computer, various processing functions of the above-described devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be any recording medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, or a semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be It should be noted that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of the processing contents may be implemented by hardware.

１０第一学習データ記憶部
１１ニューラル言語モデル生成部
１２ニューラル言語モデル記憶部
２０第二学習データ記憶部
２１頻度テンソル生成部
２２確率的言語モデル生成部
２３重みパラメータ逆計算部
２４重みパラメータ補正部
２５重みパラメータ補間部
３０学習データ記憶部
３１文脈保存ベクトル生成部
３２事後確率計算部
３３文脈保存ベクトル計算部
３４文脈保存ベクトル連結部
３５音響モデル記憶部
３６言語モデル記憶部
３７音声認識部
４０テキストデータ記憶部
４１頻度テンソル生成部
４２確率的言語モデル生成部
４３重みパラメータ逆計算部
４４重みパラメータ補正部
４５重みパラメータ補間部
５０重みパラメータ記憶部
５１重みパラメータ逆計算部
５２重みパラメータ補正部
５３重みパラメータ補間部 10 First Learning Data Storage Unit 11 Neural Language Model Generation Unit 12 Neural Language Model Storage Unit 20 Second Learning Data Storage Unit 21 Frequency Tensor Generation Unit 22 Stochastic Language Model Generation Unit 23 Weight Parameter Inverse Calculation Unit 24 Weight Parameter Correction Unit 25 Weight parameter interpolation unit 30 Learning data storage unit 31 Context-preserved vector generation unit 32 Posterior probability calculation unit 33 Context-preserved vector calculation unit 34 Context-preserved vector connection unit 35 Acoustic model storage unit 36 Language model storage unit 37 Speech recognition unit 40 Text data storage Unit 41 frequency tensor generation unit 42 stochastic language model generation unit 43 weight parameter inverse calculation unit 44 weight parameter correction unit 45 weight parameter interpolation unit 50 weight parameter storage unit 51 weight parameter inverse calculation unit 52 weight parameter correction unit 53 weight parameter interpolation unit

Claims

Store an acoustic model using a neural network that receives an acoustic feature vector as an input and outputs a posterior probability vector for an output symbol corresponding to the acoustic feature vector and an empty symbol probability that represents the probability that the output symbol is an empty symbol. An acoustic model storage unit that
A posterior probability calculation unit that inputs the acoustic feature vector extracted from the learning voice to the neural network to obtain the posterior probability vector and the empty symbol probability,
Context-preserving vector calculation for selecting and retaining the posterior probability vector output by the neural network at the previous time or the posterior probability vector output by the neural network at the current time based on the empty symbol probability Department,
Each time the context-preserving vector is calculated, the context-preserving vector is linearly transformed by using a weight parameter, added to the output of the final hidden layer of the neural network, and connected to the output layer of the neural network. Department,
A probabilistic model generation unit that generates a probabilistic model parameter using a probabilistic model from text data that is a set of word strings,
A weight parameter inverse calculation unit that performs inverse calculation of the non-linear transformation included in the neural network on the stochastic model parameter and calculates a weight parameter equivalent amount,
A weight parameter interpolating unit that interpolates the weight parameter used in the linear conversion and the weight parameter equivalent amount,
Acoustic model learning device including.

A neural network that performs a linear transformation on the input vector using the weighting parameters learned from the first learning data, which is a set of discrete value sequences belonging to the first domain, and performs a nonlinear transformation on the input vector after the linear transformation. A neural network storage unit for storing,
A probabilistic model generation unit that generates a probabilistic model parameter using a probabilistic model from the second learning data that is a set of discrete value series belonging to the second domain,
An inverse calculation of the non-linear transformation is performed on the stochastic model parameter, and a weight parameter inverse calculation unit that calculates a weight parameter equivalent amount,
A weight parameter interpolating unit that interpolates the weight parameter used in the linear conversion and the weight parameter equivalent amount,
Model learning device including.

The model learning device according to claim 2, wherein
A weight parameter correction unit that corrects the weight parameter equivalent amount so that the distribution parameter of the weight parameter equivalent amount matches the distribution parameter of the weight parameter,
Model learning device.

The model learning device according to claim 2 or 3, wherein
The weighting parameter interpolating unit performs interpolation between the weighting parameter and the weighting parameter equivalent amount by linear interpolation or geometric mean.
Model learning device.

The model learning device according to any one of claims 2 to 4,
The probability model generation unit, for the mismatch data included in the first learning data but not included in the second learning data, so that the appearance probability of the mismatch data after the nonlinear conversion of the weighting parameter does not change. The frequency of the mismatch data is determined, and the probability model parameter is generated.
Model learning device.

The model learning device according to any one of claims 2 to 5,
The neural network is one in which the weight parameter equivalent amount is an initial value, or a Gaussian distribution in which the weight parameter equivalent amount is an average is learned as a prior probability.
Model learning device.

The neural network storage unit performs a linear transformation on the input vector using the weighting parameter learned from the first learning data, which is a set of discrete value sequences belonging to the first domain, and nonlinearly transforms the input vector after the linear transformation. It stores the neural network that performs the conversion,
The probabilistic model generation unit generates a probabilistic model parameter using a probabilistic model from the second learning data that is a set of discrete value series belonging to the second domain,
The weight parameter inverse calculation unit performs the inverse calculation of the non-linear transformation on the stochastic model parameter to calculate the weight parameter equivalent amount,
A weight parameter interpolating unit interpolates the weight parameter used in the linear conversion and the weight parameter equivalent amount,
Model learning method.

A program for causing a computer to function as the acoustic model learning device according to claim 1 or the model learning device according to any one of claims 2 to 6.