JP7274441B2

JP7274441B2 - LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM

Info

Publication number: JP7274441B2
Application number: JP2020066879A
Authority: JP
Inventors: 成樹苅田; 厚徳小川; 晋治渡部
Original assignee: Johns Hopkins University
Current assignee: Johns Hopkins University
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2023-05-16
Anticipated expiration: 2040-04-02
Also published as: JP2021162798A

Description

本発明は、学習装置、学習方法および学習プログラムに関する。 The present invention relates to a learning device, a learning method, and a learning program.

従来の音声認識モデルは、音響モデルと言語モデルとを別々の系として学習させている。これに対し、ニューラルネットワークを用いたｅｎｄ－ｔｏ－ｅｎｄの音声認識モデルの学習技術が注目されている（非特許文献１参照）。この技術では、音声を入力とし、記号列を特定する情報を出力する系全体として最適化を行うことができるので、従来の音声認識より精度の高い音声認識が可能となる。 A conventional speech recognition model trains an acoustic model and a language model as separate systems. On the other hand, a technique for learning an end-to-end speech recognition model using a neural network is attracting attention (see Non-Patent Document 1). With this technology, the overall system that takes speech as an input and outputs information specifying a symbol string can be optimized, so speech recognition with higher accuracy than conventional speech recognition is possible.

また、一般に、モデルの学習では、訓練データの数を増やすほど、学習の結果として得られるモデルの精度が向上することが期待される。例えば、音声認識モデルの学習において、ノイズを含まないクリーンな音声データと、その書き起こしテキストとのペアからなる理想的な状態の訓練データを用いれば、訓練データの数を増やすほど、音声認識の精度が向上する。 Also, in model learning, it is generally expected that the more the number of training data is increased, the more the accuracy of the model obtained as a result of learning is improved. For example, in training a speech recognition model, if ideal training data consisting of a pair of clean speech data without noise and its transcript is used, the more the training data, the better the speech recognition performance. Improves accuracy.

J. Chorowski et al., “Attention-Based Models for Speech Recognition”, 2015年, Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 577-585J. Chorowski et al., “Attention-Based Models for Speech Recognition”, 2015, Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 577-585

しかしながら、ノイズ等を含む訓練データを学習に用いても、音声認識の精度の向上が困難な場合がある。例えば、現実の音声認識では、多くの訓練データには雑音等が含まれており、ノイズを含まないクリーンな音声データを大量に用意することは困難である。また、ノイズや誤り等を含む訓練データの数を増やして学習しても、却って音声認識の精度が低下してしまう場合がある。 However, even if training data containing noise or the like is used for learning, it may be difficult to improve the accuracy of speech recognition. For example, in actual speech recognition, much training data contains noise and the like, and it is difficult to prepare a large amount of clean speech data that does not contain noise. In addition, even if the number of training data containing noise, errors, etc. is increased for learning, the accuracy of speech recognition may rather deteriorate.

本発明は、上記に鑑みてなされたものであって、ノイズ等を含む訓練データを学習に用いても、音声認識の精度の向上を可能とすることを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to improve the accuracy of speech recognition even when training data containing noise and the like is used for learning.

上述した課題を解決し、目的を達成するために、本発明に係る学習装置は、第１のニューラルネットワークを用いて、入力された学習用の音声信号の特徴量を、符号化した中間特徴量に変換する変換部と、第２のニューラルネットワークを用いて、前記中間特徴量から、予測される記号列と該記号列のＣＴＣ（Connectionist Temporal Classification）に基づく事後確率を算出する第１の算出部と、第３のニューラルネットワークを用いて、正解記号列と前記中間特徴量とから、予測される記号列と該記号列の事後確率とを算出する第２の算出部と、前記ＣＴＣに基づく事後確率が所定の閾値より大きい場合に、前記第２の算出部が算出した前記事後確率と、前記ＣＴＣに基づく事後確率とから算出した損失関数値を用いて、前記第１のニューラルネットワーク、前記第２のニューラルネットワークおよび前記第３のニューラルネットワークのパラメータを更新する更新部と、を有することを特徴とする。 In order to solve the above-described problems and achieve the object, the learning apparatus according to the present invention uses a first neural network to convert the feature amount of an input speech signal for learning into an encoded intermediate feature amount. and a second neural network, a symbol string to be predicted from the intermediate feature amount and a posterior probability based on CTC (Connectionist Temporal Classification) of the symbol string. and a second calculation unit that calculates a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature value using a third neural network, and the posterior probability based on the CTC When the probability is greater than a predetermined threshold, the first neural network uses the loss function value calculated from the posterior probability calculated by the second calculation unit and the posterior probability based on the CTC, the and an updating unit that updates parameters of the second neural network and the third neural network.

本発明によれば、ノイズ等を含む訓練データを学習に用いても、音声認識の精度の向上が可能となる。 According to the present invention, it is possible to improve the accuracy of speech recognition even if training data containing noise or the like is used for learning.

図１は、本実施形態の学習装置の概略構成を例示する模式図である。FIG. 1 is a schematic diagram illustrating the schematic configuration of the learning device of this embodiment. 図２は、他の実施形態の学習装置の概略構成を例示する模式図である。FIG. 2 is a schematic diagram illustrating a schematic configuration of a learning device according to another embodiment. 図３は、学習処理手順を示すフローチャートである。FIG. 3 is a flow chart showing the learning processing procedure. 図４は、学習プログラムを実行するコンピュータの一例を示す図である。FIG. 4 is a diagram showing an example of a computer that executes a learning program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

［学習装置の構成］
図１は、本実施形態の学習装置の概略構成を例示する模式図である。図１に例示するように、本実施形態の学習装置１０は、パソコン等の汎用コンピュータで実現され、記憶部１１、および制御部１２を備える。 [Configuration of learning device]
FIG. 1 is a schematic diagram illustrating the schematic configuration of the learning device of this embodiment. As illustrated in FIG. 1, a learning device 10 of this embodiment is implemented by a general-purpose computer such as a personal computer, and includes a storage unit 11 and a control unit 12 .

記憶部１１は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１１には、学習装置１０を動作させる処理プログラムや、処理プログラムの実行中に使用されるデータなどが予め記憶され、あるいは処理の都度一時的に記憶される。 The storage unit 11 is realized by a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk. In the storage unit 11, a processing program for operating the learning device 10, data used during execution of the processing program, and the like are stored in advance, or are temporarily stored each time processing is performed.

本実施形態において、記憶部１１は、後述するｅｎｄ－ｔｏ－ｅｎｄニューラルネットワークのパラメータ１１ａを記憶する。これらのパラメータ１１ａは、後述する学習処理により更新される。 In this embodiment, the storage unit 11 stores parameters 11a of an end-to-end neural network, which will be described later. These parameters 11a are updated by learning processing, which will be described later.

制御部１２は、ＣＰＵ（Central Processing Unit）等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部１２は、図１に例示するように、データ選択部１２ａ、符号化器１２ｂ、第１復号化器１２ｃ、第２復号化器１２ｄ、データクレンジング部１２ｅ、更新部１２ｆおよび終了判定部１２ｇとして機能する。なお、これらの機能部は、それぞれ、あるいは一部が異なるハードウェアに実装されてもよい。また、制御部１２は、その他の機能部を備えてもよい。 The control unit 12 is implemented using a CPU (Central Processing Unit) or the like, and executes a processing program stored in a memory. Accordingly, as illustrated in FIG. 1, the control unit 12 includes a data selection unit 12a, an encoder 12b, a first decoder 12c, a second decoder 12d, a data cleansing unit 12e, an update unit 12f, and an end It functions as the determination unit 12g. Note that these functional units may be implemented in different hardware, respectively or partially. Also, the control unit 12 may include other functional units.

データ選択部１２ａは、学習用の音声信号の入力を受け付ける。具体的には、データ選択部１２ａは、入力された訓練データの集合から、後述する学習処理に用いる音声信号を選択し、後述する符号化器１２ｂに入力する。なお、訓練データのうち、全ての音声信号が符号化器１２ｂに入力された場合には、後述する更新部１２ｆの処理が実行されるようにしてもよい。 The data selector 12a accepts an input of an audio signal for learning. Specifically, the data selection unit 12a selects an audio signal to be used in a learning process, which will be described later, from a set of input training data, and inputs it to the encoder 12b, which will be described later. It should be noted that the processing of the updating unit 12f, which will be described later, may be executed when all the speech signals in the training data are input to the encoder 12b.

符号化器１２ｂは、変換部の一例であり、第１のニューラルネットワークを用いて、入力された学習用の音声信号の特徴量を、符号化した中間特徴量に変換する。符号化器１２ｂは、例えば、Ｔｒａｎｓｆｏｒｍｅｒエンコーダであり、単位時間ごとの音声信号の特徴量である対数メルフィルタバンク特徴量Ｘ^fbankを、前処理用のニューラルネットワークによって長さ等を縮約した特徴量Ｘ^subを入力として受け付ける。そして、符号化器１２ｂは、特徴量Ｘ^subを第１のニューラルネットワークにより中間特徴量に変換して出力する。 The encoder 12b is an example of a conversion unit, and uses a first neural network to convert the feature quantity of the input learning speech signal into an encoded intermediate feature quantity. The encoder 12b is, for example, a Transformer encoder, and converts the logarithmic mel filter bank feature amount X ^fbank , which is the feature amount of the speech signal for each unit time, into a feature amount obtained by contracting the length, etc., using a preprocessing neural network. Accepts X ^sub as input. Then, the encoder 12b converts the feature amount X ^sub into an intermediate feature amount by the first neural network and outputs the intermediate feature amount.

ここで、符号化器１２ｂを構成する第１のニューラルネットワークの層の総数ｅ、第ｉ層（ｉ＝０，１，…，ｅ－１）の入力Ｘ_i、出力Ｘ_i+1と表記すると、次式（１）に示すように、各層ｉは、入力特徴量Ｘ_iを中間特徴量Ｘ_i+1に変換して出力する。また、最終層である第ｅ－１層は、中間特徴量として音声特徴量Ｘ_ｅを出力する。 Here, if the total number of layers of the first neural network constituting the encoder 12b is e, the input X _i of the i-th layer (i=0, 1, . . . , e−1), and the output X _i+1 is expressed as , as shown in the following equation (1), each layer i converts the input feature quantity X _i into the intermediate feature quantity X _{i+1 and} outputs it. Also, the e−1-th layer, which is the final layer, outputs the speech feature quantity X _e as an intermediate feature quantity.

ここで、ＰＥは、フレーム番号１，２，…，ｎ^subを入力として、ｄ^att次元の特徴量を出力するニューラルネットワークである。また、ＭＨＡは、３つの特徴量系列を入力として、１つ目の特徴量系列と同じ次元・長さの特徴量系列を出力するニューラルネットワークである。また、ＦＦは、２層の全結合層とＲｅＬＵ（Rectified Linear Units）活性化層からなる、入力特徴量と時刻ごとに同じ次元の特徴量系列を出力するニューラルネットワークである。 Here, PE is a neural network ^that receives ^frame numbers 1, 2, . MHA is a neural network that receives three feature quantity sequences as inputs and outputs a feature quantity sequence having the same dimension and length as the first feature quantity sequence. FF is a neural network that outputs a feature value sequence of the same dimension as the input feature value for each time, which consists of two fully connected layers and a ReLU (Rectified Linear Units) activation layer.

なお、符号化器１２ｂを構成する第１のニューラルネットワークは、上記（１）式以外に、前処理用のニューラルネットワークとして、例えば、２層のＣＮＮ（Convolution Neural Networks）とＲｅＬＵ活性化層とで構成される場合がある。その場合には、ＣＮＮの出力の長さｎ^sub、チャネル数ｄ^attとすれば、各中間特徴量Ｘ_ｉは、ｎ^sub×ｄ^att次元のベクトルとなる。 In addition to the above equation (1), the first neural network constituting the encoder 12b is, for example, a two-layer CNN (Convolution Neural Network) and a ReLU activation layer as a preprocessing neural network. may be configured. In that case, if the output length of the CNN is n ^sub and the number of channels is d ^att , each intermediate feature X _i becomes a vector of n ^sub ×d ^att dimensions.

また、符号化器１２ｂは、Ｔｒａｎｓｆｏｒｍｅｒのエンコーダに限定されず、例えば、ＲＮＮ（Recurrent Neural Networks）等のエンコーダであってもよい。 Further, the encoder 12b is not limited to the Transformer encoder, and may be, for example, an RNN (Recurrent Neural Networks) encoder.

第１復号化器１２ｃは、第１の算出部の一例であり、第２のニューラルネットワークを用いて、中間特徴量Ｘ_ｅから、予測される記号列と該記号列のＣＴＣに基づく事後確率を算出する。ここで、予測される記号列とは、教師データとして与えられる正解記号列に後続する記号を含む新たな記号列のことである。第１復号化器１２ｃは、例えば、ＣＴＣデコーダであり、第１のニューラルネットワークを用いて、中間特徴量Ｘ_ｅの時刻（フレーム）に対応する記号を配置した記号列であるアライメントついて、あらゆるアライメントに対する事後確率を算出する。 The first decoder 12c is an example of a first calculator, and uses a second neural network to obtain a predicted symbol string and a CTC-based posterior probability of the symbol string from the intermediate feature _Xe . calculate. Here, the predicted symbol string is a new symbol string that includes symbols following the correct symbol string given as training data. The first decoder 12c is, for example, a CTC decoder, and uses a first neural network to determine any alignment, which is a symbol string in which symbols corresponding to the time (frame) of the intermediate feature _Xe are arranged. Calculate the posterior probability for

具体的には、第１復号化器１２ｃは、符号化器１２ｂの出力であるＸ_ｅを用いて、次式（２）に示すように、ＣＴＣに基づく事後確率ｐ_ctc（Ｙ｜Ｘ_ｅ）を算出して出力する。 Specifically, the first decoder 12c uses X _e that is the output of the encoder 12b to obtain the CTC-based posterior probability p _ctc (Y|X _e ) as shown in the following equation (2): is calculated and output.

ここで、重み行列Ｗ^ctcおよびバイアスベクトルｂ^ctcは、第２のニューラルネットワークのパラメータであり、予め学習されたものである。 Here, the weight matrix W ^ctc and the bias vector b ^ctc are parameters of the second neural network and are learned in advance.

そして、ＣＴＣに基づく事後確率ｐ_ctc（Ｙ｜Ｘ_ｅ）とは、Ｘ_ｅとＹとの間の任意のアライメントに対する事後確率である。アライメントとは、各入力系列データの時刻ｔに対応する記号列Ｙを配置した系列である。例えば、５フレームからなる入力系列に対するアライメントπとして、ａａｂｃｃ、ａｂｂｂｃ、ａａａｂｃ、…等が挙げられる。 And the CTC-based posterior probability p _ctc (Y|X _e ) is the posterior probability for any alignment between X _e and Y. Alignment is a sequence in which symbol strings Y corresponding to time t of each input sequence data are arranged. For example, aabcc, abbbc, aaabc, .

Ｃは、第１復号化器１２ｃの出力であり、Ｃ［ｔ，π［ｔ］］は、出力記号π［ｔ］とＸ_ｅのｔ番目のフレームとの間のアライメントである。 C is the output of the first decoder 12c and C[t,π[t]] is the alignment between the output symbol π[t] and the tth frame of _Xe .

また、多対１のマッピング関数Ｂ（π）は、アライメントπから冗長な記号を取り除く関数である、例えば、φを空白記号（blank symbol）とすれば、Ｂ（ａａφｂ）＝ａｂである。また、１対多のマッピング関数Ｂ^-1は、記号列を入力として、上記したアライメントのすべての集合を出力する。 Also, the many-to-one mapping function B(π) is a function that removes redundant symbols from the alignment π. For example, if φ is a blank symbol, B(aaφb)=ab. Also, the one-to-many mapping function B ⁻¹ takes the symbol string as input and outputs a set of all the above alignments.

上記式（２）の第２式では、Ｘ_ｅを観測した場合の各アライメントπの事後確率を、「時刻ｔに記号π［ｔ］を配置する確率Ｃ［ｔ，π［ｔ］］を全時刻で総乗したもの」として算出している。 In the second formula of the above formula (2), the posterior probability of each alignment π when X _e is observed is defined as “probability C[t, π[t]] of arranging symbol π[t] at time t. It is calculated as the product of time.

また、上記式（２）の第３式では、Ｘ_ｅを観測した場合の記号列Ｙの事後確率を、「Ｙの出現の場合わけであるアライメントのすべてにおける上記した第２式の事後確率を総和したもの」として算出している。 In addition, in the third formula of the above formula (2), the posterior probability of the symbol string Y when X _e is observed is expressed as "the posterior probability of the second formula above for all alignments in which Y appears. It is calculated as the sum total.

第２復号化器１２ｄは、第２の算出部の一例であり、第３のニューラルネットワークを用いて、正解記号列と中間特徴量Ｘ_ｅとから、予測される記号列と該記号列の事後確率とを算出する。 The second decoder 12d is an example of a second calculator, and uses a third neural network to predict a symbol string and a posterior to the symbol string from the correct symbol string and the intermediate feature quantity _Xe . Calculate the probability and

例えば、第２復号化器１２ｄは、Ｔｒａｎｓｆｏｒｍｅｒにおけるデコーダである。第２復号化器１２ｄは、符号化器１２ｂで変換して得られた音声特徴量Ｘ_ｅと、既に予測済みの記号列Ｙ［１：ｕ］＝Ｙ［１］，…，Ｙ［ｕ］を入力とし、次式（３）に示すように、後続する記号列Ｙ［２：ｕ＋１］を予測して出力する。 For example, the second decoder 12d is a decoder in Transformer. The second decoder 12d converts the speech feature quantity _Xe obtained by the conversion by the encoder 12b and the already predicted symbol string Y[1:u]=Y[1], . . . , Y[u] is input, and the subsequent symbol string Y[2:u+1] is predicted and output as shown in the following equation (3).

ここで、Ｅｍｂｅｄは、ＰＥと同様のニューラルネットワークによる演算を表す関数であり、ＰＥにおける時刻（フレーム）に代えて記号の系列Ｙ［１：ｕ］を入力として、ｄ^att次元の特徴量を出力する。 Here, Embed is a function that expresses computation by a neural network similar to PE, and outputs a d ^att -dimensional feature amount by inputting a series of symbols Y[1:u] instead of the time (frame) in PE. do.

なお、第２復号化器１２ｄを構成する第３のニューラルネットワークの層の総数ｄ、第ｊ層（ｊ＝０，１，…，ｄ－１）の入力Ｚ_j、出力Ｚ_j+1と表記する。この場合に、第２復号化器１２ｄは、次式（４）に示すように、Ｙ［１：ｕ］およびＸ_ｅが与えられたもとで、Ｔｒａｎｓｆｏｒｍｅｒに基づく事後確率、つまり、次の記号がＹ［ｕ＋１］となる事後確率ｐ_s2s（Ｙ｜Ｘ_ｅ）を算出して出力する。 The total number of layers of the third neural network constituting the second decoder 12d is expressed as d, the input Z _j and the output Z _j+1 of the j-th layer (j=0, 1, . . . , d−1). do. In this case, the second decoder 12d is given Y[1:u] and X _e as shown in the following equation (4), and the posterior probability based on the Transformer, that is, the next symbol is Y The posterior probability p _s2s (Y|X _e ) of [u+1] is calculated and output.

ここで、重み行列Ｗ^attおよびバイアスベクトルｂ^attは、第３のニューラルネットワークのパラメータであり、予め学習されたものである。 Here, the weight matrix W ^att and the bias vector b ^att are the parameters of the third neural network and are learned in advance.

なお、学習装置１０は、第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークを、全体として１つのｅｎｄ－ｔｏ－ｅｎｄのニューラルネットワークとみなして学習する。 The learning device 10 learns by regarding the first neural network, the second neural network, and the third neural network as one end-to-end neural network as a whole.

また、第２復号化器１２ｄは、Ｔｒａｎｓｆｏｒｍｅｒのデコーダに限定されず、例えば、ＲＮＮ等のデコーダであってもよい。 Further, the second decoder 12d is not limited to a Transformer decoder, and may be, for example, an RNN decoder.

データクレンジング部１２ｅは、第１復号化器１２ｃで算出された事後確率に基づいて、後述する更新部１２ｆの処理に用いるデータを選別する。具体的には、データクレンジング部１２ｅは、ＣＴＣに基づく事後確率が所定の閾値より大きい場合に、後述する更新部１２ｆに処理を実行させる。 Based on the posterior probability calculated by the first decoder 12c, the data cleansing unit 12e selects data to be used for processing by the updating unit 12f, which will be described later. Specifically, when the CTC-based posterior probability is greater than a predetermined threshold, the data cleansing unit 12e causes the updating unit 12f, which will be described later, to perform processing.

例えば、データクレンジング部１２ｅは、ＣＴＣに基づく事後確率が所定の閾値より大きいデータのインデックスを、インデックス集合Ｉとして記憶部１１に記憶しておく。 For example, the data cleansing unit 12e stores, as an index set I, in the storage unit 11, indexes of data whose CTC-based posterior probabilities are greater than a predetermined threshold.

なお、ＣＴＣに基づく事後確率が所定の閾値以下である場合には、データクレンジング部１２ｅは、データ選択部１２ａに他の音声信号を選択させる。 Note that when the posterior probability based on CTC is equal to or less than a predetermined threshold, the data cleansing unit 12e causes the data selection unit 12a to select another audio signal.

更新部１２ｆは、ＣＴＣに基づく事後確率が所定の閾値より大きい場合に、第２復号化器１２ｄが算出した事後確率と、ＣＴＣに基づく事後確率とから算出した損失関数値を用いて、第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークのパラメータ１１ａを更新する。 When the CTC-based posterior probability is greater than a predetermined threshold, the updating unit 12f uses the loss function value calculated from the posterior probability calculated by the second decoder 12d and the CTC-based posterior probability to update the first , the second neural network and the third neural network parameters 11a are updated.

具体的には、更新部１２ｆは、データクレンジング部１２ｅが選別した音声信号について第１復号化器１２ｃの出力に関する損失と、第２復号化器２１ｄの出力に関する損失とを算出し、それらの和に基づいて、第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークの各パラメータ１１ａを更新する。 Specifically, the updating unit 12f calculates the loss related to the output of the first decoder 12c and the loss related to the output of the second decoder 21d for the audio signal selected by the data cleansing unit 12e, and sums them update the parameters 11a of the first neural network, the second neural network, and the third neural network based on .

ここで、第１復号化器１２ｃの出力に関する損失は、次式（５）に示すインデックス集合Ｉに含まれるインデックスの入力データに対応して各復号化器の出力から算出される、次式（６）に示すＣＴＣ損失である。 Here, the loss related to the output of the first decoder 12c is calculated from the output of each decoder corresponding to the input data of the index included in the index set I shown in the following equation (5). 6) is the CTC loss shown in FIG.

また、第２復号化器１２ｄの出力に関する損失は、上記式（５）のインデックス集合Ｉに含まれるンデックスの入力データに対応して各復号化器の出力から算出される、次式（７）に示すクロスエントロピー損失である。 Also, the loss related to the output of the second decoder 12d is calculated from the output of each decoder corresponding to the input data of the index included in the index set I of the above equation (5), as shown in the following equation (7) is the cross-entropy loss shown in .

更新部１２ｆは、上記式（６）、（７）の損失の重み付け和を損失関数値として、例えば誤差逆伝搬学習等の周知の手法を用いて、ｅｎｄ－ｔｏ－ｅｎｄニューラルネットワークのパラメータの値を算出し、記憶部１１に記憶されているパラメータ１１ａを更新する。 The update unit 12f uses the weighted sum of the losses of the above equations (6) and (7) as a loss function value, and uses a well-known technique such as error backpropagation learning to obtain the parameter values of the end-to-end neural network. is calculated, and the parameter 11a stored in the storage unit 11 is updated.

このようにして、学習装置１０は、ＣＴＣに基づく事後確率が所定の閾値以下であって、訓練データとして用いるべきではないデータを除外するデータクレンジングを、学習中に行いながら、学習を行うことが可能となる。 In this way, the learning device 10 can perform learning while performing data cleansing to exclude data that should not be used as training data because the CTC-based posterior probability is equal to or less than a predetermined threshold value during learning. It becomes possible.

なお、学習装置１０は、パラメータ１１ａの更新が行われた後、再び学習用の音声信号の入力を受け付けて、ｅｎｄ－ｔｏ－ｅｎｄニューラルネットワークを用いて、記号列の予測を行う。 Note that after the parameter 11a is updated, the learning device 10 receives the input of the speech signal for learning again and predicts the symbol string using the end-to-end neural network.

終了判定部１２ｇは、所定の終了条件を満たした場合に、パラメータ１１ａの更新を終了する。例えば、終了判定部１２ｇは、損失関数値が所定の閾値以下となった場合、パラメータ１１ａの更新回数が所定の回数に到達した場合、またはパラメータ１１ａの更新量が所定の閾値以下となった場合の少なくともいずれかの場合に、パラメータ１１ａの更新を終了する。 The termination determination unit 12g terminates updating of the parameter 11a when a predetermined termination condition is satisfied. For example, when the loss function value becomes equal to or less than a predetermined threshold, when the number of times the parameter 11a is updated reaches a predetermined number, or when the amount of update of the parameter 11a becomes equal to or less than a predetermined threshold In at least one of the cases, the update of the parameter 11a is terminated.

なお、図１に示した学習装置１０では、第１復号化器１２ｃと第２復号化器１２ｄとの処理が並列に実行される。ここで、図２は、他の実施形態の学習装置１０の概略構成を例示する模式図である。図２に示すように、学習装置１０は、データクレンジング部１２ｅが選別したデータのみを、第２復号化器１２ｄに入力するようにしてもよい。このように、データクレンジング部１２ｅは、上記した第２復号化器１２ｄの処理を、ＣＴＣに基づく事後確率が所定の閾値より大きい場合にのみ実行させるようにしてもよい。この場合には、第２復号化器１２ｄの処理が軽減される。 Note that in the learning device 10 shown in FIG. 1, the processes of the first decoder 12c and the second decoder 12d are executed in parallel. Here, FIG. 2 is a schematic diagram illustrating a schematic configuration of the learning device 10 of another embodiment. As shown in FIG. 2, the learning device 10 may input only the data selected by the data cleansing unit 12e to the second decoder 12d. In this way, the data cleansing unit 12e may cause the processing of the second decoder 12d described above to be executed only when the posterior probability based on the CTC is greater than a predetermined threshold. In this case, the processing of the second decoder 12d is reduced.

［学習処理］
次に、図３を参照して、本実施形態に係る学習装置１０による学習処理について説明する。図３は、学習処理手順を示すフローチャートである。図３のフローチャートは、例えば、ユーザが開始を指示する操作入力を行ったタイミングで開始される。 [Learning process]
Next, learning processing by the learning device 10 according to the present embodiment will be described with reference to FIG. FIG. 3 is a flow chart showing the learning processing procedure. The flowchart in FIG. 3 is started, for example, at the timing when the user performs an operation input instructing the start.

まず、符号化器１２ｂが、データ選択部１２ａから入力された学習用の音声信号を受け付ける（ステップＳ１）。そして、符号化器１２ｂが、第１のニューラルネットワークを用いて、受け付けた音声信号の特徴量を、符号化した中間特徴量に変換する（ステップＳ２）。 First, the encoder 12b receives an audio signal for learning input from the data selector 12a (step S1). Then, the encoder 12b uses the first neural network to convert the feature quantity of the received speech signal into an encoded intermediate feature quantity (step S2).

また、第１復号化器１２ｃが、第２のニューラルネットワークを用いて、中間特徴量から、予測される記号列と該記号列のＣＴＣに基づく事後確率を算出する（ステップＳ３）。また、第２復号化器１２ｄが、第３のニューラルネットワークを用いて、正解記号列と中間特徴量とから、予測される記号列と該記号列の事後確率とを算出する（ステップＳ４）。 Also, the first decoder 12c uses the second neural network to calculate the predicted symbol string and the CTC-based posterior probability of the symbol string from the intermediate feature amount (step S3). Also, the second decoder 12d uses a third neural network to calculate a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature quantity (step S4).

次に、データクレンジング部１２ｅが、ＣＴＣに基づく事後確率が所定の閾値より大きいか否かを確認し、所定の閾値より大きい場合に（ステップＳ５、Ｙｅｓ）、ステップＳ６に処理を進める。一方、データクレンジング部１２ｅは、ＣＴＣに基づく事後確率が所定の閾値以下である場合には（ステップＳ５、Ｎｏ）、ステップＳ１に処理を戻す。 Next, the data cleansing unit 12e checks whether the CTC-based posterior probability is greater than a predetermined threshold, and if it is greater than the predetermined threshold (step S5, Yes), the process proceeds to step S6. On the other hand, when the CTC-based posterior probability is equal to or less than the predetermined threshold (step S5, No), the data cleansing unit 12e returns the process to step S1.

更新部１２ｆは、第２復号化器１２ｄが算出した事後確率と、ＣＴＣに基づく事後確率とから算出した損失関数値を用いて、ｅｎｄ－ｔｏ－ｅｎｄニューラルネットワークのパラメータ１１ａを更新する（ステップＳ６）。 The updating unit 12f updates the parameter 11a of the end-to-end neural network using the loss function value calculated from the posterior probability calculated by the second decoder 12d and the posterior probability based on the CTC (step S6 ).

そして、終了判定部１２ｇが、所定の終了条件を満たすか否かを確認する（ステップＳ７）。例えば、終了判定部１２ｇは、損失関数値が所定の閾値以下となった場合、パラメータ１１ａの更新回数が所定の回数に到達した場合、またはパラメータ１１ａの更新量が所定の閾値以下となった場合の少なくともいずれかの場合に、終了条件を満たすと判定する。 Then, the termination determination unit 12g confirms whether or not a predetermined termination condition is satisfied (step S7). For example, when the loss function value becomes equal to or less than a predetermined threshold, when the number of times the parameter 11a is updated reaches a predetermined number, or when the amount of update of the parameter 11a becomes equal to or less than a predetermined threshold In at least one of the cases, it is determined that the termination condition is satisfied.

終了判定部１２ｇは、所定の終了条件を満たさないと判定した場合には（ステップＳ７、Ｎｏ）、ステップＳ１に処理を戻して、記号列の予測とパラメータ１１ａの更新とを繰り返す。一方、終了判定部２２ｇは、所定の終了条件を満たすと判定した場合には（ステップＳ７、Ｙｅｓ）、一連の学習処理を終了する。 When the termination determination unit 12g determines that the predetermined termination condition is not satisfied (step S7, No), the process returns to step S1 to repeat prediction of the symbol string and update of the parameter 11a. On the other hand, when the termination determination unit 22g determines that the predetermined termination condition is satisfied (step S7, Yes), the series of learning processes is terminated.

以上、説明したように、本実施形態の学習装置１０において、符号化器１２ｂが、第１のニューラルネットワークを用いて、入力された学習用の音声信号の特徴量を、符号化した中間特徴量に変換する。また、第１復号化器１２ｃが、第２のニューラルネットワークを用いて、中間特徴量から、予測される記号列と該記号列のＣＴＣに基づく事後確率を算出する。また、第２復号化器１２ｄが、第３のニューラルネットワークを用いて、正解記号列と中間特徴量とから、予測される記号列と該記号列の事後確率とを算出する。また、ＣＴＣに基づく事後確率が所定の閾値より大きい場合に、更新部１２ｆが、第２復号化器１２ｄが算出した事後確率と、ＣＴＣに基づく事後確率とから算出した損失関数値を用いて、第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークのパラメータを更新する。 As described above, in the learning apparatus 10 of the present embodiment, the encoder 12b uses the first neural network to encode the feature amount of the input speech signal for learning to an intermediate feature amount. Convert to Also, the first decoder 12c uses the second neural network to calculate the predicted symbol string and the CTC-based posterior probability of the symbol string from the intermediate feature amount. Also, the second decoder 12d uses a third neural network to calculate a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature quantity. Further, when the CTC-based posterior probability is greater than a predetermined threshold, the updating unit 12f uses the loss function value calculated from the posterior probability calculated by the second decoder 12d and the CTC-based posterior probability, Update the parameters of the first neural network, the second neural network and the third neural network.

このように、学習装置１０は、ＣＴＣに基づく事後確率が所定の閾値以下であって、学習用に用いると音声認識の精度を低下させる恐れのあるデータを除外するデータクレンジングを、学習中に行うことができる。その結果、ノイズや誤り等を含む訓練データを学習に用いても、音声認識の精度の向上が可能となる。 In this way, the learning device 10 performs data cleansing during learning to exclude data whose posterior probability based on CTC is equal to or less than a predetermined threshold and which may reduce the accuracy of speech recognition when used for learning. be able to. As a result, even if training data containing noise, errors, etc. is used for learning, the accuracy of speech recognition can be improved.

また、学習装置１０は、第２復号化器１２ｄの処理を、ＣＴＣに基づく事後確率が所定の閾値より大きい場合に行うようにしてもよい。これにより、第２復号化器１２ｄの処理が軽減される。 Also, the learning device 10 may perform the processing of the second decoder 12d when the posterior probability based on the CTC is greater than a predetermined threshold. This reduces the processing of the second decoder 12d.

また、学習装置１０は、第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークを、全体として１つのｅｎｄ－ｔｏ－ｅｎｄのニューラルネットワークとみなして学習する。これにより、音声認識処理が最適化され、より高精度に音声認識が可能となる。 In addition, learning device 10 learns by regarding the first neural network, the second neural network, and the third neural network as one end-to-end neural network as a whole. This optimizes the speech recognition process and enables more accurate speech recognition.

また、学習装置１０は、終了判定部１２ｇが、損失関数値が所定の閾値以下となった場合、パラメータ１１ａの更新回数が所定の回数に到達した場合、またはパラメータ１１ａの更新量が所定の閾値以下となった場合の少なくともいずれかの場合に、パラメータ１１ａの更新を終了する。これにより、学習処理の処理負荷を抑制することが可能となる。 Further, the learning device 10 determines that the loss function value is equal to or less than a predetermined threshold value, the number of updates of the parameter 11a reaches a predetermined number of times, or the update amount of the parameter 11a reaches a predetermined threshold value. In at least one of the cases below, the update of the parameter 11a is terminated. This makes it possible to suppress the processing load of the learning process.

［プログラム］
上記実施形態に係る学習装置１０が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、学習装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の音声認識処理を実行する学習プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の学習プログラムを情報処理装置に実行させることにより、情報処理装置を学習装置１０として機能させることができる。 [program]
It is also possible to create a program in which the processing executed by the learning device 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the learning device 10 can be implemented by installing a learning program that executes the above-described speech recognition processing as package software or online software on a desired computer. For example, the information processing device can function as the learning device 10 by causing the information processing device to execute the learning program.

ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）などの移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）などのスレート端末などがその範疇に含まれる。また、学習装置１０の機能を、クラウドサーバに実装してもよい。 The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, information processing devices include smart phones, mobile communication terminals such as mobile phones and PHSs (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the learning device 10 may be implemented in a cloud server.

図４は、学習プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 FIG. 4 is a diagram showing an example of a computer that executes a learning program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。ディスクドライブ１０４１には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１０５１およびキーボード１０５２が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１０６１が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

ここで、ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ１０３１やメモリ１０１０に記憶される。 Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

また、学習プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、ハードディスクドライブ１０３１に記憶される。具体的には、上記実施形態で説明した学習装置１０が実行する各処理が記述されたプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The learning program is also stored in hard disk drive 1031 as, for example, program modules 1093 that describe instructions to be executed by computer 1000 . Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the learning device 10 described in the above embodiment.

また、学習プログラムによる情報処理に用いられるデータは、プログラムデータ１０９４として、例えば、ハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Data used for information processing by the learning program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

なお、学習プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、学習プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and program data 1094 related to the learning program are not limited to being stored in the hard disk drive 1031. For example, they may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be Alternatively, program module 1093 and program data 1094 related to the learning program are stored in another computer connected via a network such as LAN or WAN (Wide Area Network), and are read out by CPU 1020 via network interface 1070. may

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiments to which the invention made by the present inventor is applied have been described above, the present invention is not limited by the descriptions and drawings forming a part of the disclosure of the present invention according to the embodiments. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

１０学習装置
１１記憶部
１１ａパラメータ
１２制御部
１２ａデータ選択部
１２ｂ符号化器
１２ｃ第１復号化器（ＣＴＣデコーダ）
１２ｄ第２復号化器
１２ｅデータクレンジング部
１２ｆ更新部
１２ｇ終了判定部 REFERENCE SIGNS LIST 10 learning device 11 storage unit 11a parameter 12 control unit 12a data selection unit 12b encoder 12c first decoder (CTC decoder)
12d second decoder 12e data cleansing unit 12f update unit 12g end determination unit

Claims

a conversion unit that converts the feature quantity of the input speech signal for learning into a coded intermediate feature quantity using the first neural network;
a first calculation unit that calculates a predicted symbol string and a posterior probability based on CTC (Connectionist Temporal Classification) of the symbol string from the intermediate feature amount using a second neural network;
a second calculation unit that calculates a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature using a third neural network;
When the posterior probability based on the CTC is greater than a predetermined threshold, using the loss function value calculated from the posterior probability calculated by the second calculation unit and the posterior probability based on the CTC, an updating unit that updates the parameters of the neural network of, the second neural network and the third neural network;
A learning device characterized by comprising:

2. The learning apparatus according to claim 1, wherein the process of the second calculator is performed when the posterior probability based on the CTC is greater than a predetermined threshold.

2. The method according to claim 1, wherein learning is performed by considering the first neural network, the second neural network and the third neural network as one end-to-end neural network as a whole. learning device.

When the loss function value becomes equal to or less than a predetermined threshold, when the number of times the parameter is updated reaches a predetermined number, or when the amount of update of the parameter becomes equal to or less than a predetermined threshold. 2. The learning apparatus according to claim 1, further comprising an end determination unit for terminating update of said parameters.

A learning method executed by a learning device, comprising:
a conversion step of converting the feature quantity of the input learning speech signal into an encoded intermediate feature quantity using the first neural network;
A first calculation step of calculating a predicted symbol string and a posterior probability based on CTC (Connectionist Temporal Classification) of the symbol string from the intermediate feature amount using a second neural network;
a second calculation step of calculating a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature amount using a third neural network;
When the posterior probability based on the CTC is greater than a predetermined threshold, using the loss function value calculated from the posterior probability calculated by the second calculation step and the posterior probability based on the CTC, an updating step of updating parameters of the neural network of, the second neural network and the third neural network;
A learning method comprising:

a transformation step of transforming the feature quantity of the input speech signal for learning into a coded intermediate feature quantity using the first neural network;
a first calculation step of calculating a predicted symbol string and a posterior probability based on CTC (Connectionist Temporal Classification) of the symbol string from the intermediate feature amount using a second neural network;
a second calculation step of calculating a predicted symbol string and the posterior probability of the symbol string from the correct symbol string and the intermediate feature using a third neural network;
When the CTC-based posterior probability is greater than a predetermined threshold, the first an updating step of updating parameters of the neural network of, the second neural network and the third neural network;
A learning program for making a computer execute