JP7359028B2

JP7359028B2 - Learning devices, learning methods, and learning programs

Info

Publication number: JP7359028B2
Application number: JP2020028869A
Authority: JP
Inventors: 直弘俵; 厚徳小川; 具治岩田; 陽祐樋口; 哲則小林; 哲司小川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-10-11
Anticipated expiration: 2040-02-21
Also published as: JP2021135314A

Description

本発明は、学習装置、音声認識装置、学習方法、および、学習プログラムに関する。 The present invention relates to a learning device, a speech recognition device, a learning method, and a learning program.

従来、ニューラルネットワーク（以下、適宜NNと表記する）を用いたモデルを、機械学習により学習する技術が知られている。例えば、音声データを当該音声データの示す情報（事後確率）に変換するためのend-to-endのNNを用いた音声認識モデルを機械学習により学習する方法が知られている。 BACKGROUND ART Conventionally, a technique is known in which a model using a neural network (hereinafter referred to as NN as appropriate) is learned by machine learning. For example, a method is known in which a speech recognition model using an end-to-end NN for converting speech data into information (posterior probability) represented by the speech data is learned by machine learning.

このend-to-endのNNは、例えば、音声の中間特徴量を出力するエンコーダと、エンコーダから出力された中間特徴量のうちどの部分に着目するか（重み）を判断するアテンションと、アテンションにより判断された重みを用いて、音声の示す文字を推定するデコータとを備える（非特許文献１参照）。 This end-to-end NN consists of, for example, an encoder that outputs intermediate features of speech, an attention that determines which part of the intermediate features output from the encoder to focus on (weight), and an attention and a decoder that estimates the character indicated by the voice using the determined weight (see Non-Patent Document 1).

上記の音声認識モデルを学習する際、ノイズ等を含まないクリーンな音声を学習用データとして用いるものが多い。しかし、実環境ではノイズ等が存在する状況が多く、ノイズが存在する状況下で音声認識を行う必要がある。 When learning the above-mentioned speech recognition models, clean speech that does not contain noise etc. is often used as training data. However, in real environments, there are many situations where noise etc. are present, and it is necessary to perform speech recognition in situations where noise is present.

そこで、例えば、上記の非特許文献１等に記載の技術は、ノイズに対する頑健性を高めるため、クリーンな音声とノイズを含む音声との両方を学習用データとして用いて、音声認識モデルを学習させる。この場合、まず、同じ内容（テキスト）の音声信号でノイズを含むもの（ノイズあり音声）とノイズを含まないもの（クリーンな音声）とのペアデータを用意する。次に、ノイズあり音声を入力したときのエンコーダの出力と、デコーダの中間層の出力とが、それぞれ対応するクリーンな音声を入力したときのエンコーダの出力とデコーダの中間層の出力とに近付くように、モデルのパラメータを学習させる。 Therefore, for example, the technology described in the above-mentioned non-patent document 1 uses both clean speech and noise-containing speech as training data to train a speech recognition model in order to increase robustness against noise. . In this case, first, pair data of audio signals with the same content (text) that includes noise (noisy audio) and one that does not include noise (clean audio) is prepared. Next, the output of the encoder and the output of the decoder's intermediate layer when noisy audio is input become closer to the output of the encoder and the output of the decoder's intermediate layer when the corresponding clean audio is input, respectively. Then, learn the model parameters.

Davis Liang, Zhiheng Huang, and Zachary C Lipton, “Learning Noise-Invariant Representations for Robust Speech Recognition,” in Proceedings of the IEEE Workshop on Spoken Language Technology Workshop （SLT）, pp.56-63，2018.Davis Liang, Zhiheng Huang, and Zachary C Lipton, “Learning Noise-Invariant Representations for Robust Speech Recognition,” in Proceedings of the IEEE Workshop on Spoken Language Technology Workshop (SLT), pp.56-63, 2018.

しかし、上記の技術により学習したモデルによっても、ノイズを含む音声の認識精度は必ずしも高くないという問題があった。そこで、本発明は、前記した問題を解決し、ノイズに対する頑健性の高い音声認識手段を提供することを課題とする。 However, even with models learned using the above techniques, there is a problem in that the recognition accuracy of speech containing noise is not necessarily high. SUMMARY OF THE INVENTION Therefore, it is an object of the present invention to solve the above-mentioned problems and provide a speech recognition means that is highly robust against noise.

前記した課題を解決するため、本発明は、音声データと前記音声データの示す記号列を特定する情報の正解データとを対応付けたデータを第１の教師データとして用いて、音声データを、前記音声データの示す記号列を特定する情報に変換する際、前記音声データの中間特徴量を出力する符号化器と、前記中間特徴量を構成する各要素のうちどの要素に着目すればよいかを示す重みとその重みで前記中間特徴量の重み付け和を算出した値とを出力する注意機構とを備える音声認識モデルの学習を行う第１の学習部と、前記音声データ、前記音声データにノイズが加算された音声データであるノイズあり音声データおよび前記音声データの示す記号列を特定する情報の正解データを対応付けた第２の教師データに基づき、前記第１の学習部による学習後の音声認識モデルに、音声データを入力した場合と前記ノイズあり音声データを入力した場合とで、当該音声認識モデルの注意機構から出力される重みの分布にどの程度の違いがあるかを示す第１の距離と、前記ノイズあり音声データを入力した場合に当該音声認識モデルの復号化器から出力される情報と前記音声データに対する正解データとの間にどの程度の違いがあるかを示す第２の距離とを計算する距離計算部と、前記第２の教師データを用いて、前記第１の学習部による学習後の音声認識モデルの学習を行う際、前記第１の距離と前記第２の距離との和を損失とし、前記損失が小さくなるように当該音声認識モデルの符号化器および注意機構のパラメータの更新を行う第２の学習部と、を備えることを特徴とする。 In order to solve the above-mentioned problems, the present invention uses, as first teacher data, data in which audio data is associated with correct data of information specifying a symbol string indicated by the audio data, and When converting a symbol string indicated by audio data into identifying information, an encoder that outputs an intermediate feature of the audio data and which element to focus on among the elements constituting the intermediate feature are provided. a first learning unit that performs learning of a speech recognition model, which includes a warning mechanism that outputs a weight indicated by the weight and a value obtained by calculating a weighted sum of the intermediate feature amounts using the weight; Speech recognition after learning by the first learning unit based on the second teacher data in which the noise-containing speech data that is the added speech data is associated with the correct answer data of information specifying the symbol string indicated by the speech data. A first distance that indicates how much of a difference there is in the distribution of weights output from the attention mechanism of the voice recognition model when voice data is input to the model and when the noisy voice data is input to the model. and a second distance indicating how much difference there is between the information output from the decoder of the speech recognition model and the correct data for the speech data when the noisy speech data is input. When learning the speech recognition model after learning by the first learning unit using a distance calculation unit that calculates the distance and the second training data, the difference between the first distance and the second distance is The present invention is characterized by comprising a second learning unit that takes the sum as a loss and updates parameters of an encoder and an attention mechanism of the speech recognition model so that the loss becomes small.

本発明によれば、ノイズに対する頑健性の高い音声認識手段を提供することができる。 According to the present invention, it is possible to provide a speech recognition means that is highly robust against noise.

図１は、end-to-endのNNを用いた音声認識モデルの一例を示す図である。FIG. 1 is a diagram showing an example of a speech recognition model using an end-to-end NN. 図２は、学習装置の動作概要を説明するための図である。FIG. 2 is a diagram for explaining an overview of the operation of the learning device. 図３は、学習装置の動作概要を説明するための図である。FIG. 3 is a diagram for explaining an overview of the operation of the learning device. 図４は、学習装置の動作概要を説明するための図である。FIG. 4 is a diagram for explaining an overview of the operation of the learning device. 図５は、学習装置の動作概要を説明するための図である。FIG. 5 is a diagram for explaining an overview of the operation of the learning device. 図６は、学習装置の構成例を示す図である。FIG. 6 is a diagram showing an example of the configuration of the learning device. 図７は、学習装置の処理手順の例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the processing procedure of the learning device. 図８は、音声認識装置の構成例を示す図である。FIG. 8 is a diagram showing an example of the configuration of a speech recognition device. 図９は、音声認識装置の処理手順の例を示すフローチャートである。FIG. 9 is a flowchart showing an example of the processing procedure of the speech recognition device. 図１０は、学習プログラムを実行するコンピュータの一例を示す図である。FIG. 10 is a diagram showing an example of a computer that executes a learning program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について、第１の実施形態と第２の実施形態とに分けて説明する。本発明は、各実施形態に限定されない。なお、各実施形態における学習装置が学習対象とするモデルは、エンコーダ、アテンションおよびデコータを備えるend-to-endのNNを用いた音声認識モデルである場合を例に説明する。 EMBODIMENT OF THE INVENTION Hereinafter, with reference to drawings, the form (embodiment) for implementing this invention is divided into 1st Embodiment and 2nd Embodiment, and is described. The present invention is not limited to each embodiment. In addition, the case where the model to be learned by the learning device in each embodiment is a speech recognition model using an end-to-end NN including an encoder, attention, and decoder will be described as an example.

［第１の実施形態］
まず、図１～図５を用いて、第１の実施形態の学習装置の動作概要を説明する。学習装置は、上記のend-to-endのNNを用いた音声認識モデル（音声認識部）を備える。この音声認識モデルは、図１に示すように、エンコーダ（符号化部）、アテンション（注意機構部）およびデコーダ（復号化部）を備える。 [First embodiment]
First, an overview of the operation of the learning device according to the first embodiment will be explained using FIGS. 1 to 5. The learning device includes a speech recognition model (speech recognition unit) using the above-mentioned end-to-end NN. As shown in FIG. 1, this speech recognition model includes an encoder (encoding section), attention (attention mechanism section), and decoder (decoding section).

エンコーダは、入力された音声データ（x₁,…,x_T）を中間特徴量（h₂,…,h_T）に変換する。アテンションは、中間特徴量とデコーダの隠れ状態に基づき、中間特徴量を構成する各要素のうちどの要素に着目すればよいかを示す重みの値（α₀,…,α_T）を算出し、その重みで中間特徴量の重み付け和を算出した値（c₀,…,c_L）を出力する。デコーダは、直前までのデコーダの出力（y_i-1）とアテンションからの出力値c_iとに基づき、デコーダの出力（y_i-1）の次の文字を特定する情報（y_i）を推定して出力する。 The encoder converts the input audio data (x ₁ ,...,x _T ) into intermediate feature amounts (h ₂ ,..., h _T ). Attention calculates weight values (α ₀ ,…,α _T ) indicating which element to focus on among the elements constituting the intermediate feature based on the intermediate feature and the hidden state of the decoder. A value (c ₀ ,...,c _L ) calculated by calculating a weighted sum of intermediate features using the weight is output. The decoder estimates information (y _i ) that specifies the next character of the decoder output (y _i-1 ) based on the previous decoder output (y _i-1 ) and the output value c _i from the attention. and output it.

学習装置は、上記の音声認識モデルについて、まずクリーンな音声データ（ノイズを含まない音声データ）を用いてモデルを事前学習する（図２（１））。例えば、学習装置は、クリーンな音声データとその音声データの示す記号列を特定する情報（正解データ）とを対応付けたデータを教師データとして用いて、音声認識モデルの事前学習を行う。これにより、音声認識モデルのエンコーダ、アテンションおよびデコーダにはクリーンな音声データについて精度よく音声認識を行うためのパラメータが設定される。 The learning device first pre-trains the above-mentioned speech recognition model using clean speech data (speech data that does not include noise) (FIG. 2(1)). For example, the learning device performs pre-learning of a speech recognition model using, as teacher data, data in which clean speech data is associated with information (correct data) specifying a symbol string indicated by the speech data. As a result, the encoder, attention, and decoder of the speech recognition model are set with parameters for accurately performing speech recognition on clean speech data.

次に、学習装置は、クリーンな音声データとノイズの入った音声データ（例えば、当該クリーンな音声データにノイズが加算された音声データ）とのペアデータにより、クリーンな音声データに対するアテンションの重みと、ノイズの入った音声データに対するアテンションの重みとを抽出する（図３（２））。 Next, the learning device determines the attention weight for clean voice data based on pair data of clean voice data and voice data with noise (for example, voice data with noise added to the clean voice data). , and the attention weight for the noisy audio data (FIG. 3(2)).

ここで、クリーンな音声データとノイズの入った音声データにおけるアテンションの重みの分布間損失を定義する（図４（３））。例えば、クリーンな音声データのアテンションの重み（α_i）とノイズの入った音声データのアテンションの重み（α´_i）との分布間損失L_attをKLダイバージェンスにより定義する。例えば、KLダイバージェンスにより定義したクリーンな音声データのアテンションの重み（α_i）とノイズの入った音声データのアテンションの重み（α´_i）との分布間損失L_attは、D_KL（α_i||α´_i）となる。 Here, we define the inter-distribution loss of attention weights between clean audio data and noisy audio data ((3) in FIG. 4). For example, the inter-distribution loss L _att between the attention weight (α _i ) of clean audio data and the attention weight (α′ _i ) of noisy audio data is defined by KL divergence. For example, the inter-distribution loss L _att between the attention weight of clean audio data (α _i ) and the attention weight of noisy audio data (α _i ) defined by KL divergence is D _KL (α _i | |α′ _i ).

次に、学習装置は、文字識別損失L_charとアテンションの損失L_attとの和が所定の閾値以下となるよう、アテンションおよびエンコーダを学習する（第２の学習を行う）（図５（４））。なお、上記の文字識別損失L_charは、ノイズの入った音声データが入力された場合にデコーダが出力する情報（y_i´）と正解データとの間の損失を示す。 Next, the learning device learns the attention and the encoder (performs second learning) so that the sum of the character discrimination loss L _char and the attention loss L _att is less than or equal to a predetermined threshold (Fig. 5 (4) ). Note that the above character identification loss L _char indicates the loss between the information (y _i ′) output by the decoder when noisy audio data is input and the correct data.

学習装置が上記のようにして音声認識モデルの学習を行うことで、音声認識モデルにおける、ノイズの入った音声データのアテンションの重みを、クリーンな音声データのアテンションの重みに近づけることができる。その結果、学習装置は、ノイズに対する頑健性の高い音声認識モデルの学習を行うことができる。 By the learning device learning the speech recognition model as described above, it is possible to bring the attention weight of noisy speech data closer to the attention weight of clean speech data in the speech recognition model. As a result, the learning device can learn a speech recognition model that is highly robust against noise.

［構成］
次に、図６を用いて、第１の実施形態の学習装置１０の構成例を説明する。学習装置１０は、入力部１１と、出力部１２と、記憶部１３と、制御部１４とを備える。 [composition]
Next, a configuration example of the learning device 10 of the first embodiment will be described using FIG. 6. The learning device 10 includes an input section 11, an output section 12, a storage section 13, and a control section 14.

入力部１１は、制御部１４が各種処理を行う際に用いるデータの入力を受け付ける。例えば、入力部１１は、音声認識モデル（音声認識部１４３）の教師データ（第１の教師データおよび第２の教師データ）の入力を受け付ける。出力部１２は、制御部１４が行った処理の結果を出力する。例えば、出力部１２は、音声認識部１４３による音声の認識結果等を出力する。 The input unit 11 receives input of data used when the control unit 14 performs various processes. For example, the input unit 11 receives input of teacher data (first teacher data and second teacher data) of a speech recognition model (speech recognition unit 143). The output unit 12 outputs the results of the processing performed by the control unit 14. For example, the output unit 12 outputs the voice recognition result etc. by the voice recognition unit 143.

記憶部１３は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現され、学習装置１０を動作させるプログラムや、当該プログラムの実行中に使用されるデータなどが記憶される。例えば、記憶部１３は、第１の教師データと、第２の教師データとを記憶する。また、記憶部１３は、音声認識部１４３に設定されるパラメータの値等を記憶する。 The storage unit 13 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk, and stores a program that operates the learning device 10 and the execution of the program. The data used inside is stored. For example, the storage unit 13 stores first teacher data and second teacher data. Furthermore, the storage unit 13 stores values of parameters set in the speech recognition unit 143, and the like.

第１の教師データは、クリーンな音声データと、当該音声データの示す記号列を特定する情報（正解データ）とを対応付けたデータである。この第１の教師データは、第１の学習部１４７１が音声認識部１４３の事前学習を行う際に用いられる。 The first teacher data is data in which clean voice data is associated with information (correct data) that specifies a symbol string indicated by the voice data. This first teacher data is used when the first learning section 1471 performs preliminary learning of the speech recognition section 143.

第２の教師データは、クリーンな音声データ、当該音声データにノイズの入った音声データおよび当該音声データの示す記号列を特定する情報（正解データ）を対応付けたデータである。 The second teacher data is data in which clean voice data, voice data with noise added to the voice data, and information (correct data) for specifying a symbol string indicated by the voice data are associated.

なお、この第２の教師データにおけるクリーンな音声データは、第１の教師データに含まれるクリーンな音声データと同じものでもよい。また、ノイズの入った音声データは、クリーンな音声データに人工的にノイズを加えたものでもよいし、雑音等のノイズが発生している環境下で収録された音声データであってもよい。この第２の教師データは、第２の学習部１４７４が音声認識部１４３の第２の学習を行う際に用いられる。 Note that the clean voice data in the second teacher data may be the same as the clean voice data included in the first teacher data. Further, the noisy audio data may be clean audio data with artificially added noise, or may be audio data recorded in an environment where noise such as noise is generated. This second teacher data is used when the second learning section 1474 performs the second learning of the speech recognition section 143.

制御部１４は、学習装置１０全体の制御を司る。制御部１４は、例えば、音声認識部１４３の学習や、学習後の音声認識部１４３を用いた音声認識等を行う。 The control unit 14 controls the entire learning device 10. The control unit 14 performs, for example, learning of the voice recognition unit 143 and voice recognition using the voice recognition unit 143 after learning.

制御部１４は、第１のデータ入力部１４１と、第２のデータ入力部１４２と、音声認識部１４３と、学習部１４７とを備える。 The control section 14 includes a first data input section 141 , a second data input section 142 , a speech recognition section 143 , and a learning section 147 .

第１のデータ入力部１４１は、事前学習モードの場合、第１の教師データから、まだ選択していない音声データを選択し、音声認識部１４３に入力し、音声認識部１４３に演算処理を実行させる。 In the case of pre-learning mode, the first data input unit 141 selects voice data that has not yet been selected from the first teacher data, inputs it to the voice recognition unit 143, and causes the voice recognition unit 143 to perform arithmetic processing. let

第２のデータ入力部１４２は、第２の学習モードの場合、第２の教師データから、まだ選択していないクリーンな音声データとノイズの入った音声データとのペアを選択する。そして、第２のデータ入力部１４２は、例えば、選択したペアのクリーンな音声データについて、音声認識部１４３に入力し、当該音声認識部１４３に演算処理を実行させる。また、第２のデータ入力部１４２は、当該クリーンな音声データのペアとなるノイズの入った音声データについても、同じパラメータが設定された音声認識部１４３に入力し、当該音声認識部１４３に演算処理を実行させる。 In the case of the second learning mode, the second data input unit 142 selects a pair of clean audio data and noisy audio data that have not yet been selected from the second teacher data. Then, the second data input unit 142 inputs, for example, the selected pair of clean voice data to the voice recognition unit 143, and causes the voice recognition unit 143 to perform arithmetic processing. The second data input unit 142 also inputs the noisy voice data that is a pair of the clean voice data to the voice recognition unit 143 set with the same parameters, and causes the voice recognition unit 143 to perform calculations. Execute the process.

音声認識部１４３は、音声認識モデルに基づき、入力された音声データの音声認識を行う。具体的には、音声認識部１４３は、入力された音声データについて当該音声データの示す記号列を特定する情報に変換し、変換した情報を出力する。この音声認識部１４３は、符号化部（エンコーダ）１４４と、注意機構部（アテンション）１４５と、復号化部（デコーダ）１４６とを備える。符号化部１４４、注意機構部１４５および復号化部１４６それぞれには、音声認識を行う際に用いるパラメータが設定される。なお、上記のパラメータは、音声認識部１４３の学習時に適宜更新される。 The speech recognition unit 143 performs speech recognition of input speech data based on a speech recognition model. Specifically, the voice recognition unit 143 converts input voice data into information that specifies a symbol string indicated by the voice data, and outputs the converted information. The speech recognition section 143 includes an encoding section (encoder) 144, an attention mechanism section (attention) 145, and a decoding section (decoder) 146. Parameters used when performing speech recognition are set in each of the encoding section 144, the attention mechanism section 145, and the decoding section 146. Note that the above parameters are updated as appropriate when the speech recognition unit 143 is learning.

符号化部１４４は、入力された音声データを中間特徴量に変換する。注意機構部１４５は、中間特徴量と復号化部１４６の隠れ状態S_nとを入力とし、重み（α₀,…,α_T）と、その重みで中間特徴量の重み付け和を算出した値c_nを出力する。復号化部１４６は、直前までの復号化器の出力y_1:n-1と注意機構部１４５からの出力値（c_n）とを入力として、次の文字を特定する情報y_nを推定して出力する。 The encoding unit 144 converts the input audio data into intermediate feature amounts. The attention mechanism unit 145 inputs the intermediate features and the hidden state S _n of the decoding unit 146, and calculates weights (α ₀ ,...,α _T ) and a value c that is a weighted sum of the intermediate features using the weights. Output _n . The decoding unit 146 receives the previous output y _1:n-1 of the decoder and the output value (c _n ) from the attention mechanism unit 145 as input, and estimates information y _n that specifies the next character. and output it.

学習部１４７は、音声認識部１４３の学習を行う。この学習部１４７は、第１の学習部１４７１と、第１の距離計算部１４７２および第２の距離計算部１４７３（距離計算部）と、第２の学習部１４７４とを備える。破線で示す第３の距離計算部１４７５は装備される場合と装備されない場合とがあり、装備される場合について第２の実施形態で述べる。 The learning section 147 performs learning of the speech recognition section 143. The learning section 147 includes a first learning section 1471, a first distance calculating section 1472, a second distance calculating section 1473 (distance calculating section), and a second learning section 1474. The third distance calculation unit 1475 indicated by a broken line may or may not be equipped, and the case where it is equipped will be described in the second embodiment.

第１の学習部１４７１は、学習モードが事前学習モードの場合に音声認識部１４３の事前学習を行う。例えば、第１の学習部１４７１は、所定の条件を満たすまで、第１の教師データの音声データに対する音声認識部１４３の出力y_nと、当該音声データの示す記号列を特定する情報（正解）との損失（距離）に基づいて音声認識部１４３の符号化部１４４、注意機構部１４５および復号化部１４６それぞれのパラメータを更新する。また、上記の事前学習後の音声認識部１４３の各部のパラメータの値は記憶部１３に記憶される。 The first learning unit 1471 performs preliminary learning of the speech recognition unit 143 when the learning mode is the preliminary learning mode. For example, the first learning unit 1471 uses the output y _n of the voice recognition unit 143 for the voice data of the first teacher data and information (correct answer) that specifies the symbol string indicated by the voice data until a predetermined condition is met. The parameters of the encoding section 144, the attention mechanism section 145, and the decoding section 146 of the speech recognition section 143 are updated based on the loss (distance) between the speech recognition section 143 and the speech recognition section 143. Further, the parameter values of each part of the speech recognition unit 143 after the above-mentioned pre-learning are stored in the storage unit 13.

第１の学習部１４７１が用いる上記の所定の条件は、例えば、上記の損失に基づく音声認識部１４３の各パラメータの更新が予め定めた繰り返し回数に到達したこと、損失が所定の閾値以下となったこと、パラメータの更新量が所定の閾値以下となったこと等である。 The above-mentioned predetermined conditions used by the first learning section 1471 include, for example, that the update of each parameter of the speech recognition section 143 based on the above-mentioned loss has reached a predetermined number of repetitions, and that the loss is below a predetermined threshold. The update amount of the parameter has become less than a predetermined threshold.

なお、学習部１４７は、第１の学習部１４７１による処理が上記の所定の条件を満たすと判断した場合、事前学習モードを終了させ、第２の学習モードに切り替える。 Note that if the learning unit 147 determines that the processing by the first learning unit 1471 satisfies the above-described predetermined conditions, it ends the pre-learning mode and switches to the second learning mode.

第１の距離計算部１４７２は、第２の学習モードにおいて、事前学習後の音声認識部１４３に、クリーンな音声データを入力した場合とノイズの入った音声データを入力した場合とで、当該音声認識部１４３の注意機構部１４５から出力される重みの分布にどの程度の違いがあるかを示す第１の距離を計算する。 In the second learning mode, the first distance calculation unit 1472 calculates whether the speech recognition unit 143 after pre-learning has input clean speech data or input noise-containing speech data. A first distance indicating the degree of difference in the distribution of weights output from the attention mechanism unit 145 of the recognition unit 143 is calculated.

例えば、第１の距離計算部１４７２は、第２の教師データにおけるクリーンな音声データが入力されたときに事前学習後の注意機構部１４５により出力されるアテンションの重みと、当該クリーンな音声データとペアになるノイズの入った音声データが入力されたときに当該注意機構部１４５により出力されるアテンション重みとの分布間距離（第１の距離）を、KLダイバージェンスにより計算する。 For example, the first distance calculation unit 1472 calculates the attention weight output by the attention mechanism unit 145 after pre-learning when clean voice data in the second teacher data is input, and the clean voice data. An inter-distribution distance (first distance) from the attention weight outputted by the attention mechanism unit 145 when paired noise-containing audio data is input is calculated using KL divergence.

また、第２の距離計算部１４７３は、事前学習後の音声認識部１４３に、ノイズの入った音声データを入力した場合に、当該音声認識部１４３の復号化器から出力される情報と、第２の教師データにおける当該音声データの正解データ（例えば、正解文字）との間にどの程度の違いがあるかを示す第２の距離とを計算する。 In addition, when noisy speech data is input to the speech recognition section 143 after pre-learning, the second distance calculation section 1473 calculates the information output from the decoder of the speech recognition section 143 and the A second distance indicating how much difference there is between the audio data and the correct data (for example, correct characters) in the second teacher data is calculated.

例えば、第２の距離計算部１４７３は、第２の教師データにおけるノイズの入った音声データが入力されたときに当該復号化部１４６により出力される情報と正解文字との距離（第２の距離）を計算する。 For example, the second distance calculation unit 1473 calculates the distance (second distance) between the information output by the decoding unit 146 and the correct character when noisy voice data in the second teacher data is input. ).

第２の学習部１４７４は、学習モードが第２の学習モードの場合に音声認識部１４３に第２の学習を行う。例えば、第２の学習部１４７４は、所定の条件を満たすまで、第１の距離計算部１４７２により計算された第１の距離と、第２の距離計算部１４７３により計算された第２の距離との和に基づき、事前学習後の音声認識部１４３の注意機構部１４５、復号化部１４６それぞれのパラメータを更新する。 The second learning unit 1474 performs second learning on the speech recognition unit 143 when the learning mode is the second learning mode. For example, the second learning unit 1474 adjusts the first distance calculated by the first distance calculation unit 1472 and the second distance calculated by the second distance calculation unit 1473 until a predetermined condition is satisfied. The parameters of the attention mechanism section 145 and the decoding section 146 of the speech recognition section 143 after pre-learning are updated based on the sum of .

例えば、第２の学習部１４７４は、第２の教師データを用いて、事前学習後の音声認識部１４３の注意機構部１４５および復号化部１４６それぞれのパラメータを更新する際、上記の第１の距離と前記第２の距離との和を損失とし、当該損失が小さくなるようにパラメータを更新する。 For example, when the second learning unit 1474 uses the second teacher data to update the parameters of the attention mechanism unit 145 and the decoding unit 146 of the speech recognition unit 143 after pre-learning, the second learning unit 1474 uses the first training data described above. The sum of the distance and the second distance is defined as a loss, and the parameters are updated so that the loss becomes smaller.

なお、第２の学習部１４７４が用いる所定の条件は、第１の学習部１４７１の場合と同様の条件である。例えば、音声認識部１４３の各パラメータの更新が予め定めた繰り返し回数に到達したこと、損失が所定の閾値以下となったこと、パラメータの更新量が所定の閾値以下となったこと等である。また、上記の第２の学習後の音声認識部１４３の各部のパラメータの値は記憶部１３に記憶される。 Note that the predetermined conditions used by the second learning section 1474 are the same conditions as in the case of the first learning section 1471. For example, the update of each parameter of the speech recognition unit 143 has reached a predetermined number of repetitions, the loss has become less than a predetermined threshold, the amount of parameter update has become less than a predetermined threshold, etc. Further, the parameter values of each part of the speech recognition unit 143 after the second learning described above are stored in the storage unit 13.

学習装置１０は、上記のように音声認識部１４３に対し第２の学習を行った後、学習後の音声認識部１４３を用いて、入力された音声データの音声認識処理を実行してもよい。例えば、学習装置１０は、学習後の音声認識部１４３を用いて、入力された音声データについて、当該音声データに示す記号列を特定する情報に変換し、出力してもよい。 The learning device 10 may perform the second learning on the voice recognition unit 143 as described above, and then use the learned voice recognition unit 143 to perform voice recognition processing on the input voice data. . For example, the learning device 10 may use the learned speech recognition unit 143 to convert input speech data into information that specifies a symbol string shown in the speech data, and output the information.

［処理手順］
図７を用いて学習装置１０の処理手順を説明する。まず、学習装置１０の第１のデータ入力部１４１は、第１の教師データの音声データを音声認識部１４３に投入し、第１の学習部１４７１は、音声認識部１４３の出力した結果を用いて、音声認識部１４３の事前学習を行う（Ｓ１）。その後、学習装置１０の第２のデータ入力部１４２は、Ｓ１で学習された音声認識部１４３に対し、第２の教師データの音声データを投入し、第２の学習部１４７４は、音声認識部１４３の各部が出力した結果を用いて、当該音声認識部１４３の第２の学習を行う（Ｓ２）。 [Processing procedure]
The processing procedure of the learning device 10 will be explained using FIG. 7. First, the first data input unit 141 of the learning device 10 inputs the voice data of the first teacher data to the voice recognition unit 143, and the first learning unit 1471 uses the result output from the voice recognition unit 143. Then, preliminary learning of the speech recognition unit 143 is performed (S1). After that, the second data input unit 142 of the learning device 10 inputs the voice data of the second teacher data to the voice recognition unit 143 learned in S1, and the second learning unit 1474 inputs the voice data of the second teacher data A second learning of the speech recognition section 143 is performed using the results output by each section of the speech recognition section 143 (S2).

このような学習装置１０によれば、まず第１の学習部１４７１は、第１の教師データを用いて、音声認識部１４３についてクリーンな音声データに対し正解データを出力するよう当該音声認識部１４３の各部のパラメータを設定する。 According to such a learning device 10, the first learning unit 1471 uses the first teacher data to instruct the voice recognition unit 143 to output correct answer data for clean voice data. Set the parameters for each part.

その後、第２の学習部１４７４が、第２の教師データを用いて、当該音声認識部１４３にクリーンな音声データを入力した場合とノイズの入った音声データを入力した場合とで、上記の第１の距離と第２の距離との和が小さくなるよう、音声認識部１４３の注意機構部１４５および符号化部１４４のパラメータを更新する。 Thereafter, the second learning section 1474 uses the second teacher data to determine whether the above-mentioned The parameters of the warning mechanism section 145 and the encoding section 144 of the speech recognition section 143 are updated so that the sum of the first distance and the second distance becomes smaller.

これにより、学習装置１０は、ノイズの入った音声に対し当該音声認識部１４３、が正解データを出力するようにしつつ、ノイズの入った音声に対し注意機構部１４５の出力する重みを、クリーンな音声に対し注意機構部１４５の出力する重みに近付けることができる。その結果、学習装置１０は、ノイズに対する頑健性の高い音声認識部１４３の学習を行うことができる。 As a result, the learning device 10 allows the speech recognition unit 143 to output correct data for noise-containing speech, while changing the weight output by the attention mechanism unit 145 for noise-containing speech to clean data. It is possible to approximate the weight output by the attention mechanism unit 145 for the voice. As a result, the learning device 10 can perform learning of the speech recognition unit 143 with high robustness against noise.

［第２の実施形態］
学習装置１０は、第３の距離計算部１４７５（図６参照）を備えていてもよい。この場合の実施形態を第２の実施形態として説明する。以下、第１の実施形態と同じ構成は同じ符号を付して説明を省略する。第３の距離計算部１４７５は、事前学習後の音声認識部１４３に、クリーンな音声データを入力した場合と当該音声データにノイズの入った音声データを入力した場合とで当該音声認識部１４３の符号化部１４４から出力される情報（中間特徴量）にどの程度の違いがあるかを示す第３の距離を計算する。 [Second embodiment]
The learning device 10 may include a third distance calculation section 1475 (see FIG. 6). An embodiment in this case will be described as a second embodiment. Hereinafter, the same components as those in the first embodiment will be denoted by the same reference numerals, and the description thereof will be omitted. The third distance calculation unit 1475 calculates the difference between the pre-trained voice recognition unit 143 when clean voice data is input and when the voice data containing noise is input. A third distance indicating how much difference there is in the information (intermediate feature amount) output from the encoding unit 144 is calculated.

そして、第２の学習部１４７４は、事前学習後の音声認識部１４３の学習を行う際、前記した第１の距離と第２の距離と第３の距離との和を損失とし、当該損失が小さくなるように当該音声認識部１４３の符号化部１４４および注意機構部１４５のパラメータの更新を行う。このようにすることで、学習装置１０はノイズに対しさらに頑健性の高い音声認識部１４３の学習を行うことができる。 When the second learning unit 1474 trains the speech recognition unit 143 after the pre-learning, the second learning unit 1474 sets the sum of the first distance, the second distance, and the third distance as a loss, and the loss is The parameters of the encoding section 144 and the attention mechanism section 145 of the speech recognition section 143 are updated so that the size becomes smaller. By doing so, the learning device 10 can perform learning of the speech recognition unit 143 that is more robust against noise.

［その他の実施形態］
なお、上記の学習装置１０で学習された音声認識部１４３（音声認識モデル）は、当該学習装置１０による音声認識に用いられてもよいし、他の装置による音声認識により用いられてもよい。 [Other embodiments]
Note that the speech recognition unit 143 (speech recognition model) trained by the learning device 10 described above may be used for speech recognition by the learning device 10, or may be used for speech recognition by another device.

例えば、学習装置１０で学習された音声認識部１４３を、音声認識装置１００（図８参照）に装備してもよい。音声認識装置１００は、例えば、図８に示すように、入力部１１と、出力部１２と、記憶部１３と、制御部１４とを備える。制御部１４は、学習装置１０により学習された音声認識部１４３を備える。この音声認識装置１００の入力部１１が音声認識対象の音声データの入力を受け付けると（図９のＳ１１）、学習装置１０により学習された音声認識部１４３を用いて音声認識を行い（Ｓ１２）、認識結果を出力する（Ｓ１３）。 For example, the speech recognition unit 143 learned by the learning device 10 may be installed in the speech recognition device 100 (see FIG. 8). The speech recognition device 100 includes, for example, an input section 11, an output section 12, a storage section 13, and a control section 14, as shown in FIG. The control unit 14 includes a speech recognition unit 143 trained by the learning device 10. When the input section 11 of the speech recognition device 100 receives input of speech data to be speech recognized (S11 in FIG. 9), speech recognition is performed using the speech recognition section 143 trained by the learning device 10 (S12). The recognition result is output (S13).

［実験結果］
なお、前記した第２の実施形態の学習装置１０を用いた実験結果を以下に説明する。なお、本実験で学習装置１０の第１の学習部１４７１が音声認識モデルの事前学習に用いたデータは、WSJ1si284である。このWSJ1si284は、Wall street journalを読み上げた音声データのコーパスである。このコーパスは、発話総数が37416であり、長さは80時間である。また、第２の学習部１４７４が音声認識モデルの第２の学習に用いたデータは、WSJ1si84とCHiME-4のtr05_simuである。このWSJ1si84も、Wall street journalを読み上げた音声データのコーパスである。このWSJ1si84は、発話総数が7138であり、長さは15時間である。CHiME-4のtr05_simuは、WSJ1si84にノイズを重畳したコーパスである。 [Experimental result]
Note that the results of an experiment using the learning device 10 of the second embodiment described above will be explained below. Note that in this experiment, the data used by the first learning unit 1471 of the learning device 10 for preliminary learning of the speech recognition model is WSJ1si284. This WSJ1si284 is a corpus of audio data read out from the Wall street journal. This corpus has a total number of utterances of 37416 and a length of 80 hours. Furthermore, the data used by the second learning unit 1474 for the second learning of the speech recognition model is WSJ1si84 and tr05_simu of CHiME-4. WSJ1si84 is also a corpus of audio data read out from Wall street journals. This WSJ1si84 has a total number of utterances of 7138 and a length of 15 hours. CHiME-4's tr05_simu is a corpus with noise superimposed on WSJ1si84.

上記の条件で、第２の実施形態の学習装置１０が事前学習と第２の学習とを行った後の音声認識部１４３を用いて、CHiME-4 et05_simuおよびCHiME-4 et05_realそれぞれの音声データの誤認識率を測定したところ、以下の結果を得た。なお、CHiME-4 et05_simuは、Wall street journalを読み上げた音声データにノイズを重畳したコーパスである。CHiME-4 et05_realは、雑音等のある環境下でWall street journalを読み上げた音声データのコーパスである。 Under the above conditions, the learning device 10 of the second embodiment uses the speech recognition unit 143 after performing the preliminary learning and the second learning to obtain the audio data of CHiME-4 et05_simu and CHiME-4 et05_real. When we measured the false recognition rate, we obtained the following results. Furthermore, CHiME-4 et05_simu is a corpus in which noise is superimposed on the audio data of the Wall street journal. CHiME-4 et05_real is a corpus of audio data read out from Wall street journals in a noisy environment.

例えば、以下に示すように、非特許文献１に記載の方法により学習した音声認識モデル（比較例）を用いて音声認識を行ったところ、CHiME-4 et05_simuの音声データの誤認識率は28.5%であり、CHiME-4 et05_realの音声データの誤認識率は32.8%であった。一方、第２の実施形態の学習装置１０により学習した音声認識モデルを用いて音声認識を行ったところ、CHiME-4 et05_simuの音声データの誤認識率は27.8%であり、CHiME-4 et05_realの音声データの誤認識率は32.5%であった。 For example, as shown below, when speech recognition was performed using the speech recognition model (comparative example) learned by the method described in Non-Patent Document 1, the misrecognition rate of the speech data of CHiME-4 et05_simu was 28.5%. The misrecognition rate of CHiME-4 et05_real voice data was 32.8%. On the other hand, when speech recognition was performed using the speech recognition model learned by the learning device 10 of the second embodiment, the misrecognition rate of the speech data of CHiME-4 et05_simu was 27.8%, and the misrecognition rate of the speech data of CHiME-4 et05_real was 27.8%. The data misrecognition rate was 32.5%.

上記のことから、非特許文献１に記載の方法よりも、第２の実施形態の学習装置１０により学習した音声認識モデルの方が、ノイズの入った音声データについて誤認識率を低減することが確認できた。つまり、非特許文献１に記載の方法よりも、第２の実施形態の学習装置１０の方がノイズに対し頑健性の高い音声認識モデルを作成できることが確認できた。 From the above, the speech recognition model trained by the learning device 10 of the second embodiment is more capable of reducing the false recognition rate for noisy speech data than the method described in Non-Patent Document 1. It could be confirmed. In other words, it was confirmed that the learning device 10 of the second embodiment can create a speech recognition model that is more robust against noise than the method described in Non-Patent Document 1.

［プログラム］
図１０を用いて、上記のプログラム（学習プログラム）を実行するコンピュータの一例を説明する。図１０に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 [program]
An example of a computer that executes the above program (learning program) will be explained using FIG. 10. As shown in FIG. 10, the computer 1000 includes, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090. Disk drive interface 1040 is connected to disk drive 1100. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100, for example. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. For example, a display 1130 is connected to the video adapter 1060.

ここで、図１０に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。前記した実施形態で説明した記憶部１３は、例えばハードディスクドライブ１０９０やメモリ１０１０に装備される。 Here, as shown in FIG. 10, the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. The storage unit 13 described in the above embodiment is installed in, for example, the hard disk drive 1090 or the memory 1010.

そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the hard disk drive 1090 to the RAM 1012 as necessary, and executes each procedure described above.

なお、上記の学習プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、上記のプログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and program data 1094 related to the above-mentioned learning program are not limited to being stored in the hard disk drive 1090; for example, they may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. May be read. Alternatively, the program module 1093 and program data 1094 related to the above program are stored in another computer connected via a network such as a LAN or WAN (Wide Area Network), and read out by the CPU 1020 via the network interface 1070. may be done.

１０学習装置
１１入力部
１２出力部
１３記憶部
１４制御部
１００音声認識装置
１４１第１のデータ入力部
１４２第２のデータ入力部
１４３音声認識部
１４４符号化部
１４５注意機構部
１４６復号化部
１４７学習部
１４７１第１の学習部
１４７２第１の距離計算部
１４７３第２の距離計算部
１４７４第２の学習部
１４７５第３の距離計算部 10 Learning device 11 Input section 12 Output section 13 Storage section 14 Control section 100 Speech recognition device 141 First data input section 142 Second data input section 143 Speech recognition section 144 Encoding section 145 Attention mechanism section 146 Decoding section 147 Learning section 1471 First learning section 1472 First distance calculating section 1473 Second distance calculating section 1474 Second learning section 1475 Third distance calculating section

Claims

Converting the audio data into information that specifies the symbol string indicated by the audio data, using data associating audio data with correct data of information specifying the symbol string indicated by the audio data as first teacher data. When doing so, an encoder outputs an intermediate feature of the audio data, a weight indicating which element to focus on among the elements constituting the intermediate feature, and weighting of the intermediate feature using the weight. a first learning unit that performs learning of a speech recognition model comprising a sum calculated value and an attention mechanism that outputs the calculated sum;
Based on the second teacher data that associates the audio data, noisy audio data that is audio data with noise added to the audio data, and correct answer data of information that specifies a symbol string indicated by the audio data, How much does the distribution of weights output from the attention mechanism of the speech recognition model differ between when speech data is input into the speech recognition model trained by the learning section 1 and when the noisy speech data is input? A first distance indicating whether there is a difference, and how much difference there is between the information output from the decoder of the speech recognition model for the noisy speech data and the correct answer data for the speech data. a distance calculation unit that calculates a second distance shown in the figure;
When learning the speech recognition model after learning by the first learning unit using the second teacher data, the sum of the first distance and the second distance is taken as a loss, and the loss is a second learning unit that updates the parameters of the encoder and attention mechanism of the speech recognition model so that the size of the encoder and the attention mechanism become smaller;
a speech recognition unit that uses the speech recognition model learned by the second learning unit to convert input speech data into information that specifies a symbol string shown in the speech data;
A learning device comprising:

The voice recognition model after learning by the first learning unit is different when inputting voice data and when inputting noisy voice data, which is voice data with noise added to the voice data. further comprising a third distance calculation unit that calculates a third distance indicating how much difference there is in the information output from the encoder,
The second learning section includes:
When learning the speech recognition model after learning by the first learning unit, the sum of the first distance, the second distance, and the third distance is taken as a loss, and the loss is reduced. The learning device according to claim 1, wherein parameters of an encoder and an attention mechanism of the speech recognition model are updated.

The second learning section includes:
When learning the speech recognition model after learning by the first learning unit, updating the parameters of the encoder and attention mechanism of the speech recognition model is repeated until the loss becomes equal to or less than a predetermined threshold. The learning device according to claim 1 or 2, characterized in that the learning device is characterized by:

A learning method performed by a learning device, comprising:
Converting the audio data into information that specifies the symbol string indicated by the audio data, using data associating audio data with correct data of information specifying the symbol string indicated by the audio data as first teacher data. When doing so, an encoder outputs an intermediate feature of the audio data, a weight indicating which element to focus on among the elements constituting the intermediate feature, and weighting of the intermediate feature using the weight. a first learning step of learning a speech recognition model comprising a sum calculated value and an attention mechanism that outputs the calculated sum;
Based on the second teacher data that associates the audio data, noisy audio data that is audio data with noise added to the audio data, and correct answer data of information that specifies the symbol string indicated by the audio data, the first How much difference is there in the distribution of weights output from the attention mechanism of the speech recognition model when speech data is input to the speech recognition model after learning in the learning step and when the noisy speech data is input? and the degree of difference between the information output from the decoder of the speech recognition model for the noisy speech data and the correct data for the speech data. a distance calculation step of calculating a second distance shown;
When learning the speech recognition model after learning in the first learning step using the second teacher data, the sum of the first distance and the second distance is taken as a loss, and the loss is a second learning step of updating the parameters of the encoder and attention mechanism of the speech recognition model so as to reduce the size of the encoder and the attention mechanism;
a speech recognition step of converting the input speech data into information specifying a symbol string indicated in the speech data using the speech recognition model learned in the second learning step;
A learning method characterized by including.

Converting the audio data into information that specifies the symbol string indicated by the audio data, using data associating audio data with correct data of information specifying the symbol string indicated by the audio data as first teacher data. When doing so, an encoder outputs an intermediate feature of the audio data, a weight indicating which element to focus on among the elements constituting the intermediate feature, and weighting of the intermediate feature using the weight. a first learning step of learning a speech recognition model comprising a sum calculated value and an attention mechanism that outputs the calculated sum;
Based on the second teacher data that associates the audio data, noisy audio data that is audio data with noise added to the audio data, and correct answer data of information that specifies the symbol string indicated by the audio data, the first How much difference is there in the distribution of weights output from the attention mechanism of the speech recognition model when speech data is input to the speech recognition model after learning in the learning step and when the noisy speech data is input? and the degree of difference between the information output from the decoder of the speech recognition model for the noisy speech data and the correct data for the speech data. a distance calculation step of calculating a second distance shown;
When learning the speech recognition model after learning in the first learning step using the second teacher data, the sum of the first distance and the second distance is taken as a loss, and the loss is a second learning step of updating the parameters of the encoder and attention mechanism of the speech recognition model so as to reduce the size of the encoder and the attention mechanism;
a speech recognition step of converting the input speech data into information specifying a symbol string indicated in the speech data using the speech recognition model learned in the second learning step;
A learning program that causes a computer to execute.