JP7385900B2

JP7385900B2 - Inference machine, inference program and learning method

Info

Publication number: JP7385900B2
Application number: JP2019163555A
Authority: JP
Inventors: 勝李; シュガンルー; 塵辰丁; 達也河原; 恒河井
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2023-11-24
Anticipated expiration: 2039-09-09
Also published as: JP2021043272A

Description

本技術は、音声認識タスクを実現するための推論器、推論プログラムおよび学習方法に関する。 The present technology relates to an inference device, an inference program, and a learning method for realizing a speech recognition task.

音声認識分野においては、音響モデル、言語モデル、および辞書（lexicon）を一体化したニューラルネットワークである、エンド・トゥ・エンド（end-to-end）モデルが検討および提案されている（非特許文献１および２など参照）。音声認識タスクに向けられたエンド・トゥ・エンドモデルとして、Ｔｒａｎｓｆｏｒｍｅｒベースの自動音声認識（ＡＳＲ：Automatic Speech Recognition）システムが注目されている（非特許文献３など参照）。Ｔｒａｎｓｆｏｒｍｅｒベースのエンド・トゥ・エンドモデルを用いることで、ＡＳＲシステムの構築および学習を容易化できる。 In the field of speech recognition, an end-to-end model, which is a neural network that integrates an acoustic model, a language model, and a dictionary (lexicon), has been studied and proposed (non-patent literature). 1 and 2, etc.). Transformer-based automatic speech recognition (ASR) systems are attracting attention as an end-to-end model for speech recognition tasks (see Non-Patent Document 3, etc.). Using a Transformer-based end-to-end model can facilitate the construction and learning of ASR systems.

非特許文献４および５は、中国語に関して、Ｔｒａｎｓｆｏｒｍｅｒベースのエンド・トゥ・エンド音声認識システムにおける音響モデルの研究成果を開示する。 Non-Patent Documents 4 and 5 disclose research results of acoustic models in Transformer-based end-to-end speech recognition systems for Chinese.

また、非特許文献６および７は、単一のモデルを用いた多言語エンド・トゥ・エンド音声認識システムを効率的に学習する方法を開示する。より具体的には、各発話の先頭に、当該発話がいずれの言語であるかを示す特定のワード＜ＬａｎｇｕａｇｅＭａｒｋ＞（例えば、＜Ｅｎｇｌｉｓｈ＞，＜Ｍａｎｄａｒｉｎ＞，＜Ｊａｐａｎｅｓｅ＞，＜Ｇｅｒｍａｎ＞など）を追加したデータセットを用いて学習を行う。＜ＬａｎｇｕａｇｅＭａｒｋ＞がラベルとして取り扱われる。 Additionally, Non-Patent Documents 6 and 7 disclose a method for efficiently training a multilingual end-to-end speech recognition system using a single model. More specifically, at the beginning of each utterance, a specific word <Language Mark> indicating which language the utterance is in (for example, <English>, <Mandarin>, <Japanese>, <German>, etc.) Perform learning using the dataset with added . <Language Mark> is treated as a label.

A. Graves and N. Jaitly, "Towards End-to-End speech recognition with recurrent neural networks," in Proc. ICML, 2014.A. Graves and N. Jaitly, "Towards End-to-End speech recognition with recurrent neural networks," in Proc. ICML, 2014. A. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. IEEE-ICASSP, 2016.A. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. IEEE-ICASSP, 2016. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in CoRR abs/1706.03762, 2017.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in CoRR abs/1706.03762, 2017. S. Zhou, L. Dong, S. Xu, and B. Xu, "A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on Mandarin Chinese," in CoRR abs/1805.06239, 2018.S. Zhou, L. Dong, S. Xu, and B. Xu, "A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on Mandarin Chinese," in CoRR abs/1805.06239, 2018. S. Zhou, L. Dong, S. Xu, and B. Xu, "Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin Chinese," in Proc. INTERSPEECH, 2018.S. Zhou, L. Dong, S. Xu, and B. Xu, "Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin Chinese," in Proc. INTERSPEECH, 2018. S. Zhou, S. Xu, and B. Xu, "Multilingual end-to-end speech recognition with a single transformer on low-resource languages," in CoRR abs/1806.05059, 2018.S. Zhou, S. Xu, and B. Xu, "Multilingual end-to-end speech recognition with a single transformer on low-resource languages," in CoRR abs/1806.05059, 2018. B. Li and et al., "Multi-dialect speech recognition with a dingle sequence-to- sequence model," in CoRR abs/1806.05059, 2018.B. Li and et al., "Multi-dialect speech recognition with a dingle sequence-to-sequence model," in CoRR abs/1806.05059, 2018.

上述の非特許文献６および７に開示される方法は、文字（character）レベルで学習を行うものであり、複数の言語を同時に学習した場合（すなわち、単一のモデルを用いて多言語の音声認識システムを構築使用とした場合）には、トークンの数が膨大となり、パラメータサイズが巨大化するという課題がある。 The methods disclosed in Non-Patent Documents 6 and 7 mentioned above perform learning at the character level, and when learning multiple languages at the same time (i.e., using a single model to learn multilingual speech). When a recognition system is constructed and used), there is a problem that the number of tokens becomes enormous and the parameter size becomes huge.

本技術は、より少ないパラメータサイズのモデルを用いて、多言語エンド・トゥ・エンド音声認識システムを実現するための技術を提供することを目的とする。 The purpose of this technology is to provide a technology for realizing a multilingual end-to-end speech recognition system using a model with a smaller parameter size.

ある実施の形態によれば、複数の言語のうち任意の言語で発話された音声信号の入力を受けて、対応するテキストを出力する推論器が提供される。推論器は、前記音声信号の音声特徴を示す入力シーケンスを受けて、対応するテキストに含まれる文字の特徴を示す、文字レベルとは異なるレベルの表現を出力する学習済モデルと、予め定められた文字と当該文字の特徴との対応関係を参照して、前記学習済モデルから出力される表現から対応するテキストを再構成する再構成部とを含む。 According to one embodiment, a reasoner is provided that receives input of an audio signal uttered in any one of a plurality of languages and outputs corresponding text. The reasoner receives an input sequence representing the audio characteristics of the audio signal and outputs a representation at a level different from the character level representing characteristics of characters included in the corresponding text, and a predetermined model. and a reconstruction unit that reconstructs a corresponding text from the expression output from the learned model by referring to the correspondence between characters and the characteristics of the characters.

前記学習済モデルから出力される表現は、対応するテキストに含まれる各文字の構造を特定する情報を含んでいてもよい。 The expression output from the learned model may include information specifying the structure of each character included in the corresponding text.

前記文字の構造を特定する情報は、対応する文字を構成する１または複数の文字部品を特定する情報を含んでいてもよい。 The information specifying the structure of the character may include information specifying one or more character parts that constitute the corresponding character.

前記文字の構造を特定する情報は、前記１または複数の文字部品の配置を特定する情報を含んでいてもよい。 The information specifying the structure of the character may include information specifying the arrangement of the one or more character parts.

前記対応関係は、言語ごとに、１または複数の文字部品と対応する文字との対応関係を規定してもよい。 The correspondence relationship may define a correspondence relationship between one or more character parts and corresponding characters for each language.

前記学習済モデルから出力される表現は、対応するテキストに含まれる各文字の発音を特定する情報を含んでいてもよい。 The expression output from the learned model may include information specifying the pronunciation of each character included in the corresponding text.

前記文字の発音を特定する情報は、音韻構造を表現するユニバーサル特徴に基づいて、対応する文字の発音を特定する情報を含んでいてもよい。 The information specifying the pronunciation of the character may include information specifying the pronunciation of the corresponding character based on universal features expressing phonological structure.

前記文字の発音を特定する情報は、対応するテキストに含まれる単語をさらに分解した文字ごとに発音を規定する情報を含んでいてもよい。 The information specifying the pronunciation of the character may include information specifying the pronunciation for each character obtained by further decomposing the word included in the corresponding text.

前記対応関係は、言語ごとに、発音を特定する情報と対応する文字との対応関係を規定してもよい。 The correspondence relationship may define a correspondence relationship between information specifying pronunciation and corresponding characters for each language.

別の実施の形態によれば、上記の推論器をコンピュータで実現するための推論プログラムが提供される。 According to another embodiment, an inference program for implementing the above inference device on a computer is provided.

さらに別の実施の形態によれば、複数の言語のうち任意の言語で発話された音声信号の入力を受けて、対応するテキストを出力する推論器を学習する学習方法が提供される。学習方法は、音声信号と対応するテキストとを用意するステップと、前記テキストに含まれる文字の特徴を示す、文字レベルとは異なるレベルの表現を生成するステップと、前記音声信号の音声特徴を示す入力シーケンスを前記推論器に入力して得られる推論結果と、対応する表現との誤差に基づいて、前記推論器を規定するパラメータを最適化するステップとを含む。 According to yet another embodiment, a learning method is provided for learning a reasoner that receives an input of an audio signal uttered in any one of a plurality of languages and outputs a corresponding text. The learning method includes the steps of: preparing an audio signal and a corresponding text; generating an expression at a level different from the character level that represents the characteristics of characters included in the text; and representing the audio characteristics of the audio signal. The method includes the step of optimizing parameters defining the inference device based on an error between an inference result obtained by inputting an input sequence to the inference device and a corresponding expression.

さらに別の実施の形態によれば、コンピュータに上記の学習方法を実行させるための学習プログラムが提供される。 According to yet another embodiment, a learning program for causing a computer to execute the above learning method is provided.

本技術によれば、より少ないパラメータサイズのモデルを用いて、多言語エンド・トゥ・エンド音声認識システムを実現できる。 According to the present technology, a multilingual end-to-end speech recognition system can be realized using a model with a smaller parameter size.

本発明の関連技術に従うＴｒａｎｓｆｏｒｍｅｒの一例を示す模式図である。FIG. 2 is a schematic diagram showing an example of a Transformer according to related technology of the present invention. 本実施の形態に従う音声認識システムを実現するハードウェア構成の一例を示す模式図である。1 is a schematic diagram showing an example of a hardware configuration that implements a speech recognition system according to the present embodiment. 第１の実施例に従う音声認識システムの概要を示す模式図である。FIG. 1 is a schematic diagram showing an overview of a speech recognition system according to a first embodiment. 第１の実施例に従う音声認識システムにおける文字部品への分解の方法を説明するための図である。FIG. 3 is a diagram for explaining a method of decomposition into character parts in the speech recognition system according to the first embodiment. 第１の実施例に従う音声認識システムの文字合成部において利用される文字部品対応テーブルの一例を示す図である。FIG. 3 is a diagram showing an example of a character-component correspondence table used in the character synthesis section of the speech recognition system according to the first embodiment. 第１の実施例に従う音声認識システムの学習処理を説明するための模式図である。FIG. 3 is a schematic diagram for explaining learning processing of the speech recognition system according to the first embodiment. 第１の実施例に従う音声認識システムの学習処理の手順を示すフローチャートである。3 is a flowchart showing the procedure of learning processing of the speech recognition system according to the first embodiment. 第１の実施例に従う音声認識システムの推論処理の手順を示すフローチャートである。3 is a flowchart illustrating the procedure of inference processing of the speech recognition system according to the first embodiment. 第２の実施例に従う音声認識システムの概要を示す模式図である。FIG. 2 is a schematic diagram showing an overview of a speech recognition system according to a second embodiment. 第２の実施例に従う音声認識システムにおける学習処理および推論処理の内容を説明するための模式図である。FIG. 7 is a schematic diagram for explaining the contents of learning processing and inference processing in the speech recognition system according to the second embodiment. 第２の実施例に従う音声認識システムにおけるユニバーサル音声表現に係る処理を説明するための図である。FIG. 7 is a diagram for explaining processing related to universal speech expression in the speech recognition system according to the second embodiment. 第２の実施例に従う音声認識システムの文字変換部において利用される音声特徴対応テーブルの一例を示す図である。FIG. 7 is a diagram showing an example of a speech feature correspondence table used in a character conversion section of a speech recognition system according to a second embodiment. 第２の実施例に従う音声認識システムの学習処理の手順を示すフローチャートである。7 is a flowchart showing the procedure of learning processing of the speech recognition system according to the second embodiment. 第２の実施例に従う音声認識システムの推論処理の手順を示すフローチャートである。7 is a flowchart showing the procedure of inference processing of the speech recognition system according to the second embodiment.

本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰り返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. Note that the same or corresponding parts in the figures are designated by the same reference numerals, and the description thereof will not be repeated.

［Ａ．概要］
音声認識タスクに用いられる従来のモデル（典型的には、ＤＮＮ－ＨＭＭモデル）は、１フレームの発話に対して１つのトークンのみがラベルとして使用できる。これに対して、Ｔｒａｎｓｆｏｒｍｅｒなどのエンド・トゥ・エンドモデルでは、１フレームの発話に対して一連のトークンを関連付けることができ、これによってより強力な表現能力を発揮する。 [A. overview]
Conventional models used for speech recognition tasks (typically DNN-HMM models) allow only one token to be used as a label for one frame of utterance. In contrast, end-to-end models such as Transformer allow a series of tokens to be associated with one frame of utterance, thereby demonstrating stronger expressive capabilities.

本実施の形態に従う音声認識システムは、エンド・トゥ・エンドモデルを用いて多言語対応の音声認識タスクを実行する。本実施の形態に従う音声認識システムは、既存の音声認識システムのような文字（character）レベルではなく、異なるレベルの表現（representation）を用いる。 The speech recognition system according to this embodiment executes multilingual speech recognition tasks using an end-to-end model. The speech recognition system according to the present embodiment uses a different level of representation rather than the character level like existing speech recognition systems.

より具体的には、言語間の類似性に着目した表現を利用することで、パラメータサイズを低減する。このような言語間の類似性の一例として、以下では、個々の文字が意味を表す表意文字（典型的には、漢字）の構造に着目する例（第１の実施例）、および、個々の文字が音素または音節を表す表音文字（あるいは、音標文字）の構造に着目する例（第２の実施例）について例示する。なお、本発明の技術的範囲は、表意文字および表音文字に限られず、言語間の任意の類似性を利用した音声認識システムを包含するものである。 More specifically, the parameter size is reduced by using expressions that focus on similarities between languages. As an example of such similarities between languages, an example (first example) focusing on the structure of ideograms (typically kanji) in which each character represents a meaning, and An example (second embodiment) in which attention is paid to the structure of phonetic characters (or phonetic characters) in which characters represent phonemes or syllables will be exemplified. Note that the technical scope of the present invention is not limited to ideograms and phonetic characters, but includes a speech recognition system that utilizes any similarity between languages.

第１の実施例（表意文字）は、類似した表意文字（典型的には、漢字）を利用する複数の言語に対して単一のモデルを用いる場合を想定しており、漢字を「へん」と「つくり」といった１または複数の文字部品の組み合わせと捉えて、学習済モデルを構築する。 The first embodiment (ideograms) assumes a case where a single model is used for multiple languages that use similar ideograms (typically kanji), and kanji are A trained model is constructed by considering it as a combination of one or more character parts such as and "tsukuri".

第２の実施例（表音文字）は、類似した表音文字を利用する複数の言語に対して単一のモデルを用いる場合を想定しており、文字（character）を１または複数の音調特徴（articulatory feature）の組み合わせと捉えて、学習済モデルを構築する。 The second embodiment (phonetic characters) assumes a case where a single model is used for multiple languages that use similar phonetic characters, and a character is defined by one or more tonal characteristics. (articulatory features) and construct a trained model.

このような学習済モデルを採用することで、モデルの規模（パラメータサイズ）を抑制しつつ、多言語対応のリアルタイムな音声認識システムを実現できる。さらに、認識性能の向上も期待できる。 By employing such a trained model, it is possible to realize a multilingual real-time speech recognition system while suppressing the scale of the model (parameter size). Furthermore, improvement in recognition performance can also be expected.

以下、本実施の形態に従う音声認識システムの詳細について説明する。
［Ｂ．Ｔｒａｎｓｆｏｒｍｅｒ］
本実施の形態に従う音声認識システムには、どのようなエンド・トゥ・エンドモデルを用いてもよい。現時点では、例えば、Ｔｒａｎｓｆｏｒｍｅｒ、ＬＳＴＭ（Long short-term memory）を用いたモデル、ＢＥＲＴと称されるモデルなどが挙げられる。以下の説明においては、典型例として、Ｔｒａｎｓｆｏｒｍｅｒベースのエンド・トゥ・エンドモデルを採用する。但し、技術の進歩に伴って新たなエンド・トゥ・エンドモデルが開発された場合には、そのような新たなモデルにも適用可能であることは自明である。 The details of the speech recognition system according to this embodiment will be explained below.
[B. Transformer]
Any end-to-end model may be used in the speech recognition system according to this embodiment. At present, examples include a model using Transformer, LSTM (Long short-term memory), and a model called BERT. In the following description, a Transformer-based end-to-end model is adopted as a typical example. However, if a new end-to-end model is developed as technology advances, it is obvious that the present invention can also be applied to such a new model.

以下、一般的なＴｒａｎｓｆｏｒｍｅｒについて説明する。
図１は、本発明の関連技術に従うＴｒａｎｓｆｏｒｍｅｒ１０の一例を示す模式図である。図１を参照して、Ｔｒａｎｓｆｏｒｍｅｒ１０は、学習済モデルであり、ニュートラルネットワークの一形態に相当する。 A general Transformer will be explained below.
FIG. 1 is a schematic diagram showing an example of a Transformer 10 according to the related technology of the present invention. Referring to FIG. 1, Transformer 10 is a trained model and corresponds to a form of neutral network.

Ｔｒａｎｓｆｏｒｍｅｒ１０は、スタックされたＮ層分のエンコーダブロック２０とＭ層分のデコーダブロック４０とを含む。スタックされたＮ層分のエンコーダブロック２０をまとめてエンコーダ２００とも称す。スタックされたＭ層分のデコーダブロック４０をまとめてデコーダ４００とも称す。 The Transformer 10 includes stacked N layers of encoder blocks 20 and M layers of decoder blocks 40. The stacked N layers of encoder blocks 20 are collectively referred to as an encoder 200. The stacked M layers of decoder blocks 40 are collectively referred to as a decoder 400.

エンコーダ２００は、入力シーケンス２から中間シーケンスを出力する。デコーダ４００は、エンコーダ２００から出力される中間シーケンスおよび先に出力された出力シーケンスに基づいて出力シーケンス７０を出力する。 Encoder 200 outputs an intermediate sequence from input sequence 2. Decoder 400 outputs output sequence 70 based on the intermediate sequence output from encoder 200 and the previously output output sequence.

エンコーダ２００（すなわち、Ｎ層分のエンコーダブロック２０のうち先頭層）には、入力埋め込み（Input Embedding）層４、位置埋め込み層（Positional Embedding）層６および加算器８により生成される入力トークン列が入力される。エンコーダ２００（すなわち、Ｎ層分のエンコーダブロック２０のうち最終層）は、算出結果として、中間センテンス表現を出力する。 The encoder 200 (that is, the first layer among the N layers of encoder blocks 20) receives an input token string generated by an input embedding layer 4, a positional embedding layer 6, and an adder 8. is input. The encoder 200 (that is, the final layer of the N layers of encoder blocks 20) outputs an intermediate sentence representation as a calculation result.

入力埋め込み層４は、センテンスなどの入力シーケンス２を、所定単位で１または複数のトークンに分割するとともに、各分割したトークンの値を示す所定次元のベクトルを生成する。位置埋め込み層６は、各トークンが入力シーケンス２内のいずれの位置に存在しているのかを示す値である位置埋め込み（positional embedding）を出力する。加算器８は、入力埋め込み層４からのシーケンスに、位置埋め込み層６からの位置埋め込みを付加する。 The input embedding layer 4 divides the input sequence 2, such as a sentence, into one or more tokens in predetermined units, and generates a vector of a predetermined dimension indicating the value of each divided token. The positional embedding layer 6 outputs a positional embedding, which is a value indicating in which position in the input sequence 2 each token exists. Adder 8 adds the position embedding from position embedding layer 6 to the sequence from input embedding layer 4 .

エンコーダブロック２０の各々は、ＭＨＡ（Multi-head Attention）層２２と、フィードフォワード（Feed Forward）層２６と、加算・正則化（Add & Norm）層２４，２８とを含む。 Each of the encoder blocks 20 includes an MHA (Multi-head Attention) layer 22, a feed forward layer 26, and addition/regularization (Add & Norm) layers 24 and 28.

ＭＨＡ層２２は、入力トークン列（ベクトル）についてＡｔｔｅｎｔｉｏｎを算出する。加算・正則化層２４は、入力トークン列（ベクトル）にＭＨＡ層２２から出力されるベクトルを加算した上で、任意の手法で正則化（normalize）する。フィードフォワード層２６は、入力されたベクトルに対して位置（すなわち、入力される時刻）をシフトする。加算・正則化層２８は、加算・正則化層２４から出力されるベクトルに、フィードフォワード層２６から出力されるベクトルを加算した上で、任意の手法で正則化する。 The MHA layer 22 calculates Attention for the input token sequence (vector). The addition/regularization layer 24 adds the vector output from the MHA layer 22 to the input token sequence (vector), and then normalizes the resultant using an arbitrary method. The feedforward layer 26 shifts the position (that is, the input time) with respect to the input vector. The addition/regularization layer 28 adds the vector output from the feedforward layer 26 to the vector output from the addition/regularization layer 24, and then regularizes the resultant using an arbitrary method.

デコーダ４００（すなわち、Ｍ層分のデコーダブロック４０のうち先頭層）には、出力埋め込み（Output Embedding）層１４、位置埋め込み層（Positional Embedding）層１６および加算器１８により生成される出力トークン列が入力される。デコーダ４００（すなわち、Ｍ層分のデコーダブロック４０のうち最終層）は、算出結果として、出力シーケンスを出力する。 The decoder 400 (that is, the first layer among M layers of decoder blocks 40) has an output token string generated by an output embedding layer 14, a positional embedding layer 16, and an adder 18. is input. The decoder 400 (that is, the final layer of the M layers of decoder blocks 40) outputs an output sequence as a calculation result.

出力埋め込み層１４は、既出力シーケンス（前回の出力シーケンスに対して時刻を一致させるためにシフトされたもの）（Outputs(Shifted right)）１２を、所定単位で１または複数のトークンに分割するとともに、各分割したトークンの値を示す所定次元のベクトルを生成する。位置埋め込み層１６は、各トークンが既出力シーケンス１２内のいずれの位置に存在しているのかを示す値である位置埋め込み（positional embedding）を出力する。加算器１８は、出力埋め込み層１４からのトークン列に、位置埋め込み層１６からの位置埋め込みを付加する。 The output embedding layer 14 divides the already output sequence (outputs (shifted right)) 12 into one or more tokens in predetermined units, and , generates a vector of a predetermined dimension indicating the value of each divided token. The positional embedding layer 16 outputs a positional embedding, which is a value indicating in which position in the output sequence 12 each token exists. Adder 18 adds the position embedding from position embedding layer 16 to the token string from output embedding layer 14 .

デコーダブロック４０の各々は、ＭＭＨＡ（Masked Multi-head Attention）層４２と、ＭＨＡ（Multi-head Attention）層４６と、フィードフォワード（Feed Forward）層５０と、加算・正則化（Add & Norm）層４４，４８，５２とを含む。すなわち、デコーダブロック４０は、エンコーダブロック２０と類似した構成となっているが、ＭＭＨＡ層４２および加算・正則化層４４を含んでいる点が異なっている。 Each of the decoder blocks 40 includes an MMHA (Masked Multi-head Attention) layer 42, an MHA (Multi-head Attention) layer 46, a feed forward (Feed Forward) layer 50, and an addition/regularization (Add & Norm) layer. 44, 48, and 52. That is, the decoder block 40 has a similar configuration to the encoder block 20, but differs in that it includes an MMHA layer 42 and an addition/regularization layer 44.

ＭＭＨＡ層４２は、先に算出されたベクトルのうち存在し得ないベクトルに対してマスク処理を実行する。加算・正則化層４４は、出力トークン列（ベクトル）にＭＭＨＡ層４２から出力されるベクトルを加算した上で、任意の手法で正則化する。 The MMHA layer 42 performs mask processing on vectors that cannot exist among the previously calculated vectors. The addition/regularization layer 44 adds the vector output from the MMHA layer 42 to the output token sequence (vector), and then regularizes the resultant using an arbitrary method.

ＭＨＡ層４６は、エンコーダブロック２０の加算・正則化層２８から出力される中間センテンス表現、および、加算・正則化層４４から出力されるベクトルについて、Ａｔｔｅｎｔｉｏｎを算出する。ＭＨＡ層４６の基本的な処理は、ＭＨＡ層２２と同様である。加算・正則化層４８は、加算・正則化層４４から出力されるベクトルに、ＭＨＡ層４６から出力されるベクトルを加算した上で、任意の手法で正則化する。フィードフォワード層５０は、入力されたベクトルに対して位置（すなわち、入力される時刻）をシフトする。加算・正則化層５２は、ＭＨＡ層４６から出力されるベクトルに、フィードフォワード層５０から出力されるベクトルを加算した上で、任意の手法で正則化する。 The MHA layer 46 calculates Attention for the intermediate sentence representation output from the addition/regularization layer 28 of the encoder block 20 and the vector output from the addition/regularization layer 44 . The basic processing of the MHA layer 46 is similar to that of the MHA layer 22. The addition/regularization layer 48 adds the vector output from the MHA layer 46 to the vector output from the addition/regularization layer 44, and then regularizes the resultant using an arbitrary method. The feedforward layer 50 shifts the position (that is, the input time) with respect to the input vector. The addition/regularization layer 52 adds the vector output from the feedforward layer 50 to the vector output from the MHA layer 46, and then regularizes the resultant using an arbitrary method.

Ｔｒａｎｓｆｏｒｍｅｒ１０は、出力層として、ソフトマックス（Softmax）層６０を含む。ソフトマックス層６０は、デコーダ４００から出力されるベクトルをソフトマックス関数に入力して得られる結果を出力シーケンス７０として出力する。 Transformer 10 includes a softmax layer 60 as an output layer. The softmax layer 60 inputs the vector output from the decoder 400 to a softmax function and outputs the result obtained as an output sequence 70.

［Ｃ．ハードウェア構成］
次に、本実施の形態に従う音声認識システムを実現するハードウェア構成の一例について説明する。 [C. Hardware configuration]
Next, an example of the hardware configuration for realizing the speech recognition system according to this embodiment will be described.

図２は、本実施の形態に従う音声認識システムを実現するハードウェア構成の一例を示す模式図である。音声認識システムは、典型的には、コンピュータの一例である情報処理装置５００を用いて実現される。 FIG. 2 is a schematic diagram showing an example of the hardware configuration for realizing the speech recognition system according to the present embodiment. The speech recognition system is typically realized using an information processing device 500, which is an example of a computer.

図２を参照して、音声認識システムを実現する情報処理装置５００は、主要なハードウェアコンポーネントとして、ＣＰＵ（central processing unit）５０２と、ＧＰＵ（graphics processing unit）５０４と、主メモリ５０６と、ディスプレイ５０８と、ネットワークインターフェイス（Ｉ／Ｆ：interface）５１０と、二次記憶装置５１２と、入力デバイス５２２と、光学ドライブ５２４とを含む。これらのコンポーネントは、内部バス５２８を介して互いに接続される。 Referring to FIG. 2, an information processing device 500 that implements a speech recognition system includes a CPU (central processing unit) 502, a GPU (graphics processing unit) 504, a main memory 506, and a display as main hardware components. 508, a network interface (I/F) 510, a secondary storage device 512, an input device 522, and an optical drive 524. These components are connected to each other via an internal bus 528.

ＣＰＵ５０２および／またはＧＰＵ５０４は、本実施の形態に従う音声認識システムの実現に必要な処理を実行するプロセッサである。ＣＰＵ５０２およびＧＰＵ５０４は、複数個配置されてもよいし、複数のコアを有していてもよい。 CPU 502 and/or GPU 504 are processors that execute processing necessary to implement the speech recognition system according to this embodiment. A plurality of CPUs 502 and GPUs 504 may be arranged, or may have a plurality of cores.

主メモリ５０６は、プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が処理を実行するにあたって、プログラムコードやワークデータなどを一時的に格納（あるいは、キャッシュ）する記憶領域であり、例えば、ＤＲＡＭ（dynamic random access memory）やＳＲＡＭ（static random access memory）などの揮発性メモリデバイスなどで構成される。 The main memory 506 is a storage area that temporarily stores (or caches) program codes, work data, etc. when the processor (CPU 502 and/or GPU 504) executes processing. ) and volatile memory devices such as SRAM (static random access memory).

ディスプレイ５０８は、処理に係るユーザインターフェイスや処理結果などを出力する表示部であり、例えば、ＬＣＤ（liquid crystal display）や有機ＥＬ（electroluminescence）ディスプレイなどで構成される。 The display 508 is a display unit that outputs a user interface related to processing, processing results, and the like, and is configured with, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.

ネットワークインターフェイス５１０は、インターネット上またはイントラネット上の任意の情報処理装置などとの間でデータを遣り取りする。ネットワークインターフェイス５１０としては、例えば、イーサネット（登録商標）、無線ＬＡＮ（local area network）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの任意の通信方式を採用できる。 The network interface 510 exchanges data with any information processing device on the Internet or an intranet. As the network interface 510, any communication method such as Ethernet (registered trademark), wireless LAN (local area network), Bluetooth (registered trademark), etc. can be adopted.

入力デバイス５２２は、ユーザからの指示や操作などを受け付けるデバイスであり、例えば、キーボード、マウス、タッチパネル、ペンなどで構成される。また、入力デバイス５２２は、学習およびデコーディングに必要な音声信号を収集するための集音デバイスを含んでいてもよいし、集音デバイスにより収集された音声信号の入力を受け付けるためのインターフェイスを含んでいてもよい。 The input device 522 is a device that accepts instructions and operations from the user, and includes, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 522 may include a sound collection device for collecting audio signals necessary for learning and decoding, and may include an interface for receiving input of audio signals collected by the sound collection device. It's okay to stay.

光学ドライブ５２４は、ＣＤ－ＲＯＭ（compact disc read only memory）、ＤＶＤ（digital versatile disc）などの光学ディスク５２６に格納されている情報を読出して、内部バス５２８を介して他のコンポーネントへ出力する。光学ディスク５２６は、非一過的（non-transitory）な記録媒体の一例であり、任意のプログラムを不揮発的に格納した状態で流通する。光学ドライブ５２４が光学ディスク５２６からプログラムを読み出して、二次記憶装置５１２などにインストールすることで、コンピュータが情報処理装置５００として機能するようになる。したがって、本発明の主題は、二次記憶装置５１２などにインストールされたプログラム自体、または、本実施の形態に従う機能や処理を実現するためのプログラムを格納した光学ディスク５２６などの記録媒体でもあり得る。 The optical drive 524 reads information stored on an optical disc 526 such as a CD-ROM (compact disc read only memory) or a DVD (digital versatile disc), and outputs the information to other components via an internal bus 528. The optical disc 526 is an example of a non-transitory recording medium, and is distributed in a state in which an arbitrary program is stored in a non-volatile manner. The optical drive 524 reads the program from the optical disk 526 and installs it in the secondary storage device 512 or the like, so that the computer functions as the information processing device 500. Therefore, the subject matter of the present invention may be the program itself installed in the secondary storage device 512 or the like, or a recording medium such as the optical disk 526 that stores the program for realizing the functions and processing according to the present embodiment. .

図２には、非一過的な記録媒体の一例として、光学ディスク５２６などの光学記録媒体を示すが、これに限らず、フラッシュメモリなどの半導体記録媒体、ハードディスクまたはストレージテープなどの磁気記録媒体、ＭＯ（magneto-optical disk）などの光磁気記録媒体を用いてもよい。 Although FIG. 2 shows an optical recording medium such as an optical disk 526 as an example of a non-transitory recording medium, the present invention is not limited to this, and the present invention is not limited to semiconductor recording media such as flash memory, magnetic recording media such as hard disks, or storage tapes. , a magneto-optical recording medium such as MO (magneto-optical disk) may be used.

二次記憶装置５１２は、コンピュータを情報処理装置５００として機能させるために必要なプログラムおよびデータを格納する。例えば、ハードディスク、ＳＳＤ（solid state drive）などの不揮発性記憶装置で構成される。 The secondary storage device 512 stores programs and data necessary for the computer to function as the information processing device 500. For example, it is configured with a non-volatile storage device such as a hard disk or SSD (solid state drive).

より具体的には、二次記憶装置５１２は、図示しないＯＳ（operating system）の他、学習処理を実現するための学習プログラム５１４と、音声認識システムに用いられるモデルの構造を定義するモデル定義データ５１６と、音声認識システムに用いられる学習済モデルを規定する複数のパラメータからなるパラメータセット５１８と、推論プログラム５２０と、トレーニングデータセット５３０とを格納している。 More specifically, the secondary storage device 512 stores, in addition to an OS (operating system) not shown, a learning program 514 for realizing learning processing, and model definition data that defines the structure of a model used in the speech recognition system. 516, a parameter set 518 consisting of a plurality of parameters defining a learned model used in the speech recognition system, an inference program 520, and a training data set 530.

学習プログラム５１４は、プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）により実行されることで、パラメータセット５１８を決定するための学習処理を実現する。すなわち、学習プログラム５１４は、コンピュータに推論器（音声認識システム）を学習するための学習処理を実行させる。 The learning program 514 is executed by the processor (CPU 502 and/or GPU 504) to realize learning processing for determining the parameter set 518. That is, the learning program 514 causes the computer to execute a learning process for learning the inference device (speech recognition system).

モデル定義データ５１６は、音声認識システムを構成するモデルに含まれるコンポーネントおよびコンポーネント間の接続関係などを定義するための情報を含む。 The model definition data 516 includes information for defining components included in the model constituting the speech recognition system, connection relationships between the components, and the like.

パラメータセット５１８は、音声認識システムを構成する各コンポーネントについてのパラメータを含む。パラメータセット５１８に含まれる各パラメータは、学習プログラム５１４の実行により最適化される。 Parameter set 518 includes parameters for each component that makes up the speech recognition system. Each parameter included in the parameter set 518 is optimized by executing the learning program 514.

推論プログラム５２０は、パラメータセット５１８により規定されるモデルを用いた推論処理を実行する。すなわち、推論プログラム５２０は、後述するような推論器をコンピュータで実現する。トレーニングデータセット５３０は、図４に示すようなデータの組み合わせからなる。 The inference program 520 executes inference processing using the model defined by the parameter set 518. That is, the inference program 520 implements an inference device as described later on a computer. Training data set 530 consists of a combination of data as shown in FIG.

プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がプログラムを実行する際に必要となるライブラリや機能モジュールの一部を、ＯＳが標準で提供するライブラリまたは機能モジュールにより代替してもよい。この場合には、プログラム単体では、対応する機能を実現するために必要なプログラムモジュールのすべてを含むものにはならないが、ＯＳの実行環境下にインストールされることで、目的の処理を実現できる。このような一部のライブラリまたは機能モジュールを含まないプログラムであっても、本発明の技術的範囲に含まれ得る。 A part of the library or function module required when the processor (CPU 502 and/or GPU 504) executes a program may be replaced by a library or function module provided as standard by the OS. In this case, although a single program does not include all of the program modules necessary to implement the corresponding function, it is possible to implement the desired processing by installing it in the execution environment of the OS. Even programs that do not include some of these libraries or functional modules may be included in the technical scope of the present invention.

また、これらのプログラムは、上述したようないずれかの記録媒体に格納されて流通するだけでなく、インターネットまたはイントラネットを介してサーバ装置などからダウンロードすることで配布されてもよい。 Furthermore, these programs may not only be stored and distributed in any of the recording media as described above, but may also be distributed by being downloaded from a server device or the like via the Internet or an intranet.

図２には、単一のコンピュータを用いて情報処理装置５００を構成する例を示すが、これに限らず、コンピュータネットワークを介して接続された複数のコンピュータが明示的または黙示的に連携して、音声認識システムを構成する学習済モデルおよび学習済モデルを用いた推論器を実現するようにしてもよい。 Although FIG. 2 shows an example in which the information processing device 500 is configured using a single computer, the information processing apparatus 500 is not limited to this, and multiple computers connected via a computer network may explicitly or implicitly cooperate. , a trained model constituting a speech recognition system and an inference device using the trained model may be realized.

プロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）がプログラムを実行することで実現される機能の全部または一部を、集積回路などのハードワイヤード回路（hard-wired circuit）を用いて実現してもよい。例えば、ＡＳＩＣ（application specific integrated circuit）やＦＰＧＡ（field-programmable gate array）などを用いて実現してもよい。 All or part of the functions realized by the processor (CPU 502 and/or GPU 504) executing a program may be realized using a hard-wired circuit such as an integrated circuit. For example, it may be realized using an ASIC (application specific integrated circuit), an FPGA (field-programmable gate array), or the like.

当業者であれば、本発明が実施される時代に応じた技術を適宜用いて、本実施の形態に従う情報処理装置５００を実現できるであろう。 Those skilled in the art will be able to implement information processing device 500 according to this embodiment by appropriately using techniques appropriate to the era in which the present invention is implemented.

説明の便宜上、同一の情報処理装置５００を用いて、学習処理および推論処理を実行する例を示すが、学習処理および推論処理を異なるハードウェアを用いて実現してもよい。 For convenience of explanation, an example is shown in which the same information processing device 500 is used to perform the learning process and the inference process, but the learning process and the inference process may be implemented using different hardware.

［Ｄ．第１の実施例（表意文字）］
第１の実施例として、漢字などの表意文字を用いる複数の言語に対して単一のモデルを用いた音声認識システムについて説明する。 [D. First example (ideograms)]
As a first example, a speech recognition system using a single model for multiple languages using ideographic characters such as Chinese characters will be described.

（ｄ１：概要）
図３は、第１の実施例に従う音声認識システム１００Ａの概要を示す模式図である。図３を参照して、音声認識システム１００Ａは、音声特徴を示す入力シーケンス２の入力を受けて、対応するテキストを出力シーケンス７０として出力する。すなわち、音声認識システム１００Ａは、複数の言語のうち任意の言語で発話された音声信号の入力を受けて、対応するテキストを出力する推論器に相当する。 (d1: Overview)
FIG. 3 is a schematic diagram showing an overview of the speech recognition system 100A according to the first embodiment. Referring to FIG. 3, the speech recognition system 100A receives an input sequence 2 indicating speech characteristics and outputs the corresponding text as an output sequence 70. That is, the speech recognition system 100A corresponds to an inference device that receives input of a speech signal uttered in any language among a plurality of languages and outputs the corresponding text.

出力シーケンス７０の先頭には、いずれの言語であるかを示す言語ラベル７２（＜ＴＷ＞，＜ＨＫ＞，＜ＭＡ＞など）が付加されている。このような言語ラベル７２が付加されることによって、いずれの言語であるかを一意に特定できる。 At the beginning of the output sequence 70, a language label 72 (<TW>, <HK>, <MA>, etc.) indicating which language it is in is added. By adding such a language label 72, the language can be uniquely identified.

音声認識システム１００Ａは、Ｔｒａｎｓｆｏｒｍｅｒ１０と、文字合成部８０とを含む。 The speech recognition system 100A includes a Transformer 10 and a character synthesis section 80.

Ｔｒａｎｓｆｏｒｍｅｒ１０は、音声信号の音声特徴を示す入力シーケンス２を受けて、対応するテキストに含まれる文字の特徴を示す、文字（character）レベルとは異なるレベルの表現を出力する学習済モデルに相当する。より具体的には、Ｔｒａｎｓｆｏｒｍｅｒ１０は、漢字を構成する１または複数の文字部品を示す、文字レベルではなく、異なるレベルの表現（以下、「文字部品表現８２」あるいは「Decomposed Character representation」とも称す。）を用いる。文字部品表現８２は、対応するテキストに含まれる各文字の構造を特定する情報を含む（詳細については後述する）。 The Transformer 10 corresponds to a trained model that receives the input sequence 2 representing the audio characteristics of the audio signal and outputs an expression at a level different from the character level, which represents the characteristics of the characters included in the corresponding text. More specifically, the Transformer 10 creates a representation at a different level (hereinafter also referred to as "character part representation 82" or "Decomposed Character representation"), instead of at the character level, which indicates one or more character parts that constitute a kanji. Use. The character part representation 82 includes information specifying the structure of each character included in the corresponding text (details will be described later).

本明細書において、「文字部品」は、出力すべきテキストを構成する少なくとも一部分を構成する要素を意味し、言語体系などに応じて任意に決定できる単位で規定される。 In this specification, a "character component" means an element that constitutes at least a part of a text to be output, and is defined in units that can be arbitrarily determined depending on the language system and the like.

文字合成部８０は、予め定められた文字と当該文字の特徴との対応関係を参照して、Ｔｒａｎｓｆｏｒｍｅｒ１０（学習済モデル）から出力される表現から対応するテキストを再構成する再構成部に相当する。より具体的には、文字合成部８０は、Ｔｒａｎｓｆｏｒｍｅｒ１０から出力される文字部品表現８２の入力を受けて、出力すべき文字（漢字）に合成して、出力シーケンス７０として出力する。 The character synthesis unit 80 corresponds to a reconstruction unit that reconstructs a corresponding text from the expression output from the Transformer 10 (trained model) with reference to the correspondence between predetermined characters and the characteristics of the characters. . More specifically, the character synthesis unit 80 receives input of the character part representation 82 output from the Transformer 10, synthesizes it into characters (kanji) to be output, and outputs it as the output sequence 70.

第１の実施例においては、漢字を構成する１または複数の文字部品に分解した状態を示す表現を用いてモデルの学習を行う。 In the first embodiment, a model is trained using an expression that shows a state in which a kanji is broken down into one or more character parts.

（ｄ２：文字部品表現８２）
図３に示す文字部品表現８２は、典型的には、以下のようなデータ構造のシーケンスとして出力される。 (d2: Character part representation 82)
The character part representation 82 shown in FIG. 3 is typically output as a sequence of data structures as shown below.

（１）＜言語ラベル＞［部品特定情報］，［部品特定情報］，・・・，＜区切文字＞，［部品特定情報］，［部品特定情報］，・・・
（２）＜言語ラベル＞［構造特定情報］，［部品特定情報］，［部品特定情報］，・・・，＜区切文字＞，［部品特定情報］，［部品特定情報］，・・・
文字部品表現８２に含まれる＜言語ラベル＞は、いずれの言語であるかを特定するための情報を含む。＜言語ラベル＞としては、例えば、＜ＴＷ＞（台湾），＜ＨＫ＞（香港），＜ＭＡ＞（中国標準語）などが用いられる。 (1) <Language label> [Component identification information], [Part identification information], ..., <Delimiter>, [Part identification information], [Part identification information], ...
(2) <Language label> [Structure identification information], [Part identification information], [Part identification information], ..., <Delimiter>, [Part identification information], [Part identification information], ...
<Language label> included in the character part expression 82 includes information for specifying which language it is. As <language label>, <TW> (Taiwan), <HK> (Hong Kong), <MA> (Mandarin Chinese), etc. are used, for example.

文字部品表現８２に含まれる［部品特定情報］は、対応する文字を構成する文字部品を特定するための情報を含む。文字部品表現８２に含まれる＜区切文字＞は、出力される文字の区切りを意味し、＜区切文字＞から次の＜区切文字＞までに存在する［部品特定情報］に基づいて、出力すべき文字が再構成される。＜区切文字＞としては、単にブランク（無出力）を用いてもよい。このように、文字部品表現８２は、対応する文字を構成する１または複数の文字部品を特定する情報を含む。 [Component identification information] included in the character component representation 82 includes information for specifying character components that constitute the corresponding character. The <delimiter> included in the character part expression 82 means a delimiter between characters to be output, and the character to be output is determined based on the [component specific information] that exists between the <delimiter> and the next <delimiter>. Characters are reorganized. A blank (no output) may be simply used as the <delimiter>. In this way, the character part representation 82 includes information that specifies one or more character parts that constitute the corresponding character.

文字部品表現８２に含まれる［構造特定情報］は、対応する文字を構成する文字部品の組み合わせに係る構造を特定するための情報を含む。例えば、ある文字が横並びで配置された２つの文字部品で構成されている場合において、［構造特定情報］は、横並びで配置されていることを示す情報を含むことになる。このように、文字部品表現８２は、１または複数の文字部品の配置を特定する情報を含んでいてもよい。 [Structure identification information] included in the character part expression 82 includes information for specifying a structure related to a combination of character parts forming a corresponding character. For example, in the case where a certain character is composed of two character parts arranged side by side, the [structure specifying information] includes information indicating that the characters are arranged side by side. In this way, the character part representation 82 may include information specifying the arrangement of one or more character parts.

なお、上述した文字部品表現８２のデータ構造は一例であり、文字を再構成できるものであれば、どのようなデータ構造を採用してもよい。さらに、文字部品表現８２には、より多くの情報を含めるようにしてもよい。 Note that the data structure of the character part representation 82 described above is just an example, and any data structure may be employed as long as it allows characters to be reconstructed. Furthermore, the character part representation 82 may include more information.

（ｄ３：文字部品への分解）
次に、文字を文字部品に分解する方法の一例について説明する。 (d3: Decomposition into character parts)
Next, an example of a method for decomposing characters into character parts will be described.

図４は、第１の実施例に従う音声認識システム１００Ａにおける文字部品への分解の方法を説明するための図である。図４を参照して、複数の文字の構造８０２が規定されており、各文字についていずれの構造８０２に該当するのかが決定された上で、決定された構造８０２に応じて、各文字が１または複数の文字部品８０４に分解される。 FIG. 4 is a diagram for explaining a method of decomposition into character parts in the speech recognition system 100A according to the first embodiment. Referring to FIG. 4, a plurality of character structures 802 are defined, and after determining which structure 802 each character corresponds to, each character is Alternatively, it is decomposed into a plurality of character parts 804.

したがって、各文字からは、決定された構造８０２の情報と、当該決定された構造８０２の情報に基づいて分解された１または複数の文字部品８０４との情報が生成される（単純分解８０６）。 Therefore, from each character, information on the determined structure 802 and one or more character parts 804 decomposed based on the information on the determined structure 802 is generated (simple decomposition 806).

さらに、文字によっては、複数の構造８０２を有していると決定され、それぞれの構造８０２に従って文字部品８０４の情報が生成されてもよい（混合構造８０８）。 Further, some characters may be determined to have multiple structures 802, and information on character parts 804 may be generated according to each structure 802 (mixed structure 808).

文字の構造８０２については、漢字の構造に基づいて任意のパターンを決定すればよいが、典型例としては、１２種類の構造８０２を予め用意すればよい。 Regarding the character structure 802, any pattern may be determined based on the structure of Chinese characters, but as a typical example, 12 types of structures 802 may be prepared in advance.

（ｄ４：文字合成部８０）
次に、第１の実施例に従う音声認識システム１００Ａの文字合成部８０（図３参照）における処理例について説明する。 (d4: Character synthesis section 80)
Next, an example of processing in the character synthesis unit 80 (see FIG. 3) of the speech recognition system 100A according to the first embodiment will be described.

上述したように、文字部品表現８２は、出力すべき文字を構成する１または複数の文字部品を特定するための部品特定情報からなる。文字合成部８０は、文字部品表現８２に含まれる文字ごとに規定される１または複数の部品特定情報に基づいて、出力すべき文字を再構成する。文字部品表現８２は文字部品対応テーブル８４を有しており、文字部品対応テーブル８４に基づいて、文字が再構成される。 As described above, the character part expression 82 consists of part specifying information for specifying one or more character parts that constitute a character to be output. The character synthesis unit 80 reconstructs characters to be output based on one or more pieces of part specifying information defined for each character included in the character part representation 82. The character part expression 82 has a character part correspondence table 84, and characters are reconstructed based on the character part correspondence table 84.

文字部品対応テーブル８４は、言語ごとに、１または複数の文字部品と対応する文字との対応関係を規定する。 The character parts correspondence table 84 defines the correspondence between one or more character parts and corresponding characters for each language.

図５は、第１の実施例に従う音声認識システム１００Ａの文字合成部８０において利用される文字部品対応テーブル８４の一例を示す図である。図５を参照して、文字部品対応テーブル８４は、１または複数の文字部品の組み合わせを規定する組み合わせ定義８４２と、対応する文字８４４との組を複数含む。 FIG. 5 is a diagram showing an example of the character-component correspondence table 84 used in the character synthesis section 80 of the speech recognition system 100A according to the first embodiment. Referring to FIG. 5, character component correspondence table 84 includes a plurality of sets of combination definitions 842 that define combinations of one or more character components and corresponding characters 844.

文字合成部８０は、Ｔｒａｎｓｆｏｒｍｅｒ１０から出力される文字部品表現８２に含まれる区切文字の位置で区切って、１または複数の部品特定情報を抽出する。そして、文字合成部８０は、抽出した１または複数の部品特定情報をキーにして文字部品対応テーブル８４を参照することで、対応する文字を決定する。文字部品対応テーブル８４を参照した文字の決定処理を繰り返すことで、入力シーケンス２に対応するテキストを出力シーケンス７０として出力する。 The character synthesis unit 80 separates the character part expression 82 output from the Transformer 10 at the position of the delimiter and extracts one or more parts specifying information. Then, the character synthesis unit 80 determines a corresponding character by referring to the character-component correspondence table 84 using the extracted one or more pieces of component specifying information as a key. By repeating the character determination process with reference to the character-component correspondence table 84, the text corresponding to the input sequence 2 is output as the output sequence 70.

文字部品対応テーブル８４は、言語ごとに用意されてもよい。この場合には、文字合成部８０は、Ｔｒａｎｓｆｏｒｍｅｒ１０から出力される文字部品表現８２のシーケンスの先頭に含まれる言語ラベルの値に基づいて、対応する言語の文字部品対応テーブル８４を選択する。 The character component correspondence table 84 may be prepared for each language. In this case, the character synthesis unit 80 selects the character-component correspondence table 84 of the corresponding language based on the value of the language label included at the beginning of the sequence of character-component expressions 82 output from the Transformer 10.

さらに、文字部品対応テーブル８４は、各データに関連付けて構造特定情報（対応する文字を構成する文字部品の組み合わせに係る構造を特定するための情報）を含んでいてもよい。構造特定情報を付加することで、同じ文字部品で構成されるものの、配置が異なる文字同士を区別することができる。 Further, the character-component correspondence table 84 may include structure identification information (information for specifying a structure related to a combination of character parts constituting a corresponding character) in association with each data. By adding structure identification information, it is possible to distinguish between characters that are composed of the same character parts but that are arranged differently.

上述のような文字部品対応テーブル８４を参照することで、Ｔｒａｎｓｆｏｒｍｅｒ１０から出力される文字部品表現８２から出力シーケンス７０を生成できる。 By referring to the character-component correspondence table 84 as described above, the output sequence 70 can be generated from the character-component representation 82 output from the Transformer 10.

（ｄ５：学習処理）
次に、第１の実施例に従う音声認識システム１００Ａの学習処理についての一例について説明する。 (d5: learning process)
Next, an example of the learning process of the speech recognition system 100A according to the first embodiment will be described.

図６は、第１の実施例に従う音声認識システム１００Ａの学習処理を説明するための模式図である。図６を参照して、トレーニングデータセットとして、音声特徴を示す入力シーケンス２と対応するテキスト６４との組が用意される。テキスト６４には、いずれの言語であるかを示す言語ラベルを含んでいてもよい。 FIG. 6 is a schematic diagram for explaining the learning process of the speech recognition system 100A according to the first embodiment. Referring to FIG. 6, a set of input sequence 2 indicating voice features and corresponding text 64 is prepared as a training data set. The text 64 may include a language label indicating which language it is in.

学習処理においては、前処理として、テキスト６４に含まれる各文字を１または複数の文字部品に分解した文字部品表現８２が生成される。文字部品表現８２の生成に際して、文字部品対応テーブル８４が必要に応じて参照されるとともに、文字部品対応テーブル８４の内容が適宜更新されてもよい。 In the learning process, as a preprocess, a character part representation 82 is generated in which each character included in the text 64 is decomposed into one or more character parts. When generating the character-component representation 82, the character-component correspondence table 84 may be referred to as necessary, and the contents of the character-component correspondence table 84 may be updated as appropriate.

そして、入力シーケンス２と対応する文字部品表現８２との組をトレーニングデータとして用いて、モデル（Ｔｒａｎｓｆｏｒｍｅｒ１０）を学習する。モデルの学習方法自体については、公知の技術を適宜採用することができる。 Then, the model (Transformer 10) is learned using the set of the input sequence 2 and the corresponding character part representation 82 as training data. As for the model learning method itself, publicly known techniques can be adopted as appropriate.

図７は、第１の実施例に従う音声認識システム１００Ａの学習処理の手順を示すフローチャートである。図７に示す主要なステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が学習プログラム５１４を実行することで実現される。 FIG. 7 is a flowchart showing the procedure of the learning process of the speech recognition system 100A according to the first embodiment. The main steps shown in FIG. 7 are typically realized by the processor (CPU 502 and/or GPU 504) of the information processing device 500 executing the learning program 514.

図７を参照して、情報処理装置５００は、音声特徴を示す入力シーケンス２と対応するテキストとの組からなるトレーニングデータセットの入力を受け付ける（ステップＳ１００）。情報処理装置５００は、受け付けたトレーニングデータセットのテキストに含まれる各文字を、所定規則に従って１または複数の文字部品の組み合わせに分解することで、文字部品表現８２を生成する（ステップＳ１０２）。このように、情報処理装置５００は、テキストに含まれる文字の特徴を示す、文字レベルとは異なるレベルの表現を生成する。そして、情報処理装置５００は、音声特徴を示す入力シーケンス２と対応する文字部品表現８２との組み合わせからなるトレーニングデータセットを生成する（ステップＳ１０４）。 Referring to FIG. 7, information processing device 500 receives an input of a training data set consisting of a pair of input sequence 2 indicating voice features and corresponding text (step S100). The information processing device 500 generates a character part representation 82 by decomposing each character included in the text of the received training data set into a combination of one or more character parts according to a predetermined rule (step S102). In this way, the information processing device 500 generates an expression at a level different from the character level that indicates the characteristics of characters included in the text. Then, the information processing device 500 generates a training data set consisting of a combination of the input sequence 2 indicating the voice feature and the corresponding character part representation 82 (step S104).

続いて、情報処理装置５００は、Ｔｒａｎｓｆｏｒｍｅｒ１０のパラメータを初期化する（ステップＳ１０６）。そして、パラメータの最適化が実行される。すなわち、トレーニングデータセットを用いてＴｒａｎｓｆｏｒｍｅｒ１０に含まれるパラメータが最適化される。 Subsequently, the information processing device 500 initializes the parameters of the Transformer 10 (step S106). Parameter optimization is then performed. That is, the parameters included in the Transformer 10 are optimized using the training data set.

より具体的には、情報処理装置５００は、トレーニングデータセットに含まれる入力シーケンス２をＴｒａｎｓｆｏｒｍｅｒ１０に入力して出力シーケンス（文字部品表現８２の推論結果）を演算する（ステップＳ１０８）。そして、情報処理装置５００は、出力シーケンス（推論結果）と、トレーニングデータセットの対応する文字部品表現８２（正解データ）とを比較して誤差情報を演算し（ステップＳ１１０）、当該演算した誤差情報に基づいてＴｒａｎｓｆｏｒｍｅｒ１０のパラメータを最適化する（ステップＳ１１２）。 More specifically, the information processing device 500 inputs the input sequence 2 included in the training data set to the Transformer 10 and calculates the output sequence (inference result of the character part representation 82) (step S108). Then, the information processing device 500 calculates error information by comparing the output sequence (inference result) and the corresponding character part representation 82 (correct data) of the training data set (step S110), and calculates the calculated error information. The parameters of the Transformer 10 are optimized based on (step S112).

情報処理装置５００は、予め定められた学習処理の終了条件が満たされているか否かを判断する（ステップＳ１１４）。予め定められた学習処理の終了条件が満たされていなければ（ステップＳ１１４においてＮＯ）、情報処理装置５００は、トレーニングデータセットに含まれるトレーニングデータを選択して、ステップＳ１０８以下の処理を再度実行する。 The information processing device 500 determines whether a predetermined learning process termination condition is satisfied (step S114). If the predetermined learning processing termination condition is not met (NO in step S114), the information processing device 500 selects training data included in the training data set and re-executes the processing from step S108 onwards. .

これに対して、予め定められた学習処理の終了条件が満たされていれば（ステップＳ１１４においてＹＥＳ）、情報処理装置５００は、当該時点のパラメータ値で規定されるＴｒａｎｓｆｏｒｍｅｒ１０を学習済モデルとして決定する（ステップＳ１１６）。このときのパラメータ値が、学習済モデルを規定するパラメータセット５１８として出力される。そして、処理は終了する。 On the other hand, if the predetermined learning processing termination condition is satisfied (YES in step S114), the information processing device 500 determines the Transformer 10 defined by the parameter values at the time as the learned model. (Step S116). The parameter values at this time are output as a parameter set 518 that defines the trained model. Then, the process ends.

（ｄ６：推論処理）
図８は、第１の実施例に従う音声認識システム１００Ａの推論処理の手順を示すフローチャートである。図８に示す主要なステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が推論プログラム５２０を実行することで実現される。 (d6: Inference processing)
FIG. 8 is a flowchart showing the inference processing procedure of the speech recognition system 100A according to the first embodiment. The main steps shown in FIG. 8 are typically realized by the processor (CPU 502 and/or GPU 504) of the information processing device 500 executing the inference program 520.

図８を参照して、情報処理装置５００は、入力される音声信号から音声特徴を演算することで入力シーケンスを生成する（ステップＳ１５０）。情報処理装置５００は、生成した入力シーケンスをＴｒａｎｓｆｏｒｍｅｒ１０に入力して、推論結果の出力シーケンスとして、文字部品表現８２を演算する（ステップＳ１５２）。続いて、情報処理装置５００は、文字部品対応テーブル８４を参照して、文字部品表現８２からテキストを再構成する（ステップＳ１５４）。この再構成したテキストが出力シーケンスとして出力される。 Referring to FIG. 8, information processing device 500 generates an input sequence by calculating audio features from an input audio signal (step S150). The information processing device 500 inputs the generated input sequence to the Transformer 10 and calculates the character part representation 82 as an output sequence of the inference result (step S152). Subsequently, the information processing device 500 refers to the character-component correspondence table 84 and reconstructs the text from the character-component representation 82 (step S154). This reconstructed text is output as an output sequence.

そして、情報処理装置５００は、音声信号の入力が継続しているか否かを判断する（ステップＳ１５６）。音声信号の入力が継続していれば（ステップＳ１５６においてＹＥＳ）、ステップＳ１５０以下の処理が繰り返される。 The information processing device 500 then determines whether or not the audio signal continues to be input (step S156). If the input of the audio signal continues (YES in step S156), the processes from step S150 onwards are repeated.

一方、音声信号の入力が継続していなければ（ステップＳ１５６においてＮＯ）、推論処理は一旦終了する。 On the other hand, if the input of the audio signal is not continuing (NO in step S156), the inference process is temporarily terminated.

（ｄ７：性能評価結果）
次に、第１の実施例に従う音声認識システム１００Ａの性能評価を行った結果の一例を示す。 (d7: Performance evaluation result)
Next, an example of the results of performance evaluation of the speech recognition system 100A according to the first embodiment will be shown.

第１の実験例では、漢字を用いる言語として、台湾＜ＴＷ＞、香港＜ＨＫ＞、中国標準語＜ＭＡ＞の３言語のトレーニングデータセットを用いた評価を行った。評価対象の音声認識システムとしては、文字（character）レベルで処理する音声認識システム（関連技術）（表中「（ｃ）」で示される）と、第１の実施例に従う音声認識システム１００Ａ（文字部品表現を用いる）（表中「（ｒ）」で示される）とを比較した。 In the first experimental example, evaluation was performed using training datasets for three languages that use Chinese characters: Taiwan <TW>, Hong Kong <HK>, and Mandarin Chinese <MA>. The speech recognition systems to be evaluated include a speech recognition system (related technology) that processes at the character level (indicated by "(c)" in the table), and a speech recognition system 100A (character) according to the first embodiment. (using parts representation) (indicated by "(r)" in the table).

また、各言語単体で学習を行った場合と、単一のモデルを３つの言語で学習した場合とを比較した。評価としては、各言語のデータセットの一部をテストデータとして用いた。 We also compared the case in which each language was trained alone and the case in which a single model was trained in three languages. For evaluation, part of the dataset for each language was used as test data.

認識性能の評価指標として、文字誤り率（ＣＥＲ％：Character Error Rate）を用いている。 Character error rate (CER%) is used as an evaluation index of recognition performance.

表１に示すように、文字レベルの音声認識システムを単一の言語で学習した場合、当該学習した言語については高い性能を示している（ＭＡ（ｃ），ＨＫ（ｃ），ＴＷ（ｃ））。これに対して、第１の実施例に従う音声認識システム１００Ａにおいては、単一の言語で学習した場合の性能はやや劣っている（ＭＡ（ｒ），ＨＫ（ｒ），ＴＷ（ｒ））。 As shown in Table 1, when a character-level speech recognition system is trained in a single language, it shows high performance in the learned language (MA (c), HK (c), TW (c) ). On the other hand, in the speech recognition system 100A according to the first embodiment, the performance is slightly inferior when trained in a single language (MA(r), HK(r), TW(r)).

しかしながら、単一のモデルを３つの言語で学習した場合には、第１の実施例に従う音声認識システム１００Ａ（ＭＡ＋ＨＫ＋ＴＷ（ｒ））は、関連技術に従う音声認識システム（ＭＡ＋ＨＫ＋ＴＷ（ｃ））に比較して、高い認識性能を示していることが分かる。 However, when a single model is trained in three languages, the speech recognition system 100A (MA+HK+TW(r)) according to the first embodiment is less effective than the speech recognition system (MA+HK+TW(c)) according to the related art. It can be seen that the recognition performance is high.

次に、第２の実験例では、関連技術に従う音声認識システムにおいて、文字（character）単位および単語（word）単位で学習を行った場合と比較した。このとき、他の音声認識システムと比較可能となるように、第１の実施例に従う音声認識システム１００Ａを、台湾＜ＴＷ＞、香港＜ＨＫ＞、中国標準語＜ＭＡ＞の３言語のトレーニングデータセットに加えて、日本語のトレーニングデータセットを用いて学習した。日本語のトレーニングデータセットとしては、日本語話し言葉コーパス（Corpus of Spontaneous Japanese：ＣＳＪ）を用いた。なお、表２において、「Ｅ０１」，「Ｅ０２」，「Ｅ０３」は、ＣＳＪ－Ｅｖａｌ０１，ＣＳＪ－Ｅｖａｌ０２，ＣＳＪ－Ｅｖａｌ０３をそれぞれ意味する。 Next, in a second experimental example, a comparison was made with a speech recognition system according to related technology in which learning is performed in character units and word units. At this time, in order to be able to compare with other speech recognition systems, the speech recognition system 100A according to the first embodiment is used with training data in three languages: Taiwan <TW>, Hong Kong <HK>, and Mandarin Chinese <MA>. In addition to the Japanese training dataset, we learned using the Japanese training dataset. As the Japanese training dataset, we used the Corpus of Spontaneous Japanese (CSJ). In Table 2, "E01", "E02", and "E03" mean CSJ-Eval01, CSJ-Eval02, and CSJ-Eval03, respectively.

このとき、日本語については、漢字に加えて、かなに相当する文字部品を含む文字部品表現を用いた。 At this time, for Japanese, we used a character part representation that includes character parts corresponding to kana in addition to kanji.

また、表２中において、ＷＰＭ（Wordpiece Model）についても比較例として示す。 Furthermore, in Table 2, WPM (Wordpiece Model) is also shown as a comparative example.

表２に示すように、第１の実施例に従う音声認識システム１００Ａの認識性能は、最新のモデルの認識性能と同等あるいはそれ以上となっている。 As shown in Table 2, the recognition performance of the speech recognition system 100A according to the first example is equal to or higher than the recognition performance of the latest model.

次に、第３の実験例では、関連技術に従う音声認識システムのパラメータサイズについて評価を行った。第１の実施例に従う音声認識システム１００Ａ（表中「（ｒ）」で示される）および関連技術に従う音声認識システム（表中「（ｃ）」で示される）を、中国標準語＜ＭＡ＞および日本語＜ＪＰ＞のトレーニングデータセットを用いて学習した。 Next, in a third experimental example, parameter sizes of a speech recognition system according to related technology were evaluated. The speech recognition system 100A according to the first embodiment (indicated by "(r)" in the table) and the speech recognition system according to related technology (indicated by "(c)" in the table) are used in Mandarin Chinese <MA> and Learning was performed using the Japanese <JP> training dataset.

第１の実施例に従う音声認識システム１００Ａと関連技術に従う音声認識システムとの間でほぼ同一の認識性能を発揮するまで学習した状態を比較すると、以下の表３のようになる。 A comparison of the state in which the speech recognition system 100A according to the first embodiment and the speech recognition system according to the related technology have been trained to achieve almost the same recognition performance is as shown in Table 3 below.

表３に示すように、文字誤り率（ＣＥＲ％）がほぼ同じ状態のモデル同士を比較すると、第１の実施例に従う音声認識システム１００Ａのパラメータサイズは、関連技術に従う音声認識システムの１／２以下であり、パラメータサイズが大幅に抑制されていることが分かる。 As shown in Table 3, when comparing models with almost the same character error rate (CER%), the parameter size of the speech recognition system 100A according to the first embodiment is 1/2 that of the speech recognition system according to the related technology. It can be seen that the parameter size is significantly suppressed.

［Ｅ．第２の実施例（表音文字）］
第２の実施例として、類似した発音体系を有する複数の言語に対して単一のモデルを用いた音声認識システムについて説明する。 [E. Second example (phonetic characters)]
As a second embodiment, a speech recognition system using a single model for multiple languages having similar pronunciation systems will be described.

（ｅ１：概要）
図９は、第２の実施例に従う音声認識システム１００Ｂの概要を示す模式図である。図９を参照して、音声認識システム１００Ｂは、音声特徴を示す入力シーケンス２の入力を受けて、対応するテキストを出力シーケンス７０として出力する。すなわち、音声認識システム１００Ｂは、複数の言語のうち任意の言語で発話された音声信号の入力を受けて、対応するテキストを出力する推論器に相当する。 (e1: Overview)
FIG. 9 is a schematic diagram showing an overview of a speech recognition system 100B according to the second embodiment. Referring to FIG. 9, speech recognition system 100B receives input sequence 2 indicating speech characteristics and outputs the corresponding text as output sequence 70. That is, the speech recognition system 100B corresponds to an inference device that receives an input of a speech signal uttered in any language among a plurality of languages and outputs a corresponding text.

出力シーケンス７０の先頭には、いずれの言語であるかを示す言語ラベル７２（＜ＭＹ＞，＜ＫＨ＞，＜ＳＩ＞，＜ＮＥ＞など）が付加されている。このような言語ラベル７２が付加されることによって、いずれの言語であるかを一意に特定できる。 At the beginning of the output sequence 70, a language label 72 (<MY>, <KH>, <SI>, <NE>, etc.) indicating which language it is in is added. By adding such a language label 72, the language can be uniquely identified.

音声認識システム１００Ｂは、Ｔｒａｎｓｆｏｒｍｅｒ１０と、文字変換部９０とを含む。 The speech recognition system 100B includes a Transformer 10 and a character conversion section 90.

Ｔｒａｎｓｆｏｒｍｅｒ１０は、音声信号の音声特徴を示す入力シーケンス２を受けて、対応するテキストに含まれる文字の特徴を示す、文字（character）レベルとは異なるレベルの表現を出力する学習済モデルに相当する。より具体的には、Ｔｒａｎｓｆｏｒｍｅｒ１０は、文字レベルではなく、異なるレベルの表現（以下、「ユニバーサル音声表現９２」あるいは「Universal Articulatory representation」とも称す。）を用いる。ユニバーサル音声表現９２は、対応するテキストに含まれる各文字の発音を特定する情報を含む（詳細については後述する）。 The Transformer 10 corresponds to a trained model that receives the input sequence 2 representing the audio characteristics of the audio signal and outputs an expression at a level different from the character level, which represents the characteristics of the characters included in the corresponding text. More specifically, the Transformer 10 uses a different level of expression (hereinafter also referred to as "universal audio representation 92" or "Universal Articulatory representation") instead of the character level. Universal phonetic representation 92 includes information that specifies the pronunciation of each character included in the corresponding text (details will be described later).

文字変換部９０は、予め定められた文字と当該文字の特徴との対応関係を参照して、Ｔｒａｎｓｆｏｒｍｅｒ１０（学習済モデル）から出力される表現から対応するテキストを再構成する再構成部に相当する。より具体的には、文字変換部９０は、Ｔｒａｎｓｆｏｒｍｅｒ１０から出力されるユニバーサル音声表現９２の入力を受けて、出力すべき文字に変換して、出力シーケンス７０として出力する。 The character conversion unit 90 corresponds to a reconstruction unit that reconstructs a corresponding text from the expression output from the Transformer 10 (trained model) by referring to the correspondence between predetermined characters and the characteristics of the characters. . More specifically, the character conversion unit 90 receives the input of the universal speech expression 92 output from the Transformer 10, converts it into characters to be output, and outputs it as the output sequence 70.

第２の実施例においては、文字が示す音声を示す表現を用いてモデルの学習を行う。
（ｅ２：ユニバーサル音声表現９２）
ユニバーサル音声表現９２は、テキストの発音を規定する表現である。テキストの発音は、国際音声記号（ＩＰＡ：International Pronunciation Alphabet）を用いて規定されることが一般的である。ここで、異なる言語間では単音セット（phone-sets）が異なるが、ＩＰＡを用いた場合にはこのような異なる単音セットを適切に規定することが難しい。 In the second embodiment, a model is trained using expressions indicating sounds indicated by characters.
(e2: Universal phonetic expression 92)
Universal phonetic expression 92 is an expression that defines the pronunciation of text. The pronunciation of text is generally defined using the International Pronunciation Alphabet (IPA). Here, phone-sets differ between different languages, but when IPA is used, it is difficult to appropriately define such different phone-sets.

そこで、第２の実施例に従う音声認識システム１００Ｂにおいては、さまざまな言語の音韻構造を表現するユニバーサル特徴に基づく、ユニバーサル音声表現９２を用いる。ユニバーサル特徴としては、（１）円／非円唇、（２）舌（低、中央、高）、（３）舌（前、中、後）、（４）有無声音（声帯震動）、（５）子音（気流）、（６）唇、舌頂、舌背、咽喉音の６種類が想定される。さらに、ユニバーサル特徴として、声調などのその他の要因を加えてもよい。 Therefore, the speech recognition system 100B according to the second embodiment uses a universal speech expression 92 based on universal features that express the phonological structure of various languages. Universal features include (1) round/unrounded lips, (2) tongue (low, middle, high), (3) tongue (front, middle, back), (4) presence/absence of voice (vocal fold vibration), (5) ) consonants (airflow), (6) lips, tongue tip, tongue dorsum, and throat sounds. Additionally, other factors such as tone may be added as universal features.

より具体的には、以下の表４のユニバーサル音声テーブルに示すように、３つのカテゴリごとに複数の属性（Attributes）が規定されている。３つのカテゴリは、子音の位置（consonants(position)）、子音の態様（consonants(manner)）、母音（vowel）を含む。 More specifically, as shown in the universal audio table in Table 4 below, a plurality of attributes are defined for each of the three categories. The three categories include consonants(position), consonants(manner), and vowel.

ユニバーサル音声表現９２は、文字ごとに１または複数の属性の組み合わせが割り当てられることによって生成される。 Universal phonetic representation 92 is generated by assigning a combination of one or more attributes to each character.

ユニバーサル音声表現９２は、典型的には、以下のようなデータ構造のシーケンスとして出力される。 Universal speech representation 92 is typically output as a sequence of data structures such as the following.

＜言語ラベル＞［属性］，［属性］，・・・，＜区切文字＞，［属性］，［属性］，・・・
ユニバーサル音声表現９２に含まれる＜言語ラベル＞は、いずれの言語であるかを特定するための情報を含む。 <Language label> [Attribute], [Attribute], ..., <Delimiter>, [Attribute], [Attribute], ...
<Language label> included in the universal speech expression 92 includes information for specifying which language it is.

ユニバーサル音声表現９２に含まれる［属性］（Attributes）は、表４のユニバーサル音声テーブルに従って定義されるユニバーサル特徴を特定するための情報を含む。このように、ユニバーサル音声表現９２は、音韻構造を表現するユニバーサル特徴に基づいて、対応する文字の発音を特定する情報を含む。 [Attributes] included in the universal audio expression 92 includes information for specifying universal features defined according to the universal audio table of Table 4. Thus, the universal phonetic representation 92 includes information that specifies the pronunciation of the corresponding character based on the universal features that represent the phonological structure.

ユニバーサル音声表現９２に含まれる＜区切文字＞は、出力される文字の区切りを意味し、＜区切文字＞から次の＜区切文字＞までに存在する［属性］に基づいて、出力すべき文字が再構成される。＜区切文字＞としては、単にブランク（無出力）を用いてもよい。 The <delimiter> included in the universal phonetic expression 92 means the delimiter between the characters to be output, and the character to be output is determined based on the [attribute] that exists between the <delimiter> and the next <delimiter>. Reconfigured. A blank (no output) may be simply used as the <delimiter>.

なお、上述したユニバーサル音声表現９２のデータ構造は一例であり、文字を再構成できるものであれば、どのようなデータ構造を採用してもよい。 Note that the data structure of the universal speech expression 92 described above is just an example, and any data structure may be adopted as long as it allows characters to be reconstructed.

上述したように、第２の実施例に従う音声認識システム１００Ｂにおいては、文字（character）レベルではなく、各文字の発音を規定するユニバーサル特徴のレベルで学習処理および推論処理を実行する。 As described above, in the speech recognition system 100B according to the second embodiment, learning processing and inference processing are performed not at the character level but at the level of universal features that define the pronunciation of each character.

（ｅ３：処理の詳細）
次に、第２の実施例に従う音声認識システム１００Ｂにおける処理の詳細について説明する。 (e3: Processing details)
Next, details of processing in the speech recognition system 100B according to the second embodiment will be explained.

図１０は、第２の実施例に従う音声認識システム１００Ｂにおける学習処理および推論処理の内容を説明するための模式図である。図１０を参照して、学習処理においては、多言語音声データ５３１および多言語テキストデータ５３２を含むトレーニングデータセット５３０が用いられる。多言語テキストデータ５３２には、いずれの言語であるかを示す言語ラベルを含んでいてもよい。 FIG. 10 is a schematic diagram for explaining the contents of learning processing and inference processing in the speech recognition system 100B according to the second embodiment. Referring to FIG. 10, a training data set 530 including multilingual audio data 531 and multilingual text data 532 is used in the learning process. The multilingual text data 532 may include a language label indicating which language it is in.

多言語音声データ５３１から抽出される音声特徴（入力シーケンス）としてＴｒａｎｓｆｏｒｍｅｒ１０へ入力される。 It is input to the Transformer 10 as a voice feature (input sequence) extracted from the multilingual voice data 531.

また、多言語テキストデータ５３２に対してユニバーサル特徴変換９１が適用されて、多言語テキストデータ５３２に含まれる文字ごとの発音を示す、１または複数の属性の組み合わせが出力される。多言語テキストデータ５３２に含まれる言語ラベルも抽出される。 Further, the universal feature conversion 91 is applied to the multilingual text data 532, and a combination of one or more attributes indicating the pronunciation of each character included in the multilingual text data 532 is output. Language labels included in multilingual text data 532 are also extracted.

言語ラベルと１または複数の属性の組み合わせとを含むユニバーサル音声表現９２が、対応するラベル（正解データ）として、Ｔｒａｎｓｆｏｒｍｅｒ１０へ入力される。 A universal speech expression 92 including a language label and a combination of one or more attributes is input to the Transformer 10 as a corresponding label (correct data).

すなわち、多言語音声データ５３１と多言語テキストデータ５３２との組から生成される、音声特徴とユニバーサル音声表現９２との組に基づいて、Ｔｒａｎｓｆｏｒｍｅｒ１０のパラメータが最適化される。 That is, the parameters of the Transformer 10 are optimized based on a set of audio features and a universal audio expression 92, which is generated from a set of multilingual audio data 531 and multilingual text data 532.

一方、推論処理においては、認識対象の多言語音声データ５３３から抽出される音声特徴（入力シーケンス）としてＴｒａｎｓｆｏｒｍｅｒ１０へ入力される。Ｔｒａｎｓｆｏｒｍｅｒ１０は、推論結果として、ユニバーサル音声表現９２を出力する。文字変換部９０は、ユニバーサル音声表現９２をテキストデータ５３４に変換し、推論結果として出力する。 On the other hand, in the inference process, the voice features (input sequence) extracted from the multilingual voice data 533 to be recognized are input to the Transformer 10. Transformer 10 outputs universal speech expression 92 as an inference result. The character conversion unit 90 converts the universal speech expression 92 into text data 534 and outputs it as an inference result.

図１１は、第２の実施例に従う音声認識システム１００Ｂにおけるユニバーサル音声表現に係る処理を説明するための図である。図１１においては、図１０に示す学習処理および推論処理に対応付けて処理が示されている。 FIG. 11 is a diagram for explaining processing related to universal speech expression in the speech recognition system 100B according to the second embodiment. In FIG. 11, processing is shown in association with the learning processing and inference processing shown in FIG.

図１１を参照して、学習処理においては、多言語テキストデータ５３２に含まれるテキストが単語（Word）９６の単位に分割された後、文字（character）９７の単位にさらに分割される。最終的に、文字９７ごとに１または複数の属性の組み合わせ９８が割り当てられる。このとき、音声特徴対応テーブル９４が参照される。このように、ユニバーサル音声表現９２は、対応するテキストに含まれる単語９６をさらに分解した文字９７ごとに発音を規定する情報を含むことになる。 Referring to FIG. 11, in the learning process, the text included in multilingual text data 532 is divided into units of words 96, and then further divided into units of characters 97. Finally, one or more attribute combinations 98 are assigned to each character 97. At this time, the audio feature correspondence table 94 is referred to. In this way, the universal phonetic expression 92 includes information that defines the pronunciation for each character 97 that is further decomposed from the word 96 included in the corresponding text.

音声特徴対応テーブル９４は、言語ごとに、発音を特定する情報と対応する文字との対応関係を規定する。より具体的には、音声特徴対応テーブル９４は、各文字と１または複数の属性との対応関係を規定する。 The speech feature correspondence table 94 defines the correspondence between information specifying pronunciation and corresponding characters for each language. More specifically, the audio feature correspondence table 94 defines the correspondence between each character and one or more attributes.

図１２は、第２の実施例に従う音声認識システム１００Ｂの文字変換部９０において利用される音声特徴対応テーブル９４の一例を示す図である。図１２を参照して、音声特徴対応テーブル９４は、文字（character）と、文字に対応するユニバーサル特徴の１または複数の属性の組み合わせとを規定する。音声特徴対応テーブル９４は、言語ごとに用意されてもよい。 FIG. 12 is a diagram showing an example of the speech feature correspondence table 94 used in the character conversion section 90 of the speech recognition system 100B according to the second embodiment. Referring to FIG. 12, audio feature correspondence table 94 defines characters and combinations of one or more attributes of universal features corresponding to the characters. The audio feature correspondence table 94 may be prepared for each language.

再度図１１を参照して、推論処理においては、音声特徴対応テーブル９４を参照して、音声特徴を示す入力シーケンスに対応する推論結果に含まれる属性の組み合わせ９８に対応する文字９７に順次変換される。そして、変換によって得られた文字９７から単語９６が再構成されて、推論結果として出力される。 Referring again to FIG. 11, in the inference process, the speech feature correspondence table 94 is referred to, and the characters 97 are sequentially converted into characters 97 corresponding to attribute combinations 98 included in the inference results corresponding to the input sequence indicating the speech features. Ru. Then, a word 96 is reconstructed from the characters 97 obtained by the conversion and output as an inference result.

以上のような処理手順によって、音声認識システムを構築および運用できる。
（ｅ４：学習処理）
次に、第２の実施例に従う音声認識システム１００Ｂの学習処理についての一例について説明する。 A speech recognition system can be constructed and operated through the processing procedure described above.
(e4: learning process)
Next, an example of the learning process of the speech recognition system 100B according to the second embodiment will be described.

図１３は、第２の実施例に従う音声認識システム１００Ｂの学習処理の手順を示すフローチャートである。図１３に示す主要なステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が学習プログラム５１４を実行することで実現される。 FIG. 13 is a flowchart showing the procedure of the learning process of the speech recognition system 100B according to the second embodiment. The main steps shown in FIG. 13 are typically realized by the processor (CPU 502 and/or GPU 504) of the information processing device 500 executing the learning program 514.

図１３を参照して、情報処理装置５００は、音声特徴を示す入力シーケンス２と対応するテキストとの組からなるトレーニングデータセットの入力を受け付ける（ステップＳ２００）。情報処理装置５００は、受け付けたトレーニングデータセットのテキストを単語ごとに分割し（ステップＳ２０２）、分割した各単語を文字ごとに分割する（ステップＳ２０４）。さらに、情報処理装置５００は、文字ごとにユニバーサル特徴の１または複数の属性の組み合わせを決定する（ステップＳ２０６）。決定された１または複数の属性の組み合わせからラベルとしてのユニバーサル音声表現９２が生成される。このとき、対象のテキスト言語に対応する音声特徴対応テーブル９４が参照されてもよい。このように、情報処理装置５００は、テキストに含まれる文字の特徴を示す、文字レベルとは異なるレベルの表現を生成する。 Referring to FIG. 13, information processing device 500 receives an input of a training data set consisting of a pair of input sequence 2 indicating voice features and corresponding text (step S200). The information processing apparatus 500 divides the text of the received training dataset into words (step S202), and divides each divided word into characters (step S204). Further, the information processing device 500 determines a combination of one or more attributes of the universal feature for each character (step S206). A universal audio expression 92 as a label is generated from the determined combination of one or more attributes. At this time, the audio feature correspondence table 94 corresponding to the target text language may be referred to. In this way, the information processing device 500 generates an expression at a level different from the character level that indicates the characteristics of characters included in the text.

情報処理装置５００は、音声特徴を示す入力シーケンス２と対応する１または複数の属性との組み合わせからなるトレーニングデータセットを生成する（ステップＳ２０８）。 The information processing device 500 generates a training data set consisting of a combination of the input sequence 2 indicating voice features and one or more corresponding attributes (step S208).

続いて、情報処理装置５００は、Ｔｒａｎｓｆｏｒｍｅｒ１０のパラメータを初期化する（ステップＳ２１０）。そして、パラメータの最適化が実行される。すなわち、トレーニングデータセットを用いてＴｒａｎｓｆｏｒｍｅｒ１０に含まれるパラメータが最適化される。 Subsequently, the information processing device 500 initializes the parameters of the Transformer 10 (step S210). Parameter optimization is then performed. That is, the parameters included in the Transformer 10 are optimized using the training data set.

より具体的には、情報処理装置５００は、トレーニングデータセットに含まれる入力シーケンス２をＴｒａｎｓｆｏｒｍｅｒ１０に入力して出力シーケンス（ユニバーサル音声表現９２）を演算する（ステップＳ２１２）。そして、情報処理装置５００は、出力シーケンス（推論結果）と、トレーニングデータセットの対応するユニバーサル音声表現９２（正解データ）とを比較して誤差情報を演算し（ステップＳ２１４）、当該演算した誤差情報に基づいてＴｒａｎｓｆｏｒｍｅｒ１０のパラメータを最適化する（ステップＳ２１６）。 More specifically, the information processing device 500 inputs the input sequence 2 included in the training data set to the Transformer 10 and calculates the output sequence (universal speech expression 92) (step S212). Then, the information processing device 500 calculates error information by comparing the output sequence (inference result) and the corresponding universal speech expression 92 (correct data) of the training data set (step S214), and calculates the calculated error information. The parameters of the Transformer 10 are optimized based on (step S216).

情報処理装置５００は、予め定められた学習処理の終了条件が満たされているか否かを判断する（ステップＳ２１８）。予め定められた学習処理の終了条件が満たされていなければ（ステップＳ２１８においてＮＯ）、情報処理装置５００は、トレーニングデータセットに含まれるトレーニングデータを選択して、ステップＳ２１２以下の処理を再度実行する。 The information processing device 500 determines whether a predetermined learning process termination condition is satisfied (step S218). If the predetermined learning processing termination condition is not met (NO in step S218), the information processing device 500 selects training data included in the training data set and re-executes the processing from step S212 onwards. .

これに対して、予め定められた学習処理の終了条件が満たされていれば（ステップＳ２１８においてＹＥＳ）、情報処理装置５００は、当該時点のパラメータ値で規定されるＴｒａｎｓｆｏｒｍｅｒ１０を学習済モデルとして決定する（ステップＳ２２０）。このときのパラメータ値が、学習済モデルを規定するパラメータセット５１８として出力される。そして、処理は終了する。 On the other hand, if the predetermined learning process termination condition is satisfied (YES in step S218), the information processing device 500 determines the Transformer 10 defined by the parameter values at the time as the learned model. (Step S220). The parameter values at this time are output as a parameter set 518 that defines the learned model. Then, the process ends.

（ｅ５：推論処理）
図１４は、第２の実施例に従う音声認識システム１００Ｂの推論処理の手順を示すフローチャートである。図１４に示す主要なステップは、典型的には、情報処理装置５００のプロセッサ（ＣＰＵ５０２および／またはＧＰＵ５０４）が推論プログラム５２０を実行することで実現される。 (e5: Inference processing)
FIG. 14 is a flowchart showing the inference processing procedure of the speech recognition system 100B according to the second embodiment. The main steps shown in FIG. 14 are typically realized by the processor (CPU 502 and/or GPU 504) of the information processing device 500 executing the inference program 520.

図１４を参照して、情報処理装置５００は、入力される音声信号から音声特徴を演算することで入力シーケンスを生成する（ステップＳ２５０）。情報処理装置５００は、生成した入力シーケンスをＴｒａｎｓｆｏｒｍｅｒ１０に入力して、推論結果の出力シーケンスとして、ユニバーサル音声表現９２を演算する（ステップＳ２５２）。続いて、情報処理装置５００は、音声特徴対応テーブル９４を参照して、ユニバーサル音声表現９２を文字に変換し（ステップＳ２５４）、変換した複数の文字から単語を再構成する（ステップＳ２５６）。最終的に、再構成した複数の単語からなるテキストを生成する（ステップＳ２５８）。この生成したテキストが出力シーケンスとして出力される。 Referring to FIG. 14, information processing device 500 generates an input sequence by calculating audio features from the input audio signal (step S250). The information processing device 500 inputs the generated input sequence to the Transformer 10 and calculates the universal speech expression 92 as an output sequence of the inference result (step S252). Subsequently, the information processing device 500 refers to the speech feature correspondence table 94, converts the universal speech expression 92 into characters (step S254), and reconstructs a word from the plurality of converted characters (step S256). Finally, a text consisting of a plurality of reconstructed words is generated (step S258). This generated text is output as an output sequence.

そして、情報処理装置５００は、音声信号の入力が継続しているか否かを判断する（ステップＳ２６０）。音声信号の入力が継続していれば（ステップＳ２６０においてＹＥＳ）、ステップＳ２５０以下の処理が繰り返される。 Then, the information processing device 500 determines whether or not the audio signal continues to be input (step S260). If the audio signal continues to be input (YES in step S260), the processes from step S250 onwards are repeated.

一方、音声信号の入力が継続していなければ（ステップＳ２６０においてＮＯ）、推論処理は一旦終了する。 On the other hand, if the input of the audio signal is not continuing (NO in step S260), the inference process ends once.

（ｅ６：性能評価結果）
次に、第２の実施例に従う音声認識システム１００Ｂの性能評価を行った結果の一例を示す。 (e6: Performance evaluation results)
Next, an example of the results of performance evaluation of the speech recognition system 100B according to the second embodiment will be shown.

第２の実験例では、漢字を用いる言語として、アジア圏で用いられる、マレーシア語＜ＭＹ＞、クメール語＜ＫＨ＞、シンハラ語＜ＳＩ＞、ネパール語＜ＮＥ＞の４言語のトレーニングデータセットを用いた評価を行った。評価対象の音声認識システムとしては、単語（word）レベルで処理する音声認識システム（関連技術）（表中「（ｗ）」で示される）、文字（character）レベルで処理する音声認識システム（関連技術）（表中「（ｃ）」で示される）、国際音声記号（ＩＰＡ）に従う発音記号レベルで処理する音声認識システム（関連技術）（表中「（ｐ）」で示される）、ならびに、第２の実施例に従う音声認識システム１００Ｂ（ユニバーサル音声表現を用いる）（表中「（ａ）」で示される）を採用した。 In the second experimental example, we used training data sets for four languages that use kanji: Malaysian <MY>, Khmer <KH>, Sinhalese <SI>, and Nepali <NE>, which are used in Asia. We conducted an evaluation using The speech recognition systems to be evaluated include a speech recognition system (related technology) that processes at the word level (indicated by "(w)" in the table), a speech recognition system that processes at the character level (related technology), and a speech recognition system that processes at the character level (related technology). technology) (indicated by "(c)" in the table), a speech recognition system that processes at the phonetic symbol level according to the International Phonetic Alphabet (IPA) (related technology) (indicated by "(p)" in the table), and A speech recognition system 100B (using universal speech expression) according to the second embodiment (indicated by "(a)" in the table) was adopted.

表５には、各言語単体および４言語で学習を行った場合のパラメータサイズの変化を示す。 Table 5 shows changes in parameter sizes when learning was performed for each language alone and for four languages.

表５に示すように、いずれの評価例においても、第２の実施例に従う音声認識システム１００Ｂのパラメータサイズが最小となっていることが分かる。 As shown in Table 5, it can be seen that in all evaluation examples, the parameter size of the speech recognition system 100B according to the second example is the smallest.

また、表６には、各言語単体および４言語で学習を行った場合の認識性能の変化を示す。認識性能の評価指標として、文字誤り率（ＣＥＲ％：Character Error Rate）を用いている。 Furthermore, Table 6 shows changes in recognition performance when learning was performed for each language alone and for four languages. Character error rate (CER%) is used as an evaluation index of recognition performance.

表６に示すように、第２の実施例に従う音声認識システム１００Ｂの認識性能は、国際音声記号（ＩＰＡ）に従う発音記号レベルで処理する音声認識システム（関連技術）の認識性能と同等あるいはそれ以上となっている。表５に示すように、パラメータサイズを大幅に低減できることを考慮すると、ユニバーサル音声表現を用いることで、より少ないパラメータサイズのモデルを用いて、多言語エンド・トゥ・エンド音声認識システムを実現できることが分かる。 As shown in Table 6, the recognition performance of the speech recognition system 100B according to the second embodiment is equal to or higher than the recognition performance of the speech recognition system (related technology) that processes at the phonetic symbol level according to the International Phonetic Alphabet (IPA). It becomes. As shown in Table 5, considering that the parameter size can be significantly reduced, it is possible to realize a multilingual end-to-end speech recognition system using a model with a smaller parameter size by using the universal speech representation. I understand.

［Ｆ．応用例および変形例］
本実施の形態に従う音声認識システムを用いた応用例として、自動音声翻訳システムなどを実現してもよい。この場合には、本実施の形態に従う音声認識システムから出力されるテキストに対応する音声を出力する音声合成部をさらに追加することで実現できる。 [F. Application examples and modifications]
As an application example using the speech recognition system according to this embodiment, an automatic speech translation system or the like may be realized. In this case, this can be realized by further adding a speech synthesis unit that outputs speech corresponding to the text output from the speech recognition system according to the present embodiment.

また、上述した第１の実施例および第２の実施例を単一のモデルを用いて実現することもできる。この場合には、文字部品表現およびユニバーサル音声表現の両方を出力できるように、Ｔｒａｎｓｆｏｒｍｅｒ１０の出力層の次元数を設定すればよい。加えて、さらに、第１の実施例および／または第２の実施例に加えて、文字レベルあるいは単語レベルで学習を行う言語を追加することも可能である。 Further, the first embodiment and the second embodiment described above can also be realized using a single model. In this case, the number of dimensions of the output layer of the Transformer 10 may be set so that both the character part representation and the universal audio representation can be output. In addition, in addition to the first embodiment and/or the second embodiment, it is also possible to add a language in which learning is performed at the character level or word level.

［Ｇ．まとめ］
本実施の形態に従う学習処理によれば、文字レベルとは異なるレベルの表現を用いた学習済モデルを利用することで、パラメータサイズの増大を抑制しつつ、認識性能を高めることができる推定器を実現できる。これによって、より少ないパラメータサイズのモデルを用いて、多言語エンド・トゥ・エンド音声認識システムを実現するための技術を提供できる。 [G. summary]
According to the learning process according to this embodiment, by using a trained model that uses expressions at a level different from the character level, an estimator that can improve recognition performance while suppressing an increase in parameter size is created. realizable. This makes it possible to provide a technique for realizing a multilingual end-to-end speech recognition system using a model with a smaller parameter size.

今回開示された実施の形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施の形態の説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time should be considered to be illustrative in all respects and not restrictive. The scope of the present invention is indicated by the claims rather than the description of the embodiments described above, and it is intended that all changes within the meaning and range equivalent to the claims are included.

２入力シーケンス、４入力埋め込み層、６，１６位置埋め込み層、８，１８加算器、１０Ｔｒａｎｓｆｏｒｍｅｒ、１４出力埋め込み層、２０エンコーダブロック、２２，４６ＭＨＡ層、２４，２８，４４，４８，５２加算・正則化層、２６，５０フィードフォワード層、４０デコーダブロック、４２ＭＭＨＡ層、６０ソフトマックス層、６４テキスト、７０出力シーケンス、７２言語ラベル、８０文字合成部、８２文字部品表現、８４文字部品対応テーブル、９０文字変換部、９１ユニバーサル特徴変換、９２ユニバーサル音声表現、９４音声特徴対応テーブル、９６単語、９７，８４４文字、９８属性の組み合わせ、１００Ａ，１００Ｂ音声認識システム、２００エンコーダ、４００デコーダ、５００情報処理装置、５０２ＣＰＵ、５０４ＧＰＵ、５０６主メモリ、５０８ディスプレイ、５１０ネットワークインターフェイス、５１２二次記憶装置、５１４学習プログラム、５１６モデル定義データ、５１８パラメータセット、５２０推論プログラム、５２２入力デバイス、５２４光学ドライブ、５２６光学ディスク、５２８内部バス、５３０トレーニングデータセット、５３１，５３３多言語音声データ、５３２多言語テキストデータ、５３４テキストデータ、８０２構造、８０４文字部品、８０６単純分解、８０８混合構造、８４２組み合わせ定義。 2 input sequence, 4 input embedding layer, 6, 16 position embedding layer, 8, 18 adder, 10 Transformer, 14 output embedding layer, 20 encoder block, 22, 46 MHA layer, 24, 28, 44, 48, 52 addition・Regularization layer, 26, 50 Feedforward layer, 40 Decoder block, 42 MMHA layer, 60 Softmax layer, 64 Text, 70 Output sequence, 72 Language label, 80 Character synthesis section, 82 Character component representation, 84 Character component support table, 90 character conversion unit, 91 universal feature conversion, 92 universal speech expression, 94 speech feature correspondence table, 96 word, 97,844 character, 98 combination of attributes, 100A, 100B speech recognition system, 200 encoder, 400 decoder, 500 Information processing device, 502 CPU, 504 GPU, 506 main memory, 508 display, 510 network interface, 512 secondary storage device, 514 learning program, 516 model definition data, 518 parameter set, 520 inference program, 522 input device, 524 optics drive, 526 optical disk, 528 internal bus, 530 training data set, 531,533 multilingual audio data, 532 multilingual text data, 534 text data, 802 structure, 804 character parts, 806 simple decomposition, 808 mixed structure, 842 combination Definition.

Claims

An inference device that receives input of an audio signal uttered in any language among a plurality of languages and outputs a corresponding text,
a trained model that receives an input sequence representing audio characteristics of the audio signal and outputs an expression at a level different from a character level representing characteristics of characters included in the corresponding text;
a reconstruction unit that reconstructs a corresponding text from an expression output from the learned model by referring to a correspondence relationship between a predetermined character and a feature of the character ,
An inference device , wherein the expression output from the trained model includes information specifying the structure of each character included in the corresponding text .

An inference device that receives input of an audio signal uttered in any language among a plurality of languages and outputs a corresponding text,
a trained model that receives an input sequence representing audio characteristics of the audio signal and outputs an expression at a level different from a character level that represents characteristics of characters included in the corresponding text;
a reconstruction unit that reconstructs a corresponding text from an expression output from the learned model by referring to a correspondence relationship between a predetermined character and a feature of the character ;
An inference device , wherein the expression output from the trained model includes information for specifying which language the corresponding text is in .

The inference device according to claim 2 , wherein the expression output from the trained model includes information specifying the pronunciation of each character included in the corresponding text.

4. The reasoning device according to claim 3, wherein the information specifying the pronunciation of the character includes information specifying the pronunciation of the corresponding character based on a universal feature expressing a phonological structure.

An inference program for implementing the inference device according to any one of claims 1 to 4 on a computer.

A learning method for learning a reasoner that receives an input of an audio signal uttered in any language among multiple languages and outputs a corresponding text, the method comprising:
providing an audio signal and a corresponding text;
generating a representation at a level different from the character level that indicates characteristics of characters included in the text;
optimizing parameters defining the inference device based on an error between an inference result obtained by inputting an input sequence indicating audio characteristics of the audio signal to the inference device and a corresponding expression ;
The learning method, wherein the expression at a level different from the character level includes information specifying the structure of each character included in the corresponding text or information specifying which language the corresponding text is in.