JP2019020597A

JP2019020597A - End-to-end japanese voice recognition model learning device and program

Info

Publication number: JP2019020597A
Application number: JP2017139177A
Authority: JP
Inventors: 伊藤　均; Hitoshi Ito; 均伊藤; 庄衛佐藤; Shoe Sato; 彰夫小林; Akio Kobayashi
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2017-07-18
Filing date: 2017-07-18
Publication date: 2019-02-07
Anticipated expiration: 2037-07-18
Also published as: JP6941494B2

Abstract

To improve a Japanese voice recognition rate.SOLUTION: An end-to-end Japanese voice recognition model learning device 2 comprises: label generation means 20 for generating a class label which allocates a plurality of characters whose appearance frequency included in a text is lower than previously determined appearance frequency reference to a class representing the plurality of characters from a text 1b in learning data 1, and a single label on a single character whose appearance frequency is higher than the reference; text generation means 5 for converting the character included in the text 1b into the class label on the basis of a character/label conversion table 4 for allocating he character to the class label, and generating a converted text 1c being a text after the text is converted; and acoustic model learning means 6 for learning the voice 1a, the converted text 1c, the class label and the plurality of single labels, converting the voice 1a into a label string of the class label and the single label by learning, and learning the acoustic model on the basis of the converted label string.SELECTED DRAWING: Figure 1

Description

本発明は、エンドツーエンド日本語音声認識モデル学習装置およびそのプログラムに関する。 The present invention relates to an end-to-end Japanese speech recognition model learning device and a program thereof.

従来、音声を認識するための音声認識モデルの学習には、概ね次の（１）〜（３）のようなプロセスが必要であった。
（１）入力した音声を特徴ベクトル（音響特徴量）へ変換し、（２）特徴ベクトルを各単語に対して人手で割りつけた音素へ変換するような音響モデルを学習し、（３）音素列を単語へ変換するような言語モデルを学習する。 Conventionally, learning of a speech recognition model for recognizing speech generally requires the following processes (1) to (3).
(1) Learning an acoustic model that converts input speech into feature vectors (acoustic feature quantities), (2) converting feature vectors into phonemes assigned manually to each word, and (3) phonemes Learn a language model that converts columns into words.

このうち、従来（２）のプロセスは、まず、入力音声の系列長の変化に強いＨＭＭ／ＧＭＭ（Hidden Markov model／Gaussian Mixture Model）により音響モデルを学習し、さらに近年では、ＤＮＮ（Deep Neural Network）によってこの音響モデルの学習を行うことで高精度な音響モデルを生成してきた（図１２（ａ）参照）。 Among them, the process (2) of the prior art first learns an acoustic model by HMM / GMM (Hidden Markov model / Gaussian Mixture Model), which is resistant to changes in the sequence length of the input speech, and more recently, DNN (Deep Neural Network). ) To generate a high-accuracy acoustic model by learning the acoustic model (see FIG. 12A).

そして、このようなモデル学習の複雑さを解消するための技術として、ＣＴＣ（Connectionist Temporal Classification）（非特許文献１参照）とＤＮＮとを用いた音響モデル学習法が知られている。この学習法は、音声と、音素または文字などのラベルと、の対応付けを直接学習する仕組みである。この学習法は、入力音声の系列長が変化した場合にも音響モデルの学習能力に強く、ＨＭＭ／ＧＭＭによる学習をこのＣＴＣとＤＮＮとを用いた音響モデル学習法に置き換えることで、音響モデルの学習を一括で行う（Ｅｎｄ−ｔｏ−Ｅｎｄ）ことが可能である。特に、ＣＴＣとＲＮＮ（Recurrent Neural Network）とを用いた音響モデル学習法には様々な手法が知られており、ビッグデータといわれる多量のデータを用いることにより、入力音声の特徴ベクトルを入力し、この特徴ベクトルから直接文字（文字ラベル）を出力する学習法も提案されている（非特許文献２、３参照）。また、Ｅｎｄ−ｔｏ−Ｅｎｄの音響モデル学習では、図１２（ｂ）に示すように、音素等の中間表現を用いることがない。 As a technique for eliminating the complexity of such model learning, an acoustic model learning method using CTC (Connectionist Temporal Classification) (see Non-Patent Document 1) and DNN is known. This learning method is a mechanism for directly learning the correspondence between speech and labels such as phonemes or characters. This learning method is strong in the learning ability of the acoustic model even when the sequence length of the input speech is changed. By replacing the learning by HMM / GMM with the acoustic model learning method using CTC and DNN, Learning can be performed collectively (End-to-End). In particular, various methods are known for the acoustic model learning method using CTC and RNN (Recurrent Neural Network). By using a large amount of data called big data, the feature vector of the input speech is input, A learning method for directly outputting characters (character labels) from this feature vector has also been proposed (see Non-Patent Documents 2 and 3). In end-to-end acoustic model learning, as shown in FIG. 12B, intermediate representations such as phonemes are not used.

Graves, A., et al., ”Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” ICML '06 Proceedings of the 23rd international conference on Machine learning Pages 369-376 (2006)Graves, A., et al., “Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks,” ICML '06 Proceedings of the 23rd international conference on Machine learning Pages 369-376 (2006) Miao, Y., et al., ” EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) Pages 167-174 (2015)Miao, Y., et al., ”EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) Pages 167-174 (2015) Hannun, A., et al., ” Deep Speech: Scaling up end-to-end speech recognition” Cornell University Library arXiv:1412.5567, 19 Dec 2014Hannun, A., et al., “Deep Speech: Scaling up end-to-end speech recognition” Cornell University Library arXiv: 1412.5567, 19 Dec 2014

しかしながら、ＤＮＮの出力を文字とする先行研究の多くは、英語による音声認識を対象としており、日本語を扱う場合には、日本語の文字種の多さが原因で以下の２つの問題が生じる。
１つは、出力ラベル数が多く、パラメータ数が英語に比べ膨大であることである。英語で文字を出力するＥｎｄ−ｔｏ−Ｅｎｄのニューラルネットワーク（ＮＮ）を構成した場合、出力ラベルの数はアルファベットに数字、記号を含めても１００種程度だが、日本語には漢字、ひらがな、カタカナなど３，０００以上の文字種が存在する。文字種の多さのためネットワークの各層間の結合パラメータが多くなるが、文字種に対して発音の種類は多くないため、ネットワーク内の表現に重複が生じ、モデルの頑健性に乏しい。
り、学習を難しくする。 However, many of the previous studies that use DNN output as characters are intended for speech recognition in English. When Japanese is handled, the following two problems arise due to the large number of Japanese character types.
One is that the number of output labels is large and the number of parameters is enormous compared to English. When an end-to-end neural network (NN) that outputs characters in English is configured, the number of output labels is about 100, including numbers and symbols in the alphabet, but in Japanese, kanji, hiragana, katakana There are more than 3,000 character types. There are many coupling parameters between each layer of the network due to the large number of character types, but since there are not many types of pronunciations for the character types, duplication occurs in the representation in the network, and the model is not robust.
Make learning difficult.

もう一つの問題は、いわゆる「疎」を意味するデータスパース性の課題がある。日本語の場合、文字種が多い分だけ、文字１種あたりの平均学習サンプル数は少なくなり、また、出現頻度の極端に少ない文字も存在する。このような出現頻度の低い文字（低頻度文字）が出力ラベルに存在している音響モデルでは、その文字の音響特徴はほとんど学習されず、音声認識結果として、例えば不要な低頻度文字が挿入誤りとして出力される傾向にあった。そのため、音声認識率を向上させることが難しかった。 Another problem is the problem of data sparsity, which means so-called “sparseness”. In the case of Japanese, the number of average learning samples per character decreases as the number of character types increases, and there are also characters with extremely low appearance frequencies. In an acoustic model in which characters with a low appearance frequency (low frequency characters) exist in the output label, the acoustic characteristics of the characters are hardly learned, and as a result of speech recognition, for example, unnecessary low frequency characters are erroneously inserted. Tended to be output as. For this reason, it has been difficult to improve the speech recognition rate.

本発明は、以上のような問題点に鑑みてなされたものであり、日本語の音声認識率を向上させることのできるエンドツーエンド日本語音声認識モデル学習装置およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to provide an end-to-end Japanese speech recognition model learning apparatus and program capable of improving the speech recognition rate in Japanese. To do.

前記課題を解決するために、本発明に係るエンドツーエンド日本語音声認識モデル学習装置は、テキストと音声または当該音声の音響特徴量とを含む学習データである前記音声または当該音声の音響特徴量から文字または単語のラベルを出力するエンドツーエンドの音響モデルを音声認識モデルとして学習するエンドツーエンド日本語音声認識モデル学習装置であって、
前記学習データにおける前記テキストから、当該テキストに含まれる出現頻度が、予め定められた出現頻度の基準より低い複数の文字または複数の単語を、当該複数の文字または複数の単語を代表するクラスに割り当てるクラスラベルと、前記基準よりも出現頻度が高い単独の文字または単独の単語に付される単独ラベルと、を作成するラベル作成手段と、
前記複数の文字または複数の単語を、前記クラスラベルに割り当てる変換表に基づいて、前記学習データにおける前記テキストに含まれる複数の文字または複数の単語を前記クラスラベルに変換し、前記テキストを変換した後のテキストである変換後テキストを作成するテキスト作成手段と、
前記学習データである音声または当該音声の音響特徴量と、前記変換後テキストと、前記クラスラベルと、複数の前記単独ラベルと、を学習し、当該学習により音声または音響特徴量を前記クラスラベルおよび前記単独ラベルのラベル列に変換し、変換した当該ラベル列に基づいて、前記音響モデルを学習する音響モデル学習手段と、を備える構成とした。 In order to solve the above-mentioned problem, an end-to-end Japanese speech recognition model learning device according to the present invention is a learning data including text and speech or an acoustic feature amount of the speech. An end-to-end Japanese speech recognition model learning device that learns an end-to-end acoustic model that outputs a character or word label from a speech recognition model,
From the text in the learning data, a plurality of characters or a plurality of words whose appearance frequency included in the text is lower than a predetermined appearance frequency standard are assigned to a class representing the plurality of characters or the plurality of words. Label creating means for creating a class label and a single label attached to a single character or single word having a higher appearance frequency than the reference;
Based on the conversion table that assigns the plurality of characters or the plurality of words to the class label, the plurality of characters or the plurality of words included in the text in the learning data are converted into the class label, and the text is converted. A text creation means for creating a post-conversion text that is a later text;
The speech that is the learning data or the acoustic feature amount of the speech, the converted text, the class label, and the plurality of single labels are learned, and the speech or acoustic feature amount is learned by the learning and the class label and An acoustic model learning unit that converts the label string into the single label and learns the acoustic model based on the converted label string is provided.

本発明は、以下に示す優れた効果を奏するものである。
本発明に係るエンドツーエンド日本語音声認識モデル学習装置によれば、出現頻度が低い複数の文字または複数の単語を１つのクラスラベルとしてまとめて学習することができる。
したがって、出現頻度が低い複数の文字または複数の単語について従来手法よりも多いサンプル数で音響特徴を学習することが可能となり、音声認識率が向上する。
また、このような学習によって、日本語のような出力ラベル数の多さに起因する学習ネットワークの表現の重複を緩和することが可能となり、音声認識率が向上する。 The present invention has the following excellent effects.
According to the end-to-end Japanese speech recognition model learning apparatus according to the present invention, it is possible to learn a plurality of characters or a plurality of words with low appearance frequency as one class label.
Therefore, it is possible to learn acoustic features with a larger number of samples than a conventional method for a plurality of characters or a plurality of words having a low appearance frequency, and the speech recognition rate is improved.
In addition, such learning makes it possible to reduce duplication of learning network expressions caused by a large number of output labels such as Japanese, thereby improving the speech recognition rate.

本発明の第１実施形態に係るエンドツーエンド日本語音声認識モデル学習装置を模式的に示すブロック図である。1 is a block diagram schematically showing an end-to-end Japanese speech recognition model learning device according to a first embodiment of the present invention. 本発明の第１実施形態に係るエンドツーエンド日本語音声認識モデル学習装置のラベル作成手段の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the label preparation means of the end-to-end Japanese speech recognition model learning apparatus which concerns on 1st Embodiment of this invention. 音響モデルの模式図であって、（ａ）は入力音声からラベルを出力する模式図、（ｂ）は、入力する音声からクラスラベルも出力する模式図である。It is a schematic diagram of an acoustic model, (a) is a schematic diagram which outputs a label from an input audio | voice, (b) is a schematic diagram which also outputs a class label from the input audio | voice. （ａ）は、文字・ラベル変換表の一例、（ｂ）は、文字をクラスラベルに変換する音響モデルの概念図、（ｃ）は、テキスト中のクラスラベルを文字に復元する言語モデルの概念図である。(A) is an example of a character / label conversion table, (b) is a conceptual diagram of an acoustic model for converting characters into class labels, and (c) is a concept of a language model for restoring class labels in text to characters. FIG. 第１実施形態に係るラベル作成手段によるクラスラベルの作成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the production process of the class label by the label production means concerning 1st Embodiment. 本発明の第２実施形態に係るエンドツーエンド日本語音声認識モデル学習装置のラベル作成手段の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the label preparation means of the end-to-end Japanese speech recognition model learning apparatus which concerns on 2nd Embodiment of this invention. （ａ）は、入力する音声から複数のクラスラベルを出力する音響モデルの模式図、（ｂ）は、文字をクラスラベルに変換する音響モデルの概念図、（ｃ）は、テキスト中のクラスラベルを文字に復元する言語モデルの概念図である。(A) is a schematic diagram of an acoustic model that outputs a plurality of class labels from input speech, (b) is a conceptual diagram of an acoustic model that converts characters into class labels, and (c) is a class label in the text. It is a conceptual diagram of the language model which restore | restores to a character. 形態素リストおよび読みリストの模式図である。It is a schematic diagram of a morpheme list and a reading list. （ａ）は、音声認識手段による処理の一例を示す概念図であり、（ｂ）は、（ａ）の音声認識を行う言語モデル学習手段を構成する単語を出力する変換器の模式図である。(A) is a conceptual diagram which shows an example of the process by a speech recognition means, (b) is a schematic diagram of the converter which outputs the word which comprises the language model learning means which performs the speech recognition of (a). . 第２実施形態に係るラベル作成手段によるクラスラベルの作成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the production process of the class label by the label production means which concerns on 2nd Embodiment. 図１０の処理において漢字を選択した場合の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process when a Chinese character is selected in the process of FIG. （ａ）は、従来の発音辞書を用いる音声認識処理の流れの模式図であり、（ｂ）は、従来の英語におけるエンドツーエンドの音声認識処理の流れの模式図である。(A) is a schematic diagram of a flow of speech recognition processing using a conventional pronunciation dictionary, and (b) is a schematic diagram of a flow of conventional end-to-end speech recognition processing in English.

以下、本発明の実施形態に係る日本語音声認識モデル学習装置について、図面を参照しながら説明する。
（第１実施形態）
［エンドツーエンド日本語音声認識モデル学習装置の構成］
エンドツーエンド日本語音声認識モデル学習装置２は、テキスト１ｂと音声１ａまたは当該音声の音響特徴量（以下、単に音声という）とを含む学習データ１である音声１ａから文字または単語（以下、単に文字という）のラベル３を出力するエンドツーエンドの音響モデルを音声認識モデルとして学習する。 Hereinafter, a Japanese speech recognition model learning device according to an embodiment of the present invention will be described with reference to the drawings.
(First embodiment)
[Configuration of end-to-end Japanese speech recognition model learning device]
The end-to-end Japanese speech recognition model learning device 2 uses a character or a word (hereinafter simply referred to as “speech 1a”) which is learning data 1 including the text 1b and the speech 1a or the acoustic feature of the speech (hereinafter simply referred to as “speech”). An end-to-end acoustic model that outputs label 3 of “character” is learned as a speech recognition model.

本実施形態では、日本語の音響モデル作成用の学習データ１を、音声１ａとテキスト１ｂとのペアとして説明する。音声１ａおよびテキスト１ｂは、日本語の大量の音声データおよび大量のテキストを表している。例えば、音声１ａとして、事前学習用の放送番組の番組音声を用い、テキスト１ｂとして、その番組音声の内容の厳密な書き起こし、または、それに準ずるものを用いることができる。なお、図１において、変換後テキスト１ｃ、ラベル３、および、ラベル・変換表４は、それぞれデータを示している。 In the present embodiment, the learning data 1 for creating a Japanese acoustic model is described as a pair of a voice 1a and a text 1b. Speech 1a and text 1b represent a large amount of Japanese speech data and a large amount of text. For example, the program sound of a pre-learning broadcast program can be used as the sound 1a, and the text 1b can be a transcribed or equivalent content of the program sound. In FIG. 1, the converted text 1c, the label 3, and the label / conversion table 4 indicate data.

このエンドツーエンド日本語音声認識モデル学習装置２は、ラベル作成手段２０と、テキスト作成手段５と、音響モデル学習手段６と、言語モデル学習手段７と、音響モデル記憶手段８と、言語モデル記憶手段９と、を備えている。なお、ここでは、エンドツーエンド日本語音声認識モデル学習装置２は、音声認識手段１０を備えている。
エンドツーエンド日本語音声認識モデル学習装置２は、学習データ１におけるテキスト１ｂから、テキスト１ｂに含まれる出現頻度が予め定められた出現頻度の基準より低い複数の文字を、当該複数の文字を代表するクラスに割り当てるクラスラベルと、前記基準よりも出現頻度が高い単独の文字に関する単独ラベルと、を作成し、複数の文字をクラスラベルに割り当てる文字・ラベル変換表４に基づいて、学習データ１におけるテキスト１ｂに含まれる複数の文字をクラスラベルに変換し、テキスト１ｂを変換した後のテキストである変換後テキスト１ｃを作成する。そして、学習データ１における音声１ａと変換後テキスト１ｃとクラスラベルと複数の単独ラベルとを学習し、当該学習により音声１ａをクラスラベルおよび単独ラベルのラベル列に変換し、変換した当該ラベル列に基づいて、音響モデルを学習する。 The end-to-end Japanese speech recognition model learning device 2 includes a label creating unit 20, a text creating unit 5, an acoustic model learning unit 6, a language model learning unit 7, an acoustic model storage unit 8, and a language model storage. And means 9. Here, the end-to-end Japanese speech recognition model learning device 2 includes speech recognition means 10.
The end-to-end Japanese speech recognition model learning device 2 represents, from the text 1b in the learning data 1, a plurality of characters whose appearance frequency included in the text 1b is lower than a predetermined appearance frequency standard and representing the plurality of characters. Based on the character / label conversion table 4 for creating a class label to be assigned to a class to be assigned and a single label for a single character having a higher appearance frequency than the reference, and assigning a plurality of characters to the class label, the learning data 1 A plurality of characters included in the text 1b are converted into class labels, and a post-conversion text 1c that is a text after the text 1b is converted is created. Then, the speech 1a, the converted text 1c, the class label, and a plurality of single labels in the learning data 1 are learned, and the speech 1a is converted into a class label and a single label label string by the learning, and the converted label string is converted into the converted label string. Based on this, the acoustic model is learned.

ラベル作成手段２０は、学習データ１におけるテキスト１ｂから、テキスト１ｂに含まれる出現頻度が予め定められた出現頻度の基準より低い複数の文字を、当該複数の文字を代表するクラスに割り当てるクラスラベルと、前記基準よりも出現頻度が高い単独の文字に関する単独ラベルと、を作成するものである。単独ラベル（以下、文字のラベルという）と、クラスラベルとを総称する場合、ラベル３と表記する。ラベル３は、音響モデルの出力に用いるラベルである。文字のラベルは、単独の文字を扱うラベルであり、クラスラベルは、複数の文字をひとまとまりに扱うラベルである。 The label creating unit 20 assigns a plurality of characters whose appearance frequency included in the text 1b is lower than a predetermined appearance frequency standard from the text 1b in the learning data 1 to a class representing the plurality of characters, And a single label relating to a single character having an appearance frequency higher than that of the reference. A single label (hereinafter referred to as a character label) and a class label are collectively referred to as a label 3. Label 3 is a label used for output of the acoustic model. The character label is a label that handles a single character, and the class label is a label that handles a plurality of characters as a group.

ラベル作成手段２０は、テキスト１ｂの中からモデル学習に適したラベル３と、どの文字をどのクラスラベルと対応させるかに関する表である文字・ラベル変換表４と、を作成し、出力する。ここでは、ラベル作成手段２０は、学習データ１におけるテキスト１ｂから、文字ラベルを作成し、クラスラベルを後から追加することとした。文字のラベルは、図３（ａ）の模式図に示すように、ひらがな、カタカナ、漢字、アルファベット等を含んでいる。クラスラベルを図３（ｂ）の模式図では、星印で示しているが、これに限定されるものではない。 The label creating means 20 creates and outputs a label 3 suitable for model learning from the text 1b and a character / label conversion table 4 which is a table regarding which character corresponds to which class label. Here, the label creating means 20 creates a character label from the text 1b in the learning data 1 and adds a class label later. The character labels include hiragana, katakana, kanji, alphabets, etc., as shown in the schematic diagram of FIG. The class label is indicated by an asterisk in the schematic diagram of FIG. 3B, but is not limited to this.

本実施形態では、ラベル作成手段２０は、図２に示すように、形態素分割手段２１と、文字リスト作成手段２２と、ラベル決定手段２３と、記憶手段２４と、を備えている。 In the present embodiment, the label creating means 20 includes a morpheme dividing means 21, a character list creating means 22, a label determining means 23, and a storage means 24, as shown in FIG.

形態素分割手段２１は、学習データ１のテキスト１ｂを形態素に分割し、そのそれぞれについて読みを付与するものである。なお、形態素分割手段２１としては、例えば日本語形態素解析のための一般的なソフトウェア（例えばMeCabやChaSen）を用いることができる。以下では、学習データ１のテキスト１ｂを形態素単位に分割したリストのことを、形態素単位リストＷと呼ぶ。この形態素リストＷは記憶手段２４に記憶される。 The morpheme dividing means 21 divides the text 1b of the learning data 1 into morphemes and gives a reading for each of them. As the morpheme dividing means 21, for example, general software for Japanese morpheme analysis (for example, MeCab or ChaSen) can be used. Hereinafter, the list obtained by dividing the text 1b of the learning data 1 into morpheme units is referred to as a morpheme unit list W. The morpheme list W is stored in the storage unit 24.

文字リスト作成手段２２は、学習データ１のテキスト１ｂ内の文字ごとの出現頻度を計数し、出現頻度が予め定められた基準より高い文字のリストと、それ以外の文字のリストを作成するものである。
ここでは、出現頻度が最上位からＮ文字種までの文字からなるリストを、文字リストＦと呼ぶ。また、それ以外の低頻度文字からなるリストを、文字リストＲと呼ぶ。この文字リストＦおよび文字リストＲは記憶手段２４に記憶される。 The character list creation means 22 counts the appearance frequency for each character in the text 1b of the learning data 1, and creates a list of characters whose appearance frequency is higher than a predetermined reference and a list of other characters. is there.
Here, a list composed of characters from the highest appearance frequency to N character types is referred to as a character list F. Further, a list including other low-frequency characters is referred to as a character list R. The character list F and the character list R are stored in the storage unit 24.

前記Ｎの値は、所望の値に設定でき、例えば、学習データ１のテキスト１ｂ内の全文字種の総数の半分より小さくすることができる。言い換えると、学習データ１におけるテキスト１ｂに含まれる出現頻度が予め定められた基準より低い文字の種類の個数は、テキスト１ｂに含まれる全文字種の総数の１／２以上にすることができる。 The value of N can be set to a desired value, for example, can be made smaller than half of the total number of all character types in the text 1b of the learning data 1. In other words, the number of character types whose appearance frequency included in the text 1b in the learning data 1 is lower than a predetermined reference can be ½ or more of the total number of all character types included in the text 1b.

ラベル決定手段２３は、このラベル作成手段２０全体の制御を司る。ラベル決定手段２３は、高頻度の文字については文字のラベルとして決定し、文字リストＲから選択した低頻度文字についてはクラスラベルとして決定する。ラベル決定手段２３は、クラスラベルおよび文字のラベルをラベル３として記憶手段２４に格納する。 The label determining unit 23 controls the entire label creating unit 20. The label determining unit 23 determines a character label for a high-frequency character, and determines a low-frequency character selected from the character list R as a class label. The label determining unit 23 stores the class label and the character label in the storage unit 24 as the label 3.

また、ラベル決定手段２３は、ラベル・変換表４を作成し、テキスト作成手段５に出力する。ラベル・変換表４は、複数の文字をクラスラベルに割り当てるための変換表である。ラベル・変換表４の一例を図４（ａ）に示す。この例は、図３（ｂ）の模式図に対応しており、文字「璃、鷲、劉、…」をそれぞれ星印「☆」に変換する。 Further, the label determining means 23 creates the label / conversion table 4 and outputs it to the text creating means 5. The label / conversion table 4 is a conversion table for assigning a plurality of characters to class labels. An example of the label / conversion table 4 is shown in FIG. This example corresponds to the schematic diagram of FIG. 3B, and the characters “璃, 鷲, Liu,...” Are respectively converted to the star “☆”.

記憶手段２４は、ラベル作成手段２０の処理により作成されたデータ等を記憶するものであって、ハードディスク等の一般的な記憶媒体である。この記憶手段２４には、形態素リストＷ、文字リストＦ、文字リストＲ、ラベル３のデータ等が記憶される。 The storage unit 24 stores data created by the processing of the label creation unit 20 and is a general storage medium such as a hard disk. The storage means 24 stores morpheme list W, character list F, character list R, data of label 3, and the like.

テキスト作成手段５は、文字・ラベル変換表４に基づいて、学習データ１におけるテキスト１ｂに含まれる複数の文字をクラスラベルに変換し、テキスト１ｂを変換した後のテキストである変換後テキスト１ｃを作成する。このテキスト作成手段５は、テキスト１ｂを入力し、文字・ラベル変換表４を用いて、テキスト１ｂのクラスラベルごとに分類された文字群を、分類されたクラスラベルに書き換えて変換後テキスト１ｃを作成する。 Based on the character / label conversion table 4, the text creation means 5 converts a plurality of characters included in the text 1b in the learning data 1 into class labels, and converts the converted text 1c, which is the text after converting the text 1b. create. The text creation means 5 inputs the text 1b, and uses the character / label conversion table 4 to rewrite the character group classified for each class label of the text 1b with the classified class label to convert the converted text 1c. create.

図４（ｂ）の１行目に示す「今日の尾鷲市は雨」は、このテキスト作成手段５に入力するテキスト１ｂの一例である。
また、図４（ｂ）の２行目に示す「今日の尾☆市は雨」は、このときに、テキスト作成手段５が出力する変換後テキスト１ｃの一例である。
なお、テキスト作成手段５へ入力するテキストは、学習データ１のテキスト１ｂとは別のテキストであってもよい。 “Today's Owase is rainy” shown in the first line of FIG. 4B is an example of the text 1 b input to the text creation means 5.
4B is an example of the converted text 1c output by the text creating means 5 at this time.
The text input to the text creation means 5 may be a text different from the text 1b of the learning data 1.

音響モデル学習手段６は、学習データ１である音声１ａと、変換後テキスト１ｃと、クラスラベルと、複数の単独ラベル（文字のラベル）と、を学習し、当該学習により音声１ａをクラスラベルおよび文字のラベルのラベル列に変換し、変換した当該ラベル列に基づいて、音響モデルを学習する。この音響モデル学習手段６は、ラベル３と音声１ａと変換後テキスト１ｃを用いて音声がラベル３のうちどれであるかを出力するモデルを学習し、音響モデル記憶手段８に記憶する。音響モデル学習手段６は、非特許文献２に記載されたような文字のシーケンスを特定するＥｎｄ−ｔｏ−Ｅｎｄの音響モデルの全てに適用可能なものである。 The acoustic model learning means 6 learns the speech 1a which is the learning data 1, the converted text 1c, the class label, and a plurality of single labels (character labels). A character string is converted into a label string, and an acoustic model is learned based on the converted label string. The acoustic model learning unit 6 learns a model that outputs which of the labels 3 is the voice using the label 3, the voice 1 a, and the converted text 1 c, and stores the model in the acoustic model storage unit 8. The acoustic model learning means 6 is applicable to all End-to-End acoustic models that specify a character sequence as described in Non-Patent Document 2.

この音響モデルは、大量の音声データから予め抽出した音響特徴量（メル周波数ケプストラム係数、フィルタバンク出力等）を、設定したラベルごとにディープニューラルネットワーク（Deep Neural Network）とコネクショニスト時系列分類法（ＣＴＣ：Connectionist Temporal Classification）等によってモデル化したものである。なお、音響モデルによる音響特徴量の尤度計算は、出力が漢字を含む書記素であれば再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）であっても、長・短期記憶（ＬＳＴＭ：Long Short Term Memory）であっても構わない。 In this acoustic model, acoustic features (mel frequency cepstrum coefficients, filter bank output, etc.) extracted in advance from a large amount of speech data are classified into a deep neural network (Deep Neural Network) and a connectionist time series classification method (CTC) for each set label. : Connectionist Temporal Classification) etc. It should be noted that the likelihood calculation of the acoustic feature quantity by the acoustic model is a long short term memory (LSTM: Long Short Term) even if it is a recurrent neural network (RNN) if the output is a grapheme including Kanji. Memory).

言語モデル学習手段７は、変換後テキスト１ｃを学習し、当該学習によりクラスラベルおよび単独ラベル（文字のラベル）のラベル列を単語列に変換する言語モデルを音声認識モデルとして学習する。言語モデルは、大量のテキストから予め学習した出力系列（単語等）の出現確率等をモデル化したものである。この言語モデルには、例えば、一般的なＮグラム言語モデルを用いることができる。 The language model learning means 7 learns the converted text 1c, and learns, as the speech recognition model, a language model that converts a label string of class labels and single labels (character labels) into a word string. The language model models an appearance probability of an output sequence (words and the like) learned in advance from a large amount of text. As this language model, for example, a general N-gram language model can be used.

言語モデル学習手段７は、ラベル３と変換後テキスト１ｃとを用いて、ラベル３から単語列を出力するモデルを学習し、言語モデル記憶手段９に記憶する。言語モデル学習手段７は、非特許文献２のように、音響モデル記憶手段８の出力を入力として、前後の単語の関係から単語列を推定し出力するもののうち、音響モデル学習手段６で用いたラベル３にない文字についても文字・ラベル変換表４と前後の単語の関係から文字を復元する。 The language model learning means 7 learns a model that outputs a word string from the label 3 using the label 3 and the converted text 1 c and stores it in the language model storage means 9. The language model learning unit 7 uses the output of the acoustic model storage unit 8 as an input as in Non-Patent Document 2, and uses the acoustic model learning unit 6 to estimate and output a word string from the relationship between the preceding and following words. For characters that are not in the label 3, the characters are restored from the relationship between the character / label conversion table 4 and the preceding and following words.

音響モデル記憶手段８は、音響モデル学習手段６が学習により作成した音響モデルを記憶するもので、ハードディスク等の一般的な記憶媒体である。
言語モデル記憶手段９は、言語モデル学習手段７が学習により作成した言語モデルを記憶するものであって、ハードディスク等の一般的な記憶媒体である。 The acoustic model storage unit 8 stores an acoustic model created by learning by the acoustic model learning unit 6 and is a general storage medium such as a hard disk.
The language model storage unit 9 stores a language model created by learning by the language model learning unit 7 and is a general storage medium such as a hard disk.

音声認識手段１０は、入力された音声（評価用音声）を、人が発話した発話区間ごとに音声認識するものである。この音声認識手段１０は、認識結果である単語列を図示しない表示装置等に出力する。
この音声認識手段１０は、入力された音声を、特徴量（特徴ベクトル）に変換し、この特徴量を音響モデル記憶手段８に記憶されている音響モデルを用いて、順次、ラベルに変換することでラベル列を作成する。このときに音声認識手段１０は、例えば、図４（ｃ）の１行目に示す「今日の尾☆市は雨」のようなラベル列を作成する。
そして、音声認識手段１０は、言語モデル記憶手段９に記憶されている言語モデルを用いて、前記ラベル列を、順次、単語に変換することで単語列を作成する。このときに音声認識手段１０は、例えば、図４（ｃ）の２行目に示す「今日の尾鷲市は雨」のような単語列を作成する。 The speech recognition means 10 recognizes the input speech (evaluation speech) for each utterance section spoken by a person. The voice recognition means 10 outputs a word string as a recognition result to a display device or the like (not shown).
The voice recognition unit 10 converts the input voice into a feature amount (feature vector), and sequentially converts the feature amount into a label using an acoustic model stored in the acoustic model storage unit 8. Create a label column with. At this time, the voice recognition means 10 creates a label string such as “Today's tail ☆ city is rainy” shown in the first row of FIG.
Then, the speech recognition means 10 creates a word string by sequentially converting the label string into words using the language model stored in the language model storage means 9. At this time, the voice recognition means 10 creates a word string such as “Today's Owase is rainy” shown in the second row of FIG.

［クラスラベルの作成処理の流れ］
第１実施形態に係るエンドツーエンド日本語音声認識モデル学習装置２のラベル作成手段２０によるクラスラベルの作成処理の流れについて図５を参照して説明する。
まず、エンドツーエンド日本語音声認識モデル学習装置２のラベル作成手段２０は、形態素分割手段２１によって、学習データ１のテキスト１ｂを形態素に分割した形態素単位リストＷを作成する（ステップＳ１０１）。 [Flow of class label creation processing]
The flow of class label creation processing by the label creation means 20 of the end-to-end Japanese speech recognition model learning device 2 according to the first embodiment will be described with reference to FIG.
First, the label creating means 20 of the end-to-end Japanese speech recognition model learning device 2 creates a morpheme unit list W obtained by dividing the text 1b of the learning data 1 into morphemes by the morpheme dividing means 21 (step S101).

そして、ラベル作成手段２０は、文字リスト作成手段２２によって、学習データ１のテキスト１ｂ内の文字ごとの出現頻度上位Ｎ文字種の文字リストＦと、それ以外の低頻度文字からなる文字リストＲを作成する（ステップＳ１０２）。
そして、ラベル作成手段２０は、ラベル決定手段２３によって、文字リストＲから低頻度文字を選択し（ステップＳ１０３）、選択した低頻度文字をクラスラベルに追加し（ステップＳ１０４）、文字・ラベル変換表４を更新する（ステップＳ１０５）。 Then, the label creating means 20 creates a character list F including the character list F of the top N appearance character types for each character in the text 1b of the learning data 1 and the character list R composed of other low-frequency characters by the character list creating means 22. (Step S102).
Then, the label creating means 20 selects the low-frequency character from the character list R by the label determining means 23 (step S103), adds the selected low-frequency character to the class label (step S104), and the character / label conversion table. 4 is updated (step S105).

そして、ラベル決定手段２３は、全ての低頻度文字を選択したか否かを判定する（ステップＳ１０６）。未選択の低頻度文字がある場合（ステップＳ１０６：Ｎｏ）、ラベル決定手段２３は、ステップＳ１０３に戻る。一方、全ての低頻度文字を選択した場合（ステップＳ１０６：Ｙｅｓ）、ラベル作成手段２０は、クラスラベルと文字リストＦを統合してラベル３を作成し（ステップＳ１０７）、文字・ラベル変換表４をテキスト作成手段５に出力し、処理を終了する。 Then, the label determining unit 23 determines whether all the low frequency characters have been selected (step S106). If there is an unselected low-frequency character (step S106: No), the label determining means 23 returns to step S103. On the other hand, when all the low-frequency characters have been selected (step S106: Yes), the label creating means 20 creates a label 3 by integrating the class label and the character list F (step S107). Is output to the text creation means 5 and the processing is terminated.

本実施形態によれば、出現頻度の低い文字（低頻度文字）を一つのクラスラベルとしてまとめることで、学習パラメータを減らし、１ラベルあたりの学習サンプル数を増やすので、低頻度文字の音響特徴が学習され易くなり、日本語の音声認識精度が向上する効果を奏する。 According to the present embodiment, by combining characters with low appearance frequency (low frequency characters) as one class label, the learning parameters are reduced and the number of learning samples per label is increased. It is easy to learn and has the effect of improving Japanese speech recognition accuracy.

（第２実施形態）
次に、本発明の第２実施形態に係る日本語音声認識モデル学習装置について図６を参照して説明する。なお、第２実施形態に係るエンドツーエンド日本語音声認識モデル学習装置は、ラベル作成手段２０Ｂが複数のクラスラベルを作成する点が第１実施形態とは異なるものの、他の構成要素が第１実施形態と同様なので、その全体構成の図面を省略する。また、図６に示すラベル作成手段２０Ｂにおいて、図２に示すラベル作成手段２０と同一の構成には同一の符号を付して説明を適宜省略する。
ラベル作成手段２０Ｂは、出現頻度が予め定められた基準より低い所定の複数の文字を、予め定められた基準で区分された複数のクラスのいずれかに割り当てて複数のクラスラベルを作成する。 (Second Embodiment)
Next, a Japanese speech recognition model learning apparatus according to a second embodiment of the present invention will be described with reference to FIG. Note that the end-to-end Japanese speech recognition model learning device according to the second embodiment differs from the first embodiment in that the label creating means 20B creates a plurality of class labels, but the other components are the first. Since it is the same as that of embodiment, drawing of the whole structure is abbreviate | omitted. Further, in the label creating means 20B shown in FIG. 6, the same components as those in the label creating means 20 shown in FIG.
The label creating means 20B creates a plurality of class labels by assigning a plurality of predetermined characters whose appearance frequency is lower than a predetermined criterion to any of a plurality of classes classified according to the predetermined criterion.

ここでは、ラベル作成手段２０Ｂは、一例として、文字の音韻的特徴を反映し、図７（ａ）に示すように、文字の読みごとに区分された複数のクラスのいずれかに割り当てて複数のクラスラベルを作成することとした。なお、図７（ａ）において、例えば、読みが「あ」であるクラスに割り当てられた「＜あ＞」は、読みを表す文字「あ」と、その両側に記載された２つの記号「＜」、「＞」とにより、クラスラベルを表している。
本実施形態では、テキスト作成手段５に入力するテキスト１ｂが、例えば、図７（ｂ）の１行目に示す「今日の尾鷲市は雨」である場合、テキスト作成手段５は、変換後テキスト１ｃとして、例えば、図７（ｂ）の２行目に示す「今日の尾＜わ＞市は雨」のような変換後テキスト１ｃを出力する。
また、本実施形態では、音声認識手段１０は、入力音声から、例えば、図７（ｃ）の１行目に示す「今日の尾＜わ＞市は雨」のようなラベル列を作成した場合、言語モデルを用いて例えば、図７（ｃ）の２行目に示す「今日の尾鷲市は雨」のような単語列を作成する。
以下、ラベル作成手段２０Ｂの各構成について図６を参照して説明する。 Here, as an example, the label creating means 20B reflects the phonological characteristics of a character and, as shown in FIG. 7A, is assigned to any of a plurality of classes divided for each character reading. We decided to create a class label. In FIG. 7A, for example, “<A>” assigned to a class whose reading is “A” is a character “A” indicating reading and two symbols “<” written on both sides thereof. "And">"represent class labels.
In this embodiment, when the text 1b input to the text creation means 5 is, for example, “Today's Owase is rainy” shown in the first line of FIG. 7B, the text creation means 5 As 1c, for example, a converted text 1c such as “Today's tail <wa> city is rainy” shown in the second line of FIG. 7B is output.
In the present embodiment, the voice recognition unit 10 creates a label string such as “Today's tail <wa> city is rainy” shown in the first line of FIG. 7C from the input voice. Using the language model, for example, a word string such as “Today's Owase is rainy” shown in the second row of FIG. 7C is created.
Hereinafter, each configuration of the label creating means 20B will be described with reference to FIG.

ラベル作成手段２０Ｂは、図６に示すように、形態素分割手段２１と、文字リスト作成手段２２と、ラベル決定手段２３Ｂと、記憶手段２４と、形態素リスト作成手段２５と、編集距離算出手段２６と、読み区切り推定手段２７と、読みリスト作成手段２８と、を備えている。 As shown in FIG. 6, the label creating unit 20B includes a morpheme dividing unit 21, a character list creating unit 22, a label determining unit 23B, a storage unit 24, a morpheme list creating unit 25, and an edit distance calculating unit 26. , Reading delimiter estimating means 27 and reading list creating means 28 are provided.

形態素リスト作成手段２５は、形態素単位リストＷのうち文字リストＲ内の低頻度文字を含む形態素リストＪを作成するものである。この形態素リストJは記憶手段２４に記憶される。なお、形態素リスト作成手段２５としては、例えば日本語形態素解析のための一般的なソフトウェアを用いることができる。 The morpheme list creating unit 25 creates a morpheme list J including the low-frequency characters in the character list R in the morpheme unit list W. The morpheme list J is stored in the storage unit 24. As the morpheme list creating means 25, for example, general software for Japanese morpheme analysis can be used.

本実施形態では、形態素リスト作成手段２５は、形態素単位リストＷに基づいて、注目する漢字ｓごとに、当該漢字ｓを含む形態素のリストである形態素リストＪ_sを作成する。この形態素リストJ_sは、当該漢字ｓについての処理のときに記憶手段２４に記憶される。
例えば、注目する漢字ｓが「生」の場合に、形態素リスト作成手段２５が作成する形態素リストＪ_sの一例を図８に示す。 In the present embodiment, the morpheme list creation unit 25 creates, based on the morpheme unit list W, a morpheme list J _s that is a list of morphemes including the kanji s for each kanji s of interest. The morpheme list J _s is stored in the storage unit 24 when processing for the Chinese character s.
For example, FIG. 8 shows an example of the morpheme list J _s created by the morpheme list creation unit 25 when the kanji s of interest is “raw”.

また、ここでは、形態素リスト作成手段２５は、注目する漢字ｓごとの形態素リストＪ_sに出現する各形態素ｊ_s（形態素リストＪ_s内の各形態素区間ｊ_s）に含まれる全ての漢字の一文字単独での全ての読みのリストである単独漢字リストも作成することとした。具体的には、形態素ｊ_sの一例が「生」である場合、形態素リスト作成手段２５は、「生」の読みとして、例えば「せい」、「しょう」、「き」、「なま」を記載したリストを作成する。 Further, here, the morpheme list creation means 25, character of all Chinese characters included in each morpheme j _s appearing in morpheme list J _s for each Chinese character s of interest (morphemes section j _s in the morpheme list J _s) A single kanji list, which is a list of all readings by itself, was also created. Specifically, when an example of the morpheme j _s is “raw”, the morpheme list creating unit 25 reads “sei”, “sho”, “ki”, “nam” as readings of “raw”, for example. Create the listed list.

編集距離算出手段２６は、注目する漢字ｓごとの形態素リストＪ_sに出現する形態素ｊ_sごとに、形態素ｊ_sを構成する各漢字に単独の読みを付与したときに対応付けられる全ての組み合わせを、形態素分割手段２１で付与された形態素ｊ_s全体の読みｊ^r _sと比較して編集距離Ｄ_xを算出するものである。
ここで、各漢字の組み合わせの読みと、形態素全体の読みとの編集距離Ｄ_xは、一方の読みから、挿入、削除、置換といった操作を行なうことによって、他方の読みに編集する際に、必要とされる操作の最小回数である。編集距離算出手段２６は、これら削除・挿入・置換誤り文字数を求めることで編集距離Ｄ_xを算出する。 Edit distance calculation means 26, for each morpheme j _s appearing in morpheme list J _s for each Chinese character s of interest, all combinations associated with the time you grant reading alone each Chinese character constituting the morpheme j _s The edit distance D _x is calculated by comparing with the reading j ^r _{s of the} entire morpheme j _s given by the morpheme dividing means 21.
Here, the edit distance D _x between the read combination of the Chinese character, a morpheme entire reading from one reading, inserting, deleting, by performing operations such substitutions, when editing the other readings, require Is the minimum number of operations to be taken. Edit distance calculation means 26 calculates the edit distance D _x by obtaining these deletion, insertion and substitution error characters.

具体的には、形態素ｊ_sの一例を図８に示す「生物」であるものとすると、「生」および「物」に、単独の読みをそれぞれ付与したときに対応付けられる全ての組み合わせとは、各文字の読みを組み合わせることで得られる。
ここでは、「生」の読みは、例えば「せい」、「しょう」、「き」、「なま」であるものとする。また、「物」の読みは、例えば「ぶつ」、「もの」であるものとする。
この場合、全ての組み合わせｊⁱ _s,xとは、「せい−もの」、「せい−ぶつ」、「しょう−もの」、「しょう−ぶつ」、「き−もの」、「き−ぶつ」、「なま−もの」、「なま−ぶつ」の合計８個の組み合わせである。 Specifically, _assuming that an example of the morpheme j _s is “living organism” shown in FIG. 8, all combinations that are associated when a single reading is given to “raw” and “thing” respectively. , Obtained by combining the reading of each character.
Here, the reading of “raw” is, for example, “sei”, “sho”, “ki”, “nam”. In addition, the reading of “thing” is, for example, “butsu” or “thing”.
In this case, all combinations j ⁱ _{s, x} are “sei-mono”, “sei-butsu”, “sho-mono”, “sho-butsu”, “ki-mono”, “ki-butsu”, A total of eight combinations of “name-mono” and “name-butsu”.

読み区切り推定手段２７は、編集距離Ｄ_xが最小となる漢字の組み合わせｊⁱ _s,xを求め、当該形態素ｊ_sにおいて注目する漢字ｓの単独の読みの区切りｊ^r _s,sを推定するものである。
図８に示す「生物」の全体の読みｊ^r _sは、形態素分割手段２１で「せいぶつ」のように付与されている。しかしながら、形態素分割手段２１は、単語レベルの読みを付与するものであって、シンボル「生物」において、シンボル「生」の読みが、「せ」なのか、「せい」なのかということについては、情報が無い。そこで、読み区切り推定手段２７は、例えば、上記した合計８個の組み合わせについてのそれぞれの編集距離Ｄ_xに基づいて、シンボル「生物」においてシンボル「生」の読みが「せい」である確率が高いことを判定し、シンボル「生物」において、注目する漢字ｓである「生」の単独の読みの区切りｊ^r _s,sを推定する。 The reading delimiter estimation means 27 obtains a kanji combination j ⁱ _{s, x} that minimizes the editing distance D _x and estimates a single reading delimiter j ^r _{s, s} of the kanji s of interest in the morpheme j _s . It is.
Overall read j ^r _s of "organism" shown in FIG. 8 is applied as "organism" in morphological analysis unit 21. However, the morpheme dividing means 21 gives a word-level reading, and in the symbol “living”, whether the reading of the symbol “raw” is “se” or “sei” There is no information. Therefore, the reading delimiter estimation means 27 has a high probability that the reading of the symbol “living” is “sei” in the symbol “living” based on, for example, the respective editing distances D _x for the total of the eight combinations described above. In the symbol “living”, a single reading break j ^r _{s, s} of “raw”, which is the kanji s of interest, is estimated.

読みリスト作成手段２８は、前記した一文字単独での全ての読みのリストである単独漢字リストを参照して、形態素ｊ_sにおいて注目する漢字ｓに対して推定された読みの区切りｊ^r _s,sが、注目する漢字ｓの一文字単独での全ての読みｊⁱ _xのいずれであるのかを判定し、判定された読みに応じて、形態素ｊ_sを注目する漢字ｓの読みｊⁱ _sごとに分類した読みリストＬ^r _sへ格納するものである。
この読みリストＬ^r _sは、当該漢字ｓについての処理のときに記憶手段２４に記憶される。 The reading list creation means 28 refers to the single kanji list that is a list of all readings of one character alone, and reads the reading delimiter j ^r _{s, s} estimated for the k character s of interest in the morpheme j _s . Are all the readings j ⁱ _x of a single character of the target kanji s, and the morpheme j _s is classified for each reading j ⁱ _s of the target kanji s according to the determined reading. Stored in the read list L ^r _s .
This reading list L ^r _s is stored in the storage means 24 when processing for the Chinese character s.

図８に示す例では、上から３個目までの「生」の読みは「せい」であり、上から４個目の「生」の読みは「しょう」であり、上から５個目の「生」の読みは「い」である。
よって、この場合、読みリスト作成手段２８は、「生物」、「生徒」、「生活」を、漢字「生」の読み「せい」に対応した読みリストＬ^r _sへ格納する。
また、読みリスト作成手段２８は、「生涯」を、漢字「生」の読み「しょう」に対応した読みリストＬ^r _sへ格納する。
さらに、読みリスト作成手段２８は、「生き物」を、漢字「生」の読み「い」に対応した読みリストＬ^r _sへ格納する。 In the example shown in FIG. 8, the reading of “raw” from the top to the third is “sei”, the reading of the fourth “raw” from the top is “sho”, and the fifth from the top The reading of “raw” is “yes”.
Therefore, in this case, the reading list creation means 28 stores “living things”, “students”, and “life” in the reading list L ^r _s corresponding to the reading “sei” of the kanji “raw”.
Further, the reading list creating means 28 stores “lifetime” in the reading list L ^r _s corresponding to the reading “sho” of the kanji “raw”.
Furthermore, the reading list creating means 28 stores “creature” in the reading list L ^r _s corresponding to the reading “i” of the kanji “raw”.

ラベル決定手段２３Ｂは、ラベル決定手段２３と同様に、ラベル作成手段２０Ｂ全体の制御を司り、高頻度の文字については文字のラベルとして決定し、文字リストＲから選択した低頻度文字についてはクラスラベルとして決定する。ラベル決定手段２３Ｂは、クラスラベルおよび文字のラベルをラベル３として記憶手段２４に格納する。
ラベル決定手段２３Ｂは、文字リストＲから低頻度文字を選択したときに、漢字以外であれば、該当するクラスに割り当てる。
ラベル決定手段２３Ｂは、注目する漢字ｓの読みｊⁱ _sごとに分類した読みリストＬ^r _sごとに、読みリストＬ^r _sに格納されている全形態素ｊ_sの個数Ｌ^r,c _sを、数え上げ、最大要素数を持っている読みリストの読みの頭文字を判定する。 As with the label determining unit 23, the label determining unit 23B controls the entire label creating unit 20B, determines a high-frequency character as a character label, and determines a class label for a low-frequency character selected from the character list R. Determine as. The label determining unit 23B stores the class label and the character label in the storage unit 24 as the label 3.
When the low-frequency character is selected from the character list R, the label determining unit 23B assigns it to the corresponding class if it is not a Chinese character.
Label determining unit 23B, for each read j ⁱ list read and classified by _s L ^r _s of Kanji s of interest, the number of all morphemes j _s stored in the read list L ^r _s L ^r, a ^c _s, Count up and determine the initial letter of the reading list that has the maximum number of elements.

具体的には、図８に示す例では、漢字「生」の読み「せい」に対応した読みリストＬ^r _sに格納されている形態素の個数Ｌ^r,c _sは「３」である。
また、漢字「生」の読み「しょう」に対応した読みリストＬ^r _sに格納されている形態素の個数Ｌ^r,c _sは「１」である。
また、漢字「生」の読み「い」に対応した読みリストＬ^r _sに格納されている形態素の個数Ｌ^r,c _sは「１」である。
したがって、この場合、ラベル決定手段２３Ｂは、漢字「生」の読み「せい」に対応した読みリストＬ^r _sが最大要素数を持っていることから、その頭文字「せ」から、クラス「＜せ＞」に割り当てると判定する。 Specifically, in the example illustrated in FIG. 8, the number of morphemes L ^{r, c} _s stored in the reading list L ^r _s corresponding to the reading “sei” of the kanji “raw” is “3”.
In addition, the number of morphemes L ^{r, c} _s stored in the reading list L ^r _s corresponding to the reading “sho” of the kanji “raw” is “1”.
In addition, the number of morphemes L ^{r, c} _s stored in the reading list L ^r _s corresponding to the reading “i” of the Chinese character “raw” is “1”.
Therefore, in this case, since the reading list L ^r _s corresponding to the reading “sei” of the kanji “raw” has the maximum number of elements, the label determining means 23B uses the class “<” from the initial “se”. Determination> ”.

本実施形態では、記憶手段２４には、形態素リストＷ、文字リストＦ、文字リストＲ、ラベル３等のデータに加え、形態素リストＪ、形態素リストＪ_s、読みリストＬ^r _sが記憶される。 In the present embodiment, the storage means 24 stores a morpheme list J, a morpheme list J _s , and a reading list L ^r _s in addition to data such as a morpheme list W, a character list F, a character list R, and a label 3.

ここで、言語モデル学習手段７が、ラベルから文字を復元する処理について図９（ａ）および図９（ｂ）を参照して説明する。
図９（ａ）には、一例として、漢字「奏」、「創」、「遭」、「送」を、その読みの頭文字である「そ」に対応させて、クラスラベル「＜そ＞」に割り当てて学習した音響モデルを模式的に示している。また、変換後テキスト１ｃの一例である「演＜そ＞会に出る」から、元のテキスト１ｂの一例である「演奏会」を復元できるような学習を行った言語モデルを模式的に示している。 Here, the process in which the language model learning means 7 restores characters from the labels will be described with reference to FIGS. 9 (a) and 9 (b).
In FIG. 9A, as an example, the kanji characters “Kan”, “So”, “Meet” and “Send” are associated with “So”, which is the initial of the reading, and the class label “<So>”. The acoustic model which was assigned and learned is schematically shown. In addition, a language model that has been learned so that “concert”, which is an example of the original text 1b, can be restored from “come to the concert”, which is an example of the converted text 1c, is schematically shown. Yes.

図９（ｂ）は、このときの言語モデルにおいて、漢字「奏」をクラスラベル「＜そ＞」に割り当てて学習した際の単語「演奏会」についての重みつき有限状態トランスデューサ（Weighted Finite State Transducer：ＷＦＳＴ）の模式図である。ＷＦＳＴは、入力信号および出力信号のペアとその重みを記すことにより情報を遷移する変換器であって、非特許文献２、３においても言語モデル学習手段に用いられている。なお、図９（ｂ）において、ＷＦＳＴの矢印上の「入力信号：出力信号（遷移確率）」のうち、遷移確率の記載は省略する。また、「ｅｐｓ」は入出力がない遷移を示す。また、「ｓｐａｃｅ」は空白の遷移を示す。 FIG. 9B shows a weighted finite state transducer (Weighted Finite State Transducer) for the word “concert” when learning by assigning the kanji “sou” to the class label “<so>” in the language model at this time. : WFST). The WFST is a converter that transitions information by writing a pair of input signals and output signals and their weights, and is also used in language model learning means in Non-Patent Documents 2 and 3. In FIG. 9B, description of transition probability is omitted from “input signal: output signal (transition probability)” on the arrow of WFST. “Eps” indicates a transition without input / output. “Space” indicates a blank transition.

本実施形態では、言語モデル学習手段７は、文字列作成手段（以下、変換器Ｔという）、単語列作成手段（以下、変換器Ｌという）、文章作成手段（以下、変換器Ｇという）のそれぞれのトランスデューサの合成で表される。ここでは、ＣＴＣのラベルから文字への変換器Ｔ、文字から単語への変換器Ｌ、単語から文章への変換器Ｇの３つの変換器の合成によりデコードする。 In the present embodiment, the language model learning means 7 includes a character string creating means (hereinafter referred to as a converter T), a word string creating means (hereinafter referred to as a converter L), and a sentence creating means (hereinafter referred to as a converter G). It is represented by the composition of each transducer. Here, decoding is performed by combining three converters: a CTC label-to-character converter T, a character-to-word converter L, and a word-to-sentence converter G.

変換器Ｔで生成されるのはラベル３を用いた文字列である。
変換器Ｌは、文字・ラベル変換表４によって、クラスラベルを含むラベル列から、日本語の単語へ復元する。図９（ｂ）ではその例として、漢字「奏」がラベル「＜そ＞」に割り振られた場合の単語列出力を示している。
変換器Ｌでは、変換器Ｔで推定されたトークン列を単語に変換する際、音響モデル学習時に割り当てたクラスラベルから本来の文字を含む単語への変換をする役割をもつ。
変換器Ｇでは、変換器Ｌで得られた単語列の候補から、単語間の統計的な連続情報（ｎ−ｇｒａｍ）によってもっともらしい認識結果を出力する。 The character string using the label 3 is generated by the converter T.
The converter L restores the Japanese word from the label string including the class label according to the character / label conversion table 4. As an example, FIG. 9B shows a word string output when the Chinese character “Kana” is assigned to the label “<so>”.
In the converter L, when converting the token string estimated by the converter T into a word, the converter L has a role of converting the class label assigned at the time of learning the acoustic model into a word including an original character.
In the converter G, a plausible recognition result is output from the word string candidates obtained in the converter L by statistical continuous information (n-gram) between words.

［クラスラベルの作成処理の流れ］
次に、第２実施形態に係るエンドツーエンド日本語音声認識モデル学習装置２のラベル作成手段２０Ｂによるクラスラベルの作成処理の流れについて図１０を参照して説明する。なお、図１０に示すステップＳ１０１，Ｓ１０２の処理は、図５に示す処理と同一の処理なので、説明を省略する。 [Flow of class label creation processing]
Next, the flow of class label creation processing by the label creation means 20B of the end-to-end Japanese speech recognition model learning device 2 according to the second embodiment will be described with reference to FIG. Note that the processing in steps S101 and S102 shown in FIG. 10 is the same as the processing shown in FIG.

ステップＳ１０２に続いて、ラベル作成手段２０Ｂは、形態素リスト作成手段２５によって、形態素単位リストＷのうち文字リストＲ内の低頻度文字を含む形態素リストＪを作成する（ステップＳ２０１）。
そして、ラベル作成手段２０Ｂは、ラベル決定手段２３Ｂによって、文字リストＲから低頻度文字を選択する（ステップＳ２０２）。
ここで、ラベル決定手段２３Ｂが低頻度文字として漢字を選択した場合、ラベル作成手段２０Ｂは、当該漢字の読みを推定し、読みの頭文字のリストに追加する処理を実行する（ステップＳ２０３）。なお、その詳細は後記する。 Subsequent to step S102, the label creating means 20B creates a morpheme list J including low-frequency characters in the character list R from the morpheme unit list W by the morpheme list creating means 25 (step S201).
Then, the label creating unit 20B selects a low-frequency character from the character list R by the label determining unit 23B (step S202).
Here, when the label determining means 23B selects a Chinese character as a low-frequency character, the label creating means 20B performs a process of estimating the reading of the Chinese character and adding it to the reading initial character list (step S203). Details will be described later.

また、ラベル決定手段２３Ｂは、低頻度文字として、ひらがなやカタカナを選択した場合、その該当する読みのリストに追加する（ステップＳ２０４）。
さらに、ラベル決定手段２３Ｂは、低頻度文字として、読みを推定できない数字やアルファベットを選択した場合、読み不明のリストに追加する（ステップＳ２０５）。
あるいは、ラベル決定手段２３Ｂは、低頻度文字として、読みの存在しない記号を選択した場合、記号のリストに追加する（ステップＳ２０６）。 If the hiragana or katakana character is selected as the low-frequency character, the label determining unit 23B adds it to the corresponding reading list (step S204).
Furthermore, when the number or alphabet that cannot be read is selected as the low-frequency character, the label determining unit 23B adds it to the unread list (step S205).
Alternatively, the label determining unit 23B adds a symbol having no reading as a low-frequency character to the symbol list (step S206).

ステップＳ２０３〜ステップＳ２０６のいずれかの処理に続いて、ラベル作成手段２０Ｂは、ラベル決定手段２３Ｂによって、文字・ラベル変換表４を更新する（ステップＳ２０７）。そして、ラベル決定手段２３Ｂは、全ての低頻度文字を選択したか否かを判定する（ステップＳ２０８）。未選択の低頻度文字がある場合（ステップＳ２０８：Ｎｏ）、ラベル決定手段２３Ｂは、ステップＳ２０２に戻る。一方、全ての低頻度文字を選択した場合（ステップＳ２０８：Ｙｅｓ）、クラスラベル集合Ｌ_iと文字リストＦを統合してラベル３を作成し（ステップＳ２０９）、文字・ラベル変換表４をテキスト作成手段５に出力し、処理を終了する。 Subsequent to any one of steps S203 to S206, the label creating means 20B updates the character / label conversion table 4 by the label determining means 23B (step S207). Then, the label determining unit 23B determines whether all the low-frequency characters have been selected (step S208). When there is an unselected low frequency character (step S208: No), the label determining unit 23B returns to step S202. On the other hand, when all the low-frequency characters are selected (step S208: Yes), the label 3 is created by integrating the class label set _Li and the character list F (step S209), and the character / label conversion table 4 is created as text. The data is output to the means 5, and the process is terminated.

次に、ラベル作成手段２０Ｂが、低頻度文字として漢字を選択した場合の処理について図１１を参照（適宜図１０参照）して説明する。なお、ここでは、図１０に示す処理の途中から一部を重複させて説明する。まず、ラベル作成手段２０Ｂは、ラベル決定手段２３Ｂによって、文字リストＲから、低頻度文字として漢字ｓを選択する（初期値ｓ＝１：ステップＳ２０２）。なお、ｓ（ｓ＝１，２，…）は漢字を識別する符号であるが、以下、単に漢字ｓという。 Next, processing when the label creating means 20B selects a Chinese character as a low-frequency character will be described with reference to FIG. 11 (refer to FIG. 10 as appropriate). Here, a part of the process shown in FIG. First, the label creating means 20B selects the kanji character s as the low-frequency character from the character list R by the label determining means 23B (initial value s = 1: step S202). Note that s (s = 1, 2,...) Is a code for identifying a Chinese character, but hereinafter simply referred to as a Chinese character s.

そして、ラベル作成手段２０Ｂは、形態素リスト作成手段２５によって、前記ステップＳ２０１（図１０）で作成した低頻度文字を含む形態素リストJから、漢字ｓ（ｓ番目の漢字）を含む形態素リストJ_ｓを作成する（ステップＳ２３１）。
ラベル作成手段２０Ｂは、ラベル決定手段２３Ｂによって、形態素リストＪ_sから形態素ｊ_s（ｊ_s番目の形態素）を選択する（ステップＳ２３２）。なお、ｊ_s（ｊ_s＝１，２，…）は形態素を識別する符号であるが、以下、単に形態素ｊ_sという。 Then, the label creation means 20B uses the morpheme list creation means 25 to generate the morpheme list J _s including the Chinese character s (s-th Chinese character) from the morpheme list J including the low-frequency characters created in Step S201 (FIG. 10). Create (step S231).
Label creation means 20B is the label determining unit 23B, selects the morpheme j _s (j _s th morpheme) from the morpheme list J _s (step S232). _{_{Incidentally, j s (j s = 1,2}} , ...) is a code for identifying the morphological, hereinafter, simply referred to as morphemes j _s.

続いて、ラベル作成手段２０Ｂは、例えば、編集距離算出手段２６および読み区切り推定手段２７によって、形態素ｊ_sにおいて漢字ｓの読みｊ^r _s,ｓを推定し、読みリスト作成手段２８によって、この漢字ｓの読みｒごとに用意された読みリストＬ^r _sに形態素ｊ_sを格納する（ステップＳ２３３）。 Subsequently, the label creating means 20B estimates the reading j ^r _{s, s} of the Chinese character s in the morpheme j _s by, for example, the editing distance calculating means 26 and the reading delimiter estimating means 27, and the kanji character by the reading list creating means 28. The morpheme j _s is stored in the reading list L ^r _s prepared for each reading ^{r of} _s (step S233).

そして、ラベル決定手段２３Ｂは、全形態素区間を選択したか否かを判定する（ステップＳ２３４）。未選択の形態素区間がある場合（ステップＳ２３４：Ｎｏ）、ラベル決定手段２３Ｂは、形態素区間の値（ｊ_s）に「１」を加算し（ｊ_s＝ｊ_s＋１：ステップＳ２３５）、ステップＳ２３２に戻る。 Then, the label determining unit 23B determines whether all morpheme sections have been selected (step S234). When there is an unselected morpheme section (step S234: No), the label determining unit 23B adds “1” to the value (j _s ) of the morpheme section (j _s = j _s +1: step S235), and step S232. Return to.

一方、全形態素区間を選択した場合（ステップＳ２３４：Ｙｅｓ）、ラベル作成手段２０Ｂは、読みリスト作成手段２８によって、漢字ｓの読みｒごとに分類された読みリストＬ^r _sに格納されている形態素の数Ｌ^r,c _sを数え上げ、読みリストＬ^r _sのうち最大要素数を持つ読みｒの頭文字r_tを求める（ステップＳ２３６）。そして、ラベル決定手段２３Ｂは、頭文字r_tのリストＬ^r _ｒｔに漢字ｓを追加する（ステップＳ２３７）。
ただし、ステップＳ２３６にて編集距離が指定した値よりも大きくなるなどして読みが推定できない場合には、ステップＳ２０５と同様に、読み不明のリストに追加することとする。 On the other hand, when all the morpheme sections are selected (step S234: Yes), the label creating unit 20B uses the reading list creating unit 28 to store the morphemes stored in the reading list L ^r _s classified for each reading r of the Chinese character s. The number L ^{r, c} _s of the reading ^r is counted, and the initial letter r _{t of the} reading r having the maximum number of elements in the reading list L ^r _s is obtained (step S236). Then, the label determining unit 23B adds the Chinese character s to the list L ^r _rt of the initial letter r _t (step S237).
However, if the reading cannot be estimated because the edit distance becomes larger than the specified value in step S236, it is added to the unread list as in step S205.

そして、ラベル作成手段２０Ｂは、ラベル決定手段２３Ｂによって、文字・ラベル変換表４を更新する（ステップＳ２０７）。そして、ラベル決定手段２３Ｂは、低頻度文字における全ての漢字を選択したか否かを判定する（ステップＳ２０８）。未選択の漢字区間がある場合（ステップＳ２０８：Ｎｏ）、ラベル決定手段２３Ｂは、漢字区間の値（ｓ）に「１」を加算し（ｓ＝ｓ＋１：ステップＳ２３８）、ステップＳ２０２に戻る。一方、低頻度文字における全漢字区間を選択した場合（ステップＳ２３８：Ｙｅｓ）、ラベル決定手段２３Ｂは、クラスラベル集合Ｌ_iと文字リストＦを統合してラベル３を作成し（ステップＳ２０９）、文字・ラベル変換表４をテキスト作成手段５に出力し、処理を終了する。なお、ステップＳ２０９において、クラスラベル集合Ｌ_iの中に該当する漢字が存在しない場合、そのクラスラベルは省略してもよい。 Then, the label creating means 20B updates the character / label conversion table 4 by the label determining means 23B (step S207). Then, the label determining unit 23B determines whether or not all the Chinese characters in the low-frequency character have been selected (step S208). If there is an unselected kanji section (step S208: No), the label determining unit 23B adds “1” to the value (s) of the kanji section (s = s + 1: step S238), and the process returns to step S202. On the other hand, when all the kanji sections in the low-frequency character are selected (step S238: Yes), the label determining unit 23B creates the label 3 by integrating the class label set _Li and the character list F (step S209), Outputs the label conversion table 4 to the text creation means 5 and ends the process. Incidentally, in step S209, if the Chinese character corresponding to in the class label set L _i is not present, the class labels may be omitted.

本実施形態によれば、第１実施形態と同様に、日本語の音声認識精度が向上する効果を奏する。また、文字の読みを基準に複数のクラスラベルに分割したので、クラスラベルから文字を復元する際に、文字の読みがヒントになるので、いっそう単語認識精度を向上させる効果がある。 According to the present embodiment, as in the first embodiment, there is an effect of improving Japanese speech recognition accuracy. In addition, since the character reading is divided into a plurality of class labels, the character reading is used as a hint when the character is restored from the class label, so that the word recognition accuracy is further improved.

以上、本発明の各実施形態について説明したが、本発明はこれらに限定されるものではなく、その趣旨を変えない範囲で実施することができる。例えば、音響モデル学習手段６や音声認識手段１０には、音声を入力し、内部で特徴量に変換するのとして説明したが、音声を変換した音響特徴量を入力として用いてもよい。 As mentioned above, although each embodiment of this invention was described, this invention is not limited to these, It can implement in the range which does not change the meaning. For example, the acoustic model learning unit 6 and the speech recognition unit 10 have been described as inputting speech and converting them into feature amounts internally, but acoustic feature amounts obtained by converting speech may be used as inputs.

また、前記各実施形態では、エンドツーエンド日本語音声認識モデル学習装置として説明したが、各装置の構成の処理を可能にするように、汎用または特殊なコンピュータ言語で記述したエンドツーエンド日本語音声認識モデル学習プログラムとみなすことも可能である。 In each of the above embodiments, the end-to-end Japanese speech recognition model learning device has been described. However, end-to-end Japanese written in a general-purpose or special computer language so as to enable processing of the configuration of each device. It can also be regarded as a speech recognition model learning program.

また、頭文字の読みの推定について、形態素リスト作成手段２５が、注目する漢字ｓごとの形態素リストＪ_sに出現する各形態素ｊ_sに含まれる全ての漢字の一文字単独での全ての読みのリストである単独漢字リストを作成することとしたが、この方法に限定されない。例えば、ｋａｋａｓｉの辞書のもつ文字の読みリストから、漢字の位置に応じて前方／後方一致するよう一字ずつ探索し、合致した読みを採用することもできる。なお、ｋａｋａｓｉは、漢字仮名交じり文をひらがな文やローマ字文に変換することを目的として作成されたプログラムと辞書をいう。 Further, regarding the estimation of the initial reading, the morpheme list creating means 25 is a list of all the readings of each single kanji character included in each morpheme j _s that appears in the morpheme list J _s for each kanji s of interest. However, the present invention is not limited to this method. For example, it is possible to search one by one from the character reading list of the kakasi dictionary so as to match the front / back according to the position of the kanji and adopt the matched reading. Note that kakasi refers to a program and a dictionary created for the purpose of converting kanji-kana mixed sentences into hiragana and romaji sentences.

また、エンドツーエンド日本語音声認識モデル学習装置２は、入力する音声１ａから文字のラベル３を直接出力するエンドツーエンドの音声認識モデルを学習するものとしたが、単語のラベルを直接出力するようにしてもよい。なお、日本語の単語数は、日本語の文字種の数よりも多いためパラメータ数も多いが、例えば１０万程度の語彙数であっても単語を直接出力するシステムが可能であることが報告されている。 Further, the end-to-end Japanese speech recognition model learning device 2 learns the end-to-end speech recognition model that directly outputs the character label 3 from the input speech 1a, but directly outputs the word label. You may do it. Although the number of Japanese words is larger than the number of Japanese character types, the number of parameters is also large. For example, it has been reported that a system that directly outputs words is possible even with a vocabulary number of about 100,000. ing.

第２実施形態では、複数のクラスラベルに分割する基準を文字の読みとして説明したが、例えば、文字または単語の前後関係や品詞等のコンテキストを基準に複数のクラスラベルに分割してもよい。 In the second embodiment, the reference for dividing into a plurality of class labels has been described as the reading of characters, but for example, it may be divided into a plurality of class labels on the basis of contexts such as character or word context and parts of speech.

本発明に係るエンドツーエンド日本語音声認識モデル学習装置の性能を確かめるために、音声認識実験を行った。
第１および第２実施形態に係るエンドツーエンド日本語音声認識モデル学習装置によって生成した音響モデルおよび言語モデルを用いて音声認識したときの単語誤り率をそれぞれ求めた。以下、これらを実施例１および実施例２という。また、比較例として、クラスラベルを用いずに音声認識したときの単語誤り率を求めた。 In order to confirm the performance of the end-to-end Japanese speech recognition model learning apparatus according to the present invention, speech recognition experiments were conducted.
The word error rates when the speech recognition was performed using the acoustic model and the language model generated by the end-to-end Japanese speech recognition model learning device according to the first and second embodiments were respectively obtained. These are hereinafter referred to as Example 1 and Example 2. As a comparative example, the word error rate when speech recognition was performed without using a class label was obtained.

＜実験条件＞
KaldiベースのEESENフレームワーク（https://github.com/srvk/eesen）を用いた。なお、Kaldiベースについては、下記の参考文献に記載されている。
（参考文献）
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit," in Proc. ASRU, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011. <Experimental conditions>
The Kaldi-based EESEN framework (https://github.com/srvk/eesen) was used. The Kaldi base is described in the following references.
(References)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit, in Proc. ASRU, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.

実験では、KaldiベースのEESENフレームワークを、日本語の文字が出力可能なように、以下のように修正して用いた。
音響モデルは、ＣＴＣ基準の４層ＢＬＳＴＭ（Bi-directional Long Short Term Memory）で、学習データ１の音声１ａおよびテキスト１ｂとして、７１２時間分のＮＨＫ（登録商標）の番組音声と字幕のペアを用いて学習した。
特徴量としては、合計１２０次元の特徴パラメータであり、その内訳は、４０次元の対数メルフィルタバンク特徴量、およびそれぞれのΔ，ΔΔ係数である。
ＬＳＴＭのメモリセルは各方向３２０ずつとした。
言語モデルは、語彙２０万単語のＮＨＫ（登録商標）の原稿と字幕から作成したＡＲＰＡフォーマットの３ｇｒａｍＷＦＳＴを使用した。
評価用データには、ＮＨＫ（登録商標）の情報番組「ひるまえほっと」５時間分を使用した。 In the experiment, the Kaldi-based EESEN framework was modified as follows so that Japanese characters could be output.
The acoustic model is a CTC-based 4-layer BLSTM (Bi-directional Long Short Term Memory), and uses 712 hours of NHK (registered trademark) program audio and subtitle pairs as the audio 1a and the text 1b of the learning data 1. I learned.
The feature amounts are 120-dimensional feature parameters in total, and the breakdown is a 40-dimensional log mel filter bank feature amount and respective Δ and ΔΔ coefficients.
The number of LSTM memory cells is 320 in each direction.
The language model used was 3 gram WFST in ARPA format created from NHK (registered trademark) manuscript with a vocabulary of 200,000 words and subtitles.
NHK (registered trademark) information program “Hiruma Ehot” for 5 hours was used for the evaluation data.

学習データ１として用いる前記した７１２時間のデータで登場する文字種は、漢字、
、カタカナ等３，４７６種である。この３，４７６文字種を頻度の高いものから順に集めた際、学習データ中の全文字のうち何パーセントをカバー可能かについて調査した。その結果、学習データ中のほとんどの文字が高頻度文字種の上位３０％程度に集中しており、上位４２％の１，４５２の文字種で学習データに現れる全文字中の９９％の文字を網羅できることがわかった。なお、実験では、以下に示すように上位１，５００文字種を高頻度の文字種としてピックアップした。 Character types appearing in the above-mentioned 712 hours of data used as learning data 1 are kanji,
There are 3,476 species such as katakana. When these 3,476 character types were collected in descending order, the percentage of all characters in the learning data was investigated. As a result, almost all characters in the learning data are concentrated in the top 30% of high-frequency character types, and 99% of all characters appearing in the learning data can be covered with the top 42% 1,452 character types. I understood. In the experiment, as shown below, the top 1,500 character types were picked up as high-frequency character types.

音響モデルの出力ラベルとして、学習データ１で出現する全文字種（３，４７６種）用のラベルに、ブランクラベルを加えた３，４７７ラベルを出力とするものを比較例として実験した。
実施例１では、学習データ１中から、高頻度の文字を１，５００字抜き出して、文字のラベルの出力用の１，５００ラベルとした。加えて、それ以外の１，９７６字を１種のクラスラベルに割当てた。つまり、１，５０１ラベルを音響モデル学習に用いた。また、割り当てた１種のクラスラベルからもともとの文字へ復元する言語モデルを用いた。 As an output label of the acoustic model, an experiment was performed as a comparative example in which a 3,477 label obtained by adding a blank label to a label for all character types (3,476 types) appearing in the learning data 1 was output.
In Example 1, 1,500 high-frequency characters were extracted from the learning data 1 and used as 1,500 labels for outputting character labels. In addition, the other 1,976 characters were assigned to one class label. That is, 1,501 labels were used for acoustic model learning. In addition, a language model that restores the original character from the assigned class label was used.

また、実施例２では、学習データ１中から、高頻度の文字を１，５００字抜き出して、文字のラベルの出力用の１，５００ラベルとした。加えて、それ以外の１，９７６字を７３種のクラスラベルに割当てた。ここで、７３種とは、五十音図における「を」と撥音とを含む４６文字種のそれぞれの読みを表すクラス、濁音および半濁音を含む２５文字種のそれぞれの読みを表すクラス、数字やアルファベット等の読み不明のクラス、および、記号のクラスを意味する。なお、実際には、１，９７６字のうちいずれの文字も割り当てられなかったクラスラベル（３クラスラベル）については除外しているため、１，５７０ラベルのみを音響モデル学習に用いた。また、割り当てた７０種のクラスラベルからもともとの文字へ復元する言語モデルを用いた。実験結果を表１に示す。 In the second embodiment, 1,500 high-frequency characters are extracted from the learning data 1 and used as 1,500 labels for outputting character labels. In addition, the other 1,976 characters were assigned to 73 class labels. Here, the 73 types are a class representing each reading of 46 character types including “O” and sound repellent in the Japanese syllabary diagram, a class representing each reading of 25 character types including muddy sound and semi-voiced sound, numbers and alphabets. This means an unreadable class such as etc. and a class of symbols. Actually, since class labels (3 class labels) to which none of 1,976 characters are assigned are excluded, only 1,570 labels are used for acoustic model learning. In addition, a language model that restores the original characters from the assigned 70 class labels was used. The experimental results are shown in Table 1.

クラスを使用しない場合に比べ、クラス使用した方が音声認識単語誤り率（Word Error Rate：ＷＥＲ）が改善している。クラスも、全ての低頻度文字を１つのクラスラベルに割り当てるよりも複数のクラスラベルに分割する方がＷＥＲはさらに改善する。 Compared to the case where the class is not used, the use of the class improves the speech recognition word error rate (Word Error Rate: WER). The class also improves WER by dividing all low frequency characters into multiple class labels rather than assigning them to one class label.

２エンドツーエンド日本語音声認識モデル学習装置
５テキスト作成手段
６音響モデル学習手段
７言語モデル学習手段
８音響モデル記憶手段
９言語モデル記憶手段
１０音声認識手段
２０，２０Ｂラベル生成手段
２１形態素分割手段
２２文字リスト作成手段
２３，２３Ｂラベル決定手段
２４記憶手段
２５形態素リスト作成手段
２６編集距離算出手段
２７読み区切り推定手段
２８読みリスト作成手段 2 End-to-end Japanese speech recognition model learning device 5 Text creation means 6 Acoustic model learning means 7 Language model learning means 8 Acoustic model storage means 9 Language model storage means 10 Speech recognition means 20, 20B Label generation means 21 Morphological division means 22 Character list creation means 23, 23B Label determination means 24 Storage means 25 Morphological list creation means 26 Edit distance calculation means 27 Reading delimitation estimation means 28 Reading list creation means

Claims

An end-to-end acoustic model that outputs a label of a character or a word from the speech or the acoustic feature amount of the speech, which is learning data including text and speech or the acoustic feature amount of the speech, is learned as a speech recognition model. End Japanese speech recognition model learning device,
From the text in the learning data, a plurality of characters or a plurality of words whose appearance frequency included in the text is lower than a predetermined appearance frequency standard are assigned to a class representing the plurality of characters or the plurality of words. Label creating means for creating a class label and a single label attached to a single character or single word having a higher appearance frequency than the reference;
Based on the conversion table that assigns the plurality of characters or the plurality of words to the class label, the plurality of characters or the plurality of words included in the text in the learning data are converted into the class label, and the text is converted. A text creation means for creating a post-conversion text that is a later text;
The speech that is the learning data or the acoustic feature amount of the speech, the converted text, the class label, and the plurality of single labels are learned, and the speech or acoustic feature amount is learned by the learning and the class label and An end-to-end Japanese speech recognition model learning apparatus comprising: an acoustic model learning unit that converts the label sequence of the single label and learns the acoustic model based on the converted label sequence.

The label creating means assigns a plurality of predetermined characters or a plurality of words whose appearance frequency is lower than a predetermined criterion to any of a plurality of classes classified according to a predetermined criterion. The end-to-end Japanese speech recognition model learning device according to claim 1, wherein

The label creating means assigns a plurality of predetermined characters or a plurality of words whose appearance frequency is lower than a predetermined reference to any of a plurality of classes divided for each character or word reading. The end-to-end Japanese speech recognition model learning device according to claim 2, wherein the label is created.

4. A language model learning unit that learns the converted text and learns, as the speech recognition model, a language model that converts the class label and the single-label label string into a word string by the learning. The end-to-end Japanese speech recognition model learning device according to any one of the above.

The program for functioning a computer as an end-to-end Japanese speech recognition model learning apparatus as described in any one of Claims 1-4.