JP2014102345A

JP2014102345A - Text creation device for acoustic model learning, method of the same, and program

Info

Publication number: JP2014102345A
Application number: JP2012253587A
Authority: JP
Inventors: Narichika Nomoto; 済央野本; Satoru Kobashigawa; 哲小橋川; Yuji Aono; 裕司青野; Hirokazu Masataki; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-11-19
Filing date: 2012-11-19
Publication date: 2014-06-05
Anticipated expiration: 2032-11-19
Also published as: JP5980101B2

Abstract

PROBLEM TO BE SOLVED: To provide a text creation device for acoustic model learning, that is capable of extracting a text containing "a phoneme that ought to be learnt".SOLUTION: A voice recognition processing section recognizes a voice in developing voice data inputted from outside by referring to a language model and an existing acoustic model, and outputs a recognition result text and phonemic series information. A recognition result tabulation section calculates a phoneme recognition rate from the phonemic series information and a correct interpretation text of the developing voice data. A weak-point phoneme extracting section extracts, as a weak-point phoneme, a phoneme of which the phoneme recognition rate is equal to a threshold level or below, and produces a weak-point phoneme list. A supplementary candidate text corpus stores a large amount of text being a candidate text for acoustic model learning. A weak-point phoneme containing text selecting section refers to the weak-point phoneme list to select a text containing the weak-point phoneme from the supplementary candidate text corpus, and outputs the selected text as an acoustic model learning text.

Description

本発明は、音響モデル学習に用いる学習用テキストを作成する音響モデル学習用テキスト作成装置とその方法とプログラムに関する。 The present invention relates to an acoustic model learning text creating apparatus, a method and a program for creating a learning text used for acoustic model learning.

近年における音声認識システムでは、音響モデルと言語モデルが利用される。音響モデルは、/ａ/や/ｋ/などの各音素が持つ音響的特徴を有した辞書であり、入力音声がどのような音素列かを推測するのに用いる。 In recent speech recognition systems, an acoustic model and a language model are used. The acoustic model is a dictionary having acoustic features of each phoneme such as / a / and / k /, and is used to estimate what phoneme string the input speech has.

音響モデルの学習には、音声とそれに対応するテキスト（音声データベース）が必要となる。精度の良い音響モデルを統計的に学習するには、大規模な音声データベースの構築が必要とされる。大量の音声とそれに対応するテキストを収集するには、高いコスト（時間や労力）を要する。 Learning an acoustic model requires speech and corresponding text (speech database). In order to statistically learn an accurate acoustic model, it is necessary to construct a large-scale speech database. Collecting a large amount of speech and corresponding text requires high costs (time and effort).

そこで、従来から音響モデルの学習効率を向上させるためのテキスト作成方法が検討されて来ている。図１０に、特許文献１に開示された音響モデル学習用ラベル作成装置９００の機能構成を示して、その動作を簡単に説明する。音響モデル学習用ラベル作成装置９００は、第１音素環境頻度計算部９２３と、第２音素環境頻度計算部９３３と、格納部９３４と、新出音素環境抽出部９３５と、テキスト選択部９３６と、蓄積部９３７などを具備する。 Therefore, text creation methods for improving the learning efficiency of acoustic models have been studied conventionally. FIG. 10 shows a functional configuration of the acoustic model learning label producing apparatus 900 disclosed in Patent Document 1, and its operation will be briefly described. The acoustic model learning label creating apparatus 900 includes a first phoneme environment frequency calculation unit 923, a second phoneme environment frequency calculation unit 933, a storage unit 934, a new phoneme environment extraction unit 935, a text selection unit 936, A storage unit 937 and the like are included.

第１音素環境頻度計算部９２３は、音素変換部９２２から入力される音素系列をもとに、音素環境毎に出現頻度をカウントし、既存音声ＤＢ９１０の音素環境頻度を計算して出力する。第２音素環境頻度計算部９３３は、音素変換部９３２から入力される音素系列をもとに、音素環境毎に出現頻度をカウントし、元テキストＤＢ９３０の音素環境頻度を計算して出力する。 The first phoneme environment frequency calculation unit 923 counts the appearance frequency for each phoneme environment based on the phoneme sequence input from the phoneme conversion unit 922, and calculates and outputs the phoneme environment frequency of the existing speech DB 910. The second phoneme environment frequency calculation unit 933 counts the appearance frequency for each phoneme environment based on the phoneme sequence input from the phoneme conversion unit 932, and calculates and outputs the phoneme environment frequency of the original text DB 930.

第１と第２音素環境頻度計算部９２３，９３３からそれぞれ出力される既存音声ＤＢ音素環境頻度及び元テキストＤＢ音素環境頻度は新出音素環境抽出部９３５に入力される。新出音素環境抽出部９３５は入力された既存音声ＤＢ音素環境頻度と元テキストＤＢ音素環境頻度とから、既存音声ＤＢ９１０に含まれず、元テキストＤＢ９３０に含まれている新出音素環境を抽出し、その抽出した新出音素環境を追加収録音素環境として出力する。 The existing speech DB phoneme environment frequency and the original text DB phoneme environment frequency output from the first and second phoneme environment frequency calculation units 923 and 933 are input to the new phoneme environment extraction unit 935. A new phoneme environment extraction unit 935 extracts a new phoneme environment not included in the existing speech DB 910 but included in the original text DB 930 from the input existing speech DB phoneme environment frequency and the original text DB phoneme environment frequency. The extracted new phoneme environment is output as an additional recorded phoneme environment.

新出音素環境抽出部９３５から出力された追加収録音素環境はテキスト選択部９３６に入力される。テキスト選択部９３６は読み、音素系列と組とされて格納部９３４に格納されている元テキストＤＢ９３０のテキストの中から追加収録音素環境を含むテキストを選択する。テキストの選択は、テキスト毎に追加収録音素環境が含まれているか否かを判定することによって行われる。このようにして選択されたテキストは追加収録用ラベルセットとして出力される。 The additional recorded phoneme environment output from the new phoneme environment extraction unit 935 is input to the text selection unit 936. The text selection unit 936 reads and selects the text including the additional recorded phoneme environment from the texts of the original text DB 930 paired with the phoneme series and stored in the storage unit 934. The selection of the text is performed by determining whether or not an additional recording phoneme environment is included for each text. The text selected in this way is output as an additional recording label set.

特開２０１１−２４８００１号公報JP 2011-248001 A

従来技術では、読み上げ対象となるテキストの音素数や既存音声ＤＢ９１０に含まれる音素数など、音素環境ガバレッジ（音素カバー率）の情報を用いる。つまり、学習量として頻度的に少ない音素を含むテキストを重点的に選択するといったものである。しかし、学習量が少ない音素を含む音声を大量に集めれば当該音素の認識精度が向上するとは限らない。学習量は少ないが認識性能としては十分に高い性能を示す音素のケースも考えられる。同様に、学習量としては十分に足りているが認識性能としてはまだ改善の余地があるという音素のケースも考えられる。このように、音素環境ガバレッジだけでは「学習すべき音素」を正確に抽出することが出来ない課題がある。 In the prior art, information on phoneme environment coverage (phoneme coverage) such as the number of phonemes of text to be read out and the number of phonemes included in the existing speech DB 910 is used. That is, a text that contains phonemes that are less frequently as a learning amount is selected with priority. However, collecting a large amount of speech including phonemes with a small learning amount does not necessarily improve the recognition accuracy of the phonemes. There may be a phoneme case that has a small amount of learning but exhibits sufficiently high recognition performance. Similarly, there may be a phoneme case where the amount of learning is sufficient, but the recognition performance still has room for improvement. Thus, there is a problem that “phonemes to be learned” cannot be accurately extracted only by phoneme environment coverage.

本発明は、この課題に鑑みてなされたものであり、「学習すべき音素」を正確に抽出することが出来る音響モデル学習用テキスト作成装置とその方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and an object of the present invention is to provide an acoustic model learning text creation device, a method and a program that can accurately extract “phonemes to be learned”.

本発明の音響モデル学習用テキスト作成装置は、音声認識処理部と、認識結果集計部と、苦手音素抽出部と、追加候補テキストコーパスと、苦手音素包含テキスト選択部と、を具備する。音声認識処理部は、外部から入力される開発用音声データを、言語モデルと既存音響モデルを参照して音声認識し、認識結果テキストと音素系列情報とを出力する。認識結果集計部は、音素系列情報と開発用音声データの正解テキストとから音素認識率を計算する。苦手音素抽出部は、音素認識率が閾値以下の音素を苦手音素として抽出して苦手音素リストを生成する。追加候補テキストコーパスは、音響モデル学習用テキスト候補であるテキストを大量に記憶する。苦手音素包含テキスト選択部は、苦手音素リストを参照して追加候補テキストコーパスから苦手音素を含むテキストを選択して音響モデル学習用テキストとして出力する。 The acoustic model learning text creation device of the present invention includes a speech recognition processing unit, a recognition result totaling unit, a poor phoneme extraction unit, an additional candidate text corpus, and a poor phoneme inclusion text selection unit. The speech recognition processor recognizes speech data for development input from the outside with reference to a language model and an existing acoustic model, and outputs a recognition result text and phoneme sequence information. The recognition result totaling unit calculates a phoneme recognition rate from the phoneme sequence information and the correct text of the development speech data. The poor phoneme extraction unit extracts phonemes whose phoneme recognition rate is equal to or less than a threshold as weak phonemes and generates a poor phoneme list. The additional candidate text corpus stores a large amount of text as acoustic model learning text candidates. The poor phoneme inclusion text selection unit refers to the weak phoneme list, selects text including poor phonemes from the additional candidate text corpus, and outputs the text as acoustic model learning text.

本発明の音響モデル学習用テキスト作成装置によれば、既存音声データベースを用いて学習した既存音響モデルを用いて、評価用音声データを音声認識し、認識性能が低い苦手音素を含むテキストを選択して出力するので、「学習すべき音素」を含む音響モデル学習用テキストを抽出することが出来る。 According to the acoustic model learning text creation device of the present invention, the speech data for evaluation is speech-recognized using the existing acoustic model learned using the existing speech database, and the text including poor phonemes with low recognition performance is selected. Therefore, the acoustic model learning text including “phonemes to be learned” can be extracted.

この発明の音響モデル学習用テキスト作成装置１００の機能構成例を示す図。The figure which shows the function structural example of the text creation apparatus 100 for acoustic model learning of this invention. 音響モデル学習用テキスト作成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the text creation apparatus 100 for acoustic model learning. 正解音素系列情報と認識結果の音素系列情報の例を示す図。The figure which shows the example of correct phoneme series information and the phoneme series information of a recognition result. この発明の音響モデル学習用テキスト作成装置２００の機能構成例を示す図。The figure which shows the function structural example of the text creation apparatus 200 for acoustic model learning of this invention. 音響モデル学習用テキスト作成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the text creation apparatus 200 for acoustic model learning. 音素包含マトリックスの例を示す図。The figure which shows the example of a phoneme inclusion matrix. テキスト選択部２３０の動作フローを示す図。The figure which shows the operation | movement flow of the text selection part 230. FIG. この発明の音響モデル学習用テキスト作成装置３００の機能構成例を示す図。The figure which shows the function structural example of the text creation apparatus 300 for acoustic model learning of this invention. この発明の音響モデル学習用テキスト作成装置４００の機能構成例を示す図。The figure which shows the function structural example of the text creation apparatus 400 for the acoustic model learning of this invention. 特許文献１に開示された音響モデル学習用ラベル作成装置９００の機能構成を示す図。The figure which shows the function structure of the label production apparatus 900 for acoustic model learning disclosed by patent document 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル学習用テキスト作成装置１００の機構構成例を示す。その動作フローを図２に示す。音響モデル学習用テキスト作成装置１００は、音声認識処理部１０と、認識結果集計部４０と、苦手音素抽出部５０と、苦手音素包含テキスト選択部６０と、追加候補テキストコーパス７０と、制御部８０と、を具備する。音響モデル学習用テキスト作成装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。以降で説明する他の実施例についても同様である。 FIG. 1 shows an example of the mechanism configuration of an acoustic model learning text creation device 100 of the present invention. The operation flow is shown in FIG. The acoustic model learning text creating apparatus 100 includes a speech recognition processing unit 10, a recognition result totaling unit 40, a poor phoneme extraction unit 50, a poor phoneme inclusion text selection unit 60, an additional candidate text corpus 70, and a control unit 80. And. The acoustic model learning text creating apparatus 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program. The same applies to other embodiments described below.

音声認識処理部１０は、外部から入力される開発用音声データを、言語モデル２０と既存音響モデル３０を参照して音声認識し、認識結果テキストと音素系列情報とを出力する（ステップＳ１０）。言語モデル２０は、言語の特徴を統計的手法によりモデル化したデータを格納し、連続音声認識の実行時に音声認識結果候補に対して言語的な尤もらしさを与えるものである。既存音響モデル３０は、既存の音声データベースを用いて音素が持つ音響特性を学習した音響モデルを大量に格納している。音響モデルは、混合正規分布を出力確率とした隠れマルコフモデル（ＨＭＭ）が良く用いられ、音素の三組からなるトライフォンで表現されるものである。例えば「会社（/ｋ/ａ/ｉ/ｓｈ/ａ/）」をトライフォンで表現すると「/＊-ｋ＋ａ/ｋ−ａ＋ｉ/ａ−ｉ＋ｓｈ/ｉ―ｓｈ＋ａ/ａ−ｓｈ＋＊/」の５個の音素によって、音響モデルが構成される。開発用音声データは、既存音声データベースとは異なる音声データの集合であり、既存音響モデル３０を評価するためのものである。開発用音声データは、既存音声データベースよりも少ないデータ量でも良い。 The speech recognition processing unit 10 recognizes speech data for development input from the outside with reference to the language model 20 and the existing acoustic model 30, and outputs a recognition result text and phoneme sequence information (step S10). The language model 20 stores data in which language features are modeled by a statistical method, and gives linguistic likelihood to a speech recognition result candidate when continuous speech recognition is executed. The existing acoustic model 30 stores a large number of acoustic models obtained by learning the acoustic characteristics of phonemes using an existing speech database. As the acoustic model, a hidden Markov model (HMM) having a mixed normal distribution as an output probability is often used, and is represented by a triphone composed of three phonemes. For example, if “Company (/ k / a / i / sh / a /)” is expressed with a triphone, 5 pieces of “/ *-k + a / ka-i + i / ai + sh / i-sh + a / a-sh + * /” An acoustic model is composed of phonemes. The development voice data is a set of voice data different from the existing voice database, and is used for evaluating the existing acoustic model 30. The voice data for development may have a smaller data volume than the existing voice database.

音声認識処理部１０は、開発用音声データに含まれる各音声について、言語モデル２０と既存音響モデル３０とを用いて音声認識処理を行い認識結果テキストと音素系列情報とを出力する。認識結果テキストを例えば「会社」とした場合、その音素系列情報は「/＊-ｋ＋ａ/ｋ−ａ＋ｉ/ａ−ｉ＋ｓｈ/ｉ―ｓｈ＋ａ/ａ−ｓｈ＋＊/」である。音声認識処理部１０と言語モデル２０と既存音響モデル３０とによる音声認識処理は、一般的な音声認識処理と同じである。 The speech recognition processing unit 10 performs speech recognition processing on each speech included in the development speech data using the language model 20 and the existing acoustic model 30, and outputs a recognition result text and phoneme sequence information. For example, when the recognition result text is “company”, the phoneme sequence information is “/ * − k + a / ka−a + i / ai−sh / i−sh + a / a−sh + * /”. The speech recognition processing by the speech recognition processing unit 10, the language model 20, and the existing acoustic model 30 is the same as general speech recognition processing.

認識結果集計部４０は、音声認識処理部１０で音声認識した結果の音素系列情報と開発用音声データの正解テキストとから音素認識率を計算する（ステップＳ４０）。例えば、図３に示すように正解テキストを「会社」とした場合の認識結果が「外車」であったと仮定する。正解音素系列情報である音素列「/＊-ｋ＋ａ/ｋ−ａ＋ｉ/ａ−ｉ＋ｓｈ/ｉ―ｓｈ＋ａ/ａ−ｓｈ＋＊/」に対して、認識結果の音素列「/＊-ｇ＋ａ/ｋ−ａ＋ｉ/ａ−ｉ＋ｓｈ/ｉ―ｓｈ＋ａ/ａ−ｓｈ＋＊/」が対応する。 The recognition result totaling unit 40 calculates a phoneme recognition rate from the phoneme sequence information obtained as a result of speech recognition by the speech recognition processing unit 10 and the correct text of the development speech data (step S40). For example, as shown in FIG. 3, it is assumed that the recognition result when the correct text is “company” is “foreign vehicle”. For the phoneme string “/ *-k + a / ka-i + sh / i-sh + a / a-sh + * /” that is correct phoneme sequence information, the phoneme string “/ *-g + a / ka−i + i” of the recognition result / ai−sh / i−sh + a / a−sh + * / ”.

正解音素「/＊-ｋ＋ａ/」に対して認識音素「/＊-ｇ＋ａ/」が不一致（×）であり、音素「/＊-ｋ＋ａ/」の音素認識率は、音素データがこれだけだとすると０％として計算される。その他の音素の音素認識率は１００％として計算される。同様の処理を開発音声データに含まれる音素データの数だけ実施し、その結果を集計したものが各音素の音素認識率として計算される。 The correct phoneme “/ *-k + a /” does not match the recognized phoneme “/ *-g + a /” (×), and the phoneme recognition rate of the phoneme “/ *-k + a /” is 0% if this is the only phoneme data. Is calculated as The phoneme recognition rate of other phonemes is calculated as 100%. Similar processing is performed for the number of phoneme data included in the developed speech data, and the sum of the results is calculated as the phoneme recognition rate of each phoneme.

ここで、音素の表現を音素の三つ組からなる音素環境依存のトライフォンを用いて説明したが、周辺音素に依存しない音素環境独立のモノフォンを用いても良い。また、中心音素が合っていれば正解としても良い。 Here, the expression of phonemes has been described using a phoneme environment-dependent triphone consisting of a triplet of phonemes, but a phoneme environment-independent monophone that does not depend on surrounding phonemes may be used. If the central phoneme is correct, the answer may be correct.

苦手音素抽出部５０は、音素認識率が閾値以下の音素を苦手音素として抽出して苦手音素リストを生成する（ステップＳ５０）。閾値は、０〜１の範囲の任意の値である。閾値が０に近いほど苦手音素と判定する認識精度が低くなる。また、１に近い値にすると苦手音素を抽出し難くなる。閾値は、全音素の音素認識率の平均値を用いても良い。又は、音声認識率の低い下位から所定の順位の音素を苦手音素として抽出して苦手音素リストを生成するようにしても良い。 The poor phoneme extraction unit 50 extracts phonemes whose phoneme recognition rate is equal to or less than a threshold as weak phonemes and generates a poor phoneme list (step S50). The threshold value is an arbitrary value in the range of 0-1. The closer the threshold value is to 0, the lower the recognition accuracy for determining poor phonemes. If the value is close to 1, it is difficult to extract poor phonemes. As the threshold value, an average value of phoneme recognition rates of all phonemes may be used. Alternatively, a phoneme having a predetermined rank may be extracted as a weak phoneme from a lower order with a low voice recognition rate to generate a poor phoneme list.

苦手音素包含テキスト選択部６０は、苦手音素抽出部５０で生成した苦手音素リストをを参照して音響モデル学習用テキスト候補であるテキストとその音素系列情報とを大量に記憶した追加候補テキストコーパスから苦手音素を含むテキストを所定数以上選択して音響モデル学習用テキストとして出力する（ステップＳ６０）。ここで所定数は、予め苦手音素包含テキスト選択部６０に定数として与えておいても良い。又は、外部から与えても良い。その所定数は、例えば既存音響モデル３０の学習データのテキスト量の１０％程度のテキスト量となる値とする。 The poor phoneme inclusion text selection unit 60 refers to the poor phoneme list generated by the poor phoneme extraction unit 50, and from the additional candidate text corpus that stores a large amount of text that is an acoustic model learning text candidate and its phoneme sequence information. A predetermined number or more of texts containing poor phonemes are selected and output as acoustic model learning text (step S60). Here, the predetermined number may be given as a constant to the poor phoneme inclusion text selection unit 60 in advance. Or you may give from the outside. The predetermined number is, for example, a value that is about 10% of the text amount of the learning data of the existing acoustic model 30.

このステップＳ６０の処理は、苦手音素リストの全ての音素について終了するまで繰り返される。この繰り返し動作の制御は、制御部８０が行う。制御部８０は、音響モデル学習用テキスト作成装置１００の各部の時系列的な動作を制御する。 The process of step S60 is repeated until the completion of all phonemes in the weak phoneme list. The control unit 80 controls this repetitive operation. The control unit 80 controls the time-series operation of each unit of the acoustic model learning text creation device 100.

以上説明したようにこの発明の音響モデル学習用テキスト作成装置１００によれば、既存音響モデル３０を用いて開発用音声データを音声認識して音素認識率が低い音素をリストアップし、音素認識率の低い音素を含むテキストを追加候補テキストコーパス７０から音響モデル学習用テキストとして選択して出力する。したがって、「学習すべき音素」を含むテキストを抽出することが出来る。 As described above, according to the acoustic model learning text creation device 100 of the present invention, the existing speech model 30 is used to recognize speech for development and list phonemes having a low phoneme recognition rate. Is selected as an acoustic model learning text from the additional candidate text corpus 70 and output. Therefore, it is possible to extract text including “phonemes to be learned”.

図４に、この発明の音響モデル学習用テキスト作成装置２００の機能構成例を示す。その動作フローを図５に示す。音響モデル学習用テキスト作成装置２００は、音響モデル学習用テキスト作成装置１００の苦手音素抽出部５０に代えて音素抽出率算出部２１０を備える点と、苦手音素包含テキスト選択部６０に代えてテキスト選択部２３０を備える点と、音素包含マトリックス生成部２２０を備える点で異なる。音響モデル学習用テキスト作成装置１００と異なる点のみを説明する。 FIG. 4 shows an example of the functional configuration of the acoustic model learning text creating apparatus 200 of the present invention. The operation flow is shown in FIG. The acoustic model learning text creation device 200 includes a phoneme extraction rate calculation unit 210 instead of the poor phoneme extraction unit 50 of the acoustic model learning text creation device 100, and a text selection instead of the poor phoneme inclusion text selection unit 60. The difference is that the unit 230 is provided, and the phoneme inclusion matrix generation unit 220 is provided. Only differences from the acoustic model learning text creation device 100 will be described.

音素抽出率算出部２１０は、認識結果集計部４０で計算した音素毎の音素認識率を元に、どの音素を含むテキストをどれだけ選択するべきかを表す「音素テキスト抽出割合ｒａｔ＿ｐ」を計算して出力する。ｒａｔ＿ｐは例えば式（１）で計算する（ステップＳ２１０）。 The phoneme extraction rate calculation unit 210 calculates “phoneme text extraction ratio rat_p” indicating how much text including which phoneme should be selected based on the phoneme recognition rate for each phoneme calculated by the recognition result totaling unit 40. Output. rat_p is calculated by, for example, equation (1) (step S210).

ここでｃｏｒ＿ｐは音素ｐの音素認識率である。「ｐ」はある任意の音素を意味する。音素テキスト抽出割合ｒａｔ＿ｐは、音素認識率の値が低い音素ほど大きな値を示す。音素テキスト抽出割合ｒａｔ＿ｐを大きい順番にリスト化し、その音素テキスト抽出割合リストを降順にソートすることで、音素認識率の値が低い順番に音素とその音素テキスト抽出割合の値を得ることが出来る。音素テキスト抽出割合リストは、音素ｐとその音素ｐの音素認識率との組のリストである。 Here, cor_p is a phoneme recognition rate of the phoneme p. “P” means any arbitrary phoneme. The phoneme text extraction rate rat_p shows a larger value as the phoneme has a lower phoneme recognition rate. By listing the phoneme text extraction ratio rat_p in descending order and sorting the phoneme text extraction ratio list in descending order, the phonemes and their phoneme text extraction ratio values can be obtained in order of decreasing phoneme recognition ratio values. The phoneme text extraction ratio list is a list of pairs of phonemes p and phoneme recognition rates of the phonemes p.

音素包含マトリックス生成部２２０は、追加候補テキストコーパス７０に蓄えられた各テキストにどのような音素が出現しているかをまとめた音素包含マトリックスを生成する（ステップＳ２２０）。なお、図５では、ステップＳ２２０を、音素抽出率算出過程（ステップＳ２１０）と並列で表記しているが、音素包含マトリックスはテキスト選択過程の前の段階で出来ていれば良い。音素包含マトリックスは予め生成済みであっても良い。 The phoneme inclusion matrix generation unit 220 generates a phoneme inclusion matrix that summarizes what phonemes appear in each text stored in the additional candidate text corpus 70 (step S220). In FIG. 5, step S220 is shown in parallel with the phoneme extraction rate calculation process (step S210). However, the phoneme inclusion matrix only needs to be formed at the stage before the text selection process. The phoneme inclusion matrix may be generated in advance.

図６に音素包含マトリックスの例を示す。図６の１列目はテキスト、２列目以降は音素である。テキストと音素が交差する部分の数値は、当該音素の出現回数を表す。テキスト「会社」の音素列「/＊-ｋ＋ａ/ｋ−ａ＋ｉ/ａ−ｉ＋ｓｈ/ｉ―ｓｈ＋ａ/ａ−ｓｈ＋＊/」の各音素に対応する部分に１が設定され、「会社」に含まれない音素の部分には０が設定されている。音素包含マトリックスは、音素抽出率算出部２１０で算出した音素テキスト抽出割合ｒａｔ＿ｐを参照して、例えばその値の降順にテキストを配列したマトリックスにしても良い。 FIG. 6 shows an example of a phoneme inclusion matrix. The first column in FIG. 6 is text, and the second and subsequent columns are phonemes. The numerical value of the portion where the text and the phoneme intersect represents the number of appearances of the phoneme. 1 is set in the part corresponding to each phoneme in the phoneme string “/ *-k + a / ka + i / a + i / sh / i−sh + a / a−sh + * /” of the text “company” and included in “company”. 0 is set for the non-phoneme part. The phoneme inclusion matrix may be a matrix in which texts are arranged in descending order of the values with reference to the phoneme text extraction rate rat_p calculated by the phoneme extraction rate calculation unit 210, for example.

テキスト選択部２３０は、音素抽出率算出部２１０で算出した音素テキスト抽出割合ｒａｔ＿ｐの値に応じて音素ｐを選択し、音素包含マトリックスを参照して選択した音素ｐを含むテキストを選択する。 The text selection unit 230 selects the phoneme p according to the value of the phoneme text extraction ratio rat_p calculated by the phoneme extraction rate calculation unit 210, and selects the text including the selected phoneme p with reference to the phoneme inclusion matrix.

図７に、テキスト選択部２３０の動作フローを示してその動作を説明する。テキスト選択部２３０は、音素テキスト抽出割合リストを参照して音素ｐを選択する（ステップＳ２３１）。テキスト選択部２３０は、音素抽出率算出部２１０で計算した音素テキスト抽出割合ｒａｔ＿ｐの値が大きい順番に配列された音素テキスト抽出割合リストから、例えば降順に音素ｐを選択する。 FIG. 7 shows an operation flow of the text selection unit 230 and its operation will be described. The text selection unit 230 selects the phoneme p with reference to the phoneme text extraction ratio list (step S231). The text selection unit 230 selects the phonemes p, for example, in descending order from the phoneme text extraction rate list arranged in descending order of the phoneme text extraction rate rat_p calculated by the phoneme extraction rate calculation unit 210.

次に、選択した音素ｐの出現回数の多いテキストを音素包含マトリックスを参照して、テキストを選択する（ステップＳ２３２）。選択したテキストは音響モデル学習用テキストとして外部に出力する（ステップＳ２３３）。そして、選択したテキストは音素包含マトリックスから削除する（ステップＳ２３４）と共に、選択したテキスト数ｅｘｔ＿ｐをインクリメントする（ステップＳ２３５）。 Next, with reference to the phoneme inclusion matrix, the text having a large number of appearances of the selected phoneme p is selected (step S232). The selected text is output to the outside as acoustic model learning text (step S233). The selected text is deleted from the phoneme inclusion matrix (step S234), and the selected text number ext_p is incremented (step S235).

以上の動作は、選択したテキスト数ｅｘｔ＿ｐがテキストの選択数ｎｕｍ＿ｐに等しくなるまで繰り返される（ステップＳ２３６のＮｏ）。テキストの選択数ｎｕｍ＿ｐは外部から与えても良いし、テキスト選択部２３０に予め定数として設定しておいても良い。 The above operation is repeated until the selected text number ext_p becomes equal to the text selection number num_p (No in step S236). The text selection number num_p may be given from the outside, or may be preset in the text selection unit 230 as a constant.

ステップＳ２３１〜Ｓ２３６までの処理は、音素テキスト抽出割合リストの所定の順位の音素ｐについて終了するまで繰り返される（ステップＳ２３７のＮｏ）。この所定の順位の情報についても、テキストの選択数ｎｕｍ＿ｐと同様に外部から与えても良いし、定数として予め設定しておいても良い。 The processes in steps S231 to S236 are repeated until the phonemes p having a predetermined rank in the phoneme text extraction ratio list are completed (No in step S237). The information of this predetermined order may also be given from the outside in the same manner as the text selection number num_p, or may be set in advance as a constant.

以上説明したように音響モデル学習用テキスト作成装置２００によれば、音素認識率の悪い音素の順に、且つ、その音素を多く含むテキストを音響モデル学習用テキストとして採用することが出来る。その結果、既存音響モデルの学習効率を向上させることが出来る。 As described above, according to the acoustic model learning text creation device 200, text including many phonemes in the order of phonemes having a low phoneme recognition rate can be adopted as the acoustic model learning text. As a result, the learning efficiency of the existing acoustic model can be improved.

音素認識率だけでは無く音素頻度情報と組み合わせて音響モデル学習用テキストを選択するようにしても良い。音素認識率が低い音素には「学習データ量としては十分だが、認識精度が低い」という場合もある。つまり、そもそも認識が難しい音素が存在することも考えられる。 The acoustic model learning text may be selected in combination with the phoneme frequency information as well as the phoneme recognition rate. For phonemes with a low phoneme recognition rate, there are cases where “the amount of learning data is sufficient, but the recognition accuracy is low”. In other words, there may be phonemes that are difficult to recognize in the first place.

その場合は、いくらデータ量を増やしたとしてもそれに見合った性能改善を期待することができない。そこで、既存音声データベースを元に作成された既存音響モデルの音素頻度情報と、音素認識率とを併せて用いることで、より効率的に音響モデル学習用テキストを選択するようにした音響モデル学習用テキスト作成装置３００が考えられる。 In that case, no matter how much the data amount is increased, it is not possible to expect a performance improvement commensurate with it. Therefore, the acoustic model learning text is selected more efficiently by using the phoneme frequency information of the existing acoustic model created based on the existing speech database and the phoneme recognition rate together. A text creation device 300 is conceivable.

図８に、音響モデル学習用テキスト作成装置３００の機能構成例を示す。音響モデル学習用テキスト作成装置３００は、音響モデル学習用テキスト作成装置１００の苦手音素抽出部５０が、苦手音素抽出部３５０に置き代わった点のみが異なる。 FIG. 8 shows a functional configuration example of the acoustic model learning text creation device 300. The acoustic model learning text creation device 300 is different only in that the weak phoneme extraction unit 50 of the acoustic model learning text creation device 100 is replaced with a poor phoneme extraction unit 350.

苦手音素抽出部３５０は、認識結果集計部４０で計算した音素認識率が閾値以下の音素を苦手音素として抽出する際に、音素頻度情報も参照して苦手音素リストを生成する。音素頻度情報は、既存音声データベースに含まれる各音素の出現回数と各音素の組からなる情報である。 The poor phoneme extraction unit 350 generates a poor phoneme list with reference to the phoneme frequency information when extracting phonemes whose phoneme recognition rate calculated by the recognition result totaling unit 40 is less than a threshold as poor phonemes. The phoneme frequency information is information including the number of appearances of each phoneme included in the existing speech database and a set of each phoneme.

苦手音素抽出部３５０は、音素認識率が低く出現回数の少ない音素を苦手音素として抽出するに当たって、音素頻度情報が頻度閾値よりも小さい音素を苦手音素リストとして出力する。音響モデル学習用テキスト作成装置３００によれば、学習データ量が足りていない音素を含むテキストを優先的に音響モデル学習用テキストとして選択することが出来る。 When extracting a phoneme having a low phoneme recognition rate and a small number of appearances as a weak phoneme, the poor phoneme extraction unit 350 outputs phonemes whose phoneme frequency information is smaller than the frequency threshold as a weak phoneme list. According to the acoustic model learning text creation device 300, it is possible to preferentially select text including phonemes for which the amount of learning data is insufficient as the acoustic model learning text.

なお、音素頻度情報は外部から与えても良いし、音響モデル学習用テキスト作成装置３００の内部に音素頻度算出部３９０を備えて、既存音声データベースに含まれる各音素の出現回数と各音素の組からなる音素頻度情報を生成するようにしても良い。 Note that the phoneme frequency information may be given from the outside, or the phoneme frequency calculator 390 is provided inside the acoustic model learning text creating apparatus 300, so that the number of appearances of each phoneme included in the existing speech database and the combination of each phoneme The phoneme frequency information may be generated.

音響モデル学習用テキスト作成装置２００についても、実施例３と同様に音素頻度情報を用いた実施例が考えられる。図９に、音素頻度情報も用いるようにした音響モデル学習用テキスト作成装置４００の機能構成例を示す。 As for the acoustic model learning text creating apparatus 200, an embodiment using phoneme frequency information can be considered as in the third embodiment. FIG. 9 shows an example of the functional configuration of an acoustic model learning text creation device 400 that also uses phoneme frequency information.

音響モデル学習用テキスト作成装置４００は、音響モデル学習用テキスト作成装置２００の音素抽出率算出部２１０が、音素抽出率算出部４１０に置き代わった点のみが異なる。音素抽出率算出部４１０は、認識結果集計部４０で計算した音素毎の音素認識率を元に、どの音素を含むテキストをどれだけ選択するべきかを表す「音素テキスト抽出割合ｒａｔ＿ｐを、式（２）に基づいて計算して出力する。 The acoustic model learning text creation device 400 is different only in that the phoneme extraction rate calculation unit 210 of the acoustic model learning text creation device 200 is replaced with a phoneme extraction rate calculation unit 410. Based on the phoneme recognition rate for each phoneme calculated by the recognition result totaling unit 40, the phoneme extraction rate calculation unit 410 expresses a phoneme text extraction rate rat_p that represents how much text including which phoneme should be selected. Calculate and output based on 2).

ここでｏｃｃ＿ｐは音素ｐの出現頻度である。音素頻度情報は外部から与えても良いし、音響モデル学習用テキスト作成装置３００の内部に音素頻度算出部３９０を備え、既存音声データベースに含まれる各音素の出現回数と各音素の組からなる音素頻度情報を生成するようにしても良い。 Here, occ_p is the appearance frequency of the phoneme p. The phoneme frequency information may be given from the outside, or the phoneme frequency calculation unit 390 is provided in the acoustic model learning text creation device 300, and the phoneme composed of the number of appearances of each phoneme included in the existing speech database and the set of each phoneme. Frequency information may be generated.

音響モデル学習用テキスト作成装置４００によれば、音素認識率の悪い音素の順に、且つ、学習データ量が足りていない音素を含むテキストを音響モデル学習用テキストとして採用することが出来る。その結果、既存音響モデルの学習効率を向上させることが出来る。 According to the acoustic model learning text creation device 400, it is possible to employ, as the acoustic model learning text, a text including phonemes in which the phoneme recognition rate is low and the learning data amount is insufficient. As a result, the learning efficiency of the existing acoustic model can be improved.

以上説明したようにこの発明の音響モデル学習用テキスト作成装置によれば、既存音声データベースを用いて学習した既存音響モデルを用いて、評価用音声データを音声認識し、認識性能が低い苦手音素を含むテキストを追加候補テキストコーパス７０から選択して出力するので、「学習すべき音素」を含む音響モデル学習用テキストを正確に抽出することが出来る。よって、限られた量の音響モデル学習用テキストでも誤認識し易い音素を効率的に減らすことが出来る効果を奏する。 As described above, according to the acoustic model learning text creating apparatus of the present invention, the speech data for evaluation is recognized using the existing acoustic model learned using the existing speech database, and poor phonemes having low recognition performance are detected. Since the included text is selected from the additional candidate text corpus 70 and output, the acoustic model learning text including “phonemes to be learned” can be accurately extracted. Therefore, there is an effect that it is possible to efficiently reduce phonemes that are easily misrecognized even with a limited amount of text for learning an acoustic model.

また、音素頻度情報も用いるこの発明の音響モデル学習用テキスト作成装置３００，４００によれば、学習量は少ないが認識性能は十分に高い音素を含まないテキストを選択することが出来る。また、学習量は多いが認識性能が低い音素を含むテキストを選択することも可能である。 In addition, according to the acoustic model learning text creation devices 300 and 400 of the present invention that also use phoneme frequency information, it is possible to select text that does not include phonemes with a small learning amount but sufficiently high recognition performance. It is also possible to select text including phonemes that have a large learning amount but low recognition performance.

なお、音素テキスト抽出割合ｒａｔ＿ｐを求める式は、式(１)と式(２)に限定されない。音素認識率の値が低いほど、音素テキスト抽出割合ｒａｔ＿ｐの値が高くなるものであればどのような関数であっても良い。また、音素テキスト抽出割合ｒａｔ＿ｐは尤度値に基づく値としても良い。それぞれの式の分母は省略しても良い。分母をつけることによりｒａｔ＿ｐの値を正規化することができるため、その値の範囲を限定することが可能である。 Note that the formula for obtaining the phoneme text extraction ratio rat_p is not limited to the formula (1) and the formula (2). Any function may be used as long as the value of the phoneme text extraction ratio rat_p increases as the value of the phoneme recognition rate decreases. The phoneme text extraction ratio rat_p may be a value based on the likelihood value. The denominator of each equation may be omitted. Since the value of rat_p can be normalized by adding a denominator, it is possible to limit the range of the value.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることが出来る。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech recognition processing unit that recognizes speech data for development input from the outside with reference to a language model and an existing acoustic model, and outputs a recognition result text and phoneme sequence information;
A recognition result totaling unit that calculates a phoneme recognition rate from the phoneme sequence information and the correct text of the development speech data;
A poor phoneme extraction unit that generates a poor phoneme list by extracting phonemes whose phoneme recognition rate is equal to or less than a threshold as poor phonemes;
An additional candidate text corpus that stores a large amount of text that is a text candidate for acoustic model learning,
A weak phoneme inclusion text selection unit that selects a text including a weak phoneme from the additional candidate text corpus with reference to the weak phoneme list and outputs it as an acoustic model learning text;
An acoustic model learning text creation device comprising:

In the acoustic model learning text creating apparatus according to claim 1,
The weak phoneme extraction unit is
The phoneme frequency information consisting of each phoneme input from the outside and the appearance frequency information of the phoneme and the phoneme recognition rate are input, and the phonemes whose phoneme frequency information does not satisfy the frequency threshold are extracted as weak phonemes. An acoustic model learning text creation device, which generates a phoneme list.

A speech recognition processing unit that recognizes speech data for development input from the outside with reference to a language model and an existing acoustic model, and outputs a recognition result text and phoneme sequence information;
A recognition result totaling unit that calculates a phoneme recognition rate from the phoneme sequence information and the correct text of the development speech data;
Based on the phoneme recognition rate for each phoneme, a phoneme extraction rate calculation unit that calculates and outputs a phoneme text extraction ratio rat_p indicating how much text including which phoneme should be selected;
An additional candidate text corpus that stores a large amount of text that is a text candidate for acoustic model learning,
A phoneme inclusion matrix generating unit that generates a phoneme inclusion matrix that summarizes what phonemes appear in each text stored in the additional candidate text corpus;
A text selection unit that selects a phoneme according to the value of the phoneme text extraction ratio rat_p, and selects text including the selected phoneme with reference to a phoneme inclusion matrix;
An acoustic model learning text creation device comprising:

In the acoustic model learning text creation device according to claim 3,
The phoneme extraction rate calculation unit
A phoneme text extraction rate rat_p representing how much text including which phoneme is selected by inputting phoneme frequency information including each phoneme input from the outside and information on the number of appearances of the phoneme and the phoneme recognition rate is input. An acoustic model learning text creation device characterized in that it calculates and outputs.

In the acoustic model learning text creation device according to claim 3 or 4,
The phoneme text extraction ratio rat_p is a function for increasing the value of the phoneme text extraction ratio rat_p as the value of the phoneme recognition ratio is lower.

A speech recognition process for recognizing speech data for development input from the outside with reference to a language model and an existing acoustic model, and outputting a recognition result text and phoneme sequence information;
A recognition result aggregation process for calculating a phoneme recognition rate from the phoneme sequence information and the correct text of the development speech data;
A weak phoneme extraction process for generating a poor phoneme list by extracting phonemes whose phoneme recognition rate is below a threshold as poor phonemes;
A poor phoneme inclusion text selection process of selecting text containing weak phonemes from an additional candidate text corpus that stores a large number of texts that are acoustic model learning text candidates by referring to the weak phoneme list and outputting them as acoustic model learning text When,
A method for creating a text for learning an acoustic model.

A speech recognition process for recognizing speech data for development input from the outside with reference to a language model and an existing acoustic model, and outputting a recognition result text and phoneme sequence information;
A recognition result aggregation process for calculating a phoneme recognition rate from the phoneme sequence information and the correct text of the development speech data;
Based on the phoneme recognition rate for each phoneme, a phoneme extraction rate calculation process for calculating and outputting a phoneme text extraction ratio rat_p indicating how much text including which phoneme is selected;
A phoneme inclusion matrix generation process for generating a phoneme inclusion matrix that summarizes what phonemes appear in each text stored in an additional candidate text corpus that stores a large amount of text that is a text candidate for acoustic model learning;
Selecting a phoneme according to the value of the phoneme text extraction rate rat_p, and selecting a text including the selected phoneme with reference to a phoneme inclusion matrix;
A method for creating a text for learning an acoustic model.

A program for causing a computer to function as the acoustic model learning text creation device according to any one of claims 1 to 5.