JP2012118441A

JP2012118441A - Method, device, and program for creating acoustic model

Info

Publication number: JP2012118441A
Application number: JP2010270174A
Authority: JP
Inventors: Satoru Kobashigawa; 哲小橋川; Atsunori Ogawa; 厚徳小川; Taichi Asami; 太一浅見; Yoshikazu Yamaguchi; 義和山口; Hirokazu Masataki; 浩和政瀧; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-12-03
Filing date: 2010-12-03
Publication date: 2012-06-21
Anticipated expiration: 2030-12-03
Also published as: JP5369079B2

Abstract

PROBLEM TO BE SOLVED: To automatically optimize the size of a voice recognition result lattice including voice recognition errors in a discrimination learning method.SOLUTION: A partial learning data selection unit selects voice data for partial learning from a voice database for learning, and a partial recognition parameter determination unit obtains such a determination recognition parameter that the voice data for partial learning becomes a partial lattice having a prescribed size. Then, a recognition unit for lattice creation uses the determination recognition parameter to generate a voice recognition result lattice. A discrimination learning unit performs discrimination learning by comparison between the voice recognition result lattice and a correct answer symbol sequence to create an acoustic model subjected to discrimination learning.

Description

この発明は、識別学習法を用いて音響モデルを作成する音響モデル作成方法と、その装置とプログラムに関する。 The present invention relates to an acoustic model creation method, an apparatus, and a program for creating an acoustic model using a discriminative learning method.

音響モデルの学習方法として、従来の最尤推定に基づく手法から、音素等のシンボル間の識別能力を向上させる識別学習法が用いられることが多くなって来ている。識別学習法は、音声認識誤りを含まない参照単語列と、音声認識誤りを含む認識単語列とを同時に用いることで識別モデルの学習効果を向上させるものである。認識単語列には、音声認識結果を単語ラティス（複数の認識単語列をコンパクトに表現するための有向非循環グラフ）等の形式で表現したものが用いられる。 As a learning method of an acoustic model, an identification learning method that improves the discrimination ability between symbols such as phonemes is increasingly used from the conventional method based on maximum likelihood estimation. The identification learning method improves the learning effect of the identification model by simultaneously using a reference word string that does not include a speech recognition error and a recognition word string that includes a speech recognition error. As the recognition word string, a speech recognition result expressed in a form such as a word lattice (a directed acyclic graph for compactly expressing a plurality of recognition word strings) is used.

識別学習法は、例えば特許文献１と２、及び非特許文献１に開示されている。図１１を参照して従来の識別学習法による音響モデル作成装置９００を簡単に説明する。音響モデル作成装置９００は、学習用音声データベース９０と、言語モデル記憶部９１と、学習用音響モデル記憶部９２と、ラティス作成用認識部９３と、識別学習部９４と、を備える。 Discriminative learning methods are disclosed in, for example, Patent Documents 1 and 2 and Non-Patent Document 1. A conventional acoustic model creation apparatus 900 based on a discriminative learning method will be briefly described with reference to FIG. The acoustic model creation apparatus 900 includes a learning speech database 90, a language model storage unit 91, a learning acoustic model storage unit 92, a lattice creation recognition unit 93, and an identification learning unit 94.

学習用音声データベース９０は、音声データとその正解シンボル系列を組みにした学習用音声データを記憶する。言語モデル記憶部９１は、単語間の連接関係を表現する文法等（発音辞書を含む）を記憶する。学習用音響モデル記憶部９２は、音素と音声の特徴量とを対応付ける学習用の音響モデルを記憶する。ラティス作成用認識部９３は、言語モデルと音響モデルと、認識パラメータに基づいて全ての学習用音声データに対して音声認識を行い音声認識結果の単語ラティス（以降は単純に、「音声認識結果ラティス」又は「単語ラティス」と表現する。）を生成する。識別学習部９４は、音声認識結果ラティスと正解シンボル系列を対比させて識別学習を行い識別学習済音響モデルを生成する。 The learning speech database 90 stores learning speech data in which speech data and a correct symbol sequence thereof are combined. The language model storage unit 91 stores a grammar (including a pronunciation dictionary) that expresses a connection relationship between words. The learning acoustic model storage unit 92 stores an acoustic model for learning that associates phonemes and speech feature quantities. The lattice creation recognition unit 93 performs speech recognition on all the learning speech data based on the language model, the acoustic model, and the recognition parameters, and performs word recognition (hereinafter simply referred to as “speech recognition result lattice”). Or “word lattice”). The discrimination learning unit 94 performs discrimination learning by comparing the speech recognition result lattice and the correct symbol series to generate an acoustic model having been discriminated and learned.

図１２に、音声認識結果ラティスの一例を示す。図１２は、「お電話ありがとうございます。」の音声を認識した誤りを含む複数の認識単語列を、有向非循環グラフで表したものである。この音声認識結果ラティスと、誤りを含まない正解ラティスとを対比して識別学習を行うことで、音響モデルを効率的に学習することが出来る。 FIG. 12 shows an example of a speech recognition result lattice. FIG. 12 is a directed acyclic graph representing a plurality of recognized word strings including an error that has recognized the voice “Thank you for calling.” The acoustic model can be efficiently learned by performing discrimination learning by comparing the speech recognition result lattice with the correct lattice including no error.

特開２００７−３２２９８４号公報JP 2007-322984 A 特開２００６−２０１５５３号公報JP 2006-201553 A

Erik McDermott and Atsushi Nakamura, “String and Lattice based Discriminative Training for the Corpus of Spontaneous Japanese Lecture Transcription Task”, INTERSPEECH, pp.2081−2084, Aug. 2007.Erik McDermott and Atsushi Nakamura, “String and Lattice based Discriminative Training for the Corpus of Spontaneous Japanese Lecture Transcription Task”, INTERSPEECH, pp.2081-2084, Aug. 2007.

従来の音響モデル作成装置９００では、単語ラティスの作成に大量のメモリを使用するので、メモリ容量（ディスク容量）を圧迫してしまう課題がある。近年のメモリ容量の大容量化に伴い、学習用音声データベースに記憶される音声データの量は数百時間といった大規模なデータになりつつある。その大規模な音声データに対して音声認識を行って、単語ラティスを生成しようとすると、上記したように単語ラティスは可能性のある多くの音声認識結果を包含するものであるため、そのデータ量は莫大なものになる。また、単語ラティスの大きさは、学習用音声データの音質、音響モデル、言語モデル、音声認識パラメータ等に依存するので、事前に予測することが困難である。 Since the conventional acoustic model creation apparatus 900 uses a large amount of memory for creating word lattices, there is a problem of squeezing the memory capacity (disk capacity). With the recent increase in memory capacity, the amount of speech data stored in the learning speech database is becoming large-scale data such as several hundred hours. When speech recognition is performed on the large-scale speech data and a word lattice is generated, the word lattice contains many possible speech recognition results as described above. Will be enormous. In addition, since the size of the word lattice depends on the sound quality of the learning speech data, the acoustic model, the language model, the speech recognition parameters, etc., it is difficult to predict in advance.

そのようなことから大規模な音声データから直接、単語ラティスを生成しようとすると、場合によってディスク容量を使いきりメモリ不足に陥ることで、音響モデル作成装置９００は動作不能になる。それを防止する目的で、音声認識の認識パラメータの１つである例えば探索ビーム幅を狭めると、単語ラティスが得られない音声データも出現し、学習後の音響モデルの精度が下がってしまう問題が発生する。 For this reason, if an attempt is made to generate a word lattice directly from large-scale audio data, the acoustic model creation apparatus 900 becomes inoperable because the disk capacity is used up and the memory is insufficient. For example, if the search beam width, which is one of the recognition parameters for speech recognition, is narrowed for the purpose of preventing this, speech data in which a word lattice cannot be obtained also appears, and the accuracy of the acoustic model after learning decreases. appear.

この発明は、このような点に鑑みてなされたものであり、音声認識結果ラティスを生成する際の音声認識パラメータを自動的に最適化して、適切な大きさの単語ラティスを生成することが可能な音響モデル作成方法と、その装置とプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and can automatically optimize a speech recognition parameter when generating a speech recognition result lattice to generate a word lattice of an appropriate size. An object of the present invention is to provide an acoustic model creation method, apparatus and program thereof.

この発明の音響モデル作成方法は、部分学習データ選択過程と、部分ラティス作成用認識過程と、部分認識パラメータ判定過程と、ラティス作成用認識過程と、識別学習過程と、を備える。部分学習データ選択過程は、学習用音声データベースに記憶された音声データとその正解シンボル系列を組にした学習用音声データの中から部分学習用音声データを選択する。部分ラティス作成用認識過程は、部分学習用音声データと、言語モデル記憶部に記憶された言語モデルと学習用音響モデル記憶部に記憶された学習用音響モデルと、部分認識パラメータ判定過程で得られる制御用認識パラメータを用いて音声認識して部分ラティスを生成する。部分認識パラメータ判定過程は、部分ラティスの容量を評価して制御用認識パラメータを制御し、所定の容量の部分ラティスが得られた制御用認識パラメータを決定認識パラメータとして出力する。ラティス作成用認識過程は、言語モデルと学習用音響モデルと、結滞認識パラメータに基づいて全ての学習用音声データに対して音声認識を行い音声認識結果ラティスを生成する。識別学習過程は、音声認識結果ラティスと正解シンボル系列を対比させて識別学習を行い識別済音響モデルを生成する。 The acoustic model creation method of the present invention includes a partial learning data selection process, a partial lattice creation recognition process, a partial recognition parameter determination process, a lattice creation recognition process, and an identification learning process. In the partial learning data selection process, partial learning speech data is selected from the speech data stored in the learning speech database and the learning speech data that is a combination of the correct answer symbol series. The recognition process for creating the partial lattice is obtained by the partial learning speech data, the language model stored in the language model storage unit, the learning acoustic model stored in the learning acoustic model storage unit, and the partial recognition parameter determination process. A partial lattice is generated by performing speech recognition using the recognition parameter for control. In the partial recognition parameter determination process, the capacity of the partial lattice is evaluated to control the recognition parameter for control, and the control recognition parameter from which the partial lattice having a predetermined capacity is obtained is output as the determination recognition parameter. The lattice creation recognition process performs speech recognition on all learning speech data based on the language model, the learning acoustic model, and the stagnation recognition parameter, and generates a speech recognition result lattice. In the discriminative learning process, discriminative learning is performed by comparing the speech recognition result lattice and the correct symbol sequence to generate an identified acoustic model.

この発明の音響モデル作成方法は、学習用音声データベースの中から部分学習用音声データを選択し、その部分学習用音声データが所定の大きさの部分ラティスとなる決定認識パラメータを求める。そして、その決定認識パラメータを用いて音声認識結果ラティスを生成するので、自動的に音声認識結果ラティスの大きさを適切なものにすることが可能である。 According to the acoustic model creation method of the present invention, partial learning speech data is selected from the learning speech database, and a decision recognition parameter is obtained in which the partial learning speech data is a partial lattice having a predetermined size. Since the speech recognition result lattice is generated using the decision recognition parameter, the size of the speech recognition result lattice can be automatically made appropriate.

この発明の音響モデル作成装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 100 of this invention. 音響モデル作成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production apparatus 100. この発明の音響モデル作成装置２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 200 of this invention. 音響モデル作成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production apparatus 200. この発明の音響モデル作成装置３００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 300 of this invention. 音響モデル作成装置３００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production apparatus 300. この発明の音響モデル作成装置４００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 400 of this invention. 音響モデル作成装置４００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production apparatus 400. この発明の音響モデル作成装置５００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 500 of this invention. 音響モデル作成装置６００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus 600. FIG. 従来の音響モデル作成装置９００の機能構成を示す図。The figure which shows the function structure of the conventional acoustic model production apparatus 900. 音声認識結果ラティスの一例を示す図。The figure which shows an example of the speech recognition result lattice.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル作成装置１００の機能構成例を示す。その動作フローを図２に示す。音響モデル作成装置１００は、学習用音声データベース９０と、部分学習データ選択部１１と、部分ラティス作成用認識部１２と、部分認識パラメータ判定部１３と、言語モデル記憶部９１と、学習用音響モデル記憶部９２と、ラティス作成用認識部９３と、識別学習部９４と、を具備する。データベースと記憶部を除く各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of an acoustic model creation device 100 of the present invention. The operation flow is shown in FIG. The acoustic model creation device 100 includes a learning speech database 90, a partial learning data selection unit 11, a partial lattice creation recognition unit 12, a partial recognition parameter determination unit 13, a language model storage unit 91, and a learning acoustic model. A storage unit 92, a lattice creation recognition unit 93, and an identification learning unit 94 are provided. The functions of the units other than the database and the storage unit are realized by reading a predetermined program into a computer including, for example, a ROM, a RAM, and a CPU and executing the program by the CPU.

音響モデル作成装置１００は、従来の音響モデル作成装置９００に対して、部分学習データ選択部１１と部分ラティス作成用認識部１２と部分認識パラメータ判定部１３と、を備える点で新しい。 The acoustic model creation device 100 is new to the conventional acoustic model creation device 900 in that it includes a partial learning data selection unit 11, a partial lattice creation recognition unit 12, and a partial recognition parameter determination unit 13.

なお、音響モデル作成装置１００は、音声ディジタル信号を音声認識処理するものであり、学習用音声データベース９０には、ディジタル信号に変換された音声データが複数の音声ファイルとして記録されている。そして、音声データを例えば２０ｍｓと言った時間間隔を１フレームとして、１フレームごとに音声認識処理するものである。 The acoustic model creation apparatus 100 performs speech recognition processing on speech digital signals, and the speech data 90 converted into digital signals is recorded in the learning speech database 90 as a plurality of speech files. The voice data is subjected to voice recognition processing for each frame with a time interval of, for example, 20 ms as one frame.

学習用音声データベース９０は、音声データとその正解シンボル系列を組みにした学習用音声データを記録する。部分学習データ選択部１１は、学習用音声データベース９０の中から部分学習用音声データを選択する（ステップＳ１１）。その選択は、学習用音声データの大きさ、つまり音声ファイルのデータ量が大きな音声ファイルを部分学習用音声データとしても良いし、又は、そのデータ量を既知としてランダムに選択した音声ファイルを部分学習用音声データとしても良い。 The learning voice database 90 records learning voice data in which the voice data and its correct symbol series are combined. The partial learning data selection unit 11 selects partial learning voice data from the learning voice database 90 (step S11). For the selection, the size of the learning audio data, that is, an audio file with a large amount of audio file data may be used as partial learning audio data, or an audio file selected at random with the data amount being known is partially learned. Audio data may be used.

部分ラティス作成用認識部１２は、その部分学習用音声データを、言語モデル記憶部９１に記憶された言語モデルと学習用音響モデル記憶部９２に記憶された学習用音響モデルと、部分認識パラメータ判定部１３から入力される制御用認識パラメータと、を用いて音声認識して部分ラティスを生成する（ステップＳ１２）。 The partial lattice creation recognizing unit 12 uses the partial learning speech data as a language model stored in the language model storage unit 91, a learning acoustic model stored in the learning acoustic model storage unit 92, and partial recognition parameter determination. Speech recognition is performed using the control recognition parameters input from the unit 13 to generate a partial lattice (step S12).

なお、初回の部分ラティスの生成時には、制御用認識パラメータが決められないので認識パラメータとして予め部分ラティス作成用認識部１２に設定されている初期認識パラメータ１２０を用いる。認識パラメータとしては、探索ビーム幅や言語重み等がある。探索ビーム幅とは、音声認識結果の仮説の足切り幅のことであり、初期認識パラメータ１２０としては例えば１０００個の仮説を探索する。言語重みは、信頼度スコアを音響スコアと言語スコアの和で表現した場合に言語スコアに乗ずる重みであり、初期認識パラメータ１２０としては例えば１０といった値に設定される。単語ラティスについては、上記した非特許文献１にも記載されているもので一般的なものである。部分ラティスの生成そのものは、この発明の主要部ではないので詳しい説明は省略する。 Note that since the recognition parameter for control cannot be determined when the partial lattice is generated for the first time, the initial recognition parameter 120 set in advance in the partial lattice creation recognition unit 12 is used as the recognition parameter. The recognition parameters include search beam width and language weight. The search beam width is a cut-off width of a hypothesis of a speech recognition result. As the initial recognition parameter 120, for example, 1000 hypotheses are searched. The language weight is a weight by which the language score is multiplied when the reliability score is expressed by the sum of the acoustic score and the language score. The initial recognition parameter 120 is set to a value such as 10, for example. The word lattice is also described in Non-Patent Document 1 described above and is a general one. Since the generation of the partial lattice itself is not a main part of the present invention, detailed description thereof is omitted.

部分認識パラメータ判定部１３は、部分ラティス作成用認識部１２が出力する部分ラティスの容量を評価して制御用認識パラメータを制御し、所定の容量の部分ラティスが得られた制御用認識パラメータを決定認識パラメータとして出力する（ステップＳ１３）。部分認識パラメータ判定部１３は、部分ラティスの容量Ｌ_Ｅを目標ラティス容量Ｌ_Ｔと比較して、目標に近づくように制御用認識パラメータを調整する。 The partial recognition parameter determination unit 13 evaluates the capacity of the partial lattice output by the partial lattice creation recognition unit 12 and controls the recognition parameter for control, and determines the recognition parameter for control from which the partial lattice having a predetermined capacity is obtained. It outputs as a recognition parameter (step S13). Partial recognition parameter determining section 13, a portion Lattice capacity L _E as compared to the target lattice volume L _T, adjusting the control recognition parameter so as to approach the target.

部分ラティスの容量Ｌ_Ｅが目標ラティス容量Ｌ_Ｔに比べて小さい場合は、制御用認識パラメータとして探索ビーム幅Ｂを用いる場合、探索ビーム幅Ｂを拡大し、逆に大きい場合は縮小する。例えば、制御用認識パラメータの探索ビーム幅Ｂ′は、目標ラティス容量Ｌ_Ｔに対する比率ｒ＝Ｌ_Ｅ/Ｌ_Ｔを元にＢ′＝Ｂ/ｒで求めても良い。そして、所定の容量の部分ラティスが得られた制御用認識パラメータを決定認識パラメータとして出力する。 When the partial Lattice capacity L _E is smaller than the target lattice volume L _T as used search beam width B as the control recognition parameters, expanding the search beam width B, it is larger conversely reduced. For example, the search beam width B of the control recognition parameters 'is based on the ratio r = _{_L E} / _L _T to the target lattice volume _{L T} B' may be obtained in = B / r. Then, the control recognition parameter from which a partial lattice having a predetermined capacity is obtained is output as a decision recognition parameter.

目標ラティス容量Ｌ_Ｔは、部分学習用音声データの大きさによって変化するので、その部分学習用音声データのフレーム数を考慮した値で設定する（式（１））。 Target Lattice capacity L _T, so varies with the size of the audio data for partial learning, it sets a value in consideration of the number of frames that part training speech data (Equation (1)).

ここで、Ｌ_Ｅは部分学習用音声データから生成した部分ラティスの容量、Ｎ_Ｅはその部分学習用音声データのフレーム数（ファイル長）である。Ｎは全ての学習用音声データの総フレーム数である。 Here, L _E is the capacitance of the portion lattice generated from the audio partial learning data, N _E is the number of frames that part training speech data (file length). N is the total number of frames of all learning speech data.

目標ラティス容量Ｌ_Ｔは、単語ラティスを書き込むディスクの残り容量と、学習用音声データの総容量の比率から求めても良い。また、入力に正解ラベルを入れる場合、正解ラベルを言語モデルとして与えた場合のラティス容量を基準として目標ラティス容量Ｌ_Ｔを決定しても良い（例えば、基準の１０倍等）。 The goal Lattice capacity L _T is, and the remaining capacity of the disk to write the word lattice, may be obtained from the ratio of the total capacity of the training speech data. Also, if you put the correct label to the input may be determined target lattice volume L _T based on the lattice capacity when given the true label as a language model (e.g., the reference 10-fold, etc.).

部分認識パラメータ判定部１３は、部分ラティスの容量Ｌ_Ｅと目標ラティス容量Ｌ_Ｔを比較して、目標ラティス容量Ｌ_Ｔとほぼ等しい（或いは差が一定の値以下、例えば差が１％以下等）容量Ｌ_Ｅが得られた制御用認識パラメータを決定認識パラメータとして出力する。なお、ここで調整のための回数制限等を設けても良い。 Partial recognition parameter determination unit 13 compares the capacitance L _E and the target lattice volume L _T parts lattice, substantially equal to the target lattice volume L _T (or the difference is below a certain value, for example, the difference is less than 1%, etc.) outputting a recognition parameter control the capacity L _E was obtained as a decision recognition parameter. In addition, you may provide the frequency limit etc. for adjustment here.

ラティス作成用認識部９３は、言語モデル記憶部９１に記憶された言語モデルと学習用音響モデル記憶部９２に記憶された学習用音響モデルと、決定認識パラメータに基づいて全ての学習用音声データに対して音声認識を行い音声認識結果ラティスを生成する（ステップＳ９３）。 The lattice creation recognizing unit 93 converts the language model stored in the language model storage unit 91, the learning acoustic model stored in the learning acoustic model storage unit 92, and all learning speech data based on the decision recognition parameters. Then, speech recognition is performed and a speech recognition result lattice is generated (step S93).

識別学習部９４は、ラティス作成用認識部９３が生成した音声認識結果ラティスと、学習用音声データベース９０に記憶された正解シンボル系列を対比させて識別学習を行い識別学習済音響モデルを生成する（ステップＳ９４）。 The discrimination learning unit 94 performs discrimination learning by comparing the speech recognition result lattice generated by the lattice creation recognition unit 93 and the correct symbol sequence stored in the learning speech database 90 to generate a discriminatively learned acoustic model ( Step S94).

このように音響モデル作成装置１００は、学習用音声データの一部の音声データから部分ラティスを生成させ、その部分ラティスの大きさが所定の大きさになるように音声認識用の認識パラメータを決定するので、適切な大きさの音声認識結果ラティスを自動的に生成することができる。 As described above, the acoustic model creation apparatus 100 generates a partial lattice from a part of the speech data of the learning speech data, and determines the recognition parameters for speech recognition so that the size of the partial lattice becomes a predetermined size. Therefore, a speech recognition result lattice having an appropriate size can be automatically generated.

なお、音声認識結果ラティスは、音声認識結果のＮ−ｂｅｓｔに置き換えることも可能である。つまり、部分ラティス作成用認識部１２及び部分認識パラメータ判定部１３が、単語ラティスに代えて適当な大きさのＮ−ｂｅｓｔが得られるように制御用認識パラメータを制御して決定認識パラメータを求める。そして、その決定パラメータを用いてラティス作成用認識部９３が、音声認識結果のＮ−ｂｅｓｔを生成することで、適切な大きさの音声認識結果のＮ−ｂｅｓｔを自動的に生成することができる。 Note that the speech recognition result lattice can be replaced with the speech recognition result N-best. That is, the recognition unit for partial lattice creation 12 and the partial recognition parameter determination unit 13 determine the determination recognition parameter by controlling the recognition parameter for control so that an N-best of an appropriate size can be obtained instead of the word lattice. Then, the lattice creation recognizing unit 93 generates the N-best of the speech recognition result by using the determination parameter, thereby automatically generating the N-best of the speech recognition result having an appropriate size. .

図３に、この発明の音響モデル作成装置２００の機能構成例を示す。その動作フローを
図４に示す。音響モデル作成装置２００は、上記した音響モデル作成装置１００に対して部分学習データ選択部２０と、部分ラティス作成用認識部１２′のみが異なる。 FIG. 3 shows a functional configuration example of the acoustic model creation device 200 of the present invention. The operation flow is shown in FIG. The acoustic model creation apparatus 200 is different from the above-described acoustic model creation apparatus 100 only in the partial learning data selection unit 20 and the partial lattice creation recognition unit 12 ′.

部分学習データ選択部２０は初期ラティス容量計算手段２０１を備える。初期ラティス容量計算手段２０１は、学習用音声データベース９０に記録された学習用音声データの音声ファイルについて初期認識パラメータ１２０（図示を省略、図１に示したものと同じものでも良い）を用いて音声認識処理を行い、音声ファイルのそれぞれの初期ラティス容量を計算する（ステップＳ２０１）。この時、全ての学習用音声データについて初期ラティス容量を求めても良いし、複数個のファイルに限定して求めても良い。そして、部分学習データ選択部２０は、初期ラティス容量の大きな音声ファイルを、部分学習用音声データとして選択する（ステップＳ２０２）。 The partial learning data selection unit 20 includes an initial lattice capacity calculation unit 201. The initial lattice capacity calculation means 201 uses the initial recognition parameters 120 (not shown, may be the same as that shown in FIG. 1) for the audio file of the learning audio data recorded in the learning audio database 90. A recognition process is performed, and the initial lattice capacity of each audio file is calculated (step S201). At this time, the initial lattice capacity may be obtained for all of the learning speech data, or may be obtained by limiting to a plurality of files. Then, the partial learning data selection unit 20 selects an audio file having a large initial lattice capacity as partial learning audio data (step S202).

部分ラティス作成用認識部１２′は、その初期ラティス容量の大きさ情報を部分認識パラメータ判定部１３に伝達した後に、部分認識パラメータ判定部１３から入力される制御用認識パラメータを用いて部分学習用音声データの部分ラティスを作成する（ステップＳ１２′）。そして、部分認識パラメータ判定部１３は、その部分ラティスが所定の大きさになるように制御用認識パラメータを制御して決定認識パラメータを出力する（ステップＳ１３）。 The partial lattice creation recognition unit 12 ′ transmits the initial lattice capacity magnitude information to the partial recognition parameter determination unit 13 and then uses the control recognition parameters input from the partial recognition parameter determination unit 13 for partial learning. A partial lattice of the audio data is created (step S12 '). Then, the partial recognition parameter determination unit 13 controls the recognition parameter for control so that the partial lattice has a predetermined size, and outputs the determination recognition parameter (step S13).

このように初期ラティス容量の大きな音声ファイルを対象にして決定認識パラメータを設定するので、音声認識結果ラティスを生成する際に、ディスク容量を超えてしまう危険を減らすことが可能であり、より効率的に音声認識結果ラティスを生成することができる。
〔変形例〕
部分学習データ選択部２０′は、認識パラメータの変化に対して部分ラティス容量の変化が大きな音声ファイルを部分学習用音声データとして選択するようにしても良い。部分学習データ選択部２０′は初期ラティス容量計算手段２０１′を備える。 In this way, because the decision recognition parameters are set for a voice file with a large initial lattice capacity, it is possible to reduce the risk of exceeding the disk capacity when generating a speech recognition result lattice, and more efficiently A speech recognition result lattice can be generated.
[Modification]
The partial learning data selection unit 20 ′ may select an audio file having a large change in partial lattice capacity with respect to a change in recognition parameters as partial learning audio data. The partial learning data selection unit 20 ′ includes an initial lattice capacity calculation unit 201 ′.

初期ラティス容量計算手段２０１′は、初期認識パラメータ１２０を用いて音声認識処理を行って音声ファイルの初期ラティス容量を計算すると共に、初期認識パラメータ１２０を変更した第２認識パラメータを用いて音声ファイルの第２ラティス容量を計算する（ステップＳ２０１′）。 The initial lattice capacity calculating unit 201 ′ performs speech recognition processing using the initial recognition parameter 120 to calculate the initial lattice capacity of the audio file, and uses the second recognition parameter obtained by changing the initial recognition parameter 120 to A second lattice capacity is calculated (step S201 ').

部分学習データ選択部２０′は、初期ラティス容量と第２ラティス容量の差分が大きな音声ファイルを部分学習用音声データとして選択する（ステップＳ２０′）。ここで、第２認識パラメータを、初期認識パラメータに対してわずかに変化させた値（例えば探索ビーム幅を１０％程度変更した値）とすることで、部分学習データ選択部２０′は認識パラメータの変化に対するラティス容量の変化の大きな感度の高い音声ファイルを選択することができる。 The partial learning data selection unit 20 ′ selects an audio file having a large difference between the initial lattice capacity and the second lattice capacity as partial learning audio data (step S20 ′). Here, by setting the second recognition parameter to a value slightly changed with respect to the initial recognition parameter (for example, a value obtained by changing the search beam width by about 10%), the partial learning data selection unit 20 ′ determines the recognition parameter. It is possible to select a highly sensitive audio file with a large change in lattice capacity with respect to the change.

部分認識パラメータ判定部１３は、認識パラメータの変化に対する感度の高い音声ファイルを用いて決定認識パラメータを設定する。よって、ラティス作成用認識部９３は、より適切な大きさの音声認識結果ラティスを生成することができる。 The partial recognition parameter determination unit 13 sets a decision recognition parameter using an audio file that is highly sensitive to changes in the recognition parameter. Therefore, the lattice creation recognition unit 93 can generate a speech recognition result lattice having a more appropriate size.

図５に、この発明の音響モデル作成装置３００の機能構成例を示す。その動作フローを図６に示す。音響モデル作成装置３００は、上記した音響モデル作成装置１００と２００に対して部分学習データ選択部３０のみが異なる。 FIG. 5 shows a functional configuration example of the acoustic model creation device 300 of the present invention. The operation flow is shown in FIG. The acoustic model creation apparatus 300 differs from the acoustic model creation apparatuses 100 and 200 described above only in the partial learning data selection unit 30.

部分学習データ選択部３０は、特徴量抽出手段３０１と、信頼度スコア計算手段３０２を備える。特徴量算出手段３０１は、学習用音声データの音声ファイルの音声データをフレーム毎に音声特徴量ｏ_ｔに変換する（ステップＳ３０１）。音声特徴量ｏ_ｔとしては、例えばＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）の１〜１２元と、その変化量であるΔＭＦＣＣ等の動的パラメータや、パワーやΔパワー等を用いる。また、ケプストラム平均正規化（ＣＭＮ）等の処理を行っても良い。 The partial learning data selection unit 30 includes a feature amount extraction unit 301 and a reliability score calculation unit 302. Feature calculating unit 301 converts the audio feature o _t audio data of the audio file training speech data for each frame (step S301). The audio feature _{o t,} for example, is used and 12 yuan MFCC (Mel-Frequency Cepstrum Coefficient) , and dynamic parameters ΔMFCC like its variation, the power and Δ power like. Also, processing such as cepstrum average normalization (CMN) may be performed.

信頼度スコア計算手段３０２は、音声特徴量ｏ_ｔの系列に対する音声認識結果の音響スコアと言語スコアから推定される信頼度に変換し、その信頼度から音声ファイルごとの信頼度を当該ファイルのファイル長で正規化した信頼度スコアを計算する（ステップＳ３０２）。信頼度スコアは、音声認識結果から推定されるもので従来から音声認識装置で用いられているものを用いると良い。 Confidence score calculating means 302 converts the reliability estimated from the acoustic score and the language score of speech recognition results for a sequence of speech features o _t, files of the file the reliability of each audio file from the reliability A reliability score normalized by the length is calculated (step S302). The reliability score is estimated from the speech recognition result, and it is preferable to use the reliability score conventionally used in the speech recognition apparatus.

部分学習データ選択部３０は、信頼度スコアの小さい音声ファイルを部分学習用音声データとして選択する（ステップＳ３０３）。この時、全ての学習用音声データの信頼度スコアを計算しても良いし、複数個に限定して求めた信頼度スコアを比較して部分学習用音声データを選択しても良い。 The partial learning data selection unit 30 selects an audio file having a small reliability score as partial learning audio data (step S303). At this time, the reliability scores of all of the learning speech data may be calculated, or the partial learning speech data may be selected by comparing the reliability scores obtained by limiting to a plurality.

信頼度スコアが小さな音声ファイルから音声認識結果ラティスを生成すると、単語の対立候補が多くなるのでラティス容量が大きくなる。よって、音声認識結果ラティス容量の総量を抑えるためには、信頼度スコアが小さな音声ファイルから決定認識パラメータを求めることで、総ラティス容量がディスク容量を超えてしまう危険性を軽減することができる。 If a speech recognition result lattice is generated from an audio file having a small reliability score, the number of word conflict candidates increases, and the lattice capacity increases. Therefore, in order to suppress the total amount of the speech recognition result lattice capacity, the risk of the total lattice capacity exceeding the disk capacity can be reduced by obtaining the decision recognition parameter from the sound file having a small reliability score.

図７に、この発明の音響モデル作成装置４００の機能構成例を示す。その動作フローを図８に示す。音響モデル作成装置４００は、上記した音響モデル作成装置３００に対して部分学習データ選択部４０が、事前高速信頼度スコア計算手段４０１を備える点で異なる。 FIG. 7 shows a functional configuration example of the acoustic model creation device 400 of the present invention. The operation flow is shown in FIG. The acoustic model creation apparatus 400 differs from the acoustic model creation apparatus 300 in that the partial learning data selection unit 40 includes a pre-fast reliability score calculation unit 401.

事前高速信頼度スコア計算手段４０１は、フレーム毎の音声特徴量に対するモノフォンＧＭＭ（Gaussian Mixture Model）から得られる出力確率ｂ_ｓと、そのＧＭＭの属する状態の出現確率の積が最も高いものをモノフォン最尤値Ｐ（ｓ＾）ｂｓ＾（ｏ_ｔ）として求め、そのモノフォン最尤値の対数と、音声／ポーズ最尤値ｂｇ＾（ｏ_ｔ）の対数との差を音声ファイル単位で平均化した高速事前信頼度スコアを計算する（ステップＳ４０１）。式（２）はフレーム毎の高速事前信頼度スコアｃ（ｏ_ｔ）であり、式（３）は音声ファイル単位で平均化した高速事前信頼度スコアＣである。 The prior high-speed reliability score calculation means 401 calculates the highest monophonic product having the highest product of the output probability b _s obtained from a monophone GMM (Gaussian Mixture Model) for the speech feature value for each frame and the appearance probability of the state to which the GMM belongs. The likelihood value P (s ^) bs ^ (o _t ) is obtained, and the difference between the logarithm of the monophone maximum likelihood value and the logarithm of the voice / pause maximum likelihood value bg ^ (o _t ) is averaged for each audio file. A high-speed prior reliability score is calculated (step S401). Represented by the formula (2) high-speed pre confidence score c of each frame (o _t), Equation (3) is a fast advance confidence score C averaged voice file basis.

ここでＴは各音声ファイルの総フレーム長である。なお、高速事前信頼度スコアＣの計算は、参考文献１：「小橋川、浅見、山口、政瀧、高橋「事前信頼度推定に基づく音声認識対象データ選択」日本音響学会講演論文集、2010年3月」、又は、参考文献２：「Kobashikawa, Asami, Yamaguchi, Masataki, Takahashi, “Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models”, INTERSPEECH 2010, pp.238-241 September 2010」に記載された方法を用いても良い。 Here, T is the total frame length of each audio file. In addition, the calculation of the high-speed prior reliability score C is described in Reference 1: “Kohashikawa, Asami, Yamaguchi, Masatsugu, Takahashi“ Speech recognition target data selection based on prior reliability estimation ”, Proc. Month or Reference 2: “Kobashikawa, Asami, Yamaguchi, Masataki, Takahashi,“ Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models ”, INTERSPEECH 2010, pp.238-241 September 2010 May be used.

部分学習データ選択部４０は、高速事前信頼度スコアＣの小さい音声ファイルを部分学習用音声データとして選択する（ステップＳ４０３）。この時、全ての学習用音声データの高速事前信頼度スコアＣを計算しても良いし、複数個に限定して求めた高速事前信頼度スコアＣを比較して部分学習用音声データを選択しても良い。 The partial learning data selection unit 40 selects an audio file having a small high-speed prior reliability score C as partial learning audio data (step S403). At this time, the high-speed pre-reliability score C of all the learning speech data may be calculated, or the partial pre-speech speech data is selected by comparing the high-speed pre-reliability scores C obtained by limiting to a plurality. May be.

このように、モフォンＧＭＭから得られる出力確率から信頼度スコアを計算することで、トライフォンやバイフォン等の音響モデルを用いる場合よりも処理を高速化することが可能である。音響モデル作成装置４００は、音響モデル作成装置３００よりも高速に音声認識結果ラティスを生成することができる。 Thus, by calculating the reliability score from the output probability obtained from the mophone GMM, it is possible to speed up the processing compared to the case where an acoustic model such as triphone or biphone is used. The acoustic model creation device 400 can generate a speech recognition result lattice faster than the acoustic model creation device 300.

図９に、この発明の音響モデル作成装置５００の機能構成例を示す。音響モデル作成装置５００は、音響モデル作成装置１００に対して、言語モデル記憶部が学習用音声データベースの音声ファイルから作成されたラティス作成用言語モデルを記憶したラティス作成用言語モデル記憶部５０である点のみが異なる。 FIG. 9 shows a functional configuration example of the acoustic model creation device 500 of the present invention. The acoustic model creation device 500 is a lattice creation language model storage unit 50 in which the language model storage unit stores a lattice creation language model created from the speech file of the learning speech database. Only the point is different.

単語ラティス作成時に用いる言語モデルを学習用の音響モデルから作成することで、未知語が無くなり対立候補が減るので、音声認識結果ラティスの容量を削減することが可能である。その結果、ディスクを効率的に使用することができる。 By creating a language model used when creating a word lattice from an acoustic model for learning, unknown words are eliminated and the number of conflict candidates is reduced. Therefore, the capacity of the speech recognition result lattice can be reduced. As a result, the disk can be used efficiently.

なお、音響モデル作成装置５００は、音響モデル作成装置１００をベースに説明したが、他の音響モデル作成装置２００〜４００の言語モデル記憶部を、ラティス作成用言語モデル記憶部５０に変更することで同様な効果が期待できる。 The acoustic model creation device 500 has been described based on the acoustic model creation device 100, but the language model storage unit of the other acoustic model creation devices 200 to 400 is changed to a lattice creation language model storage unit 50. Similar effects can be expected.

また、図９に、学習用音声データベース９０に記録された正解シンボル系列から言語モデルを作成するラティス作成用言語モデル作成部５１を破線で示す。このようにラティス作成用言語モデル作成部５１を音響モデル作成装置５００として一体化して、ラティス作成用言語モデルを逐次生成するようにしても良い。また、ラティス作成用言語モデル作成部５１を別体として、予め作成したラティス作成用言語モデルを記憶したラティス作成用言語モデル記憶部５０のみを用いるようにしても良い。 In FIG. 9, a lattice creation language model creation unit 51 that creates a language model from the correct symbol sequence recorded in the learning speech database 90 is indicated by a broken line. In this manner, the lattice creation language model creation unit 51 may be integrated as the acoustic model creation device 500 to sequentially generate the lattice creation language model. Alternatively, the lattice creation language model creation unit 51 may be a separate unit, and only the lattice creation language model storage unit 50 that stores a previously created lattice creation language model may be used.

図１０に、この発明の音響モデル作成装置６００の機能構成例を示す。音響モデル作成装置６００は、音響モデル作成装置５００に対して、更に言語モデル作成用データ整備部６０を備える点で異なる。 FIG. 10 shows a functional configuration example of the acoustic model creation device 600 of the present invention. The acoustic model creation device 600 differs from the acoustic model creation device 500 in that it further includes a language model creation data maintenance unit 60.

言語モデル作成用データ整備部６０は、学習用音声データベース９０から正解シンボル系列に対応するカタカナ又はひらがなの読みラベルを抽出する。ラティス作成用言語モデル作成部５１は、その正解の読みラベルから言語モデルを作成する。 The language model creation data maintenance unit 60 extracts katakana or hiragana reading labels corresponding to the correct symbol series from the learning speech database 90. The lattice creation language model creation unit 51 creates a language model from the correct reading label.

正解の読みラベルを用いることで表記揺れが減り、語彙サイズの増加が抑えられるのでラティス容量を削減することが可能である。 By using correct reading labels, notation fluctuation can be reduced and the increase in vocabulary size can be suppressed, so that the lattice capacity can be reduced.

なお、読みラベルの単語区切りによっては、音として二重母音とすべきところを単語境界として分断されてしまう問題が生じる場合がある。二重母音とは、「ｅｉ」、「ｏｕ」、「ｉｕ」等であり、それぞれが一つの音である。この二重母音が「ｅ／ｉ」と分断されると表示揺れとなり、語彙サイズの増加の原因となる。 Depending on the word separation of the reading label, there may be a problem that the sound is divided as a word boundary where a double vowel should be used as a sound. The double vowels are “ei”, “ou”, “iu”, etc., and each is one sound. When this double vowel is divided into “e / i”, the display shakes, causing an increase in vocabulary size.

そこで、二重母音とすべき区間はフレームの連結処理を行うことで同じ音響特徴量の音声が別の音素に割り当たらないようにする。そうすることで、音素間の混同が減り、音響モデルの精度を高めることができる。また、音声認識結果ラティスの容量を削減することも可能である。 In view of this, the section that should be a double vowel is subjected to a frame concatenation process so that voices having the same acoustic feature amount are not assigned to different phonemes. By doing so, confusion between phonemes can be reduced and the accuracy of the acoustic model can be increased. It is also possible to reduce the capacity of the speech recognition result lattice.

具体的には、連結すべき二重母音を二重母音リストに登録し、二重母音がそのリストに含まれるか否かを判定し、含まれる場合は単語境界としないようにする。また、分断してポーズを挿入した場合と連結した場合とで尤度を比較するポーズ挿入判定を行って言語モデルを作成する方法が考えられる。 Specifically, the double vowels to be connected are registered in the double vowel list, and it is determined whether or not the double vowel is included in the list. In addition, a method of creating a language model by performing pose insertion determination for comparing likelihoods between the case of dividing and inserting a pose and the case of connecting poses is conceivable.

以上述べたように、この発明の音響モデル作成装置１００〜５００は、生成する音声認識結果ラティスの容量を抑えることが可能である。また音声認識結果ラティスの容量を事前に凡そ予測できるので、用意すべきディスクの空き容量を適切に決めることが可能である。また、適切な音声認識パラメータを自動的に決定することが可能であり、ディスク容量に適応させて精度の高い識別学習済音響モデルを自動的に生成することができる。更に、この発明の音響モデル作成装置６００は、学習用音声データベース９０から言語モデルを作成するので言語モデルを用意する必要が無い等の優れた効果を奏する。 As described above, the acoustic model creation apparatuses 100 to 500 according to the present invention can suppress the capacity of the generated speech recognition result lattice. Since the capacity of the speech recognition result lattice can be estimated in advance, it is possible to appropriately determine the free capacity of the disk to be prepared. In addition, it is possible to automatically determine an appropriate speech recognition parameter, and it is possible to automatically generate a discriminatively learned acoustic model with high accuracy by adapting to the disk capacity. Furthermore, since the acoustic model creation apparatus 600 of the present invention creates a language model from the learning speech database 90, the acoustic model creation device 600 has excellent effects such as no need to prepare a language model.

また、上記した音声認識結果の単語ラティスは、音声認識結果のＮ−ｂｅｓｔに置き換えても良く、その場合でも同様の効果を奏する音響モデル作成方法を提供することが可能である。 In addition, the above-described word lattice of the speech recognition result may be replaced with the N-best of the speech recognition result, and even in that case, it is possible to provide an acoustic model creation method that exhibits the same effect.

なお、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Further, the processes described in the above method and apparatus are not only executed in time series according to the order of description, but also may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A partial learning data selection process for selecting partial learning voice data from learning voice data that is a combination of the voice data stored in the learning voice database and the correct answer symbol sequence;
Using the partial learning speech data, the language model stored in the language model storage unit, the learning acoustic model stored in the learning acoustic model storage unit, and the control recognition parameters obtained in the partial recognition parameter determination process A recognition process for creating a partial lattice by generating a partial lattice by speech recognition;
A partial recognition parameter determination process for controlling the recognition parameter for control by evaluating the capacity of the partial lattice and outputting the recognition parameter for control from which the partial lattice of a predetermined capacity is obtained as a determination recognition parameter;
A recognition process for creating a lattice that performs speech recognition on all the learning speech data based on the language model, the learning acoustic model, and the decision recognition parameters, and generates a speech recognition result lattice;
A discriminative learning process in which discriminative learning is performed by comparing the speech recognition result lattice with the correct symbol sequence to generate an identified acoustic model;
An acoustic model creation method comprising:

The acoustic model creation method according to claim 1,
The partial learning data selection process is as follows:
Including an initial lattice capacity calculation step of performing speech recognition processing using an initial recognition parameter for the speech file of the learning speech data and calculating an initial lattice capacity of each of the speech files;
A method for creating an acoustic model, characterized by being a process of selecting an audio file having a large initial lattice capacity as the partial learning audio data.

In the acoustic model creation method according to claim 2,
The initial lattice capacity calculating step further calculates a second lattice capacity using a second recognition parameter obtained by changing the initial recognition parameter.
The acoustic model creation method, wherein the partial learning data selection process is a process of selecting an audio file having a large difference between the initial lattice capacity and the second lattice capacity as the partial learning audio data.

The acoustic model creation method according to claim 1,
The partial learning data selection process is as follows:
A feature amount analyzing step of converting the sound data of the sound file of the learning sound data into a sound feature amount for each frame;
A reliability score calculation step of converting the voice feature amount into a reliability composed of an acoustic score and a language score, and calculating a reliability score obtained by normalizing the reliability of each audio file from the reliability by the file length of the file; Including
A method for creating an acoustic model, characterized by being a process of selecting an audio file having a small reliability score as the partial learning audio data.

The acoustic model creation method according to claim 1,
The partial learning data selection process is as follows:
A feature amount analyzing step of converting the sound data of the sound file of the learning sound data into a sound feature amount for each frame;
The product having the highest product of the output probability obtained from the monophone GMM with respect to the speech feature value for each frame and the appearance probability of the state to which the GMM belongs is obtained as the monophone maximum likelihood value, and the logarithm of the monophone maximum likelihood value and the speech / pause A high-speed pre-reliability score calculating step of calculating a high-speed pre-reliability score by averaging the difference from the logarithm of the maximum likelihood value for each audio file;
A method for creating an acoustic model, characterized by being a process of selecting an audio file having a small high-speed prior reliability score as the audio data for partial learning.

The acoustic model creation method according to any one of claims 1 to 5,
The acoustic model creation method, wherein the language model stores a lattice creation language model created from an audio file of the learning speech database.

In the acoustic model creation method according to claim 6,
The acoustic model creation method, wherein the lattice creation language model is a language model created from a correct reading label.

In the acoustic model creation method according to claim 7,
An acoustic model creation method, wherein the double vowel of the correct reading label is not a word boundary.

A speech database for learning that records speech data for training that includes speech data and a correct symbol sequence thereof;
A language model storage unit storing a grammar expressing a connection relation between words as a language model;
An acoustic model storage unit for learning that stores an acoustic model for learning that associates phonemes and feature quantities of speech;
A partial learning data selection unit for selecting partial learning voice data from the learning voice data;
A recognition unit for creating a partial lattice by recognizing the speech data for partial learning using the recognition parameter for control input from the language model, the acoustic model, and a partial recognition parameter determination unit;
A partial recognition parameter determination unit that evaluates the capacity of the partial lattice to control the recognition parameter for control and outputs the recognition parameter for control from which the partial lattice of a predetermined capacity is obtained as a determination recognition parameter;
A lattice creation recognizing unit that performs speech recognition on all the learning speech data based on the language model, the learning acoustic model, and the determination recognition parameter, and generates a speech recognition result lattice;
A discrimination learning unit that performs discrimination learning by comparing the speech recognition result lattice and the correct symbol series to generate a discriminated learning acoustic model;
An acoustic model creation device comprising:

In the acoustic model creation device according to claim 9,
The partial learning data selection unit includes an initial lattice capacity calculation unit that performs a speech recognition process using an initial recognition parameter for the sound file of the learning sound data and calculates an initial lattice capacity of each of the sound files,
An acoustic model generation apparatus, wherein an audio file having a large initial lattice capacity is selected as the partial learning audio data.

The acoustic model creation device according to claim 10,
The initial lattice capacity calculating means calculates a second lattice capacity using a second recognition parameter obtained by changing the initial recognition parameter.
The acoustic model creation device, wherein the partial learning data selection unit selects a voice file having a large difference between the initial lattice capacity and the second lattice capacity as the partial learning voice data.

In the acoustic model creation device according to claim 9,
The partial learning data selection unit
Feature amount analyzing means for converting the sound data of the sound file of the learning sound data into a sound feature amount for each frame;
A reliability score calculation means for converting the above-mentioned voice feature amount into a reliability composed of an acoustic score and a language score, and calculating a reliability score obtained by normalizing the reliability of each audio file by the file length of the file from the reliability. Prepared,
An acoustic model generation apparatus, wherein an audio file having a small reliability score is selected as the partial learning audio data.

In the acoustic model creation device according to claim 9,
The partial learning data selection unit
Feature amount analyzing means for converting the sound data of the sound file of the learning sound data into a sound feature amount for each frame;
The product having the highest product of the output probability obtained from the monophone GMM with respect to the speech feature value for each frame and the appearance probability of the state to which the GMM belongs is obtained as the monophone maximum likelihood value, and the logarithm of the monophone maximum likelihood value and the speech / pause A high-speed pre-reliability score calculation unit that calculates a high-speed pre-reliability score that averages the difference from the logarithm of the maximum likelihood value for each audio file,
An acoustic model generation apparatus, wherein an audio file having a small high-speed prior reliability score is selected as the partial learning audio data.

The acoustic model creation device according to any one of claims 9 to 13,
The language model storage unit
An acoustic model creation apparatus characterized by storing a lattice creation language model created from the learning speech database.

A program for causing a computer to execute the acoustic model creation method according to any one of claims 1 to 8.