JP4951035B2

JP4951035B2 - Likelihood ratio model creation device by speech unit, Likelihood ratio model creation method by speech unit, speech recognition reliability calculation device, speech recognition reliability calculation method, program

Info

Publication number: JP4951035B2
Application number: JP2009161463A
Authority: JP
Inventors: 義和山口; 哲小橋川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-07-08
Filing date: 2009-07-08
Publication date: 2012-06-13
Anticipated expiration: 2029-07-08
Also published as: JP2011017818A

Description

本発明は、音声単位別の尤度比モデルの作成およびこの尤度比モデルを用いて音声認識の信頼度を算出する技術に関する。 The present invention relates to a technique for creating a likelihood ratio model for each speech unit and calculating the reliability of speech recognition using the likelihood ratio model.

音声認識技術では、音声認識結果の信頼度を計算し、この信頼度で音声認識結果の正誤などを判断することがあるため、このような信頼度が重要な指標の一つとなっている。 In the speech recognition technology, since the reliability of the speech recognition result is calculated and the correctness / incorrectness of the speech recognition result may be determined based on this reliability, such reliability is one of important indexes.

例えば、音節認識結果の尤度（音節尤度）を参照尤度として用いる信頼度の算出方法として非特許文献１が挙げられる。この技術は、図１に示すように、特徴量分析部９０１が入力されたフレームごとのテジタル音声信号の音響特徴量を求め、認識処理部９０２は、音響モデル９０３と辞書・言語モデル９０４を用いてフレームごとの音響特徴量に対する音声認識処理を行い音声認識結果とその認識尤度（タスク尤度）を算出し、音節尤度算出部９０５がフレームごとの音響特徴量から音節尤度を求め、信頼度算出部９０６が、タスク尤度を音節尤度で正規化して信頼度を得ている。このように、この技術は、認識対象辞書を用いた音声認識によるタスク尤度を音節認識による音節尤度で正規化し、語彙外入力における正誤判別能力を高めている。 For example, Non-Patent Document 1 is given as a reliability calculation method using the likelihood of a syllable recognition result (syllable likelihood) as a reference likelihood. In this technique, as shown in FIG. 1, a feature amount analysis unit 901 obtains an acoustic feature amount of a digital audio signal for each frame, and a recognition processing unit 902 uses an acoustic model 903 and a dictionary / language model 904. Speech recognition processing is performed on the acoustic feature amount for each frame to calculate a speech recognition result and its recognition likelihood (task likelihood), and the syllable likelihood calculating unit 905 obtains the syllable likelihood from the acoustic feature amount for each frame, The reliability calculation unit 906 obtains the reliability by normalizing the task likelihood with the syllable likelihood. As described above, this technique normalizes the task likelihood based on speech recognition using the recognition target dictionary with the syllable likelihood based on syllable recognition, and enhances the correctness / incorrectness discrimination ability in the input outside the vocabulary.

また、特許文献１には、認識尤度（スコア）が第１位の単語と異なる認識尤度が第２位以降の単語のうち認識尤度の最も高い単語の認識尤度と、第１位の認識尤度との尤度差を参照尤度として用いる、N-bestによる信頼度算出方法が開示されている。 Further, in Patent Document 1, the recognition likelihood of the word with the highest recognition likelihood among the words with the recognition likelihood (score) different from the word with the first rank and the second rank or higher, and the first rank A reliability calculation method based on N-best is disclosed, in which a likelihood difference with the recognition likelihood is used as a reference likelihood.

また、音声認識システムの開発などのためのオープンソースの汎用大語彙連続音声認識エンジンとしてJulius（非特許文献２参照）などが存在する。 Also, Julius (see Non-Patent Document 2) and the like exist as open source general-purpose large vocabulary continuous speech recognition engines for developing speech recognition systems.

特許3819896号公報Japanese Patent No. 3819896

渡辺隆夫、塚田聡、「音節認識を用いたゆう度補正による未知発話のリジェクション」、電子情報通信学会論文誌 D Vol.J75-D2 No.12 pp.2002-2009, 1992.Takao Watanabe, Satoshi Tsukada, "Rejection of unknown utterances by likelihood correction using syllable recognition", IEICE Transactions D Vol.J75-D2 No.12 pp.2002-2009, 1992. Julius、[平成２１年７月２日検索]、インターネット〈http://julius.sourceforge.jp/index.php〉Julius, [Search July 2, 2009], Internet <http://julius.sourceforge.jp/index.php>

上記非特許文献１に開示される技術によると、音節尤度の算出処理が累積されるため認識処理全体の処理量が大きい。 According to the technique disclosed in Non-Patent Document 1, since the calculation process of syllable likelihood is accumulated, the processing amount of the entire recognition process is large.

上記特許文献１に開示される技術によると、N-bestを求めることで参照尤度が求まるため算出処理は少ないが、語彙外入力に対して正誤判別能力が必ずしも良好とは言えない。 According to the technique disclosed in Patent Document 1, although the reference likelihood is obtained by obtaining N-best, the calculation processing is small, but it cannot be said that the correctness / incorrectness discrimination ability is necessarily good for an out-of-vocabulary input.

また、従来技術における信頼度の計算はフレーム長やフレーム数に依存しているため、認識尤度および参照尤度の値域が定まらない。このため、言語的単位（例えば単語）や発声時間によらず普遍的に信頼度を正規化する（例えば0〜100の値に収める）ことが困難である。 In addition, since the calculation of reliability in the prior art depends on the frame length and the number of frames, the range of recognition likelihood and reference likelihood is not determined. For this reason, it is difficult to universally normalize the reliability (for example, fall within a value of 0 to 100) regardless of the linguistic unit (for example, word) and the utterance time.

このような状況に鑑みて、本発明は、音声単位別の正規化された尤度比モデルの作成およびこの尤度比モデルを用いて正規化された音声認識の信頼度を算出する各技術を提供することを目的とする。 In view of such a situation, the present invention creates a normalized likelihood ratio model for each speech unit and calculates each technique for calculating the reliability of speech recognition normalized using this likelihood ratio model. The purpose is to provide.

本発明は、音声単位別の正規化された尤度比モデルを作成する技術に関して、音響モデルと、音声データとこの音声データに対応付けられた正解ラベルで構成される開発データと、混合正規分布（ＧＭＭ）とを用いて、音声データの音響特徴量をフレームごとに算出し（特徴量分析）、フレームごとの音響特徴量に対して、正解ラベルと音響モデルを用いて、正解ラベルに含まれる音声単位の正解尤度を算出し（正解尤度算出）、フレームごとの音響特徴量に対して、ＧＭＭによる尤度（ＧＭＭ尤度）を算出し（ＧＭＭ尤度算出）、フレームごとに、正解尤度とＧＭＭ尤度の比（音声単位別尤度比）を算出し（音声単位別尤度比算出）、開発データに含まれる音声単位ごとに、音声単位に対応する音声単位別尤度比を確率変数とする確率分布関数を正規化した正規化確率分布関数（音声単位別尤度比モデル）を作成する（音声単位別尤度比モデル作成）。 The present invention relates to a technique for creating a normalized likelihood ratio model for each speech unit, an acoustic model, development data composed of speech data and correct labels associated with the speech data, and a mixed normal distribution. (GMM) is used to calculate the acoustic feature amount of the voice data for each frame (feature amount analysis), and the correct feature label and the acoustic model are used for the acoustic feature amount for each frame and included in the correct label. The correct likelihood of each speech is calculated (correct likelihood calculation), and the likelihood by GMM (GMM likelihood) is calculated for each frame acoustic feature (GMM likelihood calculation). The ratio between likelihood and likelihood of GMM (likelihood ratio for each voice unit) is calculated (likelihood ratio for each voice unit), and for each voice unit included in the development data, the likelihood ratio for each voice unit corresponding to the voice unit Probability distribution with a random variable The number to create a normalized normalized probability distribution function (audio unit by the likelihood ratio model) (by speech unit likelihood ratio modeling).

このＧＭＭは、学習用音声データの有声区間から学習された混合正規分布であるとし、ＧＭＭ尤度算出では、学習用音声データの無声区間から学習された無声モデルも用いて、上記ＧＭＭ尤度を算出するようにしてもよい。 This GMM is assumed to be a mixed normal distribution learned from the voiced interval of the learning speech data, and the GMM likelihood is calculated using the unvoiced model learned from the unvoiced interval of the learning speech data. You may make it calculate.

また、本発明は、この尤度比モデルを用いて正規化された音声認識の信頼度を算出する技術に関して、音響モデルと、上記作成された音声単位別尤度比モデルと、混合正規分布（ＧＭＭ）とを用いて、認識対象である音声信号の音響特徴量をフレームごとに算出し（特徴量分析）、フレームごとの音響特徴量に対して、音響モデルを用いて、音声認識結果と当該音声認識結果に含まれる音声単位の認識結果尤度を算出し（認識処理）、フレームごとの音響特徴量に対して、ＧＭＭによる尤度（参照尤度）を算出し（参照尤度算出）、フレームごとに、認識結果尤度と参照尤度の比（音声単位別尤度比）を算出し（音声単位別尤度比算出）、音声認識結果に含まれる音声単位ごとに、当該音声単位に対応するフレームごとの音声単位別尤度比を入力としたときの音声単位別尤度比モデルの出力値（フレーム信頼度）をそれぞれ求め、これらフレーム信頼度のフレーム平均値を音声単位信頼度として求める（音声単位信頼度算出）。 Further, the present invention relates to a technique for calculating the reliability of speech recognition normalized using this likelihood ratio model, an acoustic model, the above-described likelihood unit model for speech units, and a mixed normal distribution ( GMM) is used to calculate the acoustic feature amount of the speech signal to be recognized for each frame (feature amount analysis), and for the acoustic feature amount for each frame, using the acoustic model, The recognition result likelihood of the speech unit included in the speech recognition result is calculated (recognition processing), the likelihood (reference likelihood) by GMM is calculated for the acoustic feature amount for each frame (reference likelihood calculation), For each frame, the ratio of the recognition result likelihood to the reference likelihood (likelihood ratio for each voice unit) is calculated (likelihood ratio for each voice unit), and for each voice unit included in the voice recognition result, Likelihood ratio by speech unit for each corresponding frame Output value of the audio unit by the likelihood ratio model when a force determined (frame reliability), respectively, the frame average value of the frame confidence determined as a speech unit reliability (speech unit reliability calculation).

音声認識結果に含まれる各音声単位に対応する音声単位信頼度の平均値を音声認識結果の信頼度として算出（信頼度算出）してもよい。 The average value of the speech unit reliability corresponding to each speech unit included in the speech recognition result may be calculated (reliability calculation) as the reliability of the speech recognition result.

信頼度算出技術に関して、音響モデルを音声単位別尤度比モデルを作成する際に用いられた音響モデルと同じ音響モデルとし、ＧＭＭを音声単位別尤度比モデルを作成する際に用いられたＧＭＭと同じＧＭＭとすることが好ましい。 Regarding the reliability calculation technique, the acoustic model is the same acoustic model as that used when creating the speech unit likelihood ratio model, and the GMM used when creating the speech unit likelihood ratio model Is preferably the same GMM.

また、本発明の音声単位別尤度比モデル作成装置としてコンピュータを機能させるプログラムによって、コンピュータを音声単位別尤度比モデル作成装置として作動処理させることができる。同様に、本発明の音声認識信頼度算出装置としてコンピュータを機能させるプログラムによって、コンピュータを音声認識信頼度算出装置として作動処理させることができる。詳細は実施形態で説明するが、音声単位別尤度比モデル作成装置と音声認識信頼度算出装置を単一装置として実現することも可能であり、このような場合、本発明の音声単位別尤度比モデル作成装置および音声認識信頼度算出装置としてコンピュータを機能させるように記述されたプログラムによって、コンピュータを音声単位別尤度比モデル作成装置および音声認識信頼度算出装置として作動処理させることができる。 Further, the computer can be operated as a speech unit likelihood ratio model creation device by a program that causes the computer to function as the speech unit likelihood ratio model creation device of the present invention. Similarly, the computer can be operated as a speech recognition reliability calculation device by a program that causes the computer to function as the speech recognition reliability calculation device of the present invention. Although the details will be described in the embodiment, it is also possible to realize the speech unit likelihood ratio model creation device and the speech recognition reliability calculation device as a single device. The computer can be operated as a speech unit-specific likelihood ratio model creation device and a speech recognition reliability calculation device by a program written to cause the computer to function as the degree ratio model creation device and the speech recognition reliability calculation device. .

本発明に拠れば、音声単位別の正規化された尤度比モデルの作成し、この尤度比モデルを用いて正規化された音声認識の信頼度を算出するから、言語的単位や発声時間によらず普遍的に正規化された信頼度を得ることができ、また、その取り扱いが便利である。 According to the present invention, a normalized likelihood ratio model for each speech unit is created, and the reliability of speech recognition normalized using this likelihood ratio model is calculated. Regardless of this, universally normalized reliability can be obtained, and its handling is convenient.

音節認識結果尤度（音節尤度）を参照尤度として用いる従来の信頼度算出技術を説明する図。The figure explaining the conventional reliability calculation technique which uses a syllable recognition result likelihood (syllable likelihood) as a reference likelihood. 本発明による実施形態の音素別尤度比モデル作成処理を実施する機能構成例を示す図。The figure which shows the function structural example which implements the likelihood ratio model preparation process classified by phoneme of embodiment by this invention. 本発明による実施形態の音素別尤度比モデル作成処理を実施する処理フローを示す図。The figure which shows the processing flow which implements the likelihood ratio model preparation process classified by phoneme of embodiment by this invention. 本発明による実施形態の音素別尤度比モデル作成処理の具体例を示す図。The figure which shows the specific example of the likelihood ratio model preparation process classified by phoneme of embodiment by this invention. 本発明による実施形態の音声認識信頼度算出処理を実施する機能構成例を示す図。The figure which shows the function structural example which implements the speech recognition reliability calculation process of embodiment by this invention. 本発明による実施形態の音声認識信頼度算出処理を実施する処理フローを示す図。The figure which shows the processing flow which implements the speech recognition reliability calculation process of embodiment by this invention. 本発明による実施形態の音声認識信頼度算出処理の具体例を示す図。The figure which shows the specific example of the speech recognition reliability calculation process of embodiment by this invention. ＧＭＭと無声モデルを用いた辞書構成を示す図。The figure which shows the dictionary structure using GMM and a silent model. 認識結果とその音素信頼度の例を示す図。The figure which shows the example of a recognition result and its phoneme reliability. 信頼度算出を含めた認識速度の比較表。Comparison table of recognition speed including reliability calculation. 正誤判別能力の度合いを評価する等誤り率の比較表。A comparison table of equal error rates for evaluating the degree of correctness / incorrectness discrimination ability.

図面を参照して本発明の実施形態を説明する。
本発明の実施形態である音声単位別尤度比モデル作成装置１は、それ単体で独立に存在するよりは、作成された音声単位別尤度比モデルを用いて音声認識を行う装置（本発明の実施形態である音声認識信頼度算出装置２）を構成する構成要素として存在することが実用的な場合がある。さらに云えば、音声単位別尤度比モデル作成装置１は、音声認識信頼度算出装置２とは容易に分離可能に音声認識信頼度算出装置２を構成する構成要素ではなく、音声認識信頼度算出装置２自体を或る機能に着眼して片面的に評価したものと云うこともできる。要するに、音声単位別尤度比モデル作成装置１は、音声認識信頼度算出装置２そのものであることが凡そ実用的と言うことができる。
ただし、音声単位別尤度比モデル作成装置１が、単体独立の構成要素として存在すること、音声認識信頼度算出装置２とは容易に分離可能に音声認識信頼度算出装置２を構成する構成要素であることを排除する趣旨ではない。例えば音声単位別尤度比モデルの作成自体を目的とするならば、音声単位別尤度比モデル作成装置１を単体独立の構成要素として実現することに何らの妨げは無い。
ここで音声認識信頼度算出装置２は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現されるとし、単体独立の構成要素として音声単位別尤度比モデル作成装置１を実現する場合も同様である。 Embodiments of the present invention will be described with reference to the drawings.
The speech unit likelihood ratio model creation device 1 according to the embodiment of the present invention is a device that performs speech recognition using the created speech unit likelihood ratio model, rather than being independently present (the present invention). In some cases, it is practical to exist as a component constituting the speech recognition reliability calculation device 2) according to the embodiment. Furthermore, the likelihood ratio model creation device 1 for each speech unit is not a component that constitutes the speech recognition reliability calculation device 2 so as to be easily separable from the speech recognition reliability calculation device 2, but is a speech recognition reliability calculation. It can also be said that the device 2 itself has been evaluated on one side with a certain function in mind. In short, it can be said that it is practically practical that the speech unit likelihood ratio model creation device 1 is the speech recognition reliability calculation device 2 itself.
However, the likelihood unit model creation device 1 for each speech unit exists as a single independent component, and the component that configures the speech recognition reliability calculation device 2 so as to be easily separable from the speech recognition reliability calculation device 2 It is not intended to exclude that. For example, if the purpose is to create a speech unit likelihood ratio model itself, there is no obstacle to realizing the speech unit likelihood ratio model creation device 1 as a single independent component.
Here, the speech recognition reliability calculation device 2 is realized by a computer such as a dedicated machine configured with dedicated hardware or a general-purpose machine such as a personal computer, and the likelihood ratio for each speech unit as an independent component. The same applies to the case where the model creation apparatus 1 is realized.

音声認識信頼度算出装置２を単体単独の構成要素として、これをコンピュータで実現する場合のハードウェア構成例を説明する。音声単位別尤度比モデル作成装置１は、音声認識信頼度算出装置２を構成する構成要素として説明する。 A hardware configuration example in the case where the speech recognition reliability calculation device 2 is realized as a single component and realized by a computer will be described. The speech unit likelihood ratio model creation device 1 will be described as a constituent element of the speech recognition reliability calculation device 2.

＜音声認識信頼度算出装置２のハードウェア構成例＞
音声認識信頼度算出装置２は、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ＣＰＵ（Central Processing Unit）〔キャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）と、ハードディスクである外部記憶装置、並びにこれらの入力部、出力部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置間のデータのやり取りが可能なように接続するバスなどを備えている。また必要に応じて、音声認識信頼度算出装置２に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Hardware Configuration Example of Speech Recognition Reliability Calculation Device 2>
The speech recognition reliability calculation device 2 may include an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a CPU (Central Processing Unit) [cache memory, or the like. ] RAM (Random Access Memory) or ROM (Read Only Memory) and external storage device as a hard disk, and data exchange between these input unit, output unit, CPU, RAM, ROM, and external storage device It has a bus that can be connected. If necessary, the speech recognition reliability calculation device 2 may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

音声認識信頼度算出装置２の外部記憶装置には、音声単位別尤度比モデル作成のためのプログラム、音声認識信頼度算出のためのプログラム並びにこれらのプログラムの処理において必要となるデータなどが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくなどでもよい。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。以下、データやその格納領域のアドレスなどを記憶する記憶装置を単に「記憶部」と呼ぶことにする。 The external storage device of the speech recognition reliability calculation device 2 stores a program for creating a likelihood ratio model for each speech unit, a program for calculating the speech recognition reliability, and data necessary for processing these programs. [Not limited to external storage devices, for example, a program may be stored in a ROM that is a read-only storage device. ]. Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device. Hereinafter, a storage device that stores data, addresses of storage areas, and the like is simply referred to as a “storage unit”.

実施形態では、記憶部の所定の記憶領域に、開発データ２００がデータとして記憶されている。開発データ２００は、音声データ（肉声のアナログデータ）とこの音声データに対応付けられた音声単位（例えば音素、音節、半音節などであり、本実施形態では音素とする。）による正解ラベルから構成される開発データリストを複数含んでいる。但し、このような構成に限定されるものではなく、例えば、音響分析結果とこの音響分析結果に対応付けた音声単位による正解ラベルから構成するとしてもよいし、あるいは、ディジタル化された音声データとこの音声データに対応付けられた音声単位による正解ラベルから構成されるとしてもよい。このような開発データ２００は、音響モデルの学習などに用いる既存の学習データと同じであってもよく、例えば総音声時間長で１００時間を越えるデータ量を擁していることが望ましい。 In the embodiment, the development data 200 is stored as data in a predetermined storage area of the storage unit. The development data 200 is composed of voice data (analog data of real voice) and correct answer labels based on voice units (for example, phonemes, syllables, semi-syllables, etc., which are phonemes in this embodiment) associated with the voice data. Contains multiple development data lists. However, the present invention is not limited to such a configuration. For example, it may be composed of an acoustic analysis result and a correct answer label in units of speech associated with the acoustic analysis result, or digitized speech data and It may be composed of correct labels in units of speech associated with the speech data. Such development data 200 may be the same as existing learning data used for acoustic model learning or the like. For example, the development data 200 preferably has a data amount exceeding 100 hours in terms of the total speech time length.

また、記憶部の所定の記憶領域に、ＧＭＭ（Gaussian Mixture Model；混合正規分布（混合ガウス分布））２１０と無声モデル２２０がデータとして記憶されている。ＧＭＭ２１０は、例えば音声認識に用いる音響モデルを学習するための学習用音声データ（この学習用音声データに特別の限定は無い。）の無声区間を除く全音声区間から学習された、いわば可能な限り多くの音素（最良には全音素）の特徴を包含するように学習された一種の音響モデルである。また、無声モデル２２０は、例えば音声認識に用いる音響モデルを学習するための学習用音声データ（この学習用音声データに特別の限定は無く、ＧＭＭを学習する際に用いた学習用音声データと異なる学習用音声データでもよい。）の無声区間から学習された一種の音響モデルである。 Further, a GMM (Gaussian Mixture Model) 210 and a silent model 220 are stored as data in a predetermined storage area of the storage unit. For example, the GMM 210 is learned from all speech sections except for the unvoiced section of learning speech data (the learning speech data is not particularly limited) for learning an acoustic model used for speech recognition. It is a kind of acoustic model learned so as to include the characteristics of many phonemes (preferably all phonemes). The unvoiced model 220 is, for example, learning speech data for learning an acoustic model used for speech recognition (the learning speech data is not particularly limited, and is different from the learning speech data used for learning the GMM. This is a kind of acoustic model learned from a silent section of learning voice data).

音声認識信頼度算出装置２の記憶部には、音響特徴量を算出するためのプログラム、正解尤度を算出するためのプログラム、ＧＭＭ尤度を算出するためのプログラム、正解尤度とＧＭＭ尤度を用いて音素別尤度比を算出するためのプログラム、音素別尤度比モデルを作成するためのプログラム、音声認識を行うためのプログラム、参照尤度を算出するためのプログラム、認識結果尤度と参照尤度を用いて音素別尤度比を算出するためのプログラム、音素ごとに信頼度を算出するためのプログラム、認識結果に対する信頼度を算出するためのプログラムが記憶されている。 The storage unit of the speech recognition reliability calculation device 2 includes a program for calculating an acoustic feature, a program for calculating a correct likelihood, a program for calculating a GMM likelihood, a correct likelihood and a GMM likelihood. A program for calculating likelihood ratios by phoneme, a program for creating likelihood ratio models by phoneme, a program for performing speech recognition, a program for calculating reference likelihoods, and a recognition result likelihood And a program for calculating the likelihood ratio for each phoneme using the reference likelihood, a program for calculating the reliability for each phoneme, and a program for calculating the reliability for the recognition result.

音声認識信頼度算出装置２では、記憶部に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭに読み込まれて、ＣＰＵで解釈実行・処理される。この結果、ＣＰＵが所定の機能（特徴量分析部、正解尤度算出部、ＧＭＭ尤度算出部、第１音素別尤度比算出部、音素別尤度比モデル作成部、認識処理部、参照尤度算出部、第２音素別尤度比算出部、音素信頼度算出部、認識信頼度算出部）を実現することで音素別尤度比モデルの作成並びに音声認識信頼度の算出が実現される。
なお、正解尤度算出部、ＧＭＭ尤度算出部、第１音素別尤度比算出部、音素別尤度比モデル作成部は、音声認識信頼度算出装置２の必須の構成要素ではない。また、実施形態の音声単位別尤度比モデル作成装置１は、特徴量分析部、正解尤度算出部、ＧＭＭ尤度算出部、第１音素別尤度比算出部、音素別尤度比モデル作成部を含んで構成されるが、認識処理部、参照尤度算出部、第２音素別尤度比算出部、音素信頼度算出部、認識信頼度算出部は、音声単位別尤度比モデル作成装置１の必須の構成要素ではない。 In the speech recognition reliability calculation device 2, each program stored in the storage unit and data necessary for processing each program are read into the RAM as necessary, and interpreted and executed by the CPU. As a result, the CPU performs predetermined functions (feature amount analysis unit, correct likelihood calculation unit, GMM likelihood calculation unit, first phoneme-specific likelihood ratio calculation unit, phoneme-specific likelihood ratio model generation unit, recognition processing unit, reference By implementing a likelihood calculation unit, a second phoneme-specific likelihood ratio calculation unit, a phoneme reliability calculation unit, and a recognition reliability calculation unit, creation of a phoneme-specific likelihood ratio model and calculation of speech recognition reliability are realized. The
The correct likelihood calculation unit, the GMM likelihood calculation unit, the first phoneme-specific likelihood ratio calculation unit, and the phoneme-specific likelihood ratio model creation unit are not essential components of the speech recognition reliability calculation device 2. In addition, the likelihood unit model creation device 1 for speech units according to the embodiment includes a feature amount analysis unit, a correct likelihood calculation unit, a GMM likelihood calculation unit, a first phoneme-specific likelihood ratio calculation unit, and a phoneme-specific likelihood ratio model. Although it is comprised including a preparation part, a recognition process part, a reference likelihood calculation part, a 2nd phoneme likelihood ratio calculation part, a phoneme reliability calculation part, a recognition reliability calculation part is a likelihood ratio model classified by speech unit It is not an essential component of the creation device 1.

次に、実施形態として、図２−図８を参照しながら、音声単位別尤度比モデル作成装置１による音声単位別尤度比モデル作成処理を含む音声認識信頼度算出装置２による音声認識信頼度算出処理の流れを叙述的に説明する。 Next, as an embodiment, with reference to FIG. 2 to FIG. 8, the speech recognition reliability by the speech recognition reliability calculation device 2 including the speech unit likelihood ratio model creation processing by the speech unit likelihood ratio model creation device 1. The flow of the degree calculation process will be described narratively.

＜音素別尤度比モデル作成処理＞
開発データリストごとに以下の処理を行う（ステップＳ１−Ｓ６）。
図示しないＡ/Ｄ変換部などでテジタル化された、開発データリストに含まれる音声データを入力音声信号として、特徴量分析部１０１は、入力音声信号の複数サンプルのまとまり（以下、フレームという。）ごとにケプストラム、音声パワー（例えば参考文献１）など音声認識処理に用いるものと同じ種類の音響特徴量を算出する（ステップＳ２）。
（参考文献１）古井貞煕、“ディジタル音声処理”、東海大学出版会 <Process for creating likelihood ratio model by phoneme>
The following processing is performed for each development data list (steps S1-S6).
The feature data analysis unit 101 collects a plurality of samples of the input audio signal (hereinafter referred to as a frame) using the audio data included in the development data list digitalized by an A / D conversion unit (not shown) as an input audio signal. For each, the same kind of acoustic feature quantity as that used for speech recognition processing such as cepstrum and speech power (for example, Reference 1) is calculated (step S2).
(Reference 1) Sadaaki Furui, “Digital Audio Processing”, Tokai University Press

次に正解尤度算出部１０２は、フレームごとの音響特徴量に対して、その入力音声信号に対応する正解ラベルと音響モデル２８０（音素環境依存モデル、音素環境独立モデルなど）とを用いて音声認識を行い、正解ラベルに含まれる音素の発声区間位置（以下、音素アライメントという。）Ａ_correct(f,T)［fはファイル名、Tはフレーム番号］を検出すると共に、フレームＴの音響特徴量に対する音素α(f,T)の音響モデル２８０の正解尤度Ｐ_correct(f,T)を各フレームにおいて算出する（ステップＳ３）。ここで用いる音響モデル２８０は音声認識時に用いる音響モデルと同じものである。 Next, the correct likelihood calculating unit 102 uses the correct answer label corresponding to the input speech signal and the acoustic model 280 (phoneme environment dependent model, phoneme environment independent model, etc.) for the sound feature value for each frame. Recognize and detect the speech segment position of the phoneme included in the correct label (hereinafter referred to as phoneme alignment) A _correct (f, T) [f is the file name, T is the frame number] and the acoustic features of the frame T The correct likelihood P _correct (f, T) of the acoustic model 280 of the phoneme α (f, T) with respect to the quantity is calculated in each frame (step S3). The acoustic model 280 used here is the same as the acoustic model used during speech recognition.

具体例を図４に示して説明する。例えば、発声内容が「とうきょう」である音声データ（便宜的にデータ名を1.pcmとする。）［フレーム長２９］と、その正解ラベルの音素列「t/ou/k/y/ou」（記号"/"で区切られている単位が音素である。）が与えられた場合、正解尤度算出部１０２は、例えば正解ラベルに含まれる音素「t」の発声区間である第５〜７フレームについて音素アライメントＡ_correct(‘1.pcm’,5≦T≦7)=「t」を検出し、それと共に、第５フレームの音響特徴量に対する音素「t」の音響モデルの正解尤度Ｐ_correct(‘1.pcm’,5)、第６フレームの音響特徴量に対する音素「t」の音響モデルの正解尤度Ｐ_correct(‘1.pcm’,6)、といったように順に正解尤度Ｐ_correct(‘1.pcm’,T)［T=5,6,7］を求める。また、正解尤度算出部１０２は、次の第８〜１３フレームが正解ラベルに含まれる音素「ou」の発声区間であるから、第８〜１３フレームについて音素アライメントＡ_correct(‘1.pcm’,8≦T≦13)=「ou」を検出し、それと共に、その区間の各フレームの音響特徴量に対する音素「ou」の音響モデルの正解尤度Ｐ_correct(‘1.pcm’,8)、・・・、Ｐ_correct(‘1.pcm’,13)を得る。以降のフレームも同様である。
なお、図４中の「sil」とは「silence」すなわち無声区間のことである。この無声区間についても同様に音素アライメントＡ_correct(f,T)と正解尤度Ｐ_correct(f,T)が求められる。 A specific example will be described with reference to FIG. For example, speech data whose utterance content is “Tokyo” (the data name is 1.pcm for convenience) [frame length 29] and the phoneme string “t / ou / k / y / ou” of the correct label (The unit delimited by the symbol “/” is a phoneme), the correct likelihood calculating unit 102, for example, 5th to 7th utterance sections of the phoneme “t” included in the correct answer label The phoneme alignment A _correct ('1.pcm', 5 ≦ T ≦ 7) = “t” is detected for the frame, and the correct likelihood P of the acoustic model of the phoneme “t” with respect to the acoustic feature quantity of the fifth frame is also detected. _correct likelihood P in the order of _correct ('1.pcm', 5), correct likelihood P _correct of the acoustic model of phoneme "t" with respect to the acoustic feature of the 6th frame, etc. Find _correct ('1.pcm', T) [T = 5,6,7]. In addition, the correct likelihood calculating unit 102 is the utterance section of the phoneme “ou” in which the next 8th to 13th frames are included in the correct label, and therefore the phoneme alignment A _correct ('1.pcm') for the 8th to 13th frames. , 8 ≦ T ≦ 13) = “ou” is detected, and the correct likelihood P _correct ('1.pcm', 8) of the acoustic model of the phoneme “ou” with respect to the acoustic feature amount of each frame in the section ..., P _correct ('1.pcm', 13) is obtained. The same applies to the subsequent frames.
Note that “sil” in FIG. 4 is “silence”, that is, a silent section. Similarly, the phoneme alignment A _correct (f, T) and the correct likelihood P _correct (f, T) are also obtained for this unvoiced section.

次に、ＧＭＭ尤度算出部１０３は、フレームごとの音響特徴量とＧＭＭ２１０と無声モデル２２０を用いて音声認識を行い、音響特徴量に対するＧＭＭ尤度Ｐ_GMM(f,T)をフレームごとに算出する（ステップＳ４）。ＧＭＭ２１０は学習データの無声区間を除く全音声区間から学習された一種の音響モデルであるところ、実際の音声認識では認識対象音声に無声区間が含まれることが通常であるから、ＧＭＭ尤度算出部１０３は、無声モデル２２０も用いて音声認識を行いその尤度を求める。すなわち、ＧＭＭ尤度算出部１０３は、図８に示すようなＧＭＭ２１０と無声モデル２２０からなる辞書（文法）を用いて、両モデルによる音響特徴量に対する尤度をそれぞれ求め、その大きい方をＧＭＭ尤度Ｐ_GMM(f,T)とする。なお、ＧＭＭによる尤度計算自体は例えば上記非特許文献２などで行われている手法を使えばよい。 Next, the GMM likelihood calculating unit 103 performs speech recognition using the acoustic feature amount for each frame, the GMM 210, and the unvoiced model 220, and calculates the GMM likelihood P _GMM (f, T) for the acoustic feature amount for each frame. (Step S4). The GMM 210 is a kind of acoustic model learned from all speech sections excluding the unvoiced section of the learning data. In actual speech recognition, the recognition target speech usually includes the unvoiced section. 103 performs speech recognition using the silent model 220 and obtains its likelihood. That is, the GMM likelihood calculation unit 103 obtains the likelihood for the acoustic feature amount by both models using a dictionary (grammar) composed of the GMM 210 and the unvoiced model 220 as shown in FIG. Degree P _GMM (f, T). In addition, the likelihood calculation itself by GMM should just use the method currently performed by the said nonpatent literature 2, etc., for example.

図４に示す具体例の場合、ＧＭＭ尤度算出部１０３は、音声データ1.pcmから算出した音響特徴量に対するＧＭＭ尤度Ｐ_GMM(‘1.pcm’,1)、Ｐ_GMM(‘1.pcm’,2)、・・・、Ｐ_GMM(‘1.pcm’,29)を求める。 In the case of the specific example shown in FIG. 4, the GMM likelihood calculating unit 103 has GMM likelihoods P _GMM ('1.pcm', 1), P _GMM ('1. pcm ', 2), ..., P _GMM (' 1.pcm ', 29) is obtained.

次に、第１音素別尤度比算出部１０４は、フレームごとに、先に求められた正解尤度とＧＭＭ尤度の比Ｐ_correct(f,T)÷Ｐ_GMM(f,T)を音素別尤度比Ｐ_ratio(f,T)として算出し、その音素アライメントＡ_correct(f,T)とともに音素別尤度比記憶部２３０に記憶する（ステップＳ５）。
この処理を入力音声信号の全フレームについて実施する。 Next, the first phoneme-specific likelihood ratio calculation unit 104 calculates the ratio P _correct (f, T) ÷ P _GMM (f, T) of the correct likelihood and the GMM likelihood obtained previously for each frame. It is calculated as a separate likelihood ratio P _ratio (f, T), and stored together with the phoneme alignment A _correct (f, T) in the phoneme-specific likelihood ratio storage unit 230 (step S5).
This process is performed for all frames of the input audio signal.

図４に示す具体例の場合、例えば第５フレーム（音素「t」）の音素別尤度比Ｐ_ratio(‘1.pcm’,5)は、「正解尤度Ｐ_correct(‘1.pcm’,5)」÷「ＧＭＭ尤度Ｐ_GMM(‘1.pcm’,5)」で算出される。また、第８フレーム（音素「ou」）の音素別尤度比Ｐ_ratio(‘1.pcm’,8)は、「正解尤度Ｐ_correct(‘1.pcm’,8)」÷「ＧＭＭ尤度Ｐ_GMM(‘1.pcm’,8)」で算出される。他のフレームについても同様である。 In the specific example shown in FIG. 4, for example, the likelihood ratio P _ratio ('1.pcm', 5) for the fifth frame (phoneme “t”) is “correct likelihood P _correct ('1.pcm'). , 5) "÷" GMM likelihood P _GMM ('1.pcm', 5) ". Also, the likelihood ratio P _ratio ('1.pcm', 8) for the 8th frame (phoneme “ou”) is “correct likelihood P _correct ('1.pcm', 8)” ÷ “GMM likelihood. Degree P _GMM ('1.pcm', 8) ". The same applies to other frames.

ステップＳ２−Ｓ５の各処理を開発データ２００に含まれる全ての開発データリストに対して実施した後（ステップＳ６）、音素別尤度比モデル作成部１０５は、音素ごとに尤度比モデルを作成する（ステップＳ７−Ｓ９）。具体的には、音素別尤度比モデル作成部１０５は、音素別尤度比記憶部２３０から、開発データ２００に含まれる音素の種類ごとに、音素αに対応する音素別尤度比Ｐ_ratio(f,T)を読み込み、これら音素別尤度比Ｐ_ratio(f,T)を確率変数とする確率分布関数（例えば正規分布）をその最大値（例えば出現累積値）で正規化した正規化確率分布関数Ｄ(α)を音素別尤度比モデルとして、音素別尤度比モデル記憶部２４０に記憶する。この処理を開発データ２００に含まれる音素の全ての種類について実施する。例えば開発データ２００に「t」、「ou」など３０種類の音素が含まれていた場合には、Ｄ(‘t’)、Ｄ(‘ou’)など３０種類の音素別尤度比モデルが作成されることになる。
上述の一連の処理が、＜音素別尤度比モデル作成処理＞である。 After each process of steps S2-S5 is performed on all the development data lists included in the development data 200 (step S6), the likelihood ratio model creation unit 105 for each phoneme creates a likelihood ratio model for each phoneme. (Steps S7 to S9). Specifically, the phoneme-specific likelihood ratio model creation unit 105 receives the phoneme-specific likelihood ratio P _ratio corresponding to the phoneme α for each phoneme type included in the development data 200 from the phoneme-specific likelihood ratio storage unit 230. Normalization that reads (f, T) and normalizes the probability distribution function (for example, normal distribution) with the likelihood ratio P _ratio (f, T) for each phoneme as a random variable by its maximum value (for example, cumulative value of appearance) The probability distribution function D (α) is stored in the phoneme-specific likelihood ratio model storage unit 240 as a phoneme-specific likelihood ratio model. This process is performed for all types of phonemes included in the development data 200. For example, when the development data 200 includes 30 types of phonemes such as “t” and “ou”, 30 types of likelihood ratio models for each phoneme such as D ('t') and D ('ou') exist. Will be created.
The above-described series of processing is <phoneme-specific likelihood ratio model creation processing>.

音素別尤度比モデル作成処理によって作成された音素別尤度比モデルによれば、後述する音声認識処理にて得られた音素αの音素別尤度比に対して、音素αの音素別尤度比モデルの出力値（確率変数として音素別尤度比を入力とするときの確率分布関数の出力値）は１を上限とする高い値となり、音素α以外の音素別尤度比モデルの出力値は０を下限とする低い値になる。 According to the phoneme-specific likelihood ratio model created by the phoneme-specific likelihood ratio model creation processing, the phoneme-specific likelihood of the phoneme α is compared with the phoneme-specific likelihood ratio of the phoneme α obtained by the speech recognition processing described later. The output value of the frequency ratio model (the output value of the probability distribution function when the phoneme likelihood ratio is input as a random variable) is a high value with an upper limit of 1, and the output of the phoneme likelihood ratio model other than the phoneme α The value is a low value with 0 as the lower limit.

＜音声認識信頼度算出処理＞
図示しないＡ/Ｄ変換部などでテジタル化された、音声認識対象のデジタル音声信号を入力音声信号として、特徴量分析部１０１は、入力音声信号の複数サンプルのまとまり（以下、フレームという。）ごとにケプストラム、音声パワー（上記参考文献１参照）など音声認識処理に用いる音響特徴量を算出する（ステップＳ１ｐ）。 <Voice recognition reliability calculation processing>
Using the digital speech signal to be speech-recognized that has been digitized by an A / D conversion unit (not shown) as an input speech signal, the feature amount analysis unit 101 collects a plurality of samples (hereinafter referred to as a frame) of the input speech signal. In addition, an acoustic feature amount used for speech recognition processing such as cepstrum and speech power (see the above-mentioned Reference 1) is calculated (step S1p).

次に、認識処理部１０６は、フレームごとの音響特徴量と音響モデル２８０と音声認識用辞書（場合によっては言語モデルを含む。）２８２を用いて音声認識処理を行い、認識結果と、認識結果に含まれる音素の音素アライメントＡ_result(T)と、各フレームの認識結果尤度Ｐ_result(T)を求める（ステップＳ２ｐ）。Tはフレーム番号を表す。 Next, the recognition processing unit 106 performs speech recognition processing using the acoustic feature value for each frame, the acoustic model 280, and a speech recognition dictionary (including a language model in some cases) 282, and recognizes the recognition result and the recognition result. Phoneme alignment A _result (T) of phonemes included in the frame and recognition result likelihood P _result (T) of each frame are obtained (step S2p). T represents a frame number.

具体例を図７に示して説明する。例えば、発声内容が「とうきょう」である入力音声信号［フレーム長２７］に対する認識結果第１位の音素列が「t/ou/k/y/ou」であった場合、認識処理部１０６は、例えば認識結果音素「t」の発声区間である第３〜５フレームについて音素アライメントＡ_result(3≦T≦5)=「t」を得て、それと共にその区間の各フレームの音響特徴量に対する認識結果音素「t」の音響モデルの認識結果尤度Ｐ_result(3)、・・・、Ｐ_result(5)を得る。同様に、認識処理部１０６は、次の認識結果音素「ou」の発声区間である第６〜１２フレームについて音素アライメントＡ_result(6≦T≦12)=「ou」を得て、それと共にその区間の各フレームの音響特徴量に対する認識結果音素「ou」の音響モデルの認識結果尤度Ｐ_result(6)、・・・、Ｐ_result(12)を得る。以降のフレームも同様である。
なお、図４中の無声区間silについても同様に音素アライメントＡ_result(T)と認識結果尤度Ｐ_result(T)が求められる。 A specific example will be described with reference to FIG. For example, when the first phoneme string of the recognition result for the input voice signal [frame length 27] whose utterance content is “Tokyo” is “t / ou / k / y / ou”, the recognition processing unit 106 For example, the phoneme alignment A _result (3 ≦ T ≦ 5) = “t” is obtained for the third to fifth frames, which are the utterance intervals of the recognition result phoneme “t”, and the recognition of the acoustic feature amount of each frame in the interval is obtained. The recognition result likelihood P _result (3),..., P _result (5) of the acoustic model of the result phoneme “t” is obtained. Similarly, the recognition processing unit 106 obtains phoneme alignment A _result (6 ≦ T ≦ 12) = “ou” for the sixth to twelfth frames that are the utterance interval of the next recognition result phoneme “ou”, and along with that, The recognition result likelihood P _result (6),..., P _result (12) of the acoustic model of the recognition result phoneme “ou” for the acoustic feature quantity of each frame in the section is obtained. The same applies to the subsequent frames.
Note that the phoneme alignment A _result (T) and the recognition result likelihood P _result (T) are similarly obtained for the unvoiced section sil in FIG.

また、参照尤度算出部１０７は、フレームごとの音響特徴量とＧＭＭ２１０と無声モデル２２０を用いて音声認識を行い、音響特徴量に対する参照尤度Ｐ_ref(T)をフレームごとに算出する（ステップＳ３ｐ）。この処理は、ＧＭＭ尤度算出部１０３の処理と同様であり、図８に示すようなＧＭＭ２１０と無声モデル２２０からなる辞書（文法）を用いて、両モデルによる音響特徴量に対する尤度をそれぞれ求め、その大きい方を参照尤度Ｐ_ref(T)とする。なお、音声認識信頼度算出処理で用いるＧＭＭ２１０と無声モデル２２０は、音素別尤度比モデル作成処理で用いたＧＭＭ２１０と無声モデル２２０と同じとする。 Further, the reference likelihood calculating unit 107 performs speech recognition using the acoustic feature amount for each frame, the GMM 210, and the unvoiced model 220, and calculates the reference likelihood P _ref (T) for the acoustic feature amount for each frame (step) S3p). This process is the same as the process of the GMM likelihood calculating unit 103, and using the dictionary (grammar) composed of the GMM 210 and the unvoiced model 220 as shown in FIG. The larger one is defined as a reference likelihood P _ref (T). Note that the GMM 210 and the unvoiced model 220 used in the speech recognition reliability calculation process are the same as the GMM 210 and the unvoiced model 220 used in the phoneme-specific likelihood ratio model creation process.

図７に示す具体例の場合、参照尤度算出部１０７は、入力音声信号から算出した音響特徴量に対する参照尤度Ｐ_ref(1)、・・・、Ｐ_ref(27)を求める。 In the case of the specific example shown in FIG. 7, the reference likelihood calculating unit 107 obtains reference likelihoods P _ref (1),..., P _ref (27) for the acoustic feature amount calculated from the input speech signal.

次に、第２音素別尤度比算出部１０８は、フレームごとに、認識結果尤度と参照尤度の比Ｐ_result(T)÷Ｐ_ref(T)を音素別尤度比Ｐ_ratio(T)として算出し、その音素アライメントＡ_correct(f,T)とともに音素別尤度比記憶部２５０に記憶する（ステップＳ４ｐ）。 Next, the second phoneme-specific likelihood ratio calculation unit 108 obtains the ratio of the recognition result likelihood to the reference likelihood P _result (T) ÷ P _ref (T) for each frame, by the phoneme-specific likelihood ratio P _ratio (T ) And is stored in the phoneme-specific likelihood ratio storage unit 250 together with the phoneme alignment A _correct (f, T) (step S4p).

図７に示す具体例の場合、例えば第３フレーム（認識結果音素「t」）の音素別尤度比Ｐ_ratio(3)は、「認識結果正解尤度Ｐ_result(3)」÷「参照尤度Ｐ_ref(3)」で算出される。また、第６フレーム（認識結果音素「ou」）の音素別尤度比Ｐ_ratio(6)は、「認識結果尤度Ｐ_result(6)」÷「参照尤度Ｐ_GMM(6)」で算出される。他のフレームについても同様である。 In the specific example shown in FIG. 7, for example, the likelihood ratio P _ratio (3) for each phoneme of the third frame (recognition result phoneme “t”) is “recognition result correct likelihood P _result (3)” ÷ “reference likelihood. Degree P _ref (3) ”. Further, the likelihood ratio P _ratio (6) for each phoneme of the sixth frame (recognition result phoneme “ou”) is calculated by “recognition result likelihood P _result (6)” ÷ “reference likelihood P _GMM (6)”. Is done. The same applies to other frames.

続いて音素信頼度算出部１０９は、一つ以上連続する同一の認識結果音素αに対応する音素アライメントＡ_result(T_start(α)≦T≦T_end(α))=「α」について以下の処理を行い、この処理を認識結果に現われる全ての認識結果音素について行う（ステップＳ５ｐ−Ｓ７ｐ）。なお、T_start(α)は認識結果音素αの開始フレーム番号であり、T_end(α)は認識結果音素αの終了フレーム番号である。また、種類としては同じ認識結果音素であっても、開始フレーム番号が異なる場合には個別に当該処理を適用する。例えば、図７に示す例では、認識結果に認識結果音素「ou」が２回（第６〜１２フレーム、第１９〜２４フレーム）現われるが、種類としては同じ認識結果音素「ou」であっても開始フレーム番号が異なるため個別に当該処理を適用することとし、第６〜１２フレームについてT_start(‘ou’)=6、T_end(‘ou’)=12であり、同じく第１９〜２４フレームについてT_start(‘ou’)=19、T_end(‘ou’)=24である。
当該処理の内実は次のとおりである。音素信頼度算出部１０９は、音素アライメントＡ_result(T_start(α)≦T≦T_end(α))=「α」内の各フレームT=T_start(α)，T_start(α)+1，・・・，T_end(α)-1，T_end(α)について、音素別尤度比モデル記憶部２４０に記憶されている音素αの音素別尤度比モデルＤ(α)を用いて、音素別尤度比記憶部２５０に記憶されている音素別尤度比Ｐ_ratio(T)に対応する音素別尤度比モデルＤ(α)の出力値Ｄ(α,Ｐ_ratio(T))（以下、フレーム信頼度という。）を求める。以下、フレーム信頼度Ｄ(α,Ｐ_ratio(T)) (T_start(α)≦T≦T_end(α))をＣ_frame(T) (T_start(α)≦T≦T_end(α))と記す。そして、音素信頼度算出部１０９は、この音素アライメントＡ_result(T_start(α)≦T≦T_end(α))=「α」について、フレーム信頼度Ｃ_frame(T) (T_start(α)≦T≦T_end(α))の累積（総和）をフレーム数（T_end(α)-T_start(α)+1）で割ったフレーム平均値を音素アライメントＡ_result(T_start(α)≦T≦T_end(α))=「α」における音素信頼度Ｃ_phone[T_start(α)：T_end(α)]とする。 Subsequently, the phoneme reliability calculation unit 109 performs the following for phoneme alignment A _result (T _start (α) ≦ T ≦ T _end (α)) = “α” corresponding to one or more consecutive identical recognition result phonemes α. Processing is performed, and this processing is performed for all recognition result phonemes appearing in the recognition result (steps S5p-S7p). T _start (α) is the start frame number of the recognition result phoneme α, and T _end (α) is the end frame number of the recognition result phoneme α. In addition, even if the recognition result phonemes are the same type, the processing is applied individually when the start frame numbers are different. For example, in the example illustrated in FIG. 7, the recognition result phoneme “ou” appears twice in the recognition result (6th to 12th frames and 19th to 24th frames). Also, since the start frame numbers are different, the processing is applied individually, and T _start ('ou') = 6 and T _end ('ou') = 12, for the sixth to twelfth frames. T _start ('ou') = 19 and T _end ('ou') = 24 for the frame.
The details of the process are as follows. The phoneme reliability calculation unit 109 calculates each frame T = T _start (α), T _start (α) +1 within the phoneme alignment A _result (T _start (α) ≦ T ≦ T _end (α)) = “α”. ,..., T _end (α) −1, T _end (α) using the phoneme-specific likelihood ratio model D (α) of the phoneme α stored in the phoneme-specific likelihood ratio model storage unit 240. The output value D (α, P _ratio (T)) of the phoneme-specific likelihood ratio model D (α) corresponding to the phoneme-specific likelihood ratio P _ratio (T) stored in the phoneme-specific likelihood ratio storage unit 250. (Hereinafter referred to as frame reliability). Hereinafter, the frame reliability D (α, P _ratio (T)) (T _start (α) ≦ T ≦ T _end (α)) is changed to C _frame (T) (T _start (α) ≦ T ≦ T _end (α). ). The phoneme reliability calculation unit 109 then sets the frame reliability C _frame (T) (T _start (α) for this phoneme alignment A _result (T _start (α) ≦ T ≦ T _end (α)) = “α”. ≦ T ≦ T _end (α)) is divided by the number of frames (T _end (α) −T _start (α) +1) to obtain an average frame value A _result (T _start (α) ≦ The phoneme reliability C _phone [T _start (α): T _end (α)] when T ≦ T _end (α)) = “α” is set.

図７に示す具体例の場合、先頭から２番目の音素アライメントはＡ_result(3≦T≦5)=「t」であるため、音素別尤度比モデル記憶部２４０に記憶されている音素「t」の音素別尤度比モデルＤ(‘t’)を用いて、フレームTごとに音素別尤度比Ｐ_ratio(3)、Ｐ_ratio(4)、Ｐ_ratio(5)に対する音素別尤度比モデルＤ(‘t’,Ｐ_ratio(T))の値であるフレーム信頼度Ｃ_frame(3)、Ｃ_frame(4)、Ｃ_frame(5)を求め、これらのフレーム平均値を求め音素信頼尺度Ｃ_phone[3:5]とする。その他の音素アライメントについても同様の処理を行う。 In the case of the specific example shown in FIG. 7, the second phoneme alignment from the beginning is A _result (3 ≦ T ≦ 5) = “t”, and therefore the phoneme “ phoneme-specific likelihood ratios P _ratio (3), P _ratio (4), and P _ratio (5) for each frame T using the phoneme-specific likelihood ratio model D ('t') of "t" The frame reliability C _frame (3), C _frame (4), and C _frame (5), which are values of the ratio model D ('t', P _ratio (T)), are obtained, and the average value of these frames is obtained to determine the phoneme confidence. Scale C _phone [3: 5]. Similar processing is performed for other phoneme alignments.

次に、信頼度算出部１１０は、音素信頼度算出部１０９によって計算された音素信頼度Ｃ_phoneの全音素アライメントにおける平均値を認識結果の信頼度Ｃとして算出する（ステップＳ８ｐ）。 Next, the reliability calculation unit 110 calculates the average value of all phoneme alignments of the phoneme reliability C _phone calculated by the phoneme reliability calculation unit 109 as the reliability C of the recognition result (step S8p).

図７に示す具体例では、認識結果第１位に対する信頼度Ｃの算出例を図示したが、認識結果第２位以降の各認識結果についても同様に信頼度を求める実施形態も許容される。
上述の一連の処理が、＜音声認識信頼度算出処理＞である。 In the specific example illustrated in FIG. 7, an example of calculating the reliability C for the first recognition result is illustrated, but an embodiment in which the reliability is similarly obtained for each recognition result after the second recognition result is also permitted.
The series of processes described above is <voice recognition reliability calculation process>.

信頼度の正規化については、図９に示す認識結果第１位のように、入力音声信号に対する認識結果が正解であった場合、いずれの音素アライメントにおいてもフレーム信頼度Ｃ_frame(T)は、１を上限とする高い値になるため、音素信頼尺度Ｃ_phone、ひいては信頼度Ｃも高い値となる。
一方、入力音声信号に対する認識結果が不正解であった場合、例えば図９に示す認識結果第２位のように発声内容が「とうきょう（t/ou/k/y/ou）」である入力音声信号に対する認識結果が「t/ou/ch/y/ou」であった場合、音素「k」を音素「ch」に誤認識したことになる。この音素アライメントにおいて本来「k」と発声している音素特徴量に対して、認識結果音素「ch」の音素別尤度比モデルから低いフレーム信頼度Ｃ_frame(T)が出力されることが推測される。そのため認識結果全体の信頼度も、正解時の信頼度と比較して低い値をとることになり、この信頼度の大小によって認識結果に対する確信度合い、つまり信頼度が高ければ正解の可能性が高い、信頼度が低ければ誤認識の可能性が高いことが表現できる。 As for normalization of reliability, as shown in the first recognition result shown in FIG. 9, when the recognition result for the input speech signal is correct, the frame reliability C _frame (T) in any phoneme alignment is Since the value is a high value with 1 as the upper limit, the phoneme confidence measure C _phone and hence the reliability C are also high values.
On the other hand, if the recognition result for the input voice signal is incorrect, the input voice whose utterance content is “tyo (t / ou / k / y / ou)” as shown in the second recognition result in FIG. When the recognition result for the signal is “t / ou / ch / y / ou”, the phoneme “k” is erroneously recognized as the phoneme “ch”. In this phoneme alignment, it is estimated that a low frame reliability C _frame (T) is output from the phoneme-specific likelihood ratio model of the recognition result phoneme “ch” with respect to the phoneme feature amount originally uttered “k”. Is done. Therefore, the reliability of the entire recognition result also takes a low value compared to the reliability at the time of the correct answer, and the degree of confidence in the recognition result, that is, if the reliability is high, the probability of a correct answer is high. If the reliability is low, it can be expressed that the possibility of erroneous recognition is high.

≪補記≫
実施形態において、音声単位別尤度比モデルの作成のみを行う場合には、ステップＳ１ｐ−Ｓ８ｐの各処理を省略できる。また、実施形態において、音素信頼度のみを必要とする場合、ステップＳ８ｐの処理を省略することができる。ここでは、音声単位として音素を例としたが、音節などその他の音声単位を用いる場合には、上記説明において「音素」を「音節」などに読み替えればよい。 ≪Supplementary notes≫
In the embodiment, when only the creation of a speech unit likelihood ratio model is performed, each processing of steps S1p-S8p can be omitted. In the embodiment, when only the phoneme reliability is required, the process of step S8p can be omitted. Here, a phoneme is taken as an example of a speech unit. However, when other speech units such as a syllable are used, “phoneme” may be read as “syllable” or the like in the above description.

音声単位別尤度比モデル作成装置１と音声認識信頼度算出装置２を各別の装置として構成する場合などでは、音声単位別尤度比モデル作成装置１の音素別尤度比モデル作成部１０５が作成した音声単位別尤度比モデルを（例えば記録媒体を介して）音声認識信頼度算出装置２の記憶部に記憶し、音素信頼度算出部１０９がこの記憶された音声単位別尤度比モデルを用いて音素信頼度Ｃ_phoneを得ることができる。この場合、音声認識信頼度算出装置２が用いる音響モデル２８０は、音声単位別尤度比モデル作成装置１で用いられた音響モデル２８０と同じであることが好適であるが、必ずしも同一の音響モデルを用いる必要はない。一例として、音響モデルの学習に用いる学習データは重複するが、音響モデルの構造が異なる（ＨＭＭ（Hidden Markov Model）の状態や混合数が異なるなど）場合には、異なる音響モデルを用いることが可能である。 In a case where the speech unit likelihood ratio model creation device 1 and the speech recognition reliability calculation device 2 are configured as separate devices, for example, the speech unit likelihood ratio model creation unit 105 of the speech unit likelihood ratio model creation device 1 is used. Is stored in the storage unit of the speech recognition reliability calculation device 2 (for example, via a recording medium), and the phoneme reliability calculation unit 109 stores the stored likelihood ratio by speech unit. The phoneme reliability C _phone can be obtained using the model. In this case, the acoustic model 280 used by the speech recognition reliability calculation device 2 is preferably the same as the acoustic model 280 used by the speech unit likelihood ratio model creation device 1, but is not necessarily the same acoustic model. There is no need to use. As an example, the learning data used for learning the acoustic model is duplicated, but different acoustic models can be used when the structure of the acoustic model is different (such as the HMM (Hidden Markov Model) state or the number of mixtures). It is.

以上の実施形態の他、本発明である音声単位別尤度比モデル作成装置・方法、音声認識信頼度算出装置・方法は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、各実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 In addition to the above-described embodiments, the speech unit likelihood ratio model creation device / method and the speech recognition reliability calculation device / method according to the present invention are not limited to the above-described embodiments, and depart from the spirit of the present invention. Changes can be made as appropriate without departing from the scope. In addition, the processing described in each embodiment may be executed not only in time series according to the description order, but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

また、上記音声単位別尤度比モデル作成装置／音声認識信頼度算出装置における処理機能をコンピュータによって実現する場合、音声単位別尤度比モデル作成装置／音声認識信頼度算出装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記音声単位別尤度比モデル作成装置／音声認識信頼度算出装置における処理機能がコンピュータ上で実現される。 Further, when the processing function in the speech unit-specific likelihood ratio model creation device / speech recognition reliability calculation device is realized by a computer, the function that the speech unit-specific likelihood ratio model creation device / speech recognition reliability calculation device should have The processing content of is described by a program. Then, by executing this program on a computer, the processing functions in the above-mentioned speech unit likelihood ratio model creation device / speech recognition reliability calculation device are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、音声単位別尤度比モデル作成装置／音声認識信頼度算出装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the speech unit likelihood ratio model creation device / speech recognition reliability calculation device is configured by executing a predetermined program on a computer. However, at least a part of these processing contents is used. May be realized in hardware.

図１０に示す表１は信頼度算出を含めた認識速度（RTF: Real Time Factor; 認識処理時間を音声長で正規化した数値）の比較表である。本発明はＧＭＭを用いているところ、例えば本実施例で６４混合のガウス分布からなるＧＭＭを用いたとしても、本実施例の方が音節認識結果尤度による従来手法より遙かに計算量が少ない。むしろ、本実施例は計算量の少ないN-bestによる従来手法と同程度の計算量で済むことが比較表から見て取れる。 Table 1 shown in FIG. 10 is a comparison table of recognition speed (RTF: Real Time Factor; a numerical value obtained by normalizing the recognition processing time by the voice length) including reliability calculation. Although the present invention uses a GMM, for example, even if a GMM having a Gaussian distribution of 64 mixtures is used in this embodiment, the calculation amount of this embodiment is far greater than the conventional method based on the likelihood of the syllable recognition result. Few. Rather, it can be seen from the comparison table that the present embodiment only requires the same amount of calculation as the conventional method based on N-best with a small amount of calculation.

図１１に示す表２は正誤判別能力の度合いを評価する等誤り率（EER: Equal Error Rate; 正解を誤って棄却する誤棄却率と、誤認識や語彙外を誤って受理する誤受理率が等しくなる値で、等誤り率が小さい方が良いとされる）の比較表である。本発明は音素モデルを包含したＧＭＭを用いているため、本実施例は、音節認識結果尤度による従来手法と等価以上の良好な効果が得られており、かつN-bestによる従来手法よりも等誤り率が低く、正誤判別能力が良いことがわかる。 Table 2 shown in Fig. 11 shows the Equal Error Rate (EER: Equal Error Rate), which evaluates the level of correctness / error discrimination ability, and the false acceptance rate for falsely rejecting correct answers and falsely accepting errors outside the vocabulary. It is a comparison table in which it is better that the equal error rate is smaller with equal values. Since the present invention uses a GMM that includes a phoneme model, the present embodiment achieves an effect that is equal to or better than the conventional method based on the likelihood of syllable recognition results, and is more effective than the conventional method based on N-best. It can be seen that the equal error rate is low and the correct / incorrect discrimination ability is good.

Claims

Storage means for storing an acoustic model, development data composed of speech data and correct labels associated with the speech data, and a mixed normal distribution (GMM);
Feature quantity analysis means for calculating the acoustic feature quantity of the voice data for each frame;
Correct likelihood calculation means for calculating the correct likelihood of phonemes included in the correct label using the correct label and the acoustic model for the acoustic feature amount for each frame;
GMM likelihood calculating means for calculating the likelihood by the GMM (GMM likelihood) for the acoustic feature amount for each frame;
For each frame, speech unit-specific likelihood ratio calculating means for calculating a ratio between the correct likelihood and the GMM likelihood as a first phoneme- specific likelihood ratio;
For each type of phonemes contained in the development data, phoneme corresponding said first phoneme likelihood normalize the probability distribution function of a random variable the ratio was normalized probability distribution function (phoneme likelihood ratio model) A speech unit-specific likelihood ratio model creation device including speech unit-specific likelihood ratio model creation means for creating

The likelihood ratio model creation device for each voice unit according to claim 1,
The GMM is a likelihood ratio model creation device by speech unit, characterized in that it is a mixed normal distribution learned from voiced sections of speech data for learning.

In the speech unit likelihood ratio model creation device according to claim 1 or 2,
The speech unit likelihood ratio model creating apparatus characterized in that the GMM likelihood calculating means calculates the GMM likelihood using an unvoiced model learned from an unvoiced section of learning speech data.

The storage unit stores an acoustic model, development data including voice data and a correct answer label associated with the voice data, and a mixed normal distribution (GMM).
A feature amount analyzing step for calculating an acoustic feature amount of the voice data for each frame;
A correct likelihood calculation step of calculating a correct likelihood of a phoneme included in the correct label using the correct label and the acoustic model for the acoustic feature amount for each frame;
A GMM likelihood calculating step for calculating a likelihood by the GMM (GMM likelihood) for the acoustic feature amount for each frame;
For each frame, and the ratio of the correct answers likelihood and the GMM likelihood first Ruoto voice unit by the likelihood ratio calculation step to calculate a phoneme likelihood ratio,
For each type of phonemes contained in the development data, phoneme corresponding said first phoneme likelihood normalize the probability distribution function of a random variable the ratio was normalized probability distribution function (phoneme likelihood ratio model) A method of creating a likelihood ratio model for each speech unit, comprising a step of creating a likelihood ratio model for each speech unit for creating a speech unit.

Storage means for storing an acoustic model, a likelihood ratio model by phoneme created by the likelihood ratio model creation apparatus by speech unit according to any one of claims 1 to 3, and a mixed normal distribution (GMM) When,
Feature quantity analysis means for calculating the acoustic feature quantity of the speech signal to be recognized for each frame;
Recognition processing means for calculating a speech recognition result and a recognition result likelihood of a phoneme included in the speech recognition result using the acoustic model for the acoustic feature amount for each frame;
Reference likelihood calculating means for calculating a likelihood (reference likelihood) by the GMM for the acoustic feature amount for each frame;
For each frame, speech unit-specific likelihood ratio calculating means for calculating a ratio between the recognition result likelihood and the reference likelihood as a second phoneme- specific likelihood ratio;
For each phoneme contained in the speech recognition result, the output value of the phoneme likelihood ratio model when the inputs the second phoneme likelihood ratio for each frame corresponding to the phoneme (frame reliability) A speech recognition reliability calculation device including speech unit reliability calculation means for determining each frame reliability and obtaining a frame average value of the frame reliability as a speech unit reliability.

In the speech recognition reliability calculation apparatus according to claim 5,
A speech recognition reliability calculation device comprising: a reliability calculation means for calculating an average value of the speech unit reliability corresponding to each phoneme included in the speech recognition result as a reliability of the speech recognition result.

In the speech recognition reliability calculation apparatus according to claim 5 or 6,
The acoustic model is the same acoustic model as the acoustic model used in creating the phoneme likelihood ratio model,
The speech recognition reliability calculation apparatus according to claim 1, wherein the GMM is the same GMM as the GMM used when creating the likelihood ratio model for each phoneme .

The storage means includes an acoustic model, a likelihood ratio model by phoneme created by the speech unit likelihood ratio model creation device according to any one of claims 1 to 3, a mixed normal distribution (GMM), and Is remembered,
A feature amount analyzing step for calculating an acoustic feature amount of a speech signal to be recognized for each frame;
A recognition processing step of calculating a speech recognition result and a phoneme recognition result likelihood included in the speech recognition result using the acoustic model for the acoustic feature amount for each frame;
A reference likelihood calculating step for calculating a likelihood (reference likelihood) by the GMM with respect to the acoustic feature amount for each frame;
For each frame, a speech unit-specific likelihood ratio calculation step of calculating a ratio between the recognition result likelihood and the reference likelihood as a second phoneme- specific likelihood ratio;
For each phoneme contained in the speech recognition result, the output value of the phoneme likelihood ratio model when the inputs the second phoneme likelihood ratio for each frame corresponding to the phoneme (frame reliability) A speech recognition reliability calculation method including: a speech unit reliability calculation step that calculates each frame average value of the frame reliability as a speech unit reliability.

A computer functions as the speech unit likelihood ratio model creation device according to any one of claims 1 to 3 and / or the speech recognition reliability calculation device according to any one of claims 5 to 7. Program to let you.