JP6148150B2

JP6148150B2 - Acoustic analysis frame reliability calculation device, acoustic model adaptation device, speech recognition device, their program, and acoustic analysis frame reliability calculation method

Info

Publication number: JP6148150B2
Application number: JP2013220132A
Authority: JP
Inventors: 太一浅見; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-10-23
Filing date: 2013-10-23
Publication date: 2017-06-14
Anticipated expiration: 2033-10-23
Also published as: JP2015082036A

Description

この発明は、音声認識に用いる音響モデルの適応技術に関し、特に教師なし音響モデル適応に用いる特徴量を選別するようにした音響分析フレーム信頼度計算装置とその方法と、その装置を用いた音響モデル適応装置と音声認識装置とそれらのプログラムに関する。 The present invention relates to an acoustic model adaptation technique used for speech recognition, and in particular, an acoustic analysis frame reliability calculation apparatus and method for selecting feature quantities used for unsupervised acoustic model adaptation, and an acoustic model using the apparatus. The present invention relates to an adaptive device, a speech recognition device, and their programs.

音声認識に使用する音響モデルを更新する際には、学習データ中の事例ができるだけ多く成り立つようにモデルのパラメータの最適化処理を行う。この処理を「音響モデルの適応」と称し、一般に、音声ファイルと当該音声ファイルの発話内容を表す正解テキストと学習（適応）データを用いる。音響モデルの適応は、正解テキストを、音声ファイルに対応する読みを人間が書き起こすことにより得る教師あり適応と、音声ファイルの音声認識結果として得る教師なし適応との二つに大別される。 When updating the acoustic model used for speech recognition, model parameter optimization processing is performed so that as many examples as possible are included in the learning data. This process is referred to as “acoustic model adaptation” and generally uses an audio file, correct text representing the utterance content of the audio file, and learning (adaptive) data. The adaptation of the acoustic model is roughly divided into two types: supervised adaptation in which correct text is obtained by human writing up the reading corresponding to the speech file, and unsupervised adaptation obtained as a speech recognition result of the speech file.

教師なし適応は、人手を介さないためコストや時間の面で優れているが、音声認識結果には誤認識が含まれ得るため、適応処理を行うことで音響モデルの精度を低下させてしまう場合がある。この問題に対して、音声認識結果にその信頼性を示す信頼尺度を付与して、信頼尺度の高さに応じて適応データを選択し、選択した音声認識結果を用いて音響モデルの適応を行う方法が考えられている（特許文献１）。 Unsupervised adaptation is superior in terms of cost and time because it does not involve human intervention, but the speech recognition results may include misrecognition, so the adaptation process reduces the accuracy of the acoustic model There is. For this problem, a confidence measure indicating the reliability is given to the speech recognition result, adaptive data is selected according to the height of the confidence measure, and the acoustic model is adapted using the selected speech recognition result. A method is considered (Patent Document 1).

その方法は、音声ファイル中の各発話の音声認識結果に信頼尺度を付与し、信頼尺度がある閾値を超えた発話のみを適応データとして選択して教師なし適応を行うものである。ここで発話とは、音声ファイル中の例えば一呼吸で発声された数秒〜数十秒の音声区間のことであり、その発話の音声認識結果には通常、数単語〜数十単語が含まれる。 This method assigns a confidence measure to the speech recognition result of each utterance in the speech file, and selects only utterances whose confidence measure exceeds a certain threshold as adaptation data to perform unsupervised adaptation. Here, the utterance is a voice section of several seconds to several tens of seconds uttered by, for example, one breath in the voice file, and the speech recognition result of the utterance usually includes several words to several tens of words.

特開２００７−２４８７３０号公報JP 2007-248730 A

従来技術は、信頼尺度が低い、つまり音声認識率が低い発話は誤認識を多く含むと考えて、その発話の音声認識結果を適応データとして選択しない方法である。適応データとして選択しない信頼尺度の低い発話であっても、その発話区間全体が誤認識である場合は少なく、正しい認識結果を含む場合が多い。 The prior art is a method in which an utterance having a low confidence measure, that is, having a low speech recognition rate includes many misrecognitions, and does not select the speech recognition result of the utterance as adaptive data. Even for an utterance with a low confidence measure that is not selected as adaptive data, the entire utterance section is rarely misrecognized, and often includes a correct recognition result.

しかし、従来技術では、発話を単位として選択するため、誤認識区間のみを排除することができなかった。要するに、従来技術では、誤認識区間を精度よく識別することができなかった。その結果、適応データのデータ収集のコストを増大させると共に、適応効果を低下させ、音声認識精度の改善を小さくしてしまう課題があった。 However, in the prior art, since the utterance is selected as a unit, it is not possible to exclude only the erroneous recognition section. In short, in the conventional technology, the erroneous recognition section cannot be accurately identified. As a result, there has been a problem that the cost of collecting the adaptive data is increased, the adaptation effect is lowered, and the improvement of the speech recognition accuracy is reduced.

この発明は、このような課題に鑑みてなされたものであり、適応データ中の誤認識区間を精度よく識別できるようにした音響分析フレーム信頼度計算装置とその方法と、その装置を用いた音響モデル適応装置と音声認識装置とそれらのプログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and an acoustic analysis frame reliability calculation apparatus and method for accurately identifying an erroneous recognition section in adaptive data, and an acoustic using the apparatus. The object is to provide a model adaptation device, a speech recognition device, and their programs.

この発明の音響分析フレーム信頼度計算装置は、音声認識部とフレーム信頼尺度計算部と、を具備する。音声認識部は、音声信号を、所定時間長の音響分析フレームに分割して当該音響分析フレームの単位で音響特徴量を抽出し、当該音響特徴量と初期音響モデルを用いて音響尤度を計算し、音響分析フレームに最尤の音素ＩＤを付与して音響特徴量と音響尤度と音素ＩＤとを対応付けた音素ＩＤ・音響尤度付き音響特徴量系列を出力する。フレーム信頼尺度計算部は、音声認識部が出力する音素ＩＤ・音響尤度付き音響特徴量系列を入力として、上記各音素ＩＤごとに上記音響特徴量を分類して音響特徴量集合を作成し、当該各音響特徴量集合に対して外れ値検出を行って外れ値スコアを求め、当該外れ値スコアを用いた関数の値を信頼度の尺度として求め、音響分析フレームごとに上記関数の値を付与したフレーム信頼度付き音響特徴量系列を出力する。 The acoustic analysis frame reliability calculation device according to the present invention includes a speech recognition unit and a frame reliability scale calculation unit. The speech recognition unit divides the speech signal into acoustic analysis frames having a predetermined time length, extracts acoustic features in units of the acoustic analysis frames, and calculates acoustic likelihood using the acoustic features and the initial acoustic model. Then, the maximum likelihood phoneme ID is assigned to the acoustic analysis frame, and a phoneme ID / acoustic likelihood acoustic feature quantity sequence in which the acoustic feature quantity, the acoustic likelihood, and the phoneme ID are associated with each other is output. The frame confidence measure calculation unit receives the phoneme ID / acoustic likelihood-equipped acoustic feature amount sequence output from the speech recognition unit, classifies the acoustic feature amount for each phoneme ID, and creates an acoustic feature amount set, Outlier detection is performed on each acoustic feature quantity set to obtain an outlier score, a function value using the outlier score is obtained as a measure of reliability, and the value of the above function is assigned to each acoustic analysis frame The acoustic feature quantity series with frame reliability is output.

また、この発明の音響モデル適応装置は、上記した音響分析フレーム信頼度計算装置と、特徴量選択部と、音響モデル適応部と、を具備する。特徴量選択部は、音響分析フレーム信頼度計算装置が出力するフレーム信頼度付き音響特徴量系列を入力として、上記音響分析フレームごとの信頼度が、選択閾値以上の音響分析フレームを選択することを示す選択フラグを付与した選択フラグ付き音響特徴量系列を出力する。音響モデル適応部は、特徴量選択部が出力する選択フラグ付き音響特徴量系列を入力として、選択フラグが付与されていない各音響分析フレームの音響特徴量を用いて計算する統計量に、０以上１以下の実数値の非選択重みを乗じて更新した統計量を求め、当該更新後の統計量に基づいて初期音響モデルのパラメータを更新して適応後音響モデルを出力する。 The acoustic model adaptation apparatus of the present invention includes the above-described acoustic analysis frame reliability calculation device, a feature amount selection unit, and an acoustic model adaptation unit. The feature amount selection unit receives an acoustic feature amount sequence with frame reliability output from the acoustic analysis frame reliability calculation device as an input, and selects a sound analysis frame having a reliability for each acoustic analysis frame equal to or greater than a selection threshold. The acoustic feature quantity series with selection flag to which the selection flag shown is given is output. The acoustic model adaptation unit uses the acoustic feature quantity sequence with the selection flag output from the feature quantity selection unit as an input, and the statistical amount calculated using the acoustic feature quantity of each acoustic analysis frame to which the selection flag is not assigned is 0 or more. An updated statistic is obtained by multiplying a non-selection weight of a real value of 1 or less, and the parameters of the initial acoustic model are updated based on the updated statistic, and an after-adaptation acoustic model is output.

また、この発明の音声認識装置は、上記した音響分析フレーム信頼度計算装置と、特徴量選択部と、音声認識部と、を具備する。特徴量選択部は、音響分析フレーム信頼度計算装置が出力するフレーム信頼度付き音響特徴量系列を入力として、音響分析フレームごとの信頼度が、選択閾値以上の音響分析フレームを選択することを示す選択フラグを付与した選択フラグ付き音響特徴量系列を出力する。音声認識部は、特徴量選択部が出力する選択フラグ付き音響特徴量系列を入力として、選択フラグが付与されていない音響分析フレームについて言語モデルの重みを増やして音声認識処理を行い音声認識結果を出力する。 The speech recognition apparatus of the present invention includes the above-described acoustic analysis frame reliability calculation device, a feature amount selection unit, and a speech recognition unit. The feature quantity selection unit indicates that an acoustic analysis frame with frame reliability output from the acoustic analysis frame reliability calculation device is input, and an acoustic analysis frame having a reliability for each acoustic analysis frame that is equal to or higher than a selection threshold is selected. The acoustic feature quantity series with the selection flag to which the selection flag is assigned is output. The speech recognition unit inputs the acoustic feature quantity sequence with the selection flag output from the feature quantity selection unit, increases the weight of the language model for the acoustic analysis frame without the selection flag, performs speech recognition processing, and obtains the speech recognition result. Output.

本発明の音響分析フレーム信頼度計算装置によれば、音声データを扱う際の最小単位である音響分析フレーム単位で、外れ値スコアを用いた関数の値を信頼度の尺度として求めるので、誤認識区間を精度よく識別することができる。 According to the acoustic analysis frame reliability calculation apparatus of the present invention, since the value of a function using an outlier score is obtained as a measure of reliability in an acoustic analysis frame unit, which is the minimum unit when handling audio data, erroneous recognition The section can be identified with high accuracy.

また、本発明の音響モデル適応装置は、この発明の音響分析フレーム信頼度計算装置が出力するフレーム信頼度付き音響特徴量系列を用いて音響モデルの適応を行うので、適応データのデータ収集のコストを低減させ、効率的に音響モデルの適応を行うことができる。 Also, the acoustic model adaptation apparatus of the present invention adapts the acoustic model using the acoustic feature quantity series with frame reliability output from the acoustic analysis frame reliability calculation apparatus of the present invention. The acoustic model can be efficiently adapted.

また、本発明の音声認識装置は、選択フラグが付与されていない音響分析フレームについて言語モデルの重みを増やして音声認識処理を行うので、音声認識精度を向上させることができる。 In addition, since the speech recognition apparatus according to the present invention performs speech recognition processing by increasing the weight of the language model for the acoustic analysis frame to which the selection flag is not assigned, the speech recognition accuracy can be improved.

この発明の音響分析フレーム信頼度計算装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic analysis frame reliability calculation apparatus 100 of this invention. 音響分析フレーム信頼度計算装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic analysis frame reliability calculation apparatus 100. 音声信号とフレーム信頼尺度との対応関係を示す図。The figure which shows the correspondence of an audio | voice signal and a frame reliability scale. この発明の音響分析フレーム信頼度計算装置２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic analysis frame reliability calculation apparatus 200 of this invention. この発明の音響モデル適応装置３００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model adaptation apparatus 300 of this invention. この発明の音声認識装置４００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 400 of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響分析フレーム信頼度計算装置１００の機能構成例を示す。その動作フローを図２に示す。音響分析フレーム信頼度計算装置１００は、初期音響モデル１０と、音声認識部１１と、フレーム信頼尺度計算部１２と、を具備する。音響分析フレーム信頼度計算装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現される。 FIG. 1 shows an example of the functional configuration of an acoustic analysis frame reliability calculation apparatus 100 according to the present invention. The operation flow is shown in FIG. The acoustic analysis frame reliability calculation device 100 includes an initial acoustic model 10, a speech recognition unit 11, and a frame confidence measure calculation unit 12. The acoustic analysis frame reliability calculation device 100 is realized by reading a predetermined program into a computer including, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声認識部１１は、音声信号を、所定時間長の音響分析フレームに分割して当該音響分析フレームの単位で音響特徴量を抽出し、当該音響特徴量と初期音響モデル１０を用いて音声認識し、上記音響分析フレームに最尤の音素ＩＤを付与して上記音響特徴量と音響尤度と音素ＩＤとを対応付けた音素ＩＤ・音響尤度付き音響特徴量系列として出力する（ステップＳ１１）。ここで音声信号は、例えば、サンプリング周波数１０ｋＨｚで離散値化されたディジタル信号であり、数十分〜数時間程度の長さを想定する。音響分析フレームは、その離散値を、例えば１００点集めた１０ｍｓの時間長である。音響分析フレームの時間長を１０ｍｓとすると、例えば３０分の音声信号からは１８００００フレーム分の音響特徴量が抽出される。 The speech recognition unit 11 divides the speech signal into acoustic analysis frames having a predetermined time length, extracts acoustic feature amounts in units of the acoustic analysis frames, and performs speech recognition using the acoustic feature amounts and the initial acoustic model 10. Then, the maximum likelihood phoneme ID is assigned to the acoustic analysis frame and output as a phoneme ID / acoustic likelihood acoustic feature quantity sequence in which the acoustic feature quantity, acoustic likelihood, and phoneme ID are associated with each other (step S11). Here, the audio signal is, for example, a digital signal that has been made discrete at a sampling frequency of 10 kHz, and is assumed to have a length of several tens of minutes to several hours. The acoustic analysis frame has a time length of 10 ms obtained by collecting 100 discrete values, for example. Assuming that the time length of the acoustic analysis frame is 10 ms, for example, an acoustic feature amount of 180000 frames is extracted from an audio signal of 30 minutes.

音響特徴量は、実数値のベクトルであり、抽出方法としてはＬＰＣケプストラム、ＭＦＣＣ等の何れの手法を用いても構わない。音素ＩＤは、音声信号を音声認識して得られるテキスト情報の音素と、音響分析フレームとを対応付ける識別子である。 The acoustic feature quantity is a real-valued vector, and any method such as LPC cepstrum or MFCC may be used as an extraction method. The phoneme ID is an identifier that associates a phoneme of text information obtained by voice recognition of a voice signal and an acoustic analysis frame.

図３に、音響分析フレームと音素ＩＤとの関係を例示する。１行目は音声信号、２行目は音声認識結果、３行目は音響特徴量系列、４行目は音素ＩＤ、５行目は音響尤度、６行目と７行目は後述するフレーム信頼尺度と選択フラグである。この例は、音声信号を「こんにちは」として、その認識結果が「こんにきは（ｋｏＮｎｉｋｉｗａ）」の場合を示す。 FIG. 3 illustrates the relationship between the acoustic analysis frame and the phoneme ID. The first line is an audio signal, the second line is a speech recognition result, the third line is an acoustic feature series, the fourth line is a phoneme ID, the fifth line is an acoustic likelihood, and the sixth and seventh lines are frames described later. A confidence measure and a selection flag. This example, a speech signal as "Hello", the recognition result is "Konniki is (koNnikiwa)" shows a case.

音素ＩＤは、音声認識結果として得られるテキスト情報の音素、この例では（ｋ,ｏＮ,ｎ,ｉ,ｋ,ｉ,ｗ,ａ）の各音素が、初期音響モデル１０を用いて計算される音響尤度が最も大きくなる音素である。その音素ＩＤが、音響特徴量系列の各音響分析フレームと対応付けられる。１個目〜３個目までの音響分析フレームには「ｋ」、４個目〜９個目までの音響分析フレームには「ｏ」、それ以降は「Ｎ」,「ｎ」,「ｉ」,「ｋ」,「ｉ」,「ｗ」,「ａ」の各音素ＩＤが付与されている。このように、音素ＩＤ・音響尤度付き音響特徴量系列は、音響特徴量と音素ＩＤと音響尤度とが、対応付けられたそれぞれの時系列で構成される。 The phoneme ID is a phoneme of text information obtained as a speech recognition result, and in this example, each phoneme of (k, oN, n, i, k, i, w, a) is calculated using the initial acoustic model 10. It is a phoneme with the largest acoustic likelihood. The phoneme ID is associated with each acoustic analysis frame of the acoustic feature amount series. “K” for the first to third acoustic analysis frames, “o” for the fourth to ninth acoustic analysis frames, and “N”, “n”, “i” thereafter. , “K”, “i”, “w”, “a” are assigned phoneme IDs. As described above, the acoustic feature quantity sequence with phoneme ID / acoustic likelihood is composed of respective time series in which the acoustic feature quantity, the phoneme ID, and the acoustic likelihood are associated with each other.

音声認識結果と音響特徴量系列と音響尤度は、例えば参考文献１（政瀧ほか「顧客との自然な会話を聞き取る自由発話音声認識技術「VoiceRex」」,ＮＴＴ技術ジャーナル,Vol.18,No.11,pp.15-18,2006.）に記載された従来の音声認識技術で得ることができる。 For example, reference 1 (Masayoshi et al. “Free Speech Recognition Technology“ VoiceRex ”that Listens to Natural Conversations with Customers” ”, NTT Technical Journal, Vol. 18, No. .11, pp.15-18, 2006.) can be obtained by the conventional speech recognition technology.

フレーム信頼尺度計算部１２は、音声認識部１１が出力する音素ＩＤ・音響尤度付き音響特徴量系列を入力として、各音素ＩＤごとに音響特徴量を分類して音響特徴量集合を作成し、当該各音響特徴量集合に対して外れ値検出を行って外れ値スコアを求め、当該外れ値スコアを用いた関数の値を信頼度の尺度として求め、上記音響分析フレームごとに上記関数の値を付与したフレーム信頼度付き音響特徴量系列を出力する（ステップＳ１２、図２）。 The frame confidence measure calculation unit 12 receives the phoneme ID / acoustic likelihood-equipped acoustic feature quantity sequence output from the speech recognition unit 11 and classifies the acoustic feature quantity for each phoneme ID to create an acoustic feature quantity set. Outlier detection is performed on each acoustic feature amount set to obtain an outlier score, a function value using the outlier score is obtained as a measure of reliability, and the value of the function is calculated for each acoustic analysis frame. The assigned acoustic feature quantity series with frame reliability is output (step S12, FIG. 2).

このフレーム信頼度付き音響特徴量系列が出力される処理は、音声信号が一定時間以上の間入力されない場合、又は、図示しない動作停止信号が制御部１５に入力されるまで繰り返される（ステップＳ１５のＮｏ）。このステップＳ１１とステップＳ１２の時系列動作の制御と動作終了の制御は制御部１５が行う。この制御部１５の機能は、この実施例の特別な技術的特徴では無く一般的なものである。 The process of outputting the acoustic feature quantity series with frame reliability is repeated until an audio signal is not input for a certain period of time or until an operation stop signal (not shown) is input to the control unit 15 (in step S15). No). The control unit 15 performs the control of the time series operation and the control of the operation end in steps S11 and S12. The function of the control unit 15 is not a special technical feature of this embodiment but a general one.

フレーム信頼尺度は、ある音響分析フレームと音素ＩＤとの対応付けの確からしさを示す値である。音声認識部１１で生じる誤認識により、誤った音素ＩＤが対応付けられた音響分析フレームが存在する。例えば、図３の２７個目〜２９個目にイタリック体で表すｋ
と対応付けられた音響分析フレームである。このような音響分析フレームを検出するためにフレーム信頼尺度を用いる。 The frame confidence measure is a value indicating the probability of association between a certain acoustic analysis frame and a phoneme ID. There is an acoustic analysis frame that is associated with an incorrect phoneme ID due to erroneous recognition that occurs in the speech recognition unit 11. For example, k shown in italics for the 27th to 29th characters in FIG.
Is an acoustic analysis frame associated with. A frame confidence measure is used to detect such acoustic analysis frames.

フレーム信頼尺度は、「同一音素に対応付けられた音響特徴量を集めたとき、外れ値となっている音響特徴量は信頼できない」という考えに基づいて計算する。具体的には、以下の手順となる。 The frame confidence measure is calculated based on the idea that “acoustic feature quantities that are outliers cannot be trusted when acoustic feature quantities associated with the same phoneme are collected”. Specifically, the procedure is as follows.

先ず、音素ＩＤ・音響尤度付き音響特徴量系列の各音響分析フレームを、対応付けられた音素ＩＤごとに分類し、音素ＩＤごとの音響特徴量の集合を作成する。 First, each acoustic analysis frame of the phoneme ID / acoustic likelihood acoustic feature quantity series is classified for each associated phoneme ID, and a set of acoustic feature quantities for each phoneme ID is created.

次に、ある音素ＩＤ（Ｘ）の音響特徴量の集合Ｄに対して外れ値検出を行い、各音響特徴量に外れ値スコアを付与する。実数値のベクトルの集合から、各ベクトルの外れ値スコアを計算する方法としては、ＬＯＦ（Local Outlier Factor）や１クラスＳＶＭ（One Class Supprt Vector Machine）などの既存の手法を用いる。 Next, outlier detection is performed on a set D of acoustic feature values of a certain phoneme ID (X), and an outlier score is assigned to each acoustic feature value. As a method of calculating the outlier score of each vector from a set of real-valued vectors, existing methods such as LOF (Local Outlier Factor) and 1 class SVM (One Class Supprt Vector Machine) are used.

例えば、ＬＯＦを用いた場合の外れ値スコアＬＯＦ（ｄ_ｉ）は、次式で計算する。 For example, the outlier score LOF (d _i ) when LOF is used is calculated by the following equation.

ここで、ｄ_ｉはｉ番目の音響特徴量であり、集合ＤにＭ個含まれる（１≦ｉ≦Ｍ）。また、ｋは１以上の整数であり外れ値スコアＬＯＦ（ｄ_ｉ）の計算時のパラメータである。通常は、ｋ＝１０〜２０程度の値を用い、定数として予め与えられる。 Here, d _i is the i-th acoustic feature and is included in the set D (1 ≦ i ≦ M). Further, k is an integer equal to or greater than 1, and is a parameter when calculating the outlier score LOF (d _i ). Usually, a value of about k = 10 to 20 is used and given in advance as a constant.

分母のＮ_ｋ（ｄ_ｉ）は、集合Ｄの中でｄ_ｉに１番近い音響特徴量からｋ番目に近い音響特徴量までを集めた集合である。距離としてはユークリッド距離を用いる。｜Ｎ_ｋ（ｄ_ｉ）｜は、Ｎ_ｋ（ｄ_ｉ）に含まれる音響特徴量の個数であり、通常は｜Ｎ_ｋ（ｄ_ｉ）｜＝ｋとなる。 N _k (d _i ) of the denominator is a set in which the acoustic feature amount closest to d _i to the acoustic feature amount closest to the k th in the set D is collected. The Euclidean distance is used as the distance. | N _k (d _i ) | is the number of acoustic feature quantities included in N _k (d _i ), and is usually | N _k (d _i ) | = k.

音響特徴量ｘの周辺のデータの密度を表すｌｒｄ（ｘ）は次式で計算される。 Lrd (x) representing the density of data around the acoustic feature quantity x is calculated by the following equation.

ここで、ｋｄｉｓｔ（ｘ）はある音響特徴量ｘからｋ番目に近い音響特徴量との間の距離、ｄｉｓｔ（ｘ,ｙ）は音響特徴量ｘとｙとの距離である。１クラスＳＶＭを用いた場合の外れ値は、クラス境界面からの距離とする。 Here, kdist (x) is a distance between a certain acoustic feature quantity x and the kth closest acoustic feature quantity, and dist (x, y) is a distance between the acoustic feature quantity x and y. An outlier when 1 class SVM is used is a distance from the class boundary surface.

次に、各音響特徴量に付与された音響尤度と外れ値スコアの重み付き和を、各音響分析フレームのフレーム信頼尺度とする。音響尤度をＬ_ｔ、外れ値スコアをＯ_ｔとして、フレーム信頼尺度Ｃ_ｔは次式で計算される。フレーム信頼尺度Ｃ_ｔが、信頼度の尺度として定義した関数の値である。フレーム信頼尺度Ｃ_ｔは、図３の６行目に示すように、各音響分析フレームごとに付与される。 Next, the weighted sum of the acoustic likelihood and the outlier score assigned to each acoustic feature amount is used as the frame confidence measure of each acoustic analysis frame. The frame confidence measure C _t is calculated by the following equation, where L _{t is} the acoustic likelihood and O _t is the outlier score. The frame confidence measure C _t is a function value defined as a measure of reliability. The frame confidence measure _Ct is given for each acoustic analysis frame as shown in the sixth line of FIG.

ここでαは音響尤度重みである。αは、０以上１以下の実数値であり、０に設定すれば外れ値スコアをそのままフレーム信頼尺度とし、１に設定すれば音響尤度Ｌ_ｔをそのままフレーム信頼尺度とする値である。通常は０．５程度に設定する。音響尤度重みαは、フレーム信頼尺度計算部１２に予め定数として設定しておいても良いし、図１に破線で示すように外部から与えるようにしても良い。 Here, α is an acoustic likelihood weight. α is a real value between 0 and 1, and when set to 0, the outlier score is the frame confidence measure as it is, and when it is set to 1, the acoustic likelihood L _t is the frame confidence measure as it is. Usually, it is set to about 0.5. The acoustic likelihood weight α may be set as a constant in the frame confidence measure calculation unit 12 in advance, or may be given from the outside as indicated by a broken line in FIG.

このように音響分析フレーム信頼度計算装置１００によれば、音声データを扱う際の最小単位である音響分析フレームの単位で、信頼度の尺度として定義した関数の値を求めることができる。つまり、誤認識区間を音響分析フレームの単位で検出することが可能になる。 As described above, according to the acoustic analysis frame reliability calculation device 100, the value of a function defined as a measure of reliability can be obtained in units of acoustic analysis frames, which is the minimum unit for handling audio data. That is, it becomes possible to detect the erroneous recognition section in units of acoustic analysis frames.

なお、音素ＩＤ・音響尤度付き音響特徴量系列を出力する音声認識部１１については、他の構成も考えられる。音声認識部１１を音声認識部２１とした構成の音響分析フレーム信頼度計算装置２００の動作を次に説明する。 In addition, about the speech recognition part 11 which outputs phoneme ID and the acoustic feature-value series with acoustic likelihood, another structure can also be considered. Next, the operation of the acoustic analysis frame reliability calculation device 200 having the voice recognition unit 11 as the voice recognition unit 21 will be described.

図４に、この発明の音響分析フレーム信頼度計算装置２００の機能構成例を示す。音響分析フレーム信頼度計算装置２００は、音響分析フレーム信頼度計算装置１００（図１）の音声認識部１１を、音声認識部２１に置き換えたものである。 FIG. 4 shows a functional configuration example of the acoustic analysis frame reliability calculation device 200 of the present invention. The acoustic analysis frame reliability calculation device 200 is obtained by replacing the speech recognition unit 11 of the acoustic analysis frame reliability calculation device 100 (FIG. 1) with a speech recognition unit 21.

音声認識部２１は、音声認識手段２１１と音響特徴量アライメント手段２１２とで構成される。音声認識手段２１１は、音声信号を、初期音響モデルを用いて音声認識し、音声認識テキストを出力する。音声認識手段２１１は、上記した参考文献１に開示された従来の音声認識技術で構成できる。 The voice recognition unit 21 includes a voice recognition unit 211 and an acoustic feature amount alignment unit 212. The voice recognition unit 211 recognizes the voice signal using the initial acoustic model and outputs a voice recognition text. The voice recognition unit 211 can be configured by the conventional voice recognition technique disclosed in the above-described Reference Document 1.

音響特徴量アライメント手段２１２は、上記した音声信号と、音声認識手段２１１が出力する音声認識テキストを入力として、当該音声信号を所定時間長の音響分析フレームに分割し、各音響分析フレームの音響特徴量を抽出して音響特徴量系列を生成する。そして、当該音響特徴量系列に対応する音素系列を、入力された音声認識テキストから取得し、音響特徴量系列の各フレームと音素系列の各音素との対応付けのうち、初期音響モデル１０を用いて計算される音響尤度が最も大きくなる対応を選択して音響特徴量系列の各フレームと音素とを対応付け、各音響分析フレームに音素ＩＤを付与する。 The acoustic feature amount alignment unit 212 receives the voice signal and the voice recognition text output from the voice recognition unit 211 as input, divides the voice signal into acoustic analysis frames having a predetermined time length, and acoustic features of each acoustic analysis frame. An acoustic feature quantity sequence is generated by extracting the quantity. And the phoneme series corresponding to the said acoustic feature-value series is acquired from the input speech recognition text, and the initial acoustic model 10 is used among the correspondence between each frame of the acoustic feature-value series and each phoneme of the phoneme series. The correspondence with the largest acoustic likelihood calculated in this way is selected to associate each frame of the acoustic feature quantity sequence with the phoneme, and a phoneme ID is assigned to each acoustic analysis frame.

つまり、音響特徴量アライメント手段２１２は、音声認識テキストにより確定した音素系列の各音素が、それぞれ何個のフレーム数継続するかを決定する。この音響尤度が最も大きくなる対応における各音響分析フレームの音響尤度も同時に各フレームに付与され、それぞれの時系列が音素ＩＤ・音響尤度付き音響特徴量系列としてフレーム信頼尺度計算部１２に出力される。 That is, the acoustic feature amount alignment unit 212 determines how many frames each phoneme of the phoneme sequence determined by the speech recognition text continues. The acoustic likelihood of each acoustic analysis frame in correspondence with the largest acoustic likelihood is also given to each frame at the same time, and each time series is given to the frame confidence measure calculation unit 12 as an acoustic feature quantity sequence with phoneme ID and acoustic likelihood. Is output.

このように音声認識部２１を備える構成でも音声データを扱う際の最小単位である音響分析フレームの単位で、信頼度の尺度として定義した関数の値を求めることができ、誤認識区間を検出する精度を向上させることが可能である。上記して説明した音響分析フレーム信頼度計算装置１００,２００を用いた音響モデル適応装置３００と音声認識装置４００も考えられる。次にこれらの装置について説明する。 As described above, even in the configuration including the speech recognition unit 21, the value of the function defined as a measure of reliability can be obtained in the unit of the acoustic analysis frame which is the minimum unit when handling speech data, and the erroneous recognition section is detected. The accuracy can be improved. The acoustic model adaptation apparatus 300 and the speech recognition apparatus 400 using the acoustic analysis frame reliability calculation apparatuses 100 and 200 described above are also conceivable. Next, these devices will be described.

〔音響モデル適応装置〕
図５に、この発明の音響モデル適応装置３００の機能構成例を示す。音響モデル適応装置３００は、音響分析フレーム信頼度計算装置１００,２００と、特徴量選択部３１６と、音響モデル適応部３１７と、を具備する。 [Acoustic model adapting device]
FIG. 5 shows a functional configuration example of the acoustic model adaptation apparatus 300 of the present invention. The acoustic model adaptation device 300 includes acoustic analysis frame reliability calculation devices 100 and 200, a feature amount selection unit 316, and an acoustic model adaptation unit 317.

音響分析フレーム信頼度計算装置１００,２００は、音響分析フレーム信頼度計算装置１００（図１）又は音響分析フレーム信頼度計算装置２００（図４）である。特徴量選択部３１６は、音響分析フレーム信頼度計算装置１００,２００が出力するフレーム信頼度付き音響特徴量系列を入力として、音響分析フレームごとの信頼度が、選択閾値θ以上の音響分析フレームを選択することを示す選択フラグを付与した選択フラグ付き音響特徴量系列を出力する。 The acoustic analysis frame reliability calculation devices 100 and 200 are the acoustic analysis frame reliability calculation device 100 (FIG. 1) or the acoustic analysis frame reliability calculation device 200 (FIG. 4). The feature quantity selection unit 316 receives an acoustic analysis frame with frame reliability output from the acoustic analysis frame reliability calculation devices 100 and 200 as an input, and selects an acoustic analysis frame having a reliability for each acoustic analysis frame equal to or greater than the selection threshold θ. An acoustic feature quantity series with a selection flag to which a selection flag indicating selection is given is output.

図３の最下行に選択フラグを示す。図中の「○」はその音響分析フレームが選択されていることを示している。「×」はその音響分析フレームが選択されていないことを示している。図３に示す例では、「ち」を「き」と誤認識した２７個目〜２９個目の音響分析フレームに「×」の選択フラグが付与され、選択されていないことが分かる。 The selection flag is shown in the bottom line of FIG. “◯” in the figure indicates that the acoustic analysis frame is selected. “X” indicates that the acoustic analysis frame is not selected. In the example shown in FIG. 3, it can be seen that a selection flag of “x” is given to the 27th to 29th acoustic analysis frames in which “Chi” is misrecognized as “Ki”, and is not selected.

選択閾値θは、予め定数として特徴量選択部３１６に設定しておいても良いし、外部から特徴量選択部３１６に与えられるようにしても良い。選択閾値θは、小さい値に設定すれば選択される音響分析フレームは多くなるが、誤認識している音響分析フレームも選択してしまう場合も増加してしまう。逆に、選択閾値θを、大きな値に設定すれば誤認識されている音響分析フレームは選択され難くなるが、選択される音響分析フレームの総数が少なくなる。 The selection threshold θ may be set in advance in the feature quantity selection unit 316 as a constant, or may be given to the feature quantity selection unit 316 from the outside. If the selection threshold θ is set to a small value, more acoustic analysis frames are selected, but the number of erroneously recognized acoustic analysis frames is also increased. On the other hand, if the selection threshold θ is set to a large value, it is difficult to select the erroneously recognized acoustic analysis frames, but the total number of selected acoustic analysis frames is reduced.

選択閾値θは、経験に基づいて固定値を設定しても良いし、例えば、フレーム信頼尺度の下位１０％を選択しないように自動計算して求めるようにしても良い。任意のパーセンテージを選択しないように選択閾値θを求めるためには、全てのフレーム信頼尺度の平均値μと標準偏差σを求め、統計的手法に基づいて選択閾値θを自動計算するようにしても良い。 The selection threshold θ may be set to a fixed value based on experience, or may be obtained by automatic calculation so as not to select the lower 10% of the frame confidence measure, for example. In order to obtain the selection threshold θ so as not to select an arbitrary percentage, the average value μ and the standard deviation σ of all frame confidence measures are obtained, and the selection threshold θ is automatically calculated based on a statistical method. good.

音響モデル適応部３１７は、特徴量選択部３１６が出力する選択フラグ付き音響特徴量系列を入力として、選択フラグが付与されていない各音響分析フレームの音響特徴量を用いて計算する統計量に、０以上１以下の実数値の非選択重みを乗じて更新した統計量を求め、当該更新後の統計量に基づいて初期音響モデル１０のパラメータを更新して、適応後音響モデルとして出力する。初期音響モデル１０は、音響分析フレーム信頼度計算装置１００,２００の内部のものであるが、それ以外のベース音響モデルを用いて音響モデルの適応処理を行っても良い。 The acoustic model adaptation unit 317 receives the acoustic feature amount series with the selection flag output from the feature amount selection unit 316 as an input, and uses the acoustic feature amount of each acoustic analysis frame without the selection flag as a statistic to be calculated. The updated statistic is obtained by multiplying a real value non-selection weight of 0 or more and 1 or less, the parameters of the initial acoustic model 10 are updated based on the updated statistic, and output as an after-adaptation acoustic model. The initial acoustic model 10 is internal to the acoustic analysis frame reliability calculation apparatuses 100 and 200, but the acoustic model may be adapted using another base acoustic model.

音響モデルのパラメータの更新では、最尤推定法、最大事後確率法などの既存手法の更新後のパラメータを計算する式を用いる。音響モデルのパラメータを更新する式において、各音響分析フレームの音響特徴量を用いて計算する統計量に、選択されていない音響特徴量の場合には非選択重みｗを乗算する。 In updating the parameters of the acoustic model, formulas for calculating parameters after updating existing methods such as the maximum likelihood estimation method and the maximum posterior probability method are used. In the equation for updating the parameters of the acoustic model, a statistic calculated using the acoustic feature amount of each acoustic analysis frame is multiplied by a non-selection weight w in the case of an unselected acoustic feature amount.

非選択重みｗは０以上１以下の実数値であり、０に設定すれば選択されていない音響特徴量は全く利用されずに、更新後のパラメータが計算される。また、１に設定すれば非選択フラグを無視した通常の音響モデル適応と同じ更新後のパラメータが計算される。また、０と１の中間に設定すれば、選択されていない音響特徴量の影響を弱めて更新後のパラメータが計算される。非選択重みｗは、通常０〜０．５程度の値に設定する。 The non-selection weight w is a real value of 0 or more and 1 or less. If it is set to 0, an acoustic parameter not selected is not used at all, and the updated parameter is calculated. On the other hand, if it is set to 1, the same updated parameter as that of normal acoustic model adaptation ignoring the non-selection flag is calculated. Further, if it is set between 0 and 1, the updated parameter is calculated with the influence of the unselected acoustic feature amount reduced. The non-selection weight w is normally set to a value of about 0 to 0.5.

具体的に最尤推定法を用いる場合は、参考文献２（篠田浩一,「確率モデルによる音声認識のための話者適応化技術」.D-II,情報・システム,II-パターン処理,J87-D-II(2),371-386,2004-02-01.）に記載された更新後の音響モデルの平均と分散のパラメータの計算を次式に変更する。 Specifically, when using the maximum likelihood estimation method, reference 2 (Koichi Shinoda, “Speaker Adaptation Technology for Speech Recognition Using Probabilistic Models”. D-II, Information / System, II-Pattern Processing, J87- D-II (2), 371-386, 2004-02-01.) The updated calculation of the average and variance parameters of the acoustic model is changed to the following equation.

式の変更点は、各音響分析フレームの音響特徴量ｘ_ｔに関する統計量（シグマ内）にｔ番目の音響分析フレームの重みｗ_ｔを乗じる点と、式（５）と式（６）の分母である総フレーム数ＴをＴ_{ｓｅｌｅｃｔ}＋ｗＴ_{ｒｅｊｅｃｔ}に置き換えた点である。ｔ番目の音響分析フレームの重みｗ_ｔの値は、ｔ番目の音響分析フレームが選択されている場合は「１」、選択されていない場合は入力された非選択重みｗである。Ｔ_{ｓｅｌｅｃｔ}は選択された音響分析フレームの総数、Ｔ_{ｒｅｊｅｃｔ}は選択されていない音響分析フレームの総数であり、Ｔ_{ｓｅｌｅｃｔ}＋Ｔ_{ｒｅｊｅｃｔ}＝Ｔとなる。Ｔ_{ｓｅｌｅｃｔ}＋ｗＴ_{ｒｅｊｅｃｔ}は、選択されていない音響分析フレームを少なく見積もった総フレーム数である。なお、式（６）の′は転置を意味する。 The changes in the equation are that the statistic (within sigma) regarding the acoustic feature value x _t of each acoustic analysis frame is multiplied by the weight w _t of the t-th acoustic analysis frame, and the denominator of the equations (5) and (6). The total number of frames T is replaced with T _select + wT _reject . The value of the weight w _t of the t-th acoustic analysis frame is “1” when the t-th acoustic analysis frame is selected, and the input non-selection weight w when it is not selected. T _select is the total number of selected acoustic analysis frames, T _reject is the total number of unselected acoustic analysis frames, and T _select + T _reject = T. T _select + wT _reject is the total number of frames in which the number of unselected acoustic analysis frames is estimated. In the formula (6), 'means transposition.

最大事後確率法を用いる場合には、参考文献２に記載された更新後の音響モデルの平均と分散のパラメータの計算式を次式に変更する。 When the maximum posterior probability method is used, the calculation formula of the average and variance parameters of the updated acoustic model described in Reference 2 is changed to the following formula.

この場合の変更点は、最尤推定法の場合と同様に、各音響分析フレームの音響特徴量ｘ_ｔに関する統計量にｔ番目の音響分析フレームの重みｗ_ｔを乗じる点と、分母の総フレーム数ＴをＴ_{ｓｅｌｅｃｔ}＋ｗＴ_{ｒｅｊｅｃｔ}に置き換えた点である。 As in the case of the maximum likelihood estimation method, the change in this case is that the statistic regarding the acoustic feature amount x _t of each acoustic analysis frame is multiplied by the weight w _t of the t-th acoustic analysis frame, and the total frame of the denominator. The number T is replaced by T _select + wT _reject .

音響モデル適応装置３００は、特徴量選択部３１６が音響特徴量系列に音響モデルの適応に利用するか否かを表す選択フラグを付与し、音響モデル適応部３１７が誤認識区間を正しく排除した音響特徴量を用いて音響モデル適応を行う。したがって、音響モデル適応部３１７において高い適応効果が得られ、その適応後音響モデルは高い音声認識精度を実現することができる。 The acoustic model adaptation apparatus 300 assigns a selection flag indicating whether or not the feature selection unit 316 uses the acoustic feature series for adaptation of the acoustic model, and the acoustic model adaptation unit 317 correctly excludes the erroneous recognition section. Acoustic model adaptation is performed using feature quantities. Therefore, a high adaptation effect is obtained in the acoustic model adaptation unit 317, and the post-adaptation acoustic model can achieve high speech recognition accuracy.

〔音声認識装置〕
図６に、この発明の音声認識装置４００の機能構成例を示す。音声認識装置４００は、音響分析フレーム信頼度計算装置１００,２００と、特徴量選択部３１６と、音声認識部４１８と、音響モデル４１９と、言語モデル４２０と、を具備する。参照符号から明らかなように、音響分析フレーム信頼度計算装置１００,２００と、特徴量選択部３１６とは、上記した音響モデル適応装置３００と同じものである。 [Voice recognition device]
FIG. 6 shows a functional configuration example of the speech recognition apparatus 400 of the present invention. The speech recognition apparatus 400 includes acoustic analysis frame reliability calculation devices 100 and 200, a feature amount selection unit 316, a speech recognition unit 418, an acoustic model 419, and a language model 420. As is clear from the reference numerals, the acoustic analysis frame reliability calculation devices 100 and 200 and the feature amount selection unit 316 are the same as the acoustic model adaptation device 300 described above.

音声認識部４１８は、特徴量選択部３１６が出力する選択フラグ付き音響特徴量系列入
力として、音響モデル４１９と言語モデル４２０とを用いて音声認識結果を出力するものであり、選択フラグが付与されていない音響分析フレームについては、言語モデル４２０の重みを増やして音声認識処理を行う。つまり、選択フラグが付与されていない音響分析フレームの音響特徴量は誤差を含んでいる可能性が高いので、音響特徴量に依拠するスコアの計算を軽く扱う。要するに、言語モデル４２０の重みを、音響モデル４１９の重みよりも大きくして、その音響分析フレームのスコアを計算して音声認識処理を行う。その結果、音声認識精度の向上が期待できる。 The speech recognition unit 418 outputs a speech recognition result using the acoustic model 419 and the language model 420 as an acoustic feature amount series input with a selection flag output from the feature amount selection unit 316, and is given a selection flag. For the acoustic analysis frames that are not, the speech recognition process is performed by increasing the weight of the language model 420. In other words, since the acoustic feature amount of the acoustic analysis frame to which the selection flag is not assigned is likely to contain an error, the calculation of the score that depends on the acoustic feature amount is handled lightly. In short, the weight of the language model 420 is set larger than the weight of the acoustic model 419, the score of the acoustic analysis frame is calculated, and the speech recognition process is performed. As a result, improvement in voice recognition accuracy can be expected.

以上説明した音響分析フレーム信頼度計算装置１００,２００によれば、音声データを扱う際の最小単位である音響分析フレーム単位で、外れ値スコアを用いた関数の値を信頼度の尺度として求めるので、誤認識区間を精度よく識別することができる。その音響分析フレーム信頼度計算装置１００,２００を含む音響モデル適応装置３００は、フレーム信頼度付き音響特徴量系列を用いて音響モデルの適応を行うので、適応データのデータ収集のコストを低減させ、効率的に音響モデルの適応をすることができる。また、音声認識装置４００は、選択フラグが付与されていない音響分析フレームについて言語モデルの重みを増やして音声認識処理を行うので、音声認識精度を向上させることができる。 According to the acoustic analysis frame reliability calculation devices 100 and 200 described above, the value of a function using an outlier score is obtained as a measure of reliability in units of acoustic analysis frames, which is the minimum unit for handling audio data. Thus, it is possible to accurately identify the erroneous recognition section. Since the acoustic model adaptation device 300 including the acoustic analysis frame reliability calculation devices 100 and 200 performs the acoustic model adaptation using the acoustic feature quantity sequence with the frame reliability, the cost of collecting data of the adaptation data is reduced, The acoustic model can be adapted efficiently. In addition, since the speech recognition apparatus 400 performs speech recognition processing by increasing the weight of the language model for an acoustic analysis frame to which no selection flag is assigned, speech recognition accuracy can be improved.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

The audio signal is divided into acoustic analysis frames having a predetermined time length, and acoustic features are extracted in units of the acoustic analysis frames, and acoustic likelihood is calculated using the acoustic features and the initial acoustic model, and the acoustic analysis is performed. A speech recognition unit that outputs a phoneme ID / acoustic likelihood acoustic feature quantity sequence in which a maximum likelihood phoneme ID is assigned to a frame and the acoustic feature quantity, acoustic likelihood, and phoneme ID are associated with each other;
Using the phoneme ID / acoustic likelihood acoustic feature series as an input, classifying the acoustic feature quantity for each phoneme ID to create an acoustic feature quantity set, and detecting outliers for each acoustic feature quantity set To obtain an outlier score for each acoustic feature amount , obtain a function value using the outlier score as a measure of reliability, and assign a value of the function to each acoustic analysis frame. A frame confidence measure calculation unit for outputting a sequence of attached acoustic features,
An acoustic analysis frame reliability calculation device comprising:

In the acoustic analysis frame reliability calculation device according to claim 1,
The voice recognition unit
Speech recognition means for recognizing a speech signal using an initial acoustic model and outputting speech recognition text;
The speech signal and speech recognition text are input, the speech signal is divided into acoustic analysis frames having a predetermined time length, and acoustic feature amounts of each acoustic analysis frame are extracted to generate an acoustic feature amount sequence. A phoneme sequence is acquired from the recognized text, and a phoneme having the maximum acoustic likelihood is assigned to each acoustic analysis frame as a phoneme ID using the initial acoustic model, and the acoustic feature value, the acoustic likelihood, and the phoneme ID are associated with each other. An acoustic feature quantity alignment means for outputting a series of acoustic feature quantities with phoneme ID / acoustic likelihood;
An acoustic analysis frame reliability calculation device characterized by comprising:

The acoustic analysis frame reliability calculation device according to claim 1 or 2,
A selection flag indicating that an acoustic analysis frame having a reliability of each acoustic analysis frame equal to or higher than a selection threshold is selected by using the acoustic feature amount series with frame reliability output from the acoustic analysis frame reliability calculation device as an input. A feature quantity selection unit that outputs the selected acoustic feature quantity series with a selection flag;
Using the acoustic feature quantity sequence with the selection flag as an input, the statistic calculated using the acoustic feature quantity of each acoustic analysis frame not assigned with the selection flag is multiplied by a non-selection weight of a real value between 0 and 1. An updated acoustic model, and updates the parameters of the initial acoustic model based on the updated statistics and outputs an adapted acoustic model; and
An acoustic model adaptation apparatus comprising:

The acoustic analysis frame reliability calculation device according to claim 1 or 2,
A selection flag indicating that an acoustic analysis frame having a reliability of each acoustic analysis frame equal to or higher than a selection threshold is selected by using the acoustic feature amount series with frame reliability output from the acoustic analysis frame reliability calculation device as an input. A feature quantity selection unit that outputs the selected acoustic feature quantity series with a selection flag;
A speech recognition unit that receives the acoustic feature quantity sequence with the selection flag as input, increases the weight of the language model for the acoustic analysis frame without the selection flag, and performs speech recognition processing and outputs a speech recognition result;
A speech recognition apparatus comprising:

The speech recognition unit divides the speech signal into acoustic analysis frames of a predetermined time length, extracts acoustic feature quantities in units of the acoustic analysis frames, and calculates acoustic likelihood using the acoustic feature quantities and the initial acoustic model. A speech recognition process for outputting a phoneme ID / acoustic likelihood-equipped acoustic feature quantity sequence in which a maximum likelihood phoneme ID is assigned to the acoustic analysis frame and the acoustic feature quantity, acoustic likelihood, and phoneme ID are associated with each other; ,
The frame confidence measure calculation unit generates the acoustic feature amount set by classifying the acoustic feature amount for each phoneme ID, using the phoneme ID / acoustic likelihood acoustic feature amount sequence as input, and generating each acoustic feature amount Outlier detection is performed on the set, an outlier score is obtained for each acoustic feature amount, a function value using the outlier score is obtained as a measure of reliability, and the function is calculated for each acoustic analysis frame. A frame confidence measure calculation process for outputting an acoustic feature quantity sequence with a frame reliability to which a value is assigned;
Acoustic analysis frame reliability calculation method including

Functions of each part of the acoustic analysis frame reliability calculation device according to claim 1, the acoustic model adaptation device according to claim 3, and the speech recognition device according to claim 4 are provided to a computer. A program to be executed.