JP5427140B2

JP5427140B2 - Speech recognition method, speech recognition apparatus, and speech recognition program

Info

Publication number: JP5427140B2
Application number: JP2010171020A
Authority: JP
Inventors: 哲小橋川; 太一浅見; 義和山口; 浩和政瀧; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-07-29
Filing date: 2010-07-29
Publication date: 2014-02-26
Anticipated expiration: 2030-07-29
Also published as: JP2012032538A

Description

本発明は、様々な音質の音声データを効率良く音声認識する音声認識方法、音声認識装置及び音声認識プログラムに関する。 The present invention relates to a speech recognition method, a speech recognition apparatus, and a speech recognition program for efficiently recognizing speech data of various sound qualities.

近年、音声データを記録するメモリ素子が安価になることに伴い大量の音声データを容易に入手することが可能になった。それらの音声データを音声認識する際に、音声データの品質によって認識精度や処理時間が大きく変動する。 In recent years, it has become possible to easily obtain a large amount of audio data as a memory element for recording audio data becomes cheaper. When recognizing such audio data, the recognition accuracy and processing time greatly vary depending on the quality of the audio data.

そこで、従来から音声認識結果に信頼度を付与することで、音声認識誤りに起因する不具合を抑制する方法が検討されている。例えば、特許文献１が音声認識結果に信頼度を付与する従来技術として知られている。図１に特許文献１の音声認識装置９００の機能構成を示す。音声認識装置９００は、音響分析部１２０、音響モデル格納部１４０、辞書・言語モデル格納部１５０、探索部１６０、信頼度計算部１９０を備える。 In view of this, conventionally, a method for suppressing a defect caused by a speech recognition error by giving a reliability to the speech recognition result has been studied. For example, Patent Document 1 is known as a prior art that gives reliability to a speech recognition result. FIG. 1 shows a functional configuration of a speech recognition apparatus 900 disclosed in Patent Document 1. The speech recognition apparatus 900 includes an acoustic analysis unit 120, an acoustic model storage unit 140, a dictionary / language model storage unit 150, a search unit 160, and a reliability calculation unit 190.

音響分析部１２０は、入力音声信号１１０を、数十ｍｓのフレームと呼ばれる単位で例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析して音響特徴パラメータ系列１３０を生成する。探索部１６０は、音響特徴パラメータ系列１３０について、音響モデル格納部１４０と辞書・言語モデル格納部１５０とを用いて音声認識結果候補の探索を行う。探索の結果、上位〜Ｎ位までの音声認識結果１７０と、各音声認識結果に対するスコア１８０が出力される。 The acoustic analysis unit 120 generates an acoustic feature parameter series 130 by analyzing, for example, a mel frequency cepstrum coefficient (MFCC) of the input speech signal 110 in units called frames of several tens of ms. The searching unit 160 searches the acoustic feature parameter series 130 for a speech recognition result candidate using the acoustic model storage unit 140 and the dictionary / language model storage unit 150. As a result of the search, the speech recognition results 170 from the top to the Nth and the score 180 for each speech recognition result are output.

信頼度計算部１９０は、音声認識結果１７０とスコア１８０に基づいて複数の音声認識結果１７０にそれぞれ対応する信頼度スコア１９５を計算して出力する。その信頼度スコア１９５は、例えば音声認識結果として得られたＮベスト候補及びそれらのスコアの単純なスコア差と加算平均から求められる。 The reliability calculation unit 190 calculates and outputs a reliability score 195 corresponding to each of the plurality of speech recognition results 170 based on the speech recognition result 170 and the score 180. The reliability score 195 is obtained from, for example, N best candidates obtained as a speech recognition result, a simple score difference between those scores, and an addition average.

この信頼度スコア１９５を参照することで、その信頼度スコア１９５に対応する音声認識結果１７０を廃棄したり、発話者に対して音声認識結果を確認したりする。このような処理を行うことで、誤認識による不具合の発生を抑制していた。 By referring to the reliability score 195, the speech recognition result 170 corresponding to the reliability score 195 is discarded or the speech recognition result is confirmed with respect to the speaker. By performing such processing, occurrence of problems due to erroneous recognition has been suppressed.

特開２００５−１４８３４２号公報JP 2005-148342 A

しかし、従来の音声認識装置９００では、信頼度スコアを、音声認識処理を行った後の音声認識結果や音声認識結果に付随するスコアから計算していた。従って、信頼度スコアを得るのに音声認識処理の処理時間を必要としていた。Ｓ/Ｎ比が悪い等の理由により認識精度の低いものの中には、例え探索時のビーム幅を広げたり、教師なし適応を行ったとしても、誤認識ばかりで、認識精度を向上することができない利用不能な音声データも存在する。よって、音声認識処理を行った後のスコアから信頼度スコアを計算する場合、利用不能な音声データに余分な処理時間をかけてしまうという問題がある。また、大量の音声ファイルに対して音声認識処理を行う場合に、音声認識精度の低い音声ファイルの処理に時間がかかり、他の音声認識精度の高い音声ファイルの処理が進まず、音声認識処理全体の処理効率を低下させるという問題がある。また、言語モデルを用いた音声認識結果に基づく処理のため、信頼度スコアの値が言語モデルに依存してしまうという問題もある。 However, in the conventional speech recognition apparatus 900, the reliability score is calculated from the speech recognition result after the speech recognition processing and the score accompanying the speech recognition result. Therefore, it takes time for speech recognition processing to obtain a reliability score. Among those with low recognition accuracy due to reasons such as poor S / N ratio, even if the beam width at the time of search is widened or unsupervised adaptation is performed, it is possible to improve recognition accuracy only by misrecognition. There are some voice data that cannot be used. Therefore, when the reliability score is calculated from the score after the voice recognition process is performed, there is a problem that extra processing time is required for the unusable voice data. Also, when performing speech recognition processing on a large number of audio files, it takes time to process an audio file with low speech recognition accuracy, and processing of other audio files with high speech recognition accuracy does not proceed. There is a problem of lowering the processing efficiency. In addition, since the processing is based on the speech recognition result using the language model, there is a problem that the reliability score value depends on the language model.

この発明は、このような問題点に鑑みてなされたものであり、音声認識処理を行うこと無く短い処理時間で信頼度スコアが計算可能であり、言語モデルに依存しない信頼度スコアを出力する音声認識装置と音声認識方法と、音声認識プログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and is capable of calculating a reliability score in a short processing time without performing a voice recognition process, and outputting a reliability score independent of a language model. An object is to provide a recognition device, a speech recognition method, and a speech recognition program.

上記の課題を解決するために、本発明に係る音声認識方法は、音声ディジタル信号の音声特徴量をフレーム単位で分析して音声特徴量系列を求め、フレーム毎の音声特徴量系列を用いて、その音声特徴量系列に対するモノフォンＨＭＭの各状態に属するＧＭＭから得られる出力確率ｂ_ｓ（ｏ_ｔ）と、その各状態ｓの出現確率Ｐ（ｓ）との積が最も高いものを求め、最も高い積Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）の対数または出力確率ｂ_ｓ＾（ｏ_ｔ）の対数と、その入力に対する音声モデルの状態に属するＧＭＭまたはポーズモデルＨＭＭの各状態に属するＧＭＭから得られる最も高い出力確率ｂ_g＾（ｏ_ｔ）の対数との差を当該フレームの事前信頼度とし、その事前信頼度を平均化して音声ファイル単位の信頼度スコアを求め、音声特徴量系列を用いて、信頼度スコアに基づき音声認識処理を行う。 In order to solve the above problems, a speech recognition method according to the present invention obtains a speech feature amount sequence by analyzing speech feature amounts of a speech digital signal in units of frames, and uses a speech feature amount sequence for each frame, The highest product of the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM for the speech feature quantity sequence and the appearance probability P (s) of each state s is obtained, and the highest From the logarithm of the product P (s ^) b _{s ^} (o _t ) or the logarithm of the output probability b _{s ^} (o _t ) and the GMM belonging to each state of the speech model or pose model HMM belonging to the state of the speech model for that input the difference between the logarithm of the highest output probability obtained b _{g ^} (o _t) and pre reliability of the frame, obtains a confidence score of the audio file units by averaging the pre-reliability, audio feature Using columns, performs speech recognition processing based on the confidence scores.

また、本発明に係る音声認識装置は、入力される音声ディジタル信号の音声特徴量をフレーム単位で分析して音声特徴量系列を出力する特徴量分析部と、フレーム毎の音声特徴量系列を入力として、その入力に対するモノフォンＨＭＭの各状態に属するＧＭＭから得られる出力確率ｂ_ｓ（ｏ_ｔ）と、その各状態ｓの出現確率Ｐ（ｓ）との積が最も高いものを求め、最も高い積Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）の対数または出力確率ｂ_ｓ＾（ｏ_ｔ）の対数と、その入力に対する音声モデルの状態に属するＧＭＭまたはポーズモデルＨＭＭの各状態に属するＧＭＭから得られる最も高い出力確率ｂ_g＾（ｏ_ｔ）の対数との差を当該フレームの事前信頼度とし、その事前信頼度を平均化して音声ファイル単位の信頼度スコアを出力する事前信頼度スコア計算部と、音声特徴量系列を入力として、信頼度スコアに基づき音声認識処理を行う音声認識処理部とを備える。 The speech recognition apparatus according to the present invention also includes a feature amount analysis unit that analyzes a speech feature amount of an input speech digital signal in units of frames and outputs a speech feature amount sequence, and inputs a speech feature amount sequence for each frame. And find the highest product of the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM for that input and the appearance probability P (s) of each state s. Obtained from the logarithm of P (s ^) b _{s ^} (o _t ) or the logarithm of output probability b _{s ^} (o _t ) and the GMM belonging to each state of the speech model or the pause model HMM for the input. the highest difference between the logarithm of the output probability b _{g ^} (o _t) and pre reliability of the frame to be pre reliability of outputting the confidence score of the audio file units by averaging the pre-reliability It includes a score calculation unit, an input voice feature amount sequence, and a voice recognition processing unit that performs speech recognition processing based on the confidence scores.

本発明は、音声認識処理を行う前に、音声認識の結果として得られる音声認識結果に対する信頼度を事前に推定し、求められた信頼度に基づき音声認識処理を行う。そのため、利用不能な音声データに対する処理時間を削減することができるという効果を奏する。また、信頼度の高い音声データ、つまり、音声認識精度の高いことが期待できる音声データの処理を優先的に行い、音声認識処理全体の処理効率を向上させることができるという効果を奏する。さらに、信頼度を求める際に、言語モデルを用いないため、言語モデルに依存しない（事前）信頼度を求めることができるという効果を奏する。 In the present invention, before performing the speech recognition processing, the reliability of the speech recognition result obtained as a result of speech recognition is estimated in advance, and the speech recognition processing is performed based on the obtained reliability. As a result, it is possible to reduce the processing time for unavailable audio data. Further, it is possible to preferentially process voice data with high reliability, that is, voice data that can be expected to have high voice recognition accuracy, thereby improving the processing efficiency of the entire voice recognition process. Furthermore, since the language model is not used when obtaining the reliability, there is an effect that (prior) reliability that does not depend on the language model can be obtained.

特許文献１に開示された従来の音声認識装置９００の機能構成を示す図。The figure which shows the function structure of the conventional speech recognition apparatus 900 disclosed by patent document 1. FIG. 音素モデルの一例を示す図。The figure which shows an example of a phoneme model. 音素モデルを構成する１状態を模式的に示す図。The figure which shows typically 1 state which comprises a phoneme model. 音声認識装置１００、２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100,200. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. 事前信頼度スコア計算部３０、３０’の機能構成例を示す図。The figure which shows the function structural example of the prior reliability score calculation parts 30 and 30 '. モノフォン最尤状態と音声／ポーズ最尤状態の時間経過を模式的に示す図。The figure which shows typically the time passage of a monophone maximum likelihood state and a speech / pause maximum likelihood state. 図７を二種の音響モデルにした場合を示す図。The figure which shows the case where FIG. 7 is made into two types of acoustic models. 実験結果を示す図。The figure which shows an experimental result. 事前信頼度スコア計算部２３０の機能構成例を示す図。The figure which shows the function structural example of the prior reliability score calculation part 230. FIG. 実施例２の基本的な考え方を説明するために音声特徴量と尤度（または出力確率）との関係を模式的に示す図。The figure which shows typically the relationship between an audio | voice feature-value and likelihood (or output probability) in order to demonstrate the fundamental view of Example 2. FIG. 音声認識装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 300. FIG. 信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係の例を示す図。The figure which shows the example of the relationship between the reliability score C and beam search width | variety N (C). 音声認識装置４００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 400. FIG. 音声認識装置４００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 400. 音声認識装置５００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 500. 音声認識装置５００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 500.

以下、図面を参照して、この発明の実施の形態を説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の基本的な考え方について説明する。
［この発明の基本的な考え方］
一般的な信頼度尺度は、以下の単語事後確率Ｐ（Ｗ＾｜Ｏ）で表現される。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the basic concept of the present invention will be described.
[Basic concept of this invention]
A general reliability measure is expressed by the following word posterior probability P (W ^ | O).

なお、Ｏは音響特徴量系列であり（Ｏ＝（ｏ_１，ｏ_２，…，ｏ_Ｔ））、Ｗは音声認識結果単語系列であり、Ｐ（Ｗ）は音声認識の結果に基づき求まる音声認識結果単語系列の出現確率であり、”＾”は尤度の最も高い単語、状態、及び、尤度の高い単語系列や状態系列を示す。 O is an acoustic feature quantity sequence (O = (o ₁ , o ₂ ,..., O _T )), W is a speech recognition result word sequence, and P (W) is a speech obtained based on the speech recognition result. It is the appearance probability of the recognition result word sequence, and “^” indicates the word or state with the highest likelihood, and the word sequence or state sequence with the highest likelihood.

ここで、大語彙の辞書を含む言語モデルを用いた音声認識結果単語系列Ｗを得るためには音声認識処理にかかる膨大な計算が必要となる。この計算量を減らすために、本発明では、言語モデルは使用せず、音声認識結果単語系列Ｗの代わりに状態系列Ｓを用いることにする。よって、単語事後確率Ｐ（Ｗ＾｜Ｏ）は以下の式により近似される。 Here, in order to obtain a speech recognition result word sequence W using a language model including a dictionary of large vocabularies, enormous calculations related to speech recognition processing are required. In order to reduce the amount of calculation, in the present invention, the language model is not used, and the state series S is used instead of the speech recognition result word series W. Therefore, the word posterior probability P (W ^ | O) is approximated by the following expression.

なお、状態系列Ｓの種類は、とりうる全ての状態ｓ_ｊ（但し、ｊ＝１，２，…，Ｊ）から生じうる全ての状態系列からなることが望ましいが、計算量を削減するために、音声認識で用いられている高速化手法を導入し、可能性の低い状態を予め計算対象から外してもよい。 The type of state series S is preferably all state series that can be generated from all possible states s _j (where j = 1, 2,..., J), but in order to reduce the amount of calculation. Alternatively, a high-speed method used in speech recognition may be introduced, and a low possibility state may be excluded from the calculation target in advance.

さらに、高速化するために、状態系列Ｓ内の各状態ｓ_ｊは、モノフォンに含まれる状態のみに限定する。ここで、モノフォンとは、環境独立音素モデルのことであり、前後の音素環境に制約を持つ環境依存音素モデル（例えばトライフォン）に対して、前後の音素の制約がない音素モデルであり、その種類は少ない。例えば、音素の数を３０個とした場合、モノフォン音響モデル中の音素モデルの数は３０個であるが、トライフォンの場合の数は３０^３個（２７０００個）である。また、モノフォンは、音声以外の部分、つまり、非音声部分のモデルであるポーズモデルを含むものとする。モノフォンの音素モデルは、１以上（通常３個程度）の状態の確率連鎖によって構築され、モノフォンＨＭＭ(Hidden Markov Model)として表される。モノフォンＨＭＭは、例えば、図２に示すようにｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型ＨＭＭで表される。図２は、３つの状態ｓ_１（第１状態）、ｓ_２（第２状態）、ｓ_３（第３状態）を並べたものであり、状態の確率連鎖（状態遷移）としては、自己遷移ａ_１１、ａ_２２、ａ_３３と、次状態へのａ_１２、ａ_２３、ａ_３４からなる。各状態ｓは、１以上の基底分布からなる混合分布（以下、混合正規分布ＧＭＭ：Gaussian Mixture Modelを含めて混合分布とする）から構成され、例えば、図３に示すように、混合正規分布Ｍとして表現される。混合正規分布Ｍは、例えば３つの（基底）正規分布、Ｎ（μ_ｓ，１，Σ_ｓ，１），Ｎ（μ_ｓ，２，Σ_ｓ，２），Ｎ（μ_ｓ，３，Σ_ｓ，３）で構成される。ここでμ_ｓ，ｍは状態ｓに属する正規分布ｍの平均ベクトル、Σ_ｓｍは状態ｓに属する正規分布ｍの共分散行列を示す。 Furthermore, in order to increase the speed, each state s _j in the state sequence S is limited to only states included in the monophone. Here, the monophone is an environment-independent phoneme model, which is a phoneme model that has no restriction on the preceding and following phonemes, compared to an environment-dependent phoneme model (for example, a triphone) that has restrictions on the preceding and following phoneme environments. There are few types. For example, when the 30 number of phonemes, the number of phoneme models in monophone acoustic model is 30, the number of cases of triphone is 30 ³ (27000 pieces). The monophone includes a pose model that is a model of a part other than the voice, that is, a non-voice part. A phone model of a monophone is constructed by a probability chain of one or more (usually about three) states, and is represented as a monophone HMM (Hidden Markov Model). The monophone HMM is represented by, for example, a left-to-right type HMM as shown in FIG. FIG. 2 shows three states s ₁ (first state), s ₂ (second state), and s ₃ (third state) arranged side by side, and the state probability chain (state transition) is self-transition. a ₁₁ , a ₂₂ , a ₃₃ and a ₁₂ , a ₂₃ , a ₃₄ to the next state. Each state s is composed of a mixed distribution composed of one or more basis distributions (hereinafter, mixed normal distribution GMM: mixed distribution including Gaussian Mixture Model). For example, as shown in FIG. Is expressed as The mixed normal distribution M is, for example, three (basic) normal distributions, N (μ _{s, 1} , Σ _{s, 1} ), N (μ _{s, 2} , Σ _{s, 2} ), N (μ _{s, 3} , Σ _{s). , 3} ). Here, μ _{s, m} represents an average vector of the normal distribution m belonging to the state s, and Σ _sm represents a covariance matrix of the normal distribution m belonging to the state s.

さらに、式（２）において計算量を減らすために、多くの音声認識デコーダが遷移確率を無視するのと同様に（参考文献１参照）、本発明においても遷移確率を無視し、モノフォンＨＭＭの各状態に属するＧＭＭ（以下、単に「モノフォンＧＭＭ」という）から得られる出力確率のみを用いて、フレーム毎の信頼度を推定する。
［参考文献１］J. R. Glass, "A probabilistic framework for segmentbased speech recognition", Computer Speech and Language, Elsevier, 2003, Vol.17, No.2-3, pp.137-152
よって、式（２）の状態事後確率Ｐ（Ｓ＾｜Ｏ）は、時刻ｔにおける音響特徴量ｏ_ｔに対するフレーム毎の状態事後確率Ｐ（ｓ＾｜ｏ_ｔ）から近似的に以下のように、計算される。 Furthermore, in order to reduce the amount of calculation in equation (2), in the same way as many speech recognition decoders ignore the transition probability (see Reference 1), the present invention also ignores the transition probability, and each monophone HMM The reliability for each frame is estimated using only the output probability obtained from the GMM belonging to the state (hereinafter simply referred to as “monophone GMM”).
[Reference 1] JR Glass, "A probabilistic framework for segmentbased speech recognition", Computer Speech and Language, Elsevier, 2003, Vol.17, No.2-3, pp.137-152
Therefore, the state posterior probability P of formula (2) (S ^ | O ) , the state posterior probability P of each frame with respect to the acoustic feature quantity _{o t} at time _t (s ^ | _o t) approximately as follows from Calculated.

なお、Ｔは総フレーム数を表す。さらに、フレーム毎の状態事後確率Ｐ（ｓ＾｜ｏ_ｔ）は、以下のように状態ｓの出力確率ｂ_ｓ（ｏ_ｔ）からフレーム毎に計算される。 T represents the total number of frames. Further, the state posterior probability P (s ^ | o _t ) for each frame is calculated for each frame from the output probability b _s (o _t ) of the state s as follows.

なお、ｓ＾は時刻ｔにおいてＰ（ｓ）・ｂ_ｓ（ｏ_ｔ）の値が最も高いときの状態（以下「最尤状態ｓ＾」という）であり、Ｍ_ｓは状態ｓに属する混合分布数であり、ｗ_ｓ，ｍは正規分布ｍの混合重み係数であり、Ｎ_ｓ，ｍ（・）は正規分布ｍのガウス分布関数を意味し、Ｎ_ｓ，ｍ（ｏ_ｔ｜μ_ｓ，ｍΣ_ｓ，ｍ）は、時刻ｔの音響特徴量ｏ_ｔに対する状態ｓに属する正規分布ｍの出力確率を意味する。なおｗ_ｓ，ｍは音響モデル学習の結果で決まるものであり、０≦ｗ_ｓ，ｍ≦１の範囲を取る値である。例えば、混合分布数Ｍ_ｓが１６であるとすると平均すると１/１６の値となる。 S ^ is a state when the value of P (s) · b _s (o _t ) is the highest at time _t (hereinafter referred to as “maximum likelihood state s ^”), and M _s is a mixed distribution belonging to the state s. W _{s, m} is a mixture weight coefficient of the normal distribution m, N _{s, m} (·) means a Gaussian distribution function of the normal distribution m, and N _{s, m} (o _t | μ _{s, m} sigma _{s, m)} denotes the output probability of the normal distribution m belonging to the state s for the acoustic feature quantity o _t at time t. Note that w _{s, m} is determined by the result of acoustic model learning, and takes a range of 0 ≦ w _{s, m} ≦ 1. For example, _assuming that the mixed distribution number _Ms is 16, the average value is 1/16.

参考文献２では、モノフォンは音素環境依存モデル（トライフォン）の近似モデルであるという仮定に基づき、モノフォンを用いて音響尤度計算量を削減することで高速化を行っている。本発明においても同様に、式（４）の計算において、モノフォンのみを用いることで高速化を実現する。
［参考文献２］A.Lee, T.Kawahara, K.Shikano, "Gaussian mixture selection using context-independent HMM", in Proceedings of ICASSP, 2001, vol.1, pp.69-72
式（４）の分母Σ_ｓＰ（ｓ）ｂ_ｓ（ｏ_ｔ）は、ポーズ以外の全ての音素の特徴量から学習した音声ＧＭＭからなる音声モデルを使って、以下のように近似される。 In Reference Document 2, the speed is increased by reducing the amount of calculation of acoustic likelihood using a monophone based on the assumption that the monophone is an approximate model of a phoneme environment dependent model (triphone). Similarly, in the present invention, in the calculation of Expression (4), the speed is increased by using only a monophone.
[Reference 2] A. Lee, T. Kawahara, K. Shikano, "Gaussian mixture selection using context-independent HMM", in Proceedings of ICASSP, 2001, vol.1, pp.69-72
The denominator Σ _s P (s) b _s (o _t ) of Equation (4) is approximated as follows using a speech model composed of speech GMM learned from all phoneme features other than pauses.

ｇは前記音声モデルに属する状態であり、全ての音素の音響特徴、言い換えると、全ての状態から学習される。ここで、この音声モデルはただ１つの状態ｇのみを持つように構築すると、音声フレームにおいて、ｇの出現確率Ｐ（ｇ）は１となる。よって、 g is a state belonging to the speech model, and is learned from the acoustic features of all phonemes, in other words, from all states. Here, if this speech model is constructed to have only one state g, the appearance probability P (g) of g is 1 in the speech frame. Therefore,

従って、式（４）と式（６）’から、フレーム毎の状態事後確率Ｐ（ｓ＾｜ｏ_ｔ）は以下の式により近似的に計算される。 Therefore, from Equation (4) and Equation (6) ′, the state posterior probability P (s ^ | o _t ) for each frame is approximately calculated by the following equation.

ここで、通常、音声認識では確率値を対数スコア領域に変換したものを計算に用いるため、フレーム毎の事前信頼度ｃ（ｏ_ｔ）は、式（７）で近似的に求めたフレーム毎の状態事後確率Ｐ（ｓ＾｜ｏ_ｔ）を、以下の式のように、対数スコア領域にしたものとする。 Here, normally, since speech recognition uses a probability value converted to a logarithmic score region for calculation, the prior reliability c (o _t ) for each frame is approximately calculated for each frame obtained by Equation (7). The state posterior probability P (s ^ | o _t ) is assumed to be a logarithmic score area as in the following equation.

フレーム毎の事前信頼度ｃ（ｏ_ｔ）は、前記音声モデルをＵＢＭ（Universal Background Model）と考え、状態出現確率Ｐ（ｓ＾）を無視すると、例えば参考文献３に見られるような話者照合でしばしば用いられる尤度比を対数化したものと等価となる。本発明では、状態出現確率Ｐ（ｓ＾）が導入されることによって、最尤状態ｓ＾の推定に状態の出現頻度ひいては音素毎の出現頻度が考慮されている。
［参考文献３］
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, 2000, vol.10, pp.19-41
信頼度スコアＣはフレーム毎の事前信頼度ｃ（ｏ_ｔ）から計算される。その際、異なる長さの音声データの比較を可能とするために、総フレーム数Ｔによって、以下のように、正規化される。 Pre-reliability c (o _t ) for each frame is the speaker verification as seen in Reference 3 when the speech model is considered as UBM (Universal Background Model) and the state appearance probability P (s ^) is ignored. This is equivalent to the logarithm of the likelihood ratio often used in. In the present invention, by introducing the state appearance probability P (s ^), the appearance frequency of the state and thus the appearance frequency for each phoneme is taken into consideration in the estimation of the maximum likelihood state s ^.
[Reference 3]
DA Reynolds, TF Quatieri, and RB Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, 2000, vol.10, pp.19-41
The reliability score C is calculated from the prior reliability c (o _t ) for each frame. At that time, in order to enable comparison of audio data of different lengths, normalization is performed as follows according to the total number of frames T.

本発明は、このような考え方に基づき、音声認識結果を用いずに、モノフォン及び音声データを用いて、信頼度スコアを求める。
以下、本発明の実施の形態について、詳細に説明する。 Based on such a concept, the present invention obtains a reliability score using a monophone and voice data without using a voice recognition result.
Hereinafter, embodiments of the present invention will be described in detail.

＜音声認識装置１００＞
図４及び図５を用いて実施例１に係る音声認識装置１００を説明する。音声認識装置１００は、Ａ/Ｄ変換部１０と、特徴量分析部２０と、事前信頼度スコア計算部３０と、音声認識処理部４０と、音響モデルパラメータメモリ５０と、言語モデルパラメータメモリ６０とを具備する。音声認識装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 <Voice recognition apparatus 100>
The speech recognition apparatus 100 according to the first embodiment will be described with reference to FIGS. 4 and 5. The speech recognition apparatus 100 includes an A / D conversion unit 10, a feature amount analysis unit 20, a prior reliability score calculation unit 30, a speech recognition processing unit 40, an acoustic model parameter memory 50, a language model parameter memory 60, It comprises. The speech recognition apparatus 100 is realized by reading a predetermined program into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

Ａ/Ｄ変換部１０は、音声信号ｘ（ｕ）を、例えばサンプリング周波数１６ｋＨｚで離散値化して音声ディジタル信号ｘ（ｖ）に変換する（ステップＳ１０）。但し、ｕは連続時間を、ｖは離散時間を表す。なお、音声ディジタル信号ｘ（ｖ）が直接入力される場合は、Ａ/Ｄ変換部１０は不要である。 The A / D conversion unit 10 converts the audio signal x (u) into a discrete value, for example, at a sampling frequency of 16 kHz and converts it into the audio digital signal x (v) (step S10). However, u represents continuous time and v represents discrete time. When the audio digital signal x (v) is directly input, the A / D converter 10 is not necessary.

特徴量分析部２０は、音声ディジタル信号ｘ（ｖ）を入力として、例えば３２０個の音声ディジタル信号ｘ（ｖ）を１フレーム（例えば、２０ｍｓ）とし、このフレーム毎に、その音声特徴量ｏ_ｔを分析し、音声特徴量系列Ｏを出力する（ステップＳ２０）。音声特徴量としては、例えば、ＭＦＣＣ（Mel-Frequenct Cepstrum Coefficient）の１〜１２元と、その変化量であるΔＭＦＣＣ等の動的パラメータや、パワーやΔパワー等を用いる。また、ケプストラム平均正規化（ＣＭＮ）等の処理を行っても良い。 The feature quantity analysis unit 20 receives the voice digital signal x (v) as an input, for example, sets 320 voice digital signals x (v) as one frame (for example, 20 ms), and for each frame, the voice feature quantity o _t. And an audio feature amount series O is output (step S20). As the audio feature amount, for example, dynamic parameters such as MFCC (Mel-Frequenct Cepstrum Coefficient) 1 to 12 elements, ΔMFCC which is the change amount, power, Δ power, and the like are used. Also, processing such as cepstrum average normalization (CMN) may be performed.

事前信頼度スコア計算部３０は、音声特徴量系列Ｏを入力として、フレーム毎の音声特徴量ｏ_ｔに対するモノフォンＧＭＭから得られる出力確率ｂ_ｓ（ｏ_ｔ）とそのＧＭＭの属する状態ｓの出現確率Ｐ（ｓ）の積が最も高いもの（以下「モノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）」という）を求める。さらに、事前信頼度スコア計算部３０は、入力ｏ_ｔに対する音声モデルの状態に属するＧＭＭまたはポーズモデルＨＭＭの各状態に属するＧＭＭ（以下「音声／ポーズＧＭＭ」という）から得られる出力確率が最も高いもの（以下「音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）」という）を求める。なお、この音声モデルとは前述の通り、ポーズ以外の全ての音素の特徴量から学習される。さらに、求めたモノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）の対数と音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）の対数との差を当該フレームの事前信頼度ｃ（ｏ_ｔ）とし（式（８）参照）、その事前信頼度ｃ（ｏ_ｔ）を平均化して音声ファイル単位の信頼度スコアＣを求め、出力する（ステップＳ３０）。 Pre confidence score calculation unit 30 is input with audio feature series O, probabilities of occurrence of output probability b _{s (o} _t) and state s Field of the GMM obtained from monophones GMM for the speech feature quantity o _t for each frame The one having the highest product of P (s) (hereinafter referred to as “monophone maximum likelihood value P (s ^) b _{s ^} (o _t )") is obtained. Furthermore, pre confidence score calculator 30, the highest output probability obtained from GMM belonging to each state of the GMM or poses Model HMM belonging to the state of the speech model (hereinafter referred to as "speech / pause GMM") for the input o _t A thing (hereinafter referred to as “speech / pause maximum likelihood b _{g ^} (o _t )”) is obtained. As described above, this speech model is learned from the feature quantities of all phonemes other than the pose. Furthermore, the difference between the logarithm of the obtained monophone maximum likelihood value P (s ^) b _{s ^} (o _t ) and the logarithm of the speech / pause maximum likelihood value b _{g ^} (o _t ) is determined as the prior reliability c ( o _t ) (see equation (8)), the prior reliability c (o _t ) is averaged to obtain a reliability score C for each audio file and output (step S30).

音声認識処理部４０は、音声特徴量系列Ｏと信頼度スコアＣを入力として、信頼度スコアに基づき音声認識処理を行う。例えば、信頼度スコアＣに応じて音声認識処理を行うか否かを判断し（ステップＳ４０ａ）、音声認識処理を行うと判断した場合には、音響モデルパラメータメモリ５０に記録された音響モデルと、言語モデルパラメータメモリ６０に記録された言語モデルとを参照して、音声特徴量系列Ｏについて音声認識処理を行い、その音声認識結果Ｗと信頼度スコアＣを出力する（ステップＳ４０ｂ）。 The speech recognition processing unit 40 receives the speech feature amount series O and the reliability score C and performs speech recognition processing based on the reliability score. For example, it is determined whether or not speech recognition processing is performed according to the reliability score C (step S40a), and when it is determined that speech recognition processing is performed, the acoustic model recorded in the acoustic model parameter memory 50, With reference to the language model recorded in the language model parameter memory 60, speech recognition processing is performed on the speech feature amount series O, and the speech recognition result W and the reliability score C are output (step S40b).

なお、ステップＳ４０ａ、ｓ４０ｂの音声認識処理過程は、音声ファイルの全フレームについて処理が終了するまで繰り返される。 Note that the voice recognition process in steps S40a and s40b is repeated until the process is completed for all frames of the voice file.

音声認識装置１００によれば、事前信頼度スコア計算部３０が、フレーム毎に事前信頼度ｃ（ｏ_ｔ）を付与して、これを平均化して（つまり１フレーム当たりの平均事前信頼度を計算して）、音声ファイル単位の信頼度スコアＣを計算する。音声特徴量系列Ｏに基づいた信頼度スコアＣは、従来の音声認識結果から信頼度スコアを求める方法と比べて計算量が少なくて済む。また、複数の音声ファイルを処理する場合に、信頼度スコアＣの値に応じて音声認識処理を行うか否かを判断をすることで、信頼度Ｃが低い、つまり音声認識精度が低い音声ファイルの音声認識処理に時間がかかるという問題も解決される。次に、実施例１の主要部である事前信頼度スコア計算部３０のより具体的な構成例を示してさらに詳しく説明する。 According to the speech recognition apparatus 100, the prior reliability score calculation unit 30 assigns the prior reliability c (o _t ) to each frame, averages the prior reliability (that is, calculates the average prior reliability per frame). And a reliability score C for each audio file is calculated. The reliability score C based on the speech feature amount series O requires a smaller amount of calculation than the conventional method of obtaining the reliability score from the speech recognition result. In addition, when processing a plurality of audio files, it is determined whether or not the audio recognition processing is performed according to the value of the reliability score C, so that the audio file with low reliability C, that is, with low audio recognition accuracy This also solves the problem that the voice recognition processing takes time. Next, a more specific configuration example of the prior reliability score calculation unit 30, which is a main part of the first embodiment, will be described in detail.

＜事前信頼度スコア計算部３０＞
図６を用いて、事前信頼度スコア計算部３０を説明する。事前信頼度スコア計算部３０は、モノフォン最尤検出手段３２と、音声／ポーズ最尤検出手段３３と、事前信頼度算出手段３４と、信頼度スコア算出手段３５とを備える。 <Pre-reliability score calculation unit 30>
The prior reliability score calculation unit 30 will be described with reference to FIG. The prior reliability score calculation unit 30 includes a monophone maximum likelihood detection unit 32, a speech / pause maximum likelihood detection unit 33, a prior reliability calculation unit 34, and a reliability score calculation unit 35.

図７に、モノフォンの出力確率とポーズモデルと音声モデルの出力確率の時間経過を模式的に示す。横方向は時間経過をフレームｔで表す。縦方向はフレームｔ毎の複数のモノフォンと音声モデルのそれぞれの状態を表す。例えば、各モノフォンは、それぞれ３つの状態から成り、モノフォン「＊−ａ＋＊」は状態ａ_１,ａ_２,ａ_３から成る。太い丸の状態がモノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）に対応するモノフォン最尤状態ｓ＾を表す。斜線入り丸の状態が音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）に対応する音声／ポーズ最尤状態ｇ＾を表す。モノフォン最尤状態ｓ＾と、音声／ポーズ最尤状態ｇ＾が、一致する場合（ｓ＾＝ｇ＾）には斜線入りの太い丸で示す。 FIG. 7 schematically shows a time course of the output probability of the monophone, the output probability of the pause model, and the speech model. In the horizontal direction, the passage of time is represented by a frame t. The vertical direction represents the state of each of a plurality of monophones and voice models for each frame t. For example, each monophone consists of three states, and the monophone “* -a + *” consists of states a ₁ , a ₂ , and a ₃ . The thick circle state represents the monophone maximum likelihood state s ^ corresponding to the monophone maximum likelihood value P (s ^) b _{s ^} (o _t ). The state of the circle with diagonal lines represents the speech / pause maximum likelihood state 対応 corresponding to the speech / pause maximum likelihood value b _＾ (o _t ). When the monophone maximum likelihood state ＾ and the speech / pause maximum likelihood state ＾ coincide (s ＾ = g ＾), the state is indicated by a thick circle with diagonal lines.

時刻ｔ_１〜ｔ_３では、モノフォン最尤状態ｓ＾は、それぞれポーズモデルの第１状態ｐ_１〜第３状態ｐ_３である。同様に音声／ポーズ最尤状態ｇ＾は、それぞれポーズモデルの第１状態ｐ_１〜第３状態ｐ_３である。このことから、時刻ｔ_１〜ｔ_３は非音声状態である。例えば、時刻ｔ_１では、式（８）を用いて、モノフォン「＊−ｐａｕｓｅ＋＊」の第１状態ｐ_１の出現確率Ｐ（ｐ_１）と、状態ｐ_１に属するＧＭＭの出力確率ｂ_ｐ１（ｏ_ｔ1）の積の対数と、ポーズモデルの状態ｐ_１に属するＧＭＭの出力確率ｂ_ｐ１（ｏ_ｔ１）の対数との差を事前信頼度ｃ（ｏ_ｔ１）とする。つまり以下のように求められる。
ｃ（ｏ_ｔ１）＝ｌｏｇ（Ｐ（ｐ_１）ｂ_ｐ１（ｏ_ｔ１））−ｌｏｇｂ_ｐ１（ｏ_ｔ１） At times t _{1 to} t ₃ , the monophone maximum likelihood state s ^ is the first state p ₁ to the third state p ₃ of the pose model, respectively. Similarly, the speech / pause maximum likelihood state ＾ is the first state p ₁ to the third state p ₃ of the pose model, respectively. Therefore, the times t ₁ to t ₃ are in a non-voice state. For example, at the time t ₁ , using the expression (8), the appearance probability P (p ₁ ) of the first state p ₁ of the monophone “* -pause + *” and the output probability b _{p1 of the} GMM belonging to the state p ₁ ( The difference between the logarithm of the product of o _t1 ) and the logarithm of the output probability b _p1 (o _t1 ) of the GMM belonging to the state p _{1 of the} pose model is defined as a prior reliability c (o _t1 ). In other words, it is obtained as follows.
c (o _t1 ) = log (P (p ₁ ) b _p1 (o _t1 )) − logb _p1 (o _t1 )

時刻ｔ_４では、モノフォン最尤状態ｓ＾は、モノフォン「＊−ａ＋＊」の第３状態ａ_３であり、音声／ポーズ最尤状態ｇ＾が音声モデルの状態ｇであることから音声状態であると考えられる。式（８）を用いて、モノフォン「＊−ａ＋＊」の第３状態ａ_３の出現確率Ｐ（ａ_３）と、状態ａ_３に属するＧＭＭの出力確率ｂ_ａ３（ｏ_ｔ４）の積の対数と、音声モデルの状態ｇに属するＧＭＭの出力確率ｂ_ｇ（ｏ_ｔ４）の対数との差を事前信頼度ｃ（ｏ_ｔ４）とする。つまり以下のように求められる。
ｃ（ｏ_ｔ４）＝ｌｏｇ（Ｐ（ａ_３）ｂ_ａ３（ｏ_ｔ４））−ｌｏｇｂ_ｇ（ｏ_ｔ４） At time t _4, monophones maximum likelihood state s ^ is the third state a ₃ of monophones "* -a + *", in the voice state from that voice / pause maximum likelihood state g ^ is the state g of voice model It is believed that there is. Using equation (8), the logarithm of the product of the appearance probability P (a ₃ ) of the third state a ₃ of the monophone “* -a + *” and the output probability b _a3 (o _t4 ) of the GMM belonging to the state a ₃ And the logarithm of the output probability b _g (o _t4 ) of the GMM belonging to the state g of the speech model is defined as a prior reliability c (o _t4 ). In other words, it is obtained as follows.
c (o _t4 ) = log (P (a ₃ ) b _a3 (o _t4 )) − logb _g (o _t4 )

また、時刻ｔ_１９では、モノフォン最尤状態ｓ＾は、モノフォン「＊−ｉ＋＊」の第２状態ｉ_２であり、音声／ポーズ最尤状態ｇ＾がポーズモデルの第３状態ｐ_３である。このとき、式（８）を用いて、モノフォン「＊−ｉ＋＊」の第２状態ｉ_２の出現確率Ｐ（ｉ_２）と、状態ｉ_２に属するＧＭＭの出力確率ｂ_ｉ２（ｏ_ｔ１９）の積の対数と、ポーズモデルの第３状態ｐ_３に属するＧＭＭの出力確率ｂ_ｐ３（ｏ_ｔ１９）の対数との差を事前信頼度ｃ（ｏ_ｔ１９）とする。つまり以下のように求められる。
ｃ（ｏ_ｔ１９）＝ｌｏｇ（Ｐ（ｉ_２）ｂ_ｉ２（ｏ_ｔ１９））−ｌｏｇｂ_ｐ３（ｏ_ｔ１９）
なお、図７は、一部の時間しか示していない。音声ファイルの長さは例えば数分（例えば３０,０００フレーム）程度である。以下、各手段の処理を具体的に説明する。 In addition, at time _{t 19,} monophones maximum likelihood state s ^ is, monophones "* -i + *" is the second state _{i 2} of, is the third state _{p 3} of the voice / pause maximum likelihood state g ^ pose model . At this time, using the expression (8), the appearance probability P (i ₂ ) of the second state i ₂ of the monophone “* −i + *” and the output probability b _i2 (o _t19 ) of the GMM belonging to the state i ₂ The difference between the logarithm of the product and the logarithm of the output probability b _p3 (o _t19 ) of the GMM belonging to the third state p ₃ of the pose model is defined as a prior reliability c (o _t19 ). In other words, it is obtained as follows.
c (o _t19 ) = log (P (i ₂ ) b _i2 (o _t19 )) − logb _p3 (o _t19 )
FIG. 7 shows only a part of the time. The length of the audio file is, for example, about several minutes (for example, 30,000 frames). Hereinafter, the processing of each means will be specifically described.

（モノフォン最尤検出手段３２）
モノフォン最尤検出手段３２は、フレームｔ毎の音声特徴量ｏ_ｔに対する各モノフォンＧＭＭから得られる出力確率ｂ_ｓ（ｏ_ｔ）とそのＧＭＭが属する状態ｓの出現確率Ｐ（ｓ）の積Ｐ（ｓ）ｂ_ｓ（ｏ_ｔ）から、モノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）を求め、その対数ｌｏｇ（Ｐ（ｓ＾）ｂｓ＾（ｏｔ））を事前信頼度算出手段３４に出力する。なお、モノフォン最尤検出手段３２は、音響モデルパラメータメモリ５０を参照して、各モノフォンＧＭＭと各状態ｓの出現確率Ｐ（ｓ）を取得することができる。また、モノフォン最尤検出手段３２は、各モノフォンＧＭＭと各状態ｓの出現確率Ｐ（ｓ）を予め音響モデルパラメータメモリ５０から取得しておき、記憶しておいてもよい。 (Monophone maximum likelihood detection means 32)
Monophones maximum likelihood detecting means 32, output probability obtained from each monophones GMM for the speech feature quantity _{o t} for each frame t _b s _{(o t)} and the product P of the probability P (s) of the GMM belongs state s ( _s) b s (from _{o t),} monophones maximum likelihood value _{P (s ^) b s ^} (o t) the request, the log-log (P (s ^) bs ^ (ot)) a pre-reliability calculation means 34. Note that the monophone maximum likelihood detection unit 32 can obtain the appearance probability P (s) of each monophone GMM and each state s with reference to the acoustic model parameter memory 50. In addition, the monophone maximum likelihood detection unit 32 may acquire and store each monophone GMM and the appearance probability P (s) of each state s from the acoustic model parameter memory 50 in advance.

なお、モノフォン最尤状態ｓ＾の出現確率Ｐ（ｓ＾）は、音響モデルの学習データと目的の音声認識対象である評価音声データにおける各状態の出現確率には差がないと仮定することで、以下の式（１０）により、近似的に求めてもよい。 It should be noted that the appearance probability P (s ^) of the monophone maximum likelihood state s ^ is assumed that there is no difference between the appearance probability of each state in the learning data of the acoustic model and the evaluation speech data that is the target speech recognition target. Approximately, the following equation (10) may be used.

式（１０）の分母は音響モデルの学習データにおける各状態ｓの出現頻度の和を表し、分子は音響モデルの学習データにおける最尤状態ｓ＾の出現頻度を表す。音響モデルの学習時に得られる各状態ｓの出現頻度の期待値Γ（ｓ）を、音響モデルパラメータメモリ５０に保存しておけば、それを利用することで容易に実現することができる。 The denominator of Equation (10) represents the sum of the appearance frequencies of each state s in the acoustic model learning data, and the numerator represents the appearance frequency of the maximum likelihood state s ^ in the acoustic model learning data. If the expected value Γ (s) of the appearance frequency of each state s obtained at the time of learning the acoustic model is stored in the acoustic model parameter memory 50, it can be easily realized by using it.

（音声／ポーズ最尤検出手段３３）
音声／ポーズ最尤検出手段３３は、フレームｔ毎の音声特徴量ｏ_ｔに対する音声／ポーズＧＭＭから得られる出力確率から、音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）を求め、その対数ｌｏｇｂ_ｇ＾（ｏ_ｔ）を事前信頼度算出手段３４に出力する。なお、音声／ポーズ最尤検出手段３３は、音響モデルパラメータメモリ５０を参照して、音声／ポーズＧＭＭを取得することができる。また、音声／ポーズ最尤検出手段３３は、音声／ポーズＧＭＭを予め音響モデルパラメータメモリ５０から取得しておき、記憶しておいてもよい。 (Voice / pause maximum likelihood detection means 33)
Voice / Pause maximum likelihood detecting means 33 obtains the output probability obtained from the speech / pause GMM, audio / pause maximum likelihood value _b g ^ a _{(o t)} for the audio feature amount _{o t} for each frame t, the logarithm logb _{g ^} (o _t ) is output to the prior reliability calculation means 34. The voice / pause maximum likelihood detection means 33 can acquire the voice / pause GMM with reference to the acoustic model parameter memory 50. The voice / pause maximum likelihood detection means 33 may acquire the voice / pause GMM from the acoustic model parameter memory 50 in advance and store it.

（事前信頼度算出手段３４）
事前信頼度算出手段３４は、モノフォン最尤値の対数ｌｏｇ（Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ））と音声／ポーズ最尤値の対数ｌｏｇｂ_ｇ＾（ｏ_ｔ）とを入力され、以下の式（１１）によりその差を当該フレームの事前信頼度ｃ（ｏ_ｔ）として求め、信頼度スコア算出手段３５に出力する。 (Advance reliability calculation means 34)
Prior reliability calculation means 34 is input with logarithm log (P (s ^) b _{s ^} (o _t )) of monophone maximum likelihood value and logarithmic logb _{g ^} (o _t ) of speech / pause maximum likelihood value, The difference is obtained as the prior reliability c (o _t ) of the frame by the following equation (11), and is output to the reliability score calculation means 35.

（信頼度スコア算出手段３５）
信頼度スコア算出手段３５は、フレーム毎の事前信頼度ｃ（ｏ_ｔ）を入力され、式（９）により、フレーム毎の事前信頼度ｃ（ｏ_ｔ）を平均化して音声ファイル単位のもの（言い換えると、事前信頼度ｃ（ｏ_ｔ）を音声ファイルの継続時間Ｔ（総フレーム数）の間累積して平均したもの）を信頼度スコアＣとして求め、出力する。 (Reliability score calculation means 35)
The reliability score calculation means 35 receives the prior reliability c (o _t ) for each frame, averages the prior reliability c (o _t ) for each frame according to the equation (9), and is based on the audio file unit ( In other words, the prior reliability c (o _t ) obtained by accumulating and averaging the duration T (total number of frames) of the audio file is obtained as the reliability score C and output.

このように、事前信頼度スコア計算部３０は、フレーム単位の事前信頼度ｃ（ｏ_ｔ）を音声ファイルの総フレーム数Ｔで平均することで音声ファイル単位の信頼度を表す信頼度スコアＣを計算する。なお、音声ファイル単位の信頼度スコアＣを求めるので精緻な処理を必要としない。また、このような構成とすることで、入力音声信号の品質や音響モデル等との整合性によって処理速度が変化することなく、安定した処理速度で事前信頼度計算を行うことができる。次に音声認識処理部４０の詳細を説明する。 Thus, pre confidence score calculator 30, a confidence score C representing the reliability of the audio file units by averaging the frame advance confidence c a (o _t) by the total number of frames T audio file calculate. In addition, since the reliability score C for each audio file is obtained, no elaborate processing is required. Further, with such a configuration, the prior reliability calculation can be performed at a stable processing speed without changing the processing speed due to the quality of the input voice signal, consistency with the acoustic model, and the like. Next, details of the voice recognition processing unit 40 will be described.

＜音声認識処理部４０＞
音声認識処理部４０は、特徴量分析部２０が出力する音声特徴量系列Ｏ（＝ｏ_１，ｏ_２，…，ｏ_Ｔ）と信頼度スコアＣを入力として、音響モデルパラメータメモリ５０及び言語モデルパラメータ６０を参照して、音声認識処理を行い、音声認識結果Ｗを出力する。この時、信頼度スコアＣを同時に出力しても良い。ここでの音声認識処理は、音響モデルパラメータメモリ５０に記録された全ての音響モデルを用いた認識処理が行われる。音声認識処理部４０は、信頼度スコアＣの値に応じて音声認識処理の実行の有無を切り替える。 <Voice recognition processing unit 40>
The speech recognition processing unit 40 receives the speech feature amount series O (= o ₁ , o ₂ ,..., O _T ) output from the feature amount analysis unit 20 and the reliability score C, and receives the acoustic model parameter memory 50 and the language model. With reference to the parameter 60, voice recognition processing is performed, and a voice recognition result W is output. At this time, the reliability score C may be output simultaneously. In this speech recognition process, a recognition process using all acoustic models recorded in the acoustic model parameter memory 50 is performed. The voice recognition processing unit 40 switches the execution of the voice recognition process according to the value of the reliability score C.

例えば、音声認識処理部４０は、信頼度スコアＣが一定値Ｃ_ｔｈ以下の場合に音声認識処理を停止する。信頼度スコアＣは音声ファイル毎に計算される値であるので、音声認識処理部４０は音声ファイル単位で音声認識処理の実行の有無を切り替える。一定値Ｃ_ｔｈは、例えば、音響モデルの学習データに対する信頼度スコア分布から算出する方法が考えられる。信頼度スコア分布の平均値μ、標準偏差σとした場合に、例えばＣ_ｔｈ＝μ−２σとする。また、例えば、音声認識処理部４０は、複数の音声ファイルの信頼度スコアＣを求め蓄積しておき、上位Ｎ個(例えば、全音声認識対象音声ファイル中の２０％に当たる個数等）の音声ファイルのみを音声認識処理する構成としてもよい。 For example, the speech recognition processing unit 40 stops the speech recognition process when the reliability score C is equal to or less than a certain value _Cth . Since the reliability score C is a value calculated for each voice file, the voice recognition processing unit 40 switches whether or not the voice recognition process is executed for each voice file. For example, a method of calculating the constant value C _th from the reliability score distribution for the learning data of the acoustic model is conceivable. When the average value μ and the standard deviation σ of the reliability score distribution are used, for example, C _th = μ−2σ. For example, the voice recognition processing unit 40 obtains and accumulates reliability scores C of a plurality of voice files, and stores the top N voice files (for example, the number corresponding to 20% of all voice recognition target voice files). Only a voice recognition process may be performed.

＜効果＞
以上述べたように、この発明の音声認識装置によれば、音声特徴量に基づいた事前信頼度を求め、フレーム毎の事前信頼度を平均化して音声ファイル単位の信頼度スコアを計算する。従って、従来の音声認識装置よりも軽い処理で信頼度スコアが求められる。また、音声特徴量に基づく処理なので、言語モデルに依存しない信頼度スコアを得ることができる。また、求められた信頼度スコアの値に応じて音声認識処理を行うか否かを判断することで、例えばＳ/Ｎ比が悪い等の理由により音声認識精度の低い音声ファイルの音声認識処理に時間がかかる問題も解決できる。また、従来は単語やキーワード単位、あるいは、発話（文）単位での信頼度計算であったのに対し、本実施例の音声認識装置１００は、複数発話からなる音声ファイル単位での信頼度スコアを計算することができる。 <Effect>
As described above, according to the speech recognition apparatus of the present invention, the prior reliability based on the speech feature amount is obtained, the prior reliability for each frame is averaged, and the reliability score for each audio file is calculated. Therefore, the reliability score is obtained by processing that is lighter than that of the conventional speech recognition apparatus. Further, since the processing is based on the speech feature amount, a reliability score that does not depend on the language model can be obtained. Further, by determining whether or not to perform speech recognition processing according to the value of the obtained reliability score, for example, for speech recognition processing of a speech file with low speech recognition accuracy due to a poor S / N ratio or the like. It can solve time-consuming problems. In contrast to the conventional calculation of reliability in units of words or keywords, or in units of utterances (sentences), the speech recognition apparatus 100 of the present embodiment has a reliability score in units of audio files composed of a plurality of utterances. Can be calculated.

＜変形例１＞
図４及び図６を用いて、実施例１と異なる部分のみ説明する。実施例１とは、事前信頼度スコア計算部３０’の処理内容が異なる。 <Modification 1>
Only parts different from the first embodiment will be described with reference to FIGS. 4 and 6. The processing content of the prior reliability score calculation unit 30 ′ is different from that of the first embodiment.

＜事前信頼度スコア計算部３０’＞
事前信頼度スコア計算部３０’は、モノフォン最尤検出手段３２’と、音声／ポーズ最尤検出手段３３’と、事前信頼度算出手段３４と、信頼度スコア算出手段３５とを備える。モノフォン最尤検出手段３２’と音声／ポーズ最尤検出手段３３’の処理が実施例１と異なる。事前信頼度スコア計算部３０’では、二種以上の音響モデル中に含まれるモノフォン及び音声モデルに基づいて計算したフレーム毎の事前信頼度を平均化して音声ファイル単位の信頼度スコアＣを計算する。図８に、二種以上の音響モデルを、男性音響モデルと女性音響モデルとした場合の出力確率の時間経過の一例を示す。 <Pre-reliability score calculation unit 30 '>
The prior reliability score calculation unit 30 ′ includes a monophone maximum likelihood detection unit 32 ′, a speech / pause maximum likelihood detection unit 33 ′, a prior reliability calculation unit 34, and a reliability score calculation unit 35. The processing of the monophone maximum likelihood detection means 32 ′ and the voice / pause maximum likelihood detection means 33 ′ is different from that of the first embodiment. The prior reliability score calculation unit 30 ′ calculates the reliability score C for each audio file by averaging the prior reliability for each frame calculated based on the monophone and the audio model included in the two or more types of acoustic models. . FIG. 8 shows an example of the time course of output probability when two or more kinds of acoustic models are a male acoustic model and a female acoustic model.

（モノフォン最尤検出手段３２’）
モノフォン最尤検出手段３２’は、まず、フレームｔ毎の音声特徴量ｏ_ｔに対する男性モノフォンＨＭＭの状態ｓ_ｍに属するＧＭＭ（以下「男性モノフォンＧＭＭ」という）から得られる出力確率ｂ_ｓｍ（ｏ_ｔ）とそのＧＭＭが属する状態ｓ_ｍの出現確率Ｐ（ｓ_ｍ）の積Ｐ（ｓ_ｍ）ｂ_ｓｍ（ｏ_ｔ）から、最も高い値（以下、「男性モノフォン最尤値Ｐ（ｓ＾_ｍ）ｂ_ｓ＾ｍ（ｏ_ｔ）」という）を求める。次に、音声特徴量ｏ_ｔに対する女性モノフォンＨＭＭの状態ｓ_ｆに属するＧＭＭ（以下「女性モノフォンＧＭＭ」という）から得られる出力確率ｂ_ｓｆ（ｏ_ｔ）とそのＧＭＭが属する状態ｓ_ｆの出現確率Ｐ（ｓ_ｆ）の積Ｐ（ｓ_ｆ）ｂ_ｓｆ（ｏ_ｔ）から、最も高い値（以下、「女性モノフォン最尤値Ｐ（ｓ＾_ｆ）ｂ_ｓ＾ｆ（ｏ_ｔ）」という）を求める。男性モノフォン最尤値Ｐ（ｓ＾_ｍ）ｂ_ｓ＾ｍ（ｏ_ｔ）と女性モノフォン最尤値Ｐ（ｓ＾_ｆ）ｂ_ｓ＾ｆ（ｏ_ｔ）のうち、大きい方をモノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）とし、その対数を事前信頼度算出手段３４に出力する。 (Monophone maximum likelihood detection means 32 ')
Monophones maximum likelihood detecting means 32 'is first output probability obtained from GMM belonging to the state _{s m} Male monophone HMM for speech features _{o t} for each frame t (hereinafter referred to as "male monophones GMM") _b sm _{(o t} ) And the appearance probability P (s _m ) of the state s _m to which the GMM belongs, P (s _m ) b _sm (o _t ), the highest value (hereinafter referred to as “male monophone maximum likelihood value P (s ^ _m )” b _{s ^ m} (o _t ) ”). Then, the probability of occurrence of the output probability _b sf _{(o t)} and its GMM belongs state _{s f} obtained from GMM belonging to the state _{s f} of women monophones HMM for the speech feature quantity _{o t} (hereinafter referred to as "women monophones GMM") from P product _{_{P (s f) b sf (}} o t) of _{(s f),} the highest value (hereinafter referred to as "women monophones maximum likelihood value _{_{P (s ^ f) b s}} ^ f (o t) " hereinafter) a Ask. Men monophones maximum likelihood value _{_{P (s ^ m) b s}} ^ m (o t) and women monophones maximum likelihood value _{_{P (s ^ f) b s}} ^ f (o t) of the, the larger the monophones maximum likelihood value P (S ^) b _{s ^} (o _t ) and the logarithm thereof is output to the prior reliability calculation means 34.

（音声／ポーズ最尤検出手段３３’）
音声／ポーズ最尤検出手段３３’は、まずフレームｔ毎の音声特徴量ｏ_ｔに対する男性音声／ポーズＧＭＭから得られる出力確率から、男性音声／ポーズ最尤値ｂ_ｇ＾ｍ（ｏ_ｔ）を求める。次に、まずフレームｔ毎の音声特徴量ｏ_ｔに対する女性音声／ポーズＧＭＭから得られる出力確率から、女性音声／ポーズ最尤値ｂ_ｇ＾ｆ（ｏ_ｔ）を求める。男性音声／ポーズ最尤値ｂ_ｇ＾ｍ（ｏ_ｔ）と女性音声／ポーズ最尤値ｂ_ｇ＾ｆ（ｏ_ｔ）のうち、大きい方を音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）とし、その対数を事前信頼度算出手段３４に出力する。 (Speech / pause maximum likelihood detection means 33 ')
Voice / Pause maximum likelihood detection unit 33 ', the output probability is first obtained from a male voice / pause GMM for the speech feature quantity o _t for each frame t, male voice / pause maximum likelihood value b _{g ^ m} a _(o _t) Ask. Then, first, the output probability obtained from female voice / pause GMM for the speech feature quantity _{o t} for each frame t, obtaining the female voice / pause maximum likelihood value _{_{b g ^ f (o t)}} . Of the male voice / pause maximum likelihood value b _{g ^ m} (o _t ) and the female voice / pause maximum likelihood value b _{g ^ f} (o _t ), the larger one is the voice / pause maximum likelihood value b _{g ^} (o _t ). And the logarithm is output to the prior reliability calculation means 34.

事前信頼度算出手段３４は、モノフォン最尤値の対数ｌｏｇ（Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ））と音声／ポーズ最尤値の対数ｌｏｇｂ_ｇ＾（ｏ_ｔ）から式（１１）によりその差を当該フレームの事前信頼度ｃ（ｏ_ｔ）として求める。信頼度スコア算出手段３５は、フレーム毎の事前信頼度ｃ（ｏ_ｔ）を入力され、式（９）により、フレーム毎の事前信頼度ｃ（ｏ_ｔ）を平均化して音声ファイル単位のものを信頼度スコアＣとして求める。 The prior reliability calculation means 34 calculates the logarithm log (P (s ^) b _{s ^} (o _t )) of the monophon maximum likelihood value and the logarithmic logb _{g ^} (o _t ) of the speech / pause maximum likelihood value to obtain the equation (11). The difference is obtained as the prior reliability c (o _t ) of the frame. The reliability score calculation means 35 receives the prior reliability c (o _t ) for each frame, averages the prior reliability c (o _t ) for each frame according to the equation (9), and obtains the one for each audio file. Obtained as a reliability score C.

このような構成とすることによって、後段の音声認識処理が複数の音響モデルを用いる場合でも、同様に複数の種別の音響モデルを事前信頼度スコア計算に用いることで、信頼度スコアＣを音声認識処理に合わせて精度よく求めることができる。なお、事前信頼度スコア計算部３０’に用いる音響モデルの種別は三種以上の複数であっても良い。 With such a configuration, even when the subsequent speech recognition process uses a plurality of acoustic models, the reliability score C is recognized by using a plurality of types of acoustic models in the prior reliability score calculation. It can be accurately determined according to the processing. Note that there may be three or more types of acoustic models used in the prior reliability score calculation unit 30 ′.

また、信頼度スコアＣは、音声特徴量系列に対する二種以上の音声モデル又はポーズモデルの最尤状態の出力確率を比較し、出力確率が大きい種別のモノフォンに限定して計算された値であっても良い。つまり、前記した例のように男性と女性のモノフォンの最尤値Ｐ（ｓ＾_ｍ）ｂ_ｓ＾ｍ（ｏ_ｔ）とＰ（ｓ＾_ｆ）ｂ_ｓ＾ｆ（ｏ_ｔ）を全てのフレームについて求めるのでは無く、音声モデル又はポーズモデルの出力確率が女性（男性）よりも男性（女性）が高くなるフレームは、男性（女性）モノフォンに限定して計算する方法も考えられる。 The reliability score C is a value calculated by comparing the output probabilities of the maximum likelihood states of two or more types of speech models or pause models with respect to the speech feature amount series, and limited to monophones of a type having a large output probability. May be. In other words, all of the frame maximum likelihood value _P of monophones of men and women _{(s ^ m) b s ^} m and _{(o t)} _P the _{(s ^ f) b s ^} f (o t) as of the above-mentioned example The frame in which the output probability of the speech model or the pose model is higher for the male (female) than the female (male) may be limited to the male (female) monophone.

すなわち、音声／ポーズ最尤検出手段３３”は、男性と女性の音声／ポーズ最尤値ｂ_ｇ＾ｍ（ｏ_ｔ）とｂ_ｇ＾ｆ（ｏ_ｔ）のうち大きい方を音声／ポーズ最尤値ｂ_ｇ＾（ｏ_ｔ）とするものである。そして、モノフォン最尤検出手段３２”は、その判定結果を入力としてどちらか一方のモノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）を求める。この例の場合、全ての種別のモノフォンの出力確率ｂ_ｓ（ｏ_ｔ）とその状態の出現確率Ｐ（ｓ）の積Ｐ（ｓ）ｂ_ｓ（ｏ_ｔ）を計算しないので、計算量を削減する効果が期待できる。 That is, the speech / pause maximum likelihood detection means 33 ″ uses the larger one of the speech / pause maximum likelihood values b _{g ^ m} (o _t ) and b _{g ^ f} (o _t ) of the male and female voice / pause maximum likelihood. The value b _{g ^} (o _t ) is set, and the monophone maximum likelihood detection unit 32 ″ receives the determination result as an input, and either of the monophone maximum likelihood values P (s ^) b _{s ^} (o _t ) In this example, since the product P (s) b _s (o _t ) of the output probability b _s (o _t ) of all types of monophones and the appearance probability P (s) of the state is not calculated, the amount of calculation is reduced. Can be expected.

＜その他の変形例＞
特徴量分析部２０の前段に、図示しない音声区間判定部を設けても良い。例えば、音声区間判定部は、パワーが所定値以下のフレームが所定時間以上継続したときに、音声区間ではないと判断する。そして、非音声区間と判定した場合には、その区間に対するそれ以降の処理を停止するように指示信号を出力する。このような構成とすることで、非音声区間の音声認識処理を省略することができる。なお、大きな雑音等は、音声区間判定部で省略することはできないが、モノフォン最尤検出手段３２及び音声／ポーズ最尤検出手段３３において、音声か非音声（ポーズ）か判定するため、誤認識を防ぐことができる。 <Other variations>
A speech section determination unit (not shown) may be provided in the preceding stage of the feature amount analysis unit 20. For example, the voice section determination unit determines that the voice section is not a voice section when a frame whose power is equal to or lower than a predetermined value continues for a predetermined time or longer. And when it determines with a non-voice area, an instruction | indication signal is output so that the process after that with respect to the area may be stopped. With such a configuration, it is possible to omit the speech recognition process in the non-speech section. Note that large noise or the like cannot be omitted by the speech segment determination unit, but the monophone maximum likelihood detection unit 32 and the speech / pause maximum likelihood detection unit 33 determine whether the speech is speech or non-speech (pause). Can be prevented.

モノフォン最尤検出手段３２で用いる各状態ｓの出現頻度、または、出現確率は、実際の音声認識処理には用いないため、この情報を保持しない音響モデルパラメータメモリ５０も存在する。その場合には、全ての出現頻度を１として（Ｐ（ｓ）＝１）、式（８）によりフレーム毎の事前信頼度ｃ（ｏ_ｔ）を求めてもよい。また、一部の状態についてのみ出現頻度または出現確率が保存されている音響モデルパラメータメモリ５０も存在する。その場合には、保存されている一部の状態の出現頻度または出現確率の平均値を求め、求めた平均値を他の状態（出現頻度または出現確率の保存されていない状態）の出現頻度または出現確率として代用してもよい。 Since the appearance frequency or the appearance probability of each state s used in the monophone maximum likelihood detection means 32 is not used in the actual speech recognition processing, there is also an acoustic model parameter memory 50 that does not hold this information. In that case, all the appearance frequencies may be set to 1 (P (s) = 1), and the prior reliability c (o _t ) for each frame may be obtained by Expression (8). There is also an acoustic model parameter memory 50 in which appearance frequencies or appearance probabilities are stored only for some states. In that case, the average value of the appearance frequency or appearance probability of some stored states is obtained, and the obtained average value is used as the appearance frequency of other states (states where the appearance frequency or appearance probability is not stored) or You may substitute as an appearance probability.

また、事前信頼度スコア計算部において、複数の音響モデルを用いる場合、発話区間を推定し、発話区間毎に最適な音響モデルを推定する構成としてもよい。例えば、参考文献４のように、音声/ポーズＧＭＭを用いて事前に性別を推定し、推定した性別に適合する音響モデル（男性音響モデルまたは女性音響モデル）を用いる構成とする。
［参考文献４］S. Kobashikawa, A. Ogawa, Y. Yamaguchi, and S. Takahashi,“Rapid unsupervised adaptation using frame independent output probabilities of gender and context independent phoneme models”, INTERSPEECH, 2009, pp.1615-1618.
モノフォン最尤検出手段３２及び音声／ポーズ最尤検出手段３３は、それぞれ対数ｌｏｇ（Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ））及び対数ｌｏｇｂ_ｇ＾（ｏ_ｔ）に代えて、Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）及びｂ_ｇ＾（ｏ_ｔ）を出力し、事前信頼度算出手段３４において、対数ｌｏｇ（Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ））及びｌｏｇｂ_ｇ＾（ｏ_ｔ）を求めてもよい。 Moreover, when using a some acoustic model in a prior reliability score calculation part, it is good also as a structure which estimates an utterance area and estimates the optimal acoustic model for every utterance area. For example, as in Reference Document 4, a gender is estimated in advance using a voice / pause GMM, and an acoustic model (male acoustic model or female acoustic model) that matches the estimated gender is used.
[Reference 4] S. Kobashikawa, A. Ogawa, Y. Yamaguchi, and S. Takahashi, “Rapid unsupervised adaptation using frame independent output probabilities of gender and context independent phoneme models”, INTERSPEECH, 2009, pp.1615-1618.
The monophone maximum likelihood detection means 32 and the speech / pause maximum likelihood detection means 33 replace the logarithm log (P (s ^) b _{s ^} (o _t )) and logarithm logb _{g ^} (o _t ), respectively. ^) B _{s ^} (o _t ) and b _{g ^} (o _t ) are output, and the prior reliability calculation means 34 outputs the logarithm log (P (s ^) b _{s ^} (o _t )) and logb _{g ^} ( o _t ) may be determined.

なお、前記方法及び装置において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Note that the processes described in the method and apparatus are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

＜実験結果＞
本実験の音響分析条件は、標本化周波数１６ｋＨｚ、窓幅２０ｍｓｅｃのハミング窓、窓シフト１０ｍｓｅｃで、特徴量は２５次元（ＭＦＣＣ１２、ΔＭＦＣＣ１２、ΔＰＯＷＥＲ）であり、評価タスクは、話者４８名（男性１７名、女性３１名）による計２４０通話（合計１９．８１時間、１７，６７２発話）で、発話内容は１対１対話における自由発話である。音響モデルは、男女別不特定話者モデルで、総状態数は１，９５８、総分布数は男性２６，５６７、女性２９，８３６である。性別選択は参考文献４と同様に音声/ポーズＧＭＭを用いて事前に行った。言語モデルは、対話音声の書き起こしをベースに構築した単語ｔｒｉｇｒａｍであり、語彙サイズは５９，６７６単語である。デコーダには、音声認識エンジンＶｏｉｃｅＲｅｘ（参考文献５参照）を用いた。
［参考文献５］H. Masataki, D. Shibata, Y. Nakazawa, S. Kobashikawa, A. Ogawa, and K. Ohtsuki, “VoiceRex - Spontaneous speech recognition technology for contact-center conversations,” NTT Tech. Rev., 2007, vol. 5, no. 1, pp. 22-27 <Experimental result>
The acoustic analysis conditions of this experiment are a sampling frequency of 16 kHz, a Hamming window with a window width of 20 msec, a window shift of 10 msec, a feature amount of 25 dimensions (MFCC12, ΔMFCC12, ΔPOWER), and an evaluation task of 48 speakers (male) A total of 240 calls (total of 19.81 hours, 17,672 utterances) by 17 people and 31 women), the utterance contents are free utterances in a one-to-one dialogue. The acoustic model is an unspecified speaker model by gender, the total number of states is 1,958, the total number of distributions is 26,567 men and 29,836 women. Gender selection was performed in advance using voice / pause GMM as in Reference 4. The language model is a word trigram constructed based on the transcription of dialogue voice, and the vocabulary size is 59,676 words. A voice recognition engine VoiceRex (see Reference 5) was used as the decoder.
[Reference 5] H. Masataki, D. Shibata, Y. Nakazawa, S. Kobashikawa, A. Ogawa, and K. Ohtsuki, “VoiceRex-Spontaneous speech recognition technology for contact-center conversations,” NTT Tech. Rev., 2007, vol. 5, no. 1, pp. 22-27

提案する事前信頼度推定による認識対象データ選択の有効性を示すために、通話単位でのデータ選択率に対する選択された通話音声の平均認識率(文字単位) で評価し、理想条件:認識精度が高い順に選択した理想条件、平均認識率:実験に用いた全通話音声の平均認識率、従来技術:音声認識処理後の音声認識結果を用いた事後的な信頼度スコアの高い順に選択、提案技術:提案する事前信頼度の高い順に選択（実施例１の音声認識装置１００）、の４条件で比較を行った。また、従来技術は、参考文献６のように音声認識結果のＮベストに基づき信頼度を推定している手法を採用した。さらに、事前信頼度推定の速度に関しては、音声認識処理を含む従来技術との比較により評価を行った。
［参考文献６］B. Rueber, “Obtaining confidence measures from sentence probabilities”, In EUROSPEECH-1997, pp.739-742 In order to show the effectiveness of the proposed recognition target data selection based on the pre-reliability estimation, the average recognition rate (character unit) of the selected call voice against the data selection rate per call unit is evaluated, and the ideal condition: recognition accuracy is Ideal conditions selected in descending order, average recognition rate: average recognition rate of all call voices used in the experiment, conventional technology: selection in descending order of reliability score using speech recognition results after speech recognition processing, proposed technology : Comparison was made under the four conditions of selection in the order of proposed prior reliability (voice recognition apparatus 100 of Example 1). Moreover, the prior art employ | adopted the method which estimates the reliability based on N best of the speech recognition result like the reference literature 6. FIG. Furthermore, the speed of prior reliability estimation was evaluated by comparison with the prior art including speech recognition processing.
[Reference 6] B. Rueber, “Obtaining confidence measures from sentence probabilities”, In EUROSPEECH-1997, pp.739-742

提案手法による認識対象データの選択の効果を、図９に示す。実施例１の音声認識装置１００は、理想条件には及ばないものの、全ての選択率で平均認識率よりも高い認識率を示しており、選択が認識率の改善に効果を示している。さらに、音声認識処理後の事後的な信頼度に基づく手法（従来技術）と同等の性能を示した。また、事前信頼度推定の処理時間は、従来技術との比で、僅か０．０１８４であり、５０倍以上の速度向上を実現した。限られた計算資源の下で全ての通話音声を認識処理できない場合においては、図９に示すような事後的な信頼度による選択は実現できないため、提案する事前信頼度に基づく選択が有効であると言える。 The effect of selecting recognition target data by the proposed method is shown in FIG. Although the speech recognition apparatus 100 according to the first embodiment does not satisfy the ideal condition, the recognition rate is higher than the average recognition rate at all selection rates, and the selection has an effect on the improvement of the recognition rate. Furthermore, it showed the same performance as the method based on the ex-post reliability after speech recognition processing (prior art). In addition, the processing time of the prior reliability estimation is only 0.0184 in comparison with the prior art, realizing a speed improvement of 50 times or more. When all call voices cannot be recognized and processed under limited computing resources, the selection based on the a posteriori reliability as shown in FIG. 9 cannot be realized. Therefore, the selection based on the proposed prior reliability is effective. It can be said.

本稿では、環境独立音素モデル及び音声モデルを用いた高速な事前信頼度推定に基づき、認識対象音声データを音声認識処理前に選択する手法を提案した。実験の結果、音声認識処理後の事後的な信頼度推定に比べておよそ５４倍の速度で同等の選択性能を実現した。 In this paper, we proposed a method for selecting recognition target speech data before speech recognition based on fast prior reliability estimation using environment-independent phoneme model and speech model. As a result of the experiment, the same selection performance was realized at a speed about 54 times faster than the ex-post reliability estimation after the speech recognition processing.

＜音声認識装置２００＞
図４を用いて実施例２に係る音声認識装置２００について、実施例１と異なる部分のみを説明する。音声認識装置２００は、事前信頼度スコア計算部２３０の処理内容が実施例１と異なる。 <Voice recognition apparatus 200>
Only the parts different from the first embodiment of the speech recognition apparatus 200 according to the second embodiment will be described with reference to FIG. The speech recognition apparatus 200 is different from the first embodiment in the processing content of the prior reliability score calculation unit 230.

＜事前信頼度スコア計算部２３０＞
図１０を用いて、事前信頼度スコア計算部２３０を説明する。事前信頼度スコア計算部２３０は、モノフォン最尤検出手段２３２と、音声／ポーズ最尤検出手段３３と、事前信頼度算出手段２３４と、信頼度スコア算出手段３５とを備え、モノフォン最尤検出手段２３２と事前信頼度算出手段２３４の処理内容が実施例１と異なる。 <Pre-reliability score calculation unit 230>
The prior reliability score calculation unit 230 will be described with reference to FIG. The prior reliability score calculation unit 230 includes a monophone maximum likelihood detection unit 232, a speech / pause maximum likelihood detection unit 33, a prior reliability calculation unit 234, and a reliability score calculation unit 35, and the monophone maximum likelihood detection unit. Processing contents of the H.232 and the prior reliability calculation means 234 are different from those in the first embodiment.

（モノフォン最尤検出手段２３２）
モノフォン最尤検出手段２３２は、フレームｔ毎の音声特徴量ｏ_ｔに対する各モノフォンＧＭＭから得られる出力確率ｂ_ｓ（ｏ_ｔ）とそのＧＭＭが属する状態ｓの出現確率Ｐ（ｓ）の積Ｐ（ｓ）ｂ_ｓ（ｏ_ｔ）から、モノフォン最尤値Ｐ（ｓ＾）ｂ_ｓ＾（ｏ_ｔ）を求め、モノフォン最尤状態ｓ＾に属するＧＭＭから得られる出力確率ｂ_ｓ＾（ｏ_ｔ）の対数ｌｏｇｂ_ｓ＾（ｏ_ｔ）を事前信頼度算出手段３４に出力する。 (Monophone maximum likelihood detection means 232)
Monophones maximum likelihood detection unit 232, the output probability obtained from each monophones GMM for the speech feature quantity _{o t} for each frame t _b s _{(o t)} and the product P of the probability P (s) of the GMM belongs state s ( _s) b s (from _{o t),} monophones determine the maximum likelihood value _{P (s ^) b s ^} (o t), the output probability obtained from GMM belonging to monophones maximum likelihood state _{_{s ^ b s ^ (o t}} ) Loglog _{s ^} (o _t ) is output to the prior reliability calculation means 34.

（事前信頼度算出手段２３４）
事前信頼度算出手段２３４は、モノフォン最尤状態ｓ＾に属するＧＭＭから得られる出力確率ｂ_ｓ＾（ｏ_ｔ）の対数ｌｏｇｂ_ｓ＾（ｏ_ｔ）と音声／ポーズ最尤値の対数ｌｏｇｂ_ｇ＾（ｏ_ｔ）とを入力され、式（１２）によりその差を当該フレームの事前信頼度ｃ（ｏ_ｔ）として求め、信頼度スコア算出手段３５に出力する。 (Advance reliability calculation means 234)
The prior reliability calculation means 234 calculates the logarithm logb _{s ^} (o _t ) of the output probability b _{s ^} (o _t ) obtained from the GMM belonging to the monophone maximum likelihood state s ^ and the log logb _{g ^ of the} speech / pause maximum likelihood value. (O _t ) is input, and the difference is obtained as the prior reliability c (o _t ) of the frame by Expression (12), and is output to the reliability score calculation means 35.

式（１１）に代えて、式（１２）を用いても、実施例１と同様に事前信頼度ｃ（ｏ_ｔ）を求めることができる。 Even if the equation (12) is used instead of the equation (11), the prior reliability c (o _t ) can be obtained as in the first embodiment.

なお、式（１２）が以下の観点からも、事前信頼度として有効であることがわかる。図１１に、音声特徴量と尤度との関係を示す。尤度は、尤もらしさを表す値であり、出力確率値で代用しても良い。横軸が音声特徴量、縦軸が尤度である。図中に、音響モデル中に含まれる音声モデル(破線)とモノフォンの音素モデル「＊−ａ＋＊」，「＊−ｉ＋＊」，「＊−ｕ＋＊」のそれぞれの分布を表す。なお、−は左側依存、＋は右側依存を表し、＊はどのような音素でもよいことを表す。図１１では、簡略化のため音素モデルの状態数を１、混合分布数を１として表現している。 In addition, it turns out that Formula (12) is effective as prior reliability also from the following viewpoints. FIG. 11 shows the relationship between the voice feature quantity and the likelihood. Likelihood is a value representing likelihood, and an output probability value may be substituted. The horizontal axis is the voice feature amount, and the vertical axis is the likelihood. In the figure, the respective distributions of the speech model (broken line) and monophone phoneme models “* -a + *”, “* -i + *”, “* -u + *” included in the acoustic model are shown. Note that-represents left side dependence, + represents right side dependence, and * represents any phoneme. In FIG. 11, the number of states of the phoneme model is represented as 1 and the number of mixture distributions is represented as 1 for simplification.

音声モデルに用いるＧＭＭは、全ての音声すなわち全ての音素の学習データに基づき学習されたモデルである。そのため、その分布は、音声特徴量に対する尤度の値が比較的なだらかな分布となる。それに対して、モノフォンは、各音素の学習データで学習されたモデルである。そのため、当該音素に対応する音声特徴量に対する尤度の値が急峻な分布である。 The GMM used for the speech model is a model learned based on learning data of all speech, that is, all phonemes. Therefore, the distribution is a distribution in which the likelihood value with respect to the voice feature amount is relatively gentle. On the other hand, the monophone is a model learned from learning data of each phoneme. Therefore, the likelihood value for the speech feature amount corresponding to the phoneme has a steep distribution.

従って、ある音声特徴量に対する音声モデルの尤度と、同じ音声特徴量に対するモノフォンの尤度を比較することで、音声ファイルの信頼度を判定することが可能である。つまり、雑音の影響を受けずに収録された音素ａの音声特徴量ｏ_ｔ ^clean（ａ）に対するモノフォン「＊−ａ＋＊」の尤度ｂ_ｓ（ｏ_ｔ ^clean（ａ））は大きな値を示す。一方、同じ音声特徴量ｏ_ｔ ^clean（ａ）に対する音声モデルの尤度ｂ_ｇ（ｏ_ｔ ^clean（ａ））は相対的に小さな値を示す。その結果、それらの値の間には大きな差が生じる。 Therefore, it is possible to determine the reliability of the audio file by comparing the likelihood of the audio model for a certain audio feature amount and the likelihood of the monophone for the same audio feature amount. In other words, the likelihood of monophones "* -a + *" for the audio feature of the phoneme a which was recorded without being affected by the noise _{^{_{o t clean (a) b s}}} (o t clean (a)) shows a large value . On the other hand, the likelihood _b g of speech model for the same speech features _{^{_{o t clean (a) (o}}} t clean (a)) shows a relatively small value. As a result, there is a large difference between these values.

これに対して、雑音の影響を強く受けて収録された音素ａの音声特徴量ｏ_ｔ ^noisy（ａ）は、本来の特徴量とは異なるのでモノフォンでの尤度ｂ_ｓ（ｏ_ｔ ^noisy（ａ））と、音声モデルにおける尤度ｂ_ｇ（ｏ_ｔ ^noisy（ａ））との間の差が小さくなる。 On the other hand, the audio feature amount of the phoneme a which was recorded strongly affected by the noise o _t ^noisy (a) is, the likelihood b _{s (o} _t ^noisy _(a in monophones is different from the original feature value )) and the difference between the likelihood _{_{^{b g (o t noisy (a}}} )) in the speech model is reduced.

このように音声特徴量に対するモノフォンの尤度ｂ_ｓ（ｏ_ｔ）と、音声モデルの尤度ｂ_ｇ（ｏ_ｔ）との差を見ることで、収録音声の品質を評価することができる。よって式（１２）により事前信頼度ｃ（ｏ_ｔ）を求めることができることがわかる。 Thus the speech features for monophone likelihood b _{s (o} _t), by looking at the difference between the speech model likelihood b _{g (o} _t), it is possible to evaluate the quality of the recorded voice. Therefore, it can be seen that the prior reliability c (o _t ) can be obtained from the equation (12).

このような構成とすることで、実施例１と同様の効果を得ることができる。また、実施例１で用いる式（１１）では第１項に、最尤状態ｓ＾の出現確率Ｐ（ｓ＾）(＜１)を含むため、事前信頼度ｃ（ｏ_ｔ）の値が小さくなり、負の領域になる可能性が高い。実施例２で用いる式（１２）では、第１項と第２項とも同様の出力確率の対数スコアであり、かつ前述の通り音声モデルの分布がモノフォンの分布に比べてなだらかなことから、第２項の値は第１項に比べて小さくなり、正の領域になる可能性が高い。すなわち、事前信頼度ｃ（ｏ_ｔ）、ひいては信頼度スコアＣの値の取り得る値の範囲が制限される。従って、後段で音声認識処理制御を行う場合、音声認識処理を制御する閾値Ｃ_ｔｈの設定が容易になる。 By adopting such a configuration, the same effect as in the first embodiment can be obtained. Further, in the expression (11) used in the first embodiment, since the first term includes the appearance probability P (s ^) (<1) of the maximum likelihood state s ^, the value of the prior reliability c (o _t ) is small. This is likely to be a negative region. In the expression (12) used in the second embodiment, the first term and the second term are similar log probabilities of output probabilities, and the distribution of the speech model is gentle compared to the distribution of the monophone as described above. The value of the second term is smaller than that of the first term and is likely to be a positive region. In other words, the range of possible values of the value of the prior reliability c (o _t ) and the reliability score C is limited. Therefore, when performing voice recognition processing control at a later stage, setting of the threshold value C _th for controlling the voice recognition processing becomes easy.

＜音声認識装置３００＞
図１２を用いて実施例３に係る音声認識装置３００を説明する。音声認識装置３００は、認識処理制御部３８０を備える点、及び音声認識装置３４０の処理内容が音声認識装置１００、２００と異なる。
＜認識処理制御部３８０＞
認識処理制御部３８０は、制御信号としてビーム探索幅Ｎ（Ｃ）を出力する。その一例を式（１３）に示す。 <Voice recognition apparatus 300>
A speech recognition apparatus 300 according to the third embodiment will be described with reference to FIG. The speech recognition apparatus 300 is different from the speech recognition apparatuses 100 and 200 in that it includes a recognition processing control unit 380 and the processing content of the speech recognition apparatus 340.
<Recognition processing control unit 380>
The recognition processing control unit 380 outputs the beam search width N (C) as a control signal. An example is shown in equation (13).

図１３に信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係を例示する。横軸は信頼度スコアＣであり、縦軸はビーム探索幅Ｎ（Ｃ）である。
図１３に示すように式（１３）は、所定の範囲の信頼度スコアＣ（Ｃ_ｍｉｎ〜Ｃ_ｍａｘ）に対応するビーム探索幅Ｎ（Ｃ）（Ｎ_ｍｉｎ〜Ｎ_ｍａｘ）を、信頼度スコアＣの値で比例配分する考えである。ここでは、比例係数が負の値なので、信頼度スコアＣが小でビーム探索幅Ｎ（Ｃ）が大であり、Ｃが大でＮ（Ｃ）が小となる関係である。もちろん、信頼度スコアＣとビーム探索幅Ｎ（Ｃ）との関係は、非線形な関数で表せる関係であっても良い。また、制御信号としてビーム探索幅Ｎ（Ｃ）を用いる場合、ビーム探索幅は、個数ビーム幅に限定したものではなく、例えばスコアビーム幅、単語終端スコアビーム幅や、単語終端個数ビーム幅等であっても良い。 FIG. 13 illustrates the relationship between the reliability score C and the beam search width N (C). The horizontal axis is the reliability score C, and the vertical axis is the beam search width N (C).
As shown in FIG. 13, the equation (13) expresses the beam search width N (C) (N _{min to} N _max ) corresponding to the reliability score C (C _{min to} C _max ) in a predetermined range, and the reliability score C It is an idea of proportionally distributing with the value of. Here, since the proportionality coefficient is a negative value, the reliability score C is small and the beam search width N (C) is large, and C is large and N (C) is small. Of course, the relationship between the reliability score C and the beam search width N (C) may be a relationship that can be expressed by a non-linear function. Further, when the beam search width N (C) is used as the control signal, the beam search width is not limited to the number beam width. For example, the score search width, the word end score beam width, the word end number beam width, etc. There may be.

ここで、例えばＣ_ｍａｘ＝μ＋σ、Ｃ_ｍｉｎ＝μ―σとして、Ｎ_ｍａｘを通常用いるビーム幅の１.５倍、Ｎ_ｍｉｎを通常用いるビーム幅の半分等としても良い。また、平均音質が極端に悪い場合（例えばＣ＜Ｃ_ｍｉｎ）には、ビーム探索幅を拡大しても精度向上が望めず処理時間ばかり掛かるので、ビーム探索幅を小さく、例えばＮ_ｍｉｎにしても良い。また、制御信号に認識対象外指示信号を含ませて音声認識処理を行わせないようにしても良い。また、音声認識処理を停止させる信号とビーム探索幅の制御信号を並存させても良い。 Here, for example, C _max = μ + σ, C _min = μ−σ, N _max may be 1.5 times the beam width normally used, N _min may be half the beam width normally used, and the like. In addition, when the average sound quality is extremely low (for example, C <C _min ), even if the beam search width is increased, the accuracy cannot be improved and it takes much processing time. Therefore, the beam search width is reduced, for example, N _min. good. Further, the speech recognition process may not be performed by including the non-recognition instruction signal in the control signal. Further, a signal for stopping the speech recognition process and a control signal for the beam search width may coexist.

＜音声認識処理部３４０＞
音声認識処理部３４０は、音声認識処理を停止させる信号か、ビーム探索幅の制御信号の少なくとも一方に基づき、前記音声特徴量系列を入力として音声認識処理を行う。例えば、認識処理制御部３８０から音声認識処理を停止させる信号を受信した場合には、対応する音声ファイルについては、音声認識処理を停止させる。また、ビーム探索幅Ｎ（Ｃ）の制御信号を受信した場合には、そのビーム探索幅Ｎ（Ｃ）に基づき、音声認識処理を行う。 <Voice recognition processing unit 340>
The speech recognition processing unit 340 performs speech recognition processing using the speech feature amount series as an input based on at least one of a signal for stopping the speech recognition processing or a control signal for beam search width. For example, when a signal for stopping the voice recognition process is received from the recognition process control unit 380, the voice recognition process is stopped for the corresponding voice file. When a control signal for the beam search width N (C) is received, speech recognition processing is performed based on the beam search width N (C).

＜効果＞
このように、認識処理制御部３８０を備えた音声認識装置３００は、複数の音声ファイルの音声認識処理の効率化と、認識精度の向上を図ることができる。なお、認識処理制御部３８０の機能は、音声認識処理部４０に持たせても良い。 <Effect>
As described above, the speech recognition apparatus 300 including the recognition processing control unit 380 can improve the efficiency of the speech recognition processing of a plurality of audio files and improve the recognition accuracy. Note that the speech recognition processing unit 40 may have the function of the recognition processing control unit 380.

＜音声認識装置４００＞
図１４及び図１５を用いて実施例４に係る音声認識装置４００を説明する。
音声認識装置４００は、音声ファイル処理部４０１と、ソート音声認識処理部４４０と、を備える点で音声認識装置１００、２００と異なる。 <Voice recognition apparatus 400>
A speech recognition apparatus 400 according to the fourth embodiment will be described with reference to FIGS. 14 and 15.
The voice recognition apparatus 400 is different from the voice recognition apparatuses 100 and 200 in that the voice recognition apparatus 400 includes a voice file processing unit 401 and a sort voice recognition processing unit 440.

＜音声ファイル処理部４０１＞
音声ファイル処理部４０１は、複数の音声ファイルの信頼度スコアＣの高い順番に複数の音声ファイルを並び替える（ステップＳ４０１）。
＜ソート音声認識処理部４４０＞
ソート音声認識処理部４４０は、信頼度スコアＣの高い順番に音声認識処理を行う（ステップＳ４４０）。 <Audio file processing unit 401>
The audio file processing unit 401 rearranges the plurality of audio files in descending order of the reliability score C of the plurality of audio files (step S401).
<Sort voice recognition processing unit 440>
The sorted speech recognition processing unit 440 performs speech recognition processing in descending order of the reliability score C (step S440).

＜効果＞
このような構成とすることで、実施例１と同様の効果を得ることができる。さらに、このように信頼度スコアＣの大きさ順に音声認識処理を実行することで、複数の音声ファイルの音声認識処理を行う場合の処理効率を向上させることができる。例えば、全音声ファイルに対して音声認識処理を行うことが、計算機資源や処理時間の関係等によって難しい場合には、信頼度スコアＣが小さい音声ファイルは音声認識処理が行われず、音声認識精度が高い事が期待される信頼度スコアＣが大きな音声ファイルにのみ音声認識処理が行われることになり、高精度な音声認識結果を収集することが可能になる。なお、音声ファイル処理部４０１の機能は、ソート音声認識処理部４４０の機能に含めても良い。なお、実施例３の音声認識装置３００と音声ファイル処理部４０１及びソート音声認識処理部４４０を組み合わせても、同様の効果をえることができる。 <Effect>
By adopting such a configuration, the same effect as in the first embodiment can be obtained. Furthermore, by executing the speech recognition processing in the order of the reliability score C in this way, it is possible to improve the processing efficiency when performing speech recognition processing of a plurality of speech files. For example, when it is difficult to perform speech recognition processing on all speech files due to the relationship between computer resources and processing time, speech recognition processing is not performed on speech files with a low reliability score C, and speech recognition accuracy is high. The voice recognition process is performed only on a voice file having a high reliability score C that is expected to be high, and it is possible to collect a highly accurate voice recognition result. Note that the function of the voice file processing unit 401 may be included in the function of the sort voice recognition processing unit 440. Note that the same effect can be obtained by combining the speech recognition apparatus 300 according to the third embodiment with the speech file processing unit 401 and the sorted speech recognition processing unit 440.

＜音声認識装置５００＞
図１６及び図１７を用いて実施例５に係る音声認識装置５００を説明する。
音声認識装置５００は、教師なし適応制御部５０１と、教師なし適応部５０２と、適応後音響モデルパラメータメモリ５０３と、第２認識処理部５０４とを備える点で音声認識装置１００、２００と異なる。 <Voice recognition apparatus 500>
A speech recognition apparatus 500 according to the fifth embodiment will be described with reference to FIGS. 16 and 17.
The speech recognition apparatus 500 differs from the speech recognition apparatuses 100 and 200 in that it includes an unsupervised adaptive control unit 501, an unsupervised adaptation unit 502, a post-adaptation acoustic model parameter memory 503, and a second recognition processing unit 504.

＜教師なし適応制御部５０１＞
教師なし適応制御部５０１は、事前信頼度Ｃを入力として、その事前信頼度Ｃの値が一定範囲内（例えばＣ＞Ｃ_ｔｈ２であり、Ｃ_ｔｈ２＞Ｃ_ｔｈとする。ここで、Ｃ_ｔｈ２は、前述の信頼度スコア分布の平均値μ、標準偏差σを用いて、例えばＣ_ｔｈ２＝μ―σ等としてもよい）か否かを判定して教師なし適応制御信号ｐを出力する（ステップＳ５０１）。事前信頼度Ｃの値が一定範囲内でない場合、その音声ファイルの処理を終了する（ステップＳ５０１のＮＯ）。教師なし適応制御信号とは、音声認識処理部４０が出力する音声認識結果を適応用ラベルとして用いるか否かを制御する信号である。 <Unsupervised adaptive control unit 501>
The unsupervised adaptive control unit 501 receives the prior reliability C, and the value of the prior reliability C is within a certain range (for example, C> C _th2 and C _th2 > C _th , where C _th2 is Then, using the above-described average value μ and standard deviation σ of the reliability score distribution, it is determined whether or not C _th2 = μ−σ, for example, and the unsupervised adaptive control signal p is output (step S501). ). If the value of the prior reliability C is not within a certain range, the processing of the audio file is terminated (NO in step S501). The unsupervised adaptive control signal is a signal for controlling whether or not the speech recognition result output from the speech recognition processing unit 40 is used as an adaptation label.

＜教師なし適応部５０２＞
教師なし適応部５０２は、教師なし適応制御信号ｐが、音声認識処理部４０が出力する音声認識結果Ｗを適応用ラベルとして用いることを指示していた場合、音声認識結果Ｗを適応用ラベルとして音響モデルパラメータメモリ５０に記録された音響モデルを学習して、適応後音響モデルを生成する（ステップＳ５０２）。適応後音響モデルは、適応後音響モデルパラメータメモリ５０３に記録される。 <Unsupervised adaptation unit 502>
The unsupervised adaptation unit 502 uses the speech recognition result W as an adaptation label when the unsupervised adaptation control signal p instructs to use the speech recognition result W output from the speech recognition processing unit 40 as an adaptation label. The acoustic model recorded in the acoustic model parameter memory 50 is learned, and an adaptive acoustic model is generated (step S502). The post-adaptation acoustic model is recorded in the post-adaptation acoustic model parameter memory 503.

＜第２認識処理部５０４＞
第２認識処理部５０４は、適応後音響モデルパラメータメモリ５０３に記録された適応後音響モデルを用いて音声特徴量系列Ｏの音声認識処理を行い、音声認識結果Ｗ’を出力する（ステップＳ５０４）。なお、このとき、事前信頼度スコア計算部３０で求めた信頼度スコアＣを一緒に出力してもよい。 <Second recognition processing unit 504>
The second recognition processing unit 504 performs speech recognition processing of the speech feature amount series O using the post-adaptation acoustic model recorded in the post-adaptation acoustic model parameter memory 503, and outputs a speech recognition result W ′ (step S504). . At this time, the reliability score C obtained by the prior reliability score calculation unit 30 may be output together.

＜効果＞
このような構成とすることで実施例１と同様の効果を得ることができる。さらに、音声認識装置５００は、事前信頼度Ｃの値が一定範囲内にある場合に限って、音声認識結果Ｗを適応用ラベルとして音響モデルを学習し、さらに音声認識処理を行う。事前信頼度スコアＣが低く音声ファイルの認識精度の低い場合には、そのときの音声認識処理結果Ｗは、教師なし適応における適応用ラベルとしてふさわしくなく、教師なし適応による音響モデルの精度向上が期待できない。そのような場合に、教師なし適応や第２音声認識処理を省略することで、その計算時間を削減できる。また、信頼度スコアＣが高く音声ファイルの認識精度の高い音声認識結果Ｗを適応用ラベルとして音響モデルを学習するので、音響モデルの精度を自動的に向上させることができる。なお、実施例３、４の音声認識装置３００、４００と教師なし適応制御部５０１、教師なし適応部５０２、適応後音響モデルパラメータメモリ５０３及び第２認識処理部５０４を組み合わせても、同様の効果をえることができる。 <Effect>
By adopting such a configuration, the same effect as in the first embodiment can be obtained. Furthermore, the speech recognition apparatus 500 learns an acoustic model using the speech recognition result W as an adaptation label and performs speech recognition processing only when the value of the prior reliability C is within a certain range. When the prior reliability score C is low and the recognition accuracy of the voice file is low, the voice recognition processing result W at that time is not suitable as an adaptation label in unsupervised adaptation, and an improvement in the accuracy of the acoustic model by unsupervised adaptation is expected. Can not. In such a case, the calculation time can be reduced by omitting the unsupervised adaptation and the second speech recognition process. In addition, since the acoustic model is learned using the speech recognition result W having the high reliability score C and the recognition accuracy of the speech file as the adaptive label, the accuracy of the acoustic model can be automatically improved. The same effect can be obtained by combining the speech recognition apparatuses 300 and 400 according to the third and fourth embodiments, the unsupervised adaptive control unit 501, the unsupervised adaptation unit 502, the post-adaptation acoustic model parameter memory 503, and the second recognition processing unit 504. You can

＜プログラム＞
また、前記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 <Program>
Further, when the processing means in the device is realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

１００、２００、３００、４００、５００、９００音声認識装置
２０特徴量分析部
３０、３０’、２３０事前信頼度スコア計算部
４０、３４０音声認識処理部
５０音響モデルパラメータメモリ
６０言語モデルパラメータメモリ
３８０認識処理制御部
４０１音声ファイル処理部
４４０ソート音声認識処理部
５０１教師なし適応制御部
５０２教師なし適応部
５０３適応後音響モデルパラメータメモリ
５０４第２認識処理部 100, 200, 300, 400, 500, 900 Speech recognition device 20 Feature amount analysis unit 30, 30 ′, 230 Prior reliability score calculation unit 40, 340 Speech recognition processing unit 50 Acoustic model parameter memory 60 Language model parameter memory 380 Recognition Processing control unit 401 Audio file processing unit 440 Sorted speech recognition processing unit 501 Unsupervised adaptive control unit 502 Unsupervised adaptation unit 503 Adaptive acoustic model parameter memory 504 Second recognition processing unit

Claims

A feature amount analysis process for obtaining a speech feature amount sequence by analyzing a speech feature amount of a speech digital signal in units of frames;
Using the speech feature amount sequence for each frame, the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM for the speech feature amount sequence, and the appearance probability P (s) of each state s Is obtained from the logarithm of the highest product P (s ^) b _{s ^} (o _t ) and the GMM belonging to each state of the speech model or pose model HMM for the input speech model. A prior reliability score calculation process in which a difference from the logarithm of the highest obtained output probability b _{g ^} (o _t ) is used as the prior reliability of the frame, and the prior reliability is averaged to obtain a reliability score for each audio file. When,
Using the speech feature amount sequence, a speech recognition processing step of performing speech recognition processing based on the reliability score,
A speech recognition method characterized by the above.

A feature amount analysis process for obtaining a speech feature amount sequence by analyzing a speech feature amount of a speech digital signal in units of frames;
The product of the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM for the input and the appearance probability P (s) of each state s using the speech feature amount sequence for each frame. Is the logarithm of the output probability b _{s ^} (o _t ) when is the highest and the highest output probability b _{g ^} (o obtained from the GMM belonging to each state of the speech model HMM or the pose model HMM for the input. a prior reliability score calculation process in which a difference between the logarithm of _t ) is defined as a prior reliability of the frame, and the prior reliability is averaged to obtain a reliability score for each audio file;
Using the speech feature amount sequence, a speech recognition processing step of performing speech recognition processing based on the reliability score,
A speech recognition method characterized by the above.

In the voice recognition method according to claim 1 or 2,
A recognition processing control step for obtaining at least one of a signal for stopping speech recognition processing or a control signal for beam search width using the reliability score ;
The speech recognition processing step performs speech recognition processing of the speech feature amount series based on at least one of a signal for stopping speech recognition processing or a control signal for beam search width.
A speech recognition method characterized by the above.

The speech recognition method according to any one of claims 1 to 3,
From the confidence scores of the plurality of audio files, and the audio file processing step of rearranging a plurality of audio files with high confidence score order,
Sort voice recognition processing process that performs voice recognition processing in order of high reliability score ;
A speech recognition method, further comprising:

In the voice recognition method according to any one of claims 1 to 4,
Using said confidence score, and unsupervised adaptive control process of obtaining the no adaptive control signals teacher determines whether the value of the confidence scores or within a predetermined range,
Using the speech recognition result and the unsupervised adaptive control signal, learning an acoustic model using the speech recognition result as an adaptation label and generating an adapted acoustic model;
A second recognition process for performing speech recognition processing of the speech feature amount sequence using the post-adaptation acoustic model when the post-adaptation acoustic model is generated;
A speech recognition method, further comprising:

A feature amount analysis unit that analyzes a speech feature amount of an input speech digital signal in units of frames and outputs a speech feature amount sequence;
The product of the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM with respect to the input speech feature quantity sequence for each frame and the appearance probability P (s) of each state s Which is obtained from the logarithm of the highest product P (s ^) b _{s ^} (o _t ) and the GMM belonging to each state of the speech model or pose model HMM for the input. A prior reliability score calculator that outputs a reliability score for each audio file by averaging the prior reliability as a difference between the logarithm of the high output probability b _{g ^} (o _t ) and the prior reliability of the frame;
A speech recognition processing unit that performs speech recognition processing based on the reliability score using the speech feature amount series as an input;
A speech recognition apparatus characterized by that.

A feature amount analysis unit that analyzes a speech feature amount of an input speech digital signal in units of frames and outputs a speech feature amount sequence;
The product of the output probability b _s (o _t ) obtained from the GMM belonging to each state of the monophone HMM with respect to the input speech feature quantity sequence for each frame and the appearance probability P (s) of each state s Is the logarithm of the output probability b _{s ^} (o _t ) when is the highest and the highest output probability b _{g ^} (o obtained from the GMM belonging to each state of the speech model HMM or the pose model HMM for the input. a prior reliability score calculation unit that sets a difference between the logarithm of _t ) as a prior reliability of the frame, averages the prior reliability, and outputs a reliability score for each audio file;
A speech recognition processing unit that performs speech recognition processing based on the reliability score using the speech feature amount series as an input;
A speech recognition apparatus characterized by that.

In the voice recognition device according to claim 6 or 7,
A recognition processing control unit for obtaining at least one of a signal for stopping speech recognition processing or a control signal for beam search width, using the reliability score as an input;
The speech recognition processing unit performs speech recognition processing using the speech feature amount series as an input based on at least one of a signal for stopping speech recognition processing or a control signal for beam search width,
A speech recognition apparatus characterized by that.

The speech recognition apparatus according to any one of claims 6 to 8,
From the confidence scores of the plurality of audio files, and the audio file processing unit that rearranges the plurality of audio files with high confidence score order,
Sort voice recognition processing unit that performs voice recognition processing in order of high reliability score ;
A speech recognition apparatus further comprising:

The speech recognition apparatus according to any one of claims 6 to 9,
Wherein the confidence scores as inputs, and unsupervised adaptive controller that outputs without adaptive control signals teacher determines whether the value of the confidence scores or within a predetermined range,
An unsupervised adaptation unit that receives the speech recognition result and the unsupervised adaptive control signal as input, learns an acoustic model using the speech recognition result as an adaptation label, and generates a post-adaptation acoustic model;
A second recognition processing unit for performing speech recognition processing of the speech feature amount sequence using the post-adaptation acoustic model when the post-adaptation acoustic model is generated;
A speech recognition apparatus further comprising:

A program for causing a computer to execute the speech recognition method according to any one of claims 1 to 5.