JP4074543B2

JP4074543B2 - Audio processing apparatus, audio processing method, audio processing program, and program recording medium

Info

Publication number: JP4074543B2
Application number: JP2003118305A
Authority: JP
Inventors: 建一熊谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2003-04-23
Filing date: 2003-04-23
Publication date: 2008-04-09
Anticipated expiration: 2023-04-23
Also published as: JP2004325635A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識システム等に利用される音声処理装置,音声処理方法,音声処理プログラムおよびプログラム記録媒体に関する。
【０００２】
【従来の技術】
現在、音声認識システムの認識性能は、書き起こし文を読み上げた朗読音声であれば、不特定話者タスクであっても高い単語認識性能を有している。これは、多数話者データベースの利用が可能であり、殆どの話者の音響特性を学習できるためである。また、Maximum a posteriori(以下、ＭＡＰと略称する)やMaximum Likelihood Linear Regression(以下、ＭＬＬＲと略称する)等の話者適応技術によって、少ない音声サンプルから話者の音響特性を学習することも可能である。
【０００３】
ここで、上記話者の音響特性とは、話者の発声器官の違い等、発声器官の物理特性の違いによって起こる音響特性のことである。例えば、声道長の違い等によって、音声のスペクトルが話者毎に異なる。尚、上述したＭＡＰやＭＬＬＲは、S.Young他著“The HTKBOOK”に詳しく述べられている。
【０００４】
しかしながら、自然に且つ自由に発声された音声(以下、自然音声と言う)に対する認識性能は不十分である(篠崎他、音講論、pp17-18、Mar.2002)。自然音声認識が難しい理由は、発話スタイルの要因が大きいといわれている(山本他、信学論pp2438−2447、Nov．2000)。また、自然音声と朗読音とを使ってモデルを学習した場合でも、自然音声の認識率はかなり低下する。この原因は、総ての発話速度に対応したモデルを作成することが難しいことと、自然音声においては特に母音をはっきりと発音しない(なまける)傾向があるためであると考えられる。
【０００５】
前者の原因に対しては、発話速度毎に遷移パスを分離するマルチパス隠れマルコフモデル(以下、ＨＭＭと略称する)(李他、音講文, pp.89-90, Mar.2002)等が提案されている。しかしながら、計算コストに見合った認識精度は得られていない。また、後者の問題に対しては、自然音声を上記ＭＡＰやＭＬＬＲ等の話者適応技術によって音響モデルを学習することが考えられる。しかしながら、そうすると、逆に母音モデルの特徴空間が大きくなってしまい、結果として自然音声の認識率が向上しても、朗読音声の認識精度が悪くなり兼ねない。
【０００６】
ここで、上記発話スタイルとは、上記「話者の音響特性」のような発声器官の物理特性の違いではなく、話者の環境や文化等によって起こる音響特性のことである。例えば、方言,早口,ゆっくりしゃべる,はっきりと発音しない等である。
【０００７】
さらに、あらゆる騒音環境下において高性能な認識性能を保証することはできない。予め収録した騒音を学習音声に重畳した音声をモデル(マッチドモデル)化する方法によって良い認識性能が得られるが、全環境の騒音を収録するのは不可能である。そのために、騒音環境の場合も、上記話者適応の場合と同様に少数の騒音データから上記ＭＡＰやＭＬＬＲ等によって適応処理を行う方法がなされている。しかしながら、その場合であっても上記マッチドモデル化する方法と比較すると認識性能は劣る。また、利用者が手当たり次第に環境適応を行うと、音響モデルがどのようになるか予測がつかないために好ましくない。
【０００８】
利用者にとって、利用者自身の音声の音響特性は如何にもならないが、周りの騒音や発話スタイルに対しては対応が簡単である。例えば、騒音に対しては静かな場所に移動できるし、発話スタイルに対しては標準的な話し方をすればよい。したがって、誤認識の原因が、話者の音響特性によるものか発話スタイルによるものか騒音によるものかを判定して、判定結果を利用者に知らせることができれば、誤認識による不快感を少なくすることができることになる。また、発話スタイルへの適応を行わないことで、認識性能が向上しない無駄な適応処理を回避することができる。同様に、対応していない環境を通知してやることによって、無駄な環境適応処理を回避することができる。
【０００９】
しかしながら、多くの音声認識システムにおいては、利用者に誤認識理由すら通知してはいない。その理由は、誤認識の原因を一般の人が理解できるように説明するのが難しいためである。具体的には、上記ＨＭＭを用いた音声認識システムにおいては、入力音声の音韻性以外の情報を含んだ「Mel-frequency cepstral coefficients(以下、ＭＦＣＣと略称する)」等の特徴ベクトルと標準モデルとの確率統計距離を基準としたマッチングスコアの大小によって認識結果が判定されるので、誤認識の原因を音声学の知見に完全に(１対１の対応で)結び付けることができないからである。
【００１０】
入力音声と標準音声との物理的な距離尺度を基準とした認識システムにおいては、上述したような誤認識理由を教示する装置ではないが、標準的な発話を利用者に学習させる音声認識装置が提案されている(例えば、特許文献１参照)。
【００１１】
その他、上記誤認識理由通知を行うものとしては、以下のような音声認識方法及び装置がある(特許文献２参照)。この音声認識方法及び装置においては、音声が入力されると、音声認識タスクによって入力音声を分析し、予め登録されている音声データと比較して一致するものを検出する。その際に、認識結果が「ＮＧ」である場合には、ＮＧであった旨の表示と理由コードとを表示するようにしている。
【００１２】
また、従来の話者適応可能な音声認識システムにおいては、話者の音響特性と発話スタイルの違いが明確化されていないため、発話スタイルや周辺環境も話者の音響特性と同様に学習してしまうことになる。例えば、話者適応技術を用いて信頼性の高いサブワードだけに話者適応を行う音声認識装置及び自動音声認識装置がある(特許文献３参照)。この音声認識装置及び自動音声認識装置では、認識結果の尤度尺度が閾値以上になる信頼性の高いサブワードにのみモデル適応を行うことによって、適応による認識性能劣化を小さくするようにしている。
【００１３】
【特許文献１】
特開平０１‐２８５９９８号公報
【特許文献２】
特開２０００‐１１２４９７号公報
【特許文献３】
特開２０００‐１８１４８２号公報
【００１４】
【発明が解決しようとする課題】
しかしながら、上記従来の音声認識装置や音声認識方法においては、以下のような問題がある。
【００１５】
すなわち、先ず、上記特許文献１に開示された音声認識装置においては、上記のような発話スタイルと話者の音響特性とを区別することはできないし、周辺環境に適応することもできない。さらに、認識を行う認識モードと、指定単語の発話者による音節特徴パターンを作成して登録する登録モードとを有している。そして、上記登録モードでは、発声単語を指示すると共に、正しく認識されるための発声方法(つまり、誤認識され易い理由)を指示するようになっている。ところが、上記登録モードは認識モードと分離しているため、認識モードにおいて誤認識が発生した場合に誤認識の理由を発話者に通知することができず、任意文の音声入力時において誤り原因を知らせることができないという問題がある。
【００１６】
また、上記特許文献２に開示された音声認識方法及び装置においては、入力音声の認識に失敗した場合にその理由情報を通知するのであるが、その通知内容は精々「比較すべき音声登録データなし」や「入力音量過多」等の程度である。また、誤認識理由を取得する手段や方法が開示されておらず、複数の要因が重なり合って発生する誤認識の理由をどのように取得するのかは不明である。したがって、十分な誤認識理由を利用者に通知することができないという問題がある。
【００１７】
また、上記特許文献３に開示された音声認識装置及び自動音声認識装置においては、信頼尺度が閾値以上になるサブワードにモデル適応を行うのであるが、実際に信頼度の定義や信頼度の閾値を決めるのは非常に難しい。例えば、信頼度の閾値を低くし過ぎると適応による認識性能劣化は防げるのではあるが、適応を行う確率が低くなるために適応効果があまり得られない。したがって、そのようなトレードオフの関係を見極めるのは非常に難しいのである。
【００１８】
さらに、誤認識の原因が音響特性と発話スタイルと周辺環境との何れであるかを、区別することはできない。したがって、尤度尺度が閾値以上であって認識の信頼度が高い場合には、発話スタイルおよび周辺環境にも適応しようとすることになる。ところが、上述したように、発話スタイルは、自然音声を用いて学習した場合であっても認識率は劣化するものであるから同様に認識率の劣化を招き、結果的に無駄な計算をすることになる。また、誤認識した理由や信頼度が低い理由等を利用者に通知する理由取得・通知手段が存在しないために、利用者に不快感を与える可能性もある。
【００１９】
そこで、この発明の目的は、誤認識となる要因を判定して利用者に通知することが可能な音声処理装置,音声処理方法,音声処理プログラムおよびプログラム記録媒体を提供することにある。
【００２０】
【課題を解決するための手段】
上記目的を達成するため、この発明の音声処理装置は、入力された音声の特徴量と標準モデルとの比較を行うに際して、上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め,各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出する要因別ずれ算出手段と、上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に,上記閾値内にある場合には,上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換するずれ度合変換手段と、上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて最もずれの度合いが大きい要因を検出する要因検出手段と、上記検出された最もずれの大きい要因を誤認識となる要因として出力する誤認識要因出力手段を備えている。
【００２１】
上記構成によれば、入力音声波形の特徴量に基づいて、例えば人間が直感的に理解し易い誤認識の要因に関する特徴量が求められる。そして、上記特徴量と標準モデルとのずれの度合が最も大きな要因が誤認識となる原因として検出され、ユーザに対して出力される。こうして、利用者に、誤認識となる原因を知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００２２】
また、１実施例の音声処理装置では、上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が複数存在する場合には、誤認識要因を出力せずに、音声の入力を再度行うことを促すメッセージを出力するようになっている。
【００２３】
上記最もずれの大きい要因が複数存在する場合には、突発的な雑音が発生した場合に多い。この実施例によれば、このような場合には、再入力を促すことによって、突発的な雑音に対して頑健に上記要因の分析が行われる。
【００２４】
また、１実施例の音声処理装置では、上記誤認識要因出力手段による上記メッセージの出力に従って音声が再度入力された場合には、上記許容範囲を表す閾値を上記許容範囲が狭くなるように変更する閾値変更手段を備えている。
【００２５】
この実施例によれば、上記許容範囲を表す閾値が上記許容範囲を狭くするように変更されるため、ずれの度合いが強調されることになる。したがって、誤認識の要因分析結果がより得易くなり、何度も利用者に音声入力させる手間が不要になる。
【００２６】
また、１実施例の音声処理装置では、上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が前回の音声入力時と同じ要因である場合は、２番目にずれが大きい要因を上記誤認識となる要因として出力するようになっている。
【００２７】
この実施例によれば、利用者に対して何度も同じ指示を出さないようにして、利用者の不快感が減らされる。
【００２８】
また、１実施例の音声処理装置では、上記標準モデルは確率関数で表されており、上記要因別ずれ算出手段は、上記誤認識の要因に関する特徴量としてパワー,話速,話者性および周辺環境雑音の特徴量を求め、各要因毎に、上記標準モデルを表す確率関数における当該要因の特徴量に基づく確率値を用いて、当該標準モデルとのずれの度合いを算出するようになっている。
【００２９】
この実施例によれば、入力音声波形の特徴量に基づいて、人間が直感的に理解し易い誤認識の要因に関する特徴量が求められる。さらに、上記ずれの度合いを累積確率値によって表すことによって、異なる要因間のずれの度合いを確率値で比較することが可能になる。したがって、ずれの度合いの値に特別な正規化を施すことなく、最もずれの大きな要因を検出することが可能になる。
【００３０】
また、この発明の音声処理方法は、入力された音声の特徴量と標準モデルとの比較を行うに際して、上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め,各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出し、上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に,上記閾値内にある場合には,上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換し、上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて最もずれの度合いが大きい要因を検出し、上記検出された最もずれの大きい要因を誤認識となる要因として出力する。
【００３１】
上記構成によれば、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３２】
また、この発明の音声処理プログラムは、コンピュータを、この発明の音声処理装置における要因別ずれ算出手段,ずれ度合変換手段,要因検出手段および誤認識要因出力手段として機能させる。
【００３３】
上記構成によれば、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３４】
また、この発明のプログラム記録媒体は、この発明の音声処理プログラムが記録されている。
【００３５】
上記構成によれば、コンピュータで読み出して実行することによって、利用者に、誤認識となる原因が、例えば人間が直感的に理解し易い要因によって提示される。こうして、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３６】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。図１は、本実施の形態の音声処理装置におけるハードウェア構成を示す図である。
【００３７】
図１において、１は数値演算・制御等の処理を行う中央演算処理装置であり、本実施の形態において説明する処理手順に従って演算・処理を行う。２はＲＡＭ（ランダム・アクセス・メモリ)やＲＯＭ(リード・オンリ・メモリ)等で構成される記憶装置であり、中央演算処理装置１によって実行される処理手順(音声処理プログラム)やその処理に必要な一時データが格納される。３はハードディスク等で構成される外部記憶装置であり、音声処理用の標準パターン(テンプレート)や標準モデル等が格納される。４はマイクロホンやキーボード等で構成される入力装置であり、ユーザが発声した音声やキー入力された文字列を入力する。５はディスプレイやスピーカ等で構成される出力装置であり、分析結果あるいはこの分析結果を処理することによって得られた情報を出力する。６はバスであり、中央演算処理装置１〜入力装置５の各種装置を相互に接続する。尚、本音声処理装置のハードウェア構成は、図１に示す構成に加えて、インターネット等の通信ネットワークと接続する通信Ｉ/Ｆを備えていても構わない。
【００３８】
但し、本実施の形態においては、音声処理装置および音声処理プログラムは独立しているが、他の装置の一部として組み込んだり、他のプログラムの一部として組み込むことも可能である。そして、その場合における入力は、上記他の装置やプログラムを介して間接的に行われることになる。
【００３９】
以下、上記ハードウェア構成を踏まえて、本実施の形態において実行される処理について説明する。
【００４０】
図２は、本実施の形態における音声処理装置の機能的構成を示すブロック図である。入力部１１から、利用者の音声とそのラベル(発話内容のテキスト表記)とが入力される。そして、入力された音声は、Ａ/Ｄ変換部１２においてデジタル化される。このとき上記入力されたテキストはそのままである。
【００４１】
デジタル化された信号は、特徴抽出部１３によって、ある時間区間(フレーム)毎にＭＦＣＣベクトルに変換される。尚、上記ＭＦＣＣを求める詳細な方法は、上述した「S.Young他著“The HTKBOOK”」を参考されたい。また、ＭＦＣＣは特徴分析方法の１つであって、Linear prediction filter coefficients(線形予測フィルタ係数)等を用いても同じことである。
【００４２】
尚、上記特徴抽出部１３は、上述したように、本音声処理装置および音声処理プログラムを他の装置やプログラムに容易に組み込むことが可能なように、外部装置から特徴抽出されたパラメータが直接入力されることが可能なようになっている。その場合には、外部装置から入力されるパラメータと後に述べる標準モデルとの特徴分析方法を同じにする必要がある。例えば、上記標準モデルのパターンがＭＦＣＣで表現されている場合には、入力パラメータの特徴量もＭＦＣＣ表現にする必要がある。このとき上記入力されたテキストはそのままである。
【００４３】
上記特徴抽出部１３によって抽出されたＭＦＣＣベクトル列は、セグメント分割部１４によって、標準モデル格納部１８に格納された標準モデルの集合を用いて音素毎のセグメントに分割される。この音素毎のセグメントへの分割は、以下のようにして行われる。
【００４４】
すなわち、上記標準モデルがＨＭＭである場合、ＨＭＭの状態ｉから状態ｊに遷移する確率をａ_ijとし、ＨＭＭの状態ｊにおいてフレームｔにおける特徴ベクトルＯtを観測する確率をｂ_j(Ｏt)とすると、最終フレームＴにおいてＨＭＭの最終状態Ｎに至る対数尤度Ｌ_N(T)は、次式
Ｌ_N(T)＝max_i[Ｌ_i(T)＋log(ａ_iN)]
Ｌ_j(t)＝max_i[Ｌ_i(t-1)＋log(ａ_ij)]＋log(ｂ_j(Ｏt))
但し、max_i[ｆ(i)]：ｉについて関数ｆ(i)の最大値
に従って、ビタビアルゴリズムによって求められる。そして、Ｌ_N(T)が求められた際の(つまり、最終フレームＴの最終状態Ｎに到達した際の)フレームに対する状態番号を総て記憶しておき、記憶した状態番号を特徴ベクトル(ＭＦＣＣベクトル)に割り当てることによって、特徴ベクトル列を音素単位に分割する。
【００４５】
尚、上述した本方法が難解だと思われる場合には、上述の「S.Young他著“The HTKBOOK”」を参考にして行っても差し支えない。
【００４６】
こうして、音素毎のセグメントに分割された特徴ベクトル列は、上記テキスト表記されたラベルが付加されて要因分析部１５に入力される。そして、要因分析部１５によって、誤認識となる要因が調べられる。メッセージ作成部１６は、要因分析部１５による分析結果に従って、利用者へ提示するメッセージの文字列を作成する。最後に、メッセージ提示部１７によって、上記作成された文字列に基づいて、出力装置５を構成する上記ディスプレイにメッセージを表示したり、内蔵するテキスト音声合成手段で合成音声に変換してスピーカから音声出力したりすることによって、利用者に通知される。
【００４７】
但し、本音声処理装置および音声処理プログラムが、他の装置や他のプログラムの一部として組み込まれている場合には、メッセージ提示部１７は、上記作成された文字列を上記他の装置に返すことになる。
【００４８】
すなわち、上記Ａ/Ｄ変換部１２,特徴抽出部１３,セグメント分割部１４,要因分析部１５およびメッセージ作成部１６とメッセージ提示部１７の一部とは上記中央演算処理装置１で構成され、入力部１１は上記入力装置４で構成され、メッセージ提示部１７の上記一部の残りは上記出力装置５で構成され、標準モデル格納部１８は上記外部記憶装置３で構成されるのである。また、中央演算処理装置１は、上述した各部１２〜１７による本実施の形態に係る処理動作の他に、演算・判断処理,計時処理および入出力処理等の各種の処理動作をも行うようになっている。
【００４９】
以下、上記要因分析部１５による誤認識要因の分析と、メッセージ作成部１６によるメッセージの作成とについて、詳細に説明する。図３および図４は、要因分析部１５およびメッセージ作成部１６によって実行される要因分析・メッセージ作成処理動作のフローチャートである。尚、ステップＳ20およびステップＳ22はメッセージ作成部１６による処理であり、その他のステップは要因分析部１５による処理である。
【００５０】
上記セグメント分割部１４によるセグメントへの分割が終了すると要因分析・メッセージ作成処理動作がスタートする。そして、先ず、ステップＳ1で、セグメント分割部１４からの入力があるか否かが判別される。そして、入力があればステップＳ2に進む。ステップＳ2で、セグメント分割部１４からの上記セグメント毎に分割されてラベルが付けられた特徴ベクトルが取り込まれる。ステップＳ3で、セグメント分割部１４からの連続した入力回数を計時・記憶しているカウンタの値に基づいて、初回の入力であるか否かが判別される。その結果、初回の入力であればステップＳ5に進み、そうでなければステップＳ4に進む。
【００５１】
ステップＳ4で、後に実行される上記特徴ベクトルと標準モデルとの離れの度合が許容範囲内であるかを判定する際に用いる閾値が、初回入力時に用いる標準閾値から入力回数に応じた閾値に変更される。ここで、上記閾値は、入力回数が増加するに従って標準閾値から段階的に減少するように設定されている。また、上記閾値は、上記特徴ベクトルと標準モデルとの「ずれ(離れ)の度合い」が許容範囲内である場合に上記ずれ度合を所定値にするためにも用いられ、ずれの要因毎に予め設定されて外部記憶装置３等に記憶されている。尚、上記閾値は、音声認識システムの認識性能に依存するので、予め認識率９５％以上の話者の発話から求めた特徴ベクトルに基づいて実験的に決めておく。
【００５２】
ステップＳ5で、発話入力前における非音声区間のセグメントと雑音モデルとの離れ度合が算出される。尚、上記雑音モデルは、予め収録された雑音から学習によって求められて、標準モデル格納部１８に格納されている。また、上記非音声区間のセグメント(特徴ベクトル列)と雑音モデルとの離れ度合は、上記雑音モデルが与えられた際に非音声区間の特徴ベクトル列を観測する対数尤度の累積確率値として求められる。
【００５３】
具体的には、上記雑音モデルをＭnとし、雑音の特徴ベクトル列をＸとする。その場合、雑音モデルＭnが与えられた際に入力特徴ベクトル列Ｘを観測する対数尤度をＬ(Ｘ|Ｍn)とし、雑音特徴ベクトル列のフレーム数(継続長)をＴとすると、継続長Ｔで正規化した対数尤度ｘ(＝Ｌ(Ｘ|Ｍn)/Ｔ)(以下、正規化対数尤度と言う)の累積確率値Ｓnは、次式で表される。

ここで、Ｎn(ｘ;μn,σn)は、確率変数ｘについて平均値μnと分散値σnとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分の範囲は、入力雑音の正規化対数尤度＜μnである場合は、「ａ」が学習データ中の正規化対数尤度の最小値であり、「ｂ」が入力雑音の正規化対数尤度である。また、入力雑音の正規化対数尤度＞μnである場合は、「ａ」が入力雑音の正規化対数尤度であり、「ｂ」が学習データ中の正規化対数尤度の最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、混合ガウス分布等を用いても差し支えない。
【００５４】
上記累積確率値Ｓnは、その値が小さい程、入力雑音の正規化対数尤度が学習データの正規化対数尤度ｘの単一ガウス分布の平均μnから離れていることを意味し、入力雑音の特徴が学習した雑音モデルから大きくずれていることを示す。
【００５５】
ステップＳ6で、上記算出された非音声区間のセグメント(入力雑音)と雑音モデルとの離れ度合(累積確率値Ｓn)は、上記ステップＳ4において設定された閾値あるいは上記標準閾値よりも小さいか否か、つまり、入力雑音の特徴が雑音モデルから大きくずれているか否かが判別される。その結果、大きくずれている場合は、ビタビアルゴリズムによって求められる最尤状態経路が信頼できないので、ステップＳ20に進む。一方、ずれていない場合にはステップＳ7に進む。
【００５６】
ステップＳ7で、入力音声のパワーの標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ5の場合と同様に、特徴ベクトルのパワーの平均値の累積確率値として求められる。
【００５７】
具体的には、先ず、入力音声の特徴ベクトルのパワーが、ＨＭＭの各状態毎に平均化される。次に、音素のパワーの累積確率値Ｓpが、次式によって表される各状態inの累積確率値の中央値で近似することによって求められる。

ここで、Ｎp_in(p_in；μp_in,σp_in)は、ＨＭＭの状態inに割り当てられたパワーの平均値p_inである確率変数について平均値μp_inと分散値σp_inとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分範囲は、入力音声における状態inに割り当てられたパワーの平均値＜μp_inである場合には、「ａ」が学習データのパワーの最小値であり、「ｂ」が入力音声のパワーの平均値である。また、入力音声における状態inに割り当てられたパワーの平均値＞μp_inである場合には、「ａ」が入力音声のパワーの平均値であり、「ｂ」が学習データのパワーの最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、例えば混合ガウス分布等を用いても差し支えない。
【００５８】
上述したように、各状態毎に確率過程を独立と見なして各状態の累積確率値の中央値で音素のパワーの累積確率値Ｓpの近似を行うことによって、音素の各状態のパワーを確率変数とした結合確率密度関数Prob(i1,i2,…,in)の複雑な推定や積分をすることが必要ないのである。
【００５９】
上記累積確率値Ｓpは、その値が大きい程、標準的な発話スタイルに近いことを示している。また、積分の範囲から、標準的な発話スタイルよりもパワーが小さいのか(入力パワーの平均値＜μp_in)あるいは大きいのか(入力パワーの平均値＞μp_in)が判別可能となるのである。
【００６０】
ステップＳ8で、上記算出され入力音声のパワーと標準分布との離れ度合(累積確率値Ｓp)は、上記ステップＳ4において設定された閾値または上記標準閾値よりも大きい場合には、累積確率値Ｓpの値は定数「１」に変換されて出力される。この処理によって、入力音声のパワーと標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６１】
ステップＳ9で、入力音声の話速の標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ5の場合と同様に、継続長の累積確率値として求められる。
【００６２】
具体的には、先ず、入力音素のセグメントに属する特徴ベクトルの総フレーム数から継続長Ｔが計算される。この継続長Ｔは、音素を発声するのに掛った時間であり、その逆数は話速を表す。次に、継続長の累積確率値ＳTが次式によって求められる。

ここで、Ｐ(x；λ)は、確率変数ｘについて平均値λを有するポアソン分布であり、学習データから予め推定しておく。また、式中の積分の範囲は、入力音声の音素の継続長Ｔ＜λである場合には、「ａ」が学習データの最小値であり、「ｂ」がＴである。また、入力音声の音素の継続長Ｔ＞λである場合には、「ａ」がＴであり、「ｂ」が学習データの最大値である。
【００６３】
上記累積確率値ＳTは、その値が大きい程、継続長Ｔが標準分布に近いことを示す。また、積分の範囲から、標準的な発話スタイルより話速が速いのか(継続長Ｔ＜λ)あるいは遅いのか(継続長Ｔ＞λ)が判別可能となるのである。
【００６４】
ステップＳ10で、上記算出された入力音声の話速と標準分布との離れ度合(累積確率値ＳT)は、上記ステップＳ4において設定された閾値または上記標準閾値よりも大きい場合は、累積確率値ＳTの値は定数「１」に変換されて出力される。この処理によって、入力音声の話速と標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６５】
ステップＳ11で、入力話者における音響特性(話者性)の標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ5の場合と同様に、標準モデルが与えられた際に入力特徴ベクトル列を観測する対数尤度の累積確率値として求められる。
【００６６】
具体的には、上記標準モデルをＭsとし、入力特徴ベクトル列をＸとする。その場合、標準モデルＭsが与えられた際に入力特徴ベクトル列Ｘを観測する対数尤度をＬ(Ｘ|Ｍs)とし、入力特徴ベクトル列Ｘのフレーム数(継続長)をＴとすると、継続長Ｔで正規化した正規化対数尤度ｙ(＝Ｌ(Ｘ|Ｍs)/Ｔ)の累積確率値Ｓsは、次式で表される。

ここで、Ｎs(ｙ；μs,σs)は、確率変数ｙについて平均値μsと分散値σsとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分値の範囲は、入力特徴ベクトルの正規化対数尤度＜μsである場合は、「ａ」が学習データ中の正規化対数尤度の最小値であり、「ｂ」が入力特徴ベクトルの正規化対数尤度である。また、入力特徴ベクトルの正規化対数尤度＞μsである場合には、「ａ」が入力特徴ベクトルの正規化対数尤度であり、「ｂ」が学習データ中の正規化対数尤度の最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、混合ガウス分布等を用いても構わない。
【００６７】
上記累積確率値Ｓsは、その値が大きい程、入力話者の音響特性は標準話者の音響特性に近いことを示す。但し、上記発話スタイルである入力音声のパワーや話速度の場合と異なって、積分の範囲は意味をなさない。
【００６８】
ステップＳ12で、上記算出された入力話者における音響特性と標準分布との離れ度合(累積確率値Ｓs)は、上記ステップＳ4において設定された閾値あるいは上記標準閾値よりも大きい場合には、累積確率値Ｓsの値は定数「１」に変換されて出力される。この処理によって、入力話者の音響特性と標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６９】
ステップＳ13で、上記ステップＳ8,ステップＳ10およびステップＳ12において設定された各累積確率値Ｓp,ＳT,Ｓsを直接比較することによって、最も小さい値を有して標準モデルから一番離れている要因が、認識誤りの要因であると判定される。その際に、上記ステップＳ8,ステップＳ10およびステップＳ12において総ての要因の累積確率値が１に変換されている場合には、本ステップの処理は行われない。ステップＳ14で、上記ステップＳ13による判定結果に基づいて当該セグメントの分析メッセージが作成され、当該セグメントのラベル名および各要因の累積確率値Ｓp,ＳT,Ｓsと対応付けられて、記憶装置２のＲＡＭ等に保存される。その場合における分析メッセージの作成は、図５の＜詳細情報＞に示すように、定型キーワードに、上記ステップＳ13における判定結果を埋め込むことによって行われる。但し、判定結果がない場合には分析メッセージは作成されない。
【００７０】
ステップＳ15で、全セグメントの入力が終了したか否かが判別される。その結果終了した場合にはステップＳ16に進み、そうでなければ上記ステップＳ7に戻って次のセグメントの処理に移行する。
【００７１】
ステップＳ16で、上記ステップＳ14において記憶装置２のＲＡＭ等に保存された全セグメントの累積確率値Ｓp,ＳT,Ｓsに基づいて、各々の要因ｉについて発話全体のスコアＳi_total(同時確率)が次式によって求められる。
Ｓi_total =Ｓi(1)×…×Ｓi(j)×…×Ｓi(SN)
但し、Ｓi(j)：ｊ番目のセグメントに対する累積確率
ＳＮ：入力セグメント総数
そして、こうして求められ発話全体のスコアＳi_totalが最小値を呈する要因をバッファに保存しておく。
【００７２】
ステップＳ17で、上記ステップＳ16における発話全体のスコア算出の結果に基づいて、総ての要因が同スコアであるか否かが判別される。その結果、総ての要因が同スコアである場合にはステップＳ21に進み、そうでない場合にはステップＳ18に進む。ステップＳ18で、上記ステップＳ16において求められた要因と前の入力において求められた誤認識の要因とが同じか否かが判別される。その結果、同じ場合にはステップＳ19に進み、異なる場合にはステップＳ20に進む。但し、初回入力の場合には、総てのバッファが初期化されている本ステップにおける判別結果は偽(ＮＯ)となる。ステップＳ19で、発話全体の誤認識の要因が次に(２番目に)小さいスコアの要因に変更される。こうすることによって、利用者に対して同じ要因が提示されることが防止される。
【００７３】
ステップＳ20で、ユーザに対して誤認識の要因、つまり最小スコアを有する要因が、図５の上側半分に示すごとくメッセージの形式で提示される。その際に、必要に応じて、図５の＜詳細情報＞に示すごとく、上記ステップＳ14において作成された分析メッセージも合せて提示される。但し、上記ステップＳ6から本ステップに分岐した場合には、誤り原因が雑音であることが提示される。そうした後、入力回数を０に初期化して、今回の入力音声に対する要因分析・メッセージ作成処理動作を終了する。
【００７４】
ステップＳ21で、上記ステップＳ8,ステップＳ10およびステップＳ12において総ての要因における累積確率値の値が定数「１」に変換された場合等には総ての要因のスコアが同一になり、総ての要因が特に標準モデルからずれてはいないことになる。ところが、このようなことは、突発的な雑音が発生した場合に起きることが多い。そのために、本ステップでは、誤認識の要因が突発ノイズと推定される。ステップＳ22で、ユーザに対して突発的な雑音があったか否かを確認し、もう一度入力を促すメッセージが提示される。そうした後、入力回数がインクリメントされて、上記ステップＳ1に戻って同じ音声の再入力待ちの状態となる。
【００７５】
このように、入力回数をカウントしておき、その入力回数に応じて上記標準閾値(つまり、標準的な範囲)を狭くすることによって、総ての要因のスコアが同一になることを防ぎ、連続して誤り原因が分らなくなることを防ぐのである。つまり、誤認識の要因に関して何らかの結果を出して、利用者に対する不快感を少なくすることができるのである。
【００７６】
上記構成を有して上述のごとく動作する音声処理装置は、例えば音声認識システムに組み込まれることによって、次のように利用される。すなわち、音声認識システムのシステム本体側の特徴抽出部から、入力音声の特徴ベクトル列とそのラベルが特徴抽出部１３に入力される。そして、セグメント分割部１４および要因分析部１５によって上述のようにして誤認識となる要因が分析され、メッセージ作成部１６によって上記誤認識となる要因を提示するためのメッセージが作成される。そうすると、このメッセージが、メッセージ提示部１７によって、システム本体側に返送されるのである。こうすることによって、上記システム本体側では、入力音声の認識に失敗した場合には、本音声処理装置側から返送されてきた当該誤認識音声に関する上記メッセージをシステム本体側の出力装置に表示するのである。さらに、上記誤認識となる要因が発話スタイルおよび周辺雑音である場合には、無駄な適応を避けることも可能になるのである。
【００７７】
こうすることによって、利用者は、誤認識や信頼度低下の原因をより具体的に知ることができ、その原因が発話スタイルに関するものであれば即座に対応することができる。さらに、誤認識や信頼度低下の原因が分らないことに起因する不快感を無くすことができるのである。
【００７８】
上述した本音声処理装置が組み込まれた音声認識システムの機能は、本音声処理プログラムを音声認識装置の音声認識プログラム中に組み込んでも達成することができる。勿論、本音声処理装置を音声認識装置とは独立して用い、音声認識装置の使用者に、本音声処理装置を用いることによって、音声認識時に起るであろう誤認識の要因を予め知らせることもできる。この場合には、音声認識装置の使用者が自分の発話スタイルに標準との差があることを予め知ることによって、後の音声認識を効率良く行うことができることになる。
【００７９】
以上のごとく、上記実施の形態においては、上記セグメント分割部１４によって、入力音声の特徴ベクトル列を標準モデルとの比較によって音素毎のセグメントに分割する。そして、要因分析部１５によって、各セグメント毎の特徴ベクトル列に基づいて複数の要因に関する特徴量を求め、各要因毎に特徴量と標準モデルとのずれの度合いを算出し、その算出されたずれの度合が許容範囲内に在るか否かを入力回数に応じて狭く設定される閾値に基づいて判定する。そして、許容範囲内に在る場合には、そのずれの度合を「１」に変換する。そうした後、上記各要因の判定結果から最もずれの大きい要因を検出する。そして、メッセージ提示部１７によって、上記検出結果に基づいて、最もずれの大きい要因を提示するようにしている。
【００８０】
したがって、音声波形の特徴ベクトルから例えば人間が直感的に理解し易い誤認識の要因を抽出して、最もずれの大きな要因を検出することによって、何が誤認識の原因となり得るかを推定することができる。したがって、利用者に、誤認識となる原因を知らせることができ、利用者の不快感を減らすことができるのである。
【００８１】
その際における上記誤認識の主な原因として、次の４項目
（Ａ）音声パワーの標準モデルからのずれ
（Ｂ）音声話速の標準モデルとのずれ
（Ｃ）話者の音響特性
（Ｄ）周辺雑音
を用いている。そのうちの要因(Ａ),(Ｂ)は上記発話スタイルである。したがって、本実施の形態によれば、誤認識となる原因を、話者の音響特性と発話スタイルと周辺雑音とに区別して利用者に知らせることができる。そのために、利用者は、誤認識となる要因が要因(Ａ),(Ｂ),(Ｄ)である場合には、音声認識時に的確に対応することが可能になる。
【００８２】
また、上記要因のうちの要因(Ａ)〜要因(Ｃ)と要因(Ｄ)との検出方法は少し異なっている。すなわち、利用者の発話区間内に埋もれた雑音の検出は非常に難しい。そのため、図６に示すように、利用者の発声前における無音区間によって周辺雑音の検出を行うのである。周辺雑音は略定常であると考えられ、このような検出方法でも問題はないと考えられる。
【００８３】
但し、利用者の発声区間内に、警笛や駅アナウンス等の突発ノイズが発生した場合には誤認識の要因となる。そして、このような突発ノイズは、要因(Ａ)〜要因(Ｃ)の総てのずれに同様に作用するため、突発ノイズを要因として特定することが困難である。そこで、本実施の形態においては、利用者の発話区間内において検出された要因(Ａ)〜要因(Ｃ)のずれが略同じである場合に、誤認識要因は突発的な雑音であると推定するのである。但し、その場合には、誤認識要因を提示せずに、メッセージ提示部１７によって再入力を促すメッセージを出力するようにしている。そして、音声の再入力があった場合には、上記閾値を更に小さく設定するようにしている。こうすることによって、突発的な雑音に対して頑健に誤り分析を行うことができ、ずれの度合いを強調することによって誤り分析結果が得易くなり、何度も利用者に発声させる手間が不要になるのである。
【００８４】
ところで、上記実施の形態における上記中央演算処理装置１による上記要因別ずれ算出手段,ずれ度合変換手段,要因検出手段および誤認識要因出力手段としての機能は、プログラム記録媒体に記録された音声処理プログラムによって実現される。上記実施の形態におけるプログラム記録媒体は、上記ＲＯＭでなるプログラムメディアである。または、上記外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、プログラムメディアから音声処理プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、上記ＲＡＭに設けられたプログラム記憶エリア(図示せず)にダウンロードし、上記プログラム記憶エリアにアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアから上記ＲＡＭのプログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【００８５】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク,ハードディスク等の磁気ディスクやＣＤ(コンパクトディスク)‐ＲＯＭ,ＭＯ(光磁気)ディスク,ＭＤ(ミニディスク),ＤＶＤ(ディジタル多用途ディスク)等の光ディスクのディスク系、ＩＣ(集積回路)カードや光カード等のカード系、マスクＲＯＭ,ＥＰＲＯＭ（紫外線消去型ＲＯＭ),ＥＥＰＲＯＭ(電気的消去型ＲＯＭ),フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【００８６】
また、上記実施の形態における音声処理置は、インターネット等の通信ネットワークと通信Ｉ/Ｆを介して接続可能な構成を有している場合には、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【００８７】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【００８８】
【発明の効果】
以上より明らかなように、この発明は、入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め、各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出し、最もずれの度合いが大きい要因を検出して誤認識となる要因として出力するので、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることができる。したがって、音声認識の際に誤認識に至った場合に、利用者は何故誤認識となったのかを明確に知ることができる。したがって、利用者が、誤認識となった原因が分らずに不快な気分になることを回避することができるのである。
【００８９】
さらに、上記誤認識の要因に関する特徴量としてパワー,話速,話者性および周辺環境雑音の特徴量を求めるようにすれば、誤認識となる原因を、話者の音響特性と発話スタイルと周辺雑音とに区別して利用者に知らせることができる。したがって、利用者は、誤認識となる要因がパワー,話速および周辺環境雑音である場合には、音声認識時に的確に対応することが可能になる。
【００９０】
また、音声認識装置とは独立した構成となっているため、状況によっては、音声認識装置と組み合せて音声認識システムを構成することによって、音声認識の効率と認識率とを高めることができる。
【図面の簡単な説明】
【図１】この発明の音声処理装置におけるハードウェア構成を示す図である。
【図２】図１に示す音声処理装置の機能的構成を示すブロック図である。
【図３】図２における要因分析部およびメッセージ作成部によって実行される要因分析・メッセージ作成処理動作のフローチャートである。
【図４】図３に続く要因分析・メッセージ作成処理動作のフローチャートである。
【図５】図２におけるメッセージ提示部によって提示されるメッセージの一例を示す図である。
【図６】図２におけるセグメント分割部への入力音声の一例を示す図である。
【符号の説明】
１…中央演算処理装置、
２…記憶装置、
３…外部記憶装置、
４…入力装置、
５…出力装置、
１１…入力部、
１２…Ａ/Ｄ変換部、
１３…特徴抽出部、
１４…セグメント分割部、
１５…要因分析部、
１６…メッセージ作成部、
１７…メッセージ提示部、
１８…標準モデル格納部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice processing device, a voice processing method, a voice processing program, and a program recording medium used in a voice recognition system or the like.
[0002]
[Prior art]
At present, the speech recognition system has a high word recognition performance even if it is an unspecified speaker task as long as it is a recitation speech that reads out a transcript. This is because a multi-speaker database can be used and the acoustic characteristics of most speakers can be learned. It is also possible to learn speaker acoustic characteristics from a small number of speech samples by speaker adaptation techniques such as Maximum a posteriori (hereinafter abbreviated as MAP) and Maximum Likelihood Linear Regression (hereinafter abbreviated as MLLR). is there.
[0003]
Here, the acoustic characteristics of the speaker are acoustic characteristics caused by differences in the physical characteristics of the vocal organs, such as differences in the vocal organs of the speaker. For example, due to differences in vocal tract length and the like, the spectrum of speech varies from speaker to speaker. The MAP and MLLR described above are described in detail in “The HTKBOOK” by S. Young et al.
[0004]
However, the recognition performance for naturally and freely uttered speech (hereinafter referred to as “natural speech”) is insufficient (Shinozaki et al., Sound Lecture, pp 17-18, Mar. 2002). The reason why natural speech recognition is difficult is said to be due to the factors of the utterance style (Yamamoto et al., Science Theory pp2438-2447, Nov. 2000). Even when a model is learned using natural speech and reading sound, the recognition rate of natural speech is considerably reduced. This is thought to be because it is difficult to create models corresponding to all utterance speeds, and there is a tendency that natural vowels are not pronounced (smoothed) especially in natural speech.
[0005]
For the cause of the former, there is a multipath hidden Markov model (hereinafter abbreviated as HMM) that separates the transition path for each speech rate (Li et al., Otokobun, pp.89-90, Mar.2002), etc. Proposed. However, the recognition accuracy commensurate with the calculation cost has not been obtained. Further, for the latter problem, it is conceivable to learn an acoustic model of natural speech by a speaker adaptation technique such as the MAP or MLLR. However, in that case, the feature space of the vowel model becomes larger, and as a result, even if the recognition rate of natural speech is improved, the recognition accuracy of the recitation speech may deteriorate.
[0006]
Here, the utterance style is not a difference in physical characteristics of the speech organ such as the “speaker acoustic characteristics” but an acoustic characteristic caused by the environment or culture of the speaker. For example, dialect, fast speaking, speaking slowly, not pronounced clearly.
[0007]
Furthermore, high performance recognition performance cannot be guaranteed in any noise environment. Good recognition performance can be obtained by a method of modeling (matched model) a speech in which pre-recorded noise is superimposed on learning speech, but it is impossible to record noise in the entire environment. Therefore, also in the case of a noise environment, a method of performing an adaptation process using the MAP, MLLR, or the like from a small number of noise data as in the case of speaker adaptation is used. However, even in that case, the recognition performance is inferior compared to the above-described method of matching modeling. In addition, if the user adapts the environment gradually, it is not preferable because it is impossible to predict what the acoustic model will look like.
[0008]
For the user, the acoustic characteristics of the user's own voice are nothing, but it is easy to deal with surrounding noise and speech style. For example, it is possible to move to a quiet place with respect to noise and to use a standard way of speaking with respect to the speaking style. Therefore, if it is possible to determine whether the cause of misrecognition is due to the acoustic characteristics of the speaker, the speech style, or noise, and to inform the user of the determination result, discomfort due to misrecognition will be reduced. Will be able to. Moreover, useless adaptation processing that does not improve recognition performance can be avoided by not adapting to the speech style. Similarly, useless environment adaptation processing can be avoided by notifying the environment that is not supported.
[0009]
However, in many speech recognition systems, the user is not even notified of the reason for misrecognition. The reason is that it is difficult to explain the cause of misrecognition so that a general person can understand. Specifically, in the speech recognition system using the above HMM, a feature vector such as “Mel-frequency cepstral coefficients (hereinafter abbreviated as MFCC)” including information other than the phoneme of the input speech and a standard model are used. This is because the recognition result is determined based on the magnitude of the matching score based on the probability statistical distance, and the cause of the misrecognition cannot be completely linked to the phonetic knowledge (one-to-one correspondence).
[0010]
The recognition system based on the physical distance measure between the input speech and the standard speech is not a device that teaches the reason for the misrecognition as described above, but a speech recognition device that allows the user to learn a standard utterance. It has been proposed (see, for example, Patent Document 1).
[0011]
In addition, there are the following voice recognition methods and devices that perform the above-mentioned misrecognition reason notification (see Patent Document 2). In this speech recognition method and apparatus, when speech is input, the speech is analyzed by the speech recognition task, and a match is detected by comparison with previously registered speech data. At this time, if the recognition result is “NG”, a display indicating that it is NG and a reason code are displayed.
[0012]
Also, in conventional speaker-adaptive speech recognition systems, the difference between the speaker's acoustic characteristics and speech style is not clarified, so the speech style and surrounding environment are learned in the same way as the speaker's acoustic characteristics. Will end up. For example, there are a speech recognition device and an automatic speech recognition device that perform speaker adaptation only on highly reliable subwords using speaker adaptation technology (see Patent Document 3). In the speech recognition apparatus and the automatic speech recognition apparatus, the degradation of recognition performance due to adaptation is reduced by applying model adaptation only to highly reliable subwords for which the likelihood measure of the recognition result is equal to or greater than a threshold value.
[0013]
[Patent Document 1]
Japanese Patent Laid-Open No. 01-285998
[Patent Document 2]
JP 2000-112497 A
[Patent Document 3]
JP 2000-181482 A
[0014]
[Problems to be solved by the invention]
However, the conventional speech recognition apparatus and speech recognition method have the following problems.
[0015]
That is, first, in the speech recognition apparatus disclosed in Patent Document 1, it is impossible to distinguish between the above utterance style and the acoustic characteristics of the speaker, and it is impossible to adapt to the surrounding environment. Furthermore, it has a recognition mode for performing recognition and a registration mode for creating and registering a syllable feature pattern by a speaker of a designated word. In the registration mode, an utterance word is designated, and an utterance method for correctly recognizing (that is, a reason for erroneous recognition) is designated. However, since the registration mode is separated from the recognition mode, if a recognition error occurs in the recognition mode, the reason for the recognition error cannot be notified to the speaker. There is a problem that it cannot be notified.
[0016]
Further, in the speech recognition method and apparatus disclosed in Patent Document 2, the reason information is notified when the input speech recognition fails, but the notification content is “no speech registration data to be compared”. ”And“ excessive input volume ”. Moreover, no means or method for acquiring the reason for misrecognition is disclosed, and it is unclear how to acquire the reason for misrecognition caused by overlapping of a plurality of factors. Therefore, there is a problem that it is impossible to notify a user of a sufficient reason for misrecognition.
[0017]
In the speech recognition device and the automatic speech recognition device disclosed in Patent Document 3, model adaptation is applied to subwords whose confidence measure is equal to or greater than a threshold value. It is very difficult to decide. For example, if the reliability threshold is set too low, degradation of recognition performance due to adaptation can be prevented, but the adaptation effect is not obtained so much because the probability of adaptation becomes low. Therefore, it is very difficult to determine such a trade-off relationship.
[0018]
Furthermore, it cannot be distinguished whether the cause of the misrecognition is the acoustic characteristic, the speech style, or the surrounding environment. Therefore, when the likelihood measure is equal to or greater than the threshold value and the recognition reliability is high, the user intends to adapt to the speech style and the surrounding environment. However, as described above, since the recognition rate deteriorates even when the speech style is learned using natural speech, the recognition rate deteriorates in the same manner, resulting in useless calculation. become. In addition, since there is no reason acquisition / notification means for notifying the user of the reason for misrecognition or the reason for low reliability, there is a possibility that the user may feel uncomfortable.
[0019]
SUMMARY OF THE INVENTION An object of the present invention is to provide a voice processing device, a voice processing method, a voice processing program, and a program recording medium capable of determining a factor causing misrecognition and notifying a user.
[0020]
[Means for Solving the Problems]
In order to achieve the above object, the speech processing apparatus according to the present invention relates to a plurality of causes of erroneous recognition based on the input speech feature amount when comparing the input speech feature amount with a standard model. A factor-specific deviation calculation unit that obtains a feature quantity and calculates the degree of deviation of the feature quantity from the standard model for each factor, and whether the calculated deviation degree is within a threshold that represents an allowable range. And when it is within the threshold value, a deviation degree converting means for converting the degree of deviation into a predetermined value representing that the deviation is within the allowable range, and the calculated degree of deviation and the converted value. A factor detecting means for detecting a factor with the greatest degree of deviation based on the degree of deviation, and an error recognition factor output means for outputting the detected factor with the largest deviation as a factor causing misrecognition. That.
[0021]
According to the above configuration, based on the feature amount of the input speech waveform, for example, a feature amount relating to a factor of misrecognition that is easy for a human to understand intuitively is obtained. Then, the factor with the greatest degree of deviation between the feature quantity and the standard model is detected as a cause of erroneous recognition, and is output to the user. Thus, by notifying the user of the cause of misrecognition, the user's discomfort in the event of misrecognition is reduced.
[0022]
Further, in the speech processing apparatus of one embodiment, the misrecognition factor output means outputs the voice again without outputting the misrecognition factor when there are a plurality of factors having the largest deviation detected. A message prompting you to do it is output.
[0023]
When there are a plurality of factors having the largest deviation, there are many cases where sudden noise occurs. According to this embodiment, in such a case, the above factors are analyzed robustly against sudden noise by prompting re-input.
[0024]
Further, in the speech processing apparatus of one embodiment, when speech is input again according to the output of the message by the erroneous recognition factor output means, the threshold value representing the allowable range is changed so that the allowable range is narrowed. Threshold changing means is provided.
[0025]
According to this embodiment, since the threshold value representing the allowable range is changed to narrow the allowable range, the degree of deviation is emphasized. Therefore, it becomes easier to obtain the cause analysis result of the misrecognition, and the trouble of having the user repeatedly input the voice becomes unnecessary.
[0026]
Also, in the speech processing apparatus of one embodiment, the erroneous recognition factor output means determines the second largest factor when the detected largest deviation factor is the same factor as the previous voice input. It is output as a factor causing the above-mentioned erroneous recognition.
[0027]
According to this embodiment, the same instruction is not repeatedly issued to the user, and the user's discomfort is reduced.
[0028]
Further, in the speech processing apparatus of one embodiment, the standard model is represented by a probability function, and the factor-specific deviation calculating means includes power, speech speed, speaker characteristics, and surroundings as feature quantities regarding the cause of the misrecognition. The feature amount of the environmental noise is obtained, and the degree of deviation from the standard model is calculated for each factor using a probability value based on the feature amount of the factor in the probability function representing the standard model. .
[0029]
According to this embodiment, based on the feature amount of the input speech waveform, the feature amount relating to the cause of misrecognition that is easy for a human to understand intuitively is obtained. Further, by expressing the degree of deviation as a cumulative probability value, the degree of deviation between different factors can be compared with the probability value. Therefore, it is possible to detect the factor having the largest deviation without specially normalizing the deviation degree value.
[0030]
Further, the speech processing method of the present invention obtains a feature amount related to a plurality of factors of misrecognition based on the feature amount of the input speech when comparing the feature amount of the input speech with the standard model, When the degree of deviation of the feature quantity from the standard model is calculated for each factor, and it is determined whether or not the calculated degree of deviation is within a threshold value representing an allowable range, and is within the threshold value The degree of deviation is converted to a predetermined value indicating that it is within the allowable range, and the factor with the largest degree of deviation is determined based on the calculated degree of deviation and the converted degree of deviation. The detected factor with the largest deviation is output as a factor causing misrecognition.
[0031]
According to the above configuration, the cause of misrecognition is notified to the user, for example, by a factor that is easy for a human to understand intuitively, thereby reducing user discomfort when misrecognition results. The
[0032]
The voice processing program of the present invention causes a computer to function as a factor-specific deviation calculation means, a deviation degree conversion means, a factor detection means, and a misrecognition factor output means in the voice processing apparatus of the present invention.
[0033]
According to the above configuration, the cause of misrecognition is notified to the user, for example, by a factor that is easy for a human to understand intuitively, thereby reducing user discomfort when misrecognition results. The
[0034]
The program recording medium of the present invention records the voice processing program of the present invention.
[0035]
According to the above configuration, the cause of misrecognition is presented to the user by, for example, a factor that is easy for a human to understand intuitively by being read and executed by a computer. Thus, user discomfort in the event of erroneous recognition is reduced.
[0036]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a diagram illustrating a hardware configuration of the speech processing apparatus according to the present embodiment.
[0037]
In FIG. 1, reference numeral 1 denotes a central processing unit that performs processing such as numerical computation and control, and performs computation and processing according to the processing procedure described in the present embodiment. Reference numeral 2 denotes a storage device composed of RAM (Random Access Memory), ROM (Read Only Memory), etc., which is necessary for the processing procedure (voice processing program) executed by the central processing unit 1 and its processing. Temporary data is stored. Reference numeral 3 denotes an external storage device composed of a hard disk or the like, which stores a standard pattern (template) for voice processing, a standard model, and the like. Reference numeral 4 denotes an input device composed of a microphone, a keyboard, and the like, and inputs a voice uttered by a user or a character string input by a key. Reference numeral 5 denotes an output device including a display, a speaker, and the like, and outputs an analysis result or information obtained by processing the analysis result. Reference numeral 6 denotes a bus, which connects various devices of the central processing unit 1 to the input device 5 to each other. Note that the hardware configuration of the voice processing apparatus may include a communication I / F that connects to a communication network such as the Internet in addition to the configuration shown in FIG.
[0038]
However, in this embodiment, the voice processing device and the voice processing program are independent, but can be incorporated as a part of another device or a part of another program. And the input in that case is indirectly performed through the said other apparatus and program.
[0039]
Hereinafter, processing executed in the present embodiment will be described based on the above hardware configuration.
[0040]
FIG. 2 is a block diagram showing a functional configuration of the speech processing apparatus according to the present embodiment. From the input unit 11, the user's voice and its label (text notation of the utterance content) are input. The input voice is digitized by the A / D converter 12. At this time, the input text remains unchanged.
[0041]
The digitized signal is converted into an MFCC vector by a feature extraction unit 13 every certain time interval (frame). For the detailed method for obtaining the MFCC, please refer to “S.Young et al.,“ The HTKBOOK ”” described above. Further, MFCC is one of feature analysis methods, and the same is true even when using Linear prediction filter coefficients.
[0042]
As described above, the feature extraction unit 13 directly inputs parameters extracted from an external device so that the voice processing device and the voice processing program can be easily incorporated into other devices and programs. It is possible to be done. In that case, it is necessary to use the same feature analysis method for parameters input from an external device and a standard model described later. For example, when the standard model pattern is expressed in MFCC, the feature quantity of the input parameter also needs to be expressed in MFCC. At this time, the input text remains unchanged.
[0043]
The MFCC vector sequence extracted by the feature extraction unit 13 is divided by the segment division unit 14 into segments for each phoneme using a set of standard models stored in the standard model storage unit 18. The division into segments for each phoneme is performed as follows.
[0044]
That is, when the standard model is HMM, the probability of transition from state i of HMM to state j is a_ij, and the probability of observing feature vector Ot in frame t in state j of HMM is b_j (Ot). The log likelihood L_N (T) reaching the final state N of the HMM in frame T is given by
L_N (T) = max_i [L_i (T) + log (a_iN)]
L_j (t) = max_i [L_i (t-1) + log (a_ij)] + log (b_j (Ot))
Where max_i [f (i)]: the maximum value of the function f (i) for i
According to the Viterbi algorithm. Then, all the state numbers for the frame when L_N (T) is obtained (that is, when the final state N of the final frame T is reached) are stored, and the stored state numbers are stored as feature vectors (MFCC vectors). ) Is divided into phoneme units.
[0045]
If the above-described method is considered difficult, the above-mentioned “S.Young et al.“ The HTKBOOK ”” may be referred to.
[0046]
In this way, the feature vector string divided into segments for each phoneme is input to the factor analysis unit 15 with the textual label added thereto. Then, the factor analysis unit 15 examines factors that cause erroneous recognition. The message creation unit 16 creates a character string of a message to be presented to the user according to the analysis result by the factor analysis unit 15. Finally, the message presenting unit 17 displays a message on the display constituting the output device 5 based on the created character string, or converts it into synthesized speech by the built-in text-to-speech synthesis means, and voices from the speaker. The user is notified by outputting.
[0047]
However, when the voice processing device and the voice processing program are incorporated as part of another device or another program, the message presentation unit 17 returns the created character string to the other device. It will be.
[0048]
That is, the A / D conversion unit 12, the feature extraction unit 13, the segment division unit 14, the factor analysis unit 15, the message creation unit 16, and a part of the message presentation unit 17 are configured by the central processing unit 1 and are input. The unit 11 is configured by the input device 4, the remaining part of the message presentation unit 17 is configured by the output device 5, and the standard model storage unit 18 is configured by the external storage device 3. Further, the central processing unit 1 performs various processing operations such as calculation / judgment processing, timing processing, and input / output processing in addition to the processing operations according to the present embodiment by the respective units 12 to 17 described above. It has become.
[0049]
Hereinafter, analysis of the cause of erroneous recognition by the factor analysis unit 15 and creation of a message by the message creation unit 16 will be described in detail. 3 and 4 are flowcharts of the factor analysis / message creation processing operation executed by the factor analysis unit 15 and the message creation unit 16. Steps S20 and S22 are processing by the message creation unit 16, and the other steps are processing by the factor analysis unit 15.
[0050]
When the segment division unit 14 completes the segmentation, the factor analysis / message creation processing operation starts. First, in step S1, it is determined whether or not there is an input from the segment dividing unit 14. If there is an input, the process proceeds to step S2. In step S2, the feature vector divided and labeled for each segment from the segment dividing unit 14 is fetched. In step S3, it is determined whether or not the input is the first time based on the value of the counter that counts and stores the number of continuous inputs from the segment dividing unit 14. As a result, if it is the first input, the process proceeds to step S5, and if not, the process proceeds to step S4.
[0051]
In step S4, the threshold value used when determining whether the degree of separation between the feature vector and the standard model executed later is within the allowable range is changed from the standard threshold value used at the time of first input to a threshold value according to the number of inputs. Is done. Here, the threshold value is set so as to decrease stepwise from the standard threshold value as the number of inputs increases. The threshold value is also used to set the deviation degree to a predetermined value when the “degree of deviation (separation)” between the feature vector and the standard model is within an allowable range. It is set and stored in the external storage device 3 or the like. Since the threshold value depends on the recognition performance of the speech recognition system, it is experimentally determined in advance based on a feature vector obtained from an utterance of a speaker having a recognition rate of 95% or more.
[0052]
In step S5, the degree of separation between the non-speech segment and the noise model before utterance input is calculated. The noise model is obtained by learning from pre-recorded noise and stored in the standard model storage unit 18. The degree of separation between the non-speech segment (feature vector sequence) and the noise model is obtained as a cumulative probability value of the logarithmic likelihood of observing the non-speech segment feature vector sequence when the noise model is given. It is done.
[0053]
Specifically, the noise model is Mn, and the noise feature vector sequence is X. In this case, if the log likelihood of observing the input feature vector sequence X when the noise model Mn is given is L (X | Mn) and the number of frames (continuation length) of the noise feature vector sequence is T, the duration The cumulative probability value Sn of log likelihood normalized by T (= L (X | Mn) / T) (hereinafter referred to as normalized log likelihood) is expressed by the following equation.

Here, Nn (x; μn, σn) is a single Gaussian distribution having an average value μn and a variance value σn for the random variable x, and is estimated in advance from learning data. In addition, the range of integration in the equation is that when the normalized log likelihood of input noise <μn, “a” is the minimum value of normalized log likelihood in the learning data, and “b” is input noise. Is the normalized log likelihood. When the normalized log likelihood of input noise> μn, “a” is the normalized log likelihood of input noise, and “b” is the maximum normalized log likelihood in the learning data. . However, the probability density function is expressed as a single Gaussian distribution in order to reduce the amount of calculation, and a mixed Gaussian distribution or the like may be used.
[0054]
The cumulative probability value Sn means that the smaller the value is, the farther the normalized log likelihood of the input noise is from the average μn of the single Gaussian distribution of the normalized log likelihood x of the learning data. It is shown that the features of are greatly deviated from the learned noise model.
[0055]
In step S6, whether or not the calculated degree of separation (cumulative probability value Sn) between the segment (input noise) of the non-speech segment and the noise model is smaller than the threshold value set in step S4 or the standard threshold value. That is, it is determined whether or not the characteristics of the input noise greatly deviate from the noise model. As a result, when there is a large deviation, the maximum likelihood state path obtained by the Viterbi algorithm is not reliable, and the process proceeds to step S20. On the other hand, if not shifted, the process proceeds to step S7.
[0056]
In step S7, the degree of separation from the standard distribution of the power of the input voice is calculated. The degree of separation in this case is obtained as the cumulative probability value of the average value of the power of the feature vector, as in step S5.
[0057]
Specifically, first, the power of the feature vector of the input speech is averaged for each state of the HMM. Next, the cumulative probability value Sp of the phoneme power is obtained by approximating the median of the cumulative probability values of each state in represented by the following equation.

Here, Np_in (p_in; μp_in, σp_in) is a single Gaussian distribution having an average value μp_in and a variance value σp_in for a random variable that is an average value p_in of the power allocated to the state in the HMM, and learning data Is estimated in advance. Further, in the integration range in the equation, when the average value of the power allocated to the state in the input voice is smaller than μp_in, “a” is the minimum power of the learning data, and “b” is the input voice. It is the average value of power. Also, when the average power assigned to the state in the input voice is> μp_in, “a” is the average power of the input voice, and “b” is the maximum power of the learning data. . However, the probability density function is expressed as a single Gaussian distribution in order to reduce the amount of calculation. For example, a mixed Gaussian distribution or the like may be used.
[0058]
As described above, the probability process is regarded as independent for each state, and the cumulative probability value Sp of the phoneme power is approximated by the median of the cumulative probability values of each state. It is not necessary to perform complicated estimation and integration of the joint probability density function Prob (i1, i2, ..., in).
[0059]
The cumulative probability value Sp indicates that the larger the value, the closer to the standard utterance style. Also, it is possible to determine whether the power is smaller than the standard utterance style (average value of input power <μp_in) or larger (average value of input power> μp_in) from the range of integration.
[0060]
In step S8, if the calculated degree of separation between the power of the input voice and the standard distribution (cumulative probability value Sp) is larger than the threshold value set in step S4 or the standard threshold value, the cumulative probability value Sp The value is converted to a constant “1” and output. With this process, when the difference between the power of the input sound and the standard model is small, the degree of deviation can be ignored.
[0061]
In step S9, the degree of separation from the standard distribution of the speech speed of the input speech is calculated. The degree of separation in this case is obtained as a cumulative probability value of the continuation length as in step S5.
[0062]
Specifically, first, the duration T is calculated from the total number of feature vectors belonging to the segment of the input phoneme. This continuation length T is the time taken to utter a phoneme, and its reciprocal represents the speech speed. Next, the cumulative probability value ST of the continuation length is obtained by the following equation.

Here, P (x; λ) is a Poisson distribution having an average value λ with respect to the random variable x, and is estimated in advance from the learning data. In the integration range in the equation, when the duration of the phoneme of the input speech is T <λ, “a” is the minimum value of the learning data, and “b” is T. When the phoneme duration of the input speech is T> λ, “a” is T and “b” is the maximum value of the learning data.
[0063]
The cumulative probability value ST indicates that the larger the value is, the closer the duration T is to the standard distribution. Also, it is possible to determine whether the speech speed is faster (continuation length T <λ) or slower (continuation length T> λ) than the standard speech style from the range of integration.
[0064]
In step S10, if the calculated degree of separation between the speech speed and the standard distribution (cumulative probability value ST) is larger than the threshold set in step S4 or the standard threshold, the cumulative probability value ST Is converted to a constant “1” and output. With this process, when the difference between the speech speed of the input speech and the standard model is small, the degree of deviation can be ignored.
[0065]
In step S11, the degree of separation from the standard distribution of the acoustic characteristics (speaker characteristics) of the input speaker is calculated. The degree of separation in this case is obtained as the cumulative probability value of the logarithmic likelihood of observing the input feature vector sequence when the standard model is given, as in step S5.
[0066]
Specifically, the standard model is Ms, and the input feature vector sequence is X. In this case, if the log likelihood of observing the input feature vector sequence X when the standard model Ms is given is L (X | Ms) and the number of frames (continuation length) of the input feature vector sequence X is T, the continuation The cumulative probability value Ss of the normalized log likelihood y (= L (X | Ms) / T) normalized by the length T is expressed by the following equation.

Here, Ns (y; μs, σs) is a single Gaussian distribution having a mean value μs and a variance value σs for the random variable y, and is estimated in advance from learning data. Further, the range of the integral value in the expression is such that when the normalized log likelihood of the input feature vector <μs, “a” is the minimum value of the normalized log likelihood in the learning data, and “b” is It is the normalized log likelihood of the input feature vector. When the normalized log likelihood of the input feature vector> μs, “a” is the normalized log likelihood of the input feature vector, and “b” is the maximum normalized log likelihood in the learning data. Value. However, the probability density function is expressed as a single Gaussian distribution in order to reduce the amount of calculation, and a mixed Gaussian distribution or the like may be used.
[0067]
The cumulative probability value Ss indicates that the larger the value is, the closer the acoustic characteristic of the input speaker is to that of the standard speaker. However, unlike the case of input speech power and speech speed, which are the above-mentioned speech styles, the range of integration does not make sense.
[0068]
In step S12, if the calculated degree of separation between the acoustic characteristics of the input speaker and the standard distribution (cumulative probability value Ss) is larger than the threshold set in step S4 or the standard threshold, the cumulative probability The value Ss is converted to a constant “1” and output. By this process, when the deviation between the acoustic characteristics of the input speaker and the standard model is small, the degree of deviation can be ignored.
[0069]
In step S13, by directly comparing the cumulative probability values Sp, ST, Ss set in steps S8, S10, and S12, the factor that has the smallest value and is farthest from the standard model is It is determined that this is a cause of recognition error. At this time, if the cumulative probability values of all the factors are converted to 1 in steps S8, S10, and S12, the process of this step is not performed. In step S14, an analysis message of the segment is created based on the determination result in step S13, and is associated with the label name of the segment and the cumulative probability values Sp, ST, Ss of each factor, and stored in the RAM of the storage device 2. Saved in etc. In this case, the analysis message is created by embedding the determination result in step S13 in the fixed keyword as shown in <detailed information> in FIG. However, if there is no determination result, no analysis message is created.
[0070]
In step S15, it is determined whether or not all segments have been input. If the result is the end, the process proceeds to step S16; otherwise, the process returns to step S7 and proceeds to the next segment process.
[0071]
In step S16, based on the cumulative probability values Sp, ST, Ss of all segments stored in the RAM or the like of the storage device 2 in step S14, the score Si_total (simultaneous probability) of the entire utterance for each factor i is given by Sought by.
Si_total = Si (1) x ... x Si (j) x ... x Si (SN)
Where Si (j): cumulative probability for the jth segment
SN: Total number of input segments
Then, a factor for obtaining the minimum value of the score Si_total of the entire utterance thus obtained is stored in a buffer.
[0072]
In step S17, based on the result of the score calculation for the entire utterance in step S16, it is determined whether or not all factors have the same score. As a result, if all the factors are the same score, the process proceeds to step S21, and if not, the process proceeds to step S18. In step S18, it is determined whether or not the factor determined in step S16 is the same as the factor of misrecognition determined in the previous input. If the result is the same, the process proceeds to step S19. If the result is different, the process proceeds to step S20. However, in the case of the first input, the determination result in this step in which all the buffers are initialized is false (NO). In step S19, the factor of misrecognition of the entire utterance is changed to the factor of the second (second) score. This prevents the same factor from being presented to the user.
[0073]
In step S20, the cause of misrecognition, that is, the factor having the minimum score is presented to the user in the form of a message as shown in the upper half of FIG. At that time, as shown in <detailed information> in FIG. 5, the analysis message created in step S14 is also presented as necessary. However, when branching from step S6 to this step, it is suggested that the cause of the error is noise. Thereafter, the number of times of input is initialized to 0, and the factor analysis / message creation processing operation for the current input voice is terminated.
[0074]
If, in step S21, the cumulative probability value of all factors is converted to a constant “1” in steps S8, S10, and S12, the scores of all factors are the same. These factors are not particularly deviated from the standard model. However, this often occurs when sudden noise occurs. Therefore, in this step, the cause of misrecognition is estimated as sudden noise. In step S22, it is confirmed whether or not there is sudden noise for the user, and a message prompting the user to input again is presented. After that, the number of inputs is incremented, and the process returns to step S1 to wait for re-input of the same voice.
[0075]
In this way, by counting the number of inputs and narrowing the standard threshold (that is, the standard range) according to the number of inputs, it is possible to prevent the scores of all factors from becoming the same, This prevents the cause of the error from being lost. In other words, some kind of result can be obtained regarding the cause of misrecognition, thereby reducing discomfort to the user.
[0076]
The voice processing apparatus having the above-described configuration and operating as described above is used as follows by being incorporated in a voice recognition system, for example. That is, the feature vector sequence of the input speech and its label are input to the feature extraction unit 13 from the feature extraction unit on the system body side of the speech recognition system. Then, the factor causing erroneous recognition is analyzed as described above by the segment dividing unit 14 and the factor analyzing unit 15, and the message for presenting the factor causing erroneous recognition is created by the message creating unit 16. Then, this message is returned to the system main body by the message presentation unit 17. By doing so, when the system main body side fails to recognize the input voice, the message related to the erroneously recognized voice returned from the voice processing apparatus side is displayed on the output device on the system main body side. is there. Furthermore, when the above-mentioned factors causing misrecognition are speech style and ambient noise, it is possible to avoid useless adaptation.
[0077]
By doing so, the user can know more specifically the cause of misrecognition and lowering of reliability, and can respond immediately if the cause is related to the speech style. Furthermore, the discomfort caused by not knowing the cause of misrecognition and reliability reduction can be eliminated.
[0078]
The functions of the speech recognition system in which the speech processing apparatus described above is incorporated can also be achieved by incorporating the speech processing program in the speech recognition program of the speech recognition apparatus. Of course, this speech processing device is used independently of the speech recognition device, and the user of the speech recognition device is notified in advance of the cause of erroneous recognition that may occur during speech recognition by using this speech processing device. You can also. In this case, when the user of the speech recognition apparatus knows beforehand that his / her utterance style is different from the standard, the subsequent speech recognition can be performed efficiently.
[0079]
As described above, in the embodiment, the segment dividing unit 14 divides the feature vector sequence of the input speech into segments for each phoneme by comparison with the standard model. Then, the factor analysis unit 15 obtains feature quantities related to a plurality of factors based on the feature vector sequence for each segment, calculates the degree of deviation between the feature quantity and the standard model for each factor, and calculates the calculated deviation. Is determined based on a threshold value that is narrowly set according to the number of inputs. If it is within the allowable range, the degree of deviation is converted to “1”. After that, the factor having the largest deviation is detected from the determination results of the above factors. Then, the message presenting unit 17 presents the factor with the largest deviation based on the detection result.
[0080]
Therefore, by extracting the cause of misrecognition that is easy for humans to understand intuitively, for example, from the feature vectors of the speech waveform and detecting the factor with the largest deviation, it is estimated what can cause the misrecognition Can do. Therefore, the user can be informed of the cause of misrecognition, and the user's discomfort can be reduced.
[0081]
The following four items are the main causes of the above misrecognition.
(A) Deviation from standard model of voice power
(B) Deviation from the standard model of speech speed
(C) Speaker acoustic characteristics
(D) Ambient noise
Is used. Among them, the factors (A) and (B) are the above utterance styles. Therefore, according to the present embodiment, the cause of erroneous recognition can be notified to the user by distinguishing between the acoustic characteristics of the speaker, the speech style, and the ambient noise. For this reason, when the factors causing misrecognition are the factors (A), (B), and (D), the user can accurately cope with the speech recognition.
[0082]
Further, the detection methods of the factor (A) to the factor (C) and the factor (D) among the above factors are slightly different. That is, it is very difficult to detect noise buried in the user's utterance section. Therefore, as shown in FIG. 6, the ambient noise is detected by the silent section before the user utters. The ambient noise is considered to be substantially stationary, and such a detection method is considered to have no problem.
[0083]
However, if a sudden noise such as a horn or a station announcement occurs in the user's utterance section, it becomes a cause of erroneous recognition. Such a sudden noise acts on all the deviations of the factors (A) to (C) in the same manner, so that it is difficult to specify the sudden noise as a factor. Therefore, in this embodiment, when the deviations of the factors (A) to (C) detected in the user's utterance interval are substantially the same, the misrecognition factor is estimated to be sudden noise. To do. However, in that case, a message prompting re-input by the message presentation unit 17 is output without presenting the cause of misrecognition. When the voice is re-input, the threshold value is set smaller. By doing this, it is possible to perform error analysis robustly against sudden noise, making it easier to obtain error analysis results by emphasizing the degree of deviation, and eliminating the need to repeatedly utter the user It becomes.
[0084]
By the way, the functions as the factor-specific deviation calculating means, the deviation degree converting means, the factor detecting means, and the misrecognition factor outputting means by the central processing unit 1 in the embodiment described above are the voice processing programs recorded on the program recording medium. It is realized by. The program recording medium in the above embodiment is a program medium composed of the ROM. Alternatively, it may be a program medium that is loaded into the external auxiliary storage device and read out. In any case, the program reading means for reading the sound processing program from the program medium may have a structure for directly accessing and reading the program medium, or a program storage area (see FIG. (Not shown) may be downloaded, and the program storage area may be accessed and read. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main unit in advance.
[0085]
Here, the program medium is configured to be separable from the main body side, and is a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, a CD (compact disk) -ROM, or MO (magneto-optical). Optical discs such as discs, MD (mini discs) and DVDs (digital versatile discs), card systems such as IC (integrated circuit) cards and optical cards, mask ROM, EPROM (ultraviolet erasable ROM), EEPROM (electrical This is a medium that carries a fixed program, including a semiconductor memory system such as a static erasable ROM) and a flash ROM.
[0086]
In addition, when the audio processing unit in the above embodiment has a configuration that can be connected to a communication network such as the Internet via a communication I / F, the program medium is downloaded by downloading from the communication network or the like It can be a medium that fluidly supports the program. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Or it shall be installed from another recording medium.
[0087]
It should be noted that what is recorded on the recording medium is not limited to a program, and data can also be recorded.
[0088]
【The invention's effect】
As is clear from the above, the present invention obtains a feature amount related to a plurality of factors of misrecognition based on the feature amount of the input speech, and determines the degree of deviation of the feature amount from the standard model for each factor. Calculate and detect the factor with the highest degree of deviation and output it as a factor causing misrecognition, so the user can be informed of the cause of misrecognition by factors that are easy to understand intuitively by humans, for example. . Therefore, when misrecognition occurs during voice recognition, the user can clearly know why misrecognition occurred. Therefore, it is possible to prevent the user from feeling uncomfortable without knowing the cause of misrecognition.
[0089]
Furthermore, if the features of power, speech speed, speaker characteristics, and ambient noise are obtained as features related to the above-mentioned misrecognition factors, the cause of misrecognition can be attributed to the speaker's acoustic characteristics, speech style, and surroundings. It is possible to notify the user by distinguishing it from noise. Therefore, when the factors causing misrecognition are power, speech speed, and ambient environmental noise, the user can accurately cope with the voice recognition.
[0090]
In addition, since the configuration is independent from the speech recognition device, the speech recognition efficiency and the recognition rate can be increased by configuring the speech recognition system in combination with the speech recognition device depending on the situation.
[Brief description of the drawings]
FIG. 1 is a diagram showing a hardware configuration of a sound processing apparatus according to the present invention.
FIG. 2 is a block diagram showing a functional configuration of the speech processing apparatus shown in FIG.
FIG. 3 is a flowchart of a factor analysis / message creation processing operation executed by a factor analysis unit and a message creation unit in FIG. 2;
FIG. 4 is a flowchart of the factor analysis / message creation processing operation continued from FIG. 3;
FIG. 5 is a diagram illustrating an example of a message presented by a message presentation unit in FIG.
6 is a diagram illustrating an example of an input voice to a segment division unit in FIG. 2. FIG.
[Explanation of symbols]
1 ... Central processing unit,
2 ... storage device,
3 ... External storage device,
4 ... Input device,
5 ... Output device,
11 ... Input section,
12 ... A / D converter,
13 ... Feature extraction unit,
14 ... segment division part,
15 ... Factor analysis department,
16 ... Message creation part,
17 ... Message presentation part,
18: Standard model storage unit.

Claims

A speech processing device that compares a feature amount of an input speech with a standard model,
A factor-by-factor deviation calculating unit that calculates a feature amount related to a plurality of factors of misrecognition based on the input voice feature amount, and calculates a degree of deviation of the feature amount from the standard model for each factor;
It is determined whether or not the calculated degree of deviation is within a threshold value that represents an allowable range. If the calculated degree of deviation is within the threshold value, the degree of deviation is set to a predetermined value that indicates that the deviation is within the allowable range. A shift degree conversion means for converting;
Factor detecting means for detecting a factor having the largest degree of deviation based on the calculated degree of deviation and the converted degree of deviation;
An audio processing apparatus comprising: a misrecognition factor output means for outputting the detected factor having the largest deviation as a factor causing misrecognition.

The speech processing apparatus according to claim 1,
The misrecognition factor output means outputs a message prompting the user to input the voice again without outputting the misrecognition factor when there are a plurality of detected factors having the largest deviation. An audio processing device characterized by that.

The speech processing apparatus according to claim 2, wherein
Threshold value changing means is provided for changing a threshold value representing the allowable range so that the allowable range is narrowed when a voice is input again according to the output of the message by the erroneous recognition factor output means. Audio processing device.

The speech processing apparatus according to claim 1, wherein
A factor determination means for determining whether or not the detected largest deviation factor is the same factor as the previous voice input;
The misrecognition factor output means outputs a factor with the second largest deviation as a factor causing the misrecognition when the detected largest deviation factor is the same factor as the previous voice input. A voice processing device characterized by that.

The speech processing apparatus according to claim 1, wherein
The standard model is represented by a probability function,
The factor-by-factor deviation calculating means obtains power, speech speed, speaker characteristics, and ambient noise characteristic amounts as characteristic amounts related to the misrecognition factors, and for each factor, the factor in the probability function representing the standard model. A speech processing apparatus characterized in that a degree of deviation from the standard model is calculated using a probability value based on the feature amount of the standard model.

A speech processing method for comparing a feature amount of an input speech with a standard model,
A feature amount related to a plurality of misrecognition factors is obtained based on the input feature amount of the speech, and a degree of deviation of the feature amount from the standard model is calculated for each factor,
It is determined whether or not the calculated degree of deviation is within a threshold value that represents an allowable range. If the calculated degree of deviation is within the threshold value, the degree of deviation is set to a predetermined value that indicates that the deviation is within the allowable range. Converted,
Based on the calculated degree of deviation and the converted degree of deviation, a factor with the largest degree of deviation is detected,
A voice processing method characterized in that the detected factor having the largest deviation is output as a factor causing misrecognition.

Computer
A speech processing program that functions as the factor-specific deviation calculation means, deviation degree conversion means, factor detection means, and misrecognition factor output means according to claim 1.

A computer-readable program recording medium on which the voice processing program according to claim 7 is recorded.