JP4259100B2

JP4259100B2 - Unknown speech detection device for speech recognition and speech recognition device

Info

Publication number: JP4259100B2
Application number: JP2002342011A
Authority: JP
Inventors: 純幸沖本; 充遠藤; 裕康桑野; 由実脇田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-11-26
Filing date: 2002-11-26
Publication date: 2009-04-30
Anticipated expiration: 2022-11-26
Also published as: JP2004177551A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識装置における音声認識方法に関するものである。
【０００２】
【従来の技術】
従来、音声認識装置においては、受理可能な音声認識語彙を規定して、入力音声と最も類似した認識語彙を探索することによって、これを認識結果として出力する。したがって、かりに利用者が音声認識語彙外の発話を行なった場合でも、音声認識語彙から最も類似した語彙を選択するため、認識結果は誤ったものとなる。このため利用者の発話が音声認識語彙に含まれる単語であるのか、それ以外の単語あるいは言い淀み等であるのかを判定し、これら未知発話を棄却する機能が必要となる。
【０００３】
このような未知発話を棄却する方法は、サブワードと呼ばれる単語より短かい単位のHMM(音声パタンを表現するモデルの1つ)を連結して、認識語彙の各単語のモデルを構成し、入力音声に対して最大のゆう度を与える単語の探索を行ない、このゆう度を認識ゆう度とする（例えば非特許文献1参照）。また、任意カナ系列に対応する任意のサブワードHMMの連接によるモデルの中から、入力音声に対する最大のゆう度を求めて、これを参照ゆう度とする。このようにして得られた認識ゆう度と参照ゆう度の比較を行なうことで、未知発話を検出し棄却する。
【０００４】
しかしこのような方法においては、参照ゆう度の算出において、任意のサブワードHMMの連接における制約がなく、非日本語的な系列に対するゆう度が最大ゆう度として選択される場合も多く、結果としてこのような参照ゆう度と認識ゆう度の比較では充分な未知発声の棄却効果が得られなかった。また、あらゆるサブワードHMMの連接を比較するため、処理計算量の面でも大きなリソースを必要とした。このような問題に対して、たとえば、音声認識法では、サブワードHMM間の連接の親和性を遷移確率として導入することによって、未知発声の棄却精度と処理量の両面の向上を図っている（例えば特許文献1参照）。
【０００５】
【特許文献１】
特開平10-171489号公報
【非特許文献１】
渡辺他, "音節認識を用いたゆう度補正による未知発話のリジェクション", 電子情報通信学会論文誌, Vol. J75-D-II, No.12 (1992)
【０００６】
【発明が解決しようとする課題】
しかしながら、以上に述べたような従来法では、次に述べるような問題がある。
【０００７】
すなわち、サブワードHMMの連接によるモデルは、入力音声をかな系列として認識するモデルと見なすことができるが、仮りにサブワードHMM間の連接の制約として遷移確率を導入したとしても、このモデルが生成するかな系列は、依然、入力音声のそれとは充分一致しているとは言い難い。すなわち、このようなモデルによって得られる参照ゆう度は充分な精度とは言えず、未知発話の棄却効果も充分ではない。
【０００８】
【課題を解決するための手段】
上記目的を達成するため、上記第1の発明の音声認識用未知発話検出装置は、入力された音声を分析して特徴パラメータの系列に変換する音声分析手段と、認識対象語彙を規定する認識辞書格納手段と、
音声の標準的パタンをモデル化した音声モデル格納手段と、認識辞書に規定された語彙のモデルを、上記音声モデル格納手段によって格納されたモデルを用いて構築し、入力音声との照合を行なう単語レベルマッチング手段と、サブワード間の遷移確率を規定するサブワード遷移確率格納手段と、上記音声モデル格納手段によって格納された音声モデルを、上記サブワード遷移確率格納手段によって格納されたサブワード遷移確率を勘案して連結し、入力音声との照合を行なうサブワードレベルマッチング手段と、
上記単語レベルマッチング部および上記サブワードレベルマッチング部から、複数個の未知発話尺度を計算する未知発話尺度計算部手段と、上記未知発話尺度計算部で計算された複数の尺度を元に、未知発話の判定を行なう未知発話判定手段とを備えたことを特徴とする。
【０００９】
上記構成によれば、未知発話の判定において、複数の観点から入力音声が未知発話である可能性を判断することが可能となり、高い未知発話の検出性能を示すことが可能となる。
【００１０】
また、上記第1の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた単語のゆう度と、上記サブワードレベルマッチング手段により得られたサブワード連鎖ゆう度の差に基づいて計算された値を含むことが望ましい。
【００１１】
上記構成によれば、上記サブワード連鎖ゆう度による上記単語ゆう度の補正効果が得られ、高い未知発話検出性能が得られる。
【００１２】
また、上記第1の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた1位候補の単語モデルの音響的特徴と、上記サブワードレベルマッチング手段により得られたサブワード連鎖モデルの音響的特徴の、両者の類似性に基づいて計算された値を含むことが望ましい。
【００１３】
上記構成によれば、2つのモデルの音響的特徴の類似性に着目した未知発話の判定が可能となり、高い未知発話検出性能が得られる。
【００１４】
また、上記第1の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた1位候補の単語のゆう度と、下位候補の単語のゆう度の差に基づいて計算された値を含むことが望ましい。
【００１５】
上記構成によれば、未知発話の認識時には単語レベルマッチング部では、誤った候補が類似した、ゆう度で得られるという特徴をモデル化することが可能となり、高い未知発話検出性能が得られる。
【００１６】
また、上記第1の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた1位候補の単語の音響的特徴と、下位候補の単語の音響的特徴の、両者の類似性に基づいて計算された値を含むことが望ましい。
【００１７】
上記構成によれば、候補単語のモデル間の音響的類似性に着目した未知発話の判定が可能となり、高い未知発話検出性能が得られる。
【００１８】
また、上記第2の発明の音声認識装置は、入力された音声を、認識辞書に登録されている語彙に対応するモデルによって照合を行なって認識する音声認識装置であって、上記未知発話検出装置を塔載し、上記未知発話検出装置の出力結果を勘案して認識結果の出力を行なうことを特徴とする。
【００１９】
上記構成によれば、音声認識装置は、どのような入力音声に対しても常に認識辞書内の語彙のいずれか1つを出力するのではなく、発話内容が認識辞書に含まれないものであれば、これを利用者に伝えることが可能となり、音声認識装置を塔載した様々な音声認識インタフェースにおいて、利用者にとってより判り易いインタフェースを提供することを可能とする。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態について、図を参照して説明する。
【００２１】
（実施の形態１）
図１は、本実施の形態における未知発話検出装置のブロック図を示したものである。図１において、１は入力音声をA/D変換し特徴パラメータの時系列に変換する音響分析部である。２は入力音声の特徴パラメータとのマッチングに用いられる、標準的な音声の音声片を格納した音声片パタン格納部である。
【００２２】
ここで音声片とは、音声の母音区間の後半部分とこれに後続する子音区間の前半部分を連接したＶＣパタン、および子音区間の後半部分とこれに後続する母音区間の前半部分を連接したＣＶパタンの集合を意味している。ただし音声片は、この他に日本語をローマ字標記した場合のアルファベット1文字1文字にほぼ相当する音素の集合、日本語をひらかな標記した時のひらかな1文字1文字にほぼ相当するモーラの集合、複数のモーラの連鎖を意味するサブワードの集合、さらにこれらの集合の混合集合であってもよい。
【００２３】
図１における３は、上記音声片を連結して音声認識語彙の単語パタンを合成するための規則が格納された、単語辞書格納部である。４は特徴パラメータの時系列で表現された入力音声と、上記合成された単語パタンを比較し、その類似性に対応する、ゆう度を各単語ごとに求める単語マッチング部である。
【００２４】
５は音声片どうしを任意に結合する場合における、結合の自然さを連続値で表現する遷移確率が格納された遷移確率格納部である。本実施の形態では、遷移確率として音素の2gram確率を用いる。音素の2gram確率とは、先行する音素 x の後に、音素 y が接続する確率 P(y|x) を意味するもので、多数の日本語テキストデータなどを用いて事前に求めておく。ただし遷移確率は、これ以外にモーラの2gram確率、サブワードの2gram確率、あるいはこれらの混合の2gram確率であってもよく、また2gram確率以外にも、3gram確率などであってもよい。
【００２５】
図1における６は、上記音声片パタンを任意に結合してできるパタンと、特徴パラメータの時系列として表現された入力音声とのゆう度を、上記遷移確率を考慮して計算し、得られた最大ゆう度を参照ゆう度とする音声系列タマッチング部である。
【００２６】
７は上記単語マッチング部で計算された各単語ごとのゆう度のうち、最も高い値を得た単語(1位候補)と次に高い値を得た単語(2位候補)のゆう度の差を単語の長さで正規化して計算する候補間スコア差計算部である。
【００２７】
８は1位候補と2位候補の音響的な類似性を求めるため、1位候補の音素系列と2位候補の音素系列の系列間の距離を計算する、候補音素系列間類似度計算部である。
【００２８】
９は1位候補のゆう度と、上記音声系列マッチング部で計算された参照尤度との差を単語の長さで正規化して計算する、候補・音声系列スコア差計算部である。
【００２９】
１０は、1位候補と、上記音声系列マッチング部によって最適系列とされた系列の音響的な類似性を、各音素系列間の距離として計算する候補・音声系列・音素系列間類似度計算部である。
【００３０】
１１は、上記、候補間スコア差計算部、候補・音素系列間類似度計算部、候補・音声系列スコア差計算部、候補・音声系列・音素系列間類似度計算部で求められた各値を総合して、入力音声が未知発話であるか否かを判定する未知発話判定部である。
【００３１】
なお、本実施の形態においては、未知発話判定部で用いる尺度として、上記4つの尺度を挙げたが、これ以外にも、各単語候補のゆう度そのものやその分布、また単語区間内での局所スコアの変動量、単語を構成する音素の持続時間情報などの尺度も併用することも可能である。また、複数の尺度を元に未知発話を判定する方法として、本実施の形態では事前に多数の認識結果の事例を用いて求めた線型判別式を利用する。しかしこれ以外にも、ニューラルネットワーク、決定木、SVM(サポート・ベクトル・マシン)などいわゆる学習機械の利用も有効である。
【００３２】
次に、本実施の形態における未知発話検出の処理動作を説明する。入力された音声は、まず音声分析部において、A/D変換された後に分析され、10m秒ごとに LPCベクトルに変換される。LPCベクトルは、音声の短時間スペクトルのスペクトル包絡を意味するパラメータであり、音声の音韻的特徴をよく表わすパラメータとして利用されるものである。通常の音声認識法においては、入力音声から一定時間ごとに得られた LPCベクトルの時系列を入力音声の特徴ベクトルとして、あらかじめ求めておいた単語モデルとマッチングさせて、単語ごとのゆう度と呼ばれるスコアを求める。
【００３３】
本実施の形態においては、単語モデルを音声片パタンと単語辞書を用いて作成する。すなわち、単語辞書格納部に格納された単語パタンを合成するための音声片の連接規則に基づいて、音声片パタン格納部に格納された音声片パタンを連接して単語パタンを構築する。図２には、本実施の形態で用いるＣＶ・ＶＣパタンと呼ばれる音声片パタンを連接して、単語パタン「はちのへ」を合成するイメージを図示する。
【００３４】
なお、音声片パタンには、各音声片のLPCベクトルの標準的な分布(正規分布を仮定)を示すパラメータが時系列で格納されている。また、近年はHMM(隠れマルコフモデル)と呼ばれる遷移ネットワークが、音声認識のためのモデルとしてしばしば用いられている。HMMモデルを用いる場合においても、音声片パタン格納部２には音声片パタンを表現するHMMモデルを格納し、単語辞書格納部３においてHMMモデルどうしの遷移に関する規則を定義することによって、単語のHMMモデルを構築することが可能である。
【００３５】
入力音声の特徴パラメータ時系列は、単語マッチング部４において単語パタンと比較され、単語辞書格納部３に定義された全単語、あるいは一定のゆう度のビームの中に残った上位候補単語に対するゆう度が計算され、ゆう度の高いものから順にソートされる。図３において、ゆう度順でソートされた単語の出力例を示す。
【００３６】
またこれと並行して、音声系列マッチング部６において、音声片の任意系列のマッチングも行なわれる。これは、音声片を一定の制約の下で自由に連接して、最も入力音声に近い音声片系列とそのゆう度を計算する。この時音声片どうしの連接において何らの制約も加えないと、計算結果はおよそ非日本語的な系列となり、そのゆう度も充分意味のある値とは言えなくなる。そこで最適音声片系列の探索過程において、音声片の選択と接続のコストとして遷移確率格納部３に格納された音素2gram確率を用いる。音素2gram確率については、認識タスクと同タスクの大量の日本語テキストを音素系列に変換し、これを元に計算しておいたものを用いる。
【００３７】
図４において、音素2gram確率の一例として、先行音素 /k/ の後に5つの母音 /a/,/i/,/u/,/e/,/o/ がそれぞれ後続する確率を例示する。この例の場合では、子音/k/の後に後続しやすい母音は /a/、次に /i/であることが示されている。音声系列マッチング部６では、連接された音声片パタンによるゆう度と、上記音素2gram確率の対数和によって得られる遷移ゆう度を重み付けで加算した値を求め、これが最も高い値となる系列採用する。
【００３８】
図５に音素2gram確率から系列 /kobajasi/に対する遷移ゆう度を求める例を示す。また図６において、遷移ゆう度を導入することによる効果を示す一例として、「コバヤシ」という入力音声に対する、音声系列マッチンング部の出力する音声系列とゆう度を示す。この図にあるように、遷移ゆう度を用いない場合は /pobaeasii/という「コバヤシ」とは大きくかけ離れた系列の方が、より類似する /obajasi/より高いパタンゆう度を得ているが、遷移ゆう度を考慮した合計ゆう度を用いることにより、「コバヤシ」により近い /obajasi/の方が選択される。
【００３９】
以上により、単語辞書格納部に定義された単語ごとの認識ゆう度と、参照ゆう度およびその時の音声系列が得られるが、次にこれを元に未知発話判定のための種々の尺度の計算を行なう。
【００４０】
まず候補間スコア差計算部７では、単語マッチング部で得られたゆう度のうち、最も高いゆう度(1位候補のゆう度)とその次に高いゆう度(2位候補のゆう度)のゆう度差を単語の時間長で割った値を計算する。例えば図３に示した結果の場合、「コバヤシ」と「ハヤシ」のゆう度の差を単語長で正規化して、6.9を得る。
【００４１】
候補音素系列間類似度計算部８では、1位候補の音素系列と2位候補の音素系列についてその類似度を計算する。ここで系列間の類似度は、編集距離を2つの系列の系列長の和で正規化した値を用いる。編集距離とは、一方の系列を編集して他方の系列に変換する際に、1要素置き換え(置換)、1要素削除(脱落)、1要素追加(挿入)に要するコストをそれぞれ1として、最小のコストで編集した場合のコストの総和を意味する。図７では、2つの音素系列 /uenoeki/と /jenokii/ に対する編集距離の求め方を例示する。このような方法に従って候補音素系列間類似度計算部は、例えば図３のような結果の場合、「コバヤシ(/kobajasi/)」と「ハヤシ(/hajasi/)」の編集距離3を各音素系列長の和14で割った 0.21という値を出力する。なお、系列間類似度として本実施の形態では上記編集距離に基づく値を用いるものとしたが、これ以外にも音素間の音響的類似性を考慮した系列間距離などを利用することも有効である。
【００４２】
候補・音声系列スコア差計算部では、単語マッチング部で得た1位候補のゆう度と、音声系列マッチング部で得た参照ゆう度の差を、単語の時間長で正規化した値を計算する。例えば図３および図６に示した例では、1位候補の認識ゆう度 2055と参照ゆう度 2014の差を、単語時間長で正規化した 0.87を得る。
【００４３】
また候補・音声系列・音素系列類似度計算部では、単語マッチング部で得られた1位候補の音素系列と、音声系列マッチング部で得られた最適な音素系列の、系列間の正規化した編集距離を計算する。例えば、図３および図６に示した例では、音素系列 /kobajasi/と/obajasi/の編集距離を系列長の和で正規化して 0.07を得る。
【００４４】
以上のようにして得られた 4つの尺度に対して、未登録語判定部ではこれらの尺度を適切に重み付けした和を求め、その大小を閾値で判定して未登録語発声か否かの判定を行なう。すなわち、図８に示した式に従って判定を行なう。図８において CM1 〜 CM4 は、それぞれ候補間スコア差、候補音素系列間類似度、候補・音声系列スコア差、候補・音声系列・音素系列間類似度を意味しており、また、w1〜w2は各尺度に対する重み付け、θは閾値を意味している。
【００４５】
また、ここで用いる各尺度に対する重み付けは、統計的手法によって事前に求めておく。すなわち、登録語発声および未登録語発声の多数の事例に対して、上記4つの尺度をそれぞれ求め、4つの尺度と登録語発声、未登録語発声の関係を線型判別法によって分析し、各尺度に対する重みを求めている。
【００４６】
(効果)
次に、本実施の形態に基づく未知発話の検出法の効果を、従来手法と比較して実験的に示す。
【００４７】
一般に、このような検出問題には2種類のエラーが存在する。すなわち、検出漏れエラーと、検出されてはならないものが検出される湧き出しエラーである。この両者のエラーはトレードオフの関係にあり、一方のエラーを減らそうとすれば、他方が増えることが知られている。そのためこのような問題に対しては、図９に示したような２つの尺度による比較を行なう。ここにおいて未知発話再現率が高いということは検出漏れエラーが少ないことを意味し、未知発話適合率が高いということは湧き出しエラーが少ないことを意味する。この両者は共に高いことが望ましい。
【００４８】
以下に示す実験では、100語の未知人名がある場合において判定閾値を増減させた場合に、各手法による未知発話再現率と未知発話適合率の変化を調べる。比較する手法は次の3つである。
【００４９】
(1) 認識ゆう度と参照ゆう度の差のみで未知発話判定する
(音声片の連接について制約なし)
(2) 認識ゆう度と参照ゆう度の差のみで未知発話判定する
(音声片の連接について遷移確率を導入する)
(3) 上記実施の形態に述べた 4つの尺度を併用して未知発話判定する
図１０にこの結果を示す。図１０では、横軸に未知発話適合率を、縦軸に未知発話再現率を取っている。この両者は高いほど良いので、図中の曲線は右上に行くほど良い検出性能であると言うことができる。この結果から、従来技術のように認識ゆう度と参照ゆう度のみを用いて未知発話の判定を行なうより、4つの尺度を併用して未知発話の判定を行なう法が高い検出性能となることが示される。
【００５０】
（実施の形態２）
本実施の形態は、上記第1の実施の形態における未知発話検出部を塔載した音声認識装置に関するものである。本実施の形態では、従来の音声認識結果と共に未知発話の検出結果を同時に用いることで最適な応答結果を返しことにより、利用者にとってより使い易い音声認識インタフェース機能を提供する機能を有するものである。
【００５１】
図１１は、上記図１に示す未知発話検出装置を塔載した音声認識装置のブロック図である。未知発話検出装置を構成する、音声分析部２０、音声片パタン格納部２１、単語辞書格納部２２、単語マッチング部２３、遷移確率格納部２４、音声系列マッチング部２５、候補スコア差計算部２６、候補・音素系列間類似度計算部２７、候補・音声系列スコア差計算部２８、候補・音声系列・音素系列類似度計算部２９、および未知発話判定部３０は、上記第１の実施の形態における音声分析部１、音声片パタン格納部２、単語辞書格納部３、単語マッチング部４、遷移確率格納部５、音声系列マッチング部６、候補スコア差計算部７、候補・音素系列間類似度計算部８、候補・音声系列スコア差計算部９、候補・音声系列・音素系列類似度計算部１０、および未知発話判定部１１と同じ構成をしている。
【００５２】
ただし未知発話検出部３０は、単に未知発話判定結果を正否で出力するのではなく、未知発話らしさを示す連続値を出力する。さらに本実施の形態では、上記単語マッチング部２３と上記未知発話検出部３０の双方の結果を勘案して認識結果を出力する認識結果出力部３１が含まれる。なお、音声分析部２０、音声片パタン格納部２１、単語辞書格納部２２、単語マッチング部２３による認識結果出力部３１による構成は、通常の音声認識装置と同様の構成をなす。
【００５３】
本実施の形態においては、入力音声は上記第1の実施の形態と同様のステップによって、未知発話検出部３０から、発話内容についての既知語らしさあるいは未知発話らしさに関する結果を得る。これと同時に、単語マッチング部２３から得られる単語ごとにゆう度が付与され、さらにゆう度の大きさでソートされた結果から、認識結果候補が得られる。
【００５４】
認識結果出力部３１では、上記認識結果候補と未知発話検出部で得られた未知発話らしさに関する結果とを勘案して、最適な応答を出力する。
【００５５】
すなわち認識結果出力部では、未知発話らしさが高い場合には、認識結果候補として得られた結果を全て棄却し、棄却されたことを意味する結果を出力する。また未知発話らしさが中程度に高い場合には、認識結果候補のうち上位から1つ以上の候補を出力するとともに、その結果が充分信頼できないものであることを意味する信号も付与する。さらに未知発話らしさが充分低い場合には、認識結果候補のうちから上位1個以上の候補を出力する。
【００５６】
以上のような構成により、例えばテレビ受像機において本実施の形態の音声認識装置を塔載して、番組選択を音声入力インタフェースによって行なうようにした場合次のような効果が得られる。すなわち従来であれば、放映されていない番組名や受信不可能な放送局名など、音声認識のための認識辞書に登録されていない単語を利用者が発声した場合、従来であれば単に認識誤りを起こし、利用者に何と発声すればよいか判らないといった不信感を与えていた。
【００５７】
しかし、本実施の形態の音声認識装置により、このような未知の単語を利用者が発声した場合には、そのような番組名あるいは放送局名が存在しないことを利用者に知らせることが可能となる。また、認識結果が曖昧である場合も、従来であれば曖昧なまま処理を続行し、利用者の望まない番組に映像を切り替えるといったことが起こり得たが、本実施の形態により認識結果が曖昧である旨利用者に通達し、確認手段を提示してから、番組を切り替えるといった処理が可能となり、音声認識に起こりがちな認識誤りによる問題を効率的に回避することが可能となる。
【００５８】
同様の効果は、テレビ受像機における音声認識装置のみならず、例えばカーナビゲーションシステムにおける目的地検索機能や、音声による自動電話番号案内システムなどでの応用が可能である。
【００５９】
【発明の効果】
以上のように本発明の第1の発明は、音声認識装置における未知発話の検出手法として、
単一の判定尺度のみではなく、複数の判定尺度を併用することにより、高い確度で未知発話を検出するという効果を有する。
【００６０】
また上記第2の発明は、認識結果の出力を、上記第１の発明による未知発話検出装置による結果を勘案して出力することにより、より利用者に使いよい音声認識インタフェースを提供するという効果を有する。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態における未知発話検出装置のブロック図
【図２】同実施形態における、音声片から単語パタンを構築する例を示す図
【図３】同実施形態における、単語マッチング部の出力する単語とゆう度のリストの出力例を示す図
【図４】同実施形態における、音素２gram確率の例を示す図
【図５】同実施形態における、音素２gram確率を元に計算される単語内の音素遷移ゆう度の例を示す図
【図６】同実施形態における、参照ゆう度の計算において遷移ゆう度を導入する効果を示す図
【図７】同実施形態における、系列間の編集距離を求める方法を示す図
【図８】同実施形態における、４つの未知発話に関する尺度から未知発話の判定を行うルールを示す式の図
【図９】同実施形態における、従来手法と比較した効果を示すための評価尺度を示す式の図
【図１０】同実施形態における、従来手法と比較した効果を示す実験結果を示す図
【図１１】本発明の第２の実施の形態における音声認識装置のブロック図
【符号の説明】
1,20 音声分析部
2,21 音声片パタン格納部
3,22 単語辞書格納部
4,23 単語マッチング部
5,24 遷移確率格納部
6,25 音声系列マッチング部
7,26 候補スコア差計算部
8,27 候補・音素系列類似度計算部
9,28 候補・音声系列スコア差計算部
10,29 候補・音声系列・音素系列類似度計算部
11,30 未知発話判定部
31 認識結果出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method in a speech recognition apparatus.
[0002]
[Prior art]
Conventionally, in a speech recognition device, an acceptable speech recognition vocabulary is defined, and a recognition vocabulary most similar to the input speech is searched for and output as a recognition result. Therefore, even if the user utters outside the speech recognition vocabulary, the most similar vocabulary is selected from the speech recognition vocabulary, so the recognition result is incorrect. For this reason, it is necessary to determine whether a user's utterance is a word included in the speech recognition vocabulary, a word other than that, or an utterance, and a function of rejecting these unknown utterances.
[0003]
The method for rejecting such unknown utterances is to connect the HMMs (one of the models expressing speech patterns) of units shorter than words called subwords to construct a model for each word in the recognition vocabulary, and to input speech A word that gives the maximum likelihood is searched for, and this likelihood is set as the recognition likelihood (see Non-Patent Document 1, for example). Further, the maximum likelihood for the input speech is obtained from the model based on the concatenation of arbitrary subword HMMs corresponding to the arbitrary kana sequence, and this is used as the reference likelihood. The unknown utterance is detected and rejected by comparing the recognition likelihood and the reference likelihood obtained in this way.
[0004]
However, in such a method, there is no restriction on concatenation of arbitrary subword HMMs in the calculation of the reference likelihood, and the likelihood for a non-Japanese sequence is often selected as the maximum likelihood, and as a result, In comparison between the reference likelihood and the recognition likelihood, a sufficient unknown speech rejection effect could not be obtained. Moreover, in order to compare the concatenation of all subword HMMs, a large amount of resources was required in terms of processing complexity. For such problems, for example, in speech recognition methods, the accuracy of rejection of unknown utterances and the amount of processing are improved by introducing the affinity of connection between subword HMMs as a transition probability (for example, (See Patent Document 1).
[0005]
[Patent Document 1]
Japanese Patent Laid-Open No. 10-171489 [Non-Patent Document 1]
Watanabe et al., "Rejection of unknown utterances by likelihood correction using syllable recognition", IEICE Transactions, Vol. J75-D-II, No.12 (1992)
[0006]
[Problems to be solved by the invention]
However, the conventional method as described above has the following problems.
[0007]
In other words, a model based on concatenation of subword HMMs can be regarded as a model that recognizes input speech as a kana sequence, but even if a transition probability is introduced as a constraining constraint between subword HMMs, this model may be generated. It is still difficult to say that the sequence is sufficiently consistent with that of the input speech. That is, the reference likelihood obtained by such a model is not sufficiently accurate, and the effect of rejecting unknown utterances is not sufficient.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, the unknown speech detection apparatus for speech recognition according to the first aspect of the present invention includes a speech analysis unit that analyzes input speech and converts it into a series of feature parameters, and a recognition dictionary that defines a recognition target vocabulary. Storage means;
A speech model storage means that models a standard pattern of speech and a vocabulary model defined in the recognition dictionary using the model stored by the speech model storage means, and a word that is collated with the input speech Level matching means, subword transition probability storage means for defining transition probabilities between subwords, and speech models stored by the speech model storage means, taking into account the subword transition probabilities stored by the subword transition probability storage means Sub-word level matching means for connecting and collating with input speech;
Based on the unknown utterance scale calculation means for calculating a plurality of unknown utterance scales from the word level matching section and the subword level matching section, and based on the plurality of scales calculated by the unknown utterance scale calculation section, And an unknown utterance judging means for judging.
[0009]
According to the above configuration, in the determination of an unknown utterance, it is possible to determine the possibility that the input speech is an unknown utterance from a plurality of viewpoints, and it is possible to show high unknown utterance detection performance.
[0010]
Further, the unknown utterance detection device according to the first invention is characterized in that, in the unknown utterance scale calculation means, the word likelihood obtained by the word level matching means and the subword chain likelihood obtained by the subword level matching means. It is desirable to include a value calculated based on the difference between the two.
[0011]
According to the above configuration, the word likelihood correction effect by the subword chain likelihood is obtained, and high unknown utterance detection performance is obtained.
[0012]
Further, the unknown utterance detection device according to the first aspect of the invention is characterized in that the unknown utterance scale calculation means obtains the acoustic features of the first candidate word model obtained by the word level matching means and the subword level matching means. It is desirable to include a value calculated based on the similarity between the acoustic characteristics of the generated subword chain model.
[0013]
According to the above configuration, it is possible to determine an unknown utterance focusing on the similarity between the acoustic features of the two models, and high unknown utterance detection performance can be obtained.
[0014]
The unknown utterance detection device according to the first aspect of the present invention is the unknown utterance scale calculation unit, wherein the difference between the likelihood of the first candidate word obtained by the word level matching unit and the likelihood of the lower candidate word is obtained. It is desirable to include a value calculated based on
[0015]
According to the above configuration, at the time of recognition of an unknown utterance, the word level matching unit can model a feature that wrong candidates are similar and can be obtained with a likelihood, and high unknown utterance detection performance can be obtained.
[0016]
Further, the unknown utterance detection device according to the first aspect of the invention is characterized in that, in the unknown utterance scale calculation means, the acoustic features of the first candidate word and the acoustic features of the lower candidate words obtained by the word level matching means. It is desirable to include a value calculated based on the similarity between the two.
[0017]
According to the above configuration, it is possible to determine unknown utterances focusing on acoustic similarity between models of candidate words, and high unknown utterance detection performance can be obtained.
[0018]
The speech recognition device according to the second invention is a speech recognition device that recognizes an input speech by collating with a model corresponding to a vocabulary registered in a recognition dictionary, wherein the unknown utterance detection device The recognition result is output in consideration of the output result of the unknown utterance detection device.
[0019]
According to the above configuration, the speech recognition device does not always output any one of the vocabularies in the recognition dictionary for any input speech, and the speech content is not included in the recognition dictionary. Thus, it is possible to convey this to the user, and it is possible to provide a user-friendly interface in various voice recognition interfaces on which the voice recognition device is mounted.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0021]
(Embodiment 1)
FIG. 1 shows a block diagram of an unknown utterance detection apparatus according to the present embodiment. In FIG. 1, reference numeral 1 denotes an acoustic analysis unit that performs A / D conversion on input speech and converts it into a time series of feature parameters. Reference numeral 2 denotes a speech segment pattern storage unit that stores speech segments of standard speech used for matching with feature parameters of input speech.
[0022]
Here, the speech segment is a VC pattern in which the latter half of the vowel section of speech is connected to the first half of the consonant section that follows it, and the CV in which the latter half of the consonant section is connected to the first half of the vowel section that follows it. It means a set of patterns. However, the speech piece is a set of phonemes that are roughly equivalent to one letter of the alphabet when Japanese is written in Roman letters, and a mora that is almost equivalent to one letter of one letter when Japanese is clearly marked. It may be a set, a set of subwords meaning a chain of a plurality of mora, or a mixed set of these sets.
[0023]
Reference numeral 3 in FIG. 1 denotes a word dictionary storage unit in which rules for combining the speech pieces to synthesize word patterns of the speech recognition vocabulary are stored. Reference numeral 4 denotes a word matching unit that compares an input speech expressed in a time series of feature parameters with the synthesized word pattern and obtains a likelihood corresponding to the similarity for each word.
[0024]
Reference numeral 5 denotes a transition probability storage unit that stores transition probabilities that express the naturalness of connection as continuous values when speech pieces are arbitrarily combined. In this embodiment, a 2-gram probability of phonemes is used as the transition probability. The 2-gram probability of a phoneme means the probability P (y | x) that a phoneme y connects after the preceding phoneme x, and is obtained in advance using a large number of Japanese text data. However, the transition probability may be a 2-gram probability of mora, a 2-gram probability of subwords, or a 2-gram probability of a mixture of these, and may be a 3-gram probability in addition to the 2-gram probability.
[0025]
6 in FIG. 1 is obtained by calculating the likelihood of the input speech expressed as a time series of feature parameters in consideration of the transition probability, and a pattern formed by arbitrarily combining the speech segment patterns. It is a speech sequence matching unit that uses the maximum likelihood as a reference likelihood.
[0026]
7 is the difference between the likelihoods of the words with the highest value (first candidate) and the words with the second highest value (second candidate) among the likelihoods of each word calculated by the word matching unit. Is a score difference calculation unit between candidates that is calculated by normalizing with a word length.
[0027]
8 is a similarity calculation unit between candidate phoneme sequences that calculates the distance between the phoneme sequence of the first candidate and the phoneme sequence of the second candidate in order to obtain the acoustic similarity between the first candidate and the second candidate. is there.
[0028]
Reference numeral 9 denotes a candidate / speech sequence score difference calculation unit which normalizes and calculates the difference between the likelihood of the first candidate and the reference likelihood calculated by the speech sequence matching unit.
[0029]
10 is a candidate / speech sequence / phoneme sequence similarity calculation unit that calculates the acoustic similarity between the first candidate and the sequence determined as the optimum sequence by the speech sequence matching unit as a distance between the phoneme sequences. is there.
[0030]
11 shows the values obtained by the inter-candidate score difference calculation unit, the candidate / phoneme sequence similarity calculation unit, the candidate / speech sequence score difference calculation unit, and the candidate / speech sequence / phoneme sequence similarity calculation unit. Overall, this is an unknown utterance determination unit that determines whether or not the input speech is an unknown utterance.
[0031]
In the present embodiment, the above four scales are listed as scales used in the unknown utterance determination unit, but in addition to this, the likelihood of each word candidate itself and its distribution, and the locality within the word section It is also possible to use scales such as the amount of change in score and the duration information of phonemes constituting words. As a method for determining an unknown utterance based on a plurality of scales, a linear discriminant obtained in advance using a large number of recognition result cases is used in this embodiment. However, the use of so-called learning machines such as neural networks, decision trees, and SVM (support vector machines) is also effective.
[0032]
Next, an unknown utterance detection processing operation in the present embodiment will be described. The input speech is first analyzed by the speech analysis unit after A / D conversion, and converted into LPC vectors every 10 milliseconds. The LPC vector is a parameter that means a spectral envelope of a short-time spectrum of speech, and is used as a parameter that well represents the phonological characteristics of speech. In normal speech recognition methods, the time series of LPC vectors obtained from input speech at regular intervals is used as the feature vector of input speech and is matched with a word model obtained in advance, and is called the likelihood for each word. Find the score.
[0033]
In the present embodiment, a word model is created using a voice fragment pattern and a word dictionary. That is, based on the speech segment concatenation rule for synthesizing the word patterns stored in the word dictionary storage unit, the speech pattern stored in the speech segment pattern storage unit is connected to construct a word pattern. FIG. 2 shows an image of synthesizing the word pattern “Hachinohe” by concatenating speech segment patterns called CV / VC patterns used in the present embodiment.
[0034]
In the speech segment pattern, parameters indicating the standard distribution (assuming normal distribution) of the LPC vectors of each speech segment are stored in time series. In recent years, a transition network called HMM (Hidden Markov Model) is often used as a model for speech recognition. Even when the HMM model is used, the HMM model expressing the speech segment pattern is stored in the speech segment pattern storage unit 2, and the rules for the transition between the HMM models are defined in the word dictionary storage unit 3, thereby It is possible to build a model.
[0035]
The feature parameter time series of the input speech is compared with the word pattern in the word matching unit 4, and the likelihood for all the words defined in the word dictionary storage unit 3 or the upper candidate words remaining in the beam with a certain likelihood. Are calculated and sorted in descending order of likelihood. FIG. 3 shows an output example of words sorted in order of likelihood.
[0036]
In parallel with this, the speech sequence matching unit 6 also matches an arbitrary sequence of speech segments. In this method, speech segments are freely concatenated under certain restrictions, and a speech segment sequence closest to the input speech and its likelihood are calculated. At this time, if no restrictions are applied to the connection of the speech pieces, the calculation result is a non-Japanese sequence, and the likelihood is not sufficiently meaningful. Therefore, in the search process of the optimum speech segment sequence, the phoneme 2gram probability stored in the transition probability storage unit 3 is used as the speech segment selection and connection cost. For phoneme 2gram probabilities, the recognition task and a large amount of Japanese text in the same task are converted into phoneme sequences and used based on this.
[0037]
In FIG. 4, as an example of the phoneme 2gram probability, the probability that five vowels / a /, / i /, / u /, / e /, / o / respectively follow the preceding phoneme / k / is illustrated. In this example, it is shown that the vowel that is likely to follow after the consonant / k / is / a / and then / i /. The speech sequence matching unit 6 obtains a value obtained by adding the likelihood by the concatenated speech fragment pattern and the transition likelihood obtained by the logarithmic sum of the phoneme 2gram probabilities by weighting, and adopts the sequence having the highest value.
[0038]
Figure 5 shows an example of finding the transition likelihood for the sequence / kobajasi / from the phoneme 2gram probability. In addition, in FIG. 6, as an example showing the effect of introducing the transition likelihood, the speech sequence output by the speech sequence matching unit and the likelihood for the input speech “KOBAYASHI” are shown. As shown in this figure, when the transition likelihood is not used, the series that is far away from the `` Kobayashi '' called / pobaeasii / has a higher pattern likelihood than the similar / obajasi / By using the total likelihood considering the likelihood, / obajasi / closer to “Kobayashi” is selected.
[0039]
As described above, the recognition likelihood for each word defined in the word dictionary storage unit, the reference likelihood and the speech sequence at that time are obtained. Based on this, calculation of various measures for determining unknown utterances is performed. Do.
[0040]
First, the inter-candidate score difference calculation unit 7 has the highest likelihood (the likelihood of the first candidate) and the second highest likelihood (the likelihood of the second candidate) among the likelihoods obtained by the word matching unit. Calculate the likelihood difference divided by the word length. For example, in the case of the result shown in FIG. 3, the difference in likelihood between “Kobayashi” and “Hayashi” is normalized by the word length to obtain 6.9.
[0041]
The candidate phoneme sequence similarity calculation unit 8 calculates the similarity of the first candidate phoneme sequence and the second candidate phoneme sequence. Here, the similarity between sequences uses a value obtained by normalizing the edit distance by the sum of the sequence lengths of the two sequences. The edit distance is the minimum when one element is edited and converted to the other, with the cost required for one element replacement (replacement), one element deletion (dropout), and one element addition (insertion) set to 1, respectively. This means the total cost when editing at the cost of. FIG. 7 exemplifies how to obtain the edit distance for two phoneme sequences / uenoeki / and / jenokii /. In accordance with such a method, the similarity calculation unit between candidate phoneme sequences, for example, in the case of the result as shown in FIG. 3, sets the edit distance 3 of “kobayashi (/ kobajasi /)” and “hayashi (/ hajasi /)” to each phoneme sequence. The value 0.21 divided by the sum of lengths 14 is output. In this embodiment, the value based on the edit distance is used as the similarity between sequences. However, it is also effective to use a distance between sequences in consideration of acoustic similarity between phonemes. is there.
[0042]
The candidate / speech sequence score difference calculation unit calculates a value obtained by normalizing the difference between the likelihood of the first candidate obtained by the word matching unit and the reference likelihood obtained by the speech sequence matching unit by the time length of the word. . For example, in the example shown in FIGS. 3 and 6, 0.87 is obtained by normalizing the difference between the recognition likelihood 2055 of the first candidate and the reference likelihood 2014 by the word time length.
[0043]
In addition, the candidate / speech sequence / phoneme sequence similarity calculation unit normalizes editing between sequences of the first candidate phoneme sequence obtained by the word matching unit and the optimal phoneme sequence obtained by the speech sequence matching unit. Calculate the distance. For example, in the example shown in FIGS. 3 and 6, the editing distance of the phoneme sequences / kobajasi / and / obajasi / is normalized by the sum of the sequence lengths to obtain 0.07.
[0044]
For the four scales obtained as described above, the unregistered word determination unit obtains a sum that appropriately weights these scales, and determines whether it is an unregistered word utterance by determining the magnitude with a threshold value. To do. That is, the determination is performed according to the equation shown in FIG. In FIG. 8, CM1 to CM4 mean the score difference between candidates, the similarity between candidate phoneme sequences, the difference between candidates / speech sequence scores, the similarity between candidates / speech sequences / phoneme sequences, and w1 to w2 Weighting for each scale, θ means a threshold value.
[0045]
Also, the weighting for each scale used here is obtained in advance by a statistical method. In other words, for the many cases of registered word utterances and unregistered word utterances, the above four scales are obtained, and the relationship between the four scales and the registered word utterances and unregistered word utterances is analyzed by a linear discrimination method. Find the weight for.
[0046]
(effect)
Next, the effect of the unknown utterance detection method based on this embodiment will be experimentally shown in comparison with the conventional method.
[0047]
In general, there are two types of errors in such detection problems. That is, a detection omission error and a well-being error in which something that should not be detected is detected. These two errors are in a trade-off relationship, and it is known that if one error is reduced, the other increases. Therefore, for such a problem, a comparison based on two scales as shown in FIG. 9 is performed. Here, a high unknown utterance recall rate means fewer detection errors, and a high unknown utterance match rate means fewer errors. Both of these are desirably high.
[0048]
In the experiment shown below, when there is an unknown person name of 100 words and the determination threshold value is increased or decreased, changes in the unknown utterance recall rate and the unknown utterance adaptation rate by each method are examined. The following three methods are compared.
[0049]
(1) Judgment of unknown utterance based only on difference between recognition likelihood and reference likelihood
(There are no restrictions on the connection of audio pieces)
(2) Judgment of unknown utterance based only on difference between recognition likelihood and reference likelihood
(Introducing transition probabilities for concatenation of audio fragments)
(3) FIG. 10 shows the result of the unknown utterance determination using the four scales described in the above embodiment together. In FIG. 10, the unknown utterance adaptation rate is taken on the horizontal axis, and the unknown utterance recall rate is taken on the vertical axis. The higher the both, the better. Therefore, it can be said that the curve in the figure shows better detection performance as it goes to the upper right. From this result, it can be seen that the method of determining unknown utterances using four measures together has higher detection performance than determining unknown utterances using only the recognition likelihood and the reference likelihood as in the prior art. Indicated.
[0050]
(Embodiment 2)
The present embodiment relates to a speech recognition apparatus in which the unknown utterance detection unit in the first embodiment is mounted. The present embodiment has a function of providing a voice recognition interface function that is easier for the user to use by returning an optimal response result by simultaneously using a detection result of an unknown utterance together with a conventional voice recognition result. .
[0051]
FIG. 11 is a block diagram of a speech recognition apparatus on which the unknown utterance detection apparatus shown in FIG. 1 is mounted. A speech analysis unit 20, a speech segment pattern storage unit 21, a word dictionary storage unit 22, a word matching unit 23, a transition probability storage unit 24, a speech sequence matching unit 25, a candidate score difference calculation unit 26, which constitute an unknown utterance detection device, The candidate / phoneme sequence similarity calculation unit 27, the candidate / speech sequence score difference calculation unit 28, the candidate / speech sequence / phoneme sequence similarity calculation unit 29, and the unknown utterance determination unit 30 are the same as those in the first embodiment. Speech analysis unit 1, speech segment pattern storage unit 2, word dictionary storage unit 3, word matching unit 4, transition probability storage unit 5, speech sequence matching unit 6, candidate score difference calculation unit 7, candidate / phoneme sequence similarity calculation The configuration is the same as that of the unit 8, the candidate / speech sequence score difference calculation unit 9, the candidate / speech sequence / phoneme sequence similarity calculation unit 10, and the unknown utterance determination unit 11.
[0052]
However, the unknown utterance detection unit 30 does not simply output the unknown utterance determination result as correct or incorrect, but outputs a continuous value indicating the likelihood of unknown utterance. Further, the present embodiment includes a recognition result output unit 31 that outputs a recognition result in consideration of the results of both the word matching unit 23 and the unknown utterance detection unit 30. In addition, the structure by the recognition result output part 31 by the audio | voice analysis part 20, the audio | voice piece pattern storage part 21, the word dictionary storage part 22, and the word matching part 23 makes the structure similar to a normal voice recognition apparatus.
[0053]
In the present embodiment, the input speech obtains a result related to the known utterance or the unknown utterance regarding the utterance content from the unknown utterance detection unit 30 by the same steps as in the first embodiment. At the same time, a likelihood is given to each word obtained from the word matching unit 23, and a recognition result candidate is obtained from the result sorted according to the likelihood.
[0054]
The recognition result output unit 31 outputs an optimum response in consideration of the recognition result candidate and the result regarding the likelihood of unknown utterance obtained by the unknown utterance detection unit.
[0055]
That is, the recognition result output unit rejects all the results obtained as recognition result candidates when the probability of unknown utterance is high, and outputs a result indicating that the recognition is rejected. When the probability of unknown utterance is moderately high, one or more candidates are output from the top among the recognition result candidates, and a signal indicating that the result is not sufficiently reliable is also given. Further, when the likelihood of unknown utterance is sufficiently low, one or more top candidates are output from the recognition result candidates.
[0056]
With the configuration as described above, for example, when the speech recognition apparatus according to the present embodiment is mounted on a television receiver and program selection is performed using the speech input interface, the following effects can be obtained. In other words, if a user utters a word that is not registered in the recognition dictionary for speech recognition, such as the name of a program that has not been broadcast or the name of a broadcasting station that cannot be received in the past, it would simply be a recognition error in the past. And gave users distrust that they didn't know what to say.
[0057]
However, when the user utters such an unknown word, the speech recognition apparatus according to the present embodiment can notify the user that such a program name or broadcast station name does not exist. Become. Also, even if the recognition result is ambiguous, in the past, it may have been possible to continue the process while being ambiguous and switch the video to a program that the user does not want, but the recognition result is ambiguous according to this embodiment. It is possible to perform processing such as switching the program after notifying the user and presenting the confirmation means, and it is possible to efficiently avoid problems due to recognition errors that tend to occur in voice recognition.
[0058]
The same effect can be applied not only to a voice recognition device in a television receiver but also to a destination search function in a car navigation system, an automatic telephone number guidance system by voice, and the like.
[0059]
【The invention's effect】
As described above, the first invention of the present invention is an unknown utterance detection method in a speech recognition apparatus,
By using not only a single determination scale but also a plurality of determination scales, there is an effect of detecting an unknown utterance with high accuracy.
[0060]
In addition, the second invention has the effect of providing a speech recognition interface that is more usable for the user by outputting the output of the recognition result in consideration of the result of the unknown utterance detection device according to the first invention. Have.
[Brief description of the drawings]
FIG. 1 is a block diagram of an unknown utterance detection apparatus according to a first embodiment of the present invention. FIG. 2 is a diagram showing an example of constructing a word pattern from speech pieces in the embodiment. FIG. 4 is a diagram showing an example of output of a list of words and likelihoods output by a word matching unit. FIG. 4 is a diagram showing an example of phoneme 2gram probabilities in the embodiment. FIG. 6 is a diagram showing an example of a phoneme transition likelihood in a word calculated in FIG. 6. FIG. 6 is a diagram showing an effect of introducing a transition likelihood in calculation of a reference likelihood in the embodiment. FIG. FIG. 8 is a diagram showing a method for obtaining an edit distance between sequences. FIG. 8 is an expression diagram showing rules for determining unknown utterances from a scale related to four unknown utterances in the embodiment. FIG. 9 is a conventional method in the embodiment. The effect compared with FIG. 10 is a diagram showing an experimental result showing an effect compared with the conventional method in the embodiment. FIG. 11 is a diagram showing an evaluation scale for evaluating the speech recognition apparatus in the second embodiment of the present invention. Block diagram [Explanation of symbols]
1,20 Speech analysis unit
2,21 Voice segment pattern storage
3,22 Word dictionary storage
4,23 Word matching part
5,24 Transition probability storage
6,25 Speech sequence matching section
7,26 Candidate score difference calculator
8,27 Candidate / phoneme sequence similarity calculator
9,28 Candidate / voice sequence score difference calculator
10,29 Candidate / voice sequence / phoneme sequence similarity calculator
11,30 Unknown utterance determination unit
31 Recognition result output section

Claims

Speech analysis means for analyzing the input speech and converting it into a series of feature parameters;
A recognition dictionary storage means for defining a recognition target vocabulary;
A voice model storage means that models a standard pattern of voice;
Vocabulary model defined in the recognition dictionary, constructed using the model stored by the speech model storage section, and word-level matching means for matching the input speech,
Subword transition probability storage means for defining transition probabilities between subwords;
The speech model stored by the speech model storage section, the linked consideration the stored sub-word transition probabilities by sub-word transition probabilities storing means, and a sub-word level matching means for matching the input speech,
(1) and likelihood of words obtained by the word-level matching means, first and unknown speech measure is a calculated value based on the difference between the time the Yu subword chain obtained by the sub-word level matching means,
(2) Calculated based on the similarity between the acoustic characteristics of the first candidate word obtained by the word level matching means and the acoustic characteristics of the subword chain obtained by the subword level matching means. A second unknown utterance scale that is a value;
(3) a third unknown utterance scale that is a value calculated based on the likelihood of the first candidate word obtained by the word level matching means and the likelihood of the lower candidate word;
(4) A fourth unknown value which is a value calculated based on the similarity between the acoustic feature of the first candidate word obtained by the word level matching means and the acoustic feature of the lower candidate word Utterance scale,
Means for calculating an unknown utterance scale for calculating
To the unknown utterance measure calculating section respectively calculated the four unknown utterance measure, the value obtained by adding the multiplied by the weights determined by the statistical approach, based on whether it meets a predetermined threshold value, the unknown speech An unknown utterance detection apparatus for speech recognition, comprising: an unknown utterance determination means for performing determination.

Speech analysis means for analyzing the input speech and converting it into a series of feature parameters;
A recognition dictionary storage means for defining a recognition target vocabulary;
A voice model storage means that models a standard pattern of voice;
A vocabulary model defined in the recognition dictionary is constructed using the model stored by the speech model storage means, and a word level matching means for matching with the input speech;
Subword transition probability storage means for defining transition probabilities between subwords;
Subword level matching means for connecting the speech model stored by the speech model storage means in consideration of the subword transition probability stored by the subword transition probability storage means, and performing collation with the input speech;
(1) and likelihood of words obtained by the word-level matching means, first and unknown speech measure is a calculated value based on the difference between the time the Yu subword chain obtained by the sub-word level matching means,
(2) Calculated based on the similarity between the acoustic characteristics of the first candidate word obtained by the word level matching means and the acoustic characteristics of the subword chain obtained by the subword level matching means. A second unknown utterance scale that is a value;
(3) a third unknown utterance scale that is a value calculated based on the likelihood of the first candidate word obtained by the word level matching means and the likelihood of the lower candidate word;
(4) A fourth unknown value which is a value calculated based on the similarity between the acoustic feature of the first candidate word obtained by the word level matching means and the acoustic feature of the lower candidate word Utterance scale,
Means for calculating an unknown utterance scale for calculating
Based on whether or not a value obtained by adding a weight obtained by a statistical method to each of the four unknown utterance scales calculated by the unknown utterance scale calculation unit satisfies a predetermined threshold, Unknown utterance judging means for outputting a continuous value indicating
Based on the output result of the unknown utterance determination means,
When the probability of unknown utterance is high, reject the result obtained as a recognition result candidate, and output a result indicating that it is rejected,
The unknown time speech likeliness is medium, the recognition result and outputs at least the upper one candidate among the candidates, and outputs a result, which means that the output of the at least upper one candidate is not sufficiently reliable,
An unknown utterance detection apparatus for speech recognition, comprising: a recognition result output unit that outputs at least one upper candidate among recognition result candidates when the likelihood of unknown utterance is low.

Speech analysis means for analyzing the input speech and converting it into a series of feature parameters;
  A recognition dictionary storage means for defining a recognition target vocabulary;
  A voice model storage means that models a standard pattern of voice;
  A vocabulary model defined in the recognition dictionary is constructed using the model stored by the speech model storage means, and a word level matching means for matching with the input speech;
  Subword transition probability storage means for defining transition probabilities between subwords;
  Subword level matching means for connecting the speech model stored by the speech model storage means in consideration of the subword transition probability stored by the subword transition probability storage means, and performing collation with the input speech;
(1) A first unknown utterance scale that is a value calculated based on a difference between a word likelihood obtained by the word level matching means and a subword chain likelihood obtained by the subword level matching means;
(2) Obtained by the word level matching means 1 A second unknown utterance scale that is a value calculated based on the similarity between the acoustic characteristics of the candidate words and the acoustic characteristics of the subword chain obtained by the subword level matching means;
(3) Obtained by the word level matching means 1 A third unknown utterance scale that is a value calculated based on the difference between the likelihood of the rank candidate word and the likelihood of the lower candidate word;
(Four) Obtained by the word level matching means 1 Of the fourth unknown utterance scale, which is a value calculated based on the similarity between the acoustic characteristics of the rank candidate words and the acoustic characteristics of the lower candidate words, the first unknown utterance An unknown utterance scale calculating means for calculating a scale and at least one unknown utterance scale from the second unknown utterance scale to the fourth unknown utterance scale;
  For each unknown utterance scale calculated by the unknown utterance scale calculation unit, a value obtained by adding a weight obtained by a statistical method satisfies a predetermined threshold value, and an unknown utterance is determined. An unknown utterance detection device for speech recognition, comprising: an unknown utterance determination means for performing speech recognition.