JP2004177551A

JP2004177551A - Unknown speech detecting device for voice recognition and voice recognition device

Info

Publication number: JP2004177551A
Application number: JP2002342011A
Authority: JP
Inventors: Sumiyuki Okimoto; 純幸沖本; Mitsuru Endo; 充遠藤; Hiroyasu Kuwano; 裕康桑野; Yumi Wakita; 由実脇田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-11-26
Filing date: 2002-11-26
Publication date: 2004-06-24
Anticipated expiration: 2022-11-26
Also published as: JP4259100B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an unknown speech detecting device for voice recognition that can detect an unknown speech with high accuracy by using not only a single decision scale, but also a plurality of decision scales in combination. <P>SOLUTION: In addition to collation using word patterns of a vocabulary registered in a recognition dictionary, the unknown speech detecting device for voice recognition performs collation, using subword connection models, in which chain probability of subwords is taken into consideration in the stage of voice recognition to find a plurality of unknown speech detection scales from collation results, and takes those results into consideration to decide an unknown speech. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識装置における音声認識方法に関するものである。
【０００２】
【従来の技術】
従来、音声認識装置においては、受理可能な音声認識語彙を規定して、入力音声と最も類似した認識語彙を探索することによって、これを認識結果として出力する。したがって、かりに利用者が音声認識語彙外の発話を行なった場合でも、音声認識語彙から最も類似した語彙を選択するため、認識結果は誤ったものとなる。このため利用者の発話が音声認識語彙に含まれる単語であるのか、それ以外の単語あるいは言い淀み等であるのかを判定し、これら未知発話を棄却する機能が必要となる。
【０００３】
このような未知発話を棄却する方法は、サブワードと呼ばれる単語より短かい単位のＨＭＭ（音声パタンを表現するモデルの１つ）を連結して、認識語彙の各単語のモデルを構成し、入力音声に対して最大のゆう度を与える単語の探索を行ない、このゆう度を認識ゆう度とする（例えば非特許文献１参照）。また、任意カナ系列に対応する任意のサブワードＨＭＭの連接によるモデルの中から、入力音声に対する最大のゆう度を求めて、これを参照ゆう度とする。このようにして得られた認識ゆう度と参照ゆう度の比較を行なうことで、未知発話を検出し棄却する。
【０００４】
しかしこのような方法においては、参照ゆう度の算出において、任意のサブワードＨＭＭの連接における制約がなく、非日本語的な系列に対するゆう度が最大ゆう度として選択される場合も多く、結果としてこのような参照ゆう度と認識ゆう度の比較では充分な未知発声の棄却効果が得られなかった。また、あらゆるサブワードＨＭＭの連接を比較するため、処理計算量の面でも大きなリソースを必要とした。このような問題に対して、たとえば、音声認識法では、サブワードＨＭＭ間の連接の親和性を遷移確率として導入することによって、未知発声の棄却精度と処理量の両面の向上を図っている（例えば特許文献１参照）。
【０００５】
【特許文献１】
特開平１０−１７１４８９号公報
【非特許文献１】
渡辺他， ”音節認識を用いたゆう度補正による未知発話のリジェクション”，電子情報通信学会論文誌，Ｖｏｌ．Ｊ７５−Ｄ−ＩＩ，Ｎｏ．１２（１９９２）
【０００６】
【発明が解決しようとする課題】
しかしながら、以上に述べたような従来法では、次に述べるような問題がある。
【０００７】
すなわち、サブワードＨＭＭの連接によるモデルは、入力音声をかな系列として認識するモデルと見なすことができるが、仮りにサブワードＨＭＭ間の連接の制約として遷移確率を導入したとしても、このモデルが生成するかな系列は、依然、入力音声のそれとは充分一致しているとは言い難い。すなわち、このようなモデルによって得られる参照ゆう度は充分な精度とは言えず、未知発話の棄却効果も充分ではない。
【０００８】
【課題を解決するための手段】
上記目的を達成するため、上記第１の発明の音声認識用未知発話検出装置は、入力された音声を分析して特徴パラメータの系列に変換する音声分析手段と、認識対象語彙を規定する認識辞書格納手段と、
音声の標準的パタンをモデル化した音声モデル格納手段と、認識辞書に規定された語彙のモデルを、上記音声モデル格納手段によって格納されたモデルを用いて構築し、入力音声との照合を行なう単語レベルマッチング手段と、サブワード間の遷移確率を規定するサブワード遷移確率格納手段と、上記音声モデル格納手段によって格納された音声モデルを、上記サブワード遷移確率格納手段によって格納されたサブワード遷移確率を勘案して連結し、入力音声との照合を行なうサブワードレベルマッチング手段と、
上記単語レベルマッチング部および上記サブワードレベルマッチング部から、複数個の未知発話尺度を計算する未知発話尺度計算部手段と、上記未知発話尺度計算部で計算された複数の尺度を元に、未知発話の判定を行なう未知発話判定手段とを備えたことを特徴とする。
【０００９】
上記構成によれば、未知発話の判定において、複数の観点から入力音声が未知発話である可能性を判断することが可能となり、高い未知発話の検出性能を示すことが可能となる。
【００１０】
また、上記第１の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた単語のゆう度と、上記サブワードレベルマッチング手段により得られたサブワード連鎖ゆう度の差に基づいて計算された値を含むことが望ましい。
【００１１】
上記構成によれば、上記サブワード連鎖ゆう度による上記単語ゆう度の補正効果が得られ、高い未知発話検出性能が得られる。
【００１２】
また、上記第１の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた１位候補の単語モデルの音響的特徴と、上記サブワードレベルマッチング手段により得られたサブワード連鎖モデルの音響的特徴の、両者の類似性に基づいて計算された値を含むことが望ましい。
【００１３】
上記構成によれば、２つのモデルの音響的特徴の類似性に着目した未知発話の判定が可能となり、高い未知発話検出性能が得られる。
【００１４】
また、上記第１の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた１位候補の単語のゆう度と、下位候補の単語のゆう度の差に基づいて計算された値を含むことが望ましい。
【００１５】
上記構成によれば、未知発話の認識時には単語レベルマッチング部では、誤った候補が類似した、ゆう度で得られるという特徴をモデル化することが可能となり、高い未知発話検出性能が得られる。
【００１６】
また、上記第１の発明の未知発話検出装置は、上記未知発話尺度計算手段において、上記単語レベルマッチング手段により得られた１位候補の単語の音響的特徴と、下位候補の単語の音響的特徴の、両者の類似性に基づいて計算された値を含むことが望ましい。
【００１７】
上記構成によれば、候補単語のモデル間の音響的類似性に着目した未知発話の判定が可能となり、高い未知発話検出性能が得られる。
【００１８】
また、上記第２の発明の音声認識装置は、入力された音声を、認識辞書に登録されている語彙に対応するモデルによって照合を行なって認識する音声認識装置であって、上記未知発話検出装置を塔載し、上記未知発話検出装置の出力結果を勘案して認識結果の出力を行なうことを特徴とする。
【００１９】
上記構成によれば、音声認識装置は、どのような入力音声に対しても常に認識辞書内の語彙のいずれか１つを出力するのではなく、発話内容が認識辞書に含まれないものであれば、これを利用者に伝えることが可能となり、音声認識装置を塔載した様々な音声認識インタフェースにおいて、利用者にとってより判り易いインタフェースを提供することを可能とする。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態について、図を参照して説明する。
【００２１】
（実施の形態１）
図１は、本実施の形態における未知発話検出装置のブロック図を示したものである。図１において、１は入力音声をＡ／Ｄ変換し特徴パラメータの時系列に変換する音響分析部である。２は入力音声の特徴パラメータとのマッチングに用いられる、標準的な音声の音声片を格納した音声片パタン格納部である。
【００２２】
ここで音声片とは、音声の母音区間の後半部分とこれに後続する子音区間の前半部分を連接したＶＣパタン、および子音区間の後半部分とこれに後続する母音区間の前半部分を連接したＣＶパタンの集合を意味している。ただし音声片は、この他に日本語をローマ字標記した場合のアルファベット１文字１文字にほぼ相当する音素の集合、日本語をひらかな標記した時のひらかな１文字１文字にほぼ相当するモーラの集合、複数のモーラの連鎖を意味するサブワードの集合、さらにこれらの集合の混合集合であってもよい。
【００２３】
図１における３は、上記音声片を連結して音声認識語彙の単語パタンを合成するための規則が格納された、単語辞書格納部である。４は特徴パラメータの時系列で表現された入力音声と、上記合成された単語パタンを比較し、その類似性に対応する、ゆう度を各単語ごとに求める単語マッチング部である。
【００２４】
５は音声片どうしを任意に結合する場合における、結合の自然さを連続値で表現する遷移確率が格納された遷移確率格納部である。本実施の形態では、遷移確率として音素の２ｇｒａｍ確率を用いる。音素の２ｇｒａｍ確率とは、先行する音素ｘの後に、音素ｙが接続する確率Ｐ（ｙ｜ｘ）を意味するもので、多数の日本語テキストデータなどを用いて事前に求めておく。ただし遷移確率は、これ以外にモーラの２ｇｒａｍ確率、サブワードの２ｇｒａｍ確率、あるいはこれらの混合の２ｇｒａｍ確率であってもよく、また２ｇｒａｍ確率以外にも、３ｇｒａｍ確率などであってもよい。
【００２５】
図１における６は、上記音声片パタンを任意に結合してできるパタンと、特徴パラメータの時系列として表現された入力音声とのゆう度を、上記遷移確率を考慮して計算し、得られた最大ゆう度を参照ゆう度とする音声系列タマッチング部である。
【００２６】
７は上記単語マッチング部で計算された各単語ごとのゆう度のうち、最も高い値を得た単語（１位候補）と次に高い値を得た単語（２位候補）のゆう度の差を単語の長さで正規化して計算する候補間スコア差計算部である。
【００２７】
８は１位候補と２位候補の音響的な類似性を求めるため、１位候補の音素系列と２位候補の音素系列の系列間の距離を計算する、候補音素系列間類似度計算部である。
【００２８】
９は１位候補のゆう度と、上記音声系列マッチング部で計算された参照尤度との差を単語の長さで正規化して計算する、候補・音声系列スコア差計算部である。
【００２９】
１０は、１位候補と、上記音声系列マッチング部によって最適系列とされた系列の音響的な類似性を、各音素系列間の距離として計算する候補・音声系列・音素系列間類似度計算部である。
【００３０】
１１は、上記、候補間スコア差計算部、候補・音素系列間類似度計算部、候補・音声系列スコア差計算部、候補・音声系列・音素系列間類似度計算部で求められた各値を総合して、入力音声が未知発話であるか否かを判定する未知発話判定部である。
【００３１】
なお、本実施の形態においては、未知発話判定部で用いる尺度として、上記４つの尺度を挙げたが、これ以外にも、各単語候補のゆう度そのものやその分布、また単語区間内での局所スコアの変動量、単語を構成する音素の持続時間情報などの尺度も併用することも可能である。また、複数の尺度を元に未知発話を判定する方法として、本実施の形態では事前に多数の認識結果の事例を用いて求めた線型判別式を利用する。しかしこれ以外にも、ニューラルネットワーク、決定木、ＳＶＭ（サポート・ベクトル・マシン）などいわゆる学習機械の利用も有効である。
【００３２】
次に、本実施の形態における未知発話検出の処理動作を説明する。入力された音声は、まず音声分析部において、Ａ／Ｄ変換された後に分析され、１０ｍ秒ごとにＬＰＣベクトルに変換される。ＬＰＣベクトルは、音声の短時間スペクトルのスペクトル包絡を意味するパラメータであり、音声の音韻的特徴をよく表わすパラメータとして利用されるものである。通常の音声認識法においては、入力音声から一定時間ごとに得られたＬＰＣベクトルの時系列を入力音声の特徴ベクトルとして、あらかじめ求めておいた単語モデルとマッチングさせて、単語ごとのゆう度と呼ばれるスコアを求める。
【００３３】
本実施の形態においては、単語モデルを音声片パタンと単語辞書を用いて作成する。すなわち、単語辞書格納部に格納された単語パタンを合成するための音声片の連接規則に基づいて、音声片パタン格納部に格納された音声片パタンを連接して単語パタンを構築する。図２には、本実施の形態で用いるＣＶ・ＶＣパタンと呼ばれる音声片パタンを連接して、単語パタン「はちのへ」を合成するイメージを図示する。
【００３４】
なお、音声片パタンには、各音声片のＬＰＣベクトルの標準的な分布（正規分布を仮定）を示すパラメータが時系列で格納されている。また、近年はＨＭＭ（隠れマルコフモデル）と呼ばれる遷移ネットワークが、音声認識のためのモデルとしてしばしば用いられている。ＨＭＭモデルを用いる場合においても、音声片パタン格納部２には音声片パタンを表現するＨＭＭモデルを格納し、単語辞書格納部３においてＨＭＭモデルどうしの遷移に関する規則を定義することによって、単語のＨＭＭモデルを構築することが可能である。
【００３５】
入力音声の特徴パラメータ時系列は、単語マッチング部４において単語パタンと比較され、単語辞書格納部３に定義された全単語、あるいは一定のゆう度のビームの中に残った上位候補単語に対するゆう度が計算され、ゆう度の高いものから順にソートされる。図３において、ゆう度順でソートされた単語の出力例を示す。
【００３６】
またこれと並行して、音声系列マッチング部６において、音声片の任意系列のマッチングも行なわれる。これは、音声片を一定の制約の下で自由に連接して、最も入力音声に近い音声片系列とそのゆう度を計算する。この時音声片どうしの連接において何らの制約も加えないと、計算結果はおよそ非日本語的な系列となり、そのゆう度も充分意味のある値とは言えなくなる。そこで最適音声片系列の探索過程において、音声片の選択と接続のコストとして遷移確率格納部３に格納された音素２ｇｒａｍ確率を用いる。音素２ｇｒａｍ確率については、認識タスクと同タスクの大量の日本語テキストを音素系列に変換し、これを元に計算しておいたものを用いる。
【００３７】
図４において、音素２ｇｒａｍ確率の一例として、先行音素／ｋ／の後に５つの母音／ａ／，／ｉ／，／ｕ／，／ｅ／，／ｏ／がそれぞれ後続する確率を例示する。この例の場合では、子音／ｋ／の後に後続しやすい母音は／ａ／、次に／ｉ／であることが示されている。音声系列マッチング部６では、連接された音声片パタンによるゆう度と、上記音素２ｇｒａｍ確率の対数和によって得られる遷移ゆう度を重み付けで加算した値を求め、これが最も高い値となる系列採用する。
【００３８】
図５に音素２ｇｒａｍ確率から系列／ｋｏｂａｊａｓｉ／に対する遷移ゆう度を求める例を示す。また図６において、遷移ゆう度を導入することによる効果を示す一例として、「コバヤシ」という入力音声に対する、音声系列マッチンング部の出力する音声系列とゆう度を示す。この図にあるように、遷移ゆう度を用いない場合は／ｐｏｂａｅａｓｉｉ／という「コバヤシ」とは大きくかけ離れた系列の方が、より類似する／ｏｂａｊａｓｉ／より高いパタンゆう度を得ているが、遷移ゆう度を考慮した合計ゆう度を用いることにより、「コバヤシ」により近い／ｏｂａｊａｓｉ／の方が選択される。
【００３９】
以上により、単語辞書格納部に定義された単語ごとの認識ゆう度と、参照ゆう度およびその時の音声系列が得られるが、次にこれを元に未知発話判定のための種々の尺度の計算を行なう。
【００４０】
まず候補間スコア差計算部７では、単語マッチング部で得られたゆう度のうち、最も高いゆう度（１位候補のゆう度）とその次に高いゆう度（２位候補のゆう度）のゆう度差を単語の時間長で割った値を計算する。例えば図３に示した結果の場合、「コバヤシ」と「ハヤシ」のゆう度の差を単語長で正規化して、６．９を得る。
【００４１】
候補音素系列間類似度計算部８では、１位候補の音素系列と２位候補の音素系列についてその類似度を計算する。ここで系列間の類似度は、編集距離を２つの系列の系列長の和で正規化した値を用いる。編集距離とは、一方の系列を編集して他方の系列に変換する際に、１要素置き換え（置換）、１要素削除（脱落）、１要素追加（挿入）に要するコストをそれぞれ１として、最小のコストで編集した場合のコストの総和を意味する。図７では、２つの音素系列／ｕｅｎｏｅｋｉ／と／ｊｅｎｏｋｉｉ／に対する編集距離の求め方を例示する。このような方法に従って候補音素系列間類似度計算部は、例えば図３のような結果の場合、「コバヤシ（／ｋｏｂａｊａｓｉ／）」と「ハヤシ（／ｈａｊａｓｉ／）」の編集距離３を各音素系列長の和１４で割った０．２１という値を出力する。なお、系列間類似度として本実施の形態では上記編集距離に基づく値を用いるものとしたが、これ以外にも音素間の音響的類似性を考慮した系列間距離などを利用することも有効である。
【００４２】
候補・音声系列スコア差計算部では、単語マッチング部で得た１位候補のゆう度と、音声系列マッチング部で得た参照ゆう度の差を、単語の時間長で正規化した値を計算する。例えば図３および図６に示した例では、１位候補の認識ゆう度２０５５と参照ゆう度２０１４の差を、単語時間長で正規化した０．８７を得る。
【００４３】
また候補・音声系列・音素系列類似度計算部では、単語マッチング部で得られた１位候補の音素系列と、音声系列マッチング部で得られた最適な音素系列の、系列間の正規化した編集距離を計算する。例えば、図３および図６に示した例では、音素系列／ｋｏｂａｊａｓｉ／と／ｏｂａｊａｓｉ／の編集距離を系列長の和で正規化して０．０７を得る。
【００４４】
以上のようにして得られた４つの尺度に対して、未登録語判定部ではこれらの尺度を適切に重み付けした和を求め、その大小を閾値で判定して未登録語発声か否かの判定を行なう。すなわち、図８に示した式に従って判定を行なう。図８においてＣＭ１〜ＣＭ４は、それぞれ候補間スコア差、候補音素系列間類似度、候補・音声系列スコア差、候補・音声系列・音素系列間類似度を意味しており、また、ｗ１〜ｗ２は各尺度に対する重み付け、θは閾値を意味している。
【００４５】
また、ここで用いる各尺度に対する重み付けは、統計的手法によって事前に求めておく。すなわち、登録語発声および未登録語発声の多数の事例に対して、上記４つの尺度をそれぞれ求め、４つの尺度と登録語発声、未登録語発声の関係を線型判別法によって分析し、各尺度に対する重みを求めている。
【００４６】
（効果）
次に、本実施の形態に基づく未知発話の検出法の効果を、従来手法と比較して実験的に示す。
【００４７】
一般に、このような検出問題には２種類のエラーが存在する。すなわち、検出漏れエラーと、検出されてはならないものが検出される湧き出しエラーである。この両者のエラーはトレードオフの関係にあり、一方のエラーを減らそうとすれば、他方が増えることが知られている。そのためこのような問題に対しては、図９に示したような２つの尺度による比較を行なう。ここにおいて未知発話再現率が高いということは検出漏れエラーが少ないことを意味し、未知発話適合率が高いということは湧き出しエラーが少ないことを意味する。この両者は共に高いことが望ましい。
【００４８】
以下に示す実験では、１００語の未知人名がある場合において判定閾値を増減させた場合に、各手法による未知発話再現率と未知発話適合率の変化を調べる。比較する手法は次の３つである。
【００４９】
（１）認識ゆう度と参照ゆう度の差のみで未知発話判定する
（音声片の連接について制約なし）
（２）認識ゆう度と参照ゆう度の差のみで未知発話判定する
（音声片の連接について遷移確率を導入する）
（３）上記実施の形態に述べた４つの尺度を併用して未知発話判定する
図１０にこの結果を示す。図１０では、横軸に未知発話適合率を、縦軸に未知発話再現率を取っている。この両者は高いほど良いので、図中の曲線は右上に行くほど良い検出性能であると言うことができる。この結果から、従来技術のように認識ゆう度と参照ゆう度のみを用いて未知発話の判定を行なうより、４つの尺度を併用して未知発話の判定を行なう法が高い検出性能となることが示される。
【００５０】
（実施の形態２）
本実施の形態は、上記第１の実施の形態における未知発話検出部を塔載した音声認識装置に関するものである。本実施の形態では、従来の音声認識結果と共に未知発話の検出結果を同時に用いることで最適な応答結果を返しことにより、利用者にとってより使い易い音声認識インタフェース機能を提供する機能を有するものである。
【００５１】
図１１は、上記図１に示す未知発話検出装置を塔載した音声認識装置のブロック図である。未知発話検出装置を構成する、音声分析部２０、音声片パタン格納部２１、単語辞書格納部２２、単語マッチング部２３、遷移確率格納部２４、音声系列マッチング部２５、候補スコア差計算部２６、候補・音素系列間類似度計算部２７、候補・音声系列スコア差計算部２８、候補・音声系列・音素系列類似度計算部２９、および未知発話判定部３０は、上記第１の実施の形態における音声分析部１、音声片パタン格納部２、単語辞書格納部３、単語マッチング部４、遷移確率格納部５、音声系列マッチング部６、候補スコア差計算部７、候補・音素系列間類似度計算部８、候補・音声系列スコア差計算部９、候補・音声系列・音素系列類似度計算部１０、および未知発話判定部１１と同じ構成をしている。
【００５２】
ただし未知発話検出部３０は、単に未知発話判定結果を正否で出力するのではなく、未知発話らしさを示す連続値を出力する。さらに本実施の形態では、上記単語マッチング部２３と上記未知発話検出部３０の双方の結果を勘案して認識結果を出力する認識結果出力部３１が含まれる。なお、音声分析部２０、音声片パタン格納部２１、単語辞書格納部２２、単語マッチング部２３による認識結果出力部３１による構成は、通常の音声認識装置と同様の構成をなす。
【００５３】
本実施の形態においては、入力音声は上記第１の実施の形態と同様のステップによって、未知発話検出部３０から、発話内容についての既知語らしさあるいは未知発話らしさに関する結果を得る。これと同時に、単語マッチング部２３から得られる単語ごとにゆう度が付与され、さらにゆう度の大きさでソートされた結果から、認識結果候補が得られる。
【００５４】
認識結果出力部３１では、上記認識結果候補と未知発話検出部で得られた未知発話らしさに関する結果とを勘案して、最適な応答を出力する。
【００５５】
すなわち認識結果出力部では、未知発話らしさが高い場合には、認識結果候補として得られた結果を全て棄却し、棄却されたことを意味する結果を出力する。また未知発話らしさが中程度に高い場合には、認識結果候補のうち上位から１つ以上の候補を出力するとともに、その結果が充分信頼できないものであることを意味する信号も付与する。さらに未知発話らしさが充分低い場合には、認識結果候補のうちから上位１個以上の候補を出力する。
【００５６】
以上のような構成により、例えばテレビ受像機において本実施の形態の音声認識装置を塔載して、番組選択を音声入力インタフェースによって行なうようにした場合次のような効果が得られる。すなわち従来であれば、放映されていない番組名や受信不可能な放送局名など、音声認識のための認識辞書に登録されていない単語を利用者が発声した場合、従来であれば単に認識誤りを起こし、利用者に何と発声すればよいか判らないといった不信感を与えていた。
【００５７】
しかし、本実施の形態の音声認識装置により、このような未知の単語を利用者が発声した場合には、そのような番組名あるいは放送局名が存在しないことを利用者に知らせることが可能となる。また、認識結果が曖昧である場合も、従来であれば曖昧なまま処理を続行し、利用者の望まない番組に映像を切り替えるといったことが起こり得たが、本実施の形態により認識結果が曖昧である旨利用者に通達し、確認手段を提示してから、番組を切り替えるといった処理が可能となり、音声認識に起こりがちな認識誤りによる問題を効率的に回避することが可能となる。
【００５８】
同様の効果は、テレビ受像機における音声認識装置のみならず、例えばカーナビゲーションシステムにおける目的地検索機能や、音声による自動電話番号案内システムなどでの応用が可能である。
【００５９】
【発明の効果】
以上のように本発明の第１の発明は、音声認識装置における未知発話の検出手法として、
単一の判定尺度のみではなく、複数の判定尺度を併用することにより、高い確度で未知発話を検出するという効果を有する。
【００６０】
また上記第２の発明は、認識結果の出力を、上記第１の発明による未知発話検出装置による結果を勘案して出力することにより、より利用者に使いよい音声認識インタフェースを提供するという効果を有する。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態における未知発話検出装置のブロック図
【図２】同実施形態における、音声片から単語パタンを構築する例を示す図
【図３】同実施形態における、単語マッチング部の出力する単語とゆう度のリストの出力例を示す図
【図４】同実施形態における、音素２ｇｒａｍ確率の例を示す図
【図５】同実施形態における、音素２ｇｒａｍ確率を元に計算される単語内の音素遷移ゆう度の例を示す図
【図６】同実施形態における、参照ゆう度の計算において遷移ゆう度を導入する効果を示す図
【図７】同実施形態における、系列間の編集距離を求める方法を示す図
【図８】同実施形態における、４つの未知発話に関する尺度から未知発話の判定を行うルールを示す式の図
【図９】同実施形態における、従来手法と比較した効果を示すための評価尺度を示す式の図
【図１０】同実施形態における、従来手法と比較した効果を示す実験結果を示す図
【図１１】本発明の第２の実施の形態における音声認識装置のブロック図
【符号の説明】
１，２０音声分析部
２，２１音声片パタン格納部
３，２２単語辞書格納部
４，２３単語マッチング部
５，２４遷移確率格納部
６，２５音声系列マッチング部
７，２６候補スコア差計算部
８，２７候補・音素系列類似度計算部
９，２８候補・音声系列スコア差計算部
１０，２９候補・音声系列・音素系列類似度計算部
１１，３０未知発話判定部
３１認識結果出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition method in a speech recognition device.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, a speech recognition apparatus defines an acceptable speech recognition vocabulary, searches for a recognition vocabulary most similar to an input speech, and outputs the same as a recognition result. Therefore, even if the user makes an utterance outside the speech recognition vocabulary, the most similar vocabulary is selected from the speech recognition vocabulary, and the recognition result is incorrect. Therefore, it is necessary to determine whether the utterance of the user is a word included in the speech recognition vocabulary, a word other than the utterance, or a stagnant word, and reject the unknown utterance.
[0003]
A method of rejecting such unknown utterances is to form a model of each word in the recognized vocabulary by connecting HMMs (one of models representing voice patterns) in units shorter than words called subwords. A word that gives the maximum likelihood is searched for, and this likelihood is set as the recognition likelihood (for example, see Non-Patent Document 1). Also, the maximum likelihood for the input speech is obtained from the model obtained by connecting the arbitrary subwords HMM corresponding to the arbitrary kana sequence, and this is set as the reference likelihood. By comparing the recognition likelihood obtained in this way with the reference likelihood, unknown utterances are detected and rejected.
[0004]
However, in such a method, in calculating the reference likelihood, there is no restriction on the concatenation of arbitrary subwords HMM, and the likelihood for a non-Japanese series is often selected as the maximum likelihood. Such comparison between the reference likelihood and the recognition likelihood did not provide a sufficient effect of rejecting unknown utterances. Further, since the connection of all subword HMMs is compared, a large resource is required in terms of the amount of processing calculation. In order to solve such a problem, for example, in the speech recognition method, both the rejection accuracy of unknown utterances and the processing amount are improved by introducing the affinity of connection between subwords HMM as a transition probability (for example, Patent Document 1).
[0005]
[Patent Document 1]
JP-A-10-171489
[Non-patent document 1]
Watanabe et al., "Rejection of unknown utterance by likelihood correction using syllable recognition", IEICE Transactions, Vol. J75-D-II, No. 12 (1992)
[0006]
[Problems to be solved by the invention]
However, the conventional method described above has the following problems.
[0007]
That is, the model based on the concatenation of the sub-word HMMs can be regarded as a model that recognizes the input speech as a kana sequence. Even if the transition probability is introduced as a constraint on the concatenation between the sub-word HMMs, the model is generated. The sequence is still hard to match well with that of the input speech. That is, the likelihood of reference obtained by such a model is not sufficiently accurate, and the effect of rejecting unknown utterances is not sufficient.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, an unknown utterance detection device for speech recognition according to the first aspect of the present invention includes a speech analysis unit that analyzes input speech and converts it into a sequence of feature parameters, and a recognition dictionary that defines a vocabulary to be recognized. Storage means;
A speech model storage unit that models a standard pattern of speech, and a vocabulary model defined in the recognition dictionary are constructed using the model stored by the speech model storage unit, and words to be compared with the input speech Level matching means, a subword transition probability storage means for defining a transition probability between subwords, and a speech model stored by the speech model storage means, taking into account the subword transition probabilities stored by the subword transition probability storage means. Sub-word level matching means for concatenating and collating with the input voice;
An unknown utterance scale calculation unit that calculates a plurality of unknown utterance scales from the word level matching unit and the subword level matching unit; and an unknown utterance scale based on the plurality of scales calculated by the unknown utterance scale calculation unit. Unknown utterance determining means for performing determination.
[0009]
According to the above configuration, in the determination of the unknown utterance, it is possible to determine the possibility that the input voice is an unknown utterance from a plurality of viewpoints, and it is possible to exhibit high unknown utterance detection performance.
[0010]
Further, in the unknown utterance detection device according to the first invention, in the unknown utterance scale calculating means, the likelihood of a word obtained by the word level matching means and a subword chain likelihood obtained by the subword level matching means. It is desirable to include a value calculated based on the difference between the two.
[0011]
According to the above configuration, an effect of correcting the likelihood of the word based on the likelihood of the subword chain is obtained, and a high unknown utterance detection performance is obtained.
[0012]
Further, in the unknown utterance detection device according to the first invention, the unknown utterance scale calculating means obtains the acoustic feature of the word model of the first candidate obtained by the word level matching means and the subword level matching means. It is desirable to include a value calculated based on the similarity between the acoustic features of the obtained subword chain model.
[0013]
According to the above configuration, it is possible to determine an unknown utterance by focusing on the similarity of the acoustic features of the two models, and high unknown utterance detection performance is obtained.
[0014]
In the unknown utterance detection device according to the first aspect of the present invention, the unknown utterance scale calculating means includes a difference between the likelihood of the word of the first candidate and the likelihood of the word of the lower candidate obtained by the word level matching means. It is desirable to include a value calculated based on
[0015]
According to the above configuration, at the time of recognition of an unknown utterance, the word level matching unit can model a feature that an erroneous candidate is similar and can be obtained with a high likelihood, and high unknown utterance detection performance can be obtained.
[0016]
Further, in the unknown utterance detection device according to the first invention, in the unknown utterance scale calculating means, the acoustic feature of the first candidate word obtained by the word level matching means and the acoustic feature of the lower candidate word are obtained. It is desirable to include a value calculated based on the similarity between the two.
[0017]
According to the above configuration, it is possible to determine an unknown utterance by focusing on the acoustic similarity between the models of the candidate word, and a high unknown utterance detection performance can be obtained.
[0018]
Further, the speech recognition apparatus of the second invention is a speech recognition apparatus for recognizing an input speech by collating with a model corresponding to a vocabulary registered in a recognition dictionary, wherein the unknown speech detection apparatus And outputs the recognition result in consideration of the output result of the unknown utterance detection device.
[0019]
According to the above configuration, the speech recognition apparatus does not always output any one of the vocabularies in the recognition dictionary for any input speech, and the speech content is not included in the recognition dictionary. For example, this can be transmitted to the user, and it is possible to provide a user-friendly interface among various voice recognition interfaces including the voice recognition device.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0021]
(Embodiment 1)
FIG. 1 shows a block diagram of an unknown utterance detection device according to the present embodiment. In FIG. 1, reference numeral 1 denotes an acoustic analysis unit that A / D converts an input voice and converts the input voice into a time series of feature parameters. Reference numeral 2 denotes a speech piece pattern storage unit that stores speech pieces of standard speech used for matching with the feature parameters of the input speech.
[0022]
Here, the speech piece is a VC pattern in which the second half of a vowel section of a voice is connected to the first half of a consonant section subsequent thereto, and a CV in which the second half of a consonant section is connected to the first half of a subsequent vowel section. It means a set of patterns. However, in addition to this, the speech piece is a set of phonemes that are almost equivalent to each letter of the alphabet when Japanese is written in Roman characters, and the mora of a character that is almost equivalent to each letter when it is written in Japanese. It may be a set, a set of subwords meaning a chain of a plurality of moras, or a mixed set of these sets.
[0023]
Reference numeral 3 in FIG. 1 denotes a word dictionary storage unit in which rules for synthesizing the word pattern of the speech recognition vocabulary by connecting the speech segments are stored. Reference numeral 4 denotes a word matching unit that compares an input voice expressed in a time series of feature parameters with the synthesized word pattern and obtains a likelihood corresponding to the similarity for each word.
[0024]
Reference numeral 5 denotes a transition probability storage unit that stores transition probabilities that represent the naturalness of the connection as continuous values when the voice segments are arbitrarily combined. In the present embodiment, a 2 gram probability of a phoneme is used as the transition probability. The 2 gram probability of a phoneme means a probability P (y | x) that a phoneme y is connected after a preceding phoneme x, and is obtained in advance using a large number of Japanese text data or the like. However, the transition probability may be a 2 gram probability of a mora, a 2 gram probability of a subword, or a 2 gram probability of a mixture of these, or a 3 gram probability other than the 2 gram probability.
[0025]
Reference numeral 6 in FIG. 1 is obtained by calculating the likelihood of a pattern formed by arbitrarily combining the above-mentioned speech piece patterns and the input speech expressed as a time series of the characteristic parameters in consideration of the above-mentioned transition probabilities. This is a speech sequence matching unit that uses the maximum likelihood as a reference likelihood.
[0026]
7 is a difference between the likelihood of the word that obtained the highest value (first candidate) and the word that obtained the next highest value (second candidate) among the likelihoods of each word calculated by the word matching unit. Is normalized by the word length.
[0027]
No. 8 is a candidate phoneme sequence similarity calculation unit that calculates the distance between the phoneme sequence of the first candidate and the phoneme sequence of the second candidate in order to obtain the acoustic similarity between the first candidate and the second candidate. is there.
[0028]
Reference numeral 9 denotes a candidate / speech sequence score difference calculation unit that calculates a difference between the likelihood of the first-place candidate and the reference likelihood calculated by the speech sequence matching unit by normalizing the difference with the word length.
[0029]
Reference numeral 10 denotes a candidate / speech sequence / phoneme sequence similarity calculation unit that calculates the acoustic similarity between the first candidate and the sequence determined as the optimal sequence by the speech sequence matching unit as the distance between the phoneme sequences. is there.
[0030]
Reference numeral 11 denotes each value obtained by the above-described inter-candidate score difference calculation unit, the candidate / phoneme sequence similarity calculation unit, the candidate / speech sequence score difference calculation unit, and the candidate / speech sequence / phoneme sequence similarity calculation unit. Overall, the unknown utterance determination unit determines whether or not the input voice is an unknown utterance.
[0031]
In the present embodiment, the above-mentioned four scales are used as the scales used in the unknown utterance determination unit. In addition, the likelihood of each word candidate and its distribution, and the locality in the word section are also used. It is also possible to use a measure such as the amount of change in the score or the duration information of the phonemes constituting the word. Further, as a method of determining an unknown utterance based on a plurality of scales, in the present embodiment, a linear discriminant obtained in advance using a large number of recognition result cases is used. However, besides this, it is also effective to use a so-called learning machine such as a neural network, a decision tree, or an SVM (support vector machine).
[0032]
Next, the processing operation of unknown utterance detection in the present embodiment will be described. The input voice is first analyzed by the voice analysis unit after A / D conversion, and is converted into LPC vectors every 10 ms. The LPC vector is a parameter that indicates the spectral envelope of the short-time spectrum of the voice, and is used as a parameter that well represents the phonological features of the voice. In a normal speech recognition method, a time series of LPC vectors obtained at regular intervals from an input speech is used as a feature vector of the input speech and matched with a previously obtained word model, and is called a likelihood for each word. Find a score.
[0033]
In the present embodiment, a word model is created using a speech piece pattern and a word dictionary. That is, the speech pattern stored in the speech pattern storage unit is connected to form a word pattern based on the speech pattern connection rule for synthesizing the word pattern stored in the word dictionary storage unit. FIG. 2 illustrates an image in which speech pattern called CV / VC pattern used in the present embodiment is connected to synthesize word pattern “Hachi no He”.
[0034]
In the speech piece pattern, parameters indicating a standard distribution (assuming a normal distribution) of LPC vectors of each speech piece are stored in time series. In recent years, a transition network called HMM (Hidden Markov Model) is often used as a model for speech recognition. Even when the HMM model is used, the HMM model representing the speech pattern is stored in the speech pattern storage unit 2, and the rule regarding the transition between the HMM models is defined in the word dictionary storage unit 3, whereby the HMM of the word is defined. It is possible to build a model.
[0035]
The feature parameter time series of the input speech is compared with the word pattern in the word matching unit 4 and the likelihood for all the words defined in the word dictionary storage unit 3 or the upper candidate words remaining in the beam of a certain likelihood. Is calculated and sorted in descending order of likelihood. FIG. 3 shows an output example of words sorted in the order of likelihood.
[0036]
In parallel with this, the speech sequence matching section 6 also performs matching of an arbitrary sequence of speech segments. In this method, speech segments are connected freely under certain restrictions, and a speech segment sequence closest to the input speech and its likelihood are calculated. At this time, if no restriction is added to the connection of the voice segments, the calculation result becomes a non-Japanese sequence, and the likelihood cannot be said to be a sufficiently significant value. Therefore, in the process of searching for the optimal speech segment sequence, the phoneme 2 gram probability stored in the transition probability storage unit 3 is used as the cost of speech segment selection and connection. For the phoneme 2 gram probability, a large number of Japanese texts of the recognition task and the same task are converted into a phoneme sequence, and the one calculated based on this is used.
[0037]
In FIG. 4, as an example of the phoneme 2 gram probability, a probability that five vowels / a /, / i /, / u /, / e /, / o / follow the preceding phoneme / k /, respectively, is illustrated. In the case of this example, it is shown that the vowel that tends to follow the consonant / k / is / a /, and then / i /. The speech sequence matching unit 6 obtains a value obtained by weighting the likelihood of the concatenated speech pattern and the likelihood of transition obtained by the logarithmic sum of the phoneme 2 gram probabilities, and adopts the sequence having the highest value.
[0038]
FIG. 5 shows an example of calculating the transition likelihood for the sequence / kobajasi / from the phoneme 2gram probability. Also, in FIG. 6, as an example showing the effect of introducing the transition likelihood, the speech sequence output from the speech sequence matching unit and the likelihood for the input speech "Kobayashi" are shown. As shown in this figure, when the transition likelihood is not used, a sequence that is far apart from “kobayashi” of / pobaeasii / obtains a more similar / obajasi / higher pattern likelihood. By using the total likelihood in consideration of the likelihood, / obajasi / closer to "kobayashi" is selected.
[0039]
As described above, the recognition likelihood for each word defined in the word dictionary storage unit, the reference likelihood, and the speech sequence at that time are obtained. Next, based on these, various scales for determining unknown utterances are calculated. Do.
[0040]
First, the inter-candidate score difference calculator 7 calculates the highest likelihood (the likelihood of the first candidate) and the next highest likelihood (the likelihood of the second candidate) among the likelihoods obtained by the word matching unit. Calculate the value obtained by dividing the likelihood difference by the word length. For example, in the case of the result shown in FIG. 3, the difference between the likelihood of “Kobayashi” and “Hayashi” is normalized by the word length to obtain 6.9.
[0041]
The candidate phoneme sequence similarity calculator 8 calculates the similarity between the phoneme sequence of the first candidate and the phoneme sequence of the second candidate. Here, as the similarity between the sequences, a value obtained by normalizing the editing distance by the sum of the sequence lengths of the two sequences is used. The editing distance is defined as a minimum when the cost required for one element replacement (replacement), one element deletion (dropout), and one element addition (insertion) is set to 1 when one series is edited and converted to the other series. Means the total cost when editing with the cost of FIG. 7 exemplifies a method of obtaining an edit distance for two phoneme sequences / uenoeki / and / jenokii /. According to such a method, the candidate phoneme sequence similarity calculation unit calculates the edit distance 3 of “Kobayashi (/ kobayashi /)” and “Hayashi (/ hajasi /)” for each phoneme sequence in the case of the result shown in FIG. 3, for example. Outputs the value of 0.21 divided by the sum of the lengths. In this embodiment, the value based on the edit distance is used as the inter-sequence similarity. However, it is also effective to use the inter-sequence distance in consideration of the acoustic similarity between phonemes. is there.
[0042]
The candidate / speech sequence score difference calculation unit calculates a value obtained by normalizing the difference between the likelihood of the first candidate obtained by the word matching unit and the reference likelihood obtained by the speech sequence matching unit by the time length of the word. . For example, in the examples shown in FIGS. 3 and 6, the difference between the recognition likelihood 2055 of the first-place candidate and the reference likelihood 2014 is normalized to 0.87 to obtain a word time length of 0.87.
[0043]
In the candidate / speech sequence / phoneme sequence similarity calculation unit, the normalized phoneme sequence of the first candidate obtained by the word matching unit and the optimal phoneme sequence obtained by the speech sequence matching unit are normalized and edited between the sequences. Calculate the distance. For example, in the examples shown in FIGS. 3 and 6, the editing distance of the phoneme sequences / kobajasi / and / obajasi / is normalized by the sum of the sequence lengths to obtain 0.07.
[0044]
For the four scales obtained as described above, the unregistered word determination unit obtains an appropriately weighted sum of these scales, and determines the magnitude by a threshold to determine whether or not the utterance is an unregistered word. Perform That is, the determination is performed according to the equation shown in FIG. In FIG. 8, CM1 to CM4 denote the inter-candidate score difference, the candidate phoneme sequence similarity, the candidate / speech sequence score difference, and the candidate / speech sequence / phoneme sequence similarity, respectively. The weight for each scale, θ, means a threshold.
[0045]
The weight for each scale used here is obtained in advance by a statistical method. That is, for each of a large number of cases of registered word utterances and unregistered word utterances, the above four scales are obtained, and the relationship between the four scales and the registered word utterances and unregistered word utterances is analyzed by a linear discriminant method. To find the weight.
[0046]
(effect)
Next, the effect of the unknown utterance detection method based on the present embodiment will be experimentally shown in comparison with the conventional method.
[0047]
Generally, there are two types of errors in such a detection problem. That is, a detection omission error and a source error in which an error that should not be detected is detected. These two errors are in a trade-off relationship, and it is known that if one error is reduced, the other increases. Therefore, such a problem is compared by using two scales as shown in FIG. Here, a high unknown utterance recall means that there are few detection omission errors, and a high unknown utterance relevance means that there are few source errors. It is desirable that both are high.
[0048]
In the experiment described below, the change of the unknown utterance recall rate and the unknown utterance matching rate by each method is examined when the determination threshold is increased or decreased when there are 100 unknown names. The following three methods are used for comparison.
[0049]
(1) Unknown utterance judgment based only on the difference between the recognition likelihood and the reference likelihood
(There is no restriction on the connection of voice segments)
(2) Unknown utterance judgment based only on the difference between the recognition likelihood and the reference likelihood
(Introduce transition probability for concatenation of voice segments)
(3) Unknown utterance determination using the four scales described in the above embodiment together
FIG. 10 shows the result. In FIG. 10, the unknown utterance matching rate is plotted on the horizontal axis, and the unknown utterance recall rate is plotted on the vertical axis. The higher the two, the better, and the curve in the figure can be said to have better detection performance as it goes to the upper right. From this result, it can be seen that the method of determining the unknown utterance using the four scales together with the method of determining the unknown utterance has higher detection performance than the method of determining the unknown utterance using only the recognition likelihood and the reference likelihood as in the related art. Is shown.
[0050]
(Embodiment 2)
The present embodiment relates to a speech recognition device including the unknown utterance detection unit according to the first embodiment. The present embodiment has a function of providing an easier-to-use voice recognition interface function by returning an optimal response result by simultaneously using a detection result of an unknown utterance together with a conventional voice recognition result. .
[0051]
FIG. 11 is a block diagram of a speech recognition device equipped with the unknown utterance detection device shown in FIG. A speech analysis unit 20, a speech piece pattern storage unit 21, a word dictionary storage unit 22, a word matching unit 23, a transition probability storage unit 24, a speech sequence matching unit 25, a candidate score difference calculation unit 26, which constitutes the unknown utterance detection device, The candidate / phoneme sequence similarity calculator 27, the candidate / speech sequence score difference calculator 28, the candidate / speech sequence / phoneme sequence similarity calculator 29, and the unknown utterance determiner 30 are the same as those in the first embodiment. Speech analysis unit 1, speech piece pattern storage unit 2, word dictionary storage unit 3, word matching unit 4, transition probability storage unit 5, speech sequence matching unit 6, candidate score difference calculation unit 7, candidate / phoneme sequence similarity calculation It has the same configuration as the unit 8, the candidate / speech sequence score difference calculation unit 9, the candidate / speech sequence / phoneme sequence similarity calculation unit 10, and the unknown utterance determination unit 11.
[0052]
However, the unknown utterance detection unit 30 outputs a continuous value indicating the likelihood of unknown utterance, instead of simply outputting the unknown utterance determination result as correct or not. Further, the present embodiment includes a recognition result output unit 31 that outputs a recognition result in consideration of the results of both the word matching unit 23 and the unknown utterance detection unit 30. The configuration of the speech analysis unit 20, the speech piece pattern storage unit 21, the word dictionary storage unit 22, and the recognition result output unit 31 by the word matching unit 23 is the same as that of a normal speech recognition device.
[0053]
In the present embodiment, for the input voice, the result of the unknown utterance likelihood or the unknown utterance likelihood of the utterance content is obtained from the unknown utterance detection unit 30 by the same steps as those in the first embodiment. At the same time, likelihood is given to each word obtained from the word matching unit 23, and a recognition result candidate is obtained from a result sorted by the magnitude of likelihood.
[0054]
The recognition result output unit 31 outputs an optimal response in consideration of the recognition result candidate and the result regarding the unknown utterance likelihood obtained by the unknown utterance detection unit.
[0055]
That is, if the likelihood of unknown utterance is high, the recognition result output unit rejects all the results obtained as recognition result candidates and outputs a result indicating that the result is rejected. If the likelihood of unknown utterance is moderately high, one or more candidates from the top of the recognition result candidates are output, and a signal indicating that the result is not sufficiently reliable is also given. Further, if the likelihood of unknown utterance is sufficiently low, one or more top candidates among the recognition result candidates are output.
[0056]
With the above-described configuration, for example, the following effects can be obtained when the voice recognition device of the present embodiment is mounted on a television receiver and program selection is performed by the voice input interface. That is, conventionally, when a user utters a word that is not registered in the recognition dictionary for speech recognition, such as a program name that is not being broadcast or a name of a broadcast station that cannot be received, if a user utters a word, the conventional method simply recognizes a recognition error. And gave the user distrust of not knowing what to say.
[0057]
However, when the user utters such an unknown word, the voice recognition device of the present embodiment can inform the user that no such program name or broadcast station name exists. Become. Also, in the case where the recognition result is ambiguous, in the related art, the processing may be continued in an ambiguous manner and the video may be switched to a program not desired by the user. Thus, the user can be notified of the fact, and the confirmation means can be presented, and then the program can be switched. Thus, it is possible to efficiently avoid problems caused by recognition errors that often occur in voice recognition.
[0058]
The same effect can be applied not only to a voice recognition device in a television receiver but also to a destination search function in a car navigation system, an automatic telephone number guidance system by voice, and the like.
[0059]
【The invention's effect】
As described above, the first invention of the present invention is a technique for detecting an unknown utterance in a speech recognition device.
Using not only a single judgment scale but also a plurality of judgment scales has an effect of detecting an unknown utterance with high accuracy.
[0060]
Further, the second invention has an effect of providing a user-friendly voice recognition interface by outputting an output of a recognition result in consideration of a result of the unknown utterance detection device according to the first invention. Have.
[Brief description of the drawings]
FIG. 1 is a block diagram of an unknown utterance detection device according to a first embodiment of the present invention.
FIG. 2 is a diagram showing an example of constructing a word pattern from a speech piece in the embodiment.
FIG. 3 is a diagram showing an output example of a list of words and likelihoods output by a word matching unit according to the embodiment;
FIG. 4 is a diagram showing an example of a phoneme 2 gram probability in the embodiment.
FIG. 5 is a diagram showing an example of phoneme transition likelihood in a word calculated based on a phoneme 2 gram probability in the embodiment;
FIG. 6 is a diagram showing an effect of introducing a transition likelihood in calculation of a reference likelihood in the embodiment.
FIG. 7 is a view showing a method for obtaining an editing distance between streams in the embodiment.
FIG. 8 is a diagram of an expression showing a rule for determining an unknown utterance from a scale regarding four unknown utterances in the embodiment.
FIG. 9 is a diagram of an expression showing an evaluation scale for showing an effect compared with the conventional method in the embodiment.
FIG. 10 is a view showing an experimental result showing an effect in the same embodiment as compared with the conventional method.
FIG. 11 is a block diagram of a speech recognition device according to a second embodiment of the present invention.
[Explanation of symbols]
1,20 Voice analysis unit
2,21 Speech piece pattern storage
3,22 Word dictionary storage
4,23 Word matching unit
5, 24 transition probability storage unit
6,25 Voice sequence matching unit
7,26 Candidate score difference calculator
8,27 candidate / phoneme sequence similarity calculator
9,28 Candidate / speech sequence score difference calculator
10,29 Candidate / speech sequence / phoneme sequence similarity calculator
11,30 Unknown utterance determination unit
31 Recognition result output unit

Claims

Voice analysis means for analyzing the input voice and converting it into a sequence of feature parameters;
A recognition dictionary storage unit for defining a vocabulary to be recognized,
Voice model storage means modeling a standard voice pattern;
A word level matching means for constructing a vocabulary model defined in the recognition dictionary by using the model stored by the voice model storage means, and performing matching with an input voice;
Subword transition probability storage means for defining transition probabilities between subwords,
A sub-word level matching unit that connects the voice model stored by the voice model storage unit in consideration of the sub-word transition probability stored by the sub-word transition probability storage unit and performs collation with the input voice;
From the word level matching unit and the sub-word level matching unit, unknown utterance scale calculation unit means for calculating a plurality of unknown utterance scales,
An unknown utterance detection device for speech recognition, comprising: an unknown utterance determination unit that determines an unknown utterance based on the plurality of scales calculated by the unknown utterance scale calculation unit.

2. The unknown utterance detection device according to claim 1, wherein the unknown utterance scale calculating means calculates a difference between a likelihood of a word obtained by the word level matching means and a subword chain likelihood obtained by the subword level matching means. An unknown utterance detection device characterized by including a value calculated based on the utterance.

2. The unknown utterance detection device according to claim 1, wherein the unknown utterance scale calculating means includes an acoustic feature of the word of the first candidate obtained by the word level matching means and a subword obtained by the subword level matching means. An unknown utterance detection device including a value calculated based on a similarity between two acoustic features of a chain.

2. The unknown utterance detection device according to claim 1, wherein the unknown utterance scale calculating means is based on a difference between the likelihood of the word of the first candidate and the likelihood of the word of the lower candidate obtained by the word level matching means. An unknown utterance detection device characterized by including a value calculated by the above.

2. The unknown utterance detection device according to claim 1, wherein the unknown utterance scale calculation unit calculates an acoustic feature of the first-rank candidate word obtained by the word-level matching unit and an acoustic feature of a lower-rank candidate word, An unknown utterance detection device comprising a value calculated based on a similarity between the two.

A speech recognition apparatus for recognizing an input speech by collating with a model corresponding to a vocabulary registered in a recognition dictionary, wherein the unknown speech detection apparatus according to any one of claims 1 to 5 is provided. A speech recognition device mounted on a tower and outputting a recognition result in consideration of an output result of the unknown utterance detection device.