JP3798530B2

JP3798530B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP3798530B2
Application number: JP25620197A
Authority: JP
Inventors: 浩志古山; 郁夫井上
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1997-09-05
Filing date: 1997-09-05
Publication date: 2006-07-19
Anticipated expiration: 2017-09-05
Also published as: JPH1185190A

Description

【０００１】
【発明の属する技術分野】
本発明は、話者の口唇を含む映像信号と音声信号とを用いて音声認識を行なう音声認識装置と、その音声認識方法に関し、特に、認識率の向上を図るものである。
【０００２】
【従来の技術】
音声認識を行なう場合に、音声信号だけでなく、話者の口唇を含む映像を併せて用いる音声認識装置が、Sintani等によって報告された“An Isolated Word Speech Recognition Using Fusion of Auditory and Visual Information"(IEICE Trans. Fundamentals, Vol. E79-A, No. 6, p777-783(1996))に記載されている。音声信号だけを用いる音声認識では、雑音が混入すると、認識精度が急激に低下するが、口唇の映像を併用する場合には、認識精度の低下の程度を和らげることができる。
【０００３】
図６は、この従来の音声認識装置の概略構成を示している。この装置は、話者の口唇部分を含む映像を入力するビデオカメラ等の映像入力部１と、話者が発声する音声を入力するマイク等の音声入力部３と、各種単語を発声する口唇部分の映像標準データと入力した口唇部分の映像との類似度を求め、映像標準データに含まれる各単語に対する類似度を出力する映像処理部２と、各種単語の音声標準データと入力した音声との類似度を求め、音声標準データに含まれる各単語に対する類似度を出力する音声処理部４と、映像処理部２及び音声処理部４より入力する類似度から最も類似度の高い単語を算出し、それを認識結果として出力する音声認識部５とを備えている。
【０００４】
この装置の映像処理部２は、入力映像から例えば口唇部分の上下方向及び左右方向の長さ、並びに上下及び左右の長さの比を特徴量として抽出する。そして、類似度の算出のために予め用意された複数の単語の映像標準データのうちで、ｉ番目の単語に対応する特徴量と、入力映像から抽出した特徴量との間の類似度（Ｒ_i,_Image）を、パターン認識の手法として良く知られた、隠れマルコフモデル（以下、ＨＭＭと略す）により算出して出力する。
【０００５】
また、音声処理部４は、入力音声からケプストラム分析により特徴量を抽出し、予め用意された複数の単語の音声標準データのうちで、ｉ番目の単語に対応する特徴量と、入力音声から抽出した特徴量との類似度（Ｒ_i,_Sound）をＨＭＭにより算出して出力する。
【０００６】
また、音声認識部５は、映像処理部２の出力（Ｒ_i,_Image）及び音声処理部４の出力（Ｒ_i,_Sound）から、ｉ番目の単語に対して映像及び音声を総合した類似度（Ｒ_i,_Total）を次式（１）により算出する。
Ｒ_i,_Total＝α・Ｒ_i,_Image＋（１−α）・Ｒ_i,_Sound ………（式１）
ここで、α（０≦α≦１）は、係数決定用に（類似度算出用とは別に）サンプルした映像と音声データとを用いて、認識率が最大となるように予め設定した係数である。
【０００７】
音声認識部５は、映像標準データ及び音声標準データに含まれる全ての単語に対して類似度Ｒ_i,_Totalを求め、この類似度Ｒ_i,_Totalが最大となる単語を認識結果として出力する。
【０００８】
このように、この音声認識装置は、音声信号とともに、口唇情報を含む映像信号を併せて用いているため、雑音が存在する場合でも、認識率の急激な低下を免れることができ、音声認識装置の応用分野を騒音環境下で使用される装置にまで広げることができる。
【０００９】
【発明が解決しようとする課題】
音声認識装置をカーナビゲーション装置に応用して、装置への指令を音声で与えることが検討されているが、しかし、こうした騒音環境下で使用される装置に組み込むためには、音声認識装置の騒音下での認識率をさらに高めることが必要である。
【００１０】
本発明は、こうした要請に応えるものであり、騒音環境下の音声認識において、高い認識率を実現することができる音声認識装置を提供し、また、その音声認識方法を提供することを目的としている。
【００１１】
【課題を解決するための手段】
そこで、本発明の音声認識装置では、口唇を含む話者の映像データを入力する映像入力手段と、話者の音声データを入力する音声入力手段と、各単音節を発声する口唇の映像標準データと入力映像データとの間の類似度を算出し、映像標準データの各単音節とそれに対する類似度とを出力する映像処理手段と、各単音節を発声する音声の音声標準データと入力音声データとの間の類似度を算出し、音声標準データの各単音節とそれに対する類似度とを出力する音声処理手段と、映像処理手段及び音声処理手段から出力された類似度を用いて、総合的な類似度が最も大きい単音節を識別する音声認識手段とを設けるとともに、あらかじめ、前記映像入力手段に各種の単音節に関する前記映像データを入力し、各入力映像データに対応して前記映像処理手段から出力される類似度が最大の映像標準データの単音節を候補単音節として集計し、それを基に算出した、前記映像処理手段から出力された同一の候補単音節の総数の内で、入力映像データの単音節と一致する候補単音節の数の割合を、映像標準データのその単音節に対する正答率のデータとして保持する映像標準データ正答率保持手段と、あらかじめ、前記音声入力手段に各種の単音節の音声データを入力し、各入力音声データに対応して前記音声処理手段から出力される類似度が最大の音声標準データの単音節を候補単音節として集計し、それを基に算出した、前記音声処理手段から出力された同一の候補単音節の総数の内で、入力音声データの単音節と一致する候補単音節の数の割合を、音声標準データのその単音節に対する正答率のデータとして保持する音声標準データ正答率保持手段とを設け、音声認識手段が、映像処理手段から出力される類似度と映像標準データ正答率保持手段から読み出した正答率との積、及び、音声処理手段から出力される類似度と音声標準データ正答率保持手段から読み出した正答率との積の総和を各単音節に対する総合的な類似度として求めるようにしている。
【００１２】
この装置では、識別対象の単音節が、口唇の映像を基に識別した方が高精度に識別できる種類の単音節である場合には、映像による識別結果が最終判断に大きく寄与し、また、識別対象の単音節が、音声を基に識別した方が高精度に識別できる単音節である場合には、音声による識別結果が最終判断に大きく寄与することになる。そのため、騒音環境下でも信頼性の高い音声認識が可能となる。
【００１５】
【発明の実施の形態】
本発明の請求項１に記載の発明は、口唇を含む話者の映像データを入力する映像入力手段と、話者の音声データを入力する音声入力手段と、各単音節を発声する口唇の映像標準データと入力映像データとの間の類似度を算出し、映像標準データの各単音節とそれに対する類似度とを出力する映像処理手段と、各単音節を発声する音声の音声標準データと入力音声データとの間の類似度を算出し、音声標準データの各単音節とそれに対する類似度とを出力する音声処理手段と、あらかじめ、前記映像入力手段に各種の単音節に関する前記映像データを入力し、各入力映像データに対応して前記映像処理手段から出力される類似度が最大の映像標準データの単音節を候補単音節として集計し、それを基に算出した、前記映像処理手段から出力された同一の候補単音節の総数の内で、入力映像データの単音節と一致する候補単音節の数の割合を、映像標準データのその単音節に対する正答率のデータとして保持する映像標準データ正答率保持手段と、あらかじめ、前記音声入力手段に各種の単音節の音声データを入力し、各入力音声データに対応して前記音声処理手段から出力される類似度が最大の音声標準データの単音節を候補単音節として集計し、それを基に算出した、前記音声処理手段から出力された同一の候補単音節の総数の内で、入力音声データの単音節と一致する候補単音節の数の割合を、音声標準データのその単音節に対する正答率のデータとして保持する音声標準データ正答率保持手段と、前記映像処理手段から出力される類似度と前記映像標準データ正答率保持手段から読み出した正答率との積、及び、前記音声処理手段から出力される類似度と前記音声標準データ正答率保持手段から読み出した正答率との積の総和を各単音節に対する総合的な類似度として求め、総合的な類似度が最も大きい単音節を識別する音声認識手段とを設けた音声認識装置であり、識別対象の単音節が、口唇の形状や動きを基に識別した方が高精度に識別できる種類の単音節である場合には、映像による識別の寄与率を大きくし、また、識別対象の単音節が、音声を基に識別した方が高精度に識別できる単音節である場合には、音声による識別の寄与率を大きくすることにより、騒音環境下でも信頼性の高い音声認識が可能となる。
【００２０】
請求項２に記載の発明は、音声標準データ正答率保持手段が、正答率のデータとして、信号対雑音比に対応する複数種類の正答率のデータを保持し、音声認識手段が、各単音節に対する総合的な類似度を求める際に、音声標準データ正答率保持手段から、入力音声データの信号対雑音比に応じた正答率のデータを読み出すようにしたものであり、音声信号を用いた識別の信頼性が、入力音声信号の信号対雑音比により変動する点を改善できる。
【００２４】
請求項３に記載の発明は、口唇を含む話者の映像データを入力する映像入力手段と、話者の音声データを入力する音声入力手段と、発声の機構が共通する単音節を予めグループ化し、各グループについて、当該グループの各単音節を発声する口唇の映像標準データと入力映像データとの間の類似度を算出し、その中の最大の類似度を当該グループの類似度に設定して、グループの識別情報と当該グループの類似度とを出力する映像処理手段と、
前記各グループについて、当該グループの各単音節を発声する音声の音声標準データと入力音声データとの間の類似度を算出し、その中の最大の類似度を当該グループの類似度に設定して、グループの識別情報と当該グループの類似度とを出力する音声処理手段と、あらかじめ、前記映像入力手段に各種の単音節に関する前記映像データを入力し、各入力映像データに対応して前記映像処理手段から出力される各グループの出力総数を集計し、それを基に算出した、前記出力総数の内で、入力した単音節が当該グループに含まれる正解の出力数の割合を、そのグループに対する正答率のデータとして保持する映像標準データ正答率保持手段と、あらかじめ、前記音声入力手段に各種の単音節の音声データを入力し、各入力音声データに対応して前記音声処理手段から出力される各グループの出力総数を集計し、それを基に算出した、前記出力総数の内で、入力した単音節が当該グループに含まれる正解の出力数の割合を、そのグループに対する正答率のデータとして保持する音声標準データ正答率保持手段と、前記映像処理手段から出力される類似度と前記映像標準データ正答率保持手段から読み出した正答率との積、及び、前記音声処理手段から出力される類似度と前記音声標準データ正答率保持手段から読み出した正答率との積の総和を各グループに対する総合的な類似度として求め、総合的な類似度が最も大きいグループを識別する音声認識手段とを備える音声認識装置であり、認識対象の音声が属しているグループを高精度に特定することができ、音声識別処理を効率化することができる。
【００３８】
以下、本発明の実施の形態について図面を用いて説明する。
【００３９】
（第１の実施の形態）
第１の実施形態の音声認識装置は、図１に示すように、話者の口唇部分を含む映像が入力するビデオカメラ等の映像入力部１と、話者の発声する音声が入力するマイク等の音声入力部３と、各種単音節を発声する口唇部分の映像標準データと入力した口唇部分の映像との類似度を求め、映像標準データに含まれる各単音節に対する類似度を出力する映像処理部２と、各種単音節の音声標準データと入力した音声との類似度を求め、音声標準データに含まれる各単音節に対する類似度を出力する音声処理部４と、口唇部分の映像に基づいて音声認識された各単音節の正答率データ（即ち、その単音節が正解である確率を表わすデータ）を保持する映像処理部正答率データ保持部６と、音声に基づいて音声認識された各単音節の正答率データを保持する音声処理部正答率データ保持部７と、映像処理部２及び音声処理部４より入力する類似度、並びに映像処理部正答率データ保持部６及び音声処理部正答率データ保持部７より読み出した正答率データに基づいて総合の類似度を求め、その類似度が最も高い単音節を認識結果として出力する音声認識部５とを備えている。
【００４０】
図２は、本発明における正答率を説明するための図面である。この図は、例えば、音声処理部４に単音節が入力したとき（ここでは、簡単のため、入力単音節を「あ」、「い」、「う」、「え」、「お」の５つとした）、音声処理部４より「類似度が最大の単音節」（これを候補単音節という）として、どの単音節が、どの程度の割合で出力されたかを示している。
【００４１】
この候補単音節の出現度数のデータは、類似度を算出するための標準音声データとは別に、単音節の音声をサンプルとして用意し、これを音声処理部４に入力して、実際の値を求めている。
【００４２】
例えば、音声処理部４に単音節の「あ」を１００回入力したとき、「あ」が候補単音節となる場合が９７回有り、「う」が候補単音節となる場合が１回有り、「お」が候補単音節となる場合が１回有った（残りの１回は候補単音節を特定できなかった）。
【００４３】
こうして求めた候補単音節の出現度数のデータから、ある特定の単音節が出力された場合の、その単音節が正解である割合、即ち、その単音節の出力数と、出力数の内で入力単音節に一致する出力の数との比、を正答率として算出する。
【００４４】
例えば、音声処理部が「あ」を出力した数１０５に対して、入力単音節が「あ」である数は９７であるから、図２の例では候補単音節「あ」の正答率は０．９２４となる。
【００４５】
このようにして、音声処理部４に、認識を行なうすべての単音節の音声データを入力し、これらの単音節に対する正答率を算出し、正答率データ（Ｓ_i,_Sound）として音声処理部正答率データ保持部７に格納する。また、映像に関しても同じように、映像処理部２に、認識を行なうすべての単音節を発声する話者の口唇部分を含む映像データを入力し、これらの単音節に対する正答率データ（Ｓ_i,_Image）を算出して、映像処理部正答率データ保持部６に格納する。
【００４６】
なお、この正答率データの算出に用いる映像データ及び音声データは、音声認識の対象者がデータを提供すること、あるいは複数の話者がデータ提供者になることが望ましく、また、各単音節の入力データ数にばらつきが少ないことが望ましい。
【００４７】
この装置の映像処理部２は、映像入力部１より入力する映像から例えば口唇部分の上下方向及び左右方向の長さ、並びに上下及び左右の長さの比を特徴量として抽出する。そして、類似度の算出のために予め用意された複数の単音節の映像標準データのうちで、ｉ番目の単音節に対応する特徴量と、入力映像から抽出した特徴量との間の類似度（Ｒ_i,_Image）をＨＭＭにより算出し、その結果を出力する。
【００４８】
また、音声処理部４は、音声入力部３より入力する音声からケプストラム分析により特徴量を抽出し、予め用意された複数の単音節の音声標準データのうちで、ｉ番目の単音節に対応する特徴量と、入力音声から抽出した特徴量との類似度（Ｒ_i,_Sound）をＨＭＭにより算出し、その結果を出力する。
【００４９】
音声認識部５は、映像処理部２の出力（Ｒ_i,_Image）及び音声処理部４の出力（Ｒ_i,_Sound）と、映像処理部正答率データ保持部６に保持された正答率データ（Ｓ_i,_Image）及び音声処理部正答率データ保持部７に保持された正答率データ（Ｓ_i,_Sound）とから、ｉ番目の単音節に対する映像及び音声の総合類似度（Ｒ_i,_Total）を次式（２）により算出する。
Ｒ_i,_Total＝Ｓ_i,_Image・Ｒ_i,_Image＋Ｓ_i,_Sound・Ｒ_i,_Sound ……（式２）
音声認識部５は、映像標準データ及び音声標準データに含まれる全ての単音節に対して類似度Ｒ_i,_Totalを求め、この類似度Ｒ_i,_Totalが最大となる単音節を認識結果として出力する。
【００５０】
このように、この実施形態の音声認識装置は、映像及び音声の類似度と正答率とを組み合わせて音声認識を行なっている。この正答率を組み合わせることは、単音節を識別する場合に、音声または映像を用いる識別方法の内で、その単音節を効果的に識別できる方法に対して、より多くの重み付けを行なうことであり、そうすることにより、騒音環境下においても高精度の音声認識を実現することができる。
【００５１】
なお、映像処理部２及び音声処理部４における類似度の算出には、ＨＭＭ以外に、ニューラルネットワーク等、音声認識に一般に用いられている他の手法を用いても良い。
【００５２】
（第２の実施の形態）
第２の実施形態の音声認識装置は、第１の実施形態（図１）と同一構成を備え、ただ、音声認識部５での総合類似度（Ｒ_i,_Total）の算出動作だけが違っている。
【００５３】
この装置の音声認識部５は、Ｓ_i,_Image・Ｒ_i,_Image＞Ｓ_i,_Sound・Ｒ_i,_Soundであるときには、
Ｒ_i,_Total＝Ｓ_i,_Image・Ｒ_i,_Image ………（式３）
の値を算出し、一方、Ｓ_i,_Image・Ｒ_i,_Image≦Ｓ_i,_Sound・Ｒ_i,_Soundであるときには、
Ｒ_i,_Total＝Ｓ_i,_Sound・Ｒ_i,_Sound ………（式４）
の値を算出する。そして、Ｒ_i,_Totalが最大となる単音節を認識結果として出力する。
【００５４】
このように、この実施形態の装置では、映像データまたは音声データによる識別結果の内、信頼性が高い方を選択して識別に用いている。こうすることにより、高い認識精度を保ちながら、音声認識における演算処理を簡略化することができる。
【００５５】
（第３の実施の形態）
第３の実施形態の音声認識装置は、入力音声のＳ／Ｎが変動する場合でも、高精度の音声認識が可能である。音声を用いた音声認識は、その正答率が入力音声のＳ／Ｎとともに変化する。この装置では、こうした変化に対応できるように構成している。
【００５６】
この装置は、図３に示すように、映像処理部２により抽出された口唇の映像から、発声している区間（発声区間）と発声していない区間（非発声区間）とを検出する発声区間検出部８と、発声区間の音圧レベルと非発声区間の音圧レベルとから信号対雑音比（Ｓ／Ｎ）を算出する音圧レベル検出部９と、音声に基づいて音声認識された単音節の正答率データ（Ｓ_i,_Sound）として、入力音声信号のＳ／Ｎに応じた複数種類のデータを保持する音声処理部正答率データ保持部７とを備えており、音声処理部正答率データ保持部７が保持している正答率データ（Ｓ_i,_Sound）の内、音圧レベル検出部９で検出されたＳ／Ｎに対応する正答率データが音声認識部５に出力される。その他の構成は第１の実施形態（図１）と変わりがない。
【００５７】
この装置では、映像処理部２が、入力する映像から口唇部分の特徴量を抽出し、その特徴量と各単音節の映像標準データにおける特徴量との類似度（Ｒ_i,_Image）を算出して、音声認識部５に出力する。
【００５８】
発声区間検出部８は、映像処理部２で抽出された口唇の上下方向及び左右方向の長さ、あるいはそれらの比などの特徴量を一定時間毎にサンプルし、その特徴量の時間毎の変化量が、設定した閾値を超えている場合には発声区間、閾値を超えない場合には非発声区間と識別し、識別結果を音声処理部４に出力する。
【００５９】
音声処理部４は、音声入力部３より入力する音声信号を発声区間と非発声区間とに区分して音圧レベル検出部９に出力し、また、入力音声信号から特徴量を抽出して、各単音節の音声標準データにおける特徴量との類似度（Ｒ_i,_Sound）を算出して、音声認識部５に出力する。
【００６０】
音圧レベル検出部９は、発声区間及び非発声区間のそれぞれの時間区間における平均音圧レベルを検出し、発声区間の音圧レベルを信号レベル、非発声区間の音圧レベルを雑音レベルとして信号対雑音比（Ｓ／Ｎ）を算出し、音声処理部正答率データ保持部７に出力する。
【００６１】
音声処理部正答率データ保持部７には、音声に基づいて認識された単音節の正答率データ（Ｓ_i,_Sound）として、入力音声信号の複数のＳ／Ｎに対応する複数種類のデータが保持されている。そして、音声処理部正答率データ保持部７は、音圧レベル検出部９からＳ／Ｎが入力すると、そのＳ／Ｎに対応する種類の正答率データ（Ｓ_i,_Sound）を出力用のデータとして用意する。
【００６２】
音声認識部５は、映像処理部２及び音声処理部４の各出力（Ｒ_i,_Image）、（Ｒ_i,_Sound）と、映像処理部正答率データ保持部６から読み出した正答率データ（Ｓ_i,_Image）と、音声処理部正答率データ保持部７から選択した、Ｓ／Ｎに対応する正答率データ（Ｓ_i,_Sound）とを用いて、ｉ番目の単音節に対する映像及び音声の総合類似度（Ｒ_i,_Total）を式（２）により算出し、Ｒ_i,_Totalが最大となる単音節を認識結果として出力する。
【００６３】
このように、この実施形態の音声認識装置では、音声に基づく音声認識結果の正答率データ（Ｓ_i,_Sound）としてＳ／Ｎに応じた値を用いているため、Ｓ／Ｎが異なる入力音声信号に対しても、より確実に音声認識処理を行なうことが可能となる。
【００６４】
（第４の実施の形態）
第４の実施形態の音声認識装置は、種々の方向から撮影した話者の映像を音声認識処理に利用する。
【００６５】
この装置は、図４に示すように、話者を正面から撮影した映像が入力する映像入力部１と、話者を側面から撮影した映像が入力する映像入力部10と、映像入力部１から入力する正面映像の口唇部分における特徴量を抽出し、その特徴量と、各単音節を話す口唇部分の正面映像より成る映像標準データの特徴量との類似度を求める映像処理部２と、映像入力部10から入力する側面映像の口唇部分における特徴量を抽出し、その特徴量と、各単音節を話す口唇部分の側面映像より成る映像標準データの特徴量との類似度を求める映像処理部11と、正面映像に対する正答率データ（Ｓ_i,_Image1）を保持する映像処理部正答率データ保持部６と、側面映像に対する正答率データ（Ｓ_i,_Image2）を保持する映像処理部正答率データ保持部12とを備えている。その他の構成は、第１の実施形態（図１）と変わりがない。
【００６６】
この装置では、映像入力部１に、話者を正面から撮影した映像の映像信号が入力し、映像入力部10に、話者を側面から撮影した映像の映像信号が入力する。
【００６７】
映像処理部２は、映像入力部１より入力する正面映像から口唇部分の特徴量を抽出し、その特徴量と、各単音節を発声する口唇部分の正面映像より成る映像標準データの特徴量との類似度（Ｒ_i,_Image1）を算出して、音声認識部５に出力し、また、映像処理部11は、映像入力部10より入力する側面映像から口唇部分の特徴量を抽出し、その特徴量と、各単音節を発声する口唇部分の側面映像より成る映像標準データの特徴量との類似度（Ｒ_i,_Image1）を算出して、音声認識部５に出力する。
【００６８】
また、映像処理部正答率データ保持部６には、口唇部分の正面映像に基づいて音声認識された各単音節の正答率データ（Ｓ_i,_Image1）が保持され、映像処理部正答率データ12には、口唇部分の側面映像に基づいて音声認識された各単音節の正答率データ（Ｓ_i,_Image2）が保持されている。
【００６９】
音声認識部５は、映像処理部２の出力（Ｒ_i,_Image1）、映像処理部11の出力（Ｒ_i,_Image2）、及び音声処理部４の出力（Ｒ_i,_Image）、並びに映像処理部正答率データ保持部６から読み出した正答率データ（Ｓ_i,_Image1）、映像処理部正答率データ保持部12から読み出した正答率データ（Ｓ_i,_Image2）、及び音声処理部正答率データ保持部７から読み出した正答率データ（Ｓ_i,_Sound）を用いて、ｉ番目の単音節に対する映像及び音声の総合類似度（Ｒ_i,_Total）を式（５）により算出する。
Ｒ_i,_Total＝Ｓ_i,_Image1・Ｒ_i,_Image1＋Ｓ_i,_Image2・Ｒ_i,_Image2
＋Ｓ_i,_Sound・Ｒ_i,_Sound ………（式５）
そして、Ｒ_i,_Totalが最大となる単音節を認識結果として出力する。
【００７０】
このように、この実施形態の装置では、複数の方向から撮影した話者の映像を用いることにより、より確かな音声認識を行なうことが可能となる。
【００７１】
なお、この実施形態では、話者の正面及び側面の映像を用いる場合について説明したが、正面及び側面以外に斜め方向からの映像など、より多くの映像を用いることにより、より確かな音声認識が可能となる。
【００７２】
（第５の実施の形態）
第５の実施形態では、発声された音声の単音節が属しているグループを特定する音声認識装置について説明する。
【００７３】
例えば、「あ行」のグループに属する単音節（「あ」「い」「う」「え」「お」）に共通する特徴があり、また、同一の子音を含む「か行」、「さ行」、‥の各グループに属する単音節に共通する特徴があるものとすると、入力音声の単音節の特徴と各グループの特徴との類似度を比較することにより、入力音声の単音節がどのグループに属しているかを特定することができる。
【００７４】
音声認識の手法には、例えば「モグラ」という単語が発声された時、「モ」「グ」「ラ」の各々に対応する単音節の候補としてそれぞれ複数の単音節を選び出し、次に、「モ」「グ」「ラ」の各候補の組み合わせを順番に当たり、その組み合わせが単語としての意味を持つか否か、などから、最終的に発声された単語を識別する方法が知られている。
【００７５】
このような場合に、例えば、先頭の単音節が属しているグループを特定することができれば、検討すべき各候補の組み合わせの数が大幅に減少し、音声認識処理を効率化することができる。
【００７６】
第５の実施形態の音声認識装置は、音声データと映像データとを併用することにより、単音節が属しているグループを高精度に特定することができる。
【００７７】
この装置は、第１の実施形態（図１）と同じように、映像入力部１、映像処理部２、音声入力部３、音声処理部４、映像処理部正答率データ保持部６、音声処理部正答率データ保持部７及び音声認識部５を備えている。
【００７８】
但し、映像処理部２は、話者の口唇部分の入力映像から抽出した特徴を、個々の単音節を発声する口唇部分の映像標準データと比較するのでは無く、複数の単音節より成る各グループの特徴と比較して、それぞれのグループに対する類似度を出力する。
【００７９】
また、音声処理部４は、入力音声から抽出した特徴を、個々の単音節の音声標準データと比較するのでは無く、複数の単音節より成る各グループの特徴と比較して、それぞれのグループに対する類似度を出力する。
【００８０】
この各グループとの類似度を求めるため、映像処理部２及び音声処理部４は、例えば、入力単音節の特徴量と、グループに含まれるすべての単音節の特徴量との類似度を算出し、類似度が最大となる単音節の類似度を、そのグループの類似度とする。あるいは、グループに含まれるすべての単音節に共通する特徴量のパターンをそのグループの特徴量として、入力単音節の特徴量との間の類似度を算出する。
【００８１】
また、映像処理部正答率データ保持部６及び音声処理部正答率データ保持部７には、映像処理部２または音声処理部４から出力されるグループの正答率が保持されている。この正答率を得るために、図５に例示するように、映像入力部１または音声入力部３から、単音節の映像または音声（「か」「き」「く」）のサンプルを入力して、映像処理部２または音声処理部４からどのグループ（「あ行」「か行」‥「わ行」）が出力されるかを実測し、それぞれのグループの出力総数に対して、そのグループが正解であった数（入力した単音節がそのグループに含まれていた出力数）の割合を算出する。
【００８２】
音声認識部５は、映像処理部２及び音声処理部４からの出力と、映像処理部正答率データ保持部６及び音声処理部正答率データ保持部７に保持された正答率データとから、ｉ番目のグループに対する総合類似度（Ｒ_i,_Total）を前記（式２）により算出する。そして、Ｒ_i,_Totalが最大となるグループを認識結果として出力する。
【００８３】
こうして、この装置は、発声された音声の単音節が属しているグループを高精度に特定することができる。
【００８４】
また、グループ分けの例としては、唇音（/ｂ/、/ｍ/、/ｐ/）を含む単音節グループ、拗音（/ｙ/）を含む単音節グループ、唇音及び拗音を含まないグループの３つに分けることもできる。
【００８５】
この場合、唇音は口唇の形に特徴が現れるため、唇音を含むグループの正答率は、映像処理部２の正答率の方が高く、音声処理部４の正答率の方が低い傾向がある。逆に、拗音は発声音に特徴が現れるため、拗音を含むグループの正答率は、音声処理部４の正答率の方が高く、映像処理部２の正答率の方が低い傾向がある。そのため、各グループに対する総合類似度を（式２）により算出すると、唇音を対象とするものについては、映像処理部２から出力された類似度の寄与が高くなり、拗音を対象とするものについては、音声処理部４から出力された類似度の寄与が高くなる。
【００８６】
従って、映像と音声とを併用して音声識別を行なうことにより、入力した単音節が唇音を含むか含まないか、あるいは、拗音を含むか含まないかを、より確かに認識することができる。
【００８７】
このように、認識する単音節をグループ化する場合に、映像処理部２の出力の正答率が高いグループと音声処理部４の出力の正答率が高いグループとをそれぞれ選択することにより、音声のみ、あるいは映像のみによりグループを識別する場合に比べて、より細分化されたグループの認識が可能となる。
【００８８】
【発明の効果】
以上の説明から明らかなように、本発明の音声認識装置は、入力する音声データや映像データと標準データとの類似度、及びそれらの正答率を組み合わて音声認識を行なっているため、雑音が存在する環境下でも、より確実な音声認識を実現することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態における音声認識装置の概略構成図、
【図２】第１の実施形態における正答率データの算出方法を説明する図、
【図３】本発明の第３の実施形態における音声認識装置の概略構成図、
【図４】本発明の第４の実施形態における音声認識装置の概略構成図、
【図５】本発明の第５の実施形態における単音節のグループの分類を説明する図、
【図６】従来の音声認識装置の概略構成図である。
【符号の説明】
１映像入力部
２映像処理部
３音声入力部
４音声処理部
５音声認識部
６映像処理部正答率データ保持部
７音声処理部正答率データ保持部
８発声区間検出部
９音声レベル検出部
10 第二の映像入力部
11 第二の映像処理部
12 第二の映像処理部正答率データ保持部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice recognition apparatus that performs voice recognition using a video signal including a lip of a speaker and a voice signal, and a voice recognition method thereof, and particularly, to improve a recognition rate.
[0002]
[Prior art]
When performing speech recognition, a speech recognition device that uses not only speech signals but also images including the lips of the speaker is an “An Isolated Word Speech Recognition Using Fusion of Auditory and Visual Information” (Sintani et al.) IEICE Trans. Fundamentals, Vol. E79-A, No. 6, p777-783 (1996)). In speech recognition using only a speech signal, if noise is mixed, the recognition accuracy is drastically reduced. However, when the image of the lips is used together, the degree of reduction in recognition accuracy can be reduced.
[0003]
FIG. 6 shows a schematic configuration of this conventional speech recognition apparatus. This apparatus includes a video input unit 1 such as a video camera that inputs an image including a lip portion of a speaker, an audio input unit 3 such as a microphone that inputs sound uttered by the speaker, and a lip portion that utters various words. A video processing unit 2 for calculating the similarity between the video standard data and the input lip image, and outputting the similarity to each word included in the video standard data, and the audio standard data of various words and the input voice Obtaining the similarity and calculating the word having the highest similarity from the similarity input from the audio processing unit 4 that outputs the similarity to each word included in the audio standard data and the video processing unit 2 and the audio processing unit 4; And a voice recognition unit 5 for outputting the result as a recognition result.
[0004]
The video processing unit 2 of this apparatus extracts, for example, the vertical and horizontal lengths of the lip portion and the ratio of the vertical and horizontal lengths as feature amounts from the input video. The similarity (R) between the feature quantity corresponding to the i-th word and the feature quantity extracted from the input video among the video standard data of a plurality of words prepared in advance for calculating the similarity._i,_Image) Is calculated and output by a hidden Markov model (hereinafter abbreviated as HMM), which is well known as a pattern recognition technique.
[0005]
Also, the speech processing unit 4 extracts feature quantities from the input speech by cepstrum analysis, and extracts from the feature quantities corresponding to the i-th word and the input speech among the speech standard data of a plurality of words prepared in advance. Similarity with feature value (R_i,_Sound) Is calculated by the HMM and output.
[0006]
In addition, the voice recognition unit 5 outputs (R_i,_Image) And the output of the voice processing unit 4 (R_i,_Sound) From the similarity (R_i,_Total) Is calculated by the following equation (1).
R_i,_Total= Α ・ R_i,_Image+ (1-α) · R_i,_Sound    ......... (Formula 1)
Here, α (0 ≦ α ≦ 1) is a coefficient set in advance so that the recognition rate is maximized by using sampled video and audio data for coefficient determination (apart from similarity calculation). is there.
[0007]
The voice recognition unit 5 performs the similarity R for all words included in the video standard data and the voice standard data._i,_TotalThis similarity R_i,_TotalThe word with the maximum is output as the recognition result.
[0008]
As described above, since the voice recognition device uses the video signal including the lip information together with the voice signal, even when noise is present, the voice recognition device can avoid the sudden decrease in the recognition rate. Can be extended to devices used in noisy environments.
[0009]
[Problems to be solved by the invention]
It has been studied to apply a voice recognition device to a car navigation device and give a command to the device by voice. However, in order to incorporate it in a device used in such a noise environment, the noise of the voice recognition device is considered. It is necessary to further increase the recognition rate below.
[0010]
The present invention responds to such a demand, and it is an object of the present invention to provide a speech recognition apparatus capable of realizing a high recognition rate in speech recognition under a noisy environment and to provide a speech recognition method thereof. .
[0011]
[Means for Solving the Problems]
  Therefore, in the speech recognition apparatus of the present invention, the video data of the speaker including the lipsTheVideo input means to input and speaker's voice dataTheVideo processing that calculates the similarity between the input audio data and the standard video data of the lips that utter each single syllable and the input video data, and outputs each single syllable of the standard video data and the similarity to it Sound processing means for calculating the similarity between the voice standard data of the voice uttering each single syllable and the input voice data, and outputting each single syllable of the voice standard data and the similarity thereto, and video A speech recognition means for identifying a single syllable having the largest overall similarity using the similarity output from the processing means and the speech processing means;The video data relating to various single syllables is input to the video input means in advance, and the single syllable of the video standard data having the maximum similarity output from the video processing means corresponding to each input video data is selected as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input video data in the total number of identical candidate single syllables output from the video processing means calculated based on As data of correct answer rate for the single syllable of the dataMeans for holding the correct rate of video standard data to be held;The voice data of various single syllables is input to the voice input means in advance, and the single syllable of the voice standard data having the maximum similarity output from the voice processing means corresponding to each input voice data is set as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input voice data out of the total number of identical candidate single syllables output from the voice processing means calculated and calculated based on the voice standard data As the correct answer rate data for that single syllableVoice standard data correct answer rate holding means to hold, and voice recognition means is video processing meansFromOutput similarityWhenVideo standard data correct answer rate retention meansFromRead correct answer rate andAnd the sum of products of the similarity output from the voice processing means and the correct answer rate read from the voice standard data correct answer rate holding meansOverall similarity to each syllableAsI want to ask.
[0012]
In this device, when the single syllable to be identified is a type of single syllable that can be identified with high accuracy when identified based on the image of the lips, the identification result by the image greatly contributes to the final determination, When the single syllable to be identified is a single syllable that can be identified with higher accuracy when identified based on speech, the identification result by speech greatly contributes to the final determination. Therefore, highly reliable voice recognition is possible even in a noisy environment.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
  According to the first aspect of the present invention, video data of a speaker including a lip is provided.TheVideo input means to input and speaker's voice dataTheVideo processing that calculates the similarity between the input audio data and the standard video data of the lips that utter each single syllable and the input video data, and outputs each single syllable of the standard video data and the similarity to it Sound processing means for calculating the similarity between the voice standard data of the voice uttering each single syllable and the input voice data, and outputting each single syllable of the voice standard data and the similarity thereto,The video data relating to various single syllables is input to the video input means in advance, and the single syllable of the video standard data having the maximum similarity output from the video processing means corresponding to each input video data is selected as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input video data in the total number of identical candidate single syllables output from the video processing means calculated based on As data of correct answer rate for the single syllable of the dataMeans for holding the correct rate of video standard data to be held;The voice data of various single syllables is input to the voice input means in advance, and the single syllable of the voice standard data having the maximum similarity output from the voice processing means corresponding to each input voice data is set as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input voice data out of the total number of identical candidate single syllables output from the voice processing means calculated and calculated based on the voice standard data As the correct answer rate data for that single syllableAudio standard data correct answer rate holding means to hold, and the video processing meansFromOutput similarityWhenSaid video standard data correct answer rate holding meansFromRead correct answer rate andAnd the sum of products of the similarity output from the speech processing means and the correct answer rate read from the speech standard data correct answer rate holding means.Overall similarity to each syllableAsThe speech recognition device is provided with speech recognition means for identifying the single syllable having the highest overall similarity, and the single syllable to be identified is more accurately identified based on the shape and movement of the lips. In the case of a single syllable that can be identified, the contribution rate of the identification by video is increased, and the single syllable to be identified is a single syllable that can be identified with higher accuracy when identified based on speech. By increasing the contribution rate of discrimination by voice, it is possible to perform voice recognition with high reliability even in a noisy environment.
[0020]
  Claim2The voice standard data correct answer rate holding means holds a plurality of types of correct answer rate data corresponding to the signal-to-noise ratio as correct answer rate data, and the voice recognition means is comprehensive for each single syllable. When obtaining a high degree of similarity, the correct answer rate data corresponding to the signal-to-noise ratio of the input voice data is read from the voice standard data correct answer rate holding means, and the reliability of the identification using the voice signal However, the point which fluctuates by the signal-to-noise ratio of the input voice signal can be improved.
[0024]
  Claim3The video data of the speaker including the lipsTheVideo input means to input and speaker's voice dataTheVoice input means for input;Single syllables with common utterance mechanisms are grouped in advance, and each groupStandard video data for lips that speak single syllablesWhenCalculate the similarity between the input video data andThe maximum similarity among them is set as the similarity of the group, and the group identification information and theVideo processing means for outputting the similarity of the group;
  For each group,Voice standard data for voices uttering each single syllableWhenCalculate the similarity between the input audio data andThe maximum similarity among them is set as the similarity of the group, and the group identification information and theVoice processing means for outputting the similarity of the group;In advance, the video data relating to various single syllables is input to the video input means, and the total number of outputs of each group output from the video processing means corresponding to each input video data is calculated and calculated based thereon. Of the total number of outputs, the ratio of the number of correct outputs in which the input single syllable is included in the group is used as data of the correct answer rate for the group.Means for holding the correct rate of video standard data to be held;Preliminarily input the voice data of various single syllables to the voice input means, totaled the total number of outputs of each group output from the voice processing means corresponding to each input voice data, calculated based on it, Of the total number of outputs, the ratio of the number of correct outputs in which the input single syllable is included in the group is used as the correct answer rate data for the group.Audio standard data correct answer rate holding means to hold, and the video processing meansFromOutput similarityWhenSaid video standard data correct answer rate holding meansFromRead correct answer rate andAnd the sum of products of the similarity output from the speech processing means and the correct answer rate read from the speech standard data correct answer rate holding means.Overall similarity for each groupAsAnd a voice recognition device including a voice recognition unit that identifies a group having the largest overall similarity, and can accurately identify a group to which a voice to be recognized belongs, thereby efficiently performing voice identification processing. Can be
[0038]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0039]
(First embodiment)
As shown in FIG. 1, the speech recognition apparatus according to the first embodiment includes a video input unit 1 such as a video camera that inputs a video including a speaker's lip, a microphone that inputs a voice uttered by the speaker, and the like. The video input unit 3 and the video processing for obtaining the similarity between the video standard data of the lip part that utters various single syllables and the video of the input lip part, and outputting the similarity to each single syllable included in the video standard data Based on the image of the lip portion, the speech processing unit 4 that obtains the similarity between the unit 2, the speech standard data of various single syllables and the input speech, and outputs the similarity to each single syllable included in the speech standard data A video processing unit correct answer rate data holding unit 6 that holds correct answer rate data of each single syllable that has been voice-recognized (that is, data representing a probability that the single syllable is correct), and each single syllable that has been voice-recognized based on the voice. Preserve syllable accuracy data The voice processing unit correct answer rate data holding unit 7, the similarity input from the video processing unit 2 and the audio processing unit 4, and the video processing unit correct answer rate data holding unit 6 and the voice processing unit correct answer rate data holding unit 7 are read out. A speech recognition unit 5 that obtains a total similarity based on the correct answer rate data and outputs a single syllable having the highest similarity as a recognition result.
[0040]
FIG. 2 is a drawing for explaining the correct answer rate in the present invention. This figure shows, for example, when a single syllable is input to the speech processing unit 4 (here, for the sake of simplicity, the input single syllable is set to 5 of “A”, “I”, “U”, “E”, “O”. The single syllable is output as a “single syllable with the highest degree of similarity” (this is called a candidate single syllable) by the speech processing unit 4.
[0041]
This candidate single syllable appearance frequency data is prepared as a sample of single syllable speech separately from the standard speech data for calculating the similarity, and is input to the speech processing unit 4 to obtain the actual value. Seeking.
[0042]
For example, when a single syllable “a” is input 100 times to the speech processing unit 4, “a” is a candidate single syllable 97 times, and “u” is a candidate single syllable once, There was one case where “o” was a candidate single syllable (the candidate single syllable could not be specified for the remaining one).
[0043]
From the frequency of appearance of candidate single syllables obtained in this way, when a specific single syllable is output, the ratio that the single syllable is correct, that is, the number of outputs of the single syllable and the number of outputs The ratio of the number of outputs that match a single syllable is calculated as the correct answer rate.
[0044]
For example, with respect to the number 105 in which the speech processing unit outputs “a”, the number of input single syllables “a” is 97. Therefore, in the example of FIG. .924.
[0045]
In this way, the speech data of all single syllables to be recognized is input to the speech processing unit 4, the correct answer rate for these single syllables is calculated, and the correct answer rate data (S_i,_Sound) Is stored in the voice processing unit correct answer rate data holding unit 7. Similarly, with respect to video, video data including the lip portion of the speaker who utters all single syllables to be recognized is input to the video processing unit 2, and correct answer rate data (S for these single syllables)._i,_Image) Is calculated and stored in the video processing unit correct answer rate data holding unit 6.
[0046]
It should be noted that the video data and audio data used for calculating the correct answer rate data are preferably provided by a voice recognition target person or a plurality of speakers as data providers. It is desirable that the number of input data has little variation.
[0047]
The video processing unit 2 of this apparatus extracts, for example, the vertical and horizontal lengths of the lip portion and the ratio of the vertical and horizontal lengths as feature amounts from the video input from the video input unit 1. The similarity between the feature quantity corresponding to the i-th single syllable and the feature quantity extracted from the input video among the standard video data of a plurality of single syllables prepared in advance for calculating the similarity. (R_i,_Image) Is calculated by the HMM, and the result is output.
[0048]
Further, the speech processing unit 4 extracts a feature amount from speech input from the speech input unit 3 by cepstrum analysis, and corresponds to the i-th single syllable among a plurality of prepared single speech syllable standard data. Similarity between the feature quantity and the feature quantity extracted from the input speech (R_i,_Sound) Is calculated by the HMM, and the result is output.
[0049]
The voice recognition unit 5 outputs the video processing unit 2 (R_i,_Image) And the output of the voice processing unit 4 (R_i,_Sound) And correct answer rate data (S) stored in the video processing unit correct answer rate data holding unit 6_i,_Image) And the correct answer rate data (S) held in the voice processor correct answer rate data holding unit 7_i,_Sound) And the total similarity of video and audio (R_i,_Total) Is calculated by the following equation (2).
R_i,_Total= S_i,_Image・ R_i,_Image+ S_i,_Sound・ R_i,_Sound  (Formula 2)
The voice recognition unit 5 performs similarity R for all single syllables included in the video standard data and the voice standard data._i,_TotalThis similarity R_i,_TotalThe single syllable with the maximum is output as the recognition result.
[0050]
As described above, the speech recognition apparatus of this embodiment performs speech recognition by combining the similarity between video and audio and the correct answer rate. Combining this correct answer rate means that when identifying single syllables, among the identification methods using audio or video, more weight is given to the method that can effectively identify the single syllable. By doing so, highly accurate voice recognition can be realized even in a noisy environment.
[0051]
For calculating the similarity in the video processing unit 2 and the audio processing unit 4, other methods generally used for voice recognition such as a neural network may be used in addition to the HMM.
[0052]
(Second Embodiment)
The speech recognition apparatus of the second embodiment has the same configuration as that of the first embodiment (FIG. 1), but only the overall similarity (R_i,_TotalOnly the calculation operation is different.
[0053]
The voice recognition unit 5 of this apparatus is S_i,_Image・ R_i,_Image> S_i,_Sound・ R_i,_SoundWhen
R_i,_Total= S_i,_Image・ R_i,_Image                    ……… (Formula 3)
While the value of S_i,_Image・ R_i,_Image≦ S_i,_Sound・ R_i,_SoundWhen
R_i,_Total= S_i,_Sound・ R_i,_Sound                    ......... (Formula 4)
Is calculated. And R_i,_TotalThe single syllable with the maximum is output as the recognition result.
[0054]
As described above, in the apparatus of this embodiment, the one having higher reliability among the identification results based on the video data or the audio data is selected and used for identification. By doing so, it is possible to simplify arithmetic processing in speech recognition while maintaining high recognition accuracy.
[0055]
(Third embodiment)
The speech recognition apparatus according to the third embodiment can perform highly accurate speech recognition even when the S / N of the input speech varies. In speech recognition using speech, the correct answer rate varies with the S / N of the input speech. This device is configured to cope with such changes.
[0056]
As shown in FIG. 3, this apparatus detects an utterance section (speaking section) and a non-speaking section (non-speech section) from the lip image extracted by the video processing unit 2. A detection unit 8; a sound pressure level detection unit 9 that calculates a signal-to-noise ratio (S / N) from the sound pressure level in the utterance interval and the sound pressure level in the non-utterance interval; Syllable correct rate data (S_i,_Sound) As a voice processing unit correct answer rate data holding unit 7 that holds a plurality of types of data corresponding to the S / N of the input voice signal, and the correct answer held by the voice processing unit correct answer rate data holding unit 7 Rate data (S_i,_Sound), The correct answer rate data corresponding to the S / N detected by the sound pressure level detection unit 9 is output to the speech recognition unit 5. Other configurations are the same as those of the first embodiment (FIG. 1).
[0057]
In this apparatus, the video processing unit 2 extracts the feature value of the lip from the input video, and the similarity (R) between the feature value and the feature value in the video standard data of each single syllable_i,_Image) Is calculated and output to the speech recognition unit 5.
[0058]
The utterance section detection unit 8 samples feature amounts such as the vertical and horizontal lengths of the lips extracted by the video processing unit 2 or a ratio thereof at regular intervals, and changes in the feature amounts over time. When the amount exceeds the set threshold, it is identified as an utterance interval, and when the amount does not exceed the threshold, it is identified as a non-utterance interval, and the identification result is output to the speech processing unit 4.
[0059]
The speech processing unit 4 divides the speech signal input from the speech input unit 3 into a speech section and a non-speech section and outputs the speech signal to the sound pressure level detection unit 9, and extracts a feature amount from the input speech signal, The similarity (R) to the feature value in the speech standard data of each single syllable_i,_Sound) Is calculated and output to the speech recognition unit 5.
[0060]
The sound pressure level detection unit 9 detects an average sound pressure level in each time interval of the utterance interval and the non-utterance interval, and signals the sound pressure level of the utterance interval as the signal level and the sound pressure level of the non-utterance interval as the noise level. A noise-to-noise ratio (S / N) is calculated and output to the speech processing unit correct answer rate data holding unit 7.
[0061]
The speech processing unit correct answer rate data holding unit 7 stores the correct answer rate data (S_i,_Sound) Holds a plurality of types of data corresponding to a plurality of S / Ns of the input audio signal. Then, when the S / N is input from the sound pressure level detection unit 9, the voice processing unit correct answer rate data holding unit 7 receives the correct answer rate data (S of the type corresponding to the S / N)._i,_Sound) Is prepared as output data.
[0062]
The voice recognition unit 5 outputs the outputs (R) of the video processing unit 2 and the voice processing unit 4._i,_Image), (R_i,_Sound) And the correct answer rate data (S) read from the video processor correct answer rate data holding unit 6_i,_Image) And the correct answer rate data (S) corresponding to S / N selected from the voice processing unit correct answer rate data holding unit 7_i,_Sound) And the overall similarity of video and audio (R_i,_Total) Is calculated by equation (2) and R_i,_TotalThe single syllable with the maximum is output as the recognition result.
[0063]
Thus, in the speech recognition apparatus of this embodiment, correct answer rate data (S_i,_SoundSince a value corresponding to S / N is used as), voice recognition processing can be performed more reliably even for input voice signals having different S / N.
[0064]
(Fourth embodiment)
The voice recognition device according to the fourth embodiment uses a video of a speaker taken from various directions for voice recognition processing.
[0065]
As shown in FIG. 4, this apparatus includes a video input unit 1 for inputting a video obtained by photographing a speaker from the front, a video input unit 10 for inputting a video obtained by photographing the speaker from the side, and a video input unit 1. A video processing unit 2 that extracts the feature amount in the lip portion of the input front image and obtains the similarity between the feature amount and the feature amount of the video standard data composed of the front image of the lip portion speaking each single syllable; A video processing unit that extracts the feature value in the lip portion of the side image input from the input unit 10 and calculates the similarity between the feature value and the feature value of the video standard data composed of the side image of the lip portion that speaks each single syllable 11 and correct answer rate data (S_i,_Image1) That holds the image processing unit correct answer rate data holding unit 6 and the correct answer rate data (S_i,_Image2And a video processing unit correct answer rate data holding unit 12. Other configurations are the same as those of the first embodiment (FIG. 1).
[0066]
In this apparatus, a video signal obtained by photographing a speaker from the front is input to the video input unit 1, and a video signal obtained by photographing the speaker from the side is input to the video input unit 10.
[0067]
The video processing unit 2 extracts the feature value of the lip portion from the front video input from the video input unit 1, and the feature value of the video standard data including the front video of the lip portion that utters each single syllable. Similarity (R_i,_Image1) Is calculated and output to the voice recognition unit 5, and the video processing unit 11 extracts the feature value of the lip from the side image input from the video input unit 10, and calculates the feature value and each single syllable. Similarity with the feature value of the video standard data consisting of the side video of the lip that speaks (R_i,_Image1) Is calculated and output to the speech recognition unit 5.
[0068]
The video processing unit correct answer rate data holding unit 6 stores the correct answer rate data (S) of each syllable recognized by voice based on the front image of the lip._i,_Image1) Is stored, and the correct response rate data 12 of each single syllable that is voice-recognized based on the side image of the lip portion (S)_i,_Image2) Is held.
[0069]
The voice recognition unit 5 outputs the video processing unit 2 (R_i,_Image1), Output of the video processing unit 11 (R_i,_Image2), And the output (R_i,_Image) And the correct answer rate data (S) read out from the video processor correct answer rate data holding unit 6_i,_Image1), Correct answer rate data (S) read from the video processing part correct answer rate data holding unit 12_i,_Image2), And correct answer rate data (S) read out from the voice processing unit correct answer rate data holding unit 7_i,_Sound), The overall similarity between video and audio (R_i,_Total) Is calculated by equation (5).
R_i,_Total= S_i,_Image1・ R_i,_Image1+ S_i,_Image2・ R_i,_Image2
+ S_i,_Sound・ R_i,_Sound                  ......... (Formula 5)
And R_i,_TotalThe single syllable with the maximum is output as the recognition result.
[0070]
As described above, in the apparatus of this embodiment, it is possible to perform more reliable voice recognition by using the video of the speaker taken from a plurality of directions.
[0071]
In this embodiment, the case where the front and side images of the speaker are used has been described. However, by using more images such as images from oblique directions in addition to the front and side surfaces, more reliable voice recognition is possible. It becomes possible.
[0072]
(Fifth embodiment)
In the fifth embodiment, a voice recognition device that identifies a group to which a single syllable of spoken voice belongs will be described.
[0073]
For example, there are features common to single syllables belonging to the group of “A” (“A”, “I”, “U”, “E”, “O”), and “K”, “S” that contain the same consonant. If there is a feature common to the single syllables belonging to each group of “line”,..., The similarity between the single syllable feature of the input speech and the feature of each group is compared. It can be specified whether it belongs to a group.
[0074]
For example, when the word “mole” is uttered, a plurality of single syllables are selected as candidate single syllables corresponding to each of “mo”, “gu”, and “la”. There is known a method of identifying a finally uttered word based on whether combinations of candidates “M”, “G”, and “La” are hit in order, and whether the combination has a meaning as a word.
[0075]
In such a case, for example, if the group to which the leading single syllable belongs can be identified, the number of combinations of candidates to be examined can be greatly reduced, and the speech recognition process can be made more efficient.
[0076]
The voice recognition apparatus according to the fifth embodiment can specify a group to which a single syllable belongs with high accuracy by using voice data and video data together.
[0077]
As in the first embodiment (FIG. 1), this apparatus includes a video input unit 1, video processing unit 2, audio input unit 3, audio processing unit 4, video processing unit correct answer rate data holding unit 6, audio processing. A correct answer rate data holding unit 7 and a voice recognition unit 5 are provided.
[0078]
However, the video processing unit 2 does not compare the features extracted from the input video of the speaker's lip part with the standard video data of the lip part that utters each single syllable, but each group consisting of a plurality of single syllables. Compared with the feature of, the similarity for each group is output.
[0079]
Further, the speech processing unit 4 does not compare the features extracted from the input speech with the speech standard data of individual single syllables, but compares the features extracted from the input speech with respect to each group. Outputs similarity.
[0080]
In order to obtain the similarity with each group, the video processing unit 2 and the audio processing unit 4 calculate, for example, the similarity between the feature value of the input single syllable and the feature values of all the single syllables included in the group. The similarity of a single syllable having the maximum similarity is defined as the similarity of the group. Alternatively, the feature amount pattern common to all single syllables included in the group is used as the feature amount of the group, and the similarity between the feature amount of the input single syllable is calculated.
[0081]
The video processing unit correct answer rate data holding unit 6 and the audio processing unit correct answer rate data holding unit 7 hold the correct answer rates of the groups output from the video processing unit 2 or the audio processing unit 4. In order to obtain the correct answer rate, as illustrated in FIG. 5, a single syllable video or audio (“ka” “ki” “ku”) sample is input from the video input unit 1 or the audio input unit 3. Actually, which group (“A row” “Ka row”... “Wa row”) is output from the video processing unit 2 or the audio processing unit 4 is measured. The ratio of the number of correct answers (the number of outputs in which the input single syllable was included in the group) is calculated.
[0082]
The voice recognizing unit 5 outputs i from the output from the video processing unit 2 and the voice processing unit 4 and the correct answer rate data held in the video processing unit correct answer rate data holding unit 6 and the voice processing unit correct answer rate data holding unit 7. Total similarity to the second group (R_i,_Total) Is calculated by the above (Formula 2). And R_i,_TotalThe group with the largest value is output as the recognition result.
[0083]
In this way, this apparatus can specify the group to which the single syllable of the uttered voice belongs with high accuracy.
[0084]
Examples of grouping include three syllable groups including lip sounds (/ b /, / m /, / p /), single syllable groups including stuttering (/ y /), and groups not including lip sounds and stuttering. It can also be divided into two.
[0085]
In this case, since the characteristics of the lip sound appear in the shape of the lips, the correct answer rate of the group including the lip sound tends to be higher for the video processing unit 2 and lower for the voice processing unit 4. On the contrary, since the stuttering is characterized by the utterance sound, the correct answer rate of the group including the stuttering tends to be higher in the voice processing unit 4 and lower in the video processing unit 2. Therefore, when the total similarity for each group is calculated by (Equation 2), the contribution of the similarity output from the video processing unit 2 is high for the lip sound target, and for the stuttering target. The contribution of the similarity output from the voice processing unit 4 is increased.
[0086]
Therefore, it is possible to more surely recognize whether the input single syllable includes a lip sound or does not include a stuttering sound by performing audio recognition using both video and audio.
[0087]
In this way, when grouping single syllables to be recognized, by selecting a group with a high correct answer rate of the output of the video processing unit 2 and a group with a high correct answer rate of the output of the audio processing unit 4, respectively, only the audio is selected. Or, compared with the case of identifying a group only by video, it becomes possible to recognize a more detailed group.
[0088]
【The invention's effect】
As is clear from the above description, the speech recognition apparatus of the present invention performs speech recognition by combining the similarity between the input audio data and video data and the standard data, and the correct answer rate, so that noise is generated. Even in an existing environment, more reliable speech recognition can be realized.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of a speech recognition apparatus according to a first embodiment of the present invention;
FIG. 2 is a diagram for explaining a calculation method of correct answer rate data in the first embodiment;
FIG. 3 is a schematic configuration diagram of a speech recognition apparatus according to a third embodiment of the present invention;
FIG. 4 is a schematic configuration diagram of a speech recognition apparatus according to a fourth embodiment of the present invention;
FIG. 5 is a diagram for explaining classification of groups of single syllables in the fifth embodiment of the present invention;
FIG. 6 is a schematic configuration diagram of a conventional speech recognition apparatus.
[Explanation of symbols]
1 Video input section
2 Video processing section
3 Voice input part
4 Voice processing part
5 Voice recognition unit
6 Video processing part correct answer rate data holding part
7 Voice processing part correct answer rate data holding part
8 Voice detection section
9 Audio level detector
10 Second video input section
11 Second video processing unit
12 Second video processing part correct answer rate data holding part

Claims

Video input means for inputting the video data of the speaker including the lips;
Voice input means for inputting the voice data of the speaker;
Video processing means for calculating the similarity between the video standard data of the lips that utter each single syllable and the input video data, and outputting each single syllable of the video standard data and the similarity thereto;
A speech processing means for calculating the similarity between the speech standard data of the speech uttering each single syllable and the input speech data, and outputting each single syllable of the speech standard data and the similarity thereto;
The video data relating to various single syllables is input to the video input means in advance, and the single syllable of the video standard data having the maximum similarity output from the video processing means corresponding to each input video data is selected as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input video data out of the total number of identical candidate single syllables output from the video processing means calculated based on the Video standard data correct answer rate holding means for holding data as correct answer rate data for that single syllable ;
The voice data of various single syllables is input to the voice input means in advance, and the single syllable of the voice standard data having the maximum similarity output from the voice processing means corresponding to each input voice data is set as a candidate single syllable. The ratio of the number of candidate single syllables that match the single syllable of the input voice data out of the total number of identical candidate single syllables output from the voice processing means calculated and calculated based on the voice standard data Voice standard data correct answer rate holding means for holding as a correct answer rate data for the single syllable ,
Product of the image processing similarity output from means the video standard data correct rate holding means from the read right answer rate, and, from the similarity output from the audio processing means the audio standard data correct rate holding means A speech recognition apparatus comprising speech recognition means for obtaining a sum of products of read correct answer rates as an overall similarity to each single syllable and identifying a single syllable having the largest overall similarity.

The voice standard data correct answer rate holding means holds a plurality of types of correct answer rate data corresponding to a signal-to-noise ratio as the correct answer rate data, and the voice recognition means has a comprehensive similarity to each single syllable. The speech recognition apparatus according to claim 1, wherein when determining the correct answer rate, the correct answer rate data corresponding to a signal-to-noise ratio of the input speech data is read from the speech standard data correct answer rate holding unit.

Video input means for inputting the video data of the speaker including the lips;
Voice input means for inputting the voice data of the speaker;
Single syllables with common utterance mechanisms are grouped in advance, and for each group, the similarity between the video standard data of the lips that utter each single syllable of the group and the input video data is calculated, and the largest of them is calculated . Video processing means for setting the similarity to the similarity of the group and outputting the group identification information and the similarity of the group;
For each group, calculate the similarity between the voice standard data of the voice that utters each single syllable of the group and the input voice data, and set the maximum similarity among them as the similarity of the group Voice processing means for outputting the group identification information and the similarity of the group;
In advance, the video data relating to various single syllables is input to the video input means, and the total number of outputs of each group output from the video processing means corresponding to each input video data is calculated and calculated based thereon. Video standard data correct answer rate holding means for holding the ratio of the number of correct outputs included in the group of the input single syllables among the total number of outputs, as correct answer rate data for the group ,
Preliminarily input the voice data of various single syllables to the voice input means, totaled the total number of outputs of each group output from the voice processing means corresponding to each input voice data, calculated based on it, A voice standard data correct answer rate holding means for holding the ratio of the number of correct outputs included in the group of the input single syllables among the total number of outputs, as correct answer rate data for the group ,
The video product of processing the similarity output from the means and the video standard data correct rate holding read from the means percentage of correct answers, and the similarity output from the audio processing means the audio standard data correct rate holding means obtains the sum of the product of the read correct rate from the overall similarity for each group, the speech recognition apparatus and a speech recognition means for identifying a group greatest overall similarity.