JP2004024863A

JP2004024863A - Lips recognition device and occurrence zone recognition device

Info

Publication number: JP2004024863A
Application number: JP2003158723A
Authority: JP
Inventors: Hidetsugu Maekawa; 前川　英嗣; Tatsumi Watanabe; 渡辺　辰巳; Kazuaki Obara; 小原　和昭; Kazuhiro Kayashima; 萱嶋　一弘; Kenji Matsui; 松井　謙二; Yoshihiko Matsukawa; 松川　善彦
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1994-05-13
Filing date: 2003-06-03
Publication date: 2004-01-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a game device that enables operation by sounds natural to human beings. <P>SOLUTION: A voice recognition part 2 recognizes voices and an utterance zone detector 4 detects a speaker's utterance zone from the movements of the neighbors of the lips of a speaker (device operator) that an image import part 3 incorporates. An integration judgment part 5 extracts only a recognized result of the voice uttered by the speaker based on the voice recognition result and the information from the detected utterance zone. The recognized result is sent to a control part 6 and used to control an airship 7. The above structure can prevent incorrect recognition from occurring by ambient noises other than those from the speaker. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を用いて操作するゲーム装置、口唇画像や音声を入力する入力装置、および音声反応装置に関する。
【０００２】
【従来の技術】
図３４に従来のゲーム装置の例として、無線受信機を備えた飛行船を操作者の手元の無線受信器付きリモートコントローラーによって操作するゲーム装置を示す。図３４に示すように、従来のゲーム装置では、リモートコントローラに備えられたジョイスティック１６１を用いて対象物を操作するのが一般的である。操作者がジョイスティック１６１を動かすと、その角度が角度検出部１６２および１６３によって検出され、電気信号に変換されて制御部１６４に出力される。制御部１６４は、これらの電気信号に基づき、ジョイスティック１６１の角度に応じて飛行船７の移動を制御するためのラジオコントロール信号を出力する。
【０００３】
【発明が解決しようとする課題】
しかしながら従来のゲーム装置は、ジョイスティック１６１による操作であるため、人間にとって自然な操作とはなっていない。このため操作習熟に時間がかかる、とっさの反応に鈍くなる等の問題点を有していた。また、飛行船ではなく、駆動装置付きの風船を操作するゲーム装置もあるが、この装置においても上述したようにして風船の動きが制御されるため、風船の動きが非生物的になってしまい、風船独特の暖かみが薄れるという問題があった。
【０００４】
また、操作者の口唇の画像を入力することにより、音声を認識する装置も提案されているが、このような装置では、高度な光学系レンズを必要とするために装置自体が大ががりなものとなってしまう上に高価であるという問題点がある。
【０００５】
本発明はこのような現状に鑑みてなされたものであり、その目的は、（１）自然な音声による操作が可能であり、操作習熟を必要とせず、さらに騒音下あるいは音声を発しにくい状況での利用、および発声に障害を持つ者の利用を可能にするゲーム装置を低コストかつ簡易な構成で提供すること、（２）操作者の口唇の動きおよび音声を簡易な構成により入力することが可能である入力装置、（３）同一の入力音声に対して、複数の言葉の中からランダムに選択された言葉を音声として出力する音声選択装置、（４）音声によって自然な動作をさせることができるゲーム装置または玩具、ならびにこれらに用いられる音声認識装置を提供すること、および（５）入力される音声に応じて動作を変えることができる音声反応装置を提供することにある。
【０００６】
【課題を解決するための手段】
本発明のゲーム装置は、操作者によって発生された音声を含む少なくとも１つの音声を入力し、入力された該音声を第１の電気信号に変換し、該第１の電気信号を出力する音声入力手段と、該音声入力手段から出力された該第１の電気信号に基づいて該少なくとも１つの音声を認識する音声認識手段と、該操作者の口唇の動きを光学的に検出し、検出された該口唇の動きを第２の電気信号に変換し、該第２の電気信号を出力する画像入力手段と、該第２の電気信号を受け取り、受け取った該第２の電気信号に基づいて、該話者によって該音声が発生されている区間を求める発生区間検出手段と、該音声認識手段によって認識された該少なくとも１つの音声と、該発生区間検出手段によって求められた該区間とに基づいて、該少なくとも１つの音声から該操作者によって発生された該音声を抽出する統合判断手段と、該統合判断手段によって抽出された該音声に基づいて、対象物を制御する制御手段とを備えており、そのことにより上記目的を達成する。
【０００７】
前記発声区間検出手段は、前記画像入力手段から出力される前記第２の電気信号の変化の度合いを検出する微分手段と、該微分手段によって検出される該変化の度合いが所定の値を超えたときに、対応する音声は前記操作者によって発生されたと判断する手段とを備えていてもよい。
【０００８】
前記統合判断手段は、前記発声区間検出手段によって求められた前記区間に所定の長さの区間を加えることにより評価区間を作成する手段と、前記音声認識手段によって認識された前記少なくとも１つの音声が、該音声認識手段から出力された認識結果出力時間を検出する手段と、該認識結果出力時間と該評価区間とを比較し、該少なくとも１つの音声のうち、該認識結果出力時間が該評価区間内に収まっている音声を前記操作者によって発声された前記音声と判断する手段とを備えていてもよい。
【０００９】
本発明の他のゲーム装置は、操作者の口唇の動きを光学的に入力し、該入力された口唇の動きを電気信号に変換し、該電気信号を出力する画像入力手段と、該電気信号に基づいて該口唇の動きを求め、該求められた口唇の動きに対応する言葉を認識し、認識結果を出力する口唇認識手段と、該認識結果に基づいた制御信号に応じて対象物を制御する制御手段とを備えており、そのことにより上記目的を達成する。
【００１０】
前記口唇認識手段は、所定数の言葉を記憶している記憶手段と、前記求められた口唇の動きに応じて該所定数の言葉から１つを選択し、該選択された言葉を該口唇の動きに対応する該言葉であると判断するマッチング手段とを備えていてもよい。
【００１１】
前記記憶手段は、前記所定数の言葉に対応する口唇の動きを標準パターンとして記憶しており、前記マッチング手段は、該標準パターンの全てについて、前記求められた口唇の動きとの距離を算出し、該標準パターンのうちの該距離が最も小さい１つに対応する言葉を選択してもよい。
【００１２】
前記ゲーム装置は、音声を入力し、該音声を他の電気信号に変換し、該他の電気信号を出力する音声入力手段と、該音声入力手段から出力された該他の電気信号に基づいて該音声を認識する音声認識手段と、該音声認識手段による認識結果と、前記口唇認識手段による前記認識結果との両方に基づいて、前記制御手段に与えられるべき前記制御信号を出力する統合判断手段とをさらに備えていてもよい。
【００１３】
前記ゲーム装置は、前記音声認識手段による前記認識結果に対して、音声認識信頼度を求める手段と、前記口唇認識手段による前記認識結果に対して、口唇認識信頼度を求める手段とを有しており、前記統合判断手段は、該音声認識信頼度および該口唇認識信頼度に基づいて、該音声認識手段による該認識結果および該口唇認識手段の該認識結果のうちの一方を選択し、それを前記制御信号として出力してもよい。
【００１４】
前記画像入力手段は、光を出射する発光手段と、前記操作者の前記口唇によって反射された該光を受け取り、該受け取った光を前記第２の電気信号に変換する受光手段とを有していてもよい。
【００１５】
前記画像入力手段は、光を出射する発光手段と、前記操作者の前記口唇によって反射された該光を受け取り、該受け取った光を前記電気信号に変換する受光手段とを有していてもよい。
【００１６】
前記画像入力手段は、光を出射する発光手段と、前記操作者の前記口唇によって反射された該光を受け取り、該受け取った光を前記電気信号に変換する受光手段とを有していてもよい。
【００１７】
前記光は、前記口唇に側方から照射されてもよい。
【００１８】
前記光は、前記口唇に正面から照射されてもよい。
【００１９】
前記音声入力手段は、少なくとも１つのマイクロフォンを有していてもよい。
前記音声入力手段は少なくとも１つのマイクロフォンを有しており、該少なくとも１つのマイクロフォン、および前記画像入力手段の前記発光手段および前記受光手段は、１つの台上に設けられていてもよい。
【００２０】
本発明の入力装置は、ヘッドフォン状のヘッドセットと、一端が該ヘッドセットに接合されている支柱と、該支柱の他端に接合されている台であって、その上に、操作者の口唇に照射される光を発生する少なくとも１つの発光素子と、該口唇によって反射された該光を受け取る少なくとも１つの受光素子とが設けられている台とを備えており、そのことにより上記目的を達成する。
【００２１】
前記台上には、音声を入力する音声入力手段が設けられていてもよい。
【００２２】
本発明の音声選択装置は、複数のテーブルを格納する第１の記憶手段であって、該複数のテーブルのそれぞれは、１つの入力に対して出力されうる複数の言葉を含んでいる第１の記憶手段と、該複数のテーブルのうちの１つを格納する第２の記憶手段と、外部からの入力に応じて、該第２の記憶手段に格納されている該１つのテーブルに含まれている該複数の言葉から１つの言葉を選択し、該選択された１つの言葉を音声として出力する選択手段と、該第２の記憶手段に格納されている該１つのテーブルを、該第１の記憶手段に格納されている該複数のテーブルのうちから該選択された１つの言葉に応じて決定される他のテーブルに更新する遷移手段とを備えており、そのことにより上記目的を達成する。
【００２３】
前記音声選択装置は、乱数を発生する手段をさらに備えており、前記選択手段は該乱数を用いて前記複数の言葉から前記１つの言葉を選択してもよい。
【００２４】
本発明の他の音声選択装置は、テーブルを格納する記憶手段であって、該テーブルは、１つの入力に応じて出力されうる複数の言葉を含んでいる記憶手段と、外部からの入力を受け取り、該記憶手段に格納されている該テーブルに含まれている該複数の言葉から乱数を用いて１つの言葉を選択し、それを音声として出力する選択手段と、該乱数を発生する手段とを備えており、そのことにより上記目的を達成する。
【００２５】
本発明の音声反応装置は、上述した音声選択装置と、音声を入力し、該音声を認識し、認識結果を該音声選択装置に与える音声認識手段とを備えており、そのことにより上記目的を達成する。
【００２６】
本発明の他のゲーム装置は、上述した音声反応装置を備えており、そのことにより上記目的を達成する。
【００２７】
本発明の他のゲーム装置は、上述した音声反応装置を複数個備えており、それにより該音声反応装置がお互いに対話し、そのことにより上記目的を達成する。
本発明の他のゲーム装置は、入力した音声を電気信号に変換する複数の音声入力部であって、該複数の音声入力部はそれぞれ異なる方向に対応している音声入力部と、該電気信号のエネルギーを該複数の音声入力部のそれぞれについて求め、該複数の音声入力部のうちの該エネルギーが最大である１つを決定し、該決定された１つの音声入力部に対応する方向を該音声が発生された方向であると判定する方向検出手段とを備えており、そのことにより上記目的を達成する。
【００２８】
前記ゲーム装置は、対象物を動作させる動作手段と、前記判定された方向に該対象物の動作する方向を変更するように該動作手段を制御する制御手段とをさらに備えていてもよい。
【００２９】
前記ゲーム装置は、対象物の動作の現在の方向を計測する計測手段、および前記判定された方向を入力し、該現在の方向および該判定された方向に基づいて目的方向を求め、該目的方向を格納する手段とを有している方向選択手段と、該対象物を動作させる動作手段とをさらに備えており、該方向選択手段は、該目的方向と該現在の方向の差を用いて、該対象物の動作の該現在の方向と該目的方向とが実質的に一致するように該動作手段を制御してもよい。
【００３０】
本発明の他のゲーム装置は、音声により相対的な方向を入力する入力手段と、対象物の現在の方向を計測する計測手段と、該現在の方向および該入力された相対的な方向に基づいて目的方向を求め、該目的方向を格納する手段とを有する方向選択手段を備えたゲーム装置であって、該方向選択手段によって、該目的方向と該現在の方向の差を用いて、該対象物の該現在の方向と該目的方向とが実質的に一致するように該対象物を制御し、そのことにより上記目的を達成する。
【００３１】
前記入力手段は、前記音声が入力される入力部と、該入力された音声に基づいて前記相対的な方向を認識する認識部とを有していてもよい。
【００３２】
本発明の他のゲーム装置は、音声により絶対的な方向を入力する入力手段と、該絶対的な方向に基づいて目的方向を決定し、該目的方向を格納する手段と、対象物の現在の方向を計測する計測手段とを有する方向選択手段を備えたゲーム装置であって、該方向選択手段によって、該目的方向と該現在の方向の差を用いて、該対象物の該現在の方向と該目的方向とが実質的に一致するように該対象物を制御し、そのことにより上記目的を達成する。
【００３３】
前記入力手段は、前記音声が入力される入力部と、該入力された音声に基づいて前記絶対的な方向を認識する認識部とを有していてもよい。
【００３４】
本発明の音声認識装置は、音声に対応する電気信号を受け取り、該電気信号から、該音声の入力が終了した時間である音声終了点を検出する第１の検出手段と、該電気信号に基づいて、該音声が入力された区間のうちの該音声が発声された区間である発声区間を決定する第２の検出手段と、該電気信号の該発声区間の部分に基づいて、特徴量ベクトルを作成する特徴量抽出手段と、予め作成された複数の候補音声の特徴量ベクトルを記憶する記憶手段と、該特徴量抽出手段からの該特徴量ベクトルを、該記憶手段に記憶されている該複数の候補音声の該特徴量ベクトルのそれぞれと比較することにより、該入力された音声を認識する手段とを備えており、そのことにより上記目的を達成する。
【００３５】
前記第１の検出手段は、前記電気信号を、それぞれが所定の長さを有する複数のフレームに分割する手段と、該複数のフレームのそれぞれに対して該電気信号のエネルギーを求める算出手段と、該エネルギーの分散に基づいて前記音声終了点を決定する決定手段とを備えていてもよい。
【００３６】
前記決定手段は、予め定められている閾値と前記エネルギーの前記分散とを比較することにより前記音声終了点を決定し、該音声終了点は、該エネルギーの該分散が該閾値とよりも大きい値から小さい値に変化するときに該分散が該閾値と一致する時間であってもよい。
【００３７】
前記決定手段は、前記複数のフレームの前記エネルギーのうちの所定数のフレームのエネルギーに対する分散を用いてもよい。
【００３８】
前記第２の検出手段は、前記電気信号の前記エネルギーを平滑化する手段と、該電気信号の該エネルギーを平滑化しないままフレーム毎に順次格納する第１の循環式記憶手段と、該平滑化されたエネルギーをフレーム毎に順次格納する第２の循環式記憶手段と、前記音声終了点が検出されたときに該第１の循環式記憶手段に格納されている該平滑化されていないエネルギーおよび該第２の循環式記憶手段に格納されている平滑化されたエネルギーの両方を用いて、発声区間検出用閾値を算出する閾値算出手段と、該平滑化されていないエネルギーを該発声区間検出用閾値と比較することにより、前記発声区間を決定する発声区間決定手段とを有していてもよい。
【００３９】
前記閾値算出手段は、前記音声終了点が検出された時点で前記第１の循環式記憶手段に格納されている前記平滑化されていないエネルギーの最大値と、該音声終了点が検出されていない時点で前記第２の循環式記憶手段に格納されている前記平滑化エネルギーの最小値とを用いて、前記発声区間検出用閾値を算出してもよい。
【００４０】
前記特徴量検出手段は、前記電気信号の前記発声区間の部分から、該電気信号のフレーム毎のゼロ交差数と、該電気信号を微分して得られる信号のフレーム毎のゼロ交差数と、該電気信号の前記エネルギーとを算出し、これらを前記特徴量ベクトルの要素としてもよい。
【００４１】
本発明の他の音声反応装置は、少なくとも１つの上述した音声認識装置と、該少なくとも１つの音声認識装置の認識結果に基づいて対象物を制御する少なくとも１つの制御手段とを備えており、そのことにより上記目的を達成する。
【００４２】
前記音声反応装置は、前記少なくとも１つの音声認識装置に接続されており、該少なくとも１つの音声認識装置による前記認識結果を送信する送信手段と、前記少なくとも１つの制御装置に接続されており、該送信された認識結果を受け取り、該少なくとも１つの制御装置に与える受信手段とをさらに備えており、該少なくとも１つの制御装置および該受信手段は前記対象物に取り付けられており、それにより該対象物を遠隔より操作することを可能としてもよい。
【００４３】
以下、作用について説明する。
本発明のゲーム装置では、音声認識手段は入力された音声を認識し、発声区間検出装置は話者（操作者）の口唇の動きから話者が発声している区間である発声区間を検出する。この音声認識結果、および発声区間の検出結果に基づいて、統合判断部が話者が音声により入力したコマンドを認識し、そのコマンドに応じて制御部が対象物を制御する。これにより、人間の音声によりゲームを操作することが可能であり、話者以外の者の音声を誤認識したことに起因する誤操作を防ぐことができる。また、本発明の他のゲーム装置では、操作者の口唇の動きから直にコマンドを認識するので、人間の音声により、騒音下、あるいは音声を発しにくい状況でもゲームを操作することが可能となる。また、このゲーム装置は、発生に障害のある者の利用も可能とする。本発明のさらに他のゲーム装置では、音声認識手段による認識結果と口唇の動きに基づく認識結果との両方から統合判断部がより確からしい認識結果を判定する。このため、上述した利点に加えて、音声によるゲーム操作の信頼性をより高くすることができるという利点も得られる
。
【００４４】
本発明の入力装置は、軽いヘッドセットに支柱を取り付け、支柱に取り付けた台に安価な発光素子（例えば、ＬＥＤ等）と安価が受光素子（フォトダイオード等）を取り付けているために、非常に軽く、しかも安価に入力装置を提供することができる。さらに、ヘッドセットを伸縮可能にしておけば、その入力装置の操作者ごとにヘッドセットの長さを調節して、発光素子および受光素子と操作者の口唇付近との位置関係を調節することができる。
【００４５】
本発明の音声選択装置では、外部から１つの入力があると、第２の記憶手段に格納されているテーブルに含まれている言葉のうちの１つが選択され、音声として出力される。そして、第２の記憶手段に格納されているテーブルは、第１の記憶手段に格納されている複数のテーブルのうちからこの出力に応じて選ばれるテーブルに変更される。次に外部から入力があると、上述した動作が繰り返される。このようにして、本発明の音声選択装置は、１つの入力に１つの言葉を返すという１回の動作だけではなく、次々と与えられる入力に応じて言葉を返していくということができる。この音声選択装置を音声認識装置と組み合わせれば、入力された音声からそれに対応する言葉を認識し、その認識結果に応じて、ランダムに選ばれた言葉を音声として出力する音声反応装置を構成することができる。ゲーム装置にこの音声反応装置を少なくとも１個設ければ、音声反応装置に操作者と対話を行わせることができるし、また複数個設けると、装置同士で対話を行うゲーム装置を構成することもできる。また１つの入力に対して出力されるべき言葉を乱数を用いて選択することにより、同一の入力に対して常に同じ言葉を出力するというのではなく、変化のある出力を行うことができる。
【００４６】
本発明の他のゲーム装置では、それぞれが異なる方向に対応している複数の音声入力部を用いて音声が入力された方向を検出する。そして、検出された方向に対象物の移動の向きあるいは対象物自体の向きを変更する。このようにして、音声により対象物を動作させることができる。また本発明の他のゲーム装置では、音声によって入力された方向と現在の対象物の移動方向あるいは向きとの差を方位計で検出しながら、対象物の移動方向あるいは向きを変更する。
【００４７】
本発明の音声認識装置は、入力された音声に対応する電気信号から音声の入力が終了した点を検出する。続いて、このようにして求められる音声が入力されている区間分の電気信号から、さらに音声が発声されている区間を抽出する。この音声が発声されている区間分の電気信号から、実際に候補音声の特徴量ベクトルと比較される特徴量ベクトルを作成するので、本発明の音声認識装置は簡単な構成で精度よく音声を認識することができる。また、音声が発声されている区間の抽出に用いられる閾値は、上記電気信号のエネルギーおよびこのエネルギーを平滑化したものとに基づいて算出される。これにより、音声が発声されている区間を良好に検出することができる。さらに、この音声認識装置を、対象物の動作を制御する手段と組み合わせて得られる音声反応装置では、入力された音声に対応する動作を対象物に行わせることができる。
【００４８】
【発明の実施の形態】
（第１の実施例）
以下、図面を参照しながら本発明のゲーム装置の第１の実施例を説明する。本実施例は、飛行船の動きに応じた音声コマンドで飛行船を操作するゲーム装置である。音声コマンドは、「前」・「後ろ」・「右」・「左」・「上」・「下」の６個のコマンドを含んでいる。
【００４９】
本実施例では、話者（ゲーム装置の操作者）の音声信号とともに話者の口唇の動きを表す信号を入力し、これらの信号に基づいて話者が発声しているか否かを判定する処理を行っている。これにより、周囲の騒音、特に他者が話した声による誤動作を防止することが可能となる。
【００５０】
図１に、本実施例のゲーム装置の構成を簡単に示す。本実施例のゲーム装置は、入力された音声を処理するための音声入力部１および音声認識部２、口唇の動きを入力し、口唇の動きを示す信号を処理するための画像入力部３および発声区間検出部４を備えている。音声認識部２および発声区間検出部４は、ともに統合判断部５に接続されており、ここで入力された音声および口唇の動きの両方に基づき、話者が発声したコマンドが何であるかが判断される。統合判断部５の判断結果は制御部６に入力され、これに基づいて制御部６は飛行船７を制御する。
【００５１】
まず、話者が発声したコマンドを含む音声が音声入力部１に入力される。音声の入力は、例えば、通常のマイクロフォン等を利用することができる。音声入力部１は入力された音声を電気信号に変換し、これを音声信号１１として音声認識部２に出力する。音声認識部２は音声信号１１を解析し、その結果を音声認識結果１２として出力する。音声信号１１の解析は、例えばＤＰマッチング等の従来から知られている手法により行うことができる。
【００５２】
以上の入力音声の処理と平行して、口唇の動きを表す電気信号の処理が行われる。話者がコマンドを発声すると、そのときの口唇の動きが画像入力部３に入力される。図２に画像入力部３の構成例を示す。本実施例の画像入力部３は、ＬＥＤ２１から発した光を話者の口唇部分に照射し、口唇部分に反射された光をフォトダイオード２２によって検出する。これにより、口唇の動きに応じた電気信号１３を出力する。話者の口唇に動きがある場合、電気信号１３のレベルは、話者の口唇付近の陰影の変化に応じて変化する。なお、話者の口唇には、ＬＥＤ２１からの光を正面から照射してもよいし、側面から照射してもよい。
【００５３】
画像入力部３からの電気信号１３は発声区間検出部４に入力される。図３に、本実施例の発声区間検出部４の構成を示す。発声区間検出部４は、微分回路３１と区間検出部３２とを有している。微分回路３１は、入力された電気信号１３の変化度合いを示す微分信号３３を出力する。微分信号３３の波形の一例を図５に示す。図５は、ＬＥＤ２１からの光を話者の口唇に側面から照射した状態で話者がコマンド「前」および「後ろ」を発声したときに得られた微分信号３３を示している。図５から分かるように、話者が発声している場合には、微分信号３３の振幅が大きくなる。また、話者の口唇に側面からＬＥＤ光を当てているため、コマンド「後ろ」の「う」を発した時に唇が尖る動きが波形に反映されているのがわかる。なお、ＬＥＤ２１からの光を話者の口唇に正面からあてる場合には、光が話者の顔のみに当たるので、電気信号１３および微分信号３３は背景の動きに起因するノイズの影響を受けないという利点がある。
【００５４】
区間検出部３２は、この微分信号３３を受け取り、微分信号３３の振幅の大きさを判定し、話者の発声区間を検出する。具体的な発声区間の検出法を図６を参照しながら説明する。
【００５５】
区間検出部３２は、微分信号３３のレベルが所定の振幅閾値５１を超えると、その微分信号３３は話者がコマンドを発声したことによって生じたものであると判断し、微分信号３３のレベルが振幅閾値５１を超えている区間を発声区間とする。図６に示す例では、区間１および区間２が発声区間である。続いて、隣接する発声区間のインターバルを所定の時間閾値５２と比較する。この時間閾値５２は、複数の発声区間が同一の発声に対応するものか否か、つまり複数の発声区間が連続するものか否かを判断するために用いられる値である。発声区間のインターバルが時間閾値５２以内であれば、そのインターバルを挟んだ２つの発声区間は連続した発声区間であると判断される。このようにして判定された連続した発声区間を表す信号１４が発声区間検出部４から出力される。なお、振幅閾値５１および時間閾値５２は、いずれも、予め適当な値に設定され得る。
【００５６】
以上述べたようにして、発声区間検出部４は、微分信号３３を用いて口唇の動きの激しさと持続時間を検出することにより、話者がコマンドを発声した区間を求める。
【００５７】
次に、統合判断部５の動作について説明する。統合判断部５は、図４に示すように、音声認識時間判定部４１、出力判定部４２および出力ゲート４３を有している。音声認識時間判定部４１は音声認識結果１２を受け取り、認識された音声が音声入力部１に入力された時間を出力判定部４２に伝える。出力判定部４２には、音声認識時間判定部４１からの出力の他に、発声区間検出部４からの発声区間検出信号１４が入力される。ここで、図７を参照しながら出力判定部４２の動作を説明する。
【００５８】
出力判定部４２は、まず、受け取った発声区間検出信号１４に基づいて、発声区間の前後に評価用の時間閾値７１を足すことにより評価用発声区間７２を作成する。次に、音声認識結果１２が音声認識部２から出力された時間が、上記評価用発声区間７２に収まっているか否かを判定する。収まっている場合には、音声入力部１に入力され、音声認識部２によって認識された音声は、話者によって発声されたものであると判断される。判断の結果は信号１５として制御部６に出力される。
【００５９】
なお、評価用発声区間７２を作成するための時間閾値７１は、音声認識部２が行う認識処理に要する時間を考慮して設定される。これは、認識された音声が話者の発声によるものかどうかを判断する材料の１つとして、音声認識結果１２が出力された時間を用いているためである。
【００６０】
このようにして、音声によって入力されたコマンドに対応する信号１５が得られると、制御部６は、入力されたコマンドに応じたラジオコントロール信号を出力することにより飛行船７を制御する。
【００６１】
以上のように第１の実施例では、話者がコマンドを発声したときの口唇の動きから話者が発声している発声区間を検出し、これに基づいて、認識された音声が話者のものか否かを判断する。このため、話者以外の発声による誤認識、およびその結果生じる対象物の誤動作を防止することができる。
【００６２】
従って、音声による操作という人間にとって自然な操作によるゲーム装置を実現することが可能となる。また、本実施例では、話者の口唇の動きを、ＬＥＤとフォトダイオードとの組み合わせといった簡易な構成・方法によって検出している。このため、話者の口唇の画像をビデオカメラ等を用いて取り入れていた従来の装置と比較して、非常に安価に実現することができる。もちろん、フォトダイオードの代わりにフォトトランジスタを用いても構わない。
【００６３】
なお、図２、図３の回路構成は一例を示したもので、この構成のみに限定されるものではない。また、計算機のソフトウェアを利用して実現することも可能である。
【００６４】
（第２の実施例）
本発明の第２の実施例のゲーム装置では、コマンドを音声により入力するのではなく、口唇の動きのみで入力し、入力されたコマンドに応じて飛行船を制御する。これにより、騒音下での利用、また例えば真夜中等の音声を発声できない状況における利用、あるいは発声に障害がある者の利用を可能にする。
【００６５】
図８は、本実施例のゲーム装置の構成を簡単に示す図である。本実施例のゲーム装置は、上記実施例１と同様に、画像入力部３、制御部６、飛行船７を備えており、さらに、口唇の動きから話者（操作者）の言葉を認識する口唇認識部８１を備えている。
【００６６】
口唇認識部８１の構成例を図９に示す。本実施例では、口唇認識部８１は、微分回路３１、差分計算部９１、データベース９２およびパターンマッチング部９３から構成される。微分回路３１は、上記第１の実施例のゲーム装置の発声区間検出部４において用いられたものと同じである。差分計算部９１は、微分回路３１からの微分信号３３を所定の時間幅でサンプリングし、サンプリングデータ間の差分を計算する。差分計算の結果は、差分計算部９１からデータベース９２およびパターンマッチング部９３の両方に送られる。データベース９２には、認識に用いられる標準パターンの差分計算結果が保持されている。パターンマッチング部９３は、保持されている標準パターンの差分結果と、認識対象となっている入力パターンの差分計算結果との距離の差を求め、この差に基づいて口唇の動きとして入力された言葉を認識する。もちろん、差が小さいほど認識結果の信頼性は高い。
【００６７】
以下、本実施例のゲーム装置の動作を詳細に説明する。本実施例では、口唇認識部８１は、上述したように標準パターンと入力パターンとの比較により入力された言葉の認識を行うために、認識動作を行うよりも前に標準パターンを予め口唇認識部８１に登録しておく必要がある。
【００６８】
（登録動作）
まず、画像入力部３が、話者の口唇部分によって反射されたＬＥＤ反射光を受け、口唇の動きに応じた電気信号１３を口唇認識部８１に出力する。電気信号１３は口唇認識部８１の微分回路３１に入力される。微分回路３１は、電気信号１３の変化の度合いを示す微分信号３３を差分計算部９１に伝える。ここまでは、第１の実施例と同様である。
【００６９】
差分計算部９１の動作を図１０を参照しながら説明する。まず、微分信号３３を時間幅（Δｔ）でサンプリングし、得られたサンプリングデータにおいて隣り合うサンプリングデータ間の差を計算する。計算されたサンプリングデータ間の差、すなわち一連の差分データはデータベース９２に出力される。データベース９２はこの差分データ列を保持する。以上の動作を、認識されるべき言葉（カテゴリー）の数だけ繰り返し、全てのカテゴリーに対して差分データ列を格納する。格納された差分データ列は、認識に用いられる標準パターンとして保持されることになる。本実施例では、対象物の制御に用いられるコマンドは、「前」・「後ろ」・「右」・「左」・「上」・「下」の６つである。従って、上述した差分データ列の格納は６回繰り返され、最終的には６つの標準パターンがデータベース９２に保持されることになる。
【００７０】
このようにして全ての標準パターンをデータベース９２に登録し終えると、データベース９２は各差分データ列を調べ、口唇が動いている部分に相当するデータが続いている区間の長さを各差分データ列に対して抽出する。具体的には、例えば、差分データ列内でゼロに近い値が所定の時間よりも長く続いていれば、その区間は口唇が動いていないときに相当すると判断する。そして、全ての標準パターンについて口唇が動いている部分に対応する区間の長さを抽出し終わると、最も長い長さを有する標準パターンを選び出し、その長さを標準パターンの差分データ列長（Ｎ）と定める。以上で登録動作が終了し、標準パターンの差分データ列がデータベース９２に保持された状態となる。
【００７１】
（認識動作時）
口唇部分の動きを入力してから微分信号３３を得るまでの動作は、登録動作時と全く同様である。ここでは、微分信号３３が差分計算部９１に入力されてから後の動作を図１１を参照しながら説明する。
【００７２】
差分計算部９１に入力された微分信号３３は、登録動作時と同じように時間幅（Δｔ）でサンプリングされる。続いて、標準パターンの差分データ列長（Ｎ）の長さ分の区間内のサンプリングデータについて、隣接するサンプリングデータ間の差分を計算し、得られた一連の差分データをその区間の差分データ列とする。差分が計算される区間は順次Δｔずつ時間的に後方にずらしていく。図１１では、一番目のサンプリングデータを区間の始まりとし、区間の長さがＮであるような区間１１１についての差分データ列、および区間１１１からＮ／２だけ時間的に後方にずれた区間１１２について差分データ列のみを図示している。
【００７３】
区間の長さがＮである複数の区間の差分データ列（以下、これらを認識差分データ列とする）が求められると、これらの認識差分データ列は、パターンマッチング部９３に送られる。パターンマッチング部９３は、データベース９２から標準パターンを読み出してきて、複数の認識差分データ列のそれぞれについて、標準パターンのそれぞれとの距離を求める。本実施例では、上述したように６個の標準パターンがデータベース９２に登録されているので、パターンマッチング部９３は認識差分データ列のそれぞれについて、各標準パターンとの距離を１つずつ計算することになる。
【００７４】
認識差分データ列と標準パターンとの距離は、以下の式を用いて計算される。
Ｎ
ｄ^ｊ＝Σ　（ｒ^ｉ−ｐ^ｉｊ）^２
ｉ＝１
ここで、ｒ^ｉはｉ番目の認識差分データ列、ｐ^ｉｊはｊ番目の標準パターン（ｊ番目のカテゴリーに対応）、ｄ^ｊは認識差分データ列とｊ番目の標準パターンとの距離である。パターンマッチング部９３は、この距離ｄ^ｊ　がある一定値以下となると、認識差分データ列がｊ番目の標準パターンと一致したと判断し、そのｊ番目のカテゴリー（言葉）に対応する信号８２を判断結果として出力する。
【００７５】
この判断結果は制御部６に入力され、制御部６はｊ番目のカテゴリーに対応したラジオコントロール信号を出力して、飛行船７を制御する。
【００７６】
以上述べたように、本実施例では、口唇の動きのみを基に入力された言葉（コマンド）を認識し、認識された言葉に応じて飛行船を制御する。このため、騒音下での利用や、声が出しにくい状況での利用、また発声に障害がある者の利用が可能になる。
【００７７】
また、口唇の動きを入力する画像入力部３は、上記実施例１と同様に、ＬＥＤ２１とフォトダイオード２２の組み合わせによって実現され得るため、ビデオカメラ等を用いて口唇の画像自体を取り込む従来の方法と比較して、非常に安価なゲーム装置を提供することができる。
【００７８】
なお、本実施例ではゲームの利用者が、コマンドの入力に先立って、コマンドの認識に用いられる標準パターンの登録を行っている。しかし、例えばゲーム装置製造時あるいは出荷時等にあらかじめ不特定の利用者の口唇の動きに対応できるような標準パターンをデータベース９２に登録しておき、利用者による登録を省略するようにしてもよい。
【００７９】
（第３の実施例）
続いて、本発明の第３の実施例のゲーム装置を説明する。本実施例では、コマンドを音声および話者（操作者）の口唇の動きの両方により入力し、両方の認識結果を統合して判断することにより、飛行船を操作する。このため、騒音下においても話者が発声したコマンドを確実に認識することが可能である。
【００８０】
図１２に本実施例のゲーム装置の構成を簡単に示す。本実施例のゲーム装置は、実施例１のゲーム装置と同様の構成を有する音声入力部１、画像入力部３、制御部６および飛行船７を備えている。また、さらに音声処理部１２１および口唇処理部１２２を備えている。音声処理部１２１は、上記実施例１の音声認識部２と同様にして入力された音声を認識し、続いて認識結果の信頼度を算出する。また、口唇処理部１２２は、実施例２の口唇認識部８１と同様にして口唇の動きとして入力された言葉（コマンド）を認識し、それとあわせて認識結果の信頼度を算出する。音声処理部１２１および口唇処理部１２２からの出力はともに統合判断部１２３に入力される。統合判断部１２３は、各処理部１２１および１２２からの認識結果、および信頼度から統合的に話者の入力したコマンドを判断し、判断結果を出力する。
【００８１】
以下、本実施例のゲーム装置の動作を詳細に説明する。
【００８２】
話者（ゲーム装置の操作者）が発声した音声を音声入力部１が入力し、入力された音声に対応する電気信号１１を音声処理部１２１に伝えるのは、実施例１と同様である。音声処理部１２１は、電気信号１１を受け取り、これに基づいて入力された音声を認識する。音声認識の手法としては、従来から知られているどの方法を用いてもよい。ここでは、例えば上記実施例の口唇認識部の説明において述べた方法と同様に、入力される可能性のある全てのコマンドについてそれを発声したときに得られる電気信号１１を処理して得られるデータ列を標準パターンとして予め登録しておき、実際にゲーム装置の操作者がコマンドを発声したときに得られた電気信号１１を処理して得られた認識対象データ列と、予め登録された全ての標準パターンとの距離を算出することにより、音声入力部から入力されたコマンド（音声）が何であるかを認識する。このようにして音声が認識されると、続いて音声処理部１２１は、認識結果はどの程度信頼がおけるものかを示す信頼度を求め、音声認識結果と信頼度との両方を出力１２４として統合判断部１２３に与える。信頼度の求め方は後で述べる。
【００８３】
また入力された音声の処理と平行して、口唇の動きを表す信号の処理が行われる。まず、画像入力部３は、話者の口唇の動きを実施例１と同様にして入力し、口唇の動きに応じてレベルが変化する電気信号１３を出力する。口唇処理部１２２は電気信号１３を受け取り、実施例２と同様の処理を行う。ただし、本実施例の口唇処理部１２２は、認識差分データ列と標準パターンとのパターンマッチングの結果、認識差分データ列がｊ番目の標準パターンと一致するものと判断されると、その認識差分データ列とｊ番目の標準パターンとの距離ｄ^ｊに基づいて、認識結果の信頼度を算出する。このようにして得られた認識結果と信頼度はともに統合判断部１２３に出力される。
【００８４】
次に、簡単に信頼度の算出方法を簡単に説明する。本実施例では、音声認識結果の信頼度も口唇の動きに基づく認識結果の信頼度も同じ処理により求められる。以下、音声認識結果の信頼度の算出を説明する。音声認識結果の信頼度を「大」、「中」、「小」の３段階で評価する場合を考える。なお、信頼度「小」のときが最も認識結果の信頼性が高く、信頼度「大」のときに認識結果の信頼性は最も低いものとする。この場合、信頼度「小」と「中」とを分ける閾値α^Ｌ、および信頼度「中」と「大」とを区切る閾値α^Ｈ（ただしα^Ｌ＜α^Ｈ）を用い、認識対象と一致すると判断された標準パターンと認識対象との距離ｄを上記閾値と比較する。比較した結果ｄ＜α^Ｌならば信頼度は「小」と判定される。同様に、α^Ｌ≦ｄ＜α^Ｈ、ｄ≧α^Ｈのときには、それぞれ信頼度は「中」、「大」と判定される。口唇の動きに基づく認識結果についても同様に、閾値との比較により信頼度がどの段階であるかが判定される。なお、ここで用いられる閾値は、適当な値に設定することができる。また、信頼度の算出方法は、ここで説明した方法に限られず、公知のどの方法を用いてもよい。
【００８５】
続いて、統合判断部１２３の動作を、図１３を参照しながら説明する。
【００８６】
図１３は、統合判断を行う方法の概念を示す図である。まず、統合判断部１２３は、音声認識結果が音声処理部１２１から出力された時間（すなわち出力１２４が発生された時間）および口唇の動きに基づく認識結果が口唇処理部１２２から出力された時間（すなわち出力１２５が発生された時間）を検出し、検出された各出力時間の前後に所定の閾値１３１に相当する区間を足すことにより、評価用区間１３２ａおよび１３２ｂを作成する。続いて、口唇認識結果についての評価用区間１３２ａと音声認識結果について作成された評価用認識区間１３２ｂとが重なっているか否かを判定する。重なっている場合には、統合判断部１２３は、口唇の動きを入力した操作者が発声した音声が入力され、認識されたと判断する。重なっていない場合には、認識された音声は、周囲の騒音や操作者以外のものの発声によると判断される。これにより、操作者以外の音声の誤認識を防ぐことができる。
【００８７】
次に、統合判断部１２３は、口唇の動きに基づく認識結果と音声に基づく認識結果とが一致しているかどうかを判定し、一致した場合にはそれらの認識結果を統合判断結果とする（図１３の統合判断結果「前」）。一致しなかった場合、各認識結果に対して求められた信頼度に応じて統合判断結果を決定する。認識結果に対する信頼度の組み合わせと、その組み合わせに応じて決定される統合判断結果との対応関係の例を図１４に示す。この例では、上述したように、各認識結果に対する信頼度を、最も信頼性の低い「大」、最も信頼性の高い「小」、およびこれらの間の「中」との３段階で評価している。図１４の（ａ）は、信頼度が同等であるときに音声認識結果を優先する場合の対応関係であり、（ｂ）は口唇認識結果を優先する場合の対応関係である。どちらの認識結果を採用するかは、このゲーム装置が操作される周囲の環境等の要因に応じて決定されるものであり、これを予めゲーム装置に登録しておくことも可能であるし、あるいは操作者が自分で入力するようにゲーム装置を構成してもよい。例えば、（ａ）のように音声認識結果が優先されるのは、発声に支障がない健常者でかつ周囲の騒音が比較的小さい場合であり、発声に障害を持つ話者の場合や周囲の騒音が非常に大きい場合には（ｂ）を採用する。
【００８８】
統合判断部１２３は、以上述べたように決定された統合判断結果を信号１５として出力する。最後に、制御部６が判断結果に応じたラジオコントロール信号を出力して、飛行船７を制御する。
【００８９】
以上のように本実施例によれば、音声信号とともに口唇の動きも認識し、両者の結果を統合的に使って認識するため、騒音下においても確実に話者が発声した言葉（コマンド）を認識することができる。同時に、発声に障害を持つ者が音声操作によるゲームを利用することを可能にするという効果もある。また、上述した実施例１および２と同様に、ＬＥＤ２１とフォトダイオード２２の組み合わせで口唇の動きを検出しているため、ビデオカメラ等を用いて口唇の画像を取り込む方法と比較して非常に安価に実現できる、という効果もある。
【００９０】
なお、詳細な説明は省略したが、本実施例でも第２の実施例と同様に、ゲームの利用者が口唇認識時の標準パターンの登録を行うが、あらかじめ不特定話者に対応できる形の標準パターンを準備しておき、利用者による登録を省略するようにしてもよい。
【００９１】
また、上記実施例１〜３では、飛行船７をラジオコントロール信号によって制御するゲーム装置を例として説明しているが、本発明が適用されうるゲーム装置はこれに限られないのはもちろんである。例えば、上記実施例のいずれかで述べたような構成を操作者の数だけ設ければ、複数の操作者が同時にプレイすることが可能なゲーム装置を実現することができる。
【００９２】
以下、本発明の入力装置を説明する。図１５は、本発明の入力装置の構成を簡単に示す図である。本発明の入力装置は、ヘッドセット１５４と、それに取り付けられた支柱１５５と、フォトダイオード１５１およびＬＥＤ１５２が設けられた台１５３とを有しており、台１５３は所定の角度で支柱１５５に接合されている（図１５の（ａ）参照）。台１５３と支柱１５５との角度を調整すれば、ＬＥＤ１５２が発した光が操作者の口唇部分に照射される方向を変更することができる。この入力装置は、ＬＥＤ１５２が発した光を操作者の口唇部分に照射し、反射された光をフォトダイオード１５１で検出することにより、口唇の動きを入力する装置である。このような入力装置は、例えば、上記実施例１〜３における画像入力部として用いることができる。また、台１５３にマイク１５６を付加すれば（図１５の（ｂ）参照）、この入力装置を音声入力装置としても用いることができる。
【００９３】
図１５の（ａ）に示すようにマイクを設けていない入力装置は、上記実施例２の画像入力部として用いることができる。また、図１５の（ｂ）に示すようにマイクを有する入力装置は、上記実施例１および３の音声入力部と画像入力部とを兼ねた装置として用いることができる。
【００９４】
このように、本発明の入力装置は、非常にサイズが小さく、かつ非常に軽く実装することができるフォトダイオード１５１、ＬＥＤ１５２、およびマイク１５６を用いているので、入力装置全体のサイズおよび重量は非常に小さい。また、使用している構成要素はすべて安価であるため、低コストで実現することができる。さらに、本発明の入力装置は、ヘッドセット１５４により操作者の頭部に固定されるため、口唇とフォトダイオード１５１およびＬＥＤ１５２の位置を実質的に一定にすることができる。このため、口唇の動きを安定して入力することができる。また、本発明の入力装置は光により口唇の動きを入力し、それを電気信号に変換して出力するので、従来の入力装置、例えば口唇の動きではなく画像を入力する装置や、超音波を用いる装置といった大がかりで複雑な構成にならざるを得ない入力装置よりも簡易な構成にすることが可能である。
【００９５】
なお、ここでは、フォトダイオードとＬＥＤはそれぞれ１つずつしか実装していないが、それぞれを複数個実装することも可能である。たとえば、ＬＥＤとフォトダイオードを２組準備し、各組を十字状に配置すれば面上での動き方向が検出できるといった効果がある。
【００９６】
以上説明したように、本発明によれば、人間にとって自然な音声による操作が可能であり、かつ操作習熟を必要としないゲーム装置を得ることができる。また、音声のみから入力された言葉（コマンド）を認識するのではなく、口唇の動きを利用しているので、騒音下においても安定な操作が可能である。さらに、口唇の動きをＬＥＤとフォトダイオード（フォトトランジスタ）の組み合わせでとらえるため、ビデオカメラ、あるいは超音波等を利用する場合と比較して、低コストで実現することができる。
【００９７】
さらに、上記第１の実施例で述べたように、口唇の動きから話者の発声区間を検出し、これを音声認識結果の判断材料とするため、話者以外の発声による誤認識を防止することができる。また、上記第２および第３の実施例で述べたように、口唇の動きから入力された言葉（コマンド）を認識して飛行船の制御を行うようにすれば、騒音下においても、また声が出しにくい状況や、発声に障害を持つ者の利用も可能となる。
【００９８】
また、本発明の入力装置は、軽いヘッドセットと支柱および台に安価な発光素子（ＬＥＤ等）と安価な受光素子（フォトダイオード等）を取り付けている。このため、非常に軽く、しかも安価な入力装置を実現することができる。
【００９９】
上記実施例１〜３では、認識された音声あるいは口唇の動きに応じて、対象物の移動が制御される例を説明した。しかし、音声あるいは口唇の動きに基づいて制御される対象物の動作は移動に限らず、例えば何らかの言葉を言い返す、等の動作であってもよい。以下に説明するのは、認識された音声に応じて、対象物に何らかの動作（移動を含む）を行わせるための様々な装置である。
【０１００】
以下に、認識された音声に応じて対象物に何らかの動作を行わせるための装置を各実施例において説明する。
【０１０１】
（第４の実施例）
本実施例では、認識された音声に応じて、その音声に対して用意された出力音声の集合から１つの出力音声を選択し、それを出力する装置を説明する。
【０１０２】
図１６に本実施例の音声選択装置１００の構成を簡単に示す。音声選択装置１００は、乱数発生部１０１、選択部１０２、入出力状態メモリ１０３、状態遷移部１０４および入出力状態データベース１０５とを有している。入出力状態データベース１０５には、複数個の入出力状態テーブルが予め記憶されている。各入出力状態テーブルは、状態ｓにおける入力ｘ（ｘは負でない整数）と、入力ｘに対するｎ（ｓ）個の出力音声の集合ｓｐ（ｘ，ｉ）（０≦ｉ＜ｎ（ｓ））とを含んでいる。入出力状態テーブルの例を図１７に示す。入出力状態メモリ１０３には、最初、図１７（ａ）に示す初期状態のテーブル２０１が格納されている。乱数発生部１０１は、出力音声の集合から出力すべき１つの音声を選択するのに用いられるｉを決定する。
【０１０３】
以下、この音声選択装置１００の動作を説明する。選択部１０２に外部から入力ｘがあると、選択部１０２は、入出力状態メモリ１０３に格納されている入出力状態テーブルを参照し、入力ｘに対応する出力音声集合ｓｐ（ｘ，ｉ）を選択する。続いて、選択部１０２は、乱数発生部１０１に乱数ｒ（ｎ（ｓ））（ただし、０≦ｒ（ｎ（ｓ））＜ｎ（ｓ））によって決定させ、ｉ＝ｒ（ｎ（ｓ））として出力音声集合ｓｐ（ｘ，ｉ）の中から１つの音声を選び出す。そして、これを外部に出力する。
【０１０４】
選択部１０２からの出力は、外部だけではなく、状態遷移部１０４にも与えられる。選択部１０２からの出力を受け取ると、状態遷移部１０４は、入出力状態データベース１０５を参照しながら、入出力状態メモリ１０３の内容を、選択部１０２からの出力に対する入出力状態テーブルに書き換える。例えば、初期状態２０１において「元気？」が出力された場合、状態遷移部１０４は、入出力状態データベース１０５を参照して、出力「元気？」に対する入出力状態２０２のテーブルを取り出す。そして、取り出した状態２０２のテーブルを入出力状態メモリ１０３に格納する。
【０１０５】
このようにして本実施例の音声選択装置１００は、入力された音声に対して、乱数を用いて選ばれた音声を出力する。従って、この音声選択装置１００を用いれば簡単な対話システムを構築することが可能となる。また、図１８に示すように、状態遷移部１０４と入出力状態データベース１０５とを省略した簡単な構成の音声選択装置１００ａを用いれば、入力された音声に対して一回だけの応答をさせるようにすることもできる。
【０１０６】
上記音声選択装置１００および１００ａは、図２７に示すように音声反応装置１２０３の音声選択装置１２０２として、音声認識装置１２０１と組み合わせて用いられ得る。具体的に説明すると、まず、音声認識装置１２０１によって音声が認識されると、その認識結果は、例えばその音声に付された識別番号によって音声選択装置１２０２に入力される。音声選択装置１２０２は、入力された識別番号を入力ｘとして出力音声集合から１つの音声をランダムに選択し、それを出力する。これにより、ある音声を入力するとそれに対応した音声が出力され、しかも、同じ入力音声に対してもさまざまな応答をすることができる音声反応装置１２０３を実現することができる。例えば、音声選択装置１２０２が初期状態にあるときに音声認識装置１２０１が「おはよう」という音声を認識結果として出力すると、音声選択装置１２０２には、「おはよう」という音声に与えられた識別番号１が入力ｘとして入力される（図２（ａ）参照）。これに応じて、音声選択装置１２０２は、「おはよう」、「元気？」の２つの出力音声を含む集合ｓｐ（１，ｉ）から１つをランダムに選び、出力する。
【０１０７】
この音声反応装置１２０３では、実際の動作に先立って、音声選択装置１２０２に入力として受け入れられ得る音声を登録しておく必要がある。登録音声集合に含まれていない音声が音声選択装置１２０２に入力されたときには、例えば、「何？」という音声を音声選択装置１２０２から出力させればよい。また上記実施例３の装置を音声認識装置１２０１として用いた場合には、認識された音声の信頼性が低いときにはもう一度音声入力をしてもらうための音声を音声認識選択装置１２０２から出力させることもできる。
【０１０８】
このように本発明の音声選択装置では、入出力の状態を表すテーブルを複数個用意し、過去の入出力の履歴に応じて入出力の状態を遷移させている。従って本発明の音声選択装置を用いれば、簡単な対話を行う装置を実現することが可能となる。また、この音声選択装置では、１つの入力に対して複数の出力音声の候補を有しており、これらの出力音声候補から１つをランダムに選択して出力する。
このため、１つの入力に対して常に同じ応答をするのではなく、変化のある応答をすることができる音声反応装置が得られる。
【０１０９】
（第５の実施例）
次に、本発明の方向検出装置および方向選択装置を説明する。
【０１１０】
まず、図１９を参照しながら方向検出装置４００を説明する。方向検出装置４００は、方向検出部４０１とこれに接続された複数のマイク４０２を有しており、マイク４０２は、制御される対象物に取り付けられている。ここでは、マイクの個数が４個である場合を例として方向検出装置４００の動作を説明する。４個のマイクｍ（ｉ）（ｉ＝０，１，２，３）から音声が入力されると、方向検出部４０１は、図２０に示すように、入力された音声ｓｐ（ｍ（ｉ），ｔ）をフレームｆ（ｍ（ｉ），ｊ）５０１（０≦ｊ）に分割する。例えば１フレームの長さは１６ｍｓとされる。次に方向検出部４０１は、各フレームについてフレーム内の音声のエネルギーｅ（ｍ（ｎ），ｊ）を求め、求められたエネルギーｅ（ｍ（ｎ），ｊ）を長さｌ（例えば長さ１００）の循環メモリ（不図示）に順次蓄えていく。このとき方向検出部４０１は、１フレーム毎のエネルギーが蓄えられる度に各マイクについて過去ｌフレーム分のエネルギーの和を求め、エネルギーの和が最大となるマイクを決定する。続いて方向検出部４０１は、エネルギーの和の最大値を予め実験的に定められた閾値Ｔｈｅと比較し、エネルギーの和の最大値が閾値Ｔｈｅよりも大きければ、方向検出部４０１からそのマイクへ至る方向が音声が聞こえてくる方向であると判定する。こうして判定されたマイクの番号ｉが、音声が入力された方向として方向検出部４０１から出力する。
【０１１１】
このように動作する方向検出装置４００を、例えば、図２８に示すように動作装置１３０２と組み合わせて用いれば、音声の聞こえた方向に応じて所定の動作を行う音声反応装置１３０３を構成することができる。具体的には、例えば対象物（例えば風船やぬいぐるみなど）にこれを動かすための動作装置１３０２および方向検出装置１３０１（図１９では４００）を取り付ければ、人間の声のする方に対象物が移動するというように、音声に応じて音声が聞こえてくる方向に向けて所定の動作を行う装置を作ることができる。
【０１１２】
なお、上述した動作装置１３０２の一例としては、対象物に取り付けられたプロペラ付きのモーターを３個とこれらのモーターの駆動装置とを有しており、次に移動しようとする方向を入力すると、対象物がその方向へ移動するように３個のモーターを制御する装置がある。
【０１１３】
次に図２１を参照しながら方向選択装置を説明する。方向選択装置６００は、オフセット算出部６０１、方位計６０２および目的方向メモリ６０３を有しており、対象物の移動する方向あるいは対象物の向きを制御するための装置として用いられ得る。オフセット算出部６０１は、次に対象物が移動すべき方向あるいは対象物が向くべき方向を示す入力ｘ（ｘは負でない整数）が入力されると、予めオフセット算出部６０１に格納されているテーブルに基づいて、入力ｘに応じたオフセットを出力する。出力されたオフセットは、方位計６０２によって計測されたその時点での対象物の実際の方向に加算されて目的方向メモリ６０３に送られる。目的方向メモリ６０３は、方位計６０２からの実際の方向にオフセットを加えたものを次に対象物が移動すべき方向あるいは次に対象物が向くべき方向として記憶する。
【０１１４】
このように図２１の方向選択装置は、入力ｘに応じて、現在対象物が移動している方向あるいは対象物が向いている方向を基準として対象物の方向を変えるために用いられる。
【０１１５】
また、図２１の方向選択装置６００に代えて図２２の方向選択装置７００を用いれば、現在の方向を基準とした相対的な方向に対象物の方向を変えるのではなく、絶対的な方向に変えることができる。図２２の方向選択装置７００では、方向算出部７０１は、絶対的な方向（例えば、北など）を示す入力ｘ（ｘは負でない整数）を外部から受け取ると、入力ｘに対応する値を出力する。出力された値は目的とする方向としてそのまま目的方向メモリ６０３に記憶される。この方向算出部７０１も上述したオフセット算出部６０１と同様に、入力ｘに対する絶対的な方向の値をテーブルとして保持することによって実現可能である。このように目的とする方向をメモリ６０３に格納した後、方向選択装置７００は、対象物が移動していく、あるいは向きを変えていく中での現在の方向を方位計６０２で順次計測し、計測された方向と目的方向メモリ６０３に記憶された方向との差分を出力する。この出力に基づいて対象物に対してフィードバック制御を行えば、目的とする絶対的な方向に対象物を移動させたり、対象物の向きを変えたりすることができる。
【０１１６】
上述したような方向選択装置を、音声認識装置および動作装置と組み合わせれば、図２９に示すように、対象物の向きあるいは移動方向を音声によって入力すればそれに応じて対象物の向きあるいは移動方向が変化する音声反応装置１４０２を実現することができる。音声反応装置１４０２では、音声認識装置１２０１の認識結果を方向選択装置１４０１の入力とし、さらに方向選択装置１４０１の出力を動作装置１３０２に入力している。これにより、現在の対象物の向きあるいは移動方向と目的とする方向とを比較しながら、対象物の動作を制御することが可能になる。
【０１１７】
例えば、北を０度とし、東回りを正の方向としたときに、対象物が現在０度の方向を向いている場合を考える。このとき、方向選択装置１４０１として上述した方向選択装置６００（図２１参照）を用いているものとする。目的とする方向を示す音声が音声認識装置１２０１により「右」という言葉であると認識されると、方向選択装置６００のオフセット算出部６０１に「右」という言葉に＋９０度が対応づけられているテーブルを格納しておけば、方向選択装置６００は、動作装置１３０２に対して、対象物の向きあるいは移動方向を現在の向きから東回りに９０度ほど変えるようにという出力を送る。このとき、方向選択装置６００によって、対象物の向きあるいは移動方向の変化中に現在の方向と目的とする方向とは常に比較される。動作装置１３０２は、方向選択装置６００の出力によって目的とする方向に対象物の向きあるいは移動方向が変わるように制御される。
あるいは方向選択装置１４０１として用いられているのが図２２の方向選択装置７００である場合には、目的とする方向を表す言葉として、「右」や「左」ではなく「北」や「南西」というような絶対的な方向を表す言葉が入力されることになる。このとき、方向選択装置７００は、入力された言葉が「北」であれば０度を、「南西」であれば−１３５度を目的とする絶対的な方向として目的方向メモリに格納し、上述したような動作を行う。なお、ここで目的とする方向は−１８０度〜＋１８０度とする。
【０１１８】
また、本実施例の方向検出装置および方向選択装置を動作装置と組み合わせてもよい。この場合、図３０に示すように、方向検出装置１３０１の検出結果を方向選択装置１４０１の入力とし、方向選択装置１４０１の出力を動作装置１３０２の入力とする。これにより、対象物の向きあるいは移動している方向を、現在の対象物の向きあるいは移動している方向と目的とする方向とを比較しながら音声が聞こえてくる方向に変えるという音声反応装置１５０１を実現することができる。
【０１１９】
（第６の実施例）
本実施例では、音声認識に関する装置を説明する。この装置は、図２６に示すように、音声終了点検出装置１１０１、音声検出装置１１０２、特徴量抽出装置１１０３、距離計算装置１１０４および辞書１１１０５を有している。
【０１２０】
まず、入力された音声に対応する信号を受け取り、その信号に基づいて音声終了点を検出する音声終了点検出装置１１０１を説明する。本明細書では「音声終了点」は音声入力が終了した時間を意味するものとする。
【０１２１】
本実施例の音声終了点検出装置１１０１は、マイクなどの音声入力装置に接続されている。音声入力装置から音声ｓ（ｔ）が入力されると、音声終了点検出装置１１０１は、図２３に示すように入力された音声ｓ（ｔ）をフレームｆ（ｉ）（ｉは負でない整数）に分割し、各フレーム内のエネルギーｅ（ｉ）を求める。図２３では、音声ｓ（ｔ）を曲線８０１で、エネルギーｅ（ｉ）を曲線８０２で表している。続いて音声終了点検出装置１１０１は、１フレーム分の音声が入力される度にそのフレームから所定個数前のフレームまでのエネルギーの分散を求め、予め実験的に定められている閾値Ｔｈｖと比較する。比較の結果、エネルギーの分散が閾値Ｔｈｖと大きい方から小さい方に交差していれば、交差した時点を音声終了点と判定する。
【０１２２】
ここで一定期間のフレーム毎のエネルギーから分散を求める方法を述べる。まず、循環メモリを使う方法であるが、フレーム毎に求まるエネルギーを順次、長さｌの循環メモリ８０３に格納していく。そして、１フレームのエネルギーが求まる度に、そこから一定期間ほど遡ったフレームのエネルギーを循環メモリ８０３から参照することにより、分散を求める。
【０１２３】
また、循環メモリを用いずにエネルギーの分散を求める方法もある。この方法では、音声終了点検出装置１１０１に過去の所定数個のフレームについての平均ｍ（ｉ−１）と分散ｖ（ｉ−１）を保持させておき、新しいフレームに対してエネルギーｅ（ｉ）が求められる度に、新しく求められたエネルギーｅ（ｉ）と過去のエネルギーの平均ｍ（ｉ−１）との重みづけした和を新しいエネルギーの平均ｍ（ｉ）とし、同じく過去の分散ｖ（ｉ−１）と｜ｅ（ｉ）−ｍ（ｉ）｜との重みづけ和を新しい分散ｖ（ｉ）とする。このようにすれば擬似的なエネルギーの分散を求めることができる。ここで、重みづけには減衰定数αを用い、次式を用いて新しい平均と分散とを求める。αとしては１．０２を用いている。
【０１２４】
【数１】

【０１２５】
このようにすることにより、循環メモリを必要とせず、メモリの節約につながり、新しいエネルギーが求まる度に一定期間内のエネルギーの総和を求める等の手間が省け、処理時間の短縮にもつながる。
【０１２６】
次に、実際に音声が発音された区間を抽出する音声検出装置１１０２を説明する。この区間の抽出のために、エネルギーを格納するための循環メモリ８０３とは別に、平滑化エネルギーを格納するための循環メモリ９０２を用意しておき、図２４に示すように、１フレームのエネルギーが求まる度にメモリ８０３にはエネルギー８０２を、メモリ９０２には平滑化エネルギー９０１を蓄えてゆく。上述したようにして音声終了点９０３が求まった時点では、これらの循環メモリ８０３および９０２にはエネルギーおよび平滑化エネルギーの履歴が残っており、これらの循環メモリの長さｌを十分な長さ（例えば２秒に相当する長さ）にしておけば、一単語分のエネルギーを残しておくことができる。そこで、音声検出装置１１０２は、これらのメモリに格納されているエネルギーおよび平滑化エネルギーを用いて音声が発音された区間を抽出する。
【０１２７】
区間の抽出は次のような手順で行われる。まず、後で説明するようにして閾値Ｔｈを決定する。この閾値Ｔｈと循環メモリ８０３内に格納されているエネルギーとを過去のものから順に比較していき、エネルギーが初めてその閾値を超える点を音声が発音された区間の始点とする。また、逆に音声終了点から過去に遡っていくときにエネルギーが初めて閾値と交差する点を音声が発音された区間の終点とする。このようにして、音声が発音された区間を抽出する。
【０１２８】
ここで閾値Ｔｈの決定の仕方を説明する。まず、音声終了点が検出された時点でのメモリ８０３内のエネルギーの最大値ｍａｘ１００１と、メモリ９０２ないの平滑化エネルギーの最小値ｍｉｎ１００２とを求める。これらの値を用いて、次式から閾値Ｔｈを算出する。
【０１２９】
【数２】

【０１３０】
ただし、βとしては０．０７程度の値を採用した。
【０１３１】
またここでは、エネルギーを平滑化する方法としては一定ウインドウ内のメディアン値を採る方法を用いている。しかし、平滑化の方法はこれに限定されるものではなく、例えば平均値を採ってもかまわない。なお、閾値Ｔｈを求める際に平滑化エネルギーの最大値ではなくエネルギーの最大値を用いたのは、閾値Ｔｈを求めるのに平滑化エネルギーの最大値を用いると、単語の長さが変動した場合に最大値が大幅に変動し、それに伴なって閾値Ｔｈも変動してしまい、結果的に良好な音声検出ができなくなるからである。また、平滑化エネルギーの最小値を閾値Ｔｈの算出に用いているので、音声ではないノイズが検出されるのを防ぐこともできる。
【０１３２】
上述したようにして、音声が発音されている区間の抽出、すなわち入力された信号のうちの音声に相当する部分の検出が音声検出装置１１０２によって行われる。
【０１３３】
次に、検出された音声から、特徴量抽出装置１１０３によって、認識のための特徴量を抽出する。特徴量もエネルギー同様、フレーム毎に求めるものとし、循環メモリに蓄えていくものとする。ここで特徴量とは、原信号のゼロ交差数と原信号の微分信号のゼロ交差数と原信号のエネルギーの対数をとったもののフレーム間差分の３つの要素を含む特徴量ベクトルとする。
【０１３４】
このように音声終了点検出装置１１０１、音声検出装置１１０２、および特徴量抽出装置１１０３を経て得られた音声の特徴量ベクトルは、距離計算装置１１０４に入力される。距離計算装置１１０４は、辞書１１０５に予め登録されている複数の音声の特徴量ベクトルのそれぞれと入力された特徴量ベクトルとを照合し、最もスコアがよかったものを認識結果として出力する。照合の方法は単純にベクトル間のユークリッド距離を取ってもよいし、ＤＰマッチング法を用いてもよい。
【０１３５】
以上説明したようにして、本実施例の装置は音声認識を行う。この音声認識装置は、図２７に示すように実施例４で述べた音声選択装置１２０２と組み合わせて用いることもできるし、図２９に示すように実施例５で述べた方向選択装置１４０１、および動作装置１３０２に組み合わせることもできる。また、単に動作装置１３０２と組み合わせて、音声認識装置１２０１の結果を動作装置１３０２の入力として目的の方向へ装置全体を移動させる音声反応装置１６０１を構成することもできる。
【０１３６】
さらに、実施例４〜６で述べた音声反応装置のうち音声認識装置１２０１を含むものでは、音声認識装置側に信号送信装置１７０１を付加し、それぞれの構成の中で音声認識装置の後段に来る音声選択装置１２０２や方向選択装置１４０１や動作装置１３０２に信号受信装置１７０２を付加すれば、音声認識装置のみを手元のリモコンとして対象物を遠隔操作することが可能となる。ここで信号送受信に赤外線や無線を用いることが可能である。
【０１３７】
また、上述した音声反応装置を風船に取りつけることによって、風船と対話したり、風船をコントロールすることが可能になり、風船独特のあたたかみを生かした玩具を作ることが可能となる。
【０１３８】
また、図３３に示すように、上述した音声認識装置と音声選択装置とを備えた音声反応装置１２０３を風船１８０１に取り付けた物を２つ用意し、人がこの音声反応装置に話しかけるのではなく、２つの音声反応装置同士がお互いに対話するように構成すれば、勝手に対話するような玩具を作ることが可能となる。さらに、この音声反応装置付き風船１８０１を複数用意し、対話させることも可能である。このときに、それぞれの音声反応装置付き風船に音声認識過程でリジェクト機能を持たせれば、特定の言葉に対してのみ反応することが可能となり、ある発声に対し一つの風船だけが反応するように構成することも可能となる。例えば、それぞれの風船１８０１に名前を付け、その名前を呼んだ時だけ反応させることが可能となる。ここでリジェクトの方法は音声認識を行う時に内部の辞書と距離を計算するが、実験的に閾値を決めておき、その閾値を越えたものをリジェクトするというものがある。さらに、音声反応装置に時計を組み込んで、所定の時間が経過したら、登録されている出力音声集合の中から１つの音声をランダムに選んで出力させることにより、音声反応装置側から対話を始めることのできる玩具を構成することも可能である。
【０１３９】
なお、上記対象物は風船に限定されるものではなく、ぬいぐるみや人形、あるいは写真や絵であってもかまわない。また、ディスプレイ中の動画であってもよい。また、対象物として、風船以外の反重力装置（例えば、ヘリコプターのようにプロペラによって浮上するものや、リニアモーターカーのように磁力によって浮上するもの）を用いてもよい。
【０１４０】
【発明の効果】
以上説明したように、本発明によれば、人間にとって自然な音声による操作が可能であり、かつ操作習熟を必要としないゲーム装置を得ることができる。また、音声のみから入力された言葉（コマンド）を認識するのではなく、口唇の動きを利用しているので、騒音下においても安定な操作が可能である。さらに、口唇の動きをＬＥＤとフォトダイオード（フォトトランジスタ）の組み合わせでとらえるため、ビデオカメラ、あるいは超音波等を利用する場合と比較して、低コストで実現することができる。
【０１４１】
さらに、本発明の音声認識装置では、口唇の動きから話者の発声区間を検出し、これを音声認識結果の判断材料とするため、話者以外の発声による誤認識を防止することができる。また、本発明の他の音声認識装置では、口唇の動きから入力された言葉（コマンド）を認識して飛行船の制御を行うために、騒音下においても、また声が出しにくい状況や、発声に障害を持つ者の利用も可能となる。
【０１４２】
また、本発明の入力装置は、軽いヘッドセットと支柱および台に安価な発光素子（ＬＥＤ等）と安価な受光素子（フォトダイオード等）を取り付けている。このため、非常に軽く、しかも安価な入力装置を実現することができる。
【０１４３】
以上説明したように、本発明の音声選択装置は、入出力の状態を複数用意し過去の入出力の履歴により入出力の状態を遷移させる。このため、この音声選択装置を用いることにより簡単な対話をする装置を提供することが可能となる。また、本発明の音声選択装置は１つの入力に対し複数の出力を用意しており、この中からランダムに選択した１つを出力するので、１つの入力に対し常に同じ応答ではなく、変化のある応答をすることができる。
【０１４４】
また、本発明の方向検出装置は、複数のマイクによって音声を入力し、エネルギーが最大となるマイクを検出する。これにより、音声が発声された方向を検出することができる。さらに、本発明の方向選択装置を用いれば、方位計によって現在の位置を検出しながら、対象物を入力された方向に正確に移動させたり、あるいは入力された方向に対象物の向きを変えたりすることができる。
【０１４５】
また、本発明の音声認識装置は、音声終了点検出装置によりまず大まかな音声の終了点を求めてから、音声検出装置で自動的に閾値を求める。ここで、入力された音声のエネルギーの最大値と、エネルギーを平滑化したものの最小値とから閾値を決定しているので、音声の発声区間の長短に関係なく、良好な音声区間抽出を行うことができる。音声検出装置が閾値を用いて音声を検出すると、この音声から特徴量を求め、これに基づいて音声認識を行う。
【０１４６】
また、上述した装置を適宜組み合わせることにより、様々な音声反応装置を得ることができる。例えば、音声認識装置と音声選択装置を組み合わせれば、人が声で話しかけると返答する音声反応装置が得られ、これによりマン・マシンインターフェースを構築することが可能となる。また、方向検出装置と動作装置を組み合わせれば、音声に反応して対象物を動作させることが可能となるし、音声認識装置と方向選択装置と動作装置を組み合わせれば、音声の内容が示す方向に対象物を正確に移動させたり、音声の内容が示す方向に対象物の向きを変えたりすることが可能となる。さらに、音声反応装置のうちの音声認識装置に信号送信装置を接続し、音声認識装置の後段にくる装置に信号受信装置を接続して対象物に取り付ければ、遠隔からの操作が可能である音声反応装置を実現することができる。
【０１４７】
さらに、上述したような音声反応装置を複数個用意すれば、音声反応装置間で自動的に対話をする玩具を構成することも可能である。また、音声反応装置をそれぞれ風船に付ければ、風船独特の暖かみを持ち、しかも話しかけることが可能な玩具を作ることができる。また、時計を組み込み、ある時間がくれば適当な音声を出力することによって人間から話かけるのではなく、自分から話しかける音声反応装置を作ることも可能である。
【図面の簡単な説明】
【図１】本発明の第１の実施例のゲーム装置の構成を示すブロック図である。
【図２】本発明の第１〜第３の実施例の画像入力部の詳細な構成を示す図である。
【図３】本発明の第１の実施例における発声区間検出部の詳細な構成を示す図である。
【図４】本発明の第１の実施例における統合判断部の詳細な構成を示すブロック図である。
【図５】本発明の第１〜第３の実施例における微分信号の出力例を示すグラフである。
【図６】図３の発声区間検出部の処理動作を説明するための図である。
【図７】図４の統合判断部の処理動作を説明するための図である。
【図８】本発明の第２の実施例のゲーム装置の構成を示すブロック図である。
【図９】本発明の第２、第３の実施例における口唇認識部の詳細な構成を示すブロック図である。
【図１０】本発明の第２、第３の実施例における微分回路の処理動作を示す図である。
【図１１】本発明の第２、第３の実施例のパターンマッチング部の処理動作を示す図である。
【図１２】本発明の第３の実施例のゲーム装置の構成を示すブロック図である。
【図１３】本発明の第３の実施例における統合判断部の処理動作を示す図である。
【図１４】本発明の第３の実施例における統合判断部の処理動作を示す図である。
【図１５】本発明の入力装置の具体的構成例を示す図である。
【図１６】本発明の第４の実施例の音声選択装置の構成を示す図である。
【図１７】図１６の音声選択装置における入出力状態を示す図である。
【図１８】本発明の変形例の音声選択装置の構成を示す図である。
【図１９】本発明の第５の実施例の方向検出装置の構成を示す図である。
【図２０】入力された音声の波形とフレームとを説明する図である。
【図２１】本発明の第５の実施例の方向選択装置の構成を示す図である。
【図２２】本発明の第５の実施例の他の方向選択装置の構成を示す図である。
【図２３】音声波形、エネルギー、および循環メモリを説明する図である。
【図２４】本発明の第６の実施例における音声終了点の検出方法を説明する図である。
【図２５】本発明の第６の実施例における音声検出方法を説明する図である。
【図２６】本発明の第６の実施例の音声認識装置の構成を示すブロック図である。
【図２７】本発明の音声認識装置、および音声選択装置を用いた音声反応装置の構成を示す図である。
【図２８】本発明の方向検出装置、および動作装置を用いた音声反応装置の構成を示す図である。
【図２９】本発明の音声認識装置、方向選択装置、および動作装置を用いた音声反応装置の構成を示す図である。
【図３０】本発明の方向検出装置、方向選択装置、および動作装置を用いた音声反応装置の構成を示す図である。
【図３１】本発明の音声認識装置、および動作装置を用いた音声反応装置の構成を示す図である。
【図３２】本発明の遠隔操作が可能な音声反応装置の構成を示す図である。
【図３３】本発明の音声反応装置を用いた玩具の一例を示す図である。
【図３４】従来のゲーム装置の構成を示す図である。
【符号の説明】
１　音声入力部
３　画像入力部
２　音声認識部
４　発声区間検出部
５、１２３　統合判断部
６　制御部
７　飛行船
２１　ＬＥＤ
２２　フォトダイオード
８１　口唇認識部
１００，１００ａ　音声選択装置
１０１　乱数発生部
１０２　音声選択部
１０３　入出力状態メモリ
１０４　状態遷移部
１０５　入出力状態データベース
４００、１３０１　方向検出装置
４０１　方向検出部
６００、７００、１４０１　方向選択装置
６０１　オフセット算出装置
６０２　方位計
６０３　目的方向メモリ
７０１　方向算出装置
１１０１　音声終了点検出装置
１１０２　音声検出装置
１１０３　特徴量抽出装置
１１０４　距離計算装置
１１０５　辞書
１２０１　音声認識装置
１２０２　音声選択装置
１３０２　動作装置
１７０１　信号送信装置
１７０２　信号受信装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a game device that operates using voice, an input device that inputs a lip image or voice, and a voice reaction device.
[0002]
[Prior art]
As an example of a conventional game apparatus, FIG. 34 shows a game apparatus in which an airship equipped with a wireless receiver is operated by a remote controller with a wireless receiver at hand of the operator. As shown in FIG. 34, in a conventional game device, it is common to operate an object using a joystick 161 provided in a remote controller. When the operator moves the joystick 161, the angle is detected by the

angle detection units

162 and 163, converted into an electrical signal, and output to the control unit 164. The control unit 164 outputs a radio control signal for controlling the movement of the airship 7 according to the angle of the joystick 161 based on these electric signals.
[0003]
[Problems to be solved by the invention]
However, since the conventional game device is an operation with the joystick 161, it is not a natural operation for humans. For this reason, there are problems such as that it takes a long time to learn how to operate, and that the reaction is dull. In addition, there is a game device that operates a balloon with a driving device instead of an airship, but since the movement of the balloon is controlled as described above in this device, the movement of the balloon becomes abiotic, There was a problem that the warmth peculiar to balloons faded.
[0004]
Also, an apparatus for recognizing voice by inputting an image of the operator's lips has been proposed. However, in such an apparatus, an advanced optical system lens is required, so that the apparatus itself is not large. There is a problem that it becomes expensive and expensive.
[0005]
The present invention has been made in view of such a current situation, and the purpose thereof is (1) in a situation where operation by natural voice is possible, operation skill is not required, and noise or voice is difficult to be emitted. Providing a low-cost and simple configuration of a game device that enables use of a person with a disability in speaking, and (2) inputting an operator's lip movement and voice with a simple configuration Possible input device, (3) a voice selection device that outputs a word randomly selected from a plurality of words as a voice for the same input voice, and (4) a natural operation by the voice. Provided game device or toy that can be used, and voice recognition device used for them, and (5) To provide a voice reaction device that can change the operation according to the input voice A.
[0006]
[Means for Solving the Problems]
The game apparatus of the present invention inputs at least one sound including a sound generated by an operator, converts the input sound into a first electric signal, and outputs the first electric signal. Means, a voice recognition means for recognizing the at least one voice based on the first electrical signal output from the voice input means, and a movement of the lip of the operator is optically detected and detected. Image input means for converting the movement of the lips into a second electrical signal and outputting the second electrical signal; and receiving the second electrical signal, and based on the received second electrical signal, Based on the generation section detection means for obtaining a section in which the speech is generated by a speaker, the at least one speech recognized by the speech recognition means, and the section obtained by the generation section detection means, The at least one Integrated judgment means for extracting the voice generated by the operator from voice, and control means for controlling an object based on the voice extracted by the integration judgment means, thereby Achieve the goal.
[0007]
The utterance section detecting means includes a differentiating means for detecting the degree of change of the second electric signal output from the image input means, and the degree of change detected by the differentiating means exceeds a predetermined value. In some cases, there may be provided means for determining that the corresponding sound is generated by the operator.
[0008]
The integrated determination means includes means for creating an evaluation section by adding a section of a predetermined length to the section obtained by the utterance section detection means, and the at least one speech recognized by the speech recognition means A means for detecting a recognition result output time outputted from the voice recognition means, and comparing the recognition result output time with the evaluation section, and the recognition result output time of the at least one speech is the evaluation section. There may be provided means for determining that the voice contained in the voice is the voice uttered by the operator.
[0009]
Another game apparatus of the present invention is an image input means for optically inputting an operator's lip movement, converting the input lip movement into an electrical signal, and outputting the electrical signal; and the electrical signal Lip recognition means for obtaining the movement of the lip based on the recognition, recognizing a word corresponding to the obtained movement of the lip, and outputting a recognition result, and controlling the object according to the control signal based on the recognition result Control means for achieving the above object.
[0010]
The lip recognition means selects one of the predetermined number of words according to the storage means for storing a predetermined number of words and the determined movement of the lips, and selects the selected word of the lips. Matching means for determining that the word corresponds to the movement may be provided.
[0011]
The storage means stores lip movement corresponding to the predetermined number of words as a standard pattern, and the matching means calculates a distance from the determined lip movement for all the standard patterns. The word corresponding to the smallest one of the standard patterns may be selected.
[0012]
The game apparatus is configured to input sound, convert the sound into another electric signal, and output the other electric signal, and based on the other electric signal output from the sound input means. Integrated recognition means for outputting the control signal to be given to the control means based on both the speech recognition means for recognizing the speech, the recognition result by the speech recognition means, and the recognition result by the lip recognition means And may be further provided.
[0013]
The game device includes means for obtaining a speech recognition reliability for the recognition result by the speech recognition means, and means for obtaining a lip recognition reliability for the recognition result by the lip recognition means. And the integrated determination means selects one of the recognition result by the voice recognition means and the recognition result of the lip recognition means based on the voice recognition reliability and the lip recognition reliability, You may output as said control signal.
[0014]
The image input means includes light emitting means for emitting light and light receiving means for receiving the light reflected by the lips of the operator and converting the received light into the second electric signal. May be.
[0015]
The image input means may include light emitting means for emitting light and light receiving means for receiving the light reflected by the lips of the operator and converting the received light into the electrical signal. .
[0016]
The image input means may include light emitting means for emitting light and light receiving means for receiving the light reflected by the lips of the operator and converting the received light into the electrical signal. .
[0017]
The light may be applied to the lips from the side.
[0018]
The light may be applied to the lips from the front.
[0019]
The voice input means may have at least one microphone.
The voice input unit may include at least one microphone, and the light emitting unit and the light receiving unit of the at least one microphone and the image input unit may be provided on one table.
[0020]
An input device according to the present invention includes a headphone-shaped headset, a column having one end bonded to the headset, and a table bonded to the other end of the column. And a table provided with at least one light emitting element that generates light irradiated on the lip and at least one light receiving element that receives the light reflected by the lips, thereby achieving the above object. To do.
[0021]
Voice input means for inputting voice may be provided on the table.
[0022]
The voice selection device of the present invention is a first storage means for storing a plurality of tables, each of the plurality of tables including a plurality of words that can be output in response to one input. Included in the storage means, the second storage means for storing one of the plurality of tables, and the one table stored in the second storage means in response to an external input A selection means for selecting one word from the plurality of words, and outputting the selected one word as speech; and the one table stored in the second storage means, Transition means for updating to another table determined in accordance with the selected one of the plurality of tables stored in the storage means is provided, thereby achieving the above object.
[0023]
The voice selection device may further include means for generating a random number, and the selection means may select the one word from the plurality of words using the random number.
[0024]
Another voice selection device according to the present invention is a storage means for storing a table, the table receiving a storage means including a plurality of words that can be output in response to one input, and an external input. Selection means for selecting one word using a random number from the plurality of words included in the table stored in the storage means and outputting it as speech; and means for generating the random number To achieve the above objective.
[0025]
The voice reaction device of the present invention includes the voice selection device described above, and voice recognition means for inputting voice, recognizing the voice, and giving a recognition result to the voice selection device. Achieve.
[0026]
Another game device of the present invention includes the above-described voice reaction device, thereby achieving the above object.
[0027]
Another game device of the present invention includes a plurality of the above-described voice reaction devices, whereby the voice reaction devices interact with each other, thereby achieving the above-described object.
Another game apparatus according to the present invention includes a plurality of sound input units that convert input sound into an electric signal, the plurality of sound input units corresponding to different directions, and the electric signal. Is determined for each of the plurality of voice input units, one of the plurality of voice input units is determined to have the maximum energy, and a direction corresponding to the determined one voice input unit is determined Direction detection means for determining that the direction in which the sound is generated is provided, thereby achieving the above object.
[0028]
The game apparatus may further include operating means for operating the object, and control means for controlling the operating means so as to change the direction in which the object operates in the determined direction.
[0029]
The game apparatus inputs a measurement unit that measures a current direction of the motion of the object, and the determined direction, obtains a target direction based on the current direction and the determined direction, and determines the target direction. Further comprising direction selection means having means for storing and an operation means for operating the object, wherein the direction selection means uses the difference between the target direction and the current direction, The movement means may be controlled so that the current direction of the movement of the object substantially coincides with the target direction.
[0030]
Another game apparatus of the present invention is based on input means for inputting a relative direction by voice, measurement means for measuring the current direction of an object, and the current direction and the input relative direction. A direction selection means having a means for obtaining a target direction and storing the target direction, and using the difference between the target direction and the current direction by the direction selection means, The object is controlled so that the current direction of the object and the target direction substantially coincide, thereby achieving the above object.
[0031]
The input unit may include an input unit to which the voice is input and a recognition unit that recognizes the relative direction based on the input voice.
[0032]
Another game device of the present invention includes an input unit that inputs an absolute direction by voice, a unit that determines a target direction based on the absolute direction, stores the target direction, and a current object A game apparatus including direction selection means having a measurement means for measuring a direction, and using the difference between the target direction and the current direction by the direction selection means, The object is controlled so that the target direction substantially coincides with the target direction, thereby achieving the object.
[0033]
The input unit may include an input unit to which the voice is input and a recognition unit that recognizes the absolute direction based on the input voice.
[0034]
The speech recognition apparatus of the present invention receives an electrical signal corresponding to speech, detects from the electrical signal a speech end point that is a time when the speech input is completed, and based on the electrical signal Then, based on the second detection means for determining the utterance section that is the section in which the voice is uttered among the sections in which the voice is input, and the portion of the utterance section of the electrical signal, the feature quantity vector is calculated. Feature quantity extraction means to be created, storage means for storing feature quantity vectors of a plurality of candidate voices created in advance, and the feature quantity vectors from the feature quantity extraction means are stored in the storage means Means for recognizing the inputted speech by comparing each of the feature amount vectors of the candidate speech with the candidate speech, thereby achieving the above object.
[0035]
The first detection means includes means for dividing the electrical signal into a plurality of frames each having a predetermined length, and a calculation means for obtaining energy of the electrical signal for each of the plurality of frames. And determining means for determining the voice end point based on the energy dispersion.
[0036]
The determining means determines the voice end point by comparing a predetermined threshold value with the variance of the energy, and the voice end point is a value at which the variance of the energy is larger than the threshold value. It may be the time when the variance matches the threshold when changing from a small value to a small value.
[0037]
The determining means may use a variance for the energy of a predetermined number of frames among the energies of the plurality of frames.
[0038]
The second detection means includes means for smoothing the energy of the electrical signal, first cyclic storage means for sequentially storing the energy of the electrical signal for each frame without smoothing, and the smoothing. Second cyclic storage means for sequentially storing the generated energy for each frame, and the unsmoothed energy stored in the first cyclic storage means when the voice end point is detected, and Threshold calculation means for calculating a utterance interval detection threshold using both of the smoothed energy stored in the second cyclic storage means, and the unsmoothed energy for the utterance interval detection You may have the utterance area determination means which determines the said utterance area by comparing with a threshold value.
[0039]
The threshold calculation means detects the maximum value of the unsmoothed energy stored in the first circulation type storage means and the voice end point when the voice end point is detected. The utterance interval detection threshold value may be calculated using the minimum value of the smoothing energy stored in the second cyclic storage means at the time.
[0040]
The feature amount detection means includes a number of zero crossings for each frame of the electrical signal, a number of zero crossings for each frame of the signal obtained by differentiating the electrical signal, from the portion of the utterance section of the electrical signal, The energy of the electrical signal may be calculated and used as an element of the feature vector.
[0041]
Another speech reaction device of the present invention comprises at least one speech recognition device as described above and at least one control means for controlling an object based on a recognition result of the at least one speech recognition device, This achieves the above objective.
[0042]
The voice reaction device is connected to the at least one voice recognition device, is connected to a transmission means for transmitting the recognition result by the at least one voice recognition device, and to the at least one control device, and Receiving means for receiving the transmitted recognition result and providing it to the at least one control device, wherein the at least one control device and the receiving means are attached to the object, whereby the object May be remotely controlled.
[0043]
The operation will be described below.
In the game device of the present invention, the voice recognition means recognizes the input voice, and the utterance section detection device detects a utterance section that is a section in which the speaker is speaking from the movement of the lip of the speaker (operator). . Based on the voice recognition result and the detection result of the utterance section, the integrated determination unit recognizes a command input by the speaker using voice, and the control unit controls the object according to the command. Thereby, it is possible to operate a game with a human voice, and it is possible to prevent an erroneous operation caused by misrecognizing a voice of a person other than the speaker. Further, in another game device of the present invention, since the command is recognized directly from the movement of the operator's lips, it is possible to operate the game even in a situation where it is difficult to make a voice or sound by human voice. . In addition, this game device can be used by persons with disabilities. In still another game device of the present invention, the integrated determination unit determines a more probable recognition result from both the recognition result by the voice recognition means and the recognition result based on the movement of the lips. For this reason, in addition to the above-described advantages, there is also an advantage that the reliability of the game operation by voice can be further increased.
.
[0044]
Since the input device of the present invention has a column attached to a light headset, and an inexpensive light emitting element (such as an LED) and an inexpensive light receiving element (such as a photodiode) are mounted on a table attached to the column, It is possible to provide an input device that is light and inexpensive. Furthermore, if the headset can be expanded and contracted, the length of the headset can be adjusted for each operator of the input device, and the positional relationship between the light emitting element and the light receiving element and the vicinity of the operator's lips can be adjusted. it can.
[0045]
In the voice selection device of the present invention, when there is one input from the outside, one of words included in the table stored in the second storage means is selected and output as voice. Then, the table stored in the second storage means is changed to a table selected according to this output from the plurality of tables stored in the first storage means. Next, when there is an input from the outside, the above-described operation is repeated. In this way, the voice selection device of the present invention can return words according to the input given one after another as well as one operation of returning one word to one input. Combining this speech selection device with a speech recognition device constitutes a speech reaction device that recognizes words corresponding to the input speech and outputs randomly selected words as speech according to the recognition result. be able to. If at least one voice reaction device is provided in a game device, the voice reaction device can have a dialogue with an operator. If a plurality of voice reaction devices are provided, a game device in which a dialogue between the devices can be formed. it can. Further, by selecting a word to be output with respect to one input using a random number, it is possible not to always output the same word with respect to the same input, but to output with a change.
[0046]
In another game device of the present invention, a direction in which a sound is input is detected using a plurality of sound input units each corresponding to a different direction. Then, the direction of movement of the object or the direction of the object itself is changed in the detected direction. In this way, the object can be moved by voice. In another game device of the present invention, the moving direction or direction of the object is changed while a difference between the direction input by voice and the moving direction or direction of the current object is detected by an azimuth meter.
[0047]
The speech recognition apparatus of the present invention detects a point where speech input is completed from an electrical signal corresponding to the input speech. Subsequently, a section in which the voice is further uttered is extracted from the electrical signal for the section in which the voice thus obtained is input. Since the feature quantity vector that is actually compared with the feature quantity vector of the candidate voice is created from the electrical signal for the section where the voice is uttered, the voice recognition device of the present invention recognizes the voice with a simple configuration and high accuracy. can do. Further, the threshold value used for extracting the section where the voice is uttered is calculated based on the energy of the electric signal and the smoothed energy. Thereby, it is possible to satisfactorily detect a section in which the voice is uttered. Furthermore, in a voice reaction device obtained by combining this voice recognition device with means for controlling the action of an object, the object can be made to perform an action corresponding to the input voice.
[0048]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
A first embodiment of the game apparatus of the present invention will be described below with reference to the drawings. The present embodiment is a game device that operates an airship with a voice command corresponding to the movement of the airship. The voice command includes six commands of “front”, “back”, “right”, “left”, “up”, and “down”.
[0049]
In the present embodiment, a signal representing the movement of the lip of the speaker is input together with the voice signal of the speaker (game device operator), and a process for determining whether or not the speaker is speaking based on these signals. It is carried out. As a result, it is possible to prevent malfunctions caused by ambient noise, particularly voices spoken by others.
[0050]
FIG. 1 simply shows the configuration of the game apparatus of this embodiment. The game apparatus according to the present embodiment includes a voice input unit 1 and a voice recognition unit 2 for processing input voice, an image input unit 3 for inputting movement of the lips and processing a signal indicating the movement of the lips, and An utterance section detection unit 4 is provided. Both the speech recognition unit 2 and the utterance section detection unit 4 are connected to the integrated determination unit 5, and based on both the input voice and the movement of the lips, it is determined what command the speaker has uttered. Is done. The determination result of the integrated determination unit 5 is input to the control unit 6, and the control unit 6 controls the airship 7 based on the determination result.
[0051]
First, a voice including a command uttered by a speaker is input to the voice input unit 1. For example, a normal microphone can be used for inputting voice. The voice input unit 1 converts the inputted voice into an electrical signal, and outputs this as a voice signal 11 to the voice recognition unit 2. The voice recognition unit 2 analyzes the voice signal 11 and outputs the result as a voice recognition result 12. The analysis of the audio signal 11 can be performed by a conventionally known method such as DP matching.
[0052]
In parallel with the processing of the input voice described above, processing of an electric signal representing the movement of the lips is performed. When the speaker utters a command, the movement of the lips at that time is input to the image input unit 3. FIG. 2 shows a configuration example of the image input unit 3. The image input unit 3 of this embodiment irradiates the speaker's lip with light emitted from the LED 21, and detects the light reflected by the lip by the photodiode 22. Thereby, the electric signal 13 according to the movement of the lips is output. When the speaker's lips are moving, the level of the electric signal 13 changes according to the change in the shadow near the speaker's lips. The speaker's lips may be irradiated with light from the LED 21 from the front or from the side.
[0053]
The electrical signal 13 from the image input unit 3 is input to the utterance section detection unit 4. FIG. 3 shows the configuration of the utterance section detection unit 4 of the present embodiment. The utterance section detection unit 4 includes a differentiation circuit 31 and a section detection unit 32. The differentiation circuit 31 outputs a differentiation signal 33 indicating the degree of change of the input electrical signal 13. An example of the waveform of the differential signal 33 is shown in FIG. FIG. 5 shows a differential signal 33 obtained when the speaker utters the commands “front” and “rear” in a state where the light from the LED 21 is irradiated from the side onto the lips of the speaker. As can be seen from FIG. 5, the amplitude of the differential signal 33 increases when the speaker is speaking. Also, since LED light is applied to the speaker's lips from the side, it can be seen that the movement of the lips sharpening is reflected in the waveform when the command “back” is issued. When the light from the LED 21 is applied to the speaker's lips from the front, the light hits only the speaker's face, so that the electrical signal 13 and the differential signal 33 are not affected by noise caused by the movement of the background. There are advantages.
[0054]
The section detector 32 receives the differential signal 33, determines the magnitude of the amplitude of the differential signal 33, and detects the speaker's utterance section. A specific method for detecting a speech section will be described with reference to FIG.
[0055]
When the level of the differential signal 33 exceeds a predetermined amplitude threshold value 51, the section detection unit 32 determines that the differential signal 33 is generated when the speaker utters a command, and the level of the differential signal 33 is A section that exceeds the amplitude threshold value 51 is defined as a speech section. In the example shown in FIG. 6, section 1 and section 2 are utterance sections. Subsequently, the interval between adjacent utterance sections is compared with a predetermined time threshold value 52. The time threshold value 52 is a value used to determine whether or not a plurality of utterance sections correspond to the same utterance, that is, whether or not a plurality of utterance sections are continuous. If the interval of the utterance section is within the time threshold value 52, it is determined that the two utterance sections sandwiching the interval are continuous utterance sections. A signal 14 representing the continuous utterance intervals determined in this way is output from the utterance interval detector 4. Note that both the amplitude threshold value 51 and the time threshold value 52 can be set to appropriate values in advance.
[0056]
As described above, the utterance section detection unit 4 uses the differential signal 33 to detect the intensity and duration of the lip movement, thereby obtaining the section in which the speaker uttered the command.
[0057]
Next, the operation of the integrated determination unit 5 will be described. As illustrated in FIG. 4, the integrated determination unit 5 includes a voice recognition time determination unit 41, an output determination unit 42, and an output gate 43. The voice recognition time determination unit 41 receives the voice recognition result 12 and informs the output determination unit 42 of the time when the recognized voice is input to the voice input unit 1. In addition to the output from the speech recognition time determination unit 41, the output determination unit 42 receives the utterance interval detection signal 14 from the utterance interval detection unit 4. Here, the operation of the output determination unit 42 will be described with reference to FIG.
[0058]
The output determination unit 42 first creates an evaluation utterance section 72 by adding an evaluation time threshold 71 before and after the utterance section based on the received utterance section detection signal 14. Next, it is determined whether or not the time when the speech recognition result 12 is output from the speech recognition unit 2 is within the evaluation utterance section 72. When it is within the range, it is determined that the voice input to the voice input unit 1 and recognized by the voice recognition unit 2 is uttered by the speaker. The result of the determination is output as a signal 15 to the control unit 6.
[0059]
The time threshold 71 for creating the evaluation utterance section 72 is set in consideration of the time required for the recognition processing performed by the speech recognition unit 2. This is because the time when the speech recognition result 12 is output is used as one of the materials for determining whether or not the recognized speech is due to the utterance of the speaker.
[0060]
In this way, when the signal 15 corresponding to the command input by voice is obtained, the control unit 6 controls the airship 7 by outputting a radio control signal corresponding to the input command.
[0061]
As described above, in the first embodiment, the utterance section that the speaker utters is detected from the movement of the lips when the speaker utters the command, and based on this, the recognized speech is detected by the speaker. It is judged whether it is a thing. For this reason, it is possible to prevent misrecognition due to utterances other than the speaker and the malfunction of the target object resulting therefrom.
[0062]
Therefore, it is possible to realize a game apparatus that is operated by a natural operation for human beings such as an operation by voice. In this embodiment, the movement of the speaker's lips is detected by a simple configuration / method such as a combination of an LED and a photodiode. For this reason, compared with the conventional apparatus which took in the image of a speaker's lip using a video camera etc., it can implement | achieve very cheaply. Of course, a phototransistor may be used instead of the photodiode.
[0063]
Note that the circuit configurations of FIGS. 2 and 3 are merely examples, and are not limited to this configuration. It can also be realized using computer software.
[0064]
(Second embodiment)
In the game apparatus according to the second embodiment of the present invention, the command is not input by voice, but is input only by the movement of the lips, and the airship is controlled according to the input command. This makes it possible to use it under noisy conditions, use it in situations where it is not possible to utter speech such as midnight, or use a person who has a disability in utterance.
[0065]
FIG. 8 is a diagram simply illustrating the configuration of the game apparatus according to the present embodiment. Similar to the first embodiment, the game apparatus according to the present embodiment includes an image input unit 3, a control unit 6, and an airship 7, and the lip recognizes a speaker (operator) word from the movement of the lips. A recognition unit 81 is provided.
[0066]
A configuration example of the lip recognition unit 81 is shown in FIG. In this embodiment, the lip recognition unit 81 includes a differentiation circuit 31, a difference calculation unit 91, a database 92, and a pattern matching unit 93. The differentiation circuit 31 is the same as that used in the utterance section detection unit 4 of the game apparatus of the first embodiment. The difference calculation unit 91 samples the differential signal 33 from the differentiating circuit 31 with a predetermined time width, and calculates the difference between the sampling data. The difference calculation result is sent from the difference calculation unit 91 to both the database 92 and the pattern matching unit 93. The database 92 holds a difference calculation result of a standard pattern used for recognition. The pattern matching unit 93 obtains a difference in distance between the difference result of the held standard pattern and the difference calculation result of the input pattern to be recognized, and the words input as the lip movement based on this difference Recognize Of course, the smaller the difference, the higher the reliability of the recognition result.
[0067]
Hereinafter, the operation of the game apparatus according to the present embodiment will be described in detail. In this embodiment, the lip recognition unit 81 recognizes a standard pattern before performing a recognition operation in order to recognize a word input by comparing the standard pattern and the input pattern as described above. It is necessary to register in 81.
[0068]
(Registration operation)
First, the image input unit 3 receives the LED reflected light reflected by the lip portion of the speaker, and outputs an electric signal 13 corresponding to the movement of the lip to the lip recognition unit 81. The electrical signal 13 is input to the differentiation circuit 31 of the lip recognition unit 81. The differentiation circuit 31 transmits a differential signal 33 indicating the degree of change of the electrical signal 13 to the difference calculation unit 91. Up to this point, the process is the same as in the first embodiment.
[0069]
The operation of the difference calculation unit 91 will be described with reference to FIG. First, the differential signal 33 is sampled with a time width (Δt), and the difference between adjacent sampling data in the obtained sampling data is calculated. The difference between the calculated sampling data, that is, a series of difference data is output to the database 92. The database 92 holds this difference data string. The above operation is repeated for the number of words (categories) to be recognized, and difference data strings are stored for all categories. The stored difference data string is held as a standard pattern used for recognition. In this embodiment, there are six commands used for controlling the object: “front”, “rear”, “right”, “left”, “upper”, and “lower”. Accordingly, the above-described storage of the difference data string is repeated six times, and finally six standard patterns are held in the database 92.
[0070]
When all the standard patterns have been registered in the database 92 in this way, the database 92 examines each difference data string, and sets the length of the section in which the data corresponding to the portion where the lip is moving continues to each difference data string. Extract against. Specifically, for example, if a value close to zero in the difference data string continues longer than a predetermined time, it is determined that the section corresponds to when the lips are not moving. When the length of the section corresponding to the portion where the lips are moving is extracted for all the standard patterns, the standard pattern having the longest length is selected, and the length is selected as the difference data string length (N ). The registration operation is thus completed, and the standard pattern difference data string is held in the database 92.
[0071]
(During recognition operation)
The operation from the input of the movement of the lip portion until the differential signal 33 is obtained is exactly the same as in the registration operation. Here, the operation after the differential signal 33 is input to the difference calculation unit 91 will be described with reference to FIG.
[0072]
The differential signal 33 input to the difference calculation unit 91 is sampled with a time width (Δt) as in the registration operation. Subsequently, for the sampling data in the interval corresponding to the length of the difference data string length (N) of the standard pattern, the difference between adjacent sampling data is calculated, and the obtained series of difference data is the difference data string in that interval. And The interval in which the difference is calculated is sequentially shifted backward by Δt. In FIG. 11, the first sampling data is the beginning of the section, the difference data string for the section 111 whose section length is N, and the section 112 shifted backward from the section 111 by N / 2 in time. Only the difference data string is shown.
[0073]
When difference data strings of a plurality of sections whose section length is N (hereinafter referred to as recognition difference data strings) are obtained, these recognition difference data strings are sent to the pattern matching unit 93. The pattern matching unit 93 reads a standard pattern from the database 92 and obtains a distance from each of the standard patterns for each of the plurality of recognition difference data strings. In this embodiment, since six standard patterns are registered in the database 92 as described above, the pattern matching unit 93 calculates the distance from each standard pattern one by one for each recognition difference data string. become.
[0074]
The distance between the recognition difference data string and the standard pattern is calculated using the following equation.
N
d^j= Σ (rⁱ-P^ij)²
i = 1
Where rⁱIs the i-th recognition difference data string, p^ijIs the jth standard pattern (corresponding to the jth category), d^jIs the distance between the recognition difference data string and the j-th standard pattern. The pattern matching unit 93 calculates the distance d^jWhen the value falls below a certain value, it is determined that the recognition difference data string matches the jth standard pattern, and a signal 82 corresponding to the jth category (word) is output as a determination result.
[0075]
The determination result is input to the control unit 6, and the control unit 6 outputs a radio control signal corresponding to the jth category to control the airship 7.
[0076]
As described above, in the present embodiment, an input word (command) is recognized based only on the movement of the lips, and the airship is controlled according to the recognized word. For this reason, it becomes possible to use it under noisy conditions, use it in a situation where it is difficult to make a voice, or use a person who has a disability in speaking.
[0077]
Similarly to the first embodiment, the image input unit 3 for inputting the movement of the lips can be realized by a combination of the LED 21 and the photodiode 22. Therefore, the conventional method for capturing the lip image itself using a video camera or the like. Compared to the above, a game device that is very inexpensive can be provided.
[0078]
In this embodiment, a game user registers a standard pattern used for command recognition prior to inputting a command. However, for example, a standard pattern that can correspond to the movement of the lip of an unspecified user is registered in the database 92 in advance when the game device is manufactured or shipped, and registration by the user may be omitted. .
[0079]
(Third embodiment)
Subsequently, a game apparatus according to a third embodiment of the present invention will be described. In this embodiment, the airship is operated by inputting a command by both voice and the movement of the speaker's (operator's) lip and integrating and determining both recognition results. For this reason, it is possible to reliably recognize a command uttered by a speaker even under noise.
[0080]
FIG. 12 simply shows the configuration of the game apparatus of this embodiment. The game apparatus according to the present embodiment includes a voice input unit 1, an image input unit 3, a control unit 6, and an airship 7 having the same configuration as that of the game apparatus according to the first embodiment. Furthermore, a voice processing unit 121 and a lip processing unit 122 are provided. The voice processing unit 121 recognizes the input voice in the same manner as the voice recognition unit 2 of the first embodiment, and subsequently calculates the reliability of the recognition result. The lip processing unit 122 recognizes words (commands) input as lip movements in the same manner as the lip recognition unit 81 of the second embodiment, and calculates the reliability of the recognition result together with the words (commands). Outputs from the voice processing unit 121 and the lip processing unit 122 are both input to the integration determination unit 123. The integrated determination unit 123 determines the command input by the speaker in an integrated manner from the recognition results from the

processing units

121 and 122 and the reliability, and outputs the determination result.
[0081]
Hereinafter, the operation of the game apparatus according to the present embodiment will be described in detail.
[0082]
As in the first embodiment, the voice input unit 1 inputs the voice uttered by the speaker (game device operator) and transmits the electrical signal 11 corresponding to the input voice to the voice processing unit 121. The audio processing unit 121 receives the electric signal 11 and recognizes the input audio based on the electric signal 11. As a speech recognition method, any conventionally known method may be used. Here, for example, similarly to the method described in the description of the lip recognition unit in the above embodiment, data obtained by processing the electric signal 11 obtained when the command is spoken for all the commands that may be input. A sequence is registered in advance as a standard pattern, and a recognition target data sequence obtained by processing the electrical signal 11 obtained when the operator of the game apparatus actually utters a command, and all the previously registered sequences By calculating the distance from the standard pattern, the command (speech) input from the voice input unit is recognized. When the speech is recognized in this way, the speech processing unit 121 subsequently obtains a reliability indicating how reliable the recognition result is, and integrates both the speech recognition result and the reliability as an output 124. This is given to the determination unit 123. The method for obtaining the reliability will be described later.
[0083]
In parallel with the processing of the input voice, processing of a signal representing the movement of the lips is performed. First, the image input unit 3 inputs the movement of the speaker's lips in the same manner as in the first embodiment, and outputs an electric signal 13 whose level changes according to the movement of the lips. The lip processing unit 122 receives the electrical signal 13 and performs the same processing as in the second embodiment. However, if the lip processing unit 122 of this embodiment determines that the recognition difference data string matches the jth standard pattern as a result of pattern matching between the recognition difference data string and the standard pattern, the recognition difference data Distance d between column and jth standard pattern^jBased on the above, the reliability of the recognition result is calculated. Both the recognition result and the reliability obtained in this way are output to the integrated determination unit 123.
[0084]
Next, a method for calculating the reliability will be briefly described. In the present embodiment, the reliability of the speech recognition result and the reliability of the recognition result based on the lip movement are obtained by the same process. Hereinafter, calculation of the reliability of the speech recognition result will be described. Consider a case where the reliability of a speech recognition result is evaluated in three stages of “large”, “medium”, and “small”. The reliability of the recognition result is the highest when the reliability is “low”, and the reliability of the recognition result is the lowest when the reliability is “high”. In this case, the threshold α for separating the reliability “low” and “medium”^L, And a threshold value α for separating the reliability from “medium” and “large”^H(However, α^L<Α^H), The distance d between the standard pattern determined to match the recognition target and the recognition target is compared with the threshold value. Comparison result d <α^LIf so, the reliability is determined to be “small”. Similarly, α^L≦ d <α^H, D ≧ α^HIn this case, the reliability is determined as “medium” and “large”, respectively. Similarly, regarding the recognition result based on the movement of the lips, it is determined which level the reliability is by comparing with the threshold value. The threshold value used here can be set to an appropriate value. Further, the calculation method of the reliability is not limited to the method described here, and any known method may be used.
[0085]
Next, the operation of the integration determining unit 123 will be described with reference to FIG.
[0086]
FIG. 13 is a diagram illustrating a concept of a method for performing integration determination. First, the integration determining unit 123 sets the time when the speech recognition result is output from the speech processing unit 121 (that is, the time when the output 124 is generated) and the time when the recognition result based on the movement of the lips is output from the lip processing unit 122 ( In other words, the evaluation intervals 132a and 132b are created by detecting the time when the output 125 is generated) and adding the interval corresponding to the predetermined threshold 131 before and after each detected output time. Subsequently, it is determined whether or not the evaluation section 132a for the lip recognition result overlaps with the evaluation recognition section 132b created for the voice recognition result. If they overlap, the integrated judgment unit 123 judges that the voice uttered by the operator who inputted the movement of the lips has been inputted and recognized. If they do not overlap, it is determined that the recognized speech is due to ambient noise or the utterances of something other than the operator. Thereby, misrecognition of voices other than the operator can be prevented.
[0087]
Next, the integrated determination unit 123 determines whether the recognition result based on the movement of the lips matches the recognition result based on the voice, and if they match, the recognition result is used as the integrated determination result (see FIG. 13 integrated judgment result “before”). If they do not match, the integrated judgment result is determined according to the reliability obtained for each recognition result. FIG. 14 shows an example of the correspondence relationship between the combination of reliability with respect to the recognition result and the integrated determination result determined according to the combination. In this example, as described above, the reliability of each recognition result is evaluated in three stages: “large” with the lowest reliability, “small” with the highest reliability, and “medium” between them. ing. FIG. 14A shows the correspondence when priority is given to the speech recognition result when the reliability is equal, and FIG. 14B shows the correspondence when priority is given to the lip recognition result. Which recognition result is adopted is determined according to factors such as the surrounding environment in which the game apparatus is operated, and can be registered in advance in the game apparatus. Or you may comprise a game device so that an operator may input by himself. For example, as shown in (a), the speech recognition result is given priority when a normal person who does not disturb the utterance and the surrounding noise is relatively low. If the noise is very high, (b) is adopted.
[0088]
The integration determination unit 123 outputs the integration determination result determined as described above as a signal 15. Finally, the control unit 6 outputs a radio control signal corresponding to the determination result to control the airship 7.
[0089]
As described above, according to the present embodiment, the movement of the lips is recognized together with the voice signal, and the results of both are recognized and used in an integrated manner. Can be recognized. At the same time, there is an effect that enables a person with a disability to use a game by voice operation. Further, as in the first and second embodiments described above, since the movement of the lips is detected by the combination of the LED 21 and the photodiode 22, it is very inexpensive as compared with a method for capturing a lip image using a video camera or the like. There is also an effect that can be realized.
[0090]
Although a detailed description is omitted, in this embodiment, as in the second embodiment, the user of the game registers a standard pattern for lip recognition. A standard pattern may be prepared and registration by the user may be omitted.
[0091]
In the first to third embodiments, a game apparatus that controls the airship 7 using a radio control signal is described as an example. However, the game apparatus to which the present invention can be applied is not limited to this. For example, if a configuration as described in any of the above embodiments is provided for the number of operators, a game device that allows a plurality of operators to play simultaneously can be realized.
[0092]
The input device of the present invention will be described below. FIG. 15 is a diagram simply showing the configuration of the input device of the present invention. The input device of the present invention includes a headset 154, a support column 155 attached thereto, and a table 153 provided with a photodiode 151 and an LED 152. The table 153 is joined to the column 155 at a predetermined angle. (See FIG. 15A). By adjusting the angle between the base 153 and the support column 155, the direction in which the light emitted from the LED 152 is applied to the lip portion of the operator can be changed. This input device is a device that inputs the movement of the lip by irradiating the lip portion of the operator with the light emitted from the LED 152 and detecting the reflected light with the photodiode 151. Such an input device can be used, for example, as an image input unit in the first to third embodiments. If a microphone 156 is added to the base 153 (see FIG. 15B), this input device can be used as a voice input device.
[0093]
An input device not provided with a microphone as shown in FIG. 15A can be used as the image input unit of the second embodiment. Further, as shown in FIG. 15B, the input device having a microphone can be used as a device that serves as both the audio input unit and the image input unit of the first and third embodiments.
[0094]
Thus, since the input device of the present invention uses the photodiode 151, the LED 152, and the microphone 156 that are very small in size and can be mounted very lightly, the size and weight of the input device as a whole are very large. Small. Further, since all the components used are inexpensive, they can be realized at low cost. Furthermore, since the input device of the present invention is fixed to the operator's head by the headset 154, the positions of the lips, the photodiode 151, and the LED 152 can be made substantially constant. For this reason, the movement of the lips can be input stably. In addition, the input device of the present invention inputs the movement of the lips with light, converts it into an electrical signal and outputs it, so that the conventional input device, for example, a device for inputting an image instead of the movement of the lips, or an ultrasonic wave It is possible to make the configuration simpler than an input device that must be a large and complicated configuration such as a device to be used.
[0095]
Here, only one photodiode and one LED are mounted, but a plurality of them can be mounted. For example, if two sets of LEDs and photodiodes are prepared and each set is arranged in a cross shape, there is an effect that the direction of movement on the surface can be detected.
[0096]
As described above, according to the present invention, it is possible to obtain a game apparatus that can be operated by a voice that is natural for humans and that does not require operation learning. Further, since the movement of the lips is used instead of recognizing words (commands) input only from voice, stable operation is possible even under noisy conditions. Furthermore, since the movement of the lips is captured by a combination of an LED and a photodiode (phototransistor), it can be realized at a lower cost than when a video camera or an ultrasonic wave is used.
[0097]
Further, as described in the first embodiment, since the speaker's utterance section is detected from the movement of the lips and is used as a judgment material for the speech recognition result, erroneous recognition due to utterances other than the speaker is prevented. be able to. As described in the second and third embodiments, if the words (commands) input from the movement of the lips are recognized and the airship is controlled, the voice can be heard even under noisy conditions. It is also possible to use it in situations where it is difficult to produce or for people with disabilities in speaking.
[0098]
In the input device of the present invention, an inexpensive light-emitting element (LED or the like) and an inexpensive light-receiving element (photodiode or the like) are attached to a light headset, column, and stand. For this reason, a very light and inexpensive input device can be realized.
[0099]
In the first to third embodiments, the example in which the movement of the object is controlled according to the recognized voice or the movement of the lips has been described. However, the movement of the object controlled based on the voice or the movement of the lips is not limited to movement, and may be movement such as rephrasing some words. Described below are various devices for causing an object to perform some operation (including movement) in accordance with the recognized voice.
[0100]
Hereinafter, an apparatus for causing an object to perform some operation according to recognized speech will be described in each embodiment.
[0101]
(Fourth embodiment)
In the present embodiment, a description will be given of an apparatus that selects one output sound from a set of output sounds prepared for the recognized sound and outputs the selected sound.
[0102]
FIG. 16 simply shows the configuration of the voice selection device 100 of this embodiment. The voice selection device 100 includes a random number generation unit 101, a selection unit 102, an input / output state memory 103, a state transition unit 104, and an input / output state database 105. The input / output status database 105 stores a plurality of input / output status tables in advance. Each input / output state table includes an input x (x is a non-negative integer) in the state s and a set of n (s) output speeches for the input x sp (x, i) (0 ≦ i <n (s)). Including. An example of the input / output state table is shown in FIG. Initially, the input / output state memory 103 stores an initial state table 201 shown in FIG. The random number generation unit 101 determines i used to select one voice to be output from the set of output voices.
[0103]
Hereinafter, the operation of the voice selection device 100 will be described. When the selection unit 102 receives an input x from the outside, the selection unit 102 refers to an input / output state table stored in the input / output state memory 103 and selects an output speech set sp (x, i) corresponding to the input x. select. Subsequently, the selection unit 102 causes the random number generation unit 101 to determine by a random number r (n (s)) (where 0 ≦ r (n (s)) <n (s)), and i = r (n (s )), One voice is selected from the output voice set sp (x, i). And this is output outside.
[0104]
The output from the selection unit 102 is given not only to the outside but also to the state transition unit 104. When receiving the output from the selection unit 102, the state transition unit 104 rewrites the contents of the input / output state memory 103 into the input / output state table for the output from the selection unit 102 while referring to the input / output state database 105. For example, when “good?” Is output in the initial state 201, the state transition unit 104 refers to the input / output state database 105 and extracts the table of the input / output state 202 for the output “good?”. Then, the extracted state 202 table is stored in the input / output state memory 103.
[0105]
Thus, the voice selection device 100 according to the present embodiment outputs a voice selected using a random number with respect to the input voice. Therefore, if this voice selection device 100 is used, a simple dialogue system can be constructed. Further, as shown in FIG. 18, if a voice selection device 100a having a simple configuration in which the state transition unit 104 and the input / output state database 105 are omitted is used, a response is made only once for the input voice. It can also be.
[0106]
The voice selection devices 100 and 100a can be used in combination with the voice recognition device 1201 as the voice selection device 1202 of the voice reaction device 1203 as shown in FIG. More specifically, first, when a voice is recognized by the voice recognition device 1201, the recognition result is input to the voice selection device 1202 by an identification number attached to the voice, for example. The voice selection device 1202 selects one voice at random from the output voice set using the input identification number as the input x, and outputs it. Thereby, when a certain voice is input, a voice corresponding to the voice is output, and the voice reaction device 1203 capable of various responses to the same input voice can be realized. For example, when the voice recognition device 1201 outputs the voice “Good morning” as a recognition result when the voice selection device 1202 is in the initial state, the voice selection device 1202 has the identification number 1 given to the voice “Good morning”. Input as input x (see FIG. 2A). In response to this, the voice selection device 1202 randomly selects one from the set sp (1, i) including two output voices of “Good morning” and “Genki?” And outputs it.
[0107]
In the voice reaction device 1203, it is necessary to register a voice that can be accepted as an input in the voice selection device 1202 prior to the actual operation. When a voice that is not included in the registered voice set is input to the voice selection device 1202, for example, a voice “What?” May be output from the voice selection device 1202. Further, when the apparatus of the third embodiment is used as the speech recognition device 1201, when the recognized speech is not reliable, the speech recognition selection device 1202 can output a speech for receiving another speech input. it can.
[0108]
As described above, in the voice selection device according to the present invention, a plurality of tables representing the input / output states are prepared, and the input / output states are changed according to the past input / output history. Therefore, by using the voice selection device of the present invention, it is possible to realize a device for performing a simple dialogue. In addition, this voice selection device has a plurality of output voice candidates for one input, and randomly selects one of these output voice candidates and outputs it.
Therefore, it is possible to obtain a voice reaction device capable of responding with a change instead of always giving the same response to one input.
[0109]
(Fifth embodiment)
Next, the direction detection device and the direction selection device of the present invention will be described.
[0110]
First, the direction detection device 400 will be described with reference to FIG. The direction detection device 400 includes a direction detection unit 401 and a plurality of microphones 402 connected thereto, and the microphones 402 are attached to an object to be controlled. Here, the operation of the direction detection device 400 will be described by taking as an example the case where the number of microphones is four. When voices are input from the four microphones m (i) (i = 0, 1, 2, 3), the direction detection unit 401 receives the input voice sp (m (i) as shown in FIG. , T) is divided into frames f (m (i), j) 501 (0 ≦ j). For example, the length of one frame is 16 ms. Next, the direction detection unit 401 obtains the sound energy e (m (n), j) in the frame for each frame, and the obtained energy e (m (n), j) has a length l (for example, length). 100) are sequentially stored in a circular memory (not shown). At this time, the direction detection unit 401 obtains the sum of the energy for the past l frames for each microphone each time the energy for each frame is stored, and determines the microphone with the maximum sum of energy. Subsequently, the direction detection unit 401 compares the maximum value of the sum of energy with a threshold value The that is experimentally determined in advance, and if the maximum value of the sum of energy is larger than the threshold value The, the direction detection unit 401 sends the value to the microphone. It is determined that the direction to reach is the direction in which the voice can be heard. The microphone number i thus determined is output from the direction detection unit 401 as the direction in which the voice is input.
[0111]
For example, when the direction detecting device 400 operating in this way is used in combination with the operating device 1302 as shown in FIG. 28, a voice reaction device 1303 that performs a predetermined operation according to the direction in which the voice is heard can be configured. it can. Specifically, for example, if an operation device 1302 and a direction detection device 1301 (400 in FIG. 19) for moving the object are attached to an object (for example, a balloon or a stuffed animal), the object moves toward a human voice. As described above, it is possible to make a device that performs a predetermined operation in a direction in which sound is heard according to sound.
[0112]
In addition, as an example of the operation device 1302 described above, it has three motors with propellers attached to an object and a driving device for these motors, and when a direction to move next is input, There is a device that controls three motors so that the object moves in that direction.
[0113]
Next, the direction selection device will be described with reference to FIG. The direction selection device 600 includes an offset calculation unit 601, an azimuth meter 602, and a target direction memory 603, and can be used as a device for controlling the direction in which the object moves or the direction of the object. The offset calculation unit 601 is a table stored in advance in the offset calculation unit 601 when an input x (x is a non-negative integer) indicating the direction in which the object is to be moved next or the direction in which the object is to be directed is input. Based on the above, an offset corresponding to the input x is output. The output offset is added to the actual direction of the target object measured by the azimuth meter 602 and sent to the target direction memory 603. The target direction memory 603 stores the actual direction from the azimuth meter 602 plus an offset as the direction in which the object should move next or the direction in which the object should next face.
[0114]
As described above, the direction selection device of FIG. 21 is used to change the direction of the object based on the direction in which the object is currently moving or the direction in which the object is facing in accordance with the input x.
[0115]
If the direction selection device 700 of FIG. 22 is used instead of the direction selection device 600 of FIG. 21, the direction of the object is not changed to a relative direction based on the current direction, but an absolute direction. Can be changed. In the direction selection device 700 in FIG. 22, when the direction calculation unit 701 receives an input x (x is a non-negative integer) indicating an absolute direction (for example, north) from the outside, the direction calculation unit 701 outputs a value corresponding to the input x. To do. The output value is stored in the target direction memory 603 as a target direction as it is. Similar to the offset calculation unit 601 described above, this direction calculation unit 701 can also be realized by holding absolute direction values for the input x as a table. After storing the target direction in the memory 603 in this way, the direction selection device 700 sequentially measures the current direction while the object is moving or changing direction with the azimuth meter 602, The difference between the measured direction and the direction stored in the target direction memory 603 is output. If feedback control is performed on the object based on this output, the object can be moved in the intended absolute direction or the direction of the object can be changed.
[0116]
If the direction selection device as described above is combined with a voice recognition device and an operation device, as shown in FIG. 29, if the direction or movement direction of the object is input by voice, the direction or movement direction of the object is accordingly changed. It is possible to realize a voice reaction device 1402 in which changes. In the voice reaction device 1402, the recognition result of the voice recognition device 1201 is input to the direction selection device 1401, and the output of the direction selection device 1401 is input to the operation device 1302. This makes it possible to control the operation of the object while comparing the current direction or moving direction of the object with the target direction.
[0117]
For example, let us consider a case where the object is currently facing the direction of 0 degrees when north is 0 degrees and the eastbound direction is a positive direction. At this time, it is assumed that the above-described direction selection device 600 (see FIG. 21) is used as the direction selection device 1401. When the voice indicating the target direction is recognized by the voice recognition device 1201 as the word “right”, +90 degrees is associated with the word “right” in the offset calculation unit 601 of the direction selection device 600. If the table is stored, the direction selection device 600 sends an output to the operation device 1302 to change the direction or moving direction of the object 90 degrees eastward from the current direction. At this time, the direction selection device 600 always compares the current direction with the target direction while the direction of the object or the movement direction is changing. The motion device 1302 is controlled by the output of the direction selection device 600 so that the direction of the object or the moving direction changes to the target direction.
Alternatively, when the direction selection device 7001 shown in FIG. 22 is used as the direction selection device 1401, “north” or “southwest” is used as a word indicating the target direction instead of “right” or “left”. Thus, a word indicating an absolute direction is input. At this time, the direction selection device 700 stores the absolute direction in the target direction memory as 0 degrees if the input word is “north” and −135 degrees if it is “southwest”. The operation is performed. Here, the target direction is -180 degrees to +180 degrees.
[0118]
Further, the direction detection device and the direction selection device of the present embodiment may be combined with an operation device. In this case, as shown in FIG. 30, the detection result of the direction detection device 1301 is input to the direction selection device 1401, and the output of the direction selection device 1401 is input to the operation device 1302. Thus, the voice reaction device 1501 changes the direction of the object or the moving direction to a direction in which the sound can be heard while comparing the current direction of the object or the moving direction with the target direction. Can be realized.
[0119]
(Sixth embodiment)
In this embodiment, an apparatus related to speech recognition will be described. As shown in FIG. 26, this apparatus includes a voice end point detection device 1101, a voice detection device 1102, a feature amount extraction device 1103, a distance calculation device 1104, and a dictionary 11105.
[0120]
First, a voice end point detection device 1101 that receives a signal corresponding to an input voice and detects a voice end point based on the signal will be described. In this specification, the “speech end point” means the time when the voice input is finished.
[0121]
The voice end point detection device 1101 of this embodiment is connected to a voice input device such as a microphone. When the voice s (t) is input from the voice input device, the voice end point detection device 1101 converts the input voice s (t) into the frame f (i) (i is a non-negative integer) as shown in FIG. And energy e (i) in each frame is obtained. In FIG. 23, the voice s (t) is represented by a curve 801, and the energy e (i) is represented by a curve 802. Subsequently, the voice end point detection device 1101 obtains the variance of energy from the frame to a predetermined number of frames every time one frame of voice is input, and compares it with a threshold value Thv determined experimentally in advance. . As a result of the comparison, if the energy dispersion intersects the threshold Thv from the larger one to the smaller one, the point of intersection is determined as the voice end point.
[0122]
Here, a method for obtaining the variance from the energy of each frame for a certain period will be described. First, in a method using a circular memory, energy obtained for each frame is sequentially stored in a circular memory 803 having a length l. Each time the energy of one frame is obtained, the energy of the frame that is traced back for a certain period of time is referred to from the circulating memory 803 to obtain the variance.
[0123]
There is also a method for obtaining energy dispersion without using a circular memory. In this method, the voice end point detection device 1101 holds an average m (i−1) and variance v (i−1) for a predetermined number of past frames, and energy e (i) for a new frame. ) Is obtained, the weighted sum of the newly obtained energy e (i) and the average m (i-1) of the past energy is defined as the average m (i) of the new energy, and the past variance v Let the weighted sum of (i-1) and | e (i) -m (i) | be the new variance v (i). In this way, pseudo energy dispersion can be obtained. Here, the attenuation constant α is used for weighting, and a new average and variance are obtained using the following equations. As α, 1.02 is used.
[0124]
[Expression 1]

[0125]
This eliminates the need for a circular memory, saves memory, saves the trouble of obtaining the total energy within a certain period each time new energy is obtained, and shortens the processing time.
[0126]
Next, the voice detection device 1102 that extracts the section where the voice is actually pronounced will be described. In order to extract this section, a circulation memory 902 for storing smoothing energy is prepared separately from the circulation memory 803 for storing energy, and as shown in FIG. Each time it is obtained, energy 802 is stored in the memory 803 and smoothing energy 901 is stored in the memory 902. When the voice end point 903 is obtained as described above, the history of energy and smoothing energy remains in these

cyclic memories

803 and 902, and the length l of these cyclic memories is set to a sufficient length ( For example, if the length is equivalent to 2 seconds), energy for one word can be left. Therefore, the voice detection device 1102 extracts a section where the voice is pronounced using the energy and smoothing energy stored in these memories.
[0127]
The section is extracted by the following procedure. First, the threshold value Th is determined as described later. This threshold value Th and the energy stored in the circulating memory 803 are compared in order from the past, and the point where the energy exceeds the threshold value for the first time is set as the starting point of the section where the sound is produced. Conversely, the point at which energy first intersects the threshold when going back to the past from the voice end point is set as the end point of the section where the voice is pronounced. In this way, the section where the voice is pronounced is extracted.
[0128]
Here, how to determine the threshold Th will be described. First, the maximum value max1001 of the energy in the memory 803 at the time when the voice end point is detected and the minimum value min1002 of the smoothing energy without the memory 902 are obtained. Using these values, the threshold value Th is calculated from the following equation.
[0129]
[Expression 2]

[0130]
However, a value of about 0.07 was adopted as β.
[0131]
Here, as a method of smoothing energy, a method of taking a median value within a certain window is used. However, the smoothing method is not limited to this, and for example, an average value may be taken. Note that the maximum energy value instead of the maximum smoothing energy value is used when obtaining the threshold value Th when the maximum length of the smoothing energy is used to obtain the threshold value Th. This is because the maximum value fluctuates greatly and the threshold value Th also fluctuates accordingly, and as a result, good voice detection cannot be performed. In addition, since the minimum value of the smoothing energy is used for the calculation of the threshold value Th, it is possible to prevent detection of noise that is not speech.
[0132]
As described above, the voice detection device 1102 performs the extraction of the section where the voice is sounded, that is, the detection of the portion corresponding to the voice in the input signal.
[0133]
Next, a feature quantity for recognition is extracted from the detected voice by the feature quantity extraction device 1103. Like the energy, the feature amount is obtained for each frame and is stored in the circulation memory. Here, the feature amount is a feature amount vector including three elements of the difference between frames of the logarithm of the zero crossing number of the original signal, the zero crossing number of the differential signal of the original signal, and the energy of the original signal.
[0134]
The speech feature vector obtained through the speech end point detection device 1101, the speech detection device 1102, and the feature extraction device 1103 is input to the distance calculation device 1104. The distance calculation device 1104 collates each of a plurality of speech feature amount vectors registered in the dictionary 1105 in advance with the input feature amount vector, and outputs the one having the best score as a recognition result. As a matching method, a Euclidean distance between vectors may be simply taken, or a DP matching method may be used.
[0135]
As described above, the apparatus of this embodiment performs voice recognition. This voice recognition apparatus can also be used in combination with the voice selection apparatus 1202 described in the fourth embodiment as shown in FIG. 27, or the direction selection apparatus 1401 described in the fifth embodiment as shown in FIG. It can also be combined with the device 1302. In addition, a voice reaction device 1601 that moves the entire device in a target direction using the result of the voice recognition device 1201 as an input of the motion device 1302 can be configured in combination with the motion device 1302.
[0136]
Further, among the speech reaction devices described in the fourth to sixth embodiments, including the speech recognition device 1201, a signal transmission device 1701 is added to the speech recognition device side, and comes after the speech recognition device in each configuration. If the signal reception device 1702 is added to the voice selection device 1202, the direction selection device 1401, and the operation device 1302, it is possible to remotely operate the object using only the voice recognition device as a remote controller at hand. Here, infrared rays or radio can be used for signal transmission / reception.
[0137]
Further, by attaching the above-described voice reaction device to the balloon, it is possible to interact with the balloon and control the balloon, and it is possible to make a toy that makes use of the warmth unique to the balloon.
[0138]
Also, as shown in FIG. 33, two things are prepared by attaching two voice reaction devices 1203 each having the voice recognition device and the voice selection device described above to a balloon 1801, and a person does not talk to the voice reaction device. If the two voice reaction devices are configured to interact with each other, it is possible to make a toy that interacts with each other. Furthermore, it is also possible to prepare a plurality of balloons 1801 with voice reaction devices and have them interact. At this time, if each balloon with a voice reaction device has a reject function in the voice recognition process, it will be possible to react only to a specific word, so that only one balloon reacts to a certain utterance. It can also be configured. For example, each balloon 1801 can be given a name and can be reacted only when the name is called. Here, the reject method is to calculate the distance from the internal dictionary when performing speech recognition, but there is a method in which a threshold value is experimentally determined and a value exceeding the threshold value is rejected. In addition, by incorporating a clock into the voice reaction device and starting a dialogue from the voice reaction device side by randomly selecting and outputting one voice from the registered output voice set after a predetermined time has elapsed. It is also possible to construct a toy that can be used.
[0139]
The object is not limited to a balloon, and may be a stuffed animal, a doll, a photograph, or a picture. Moreover, the moving image in a display may be sufficient. Moreover, you may use antigravity apparatuses other than a balloon (for example, the thing which floats by a propeller like a helicopter, and the thing which floats by magnetic force like a linear motor car) as a target object.
[0140]
【The invention's effect】
As described above, according to the present invention, it is possible to obtain a game apparatus that can be operated by a voice that is natural for humans and that does not require operation learning. Further, since the movement of the lips is used instead of recognizing words (commands) input only from voice, stable operation is possible even under noisy conditions. Furthermore, since the movement of the lips is captured by a combination of an LED and a photodiode (phototransistor), it can be realized at a lower cost than when a video camera or an ultrasonic wave is used.
[0141]
Furthermore, in the speech recognition apparatus according to the present invention, since the speaker's utterance section is detected from the movement of the lips and used as a judgment material for the speech recognition result, erroneous recognition due to utterances other than the speaker can be prevented. Further, in another speech recognition apparatus of the present invention, since the words (commands) input from the movement of the lips are recognized and the airship is controlled, it is difficult to produce a voice even under noisy conditions. The use of persons with disabilities is also possible.
[0142]
In the input device of the present invention, an inexpensive light-emitting element (LED or the like) and an inexpensive light-receiving element (photodiode or the like) are attached to a light headset, column, and stand. For this reason, a very light and inexpensive input device can be realized.
[0143]
As described above, the voice selection device of the present invention prepares a plurality of input / output states, and changes the input / output states based on past input / output histories. For this reason, it is possible to provide a device for performing a simple dialogue by using this voice selection device. In addition, the voice selection device of the present invention provides a plurality of outputs for one input, and outputs one randomly selected from these, so the response is not always the same for one input, but changes. You can make a response.
[0144]
In addition, the direction detection device of the present invention inputs sound through a plurality of microphones and detects the microphone with the maximum energy. Thereby, the direction in which the voice is uttered can be detected. Furthermore, if the direction selection device of the present invention is used, the current position is detected by an azimuth meter, the object is accurately moved in the input direction, or the direction of the object is changed to the input direction. can do.
[0145]
Also, the speech recognition apparatus of the present invention first obtains a rough speech end point by the speech end point detection device, and then automatically obtains the threshold value by the speech detection device. Here, since the threshold is determined from the maximum value of the energy of the input speech and the minimum value of the smoothed energy, good speech segment extraction should be performed regardless of the length of the speech segment. Can do. When the speech detection device detects speech using a threshold value, a feature amount is obtained from the speech, and speech recognition is performed based on the feature amount.
[0146]
Various voice reaction devices can be obtained by appropriately combining the above-described devices. For example, when a voice recognition device and a voice selection device are combined, a voice reaction device that responds when a person speaks with a voice can be obtained, thereby making it possible to construct a man-machine interface. In addition, if the direction detection device and the motion device are combined, the object can be operated in response to the voice. If the voice recognition device, the direction selection device, and the motion device are combined, the content of the voice is indicated. It becomes possible to accurately move the object in the direction or change the direction of the object in the direction indicated by the content of the voice. Furthermore, if a signal transmission device is connected to a voice recognition device of the voice reaction device, and a signal reception device is connected to a device following the voice recognition device and attached to an object, the voice that can be remotely operated is possible. A reactor can be realized.
[0147]
Furthermore, if a plurality of voice reaction devices as described above are prepared, it is possible to configure a toy that automatically interacts between the voice reaction devices. If a voice reaction device is attached to each balloon, a toy that has the warmth unique to a balloon and can be talked to can be made. It is also possible to create a voice reaction device that speaks from itself instead of speaking from a human by incorporating a clock and outputting a suitable sound at a certain time.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a game device according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating a detailed configuration of an image input unit according to first to third embodiments of the present invention.
FIG. 3 is a diagram illustrating a detailed configuration of an utterance section detection unit according to the first embodiment of the present invention.
FIG. 4 is a block diagram illustrating a detailed configuration of an integrated determination unit according to the first exemplary embodiment of the present invention.
FIG. 5 is a graph showing an output example of a differential signal in the first to third embodiments of the present invention.
6 is a diagram for explaining the processing operation of the utterance section detection unit of FIG. 3;
7 is a diagram for explaining a processing operation of an integrated determination unit in FIG. 4; FIG.
FIG. 8 is a block diagram illustrating a configuration of a game device according to a second embodiment of the present invention.
FIG. 9 is a block diagram showing a detailed configuration of a lip recognition unit in the second and third embodiments of the present invention.
FIG. 10 is a diagram showing the processing operation of the differentiating circuit in the second and third embodiments of the present invention.
FIG. 11 is a diagram illustrating a processing operation of a pattern matching unit according to second and third embodiments of the present invention.
FIG. 12 is a block diagram showing a configuration of a game apparatus according to a third embodiment of the present invention.
FIG. 13 is a diagram illustrating a processing operation of an integrated determination unit according to the third embodiment of the present invention.
FIG. 14 is a diagram illustrating a processing operation of an integrated determination unit according to the third embodiment of the present invention.
FIG. 15 is a diagram illustrating a specific configuration example of the input device according to the invention.
FIG. 16 is a diagram showing a configuration of a voice selection device according to a fourth exemplary embodiment of the present invention.
17 is a diagram showing an input / output state in the voice selection device of FIG. 16;
FIG. 18 is a diagram showing a configuration of a voice selection device according to a modification of the present invention.
FIG. 19 is a diagram showing a configuration of a direction detecting device according to a fifth embodiment of the present invention.
FIG. 20 is a diagram for explaining a waveform and a frame of an input voice.
FIG. 21 is a diagram showing a configuration of a direction selection device according to a fifth example of the present invention.
FIG. 22 is a diagram showing the configuration of another direction selecting device according to the fifth embodiment of the present invention.
FIG. 23 is a diagram illustrating speech waveforms, energy, and a cyclic memory.
FIG. 24 is a diagram for explaining a method for detecting a voice end point according to the sixth embodiment of the present invention;
FIG. 25 is a diagram illustrating a voice detection method according to a sixth embodiment of the present invention.
FIG. 26 is a block diagram showing a configuration of a speech recognition apparatus according to a sixth embodiment of the present invention.
FIG. 27 is a diagram showing a configuration of a voice reaction device using a voice recognition device and a voice selection device of the present invention.
FIG. 28 is a diagram showing a configuration of a voice reaction device using the direction detection device and the operation device of the present invention.
FIG. 29 is a diagram showing a configuration of a voice reaction device using a voice recognition device, a direction selection device, and an operation device of the present invention.
FIG. 30 is a diagram showing a configuration of a voice reaction device using the direction detection device, the direction selection device, and the operation device of the present invention.
FIG. 31 is a diagram showing a configuration of a voice reaction device using a voice recognition device and an operation device of the present invention.
FIG. 32 is a diagram showing a configuration of a voice reaction device capable of remote operation according to the present invention.
FIG. 33 is a diagram showing an example of a toy using the voice reaction device of the present invention.
FIG. 34 is a diagram showing a configuration of a conventional game device.
[Explanation of symbols]
1 Voice input section
3 Image input section
2 Voice recognition unit
4 Voice detection section
5, 123 Integrated judgment section
6 Control unit
7 Airship
21 LED
22 Photodiode
81 Lip recognition unit
100, 100a voice selection device
101 random number generator
102 Voice selection part
103 I / O status memory
104 State transition part
105 I / O status database
400, 1301 Direction detection device
401 Direction detection unit
600, 700, 1401 Direction selection device
601 Offset calculation device
602 compass
603 Target direction memory
701 Direction calculation device
1101 Voice end point detection device
1102 Voice detection device
1103 Feature quantity extraction device
1104 Distance calculation device
1105 dictionary
1201 Voice recognition device
1202 Voice selection device
1302 Operating device
1701 Signal transmission device
1702 Signal receiving device

Claims

An image input means for optically inputting the movement of the operator's lips, converting the inputted movement of the lips into an electrical signal, and outputting the electrical signal;
Lip recognition means for obtaining movement of the lip based on the electrical signal, recognizing a word corresponding to the obtained movement of the lip, and outputting a recognition result;
Control means for controlling an object in accordance with a control signal based on the recognition result;
A game device comprising:

The lip recognition means includes
Storage means for storing a predetermined number of words;
Matching means for selecting one of the predetermined number of words according to the determined movement of the lips and determining that the selected word is the word corresponding to the movement of the lips;
The game device according to claim 1, comprising:

The storage means stores a lip movement corresponding to the predetermined number of words as a standard pattern,
The matching means calculates a distance from the determined lip movement for all of the standard patterns, and selects a word corresponding to one of the standard patterns having the smallest distance. The game device described in 1.

Voice input means for inputting voice, converting the voice to another electrical signal, and outputting the other electrical signal;
Voice recognition means for recognizing the voice based on the other electrical signal output from the voice input means;
Integrated determination means for outputting the control signal to be given to the control means based on both the recognition result by the voice recognition means and the recognition result by the lip recognition means;
The game device according to claim 1, further comprising:

Means for obtaining a speech recognition reliability for the recognition result by the speech recognition means;
Means for obtaining a lip recognition reliability for the recognition result by the lip recognition means;
And the integrated determination means selects one of the recognition result by the voice recognition means and the recognition result of the lip recognition means based on the voice recognition reliability and the lip recognition reliability. The game apparatus according to claim 4, wherein the game apparatus outputs the control signal as the control signal.

The image input means includes light emitting means for emitting light, and light receiving means for receiving the light reflected by the lips of the operator and converting the received light into the electrical signal. Item 4. A game device according to Item 1.

The image input means includes light emitting means for emitting light, and light receiving means for receiving the light reflected by the lips of the operator and converting the received light into the electrical signal. Item 5. A game device according to Item 4.

The game device according to claim 6, wherein the light is applied to the lips from a side.

The game device according to claim 6, wherein the light is applied to the lips from the front.

The game apparatus according to claim 4, wherein the voice input unit includes at least one microphone.

The sound input means has at least one microphone, and the light emitting means and the light receiving means of the at least one microphone, and the image input means are provided on one table. The game device described.