JP4086280B2

JP4086280B2 - Voice input system, voice input method, and voice input program

Info

Publication number: JP4086280B2
Application number: JP2002019457A
Authority: JP
Inventors: 政秀蟻生
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-01-29
Filing date: 2002-01-29
Publication date: 2008-05-14
Anticipated expiration: 2022-01-29
Also published as: JP2003223188A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を扱う装置に関するものであり、特にユーザの発声が複数の音声入力に入りうる場合の音声入力システム、音声入力方法及び音声入力プログラムに関する。
【０００２】
【従来の技術】
これまでは音声によって機器を制御する場合や、音声をある機器に入力する場合にはユーザと音声入力機器は１対１で対応していることを主に想定していた。しかしながら、例えば一つの部屋に複数の音声入力装置がある場合などユーザの発声が複数の音声入力装置に入ってしまうことは十分あり得る。その場合に従来は、ユーザが特に対象機器を指定したり、音声入力しようと思っている機器以外に対しては音声入力を抑制するような操作を行ったりする必要があった。
【０００３】
【発明が解決しようとする課題】
本発明は、ユーザに負担をかけずにユーザの発声を入力したい音声入力装置に入力する音声入力システム、音声入力方法及び音声入力プログラムを提供することを目的とする。
【０００４】
【課題を解決するための手段】
本発明の音声入力システムは複数の音声入力装置がネットワークに接続され、前記音声入力装置は入力される音声を検知し、前記音声入力装置は入力される音声を検知したときに、検知した前記音声に関する判断情報を前記ネットワークを介して他の音声入力装置と授受し、前記音声入力装置は検知した前記音声に関する判断情報と、他の音声入力装置からの判断情報とをもとに検知した前記音声に対する処理の決定及び実行の判断を行うことを特徴とするものである。
【０００５】
また、本発明の音声入力方法はネットワークに接続された複数の音声入力装置において入力される音声をそれぞれ検知するステップと、前記音声入力装置で入力される音声を検知したときに、検知した前記音声に関する判断情報を前記ネットワークを介して他の音声入力装置と授受するステップと、前記音声入力装置は検知した前記音声に関する判断情報と、他の音声入力装置からの判断情報とをもとに検知した前記音声に対する処理の決定及び実行の判断を行うステップとを含むことを特徴とするものである。
【０００６】
また、本発明の音声入力プログラムはネットワークに接続された複数の音声入力装置において入力される音声をそれぞれ検知し、前記音声入力装置で入力される音声を検知したときに、検知した前記音声に関する判断情報を前記ネットワークを介して他の音声入力装置と授受し、前記音声入力装置は検知した前記音声に関する判断情報と、他の音声入力装置からの判断情報とをもとに検知した前記音声に対する処理の決定及び実行の判断を行う機能を実現することを特徴とするものである。
【０００７】
【発明の実施の形態】
以下、図面を参照しながら本発明による音声入力システムについて説明する。はじめに、本発明の全体の概要を図１を用いて説明する。
【０００８】
本発明の音声入力システムでは、ネットワーク104に複数の単体の音声入力装置101や音声入力装置102を有した機器103、例えばビデオテープレコーダが接続され、これらの単体の音声入力装置101や機器103に搭載された音声入力装置102によりユーザの発声する音声命令や伝言のメッセージあるいは会話等を計測し、入力された音声信号を信号処理手段によって適当な信号に変換する。そしてこの変換された信号から音声入力システムは入力された音声に対する処理を単体の音声入力装置101や機器103に搭載された音声入力装置102で行うことができる。
【０００９】
また単体の音声入力装置101や機器103に搭載された音声入力装置102はネットワーク104を介して情報の授受が可能となっており、入力された音声に対する処理として、ネットワーク上の他の単体の音声入力装置や機器に搭載された音声入力装置と情報の送受信ができる。
【００１０】
このとき、ネットワークへの情報の送信については情報が一つ一つの各音声入力装置に移っていくようなリレー方式でも、一つの音声入力装置から同時に複数の音声入力装置に送るようなブロードキャスト方式でも構わないが、音声という実時間処理が重要な用途であるので以降はブロードキャスト方式を念頭に置いて説明する。
【００１１】
ユーザの発声がネットワーク接続された複数の音声入力装置に入力された場合に、各音声入力装置での処理をどうするかという点が本発明によって解決する所である。また、ユーザの発声が単一の音声入力装置にしか入力されなかった場合でも本発明の処理で包含することができる。
【００１２】
また本発明の実施例としては、ユーザの発声という人間の発声を主に例に挙げて説明しているが、本発明は人間の音声に限定されたものではない。目的に応じて機械の動作音や動物の声でも、音声であれば構わないものとする。
【００１３】
次に本発明の実施の形態の音声入力システムを構成する音声入力装置について図2を用いて説明する。音声入力装置（20-1〜20-3）はそれぞれネットワーク21に接続されており、音声入力装置20-1はビデオデープレコーダ（以下ビデオとする）26に搭載され、音声入力装置20-3はエアーコンディショナー（以下エアコンとする）27に搭載され、また、音声入力装置20-2は単体で接続されている。音声入力装置20-1に入力された音声によりビデオ26の操作を行い、音声入力装置20-3に入力された音声によりエアコン27の操作を行う。なお、後述するように自分の音声入力装置への音声でなくとも、各機器は音声入力への処理を行うことができる。
【００１４】
各音声入力装置（20-1〜20-3）はそれぞれマイクロホン201、信号処理部202、中央処理部203、記憶部204、ネットワーク接続部205、情報表出部206から構成される。
【００１５】
ユーザが発声する音声入力はマイクロホン201に入力され、このマイクロホン201でユーザの発声を計測する。これは一般にあるマイクロホンで実用可能である。このマイクロホンは、単一のマイクロホンや複数のマイクロホン（マイクロホンアレイ）、指向性・無指向性マイクロホンなど、マイクロホンとして使えるものから構成できるものとする。
【００１６】
マイクロホンから取り込まれた音声信号は信号処理部202で後段の処理に必要な形式に処理される。この処理は例えば音声信号のMPEGによる圧縮や、音声認識で用いられるケプストラム特徴に変換する処理などが考えられる。なお、この信号処理部202はその他にも音声入力装置の用途に応じて適当な処理を実行できるように構成できるものとする。
【００１７】
また、この信号処理部202では次に説明する中央処理部203からの命令を受けて情報表出部206に伝える形式に変換する機能も含まれている。さらに、この情報表出部206では中央処理部203からのユーザに伝えるメッセージ内容から音声合成を行って合成音の信号に変換している。
【００１８】
なお、その他にも、ディスプレイ表示のための表示内容に変換したりする処理や情報表出部206におけるデバイスや、音声入力装置の用途に応じて処理を実行できるように構成することも可能である。
【００１９】
ただし、このマイクロホンからの音声信号の処理と情報表出部206へ送る情報についての処理は同一処理機構で行うか否かは問わないものとする。すなわち上記の処理を行う機構を総称して信号処理部202とする。
【００２０】
また、信号処理部202の入力としてマイクロホン以外のセンサ・デバイスも考えられる。例えばカメラからの動画像や触覚センサ、スイッチ等が挙げられる。その他のセンサ・デバイスからの入力も、音声入力装置の用途に応じて処理できるような信号処理部を構成できるものとする。これについては後述する。
【００２１】
中央処理部203では音声入力装置全体の処理を制御する。この中央処理部203が音声入力装置の状態を管理し、必要に応じて各処理機構に命令を送る。信号処理部202からの情報やネットワーク接続部205からの情報、そして記憶部204の情報を元に制御内容を決めることができる。また、他の音声入力装置に対して制御情報を送出する。本発明の音声入力システムとして音声をどう処理するかについては後述する。
【００２２】
記憶部204では中央処理部203で行う処理のプログラムやその作業領域、信号処理部202からの情報やネットワーク接続部205からの情報を保持しておく機構である。なお、この記憶部204は信号処理部202における情報記憶用やネットワーク接続部からの情報記憶用といったように回路的には別のものであっても構わないとする。
【００２３】
すなわち、音声入力装置における情報保持機構を総称して記憶部204と呼ぶことにする。この記憶部204は半導体メモリや磁気ディスクなどの機構で実現可能であり、データを保持できる任意の機構で構成可能なものであるが、この実施の形態では半導体メモリが使用されている。
【００２４】
記憶部204の使われ方や記憶される情報については中央処理部203の処理の説明と共に後述する。
【００２５】
ネットワーク接続部205はネットワーク21を通して音声入力装置間の情報の授受を行うための機構であり、LANでのネットワーク接続やブルートゥースといった無線技術といった機器間通信技術によって実現できるものとし、ここではLANでのネットワーク接続を用いている。
【００２６】
また、以上のような音声入力装置の機構のそれぞれ、もしくは全てが、他の機能を持つシステムのものと機構を共有しても構わないとする。例えばビデオ・システムのようなオーディオ・ヴィジュアル機器に音声入力装置が含まれている場合に、共通の信号処理回路を使ってお互いの機能を実現したり、同じ中央処理回路を用いて音声入力装置やビデオ・システムの機能の制御を行ったりすることが考えられる。
【００２７】
他にも共通の機構で音声入力装置と他のシステムの機能を実現する例が考えられるが詳細は省略する。
【００２８】
さらに、回路的な機構として音声入力装置やその他のシステムが別々にあるのでなく、共通の回路でありながら、プログラム的なプロセスとして別のシステムとして制御できる場合も上記に含まれているものとする。
【００２９】
次に中央処理部203が信号処理部202からの音声信号やネットワーク接続部205からの情報、記憶部204で保持されている情報をもとにして音声をどのように処理するかについて図3を用いて説明する。図3では図2のビデオ26に搭載された音声入力装置20-1（以下音声入力装置Ａとする）とエアコン27に搭載された音声入力装置20-3（以下音声入力装置Ｂとする）に対して音声が入力される例を示している。さらに、現在ユーザが音声入力装置Ｂに対して対話処理を行い、音声入力装置Ａは待機中の状態を示している。
【００３０】
まず、ユーザが音声入力装置Ａ及び音声入力装置Ｂに対して発声すると（step301）、各音声入力装置の信号処理部202ではマイクロホン201で取り込まれたユーザからの発声を検知し、信号処理される（step302）。
【００３１】
ここで、音声入力装置Ｂは既にユーザと対話処理を行っているので、音声入力装置Ｂ自身が対話処理中であって他のシステムの状態が対話状態でないとなれば、音声入力装置Ｂがユーザの発声した内容に対する処理を行う選択をする。（step303）
次に、音声入力装置Ｂの中央処理部202は音声入力装置の機能にあわせて取り込まれた音声の処理を行い、音声の内容にしたがって機器を操作し、対話終了後再び待機状態になる（step304）。
【００３２】
逆に音声入力装置Ａでは、音声入力装置Ｂがユーザとの対話状態であるので、信号処理された後（step302）、それ以上の処理を行わない（step305）とし、待機状態になる。
【００３３】
こうすることで、ユーザの発声が複数の音声入力装置で検知されてしまうような場合でも、ユーザが現在発声対象としている音声入力装置に対してのみ、楽にアクセスできるようにすることを可能とする。また上記ではユーザが複数の音声入力装置に対して発声するとした例を挙げたが、ユーザは意図的に複数の音声入力装置に検知されるように音声を発声する必要はなく、このことは以降の実施例でも同様である。
【００３４】
また、他の音声入力装置が対話状態でなければ処理を行うといった条件付けは、上記以外の条件についてユーザが任意に、もしくは音声入力装置が設定として定めることができるものとする。
【００３５】
また、ここでの対話は人間とシステムの一対一による音声のやり取りに限定したものではなく、人間からシステムへの一方的な音声発声やシステム側から視覚的な応答を返す場合、あるいはシステムから任意の人間に応答する場合を含んでも構わないものとし、以降の説明で用いられる対話についても同様である。
【００３６】
また、音声入力装置にはあるルールに基づいた順序関係があり、その順序関係に基づいて取り込まれた音声情報に対する処理を決めることもできる。ルールの具体例としては、音声入力装置の処理能力・ユーザによる設定・使用頻度・音声入力装置の機能に基づく設定値・マイクロホン以外からのセンサの情報や、これらの組み合わせ等が挙げられる。
【００３７】
次に上記の音声入力装置の機能による順位付けの例を図４を用いて説明する。
【００３８】
音声入力装置が搭載している機器としてウェアラブル・コンピューター（以下音声入力装置Ｃとする）と音声入力装置が搭載している機器としてビデオ・システム（以下音声入力装置Ｄとする）があり、前者の方が特定ユーザ向けなので順位が高く、ビデオ・システムは不特定のユーザが使い得るので順位が低いものとする。
【００３９】
このときユーザは音声入力装置Ｃ及び音声入力装置Ｄに対して発声し（step401）、それぞれの音声入力装置は信号処理部202においてマイクロホン201で取り込まれたユーザからの発声を検知した場合に、自音声入力装置の順位を送信しあう（step402）。
【００４０】
次に、他の音声入力装置の順位と比較し、順位の高い音声入力装置Ｃがそのユーザの発声を処理する（step403）。
【００４１】
順位の低い音声入力装置Ｄは処理は行わず（step404）、待機中のままになる。
【００４２】
上記の例では順位情報を送信しているが、送信情報に順位以外の情報があっても構わないし、発声を検知してからでなく前もって情報のやり取りをしておく、あるいはプリセットの順位情報をもとに自音声入力装置で処理するかの判断を行っても構わないとする。
【００４３】
上記のような実施例によって、例えば音声入力装置を搭載する機器として火災報知器や緊急警報器のような非常用機器は他のどんな機器よりも順位が高く、例えば「助けて」という発声に対していかに通常機器で音声命令として登録していてもまずは非常用機器に対する音声入力が優先されるということも可能となる。
【００４４】
また、音声入力装置内に時間を処理する機構を設けて、それによって処理の判断の参考にすることもできる。図5で例を挙げて説明する。
【００４５】
図5ではビデオに搭載された音声入力装置（以下音声入力装置Ｅとする）とエアコンに搭載された音声入力装置（以下音声入力装置Ｆとする）に対して音声が入力される例を示しており、音声入力装置Ｅは音声入力装置Ｆよりユーザに近い位置に設置している。
【００４６】
このときユーザは音声入力装置Ｅ及び音声入力装置Ｆに対して発声し（step501）、それぞれの音声入力装置は信号処理部202においてマイクロホン201で取り込まれたユーザからの発声を検知した場合に、自音声入力装置の発声検知時間を送信しあう（step502）。
【００４７】
次に、音声を検知した他の音声入力装置からの検知時間と自音声入力装置の検知時間を比較し、自音声入力装置が最も早かった場合は音声を処理し（step503）、そうでなければ当該音声を処理しないという判断をする（step504）ことで、ユーザが指定しなくともユーザに最も近い音声入力装置が音声の処理を行えるようになる。
【００４８】
また、音声検知時間がもっとも長かった音声入力装置がユーザの発声を最初から最後まで検知できたとみなして、その音声入力装置が当該音声の処理を行うといったように音声検出の早さ以外の時間情報を判断基準とすることもできる。
【００４９】
また、ユーザの発声の音量をマイクロホンから取り込まれた音声から計測し、処理の判断の参考にすることもできる。音量情報を利用した本発明の例として図6を用いて説明する。
【００５０】
ここでは上述した音声入力装置Ｅと音声入力装置Ｆがある場合に、ユーザは音声入力装置Ｅ及び音声入力装置Ｆに対して発声し（step601）、それぞれの音声入力装置は信号処理部202においてマイクロホン201で取り込まれたユーザからの発声を検知した場合に、音量情報を送信しあう（step602）。すなわち、ユーザの発声の音量をマイクロホンから取り込まれた音声から計測し、ネットワーク上の他の音声入力装置に伝える。
【００５１】
次に、音声を検知した他の音声入力装置からの音量情報と自音声入力装置の音量情報を比較し、自音声入力装置が最も大きかった場合は音声を処理し（step603）、そうでなければ当該音声を処理しないという判断をする（step604）ことで、ユーザが指定しなくともユーザに最も近い音声入力装置が音声の処理を行う、もしくは元の発声を最もよく収録した音声で処理を行えるようになる。この音量情報としては音圧レベルや音響パワーレベル、あるいはphonやsoneなどの単位が挙げられる。
【００５２】
また、周囲の雑音に対するユーザの発声の信号対雑音比をマイクロホンから取り込まれた音声から計算して、処理の判断の参考にすることもできる。信号対雑音比を利用した本発明の例として図7を用いて説明する。
【００５３】
図7ではビデオに搭載された音声入力装置（以下音声入力装置Ｇとする）とエアコンに搭載された音声入力装置（以下音声入力装置Ｈとする）に対して音声が入力される例を示しており、騒音源があり、音声入力装置Ｇは音声入力装置Ｈより騒音源が遠い位置にあるものとする。
【００５４】
始めに、各音声入力装置は常時音声を取り込んで周囲の雑音の情報を計測しておく（step701）。
【００５５】
次に、ユーザは音声入力装置G及び音声入力装置Hに対して発声し（step702）、それぞれの音声入力装置は信号処理部202においてマイクロホン201で取り込まれたユーザからの発声を検知し、ユーザの発声をマイクロホンから取り込んだときに雑音情報をもとに信号対雑音比を計算し、ネットワーク上の他の音声入力装置に伝える（step703）。
【００５６】
次に、音声を検知した他の音声入力装置からの信号対雑音比情報と自音声入力装置の信号対雑音比情報を比較し、自音声入力装置が最も大きかった場合は音声を処理し（step704）、そうでなければ当該音声を処理しないという判断をする（step705）。
【００５７】
これにより、ユーザが指定しなくともユーザに最も近い音声入力装置が音声の処理を行う、もしくは元の発声を最もよく収録した音声で処理を行えるようになる。ここでの例では、無発声中でも常時周囲音を取り込んで雑音を計算する例を挙げたが、他にも例えば発声を検知してから発声中の無音区間をもとに雑音を推定してもよい。
【００５８】
また、記憶部に使用状況に関する過去の履歴を保持しておき、それを処理の判断に利用することもできる。過去の履歴を利用した本発明の例について図8を用いて説明する。
【００５９】
図8ではビデオに搭載された音声入力装置（以下音声入力装置Ｉとする）とエアコンに搭載された音声入力装置（以下音声入力装置Ｊとする）に対して音声が入力される例を示しており、音声入力装置Ｉは音声入力装置Ｊより使用頻度が高いもとする。
【００６０】
始めに、ユーザが両方の音声入力装置に対して発声（step801）し、この発声に対して最近の使用時間・使用回数等をネットワーク経由で他の音声入力装置に伝える（step802）。
【００６１】
一方、音声入力装置Ｉでは音声入力装置Ｊの使用履歴と比較して、音声入力装置Ｉが最もよく使われているなら音声の処理を行うよう判断する（step803）ことでユーザがわざわざ指定しなくてもよく使われている音声入力装置Ｉを利用できるようになる。
【００６２】
また、他方、音声入力装置Ｊでは音声入力装置Ｉの使用履歴と比較して、音声入力装置Ｊがあまり使われていないなら音声の処理は行わず（step804）、待機中のままになる。
【００６３】
また、音声認識をする手段を備えその認識結果を利用して取り込まれた音声の処理を判断することもできる。信号処理部からの情報は音声認識を行う機構で処理されその結果が中央処理部に渡される。このとき行われる音声認識は、演算処理を中央処理部で扱っても構わない。
【００６４】
また音声認識に使われる手法は混合正規分布をモデルに使ったHMMやDPマッチングのような一般に現実化されている手法で構わないとし、このとき使われるHMMや言語モデルは記憶部にあっても構わないとする。音声認識の語彙は音声入力装置毎に異なっていても共通化されていても構わないとする。さらにその語彙に制御命令を対応させることで音声コマンドを可能にすることもできる。この音声認識を利用した本発明の例について図9で説明する。
【００６５】
図9ではビデオに搭載された音声入力装置（以下音声入力装置Ｋとする）とエアコンに搭載された音声入力装置（以下音声入力装置Ｌとする）に対して音声が入力される例を示している。
【００６６】
始めに、各音声入力装置に対してユーザからの音声入力装置Ｋに関連する「再生」という発声があった（step901）場合に、各音声入力装置はその音声の検知と音声認識を行う（step902）。
【００６７】
その音声認識した結果を中央処理部は受け取り、認識結果から自音声入力装置に対する発声か否かを判断し（step903）、その判断結果と認識結果をネットワーク経由で他の音声入力装置に伝える（step904）。
【００６８】
一方、他の音声入力装置の判断結果と認識結果をみて、音声入力装置Ｋでは自音声入力装置への発声と判断（step905）できたら当該音声に対する処理を行うことで、ユーザが特に指定しなくても発声対象の音声入力装置を使うことができるようになる。
【００６９】
他方、音声入力装置Ｌでは自音声入力装置への発声と判断しない（step906）ので、待機中のままである。
【００７０】
また、音源の識別を行う手段を備え、その識別結果を利用して音声の処理を判断することもできる。音源の種類としては人間、機械、動物など使用目的に応じて考えられるが、以降では例として人間の発声を音声とした場合について説明する。信号処理部からのユーザの音声情報に対して話者識別を行い、その結果を中央処理部に伝える。この話者識別を行う方法は話者毎に学習または適応されたHMMに対する尤度から判断するものや、性別や年齢層毎のモデルで最も近いカテゴリーを選ぶものなど、個人あるいは話者の特性（例えば性別や年齢層など）を識別できる手法ならば構わないものとする。
【００７１】
この話者識別を使った本発明の例を次の図10を用いて説明する。
【００７２】
図10ではビデオに搭載された音声入力装置（以下音声入力装置Ｍとする）とエアコンに搭載された音声入力装置（以下音声入力装置Ｎとする）に対して音声が入力され、あるユーザは片方の音声入力装置Mでのみ音声の処理が可能である場合の例を示している。
【００７３】
始めに、各音声入力装置に対してユーザからの発声があった（step1001）場合に、ユーザの発声を検知した音声入力装置は話者識別を行い（step1002）、自音声入力装置で処理すべき発声か否か判断（step1003）をして、その判断結果と話者識別結果をネットワーク経由で他の音声入力装置に伝える（step1004）。
【００７４】
そして自音声入力装置と他の音声入力装置における判断結果と話者認識結果をみて、自音声入力装置への発声と判断（step1005）できたら当該音声に対する処理を行い、逆に他方の音声入力装置Nは自音声入力装置への発声ではないと判断（step1006）できたら処理を行わないとすることで、ある音声入力装置が特定のユーザに利用可能である場合に、ユーザが特に指定しなくても発声対象の音声入力装置を使うことができるようになる。
【００７５】
また、話者識別の信頼性が低い場合や複数話者が候補となった場合に、システム側からさらに暗証番号や定型句あるいは自由発声を促してさらにデータを得ることによって識別精度を上げてから話者識別以降の処理をおこなってもよい。
【００７６】
また、ここでは人物の話者認識について述べているが、前記のように故障者や動物の音に応じて識別とその後の処理を行うことも可能である。
【００７７】
また、音声入力装置やネットワーク上の他の機器と共通の命令を持ち、お互いに許された範囲で制御することもできる。こうすることで、他の音声入力装置の働きを抑制したり、音声入力装置同士の互換性をよくしたりすることができる。
【００７８】
この例を図11で説明する。
【００７９】
例えばネットワーク1102に接続されている全ての音声入力装置1101が「電源ON」「電源OFF」「省電力」といった共通の電源管理命令を持っているときに、ネットワーク1102に繋がっているパーナルコンピュータ1103から一度に複数システムも含めた任意の音声入力装置1101の電源を操作する命令をネットワーク経由で送信し、各音声入力装置がその命令を実行することが出来る。
【００８０】
また、音声入力装置やネットワーク上の他の機器と共通の音声による制御命令と、入力された音声とその命令をマッチングする手段を備えることで、より平易で確実な音声による制御命令の実行を可能とする。この例について図12のフロー図を用いて説明する。
【００８１】
図12の例では音声入力装置を有しているビデオ（音声入力装置Ｏ）と音声入力装置を有しているエアコン（音声入力装置Ｐ）があったときに、ユーザが「ビデオ」「エアコン」と命令対象の名称を発声した後で、「電源ON」「電源OFF」といったように共通の動作について共通化された命令を発声する。
【００８２】
ここで、ユーザから「ビデオ」「電源ON」という発声があった場合（step1201）、音声入力装置Ｏ及び音声入力装置Ｐは前述の音声認識で使われるマッチング手段で機器名称と機器命令を認識（step1202）し、自身のシステムへの命令か、処理可能かについて判断する（step1203）。
【００８３】
その結果をネットワーク上の他の音声入力装置や制御可能機器にその結果を伝達（step1204）し、その結果と他の音声入力装置からの結果から自音声入力装置が処理すべき発声か判断（step1205）してその制御命令に対応した処理を行うことができる。
【００８４】
共通化された命令に対して複数の音声入力装置から得られた結果を使うことが、これまでの音声によるリモコンや音声によって命令する機器とは異なる点である。
【００８５】
また、ネットワーク上に音声による制御可能機器が複数ある場合に、記憶部でその制御命令の全てまたは一部に関する情報を記憶できるような仕組みと、入力された音声とそれらの命令をマッチングさせる手段を備えることで、より平易で確実な音声による制御命令の実行が可能となる。
【００８６】
この例を次の図13、図14を用いて説明する。ネットワーク上に音声入力装置で制御可能なビデオ（音声入力装置Ｑ）とエアコン（音声入力装置Ｒ）があるとして、音声命令について音声入力装置Ｑが「再生」「停止」、音声入力装置Ｒが「温度あげて」「温度さげて」等であった場合に、ネットワーク上のそれぞれの音声入力装置では認識単語と対象機器を関連付けて記憶できるようになっているとする。
【００８７】
図13はこの認識単語と対象機器、そして処理内容を結びつける概念を表している。この図13のような認識単語と処理内容との結びつけは、単純な表引きやオブジェクト指向や高次の知識処理によって実現できるものとし、ここではその詳細は省略する。
【００８８】
図14のフロー図に示すようにユーザが「ビデオ」「再生」と発声した場合（step1401）、音声入力装置Ｑと音声入力装置Ｒは発声の検知と認識を行う（step1402）。
【００８９】
さらに、図13に示した概念を用いて発声内容を判断し（step1403）、その結果をネットワーク上の他の音声入力装置に伝達し（step1404）、その結果と他の音声入力装置から送られてきた結果をもとに自音声入力装置が処理すべき発声だったかを判断して（step1405）、その制御命令に対応した処理を行う。
【００９０】
上述の「ビデオ」「再生」の場合、図13のような認識単語と対象機器、処理内容の結びつきによってどちらの音声入力装置も発声がビデオに対して再生の命令であったと判断できる。さらにネットワーク経由で送信しあった情報により、発声が一意に解釈でき、音声入力装置は認識結果に対応する処理内容を行うことが出来る。
【００９１】
またこれまでの音声認識を用いた例では基本的に単語認識による例を挙げてきたが、ワードスポッティングや連続音声認識の技術を使っても、各音声入力装置での音声認識のスペックに差があっても、図１３のような認識結果と処理内容の対応づけの概念がされれば構わないとする。
【００９２】
また、上述の図14で示した例については、音声入力装置以外のネットワークに接続された制御対象機器についても処理できるものとする。その例について図15を用いて説明する。
【００９３】
図15に示すように音声入力装置のついたエアコン1501と単体の音声入力装置1502及びビデオ1503がネットワーク1504に接続されており、ここでユーザがビデオ1503を操作する発声をする。
【００９４】
この音声入力装置は図14のフロー図に示す流れで音声の検知及び認識を行い、図13のような概念で認識結果と処理内容を結びつける。そして認識結果と処理内容の判断をしてからネットワーク1504上の他のシステムに送信する。
【００９５】
その結果、ビデオ1503は認識結果に応じた処理内容を受け、発声を実行することができる。よってビデオ1503自体に音声入力装置がなくても自分が制御可能な情報についてネットワークに情報を流し、各音声入力装置に図13のような認識結果と処理内容の概念をつくることで音声による制御が可能となる。
【００９６】
図12から図15までで説明した音声認識を用いた本発明の例については、これまでブロードキャスト方式で音声認識と判断の結果をネットワークの全てのシステムに送信する例を挙げてきたが、認識結果によって直接その対象機器にのみ認識結果と判断の結果を伝えてもよいものとする。
【００９７】
また、音声入力装置において、マイクロホンによる音声入力以外のセンサがある場合に、そのセンサ情報を利用して検知した音声の処理内容を判断することもできる。この例について、図16を用いて説明する。
【００９８】
図16に示すように音声入力装置を有したエアコン1601と単体の音声入力装置1602がネットワーク1603に接続されている。また、この単体の音声入力装置にはカメラを有しておりカメラから周辺の画像情報を取り入れることができる。なお、このカメラの入力は図2の信号処理部202に入力され画像処理される。
【００９９】
この音声入力システムにおいて、ユーザがエアコン1601の音声入力装置に対して発声する。ここで、単体の音声入力装置1602に付いているカメラにより話者がどの方向を向いているかを推定する。なお、この話者がどの音声入力装置を向いているかについては、画像から人間を抽出する技術、顔部分を推定してその向きを推定する技術、口の動きから検知した発声がどの人間からのものか推定する技術等の組み合わせで実現できるものとするが、ここでは詳細は省略する。
【０１００】
推定された話者の顔向きから話者がエアコン1601の方を向いていると判断すると、発声の対象機器をエアコンと判断して、各音声入力装置は結果をネットワーク1603で他の音声入力装置に通知し、これまで述べてきた例のように処理を判断する。
【０１０１】
ここではカメラを使った画像情報を利用した例を挙げたが、スイッチ等の直接的なセンサ・デバイスや音源定位のためのマイクロホンアレイなどが考えられるが、どのような計測技術を使うかは限定しない。
【０１０２】
また、図2の音声入力装置の構成で述べたようにマイクロホン201、情報表出部206、信号処理部202、中央処理部203、記憶部204、ネットワーク接続部205は音声入力装置においてそれぞれその働きをするものの総称であるので、ネットワークを通した形や直接接続された形でそれぞれが物理的に複数に分かれていても構わないとする。この例を図17で説明する。
【０１０３】
図17に示すように音声入力装置は物理的には2つの音声入力装置（1701、1702）に分かれていてもネットワーク1703で接続されており適切な情報のやり取りが出来るものとする。このときユーザの発声に対して、2つの音声入力装置（1701、1702）で一つの音声入力装置として働くことが出来る。
【０１０４】
また、これまで述べたような音声入力装置に対する判断の基準は、他の音声入力装置の情報やユーザの設定によって変えられるものとする。例えば、音声入力装置は音声を検知したときの検知や認識結果等の情報以外に、一定時間ごとに他の音声処理システムの処理状態、処理性能、認識可能語彙やそれに対する処理内容をやり取りして、自音声入力装置の記憶部に蓄えておけるとする。
【０１０５】
そのような情報を利用して、現在はある音声入力装置は処理出来ないから自音声入力装置で処理可能な場合は代わって処理するとか、自音声入力装置より性能のいい音声入力装置の認識結果を自分の結果より重視することで認識誤りを補正するとか、ユーザが自分の好みに合わせて上述のような判断の制御を可能とすることが出来る。
【０１０６】
また、これまで述べたような音声入力装置に対する入力の判断の手段は、上述のものを組み合わせても構わないとする。例えば、検知時間が早い音声入力装置が発声を扱うとするが、ある許容時間内では時間差がないものとし、同じ時間の場合は音量で判断するとか、音声認識の尤度と音声入力装置の順位を重み付けして最もスコアの高い音声入力装置で音声を扱うなどが考えられる。
【０１０７】
また、上述のような判断の手段の組み合わせにより得られた情報を利用して高次のエージェントシステムや知識処理システムで判断する場合も考えられる。
【０１０８】
また、これまで述べたような音声入力装置における処理の判断の手段は、ネットワーク上の音声入力装置間で同一であることを必須とはしないものとする。例えば音声入力装置が2つあり、一つは音声の検知時間のみで、他方は音量情報のみで判断する場合には、音声を検知したあとに相互に授受する情報は必ずしも対応は取れないが、各々の音声入力装置でその場合における処理を装置の目的に応じて設定しておけば、音声入力システムとして処理が破綻せずに各々の音声入力装置で処理の判断が可能である。
また、上述のような音声入力装置の判断の手段が各々の音声入力装置において異なっている場合に、ネットワークを通して授受した情報をもとに音声入力装置より高次のエージェントシステムや知識処理システムで処理を判断する場合も考えられる。
【０１０９】
また、これまで述べたような音声入力装置に対する入力の判断において、音声検知時間や音量といった発声に関する情報や、音声認識結果や識別結果といった情報から、ユーザがどの機器に対してどのような音声入力を行ったのかが一意には判断できなかった場合は、音声入力装置の一つがユーザと対話処理を行って決定したり、マイクロホン以外のセンサ情報といった他の条件を使って決定したりすることもできる。
【０１１０】
次に、これまで述べたような音声入力装置において先に説明した図2の情報表出部206とこれまでの説明の補足となる例を次の図18を用いて説明する。
【０１１１】
図18に示すように音声入力装置を有したエアコン1801、単体の音声入力装置1802及び音声入力装置を有したビデオ1803がネットワーク1804に接続されている。また、これらの音声入力装置は図2の情報表出部206を有している。
【０１１２】
この音声入力システムでは前述したように、待機中に各音声入力装置は自音声入力装置の情報、すなわち認識語彙、処理内容やここでは特に情報表出部の有無と表現可能なメディアの情報をやり取りして記憶部に保存してあるものとする。
【０１１３】
この例での各音声入力装置の情報表出部は全てスピーカを備え、中央処理部と信号処理部によって合成された任意文の音声をユーザに返すことができるとする。そしてその情報表出部への制御命令の一部は音声入力装置で共通化されているとする。つまりある音声入力装置が自分の情報表出部からユーザに応答を返す代わりに、ネットワーク上の別な音声入力装置の情報表出部からユーザへの応答を可能とする。
【０１１４】
ここでユーザから「ビデオ」「再生」という発声があったときに、エアコンに接続された音声入力装置と単体の音声入力装置がその音声を検知したとする。なお、ユーザの位置は単体の音声入力装置に一番近いところにあるとする。
【０１１５】
これまで述べてきたような手順により両音声入力装置は音声の検知、認識、自音声入力装置への命令か判断して、結局「ビデオ」への「再生命令」と判断し、それぞれネットワーク上の他の音声入力装置へ伝える。ビデオに接続された音声入力装置は直接音声を検知しないが、ネットワーク上の別な音声入力装置からの情報を受け、自音声入力装置への命令と解釈して、再生命令がされた場合の処理を実行する。
【０１１６】
またこのとき、単体の音声入力装置の方がユーザに近いため、ネットワーク上に送られた音量や信号対雑音比の情報で判断したときに、ビデオに接続された音声入力装置よりも単体の音声入力装置の方が音声処理に適していることを各音声入力装置は判断できる。
【０１１７】
よって、単体の音声入力装置とビデオの音声入力装置はそれぞれ単体の音声入力装置がユーザとの音声の授受を行う音声入力装置と判断できる。
【０１１８】
再生命令を受けたビデオの音声入力装置は、ビデオに対して再生の制御命令を送る一方、再生を始めたことをユーザに伝えるために、「再生を開始しました」という合成音声を単体の音声入力装置からユーザに返すよう命令を生成して、ネットワークを介して単体の音声入力装置へ伝える。このときビデオの音声入力装置から送信される制御命令はこれまでのネットワークへの情報送信と同様に単体の音声入力装置一つへ直接送信してもよいし、単体の音声入力装置への命令という情報を含んだ形で、ブロードキャスト形式で全ての音声入力装置へ伝えられてもよい。
【０１１９】
このようにしてビデオの音声入力装置から送られたユーザへの応答命令を解釈して、単体の音声入力装置は合成音声で「再生を開始しました」というメッセージをユーザに伝えることができる。
【０１２０】
また、この処理を通して単体の音声入力装置とビデオの音声入力装置は、ユーザと対話処理中であるというフラグを一定時間立てることで、ユーザの次の発声を優先的に処理し、エアコンの音声入力装置で処理しなくてもよいように出来るという例については既に述べてある。
【０１２１】
次に、これまで述べてきたような音声入力装置において音声入力装置が何らかの基準でグループ化されている場合の例について図19を用いて説明する。
【０１２２】
この例では音声入力装置の場所を基準としグループ「キッチン」1901、グループ「ウェアラブル」1902、グループ「リビング」1903のグループは全てネットワーク1904で接続されている。また、それぞれのグループ内に音声入力装置があり、これらの各グループ内におけるそれぞれの音声入力装置は他グループを同定できる情報を持っているものとする。
【０１２３】
ただし、自グループの他の音声入力装置に関して記憶部が持つ情報と、他グループに関してもつ情報の種類は必ずしも同一でなくてよい。具体的にはここでは他グループにおける各々の音声入力装置の認識語彙やそれに対応する対象機器や処理内容の情報までは持たないとする。
【０１２４】
ここでユーザが「リビング」「ビデオ」「再生」と発声し、それがグループ「キッチン」とグループ「ウェアラブル」の音声入力装置で検知されたとする。これまで述べてきた例と同様に、検知した音声入力装置で認識と自音声入力装置で処理すべきか判断した結果、自グループへの発声でなくグループ「リビング」への発声と判断し、その音声情報や判断結果をグループ「リビング」の音声入力装置へ伝える。
【０１２５】
このとき基本的に同定できたグループにのみ情報を送ることで、多くの音声入力装置がネットワークに接続されたときに必要な音声入力装置のみが情報のやり取りをできるようになることがグループ化することの利点である。
【０１２６】
したがって、グループ「リビング」の音声入力装置は自グループ宛の音声に関する情報を受け取ることで自グループ内の「ビデオ」に対する「再生」の命令と判断してそれに対応する処理をすることができる。
なお、本発明は音声入力プログラムに適用することも言うまでもない。
【０１２７】
【発明の効果】
以上説明したように、本発明はユーザの発声に対して他の音声入力装置からの情報を利用することで、ユーザに負担をかけずに音声に対する処理を決定することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る音声入力システムの構成を示す図。
【図２】本発明の一実施形態に係る音声入力システムを構成する音声入力装置を示す図。
【図３】本発明の一実施形態に係る音声入力システムの動作を示すフロー図。
【図４】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図５】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図６】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図７】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図８】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図９】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図１０】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図１１】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【図１２】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図１３】本発明の一実施形態に係る音声入力システムに係り、認識単語、対象機器、処理内容を結びつける概念を示す図。
【図１４】本発明の一実施形態に係る音声入力システムの他の動作を示すフロー図。
【図１５】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【図１６】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【図１７】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【図１８】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【図１９】本発明の一実施形態に係る他の音声入力システムの構成を示す図。
【符号の説明】
101・・・音声入力装置
102・・・音声入力装置
103・・・機器
104・・・ネットワーク
201・・・マイクロホン
202・・・信号処理部
203・・・中央処理部
204・・・記憶部
205・・・ネットワーク接続部
206・・・情報表出部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a device that handles voice, and more particularly, to a voice input system, a voice input method, and a voice input program when a user's utterance can enter a plurality of voice inputs.
[0002]
[Prior art]
Until now, it was mainly assumed that the user and the voice input device have a one-to-one correspondence when the device is controlled by voice or when voice is input to a certain device. However, for example, when there are a plurality of voice input devices in one room, it is possible that a user's utterance enters a plurality of voice input devices. In that case, conventionally, it has been necessary for the user to specify a target device, or to perform an operation to suppress voice input for devices other than those intended to be used for voice input.
[0003]
[Problems to be solved by the invention]
An object of the present invention is to provide a voice input system, a voice input method, and a voice input program for inputting a user's utterance to a voice input device that does not place a burden on the user.
[0004]
[Means for Solving the Problems]
In the voice input system of the present invention, a plurality of voice input devices are connected to a network, the voice input devices detect voices to be input, and the voice input devices Is detected when the input voice is detected Judgment information about the voice The With other voice input devices via the network Giving and receiving The voice input device Judgment information about the detected voice; Other voice input devices from Based on judgment information Detected voice It is characterized in that determination of processing and determination of execution are performed.
[0005]
According to another aspect of the present invention, there is provided a voice input method for detecting voices input from a plurality of voice input devices connected to a network, and the voice input device. Detected when audio input in is detected Exchanging judgment information about the voice with another voice input device via the network; and the voice input device, Judgment information about the detected voice; Other voice input devices from Judgment information When Based on Detected voice And determining the execution of the process and determining whether to execute the process.
[0006]
Further, the voice input program of the present invention detects voices input from a plurality of voice input devices connected to a network, and the voice input device Detected when audio input in is detected The judgment information about the voice is exchanged with another voice input device via the network, and the voice input device Judgment information about the detected voice; Other voice input devices from Judgment information When Based on Detected voice The present invention is characterized in that it implements a function of determining processing and determining whether to execute the process.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a voice input system according to the present invention will be described with reference to the drawings. First, an overview of the present invention will be described with reference to FIG.
[0008]
In the voice input system of the present invention, a device 103 having a plurality of single voice input devices 101 and voice input devices 102, such as a video tape recorder, is connected to the network 104, and these single voice input devices 101 and devices 103 are connected. A voice command uttered by a user, a message message or a conversation is measured by the voice input device 102 installed, and the input voice signal is converted into an appropriate signal by a signal processing means. From the converted signal, the voice input system can perform processing on the input voice by the single voice input device 101 or the voice input device 102 mounted on the device 103.
[0009]
In addition, the voice input device 102 mounted on the single voice input device 101 or the device 103 can exchange information via the network 104, and as a process for the input voice, another single voice on the network is used. Information can be transmitted to and received from an audio input device mounted on the input device or device.
[0010]
At this time, the transmission of information to the network may be a relay method in which information is transferred to each voice input device or a broadcast method in which a single voice input device is simultaneously sent to a plurality of voice input devices. However, since real-time processing of voice is an important application, the following description will be given with the broadcast method in mind.
[0011]
When the user's utterance is input to a plurality of voice input devices connected to a network, the present invention solves the point of processing in each voice input device. Further, even when the user's utterance is input only to a single voice input device, it can be included in the processing of the present invention.
[0012]
In addition, as an embodiment of the present invention, human speech called user speech has been mainly described as an example, but the present invention is not limited to human speech. Depending on the purpose, it may be a machine operation sound or an animal voice as long as it is a sound.
[0013]
Next, a voice input device constituting the voice input system according to the embodiment of the present invention will be described with reference to FIG. The voice input devices (20-1 to 20-3) are connected to the network 21, respectively. The voice input device 20-1 is mounted on a video deep recorder (hereinafter referred to as video) 26, and the voice input device 20-3 is It is mounted on an air conditioner (hereinafter referred to as an air conditioner) 27, and the voice input device 20-2 is connected alone. The video 26 is operated by the voice input to the voice input device 20-1, and the air conditioner 27 is operated by the voice input to the voice input device 20-3. As will be described later, each device can perform processing for voice input even if it is not voice to its own voice input device.
[0014]
Each voice input device (20-1 to 20-3) includes a microphone 201, a signal processing unit 202, a central processing unit 203, a storage unit 204, a network connection unit 205, and an information display unit 206.
[0015]
The voice input uttered by the user is input to the microphone 201, and the microphone 201 measures the user's utterance. This is generally possible with some microphones. This microphone can be configured from a microphone that can be used as a microphone, such as a single microphone, a plurality of microphones (microphone array), and a directional / omnidirectional microphone.
[0016]
The audio signal captured from the microphone is processed by the signal processing unit 202 into a format necessary for subsequent processing. As this processing, for example, compression of an audio signal by MPEG, conversion to cepstrum features used in speech recognition, and the like can be considered. It should be noted that the signal processing unit 202 can be configured so as to be able to execute an appropriate process according to the use of the voice input device.
[0017]
The signal processing unit 202 also includes a function of receiving a command from the central processing unit 203, which will be described next, and converting it into a format to be transmitted to the information expression unit 206. Further, the information display unit 206 performs speech synthesis from the message content transmitted to the user from the central processing unit 203 and converts it into a synthesized sound signal.
[0018]
In addition, it is also possible to configure so that processing can be executed in accordance with the processing of converting to display content for display display, the device in the information display unit 206, and the use of the voice input device. .
[0019]
However, it does not matter whether the processing of the audio signal from the microphone and the processing of the information sent to the information display unit 206 are performed by the same processing mechanism. That is, the mechanism that performs the above processing is collectively referred to as a signal processing unit 202.
[0020]
Also, a sensor device other than a microphone can be considered as an input to the signal processing unit 202. For example, a moving image from a camera, a tactile sensor, a switch, and the like can be given. It is assumed that a signal processing unit that can process input from other sensors / devices according to the use of the voice input device can be configured. This will be described later.
[0021]
The central processing unit 203 controls processing of the entire voice input device. The central processing unit 203 manages the state of the voice input device and sends commands to the processing mechanisms as necessary. Control details can be determined based on information from the signal processing unit 202, information from the network connection unit 205, and information in the storage unit 204. In addition, control information is transmitted to other voice input devices. How to process voice as the voice input system of the present invention will be described later.
[0022]
The storage unit 204 is a mechanism for holding a program for the processing performed by the central processing unit 203, its work area, information from the signal processing unit 202, and information from the network connection unit 205. Note that the storage unit 204 may be different in terms of circuitry, such as for information storage in the signal processing unit 202 or information storage from the network connection unit.
[0023]
That is, information holding mechanisms in the voice input device are collectively referred to as a storage unit 204. The storage unit 204 can be realized by a mechanism such as a semiconductor memory or a magnetic disk, and can be configured by any mechanism capable of holding data. In this embodiment, a semiconductor memory is used.
[0024]
The usage of the storage unit 204 and the stored information will be described later together with the description of the processing of the central processing unit 203.
[0025]
The network connection unit 205 is a mechanism for exchanging information between the voice input devices through the network 21, and can be realized by an inter-device communication technology such as a network connection over a LAN or a wireless technology such as Bluetooth. You are using a network connection.
[0026]
Further, each or all of the mechanisms of the voice input device as described above may share the mechanism with those of the system having other functions. For example, when an audio visual device such as a video system includes an audio input device, the common signal processing circuit can be used to achieve each other's functions, or the same central processing circuit can be used to It may be possible to control the functions of the video system.
[0027]
Other examples of realizing the functions of the voice input device and other systems with a common mechanism are conceivable, but details are omitted.
[0028]
Furthermore, the case where the voice input device and other systems are not separately provided as a circuit mechanism, but can be controlled as a separate system as a program process while being a common circuit, is also included in the above. .
[0029]
Next, FIG. 3 shows how the central processing unit 203 processes the audio based on the audio signal from the signal processing unit 202, the information from the network connection unit 205, and the information held in the storage unit 204. It explains using. 3, the voice input device 20-1 (hereinafter referred to as voice input device A) mounted on the video 26 in FIG. 2 and the voice input device 20-3 (hereinafter referred to as voice input device B) mounted on the air conditioner 27 are used. On the other hand, an example in which voice is input is shown. Furthermore, the user currently performs dialogue processing on the voice input device B, and the voice input device A shows a standby state.
[0030]
First, when the user speaks to the voice input device A and the voice input device B (step 301), the signal processing unit 202 of each voice input device detects the voice from the user captured by the microphone 201 and performs signal processing. (Step302).
[0031]
Here, since the voice input device B has already performed the dialogue process with the user, if the voice input device B itself is in the dialogue process and the state of the other system is not in the dialogue state, the voice input device B Select to process the content uttered. (Step303)
Next, the central processing unit 202 of the voice input device B performs processing of the voice that is captured in accordance with the function of the voice input device, operates the device according to the content of the voice, and again enters a standby state after the end of the dialogue (step 304). ).
[0032]
Conversely, in the voice input device A, since the voice input device B is in a dialog state with the user, after the signal processing (step 302), no further processing is performed (step 305), and the voice input device B enters a standby state.
[0033]
In this way, even when a user's utterance is detected by a plurality of voice input devices, it is possible to allow easy access only to the voice input device that the user is currently uttering. . In the above, an example is given in which the user utters a plurality of voice input devices. However, the user does not need to utter a voice so that the voice is intentionally detected by a plurality of voice input devices. The same applies to the embodiments.
[0034]
In addition, the condition that the processing is performed when the other voice input device is not in the conversation state can be determined arbitrarily by the user or as a setting by the voice input device for conditions other than the above.
[0035]
In addition, the dialogue here is not limited to one-to-one audio exchange between human and system, but when unilateral voice utterance from human to system, returning visual response from system side, or arbitrary from system The case of responding to a human may be included, and the same applies to the dialogue used in the following description.
[0036]
Further, the voice input device has an order relationship based on a certain rule, and the processing for the voice information taken in can be determined based on the order relationship. Specific examples of the rules include processing capability of the voice input device, setting by the user, frequency of use, setting value based on the function of the voice input device, sensor information from other than the microphone, and combinations thereof.
[0037]
Next, an example of ranking according to the function of the voice input device will be described with reference to FIG.
[0038]
There are a wearable computer (hereinafter referred to as a voice input device C) as a device equipped with a voice input device and a video system (hereinafter referred to as a voice input device D) as a device equipped with a voice input device. The ranking is high because it is for a specific user, and the video system can be used by an unspecified user, so the ranking is low.
[0039]
At this time, the user speaks to the voice input device C and the voice input device D (step 401), and each voice input device automatically detects when the signal processing unit 202 detects the voice from the user captured by the microphone 201. The order of the voice input devices is transmitted (step 402).
[0040]
Next, the voice input device C having a higher rank processes the utterance of the user as compared with the ranking of other voice input devices (step 403).
[0041]
The voice input device D having a lower rank does not perform processing (step 404) and remains on standby.
[0042]
In the above example, the rank information is transmitted. However, information other than the rank may be included in the transmission information, the information is exchanged in advance instead of detecting the utterance, or the preset rank information is Assume that it is possible to determine whether or not processing is performed by the own voice input device.
[0043]
According to the embodiment as described above, for example, emergency equipment such as a fire alarm or an emergency alarm as a device equipped with a voice input device has a higher rank than any other device. Even if a normal device is registered as a voice command, the voice input to the emergency device can be given priority.
[0044]
In addition, a mechanism for processing time may be provided in the voice input device so that it can be used as a reference for processing determination. An example will be described with reference to FIG.
[0045]
FIG. 5 shows an example in which voice is input to a voice input device (hereinafter referred to as voice input device E) mounted on video and a voice input device (hereinafter referred to as voice input device F) mounted on an air conditioner. The voice input device E is installed at a position closer to the user than the voice input device F.
[0046]
At this time, the user speaks to the voice input device E and the voice input device F (step 501), and each voice input device automatically detects when the signal processing unit 202 detects the voice from the user captured by the microphone 201. The voice detection time of the voice input device is transmitted (step 502).
[0047]
Next, the detection time from the other voice input device that detected the voice is compared with the detection time of the own voice input device. If the own voice input device is the earliest, the voice is processed (step 503), otherwise By determining that the voice is not to be processed (step 504), the voice input device closest to the user can perform the voice processing without the user's designation.
[0048]
Further, time information other than the speed of voice detection such that the voice input device having the longest voice detection time can detect the user's utterance from the beginning to the end, and the voice input device processes the voice. Can be used as a criterion.
[0049]
Further, the volume of the user's utterance can be measured from the sound taken in from the microphone, and can be used as a reference for processing determination. An example of the present invention using volume information will be described with reference to FIG.
[0050]
Here, when there is the voice input device E and the voice input device F described above, the user speaks to the voice input device E and the voice input device F (step 601), and each voice input device uses a microphone in the signal processing unit 202. When utterances from the user captured in 201 are detected, volume information is transmitted (step 602). That is, the volume of the user's utterance is measured from the voice taken from the microphone and transmitted to other voice input devices on the network.
[0051]
Next, the volume information from the other voice input device that detected the voice is compared with the volume information of the own voice input device, and if the own voice input device is the highest, the voice is processed (step 603), otherwise By determining that the voice is not processed (step 604), the voice input device closest to the user can process the voice even if not specified by the user, or the voice can be processed with the voice that best records the original utterance. become. Examples of the volume information include sound pressure level, sound power level, and units such as phon and sone.
[0052]
In addition, the signal-to-noise ratio of the user's utterance with respect to the ambient noise can be calculated from the voice captured from the microphone, and can be used as a reference for processing determination. An example of the present invention using the signal-to-noise ratio will be described with reference to FIG.
[0053]
FIG. 7 shows an example in which voice is input to a voice input device (hereinafter referred to as a voice input device G) mounted on a video and a voice input device (hereinafter referred to as a voice input device H) mounted on an air conditioner. In addition, it is assumed that there is a noise source, and the voice input device G is located at a position farther from the voice input device H.
[0054]
First, each voice input device always takes voice and measures ambient noise information (step 701).
[0055]
Next, the user speaks to the voice input device G and the voice input device H (step 702), and each voice input device detects the utterance from the user captured by the microphone 201 in the signal processing unit 202, and When the utterance is captured from the microphone, the signal-to-noise ratio is calculated based on the noise information and transmitted to other voice input devices on the network (step 703).
[0056]
Next, the signal-to-noise ratio information from the other voice input device that detected the voice is compared with the signal-to-noise ratio information of the own voice input device. If the own voice input device is the largest, the voice is processed (step 704). Otherwise, it is determined that the voice is not processed (step 705).
[0057]
Accordingly, even if the user does not specify, the voice input device closest to the user can process the voice, or the voice can be processed with the voice that best records the original utterance. In the example here, the noise is calculated by always capturing the ambient sound even when there is no speech.However, for example, even if the noise is estimated based on the silent section during the speech after detecting the speech Good.
[0058]
In addition, a past history related to the usage status can be stored in the storage unit and used for determination of processing. An example of the present invention using a past history will be described with reference to FIG.
[0059]
FIG. 8 shows an example in which voice is input to a voice input device (hereinafter referred to as a voice input device I) mounted on a video and a voice input device (hereinafter referred to as a voice input device J) mounted on an air conditioner. The voice input device I is used more frequently than the voice input device J.
[0060]
First, the user utters both voice input devices (step 801), and in response to this utterance, the latest usage time, number of times of use, etc. are transmitted to other voice input devices via the network (step 802).
[0061]
On the other hand, in the voice input device I, compared with the usage history of the voice input device J, if the voice input device I is most frequently used, it is determined that the voice processing is performed (step 803), so that the user does not bother to specify. However, the voice input device I that is often used can be used.
[0062]
On the other hand, as compared with the usage history of the voice input device I, the voice input device J does not perform voice processing if the voice input device J is not used much (step 804), and remains in a standby state.
[0063]
In addition, it is possible to determine the processing of the captured voice using a recognition result provided with a voice recognition means. Information from the signal processing unit is processed by a voice recognition mechanism, and the result is passed to the central processing unit. The voice recognition performed at this time may be handled by the central processing unit.
[0064]
The method used for speech recognition may be a generally realized method such as HMM or DP matching using a mixed normal distribution as a model, and the HMM and language model used at this time may be stored in the storage unit. I don't mind. The speech recognition vocabulary may be different for each voice input device or may be shared. Furthermore, a voice command can be made possible by making a control command correspond to the vocabulary. An example of the present invention using this voice recognition will be described with reference to FIG.
[0065]
FIG. 9 shows an example in which voice is input to a voice input device (hereinafter referred to as voice input device K) mounted on video and a voice input device (hereinafter referred to as voice input device L) mounted on an air conditioner. Yes.
[0066]
First, when there is an utterance “play” related to the voice input device K from the user for each voice input device (step 901), each voice input device performs voice detection and voice recognition (step 902). ).
[0067]
The central processing unit receives the result of the speech recognition, determines from the recognition result whether or not it is uttered to the own speech input device (step 903), and transmits the determination result and the recognition result to another speech input device via the network (step 904). ).
[0068]
On the other hand, if the voice input device K determines that the voice is input to the voice input device (step 905) based on the judgment result and the recognition result of the other voice input device, the user performs the processing for the voice, so that the user does not specify anything. However, it is possible to use the voice input device to be uttered.
[0069]
On the other hand, since the voice input device L does not determine that the voice is input to the voice input device (step 906), the voice input device L remains on standby.
[0070]
In addition, a means for identifying a sound source can be provided, and the speech processing can be determined using the identification result. The type of sound source can be considered according to the purpose of use, such as human beings, machines, animals, etc., but hereinafter, a case where a human voice is used as an example will be described. Speaker identification is performed on the user's voice information from the signal processing unit, and the result is transmitted to the central processing unit. This method of speaker identification is based on individual or speaker characteristics, such as judging from the likelihood for HMM learned or adapted for each speaker, or selecting the closest category in the model for each gender or age group ( Any method can be used as long as it can identify gender, age group, and the like.
[0071]
An example of the present invention using this speaker identification will be described with reference to FIG.
[0072]
In FIG. 10, a voice is input to a voice input device (hereinafter referred to as a voice input device M) mounted on a video and a voice input device (hereinafter referred to as a voice input device N) mounted on an air conditioner. In this example, only the voice input device M can process voice.
[0073]
First, when there is a utterance from the user to each voice input device (step 1001), the voice input device that detects the user's utterance should perform speaker identification (step 1002) and process it by the own voice input device. It is determined whether or not the voice is uttered (step 1003), and the determination result and the speaker identification result are transmitted to another voice input device via the network (step 1004).
[0074]
Then, the judgment result and the speaker recognition result in the own voice input device and the other voice input device are seen, and if it is judged that the voice is spoken to the own voice input device (step 1005), the voice is processed. Conversely, the other voice input device When N is determined not to speak to the own voice input device (step 1006), the processing is not performed, so that when a voice input device is available to a specific user, the user does not specify Will be able to use the voice input device to be uttered.
[0075]
Also, if the speaker identification is unreliable or if multiple speakers are candidates, the system will further improve the identification accuracy by prompting for a PIN, boilerplate or free utterance to obtain further data. Processing after speaker identification may be performed.
[0076]
Further, although the speaker recognition of a person is described here, it is also possible to perform identification and subsequent processing in accordance with the sound of a faulty person or an animal as described above.
[0077]
In addition, it has commands common to the voice input device and other devices on the network, and can be controlled within the permitted range of each other. By doing so, the operation of other voice input devices can be suppressed, and the compatibility of the voice input devices can be improved.
[0078]
This example will be described with reference to FIG.
[0079]
For example, when all the voice input devices 1101 connected to the network 1102 have common power management commands such as “power ON”, “power OFF”, and “power saving”, the personal computer 1103 connected to the network 1102 Can transmit a command for operating the power source of any voice input device 1101 including a plurality of systems at a time via the network, and each voice input device can execute the command.
[0080]
In addition, it is possible to execute control commands with simpler and more reliable voice by providing voice commands common to voice input devices and other devices on the network and means for matching the input voice with those commands. And This example will be described with reference to the flowchart of FIG.
[0081]
In the example of FIG. 12, when there is a video (voice input device O) having a voice input device and an air conditioner (voice input device P) having a voice input device, the user selects “video” “air conditioner”. And a command that is common to common operations, such as “power ON” and “power OFF”.
[0082]
When the user utters “video” and “power ON” (step 1201), the voice input device O and the voice input device P recognize the device name and the device command by the matching means used in the above-described speech recognition ( Step 1202), and determines whether it is a command to its own system or whether it can be processed (step 1203).
[0083]
The result is transmitted to other voice input devices and controllable devices on the network (step 1204), and it is determined whether the voice input device should process the voice from the result and the result from the other voice input device (step 1205). And processing corresponding to the control command can be performed.
[0084]
The use of results obtained from a plurality of voice input devices with respect to a common command is different from conventional voice remote controls and devices commanded by voice.
[0085]
In addition, when there are a plurality of voice controllable devices on the network, there is a mechanism for storing information on all or part of the control command in the storage unit, and means for matching the input voice with those commands. By providing, it becomes possible to execute control commands by simpler and more reliable voice.
[0086]
This example will be described with reference to FIGS. Assuming that there is a video (voice input device Q) and an air conditioner (voice input device R) that can be controlled by the voice input device on the network, the voice input device Q “plays” “stops” and the voice input device R “ It is assumed that each of the voice input devices on the network can store the recognized word and the target device in association with each other in the case of “Raise temperature”, “Temperature”, or the like.
[0087]
FIG. 13 shows the concept of linking this recognition word to the target device and the processing content. The connection between the recognition word and the processing content as shown in FIG. 13 can be realized by simple table lookup, object orientation, and high-level knowledge processing, and details thereof are omitted here.
[0088]
As shown in the flowchart of FIG. 14, when the user utters “video” and “play” (step 1401), the voice input device Q and the voice input device R detect and recognize the utterance (step 1402).
[0089]
Further, the content of the utterance is judged using the concept shown in FIG. 13 (step 1403), and the result is transmitted to another voice input device on the network (step 1404), and the result is sent from the other voice input device. Based on the result, it is determined whether or not the voice input device should utter (step 1405), and processing corresponding to the control command is performed.
[0090]
In the case of the above-mentioned “video” and “playback”, it is possible to determine that the utterance is an instruction to play back the video from either voice input device by the combination of the recognition word, the target device, and the processing content as shown in FIG. Furthermore, the utterance can be uniquely interpreted based on the information transmitted via the network, and the voice input device can perform processing contents corresponding to the recognition result.
[0091]
Also, in the examples using voice recognition so far, the example based on word recognition has been basically given, but even if word spotting or continuous voice recognition technology is used, there is a difference in the voice recognition specifications of each voice input device. Even if it exists, it is only necessary to have a concept of associating the recognition result with the processing content as shown in FIG.
[0092]
Further, in the example shown in FIG. 14 described above, it is also possible to process a control target device connected to a network other than the voice input device. An example of this will be described with reference to FIG.
[0093]
As shown in FIG. 15, an air conditioner 1501 with a voice input device, a single voice input device 1502 and a video 1503 are connected to a network 1504, and a user speaks to operate the video 1503 here.
[0094]
This voice input device detects and recognizes voice according to the flow shown in the flowchart of FIG. 14, and connects the recognition result and the processing content with the concept shown in FIG. Then, after the recognition result and the processing content are determined, it is transmitted to another system on the network 1504.
[0095]
As a result, the video 1503 can receive the processing content corresponding to the recognition result and execute the utterance. Therefore, information can be controlled even if there is no voice input device in the video 1503 itself, and information can be controlled on the network. By creating a concept of recognition results and processing contents as shown in Fig. 13 for each voice input device, control by voice is possible. It becomes possible.
[0096]
As for the example of the present invention using the voice recognition described in FIGS. 12 to 15, an example in which the result of voice recognition and determination is transmitted to all systems in the network by the broadcast method has been given. The result of recognition and the result of determination may be transmitted only to the target device directly.
[0097]
Further, in the voice input device, when there is a sensor other than the voice input by the microphone, it is possible to determine the processing content of the detected voice using the sensor information. This example will be described with reference to FIG.
[0098]
As shown in FIG. 16, an air conditioner 1601 having a voice input device and a single voice input device 1602 are connected to a network 1603. In addition, this single audio input device has a camera, and peripheral image information can be taken from the camera. Note that the input of this camera is input to the signal processing unit 202 in FIG. 2 and subjected to image processing.
[0099]
In this voice input system, the user speaks to the voice input device of the air conditioner 1601. Here, it is estimated which direction the speaker is facing with the camera attached to the single voice input device 1602. As for which voice input device this speaker is facing, the technology for extracting a person from the image, the technique for estimating the face part and estimating its direction, the utterance detected from the movement of the mouth from which person It can be realized by a combination of techniques for estimating whether or not, but details are omitted here.
[0100]
If it is determined that the speaker is facing the air conditioner 1601 from the estimated face direction of the speaker, the target device to be uttered is determined to be an air conditioner, and each voice input device transmits the result to the other voice input device via the network 1603. And determine the processing as in the example described above.
[0101]
Although an example using image information using a camera is given here, direct sensor devices such as switches and microphone arrays for sound source localization are conceivable, but what measurement technology is used is limited do not do.
[0102]
In addition, as described in the configuration of the voice input device in FIG. 2, the microphone 201, the information display unit 206, the signal processing unit 202, the central processing unit 203, the storage unit 204, and the network connection unit 205 are each functioning in the voice input device. Therefore, it is possible to physically divide each of them through a network or directly connected to each other. This example will be described with reference to FIG.
[0103]
As shown in FIG. 17, even if the voice input device is physically divided into two voice input devices (1701, 1702), they are connected by the network 1703 and can exchange appropriate information. At this time, the two voice input devices (1701, 1702) can function as one voice input device for the user's utterance.
[0104]
In addition, it is assumed that the criteria for determining the voice input device as described above can be changed according to information of other voice input devices and user settings. For example, the voice input device exchanges the processing status, processing performance, recognizable vocabulary and processing contents for other voice processing systems at regular intervals in addition to information such as detection and recognition results when voice is detected. Suppose that it can be stored in the storage unit of its own voice input device.
[0105]
Using such information, the current voice input device cannot be processed, so if it can be processed by its own voice input device, it will be processed instead, or the recognition result of the voice input device with better performance than the own voice input device It is possible to correct the recognition error by placing more emphasis on the result of the user, or to enable the user to control the determination as described above according to his / her preference.
[0106]
Further, it is assumed that the above-described means for determining input to the voice input device as described above may be combined. For example, a voice input device with a fast detection time handles speech, but it is assumed that there is no time difference within a certain permissible time, and if it is the same time, it is judged by volume, or the likelihood of voice recognition and the rank of the voice input device It is conceivable that the voice is handled by the voice input device having the highest score.
[0107]
In addition, it is conceivable that a high-order agent system or a knowledge processing system makes a determination using information obtained by a combination of determination means as described above.
[0108]
Further, it is assumed that the processing determination means in the voice input device as described above do not have to be the same between the voice input devices on the network. For example, there are two voice input devices, one is only the detection time of the voice, and the other is determined only by the volume information, the information exchanged after detecting the voice is not necessarily compatible, If the processing in that case is set for each voice input device according to the purpose of the device, the processing can be judged by each voice input device without failing as a voice input system.
Further, when the means for determining the voice input device as described above is different in each voice input device, processing is performed by a higher-order agent system or knowledge processing system than the voice input device based on information exchanged through the network. Judgment may also be considered.
[0109]
In addition, in the determination of input to the voice input device as described above, what kind of voice input the user inputs to which device based on information on utterance such as voice detection time and volume and information such as voice recognition result and identification result If it is not possible to uniquely determine whether or not a voice input has been performed, one of the voice input devices may make a decision by interacting with the user or by using other conditions such as sensor information other than the microphone. it can.
[0110]
Next, in the voice input device as described above, an example serving as a supplement to the information expression unit 206 of FIG. 2 described above and the above description will be described with reference to FIG.
[0111]
As shown in FIG. 18, an air conditioner 1801 having a voice input device, a single voice input device 1802, and a video 1803 having a voice input device are connected to a network 1804. Further, these voice input devices have the information display unit 206 of FIG.
[0112]
As described above, in this voice input system, each voice input device exchanges information of its own voice input device during standby, that is, recognition vocabulary, processing contents, and here, in particular, presence / absence of an information display unit and expressible media information. And stored in the storage unit.
[0113]
In this example, it is assumed that the information expression units of the respective voice input devices are all provided with a speaker and can return the voice of an arbitrary sentence synthesized by the central processing unit and the signal processing unit to the user. A part of the control command to the information display unit is shared by the voice input device. That is, instead of returning a response from the information display unit of one voice to the user, a response to the user can be made from the information display unit of another voice input device on the network.
[0114]
Here, it is assumed that when the user utters “video” and “playback”, the voice input device connected to the air conditioner and the single voice input device detect the voice. It is assumed that the user's position is closest to the single voice input device.
[0115]
By the procedure described so far, both audio input devices detect and recognize audio, determine whether the command is to the audio input device, and eventually determine “play command” to “video”. Tell other voice input devices. The audio input device connected to the video does not directly detect audio, but receives information from another audio input device on the network, interprets it as a command to its own audio input device, and processes when a playback command is issued Execute.
[0116]
At this time, since the single audio input device is closer to the user, the single audio input device than the audio input device connected to the video, when judged by the volume or signal-to-noise information sent over the network, is used. Each voice input device can determine that the input device is more suitable for voice processing.
[0117]
Therefore, it can be determined that the single voice input device and the video voice input device are the voice input devices that each of the single voice input devices exchanges with the user.
[0118]
In response to the playback command, the audio input device for the video sends a playback control command to the video, while in order to inform the user that playback has started, the synthesized audio “Started playback” is sent to the single audio. A command to be returned from the input device to the user is generated and transmitted to a single voice input device via the network. At this time, the control command transmitted from the audio input device for video may be transmitted directly to one single audio input device as in the case of information transmission to the network so far, or it is referred to as a command to the single audio input device. The information may be transmitted to all voice input devices in a broadcast format.
[0119]
In this way, by interpreting the response command sent to the user from the video voice input device, the single voice input device can transmit the message “reproduction started” to the user with synthesized voice.
[0120]
In addition, through this process, the single voice input device and the video voice input device set the flag that the user is interacting with the user for a certain period of time, so that the next voice of the user is preferentially processed, and the voice input of the air conditioner is performed. An example has been already described in which it is not necessary to perform processing in the apparatus.
[0121]
Next, an example of a case where voice input devices as described above are grouped according to some criteria will be described with reference to FIG.
[0122]
In this example, a group “kitchen” 1901, a group “wearable” 1902, and a group “living room” 1903 are all connected by a network 1904 based on the location of the voice input device. Further, it is assumed that there are voice input devices in each group, and each voice input device in each group has information that can identify other groups.
[0123]
However, the information held by the storage unit regarding other voice input devices of the own group and the type of information held by the other group are not necessarily the same. Specifically, it is assumed here that there is no information on the recognition vocabulary of each voice input device in another group, the corresponding target device, and the processing content.
[0124]
Here, it is assumed that the user utters “living room”, “video”, and “playback”, which are detected by the voice input devices of the group “kitchen” and the group “wearable”. As in the examples described so far, it is determined that the voice input device that has been detected should be recognized and processed by the own voice input device. Tell information and judgment results to the voice input device of the group “Living”.
[0125]
At this time, by sending information only to the group that can be basically identified, only a necessary voice input device can exchange information when many voice input devices are connected to the network. Is the advantage of that.
[0126]
Accordingly, the voice input device of the group “living” can receive the information related to the voice addressed to the own group, and can determine that it is a “play” command for “video” in the own group and perform processing corresponding thereto.
Needless to say, the present invention is applied to a voice input program.
[0127]
【The invention's effect】
As described above, according to the present invention, it is possible to determine processing for voice without burdening the user by using information from another voice input device for the user's utterance.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a voice input system according to an embodiment of the present invention.
FIG. 2 is a diagram showing a voice input device constituting a voice input system according to an embodiment of the present invention.
FIG. 3 is a flowchart showing the operation of the voice input system according to the embodiment of the present invention.
FIG. 4 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 5 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 6 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 7 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 8 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 9 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 10 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 11 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
FIG. 12 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 13 is a diagram showing a concept of connecting a recognized word, a target device, and processing contents according to the voice input system according to the embodiment of the present invention.
FIG. 14 is a flowchart showing another operation of the voice input system according to the embodiment of the present invention.
FIG. 15 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
FIG. 16 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
FIG. 17 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
FIG. 18 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
FIG. 19 is a diagram showing a configuration of another voice input system according to an embodiment of the present invention.
[Explanation of symbols]
101 ・・・ Voice input device
102 ・・・ Voice input device
103 ・・・ Equipment
104 ... Network
201 ... Microphone
202 ... Signal processing unit
203 ・・・ Central processing unit
204 ・・・ Memory unit
205 ・・・ Network connection
206 ・・・ Information display

Claims

Multiple audio input devices are connected to the network,
The voice input device detects input voice,
When the voice input device detects input voice, the judgment information about the detected voice is exchanged with other voice input devices via the network,
The voice input device performs determination and execution of processing for the detected voice based on the detected judgment information about the voice and the judgment information about the voice from another voice input device. Voice input system.

A plurality of voice input devices connected to the network constitute a ranking relationship based on a predetermined rule,
The voice input system according to claim 1, wherein the rank information of the voice input ranked from the rank relationship is the determination information.

A plurality of voice input devices connected on the network are grouped into a plurality based on a predetermined rule,
An area for storing information about the group;
The storage area related to the group includes a mechanism for performing work in association with a storage area related to a voice input device connected to a network
The voice input system according to claim 1, wherein information on a storage area related to the group is the determination information.

A plurality of voice input devices connected to the network have common time information,
The voice input system according to claim 1, wherein a detection time when the voice input device detects voice is the determination information.

A plurality of audio input devices connected to the network have a common measure for the volume of detected audio,
The voice input system according to claim 1, wherein a volume of a voice detected by the voice input device is the determination information.

The voice input device includes measuring means for measuring ambient noise information;
Computation means for calculating signal-to-noise ratio information of the voice detected based on noise information measured from the measurement means,
The voice input system according to claim 1, wherein the signal-to-noise ratio information is the determination information.

The voice input device includes a storage area for storing history information relating to past usage situations,
The voice input system according to claim 1, wherein the history information is the determination information.

The voice input device includes voice recognition means for voice recognition of the detected voice.
2. The voice input system according to claim 1, wherein the voice recognition information recognized by the voice recognition means is the judgment information.

The voice input device includes an identification unit that identifies a sound source of the detected voice,
The voice input system according to claim 1, wherein the sound input information identified by the identification unit is the determination information.

A plurality of voice input devices connected to the network include a common control command system capable of controlling each of the voice input devices,
The control command is transmitted to another voice input device on the network in response to the detected voice, the control command is received from another voice input device, and the command content of the control command is executed. The voice input system according to claim 1.

The voice input device includes an area for storing information of controllable devices connected via the network,
2. The voice input system according to claim 1, wherein the controllable device information stored in the detected voice is used to process the input voice information and exchange information with the controllable device. .

Unlike the means for detecting voice, the voice input device comprises a sensor device that measures information related to the voice,
The voice input system according to claim 1, wherein the information related to the voice measured by the sensor device is the determination information.

The voice input device may change a criterion for determining and executing a process for the detected voice using the determination information received from another voice input device, or may be changed according to a user setting. The voice input system according to claim 1.

The voice input device includes a display unit for displaying a system status,
It has a function to control how the detected voice or system expresses information that the user wants to convey to the user,
2. The voice input system according to claim 1, wherein the process is determined and executed and information is expressed.

2. The voice input system according to claim 1, wherein a part or all of the voice input device is shared with means in a function other than the voice input device.

The voice input system according to claim 1, wherein even if a part of the function of the voice input device is physically separated, the voice input system functions via the network.

Detecting each of voices input in a plurality of voice input devices connected to the network;
A step of exchanging judgment information related to the detected voice with another voice input device via the network when voice input by the voice input device is detected;
The voice input device includes a step of determining processing and determining execution of the detected voice based on the detected judgment information about the voice and the judgment information about the voice from another voice input device. A voice input method characterized by the above.

When voices input by a plurality of voice input devices connected to a network are detected, and voices input by the voice input devices are detected, judgment information relating to the detected voices is determined via the network. The voice input device exchanges with the voice input device, and the voice input device determines determination and execution of processing for the detected voice based on the detected judgment information about the voice and the judgment information about the voice from another voice input device. A voice input program characterized by realizing a function of performing