JP4224991B2

JP4224991B2 - Network speech recognition system, network speech recognition method, network speech recognition program

Info

Publication number: JP4224991B2
Application number: JP2002168769A
Authority: JP
Inventors: 茂樹植田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-06-10
Filing date: 2002-06-10
Publication date: 2009-02-18
Anticipated expiration: 2022-06-10
Also published as: JP2004012993A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の電気機器を音声認識により操作するネットワーク技術に関し、回線に接続された複数の機器に指令された操作者の音声を中央処理手段で集中的に音声認識処理を行い、各々の電気機器に所定の作動出力を送信するネットワーク音声認識システム、ネットワーク音声認識方法、ネットワーク音声認識プログラムに関するものである。
【０００２】
【従来の技術】
音声認識を扱う技術としては、特開昭５６−８８５０３号公報がある。これは加熱装置に音声認識部と音声合成部を設け、２つの音声認識モードと１つの音声合成モードをシーケンシャルに切り換えることで、あたかも操作者と機器が対話をするかのごとく操作が進み、音声の誤認識による誤動作の防止を図ったものである。
【０００３】
入力された音声は特徴を抽出されてメモリに記憶される。認識処理はＲＯＭにあらかじめ記憶されている標準パターンと前記メモリに記憶された上記パターンとを比較し、類似度の高いものをその音声指令として認識する。次いで音声合成モードに切り換えられ、音声合成シンセサイザによって認識された音声に基づきある所定の音声メッセージを操作者に対して発する。操作者はこのメッセージを聞き、次の音声を入力すると、機器は再び第２の音声認識モードへと移行し、ある定められた音声が入力されるかを判断する。
【０００４】
かかる構成により、システムは確実に制御を進めることができる。誤認識によって指示されない加熱を開始したり、まだ終わっていない加熱を勝手に中断したり、といった誤動作を未然に防止できる。
【０００５】
これは本発明の発明者による出願であるが、当時、特定話者音声認識（あらかじめ認識する対象者の言語を登録しておく方式）でもなかなか完璧な認識が難しかった頃の技術背景を偲ばせる発明である。
【０００６】
近年、半導体技術の著しい進展やソフト技術・ネットワーク技術の発展にともない、家庭内にもパソコンや携帯電話などの情報機器が広く浸透してきた。またネット家電と呼ばれる電気機器も徐々に市場に姿を現しつつある。
【０００７】
このような状況のもと、以前より注目を集めてきた音声認識技術は、人と機械の良好なコミュニケーションを図れる技術としていよいよ実用段階に入りつつある。特にメモリが安価になり、ＣＰＵの処理速度が格段に上昇し、音声認識技術が進化することで、十数年前には非常に困難とされた不特定話者音声認識（あらかじめ言語登録の不要な不特定多数を対象とする方式）を高い認識率で認識することが可能となってきた。キーボードの代わりにマイクを備えたパソコンは、もはや実用レベルに入ったと言えよう。工場出荷時にこのような音声認識ソフトがプリインストールされるパソコンも珍しくなくなってきた。
【０００８】
が、いまだ特開昭５６−８８５０３号公報に示されるような加熱装置は、商品化されていない。パソコンの世界では音声認識技術の普及が急速に進みつつあるにもかかわらず、である。これはひとえに音声認識システムの経済性に由来する。パソコンでは処理速度の速いＣＰＵも大きなメモリも、それ自体がパソコン本体の価値を高めてくれる。パソコンではそれほど価格を押し上げることなく、音声認識による操作を導入できる。また、よしんばその一部を誤認識しても再入力すればよい。単にキーボードをミスタッチしたのと同等である。だが、加熱装置では誤認識は過大な加熱や庫内での燃焼に繋がりかねない。加熱装置に音声認識システムを組み込んだ場合、認識精度は一層重要であり、これを上げようとすればシステムコストにさらに重くのしかかる。
【０００９】
本発明の関連技術として、特開２０００−１１１０５７号公報を示しておく。ここには台所電気製品のためのデータ処理装置が開示されている。台所器具、例えば電子レンジ、には操作者の音声に応じてデジタル・データを発生するマイクロホンと、このマイクロホンからデジタル信号を取り出すための音声認識手段が設けられる。このマイクロホンに入力されるのは庫内で消費される食品の品目であり、処理した品目のリストを発生させるためにデジタル信号を記憶する記憶手段が本体に備えられる。またこのデジタル・データは遠隔管理システムにも転送される。
【００１０】
かかる構成により庫内で消費される食品品目が音声でマイクロホンに入力されると、器具本体にロードされた音声認識ソフトウェアがこれを識別し、インターネットに転送される電子メール・メッセージを発生するために使用される。インターネットに転送されたメッセージは品目の補充、配達のために利用される。
【００１１】
この構成においても、台所器具は音声認識手段を備えており、台所器具のコストを押し上げていることは否めない。また、音声認識によって扱われるのは単なる食品品目名という情報でしかなく、誤認識しても補充品目の発注が正確に行われないだけである。前記特開２０００−１１１０５７公報が開示する機器本体を作動・停止させるような機能を含まない。
【００１２】
【発明が解決しようとする課題】
本発明ではネットワークに接続された複数の電気機器を、音声認識という人にやさしい操作系で操作できるよう、常に最新の音声認識技術を利用して、しかも高い認識率を維持しながら、経済的に利用できるシステムを実現することをめざす。
【００１３】
【課題を解決するための手段】
前記従来の課題を解決するために、本発明のネットワーク音声認識システムは、音声を電気信号に変換するマイクを備えると共に個別のＩＤを持った個別のＩＤを持った電気機器と、この電気機器と回線で接続され音声認識手段を内蔵した中央処理手段とを有し、前記電気機器の操作者は、前記マイクから操作指令を入力し、前記電気機器が音声信号と前記個別のＩＤとを前記中央処理手段へ送り、中央処理手段は送信された音声信号の音声認識処理を行うと共に前記音声信号を送ってきた前記電気機器を特定し、前記音声信号から前記電気機器ごとに登録された音声リストを検索して前記電気機器が受け付けられる操作指令かどうかを判断し、受け付けられる操作指令である場合、前記電気機器が確認され今受け付けられる状態であるか調べられ受け付けられるなら処理結果に基づき予め前記電気機器と前記中央処理手段とで定められた所定の作動出力を前記電気機器に送信するよう構成したものである。これによって、コストがかかる高い認識率を実現する音声認識手段は中央処理手段に置くことができ、回線で接続された電気機器から送信されてくる音声データを次々と認識し、認識結果に基づく作動信号を返信することができる。
【００１４】
【発明の実施の形態】
本発明は、音声を電気信号に変換するマイクを備えると共に個別のＩＤを持った電気機器と、この機器と回線で接続され音声認識手段を内蔵した中央処理手段とを有し、前記電気機器の操作者は、前記マイクから操作指令を入力し、前記電気機器が音声信号と前記個別のＩＤとを前記中央処理手段へ送り、中央処理手段は送信された音声信号の音声認識処理を行うと共に前記音声信号を送ってきた前記電気機器を特定し、前記音声信号から前記電気機器ごとに登録された音声リストを検索して前記電気機器が受け付けられる操作指令かどうかを判断し、受け付けられる操作指令である場合、前記電気機器が確認され今受け付けられる状態であるか調べられ受け付けられるなら処理結果に基づき予め前記電気機器と前記中央処理手段とで定められた所定の作動出力を前記電気機器に送信するよう構成したシステムである。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００１５】
また、複数の電気機器を備えたもので、各々の機器が音声認識手段を有する必要がなく、一層経済的に音声認識によるやさしい操作を利用できる。また、すべての機器を同一の操作系で操作できる。
【００１６】
また、電気機器がさらにＡ／Ｄコンバータを有し、音声をデジタル信号に変換して回線で中央処理手段に送信するよう構成したもので、ノイズに強く多重通信しやすく、また圧縮などにより通信時間を短縮することが容易になる。
【００１７】
また、中央処理手段が行う音声認識処理は、線形予測分析（ＬＰＣ）とベクトル量子化（ＶＱ）と隠れマルコフモデル（ＨＭＭ）に基づく音声認識処理としたもので、線形予測分析（ＬＰＣ）により特徴抽出を効率良く行い、得られた特徴ベクトルをベクトル量子化（ＶＱ）により有限個のシンボルに変換できる。これを利用して隠れマルコフモデル（ＨＭＭ）、すなわち音韻の確率モデル（確率オートマトン）を作り、音韻単位の認識が行える。
【００１８】
また、中央処理手段は音声認識処理結果に基づき電気機器にメッセージを伝える必要が生じた場合、メモリ内より所定の音声を合成し、回線を介して送信する構成であり、対話をしながら機器の制御を進めることができる。
【００１９】
また、中央処理手段は音声認識を行うハードウエアとソフト処理を行うプログラムを有する構成であり、すべてをソフト処理するのではなくバランス良くハードウェアとソフトウェアに音声認識処理を分担させることで、経済的なシステムを実現できる。特に中央処理装置の負担を軽減することができる。
【００２０】
また、電気機器が音声を電気信号に変換する段階と、前記電気機器が音声信号と個別のＩＤとを送信する段階と、中央処理手段が送信された個別のＩＤにより前記電気機器を特定し音声信号の音声認識処理を行う段階と、前記音声信号から前記電気機器ごとに登録された音声リストを検索して前記電気機器が受け付けられる操作指令かどうかを判断し、受け付けられる操作指令である場合、前記電気機器が確認され今受け付けられる状態であるか調べられ受け付けられるなら処理結果に基づき予め前記電気機器と前記中央処理手段とで定められた所定の作動出力を前記電気機器に送信する段階とより構成したネットワーク認識方法である。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００２１】
また、電気機器が音声を電気信号に変換するステップと、該電気機器が音声信号と個別のＩＤとを送信するステップと、中央処理手段が送信された個別のＩＤにより前記電気機器を特定し音声信号の音声認識処理を行うステップと、前記音声信号から前記電気機器ごとに登録された音声リストを検索して前記電気機器が受け付けられる操作指令かどうかを判断し、受け付けられる操作指令である場合、前記電気機器が確認され今受け付けられる状態であるか調べられ受け付けられるなら処理結果に基づき予め前記電気機器と前記中央処理手段とで定められた所定の作動出力を前記電気機器に送信するステップとより構成したネットワーク音声認識プログラムである。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００２２】
【実施例】
以下本発明の実施例について、図面を参照しながら説明する。
【００２３】
（実施例１）
図２は本発明の第１の実施例におけるネットワーク音声認識システムの構成を示す接続図である。ある家庭内に設置される電気機器群には、いずれも音声を電気信号に変換する手段たるマイク１と、合成音声信号を音声に復元するスピーカ２と、ネットワークに接続する接続手段３が備えられている。
【００２４】
従って、例えばランドリー４の操作部にはマイク１とスピーカ２と他には表示窓が存在するだけで、従来のようにタイマーやキーボードはまったくない。なお、接続手段３は機器本体に内蔵されている。
【００２５】
電子レンジ５、炊飯器６、冷蔵庫７もまったく同様であり、操作部にはマイク１とスピーカ２と他には必要に応じて表示窓が存在するだけである。テレビ８も同様だが、テレビはもともと本来の機能としてスピーカ２を備えている。
【００２６】
さて、これらの機器と接続手段３を介して回線で接続された中央処理手段９がある。これはある家庭内のネットワークだけではなく、多数の家庭の電気機器群を回線を介して制御する。音声認識手段１０を内蔵している。
【００２７】
図１はかかる本発明の第１の実施例におけるネットワーク音声認識システムの構成を示すブロック図である。ランドリー４には前述のマイク１とスピーカ２、接続手段３以外に制御手段１１が設けられ、音声指令に基づきモータ及びヒータ１２を制御する。すなわち、マイク１から入力された操作指令を制御手段１１が接続手段３と回線を介して中央処理手段９へ送信する。かかる音声データは中央処理手段９に設けた接続手段１４を経由し、ＩＤ認識手段１５によりどのロケーション（家庭）のどのアドレス（電気機器）かを識別され、Ａという家庭のランドリーである旨を認識した後、音声認識手段１０により指令内容が分析される。音声認識は記憶手段１６に記録された音声パターンとの類似度を比較したり、前後の単語あるいは音韻から文章の推定が行われたりする。その認識処理の結果、機器がどのような動作を起こすべきかが再びランドリー４に送信される。この出力を受けて制御手段１１はモータ及びヒータ１２への通電を開始したり変更したり停止したりする。制御手段１１は必要に応じてスピーカ２より送信されてきた合成音声メッセージを出力する。
【００２８】
電子レンジ５以下も同様である。電子レンジ５では制御手段１１により制御されるのは、熱源たるマグネトロン１３である。もちろん、ランドリー４にしろ電子レンジ５にしろ、制御される対象はこれだけではない。これらは被制御主要ブロックである。中央処理手段９から出力されるのは、あらかじめ定められた被制御ブロックをダイレクトに制御するデータであってもいいし、ある動作モードをコード化したデータでも構わない。これらはいったん制御手段１１で解読され、当該の被制御ブロックをコントロールする。
【００２９】
かかる構成によりネットワーク音声認識システムは、特開昭５６−８８５０３号公報で示したような音声認識部と音声合成部を有する加熱機器のように、ネットワークを介しながらあたかも操作者と機器が対話をするかのごとく操作を進め、確実に機器の動作を実行することができる。もちろん、音声合成は本発明にとって必須事項ではない。
【００３０】
（実施例２）
図３は本発明の第２の実施例におけるネットワーク音声認識システムの構成を示す接続図である。ある家庭Ａ１７、ある家庭Ｂ１８など、それぞれの家庭がインターネットで中央処理手段９に接続されている。ある家庭Ａ１７内に設置される電気機器群には、図２と同様いずれも音声を電気信号に変換する手段たるマイクと、合成音声信号を音声に復元するスピーカを備えている。しかしながら、ネットワークに接続する接続手段は個々の電気機器は有しない。これらの機器はハブスイッチ１９に接続され、ハブスイッチ１９はルーターに繋がっている。すなわち中央処理手段９との接続手段を単一のルーターを共有している。家庭Ｂ１８内も同様の構成である。
【００３１】
かかる構成により現状ではまだ高価なモデムを共有することにより経済的に本発明を活用することができる。が、将来のネット家電の実現に備え、通信プロトコルをシンプル化し、８ビット程度のマイコンに搭載可能なインターネットへの接続を実行する安価なＯＳの開発が進められており、接続する電気機器より高価なモデムも数年後には笑い話になるであろう。要するに当面は図３に示す構成が現実的であるが、将来的には図１の構成が一般的になるであろう。
【００３２】
この選択はネットワークへの接続コスト（ハード及び利用料金などのソフト）と、音声認識手段のハードコストとの比較になる。前者が後者を上回るなら、この発明は現実には利用者がいなくなる。が、後述するが音声認識処理は高速の演算処理と大量の記憶手段、日進月歩の技術革新があり、本発明によれば、これらを中央処理手段で共用でき、最新技術への更新を図りながら、各電気機器内には熱源やモータなどの作動を制御するごく簡単な制御手段を設ければよい。場合によっては制御をすべて中央処理手段が行うことも可能である。ネットワークコストはどんどん下がり、音声処理技術は進化する、ということが本発明の前提である。
【００３３】
（実施例３）
図４はかかる本発明の第３の実施例におけるネットワーク音声認識システムの構成を示すブロック図である。図１に示す実施例１に対して、マイク１にはＡ／Ｄコンバータ２１が接続され、音声をデジタル信号に変換して接続手段３を介して制御手段１１は、回線経由で中央処理手段９に送信する。かかる構成により送信される音声信号はノイズに対して強くなる。すなわち、送信データをいくつかのフレームに分け、そのフレームごとに偶数あるいは奇数のパリティチェックを行えば、ノイズを除外することが容易となる。デジタル信号にすれば圧縮などの処理も行い易い。Ａ／Ｄコンバータは安価であり、信頼性を上げる上でも各電気機器にこれを個別に搭載する構成は実用面で有益である。
【００３４】
また本実施例では、中央処理手段９が音声認識処理結果に基づき当該電気機器にメッセージを伝える必要が生じた場合、記憶手段１６内より所定の音声を合成し、回線を介して当該電気機器に送信する構成である。さらに各電気機器はＤ／Ａコンバータ２２を備えている。中央処理部９で合成された音声は、アナログデータとして回線上に送信されて構わないが、デジタルデータで扱えばノイズに対して強く、また圧縮などの処理も行い易い。すなわち音声認識において音声指令をデジタル化するのと同じ効果で得られる。
【００３５】
（実施例４）
図５はかかる本発明の第４の実施例におけるネットワーク音声認識システムの音声認識処理の構成を示すブロック図である。
【００３６】
人が発する音声はきわめて曖昧であり、特に日本語においては単語間の独立性が乏しく、個人差や地域性（方言）など正確な認識を妨げる要因がすこぶる多い。そんな中、音声認識処理はパターン認識技術を駆使して実現される。あらかじめ認識する言葉を登録しておく特定話者認識と、不特定多数の人の音声を認識する不特定話者認識があり、当然、後者が技術的にはハードルは高いが実用性が高い。
【００３７】
また、音声認識には単語単位の認識、音韻単位の認識がある。単語単位の認識では音声をコンピュータが分析し、特徴抽出し、特徴量の時系列を作る。そして処理手段内の特徴時系列単語辞書と類似度を比較計算し、認識結果として出力する。
【００３８】
音韻単位の認識は入力音声を音素記号列に変換し、単語列に置き換える。これを構文解析し、文字列に変換する。さらに論理解析や意味解析し、文章を生成する。その音声が発せられた前後の状況や中央処理手段に接続された個別の機器を認識することから、ありうる指令、ありえない指令など簡単な言語理解にまで踏み込むこともできる。極めて難度が高く、従って高速大容量のコンピュータと専用のハードウエアのハイブリッドシステムが効率良く機能する。
【００３９】
本実施例では不特定話者のゆらぎに強いＨＭＭ（隠れマルコフモデル、ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）による方法を採用する。ＨＭＭは近年広く用いられる手法であり、これに基づく音韻単位の音声認識を実現する。かかる技術は、例えば２０００年第２回ＩＰアワード、ＩＰ賞を受賞した「隠れマルコフモデルに基づく不特定話者音韻レベル音声認識・学習回路（中村一博他、奈良先端科学技術大学院大学）」に示される。
【００４０】
図５において、音声の電気信号（アナログ）はＡ／Ｄコンバータ２１によってデジタル信号に置換される。Ａ／Ｄコンバータ２１は各電気機器内においても中央処理手段内においてもどちらでも構わない。機器内に置いた時の効果は図４・実施例３ですでに説明した。
【００４１】
デジタル信号に変換された音声データは、コントローラ２３へ送られ、順次ＬＰＣメモリ２４に記憶される。と同時にＬＰＣ分析部２５に１フレーム分ずつ送り出され、線形予測分析（ＬＰＣ）手法を用いて音声の特徴が抽出される。すなわち特徴ベクトルが算出される。特徴ベクトルは音声の時系列データと自己相関関数に基づき、線形予測係数を算出することで得られる。求められた特徴ベクトルはＬＰＣメモリに記録される。
【００４２】
次いでＶＱ部２６において、ＬＰＣ分析により得られた特徴ベクトル系列が有限個のシンボルに変換される。ベクトル量子化（ＶＱ）である。各特徴ベクトルについて、符号帳（シンボル番号と代表ベクトルのペアを要素とする表）を探索し、距離が最小となる代表ベクトルのシンボルに写像する。符号帳は多くの学習用データにクラスタリング手法を適用して生成され、ＬＰＣメモリ２４に記憶される。
【００４３】
以上の処理に続いてＨＭＭ部２７で音声認識が実行される。ＨＭＭは音韻の確率モデル（確率オートマトン）であり、音韻ごとに１つのＨＭＭが構成され、ＨＭＭメモリ２８に確率パラメータが記憶されている。シンボル系列と最も高い確率を示す音韻が選択される。そしてこの認識処理の結果が出力される。
【００４４】
なお、本実施例では音声認識をハードウエアで行う構成としたが、その処理の大半は延々と繰り返される行列演算、確率演算処理である。中央処理手段に十分な処理能力があれば、ＬＰＣ分析部やＶＱ部、ＨＭＭ部などはプログラムによるソフト処理に置換できる。また、ハードとソフトのハイブリッドシステムとし、処理速度と経済性のバランスをとることもできる。
【００４５】
（実施例５）
図６は本発明の第５の実施例における中央処理手段をプログラム処理とした構成のプログラムのフローチャートである。
【００４６】
まず、デジタル化された音声データが入力される（ステップ１００）。この音声データの前あるいは後には機器のＩＤデータが付加されており、このＩＤデータを認識し、音声データが送られて来た機器を特定する（ステップ１０１）。ＩＤは当該電気機器が工場から出荷される際に唯一無二のものとして与えられており、ある家庭にその当該電気機器が設置された折、その家庭の個別情報が自動的にあるいは登録の手順を経て中央処理手段に記憶されている。かかるデータベースをアクセスすることで中央処理手段はアクセスした家庭と、その家庭におけるネット機器の構成を特定することができる（ステップ１０２）。
【００４７】
続いて音声認識処理が始まる。まず、音声データが１フレームごとに区切られ特徴抽出される（ステップ１０３）。そして特徴ベクトルの量子化が行われ（ステップ１０４）、そのデータが何に近いかの確率を音韻ごとに計算される（ステップ１０５）。そして確率が最大となるＨＭＭが求められる（ステップ１０６）。入力された音声を確定して音声認識処理は終了する（ステップ１０７）。この間の音声認識処理は図５の例でハード的に実行したのとまったく同等である。
【００４８】
さて、確定された音声を機器ごとに登録された音声リスト（テキストのテーブル）を検索して付き合わせが行われる（ステップ１０８）。これはその機器が受け付けられる指令かどうかを判断するものである。例えば、ＩＤからどのメーカのどの機種の電子レンジから音声データが入力されたかが特定されているから、この電子レンジが持っている機能、仕様に照らして検査が行われるのである。オーブンレンジなら「オーブンで１８０℃、３５分焼き上げ」という指令は受付可能であるが、ヒータ機能を有しない電子レンジであればこの指令は実行できない。また、電子レンジに「全自動で選択をすること」という指令が来ても対応できない。この音声リスト（テキストのテーブル）は新製品が発売されるごとに中央処理手段のメモリに記憶されてもいいし、当該のメーカが有するサーバにネットを介して中央処理手段がデータを検索にいってもよい。まずテーブルにある登録された音声かどうかが調べられる（ステップ１０９）。この検索で当該音声が見つからなければ、エラーメッセージが音声出力される（ステップ１１０）。エラーの内容に応じてきめこまかにメッセージを発することもできる。
【００４９】
登録された音声である場合には、次いで当該機器のステージ（動作状態）が確認される（ステップ１１１）。一切の指令は音声で中央処理手段に対して命じられるので、中央処理手段はその機器が今どんな状態かを特定できる。そして音声データは間違いがなくとも、今それを受け付けられるかが調べられる（ステップ１１２）。電子レンジが作動していない状態で「停止せよ」と命じられても、応じることはできない。また、直前に受けた命令を実行中も応じられない。
【００５０】
さらに、中央処理手段は当該家庭のネット機器の構成を把握しているので、ネットに接続された他の機器の動作もチェックできる（ステップ１１３）。例えば炊飯器が動作を始めたばかりで、炊きあがるのが４０分後なのに電子レンジに温め直しの指示が出た時、受け付けていいのかを音声出力して操作者に問いかけ、再度の指示を呼びかけることも可能である。
【００５１】
かかる処理の後、受付が完了した旨を知らせる音声が出力され（ステップ１１４）、当該機器を所定の動作をさせるための作動出力がテーブルから検索される（ステップ１１５）。例えば電子レンジは熱源としてのマグネトロンやモータ、ランプなどの負荷を有するが、これらの負荷のどれをオンしどれをオフするかの作動状態チャートをあらかじめテーブルとして記憶させておく。これらの状態を４桁の１６進データで扱う場合、０１Ａ６をマグネトロンをフル出力で作動させ、ターンテーブルモータは回転、ランプは消灯する、というようにあらかじめ機器側と中央処理手段で定めておけば、作動指令として出力される（ステップ１１６）。
【００５２】
以上のプログラムにより、音声認識によりネットに接続された電気機器を中央そり手段が制御することができる。
【００５３】
【発明の効果】
以上のように、本発明によれば、ネットワークに接続された複数の電気機器を、音声認識という人にやさしい操作系で操作でき、常に最新の音声認識技術を利用して高い認識率を維持しながら、経済的に利用できるシステムを実現することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施例におけるネットワーク音声認識システムの構成を示すブロック図
【図２】同ネットワーク音声認識システムの構成を示す接続図
【図３】本発明の第２の実施例におけるネットワーク音声認識システムの構成を示す接続図
【図４】本発明の第３の実施例におけるネットワーク音声認識システムの構成を示すブロック図
【図５】本発明の第４の実施例におけるネットワーク音声認識システムの音声認識処理の構成を示すブロック図
【図６】本発明の第５の実施例における中央処理手段をプログラム処理とした構成のプログラムのフローチャート[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a network technology for operating a plurality of electric devices by voice recognition, and centrally processing voices of operators commanded to a plurality of devices connected to a line by a central processing unit. The present invention relates to a network voice recognition system, a network voice recognition method, and a network voice recognition program for transmitting a predetermined operation output to an electric device.
[0002]
[Prior art]
Japanese Unexamined Patent Publication No. 56-88503 is available as a technology for handling voice recognition. This is because the heating device is equipped with a voice recognition unit and a voice synthesis unit, and the two voice recognition modes and one voice synthesis mode are switched sequentially, so that the operation proceeds as if the operator and the device interact. This is intended to prevent malfunction due to misrecognition.
[0003]
The input speech is extracted and stored in the memory. In the recognition processing, a standard pattern stored in advance in the ROM is compared with the pattern stored in the memory, and a pattern having a high degree of similarity is recognized as the voice command. Next, the mode is switched to the voice synthesis mode, and a predetermined voice message is issued to the operator based on the voice recognized by the voice synthesis synthesizer. When the operator listens to this message and inputs the next voice, the device again shifts to the second voice recognition mode, and determines whether a predetermined voice is input.
[0004]
With this configuration, the system can surely proceed with the control. It is possible to prevent malfunctions such as starting heating that is not instructed by misrecognition or interrupting heating that has not been completed.
[0005]
This is an application by the inventor of the present invention, but at that time, the technical background at the time when it was difficult to achieve perfect recognition even with specific speaker speech recognition (a method of registering the language of the target person to be recognized in advance) is disclosed. It is an invention.
[0006]
In recent years, with the remarkable progress of semiconductor technology and the development of software technology and network technology, information devices such as personal computers and mobile phones have spread widely in the home. In addition, electric appliances called internet appliances are gradually appearing on the market.
[0007]
Under such circumstances, speech recognition technology, which has been attracting attention for some time, is now entering the practical stage as a technology that enables good communication between humans and machines. In particular, it has become cheaper, CPU processing speed has increased dramatically, and speech recognition technology has evolved, so unspecified speaker speech recognition, which has become extremely difficult 10 years ago (no need for language registration in advance) Can be recognized at a high recognition rate. It can be said that a computer equipped with a microphone instead of a keyboard has entered a practical level. Computers with such voice recognition software pre-installed at the time of factory shipment are not uncommon.
[0008]
However, a heating device as disclosed in JP-A-56-88503 has not yet been commercialized. This is despite the rapid spread of speech recognition technology in the world of personal computers. This comes primarily from the economics of speech recognition systems. With a personal computer, both the fast CPU and large memory increase the value of the personal computer itself. With a personal computer, you can introduce voice recognition without increasing the price. Moreover, even if some of them are misrecognized, they may be re-input. It is equivalent to simply mistouching the keyboard. However, in a heating device, misrecognition can lead to excessive heating and combustion in the cabinet. When a speech recognition system is incorporated in the heating device, the recognition accuracy is more important, and if this is increased, the system cost will be further increased.
[0009]
As a related technique of the present invention, Japanese Patent Laid-Open No. 2000-111057. issue The gazette is shown. This discloses a data processing device for kitchen appliances. A kitchen appliance, for example, a microwave oven, is provided with a microphone that generates digital data in response to an operator's voice and voice recognition means for extracting a digital signal from the microphone. Input to the microphone are food items consumed in the cabinet, and the main body is provided with storage means for storing digital signals to generate a list of processed items. This digital data is also transferred to the remote management system.
[0010]
With this configuration, when a food item consumed in the warehouse is input to the microphone by voice, the voice recognition software loaded on the appliance body identifies it and generates an e-mail message that is forwarded to the Internet. used. Messages transferred to the Internet are used to replenish and deliver items.
[0011]
Even in this configuration, the kitchen appliance is provided with voice recognition means, and it cannot be denied that the cost of the kitchen appliance has been increased. Moreover, what is handled by voice recognition is only information on the name of a food item, and even if it is misrecognized, an order for a supplemental item is not made accurately. It does not include a function for operating / stopping the device main body disclosed in JP 2000-1111057 A.
[0012]
[Problems to be solved by the invention]
In the present invention, a plurality of electrical devices connected to a network can be operated by a voice-friendly operation system called voice recognition, always using the latest voice recognition technology, and maintaining a high recognition rate while economically. Aiming to realize a usable system.
[0013]
[Means for Solving the Problems]
In order to solve the conventional problem, the network speech recognition system of the present invention converts speech into an electrical signal. Microphone And an electric device having an individual ID with an individual ID and connected to the electric device via a line. Built-in voice recognition means Central processing means, The operator of the electrical device inputs an operation command from the microphone, The electrical device sends a voice signal and the individual ID to the central processing unit, and the central processing unit performs voice recognition processing of the transmitted voice signal and identifies the electrical device that has sent the voice signal. The voice list registered for each electrical device is searched from the audio signal to determine whether the operation command is accepted by the electrical device. If the operation command is accepted, the electrical device is confirmed and accepted. If it is checked and accepted Based on processing results Predetermined by the electrical device and the central processing means The predetermined operation output For the electrical equipment It is configured to transmit. As a result, a voice recognition means that realizes a high recognition rate, which is costly, can be placed in the central processing means, and it recognizes voice data transmitted from electrical devices connected by a line one after another and operates based on the recognition result A signal can be returned.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
The present invention converts sound into an electrical signal. Microphone And an electrical device with an individual ID and connected to this device via a line Built-in voice recognition means Central processing means, The operator of the electrical device inputs an operation command from the microphone, The electrical device sends a voice signal and the individual ID to the central processing unit, and the central processing unit performs voice recognition processing of the transmitted voice signal and identifies the electrical device that has sent the voice signal. The voice list registered for each electrical device is searched from the audio signal to determine whether the operation command is accepted by the electrical device. If the operation command is accepted, the electrical device is confirmed and accepted. If it is checked and accepted Based on processing results Predetermined by the electrical device and the central processing means The predetermined operation output For the electrical equipment A system configured to transmit. As a result, the customer can economically realize easy operation by voice recognition.
[0015]
Also The apparatus includes a plurality of electric devices, and each device does not need to have voice recognition means, and can use a simple operation by voice recognition more economically. All devices can be operated with the same operation system.
[0016]
Also In addition, the electrical equipment further has an A / D converter, which is configured to convert voice into a digital signal and transmit it to the central processing means via a line. It becomes easy to shorten.
[0017]
Also The speech recognition process performed by the central processing means is a speech recognition process based on linear predictive analysis (LPC), vector quantization (VQ), and hidden Markov model (HMM), and features are extracted by linear predictive analysis (LPC). The obtained feature vector can be converted into a finite number of symbols by vector quantization (VQ). By using this, a hidden Markov model (HMM), that is, a phonemic probability model (probability automaton) can be created to recognize a phoneme unit.
[0018]
Also The central processing unit is configured to synthesize a predetermined voice from the memory and transmit it via a line when it is necessary to convey a message to the electric equipment based on the voice recognition processing result, and control the equipment while having a conversation. Can proceed.
[0019]
Also The central processing means is configured to have a hardware for performing speech recognition and a program for performing software processing. Instead of performing software processing for all, the hardware and software share the speech recognition processing in a well-balanced manner. A system can be realized. In particular, the burden on the central processing unit can be reduced.
[0020]
Also , Electric The electrical device converts the sound into an electrical signal; and And individual ID and And the central processing means was sent Identify the electrical device by individual ID Performing voice recognition processing of the voice signal; The voice list registered for each electrical device is searched from the audio signal to determine whether the operation command is accepted by the electrical device. When the operation command is accepted, the electrical device is confirmed and is now accepted If it is checked and accepted Based on processing results Predetermined by the electrical device and the central processing means The predetermined operation output For the electrical equipment A network recognition method comprising a transmitting step and a network recognition method. As a result, the customer can economically realize easy operation by voice recognition.
[0021]
A step of converting the sound into an electric signal by the electric device; And individual ID and And the central processing means has been sent Identify the electrical device by individual ID Performing voice recognition processing of the voice signal; The voice list registered for each electrical device is searched from the audio signal to determine whether the operation command is accepted by the electrical device. When the operation command is accepted, the electrical device is confirmed and is now accepted If it is checked and accepted Based on processing results Predetermined by the electrical device and the central processing means The predetermined operation output For the electrical equipment A network speech recognition program comprising a transmitting step. As a result, the customer can economically realize easy operation by voice recognition.
[0022]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
[0023]
Example 1
FIG. 2 is a connection diagram showing the configuration of the network speech recognition system in the first embodiment of the present invention. A group of electric devices installed in a home includes a microphone 1 that is a means for converting sound into an electric signal, a speaker 2 that restores a synthesized sound signal to sound, and a connection means 3 that is connected to a network. ing.
[0024]
Therefore, for example, the operation unit of the laundry 4 has only the microphone 1, the speaker 2, and the display window, and there is no timer or keyboard as in the conventional case. The connection means 3 is built in the device body.
[0025]
The microwave oven 5, the rice cooker 6, and the refrigerator 7 are exactly the same, and the operation unit has a microphone 1, a speaker 2, and a display window as necessary. The TV 8 is the same, but the TV originally has the speaker 2 as an original function.
[0026]
There is a central processing means 9 connected to these devices via a connection means 3 via a line. This controls not only a home network but also a large number of home electrical devices via lines. The voice recognition means 10 is incorporated.
[0027]
FIG. 1 is a block diagram showing the configuration of a network speech recognition system according to the first embodiment of the present invention. The laundry 4 is provided with a control means 11 in addition to the microphone 1, the speaker 2 and the connection means 3 described above, and controls the motor and the heater 12 based on a voice command. That is, the control means 11 transmits the operation command input from the microphone 1 to the central processing means 9 via the connection means 3 and the line. This voice data is connected to the central processing means 9 via the connection means 14 and the ID recognition means 15 identifies which address (electrical device) at which location (home) and recognizes that it is a home laundry called A. After that, the command content is analyzed by the voice recognition means 10. In the speech recognition, the similarity with the speech pattern recorded in the storage means 16 is compared, or the sentence is estimated from the preceding and following words or phonemes. As a result of the recognition process, what operation the device should take is transmitted to the laundry 4 again. In response to this output, the control means 11 starts, changes or stops energization of the motor and the heater 12. The control means 11 outputs the synthesized voice message transmitted from the speaker 2 as necessary.
[0028]
The same applies to the microwave oven 5 or lower. In the microwave oven 5, the magnetron 13 that is a heat source is controlled by the control means 11. Of course, whether it is the laundry 4 or the microwave oven 5, the controlled object is not the only one. These are the controlled main blocks. The data output from the central processing means 9 may be data for directly controlling a predetermined controlled block, or data obtained by coding a certain operation mode. These are once decoded by the control means 11 to control the controlled block.
[0029]
With this configuration, the network speech recognition system allows the operator and the device to interact with each other through the network like a heating device having a speech recognition unit and a speech synthesis unit as disclosed in JP-A-56-88503. The operation can be performed as if it were, and the operation of the device can be executed reliably. Of course, speech synthesis is not essential for the present invention.
[0030]
(Example 2)
FIG. 3 is a connection diagram showing the configuration of the network voice recognition system in the second embodiment of the present invention. Each home, such as a home A17 and a home B18, is connected to the central processing means 9 via the Internet. As in FIG. 2, an electrical device group installed in a home A <b> 17 includes a microphone that is a means for converting sound into an electrical signal and a speaker that restores a synthesized speech signal to speech. However, the connection means for connecting to the network does not have individual electric devices. These devices are connected to a hub switch 19, and the hub switch 19 is connected to a router. That is, a single router is shared as a connection means with the central processing means 9. The home B18 has the same configuration.
[0031]
With this configuration, the present invention can be utilized economically by sharing an expensive modem at present. However, in preparation for the realization of future home appliances, the development of an inexpensive OS that simplifies the communication protocol and executes the connection to the Internet that can be installed in an 8-bit microcomputer is being promoted. A nasty modem will be a laugh after a few years. In short, the configuration shown in FIG. 3 is realistic for the time being, but the configuration shown in FIG. 1 will become common in the future.
[0032]
This selection is a comparison between the connection cost to the network (hardware and software such as a usage fee) and the hardware cost of the voice recognition means. If the former exceeds the latter, the present invention actually has no users. However, as will be described later, the speech recognition processing has high-speed arithmetic processing, a large amount of storage means, and technological innovations of daily progress, and according to the present invention, these can be shared by the central processing means, while updating to the latest technology, Each electric device may be provided with a very simple control means for controlling the operation of a heat source, a motor and the like. In some cases, all control can be performed by the central processing means. The premise of the present invention is that the network cost is steadily decreasing and the voice processing technology is evolving.
[0033]
(Example 3)
FIG. 4 is a block diagram showing the configuration of the network speech recognition system according to the third embodiment of the present invention. In contrast to the first embodiment shown in FIG. 1, an A / D converter 21 is connected to the microphone 1, and voice is converted into a digital signal. Send to. An audio signal transmitted by such a configuration is strong against noise. That is, if transmission data is divided into several frames and an even or odd parity check is performed for each frame, it becomes easy to exclude noise. If a digital signal is used, processing such as compression is easy. An A / D converter is inexpensive, and a configuration in which the A / D converter is individually mounted on each electric device is beneficial in terms of practical use in order to increase reliability.
[0034]
Further, in this embodiment, when it becomes necessary for the central processing means 9 to transmit a message to the electric device based on the voice recognition processing result, a predetermined voice is synthesized from the storage means 16 and is sent to the electric device via a line. It is the structure which transmits. Further, each electric device includes a D / A converter 22. The voice synthesized by the central processing unit 9 may be transmitted as analog data on a line. However, if it is handled as digital data, it is resistant to noise and easy to perform processing such as compression. That is, the same effect as that obtained by digitizing the voice command in voice recognition can be obtained.
[0035]
(Example 4)
FIG. 5 is a block diagram showing the configuration of the speech recognition processing of the network speech recognition system in the fourth embodiment of the present invention.
[0036]
Human voices are very vague, especially in Japanese, where independence between words is poor, and there are many factors that hinder accurate recognition, such as individual differences and regionality (dialect). Meanwhile, voice recognition processing is realized by making full use of pattern recognition technology. Specific speaker recognition that registers words to be recognized in advance and unspecified speaker recognition that recognizes the speech of an unspecified number of people. Of course, the latter is technically difficult but practical. Sex high.
[0037]
Speech recognition includes word unit recognition and phoneme unit recognition. In word-by-word recognition, a computer analyzes the speech, extracts features, and creates a time series of features. Then, the similarity is compared with the characteristic time-series word dictionary in the processing means, and the result is output as a recognition result.
[0038]
In recognition of phoneme units, input speech is converted into a phoneme symbol string and replaced with a word string. This is parsed and converted to a string. Furthermore, logic analysis and semantic analysis are performed to generate sentences. Since it recognizes the situation before and after the voice is uttered and the individual devices connected to the central processing means, it can also be used to understand simple language such as possible commands and impossible commands. It is extremely difficult, so a high-speed, large-capacity computer and dedicated hardware hybrid system can function efficiently.
[0039]
In the present embodiment, a method based on HMM (Hidden Markov Model) that is resistant to fluctuations of unspecified speakers is employed. HMM is a widely used technique in recent years, and realizes speech recognition in units of phonemes based on this. This technology is, for example, in the 2000 Second IP Award and IP Award “Unspecified Speaker Phonological Level Speech Recognition / Learning Circuit Based on Hidden Markov Model (Kazuhiro Nakamura et al., Nara Institute of Science and Technology)” Indicated.
[0040]
In FIG. 5, the electrical electrical signal (analog) is replaced with a digital signal by the A / D converter 21. The A / D converter 21 may be either in each electric device or in the central processing means. The effect when placed in the apparatus has already been described with reference to FIG.
[0041]
The audio data converted into the digital signal is sent to the controller 23 and sequentially stored in the LPC memory 24. At the same time, it is sent out to the LPC analysis unit 25 one frame at a time, and speech features are extracted using a linear prediction analysis (LPC) technique. That is, a feature vector is calculated. A feature vector is obtained by calculating a linear prediction coefficient based on speech time-series data and an autocorrelation function. The obtained feature vector is recorded in the LPC memory.
[0042]
Next, in the VQ unit 26, the feature vector series obtained by the LPC analysis is converted into a finite number of symbols. Vector quantization (VQ). For each feature vector, a codebook (a table having a symbol number / representative vector pair as an element) is searched and mapped to the symbol of the representative vector having the minimum distance. The codebook is generated by applying a clustering method to a large amount of learning data and stored in the LPC memory 24.
[0043]
Following the above processing, the HMM unit 27 performs voice recognition. The HMM is a phoneme probability model (probability automaton), and one HMM is configured for each phoneme, and a probability parameter is stored in the HMM memory 28. The phoneme indicating the symbol sequence and the highest probability is selected. And the result of this recognition process is output.
[0044]
In the present embodiment, the speech recognition is performed by hardware, but most of the processing is matrix calculation and probability calculation processing that are repeated endlessly. If the central processing means has sufficient processing capability, the LPC analysis unit, VQ unit, HMM unit, etc. can be replaced with software processing by a program. In addition, a hybrid system of hardware and software can be used to balance processing speed and economy.
[0045]
(Example 5)
FIG. 6 is a flowchart of a program having a configuration in which the central processing means in the fifth embodiment of the present invention is a program process.
[0046]
First, digitized audio data is input (step 100). The ID data of the device is added before or after the voice data, and the ID data is recognized and the device to which the voice data is sent is specified (step 101). The ID is given as a unique ID when the electrical device is shipped from the factory. When the electrical device is installed in a household, the individual information of the household is automatically or registered. And stored in the central processing means. By accessing such a database, the central processing means can specify the accessed home and the configuration of the network device in the home (step 102).
[0047]
Subsequently, the voice recognition process starts. First, audio data is segmented for each frame and features are extracted (step 103). Then, the feature vector is quantized (step 104), and the probability of what the data is close to is calculated for each phoneme (step 105). Then, the HMM having the maximum probability is obtained (step 106). The input speech is confirmed and the speech recognition process ends (step 107). The voice recognition process during this period is exactly the same as that executed in hardware in the example of FIG.
[0048]
Now, the confirmed voice is searched by searching a voice list (text table) registered for each device (step 108). This is to determine whether or not the command is accepted by the device. For example, since it is specified from the ID of which model of which maker the voice data is input from the ID, the inspection is performed in accordance with the function and specification of the microwave oven. In the case of the microwave oven, a command “baked in an oven at 180 ° C. for 35 minutes” can be accepted, but this command cannot be executed in a microwave oven that does not have a heater function. In addition, even if a command to “select automatically” comes to the microwave oven, it cannot be handled. This voice list (text table) may be stored in the memory of the central processing unit every time a new product is released, or the central processing unit searches the server of the manufacturer for data via the network. May be. First, it is checked whether or not the voice is registered in the table (step 109). If the voice is not found by this search, an error message is outputted as a voice (step 110). A message can be issued in a detailed manner according to the content of the error.
[0049]
If it is a registered voice, then the stage (operating state) of the device is confirmed (step 111). Since all commands are commanded to the central processing means by voice, the central processing means can identify what the device is now in. Even if there is no mistake in the voice data, it is checked whether it can be accepted now (step 112). Even if the microwave oven is not in operation and you are ordered to “stop”, you cannot respond. Also, it cannot be accepted while the instruction received immediately before is being executed.
[0050]
Furthermore, since the central processing means knows the configuration of the home network device, it can also check the operation of other devices connected to the network (step 113). For example, when the rice cooker has just begun to operate and it is 40 minutes after cooking, when the instruction to reheat is given to the microwave, it asks the operator by voice output whether they can accept it, and calls the instruction again Is also possible.
[0051]
After such processing, a sound notifying that reception has been completed is output (step 114), and an operation output for causing the device to perform a predetermined operation is retrieved from the table (step 115). For example, a microwave oven has loads such as a magnetron, a motor, and a lamp as a heat source, and an operation state chart indicating which of these loads is turned on and which is turned off is stored in advance as a table. When handling these states with 4-digit hexadecimal data, it is necessary to determine in advance on the equipment side and the central processing means that the 01A6 operates the magnetron at full output, the turntable motor rotates, the lamp turns off, etc. The operation command is output (step 116).
[0052]
With the above program, the central sled means can control the electric equipment connected to the net by voice recognition.
[0053]
【The invention's effect】
As described above, according to the present invention, a plurality of electrical devices connected to a network can be operated with a human-friendly operation system called voice recognition, and a high recognition rate is always maintained using the latest voice recognition technology. However, an economically usable system can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a network voice recognition system in a first embodiment of the present invention.
FIG. 2 is a connection diagram showing the configuration of the network voice recognition system.
FIG. 3 is a connection diagram showing a configuration of a network voice recognition system according to a second embodiment of the present invention.
FIG. 4 is a block diagram showing the configuration of a network speech recognition system in a third embodiment of the present invention.
FIG. 5 is a block diagram showing the configuration of speech recognition processing of the network speech recognition system in the fourth embodiment of the present invention.
FIG. 6 is a flowchart of a program having a configuration in which the central processing means in the fifth embodiment of the present invention is a program process.

Claims

Has a electrical device having a separate ID provided with a microphone for converting sound into electrical signals, and a central processing unit with a built-in voice recognition means are connected with the electrical equipment and the line, the operator of the electrical device Inputs an operation command from the microphone, the electric device sends a voice signal and the individual ID to the central processing means, and the central processing means performs voice recognition processing of the transmitted voice signal and the voice. It is an operation command that is accepted by identifying the electrical device that has sent the signal, searching the voice signal for each electrical device from the audio signal, and determining whether the operation command is accepted by the electrical device. If the stipulated by electrical equipment in advance the electric device based on if the processing result is verified is accepted examined whether the conditions which are accepted now as the central processing unit Network speech recognition system configured to transmit a predetermined operating output to the electrical device.

The network voice recognition system according to claim 1, comprising a plurality of the electric devices.

2. The electric apparatus comprising means for converting the sound into an electric signal further comprises an A / D converter, and is configured to convert the sound into a digital signal and transmit it to the central processing means via a line. Network voice recognition system.

The network speech recognition system according to claim 1, wherein the speech recognition processing performed by the central processing unit is speech recognition processing based on linear prediction analysis (LPC), vector quantization (VQ), and hidden Markov model (HMM).

2. The network according to claim 1, wherein the central processing unit is configured to synthesize a predetermined voice from the memory and transmit it via a line when a message needs to be transmitted to the electrical device based on a voice recognition processing result. Speech recognition system.

2. The network voice recognition system according to claim 1, wherein the central processing unit includes hardware for performing voice recognition and a program for performing software processing.

The electrical device converts the sound into an electrical signal, the electrical device transmits the audio signal and the individual ID, and the central processing unit identifies the electrical device by the transmitted individual ID and Performing a voice recognition process; searching a voice list registered for each electric device from the voice signal to determine whether the electric device is an operation command to be accepted; It is composed of a step of transmitting a predetermined operation output determined in advance by the electric device and the central processing unit to the electric device based on a processing result if the device is confirmed and checked and accepted . Network speech recognition method.

The electrical device converts the sound into an electrical signal, the electrical device transmits the speech signal and the individual ID, and the central processing unit identifies the electrical device by the transmitted individual ID, A step of performing speech recognition processing; a speech list registered for each electrical device is searched from the speech signal to determine whether the electrical command is accepted; If the device is confirmed and checked to be accepted, and if it is accepted, a predetermined operation output determined in advance by the electric device and the central processing unit based on a processing result is transmitted to the electric device . Network speech recognition program.