JP2004012993A

JP2004012993A - Network voice recognition system, method of recognizing network voice, and network voice recognition program

Info

Publication number: JP2004012993A
Application number: JP2002168769A
Authority: JP
Inventors: Shigeki Ueda; 植田　茂樹
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-06-10
Filing date: 2002-06-10
Publication date: 2004-01-15
Anticipated expiration: 2022-06-10
Also published as: JP4224991B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize an economically usable system by constantly utilizing the latest voice recognition technique by maintaining a high recognition rate so that a plurality of electric apparatuses connected to a network may be operated by a human friendly operation system being vocie recognition. <P>SOLUTION: The network voice recognition system has the plurality of electric apparatuses 4, 5 having a means 1 for converting a voice into an electrical signal, and a central processing means 9 connected to the electric apparatuses through a line. The central processing means 9 executes voice recognition processing for a transmitted voice signal and transmits specified operation output to the electric apparatuses based on a processing result. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の電気機器を音声認識により操作するネットワーク技術に関し、回線に接続された複数の機器に指令された操作者の音声を中央処理手段で集中的に音声認識処理を行い、各々の電気機器に所定の作動出力を送信するネットワーク音声認識システム、ネットワーク音声認識方法、ネットワーク音声認識プログラムに関するものである。
【０００２】
【従来の技術】
音声認識を扱う技術としては、特開昭５６−８８５０３号公報がある。これは加熱装置に音声認識部と音声合成部を設け、２つの音声認識モードと１つの音声合成モードをシーケンシャルに切り換えることで、あたかも操作者と機器が対話をするかのごとく操作が進み、音声の誤認識による誤動作の防止を図ったものである。
【０００３】
入力された音声は特徴を抽出されてメモリに記憶される。認識処理はＲＯＭにあらかじめ記憶されている標準パターンと前記メモリに記憶された上記パターンとを比較し、類似度の高いものをその音声指令として認識する。次いで音声合成モードに切り換えられ、音声合成シンセサイザによって認識された音声に基づきある所定の音声メッセージを操作者に対して発する。操作者はこのメッセージを聞き、次の音声を入力すると、機器は再び第２の音声認識モードへと移行し、ある定められた音声が入力されるかを判断する。
【０００４】
かかる構成により、システムは確実に制御を進めることができる。誤認識によって指示されない加熱を開始したり、まだ終わっていない加熱を勝手に中断したり、といった誤動作を未然に防止できる。
【０００５】
これは本発明の発明者による出願であるが、当時、特定話者音声認識（あらかじめ認識する対象者の言語を登録しておく方式）でもなかなか完璧な認識が難しかった頃の技術背景を偲ばせる発明である。
【０００６】
近年、半導体技術の著しい進展やソフト技術・ネットワーク技術の発展にともない、家庭内にもパソコンや携帯電話などの情報機器が広く浸透してきた。またネット家電と呼ばれる電気機器も徐々に市場に姿を現しつつある。
【０００７】
このような状況のもと、以前より注目を集めてきた音声認識技術は、人と機械の良好なコミュニケーションを図れる技術としていよいよ実用段階に入りつつある。特にメモリが安価になり、ＣＰＵの処理速度が格段に上昇し、音声認識技術が進化することで、十数年前には非常に困難とされた不特定話者音声認識（あらかじめ言語登録の不要な不特定多数を対象とする方式）を高い認識率で認識することが可能となってきた。キーボードの代わりにマイクを備えたパソコンは、もはや実用レベルに入ったと言えよう。工場出荷時にこのような音声認識ソフトがプリインストールされるパソコンも珍しくなくなってきた。
【０００８】
が、いまだ特開昭５６−８８５０３号公報に示されるような加熱装置は、商品化されていない。パソコンの世界では音声認識技術の普及が急速に進みつつあるにもかかわらず、である。これはひとえに音声認識システムの経済性に由来する。パソコンでは処理速度の速いＣＰＵも大きなメモリも、それ自体がパソコン本体の価値を高めてくれる。パソコンではそれほど価格を押し上げることなく、音声認識による操作を導入できる。また、よしんばその一部を誤認識しても再入力すればよい。単にキーボードをミスタッチしたのと同等である。だが、加熱装置では誤認識は過大な加熱や庫内での燃焼に繋がりかねない。加熱装置に音声認識システムを組み込んだ場合、認識精度は一層重要であり、これを上げようとすればシステムコストにさらに重くのしかかる。
【０００９】
本発明の関連技術として、特開２０００−１１１０５７公報を示しておく。ここには台所電気製品のためのデータ処理装置が開示されている。台所器具、例えば電子レンジ、には操作者の音声に応じてデジタル・データを発生するマイクロホンと、このマイクロホンからデジタル信号を取り出すための音声認識手段が設けられる。このマイクロホンに入力されるのは庫内で消費される食品の品目であり、処理した品目のリストを発生させるためにデジタル信号を記憶する記憶手段が本体に備えられる。またこのデジタル・データは遠隔管理システムにも転送される。
【００１０】
かかる構成により庫内で消費される食品品目が音声でマイクロホンに入力されると、器具本体にロードされた音声認識ソフトウェアがこれを識別し、インターネットに転送される電子メール・メッセージを発生するために使用される。インターネットに転送されたメッセージは品目の補充、配達のために利用される。
【００１１】
この構成においても、台所器具は音声認識手段を備えており、台所器具のコストを押し上げていることは否めない。また、音声認識によって扱われるのは単なる食品品目名という情報でしかなく、誤認識しても補充品目の発注が正確に行われないだけである。前記特開２０００−１１１０５７公報が開示する機器本体を作動・停止させるような機能を含まない。
【００１２】
【発明が解決しようとする課題】
本発明ではネットワークに接続された複数の電気機器を、音声認識という人にやさしい操作系で操作できるよう、常に最新の音声認識技術を利用して、しかも高い認識率を維持しながら、経済的に利用できるシステムを実現することをめざす。
【００１３】
【課題を解決するための手段】
前記従来の課題を解決するために、本発明のネットワーク音声認識システムは、音声を電気信号に変換する手段を備えた個別のＩＤを持つ複数の電気機器と、この機器群と回線で接続された中央処理手段とを有し、中央処理手段は送信された音声信号の音声認識処理を行い、処理結果に基づき当該の電気機器に所定の作動出力を送信するよう構成したものである。これによって、コストがかかる高い認識率を実現する音声認識手段は中央処理手段に置くことができ、回線で接続された電気機器から送信されてくる音声データを次々と認識し、認識結果に基づく作動信号を返信することができる。
【００１４】
【発明の実施の形態】
請求項１に記載の発明は、音声を電気信号に変換する手段を備えた電気機器と、この機器と回線で接続された中央処理手段とを有し、該中央処理手段は送信された音声信号の音声認識処理を行い、処理結果に基づき該電気機器に所定の作動出力を送信するよう構成したシステムである。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００１５】
請求項２に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、複数の電気機器を備えたもので、各々の機器が音声認識手段を有する必要がなく、一層経済的に音声認識によるやさしい操作を利用できる。また、すべての機器を同一の操作系で操作できる。
【００１６】
請求項３に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、電気機器が個別のＩＤを持つよう構成したもので、中央処理手段にいかに多くの電気機器が接続されようとも、中央処理手段は着実に個別の機器を識別し、適切な認識を行うことができる。また、このＩＤに基づき機器名やメーカ・機種名などを識別できるよう構成すれば、中央処理手段はある家庭内のネットワークに繋がれた機器群の把握をすることができるようになり、ある音声入力に対して複数の機器に適切な作動出力を送信できる。
【００１７】
請求項４に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、電気機器がさらにＡ／Ｄコンバータを有し、音声をデジタル信号に変換して回線で中央処理手段に送信するよう構成したもので、ノイズに強く多重通信しやすく、また圧縮などにより通信時間を短縮することが容易になる。
【００１８】
請求項５に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、中央処理手段が行う音声認識処理は、線形予測分析（ＬＰＣ）とベクトル量子化（ＶＱ）と隠れマルコフモデル（ＨＭＭ）に基づく音声認識処理としたもので、線形予測分析（ＬＰＣ）により特徴抽出を効率良く行い、得られた特徴ベクトルをベクトル量子化（ＶＱ）により有限個のシンボルに変換できる。これを利用して隠れマルコフモデル（ＨＭＭ）、すなわち音韻の確率モデル（確率オートマトン）を作り、音韻単位の認識が行える。
【００１９】
請求項６に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、中央処理手段は音声認識処理結果に基づき電気機器にメッセージを伝える必要が生じた場合、メモリ内より所定の音声を合成し、回線を介して送信する構成であり、対話をしながら機器の制御を進めることができる。
【００２０】
請求項７に記載の発明は、特に請求項１に記載のネットワーク音声認識システムにおいて、中央処理手段は音声認識を行うハードウエアとソフト処理を行うプログラムを有する構成であり、すべてをソフト処理するのではなくバランス良くハードウェアとソフトウェアに音声認識処理を分担させることで、経済的なシステムを実現できる。特に中央処理装置の負担を軽減することができる。
【００２１】
請求項８に記載の発明は、音声電気機器が音声を電気信号に変換する段階と、該電気機器が音声信号を送信する段階と、中央処理手段が送信された音声信号の音声認識処理を行う段階と、処理結果に基づき該電気機器に所定の作動出力を送信する段階とより構成したネットワーク認識方法である。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００２２】
請求項９に記載の発明は、電気機器が音声を電気信号に変換するステップと、該電気機器が音声信号を送信するステップと、中央処理手段が送信された音声信号の音声認識処理を行うステップと、処理結果に基づき該電気機器に所定の作動出力を送信するステップとより構成したネットワーク音声認識プログラムである。これによって、顧客は音声認識によるやさしい操作を経済的に実現できる。
【００２３】
【実施例】
以下本発明の実施例について、図面を参照しながら説明する。
【００２４】
（実施例１）
図２は本発明の第１の実施例におけるネットワーク音声認識システムの構成を示す接続図である。ある家庭内に設置される電気機器群には、いずれも音声を電気信号に変換する手段たるマイク１と、合成音声信号を音声に復元するスピーカ２と、ネットワークに接続する接続手段３が備えられている。
【００２５】
従って、例えばランドリー４の操作部にはマイク１とスピーカ２と他には表示窓が存在するだけで、従来のようにタイマーやキーボードはまったくない。なお、接続手段３は機器本体に内蔵されている。
【００２６】
電子レンジ５、炊飯器６、冷蔵庫７もまったく同様であり、操作部にはマイク１とスピーカ２と他には必要に応じて表示窓が存在するだけである。テレビ８も同様だが、テレビはもともと本来の機能としてスピーカ２を備えている。
【００２７】
さて、これらの機器と接続手段３を介して回線で接続された中央処理手段９がある。これはある家庭内のネットワークだけではなく、多数の家庭の電気機器群を回線を介して制御する。音声認識手段１０を内蔵している。
【００２８】
図１はかかる本発明の第１の実施例におけるネットワーク音声認識システムの構成を示すブロック図である。ランドリー４には前述のマイク１とスピーカ２、接続手段３以外に制御手段１１が設けられ、音声指令に基づきモータ及びヒータ１２を制御する。すなわち、マイク１から入力された操作指令を制御手段１１が接続手段３と回線を介して中央処理手段９へ送信する。かかる音声データは中央処理手段９に設けた接続手段１４を経由し、ＩＤ認識手段１５によりどのロケーション（家庭）のどのアドレス（電気機器）かを識別され、Ａという家庭のランドリーである旨を認識した後、音声認識手段１０により指令内容が分析される。音声認識は記憶手段１６に記録された音声パターンとの類似度を比較したり、前後の単語あるいは音韻から文章の推定が行われたりする。その認識処理の結果、機器がどのような動作を起こすべきかが再びランドリー４に送信される。この出力を受けて制御手段１１はモータ及びヒータ１２への通電を開始したり変更したり停止したりする。制御手段１１は必要に応じてスピーカ２より送信されてきた合成音声メッセージを出力する。
【００２９】
電子レンジ５以下も同様である。電子レンジ５では制御手段１１により制御されるのは、熱源たるマグネトロン１３である。もちろん、ランドリー４にしろ電子レンジ５にしろ、制御される対象はこれだけではない。これらは被制御主要ブロックである。中央処理手段９から出力されるのは、あらかじめ定められた被制御ブロックをダイレクトに制御するデータであってもいいし、ある動作モードをコード化したデータでも構わない。これらはいったん制御手段１１で解読され、当該の被制御ブロックをコントロールする。
【００３０】
かかる構成によりネットワーク音声認識システムは、特開昭５６−８８５０３号公報で示したような音声認識部と音声合成部を有する加熱機器のように、ネットワークを介しながらあたかも操作者と機器が対話をするかのごとく操作を進め、確実に機器の動作を実行することができる。もちろん、音声合成は本発明にとって必須事項ではない。
【００３１】
（実施例２）
図３は本発明の第２の実施例におけるネットワーク音声認識システムの構成を示す接続図である。ある家庭Ａ１７、ある家庭Ｂ１８など、それぞれの家庭がインターネットで中央処理手段９に接続されている。ある家庭Ａ１７内に設置される電気機器群には、図２と同様いずれも音声を電気信号に変換する手段たるマイクと、合成音声信号を音声に復元するスピーカを備えている。しかしながら、ネットワークに接続する接続手段は個々の電気機器は有しない。これらの機器はハブスイッチ１９に接続され、ハブスイッチ１９はルーターに繋がっている。すなわち中央処理手段９との接続手段を単一のルーターを共有している。家庭Ｂ１８内も同様の構成である。
【００３２】
かかる構成により現状ではまだ高価なモデムを共有することにより経済的に本発明を活用することができる。が、将来のネット家電の実現に備え、通信プロトコルをシンプル化し、８ビット程度のマイコンに搭載可能なインターネットへの接続を実行する安価なＯＳの開発が進められており、接続する電気機器より高価なモデムも数年後には笑い話になるであろう。要するに当面は図３に示す構成が現実的であるが、将来的には図１の構成が一般的になるであろう。
【００３３】
この選択はネットワークへの接続コスト（ハード及び利用料金などのソフト）と、音声認識手段のハードコストとの比較になる。前者が後者を上回るなら、この発明は現実には利用者がいなくなる。が、後述するが音声認識処理は高速の演算処理と大量の記憶手段、日進月歩の技術革新があり、本発明によれば、これらを中央処理手段で共用でき、最新技術への更新を図りながら、各電気機器内には熱源やモータなどの作動を制御するごく簡単な制御手段を設ければよい。場合によっては制御をすべて中央処理手段が行うことも可能である。ネットワークコストはどんどん下がり、音声処理技術は進化する、ということが本発明の前提である。
【００３４】
（実施例３）
図４はかかる本発明の第３の実施例におけるネットワーク音声認識システムの構成を示すブロック図である。図１に示す実施例１に対して、マイク１にはＡ／Ｄコンバータ２１が接続され、音声をデジタル信号に変換して接続手段３を介して制御手段１１は、回線経由で中央処理手段９に送信する。かかる構成により送信される音声信号はノイズに対して強くなる。すなわち、送信データをいくつかのフレームに分け、そのフレームごとに偶数あるいは奇数のパリティチェックを行えば、ノイズを除外することが容易となる。デジタル信号にすれば圧縮などの処理も行い易い。Ａ／Ｄコンバータは安価であり、信頼性を上げる上でも各電気機器にこれを個別に搭載する構成は実用面で有益である。
【００３５】
また本実施例では、中央処理手段９が音声認識処理結果に基づき当該電気機器にメッセージを伝える必要が生じた場合、記憶手段１６内より所定の音声を合成し、回線を介して当該電気機器に送信する構成である。さらに各電気機器はＤ／Ａコンバータ２２を備えている。中央処理部９で合成された音声は、アナログデータとして回線上に送信されて構わないが、デジタルデータで扱えばノイズに対して強く、また圧縮などの処理も行い易い。すなわち音声認識において音声指令をデジタル化するのと同じ効果で得られる。
【００３６】
（実施例４）
図５はかかる本発明の第４の実施例におけるネットワーク音声認識システムの音声認識処理の構成を示すブロック図である。
【００３７】
人が発する音声はきわめて曖昧であり、特に日本語においては単語間の独立性が乏しく、個人差や地域性（方言）など正確な認識を妨げる要因がすこぶる多い。そんな中、音声認識処理はパターン認識技術を駆使して実現される。あらかじめ認識する言葉を登録しておく特定話者認識と、不特定多数の人の音声を認識する不特定話者認識があり、当然、後者が技術的にはハードルは高いが実用性はが高い。
【００３８】
また、音声認識には単語単位の認識、音韻単位の認識がある。単語単位の認識では音声をコンピュータが分析し、特徴抽出し、特徴量の時系列を作る。そして処理手段内の特徴時系列単語辞書と類似度を比較計算し、認識結果として出力する。
【００３９】
音韻単位の認識は入力音声を音素記号列に変換し、単語列に置き換える。これを構文解析し、文字列に変換する。さらに論理解析や意味解析し、文章を生成する。その音声が発せられた前後の状況や中央処理手段に接続された個別の機器を認識することから、ありうる指令、ありえない指令など簡単な言語理解にまで踏み込むこともできる。極めて難度が高く、従って高速大容量のコンピュータと専用のハードウエアのハイブリッドシステムが効率良く機能する。
【００４０】
本実施例では不特定話者のゆらぎに強いＨＭＭ　（隠れマルコフモデル、Ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　Ｍｏｄｅｌ）による方法を採用する。ＨＭＭは近年広く用いられる手法であり、これに基づく音韻単位の音声認識を実現する。かかる技術は、例えば２０００年第２回ＩＰアワード、ＩＰ賞を受賞した「隠れマルコフモデルに基づく不特定話者音韻レベル音声認識・学習回路（中村一博他、奈良先端科学技術大学院大学）」に示される。
【００４１】
図５において、音声の電気信号（アナログ）はＡ／Ｄコンバータ２１によってデジタル信号に置換される。Ａ／Ｄコンバータ２１は各電気機器内においても中央処理手段内においてもどちらでも構わない。機器内に置いた時の効果は図４・実施例３ですでに説明した。
【００４２】
デジタル信号に変換された音声データは、コントローラ２３へ送られ、順次ＬＰＣメモリ２４に記憶される。と同時にＬＰＣ分析部２５に１フレーム分ずつ送り出され、線形予測分析（ＬＰＣ）手法を用いて音声の特徴が抽出される。すなわち特徴ベクトルが算出される。特徴ベクトルは音声の時系列データと自己相関関数に基づき、線形予測係数を算出することで得られる。求められた特徴ベクトルはＬＰＣメモリに記録される。
【００４３】
次いでＶＱ部２６において、ＬＰＣ分析により得られた特徴ベクトル系列が有限個のシンボルに変換される。ベクトル量子化（ＶＱ）である。各特徴ベクトルについて、符号帳（シンボル番号と代表ベクトルのペアを要素とする表）を探索し、距離が最小となる代表ベクトルのシンボルに写像する。符号帳は多くの学習用データにクラスタリング手法を適用して生成され、ＬＰＣメモリ２４に記憶される。
【００４４】
以上の処理に続いてＨＭＭ部２７で音声認識が実行される。ＨＭＭは音韻の確率モデル（確率オートマトン）であり、音韻ごとに１つのＨＭＭが構成され、ＨＭＭメモリ２８に確率パラメータが記憶されている。シンボル系列と最も高い確率を示す音韻が選択される。そしてこの認識処理の結果が出力される。
【００４５】
なお、本実施例では音声認識をハードウエアで行う構成としたが、その処理の大半は延々と繰り返される行列演算、確率演算処理である。中央処理手段に十分な処理能力があれば、ＬＰＣ分析部やＶＱ部、ＨＭＭ部などはプログラムによるソフト処理に置換できる。また、ハードとソフトのハイブリッドシステムとし、処理速度と経済性のバランスをとることもできる。
【００４６】
（実施例５）
図６は本発明の第５の実施例における中央処理手段をプログラム処理とした構成のプログラムのフローチャートである。
【００４７】
まず、デジタル化された音声データが入力される（ステップ１００）。この音声データの前あるいは後には機器のＩＤデータが付加されており、このＩＤデータを認識し、音声データが送られて来た機器を特定する（ステップ１０１）。ＩＤは当該電気機器が工場から出荷される際に唯一無二のものとして与えられており、ある家庭にその当該電気機器が設置された折、その家庭の個別情報が自動的にあるいは登録の手順を経て中央処理手段に記憶されている。かかるデータベースをアクセスすることで中央処理手段はアクセスした家庭と、その家庭におけるネット機器の構成を特定することができる（ステップ１０２）。
【００４８】
続いて音声認識処理が始まる。まず、音声データが１フレームごとに区切られ特徴抽出される（ステップ１０３）。そして特徴ベクトルの量子化が行われ（ステップ１０４）、そのデータが何に近いかの確率を音韻ごとに計算される（ステップ１０５）。そして確率が最大となるＨＭＭが求められる（ステップ１０６）。入力された音声を確定して音声認識処理は終了する（ステップ１０７）。この間の音声認識処理は図５の例でハード的に実行したのとまったく同等である。
【００４９】
さて、確定された音声を機器ごとに登録された音声リスト（テキストのテーブル）を検索して付き合わせが行われる（ステップ１０８）。これはその機器が受け付けられる指令かどうかを判断するものである。例えば、ＩＤからどのメーカのどの機種の電子レンジから音声データが入力されたかが特定されているから、この電子レンジが持っている機能、仕様に照らして検査が行われるのである。オーブンレンジなら「オーブンで１８０℃、３５分焼き上げ」という指令は受付可能であるが、ヒータ機能を有しない電子レンジであればこの指令は実行できない。また、電子レンジに「全自動で選択をすること」という指令が来ても対応できない。この音声リスト（テキストのテーブル）は新製品が発売されるごとに中央処理手段のメモリに記憶されてもいいし、当該のメーカが有するサーバにネットを介して中央処理手段がデータを検索にいってもよい。まずテーブルにある登録された音声かどうかが調べられる（ステップ１０９）。この検索で当該音声が見つからなければ、エラーメッセージが音声出力される（ステップ１１０）。エラーの内容に応じてきめこまかにメッセージを発することもできる。
【００５０】
登録された音声である場合には、次いで当該機器のステージ（動作状態）が確認される（ステップ１１１）。一切の指令は音声で中央処理手段に対して命じられるので、中央処理手段はその機器が今どんな状態かを特定できる。そして音声データは間違いがなくとも、今それを受け付けられるかが調べられる（ステップ１１２）。電子レンジが作動していない状態で「停止せよ」と命じられても、応じることはできない。また、直前に受けた命令を実行中も応じられない。
【００５１】
さらに、中央処理手段は当該家庭のネット機器の構成を把握しているので、ネットに接続された他の機器の動作もチェックできる（ステップ１１３）。例えば炊飯器が動作を始めたばかりで、炊きあがるのが４０分後なのに電子レンジに温め直しの指示が出た時、受け付けていいのかを音声出力して操作者に問いかけ、再度の指示を呼びかけることも可能である。
【００５２】
かかる処理の後、受付が完了した旨を知らせる音声が出力され（ステップ１１４）、当該機器を所定の動作をさせるための作動出力がテーブルから検索される（ステップ１１５）。例えば電子レンジは熱源としてのマグネトロンやモータ、ランプなどの負荷を有するが、これらの負荷のどれをオンしどれをオフするかの作動状態チャートをあらかじめテーブルとして記憶させておく。これらの状態を４桁の１６進データで扱う場合、０１Ａ６をマグネトロンをフル出力で作動させ、ターンテーブルモータは回転、ランプは消灯する、というようにあらかじめ機器側と中央処理手段で定めておけば、作動指令として出力される（ステップ１１６）。
【００５３】
以上のプログラムにより、音声認識によりネットに接続された電気機器を中央そり手段が制御することができる。
【００５４】
【発明の効果】
以上のように、本発明によれば、ネットワークに接続された複数の電気機器を、音声認識という人にやさしい操作系で操作でき、常に最新の音声認識技術を利用して高い認識率を維持しながら、経済的に利用できるシステムを実現することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施例におけるネットワーク音声認識システムの構成を示すブロック図
【図２】同ネットワーク音声認識システムの構成を示す接続図
【図３】本発明の第２の実施例におけるネットワーク音声認識システムの構成を示す接続図
【図４】本発明の第３の実施例におけるネットワーク音声認識システムの構成を示すブロック図
【図５】本発明の第４の実施例におけるネットワーク音声認識システムの音声認識処理の構成を示すブロック図
【図６】本発明の第５の実施例における中央処理手段をプログラム処理とした構成のプログラムのフローチャート[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a network technology for operating a plurality of electric devices by voice recognition, and performs centralized voice recognition processing of voices of an operator instructed by a plurality of devices connected to a line by a central processing unit. The present invention relates to a network voice recognition system, a network voice recognition method, and a network voice recognition program for transmitting a predetermined operation output to an electric device.
[0002]
[Prior art]
Japanese Patent Laid-Open Publication No. Sho 56-88503 discloses a technique for handling speech recognition. In this method, a heating device is provided with a voice recognition unit and a voice synthesis unit, and by sequentially switching between two voice recognition modes and one voice synthesis mode, the operation proceeds as if the operator and the device interact with each other. This is intended to prevent malfunction due to erroneous recognition.
[0003]
The features of the input voice are extracted and stored in the memory. The recognition process compares a standard pattern stored in the ROM in advance with the pattern stored in the memory, and recognizes a pattern having a high degree of similarity as the voice command. Next, the mode is switched to the voice synthesis mode, and a predetermined voice message is issued to the operator based on the voice recognized by the voice synthesis synthesizer. When the operator listens to this message and inputs the next voice, the device shifts again to the second voice recognition mode, and determines whether a predetermined voice is input.
[0004]
With such a configuration, the system can reliably advance the control. Malfunctions such as starting heating not instructed by erroneous recognition or interrupting heating that has not yet been completed can be prevented beforehand.
[0005]
This is an application filed by the inventor of the present invention. At that time, it is reminiscent of the technical background at the time when it was difficult to perform perfect speaker recognition even with specific speaker speech recognition (a method of registering the language of the person to be recognized in advance). It is an invention.
[0006]
In recent years, with the remarkable progress of semiconductor technology and the development of software technology and network technology, information devices such as personal computers and mobile phones have become widespread in homes. Also, electrical appliances called Internet home appliances are gradually appearing on the market.
[0007]
Under such circumstances, the speech recognition technology, which has been attracting attention for a long time, is finally entering the practical stage as a technology capable of achieving good communication between humans and machines. In particular, memory has become inexpensive, CPU processing speed has increased dramatically, and speech recognition technology has evolved, making speaker-independent speech recognition very difficult ten years ago (no need for language registration in advance). (A method targeting an unspecified majority) with a high recognition rate. Computers with microphones instead of keyboards are now at a practical level. PCs with such voice recognition software pre-installed at the factory are no longer rare.
[0008]
However, a heating device as disclosed in JP-A-56-88503 has not yet been commercialized. Despite the rapid spread of speech recognition technology in the PC world. This comes solely from the economics of speech recognition systems. A PC with a fast processing speed and a large memory on a personal computer itself will enhance the value of the personal computer itself. On a personal computer, voice recognition can be introduced without increasing the price. In addition, even if a part of the information is erroneously recognized, it may be input again. It is equivalent to simply mistouching the keyboard. However, mis-recognition can lead to excessive heating and combustion in the cooking cabinet. If a speech recognition system is incorporated in the heating device, the recognition accuracy is even more important, and any attempt to increase it will add to the system cost.
[0009]
Japanese Patent Application Laid-Open No. 2000-111057 is shown as a related art of the present invention. Here, a data processing device for kitchen appliances is disclosed. A kitchen appliance, for example, a microwave oven, is provided with a microphone that generates digital data in response to an operator's voice, and voice recognition means for extracting a digital signal from the microphone. The input to the microphone is an item of food consumed in the refrigerator, and the main body is provided with storage means for storing digital signals for generating a list of processed items. This digital data is also transferred to a remote management system.
[0010]
With this arrangement, when a food item consumed in the refrigerator is input into the microphone by voice, voice recognition software loaded on the appliance body identifies this and generates an e-mail message that is forwarded to the Internet. used. The message transmitted to the Internet is used for replenishment and delivery of items.
[0011]
Also in this configuration, the kitchen appliance is provided with the voice recognition means, and it cannot be denied that the cost of the kitchen appliance is increased. Further, only information of food item names is handled by voice recognition. Even if misrecognition is performed, ordering of supplementary items is not performed accurately. It does not include a function for operating / stopping the device main body disclosed in Japanese Patent Application Laid-Open No. 2000-111057.
[0012]
[Problems to be solved by the invention]
In the present invention, a plurality of electrical devices connected to the network can be operated by a human-friendly operation system called voice recognition, so that the latest voice recognition technology is always used, and a high recognition rate is maintained while economically. Aim to realize a usable system.
[0013]
[Means for Solving the Problems]
In order to solve the above-mentioned conventional problems, a network voice recognition system according to the present invention includes a plurality of electrical devices having individual IDs provided with means for converting voice into electrical signals, and is connected to the device group by a line. And a central processing unit that performs a voice recognition process on the transmitted voice signal and transmits a predetermined operation output to the electric device based on the processing result. As a result, costly voice recognition means for realizing a high recognition rate can be placed in the central processing means, and the voice data transmitted from the electric equipment connected via the line is recognized one after another, and the operation based on the recognition result is performed. A signal can be returned.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
According to the first aspect of the present invention, there is provided an electric device having means for converting a sound into an electric signal, and a central processing means connected to the device by a line, wherein the central processing means transmits the transmitted sound signal. And a predetermined operation output is transmitted to the electric device based on the processing result. This allows the customer to economically realize easy operation by voice recognition.
[0015]
According to a second aspect of the present invention, there is provided the network voice recognition system according to the first aspect, further comprising a plurality of electric devices. Each of the devices does not need to have a voice recognition unit, and is more economical. Easy operation by recognition is available. In addition, all devices can be operated by the same operation system.
[0016]
According to a third aspect of the present invention, in the network voice recognition system according to the first aspect, the electrical devices are configured to have individual IDs, and no matter how many electrical devices are connected to the central processing unit. The central processing means can steadily identify individual devices and perform appropriate recognition. Further, if the device name, the maker / model name, and the like can be identified based on the ID, the central processing unit can grasp a group of devices connected to a certain home network, and can recognize a certain voice. An appropriate operation output can be transmitted to a plurality of devices in response to an input.
[0017]
According to a fourth aspect of the present invention, in the network voice recognition system of the first aspect, the electric device further includes an A / D converter, converts the voice into a digital signal, and transmits the digital signal to the central processing unit via a line. With such a configuration, it is easy to perform multiplex communication resistant to noise, and it is easy to shorten the communication time by compression or the like.
[0018]
According to a fifth aspect of the present invention, in the network speech recognition system of the first aspect, the speech recognition processing performed by the central processing unit includes linear prediction analysis (LPC), vector quantization (VQ), and a hidden Markov model ( HMM) based speech recognition processing, feature extraction is efficiently performed by linear prediction analysis (LPC), and the obtained feature vector can be converted into a finite number of symbols by vector quantization (VQ). Using this, a hidden Markov model (HMM), that is, a phoneme probability model (probability automaton) is created, and phoneme unit recognition can be performed.
[0019]
According to a sixth aspect of the present invention, in the network speech recognition system according to the first aspect, when the central processing unit needs to transmit a message to an electric device based on a result of the speech recognition processing, a predetermined speech is stored in the memory. Are synthesized and transmitted via a line, and the device control can be advanced while interacting.
[0020]
According to a seventh aspect of the present invention, in the network speech recognition system of the first aspect, the central processing means has a hardware for performing voice recognition and a program for performing software processing. An economical system can be realized by distributing the voice recognition processing to hardware and software in a well-balanced manner. In particular, the load on the central processing unit can be reduced.
[0021]
According to an eighth aspect of the present invention, the audio electric device converts sound into an electric signal, the electric device transmits an audio signal, and the central processing means performs a voice recognition process on the transmitted audio signal. The network recognition method includes a step and a step of transmitting a predetermined operation output to the electric device based on a processing result. This allows the customer to economically realize easy operation by voice recognition.
[0022]
According to a ninth aspect of the present invention, the electric device converts voice into an electric signal, the electric device transmits a voice signal, and the central processing means performs a voice recognition process on the transmitted voice signal. And a step of transmitting a predetermined operation output to the electric device based on the processing result. This allows the customer to economically realize easy operation by voice recognition.
[0023]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0024]
(Example 1)
FIG. 2 is a connection diagram showing the configuration of the network speech recognition system according to the first embodiment of the present invention. A group of electric devices installed in a certain home is provided with a microphone 1 as a means for converting sound into an electric signal, a speaker 2 for restoring a synthesized sound signal into sound, and a connecting means 3 for connecting to a network. ing.
[0025]
Therefore, for example, the operation unit of the laundry 4 only has the microphone 1, the speaker 2, and the display window other than the microphone 1. There is no timer or keyboard as in the related art. Note that the connection means 3 is built in the device body.
[0026]
The microwave oven 5, the rice cooker 6, and the refrigerator 7 are exactly the same, and the operation unit has only the microphone 1, the speaker 2, and other display windows as necessary. The television 8 is similar, but the television originally has the speaker 2 as an original function.
[0027]
Now, there is a central processing unit 9 connected to these devices via a line via the connection unit 3. This controls not only a home network but also a large number of home electrical appliances via a line. The voice recognition means 10 is built in.
[0028]
FIG. 1 is a block diagram showing the configuration of the network speech recognition system according to the first embodiment of the present invention. The laundry 4 is provided with a control unit 11 in addition to the microphone 1, the speaker 2, and the connection unit 3 described above, and controls the motor and the heater 12 based on a voice command. That is, the control unit 11 transmits the operation command input from the microphone 1 to the central processing unit 9 via the connection unit 3 and the line. The voice data passes through the connection means 14 provided in the central processing means 9, and is identified by the ID recognizing means 15 at which location (home) and which address (electric equipment), and recognizes that the laundry is a home laundry A. After that, the instruction content is analyzed by the voice recognition means 10. In the speech recognition, the similarity with the speech pattern recorded in the storage unit 16 is compared, and the sentence is estimated from the preceding or succeeding words or phonemes. As a result of the recognition process, what operation the device should take is transmitted to the laundry 4 again. In response to this output, the control means 11 starts, changes, or stops energization of the motor and the heater 12. The control means 11 outputs the synthesized voice message transmitted from the speaker 2 as needed.
[0029]
The same applies to microwave ovens 5 and below. In the microwave oven 5, what is controlled by the control means 11 is a magnetron 13 which is a heat source. Of course, whether the laundry 4 or the microwave oven 5 is controlled, the object to be controlled is not limited to this. These are the controlled main blocks. The output from the central processing means 9 may be data for directly controlling a predetermined controlled block or data obtained by coding a certain operation mode. These are once decoded by the control means 11 and control the controlled block.
[0030]
With such a configuration, the network voice recognition system makes it possible for the operator and the device to interact with each other via a network, such as a heating device having a voice recognition unit and a voice synthesis unit as disclosed in Japanese Patent Application Laid-Open No. 56-88503. The operation can be advanced as if it were, and the operation of the device can be executed reliably. Of course, speech synthesis is not essential to the present invention.
[0031]
(Example 2)
FIG. 3 is a connection diagram showing the configuration of the network speech recognition system according to the second embodiment of the present invention. Each home, such as a certain home A17 and a certain home B18, is connected to the central processing means 9 via the Internet. Each of the electric appliances installed in a certain home A17 is provided with a microphone as a means for converting sound into an electric signal and a speaker for restoring a synthesized sound signal into sound as in FIG. However, the connection means for connecting to the network does not have individual electric devices. These devices are connected to a hub switch 19, which is connected to a router. That is, a single router shares the connection means with the central processing means 9. The inside of the home B18 has the same configuration.
[0032]
With such a configuration, the present invention can be economically utilized by sharing an expensive modem at present. However, in preparation for the realization of Internet home appliances in the future, the development of an inexpensive OS that simplifies the communication protocol and connects to the Internet that can be mounted on an 8-bit microcomputer has been advanced, and is more expensive than the connected electrical equipment. Some modems will be a funny story in a few years. In short, the configuration shown in FIG. 3 is realistic for the time being, but the configuration shown in FIG. 1 will be general in the future.
[0033]
This selection is a comparison between the cost of connecting to the network (hardware and software such as usage fee) and the hardware cost of the voice recognition means. If the former is greater than the latter, the invention will actually have no users. However, as will be described later, the speech recognition processing has high-speed arithmetic processing and a large amount of storage means, and there is a technological innovation of rapid progress.According to the present invention, these can be shared by the central processing means and while updating to the latest technology, Very simple control means for controlling the operation of the heat source, the motor and the like may be provided in each electric device. In some cases, all the control can be performed by the central processing means. It is a premise of the present invention that network costs are steadily falling and voice processing technology is evolving.
[0034]
(Example 3)
FIG. 4 is a block diagram showing the configuration of the network voice recognition system according to the third embodiment of the present invention. In contrast to the first embodiment shown in FIG. 1, an A / D converter 21 is connected to the microphone 1, the audio is converted to a digital signal, and the control unit 11 is connected via the connection unit 3 to the central processing unit 9 via a line. Send to The audio signal transmitted by such a configuration is resistant to noise. That is, if the transmission data is divided into several frames and an even or odd parity check is performed for each frame, noise can be easily eliminated. If a digital signal is used, processing such as compression can be easily performed. The A / D converter is inexpensive, and a configuration in which the A / D converter is individually mounted on each electric device is useful in practical use in terms of increasing reliability.
[0035]
In this embodiment, when the central processing unit 9 needs to transmit a message to the electric device based on the result of the voice recognition processing, a predetermined voice is synthesized from the storage unit 16 and transmitted to the electric device via the line. This is a configuration for transmission. Further, each electric device includes a D / A converter 22. The voice synthesized by the central processing unit 9 may be transmitted over the line as analog data. However, if it is handled as digital data, it is strong against noise and can easily perform processing such as compression. That is, it is possible to obtain the same effect as digitizing a voice command in voice recognition.
[0036]
(Example 4)
FIG. 5 is a block diagram showing a configuration of the voice recognition processing of the network voice recognition system according to the fourth embodiment of the present invention.
[0037]
Speech uttered by humans is extremely vague, especially in Japanese, where the independence between words is poor, and there are many factors that hinder accurate recognition, such as individual differences and regional characteristics (dialects). Meanwhile, the voice recognition processing is realized by making full use of pattern recognition technology. There are specific speaker recognition in which words to be recognized are registered in advance, and unspecified speaker recognition in which the voices of an unspecified number of people are recognized. Naturally, the latter is technically high, but practical. .
[0038]
The speech recognition includes word unit recognition and phoneme unit recognition. In word-based recognition, a computer analyzes speech, extracts features, and creates a time series of feature quantities. Then, it compares and calculates the similarity with the characteristic time-series word dictionary in the processing means and outputs it as a recognition result.
[0039]
Recognition of phoneme units converts input speech into a phoneme symbol string and replaces it with a word string. Parse this and convert it to a string. Furthermore, it performs sentence analysis and semantic analysis to generate sentences. By recognizing the situation before and after the sound is uttered and the individual devices connected to the central processing means, it is possible to go to a simple language understanding such as a possible command and a impossible command. It is extremely difficult, so a high-speed, large-capacity computer / dedicated hardware hybrid system works efficiently.
[0040]
In this embodiment, a method based on HMM (Hidden Markov Model) that is strong against fluctuation of an unspecified speaker is adopted. HMM is a technique widely used in recent years, and realizes speech recognition in phoneme units based on this technique. Such a technology is described in, for example, "The Unspecified Speaker Phoneme Level Speech Recognition / Learning Circuit Based on Hidden Markov Model (Kazuhiro Nakamura et al., Nara Institute of Science and Technology)," which won the IP Award of the 2nd IP Award in 2000. Is shown.
[0041]
In FIG. 5, the audio electric signal (analog) is replaced by the A / D converter 21 with a digital signal. The A / D converter 21 may be provided in each electric device or in the central processing means. The effect when placed in the apparatus has already been described in FIG.
[0042]
The audio data converted into the digital signal is sent to the controller 23 and is sequentially stored in the LPC memory 24. At the same time, the data is sent to the LPC analysis unit 25 one frame at a time, and speech features are extracted using a linear prediction analysis (LPC) technique. That is, a feature vector is calculated. The feature vector is obtained by calculating a linear prediction coefficient based on the time series data of the voice and the autocorrelation function. The obtained feature vector is recorded in the LPC memory.
[0043]
Next, in the VQ unit 26, the feature vector sequence obtained by the LPC analysis is converted into a finite number of symbols. Vector quantization (VQ). For each feature vector, a codebook (a table having a pair of a symbol number and a representative vector as an element) is searched and mapped to a symbol of the representative vector having the shortest distance. The codebook is generated by applying a clustering method to a large amount of learning data, and is stored in the LPC memory 24.
[0044]
Subsequent to the above processing, speech recognition is performed by the HMM unit 27. The HMM is a phoneme probability model (probability automaton). One HMM is configured for each phoneme, and a probability parameter is stored in the HMM memory 28. The phoneme showing the symbol sequence and the highest probability is selected. Then, the result of this recognition processing is output.
[0045]
In this embodiment, the voice recognition is performed by hardware, but most of the processing is a matrix calculation and a probability calculation process that are repeated endlessly. If the central processing means has a sufficient processing capability, the LPC analysis unit, VQ unit, HMM unit, etc. can be replaced by software processing by a program. In addition, a hybrid system of hardware and software can be used to balance processing speed and economy.
[0046]
(Example 5)
FIG. 6 is a flowchart of a program having a configuration in which the central processing unit according to the fifth embodiment of the present invention performs program processing.
[0047]
First, digitized audio data is input (step 100). The device ID data is added before or after the voice data. The device recognizes the ID data and specifies the device to which the voice data has been sent (step 101). The ID is given as a unique ID when the electrical device is shipped from the factory. When the electrical device is installed in a certain home, the individual information of the home is automatically or automatically registered. And stored in the central processing means. By accessing such a database, the central processing means can specify the accessed home and the configuration of the network device in the home (step 102).
[0048]
Subsequently, the voice recognition process starts. First, voice data is divided for each frame and features are extracted (step 103). Then, the feature vector is quantized (step 104), and the probability that the data is close to each other is calculated for each phoneme (step 105). Then, the HMM with the highest probability is obtained (step 106). The input voice is determined, and the voice recognition process ends (step 107). The voice recognition process during this period is exactly the same as that executed in hardware in the example of FIG.
[0049]
The determined voice is searched by searching a voice list (text table) registered for each device, and matching is performed (step 108). This is to determine whether or not the command is a command accepted by the device. For example, since the ID identifies the type of microwave oven of which manufacturer and from which the audio data was input, the inspection is performed in light of the functions and specifications of the microwave oven. A microwave oven can accept a command of "baking at 180 ° C. for 35 minutes in an oven", but a microwave oven without a heater function cannot execute this command. Also, even if a command to "select automatically" is received in the microwave oven, it cannot be handled. This voice list (text table) may be stored in the memory of the central processing unit every time a new product is released, or the central processing unit may search for data via a network to a server owned by the manufacturer. You may. First, it is checked whether the voice is a registered voice in the table (step 109). If the voice is not found in this search, an error message is output as voice (step 110). A message can be issued in detail according to the nature of the error.
[0050]
If the voice is a registered voice, the stage (operating state) of the device is checked (step 111). All commands are instructed by voice to the central processing means so that the central processing means can specify what state the device is in now. Then, even if there is no mistake in the voice data, it is checked whether the voice data can be accepted now (step 112). If the user is instructed to "stop" while the microwave oven is not operating, he will not be able to respond. Also, it cannot respond during execution of the instruction received immediately before.
[0051]
Further, since the central processing unit knows the configuration of the home network device, the operation of other devices connected to the network can be checked (step 113). For example, when the rice cooker has just started operation, and it has been cooked 40 minutes later, and the microwave oven is instructed to reheat it, ask the operator by sounding out whether it is acceptable to accept it and calling for the instruction again. Is also possible.
[0052]
After such processing, a sound is output to notify that the reception has been completed (step 114), and an operation output for causing the device to perform a predetermined operation is retrieved from the table (step 115). For example, a microwave oven has loads such as a magnetron, a motor, and a lamp as a heat source. An operation state chart of which of these loads is turned on and which is turned off is stored in advance as a table. When these states are handled by four-digit hexadecimal data, it is necessary to determine in advance the equipment side and the central processing means that the 01A6 operates the magnetron at full output, the turntable motor rotates, and the lamp is turned off. Is output as an operation command (step 116).
[0053]
With the above program, the central sled means can control the electric equipment connected to the net by voice recognition.
[0054]
【The invention's effect】
As described above, according to the present invention, a plurality of electric devices connected to a network can be operated by a human-friendly operation system called voice recognition, and a high recognition rate is always maintained by using the latest voice recognition technology. However, a system that can be used economically can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a network speech recognition system according to a first embodiment of the present invention.
FIG. 2 is a connection diagram showing a configuration of the network speech recognition system.
FIG. 3 is a connection diagram showing a configuration of a network speech recognition system according to a second embodiment of the present invention.
FIG. 4 is a block diagram showing a configuration of a network speech recognition system according to a third embodiment of the present invention.
FIG. 5 is a block diagram showing a configuration of a voice recognition process of a network voice recognition system according to a fourth embodiment of the present invention.
FIG. 6 is a flowchart of a program having a configuration in which a central processing unit according to a fifth embodiment of the present invention performs program processing;

Claims

An electrical device having means for converting voice into an electrical signal, and a central processing unit connected to the device by a line, the central processing unit performing voice recognition processing of the transmitted voice signal, and processing results A network voice recognition system configured to transmit a predetermined operation output to the electric device based on the information.

The network voice recognition system according to claim 1, comprising a plurality of the electric devices.

The network voice recognition system according to claim 1, wherein the electric device has an individual ID.

2. The electric apparatus according to claim 1, wherein the electric device including the means for converting the sound into an electric signal further includes an A / D converter, and converts the sound into a digital signal and transmits the digital signal to the central processing means via a line. Network speech recognition system.

The network speech recognition system according to claim 1, wherein the speech recognition processing performed by the central processing means is a speech recognition processing based on linear prediction analysis (LPC), vector quantization (VQ), and a Hidden Markov Model (HMM).

2. The network according to claim 1, wherein the central processing unit synthesizes a predetermined voice from a memory and transmits the synthesized voice via a line when a message needs to be transmitted to the electric device based on a voice recognition processing result. Voice recognition system.

2. The network speech recognition system according to claim 1, wherein said central processing means has hardware for performing speech recognition and a program for performing software processing.

An electrical device converting voice to an electrical signal; an electrical device transmitting an audio signal; a central processing unit performing a voice recognition process on the transmitted audio signal; and Transmitting a predetermined operation output to the network voice recognition method.

An electric device converting sound into an electric signal; a step in which the electric device transmits a sound signal; a step in which a central processing unit performs a sound recognition process on the transmitted sound signal; and And transmitting a predetermined operation output to the network voice recognition program.