JP3862169B2

JP3862169B2 - Speech recognition service mediation system and speech recognition master reference method used therefor

Info

Publication number: JP3862169B2
Application number: JP2002353950A
Authority: JP
Inventors: 丹一安藤; 雄治草野; 孝信清水
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2002-12-05
Filing date: 2002-12-05
Publication date: 2006-12-27
Anticipated expiration: 2022-12-05
Also published as: JP2004184858A

Description

【０００１】
【発明の属する技術分野】
本発明は音声認識サービス仲介システムと、それに用いる音声認識マスター参照方法、プログラム、およびそのプログラムを内蔵する記憶媒体に関する。すなわち本発明に係る音声認識サービス仲介システムにおいては、音声認識機能付きの複数のローカル端末と音声認識センターがネットワークで接続されており、ローカル端末で処理できない音声入力がされるとその入力音声は、より処理能力の高い音声認識センターで音声認識処理が行なわれる分散型音声認識システムに関する。
【０００２】
【従来の技術】
従来、音声認識技術の向上と共に多くの端末ではキーボード等の入力装置を用いずに音声による入力手段が広く用いられている。この音声入力手段を用いるとキーボードやマウスによる操作と異なり、いわゆるハンズフリーの状態で操作が出来るため、他の手操作を要する動作中でも情報機器に対して容易に音声で意志伝達が可能となる。そのため音声入力手段は、例えば自動車運転等の両手を必要とする動作中に情報機器を操作する場合には特に有効となる。
【０００３】
音声認識装置は一般に、音声認識エンジンおよび音声認識用の語彙データベースの２つのモジュールから構成されている。前者の音声認識エンジンとは、入力音声を一定のルールに従い分解してシステムが理解できる単位に分解する。後者の語彙データベースは、語彙の蓄積であり辞書として機能する。これら２つのモジュールは、音声認識装置により夫々異なり、既に多くの既存技術が存在する。すなわちある一つの入力音声に対して、ある音声認識装置では認識できても、他の音声認識装置では認識不可の場合もあり得ることになる。
【０００４】
上述の音声認識エンジンは日進月歩の状態であり、ある端末装置に内蔵されて製品化された音声認識エンジンもしばらく使用すれば陳腐化し、より高度の音声認識エンジンが開発されるのは、他のパソコン技術と同様である。音声認識用の語彙データベースについては、音声認識エンジンのように急速な技術進歩により陳腐化するのではなく、むしろ蓄積されるデータの更新あるいは蓄積量の問題として理解される。すなわち語彙データベースが常にアップデートされており、常に最新のデータが蓄積されている限り陳腐化することはない。端末装置における語彙データベースは、新旧の語彙データを取捨選択するのではなく、古いデータはそのまま残置して新しいデータを書き加えることによりデータ更新が行なわれる。このため記憶容量に限界がある端末装置では常にデータ更新を行なうことは、この記憶容量から限界があるのが一般的である。また記憶容量を最初から必要以上に大きくするのはコストの点で現実的ではない。
【０００５】
こうした音声認識機能付き情報端末での構造的限界を解決する手段として、従来技術では、端末に入力された音声を文字化する場合には、その機能を端末に設けずに、ネットワーク経由で入力音声データをサーバーに送り、サーバーで処理後、再び端末に送り返す方式が取られている。（例えば特許文献１参照。）この方式の利点は、端末側に過剰な機能を搭載せずにコンパクトに構成でき、音声を文字化する場合にはサーバーで処理する点である。ただしこの方式では、文字化すべき音声情報は全て常にサーバーで処理がされるために、その間の通信時間、通信コストがかかる構造的な欠点を有することである。また本発明では、端末側の音声認識エンジン機能は不変であり、もし搭載されている音声認識エンジンで認識不可の入力音声があれば、端末でのローカル処理が出来ない欠点を有する。
【０００６】
【特許文献１】
特開平７−２２２２４８号公報
【０００７】
【発明が解決しようとする課題】
上述のように従来の音声認識付き端末装置は、音声認識エンジン部の性能が固定的であり、また音声認識用の語彙データベースが記憶容量の点およびコストの点から限定的である。本発明はこのような従来技術の限界を克服するために開発されたものであり、その第１の目的は、情報端末の構成を必要以上に大きくすることなく、情報端末の音声認識率を向上させることである。本発明の第２の目的は、情報端末で出来る音声認識処理は情報端末で自己完結させることで処理速度の遅延のない音声認識システムを提供することである。第３の目的は、音声入力することで、その音声情報に関連する関連情報も得ることが出来るシステムを提供することである。さらに本発明の第４の目的は、高度な音声認識エンジンと、最新の音声認識用の語彙データベースを提供することにより、それに対するサービス料を徴収する新たなビジネスチャンスを創成するシステムを提供することである。
【０００８】
【課題を解決するための手段】
上記問題点を解消するために本発明では、音声認識サービスを行なう音声認識センターと、音声認識機能付の複数のローカル端末がネットワークを介して相互に機能的に接続されている音声認識サービス仲介システムにおいて、所定の入力音声に対する前記音声認識センターの音声認識能力を、前記ローカル端末の音声認識能力と比較して認識率が高度となるように構成することで、前記音声認識センターが前記ローカル端末の音声認識機能を補完するように構成したことを特徴とする音声認識サービス仲介システムを提案する。このように構成することで、ローカル端末に無用の負荷とコストをかけることなくコストパーフォーマンスの高い音声認識機能をローカル端末に持たせることが可能となる。
【０００９】
さらに前記音声認識センターは、マスター音声認識エンジン及びマスター・語彙データベースを有し、当該マスター音声認識エンジン及びマスター・語彙データベースの両者が機能的に前記ローカル端末の音声認識機能を補完するように構成し、前記音声認識センターの前記マスター音声認識エンジンは、前記ローカル端末の音声認識エンジンと比較して多種多様の既存音声認識技術が用いられている複数の音声認識エンジンを択一的に用いることにより、所定の入力音声に対する前記音声認識センターの音声認識能力が前記ローカル端末の音声認識エンジンよりも高度となるように構成する。すなわちローカル端末の音声認識機能は、その端末を購入した時点からはレベルアップしなくとも、それを補完する音声認識センターの音声認識機能を常にアップデートさせておけば、システム全体としての音声認識機能は常に最新のレベルにアップデートすることが出来る。特に膨大な数のローカル端末側を変更せずとも、音声認識センターの音声認識能力をアップデートするだけで済む為、効率のよい、かつ社会的ニーズの高い音声認識システムを構築することが可能となる。
【００１０】
また前記音声認識センターの前記マスター・語彙データベースは、前記ローカル端末の語彙データベースと比較して多種多数の語彙の蓄積量を有することにより、所定の入力音声に対する前記音声認識センターの音声認識能力が前記ローカル端末の音声認識エンジンよりも高度となるように構成し、前記マスター・語彙データベースの前記多種多数の語彙は、複数の異なるジャンルから選択するように構成する。これにより幅広いジャンルの語彙、特に専門性の高い語彙についても容易に音声識別が可能となる。
【００１１】
システムをより実効性あるものとするために、前記音声認識センターはさらに出力フォーマットを変換する出力フォーマット変換部を有し、音声認識サービスを行なう前記音声認識センターから前記ローカル端末へ送られる音声認識処理後の出力データフォーマットを、前記ローカル端末のローカル音声認識部からの出力データフォーマットと同一フォーマットに変換するように構成する。従って、ローカル端末の目的処理部にとっては、音声認識がローカル端末で行なわれても、またセンター側で行なわれても等価状態となり、システムとしてフレクシブルな対応が可能となる。
【００１２】
また前記音声認識センターは、また音声認識サービスに対する課金処理の課金データを蓄積する課金データベースを有し、さらに前記音声認識センターは入力音声データに関連する付加情報を、前記ローカル端末へ送られる音声認識処理後の出力データに付加し、前記ローカル端末の利用者または前記付加情報の発信者に対して前記関連付加情報提供の課金処理を行なうように構成する。このように構成することで、単に音声認識をセンター側で実行するのではなく、関連する情報提供という新たなビジネスチャンスを創成することが可能となる。
【００１３】
本発明では、上記に加えて同様な技術思想に基づいた音声認識マスター参照方法、音声認識マスター参照プログラムおよびそれを内蔵する記録媒体を提案する。
【００１４】
【発明の実施の形態】
以下、本発明を図に示した実施例を用いて詳細に説明する。但し、この実施例に記載されている構成部品の寸法、材質、形状、その相対配置などは特に特定的な記載がない限り、この発明の範囲をそれのみに限定する趣旨ではなく、単なる説明例にすぎない。
【００１５】
図１は本発明に係る音声認識サービス仲介システムの概念図である。すなわち音声認識センター１００と音声認識部を有する複数の音声認識機能付きローカル端末２００が、ネットワーク３００で接続されている。
【００１６】
音声認識センター１００は、入力部１０、出力部２０、マスター音声認識エンジン３０、複数のマスター・語彙データベース（ＤＢ）４０、出力フォーマット変換部５０、課金データベース（ＤＢ）６０で構成されている。この音声認識センター１００は、音声認識サービスを行なうサービス業者のもとに設置され、複数のローカル端末２００で音声認識できなかった音声データをネットワーク３００経由で受けて、音声認識処理を行ない、そのローカル端末２００が処理できるフォーマットに出力データを変換した後に、ローカル端末に送り返す音声認識サービスを行なう。なお後述するように、ローカル端末とは例えば車両内に設置された情報端末であり、その数は数千、数万台以上といった単位の端末に対して、この音声認識センター１００は、ネットワーク３００で接続可能な範囲に設置される限り単一のセンターを想定して本発明に係る音声認識サービスシステムは構築されている。
【００１７】
この音声認識センター１００での音声認識サービスは、課金処理がされる有料サービスとすることが出来る。音声認識サービスは、各ローカル端末２００と個別にサービス提供契約を結び、ローカル端末で処理できなかった音声認識サービスを提供してもよい。音声認識センター１００は、音声認識サービス単独で設置してもよいし、あるいは他の各種の情報仲介サービスを行なう情報センターの一つの機能として設けてもよい。また音声認識サービスを要求してきた各ローカル端末２００へは、付加価値情報追加機能７０によって、音声認識サービスと共にさらにその音声認識された情報に関連する付加価値情報を合わせて提供することも出来る。例えば音楽のジャンルに関する音声認識サービスを要求してきたローカル端末２００に対して音声認識処理後のデータを返送する際に、その音楽に関するコンサート情報を有料又は無料で付加すると共に、そのコンサート主催者からは別途コンサートに関する広告宣伝費として情報配信料を徴収するように構成してもよい。以下に上記音声認識センター１００の各ブロックについて詳説する。
【００１８】
入力部１０および出力部２０は、ネットワーク３００とのインターフェース機能を有する。この点は多くの既存技術で既に開示されているので詳細は省略する。
【００１９】
マスター音声認識エンジン３０は、複数の音声認識部Ａ〜Ｎを有している。この音声認識部は夫々独立した音声認識機能を有している点では共通するが、その方式を異にする。すなわち出来るだけ多種類の既存の音声認識技術を用いて、入力音声を認識するために、ある入力音声を例えば音声認識部Ａで認識できなければ、音声認識部Ｂ、さらにはＣと言った具合に、認識が可能となるまで他の種類の認識技術を用いた音声認識部へ処理を移行させる。従って音声認識部の数は限定しないが、既存技術を用いた、出来るだけ多種多様な音声認識部を設置するのが音声認識率向上のためにも望ましい。もちろん単一の音声認識部でもよい。
【００２０】
マスター・語彙データベース４０には、色々なジャンルの最新のデータが蓄積されている。音声認識エンジン３０で音声処理された後は、このマスター・語彙データベース４０の語彙データを用いて具体的な音声認識出力が生成される。このマスター・語彙データベース４０の語彙データは、例えば音楽ジャンル、不動産情報ジャンル、ニュースジャンル、その他多種多様なジャンルに関する情報が、常にアップデートされた最新の情報レベルで蓄積されている。すなわち本発明に係る音声認識サービス仲介システムの中核的な機能としてのマスター・語彙データベースとして機能する。もちろん語彙データベースは、図示するように夫々ジャンル毎に独立した語彙データベースとする必要は必ずしも無く、単一の語彙データベースに蓄積されていてもよい。
【００２１】
出力フォーマット変換部５０では、上述のマスター音声認識エンジン３０およびマスター・語彙データベース４０で音声認識された音声出力データを、各ローカル端末２００内で用いられている音声認識部ａ〜ｎの出力フォーマット等価になるように変換処理を行なう。これは、各ローカル端末２００の音声認識部ａ〜ｎに対応する出力フォーマットに変換して夫々のローカル端末へ音声出力データを送れば、受取ったローカル端末側では、その後の目的処理を行なう際に都合がよいからである。なお例えば各ローカル端末２００から音声認識センター１００側へ音声データが送られる際に、その音声データに各ローカル端末のＩＤを付加しておけば、どのようなフォーマットに変換すべきかは容易に判別することが可能である。なおこのローカル端末のＩＤは次に述べる課金処理でも用いられる。
【００２２】
本発明に係る音声認識サービス仲介システムの目的の１つは、ローカル端末での音声認識機能を過大なものとせず、一定以上の困難な音声認識は音声認識センター１００に依存させ、ローカル端末としてのコストパーフォーマンスを向上させることである。従って、音声認識センター１００での音声認識サービスは、そのローカル端末でのコスト低減の範囲内、又は後述する関連情報の広告宣伝費の範囲内で、有料化するビジネスモデルを創成することが可能である。このため本発明では、課金データベース６０を設けて、音声認識を要求したローカル端末２００ごと、または関連情報の広告主体ごとに所定のサービス料を請求するように構成することも出来る。関連情報とは、例えばあるローカル端末から音楽ジャンルで、あるシンガーの曲に関する音声認識要求があった場合、その音声認識処理された出力データと共に、例えばそのシンガーの全ての曲目案内をローカル端末２００へ送り、その曲目のレコード会社へは、広告宣伝費として所定の料金を課金する例が上げられる。さらに別の例として、不動産ジャンルで音声認識要求があれば、それに関連する不動産情報をローカル端末２００に送る等の関連情報提供サービスが考えられる。
【００２３】
次に音声認識機能付きローカル端末２００について詳説する。このローカル端末２００は、例えば図２に示すように、車両内部に設けられたＣＤプレーヤがある。このプレーヤーでは、運転者が例えば特定のシンガーＡの曲目Ｂをマイク２４０によって音声指令することが出来る。この曲目Ｂの音楽データ自体は、既にこのＣＤプレーヤーに装備されているか、または別途図示しない音楽配信センターからダウンロードされるものとする。すなわち音楽データとしては、この車両内のＣＤプレーヤ内に存在する場合を想定する。ただしその曲目Ｂは音声認識部ａ２１０の曲目データベース２２０にはまだ登録されておらず、従って運転手の曲目Ｂを選択する音声情報を判別することが出来ない場合を想定する。本発明に係る音声認識サービス仲介システムでは、この場合、認識できなかった入力音声は、ネットワーク３００経由で音声認識センター１００へ上げられ、センターでより高度な音声認識処理がされることになる。なお前述のように、音声認識センター１００では、曲目Ｂを含む最新の音声認識データがマスター・語彙データベース４０に蓄積されているために、入力音声から曲目Ｂを識別することが可能である。そして音声認識センター１００で曲目Ｂが識別されると、この曲目Ｂの音声識別データは個々のローカル端末の目的処理部２３０へ入力されるフォーマットに出力フォーマット変換部５０で変換される。そしてこの変換後の音声識別データは、ローカル端末であるＣＤプレーヤに返送されて、その後に目的処理部２３０で目的処理、この場合には曲目ＢがＣＤプレーヤ内の音楽データから選択されてプレーされることになる。
【００２４】
上述の音声認識センター１００内の音声認識部Ａ〜Ｎは、ローカル端末２００との契約によっては、例えば音楽ジャンルだけ、不動産ジャンルだけ、ニュースジャンルだけ、あるいはそれらの全て、等に利用者により音声変換サービスを選択することが出来る。もちろんジャンルが多ければその分のサービス料の提供を求めてもよい。従ってより多くの音声認識サービスに対してはより多くの課金を行なうというビジネスモデルも可能となる。さらに図示しないが、各ジャンルに関連する付加的な情報もジャンルを多く選択することで得ることが出来る。例えばレストランに関するジャンルでは、もし車両内の語彙データベースに適当なレストラン情報が無ければ、利用者は“レストラン”を音声で入力すれば、その音声データがローカル端末２００では認識できず、音声認識センター１００へ送られるため、音声認識センター１００からは予め情報提供を契約した複数のレストランについてのレストラン情報をローカル端末２００へ提供すると共に、そのレストランへは広告宣伝費として課金するというビジネスモデルも可能となる。なお図２には音声認識機能付きローカル端末２００は車両内に搭載されているが、本発明はこの例には限定されず、オフィス内や、工場内、その他多くの場所に設置された音声認識を行なう端末なら、どのようなアプリケーションにも応用することが可能である。
【００２５】
図３は本発明に係る音声認識サービス仲介システムにおける、音声認識処理のフローチャートである。ローカル端末２００から利用者の音声入力がされる（ＳＴ１）。上述の例では“シンガーＡの曲目Ｂ”という音声入力である。この音声入力に対してローカル端末２００内の音声認識部ａで音声認識できれば（ＳＴ２）、次にその“シンガーＡの曲目Ｂ”をローカル端末２００内の音楽曲目ＤＢａ２２０を検索し（ＳＴ３）、一致する音楽データがあればローカル端末内での目的処理が実行される。この例で云えば、シンガーＡの曲目ＢがＣＤプレーヤーで演奏される。この限りでは、本発明に係る音声認識サービス仲介システムは利用されないことになる。
【００２６】
しかしながら実際には、たとえば利用者の声質は個々人で異なり、従って利用者Ｘの声質は音声認識出来ても利用者Ｙの声質は認識できない場合があり得る。このような音声認識不可のケースは、たとえローカル端末に自己学習機能が付加されていても、実際には起こり得るし、また利用者が多数である場合には、夫々の声質が異なり、自己学習自体が困難なケースも起こり得る。このような場合には、ＳＴ３のステップで、音声認識不可となり、その入力された音声データはそのまま音声認識センター１００へ送られ、そこで改めて音声認識が実行される（ＳＴ５）。上述のように音声認識センターでは、ローカル端末２００に搭載された音声認識部ａ以外の多くの、そして最新の音声認識技術を用いた音声認識部Ａ〜Ｎが準備されているために、より高い確率で音声認識を行なうことが出来る。換言すれば音声認識センター１００は、ローカル端末２００からマスター参照される。
【００２７】
そしてＳＴ５で音声認識処理がされたら、ＳＴ６ではマスター・語彙データベース４０から“シンガーＡの曲目Ｂ”に対する語彙を検索後、対応する変換処理データが生成される。なおローカル端末のＳＴ３でも、ローカルの語彙データベースに対応する語彙が登録されていない場合にはＳＴ６へ移行し、マスター・語彙データベースから“シンガーＡの曲目Ｂ”に対する語彙を検索後、対応する変換処理データが同様に生成される。この変換処理データはＳＴ７でローカル端末２００用の出力フォーマットに変換され、ＳＴ８で課金処理後、再びローカル端末へ“シンガーＡの曲目Ｂ”という変換処理データとして返送され、ＳＴ４での目的処理が実行される。すなわちシンガーＡの曲目ＢがＣＤプレーヤーで演奏される。なお図示はしないが、ＳＴ７では、前述の関連情報を合わせて付加することが出来る。
【００２８】
【発明の効果】
上記のように本発明ではローカル端末で、音声認識処理出来ない場合、および音声認識のための語彙データベースが不足している場合にも、音声認識センターへ音声データを送ることにより簡便に音声認識を行なうことが可能となる。
【００２９】
従って、ローカル端末内の音声認識機能を過大なものとする必要が無く、適切なコストパーフォーマンスを持った音声認識機能付の端末を構成することが出来る。
【００３０】
さらにこのような音声認識サービスに対して課金をすることによりビジネスモデルを創成することが可能となる。また関連情報をセンター側から端末側へ送ることで広告宣伝機能を付加することでもビジネスモデルを創成することが出来る。
【図面の簡単な説明】
【図１】本発明に係る音声認識サービス仲介システムの概念図である。
【図２】音声認識機能付きローカル端末が車両に搭載された場合の概念図である。
【図３】本発明に係る音声認識サービス仲介システムにおける、音声認識処理のフローチャートである。
【符号の説明】
１００音声認識センター
１０入力部
２０出力部
３０マスター音声認識エンジン
４０マスター・語彙データベース
５０出力フォーマット変換部
６０課金データベース
２００音声認識機能付きローカル端末
３００ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition service mediation system, a speech recognition master reference method used therefor, a program, and a storage medium incorporating the program. That is, in the speech recognition service mediation system according to the present invention, a plurality of local terminals with a speech recognition function and a speech recognition center are connected via a network, and when speech input that cannot be processed by the local terminal is performed, the input speech is The present invention relates to a distributed speech recognition system in which speech recognition processing is performed in a speech recognition center with higher processing capability.
[0002]
[Prior art]
Conventionally, with the improvement of voice recognition technology, voice input means are widely used in many terminals without using an input device such as a keyboard. When this voice input means is used, unlike a keyboard or mouse, it can be operated in a so-called hands-free state. Therefore, it is possible to easily communicate with the information equipment by voice even during operations requiring other manual operations. For this reason, the voice input means is particularly effective when the information device is operated during an operation that requires both hands, such as driving a car.
[0003]
A speech recognition apparatus is generally composed of two modules: a speech recognition engine and a vocabulary database for speech recognition. In the former speech recognition engine, the input speech is decomposed according to a certain rule and decomposed into units that can be understood by the system. The latter vocabulary database is an accumulation of vocabulary and functions as a dictionary. These two modules differ depending on the speech recognition device, and many existing technologies already exist. That is, a certain input speech can be recognized by a certain speech recognition device but cannot be recognized by another speech recognition device.
[0004]
The above-mentioned speech recognition engine is in a state of progress, and a speech recognition engine that is built into a terminal device and commercialized will become obsolete if used for a while, and a more advanced speech recognition engine will be developed for other personal computers. Same as technology. The vocabulary database for speech recognition is not obsoleted by rapid technological advancement like a speech recognition engine, but rather is understood as a problem of update or accumulation amount of accumulated data. That is, the vocabulary database is constantly updated, and as long as the latest data is always accumulated, it will not become obsolete. The vocabulary database in the terminal device is not renewed by selecting new and old vocabulary data, but data is updated by adding new data while leaving old data as it is. For this reason, in a terminal device having a limited storage capacity, it is general that there is a limit from the storage capacity to constantly update data. Also, it is not realistic in terms of cost to increase the storage capacity from the beginning more than necessary.
[0005]
As a means for solving the structural limitations of such information terminals with a voice recognition function, in the prior art, when voice input to a terminal is converted to text, the input voice is transmitted via a network without providing the function in the terminal. Data is sent to the server, processed by the server, and sent back to the terminal again. (For example, refer to Patent Document 1.) The advantage of this method is that it can be configured compactly without installing excessive functions on the terminal side, and processing is performed by a server when voice is transcribed. However, in this method, since all voice information to be converted into characters is always processed by the server, there is a structural defect that requires communication time and communication cost during that time. In the present invention, the voice recognition engine function on the terminal side is unchanged, and if there is an input voice that cannot be recognized by the installed voice recognition engine, the terminal cannot perform local processing.
[0006]
[Patent Document 1]
Japanese Patent Laid-Open No. 7-222248
[Problems to be solved by the invention]
As described above, in the conventional terminal device with speech recognition, the performance of the speech recognition engine unit is fixed, and the vocabulary database for speech recognition is limited in terms of storage capacity and cost. The present invention has been developed to overcome such limitations of the prior art, and its first object is to improve the speech recognition rate of the information terminal without making the information terminal configuration unnecessarily large. It is to let you. A second object of the present invention is to provide a speech recognition system that does not have a processing speed delay by allowing speech recognition processing that can be performed by an information terminal to be completed by the information terminal. A third object is to provide a system capable of obtaining related information related to the voice information by inputting the voice. A fourth object of the present invention is to provide a system for creating a new business opportunity for collecting a service fee for an advanced speech recognition engine and a vocabulary database for the latest speech recognition. It is.
[0008]
[Means for Solving the Problems]
In order to solve the above problems, in the present invention, a speech recognition service mediation system in which a speech recognition center that performs a speech recognition service and a plurality of local terminals with a speech recognition function are functionally connected to each other via a network. The voice recognition center of the voice recognition center with respect to a predetermined input voice is configured to have a higher recognition rate than the voice recognition capability of the local terminal, so that the voice recognition center can This paper proposes a speech recognition service mediation system that is configured to complement the speech recognition function. With this configuration, the local terminal can have a voice recognition function with high cost performance without adding unnecessary load and cost to the local terminal.
[0009]
Further, the speech recognition center has a master speech recognition engine and a master / vocabulary database, and both the master speech recognition engine and the master / vocabulary database functionally complement the speech recognition function of the local terminal. The master speech recognition engine of the speech recognition center uses a plurality of speech recognition engines in which a wide variety of existing speech recognition technologies are used as compared with the speech recognition engine of the local terminal. The voice recognition center has a voice recognition capability higher than that of the local terminal voice recognition engine for a predetermined input voice. In other words, even if the voice recognition function of the local terminal is not upgraded from the time of purchasing the terminal, if the voice recognition function of the voice recognition center that complements it is constantly updated, the voice recognition function of the entire system will be You can always update to the latest level. In particular, it is only necessary to update the voice recognition capability of the voice recognition center without changing a huge number of local terminals, so it is possible to construct an efficient voice recognition system with high social needs. .
[0010]
In addition, the master / vocabulary database of the speech recognition center has a large number of vocabulary accumulation amounts compared to the vocabulary database of the local terminal, so that the speech recognition center has a speech recognition capability for a predetermined input speech. The vocabulary of the master / vocabulary database is selected from a plurality of different genres so as to be more sophisticated than the speech recognition engine of the local terminal. As a result, it is possible to easily identify speech for a wide range of genres, especially vocabularies with high expertise.
[0011]
In order to make the system more effective, the speech recognition center further includes an output format conversion unit for converting an output format , and a speech recognition process sent from the speech recognition center that performs speech recognition service to the local terminal. The subsequent output data format is converted to the same format as the output data format from the local speech recognition unit of the local terminal. Therefore, for the purpose processing unit of the local terminal, even if the voice recognition is performed at the local terminal or at the center side, an equivalent state is obtained, and the system can be flexibly handled.
[0012]
The voice recognition center also has a billing database for accumulating billing data for billing processing for a voice recognition service, and the voice recognition center further sends additional information related to input voice data to the local terminal. It is added to the output data after processing, and a billing process for providing the related additional information is performed to the user of the local terminal or the sender of the additional information. With this configuration, it is possible to create a new business opportunity for providing related information, instead of simply performing speech recognition on the center side.
[0013]
In addition to the above, the present invention proposes a voice recognition master reference method, a voice recognition master reference program, and a recording medium incorporating the same, based on the same technical idea.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the embodiments shown in the drawings. However, the dimensions, materials, shapes, relative arrangements, and the like of the component parts described in this example are not intended to limit the scope of the present invention only to specific examples unless otherwise specified. Only.
[0015]
FIG. 1 is a conceptual diagram of a speech recognition service mediation system according to the present invention. That is, the voice recognition center 100 and a plurality of local terminals with a voice recognition function 200 having a voice recognition unit are connected by a network 300.
[0016]
The speech recognition center 100 includes an input unit 10, an output unit 20, a master speech recognition engine 30, a plurality of master / vocabulary databases (DB) 40, an output format conversion unit 50, and a charging database (DB) 60. The voice recognition center 100 is installed under a service provider that performs voice recognition services, receives voice data that cannot be voice-recognized by a plurality of local terminals 200 via the network 300, performs voice recognition processing, and performs local recognition. After the output data is converted into a format that can be processed by the terminal 200, a voice recognition service is sent back to the local terminal. As will be described later, the local terminal is, for example, an information terminal installed in a vehicle, and the voice recognition center 100 is connected to a network 300 with respect to terminals of thousands, tens of thousands or more. As long as it is installed in a connectable range, the voice recognition service system according to the present invention is constructed assuming a single center.
[0017]
The voice recognition service in the voice recognition center 100 can be a charged service that is charged. The voice recognition service may provide a voice recognition service that cannot be processed by the local terminal by making a service provision contract with each local terminal 200 individually. The voice recognition center 100 may be provided alone as a voice recognition service, or may be provided as one function of an information center that performs other various information mediation services. Further, to each local terminal 200 that has requested the voice recognition service, the added value information addition function 70 can also provide the added value information related to the voice-recognized information together with the voice recognition service. For example, when data after voice recognition processing is returned to the local terminal 200 that has requested a voice recognition service relating to a music genre, concert information relating to the music is added for a fee or free of charge, and the concert organizer Separately, an information distribution fee may be collected as an advertising expense related to a concert. Hereinafter, each block of the voice recognition center 100 will be described in detail.
[0018]
The input unit 10 and the output unit 20 have an interface function with the network 300. Since this point has already been disclosed in many existing technologies, details are omitted.
[0019]
The master speech recognition engine 30 has a plurality of speech recognition units A to N. The voice recognition units are common in that they have independent voice recognition functions, but their methods are different. That is, in order to recognize an input voice using as many kinds of existing voice recognition technologies as possible, if a certain input voice cannot be recognized by the voice recognition unit A, for example, the voice recognition unit B and further C. Then, the process is shifted to a voice recognition unit using another type of recognition technology until recognition is possible. Therefore, the number of voice recognition units is not limited, but it is desirable to install as many different voice recognition units as possible using existing technology in order to improve the voice recognition rate. Of course, a single voice recognition unit may be used.
[0020]
The master / vocabulary database 40 stores the latest data of various genres. After the voice processing by the voice recognition engine 30, a specific voice recognition output is generated using the vocabulary data in the master / vocabulary database 40. In the vocabulary data of the master / vocabulary database 40, for example, information on various genres such as music genre, real estate information genre, news genre, and the like is always accumulated at the updated latest information level. That is, it functions as a master / vocabulary database as a core function of the speech recognition service mediation system according to the present invention. Of course, the vocabulary database does not necessarily need to be an independent vocabulary database for each genre as shown in the figure, and may be stored in a single vocabulary database.
[0021]
In the output format conversion unit 50, the audio output data recognized by the master speech recognition engine 30 and the master / vocabulary database 40 is equivalent to the output formats of the speech recognition units a to n used in each local terminal 200. The conversion process is performed so that This is because if the output format corresponding to the voice recognition units a to n of each local terminal 200 is converted and the voice output data is sent to each local terminal, the received local terminal side performs the subsequent target processing. Because it is convenient. For example, when voice data is sent from each local terminal 200 to the voice recognition center 100 side, if the ID of each local terminal is added to the voice data, it is easy to determine what format should be converted. It is possible. The ID of this local terminal is also used in the billing process described below.
[0022]
One of the purposes of the speech recognition service mediation system according to the present invention is that the speech recognition function at the local terminal is not excessive, and the speech recognition function more than a certain level depends on the speech recognition center 100, so that It is to improve cost performance. Therefore, the speech recognition service in the speech recognition center 100 can create a business model for charging within the range of cost reduction at the local terminal or within the range of the advertising cost of related information described later. is there. Therefore, in the present invention, the accounting database 60 may be provided so that a predetermined service fee is charged for each local terminal 200 that has requested voice recognition or for each advertising subject of related information. The related information is, for example, a music genre from a certain local terminal, and when there is a voice recognition request related to a certain singer's song, for example, guidance on all the songs of that singer, together with the output data subjected to the voice recognition processing, An example in which a predetermined fee is charged as an advertising expense to the record company of the song to be sent is given. As another example, if there is a voice recognition request in the real estate genre, a related information providing service such as sending related real estate information to the local terminal 200 can be considered.
[0023]
Next, the local terminal 200 with a voice recognition function will be described in detail. As shown in FIG. 2, for example, the local terminal 200 includes a CD player provided in the vehicle. In this player, for example, the driver can give a voice command to the song B of a specific singer A by the microphone 240. It is assumed that the music data of the music piece B is already installed in the CD player or downloaded from a music distribution center (not shown). That is, it is assumed that the music data exists in the CD player in the vehicle. However, it is assumed that the song B has not yet been registered in the song database 220 of the voice recognition unit a210, and therefore the voice information for selecting the driver's song B cannot be determined. In the speech recognition service mediation system according to the present invention, in this case, the input speech that could not be recognized is raised to the speech recognition center 100 via the network 300, and more advanced speech recognition processing is performed at the center. As described above, since the latest speech recognition data including the song B is stored in the master / vocabulary database 40 at the speech recognition center 100, the song B can be identified from the input speech. When the song B is identified by the voice recognition center 100, the voice identification data of the song B is converted by the output format converter 50 into a format that is input to the object processor 230 of each local terminal. The converted voice identification data is returned to the CD player which is the local terminal, and thereafter the target processing is performed by the target processing unit 230. In this case, the music piece B is selected from the music data in the CD player and played. Will be.
[0024]
Depending on the contract with the local terminal 200, the voice recognition units A to N in the voice recognition center 100 described above may convert the voice into a music genre, a real estate genre, a news genre, or all of them, for example. You can select a service. Of course, if there are many genres, it may be requested to provide the service fee accordingly. Therefore, a business model in which more charges are made for more speech recognition services is possible. Although not shown, additional information related to each genre can be obtained by selecting many genres. For example, in a genre relating to a restaurant, if there is no appropriate restaurant information in the vocabulary database in the vehicle, if the user inputs “restaurant” by voice, the voice data cannot be recognized by the local terminal 200, and the voice recognition center 100. Therefore, the speech recognition center 100 can provide restaurant information about a plurality of restaurants for which information provision has been previously contracted to the local terminal 200, and can charge a business model for the restaurant as advertising expenses. . In FIG. 2, the local terminal 200 with a voice recognition function is mounted in a vehicle, but the present invention is not limited to this example, and the voice recognition installed in an office, a factory, and many other places. It is possible to apply to any application as long as the terminal performs the above.
[0025]
FIG. 3 is a flowchart of voice recognition processing in the voice recognition service mediation system according to the present invention. A user's voice is input from the local terminal 200 (ST1). In the above example, the voice input is “Song A's song B”. If the voice recognition unit a in the local terminal 200 can recognize the voice in response to this voice input (ST2), then the music piece DBa220 in the local terminal 200 is searched for the “song B of the singer A” (ST3). If there is music data to be processed, the target process is executed in the local terminal. In this example, song B of singer A is played on a CD player. As long as this is the case, the speech recognition service mediation system according to the present invention is not used.
[0026]
However, in reality, for example, the voice quality of the user is different for each person, and therefore the voice quality of the user X may be recognized, but the voice quality of the user Y may not be recognized. Such a case where speech recognition is not possible can actually occur even if a self-learning function is added to the local terminal, and when there are a large number of users, each voice quality is different and self-learning is different. There are cases where this is difficult. In such a case, voice recognition is disabled in step ST3, and the inputted voice data is sent to the voice recognition center 100 as it is, where voice recognition is executed again (ST5). As described above, in the speech recognition center, since many speech recognition units A to N using the latest speech recognition technology other than the speech recognition unit a installed in the local terminal 200 are prepared, the cost is higher. Speech recognition can be performed with probability. In other words, the voice recognition center 100 is referred to as a master from the local terminal 200.
[0027]
When the speech recognition process is performed in ST5, in ST6, after searching the vocabulary for "Singer A's song B" from the master / vocabulary database 40, corresponding conversion process data is generated. In ST3 of the local terminal, if the vocabulary corresponding to the local vocabulary database is not registered, the process proceeds to ST6, and after searching the vocabulary for “Singer A's song B” from the master / vocabulary database, the corresponding conversion process is performed. Data is generated as well. This conversion processing data is converted into an output format for the local terminal 200 in ST7, charged after ST8, and then returned to the local terminal as conversion processing data “Singer A's song B”, and the target processing in ST4 is executed. Is done. That is, song B of singer A is played on the CD player. Although not shown, in ST7, the related information described above can be added together.
[0028]
【The invention's effect】
As described above, according to the present invention, voice recognition can be easily performed by sending voice data to the voice recognition center even when the local terminal cannot perform voice recognition processing and when the vocabulary database for voice recognition is insufficient. Can be performed.
[0029]
Therefore, it is not necessary to make the voice recognition function in the local terminal excessive, and a terminal with a voice recognition function having appropriate cost performance can be configured.
[0030]
Furthermore, it is possible to create a business model by charging for such a speech recognition service. A business model can also be created by adding an advertisement function by sending related information from the center side to the terminal side.
[Brief description of the drawings]
FIG. 1 is a conceptual diagram of a speech recognition service mediation system according to the present invention.
FIG. 2 is a conceptual diagram when a local terminal with a voice recognition function is mounted on a vehicle.
FIG. 3 is a flowchart of voice recognition processing in the voice recognition service mediation system according to the present invention.
[Explanation of symbols]
100 speech recognition center 10 input unit 20 output unit 30 master speech recognition engine 40 master / vocabulary database 50 output format conversion unit 60 billing database 200 local terminal 300 with speech recognition function network

Claims

In a voice recognition center that performs a voice recognition service and a voice recognition service mediation system in which a plurality of local terminals with a voice recognition function are functionally connected to each other via a network, the voice recognition center of a predetermined input voice Configuring the speech recognition center to complement the speech recognition function of the local terminal by configuring the speech recognition capability so that the recognition rate is higher than the speech recognition capability of the local terminal;
The speech recognition center includes a master speech recognition engine and a master / vocabulary database, and both the master speech recognition engine and the master / vocabulary database are configured to functionally complement the speech recognition function of the local terminal. ;
Further, the master speech recognition engine of the speech recognition center can alternatively use a plurality of speech recognition engines in which a wide variety of existing speech recognition technologies are used compared to the speech recognition engine of the local terminal. A speech recognition service mediation system, wherein speech recognition capability of the speech recognition center for predetermined input speech is higher than speech recognition engine of the local terminal.

Further, the master / vocabulary database of the speech recognition center has a large number of vocabulary accumulation amounts compared to the vocabulary database of the local terminal, so that the speech recognition center has a speech recognition capability for a predetermined input speech. 2. The speech recognition service mediation system according to claim 1 , wherein the speech recognition service mediation system is configured to be more sophisticated than the speech recognition engine of the local terminal.

3. The voice recognition service mediation system according to claim 2 , wherein the various vocabularies in the master / vocabulary database are selected from a plurality of different genres.

Further, the speech recognition center has an output format conversion unit for converting an output format , and an output data format after speech recognition processing sent from the speech recognition center that performs speech recognition service to the local terminal is changed to a local speech recognition service mediation system according to claim 1, wherein the converting the output data format and the same data format from the speech recognition unit.

The voice recognition center, the speech recognition service mediation system according to claim 1, wherein a billing database for further storing billing data for billing process for the speech recognition service.

The voice recognition center further adds additional information related to the input voice data to the output data after the voice recognition process sent to the local terminal, and provides the user of the local terminal or the sender of the additional information with the information. 6. The voice recognition service mediation system according to claim 5, wherein a billing process for providing related additional information is performed.

In a speech recognition master reference method for transferring input speech that could not be recognized by a local terminal to a speech recognition center connected via a network and performing more advanced speech recognition processing in the speech recognition center, comprising the steps of: speech input, and determining the variable or disable the voice recognition processing at the local terminal, when it is judged impossible at the step sends via the network the input speech to the speech recognition center, the local By selectively using speech recognition processing using a wide variety of existing speech recognition technologies compared to speech recognition processing at the terminal, the speech recognition capability at the speech recognition center for the predetermined input speech can be increased. a step of advanced than the voice recognition capabilities of the terminal, the voice recognition center speech recognition A voice processing step comprising: converting the output data into a data format for voice recognition processing of the local terminal; and a billing processing step for performing billing processing related to a voice recognition service at the voice recognition center. Recognition master reference method.

In the speech recognition master reference method, a step of adding additional information related to a business field specified by input speech to the output data subjected to the speech recognition processing, and an additional information charging step of performing a charging process for the additional information The speech recognition master reference method according to claim 7, comprising :