JP2004012653A

JP2004012653A - Voice recognition system, voice recognition client, voice recognition server, voice recognition client program, and voice recognition server program

Info

Publication number: JP2004012653A
Application number: JP2002163931A
Authority: JP
Inventors: Takashi Akiyama; 秋山　貴; Norihiko Kumon; 久門　紀彦
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-06-05
Filing date: 2002-06-05
Publication date: 2004-01-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition system which solves the load increase of a server side CPU and the load increase of a network band between a client and a server as problems in the voice recognition system constituted by the server and the client. <P>SOLUTION: The client tries a primary voice recognition for the inputted voice data, the voice recognition result is transmitted to the server when the voice can be recognized by means of the primary voice recognition and, to the contrary, when the voice cannot be recognized, the voice data is transmitted to the server and the server performs the secondary voice recognition for the voice data. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、サーバとクライアントから構成される音声認識システム、音声認識クライアント、音声認識サーバ、音声認識クライアントプログラムおよび音声認識サーバプログラムに関する。
【０００２】
【従来の技術】
従来、サーバとクライアントとにより構成される音声認識システムに関しては特開２００１−１４２４８８号公報に記載されたものが知られている。音声入力をクライアントで行い、クライアントで得た音声データをサーバに送り、音声認識をサーバで行う音声認識システムである。また、クライアントで音声データの符号化を行うことにより、クライアントとサーバとの間のネットワーク帯域の負荷を抑えることを解決するサーバとクライアントとにより構成される音声認識システムに関しては特開２００１−３３７６９５号公報に記載されたものが知られているが、何れも音声認識をサーバのみで行うことによりサーバ側ＣＰＵの負荷が上昇してしまうもので、サーバ側ＣＰＵの負荷と、クライアントとサーバとの間のネットワーク帯域の負荷とを抑えつつ音声認識を行うものではない。
【０００３】
【発明が解決しようとする課題】
このサーバとクライアントにより構成される音声認識システムにおいては、サーバ側ＣＰＵの負荷と、クライアントとサーバとの間のネットワーク帯域の負荷とを抑えつつ音声認識を行うことができることが要求されている。本発明はサーバ側ＣＰＵの負荷と、クライアントとサーバとの間のネットワーク帯域の負荷とを抑えつつ音声認識を行うことを可能にすることを目的とするサーバとクライアントで音声認識を行うシステムを提供することである。
【０００４】
【課題を解決するための手段】
本発明の第１は、クライアントは、音声が入力されるとこの入力音声に対する音声データを生成し、前記音声データと１次音声認識のための辞書１に格納された複数の辞書データの夫々とを比較することにより１次音声認識を行い１次認識結果データを生成し、ここで、前記クライアントは、前記音声データと一致する辞書データが辞書１に格納されている場合、つまり、１次音声認識が可能な場合、前記１次認識結果データを前記サーバへ伝送し、前記サーバは、前記１次認識結果データを受信し、受信した前記１次音声結果データを前記音声認識システムの認識結果データとして得、一方、前記クライアントは、前記音声データと一致する辞書データが辞書１に格納されていない場合、つまり、１次音声認識が不可能な場合、前記音声データを前記サーバへ伝送し、前記サーバは、前記音声データを受信し、受信した前記音声データと２次音声認識のための辞書２に格納された複数の辞書データの夫々とを比較することにより２次音声認識を行い、２次認識結果データを生成し、前記２次認識結果データを前記音声認識システムの認識結果データとして得ることを備えたものである。
【０００５】
この構成により、前記クライアントで１次音声認識が可能な場合、前記クライアントが前記サーバへ前記音声データを伝送する必要がなく前記１次認識結果データのみ伝送することにより前記クライアントと前記サーバとの間のネットワーク帯域の負荷を減少させるという効果と、前記サーバが２次音声認識を行う必要がないため前記サーバ側ＣＰＵの負荷を前記クライアント側ＣＰＵに分散させることにより前記サーバ側ＣＰＵの負荷を減少させる効果とをもたらすものである。
【０００６】
本発明の第２は、前記サーバは、前記本発明の第１において前記クライアントで１次音声認識が不可能な場合、前記サーバで生成した２次認識結果データを前記クライアントへ伝送し、前記クライアントは、前記２次認識結果データを受信し、受信した前記２次認識結果データを１次音声認識のための前記辞書１に登録することを備えたものである。
【０００７】
この構成により、前記本発明の第１において前記クライアントの１次音声認識で不可能とされていた音声認識に対応する辞書データを前記辞書１に格納することにより、前記本発明の第１と比較して前記クライアントで１次音声認識が可能であるケースが増大するため、前記本発明の第１と比較して、前記クライアントと前記サーバとの間のネットワーク帯域の負荷を更に減少させる効果と、前記サーバ側ＣＰＵの負荷を更に減少させる効果とをもたらすものである。
【０００８】
本発明の第３は、前記クライアントは、前記本発明の第２において前記２次認識結果データを１次音声認識のための前記辞書１に登録する際に、前記辞書１に辞書データを格納するためのスペースが有る場合、受信した前記２次認識結果データを前記辞書１に登録し、一方、前記辞書１に辞書データを格納するためのスペースが無い場合、前記辞書１に格納される複数のデータの夫々について１次音声認識において前記音声データとの比較に用いられた回数に対する前記音声データと一致した回数の割合、つまり、１次音声認識可能確率を格納する１次音声認識可能確率テーブルを参照し、前記１次音声認識可能確率が最も低いものに対応する辞書データを前記辞書１から削除した後に前記辞書１に受信した前記２次認識結果データを登録することを備えたものである。
【０００９】
この構成により、前記本発明の第２において前記２次認識結果データを前記辞書１に登録する際に前記辞書１に辞書データを格納するためのスペースが無い場合には前記辞書１に格納された複数の辞書データのうち１次音声認識を可能とする確率が最も低い辞書データを削除することにより、前記辞書１に格納された複数の辞書データの何れかを無作為に削除する場合と比較して前記クライアントで１次音声認識が不可能であるケースが減少するため、前記本発明の第２のように１次音声認識を可能とする確率に応じて前記辞書１の辞書データを削除することを備えないものと比較して、前記クライアントと前記サーバとの間のネットワーク帯域の負荷を更に減少させる効果と、前記サーバ側ＣＰＵの負荷を更に減少させる効果とをもたらすものである。
【００１０】
本発明の第４は、前記クライアントは前記１次音声認識のための複数の辞書データの夫々をその辞書データを必要とする話者と関連付けて前記辞書１に登録し、前記サーバは前記２次音声認識のため複数の辞書データの夫々をその辞書データを必要とする話者と関連付けて前記辞書２に登録し、前記クライアントは音声が入力されるとこの入力音声に対する音声データを生成し、前記音声データを用いて音声識別を行うことにより話者の特定をし、話者が誰であるかを示す話者データを生成し、前記辞書１から前記話者データに対応する話者以外の話者と関連付けられた複数の辞書データ、つまり、前記話者に対応しない複数の辞書データを削除し、前記話者データを前記サーバへ伝送し、前記サーバは、前記話者データを受信し、前記辞書２に格納された複数のデータのうち受信した前記話者データに対応する話者と関連付けられた複数の辞書データ、つまり、前記話者に対応する複数の辞書データを前記クライアントに伝送し、前記クライアントは、前記話者に対応する複数の辞書データを受信し、受信した前記話者に対応する複数の辞書データを前記話者と関連付けて前記辞書１に登録することを備えたものである。
【００１１】
この構成により、前記本発明の第１と比較して前記クライアントの前記辞書１に格納され複数の辞書データのうち音声を入力した話者の音声認識に適した辞書データが増大することにより、前記本発明の第１と比較して前記クライアントで１次音声認識が可能であるケースが増大するため、前記本発明の第１と比較して、前記クライアントと前記サーバとの間のネットワーク帯域の負荷を更に減少させる効果と、前記サーバ側ＣＰＵの負荷を更に減少させる効果とをもたらすものである。
【００１２】
本発明の第５は、前記クライアントは、前記１次音声認識のための複数の辞書データを、前記辞書１を構成する辞書領域１と辞書領域２に分けて登録し、前記クライアントと前記サーバとの間のデータ伝送量を監視し、前記クライアントは、音声が入力されるとこの入力音声に対する音声データを生成し、監視した前記クライアントと前記サーバとの間のデータ伝送量の値が或る閾値以上である場合、前記辞書１の前記辞書領域１と前記辞書領域２の何れかに格納された辞書データと前記音声データとを比較することにより１次音声認識を行い、一方、監視した前記クライアントと前記サーバとの間のデータ伝送量の値が或る閾値未満である場合、前記辞書１の前記辞書領域１に格納された辞書データと前記音声データとを比較することにより１次音声認識を行うことを備えたものである。
【００１３】
この構成により、前記クライアントと前記サーバとの間のデータ伝送量に応じて１次音声認識で適用する前記辞書１の領域を制御することにより、前記クライアントと前記サーバとの間のデータ伝送量が多い場合は前記クライアントと前記サーバとの間のデータ伝送量が少ない場合と比較して１次音声認識が可能であるケースが増大するため、前記クライアントと前記サーバとの間のデータ伝送量が多い場合は前記クライアントと前記サーバとの間のデータ伝送量が少ない場合と比較して前記クライアントと前記サーバとの間のネットワーク帯域の負荷を更に減少させる効果をもたらすものである。
【００１４】
本発明の第６は、前記クライアントは、前記１次音声認識のための複数の辞書データを前記辞書１を構成する辞書領域１と辞書領域２に分けて登録し、前記サーバ側ＣＰＵの使用率を監視し、前記クライアントは、音声が入力されるとこの入力音声に対する音声データを生成し、監視した前記サーバ側ＣＰＵの使用率の値が或る閾値以上である場合、前記辞書１の前記辞書領域１と前記辞書領域２の何れかに格納された辞書データと前記音声データとを比較することにより１次音声認識を行い、一方、監視した前記サーバ側ＣＰＵの使用率の値が或る閾値未満である場合、前記辞書１の前記辞書領域１に格納された辞書データと前記音声データとを比較することにより１次音声認識を行うことを備えたものである。
【００１５】
この構成により、前記サーバ側ＣＰＵの使用率に応じて１次音声認識で適用する前記辞書１の領域を制御することにより、前記サーバ側ＣＰＵの使用率が高い場合は前記サーバ側ＣＰＵの使用率が低い場合と比較して１次音声認識が可能であるケースが増大するため、前記サーバ側ＣＰＵの使用率が高い場合は前記サーバ側ＣＰＵの使用率が低い場合と比較して前記サーバ側ＣＰＵの負荷を更に減少させる効果をもたらすものである。
【００１６】
【発明の実施の形態】
以下、本発明の実施の形態について、図１から図１２を用いて説明する。
【００１７】
（実施の形態１）
図１は実施の形態１における音声認識システムの構成図である。図１において、１０はクライアント、２０はサーバである。
【００１８】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部である。
【００１９】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。辞書１記憶部１３は、１次音声認識を行う際に用いる複数の辞書データより構成される辞書１を記憶する。１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。
【００２０】
１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００２１】
選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択する。送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６の夫々の動作を制御する。
【００２２】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部である。
【００２３】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成する。
【００２４】
辞書２記憶部２２は、２次音声認識を行う際に用いる複数の辞書データより構成される辞書２を記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００２５】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識結果データである２次認識結果データを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３の夫々の動作を制御する。
【００２６】
図２に実施の形態１における音声認識システムで行われる処理手順のフローチャートを示す。図２に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２０２へ移行する。Ｓ２０２において、クライアント１０で辞書１を用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。
【００２７】
Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００２８】
（実施の形態２）
図３は実施の形態２における音声認識システムの構成図である。図３において、１０はクライアント、２０はサーバである。
【００２９】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部、１８は受信部である。
【００３０】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。辞書１記憶部１３は、１次音声認識を行う際に用いる複数の辞書データより構成される辞書１を記憶する。
【００３１】
辞書１記憶部１３は更に、受信部１８で後述する２次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部１８で受信されたデータを辞書１に記憶する。１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。
【００３２】
１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００３３】
選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択する。送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６，１８の夫々の動作を制御する。
【００３４】
受信部１８は、サーバ２０から送信されたデータを受信する。受信部１８は更に、受信したデータが２次認識結果データである場合、２次認識結果データを受信したことを示すフラグを生成する。
【００３５】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部、２５は選択部、２６は送信部である。
【００３６】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成する。辞書２記憶部２２は、２次音声認識を行う際に用いる複数の辞書データより構成される辞書２を記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００３７】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識を終了したことを示すフラグと２次音声認識結果データである２次認識結果データとを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。
【００３８】
制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３，２５，２６の夫々の動作を制御する。選択部２５は、サーバ２０で生成された複数のデータからサーバ２０からクライアント１０へ送信すべきデータを選択する。選択部２５は更に、２次音声認識部２３で２次音声認識を終了したことを示すフラグが生成されたことを確認した場合、２次音声認識部２３で生成された２次認識結果データを選択する。送信部２６は、選択部２５で選択されたデータをクライアント１０へ送信する。
【００３９】
図４に実施の形態２における音声認識システムで行われる処理手順のフローチャートを示す。図４に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２０２へ移行する。Ｓ２０２において、クライアント１０で辞書１を用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。
【００４０】
Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２１１へ移行する。Ｓ２１１において、サーバ２０でクライアント１０へ２次認識結果データを送信しＳ２１２へ移行する。Ｓ２１２において、クライアント１０で２次認識結果データを受信しＳ２１３へ移行する。Ｓ２１３においてクライアント１０で２次認識結果データを辞書１に記憶しＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００４１】
（実施の形態３）
図５は実施の形態３における音声認識システムの構成図である。図５において、１０はクライアント、２０はサーバである。
【００４２】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部、１８は受信部、１９は辞書１管理部である。
【００４３】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。辞書１記憶部１３は、１次音声認識を行う際に用いる複数の辞書データより構成される辞書１を記憶する。
【００４４】
辞書１記憶部１３は更に、受信部１８で後述する２次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、辞書１における辞書データを記憶するためのスペースの有無を確認し、辞書１に辞書データを記憶するためのスペースが有る場合、受信部１８で受信されたデータを辞書１に記憶し、一方、辞書１に辞書データを記憶するためのスペースが無い場合、辞書１管理部１９で生成された削除アドレスデータに対応する辞書１の辞書データを削除し、受信部１８で受信されたデータを辞書１に記憶する。
【００４５】
１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００４６】
選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択する。送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。
【００４７】
制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６，１８，１９の夫々の動作を制御する。受信部１８は、サーバ２０から送信されたデータを受信する。受信部１８は更に、受信したデータが２次認識結果データである場合、２次認識結果データを受信したことを示すフラグを生成する。
【００４８】
辞書１管理部１９は、１次音声認識部１４における１次音声認識が行われる毎に、辞書１記憶部１３の辞書１に記憶された複数の辞書データの夫々について、１次音声認識部１４における１次音声認識を可能とした確率を記憶し、辞書１の複数の辞書データを管理する。辞書１管理部１９は更に、１次音声認識における音声データとの比較に用いられた回数に対する音声データと一致した回数の割合、つまり、１次音声認識可能確率を算出し、辞書データの１次音声認識可能確率と辞書データの格納場所、つまり、辞書データのアドレスとを関連付けて１次音声認識可能確率テーブルに記憶する。辞書１管理部１９は更に、１次音声認識可能確率テーブルを参照し、１次音声認識可能確率が最も低い辞書データのアドレスを示す削除アドレスデータを生成する。
【００４９】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部、２５は選択部、２６は送信部である。
【００５０】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成する。辞書２記憶部２２は、２次音声認識を行う際に用いる複数の辞書データより構成される辞書２を記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００５１】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識を終了したことを示すフラグと２次音声認識結果データである２次認識結果データとを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。
【００５２】
制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３，２５，２６の夫々の動作を制御する。選択部２５は、サーバ２０で生成された複数のデータからサーバ２０からクライアント１０へ送信すべきデータを選択する。選択部２５は更に、２次音声認識部２３でフラグが生成されたことを確認した場合、２次音声認識部２３で生成された２次認識結果データを選択する。送信部２６は、選択部２５で選択されたデータをクライアント１０へ送信する。
【００５３】
図６に実施の形態３における音声認識システムで行われる処理手順のフローチャートを示す。図６に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２０２へ移行する。Ｓ２０２において、クライアント１０で辞書１を用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。
【００５４】
Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２１１へ移行する。Ｓ２１１において、サーバ２０でクライアント１０へ２次認識結果データを送信しＳ２１２へ移行する。Ｓ２１２において、クライアント１０で２次認識結果データを受信しＳ２１４へ移行する。Ｓ２１４において、クライアント１０で辞書データを記憶するスペースが辞書１に有るかを確認し、辞書データを記憶するスペースが辞書１に有る場合、Ｓ２１３へ移行し、辞書データを記憶するスペースが辞書１に無い場合、Ｓ２１５へ移行する。Ｓ２１５において、クライアント１０で辞書１から１次音声認識可能確率が最も低い辞書データを削除しＳ２１３へ移行する。Ｓ２１３においてクライアント１０で２次認識結果データを辞書１に記憶しＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００５５】
（実施の形態４）
図７は実施の形態４における音声認識システムの構成図である。図７において、１０はクライアント、２０はサーバである。
【００５６】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部、１８は受信部、３０は音声識別部である。
【００５７】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。音声識別部３０は、音声分析部１２で生成された音声データを用いて音声を入力した話者の特定を行う。音声識別部３０は更に、話者を特定した場合、話者を特定したことを示すフラグと話者が誰であるかを示す話者データとを生成する。辞書１記憶部１３は、辞書１に１次音声認識を行う際に用いる複数の辞書データの夫々をその辞書データを必要とする話者と関連付けて記憶する。
【００５８】
辞書１記憶部１３は更に、音声識別部３０で話者を特定したことを示すフラグが生成されたことを確認した場合、音声識別部３０で生成された話者データに対応する話者以外の話者に関連付けられた辞書データを辞書１から削除し、受信部１８で後述する話者データに対応する話者と関連付けられた辞書データを受信したことを示すフラグが生成されたことを確認した場合、受信部１８で受信されたデータをその辞書データを必要とする話者と関連付けて辞書１に記憶する。１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。
【００５９】
１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００６０】
選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択し、音声識別部３０で話者を特定したことを示すフラグが生成されたことを確認した場合、音声識別部３０で生成された話者データを選択する。
【００６１】
送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６，１８，３０の夫々の動作を制御する。受信部１８は、サーバ２０から送信されたデータを受信する。受信部１８は更に、受信したデータが２次認識結果データである場合、２次認識結果データを受信したことを示すフラグを生成し、受信したデータが話者データに対応する話者と関連付けられた辞書データである場合、話者データに対応する話者と関連付けられた辞書データを受信したことを示すフラグを生成する。
【００６２】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部、２５は選択部、２６は送信部である。
【００６３】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成し、受信したデータが話者データである場合、話者データを受信したことを示すフラグを生成する。辞書２記憶部２２は、辞書２に２次音声認識を行う際に用いる複数の辞書データの夫々をその辞書データを必要とする話者と関連付けて記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００６４】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識を終了したことを示すフラグと２次音声認識結果データである２次認識結果データとを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。
【００６５】
制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３，２５，２６の夫々の動作を制御する。選択部２５は、サーバ２０で生成された複数のデータからサーバ２０からクライアント１０へ送信すべきデータを選択する。選択部２５は更に、２次音声認識部２３で２次音声認識を終了したことを示すフラグが生成されたことを確認した場合、２次音声認識部２３で生成された２次認識結果データを選択し、受信部２１で話者データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信された話者データに対応する話者に関連付けられた辞書データを辞書２から選択する。送信部２６は、選択部２５で選択されたデータをクライアント１０へ送信する。
【００６６】
図８に実施の形態４における音声認識システムで行われる処理手順のフローチャートを示す。図８に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２１６へ移行する。Ｓ２１６において、クライアント１０で音声識別を行い話者データを生成しＳ２１７へ移行する。Ｓ２１７において、クライアント１０で辞書１から話者データに対応する話者以外の話者に関連付けられた辞書データを削除しＳ２１８へ移行する。Ｓ２１８において、クライアント１０でサーバ２０へ話者データを送信しＳ２１９へ移行する。Ｓ２１９において、サーバ２０で話者データを受信しＳ２２０へ移行する。Ｓ２２０において、サーバ２０でクライアント１０へ辞書２の話者データに対応する話者に関連付けられた辞書データを送信しＳ２２１へ移行する。Ｓ２２１において、クライアント１０で話者データに対応する話者に関連付けられた辞書データを受信しＳ２２２へ移行する。Ｓ２２２において、クライアント１０で話者データに対応する話者に関連付けられた辞書データを辞書１に記憶しＳ２０２へ移行する。Ｓ２０２において、クライアント１０で辞書１を用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。
【００６７】
Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００６８】
（実施の形態５）
図９は実施の形態５における音声認識システムの構成図である。図９において、１０はクライアント、２０はサーバである。
【００６９】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部、１８は受信部、３１は伝送量監視部である。
【００７０】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。辞書１記憶部１３は、１次音声認識を行う際に用いる複数の辞書データを、辞書１を構成する辞書領域１と辞書領域２に分けて記憶する。１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。
【００７１】
１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００７２】
１次音声認識部１４は更に、伝送量監視部３１で後述するデータ伝送量が或る閾値以上であることを示すフラグが生成されたことを確認した場合、辞書１の辞書領域１と辞書領域２の何れかに記憶された辞書データと前記音声データとを比較することにより１次音声認識を行い、伝送量監視部３１で後述するデータ伝送量が或る閾値未満であることを示すフラグが生成されたことを確認した場合辞書１の辞書領域１に記憶された辞書データと前記音声データとを比較することにより１次音声認識を行う。選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。
【００７３】
選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択する。送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６，１８，３１の夫々の動作を制御する。受信部１８は、サーバ２０から送信されたデータを受信する。
【００７４】
伝送量監視部３１は、クライアント１０とサーバ２０との間のデータ伝送量を監視する。伝送量監視部３１は更に、送信部１６で送信されたデータ量と受信部１８で受信されたデータ量との和、つまり、クライアント１０とサーバ２０との間のデータ伝送量を算出し、クライアント１０とサーバ２０との間のデータ伝送量が或る閾値以上である場合、クライアント１０とサーバ２０との間のデータ伝送量が或る閾値以上であることを示すフラグを生成し、クライアント１０とサーバ２０との間のデータ伝送量が或る閾値未満である場合、クライアント１０とサーバ２０との間のデータ伝送量が或る閾値未満であることを示すフラグを生成する。
【００７５】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部、２５は選択部、２６は送信部である。
【００７６】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成する。辞書２記憶部２２は、２次音声認識を行う際に用いる複数の辞書データより構成される辞書２を記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００７７】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識結果データである２次認識結果データを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。
【００７８】
制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３，２５，２６の夫々の動作を制御する。選択部２５は、サーバ２０で生成された複数のデータからサーバ２０からクライアント１０へ送信すべきデータを選択する。送信部２６は、選択部２５で選択されたデータをクライアント１０へ送信する。
【００７９】
図１０に実施の形態５における音声認識システムで行われる処理手順のフローチャートを示す。図１０に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２２３へ移行する。Ｓ２２３において、クライアント１０とサーバ２０との間のデータ伝送量は閾値以上であるかを確認し、クライアント１０とサーバ２０との間のデータ伝送量が閾値以上である場合、Ｓ２２４へ移行し、クライアント１０とサーバ２０との間のデータ伝送量が閾値未満である場合、Ｓ２２５へ移行する。Ｓ２２４において、クライアント１０で辞書１の辞書領域１と辞書領域２とを用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２２５において、クライアント１０で辞書１の辞書領域１を用いて１次音声認識を行いＳ２０３へ移行する。
【００８０】
Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００８１】
（実施の形態６）
図１１は実施の形態６における音声認識システムの構成図である。図１１において、１０はクライアント、２０はサーバである。
【００８２】
次に、クライアント１０の構成について説明する。１１はマイクロフオン、１２は音声分析部、１３は辞書１記憶部、１４は１次音声認識部、１５は選択部、１６は送信部、１７は制御部、１８は受信部、３２はサーバ監視部である。
【００８３】
マイクロフオン１１は、音声を入力する。音声分析部１２は、マイクロフオン１１に入力された音声を分析し、音声データを生成する。辞書１記憶部１３は、１次音声認識を行う際に用いる複数の辞書データを、辞書１を構成する辞書領域１と辞書領域２に分けて記憶する。１次音声認識部１４は、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの夫々とを比較することにより１次音声認識を行う。
【００８４】
１次音声認識部１４は更に、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れかとが一致した場合、つまり、１次音声認識ができた場合、１次音声認識ができたことを示すフラグと１次音声認識結果である１次認識結果データとを生成し、一方、音声分析部１２で生成された音声データと辞書１記憶部１３の辞書１に記憶された複数のデータの何れもが一致しなかった場合、つまり、１次音声認識ができなかった場合、１次音声認識ができなかったことを示すフラグを生成する。
【００８５】
１次音声認識部１４は更に、サーバ監視部３２で後述するサーバ２０側ＣＰＵ使用率が或る閾値以上であることを示すフラグが生成されたことを確認した場合、辞書１の辞書領域１と辞書領域２の何れかに記憶された辞書データと前記音声データとを比較することにより１次音声認識を行い、伝送量監視部３１で後述するサーバ２０側ＣＰＵ使用率が或る閾値未満であることを示すフラグが生成されたことを確認した場合辞書１の辞書領域１に記憶された辞書データと前記音声データとを比較することにより１次音声認識を行う。選択部１５は、クライアント１０で生成された複数のデータからクライアント１０からサーバ２０へ送信すべきデータを選択する。
【００８６】
選択部１５は更に、１次音声認識部１４で１次音声認識ができたことを示すフラグが生成されたことを確認した場合、１次音声認識部１４で生成された１次認識結果データを選択し、１次音声認識部１４で１次音声認識ができなかったことを示すフラグが生成されたことを確認した場合、音声分析部１２で生成された音声データを選択する。送信部１６は、選択部１５で選択されたデータをサーバ２０へ送信する。制御部１７は、クライアント１０側のＣＰＵを備え、１１〜１６，１８，３２の夫々の動作を制御する。受信部１８は、サーバ２０から送信されたデータを受信する。受信部１８は更に、受信したデータがサーバ２０側ＣＰＵ使用率データである場合、サーバ２０側ＣＰＵ使用率データを受信したことを示すフラグを生成する。
【００８７】
サーバ監視部３２は、サーバ２０側のＣＰＵの使用率を監視する。サーバ監視部３２は更に、受信部１８でサーバ２０側ＣＰＵ使用率データを受信したことを示すフラグが生成されたことを確認した場合、受信部１８で受信されたデータを用いてサーバ２０側ＣＰＵ使用率を算出し、サーバ２０側ＣＰＵ使用率が或る閾値以上である場合、サーバ２０側ＣＰＵ使用率が或る閾値以上であることを示すフラグを生成し、サーバ２０側ＣＰＵ使用率が或る閾値未満である場合、サーバ２０側ＣＰＵ使用率が或る閾値未満であることを示すフラグを生成する。
【００８８】
次に、サーバ２０の構成について説明する。２１は受信部、２２は辞書２記憶部、２３は２次音声認識部、２４は制御部、２５は選択部、２６は送信部である。
【００８９】
受信部２１は、クライアント１０から送信されたデータを受信する。受信部２１は更に、受信したデータが音声データである場合、音声データを受信したことを示すフラグを生成し、受信したデータが１次認識結果データである場合、１次認識結果データを受信したことを示すフラグを生成する。辞書２記憶部２２は、２次音声認識を行う際に用いる複数の辞書データより構成される辞書２を記憶する。２次音声認識部２３は、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行う。
【００９０】
２次音声認識部２３は更に、受信部２１で音声データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータと辞書２記憶部２２の辞書２に記憶された複数のデータの夫々とを比較することにより２次音声認識を行い、２次音声認識結果データである２次認識結果データを生成し、この２次認識結果データを音声認識システムの認識結果データとして扱い、受信部２１で１次認識結果データを受信したことを示すフラグが生成されたことを確認した場合、受信部２１で受信されたデータを音声認識システムの認識結果データとして扱う。制御部２４は、サーバ２０側のＣＰＵを備え、２１〜２３，２５，２６の夫々の動作を制御する。
【００９１】
制御部２４は更に、サーバ２０側のＣＰＵの使用率を算出し、サーバ２０側のＣＰＵの使用率を算出したことを示すフラグとサーバ２０側のＣＰＵの使用率を示すサーバ２０側ＣＰＵ使用率データとを生成する。選択部２５は、サーバ２０で生成された複数のデータからサーバ２０からクライアント１０へ送信すべきデータを選択する。選択部２５は更に、制御部２４でサーバ２０側のＣＰＵの使用率を算出したことを示すフラグが生成されたことを確認した場合、制御部２４で生成されたサーバ２０側ＣＰＵ使用率データを選択する。送信部２６は、選択部２５で選択されたデータをクライアント１０へ送信する。
【００９２】
図１２に実施の形態６における音声認識システムで行われる処理手順のフローチャートを示す。図１２に示すように、Ｓ２００においてクライアント１０に音声が入力されると処理を開始しＳ２０１へ移行する。Ｓ２０１において、クライアント１０で音声データを生成しＳ２２６へ移行する。Ｓ２２６において、サーバ２０側のＣＰＵの使用率は閾値以上であるかを確認し、サーバ２０側のＣＰＵの使用率が閾値以上である場合、Ｓ２２４へ移行し、サーバ２０側のＣＰＵの使用率が閾値未満である場合、Ｓ２２５へ移行する。Ｓ２２４において、クライアント１０で辞書１の辞書領域１と辞書領域２とを用いて１次音声認識を行いＳ２０３へ移行する。Ｓ２２５において、クライアント１０で辞書１の辞書領域１を用いて１次音声認識を行いＳ２０３へ移行する。
【００９３】
Ｓ２０３において、クライアント１０で１次音声認識が可能であるかを確認し、１次音声認識が可能である場合、Ｓ２０４へ移行し、１次音声認識が不可能である場合、Ｓ２０８へ移行する。Ｓ２０４において、クライアント１０でサーバ２０へ１次認識結果データを送信しＳ２０５へ移行する。Ｓ２０５において、サーバ２０で１次認識結果データを受信しＳ２０６へ移行する。Ｓ２０６において、サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０８において、クライアント１０でサーバ２０へ音声データを送信しＳ２０９へ移行する。Ｓ２０９において、サーバ２０で音声データを受信しＳ２１０へ移行する。Ｓ２１０において、サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得てＳ２０７へ移行する。Ｓ２０７において、処理を終了する。
【００９４】
【発明の効果】
以上のように本発明によれば、サーバ側ＣＰＵの負荷と、クライアントとサーバとの間のネットワーク帯域の負荷とを抑えつつ音声認識を行うことを可能とすることができる。
【図面の簡単な説明】
【図１】実施の形態１における音声認識システムの構成を示す図
【図２】実施の形態１における音声認識システムで行われる処理手順のフローチャート
【図３】実施の形態２における音声認識システムの構成を示す図
【図４】実施の形態２における音声認識システムで行われる処理手順のフローチャート
【図５】実施の形態３における音声認識システムの構成を示す図
【図６】実施の形態３における音声認識システムで行われる処理手順のフローチャート
【図７】実施の形態４における音声認識システムの構成を示す図
【図８】実施の形態４における音声認識システムで行われる処理手順のフローチャート
【図９】実施の形態５における音声認識システムの構成を示す図
【図１０】実施の形態５における音声認識システムで行われる処理手順のフローチャート
【図１１】実施の形態６における音声認識システムの構成を示す図
【図１２】実施の形態６における音声認識システムで行われる処理手順のフローチャート
【符号の説明】
１０　クライアント
１１　マイクロフオン
１２　音声分析部
１３　辞書１記憶部
１４　１次音声認識部
１５　選択部
１６　送信部
１７　制御部
１８　受信部
１９　辞書１管理部
２０　サーバ
２１　受信部
２２　辞書２記憶部
２３　２次音声認識部
２４　制御部
２５　選択部
２６　送信部
３０　音声識別部
３１　伝送量監視部
３２　サーバ監視部
Ｓ２００　クライアント１０に音声が入力されると処理を開始するステップ
Ｓ２０１　クライアント１０で音声データを生成するステップ
Ｓ２０２　クライアント１０で辞書１を用いて１次音声認識を行うステップ
Ｓ２０３　クライアント１０で１次音声認識が可能であるかを確認するステップ
Ｓ２０４　クライアント１０でサーバ２０へ１次認識結果データを送信するステップ
Ｓ２０５　サーバ２０で１次認識結果データを受信するステップ
Ｓ２０６　サーバ２０で１次認識結果データを音声認識システムの認識結果データとして得るステップ
Ｓ２０７　処理を終了するステップ
Ｓ２０８　クライアント１０でサーバ２０へ音声データを送信するステップ
Ｓ２０９　サーバ２０で音声データを受信するステップ
Ｓ２１０　サーバ２０で辞書２を用いて２次音声認識を行い２次認識結果データを音声認識システムの認識結果データとして得るステップ
Ｓ２１１　サーバ２０でクライアント１０へ２次認識結果データを送信するステップ
Ｓ２１２　クライアント１０で２次認識結果データを受信するステップ
Ｓ２１３　クライアント１０で２次認識結果データを辞書１に記憶するステップ
Ｓ２１４　クライアント１０で辞書データを記憶するスペースが辞書１に有るかを確認するステップ
Ｓ２１５　クライアント１０で辞書１から１次音声認識可能確率が最も低い辞書データを削除するステップ
Ｓ２１６　クライアント１０で音声識別を行い話者データを生成するステップ
Ｓ２１７　クライアント１０で辞書１から話者データに対応する話者以外の話者に関連付けられた辞書データを削除するステップ
Ｓ２１８　クライアント１０でサーバ２０へ話者データを送信するステップ
Ｓ２１９　サーバ２０で話者データを受信するステップ
Ｓ２２０　サーバ２０でクライアント１０へ辞書２の話者データに対応する話者に関連付けられた辞書データを送信するステップ
Ｓ２２１　クライアント１０で話者データに対応する話者に関連付けられた辞書データを受信するステップ
Ｓ２２２　クライアント１０で話者データに対応する話者に関連付けられた辞書データを辞書１に記憶するステップ
Ｓ２２３　クライアント１０とサーバ２０との間のデータ伝送量は閾値以上であるかを確認するステップ
Ｓ２２４　クライアント１０で辞書１の辞書領域１と辞書領域２とを用いて１次音声認識を行うステップ
Ｓ２２５　クライアント１０で辞書１の辞書領域１を用いて１次音声認識を行うステップ
Ｓ２２６　サーバ２０側のＣＰＵの使用率は閾値以上であるかを確認するステップ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition system including a server and a client, a speech recognition client, a speech recognition server, a speech recognition client program, and a speech recognition server program.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, a speech recognition system including a server and a client is disclosed in Japanese Patent Application Laid-Open No. 2001-142488. This is a voice recognition system in which voice input is performed by a client, voice data obtained by the client is sent to a server, and voice recognition is performed by the server. Japanese Patent Application Laid-Open No. 2001-337695 discloses a speech recognition system composed of a server and a client which solves the problem of suppressing the load on the network band between the client and the server by encoding the speech data at the client. Although those described in the official gazette are known, the load on the server-side CPU is increased by performing voice recognition only on the server, and the load between the server-side CPU and the client and the server is increased. It does not perform voice recognition while suppressing the load on the network bandwidth.
[0003]
[Problems to be solved by the invention]
In the speech recognition system constituted by the server and the client, it is required that the speech recognition can be performed while suppressing the load on the server-side CPU and the load on the network band between the client and the server. The present invention provides a system for performing voice recognition between a server and a client, which enables voice recognition to be performed while suppressing the load on the server-side CPU and the load on the network bandwidth between the client and the server. It is to be.
[0004]
[Means for Solving the Problems]
According to a first aspect of the present invention, when a voice is input, a client generates voice data for the input voice, and outputs the voice data and a plurality of dictionary data stored in the dictionary 1 for primary voice recognition. To perform primary voice recognition to generate primary recognition result data, where the client determines that dictionary data matching the voice data is stored in the dictionary 1, that is, the primary voice If the recognition is possible, the primary recognition result data is transmitted to the server, the server receives the primary recognition result data, and converts the received primary voice result data into the recognition result data of the voice recognition system. On the other hand, if the dictionary data corresponding to the voice data is not stored in the dictionary 1, that is, if the primary voice recognition is not possible, the client Transmitting the data to the server, the server receiving the voice data, and comparing the received voice data with each of the plurality of dictionary data stored in the dictionary 2 for secondary voice recognition. Secondary speech recognition is performed, secondary recognition result data is generated, and the secondary recognition result data is obtained as recognition result data of the speech recognition system.
[0005]
According to this configuration, when primary voice recognition is possible in the client, the client does not need to transmit the voice data to the server, and transmits only the primary recognition result data, so that the client can communicate with the server. And the load on the server-side CPU is reduced by distributing the load on the server-side CPU to the client-side CPU because the server does not need to perform secondary voice recognition. And effect.
[0006]
In a second aspect of the present invention, the server transmits the secondary recognition result data generated by the server to the client when the client cannot perform primary speech recognition in the first aspect of the present invention, Comprises receiving the secondary recognition result data and registering the received secondary recognition result data in the dictionary 1 for primary speech recognition.
[0007]
According to this configuration, the dictionary data corresponding to the speech recognition that is not possible in the primary speech recognition of the client in the first embodiment of the present invention is stored in the dictionary 1, thereby comparing with the first embodiment of the present invention. Since the number of cases in which primary voice recognition is possible in the client increases, the effect of further reducing the load on the network band between the client and the server, as compared with the first aspect of the present invention, This has the effect of further reducing the load on the server-side CPU.
[0008]
A third aspect of the present invention is that the client stores the dictionary data in the dictionary 1 when registering the secondary recognition result data in the dictionary 1 for primary speech recognition in the second aspect of the present invention. If there is space to store the received secondary recognition result data in the dictionary 1, if there is no space for storing dictionary data in the dictionary 1, a plurality of data stored in the dictionary 1 will be stored. For each of the data, the ratio of the number of times of matching with the voice data to the number of times used for comparison with the voice data in the primary voice recognition, that is, the primary voice recognizable probability table storing the primary voice recognizable probability Referring to the dictionary data corresponding to the lowest primary speech recognizable probability, the received secondary recognition result data is registered in the dictionary 1 after the dictionary data is deleted from the dictionary 1. It is those with.
[0009]
With this configuration, when the secondary recognition result data is registered in the dictionary 1 in the second embodiment of the present invention, if the dictionary 1 has no space for storing the dictionary data, the secondary recognition result data is stored in the dictionary 1. By deleting the dictionary data having the lowest probability of enabling primary voice recognition among the plurality of dictionary data, a comparison is made with the case where any of the plurality of dictionary data stored in the dictionary 1 is deleted at random. Since the number of cases in which primary speech recognition is not possible in the client is reduced, the dictionary data of the dictionary 1 is deleted according to the probability of enabling primary speech recognition as in the second aspect of the present invention. As compared with the one without the above, an effect of further reducing the load on the network band between the client and the server and an effect of further reducing the load on the server-side CPU are provided. Than it is.
[0010]
In a fourth aspect of the present invention, the client registers each of the plurality of dictionary data for the primary speech recognition in the dictionary 1 in association with a speaker who needs the dictionary data, and the server registers the secondary data. Each of the plurality of dictionary data for voice recognition is registered in the dictionary 2 in association with a speaker requiring the dictionary data, and when a voice is input, the client generates voice data for the input voice, and The speaker is identified by performing voice identification using the voice data, speaker data indicating who the speaker is is generated, and a story other than the speaker corresponding to the speaker data corresponding to the speaker data is generated from the dictionary 1. A plurality of dictionary data associated with a speaker, i.e., delete a plurality of dictionary data that does not correspond to the speaker, transmit the speaker data to the server, the server receives the speaker data, dictionary Transmitting a plurality of dictionary data associated with the speaker corresponding to the received speaker data among the plurality of data stored in the client, that is, transmitting a plurality of dictionary data corresponding to the speaker to the client; Includes receiving a plurality of dictionary data corresponding to the speaker and registering the received plurality of dictionary data corresponding to the speaker in the dictionary 1 in association with the speaker.
[0011]
With this configuration, the number of dictionary data stored in the dictionary 1 of the client and suitable for voice recognition of a speaker who has input a voice among the plurality of dictionary data increases, as compared with the first aspect of the present invention. Since the number of cases in which primary voice recognition is possible in the client is increased as compared with the first aspect of the present invention, the load on the network bandwidth between the client and the server is increased as compared with the first aspect of the present invention. And an effect of further reducing the load on the server-side CPU.
[0012]
According to a fifth aspect of the present invention, the client registers a plurality of dictionary data for the primary speech recognition separately in a dictionary area 1 and a dictionary area 2 constituting the dictionary 1, and registers the dictionary data with the client and the server. The client generates voice data for the input voice when voice is input, and the value of the data transmission between the monitored client and the server is a threshold. In the above case, the primary voice recognition is performed by comparing the dictionary data stored in one of the dictionary areas 1 and 2 of the dictionary 1 with the voice data. When the value of the data transmission amount between the server and the server is less than a certain threshold, the dictionary data stored in the dictionary area 1 of the dictionary 1 is compared with the voice data. Those having to do a primary speech recognition.
[0013]
According to this configuration, by controlling the area of the dictionary 1 to be applied in the primary speech recognition according to the data transmission amount between the client and the server, the data transmission amount between the client and the server is reduced. When the number is large, the number of cases in which primary voice recognition is possible increases as compared with the case where the amount of data transmission between the client and the server is small, so that the amount of data transmission between the client and the server is large. In this case, the effect of further reducing the load on the network band between the client and the server is brought about as compared with the case where the amount of data transmission between the client and the server is small.
[0014]
In a sixth aspect of the present invention, the client registers a plurality of dictionary data for the primary voice recognition separately in a dictionary area 1 and a dictionary area 2 constituting the dictionary 1, and registers the usage rate of the server-side CPU. The client generates voice data for the input voice when a voice is input, and when the monitored usage rate of the server-side CPU is equal to or more than a certain threshold, the client The primary speech recognition is performed by comparing the dictionary data stored in any one of the area 1 and the dictionary area 2 with the audio data, while the monitored usage rate of the server-side CPU is a threshold. If it is less than 1, the primary voice recognition is performed by comparing the dictionary data stored in the dictionary area 1 of the dictionary 1 with the voice data.
[0015]
According to this configuration, by controlling the area of the dictionary 1 to be applied in the primary speech recognition according to the usage rate of the server-side CPU, when the usage rate of the server-side CPU is high, the usage rate of the server-side CPU is controlled. Since the number of cases in which the primary voice recognition is possible is increased as compared with the case where the usage rate of the server-side CPU is low, the usage rate of the server-side CPU is higher than when the usage rate of the server-side CPU is low. This has the effect of further reducing the load on the device.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to FIGS.
[0017]
(Embodiment 1)
FIG. 1 is a configuration diagram of the speech recognition system according to the first embodiment. In FIG. 1, reference numeral 10 denotes a client, and 20 denotes a server.
[0018]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, and 17 is a control unit.
[0019]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The dictionary 1 storage unit 13 stores the dictionary 1 including a plurality of dictionary data used for performing primary speech recognition. The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13.
[0020]
The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0021]
The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10. The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected. The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20. The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16.
[0022]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary speech recognition unit, and 24 is a control unit.
[0023]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. Is generated.
[0024]
The dictionary 2 storage unit 22 stores the dictionary 2 including a plurality of dictionary data used when performing secondary speech recognition. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0025]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. The secondary speech recognition is performed by comparing each of the plurality of stored data to generate secondary recognition result data as secondary speech recognition result data, and the secondary recognition result data is recognized by the speech recognition system. When the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is generated, the reception unit 21 treats the data received as the recognition result data of the speech recognition system. The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23.
[0026]
FIG. 2 shows a flowchart of a processing procedure performed in the speech recognition system according to the first embodiment. As shown in FIG. 2, when a voice is input to the client 10 in S200, the process starts, and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S202. In S202, the client 10 performs primary speech recognition using the dictionary 1, and the process proceeds to S203. In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207.
[0027]
In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S207. In S207, the process ends.
[0028]
(Embodiment 2)
FIG. 3 is a configuration diagram of the speech recognition system according to the second embodiment. In FIG. 3, reference numeral 10 denotes a client, and 20 denotes a server.
[0029]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, 17 is a control unit, and 18 is a reception unit.
[0030]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The dictionary 1 storage unit 13 stores the dictionary 1 including a plurality of dictionary data used for performing primary speech recognition.
[0031]
The dictionary 1 storage unit 13 further stores the data received by the receiving unit 18 in the dictionary 1 when confirming that the receiving unit 18 has generated a flag indicating that secondary recognition result data described later has been received. . The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13.
[0032]
The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0033]
The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10. The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected. The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20. The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16 and 18.
[0034]
The receiving unit 18 receives the data transmitted from the server 20. When the received data is the secondary recognition result data, the receiving unit 18 further generates a flag indicating that the secondary recognition result data has been received.
[0035]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary voice recognition unit, 24 is a control unit, 25 is a selection unit, and 26 is a transmission unit.
[0036]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. Is generated. The dictionary 2 storage unit 22 stores the dictionary 2 including a plurality of dictionary data used when performing secondary speech recognition. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0037]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. Secondary speech recognition is performed by comparing each of the stored data with a plurality of pieces of data, and a flag indicating that secondary speech recognition has been completed and secondary recognition result data that is secondary speech recognition result data are generated. When the secondary recognition result data is treated as the recognition result data of the speech recognition system, and the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is received, the reception unit 21 receives the flag. The data is treated as recognition result data of the speech recognition system.
[0038]
The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23, 25, and 26. The selection unit 25 selects data to be transmitted from the server 20 to the client 10 from a plurality of data generated by the server 20. When the selection unit 25 further confirms that the secondary speech recognition unit 23 has generated a flag indicating that the secondary speech recognition has been completed, the selection unit 25 compares the secondary recognition result data generated by the secondary speech recognition unit 23 with the secondary recognition result data. select. The transmission unit 26 transmits the data selected by the selection unit 25 to the client 10.
[0039]
FIG. 4 shows a flowchart of a processing procedure performed in the speech recognition system according to the second embodiment. As shown in FIG. 4, when a voice is input to the client 10 in S200, the process starts and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S202. In S202, the client 10 performs primary speech recognition using the dictionary 1, and the process proceeds to S203. In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207.
[0040]
In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S211. In S211, the server 20 transmits the secondary recognition result data to the client 10, and the process proceeds to S212. In S212, the client 10 receives the secondary recognition result data, and proceeds to S213. In step S213, the client 10 stores the secondary recognition result data in the dictionary 1, and the process proceeds to step S207. In S207, the process ends.
[0041]
(Embodiment 3)
FIG. 5 is a configuration diagram of the speech recognition system according to the third embodiment. In FIG. 5, reference numeral 10 denotes a client, and 20 denotes a server.
[0042]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, 17 is a control unit, 18 is a reception unit, and 19 is a dictionary 1 It is a management unit.
[0043]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The dictionary 1 storage unit 13 stores the dictionary 1 including a plurality of dictionary data used for performing primary speech recognition.
[0044]
When the dictionary 1 storage unit 13 further confirms that the receiving unit 18 has generated a flag indicating that secondary recognition result data described later has been received, the dictionary 1 storage unit 13 determines whether there is a space for storing dictionary data in the dictionary 1. When it is confirmed that the dictionary 1 has space for storing the dictionary data, the data received by the receiving unit 18 is stored in the dictionary 1, while, when the dictionary 1 has no space for storing the dictionary data, The dictionary data of the dictionary 1 corresponding to the deleted address data generated by the dictionary 1 management unit 19 is deleted, and the data received by the receiving unit 18 is stored in the dictionary 1.
[0045]
The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13. The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0046]
The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10. The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected. The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20.
[0047]
The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16, 18, and 19. The receiving unit 18 receives the data transmitted from the server 20. When the received data is the secondary recognition result data, the receiving unit 18 further generates a flag indicating that the secondary recognition result data has been received.
[0048]
Each time the primary speech recognition unit 14 performs the primary speech recognition, the dictionary 1 management unit 19 performs the primary speech recognition unit 14 for each of the plurality of dictionary data stored in the dictionary 1 in the dictionary 1 storage unit 13. Is stored, and a plurality of dictionary data of the dictionary 1 is managed. The dictionary 1 management unit 19 further calculates a ratio of the number of times of matching with the voice data to the number of times used for comparison with the voice data in the primary voice recognition, that is, a primary voice recognizable probability. The speech recognizable probability is associated with the storage location of the dictionary data, that is, the address of the dictionary data, and stored in the primary speech recognizable probability table. The dictionary 1 management unit 19 further refers to the primary voice recognizable probability table and generates deletion address data indicating the address of the dictionary data having the lowest primary voice recognizable probability.
[0049]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary voice recognition unit, 24 is a control unit, 25 is a selection unit, and 26 is a transmission unit.
[0050]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. Is generated. The dictionary 2 storage unit 22 stores the dictionary 2 including a plurality of dictionary data used when performing secondary speech recognition. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0051]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. Secondary speech recognition is performed by comparing each of the stored data with a plurality of pieces of data, and a flag indicating that secondary speech recognition has been completed and secondary recognition result data that is secondary speech recognition result data are generated. When the secondary recognition result data is treated as the recognition result data of the speech recognition system, and the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is received, the reception unit 21 receives the flag. The data is treated as recognition result data of the speech recognition system.
[0052]
The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23, 25, and 26. The selection unit 25 selects data to be transmitted from the server 20 to the client 10 from a plurality of data generated by the server 20. The selection unit 25 further selects the secondary recognition result data generated by the secondary voice recognition unit 23 when confirming that the flag is generated by the secondary voice recognition unit 23. The transmission unit 26 transmits the data selected by the selection unit 25 to the client 10.
[0053]
FIG. 6 shows a flowchart of a processing procedure performed in the speech recognition system according to the third embodiment. As shown in FIG. 6, when a voice is input to the client 10 in S200, the process starts and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S202. In S202, the client 10 performs primary speech recognition using the dictionary 1, and the process proceeds to S203. In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207.
[0054]
In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S211. In S211, the server 20 transmits the secondary recognition result data to the client 10, and the process proceeds to S212. In S212, the client 10 receives the secondary recognition result data, and proceeds to S214. In step S214, the client 10 checks whether the space for storing the dictionary data exists in the dictionary 1. If the space for storing the dictionary data exists in the dictionary 1, the process proceeds to step S213, and the space for storing the dictionary data is stored in the dictionary 1. If not, the process proceeds to S215. In S215, the client 10 deletes the dictionary data having the lowest primary voice recognizable probability from the dictionary 1 in the dictionary 1, and shifts to S213. In step S213, the client 10 stores the secondary recognition result data in the dictionary 1, and the process proceeds to step S207. In S207, the process ends.
[0055]
(Embodiment 4)
FIG. 7 is a configuration diagram of the speech recognition system according to the fourth embodiment. 7, reference numeral 10 denotes a client and reference numeral 20 denotes a server.
[0056]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, 17 is a control unit, 18 is a reception unit, and 30 is voice identification. Department.
[0057]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The voice identification unit 30 specifies the speaker who has input the voice using the voice data generated by the voice analysis unit 12. When the speaker is specified, the voice identification unit 30 further generates a flag indicating that the speaker has been specified and speaker data indicating who the speaker is. The dictionary 1 storage unit 13 stores a plurality of dictionary data used when performing primary speech recognition on the dictionary 1 in association with speakers who need the dictionary data.
[0058]
If the dictionary 1 storage unit 13 further confirms that the flag indicating that the speaker has been identified has been generated by the voice identification unit 30, The dictionary data associated with the speaker was deleted from the dictionary 1, and it was confirmed that the reception unit 18 generated a flag indicating that dictionary data associated with the speaker corresponding to the speaker data described below was received. In this case, the data received by the receiving unit 18 is stored in the dictionary 1 in association with the speaker who needs the dictionary data. The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13.
[0059]
The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0060]
The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10. The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected, and the voice identification unit 30 is selected. When it is confirmed that the flag indicating that the speaker has been specified is generated, the speaker data generated by the voice identification unit 30 is selected.
[0061]
The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20. The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16, 18, and 30. The receiving unit 18 receives the data transmitted from the server 20. The receiving unit 18 further generates a flag indicating that the secondary recognition result data has been received when the received data is the secondary recognition result data, and associates the received data with the speaker corresponding to the speaker data. If the data is dictionary data, a flag indicating that dictionary data associated with the speaker corresponding to the speaker data has been received is generated.
[0062]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary voice recognition unit, 24 is a control unit, 25 is a selection unit, and 26 is a transmission unit.
[0063]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. A flag indicating that the received data is speaker data is generated when the received data is speaker data. The dictionary 2 storage unit 22 stores each of a plurality of dictionary data used when performing secondary speech recognition on the dictionary 2 in association with a speaker who needs the dictionary data. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0064]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. Secondary speech recognition is performed by comparing each of the stored data with a plurality of pieces of data, and a flag indicating that secondary speech recognition has been completed and secondary recognition result data that is secondary speech recognition result data are generated. When the secondary recognition result data is treated as the recognition result data of the speech recognition system, and the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is received, the reception unit 21 receives the flag. The data is treated as recognition result data of the speech recognition system.
[0065]
The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23, 25, and 26. The selection unit 25 selects data to be transmitted from the server 20 to the client 10 from a plurality of data generated by the server 20. When the selection unit 25 further confirms that the secondary speech recognition unit 23 has generated a flag indicating that the secondary speech recognition has been completed, the selection unit 25 compares the secondary recognition result data generated by the secondary speech recognition unit 23 with the secondary recognition result data. When it is confirmed that the flag indicating that the speaker data has been received is generated by the receiving unit 21, the dictionary data associated with the speaker corresponding to the speaker data received by the receiving unit 21 is converted to the dictionary. Choose from 2. The transmission unit 26 transmits the data selected by the selection unit 25 to the client 10.
[0066]
FIG. 8 shows a flowchart of a processing procedure performed in the speech recognition system according to the fourth embodiment. As shown in FIG. 8, when a voice is input to the client 10 in S200, the process starts and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S216. In S216, the client 10 performs voice identification to generate speaker data, and the process proceeds to S217. In S217, the client 10 deletes, from the dictionary 1, dictionary data associated with speakers other than the speaker corresponding to the speaker data, and proceeds to S218. In S218, the client 10 transmits the speaker data to the server 20, and the process proceeds to S219. In S219, the server 20 receives the speaker data, and proceeds to S220. In S220, the server 20 transmits the dictionary data associated with the speaker corresponding to the speaker data in the dictionary 2 to the client 10 and shifts to S221. In S221, the client 10 receives dictionary data associated with the speaker corresponding to the speaker data, and proceeds to S222. In S222, the client 10 stores the dictionary data associated with the speaker corresponding to the speaker data in the dictionary 1, and shifts to S202. In S202, the client 10 performs primary speech recognition using the dictionary 1, and the process proceeds to S203. In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207.
[0067]
In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S207. In S207, the process ends.
[0068]
(Embodiment 5)
FIG. 9 is a configuration diagram of the speech recognition system according to the fifth embodiment. In FIG. 9, reference numeral 10 denotes a client, and reference numeral 20 denotes a server.
[0069]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, 17 is a control unit, 18 is a reception unit, and 31 is a transmission amount. It is a monitoring unit.
[0070]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The dictionary 1 storage unit 13 stores a plurality of dictionary data used for performing primary speech recognition separately in a dictionary area 1 and a dictionary area 2 constituting the dictionary 1. The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13.
[0071]
The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0072]
When the primary voice recognition unit 14 further confirms that the transmission amount monitoring unit 31 has generated a flag indicating that the data transmission amount described later is equal to or greater than a certain threshold, the primary region 2 performs primary voice recognition by comparing the dictionary data stored in any one of the above with the voice data, and a flag indicating that the data transmission amount to be described later is less than a certain threshold is transmitted by the transmission amount monitoring unit 31. When it is confirmed that the voice data is generated, the primary voice recognition is performed by comparing the dictionary data stored in the dictionary area 1 of the dictionary 1 with the voice data. The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10.
[0073]
The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected. The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20. The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16, 18, and 31. The receiving unit 18 receives the data transmitted from the server 20.
[0074]
The transmission amount monitoring unit 31 monitors the data transmission amount between the client 10 and the server 20. The transmission amount monitoring unit 31 further calculates the sum of the data amount transmitted by the transmission unit 16 and the data amount received by the reception unit 18, that is, the data transmission amount between the client 10 and the server 20. If the data transmission amount between the client 10 and the server 20 is equal to or greater than a certain threshold, a flag indicating that the data transmission amount between the client 10 and the server 20 is equal to or greater than a certain threshold is generated. When the data transmission amount between the client 20 and the server 20 is smaller than a certain threshold, a flag indicating that the data transmission amount between the client 10 and the server 20 is smaller than a certain threshold is generated.
[0075]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary voice recognition unit, 24 is a control unit, 25 is a selection unit, and 26 is a transmission unit.
[0076]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. Is generated. The dictionary 2 storage unit 22 stores the dictionary 2 including a plurality of dictionary data used when performing secondary speech recognition. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0077]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. The secondary speech recognition is performed by comparing each of the plurality of stored data to generate secondary recognition result data as secondary speech recognition result data, and the secondary recognition result data is recognized by the speech recognition system. When the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is generated, the reception unit 21 treats the data received as the recognition result data of the speech recognition system.
[0078]
The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23, 25, and 26. The selection unit 25 selects data to be transmitted from the server 20 to the client 10 from a plurality of data generated by the server 20. The transmission unit 26 transmits the data selected by the selection unit 25 to the client 10.
[0079]
FIG. 10 shows a flowchart of a processing procedure performed in the speech recognition system according to the fifth embodiment. As shown in FIG. 10, when a voice is input to the client 10 in S200, the process starts and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S223. In S223, it is checked whether the data transmission amount between the client 10 and the server 20 is equal to or larger than the threshold value. If the data transmission amount between the client 10 and the server 20 is equal to or larger than the threshold value, the process proceeds to S224. If the data transmission amount between the server 10 and the server 20 is less than the threshold, the process proceeds to S225. In S224, the client 10 performs primary speech recognition using the dictionary area 1 and the dictionary area 2 of the dictionary 1, and proceeds to S203. In S225, the client 10 performs primary speech recognition using the dictionary area 1 of the dictionary 1, and the process proceeds to S203.
[0080]
In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207. In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S207. In S207, the process ends.
[0081]
(Embodiment 6)
FIG. 11 is a configuration diagram of the speech recognition system according to the sixth embodiment. In FIG. 11, reference numeral 10 denotes a client, and reference numeral 20 denotes a server.
[0082]
Next, the configuration of the client 10 will be described. 11 is a microphone, 12 is a voice analysis unit, 13 is a dictionary 1 storage unit, 14 is a primary voice recognition unit, 15 is a selection unit, 16 is a transmission unit, 17 is a control unit, 18 is a reception unit, and 32 is server monitoring. Department.
[0083]
The microphone 11 inputs voice. The voice analysis unit 12 analyzes the voice input to the microphone 11 and generates voice data. The dictionary 1 storage unit 13 stores a plurality of dictionary data used for performing primary speech recognition separately in a dictionary area 1 and a dictionary area 2 constituting the dictionary 1. The primary voice recognition unit 14 performs primary voice recognition by comparing the voice data generated by the voice analysis unit 12 with each of a plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13.
[0084]
The primary voice recognition unit 14 further performs a process when the voice data generated by the voice analysis unit 12 matches one of the plurality of data stored in the dictionary 1 of the dictionary 1 storage unit 13, that is, the primary voice recognition is performed. If successful, a flag indicating that primary speech recognition has been performed and primary recognition result data that is the primary speech recognition result are generated. On the other hand, the voice data generated by the voice analysis unit 12 and the dictionary 1 storage unit When none of the plurality of data stored in the 13 dictionaries 1 match, that is, when primary speech recognition cannot be performed, a flag indicating that primary speech recognition could not be performed is generated.
[0085]
When the primary voice recognition unit 14 further confirms that the server monitoring unit 32 has generated a flag indicating that the server 20 side CPU usage rate is greater than or equal to a certain threshold, the primary speech recognition unit 14 The primary voice recognition is performed by comparing the dictionary data stored in any of the dictionary areas 2 with the voice data, and the transmission rate monitoring unit 31 has a server 20 side CPU usage rate, which will be described later, less than a certain threshold. If it is confirmed that a flag indicating that the voice data has been generated, primary voice recognition is performed by comparing the dictionary data stored in the dictionary area 1 of the dictionary 1 with the voice data. The selecting unit 15 selects data to be transmitted from the client 10 to the server 20 from a plurality of data generated by the client 10.
[0086]
The selection unit 15 further checks the primary recognition result data generated by the primary speech recognition unit 14 when it confirms that the primary speech recognition unit 14 has generated a flag indicating that primary speech recognition has been performed. When it is confirmed that the flag indicating that the primary voice recognition has failed in the primary voice recognition unit 14 has been generated, the voice data generated by the voice analysis unit 12 is selected. The transmission unit 16 transmits the data selected by the selection unit 15 to the server 20. The control unit 17 includes a CPU on the client 10 side, and controls each operation of 11 to 16, 18, and 32. The receiving unit 18 receives the data transmitted from the server 20. Further, when the received data is the server 20 side CPU usage rate data, the receiving unit 18 generates a flag indicating that the server 20 side CPU usage rate data has been received.
[0087]
The server monitoring unit 32 monitors the usage rate of the CPU on the server 20 side. Further, when the server monitoring unit 32 confirms that the reception unit 18 has generated the flag indicating that the server 20-side CPU usage data has been received, the server monitoring unit 32 uses the data received by the reception unit 18 to execute the processing. The usage rate is calculated, and if the server 20-side CPU usage rate is equal to or higher than a certain threshold, a flag indicating that the server 20-side CPU usage rate is equal to or higher than a certain threshold is generated. If it is less than a certain threshold, a flag indicating that the server 20 side CPU usage rate is less than a certain threshold is generated.
[0088]
Next, the configuration of the server 20 will be described. 21 is a receiving unit, 22 is a dictionary 2 storage unit, 23 is a secondary voice recognition unit, 24 is a control unit, 25 is a selection unit, and 26 is a transmission unit.
[0089]
The receiving unit 21 receives the data transmitted from the client 10. The receiving unit 21 further generates a flag indicating that voice data has been received when the received data is voice data, and receives the primary recognition result data when the received data is primary recognition result data. Is generated. The dictionary 2 storage unit 22 stores the dictionary 2 including a plurality of dictionary data used when performing secondary speech recognition. The secondary voice recognition unit 23 performs secondary voice recognition by comparing the data received by the reception unit 21 with each of the plurality of data stored in the dictionary 2 of the dictionary 2 storage unit 22.
[0090]
When the secondary voice recognition unit 23 further confirms that the flag indicating that the voice data has been received is generated by the reception unit 21, the secondary voice recognition unit 23 stores the data received by the reception unit 21 in the dictionary 2 of the dictionary 2 storage unit 22. The secondary speech recognition is performed by comparing each of the plurality of stored data to generate secondary recognition result data as secondary speech recognition result data, and the secondary recognition result data is recognized by the speech recognition system. When the reception unit 21 confirms that the flag indicating that the primary recognition result data has been received is generated, the reception unit 21 treats the data received as the recognition result data of the speech recognition system. The control unit 24 includes a CPU on the server 20 side, and controls each operation of 21 to 23, 25, and 26.
[0091]
The control unit 24 further calculates the usage rate of the CPU on the server 20 side, a flag indicating that the usage rate of the CPU on the server 20 has been calculated, and the CPU usage rate indicating the usage rate of the CPU on the server 20 side. And generate data. The selection unit 25 selects data to be transmitted from the server 20 to the client 10 from a plurality of data generated by the server 20. When the selection unit 25 further confirms that the control unit 24 has generated a flag indicating that the CPU usage rate of the server 20 has been calculated, the selection unit 25 converts the server 20-side CPU usage data generated by the control unit 24 into data. select. The transmission unit 26 transmits the data selected by the selection unit 25 to the client 10.
[0092]
FIG. 12 shows a flowchart of a processing procedure performed in the speech recognition system according to the sixth embodiment. As shown in FIG. 12, when a voice is input to the client 10 in S200, the process starts and the process proceeds to S201. In S201, the client 10 generates audio data, and the process proceeds to S226. In S226, it is checked whether the usage rate of the CPU of the server 20 is equal to or more than the threshold. If the usage rate of the CPU of the server 20 is equal to or more than the threshold, the process proceeds to S224, and the usage rate of the CPU of the server 20 is reduced. If it is less than the threshold, the process moves to S225. In S224, the client 10 performs primary speech recognition using the dictionary area 1 and the dictionary area 2 of the dictionary 1, and proceeds to S203. In S225, the client 10 performs primary speech recognition using the dictionary area 1 of the dictionary 1, and the process proceeds to S203.
[0093]
In S203, it is confirmed whether or not the primary voice recognition is possible in the client 10, and if the primary voice recognition is possible, the process proceeds to S204, and if the primary voice recognition is not possible, the process proceeds to S208. In S204, the client 10 transmits the primary recognition result data to the server 20, and the process proceeds to S205. In S205, the primary recognition result data is received by the server 20, and the process proceeds to S206. In S206, the server 20 obtains the primary recognition result data as the recognition result data of the speech recognition system, and the process proceeds to S207. In S208, the client 10 transmits the audio data to the server 20, and the process proceeds to S209. In S209, the server 20 receives the audio data, and the process proceeds to S210. In S210, the server 20 performs secondary speech recognition using the dictionary 2, obtains secondary recognition result data as recognition result data of the speech recognition system, and shifts to S207. In S207, the process ends.
[0094]
【The invention's effect】
As described above, according to the present invention, it is possible to perform voice recognition while suppressing the load on the server-side CPU and the load on the network band between the client and the server.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a speech recognition system according to a first embodiment.
FIG. 2 is a flowchart of a processing procedure performed in the voice recognition system according to the first embodiment;
FIG. 3 is a diagram showing a configuration of a speech recognition system according to a second embodiment.
FIG. 4 is a flowchart of a processing procedure performed in the speech recognition system according to the second embodiment.
FIG. 5 is a diagram showing a configuration of a speech recognition system according to a third embodiment.
FIG. 6 is a flowchart of a processing procedure performed in the speech recognition system according to the third embodiment.
FIG. 7 is a diagram showing a configuration of a speech recognition system according to a fourth embodiment.
FIG. 8 is a flowchart of a processing procedure performed in the speech recognition system according to the fourth embodiment.
FIG. 9 is a diagram showing a configuration of a speech recognition system according to a fifth embodiment.
FIG. 10 is a flowchart of a processing procedure performed in the speech recognition system according to the fifth embodiment.
FIG. 11 is a diagram showing a configuration of a speech recognition system according to a sixth embodiment.
FIG. 12 is a flowchart of a processing procedure performed in the speech recognition system according to the sixth embodiment.
[Explanation of symbols]
10 clients
11 Microphone
12 Voice analysis unit
13 Dictionary 1 storage unit
14 Primary speech recognition unit
15 Selection section
16 Transmission section
17 Control part
18 Receiver
19 Dictionary 1 Management Department
20 servers
21 Receiver
22 Dictionary 2 storage unit
23 Secondary speech recognition unit
24 control unit
25 Selector
26 Transmitter
30 Voice identification unit
31 Transmission volume monitoring unit
32 Server monitoring unit
S200 Step of starting processing when voice is input to client 10
S201 Step of generating audio data in client 10
S202 Step of performing primary speech recognition using the dictionary 1 in the client 10
S203 Step of checking whether primary speech recognition is possible in client 10
S204 Step of transmitting primary recognition result data to server 20 by client 10
S205 Step of receiving primary recognition result data in server 20
S206 Step of obtaining primary recognition result data as recognition result data of speech recognition system in server 20
Step S207 to end the process
S208 Step of transmitting voice data to server 20 by client 10
S209 Step of receiving audio data in server 20
S210 Step of performing secondary speech recognition using the dictionary 2 in the server 20 and obtaining secondary recognition result data as recognition result data of the speech recognition system
S211 Step of transmitting secondary recognition result data to client 10 in server 20
S212 Step of receiving secondary recognition result data in client 10
S213 Step of storing secondary recognition result data in dictionary 1 in client 10
S214 Step of checking whether dictionary 1 has a space for storing dictionary data in client 10
In step S215, the client 10 deletes the dictionary data having the lowest primary voice recognizable probability from the dictionary 1 from the dictionary 1.
S216 Step of performing speaker identification and generating speaker data in client 10
S217: Step of deleting the dictionary data associated with the speaker other than the speaker corresponding to the speaker data from the dictionary 1 by the client 10
S218 Step of transmitting speaker data to server 20 in client 10
S219 Step of receiving speaker data at server 20
S220 Step of transmitting dictionary data associated with the speaker corresponding to the speaker data of dictionary 2 to client 10 at server 20
S221 Step of receiving dictionary data associated with a speaker corresponding to the speaker data in the client 10
S222: storing dictionary data associated with the speaker corresponding to the speaker data in the dictionary 1 in the client 10.
S223 Step of checking whether the data transmission amount between client 10 and server 20 is equal to or larger than a threshold value
S224 Step of performing primary speech recognition on the client 10 using the dictionary areas 1 and 2 of the dictionary 1
S225 Step of performing primary speech recognition using dictionary area 1 of dictionary 1 in client 10
S226: Step of checking whether the usage rate of the CPU on the server 20 side is equal to or higher than the threshold

Claims

A speech recognition system comprising a server and a client,
A voice analysis unit configured to analyze input voice to generate voice data; a first storage unit configured to store a first dictionary including a plurality of dictionary data for performing primary voice recognition; First voice recognition means for performing primary voice recognition using the voice data and the dictionary data of the first dictionary to generate primary recognition result data; and obtaining the primary voice recognition data from the voice data or the primary recognition result data. A first selection unit that selects data to be transmitted to the server; and a first transmission unit that transmits the data selected by the first selection unit to the server.
A server configured to receive a data transmitted by the client; a second storage unit configured to store a second dictionary including a plurality of dictionary data for performing secondary voice recognition; A speech recognition system, comprising: second speech recognition means for performing secondary speech recognition using the data received by the reception means and dictionary data of the second dictionary.

When the first speech recognition unit can perform primary speech recognition, the client selects the primary recognition result data generated by the first speech recognition unit by the first selection unit, and The speech recognition system according to claim 1, wherein the first recognition unit transmits the primary recognition result data to the server.

If the data received by the second receiving means is primary recognition result data, the server does not perform secondary voice recognition by the second voice recognition means, and outputs the primary recognition result data to the voice. The speech recognition system according to claim 1, wherein the speech recognition system obtains the recognition result data of the recognition system.

The server further includes a second selecting unit that selects data to be transmitted to the client from the plurality of data generated by the server, and a second unit that transmits the data selected by the second selecting unit to the client. Transmission means,
The speech recognition system according to claim 1, wherein the client further comprises first receiving means for receiving data transmitted by the server.

When the first speech recognition unit cannot perform primary speech recognition, the client selects the speech data generated by the speech analysis unit by the first selection unit, and selects the speech data by the first transmission unit. Transmitting the audio data to the server;
When the data received by the second receiving means is voice data, the server performs secondary voice recognition by the second voice recognition means to generate secondary recognition result data, and Obtaining data as recognition result data of the speech recognition system, selecting the recognition result data with the second selecting means, transmitting the recognition result data to the client with the second transmitting means,
If the data received by the first receiving means is the recognition result data, the client registers the recognition result data in the first dictionary in the first storage means. Item 5. A speech recognition system according to item 4.

The client further calculates, for each of the plurality of dictionary data stored in the first dictionary, a primary speech recognizable probability that is a probability that the data matches the speech data of the speech analysis unit. The voice recognition system according to claim 1, further comprising a first dictionary management unit configured to store the primary voice recognizable probability in a first probability table.

The client, when the data received by the first receiving means is the recognition result data, the first storage means, when there is a space for storing dictionary data in the first dictionary, Registers the data received by the first receiving means in the first dictionary, and if there is no space for storing dictionary data in the first dictionary, the primary The dictionary data corresponding to the speech recognition adoption probability that is the lowest is deleted from the first dictionary, and the data received by the first receiving unit is registered in the first dictionary. The speech recognition system according to claim 6.

The client further includes voice identification means for performing voice identification using the voice data to identify a speaker and generate speaker data indicating who the speaker is, and the first storage means includes Each of the plurality of data in the first dictionary is stored in association with a speaker, and the second storage means stores each of the plurality of data in the second dictionary in association with a speaker. The speech recognition system according to claim 1.

The client deletes a plurality of dictionary data associated with speakers other than the speaker identified by the voice identification unit in the first storage unit, and generates the dictionary data in the first selection unit by the voice identification unit. Selecting said speaker data, and transmitting said speaker data to said server by said first transmitting means,
The server, when the data received by the second receiving means is speaker data, a speaker indicated by the speaker data of the second receiving means from the second dictionary by the second selecting means Selecting each of the plurality of dictionary data associated with the, and transmitting each of the plurality of dictionary data associated with the speaker by the second transmission means,
If the data received by the first receiving means is dictionary data associated with a speaker, the client stores the dictionary data associated with the speaker in the first dictionary in the first storage means. The speech recognition system according to claim 8, wherein the speech recognition system stores the speech in association with a speaker.

A speech recognition system comprising a server and a client,
The client includes a voice analysis unit that analyzes input voice to generate voice data, and a first dictionary area and a second dictionary area that store a plurality of dictionary data for performing primary voice recognition. A first storage unit for storing a first dictionary, and a primary storage using the voice data and dictionary data stored in one of a first dictionary area and a second dictionary area of the first dictionary. A first voice recognition unit that performs voice recognition to generate primary recognition result data; a first selection unit that selects data to be transmitted to the server from the voice data or the primary recognition result data; First transmission means for transmitting the data selected by the selection means to the server, first reception means for receiving the data transmitted by the server, and a data transmission amount between the client and the server. The biography to monitor And a quantity monitoring means,
A server configured to receive a data transmitted by the client; a second storage unit configured to store a second dictionary including a plurality of dictionary data for performing secondary voice recognition; Second speech recognition means for performing secondary speech recognition using the data received by the reception means and the dictionary data of the second dictionary; and transmitting the plurality of data generated by the server to the client. A speech recognition system comprising: a second selection unit that selects data; and a second transmission unit that transmits the data selected by the second selection unit to the client.

The client, when the value of the data transmission amount between the client and the server of the transmission amount monitoring unit is equal to or greater than a certain threshold, the first speech recognition unit uses the first dictionary of the first dictionary. Primary speech recognition is performed using the dictionary data stored in the area and the second dictionary area, and when the value of the data transmission amount between the client and the server of the transmission amount monitoring means is less than a certain threshold value 11. The speech recognition system according to claim 10, wherein, in some cases, the first speech recognition unit performs primary speech recognition using dictionary data stored in a first dictionary area of the first dictionary. .

When the first speech recognition unit can perform primary speech recognition, the client selects the primary recognition result data generated by the first speech recognition unit by the first selection unit, and The voice recognition system according to claim 10, wherein the primary recognition result data is transmitted to the server by one transmission unit.

If the data received by the second receiving means is primary recognition result data, the server does not perform secondary voice recognition by the second voice recognition means, and outputs the primary recognition result data to the voice. 13. The speech recognition system according to claim 10, wherein the speech recognition system obtains the data as recognition result data of the recognition system.

A speech recognition system comprising a server and a client,
The client includes a voice analysis unit that analyzes input voice to generate voice data, and a first dictionary area and a second dictionary area that store a plurality of dictionary data for performing primary voice recognition. A first storage unit for storing a first dictionary, and a primary storage using the voice data and dictionary data stored in one of a first dictionary area and a second dictionary area of the first dictionary. A first voice recognition unit that performs voice recognition to generate primary recognition result data; a first selection unit that selects data to be transmitted to the server from the voice data or the primary recognition result data; First transmitting means for transmitting the data selected by the selecting means to the server, first receiving means for receiving the data transmitted by the server, and CPU monitoring means for monitoring the server-side CPU usage rate; With
A server configured to receive a data transmitted by the client; a second storage unit configured to store a second dictionary including a plurality of dictionary data for performing secondary voice recognition; Second speech recognition means for performing secondary speech recognition using the data received by the reception means and the dictionary data of the second dictionary; and transmitting the plurality of data generated by the server to the client. Second selecting means for selecting data, second transmitting means for transmitting the data selected by the second selecting means to the client, and calculating the usage rate of the server-side CPU and calculating the server-side CPU usage rate A speech recognition system comprising: a CPU utilization calculating unit for generating data.

The server, when the server-side CPU utilization data is generated by the CPU utilization-percentage calculating means, selects the server-side CPU utilization data of the CPU utilization-percentage computing means with the second selecting means, and Transmitting the server side CPU utilization data to the client by the transmitting means of (2),
The client, when the data received by the first receiving means is server-side CPU utilization data, calculates the server-side CPU utilization by using the server-side CPU utilization data by the CPU monitoring means; When the value of the server-side CPU utilization is equal to or greater than a certain threshold, the first voice recognition unit uses the dictionary data stored in the first dictionary area and the second dictionary area of the first dictionary. When the value of the server-side CPU usage rate is less than a certain threshold, the first speech recognition unit deletes the dictionary data stored in the first dictionary area of the first dictionary. The speech recognition system according to claim 14, wherein the primary speech recognition is performed using the speech recognition.

When the first speech recognition unit can perform primary speech recognition, the client selects the primary recognition result data generated by the first speech recognition unit by the first selection unit, and 16. The speech recognition system according to claim 14, wherein the primary recognition result data is transmitted to the server by one transmission unit.

If the data received by the second receiving means is primary recognition result data, the server does not perform secondary voice recognition by the second voice recognition means, and outputs the primary recognition result data to the voice. 17. The speech recognition system according to claim 14, wherein the speech recognition system obtains the recognition result data of the recognition system.

A speech recognition client used in a speech recognition system including a server and a client,
Voice analysis means for analyzing input voice to generate voice data, first storage means for storing a first dictionary composed of a plurality of dictionary data for performing primary voice recognition, First voice recognition means for performing primary voice recognition using the dictionary data of the first dictionary to generate primary recognition result data; and the voice data or the first data from a plurality of data generated by the client. A voice recognition client comprising: first selection means for selecting data to be transmitted to the server from next recognition result data; and first transmission means for transmitting the data selected by the first selection means to the server.

A speech recognition server used in a speech recognition system including a server and a client,
A second receiving unit that receives the data transmitted by the client, a second storage device that stores a second dictionary including a plurality of dictionary data for performing secondary voice recognition, A speech recognition server comprising: a second speech recognition unit that performs secondary speech recognition using received data and dictionary data of the second dictionary.

A speech recognition client program used in a speech recognition system including a server and a client,
A voice analysis step of analyzing input voice to generate voice data, a first storage step of storing a first dictionary composed of a plurality of dictionary data for performing primary voice recognition, A first speech recognition step of performing primary speech recognition using the dictionary data of the first dictionary to generate primary recognition result data, and the speech data or the first speech data from a plurality of data generated by the client. Speech recognition client program comprising: a first selection step of selecting data to be transmitted to the server from next recognition result data; and a first transmission step of transmitting data selected by the first selection means to the server. .

A speech recognition server program used in a speech recognition system including a server and a client,
A second receiving step of receiving data transmitted by the client, a second storing step of storing a second dictionary composed of a plurality of dictionary data for performing secondary voice recognition, A second voice recognition step of performing secondary voice recognition using the received data and the dictionary data of the second dictionary.