JP4392581B2

JP4392581B2 - Language processing apparatus, language processing method, program, and recording medium

Info

Publication number: JP4392581B2
Application number: JP2003042019A
Authority: JP
Inventors: 厚夫廣江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-02-20
Filing date: 2003-02-20
Publication date: 2010-01-06
Anticipated expiration: 2023-02-20
Also published as: JP2004252121A

Description

【０００１】
【発明の属する技術分野】
本発明は、言語処理装置および言語処理方法、並びにプログラムおよび記録媒体に関し、特に、例えば、登録した単語を複数のアプリケーションで共通に認識できるようにした言語処理装置および言語処理方法、並びにプログラムに関する。
【０００２】
【従来の技術】
音声認識には、単独の単語を認識する孤立単語認識と複数の単語からなる単語列を認識する連続単語認識がある。従来の連続単語認識では、言語モデルという「単語間のつながりやすさについてのデータベース」を持つことで、「音は似ているが滅茶苦茶な単語列」が認識結果として生成されることを防いでいる。
【０００３】
しかしながら、言語モデルには、最初から認識できる単語（以下、適宜、既知語と称する）についての情報のみ記述されるため、後で登録された単語（以下、適宜、登録単語と称する）を正しく認識することが困難であった。すなわち、孤立単語認識では、認識辞書に単語を登録すれば、以降その単語は認識されるようになるが、連続単語認識では辞書への登録だけでは不十分であり、登録単語を言語モデルにも反映させる必要があるが、言語モデルへの反映は一般的には困難であった。
【０００４】
そこで、登録単語を「人名」、「地名」等のカテゴリに分類し、そのカテゴリに対応した認識文法を用意して、音声を認識することが提案されている（例えば、特許文献１参照）。
【０００５】
また、音声認識を使用するアプリケーションが複数、しかも可変個存在するシステムにおいて、１つのアプリケーションで登録された単語を、他のアプリケーションに反映させる場合、アプリケーションが１つの場合とは違った問題が発生する。例えば、既に起動しているアプリケーションに対してのみ単語登録を行うようにすると、アプリケーションが１つの場合と異なり、登録後に起動、またはインストールされたアプリケーションに、登録単語を反映させることが困難であるという課題があった。
【０００６】
さらに、アプリケーションが複数ある場合、複数のアプリケーションで、何度も同一の登録単語を削除することは面倒である。また、アプリケーションが複数である場合、登録単語を全て削除することは容易であるが、その一部だけを削除したり発音を変更することは困難であるという課題があった。
【０００７】
即ち、アプリケーションが１つである場合、例えば、削除または変更する登録単語を「ｎ回目に登録した単語」や「認識辞書中のｎ番目のエントリ」といった情報で特定できるが、アプリケーションが複数である場合、各アプリケーションによって、「ｎ回目に登録した単語」や「辞書エントリの何番目に追加したか」が異なるため特定することが困難であった。
【０００８】
また、アプリケーションが複数である場合、発音で、登録単語を特定することができるが、発音で登録単語を特定した場合、同音異義語が削除または変更されてしまうおそれがあった。
【０００９】
そこで、各アプリケーションが個別に音声認識を行う代わりに、「音声コマンダ」というモジュールが、全てのアプリケーションに対する音声認識を行い、その認識結果を各アプリケーションに転送することが提案されている（例えば、特許文献１参照）。
【００１０】
【特許文献１】
特開２００１−２１６１２８号公報
【００１１】
【発明が解決しようとする課題】
しかしながら、特許文献１に記載の発明では、各アプリケーションに対応した認識辞書と言語モデルとを、「音声コマンダ」が所持している必要がある。即ち、「音声コマンダ」を開発する際に、どのようなアプリケーションが同時に使用されるかを想定して、それに適した認識辞書、言語モデルを用意しておく必要があるため、想定外のアプリケーションに対しては、登録単語を反映させることが困難であるという課題があった。
【００１２】
本発明はこのような状況に鑑みてなされたものであり、登録した単語を複数のアプリケーションで共通に使用することができるようにするものである。
【００１３】
【課題を解決するための手段】
本発明の言語処理装置は、単語が登録される登録辞書を記憶する登録辞書記億手段と、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、アプリケーションごとに、登録辞書に基づいて構築する構築手段と、登録辞書に対して、単語を追加、削除、または変更する処理を行なう処理手段と、専用辞書の単語を削除する削除手段とを備え、専用辞書に登録されたすべての単語が削除された後、構築手段は、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書を再構築することを特徴とする。
【００１５】
専用辞書は、所定の単語が予め登録されている固定辞書と、登録される単語が可変の可変辞書とを、少なくとも含み、構築手段は、専用辞書のうちの可変辞書を構築するようにすることができる。
【００１６】
専用辞書は、単語のカテゴリが登録されたカテゴリテーブルをさらに含み、構築手段は、登録辞書の単語のうち、カテゴリテーブルに登録されたカテゴリの単語を、可変辞書に登録することにより、可変辞書を構築するようにすることができる。
【００１７】
カテゴリの単語がどのように連鎖するかを示す連鎖情報を記述する言語モデルを記憶する言語モデル記憶手段と、専用辞書と言語モデルに基づいて音声認識を行う認識処理手段とをさらに設けることができる。
【００１８】
本発明の言語処理方法は、単語が登録される登録辞書を記憶する登録辞書記億ステップと、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、アプリケーションごとに、登録辞書に基づいて構築する構築ステップと、登録辞書に対して、単語を追加、削除、または変更する処理を行なう処理ステップと、専用辞書の単語を削除する削除ステップと、専用辞書に登録されたすべての単語が削除された後、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書を再構築する再構築ステップとを含むことを特徴とする。
【００１９】
本発明の記録媒体に記録されているプログラムは、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、アプリケーションごとに、単語が登録される登録辞書に基づいて構築する構築ステップと、登録辞書に対して、単語を追加、削除、または変更する処理を行なう処理ステップと、専用辞書の単語を削除する削除ステップと、専用辞書に登録されたすべての単語が削除された後、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書を再構築する再構築ステップとをコンピュータに実行させることを特徴とする。
【００２０】
本発明のプログラムは、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、アプリケーションごとに、単語が登録される登録辞書に基づいて構築する構築ステップと、登録辞書に対して、単語を追加、削除、または変更する処理を行なう処理ステップと、専用辞書の単語を削除する削除ステップと、専用辞書に登録されたすべての単語が削除された後、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書を再構築する再構築ステップとをコンピュータに実行させることを特徴とする。
【００２１】
本発明においては、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書が、アプリケーションごとに、単語が登録される登録辞書に基づいて構築され、登録辞書に対して、単語を追加、削除、または変更する処理が行なわれ、専用辞書の単語が削除され、専用辞書に登録されたすべての単語が削除された後、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書が再構築される。
【００２２】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して説明する。図１は、本発明を適用したロボット制御システム１の構成例を表わしている。
【００２３】
このロボット制御システム１において、音声認識エンジン部１１は、入力された音声データを認識し、認識結果として、音声データに対応する単語列を生成する。音声認識エンジン部１１は、その認識結果を、名前登録用アプリケーション部２１₁、雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃、・・・、その他のアプリケーション部２１_M、並びに、アプリケーション管理部３１に供給する。
【００２４】
名前登録用アプリケーション部２１₁、雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃、・・・、その他のアプリケーション部２１_Mは、音声認識エンジン部１１から供給された認識結果に基づいて、各種の処理を行う。
【００２５】
名前登録用アプリケーション部２１₁は、音声認識エンジン部１１から供給された認識結果に基づいて、ロボット名、ユーザ名等を音声で登録し、それ以外のアプリケーション部は、名前登録用アプリケーション部２１₁が登録した名前を用い、ユーザからの発話に対応してロボットの動作を制御する。
【００２６】
したがって、雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃、・・・・、およびその他のアプリケーション部２１_Mで行われる音声認識は、名前登録用アプリケーション部２１₁で登録されたロボット名、ユーザ名等に対応する必要がある。
【００２７】
雑談用アプリケーション部２１₂は、ロボットに、ユーザと音声で雑談させ、音声コマンダ用アプリケーション部２１₃は、ロボットに、ユーザからの発話に対応する動作を行わせる。例えば、音声コマンダ用アプリケーション部２１₃は、「エスディーアール（ロボット名）、前に進め！」といったユーザからの発話に対応して、ロボットを前に進める。
【００２８】
なお、アプリケーション部は、任意の個数用意することができる。以下、名前登録用アプリケーション部２１₁、雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃、・・・、およびその他のアプリケーション部２１_Mのそれぞれを個々に区別する必要がない場合、適宜、まとめて、アプリケーション部２１と称する。
【００２９】
アプリケーション管理部３１は、音声認識エンジン部１１から供給された認識結果に基づいて、アプリケーション部２１に対して、起動、終了の指令を行う。例えば、アプリケーション管理部３１は、音声認識エンジン部１１から「音声コマンダを起動」という認識結果が供給された場合、音声コマンダ用アプリケーション部２１₃を起動する。このとき、複数のアプリケーション部を同時に起動させてもよい。
【００３０】
また、アプリケーション部２１、およびアプリケーション管理部３１は、音声認識エンジン部１１に対して、タスク切替コマンドを発行し、それぞれに対応したタスク（図２で後述する）が、音声認識エンジン部１１の内部で有効（アクティブ）、または無効（ディアクティブ）になるように制御する。
【００３１】
図２は、音声認識エンジン部１１の構成を示している。ユーザの発話は、マイクロホン５１に入力され、マイクロホン５１では、その発話が、電気信号としての音声信号に変換される。マイクロホン５１は、この音声信号を、ＡＤ（Analog Digital）変換部５２に供給する。ＡＤ変換部５２は、マイクロホン５１からのアナログ信号である音声信号をサンプリングし、量子化して、ディジタル信号である音声データに変換する。この音声データは、特徴量抽出部４３に供給される。
【００３２】
特徴量抽出部５３は、ＡＤ変換部５２からの音声データについて、適当なフレーム毎に、例えば、スペクトル、パワー線形予測計数、ケプストラム計数、線スペクトル対等の特徴パラメータを抽出し、マッチング部５４に供給する。
【００３３】
マッチング部５４は、特徴量抽出部５３からの特徴パラメータに基づき、音韻タイプライタ用タスク７１₁、アプリケーション切替用タスク７１₂、名前登録用タスク７１₃、雑談用タスク７１₄、音声コマンダ用タスク７１₅、・・・、およびその他のタスク７１_Nのうち、その時点で有効にされているタスク毎に、タスク内部のデータベースを必要に応じて参照しながら、マイクロホン５１に入力された音声（入力音声）に最も近い単語列を、認識結果として求める。マッチング部５４は、その認識結果を、それぞれのタスクに対応するアプリケーション部２１、およびアプリケーション管理部３１に供給する。
【００３４】
なお、タスクとは、音声認識を行うのに必要なデータのセットのことである。即ち、音声認識エンジン部１１を、マッチング等を行うプログラム部分と、音響モデル、言語モデル、認識辞書等のデータ部分とに分類した場合のデータ部分、およびデータにアクセスするためのプログラムのことである。
【００３５】
したがって、複数のアプリケーションが異なる音響モデル、言語モデル、辞書を用いて音声認識を行う場合であっても、タスクを複数用意することによって、音声認識エンジン部は１つにすることができる。タスクの内部の詳細については、図６で後述する。
【００３６】
音韻タイプライタ用タスク７１₁は、音韻タイプライタとして働くタスクであり、音声認識エンジン部１１の指令により、有効にされる。この音韻タイプライタによって、マッチング部５４は、入力された任意の音声に対して、音韻系列を取得する他、カナ表記の発音も取得する。例えば、「君の名前はエスディーアールだよ」という音声から、“k/i/m/i/n/o/n/a/m/a/e/w/a/e/s/u/d/i:/a:/r/u/d/a/y/o”（“i:”、“a:”は、それぞれ“i”、“a”の長音）という音韻系列と、「キミノナマエワエスディーアールダヨ」というカナ表記を取得する。この音韻系列とカナ表記は、未知語獲得部５６で用いられる。
【００３７】
アプリケーション切替用タスク７１₂は、アプリケーション管理部３１に対応したタスクであり、アプリケーション管理部３１が起動した後、アプリケーション管理部３１からタスク切替コマンドが供給されると、有効にされる。アプリケーション切替用タスク７１₂によって、マッチング部５４は、例えば、「雑談アプリを起動」、「音声コマンダを起動」、「名前登録を起動して」等のアプリケーション部の起動、または終了命令に対応する音声を認識する。
【００３８】
名前登録用タスク７１₃は、名前登録用アプリケーション部２１₁に対応したタスクであり、アプリケーション管理部３１からの指令により、名前登録用アプリケーション部２１₁が起動された後、名前登録用アプリケーション部２１₁からタスク切替コマンドが供給されると、有効にされる。名前登録用タスク７１₃によって、マッチング部５４は、例えば、「君の名前は、＜ロボット名を表す未知語＞だよ。」、「私の名前は、＜人名を表す未知語＞です。」といった名前に対応する音声を認識する。
【００３９】
雑談用タスク７１₄、音声コマンダ用タスク７１₅、・・・、およびその他のタスク７１_Nは、それぞれ雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃、・・・、その他のアプリケーション部２１_Mに対応したタスクであり、アプリケーション管理部３１からの指令により、対応するアプリケーション部が起動された後、対応するアプリケーション部からそれぞれタスク切替コマンドが供給されると、有効にされる。
【００４０】
マッチング部５４は、雑談用タスク７１₄によって、例えば、「エスディーアール（ロボット名）、何時に起きたの？」というユーザからの雑談としての発話を認識することができる。また、マッチング部５４は、音声コマンダ用タスク７１₅によって、例えば、「エスディーアール（ロボット名）、前に１歩進め」というユーザからの指令としての発話を認識することができる。
【００４１】
また、マッチング部５４は、後述する共通辞書部５５に登録された単語を、各タスクに反映させる。
【００４２】
なお、以下、音韻タイプライタ用タスク７１₁、アプリケーション切替用タスク７１₂、名前登録用タスク７１₃、雑談用タスク７１₄、音声コマンダ用タスク７１₅、・・・、およびその他のタスク７１_Nのそれぞれを個々に区別する必要がない場合、適宜、まとめて、タスク７１と称する。
【００４３】
共通辞書部５５は、タスク７１で共通に用いる単語の辞書としての共通辞書を記憶している。共通辞書部５５に記憶されている共通辞書には、そこに登録された全単語について、発音情報とカテゴリ情報が記述される。例えば、固有名詞である「エスディーアール（ロボット名）」が共通辞書に登録される場合、「エスディーアール」という発音（音韻情報）と“＿ロボット名＿”というカテゴリが共通辞書に記述される。詳細は、図２４で後述する。
【００４４】
未知語獲得部５６は、認識用の辞書（図６で後述する固定単語辞書１３１）に登録されていない名前等の単語（未知語）について、音韻タイプライタ用タスク７１₁によって認識され、マッチング部５４から供給された音韻系列およびカナ表記を記憶し、それ以降、その単語の音声を認識できる（他の音声と識別できる）ようにする。
【００４５】
即ち、未知語獲得部５６は、音韻タイプライタ用タスク７１₁によって認識された未知語の音韻系列およびカナ表記を、いくつかのクラスタに分類する。各クラスタはＩＤ、代表音韻系列、および代表カナ表記を持ち、ＩＤで管理される。
【００４６】
図３は、未知語獲得部５６のクラスタの状態を示している。
【００４７】
「あか」、「あお」、「みどり」の３回の音声が入力されたとき、未知語獲得部５６は、３回の入力音声を、それぞれに対応した「あか」クラスタ９１、「あお」クラスタ９２、および「みどり」クラスタ９３の３つのクラスタに分類し、各クラスタに、代表となる音韻系列（図３の例の場合、“a/k/a”、“a/o”、“m/i/d/o/r/i”）、代表的なカナ表記（図３の例の場合、「アカ」、「アオ」、「ミドリ」）、およびＩＤ（図３の例の場合、「１」、「２」、「３」）を付加する。
【００４８】
ここで、再び「あか」という音声が入力されると、対応するクラスタが既に存在するので、未知語獲得部５６は、入力音声を「あか」クラスタ９１に分類し、新しいクラスタは生成しない。これに対して、「くろ」という音声が入力された場合、対応するクラスタが存在しないので、未知語獲得部５６は、「くろ」に対応した「くろ」クラスタ９４を新たに生成し、そのクラスタに、代表的な音韻系列（図３の例の場合、“k/u/r/o”）、代表的なカナ表記（図３の例の場合、「クロ」）、およびＩＤ（図３の例の場合、「４」）を付加する。
【００４９】
この方法を用いると、ユーザが同じ音声を何度も入力することによって、各クラスタの代表音韻系列と代表カナ発音の精度をあげることができる。例えば、「みどり」を１度入力した時点では、音韻タイプライタが誤認識して、“m/e/r/a/a”という音韻系列と、「メラア」というカナ発音とを出力したとする。その後、「みどり」という発話を何回もすることにより、音韻系列とカナ発音とが正しい値（“m/i/d/o/r/i”と「ミドリ」）に収束していく可能性がある。このような単語獲得処理の詳細は、本出願人が先に提案した特願2001-097843号、および特願2001-382579号に開示されている。
【００５０】
次に、図４と図５を参照して、図１のロボット制御システム１におけるロボット制御処理を説明する。なお、この処理は、ユーザによりロボット制御システム１が起動されたとき、開始される。
【００５１】
ステップＳ１において、音声認識エンジン部１１が起動し、ステップＳ２に進む。ステップＳ２において、音声認識エンジン部１１は、前回のロボット制御システム１の終了時に、不図示の記憶部に記憶しておいた（後述するステップＳ１７の処理）共通辞書部５５の内容（共通辞書）と未知語獲得部５６のクラスタの状態をロードする。共通辞書とクラスタの状態が記憶部に記憶されていない場合は、共通辞書部５５と未知語獲得部５６のクラスタのエントリが何もない状態のままにする。記憶部にクラスタの状態は記憶されているが共通辞書の状態は記憶されていないという場合は、共通辞書のみ初期化（エントリが何もない状態に）する。逆に、共通辞書の状態は記憶されているがクラスタの状態は記憶されていない場合は、クラスタ由来のエントリ（図２４でクラスタＩＤが記述されているエントリ）は共通辞書から削除し、カナ発音由来のエントリ（図２４でカナ発音が記述されているエントリ）は残す。
【００５２】
ステップＳ２の処理後は、ステップＳ３に進み、音声認識エンジン部１１は、音韻タイプライタ用タスク７１₁を有効にし、音韻タイプライタ用タスク７１₁が音声認識に使用できる状態にして、ステップＳ４に進む。ステップＳ４において、アプリケーション管理部３１が起動し、ステップＳ５に進む。
【００５３】
ステップＳ５において、アプリケーション管理部３１は、対応するタスクであるアプリケーション切替用タスク７１₂を有効にし、ステップＳ６に進む。ステップＳ６において、音声認識エンジン部１１は、マイクロホン５１に音声で入力された、アプリケーション部２１の起動命令を認識し、認識結果をアプリケーション管理部３１に供給する。この音声認識処理の詳細は、図３５のフローチャートで後述する。
【００５４】
ステップＳ６の処理後は、図５のステップＳ７に進み、アプリケーション管理部３１は、音声認識エンジン部１１から供給された認識結果から、名前登録用アプリケーション部２１₁を起動するか否かを判定し、名前登録用アプリケーション部２１₁を起動すると判定した場合（例えば、認識結果が「名前登録を起動」である場合）、ステップＳ８に進む。
【００５５】
ステップＳ８において、アプリケーション管理部３１は、名前登録用アプリケーション部２１₁を起動させる。ステップＳ８の処理後は、ステップＳ９に進み、名前登録用アプリケーション部２１₁は、名前登録処理を行なう。この名前登録処理の詳細は、図２１のフローチャートで後述する。
【００５６】
ステップＳ７において、アプリケーション管理部３１は、名前登録用アプリケーション部２１₁を起動しないと判定した場合、ステップＳ１０に進み、音声認識エンジン部１１による認識結果から、雑談用アプリケーション部２１₂を起動するか否かを判定する。ステップＳ１０において、アプリケーション管理部３１は、雑談用アプリケーション部２１₂を起動すると判定した場合（例えば、認識結果が「雑談を起動して」である場合）、ステップＳ１１に進み、雑談用アプリケーション部２１₂を起動させる。
【００５７】
ステップＳ１１の処理後は、ステップＳ１２に進み、雑談用アプリケーション部２１₂は、雑談処理を行なう。この雑談処理の詳細は、図３２のフローチャートで後述する。
【００５８】
ステップＳ１０において、アプリケーション管理部３１は、雑談用アプリケーション部２１₂を起動しないと判定した場合、ステップＳ１３に進み、音声認識エンジン部１１による認識結果から、音声コマンダ用アプリケーション部２１₃を起動するか否かを判定する。ステップＳ１３の処理において、アプリケーション管理部３１は、音声コマンダ用アプリケーション部２１₃を起動すると判定した場合（例えば、認識結果が「音声コマンダ起動」である場合）、ステップＳ１４に進み、音声コマンダ用アプリケーション部２１₃を起動させる。
【００５９】
ステップＳ１４の処理後は、ステップＳ１５に進み、音声コマンダ用アプリケーション部２１₃は、音声コマンダ処理を行なう。この音声コマンダ処理の詳細は、図３２のフローチャートで後述する。
【００６０】
ステップＳ１３において、アプリケーション管理部３１は、音声コマンダ用アプリケーション部２１₃を起動しないと判定した場合、音声認識エンジン部１１による認識結果が誤っているため（アプリケーション切り替え以外の発話の場合もある）、図４のステップＳ６に戻り、音声認識エンジン部１１は、新たに入力された音声を認識する処理を行う。
【００６１】
このように、アプリケーション管理部３１は、音声認識エンジン部１１による認識結果に応じて、アプリケーション部２１を起動させる。
【００６２】
ステップＳ９，Ｓ１２，Ｓ１５の処理の後は、ステップＳ１６に進み、アプリケーション管理部３１は、ロボット制御処理を終了するか否かを判定する。例えば、アプリケーション管理部３１は、ユーザにより不図示の終了ボタンが押圧されたか否かを判定し、終了ボタンが押圧された場合、ロボット制御処理を終了すると判定する。
【００６３】
ステップＳ１６において、ロボット制御処理を終了しないと判定された場合、処理は図４のステップＳ６に戻り、入力された音声を認識する処理を繰り返す。ステップＳ１６において、アプリケーション管理部３１は、ロボット制御処理を終了する（終了ボタンが押圧された）と判定した場合、ステップＳ１７に進み、共通辞書部５５の共通辞書および未知語獲得部５６のクラスタの状態を、不図示の記憶部に記憶させる。
【００６４】
そして、アプリケーション管理部３１は、起動しているアプリケーション部２１がある場合、そのアプリケーション部を終了する。このとき、アプリケーション部２１は、対応するタスク７１を無効にする。また、アプリケーション管理部３１は、アプリケーション切替用タスク７１₂を無効にし、音声認識エンジン部１１は、音韻タイプライタ用タスク７１₁を無効にして、アプリケーション管理部３１および音声認識エンジン部１１は、処理を終了する。
【００６５】
なお、上述の処理では、アプリケーション部が、名前登録用アプリケーション部２１₁、雑談用アプリケーション部２１₂、音声コマンダ用アプリケーション部２１₃の３個のときを説明したが、さらにその他のアプリケーション部がある場合は、ステップＳ１３で、音声コマンダ用アプリケーションを起動しないと判定した場合、ステップＳ６に戻らず、ステップＳ７，Ｓ１０，Ｓ１３と同様に、他のアプリケーションを起動するか否かが判定され、その判定結果に応じて他のアプリケーションが起動される。
【００６６】
また、上述の処理では、音声認識の終了は、ユーザによって指令されたが、例えば、所定時間音声が入力されない場合に終了する等、ロボット制御システム１が自動的に判断してもよい。
【００６７】
上述の処理によれば、アプリケーション切替用タスク７１₂は、各アプリケーション部２１の起動中も有効になっているため、「○○を起動して」という発話が他のアプリケーション部の起動中になされた場合も、その発話を認識して、対応するアプリケーションを起動することができる。例えば、音声コマンダアプリケーション部２１₃が起動中に、ユーザによって「雑談を起動して」と発話された場合、雑談用アプリケーション部２１₂を起動することができる。
【００６８】
この場合、起動中のアプリケーション部を終了させてから新しいアプリケーション部を起動させるか、起動中のアプリケーション部は一時停止状態にしてから新しいアプリケーション部を起動し、新しいアプリケーション部が終了してから元のアプリケーション部を再開するか、あるいは両方を並列に起動させるかは、アプリケーション部同士の組み合わせによって予め設定されている（メモリ等のリソース制約などから動的に判断することもある）。
【００６９】
図６は、タスク７１の構成を示している。タスク７１は、音響モデル１１１、言語モデル１１２、辞書１１３、音韻リスト１１４、カナ音韻変換規則１１５、およびサーチパラメータ１１６から構成されている。
【００７０】
音響モデル１１１は、音声認識する音声の個々の音韻、音節等の音響的な特徴を表すモデルを記憶している。音響モデルとしては、例えば、HMM（Hidden Markov Model）を用いることができる。
【００７１】
言語モデル１１２は、辞書１１３の単語辞書に登録されている各単語がどのように連鎖する（接続する）かを示す情報（以下、適宜、連鎖情報と称する）を記述している。記述方法としては、統計的な単語連鎖確率（n-gram）、生成文法、有限状態オートマン等がある。
【００７２】
言語モデル１１２は、単語についての連鎖情報の他に、単語を特定の観点で分類したカテゴリについての連鎖情報も含んでいる。例えば、「ユーザ名を表す単語からなるカテゴリを“＿ユーザ名＿”というシンボルで表し、「ロボット名を表す単語からなるカテゴリ」を“＿ロボット名＿”というシンボルで表す場合、言語モデル１１２は、“＿ユーザ名＿”、“＿ロボット名＿”についての連鎖情報（カテゴリ同士の連鎖、カテゴリと辞書に予め記憶されている単語との連鎖等）も記述している。
【００７３】
したがって、言語モデル１１２に含まれない単語についても連鎖情報を取得することができる。例えば、「エスディーアール」と「は（助詞）」の連鎖情報を取得する場合、言語モデル１１２に「エスディーアール」についての連鎖情報が記述されていなくても、「エスディーアール」が“＿ロボット名＿”というシンボルで表されるカテゴリに属していることがわかれば、代わりに“＿ロボット名＿”と「は」との連鎖情報を取得することによって、「エスディーアール」と「は」の連鎖情報を取得することができる。
【００７４】
なお、カテゴリは、意味属性に基づく分類（“＿ロボット名＿”、“＿ユーザ名＿”、“＿地名＿”、“＿店名＿”等）ではなく、品詞に基づく分類（“＿名詞＿”、“＿動詞＿”、“＿助詞＿”等）にしてもよい。以下、“＿・・・＿”という表記は、カテゴリ名を表すものとする。
【００７５】
辞書１１３は、固定単語辞書１３１、可変単語辞書１３２、およびカテゴリテーブル１３３から構成されている。
【００７６】
固定単語辞書１３１には、単語登録および削除の対象外の単語、すなわち、予めロボット制御システム１に設定されている単語（以下、適宜、固定単語と称する）についての発音（音韻系列）、音韻および音節の連鎖関係を記述したモデル等、各種の情報が記述されている。
【００７７】
なお、固定単語辞書１３１には、タスク７１毎に、そのタスク７１に対応するアプリケーション部２１で用いられる専用の単語についての情報が記述されている。上述の音響モデル１１１および言語モデル１１２において、並びに後述するカテゴリテーブル１３３、音韻リスト１１４、カナ音韻変換規則１１５、およびサーチパラメータ１１６においても同様である。
【００７８】
可変単語辞書１３２には、単語登録および削除の対象となる単語、すなわち、登録単語についての発音、音韻および音節の連鎖関係を記述したモデル等、各種の情報が記述されており、共通辞書部５５に新たな登録単語が登録されると、その登録単語が反映される。この反映処理については、図２５で後述する。単語の削除や発音の変更は、この可変単語辞書１３２のエントリに対してのみ行うことができる。なお、可変単語辞書１３２は、何も記億されていなくてもよい。
【００７９】
カテゴリテーブル１３３は、言語モデル１１２に含まれているカテゴリとそのカテゴリに含まれている単語の情報との対応を示すテーブルを記憶している。また、タスク７１がカテゴリ独自のＩＤ（カテゴリＩＤ）を付与している場合には、カテゴリテーブル１３３は、カテゴリのシンボルとそのＩＤの対応関係も記憶する。例えば、“＿ロボット名＿”のカテゴリに、カテゴリＩＤ「４」が付与されている場合、“＿ロボット名＿”に対応して、カテゴリＩＤ＝４も記憶する。なお、カテゴリテーブル１３３は、言語モデル１１２がカテゴリを含まない場合、何も記憶しない。
【００８０】
音韻リスト１１４は、タスク７１で使用する音韻記号の一覧である。カナ音韻変換規則１１５は、カナ文字列を音韻系列に変換するための規則である。このように、カナ音韻変換規則１１５をタスク毎に記憶することによって、共通辞書部５５は、発音情報として、音韻系列とは独立であるカナ文字列を保持することができる。
【００８１】
サーチパラメータ１１６は、マッチング部５４がマッチング（サーチ）する場合に用いるパラメータを保持している。パラメータには、音響モデル１１１に依存した値、語彙数に依存した値、言語モデル１１２の種類に依存した値等があるので、タスク毎に保持しておく必要がある。ただし、タスクに依存しないパラメータは、認識エンジン部１１で共通に保持すればよい。
【００８２】
なお、上述の説明では、全てのデータをタスク毎に記憶するようにしたが、複数のタスクで共通に用いるデータは、タスク間で共有することでメモリ使用量を減らすことができる。例えば、音韻リスト１１４が全てのタスクで共通である場合、音韻リスト１１４を音声認識エンジン部１１で１つだけ用意し、各タスクはそれを参照するようにすればよい。この場合、カナ音韻変換規則１１５も１つだけ用意すれば十分である。
【００８３】
また、音響モデル１１１は、静かな環境用（静かな環境で高い認識率が出る音響モデル）と雑音環境用（騒がしい環境でもそれなりの認識率が出る音響モデル）との２種類を用意し、タスク毎にどちらかを参照するようにしてもよい。
【００８４】
例えば、名前登録用タスク７１₃と雑談用タスク７１₄は、静かな環境で使用することを想定しているので、静かな環境用の音響モデル１１１を参照し、音声コマンダ用タスク７１₅は、騒がしい環境（ロボットの動作音が大きい環境）で使うことを想定しているので、雑音環境用の音響モデルを参照するようにすることができる。
【００８５】
図７は、図６の音韻リスト１１４の例を示している。図７において、１つの記号は１つの音韻（に相当するもの）を表す。なお、図７の音韻リスト１１４において、母音＋コロン（例えば、“ａ：”）は、長音を表し、“Ｎ”は、撥音（「ん」）を表す。また、“sp”、“silB”、“silE”、“ｑ”は、全て無音を表すが、それぞれ「発話の中の無音」、「発話前の無音」、「発話後の無音」、「促音（「っ」）」を表す。
【００８６】
図８は、図６のカナ音韻変換規則１１５の例を示している。図８のカナ音韻変換規則１１５によれば、例えば、「エスディーアール」というカナ文字列は、“e/s/u/d/i:/a:/r/u”という音韻系列に変換される。
【００８７】
次に、各タスクの言語モデル１１２と辞書１１３（図６）の例を示す。
【００８８】
図９は、音韻タイプライタ用タスク７１₁の言語モデル１１２（図６）の例を示している。図９において、第１行目の変数“$SYLLABLE”は、全てのカナ表記が「または」を意味する“|”で繋がれているので、そのカナ表記の内の任意の１つを意味する。
【００８９】
即ち、ここでは、音韻タイプライタ用タスク７１₁は、音節（シラブル）を単位とする音声認識用のタスクであるとして、図９の言語モデル１１２は、任意のシラブルが、任意に接続できるという連鎖規則を、BNF（Backus-Naur-Form）形式の文法で表している。なお、言語モデル１１２は、後述する統計言語モデルを用いてもよい。
【００９０】
図１０は、音韻タイプライタ用タスク７１₁の固定単語辞書１３１（図６）の例を示している。「シンボル」は単語を識別するための文字列であり、例えば、カナ表記などを用いることができる。シンボルが同じエントリは、同じ単語のエントリであるとみなされる。また、言語モデル１１２は、このシンボルを用いて表されている。なお、「<先頭>」と「<終端>」は特殊なシンボルであり、それぞれ「発話前の無音」と「発話後の無音」を表す（後述する図１１等においても同様）。
【００９１】
また、「トランスクリプション」は、単語の表記を表し、認識結果として出力される文字列はこのトランスクリプションである。「音韻系列」は、単語の発音を音韻系列で表したものである。
【００９２】
音韻タイプライタ用タスク７１₁の可変単語辞書１３２には、音韻タイプライタ用タスク７１₁に単語を追加することは想定していないので、何も記憶されない。また、音韻タイプライタ用タスク７１₁の言語モデル１１２は、図９に示すように、カテゴリを含まないので、カテゴリテーブル１３３にも何も記憶されない。
【００９３】
図１１は、アプリケーション切替用タスク７１₂の言語モデル１１２（図６）の例を示している。図１１の言語モデル１１２は、BNF形式の文法で記述されている。第１行目の変数“$APPLICATIONS”は、全てのアプリケーション名（「雑談」、「音声コマンダ」、「名前登録」等）が「または」を意味する“|”で繋がれているので、アプリケーション名の内のどれか１つを意味する。
【００９４】
また、第２行目の変数“$UTTERANCE”は、“＿ロボット名＿”と「を」のそれぞれに、「省略可能」を意味する“［］”が付加されているので、「（ロボット名）アプリケーション名（を）起動して」を意味する。ここで、「ロボット名」とは、“＿ロボット名＿”のカテゴリに登録された単語を示している。
【００９５】
例えば、“＿ロボット名＿”に「エスディーアール」が登録されていた場合、「エスディーアール、音声コマンダ（を）起動して」、「音声コマンダ（を）起動して」等の発話が、図１１の言語モデル１１２を用いて認識される。
【００９６】
このように言語モデル１１２を、カテゴリ名を用いて記述することによって、新たに登録された単語であっても、その単語が、言語モデル１１２に記述されているカテゴリに含まれるものである場合には、その新たに登録された単語を含む発話を、言語モデル１１２を用いて認識することができる。
【００９７】
図１２は、アプリケーション切替用タスク７１₂の固定単語辞書１３１（図６）の例を示している。図１２の固定単語辞書１３１には、図１１の言語モデル１１２の文法中に記述されるシンボル（図１１における「雑談」や「音声コマンダ」等）について、トランスクリプションと音韻系列が記述されている。
【００９８】
図１３は、アプリケーション切替用タスク７１₂のカテゴリテーブル１３３（図６）の例を示している。カテゴリテーブル１３３は、言語モデル１１２に使用されているカテゴリの種類と、カテゴリに属する単語の情報を記憶する。言語モデル１１２が図１１に示すような場合、アプリケーション切替用タスク７１₂の言語モデル１１２には、“＿ロボット名＿”のカテゴリが使用されているため、カテゴリテーブル１３３には、図１３に示すように、“＿ロボット名＿”がエントリされている。図１３においては“＿ロボット名＿”のカテゴリに属する単語の集合は、空集合であり、まだ“＿ロボット名＿”に属する単語は何もないことを表している。
【００９９】
図１３に示したように、カテゴリテーブル１３３に、カテゴリがエントリされている場合であっても、そのエントリに属する単語がない場合（空集合の場合）、可変単語辞書１３２には、そのカテゴリに属する単語の情報は記憶されない。
【０１００】
図１４は、名前登録用タスク７１₃の言語モデル１１２（図６）の例を示している。図１４の言語モデル１１２は、BNF形式の文法で記述されている。変数“$UTTERANCE”は、「私［の名前］は＜OOV＞［です］［といいます］」と「君［の名前］は＜OOV＞［というん］だよ」が、「または」を意味する“|”で繋がっており、「の名前」、「です」、「といいます」、「というんだよ」それぞれに、「省略可能」を意味する“［］”が付加されている。
【０１０１】
したがって、図１４の言語モデル１１２を用いて、「私（の名前）は＜OOV＞（です）（といいます）」または「君（の名前）は＜OOV＞（というん）だよ」が認識される。なお、＜OOV＞は、「Out Of Vocabulary」を意味するシンボルであり、任意の発音の語句（固定単語辞書１３１に記述されていない単語）を意味する。
【０１０２】
シンボル＜OOV>を用いることによって、例えば、「私の名前は太郎です」、「君の名前はエスディーアールだよ」といった発話（「太郎」と「エスディーアール」は、固定単語辞書１３１に記述されていない）に対して、それぞれ図１４の言語モデル１１２の「＜先頭＞私の名前は＜OOV>です＜終端＞」、「＜先頭＞君の名前は＜OOV＞だよ」が適用されることにより、「私の名前はタロウです」、「君の名前はエスディーアールだよ」という音声認識結果を得ることができる。
【０１０３】
図１５は、名前登録用タスク７１₃の固定単語辞書１３１（図６）の例を示している。固定単語辞書１３１には、図１４に示されるような言語モデル１１２の文法中に記述されるシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１０４】
名前登録用タスク７１₃の可変単語辞書１３２には、ここでは、名前登録用タスク７１₃に単語を追加することは想定していないので、何も記憶されない。また、名前登録用タスク７１₃の言語モデル１１２は、図１４に示すように、カテゴリを含まないので、カテゴリテーブル１３３にも何も記憶されない。
【０１０５】
図１６は、雑談用タスク７１₄の言語モデル１１２（図６）の例を示している。雑談は、語彙も発話のバリエーションも多いため、言語モデル１１２として、統計言語モデルが用いられている。統計言語モデルは、単語の連鎖情報を条件付確率で記述したモデルであり、図１６の言語モデル１１２では、３つの単語１，２，３の並び、すなわち単語の３連鎖の確率を表すtri-gramが用いられている。
【０１０６】
図１６において、「Ｐ（単語３|単語１単語２）」は、単語列中に「単語１」、「単語２」という並びがあった場合に、その次に「単語３」が出現する確率を表す。例えば、「＜先頭＞“＿ロボット名＿”」という並びがあった場合に、その次に「は」が出現する確率は、「0.012」である。なお、この確率は、大量の雑談を記述したテキストを解析することにより、予め求められる。また、言語モデル１１２としては、tri-gramの他に、bi-gram（２連鎖の確率）やuni-gram（単語の出現確率）等も、必要に応じて用いることが可能である。
【０１０７】
図１６の言語モデル１１２においても、図１１における場合と同様に、単語の他、カテゴリを用いて文法が記述されている。即ち、図１６において、「＿ロボット名＿」、「＿地名＿」は、カテゴリ“＿ロボット名＿”、“＿地名＿”を意味するが、これらのカテゴリを用いてtri-gramを記述することによって、ロボット名や地名を表す単語が可変単語辞書１３２に登録された場合に、その単語を雑談用タスク７１₄で認識することができる。
【０１０８】
図１７は、雑談用タスク７１₄の固定単語辞書１３１の例を示している。固定単語辞書１３１には、図１６に示されるような言語モデル１１２の文法中に記述されるシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１０９】
図１８は、雑談用タスク７１₄のカテゴリテーブル１３３の例を示している。カテゴリテーブル１３３は、言語モデル１１２に使用されているカテゴリの種類と、そのカテゴリに属する単語の情報を記憶する。言語モデル１１２が図１６に示すような場合、雑談用タスク７１₄の言語モデル１１２には、“＿ロボット名＿”と“＿地名＿”の２個のカテゴリが使用されているため、カテゴリテーブル１３３には、図１８に示すように、“＿ロボット名＿”と“＿地名＿”の２つのカテゴリがエントリされている。図１８では、カテゴリ“＿ロボット名＿”と“＿地名＿”に属する単語は、まだ何もないことを表している。
【０１１０】
図１９は、音声コマンダ用タスク７１₅の言語モデル１１２（図６）の例を示している。図１９の言語モデル１１２は、BNF形式の文法で記述されている。第１行目の変数“$NUMBER”は、数字（「１」、「２」、「３」等）が「または」を意味する“|”で繋がっているので、数字の内のどれか１つを意味する。
【０１１１】
第２行目の変数“$DIRECTION”は、方向（「前」、「後」、「右」、「左」等）が「または」を意味する“|”で繋がっているので、方向の内のどれか１つを意味する。第３行目の変数“UTTERANCE”は、“＿ロボット名＿”、「$DIRECTION に」、および「$NUMBER 歩」に「進め」を付加したものとなっており、さらに、変数“$UTTERANCE”の“＿ロボット名＿”、「$DIRECTION に」、および「$NUMBER 歩」に、「省略可能」を意味する“［］”が付加されている。
【０１１２】
したがって、図１９の言語モデル１１２において、例えば、「（ロボット名）前に３歩進め」といった音声が認識される。
【０１１３】
図２０は、音声コマンダ用タスク７１₅の固定単語辞書１３１の例を示している。固定単語辞書１３１には、図１９に示されるような言語モデル１１２の文法中に記述するシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１１４】
なお、「１」と「歩」については、シンボルが重複しているが、これは「１」と「歩」が、それぞれ２つの発音（「イチ」と「イッ」、「ホ」と「ポ」）を持つことを表している。これによって、例えば、「イチホ」、「イッポ」という異なる発音がされた発話を、同じ「１歩」として認識することができる。
【０１１５】
言語モデル１１２が図１９に示すような場合、音声コマンダ用タスク７１₅の言語モデル１１２には、“＿ロボット名＿”のカテゴリだけが使用されているため、音声コマンダ用タスク７１₅のカテゴリテーブル１３３は、図１３に示した、アプリケーション切替用タスク７１₂のカテゴリテーブル１３３と同じになる。また、“＿ロボット名＿”に属する単語がまだ発話されていない状態では、音声コマンダ用タスク７１₅の可変単語辞書１３２には、何も記憶されない。
【０１１６】
次に、名前登録用アプリケーション部２１₁が、図５のステップＳ９で行う名前登録処理を、図２１のフローチャートを用いて、詳細に説明する。なお、この処理は、ユーザの発話によって名前登録用アプリケーション部２１₁が起動されたときに開始される。この処理が開始される前に、ユーザは、例えば、不図示のモード切替ボタンによって、名前を登録する名前登録モードとして、音声により名前を入力する音声入力モード、またはキーボード等によるカナ入力により名前を入力するカナ入力モードのうちのいずれか一方を選択しておく。
【０１１７】
ステップＳ４１において、名前登録用アプリケーション部２１₁は、音声認識エンジン部１１の名前登録用タスク７１₃を有効にし、この名前登録用タスク７１₃で音声を認識できるようにする。
【０１１８】
ステップＳ４１の処理後は、ステップＳ４２に進み、名前登録用アプリケーション部２１₁は、名前登録モードが音声入力モードであるか否かを判定し、音声入力モードであると判定した場合、ステップＳ４３に進み、マッチング部５４に名前認識処理を行わせ、ステップＳ４４に進む。（または、ステップＳ４２でユーザが発話した場合は「名前を音声で入力した」と判定してステップＳ４３に進み、不図示のカナ入力ボタンが押された場合は、「名前をカナ文字で入力した」と判定してステップＳ４６に進む。）この名前認識処理の詳細は、図２２で後述する。
【０１１９】
ステップＳ４４において、名前登録用アプリケーション部２１₁は、マッチング部５４でステップＳ４３の名前認識処理が行われることにより得られる名前の音声認識結果（認識された名前）が正しいか否かを判定する。この判定は、例えば、認識結果をユーザに向かって発話し、ユーザから不図示のＯＫボタンが操作されたか否かによって行われる。
【０１２０】
ステップＳ４４において、名前の音声認識結果が正しくないと判定された場合、ユーザに再度発話するよう促し、ステップＳ４３に戻り、再び名前認識処理を行う。ステップＳ４４において、認識結果が正しいと判定された場合、ステップＳ４７に進む。
【０１２１】
一方、ステップＳ４２において、名前登録アプリケーション部２１₁は、名前登録モードが音声入力モードではないと判定した場合、ステップＳ４５に進み、名前登録モードがカナ入力モードであるか否かを判定する。
【０１２２】
ステップＳ４５において、名前登録モードがカナ入力モードではないと判定された場合、ユーザによって名前登録モードが選択されていないので、名前登録モードが選択されるまで待機し、ユーザによる名前入力モードの選択を待って、ステップＳ４２に戻る。
【０１２３】
ステップＳ４５において、名前登録モードがカナ入力モードであると判定された場合、ステップＳ４６に進み、名前登録用アプリケーション部２１₁は、ユーザによって入力された名前のカナ列と、その名前のカテゴリを取得する。
【０１２４】
カナ列を入力する方法としては、例えば、ユーザが一時的にキーボードを接続してカナ文字を入力する方法、ロボットの各種スイッチを使用して入力する方法、文字を書いた紙等をロボットに見せて文字認識する方法（例えば、特願2001-135423参照）、無線LAN（Local Area Network）等でロボットとパーソナルコンピュータを接続し、そのパーソナルコンピュータからロボットに転送する方法、インターネット等を経由して、ロボットにダウンロードする方法等がある。また、文字を書いた紙等をロボットに見せて文字認識する方法において、カナ文字を入力するのではなく、カナ漢字交じりの文字列を入力し、名前登録用アプリケーション部２１₁が、カナ列に変換してもよい（特願2001-135423参照）。
【０１２５】
さらに、ユーザが名前のカナ列を入力するのではなく、予め共通辞書部５５の共通辞書に、名前のカナ文字を付加したエントリを与えておき、名前登録用アプリケーション部２１₁は、共通辞書部５５を参照することによって、名前のカナ列を取得してもよい。
【０１２６】
ステップＳ４４またはＳ４６の処理後は、ステップＳ４７に進み、名前登録用アプリケーション部２１₁は、登録する名前のカテゴリを決定する。名前登録モードがカナ入力モードである場合、名前登録用アプリケーション部２１₁は、ステップＳ４６で取得した（ユーザによって入力された）カテゴリを、登録する名前のカテゴリに決定する。
【０１２７】
即ち、カナ入力モードにおいては、ステップＳ４６において、ユーザに、名前の他、その名前のカテゴリも入力してもらい、ユーザが入力した名前のカテゴリを、登録する名前のカテゴリに決定する。一方、名前登録モードが音声入力モードである場合、名前登録用アプリケーション部２１₁は、ステップＳ４３の名前認識処理で得られた名前のカテゴリを推測して決定する。
【０１２８】
例えば、音声認識エンジン部１１から供給された認識結果が「君」で始まる場合は、登録する名前の属するカテゴリは、“＿ロボット名＿”であると推測し、「私」で始まる場合は、登録する名前の属するカテゴリは、“＿ユーザ名＿”であると推測する。また、本出願人が先に提案した特願2001-382579に開示されている、各種のカテゴリ推定方法も用いることができる。
【０１２９】
ステップＳ４７の処理後は、ステップＳ４８に進み、名前登録用アプリケーション部２１₁は、マッチング部５４を制御して、登録する名前の発音情報とカテゴリを共通辞書部５５の共通辞書にエントリし、ステップＳ４９に進む。ステップＳ４９において、名前登録用アプリケーション部２１₁は、マッチング部５４を制御して、共通辞書の内容を、雑談用タスク７１₄、音声コマンダ用タスク７１₅、・・・、その他のタスク７１_Nに反映させる。この反映の詳細は、図２５のフローチャートを参照して後述する。
【０１３０】
このように、共通辞書に登録した名前を他のタスクに反映させることにより、他のタスクでも、この登録した名前を認識することができる。
【０１３１】
ステップＳ４９の処理後は、ステップＳ５０に進み、名前登録用アプリケーション部２１₁は、名前登録処理を終了するか否かを判定する。この判定は、例えば、終了するかという質問をユーザに向けて発話し、ユーザにより不図示のＯＫボタンが操作（押圧）されたか否かによって行う。ステップＳ４９において、名前登録用アプリケーションを終了しない（例えば、ＯＫボタンが押されていない）と判定された場合、ステップＳ４２に戻り、他の名前を登録する処理を行なう。
【０１３２】
また、ステップＳ５０において、名前登録処理を終了する（例えば、ＯＫボタンが押された）と判定された場合、ステップＳ５１に進み、名前登録用アプリケーション部２１₁は、名前登録用タスク７１₃を無効にし、ステップＳ５２に進む。ステップＳ５２において、名前登録用アプリケーション部２１₁は処理を終了する。
【０１３３】
図２２は、図２１のステップＳ４３で、図２のマッチング部５４が行う名前認識処理を説明するフローチャートである。
【０１３４】
ステップＳ６１において、マッチング部５４は、音声がマイク５１に入力されたか否かを判定し、音声が入力されていないと判定した場合、音声が入力されるまで待機する。ステップＳ６１において、音声が入力されたと判定された場合、ステップＳ６２に進む。ここで入力される音声は、例えば、「私の名前は太郎です」、「君の名前はエスディーアールだよ」といった通常の会話でよく、ユーザは、名前登録を意識して、「太郎」「エスディーアール」という名前だけを単独で入力する必要はない。
【０１３５】
ステップＳ６２において、マッチング部５４は、音声を認識し、名前を抽出する。例えば、「君の名前はエスディーアールだよ」という発話がされた場合、図１４に示すような言語モデル１１２と図１５に示すような固定単語辞書１３１を有する名前登録用タスク７１₃を参照して、マッチング部５４は、例えば、「＜先頭＞君の名前は＜OOV>だよ＜終端>」という認識結果を生成する。また、マッチング部５４は、＜OOV＞が発話のどの区間（最初の発話の何秒目から何秒目まで）であるかという情報を得る。
【０１３６】
さらに、マッチング部５４は、同じ発話に対して、図９に示すような言語モデル１１２と図１０に示すような固定単語辞書１３１を有する音韻タイプライタ用タスク７１₁を参照して、例えば、“k/i/m/i/n/o/n/a/m/a/e/w/a/e/s/u/d/i:/a:/r/u/d/a/y/o”という音韻系列と、「キミノナマエワエスディーアールダヨ」というカナ列を得る。
【０１３７】
そして、マッチング部５４は、＜OOV＞が発話のどの区間であるかという情報に基づき、得られた音韻系列およびカナ列から、＜OOV＞に相当する区間、すなわち、名前の区間の音韻系列とカナ列とを切り出し、“e/s/u/d/i:/a:/r/u”という音韻系列と「エスディーアール」というカナ列とを得る。また、マッチング部５４は、同区間の音声データも得る。この名前を抽出する処理の詳細は、本出願人が先に提案した特願2001-382579号に開示されている。
【０１３８】
ステップＳ６２の処理後は、ステップＳ６３に進み、マッチング部５４は、ステップＳ６２の処理で抽出した名前の音韻系列、カナ列、および音声データを、未知語獲得部５６に供給し、クラスタリングを行う。クラスタリングの詳細は、本出願人が先に提案した特願2001-097843号に開示されている。このクラスタリングの結果、未知語獲得部５６の各クラスタは、代表の音韻系列とカナ列とを有する。
【０１３９】
ステップＳ６３の処理後は、ステップＳ６４に進み、ステップＳ６２で認識された音声の認識結果（例えば、「キミノナマエワエスディーアールダヨ」というカナ列）を、名前登録用アプリケーション部２１₁に供給する。
【０１４０】
図２３は、図２２のステップＳ６３の処理で、未知語獲得部５６においてクラスタリングされた、特徴空間の例を示している。なお、図２３においては、図が煩雑になるのを避けるため、２つの特徴量（特徴パラメータ）１と２で定義される特徴空間を示してある（上述の図３においても同様）。図２３では、特徴空間において、「あらら」、「さにー」、「とーきょー」、「たろう」という４個の名前がクラスタリングされている。
【０１４１】
即ち、図２３では、特徴空間において、「あらら」クラスタ１５１、「さにー」クラスタ１５２、「とーきょー」クラスタ１５３、「たろう」クラスタ１５４の４つのクラスタが構成されており、各クラスタには、代表となる音韻系列（図２３の例の場合、“a/r/a/r/a”、“s/a/n/i:”、“t/o：/ky/o:”、“t/a/r/o/u”）、代表的なカナ表記（図２３の例の場合、「アララ」、「サニー」、「トーキョー」、「タロウ」）、およびＩＤ（図２３の例の場合、「１」、「２」、「３」、「５」）が付加されている。
【０１４２】
図２４は、図２１のステップＳ４８で単語の情報がエントリされた共通辞書部５５の共通辞書の例を示している。図２４において、第１行目のエントリは、発音がカナ列で入力され、その発音が「エスディーアール」という文字列であり、カテゴリが“＿ロボット名＿”と入力されたことを表している。
【０１４３】
第２行目のエントリは、発音が音声で入力され、その発音のカナ表記と音韻系列が、未知語獲得部５６のＩＤが「５」のクラスタに付加された代表的なカナ表記（図２３の例の場合「タロウ」）と音韻系列（図２３の例の場合“t/a/r/o:”）であることを表している。第２行目のエントリのカテゴリは、図２１のステップＳ４７で、名前登録アプリケーション部２１により決定され、“＿ユーザ名＿”となっている。例えば、ユーザによって「私の名前は太郎です」という発話がされた場合、共通辞書部５５に、第２行目のようなエントリが構成される。
【０１４４】
同様に、第３行目、第４行目のエントリは、発音がカナ列で入力され、その発音がそれぞれ「サニータロウ」、「キタシナガワ」という文字列であり、カテゴリが“＿ユーザ名＿”“＿地名＿”と入力されたことを表している。また、第５行目のエントリは、発音が音声で入力され、その発音のカナ表記と音韻系列が、未知語獲得部５６のＩＤが「３」のクラスタに付加された代表的なカナ表記（図２３の例の場合「トーキョー」）と音韻系列（図２３の例の場合“t/o:/ky/o:”）であることを表している。さらに、第５行目のエントリのカテゴリは、名前登録アプリケーション部２１によって、“＿地名＿”に決定されている。
【０１４５】
なお、共通辞書においては、発音がカナ列で入力された単語については、その単語の発音を表すカナ列とカテゴリとの組が１つのエントリに登録され、発音が音声入力された単語については、その単語のクラスタを表すＩＤとカテゴリとの組が１つのエントリに登録される。
【０１４６】
図２５は、図２１のステップＳ４９の処理で、マッチング部５４が共通辞書部５５の内容をタスクに反映させる処理を説明するフローチャートである。なお、この処理は、有効にされているタスク毎に行なわれる。
【０１４７】
ステップＳ８１において、マッチング部５４は、タスク７１（図６）における可変単語辞書１３２とカテゴリテーブル１３３を初期化する。即ち、可変単語辞書１３２は、エントリが１つもない状態にされ、カテゴリテーブル１３３は、各カテゴリに単語が何も属していない状態にされる。
【０１４８】
ステップＳ８１の処理後は、ステップＳ８２に進み、マッチング部５４は、共通辞書部５５の内容を可変単語辞書１３２とカテゴリテーブル１３３に反映させる。
【０１４９】
即ち、マッチング部５４は、共通辞書部５５の共通辞書の中から、カテゴリテーブル１３３にエントリされているカテゴリと共通する（同一の）カテゴリを選択し、そのカテゴリと、そのカテゴリに対応するクラスタＩＤまたはカナ発音（カナ列）を取得する。さらに、マッチング部５４は、共通辞書からクラスタＩＤを取得した場合、未知語獲得部５６からクラスタＩＤに対応するカナ列を取得する。
【０１５０】
マッチング部５４は、以上のようにして、共通辞書部５５の共通辞書から選択したカテゴリに属する単語のカナ列を取得すると、そのカナ列を、可変単語辞書１３２にエントリする。また、マッチング部５４は、共通辞書から取得したカナ列で表される単語の情報を、カテゴリテーブル１３３の対応するカテゴリにエントリする。
【０１５１】
上述の処理によれば、各タスクにおいて、可変単語辞書１３２は、初期化されてから、共通辞書の内容が反映される。即ち、可変単語辞書１３２は、共通辞書の内容に基づいて、構築または再構築される。このため、辞書中の特定のエントリに対して削除や変更を行う方法に比べて、容易に、各タスクで整合を保つことができる。
【０１５２】
また、上述の処理によれば、音声で登録した単語については、各タスクに反映させるたびに、そのときの最新の発音情報を未知語獲得部５６から取得するので、可変単語辞書１３２に登録した後も、未知語獲得部５６に音声データを供給するだけで発音情報が更新され、マッチング部５４は、そのときの最新の発音情報を参照して、音声を認識することができる。
【０１５３】
図２６は、図２５の反映処理を説明するブロック図である。共通辞書部５５の共通辞書に、カテゴリに対応してカナ列が記述されている場合、そのカナ列が可変単語辞書１３２に登録され、カテゴリテーブル１３３の、共通辞書のカテゴリと同一のカテゴリに、共通辞書のカナ列で表される単語の情報が登録される。
【０１５４】
一方、共通辞書に、カテゴリに対応してクラスタＩＤが記述されている場合、未知語獲得部５６が参照され、そのクラスタＩＤに対応する代表カナ列と代表音韻系列が可変単語辞書１３２に登録されて、カテゴリテーブル１３３の、共通辞書のカテゴリと同一のカテゴリに、共通辞書のクラスタＩＤで表される単語の情報が登録される。なお、後述する音声認識処理では、固定単語辞書１３１と可変単語辞書１３２の両方が使用される。
【０１５５】
図２７は、図２４に示す共通辞書部５５の内容が反映された、アプリケーション切替用タスク７１₂の可変単語辞書１３２の例である。アプリケーション切替用タスク７１₂のカテゴリテーブル１３３が図１３に示すような場合、図２４の共通辞書と共通しているカテゴリは、“＿ロボット名＿”であるので、マッチング部５４は、図２４の共通辞書から“＿ロボット名＿”に対応する「エスディーアール」というカナ発音を取得する。
【０１５６】
そして、マッチング部５４は、図２７に示すように、可変単語辞書１３２のトランスクリプションに、図２４の共通辞書から取得したカナ発音「エスディーアール」をエントリする。さらに、マッチング部５４は、トランスクリプション「エスディーアール」に対応する音韻系列に、カナ音韻変換規則１１５（図８）に基づいて、カナ発音「エスディーアール」に対応する“e/s/u/d/i:/a:/r/u”を記述する。
【０１５７】
また、マッチング部５４は、トランスクリプション「エスディーアール」で表される単語のシンボルとして、「OOV00001」を登録する。ここでは、シンボルを、「“OOV”+通し番号」を意味する「OOV00001」としたが、シンボルは、その単語を一意に識別できる文字列であればよい。即ち、シンボルとしては、例えば、カテゴリ名を先頭に付加して、「＿ロボット名＿：：OOV00001」などを用いることも可能である。
【０１５８】
図２８は、図２４の共通辞書の内容が反映されたアプリケーション切替用タスク７１₂のカテゴリテーブル１３３の例を示している。図２７に示したように、可変単語辞書１３２に、図２４の共通辞書の内容が反映された場合、カテゴリテーブル１３３の内容は、図１３に示した、“＿ロボット名＿”のカテゴリに単語が登録されていない状態から、図２７の可変単語辞書１３２に登録されたカテゴリ“＿ロボット名＿”に属する単語のシンボル「OOV00001」がエントリされた状態となる。
【０１５９】
次に、図２９のフローチャートを参照して、図２１のステップＳ４８で共通辞書部５５の共通辞書に登録された単語を、マッチング部５４が、削除または変更する処理を説明する。共通辞書の単語を削除または変更する処理は、例えば、名前登録アプリケーション部２１から指令があった場合や、メモリの制約のために登録単語の不要になったものを削除する必要がある場合など開始される。
【０１６０】
また、共通辞書の単語を削除または変更する処理は、その他、例えば、未知語獲得部５６においてクラスタが削除され、あるいはクラスタが分割、併合されることによって、クラスタに付されるＩＤが変更され、未知語獲得部５６のクラスタに付されているＩＤと共通辞書に記述されているＩＤ（図２４で説明したクラスタＩＤ）との整合をとる必要がある場合に、共通辞書に記述されたＩＤを書き替えるために行われる。
【０１６１】
さらに、共通辞書を削除または変更する処理は、あるカテゴリを言語モデル１１２に記述されているタスク全てが以後使用しなくなった場合に、そのカテゴリの情報を共通辞書から削除して、共通辞書のスリム化を図るために行われる。
【０１６２】
なお、未知語獲得部５６においてクラスタの代表音韻系列とカナ列とが変更された場合は、その変更内容は、図２５の反映処理で、共通辞書に反映されるため、単語の削除または変更する処理（以下、適宜、変更削除する処理という）を行なう必要はない。
【０１６３】
ステップＳ１０１において、マッチング部５４は、変更削除処理の対象となる単語を共通辞書の中から決定し、ステップＳ１０２に進む。対象となる単語は、ユーザが不図示のボタンによって決定してもよいし、マッチング部５４が推定して決定してもよい。
【０１６４】
ステップＳ１０２において、マッチング部５４は、変更削除処理の対象となる単語を削除するか否かを判定し、削除すると判定した場合、ステップＳ１０３に進む。ステップＳ１０３において、マッチング部５４は、変更削除処理の対象となる単語のエントリを、共通辞書から削除する。削除とは、カテゴリと発音情報とで特定されるエントリを削除すること、特定のカテゴリのエントリをまとめて削除すること、または特定の発音情報（カナ列またはクラスタＩＤ）を有するエントリをまとめて削除することを意味する。
【０１６５】
一方、ステップＳ１０２において、マッチング部５４は、変更削除処理の対象となる単語を削除しないと判定した場合、ステップＳ１０４に進み、単語を変更するか否かを判定し、単語を変更しないと判定した場合、ステップＳ１０２に戻り、変更または削除のどちらかに判定されるまで待機する。
【０１６６】
また、ステップＳ１０４において、マッチング部５４は、変更削除処理の対象とする単語を変更すると判定した場合、ステップＳ１０５に進み、共通辞書において、変更削除処理の対象となる単語のエントリを変更する。
【０１６７】
例えば、マッチング部５４は、未知語獲得部５６のクラスタに分割または併合が発生してクラスタのＩＤ番号に変化が生じた場合、未知語獲得部５６と整合をとるように、共通辞書のクラスタＩＤを変更する。また、例えば、ユーザが登録時に入力したカナ列を後で修正したくなった場合、マッチング部５４は、名前登録用アプリケーション部２１₁の指令により、共通辞書の対象となる単語（図２１のステップＳ４８で共通辞書にエントリされた単語）のカナ発音を、ユーザが、共通辞書の対象となる単語を決定した後入力したカナ列に変更する。
【０１６８】
ステップＳ１０３の処理、またはステップＳ１０５の処理の後は、ステップＳ１０６に進み、マッチング部５４は、図２５の反映処理を行ない、共通辞書の内容を各タスクに反映させる。
【０１６９】
このように、共通辞書の単語を削除または変更した場合、その変更後の内容を各タスクに反映させるので、各アプリケーション部での登録単語の整合性を保つことができる。
【０１７０】
図３０と図３１は、図２９のステップＳ１０５の処理で、マッチング部５４が共通辞書の単語のエントリを変更する例を示している。例えば、未知語獲得部５６のＩＤが「５」のクラスタが、ＩＤが「８」のクラスタとＩＤが「９」のクラスタに分割された場合、マッチング部５４は、共通辞書を図３０Ａに示すような状態から図３０Ｂに示すような状態に変更する。
【０１７１】
即ち、マッチング部５４は、共通辞書部５５のクラスタＩＤが「５」のエントリ（図３０Ａの第１行目のエントリ）を削除し、その削除したエントリに登録されていた“＿ユーザ名＿”のカテゴリの２つのエントリを登録する。さらに、マッチング部５４は、新たな２つのエントリに、クラスタＩＤ番号「８」と「９」をそれぞれ記述する（図３０Ｂの第１行目と第２行目のエントリ）。
【０１７２】
また、例えば、未知語獲得部５６のクラスタＩＤが「５」のクラスタとＩＤが「３」のクラスタが併合されて、ＩＤが「１０」のクラスタが新たに生成された場合、マッチング部５４は、共通辞書を図３１Ａに示すような状態から図３１Ｂに示すような状態に変更する。
【０１７３】
即ち、マッチング部５４は、共通辞書のクラスタＩＤが「５」と「３」のエントリ（図３１Ａの全てのエントリ）のクラスタＩＤを「１０」に変更し、その結果、重複する“＿ユーザ名＿”というカテゴリとそれに対応するクラスタＩＤ番号「１０」の２つのエントリを１つにする（例えば、一方を削除する）（図３１Ｂ）。
【０１７４】
次に、図５のステップＳ１２の雑談処理を、図３２のフローチャートを参照して詳細に説明する。
【０１７５】
ステップＳ１２１において、雑談用アプリケーション部２１₂は、雑談用タスク７１₄を有効にし、ステップＳ１２２に進む。ステップＳ１２２において、雑談用アプリケーション部２１₂は、マッチング部５４を制御して、図２５に示すような反映処理を行ない、共通辞書部５５の共通辞書の内容を、雑談用タスク７１₄（可変単語辞書１３２とカテゴリテーブル１３３）に反映させる。したがって、雑談用タスク７１₄は、無効である間に共通辞書に登録、変更、および削除された単語を獲得することができる。
【０１７６】
ステップＳ１２２の処理後は、ステップＳ１２３に進み、雑談用アプリケーション部２１₂は、音声認識エンジン部１１を制御して音声認識処理を行い、ステップＳ１２４に進む。この音声認識処理の詳細は、図３５で後述する。
【０１７７】
ステップＳ１２４において、雑談用アプリケーション部２１₂は、音声認識エンジン部１１から認識結果を取得し、その認識結果に対する応答を生成する。即ち、ロボットは、ユーザからの発話に対して応答する。例えば、ユーザからの発話が「エスディーアール（ロボット名）は、何時に起きたの？」である場合、雑談用アプリケーション部２１₂は、ロボットが起きた（起動された）時間（例えば、「７時」）の応答を生成し、ロボットに発話させる。
【０１７８】
ステップＳ１２４の処理後は、ステップＳ１２５に進み、雑談用アプリケーション部２１₂は、処理を終了するか否かを判定する。この判定は、例えば、雑談用アプリケーション部２１₂が、ユーザに向かって「終了する？」という発話をさせ、ユーザが不図示のＯＫボタンを操作（押圧）した（押圧した）か否かによって行う。
【０１７９】
ステップＳ１２５において、処理を終了しないと判定された場合、処理はステップＳ１２３に戻り、以下同様の処理を繰り返す。即ち、ロボットはユーザとの雑談を続行する。
【０１８０】
ステップＳ１２５において、処理を終了すると判定された場合、処理はステップＳ１２６に進み、雑談用アプリケーション部２１₂は、雑談用タスク７１₄を無効にし、ステップＳ１２７に進む。ステップＳ１２７において、雑談用アプリケーション部２１₂は処理を終了する。
【０１８１】
上述の処理では、ユーザが１回発話する毎に雑談用アプリケーション部２１₂が応答を生成したが、ロボットが自発的に発話することで、ユーザの発話を促してもよい。
【０１８２】
また、図３２の処理では、雑談用アプリケーション部２１₂の雑談処理について説明したが、音声コマンダ用アプリケーション部２１₃の音声コマンダ処理、・・・・、その他のアプリケーション部２１_Mの処理も同様に行われる。但し、ステップＳ１２４では、アプリケーション部２１に応じて、音声認識エンジン部１１による音声認識結果に基づく処理が行なわれる。
【０１８３】
図３３は、図３２のステップＳ１２２の処理で、図２４に示す共通辞書部５５の共通辞書の内容が、雑談用タスク７１₄の可変単語辞書１３２に反映された状態を示している。
【０１８４】
雑談用タスク７１₄のカテゴリテーブル１３３が図１８に示すような場合、図２４の共通辞書と共通しているカテゴリは、“＿ロボット名＿”と“＿地名＿”であるので、マッチング部５４は、“＿ロボット名＿”に対応する共通辞書エントリとして図２４の１番目のエントリ、“＿地名＿”に対応するエントリとして図２４の４番目と５番目のエントリを取得する。さらに、１番目のエントリからはカナ発音「エスディーアール」を、４番目のエントリからはカナ発音「キタシナガワ」を、５番目のエントリからはクラスタＩＤ番号「３」をそれぞれ取得する。
【０１８５】
そして、マッチング部５４は、図３３に示すように、可変単語辞書１３２のトランスクリプションに「エスディーアール」と「キタシナガワ」をエントリする。さらに、マッチング部５４は、可変単語辞書１３２の音韻系列に、カナ音韻変換規則１１５（図８）に基づき、トランスクリプション「エスディーアール」に対応して“e/s/u/d/i:/a:/r/u”、トランスクリプション「キタシナガワ」に対応して“k/i/t/a/sｈ/i/n/a/g/a/w/a”を記述する。
【０１８６】
また、マッチング部５４は、未知語獲得部５６からクラスタＩＤが「３」のクラスタを抽出し、その代表的な音韻系列と、カナ列を取得する。例えば、未知語獲得部５６が図２３に示すような状態の場合、マッチング部５４は、クラスタＩＤが「３」のクラスタ１５３から、“t/o:/ky/o:”という音韻系列と「トーキョー」というカナ列を取得する。そして、マッチング部は、図３３に示すように、取得した音韻系列“t/o:/ky/o:”とカナ列「トーキョー」を、可変単語辞書１３２の音韻系列とトランスクリプションにそれぞれエントリする。
【０１８７】
さらに、マッチング部５４は、トランスクリプション「エスディーアール」で表される単語のシンボルとして、「OOV00001」を、トランスクリプション「キタシナガワ」で表される単語のシンボルとして、「OOV00002」を、トランスクリプション「トーキョー」で表される単語のシンボルとして、「OOV00003」を登録する。
【０１８８】
なお、いまの場合、音韻タイプライタ用タスク７１₁と雑談用タスク７１₄のカナ音韻系列規則１１５が同じであるとして、音韻タイプライタ用タスク７１₁を用いて得られるクラスタの代表的な音韻系列を、そのまま、雑談用タスク７１₄の可変単語辞書１３２に登録するようにしたが、カナ音韻系列規則１１５が、音韻タイプライタ用タスク７１₁と雑談用タスク７１₄のカナ音韻系列規則１１５が異なる場合は、マッチング部５４は、未知語獲得部５６からクラスタの代表的なカナ列を取得し、雑談用タスク７１₄のカナ音韻系列規則１１５に基づいて、その可変単語辞書１３２の音韻系列を記述する。
【０１８９】
図３４は、図２４の共通辞書の内容が、図１８の雑談用タスク７１₄のカテゴリテーブル１３３に反映された状態を示している。カテゴリテーブル１３３においては、“＿ロボット名＿”のカテゴリに対し、そのカテゴリ“＿ロボット名＿”に属する単語（トランスクリプションが「エスディーアール」の単語（図３３））について可変単語辞書１３２のシンボル「OOV00001」がエントリされる。さらに、カテゴリテーブル１３３の“＿地名＿”のカテゴリに対し、そのカテゴリ“＿地名＿”に属する単語（トランスクリプション「キタシナガワ」と「トーキョー」の単語（図３３））について可変単語辞書１３２に登録されたシンボル「OOV00002」、「OOV00003」がエントリされる。
【０１９０】
次に、図３２のステップＳ１２３の処理で、図２の音声認識エンジン部１１が行う音声認識処理を、図３５のフローチャートを参照して詳細に説明する。この処理は、ユーザからマイクロホン５１に音声が入力されたとき、開始され、アプリケーション切替用タスク７１₂、雑談用タスク７１₄、音声コマンダ用タスク７１₅、・・・、その他のタスク７１_Nのうち、有効になっているタスク毎に行われる。
【０１９１】
マイクロホン５１で生成された音声信号は、ステップＳ１４１において、ＡＤ変換部５２により、ディジタル信号である音声データに変換され、特徴量抽出部５３に供給される。ステップＳ１４１の処理後は、ステップＳ１４２に進み、特徴量抽出部５３は、供給された音声信号から、メルケプストラム等の特徴量を抽出し、ステップＳ１４３に進む。
【０１９２】
ステップＳ１４３において、マッチング部５４は、固定単語辞書１３１と可変単語辞書１３２のシンボルで表される単語のいくつかを連結して、単語列を生成し、音響スコアを計算する。音響スコアは、音声認識結果の候補である単語列と入力音声とが音として（音響的に）どれだけ近いかを表す。
【０１９３】
ステップＳ１４３の処理後は、ステップＳ１４４に進み、マッチング部５４は、ステップＳ１４３で計算された音響スコアに基づいて、音響スコアの高い単語列を所定の個数選択し、ステップＳ１４５に進む。
【０１９４】
ステップＳ１４５において、マッチング部５４は、ステップＳ１４４で選択した各単語列の言語スコアを、言語モデル１１２を用いて計算し、ステップＳ１４６に進む。例えば、言語モデル１１２として、文法や有限状態オートマンを使用している場合、単語列がその言語モデル１１２で受理することができるとき、言語スコアは「１」であり、受理することができないとき、言語スコアは「０」である。
【０１９５】
なお、マッチング部５４は、受理することができるとき、ステップＳ１４４で選択した単語列を残し、受理することができないとき、ステップＳ１４４で選択した単語列を削除してもよい。
【０１９６】
また、言語モデル１１２として、統計言語モデルを使用している場合、その単語列の生成確率を言語スコアとする。この言語スコアを求める方法の詳細は、本出願人が先に提案した特願2001-382579号に開示されている。
【０１９７】
例えば、音声コマンダ用アプリケーション部２１₃の音声コマンダ処理において音声認識処理を行う場合、マッチング部５４がステップＳ１４４の処理で「＜先頭＞OOV00001 前に進め＜終端＞」という単語列を選択したとき、その言語スコアは、単語列「＜先頭＞OOV00001 前に進め＜終端＞」が、図１９に示す文法の言語モデル１１２で受理することができるので「１」となる。
【０１９８】
即ち、マッチング部５４は、カテゴリテーブル１３３（図２８）を参照して、シンボル“OOV00001”のカテゴリが“＿ロボット名＿”であることを認識し、ステップＳ１４４で得られた単語列「＜先頭＞OOV00001 前に進め＜終端＞」を、カテゴリ名を使用した単語列「＜先頭＞＿ロボット名＿前に進め＜終端＞」に変換して、図１９に示す言語モデル１１２で受理することができると判定する。
【０１９９】
一方、例えば、ステップＳ１４４で単語列「＜先頭＞OOV00001 に進め前＜終端＞」が選択された場合、マッチング部５４は、カテゴリテーブル１３３（図２８）を参照して、シンボル“OOV00001”のカテゴリが“＿ロボット名＿”であることを認識し、ステップＳ１４４で得られた単語列「＜先頭＞OOV00001 に進め前＜終端＞」を、カテゴリ名を使用した単語列「＜先頭＞＿ロボット名＿に進め前＜終端＞」に変換して、図１９に示す言語モデル１１２で受理することができないと判定し、この単語列の言語スコアを「０」とする。
【０２００】
ステップＳ１４６において、マッチング部５４は、ステップＳ１４３で計算された音響スコアと、ステップＳ１４５で計算された言語スコアを統合して、各単語列をソートし、例えば、統合したスコアの一番大きい単語列を認識結果として決定する。
【０２０１】
これにより、音響的にも言語的にも最もふさわしい単語列が認識結果として決定される。
【０２０２】
ステップＳ１４６の処理後は、ステップＳ１４７に進み、マッチング部５４は、認識結果に音声で登録された単語（未知語獲得部５６にクラスタリングされている単語）が含まれているか否かを判定する。
【０２０３】
ステップＳ１４７において、音声で登録された単語が認識結果に含まれていると判定された場合、ステップＳ１４８に進み、マッチング部５４は、未知語獲得部５６にその単語を供給し、未知語獲得部５６は、再クラスタリングを行う。そして、処理はステップＳ１４９に進む。
【０２０４】
例えば、ステップＳ１４４で、地名（未知語）の「トーキョー」を含む単語列「＜先端＞今日はトーキョーに行ったんだよ＜終端＞」が得られた場合、マッチング部５４は、未知語である「トーキョー」の音声データ、音韻タイプライタ用タスク７１₁を参照して認識された音韻系列(例えば、“t/o:/ky/o:”)およびカナ列（例えば、「トーキョー」）を未知語獲得部５６に供給する。そして、未知語獲得部５６は、再クラスタリングを行う。
【０２０５】
これにより、未知語獲得部５６に供給される音声データの量が増え、各クラスタの代表音韻系列と代表カナ列が、正しい値に更新される可能性がある。ただし、副作用として、正しい音韻カナ系列・カナ列が取得された後も、再クラスタリングによって正しくない値に変化してしまう可能性がある。そのような副作用を防ぐため、ユーザからの指示があった場合は、その時点でのカナ列を共通辞書のエントリに記述すれば、発音を固定することができる。例えば、図２３において、ＩＤ＝３のクラスタのカナ列が「トーキョー」という発音になった時点で、図２４の共通辞書において「クラスタＩＤ＝３」と記述されている箇所を「カナ発音：トーキョー」に書き換える（５番目のエントリがその書き換えの対象となる）。こうすることで、以降でＩＤ＝３のクラスタのカナ列が「トーキョー」以外に変化しても、共通辞書の５番目のエントリの発音は「トーキョー」で固定される。
【０２０６】
一方、ステップＳ１４７において、マッチング部５４は、音声認識結果に音声で登録された単語が含まれていないと判定した場合、ステップＳ１４８をスキップして、ステップＳ１４９に進む。
【０２０７】
ステップＳ１４９において、マッチング部５４は、タスクに対応するアプリケーション部２１に、ステップＳ１４６の処理で決定された認識結果を供給する。
【０２０８】
ここで、雑談用アプリケーション部２１₂の雑談処理において、マッチング部５４が、図３５のステップＳ１４４で、例えば、単語列「＜先頭＞OOV00001 は何時に起きたの＜終端＞」を選択した場合の言語スコアを求める式を図３６に示す。
【０２０９】
言語スコア「Score（＜先頭＞OOV00001 は何時に起きたの＜終端＞）」は、式（１）に示すように、単語列「＜先頭＞OOV00001 は何時に起きたの＜終端＞」の生成確率である。
【０２１０】
言語スコア「Score（＜先頭＞OOV00001 は何時に起きたの＜終端＞）」の値は、正確には、式（２）に示すように、「P（＜先頭＞）P（OOV00001｜＜先頭＞）P（は｜＜先頭＞OOV00001）P（何時｜＜先頭＞OOV00001 は）P（に｜＜先頭＞OOV00001 は何時）P（起きた｜＜先頭＞OOV00001 は何時に）P（の｜＜先頭＞OOV00001 は何時に起きた）P（＜終端＞｜＜先頭＞OOV00001 は何時に起きたの）で求められるが、図１６に示すように、言語モデル１１２は、tri-gramを用いているので、条件部分「＜先頭＞OOV00001 は」、「＜先頭＞OOV00001 は何時」、「＜先頭＞OOV00001 は何時に」、「＜先頭＞OOV00001 は何時に起きた」、および「＜先頭＞OOV00001 は何時に起きたの」は、直前の最大２単語「OOV00001 は」、「は何時」、「何時に」、「に起きた」、および「起きたの」にそれぞれ限定した条件付確率で近似する（式（３））。
【０２１１】
この条件付確率は、言語モデル１１２（図１６）を参照することによって求められるが、言語モデル１１２は、シンボル「OOV00001」を含んでいないので、マッチング部５４は、図３４のカテゴリテーブル１３３を参照して、シンボル「OOV00001」で表される単語のカテゴリが、“＿ロボット名＿”であることを認識し、「OOV00001」を“＿ロボット名＿”に変換する。
【０２１２】
即ち、式（４）に示すように、「P（OOV00001｜＜先頭＞）」は、「P（＿ロボット名＿｜＜先頭＞）P（OOV00001｜＿ロボット名）」に変更され、「P（＿ロボット名＿｜＜先頭＞）」/N」で近似される。なお、Ｎは、カテゴリテーブル１３３の“＿ロボット名＿”のカテゴリに属している単語の数を表す。
【０２１３】
即ち、確率をＰ（Ｘ｜Ｙ）という形式で記述した場合、単語ＸがカテゴリＣに属する単語である場合、言語モデル１１２からＰ（Ｃ｜Ｙ）を求め、その値に、Ｐ（Ｘ｜Ｃ）（カテゴリＣから単語Ｘが生成される確率）を掛ける。カテゴリＣに属する単語が全て等確率で生成されると仮定すれば、カテゴリＣに属する単語がＮ個ある場合、Ｐ（Ｘ｜Ｃ）は、１／Ｎと近似できる。
【０２１４】
図３４において、カテゴリ“＿ロボット名＿”には、シンボル「OOV00001」で表される単語のみが属しているので、Ｎ」は「１」となる。したがって、式（５）に示すように、「P（は｜＜先頭＞OOV00001）」は、「P（は｜＜先頭＞＿ロボット名＿）」となる。また、「P（何時｜OOV00001 は）」は、式（６）に示すように、「P（何時｜＿ロボット名＿は）となる。
【０２１５】
これにより、可変単語を含む単語列に対しても、言語スコアを計算することができ、可変単語を認識結果に出現させることが可能となる。
【０２１６】
上述の例では、アプリケーション部２１の起動とタスク７１の有効、アプリケーション部２１の終了とタスク７１の無効が連動するようにしたが、これを別のタイミングに行って、例えば、アプリケーション部２１の起動中にタスクの有効や無効を何度も切り替えたり、１つのアプリケーションで複数のタスクを制御したりすることも可能である。
【０２１７】
この場合、有効や無効の切替を頻繁に繰り返すタスクでは、そのたびにメモリの確保や開放を繰り返すと、効率が悪いので、無効後もフラグ（そのタスクが無効であることを表すフラグ）を立てるだけで、メモリを確保したままにしておくこともできる。
【０２１８】
また、上述の例では、ロボットシステムの起動時に共通辞書部５５の共通辞書には何も記億されていない状態であるとしたが、共通辞書に、いくつかの単語が予め記憶されていてもよい。例えば、ロボットの商品名は、そのロボットの名前に登録されることが多いので、ロボットの商品名を予め共通辞書の“＿ロボット名＿”のカテゴリに登録しておいてもよい。
【０２１９】
図３７は、ロボットシステムの起動時に、ロボットの商品名「エスディーアール」がカテゴリ“＿ロボット名＿”にエントリされている場合の共通辞書の例を示している。図３７において、ロボットシステムの起動時には、カテゴリ“＿ロボット名＿”に、カナ発音「エスディーアール」がエントリされているので、ユーザは、名前登録を行わなくても、カナ発音「エスディーアール」で表される単語を用いて、ロボットを制御することができる。
【０２２０】
また、上述の例では、初期段階（出荷時）には未知語獲得部のクラスタは何も生成されていないことを想定していた。しかし、主要な名称についてはクラスタを最初から用意しておくと、図２１の名前登録処理において名前を音声で入力する場合に、クラスタが用意されている名前については認識されやすくなる。例えば、図３のようなクラスタを出荷時に用意しておくと、「アカ」、「アオ」、「ミドリ」、「クロ」という音声については発音を正しい音韻系列で認識（取得）できる。さらに、クラスタが用意されている名前については、共通辞書に登録した後で発音が変化することは望ましくない。そこで、共通辞書に発音情報を登録する際には、クラスタＩＤを記述する代わりに、「アカ」、「アオ」、「ミドリ」、「クロ」といったカナ発音（クラスタの代表カナ列）で記述する。
【０２２１】
さらに、上述の例では、マッチング部５４は、共通辞書の内容を全てのタスクに反映させるとしたが、反映させたいタスクにのみ反映させてもよい。例えば、予めタスクに番号（タスクＩＤ）を付加しておき、図２４の共通辞書を拡張して、「このエントリが有効（または無効）なタスクのリスト」を表す欄を設け、図２５の反映処理において、マッチング部５４は、「このエントリが有効なタスクのリスト」を表す欄に記述されたタスクＩＤが付加されたタスクにのみ、共通辞書の内容を反映させればよい。
【０２２２】
図３８は、反映させたいタスクのＩＤが、共通辞書の「有効なタスク」を表す欄に記述された例を示している。図３８において、カテゴリ“＿ロボット名＿”に属するカナ発音が「エスディーアール」で表される単語は、有効なタスクのＩＤが「１」、「２」、「４」であるので、タスクＩＤが「１」、「２」、「４」のタスクの可変単語辞書１３２とカテゴリテーブル１３３にのみ、カナ発音「エスディーアール」で表される単語の共通辞書の内容が反映される。
【０２２３】
また、上述の例では、固定単語辞書１３１に記憶されている単語は、言語モデル１１２に記述されている単語であり、可変単語辞書１３２に記憶される単語は、カテゴリに属する単語であるとしたが、カテゴリに属する単語の一部を、固定単語辞書１３１に記憶してもよい。
【０２２４】
図３９は、アプリケーション切替用タスク７１₂の固定単語辞書１３１の例を示し、図４０は、起動時のカテゴリテーブル１３３の例を示している。即ち、図４０のカテゴリテーブル１３３には、カテゴリ“＿ロボット名＿”と、そのカテゴリ“＿ロボット名＿”に属する単語のシンボル「OOV00001」が予め登録されている。また、図４０の固定単語辞書１３１には、シンボル「OOV00001」と、そのシンボル「OOV00001」で表される単語のトランスクリプション「エスディーアール」、および音韻系列“e/s/u/d/i:/a:/r/u”が予め登録されている。
【０２２５】
この場合、単語「エスディーアール」は、カテゴリ“＿ロボット名＿”に属するものとして音声認識処理が行われる。即ち、単語「エスディーアール」は、最初からロボットの名前として扱われることになる。但し、単語「エスディーアール」は固定単語辞書１３１に記憶されているため、削除したり、変更することはできない。
【０２２６】
このように、例えば、ロボットの商品名等、名前に設定されると想定される単語を予め固定単語辞書１３１に記憶しておくことによって、ユーザは名前登録を行わずに、ロボットを制御することができる。
【０２２７】
また、上述の例では、カテゴリのシンボルは全タスクで共通にしていたが、共通でなくてもよい。この場合、図４１乃至図４４に示すような変換テーブルをタスク内に用意すればよい。
【０２２８】
即ち、例えば、あるタスクＴで、カテゴリ“_ROBOT_NAME_”とカテゴリ“_USER＿NAME＿”が記述されている場合、図４１の変換テーブルによれば、タスクＴにおいて、共通辞書部５５でカテゴリ“＿ロボット名＿”に属する単語の共通辞書の内容は、“_ROBOT_NAME_”というカテゴリに反映される。また、タスクＴにおいて、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容は、“_USER_NAME_”というカテゴリに反映される。
【０２２９】
また、例えば、あるタスクＴで、カテゴリ“＿固有名詞＿”が記述されている場合、図４２の変換テーブルによれば、タスクＴにおいて、共通辞書部５５で、カテゴリ“＿ロボット名＿”に属する単語の共通辞書の内容も、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容も、“＿固有名詞＿”というカテゴリに反映される。
【０２３０】
さらに、例えば、あるタスクＴで、カテゴリ“＿姓＿”とカテゴリ“＿名＿”が記述されている場合、図４３の変換テーブルによれば、タスクＴにおいて、共通辞書部５５で、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容は、カテゴリ“＿姓＿”とカテゴリ“＿名＿”とに変換・複製される。共通辞書の内容をこのタスクに反映させるステップ（図２５）において、例えば、図２４の２番目のエントリは、変換テーブルにしたがって「＿ユーザ名＿クラスタＩＤ＝５」から「＿姓＿クラスタＩＤ＝５」と「＿名＿クラスタＩＤ＝５」との２エントリに変換・複製され、それからこのタスクの固定単語辞書とカテゴリテーブルとに反映される。
【０２３１】
また、例えば、あるタスクＴで、カテゴリが記述されていない場合、図４４の変換テーブルによれば、タスクＴにおいて、共通辞書部５５のカテゴリ“＿ロボット名＿”、カテゴリ“＿ユーザ名＿”、カテゴリ“＿地名＿”に属する単語が、シンボル「UNK」で表される。なお、「UNK」は、「Unknown word」を意味する。
【０２３２】
これにより、カテゴリが記述されていないタスクにおいても、言語モデル１１２に、シンボル「UNK」を記述しておくだけで、マッチング部５４は、カテゴリ“＿ロボット名＿”、カテゴリ“＿ユーザ名＿”、カテゴリ“＿地名＿”に属する単語を認識することができる。
【０２３３】
図４５は、本発明を適用したロボット制御システム１を備えた２足歩行型のロボットの外観構成例を示している。ロボット２０１は、胴体部ユニット２１３の上部に頭部ユニット２１１が配設されるとともに、胴体部ユニット２１３の上部左右にそれぞれ同じ構成の腕部ユニット２１２Ａ、２１２Ｂがそれぞれ配設され、かつ胴体部ユニット２１３の下部左右にそれぞれ同じ構成の脚部ユニット２１４Ａ、２１４Ｂがそれぞれ所定位置に取り付けられことにより構成されている。
【０２３４】
また、頭部ユニット２１１には、このロボット２０１の「目」として機能するCCD（Charge Coupled Device）カメラ２２１Ａ，２２１Ｂ、「耳」として機能するマイクロホン２２２Ａ，２２２Ｂ、および「口」として機能するスピーカ２２３がそれぞれ所定位置に配置されている。
【０２３５】
図４６は、ロボットの電気的構成例を示している。ロボット制御システム１の指令により、ユニット制御システム２３１および対話制御システム２３２は、ロボット２０１の動作を制御する。即ち、ユニット制御システム２３１は、ロボット２０１の頭部ユニット２１１、腕部ユニット２１２Ａ，２１２Ｂ、および脚部ユニット２１４Ａ，２１４Ｂのそれぞれを必要に応じて制御し、ロボット２０１に所定の動作をさせる。また、対話制御システム２３２は、ロボット２０１の発話を制御し、必要に応じて、スピーカ２２３から、所定の発話をさせる。
【０２３６】
なお、上述の説明において、単語とは、音声を認識する処理において、１つのまとまりとして扱った方がよい単位のことであり、言語学的な単語とは必ずしも一致しない。例えば、「タロウ君」は、それ全体を１単語として扱ってもよいし、「タロウ」、「君」という２単語として扱ってもよい。さらに、もっと大きな単位である「こんにちはタロウ君」等を１単語として扱ってもよい。
【０２３７】
また、音韻とは、音響的に１つの単位として扱った方が処理上都合のよいもののことであり、音声学的な音韻や音素とは必ずしも一致しない。例えば、「東京」の「とう」の部分を“t/o/u”という３個の音韻記号で表すことも可能であり、または“o”の長音である“o:”という記号を用意してもよい。さらに、“t/o/o”と表してもよい。他にも、無音を表す記号を用意したり、さらにそれを「発話前の無音」「発話に挟まれた短い無音区間」「「っ」」の部分の無音」のように細かく分類してもよい。
【０２３８】
また、以上においては、ロボット装置について説明したが、本発明は、音声認識や音声合成、翻訳、その他の言語処理を利用したアプリケーションを有する装置に適用することができる。
【０２３９】
さらに、本発明は、例えば、広辞苑に登録された辞書の中から、所定の用語だけを抜き出して、その用語辞書をつくる装置に適用することができる。
【０２４０】
また、上述の説明において、アプリケーション部が複数ある場合について説明したが、アプリケーション部は１つでもよい。
【０２４１】
上述した一連の処理は、ハードウエアにより実行させることもできるし、ソフトウエアにより実行させることもできる。この場合、上述した処理は、図４７に示されるようなパーソナルコンピュータ６００により実行される。
【０２４２】
図４７において、CPU（Central Processing Unit）６０１は、ROM(Read Only Memory)６０２に記憶されているプログラム、または、記憶部６０８からRAM(Random Access Memory)６０３にロードされたプログラムに従って各種の処理を実行する。RAM６０３にはまた、CPU６０１が各種の処理を実行する上において必要なデータなどが適宜記憶される。
【０２４３】
CPU６０１、ROM６０２、およびRAM６０３は、内部バス６０４を介して相互に接続されている。この内部バス６０４にはまた、入出力インターフェース６０５も接続されている。
【０２４４】
入出力インターフェース６０５には、キーボード、マウスなどよりなる入力部６０６、CRT，LCD（Liquid Crystal Display）などよりなるディスプレイ、並びにスピーカなどよりなる出力部６０７、ハードディスクなどより構成される記憶部６０８、モデム、ターミナルアダプタなどより構成される通信部６０９が接続されている。通信部６０９は、電話回線やCATVを含む各種のネットワークを介しての通信処理を行なう。
【０２４５】
入出力インターフェース６０５にはまた、必要に応じてドライブ６１０が接続され、磁気ディスク、光ディスク、光磁気ディスク、あるいは半導体メモリなどによりなるリムーバブルメディア６２１が適宜装着され、それから読み出されたコンピュータプログラムが、必要に応じて記憶部６０８にインストールされる。
【０２４６】
一連の処理をソフトウエアにより実行させる場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば、汎用のパーソナルコンピュータなどに、ネットワークや記録媒体からインストールされる。
【０２４７】
この記録媒体は、図４７に示されるように、コンピュータとは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されているリムーバブルメディア６２１よりなるパッケージメディアにより構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される、プログラムが記録されているROM６０２や記憶部６０８が含まれるハードディスクなどで構成される。
【０２４８】
なお、本明細書において、コンピュータプログラムを記述するステップは、記載された順序に従って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
【０２４９】
また、本明細書において、システムとは、複数の装置により構成される装置全体を表わすものである。
【０２５０】
ところで、本出願人が先に提案した特願2001-382579号には、未知語獲得機構で獲得した単語を言語モデルに反映させるまでの一連の処理と、言語モデルに反映させた単語を以降の認識結果に出現させるための処理についての発明が開示されている。
【０２５１】
しかしながら、特願2001-382579号の発明は、１つの単語登録用のアプリケーションと、１つの登録した単語を使用するアプリケーションから構成されており、音声認識を行うアプリケーションが複数になった場合については想定していないため、アプリケーションが複数、しかも可変個存在するシステムにおいて、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することが困難であった。
【０２５２】
また、本出願人が先に提案した特願2002-072718号には、「私の名前はタロウ（未知語）です。」という発話から、未知語である「タロウ」という単語を抽出して、名前として獲得するという発明が開示されている。
【０２５３】
しかしながら、特願2002-072718号の発明は、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することが困難であった。
【０２５４】
したがって、特願2001-382579号の発明、および特願2002-072718号の発明では、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題すべてを解決することが困難であった。
【０２５５】
しかしながら、図１のロボット制御システム１においては、言語モデル１１２にカテゴリを記述しているため、未知語をカテゴリに属させることによって、未知語を言語モデルに反映させるという上述した課題を解決することができる。
【０２５６】
また、図１のロボット制御システム１においては、共通辞書に基づき、アプリケーションで利用される音声認識の対象となる単語が登録される可変単語辞書１３２を構築、または再構築するようにしたので、登録した単語をそのアプリケーションに反映させるときの上述した課題、複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することができる。
【０２５７】
さらに、キーボードを持たないシステム（例えば、ロボット等）の場合、単語登録時に、発音情報を入力することが困難であるという課題があるが、その課題を解決する手段として、例えば、音韻タイプライタを用いて、音声で発音情報を入力する方法が提案されている。
【０２５８】
しかしながら、音韻タイプライタは、誤認識することがあり、音韻タイプライタをそのまま使用すると、間違った発音で単語を登録してしまうおそれがあるという課題があった。例えば、「エスディーアール」の発音を音韻タイプライタで認識させ、音韻タイプライタが誤認識して「イルニヤル」という結果を出力した場合、「イルニヤル」を発音情報として採用してしまうと、間違った発音で単語が登録されるため、以降は、例えば、「エスディーアール、こんにちは」という発話は認識されにくいが、「イスニヤル、こんにちは」という発話は認識されやすいという状況が発生してしまう。
【０２５９】
そこで、図１のロボット制御システム１においては、音声で登録した単語について、各タスクに反映させるたびに、そのときの最新の発音情報を未知語獲得部５６から取得するので、音韻タイプライタが誤認識し、誤認識された単語が可変単語辞書１３２に登録した後も、未知語獲得部５６に音声データを供給するだけで発音情報が更新され、そのときの最新の発音情報を得ることができ、正常な認識結果を得る可能性がある。
【０２６０】
即ち、図１のロボット制御システム１においては、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題、およびキーボードを持たないシステムにおいて、登録したい単語の発音情報を入力するときの上述した課題、すなわち上述した全ての課題を解決することができる。
【０２６１】
【発明の効果】
以上の如く、本願発明によれば、単語を登録することができる。特に、複数のアプリケーションに対応する単語を登録する場合においても、登録した単語を、各アプリケーションにおいて共通に使用することができる。また、アプリケーションの起動前に登録した単語も、そのアプリケーションで使用することができる。さらに、登録単語を変更した場合においても、各アプリケーションで整合性を保つことができる。
【図面の簡単な説明】
【図１】本発明を適用したロボット制御システムの構成例を示すブロック図である。
【図２】図１の音声認識エンジン部の構成例を示すブロック図である。
【図３】図２の未知語獲得部のクラスタの例を示す図である。
【図４】図１のロボット制御システムにおけるロボット制御処理を説明するフローチャートである。
【図５】図１のロボット制御システムにおけるロボット制御処理を説明するフローチャートである。
【図６】図２のタスクの構成例を示す図である。
【図７】図６の音韻リストの例を示す図である。
【図８】図６のカナ音韻変換規則の例を示す図である。
【図９】図２の音韻タイプライタ用タスクの言語モデルの例を示す図である。
【図１０】図２の音韻タイプライタ用タスクの固定単語辞書の例を示す図である。
【図１１】図２のアプリケーション切替用タスクの言語モデルの例を示す図である。
【図１２】図２のアプリケーション切替用タスクの固定単語辞書の例を示す図である。
【図１３】図２のアプリケーション切替用タスクのカテゴリテーブルの例を示す図である。
【図１４】図２の名前登録用タスクの言語モデルの例を示す図である。
【図１５】図２の名前登録用タスクの固定単語辞書の例を示す図である。
【図１６】図２の雑談用タスクの言語モデルの例を示す図である。
【図１７】図２の雑談用タスクの固定単語辞書の例を示す図である。
【図１８】図２の雑談用タスクのカテゴリテーブルの例を示す図である。
【図１９】図２の音声コマンダ用タスクの言語モデルの例を示す図である。
【図２０】図２の音声コマンダ用タスクの固定単語辞書の例を示す図である
【図２１】図５のステップＳ９の名前登録処理を説明するフローチャートである。
【図２２】図２１のステップＳ４３の名前認識処理を説明するフローチャートである。
【図２３】図２の未知語獲得部のクラスタの例を示す図である。
【図２４】図２の共通辞書部の例を示す図である。
【図２５】図２１のステップＳ４９の反映処理を説明するフローチャートである。
【図２６】図２５の反映処理を説明するブロック図である。
【図２７】図２の名前登録用タスクの可変単語辞書の例を示す図である。
【図２８】図２の名前登録用タスクのカテゴリテーブルの例を示す図である。
【図２９】図２のマッチング部における単語の削除または変更処理を説明するフローチャートである。
【図３０】図２の共通辞書部の変更の例を示す図である。
【図３１】図２の共通辞書部の変更の例を示す図である。
【図３２】図５のステップＳ１２の雑談処理を説明するフローチャートである。
【図３３】図２の雑談用タスクの可変単語辞書の例である。
【図３４】図２の雑談用タスクのカテゴリテーブルの例である。
【図３５】図３２のステップＳ１２３の音声認識処理を説明するフローチャートである。
【図３６】言語スコアの計算式の例を示すである。
【図３７】図２の共通辞書部の変形例を示すである。
【図３８】図２の共通辞書部の変形例を示す図である。
【図３９】図６の固定単語辞書の変形例を示す図である。
【図４０】図６のカテゴリテーブルの例を示す図である。
【図４１】カテゴリ変換テーブルの例を示す図である。
【図４２】カテゴリ変換テーブルの例を示す図である。
【図４３】カテゴリ変換テーブルの例を示す図である。
【図４４】カテゴリ変換テーブルの例を示す図である。
【図４５】ロボットの外観構成を示す斜視図である。
【図４６】ロボットの電気的構成を示すブロック図である。
【図４７】パーソナルコンピュータの例を示す図である。
【符号の説明】
１１音声認識エンジン部，２１アプリケーション部，３１アプリケーション管理部，５１マイクロホン，５２ＡＤ変換部，５３特徴量抽出部，５４マッチング部，５５共通辞書部，５６未知語獲得部，７１タスク，１１１音響モデル，１１２言語モデル，１１３辞書，１１４音韻リスト，１１５カナ音韻変換規則，１１６サーチパラメータ，１３１固定単語辞書，１３２可変単語辞書，１３３カテゴリテーブル[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a language processing device, a language processing method, a program, and a recording medium, and more particularly, to a language processing device, a language processing method, and a program that enable a registered word to be commonly recognized by a plurality of applications.
[0002]
[Prior art]
Speech recognition includes isolated word recognition for recognizing a single word and continuous word recognition for recognizing a word string composed of a plurality of words. In conventional continuous word recognition, having a “database of easy connection between words” called a language model prevents “sounds that are similar but messy words” from being generated as recognition results. It is out.
[0003]
However, since the language model describes only information about words that can be recognized from the beginning (hereinafter referred to as known words as appropriate), the words that are registered later (hereinafter referred to as registered words as appropriate) are correctly recognized. It was difficult to do. In other words, in isolated word recognition, if a word is registered in the recognition dictionary, the word is recognized thereafter. However, in continuous word recognition, it is not sufficient to register in the dictionary. Although it is necessary to reflect it, it was generally difficult to reflect it in the language model.
[0004]
Therefore, it has been proposed to classify registered words into categories such as “person name” and “place name” and prepare recognition grammars corresponding to the categories to recognize speech (for example, see Patent Document 1).
[0005]
Further, in a system in which there are a plurality of applications that use voice recognition and there are variable applications, when a word registered in one application is reflected in another application, a problem different from the case where there is one application occurs. . For example, if word registration is performed only for an application that has already been activated, it is difficult to reflect the registered word in an application that has been activated or installed after registration, unlike the case of a single application. There was a problem.
[0006]
Furthermore, when there are a plurality of applications, it is troublesome to delete the same registered word many times with a plurality of applications. Further, when there are a plurality of applications, it is easy to delete all registered words, but there is a problem that it is difficult to delete only a part of them or change pronunciation.
[0007]
That is, when there is one application, for example, a registered word to be deleted or changed can be specified by information such as “word registered at the nth time” or “nth entry in the recognition dictionary”, but there are a plurality of applications. In this case, it is difficult to specify each application because “word registered at the nth time” and “the number of dictionary entries added” differ.
[0008]
In addition, when there are a plurality of applications, a registered word can be specified by pronunciation. However, if a registered word is specified by pronunciation, there is a possibility that the homonym is deleted or changed.
[0009]
Therefore, instead of each application performing voice recognition individually, it is proposed that a module called “voice commander” performs voice recognition for all applications and transfers the recognition result to each application (for example, patents). Reference 1).
[0010]
[Patent Document 1]
JP 2001-216128 A
[0011]
[Problems to be solved by the invention]
However, in the invention described in Patent Document 1, the “voice commander” needs to have a recognition dictionary and a language model corresponding to each application. In other words, when developing “Speech Commander”, it is necessary to prepare a recognition dictionary and language model suitable for it, assuming which applications are used at the same time. On the other hand, there is a problem that it is difficult to reflect registered words.
[0012]
The present invention has been made in view of such a situation, and enables registered words to be used in common by a plurality of applications.
[0013]
[Means for Solving the Problems]
The language processing apparatus of the present invention includes a registered dictionary storage means for storing a registered dictionary in which words are registered, and a dedicated dictionary dedicated to the application in which words to be used for language processing used in the application are registered. For each application, Construction means to build based on the registered dictionary and , Including processing means for performing processing for adding, deleting, or changing words in the registered dictionary, and deleting means for deleting words in the dedicated dictionary, after all the words registered in the dedicated dictionary have been deleted The construction means rebuilds a dedicated dictionary based on a registered dictionary in which words are added, deleted, or changed It is characterized by that.
[0015]
The dedicated dictionary includes at least a fixed dictionary in which a predetermined word is registered in advance and a variable dictionary in which the registered word is variable, and the construction unit is configured to construct a variable dictionary of the dedicated dictionary. Can do.
[0016]
The dedicated dictionary further includes a category table in which word categories are registered, and the construction means registers the variable dictionary by registering the words in the category registered in the category table among the words in the registered dictionary in the variable dictionary. Can be built.
[0017]
Further provided is a language model storage unit that stores a language model that describes chain information indicating how words of a category are linked, and a recognition processing unit that performs speech recognition based on the dedicated dictionary and the language model. be able to.
[0018]
The language processing method of the present invention includes a registered dictionary storage step for storing a registered dictionary in which words are registered, and a dedicated dictionary dedicated to the application in which words to be used for language processing used in the application are registered. For each application, Build steps to build based on registered dictionaries and , A processing step for adding, deleting, or changing a word in the registered dictionary, a deleting step for deleting a word in the dedicated dictionary, and a word after all the words registered in the dedicated dictionary are deleted A rebuild step to rebuild a dedicated dictionary based on a registered dictionary that has been added, deleted or modified It is characterized by including.
[0019]
The program recorded in the recording medium of the present invention is a dedicated dictionary dedicated to the application in which words to be used for language processing used in the application are registered. For each application, Building steps to build based on a registered dictionary where words are registered; , A processing step for adding, deleting, or changing a word in the registered dictionary, a deleting step for deleting a word in the dedicated dictionary, and a word after all the words registered in the dedicated dictionary are deleted A rebuild step to rebuild a dedicated dictionary based on a registered dictionary that has been added, deleted or modified The Let computer run It is characterized by that.
[0020]
The program of the present invention is a dedicated dictionary dedicated to the application in which words to be processed in the language used in the application are registered. For each application, Building steps to build based on a registered dictionary where words are registered; , A processing step for adding, deleting, or changing a word in the registered dictionary, a deleting step for deleting a word in the dedicated dictionary, and a word after all the words registered in the dedicated dictionary are deleted A rebuild step to rebuild a dedicated dictionary based on a registered dictionary that has been added, deleted or modified Is executed by a computer.
[0021]
In the present invention , A dedicated dictionary dedicated to the application that registers the language processing target words used in the application. Words are registered for each application Built on the registered dictionary , Processing to add, delete, or change words in the registered dictionary, delete words in the dedicated dictionary, delete all words registered in the dedicated dictionary, add, delete, Or a dedicated dictionary is rebuilt based on the modified registration dictionary .
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a configuration example of a robot control system 1 to which the present invention is applied.
[0023]
In this robot control system 1, the speech recognition engine unit 11 recognizes input speech data and generates a word string corresponding to the speech data as a recognition result. The voice recognition engine unit 11 uses the recognition result as the name registration application unit 21. ₁ Chat application part 21 ₂ Application unit 21 for voice commander _Three ..., other application part 21 _M And to the application management unit 31.
[0024]
Name registration application section 21 ₁ Chat application part 21 ₂ Application unit 21 for voice commander _Three ..., other application part 21 _M Performs various processes based on the recognition result supplied from the speech recognition engine unit 11.
[0025]
Name registration application section 21 ₁ Is based on the recognition result supplied from the speech recognition engine unit 11, and registers the robot name, user name, etc. by voice, and the other application units are the name registration application unit 21. ₁ Is used to control the operation of the robot in response to the utterance from the user.
[0026]
Therefore, the chat application unit 21 ₂ Application unit 21 for voice commander _Three ,..., And other application units 21 _M The voice recognition performed by the name registration application unit 21 ₁ It is necessary to correspond to the robot name, user name, etc. registered in (1).
[0027]
Chat application part 21 ₂ Makes the robot chat with the user by voice and uses the voice commander application unit 21. _Three Causes the robot to perform an action corresponding to the utterance from the user. For example, the voice commander application unit 21 _Three Responds to an utterance from the user such as “SD (robot name), move forward!” To move the robot forward.
[0028]
An arbitrary number of application units can be prepared. Hereinafter, the name registration application unit 21 ₁ Chat application part 21 ₂ Application unit 21 for voice commander _Three ..., and other application sections 21 _M When it is not necessary to distinguish each of the above, they are collectively referred to as an application unit 21 as appropriate.
[0029]
The application management unit 31 instructs the application unit 21 to start and end based on the recognition result supplied from the speech recognition engine unit 11. For example, when the recognition result “start voice commander” is supplied from the voice recognition engine unit 11, the application management unit 31 uses the voice commander application unit 21. _Three Start up. At this time, a plurality of application units may be activated simultaneously.
[0030]
Further, the application unit 21 and the application management unit 31 issue a task switching command to the speech recognition engine unit 11, and tasks corresponding to the respective commands (described later in FIG. 2) are stored in the speech recognition engine unit 11. To enable (active) or disable (deactive).
[0031]
FIG. 2 shows the configuration of the speech recognition engine unit 11. The user's utterance is input to the microphone 51, and the microphone 51 converts the utterance into an audio signal as an electric signal. The microphone 51 supplies this audio signal to an AD (Analog Digital) conversion unit 52. The AD converter 52 samples the audio signal that is an analog signal from the microphone 51, quantizes it, and converts it into audio data that is a digital signal. This audio data is supplied to the feature amount extraction unit 43.
[0032]
The feature amount extraction unit 53 extracts, for example, feature parameters such as spectrum, power linear prediction count, cepstrum count, line spectrum pair, and the like from the audio data from the AD conversion unit 52 for each appropriate frame, and supplies them to the matching unit 54. To do.
[0033]
Based on the feature parameters from the feature quantity extraction unit 53, the matching unit 54 uses a phoneme typewriter task 71. ₁ Application switching task 71 ₂ , Name registration task 71 _Three , Chat Task 71 _Four Voice commander task 71 _Five ... and other tasks 71 _N Among them, for each task enabled at that time, a word string closest to the voice (input voice) input to the microphone 51 is obtained as a recognition result while referring to a database inside the task as necessary. . The matching unit 54 supplies the recognition result to the application unit 21 and the application management unit 31 corresponding to each task.
[0034]
A task is a set of data necessary for voice recognition. That is, the speech recognition engine unit 11 is a program part for performing matching and the like, and a data part when classified into a data part such as an acoustic model, a language model, a recognition dictionary, and a program for accessing data. .
[0035]
Therefore, even when a plurality of applications perform speech recognition using different acoustic models, language models, and dictionaries, a single speech recognition engine unit can be provided by preparing a plurality of tasks. Details of the inside of the task will be described later with reference to FIG.
[0036]
Phonological typewriter task 71 ₁ Is a task that works as a phoneme typewriter, and is enabled by a command from the speech recognition engine unit 11. Using this phoneme typewriter, the matching unit 54 acquires a phoneme sequence for an input arbitrary voice, and also acquires pronunciation in kana notation. For example, “k / i / m / i / n / o / n / a / m / a / e / w / a / e / s / u / d” says “Your name is SDR” / i: / a: / r / u / d / a / y / o ”(where“ i: ”and“ a: ”are the long sounds of“ i ”and“ a ”, respectively) Acquired the Kana notation of “WSDDR Dayo”. The phoneme series and kana notation are used by the unknown word acquisition unit 56.
[0037]
Application switching task 71 ₂ Is a task corresponding to the application management unit 31 and is validated when a task switching command is supplied from the application management unit 31 after the application management unit 31 is activated. Application switching task 71 ₂ Thus, for example, the matching unit 54 recognizes the voice corresponding to the activation of the application unit such as “launch chat app”, “launch voice commander”, “launch name registration”, or an end command.
[0038]
Name registration task 71 _Three Name registration application unit 21 ₁ In response to a command from the application management unit 31, the name registration application unit 21 ₁ Is started, the name registration application unit 21 ₁ When a task switching command is supplied from, it is validated. Name registration task 71 _Three Thus, the matching unit 54, for example, “Your name is <an unknown word representing a robot name>” and “My name is <an unknown word representing a person name>”. Recognize
[0039]
Chat task 71 _Four Voice commander task 71 _Five ... and other tasks 71 _N Respectively, the chat application part 21 ₂ Application unit 21 for voice commander _Three ..., other application part 21 _M When a task switching command is supplied from the corresponding application unit after the corresponding application unit is activated by a command from the application management unit 31, the task is validated.
[0040]
The matching unit 54 includes a chat task 71. _Four Thus, for example, it is possible to recognize an utterance as a chat from the user “SD (robot name), what time did it happen?”. The matching unit 54 also includes a voice commander task 71. _Five Thus, for example, it is possible to recognize an utterance as a command from the user “SD (robot name), move forward one step”.
[0041]
Moreover, the matching part 54 reflects the word registered in the common dictionary part 55 mentioned later on each task.
[0042]
In the following, the phonetic typewriter task 71 ₁ Application switching task 71 ₂ , Name registration task 71 _Three , Chat Task 71 _Four Voice commander task 71 _Five ... and other tasks 71 _N When it is not necessary to distinguish each of these individually, they are collectively referred to as task 71 as appropriate.
[0043]
The common dictionary unit 55 stores a common dictionary as a word dictionary commonly used in the task 71. In the common dictionary stored in the common dictionary section 55, pronunciation information and category information are described for all the words registered there. For example, when “SDR (robot name)” which is a proper noun is registered in the common dictionary, the pronunciation (phonological information) “SDR” and the category “_robot name_” are described in the common dictionary. Details will be described later with reference to FIG.
[0044]
The unknown word acquisition unit 56 performs a phonological typewriter task 71 for words (unknown words) such as names that are not registered in a recognition dictionary (a fixed word dictionary 131 described later with reference to FIG. 6). ₁ Is stored and the phoneme sequence and kana notation supplied from the matching unit 54 are stored, and thereafter, the speech of the word can be recognized (identified from other speech).
[0045]
That is, the unknown word acquisition unit 56 performs the phoneme typewriter task 71. ₁ The phoneme sequence and kana notation of the unknown word recognized by is classified into several clusters. Each cluster has an ID, a representative phoneme series, and a representative kana notation, and is managed by the ID.
[0046]
FIG. 3 shows the state of the cluster of the unknown word acquisition unit 56.
[0047]
When three voices of “Aka”, “Ao”, and “Midori” are input, the unknown word acquisition unit 56 converts the three input voices into “Aka” cluster 91 and “Ao” cluster corresponding to each of them. 92 and “Midori” cluster 93, and each cluster has a representative phoneme sequence (“a / k / a”, “a / o”, “m / i / d / o / r / i ”), typical kana notation (in the case of FIG. 3,“ red ”,“ ao ”,“ midori ”), and ID (in the case of FIG. 3,“ 1 ”,“ 2 ”,“ 3 ”).
[0048]
Here, when the voice “Aka” is input again, since the corresponding cluster already exists, the unknown word acquisition unit 56 classifies the input voice into the “Aka” cluster 91 and does not generate a new cluster. On the other hand, when the voice “Kuro” is input, the corresponding cluster does not exist, so the unknown word acquisition unit 56 newly generates the “Kuro” cluster 94 corresponding to “Kuro”. In addition, a representative phoneme sequence (“k / u / r / o” in the example of FIG. 3), a typical kana notation (“black” in the example of FIG. 3), and an ID (in FIG. 3) In the example, “4”) is added.
[0049]
When this method is used, the accuracy of the representative phoneme sequence and the representative kana pronunciation of each cluster can be improved by the user inputting the same voice many times. For example, when “Midori” is entered once, the phoneme typewriter misrecognizes and outputs the phonetic sequence “m / e / r / a / a” and the kana pronunciation “Mera”. . After that, the phonological sequence and kana pronunciation may converge to the correct values (“m / i / d / o / r / i” and “midori”) by repeating the utterance “Midori” many times. There is. Details of such word acquisition processing are disclosed in Japanese Patent Application No. 2001-097843 and Japanese Patent Application No. 2001-382579 previously proposed by the present applicant.
[0050]
Next, robot control processing in the robot control system 1 of FIG. 1 will be described with reference to FIGS. This process is started when the robot control system 1 is activated by the user.
[0051]
In step S1, the speech recognition engine unit 11 is activated, and the process proceeds to step S2. In step S2, the speech recognition engine unit 11 stores the contents (common dictionary) of the common dictionary unit 55 stored in a storage unit (not shown) at the end of the previous robot control system 1 (processing in step S17 described later). And the state of the cluster of the unknown word acquisition unit 56 is loaded. When the common dictionary and the cluster state are not stored in the storage unit, the common dictionary unit 55 and the unknown word acquisition unit 56 are left without any cluster entry. If the cluster state is stored in the storage unit but the common dictionary state is not stored, only the common dictionary is initialized (there is no entry). Conversely, if the common dictionary state is stored but the cluster state is not stored, the cluster-derived entry (the entry whose cluster ID is described in FIG. 24) is deleted from the common dictionary, and the kana pronunciation The entry of origin (the entry in which kana pronunciation is described in FIG. 24) remains.
[0052]
After the process of step S2, the process proceeds to step S3, where the speech recognition engine unit 11 performs the phoneme typewriter task 71. ₁ Enables phonetic typewriter task 71 ₁ Can be used for voice recognition, and the process proceeds to step S4. In step S4, the application management unit 31 is activated and proceeds to step S5.
[0053]
In step S5, the application management unit 31 performs a corresponding task 71 for application switching. ₂ Is enabled, and the process proceeds to step S6. In step S 6, the speech recognition engine unit 11 recognizes an activation command for the application unit 21 input by voice to the microphone 51 and supplies the recognition result to the application management unit 31. Details of the voice recognition processing will be described later with reference to the flowchart of FIG.
[0054]
After the processing in step S6, the process proceeds to step S7 in FIG. ₁ Whether or not to start the name registration application unit 21 ₁ If it is determined to activate (for example, the recognition result is “activate name registration”), the process proceeds to step S8.
[0055]
In step S 8, the application management unit 31 performs the name registration application unit 21. ₁ Start up. After the processing of step S8, the process proceeds to step S9, where the name registration application unit 21 ₁ Performs name registration processing. Details of the name registration processing will be described later with reference to the flowchart of FIG.
[0056]
In step S 7, the application management unit 31 performs the name registration application unit 21. ₁ Is determined not to be activated, the process proceeds to step S10, and the chat application unit 21 is determined from the recognition result by the speech recognition engine unit 11. ₂ It is determined whether or not to start. In step S 10, the application management unit 31 performs the chat application unit 21. ₂ (For example, when the recognition result is “activate chat”), the process proceeds to step S11 and the chat application unit 21 is activated. ₂ Start up.
[0057]
After the processing of step S11, the process proceeds to step S12, and the chat application unit 21 ₂ Performs chat processing. Details of the chat process will be described later with reference to the flowchart of FIG.
[0058]
In step S 10, the application management unit 31 performs the chat application unit 21. ₂ Is determined not to be activated, the process proceeds to step S13, and the speech commander application unit 21 is determined from the recognition result by the speech recognition engine unit 11. _Three It is determined whether or not to start. In the process of step S13, the application management unit 31 performs the voice commander application unit 21. _Three (For example, when the recognition result is “voice commander activation”), the process proceeds to step S14, where the voice commander application unit 21 is activated. _Three Start up.
[0059]
After the process of step S14, the process proceeds to step S15, where the voice commander application unit 21 is processed. _Three Performs voice commander processing. Details of the voice commander process will be described later with reference to the flowchart of FIG.
[0060]
In step S 13, the application management unit 31 performs the voice commander application unit 21. _Three 4, since the recognition result by the speech recognition engine unit 11 is incorrect (there may be an utterance other than application switching), the process returns to step S6 in FIG. A process for recognizing the input voice is performed.
[0061]
Thus, the application management unit 31 activates the application unit 21 according to the recognition result by the speech recognition engine unit 11.
[0062]
After the processes of steps S9, S12, and S15, the process proceeds to step S16, and the application management unit 31 determines whether or not to end the robot control process. For example, the application management unit 31 determines whether or not an end button (not illustrated) is pressed by the user, and determines that the robot control process is ended when the end button is pressed.
[0063]
If it is determined in step S16 that the robot control process is not to be terminated, the process returns to step S6 in FIG. 4 to repeat the process of recognizing the input voice. In step S16, if the application management unit 31 determines that the robot control process is to be ended (the end button has been pressed), the process proceeds to step S17, where the common dictionary of the common dictionary unit 55 and the cluster of the unknown word acquisition unit 56 are clustered. The state is stored in a storage unit (not shown).
[0064]
Then, when there is an active application unit 21, the application management unit 31 ends the application unit. At this time, the application unit 21 invalidates the corresponding task 71. The application management unit 31 also includes an application switching task 71. ₂ The speech recognition engine unit 11 disables the phoneme typewriter task 71. ₁ Is invalidated, and the application management unit 31 and the speech recognition engine unit 11 end the processing.
[0065]
In the above-described processing, the application unit performs the name registration application unit 21. ₁ Chat application part 21 ₂ Application unit 21 for voice commander _Three However, if there is another application part, if it is determined in step S13 that the voice commander application is not activated, the process does not return to step S6 but is similar to steps S7, S10, and S13. Whether or not to start another application is determined, and the other application is started according to the determination result.
[0066]
In the above-described processing, the end of the voice recognition is instructed by the user, but the robot control system 1 may automatically determine, for example, that the voice recognition ends when no voice is input for a predetermined time.
[0067]
According to the above-described processing, the application switching task 71 ₂ Is valid even while each application unit 21 is running, so that even if the utterance “Launch XX” is made while another application unit is running, it recognizes the utterance and responds accordingly. An application can be started. For example, the voice commander application unit 21 _Three Is activated, if the user utters “activate chat”, the chat application unit 21 ₂ Can be launched.
[0068]
In this case, start the new application part after closing the active application part, or start the new application part after the active application part is paused, Whether to restart the application unit or to start both in parallel is set in advance by a combination of the application units (may be dynamically determined from resource constraints such as memory).
[0069]
FIG. 6 shows the configuration of the task 71. The task 71 includes an acoustic model 111, a language model 112, a dictionary 113, a phoneme list 114, a kana phoneme conversion rule 115, and a search parameter 116.
[0070]
The acoustic model 111 stores a model representing acoustic features such as individual phonemes and syllables of speech to be recognized. As the acoustic model, for example, an HMM (Hidden Markov Model) can be used.
[0071]
The language model 112 describes information (hereinafter referred to as chain information as appropriate) indicating how words registered in the word dictionary of the dictionary 113 are linked (connected). The description method includes statistical word chain probability (n-gram), generation grammar, finite state automan, and the like.
[0072]
The language model 112 includes chain information about a category in which words are classified from a specific viewpoint, in addition to the chain information about words. For example, when “category consisting of a word representing a user name” is represented by a symbol “_user name_” and “category consisting of a word representing a robot name” is represented by a symbol “_robot name_”, the language model 112 is , “_User name_”, “_robot name_” chain information (chain of categories, chain of categories and words stored in the dictionary in advance) is also described.
[0073]
Accordingly, chain information can be acquired for words that are not included in the language model 112. For example, when the chain information of “SD” and “ha (particle)” is acquired, even if the chain information about “SD” is not described in the language model 112, “SD” is “_robot name”. If it is known that it belongs to the category represented by the symbol “_”, instead of acquiring the linkage information of “_Robot name_” and “ha”, the chain of “SD” and “ha” is obtained. Information can be acquired.
[0074]
The category is not a classification based on semantic attributes (“_robot name_”, “_user name_”, “_place name_”, “_store name_”, etc.), but a classification based on part of speech (“_noun_ ”,“ _Verb_ ”,“ _particle_ ”, etc.). Hereinafter, the notation “_... _” Represents a category name.
[0075]
The dictionary 113 includes a fixed word dictionary 131, a variable word dictionary 132, and a category table 133.
[0076]
In the fixed word dictionary 131, words that are not subject to word registration and deletion, that is, pronunciations (phoneme series), phonemes, and words set in advance in the robot control system 1 (hereinafter referred to as “fixed words” as appropriate) are stored. Various types of information such as a model describing the syllable chain relationship are described.
[0077]
In the fixed word dictionary 131, for each task 71, information about a dedicated word used in the application unit 21 corresponding to the task 71 is described. The same applies to the acoustic model 111 and the language model 112 described above, and to the later-described category table 133, phoneme list 114, kana phoneme conversion rule 115, and search parameter 116.
[0078]
The variable word dictionary 132 describes various types of information such as a model describing a word to be registered and deleted, that is, a model describing the pronunciation, phoneme, and syllable chain relationship for the registered word. When a new registered word is registered, the registered word is reflected. This reflection process will be described later with reference to FIG. The deletion of words and the change of pronunciation can be performed only for the entries of the variable word dictionary 132. The variable word dictionary 132 does not have to be recorded in anything.
[0079]
The category table 133 stores a table indicating the correspondence between categories included in the language model 112 and word information included in the categories. Further, when the task 71 assigns a category-specific ID (category ID), the category table 133 also stores the correspondence between the category symbol and the ID. For example, when category ID “4” is assigned to the category “_robot name_”, category ID = 4 is also stored corresponding to “_robot name_”. The category table 133 stores nothing when the language model 112 does not include a category.
[0080]
The phoneme list 114 is a list of phoneme symbols used in the task 71. The kana phoneme conversion rule 115 is a rule for converting a kana character string into a phoneme sequence. Thus, by storing the kana phoneme conversion rule 115 for each task, the common dictionary unit 55 can hold a kana character string that is independent of the phoneme sequence as pronunciation information.
[0081]
The search parameter 116 holds a parameter used when the matching unit 54 performs matching (search). The parameters include a value depending on the acoustic model 111, a value depending on the number of vocabularies, a value depending on the type of the language model 112, and the like, and thus must be retained for each task. However, the task-independent parameters may be held in common in the recognition engine unit 11.
[0082]
In the above description, all data is stored for each task. However, the data used in common for a plurality of tasks can be shared between the tasks to reduce the memory usage. For example, if the phoneme list 114 is common to all tasks, only one phoneme list 114 may be prepared in the speech recognition engine unit 11, and each task may refer to it. In this case, it is sufficient to prepare only one kana phoneme conversion rule 115.
[0083]
There are two types of acoustic models 111 for quiet environments (acoustic models with high recognition rates in quiet environments) and noise models (acoustic models with reasonable recognition rates in noisy environments). One of them may be referred to every time.
[0084]
For example, name registration task 71 _Three And chat task 71 _Four Are assumed to be used in a quiet environment, so the audio commander task 71 is referred to by referring to the acoustic model 111 for a quiet environment. _Five Is assumed to be used in a noisy environment (an environment in which the robot's operation sound is high), so an acoustic model for a noisy environment can be referred to.
[0085]
FIG. 7 shows an example of the phoneme list 114 of FIG. In FIG. 7, one symbol represents one phoneme (corresponding to). In the phoneme list 114 of FIG. 7, a vowel + colon (for example, “a:”) represents a long sound, and “N” represents a repellent sound (“n”). "Sp", "silB", "silE", and "q" all represent silence, but "silence in utterance", "silence before utterance", "silence after utterance", "promotion sound"("")"".
[0086]
FIG. 8 shows an example of the kana phoneme conversion rule 115 of FIG. According to the kana phoneme conversion rule 115 in FIG. 8, for example, a kana character string “SD” is converted into a phoneme sequence “e / s / u / d / i: / a: / r / u”. .
[0087]
Next, examples of the language model 112 and the dictionary 113 (FIG. 6) for each task are shown.
[0088]
FIG. 9 shows a phonetic typewriter task 71. ₁ An example of the language model 112 (FIG. 6) is shown. In FIG. 9, the variable “$ SYLLABLE” on the first line means any one of the kana notations because all kana notations are connected by “|” meaning “or”. .
[0089]
That is, here, the phoneme typewriter task 71 ₁ 9 is a task for speech recognition in units of syllables, the language model 112 in FIG. 9 has a chain rule that any syllable can be arbitrarily connected in the BNF (Backus-Naur-Form) format. This is expressed in grammar. The language model 112 may be a statistical language model described later.
[0090]
FIG. 10 shows a phonological typewriter task 71. ₁ An example of the fixed word dictionary 131 (FIG. 6) is shown. “Symbol” is a character string for identifying a word, and for example, kana notation can be used. Entries with the same symbol are considered to be entries with the same word. The language model 112 is represented using this symbol. In addition, "<First> and `` “<Termination>” is a special symbol, and represents “silence before utterance” and “silence after utterance”, respectively (the same applies to FIG. 11 described later).
[0091]
“Transcription” represents a notation of a word, and a character string output as a recognition result is this transcription. The “phoneme sequence” is a phoneme sequence representing the pronunciation of a word.
[0092]
Phonological typewriter task 71 ₁ In the variable word dictionary 132, a phoneme typewriter task 71 is stored. ₁ Since it is not supposed to add a word to, nothing is remembered. Also, the phoneme typewriter task 71 ₁ As shown in FIG. 9, the language model 112 does not include a category, so nothing is stored in the category table 133.
[0093]
FIG. 11 shows an application switching task 71. ₂ An example of the language model 112 (FIG. 6) is shown. The language model 112 in FIG. 11 is described in a BNF format grammar. The variable “$ APPLICATIONS” on the first line is connected to all application names (“chat”, “voice commander”, “name registration”, etc.) with “|” meaning “or”. Means one of the names.
[0094]
In addition, the variable “$ UTTERANCE” on the second line has “[]” meaning “can be omitted” added to “_robot name_” and “wo” respectively. ) Start application name () "means. Here, “robot name” indicates a word registered in the category “_robot name_”.
[0095]
For example, if “SDR” is registered in “_Robot name_”, utterances such as “Start SDR, voice commander ()”, “Start voice commander ()”, etc. 11 language models 112 are recognized.
[0096]
Thus, by describing the language model 112 using the category name, even if the word is newly registered, the word is included in the category described in the language model 112. Can recognize an utterance including the newly registered word by using the language model 112.
[0097]
FIG. 12 shows an application switching task 71 ₂ An example of the fixed word dictionary 131 (FIG. 6) is shown. In the fixed word dictionary 131 in FIG. 12, transcription and phoneme series are described for symbols (such as “chat” and “speech commander” in FIG. 11) described in the grammar of the language model 112 in FIG. Yes.
[0098]
FIG. 13 shows an application switching task 71. ₂ An example of the category table 133 (FIG. 6) is shown. The category table 133 stores information on the types of categories used in the language model 112 and the words belonging to the categories. When the language model 112 is as shown in FIG. ₂ Since the language model 112 uses the category “_robot name_”, “_robot name_” is entered in the category table 133 as shown in FIG. In FIG. 13, the set of words belonging to the category “_Robot name_” is an empty set, indicating that there are no words belonging to “_Robot name_” yet.
[0099]
As shown in FIG. 13, even when a category is entered in the category table 133, if there is no word belonging to the entry (in the case of an empty set), the variable word dictionary 132 contains the category. Information on the belonging word is not stored.
[0100]
FIG. 14 shows a name registration task 71. _Three An example of the language model 112 (FIG. 6) is shown. The language model 112 in FIG. 14 is described in a BNF format grammar. The variable “$ UTTERANCE” is “I [name] is <OOV> [is] [named]” and “You [name] is <OOV> [name]]” or “or” “|”, Which means “,” and “[]”, which means “can be omitted”, are added to “name”, “is”, “say”, and “to say”. ing.
[0101]
Therefore, using the language model 112 in FIG. 14, “I (name) is <OOV> (is)” (or “you” is <OOV>). Be recognized. Note that <OOV> is a symbol meaning “Out Of Vocabulary”, and means an arbitrary pronunciation phrase (word not described in the fixed word dictionary 131).
[0102]
By using the symbol <OOV>, for example, “My name is Taro” and “Your name is SDR” (“Taro” and “SDR” are described in the fixed word dictionary 131. 14), “<First> My name is <OOV><Terminal>” and “<First> Your name is <OOV>” in the language model 112 of FIG. 14 respectively. By doing this, you can get voice recognition results such as “My name is Taro” and “Your name is SDR”.
[0103]
FIG. 15 shows a name registration task 71. _Three An example of the fixed word dictionary 131 (FIG. 6) is shown. In the fixed word dictionary 131, transcriptions and phoneme sequences are described for symbols described in the grammar of the language model 112 as shown in FIG.
[0104]
Name registration task 71 _Three Here, the variable word dictionary 132 includes a name registration task 71. _Three Since it is not supposed to add a word to, nothing is remembered. Also, a name registration task 71 _Three As shown in FIG. 14, the language model 112 does not include a category, so nothing is stored in the category table 133.
[0105]
FIG. 16 shows a chat task 71. _Four An example of the language model 112 (FIG. 6) is shown. Since the chat has many vocabulary and utterance variations, a statistical language model is used as the language model 112. The statistical language model is a model in which word chain information is described with conditional probabilities. In the language model 112 in FIG. 16, a tri-type representing the sequence of three words 1, 2, 3; Gram is used.
[0106]
In FIG. 16, “P (word 3 | word 1 word 2)” is a probability that “word 3” appears next when “word 1” and “word 2” are arranged in the word string. Represents. For example, when there is a sequence of “<start>“ _robot name_ ””, the probability of “ha” appearing next is “0.012”. This probability is obtained in advance by analyzing a text describing a large amount of chat. In addition to the tri-gram, a bi-gram (probability of two chains), a uni-gram (word appearance probability), and the like can be used as the language model 112 as necessary.
[0107]
In the language model 112 of FIG. 16 as well, the grammar is described using categories in addition to words, as in the case of FIG. That is, in FIG. 16, “_robot name_” and “_place name_” mean the categories “_robot name_” and “_place name_”, and the tri-gram is described using these categories. Thus, when a word representing a robot name or a place name is registered in the variable word dictionary 132, the word is added to the chat task 71. _Four Can be recognized.
[0108]
FIG. 17 shows a chat task 71. _Four An example of the fixed word dictionary 131 is shown. In the fixed word dictionary 131, transcriptions and phoneme sequences are described for symbols described in the grammar of the language model 112 as shown in FIG.
[0109]
FIG. 18 shows a chat task 71. _Four An example of the category table 133 is shown. The category table 133 stores information on the types of categories used in the language model 112 and the words belonging to the categories. When the language model 112 is as shown in FIG. _Four In the language model 112, two categories of “_robot name_” and “_place name_” are used. Therefore, the category table 133 includes “_robot name_” as shown in FIG. And two categories "_place name_" are entered. In FIG. 18, the words belonging to the categories “_robot name_” and “_place name_” indicate that there is nothing yet.
[0110]
FIG. 19 shows a voice commander task 71. _Five An example of the language model 112 (FIG. 6) is shown. The language model 112 in FIG. 19 is described in a BNF format grammar. The variable “$ NUMBER” in the first line is connected to the number (“1”, “2”, “3”, etc.) by “|”, which means “or”. Means one.
[0111]
The variable “$ DIRECTION” on the second line is connected to the direction (“front”, “back”, “right”, “left”, etc.) with “|” meaning “or”. Means one of the following. The variable “UTTERANCE” on the third line is “_Robot name_”, “$ DIRECTION”, and “$ NUMBER steps” with “advance” added, and the variable “$ UTTERANCE” “[Robot name_]”, “to $ DIRECTION”, and “$ NUMBER steps” are appended with “[]” meaning “can be omitted”.
[0112]
Accordingly, in the language model 112 of FIG. 19, for example, a voice “3 steps ahead of (robot name)” is recognized.
[0113]
FIG. 20 shows a voice commander task 71. _Five An example of the fixed word dictionary 131 is shown. In the fixed word dictionary 131, transcriptions and phoneme sequences are described for symbols described in the grammar of the language model 112 as shown in FIG.
[0114]
Note that the symbols for “1” and “Walk” are duplicated, but this means that “1” and “Walk” have two pronunciations (“Ichi” and “It”, “Ho” and “Po”, respectively). )). As a result, for example, utterances with different pronunciations “Ichiho” and “Ippo” can be recognized as the same “1 step”.
[0115]
When the language model 112 is as shown in FIG. _Five Since only the category “_robot name_” is used in the language model 112 of FIG. _Five In the category table 133, the application switching task 71 shown in FIG. ₂ This is the same as the category table 133. When the word belonging to “_robot name_” has not been spoken yet, the voice commander task 71 _Five Nothing is stored in the variable word dictionary 132.
[0116]
Next, the name registration application unit 21 ₁ However, the name registration process performed in step S9 of FIG. 5 will be described in detail with reference to the flowchart of FIG. This process is performed by the name registration application unit 21 according to the user's speech. ₁ Started when is started. Before this process is started, the user can enter a name by, for example, a voice input mode for inputting a name by voice or a kana input by a keyboard or the like as a name registration mode for registering a name by a mode switching button (not shown). Select either Kana input mode to input.
[0117]
In step S41, the name registration application unit 21 ₁ Is the name registration task 71 of the speech recognition engine unit 11. _Three And this name registration task 71 _Three To be able to recognize voice.
[0118]
After the process of step S41, the process proceeds to step S42, and the name registration application unit 21 ₁ Determines whether the name registration mode is the voice input mode. If it is determined that the name registration mode is the voice input mode, the process proceeds to step S43, the matching unit 54 performs name recognition processing, and the process proceeds to step S44. (Alternatively, if the user speaks in step S42, it is determined that “name is input by voice” and the process proceeds to step S43. If a not-illustrated kana input button is pressed, “name is input in kana characters” The process proceeds to step S46.) Details of this name recognition process will be described later with reference to FIG.
[0119]
In step S44, the name registration application unit 21 ₁ Determines whether the name speech recognition result (recognized name) obtained by performing the name recognition processing in step S43 in the matching unit 54 is correct. This determination is made, for example, by speaking the recognition result toward the user and determining whether or not an OK button (not shown) has been operated by the user.
[0120]
If it is determined in step S44 that the name speech recognition result is not correct, the user is prompted to speak again, and the process returns to step S43 to perform the name recognition process again. If it is determined in step S44 that the recognition result is correct, the process proceeds to step S47.
[0121]
On the other hand, in step S42, the name registration application unit 21 ₁ If it is determined that the name registration mode is not the voice input mode, the process proceeds to step S45 to determine whether or not the name registration mode is the kana input mode.
[0122]
If it is determined in step S45 that the name registration mode is not the kana input mode, the name registration mode is not selected by the user. Therefore, the process waits until the name registration mode is selected, and the user selects the name input mode. Wait and return to step S42.
[0123]
If it is determined in step S45 that the name registration mode is the kana input mode, the process proceeds to step S46, and the name registration application unit 21. ₁ Obtains a kana string of names input by the user and a category of the name.
[0124]
As a method of inputting a kana string, for example, a method in which a user temporarily connects a keyboard to input kana characters, a method of inputting using various switches of the robot, a paper on which characters are written, etc. are shown to the robot. Character recognition (for example, see Japanese Patent Application No. 2001-135423), wireless LAN (Local Area Network) etc., connecting a robot and a personal computer, transferring from the personal computer to the robot, via the Internet, etc. There are methods for downloading to a robot. Also, in the method of recognizing characters by showing a paper or the like on which a character is written to a robot, a character string mixed with kana-kanji characters is input instead of inputting kana characters, and the name registration application unit 21 ₁ However, it may be converted into a kana string (see Japanese Patent Application 2001-135423).
[0125]
Furthermore, the user does not input the name string, but an entry with the name kana character added to the common dictionary of the common dictionary section 55 in advance, and the name registration application section 21. ₁ May acquire a kana string of names by referring to the common dictionary unit 55.
[0126]
After the process of step S44 or S46, the process proceeds to step S47, and the name registration application unit 21 ₁ Determines the category of the name to be registered. When the name registration mode is the kana input mode, the name registration application unit 21 ₁ Determines the category (input by the user) acquired in step S46 as the category of the name to be registered.
[0127]
In other words, in the Kana input mode, in step S46, the user inputs not only the name but also the category of the name, and the category of the name input by the user is determined as the name category to be registered. On the other hand, when the name registration mode is the voice input mode, the name registration application unit 21. ₁ Determines the category of the name obtained by the name recognition process in step S43.
[0128]
For example, when the recognition result supplied from the speech recognition engine unit 11 starts with “Kimi”, the category to which the registered name belongs is assumed to be “_robot name_”, and when it starts with “I”, It is assumed that the category to which the registered name belongs is “_user name_”. Various category estimation methods disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant can also be used.
[0129]
After the process of step S47, the process proceeds to step S48, and the name registration application unit 21 ₁ Controls the matching unit 54 to enter the pronunciation information and category of the name to be registered in the common dictionary of the common dictionary unit 55, and proceeds to step S49. In step S49, the name registration application unit 21 ₁ Controls the matching unit 54 to change the contents of the common dictionary into the chat task 71. _Four Voice commander task 71 _Five ... and other tasks 71 _N To reflect. Details of this reflection will be described later with reference to the flowchart of FIG.
[0130]
In this way, by reflecting the name registered in the common dictionary on other tasks, the registered name can be recognized also by other tasks.
[0131]
After the processing of step S49, the process proceeds to step S50, and the name registration application unit 21. ₁ Determines whether to end the name registration process. This determination is made, for example, depending on whether or not an OK button (not shown) has been operated (pressed) by speaking to the user a question of whether to end or not. If it is determined in step S49 that the name registration application is not terminated (for example, the OK button has not been pressed), the process returns to step S42 to perform processing for registering another name.
[0132]
If it is determined in step S50 that the name registration process is to be terminated (for example, an OK button has been pressed), the process proceeds to step S51, and the name registration application unit 21. ₁ Is a name registration task 71 _Three Is disabled, and the process proceeds to step S52. In step S52, the name registration application unit 21 ₁ Ends the process.
[0133]
FIG. 22 is a flowchart illustrating the name recognition process performed by the matching unit 54 in FIG. 2 in step S43 in FIG.
[0134]
In step S61, the matching unit 54 determines whether or not sound is input to the microphone 51. If it is determined that no sound is input, the matching unit 54 waits until the sound is input. If it is determined in step S61 that sound has been input, the process proceeds to step S62. The voice input here may be a normal conversation such as “My name is Taro” or “Your name is SDR”, and the user is conscious of name registration, “Taro” “ It is not necessary to enter the name “SD” alone.
[0135]
In step S62, the matching unit 54 recognizes the voice and extracts the name. For example, when an utterance “Your name is SDR” is made, a name registration task 71 having a language model 112 as shown in FIG. 14 and a fixed word dictionary 131 as shown in FIG. _Three Referring to FIG. 5, the matching unit 54 generates, for example, a recognition result “<head> Your name is <OOV><end>”. In addition, the matching unit 54 obtains information on which section of the utterance <OOV> is (from what second to how many seconds of the first utterance).
[0136]
Further, the matching unit 54 performs a phonological typewriter task 71 having a language model 112 as shown in FIG. 9 and a fixed word dictionary 131 as shown in FIG. 10 for the same utterance. ₁ For example, “k / i / m / i / n / o / n / a / m / a / e / w / a / e / s / u / d / i: / a: / r / Obtain the phoneme sequence “u / d / a / y / o” and the kana sequence “Kiminonamaewa SDL Ayo”.
[0137]
Then, based on the information on which section of the utterance <OOV> is the matching unit 54, from the obtained phoneme sequence and kana sequence, the section corresponding to <OOV>, that is, the phoneme sequence of the name section A kana string is cut out to obtain a phoneme sequence “e / s / u / d / i: / a: / r / u” and a kana string “SDR”. The matching unit 54 also obtains audio data for the same section. Details of the process for extracting the name are disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant.
[0138]
After the process in step S62, the process proceeds to step S63, and the matching unit 54 supplies the phoneme sequence, kana string, and speech data of the name extracted in the process of step S62 to the unknown word acquisition unit 56 to perform clustering. Details of clustering are disclosed in Japanese Patent Application No. 2001-097843 previously proposed by the present applicant. As a result of this clustering, each cluster of the unknown word acquisition unit 56 has a representative phoneme sequence and a kana string.
[0139]
After the process of step S63, the process proceeds to step S64, and the recognition result of the voice recognized in step S62 (for example, the kana string “Kinominama ESD Ryo”) is used as the name registration application unit 21. ₁ To supply.
[0140]
FIG. 23 shows an example of the feature space clustered by the unknown word acquisition unit 56 in the process of step S63 of FIG. In FIG. 23, a feature space defined by two feature amounts (feature parameters) 1 and 2 is shown in order to avoid making the diagram complicated (the same applies to FIG. 3 described above). In FIG. 23, four names “Arara”, “Saga”, “Tokyo”, and “Taro” are clustered in the feature space.
[0141]
That is, in FIG. 23, four clusters of “arara” cluster 151, “sato” cluster 152, “tokyo” cluster 153, and “taro” cluster 154 are configured in the feature space. Each cluster has a representative phoneme sequence (“a / r / a / r / a”, “s / a / n / i:”, “t / o: / ky / o in the example of FIG. 23). : ”,“ T / a / r / o / u ”), typical Kana notation (in the example of FIG. 23,“ Arara ”,“ Sunny ”,“ Tokyo ”,“ Taro ”)) and ID (Fig. In the case of 23 examples, “1”, “2”, “3”, “5”) are added.
[0142]
FIG. 24 shows an example of a common dictionary of the common dictionary unit 55 in which word information is entered in step S48 of FIG. In FIG. 24, the entry in the first row indicates that the pronunciation is input in a kana string, the pronunciation is a character string “SDR”, and the category “_robot name_” is input. .
[0143]
The entry in the second line is a representative kana notation in which pronunciation is input by voice, the kana notation of the pronunciation and the phoneme sequence are added to the cluster whose ID of the unknown word acquisition unit 56 is “5” (FIG. 23). In this example, “Taro”) and a phoneme sequence (“t / a / r / o:” in the example of FIG. 23) are indicated. The category of the entry in the second row is determined by the name registration application unit 21 in step S47 of FIG. 21, and is “_user name_”. For example, when the user utters “My name is Taro”, an entry such as the second line is configured in the common dictionary unit 55.
[0144]
Similarly, in the entries on the third and fourth lines, the pronunciation is input in a kana string, the pronunciations are character strings “Sunny Taro” and “Kitashinagawa”, respectively, and the category is “_user name_”. This indicates that “_place name_” has been entered. The entry on the fifth line is a representative kana notation (the pronunciation is input by voice, the kana notation of the pronunciation and the phoneme sequence are added to the cluster whose ID of the unknown word acquisition unit 56 is “3” ( In the example of FIG. 23, “Tokyo”) and a phoneme sequence (“t / o: / ky / o:” in the example of FIG. 23) are shown. Further, the category of the entry in the fifth row is determined to be “_place name_” by the name registration application unit 21.
[0145]
In the common dictionary, for a word whose pronunciation is input in a kana string, a pair of a kana string and a category representing the pronunciation of the word is registered in one entry, and for a word whose pronunciation is spoken, A set of ID and category representing the cluster of words is registered in one entry.
[0146]
FIG. 25 is a flowchart for explaining the process in which the matching unit 54 reflects the contents of the common dictionary unit 55 in the task in the process of step S49 of FIG. This process is performed for each activated task.
[0147]
In step S81, the matching unit 54 initializes the variable word dictionary 132 and the category table 133 in the task 71 (FIG. 6). That is, the variable word dictionary 132 is in a state where there is no entry, and the category table 133 is in a state where no word belongs to each category.
[0148]
After the process of step S81, the process proceeds to step S82, and the matching unit 54 reflects the contents of the common dictionary unit 55 in the variable word dictionary 132 and the category table 133.
[0149]
That is, the matching unit 54 selects, from the common dictionary of the common dictionary unit 55, a category that is common (identical) with the category entered in the category table 133, and the category and the cluster ID corresponding to the category. Or acquire kana pronunciation (kana string). Furthermore, the matching unit 54 acquires a kana string corresponding to the cluster ID from the unknown word acquisition unit 56 when acquiring the cluster ID from the common dictionary.
[0150]
When the matching unit 54 acquires a kana string of words belonging to the category selected from the common dictionary of the common dictionary unit 55 as described above, the matching unit 54 enters the kana string in the variable word dictionary 132. In addition, the matching unit 54 enters the word information represented by the kana string acquired from the common dictionary into the corresponding category of the category table 133.
[0151]
According to the above processing, in each task, after the variable word dictionary 132 is initialized, the contents of the common dictionary are reflected. That is, the variable word dictionary 132 is constructed or reconstructed based on the contents of the common dictionary. Therefore, it is possible to easily maintain consistency in each task as compared with a method of deleting or changing a specific entry in the dictionary.
[0152]
Further, according to the above-described processing, each time a word registered by voice is reflected in each task, the latest pronunciation information at that time is acquired from the unknown word acquisition unit 56, so that it is registered in the variable word dictionary 132. Thereafter, the pronunciation information is updated only by supplying voice data to the unknown word acquisition unit 56, and the matching unit 54 can recognize the voice by referring to the latest pronunciation information at that time.
[0153]
FIG. 26 is a block diagram illustrating the reflection process of FIG. When a kana string is described corresponding to a category in the common dictionary of the common dictionary unit 55, the kana string is registered in the variable word dictionary 132, and the same category as the category of the common dictionary in the category table 133 is set. Information of words represented by the kana string in the common dictionary is registered.
[0154]
On the other hand, when the cluster ID is described corresponding to the category in the common dictionary, the unknown word acquisition unit 56 is referred to, and the representative kana string and the representative phoneme sequence corresponding to the cluster ID are registered in the variable word dictionary 132. Thus, the word information represented by the cluster ID of the common dictionary is registered in the same category as the category of the common dictionary in the category table 133. Note that in the speech recognition process described later, both the fixed word dictionary 131 and the variable word dictionary 132 are used.
[0155]
FIG. 27 shows an application switching task 71 that reflects the contents of the common dictionary unit 55 shown in FIG. ₂ This is an example of the variable word dictionary 132. Application switching task 71 ₂ When the category table 133 of FIG. 13 is as shown in FIG. 13, since the category common to the common dictionary of FIG. 24 is “_robot name_”, the matching unit 54 searches the common dictionary of FIG. Acquires the pronunciation of “SD” corresponding to the name “_”.
[0156]
Then, the matching unit 54 enters the kana pronunciation “SD” acquired from the common dictionary of FIG. 24 in the transcription of the variable word dictionary 132, as shown in FIG. Further, the matching unit 54 converts the phoneme sequence corresponding to the transcription “SD” to “e / s / u /” corresponding to the kana pronunciation “SD” based on the kana phoneme conversion rule 115 (FIG. 8). d / i: / a: / r / u ”is described.
[0157]
Further, the matching unit 54 registers “OOV00001” as a symbol of a word represented by the transcription “SDR”. Here, the symbol is “OOV00001” which means “OOV” + serial number ”, but the symbol may be a character string that can uniquely identify the word. That is, as a symbol, for example, “_robot name _ :: OOV00001” or the like can be used by adding a category name to the head.
[0158]
FIG. 28 shows an application switching task 71 reflecting the contents of the common dictionary of FIG. ₂ An example of the category table 133 is shown. As shown in FIG. 27, when the contents of the common dictionary of FIG. 24 are reflected in the variable word dictionary 132, the contents of the category table 133 are words in the category “_robot name_” shown in FIG. Is registered, the symbol “OOV00001” of the word belonging to the category “_robot name_” registered in the variable word dictionary 132 in FIG. 27 is entered.
[0159]
Next, a process in which the matching unit 54 deletes or changes the word registered in the common dictionary of the common dictionary unit 55 in step S48 of FIG. 21 will be described with reference to the flowchart of FIG. The process of deleting or changing a word in the common dictionary starts, for example, when there is a command from the name registration application unit 21 or when it is necessary to delete an unnecessary registered word due to memory restrictions Is done.
[0160]
In addition, the process of deleting or changing a word in the common dictionary includes, for example, deleting the cluster in the unknown word acquisition unit 56, or changing the ID assigned to the cluster by dividing and merging the cluster, When it is necessary to match the ID assigned to the cluster of the unknown word acquisition unit 56 and the ID described in the common dictionary (the cluster ID described in FIG. 24), the ID described in the common dictionary is This is done to rewrite.
[0161]
Further, the process of deleting or changing the common dictionary is performed by deleting information on the category from the common dictionary when all the tasks described in the language model 112 are no longer used, and reducing the common dictionary. This is done to make it easier.
[0162]
When the unknown word acquisition unit 56 changes the representative phoneme sequence and the kana string of the cluster, the change contents are reflected in the common dictionary by the reflection process in FIG. There is no need to perform a process (hereinafter referred to as a process of changing and deleting as appropriate).
[0163]
In step S 101, the matching unit 54 determines a word to be subjected to the change deletion process from the common dictionary, and proceeds to step S 102. The target word may be determined by the user using a button (not shown), or may be estimated and determined by the matching unit 54.
[0164]
In step S102, the matching unit 54 determines whether or not to delete the word that is the target of the change deletion process. If it is determined to delete the word, the process proceeds to step S103. In step S103, the matching unit 54 deletes the entry of the word that is the target of the change deletion process from the common dictionary. Deletion means deleting an entry specified by a category and pronunciation information, deleting entries of a specific category at once, or deleting entries having specific pronunciation information (kana string or cluster ID) at once. It means to do.
[0165]
On the other hand, in step S102, when the matching unit 54 determines not to delete the word that is the target of the change deletion process, the process proceeds to step S104, determines whether or not to change the word, and determines not to change the word. If YES in step S102, the process returns to step S102 and waits until it is determined to be either change or deletion.
[0166]
In step S104, if the matching unit 54 determines to change the word to be changed and deleted, the matching unit 54 proceeds to step S105 and changes the entry of the word to be changed and deleted in the common dictionary.
[0167]
For example, when the cluster of the unknown word acquisition unit 56 is divided or merged and the ID number of the cluster changes, the matching unit 54 matches the cluster ID of the common dictionary so that the unknown word acquisition unit 56 is matched. To change. For example, when the user wants to correct the kana string input at the time of registration later, the matching unit 54 uses the name registration application unit 21. ₁ , The kana pronunciation of the word that is the target of the common dictionary (the word entered in the common dictionary in step S48 in FIG. 21) is converted into the kana string that the user has entered after determining the target word of the common dictionary. change.
[0168]
After the process of step S103 or the process of step S105, the process proceeds to step S106, and the matching unit 54 performs the reflection process of FIG. 25 to reflect the contents of the common dictionary in each task.
[0169]
As described above, when the words in the common dictionary are deleted or changed, the contents after the change are reflected in each task, so that the consistency of the registered words in each application unit can be maintained.
[0170]
30 and 31 show an example in which the matching unit 54 changes the entry of a word in the common dictionary in the process of step S105 of FIG. For example, when the cluster of the unknown word acquisition unit 56 whose ID is “5” is divided into a cluster whose ID is “8” and a cluster whose ID is “9”, the matching unit 54 shows the common dictionary in FIG. 30A. The state is changed to the state shown in FIG. 30B.
[0171]
That is, the matching unit 54 deletes the entry (the entry on the first line in FIG. 30A) whose cluster ID is “5” in the common dictionary unit 55 and registers “_user name_” registered in the deleted entry. Two entries of the category are registered. Further, the matching unit 54 describes the cluster ID numbers “8” and “9” in the two new entries, respectively (entries on the first and second lines in FIG. 30B).
[0172]
For example, when the cluster of the unknown word acquisition unit 56 with the cluster ID “5” and the cluster with the ID “3” are merged, and the cluster with the ID “10” is newly generated, the matching unit 54 The common dictionary is changed from the state shown in FIG. 31A to the state shown in FIG. 31B.
[0173]
That is, the matching unit 54 changes the cluster IDs of the entries having the cluster IDs “5” and “3” in the common dictionary (all entries in FIG. 31A) to “10”. Two entries of the category “_” and the corresponding cluster ID number “10” are made one (for example, one of them is deleted) (FIG. 31B).
[0174]
Next, the chat process in step S12 of FIG. 5 will be described in detail with reference to the flowchart of FIG.
[0175]
In step S121, the chat application unit 21 ₂ The chat task 71 _Four Is enabled, and the process proceeds to step S122. In step S122, the chat application unit 21 ₂ Controls the matching unit 54 to perform a reflection process as shown in FIG. 25, and changes the contents of the common dictionary of the common dictionary unit 55 to the chat task 71. _Four (Variable word dictionary 132 and category table 133) are reflected. Therefore, chat task 71 _Four Can obtain words that are registered, changed, and deleted in the common dictionary while they are invalid.
[0176]
After the processing of step S122, the process proceeds to step S123, and the chat application unit 21 ₂ Controls the speech recognition engine unit 11 to perform speech recognition processing, and proceeds to step S124. Details of the voice recognition processing will be described later with reference to FIG.
[0177]
In step S124, the chat application unit 21 ₂ Acquires a recognition result from the speech recognition engine unit 11 and generates a response to the recognition result. That is, the robot responds to the utterance from the user. For example, when the utterance from the user is “What time did SDR (robot name) occur?”, The chat application unit 21 ₂ Generates a response for the time when the robot woke up (activated) (eg, “7 o'clock”) and causes the robot to speak.
[0178]
After the processing of step S124, the process proceeds to step S125, and the chat application unit 21 ₂ Determines whether to end the process. This determination is performed by, for example, the chat application unit 21. ₂ However, it is made depending on whether the user has operated (pressed) (pressed) an OK button (not shown) or not.
[0179]
If it is determined in step S125 that the process is not terminated, the process returns to step S123, and the same process is repeated thereafter. That is, the robot continues chatting with the user.
[0180]
If it is determined in step S125 that the process is to end, the process proceeds to step S126, and the chat application unit 21 is started. ₂ The chat task 71 _Four Is disabled and the process proceeds to step S127. In step S127, the chat application unit 21 ₂ Ends the process.
[0181]
In the above-described processing, the chat application unit 21 every time the user speaks once. ₂ Generated a response, but the robot may urge the user to speak by voluntarily speaking.
[0182]
In the process of FIG. 32, the chat application unit 21 ₂ The chat command processing for the voice commander 21 has been described. _Three Voice commander processing ... Other application section 21 _M This process is also performed in the same manner. However, in step S124, processing based on the speech recognition result by the speech recognition engine unit 11 is performed in accordance with the application unit 21.
[0183]
FIG. 33 shows the content of the common dictionary in the common dictionary unit 55 shown in FIG. _Four The state reflected in the variable word dictionary 132 is shown.
[0184]
Chat task 71 _Four When the category table 133 of FIG. 18 is as shown in FIG. 18, the categories common to the common dictionary of FIG. 24 are “_robot name_” and “_place name_”. The first entry in FIG. 24 is acquired as a common dictionary entry corresponding to “name_”, and the fourth and fifth entries in FIG. 24 are acquired as entries corresponding to “_place name_”. Further, the Kana pronunciation “SD” is acquired from the first entry, the Kana pronunciation “Kitashinagawa” is acquired from the fourth entry, and the cluster ID number “3” is acquired from the fifth entry.
[0185]
Then, as shown in FIG. 33, the matching unit 54 enters “SD” and “Kitashinagawa” in the transcription of the variable word dictionary 132. Further, the matching unit 54 applies “e / s / u / d / i:” to the phoneme sequence of the variable word dictionary 132 in accordance with the transcription “SD” based on the Kana phoneme conversion rule 115 (FIG. 8). “/ a: / r / u” and “k / i / t / a / sh / i / n / a / g / a / w / a” are described corresponding to the transcription “Kitashinagawa”.
[0186]
Further, the matching unit 54 extracts the cluster having the cluster ID “3” from the unknown word acquisition unit 56, and acquires a representative phoneme sequence and a kana string. For example, when the unknown word acquisition unit 56 is in the state shown in FIG. 23, the matching unit 54 selects the phoneme sequence “t / o: / ky / o:” from the cluster 153 with the cluster ID “3” and “ Get Kana line "Tokyo". Then, as shown in FIG. 33, the matching unit enters the acquired phoneme sequence “t / o: / ky / o:” and the kana sequence “Tokyo” into the phoneme sequence and transcription of the variable word dictionary 132, respectively. To do.
[0187]
Further, the matching unit 54 transcribes “OOV00001” as a word symbol represented by the transcription “SD” and “OOV00002” as a word symbol represented by the transcription “Kitashinagawa”. “OOV00003” is registered as a symbol of the word represented by the option “Tokyo”.
[0188]
In this case, the phoneme typewriter task 71 is used. ₁ And chat task 71 _Four The phonetic typewriter task 71 assumes that the kana phonological sequence rules 115 of the ₁ A representative phoneme sequence of the cluster obtained by using the chat task 71 as it is. _Four The kana phonological sequence rule 115 is registered in the variable word dictionary 132 of the phonological typewriter task 71. ₁ And chat task 71 _Four When the kana phonological sequence rules 115 of the different are different, the matching unit 54 acquires a representative kana string of the cluster from the unknown word acquisition unit 56, and the chat task 71. _Four The phoneme sequence of the variable word dictionary 132 is described based on the kana phoneme sequence rule 115.
[0189]
FIG. 34 shows that the content of the common dictionary in FIG. 24 is the chat task 71 in FIG. _Four The state reflected in the category table 133 is shown. In the category table 133, for the category of “_robot name_”, words belonging to the category “_robot name_” (words whose transcription is “SD” (FIG. 33)) are stored in the variable word dictionary 132. The symbol “OOV00001” is entered. Further, with respect to the category “_place name_” in the category table 133, the words (transcriptions “Kitashinagawa” and “Tokyo” (FIG. 33)) belonging to the category “_place name_” are stored in the variable word dictionary 132. The registered symbols “OOV00002” and “OOV00003” are entered.
[0190]
Next, the speech recognition process performed by the speech recognition engine unit 11 of FIG. 2 in the process of step S123 of FIG. 32 will be described in detail with reference to the flowchart of FIG. This process is started when a voice is input to the microphone 51 from the user, and the application switching task 71 is started. ₂ , Chat Task 71 _Four Voice commander task 71 _Five ... and other tasks 71 _N Of these, it is performed for each active task.
[0191]
The audio signal generated by the microphone 51 is converted into audio data that is a digital signal by the AD conversion unit 52 and supplied to the feature amount extraction unit 53 in step S141. After the process of step S141, the process proceeds to step S142, and the feature quantity extraction unit 53 extracts a feature quantity such as a mel cepstrum from the supplied audio signal, and the process proceeds to step S143.
[0192]
In step S143, the matching unit 54 connects some of the words represented by the symbols of the fixed word dictionary 131 and the variable word dictionary 132, generates a word string, and calculates an acoustic score. The acoustic score represents how close (acoustically) a word string that is a candidate for the speech recognition result and the input speech are as sounds.
[0193]
After the processing of step S143, the process proceeds to step S144, and the matching unit 54 selects a predetermined number of word strings having a high acoustic score based on the acoustic score calculated in step S143, and proceeds to step S145.
[0194]
In step S145, the matching unit 54 calculates the language score of each word string selected in step S144 using the language model 112, and proceeds to step S146. For example, when grammar or finite state automan is used as the language model 112, when the word string can be accepted by the language model 112, the language score is “1” and cannot be accepted The language score is “0”.
[0195]
The matching unit 54 may leave the word string selected in step S144 when it can be accepted, and delete the word string selected in step S144 when it cannot be accepted.
[0196]
When a statistical language model is used as the language model 112, the generation probability of the word string is used as a language score. Details of the method for obtaining the language score are disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant.
[0197]
For example, the voice commander application unit 21 _Three When the speech recognition process is performed in the voice commander process, when the matching unit 54 selects the word string “<start> OOV00001 forward <end>” in the process of step S144, the language score is the word string “< Since “start> OOV00001 forward <end>” can be accepted by the language model 112 of the grammar shown in FIG. 19, it becomes “1”.
[0198]
That is, the matching unit 54 recognizes that the category of the symbol “OOV00001” is “_robot name_” with reference to the category table 133 (FIG. 28), and the word string “<beginning” obtained in step S144. > OOV00001 move forward <end> can be converted to the word string “<start> _robot name_advance <end>” using the category name and accepted by the language model 112 shown in FIG. Judge that it is possible.
[0199]
On the other hand, for example, when the word string “<Proceed to <start> OOV00001” <end> is selected in step S144, the matching unit 54 refers to the category table 133 (FIG. 28) and refers to the category “OOV00001” Is “_robot name_” and the word string “<start> Go to OOV00001 before <end>” obtained in step S144 is replaced with the word string “<start> _robot name” using the category name. It is determined that it cannot be accepted by the language model 112 shown in FIG. 19, and the language score of this word string is set to “0”.
[0200]
In step S146, the matching unit 54 integrates the acoustic score calculated in step S143 and the language score calculated in step S145, sorts each word string, for example, the word string having the largest integrated score. Is determined as a recognition result.
[0201]
As a result, the most appropriate word string acoustically and linguistically is determined as the recognition result.
[0202]
After the processing of step S146, the process proceeds to step S147, and the matching unit 54 determines whether the recognition result includes a word registered in speech (words clustered in the unknown word acquisition unit 56).
[0203]
If it is determined in step S147 that the word registered by speech is included in the recognition result, the process proceeds to step S148, where the matching unit 54 supplies the word to the unknown word acquisition unit 56, and the unknown word acquisition unit 56 performs re-clustering. Then, the process proceeds to step S149.
[0204]
For example, in step S144, if the word string “<tip> I went to Tokyo today <end>” including the place name (unknown word) “Tokyo”, the matching unit 54 is an unknown word. “Tokyo” voice data, phonetic typewriter task 71 ₁ The phoneme sequence (for example, “t / o: / ky / o:”) and the kana string (for example, “Tokyo”) recognized with reference to FIG. Then, the unknown word acquisition unit 56 performs reclustering.
[0205]
As a result, the amount of speech data supplied to the unknown word acquisition unit 56 increases, and the representative phoneme sequence and the representative kana sequence of each cluster may be updated to correct values. However, as a side effect, even after a correct phoneme kana sequence / kana sequence is acquired, there is a possibility that it will be changed to an incorrect value by re-clustering. In order to prevent such a side effect, when there is an instruction from the user, the pronunciation can be fixed by describing the kana string at that time in the entry of the common dictionary. For example, in FIG. 23, when the kana string of the cluster with ID = 3 is pronounced “Tokyo”, the location described as “cluster ID = 3” in the common dictionary in FIG. 24 is “Kana pronunciation: Tokyo”. (The fifth entry is the target of the rewriting). By doing this, even if the kana string of the cluster with ID = 3 changes to something other than “Tokyo”, the pronunciation of the fifth entry in the common dictionary is fixed to “Tokyo”.
[0206]
On the other hand, in step S147, when the matching unit 54 determines that the speech recognition result does not include a word registered by speech, the matching unit 54 skips step S148 and proceeds to step S149.
[0207]
In step S149, the matching unit 54 supplies the recognition result determined in step S146 to the application unit 21 corresponding to the task.
[0208]
Here, the chat application unit 21 ₂ In the chat process of FIG. 36, for example, FIG. 36 shows an expression for obtaining the language score when the matching unit 54 selects the word string “<Start> OOV00001 occurred at <End>” in step S144 of FIG. Show.
[0209]
The language score “Score (<start> OOV00001 when it occurred <end>)” generates the word string “<start> OOV00001 occurred when <end>” as shown in equation (1). It is a probability.
[0210]
The value of the language score “Score (<start> OOV00001 happened at <end>)” is exactly “P (<start>) P (OOV00001 | <start) as shown in Equation (2) >) P (ha | <top> OOV00001) P (what | | <top> OOV00001 is) P (at | <top> OOV00001 is at what time) P (was | | at the beginning> OOV00001 is at what time) P (of | <First> OOV00001 occurred at what time) P (<Termination> | <Start> OOV00001 occurred at what time), but as shown in FIG. 16, the language model 112 uses a tri-gram. So, the condition parts “<Start> OOV00001 is”, “<Start> OOV00001 is what time”, “<Start> OOV00001 is what time”, “<Start> OOV00001 is what time”, and “<Start> OOV00001 is “What happened” is the maximum two words “OOV00001”, “What time”, “When”, “Wake up”, and “Wake up” Approximate with conditional probabilities limited to “Tano” (Equation (3)).
[0211]
This conditional probability is obtained by referring to the language model 112 (FIG. 16), but since the language model 112 does not include the symbol “OOV00001”, the matching unit 54 refers to the category table 133 of FIG. Then, it recognizes that the category of the word represented by the symbol “OOV00001” is “_robot name_”, and converts “OOV00001” to “_robot name_”.
[0212]
That is, as shown in Expression (4), “P (OOV00001 | <first>)” is changed to “P (_robot name_ | <first>) P (OOV00001 | _robot name)”, and “P (_Robot name_ | <head>) "/ N". N represents the number of words belonging to the category “_robot name_” in the category table 133.
[0213]
That is, when the probability is described in the form of P (X | Y), if the word X is a word belonging to the category C, P (C | Y) is obtained from the language model 112, and the value P (X | C) Multiply (probability of generating word X from category C). Assuming that all the words belonging to category C are generated with equal probability, if there are N words belonging to category C, P (X | C) can be approximated to 1 / N.
[0214]
In FIG. 34, only the word represented by the symbol “OOV00001” belongs to the category “_robot name_”, so “N” is “1”. Therefore, as shown in Expression (5), “P (ha | <start> OOV00001)” becomes “P (ha | <head> _robot name_)”. Also, “P (what time | OOV00001 is)” becomes “P (what time | _robot name_ is), as shown in Expression (6).
[0215]
Thereby, the language score can be calculated even for a word string including a variable word, and the variable word can appear in the recognition result.
[0216]
In the above example, the activation of the application unit 21 and the validity of the task 71 are linked with the termination of the application unit 21 and the invalidation of the task 71. However, this is performed at another timing, for example, the activation of the application unit 21 It is also possible to switch between valid / invalid of a task many times and control a plurality of tasks with one application.
[0217]
In this case, a task that frequently switches between enabling and disabling frequently is not efficient if memory is allocated or released repeatedly, so a flag (a flag indicating that the task is invalid) is set even after it is disabled. You can just leave the memory reserved.
[0218]
Further, in the above example, it is assumed that nothing is stored in the common dictionary of the common dictionary unit 55 when the robot system is activated. However, even if several words are stored in the common dictionary in advance. Good. For example, since the product name of a robot is often registered in the name of the robot, the product name of the robot may be registered in advance in the category “_robot name_” of the common dictionary.
[0219]
FIG. 37 shows an example of a common dictionary when the robot product name “SD” is entered in the category “_robot name_” when the robot system is activated. In FIG. 37, when the robot system is activated, the kana pronunciation “SD” is entered in the category “_robot name_”, so that the user can pronounce the kana pronunciation “SD” without registering the name. The word can be used to control the robot.
[0220]
In the above example, it is assumed that no unknown word acquisition unit cluster is generated in the initial stage (at the time of shipment). However, if a cluster is prepared for the main name from the beginning, when the name is input by voice in the name registration process of FIG. 21, the name for which the cluster is prepared is easily recognized. For example, if a cluster as shown in FIG. 3 is prepared at the time of shipment, the pronunciation of voices “red”, “blue”, “midori”, and “black” can be recognized (acquired) with a correct phoneme sequence. Furthermore, it is not desirable that the pronunciation of a name for which a cluster is prepared changes after it is registered in the common dictionary. Therefore, when registering pronunciation information in the common dictionary, instead of describing the cluster ID, describe it with kana pronunciation (representative kana string of clusters) such as “red”, “ao”, “midori”, “black”. .
[0221]
Furthermore, in the above-described example, the matching unit 54 reflects the contents of the common dictionary on all tasks. However, the matching unit 54 may reflect only on the tasks to be reflected. For example, a number (task ID) is added in advance to the task, and the common dictionary in FIG. 24 is expanded to provide a column indicating “list of tasks in which this entry is valid (or invalid)”. In the processing, the matching unit 54 only needs to reflect the contents of the common dictionary only to the task to which the task ID described in the column indicating “list of tasks for which this entry is valid” is added.
[0222]
FIG. 38 shows an example in which the ID of a task to be reflected is described in a column representing “valid task” in the common dictionary. In FIG. 38, the words whose kana pronunciation belonging to the category “_robot name_” is represented by “SD” have the task IDs “1”, “2”, and “4”. Only the variable word dictionary 132 and the category table 133 of the tasks “1”, “2”, and “4” reflect the contents of the common dictionary of words represented by kana pronunciation “SDR”.
[0223]
In the above example, the word stored in the fixed word dictionary 131 is a word described in the language model 112, and the word stored in the variable word dictionary 132 is a word belonging to a category. However, a part of the words belonging to the category may be stored in the fixed word dictionary 131.
[0224]
FIG. 39 shows an application switching task 71. ₂ FIG. 40 shows an example of the category table 133 at the time of activation. That is, in the category table 133 of FIG. 40, the category “_robot name_” and the word symbol “OOV00001” belonging to the category “_robot name_” are registered in advance. 40 includes the symbol “OOV00001”, the transcription “SD” of the word represented by the symbol “OOV00001”, and the phoneme sequence “e / s / u / d / i”. : / a: / r / u ”is registered in advance.
[0225]
In this case, the speech recognition process is performed assuming that the word “SDR” belongs to the category “_robot name_”. That is, the word “SDR” is treated as a robot name from the beginning. However, since the word “SD” is stored in the fixed word dictionary 131, it cannot be deleted or changed.
[0226]
In this way, for example, by storing in the fixed word dictionary 131 in advance a word that is assumed to be set as a name, such as a product name of the robot, the user can control the robot without registering the name. Can do.
[0227]
In the above example, the category symbols are common to all tasks, but may not be common. In this case, a conversion table as shown in FIGS. 41 to 44 may be prepared in the task.
[0228]
That is, for example, when a category “_ROBOT_NAME_” and a category “_USER_NAME_” are described in a task T, according to the conversion table of FIG. The content of the common dictionary of words belonging to is reflected in the category “_ROBOT_NAME_”. In the task T, the contents of the common dictionary of words belonging to the category “_user name_” are reflected in the category “_USER_NAME_”.
[0229]
Further, for example, when a category “_proper noun_” is described in a certain task T, according to the conversion table of FIG. 42, in the task T, the common dictionary unit 55 sets the category “_robot name_”. Both the contents of the common dictionary of words belonging to and the contents of the common dictionary of words belonging to the category “_user name_” are reflected in the category “_proper noun_”.
[0230]
Further, for example, when a category “_Last Name_” and a category “_First Name_” are described in a certain task T, according to the conversion table of FIG. The contents of the common dictionary of words belonging to “_user name_” are converted and copied into the category “_last name_” and the category “_name_”. In the step of reflecting the contents of the common dictionary to this task (FIG. 25), for example, the second entry in FIG. 24 is changed from “_user name_cluster ID = 5” to “_surname_cluster ID =” according to the conversion table. 5 ”and“ _name_cluster ID = 5 ”are converted and duplicated, and then reflected in the fixed word dictionary and the category table of this task.
[0231]
Further, for example, when a category is not described in a certain task T, according to the conversion table of FIG. 44, the category “_robot name_” and category “_user name_” in the common dictionary unit 55 are included in the task T. , A word belonging to the category “_place name_” is represented by the symbol “UNK”. “UNK” means “Unknown word”.
[0232]
As a result, even in a task for which no category is described, the matching unit 54 can simply select the category “_robot name_” and the category “_user name_” simply by describing the symbol “UNK” in the language model 112. , Words belonging to the category “_place name_” can be recognized.
[0233]
FIG. 45 shows an example of the external configuration of a biped robot equipped with the robot control system 1 to which the present invention is applied. In the robot 201, a head unit 211 is disposed on the upper portion of the body unit 213, and arm units 212A and 212B having the same configuration are disposed on the upper left and right of the body unit 213, respectively. Leg units 214A and 214B having the same configuration are attached to predetermined positions on the lower left and right sides of 213, respectively.
[0234]
The head unit 211 includes CCD (Charge Coupled Device) cameras 221A and 221B that function as “eyes” of the robot 201, microphones 222A and 222B that function as “ears”, and a speaker 223 that functions as a “mouth”. Are arranged at predetermined positions.
[0235]
FIG. 46 shows an example of the electrical configuration of the robot. The unit control system 231 and the dialogue control system 232 control the operation of the robot 201 in accordance with commands from the robot control system 1. That is, the unit control system 231 controls the head unit 211, the arm units 212A and 212B, and the leg units 214A and 214B of the robot 201 as necessary, and causes the robot 201 to perform a predetermined operation. Further, the dialogue control system 232 controls the utterance of the robot 201 and causes the speaker 223 to make a predetermined utterance as necessary.
[0236]
In the above description, a word is a unit that should be handled as one unit in the process of recognizing speech, and does not necessarily match a linguistic word. For example, “Taro-kun” may be treated as a single word, or may be treated as two words “Taro” and “you”. In addition, may be dealing with more is a major unit of "Hello Taro" or the like as one word.
[0237]
A phoneme is one that is more convenient in terms of processing if it is treated acoustically as one unit, and does not necessarily match phonemic phonemes or phonemes. For example, the “To” part of “Tokyo” can be represented by three phonetic symbols “t / o / u”, or a symbol “o:” that is a long sound of “o” is prepared. May be. Further, it may be expressed as “t / o / o”. In addition, symbols that indicate silence can be prepared, or further divided into "silence before speech", "short silence between speeches", and "silence of the part". Good.
[0238]
Although the robot apparatus has been described above, the present invention can be applied to an apparatus having an application using speech recognition, speech synthesis, translation, and other language processing.
[0239]
Furthermore, the present invention can be applied to, for example, an apparatus for extracting a predetermined term from a dictionary registered in Kojien and creating the term dictionary.
[0240]
In the above description, the case where there are a plurality of application units has been described. However, the number of application units may be one.
[0241]
The series of processes described above can be executed by hardware or can be executed by software. In this case, the processing described above is executed by a personal computer 600 as shown in FIG.
[0242]
47, a CPU (Central Processing Unit) 601 performs various processes according to a program stored in a ROM (Read Only Memory) 602 or a program loaded from a storage unit 608 to a RAM (Random Access Memory) 603. Execute. The RAM 603 also appropriately stores data necessary for the CPU 601 to execute various processes.
[0243]
The CPU 601, ROM 602, and RAM 603 are connected to each other via an internal bus 604. An input / output interface 605 is also connected to the internal bus 604.
[0244]
The input / output interface 605 includes an input unit 606 including a keyboard and a mouse, a display including a CRT and an LCD (Liquid Crystal Display), an output unit 607 including a speaker, a storage unit 608 including a hard disk, a modem, and the like. A communication unit 609 including a terminal adapter is connected. The communication unit 609 performs communication processing via various networks including a telephone line and CATV.
[0245]
A drive 610 is connected to the input / output interface 605 as necessary, and a removable medium 621 composed of a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is appropriately attached, and a computer program read therefrom is It is installed in the storage unit 608 as necessary.
[0246]
When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs. For example, it is installed in a general-purpose personal computer from a network or a recording medium.
[0247]
As shown in FIG. 47, this recording medium is not only composed of a package medium composed of a removable medium 621 on which a program is recorded, which is distributed to provide a program to a user, separately from the computer. A ROM 602 storing a program and a hard disk including a storage unit 608 provided to the user in a state of being pre-installed in the apparatus main body.
[0248]
In the present specification, the step of describing a computer program includes not only processing performed in time series according to the described order but also processing executed in parallel or individually even if not necessarily processed in time series. Is also included.
[0249]
Further, in this specification, the system represents the entire apparatus composed of a plurality of apparatuses.
[0250]
By the way, in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant, a series of processes until the word acquired by the unknown word acquisition mechanism is reflected in the language model and the word reflected in the language model are An invention about a process for making it appear in a recognition result is disclosed.
[0251]
However, the invention of Japanese Patent Application No. 2001-382579 is composed of an application for registering one word and an application that uses one registered word, and it is assumed that there are multiple applications that perform speech recognition. Therefore, in a system in which there are a plurality of applications and a variable number of applications, the above-described problem when reflecting registered words in the application, and the above-described problem when deleting or changing registered words for a plurality of applications are described. It was difficult to solve the problem.
[0252]
In addition, in Japanese Patent Application No. 2002-072718 previously proposed by the applicant, the word “Taro”, which is an unknown word, is extracted from the utterance “My name is Taro (unknown word).” An invention of obtaining as a name is disclosed.
[0253]
However, the invention of Japanese Patent Application No. 2002-072718 is based on the above-described problem of reflecting unknown words in the language model, the above-mentioned problem of reflecting registered words in the application, and the words registered for a plurality of applications. It has been difficult to solve the above-described problems when deleting or changing the.
[0254]
Therefore, in the invention of Japanese Patent Application No. 2001-382579 and the invention of Japanese Patent Application No. 2002-072718, the above-mentioned problem of reflecting an unknown word in a language model, the above-mentioned problem of reflecting a registered word in its application, In addition, it has been difficult to solve all the problems described above when deleting or changing words registered for a plurality of applications.
[0255]
However, in the robot control system 1 of FIG. 1, since the category is described in the language model 112, the above-described problem of reflecting the unknown word in the language model by causing the unknown word to belong to the category is solved. Can do.
[0256]
Further, in the robot control system 1 of FIG. 1, the variable word dictionary 132 in which words to be used for speech recognition used in the application are registered or reconstructed based on the common dictionary. It is possible to solve the above-described problem when reflecting a word that has been applied to the application and the above-described problem when deleting or changing a word registered for a plurality of applications.
[0257]
Furthermore, in the case of a system that does not have a keyboard (for example, a robot), there is a problem that it is difficult to input pronunciation information at the time of word registration. As a means for solving the problem, for example, a phonological typewriter is used. A method for inputting pronunciation information by voice has been proposed.
[0258]
However, the phonological typewriter sometimes misrecognizes, and if the phonological typewriter is used as it is, there is a problem that a word may be registered with a wrong pronunciation. For example, if the phonetic typewriter recognizes the pronunciation of “SDR” and the phoneme typewriter misrecognizes and outputs the result of “Ilnyal”, if “Ilnyal” is adopted as pronunciation information, in order to word is registered, and later, for example, "es Dr, Hello" that although the speech is difficult to be recognized, "Isuniyaru, Hello" speech that the context of recognized easily occurs.
[0259]
Therefore, in the robot control system 1 of FIG. 1, each time a word registered by voice is reflected in each task, the latest pronunciation information at that time is acquired from the unknown word acquisition unit 56, so that the phonological typewriter has an error. Even after the recognized and misrecognized words are registered in the variable word dictionary 132, the pronunciation information is updated only by supplying voice data to the unknown word acquisition unit 56, and the latest pronunciation information at that time can be obtained. There is a possibility of obtaining a normal recognition result.
[0260]
That is, in the robot control system 1 of FIG. 1, the above-described problem of reflecting an unknown word in a language model, the above-described problem of reflecting a registered word in the application, and the words registered for a plurality of applications The above-described problems when deleting or changing, and the above-described problems when inputting pronunciation information of a word to be registered in a system without a keyboard, that is, all the problems described above can be solved.
[0261]
【The invention's effect】
As described above, according to the present invention, words can be registered. In particular, even when registering words corresponding to a plurality of applications, the registered words can be commonly used in each application. In addition, words registered before the application is started can also be used in the application. Furthermore, even when the registered word is changed, consistency can be maintained in each application.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration example of a robot control system to which the present invention is applied.
2 is a block diagram illustrating a configuration example of a speech recognition engine unit in FIG. 1. FIG.
FIG. 3 is a diagram illustrating an example of a cluster of an unknown word acquisition unit in FIG. 2;
4 is a flowchart for explaining robot control processing in the robot control system of FIG. 1; FIG.
FIG. 5 is a flowchart for explaining robot control processing in the robot control system of FIG. 1;
6 is a diagram illustrating a configuration example of the task in FIG. 2. FIG.
7 is a diagram showing an example of a phoneme list in FIG. 6. FIG.
8 is a diagram illustrating an example of a kana phoneme conversion rule in FIG. 6; FIG.
FIG. 9 is a diagram illustrating an example of a language model of the phoneme typewriter task of FIG. 2;
10 is a diagram showing an example of a fixed word dictionary of the phoneme typewriter task of FIG. 2; FIG.
FIG. 11 is a diagram illustrating an example of a language model of the application switching task in FIG. 2;
12 is a diagram showing an example of a fixed word dictionary of the application switching task of FIG. 2;
13 is a diagram showing an example of a category table for the application switching task in FIG. 2; FIG.
14 is a diagram illustrating an example of a language model of the name registration task in FIG. 2;
15 is a diagram showing an example of a fixed word dictionary of the name registration task of FIG. 2;
16 is a diagram illustrating an example of a language model of the chat task in FIG. 2;
FIG. 17 is a diagram showing an example of a fixed word dictionary of the chat task in FIG. 2;
18 is a diagram showing an example of a category table for the chat task in FIG. 2. FIG.
FIG. 19 is a diagram illustrating an example of a language model of the voice commander task in FIG. 2;
20 is a diagram showing an example of a fixed word dictionary of the voice commander task of FIG. 2;
FIG. 21 is a flowchart for describing name registration processing in step S9 of FIG. 5;
FIG. 22 is a flowchart illustrating name recognition processing in step S43 of FIG.
FIG. 23 is a diagram illustrating an example of a cluster of the unknown word acquisition unit in FIG. 2;
FIG. 24 is a diagram illustrating an example of the common dictionary unit in FIG. 2;
FIG. 25 is a flowchart illustrating the reflection process in step S49 of FIG.
FIG. 26 is a block diagram illustrating the reflection process in FIG. 25;
FIG. 27 is a diagram showing an example of a variable word dictionary of the name registration task of FIG. 2;
FIG. 28 is a diagram showing an example of a category table for the name registration task in FIG. 2;
FIG. 29 is a flowchart illustrating word deletion or change processing in the matching unit of FIG. 2;
30 is a diagram illustrating an example of changing the common dictionary unit in FIG. 2;
FIG. 31 is a diagram showing an example of changing the common dictionary part in FIG. 2;
32 is a flowchart for explaining the chat process in step S12 of FIG.
FIG. 33 is an example of a variable word dictionary of the chat task in FIG. 2;
FIG. 34 is an example of a chat task category table of FIG. 2;
FIG. 35 is a flowchart for describing the speech recognition processing in step S123 of FIG.
FIG. 36 shows an example of a language score calculation formula.
FIG. 37 shows a modification of the common dictionary part of FIG.
38 is a diagram showing a modification of the common dictionary unit in FIG. 2;
FIG. 39 is a diagram showing a modification of the fixed word dictionary of FIG.
40 is a diagram showing an example of the category table in FIG. 6. FIG.
FIG. 41 is a diagram illustrating an example of a category conversion table.
FIG. 42 is a diagram illustrating an example of a category conversion table.
FIG. 43 is a diagram illustrating an example of a category conversion table.
FIG. 44 is a diagram illustrating an example of a category conversion table.
45 is a perspective view showing an external configuration of a robot. FIG.
FIG. 46 is a block diagram showing an electrical configuration of the robot.
FIG. 47 is a diagram illustrating an example of a personal computer.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 Speech recognition engine part, 21 Application part, 31 Application management part, 51 Microphone, 52 AD conversion part, 53 Feature-value extraction part, 54 Matching part, 55 Common dictionary part, 56 Unknown word acquisition part, 71 Task, 111 Acoustic model , 112 language model, 113 dictionary, 114 phoneme list, 115 kana phoneme conversion rule, 116 search parameter, 131 fixed word dictionary, 132 variable word dictionary, 133 category table

Claims

A language processing apparatus having a plurality of applications using language processing,
A registered dictionary storage means for storing a registered dictionary in which words are registered;
A construction means for constructing a dedicated dictionary dedicated to the application, in which words to be used for language processing used in the application are registered, based on the registered dictionary for each application ;
Processing means for performing processing for adding, deleting, or changing words to the registered dictionary;
Deleting means for deleting words in the dedicated dictionary;
With
After all the words registered in the dedicated dictionary are deleted,
The language processing apparatus , wherein the construction unit reconstructs the dedicated dictionary based on the registered dictionary in which a word is added, deleted, or changed .

The dedicated dictionary includes at least a fixed dictionary in which predetermined words are registered in advance, and a variable dictionary in which registered words are variable,
The language processing apparatus according to claim 1, wherein the construction unit constructs the variable dictionary of the dedicated dictionary.

The dedicated dictionary further includes a category table in which word categories are registered,
The construction unit, among the words of the registration dictionary, the word of the registered in the category table category, by registering the variable dictionary, according to claim 2, characterized in that constructing the variable dictionary Language processor.

Language model storage means for storing a language model that describes chain information indicating how words of the category are chained;
Recognition processing means for performing speech recognition based on the dedicated dictionary and the language model;
Further comprising
The language processing apparatus according to claim 3.

A language processing method of a language processing apparatus having a plurality of applications using language processing,
A registered dictionary storage step for storing a registered dictionary in which words are registered;
A construction step in which a word dedicated to language processing used in the application is registered, and a dedicated dictionary dedicated to the application is built for each application based on the registration dictionary ;
A processing step of performing processing for adding, deleting, or changing a word with respect to the registered dictionary;
A deletion step of deleting words in the dedicated dictionary;
And reconstructing the dedicated dictionary based on the registered dictionary in which words are added, deleted, or changed after all the words registered in the dedicated dictionary are deleted. Language processing method.

A program that performs language processing of multiple applications,
A construction step for constructing a dedicated dictionary dedicated to the application in which words to be used for language processing used in the application are registered based on the registered dictionary in which words are registered for each application ;
A processing step of performing processing for adding, deleting, or changing a word with respect to the registered dictionary;
A deletion step of deleting words in the dedicated dictionary;
Causing a computer to execute a restructuring step of reconstructing the dedicated dictionary based on the registered dictionary in which words are added, deleted, or changed after all the words registered in the dedicated dictionary are deleted. A recording medium on which a program is recorded .

A program that performs language processing of multiple applications,
A construction step for constructing a dedicated dictionary dedicated to the application in which words to be used for language processing used in the application are registered based on the registered dictionary in which words are registered for each application ;
A processing step of performing processing for adding, deleting, or changing a word with respect to the registered dictionary;
A deletion step of deleting words in the dedicated dictionary;
Causing a computer to execute a rebuilding step of reconstructing the dedicated dictionary based on the registered dictionary in which words are added, deleted, or changed after all the words registered in the dedicated dictionary are deleted. A program characterized by