JP2004252121A

JP2004252121A - Language processing apparatus and language processing method, and program and recording medium

Info

Publication number: JP2004252121A
Application number: JP2003042019A
Authority: JP
Inventors: Atsuo Hiroe; 厚夫廣江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-02-20
Filing date: 2003-02-20
Publication date: 2004-09-09
Anticipated expiration: 2023-02-20
Also published as: JP4392581B2

Abstract

<P>PROBLEM TO BE SOLVED: To make a registered word commonly usable when the word for a plurality of applications is registered. <P>SOLUTION: When voices are inputted to a microphone 51, an analog-to-digital conversion section 52 converts a voice signal which is an analog signal into voice data which is a digital signal. A characteristic amount extraction section 53 extracts a characteristic amount from voice data and supplies the same to a matching section. The matching section 54 references the task corresponding to the application, recognizes the voice based on the characteristic amount and registers the word in a common dictionary section 55. Also, the matching section 54 changes the word registered in the section 55. The matching section 54 reflects the content of the section 55 to the respective tasks. The invention is applicable to a robot control system. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、言語処理装置および言語処理方法、並びにプログラムおよび記録媒体に関し、特に、例えば、登録した単語を複数のアプリケーションで共通に認識できるようにした言語処理装置および言語処理方法、並びにプログラムに関する。
【０００２】
【従来の技術】
音声認識には、単独の単語を認識する孤立単語認識と複数の単語からなる単語列を認識する連続単語認識がある。従来の連続単語認識では、言語モデルという「単語間のつながりやすさについてのデータベース」を持つことで、「音は似ているが滅茶苦茶な単語列」が認識結果として生成されることを防いでいる。
【０００３】
しかしながら、言語モデルには、最初から認識できる単語（以下、適宜、既知語と称する）についての情報のみ記述されるため、後で登録された単語（以下、適宜、登録単語と称する）を正しく認識することが困難であった。すなわち、孤立単語認識では、認識辞書に単語を登録すれば、以降その単語は認識されるようになるが、連続単語認識では辞書への登録だけでは不十分であり、登録単語を言語モデルにも反映させる必要があるが、言語モデルへの反映は一般的には困難であった。
【０００４】
そこで、登録単語を「人名」、「地名」等のカテゴリに分類し、そのカテゴリに対応した認識文法を用意して、音声を認識することが提案されている（例えば、特許文献１参照）。
【０００５】
また、音声認識を使用するアプリケーションが複数、しかも可変個存在するシステムにおいて、１つのアプリケーションで登録された単語を、他のアプリケーションに反映させる場合、アプリケーションが１つの場合とは違った問題が発生する。例えば、既に起動しているアプリケーションに対してのみ単語登録を行うようにすると、アプリケーションが１つの場合と異なり、登録後に起動、またはインストールされたアプリケーションに、登録単語を反映させることが困難であるという課題があった。
【０００６】
さらに、アプリケーションが複数ある場合、複数のアプリケーションで、何度も同一の登録単語を削除することは面倒である。また、アプリケーションが複数である場合、登録単語を全て削除することは容易であるが、その一部だけを削除したり発音を変更することは困難であるという課題があった。
【０００７】
即ち、アプリケーションが１つである場合、例えば、削除または変更する登録単語を「ｎ回目に登録した単語」や「認識辞書中のｎ番目のエントリ」といった情報で特定できるが、アプリケーションが複数である場合、各アプリケーションによって、「ｎ回目に登録した単語」や「辞書エントリの何番目に追加したか」が異なるため特定することが困難であった。
【０００８】
また、アプリケーションが複数である場合、発音で、登録単語を特定することができるが、発音で登録単語を特定した場合、同音異義語が削除または変更されてしまうおそれがあった。
【０００９】
そこで、各アプリケーションが個別に音声認識を行う代わりに、「音声コマンダ」というモジュールが、全てのアプリケーションに対する音声認識を行い、その認識結果を各アプリケーションに転送することが提案されている（例えば、特許文献１参照）。
【００１０】
【特許文献１】
特開２００１−２１６１２８号公報
【００１１】
【発明が解決しようとする課題】
しかしながら、特許文献１に記載の発明では、各アプリケーションに対応した認識辞書と言語モデルとを、「音声コマンダ」が所持している必要がある。即ち、「音声コマンダ」を開発する際に、どのようなアプリケーションが同時に使用されるかを想定して、それに適した認識辞書、言語モデルを用意しておく必要があるため、想定外のアプリケーションに対しては、登録単語を反映させることが困難であるという課題があった。
【００１２】
本発明はこのような状況に鑑みてなされたものであり、登録した単語を複数のアプリケーションで共通に使用することができるようにするものである。
【００１３】
【課題を解決するための手段】
本発明の言語処理装置は、単語が登録される登録辞書を記憶する登録辞書記億手段と、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、登録辞書に基づいて構築する構築手段とを備えることを特徴とする。
【００１４】
登録辞書に対して、単語を追加、削除、または変更する処理を行なう処理手段と、専用辞書の単語を削除する削除手段とをさらに備え、専用辞書に登録されたすべての単語が削除された後、構築手段は、単語が追加、削除、または変更された登録辞書に基づいて、専用辞書を再構築するようにすることができる。
【００１５】
専用辞書は、所定の単語が予め登録されている固定辞書と、登録される単語が可変の可変辞書とを、少なくとも含み、構築手段は、専用辞書のうちの可変辞書を構築するようにすることができる。
【００１６】
専用辞書は、単語のカテゴリが登録されたカテゴリテーブルをさらに含み、構築手段は、登録辞書の単語のうち、カテゴリテーブルに登録されたカテゴリの単語を、可変辞書に登録することにより、可変辞書を構築するようにすることができる。
【００１７】
アプリケーションは複数存在し、構築手段は、複数のアプリケーション毎の専用辞書を構築するようにすることができる。
【００１８】
本発明の言語処理方法は、単語が登録される登録辞書を記憶する登録辞書記億ステップと、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、登録辞書に基づいて構築する構築ステップとを含むことを特徴とする。
【００１９】
本発明の記録媒体に記録されているプログラムは、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、単語が登録される登録辞書に基づいて構築する構築ステップを含むことを特徴とする。
【００２０】
本発明のプログラムは、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書を、単語が登録される登録辞書に基づいて構築する構築ステップをコンピュータに実行させることを特徴とする。
【００２１】
本発明においては、単語が登録される登録辞書が記憶され、アプリケーションで利用される言語処理の対象となる単語が登録される、そのアプリケーション専用の専用辞書が、登録辞書に基づいて構築される。
【００２２】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して説明する。図１は、本発明を適用したロボット制御システム１の構成例を表わしている。
【００２３】
このロボット制御システム１において、音声認識エンジン部１１は、入力された音声データを認識し、認識結果として、音声データに対応する単語列を生成する。音声認識エンジン部１１は、その認識結果を、名前登録用アプリケーション部２１_１、雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３、・・・、その他のアプリケーション部２１_Ｍ、並びに、アプリケーション管理部３１に供給する。
【００２４】
名前登録用アプリケーション部２１_１、雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３、・・・、その他のアプリケーション部２１_Ｍは、音声認識エンジン部１１から供給された認識結果に基づいて、各種の処理を行う。
【００２５】
名前登録用アプリケーション部２１_１は、音声認識エンジン部１１から供給された認識結果に基づいて、ロボット名、ユーザ名等を音声で登録し、それ以外のアプリケーション部は、名前登録用アプリケーション部２１_１が登録した名前を用い、ユーザからの発話に対応してロボットの動作を制御する。
【００２６】
したがって、雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３、・・・・、およびその他のアプリケーション部２１_Ｍで行われる音声認識は、名前登録用アプリケーション部２１_１で登録されたロボット名、ユーザ名等に対応する必要がある。
【００２７】
雑談用アプリケーション部２１_２は、ロボットに、ユーザと音声で雑談させ、音声コマンダ用アプリケーション部２１_３は、ロボットに、ユーザからの発話に対応する動作を行わせる。例えば、音声コマンダ用アプリケーション部２１_３は、「エスディーアール（ロボット名）、前に進め！」といったユーザからの発話に対応して、ロボットを前に進める。
【００２８】
なお、アプリケーション部は、任意の個数用意することができる。以下、名前登録用アプリケーション部２１_１、雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３、・・・、およびその他のアプリケーション部２１_Ｍのそれぞれを個々に区別する必要がない場合、適宜、まとめて、アプリケーション部２１と称する。
【００２９】
アプリケーション管理部３１は、音声認識エンジン部１１から供給された認識結果に基づいて、アプリケーション部２１に対して、起動、終了の指令を行う。例えば、アプリケーション管理部３１は、音声認識エンジン部１１から「音声コマンダを起動」という認識結果が供給された場合、音声コマンダ用アプリケーション部２１_３を起動する。このとき、複数のアプリケーション部を同時に起動させてもよい。
【００３０】
また、アプリケーション部２１、およびアプリケーション管理部３１は、音声認識エンジン部１１に対して、タスク切替コマンドを発行し、それぞれに対応したタスク（図２で後述する）が、音声認識エンジン部１１の内部で有効（アクティブ）、または無効（ディアクティブ）になるように制御する。
【００３１】
図２は、音声認識エンジン部１１の構成を示している。ユーザの発話は、マイクロホン５１に入力され、マイクロホン５１では、その発話が、電気信号としての音声信号に変換される。マイクロホン５１は、この音声信号を、ＡＤ（ＡｎａｌｏｇＤｉｇｉｔａｌ）変換部５２に供給する。ＡＤ変換部５２は、マイクロホン５１からのアナログ信号である音声信号をサンプリングし、量子化して、ディジタル信号である音声データに変換する。この音声データは、特徴量抽出部４３に供給される。
【００３２】
特徴量抽出部５３は、ＡＤ変換部５２からの音声データについて、適当なフレーム毎に、例えば、スペクトル、パワー線形予測計数、ケプストラム計数、線スペクトル対等の特徴パラメータを抽出し、マッチング部５４に供給する。
【００３３】
マッチング部５４は、特徴量抽出部５３からの特徴パラメータに基づき、音韻タイプライタ用タスク７１_１、アプリケーション切替用タスク７１_２、名前登録用タスク７１_３、雑談用タスク７１_４、音声コマンダ用タスク７１_５、・・・、およびその他のタスク７１_Ｎのうち、その時点で有効にされているタスク毎に、タスク内部のデータベースを必要に応じて参照しながら、マイクロホン５１に入力された音声（入力音声）に最も近い単語列を、認識結果として求める。マッチング部５４は、その認識結果を、それぞれのタスクに対応するアプリケーション部２１、およびアプリケーション管理部３１に供給する。
【００３４】
なお、タスクとは、音声認識を行うのに必要なデータのセットのことである。即ち、音声認識エンジン部１１を、マッチング等を行うプログラム部分と、音響モデル、言語モデル、認識辞書等のデータ部分とに分類した場合のデータ部分、およびデータにアクセスするためのプログラムのことである。
【００３５】
したがって、複数のアプリケーションが異なる音響モデル、言語モデル、辞書を用いて音声認識を行う場合であっても、タスクを複数用意することによって、音声認識エンジン部は１つにすることができる。タスクの内部の詳細については、図６で後述する。
【００３６】
音韻タイプライタ用タスク７１_１は、音韻タイプライタとして働くタスクであり、音声認識エンジン部１１の指令により、有効にされる。この音韻タイプライタによって、マッチング部５４は、入力された任意の音声に対して、音韻系列を取得する他、カナ表記の発音も取得する。例えば、「君の名前はエスディーアールだよ」という音声から、“ｋ／ｉ／ｍ／ｉ／ｎ／ｏ／ｎ／ａ／ｍ／ａ／ｅ／ｗ／ａ／ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ／ｄ／ａ／ｙ／ｏ”（“ｉ：”、“ａ：”は、それぞれ“ｉ”、“ａ”の長音）という音韻系列と、「キミノナマエワエスディーアールダヨ」というカナ表記を取得する。この音韻系列とカナ表記は、未知語獲得部５６で用いられる。
【００３７】
アプリケーション切替用タスク７１_２は、アプリケーション管理部３１に対応したタスクであり、アプリケーション管理部３１が起動した後、アプリケーション管理部３１からタスク切替コマンドが供給されると、有効にされる。アプリケーション切替用タスク７１_２によって、マッチング部５４は、例えば、「雑談アプリを起動」、「音声コマンダを起動」、「名前登録を起動して」等のアプリケーション部の起動、または終了命令に対応する音声を認識する。
【００３８】
名前登録用タスク７１_３は、名前登録用アプリケーション部２１_１に対応したタスクであり、アプリケーション管理部３１からの指令により、名前登録用アプリケーション部２１_１が起動された後、名前登録用アプリケーション部２１_１からタスク切替コマンドが供給されると、有効にされる。名前登録用タスク７１_３によって、マッチング部５４は、例えば、「君の名前は、＜ロボット名を表す未知語＞だよ。」、「私の名前は、＜人名を表す未知語＞です。」といった名前に対応する音声を認識する。
【００３９】
雑談用タスク７１_４、音声コマンダ用タスク７１_５、・・・、およびその他のタスク７１_Ｎは、それぞれ雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３、・・・、その他のアプリケーション部２１_Ｍに対応したタスクであり、アプリケーション管理部３１からの指令により、対応するアプリケーション部が起動された後、対応するアプリケーション部からそれぞれタスク切替コマンドが供給されると、有効にされる。
【００４０】
マッチング部５４は、雑談用タスク７１_４によって、例えば、「エスディーアール（ロボット名）、何時に起きたの？」というユーザからの雑談としての発話を認識することができる。また、マッチング部５４は、音声コマンダ用タスク７１_５によって、例えば、「エスディーアール（ロボット名）、前に１歩進め」というユーザからの指令としての発話を認識することができる。
【００４１】
また、マッチング部５４は、後述する共通辞書部５５に登録された単語を、各タスクに反映させる。
【００４２】
なお、以下、音韻タイプライタ用タスク７１_１、アプリケーション切替用タスク７１_２、名前登録用タスク７１_３、雑談用タスク７１_４、音声コマンダ用タスク７１_５、・・・、およびその他のタスク７１_Ｎのそれぞれを個々に区別する必要がない場合、適宜、まとめて、タスク７１と称する。
【００４３】
共通辞書部５５は、タスク７１で共通に用いる単語の辞書としての共通辞書を記憶している。共通辞書部５５に記憶されている共通辞書には、そこに登録された全単語について、発音情報とカテゴリ情報が記述される。例えば、固有名詞である「エスディーアール（ロボット名）」が共通辞書に登録される場合、「エスディーアール」という発音（音韻情報）と“＿ロボット名＿”というカテゴリが共通辞書に記述される。詳細は、図２４で後述する。
【００４４】
未知語獲得部５６は、認識用の辞書（図６で後述する固定単語辞書１３１）に登録されていない名前等の単語（未知語）について、音韻タイプライタ用タスク７１_１によって認識され、マッチング部５４から供給された音韻系列およびカナ表記を記憶し、それ以降、その単語の音声を認識できる（他の音声と識別できる）ようにする。
【００４５】
即ち、未知語獲得部５６は、音韻タイプライタ用タスク７１_１によって認識された未知語の音韻系列およびカナ表記を、いくつかのクラスタに分類する。各クラスタはＩＤ、代表音韻系列、および代表カナ表記を持ち、ＩＤで管理される。
【００４６】
図３は、未知語獲得部５６のクラスタの状態を示している。
【００４７】
「あか」、「あお」、「みどり」の３回の音声が入力されたとき、未知語獲得部５６は、３回の入力音声を、それぞれに対応した「あか」クラスタ９１、「あお」クラスタ９２、および「みどり」クラスタ９３の３つのクラスタに分類し、各クラスタに、代表となる音韻系列（図３の例の場合、“ａ／ｋ／ａ”、“ａ／ｏ”、“ｍ／ｉ／ｄ／ｏ／ｒ／ｉ”）、代表的なカナ表記（図３の例の場合、「アカ」、「アオ」、「ミドリ」）、およびＩＤ（図３の例の場合、「１」、「２」、「３」）を付加する。
【００４８】
ここで、再び「あか」という音声が入力されると、対応するクラスタが既に存在するので、未知語獲得部５６は、入力音声を「あか」クラスタ９１に分類し、新しいクラスタは生成しない。これに対して、「くろ」という音声が入力された場合、対応するクラスタが存在しないので、未知語獲得部５６は、「くろ」に対応した「くろ」クラスタ９４を新たに生成し、そのクラスタに、代表的な音韻系列（図３の例の場合、“ｋ／ｕ／ｒ／ｏ”）、代表的なカナ表記（図３の例の場合、「クロ」）、およびＩＤ（図３の例の場合、「４」）を付加する。
【００４９】
この方法を用いると、ユーザが同じ音声を何度も入力することによって、各クラスタの代表音韻系列と代表カナ発音の精度をあげることができる。例えば、「みどり」を１度入力した時点では、音韻タイプライタが誤認識して、“ｍ／ｅ／ｒ／ａ／ａ”という音韻系列と、「メラア」というカナ発音とを出力したとする。その後、「みどり」という発話を何回もすることにより、音韻系列とカナ発音とが正しい値（“ｍ／ｉ／ｄ／ｏ／ｒ／ｉ”と「ミドリ」）に収束していく可能性がある。このような単語獲得処理の詳細は、本出願人が先に提案した特願２００１−０９７８４３号、および特願２００１−３８２５７９号に開示されている。
【００５０】
次に、図４と図５を参照して、図１のロボット制御システム１におけるロボット制御処理を説明する。なお、この処理は、ユーザによりロボット制御システム１が起動されたとき、開始される。
【００５１】
ステップＳ１において、音声認識エンジン部１１が起動し、ステップＳ２に進む。ステップＳ２において、音声認識エンジン部１１は、前回のロボット制御システム１の終了時に、不図示の記憶部に記憶しておいた（後述するステップＳ１７の処理）共通辞書部５５の内容（共通辞書）と未知語獲得部５６のクラスタの状態をロードする。共通辞書とクラスタの状態が記憶部に記憶されていない場合は、共通辞書部５５と未知語獲得部５６のクラスタのエントリが何もない状態のままにする。記憶部にクラスタの状態は記憶されているが共通辞書の状態は記憶されていないという場合は、共通辞書のみ初期化（エントリが何もない状態に）する。逆に、共通辞書の状態は記憶されているがクラスタの状態は記憶されていない場合は、クラスタ由来のエントリ（図２４でクラスタＩＤが記述されているエントリ）は共通辞書から削除し、カナ発音由来のエントリ（図２４でカナ発音が記述されているエントリ）は残す。
【００５２】
ステップＳ２の処理後は、ステップＳ３に進み、音声認識エンジン部１１は、音韻タイプライタ用タスク７１_１を有効にし、音韻タイプライタ用タスク７１_１が音声認識に使用できる状態にして、ステップＳ４に進む。ステップＳ４において、アプリケーション管理部３１が起動し、ステップＳ５に進む。
【００５３】
ステップＳ５において、アプリケーション管理部３１は、対応するタスクであるアプリケーション切替用タスク７１_２を有効にし、ステップＳ６に進む。ステップＳ６において、音声認識エンジン部１１は、マイクロホン５１に音声で入力された、アプリケーション部２１の起動命令を認識し、認識結果をアプリケーション管理部３１に供給する。この音声認識処理の詳細は、図３５のフローチャートで後述する。
【００５４】
ステップＳ６の処理後は、図５のステップＳ７に進み、アプリケーション管理部３１は、音声認識エンジン部１１から供給された認識結果から、名前登録用アプリケーション部２１_１を起動するか否かを判定し、名前登録用アプリケーション部２１_１を起動すると判定した場合（例えば、認識結果が「名前登録を起動」である場合）、ステップＳ８に進む。
【００５５】
ステップＳ８において、アプリケーション管理部３１は、名前登録用アプリケーション部２１_１を起動させる。ステップＳ８の処理後は、ステップＳ９に進み、名前登録用アプリケーション部２１_１は、名前登録処理を行なう。この名前登録処理の詳細は、図２１のフローチャートで後述する。
【００５６】
ステップＳ７において、アプリケーション管理部３１は、名前登録用アプリケーション部２１_１を起動しないと判定した場合、ステップＳ１０に進み、音声認識エンジン部１１による認識結果から、雑談用アプリケーション部２１_２を起動するか否かを判定する。ステップＳ１０において、アプリケーション管理部３１は、雑談用アプリケーション部２１_２を起動すると判定した場合（例えば、認識結果が「雑談を起動して」である場合）、ステップＳ１１に進み、雑談用アプリケーション部２１_２を起動させる。
【００５７】
ステップＳ１１の処理後は、ステップＳ１２に進み、雑談用アプリケーション部２１_２は、雑談処理を行なう。この雑談処理の詳細は、図３２のフローチャートで後述する。
【００５８】
ステップＳ１０において、アプリケーション管理部３１は、雑談用アプリケーション部２１_２を起動しないと判定した場合、ステップＳ１３に進み、音声認識エンジン部１１による認識結果から、音声コマンダ用アプリケーション部２１_３を起動するか否かを判定する。ステップＳ１３の処理において、アプリケーション管理部３１は、音声コマンダ用アプリケーション部２１_３を起動すると判定した場合（例えば、認識結果が「音声コマンダ起動」である場合）、ステップＳ１４に進み、音声コマンダ用アプリケーション部２１_３を起動させる。
【００５９】
ステップＳ１４の処理後は、ステップＳ１５に進み、音声コマンダ用アプリケーション部２１_３は、音声コマンダ処理を行なう。この音声コマンダ処理の詳細は、図３２のフローチャートで後述する。
【００６０】
ステップＳ１３において、アプリケーション管理部３１は、音声コマンダ用アプリケーション部２１_３を起動しないと判定した場合、音声認識エンジン部１１による認識結果が誤っているため（アプリケーション切り替え以外の発話の場合もある）、図４のステップＳ６に戻り、音声認識エンジン部１１は、新たに入力された音声を認識する処理を行う。
【００６１】
このように、アプリケーション管理部３１は、音声認識エンジン部１１による認識結果に応じて、アプリケーション部２１を起動させる。
【００６２】
ステップＳ９，Ｓ１２，Ｓ１５の処理の後は、ステップＳ１６に進み、アプリケーション管理部３１は、ロボット制御処理を終了するか否かを判定する。例えば、アプリケーション管理部３１は、ユーザにより不図示の終了ボタンが押圧されたか否かを判定し、終了ボタンが押圧された場合、ロボット制御処理を終了すると判定する。
【００６３】
ステップＳ１６において、ロボット制御処理を終了しないと判定された場合、処理は図４のステップＳ６に戻り、入力された音声を認識する処理を繰り返す。ステップＳ１６において、アプリケーション管理部３１は、ロボット制御処理を終了する（終了ボタンが押圧された）と判定した場合、ステップＳ１７に進み、共通辞書部５５の共通辞書および未知語獲得部５６のクラスタの状態を、不図示の記憶部に記憶させる。
【００６４】
そして、アプリケーション管理部３１は、起動しているアプリケーション部２１がある場合、そのアプリケーション部を終了する。このとき、アプリケーション部２１は、対応するタスク７１を無効にする。また、アプリケーション管理部３１は、アプリケーション切替用タスク７１_２を無効にし、音声認識エンジン部１１は、音韻タイプライタ用タスク７１_１を無効にして、アプリケーション管理部３１および音声認識エンジン部１１は、処理を終了する。
【００６５】
なお、上述の処理では、アプリケーション部が、名前登録用アプリケーション部２１_１、雑談用アプリケーション部２１_２、音声コマンダ用アプリケーション部２１_３の３個のときを説明したが、さらにその他のアプリケーション部がある場合は、ステップＳ１３で、音声コマンダ用アプリケーションを起動しないと判定した場合、ステップＳ６に戻らず、ステップＳ７，Ｓ１０，Ｓ１３と同様に、他のアプリケーションを起動するか否かが判定され、その判定結果に応じて他のアプリケーションが起動される。
【００６６】
また、上述の処理では、音声認識の終了は、ユーザによって指令されたが、例えば、所定時間音声が入力されない場合に終了する等、ロボット制御システム１が自動的に判断してもよい。
【００６７】
上述の処理によれば、アプリケーション切替用タスク７１_２は、各アプリケーション部２１の起動中も有効になっているため、「○○を起動して」という発話が他のアプリケーション部の起動中になされた場合も、その発話を認識して、対応するアプリケーションを起動することができる。例えば、音声コマンダアプリケーション部２１_３が起動中に、ユーザによって「雑談を起動して」と発話された場合、雑談用アプリケーション部２１_２を起動することができる。
【００６８】
この場合、起動中のアプリケーション部を終了させてから新しいアプリケーション部を起動させるか、起動中のアプリケーション部は一時停止状態にしてから新しいアプリケーション部を起動し、新しいアプリケーション部が終了してから元のアプリケーション部を再開するか、あるいは両方を並列に起動させるかは、アプリケーション部同士の組み合わせによって予め設定されている（メモリ等のリソース制約などから動的に判断することもある）。
【００６９】
図６は、タスク７１の構成を示している。タスク７１は、音響モデル１１１、言語モデル１１２、辞書１１３、音韻リスト１１４、カナ音韻変換規則１１５、およびサーチパラメータ１１６から構成されている。
【００７０】
音響モデル１１１は、音声認識する音声の個々の音韻、音節等の音響的な特徴を表すモデルを記憶している。音響モデルとしては、例えば、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いることができる。
【００７１】
言語モデル１１２は、辞書１１３の単語辞書に登録されている各単語がどのように連鎖する（接続する）かを示す情報（以下、適宜、連鎖情報と称する）を記述している。記述方法としては、統計的な単語連鎖確率（ｎ−ｇｒａｍ）、生成文法、有限状態オートマン等がある。
【００７２】
言語モデル１１２は、単語についての連鎖情報の他に、単語を特定の観点で分類したカテゴリについての連鎖情報も含んでいる。例えば、「ユーザ名を表す単語からなるカテゴリを“＿ユーザ名＿”というシンボルで表し、「ロボット名を表す単語からなるカテゴリ」を“＿ロボット名＿”というシンボルで表す場合、言語モデル１１２は、“＿ユーザ名＿”、“＿ロボット名＿”についての連鎖情報（カテゴリ同士の連鎖、カテゴリと辞書に予め記憶されている単語との連鎖等）も記述している。
【００７３】
したがって、言語モデル１１２に含まれない単語についても連鎖情報を取得することができる。例えば、「エスディーアール」と「は（助詞）」の連鎖情報を取得する場合、言語モデル１１２に「エスディーアール」についての連鎖情報が記述されていなくても、「エスディーアール」が“＿ロボット名＿”というシンボルで表されるカテゴリに属していることがわかれば、代わりに“＿ロボット名＿”と「は」との連鎖情報を取得することによって、「エスディーアール」と「は」の連鎖情報を取得することができる。
【００７４】
なお、カテゴリは、意味属性に基づく分類（“＿ロボット名＿”、“＿ユーザ名＿”、“＿地名＿”、“＿店名＿”等）ではなく、品詞に基づく分類（“＿名詞＿”、“＿動詞＿”、“＿助詞＿”等）にしてもよい。以下、“＿・・・＿”という表記は、カテゴリ名を表すものとする。
【００７５】
辞書１１３は、固定単語辞書１３１、可変単語辞書１３２、およびカテゴリテーブル１３３から構成されている。
【００７６】
固定単語辞書１３１には、単語登録および削除の対象外の単語、すなわち、予めロボット制御システム１に設定されている単語（以下、適宜、固定単語と称する）についての発音（音韻系列）、音韻および音節の連鎖関係を記述したモデル等、各種の情報が記述されている。
【００７７】
なお、固定単語辞書１３１には、タスク７１毎に、そのタスク７１に対応するアプリケーション部２１で用いられる専用の単語についての情報が記述されている。上述の音響モデル１１１および言語モデル１１２において、並びに後述するカテゴリテーブル１３３、音韻リスト１１４、カナ音韻変換規則１１５、およびサーチパラメータ１１６においても同様である。
【００７８】
可変単語辞書１３２には、単語登録および削除の対象となる単語、すなわち、登録単語についての発音、音韻および音節の連鎖関係を記述したモデル等、各種の情報が記述されており、共通辞書部５５に新たな登録単語が登録されると、その登録単語が反映される。この反映処理については、図２５で後述する。単語の削除や発音の変更は、この可変単語辞書１３２のエントリに対してのみ行うことができる。なお、可変単語辞書１３２は、何も記億されていなくてもよい。
【００７９】
カテゴリテーブル１３３は、言語モデル１１２に含まれているカテゴリとそのカテゴリに含まれている単語の情報との対応を示すテーブルを記憶している。また、タスク７１がカテゴリ独自のＩＤ（カテゴリＩＤ）を付与している場合には、カテゴリテーブル１３３は、カテゴリのシンボルとそのＩＤの対応関係も記憶する。例えば、“＿ロボット名＿”のカテゴリに、カテゴリＩＤ「４」が付与されている場合、“＿ロボット名＿”に対応して、カテゴリＩＤ＝４も記憶する。なお、カテゴリテーブル１３３は、言語モデル１１２がカテゴリを含まない場合、何も記憶しない。
【００８０】
音韻リスト１１４は、タスク７１で使用する音韻記号の一覧である。カナ音韻変換規則１１５は、カナ文字列を音韻系列に変換するための規則である。このように、カナ音韻変換規則１１５をタスク毎に記憶することによって、共通辞書部５５は、発音情報として、音韻系列とは独立であるカナ文字列を保持することができる。
【００８１】
サーチパラメータ１１６は、マッチング部５４がマッチング（サーチ）する場合に用いるパラメータを保持している。パラメータには、音響モデル１１１に依存した値、語彙数に依存した値、言語モデル１１２の種類に依存した値等があるので、タスク毎に保持しておく必要がある。ただし、タスクに依存しないパラメータは、認識エンジン部１１で共通に保持すればよい。
【００８２】
なお、上述の説明では、全てのデータをタスク毎に記憶するようにしたが、複数のタスクで共通に用いるデータは、タスク間で共有することでメモリ使用量を減らすことができる。例えば、音韻リスト１１４が全てのタスクで共通である場合、音韻リスト１１４を音声認識エンジン部１１で１つだけ用意し、各タスクはそれを参照するようにすればよい。この場合、カナ音韻変換規則１１５も１つだけ用意すれば十分である。
【００８３】
また、音響モデル１１１は、静かな環境用（静かな環境で高い認識率が出る音響モデル）と雑音環境用（騒がしい環境でもそれなりの認識率が出る音響モデル）との２種類を用意し、タスク毎にどちらかを参照するようにしてもよい。
【００８４】
例えば、名前登録用タスク７１_３と雑談用タスク７１_４は、静かな環境で使用することを想定しているので、静かな環境用の音響モデル１１１を参照し、音声コマンダ用タスク７１_５は、騒がしい環境（ロボットの動作音が大きい環境）で使うことを想定しているので、雑音環境用の音響モデルを参照するようにすることができる。
【００８５】
図７は、図６の音韻リスト１１４の例を示している。図７において、１つの記号は１つの音韻（に相当するもの）を表す。なお、図７の音韻リスト１１４において、母音＋コロン（例えば、“ａ：”）は、長音を表し、“Ｎ”は、撥音（「ん」）を表す。また、“ｓｐ”、“ｓｉｌＢ”、“ｓｉｌＥ”、“ｑ”は、全て無音を表すが、それぞれ「発話の中の無音」、「発話前の無音」、「発話後の無音」、「促音（「っ」）」を表す。
【００８６】
図８は、図６のカナ音韻変換規則１１５の例を示している。図８のカナ音韻変換規則１１５によれば、例えば、「エスディーアール」というカナ文字列は、“ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ”という音韻系列に変換される。
【００８７】
次に、各タスクの言語モデル１１２と辞書１１３（図６）の例を示す。
【００８８】
図９は、音韻タイプライタ用タスク７１_１の言語モデル１１２（図６）の例を示している。図９において、第１行目の変数“＄ＳＹＬＬＡＢＬＥ”は、全てのカナ表記が「または」を意味する“｜”で繋がれているので、そのカナ表記の内の任意の１つを意味する。
【００８９】
即ち、ここでは、音韻タイプライタ用タスク７１_１は、音節（シラブル）を単位とする音声認識用のタスクであるとして、図９の言語モデル１１２は、任意のシラブルが、任意に接続できるという連鎖規則を、ＢＮＦ（Ｂａｃｋｕｓ−Ｎａｕｒ−Ｆｏｒｍ）形式の文法で表している。なお、言語モデル１１２は、後述する統計言語モデルを用いてもよい。
【００９０】
図１０は、音韻タイプライタ用タスク７１_１の固定単語辞書１３１（図６）の例を示している。「シンボル」は単語を識別するための文字列であり、例えば、カナ表記などを用いることができる。シンボルが同じエントリは、同じ単語のエントリであるとみなされる。また、言語モデル１１２は、このシンボルを用いて表されている。なお、「＜先頭＞」と「＜終端＞」は特殊なシンボルであり、それぞれ「発話前の無音」と「発話後の無音」を表す（後述する図１１等においても同様）。
【００９１】
また、「トランスクリプション」は、単語の表記を表し、認識結果として出力される文字列はこのトランスクリプションである。「音韻系列」は、単語の発音を音韻系列で表したものである。
【００９２】
音韻タイプライタ用タスク７１_１の可変単語辞書１３２には、音韻タイプライタ用タスク７１_１に単語を追加することは想定していないので、何も記憶されない。また、音韻タイプライタ用タスク７１_１の言語モデル１１２は、図９に示すように、カテゴリを含まないので、カテゴリテーブル１３３にも何も記憶されない。
【００９３】
図１１は、アプリケーション切替用タスク７１_２の言語モデル１１２（図６）の例を示している。図１１の言語モデル１１２は、ＢＮＦ形式の文法で記述されている。第１行目の変数“＄ＡＰＰＬＩＣＡＴＩＯＮＳ”は、全てのアプリケーション名（「雑談」、「音声コマンダ」、「名前登録」等）が「または」を意味する“｜”で繋がれているので、アプリケーション名の内のどれか１つを意味する。
【００９４】
また、第２行目の変数“＄ＵＴＴＥＲＡＮＣＥ”は、“＿ロボット名＿”と「を」のそれぞれに、「省略可能」を意味する“［］”が付加されているので、「（ロボット名）アプリケーション名（を）起動して」を意味する。ここで、「ロボット名」とは、“＿ロボット名＿”のカテゴリに登録された単語を示している。
【００９５】
例えば、“＿ロボット名＿”に「エスディーアール」が登録されていた場合、「エスディーアール、音声コマンダ（を）起動して」、「音声コマンダ（を）起動して」等の発話が、図１１の言語モデル１１２を用いて認識される。
【００９６】
このように言語モデル１１２を、カテゴリ名を用いて記述することによって、新たに登録された単語であっても、その単語が、言語モデル１１２に記述されているカテゴリに含まれるものである場合には、その新たに登録された単語を含む発話を、言語モデル１１２を用いて認識することができる。
【００９７】
図１２は、アプリケーション切替用タスク７１_２の固定単語辞書１３１（図６）の例を示している。図１２の固定単語辞書１３１には、図１１の言語モデル１１２の文法中に記述されるシンボル（図１１における「雑談」や「音声コマンダ」等）について、トランスクリプションと音韻系列が記述されている。
【００９８】
図１３は、アプリケーション切替用タスク７１_２のカテゴリテーブル１３３（図６）の例を示している。カテゴリテーブル１３３は、言語モデル１１２に使用されているカテゴリの種類と、カテゴリに属する単語の情報を記憶する。言語モデル１１２が図１１に示すような場合、アプリケーション切替用タスク７１_２の言語モデル１１２には、“＿ロボット名＿”のカテゴリが使用されているため、カテゴリテーブル１３３には、図１３に示すように、“＿ロボット名＿”がエントリされている。図１３においては“＿ロボット名＿”のカテゴリに属する単語の集合は、空集合であり、まだ“＿ロボット名＿”に属する単語は何もないことを表している。
【００９９】
図１３に示したように、カテゴリテーブル１３３に、カテゴリがエントリされている場合であっても、そのエントリに属する単語がない場合（空集合の場合）、可変単語辞書１３２には、そのカテゴリに属する単語の情報は記憶されない。
【０１００】
図１４は、名前登録用タスク７１_３の言語モデル１１２（図６）の例を示している。図１４の言語モデル１１２は、ＢＮＦ形式の文法で記述されている。変数“＄ＵＴＴＥＲＡＮＣＥ”は、「私［の名前］は＜ＯＯＶ＞［です］［といいます］」と「君［の名前］は＜ＯＯＶ＞［というん］だよ」が、「または」を意味する“｜”で繋がっており、「の名前」、「です」、「といいます」、「というんだよ」それぞれに、「省略可能」を意味する“［］”が付加されている。
【０１０１】
したがって、図１４の言語モデル１１２を用いて、「私（の名前）は＜ＯＯＶ＞（です）（といいます）」または「君（の名前）は＜ＯＯＶ＞（というん）だよ」が認識される。なお、＜ＯＯＶ＞は、「ＯｕｔＯｆＶｏｃａｂｕｌａｒｙ」を意味するシンボルであり、任意の発音の語句（固定単語辞書１３１に記述されていない単語）を意味する。
【０１０２】
シンボル＜ＯＯＶ＞を用いることによって、例えば、「私の名前は太郎です」、「君の名前はエスディーアールだよ」といった発話（「太郎」と「エスディーアール」は、固定単語辞書１３１に記述されていない）に対して、それぞれ図１４の言語モデル１１２の「＜先頭＞私の名前は＜ＯＯＶ＞です＜終端＞」、「＜先頭＞君の名前は＜ＯＯＶ＞だよ」が適用されることにより、「私の名前はタロウです」、「君の名前はエスディーアールだよ」という音声認識結果を得ることができる。
【０１０３】
図１５は、名前登録用タスク７１_３の固定単語辞書１３１（図６）の例を示している。固定単語辞書１３１には、図１４に示されるような言語モデル１１２の文法中に記述されるシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１０４】
名前登録用タスク７１_３の可変単語辞書１３２には、ここでは、名前登録用タスク７１_３に単語を追加することは想定していないので、何も記憶されない。また、名前登録用タスク７１_３の言語モデル１１２は、図１４に示すように、カテゴリを含まないので、カテゴリテーブル１３３にも何も記憶されない。
【０１０５】
図１６は、雑談用タスク７１_４の言語モデル１１２（図６）の例を示している。雑談は、語彙も発話のバリエーションも多いため、言語モデル１１２として、統計言語モデルが用いられている。統計言語モデルは、単語の連鎖情報を条件付確率で記述したモデルであり、図１６の言語モデル１１２では、３つの単語１，２，３の並び、すなわち単語の３連鎖の確率を表すｔｒｉ−ｇｒａｍが用いられている。
【０１０６】
図１６において、「Ｐ（単語３｜単語１単語２）」は、単語列中に「単語１」、「単語２」という並びがあった場合に、その次に「単語３」が出現する確率を表す。例えば、「＜先頭＞“＿ロボット名＿”」という並びがあった場合に、その次に「は」が出現する確率は、「０．０１２」である。なお、この確率は、大量の雑談を記述したテキストを解析することにより、予め求められる。また、言語モデル１１２としては、ｔｒｉ−ｇｒａｍの他に、ｂｉ−ｇｒａｍ（２連鎖の確率）やｕｎｉ−ｇｒａｍ（単語の出現確率）等も、必要に応じて用いることが可能である。
【０１０７】
図１６の言語モデル１１２においても、図１１における場合と同様に、単語の他、カテゴリを用いて文法が記述されている。即ち、図１６において、「＿ロボット名＿」、「＿地名＿」は、カテゴリ“＿ロボット名＿”、“＿地名＿”を意味するが、これらのカテゴリを用いてｔｒｉ−ｇｒａｍを記述することによって、ロボット名や地名を表す単語が可変単語辞書１３２に登録された場合に、その単語を雑談用タスク７１_４で認識することができる。
【０１０８】
図１７は、雑談用タスク７１_４の固定単語辞書１３１の例を示している。固定単語辞書１３１には、図１６に示されるような言語モデル１１２の文法中に記述されるシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１０９】
図１８は、雑談用タスク７１_４のカテゴリテーブル１３３の例を示している。カテゴリテーブル１３３は、言語モデル１１２に使用されているカテゴリの種類と、そのカテゴリに属する単語の情報を記憶する。言語モデル１１２が図１６に示すような場合、雑談用タスク７１_４の言語モデル１１２には、“＿ロボット名＿”と“＿地名＿”の２個のカテゴリが使用されているため、カテゴリテーブル１３３には、図１８に示すように、“＿ロボット名＿”と“＿地名＿”の２つのカテゴリがエントリされている。図１８では、カテゴリ“＿ロボット名＿”と“＿地名＿”に属する単語は、まだ何もないことを表している。
【０１１０】
図１９は、音声コマンダ用タスク７１_５の言語モデル１１２（図６）の例を示している。図１９の言語モデル１１２は、ＢＮＦ形式の文法で記述されている。第１行目の変数“＄ＮＵＭＢＥＲ”は、数字（「１」、「２」、「３」等）が「または」を意味する“｜”で繋がっているので、数字の内のどれか１つを意味する。
【０１１１】
第２行目の変数“＄ＤＩＲＥＣＴＩＯＮ”は、方向（「前」、「後」、「右」、「左」等）が「または」を意味する“｜”で繋がっているので、方向の内のどれか１つを意味する。第３行目の変数“ＵＴＴＥＲＡＮＣＥ”は、“＿ロボット名＿”、「＄ＤＩＲＥＣＴＩＯＮに」、および「＄ＮＵＭＢＥＲ歩」に「進め」を付加したものとなっており、さらに、変数“＄ＵＴＴＥＲＡＮＣＥ”の“＿ロボット名＿”、「＄ＤＩＲＥＣＴＩＯＮに」、および「＄ＮＵＭＢＥＲ歩」に、「省略可能」を意味する“［］”が付加されている。
【０１１２】
したがって、図１９の言語モデル１１２において、例えば、「（ロボット名）前に３歩進め」といった音声が認識される。
【０１１３】
図２０は、音声コマンダ用タスク７１_５の固定単語辞書１３１の例を示している。固定単語辞書１３１には、図１９に示されるような言語モデル１１２の文法中に記述するシンボルについて、トランスクリプションと音韻系列が記述されている。
【０１１４】
なお、「１」と「歩」については、シンボルが重複しているが、これは「１」と「歩」が、それぞれ２つの発音（「イチ」と「イッ」、「ホ」と「ポ」）を持つことを表している。これによって、例えば、「イチホ」、「イッポ」という異なる発音がされた発話を、同じ「１歩」として認識することができる。
【０１１５】
言語モデル１１２が図１９に示すような場合、音声コマンダ用タスク７１_５の言語モデル１１２には、“＿ロボット名＿”のカテゴリだけが使用されているため、音声コマンダ用タスク７１_５のカテゴリテーブル１３３は、図１３に示した、アプリケーション切替用タスク７１_２のカテゴリテーブル１３３と同じになる。また、“＿ロボット名＿”に属する単語がまだ発話されていない状態では、音声コマンダ用タスク７１_５の可変単語辞書１３２には、何も記憶されない。
【０１１６】
次に、名前登録用アプリケーション部２１_１が、図５のステップＳ９で行う名前登録処理を、図２１のフローチャートを用いて、詳細に説明する。なお、この処理は、ユーザの発話によって名前登録用アプリケーション部２１_１が起動されたときに開始される。この処理が開始される前に、ユーザは、例えば、不図示のモード切替ボタンによって、名前を登録する名前登録モードとして、音声により名前を入力する音声入力モード、またはキーボード等によるカナ入力により名前を入力するカナ入力モードのうちのいずれか一方を選択しておく。
【０１１７】
ステップＳ４１において、名前登録用アプリケーション部２１_１は、音声認識エンジン部１１の名前登録用タスク７１_３を有効にし、この名前登録用タスク７１_３で音声を認識できるようにする。
【０１１８】
ステップＳ４１の処理後は、ステップＳ４２に進み、名前登録用アプリケーション部２１_１は、名前登録モードが音声入力モードであるか否かを判定し、音声入力モードであると判定した場合、ステップＳ４３に進み、マッチング部５４に名前認識処理を行わせ、ステップＳ４４に進む。（または、ステップＳ４２でユーザが発話した場合は「名前を音声で入力した」と判定してステップＳ４３に進み、不図示のカナ入力ボタンが押された場合は、「名前をカナ文字で入力した」と判定してステップＳ４６に進む。）この名前認識処理の詳細は、図２２で後述する。
【０１１９】
ステップＳ４４において、名前登録用アプリケーション部２１_１は、マッチング部５４でステップＳ４３の名前認識処理が行われることにより得られる名前の音声認識結果（認識された名前）が正しいか否かを判定する。この判定は、例えば、認識結果をユーザに向かって発話し、ユーザから不図示のＯＫボタンが操作されたか否かによって行われる。
【０１２０】
ステップＳ４４において、名前の音声認識結果が正しくないと判定された場合、ユーザに再度発話するよう促し、ステップＳ４３に戻り、再び名前認識処理を行う。ステップＳ４４において、認識結果が正しいと判定された場合、ステップＳ４７に進む。
【０１２１】
一方、ステップＳ４２において、名前登録アプリケーション部２１_１は、名前登録モードが音声入力モードではないと判定した場合、ステップＳ４５に進み、名前登録モードがカナ入力モードであるか否かを判定する。
【０１２２】
ステップＳ４５において、名前登録モードがカナ入力モードではないと判定された場合、ユーザによって名前登録モードが選択されていないので、名前登録モードが選択されるまで待機し、ユーザによる名前入力モードの選択を待って、ステップＳ４２に戻る。
【０１２３】
ステップＳ４５において、名前登録モードがカナ入力モードであると判定された場合、ステップＳ４６に進み、名前登録用アプリケーション部２１_１は、ユーザによって入力された名前のカナ列と、その名前のカテゴリを取得する。
【０１２４】
カナ列を入力する方法としては、例えば、ユーザが一時的にキーボードを接続してカナ文字を入力する方法、ロボットの各種スイッチを使用して入力する方法、文字を書いた紙等をロボットに見せて文字認識する方法（例えば、特願２００１−１３５４２３参照）、無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等でロボットとパーソナルコンピュータを接続し、そのパーソナルコンピュータからロボットに転送する方法、インターネット等を経由して、ロボットにダウンロードする方法等がある。また、文字を書いた紙等をロボットに見せて文字認識する方法において、カナ文字を入力するのではなく、カナ漢字交じりの文字列を入力し、名前登録用アプリケーション部２１_１が、カナ列に変換してもよい（特願２００１−１３５４２３参照）。
【０１２５】
さらに、ユーザが名前のカナ列を入力するのではなく、予め共通辞書部５５の共通辞書に、名前のカナ文字を付加したエントリを与えておき、名前登録用アプリケーション部２１_１は、共通辞書部５５を参照することによって、名前のカナ列を取得してもよい。
【０１２６】
ステップＳ４４またはＳ４６の処理後は、ステップＳ４７に進み、名前登録用アプリケーション部２１_１は、登録する名前のカテゴリを決定する。名前登録モードがカナ入力モードである場合、名前登録用アプリケーション部２１_１は、ステップＳ４６で取得した（ユーザによって入力された）カテゴリを、登録する名前のカテゴリに決定する。
【０１２７】
即ち、カナ入力モードにおいては、ステップＳ４６において、ユーザに、名前の他、その名前のカテゴリも入力してもらい、ユーザが入力した名前のカテゴリを、登録する名前のカテゴリに決定する。一方、名前登録モードが音声入力モードである場合、名前登録用アプリケーション部２１_１は、ステップＳ４３の名前認識処理で得られた名前のカテゴリを推測して決定する。
【０１２８】
例えば、音声認識エンジン部１１から供給された認識結果が「君」で始まる場合は、登録する名前の属するカテゴリは、“＿ロボット名＿”であると推測し、「私」で始まる場合は、登録する名前の属するカテゴリは、“＿ユーザ名＿”であると推測する。また、本出願人が先に提案した特願２００１−３８２５７９に開示されている、各種のカテゴリ推定方法も用いることができる。
【０１２９】
ステップＳ４７の処理後は、ステップＳ４８に進み、名前登録用アプリケーション部２１_１は、マッチング部５４を制御して、登録する名前の発音情報とカテゴリを共通辞書部５５の共通辞書にエントリし、ステップＳ４９に進む。ステップＳ４９において、名前登録用アプリケーション部２１_１は、マッチング部５４を制御して、共通辞書の内容を、雑談用タスク７１_４、音声コマンダ用タスク７１_５、・・・、その他のタスク７１_Ｎに反映させる。この反映の詳細は、図２５のフローチャートを参照して後述する。
【０１３０】
このように、共通辞書に登録した名前を他のタスクに反映させることにより、他のタスクでも、この登録した名前を認識することができる。
【０１３１】
ステップＳ４９の処理後は、ステップＳ５０に進み、名前登録用アプリケーション部２１_１は、名前登録処理を終了するか否かを判定する。この判定は、例えば、終了するかという質問をユーザに向けて発話し、ユーザにより不図示のＯＫボタンが操作（押圧）されたか否かによって行う。ステップＳ４９において、名前登録用アプリケーションを終了しない（例えば、ＯＫボタンが押されていない）と判定された場合、ステップＳ４２に戻り、他の名前を登録する処理を行なう。
【０１３２】
また、ステップＳ５０において、名前登録処理を終了する（例えば、ＯＫボタンが押された）と判定された場合、ステップＳ５１に進み、名前登録用アプリケーション部２１_１は、名前登録用タスク７１_３を無効にし、ステップＳ５２に進む。ステップＳ５２において、名前登録用アプリケーション部２１_１は処理を終了する。
【０１３３】
図２２は、図２１のステップＳ４３で、図２のマッチング部５４が行う名前認識処理を説明するフローチャートである。
【０１３４】
ステップＳ６１において、マッチング部５４は、音声がマイク５１に入力されたか否かを判定し、音声が入力されていないと判定した場合、音声が入力されるまで待機する。ステップＳ６１において、音声が入力されたと判定された場合、ステップＳ６２に進む。ここで入力される音声は、例えば、「私の名前は太郎です」、「君の名前はエスディーアールだよ」といった通常の会話でよく、ユーザは、名前登録を意識して、「太郎」「エスディーアール」という名前だけを単独で入力する必要はない。
【０１３５】
ステップＳ６２において、マッチング部５４は、音声を認識し、名前を抽出する。例えば、「君の名前はエスディーアールだよ」という発話がされた場合、図１４に示すような言語モデル１１２と図１５に示すような固定単語辞書１３１を有する名前登録用タスク７１_３を参照して、マッチング部５４は、例えば、「＜先頭＞君の名前は＜ＯＯＶ＞だよ＜終端＞」という認識結果を生成する。また、マッチング部５４は、＜ＯＯＶ＞が発話のどの区間（最初の発話の何秒目から何秒目まで）であるかという情報を得る。
【０１３６】
さらに、マッチング部５４は、同じ発話に対して、図９に示すような言語モデル１１２と図１０に示すような固定単語辞書１３１を有する音韻タイプライタ用タスク７１_１を参照して、例えば、“ｋ／ｉ／ｍ／ｉ／ｎ／ｏ／ｎ／ａ／ｍ／ａ／ｅ／ｗ／ａ／ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ／ｄ／ａ／ｙ／ｏ”という音韻系列と、「キミノナマエワエスディーアールダヨ」というカナ列を得る。
【０１３７】
そして、マッチング部５４は、＜ＯＯＶ＞が発話のどの区間であるかという情報に基づき、得られた音韻系列およびカナ列から、＜ＯＯＶ＞に相当する区間、すなわち、名前の区間の音韻系列とカナ列とを切り出し、“ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ”という音韻系列と「エスディーアール」というカナ列とを得る。また、マッチング部５４は、同区間の音声データも得る。この名前を抽出する処理の詳細は、本出願人が先に提案した特願２００１−３８２５７９号に開示されている。
【０１３８】
ステップＳ６２の処理後は、ステップＳ６３に進み、マッチング部５４は、ステップＳ６２の処理で抽出した名前の音韻系列、カナ列、および音声データを、未知語獲得部５６に供給し、クラスタリングを行う。クラスタリングの詳細は、本出願人が先に提案した特願２００１−０９７８４３号に開示されている。このクラスタリングの結果、未知語獲得部５６の各クラスタは、代表の音韻系列とカナ列とを有する。
【０１３９】
ステップＳ６３の処理後は、ステップＳ６４に進み、ステップＳ６２で認識された音声の認識結果（例えば、「キミノナマエワエスディーアールダヨ」というカナ列）を、名前登録用アプリケーション部２１_１に供給する。
【０１４０】
図２３は、図２２のステップＳ６３の処理で、未知語獲得部５６においてクラスタリングされた、特徴空間の例を示している。なお、図２３においては、図が煩雑になるのを避けるため、２つの特徴量（特徴パラメータ）１と２で定義される特徴空間を示してある（上述の図３においても同様）。図２３では、特徴空間において、「あらら」、「さにー」、「とーきょー」、「たろう」という４個の名前がクラスタリングされている。
【０１４１】
即ち、図２３では、特徴空間において、「あらら」クラスタ１５１、「さにー」クラスタ１５２、「とーきょー」クラスタ１５３、「たろう」クラスタ１５４の４つのクラスタが構成されており、各クラスタには、代表となる音韻系列（図２３の例の場合、“ａ／ｒ／ａ／ｒ／ａ”、“ｓ／ａ／ｎ／ｉ：”、“ｔ／ｏ：／ｋｙ／ｏ：”、“ｔ／ａ／ｒ／ｏ／ｕ”）、代表的なカナ表記（図２３の例の場合、「アララ」、「サニー」、「トーキョー」、「タロウ」）、およびＩＤ（図２３の例の場合、「１」、「２」、「３」、「５」）が付加されている。
【０１４２】
図２４は、図２１のステップＳ４８で単語の情報がエントリされた共通辞書部５５の共通辞書の例を示している。図２４において、第１行目のエントリは、発音がカナ列で入力され、その発音が「エスディーアール」という文字列であり、カテゴリが“＿ロボット名＿”と入力されたことを表している。
【０１４３】
第２行目のエントリは、発音が音声で入力され、その発音のカナ表記と音韻系列が、未知語獲得部５６のＩＤが「５」のクラスタに付加された代表的なカナ表記（図２３の例の場合「タロウ」）と音韻系列（図２３の例の場合“ｔ／ａ／ｒ／ｏ：”）であることを表している。第２行目のエントリのカテゴリは、図２１のステップＳ４７で、名前登録アプリケーション部２１により決定され、“＿ユーザ名＿”となっている。例えば、ユーザによって「私の名前は太郎です」という発話がされた場合、共通辞書部５５に、第２行目のようなエントリが構成される。
【０１４４】
同様に、第３行目、第４行目のエントリは、発音がカナ列で入力され、その発音がそれぞれ「サニータロウ」、「キタシナガワ」という文字列であり、カテゴリが“＿ユーザ名＿”“＿地名＿”と入力されたことを表している。また、第５行目のエントリは、発音が音声で入力され、その発音のカナ表記と音韻系列が、未知語獲得部５６のＩＤが「３」のクラスタに付加された代表的なカナ表記（図２３の例の場合「トーキョー」）と音韻系列（図２３の例の場合“ｔ／ｏ：／ｋｙ／ｏ：”）であることを表している。さらに、第５行目のエントリのカテゴリは、名前登録アプリケーション部２１によって、“＿地名＿”に決定されている。
【０１４５】
なお、共通辞書においては、発音がカナ列で入力された単語については、その単語の発音を表すカナ列とカテゴリとの組が１つのエントリに登録され、発音が音声入力された単語については、その単語のクラスタを表すＩＤとカテゴリとの組が１つのエントリに登録される。
【０１４６】
図２５は、図２１のステップＳ４９の処理で、マッチング部５４が共通辞書部５５の内容をタスクに反映させる処理を説明するフローチャートである。なお、この処理は、有効にされているタスク毎に行なわれる。
【０１４７】
ステップＳ８１において、マッチング部５４は、タスク７１（図６）における可変単語辞書１３２とカテゴリテーブル１３３を初期化する。即ち、可変単語辞書１３２は、エントリが１つもない状態にされ、カテゴリテーブル１３３は、各カテゴリに単語が何も属していない状態にされる。
【０１４８】
ステップＳ８１の処理後は、ステップＳ８２に進み、マッチング部５４は、共通辞書部５５の内容を可変単語辞書１３２とカテゴリテーブル１３３に反映させる。
【０１４９】
即ち、マッチング部５４は、共通辞書部５５の共通辞書の中から、カテゴリテーブル１３３にエントリされているカテゴリと共通する（同一の）カテゴリを選択し、そのカテゴリと、そのカテゴリに対応するクラスタＩＤまたはカナ発音（カナ列）を取得する。さらに、マッチング部５４は、共通辞書からクラスタＩＤを取得した場合、未知語獲得部５６からクラスタＩＤに対応するカナ列を取得する。
【０１５０】
マッチング部５４は、以上のようにして、共通辞書部５５の共通辞書から選択したカテゴリに属する単語のカナ列を取得すると、そのカナ列を、可変単語辞書１３２にエントリする。また、マッチング部５４は、共通辞書から取得したカナ列で表される単語の情報を、カテゴリテーブル１３３の対応するカテゴリにエントリする。
【０１５１】
上述の処理によれば、各タスクにおいて、可変単語辞書１３２は、初期化されてから、共通辞書の内容が反映される。即ち、可変単語辞書１３２は、共通辞書の内容に基づいて、構築または再構築される。このため、辞書中の特定のエントリに対して削除や変更を行う方法に比べて、容易に、各タスクで整合を保つことができる。
【０１５２】
また、上述の処理によれば、音声で登録した単語については、各タスクに反映させるたびに、そのときの最新の発音情報を未知語獲得部５６から取得するので、可変単語辞書１３２に登録した後も、未知語獲得部５６に音声データを供給するだけで発音情報が更新され、マッチング部５４は、そのときの最新の発音情報を参照して、音声を認識することができる。
【０１５３】
図２６は、図２５の反映処理を説明するブロック図である。共通辞書部５５の共通辞書に、カテゴリに対応してカナ列が記述されている場合、そのカナ列が可変単語辞書１３２に登録され、カテゴリテーブル１３３の、共通辞書のカテゴリと同一のカテゴリに、共通辞書のカナ列で表される単語の情報が登録される。
【０１５４】
一方、共通辞書に、カテゴリに対応してクラスタＩＤが記述されている場合、未知語獲得部５６が参照され、そのクラスタＩＤに対応する代表カナ列と代表音韻系列が可変単語辞書１３２に登録されて、カテゴリテーブル１３３の、共通辞書のカテゴリと同一のカテゴリに、共通辞書のクラスタＩＤで表される単語の情報が登録される。なお、後述する音声認識処理では、固定単語辞書１３１と可変単語辞書１３２の両方が使用される。
【０１５５】
図２７は、図２４に示す共通辞書部５５の内容が反映された、アプリケーション切替用タスク７１_２の可変単語辞書１３２の例である。アプリケーション切替用タスク７１_２のカテゴリテーブル１３３が図１３に示すような場合、図２４の共通辞書と共通しているカテゴリは、“＿ロボット名＿”であるので、マッチング部５４は、図２４の共通辞書から“＿ロボット名＿”に対応する「エスディーアール」というカナ発音を取得する。
【０１５６】
そして、マッチング部５４は、図２７に示すように、可変単語辞書１３２のトランスクリプションに、図２４の共通辞書から取得したカナ発音「エスディーアール」をエントリする。さらに、マッチング部５４は、トランスクリプション「エスディーアール」に対応する音韻系列に、カナ音韻変換規則１１５（図８）に基づいて、カナ発音「エスディーアール」に対応する“ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ”を記述する。
【０１５７】
また、マッチング部５４は、トランスクリプション「エスディーアール」で表される単語のシンボルとして、「ＯＯＶ００００１」を登録する。ここでは、シンボルを、「“ＯＯＶ”＋通し番号」を意味する「ＯＯＶ００００１」としたが、シンボルは、その単語を一意に識別できる文字列であればよい。即ち、シンボルとしては、例えば、カテゴリ名を先頭に付加して、「＿ロボット名＿：：ＯＯＶ００００１」などを用いることも可能である。
【０１５８】
図２８は、図２４の共通辞書の内容が反映されたアプリケーション切替用タスク７１_２のカテゴリテーブル１３３の例を示している。図２７に示したように、可変単語辞書１３２に、図２４の共通辞書の内容が反映された場合、カテゴリテーブル１３３の内容は、図１３に示した、“＿ロボット名＿”のカテゴリに単語が登録されていない状態から、図２７の可変単語辞書１３２に登録されたカテゴリ“＿ロボット名＿”に属する単語のシンボル「ＯＯＶ００００１」がエントリされた状態となる。
【０１５９】
次に、図２９のフローチャートを参照して、図２１のステップＳ４８で共通辞書部５５の共通辞書に登録された単語を、マッチング部５４が、削除または変更する処理を説明する。共通辞書の単語を削除または変更する処理は、例えば、名前登録アプリケーション部２１から指令があった場合や、メモリの制約のために登録単語の不要になったものを削除する必要がある場合など開始される。
【０１６０】
また、共通辞書の単語を削除または変更する処理は、その他、例えば、未知語獲得部５６においてクラスタが削除され、あるいはクラスタが分割、併合されることによって、クラスタに付されるＩＤが変更され、未知語獲得部５６のクラスタに付されているＩＤと共通辞書に記述されているＩＤ（図２４で説明したクラスタＩＤ）との整合をとる必要がある場合に、共通辞書に記述されたＩＤを書き替えるために行われる。
【０１６１】
さらに、共通辞書を削除または変更する処理は、あるカテゴリを言語モデル１１２に記述されているタスク全てが以後使用しなくなった場合に、そのカテゴリの情報を共通辞書から削除して、共通辞書のスリム化を図るために行われる。
【０１６２】
なお、未知語獲得部５６においてクラスタの代表音韻系列とカナ列とが変更された場合は、その変更内容は、図２５の反映処理で、共通辞書に反映されるため、単語の削除または変更する処理（以下、適宜、変更削除する処理という）を行なう必要はない。
【０１６３】
ステップＳ１０１において、マッチング部５４は、変更削除処理の対象となる単語を共通辞書の中から決定し、ステップＳ１０２に進む。対象となる単語は、ユーザが不図示のボタンによって決定してもよいし、マッチング部５４が推定して決定してもよい。
【０１６４】
ステップＳ１０２において、マッチング部５４は、変更削除処理の対象となる単語を削除するか否かを判定し、削除すると判定した場合、ステップＳ１０３に進む。ステップＳ１０３において、マッチング部５４は、変更削除処理の対象となる単語のエントリを、共通辞書から削除する。削除とは、カテゴリと発音情報とで特定されるエントリを削除すること、特定のカテゴリのエントリをまとめて削除すること、または特定の発音情報（カナ列またはクラスタＩＤ）を有するエントリをまとめて削除することを意味する。
【０１６５】
一方、ステップＳ１０２において、マッチング部５４は、変更削除処理の対象となる単語を削除しないと判定した場合、ステップＳ１０４に進み、単語を変更するか否かを判定し、単語を変更しないと判定した場合、ステップＳ１０２に戻り、変更または削除のどちらかに判定されるまで待機する。
【０１６６】
また、ステップＳ１０４において、マッチング部５４は、変更削除処理の対象とする単語を変更すると判定した場合、ステップＳ１０５に進み、共通辞書において、変更削除処理の対象となる単語のエントリを変更する。
【０１６７】
例えば、マッチング部５４は、未知語獲得部５６のクラスタに分割または併合が発生してクラスタのＩＤ番号に変化が生じた場合、未知語獲得部５６と整合をとるように、共通辞書のクラスタＩＤを変更する。また、例えば、ユーザが登録時に入力したカナ列を後で修正したくなった場合、マッチング部５４は、名前登録用アプリケーション部２１_１の指令により、共通辞書の対象となる単語（図２１のステップＳ４８で共通辞書にエントリされた単語）のカナ発音を、ユーザが、共通辞書の対象となる単語を決定した後入力したカナ列に変更する。
【０１６８】
ステップＳ１０３の処理、またはステップＳ１０５の処理の後は、ステップＳ１０６に進み、マッチング部５４は、図２５の反映処理を行ない、共通辞書の内容を各タスクに反映させる。
【０１６９】
このように、共通辞書の単語を削除または変更した場合、その変更後の内容を各タスクに反映させるので、各アプリケーション部での登録単語の整合性を保つことができる。
【０１７０】
図３０と図３１は、図２９のステップＳ１０５の処理で、マッチング部５４が共通辞書の単語のエントリを変更する例を示している。例えば、未知語獲得部５６のＩＤが「５」のクラスタが、ＩＤが「８」のクラスタとＩＤが「９」のクラスタに分割された場合、マッチング部５４は、共通辞書を図３０Ａに示すような状態から図３０Ｂに示すような状態に変更する。
【０１７１】
即ち、マッチング部５４は、共通辞書部５５のクラスタＩＤが「５」のエントリ（図３０Ａの第１行目のエントリ）を削除し、その削除したエントリに登録されていた“＿ユーザ名＿”のカテゴリの２つのエントリを登録する。さらに、マッチング部５４は、新たな２つのエントリに、クラスタＩＤ番号「８」と「９」をそれぞれ記述する（図３０Ｂの第１行目と第２行目のエントリ）。
【０１７２】
また、例えば、未知語獲得部５６のクラスタＩＤが「５」のクラスタとＩＤが「３」のクラスタが併合されて、ＩＤが「１０」のクラスタが新たに生成された場合、マッチング部５４は、共通辞書を図３１Ａに示すような状態から図３１Ｂに示すような状態に変更する。
【０１７３】
即ち、マッチング部５４は、共通辞書のクラスタＩＤが「５」と「３」のエントリ（図３１Ａの全てのエントリ）のクラスタＩＤを「１０」に変更し、その結果、重複する“＿ユーザ名＿”というカテゴリとそれに対応するクラスタＩＤ番号「１０」の２つのエントリを１つにする（例えば、一方を削除する）（図３１Ｂ）。
【０１７４】
次に、図５のステップＳ１２の雑談処理を、図３２のフローチャートを参照して詳細に説明する。
【０１７５】
ステップＳ１２１において、雑談用アプリケーション部２１_２は、雑談用タスク７１_４を有効にし、ステップＳ１２２に進む。ステップＳ１２２において、雑談用アプリケーション部２１_２は、マッチング部５４を制御して、図２５に示すような反映処理を行ない、共通辞書部５５の共通辞書の内容を、雑談用タスク７１_４（可変単語辞書１３２とカテゴリテーブル１３３）に反映させる。したがって、雑談用タスク７１_４は、無効である間に共通辞書に登録、変更、および削除された単語を獲得することができる。
【０１７６】
ステップＳ１２２の処理後は、ステップＳ１２３に進み、雑談用アプリケーション部２１_２は、音声認識エンジン部１１を制御して音声認識処理を行い、ステップＳ１２４に進む。この音声認識処理の詳細は、図３５で後述する。
【０１７７】
ステップＳ１２４において、雑談用アプリケーション部２１_２は、音声認識エンジン部１１から認識結果を取得し、その認識結果に対する応答を生成する。即ち、ロボットは、ユーザからの発話に対して応答する。例えば、ユーザからの発話が「エスディーアール（ロボット名）は、何時に起きたの？」である場合、雑談用アプリケーション部２１_２は、ロボットが起きた（起動された）時間（例えば、「７時」）の応答を生成し、ロボットに発話させる。
【０１７８】
ステップＳ１２４の処理後は、ステップＳ１２５に進み、雑談用アプリケーション部２１_２は、処理を終了するか否かを判定する。この判定は、例えば、雑談用アプリケーション部２１_２が、ユーザに向かって「終了する？」という発話をさせ、ユーザが不図示のＯＫボタンを操作（押圧）した（押圧した）か否かによって行う。
【０１７９】
ステップＳ１２５において、処理を終了しないと判定された場合、処理はステップＳ１２３に戻り、以下同様の処理を繰り返す。即ち、ロボットはユーザとの雑談を続行する。
【０１８０】
ステップＳ１２５において、処理を終了すると判定された場合、処理はステップＳ１２６に進み、雑談用アプリケーション部２１_２は、雑談用タスク７１_４を無効にし、ステップＳ１２７に進む。ステップＳ１２７において、雑談用アプリケーション部２１_２は処理を終了する。
【０１８１】
上述の処理では、ユーザが１回発話する毎に雑談用アプリケーション部２１_２が応答を生成したが、ロボットが自発的に発話することで、ユーザの発話を促してもよい。
【０１８２】
また、図３２の処理では、雑談用アプリケーション部２１_２の雑談処理について説明したが、音声コマンダ用アプリケーション部２１_３の音声コマンダ処理、・・・・、その他のアプリケーション部２１_Ｍの処理も同様に行われる。但し、ステップＳ１２４では、アプリケーション部２１に応じて、音声認識エンジン部１１による音声認識結果に基づく処理が行なわれる。
【０１８３】
図３３は、図３２のステップＳ１２２の処理で、図２４に示す共通辞書部５５の共通辞書の内容が、雑談用タスク７１_４の可変単語辞書１３２に反映された状態を示している。
【０１８４】
雑談用タスク７１_４のカテゴリテーブル１３３が図１８に示すような場合、図２４の共通辞書と共通しているカテゴリは、“＿ロボット名＿”と“＿地名＿”であるので、マッチング部５４は、“＿ロボット名＿”に対応する共通辞書エントリとして図２４の１番目のエントリ、“＿地名＿”に対応するエントリとして図２４の４番目と５番目のエントリを取得する。さらに、１番目のエントリからはカナ発音「エスディーアール」を、４番目のエントリからはカナ発音「キタシナガワ」を、５番目のエントリからはクラスタＩＤ番号「３」をそれぞれ取得する。
【０１８５】
そして、マッチング部５４は、図３３に示すように、可変単語辞書１３２のトランスクリプションに「エスディーアール」と「キタシナガワ」をエントリする。さらに、マッチング部５４は、可変単語辞書１３２の音韻系列に、カナ音韻変換規則１１５（図８）に基づき、トランスクリプション「エスディーアール」に対応して“ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ”、トランスクリプション「キタシナガワ」に対応して“ｋ／ｉ／ｔ／ａ／ｓｈ／ｉ／ｎ／ａ／ｇ／ａ／ｗ／ａ”を記述する。
【０１８６】
また、マッチング部５４は、未知語獲得部５６からクラスタＩＤが「３」のクラスタを抽出し、その代表的な音韻系列と、カナ列を取得する。例えば、未知語獲得部５６が図２３に示すような状態の場合、マッチング部５４は、クラスタＩＤが「３」のクラスタ１５３から、“ｔ／ｏ：／ｋｙ／ｏ：”という音韻系列と「トーキョー」というカナ列を取得する。そして、マッチング部は、図３３に示すように、取得した音韻系列“ｔ／ｏ：／ｋｙ／ｏ：”とカナ列「トーキョー」を、可変単語辞書１３２の音韻系列とトランスクリプションにそれぞれエントリする。
【０１８７】
さらに、マッチング部５４は、トランスクリプション「エスディーアール」で表される単語のシンボルとして、「ＯＯＶ００００１」を、トランスクリプション「キタシナガワ」で表される単語のシンボルとして、「ＯＯＶ００００２」を、トランスクリプション「トーキョー」で表される単語のシンボルとして、「ＯＯＶ００００３」を登録する。
【０１８８】
なお、いまの場合、音韻タイプライタ用タスク７１_１と雑談用タスク７１_４のカナ音韻系列規則１１５が同じであるとして、音韻タイプライタ用タスク７１_１を用いて得られるクラスタの代表的な音韻系列を、そのまま、雑談用タスク７１_４の可変単語辞書１３２に登録するようにしたが、カナ音韻系列規則１１５が、音韻タイプライタ用タスク７１_１と雑談用タスク７１_４のカナ音韻系列規則１１５が異なる場合は、マッチング部５４は、未知語獲得部５６からクラスタの代表的なカナ列を取得し、雑談用タスク７１_４のカナ音韻系列規則１１５に基づいて、その可変単語辞書１３２の音韻系列を記述する。
【０１８９】
図３４は、図２４の共通辞書の内容が、図１８の雑談用タスク７１_４のカテゴリテーブル１３３に反映された状態を示している。カテゴリテーブル１３３においては、“＿ロボット名＿”のカテゴリに対し、そのカテゴリ“＿ロボット名＿”に属する単語（トランスクリプションが「エスディーアール」の単語（図３３））について可変単語辞書１３２のシンボル「ＯＯＶ００００１」がエントリされる。さらに、カテゴリテーブル１３３の“＿地名＿”のカテゴリに対し、そのカテゴリ“＿地名＿”に属する単語（トランスクリプション「キタシナガワ」と「トーキョー」の単語（図３３））について可変単語辞書１３２に登録されたシンボル「ＯＯＶ００００２」、「ＯＯＶ００００３」がエントリされる。
【０１９０】
次に、図３２のステップＳ１２３の処理で、図２の音声認識エンジン部１１が行う音声認識処理を、図３５のフローチャートを参照して詳細に説明する。この処理は、ユーザからマイクロホン５１に音声が入力されたとき、開始され、アプリケーション切替用タスク７１_２、雑談用タスク７１_４、音声コマンダ用タスク７１_５、・・・、その他のタスク７１_Ｎのうち、有効になっているタスク毎に行われる。
【０１９１】
マイクロホン５１で生成された音声信号は、ステップＳ１４１において、ＡＤ変換部５２により、ディジタル信号である音声データに変換され、特徴量抽出部５３に供給される。ステップＳ１４１の処理後は、ステップＳ１４２に進み、特徴量抽出部５３は、供給された音声信号から、メルケプストラム等の特徴量を抽出し、ステップＳ１４３に進む。
【０１９２】
ステップＳ１４３において、マッチング部５４は、固定単語辞書１３１と可変単語辞書１３２のシンボルで表される単語のいくつかを連結して、単語列を生成し、音響スコアを計算する。音響スコアは、音声認識結果の候補である単語列と入力音声とが音として（音響的に）どれだけ近いかを表す。
【０１９３】
ステップＳ１４３の処理後は、ステップＳ１４４に進み、マッチング部５４は、ステップＳ１４３で計算された音響スコアに基づいて、音響スコアの高い単語列を所定の個数選択し、ステップＳ１４５に進む。
【０１９４】
ステップＳ１４５において、マッチング部５４は、ステップＳ１４４で選択した各単語列の言語スコアを、言語モデル１１２を用いて計算し、ステップＳ１４６に進む。例えば、言語モデル１１２として、文法や有限状態オートマンを使用している場合、単語列がその言語モデル１１２で受理することができるとき、言語スコアは「１」であり、受理することができないとき、言語スコアは「０」である。
【０１９５】
なお、マッチング部５４は、受理することができるとき、ステップＳ１４４で選択した単語列を残し、受理することができないとき、ステップＳ１４４で選択した単語列を削除してもよい。
【０１９６】
また、言語モデル１１２として、統計言語モデルを使用している場合、その単語列の生成確率を言語スコアとする。この言語スコアを求める方法の詳細は、本出願人が先に提案した特願２００１−３８２５７９号に開示されている。
【０１９７】
例えば、音声コマンダ用アプリケーション部２１_３の音声コマンダ処理において音声認識処理を行う場合、マッチング部５４がステップＳ１４４の処理で「＜先頭＞ＯＯＶ００００１前に進め＜終端＞」という単語列を選択したとき、その言語スコアは、単語列「＜先頭＞ＯＯＶ００００１前に進め＜終端＞」が、図１９に示す文法の言語モデル１１２で受理することができるので「１」となる。
【０１９８】
即ち、マッチング部５４は、カテゴリテーブル１３３（図２８）を参照して、シンボル“ＯＯＶ００００１”のカテゴリが“＿ロボット名＿”であることを認識し、ステップＳ１４４で得られた単語列「＜先頭＞ＯＯＶ００００１前に進め＜終端＞」を、カテゴリ名を使用した単語列「＜先頭＞＿ロボット名＿前に進め＜終端＞」に変換して、図１９に示す言語モデル１１２で受理することができると判定する。
【０１９９】
一方、例えば、ステップＳ１４４で単語列「＜先頭＞ＯＯＶ００００１に進め前＜終端＞」が選択された場合、マッチング部５４は、カテゴリテーブル１３３（図２８）を参照して、シンボル“ＯＯＶ００００１”のカテゴリが“＿ロボット名＿”であることを認識し、ステップＳ１４４で得られた単語列「＜先頭＞ＯＯＶ００００１に進め前＜終端＞」を、カテゴリ名を使用した単語列「＜先頭＞＿ロボット名＿に進め前＜終端＞」に変換して、図１９に示す言語モデル１１２で受理することができないと判定し、この単語列の言語スコアを「０」とする。
【０２００】
ステップＳ１４６において、マッチング部５４は、ステップＳ１４３で計算された音響スコアと、ステップＳ１４５で計算された言語スコアを統合して、各単語列をソートし、例えば、統合したスコアの一番大きい単語列を認識結果として決定する。
【０２０１】
これにより、音響的にも言語的にも最もふさわしい単語列が認識結果として決定される。
【０２０２】
ステップＳ１４６の処理後は、ステップＳ１４７に進み、マッチング部５４は、認識結果に音声で登録された単語（未知語獲得部５６にクラスタリングされている単語）が含まれているか否かを判定する。
【０２０３】
ステップＳ１４７において、音声で登録された単語が認識結果に含まれていると判定された場合、ステップＳ１４８に進み、マッチング部５４は、未知語獲得部５６にその単語を供給し、未知語獲得部５６は、再クラスタリングを行う。そして、処理はステップＳ１４９に進む。
【０２０４】
例えば、ステップＳ１４４で、地名（未知語）の「トーキョー」を含む単語列「＜先端＞今日はトーキョーに行ったんだよ＜終端＞」が得られた場合、マッチング部５４は、未知語である「トーキョー」の音声データ、音韻タイプライタ用タスク７１_１を参照して認識された音韻系列（例えば、“ｔ／ｏ：／ｋｙ／ｏ：”）およびカナ列（例えば、「トーキョー」）を未知語獲得部５６に供給する。そして、未知語獲得部５６は、再クラスタリングを行う。
【０２０５】
これにより、未知語獲得部５６に供給される音声データの量が増え、各クラスタの代表音韻系列と代表カナ列が、正しい値に更新される可能性がある。ただし、副作用として、正しい音韻カナ系列・カナ列が取得された後も、再クラスタリングによって正しくない値に変化してしまう可能性がある。そのような副作用を防ぐため、ユーザからの指示があった場合は、その時点でのカナ列を共通辞書のエントリに記述すれば、発音を固定することができる。例えば、図２３において、ＩＤ＝３のクラスタのカナ列が「トーキョー」という発音になった時点で、図２４の共通辞書において「クラスタＩＤ＝３」と記述されている箇所を「カナ発音：トーキョー」に書き換える（５番目のエントリがその書き換えの対象となる）。こうすることで、以降でＩＤ＝３のクラスタのカナ列が「トーキョー」以外に変化しても、共通辞書の５番目のエントリの発音は「トーキョー」で固定される。
【０２０６】
一方、ステップＳ１４７において、マッチング部５４は、音声認識結果に音声で登録された単語が含まれていないと判定した場合、ステップＳ１４８をスキップして、ステップＳ１４９に進む。
【０２０７】
ステップＳ１４９において、マッチング部５４は、タスクに対応するアプリケーション部２１に、ステップＳ１４６の処理で決定された認識結果を供給する。
【０２０８】
ここで、雑談用アプリケーション部２１_２の雑談処理において、マッチング部５４が、図３５のステップＳ１４４で、例えば、単語列「＜先頭＞ＯＯＶ００００１は何時に起きたの＜終端＞」を選択した場合の言語スコアを求める式を図３６に示す。
【０２０９】
言語スコア「Ｓｃｏｒｅ（＜先頭＞ＯＯＶ００００１は何時に起きたの＜終端＞）」は、式（１）に示すように、単語列「＜先頭＞ＯＯＶ００００１は何時に起きたの＜終端＞」の生成確率である。
【０２１０】
言語スコア「Ｓｃｏｒｅ（＜先頭＞ＯＯＶ００００１は何時に起きたの＜終端＞）」の値は、正確には、式（２）に示すように、「Ｐ（＜先頭＞）Ｐ（ＯＯＶ００００１｜＜先頭＞）Ｐ（は｜＜先頭＞ＯＯＶ００００１）Ｐ（何時｜＜先頭＞ＯＯＶ００００１は）Ｐ（に｜＜先頭＞ＯＯＶ００００１は何時）Ｐ（起きた｜＜先頭＞ＯＯＶ００００１は何時に）Ｐ（の｜＜先頭＞ＯＯＶ００００１は何時に起きた）Ｐ（＜終端＞｜＜先頭＞ＯＯＶ００００１は何時に起きたの）で求められるが、図１６に示すように、言語モデル１１２は、ｔｒｉ−ｇｒａｍを用いているので、条件部分「＜先頭＞ＯＯＶ００００１は」、「＜先頭＞ＯＯＶ００００１は何時」、「＜先頭＞ＯＯＶ００００１は何時に」、「＜先頭＞ＯＯＶ００００１は何時に起きた」、および「＜先頭＞ＯＯＶ００００１は何時に起きたの」は、直前の最大２単語「ＯＯＶ００００１は」、「は何時」、「何時に」、「に起きた」、および「起きたの」にそれぞれ限定した条件付確率で近似する（式（３））。
【０２１１】
この条件付確率は、言語モデル１１２（図１６）を参照することによって求められるが、言語モデル１１２は、シンボル「ＯＯＶ００００１」を含んでいないので、マッチング部５４は、図３４のカテゴリテーブル１３３を参照して、シンボル「ＯＯＶ００００１」で表される単語のカテゴリが、“＿ロボット名＿”であることを認識し、「ＯＯＶ００００１」を“＿ロボット名＿”に変換する。
【０２１２】
即ち、式（４）に示すように、「Ｐ（ＯＯＶ００００１｜＜先頭＞）」は、「Ｐ（＿ロボット名＿｜＜先頭＞）Ｐ（ＯＯＶ００００１｜＿ロボット名）」に変更され、「Ｐ（＿ロボット名＿｜＜先頭＞）」／Ｎ」で近似される。なお、Ｎは、カテゴリテーブル１３３の“＿ロボット名＿”のカテゴリに属している単語の数を表す。
【０２１３】
即ち、確率をＰ（Ｘ｜Ｙ）という形式で記述した場合、単語ＸがカテゴリＣに属する単語である場合、言語モデル１１２からＰ（Ｃ｜Ｙ）を求め、その値に、Ｐ（Ｘ｜Ｃ）（カテゴリＣから単語Ｘが生成される確率）を掛ける。カテゴリＣに属する単語が全て等確率で生成されると仮定すれば、カテゴリＣに属する単語がＮ個ある場合、Ｐ（Ｘ｜Ｃ）は、１／Ｎと近似できる。
【０２１４】
図３４において、カテゴリ“＿ロボット名＿”には、シンボル「ＯＯＶ００００１」で表される単語のみが属しているので、Ｎ」は「１」となる。したがって、式（５）に示すように、「Ｐ（は｜＜先頭＞ＯＯＶ００００１）」は、「Ｐ（は｜＜先頭＞＿ロボット名＿）」となる。また、「Ｐ（何時｜ＯＯＶ００００１は）」は、式（６）に示すように、「Ｐ（何時｜＿ロボット名＿は）となる。
【０２１５】
これにより、可変単語を含む単語列に対しても、言語スコアを計算することができ、可変単語を認識結果に出現させることが可能となる。
【０２１６】
上述の例では、アプリケーション部２１の起動とタスク７１の有効、アプリケーション部２１の終了とタスク７１の無効が連動するようにしたが、これを別のタイミングに行って、例えば、アプリケーション部２１の起動中にタスクの有効や無効を何度も切り替えたり、１つのアプリケーションで複数のタスクを制御したりすることも可能である。
【０２１７】
この場合、有効や無効の切替を頻繁に繰り返すタスクでは、そのたびにメモリの確保や開放を繰り返すと、効率が悪いので、無効後もフラグ（そのタスクが無効であることを表すフラグ）を立てるだけで、メモリを確保したままにしておくこともできる。
【０２１８】
また、上述の例では、ロボットシステムの起動時に共通辞書部５５の共通辞書には何も記億されていない状態であるとしたが、共通辞書に、いくつかの単語が予め記憶されていてもよい。例えば、ロボットの商品名は、そのロボットの名前に登録されることが多いので、ロボットの商品名を予め共通辞書の“＿ロボット名＿”のカテゴリに登録しておいてもよい。
【０２１９】
図３７は、ロボットシステムの起動時に、ロボットの商品名「エスディーアール」がカテゴリ“＿ロボット名＿”にエントリされている場合の共通辞書の例を示している。図３７において、ロボットシステムの起動時には、カテゴリ“＿ロボット名＿”に、カナ発音「エスディーアール」がエントリされているので、ユーザは、名前登録を行わなくても、カナ発音「エスディーアール」で表される単語を用いて、ロボットを制御することができる。
【０２２０】
また、上述の例では、初期段階（出荷時）には未知語獲得部のクラスタは何も生成されていないことを想定していた。しかし、主要な名称についてはクラスタを最初から用意しておくと、図２１の名前登録処理において名前を音声で入力する場合に、クラスタが用意されている名前については認識されやすくなる。例えば、図３のようなクラスタを出荷時に用意しておくと、「アカ」、「アオ」、「ミドリ」、「クロ」という音声については発音を正しい音韻系列で認識（取得）できる。さらに、クラスタが用意されている名前については、共通辞書に登録した後で発音が変化することは望ましくない。そこで、共通辞書に発音情報を登録する際には、クラスタＩＤを記述する代わりに、「アカ」、「アオ」、「ミドリ」、「クロ」といったカナ発音（クラスタの代表カナ列）で記述する。
【０２２１】
さらに、上述の例では、マッチング部５４は、共通辞書の内容を全てのタスクに反映させるとしたが、反映させたいタスクにのみ反映させてもよい。例えば、予めタスクに番号（タスクＩＤ）を付加しておき、図２４の共通辞書を拡張して、「このエントリが有効（または無効）なタスクのリスト」を表す欄を設け、図２５の反映処理において、マッチング部５４は、「このエントリが有効なタスクのリスト」を表す欄に記述されたタスクＩＤが付加されたタスクにのみ、共通辞書の内容を反映させればよい。
【０２２２】
図３８は、反映させたいタスクのＩＤが、共通辞書の「有効なタスク」を表す欄に記述された例を示している。図３８において、カテゴリ“＿ロボット名＿”に属するカナ発音が「エスディーアール」で表される単語は、有効なタスクのＩＤが「１」、「２」、「４」であるので、タスクＩＤが「１」、「２」、「４」のタスクの可変単語辞書１３２とカテゴリテーブル１３３にのみ、カナ発音「エスディーアール」で表される単語の共通辞書の内容が反映される。
【０２２３】
また、上述の例では、固定単語辞書１３１に記憶されている単語は、言語モデル１１２に記述されている単語であり、可変単語辞書１３２に記憶される単語は、カテゴリに属する単語であるとしたが、カテゴリに属する単語の一部を、固定単語辞書１３１に記憶してもよい。
【０２２４】
図３９は、アプリケーション切替用タスク７１_２の固定単語辞書１３１の例を示し、図４０は、起動時のカテゴリテーブル１３３の例を示している。即ち、図４０のカテゴリテーブル１３３には、カテゴリ“＿ロボット名＿”と、そのカテゴリ“＿ロボット名＿”に属する単語のシンボル「ＯＯＶ００００１」が予め登録されている。また、図４０の固定単語辞書１３１には、シンボル「ＯＯＶ００００１」と、そのシンボル「ＯＯＶ００００１」で表される単語のトランスクリプション「エスディーアール」、および音韻系列“ｅ／ｓ／ｕ／ｄ／ｉ：／ａ：／ｒ／ｕ”が予め登録されている。
【０２２５】
この場合、単語「エスディーアール」は、カテゴリ“＿ロボット名＿”に属するものとして音声認識処理が行われる。即ち、単語「エスディーアール」は、最初からロボットの名前として扱われることになる。但し、単語「エスディーアール」は固定単語辞書１３１に記憶されているため、削除したり、変更することはできない。
【０２２６】
このように、例えば、ロボットの商品名等、名前に設定されると想定される単語を予め固定単語辞書１３１に記憶しておくことによって、ユーザは名前登録を行わずに、ロボットを制御することができる。
【０２２７】
また、上述の例では、カテゴリのシンボルは全タスクで共通にしていたが、共通でなくてもよい。この場合、図４１乃至図４４に示すような変換テーブルをタスク内に用意すればよい。
【０２２８】
即ち、例えば、あるタスクＴで、カテゴリ“＿ＲＯＢＯＴ＿ＮＡＭＥ＿”とカテゴリ“＿ＵＳＥＲ＿ＮＡＭＥ＿”が記述されている場合、図４１の変換テーブルによれば、タスクＴにおいて、共通辞書部５５でカテゴリ“＿ロボット名＿”に属する単語の共通辞書の内容は、“＿ＲＯＢＯＴ＿ＮＡＭＥ＿”というカテゴリに反映される。また、タスクＴにおいて、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容は、“＿ＵＳＥＲ＿ＮＡＭＥ＿”というカテゴリに反映される。
【０２２９】
また、例えば、あるタスクＴで、カテゴリ“＿固有名詞＿”が記述されている場合、図４２の変換テーブルによれば、タスクＴにおいて、共通辞書部５５で、カテゴリ“＿ロボット名＿”に属する単語の共通辞書の内容も、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容も、“＿固有名詞＿”というカテゴリに反映される。
【０２３０】
さらに、例えば、あるタスクＴで、カテゴリ“＿姓＿”とカテゴリ“＿名＿”が記述されている場合、図４３の変換テーブルによれば、タスクＴにおいて、共通辞書部５５で、カテゴリ“＿ユーザ名＿”に属する単語の共通辞書の内容は、カテゴリ“＿姓＿”とカテゴリ“＿名＿”とに変換・複製される。共通辞書の内容をこのタスクに反映させるステップ（図２５）において、例えば、図２４の２番目のエントリは、変換テーブルにしたがって「＿ユーザ名＿クラスタＩＤ＝５」から「＿姓＿クラスタＩＤ＝５」と「＿名＿クラスタＩＤ＝５」との２エントリに変換・複製され、それからこのタスクの固定単語辞書とカテゴリテーブルとに反映される。
【０２３１】
また、例えば、あるタスクＴで、カテゴリが記述されていない場合、図４４の変換テーブルによれば、タスクＴにおいて、共通辞書部５５のカテゴリ“＿ロボット名＿”、カテゴリ“＿ユーザ名＿”、カテゴリ“＿地名＿”に属する単語が、シンボル「ＵＮＫ」で表される。なお、「ＵＮＫ」は、「Ｕｎｋｎｏｗｎｗｏｒｄ」を意味する。
【０２３２】
これにより、カテゴリが記述されていないタスクにおいても、言語モデル１１２に、シンボル「ＵＮＫ」を記述しておくだけで、マッチング部５４は、カテゴリ“＿ロボット名＿”、カテゴリ“＿ユーザ名＿”、カテゴリ“＿地名＿”に属する単語を認識することができる。
【０２３３】
図４５は、本発明を適用したロボット制御システム１を備えた２足歩行型のロボットの外観構成例を示している。ロボット２０１は、胴体部ユニット２１３の上部に頭部ユニット２１１が配設されるとともに、胴体部ユニット２１３の上部左右にそれぞれ同じ構成の腕部ユニット２１２Ａ、２１２Ｂがそれぞれ配設され、かつ胴体部ユニット２１３の下部左右にそれぞれ同じ構成の脚部ユニット２１４Ａ、２１４Ｂがそれぞれ所定位置に取り付けられことにより構成されている。
【０２３４】
また、頭部ユニット２１１には、このロボット２０１の「目」として機能するＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）カメラ２２１Ａ，２２１Ｂ、「耳」として機能するマイクロホン２２２Ａ，２２２Ｂ、および「口」として機能するスピーカ２２３がそれぞれ所定位置に配置されている。
【０２３５】
図４６は、ロボットの電気的構成例を示している。ロボット制御システム１の指令により、ユニット制御システム２３１および対話制御システム２３２は、ロボット２０１の動作を制御する。即ち、ユニット制御システム２３１は、ロボット２０１の頭部ユニット２１１、腕部ユニット２１２Ａ，２１２Ｂ、および脚部ユニット２１４Ａ，２１４Ｂのそれぞれを必要に応じて制御し、ロボット２０１に所定の動作をさせる。また、対話制御システム２３２は、ロボット２０１の発話を制御し、必要に応じて、スピーカ２２３から、所定の発話をさせる。
【０２３６】
なお、上述の説明において、単語とは、音声を認識する処理において、１つのまとまりとして扱った方がよい単位のことであり、言語学的な単語とは必ずしも一致しない。例えば、「タロウ君」は、それ全体を１単語として扱ってもよいし、「タロウ」、「君」という２単語として扱ってもよい。さらに、もっと大きな単位である「こんにちはタロウ君」等を１単語として扱ってもよい。
【０２３７】
また、音韻とは、音響的に１つの単位として扱った方が処理上都合のよいもののことであり、音声学的な音韻や音素とは必ずしも一致しない。例えば、「東京」の「とう」の部分を“ｔ／ｏ／ｕ”という３個の音韻記号で表すことも可能であり、または“ｏ”の長音である“ｏ：”という記号を用意してもよい。さらに、“ｔ／ｏ／ｏ”と表してもよい。他にも、無音を表す記号を用意したり、さらにそれを「発話前の無音」「発話に挟まれた短い無音区間」「「っ」」の部分の無音」のように細かく分類してもよい。
【０２３８】
また、以上においては、ロボット装置について説明したが、本発明は、音声認識や音声合成、翻訳、その他の言語処理を利用したアプリケーションを有する装置に適用することができる。
【０２３９】
さらに、本発明は、例えば、広辞苑に登録された辞書の中から、所定の用語だけを抜き出して、その用語辞書をつくる装置に適用することができる。
【０２４０】
また、上述の説明において、アプリケーション部が複数ある場合について説明したが、アプリケーション部は１つでもよい。
【０２４１】
上述した一連の処理は、ハードウエアにより実行させることもできるし、ソフトウエアにより実行させることもできる。この場合、上述した処理は、図４７に示されるようなパーソナルコンピュータ６００により実行される。
【０２４２】
図４７において、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）６０１は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）６０２に記憶されているプログラム、または、記憶部６０８からＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）６０３にロードされたプログラムに従って各種の処理を実行する。ＲＡＭ６０３にはまた、ＣＰＵ６０１が各種の処理を実行する上において必要なデータなどが適宜記憶される。
【０２４３】
ＣＰＵ６０１、ＲＯＭ６０２、およびＲＡＭ６０３は、内部バス６０４を介して相互に接続されている。この内部バス６０４にはまた、入出力インターフェース６０５も接続されている。
【０２４４】
入出力インターフェース６０５には、キーボード、マウスなどよりなる入力部６０６、ＣＲＴ，ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）などよりなるディスプレイ、並びにスピーカなどよりなる出力部６０７、ハードディスクなどより構成される記憶部６０８、モデム、ターミナルアダプタなどより構成される通信部６０９が接続されている。通信部６０９は、電話回線やＣＡＴＶを含む各種のネットワークを介しての通信処理を行なう。
【０２４５】
入出力インターフェース６０５にはまた、必要に応じてドライブ６１０が接続され、磁気ディスク、光ディスク、光磁気ディスク、あるいは半導体メモリなどによりなるリムーバブルメディア６２１が適宜装着され、それから読み出されたコンピュータプログラムが、必要に応じて記憶部６０８にインストールされる。
【０２４６】
一連の処理をソフトウエアにより実行させる場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば、汎用のパーソナルコンピュータなどに、ネットワークや記録媒体からインストールされる。
【０２４７】
この記録媒体は、図４７に示されるように、コンピュータとは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されているリムーバブルメディア６２１よりなるパッケージメディアにより構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される、プログラムが記録されているＲＯＭ６０２や記憶部６０８が含まれるハードディスクなどで構成される。
【０２４８】
なお、本明細書において、コンピュータプログラムを記述するステップは、記載された順序に従って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
【０２４９】
また、本明細書において、システムとは、複数の装置により構成される装置全体を表わすものである。
【０２５０】
ところで、本出願人が先に提案した特願２００１−３８２５７９号には、未知語獲得機構で獲得した単語を言語モデルに反映させるまでの一連の処理と、言語モデルに反映させた単語を以降の認識結果に出現させるための処理についての発明が開示されている。
【０２５１】
しかしながら、特願２００１−３８２５７９号の発明は、１つの単語登録用のアプリケーションと、１つの登録した単語を使用するアプリケーションから構成されており、音声認識を行うアプリケーションが複数になった場合については想定していないため、アプリケーションが複数、しかも可変個存在するシステムにおいて、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することが困難であった。
【０２５２】
また、本出願人が先に提案した特願２００２−０７２７１８号には、「私の名前はタロウ（未知語）です。」という発話から、未知語である「タロウ」という単語を抽出して、名前として獲得するという発明が開示されている。
【０２５３】
しかしながら、特願２００２−０７２７１８号の発明は、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することが困難であった。
【０２５４】
したがって、特願２００１−３８２５７９号の発明、および特願２００２−０７２７１８号の発明では、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、および複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題すべてを解決することが困難であった。
【０２５５】
しかしながら、図１のロボット制御システム１においては、言語モデル１１２にカテゴリを記述しているため、未知語をカテゴリに属させることによって、未知語を言語モデルに反映させるという上述した課題を解決することができる。
【０２５６】
また、図１のロボット制御システム１においては、共通辞書に基づき、アプリケーションで利用される音声認識の対象となる単語が登録される可変単語辞書１３２を構築、または再構築するようにしたので、登録した単語をそのアプリケーションに反映させるときの上述した課題、複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題を解決することができる。
【０２５７】
さらに、キーボードを持たないシステム（例えば、ロボット等）の場合、単語登録時に、発音情報を入力することが困難であるという課題があるが、その課題を解決する手段として、例えば、音韻タイプライタを用いて、音声で発音情報を入力する方法が提案されている。
【０２５８】
しかしながら、音韻タイプライタは、誤認識することがあり、音韻タイプライタをそのまま使用すると、間違った発音で単語を登録してしまうおそれがあるという課題があった。例えば、「エスディーアール」の発音を音韻タイプライタで認識させ、音韻タイプライタが誤認識して「イルニヤル」という結果を出力した場合、「イルニヤル」を発音情報として採用してしまうと、間違った発音で単語が登録されるため、以降は、例えば、「エスディーアール、こんにちは」という発話は認識されにくいが、「イスニヤル、こんにちは」という発話は認識されやすいという状況が発生してしまう。
【０２５９】
そこで、図１のロボット制御システム１においては、音声で登録した単語について、各タスクに反映させるたびに、そのときの最新の発音情報を未知語獲得部５６から取得するので、音韻タイプライタが誤認識し、誤認識された単語が可変単語辞書１３２に登録した後も、未知語獲得部５６に音声データを供給するだけで発音情報が更新され、そのときの最新の発音情報を得ることができ、正常な認識結果を得る可能性がある。
【０２６０】
即ち、図１のロボット制御システム１においては、未知語を言語モデルに反映させるという上述した課題、登録した単語をそのアプリケーションに反映させるときの上述した課題、複数のアプリケーションに対して登録した単語を削除または変更するときの上述した課題、およびキーボードを持たないシステムにおいて、登録したい単語の発音情報を入力するときの上述した課題、すなわち上述した全ての課題を解決することができる。
【０２６１】
【発明の効果】
以上の如く、本願発明によれば、単語を登録することができる。特に、複数のアプリケーションに対応する単語を登録する場合においても、登録した単語を、各アプリケーションにおいて共通に使用することができる。また、アプリケーションの起動前に登録した単語も、そのアプリケーションで使用することができる。さらに、登録単語を変更した場合においても、各アプリケーションで整合性を保つことができる。
【図面の簡単な説明】
【図１】本発明を適用したロボット制御システムの構成例を示すブロック図である。
【図２】図１の音声認識エンジン部の構成例を示すブロック図である。
【図３】図２の未知語獲得部のクラスタの例を示す図である。
【図４】図１のロボット制御システムにおけるロボット制御処理を説明するフローチャートである。
【図５】図１のロボット制御システムにおけるロボット制御処理を説明するフローチャートである。
【図６】図２のタスクの構成例を示す図である。
【図７】図６の音韻リストの例を示す図である。
【図８】図６のカナ音韻変換規則の例を示す図である。
【図９】図２の音韻タイプライタ用タスクの言語モデルの例を示す図である。
【図１０】図２の音韻タイプライタ用タスクの固定単語辞書の例を示す図である。
【図１１】図２のアプリケーション切替用タスクの言語モデルの例を示す図である。
【図１２】図２のアプリケーション切替用タスクの固定単語辞書の例を示す図である。
【図１３】図２のアプリケーション切替用タスクのカテゴリテーブルの例を示す図である。
【図１４】図２の名前登録用タスクの言語モデルの例を示す図である。
【図１５】図２の名前登録用タスクの固定単語辞書の例を示す図である。
【図１６】図２の雑談用タスクの言語モデルの例を示す図である。
【図１７】図２の雑談用タスクの固定単語辞書の例を示す図である。
【図１８】図２の雑談用タスクのカテゴリテーブルの例を示す図である。
【図１９】図２の音声コマンダ用タスクの言語モデルの例を示す図である。
【図２０】図２の音声コマンダ用タスクの固定単語辞書の例を示す図である
【図２１】図５のステップＳ９の名前登録処理を説明するフローチャートである。
【図２２】図２１のステップＳ４３の名前認識処理を説明するフローチャートである。
【図２３】図２の未知語獲得部のクラスタの例を示す図である。
【図２４】図２の共通辞書部の例を示す図である。
【図２５】図２１のステップＳ４９の反映処理を説明するフローチャートである。
【図２６】図２５の反映処理を説明するブロック図である。
【図２７】図２の名前登録用タスクの可変単語辞書の例を示す図である。
【図２８】図２の名前登録用タスクのカテゴリテーブルの例を示す図である。
【図２９】図２のマッチング部における単語の削除または変更処理を説明するフローチャートである。
【図３０】図２の共通辞書部の変更の例を示す図である。
【図３１】図２の共通辞書部の変更の例を示す図である。
【図３２】図５のステップＳ１２の雑談処理を説明するフローチャートである。
【図３３】図２の雑談用タスクの可変単語辞書の例である。
【図３４】図２の雑談用タスクのカテゴリテーブルの例である。
【図３５】図３２のステップＳ１２３の音声認識処理を説明するフローチャートである。
【図３６】言語スコアの計算式の例を示すである。
【図３７】図２の共通辞書部の変形例を示すである。
【図３８】図２の共通辞書部の変形例を示す図である。
【図３９】図６の固定単語辞書の変形例を示す図である。
【図４０】図６のカテゴリテーブルの例を示す図である。
【図４１】カテゴリ変換テーブルの例を示す図である。
【図４２】カテゴリ変換テーブルの例を示す図である。
【図４３】カテゴリ変換テーブルの例を示す図である。
【図４４】カテゴリ変換テーブルの例を示す図である。
【図４５】ロボットの外観構成を示す斜視図である。
【図４６】ロボットの電気的構成を示すブロック図である。
【図４７】パーソナルコンピュータの例を示す図である。
【符号の説明】
１１音声認識エンジン部，２１アプリケーション部，３１アプリケーション管理部，５１マイクロホン，５２ＡＤ変換部，５３特徴量抽出部，５４マッチング部，５５共通辞書部，５６未知語獲得部，７１タスク，１１１音響モデル，１１２言語モデル，１１３辞書，１１４音韻リスト，１１５カナ音韻変換規則，１１６サーチパラメータ，１３１固定単語辞書，１３２可変単語辞書，１３３カテゴリテーブル[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a language processing apparatus, a language processing method, a program, and a recording medium, and more particularly to, for example, a language processing apparatus, a language processing method, and a program that allow a registered word to be commonly recognized by a plurality of applications.
[0002]
[Prior art]
Speech recognition includes isolated word recognition for recognizing a single word and continuous word recognition for recognizing a word string composed of a plurality of words. In conventional continuous word recognition, having a "language model of a database of the ease of connection between words" prevents the generation of "similar but messy word strings" as recognition results. In.
[0003]
However, since only information on words that can be recognized from the beginning (hereinafter, appropriately referred to as known words) is described in the language model, words registered later (hereinafter, appropriately referred to as registered words) are correctly recognized. It was difficult to do. In other words, in isolated word recognition, if a word is registered in the recognition dictionary, that word will be recognized in the future. However, in continuous word recognition, registration in the dictionary alone is not sufficient. Although it is necessary to reflect it, it was generally difficult to reflect it on the language model.
[0004]
Therefore, it has been proposed to classify the registered words into categories such as "person name" and "place name", prepare a recognition grammar corresponding to the category, and recognize speech (for example, see Patent Document 1).
[0005]
Further, in a system in which a plurality of applications that use voice recognition and a variable number of applications exist, when a word registered in one application is reflected in another application, a problem different from the case of a single application occurs. . For example, if word registration is performed only for an application that has already been started, it is difficult to reflect the registered word on an application that has been started or installed after registration, unlike the case where there is only one application. There were challenges.
[0006]
Further, when there are a plurality of applications, it is troublesome to delete the same registered word many times in a plurality of applications. Further, when there are a plurality of applications, it is easy to delete all the registered words, but it is difficult to delete only a part of the words or change the pronunciation.
[0007]
That is, when there is one application, for example, the registered word to be deleted or changed can be specified by information such as “the word registered in the nth time” or “the nth entry in the recognition dictionary”, but there are a plurality of applications. In such a case, it is difficult to specify the “word registered in the nth time” or “the number of the dictionary entry added” depending on each application.
[0008]
When there are a plurality of applications, the registered word can be specified by pronunciation. However, when the registered word is specified by pronunciation, the homonym may be deleted or changed.
[0009]
Therefore, it has been proposed that instead of each application performing voice recognition individually, a module called “voice commander” performs voice recognition for all applications and transfers the recognition result to each application (for example, see Patent Reference 1).
[0010]
[Patent Document 1]
JP 2001-216128 A
[0011]
[Problems to be solved by the invention]
However, in the invention described in Patent Literature 1, it is necessary for the “voice commander” to possess a recognition dictionary and a language model corresponding to each application. In other words, when developing a "voice commander", it is necessary to suppose what applications will be used at the same time, and to prepare a recognition dictionary and language model suitable for those applications. On the other hand, there is a problem that it is difficult to reflect registered words.
[0012]
The present invention has been made in view of such a situation, and aims to allow a registered word to be commonly used by a plurality of applications.
[0013]
[Means for Solving the Problems]
The language processing apparatus of the present invention includes a registered dictionary storage unit that stores a registered dictionary in which words are registered, and a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered. Constructing means for constructing based on the registered dictionary.
[0014]
A processing unit for adding, deleting, or changing words to the registered dictionary, and a deleting unit for deleting words in the dedicated dictionary are further provided. After all words registered in the dedicated dictionary are deleted, The construction means may reconstruct the dedicated dictionary based on the registered dictionary in which the word has been added, deleted, or changed.
[0015]
The dedicated dictionary includes at least a fixed dictionary in which predetermined words are registered in advance and a variable dictionary in which registered words are variable, and the construction unit is configured to construct a variable dictionary among the dedicated dictionaries. Can be.
[0016]
The dedicated dictionary further includes a category table in which categories of words are registered, and the constructing unit registers words of the categories registered in the category table among words of the registered dictionaries in the variable dictionary. Can be built.
[0017]
There are a plurality of applications, and the construction means can construct a dedicated dictionary for each of the plurality of applications.
[0018]
The language processing method of the present invention includes a registered dictionary storage step for storing a registered dictionary in which words are registered, and a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered. And a construction step of constructing based on the registered dictionary.
[0019]
The program recorded on the recording medium of the present invention constructs a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered based on the registered dictionary in which the words are registered. It is characterized by including a construction step.
[0020]
A program of the present invention causes a computer to execute a construction step of constructing a dedicated dictionary dedicated to an application in which words to be subjected to language processing used in an application are registered based on the registered dictionary in which the words are registered. It is characterized by the following.
[0021]
In the present invention, a registered dictionary in which words are registered is stored, and a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered is constructed based on the registered dictionary.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a configuration example of a robot control system 1 to which the present invention is applied.
[0023]
In the robot control system 1, the voice recognition engine unit 11 recognizes the input voice data, and generates a word string corresponding to the voice data as a recognition result. The speech recognition engine unit 11 sends the recognition result to the name registration application unit 21. ₁ , Chat application unit 21 ₂ , Voice commander application section 21 ₃ ,..., Other application units 21 _M And to the application management unit 31.
[0024]
Name registration application section 21 ₁ , Chat application unit 21 ₂ , Voice commander application section 21 ₃ ,..., Other application units 21 _M Performs various processes based on the recognition result supplied from the speech recognition engine unit 11.
[0025]
Name registration application section 21 ₁ Registers the robot name, the user name, and the like by voice based on the recognition result supplied from the voice recognition engine unit 11, and the other application units include the name registration application unit 21 ₁ Controls the operation of the robot in response to the user's utterance using the name registered by the user.
[0026]
Therefore, the chat application unit 21 ₂ , Voice commander application section 21 ₃ ,... And other application units 21 _M Is performed by the name registration application unit 21. ₁ It is necessary to correspond to the robot name, user name, etc. registered in.
[0027]
Chat application unit 21 ₂ Makes the robot chat with the user by voice, and the voice commander application unit 21 ₃ Causes the robot to perform an operation corresponding to the utterance from the user. For example, the voice commander application unit 21 ₃ Moves the robot forward in response to an utterance from the user such as "SDR (robot name), move forward!"
[0028]
Note that an arbitrary number of application units can be prepared. Hereinafter, the name registration application unit 21 ₁ , Chat application unit 21 ₂ , Voice commander application section 21 ₃ ,... And other application units 21 _M Are not collectively referred to as the application unit 21 as appropriate.
[0029]
The application management unit 31 instructs the application unit 21 to start and end based on the recognition result supplied from the speech recognition engine unit 11. For example, when the recognition result of “activate voice commander” is supplied from the voice recognition engine unit 11, the application management unit 31 ₃ Start At this time, a plurality of application units may be started simultaneously.
[0030]
Further, the application unit 21 and the application management unit 31 issue a task switching command to the speech recognition engine unit 11, and a corresponding task (to be described later with reference to FIG. 2) is stored in the speech recognition engine unit 11. Is controlled to be valid (active) or invalid (deactive).
[0031]
FIG. 2 shows the configuration of the speech recognition engine unit 11. The utterance of the user is input to the microphone 51, where the utterance is converted into an audio signal as an electric signal. The microphone 51 supplies this audio signal to an AD (Analog Digital) converter 52. The AD converter 52 samples and quantizes the audio signal that is an analog signal from the microphone 51 and converts it into audio data that is a digital signal. This audio data is supplied to the feature amount extraction unit 43.
[0032]
The feature amount extraction unit 53 extracts feature parameters such as a spectrum, a power linear prediction count, a cepstrum count, and a line spectrum pair from the audio data from the AD conversion unit 52 for each appropriate frame, and supplies the extracted feature parameters to the matching unit 54. I do.
[0033]
The matching unit 54 performs a phonemic typewriter task 71 based on the feature parameters from the feature amount extraction unit 53. ₁ , Application switching task 71 ₂ , Name registration task 71 ₃ , Chat task 71 ₄ , Voice commander task 71 ₅ , ..., and other tasks 71 _N Of these, for each task enabled at that time, a word string closest to the voice (input voice) input to the microphone 51 is obtained as a recognition result while referring to a database inside the task as necessary. . The matching unit 54 supplies the recognition result to the application unit 21 and the application management unit 31 corresponding to each task.
[0034]
Note that a task is a set of data necessary for performing speech recognition. That is, it is a program for accessing the data part when the speech recognition engine unit 11 is classified into a program part for performing matching or the like, and a data part such as an acoustic model, a language model, a recognition dictionary, and the like. .
[0035]
Therefore, even when a plurality of applications perform speech recognition using different acoustic models, language models, and dictionaries, a single speech recognition engine unit can be provided by preparing a plurality of tasks. Details of the inside of the task will be described later with reference to FIG.
[0036]
Phonemic Typewriter Task 71 ₁ Is a task that works as a phoneme typewriter, and is enabled by a command from the speech recognition engine unit 11. With this phoneme typewriter, the matching unit 54 obtains a phoneme sequence and also obtains pronunciation in kana notation for any input speech. For example, from the sound "Your name is SDR", "k / i / m / i / n / o / n / a / m / a / e / w / a / e / s / u / d / I: / a: / r / u / d / a / y / o ”(“ i: ”and“ a: ”are the long sounds of“ i ”and“ a ”, respectively), and“ Kimino Namae ” We get Kana notation called "WSDD Dayo". The phoneme sequence and the kana notation are used in the unknown word acquisition unit 56.
[0037]
Application switching task 71 ₂ Is a task corresponding to the application management unit 31, and is activated when a task switching command is supplied from the application management unit 31 after the application management unit 31 is activated. Application switching task 71 ₂ Accordingly, the matching unit 54 recognizes a voice corresponding to an activation of an application unit such as “launch a chat application”, “launch a voice commander”, “launch a name registration”, or an end command.
[0038]
Name registration task 71 ₃ Is the name registration application unit 21 ₁ This is a task corresponding to the name registration application unit 21 according to a command from the application management unit 31. ₁ Is activated, the name registration application section 21 ₁ When a task switching command is supplied from, it is enabled. Name registration task 71 ₃ For example, the matching unit 54 outputs a voice corresponding to a name such as "Your name is <unknown word representing a robot name>." And "My name is <unknown word representing a person name." Recognize.
[0039]
Chat task 71 ₄ , Voice commander task 71 ₅ , ..., and other tasks 71 _N Is the chat application unit 21 ₂ , Voice commander application section 21 ₃ ,..., Other application units 21 _M The task is enabled when a task switching command is supplied from the corresponding application unit after the corresponding application unit is activated by a command from the application management unit 31.
[0040]
The matching unit 54 includes a chat task 71. ₄ Thereby, for example, it is possible to recognize an utterance as a chat from the user, "SDR (robot name), when did it happen?" The matching unit 54 includes a voice commander task 71. ₅ Thereby, for example, it is possible to recognize an utterance as a command from the user, "SDR (robot name), move forward one step".
[0041]
Further, the matching unit 54 reflects words registered in a common dictionary unit 55 described later in each task.
[0042]
In the following, a phonemic typewriter task 71 will be described. ₁ , Application switching task 71 ₂ , Name registration task 71 ₃ , Chat task 71 ₄ , Voice commander task 71 ₅ , ..., and other tasks 71 _N Are not collectively referred to as a task 71 when it is not necessary to individually distinguish them.
[0043]
The common dictionary unit 55 stores a common dictionary as a dictionary of words commonly used in the task 71. In the common dictionary stored in the common dictionary unit 55, pronunciation information and category information are described for all words registered therein. For example, when the proper noun “SDR (robot name)” is registered in the common dictionary, the pronunciation (phonetic information) of “SDR” and the category “_robot name_” are described in the common dictionary. Details will be described later with reference to FIG.
[0044]
The unknown word acquisition unit 56 performs a phoneme typewriter task 71 on a word (unknown word) such as a name that is not registered in a recognition dictionary (fixed word dictionary 131 described later in FIG. 6). ₁ The phoneme sequence and the kana notation supplied from the matching unit 54 are stored, and thereafter, the voice of the word can be recognized (can be distinguished from other voices).
[0045]
That is, the unknown word acquisition unit 56 performs the task 71 for the phoneme typewriter. ₁ Is classified into several clusters. Each cluster has an ID, a representative phoneme sequence, and a representative kana notation, and is managed by the ID.
[0046]
FIG. 3 shows the state of the cluster of the unknown word acquisition unit 56.
[0047]
When three voices “Aka”, “Ao”, and “Midori” are input, the unknown word acquisition unit 56 converts the three input voices into “Aka” cluster 91 and “Ao” cluster corresponding to the three voices, respectively. 92 and a “green” cluster 93, and each cluster has a representative phoneme sequence (“a / k / a”, “a / o”, “m / i / d / o / r / i ″), representative kana notation (“red”, “blue”, “midori” in the example of FIG. 3), and ID (“1” in the example of FIG. 3). , "2", "3").
[0048]
Here, when the voice “red” is input again, the corresponding cluster already exists, so the unknown word acquisition unit 56 classifies the input voice into the “red” cluster 91 and does not generate a new cluster. On the other hand, when the voice of “kuro” is input, there is no corresponding cluster, so the unknown word acquisition unit 56 newly generates a “kuro” cluster 94 corresponding to “kuro”, and 3, a representative phoneme sequence (“k / u / r / o” in the example of FIG. 3), a representative kana notation (“black” in the example of FIG. 3), and an ID (FIG. In the case of the example, “4”) is added.
[0049]
By using this method, the accuracy of the representative phoneme sequence and the representative kana pronunciation of each cluster can be improved by the user inputting the same voice many times. For example, it is assumed that, when "Midori" is input once, the phonological typewriter misrecognizes and outputs a phonological sequence "m / e / r / a / a" and a kana pronunciation "Meraa". . After that, by repeating the utterance "Midori" many times, the phonemic sequence and Kana pronunciation may converge to the correct values ("m / i / d / o / r / i" and "Midori"). There is. Details of such a word acquisition process are disclosed in Japanese Patent Application Nos. 2001-097842 and 2001-382579 previously proposed by the present applicant.
[0050]
Next, a robot control process in the robot control system 1 of FIG. 1 will be described with reference to FIGS. This process is started when the robot control system 1 is started by a user.
[0051]
In step S1, the speech recognition engine unit 11 starts up, and proceeds to step S2. In step S2, the voice recognition engine unit 11 stores the contents of the common dictionary unit 55 (common dictionary) stored in a storage unit (not shown) at the end of the previous robot control system 1 (processing in step S17 described later). And the state of the cluster of the unknown word acquisition unit 56 are loaded. When the state of the common dictionary and the cluster is not stored in the storage unit, the common dictionary unit 55 and the unknown word acquiring unit 56 leave no cluster entry. In the case where the state of the cluster is stored in the storage unit but the state of the common dictionary is not stored, only the common dictionary is initialized (the state where there is no entry). Conversely, when the state of the common dictionary is stored but the state of the cluster is not stored, the entry derived from the cluster (the entry in which the cluster ID is described in FIG. 24) is deleted from the common dictionary, and the kana pronunciation is performed. The entry of origin (the entry in which kana pronunciation is described in FIG. 24) is left.
[0052]
After the processing in step S2, the process proceeds to step S3, where the speech recognition engine unit 11 executes the phoneme typewriter task 71. ₁ And enable phonetic typewriter task 71 ₁ Is made available for speech recognition, and the process proceeds to step S4. In step S4, the application management unit 31 is activated, and proceeds to step S5.
[0053]
In step S5, the application management unit 31 determines that the corresponding task is the application switching task 71. ₂ Is enabled, and the process proceeds to step S6. In step S <b> 6, the voice recognition engine unit 11 recognizes a start command of the application unit 21 input by voice to the microphone 51 and supplies a recognition result to the application management unit 31. The details of the voice recognition processing will be described later with reference to the flowchart in FIG.
[0054]
After the process in step S6, the process proceeds to step S7 in FIG. 5, and the application management unit 31 determines the name registration application unit 21 from the recognition result supplied from the speech recognition engine unit 11. ₁ It is determined whether or not to start up, and the name registration application unit 21 ₁ Is started (for example, when the recognition result is "start name registration"), the process proceeds to step S8.
[0055]
In step S8, the application management unit 31 sends the name registration application unit 21 ₁ Start. After the processing in step S8, the process proceeds to step S9, where the name registration application unit 21 ₁ Performs a name registration process. The details of the name registration process will be described later with reference to the flowchart of FIG.
[0056]
In step S7, the application management unit 31 sends the name registration application unit 21 ₁ If not, the process proceeds to step S10, and based on the recognition result by the speech recognition engine unit 11, the chat application unit 21 ₂ Is determined. In step S10, the application management unit 31 ₂ Is started (for example, when the recognition result is “start chat”), the process proceeds to step S11, and the chat application unit 21 is started. ₂ Start.
[0057]
After the processing in step S11, the process proceeds to step S12, where the chat application unit 21 ₂ Performs chat processing. Details of the chat processing will be described later with reference to a flowchart of FIG.
[0058]
In step S10, the application management unit 31 ₂ If it is determined not to activate the voice commander application unit 21 based on the recognition result by the voice recognition engine unit 11, ₃ Is determined. In the processing of step S13, the application management unit 31 ₃ Is started (for example, when the recognition result is “voice commander activation”), the process proceeds to step S14, and the voice commander application unit 21 is started. ₃ Start.
[0059]
After the processing of step S14, the process proceeds to step S15, where the voice commander application unit 21 ₃ Performs voice commander processing. Details of the voice commander processing will be described later with reference to the flowchart of FIG.
[0060]
In step S13, the application management unit 31 sends the voice commander application unit 21 ₃ Is determined not to be activated, the recognition result by the voice recognition engine unit 11 is incorrect (there may be an utterance other than the application switching), so the process returns to step S6 in FIG. 4, and the voice recognition engine unit 11 A process for recognizing the input voice is performed.
[0061]
Thus, the application management unit 31 activates the application unit 21 according to the recognition result by the speech recognition engine unit 11.
[0062]
After the processes in steps S9, S12, and S15, the process proceeds to step S16, and the application management unit 31 determines whether to end the robot control process. For example, the application management unit 31 determines whether or not the end button (not shown) is pressed by the user, and determines that the robot control process is to be ended when the end button is pressed.
[0063]
If it is determined in step S16 that the robot control process is not to be ended, the process returns to step S6 in FIG. 4 and repeats the process of recognizing the input voice. If the application management unit 31 determines in step S16 that the robot control process is to be ended (the end button has been pressed), the process proceeds to step S17, in which the common dictionary of the common dictionary unit 55 and the cluster of the unknown word acquisition unit 56 are created. The state is stored in a storage unit (not shown).
[0064]
Then, when there is an activated application unit 21, the application management unit 31 ends the application unit. At this time, the application unit 21 invalidates the corresponding task 71. Further, the application management unit 31 includes an application switching task 71. ₂ Is invalidated, and the speech recognition engine unit 11 executes the phoneme typewriter task 71. ₁ Is invalidated, and the application management unit 31 and the speech recognition engine unit 11 end the processing.
[0065]
Note that, in the above-described processing, the application section is executed by the name registration application section 21. ₁ , Chat application unit 21 ₂ , Voice commander application section 21 ₃ The above three cases have been described. However, if there is another application unit, if it is determined in step S13 that the voice commander application is not to be activated, the process does not return to step S6, and is similar to steps S7, S10, and S13. Then, it is determined whether or not to start another application, and the other application is started according to the determination result.
[0066]
Further, in the above-described processing, the end of the voice recognition is instructed by the user. However, the robot control system 1 may automatically determine, for example, end when no voice is input for a predetermined time.
[0067]
According to the above processing, the application switching task 71 ₂ Is valid even during the activation of each application unit 21. Therefore, even when the utterance "Start OO" is made during the activation of another application unit, the utterance is recognized and the corresponding Applications can be launched. For example, the voice commander application unit 21 ₃ When the user utters “Start chat” during the startup, the chat application unit 21 ₂ Can be started.
[0068]
In this case, terminate the running application section and then start the new application section, or place the running application section in a paused state and then start the new application section. Whether to restart the application unit or to start both in parallel is set in advance by a combination of the application units (it may be dynamically determined based on resource constraints such as memory).
[0069]
FIG. 6 shows the configuration of the task 71. The task 71 includes an acoustic model 111, a language model 112, a dictionary 113, a phoneme list 114, kana phoneme conversion rules 115, and search parameters 116.
[0070]
The acoustic model 111 stores a model representing acoustic features such as individual phonemes and syllables of the speech to be recognized. As the acoustic model, for example, HMM (Hidden Markov Model) can be used.
[0071]
The language model 112 describes information indicating how the words registered in the word dictionary of the dictionary 113 are linked (connected) (hereinafter, referred to as chain information as appropriate). Description methods include statistical word chain probability (n-gram), generation grammar, and finite state automan.
[0072]
The language model 112 includes, in addition to the chain information about the words, chain information about a category in which the words are classified from a specific viewpoint. For example, when the “category consisting of words representing user names is represented by a symbol“ _user name_ ”and the“ category consisting of words representing robot names ”is represented by a symbol“ _robot name_ ”, the language model 112 , “_User name_” and “_robot name_” (chains between categories, chains between categories and words stored in a dictionary in advance, etc.) are also described.
[0073]
Therefore, chain information can be obtained for words not included in the language model 112. For example, when acquiring the chain information of “SD” and “wa (particle)”, even if the language model 112 does not describe the chain information of “SD”, “SD” is “_ robot name”. If it is known that the robot belongs to the category represented by the symbol “_”, the chain information of “_robot name_” and “ha” is obtained instead, so that the chain of “SD” and “ha” are obtained. Information can be obtained.
[0074]
The category is not a classification based on the semantic attribute (“_robot name_”, “_user name_”, “_place name_”, “_store name_”, etc.), but a classification based on the part of speech (“_noun_ "," _Verb_ "," _particle_ ", etc.). Hereinafter, the notation “_... _” Represents a category name.
[0075]
The dictionary 113 includes a fixed word dictionary 131, a variable word dictionary 132, and a category table 133.
[0076]
The fixed word dictionary 131 stores pronunciations (phonological sequences), phonemes, and words for words that are not registered and deleted, that is, words (hereinafter, appropriately referred to as fixed words) set in advance in the robot control system 1. Various information such as a model describing a syllable chain relationship is described.
[0077]
The fixed word dictionary 131 describes, for each task 71, information about a dedicated word used by the application unit 21 corresponding to the task 71. The same applies to the above-described acoustic model 111 and language model 112, and also to a category table 133, phoneme list 114, kana phoneme conversion rules 115, and search parameters 116 described later.
[0078]
The variable word dictionary 132 describes various information such as a word to be registered and deleted, that is, a model that describes the pronunciation, phoneme, and syllable chain relation of the registered word. Is registered, the registered word is reflected. This reflection processing will be described later with reference to FIG. Deletion of words and change of pronunciation can be performed only for entries in the variable word dictionary 132. Note that the variable word dictionary 132 does not have to store anything.
[0079]
The category table 133 stores a table indicating a correspondence between a category included in the language model 112 and information on a word included in the category. When the task 71 assigns an ID unique to the category (category ID), the category table 133 also stores the correspondence between the symbol of the category and the ID. For example, when a category ID “4” is assigned to the category “_robot name_”, the category ID = 4 is also stored corresponding to “_robot name_”. Note that the category table 133 does not store anything when the language model 112 does not include a category.
[0080]
The phoneme list 114 is a list of phoneme symbols used in the task 71. The kana phoneme conversion rule 115 is a rule for converting a kana character string into a phoneme sequence. In this way, by storing the kana phoneme conversion rules 115 for each task, the common dictionary unit 55 can hold, as pronunciation information, a kana character string that is independent of the phoneme sequence.
[0081]
The search parameter 116 holds a parameter used when the matching unit 54 performs matching (search). The parameters include a value depending on the acoustic model 111, a value depending on the number of vocabulary words, a value depending on the type of the language model 112, and the like. However, parameters that do not depend on the task may be held in the recognition engine unit 11 in common.
[0082]
In the above description, all data is stored for each task. However, data used in common by a plurality of tasks can be shared between tasks to reduce the memory usage. For example, when the phoneme list 114 is common to all tasks, only one phoneme list 114 is prepared by the speech recognition engine unit 11, and each task may refer to it. In this case, it is sufficient to prepare only one kana phoneme conversion rule 115.
[0083]
In addition, two types of acoustic models 111 are prepared, one for a quiet environment (an acoustic model that provides a high recognition rate in a quiet environment) and one for a noise environment (an acoustic model that provides a reasonable recognition rate in a noisy environment). Either one may be referred to each time.
[0084]
For example, the name registration task 71 ₃ And chat task 71 ₄ Is assumed to be used in a quiet environment, so reference is made to the acoustic model 111 for the quiet environment, and the voice commander task 71 ₅ Is assumed to be used in a noisy environment (an environment in which the operation sound of the robot is loud), so that an acoustic model for a noisy environment can be referred to.
[0085]
FIG. 7 shows an example of the phoneme list 114 of FIG. In FIG. 7, one symbol represents (corresponds to) one phoneme. In the phoneme list 114 of FIG. 7, a vowel + colon (for example, “a:”) represents a long sound, and “N” represents a plucked sound (“n”). “Sp”, “silB”, “silE”, and “q” all represent silence, but “silence in speech”, “silence before speech”, “silence after speech”, and “prompting sound”. ("Tsu") ".
[0086]
FIG. 8 shows an example of the kana phoneme conversion rule 115 of FIG. According to the kana phoneme conversion rule 115 of FIG. 8, for example, a kana character string “SD” is converted into a phoneme sequence “e / s / u / d / i: / a: / r / u”. .
[0087]
Next, examples of the language model 112 and the dictionary 113 (FIG. 6) of each task are shown.
[0088]
FIG. 9 shows a phoneme typewriter task 71. ₁ 9 shows an example of the language model 112 (FIG. 6). In FIG. 9, the variable “$ SYLLABLE” on the first line means an arbitrary one of the kana notations because all kana notations are connected by “|” meaning “or”. .
[0089]
That is, here, the phonemic typewriter task 71 ₁ Is a task for speech recognition in units of syllables (syllables), and the language model 112 in FIG. 9 describes a chain rule that an arbitrary syllable can be arbitrarily connected in a BNF (Backus-Naur-Form) format. In the grammar. The language model 112 may use a statistical language model described later.
[0090]
FIG. 10 shows a phonemic typewriter task 71. ₁ 6 shows an example of the fixed word dictionary 131 (FIG. 6). The “symbol” is a character string for identifying a word, and can use, for example, kana notation. Entries with the same symbol are considered to be entries of the same word. The language model 112 is represented by using this symbol. Note that “<head>” and “<end>” are special symbols and represent “silence before utterance” and “silence after utterance”, respectively (the same applies to FIG. 11 and the like described later).
[0091]
Further, “transcription” indicates a notation of a word, and the character string output as a recognition result is this transcription. The “phonological sequence” is a representation of the pronunciation of a word as a phonological sequence.
[0092]
Phonemic Typewriter Task 71 ₁ Of the phonemic typewriter task 71 ₁ Nothing is stored because it is not assumed that words will be added to. Also, the phonemic typewriter task 71 ₁ Since the language model 112 does not include a category as shown in FIG. 9, nothing is stored in the category table 133.
[0093]
FIG. 11 shows an application switching task 71. ₂ 9 shows an example of the language model 112 (FIG. 6). The language model 112 in FIG. 11 is described in a BNF format grammar. The variable “$ APPLICATIONS” on the first line indicates that all application names (“chat”, “voice commander”, “name registration”, etc.) are connected by “|” which means “or”. Means one of the names.
[0094]
Also, in the variable “@UTTERANCE” on the second line, “[]” meaning “can be omitted” is added to each of “_robot name_” and “wo”, so “(robot name ) Start the application name ()). Here, the “robot name” indicates a word registered in the category of “_robot name_”.
[0095]
For example, when “SDR” is registered in “_robot name_”, utterances such as “SDR, voice commander (activate)” and “voice commander (activate)” are generated as shown in FIG. It is recognized by using 11 language models 112.
[0096]
By describing the language model 112 using the category name in this way, even if a newly registered word is included in the category described in the language model 112, Can recognize the utterance including the newly registered word by using the language model 112.
[0097]
FIG. 12 shows an application switching task 71. ₂ 6 shows an example of the fixed word dictionary 131 (FIG. 6). In the fixed word dictionary 131 of FIG. 12, transcriptions and phoneme sequences are described for symbols (“chat” and “voice commander” in FIG. 11) described in the grammar of the language model 112 of FIG. I have.
[0098]
FIG. 13 shows an application switching task 71. ₂ 6 shows an example of the category table 133 (FIG. 6). The category table 133 stores the type of category used for the language model 112 and information on words belonging to the category. When the language model 112 is as shown in FIG. ₂ Since the category of “_robot name_” is used for the language model 112, “_robot name_” is entered in the category table 133 as shown in FIG. In FIG. 13, the set of words belonging to the category of “_robot name_” is an empty set, indicating that there is no word belonging to “_robot name_”.
[0099]
As shown in FIG. 13, even when a category is entered in the category table 133, if there is no word belonging to the entry (in the case of an empty set), the variable word dictionary 132 stores The information of the word to which it belongs is not stored.
[0100]
FIG. 14 shows a name registration task 71. ₃ 9 shows an example of the language model 112 (FIG. 6). The language model 112 in FIG. 14 is described in a BNF format grammar. The variables "@UTTERNANCE" are "or [your name] is <OOV> [is] [it is called]" and "You [name] is <OOV> [it is]." Are connected by “|”, which means “name”, “is”, “says” and “says”, and “[]” is added to indicate “optional”. ing.
[0101]
Therefore, using the language model 112 of FIG. 14, “I (name) is <OOV> (is) (called)” or “You (name) is <OOV> (called)” Be recognized. Note that <OOV> is a symbol meaning “Out Of Vocabulary”, and means a phrase with an arbitrary pronunciation (a word not described in the fixed word dictionary 131).
[0102]
By using the symbol <OOV>, for example, utterances such as “My name is Taro” and “Your name is SDR” (“Taro” and “SDR” are described in the fixed word dictionary 131 14 are applied to the language model 112 in FIG. 14, “<head> My name is <OOV><end>” and “<head> Your name is <OOV>”. As a result, you can get the speech recognition results of "My name is Taro" and "Your name is SDR".
[0103]
FIG. 15 shows a name registration task 71. ₃ 6 shows an example of the fixed word dictionary 131 (FIG. 6). In the fixed word dictionary 131, transcriptions and phoneme sequences are described for the symbols described in the grammar of the language model 112 as shown in FIG.
[0104]
Name registration task 71 ₃ Here, the name registration task 71 ₃ Nothing is stored because it is not assumed that words will be added to. The name registration task 71 ₃ Since the language model 112 does not include a category as shown in FIG. 14, nothing is stored in the category table 133.
[0105]
FIG. 16 shows a chat task 71. ₄ 9 shows an example of the language model 112 (FIG. 6). Since the chat has many vocabulary and utterance variations, a statistical language model is used as the language model 112. The statistical language model is a model in which chain information of words is described with conditional probabilities. In the language model 112 of FIG. 16, a tri-line representing the arrangement of three words 1, 2, and 3, that is, the probability of three chains of words. Gram is used.
[0106]
In FIG. 16, “P (word 3 | word 1 word 2)” is a probability that “word 3” will appear next when “word 1” and “word 2” are arranged in the word string. Represents For example, when there is a sequence of “<head>“ _ robot name_ ””, the probability that “ha” appears next is “0.012”. This probability is obtained in advance by analyzing a text describing a large amount of chat. As the language model 112, besides the tri-gram, a bi-gram (probability of two chains), a uni-gram (probability of appearance of a word), and the like can be used as necessary.
[0107]
In the language model 112 of FIG. 16 as well, the grammar is described using categories in addition to words, as in the case of FIG. That is, in FIG. 16, “_robot name_” and “_place name_” mean the categories “_robot name_” and “_place name_”, and a tri-gram is described using these categories. Thus, when a word representing a robot name or a place name is registered in the variable word dictionary 132, the word is stored in the chat task 71. ₄ Can be recognized.
[0108]
FIG. 17 shows a chat task 71. ₄ Of the fixed word dictionary 131 of FIG. In the fixed word dictionary 131, transcription and phoneme sequences are described for symbols described in the grammar of the language model 112 as shown in FIG.
[0109]
FIG. 18 shows a chat task 71. ₄ Of the category table 133 of FIG. The category table 133 stores types of categories used for the language model 112 and information on words belonging to the categories. When the language model 112 is as shown in FIG. ₄ Since two categories “_robot name_” and “_location name_” are used in the language model 112 of “.”, The category table 133 includes “_robot name_” as shown in FIG. And "_place name_" are entered. In FIG. 18, words belonging to the categories “_robot name_” and “_place name_” indicate that there is no word yet.
[0110]
FIG. 19 shows a task 71 for voice commander. ₅ 9 shows an example of the language model 112 (FIG. 6). The language model 112 in FIG. 19 is described in a BNF-format grammar. The variable “@NUMBER” on the first line indicates that any one of the numbers (“1”, “2”, “3”, etc.) Means one.
[0111]
The variable “$ DIRECTION” on the second line is connected with “|” meaning “or” in the directions (“front”, “back”, “right”, “left”, etc.). Means any one of The variable “UTTERANCE” in the third line is obtained by adding “advance” to “_robot name_”, “$ DIRECTION”, and “$ NUMBER step”. "[]" Meaning "can be omitted" is added to "_robot name_", "@DIRECTION", and "@NUMBER step".
[0112]
Therefore, in the language model 112 of FIG. 19, for example, a voice such as “three steps ahead of (robot name)” is recognized.
[0113]
FIG. 20 shows a task 71 for a voice commander. ₅ Of the fixed word dictionary 131 of FIG. In the fixed word dictionary 131, transcriptions and phoneme sequences are described for symbols described in the grammar of the language model 112 as shown in FIG.
[0114]
Note that the symbols of “1” and “step” overlap, which means that “1” and “step” have two pronunciations (“ichi” and “it”, “ho” and “po”, respectively). ]). As a result, for example, utterances with different pronunciations such as “Ichiho” and “Ippo” can be recognized as the same “one step”.
[0115]
When the language model 112 is as shown in FIG. ₅ Since only the category “_robot name_” is used in the language model 112 of ₅ The category table 133 of the application switching task 71 shown in FIG. ₂ Is the same as the category table 133 of FIG. In the state where the word belonging to “_robot name_” has not been uttered yet, the voice commander task 71 ₅ Nothing is stored in the variable word dictionary 132.
[0116]
Next, the name registration application unit 21 ₁ However, the name registration process performed in step S9 in FIG. 5 will be described in detail with reference to the flowchart in FIG. This processing is performed by the user's utterance, ₁ Triggered when is invoked. Before this processing is started, the user may input a name by voice input mode of inputting a name by voice or kana input by a keyboard or the like as a name registration mode for registering a name by a mode switching button (not shown), for example. One of the kana input modes for input is selected in advance.
[0117]
In step S41, the name registration application unit 21 ₁ Is a name registration task 71 of the speech recognition engine unit 11. ₃ And enable this name registration task 71 ₃ To be able to recognize voice.
[0118]
After the process in step S41, the process proceeds to step S42, where the name registration application unit 21 ₁ Determines whether or not the name registration mode is the voice input mode. If it is determined that the name registration mode is the voice input mode, the process proceeds to step S43, in which the matching unit 54 performs a name recognition process, and then proceeds to step S44. (Or, if the user utters in step S42, it is determined that "the name has been input by voice," and the process proceeds to step S43. If the kana input button (not shown) has been pressed, the user has input the And proceeds to step S46.) Details of this name recognition processing will be described later with reference to FIG.
[0119]
In step S44, the name registration application unit 21 ₁ Determines whether the speech recognition result (recognized name) of the name obtained by performing the name recognition process in step S43 by the matching unit 54 is correct. This determination is made based on, for example, whether the recognition result is uttered to the user and the user operates an OK button (not shown).
[0120]
If it is determined in step S44 that the name speech recognition result is incorrect, the user is prompted to speak again, and the process returns to step S43 to perform name recognition processing again. If it is determined in step S44 that the recognition result is correct, the process proceeds to step S47.
[0121]
On the other hand, in step S42, the name registration application unit 21 ₁ Determines that the name registration mode is not the voice input mode, proceeds to step S45, and determines whether the name registration mode is the kana input mode.
[0122]
If it is determined in step S45 that the name registration mode is not the kana input mode, the user has not selected the name registration mode, so the user waits until the name registration mode is selected. After waiting, the process returns to step S42.
[0123]
If it is determined in step S45 that the name registration mode is the kana input mode, the process proceeds to step S46, where the name registration application unit 21 ₁ Acquires the kana sequence of the name input by the user and the category of the name.
[0124]
As a method of inputting the kana sequence, for example, a method in which a user temporarily connects a keyboard to input kana characters, a method of inputting using various switches of the robot, a method of displaying paper on which characters are written to the robot, or the like. (For example, refer to Japanese Patent Application No. 2001-135423), a method of connecting a robot to a personal computer by a wireless LAN (Local Area Network) or the like, and transferring the robot from the personal computer to the robot, There is a method of downloading to a robot. Also, in the method of recognizing characters by showing a paper or the like on which characters are written to a robot, instead of inputting kana characters, a character string containing kana and kanji characters is input, and a name registration application unit 21 is input. ₁ May be converted to kana strings (see Japanese Patent Application No. 2001-135423).
[0125]
Further, instead of the user inputting the kana string of the name, an entry to which the kana character of the name is added is given to the common dictionary of the common dictionary unit 55 in advance, and the name registration application unit 21 is provided. ₁ May refer to the common dictionary unit 55 to obtain a kana sequence of names.
[0126]
After the processing in step S44 or S46, the process proceeds to step S47, where the name registration application unit 21 ₁ Determines the category of the name to be registered. When the name registration mode is the kana input mode, the name registration application unit 21 ₁ Determines the category (input by the user) acquired in step S46 as the category of the name to be registered.
[0127]
That is, in the kana input mode, in step S46, the user is requested to input a category of the name in addition to the name, and the category of the name input by the user is determined as a category of the name to be registered. On the other hand, when the name registration mode is the voice input mode, the name registration application unit 21 ₁ Is determined by estimating the category of the name obtained in the name recognition processing in step S43.
[0128]
For example, if the recognition result supplied from the voice recognition engine unit 11 starts with “Kimi”, the category to which the registered name belongs is assumed to be “_robot name_”, and if it starts with “I”, It is assumed that the category to which the registered name belongs is “_user name_”. Further, various category estimation methods disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant can also be used.
[0129]
After the process of step S47, the process proceeds to step S48, where the name registration application unit 21 ₁ Controls the matching unit 54 to enter the pronunciation information and category of the name to be registered in the common dictionary of the common dictionary unit 55, and proceeds to step S49. In step S49, the name registration application unit 21 ₁ Controls the matching unit 54 to store the contents of the common dictionary in the chat task 71. ₄ , Voice commander task 71 ₅ , ..., other tasks 71 _N To reflect. The details of this reflection will be described later with reference to the flowchart in FIG.
[0130]
In this way, by reflecting the name registered in the common dictionary in another task, the registered name can be recognized in another task.
[0131]
After the process of step S49, the process proceeds to step S50, where the name registration application unit 21 ₁ Determines whether to end the name registration process. This determination is made, for example, by speaking to the user a question as to whether or not to end, and whether or not the user has operated (pressed) an OK button (not shown). If it is determined in step S49 that the name registration application is not to be terminated (for example, the OK button has not been pressed), the process returns to step S42 to perform processing for registering another name.
[0132]
If it is determined in step S50 that the name registration process is to be ended (for example, the OK button has been pressed), the process proceeds to step S51, and the name registration application unit 21 ₁ Is the name registration task 71 ₃ Is invalidated, and the process proceeds to step S52. In step S52, the name registration application unit 21 ₁ Ends the processing.
[0133]
FIG. 22 is a flowchart illustrating a name recognition process performed by the matching unit 54 of FIG. 2 in step S43 of FIG.
[0134]
In step S61, the matching unit 54 determines whether or not a voice has been input to the microphone 51. If it is determined that no voice has been input, the matching unit 54 waits until a voice is input. If it is determined in step S61 that a voice has been input, the process proceeds to step S62. The voice input here may be a normal conversation such as "My name is Taro" or "Your name is SDR", and the user is aware of the name registration and "Taro"" You don't have to enter the name "SDR" alone.
[0135]
In step S62, the matching unit 54 recognizes the voice and extracts the name. For example, if the utterance “Your name is SDR” is made, a name registration task 71 having a language model 112 as shown in FIG. 14 and a fixed word dictionary 131 as shown in FIG. ₃ , The matching unit 54 generates a recognition result that “Your name is <OOV><End>”, for example. Also, the matching unit 54 obtains information indicating which section of the utterance the <OOV> is (from what second to what second of the first utterance).
[0136]
Further, the matching unit 54 performs a phoneme typewriter task 71 having a language model 112 as shown in FIG. 9 and a fixed word dictionary 131 as shown in FIG. ₁ With reference to, for example, “k / i / m / i / n / o / n / a / m / a / e / w / a / e / s / u / d / i: / a: / r / A phonemic sequence “u / d / a / y / o” and a kana sequence “Kimino Nama Eva Sd R Dayo” are obtained.
[0137]
Then, based on information indicating which section of the utterance the <OOV> is, the matching unit 54 determines, based on the obtained phoneme sequence and kana sequence, a section corresponding to <OOV>, that is, a phoneme sequence of the name section. The kana sequence is cut out to obtain a phoneme sequence "e / s / u / d / i: / a: / r / u" and a kana sequence "ESD". Further, the matching unit 54 also obtains audio data of the same section. The details of the process of extracting the name are disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant.
[0138]
After the process in step S62, the process proceeds to step S63, in which the matching unit 54 supplies the phoneme sequence, kana sequence, and voice data of the name extracted in the process in step S62 to the unknown word acquisition unit 56, and performs clustering. The details of the clustering are disclosed in Japanese Patent Application No. 2001-097845 previously proposed by the present applicant. As a result of this clustering, each cluster of the unknown word acquisition unit 56 has a representative phoneme sequence and a kana sequence.
[0139]
After the processing in step S63, the process proceeds to step S64, in which the recognition result of the voice recognized in step S62 (for example, the kana sequence "Kimino Nama Eva Sd R Dayo") is stored in the name registration application section 21. ₁ To supply.
[0140]
FIG. 23 shows an example of the feature space clustered by the unknown word acquisition unit 56 in the process of step S63 in FIG. Note that FIG. 23 shows a feature space defined by two feature amounts (feature parameters) 1 and 2 to avoid complicating the drawing (the same applies to FIG. 3 described above). In FIG. 23, in the feature space, four names “arara”, “sana”, “tokyo”, and “taro” are clustered.
[0141]
That is, in FIG. 23, in the feature space, four clusters of an “arara” cluster 151, a “saw” cluster 152, a “tokyo” cluster 153, and a “tarou” cluster 154 are formed, Each cluster has a representative phoneme series (in the example of FIG. 23, “a / r / a / r / a”, “s / a / n / i:”, “t / o: / ky / o”). : "," T / a / r / o / u "), representative kana notation (" Alara "," Sunny "," Tokyo "," Taro "in the example of FIG. 23), and ID (FIG. In the example of 23, “1”, “2”, “3”, “5”) are added.
[0142]
FIG. 24 shows an example of a common dictionary of the common dictionary unit 55 in which word information has been entered in step S48 of FIG. In FIG. 24, the entry on the first line is that the pronunciation is input in a kana sequence, the pronunciation is a character string of "SD", and indicates that the category is input as "_robot name_". .
[0143]
The entry on the second line is a representative kana notation in which the pronunciation is input in speech, and the kana notation and the phonological sequence of the pronunciation are added to the cluster whose ID of the unknown word acquisition unit 56 is “5” (FIG. 23). In the example of FIG. 23, “tarrow”) and the phoneme sequence (“t / a / r / o:” in the example of FIG. 23). The category of the entry on the second line is determined by the name registration application unit 21 in step S47 in FIG. 21 and is “_user name_”. For example, when the user utters “My name is Taro”, an entry like the second line is configured in the common dictionary unit 55.
[0144]
Similarly, in the entries on the third and fourth lines, pronunciations are input in kana columns, and the pronunciations are character strings "Sunitarou" and "Kitashinagawa", respectively, and the category is "_user name_". This indicates that “_place name_” has been input. In the entry on the fifth line, the pronunciation is input in speech, and the kana notation and the phoneme sequence of the pronunciation are represented by a representative kana notation (ID) of the unknown word acquisition unit 56 added to the cluster with the ID “3”. In the case of the example of FIG. 23, “Tokyo”) and the phoneme sequence (“t / o: / ky / o:” in the example of FIG. 23) are shown. Further, the category of the entry on the fifth line is determined to be “_place name_” by the name registration application unit 21.
[0145]
In the common dictionary, for a word whose pronunciation is entered in a kana sequence, a set of a kana sequence and a category representing the pronunciation of the word is registered in one entry, and for a word whose pronunciation is spoken, A pair of an ID and a category representing the cluster of the word is registered in one entry.
[0146]
FIG. 25 is a flowchart illustrating a process in which the matching unit 54 reflects the contents of the common dictionary unit 55 to the task in the process of step S49 in FIG. This process is performed for each enabled task.
[0147]
In step S81, the matching unit 54 initializes the variable word dictionary 132 and the category table 133 in the task 71 (FIG. 6). That is, the variable word dictionary 132 has no entry, and the category table 133 has a state in which no word belongs to each category.
[0148]
After the process in step S81, the process proceeds to step S82, where the matching unit 54 reflects the contents of the common dictionary unit 55 in the variable word dictionary 132 and the category table 133.
[0149]
That is, the matching unit 54 selects, from the common dictionary of the common dictionary unit 55, a category common (same) to the category entered in the category table 133, and selects the category and the cluster ID corresponding to the category. Or, get kana pronunciation (kana sequence). Further, when acquiring the cluster ID from the common dictionary, the matching unit 54 acquires a kana sequence corresponding to the cluster ID from the unknown word acquisition unit 56.
[0150]
When the matching unit 54 obtains the kana sequence of the word belonging to the selected category from the common dictionary of the common dictionary unit 55 as described above, the matching unit enters the kana sequence into the variable word dictionary 132. Further, the matching unit 54 enters the information of the word represented by the kana sequence acquired from the common dictionary into the corresponding category of the category table 133.
[0151]
According to the above-described processing, in each task, the contents of the common dictionary are reflected after the variable word dictionary 132 is initialized. That is, the variable word dictionary 132 is constructed or reconstructed based on the contents of the common dictionary. Therefore, it is possible to easily maintain consistency in each task as compared with a method of deleting or changing a specific entry in the dictionary.
[0152]
Further, according to the above-described processing, the word registered in the voice is acquired in the variable word dictionary 132 because the latest pronunciation information at that time is acquired from the unknown word acquisition unit 56 every time the word is reflected in each task. Thereafter, the pronunciation information is updated simply by supplying the speech data to the unknown word acquisition unit 56, and the matching unit 54 can recognize the speech by referring to the latest pronunciation information at that time.
[0153]
FIG. 26 is a block diagram illustrating the reflection processing of FIG. When the kana sequence is described in the common dictionary of the common dictionary unit 55 corresponding to the category, the kana sequence is registered in the variable word dictionary 132, and the category table 133 has the same category as the category of the common dictionary. Information of words represented by kana strings in the common dictionary is registered.
[0154]
On the other hand, if the common dictionary describes a cluster ID corresponding to the category, the unknown word acquisition unit 56 is referred to, and the representative kana sequence and the representative phoneme sequence corresponding to the cluster ID are registered in the variable word dictionary 132. Then, the information of the word represented by the cluster ID of the common dictionary is registered in the same category as the category of the common dictionary in the category table 133. In the speech recognition processing described later, both the fixed word dictionary 131 and the variable word dictionary 132 are used.
[0155]
FIG. 27 shows an application switching task 71 reflecting the contents of the common dictionary unit 55 shown in FIG. ₂ Is an example of the variable word dictionary 132. Application switching task 71 ₂ When the category table 133 shown in FIG. 13 is as shown in FIG. 13, the category common to the common dictionary in FIG. 24 is “_robot name_”. A kana pronunciation of "SD" corresponding to name_ "is acquired.
[0156]
Then, as shown in FIG. 27, the matching unit 54 enters the kana pronunciation “SDR” acquired from the common dictionary in FIG. 24 into the transcription of the variable word dictionary 132. Further, the matching unit 54 adds “e / s / u /” corresponding to the kana pronunciation “SD” to the phoneme sequence corresponding to the transcription “SD” based on the kana phoneme conversion rule 115 (FIG. 8). d / i: / a: / r / u ".
[0157]
In addition, the matching unit 54 registers “OOV00001” as a symbol of the word represented by the transcription “SDR”. Here, the symbol is “OOV00001”, which means “OOV” + serial number, but the symbol may be any character string that can uniquely identify the word. That is, as the symbol, for example, "_robot name _ :: OOV00001" or the like can be used by adding a category name to the head.
[0158]
FIG. 28 shows an application switching task 71 in which the contents of the common dictionary of FIG. 24 are reflected. ₂ Of the category table 133 of FIG. As shown in FIG. 27, when the contents of the common dictionary of FIG. 24 are reflected in the variable word dictionary 132, the contents of the category table 133 are stored in the category of “_robot name_” shown in FIG. Are registered, the symbol “OOV00001” of the word belonging to the category “_robot name_” registered in the variable word dictionary 132 of FIG. 27 is entered.
[0159]
Next, a process in which the matching unit 54 deletes or changes a word registered in the common dictionary of the common dictionary unit 55 in step S48 of FIG. 21 will be described with reference to a flowchart of FIG. The process of deleting or changing words in the common dictionary is started, for example, when there is a command from the name registration application unit 21 or when it becomes necessary to delete unnecessary words from registered words due to memory restrictions. Is done.
[0160]
In addition, the process of deleting or changing the words in the common dictionary includes, for example, changing the ID assigned to the cluster by deleting the cluster in the unknown word acquisition unit 56 or dividing or merging the cluster. When it is necessary to match the ID assigned to the cluster of the unknown word acquisition unit 56 with the ID described in the common dictionary (the cluster ID described in FIG. 24), the ID described in the common dictionary is changed to This is done to rewrite.
[0161]
Further, when the common dictionary is deleted or changed, when a certain category is no longer used by all the tasks described in the language model 112, the category information is deleted from the common dictionary, and the common dictionary is slimmed. This is done in order to achieve
[0162]
If the unknown word acquisition unit 56 changes the representative phoneme sequence and the kana sequence of the cluster, the changes are reflected in the common dictionary by the reflection processing in FIG. 25, so that the words are deleted or changed. It is not necessary to perform the processing (hereinafter, appropriately referred to as change deletion processing).
[0163]
In step S101, the matching unit 54 determines a word to be subjected to change deletion processing from the common dictionary, and proceeds to step S102. The target word may be determined by the user using a button (not shown), or may be determined and determined by the matching unit 54.
[0164]
In step S102, the matching unit 54 determines whether or not to delete the word targeted for the change deletion process. If it is determined that the word is to be deleted, the process proceeds to step S103. In step S103, the matching unit 54 deletes the entry of the word targeted for the change deletion process from the common dictionary. Deletion means deleting entries specified by a category and pronunciation information, deleting entries of a specific category at once, or deleting entries having specific pronunciation information (kana sequence or cluster ID) at once. Means to do.
[0165]
On the other hand, if the matching unit 54 determines in step S102 that the word to be subjected to the change deletion process is not to be deleted, the process proceeds to step S104, where it is determined whether the word is to be changed, and it is determined that the word is not to be changed. In this case, the process returns to step S102 and waits until it is determined that the change or the deletion is performed.
[0166]
When the matching unit 54 determines in step S104 to change the word to be subjected to the change deletion process, the process proceeds to step S105, and changes the entry of the word to be subjected to the change deletion process in the common dictionary.
[0167]
For example, when the cluster of the unknown word acquiring unit 56 is divided or merged and the ID number of the cluster changes, the matching unit 54 performs the matching with the unknown word acquiring unit 56 by using the cluster ID of the common dictionary. To change. Further, for example, when the user wants to later correct the kana sequence input at the time of registration, the matching unit 54 uses the name registration application unit 21. ₁ , The kana pronunciation of the word to be a target of the common dictionary (the word entered in the common dictionary in step S48 of FIG. 21) is converted to the kana sequence input by the user after determining the word to be a target of the common dictionary. change.
[0168]
After the processing of step S103 or the processing of step S105, the process proceeds to step S106, where the matching unit 54 performs the reflection processing of FIG. 25, and reflects the contents of the common dictionary to each task.
[0169]
As described above, when a word in the common dictionary is deleted or changed, the changed content is reflected in each task, so that the consistency of the registered words in each application unit can be maintained.
[0170]
FIGS. 30 and 31 show an example in which the matching unit 54 changes the entry of a word in the common dictionary in the process of step S105 in FIG. For example, when the cluster with the ID “5” of the unknown word acquisition unit 56 is divided into the cluster with the ID “8” and the cluster with the ID “9”, the matching unit 54 shows the common dictionary in FIG. 30A. The state is changed from the state shown in FIG. 30B to the state shown in FIG. 30B.
[0171]
That is, the matching unit 54 deletes the entry with the cluster ID “5” (the entry on the first line in FIG. 30A) of the common dictionary unit 55, and “_user name_” registered in the deleted entry. Register two entries of the category. Further, the matching unit 54 describes the cluster ID numbers “8” and “9” in the two new entries, respectively (entries on the first and second lines in FIG. 30B).
[0172]
Further, for example, when the cluster with the cluster ID “5” and the cluster with the ID “3” of the unknown word acquisition unit 56 are merged to generate a new cluster with the ID “10”, the matching unit 54 Then, the common dictionary is changed from the state shown in FIG. 31A to the state shown in FIG. 31B.
[0173]
That is, the matching unit 54 changes the cluster ID of the entries having the cluster IDs “5” and “3” of the common dictionary (all the entries in FIG. 31A) to “10”. The two entries of the category “_” and the corresponding cluster ID number “10” are made one (for example, one is deleted) (FIG. 31B).
[0174]
Next, the chat processing in step S12 in FIG. 5 will be described in detail with reference to the flowchart in FIG.
[0175]
In step S121, the chat application unit 21 ₂ Is a chat task 71 ₄ Is enabled, and the process proceeds to step S122. In step S122, the chat application unit 21 ₂ Controls the matching unit 54 to perform the reflection processing as shown in FIG. 25, and stores the contents of the common dictionary of the common dictionary unit 55 into the chat task 71. ₄ (The variable word dictionary 132 and the category table 133). Therefore, the chat task 71 ₄ Can obtain words registered, changed, and deleted in the common dictionary while being invalid.
[0176]
After the process of step S122, the process proceeds to step S123, and the chat application unit 21 ₂ Controls the voice recognition engine unit 11 to perform voice recognition processing, and proceeds to step S124. Details of the voice recognition processing will be described later with reference to FIG.
[0177]
In step S124, the chat application unit 21 ₂ Acquires a recognition result from the speech recognition engine unit 11 and generates a response to the recognition result. That is, the robot responds to the utterance from the user. For example, if the utterance from the user is “When did the SDR (robot name) wake up?”, The chat application unit 21 ₂ Generates a response for the time the robot was awakened (activated) (eg, “7:00”) and causes the robot to speak.
[0178]
After the process of step S124, the process proceeds to step S125, and the chat application unit 21 ₂ Determines whether to end the processing. This determination is made, for example, by the chat application unit 21. ₂ However, the user makes an utterance “End?” To the user, and determines whether the user has operated (pressed) (pressed) the OK button (not shown).
[0179]
If it is determined in step S125 that the process is not to be ended, the process returns to step S123, and the same process is repeated. That is, the robot continues the chat with the user.
[0180]
If it is determined in step S125 that the process is to be ended, the process proceeds to step S126, and the chat application unit 21 ₂ Is a chat task 71 ₄ Is invalidated, and the process proceeds to step S127. In step S127, the chat application unit 21 ₂ Ends the processing.
[0181]
In the above-described processing, the chat application unit 21 each time the user speaks once. ₂ Generated the response, but the robot may prompt the user to speak by spontaneously speaking.
[0182]
In the process of FIG. 32, the chat application unit 21 ₂ Has been described, the voice commander application unit 21 ₃ Voice commander processing,..., And other application units 21 _M Is similarly performed. However, in step S124, processing based on the speech recognition result by the speech recognition engine unit 11 is performed according to the application unit 21.
[0183]
FIG. 33 shows the contents of the common dictionary of the common dictionary unit 55 shown in FIG. ₄ In the variable word dictionary 132 shown in FIG.
[0184]
Chat task 71 ₄ When the category table 133 shown in FIG. 18 is as shown in FIG. 18, the categories common to the common dictionary in FIG. 24 are “_robot name_” and “_place name_”, so that the matching unit 54 The first entry in FIG. 24 is acquired as a common dictionary entry corresponding to “name_”, and the fourth and fifth entries in FIG. 24 are acquired as entries corresponding to “_place name_”. Further, from the first entry, the pronunciation of kana "ESD" is obtained, from the fourth entry, the pronunciation of kana "Kitashinagawa" is obtained, and from the fifth entry, the cluster ID number "3" is obtained.
[0185]
Then, as shown in FIG. 33, the matching unit 54 enters “SDR” and “Kitashinagawa” into the transcription of the variable word dictionary 132. Further, the matching unit 54 adds “e / s / u / d / i:” to the phoneme sequence of the variable word dictionary 132 based on the kana phoneme conversion rule 115 (FIG. 8) in correspondence with the transcription “SDR”. "/ A: / r / u" and "k / i / t / a / sh / i / n / a / g / a / w / a" corresponding to the transcription "Kitashinagawa".
[0186]
Further, the matching unit 54 extracts the cluster with the cluster ID “3” from the unknown word acquisition unit 56, and acquires a representative phoneme sequence and a kana sequence. For example, when the unknown word acquisition unit 56 is in a state as shown in FIG. 23, the matching unit 54 starts the cluster 153 with the cluster ID “3” and outputs the phoneme sequence “t / o: / ky / o:” and “ To get the kana sequence "Tokyo". Then, as shown in FIG. 33, the matching unit enters the acquired phoneme sequence “t / o: / ky / o:” and the kana sequence “Tokyo” into the phoneme sequence and the transcription of the variable word dictionary 132, respectively. I do.
[0187]
Further, the matching unit 54 transcribes “OOV00001” as the symbol of the word represented by the transcription “SDR” and “OOV00002” as the symbol of the word represented by the transcription “Kitashinagawa”. “OOV00003” is registered as a symbol of the word represented by the option “Tokyo”.
[0188]
In this case, the phonemic typewriter task 71 ₁ And chat task 71 ₄ Of the phoneme typewriter task 71 ₁ The representative phoneme sequence of the cluster obtained by using ₄ Is registered in the variable word dictionary 132, but the phonetic typewriter task 71 for kana ₁ And chat task 71 ₄ If the kana phoneme sequence rules 115 of the cluster are different, the matching unit 54 acquires a representative kana sequence of the cluster from the unknown word acquisition unit 56, and the chat task 71 ₄ Based on the kana phoneme sequence rule 115, the phoneme sequence of the variable word dictionary 132 is described.
[0189]
FIG. 34 shows that the contents of the common dictionary of FIG. ₄ Is reflected in the category table 133 of FIG. In the category table 133, for the category of “_robot name_”, the words belonging to the category “_robot name_” (the words whose transcription is “SD” (FIG. 33)) are stored in the variable word dictionary 132. The symbol “OOV00001” is entered. Further, for the category of “_place name_” in the category table 133, the words belonging to the category “_place name_” (the words of the transcriptions “Kitashinagawa” and “Tokyo” (FIG. 33)) are stored in the variable word dictionary 132. The registered symbols “OOV00002” and “OOV00003” are entered.
[0190]
Next, the speech recognition processing performed by the speech recognition engine unit 11 in FIG. 2 in the processing in step S123 in FIG. 32 will be described in detail with reference to the flowchart in FIG. This processing is started when a voice is input from the user to the microphone 51, and the application switching task 71 is started. ₂ , Chat task 71 ₄ , Voice commander task 71 ₅ , ..., other tasks 71 _N Of these tasks, this is performed for each enabled task.
[0191]
The audio signal generated by the microphone 51 is converted into audio data as a digital signal by the AD conversion unit 52 in step S141, and is supplied to the feature amount extraction unit 53. After the process in step S141, the process proceeds to step S142, in which the feature amount extraction unit 53 extracts a feature amount such as a mel cepstrum from the supplied audio signal, and proceeds to step S143.
[0192]
In step S143, the matching unit 54 connects some of the words represented by the symbols of the fixed word dictionary 131 and the variable word dictionary 132, generates a word string, and calculates an acoustic score. The acoustic score indicates how close (acoustically) the word string that is a candidate for the speech recognition result and the input speech are (acoustically) as sounds.
[0193]
After the process of step S143, the process proceeds to step S144, where the matching unit 54 selects a predetermined number of word strings having a high acoustic score based on the acoustic score calculated in step S143, and proceeds to step S145.
[0194]
In step S145, the matching unit 54 calculates the language score of each word string selected in step S144 using the language model 112, and proceeds to step S146. For example, when a grammar or a finite state automan is used as the language model 112, when the word string can be accepted by the language model 112, the language score is “1”, and when it cannot be accepted. , The language score is “0”.
[0195]
The matching unit 54 may leave the word string selected in step S144 when it can be accepted, and may delete the word string selected in step S144 when it cannot be accepted.
[0196]
When a statistical language model is used as the language model 112, the generation probability of the word string is used as a language score. The details of the method of obtaining the language score are disclosed in Japanese Patent Application No. 2001-382579 previously proposed by the present applicant.
[0197]
For example, the voice commander application unit 21 ₃ When the voice recognition processing is performed in the voice commander processing of the above, when the matching unit 54 selects the word string “<head> OOV00001 before moving forward <end>” in the processing of step S144, the language score is determined by the word string “< The head> OOV00001 advance to <end> ”can be accepted by the grammar language model 112 shown in FIG.
[0198]
That is, the matching unit 54 recognizes that the category of the symbol “OOV00001” is “_robot name_” with reference to the category table 133 (FIG. 28), and the word string “<head” obtained in step S144. > OOV00001 advance <end> is converted to a word string “<head> _robotic name_advance <end>” using a category name, and accepted by the language model 112 shown in FIG. Determine that you can.
[0199]
On the other hand, for example, when the word sequence “<head> OOV00001 and before <end>” is selected in step S144, the matching unit 54 refers to the category table 133 (FIG. 28) and refers to the category of the symbol “OOV00001”. Is recognized as "_robot name_", and the word string "<head> OOV00001 and proceeding to <end>" obtained in step S144 is replaced with the word string "<head> _robot name" using the category name. _ To "before <end>", it is determined that the language model 112 shown in FIG. 19 cannot accept the word sequence, and the language score of this word string is set to "0".
[0200]
In step S146, the matching unit 54 sorts the word strings by integrating the acoustic score calculated in step S143 and the language score calculated in step S145, and for example, sorts the word strings having the largest integrated score. Is determined as a recognition result.
[0201]
As a result, a word string most suitable acoustically and linguistically is determined as a recognition result.
[0202]
After the process in step S146, the process proceeds to step S147, and the matching unit 54 determines whether the recognition result includes a word registered as a voice (a word clustered in the unknown word acquisition unit 56).
[0203]
If it is determined in step S147 that the word registered in the voice is included in the recognition result, the process proceeds to step S148, where the matching unit 54 supplies the word to the unknown word acquiring unit 56, and 56 performs re-clustering. Then, the process proceeds to step S149.
[0204]
For example, in step S144, if the word string “<top> today went to Tokyo <end>” including the place name (unknown word) “Tokyo” is obtained, the matching unit 54 is an unknown word. "Tokyo" voice data, phonemic typewriter task 71 ₁ Is supplied to the unknown word acquisition unit 56 with the phoneme sequence (for example, “t / o: / ky / o:”) and the kana sequence (for example, “Tokyo”) recognized with reference to. Then, the unknown word acquisition unit 56 performs re-clustering.
[0205]
As a result, the amount of voice data supplied to the unknown word acquisition unit 56 increases, and the representative phoneme sequence and the representative kana sequence of each cluster may be updated to correct values. However, as a side effect, even after a correct phonemic kana sequence / kana sequence is obtained, the value may be changed to an incorrect value by re-clustering. In order to prevent such side effects, if there is an instruction from the user, the pronunciation can be fixed by describing the kana sequence at that time in an entry of the common dictionary. For example, in FIG. 23, when the kana sequence of the cluster with the ID = 3 is pronounced as “Tokyo”, the part described as “Cluster ID = 3” in the common dictionary in FIG. 24 is changed to “Kana pronunciation: Tokyo”. (The fifth entry is to be rewritten). By doing so, even if the kana sequence of the cluster with ID = 3 changes to a value other than “Tokyo”, the pronunciation of the fifth entry in the common dictionary is fixed at “Tokyo”.
[0206]
On the other hand, if the matching unit 54 determines in step S147 that the word registered in the voice is not included in the voice recognition result, the process skips step S148 and proceeds to step S149.
[0207]
In step S149, the matching unit 54 supplies the recognition result determined in the process of step S146 to the application unit 21 corresponding to the task.
[0208]
Here, the chat application unit 21 ₂ In the chat processing of FIG. 36, an expression for calculating a language score when the matching unit 54 selects the word string “<head> OOV00001 occurred at <end>” in step S144 of FIG. Show.
[0209]
As shown in equation (1), the language score “Score (<head> OOV00001 occurred at <end>)” is a word string “<head> OOV00001 occurred at <end>” as shown in equation (1). Probability.
[0210]
The value of the language score “Score (<head> OOV00001 is what occurred <end>)” is exactly “P (<head>) P (OOV00001 | <head”) as shown in equation (2). >) P (was | <head> OOV00001) P (when | <head> OOV00001 is) P (is | <head> OOV00001 is what time) P (was || <head> OOV00001 is when) P (| < The head> OOV00001 is obtained at what time occurred P (<end> | <head> OOV00001 was generated at what time). As shown in FIG. 16, the language model 112 uses tri-gram. Therefore, the condition parts "<head> OOV00001 is", "<head> OOV00001 is what time", "<head> OOV00001 is what time", "<head>OOV0" 001 happened at what time "and"<head> OOV00001 happened at what time "are up to two words immediately before" OOV00001 is "," was what time "," what time "," was happened ", and It is approximated by conditional probabilities each limited to "was happened" (equation (3)).
[0211]
The conditional probability is obtained by referring to the language model 112 (FIG. 16). However, since the language model 112 does not include the symbol “OOV00001”, the matching unit 54 refers to the category table 133 in FIG. Then, it recognizes that the category of the word represented by the symbol “OOV00001” is “_robot name_”, and converts “OOV00001” to “_robot name_”.
[0212]
That is, as shown in Expression (4), “P (OOV00001 | <head>)” is changed to “P (_robot name_ | <head>) P (OOV00001 | _robot name)” and “P (OOV00001 | _robot name)” (_Robot name_ | <head>) "/ N". N represents the number of words belonging to the category of “_robot name_” in the category table 133.
[0213]
That is, when the probability is described in the form of P (X | Y), when the word X is a word belonging to the category C, P (C | Y) is obtained from the language model 112, and the value is represented by P (X | C) (the probability that the word X is generated from the category C). Assuming that all words belonging to category C are generated with equal probability, if there are N words belonging to category C, P (X | C) can be approximated as 1 / N.
[0214]
In FIG. 34, since only the word represented by the symbol “OOV00001” belongs to the category “_robot name_”, N becomes “1”. Therefore, as shown in equation (5), “P (|| <head> OOV00001)” becomes “P (= | <head> _robot name_)”. In addition, “P (what | OOV00001)” becomes “P (what | _ robot name _)” as shown in Expression (6).
[0215]
Thus, a language score can be calculated for a word string including a variable word, and the variable word can appear in the recognition result.
[0216]
In the above example, the activation of the application unit 21 and the validity of the task 71, and the termination of the application unit 21 and the invalidation of the task 71 are linked. However, this is performed at another timing, for example, the activation of the application unit 21. It is also possible to switch the validity and invalidity of the task many times during the operation, and to control a plurality of tasks by one application.
[0219]
In this case, for a task that frequently switches between valid and invalid, if the memory is repeatedly reserved and released each time, the efficiency is low. Therefore, a flag (a flag indicating that the task is invalid) is set even after the invalidation. By itself, it is possible to keep the memory secured.
[0218]
Further, in the above-described example, it is assumed that nothing is stored in the common dictionary of the common dictionary unit 55 when the robot system is activated, but even if some words are stored in the common dictionary in advance. Good. For example, since the product name of the robot is often registered as the robot name, the product name of the robot may be registered in advance in the category of “_robot name_” in the common dictionary.
[0219]
FIG. 37 shows an example of a common dictionary when the robot product name "SDR" is entered in the category "_robot name_" when the robot system is activated. In FIG. 37, when the robot system is activated, the kana pronunciation “SDR” is entered in the category “_robot name_”. Therefore, the user can use the kana pronunciation “SDR” without registering the name. The robot can be controlled using the words represented.
[0220]
Also, in the above example, it was assumed that no cluster of unknown word acquisition units was generated at the initial stage (at the time of shipment). However, if a cluster is prepared for the main name from the beginning, when the name is input by voice in the name registration process of FIG. 21, it is easy to recognize the name for which the cluster is prepared. For example, if a cluster as shown in FIG. 3 is prepared at the time of shipment, the pronunciation of “red”, “blue”, “midori”, and “black” can be recognized (acquired) with a correct phoneme sequence. Further, it is not desirable that the pronunciation of a name for which a cluster is prepared is changed after the name is registered in the common dictionary. Therefore, when registering the pronunciation information in the common dictionary, instead of describing the cluster ID, the pronunciation is described in kana pronunciations (representative kana sequence of the cluster) such as “red”, “ao”, “midori”, and “kuro”. .
[0221]
Further, in the above-described example, the matching unit 54 reflects the contents of the common dictionary on all tasks. However, the matching unit 54 may reflect only the tasks to be reflected. For example, a number (task ID) is added to a task in advance, and the common dictionary in FIG. 24 is expanded to provide a column indicating “a list of tasks in which this entry is valid (or invalid)”, and the reflection in FIG. In the processing, the matching unit 54 only needs to reflect the contents of the common dictionary only on the task to which the task ID described in the column indicating “a list of valid tasks for this entry” is added.
[0222]
FIG. 38 shows an example in which the ID of the task to be reflected is described in a column indicating “valid tasks” in the common dictionary. In FIG. 38, the words belonging to the category “_robot name_” whose pronunciation in kana is represented by “SD” are valid task IDs “1”, “2”, and “4”. The contents of the common dictionary of words represented by the kana pronunciation “SD” are reflected only in the variable word dictionary 132 and the category table 133 of the tasks “1”, “2”, and “4”.
[0223]
Further, in the above-described example, the words stored in the fixed word dictionary 131 are words described in the language model 112, and the words stored in the variable word dictionary 132 are words belonging to the category. However, some of the words belonging to the category may be stored in the fixed word dictionary 131.
[0224]
FIG. 39 shows an application switching task 71. ₂ 40 shows an example of the category table 133 at the time of startup. That is, in the category table 133 in FIG. 40, the category “_robot name_” and the symbol “OOV00001” of the word belonging to the category “_robot name_” are registered in advance. Also, the fixed word dictionary 131 of FIG. 40 includes a symbol “OOV00001”, a transcription of a word represented by the symbol “OOV00001”, “ESD”, and a phoneme sequence “e / s / u / d / i”. : / A: / r / u "is registered in advance.
[0225]
In this case, the word “SDR” is subjected to the voice recognition process as belonging to the category “_robot name_”. That is, the word "SD" is treated as the robot name from the beginning. However, since the word "SD" is stored in the fixed word dictionary 131, it cannot be deleted or changed.
[0226]
In this way, for example, by storing words assumed to be set in a name, such as a product name of a robot, in the fixed word dictionary 131 in advance, the user can control the robot without performing name registration. Can be.
[0227]
In the above example, the category symbols are common to all tasks, but may not be common. In this case, a conversion table as shown in FIGS. 41 to 44 may be prepared in the task.
[0228]
That is, for example, when a certain task T describes a category “_ROBOT_NAME_” and a category “_USER_NAME_”, according to the conversion table in FIG. Are reflected in the category “_ROBOT_NAME_”. In the task T, the contents of the common dictionary of words belonging to the category “_user name_” are reflected in the category “_USER_NAME_”.
[0229]
For example, when a certain task T describes a category “_proper noun_”, according to the conversion table in FIG. 42, in the task T, the common dictionary unit 55 assigns the category “_robot name_”. Both the contents of the common dictionary of the words belonging thereto and the contents of the common dictionary of the words belonging to the category “_user name_” are reflected in the category “_proper noun_”.
[0230]
Further, for example, when a certain task T describes a category “_last name_” and a category “_first_”, according to the conversion table of FIG. The contents of the common dictionary of the words belonging to “_user name_” are converted and copied into a category “_surname_” and a category “_name_”. In the step of reflecting the contents of the common dictionary in this task (FIG. 25), for example, the second entry in FIG. 24 changes “_user name_cluster ID = 5” to “_last name_cluster ID = It is converted and duplicated into two entries of “5” and “_name_cluster ID = 5”, and then reflected in the fixed word dictionary and the category table of this task.
[0231]
Further, for example, when a category is not described in a certain task T, according to the conversion table in FIG. 44, in the task T, the category “_robot name_” and the category “_user name_” , The word belonging to the category “_place name_” is represented by the symbol “UNK”. Note that “UNK” means “Unknown word”.
[0232]
Thus, even in a task in which a category is not described, the matching unit 54 merely describes the symbol “UNK” in the language model 112 and the category “_robot name_” and the category “_user name_” , Words belonging to the category “_place name_” can be recognized.
[0233]
FIG. 45 shows an example of the external configuration of a bipedal walking robot provided with the robot control system 1 to which the present invention is applied. In the robot 201, a head unit 211 is provided above a body unit 213, and arm units 212 A and 212 B having the same configuration are provided on the left and right of the body unit 213, respectively. Leg units 214A and 214B having the same configuration are attached to predetermined positions on the lower left and right of 213, respectively.
[0234]
The head unit 211 includes CCD (Charge Coupled Device) cameras 221A and 221B functioning as “eyes” of the robot 201, microphones 222A and 222B functioning as “ears”, and a speaker 223 functioning as “mouth”. Are arranged at predetermined positions.
[0235]
FIG. 46 shows an example of the electrical configuration of the robot. The unit control system 231 and the dialogue control system 232 control the operation of the robot 201 according to a command from the robot control system 1. That is, the unit control system 231 controls the head unit 211, the arm units 212A and 212B, and the leg units 214A and 214B of the robot 201 as necessary, and causes the robot 201 to perform a predetermined operation. Further, the dialogue control system 232 controls the utterance of the robot 201 and causes the speaker 223 to make a predetermined utterance as necessary.
[0236]
In the above description, a word is a unit that should be treated as one unit in the process of recognizing speech, and does not always match a linguistic word. For example, “Taro-kun” may be treated as one word, or may be treated as two words “Taro” and “Kimi”. In addition, may be dealing with more is a major unit of "Hello Taro" or the like as one word.
[0237]
A phoneme is one that is more conveniently processed acoustically as one unit, and does not always match a phonetic phoneme or phoneme. For example, the "to" portion of "Tokyo" can be represented by three phonetic symbols "t / o / u", or a symbol "o:" which is a long sound of "o" is prepared. You may. Further, it may be expressed as “t / o / o”. In addition, symbols that represent silence are prepared, and they are further classified as "silence before utterance", "short silence section sandwiched between utterances", "silence in" tsu "part". Good.
[0238]
In the above description, the robot device has been described. However, the present invention can be applied to a device having an application using speech recognition, speech synthesis, translation, and other language processing.
[0239]
Further, the present invention can be applied to, for example, a device that extracts only predetermined terms from a dictionary registered in Kojien and creates the term dictionary.
[0240]
Further, in the above description, the case where there are a plurality of application units has been described, but one application unit may be provided.
[0241]
The above-described series of processes can be executed by hardware or can be executed by software. In this case, the above-described processing is executed by a personal computer 600 as shown in FIG.
[0242]
In FIG. 47, a CPU (Central Processing Unit) 601 performs various processes according to a program stored in a ROM (Read Only Memory) 602 or a program loaded from a storage unit 608 into a RAM (Random Access Memory) 603. Execute. The RAM 603 also appropriately stores data necessary for the CPU 601 to execute various processes.
[0243]
The CPU 601, the ROM 602, and the RAM 603 are mutually connected via an internal bus 604. The internal bus 604 is also connected to an input / output interface 605.
[0244]
The input / output interface 605 includes an input unit 606 including a keyboard and a mouse, a display including a CRT, an LCD (Liquid Crystal Display), an output unit 607 including a speaker, a storage unit 608 including a hard disk, and a modem. And a communication unit 609 including a terminal adapter and the like. The communication unit 609 performs communication processing via various networks including a telephone line and a CATV.
[0245]
A drive 610 is connected to the input / output interface 605 as necessary, and a removable medium 621 composed of a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is appropriately mounted. It is installed in the storage unit 608 as needed.
[0246]
When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer built in dedicated hardware or installing various programs. For example, it is installed on a general-purpose personal computer or the like from a network or a recording medium.
[0247]
As shown in FIG. 47, this recording medium is constituted not only by a package medium consisting of a removable medium 621 on which the program is recorded but also distributed to provide the user with the program, separately from the computer. And a hard disk including a ROM 602 and a storage unit 608 in which a program is recorded, which is provided to the user in a state where the program is incorporated in the apparatus main body in advance.
[0248]
In this specification, a step of describing a computer program refers to not only a process performed in chronological order according to the described order, but also a process executed in parallel or individually even if not necessarily performed in chronological order. Is also included.
[0249]
Also, in this specification, a system represents the entire device including a plurality of devices.
[0250]
Incidentally, Japanese Patent Application No. 2001-382579 previously proposed by the present applicant includes a series of processes until a word acquired by an unknown word acquiring mechanism is reflected on a language model, and a word reflected on a language model is referred to as a subsequent process. An invention is disclosed for a process for causing the recognition result to appear.
[0251]
However, the invention of Japanese Patent Application No. 2001-382579 is composed of an application for registering one word and an application using one registered word. It is assumed that there are a plurality of applications for performing voice recognition. In a system in which there are a plurality of applications and a variable number of applications, the above-described problem when reflecting registered words in the application and the above-described problem when deleting or changing words registered for a plurality of applications are described. It was difficult to solve the problem.
[0252]
Also, in Japanese Patent Application No. 2002-072718 previously proposed by the present applicant, the word “Taro”, which is an unknown word, is extracted from the utterance “My name is Taro (unknown word).” The invention of acquiring as a name is disclosed.
[0253]
However, the invention of Japanese Patent Application No. 2002-072718 discloses the above-described problem of reflecting an unknown word in a language model, the above-described problem of reflecting a registered word in an application, and a word registered in a plurality of applications. It has been difficult to solve the above-described problem when deleting or changing the.
[0254]
Therefore, in the invention of Japanese Patent Application No. 2001-382579 and the invention of Japanese Patent Application No. 2002-072718, the above-described problem of reflecting an unknown word in a language model, the above-described problem of reflecting a registered word in its application, In addition, it is difficult to solve all the problems described above when deleting or changing words registered for a plurality of applications.
[0255]
However, in the robot control system 1 of FIG. 1, since the category is described in the language model 112, it is possible to solve the above-described problem that the unknown word is reflected in the language model by making the unknown word belong to the category. Can be.
[0256]
Further, in the robot control system 1 of FIG. 1, the variable word dictionary 132 in which words to be subjected to speech recognition used in the application are registered or reconstructed based on the common dictionary. It is possible to solve the above-described problem when reflecting the word on the application and the problem when deleting or changing the word registered for a plurality of applications.
[0257]
Further, in the case of a system without a keyboard (for example, a robot or the like), there is a problem that it is difficult to input pronunciation information at the time of registering a word. As a means for solving the problem, for example, a phoneme typewriter is used. There has been proposed a method of inputting pronunciation information using voice.
[0258]
However, there is a problem that a phoneme typewriter may be erroneously recognized, and if the phoneme typewriter is used as it is, a word may be registered with an incorrect pronunciation. For example, if the pronunciation of "ESD" is recognized by the phoneme typewriter and the phoneme typewriter misrecognizes and outputs a result of "Irnyal", if "Irnyal" is adopted as the pronunciation information, an incorrect pronunciation will be made. in order to word is registered, and later, for example, "es Dr, Hello" that although the speech is difficult to be recognized, "Isuniyaru, Hello" speech that the context of recognized easily occurs.
[0259]
Therefore, in the robot control system 1 of FIG. 1, the latest pronunciation information at that time is acquired from the unknown word acquisition unit 56 every time a word registered by voice is reflected on each task, so that the phoneme typewriter is erroneous. Even after the recognized and misrecognized word is registered in the variable word dictionary 132, the pronunciation information is updated only by supplying the speech data to the unknown word acquisition unit 56, and the latest pronunciation information at that time can be obtained. , Normal recognition results may be obtained.
[0260]
That is, in the robot control system 1 of FIG. 1, the above-described problem of reflecting an unknown word in a language model, the above-described problem of reflecting a registered word in an application, and a word registered in a plurality of applications are referred to. The above-described problem when deleting or changing, and the above-mentioned problem when inputting pronunciation information of a word to be registered in a system without a keyboard, that is, all the above-described problems can be solved.
[0261]
【The invention's effect】
As described above, according to the present invention, words can be registered. In particular, even when words corresponding to a plurality of applications are registered, the registered words can be commonly used in each application. Also, words registered before the application is started can be used in the application. Further, even when the registered word is changed, consistency can be maintained in each application.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of a robot control system to which the present invention has been applied.
FIG. 2 is a block diagram illustrating a configuration example of a speech recognition engine unit in FIG. 1;
FIG. 3 is a diagram illustrating an example of a cluster of an unknown word acquisition unit in FIG. 2;
FIG. 4 is a flowchart illustrating a robot control process in the robot control system of FIG. 1;
FIG. 5 is a flowchart illustrating a robot control process in the robot control system of FIG. 1;
FIG. 6 is a diagram illustrating a configuration example of a task in FIG. 2;
FIG. 7 is a diagram showing an example of a phoneme list in FIG. 6;
FIG. 8 is a diagram showing an example of kana phoneme conversion rules of FIG. 6;
FIG. 9 is a diagram showing an example of a language model of the phoneme typewriter task of FIG. 2;
FIG. 10 is a diagram showing an example of a fixed word dictionary of the phoneme typewriter task of FIG. 2;
11 is a diagram illustrating an example of a language model of the application switching task in FIG. 2;
FIG. 12 is a diagram illustrating an example of a fixed word dictionary of the application switching task in FIG. 2;
FIG. 13 is a diagram illustrating an example of a category table of the application switching task in FIG. 2;
FIG. 14 is a diagram illustrating an example of a language model of the name registration task in FIG. 2;
FIG. 15 is a diagram illustrating an example of a fixed word dictionary of the name registration task in FIG. 2;
FIG. 16 is a diagram illustrating an example of a language model of the chat task of FIG. 2;
17 is a diagram illustrating an example of a fixed word dictionary of the chat task of FIG. 2;
FIG. 18 is a diagram illustrating an example of a category table of the chat task of FIG. 2;
19 is a diagram illustrating an example of a language model of the voice commander task in FIG. 2;
20 is a diagram showing an example of a fixed word dictionary of the voice commander task of FIG. 2;
FIG. 21 is a flowchart illustrating a name registration process in step S9 of FIG. 5;
FIG. 22 is a flowchart illustrating a name recognition process in step S43 of FIG. 21;
FIG. 23 is a diagram illustrating an example of a cluster of an unknown word acquisition unit in FIG. 2;
FIG. 24 is a diagram illustrating an example of a common dictionary unit in FIG. 2;
FIG. 25 is a flowchart illustrating a reflection process in step S49 of FIG. 21;
FIG. 26 is a block diagram illustrating a reflection process of FIG. 25;
FIG. 27 is a diagram showing an example of a variable word dictionary of the name registration task of FIG. 2;
FIG. 28 is a diagram illustrating an example of a category table of the name registration task in FIG. 2;
FIG. 29 is a flowchart illustrating a process of deleting or changing a word in the matching unit of FIG. 2;
FIG. 30 is a diagram showing an example of changing the common dictionary unit of FIG. 2;
FIG. 31 is a diagram illustrating an example of changing the common dictionary unit in FIG. 2;
FIG. 32 is a flowchart illustrating the chat processing of step S12 in FIG. 5;
FIG. 33 is an example of a variable word dictionary of the chat task of FIG. 2;
FIG. 34 is an example of a category table of the chat task in FIG. 2;
FIG. 35 is a flowchart illustrating a speech recognition process in step S123 of FIG. 32;
FIG. 36 is a diagram illustrating an example of a formula for calculating a language score.
FIG. 37 is a diagram illustrating a modification of the common dictionary unit in FIG. 2;
FIG. 38 is a diagram illustrating a modification of the common dictionary unit in FIG. 2;
FIG. 39 is a diagram showing a modification of the fixed word dictionary of FIG. 6;
FIG. 40 is a diagram showing an example of the category table of FIG. 6;
FIG. 41 is a diagram showing an example of a category conversion table.
FIG. 42 is a diagram illustrating an example of a category conversion table.
FIG. 43 is a diagram showing an example of a category conversion table.
FIG. 44 is a diagram showing an example of a category conversion table.
FIG. 45 is a perspective view showing an external configuration of the robot.
FIG. 46 is a block diagram illustrating an electrical configuration of the robot.
FIG. 47 is a diagram illustrating an example of a personal computer.
[Explanation of symbols]
11 Speech Recognition Engine Unit, 21 Application Unit, 31 Application Management Unit, 51 Microphone, 52 AD Conversion Unit, 53 Feature Extraction Unit, 54 Matching Unit, 55 Common Dictionary Unit, 56 Unknown Word Acquisition Unit, 71 Task, 111 Acoustic Model , 112 language models, 113 dictionaries, 114 phoneme lists, 115 kana phoneme conversion rules, 116 search parameters, 131 fixed word dictionaries, 132 variable word dictionaries, 133 category tables

Claims

A language processing apparatus having an application using language processing,
A registered dictionary storage means for storing a registered dictionary in which words are registered;
A language processing apparatus comprising: a construction unit configured to construct, based on the registered dictionary, a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered.

Processing means for adding, deleting, or changing a word to the registered dictionary;
A deleting unit for deleting a word in the dedicated dictionary,
After all words registered in the dedicated dictionary have been deleted,
2. The language processing apparatus according to claim 1, wherein the construction unit reconstructs the dedicated dictionary based on the registered dictionary in which a word has been added, deleted, or changed.

The dedicated dictionary includes at least a fixed dictionary in which predetermined words are registered in advance, and a variable dictionary in which words to be registered are variable,
2. The language processing apparatus according to claim 1, wherein the construction unit constructs the variable dictionary among the dedicated dictionaries.

The dedicated dictionary further includes a category table in which categories of words are registered,
4. The variable dictionary according to claim 3, wherein the constructing unit constructs the variable dictionary by registering, in the variable dictionary, words of a category registered in the category table among words of the registered dictionary. Language processor.

There are a plurality of the applications,
The language processing apparatus according to claim 1, wherein the construction unit constructs the dedicated dictionary for each of the plurality of applications.

A language processing method of a language processing apparatus having an application that uses language processing,
A registered dictionary storage step for storing a registered dictionary in which words are registered;
A construction step of constructing, based on the registered dictionary, a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered.

A language processing program for an application,
A computer is characterized by including a construction step of constructing a dedicated dictionary dedicated to the application in which words to be subjected to language processing used in the application are registered based on the registered dictionary in which words are registered. A program storage medium that stores a readable program.

A language processing program for an application,
A word to be subjected to linguistic processing used in the application is registered. The computer executes a construction step of constructing a dedicated dictionary dedicated to the application based on the registered dictionary in which words are registered. Program to do.