JP4048473B2

JP4048473B2 - Audio processing apparatus, audio processing method, program, and recording medium

Info

Publication number: JP4048473B2
Application number: JP2002072718A
Authority: JP
Inventors: 厚夫廣江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-03-15
Filing date: 2002-03-15
Publication date: 2008-02-20
Anticipated expiration: 2022-03-15
Also published as: JP2003271180A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声処理装置および音声処理方法、並びにプログラムおよび記録媒体に関し、特に、連続して入力される音声信号を音声認識している最中に、その入力音声信号に含まれる未知語を抽出し、簡単に登録することができるようにした音声処理装置および音声処理方法、並びにプログラムおよび記録媒体に関する。
【０００２】
【従来の技術】
対話システムにおいて、何かの名前を音声で登録するという場面は、多く発生する。例えば、ユーザが自分の名前を登録したり、対話システムに名前をつけたり、地名や店名を入力したりするという場面である。
【０００３】
従来、このような音声登録を簡単に実現する方法としては、何かのコマンドによって登録モードに移行して、登録が終了したら通常の対話モードに戻るというものがある。この場合、例えば、「ユーザ名登録」という音声コマンドによって登録モードに移行して、その後でユーザが名前を発生したらそれが登録され、その後、通常モードに戻る処理が行われる。
【０００４】
【発明が解決しようとする課題】
しかしながら、このような音声登録の方法では、コマンドによるモード切換えをしなければならず、対話としては不自然であり、ユーザにとっては煩わしいという課題がある。また、名付ける対象が複数存在する場合、コマンドの数が増えるため、いっそう煩わしくなる。
【０００５】
さらに、登録モード中に、ユーザが名前以外の単語（例えば、「こんにちは」）を話してしまった場合、それも名前として登録されてしまう。また、例えば、「太郎」という名前だけではなく、「私の名前は太郎です。」といったように、ユーザが名前以外の言葉を付加して話した場合、全体（「私の名前は太郎です。」）が名前として登録されてしまう。
【０００６】
本発明はこのような状況に鑑みてなされたものであり、通常の対話の中で、ユーザに登録モードを意識させることなく、単語を登録できるようにすることを目的とする。
【０００７】
【課題を解決するための手段】
本発明の音声処理装置は、連続する入力音声を認識する認識手段と、認識手段により認識された認識結果に、未知語が含まれているか否かを判定する未知語判定手段と、未知語判定手段により、未知語が含まれていると判定された場合、その未知語を獲得する獲得手段と、未知語判定手段により認識結果に未知語が含まれていると判定された場合、その認識結果が、未知語を含む単語列であるパターンにマッチするか否かを判定するパターン判定手段と、パターン判定手段により、認識結果がパターンにマッチしていると判定された場合、そのパターンにおいて未知語に対応付けられているカテゴリを、獲得手段により獲得された未知語に関連付けて登録する登録手段とを備え、認識手段は、入力音声の所定の区間について、既知語でマッチングさせた場合と音韻タイプライタで認識させた場合の、認識結果の候補と入力音声の音の近さを表す音響スコアを比較する比較手段を備え、比較手段は、音韻タイプライタで認識させた場合の音響スコアの方が優れている場合、その区間を未知語であると推定し、優れていない場合、その区間を既知語であると推定することを特徴とする。
【０００９】
未知語判定手段により、未知語が含まれていないと判定された場合、または、パターン判定手段により、認識結果がパターンにマッチしていないと判定された場合、入力音声に対応する応答を生成する応答生成手段をさらに備えることができる。
【００１２】
獲得手段は、未知語のクラスタを生成することで、その未知語を獲得することができる。
【００１４】
比較手段は、既知語でマッチングさせた場合の音響スコアに対して、音韻タイプライタで認識させた場合の音響スコアに補正をかけた上で比較を行うことができる。
認識手段は、認識結果の候補としての、推定された未知語または既知語を含む単語列を生成する単語列生成手段と、単語列生成手段により生成された単語列と入力音声の音の近さを表す音響スコアを計算する音響計算手段と、単語列生成手段により生成された単語列のふさわしさを表す言語スコアを計算する言語計算手段と、音響スコアと言語スコアに基づいて、単語列生成手段により生成された単語列から認識結果を選択する選択手段とをさらに備えることができる。
【００１５】
本発明の音声処理方法は、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する未知語判定ステップと、未知語判定ステップの処理により、未知語が含まれていると判定された場合、その未知語を獲得する獲得ステップと、未知語判定ステップの処理により認識結果に未知語が含まれていると判定された場合、その認識結果が、未知語を含む単語列であるパターンにマッチするか否かを判定するパターン判定ステップと、パターン判定ステップの処理により、認識結果がパターンにマッチしていると判定された場合、そのパターンにおいて未知語に対応付けられているカテゴリを、獲得ステップの処理により獲得された未知語に関連付けて登録する登録ステップとを含み、認識ステップは、入力音声の所定の区間について、既知語でマッチングさせた場合と音韻タイプライタで認識させた場合の、認識結果の候補と入力音声の音の近さを表す音響スコアを比較する比較ステップを含み、比較ステップの処理は、音韻タイプライタで認識させた場合の音響スコアの方が優れている場合、その区間を未知語であると推定し、優れていない場合、その区間を既知語であると推定することを特徴とする。
【００１６】
本発明の記録媒体のプログラムは、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する未知語判定ステップと、未知語判定ステップの処理により、未知語が含まれていると判定された場合、その未知語を獲得する獲得ステップと、未知語判定ステップの処理により認識結果に未知語が含まれていると判定された場合、その認識結果が、未知語を含む単語列であるパターンにマッチするか否かを判定するパターン判定ステップと、パターン判定ステップの処理により、認識結果がパターンにマッチしていると判定された場合、そのパターンにおいて未知語に対応付けられているカテゴリを、獲得ステップの処理により獲得された未知語に関連付けて登録する登録ステップとを含み、認識ステップは、入力音声の所定の区間について、既知語でマッチングさせた場合と音韻タイプライタで認識させた場合の、認識結果の候補と入力音声の音の近さを表す音響スコアを比較する比較ステップを含み、比較ステップの処理は、音韻タイプライタで認識させた場合の音響スコアの方が優れている場合、その区間を未知語であると推定し、優れていない場合、その区間を既知語であると推定することを特徴とする。
【００１７】
本発明のプログラムは、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する未知語判定ステップと、未知語判定ステップの処理により、未知語が含まれていると判定された場合、その未知語を獲得する獲得ステップと、未知語判定ステップの処理により認識結果に未知語が含まれていると判定された場合、その認識結果が、未知語を含む単語列であるパターンにマッチするか否かを判定するパターン判定ステップと、パターン判定ステップの処理により、認識結果がパターンにマッチしていると判定された場合、そのパターンにおいて未知語に対応付けられているカテゴリを、獲得ステップの処理により獲得された未知語に関連付けて登録する登録ステップとを含み、認識ステップは、入力音声の所定の区間について、既知語でマッチングさせた場合と音韻タイプライタで認識させた場合の、認識結果の候補と入力音声の音の近さを表す音響スコアを比較する比較ステップを含み、比較ステップの処理は、音韻タイプライタで認識させた場合の音響スコアの方が優れている場合、その区間を未知語であると推定し、優れていない場合、その区間を既知語であると推定することを特徴とする。
【００１８】
本発明においては、連続する入力音声の所定の区間について、既知語でマッチングさせた場合と音韻タイプライタで認識させた場合の、認識結果の候補と入力音声の音の近さを表す音響スコアが比較され、音韻タイプライタで認識させた場合の音響スコアの方が優れている場合、その区間が未知語であると推定され、優れていない場合、その区間が既知語であると推定される。そして、認識結果に未知語が含まれている場合、その未知語が獲得され、その認識結果が、未知語を含む単語列であるパターンにマッチするか否かが判定され、認識結果がパターンにマッチしていると判定された場合、そのパターンにおいて未知語に対応付けられているカテゴリが、未知語に関連付けて登録される。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して説明する。図１は、本発明を適用した対話システムの一実施形態の構成例を示している。
【００２０】
この対話システムは、ユーザ（人間）と音声により対話を行うシステムであり、例えば、音声が入力されると、その音声から名前が取り出され、登録されるようになっている。
【００２１】
即ち、音声認識部１には、ユーザからの発話に基づく音声信号が入力されるようになっており、音声認識部１は、入力された音声信号を認識し、その音声認識の結果としてのテキスト、その他付随する情報を、対話制御部３と単語獲得部４に必要に応じて出力する。
【００２２】
単語獲得部４は、音声認識部１が有する認識用辞書に登録されていない単語について、音響的特徴を自動的に記憶し、それ以降、その単語の音声を認識できるようにする。
【００２３】
即ち、単語獲得部４は、入力音声に対応する発音を音韻タイプライタによって求め、それをいくつかのクラスタに分類する。各クラスタはＩＤと代表音韻系列を持ち、ＩＤで管理される。このときのクラスタの状態を、図２を参照して説明する。
【００２４】
例えば、「あか」、「あお」、「みどり」という３回の入力音声があったとする。この場合、単語獲得部４は、３回の音声を、それぞれに対応した「あか」クラスタ２１、「あお」クラスタ２２、「みどり」クラスタ２３の、３つのクラスタに分類し、各クラスタには、代表となる音韻系列（図２の例の場合、"a/k/a, “a/o", “m/i/d/o/r/I"）とＩＤ（図２の例の場合、「１」、「２」、「３」）を付加する。
【００２５】
ここで再び、「あか」という音声が入力されると、対応するクラスタが既に存在するので、単語獲得部４は、入力音声を「あか」クラスタ２１に分類し、新しいクラスタは生成しない。これに対して、「くろ」という音声が入力された場合、対応するクラスタが存在しないので、単語獲得部４は、「くろ」に対応したクラスタ２４を新たに生成し、そのクラスタには、代表的な音韻系列（図２の例の場合、"k/u/r/o"）とＩＤ（図２の例の場合、「４」）を付加する。
【００２６】
したがって、入力音声が未獲得の語であるか否かは、新たなクラスタが生成されたかどうかによって判定できる。なお、このような単語獲得処理の詳細は、本出願人が先に提案した特願２００１−９７８４３号に開示されている。
【００２７】
連想記憶部２は、登録した名前（未知語）がユーザ名であるか、キャラクタ名であるかといったカテゴリ等の情報を記憶する。例えば、図３の例では、クラスタＩＤとカテゴリ名とが対応して記憶されている。図３の例の場合、例えば、クラスタＩＤ「１」、「３」、「４」は「ユーザ名」のカテゴリに対応され、クラスタＩＤ「２」は、「キャラクタ名」のカテゴリに対応されている。
【００２８】
対話制御部３は、音声認識部１の出力からユーザの発話の内容を理解し、その理解の結果に基づいて、名前（未知語）の登録を制御する。また、対話制御部３は、連想記憶部２に記憶されている登録済みの名前の情報に基づいて、登録済みの名前を認識できるように、それ以降の対話を制御する。
【００２９】
図４は、音声認識部１の構成例を示している。
【００３０】
ユーザの発話は、マイクロホン４１に入力され、マイクロホン４１では、その発話が、電気信号としての音声信号に変換される。この音声信号は、ＡＤ（Analog Digital）変換部４２に供給される。ＡＤ変換部４２は、マイクロホン４１からのアナログ信号である音声信号をサンプリングし、量子化し、ディジタル信号である音声データに変換する。この音声データは、特徴量抽出部４３に供給される。
【００３１】
特徴量抽出部４３は、ＡＤ変換部４２からの音声データについて、適当なフレームごとに、例えば、スペクトル、パワー線形予測係数、ケプストラム係数、線スペクトル対等の特徴パラメータを抽出し、マッチング部４４および音韻タイプライタ部４５に供給する。
【００３２】
マッチング部４４は、特徴量抽出部４３からの特徴パラメータに基づき、音響モデルデータベース５１、辞書データベース５２、および言語モデルデータベース５３を必要に応じて参照しながら、マイクロホン４１に入力された音声（入力音声）に最も近い単語列を求める。
【００３３】
音響モデルデータベース５１は、音声認識する音声の言語における個々の音韻や音節などの音響的な特徴を表す音響モデルを記憶している。音響モデルとしては、例えば、HMM（Hidden Markov Model）などを用いることができる。辞書データベース５２は、認識対象の各単語（語句）について、その発音に関する情報が記述された単語辞書や、音韻や音節の連鎖関係を記述したモデルを記憶している。
【００３４】
なお、ここにおける単語とは、認識処理において１つのまとまりとして扱ったほうが都合の良い単位のことであり、言語学的な単語とは必ずしも一致しない。例えば、「タロウ君」は、それ全体を１単語として扱ってもよいし、「タロウ」、「君」という２単語として扱ってもよい。さらに、もっと大きな単位である「こんにちはタロウ君」等を１単語として扱ってもよい。
【００３５】
また、音韻とは、音響的に１つの単位として扱った方が処理上都合のよいもののことであり、音声学的な音韻や音素とは必ずしも一致しない。例えば、「東京」の「とう」の部分を"t/o/u"という３個の音韻記号で表してもよいし、"o"の長音である"o:"という記号を用いて"t/o:"と表してもよい。または、"t/o/o"と表すことも可能である。他にも、無音を表す記号を用意したり、さらにそれを「発話前の無音」、「発話に挟まれた短い無音区間」、「発話語の無音」、「「っ」の部分の無音」のように細かく分類してそれぞれに記号を用意してもよい。
【００３６】
言語モデルデータベース５３は、辞書データベース５２の単語辞書に登録されている各単語がどのように連鎖する（接続する）かに関する情報を記述している。
【００３７】
音韻タイプライタ部４５は、特徴量抽出部４３から供給された特徴パラメータに基づいて、入力された音声に対応する音韻系列を取得する。例えば、「私の名前は太郎です。」という音声から"w/a/t/a/sh/i/n/o/n/a/m/a/e/w/a/t/a/r/o:/d/e/s/u"という音韻系列を取得する。この音韻タイプライタには、既存のものを用いることができる。
【００３８】
なお、音韻タイプライタ以外でも、任意の音声に対して音韻系列を取得できるものであれば代わりに用いることができる。例えば、日本語の音節（あ・い・う…か・き…・ん）を単位とする音声認識や、音韻よりも大きく、単語よりは小さな単位であるサブワードを単位とする音声認識等を用いることも可能である。
【００３９】
制御部４６は、ＡＤ変換部４２、特徴量抽出部４３、マッチング部４４、音韻タイプライタ部４５の動作を制御する。
【００４０】
次に、図５のフローチャートを参照して、本発明の対話システムの処理について説明する。
【００４１】
ステップＳ２１において、ユーザがマイクロホン４１に音声を入力すると、マイクロホン４１は、その発話を、電気信号としての音声信号に変換する。そして、ステップＳ２２において、音声認識部１は、音声認識処理を実行する。
【００４２】
音声認識処理の詳細について、図６を参照して説明する。マイクロホン４１で生成された音声信号は、ステップＳ４１において、ＡＤ変換部４２により、ディジタル信号である音声データに変換され、特徴量抽出部４３に供給される。
【００４３】
ステップＳ４２において、特徴量抽出部４３は、ＡＤ変換部４２からの音声データを受信する。そして、特徴量抽出部４３は、ステップＳ４３に進み、適当なフレームごとに、例えば、スペクトル、パワー、それらの時間変化量等の特徴パラメータを抽出し、マッチング部４４に供給する。
【００４４】
ステップＳ４４において、マッチング部４４は、辞書データベース５２に格納されている単語モデルのうちのいくつかを連結して、単語列を生成する。なお、この単語列を構成する単語には、辞書データベース５２に登録されている既知語だけでなく、登録されていない未知語を表わすシンボルである“<OOV>”も含まれている。この単語列生成処理について、図７を参照して詳細に説明する。
【００４５】
ステップＳ６１において、マッチング部４４は、入力音声の或る区間について、両方の場合の音響スコアを計算する。即ち、辞書データベース５２に登録されている既知語とマッチングさせた結果の音響スコアと、音韻タイプライタ部４５により得られた結果（今の場合、"w/a/t/a/sh/i/n/o/n/a/m/a/e/w/a/t/a/r/o:/d/e/s/u"の中の一部区間）の音響スコアが、それぞれ計算される。音響スコアは、音声認識結果の候補である単語列と入力音声とが音としてどれだけ近いかを表す。
【００４６】
次に、入力音声の一部区間と辞書データベース５２に登録されている既知語とをマッチングさせた結果の音響スコアと、音韻タイプライタ部４５による結果の音響スコアを比較させるのであるが、既知語とのマッチングは単語単位で行われ、音韻タイプライタ部４５でのマッチングは音韻単位で行われ、尺度が異なっているので、そのままでは比較することが困難である（一般的には、音韻単位の音響スコアの方が大きな値となる）。そこで、尺度を合わせて比較できるようにするために、マッチング部４４は、ステップＳ６２において、音韻タイプライタ部４５により得られた結果の音響スコアに補正をかける。
【００４７】
例えば、音韻タイプライタ部４５からの音響スコアに係数を掛けたり、一定の値やフレーム長に比例した値などを減じたりする処理が行われる。勿論、この処理は相対的なものなので、既知語とマッチングさせた結果の音響スコアに対して行うこともできる。なお、この処理の詳細は、例えば、文献「"EUROSPEECH99 Volume 1, Page 49-52"」に「OOV-Detection in Large Vocabulary System Using Automatically Defined Word-Fragments as Fillers」として開示されている。
【００４８】
マッチング部４４は、ステップＳ６３において、この２つの音響スコアを比較する（音韻タイプライタ部４５で認識させた結果の音響スコアの方が高い（優れている）か否かを判定する）。音韻タイプライタ部４５で認識させた結果の音響スコアの方が高い場合、ステップＳ６４に進み、マッチング部４４は、その区間を<OOV>（Out Of Vocabulary）（未知語）であると推定する。
【００４９】
ステップＳ６３において、既知語とマッチングさせた結果の音響スコアに対して、音韻タイプライタ部４５で認識された結果の音響スコアの方が低いと判定された場合、ステップＳ６６に進み、マッチング部４４は、その区間を既知語であると推定する。
【００５０】
即ち、例えば、「たろう」に相当する区間について、音韻タイプライタ部４５の出力した"t/a/r/o:"の音響スコアと、既知語でマッチングさせた場合の音響スコアを比較して、"t/a/r/o："の音響スコアの方が高い場合は、その音声区間に相当する単語として「<OOV>（t/a/r/o:）」が出力され、既知語の音響スコアの方が高い場合は、その既知語が音声区間に相当する単語として出力される。
【００５１】
ステップＳ６５において、音響スコアが高くなると推測される単語列（いくつかの単語モデルを連結したもの）を優先的にｎ個を生成する。
【００５２】
図６に戻って、ステップＳ４５において、音韻タイプライタ部４５はステップＳ４４の処理とは独立して、ステップＳ４３の処理で抽出された特徴パラメータに対して音韻を単位とする認識を行ない、音韻系列を出力する。例えば、「私の名前は太郎（未知語）です。」という音声が入力されると、音韻タイプライタ部４５は、"w/a/t/a/sh/i/n/o/n/a/m/a/e/w/a/t/a/r/o:/d/e/s/u"という音韻系列を出力する。
【００５３】
ステップＳ４６において、マッチング部４４は、ステップＳ４４で生成された単語列ごとに音響スコアを計算する。<OOV>（未知語）を含まない単語列に対しては既存の方法、すなわち各単語列（単語モデルを連結したもの）に対して音声の特徴パラメータを入力することで尤度を計算するという方法を用いる。一方、<OOV>を含む単語列については、既存の方法では<OOV>に相当する音声区間の音響スコアを求めることができない（<OOV>に対応する単語モデルは事前には存在しないため）。そこで、その音声区間については、音韻タイプライタの認識結果の中から同区間の音響スコアを取り出し、その値に補正をかけたものを<OOV>の音響スコアとして採用する。さらに、他の既知語部分の音響スコアと統合し、それをその単語列の音響スコアとする。
【００５４】
ステップＳ４７において、マッチング部４４は、音響スコアの高い単語列を上位ｍ個（ｍ≦ｎ）残し、候補単語列とする。ステップＳ４８において、マッチング部４４は、言語モデルデータベース５３を参照して、候補単語列毎に、言語スコアを計算する。言語スコアは、認識結果の候補である単語列が言葉としてどれだけふさわしいかを表す。ここで、この言語スコアを計算する方法を詳細に説明する。
【００５５】
本発明の音声認識部１は未知語も認識するため、言語モデルは未知語に対応している必要がある。例として、未知語に対応した文法または有限状態オートマトン（FSA:Finite State Automaton）を用いた場合と、同じく未知語に対応したtri-gram（統計言語モデルの1つである）を用いた場合とについて説明する。
【００５６】
文法の例を図８を参照して説明する。この文法６１はBNF(Backus Naur Form)で記述されている。図８において、"$Ａ"は「変数」を表し、"Ａ｜Ｂ"は「ＡまたはＢ」という意味を表す。また、"［Ａ］"は「Ａは省略可能」という意味を表し、｛Ａ｝は「Ａを０回以上繰り返す」という意味を表す。
【００５７】
<OOV>は未知語を表すシンボルであり、文法中に<OOV>を記述しておくことで、未知語を含む単語列に対しても対処することができる。"$ACTION"は図８では定義されていないが、実際には、例えば、「起立」、「着席」、「お辞儀」、「挨拶」等の動作の名前が定義されている。
【００５８】
この文法６１では、「＜先頭＞/こんにちは/＜終端＞」（“/”は単語間の区切り）、「＜先頭＞/さようなら/＜終端＞」、「＜先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」のように、データベースに記憶されている文法に当てはまる単語列は受理される（この文法で解析される）が、「＜先頭＞/君/の/<OOV>/名前/＜終端＞」といった、データベースに記憶されている文法に当てはまらない単語列は受理されない（この文法で解析されない）。なお、「＜先頭＞」と「＜終端＞」はそれぞれ発話前と後の無音を表す特殊なシンボルである。
【００５９】
この文法を用いて言語スコアを計算するために、パーザ（解析機）が用いられる。パーザは、単語列を、文法を受理できる単語列と、受理できない単語列に分ける。即ち、例えば、受理できる単語列には言語スコア１が与えられて、受理できない単語列には言語スコア０が与えられる。
【００６０】
したがって、例えば、「＜先頭＞/私/の/名前/は/<OOV>（t/a/r/o：）/です/＜終端＞」と、「＜先頭＞/私/の/名前/は/<OOV>（j/i/r/o：）/です/＜終端＞」という２つの単語列があった場合、いずれも「＜先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」に置き換えられた上で言語スコアが計算されて、ともに言語スコア１（受理）が出力される。
【００６１】
また、単語列の文法が受理できるか否かの判定は、事前に文法を等価（近似でも良い）な有限状態オートマトン（以下、FSAと称する）に変換しておき、各単語列がそのFSAで受理できるか否かを判定することによっても実現できる。
【００６２】
図８の文法を等価なFSAに変換した例が、図９に示されている。FSAは状態（ノード）とパス（アーク）とからなる有向グラフである。図９に示されるように、Ｓ１は開始状態、Ｓ１６は終了状態である。また、"$ACTION"には、図８と同様に、実際には動作の名前が登録されている。
【００６３】
パスには単語が付与されていて、所定の状態から次の状態に遷移する場合、パスはこの単語を消費する。ただし、"ε"が付与されているパスは、単語を消費しない特別な遷移（以下、ε遷移と称する）である。即ち、例えば、「＜先頭＞/私/は/<OOV>/です/＜終端＞」においては、初期状態Ｓ１から状態Ｓ２に遷移して、＜先頭＞が消費され、状態Ｓ２から状態Ｓ３へ遷移して、「私」が消費されるが、状態Ｓ３から状態Ｓ５への遷移は、ε遷移なので、単語は消費されない。即ち、状態Ｓ３から状態Ｓ５へスキップして、次の状態Ｓ６へ遷移することができる。
【００６４】
所定の単語列がこのFSAで受理できるか否かは、初期状態Ｓ１から出発して、終了状態Ｓ１６まで到達できるか否かで判定される。
【００６５】
即ち、例えば、「＜先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」においては、初期状態Ｓ１から状態Ｓ２へ遷移して、単語「＜先頭＞」が消費される。次に、状態Ｓ２から状態Ｓ３へ遷移して、単語「私」が消費される。以下、同様に、状態Ｓ３から状態Ｓ４へ、状態Ｓ４から状態Ｓ５へ、状態Ｓ５から状態Ｓ６へ、状態Ｓ６から状態Ｓ７へ順次遷移して、「の」、「名前」、「は」、「<00V>」、が次々に消費される。さらに、状態Ｓ７から状態Ｓ１５へ遷移して、「です」が消費され、状態Ｓ１５から状態Ｓ１６に遷移して、「<終端>」が消費され、結局、終了状態Ｓ１６へ到達する。したがって、「＜先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」はFSAで受理される。
【００６６】
しかしながら、「＜先頭＞/君/の/<OOV>/名前/＜終端＞」は、状態Ｓ１から状態Ｓ２へ、状態Ｓ２から状態Ｓ８へ、状態Ｓ８から状態Ｓ９までは遷移して、「＜先頭＞」、「君」、「の」までは消費されるが、その先には遷移できないので、終了状態Ｓ１６へ到達することはできない。したがって、「＜先頭＞/君/の/<OOV>/名前/＜終端＞」は、FSAで受理されない（不受理）。
【００６７】
さらに、言語モデルとして、統計言語モデルの１つであるtri-gramを用いた場合の言語スコアを計算する例を、図１０を参照して説明する。統計言語モデルとは、その単語列の生成確率を求めて、それを言語スコアとする言語モデルである。即ち、例えば、図１０の言語モデル７１の「＜先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」の言語スコアは、第２行に示されるように、その単語列の生成確率で表される。これはさらに、第３行乃至第６行で示されるように、条件付き確率の積として表される。なお、例えば、「Ｐ（の｜＜先頭＞私）」は、「の」の直前の単語が「私」で、「私」の直前の単語が「＜先頭＞」であるという条件の下で、「の」が出現する確率を表す。
【００６８】
さらに、tri-gramでは、図１０の第３行乃至第６行で示される式を、第７行乃至第９行で示されるように、連続する３単語の条件付き確率で近似させる。これらの確率値は、図１１に示されるようなtri-gramデータベース８１を参照して求められる。このtri-gramデータベース８１は、予め大量のテキストを分析して求められたものである。
【００６９】
図１１の例では、３つの連続する単語ｗ１，ｗ２，ｗ３の確率Ｐ（ｗ３｜ｗ１ｗ２）が表されている。例えば、３つの単語ｗ１，ｗ２，ｗ３が、それぞれ、「＜先頭＞」、「私」、「の」である場合、確率値は０．１２とされ、「私」、「の」、「名前」である場合、確率値は０．０１とされ、「<OOV>」、「です」、「＜終端＞」である場合、確率値は、０．８７とされている。
【００７０】
勿論、「Ｐ（Ｗ）」及び「Ｐ（ｗ２｜ｗ１）」についても、同様に、予め求めておく。
【００７１】
このようにして、言語モデル中に<OOV>について、エントリ処理をしておくことで、<OOV>を含む単語列に対して、言語スコアを計算することができる。したがって、認識結果に<OOV>というシンボルを出力することができる。
【００７２】
また、他の種類の言語モデルを用いる場合も、<OOV>についてのエントリ処理をすることによって、同様に<OOV>を含む単語列に対して、言語スコアを計算することができる。
【００７３】
さらに、<OOV>のエントリが存在しない言語モデルを用いた場合でも、<OOV>を言語モデル中の適切な単語にマッピングする機構を用いることで、言語スコアの計算ができる。例えば、「Ｐ（<OOV>｜私は）」が存在しないtri-gramデータベースを用いた場合でも、「Ｐ（太郎｜私は）」でデータベースをアクセスして、そこに記述されている確率を「Ｐ（<OOV>｜私は）」の値とみなすことで、言語スコアの計算ができる。
【００７４】
図６に戻って、マッチング部４４は、ステップＳ４９において、音響スコアと言語スコアを統合する。ステップＳ５０において、マッチング部４４は、ステップＳ４９において求められた音響スコアと言語スコアの両スコアを統合したスコアに基づいて、最もよいスコアをもつ候補単語列を選択して、認識結果として出力する。
【００７５】
なお、言語モデルとして、有限状態オートマトンを使用している場合は、ステップＳ４９の統合処理を、言語スコアが０の場合は単語列を消去し、言語スコアが０以外の場合はそのまま残すという処理にしてもよい。
【００７６】
図５に戻って、以上のようにしてステップＳ２２で音声認識処理が実行された後、ステップＳ２３において、音声認識部１の制御部４６は認識された単語列に未知語が含まれているか否かを判定する。未知語が含まれていると判定された場合、制御部４６は、単語獲得部４を制御し、ステップＳ２４において、単語獲得処理を実行させ、その未知語を獲得させる。
【００７７】
単語獲得処理の詳細について、図１２を参照して説明する。ステップＳ９１において、単語獲得部４は、音声認識部１から未知語（<OOV>）の特徴パラメータを抽出する。ステップＳ９２において、単語獲得部４は、未知語が既獲得のクラスタに属するか否かを判定する。既獲得のクラスタに属さないと判定された場合、単語獲得部４は、ステップＳ９３において、その未知語に対応する、新しいクラスタを生成する。そして、ステップＳ９４において、単語獲得部４は、未知語の属するクラスタのＩＤを音声認識部１のマッチング部４４に出力する。
【００７８】
ステップＳ９２において、未知語が既獲得のクラスタに属すると判定された場合、新しいクラスタを生成する必要がないので、単語獲得部４はステップＳ９３の処理をスキップして、ステップＳ９４に進み、未知語の属する既獲得のクラスタのＩＤをマッチング部４４に出力する。
【００７９】
なお、図１２の処理は各未知語毎に行われる。
【００８０】
図５に戻って、ステップＳ２４の単語獲得処理終了後、ステップＳ２５において、対話制御部３は、ステップＳ２４の処理で獲得された単語列が、テンプレートにマッチしているかどうかを判定する。即ち、認識結果の単語列が何かの名前の登録を意味するものかどうかの判定がここで行われる。そして、ステップＳ２５において、認識結果の単語列がテンプレートにマッチしていると判定された場合、ステップＳ２６において、対話制御部３は、連想記憶部２に、名前のクラスタＩＤとカテゴリを対応させて記憶させる。
【００８１】
対話制御部３がマッチングさせるテンプレートの例を図１３を参照して説明する。なお、図１３において、"/Ａ/"は「文字列Ａが含まれていたら」という意味を表し、"Ａ｜Ｂ"は「ＡまたはＢ」という意味を表す。また、"."は「任意の文字」を表し、"Ａ＋"は「Ａの１回以上の繰り返し」という意味を表し、"(.)＋"は「任意の文字列」を表す。
【００８２】
このテンプレート９１は、認識結果の単語列が図の左側の正規表現にマッチした場合、図の右側の動作を実行させることを表している。例えば、認識結果が「＜先頭＞/私/の/名前/は/<OOV>（t/a/r/o:）/です/＜終端＞」という単語列である場合、この認識結果から生成された文字列「私の名前は<OOV>です」は、図１３の第２番目の正規表現にマッチする。したがって、対応する動作である「＜OOV>に対応するクラスタＩＤをユーザ名として登録する」処理が実行される。即ち、「<OOV>(t/a/r/o：)」のクラスタＩＤが「１」である場合、図３に示されるように、クラスタＩＤ「１」のカテゴリ名が「ユーザ名」として登録される。
【００８３】
また、例えば、認識結果が、「＜先頭＞/君/の/名前/は/<OOV>（a/i/b/o）/だよ/＜終端＞」である場合、そこから生成される文字列「君の名前は<OOV>だよ」は図１３の第１番目の正規表現にマッチするので、「<OOV>(a/i/b/o)」がクラスタＩＤ「２」であれば、クラスタＩＤ「２」のカテゴリは、「キャラクタ名」として登録される。
【００８４】
なお、対話システムによっては、登録する単語が1種類しかない（例えば、「ユーザ名」のみ）場合もあり、その場合は、テンプレート９１と連想記憶部２は簡略化することができる。例えば、テンプレート９１の内容を「認識結果に<OOV>が含まれていたら、そのＩＤを記憶する」として、連想記憶部２にそのクラスタＩＤのみを記憶させることができる。
【００８５】
対話制御部３は、このようにして連想記憶部２に登録された情報を、以後の対話の判断処理に反映させる。例えば、対話システムの側で、「ユーザの発話の中に、対話キャラクタの名前が含まれているかどうかを判定する。含まれている場合は『呼びかけられた』と判断して、それに応じた返事をする」という処理や、「対話キャラクタがユーザの名前をしゃべる」という処理が必要になった場合に、対話制御部３は連想記憶部２の情報を参照することで、対話キャラクタに相当する単語（カテゴリ名が「キャラクタ名」であるエントリ）やユーザ名に相当する単語（カテゴリ名が「ユーザ名」であるエントリ）を得ることができる。
【００８６】
一方、ステップＳ２３において、認識結果に未知語が含まれていないと判定された場合、またはステップＳ２５において、認識結果がテンプレートにマッチしていないと判定された場合、ステップＳ２７において、対話制御部３は、入力音声に対応する応答を生成する。すなわち、この場合には、名前（未知語）の登録処理は行われず、ユーザからの入力音声に対応する所定の処理が実行される。
【００８７】
ところで、言語モデルとして文法を用いる場合、文法の中に音韻タイプライタ相当の記述も組み込むことができる。この場合の文法の例が図１４に示されている。この文法１０１において、第１行目の変数"$PHONEME"は、全ての音韻が「または」を意味する"|"で繋がれているので、音韻記号の内のどれか１つを意味する。変数"OOV"は"$PHONEME"を０回以上繰り返すことを表している。即ち、「任意の音韻記号を０回以上接続したもの」を意味し、音韻タイプライタに相当する。したがって、第３行目の「は」と「です」の間の"$OOV"は、任意の発音を受け付けることができる。
【００８８】
この文法１０１を用いた場合の認識結果では、"$OOV"に相当する部分が複数のシンボルで出力される。例えば、「私の名前は太郎です」の認識結果が「＜先頭＞/私/の/名前/は/t/a/r/o:/です/＜終端＞」となる。この結果を「＜先頭＞/私/の/名前/は/<OOV>（t/a/r/o:）/です」に変換すると、図５のステップＳ２３以降の処理は、音韻タイプライタを用いた場合と同様に実行することができる。
【００８９】
以上においては、未知語に関連する情報として、カテゴリを登録するようにしたが、その他の情報を登録するようにしてもよい。
【００９０】
図１５は、上述の処理を実行するパーソナルコンピュータ１１０の構成例を示している。このパーソナルコンピュータ１１０は、CPU（Central Processing Unit）１１１を内蔵している。CPU１１１にはバス１１４を介して、入出力インタフェース１１５が接続されている。バス１１４には、ROM(Read Only Memory)１１２およびRAM(Random Access Memory)１１３が接続されている。
【００９１】
入出力インターフェース１１５には、ユーザが操作するマウス、キーボード、マイクロホン、ＡＤ変換器等の入力デバイスで構成される入力部１１７、およびディスプレイ、スピーカ、ＤＡ変換器等の出力デバイスで構成される出力部１１６が接続されている。さらに、入出力インターフェース１１５には、プログラムや各種データを格納するハードディスクドライブなどよりなる記憶部１１８、並びにインタネットに代表されるネットワークを介してデータを通信する通信部１１９が接続されている。
【００９２】
入出力インターフェース１１５には、磁気ディスク１３１、光ディスク１３２、光磁気ディスク１３３、半導体メモリ１３４などの記録媒体に対してデータを読み書きするドライブ１２０が必要に応じて接続される。
【００９３】
このパーソナルコンピュータ１１０に本発明を適用した音声処理装置としての動作を実行させる音声処理プログラムは、磁気ディスク１３１（フロッピディスクを含む）、光ディスク１３２(CD-ROM(Compact Disc-Read Only Memory)、DVD(Digital Versatile Disc)を含む)、光磁気ディスク１３３（MD(Mini Disc)を含む）、もしくは半導体メモリ１３４に格納された状態でパーソナルコンピュータ１１０に供給され、ドライブ１２０によって読み出されて、記憶部１１８に内蔵されるハードディスクドライブにインストールされる。記憶部１１８にインストールされた音声処理プログラムは、入力部１１７に入力されるユーザからのコマンドに対応するCPU１１１の指令によって、記憶部１１８からRAM１１３にロードされて実行される。
【００９４】
上述した一連の処理は、ハードウエアにより実行させることもできるが、ソフトウエアにより実行させることもできる。一連の処理をソフトウエアにより実行させる場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、ネットワークや記録媒体からインストールされる。
【００９５】
この記録媒体は、図１５に示されるように、装置本体とは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されている磁気ディス１３１、光ディスク１３２、光磁気ディスク１３３、もしくは半導体メモリ１３４などよりなるパッケージメディアにより構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される、プログラムが記録されているROM１１２や、記憶部１１８に含まれるハードディスクなどで構成される。
【００９６】
なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
【００９７】
また、本明細書において、システムとは、複数の装置が論理的に集合したものをいい、各構成の装置が同一筐体中にあるか否かは問わない。
【００９８】
【発明の効果】
以上のように、本発明によれば、単語を音声で登録することができる。またその登録を、ユーザに登録モードを意識させることなく実行できる。さらに、既知語と未知語を含む連続する入力音声の中から未知語を容易に登録することが可能となる。さらに、登録した単語を、以降の対話で反映させることが可能となる。
【図面の簡単な説明】
【図１】本発明を適用した対話システムの一実施の形態の構成例を示すブロック図である。
【図２】クラスタの状態の例を示す図である。
【図３】単語の登録を示す図である。
【図４】図１の音声認識部の構成例を示すブロック図である。
【図５】図１の対話システムの動作を説明するためのフローチャートである。
【図６】図５のステップＳ２２の音声認識処理の詳細を説明するためのフローチャートである。
【図７】図６のステップＳ４４の単語列を生成する動作の詳細を説明するためのフローチャートである。
【図８】言語モデルデータベースで用いられる文法の例を示す図である。
【図９】有限状態オートマトンによる言語モデルの例を示す図である。
【図１０】 tri-gramを用いた言語スコアの計算の例を示す図である。
【図１１】 tri-gramデータベースの例を示す図である。
【図１２】図５のステップＳ２４の単語獲得処理の詳細を説明するためのフローチャートである。
【図１３】テンプレートの例を示す図である。
【図１４】音韻タイプライタを組み込んだ文法の例を示す図である。
【図１５】本発明を適用したコンピュータの一実施の形態の構成例を示すブロック図である。
【符号の説明】
１音声認識部，２連想記憶部，３対話制御部，４単語獲得部，４１マイクロホン，４２ＡＤ変換部，４３特徴量抽出部，４４マッチング部，４５音韻タイプライタ部，４６制御部，５１音響モデルデータベース，５２辞書データベース，５３言語モデルデータベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice processing apparatus, a voice processing method, a program, and a recording medium, and in particular, extracts an unknown word contained in an input voice signal during voice recognition of a voice signal that is continuously input. The present invention also relates to a voice processing apparatus and a voice processing method, a program, and a recording medium that can be easily registered.
[0002]
[Prior art]
In a dialogue system, there are many scenes where a name of something is registered by voice. For example, the user registers his / her name, gives a name to the dialog system, and enters a place name or store name.
[0003]
Conventionally, as a method for easily realizing such voice registration, there is a method in which the mode is shifted to the registration mode by some command, and when the registration is completed, the mode is returned to the normal interactive mode. In this case, for example, the process shifts to the registration mode by a voice command “user name registration”, and if a user generates a name after that, the name is registered, and then the process returns to the normal mode.
[0004]
[Problems to be solved by the invention]
However, in such a voice registration method, the mode must be switched by a command, which is unnatural as a dialog and is troublesome for the user. Further, when there are a plurality of objects to be named, the number of commands increases, which makes it more troublesome.
[0005]
In addition, during the registration mode, if the user is you've talked a word other than the name (for example, "Hello"), it will also be registered as a name. In addition, for example, when the user adds a word other than the name, such as “My name is Taro” instead of just the name “Taro”, the whole (“My name is Taro”. ") Will be registered as a name.
[0006]
The present invention has been made in view of such a situation, and an object of the present invention is to allow a user to register a word without making the user aware of the registration mode during a normal dialogue.
[0007]
[Means for Solving the Problems]
  The speech processing apparatus of the present invention includes a recognition unit that recognizes continuous input speech, an unknown word determination unit that determines whether an unknown word is included in the recognition result recognized by the recognition unit, and an unknown word determination If it is determined by the means that an unknown word is included, the acquisition result for acquiring the unknown word and the recognition result when the unknown word determination means determines that the unknown word is included in the recognition result Is determined by the pattern determination means for determining whether or not the pattern is a word string including an unknown word, and the pattern determination means determines that the recognition result matches the pattern, the unknown word in the pattern A registration means for registering the category associated with the word in association with the unknown word acquired by the acquisition means.The recognition means compares a recognition result candidate and an acoustic score representing the closeness of the sound of the input speech when the predetermined section of the input speech is matched with a known word and recognized by the phonological typewriter. Comparing means is provided, and the comparing means estimates that the section is an unknown word when the acoustic score when recognized by the phoneme typewriter is better, and if not, Presume that there isIt is characterized by that.
[0009]
  When the unknown word determining means determines that the unknown word is not included, or when the pattern determining means determines that the recognition result does not match the pattern, a response corresponding to the input speech is generated. Response generation means may be further provided.
[0012]
  Acquisition method is unknown wordofclusterGenerate aBy doing so, the unknown word can be acquired.
[0014]
  The comparison means can make a comparison after correcting the acoustic score when the phonetic typewriter recognizes the acoustic score when matching with a known word.
  The recognition means includes a word string generation means for generating a word string including an estimated unknown word or a known word as a recognition result candidate, and the proximity of the sound of the input speech and the word string generated by the word string generation means An acoustic calculation means for calculating an acoustic score, a language calculation means for calculating a language score representing the appropriateness of the word string generated by the word string generation means, and a word string generation means based on the acoustic score and the language score And a selecting means for selecting a recognition result from the word string generated by.
[0015]
  The speech processing method of the present invention includes a recognition step for recognizing continuous input speech, an unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step, and an unknown If it is determined that the unknown word is included by the processing of the word determination step, it is determined that the unknown result is included in the recognition result by the acquisition step of acquiring the unknown word and the processing of the unknown word determination step. The recognition result is determined to match the pattern by the pattern determination step for determining whether or not the recognition result matches a pattern that is a word string including an unknown word and the pattern determination step. In this case, the category associated with the unknown word in the pattern is registered in association with the unknown word acquired by the acquisition step processing. Including a flopThe recognition step compares the acoustic score that represents the proximity of the input speech sound with the recognition result candidate when the input speech is matched with a known word and the phoneme typewriter is used for matching. If the acoustic score when recognized by the phoneme typewriter is superior, the section is estimated to be an unknown word, and if not, the section is Estimate a known wordIt is characterized by that.
[0016]
  The recording medium program of the present invention includes a recognition step for recognizing continuous input speech, an unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step, If it is determined by the unknown word determination step that an unknown word is included, it is determined that the unknown result is included in the recognition result by the acquisition step for acquiring the unknown word and the unknown word determination step. If the recognition result is determined, the recognition result matches the pattern by the pattern determination step for determining whether or not the recognition result matches a pattern that is a word string including an unknown word, and the processing of the pattern determination step. If so, the category associated with the unknown word in the pattern is registered in association with the unknown word acquired by the processing of the acquisition step. Including a recording stepThe recognition step compares the acoustic score that represents the proximity of the input speech sound with the recognition result candidate when the input speech is matched with a known word and the phoneme typewriter is used for matching. If the acoustic score when recognized by the phoneme typewriter is superior, the section is estimated to be an unknown word, and if not, the section is Estimate a known wordIt is characterized by that.
[0017]
  The program of the present invention includes a recognition step for recognizing continuous input speech, an unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step, and an unknown word determination When it is determined by the processing of the step that an unknown word is included, when it is determined that the recognition result includes an unknown word by the acquisition step of acquiring the unknown word and the processing of the unknown word determination step When the recognition result is determined to match the pattern by the pattern determination step for determining whether or not the recognition result matches a pattern that is a word string including an unknown word and the pattern determination step The registration step for registering the category associated with the unknown word in the pattern in association with the unknown word acquired by the processing of the acquisition step. Including the doorThe recognition step compares the acoustic score that represents the proximity of the input speech sound with the recognition result candidate when the input speech is matched with a known word and the phoneme typewriter is used for matching. If the acoustic score when recognized by the phoneme typewriter is superior, the section is estimated to be an unknown word, and if not, the section is Estimate a known wordIt is characterized by that.
[0018]
  In the present invention, continuous input speechFor a given section, the acoustic score representing the proximity of the input speech sound is compared with the recognition result candidate when matched with a known word and when recognized by the phonological typewriter, and is recognized by the phonological typewriter. If the acoustic score is better, the section is estimated to be an unknown word, and if not, the section is estimated to be a known word. AndIf an unknown word is included in the recognition result, the unknown word is acquired, whether or not the recognition result matches a pattern that is a word string including the unknown word, and the recognition result matches the pattern. If it is determined that the category is associated with the unknown word, the category associated with the unknown word is registered in association with the unknown word.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a configuration example of an embodiment of an interactive system to which the present invention is applied.
[0020]
This dialogue system is a system that performs dialogue with a user (human) by voice. For example, when voice is input, a name is extracted from the voice and registered.
[0021]
That is, the speech recognition unit 1 receives a speech signal based on an utterance from the user, and the speech recognition unit 1 recognizes the input speech signal and the text as a result of the speech recognition. Other accompanying information is output to the dialogue control unit 3 and the word acquisition unit 4 as necessary.
[0022]
The word acquisition unit 4 automatically stores acoustic features for words that are not registered in the recognition dictionary of the speech recognition unit 1 so that the speech of the words can be recognized thereafter.
[0023]
That is, the word acquisition unit 4 obtains pronunciations corresponding to the input speech by the phoneme typewriter and classifies them into several clusters. Each cluster has an ID and a representative phoneme sequence, and is managed by the ID. The state of the cluster at this time will be described with reference to FIG.
[0024]
For example, it is assumed that there are three input voices “Aka”, “Ao”, and “Midori”. In this case, the word acquisition unit 4 classifies the three voices into three clusters of “Aka” cluster 21, “Ao” cluster 22, and “Midori” cluster 23 corresponding to each voice. Representative phoneme sequences (in the case of FIG. 2, “a / k / a,“ a / o ”,“ m / i / d / o / r / I ”) and ID (in the case of FIG. 2, “1”, “2”, “3”) are added.
[0025]
Here, when the voice “Aka” is input again, since the corresponding cluster already exists, the word acquisition unit 4 classifies the input voice into the “Aka” cluster 21 and does not generate a new cluster. On the other hand, when the voice “Kuro” is input, since the corresponding cluster does not exist, the word acquisition unit 4 newly generates a cluster 24 corresponding to “Kuro”. A typical phoneme sequence (“k / u / r / o” in the case of FIG. 2) and an ID (“4” in the case of FIG. 2) are added.
[0026]
Therefore, whether or not the input speech is an unacquired word can be determined based on whether or not a new cluster has been generated. Details of such word acquisition processing are disclosed in Japanese Patent Application No. 2001-97843 previously proposed by the present applicant.
[0027]
The associative storage unit 2 stores information such as a category such as whether the registered name (unknown word) is a user name or a character name. For example, in the example of FIG. 3, the cluster ID and the category name are stored in association with each other. In the example of FIG. 3, for example, the cluster IDs “1”, “3”, and “4” correspond to the “user name” category, and the cluster ID “2” corresponds to the “character name” category. Yes.
[0028]
The dialogue control unit 3 understands the content of the user's utterance from the output of the voice recognition unit 1, and controls the registration of the name (unknown word) based on the result of the understanding. Further, the dialogue control unit 3 controls the subsequent dialogue so that the registered name can be recognized based on the registered name information stored in the associative storage unit 2.
[0029]
FIG. 4 shows a configuration example of the voice recognition unit 1.
[0030]
The user's utterance is input to the microphone 41, and the microphone 41 converts the utterance into an audio signal as an electric signal. This audio signal is supplied to an AD (Analog Digital) converter 42. The AD converter 42 samples the audio signal that is an analog signal from the microphone 41, quantizes it, and converts it into audio data that is a digital signal. This audio data is supplied to the feature amount extraction unit 43.
[0031]
The feature amount extraction unit 43 extracts, for example, feature parameters such as a spectrum, a power linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, and the like from the voice data from the AD conversion unit 42 for each appropriate frame. This is supplied to the typewriter unit 45.
[0032]
Based on the feature parameters from the feature amount extraction unit 43, the matching unit 44 refers to the acoustic model database 51, the dictionary database 52, and the language model database 53 as necessary, and receives the voice (input voice) input to the microphone 41. ) Find the word string closest to.
[0033]
The acoustic model database 51 stores an acoustic model representing acoustic features such as individual phonemes and syllables in a speech language for speech recognition. For example, an HMM (Hidden Markov Model) can be used as the acoustic model. The dictionary database 52 stores a word dictionary in which information about pronunciation of each word (phrase) to be recognized is described, and a model in which phonological and syllable chain relationships are described.
[0034]
Note that the word here is a unit that is more convenient to treat as one unit in the recognition process, and does not necessarily match the linguistic word. For example, “Taro-kun” may be treated as a single word, or may be treated as two words “Taro” and “you”. In addition, may be dealing with more is a major unit of "Hello Taro" or the like as one word.
[0035]
A phoneme is one that is more convenient in terms of processing if it is treated acoustically as one unit, and does not necessarily match phonemic phonemes or phonemes. For example, the “To” part of “Tokyo” may be represented by three phonetic symbols “t / o / u”, or “t: o”, which is a long sound of “o”, and “t” / o: ". Alternatively, it can be expressed as “t / o / o”. In addition, a symbol representing silence can be prepared, and it can also be used as "silence before utterance", "short silence between utterances", "silence of utterances", and "silence of part" As shown in FIG.
[0036]
The language model database 53 describes information relating to how the words registered in the word dictionary of the dictionary database 52 are linked (connected).
[0037]
The phoneme typewriter unit 45 acquires a phoneme sequence corresponding to the input speech based on the feature parameter supplied from the feature amount extraction unit 43. For example, "w / a / t / a / sh / i / n / o / n / a / m / a / e / w / a / t / a / r" says "My name is Taro" The phoneme sequence “/ o: / d / e / s / u” is acquired. An existing phoneme typewriter can be used.
[0038]
In addition to the phoneme typewriter, any phoneme sequence can be used instead of any phoneme sequence for any speech. For example, speech recognition in units of Japanese syllables (a, i, u ... ka, ki ..., n), speech recognition in units of subwords that are larger than phonemes and smaller than words are used. It is also possible.
[0039]
The control unit 46 controls operations of the AD conversion unit 42, the feature amount extraction unit 43, the matching unit 44, and the phoneme typewriter unit 45.
[0040]
Next, processing of the interactive system of the present invention will be described with reference to the flowchart of FIG.
[0041]
In step S21, when the user inputs voice to the microphone 41, the microphone 41 converts the utterance into a voice signal as an electric signal. In step S22, the speech recognition unit 1 executes speech recognition processing.
[0042]
Details of the speech recognition processing will be described with reference to FIG. The audio signal generated by the microphone 41 is converted into audio data, which is a digital signal, by the AD conversion unit 42 in step S41 and is supplied to the feature amount extraction unit 43.
[0043]
In step S <b> 42, the feature amount extraction unit 43 receives audio data from the AD conversion unit 42. Then, the feature amount extraction unit 43 proceeds to step S43, extracts feature parameters such as spectrum, power, and their temporal change amount for each appropriate frame, and supplies them to the matching unit 44.
[0044]
In step S44, the matching unit 44 connects some of the word models stored in the dictionary database 52 to generate a word string. The words constituting this word string include not only known words registered in the dictionary database 52 but also “<OOV>”, which is a symbol representing an unknown word that is not registered. This word string generation process will be described in detail with reference to FIG.
[0045]
In step S61, the matching unit 44 calculates an acoustic score in both cases for a certain section of the input speech. That is, the acoustic score obtained as a result of matching with known words registered in the dictionary database 52 and the result obtained by the phonological typewriter 45 (in this case, “w / a / t / a / sh / i / n / o / n / a / m / a / e / w / a / t / a / r / o: / d / e / s / u "(partial section) is calculated The The acoustic score represents how close a word string that is a candidate for the speech recognition result and the input speech are as sounds.
[0046]
Next, the acoustic score obtained by matching a partial section of the input speech with the known word registered in the dictionary database 52 is compared with the acoustic score obtained by the phonological typewriter unit 45. Is performed in units of words, and matching in the phoneme typewriter unit 45 is performed in units of phonemes, and since the scales are different, it is difficult to compare them as they are (in general, phonetic units The acoustic score is higher). Therefore, the matching unit 44 corrects the resulting acoustic score obtained by the phonological typewriter unit 45 in step S62 so that the scales can be compared.
[0047]
For example, a process of multiplying the acoustic score from the phoneme typewriter unit 45 by a coefficient or subtracting a constant value or a value proportional to the frame length is performed. Of course, since this process is relative, it can also be performed on the acoustic score resulting from matching with a known word. Details of this processing are disclosed, for example, as “OOV-Detection in Large Vocabulary System Using Automatically Defined Word-Fragments as Fillers” in the document ““ EUROSPEECH99 Volume 1, Page 49-52 ””.
[0048]
In step S63, the matching unit 44 compares the two acoustic scores (determines whether the acoustic score as a result of recognition by the phonological typewriter unit 45 is higher (excellent)). If the acoustic score as a result of recognition by the phoneme typewriter unit 45 is higher, the process proceeds to step S64, and the matching unit 44 estimates that the section is <OOV> (Out Of Vocabulary) (unknown word).
[0049]
If it is determined in step S63 that the acoustic score of the result recognized by the phoneme typewriter unit 45 is lower than the acoustic score of the result matched with the known word, the process proceeds to step S66, where the matching unit 44 The section is estimated to be a known word.
[0050]
That is, for example, for the section corresponding to “Taro”, the acoustic score of “t / a / r / o:” output from the phoneme typewriter unit 45 is compared with the acoustic score when matched with a known word. , "T / a / r / o:" If the acoustic score is higher, "<OOV> (t / a / r / o :)" is output as the word corresponding to the speech segment, and the known word Is higher, the known word is output as a word corresponding to the speech section.
[0051]
In step S65, n pieces of word strings (concatenated several word models) presumed to have a high acoustic score are generated preferentially.
[0052]
Returning to FIG. 6, in step S45, the phoneme typewriter unit 45 recognizes the feature parameters extracted in the process of step S43 in units of phonemes independently of the process of step S44, and generates a phoneme sequence. Is output. For example, when a voice “My name is Taro (unknown word)” is input, the phoneme typewriter unit 45 reads “w / a / t / a / sh / i / n / o / n / a / m / a / e / w / a / t / a / r / o: / d / e / s / u "is output.
[0053]
In step S46, the matching unit 44 calculates an acoustic score for each word string generated in step S44. For word strings that do not contain <OOV> (unknown words), the likelihood is calculated by inputting speech feature parameters for each word string (concatenated word models). Use the method. On the other hand, for a word string including <OOV>, the existing method cannot obtain the acoustic score of the speech segment corresponding to <OOV> (because there is no word model corresponding to <OOV> in advance). Therefore, for the speech section, the acoustic score of the same section is extracted from the recognition results of the phoneme typewriter, and the value obtained by correcting the value is used as the <OOV> acoustic score. Furthermore, it integrates with the acoustic score of another known word part, and makes it the acoustic score of the word string.
[0054]
In step S <b> 47, the matching unit 44 leaves the top m (m ≦ n) word strings having high acoustic scores as candidate word strings. In step S48, the matching unit 44 refers to the language model database 53 and calculates a language score for each candidate word string. The language score represents how appropriate the word string that is a candidate for the recognition result is as a word. Here, the method for calculating the language score will be described in detail.
[0055]
Since the speech recognition unit 1 of the present invention also recognizes unknown words, the language model needs to correspond to the unknown words. For example, when using a grammar or finite state automaton (FSA) corresponding to an unknown word, and when using a tri-gram (which is one of the statistical language models) corresponding to an unknown word Will be described.
[0056]
  Illustration of grammar example8Will be described with reference to FIG. This grammar 61 is described in BNF (Backus Naur Form). Figure8"$ A" represents "variable" and "A | B" represents "A or B". “[A]” means “A can be omitted”, and {A} means “repeat A 0 or more times”.
[0057]
  <OOV> is a symbol representing an unknown word, and by describing <OOV> in the grammar, it is possible to deal with a word string including the unknown word. "$ ACTION" is a figure8In practice, however, names of actions such as “standing up”, “seating”, “bowing”, “greeting”, etc. are defined.
[0058]
In this grammar 61, "<start> / Hello / <end>" ( "/" is a separator between words), "<start> / goodbye / <end>", "<start> / I / Roh / name / / <OOV> / is / <terminal> ”, a word string that matches the grammar stored in the database is accepted (parsed by this grammar), but“ <start> / you / no / Word strings that do not apply to the grammar stored in the database, such as <OOV> / name / <terminal>, are not accepted (this grammar is not analyzed). “<Start>” and “<End>” are special symbols representing silence before and after the utterance, respectively.
[0059]
A parser (analyzer) is used to calculate a language score using this grammar. The parser divides the word string into a word string that can accept grammar and a word string that cannot accept it. That is, for example, a language score of 1 is given to acceptable word strings, and a language score of 0 is given to unacceptable word strings.
[0060]
So, for example, “<start> / I / no / name / is / <OOV> (t / a / r / o:) // termination>” and “<start> / i / no / name / Is / <OOV> (j / i / r / o:) / is / <terminal> ”, there are both <start> / I / no / name / has / <OOV> The language score is calculated after being replaced with “/ is / <terminal>”, and the language score 1 (acceptance) is output together.
[0061]
In addition, whether or not the grammar of the word string can be accepted is determined by converting the grammar into an equivalent (may be approximate) finite state automaton (hereinafter referred to as FSA) in advance, and each word string is the FSA. It can also be realized by determining whether or not it can be accepted.
[0062]
  Figure8An example of converting the grammar of grammar to equivalent FSA is shown in Figure9Is shown in FSA is a directed graph composed of states (nodes) and paths (arcs). Figure9As shown in FIG. 5, S1 is a start state, and S16 is an end state. In addition, "$ ACTION"8Like, the name of the action is actually registered.
[0063]
When a word is given to the path and a transition is made from a predetermined state to the next state, the path consumes this word. However, a path to which “ε” is assigned is a special transition that does not consume a word (hereinafter referred to as “ε transition”). That is, for example, in “<head> / I / has / <OOV> / <end>”, the transition is made from the initial state S1 to the state S2, and the <head> is consumed, and from the state S2 to the state S3. Although “I” is consumed by the transition, the transition from the state S3 to the state S5 is an ε transition, so the word is not consumed. That is, it is possible to skip from the state S3 to the state S5 and make a transition to the next state S6.
[0064]
Whether or not a predetermined word string can be accepted by this FSA is determined by whether or not it is possible to start from the initial state S1 and reach the end state S16.
[0065]
That is, for example, in “<start> / I / no / name / is / <OOV> // <end>”, the transition from the initial state S1 to the state S2 and the word “<start>” is consumed. The Next, transition from state S2 to state S3, the word "I" is consumed. In the same manner, the state S3 to the state S4, the state S4 to the state S5, the state S5 to the state S6, and the state S6 to the state S7 are sequentially changed to “NO”, “NAME”, “HA”, “ <00V> ”is consumed one after another. Further, the state transitions from the state S7 to the state S15, "I" is consumed, the state S15 transits to the state S16, "<Termination>" is consumed, and eventually the end state S16 is reached. Therefore, “<start> / I / no / name / is / <OOV> // <end>” is accepted by FSA.
[0066]
However, “<start> / you / no / <OOV> / name / <end>” changes from the state S1 to the state S2, from the state S2 to the state S8, and from the state S8 to the state S9. Up to “top>”, “you”, and “no” are consumed, but since the transition cannot be made beyond that, it is not possible to reach the end state S16. Therefore, “<Start> / You / No / <OOV> / Name / <End>” is not accepted by FSA (not accepted).
[0067]
  Furthermore, an example of calculating a language score when a tri-gram which is one of statistical language models is used as a language model is shown in FIG.0Will be described with reference to FIG. The statistical language model is a language model in which the generation probability of the word string is obtained and used as a language score. That is, for example, FIG.0The language score of “<start> / I /// name / has / <OOV> / is / <end>” in the language model 71 is expressed by the probability of the word string as shown in the second row. Is done. This is further expressed as a product of conditional probabilities, as shown in lines 3-6. For example, “P (no | <start> I)” is a condition that the word immediately before “no” is “I” and the word immediately before “I” is “<start>”. , Represents the probability that “no” will appear.
[0068]
  Furthermore, in the tri-gram, Fig. 10The expressions shown in the third to sixth lines are approximated by conditional probabilities of three consecutive words as shown in the seventh to ninth lines. These probability values are shown in FIG.1It is obtained by referring to a tri-gram database 81 as shown in FIG. The tri-gram database 81 is obtained by analyzing a large amount of text in advance.
[0069]
  FIG.1In the example, the probability P (w3 | w1w2) of three consecutive words w1, w2, and w3 is represented. For example, if the three words w1, w2, and w3 are “<start>”, “I”, and “no”, the probability value is 0.12, and “me”, “no”, and “name” ", The probability value is 0.01, and in the case of" <OOV> "," is ", and" <terminal> ", the probability value is 0.87.
[0070]
Of course, “P (W)” and “P (w2 | w1)” are similarly obtained in advance.
[0071]
In this way, by performing entry processing for <OOV> in the language model, a language score can be calculated for a word string including <OOV>. Therefore, the symbol <OOV> can be output as the recognition result.
[0072]
Similarly, when other types of language models are used, a language score can be calculated for a word string including <OOV> by performing entry processing for <OOV>.
[0073]
Furthermore, even when a language model that does not have an entry for <OOV> is used, a language score can be calculated by using a mechanism that maps <OOV> to an appropriate word in the language model. For example, even when using a tri-gram database that does not have “P (<OOV> | I am)”, the database is accessed with “P (Taro | I am)” and the probability described in it is calculated. The language score can be calculated by regarding the value of “P (<OOV> | I am)”.
[0074]
Returning to FIG. 6, the matching unit 44 integrates the acoustic score and the language score in step S49. In step S50, the matching unit 44 selects a candidate word string having the best score based on the score obtained by integrating both the acoustic score and the language score obtained in step S49, and outputs it as a recognition result.
[0075]
When a finite state automaton is used as the language model, the integration process in step S49 is performed so that the word string is deleted when the language score is 0, and is left as it is when the language score is other than 0. May be.
[0076]
Returning to FIG. 5, after the speech recognition process is executed in step S22 as described above, in step S23, the control unit 46 of the speech recognition unit 1 determines whether or not an unknown word is included in the recognized word string. Determine whether. When it is determined that an unknown word is included, the control unit 46 controls the word acquisition unit 4 to execute a word acquisition process in step S24 to acquire the unknown word.
[0077]
  Details of the word acquisition process are shown in FIG.2Will be described with reference to FIG. In step S <b> 91, the word acquisition unit 4 extracts feature parameters of an unknown word (<OOV>) from the speech recognition unit 1. In step S92, the word acquisition unit 4 determines whether the unknown word belongs to the acquired cluster. When it is determined that it does not belong to the already acquired cluster, the word acquiring unit 4 generates a new cluster corresponding to the unknown word in step S93. In step S94, the word acquisition unit 4 outputs the ID of the cluster to which the unknown word belongs to the matching unit 44 of the speech recognition unit 1.
[0078]
If it is determined in step S92 that the unknown word belongs to the already acquired cluster, there is no need to generate a new cluster, so the word acquiring unit 4 skips step S93 and proceeds to step S94. The ID of the acquired cluster to which is assigned is output to the matching unit 44.
[0079]
  In addition, FIG.2Processing of each unknownwordDone every time.
[0080]
Returning to FIG. 5, after completion of the word acquisition process in step S24, in step S25, the dialogue control unit 3 determines whether the word string acquired in the process of step S24 matches the template. That is, it is determined here whether the word string of the recognition result means registration of some name. If it is determined in step S25 that the word string of the recognition result matches the template, in step S26, the dialogue control unit 3 associates the name cluster ID with the category in the associative storage unit 2. Remember.
[0081]
  An example of a template to be matched by the dialog control unit 3 is shown in FIG.3Will be described with reference to FIG. In addition, FIG.3"/ A /" means "if character string A is included", and "A | B" means "A or B". “.” Represents “any character”, “A +” represents “one or more repetitions of A”, and “(.) +” Represents “any character string”.
[0082]
  This template 91 represents that the operation on the right side of the figure is executed when the word string of the recognition result matches the regular expression on the left side of the figure. For example, if the recognition result is the word string “<start> / I / no / name / is / <OOV> (t / a / r / o:) // termination>”, generated from this recognition result The string "My name is <OOV>" is shown in Figure 1.3Matches the second regular expression. Therefore, a process of “registering a cluster ID corresponding to <OOV> as a user name”, which is a corresponding operation, is executed. That is, when the cluster ID of “<OOV> (t / a / r / o :)” is “1”, the category name of the cluster ID “1” is “user name” as shown in FIG. be registered.
[0083]
  For example, if the recognition result is “<start> / you / no / name / has / <OOV> (a / i / b / o) / dayo / <end>”, it is generated from there. The string “Your name is <OOV>” is shown in Figure 1.3If “<OOV> (a / i / b / o)” is the cluster ID “2”, the category of the cluster ID “2” is “character name”. be registered.
[0084]
Depending on the dialogue system, there may be only one type of word to be registered (for example, “user name” only). In this case, the template 91 and the associative memory unit 2 can be simplified. For example, it is possible to store only the cluster ID in the associative storage unit 2 as “the ID is stored if the recognition result includes <OOV>” as the content of the template 91.
[0085]
The dialogue control unit 3 reflects the information registered in the associative memory unit 2 in the subsequent dialogue determination process. For example, on the side of the dialogue system, “determine whether or not the name of the dialogue character is included in the user's utterance. When the process of “doing” or the process of “dialog character speaks the user's name” is necessary, the dialogue control unit 3 refers to the information in the associative memory unit 2 to obtain a word corresponding to the dialogue character. (Entry whose category name is “character name”) and a word corresponding to the user name (entry whose category name is “user name”) can be obtained.
[0086]
On the other hand, if it is determined in step S23 that an unknown word is not included in the recognition result, or if it is determined in step S25 that the recognition result does not match the template, in step S27, the dialogue control unit 3 Generates a response corresponding to the input speech. That is, in this case, the name (unknown word) registration process is not performed, and a predetermined process corresponding to the input voice from the user is performed.
[0087]
  By the way, when a grammar is used as a language model, a description equivalent to a phonological typewriter can be incorporated into the grammar. An example of grammar in this case is shown in FIG.4Is shown in In this grammar 101, the variable “$ PHONEME” on the first line means any one of the phonetic symbols because all phonemes are connected by “|” meaning “or”. The variable “OOV” indicates that “$ PHONEME” is repeated zero or more times. That is, “an arbitrary phoneme symbol connected zero or more times” means a phoneme typewriter. Therefore, “$ OOV” between “ha” and “is” on the third line can accept any pronunciation.
[0088]
In the recognition result when this grammar 101 is used, a portion corresponding to “$ OOV” is output as a plurality of symbols. For example, the recognition result of “My name is Taro” is “<start> / I / no / name / is / t / a / r / o: // <end>”. When this result is converted to “<start> / I / no / name / is / <OOV> (t / a / r / o:) /”, the processing after step S23 in FIG. It can be executed in the same manner as when used.
[0089]
In the above, a category is registered as information related to an unknown word, but other information may be registered.
[0090]
  FIG.5These show a configuration example of the personal computer 110 that executes the above-described processing. The personal computer 110 includes a CPU (Central Processing Unit) 111. An input / output interface 115 is connected to the CPU 111 via the bus 114. A ROM (Read Only Memory) 112 and a RAM (Random Access Memory) 113 are connected to the bus 114.
[0091]
The input / output interface 115 includes an input unit 117 including input devices such as a mouse, a keyboard, a microphone, and an AD converter operated by a user, and an output unit including output devices such as a display, a speaker, and a DA converter. 116 is connected. The input / output interface 115 is connected to a storage unit 118 including a hard disk drive for storing programs and various data, and a communication unit 119 for communicating data via a network represented by the Internet.
[0092]
The input / output interface 115 is connected to a drive 120 for reading / writing data from / to a recording medium such as the magnetic disk 131, the optical disk 132, the magneto-optical disk 133, and the semiconductor memory 134 as necessary.
[0093]
A sound processing program for causing the personal computer 110 to execute an operation as a sound processing apparatus to which the present invention is applied includes a magnetic disk 131 (including a floppy disk), an optical disk 132 (CD-ROM (Compact Disc-Read Only Memory), DVD (Including Digital Versatile Disc), magneto-optical disc 133 (including MD (Mini Disc)), or supplied to personal computer 110 in a state of being stored in semiconductor memory 134, read by drive 120, and storage unit It is installed in a hard disk drive built in 118. The voice processing program installed in the storage unit 118 is loaded from the storage unit 118 to the RAM 113 and executed by a command of the CPU 111 corresponding to a command from the user input to the input unit 117.
[0094]
The series of processes described above can be executed by hardware, but can also be executed by software. When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs. For example, a general-purpose personal computer is installed from a network or a recording medium.
[0095]
  This recording medium is shown in FIG.5As shown in FIG. 4, a package made up of a magnetic disc 131, an optical disc 132, a magneto-optical disc 133, a semiconductor memory 134, or the like on which a program is recorded, which is distributed to provide a program to the user separately from the apparatus main body. In addition to being configured by media, it is configured by a ROM 112 in which a program is recorded and a hard disk included in the storage unit 118 provided to the user in a state of being incorporated in the apparatus main body in advance.
[0096]
In the present specification, the step of describing the program recorded on the recording medium is not limited to the processing performed in chronological order according to the described order, but is not necessarily performed in chronological order. It also includes processes that are executed individually.
[0097]
Further, in this specification, the system means a system in which a plurality of devices are logically assembled, and it does not matter whether the devices of each configuration are in the same housing.
[0098]
【The invention's effect】
As described above, according to the present invention, words can be registered by voice. The registration can be executed without making the user aware of the registration mode. Furthermore, it is possible to easily register unknown words from continuous input speech including known words and unknown words. Furthermore, the registered word can be reflected in subsequent dialogues.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration example of an embodiment of an interactive system to which the present invention is applied.
FIG. 2 is a diagram illustrating an example of a state of a cluster.
FIG. 3 is a diagram illustrating word registration.
4 is a block diagram illustrating a configuration example of a voice recognition unit in FIG. 1. FIG.
FIG. 5 is a flowchart for explaining the operation of the dialogue system of FIG. 1;
FIG. 6 is a flowchart for explaining details of the voice recognition processing in step S22 of FIG. 5;
7 is a flowchart for explaining details of an operation for generating a word string in step S44 of FIG. 6;
FIG. 8 is a diagram illustrating an example of grammar used in a language model database.
FIG. 9 is a diagram illustrating an example of a language model based on a finite state automaton.
FIG. 10 is a diagram illustrating an example of language score calculation using a tri-gram.
FIG. 11 is a diagram illustrating an example of a tri-gram database.
12 is a flowchart for explaining details of a word acquisition process in step S24 of FIG. 5;
FIG. 13 is a diagram illustrating an example of a template.
FIG. 14 is a diagram showing an example of a grammar incorporating a phoneme typewriter.
FIG. 15 is a block diagram illustrating a configuration example of an embodiment of a computer to which the present invention has been applied.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Speech recognition part, 2 Associative memory part, 3 Dialogue control part, 4 Word acquisition part, 41 Microphone, 42 AD conversion part, 43 Feature-value extraction part, 44 Matching part, 45 Phoneme typewriter part, 46 Control part, 51 Sound Model database, 52 Dictionary database, 53 Language model database

Claims

A speech processing apparatus that processes input speech and registers words included in the input speech based on the processing result,
Recognition means for recognizing the continuous input speech;
An unknown word determination means for determining whether or not an unknown word is included in the recognition result recognized by the recognition means;
When the unknown word determination means determines that the unknown word is included, acquisition means for acquiring the unknown word;
Pattern determination for determining whether or not the recognition result matches a pattern that is a word string including the unknown word when the unknown word determination unit determines that the unknown word is included in the recognition result Means,
When it is determined by the pattern determining means that the recognition result matches the pattern, the category associated with the unknown word in the pattern is associated with the unknown word acquired by the acquiring means. And registration means for registering ,
The recognition means is
Comparing means for comparing a candidate of the recognition result and an acoustic score representing the closeness of the sound of the input speech when the predetermined section of the input speech is matched with a known word and recognized by a phonological typewriter
With
The comparing means estimates that the section is the unknown word when the acoustic score when recognized by the phonological typewriter is superior, and if the acoustic score is not superior, the section is determined as the known word. A speech processing apparatus characterized by estimating that there is.

When the unknown word determining unit determines that the unknown word is not included, or when the pattern determining unit determines that the recognition result does not match the pattern, The speech processing apparatus according to claim 1, further comprising response generation means for generating a corresponding response.

It said acquisition means, by generating the unknown word clusters, speech processing apparatus according to claim 1, characterized in that to acquire the unknown words.

Said comparing means according to claim 1, characterized in that to make a comparison in terms of the relative acoustic score in the case where are matched with known words, multiplied by correction acoustic score when was recognized in the phoneme typewriter The voice processing apparatus according to 1.

The recognition means is
A word string generation means for generating a word string including the estimated unknown word or the known word as the recognition result candidate;
Acoustic calculation means for calculating an acoustic score representing the closeness of the sound of the input sound and the word string generated by the word string generation means;
Language calculation means for calculating a language score representing the appropriateness of the word string generated by the word string generation means;
The speech processing apparatus according to claim 1 , further comprising: a selecting unit that selects the recognition result from the word string generated by the word string generating unit based on the acoustic score and the language score.

In the speech processing method of the speech processing apparatus for processing an input speech and registering a word included in the input speech based on the processing result,
A recognition step for recognizing the continuous input speech;
An unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step;
If it is determined by the processing of the unknown word determination step that the unknown word is included, an acquisition step of acquiring the unknown word;
When it is determined by the processing of the unknown word determination step that the unknown word is included in the recognition result, it is determined whether or not the recognition result matches a pattern that is a word string including the unknown word. A pattern determination step;
When it is determined that the recognition result matches the pattern by the process of the pattern determination step, the category associated with the unknown word in the pattern is acquired by the process of the acquisition step. look including a registration step of registering in association with the unknown word,
The recognition step includes
A comparison step of comparing the acoustic score representing the closeness of the sound of the input speech with the recognition result candidate when the predetermined section of the input speech is matched with a known word and recognized with a phonological typewriter Including
In the comparison step, when the acoustic score when recognized by the phonological typewriter is superior, the section is estimated as the unknown word, and when the acoustic score is not superior, the section is determined as the known A speech processing method characterized by estimating a word .

A program for a speech processing apparatus that processes input speech and registers words included in the input speech based on the processing result,
A recognition step for recognizing the continuous input speech;
An unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step;
If it is determined by the processing of the unknown word determination step that the unknown word is included, an acquisition step of acquiring the unknown word;
When it is determined by the processing of the unknown word determination step that the unknown word is included in the recognition result, it is determined whether or not the recognition result matches a pattern that is a word string including the unknown word. A pattern determination step;
When it is determined that the recognition result matches the pattern by the process of the pattern determination step, the category associated with the unknown word in the pattern is acquired by the process of the acquisition step. look including a registration step of registering in association with the unknown word,
The recognition step includes
A comparison step of comparing the acoustic score representing the closeness of the sound of the input speech with the recognition result candidate when the predetermined section of the input speech is matched with a known word and recognized with a phonological typewriter Including
In the comparison step, when the acoustic score when recognized by the phonological typewriter is superior, the section is estimated as the unknown word, and when the acoustic score is not superior, the section is determined as the known A recording medium on which a computer-readable program is recorded , which is presumed to be a word .

In a program executed by a computer that controls an audio processing device that processes input speech and registers a word included in the input speech based on the processing result,
A recognition step for recognizing the continuous input speech;
An unknown word determination step for determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step;
If it is determined by the processing of the unknown word determination step that the unknown word is included, an acquisition step of acquiring the unknown word;
When it is determined by the processing of the unknown word determination step that the unknown word is included in the recognition result, it is determined whether or not the recognition result matches a pattern that is a word string including the unknown word. A pattern determination step;
When it is determined that the recognition result matches the pattern by the process of the pattern determination step, the category associated with the unknown word in the pattern is acquired by the process of the acquisition step. look including a registration step of registering in association with the unknown word,
The recognition step includes
A comparison step of comparing the acoustic score representing the closeness of the sound of the input speech with the recognition result candidate when the predetermined section of the input speech is matched with a known word and recognized with a phonological typewriter Including
In the comparison step, when the acoustic score when recognized by the phonological typewriter is superior, the section is estimated as the unknown word, and when the acoustic score is not superior, the section is determined as the known A program characterized by presuming to be a word .