JP3912089B2

JP3912089B2 - Speech recognition method and speech recognition apparatus

Info

Publication number: JP3912089B2
Application number: JP2001369173A
Authority: JP
Inventors: 康永宮澤; 浩長谷川
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-12-03
Filing date: 2001-12-03
Publication date: 2007-05-09
Anticipated expiration: 2021-12-03
Also published as: JP2003167599A

Description

【０００１】
【発明の属する技術分野】
本発明は特徴ベクトルを平均正規化してその平均正規化した特徴ベクトルを用いて音声認識処理する音声認識方法および音声認識装置に関する。
【０００２】
【従来の技術】
音声データから所定時間ごとに取得される各フレーム対応の特徴ベクトル（たとえば１０次元のケプストラム係数）を平均正規化する特徴ベクトル平均正規化法は、音声認識を行う際の重要な技術として広く用いられている。
【０００３】
この特徴ベクトル平均正規化法は、簡単に言えば、平均正規化対象となる音声データ区間から得られる各時刻ごとの特徴ベクトルの平均値を求め、その平均値を各時刻ごとの特徴ベクトルから引き算して正規化された特徴ベクトルを求めるものである。
【０００４】
たとえば、各時刻（ｔ＝１，２，・・・，T）に対応するそれぞれのフレームからＣ１，Ｃ２，・・・，ＣTの特徴ベクトル（それぞれの特徴ベクトルはたとえば１０次元のケプストラム係数でなる）が得られたとすれば、これらの特徴ベクトルＣ１，Ｃ２，・・・，ＣT の和をＣｓで表せば、このＣｓは
【０００５】
【数１】

【０００６】
によって求められ、これによって得られた特徴ベクトルＣ１，Ｃ２，・・・，ＣT の和Ｃｓから、その平均値Ｃｍを下記に示す（２）式によって求める。なお、（２）式におけるＴは平均正規化対象となる音声データにおける全フレーム数を表している。
【０００７】
Ｃm＝Ｃｓ／Ｔ（２）
そして、このようにして求められた平均値Ｃｍをそれぞれのフレーム対応の特徴ベクトルＣ１，Ｃ２，・・・，ＣTからそれぞれ引き算することで、各フレーム対応の平均正規化（以下では、単に正規化という）特徴ベクトルを求めることができる。
【０００８】
すなわち、フレーム対応の正規化特徴ベクトルをＣ１’，Ｃ２’，・・・，ＣT’で表せば、それぞれのフレームごとの正規化特徴ベクトルは、Ｃ１’＝Ｃ１−Ｃｍ，Ｃ２’＝Ｃ２−Ｃｍ，・・・，ＣT’＝ＣT−Ｃｍで表すことができる。
【０００９】
このようにして求められた正規化特徴ベクトル（正規化ケプストラム係数）を用いて、たとえば、ベクトル量子化などの前処理を行ったのちに出力確率などを求めるなど音声認識を行うに必要な演算を行う。
【００１０】
なお、上述の特徴ベクトルの和を求める計算や特徴ベクトル同志の引き算、平均特徴ベクトルを求める計算、正規化特徴ベクトルを求める計算などはすべて、その特徴ベクトルを表すの各次元ごと、すなわち、特徴ベクトルを表すケプストラム係数がたとえば１０次元で構成されるとすれば、０〜９次元の各次元ごとにそれぞれの計算を行うことはいうまでもない。
【００１１】
【発明が解決しようとする課題】
前述の（２）式で表されるの平均値Ｃｍを求めるには、正規化処理対象とする音声区間の全フレームに対する特徴ベクトルが必要となってくる。この結果、処理対象とする入力音声の全フレームが入力されてからでないと計算を行うことができず、リアルタイム処理ができないという問題もある。
【００１２】
これを解決するために、処理対象となる現発話データの１発話以前の音声データから得られた特徴ベクトルから得られた平均値を用いて、現発話データのそれぞれのフレーム対応の特徴ベクトルから逐次的に引き算をすることで、現発話のそれぞれのフレーム対応の正規化特徴ベクトルを求める方法がある。たとえば、特開平９−９０９９０の「音声認識のための音響分析方法及び装置」（以下、第１の従来技術という）にはこれについての記載がなされている。
【００１３】
この従来技術を用いれば、確かにリアルタイムでの音声認識処理を行うことが可能となる。
【００１４】
しかし、この従来技術は、現発話とそれより前の発話とで話者交代があったりした場合には、その前後において、の平均値も大きく異なる可能性が高いので、それを用いて求められた正規化特徴ベクトルは、現発話を適正に反映しないものとなる。これは、それぞれの話者の発話した音声データ（音素がまんべんなく含まれているような音声データ）から得られた平均値は、それ自体が話者性を表したものとなるからであり、平均値を求めたときの話者と、正規化特徴ベクトルを求めるときの話者が異なると、他の話者の平均値で正規化特徴ベクトルが算出されてしまうことになるという問題が生じる。
【００１５】
ところで、この正規化特徴ベクトルを求める際、適正な正規化特徴ベクトルを得るためには、ノイズに変動が少ないこと、話者の発話した音声にまんべんなくバラツキがあることが重要である。
【００１６】
ここで、話者の発話した音声にまんべんなくバラツキがあるということは、殆どの音素が含まれる音声データであるといえるので、ある程度の時間的長さを持った過去の発話から正規化特徴ベクトルを求めることが必要となってくる。
【００１７】
しかし、過去の発話から求めた正規化特徴ベクトルを用いて音声認識すると、発話者が途中で交代したような場合、話者の発話特性が異なり、認識性能が劣化するという問題がある。
【００１８】
この話者交代に対応する技術として、複数の特定話者の正規化特徴ベクトルを予め求めておき、現発話者を話者識別して、予め求められた複数の特定話者の正規化特徴ベクトルから、話者識別された話者に対応する正規化特徴ベクトルを用いる方法が考えられる。たとえば、特開平１０−２５４４９４号公報に記載された「音声認識装置および方法」の発明（以下、第２の従来技術という）には、それについての記述がなされている。
【００１９】
しかし、この第２の従来技術は、予め複数の特定話者の正規化特徴ベクトルを求めておく必要があり、特定話者の識別処理が必要とともに、新規入力話者への話者交代などには対応しにくい。
【００２０】
そこで本発明は、現発話の音声の特徴データと過去の発話の特徴データの両方を使い、しかも、繁雑な話者識別または話者グループ識別処理を行うことなく、話者交代などに対応可能とし、安価なハードウエアで高い認識性能が得られる音声認識方法および音声認識装置を提供することを目的としている。
【００２１】
【課題を解決するための手段】
上述した目的を達成するために、本発明の音声認識方法は、音声データを所定時間ごとのフレーム対応に特徴分析し、その特徴分析によって得られる特徴ベクトルを平均正規化してその平均正規化された特徴ベクトルを用いて音声認識処理する音声認識方法において、入力される音声データに対し、時間軸方向に沿ったある数のフレームを１つの単位としてそれを短時間ブロックとし、同じ話者または話者グループに属する短時間ブロックを時間軸方向に沿って予め定めた数だけ集めてそれをある話者または話者グループに対応する長時間ブロックとするとともに、この長時間ブロックが生成されるまでの段階における１つの短時間ブロックまたは同じ話者グループに属する複数の短時間ブロックの集合を長時間ブロック候補とし、前記音声データから順次取得される短時間ブロックに対し、その短時間ブロックの特徴ベクトルがすでに生成されている話者または話者グループの少なくとも１つに属するか否かを判定し、当該短時間ブロックがすでに生成されている話者または話者グループの少なくとも１つに属すると判定された場合には、その短時間ブロックの特徴ベクトルを用いて、当該短時間ブロックの属する話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の特徴ベクトルを更新し、この長時間ブロックまたは長時間ブロック候補の更新前または更新後の特徴ベクトルを用いて前記短時間ブロックの特徴ベクトルを平均正規化し、その平均正規化された特徴ベクトルを音声認識処理部に渡し、当該短時間ブロックが前記すでに生成されている話者または話者グループに属さないと判定された場合には、その短時間ブロックを長時間ブロック候補として、新たな話者または話者グループに対応する長時間ブロックの生成を開始し、当該短時間ブロックの特徴ベクトルを用いてその短時間ブロックを平均正規化して、その平均正規化された特徴ベクトルを音声認識処理部に渡すようにしている。
【００２２】
また、本発明の音声認識装置は、音声データを所定時間ごとのフレーム対応に特徴分析し、その特徴分析によって得られる特徴ベクトルを平均正規化してその平均正規化された特徴ベクトルを用いて音声認識処理する音声認識装置において、入力音声に対し音声の特徴分析を行って特徴ベクトルを出力する音声分析手段と、この音声分析手段から出力された特徴ベクトルを平均正規化する特徴ベクトル平均正規化手段と、この特徴ベクトル平均正規化手段で平均正規化された特徴ベクトルを入力して、予め用意された音響モデルを用いて音声認識処理する音声認識処理手段とを有し、前記特徴ベクトル平均正規化手段は、入力される音声データに対し、時間軸方向に沿ったある数のフレームを１つの単位としてそれを短時間ブロックとし、同じ話者または話者グループに属する短時間ブロックを時間軸方向に沿って予め定めた数だけ集めてそれをある話者または話者グループに対応する長時間ブロックとするとともに、この長時間ブロックが生成されるまでの段階における１つの短時間ブロックまたは同じ話者グループに属する複数の短時間ブロックの集合を長時間ブロック候補とし、前記音声データから順次取得される短時間ブロックに対し、その短時間ブロックの特徴ベクトルがすでに生成されている話者または話者グループの少なくとも１つに属するか否かを判定し、当該短時間ブロックがすでに生成されている話者または話者グループの少なくとも１つに属すると判定された場合には、その短時間ブロックの特徴ベクトルを用いて、当該短時間ブロックの属する話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の特徴ベクトルを更新し、この長時間ブロックまたは長時間ブロック候補の更新前または更新後の特徴ベクトルを用いて前記短時間ブロックの特徴ベクトルを平均正規化し、その平均正規化された特徴ベクトルを音声認識処理部に渡し、当該短時間ブロックが前記すでに生成されている話者または話者グループに属さないと判定された場合には、その短時間ブロックを長時間ブロック候補として、新たな話者または話者グループに対応する長時間ブロックの生成を開始し、当該短時間ブロックの特徴ベクトルを用いてその短時間ブロックを平均正規化して、その平均正規化された特徴ベクトルを音声認識処理部に渡すようにしている。
【００２３】
これらそれぞれの発明において、前記短時間ブロックは、標準的な長さを有する１単語の音声データから取得される程度のフレーム数で構成され、前記長時間ブロックは、全ての音素がまんべんなく含まれる音声データから取得される程度のフレーム数で構成するようにしている。
【００２４】
また、前記短時間ブロックを構成するフレーム数、長時間ブロックを構成するフレーム数は、それぞれ２のべき乗の値に設定するようにしている。
【００２５】
また、前記短時間ブロックの音声特徴が、すでに生成されている話者または話者グループに属するか否かを判断する話者性判断処理は、当該短時間ブロックの平均特徴ベクトルと、すでに生成されている話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の平均特徴ベクトルとの距離を計算し、その距離を予め設定されたしきい値と比較することによって行う。
【００２６】
そして、そのしきい値は、前記短時間ブロックの比較対象が前記長時間ブロックである場合と、前記長時間ブロック候補である場合とでその値を異ならせ、前記短時間ブロックの比較対象が前記長時間ブロック候補である場合は、長時間ブロックの場合よりもしきい値としての距離を大きく設定するようにしている。
【００２７】
また、前記生成される話者または話者グループの最大数は予め設定可能とし、その話者または話者グループが最大数まで生成されたあとは、それ以降に取得される短時間ブロックが、前記すでに生成されたどの話者グループにも属さないと判断された場合は、音声特徴の最も近い話者または話者グループに属すると判断するようにしている。
【００２８】
また、本発明の音声認識装置において、前記音声認識処理部が用いる音響モデルは、それぞれの話者または話者グループの平均正規化された特徴ベクトルに対応可能な標準的な音響モデルとしてもよい。
【００２９】
また、本発明の音声認識装置において、前記音声認識処理部が用いる音響モデルは、予め複数の話者または話者グループ対応に用意された複数の音響モデルとしてもよく、その場合、前記特徴ベクトル平均正規化手段から話者または話者グループを示す情報を入力し、その話者または話者グループに対応した音響モデルを用いた音声認識処理を行う。
【００３０】
このように本発明は、音声データから順次取得される短時間ブロックに対し、その短時間ブロックの特徴ベクトルが、すでに生成されている話者または話者グループの少なくとも１つに属するか否かを判定し、当該短時間ブロックがすでに生成されている話者または話者グループの少なくとも１つに属すると判定された場合には、その短時間ブロックの特徴ベクトルを用いて、当該短時間ブロックの属する話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の特徴ベクトルを更新して、この長時間ブロックまたは長時間ブロック候補の更新前または更新後の特徴ベクトルを用いて前記短時間ブロックの特徴ベクトルを正規化し、その正規化された特徴ベクトルを用いて音声認識処理を行うようにしている。
【００３１】
一方、当該短時間ブロックが、前記すでに生成されている話者または話者グループに属さないと判定された場合には、その短時間ブロックを長時間ブロック候補として、新たな話者または話者グループに対応する長時間ブロックの生成を開始する。そして、当該短時間ブロックの特徴ベクトルを用いてその短時間ブロックを正規化して、その正規化された特徴ベクトルを用いて音声認識処理するようにしている。
【００３２】
このようにな処理を行うことによって、特別な話者識別・話者グループ識別手段を用いることなく、自然に話者または話者グループを生成することができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを得ることができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを用いた音声認識処理を行うことができるので、高い認識率を得ることができる。
【００３３】
また、前記短時間ブロックは、標準的な長さを有する１単語の音声データから取得される程度のフレーム数で構成され、前記長時間ブロックは、全ての音素がまんべんなく含まれる音声データから取得される程度のフレーム数で構成されている。具体的には一例として、短時間ブロックは６４フレーム、長時間ブロックは６４個の短時間ブロック（４０９６フレーム）としている。短時間ブロックおよび長時間ブロックをそれぞれこの程度のフレーム数とすることによって、効率のよい処理が可能となり、かつ、音声の特徴も適正に反映させることができる。
【００３４】
また、前記短時間ブロックを構成するフレーム数、長時間ブロックを構成するフレーム数はそれぞれ２のべき乗となるような値に設定するようにしているので、平均特徴ベクトルなどを求める場合の割り算はコンピュータ演算において右シフト演算だけで行うことができ、演算を簡略化することができる。これによって、ＣＰＵの演算負担を軽減することができ、処理能力の低いＣＰＵを用いざるを得ない安価な装置などに適用する場合に特に大きな効果が得られる。
【００３５】
また、話者性判断は、当該短時間ブロックの平均特徴ベクトルとすでに生成されている話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の平均特徴ベクトルとの距離を計算し、その距離を予め設定されたしきい値と比較することによって行うようにしているので、単純な演算で話者グループ判断が行える。
【００３６】
また、話者性判断に用いるしきい値は、前記短時間ブロックの比較対象が前記長時間ブロック（予め設定された所定のフレーム数となった長時間ブロック）である場合と、長時間ブロック候補である場合とでその値を異ならせ、前記短時間ブロックの比較対象が前記長時間ブロック候補である場合は、長時間ブロックの場合よりもしきい値としての距離を大きく設定するようにしているので、新たな話者グループを生成する際、長時間ブロックが生成されるまで、たとえば、ある１発話から取得された幾つかの短時間ブロックは、それぞれが同じ話者または話者グループと判断され易くなり、むやみに幾つもの話者または話者グループが生成されるのを防止することができる。
【００３７】
また、前記生成される話者または話者グループの最大数は予め設定可能とし、その話者または話者グループが最大数まで生成されたあとは、それ以降に取得される短時間ブロックが前記既に生成されたどの話者グループにも属さないと判断された場合は、音声特徴の最も近い話者または話者グループに属すると判断するようにしている。これによって、幾つかの話者または話者グループが予め設定されているような音声認識装置にも適用することができる。
【００３８】
また、このような音声認識方法が適用された音声認識装置は、特別な話者識別・話者グループ識別手段を用いることなく、少ない演算量、少ないメモリの使用量で、自然に話者グループを生成することができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを得ることができるので、安価なハードウエアでしかも高い認識性能を有する装置とすることができる。
【００３９】
また、音声認識処理に用いる音響モデルとして、正規化された標準的な話者モデルを用いることもできるが、それぞれの話者または話者グループ対応の複数の音響モデルを用いることもできる。
【００４０】
標準的な音響モデルを用いる場合は、１つの音響もモデルで済むので、部品コストを低く抑えることができ、安価な装置とすることができるとともに、その都度、どの音響モデルを用いるかといった処理が不要となるため、演算量を削減することができる。
【００４１】
一方、それぞれの話者または話者グループ対応の音響モデルを用いる場合は、複数の音響モデルを用意する必要があるため、演算量が多少増大し、部品コストも増えるといった問題点もあるが、より高い認識性能が得られる利点がある。
【００４２】
【発明の実施の形態】
以下、本発明の実施の形態について説明する。
【００４３】
図１（ａ）はある音声波形を示すもので、このような音声波形において、たとえば、同図（ｂ）に示すように、２０msecを１つのフレームとし、それを１０msecずつシフトしてそれぞれのフレームごとに音声分析し、それぞれのフレームに対応する特徴ベクトル（ここでは１０次元のケプストラム係数でなる）を得る。
【００４４】
そして、この実施の形態では、同図（ｃ）に示すように、６４個のフレーム（６４フレームという）を１つの短時間ブロックとし、１つ１つの短時間ブロックを短時間ブロックＡ１，Ａ２，・・・で表す。そして、図１（ｄ）に示すように、同じ話者または話者グループに属する短時間ブロックＡ１，Ａ２，・・・を６４個集めたものを１つの長時間ブロックとし、これを長時間ブロックＢ１と表す。したがって、この長時間ブロックＢ１は６４個の短時間ブロックＡ１，Ａ２，・・・，Ａ６４で構成され、その合計のフレーム数は、１つの短時間ブロックのフレーム数が６４フレームであるから、６４×６４＝４０９６となる。
【００４５】
そして、短時間ブロックＡ１，Ａ２，・・・，Ａ６４においては、それぞれの短時間ブロックＡ１，Ａ２，・・・，Ａ６４が取得されるごとに、その短時間ブロックにおける全フレーム分（６４フレーム分）の特徴ベクトルの和を求め、それを記憶する。
【００４６】
たとえば、短時間ブロックＡ１においては、その短時間ブロックＡ１に属する６４フレーム分の特徴ベクトルの和（これをＣsA1とする）を求めてそれを記憶し、短時間ブロックＡ２においては、その短時間ブロックＡ２に属する６４フレーム分の特徴ベクトルの和（これをＣsA2とする）を求めてそれを記憶するというように、それぞれの短時間ブロックＡ１，Ａ２，・・・，Ａ６４ごとに、その短時間ブロックにおける全フレーム分の特徴ベクトルの和ＣsA1，ＣsA2，・・・，ＣsA64を求めて、これら短時間ブロックごとの特徴ベクトルの和ＣsA1，ＣsA2，・・・，ＣsA64を記憶させておく。
【００４７】
また、それぞれの短時間ブロックＡ１，Ａ２，・・・，Ａ６４において、それぞれの短時間ブロックごとに平均特徴ベクトルを求める。たとえば、短時間ブロックＡ１においては、その短時間ブロックＡ１で求められた特徴ベクトルの和ＣsA1をフレーム数（６４個）で割り算し、平均特徴ベクトル（これをＣmA1とする）を求め、短時間ブロックＡ２においては、その短時間ブロックＡ２で求められた特徴ベクトルの和ＣsA2をフレーム数（６４個）で割り算し、平均特徴ベクトル（これをＣmA2とする）を求めるというように、それぞれの短時間ブロックＡ１，Ａ２，・・・，Ａ６４ごとに、その短時間ブロックにおける平均特徴ベクトルＣmA1，ＣmA2，・・・，ＣmA64を求める。
【００４８】
なお、このそれぞれの短時間ブロックＡ１，Ａ２，・・・，Ａ６４ごとに求められる平均特徴ベクトルＣmA1，ＣmA2，・・・，ＣmA64は、その短時間ブロックの正規化特徴ベクトルを求める際に使用される。
【００４９】
また、この平均特徴ベクトルを求める際の割り算は、それぞれの短時間ブロックにおいて、（２）式に示したように、その特徴ベクトルの和をフレーム数で割り算するが、そのフレーム数が２のべき乗であれば、平均値を求める際の割り算がコンピュータによる演算においてはシフト演算のみで行えるので、本発明では、その除数を２のべき乗の値となるようにする。たとえば、ある１つの短時間ブロックにおける平均特徴ベクトルを求める際は、そのフレーム数でその平均特徴ベクトルを割り算することになるが、ここでは、そのフレーム数を６４としているのでその条件を満たす。
【００５０】
一方、長時間ブロックＢ１については、この場合、短時間ブロックＡ１からＡ６４までの４０９６フレームで構成されているので、その特徴ベクトルの和は４０９６フレームの特徴ベクトルの和である。したがって、現時点において短時間ブロックＡ６４までの特徴ベクトルの和が求められているとすれば、短時間ブロックＡ１からＡ６４までの４０９６フレーム分の特徴ベクトルの和が、長時間ブロックＢ１の特徴ベクトルの和ＣsB1となる。
【００５１】
そして、この長時間ブロックＢ１の平均特徴ベクトルＣmB1を求める。この長時間ブロックＢの平均特徴ベクトルＣmB1は、その長時間ブロックＢ１で求められた特徴ベクトルの和ＣsB1をフレーム数（４０９６個）で割り算することで、平均特徴ベクトルＣmB1が求められる。
【００５２】
この長時間ブロックＢ１の平均特徴ベクトルＣmB1を求める際の割り算も、そのフレーム数が２のべき乗となるように設定（この例ではフレーム数は４０９６に設定）してあるので、平均値を求める際の割り算がコンピュータによる演算においてはシフト演算のみで行える。
【００５３】
なお、この実施の形態におけるそれぞれの特徴ベクトルの和を求める計算や特徴ベクトル同志の引き算、平均特徴ベクトルを求める計算、正規化特徴ベクトルを求める計算、のちに説明する特徴ベクトル間の距離計算などはすべて、その特徴ベクトルを表すケプストラム係数の各次元ごと、すなわち、特徴ベクトルを表すケプストラム係数がたとえば０〜９次元の１０次元で構成されるとすれば、０〜９次元の各次元ごとにそれぞれの計算を行うことはいうまでもないので、ここでは、その点については特に表記しない。
【００５４】
ここで、図２に示すように、短時間ブロックＡ１〜Ａ６４までの６４個分の短時間ブロックについての正規化を求める処理はすでに終了し、それによって、１つの長時間ブロックＢ１（ある話者グループＧ１に対応しているものとする）が生成されているものとし、この長時間ブロックＢ１より後に発話されたある１発話Ｖ１の音声データに対して行われる処理について説明する。なお、この１発話Ｖ１の音声データからは、短時間ブロックＡ６５，Ａ６６，Ａ６７の３つの短時間ブロックが取得され、そのあとに６４フレームに満たない音声データが余りｅとして残るものとする。
【００５５】
この短時間ブロックＡ６５の６４フレーム分についても上述同様、この短時間ブロックＡ６５の特徴ベクトルの和ＣsA65を求めるとともに、その平均特徴ベクトルＣmA65を求める。
【００５６】
この短時間ブロックＡ６５について、その特徴ベクトルの和ＣsA65が求められ、その平均特徴ベクトルＣmA65が求められると、今度は、この短時間ブロックＡ６５における平均特徴ベクトルＣmA65のバラツキ、つまり、短時間ブロックＡ６５の音声の特徴がそれ以前の音声特徴に対してどの程度異なっているかを求める。
【００５７】
この短時間ブロックＡ６５における平均特徴ベクトルＣmA65のバラツキは、それまでに生成されているすべての話者グループに対応する長時間ブロックの平均特徴ベクトルに対する短時間ブロックＡ６５の平均特徴ベクトルＣmA65の距離を求め、その距離と予め定められたしきい値Ｄth1とを比較し、しきい値内に入っているか否かによって判断することができる。
【００５８】
なお、この図２の例ではすでに生成されている話者グループは話者グループＧ１のみとするので、この話者グループＧ１に対応する長時間ブロックＢ１の平均特徴ベクトルＣmB1に対する短時間ブロックＡ６５の平均特徴ベクトルＣmA65の距離ＤA65を求め、その距離ＤA65と予め定められたしきい値Ｄth1とを比較するによって判断する。
【００５９】
そして、このしきい値Ｄth1との比較の結果、ＤA65＜Ｄth1、すなわち、長時間ブロックＢ１の平均特徴ベクトルＣmB1に対する短時間ブロックＡ６５の平均特徴ベクトルＣmA65の距離ＤA65が、ある範囲（しきい値Ｄth1）内に収まっていれば、現時点での音声データはその直前の音声データと大きく変化していないと判断（その時点においては話者交代はなされていないと判断）し、それまでの長時間ブロックＢ１を更新する。この更新後の長時間ブロックをＢ１１で表す。
【００６０】
この更新後の長時間ブロックＢ１１は、短時間ブロックＡ２から短時間ブロックＡ６５までの６４個の短時間ブロックによって構成され、この長時間ブロックＢ１１の特徴ベクトルの和ＣsB11は、
ＣsB11 ＝ＣsB1＋ＣsA65− ＣsA1 （３）
によって求めることができる。つまり、長時間ブロックＢ１１の特徴ベクトルの和ＣsB11は、短時間ブロックＡ１から短時間ブロックＡ６４までの６４個の短時間ブロックで構成される長時間ブロックＢ１の特徴ベクトルの和ＣsB1に、新たに求められた短時間ブロックＡ６５の特徴ベクトルの和ＣsA65を足し算して、短時間ブロックＡ１の特徴ベクトルの和ＣsA1を引き算することによって求めることができる。
【００６１】
このようにして長時間ブロックＢ１１の特徴ベクトルの和ＣsB11が求められると、この長時間ブロックＢ１１の平均特徴ベクトルＣmB11は、
ＣmB11＝ＣsB11／４０９６（４）
によって求めることができる。
【００６２】
また、この短時間ブロックＡ６５の６４個のフレーム（これをＦ１〜Ｆ６４で表す）について、それぞれ正規化特徴ベクトルを求める。ここで、短時間ブロックＡ６５を構成する６４個のフレームＦ１〜Ｆ６４の特徴ベクトルをＣF1〜ＣF64で表し、その正規化特徴ベクトルをＣ’F1〜Ｃ’F64で表せば、これら正規化特徴ベクトルＣ’F1〜Ｃ’F64は、
【００６３】
【数２】
（５）

【００６４】
によって求めることができる。
【００６５】
なお、この正規化特徴ベクトルを求める際、この（５）式では、長時間ブロックの平均特徴ベクトルとして、短時間ブロックＡ２〜Ａ６５によって計算された更新後の長時間ブロックＢ１１の平均特徴ベクトルＣmB11を用いたが、その直前に求められた長時間ブロックの平均特徴ベクトル、すなわち、短時間ブロックＡ１〜Ａ６４によって計算された更新前の長時間ブロックＢ１の平均特徴ベクトルＣmB1を用いてもよい。ただし、この実施の形態では更新後の平均特徴ベクトルを用いるものとする。
【００６６】
ところで、前述した長時間ブロックＢ１の平均特徴ベクトルＣmB1に対する短時間ブロックＡ６５の平均特徴ベクトルＣmA65の距離ＤA65が、予め定められたしきい値Ｄth1に対して、ＤA65≧Ｄth1、すなわち、長時間ブロックＢ１の平均特徴ベクトルＣmB1に対する短時間ブロックＡ６５の平均特徴ベクトルＣmA65の距離ＤA65がある範囲（しきい値Ｄth1）以上であったとすると、現時点での音声データはその直前の音声データに対して大きく変化していると判断（その時点において話者交代がなされたと判断）し、新たな話者グループＧ２を作成する。
【００６７】
このように、１つの長時間ブロック（この場合、長時間ブロックＢ１）が終了したあとの短時間ブロックＡ６５について処理がなされ、その短時間ブロックＡ６５の平均特徴ベクトルＣmA65と、長時間ブロックＢ１の平均特徴ベクトルＣmB1との距離の大きさによって、話者交代があったと判断された場合には、新たな話者グループＧ２が作成される。なお、この例では、短時間ブロックＡ６５によって新たな話者グループＧ２が生成されたとする。
【００６８】
このような新たな話者グループが作成された場合には、その新たな話者グループ（この場合、話者グループＧ２）において、４０９６フレームの新たな長時間ブロックの生成が開始され、その話者グループＧ２に属する短時間ブロックの蓄積処理と上述したような正規化処理を順次行って行くが、この新たな話者グループＧ２に属する短時間ブロックの蓄積が順次なされて行けば、その話者グループＧ２も最終的には６４個の短時間ブロックで構成される。
【００６９】
なお、６４個の短時間ブロックでなる長時間ブロックが生成されるまでの段階における１つの短時間ブロックまたは同じ話者グループに属する複数の短時間ブロックの集合を長時間ブロック候補と呼ぶことにする。したがって、ここでの例のように、短時間ブロックＡ６５によって新たな話者グループＧ２が生成されると、この短時間ブロックＡ６５は長時間ブロック候補となる。
【００７０】
ところで、このような話者交代があったと判断された短時間ブロックＡ６５に対する正規化処理は、この時点では、この短時間ブロックＡ６５は、話者交代と判断されてから１つ目の短時間ブロックであるので、その短時間ブロックＡ６５については、自身のデータを用いて正規化処理を行う。すなわち、この短時間ブロックＡ６５の平均特徴ベクトルＣmA65を用い、（５）式に準じた演算、つまり、短時間ブロックＡ６５を構成する個々のフレームの特徴ベクトルからその平均特徴ベクトルＣmA65をそれぞれ引き算することによって正規化処理を行う。
【００７１】
そして、次の６４フレーム分を短時間ブロックＡ６６とし、その短時間ブロックＡ６６の６４フレーム分についても上述同様、この短時間ブロックＡ６６の特徴ベクトルの和ＣsA65を求めるとともに、その平均特徴ベクトルＣmA66を求める。
【００７２】
この短時間ブロックＡ６６について特徴ベクトルの和ＣsA66が求められ、その平均特徴ベクトルＣmA66が求められると、今度は、この短時間ブロックＡ６６の音声の特徴がどの話者グループに属するかの話者性判断つまり話者グループ判断を行う。
【００７３】
この場合、すでに生成されている話者グループは、長時間ブロックＢ１に対応する話者グループＧ１と短時間ブロックＡ６５によって新たに作成された話者グループＧ２の２つが存在するので、この２つの話者グループＧ１，Ｇ２のそれぞれに対して話者グループ判断する。
【００７４】
なお、この話者グループ判断は、この実施の形態では、その時の短時間ブロックの平均特徴ベクトルと、比較対象となる長時間ブロック（話者グループ）の平均特徴ベクトルとの距離を求めてその距離が予め定められたしきい値より大きいか否かで判断するので、本来は、それぞれの平均特徴ベクトルや求められた距離に特定の符号（たとえば、短時間ブロックＡ６６の平均特徴ベクトルであればＣmA66など）を付すべきであるが、説明が繁雑となるので、以下では、それらの符号の表記を特に必要とする場合を除き、それらの符号は付さずに、単にそのときの短時間ブロックがその時点で作成されている話者グループ属するか否かといった表現で説明を行うものとする。
【００７５】
この場合、短時間ブロックＡ６６についての説明であるので、この短時間ブロックＡ６６が話者グループＧ１に属するか否かを判断し、続いて、短時間ブロックＡ６６が話者グループＧ２に属するか否かを判断するが、その判断の結果、短時間ブロックＡ６６は、話者グループＧ１に属し、話者グループＧ２には属さないと判断されたとする。
【００７６】
図３はそれぞれの短時間ブロックの話者性判断結果を示すもので、この図３において、話者グループに属する場合を○、属さない場合を×で表しており、この段階では、短時間ブロックＡ６５は話者グループＧ１に属さなかったために新たな話者グループＧ２が生成され、短時間ブロックＡ６６は話者グループＧ２には属さずに、話者グループＧ１に属していると判断された例が示されている。
【００７７】
なお、この話者グループ判断を行う際には、短時間ブロックＡ６６の平均特徴ベクトルと、比較対象となる話者グループＧ１，Ｇ２の平均特徴ベクトルとの距離を求めて、その距離が予め定められたしきい値より大きいか否かで判断、つまり、距離がしきい値以上であればその話者グループには属さず、しきい値未満であればその話者グループには属すると判断するが、このときに用いるしきい値の大きさ（話者グループＧ１との比較の時に用いるしきい値をＤth1、話者グループＧ２との比較に時に用いるしきい値をＤth2とする）は、それぞれ大きさを異ならせ、Ｄth1＜Ｄth2とする。
【００７８】
これは、話者グループＧ１の平均特徴ベクトルは長時間ブロックＢ１の４０９６フレーム分から得られた値であり、その程度の長さから得られた平均特徴ベクトルは、バラツキも少なく、その音声の特徴をある程度の適切に表現していると考えられるが、話者グループＧ２はこの場合、１つの短時間ブロックＡ６５のたった６４フレームから得られた平均特徴ベクトルであり、このようにきわめて少ないフレーム数から得られた平均特徴ベクトルはその音声の特徴を適切に表現しているとは言えないので、４０９６フレーム分の場合によりも、しきい値をゆるく設定することが望ましい。
【００７９】
このように、本発明では、話者グループを構成する短時間ブロック数に応じてしきい値を設定するようにしている。なお、この設定の仕方は、短時間ブロック数が幾つまでの時はどのようなしきい値とするというように、段階的に複数のしきい値を設定することが考えられる。
【００８０】
なお、この場合、ある１発話の例であるので、本来であれば、短時間ブロックＡ６６も短時間ブロックＡ６５と同じ話者グループとなる確率はきわめて高く、また、４０９６フレームの長時間ブロックが生成されるまでは、むやみに話者グループを増やさないためにも上述したようにしきい値をゆるくして、短時間ブロック数が少ないときは、なるべく、直前に生成された話者グループと同じ話者グループに入るようにしている。
【００８１】
したがって、短時間ブロックＡ６６も短時間ブロックＡ６５と同じ話者グループＧ２に属する可能性が高くなるが、ここでは、説明の都合上、敢えて短時間ブロックＡ６６は短時間ブロックＡ６５の話者グループＧ１に属さないとした例を挙げている。また、ここでは例として挙げていないが、この短時間ブロックＡ６６が話者グループＧ１，Ｇ２のいずれにも属さない場合もあり、その場合には、新たな話者グループＧ３を生成することになる。
【００８２】
このように、短時間ブロックＡ６６が長時間ブロックＢ１に対応する話者グループＧ１に属すると判定されると、この場合、前述したようにその長時間ブロックＢ１を更新する処理がなされる。
【００８３】
すなわち、その更新後の長時間ブロック（これをＢ１１で表す）の特徴ベクトルの和ＣsB11は、前述の（３）式を用い、短時間ブロックＡ１から短時間ブロックＡ６４までの６４個の短時間ブロックで構成される長時間ブロックＢ１の特徴ベクトルの和ＣsB1に、新たに求められた短時間ブロックＡ６６の特徴ベクトルの和ＣsA66を足し算して、短時間ブロックＡ１の特徴ベクトルの和ＣsA1を引き算、つまり、ＣsB11 ＝ＣsB1＋ＣsA66− ＣsA1によって求めることができる。また、その平均特徴ベクトルＣmB11は、（４）式で示したように、ＣmB11＝ＣsB11／４０９６によって求めることができる。
【００８４】
また、この短時間ブロックＡ６６の正規化処理は、その更新後の長時間ブロックＢ２の平均特徴ベクトルＣmB11を用いて、（５）式に準じた演算を行えばよく、それによって、短時間ブロックＡ６６の個々のフレームの正規化特徴ベクトルを求めることができる。
【００８５】
次に、この短時間ブロックＡ６６に続く、６４フレーム分を短時間ブロックＡ６７とし、その短時間ブロックＡ６７の６４フレーム分についても上述同様、この短時間ブロックＡ６７の特徴ベクトルの和を求めるとともに、その平均特徴ベクトルを求める。
【００８６】
この短時間ブロックＡ６７について特徴ベクトルの和が求められ、その平均特徴ベクトルが求められると、今度は、この短時間ブロックＡ６７の音声の特徴がすでに生成されている話者グループに対してどの程度異なっているかを求める。
【００８７】
この場合も、すでに生成されている話者グループは、長時間ブロックＢ１に対応する話者グループＧ１と短時間ブロックＡ６５によって新たに作成された話者グループＧ２の２つが存在するので、この２つの話者グループＧ１，Ｇ２のそれぞれに対して話者グループ判断する。
【００８８】
この場合、上述したように、短時間ブロックＡ６７が話者グループＧ１に属するか否かを判断し、続いて、短時間ブロックＡ６７が話者グループＧ２に属するか否かを判断するが、その結果、短時間ブロックＡ６７は話者グループＧ２に属し、話者グループＧ１には属さないと判断されたとする（図３参照）。なお、これらの話者グループのいずれにも属さない場合には新たな話者グループＧ３を生成する。
【００８９】
このように、この短時間ブロックＡ６７が、短時間ブロックＡ６５により新たに生成された話者グループＧ２に属すると判定されると、この場合、短時間ブロックＡ６５（話者グループＧ２における長時間ブロック候補）の更新を行う。この場合の更新処理は、短時間ブロックＡ６５と短時間ブロックＡ６７の特徴ベクトルの和を求めるとともにその平均特徴ベクトルを求める。そして、この段階では、この短時間ブロックＡ６５と短時間ブロックＡ６７がこの話者グループＧ２の長時間ブロック候補となる。
【００９０】
このようにして、以降、同じ話者グループＧ２と判定される短時間ブロックが順次蓄積されて行くことで、新たな６４の短時間ブロック（４０９６フレーム）でなる話者グループＧ２の長時間ブロックが生成される。
【００９１】
また、この短時間ブロックＡ６７の正規化処理は、短時間ブロックＡ６６と短時間ブロックＡ６７でなる２つの短時間ブロックの特徴ベクトルの和を求め、その平均特徴ベクトルを求めて、その平均特徴ベクトルを用いて、（５）式に準じた演算をえばよく、それによって、短時間ブロックＡ６７の個々のフレームの正規化特徴ベクトルを求めることができる。
【００９２】
以上のようにして、図２に示す１発話Ｖ１分の６４フレームごとの処理（短時間ブロックＡ６５，Ａ６６，Ａ６７ごとの処理）が終了するが、１発話Ｖ１分の音声データとしては、この図２に示す余りｅが残る。
【００９３】
この余りｅは１つの短時間ブロック（６４フレーム分）を生成するにはフレーム数が不足している部分である。したがって、この余りｅについては、その直前の短時間ブロック（図２の例では、短時間ブロックＡ６７）における正規化特徴ベクトルの計算に用いた平均特徴ベクトルを用いて、余りｅの音声区間の正規化特徴ベクトルを計算する。この余りｅの部分は、長時間ブロックの平均特徴ベクトルの計算には用いない方法と、次に入力される音声データに加えて１つの短時間ブロックを形成し、長時間ブロックの計算に用いる場合とがある。
【００９４】
以上のようにして、ある長時間ブロックＢ１が生成されたあと、ある１発話Ｖ１分の音声データが入力された場合の処理がなされる。この例では、図３に示したように、短時間ブロックＡ６５によって新たな話者グループＧ２が生成され、次の短時間ブロックＡ６６は長時間ブロックＢ１に対応する話者グループＧ１に属すると判定され、それによって、その長時間ブロックＢ１の更新を行い、さらに、次の短時間ブロックＡ６７は短時間ブロックＡ６５によって生成された話者グループＧ２に属すると判定された例である。
【００９５】
そして、次の１発話Ｖ２に対する音声データが入力されると、先の１発話Ｖ１の場合と同様の処理、すなわち、その１発話Ｖ２から取得された個々の短時間ブロックがすでに生成されているある１つの話者グループに属すると判定された場合には、その話者グループに対応してその時点で生成されている長時間ブロック（または長時間ブロック候補）の更新を行い、どの話者グループのいずれにも属さないと判定された場合には、新たな話者グループを生成する処理がなされる。
【００９６】
また、それぞれの短時間ブロックにおける正規化特徴ベクトルの生成は、当該短時間ブロックがすでに生成されている話者グループに属すると判定された場合には、その話者グループに対応してその時点で生成されている長時間ブロックまたは長時間ブロック候補の更新後の平均特徴ベクトルを用いて正規化し、当該短時間ブロックがすでに生成されている話者グループのいずれにも属さないと判定された場合には、その短時間ブロックの平均特徴ベクトルを用いて正規化する。
【００９７】
入力される音声データに対し、上述したような処理を順次行うことによって、新たに生成された話者グループ（上述の例では話者グループＧ２）は、順次、短時間ブロックが蓄積されて行き、その数が６４個に達すると新たな長時間ブロックが生成される。
【００９８】
また、すでに生成されている話者グループＧ１に対応する長時間ブロックＢ１は、その話者グループＧ１に属する短時間ブロックが存在するごとに、その長時間ブロックＢ１はそのときの短時間ブロックにより更新されて行く。
【００９９】
そして、それぞれの短時間ブロックごとに上述したような正規化特徴ベクトルが算出され、その正規化特徴ベクトルが音声認識部に渡されて音声認識処理がなされる。
【０１００】
図４は上述した処理、すなわち、入力された音声データについて１つの短時間ブロックごとに逐次的な処理を行うことで、正規化特徴ベクトルを求めてそれを音声認識処理部に渡すまでの処理手順を説明するフローチャートである。なお、具体的な説明はすでになされているので、ここでは全体的な処理手順について説明するにとどめる。
【０１０１】
まず、入力された音声データの６４フレーム分が短時間ブロックとして取得されると、その短時間ブロックと、すでに生成されている全ての話者グループとの比較を行う（ステップｓ１）。この処理は、その短時間ブロックの特徴ベクトルの和とその平均特徴ベクトルを計算し、その平均特徴ベクトルとすでに生成されている全ての話者グループの平均特徴ベクトルの距離計算を行い、その距離を予め設定されたしきい値と比較する処理であり、その比較の結果、その短時間ブロックがグループのどれかに属するか否かを判断する（ステップｓ２）。
【０１０２】
この話者グループ判断によって、当該短時間ブロックがすでに生成されている全ての話者グループのいずれにも属さないと判定された場合には、新たな話者グループを作成し（ステップｓ３）、その短時間ブロックを長時間ブロック候補として、その長時間ブロック候補としての短時間ブロックの特徴ベクトルの和と平均特徴ベクトルを得る。そして、その短時間ブロックについて正規化処理を行い（ステップｓ４）、各フレームごとの正規化特徴ベクトルを得て、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ５）。
【０１０３】
一方、ステップｓ２における判断（すでに生成されている全ての話者グループのいずれかに属するかの判断）の結果、すでに生成されている全ての話者グループのいずれかに属すると判定された場合には、そのときの短時間ブロックがその時点で生成されている話者グループのどれに属するかの判断、つまり話者グループ選択を行う（ステップｓ６）。
【０１０４】
この話者グループ選択は、たとえば、その時点において、複数の話者グループが生成されている場合、その複数の話者グループからいずれか１つの話者グループを選択する処理である。
【０１０５】
この場合、その短時間ブロックがこれら複数の話者グループのどれか１つの話者グループのみに属する場合と、幾つかの話者グループ（すべての話者グループの場合も含む）に属する場合が考えられる。
【０１０６】
ここで、ある１つの話者グループのみに属すると判断された場合には、その話者グループが選択されるが、もし、複数の話者グループに属すると判断された場合には、（ａ）時間的に最も近い話者グループ、（ｂ）最も長い長時間ブロックが生成されている話者グループ、（ｃ）平均特徴ベクトル間の距離が最も近い話者グループのいずれかを選択するといった選択基準を予め設定しておき、その設定された選択基準によっていずれかの話者グループを選する。なお、（ｂ）や（ｃ）において、長さが同じあるいは距離が同じである場合には、（ａ）の判断基準を用いるといったことも決めておく。
【０１０７】
たとえば、複数の話者グループに属する場合には、（ａ）の選択基準を用いるとの設定がなされていたとすれば、その短時間ブロックに時間的に最も近い話者グループが選択され、その選択された話者グループに対応する長時間ブロックまたは長時間ブロック候補を更新、つまり、その話者グループに対応する長時間ブロックまたは長時間ブロック候補の特徴ベクトルの和と平均特徴ベクトルを更新する（ステップｓ７）。なお、この更新処理についてはすでに説明したのでここではその説明は省略する。
【０１０８】
この長時間ブロック候補の更新処理が終了すると、ステップｓ４の処理、すなわち、その長時間ブロック候補を構成する短時間ブロックについて正規化処理を行い、各フレームごとの正規化を得る。この正規化処理は、選択された話者グループの平均特徴ベクトル（この実施の形態では更新後の平均特徴ベクトル）を用いて行うが、この正規化処理についてもすでに説明したのでここではその説明は省略する。
【０１０９】
以上の処理をそれぞれの短時間ブロックごとに行う。図３で示した例では、短時間ブロックＡ６５は、すでに生成されている話者グループＧ１に属さないので、新たな話者グループＧ２を生成し（ステップｓ２，ｓ３）、自身のデータを用いて正規化特徴ベクトルの生成処理を行ったのち、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ４，ｓ５）。
【０１１０】
また、短時間ブロックＡ６６は、すでに生成されている話者グループＧ１に属しているので、どの話者グループに属しているかの話者グループ選択（この場合は話者グループＧ１のみに属しているので話者グループＧ１を選択）を行い（ステップｓ６）、その話者グループＧ１の更新を行う（ステップｓ７）。そして、その話者グループＧ１の平均特徴ベクトル（更新後の平均特徴ベクトル）を用いて正規化処理を行ったのち、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ４，ｓ５）。
【０１１１】
また、短時間ブロックＡ６７は、短時間ブロックＡ６５によって生成された話者グループＧ２（この話者グループＧ２はこの時点ではすでに生成された話者グループとなる）に属しているので、どの話者グループに属しているかの話者グループ選択（この場合は話者グループＧ２のみに属しているので話者グループＧ２を選択）を行い（ステップｓ６）、その話者グループＧ２の更新を行う（ステップｓ７）。なお、話者グループＧ２の更新は、この時点では、まだ６４個の短時間ブロックが蓄積される前であるので、短時間ブロックＡ６５とＡ６７でなる長時間ブロック候補の特徴ベクトルの和を求め、その平均特徴ベクトルを求める処理である。
【０１１２】
そして、その話者グループＧ２（、短時間ブロックＡ６５とＡ６７でなる長時間ブロック候補）の平均特徴ベクトル（更新後の平均特徴ベクトル）を用いて、正規化処理を行ったのち、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ４，ｓ５）。
【０１１３】
この図４のフローチャートで説明したような逐次的な処理、すなわち、入力された音声データに対し、短時間ブロックを順次取得し、取得された短時間ブロックを逐次的に処理して個々の短時間ブロックの正規化特徴ベクトルを求める方法は、音声区間のつながりなどを意識することなく、入力された音声データに対して逐次的に処理していけばよいので、処理が単純であり、また、保持するデータ量を少なくすることができ、より高速な処理が可能となるなどの利点がある。
【０１１４】
ところで、これまでの説明では、入力されたある１発話分の音声データから短時間ブロックを取得し、その取得した短時間ブロックについて話者グループ判断し、その話者グループ判断結果に基づいて、新たな話者グループ生成や、すでに生成されている話者グループの更新を行うと言うように、個々の短時間ブロックを逐次的に処理する例であったが、１発話から取得される個々の短時間ブロックについての話者グループ判断を行ったあとに、その話者グループ判断が、話者グループの統一性がなく１発話の話者グループとして矛盾が生じた場合には、その１発話分全体の音声データの平均特徴ベクトルを用いて、その１発話分の音声データがどの話者グループに属するかを判断するようにしてもよい。
【０１１５】
これは、１発話は本来一人の話者の発話であると考えれば、図２の例における１発話Ｖ１における短時間ブロックＡ６５，Ａ６６，Ａ６７は、同じ話者グループに属するのが妥当であるが、上述したように、個々の短時間ブロックにおける話者グループ判断の結果、短時間ブロックＡ６６のみが、異なった話者グループに属したり（図３参照）、さらに、前述の例で挙げなかったが、短時間ブロックＡ６５が話者グループＧ２、短時間ブロックＡ６５が話者グループＧ１、短時間ブロックＡ６７がそのどちらにも属さないといった場合など、１発話としては本来起こりにくい様々な矛盾が生じる場合もある。
【０１１６】
このような場合、１発話の終了を待って、その１発話分の平均特徴ベクトルを求め、その１発話分の平均特徴ベクトルとすでに生成されている話者グループに対応する長時間ブロックまたは長時間ブロック候補の平均特徴ベクトルとを比較する。
【０１１７】
たとえば、図２の例で説明すると、１発話Ｖ１の終了を待って、その１発話Ｖ１全体の平均特徴ベクトルを求め、その１発話Ｖ１全体の平均特徴ベクトルと話者グループＧ１に対応する長時間ブロックＢ１の平均特徴ベクトルとを比較し、その結果、その１発話Ｖ１の音声データが話者グループＧ１に属すると判断されれば、話者グループＧ１の長時間ブロックＢ１の更新を行い、その１発話Ｖ１の音声データが話者グループＧ１に属ないと判断されれば、その１発話Ｖ１全体の音声データによって新たな話者グループＧ２を生成する。
【０１１８】
なお、その１発話Ｖ１の音声データが話者グループＧ１に属すると判断された場合、その話者グループＧ１の長時間ブロックＢ１の更新を行う際、それぞれの短時間ブロックＡ６５，Ａ６６，Ａ６７において、話者グループＧ１に属さないと判定された短時間ブロック（上述の例では短時間ブロックＡ６５，Ａ６７）は長時間ブロックＢ１の更新には用いないようにすることも可能であるが、その短時間ブロックＡ６５，Ａ６７を長時間ブロックＢ１の更新に積極的に用いることも可能である。
【０１１９】
このように、その話者グループＧ１に属さない短時間ブロックをその話者グループＧ１の長時間ブロックＢの更新処理に使用しない利点としては、その話者グループＧ１に属さない短時間ブロックの音声データが突発的なノイズの影響を受けた音声データ（たとえば、それまでとは急激に異なった発話の仕方をした音声データや、背景ノイズが急激に変化した場合の音声データ）である可能性もあり、そのような場合は、その突発的な影響を話者グループＧ１に対応する長時間ブロックＢ１のデータ（平均特徴ベクトル）に及ぼさないようにすることができる。すなわち、突発的なノイズなどによる影響を排除できる利点がある。
【０１２０】
これに対して、その話者グループＧ１に属さない短時間ブロックＡ６５，Ａ６７を更新処理に積極的に使用する利点としては、同じ話者が徐々に話し方が変わってきたり、あるいは同じ話者グループであっても、急に話者が変わった場合など、その場の状況の変化に適応的に対応することができるといった利点がある。
【０１２１】
また、１発話Ｖ１を構成する複数の短時間ブロックが話者グループＧ１に属さないと判定された場合、その１発話Ｖ１分を構成する複数の短時間ブロックのなかに、仮に話者グループＧ１に属すると判定された短時間ブロック（図３の例では短時間ブロックＡ６６）が存在したとしても、その１発話はその話者グループＧ１に属さないとするので、その短時間ブロックＡ６６は当該話者グループＧ１の更新には使用しないようにする。
【０１２２】
図５は上述した処理、すなわち、入力された音声データについて１発話を待って話者グループ判断を行う場合の処理手順を説明するフローチャートである。その具体的な説明はすでになされているので、ここでは全体的な処理手順を説明するにとどめる。
【０１２３】
入力された音声データについて、１発話分の個々の短時間ブロックについて特徴ベクトルの和と平均特徴ベクトルを求め、個々の短時間ブロックがどの話者グループに属しているかの判断を行う話者グループ判断が終了したか否かを判断し（ステップｓ１１）、終了していなければ、音声データから短時間ブロックを取得してその取得した短時間ブロックについて、特徴ベクトルの和と平均特徴ベクトルを求めて話者グループ判断を行い（ステプｓ１２）、ステップｓ１１に戻って１発話分について終了したか否かを判断する処理を繰り返す。
【０１２４】
そして、１発話分の音声データについての話者グループ判断処理が終了したと判断されると、その１発話から取得された個々の短時間ブロックにおける話者グループ判断結果が、１発話としての統一性があるか否かを判断する（ステップｓ１３）。
【０１２５】
この統一性判断の結果、話者グループ判断に１発話としての統一性がとれていると判断されれば、その時の話者グループ判断結果に基づき、その１発話分の音声データがすでに生成されている話者グループのどれかに属する場合は、その話者グループの長時間ブロックまたは長時間ブロック候補の更新を行い、その１発話分の音声データがすべての話者グループのいずれにも属さない場合は、新たな話者グループ生成を行う（ステップｓ１４）。そして、その１発話に属する個々の短時間ブロックについて正規化処理を行い（ステップｓ１５）、各フレームごとの正規化特徴ベクトルを得て、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ１６）。なお、この正規化処理についてはすでに説明したのでここではその説明は省略する。
【０１２６】
一方、ステップｓ１３の判断が１発話としての統一性がとれていないと判断された場合には、その１発話分から取得された複数の短時間ブロック全体の平均特徴ベクトルを計算し（ステップｓ１７）、その１発話分の平均特徴ベクトルを用いて、すでに生成されているすべての話者グループとの比較を行い（ステップｓ１８）、その比較の結果、その１発話分の音声データがすでに生成されているすべての話者グループのどれかに属するか否かを判断する（ステップｓ１９）。
【０１２７】
この判断によって、当該１発話分の音声データがグループのいずれにも属さないと判定された場合には、新たな話者グループを作成し、それを長時間ブロック候補として、その長時間ブロック候補（その１発話に属する個々の短時間ブロック）の特徴ベクトルの和と平均特徴ベクトルを得る（ステップｓ２０）。そして、その１発話分の個々の短時間ブロックについて正規化処理を行い（ステップｓ１５）、各フレームごとの正規化特徴ベクトルを得て、その正規化特徴ベクトルを音声認識処理部に渡す（ステップｓ１６）。
【０１２８】
一方、ステップｓ１９における判定の結果、すでに生成されているすべての話者グループのどれかに属すると判定された場合には、そのときの１発話分の短時間ブロックがその時点で生成されている話者グループのどれに属するかの判断、つまり話者グループ選択を行う（ステップｓ２１）。この話者グループ選択は、図４のステップｓ６と同様に行うことができるので、ここではその説明は省略する。
【０１２９】
そして、たとえば、複数の話者グループに属する場合、前述した（ａ）の選択基準が設定されていたとすれば、その短時間ブロックに対して、時間的に最も近い話者グループが選択され、その話者グループに対応する長時間ブロックを更新、つまり、その話者グループに対応する長時間ブロックの特徴ベクトルの和と平均特徴ベクトルを更新する（ステップｓ２２）。
【０１３０】
なお、このステップｓ２２における更新処理やステップｓ１４における更新処理については、すでに説明したので、ここではその説明は省略する。そして、ステップｓ１５の処理、すなわち、その１発話分の個々の短時間ブロックについて正規化処理を行い、各フレームごとの正規化特徴ベクトルを得る。この正規化処理は、この実施の形態では、選択された話者グループの更新後の平均特徴ベクトルを用いて行う。
【０１３１】
この図５に示す方法は、１発話ごとにその１発話がどの話者グループに属するかを判断し、その判断結果に基づいた正規化特徴ベクトルを求めるようにしているので、図４で示した逐次的な処理に比べると、保持するデータ量がやや多くなり、また、処理時間もやや増えるが、より適正な話者グループ判断を行うことができ、より高い認識率を得ることができる。
【０１３２】
以上説明したようにこの実施の形態によれば、音声データから取得された短時間ブロックの平均特徴ベクトルと、すでに生成されているすべての話者グループに対応する長時間ブロックまたは長時間ブロック候補の平均特徴ベクトルとの比較を行い、そのすべての話者グループのいずれにも属さない場合には、新たな話者グループを作成し、すべての話者グループの少なくとも１つに属す場合は、いずれかの話者グループを選択し、その選択された話者グループに対応する長時間ブロックまたは長時間ブロック候補を更新して行くことで、それぞれの話者の特徴を反映した話者グループが自動的に生成されて行き、それぞれの話者グループにおける正規化特徴ベクトルが生成されるので、それぞれの話者の特徴を反映した音声認識がなされる。
【０１３３】
また、短時間ブロックの平均特徴ベクトルとすでに生成されている話者グループとの比較は、簡単な距離計算によって行っているので、繁雑な話者識別処理や話者グループ識別処理を行うことなく、簡単に話者交代に対応できる。
【０１３４】
なお、前述した処理を行うに際して、予め複数の話者グループ（たとえば、男性話者グループ、女性話者グループ、子供話者グループなど）の長時間ブロックから得られた特徴ベクトルの和と平均特徴ベクトルを初期値として持つようにしてもよい。
【０１３５】
ところで、これまでの説明は短時間ブロックＡ１から短時間ブロックＡ６４までの処理（それぞれの短時間ブロックの正規化を求める処理を含む）は、すでに終了し、かつ、その６４個の短時間ブロックによって、ある一つの話者グループＧ１（長時間ブロックＢ１）が生成されているものとして、短時間ブロックＡ６４より後に発話されたある１発話分の音声データに対する処理について説明したが、全く新規に音声データが入力された場合、つまり、話者グループが全く存在しない場合についてを図６を参照しながら以下に説明する。なお、ここでは、図４のフローチャートに示した逐次的に処理する方法で説明する。
【０１３６】
図６（ａ）は入力された音声波形であり、このような音声波形から時間軸ｔに沿って、同図（ｂ）に示すように、６４フレームごとの短時間ブロックＡ１，Ａ２，Ａ３，・・・を順次取得し、取得した短時間ブロックごとに図４のフローチャートで説明した手順で処理を行う。
【０１３７】
まず、短時間ブロックＡ１が取得され、その短時間ブロックＡ１の特徴ベクトルの和と平均特徴ベクトルの和を求め、その短時間ブロックＡ１がすでに生成されているすべての話者グループの中のどの話者グループに属するかを判定するが、このとき、まだ話者グループは１つも生成されていないので、図６（ｃ）に示すように、この短時間ブロックＡ１を新たな話者グループＧ１とし、その特徴ベクトルの和と必要に応じて平均特徴ベクトルを保存する。また、この短時間ブロックＡ１の正規化処理は、その特徴ベクトルの和と平均特徴ベクトルを用いて行い、その結果を音声認識部に渡す。なお、この時点では、短時間ブロックＡ１は話者グループＧ１における長時間ブロック候補となる。
【０１３８】
次に、短時間ブロックＡ２が取得され、その短時間ブロックＡ２の特徴ベクトルの和と平均特徴ベクトルの和を求め、その短時間ブロックＡ２がすでに生成されているすべての話者グループの中のどの話者グループに属するかを判定するが、このとき、すでに生成されている話者グループとしては、短時間ブロックＡ１により生成された話者グループＧ１のみであるので、この話者グループＧ１に属するか否かを判断する。
【０１３９】
ここで、この話者グループＧ１に属すると判断された場合には、この話者グループＧ１を更新（この場合、短時間ブロックＡ１の特徴ベクトルの和に短時間ブロックＡ２の特徴ベクトルの和を足し算し、その平均特徴ベクトルを求める）し、それを長時間ブロック候補とし、その長時間ブロック候補の平均特徴ベクトルを用いて当該短時間ブロックＡ２を正規化処理する。
【０１４０】
一方、話者グループＧ１に属さないと判断された場合には、この短時間ブロックＡ２を新たな話者グループＧ２とし、その特徴ベクトルの和と必要に応じて平均特徴ベクトルを保存する。また、この場合、この短時間ブロックＡ１の正規化処理は、自身の特徴ベクトルの和と平均特徴ベクトルを用いて行い、その結果を音声認識部に渡す。なお、ここでは、この短時間ブロックＡ２により新たな話者グループＧ２が生成されたとする。なお、この時点では、短時間ブロックＡ２は話者グループＧ２における長時間ブロック候補となる。
【０１４１】
そして次に、短時間ブロックＡ３が取得され、その短時間ブロックＡ３の特徴ベクトルの和と平均特徴ベクトルの和を求め、その短時間ブロックＡ３がすでに生成されているすべての話者グループの中のどの話者グループに属するかを判定するが、このとき、すでに生成されている話者グループは、短時間ブロックＡ１により生成された話者グループＧ１と短時間ブロックＡ２により生成された話者グループＧ２であるので、これら話者グループＧ１，Ｇ２の両方に対して話者グループ判断を行う。
【０１４２】
ここで、もし、両方に属すると判断された場合には、その選択基準が前述した（ａ）の設定、つまり、より時間的に近い話者グループを選択すると設定されていたとすれば、この場合、より時間的に近い話者グループＧ１に属すると判断される。また、両方に属さなければ、新たな話者グループが生成されるが、ここでは、話者グループＧ１に属すると判定されたとする。これによって、話者グループＧ１の更新、つまり、短時間ブロックＡ１の更新がなされ、それによって、この時点では、短時間ブロックＡ１と短時間ブロックＡ３とが話者グループＧ１における長時間ブロック候補となる。そして、この短時間ブロックＡ３の正規化処理は、この場合、短時間ブロックＡ１と短時間ブロックＡ３でなる長時間ブロック候補の平均特徴ベクトルを用いて行い、その正規化された音声特徴ベクトルを音声認識処理部に渡す。
【０１４３】
そして次に、短時間ブロックＡ４が取得され、その短時間ブロックＡ４の特徴ベクトルの和と平均特徴ベクトルの和を求め、その短時間ブロックＡ４がすでに生成されているすべての話者グループの中のどの話者グループに属するかを判定するが、このとき、すでに生成されている話者グループは、短時間ブロックＡ１，Ａ３でなる話者グループＧ１と短時間ブロックＡ２でなる話者グループＧ２であるので、これら話者グループＧ１，Ｇ２の両方に対して話者グループ判断を行う。ここでは、この短時間ブロックＡ４は話者グループＧ１に属すると判定されたとする。
【０１４４】
これによって、話者グループＧ１は短時間ブロックＡ１，Ａ３，Ａ４の３つが蓄積されたことになる。なお、この時点では、これら短時間ブロックＡ１，Ａ３，Ａ４が話者グループＧ１における長時間ブロック候補となる。なお、この３個の短時間ブロックＡ１，Ａ３，Ａ４による長時間ブロック候補のフレーム数は、６４×３＝１９２となり、２のべき乗の値ではないので、この短時間ブロックＡ４の正規化は、この場合、自身の特徴ベクトル（６４フレーム分）と同じ話者グループに属する１つ前の短時間ブロックＡ３の特徴ベクトル（６４フレーム分）の合計１２８フレーム分の特徴ベクトルの和を求めて、その平均の特徴ベクトルを求め、その求められた平均特徴ベクトルを用いて行う。なお、この場合、短時間ブロックＡ１，Ａ３，Ａ４でなる長時間ブロック候補の特徴ベクトルの和は保存しておき、以降の処理に用いる。
【０１４５】
この短時間ブロックＡ４のように、自身の短時間ブロックを含めた長時間ブロック候補のフレーム数が２のべき乗の値とならない場合の処理について、さらに、別な例を用いそれを図７により説明する。
【０１４６】
図７に示すように、４０９６フレームの長時間ブロックが生成されるまでの段階において、時間軸ｔに沿って、たとえば、４個の短時間ブロックＡａからＡｄの４個の短時間ブロックによる長時間ブロック候補（これをｂ１とする）が生成されている状態で、次の短時間ブロックＡｅが取得され、その短時間ブロックＡｅを加えた５個の短時間ブロックによる長時間ブロック候補ｂ２が生成されたとすると、その長時間ブロック候補ｂ２のフレーム数は、この場合、６４×５＝３２０となって、２のべき乗とならない。
【０１４７】
このような場合、短時間ブロックＡｅを正規化を計算する際に用いる平均特徴ベクトルは、短時間ブロックＡｅ以前に存在する短時間ブロックＡａからＡｅのうち、その長時間ブロック候補ｂ２において取り得る２のべき乗の最大数の短時間ブロック（この場合は短時間ブロックＡｂからＡｅの４個）の特徴ベクトルの和を求めて、その特徴ベクトルの和を、この４個の短時間ブロックＡ２〜Ａ５の合計のフレーム数（６４×４＝２５６）で割り算して求め、このようにして求められた平均特徴ベクトルを用いて短時間ブロックＡｅの正規化を行う。
【０１４８】
この図７では、自身の短時間ブロックを含めた長時間ブロック候補のフレーム数、つまり、長時間ブロック候補における短時間ブロック数が２のべき乗の値とならない場合として、その長時間ブロック候補における短時間ブロック数が５個目の場合を例にとって説明したが、それ以外の短時間ブロック数、たとえば、短時間ブロック数が３個目、６個目、７個目、９個目など、長時間ブロック候補における短時間ブロック数が２のべき乗の値とならない場合は、すべてこの図７によって説明した処理に準じた処理がなされる。
【０１４９】
図６に説明が戻って、短時間ブロックＡ４の次の短時間ブロックＡ５が取得されたとする。そして、その短時間ブロックＡ５の特徴ベクトルの和と平均特徴ベクトルの和を求め、その短時間ブロックＡ５がすでに生成されているすべての話者グループの中のどの話者グループに属するかを判定する。
【０１５０】
このとき、すでに生成されている話者グループは短時間ブロックＡ１，Ａ３，Ａ４でなる話者グループＧ１と短時間ブロックＡ２でなる話者グループＧ２であるので、ここでも、これら話者グループＧ１，Ｇ２の両方に対して話者グループ判断を行うが、この場合、話者グループＧ１に属すると判定されたとする。これによって、話者グループＧ１は短時間ブロックが短時間ブロックＡ１，Ａ３，Ａ４，Ａ５の４個蓄積されたことになり、この時点においては、これら短時間ブロックＡ１，Ａ３，Ａ４，Ａ５が話者グループＧ１の長時間ブロック候補となる。
【０１５１】
なお、この場合、これら４個の短時間ブロックＡ１，Ａ３，Ａ４，Ａ５による長時間ブロック候補のフレーム数は、６４×４＝２５６となり、２のべき乗の値となるので、この短時間ブロックＡ５の正規化処理は、短時間ブロックＡ１，Ａ３，Ａ４，Ａ５の平均特徴ベクトルを求め、その平均特徴ベクトルを用いて行い、その正規化された音声特徴ベクトルを音声認識処理部に渡す。
【０１５２】
なお、このように、複数の話者グループのどれに属するかを判断する際、前述したように、距離が予め設定されたしきい値以上か否かで判断するが、そのしきい値の大きさは、話者グループの長さ（蓄積された短時間ブロック数）によって異ならせるようにしている。たとえば、短時間ブロックＡ５が話者グループＧ１と話者グループＧ２との間で話者グループ判断を行う際、そのしきい値は、この場合、話者グループＧ２の方が話者グループＧ１よりも長さが長いので、しきい値としての距離を小さく設定し、より厳しい距離比較を行うようにする。
【０１５３】
以上のような処理を繰り返し行うことにより、幾つかの話者グループ（ここでは説明を簡素化するために話者グループＧ１と話者グループＧ２の２つのみとしている）が生成され、それぞれの話者グループごとに、短時間ブロックが蓄積されて行き、たとえば、話者グループＧ１において４０９６フレーム分（６４個の短時間ブロック）が蓄積されると、それを前述したように話者グループＧ１に対応する長時間ブロック（図２では長時間ブロックＢ１）としている。話者グループＧ２についても同様に、４０９６フレーム分（６４個の短時間ブロック）が蓄積されると、それを話者グループＧ２に対応する長時間ブロックとする。
【０１５４】
なお、この図６を用いた説明は、話者グループが全く存在しない場合の処理手順として説明したが、図２で説明したように、ある話者グループが生成されたあとの処理も、この図６で説明したと同様な処理がなされることは勿論である。
【０１５５】
ところで、新たな話者グループを生成する場合、話者グループ数には一定の制限を設けることも可能であり、その場合、その最大グループ数を予め設定しておく。たとえば、音声認識装置における音響モデルとして成人男性の話者グループ、成人女性の話者グループ、子供用の話者グループが用意されているシステムにおいては話者グループ数の最大値は３である。
【０１５６】
このように、話者グループの最大値が予め幾つか決められている場合には、話者グループ数がその最大数まで生成されたあと、入力音声から取得された短時間ブロックがそのどの話者グループに属さないと判定された場合には、その短時間ブロックに対しては、すでに生成されている幾つかの話者グループのうち、最も近い話者グループに属すると判断する。
【０１５７】
なお、このように、どの話者グループにも属さない短時間ブロックが存在し、それをグループのうちの最も近い話者グループに属するようにした場合、その短時間ブロックは、その最も近い話者グループの長時間ブロックの更新に用いない場合と更新に用いる場合のいずれかが考えられ、いずれの場合でもそれぞれ利点がある。
【０１５８】
更新に使用しない場合の利点としてはノイズなど突発的な影響を排除できることであり、また、更新に使用する場合の利点としては、同じ話者が徐々に話し方が変わってきたりあるいは同じ話者グループであっても話者が変わった場合など、その場の状況の変化に適応的に対応できることである。
【０１５９】
図８は本発明の音声認識装置の構成を簡単に示す図であり、音声を入力する音声入力手段としてのマイクロホン１、入力された音声信号を増幅したりＡ／Ｄ変換する音声信号処理部２、この音声信号処理部２で処理された音声信号を所定時間ごとに音声分析し、たとえば、１０次元のなどの特徴ベクトル列を出力する音声分析部３、この音声分析部３で得られた特徴ベクトルを前述した処理方法によって正規化する特徴ベクトル正規化処理部４、この特徴ベクトル平均正規化処理部４で正規化された正規化特徴ベクトル（正規化ケプストラム係数）を入力し、正規化された標準的な音響モデル５を用いて音声認識処理する音声認識処理部６などを有している。
【０１６０】
特徴ベクトル平均正規化処理部４は、特徴ベクトルを正規化する処理を行うもので、その処理については図１〜図７を参照しながらすでに詳細に説明したのでここでは省略する。
【０１６１】
この特徴ベクトル平均正規化処理部４で得られた正規化特徴ベクトルは、それぞれの話者グループ対応に正規化されており、その正規化特徴ベクトルが音声認識処理部６に送られ、音響モデル５を用いた音声認識処理がなされて、その認識結果が出力される。
【０１６２】
このように、本発明の音声認識装置は、特別な話者識別・話者グループ識別手段を持つことなく簡単な手法で、それぞれの話者の特徴を反映した正規化特徴ベクトルを得て、その正規化特徴ベクトルを用いた音声認識処理を行うことができる。
【０１６３】
なお、図８の例では、音声認識に用いる音響モデル５としては、正規化された標準的な音響モデル５を用いたが、予め幾つかの話者グループが設定されているような場合、それぞれの話者グループ対応の音響モデルを持ち、特徴ベクトル平均正規化処理部４からは正規化特徴ベクトルとともに、これらの話者グループに対応する話者グループ情報を音声認識処理部６に渡すことによって、音声認識処理部６では、その話者グループ情報に対応する音響モデルを用いて音声認識処理することもでき、これによって、認識率をより高めることができる。
【０１６４】
たとえば、話者グループを男性、女性、子供の３つのグループとし、図９に示すように、それに対応した音響モデル（男性用音響モデル５ａ、女性用音響モデル５ｂ、子供用音響モデル５ｃ）を用意し、特徴ベクトル平均正規化処理部４で生成される話者グループ数もこの３つを最大数、つまり、生成可能な話者グループ数は、男性話者グループ、女性話者グループ、子供話者グループの３つだけとしてもよい。
【０１６５】
その場合、音声データから取得される短時間ブロックが、前述したように、この３つの話者グループのどれかに属するかを判定し、どれにも属さない場合には、最も近い話者グループに属するようにして、その時点における正規化特徴ベクトルを求め、その正規化特徴ベクトルとどの話者グループに属するかを示す話者グループ情報を音声認識処理部６に渡すようにする。
【０１６６】
この図８および図９で説明した音声認識装置は、特別な話者識別や話者グループ識別を行うことなく、簡単な距離計算でどの話者グループに属するかの判断を行うことができ、話者の特徴を反映した正規化特徴ベクトルを得て、その正規化特徴ベクトルを用いた音声認識処理を行うことができるので、認識性能の向上を図るとができる。また、正規化特徴ベクトルを生成するための演算量や使用するメモリ量を大幅に少なくすることができ、それによって安価なハードウエアで動作可能でしかも高い認識性能を得ることができる。
【０１６７】
なお、図８に示すように、標準的な音響モデルを用いる場合は、１つの音響もモデルで済むので、部品コストを低く抑えることができ、安価な装置とすることができるとともに、その都度、どの音響モデルを用いるかといった処理が不要となるため、演算量を削減することができる。
【０１６８】
一方、図９に示すように、それぞれの話者または話者グループ対応の音響モデルを用いる場合は、複数の音響モデルを用意する必要があるため、演算量が多少増大し、部品コストも増えるといった問題点もあるが、より高い認識性能が得られる利点がある。
【０１６９】
なお、本発明は以上説明した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲で種々変形実施可能となるものである。
【０１７０】
たとえば、前述の実施の形態では、短時間ブロックは６４フレームで構成し、長時間ブロックは６４個の短時間ブロック（４０９６フレーム）で構成するようにしたが、これに限られるものではなく、使用するシステムなどに応じて適宜を設定可能とするものである。ただし、短時間ブロックのフレーム数や長時間ブロックの短時間ブロック数（フレーム数）は、２のべき乗となるようにすることが望ましい。
【０１７１】
また、前述の実施の形態では、話者性の判断は、ある話者グループに属するか否かで行ったが、話者グループではなく、ある特定の話者に属するか否かを判断することもできる。したがって、前述の実施の形態では、話者グループとして記述しているが、これを話者または話者グループと置き換えることも可能である。
【０１７２】
また、本発明は、以上説明した本発明を実現するための処理手順が記述された処理プログラムを作成し、その処理プログラムをフロッピィディスク、光ディスク、ハードディスクなどの記録媒体に記録させておくことができ、本発明はその処理プログラムが記録された記録媒体をも含むものである。また、ネットワークから当該処理プログラムを得るようにしてもよい。
【０１７３】
【発明の効果】
以上説明したように本発明によれば、特別な話者識別・話者グループ識別手段を用いることなく、自然に話者グループを生成することができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを得ることができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを用いた音声認識処理を行うことができるので、高い認識率を得ることができる。
【０１７４】
さらに、本発明では、短時間ブロックを構成するフレーム数、長時間ブロックを構成するフレーム数はそれぞれ２のべき乗となるような値に設定しているので、平均特徴ベクトルなどを求める場合の割り算はコンピュータ演算において右シフト演算だけで行うことができ、演算を簡略化することができる。これによって、ＣＰＵの演算負担を軽減することができ、処理能力の低いＣＰＵを用いざるを得ない安価な装置などに適用する場合に特に大きな効果が得られる。
【０１７５】
また、話者性断は、当該短時間ブロックの平均特徴ベクトルとすでに生成されている話者または話者グループに対応する長時間ブロックまたは長時間ブロック候補の平均特徴ベクトルとの距離を計算し、その距離を予め設定されたしきい値と比較することによって行うようにしているので、単純な演算で話者グループ判断が行える。
【０１７６】
また、このような音声認識方法が適用された音声認識装置は、特別な話者識別・話者グループ識別手段を用いることなく、少ない演算量、少ないメモリの使用量で、自然に話者または話者グループを生成することができ、それぞれの話者の特徴を反映した正規化特徴ベクトルを得ることができるので、安価なハードウエアでしかも高い認識性能を有する装置とすることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態の基本的な処理内容を説明する図である。
【図２】すでに生成されている話者グループに対応する長時間ブロックが生成されたあとの処理について説明する図である。
【図３】図２において１発話から取得された幾つかの短時間ブロックがど話者グループに属するかを判断する処理について説明する図である。
【図４】入力された音声データについて１つの短時間ブロックごとに逐次的な処理を行って正規化特徴ベクトルを求めてそれを音声認識処理部に渡すまでの処理手順を説明するフローチャートである。
【図５】入力された音声データについて１発話を待って話者グループ判断を行う場合の処理手順を説明するフローチャートである。
【図６】新規に音声データが入力された場合、つまり、グループが全く存在しない場合の本発明の処理を説明する図である。
【図７】取得された短時間ブロックを含めた長時間ブロック候補のフレーム数が２のべき乗の値とならない場合の正規化処理について説明する図である。
【図８】本発明の音声認識装置の実施の形態を説明する構成図である。
【図９】音響モデルを予め設定された幾つかの話者グループ対応に持った場合の音声認識装置の構成図である。
【符号の説明】
１マイクロホン
２音声信号処理部
３音声特徴分析部
４特徴ベクトル正規化処理部
５音響モデル
５ａ男性用音響モデル
５ｂ女性用音響モデル
５ｃ子供用音響モデル
６音声認識処理部
Ａ１，Ａ２，・・・，Ａ６４，Ａ６５，・・・短時間ブロック
Ｂ１長時間ブロック
Ｇ１，Ｇ２話者グループ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method and a speech recognition apparatus that perform average recognition on feature vectors and perform speech recognition processing using the average normalized feature vectors.
[0002]
[Prior art]
A feature vector average normalization method that averages a feature vector (for example, a 10-dimensional cepstrum coefficient) corresponding to each frame acquired from speech data every predetermined time is widely used as an important technique in speech recognition. ing.
[0003]
In short, this feature vector average normalization method calculates the average value of feature vectors for each time obtained from the speech data section to be averaged, and subtracts the average value from the feature vectors for each time. Thus, a normalized feature vector is obtained.
[0004]
For example, C1, C2,..., CT feature vectors from each frame corresponding to each time (t = 1, 2,..., T) (each feature vector is a 10-dimensional cepstrum coefficient, for example). ) Is obtained, the sum of these feature vectors C1, C2,..., CT can be expressed as Cs.
[0005]
[Expression 1]

[0006]
From the sum Cs of the feature vectors C1, C2,..., CT thus obtained, the average value Cm is obtained by the following equation (2). Note that T in equation (2) represents the total number of frames in the audio data to be averaged.
[0007]
Cm = Cs / T (2)
Then, by subtracting the average value Cm obtained in this way from the feature vectors C1, C2,..., CT corresponding to the respective frames, the average normalization corresponding to each frame (hereinafter simply referred to as normalization). Feature vectors).
[0008]
That is, if the normalized feature vector corresponding to the frame is represented by C1 ′, C2 ′,..., CT ′, the normalized feature vector for each frame is C1 ′ = C1-Cm, C2 ′ = C2-Cm. ,..., CT ′ = CT−Cm.
[0009]
Using the normalized feature vector (normalized cepstrum coefficient) obtained in this way, for example, after performing preprocessing such as vector quantization, the computation required for speech recognition, such as obtaining the output probability, etc. Do.
[0010]
Note that the above-described calculation for calculating the sum of feature vectors, subtraction of feature vectors, calculation for calculating average feature vectors, calculation for obtaining normalized feature vectors, etc. are all performed for each dimension representing the feature vector, that is, the feature vector. If the cepstrum coefficient representing is composed of, for example, 10 dimensions, it goes without saying that each calculation is performed for each dimension of 0 to 9 dimensions.
[0011]
[Problems to be solved by the invention]
In order to obtain the average value Cm expressed by the above-described equation (2), feature vectors for all frames in the speech section to be normalized are required. As a result, the calculation cannot be performed until all frames of the input speech to be processed are input, and there is a problem that real-time processing cannot be performed.
[0012]
In order to solve this, the average value obtained from the feature vector obtained from the speech data before the first utterance of the current utterance data to be processed is sequentially used from the feature vector corresponding to each frame of the current utterance data. There is a method of obtaining a normalized feature vector corresponding to each frame of the current utterance by performing subtraction. For example, “acoustic analysis method and apparatus for speech recognition” (hereinafter referred to as “first prior art”) in Japanese Patent Laid-Open No. 9-90990 describes this.
[0013]
If this prior art is used, it is possible to perform voice recognition processing in real time.
[0014]
However, when there is a change of speaker between the current utterance and the previous utterance, this prior art is likely to be greatly different from the average value before and after that. The normalized feature vector does not properly reflect the current utterance. This is because the average value obtained from the speech data uttered by each speaker (speech data that contains phonemes evenly) itself represents the speaker nature. If the speaker for obtaining the value differs from the speaker for obtaining the normalized feature vector, there arises a problem that the normalized feature vector is calculated with an average value of other speakers.
[0015]
By the way, when obtaining this normalized feature vector, in order to obtain an appropriate normalized feature vector, it is important that there is little fluctuation in noise and that the speech spoken by the speaker is evenly distributed.
[0016]
Here, the fact that the speech uttered by the speaker is evenly distributed can be said to be speech data containing most phonemes, so the normalized feature vector can be obtained from past utterances with a certain length of time. It is necessary to ask.
[0017]
However, when speech recognition is performed using normalized feature vectors obtained from past utterances, there is a problem in that when the speaker is changed in the middle, the utterance characteristics of the speaker are different and the recognition performance is deteriorated.
[0018]
As a technique corresponding to this speaker change, normalized feature vectors of a plurality of specific speakers are obtained in advance, the current speaker is identified as a speaker, and the normalized feature vectors of a plurality of specific speakers obtained in advance Therefore, a method using a normalized feature vector corresponding to the speaker identified by the speaker can be considered. For example, the invention of “speech recognition apparatus and method” (hereinafter referred to as “second prior art”) described in Japanese Patent Laid-Open No. 10-254494 has a description thereof.
[0019]
However, this second conventional technique needs to obtain normalized feature vectors of a plurality of specific speakers in advance, requires identification processing of specific speakers, and is used for changing a speaker to a new input speaker. Is difficult to respond.
[0020]
Therefore, the present invention uses both the feature data of the speech of the current utterance and the feature data of the past utterance, and can cope with speaker change without performing complicated speaker identification or speaker group identification processing. Another object of the present invention is to provide a speech recognition method and a speech recognition apparatus that can obtain high recognition performance with inexpensive hardware.
[0021]
[Means for Solving the Problems]
In order to achieve the above-described object, the speech recognition method of the present invention performs feature analysis on speech data corresponding to a frame for every predetermined time, average normalizes a feature vector obtained by the feature analysis, and averages the feature vector. In a speech recognition method for performing speech recognition processing using a feature vector, a certain number of frames along the time axis direction are set as a unit for a short time block of input speech data, and the same speaker or speaker Collecting a predetermined number of short-time blocks belonging to the group along the time axis direction to make them a long-time block corresponding to a certain speaker or speaker group, and a stage until this long-time block is generated A short-time block or a set of a plurality of short-time blocks belonging to the same speaker group as a long-time block candidate, A short-time block sequentially obtained from the data, it is determined whether the feature vector of the short-time block belongs to at least one of a speaker or a speaker group that has already been generated. When it is determined that the speaker belongs to at least one of the already generated speakers or speaker groups, the feature vector of the short-time block is used to correspond to the speaker or speaker group to which the short-time block belongs. Update the feature vector of the long-time block or long-time block candidate to be used, and average normalize the feature vector of the short-time block using the feature vector before or after the update of the long-time block or long-time block candidate, The average normalized feature vector is passed to the speech recognition processing unit, and the short-time block has already been generated. Alternatively, if it is determined that the speaker does not belong to the speaker group, the short-time block is set as a long-time block candidate, and generation of a long-time block corresponding to a new speaker or speaker group is started. The short-time block is average-normalized using the feature vector and the average-normalized feature vector is passed to the speech recognition processing unit.
[0022]
The speech recognition apparatus according to the present invention also performs feature analysis on speech data corresponding to frames at predetermined time intervals, performs speech normalization using a feature vector obtained by average normalizing the feature vector obtained by the feature analysis, and using the feature vector obtained by the average normalization. In the speech recognition apparatus for processing, speech analysis means for performing feature analysis of speech on the input speech and outputting a feature vector, and feature vector average normalization means for average normalizing the feature vector output from the speech analysis means, A speech recognition processing means for inputting a feature vector averaged by the feature vector average normalizing means and performing speech recognition processing using an acoustic model prepared in advance, and the feature vector average normalizing means The input audio data is a short-time block with a certain number of frames along the time axis as a unit. A predetermined number of short-time blocks belonging to a speaker or a speaker group are collected in the time axis direction to make a long-time block corresponding to a certain speaker or speaker group. One short-time block or a set of a plurality of short-time blocks belonging to the same speaker group at the stage until generation is used as a long-time block candidate, and the short-time blocks sequentially obtained from the speech data Determining whether the feature vector of the block belongs to at least one of the speakers or speaker groups that have already been generated, and to at least one of the speakers or speaker groups for which the short-time block has already been generated If it is determined that the speaker belongs, the feature vector of the short-time block is used to Update the feature vector of the long-time block or long-time block candidate corresponding to the user group, and average the feature vector of the short-time block using the feature vector before or after the update of the long-time block or long-time block candidate Normalize and pass the average normalized feature vector to the speech recognition processing unit, and if it is determined that the short-time block does not belong to the already generated speaker or speaker group, the short-time block Start generating a long-time block corresponding to a new speaker or speaker group with the block as a long-time block candidate, average normalizing the short-time block using the feature vector of the short-time block, and calculate the average The normalized feature vector is passed to the speech recognition processing unit.
[0023]
In each of these inventions, the short-time block is composed of a number of frames that is obtained from one-word speech data having a standard length, and the long-time block is a speech that includes all phonemes. The number of frames is such that it is acquired from the data.
[0024]
The number of frames constituting the short-time block and the number of frames constituting the long-time block are each set to a power of 2.
[0025]
In addition, the speaker characteristic determination process for determining whether the voice feature of the short-time block belongs to a speaker or a speaker group that has already been generated is generated with the average feature vector of the short-time block. This is performed by calculating the distance from the average feature vector of the long-time block or long-time block candidate corresponding to the current speaker or speaker group, and comparing the distance with a preset threshold value.
[0026]
The threshold value is different depending on whether the comparison target of the short-time block is the long-time block or the long-time block candidate. In the case of a long-time block candidate, the distance as the threshold value is set larger than that in the case of a long-time block.
[0027]
The maximum number of speakers or speaker groups to be generated can be set in advance, and after the speakers or speaker groups are generated up to the maximum number, a short-time block acquired thereafter is When it is determined that the speaker does not belong to any speaker group that has already been generated, it is determined that the speaker belongs to the speaker or speaker group having the closest voice feature.
[0028]
In the speech recognition apparatus of the present invention, the acoustic model used by the speech recognition processing unit may be a standard acoustic model that can correspond to the average normalized feature vector of each speaker or speaker group.
[0029]
In the speech recognition apparatus of the present invention, the acoustic model used by the speech recognition processing unit may be a plurality of acoustic models prepared in advance for a plurality of speakers or speaker groups, and in that case, the feature vector average Information indicating the speaker or speaker group is input from the normalizing means, and speech recognition processing using an acoustic model corresponding to the speaker or speaker group is performed.
[0030]
As described above, the present invention determines whether or not the feature vector of the short time block belongs to at least one of the already generated speakers or speaker groups with respect to the short time blocks sequentially obtained from the speech data. If it is determined that the short-time block belongs to at least one of a speaker or a speaker group that has already been generated, the feature vector of the short-time block is used to belong to the short-time block. The feature vector of the long-time block or long-time block candidate corresponding to the speaker or the speaker group is updated, and the short-time block is used by using the feature vector before or after the update of the long-time block or long-time block candidate. The feature recognition vector is normalized, and the speech recognition process is performed using the normalized feature vector.
[0031]
On the other hand, if it is determined that the short-time block does not belong to the already generated speaker or speaker group, the short-time block is set as a long-time block candidate and a new speaker or speaker group is selected. The generation of a long-time block corresponding to is started. Then, the short-time block is normalized using the feature vector of the short-time block, and speech recognition processing is performed using the normalized feature vector.
[0032]
By performing such processing, a speaker or a speaker group can be generated naturally without using special speaker identification / speaker group identification means, reflecting the characteristics of each speaker. Normalized feature vectors can be obtained, and speech recognition processing using normalized feature vectors reflecting the characteristics of each speaker can be performed, so that a high recognition rate can be obtained.
[0033]
The short-time block is composed of a number of frames that can be obtained from one word of speech data having a standard length, and the long-time block is obtained from speech data that includes all phonemes. It consists of a certain number of frames. Specifically, as an example, the short time block is 64 frames and the long time block is 64 short time blocks (4096 frames). By setting the short-time block and the long-time block to the same number of frames, efficient processing can be performed, and voice characteristics can be appropriately reflected.
[0034]
Further, since the number of frames constituting the short-time block and the number of frames constituting the long-time block are set to values that are powers of 2, respectively, division when obtaining an average feature vector or the like is performed by a computer. The calculation can be performed only by the right shift calculation, and the calculation can be simplified. As a result, the calculation burden on the CPU can be reduced, and a particularly great effect can be obtained when the present invention is applied to an inexpensive apparatus that must use a CPU with low processing capability.
[0035]
Further, the speaker character determination calculates the distance between the average feature vector of the short-time block and the average feature vector of the long-time block or long-time block candidate corresponding to the already generated speaker or speaker group, Since the distance is compared with a preset threshold value, the speaker group can be determined by a simple calculation.
[0036]
In addition, the threshold value used for determining the speaker characteristics is the case where the comparison target of the short time block is the long time block (the long time block having a predetermined number of frames set in advance) and the long time block candidate. If the comparison target of the short-time block is the long-time block candidate, the distance as the threshold value is set larger than that of the long-time block. When a new speaker group is generated, until a long-time block is generated, for example, several short-time blocks acquired from one utterance are easily determined as the same speaker or speaker group. Thus, it is possible to prevent an excessive number of speakers or speaker groups from being generated.
[0037]
In addition, the maximum number of speakers or speaker groups to be generated can be set in advance, and after the speaker or speaker group is generated up to the maximum number, short-time blocks acquired thereafter are already When it is determined that it does not belong to any of the generated speaker groups, it is determined that the speaker belongs to the speaker or speaker group having the closest voice feature. Thus, the present invention can be applied to a speech recognition apparatus in which several speakers or speaker groups are set in advance.
[0038]
In addition, a speech recognition apparatus to which such a speech recognition method is applied can naturally select a speaker group with a small amount of computation and a small amount of memory without using a special speaker identification / speaker group identification means. Since it is possible to generate a normalized feature vector reflecting the characteristics of each speaker, it is possible to provide a device with inexpensive hardware and high recognition performance.
[0039]
In addition, a standardized standard speaker model can be used as the acoustic model used for the speech recognition processing, but a plurality of acoustic models corresponding to each speaker or speaker group can also be used.
[0040]
When a standard acoustic model is used, only one acoustic model is required, so that the cost of parts can be kept low, the device can be made inexpensive, and each acoustic model is processed each time. Since it becomes unnecessary, the amount of calculation can be reduced.
[0041]
On the other hand, when using an acoustic model corresponding to each speaker or speaker group, it is necessary to prepare a plurality of acoustic models. There is an advantage that high recognition performance can be obtained.
[0042]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below.
[0043]
FIG. 1A shows a certain voice waveform. In such a voice waveform, for example, as shown in FIG. 1B, 20 msec is set as one frame, and each frame is shifted by 10 msec. Speech analysis is performed for each frame, and a feature vector (here, composed of 10-dimensional cepstrum coefficients) corresponding to each frame is obtained.
[0044]
In this embodiment, as shown in FIG. 6C, 64 frames (referred to as 64 frames) are defined as one short time block, and each short time block is defined as a short time block A1, A2, and A2. It is expressed as. Then, as shown in FIG. 1 (d), a collection of 64 short-time blocks A1, A2,... Belonging to the same speaker or speaker group is used as one long-time block, and this is a long-time block. This is represented as B1. Therefore, this long time block B1 is composed of 64 short time blocks A1, A2,..., A64, and the total number of frames is 64 because one short time block has 64 frames. X64 = 4096.
[0045]
In the short-time blocks A1, A2,..., A64, every time the respective short-time blocks A1, A2,. ) Of the feature vectors of) and is stored.
[0046]
For example, in the short-time block A1, the sum of feature vectors for 64 frames belonging to the short-time block A1 (this is referred to as CsA1) is obtained and stored, and in the short-time block A2, the short-time block A1 is stored. For each short-time block A1, A2,..., A64, the sum of the feature vectors for 64 frames belonging to A2 (which is referred to as CsA2) is stored and stored. , CsA64 of the feature vectors for all the frames are obtained, and sums CsA1, CsA2,..., CsA64 of the feature vectors for each short time block are stored.
[0047]
In each short time block A1, A2,..., A64, an average feature vector is obtained for each short time block. For example, in the short time block A1, the sum CsA1 of the feature vectors obtained in the short time block A1 is divided by the number of frames (64) to obtain an average feature vector (this is defined as CmA1). In A2, each short-time block is obtained such that the sum CsA2 of the feature vectors obtained in the short-time block A2 is divided by the number of frames (64) to obtain an average feature vector (referred to as CmA2). For each of A1, A2,..., A64, average feature vectors CmA1, CmA2,.
[0048]
The average feature vectors CmA1, CmA2,..., CmA64 obtained for each short time block A1, A2,..., A64 are used when obtaining the normalized feature vector of the short time block. The
[0049]
In addition, the division when obtaining the average feature vector is performed by dividing the sum of the feature vectors by the number of frames in each short time block as shown in the equation (2). Then, since the division for obtaining the average value can be performed only by the shift operation in the calculation by the computer, the divisor is set to a power of 2 in the present invention. For example, when obtaining an average feature vector in one short-time block, the average feature vector is divided by the number of frames. Here, since the number of frames is 64, the condition is satisfied.
[0050]
On the other hand, since the long time block B1 is composed of 4096 frames from the short time blocks A1 to A64 in this case, the sum of the feature vectors is the sum of the feature vectors of 4096 frames. Therefore, if the sum of the feature vectors up to the short time block A64 is currently obtained, the sum of the feature vectors of 4096 frames from the short time blocks A1 to A64 is the sum of the feature vectors of the long time block B1. CsB1.
[0051]
Then, an average feature vector CmB1 of the long-time block B1 is obtained. The average feature vector CmB1 of the long-time block B is obtained by dividing the sum CsB1 of the feature vectors obtained in the long-time block B1 by the number of frames (4096).
[0052]
The division for obtaining the average feature vector CmB1 of the long-time block B1 is also set so that the number of frames is a power of 2 (in this example, the number of frames is set to 4096). This division can be performed only by a shift operation in a computer operation.
[0053]
In this embodiment, the calculation for calculating the sum of the respective feature vectors, the subtraction of the feature vectors, the calculation for obtaining the average feature vector, the calculation for obtaining the normalized feature vector, the calculation of the distance between the feature vectors described later, and the like. All the cepstrum coefficients representing the feature vector, that is, if the cepstrum coefficient representing the feature vector is composed of, for example, 10 dimensions of 0 to 9 dimensions, each dimension of 0 to 9 dimensions Needless to say, the calculation is not described here.
[0054]
Here, as shown in FIG. 2, the process for obtaining the normalization for the 64 short-time blocks A1 to A64 has already been completed, whereby one long-time block B1 (speaker) The processing performed on the voice data of one utterance V1 uttered after the long-time block B1 will be described. It is assumed that three short-time blocks A65, A66, and A67 are acquired from the voice data of this one utterance V1, and voice data less than 64 frames remain as e after that.
[0055]
For the 64 frames of the short time block A65, the sum CsA65 of the feature vectors of the short time block A65 is obtained as well as the average feature vector CmA65.
[0056]
With respect to the short time block A65, the sum CsA65 of the feature vectors is obtained, and when the average feature vector CmA65 is obtained, the variation of the average feature vector CmA65 in the short time block A65, that is, the short time block A65 Find out how much the audio features differ from the previous audio features.
[0057]
The variation of the average feature vector CmA65 in the short time block A65 determines the distance of the average feature vector CmA65 of the short time block A65 to the average feature vector of the long time block corresponding to all the speaker groups generated so far. The distance can be compared with a predetermined threshold value Dth1, and a determination can be made based on whether the distance falls within the threshold value.
[0058]
In the example of FIG. 2, since the speaker group already generated is only the speaker group G1, the average of the short-time block A65 with respect to the average feature vector CmB1 of the long-time block B1 corresponding to this speaker group G1. The distance DA65 of the feature vector CmA65 is obtained, and the determination is made by comparing the distance DA65 with a predetermined threshold value Dth1.
[0059]
As a result of comparison with the threshold value Dth1, DA65 <Dth1, that is, the distance DA65 of the average feature vector CmA65 of the short time block A65 to the average feature vector CmB1 of the long time block B1 is within a certain range (threshold value Dth1). ), It is determined that the current voice data has not changed significantly from the previous voice data (no judgment is made that the speaker is not changed at that time), and the block for a long time until then Update B1. The updated long-time block is represented by B11.
[0060]
The updated long time block B11 is composed of 64 short time blocks from the short time block A2 to the short time block A65, and the feature vector sum CsB11 of the long time block B11 is
CsB11 = CsB1 + CsA65− CsA1 (3)
Can be obtained. That is, the feature vector sum CsB11 of the long-time block B11 is newly obtained as the feature vector sum CsB1 of the long-time block B1 composed of 64 short-time blocks from the short-time block A1 to the short-time block A64. It can be obtained by adding the sum CsA65 of the feature vectors of the short time block A65 and subtracting the sum CsA1 of the feature vectors of the short time block A1.
[0061]
When the feature vector sum CsB11 of the long-time block B11 is obtained in this way, the average feature vector CmB11 of the long-time block B11 is
CmB11 = CsB11 / 4096 (4)
Can be obtained.
[0062]
Also, normalized feature vectors are obtained for each of the 64 frames of the short-time block A65 (represented by F1 to F64). Here, if the feature vectors of 64 frames F1 to F64 constituting the short-time block A65 are represented by CF1 to CF64 and the normalized feature vectors are represented by C′F1 to C′F64, these normalized feature vectors C 'F1-C' F64 is
[0063]
[Expression 2]
(5)

[0064]
Can be obtained.
[0065]
When obtaining this normalized feature vector, in this equation (5), the average feature vector CmB11 of the updated long time block B11 calculated by the short time blocks A2 to A65 is used as the average feature vector of the long time block. Although used, the average feature vector of the long-time block obtained immediately before that, that is, the average feature vector CmB1 of the long-time block B1 before update calculated by the short-time blocks A1 to A64 may be used. However, in this embodiment, the updated average feature vector is used.
[0066]
Incidentally, the distance DA65 of the average feature vector CmA65 of the short-time block A65 to the average feature vector CmB1 of the long-time block B1 described above is DA65 ≧ Dth1, that is, the long-time block B1 with respect to a predetermined threshold value Dth1. If the distance DA65 of the average feature vector CmA65 of the short-time block A65 with respect to the average feature vector CmB1 is greater than or equal to a certain range (threshold value Dth1), the current audio data changes greatly from the immediately preceding audio data. (Speaker change is determined at that time) and a new speaker group G2 is created.
[0067]
In this way, processing is performed on the short-time block A65 after one long-time block (in this case, the long-time block B1) is finished, and the average feature vector CmA65 of the short-time block A65 and the average of the long-time block B1 are processed. If it is determined that there is a speaker change depending on the distance from the feature vector CmB1, a new speaker group G2 is created. In this example, it is assumed that a new speaker group G2 is generated by the short-time block A65.
[0068]
When such a new speaker group is created, generation of a new long-time block of 4096 frames is started in the new speaker group (in this case, speaker group G2). The accumulation process of the short-time blocks belonging to the group G2 and the normalization process as described above are sequentially performed. If accumulation of the short-time blocks belonging to the new speaker group G2 is sequentially performed, the speaker group G2 is also finally composed of 64 short-time blocks.
[0069]
Note that one short-time block or a set of a plurality of short-time blocks belonging to the same speaker group in a stage until a long-time block composed of 64 short-time blocks is generated is referred to as a long-time block candidate. . Therefore, as in this example, when a new speaker group G2 is generated by the short time block A65, the short time block A65 becomes a long time block candidate.
[0070]
By the way, the normalization process for the short-time block A65 determined to have such a speaker change is that the short-time block A65 at this point is the first short-time block after the speaker change is determined. Therefore, the short-time block A65 is normalized using its own data. That is, using the average feature vector CmA65 of the short-time block A65, the average feature vector CmA65 is subtracted from the feature vector of each frame constituting the short-time block A65 using the calculation according to the equation (5). Normalization processing is performed by.
[0071]
Then, the next 64 frames are set as the short time block A66, and the sum CsA65 of the feature vectors of the short time block A66 and the average feature vector CmA66 are obtained for the 64 frames of the short time block A66 as described above. .
[0072]
When the sum CsA66 of the feature vectors is obtained for the short-time block A66 and the average feature vector CmA66 is obtained, it is now possible to determine the speaker group to which the voice feature of the short-time block A66 belongs. That is, speaker group judgment is performed.
[0073]
In this case, there are two speaker groups that have already been generated, that is, the speaker group G1 corresponding to the long-time block B1 and the speaker group G2 newly created by the short-time block A65. The speaker group is determined for each of the speaker groups G1 and G2.
[0074]
In this embodiment, the speaker group determination is performed by obtaining the distance between the average feature vector of the short-time block at that time and the average feature vector of the long-time block (speaker group) to be compared. Is determined by whether or not is larger than a predetermined threshold value. Therefore, a specific code (e.g., CmA66 if the average feature vector of the short-time block A66 is used for each average feature vector and the obtained distance). However, in the following, unless those symbols are particularly required, those symbols are not attached, and only a short-time block at that time is used. It is assumed that the description is based on whether the speaker group created at that time belongs.
[0075]
In this case, since it is an explanation of the short time block A66, it is determined whether or not this short time block A66 belongs to the speaker group G1, and subsequently whether or not the short time block A66 belongs to the speaker group G2. As a result of the determination, it is assumed that the short time block A66 is determined to belong to the speaker group G1 and not to the speaker group G2.
[0076]
FIG. 3 shows the speaker determination result of each short-time block. In FIG. 3, the case of belonging to the speaker group is indicated by ○, and the case of not belonging is indicated by ×. At this stage, the short-time block is indicated. Since A65 does not belong to the speaker group G1, a new speaker group G2 is generated, and the short time block A66 does not belong to the speaker group G2 and is determined to belong to the speaker group G1. It is shown.
[0077]
When making this speaker group determination, the distance between the average feature vector of the short-time block A66 and the average feature vector of the speaker groups G1 and G2 to be compared is obtained, and the distance is determined in advance. If the distance is greater than or equal to the threshold value, it does not belong to the speaker group, and if the distance is less than the threshold value, it is determined to belong to the speaker group. The threshold values used at this time (the threshold value used when comparing with the speaker group G1 is Dth1, and the threshold value used when comparing with the speaker group G2 is Dth2) are respectively large. The thicknesses are made different so that Dth1 <Dth2.
[0078]
This is because the average feature vector of the speaker group G1 is a value obtained from 4096 frames of the long-time block B1, and the average feature vector obtained from such a length has little variation, and the feature of the speech is The speaker group G2, in this case, is an average feature vector obtained from only 64 frames of one short-time block A65, and thus obtained from such a very small number of frames. Since the obtained average feature vector cannot be said to appropriately represent the feature of the speech, it is desirable to set the threshold value loosely even in the case of 4096 frames.
[0079]
Thus, in the present invention, the threshold value is set according to the number of short-time blocks constituting the speaker group. In this setting method, it is conceivable to set a plurality of threshold values step by step, such as what threshold value is used when the number of short-time blocks is up to.
[0080]
In this case, since it is an example of a certain utterance, the probability that the short-time block A66 will be the same speaker group as the short-time block A65 is extremely high, and a long-time block of 4096 frames is generated. In order not to increase the number of speaker groups unnecessarily, if the number of blocks is short for a short time and the number of blocks is short, the same speaker as the speaker group generated immediately before I try to join a group.
[0081]
Therefore, there is a high possibility that the short-time block A66 also belongs to the same speaker group G2 as the short-time block A65. Here, for convenience of explanation, the short-time block A66 is intentionally assigned to the speaker group G1 of the short-time block A65. An example of not belonging is given. Although not given here as an example, the short-time block A66 may not belong to any of the speaker groups G1 and G2, and in this case, a new speaker group G3 is generated. .
[0082]
As described above, when it is determined that the short-time block A66 belongs to the speaker group G1 corresponding to the long-time block B1, in this case, the process of updating the long-time block B1 is performed as described above.
[0083]
That is, the sum CsB11 of the feature vectors of the updated long-time block (represented by B11) is the 64 short-time blocks from the short-time block A1 to the short-time block A64 using the above-described equation (3). Is added to the newly obtained feature vector sum CsA66 of the short-time block A66, and the feature vector sum CsA1 of the short-time block A1 is subtracted, that is, CsB11 = CsB1 + CsA66−CsA1. Further, the average feature vector CmB11 can be obtained by CmB11 = CsB11 / 4096 as shown in the equation (4).
[0084]
In addition, the normalization processing of the short time block A66 may be performed according to the equation (5) using the updated average feature vector CmB11 of the long time block B2, whereby the short time block A66. The normalized feature vector of each individual frame can be determined.
[0085]
Next, 64 frames following the short-time block A66 are defined as the short-time block A67, and the sum of the feature vectors of the short-time block A67 is obtained for the 64 frames of the short-time block A67 as described above. Find the average feature vector.
[0086]
When the sum of the feature vectors is obtained for the short-time block A67 and the average feature vector is obtained, this time, how much the voice feature of the short-time block A67 differs from the speaker group that has already been generated. Ask what they are.
[0087]
Also in this case, there are two speaker groups that have already been generated, namely, a speaker group G1 corresponding to the long-time block B1 and a speaker group G2 newly created by the short-time block A65. The speaker group is determined for each of the speaker groups G1 and G2.
[0088]
In this case, as described above, it is determined whether or not the short-time block A67 belongs to the speaker group G1, and subsequently it is determined whether or not the short-time block A67 belongs to the speaker group G2. Assume that it is determined that the short-time block A67 belongs to the speaker group G2 and does not belong to the speaker group G1 (see FIG. 3). If the speaker group does not belong to any of these speaker groups, a new speaker group G3 is generated.
[0089]
As described above, when it is determined that the short time block A67 belongs to the speaker group G2 newly generated by the short time block A65, in this case, the short time block A65 (the long time block candidate in the speaker group G2) is determined. ). In the update process in this case, the sum of the feature vectors of the short time block A65 and the short time block A67 is obtained and the average feature vector is obtained. At this stage, the short time block A65 and the short time block A67 are the long time block candidates of the speaker group G2.
[0090]
In this way, the short-time blocks determined as the same speaker group G2 are sequentially accumulated, so that the long-time block of the speaker group G2 composed of 64 new short-time blocks (4096 frames) is obtained. Generated.
[0091]
In addition, the normalization process of the short time block A67 is to obtain the sum of the feature vectors of the two short time blocks consisting of the short time block A66 and the short time block A67, to obtain the average feature vector, and to calculate the average feature vector. It is sufficient to use the calculation according to the equation (5), and the normalized feature vector of each frame of the short time block A67 can be obtained.
[0092]
As described above, the processing for every 64 frames per utterance V1 shown in FIG. 2 (processing for each short-time block A65, A66, A67) is completed. The remainder e shown in 2 remains.
[0093]
This remainder e is a portion where the number of frames is insufficient to generate one short-time block (64 frames). Therefore, for the remainder e, the average feature vector used for the calculation of the normalized feature vector in the short-time block immediately before that (short-time block A67 in the example of FIG. 2) is used to normalize the speech section of the remainder e. Compute a normalized feature vector. The remainder e is a method that is not used for calculating the average feature vector of the long-time block, and a case where one short-time block is formed in addition to the next input speech data and is used for the calculation of the long-time block. There is.
[0094]
As described above, after a certain long-time block B1 is generated, processing is performed when speech data for one utterance V1 is input. In this example, as shown in FIG. 3, a new speaker group G2 is generated by the short time block A65, and the next short time block A66 is determined to belong to the speaker group G1 corresponding to the long time block B1. Thus, the long time block B1 is updated, and the next short time block A67 is determined to belong to the speaker group G2 generated by the short time block A65.
[0095]
When the voice data for the next utterance V2 is input, the same processing as in the case of the previous utterance V1, that is, individual short-time blocks acquired from the one utterance V2 have already been generated. If it is determined that the speaker belongs to one speaker group, the long-time block (or long-time block candidate) generated at that time corresponding to the speaker group is updated, and which speaker group If it is determined that they do not belong to any of them, a process for generating a new speaker group is performed.
[0096]
In addition, the generation of the normalized feature vector in each short-time block is performed at that time corresponding to the speaker group when it is determined that the short-time block belongs to the already generated speaker group. When normalization is performed using the average feature vector after updating the generated long-time block or long-time block candidate, and it is determined that the short-time block does not belong to any of the already generated speaker groups Is normalized using the average feature vector of the short-time block.
[0097]
By sequentially performing the above-described processing on the input voice data, a newly generated speaker group (speaker group G2 in the above example) sequentially accumulates blocks for a short time, When the number reaches 64, a new long-time block is generated.
[0098]
The long-time block B1 corresponding to the speaker group G1 that has already been generated is updated by the short-time block at that time every time there is a short-time block belonging to the speaker group G1. Going to be.
[0099]
Then, a normalized feature vector as described above is calculated for each short-time block, and the normalized feature vector is passed to the speech recognition unit for speech recognition processing.
[0100]
FIG. 4 shows the processing procedure described above, that is, the processing until the normalized feature vector is obtained and passed to the speech recognition processing unit by sequentially processing the input speech data for each short-time block. It is a flowchart explaining. In addition, since specific description has already been made, only the overall processing procedure will be described here.
[0101]
First, when 64 frames of input voice data are acquired as short-time blocks, the short-time blocks are compared with all the speaker groups that have already been generated (step s1). This process calculates the sum of the feature vectors of the short-time block and the average feature vector, calculates the distance between the average feature vector and the average feature vector of all the speaker groups already generated, and calculates the distance. This is a process of comparing with a preset threshold value, and as a result of the comparison, it is determined whether or not the short-time block belongs to any group (step s2).
[0102]
If it is determined by the speaker group determination that the short-time block does not belong to any of the already generated speaker groups, a new speaker group is created (step s3). The short time block is taken as a long time block candidate, and the sum of the short time block feature vectors as the long time block candidate and the average feature vector are obtained. Then, normalization processing is performed on the short-time block (step s4), a normalized feature vector for each frame is obtained, and the normalized feature vector is passed to the speech recognition processing unit (step s5).
[0103]
On the other hand, if it is determined in step s2 (determination as to whether it belongs to any of the already generated speaker groups), it is determined that it belongs to any of the already generated speaker groups. Determines which speaker group the short-time block at that time belongs to, that is, speaker group selection (step s6).
[0104]
This speaker group selection is, for example, processing for selecting any one speaker group from the plurality of speaker groups when a plurality of speaker groups are generated at that time.
[0105]
In this case, there are cases where the short-time block belongs to only one speaker group of the plurality of speaker groups and cases where the short-time block belongs to several speaker groups (including all speaker groups). It is done.
[0106]
If it is determined that the speaker group belongs to only one speaker group, the speaker group is selected. If it is determined that the speaker group belongs to a plurality of speaker groups, (a) Selection criteria such as selecting one of the speaker groups closest in time, (b) the speaker group in which the longest long-time block is generated, or (c) the speaker group having the closest distance between the average feature vectors Is set in advance, and one of the speaker groups is selected according to the set selection criterion. In (b) and (c), when the length is the same or the distance is the same, it is also determined that the determination criterion (a) is used.
[0107]
For example, in the case of belonging to a plurality of speaker groups, if it is set that the selection criterion (a) is used, the speaker group closest in time to the short-time block is selected and the selection is made. The long-time block or long-time block candidate corresponding to the selected speaker group is updated, that is, the sum of the feature vectors and the average feature vector of the long-time block or long-time block candidate corresponding to the speaker group is updated (step s7). Since this update process has already been described, the description thereof is omitted here.
[0108]
When this long-time block candidate update processing is completed, the processing of step s4, that is, normalization processing is performed on the short-time blocks constituting the long-time block candidate, and normalization for each frame is obtained. This normalization processing is performed using the average feature vector of the selected speaker group (in this embodiment, the updated average feature vector). Since this normalization processing has already been described, the description thereof will be given here. Omitted.
[0109]
The above processing is performed for each short time block. In the example shown in FIG. 3, the short-time block A65 does not belong to the already generated speaker group G1, so a new speaker group G2 is generated (steps s2 and s3), and its own data is used. After generating the normalized feature vector, the normalized feature vector is passed to the speech recognition processing unit (steps s4 and s5).
[0110]
Since the short-time block A66 belongs to the speaker group G1 that has already been generated, the speaker group selection of which speaker group belongs (in this case, it belongs only to the speaker group G1) The speaker group G1 is selected) (step s6), and the speaker group G1 is updated (step s7). Then, normalization processing is performed using the average feature vector (updated average feature vector) of the speaker group G1, and then the normalized feature vector is passed to the speech recognition processing unit (steps s4 and s5).
[0111]
Further, since the short-time block A67 belongs to the speaker group G2 generated by the short-time block A65 (this speaker group G2 is a speaker group already generated at this time), which speaker group Speaker group selection (in this case, the speaker group G2 is selected because it belongs only to the speaker group G2) (step s6), and the speaker group G2 is updated (step s7). . Note that the update of the speaker group G2 is still before 64 short-time blocks are accumulated at this time, so the sum of feature vectors of long-time block candidates consisting of short-time blocks A65 and A67 is obtained. This is a process for obtaining the average feature vector.
[0112]
Then, normalization processing is performed using the average feature vector (updated average feature vector) of the speaker group G2 (the long-time block candidate consisting of the short-time blocks A65 and A67), and then the normalized feature The vector is passed to the speech recognition processing unit (steps s4 and s5).
[0113]
Sequential processing as described in the flowchart of FIG. 4, that is, short time blocks are sequentially acquired from the input audio data, and the acquired short time blocks are sequentially processed to obtain individual short time blocks. The method for obtaining the normalized feature vector of a block is simple because it only needs to process the input speech data sequentially without being aware of the connection between speech segments, etc. There is an advantage that the amount of data to be reduced can be reduced and higher speed processing is possible.
[0114]
By the way, in the explanation so far, a short-time block is acquired from the inputted speech data for one utterance, a speaker group is determined for the acquired short-time block, and a new one is obtained based on the speaker group determination result. This is an example in which individual short-time blocks are sequentially processed so as to generate a new speaker group or update an already generated speaker group. After making a speaker group decision for a time block, if the speaker group decision is inconsistent as a speaker group for one utterance because the speaker group is not uniform, the entire utterance It is also possible to determine to which speaker group the voice data for one utterance belongs using the average feature vector of the voice data.
[0115]
If it is considered that one utterance is originally the utterance of one speaker, it is reasonable that the short-time blocks A65, A66, A67 in the one utterance V1 in the example of FIG. 2 belong to the same speaker group. As described above, as a result of the speaker group determination in each short-time block, only the short-time block A66 belongs to a different speaker group (see FIG. 3), and is not mentioned in the above example. There may be various inconsistencies that are unlikely to occur as one utterance, such as when the short time block A65 belongs to the speaker group G2, the short time block A65 belongs to the speaker group G1, and the short time block A67 does not belong to either of them. is there.
[0116]
In such a case, after waiting for the end of one utterance, an average feature vector for the one utterance is obtained, and the average feature vector for the one utterance and a long time block or a long time corresponding to the already generated speaker group are obtained. The average feature vector of the block candidate is compared.
[0117]
For example, referring to the example of FIG. 2, after waiting for the end of one utterance V1, the average feature vector of the entire one utterance V1 is obtained, and the average feature vector of the entire one utterance V1 and the long time corresponding to the speaker group G1 are obtained. The average feature vector of the block B1 is compared. As a result, if it is determined that the voice data of the one utterance V1 belongs to the speaker group G1, the long-time block B1 of the speaker group G1 is updated. If it is determined that the voice data of the utterance V1 does not belong to the speaker group G1, a new speaker group G2 is generated from the voice data of the entire one utterance V1.
[0118]
When it is determined that the voice data of the one utterance V1 belongs to the speaker group G1, when updating the long-time block B1 of the speaker group G1, in each short-time block A65, A66, A67, The short time blocks determined not to belong to the speaker group G1 (the short time blocks A65 and A67 in the above example) can be used not to update the long time block B1. It is also possible to positively use the blocks A65 and A67 for updating the block B1 for a long time.
[0119]
Thus, as an advantage of not using the short-time block that does not belong to the speaker group G1 for the update process of the long-time block B of the speaker group G1, the voice data of the short-time block that does not belong to the speaker group G1 May be voice data that is suddenly affected by noise (for example, voice data that has a different utterance method than before, or voice data when the background noise changes abruptly). In such a case, the sudden influence can be prevented from affecting the data (average feature vector) of the long-time block B1 corresponding to the speaker group G1. That is, there is an advantage that the influence of sudden noise or the like can be eliminated.
[0120]
On the other hand, the advantage of actively using the short-time blocks A65 and A67 that do not belong to the speaker group G1 for the update process is that the same speaker gradually changes the way of speaking or the same speaker group Even in such a case, there is an advantage that it is possible to adaptively respond to changes in the situation of the place, such as when the speaker suddenly changes.
[0121]
If it is determined that a plurality of short-time blocks constituting one utterance V1 do not belong to the speaker group G1, the speaker group G1 is temporarily included in the plurality of short-time blocks constituting one utterance V1. Even if there is a short-time block determined to belong (short-time block A66 in the example of FIG. 3), since the one utterance does not belong to the speaker group G1, the short-time block A66 is not included in the speaker. It is not used for updating the group G1.
[0122]
FIG. 5 is a flowchart for explaining the above-described processing, that is, the processing procedure when the speaker group determination is performed after waiting for one utterance for the input voice data. Since the specific description has already been made, only the overall processing procedure will be described here.
[0123]
Speaker group determination for obtaining the sum of feature vectors and the average feature vector for each short-time block for one utterance of input speech data and determining which speaker group each short-time block belongs to (Step s11), if not, the short time block is acquired from the speech data, and the sum of the feature vectors and the average feature vector are obtained for the acquired short time block. Person group determination is performed (step s12), and the process returns to step s11 to determine whether or not the processing for one utterance has been completed.
[0124]
When it is determined that the speaker group determination processing for the speech data for one utterance has been completed, the speaker group determination result in each short-time block acquired from the one utterance is unified as one utterance. It is determined whether or not there is (step s13).
[0125]
As a result of this uniformity determination, if it is determined that the speaker group determination is uniform as one utterance, the voice data for that one utterance has already been generated based on the speaker group determination result at that time. If you belong to one of the speaker groups, update the long-time block or long-time block candidate for that speaker group, and the voice data for that utterance does not belong to any of the speaker groups Generates a new speaker group (step s14). Then, normalization processing is performed for each short-time block belonging to the one utterance (step s15), a normalized feature vector for each frame is obtained, and the normalized feature vector is passed to the speech recognition processing unit (step s16). ). Since the normalization process has already been described, the description thereof is omitted here.
[0126]
On the other hand, when the determination in step s13 is determined to be inconsistent as one utterance, an average feature vector of the entire plurality of short-time blocks acquired from the one utterance is calculated (step s17). Using the average feature vector for one utterance, comparison is made with all the already generated speaker groups (step s18), and as a result of the comparison, voice data for the one utterance has already been generated. It is determined whether it belongs to any speaker group (step s19).
[0127]
If it is determined that the speech data for one utterance does not belong to any group, a new speaker group is created and used as a long-time block candidate. A sum of feature vectors and an average feature vector of the individual short time blocks belonging to the one utterance are obtained (step s20). Then, normalization processing is performed for each short block for one utterance (step s15), a normalized feature vector for each frame is obtained, and the normalized feature vector is passed to the speech recognition processing unit (step s16). ).
[0128]
On the other hand, as a result of the determination in step s19, when it is determined that it belongs to any of the already generated speaker groups, a short block for one utterance at that time is generated at that time. Determination of which speaker group the user belongs to, that is, speaker group selection is performed (step s21). Since the speaker group selection can be performed in the same manner as in step s6 in FIG. 4, the description thereof is omitted here.
[0129]
And, for example, when belonging to a plurality of speaker groups, if the selection criterion (a) described above is set, the speaker group closest in time is selected for the short-time block, The long time block corresponding to the speaker group is updated, that is, the sum of the feature vectors and the average feature vector of the long time block corresponding to the speaker group are updated (step s22).
[0130]
Since the update process in step s22 and the update process in step s14 have already been described, the description thereof is omitted here. Then, the process of step s15, that is, the normalization process is performed for each short-time block for one utterance to obtain a normalized feature vector for each frame. In this embodiment, the normalization process is performed using the updated average feature vector of the selected speaker group.
[0131]
The method shown in FIG. 5 determines for each utterance which speaker group the utterance belongs to, and obtains a normalized feature vector based on the determination result. Compared to sequential processing, the amount of data to be held is slightly increased and the processing time is also slightly increased, but more appropriate speaker group determination can be performed and a higher recognition rate can be obtained.
[0132]
As described above, according to this embodiment, the average feature vector of the short-time blocks acquired from the speech data and the long-time blocks or long-time block candidates corresponding to all the speaker groups that have already been generated. If the average feature vector is compared and does not belong to any of the speaker groups, a new speaker group is created, and if it belongs to at least one of all speaker groups, either By selecting a speaker group and updating the long-time block or long-time block candidate corresponding to the selected speaker group, the speaker group reflecting the characteristics of each speaker is automatically created. Generated and normalized feature vectors for each speaker group are generated, so that speech recognition that reflects the characteristics of each speaker is performed.
[0133]
Moreover, since the comparison between the average feature vector of the short-time block and the already generated speaker group is performed by simple distance calculation, without performing complicated speaker identification processing and speaker group identification processing, Can easily handle speaker changes.
[0134]
When performing the above-described processing, the sum of feature vectors and the average feature vector obtained from long-time blocks of a plurality of speaker groups (for example, a male speaker group, a female speaker group, a child speaker group, etc.) in advance. As an initial value.
[0135]
By the way, in the description so far, the processing from the short-time block A1 to the short-time block A64 (including the processing for obtaining normalization of each short-time block) has already been completed, and the 64 short-time blocks are used. In the above description, it is assumed that one speaker group G1 (long-time block B1) has been generated, and the processing for the voice data for one utterance uttered after the short-time block A64 has been described. Is described below with reference to FIG. 6, that is, a case where no speaker group exists. Here, the sequential processing method shown in the flowchart of FIG. 4 will be described.
[0136]
FIG. 6 (a) shows an input speech waveform. As shown in FIG. 6 (b) along the time axis t from such a speech waveform, short-time blocks A1, A2, A3 every 64 frames are shown. Are sequentially acquired, and processing is performed for each acquired short-time block in the procedure described in the flowchart of FIG.
[0137]
First, the short-time block A1 is acquired, the sum of the feature vectors of the short-time block A1 and the sum of the average feature vectors is obtained, and which story in all the speaker groups in which the short-time block A1 has already been generated. At this time, since no speaker group has been generated yet, as shown in FIG. 6C, this short-time block A1 is set as a new speaker group G1, The sum of the feature vectors and the average feature vector are stored as necessary. Further, the normalization processing of the short time block A1 is performed using the sum of the feature vectors and the average feature vector, and the result is passed to the speech recognition unit. At this time, the short time block A1 is a long time block candidate in the speaker group G1.
[0138]
Next, the short-time block A2 is acquired, and the sum of the feature vectors and the average feature vector of the short-time block A2 is obtained, and which of the speaker groups in which all the short-time blocks A2 have already been generated is obtained. It is determined whether or not it belongs to the speaker group. At this time, since the speaker group already generated is only the speaker group G1 generated by the short-time block A1, whether or not it belongs to this speaker group G1. Judge whether or not.
[0139]
If it is determined that the speaker group G1 belongs, the speaker group G1 is updated (in this case, the sum of the feature vectors of the short time block A2 is added to the sum of the feature vectors of the short time block A1). The average feature vector is obtained), which is used as a long-time block candidate, and the short-time block A2 is normalized using the average feature vector of the long-time block candidate.
[0140]
On the other hand, if it is determined that it does not belong to the speaker group G1, the short-time block A2 is set as a new speaker group G2, and the sum of the feature vectors and the average feature vector are stored as necessary. In this case, the normalization processing of the short-time block A1 is performed using the sum of its own feature vectors and the average feature vector, and the result is passed to the speech recognition unit. Here, it is assumed that a new speaker group G2 is generated by the short-time block A2. At this time, the short-time block A2 is a long-time block candidate in the speaker group G2.
[0141]
Then, the short time block A3 is acquired, and the sum of the feature vectors of the short time block A3 and the sum of the average feature vectors are obtained, and all the speaker groups in which the short time block A3 has already been generated are obtained. It is determined which speaker group belongs. At this time, the already generated speaker groups are the speaker group G1 generated by the short-time block A1 and the speaker group G2 generated by the short-time block A2. Therefore, speaker group determination is performed for both of these speaker groups G1 and G2.
[0142]
Here, if it is determined that both belong, if the selection criterion is set to the setting of (a) described above, that is, to select a speaker group that is closer in time, this case Therefore, it is determined that it belongs to the speaker group G1 that is closer in time. In addition, if it does not belong to both, a new speaker group is generated, but here it is determined that it belongs to the speaker group G1. As a result, the speaker group G1 is updated, that is, the short-time block A1 is updated. At this time, the short-time block A1 and the short-time block A3 are long-time block candidates in the speaker group G1. . In this case, the normalization processing of the short time block A3 is performed using the average feature vector of the long time block candidates composed of the short time block A1 and the short time block A3, and the normalized sound feature vector is converted into the sound. Pass it to the recognition processor.
[0143]
Then, the short-time block A4 is acquired, the sum of the feature vectors of the short-time block A4 and the sum of the average feature vectors are obtained, and all the speaker groups in which the short-time block A4 has already been generated are obtained. It is determined which speaker group belongs. At this time, the already generated speaker groups are the speaker group G1 including the short-time blocks A1 and A3 and the speaker group G2 including the short-time block A2. Therefore, speaker group determination is performed for both of these speaker groups G1 and G2. Here, it is assumed that the short-time block A4 is determined to belong to the speaker group G1.
[0144]
As a result, the speaker group G1 stores three blocks A1, A3, and A4 for a short time. At this point, these short-time blocks A1, A3 and A4 are long-time block candidates in the speaker group G1. Note that the number of long-time block candidates of the three short-time blocks A1, A3, and A4 is 64 × 3 = 192, which is not a power of 2, so normalization of the short-time block A4 is In this case, the sum of feature vectors for a total of 128 frames of feature vectors (64 frames) of the previous short-time block A3 belonging to the same speaker group as its own feature vector (64 frames) is obtained, An average feature vector is obtained, and the obtained average feature vector is used. In this case, the sum of the feature vectors of the long-time block candidates composed of the short-time blocks A1, A3, and A4 is stored and used for the subsequent processing.
[0145]
The processing when the number of long-time block candidates including its own short-time block does not become a power of 2 as in this short-time block A4 will be described with reference to FIG. To do.
[0146]
As shown in FIG. 7, in a stage until a long-time block of 4096 frames is generated, for example, a long time by four short-time blocks Aa to Ad along the time axis t. In a state where a block candidate (referred to as b1) is generated, the next short time block Ae is acquired, and a long time block candidate b2 is generated by five short time blocks including the short time block Ae. If this is the case, the number of frames of the long-time block candidate b2 is 64 × 5 = 320 in this case, and is not a power of 2.
[0147]
In such a case, the average feature vector used when calculating the normalization of the short-time block Ae can be 2 in the long-time block candidate b2 among the short-time blocks Aa to Ae existing before the short-time block Ae. The sum of the feature vectors of the short-time block having the maximum number of powers (in this case, the four short-time blocks Ab to Ae) is obtained, and the sum of the feature vectors is calculated for the four short-time blocks A2 to A5. It is obtained by dividing by the total number of frames (64 × 4 = 256), and the short-time block Ae is normalized using the average feature vector thus obtained.
[0148]
In FIG. 7, the number of frames of a long-time block candidate including its own short-time block, that is, the case where the number of short-time blocks in the long-time block candidate does not become a power of two is assumed. The case where the number of time blocks is the fifth has been described as an example, but the number of other short time blocks, for example, the number of short time blocks such as the third, sixth, seventh, ninth, etc., is long. When the number of short-time blocks in the block candidate does not become a power of 2, all the processes according to the process described with reference to FIG. 7 are performed.
[0149]
Returning to FIG. 6, it is assumed that the short time block A5 next to the short time block A4 is acquired. Then, the sum of the feature vectors and the average feature vector of the short-time block A5 is obtained, and it is determined to which speaker group the short-time block A5 belongs to all the speaker groups that have already been generated. .
[0150]
At this time, since the speaker groups already generated are the speaker group G1 composed of the short-time blocks A1, A3, and A4 and the speaker group G2 composed of the short-time blocks A2, these speaker groups G1, Speaker group determination is performed for both of G2, and in this case, it is determined that the speaker group belongs to G1. As a result, in the speaker group G1, four short-time blocks A1, A3, A4, and A5 are accumulated, and at this time, these short-time blocks A1, A3, A4, and A5 are spoken. Is a long-time block candidate of the operator group G1.
[0151]
In this case, the number of long-time block candidates of these four short-time blocks A1, A3, A4, and A5 is 64 × 4 = 256, which is a power of 2, so this short-time block A5 The normalization processing is performed by obtaining an average feature vector of the short-time blocks A1, A3, A4, and A5, using the average feature vector, and passing the normalized speech feature vector to the speech recognition processing unit.
[0152]
As described above, when determining which of the plurality of speaker groups belongs to this, it is determined whether or not the distance is equal to or greater than a preset threshold value. The length varies depending on the length of the speaker group (accumulated number of short-time blocks). For example, when the short-time block A5 makes a speaker group determination between the speaker group G1 and the speaker group G2, the threshold value in this case is that the speaker group G2 is more than the speaker group G1. Since the length is long, the distance as the threshold value is set to be small and a more strict distance comparison is performed.
[0153]
By repeatedly performing the above processing, several speaker groups (here, only two of the speaker group G1 and the speaker group G2 are used for the sake of simplicity) are generated. For each speaker group, short-time blocks are accumulated. For example, when 4096 frames (64 short-time blocks) are accumulated in the speaker group G1, it corresponds to the speaker group G1 as described above. The long time block (in FIG. 2, the long time block B1) is used. Similarly, when 4096 frames (64 short-time blocks) are accumulated for the speaker group G2, it is set as a long-time block corresponding to the speaker group G2.
[0154]
The description using FIG. 6 has been described as a processing procedure in the case where there is no speaker group. However, as described with reference to FIG. Of course, the same processing as described in 6 is performed.
[0155]
By the way, when a new speaker group is generated, it is possible to set a certain limit on the number of speaker groups. In that case, the maximum number of groups is set in advance. For example, in a system in which an adult male speaker group, an adult female speaker group, and a child speaker group are prepared as acoustic models in the speech recognition apparatus, the maximum number of speaker groups is three.
[0156]
As described above, when several maximum values of the speaker groups are determined in advance, after the number of speaker groups is generated up to the maximum number, the short-time block acquired from the input speech is the speaker. If it is determined not to belong to a group, it is determined that the short-time block belongs to the nearest speaker group among several speaker groups that have already been generated.
[0157]
In this way, if there is a short-time block that does not belong to any speaker group and it is made to belong to the nearest speaker group of the group, that short-time block is the nearest speaker. Either the case where the group is not used for updating the long-time block of the group or the case where the group is used for updating can be considered.
[0158]
The advantage of not using it for updating is that it can eliminate sudden effects such as noise, and the advantage of using it for updating is that the same speaker gradually changes its way of speaking or is in the same speaker group. Even if there is a speaker, it is possible to adapt to changes in the situation, such as when the speaker changes.
[0159]
FIG. 8 is a diagram simply showing the configuration of the speech recognition apparatus according to the present invention. The microphone 1 serves as speech input means for inputting speech, and the speech signal processing unit 2 amplifies the input speech signal or performs A / D conversion. The speech signal processed by the speech signal processing unit 2 is subjected to speech analysis every predetermined time, and for example, a speech analysis unit 3 that outputs a 10-dimensional feature vector sequence, and the features obtained by the speech analysis unit 3 The feature vector normalization processing unit 4 that normalizes the vector by the above-described processing method, the normalized feature vector (normalized cepstrum coefficient) normalized by the feature vector average normalization processing unit 4 is input and normalized A speech recognition processing unit 6 that performs speech recognition processing using a standard acoustic model 5 is included.
[0160]
The feature vector average normalization processing unit 4 performs a process of normalizing the feature vector. Since this process has already been described in detail with reference to FIGS.
[0161]
The normalized feature vector obtained by the feature vector average normalization processing unit 4 is normalized for each speaker group, and the normalized feature vector is sent to the speech recognition processing unit 6, and the acoustic model 5 Is performed, and the recognition result is output.
[0162]
As described above, the speech recognition apparatus of the present invention obtains a normalized feature vector reflecting the features of each speaker by a simple method without having a special speaker identification / speaker group identification means, Speech recognition processing using the normalized feature vector can be performed.
[0163]
In the example of FIG. 8, the normalized standard acoustic model 5 is used as the acoustic model 5 used for speech recognition. However, when several speaker groups are set in advance, By passing the speaker group information corresponding to these speaker groups to the speech recognition processing unit 6 together with the normalized feature vectors from the feature vector average normalization processing unit 4. The speech recognition processing unit 6 can also perform speech recognition processing using an acoustic model corresponding to the speaker group information, thereby further increasing the recognition rate.
[0164]
For example, there are three groups of speakers, men, women, and children, and acoustic models (male acoustic model 5a, female acoustic model 5b, and child acoustic model 5c) are prepared as shown in FIG. The maximum number of speaker groups generated by the feature vector average normalization processing unit 4 is the maximum number, that is, the number of speaker groups that can be generated is a male speaker group, a female speaker group, and a child speaker. Only three of the groups may be used.
[0165]
In that case, as described above, it is determined whether the short-time block acquired from the voice data belongs to any one of the three speaker groups. Thus, the normalized feature vector at that time is obtained, and the normalized feature vector and speaker group information indicating to which speaker group it belongs are passed to the speech recognition processing unit 6.
[0166]
The speech recognition apparatus described with reference to FIGS. 8 and 9 can determine which speaker group belongs by simple distance calculation without performing special speaker identification or speaker group identification. It is possible to obtain a normalized feature vector reflecting a person's feature and perform a speech recognition process using the normalized feature vector, so that the recognition performance can be improved. In addition, the amount of computation for generating the normalized feature vector and the amount of memory to be used can be greatly reduced, so that it is possible to operate with inexpensive hardware and obtain high recognition performance.
[0167]
In addition, as shown in FIG. 8, when a standard acoustic model is used, one acoustic model is sufficient, so that the component cost can be kept low and an inexpensive device can be obtained. Since it is not necessary to process which acoustic model is used, the amount of calculation can be reduced.
[0168]
On the other hand, as shown in FIG. 9, when using an acoustic model corresponding to each speaker or speaker group, it is necessary to prepare a plurality of acoustic models. Although there are problems, there is an advantage that higher recognition performance can be obtained.
[0169]
The present invention is not limited to the embodiment described above, and various modifications can be made without departing from the gist of the present invention.
[0170]
For example, in the above embodiment, the short time block is composed of 64 frames and the long time block is composed of 64 short time blocks (4096 frames). However, the present invention is not limited to this. It is possible to set as appropriate according to the system to be performed. However, it is desirable that the number of short-time blocks and the number of short-time blocks (number of frames) of long-time blocks be a power of two.
[0171]
Further, in the above-described embodiment, the determination of the speaker property is performed based on whether or not the speaker belongs to a certain speaker group, but it is determined whether or not the speaker belongs to a specific speaker instead of the speaker group. You can also. Accordingly, in the above-described embodiment, the speaker group is described, but this can be replaced with a speaker or a speaker group.
[0172]
In addition, the present invention can create a processing program in which the processing procedure for realizing the present invention described above is described, and the processing program can be recorded on a recording medium such as a floppy disk, an optical disk, or a hard disk. The present invention also includes a recording medium on which the processing program is recorded. Further, the processing program may be obtained from a network.
[0173]
【The invention's effect】
As described above, according to the present invention, a speaker group can be generated naturally without using any special speaker identification / speaker group identification means, and normalization reflecting the characteristics of each speaker. Since feature vectors can be obtained and voice recognition processing using normalized feature vectors reflecting the features of each speaker can be performed, a high recognition rate can be obtained.
[0174]
Furthermore, in the present invention, the number of frames constituting the short-time block and the number of frames constituting the long-time block are set to values that are powers of 2, respectively. In the computer operation, it can be performed only by the right shift operation, and the operation can be simplified. As a result, the calculation burden on the CPU can be reduced, and a particularly great effect can be obtained when the present invention is applied to an inexpensive apparatus that must use a CPU with low processing capability.
[0175]
In addition, the speaker disconnection calculates the distance between the average feature vector of the short-time block and the average feature vector of the long-time block or long-time block candidate corresponding to the already generated speaker or speaker group, Since the distance is compared with a preset threshold value, the speaker group can be determined by a simple calculation.
[0176]
In addition, a speech recognition apparatus to which such a speech recognition method is applied can naturally speak or speak with a small amount of computation and a small amount of memory without using a special speaker identification / speaker group identification means. Speaker groups can be generated, and normalized feature vectors reflecting the characteristics of each speaker can be obtained. Therefore, it is possible to provide an apparatus with inexpensive hardware and high recognition performance.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating basic processing contents according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating processing after a long-time block corresponding to a speaker group that has already been generated is generated.
FIG. 3 is a diagram illustrating a process of determining to which speaker groups some short-time blocks acquired from one utterance in FIG. 2 belong.
FIG. 4 is a flowchart for explaining a processing procedure until input speech data is subjected to sequential processing for each short-time block, a normalized feature vector is obtained and passed to a speech recognition processing unit.
FIG. 5 is a flowchart for explaining a processing procedure in a case where speaker group determination is performed after waiting for one utterance for input voice data.
FIG. 6 is a diagram for explaining processing of the present invention when voice data is newly input, that is, when there is no group at all.
FIG. 7 is a diagram for explaining a normalization process when the number of long-time block candidates including an acquired short-time block is not a power of 2;
FIG. 8 is a configuration diagram illustrating an embodiment of a speech recognition apparatus according to the present invention.
FIG. 9 is a configuration diagram of a speech recognition apparatus when an acoustic model is provided for a plurality of preset speaker groups.
[Explanation of symbols]
1 Microphone
2 Audio signal processor
3 Speech feature analysis section
4 Feature vector normalization processing unit
5 Acoustic model
5a Male acoustic model
5b Female acoustic model
5c Children's acoustic model
6 Speech recognition processing section
A1, A2, ..., A64, A65, ... Short-time block
B1 Block for a long time
G1, G2 speaker group

Claims

In the speech recognition method of performing speech recognition processing by performing feature analysis on speech data corresponding to a frame for each predetermined time, averaging the feature vector obtained by the feature analysis, and using the average normalized feature vector,
With respect to the input voice data, a certain number of frames along the time axis direction as one unit is used as a short time block, and short time blocks belonging to the same speaker or speaker group are preliminarily arranged along the time axis direction. A predetermined number of blocks are collected and used as a long-time block corresponding to a certain speaker or speaker group, and a short-time block or a plurality of members belonging to the same speaker group in a stage until this long-time block is generated A set of short-time blocks is a long-time block candidate,
Determining whether or not a short-time block sequentially obtained from the audio data belongs to at least one of a speaker or a speaker group whose feature vector of the short-time block has already been generated;
If it is determined that the short-time block belongs to at least one of a speaker or a speaker group that has already been generated, the feature vector of the short-time block is used to The feature vector of the long-time block or long-time block candidate corresponding to the speaker group is updated, and the feature vector of the short-time block is updated using the feature vector before or after the update of the long-time block or long-time block candidate. Average normalization, pass the average normalized feature vector to the speech recognition processing unit,
If it is determined that the short-time block does not belong to the speaker or speaker group that has already been generated, the short-time block is used as a long-time block candidate and corresponds to a new speaker or speaker group. A voice characterized by starting generation of a long-time block, averaging the short-time block using the feature vector of the short-time block, and passing the average-normalized feature vector to a speech recognition processing unit Recognition method.

The short-time block is composed of a number of frames that can be obtained from one-word speech data having a standard length, and the long-time block is obtained from speech data that includes all phonemes. The speech recognition method according to claim 1, comprising:

3. The speech recognition method according to claim 1, wherein the number of frames constituting the short-time block and the number of frames constituting the long-time block are each set to a power of two.

The speaker characteristic determination processing for determining whether the voice feature of the short-time block belongs to a speaker or a speaker group that has already been generated is already generated with the average feature vector of the short-time block. A method of calculating a distance from an average feature vector of a long-time block or a long-time block candidate corresponding to a speaker or a speaker group, and comparing the distance with a preset threshold value. Item 4. The speech recognition method according to any one of Items 1 to 3.

The threshold value used for determining the speaker character is different depending on whether the comparison target of the short-time block is the long-time block or the long-time block candidate. 5. The speech recognition method according to claim 4, wherein when the comparison target is the long-time block candidate, the distance as the threshold value is set larger than in the case of the long-time block.

The maximum number of speakers or speaker groups to be generated can be set in advance, and after the maximum number of speakers or speaker groups has been generated, short-time blocks acquired thereafter are already generated. 6. The voice according to claim 1, wherein if it is determined that the voice does not belong to any of the speaker groups, the voice is determined to belong to the speaker or the speaker group having the closest voice characteristics. Recognition method.

In a speech recognition device that performs speech analysis on speech data corresponding to a frame for each predetermined time, performs speech recognition processing using the average normalized feature vector and average normalization of the feature vector obtained by the feature analysis,
Speech analysis means for performing speech feature analysis on input speech and outputting feature vectors, feature vector average normalizing means for average normalizing feature vectors output from the speech analysis means, and feature vector average normalization Voice recognition processing means for inputting a feature vector averaged by the means and performing voice recognition processing using an acoustic model prepared in advance,
The feature vector average normalization means uses a certain number of frames along the time axis direction as a unit for the input voice data, makes it a short time block, and sets the short time belonging to the same speaker or speaker group. Collecting a predetermined number of blocks along the time axis direction to make it a long-time block corresponding to a certain speaker or speaker group, and one short time in a stage until this long-time block is generated A set of multiple short-time blocks belonging to a block or the same speaker group is a long-time block candidate,
Determining whether or not a short-time block sequentially obtained from the audio data belongs to at least one of a speaker or a speaker group whose feature vector of the short-time block has already been generated;
If it is determined that the short-time block belongs to at least one of a speaker or a speaker group that has already been generated, the feature vector of the short-time block is used to The feature vector of the long-time block or long-time block candidate corresponding to the speaker group is updated, and the feature vector of the short-time block is updated using the feature vector before or after the update of the long-time block or long-time block candidate. Average normalization, pass the average normalized feature vector to the speech recognition processing unit,
If it is determined that the short-time block does not belong to the speaker or speaker group that has already been generated, the short-time block is used as a long-time block candidate and corresponds to a new speaker or speaker group. A voice characterized by starting generation of a long-time block, averaging the short-time block using the feature vector of the short-time block, and passing the average-normalized feature vector to a speech recognition processing unit Recognition device.

The short-time block is composed of a number of frames that can be obtained from one-word speech data having a standard length, and the long-time block is obtained from speech data that includes all phonemes. The speech recognition apparatus according to claim 7, comprising:

9. The speech recognition apparatus according to claim 7, wherein the number of frames constituting the short-time block and the number of frames constituting the long-time block are each set to a power of two.

The speaker characteristic determination processing for determining whether the voice feature of the short-time block belongs to a speaker or a speaker group that has already been generated is already generated with the average feature vector of the short-time block. A method of calculating a distance from an average feature vector of a long-time block or a long-time block candidate corresponding to a speaker or a speaker group, and comparing the distance with a preset threshold value. Item 10. The speech recognition device according to any one of Items 7 to 9.

The threshold value used for determining the speaker character is different depending on whether the comparison target of the short-time block is the long-time block or the long-time block candidate. 11. The speech recognition apparatus according to claim 10, wherein when the comparison target is the long-time block candidate, the distance as the threshold value is set larger than in the case of the long-time block.

The maximum number of speakers or speaker groups to be generated can be set in advance, and after the maximum number of speakers or speaker groups has been generated, short-time blocks acquired thereafter are already generated. The speech according to any one of claims 7 to 11, wherein if it is determined that the speaker does not belong to any of the speaker groups, it is determined that the speaker belongs to the speaker or speaker group having the closest speech feature. Recognition device.

The acoustic model used by the speech recognition processing unit is a standard acoustic model that can correspond to the average normalized feature vector of each speaker or speaker group. A voice recognition device according to claim 1.

The acoustic model used by the speech recognition processing unit is a plurality of acoustic models prepared in advance for correspondence with a plurality of speakers or speaker groups, and in that case, the feature vector average normalization means receives the speaker or speaker group. The speech recognition apparatus according to claim 7, wherein speech recognition processing using an acoustic model corresponding to the speaker or the speaker group is performed.