JP4098015B2

JP4098015B2 - Speaker identification method and system, and program

Info

Publication number: JP4098015B2
Application number: JP2002209662A
Authority: JP
Inventors: 克彦白井; 伊徳木草達瓦
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2002-07-18
Filing date: 2002-07-18
Publication date: 2008-06-11
Anticipated expiration: 2022-07-18
Also published as: JP2004053821A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された被識別音声が、予め登録された複数の話者の中の誰の音声であるかを判定する話者識別方法およびそのシステム、並びにプログラムに係り、例えば、複数の話者による連続音声（例えば放送局の長時間のデータ等）の中から特定の話者（例えばアナウンサー等）の音声を検索したり、あるいは連続音声を各話者の音声に分類する場合、空港や港等で出入国管理を行う場合、警察や自衛隊等で犯罪者や要注意人物等の管理（登録、検索、捜査等）を行う場合、役所で住民管理を行う場合、商店等で顧客管理を行う場合、会社で社員管理（出社や退社の把握等）を行う場合、インターネットを利用した情報資源へのアクセスに対するセキュリティ管理を行う場合、言語や方言の自動分類を行う場合、ロボット通信（ロボットによる対話相手の把握等）に応用する場合、おもちゃによる対話相手の把握および応答の選択に応用する場合、音声により各種の警備、防犯、監督を行う場合などに利用できる。
【０００２】
【背景技術】
一般に、話者認識には、入力音声が予め登録されている音声の中の誰の音声であるかを判定する話者識別ＡＳＩ（Automatic Speaker Identification）と、入力音声が本人の音声であるか否かを判定する話者照合ＡＳＶ（Automatic Speaker Verification）とがある。
【０００３】
話者照合ＡＳＶの場合には、単にある人の身分だけを判定するという目的で、予め登録されている本人のサンプルデータと被識別者の入力データとのマッチングをとり、システムは単に「Ｙｅｓ」あるいは「Ｎｏ」の答えを出力するだけである。従って、これまでに数多くの研究や応用システムが報告されている。
【０００４】
一方、話者識別ＡＳＩの場合には、システムによる識別対象となる発話者が、予め登録されたＮ人の話者の中の誰であるかを、Ｎ回の比較処理を行いながら判断し、さらに識別対象となる発話者が、予め登録されたＮ人の話者以外の者であれば、最終的には棄却の判断を行わなければならないので、話者照合ＡＳＶの場合に比べ、システム処理に煩雑性・困難性を伴う。従って、これまでの研究では、効率的に発話者を推定できるような特徴パラメータは発見されておらず、研究報告は、あまりにも基礎理論的な研究に集中しているといえる。
【０００５】
【発明が解決しようとする課題】
ところで、実時間応答性が良く、かつ、高精度な話者認識システムを実現するにあたっては、発話内容、発話時期および発声時間の相違による音響特徴量の変動、実時間応答性の向上を図ることと識別精度の向上を図ることとの矛盾等の問題を解決しなければならない。このことは、話者照合ＡＳＶおよび話者識別ＡＳＩのいずれの場合にもいえることであるが、話者識別ＡＳＩの場合には、前述したように話者照合ＡＳＶの場合に比べてシステム処理に煩雑性・困難性を伴うので、解決すべき問題が、より一層大きくなることから、システム開発は遅れているのが現状である。
【０００６】
本発明の目的は、実時間応答性が良好で、かつ、高精度な話者識別を行うことができる話者識別方法およびそのシステム、並びにプログラムを提供するところにある。
【０００７】
【課題を解決するための手段】
本発明は、入力された被識別音声が、予め登録された複数の話者の中の誰の音声であるかを判定する話者識別方法であって、第一段階の判定処理用のサンプルデータとして複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成した後、これらの特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第一のコードブックを作成し、これらの第一のコードブックを第一コードブック記憶手段に記憶しておくとともに、複数候補者に絞込後の第二段階の判定処理用の候補データとしてサンプルデータとは異なる環境で複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成した後、これらの特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第二のコードブックを作成し、これらの第二のコードブックを第二コードブック記憶手段に記憶しておき、被識別音声が複数の話者の中の誰の音声であるかを判定する際には、被識別音声の入力データを用いて音声の特徴パラメータを作成した後、話者性尤度算出手段により、この特徴パラメータと第一コードブック記憶手段に記憶された第一のコードブックとを用いて複数の各話者についての話者性尤度を算出し、第一段階判定手段により、これらの話者性尤度のうち最小の話者性尤度とそれ以外の話者性尤度との各差が、予め設定された第一の閾値ηよりも大きいか否かまたは第一の閾値η以上か否かを判定し、各差の全てが第一の閾値ηよりも大きいかまたは第一の閾値η以上と判定された場合には、最小の話者性尤度の話者を唯一の候補者とし、第二段階唯一候補者判定手段により、最小の話者性尤度が、予め各話者毎に設定された第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには候補者を受理し、入らないと判定したときには棄却し、一方、各差のうちの少なくとも一つの差が第一の閾値η以下または第一の閾値ηよりも小さいと判定された場合には、以下または小さいと判定された差となっている話者性尤度の話者および最小の話者性尤度の話者を複数の候補者とし、絶対値平均ベクトル誤差算出手段により、特徴パラメータと第二コードブック記憶手段に記憶された第二のコードブックとを用いて複数の各候補者についての絶対値平均ベクトル誤差を算出した後、第二段階複数候補者判定手段により、これらの絶対値平均ベクトル誤差のうち最小の絶対値平均ベクトル誤差が、第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには最小の絶対値平均ベクトル誤差の候補者を受理し、入らないと判定したときには棄却することを特徴とするものである。
【０００８】
ここで、「各差の全てが第一の閾値ηよりも大きいかまたは第一の閾値η以上と判定された場合」とは、最小の話者性尤度に対し、それ以外の全ての話者性尤度が、第一の閾値ηよりも大きな値だけ離れているか、あるいは第一の閾値η以上離れていると判定された場合であり、要するに、最小の話者性尤度が、その他の話者性尤度に比べて突出して小さな値であると判定された場合である。
【０００９】
一方、「各差のうちの少なくとも一つの差が第一の閾値η以下または第一の閾値ηよりも小さいと判定された場合」とは、最小の話者性尤度以外の話者性尤度の中に、最小の話者性尤度との差が、第一の閾値η以下または第一の閾値ηよりも小さいものが存在すると判定された場合であり、要するに、最小の話者性尤度に近い値の話者性尤度が少なくとも一つ存在すると判定された場合である。
【００１０】
このような本発明の話者識別方法においては、先ず、話者性尤度を算出して第一の閾値ηを用いて第一段階の判定を行い、次に、その判定結果に応じ、第二段階の判定の処理内容を変える。すなわち、第一段階の判定において、最小の話者性尤度が、その他の話者性尤度に比べて突出して小さな値であると判定された場合には、最小の話者性尤度の話者をそのまま唯一の候補者とし、その最小の話者性尤度が第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かの第二段階の判定処理を行う。一方、第一段階の判定において、最小の話者性尤度に近い値の話者性尤度が少なくとも一つ存在すると判定された場合には、最小の話者性尤度の話者だけではなく、最小の話者性尤度に近い値の話者性尤度の話者も含めて複数の候補者とし、絶対値平均ベクトル誤差を算出して第二の閾値θminおよび第三の閾値θmaxを用いて第二段階の判定処理を行う。
【００１１】
このため、話者識別を行うにあたって、判定を二つの段階で行うようにし、第一段階の判定結果に応じて第二段階の判定が簡易な処理になる場合とそれよりも複雑な処理になる場合とに分かれるようにしたので、全てについて一律な内容の判定処理を行う場合に比べ、話者識別に要する処理時間を短縮することが可能となり、実時間応答性の向上が図られる。
【００１２】
また、第一段階の判定処理において、最小の話者性尤度に近い値の話者性尤度が少なくとも一つ存在すると判定された場合には、最小の話者性尤度に近い値の話者性尤度の話者が、被識別音声を入力した者と一致する可能性もあるため、その場合には、絶対値平均ベクトル誤差を算出する処理を行ってから第二段階の判定を行うので、上述した如く実時間応答性の向上を図りつつ、同時に識別精度の向上を図ることが可能となり、これらにより前記目的が達成される。
【００１３】
また、前述した話者識別方法において、第一の閾値ηを設定する際には、第一閾値設定用テストデータとして複数の各話者本人のうちの任意の一人から取得した音声データを用いて音声の特徴パラメータを作成した後、話者性尤度算出手段により、この特徴パラメータと第一コードブック記憶手段に記憶された第一のコードブックとを用いて複数の各話者についての話者性尤度を算出し、前記任意の一人についての話者性尤度とこの話者性尤度に最も近い値の話者性尤度との差を第一の閾値ηとすることが望ましい。
【００１４】
このようにして第一の閾値ηを設定するようにした場合には、第一閾値設定用テストデータとしての音声データを提供した任意の一人の話者と、この話者に最も似た音声特徴を有する話者とについての話者性尤度の差に基づき、第一の閾値ηが設定されるので、第一段階の判定処理に用いる閾値として適切な値を設定することが可能となる。
【００１５】
さらに、前述した話者識別方法において、第二の閾値θminおよび第三の閾値θmaxを設定する際には、第二・第三閾値設定用テストデータとして複数の各話者本人からそれぞれ複数ずつ取得した音声データを用いて音声の特徴パラメータを各話者毎に複数ずつ作成した後、話者性尤度算出手段により、これらの特徴パラメータと第一コードブック記憶手段に記憶された第一のコードブックとを用いて複数の各話者についての話者性尤度を各話者毎に複数ずつ算出し、各話者毎に算出した複数の話者性尤度のうちの最小値を各話者についての第二の閾値θminとし、各話者毎に算出した複数の話者性尤度のうちの最大値を各話者についての第三の閾値θmaxとすることが望ましい。
【００１６】
このようにして第二の閾値θminおよび第三の閾値θmaxを設定するようにした場合には、第二段階の判定処理に用いる閾値として適切な値を設定することが可能となる。
【００１７】
そして、上記のようにして第二の閾値θminおよび第三の閾値θmaxを設定するようにした場合において、被識別音声が複数の話者の中の誰の音声であるかを判定する処理を行った結果、被識別音声を入力した者が複数の話者の中のいずれかの者であるとして受理された場合には、被識別音声の入力データとサンプルデータとを組み合わせることにより、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての第一のコードブックを作成し直し、この作成し直した第一のコードブックと、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての複数の第二・第三閾値設定用テストデータと、被識別音声の入力データとを用いて、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての第二の閾値θminおよび第三の閾値θmaxの更新設定を行うことが望ましい。
【００１８】
このように本人の音声であるとして受理された被識別音声の入力データを用いて、第二の閾値θminおよび前記第三の閾値θmaxの更新設定を行うようにした場合には、体調や健康等の変化に起因して、登録された複数の各話者の音声の特徴が変化したときには、それに追従させて第二の閾値θminおよび第三の閾値θmaxの設定を徐々に変化させていくことが可能となる。
【００１９】
そして、以上に述べた話者識別方法において、被識別音声の入力データは、第一段階の判定処理用のサンプルデータよりも短時間のデータであることが望ましい。
【００２０】
例えば、被識別音声の入力データを、５〜１０秒間のデータとし、第一段階の判定処理用のサンプルデータを、２０〜３０秒間のデータとすることが望ましい。
【００２１】
このように被識別音声の入力データを短くした場合には、話者識別に要する時間を短縮することが可能となり、実時間応答性の向上が図られる。
【００２２】
また、以上に述べた話者識別方法において、第二のコードブックは、各話者毎に複数ずつ作成しておき、第二のコードブックを用いて複数の各候補者についての絶対値平均ベクトル誤差を算出する際には、各候補者毎に絶対値平均ベクトル誤差を複数ずつ算出し、これらの絶対値平均ベクトル誤差の平均値を算出することが望ましい。
【００２３】
このように各話者毎に第二のコードブックを複数ずつ用意し、各候補者毎に絶対値平均ベクトル誤差の平均値を算出するようにした場合には、話者識別の精度を、より一層向上させることが可能となる。
【００２４】
さらに、以上に述べた話者識別方法において、特徴パラメータとしては、メルケプストラム等を好適に用いることができる。そして、メルケプストラムとする場合には、処理時間短縮（計算量削減）と識別精度の確保との兼ね合い等から、１６次メルケプストラムとすることが好適であるが、これに限定されるものではない。
【００２５】
また、以上に述べた本発明の話者識別方法を実現するシステムとして、次のような本発明の話者識別システムを挙げることができる。
【００２６】
すなわち、本発明は、入力された被識別音声が、予め登録された複数の話者の中の誰の音声であるかを判定する話者識別システムであって、第一段階の判定処理用のサンプルデータとして複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成するサンプルデータ用特徴パラメータ作成手段と、このサンプルデータ用特徴パラメータ作成手段により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第一のコードブックを作成する第一コードブック作成手段と、この第一コードブック作成手段により各話者毎に作成された第一のコードブックを記憶する第一コードブック記憶手段と、複数候補者に絞込後の第二段階の判定処理用の候補データとしてサンプルデータとは異なる環境で複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成する候補データ用特徴パラメータ作成手段と、この候補データ用特徴パラメータ作成手段により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第二のコードブックを作成する第二コードブック作成手段と、この第二コードブック作成手段により各話者毎に作成された第二のコードブックを記憶する第二コードブック記憶手段と、被識別音声が複数の話者の中の誰の音声であるかを判定する際に被識別音声の入力データを用いて音声の特徴パラメータを作成する入力データ用特徴パラメータ作成手段と、この入力データ用特徴パラメータ作成手段により作成された特徴パラメータと第一コードブック記憶手段に記憶された第一のコードブックとを用いて複数の各話者についての話者性尤度を算出する話者性尤度算出手段と、この話者性尤度算出手段により算出された話者性尤度のうち最小の話者性尤度とそれ以外の話者性尤度との各差が、予め設定された第一の閾値ηよりも大きいか否かまたは第一の閾値η以上か否かを判定する第一段階判定手段と、この第一段階判定手段により各差の全てが第一の閾値ηよりも大きいかまたは第一の閾値η以上と判定された場合に、最小の話者性尤度の話者を唯一の候補者とし、最小の話者性尤度が、予め各話者毎に設定された第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには候補者を受理し、入らないと判定したときには棄却する第二段階唯一候補者判定手段と、第一段階判定手段により各差のうちの少なくとも一つの差が第一の閾値η以下または第一の閾値ηよりも小さいと判定された場合に、以下または小さいと判定された差となっている話者性尤度の話者および最小の話者性尤度の話者を複数の候補者とし、入力データ用特徴パラメータ作成手段により作成された特徴パラメータと第二コードブック記憶手段に記憶された第二のコードブックとを用いて複数の各候補者についての絶対値平均ベクトル誤差を算出する絶対値平均ベクトル誤差算出手段と、この絶対値平均ベクトル誤差算出手段により算出された絶対値平均ベクトル誤差のうち最小の絶対値平均ベクトル誤差が、第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには最小の絶対値平均ベクトル誤差の候補者を受理し、入らないと判定したときには棄却する第二段階複数候補者判定手段とを備えたことを特徴とするものである。
【００２７】
このような本発明の話者識別システムにおいては、前述した本発明の話者識別方法で得られる作用・効果をそのまま得ることができ、これにより前記目的が達成される。
【００２８】
また、前述した話者識別システムにおいて、第二の閾値θminおよび第三の閾値θmaxの更新設定を自動的に行う第二・第三閾値自動更新手段を備え、この第二・第三閾値自動更新手段は、被識別音声が複数の話者の中の誰の音声であるかを判定する処理を行った結果、被識別音声を入力した者が複数の話者の中のいずれかの者であるとして受理された場合に、被識別音声の入力データとサンプルデータとを組み合わせることにより、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての第一のコードブックを作成し直し、この作成し直した第一のコードブックと、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての複数の第二・第三閾値設定用テストデータと、被識別音声の入力データとを用いて、複数の話者のうち被識別音声を入力した者と一致すると判定された話者についての第二の閾値θminおよび第三の閾値θmaxの更新設定を行う構成とされていることが望ましい。
【００２９】
このような第二・第三閾値自動更新手段を設けた場合には、体調や健康等の変化に起因して、登録された複数の各話者の音声の特徴が変化したときには、それに追従させて第二の閾値θminおよび第三の閾値θmaxの設定を徐々に自動的に変化させていくことが可能となる。
【００３０】
さらに、本発明は、入力された被識別音声が、予め登録された複数の話者の中の誰の音声であるかを判定する話者識別システムとして、コンピュータを機能させるためのプログラムであって、第一段階の判定処理用のサンプルデータとして複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成するサンプルデータ用特徴パラメータ作成手段と、このサンプルデータ用特徴パラメータ作成手段により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第一のコードブックを作成する第一コードブック作成手段と、この第一コードブック作成手段により各話者毎に作成された第一のコードブックを記憶する第一コードブック記憶手段と、複数候補者に絞込後の第二段階の判定処理用の候補データとしてサンプルデータとは異なる環境で複数の各話者本人から取得した音声データを用いて音声の特徴パラメータを各話者毎に作成する候補データ用特徴パラメータ作成手段と、この候補データ用特徴パラメータ作成手段により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第二のコードブックを作成する第二コードブック作成手段と、この第二コードブック作成手段により各話者毎に作成された第二のコードブックを記憶する第二コードブック記憶手段と、被識別音声が複数の話者の中の誰の音声であるかを判定する際に被識別音声の入力データを用いて音声の特徴パラメータを作成する入力データ用特徴パラメータ作成手段と、この入力データ用特徴パラメータ作成手段により作成された特徴パラメータと第一コードブック記憶手段に記憶された第一のコードブックとを用いて複数の各話者についての話者性尤度を算出する話者性尤度算出手段と、この話者性尤度算出手段により算出された話者性尤度のうち最小の話者性尤度とそれ以外の話者性尤度との各差が、予め設定された第一の閾値ηよりも大きいか否かまたは第一の閾値η以上か否かを判定する第一段階判定手段と、この第一段階判定手段により各差の全てが第一の閾値ηよりも大きいかまたは第一の閾値η以上と判定された場合に、最小の話者性尤度の話者を唯一の候補者とし、最小の話者性尤度が、予め各話者毎に設定された第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには候補者を受理し、入らないと判定したときには棄却する第二段階唯一候補者判定手段と、第一段階判定手段により各差のうちの少なくとも一つの差が第一の閾値η以下または第一の閾値ηよりも小さいと判定された場合に、以下または小さいと判定された差となっている話者性尤度の話者および最小の話者性尤度の話者を複数の候補者とし、入力データ用特徴パラメータ作成手段により作成された特徴パラメータと第二コードブック記憶手段に記憶された第二のコードブックとを用いて複数の各候補者についての絶対値平均ベクトル誤差を算出する絶対値平均ベクトル誤差算出手段と、この絶対値平均ベクトル誤差算出手段により算出された絶対値平均ベクトル誤差のうち最小の絶対値平均ベクトル誤差が、第二の閾値θminと第三の閾値θmaxとの間の範囲に入るか否かを判定し、これらの閾値θmin，θmaxの間の範囲に入ると判定したときには最小の絶対値平均ベクトル誤差の候補者を受理し、入らないと判定したときには棄却する第二段階複数候補者判定手段とを備えたことを特徴とする話者識別システムとして、コンピュータを機能させるためのものである。
【００３１】
なお、以上に述べたプログラムまたはその一部は、例えば、光磁気ディスク（ＭＯ）、コンパクトディスク（ＣＤ）を利用した読出し専用メモリ（ＣＤ−ＲＯＭ）、ＣＤレコーダブル（ＣＤ−Ｒ）、ＣＤリライタブル（ＣＤ−ＲＷ）、デジタル・バーサタイル・ディスク（ＤＶＤ）を利用した読出し専用メモリ（ＤＶＤ−ＲＯＭ）、ＤＶＤを利用したランダム・アクセス・メモリ（ＤＶＤ−ＲＡＭ）、フレキシブルディスク（ＦＤ）、磁気テープ、ハードディスク、読出し専用メモリ（ＲＯＭ）、電気的消去および書換可能な読出し専用メモリ（ＥＥＰＲＯＭ）、フラッシュ・メモリ、ランダム・アクセス・メモリ（ＲＡＭ）等の記録媒体に記録して保存や流通等させることが可能であるとともに、例えば、ローカル・エリア・ネットワーク（ＬＡＮ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ）、ワイド・エリア・ネットワーク（ＷＡＮ）、インターネット、イントラネット、エクストラネット等の有線ネットワーク、あるいは無線通信ネットワーク、さらにはこれらの組合せ等の伝送媒体を用いて伝送することが可能であり、また、搬送波に載せて搬送することも可能である。さらに、以上に述べたプログラムは、他のプログラムの一部分であってもよく、あるいは別個のプログラムと共に記録媒体に記録されていてもよい。
【００３２】
【発明の実施の形態】
以下に本発明の一実施形態を図面に基づいて説明する。図１には、本実施形態の話者識別システム１０の全体構成が示されている。図２には、話者識別システム１０により、被識別音声が複数の話者の中の誰の音声であるかを判定する際の処理の流れの説明図が示されている。また、図３には、第一の閾値ηの初期設定を行う際の処理の流れの説明図が示され、図４には、第二の閾値θminおよび第三の閾値θmaxの初期設定を行う際の処理の流れの説明図が示されている。さらに、図５には、第二の閾値θminおよび第三の閾値θmaxの更新を行う際の処理の流れの説明図が示されている。
【００３３】
図１において、話者識別システム１０は、入力された被識別音声（識別対象となる音声）が、予め登録された複数の話者の中の誰の音声であるかを判定するシステムであり、話者識別に関する各種処理を行う処理手段２０と、話者識別に必要な各種データを記憶する記憶手段４０と、話者の音声を入力する音声入力手段６０と、話者識別の結果を表示する表示手段７０と、話者識別の結果を出力する出力手段８０とを備えて構成されている。登録される話者の総数Ｎは、例えば、Ｎ＝１００人等であり、各話者には、話者番号ｎ（ｎ＝１，２，３，…，Ｎ）を付すものとする。
【００３４】
また、音声入力手段６０と処理手段２０との間、あるいは表示手段７０や出力手段８０と処理手段２０との間には、ネットワークが介在していてもよい。すなわち、音声入力手段６０は、遠隔地にいる話者の音声を入力するものであってもよく、表示手段７０や出力手段８０は、遠隔地にいる閲覧者や利用者に識別結果を提供するものであってもよい。
【００３５】
ここで、ネットワークには、例えば、ＬＡＮ、ＭＡＮ、ＷＡＮ、インターネット、イントラネット、エクストラネット、あるいはこれらの組合せ等、様々な形態のものが含まれ、有線であるか無線であるか、さらには有線および無線の混在型であるかは問わず、要するに、複数地点（距離の長短は問わない。）間で、ある程度の速度をもって情報を伝送することができるものであればよい。
【００３６】
処理手段２０は、サンプルデータ用特徴パラメータ作成手段２１と、候補データ用特徴パラメータ作成手段２２と、入力データ用特徴パラメータ作成手段２３と、第一コードブック作成手段２４と、第二コードブック作成手段２５と、話者性尤度算出手段２６と、絶対値平均ベクトル誤差算出手段２７と、第一段階判定手段２８と、第二段階唯一候補者判定手段２９と、第二段階複数候補者判定手段３０と、第一閾値初期設定手段３１と、第二・第三閾値初期設定手段３２と、第二・第三閾値自動更新手段３３とを含んで構成されている。
【００３７】
記憶手段４０は、第一コードブック記憶手段４１と、第二コードブック記憶手段４２と、第一閾値記憶手段４３と、第二閾値記憶手段４４と、第三閾値記憶手段４５と、サンプルデータ記憶手段４６と、候補データ記憶手段４７と、被識別音声入力データ記憶手段４８と、第一閾値設定用テストデータ記憶手段４９と、第二・第三閾値設定用テストデータ記憶手段５０とを含んで構成されている。
【００３８】
サンプルデータ用特徴パラメータ作成手段２１は、第一段階の判定処理に用いるサンプルデータとして複数（Ｎ人）の各話者本人から取得した音声データを用いて、音声特徴量を抽出し、音声の特徴パラメータを各話者毎に作成する処理を行うものである。特徴パラメータとしては、１６次メルケプストラムが好適であるため、本実施形態では、１６次メルケプストラムとして説明を行うが、これに限定されるものではない。ここで、サンプルデータとして取得する音声データの長さは、１人の話者につき、例えば２０〜３０秒である。従って、１フレームを１０ミリ秒とすれば、例えば２０００〜３０００個の各フレームから１６次メルケプストラムがそれぞれ作成される。
【００３９】
候補データ用特徴パラメータ作成手段２２は、複数候補者に絞込後の第二段階の判定処理に用いる候補データとしてサンプルデータとは異なる環境で複数（Ｎ人）の各話者本人から取得した音声データを用いて、音声特徴量を抽出し、音声の特徴パラメータを各話者毎に作成する処理を行うものである。特徴パラメータは、上記のサンプルデータの場合と同様に、１６次メルケプストラムとする。ここで、候補データとして取得する音声データの長さは、１人の話者につき、例えば５〜１０秒である。従って、１フレームを１０ミリ秒とすれば、例えば５００〜１０００個の各フレームから１６次メルケプストラムがそれぞれ作成される。また、この５〜１０秒の長さの候補データは、１人の話者につき、ｔ個（例えば、ｔ＝３〜５等）用意される。候補データは、サンプルデータよりも短時間とすることが好ましい。
【００４０】
入力データ用特徴パラメータ作成手段２３は、被識別音声が複数の話者の中の誰の音声であるかを判定する際に、被識別音声の入力データを用いて、音声特徴量を抽出し、音声の特徴パラメータを作成する処理を行うものである。特徴パラメータは、上記のサンプルデータや候補データの場合と同様に、１６次メルケプストラムとする。ここで、話者識別を行うために入力される音声データの長さは、１人の話者につき、例えば５〜１０秒である。従って、１フレームを１０ミリ秒とすれば、例えば５００〜１０００個の各フレームから１６次メルケプストラムがそれぞれ作成される。これらの各メルケプストラムは、それぞれ１６個の数値により構成されるベクトルＸ₁，Ｘ₂，…，Ｘ_k，…，Ｘ_Mである。ここで、Ｍはフレーム数であり、Ｘ_k＝（Ｘ_k（１），Ｘ_k（２），…，Ｘ_k（１６））（ｋ＝１，２，…，Ｍ）である。被識別音声の入力データは、サンプルデータよりも短時間とすることが好ましい。
【００４１】
第一コードブック作成手段２４は、サンプルデータ用特徴パラメータ作成手段２１により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第一のコードブックを作成する処理を行うものである。第一のコードブックは、セントロイドベクトルＣ_i ⁿの集合であり、ｎ＝１，２，…，Ｎであり、ｉ＝１，２，…，Ｌである。ここで、Ｎは、登録話者の総数であり、例えばＮ＝１００人等である。また、Ｌは、コードブックサイズ（量子化点数）であり、例えばＬ＝１２８やＬ＝２５６等である。従って、サイズＬの第一のコードブックが、Ｎ人分用意される。そして、各セントロイドベクトルＣ_i ⁿは、それぞれ１６個の数値により構成され、Ｃ_i ⁿ＝（Ｃ_i ⁿ（１），Ｃ_i ⁿ（２），…，Ｃ_i ⁿ（１６））である。
【００４２】
第二コードブック作成手段２５は、候補データ用特徴パラメータ作成手段２２により作成された特徴パラメータを用いてクラスタリングを行うことにより各話者毎に第二のコードブックを作成する処理を行うものである。第二のコードブックは、セントロイドベクトルＣ_i ^n,vの集合であり、ｎ＝１，２，…，Ｎであり、ｉ＝１，２，…，Ｌであり、ｖ＝１，２，…，ｔである。ここで、ｔは、１人の候補者（話者）について用意する第二のコードブックの個数であり、例えばｔ＝３〜５等である。従って、サイズＬの第二のコードブックがＮ人分、すなわちＮ×ｔ個用意される。そして、各セントロイドベクトルＣ_i ^n,vは、それぞれ１６個の数値により構成され、Ｃ_i ^n,v＝（Ｃ_i ^n,v（１），Ｃ_i ^n,v（２），…，Ｃ_i ^n,v（１６））である。
【００４３】
話者性尤度算出手段２６は、入力データ用特徴パラメータ作成手段２３により作成された特徴パラメータＸ_kと、第一コードブック記憶手段４１に記憶された第一のコードブックＣ_i ⁿとを用いて、ベクトル量子化（ＶＱ：Vector Qantization）を行い、複数の各話者についての話者性尤度を算出する処理を行うものである。具体的には、次式（１）に基づく処理を行う。
【００４４】
【数１】

【００４５】
上式（１）において、Ｌⁿは、話者性尤度であり、上付の添字ｎは、話者番号（ｎ＝１，２，…，Ｎ）を示し、Ｎは、登録話者の総数で、例えばＮ＝１００人等である。Ｍは、フレーム数である。ｄ（Ｘ_k，Ｃ_i ⁿ）は、ベクトルＸ_kとベクトルＣ_i ⁿとの各構成要素同士の差の２乗の和をとるという意味であり、Ｘ_k＝（Ｘ_k（１），Ｘ_k（２），…，Ｘ_k（１６））、Ｃ_i ⁿ＝（Ｃ_i ⁿ（１），Ｃ_i ⁿ（２），…，Ｃ_i ⁿ（１６））であるから、次のようになる。
【００４６】
ｄ（Ｘ_k，Ｃ_i ⁿ）＝（Ｘ_k（１）−Ｃ_i ⁿ（１））²＋（Ｘ_k（２）−Ｃ_i ⁿ（２））²＋…＋（Ｘ_k（１６）−Ｃ_i ⁿ（１６））²
【００４７】
また、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ（Ｘ_k，Ｃ_i ⁿ）］は、コードブックサイズがＬ（例えばＬ＝２５６等）であるから、ｎ番目の話者についてＬ個の第一のコードブックが作成されているので、ｄ（Ｘ_k，Ｃ_i ⁿ）をｉ＝１，２，…，Ｌについてそれぞれ求め、その中の最小値をとるという意味である。
【００４８】
そして、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ（Ｘ_k，Ｃ_i ⁿ）］をＭ個のフレームの全てについて求めると、Ｍ個のｍｉｎ（１≦ｉ≦Ｌ）［ｄ（Ｘ_k，Ｃ_i ⁿ）］（ｋ＝１，２，…，Ｍ）が得られるので、これらのＭ個のｍｉｎ（１≦ｉ≦Ｌ）［ｄ（Ｘ_k，Ｃ_i ⁿ）］を合計してＭで割ることにより、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ（Ｘ_k，Ｃ_i ⁿ）］の平均値を求める。この平均値が、ｎ番目の話者についての話者性尤度Ｌⁿである。以上のような計算をＮ人の話者の全てについて行い、Ｌ¹，Ｌ²，…，Ｌ^Nを求める。
【００４９】
絶対値平均ベクトル誤差算出手段２７は、入力データ用特徴パラメータ作成手段２３により作成された特徴パラメータＸ_kと、第二コードブック記憶手段４２に記憶された第二のコードブックＣ_i ^n,vとを用いて、第一段階の判定処理で選出された複数の各候補者（話者）についての絶対値平均ベクトル誤差ε_q,vあるいはε_q，ε_qminを算出する処理を行うものである。具体的には、次式（２）に基づく処理を行う。
【００５０】
【数２】

【００５１】
上式（２）において、ε_q,vは、第一段階の判定処理で選出された複数の各候補者（話者）についての絶対値平均ベクトル誤差であり、下付の添字ｑは、候補者番号であり、下付の添字ｖは、各候補者（話者）についてそれぞれ複数個（ｔ個、例えばｔ＝３等）の候補データが用意され、それぞれ複数個（ｔ個、例えばｔ＝３等）の第二のコードブックが作成されているので、それらの第二のコードブックに付された番号（ｖ＝１，２，…，ｔ）である。Ｍは、フレーム数である。ｄ’（Ｘ_k，Ｃ_i ^n,v）は、話者性尤度Ｌⁿを求める前記式（１）におけるｄ（Ｘ_k，Ｃ_i ⁿ）とは異なるものであり、ベクトルＸ_kとベクトルＣ_i ^n,vとの各構成要素同士の差の絶対値の和をとるという意味であり、Ｘ_k＝（Ｘ_k（１），Ｘ_k（２），…，Ｘ_k（１６））、Ｃ_i ^n,v＝（Ｃ_i ^n,v（１），Ｃ_i ^n,v（２），…，Ｃ_i ^n,v（１６））であるから、次のようになる。なお、ｄ’（Ｘ_k，Ｃ_i ^n,v）中のｎは、候補者番号ｑに対応する話者番号（第一段階の判定処理で候補者番号ｑを付された話者についての原始的話者番号）である。
【００５２】
ｄ’（Ｘ_k，Ｃ_i ^n,v）＝｜Ｘ_k（１）−Ｃ_i ^n,v（１）｜＋｜Ｘ_k（２）−Ｃ_i ^n,v（２）｜＋…＋｜Ｘ_k（１６）−Ｃ_i ^n,v（１６）｜
【００５３】
また、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ’（Ｘ_k，Ｃ_i ^n,v）］は、コードブックサイズがＬ（例えばＬ＝２５６等）であるから、ｎ番目の話者のｖ番目の候補データについてＬ個の第二のコードブックが作成されているので、ｄ’（Ｘ_k，Ｃ_i ^n,v）をｉ＝１，２，…，Ｌについてそれぞれ求め、その中の最小値をとるという意味である。
【００５４】
そして、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ’（Ｘ_k，Ｃ_i ^n,v）］をＭ個のフレームの全てについて求めると、Ｍ個のｍｉｎ（１≦ｉ≦Ｌ）［ｄ’（Ｘ_k，Ｃ_i ^n,v）］（ｋ＝１，２，…，Ｍ）が得られるので、これらのＭ個のｍｉｎ（１≦ｉ≦Ｌ）［ｄ’（Ｘ_k，Ｃ_i ^n,v）］を合計してＭで割ることにより、ｍｉｎ（１≦ｉ≦Ｌ）［ｄ’（Ｘ_k，Ｃ_i ^n,v）］の平均値を求める。この平均値が、ｎ番目の話者（候補者番号ｑに対応する話者）のｖ番目の候補データに基づく絶対値平均ベクトル誤差ε_q,vである。
【００５５】
さらに、ｎ番目の話者（候補者番号ｑに対応する話者）について、ｖ＝１番目からｖ＝ｔ番目までの各候補データ（各第二のコードブック）に基づく各絶対値平均ベクトル誤差ε_q,vの平均値を求めてε_qとする。すなわち、ｎ番目の話者（候補者番号ｑに対応する話者）についての絶対値平均ベクトル誤差ε_qは、次式で求められる。
【００５６】
ε_q＝（ε_q,1＋ε_q,2＋…＋ε_q,t）／ｔ
【００５７】
以上のような計算を、第一段階の判定処理で選出された複数の各候補者（話者）の全てについて行い、全ての候補者の絶対値平均ベクトル誤差ε_qを求める。その後、これらのε_qの中の最小値を求め、ε_qminとする。
【００５８】
第一段階判定手段２８は、話者性尤度算出手段２６により算出されたＮ個の話者性尤度Ｌⁿ（ｎ＝１，２，…，Ｎ）のうち最小の話者性尤度Ｌ^xと、それ以外の話者性尤度との各差ｍｉｎ_qが、予め設定された第一の閾値ηよりも大きいか否かを判定する処理を行うものである。
【００５９】
ここで、ｍｉｎ_q＝Ｌ_q−Ｌ₁（ｑ＝２，３，４，…）である。Ｌ₁は、算出されたＮ個の話者性尤度Ｌⁿ（ｎ＝１，２，…，Ｎ）のうち最小となるもの（Ｌ^x）を、下付の添字１を使って置き換えたものであり、Ｌ₂，Ｌ₃，Ｌ₄，…は、二番目、三番目、四番目、…に小さいものを、下付の添字２，３，４，…を使って置き換えたものである。なお、このような置き換えは、説明の便宜上行うものであるため、計算処理上は、必ずしも上記と同様な方法での置き換えを行う必要はなく、要するに、結果的に上記のような判定処理が行われるようになっていればよい。
【００６０】
第二段階唯一候補者判定手段２９は、第一段階判定手段２８により各差ｍｉｎ_qの全てが、第一の閾値ηよりも大きいと判定された場合に、最小の話者性尤度Ｌ^xの話者（話者番号ｘの話者）を唯一の候補者とし、最小の話者性尤度Ｌ^xが、この話者番号ｘの話者のために設定された第二の閾値θmin^xと第三の閾値θmax^xとの間の範囲に入るか否かを判定する。そして、これらの閾値θmin^x，θmax^xの間の範囲に入ると判定したときには、その唯一の候補者である話者番号ｘの話者が、識別音声を入力した者と一致するとして受理し、閾値θmin^x，θmax^xの間の範囲に入らないと判定したときには、棄却する処理を行うものである。
【００６１】
第二段階複数候補者判定手段３０は、絶対値平均ベクトル誤差算出手段２７により算出された絶対値平均ベクトル誤差ε_qのうち最小の絶対値平均ベクトル誤差ε_qminが、このε_qminの候補者（この候補者の話者番号をｙとする。）のために設定された第二の閾値θmin^yと第三の閾値θmax^yとの間の範囲に入るか否かを判定する。そして、これらの閾値θmin^y，θmax^yの間の範囲に入ると判定したときには、ε_qminの候補者（話者番号ｙの話者）が、識別音声を入力した者と一致するとして受理し、閾値θmin^y，θmax^yの間の範囲に入らないと判定したときには、棄却する処理を行うものである。
【００６２】
第一閾値初期設定手段３１は、第一の閾値ηの初期設定の処理を行うものである（図３参照）。
【００６３】
第二・第三閾値初期設定手段３２は、第二の閾値θminⁿおよび第三の閾値θmaxⁿ（ｎ＝１，２，…，Ｎ）の初期設定の処理を行うものである（図４参照）。
【００６４】
第二・第三閾値自動更新手段３３は、第二の閾値θminⁿおよび第三の閾値θmaxⁿ（ｎ＝１，２，…，Ｎ）の自動更新の処理を行うものである（図５参照）。
【００６５】
第一コードブック記憶手段４１は、第一のコードブックＣ_i ⁿを記憶するものである。サイズＬ（例えばＬ＝２５６等）の第一のコードブックが、Ｎ人の登録話者全員について一人一つずつ用意される。
【００６６】
第二コードブック記憶手段４２は、第二のコードブックＣ_i ^n,vを記憶するものである。サイズＬ（例えばＬ＝２５６等）の第二のコードブックが、Ｎ人の登録話者全員について一人複数個（ｔ個、例えばｔ＝３個等）ずつ用意される。
【００６７】
第一閾値記憶手段４３は、第一の閾値ηを記憶するものである。第一の閾値ηは、システム１０に一つだけ用意されるものであり、各登録話者毎に個別に用意されるものではない。
【００６８】
第二閾値記憶手段４４は、第二の閾値θminⁿを記憶するものである。第二の閾値θminⁿは、各登録話者（話者番号ｎ＝１，２，…，Ｎ）毎に個別に用意されるものである。
【００６９】
第三閾値記憶手段４５は、第三の閾値θmaxⁿを記憶するものである。第三の閾値θmaxⁿは、各登録話者（話者番号ｎ＝１，２，…，Ｎ）毎に個別に用意されるものである。
【００７０】
サンプルデータ記憶手段４６は、サンプルデータ（例えば２０〜３０秒／話者）としての音声データを記憶するものであり、例えば、ＷＡＶフォーマット等のファイル形式での保存を行う。サンプルデータは、Ｎ人の登録話者全員について一人一つずつ用意される。
【００７１】
候補データ記憶手段４７は、候補データ（例えば５〜１０秒／話者）としての音声データを記憶するものであり、例えば、ＷＡＶフォーマット等のファイル形式での保存を行う。候補データは、Ｎ人の登録話者全員について一人複数個（ｔ個、例えばｔ＝３個等）ずつ用意される。
【００７２】
被識別音声入力データ記憶手段４８は、被識別音声の入力データ（例えば５〜１０秒）を記憶するものであり、例えば、ＷＡＶフォーマット等のファイル形式での保存を行う。記憶された入力データは、システム１０による学習機能に用いられ、次回以降の話者識別処理に活かされる（図５参照）。
【００７３】
第一閾値設定用テストデータ記憶手段４９は、第一の閾値ηの初期設定に用いられる第一閾値設定用テストデータ（例えば５〜１０秒）としての音声データを記憶するものであり、例えば、ＷＡＶフォーマット等のファイル形式での保存を行う。第一閾値設定用テストデータは、Ｎ人の登録話者の中の任意の一人による音声データである。
【００７４】
第二・第三閾値設定用テストデータ記憶手段５０は、第二の閾値θminⁿおよび第三の閾値θmaxⁿの初期設定に用いられる第二・第三閾値設定用テストデータ（例えば５〜１０秒）としての音声データを記憶するものであり、例えば、ＷＡＶフォーマット等のファイル形式での保存を行う。第二・第三閾値設定用テストデータは、Ｎ人の登録話者全員について一人複数回（Ｈ回、例えばＨ＝１０回等）ずつ入力して取得した音声データである。
【００７５】
そして、処理手段２０を構成する各手段２１〜３３は、コンピュータ本体（パーソナル・コンピュータのみならず、その上位機種のものも含む。）の内部に設けられた中央演算処理装置（ＣＰＵ）、およびこのＣＰＵの動作手順を規定する一つまたは複数のプログラムにより実現される。
【００７６】
また、処理手段２０は、一台のコンピュータあるいは一つのＣＰＵにより実現されるものに限定されず、複数のコンピュータ等で分散処理を行うことにより実現されるものであってもよい。
【００７７】
さらに、記憶手段４０を構成する各手段４１〜５０は、例えばハードディスク等により好適に実現されるが、記憶容量やアクセス速度等に問題が生じない範囲であれば、ＲＯＭ、ＥＥＰＲＯＭ、フラッシュ・メモリ、ＲＡＭ、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＦＤ、磁気テープ、あるいはこれらの組合せ等を採用してもよい。
【００７８】
音声入力手段６０としては、各種の音声収録マイク等を採用することができる。
【００７９】
表示手段７０としては、例えば、液晶ディスプレイ、ＣＲＴディスプレイ、有機ＥＬ（エレクトロルミネッセンス）ディスプレイ、ＥＣＬ（エレクトロケミルミネッセンス）ディスプレイ、プロジェクタおよびスクリーン、あるいはこれらの組合せ等を採用することができる。
【００８０】
出力手段８０としては、プリンタ、プロッタ、あるいはこれらの組合せ等を採用することができる。
【００８１】
このような本実施形態においては、以下のようにして話者識別システム１０により被識別音声についての話者識別処理を行う。
【００８２】
図２において、先ず、音声入力手段６０を用いて、Ｎ人（例えばＮ＝１００人等）の登録話者全員から、サンプルデータとして、一人の話者につき例えば２０〜３０秒の音声データを取得し、これらの取得したサンプルデータを、サンプルデータ記憶手段４６に記憶保存する（ステップＳ１）。
【００８３】
続いて、各話者のサンプルデータを用いて、サンプルデータ用特徴パラメータ作成手段２１により音響特徴量を抽出し、各話者の音声についての特徴パラメータを作成する（ステップＳ２）。特徴パラメータは、１６次メルケプストラムである。
【００８４】
さらに、得られた各話者の特徴パラメータを用いて、第一コードブック作成手段２４によりクラスタリング処理を行い、Ｎ人の各話者（ｎ＝１，２，…，Ｎ）についての第一のコードブックＣ_i ⁿを作成し（ステップＳ３）、これらの作成した第一のコードブックＣ_i ⁿを第一コードブック記憶手段４１に記憶保存する（ステップＳ４）。話者一人（ｎ番目の話者）についての第一のコードブックＣ_i ⁿは、コードブックサイズがＬであるから、Ｌ個（例えば、Ｌ＝２５６個等）のコードブックベクトルにより構成されている。
【００８５】
次に、音声入力手段６０を用いて、Ｎ人（例えばＮ＝１００人等）の登録話者全員から、候補データとして、一人の話者につき例えば５〜１０秒の音声データを複数個（ｔ個、例えばｔ＝３個等）ずつ取得し、これらの取得した候補データを、候補データ記憶手段４７に記憶保存する（ステップＳ１１）。
【００８６】
続いて、各話者の候補データを用いて、候補データ用特徴パラメータ作成手段２２により音響特徴量を抽出し、各話者の音声についての特徴パラメータを作成する（ステップＳ１２）。特徴パラメータは、１６次メルケプストラムである。
【００８７】
さらに、得られた各話者の特徴パラメータを用いて、第二コードブック作成手段２５によりクラスタリング処理を行い、Ｎ人の各話者（ｎ＝１，２，…，Ｎ）について複数個（ｔ個）ずつの第二のコードブックＣ_i ^n,vを作成し（ステップＳ１３）、これらの作成した第二のコードブックＣ_i ^n,vを第二コードブック記憶手段４２に記憶保存する（ステップＳ１４）。話者一人（ｎ番目の話者）についての第二のコードブックＣ_i ^n,vは、複数個（ｔ個）用意され、これらの複数個の第二のコードブックＣ_i ^n,v（ｖ＝１，２，…，ｔ）の各々は、コードブックサイズがＬであるから、Ｌ個（例えば、Ｌ＝２５６個等）のコードブックベクトルにより構成されている。
【００８８】
以上の前処理を行った後に、第一の閾値η、第二の閾値θminⁿ、および第三の閾値θmaxⁿの初期設定を行うが、詳細は後述する（図３のステップＳ４１〜Ｓ４４、図４のステップＳ５１〜Ｓ５４）。
【００８９】
そして、前処理および各閾値の初期設定を行った後に、話者識別システム１０を稼働させ、実際に話者識別を行う際には、以下のような処理を行う。
【００９０】
図２において、先ず、音声入力手段６０を用いて、識別の対象となる被識別音声を入力して話者識別システム１０に取り込む（ステップＳ２１）。この際、取り込む入力データは、例えば５〜１０秒の音声データである。
【００９１】
続いて、被識別音声の入力データを用いて、入力データ用特徴パラメータ作成手段２３により音響特徴量を抽出し、被識別音声についての特徴パラメータを作成する（ステップＳ２２）。特徴パラメータは、１６次メルケプストラムであり、フレーム数がＭであるから、Ｍ個のベクトルＸ₁，Ｘ₂，…，Ｘ_Mが作成される。
【００９２】
さらに、得られた特徴パラメータであるＭ個のベクトルＸ₁，Ｘ₂，…，Ｘ_Mと、第一コードブック記憶手段４１に記憶保存されている第一のコードブックＣ_i ⁿとを用いて、話者性尤度算出手段２６により、ベクトル量子化（ＶＱ）を行い、各話者（ｎ＝１，２，…，Ｎ）についての話者性尤度Ｌⁿを算出する（ステップＳ２３）。
【００９３】
次に、算出した話者性尤度Ｌⁿ（上付の添字ｎは、話者番号であり、ｎ＝１，２，…，Ｎ）のうちの最小値Ｌ^x（話者番号ｘの話者についての話者性尤度）を下付の添字１を使ってＬ₁と置き換えるものとする。また、算出した話者性尤度Ｌⁿのうち二番目に小さな値を下付の添字２を使ってＬ₂と置き換えるものとする。同様にして、算出した話者性尤度Ｌⁿのうち三番目、四番目、…に小さな値をＬ₃，Ｌ₄，…と置き換えるものとする。
【００９４】
そして、第一段階判定手段２８により、最小の話者性尤度Ｌ^x（＝Ｌ₁）と、その他の話者性尤度Ｌⁿ（＝Ｌ₂，Ｌ₃，Ｌ₄，…）との差ｍｉｎ_qが、予め設定されている第一の閾値ηよりも大きいか否かを判定する（ステップＳ２４）。すなわち、ｍｉｎ_q＝Ｌ_q−Ｌ₁（ｑ＝２，３，４，…）であり、このｍｉｎ_qについて次式に基づく判定を行う。
【００９５】
ｍｉｎ_q＞η
【００９６】
ここで、全てのｍｉｎ_qが第一の閾値ηよりも大きいと判断された場合には、Ｌ₂−Ｌ₁、Ｌ₃−Ｌ₁、Ｌ₄−Ｌ₁、…の全てが第一の閾値ηよりも大きいということであるから、Ｌ₁の値に対してＬ₂，Ｌ₃，Ｌ₄，…の各値が離れているということである。従って、最小の話者性尤度Ｌ^x（＝Ｌ₁）が、その他の話者性尤度Ｌⁿ（＝Ｌ₂，Ｌ₃，Ｌ₄，…）と比べ、突出して小さな値であることを意味するので、この場合には、話者性尤度Ｌ^x（＝Ｌ₁）の話者（話者番号ｘの話者）を唯一の候補者として次の第二段階の判定処理に進む。
【００９７】
続いて、話者性尤度Ｌ^xの話者を唯一の候補者とした後に、第二段階唯一候補者判定手段２９により、この唯一の候補者が、識別音声を入力した者と一致するか否かを判定する（ステップＳ２５）。この判定の際には、話者性尤度Ｌ^xが、話者番号ｘの話者について予め設定された第二の閾値θmin^xと第三の閾値θmax^xとの間の範囲に入るか否かを判断する。
【００９８】
そして、話者性尤度Ｌ^xが、第二の閾値θmin^xと第三の閾値θmax^xとの間の範囲に入ると判断された場合には、話者性尤度Ｌ^xの話者が、識別音声を入力した者と一致するとして受理し（ステップＳ２６）、話者性尤度Ｌ^xが、第二の閾値θmin^xと第三の閾値θmax^xとの間の範囲に入らないと判断された場合には、話者性尤度Ｌ^xの話者は、識別音声を入力した者と一致しないとして棄却する（ステップＳ２７）。
【００９９】
一方、ステップＳ２４で、各差ｍｉｎ_qのうちの少なくとも一つが、第一の閾値η以下であると判断された場合には、Ｌ₂−Ｌ₁、Ｌ₃−Ｌ₁、Ｌ₄−Ｌ₁、…の中に第一の閾値η以下の差が存在するということであるから、Ｌ₁に近い値となる話者性尤度が、Ｌ₂，Ｌ₃，Ｌ₄，…の中に存在することを意味する。この場合には、最小の話者性尤度Ｌ^x（＝Ｌ₁）に近いと判断された話者性尤度の話者（すなわち、最小の話者性尤度Ｌ^xとの差が第一の閾値η以下の話者性尤度の話者）および最小の話者性尤度Ｌ^xの話者を、複数の候補者として次の第二段階の判定処理に進む。例えば、Ｌ₂−Ｌ₁、Ｌ₃−Ｌ₁が第一の閾値η以下であると判断され、Ｌ₄−Ｌ₁、Ｌ₅−Ｌ₁、Ｌ₆−Ｌ₁、…が第一の閾値ηよりも大きいと判断された場合には、Ｌ₂およびＬ₃がＬ₁に近い値なので、Ｌ₁を含めて話者性尤度Ｌ₁，Ｌ₂，Ｌ₃の３人の話者が、複数の候補者となる。従って、複数の候補者が何人になるかは、ｍｉｎ_q＞ηの判定結果により変化し、Ｎ人の登録話者全員になることもあり得る。なお、以下においては、ステップＳ２４の判定処理で、ｍｉｎ_q≦ηと判断された話者性尤度Ｌ_q（ｑ＝２，３，…）を、Ｌ₁を含めて候補者番号ｑを示す下付の添字を付してＬ_q（ｑ＝１，２，…）と表現する。上記の例では、Ｌ₁，Ｌ₂，Ｌ₃がＬ_qとなる。
【０１００】
続いて、Ｌ₁に近い値の話者性尤度の話者を、話者性尤度Ｌ₁の話者を含めて複数の候補者とした後に、これらの複数の候補者についての絶対値平均ベクトル誤差を、絶対値平均ベクトル誤差算出手段２７により算出する（ステップＳ２８）。
【０１０１】
絶対値平均ベクトル誤差算出手段２７による処理は、先ず、入力データ用特徴パラメータ作成手段２３により作成された特徴パラメータＸ_kと、第二コードブック記憶手段４２に記憶された第二のコードブックＣ_i ^n,vとを用いて、ステップＳ２４の第一段階の判定処理で選出された複数の各候補者（話者）についての絶対値平均ベクトル誤差ε_q,vを算出する。この際、絶対値平均ベクトル誤差は、一人の候補者につき、複数個（ｔ個、例えばｔ＝３個等）算出する。従って、候補者番号ｑの候補者については、ε_q,1，ε_q,2，…，ε_q,tが算出されるので、これらのｔ個の絶対値平均ベクトル誤差の平均値を求めてε_qとする。さらに、複数の候補者全員について、ε_qを求め、これらのε_qの中の最小値を求めてε_qminとする。
【０１０２】
その後、第二段階複数候補者判定手段３０により、最小の絶対値平均ベクトル誤差ε_qminが、ε_qminの候補者（この候補者の話者番号をｙとする。）について予め設定された第二の閾値θmin^yと第三の閾値θmax^yとの間の範囲に入るか否かを判断する（ステップＳ２９）。
【０１０３】
そして、最小の絶対値平均ベクトル誤差ε_qminが、第二の閾値θmin^yと第三の閾値θmax^yとの間の範囲に入ると判断された場合には、ε_qminの候補者である話者性尤度Ｌ^y（＝Ｌ_q）の話者が、識別音声を入力した者と一致するとして受理し（ステップＳ３０）、最小の絶対値平均ベクトル誤差ε_qminが、第二の閾値θmin^yと第三の閾値θmax^yとの間の範囲に入らないと判断された場合には、ε_qminの候補者である話者性尤度Ｌ^y（＝Ｌ_q）の話者は、識別音声を入力した者と一致しないとして棄却する（ステップＳ３１）。
【０１０４】
なお、以上の処理による話者識別の結果は、適宜、表示手段７０により画面表示したり、出力手段８０により印刷してもよく、あるいは各種システムの個人認証処理に利用してもよい。
【０１０５】
次に、第一閾値初期設定手段３１により、第一の閾値ηの初期設定を行う際の処理を詳述する。
【０１０６】
図３において、音声入力手段６０を用いて、第一閾値設定用テストデータとして、Ｎ人の登録話者本人のうちの任意の一人（話者番号Ｐの話者とする。）の音声データを取り込むとともに、取り込んだ音声データを、第一閾値設定用テストデータ記憶手段４９に記憶保存する（ステップＳ４１）。この際、取り込む音声データは、例えば５〜１０秒のデータである。
【０１０７】
続いて、第一閾値設定用テストデータを用いて、入力データ用特徴パラメータ作成手段２３により音響特徴量を抽出し、音声についての特徴パラメータＸ_k（ｋ＝１，２，…，Ｍ）を作成する（ステップＳ４２）。特徴パラメータＸ_kは、１６次メルケプストラムである。
【０１０８】
さらに、得られた特徴パラメータＸ_kと、第一コードブック記憶手段４１に記憶された第一のコードブックＣ_i ⁿとを用いて、話者性尤度算出手段２６によりベクトル量子化（ＶＱ）を行い、Ｎ人の各登録話者についての話者性尤度Ｌⁿ（ｎ＝１，２，…，Ｎ）を算出する（ステップＳ４３）。そして、前記任意の一人（話者番号Ｐの話者）についての話者性尤度Ｌ^Pと、Ｎ人の各話者についての話者性尤度Ｌⁿのうち話者性尤度Ｌ^Pに最も近い値の話者性尤度との差を、第一の閾値ηとして決定し、第一閾値記憶手段４３に記憶保存して設定する（ステップＳ４４）。
【０１０９】
次に、第二・第三閾値初期設定手段３２により、第二の閾値θminⁿおよび第三の閾値θmaxⁿの初期設定を行う際の処理を詳述する。
【０１１０】
図４において、音声入力手段６０を用いて、第二・第三閾値設定用テストデータとして、Ｎ人の登録話者本人から、それぞれ複数個（Ｈ個、例えばＨ＝１０個等）ずつの音声データを取り込むとともに、取り込んだ音声データを、第二・第三閾値設定用テストデータ記憶手段５０に記憶保存する（ステップＳ５１）。この際、取り込む音声データは、一人の話者につき例えば５〜１０秒のデータである。
【０１１１】
続いて、第二・第三閾値設定用テストデータを用いて、入力データ用特徴パラメータ作成手段２３により音響特徴量を抽出し、音声についての特徴パラメータＸ_k（ｋ＝１，２，…，Ｍ）をＮ人の各話者毎に複数個（Ｈ個）ずつ作成する（ステップＳ５２）。特徴パラメータＸ_kは、１６次メルケプストラムである。
【０１１２】
さらに、得られた特徴パラメータＸ_kと、第一コードブック記憶手段４１に記憶された第一のコードブックＣ_i ⁿとを用いて、話者性尤度算出手段２６によりベクトル量子化（ＶＱ）を行い、Ｎ人の各登録話者（ｎ＝１，２，…，Ｎ）について複数個（Ｈ個）ずつの話者性尤度Ｌ^n,h（ｈ＝１，２，…，Ｈ）を算出する（ステップＳ５３）。
【０１１３】
それから、登録番号ｎの話者について、Ｈ個の話者性尤度Ｌ^n,h（ｈ＝１，２，…，Ｈ）のうちの最小値を、その話者についての第二の閾値θminⁿとし、最大値を、その話者についての第三の閾値θmaxⁿとし、このようなθminⁿおよびθmaxⁿの決定をＮ人の登録話者全員について行う（ステップＳ５４）。そして、決定したθminⁿおよびθmaxⁿを、第二閾値記憶手段４４および第三閾値記憶手段４５にそれぞれ記憶保存して設定する。
【０１１４】
次に、第二・第三閾値自動更新手段３３により、第二の閾値θminⁿおよび第三の閾値θmaxⁿの自動更新を行う際の処理を詳述する。
【０１１５】
図５において、前述した図２のステップＳ２６でＬ^xの話者（話者番号ｘの話者）が受理されるか、または図２のステップＳ３０でＬ^yの話者（話者番号ｙの話者）が受理された場合には、それらの受理された話者についてのサンプルデータ（例えば２０〜３０秒）を、サンプルデータ記憶手段４６から読み込むとともに（ステップＳ６１）、その識別処理時に入力された被識別音声の入力データ（例えば５〜１０秒）を、被識別音声入力データ記憶手段４８から読み込む（ステップＳ６２）。
【０１１６】
続いて、読み込んだサンプルデータと、被識別音声の入力データとを組み合わせて用いることにより、サンプルデータ用特徴パラメータ作成手段２１により音響特徴量を抽出し、音声についての特徴パラメータを作成する（ステップＳ６３）。特徴パラメータは、１６次メルケプストラムである。
【０１１７】
さらに、得られた特徴パラメータを用いて、第一コードブック作成手段２４によりクラスタリング処理を行い、受理された話者（話者番号ｘまたはｙの話者）についての第一のコードブックＣ_i ⁿ（ｎ＝ｘまたはｙ）を作成し直し（ステップＳ６４）、この作成し直した第一のコードブックＣ_i ⁿを、第一コードブック記憶手段４１に記憶保存してデータを更新する（ステップＳ６５）。
【０１１８】
その後、受理された話者について複数個（Ｈ個、例えばＨ＝１０個等）用意された第二・第三閾値設定用テストデータ（例えば５〜１０秒）を、第二・第三閾値設定用テストデータ記憶手段５０から読み込む（ステップＳ６６）。
【０１１９】
続いて、読み込んだ第二・第三閾値設定用テストデータと、被識別音声の入力データとを用いて、入力データ用特徴パラメータ作成手段２３により音響特徴量を抽出し、音声についての特徴パラメータＸ_k（ｋ＝１，２，…，Ｍ）を、受理された話者について（Ｈ＋１）個作成する（ステップＳ６７）。特徴パラメータＸ_kは、１６次メルケプストラムである。例えば、受理された話者について、第二・第三閾値設定用テストデータが当初Ｈ＝１０個用意されている場合には、新たに取得された被識別音声の入力データを学習データとし、（Ｈ＋１）＝１１個作成する。
【０１２０】
さらに、得られた特徴パラメータＸ_kと、第一コードブック記憶手段４１に記憶された更新後の第一のコードブックＣ_i ⁿ（ｎ＝ｘまたはｙ）とを用いて、話者性尤度算出手段２６によりベクトル量子化（ＶＱ）を行い、受理された話者（話者番号ｘまたはｙの話者）について（Ｈ＋１）個の話者性尤度Ｌ^n,h（ｈ＝１，２，…，Ｈ，Ｈ＋１）を算出する（ステップＳ６８）。
【０１２１】
そして、受理された話者（話者番号ｎ＝ｘまたはｙの話者）について、（Ｈ＋１）個の話者性尤度Ｌ^n,h（ｈ＝１，２，…，Ｈ，Ｈ＋１）のうちの最小値を、その話者についての更新後の第二の閾値θminⁿとし、最大値を、その話者についての更新後の第三の閾値θmaxⁿとし、これらの更新後のθminⁿおよびθmaxⁿを第二閾値記憶手段４４および第三閾値記憶手段４５にそれぞれ記憶保存して更新設定を行う（ステップＳ６９）。
【０１２２】
このような本実施形態によれば、次のような効果がある。すなわち、話者識別を行うにあたって、判定を二つの段階で行うようにし、第一段階判定手段２８による判定（図２のステップＳ２４）の結果に応じて、第二段階の判定が簡易な処理になる場合（ステップＳ２５）と、それよりも複雑な処理になる場合（ステップＳ２８，Ｓ２９）とに分かれるようにしたので、全てについて一律な内容の判定処理を行う場合に比べ、話者識別に要する処理時間を短縮することができ、実時間応答性の向上を図ることができる。
【０１２３】
また、第一段階判定手段２８による判定処理において、最小の話者性尤度Ｌ^xに近い値の話者性尤度が少なくとも一つ存在すると判定された場合には、最小の話者性尤度Ｌ^xに近い値の話者性尤度の話者が、被識別音声を入力した者と一致する可能性もあるため、その場合には、絶対値平均ベクトル誤差ε_q,vあるいはε_q，ε_qminを算出する処理（図２のステップＳ２８）を行ってから第二段階の判定処理（ステップＳ２９）を行うので、上述した如く実時間応答性の向上を図りつつ、同時に識別精度の向上を図ることができる。
【０１２４】
さらに、第一閾値初期設定手段３１が設けられ、図３で述べたように、この第一閾値初期設定手段３１により、第一閾値設定用テストデータの提供者自身の話者性尤度Ｌ^Pと、話者性尤度Ｌ^Pに最も近い値の話者性尤度との差が、第一の閾値ηとして設定されるので、第一段階の判定処理に用いる閾値として適切な値を設定することができる。
【０１２５】
そして、第二・第三閾値初期設定手段３２が設けられ、図４で述べたように、この第二・第三閾値初期設定手段３２により、第二・第三閾値設定用テストデータとしてＮ人の各話者本人からそれぞれ複数個（Ｈ個）ずつ取得した音声データを用いて第二の閾値θminおよび第三の閾値θmaxが設定されるので、第二段階の判定処理に用いる閾値として適切な値を設定することができる。
【０１２６】
また、第二・第三閾値自動更新手段３３が設けられているので、システム１０に学習機能を付加することができ、体調や健康等の変化に起因して、登録されたＮ人の各話者の音声の特徴が変化したときには、それに追従させて第二の閾値θminおよび第三の閾値θmaxの設定を徐々に変化させていくことができる。
【０１２７】
さらに、被識別音声の入力データ（例えば５〜１０秒）は、第一段階の判定処理用のサンプルデータ（例えば２０〜３０秒）よりも短時間のデータとされているため、話者識別に要する時間を短縮することができ、実時間応答性の向上を図ることができる。
【０１２８】
また、絶対値平均ベクトル誤差算出手段２７は、Ｎ人の各話者毎に複数（ｔ個）ずつ作成された第二のコードブックＣ_i ^n,vを用いて、各候補者毎に複数（ｔ個）ずつの絶対値平均ベクトル誤差ε_q,vを算出し、これらの絶対値平均ベクトル誤差ε_q,vの平均値ε_qを算出する処理を行うので、話者識別の精度を、より一層向上させることができる。
【０１２９】
そして、特徴パラメータとして、１６次メルケプストラムを用いているので、、処理時間短縮（計算量削減）と識別精度の確保との兼ね合い等を考慮したシステム１０を実現することができる。
【０１３０】
また、音声データからスペクトル（本実施形態では、一例として１６次メルケプストラム）をとる処理を行うので、発話内容に影響されることなく話者識別を行うことができる。
【０１３１】
さらに、第一の閾値η、第二の閾値θmin、第三の閾値θmaxの各値を調整することにより、中国語や日本語等の言語の相違、携帯電話やおもちゃ等の用途の相違、録音環境や録音機器等の各種環境の差異に対応することができる。
【０１３２】
また、本発明の効果を確かめるために、次のようなシステムの評価実験を行った。実験データおよび音声分析条件は、次の通りである。
【０１３３】
［実験データ］
（１）登録話者の総数：１００人
（２）評価用テストデータ：１００話者×（３回／話者）
（３）年齢・性別：２０〜５０才・男女５０人ずつ
（４）録音環境：背景室における一般騒音下
（５）発話内容：学会発表における自由発話
（６）発話時間：サンプルデータが２０〜３０秒であり、評価用テストデータが５〜１０秒である。
【０１３４】
［音声分析条件］
（１）サンプリング周波数：１６ｋＨｚ
（２）分析窓：１０ｍｓ
（３）分析周期：２１．３ｍｓ
（４）音響特徴量：１６次メルケプストラム
【０１３５】
表１には、評価用テストデータを５〜１０秒としたときのシステム評価結果が、サンプルデータを１０〜１５秒とした場合と、サンプルデータを２０〜３０秒とした場合とに分けて示されている。
【０１３６】
【表１】

【０１３７】
表１によれば、評価用テストデータ（つまり、被識別音声の入力データ）を５〜１０秒とし、サンプルデータを２０〜３０秒とした場合には、システムの有効性が明確である。すなわち、登録話者の総数を１００人として実際に話者識別処理を行った評価実験において、絶対値平均ベクトル誤差の算出処理を行わない場合（全てのｍｉｎ_qについて、ｍｉｎ_q＞ηと判断された場合）には、平均正解率は、９９．３３％となり、誤認識率および誤棄却率は、０．１％以下に抑えられていることが確認できる。また、絶対値平均ベクトル誤差の算出処理を行う場合（少なくとも一つのｍｉｎ_qについて、ｍｉｎ_q≦ηと判断された場合）でも、平均正解率は、９８．７５％と高くなっており、以上より本発明の効果が顕著に示された。
【０１３８】
なお、本発明は前記実施形態に限定されるものではなく、本発明の目的を達成できる範囲内での変形等は本発明に含まれるものである。
【０１３９】
すなわち、前記実施形態では、第二・第三閾値自動更新手段３３が設けられ、第二の閾値θminおよび第三の閾値θmaxの設定を徐々に変化させることができるようになっていたが、第二・第三閾値自動更新手段３３の設置を省略してもよい。しかし、前記実施形態のように第二・第三閾値自動更新手段３３を設けておくことが、システム１０に学習機能を付加することができるという点で好ましい。
【０１４０】
また、第二・第三閾値自動更新手段３３による第二の閾値θminおよび第三の閾値θmaxの自動更新処理は、システム１０により話者識別を実行した結果、受理された場合に行われるようになっているので、各登録話者の音声の特徴が徐々に変化していった場合にのみ対応することができる。従って、ある登録話者の音声の特徴が、急激に大きく変化したような場合には、サンプルデータを取り直して対応することが好ましい。
【０１４１】
【発明の効果】
以上に述べたように本発明によれば、話者識別を行うにあたって、判定を二つの段階で行うようにし、第一段階の判定結果に応じて第二段階の判定が簡易な処理になる場合とそれよりも複雑な処理になる場合とに分かれるようにしたので、実時間応答性が良好で、かつ、高精度な話者識別を行うことができるという効果がある。
【図面の簡単な説明】
【図１】本発明の一実施形態の話者識別システムの全体構成図。
【図２】前記実施形態の話者識別システムにより、被識別音声が複数の話者の中の誰の音声であるかを判定する際の処理の流れの説明図。
【図３】前記実施形態において第一の閾値ηの初期設定を行う際の処理の流れの説明図。
【図４】前記実施形態において第二の閾値θminおよび第三の閾値θmaxの初期設定を行う際の処理の流れの説明図。
【図５】前記実施形態において第二の閾値θminおよび第三の閾値θmaxの更新を行う際の処理の流れの説明図。
【符号の説明】
１０話者識別システム
２１サンプルデータ用特徴パラメータ作成手段
２２候補データ用特徴パラメータ作成手段
２３入力データ用特徴パラメータ作成手段
２４第一コードブック作成手段
２５第二コードブック作成手段
２６話者性尤度算出手段
２７絶対値平均ベクトル誤差算出手段
２８第一段階判定手段
２９第二段階唯一候補者判定手段
３０第二段階複数候補者判定手段
３３第二・第三閾値自動更新手段
４１第一コードブック記憶手段
４２第二コードブック記憶手段
Ｘ_k 特徴パラメータ
Ｃ_i ⁿ 第一のコードブック
Ｃ_i ^n,v 第二のコードブック
Ｌⁿ，Ｌ^n,h 話者性尤度
Ｌ^x 最小の話者性尤度
ｍｉｎ_q 最小の話者性尤度とそれ以外の話者性尤度との差
η 第一の閾値
θminⁿ 第二の閾値
θmaxⁿ 第三の閾値
ε_q,v，ε_q 絶対値平均ベクトル誤差
ε_qmin 最小の絶対値平均ベクトル誤差[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker identification method and system, and a program for determining who is the voice of a plurality of speakers registered in advance, for example, a plurality of talks. When searching for the voice of a specific speaker (such as an announcer) from continuous voice (such as long-term data from a broadcasting station) by the subscriber, or classifying the continuous voice into the voice of each speaker, When immigration control is performed at ports, etc., when criminals or persons requiring attention are managed (registration, search, investigation, etc.) by the police or the Self-Defense Forces, when residents are managed at government offices, customers are managed at shops, etc. When managing employees at the company (for example, knowing when to leave or leave the company), when managing security for access to information resources using the Internet, when performing automatic classification of languages and dialects, robot communication (robot If you want to apply to the dialogue partner of understanding, etc.) due, can be used when applying to the selection of understanding and response of the dialogue partner by the toys, various security of the voice, crime prevention, such as when performing the supervision.
[0002]
[Background]
In general, for speaker recognition, speaker identification ASI (Automatic Speaker Identification) for determining who the input speech is in the pre-registered speech, and whether the input speech is the user's own speech There is speaker verification ASV (Automatic Speaker Verification).
[0003]
In the case of speaker verification ASV, for the purpose of merely determining the identity of a certain person, the sample data of the person registered in advance is matched with the input data of the identified person, and the system simply “Yes”. Or only the answer of "No" is output. Therefore, many researches and application systems have been reported so far.
[0004]
On the other hand, in the case of speaker identification ASI, the speaker to be identified by the system is determined among N speakers registered in advance while performing N comparison processes, Further, if the speaker to be identified is a person other than N speakers registered in advance, the decision of rejection must be finally made. Therefore, the system processing is compared with the case of speaker verification ASV. Is complicated and difficult. Therefore, no feature parameters that can estimate the speaker efficiently have been found in previous studies, and it can be said that the research reports are too concentrated on basic theoretical studies.
[0005]
[Problems to be solved by the invention]
By the way, in order to realize a speaker recognition system with good real-time responsiveness and high accuracy, the variation of acoustic features due to differences in utterance content, utterance time and utterance time, and improvement of real-time responsiveness should be attempted. Therefore, it is necessary to solve problems such as inconsistency with the improvement of identification accuracy. This can be said for both speaker verification ASV and speaker identification ASI. However, in the case of speaker identification ASI, as described above, the system processing is more efficient than the case of speaker verification ASV. Since it is complicated and difficult, the problems to be solved become even larger, and the system development is currently delayed.
[0006]
An object of the present invention is to provide a speaker identification method, a system thereof, and a program that can perform speaker identification with high accuracy and good real-time responsiveness.
[0007]
[Means for Solving the Problems]
The present invention is a speaker identification method for determining who the input identified speech is from among a plurality of pre-registered speakers, and sample data for determination processing in the first stage After creating the speech feature parameters for each speaker using the speech data obtained from each speaker, the first code for each speaker by clustering using these feature parameters Create a book, store these first codebooks in the first codebook storage means, and differ from the sample data as candidate data for the second stage determination process after narrowing down to multiple candidates After creating voice feature parameters for each speaker using voice data obtained from multiple speakers in the environment, clustering using these feature parameters for each speaker Create a second codebook, store these second codebooks in the second codebook storage means, and determine who the identified voice is from among the speakers After generating the speech feature parameters using the input data of the identified speech, the speaker likelihood calculation means calculates the feature parameters and the first codebook stored in the first codebook storage means. The speaker likelihood for each speaker is calculated using the first stage determination means, and the minimum speaker likelihood and the other speaker likelihoods of these speaker likelihoods And whether each difference is larger than the first threshold value η set in advance or greater than or equal to the first threshold value η, and all the differences are larger than the first threshold value η or If determined to be greater than or equal to the first threshold η, the speaker with the lowest speaker likelihood is the only candidate Whether the minimum speaker likelihood is within the range between the second threshold θmin and the third threshold θmax set in advance for each speaker by the second stage unique candidate determination means. The candidate is accepted when it is determined that it falls within the range between these threshold values θmin and θmax, and is rejected when it is determined that it does not enter, while at least one of the differences is the first difference If it is determined that it is less than or equal to one threshold value η or smaller than the first threshold value η, the speaker with the lowest likelihood of speaker and the minimum speaker likelihood with the difference determined to be less than or less than The absolute value for each of a plurality of candidates using the feature parameter and the second codebook stored in the second codebook storage means by the absolute value average vector error calculating means. After calculating the average vector error, the second stage multiple candidate decision Determines whether or not the minimum absolute value average vector error of these absolute value average vector errors falls within the range between the second threshold value θmin and the third threshold value θmax, and these threshold values θmin, When it is determined that it falls within the range between θmax, the candidate of the minimum absolute value average vector error is accepted, and when it is determined that it does not enter, it is rejected.
[0008]
Here, “when all of the differences are determined to be greater than the first threshold η or greater than or equal to the first threshold η” means that all the other stories for the minimum speaker likelihood It is a case where it is determined that the human likelihood is separated by a value larger than the first threshold value η, or is more than the first threshold value η. In short, the minimum speaker likelihood is the other It is a case where it is determined that the value is prominently small compared to the speaker likelihood.
[0009]
On the other hand, “when it is determined that at least one of the differences is equal to or smaller than the first threshold η or smaller than the first threshold η” refers to speaker likelihood other than the minimum speaker likelihood. Is determined to be present when the difference from the minimum speaker likelihood is less than or smaller than the first threshold value η. This is a case where it is determined that at least one speaker likelihood with a value close to the likelihood exists.
[0010]
In such a speaker identification method of the present invention, first, the speaker likelihood is calculated and the first stage determination is performed using the first threshold η, and then the first step is performed according to the determination result. Change the contents of the two-step judgment process. That is, in the first stage determination, if it is determined that the minimum speaker likelihood is a small value that is prominently smaller than other speaker likelihoods, the minimum speaker likelihood is The speaker is used as a single candidate as it is, and a second stage determination process is performed to determine whether or not the minimum speaker likelihood is within the range between the second threshold θmin and the third threshold θmax. On the other hand, if it is determined in the first step that there is at least one speaker likelihood with a value close to the minimum speaker likelihood, only the speaker with the minimum speaker likelihood is used. The second threshold value θmin and the third threshold value θmax are calculated by calculating an absolute average vector error, including a speaker having a speaker likelihood likelihood value close to the minimum speaker likelihood likelihood. The second stage determination process is performed using.
[0011]
For this reason, when performing speaker identification, the determination is performed in two stages, and the determination in the second stage is a simple process according to the determination result in the first stage, and the process is more complicated than that. As a result, the processing time required for speaker identification can be shortened and real-time responsiveness can be improved as compared with the case where uniform determination processing is performed for all contents.
[0012]
In the first stage determination process, when it is determined that there is at least one speaker likelihood with a value close to the minimum speaker likelihood, a value close to the minimum speaker likelihood is Since there is a possibility that the speaker with the speaker likelihood is the same as the person who has input the identified speech, in that case, after performing the process of calculating the absolute value average vector error, the determination in the second stage is performed. As described above, it is possible to improve the identification accuracy while improving the real-time response as described above, thereby achieving the object.
[0013]
In the speaker identification method described above, when setting the first threshold value η, voice data obtained from any one of a plurality of speakers is used as the first threshold setting test data. After creating the speech feature parameters, the speaker likelihood calculation means uses the feature parameters and the first codebook stored in the first codebook storage means to talk about a plurality of speakers. It is desirable to calculate the sex likelihood and set the difference between the speaker likelihood for one arbitrary person and the speaker likelihood of the value closest to the speaker likelihood as the first threshold η.
[0014]
When the first threshold value η is set in this way, any one speaker who provided the voice data as the first threshold setting test data and the voice feature most similar to this speaker Since the first threshold value η is set on the basis of the difference in speaker likelihood with respect to the speaker having “”, it is possible to set an appropriate value as the threshold value used in the first stage determination process.
[0015]
Furthermore, in the speaker identification method described above, when setting the second threshold value θmin and the third threshold value θmax, a plurality of each is obtained from each of the plurality of speakers as test data for setting the second and third threshold values. A plurality of voice feature parameters are created for each speaker by using the voice data, and then the feature code and the first code stored in the first codebook storage means by the speaker likelihood calculation means A speaker is used to calculate a plurality of speaker likelihoods for each speaker using a book, and the minimum value among the plurality of speaker likelihoods calculated for each speaker is calculated for each speaker. It is desirable to set the second threshold value θmin for each speaker and the maximum value among the plurality of speaker likelihoods calculated for each speaker as the third threshold value θmax for each speaker.
[0016]
When the second threshold value θmin and the third threshold value θmax are set in this way, it is possible to set appropriate values as threshold values used in the second-stage determination process.
[0017]
Then, when the second threshold value θmin and the third threshold value θmax are set as described above, a process is performed to determine who the identified speech is from among a plurality of speakers. As a result, if the person who has input the identified voice is accepted as one of a plurality of speakers, a combination of input data of the identified voice and sample data can be used to Re-create the first codebook for the speaker determined to match the person who entered the identified speech, and recreate the first codebook and the identified speaker among multiple speakers Using the plurality of second and third threshold setting test data for the speaker determined to match the person who has input the voice and the input data of the identified voice, the identified voice of the plurality of speakers is determined. It was determined to match the person who entered it It is desirable to update the second threshold value θmin and the third threshold value θmax for the speaker.
[0018]
When the second threshold value θmin and the third threshold value θmax are set to be updated using the input data of the identified voice that is accepted as the voice of the person in this way, physical condition, health, etc. When the characteristics of the voices of a plurality of registered speakers change due to the change in the number, the setting of the second threshold value θmin and the third threshold value θmax may be gradually changed in accordance with the change. It becomes possible.
[0019]
In the speaker identification method described above, it is desirable that the input data of the identified speech is data shorter than the sample data for the first stage determination process.
[0020]
For example, it is desirable that the input data of the identified voice is data for 5 to 10 seconds, and the sample data for the determination process in the first stage is data for 20 to 30 seconds.
[0021]
Thus, when the input data of the identified voice is shortened, the time required for speaker identification can be shortened, and real-time responsiveness can be improved.
[0022]
In the speaker identification method described above, a plurality of second codebooks are created for each speaker, and an absolute value average vector for each of a plurality of candidates using the second codebook. When calculating the error, it is desirable to calculate a plurality of absolute value average vector errors for each candidate and calculate the average value of these absolute value average vector errors.
[0023]
As described above, when a plurality of second codebooks are prepared for each speaker and the average value of the absolute average vector error is calculated for each candidate, the accuracy of speaker identification is further improved. This can be further improved.
[0024]
Further, in the speaker identification method described above, a mel cepstrum or the like can be suitably used as the characteristic parameter. In the case of a mel cepstrum, a 16th order mel cepstrum is preferable in view of a reduction in processing time (a reduction in calculation amount) and ensuring of identification accuracy, but is not limited thereto. .
[0025]
Further, as a system for realizing the speaker identification method of the present invention described above, the following speaker identification system of the present invention can be cited.
[0026]
That is, the present invention is a speaker identification system that determines who the input identified voice is from among a plurality of pre-registered speakers. Sample data feature parameter creation means for creating voice feature parameters for each speaker using voice data acquired from a plurality of speakers as sample data, and created by the sample data feature parameter creation means First codebook creation means for creating a first codebook for each speaker by clustering using feature parameters, and first codebook created for each speaker by the first codebook creation means A first codebook storage means for storing a codebook and a different environment from the sample data as candidate data for the second stage determination process after narrowing down to a plurality of candidates The feature parameter creation means for candidate data that creates speech feature parameters for each speaker using speech data acquired from a plurality of individual speakers, and the feature parameters created by the feature parameter creation means for candidate data The second codebook creating means for creating a second codebook for each speaker by performing clustering using the second codebook, and the second codebook created for each speaker by the second codebook creating means And a second codebook storage means for storing voice, and an input for generating voice feature parameters using input data of the identified voice when determining who the identified voice is among a plurality of speakers Feature parameter creation means for data, feature parameters created by the feature parameter creation means for input data and the first codebook storage means Speaker likelihood likelihood calculating means for calculating speaker likelihood for each of a plurality of speakers using the first codebook, and speaker characteristics calculated by the speaker likelihood likelihood calculating means Whether or not each difference between the minimum likelihood of speaker and likelihood of other speakers is greater than a first threshold η set in advance or greater than a first threshold η The first speaker determining means, and the first speaker determining means, when all the differences are determined to be greater than or equal to the first threshold η, the smallest speaker Whether the speaker with the likelihood likelihood is the only candidate and the minimum speaker likelihood falls within the range between the second threshold θmin and the third threshold θmax set in advance for each speaker. A second stage that accepts the candidate when it is determined that it falls within the range between these threshold values θmin and θmax, and rejects when it is determined that it does not enter If the only candidate determination means and the first stage determination means determine that at least one of the differences is less than or equal to the first threshold value η or less than the first threshold value η, it is determined that the difference is less or less The feature parameter and the second code created by the feature parameter creation means for input data with the speaker having the speaker likelihood and the speaker having the smallest speaker likelihood being the difference as the plurality of candidates Absolute value average vector error calculation means for calculating an absolute value average vector error for each of a plurality of candidates using the second code book stored in the book storage means, and calculation by the absolute value average vector error calculation means It is determined whether or not the minimum absolute value average vector error among the absolute value average vector errors is within a range between the second threshold value θmin and the third threshold value θmax, and the threshold values θmin and θmax are Among A second-stage multi-candidate determination means is provided that accepts a candidate with the smallest absolute value average vector error when it is determined to fall within the range, and rejects when it is determined not to fall within the range.
[0027]
In such a speaker identification system of the present invention, the actions and effects obtained by the above-described speaker identification method of the present invention can be obtained as they are, thereby achieving the object.
[0028]
The speaker identification system described above further includes second and third threshold automatic updating means for automatically performing update settings of the second threshold θmin and the third threshold θmax, and the second and third threshold automatic updating. As a result of performing the process of determining who is the voice of the identified voice among the plurality of speakers, the means is that the person who has input the identified voice is one of the plurality of speakers The first codebook for the speaker determined to match the person who entered the identified voice among a plurality of speakers by combining the input data of the identified voice and the sample data For the second and third threshold values for the first codebook that has been re-created and the speakers determined to match the person who entered the identified speech among the plurality of speakers. Test data and input data of identified voice And the second threshold value θmin and the third threshold value θmax are set to be updated for the speaker determined to match the person who input the identified speech among the plurality of speakers. Is desirable.
[0029]
When such second and third threshold automatic update means are provided, when the voice characteristics of each registered speaker change due to changes in physical condition, health, etc., they are followed. Thus, the settings of the second threshold value θmin and the third threshold value θmax can be gradually and automatically changed.
[0030]
Furthermore, the present invention is a program for causing a computer to function as a speaker identification system that determines who is the voice of a plurality of speakers registered in advance. , Sample data feature parameter creation means for creating speech feature parameters for each speaker using speech data acquired from a plurality of speakers as sample data for the first stage determination processing, and the sample data Clustering using the feature parameters created by the feature parameter creation means, first code book creating means for creating a first code book for each speaker, and each story by the first code book creating means First codebook storage means for storing the first codebook created for each person, and the second stage determination process after narrowing down to multiple candidates Candidate data feature parameter creation means for creating speech feature parameters for each speaker using speech data obtained from a plurality of speakers in an environment different from the sample data as candidate data, and for this candidate data Clustering using the feature parameters created by the feature parameter creation means, second codebook creation means for creating a second codebook for each speaker, and each speaker by this second codebook creation means Second codebook storage means for storing a second codebook created for each of the input data, and input data of the identified speech when determining who the identified speech is from among a plurality of speakers The feature parameter creation means for input data that uses the feature parameter creation means for input data and the feature parameter creation means for input data Speaker-likelihood calculating means for calculating speaker likelihood for each of a plurality of speakers using the collection parameter and the first codebook stored in the first codebook storage means; Whether each difference between the minimum speaker likelihood and the other speaker likelihoods calculated by the likelihood calculating means is greater than a preset first threshold η First stage determination means for determining whether or not the first threshold value η is greater than the first threshold value η or more than the first threshold value η by the first stage determination means The speaker with the lowest speaker likelihood is the only candidate, and the minimum speaker likelihood is determined by the second threshold value θmin and the third threshold value preset for each speaker. Whether or not it falls within the range between the threshold values θmax, and if it is judged that it falls within the range between these threshold values θmin and θmax, the candidate is accepted. When it is determined that it does not enter, the second stage only candidate determination means to reject, and the first stage determination means, at least one of the differences is equal to or less than the first threshold η or less than the first threshold η When it is determined to be small, a speaker having a speaker likelihood likelihood and a speaker having the minimum speaker likelihood having a difference determined to be small or small are set as a plurality of candidates, and the feature for input data Absolute value average vector error calculation for calculating an absolute value average vector error for each of a plurality of candidates using the feature parameter created by the parameter creation means and the second code book stored in the second code book storage means And the minimum absolute value average vector error of the absolute value average vector errors calculated by the absolute value average vector error calculation means is in a range between the second threshold θmin and the third threshold θmax. A second stage that accepts the candidate of the minimum absolute value average vector error when it is determined that it falls within the range between these threshold values θmin and θmax, and rejects when it is determined that it does not enter The present invention is for causing a computer to function as a speaker identification system including a candidate determination unit.
[0031]
Note that the above-described program or a part thereof is, for example, a magneto-optical disk (MO), a read-only memory (CD-ROM) using a compact disk (CD), a CD recordable (CD-R), a CD rewritable. (CD-RW), read-only memory (DVD-ROM) using digital versatile disk (DVD), random access memory (DVD-RAM) using DVD, flexible disk (FD), magnetic tape, Recording, storage, distribution, etc. on a recording medium such as a hard disk, read only memory (ROM), electrically erasable and rewritable read only memory (EEPROM), flash memory, random access memory (RAM), etc. Possible, for example, a local area network LAN, Metropolitan Area Network (MAN), Wide Area Network (WAN), Internet, Intranet, Extranet, and other wired networks, wireless communication networks, and combinations of these transmission media It is also possible to carry it on a carrier wave. Furthermore, the program described above may be a part of another program, or may be recorded on a recording medium together with a separate program.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows the overall configuration of a speaker identification system 10 of the present embodiment. FIG. 2 shows an explanatory diagram of a flow of processing when the speaker identification system 10 determines who the voice to be identified is among a plurality of speakers. Also, FIG. 3 shows an explanatory diagram of the flow of processing when the first threshold value η is initially set, and FIG. 4 shows the initial setting of the second threshold value θmin and the third threshold value θmax. An explanatory diagram of the flow of processing at the time is shown. Furthermore, FIG. 5 shows an explanatory diagram of the flow of processing when updating the second threshold value θmin and the third threshold value θmax.
[0033]
In FIG. 1, the speaker identification system 10 is a system that determines who is the voice to be identified (speech to be identified) among a plurality of speakers registered in advance, Processing means 20 for performing various processes relating to speaker identification, storage means 40 for storing various data necessary for speaker identification, voice input means 60 for inputting the voice of the speaker, and speaker identification results are displayed. The display unit 70 and an output unit 80 for outputting the result of speaker identification are provided. The total number N of registered speakers is, for example, N = 100 or the like, and each speaker is assigned a speaker number n (n = 1, 2, 3,..., N).
[0034]
In addition, a network may be interposed between the voice input unit 60 and the processing unit 20 or between the display unit 70 or the output unit 80 and the processing unit 20. That is, the voice input means 60 may input voice of a speaker at a remote location, and the display means 70 and the output means 80 provide an identification result to a viewer or user at a remote location. It may be a thing.
[0035]
Here, the network includes various forms such as LAN, MAN, WAN, the Internet, an intranet, an extranet, or a combination thereof, and is wired or wireless, and further, wired and wireless. Regardless of whether it is a mixed type of radio, in short, it may be anything that can transmit information at a certain speed between a plurality of points (regardless of the length of distance).
[0036]
The processing unit 20 includes a sample data feature parameter creation unit 21, a candidate data feature parameter creation unit 22, an input data feature parameter creation unit 23, a first code book creation unit 24, and a second code book creation unit. 25, speaker likelihood calculation means 26, absolute value average vector error calculation means 27, first stage determination means 28, second stage only candidate determination means 29, second stage multiple candidate determination means 30, first threshold value initial setting means 31, second and third threshold value initial setting means 32, and second and third threshold value automatic updating means 33.
[0037]
The storage means 40 includes a first code book storage means 41, a second code book storage means 42, a first threshold storage means 43, a second threshold storage means 44, a third threshold storage means 45, and a sample data storage. Means 46, candidate data storage means 47, identified voice input data storage means 48, first threshold setting test data storage means 49, and second and third threshold setting test data storage means 50. It is configured.
[0038]
The sample data feature parameter creation means 21 extracts speech feature amounts using speech data acquired from a plurality of (N) individual speakers as sample data used for the determination process in the first stage, and features speech features. A process for creating parameters for each speaker is performed. Since the 16th order mel cepstrum is suitable as the characteristic parameter, in the present embodiment, the 16th order mel cepstrum will be described, but the feature parameter is not limited to this. Here, the length of the voice data acquired as sample data is, for example, 20 to 30 seconds per speaker. Therefore, if one frame is 10 milliseconds, a 16th order mel cepstrum is created from 2000 to 3000 frames, for example.
[0039]
The candidate data feature parameter creation means 22 is a voice acquired from a plurality (N persons) of each speaker in a different environment from the sample data as candidate data used in the second stage determination process after narrowing down to a plurality of candidates. Using the data, a voice feature amount is extracted, and a process of creating a voice feature parameter for each speaker is performed. The characteristic parameter is a 16th order mel cepstrum as in the case of the sample data. Here, the length of the voice data acquired as candidate data is, for example, 5 to 10 seconds per speaker. Therefore, if one frame is 10 milliseconds, for example, a 16th-order mel cepstrum is created from 500 to 1000 frames. Further, t pieces of candidate data having a length of 5 to 10 seconds are prepared for one speaker (for example, t = 3 to 5). The candidate data is preferably shorter than the sample data.
[0040]
The input data feature parameter creation means 23 uses the input data of the identified speech to extract the speech feature amount when determining who the identified speech is from among a plurality of speakers. A process for creating a voice feature parameter is performed. The feature parameter is a 16th order mel cepstrum as in the case of the sample data and candidate data. Here, the length of the voice data input for speaker identification is, for example, 5 to 10 seconds per speaker. Therefore, if one frame is 10 milliseconds, for example, a 16th-order mel cepstrum is created from 500 to 1000 frames. Each of these mel cepstrums is a vector X composed of 16 numbers.₁, X₂, ..., X_k, ..., X_MIt is. Where M is the number of frames and X_k= (X_k(1), X_k(2), ..., X_k(16)) (k = 1, 2,..., M). The input data of the identified voice is preferably shorter than the sample data.
[0041]
The first code book creating unit 24 performs processing for creating a first code book for each speaker by performing clustering using the feature parameters created by the sample data feature parameter creating unit 21. . The first codebook is the centroid vector C_i ⁿ, N, i = 1, 2,..., L. Here, N is the total number of registered speakers, for example, N = 100. L is the codebook size (quantization point), for example, L = 128, L = 256, or the like. Therefore, a first codebook of size L is prepared for N people. And each centroid vector C_i ⁿAre each composed of 16 numerical values, and C_i ⁿ= (C_i ⁿ(1), C_i ⁿ(2), ..., C_i ⁿ(16)).
[0042]
The second code book creating unit 25 performs a process of creating a second code book for each speaker by performing clustering using the feature parameters created by the candidate data feature parameter creating unit 22. . The second codebook is the centroid vector C_i ^{n, v}, N, i = 1, 2,..., L, and v = 1, 2,. Here, t is the number of second codebooks prepared for one candidate (speaker), for example, t = 3-5. Therefore, the second codebook of size L is prepared for N people, that is, N × t. And each centroid vector C_i ^{n, v}Are each composed of 16 numerical values, and C_i ^{n, v}= (C_i ^{n, v}(1), C_i ^{n, v}(2), ..., C_i ^{n, v}(16)).
[0043]
The speaker likelihood calculation means 26 is a feature parameter X created by the input data feature parameter creation means 23._kAnd the first code book C stored in the first code book storage means 41_i ⁿAnd performing vector quantization (VQ: Vector Qantization) to calculate speaker likelihood for each of a plurality of speakers. Specifically, processing based on the following formula (1) is performed.
[0044]
[Expression 1]

[0045]
In the above formula (1), LⁿIs the speaker likelihood, the superscript n indicates the speaker number (n = 1, 2,..., N), N is the total number of registered speakers, for example, N = 100, etc. It is. M is the number of frames. d (X_k, C_i ⁿ) Is the vector X_kAnd vector C_i ⁿAnd the sum of the squares of the differences between the components._k= (X_k(1), X_k(2), ..., X_k(16)), C_i ⁿ= (C_i ⁿ(1), C_i ⁿ(2), ..., C_i ⁿ(16)), it is as follows.
[0046]
d (X_k, C_i ⁿ) = (X_k(1) -C_i ⁿ(1))²+ (X_k(2) -C_i ⁿ(2))²+ ... + (X_k(16) -C_i ⁿ(16))²
[0047]
Also, min (1 ≦ i ≦ L) [d (X_k, C_i ⁿ)], Since the codebook size is L (for example, L = 256), since L first codebooks have been created for the nth speaker, d (X_k, C_i ⁿ) Is obtained for i = 1, 2,..., L, respectively, and the minimum value among them is taken.
[0048]
Then, min (1 ≦ i ≦ L) [d (X_k, C_i ⁿ)] For all M frames, M min (1 ≦ i ≦ L) [d (X_k, C_i ⁿ)] (K = 1, 2,..., M), so that these M min (1 ≦ i ≦ L) [d (X_k, C_i ⁿ)] And dividing by M, min (1 ≦ i ≦ L) [d (X_k, C_i ⁿ)] Average value. This average value is the speaker likelihood L for the nth speaker.ⁿIt is. The above calculation is performed for all N speakers, and L¹, L², ..., L^NAsk for.
[0049]
The absolute average vector error calculating means 27 is a feature parameter X created by the input data feature parameter creating means 23._kAnd the second code book C stored in the second code book storage means 42_i ^{n, v}And the absolute value average vector error ε for each of a plurality of candidates (speakers) selected in the determination process in the first stage._{q, v}Or ε_q, Ε_qProcessing to calculate min. Specifically, processing based on the following equation (2) is performed.
[0050]
[Expression 2]

[0051]
In the above equation (2), ε_{q, v}Is the absolute average vector error for each of the plurality of candidates (speakers) selected in the first stage determination process, the subscript q is the candidate number, and the subscript v is A plurality (t, for example, t = 3, etc.) of candidate data is prepared for each candidate (speaker), and a plurality of (t, for example, t = 3, etc.) second codebooks are created. Therefore, the numbers (v = 1, 2,..., T) given to those second codebooks. M is the number of frames. d '(X_k, C_i ^{n, v}) Is the speaker likelihood LⁿD (X) in the above equation (1)_k, C_i ⁿ), And the vector X_kAnd vector C_i ^{n, v}Meaning the sum of absolute values of differences between components, and X_k= (X_k(1), X_k(2), ..., X_k(16)), C_i ^{n, v}= (C_i ^{n, v}(1), C_i ^{n, v}(2), ..., C_i ^{n, v}(16)), it is as follows. D '(X_k, C_i ^{n, v}) In () is a speaker number corresponding to the candidate number q (a primitive speaker number for a speaker given the candidate number q in the first stage determination process).
[0052]
d '(X_k, C_i ^{n, v}) = | X_k(1) -C_i ^{n, v}(1) | + | X_k(2) -C_i ^{n, v}(2) | + ... + | X_k(16) -C_i ^{n, v}(16) |
[0053]
Also, min (1 ≦ i ≦ L) [d ′ (X_k, C_i ^{n, v})], Since the codebook size is L (for example, L = 256, etc.), since L second codebooks have been created for the vth candidate data of the nth speaker, d ′ ( X_k, C_i ^{n, v}) Is obtained for i = 1, 2,..., L, respectively, and the minimum value among them is taken.
[0054]
Then, min (1 ≦ i ≦ L) [d ′ (X_k, C_i ^{n, v})] For all M frames, M min (1 ≦ i ≦ L) [d ′ (X_k, C_i ^{n, v})] (K = 1, 2,..., M), so that these M min (1 ≦ i ≦ L) [d ′ (X_k, C_i ^{n, v})] And dividing by M, min (1 ≦ i ≦ L) [d ′ (X_k, C_i ^{n, v})] Average value. This average value is the absolute average vector error ε based on the vth candidate data of the nth speaker (speaker corresponding to the candidate number q)._{q, v}It is.
[0055]
Further, for the nth speaker (speaker corresponding to the candidate number q), each absolute value average vector error based on each candidate data (each second codebook) from v = 1 to v = t. ε_{q, v}The average value of ε_qAnd That is, the absolute average vector error ε for the nth speaker (speaker corresponding to the candidate number q)_qIs obtained by the following equation.
[0056]
ε_q= (Ε_{q, 1}+ Ε_{q, 2}+ ... + ε_{q, t}) / T
[0057]
The above calculation is performed for all of the plurality of candidates (speakers) selected in the first stage determination process, and the absolute value average vector error ε of all candidates is calculated._qAsk for. Then these ε_qFind the minimum value of_qLet min.
[0058]
The first stage determination means 28 includes N speaker likelihood likelihoods L calculated by the speaker likelihood likelihood calculation means 26.ⁿMinimum speaker likelihood L among (n = 1, 2,..., N)^xAnd min difference between speaker likelihood and other_qIs a process for determining whether or not is greater than a preset first threshold value η.
[0059]
Where min_q= L_q-L₁(Q = 2, 3, 4,...) L₁Is the calculated N speaker likelihood LⁿThe smallest one (L = 1, 2,..., N) (L^x) Is replaced by subscript 1 and L₂, L_Three, L_Four, ... are the substituting subscripts 2, 3, 4,... For the second, third, fourth,. In addition, since such replacement is performed for convenience of explanation, it is not always necessary to perform replacement in the same manner as described above for calculation processing. In short, the determination processing as described above is performed as a result. It only has to come to be.
[0060]
The second stage only candidate determination means 29 is connected to the difference min_qIs determined to be greater than the first threshold η, the minimum speaker likelihood L^xSpeaker (speaker number x) is the only candidate, and the minimum speaker likelihood L^xIs the second threshold value θmin set for the speaker of this speaker number x^xAnd the third threshold θmax^xIt is determined whether it falls within the range between. And these threshold values θmin^x, Θmax^xWhen it is determined that it falls within the range between the two, it is accepted that the speaker with the speaker number x, which is the only candidate, matches the person who has input the identification speech, and the threshold θmin^x, Θmax^xWhen it is determined that it does not fall within the range between, the process of rejecting is performed.
[0061]
The second stage multi-candidate determination means 30 is the absolute value average vector error ε calculated by the absolute value average vector error calculation means 27._qThe smallest absolute mean vector error ε_qmin is this ε_qSecond threshold θmin set for min candidates (the speaker number of this candidate is y)^yAnd the third threshold θmax^yIt is determined whether it falls within the range between. And these threshold values θmin^y, Θmax^yWhen it is determined that it falls within the range between ε_qThe min candidate (speaker with speaker number y) is accepted as matching the person who entered the identification voice, and the threshold θmin^y, Θmax^yWhen it is determined that it does not fall within the range between, the process of rejecting is performed.
[0062]
The first threshold initial setting means 31 performs an initial setting process of the first threshold η (see FIG. 3).
[0063]
The second and third threshold value initial setting means 32 has a second threshold value θmin.ⁿAnd the third threshold θmaxⁿThe initial setting process (n = 1, 2,..., N) is performed (see FIG. 4).
[0064]
The second and third threshold automatic updating means 33 is configured to output the second threshold θminⁿAnd the third threshold θmaxⁿ(N = 1, 2,..., N) is automatically updated (see FIG. 5).
[0065]
The first code book storage means 41 stores the first code book C_i ⁿIs memorized. A first codebook of size L (for example, L = 256) is prepared for each of the N registered speakers one by one.
[0066]
The second code book storage means 42 stores the second code book C_i ^{n, v}Is memorized. A second codebook of size L (for example, L = 256) is prepared for each of the N registered speakers by a plurality (one, for example, t = 3, etc.).
[0067]
The first threshold storage unit 43 stores the first threshold η. Only one first threshold value η is prepared in the system 10 and is not prepared individually for each registered speaker.
[0068]
The second threshold value storage means 44 has a second threshold value θmin.ⁿIs memorized. Second threshold θminⁿAre individually prepared for each registered speaker (speaker numbers n = 1, 2,..., N).
[0069]
The third threshold value storage means 45 stores the third threshold value θmax.ⁿIs memorized. Third threshold θmaxⁿAre individually prepared for each registered speaker (speaker numbers n = 1, 2,..., N).
[0070]
The sample data storage means 46 stores audio data as sample data (for example, 20 to 30 seconds / speaker), and performs saving in a file format such as WAV format. Sample data is prepared one by one for all N registered speakers.
[0071]
The candidate data storage unit 47 stores voice data as candidate data (for example, 5 to 10 seconds / speaker), and performs saving in a file format such as a WAV format. Candidate data is prepared for each of the N registered speakers by a plurality (t, for example, t = 3, etc.).
[0072]
The identified voice input data storage means 48 stores the input data (for example, 5 to 10 seconds) of the identified voice, and stores it in a file format such as WAV format. The stored input data is used for a learning function by the system 10 and is used for speaker identification processing after the next time (see FIG. 5).
[0073]
The first threshold setting test data storage means 49 stores audio data as first threshold setting test data (for example, 5 to 10 seconds) used for initial setting of the first threshold η. Saving in a file format such as WAV format. The first threshold setting test data is voice data by any one of N registered speakers.
[0074]
The second and third threshold value setting test data storage means 50 includes a second threshold value θmin.ⁿAnd the third threshold θmaxⁿAudio data as second and third threshold setting test data (for example, 5 to 10 seconds) used for the initial setting is stored, for example, saved in a file format such as WAV format. The second and third threshold setting test data is voice data obtained by inputting each of the N registered speakers a plurality of times (H times, for example, H = 10 times, etc.).
[0075]
Each of the means 21 to 33 constituting the processing means 20 includes a central processing unit (CPU) provided inside a computer main body (including not only a personal computer but also a higher model thereof), and this This is realized by one or a plurality of programs that define the operation procedure of the CPU.
[0076]
The processing means 20 is not limited to one realized by one computer or one CPU, and may be realized by performing distributed processing with a plurality of computers or the like.
[0077]
Further, each means 41 to 50 constituting the storage means 40 is preferably realized by, for example, a hard disk or the like. However, as long as there is no problem in storage capacity, access speed, etc., ROM, EEPROM, flash memory, RAM, MO, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, FD, magnetic tape, or a combination thereof may be employed.
[0078]
As the voice input means 60, various voice recording microphones and the like can be employed.
[0079]
As the display means 70, for example, a liquid crystal display, a CRT display, an organic EL (electroluminescence) display, an ECL (electrochemiluminescence) display, a projector and a screen, or a combination thereof can be employed.
[0080]
As the output means 80, a printer, a plotter, or a combination thereof can be employed.
[0081]
In this embodiment, speaker identification processing for the identified speech is performed by the speaker identification system 10 as follows.
[0082]
In FIG. 2, first, using voice input means 60, for example, voice data of 20 to 30 seconds per speaker is acquired as sample data from all N (for example, N = 100) registered speakers. These acquired sample data are stored and saved in the sample data storage means 46 (step S1).
[0083]
Subsequently, using the sample data of each speaker, an acoustic feature amount is extracted by the sample data feature parameter creation means 21 to create a feature parameter for each speaker's voice (step S2). The characteristic parameter is a 16th order mel cepstrum.
[0084]
Further, clustering processing is performed by the first codebook creating means 24 using the obtained feature parameters of each speaker, and a first one for each of N speakers (n = 1, 2,..., N). Codebook C_i ⁿAre created (step S3), and the first codebook C created is created._i ⁿIs stored and saved in the first codebook storage means 41 (step S4). First codebook C for one speaker (nth speaker)_i ⁿSince the codebook size is L, it is configured by L (for example, L = 256) codebook vectors.
[0085]
Next, using the voice input means 60, a plurality of (t = 5 to 10 seconds) voice data, for example, of 5 to 10 seconds per speaker is obtained from all N (for example, N = 100, etc.) registered speakers as candidate data. (For example, t = 3), and the obtained candidate data is stored and saved in the candidate data storage means 47 (step S11).
[0086]
Next, using the candidate data of each speaker, the feature data for the candidate data is extracted by the candidate data feature parameter creation means 22 to create a feature parameter for each speaker's voice (step S12). The characteristic parameter is a 16th order mel cepstrum.
[0087]
Further, clustering processing is performed by the second codebook creating means 25 using the obtained feature parameters of each speaker, and a plurality (t) of N speakers (n = 1, 2,..., N). 2) second codebook C_i ^{n, v}(Step S13), and the generated second codebook C_i ^{n, v}Is stored in the second code book storage means 42 (step S14). Second codebook C for one speaker (nth speaker)_i ^{n, v}Are prepared in number (t), and the plurality of second codebooks C_i ^{n, v}Since each of (v = 1, 2,..., T) has a codebook size of L, it is composed of L codebook vectors (for example, L = 256).
[0088]
After performing the above preprocessing, the first threshold η and the second threshold θminⁿ, And the third threshold θmaxⁿThe details are described later (steps S41 to S44 in FIG. 3 and steps S51 to S54 in FIG. 4).
[0089]
Then, after pre-processing and initial setting of each threshold value, when the speaker identification system 10 is activated and speaker identification is actually performed, the following processing is performed.
[0090]
In FIG. 2, first, the voice to be identified is input by using the voice input means 60 and is taken into the speaker identification system 10 (step S21). At this time, the input data to be captured is audio data of 5 to 10 seconds, for example.
[0091]
Subsequently, using the input data of the identified speech, the acoustic feature is extracted by the input data feature parameter creating means 23 to create a feature parameter for the identified speech (step S22). Since the feature parameter is a 16th order mel cepstrum and the number of frames is M, M vectors X₁, X₂, ..., X_MIs created.
[0092]
Furthermore, the obtained feature parameters M vectors X₁, X₂, ..., X_MAnd the first code book C stored and saved in the first code book storage means 41_i ⁿUsing the above, the speaker likelihood calculation means 26 performs vector quantization (VQ), and the speaker likelihood L for each speaker (n = 1, 2,..., N).ⁿIs calculated (step S23).
[0093]
Next, the calculated speaker likelihood Lⁿ(The superscript “n” is a speaker number, and n = 1, 2,..., N).^x(Speaker likelihood for the speaker with speaker number x) using subscript 1 as L₁Shall be replaced. Also, the calculated speaker likelihood LⁿThe second smallest value of L using subscript 2₂Shall be replaced. Similarly, the calculated speaker likelihood LⁿL is a small value for the third, fourth, ..._Three, L_Four, ... shall be replaced.
[0094]
Then, by the first stage determination means 28, the minimum speaker likelihood L^x(= L₁) And other speaker likelihood Lⁿ(= L₂, L_Three, L_Four, ...) difference min_qIs greater than a preset first threshold value η (step S24). That is, min_q= L_q-L₁(Q = 2, 3, 4,...), And this min_qIs determined based on the following equation.
[0095]
min_q> Η
[0096]
Where all min_qIs determined to be greater than the first threshold η, L₂-L₁, L_Three-L₁, L_Four-L₁,... Are all greater than the first threshold η, so that L₁L for the value of₂, L_Three, L_FourThis means that the values of, ... are separated. Therefore, the minimum speaker likelihood L^x(= L₁) But other speaker likelihood Lⁿ(= L₂, L_Three, L_Four,...), And in this case, the speaker likelihood L^x(= L₁) (Speaker number x) as the only candidate, the process proceeds to the next second stage determination process.
[0097]
Subsequently, speaker likelihood L^xAfter the speaker is made the only candidate, the second stage unique candidate judging means 29 judges whether or not this unique candidate matches the person who inputted the identification voice (step S25). In this determination, speaker likelihood L^xIs a preset second threshold value θmin for the speaker of speaker number x^xAnd the third threshold θmax^xIt is determined whether or not it falls within the range between.
[0098]
And speaker likelihood L^xIs the second threshold θmin^xAnd the third threshold θmax^xSpeaker likelihood L when it is determined that it falls within the range between^xAre accepted as matching with the person who inputted the identification voice (step S26), and the speaker likelihood L^xIs the second threshold θmin^xAnd the third threshold θmax^xIf it is determined that it does not fall within the range between^xAre rejected on the assumption that they do not coincide with the person who has input the identification voice (step S27).
[0099]
On the other hand, in step S24, each difference min_qIf it is determined that at least one of the values is less than or equal to the first threshold η, L₂-L₁, L_Three-L₁, L_Four-L₁,..., There is a difference equal to or less than the first threshold value η.₁The speaker likelihood that is close to L₂, L_Three, L_Four, ... means that it exists in. In this case, the minimum speaker likelihood L^x(= L₁) Speaker having a speaker likelihood likelihood determined to be close to (ie, the minimum speaker likelihood L^x) And the minimum speaker likelihood L).^xAs the plurality of candidates, the process proceeds to the next second stage determination process. For example, L₂-L₁, L_Three-L₁Is less than or equal to the first threshold η and L_Four-L₁, L_Five-L₁, L₆-L₁When it is determined that... Is larger than the first threshold η, L₂And L_ThreeIs L₁Since the value is close to L₁Speaker likelihood L including₁, L₂, L_ThreeThe three speakers become a plurality of candidates. Therefore, min how many candidates are_qIt varies depending on the determination result of> η, and may be all N registered speakers. In the following, in the determination process in step S24, min_qSpeaker likelihood L determined to be ≦ η_q(Q = 2, 3,...)₁And subscripts indicating the candidate number q including_q(Q = 1, 2,...) In the above example, L₁, L₂, L_ThreeIs L_qIt becomes.
[0100]
Next, L₁, The speaker likelihood likelihood of a speaker close to₁Then, the absolute value average vector error for the plurality of candidates is calculated by the absolute value average vector error calculating means 27 (step S28).
[0101]
The processing by the absolute value average vector error calculating means 27 is first performed by the feature parameter X created by the input data feature parameter creating means 23._kAnd the second code book C stored in the second code book storage means 42_i ^{n, v}And the absolute value average vector error ε for each of a plurality of candidates (speakers) selected in the determination process in the first stage of step S24._{q, v}Is calculated. At this time, a plurality (t, for example, t = 3, etc.) of absolute value average vector errors are calculated for each candidate. Therefore, for a candidate with candidate number q, ε_{q, 1}, Ε_{q, 2}, ..., ε_{q, t}Is calculated, the average value of these t absolute value average vector errors is obtained and ε_qAnd In addition, for all candidates, ε_qAnd these ε_qFind the minimum value of ε_qLet min.
[0102]
Thereafter, the second stage multiple candidate judging means 30 performs the minimum absolute value average vector error ε._qmin is ε_qThe second threshold θmin set in advance for the min candidate (the speaker number of this candidate is y)^yAnd the third threshold θmax^yIt is determined whether or not it falls within the range between (step S29).
[0103]
And the smallest absolute value average vector error ε_qmin is the second threshold θmin^yAnd the third threshold θmax^yIf it is determined that it falls within the range between_qSpeaker likelihood L that is a candidate for min^y(= L_q) Is accepted as a match with the person who inputted the identification voice (step S30), and the minimum absolute value average vector error ε_qmin is the second threshold θmin^yAnd the third threshold θmax^yIf it is determined that it does not fall within the range between_qSpeaker likelihood L that is a candidate for min^y(= L_q) Is rejected because it does not match the person who entered the identification voice (step S31).
[0104]
Note that the result of speaker identification by the above processing may be appropriately displayed on the screen by the display means 70, printed by the output means 80, or used for personal authentication processing of various systems.
[0105]
Next, a process when the first threshold initial setting unit 31 performs the initial setting of the first threshold η will be described in detail.
[0106]
In FIG. 3, using the voice input means 60, voice data of any one of N registered speakers (speaker of speaker number P) is used as the first threshold setting test data. At the same time, the acquired voice data is stored and saved in the first threshold setting test data storage means 49 (step S41). At this time, the audio data to be captured is, for example, data for 5 to 10 seconds.
[0107]
Subsequently, using the first threshold value setting test data, the acoustic feature quantity is extracted by the input data feature parameter creation means 23, and the feature parameter X for the voice is extracted._k(K = 1, 2,..., M) are created (step S42). Feature parameter X_kIs a 16th order mel cepstrum.
[0108]
Furthermore, the obtained feature parameter X_kAnd the first code book C stored in the first code book storage means 41_i ⁿAre used to perform vector quantization (VQ) by the speaker likelihood calculation means 26, and speaker likelihood L for each of N registered speakers.ⁿ(N = 1, 2,..., N) is calculated (step S43). And the speaker likelihood L about the said arbitrary one (speaker number P)^PAnd speaker likelihood L for each of N speakersⁿSpeaker likelihood L^PIs determined as a first threshold value η, stored in the first threshold value storage means 43, and set (step S44).
[0109]
Next, the second threshold value θmin is set by the second / third threshold value initial setting means 32.ⁿAnd the third threshold θmaxⁿThe processing when performing the initial setting will be described in detail.
[0110]
In FIG. 4, using the voice input means 60, a plurality (H, for example, H = 10, etc.) of voices from each of N registered speakers as second and third threshold setting test data. In addition to taking in the data, the taken-in voice data is stored and saved in the second / third threshold setting test data storage means 50 (step S51). At this time, the audio data to be captured is, for example, data of 5 to 10 seconds per speaker.
[0111]
Subsequently, using the second and third threshold setting test data, the input data feature parameter creating means 23 extracts the acoustic feature quantity, and the feature parameter X for the voice is extracted._k(K = 1, 2,..., M) are created for each of N speakers (H) (step S52). Feature parameter X_kIs a 16th order mel cepstrum.
[0112]
Furthermore, the obtained feature parameter X_kAnd the first code book C stored in the first code book storage means 41_i ⁿAre used to perform vector quantization (VQ) by the speaker likelihood calculation means 26, and a plurality (H) of each of N registered speakers (n = 1, 2,..., N). Speaker likelihood L^{n, h}(H = 1, 2,..., H) is calculated (step S53).
[0113]
Then, for the speaker with registration number n, H speaker likelihood L^{n, h}The minimum value of (h = 1, 2,..., H) is set as the second threshold value θmin for the speaker.ⁿAnd the maximum value is the third threshold θmax for the speakerⁿAnd such a θminⁿAnd θmaxⁿIs determined for all N registered speakers (step S54). And the determined θminⁿAnd θmaxⁿAre stored and set in the second threshold storage means 44 and the third threshold storage means 45, respectively.
[0114]
Next, the second and third threshold value automatic updating means 33 performs the second threshold value θmin.ⁿAnd the third threshold θmaxⁿThe processing when performing automatic update will be described in detail.
[0115]
In FIG. 5, at step S26 of FIG.^xOf the speaker (speaker of the speaker number x) is accepted or L in step S30 of FIG.^yAre received from the sample data storage means 46 (steps 20 to 30 seconds), for example (steps 20 to 30 seconds). S61), input data (for example, 5 to 10 seconds) of the identified voice input during the identification process is read from the identified voice input data storage means 48 (step S62).
[0116]
Subsequently, by using a combination of the read sample data and the input data of the identified voice, the acoustic feature amount is extracted by the sample data feature parameter creating means 21 and a feature parameter for the speech is created (step S63). ). The characteristic parameter is a 16th order mel cepstrum.
[0117]
Further, clustering processing is performed by the first codebook creating means 24 using the obtained feature parameters, and the first codebook C for the speaker (speaker number x or y) received is received._i ⁿ(N = x or y) is recreated (step S64), and the recreated first codebook C_i ⁿIs stored in the first codebook storage means 41 and the data is updated (step S65).
[0118]
Thereafter, a plurality of (H, for example, H = 10, etc.) prepared second and third threshold setting test data (for example, 5 to 10 seconds) for the received speakers are set as the second and third threshold settings. Read from the test data storage means 50 (step S66).
[0119]
Subsequently, using the read second and third threshold setting test data and the input data of the identified voice, the acoustic feature quantity is extracted by the input data feature parameter creating means 23, and the feature parameter X for the voice is extracted._k(K = 1, 2,..., M) are created for the received speakers (H + 1) (step S67). Feature parameter X_kIs a 16th order mel cepstrum. For example, when H = 10 test data for second and third threshold setting are initially prepared for the received speaker, the newly acquired input data of the identified speech is used as learning data ( H + 1) = 11 are created.
[0120]
Furthermore, the obtained feature parameter X_kAnd the updated first codebook C stored in the first codebook storage means 41_i ⁿ(N = x or y) is used to perform vector quantization (VQ) by the speaker likelihood calculation means 26, and the received speaker (speaker number x or y) (H + 1) Speaker likelihood L^{n, h}(H = 1, 2,..., H, H + 1) are calculated (step S68).
[0121]
Then, (H + 1) speaker likelihood L for the received speaker (speaker number n = x or y).^{n, h}The minimum value of (h = 1, 2,..., H, H + 1) is set to the updated second threshold value θmin for the speaker.ⁿAnd the maximum value is the updated third threshold θmax for the speakerⁿAnd θmin after these updatesⁿAnd θmaxⁿAre stored in the second threshold value storage means 44 and the third threshold value storage means 45, respectively, and update setting is performed (step S69).
[0122]
According to this embodiment, there are the following effects. That is, when performing speaker identification, the determination is performed in two stages, and the determination in the second stage is simplified according to the result of the determination by the first stage determination means 28 (step S24 in FIG. 2). Since it is divided into a case (step S25) and a case where the process is more complicated (steps S28 and S29), it is necessary for speaker identification as compared with a case where determination processing of uniform contents is performed for all. Processing time can be shortened and real-time responsiveness can be improved.
[0123]
In the determination process by the first stage determination means 28, the minimum speaker likelihood L^xIf it is determined that there is at least one speaker likelihood with a value close to, the minimum speaker likelihood L^xSince a speaker having a speaker likelihood with a value close to that may match the person who input the identified speech, in that case, the absolute average vector error ε_{q, v}Or ε_q, Ε_qSince the second stage determination process (step S29) is performed after the min calculation process (step S28 in FIG. 2), the real-time response is improved as described above, and at the same time the identification accuracy is improved. be able to.
[0124]
Further, first threshold initial setting means 31 is provided, and as described with reference to FIG. 3, the first threshold initial setting means 31 provides the speaker's own likelihood L of the first threshold setting test data.^PAnd speaker likelihood L^PIs set as the first threshold value η, so that an appropriate value can be set as the threshold value used in the first-stage determination process.
[0125]
Then, second / third threshold value initial setting means 32 is provided. As described with reference to FIG. 4, the second / third threshold value initial setting means 32 sets N persons as test data for second / third threshold value setting. Since the second threshold value θmin and the third threshold value θmax are set using a plurality of (H) pieces of speech data respectively acquired from each of the speakers, it is appropriate as a threshold value used for the determination process in the second stage. A value can be set.
[0126]
In addition, since the second and third threshold automatic updating means 33 is provided, a learning function can be added to the system 10, and each registered N person's story due to changes in physical condition, health, etc. When the characteristics of the person's voice change, the setting of the second threshold value θmin and the third threshold value θmax can be gradually changed following the change.
[0127]
Furthermore, the input data (for example, 5 to 10 seconds) of the identified voice is shorter than the sample data (for example, 20 to 30 seconds) for the determination process in the first stage. The time required can be shortened, and real-time responsiveness can be improved.
[0128]
Further, the absolute value average vector error calculating means 27 has a second codebook C created for each of N speakers (t)._i ^{n, v}Using a plurality (t) of absolute value average vector errors ε for each candidate_{q, v}And calculate the absolute mean vector error ε_{q, v}Mean value ε_qTherefore, the accuracy of speaker identification can be further improved.
[0129]
Since the 16th-order mel cepstrum is used as the characteristic parameter, the system 10 can be realized in consideration of the trade-off between processing time reduction (calculation amount reduction) and ensuring identification accuracy.
[0130]
In addition, since a process of taking a spectrum (in this embodiment, a 16th order mel cepstrum) is performed from voice data, speaker identification can be performed without being influenced by the utterance content.
[0131]
Furthermore, by adjusting the first threshold value η, second threshold value θmin, and third threshold value θmax, differences in languages such as Chinese and Japanese, differences in applications such as mobile phones and toys, recording It is possible to deal with various environmental differences such as environment and recording equipment.
[0132]
In order to confirm the effect of the present invention, the following system evaluation experiment was conducted. Experimental data and speech analysis conditions are as follows.
[0133]
[Experiment data]
(1) Total number of registered speakers: 100
(2) Test data for evaluation: 100 speakers x (3 times / speaker)
(3) Age and sex: 20-50 years old, 50 men and women
(4) Recording environment: Under general noise in the background room
(5) Utterance content: Free utterance in conference presentation
(6) Speaking time: sample data is 20 to 30 seconds, and evaluation test data is 5 to 10 seconds.
[0134]
[Voice analysis conditions]
(1) Sampling frequency: 16 kHz
(2) Analysis window: 10 ms
(3) Analysis period: 21.3 ms
(4) Acoustic features: 16th order mel cepstrum
[0135]
Table 1 shows the system evaluation results when the test data for evaluation is 5 to 10 seconds, divided into the case where the sample data is 10 to 15 seconds and the case where the sample data is 20 to 30 seconds. Has been.
[0136]
[Table 1]

[0137]
According to Table 1, the effectiveness of the system is clear when the test data for evaluation (that is, the input data of the identified voice) is 5 to 10 seconds and the sample data is 20 to 30 seconds. That is, in an evaluation experiment in which speaker identification processing is actually performed with the total number of registered speakers being 100, when absolute value average vector error calculation processing is not performed (all min_qAbout min_qWhen it is determined that> η), the average accuracy rate is 99.33%, and it can be confirmed that the error recognition rate and the error rejection rate are suppressed to 0.1% or less. In addition, when calculating the absolute value average vector error (at least one min_qAbout min_qEven when it was determined that ≦ η), the average accuracy rate was as high as 98.75%. Thus, the effects of the present invention were remarkably shown.
[0138]
Note that the present invention is not limited to the above-described embodiment, and modifications and the like within a scope where the object of the present invention can be achieved are included in the present invention.
[0139]
That is, in the above embodiment, the second and third threshold automatic updating means 33 is provided, and the setting of the second threshold θmin and the third threshold θmax can be gradually changed. Installation of the second and third threshold automatic updating means 33 may be omitted. However, it is preferable to provide the second and third threshold automatic updating means 33 as in the above-described embodiment in that a learning function can be added to the system 10.
[0140]
Further, the automatic update processing of the second threshold value θmin and the third threshold value θmax by the second / third threshold value automatic updating means 33 is performed when the speaker identification is accepted by the system 10 and is accepted. Therefore, it is possible to cope only when the characteristics of the voice of each registered speaker gradually change. Therefore, when the characteristics of the voice of a certain registered speaker change drastically, it is preferable to take a sample data again.
[0141]
【The invention's effect】
As described above, according to the present invention, when performing speaker identification, the determination is performed in two stages, and the determination in the second stage is a simple process according to the determination result in the first stage. Therefore, it is possible to perform speaker identification with good real-time responsiveness and high accuracy.
[Brief description of the drawings]
FIG. 1 is an overall configuration diagram of a speaker identification system according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram of a flow of processing when a speaker identification system according to the embodiment determines who is the identified voice among a plurality of speakers;
FIG. 3 is an explanatory diagram of a flow of processing when initial setting of a first threshold value η is performed in the embodiment.
FIG. 4 is an explanatory diagram of a processing flow when performing the initial setting of the second threshold value θmin and the third threshold value θmax in the embodiment.
FIG. 5 is an explanatory diagram of a process flow when updating the second threshold value θmin and the third threshold value θmax in the embodiment.
[Explanation of symbols]
10 Speaker identification system
21 Sample data feature parameter creation means
22 Feature parameter creation means for candidate data
23 Feature parameter creation means for input data
24 First codebook creation means
25 Second codebook creation means
26 Speaker likelihood calculation means
27 Absolute value average vector error calculating means
28 First stage determination means
29 Second stage only candidate judgment means
30 Second stage multiple candidate judging means
33 Second and third threshold automatic update means
41 First codebook storage means
42 Second code book storage means
X_k  Feature parameters
C_i ⁿ  First codebook
C_i ^{n, v}  Second codebook
Lⁿ, L^{n, h}  Speaker likelihood
L^x  Minimum speaker likelihood
min_q  Difference between minimum speaker likelihood and other speaker likelihood
η First threshold
θminⁿ  Second threshold
θmaxⁿ  Third threshold
ε_{q, v}, Ε_q  Absolute mean vector error
ε_qmin Minimum absolute mean vector error

Claims

A speaker identification method for determining who the input identified speech is from among a plurality of speakers registered in advance,
Voice feature parameters are created for each speaker using voice data obtained from the plurality of speakers as sample data for the determination process in the first stage, and clustering is performed using these feature parameters. By creating a first code book for each speaker, and storing these first code books in the first code book storage means,
Using the voice data acquired from each of the plurality of speakers in an environment different from the sample data as candidate data for the second stage determination process after narrowing down to a plurality of candidates, the voice feature parameters are set for each speaker. After each creation, a second codebook is created for each speaker by performing clustering using these feature parameters, and these second codebooks are stored in the second codebook storage means. Every
In determining who the voice among the plurality of speakers is the identified voice,
After creating a speech feature parameter using the input data of the identified speech, the speaker likelihood likelihood calculation means calculates the feature parameter and the first codebook stored in the first codebook storage means. To calculate the speaker likelihood for each of the plurality of speakers,
The difference between the minimum speaker likelihood and the other speaker likelihoods among these speaker likelihoods is larger than a preset first threshold η by the first stage determination means. Whether or not greater than or equal to the first threshold η,
If it is determined that all of the differences are greater than the first threshold η or greater than or equal to the first threshold η,
The speaker having the smallest speaker likelihood is set as the only candidate, and the second speaker unique likelihood is determined by the second-stage unique candidate determination means. It is determined whether or not it falls within the range between the threshold value θmin and the third threshold value θmax. Sometimes reject,
On the other hand, if it is determined that at least one of the differences is equal to or less than the first threshold η or smaller than the first threshold η,
A speaker having a speaker likelihood likelihood or a speaker having a minimum speaker likelihood likelihood that is determined to be a difference determined to be less than or less than a plurality of candidates, and using the absolute value average vector error calculating means, the feature After calculating an absolute value average vector error for each of the plurality of candidates using the parameter and the second codebook stored in the second codebook storage means, the second stage multiple candidate determination means Determining whether or not the minimum absolute value average vector error of these absolute value average vector errors falls within the range between the second threshold value θmin and the third threshold value θmax, and these threshold values θmin , Θmax, a candidate for the minimum absolute value average vector error is accepted, and when it is determined that it does not enter, it is rejected.

The speaker identification method according to claim 1, wherein:
When setting the first threshold η,
After creating a speech feature parameter using speech data obtained from any one of the plurality of speakers as the first threshold setting test data, the feature parameter calculation unit calculates the feature parameter. And the first codebook stored in the first codebook storage means to calculate the speaker likelihood for each of the plurality of speakers, and the speaker likelihood for the arbitrary one person A speaker identification method characterized in that a difference between a speaker likelihood and a value closest to the speaker likelihood is the first threshold value η.

The speaker identification method according to claim 1 or 2,
When setting the second threshold θmin and the third threshold θmax,
After creating a plurality of speech feature parameters for each speaker using speech data obtained from each of the plurality of speakers as the second and third threshold setting test data, speaker likelihood The calculation means uses the feature parameters and the first codebook stored in the first codebook storage means to calculate a plurality of speaker likelihoods for each of the plurality of speakers for each speaker. And the minimum value of the plurality of speaker likelihoods calculated for each speaker is set as the second threshold θmin for each speaker, and a plurality of speaker likelihoods calculated for each speaker are calculated. A speaker identification method, wherein a maximum value of degrees is set as the third threshold value θmax for each speaker.

The speaker identification method according to claim 3, wherein
As a result of performing the process of determining who is the voice among the plurality of speakers, the person who has input the identified voice is any one of the plurality of speakers. If it is accepted as
By combining the input data of the identified voice and the sample data, the first code book for the speaker determined to match the person who inputted the identified voice among the plurality of speakers is created. Rework,
The recreated first codebook and a plurality of second and third threshold setting test data for speakers determined to match those who input the identified speech among the plurality of speakers And the second threshold θmin and the third threshold for the speaker determined to match the person who input the identified voice among the plurality of speakers, using the input data of the identified voice. A speaker identification method, wherein update setting of a threshold value θmax is performed.

In the speaker identification method according to any one of claims 1 to 4,
The speaker identification method according to claim 1, wherein the input data of the identified voice is data shorter than the sample data for the determination process in the first stage.

The speaker identification method according to claim 5, wherein
The speaker identification method, wherein the input data of the identified voice is data for 5 to 10 seconds, and the sample data for the determination process in the first stage is data for 20 to 30 seconds.

In the speaker identification method according to any one of claims 1 to 6,
A plurality of the second codebooks are created for each speaker,
When calculating an absolute value average vector error for each of the plurality of candidates using the second codebook, a plurality of absolute value average vector errors are calculated for each candidate, and the absolute values thereof are calculated. A speaker identification method comprising calculating an average value of average vector errors.

8. The speaker identification method according to claim 1, wherein the characteristic parameter is a mel cepstrum.

A speaker identification system for determining who the input identified speech is from among a plurality of speakers registered in advance,
Sample data feature parameter creation means for creating a speech feature parameter for each speaker using speech data acquired from each of the plurality of speakers as sample data for the first stage determination processing;
First codebook creating means for creating a first codebook for each speaker by performing clustering using the feature parameters created by the sample data feature parameter creating means;
First code book storage means for storing the first code book created for each speaker by the first code book creation means;
Using the voice data acquired from each of the plurality of speakers in an environment different from the sample data as candidate data for the second stage determination process after narrowing down to a plurality of candidates, the voice feature parameters are set for each speaker. Feature parameter creation means for candidate data created for each;
Second codebook creation means for creating a second codebook for each speaker by performing clustering using the feature parameters created by the candidate data feature parameter creation means;
Second code book storage means for storing the second code book created for each speaker by the second code book creation means;
A feature parameter creation means for input data that creates speech feature parameters using input data of the identified speech when determining which speech among the plurality of speakers is the identified speech;
Speaker likelihood for each of the plurality of speakers using the feature parameters created by the feature parameter creation means for input data and the first codebook stored in the first codebook storage means Speaker likelihood likelihood calculating means for calculating
The difference between the minimum speaker likelihood and the other speaker likelihoods among the speaker likelihoods calculated by the speaker likelihood calculating means is a first threshold value η set in advance. First stage determination means for determining whether or not greater than or equal to or greater than the first threshold η,
When it is determined by the first stage determination means that all the differences are greater than the first threshold η or greater than or equal to the first threshold η, the speaker having the minimum speaker likelihood is selected. Determine whether or not the minimum speaker likelihood is within a range between a second threshold θmin and a third threshold θmax set in advance for each speaker; A second-stage only candidate determination means that accepts the candidate when it is determined that it falls within a range between these threshold values θmin and θmax, and rejects when it is determined that it does not enter;
When the first stage determination means determines that at least one of the differences is equal to or less than the first threshold η or smaller than the first threshold η, the difference determined to be equal to or smaller than the first threshold η The feature parameter created by the feature parameter creation means for input data and the second parameter are set as a plurality of candidates, the speaker having the speaker likelihood likelihood and the speaker having the minimum speaker likelihood likelihood. Absolute value average vector error calculating means for calculating an absolute value average vector error for each of the plurality of candidates using the second code book stored in the code book storage means;
Whether or not the smallest absolute value average vector error among the absolute value average vector errors calculated by the absolute value average vector error calculation means falls within a range between the second threshold value θmin and the third threshold value θmax. If it is determined that it falls within the range between these threshold values θmin and θmax, the candidate of the minimum absolute value average vector error is accepted, and if it is determined not to enter, the second stage multiple candidates are rejected A speaker identification system comprising: a determination unit.

The speaker identification system according to claim 9, wherein
A second and third threshold automatic update means for automatically performing update setting of the second threshold θmin and the third threshold θmax;
The second and third threshold automatic update means performs a process of determining who the voice of the identified voice is among the plurality of speakers. When it is accepted that the speaker is one of a plurality of speakers, the identified speech is input among the plurality of speakers by combining the input data of the identified speech and the sample data. The first code book for the speaker determined to match the person who has performed the re-creation, and the re-created first code book and the person who has input the identified speech among the plurality of speakers The identified speech is input among the plurality of speakers by using a plurality of second and third threshold setting test data for the speaker determined to be consistent with the input data of the identified speech. Story determined to match the person The second speaker identification system that is characterized by being configured to perform updates the setting of the threshold θmin and the third threshold value θmax of about.

A program for causing a computer to function as a speaker identification system that determines who the voice of a plurality of speakers registered in advance is an input identified voice,
Sample data feature parameter creation means for creating a speech feature parameter for each speaker using speech data acquired from each of the plurality of speakers as sample data for the first stage determination processing;
First codebook creating means for creating a first codebook for each speaker by performing clustering using the feature parameters created by the sample data feature parameter creating means;
First code book storage means for storing the first code book created for each speaker by the first code book creation means;
Using the voice data acquired from each of the plurality of speakers in an environment different from the sample data as candidate data for the second stage determination process after narrowing down to a plurality of candidates, the voice feature parameters are set for each speaker. Feature parameter creation means for candidate data created for each;
Second codebook creation means for creating a second codebook for each speaker by performing clustering using the feature parameters created by the candidate data feature parameter creation means;
Second code book storage means for storing the second code book created for each speaker by the second code book creation means;
A feature parameter creation means for input data that creates speech feature parameters using input data of the identified speech when determining which speech among the plurality of speakers is the identified speech;
Speaker likelihood for each of the plurality of speakers using the feature parameters created by the feature parameter creation means for input data and the first codebook stored in the first codebook storage means Speaker likelihood likelihood calculating means for calculating
The difference between the minimum speaker likelihood and the other speaker likelihoods among the speaker likelihoods calculated by the speaker likelihood calculating means is a first threshold value η set in advance. First stage determination means for determining whether or not greater than or equal to or greater than the first threshold η,
When it is determined by the first stage determination means that all the differences are greater than the first threshold η or greater than or equal to the first threshold η, the speaker having the minimum speaker likelihood is selected. Determine whether or not the minimum speaker likelihood is within a range between a second threshold θmin and a third threshold θmax set in advance for each speaker; A second-stage only candidate determination means that accepts the candidate when it is determined that it falls within a range between these threshold values θmin and θmax, and rejects when it is determined that it does not enter;
When the first stage determination means determines that at least one of the differences is equal to or less than the first threshold η or smaller than the first threshold η, the difference determined to be equal to or smaller than the first threshold η The feature parameter created by the feature parameter creation means for input data and the second parameter are set as a plurality of candidates, the speaker having the speaker likelihood likelihood and the speaker having the minimum speaker likelihood likelihood. Absolute value average vector error calculating means for calculating an absolute value average vector error for each of the plurality of candidates using the second code book stored in the code book storage means;
Whether or not the smallest absolute value average vector error among the absolute value average vector errors calculated by the absolute value average vector error calculation means falls within the range between the second threshold value θmin and the third threshold value θmax. If it is determined that it falls within the range between these threshold values θmin and θmax, the candidate of the minimum absolute value average vector error is accepted, and if it is determined not to enter, the second stage multiple candidates are rejected A program for causing a computer to function as a speaker identification system comprising a determination unit.