JP4734771B2

JP4734771B2 - Information extraction apparatus and method

Info

Publication number: JP4734771B2
Application number: JP2001177569A
Authority: JP
Inventors: 康裕戸栗; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-06-12
Filing date: 2001-06-12
Publication date: 2011-07-27
Anticipated expiration: 2021-06-12
Also published as: JP2002366176A

Description

【０００１】
【発明の属する技術分野】
本発明は、情報抽出装置及び方法に関するものであり、特に、音声データ又は音声画像データの検索を行うための情報抽出装置及びその方法に関するものである。
【０００２】
【従来の技術】
近年のマルチメディアの普及とともに、大量のＡＶ（Audio Visual）データを効率的に管理し、分類、検索、抽出などを行う必要性が増してきた。例えば、ある登場人物のシーンやその人物の会話シーンを大量のＡＶデータから検索したり、また、ある人物の会話シーンだけをＡＶデータから抽出して再生したりということが必要となっている。
【０００３】
従来は、このようなＡＶデータにおいて特定の話者が会話している時間軸上の位置の検索等を行う場合は、人間が直接ＡＶデータを視聴しながら、その時間軸上の位置や区間を探す必要があった。
【０００４】
一方、音声の話者を識別する技術としては、自動話者識別・照合技術が研究されている。この技術についての従来の技術の概要を説明する。先ず、話者認識には、話者識別と話者照合がある。話者識別とは、入力された音声が予め登録された話者うちのどの話者であるかを判定するものであり、話者照合とは、入力された音声を予め登録された話者のデータと比較して本人であるか否かを判定するものである。また、話者認識には、認識時に発声する言葉（キーワード）が予め決められた発声内容依存型と、任意の言葉を発声して認識をする発生内容独立型がある。
【０００５】
一般的な音声認識技術としては、例えば次のような技術がよく用いられる。先ず、ある話者の音声信号の個人性を表す特徴量を抽出して、予め学習データとして記録しておく。照会・識別の際には、入力された話者音声を分析して、その個人性を表す特徴量を抽出して、学習データとの類似度を評価することで、話者の識別・照合を行う。ここで、音声の個人性を表す特徴量としては、ケプストラム（Cepstrum）等がよく用いられる。ケプストラムは、対数スペクトルをフーリエ逆変換したもので、その低次の項の係数によって音声スペクトルの包絡を表現できる。また、ケプストラム時系列の多項式展開係数をデルタケプストラムと呼び、これも音声スペクトルの時間的変化を表現する特徴量としてよく用いられる。この他、ピッチやデルタピッチ（ピッチの多項式展開係数）等も用いられることがある。
【０００６】
このようにして抽出されたＬＰＣ（Linear Predictive Coding）ケプストラム等の特徴量を標準パターンとして学習データを作成するが、その方法としては、ベクトル量子化歪みによる方法と隠れマルコフモデル（HMM:Hidden Markov Model）による方法が代表的である。
【０００７】
ベクトル量子化歪みによる方法では、予め話者ごとの特徴量をグループ化してその重心を符号帳（コードブック）の要素（コードベクトル）として蓄えておく。そして、入力された音声の特徴量を各話者のコードブックでベクトル量子化して、その入力音声全体に対する各コードブックの平均量子化歪みを求める。
【０００８】
そして話者識別の場合は、その平均量子化歪みの最も小さいコードブックの話者を選択し、話者照合の場合は、該当する話者のコードブックによる平均量子化歪みを閾値と比較して本人かどうかを判定する。
【０００９】
このように、従来の話者認識技術において、特徴量として音声のＬＰＣケプストラムを抽出し、その特徴量のベクトル量子化歪みを利用して話者識別を行う方法について、詳しく説明する。
【００１０】
先ず、入力された音声信号をブロック単位にＬＰＣ分析（線形予測分析）を行い線形予測係数（ＬＰＣ係数）を求める。分析ブロック長としては、音声では一般的に２０〜３０ミリ秒程度が用いられる。入力信号のサンプルｘ_ｔを過去のＰ個のサンプルから以下の式（１）のように予測する。なお、一般的に線形予測の次数Ｐとしては、１０〜２０次程度が用いられる。
【００１１】
【数１】

【００１２】
そして線形予測誤差ε＝ｘ'_ｔ−ｘ_ｔを最小化する線形予測係数ａ_ｉを最小二乗法によって求める。最小二乗法の解を求める方法としては、共分散法と自己相関法があり、特に自己相関法は、その係数行列の正定値性が保証されて解を必ず求めることができ、また、Durbinの再帰法によって効率的に求めることが可能であることから広く利用されている。求めたＰ個の線形予測係数により、推定される全極型音声モデルの生成関数は以下の式（２）のように表される。
【００１３】
【数２】

【００１４】
ケプストラムは、音声の対数スペクトルの逆フーリエ変換であるから、ＬＰＣ分析による音声モデルのケプストラムは、ケプストラムのフーリエ変換をＣ（ω）とすると、以下の式（３）で表される。
【００１５】
【数３】

【００１６】
ここでフーリエ変換をＺ変換に拡張して一般化すると、式（４）のように記述できる。
【００１７】
【数４】

【００１８】
Ｃ（ｚ）の逆Ｚ変換ｃ_ｉは、複素ケプストラムと呼ばれている。ここで、ＬＰＣ係数ａ_ｉを直接複素ケプストラムｃ_ｉに変換する方法が知られている。すなわち、以下の式（５）、式（６）、式（７）のような漸化式から複素ケプストラムを順次求めることができる。このようにしてＬＰＣ分析から求めたｃ_ｎを特にＬＰＣケプストラムと呼ぶ。
【００１９】
【数５】

【００２０】
次に、上述のようにして抽出した特徴量（ＬＰＣケプストラム等）にベクトル量子化を施し、その量子化歪みを利用して話者を識別する。基本的には、求めた特徴量ベクトルを複数の話者のコードブックでベクトル量子化を施し、その平均量子化歪みを最小にするコードブックを選出する。以下、詳しく説明する。
【００２１】
まずｉ番目のＬＰＣ分析ブロックにおけるＰ個の特徴量ベクトルｘ_ｉを以下の式（８）とする。特徴量ベクトルの要素としては、たとえば、前述したような１〜Ｐ次のＬＰＣケプストラムを用いる。
【００２２】
【数６】

【００２３】
また、コードブックＣＢ_ｋのｊ番目のセントロイド（コードベクトル）ｒ_ｊ ^ｋを以下の式（９）とする。
【００２４】
【数７】

【００２５】
ここで、特徴量ベクトルｘ_ｉとセントロイドｒ_ｊ ^ｋとの重み付距離を以下の式（１０）のように定義する。
【００２６】
【数８】

【００２７】
第ｉブロックのコードブックＣＢ_ｋによるベクトル量子化歪みｄ_ｋ（ｉ）を以下の式（１１）のように求める。
【００２８】
【数９】

【００２９】
各ブロック毎のベクトル量子化歪みｄ_ｋ（ｉ）を求め、さらに、話者認識区間の全ブロック（ｉ＝１，２，・・・Ｌ）における、コードブックＣＢ_ｋの平均量子化歪みＤ_ｋを以下の式（１２）のようにして求める。
【００３０】
【数１０】

【００３１】
この平均量子化歪みＤ_ｋを最小にするコードブックＣＢ_ｋ’を求め、そのコードブックに対応する話者を話者評価区間における話者として選出する。
【００３２】
一方、ＨＭＭによる方法では、上記と同様にして求めた話者の特徴量は、隠れマルコフモデル（ＨＭＭ）の状態間の遷移確率と、各状態での特徴量の出現確率によって表現され、入力音声区間全体でモデルとの平均尤度によって判定をする。
【００３３】
また、予め登録されていない不特定話者が含まれる話者識別の場合は、上述した話者識別と話者照合とを組合せた方法によって判定する。すなわち、登録された話者セットから最も類似した話者を候補として選び、その候補の量子化歪み又は尤度を閾値と比較して本人かどうかを判定する。
【００３４】
話者照合又は不特定話者を含む話者識別において、本人の判定をするために、話者の尤度若しくは量子化歪みを閾値と比較して判定するが、その際、これらの値は特徴量の時期変動、発声文章の違い、雑音等の影響により、同一の話者であっても入力データと学習データ（モデル）とのばらつきが大きく、一般的にその絶対値に閾値を設定しても安定して十分な認識率が得られない。
【００３５】
そこで、ＨＭＭにおける話者認識においては、尤度を正規化することが一般的に行われる。例えば、以下の式（１３）に示すような対数尤度比ＬＲを判定に用いる方法がある。
【００３６】
【数１１】

【００３７】
式（１３）において、Ｌ（Ｘ／Ｓ_ｃ）は、照合対象話者Ｓ_ｃ（本人）の入力音声Ｘに対する尤度であり、Ｌ（Ｘ／Ｓ_ｒ）は、話者Ｓ_ｃ以外の話者Ｓ_ｒの入力音声Ｘに対する尤度である。すなわち、入力音声Ｘに対する尤度に合わせて動的に閾値を設定することになり、発声内容の違いや時期変動に対して頑健となる。
【００３８】
或いはまた、事後確率の概念を用いて、以下の式（１４）に示すような事後確立によって判定を行う方法も研究されている。ここで、Ｐ（Ｓ_ｃ）、Ｐ（Ｓ_ｒ）はそれぞれ話者Ｓ_ｃ、Ｓ_ｒの出現確率である。
【００３９】
【数１２】

【００４０】
これらのＨＭＭを用いた尤度の正規化の方法は、後述する文献[4]等に詳しく記されている。
【００４１】
一方、上述したＨＭＭによる方法で述べた尤度を、特徴量の標準パターンと入力データから抽出した特徴量のマハラノビス距離によって求める方法もある。
【００４２】
入力データＸの特徴量ベクトルｘと特徴量の標準パターンのベクトルｒとを用いて、入力データＸの話者Ｓに対する尤度Ｌ（Ｘ／Ｓ）は、以下の式（１５）のように求められる。
【００４３】
【数１３】

【００４４】
ここで、特徴量ベクトルｘと特徴量の標準パターンのベクトルｒとは、それぞれ以下の式（１６）、式（１７）のように与えられる。
【００４５】
【数１４】

【００４６】
式（１５）において、Ｐはベクトル次数、Σは話者Ｓの特徴量データの共分散行列である。また、（ｘ−ｒ）^ＴΣ^−１（ｘ−ｒ）は、マハラノビス距離と呼ばれる。式（１５）より、話者の特徴量の共分散係数を予め求めておけば、入力データＸの尤度が求められる。また、これより、上述したような話者Ｓ_ｃと話者Ｓ_ｒの対数尤度比ＬＲは、それぞれの話者のマハラノビス距離の差によって表現される。すなわち、上述したような話者照合のための尤度正規化において、対数尤度に閾値を設定することと、マハラノビス距離の差に閾値を設定することとは同等である。詳しくは、後述する文献[5]等に記されている。
【００４７】
話者認識に関する従来技術について詳しくは、例えば、以下の文献等に記述されている。
[1] 古井：”ケプストラムの統計的特徴による話者認識”, 信学論 volJ65-A, No.2 183-190(1982)
[2] F.K.Soong and A.E.Rosenberg: “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition.”, IEEE Trans. ASSP, Vol.36, No.6, pp.871-879 (1988)
[3] 古井：”声の個人性の話”,日本音響学会誌, 51,11, pp.876-881, (1995) [4] 松井：”HMMによる話者認識”, 信学技報, Vol.95, No.467, (SP95 109-116) pp.17-24 (1996)
[5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE PRESS (CRC Press),1998 一方、特願２０００−３６３５４７の明細書及び図面には、話者認識の技術を応用して、ＡＶデータにおいて、同一話者の連続会話区間と話者切り換わり位置とを検出する技術が提案されている。この特願２０００−３６３５４７の明細書及び図面では、ＡＶデータの音声信号を小区間（１〜２秒程度）毎に話者グループに分類識別し、いくつかの連続した認識区間（数秒〜１０秒程度）内において話者グループの判別頻度の変位を求め、その頻度が閾値を上回る位置又は閾値を下回る位置を検出することで、話者の切り換わり位置を検出し、話者が切り換わる間の区間をその話者の同一話者連続会話区間として検出している。
【００４８】
【発明が解決しようとする課題】
従来の話者認識技術は、セキュリティシステムなどにおける単一話者の識別・照合を主な応用として研究、開発されており、１つの音声データにおいて複数の話者が短時間で交互に発声をしたり、時折同時に発声したり、背景に音楽や雑音があったりといった実際の複雑な会話場面に適用できるものではなかった。従って、ＡＶデータにおける話者の会話区間の検索を、従来の話者認識の技術によって自動的に行うとすると、その識別性能が著しく低下してしまうといった問題があった。
【００４９】
また、上述した特願２０００−３６３５４７の明細書及び図面に記載されている技術では、数秒〜１０秒程度の頻度評価区間ごとに判別頻度の変位によって話者の切り換わりを検出しているため、その評価区間の間は同一の話者がほぼ連続して会話をしている必要がある。すなわち、同一の話者が、評価区間長である１０秒程度の時間の間、単独で連続的に会話をしており、その判別誤り率が十分に低い場合には適用できるが、複数の話者が短時間、例えば数秒以内に交互に発声をしたり、同時に発声することが多かったり、背景雑音や音楽などで話者の認識誤りが大きくなる場合には、話者の切り換わり位置を正確に検出できず、会話区間の検出を適切に行うことができないという問題があった。
【００５０】
さらに、特願２０００−３６３５４７の明細書及び図面では、入力データを話者グループの何れかに割り当てる方法によって話者を識別しているため、登録されていない未知の話者であっても分類が行える一方で、話者の本人照合を行っていないために、分類誤りが起こりやすく、また、音声以外のデータが入力された場合でも何れかの話者グループに分類してしまうという問題があった。
【００５１】
このようなＡＶデータの話者検出において、未登録の話者や音声以外の入力データを誤って識別することを避ける手法として話者照合の技術があるが、従来の話者の尤度正規化による話者照合方法は、ＨＭＭを用いて尤度を求めた場合には適用できるが、より簡単に識別を行うことのできるベクトル量子化歪みを利用した識別方法ではそのまま適用できないという問題がある。
【００５２】
また、標準パターンからのマハラノビス距離を用いて尤度を求めるには話者の共分散係数などが既知でなければならず、計算も非常に複雑であり、さらにこの手法をベクトル量子化歪みを用いた場合に適用するのは、事前に共分散係数を求める等の複雑な計算を要するものであり、実用的なものではなかった。
【００５３】
本発明は、このような従来の実情に鑑みて提案されたものであり、自動的且つ効率的にＡＶデータにおける話者の会話区間を検出し、また、効率的に検索する情報抽出装置及びその方法を提供することを目的とする。
【００５４】
【課題を解決するための手段】
上述した目的を達成するために、本発明に係る情報抽出装置は、所定の情報源から所望の情報を抽出するための情報抽出装置において、上記情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を識別し判定することにより判別する話者識別手段と、上記評価区間毎に判別された話者の頻度を求める区間である上記情報源における頻度区間での上記話者の判別頻度情報を求める話者判別頻度計算手段とを備え、識別の手法として、登録された話者に一対一に対応する複数のコードブックによる特徴量のベクトル量子化を用い、識別の尺度として、そのベクトル量子化歪みを用い、上記ベクトル量子化歪みを最小にする話者を話者候補として識別し、当該話者候補の量子化歪み以外の話者の複数のベクトル量子化歪みの和又は平均から上記話者候補の量子化歪みを減算した値を歪み差分ΔＤとし、上記話者候補の量子化歪みＤ _０が予め設定された最大量子化歪みの閾値データＤ _ｍａｘよりも小さく、且つ、歪み差分ΔＤが予め設定された最小歪み差分の閾値データΔＤ _ｍｉｎより大きければ、上記話者候補が本人であると判定し、上記頻度区間における上記話者の出現頻度情報を検出する。
【００５５】
ここで、情報抽出装置では、上記情報源の音声信号中の音声の類似性を評価する特徴量として、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられる。
【００５７】
このような情報抽出装置は、音声信号中の話者の音声の特徴量に基づいて、ある評価区間毎に話者を識別すると共に、評価区間毎に判別された話者の頻度を求める区間である頻度区間における話者の出現頻度を検出し、話者の出現頻度情報を生成する。
【００５８】
また、上述した目的を達成するために、本発明に係る情報抽出方法は、所定の情報源から所定の情報を検索するための情報抽出方法において、上記情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を識別し判定することにより判別する話者識別工程と、上記評価区間毎に判別された話者の頻度を求める区間である上記情報源における頻度区間での上記話者の判別頻度情報を求める話者判別頻度計算工程とを有し、識別の手法として、登録された話者に一対一に対応する複数のコードブックによる特徴量のベクトル量子化を用い、識別の尺度として、そのベクトル量子化歪みを用い、上記ベクトル量子化歪みを最小にする話者を話者候補として識別し、当該話者候補の量子化歪み以外の話者の複数のベクトル量子化歪みの和又は平均から上記話者候補の量子化歪みを減算した値を歪み差分ΔＤとし、上記話者候補の量子化歪みＤ _０が予め設定された最大量子化歪みの閾値データＤ _ｍａｘよりも小さく、且つ、歪み差分ΔＤが予め設定された最小歪み差分の閾値データΔＤ _ｍｉｎより大きければ、上記話者候補が本人であると判定し、上記頻度区間における上記話者の出現頻度情報を検出する。
【００５９】
ここで、情報抽出方法では、上記情報源の音声信号中の音声の類似性を評価する特徴量として、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられる。
【００６１】
このような情報抽出方法は、音声信号中の話者の音声の特徴量に基づいて、ある評価区間毎に話者を識別すると共に、評価区間毎に判別された話者の頻度を求める区間である頻度区間における話者の出現頻度を検出し、話者の出現頻度情報を生成する。
【００６２】
また、上述した目的を達成するために、本発明に係る情報検索装置は、情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別し、上記評価区間毎に判別された話者の頻度を求める区間である頻度区間で上記話者の判別頻度情報を求めることで得られた上記頻度区間における上記話者の出現頻度情報が予め記録された記録媒体から、所望の情報の検索を行う情報検索装置であって、上記記録媒体に記録された話者の出現頻度情報を読み込む話者出現頻度読み込み手段と、所望の話者の検索条件を入力する検索条件入力手段と、入力された上記検索条件と上記記録媒体から読み出した情報とを比較して、検索条件に該当する情報を話者出現区間情報として出力する話者出現区間出力手段とを備えることを特徴としている。
【００６３】
ここで、上記話者の判別頻度情報を得る際の上記情報源の音声信号中の音声の類似性を評価する特徴量としては、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法としては、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられ、また、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【００６４】
このような情報検索装置は、所望の話者の出現頻度情報と入力した検索条件とを比較することで、所望の話者が所望の頻度で会話している部分等を検索する。
【００６５】
また、上述した目的を達成するために、本発明に係る情報検索方法は、情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別し、上記評価区間毎に判別された話者の頻度を求める区間である頻度区間で上記話者の判別頻度情報を求めることで得られた上記頻度区間における上記話者の出現頻度情報が予め記録された記録媒体から、所望の情報の検索を行う情報検索方法であって、上記記録媒体に記録された話者の出現頻度情報を読み込む話者出現頻度読み込み工程と、所望の話者の検索条件を入力する検索条件入力工程と、入力された上記検索条件と上記記録媒体から読み出した情報とを比較して、検索条件に該当する情報を話者出現区間情報として出力する話者出現区間出力工程とを有することを特徴としている。
【００６６】
ここで、上記話者の判別頻度情報を得る際の上記情報源の音声信号中の音声の類似性を評価する特徴量としては、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法としては、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられ、また、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【００６７】
このような情報検索方法は、所望の話者の出現頻度情報と入力した検索条件とを比較することで、所望の話者が所望の頻度で会話している部分等を検索する。
【００６８】
【発明の実施の形態】
以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。
【００６９】
先ず、本実施の形態における情報抽出装置の概念構成図を図１に示す。図１に示すように、情報抽出装置においては、情報源となる音声信号が話者識別手段１に入力され、ベクトル量子化歪みを評価して話者が識別される。
【００７０】
話者識別手段１によって識別された話者は、話者判別頻度計算手段２に入力され、所定の評価区間毎に区間内の各話者の認識された話者判別頻度が計算される。求められた話者判別頻度は、話者の出現頻度情報として出力される。
【００７１】
この図１に示した情報抽出装置の具体的な構成例を図２に示す。図２に示すように、情報抽出装置１０は、ＡＶ（Audio Visual）データの音声信号を入力する入力部１１と、音声信号を分析してＬＰＣ（Linear Predictive Coding）ケプストラム係数を抽出するケプストラム抽出部１２と、ＬＰＣケプストラム係数をベクトル量子化するベクトル量子化部１３と、ベクトル量子化歪みを評価して話者を識別する話者識別部１４と、認識された話者の判別頻度を用いて話者の出現頻度を求める話者判別頻度計算部１５とを備える。
【００７２】
また、図２において、コードブック群ＣＢは、ベクトル量子化に用いる各話者のコードブックデータが格納されたものであり、閾値表ファイルＴＦは、話者の判別を行うための閾値データが格納されたものであり、それぞれ図示しない記録部に記録されている。また、話者頻度ファイルＳＦは、区間毎の各話者の頻度が記録されたものである。
【００７３】
このように構成された情報抽出装置１０の動作を以下に説明する。入力部１１から入力されたＡＶデータの音声信号Ｄ１１は、ブロック単位にケプストラム抽出部１２に入力されて、ＬＰＣ分析が施され、得られたＬＰＣ係数がＬＰＣケプストラム係数に変換される。
【００７４】
得られたＬＰＣケプストラム係数の一部Ｄ１２は、ベクトル量子化部１３に入力されて、コードブック群ＣＢからの各話者のコードブックデータＤ１３を用いてそれぞれベクトル量子化が施される。それぞれのコードブックでベクトル量子化された結果（量子化歪み）Ｄ１４は、話者識別部１４に入力されて評価され、さらに閾値表ファイルＴＦから読みこんだ閾値データＤ１５を用いて、所定の認識ブロック毎に話者の識別及び判定を行う。
【００７５】
識別された話者Ｄ１６は、話者判別頻度計算部１５に入力され、所定の評価区間毎に区間内の各話者の認識された話者判別頻度が計算される。求められた話者判別頻度は、話者の出現頻度情報Ｄ１７として、例えば図３に示すような記録形式で各ＡＶデータ毎、各話者毎、各評価区間毎に話者頻度ファイルＳＦに記録される。なお、話者頻度ファイルＳＦは、図示しない送受信部により通信回線を介して通信されるものであってもよく、また、磁気ディスク、光磁気ディスク等の記録媒体や半導体メモリ等の記憶媒体等の蓄積媒体に蓄積されるものであってもよい。
【００７６】
図３の記録形式は、入力部１１から入力された音声信号のＡＶデータ名と、登録された各話者を識別する識別名と、頻度区間の開始時刻と、同区間の終了時刻と、上記ＡＶデータの上記頻度区間における上記話者の判別頻度とを情報として有する。この記録形式は、一例であり、図３に示した情報に限定されるものではない。
【００７７】
以下、図２及び図４を参照しながら、話者識別を行い話者判別頻度を求める際の処理について、さらに詳しく説明する。
【００７８】
入力されたＡＶデータの音声信号は、ケプストラム抽出部１２において、図４に示すようなＬＰＣ分析ブロックＡＢ単位にＬＰＣ分析が施されて、得られたＬＰＣ係数が変換されてＬＰＣケプストラム係数が抽出される。ＬＰＣ分析ブロックＡＢのブロック長ａは、音声信号の場合、通常２０ミリ秒〜３０ミリ秒程度がよく用いられる。また、分析性能を向上させるために隣接ブロックと若干オーバーラップさせることが多い。
【００７９】
図４の話者認識ブロックＲＢは、話者を識別する最小単位であり、このブロック単位に、話者の識別を行う。話者認識ブロックＲＢのブロック長ｂは、数秒程度が望ましい。従って、１つの話者認識ブロックＲＢは、５０〜数百程度のＬＰＣ分析ブロックＡＢを含んでいる。話者認識ブロックＲＢも、隣接区間と若干オーバーラップしていてもよい。オーバーラップ長は、通常、区間長の１０％〜５０％程度である。
【００８０】
図４の頻度区間ＦＩは、話者の出現頻度を求める評価単位であり、同区間内において、各話者認識ブロックＲＢで識別された話者の判別頻度に基づいて各話者の出現頻度を求める。頻度区間ＦＩ_Ｉの区間開始時刻はＳ_Ｉ、区間終了時刻はＥ_Ｉであり、区間長（Ｅ_Ｉ‐Ｓ_Ｉ）は、数分〜数十分程度が適当である。また、評価区間も隣接区間と若干オーバーラップしていてもよい。
【００８１】
情報抽出装置１０の動作を表すフローチャートを図５に示す。先ずステップＳ１０において、初期化処理として、区間番号Ｉを０とする。区間番号Ｉとは、話者の頻度を求める頻度区間ＦＩにつけた連続番号である。
【００８２】
次にステップＳ１１において、上述した話者認識ブロックＲＢ単位で話者候補を識別して話者候補を選定する。話者候補の選定方法については、後で詳述する。
【００８３】
ステップＳ１２では、選定された話者候補が正しい話者か否かを照合判定する。すなわち、未知の不特定話者や、音声以外のデータが入力された場合、ステップＳ１１で選定された候補話者は、入力音声に一番類似している話者を候補として選出するが、それが本当にその話者本人とは限らない。そこで、ステップＳ１２では、ベクトル量子化歪みを評価し、図２に示した閾値表ファイルＴＦに記録された閾値データと比較することで、選定された話者候補本人のデータであるか否かの判定を行う。判定方法については、後で詳述する。ステップＳ１２において、本人であると判定されれば、その話者候補をこの話者認識ブロックにおける話者として確定し、本人ではないと判定されれば、この話者認識ブロックにおける話者を未知話者として確定する。
【００８４】
続いてステップＳ１３では、頻度区間ＦＩ_Ｉの最後の話者認識ブロックＲＢまで処理したか否かが判定される。ステップＳ１３において、最後の話者認識ブロックＲＢでなければ、ステップＳ１４において、次の話者認識ブロックＲＢに進み、ステップＳ１１に戻る。ステップＳ１３において、最後の話者認識ブロックＲＢであると判定されれば、ステップＳ１５に進む。
【００８５】
ステップＳ１５では、現在の頻度区間ＦＩ_Ｉにおける、それぞれの登録話者の判別頻度を出現頻度情報として求める。なお、未知話者と判定された話者認識ブロックＲＢは頻度の計算に含めない。求めた話者出現頻度は、図２に示した話者頻度ファイルＳＦに、図３のような記録形式で記録する。
【００８６】
ステップＳ１６では、データの末尾に到達したか否かが判定される。データの末尾に到達している場合は、処理を終了し、データの末尾に到達していない場合は、ステップＳ１７に進む。
【００８７】
ステップＳ１７では、区間番号Ｉを１つ増やし、次の頻度区間に進み、ステップＳ１１に戻る。
【００８８】
続いて、図５のステップＳ１１における話者候補の識別方法の詳細を図６に示す。先ず、ステップＳ２０において、上述したＬＰＣ分析ブロックＡＢごとに音声データを入力データから読みこむ。
【００８９】
次にステップＳ２１において、話者認識ブロックＲＢの最後のＬＰＣ分析ブロックＡＢまで処理を終えたか否かを判定し、最後のＬＰＣ分析ブロックＡＢの処理を終えている場合は、ステップＳ２６に進む。ステップＳ２１において最後のＬＰＣ分析ブロックＡＢでない場合は、ステップＳ２２に進む。
【００９０】
ステップＳ２２では、得られたＬＰＣ分析ブロックＡＢのデータを評価してこのブロックが音声ブロックであるか否かを判定する。ステップＳ２２において、このＬＰＣ分析ブロックＡＢが無音ブロック又は非音声ブロックであると判定されれば、このブロックの分析をスキップしてステップＳ２５に進み、次のＬＰＣ分析ブロックＡＢに進んでステップＳ２０からの処理を行う。音声ブロックであるか否かの判定方法は、例えば、最も簡単な方法として、そのブロックのパワー平均及び最大値を評価して無音ブロックであるか否かの検出を行うだけでもよい。また、信号の平均パワー、ゼロ交差数、ピッチの有無、スペクトル形状等から分析して音声データであるか否かを判定する種々の方法があるが、本実施の形態では、特にその手法は限定せず、或いはこのステップを省略してもよい。
【００９１】
ステップＳ２２において音声ブロックであると判定された場合は、次にステップＳ２３において、このブロックのＬＰＣ分析を行い、得られたＬＰＣ係数を変換してＬＰＣケプストラム係数を抽出する。ここでは、１次〜１４次程度の低次のケプストラム係数を抽出する。
【００９２】
次にステップＳ２４において、予め作成された複数のコードブックを用いて、ステップＳ２３で得られたＬＰＣケプストラム係数にそれぞれベクトル量子化を施す。それぞれのコードブックは登録された話者に一対一に対応する。ここで、コードブックＣＢ_ｋによるこのブロックのＬＰＣケプストラム係数のベクトル量子化歪みをｄ_ｋとする。
【００９３】
ステップＳ２５では、次のＬＰＣ分析ブロックＡＢに進み、ステップＳ２０に戻り、同様にしてステップＳ２０からステップＳ２５の処理を繰り返す。
【００９４】
ステップＳ２６では、話者認識ブロックＲＢ全体にわたる各コードブックＣＢの量子化歪みｄ_ｋの平均である平均量子化歪みＤ_ｋを求める。
【００９５】
続いてステップＳ２７では、平均量子化歪みＤ_ｋを最小にする話者Ｓ_ｋ’に対応するコードブックＣＢ_ｋ’を選出し、ステップＳ２８では、この話者Ｓ_ｋ’を話者候補Ｓ_ｃとして出力する。
【００９６】
このようにして、コードブックが登録されている話者のうち、最も入力データの音声が類似している話者を、その話者認識ブロックＲＢにおける話者候補Ｓ_ｃとして選出する。
【００９７】
次に、図５のステップＳ１２における話者候補Ｓ_ｃの照合判定方法の詳細を図７に示す。先ずステップＳ３０において、話者候補Ｓ_ｃの平均量子化歪みをＤ_０とする。次にステップＳ３１において、話者候補Ｓ_ｃ以外の各コードブックによる平均量子化歪みを小さい順に並び替え、そのうち、小さいものから順にｎ個を、Ｄ_１，Ｄ_２，・・・Ｄ_ｎ（Ｄ_０＜Ｄ_１＜Ｄ_２＜・・・＜Ｄ_ｎ）とする。ｎの値は、任意に選択可能である。
【００９８】
続いてステップＳ３２において、評価の尺度として、話者候補Ｓ_ｃの量子化歪みＤ_０とそれ以外のｎ個の量子化歪みについて、以下の式（１８）又は式（１９）を用いて歪差分量ΔＤを求める。
【００９９】
【数１５】

【０１００】
式（１８）、式（１９）において、例えばｎが１の場合は、話者候補Ｓ_ｃに次いで量子化歪みが小さいＤ_１とＤ_０との量子化歪みの差を求めることになる。
【０１０１】
続いてステップＳ３３において、図２に示した閾値表ファイルＴＦから話者候補Ｓ_ｃに対応する閾値データを読みこむ。
【０１０２】
閾値表ファイルＴＦには、各登録話者ごとに、例えば図８のような形式で記録されている。すなわち、図８に示すように、各登録話者の話者識別名と、閾値データである量子化歪みの最大歪み絶対値Ｄ_ｍａｘ及び最小歪み差分ΔＤ_ｍｉｎが予め記録されている。
【０１０３】
図７に戻り、ステップＳ３４では、読みこんだ閾値データＤ_ｍａｘ，ΔＤ_ｍｉｎを、求めたＤ_０及びΔＤと比較して判別する。すなわち、ステップＳ３４において、量子化歪みの絶対値Ｄ_０が閾値データＤ_ｍａｘよりも小さく、且つ、歪み差分ΔＤが閾値データΔＤ_ｍｉｎより大きければ、ステップＳ３５に進み、本人であると判定し、候補を確定する。そうでなければ、ステップＳ３６に進み、未知話者と判定し、候補を棄却する。このように、話者候補Ｓ_ｃの平均量子化歪みＤ_０と歪差分量ΔＤとをそれぞれ閾値と比較することで、登録話者の音声データの識別誤りが減少し、また、登録話者以外の音声データを未知話者として判定することが可能となる。
【０１０４】
以上説明したように、本実施の形態における情報抽出装置は、ＡＶデータの音声信号中の話者の音声の特徴量に基づいて、話者認識ブロック毎に話者を識別すると共に、所定の区間における話者の出現頻度を検出し、話者の出現頻度情報を生成する。この出現頻度情報が通信回線又は記録媒体を介して後述する情報検索装置に供給されることで、情報検索装置において所望の情報を効果的に検索することができる。
【０１０５】
次に、本実施の形態における情報検索装置について説明する。先ず、情報検索装置の概念構成図を図９に示す。図９に示すように、情報検索装置においては、話者出現頻度読み込み手段３によって、上述した情報抽出装置にて生成された話者の出現頻度情報が読み込まれ、また、検索条件が、検索条件入力手段４に入力される。話者出現区間出力手段５は、これらの出現頻度情報と検索条件とに基づいて検索された話者出現区間情報を出力する。
【０１０６】
この図９に示した情報抽出装置の具体的な構成例を図１０に示す。図１０に示すように、情報検索装置２０は、検索条件を入力する条件入力部２１と、入力された条件から情報を検索するデータ検索部２２と、検索結果を出力する出力部２３とを備える。また、話者頻度ファイルＳＦは、話者の出現頻度情報が記録されたものであり、上述した図３に示すような記録形式で情報が記録されている。
【０１０７】
このように構成された情報検索装置２０の動作を以下に説明する。条件入力部２１はＡＶデータを検索するための検索条件を入力する。検索条件としては、例えば、所望の話者の名前や識別番号、その話者の会話頻度、検索対象とするＡＶデータ等が挙げられる。
【０１０８】
入力された検索条件Ｄ２２はデータ検索部２２に入力されて、検索条件にあった情報が検索される。データ検索部２２は、話者頻度情報ファイルＳＦを参照して話者の出現頻度情報Ｄ２３を読込み、これを検索条件Ｄ２２と比較し、検索結果Ｄ２４を出力部２３に供給する。出力部２３は、この検索結果Ｄ２４を話者出現区間情報Ｄ２５として出力する。
【０１０９】
このように、本実施の形態における情報検索装置は、通信回線又は記録媒体を介して入力したＡＶデータの音声信号中の所望の話者の出現頻度情報と入力した検索条件とを比較することにより、所望の話者が所望の頻度で会話している部分等を効果的に検索することができる。
【０１１０】
なお、本発明は上述した実施の形態のみに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。
【０１１１】
【発明の効果】
以上詳細に説明したように本発明に係る情報抽出装置は、所定の情報源から所望の情報を抽出するための情報抽出装置において、上記情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別する話者識別手段と、上記評価区間毎に判別された話者の頻度を求める区間である上記情報源における頻度区間での上記話者の判別頻度情報を求める話者判別頻度計算手段とを備え、上記頻度区間における上記話者の出現頻度情報を検出することを特徴としている。
【０１１２】
ここで、情報抽出装置では、上記情報源の音声信号中の音声の類似性を評価する特徴量として、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法として、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられる。
【０１１３】
また、情報抽出装置では、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【０１１４】
このような情報抽出装置によっては、音声信号中の話者の音声の特徴量に基づいて、ある評価区間毎に話者を識別すると共に、評価区間毎に判別された話者の頻度を求める区間である頻度区間における話者の出現頻度を検出し、話者の出現頻度情報を生成することができる。この出現頻度情報が通信回線又は記録媒体等を介して情報検索装置に供給されることで、情報検索装置において所望の情報を効果的に検索することができる。
【０１１５】
また、本発明に係る情報抽出方法は、所定の情報源から所定の情報を検索するための情報抽出方法において、上記情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別する話者識別工程と、上記評価区間毎に判別された話者の頻度を求める区間である上記情報源における頻度区間での上記話者の判別頻度情報を求める話者判別頻度計算工程とを有し、上記頻度区間における上記話者の出現頻度情報を検出することを特徴としている。
【０１１６】
ここで、情報抽出方法では、上記情報源の音声信号中の音声の類似性を評価する特徴量として、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法として、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられる。
【０１１７】
また、情報抽出方法では、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【０１１８】
このような情報抽出方法によっては、音声信号中の話者の音声の特徴量に基づいて、ある評価区間毎に話者を識別すると共に、評価区間毎に判別された話者の頻度を求める区間である頻度区間における話者の出現頻度を検出し、話者の出現頻度情報を生成することができる。この出現頻度情報が通信回線又は記録媒体等を介して情報検索装置に供給されることで、情報検索装置において所望の情報を効果的に検索することができる。
【０１１９】
また、本発明に係る情報検索装置は、情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別し、上記評価区間毎に判別された話者の頻度を求める区間である頻度区間で上記話者の判別頻度情報を求めることで得られた上記頻度区間における上記話者の出現頻度情報が予め記録された記録媒体から、所望の情報の検索を行う情報検索装置であって、上記記録媒体に記録された話者の出現頻度情報を読み込む話者出現頻度読み込み手段と、所望の話者の検索条件を入力する検索条件入力手段と、入力された上記検索条件と上記記録媒体から読み出した情報とを比較して、検索条件に該当する情報を話者出現区間情報として出力する話者出現区間出力手段とを備えることを特徴としている。
【０１２０】
ここで、上記話者の出現頻度情報を得る際の上記情報源の音声信号中の音声の類似性を評価する特徴量としては、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法としては、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられ、また、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【０１２１】
このような情報検索装置によっては、所望の話者の出現頻度情報と入力した検索条件とを比較することで、所望の話者が所望の頻度で会話している部分等を効果的に検索することができる。
【０１２２】
また、本発明に係る情報検索方法は、情報源である音声信号について、上記音声信号中の音声の類似性によって、ある評価区間毎に話者を判別し、上記評価区間毎に判別された話者の頻度を求める区間である頻度区間で上記話者の判別頻度情報を求めることで得られた上記頻度区間における上記話者の出現頻度情報が予め記録された記録媒体から、所望の情報の検索を行う情報検索方法であって、上記記録媒体に記録された話者の出現頻度情報を読み込む話者出現頻度読み込み工程と、所望の話者の検索条件を入力する検索条件入力工程と、入力された上記検索条件と上記記録媒体から読み出した情報とを比較して、検索条件に該当する情報を話者出現区間情報として出力する話者出現区間出力工程とを有することを特徴としている。
【０１２３】
ここで、上記話者の判別頻度情報を得る際の上記情報源の音声信号中の音声の類似性を評価する特徴量としては、ＬＰＣ分析によって得られるＬＰＣケプストラムが用いられ、識別の手法としては、複数のコードブックによる特徴量のベクトル量子化が用いられ、識別の尺度として、そのベクトル量子化歪みが用いられ、また、上記ベクトル量子化歪みの最小値である最小量子化歪みと、上記最小量子化歪み以外の複数のベクトル量子化歪みの和又は平均から最小量子化歪みを減算した値とを、それぞれ予め設定された閾値と比較することで識別判定される。
【０１２４】
このような情報検索方法によっては、所望の話者の出現頻度情報と入力した検索条件とを比較することで、所望の話者が所望の頻度で会話している部分等を効果的に検索することができる。
【図面の簡単な説明】
【図１】本実施の形態における情報抽出装置の概念構成を説明する図である。
【図２】同情報抽出装置の構成例を説明する図である。
【図３】同情報抽出装置における話者出現頻度情報の記録形式の一例を説明する図である。
【図４】同情報抽出装置における頻度評価区間、話者認識ブロック及びＬＰＣ分析ブロックの関係を説明する図である。
【図５】同情報抽出装置の動作を説明するフローチャートである。
【図６】同情報抽出装置における話者認識ブロック単位での話者識別処理を説明するフローチャートである。
【図７】同情報抽出装置における話者照合判定処理を説明するフローチャートである。
【図８】同情報抽出装置における話者照合判定用の閾値データの記録形式の一例を説明する図である。
【図９】本実施の形態における情報検索装置の概念構成を説明する図である。
【図１０】同情報検索装置の構成を説明する図である。
【符号の説明】
１話者識別手段、２話者判別頻度計算手段、３話者出現頻度読み込み手段、４検索条件入力手段、５話者出現区間出力手段、１０情報抽出装置、１１入力部、１２ケプストラム抽出部、１３ベクトル量子化部、１４話者識別部、１５話者判別頻度計算部、２０情報検索装置、２１条件入力部、２２データ検索部、２３出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information extraction apparatus and method.InInformation extracting apparatus and method for searching audio data or audio image dataInIt is related.
[0002]
[Prior art]
With the spread of multimedia in recent years, there has been an increasing need to efficiently manage a large amount of AV (Audio Visual) data and perform classification, search, extraction, and the like. For example, it is necessary to search a scene of a certain character or a conversation scene of the person from a large amount of AV data, or extract only a conversation scene of a certain person from the AV data and reproduce it.
[0003]
Conventionally, when searching for a position on a time axis where a specific speaker is speaking in such AV data, the position or section on the time axis is viewed while a person directly watches the AV data. I had to look for it.
[0004]
On the other hand, automatic speaker identification / collation technology has been studied as a technology for identifying speech speakers. The outline of the conventional technique about this technique will be described. First, speaker recognition includes speaker identification and speaker verification. Speaker identification is to determine which of the pre-registered speakers the input speech is, and speaker verification refers to the speaker's pre-registered speech. Compared with the data, it is determined whether or not the person is the person. In addition, speaker recognition includes an utterance content-dependent type in which words (keywords) uttered at the time of recognition are predetermined, and an occurrence content independent type in which arbitrary words are uttered and recognized.
[0005]
As general speech recognition technology, for example, the following technology is often used. First, a feature amount representing the individuality of a speaker's voice signal is extracted and recorded as learning data in advance. When inquiring and identifying, the speaker's voice input is analyzed, the feature quantity representing the individuality is extracted, and the similarity with the learning data is evaluated, thereby identifying and collating the speaker. Do. Here, a cepstrum or the like is often used as a feature amount representing the individuality of speech. The cepstrum is a logarithmic spectrum obtained by inverse Fourier transform, and the envelope of the speech spectrum can be expressed by the coefficient of the low-order term. Also, a cepstrum time series polynomial expansion coefficient is called a delta cepstrum, which is also often used as a feature value representing a temporal change of a speech spectrum. In addition, a pitch, a delta pitch (pitch polynomial expansion coefficient), or the like may be used.
[0006]
Learning data is created using the feature values such as LPC (Linear Predictive Coding) cepstrum extracted in this way as a standard pattern. The methods include a vector quantization distortion method and a Hidden Markov Model (HMM). ) Is a typical method.
[0007]
In the method based on vector quantization distortion, feature amounts for each speaker are grouped in advance, and the center of gravity is stored as an element (code vector) of a code book (code book). Then, the feature quantity of the input speech is vector-quantized with the codebook of each speaker, and the average quantization distortion of each codebook for the entire input speech is obtained.
[0008]
In the case of speaker identification, the speaker of the codebook with the smallest average quantization distortion is selected. In the case of speaker verification, the average quantization distortion in the codebook of the corresponding speaker is compared with a threshold value. Determine if you are the person.
[0009]
As described above, in the conventional speaker recognition technique, a method of extracting a speech LPC cepstrum as a feature quantity and performing speaker identification using the vector quantization distortion of the feature quantity will be described in detail.
[0010]
First, the input speech signal is subjected to LPC analysis (linear prediction analysis) in units of blocks to obtain linear prediction coefficients (LPC coefficients). As an analysis block length, generally about 20 to 30 milliseconds is used for voice. Sample input signal x_tIs predicted from the past P samples as shown in the following equation (1). In general, the order P of linear prediction is about 10 to 20th.
[0011]
[Expression 1]

[0012]
And the linear prediction error ε = x ′_t-X_tThe linear prediction coefficient a that minimizes_iIs obtained by the method of least squares. There are covariance methods and autocorrelation methods for finding the least squares solution. In particular, the autocorrelation method guarantees the positive definiteness of the coefficient matrix and can always find the solution. It is widely used because it can be efficiently obtained by a recursive method. The generation function of the all-pole speech model estimated by the obtained P linear prediction coefficients is expressed as the following equation (2).
[0013]
[Expression 2]

[0014]
Since the cepstrum is the inverse Fourier transform of the logarithmic spectrum of speech, the cepstrum of the speech model by LPC analysis is expressed by the following equation (3), where C (ω) is the Fourier transform of the cepstrum.
[0015]
[Equation 3]

[0016]
Here, when the Fourier transform is extended to the Z transform and generalized, it can be expressed as in equation (4).
[0017]
[Expression 4]

[0018]
Inverse Z transformation of C (z) c_iIs called complex cepstrum. Where LPC coefficient a_iDirect complex cepstrum c_iThe method of converting to is known. That is, the complex cepstrum can be obtained sequentially from the recurrence formulas such as the following formulas (5), (6), and (7). C obtained from LPC analysis in this way_nIs specifically referred to as the LPC cepstrum.
[0019]
[Equation 5]

[0020]
Next, vector quantization is performed on the feature amount (LPC cepstrum or the like) extracted as described above, and a speaker is identified using the quantization distortion. Basically, the obtained feature vector is subjected to vector quantization using a codebook of a plurality of speakers, and a codebook that minimizes the average quantization distortion is selected. This will be described in detail below.
[0021]
First, P feature vectors x in the i-th LPC analysis block_iIs the following equation (8). As elements of the feature vector, for example, the 1st to Pth order LPC cepstrum as described above is used.
[0022]
[Formula 6]

[0023]
Codebook CB_kJ th centroid (code vector) r_j ^kIs the following equation (9).
[0024]
[Expression 7]

[0025]
Here, feature vector x_iAnd centroid r_j ^kIs defined as the following equation (10).
[0026]
[Equation 8]

[0027]
I-th block code book CB_kVector quantization distortion due to d_k(I) is calculated | required like the following formula | equation (11).
[0028]
[Equation 9]

[0029]
Vector quantization distortion d for each block_k(I) is obtained, and the code book CB in all blocks (i = 1, 2,... L) of the speaker recognition section is obtained._kAverage quantization distortion D_kIs obtained by the following equation (12).
[0030]
[Expression 10]

[0031]
This average quantization distortion D_kCodebook CB to minimize_{k '}And the speaker corresponding to the code book is selected as the speaker in the speaker evaluation section.
[0032]
On the other hand, in the method using the HMM, the feature amount of the speaker obtained in the same manner as described above is expressed by the transition probability between states of the Hidden Markov Model (HMM) and the appearance probability of the feature amount in each state. Judgment is made based on the average likelihood with the model over the entire section.
[0033]
Further, in the case of speaker identification including an unspecified speaker that is not registered in advance, the determination is made by a method in which speaker identification and speaker verification described above are combined. That is, the most similar speaker is selected as a candidate from the registered speaker set, and the quantization distortion or likelihood of the candidate is compared with a threshold value to determine whether or not the speaker is the person.
[0034]
In speaker verification or speaker identification including unspecified speakers, the speaker's likelihood or quantization distortion is determined by comparing it with a threshold value to determine the identity of the speaker. Variations between input data and learning data (model) are large even for the same speaker due to the effects of time fluctuations in volume, differences in spoken sentences, noise, etc. Generally, a threshold value is set for the absolute value. However, a stable and sufficient recognition rate cannot be obtained.
[0035]
Therefore, normalization of likelihood is generally performed in speaker recognition in the HMM. For example, there is a method of using a log likelihood ratio LR as shown in the following equation (13) for determination.
[0036]
[Expression 11]

[0037]
In the formula (13), L (X / S_c) Is the speaker S to be verified_cIs the likelihood for the input speech X of the user (Principal), and L (X / S_r) Is the speaker S_cSpeaker S other than_rIs the likelihood for the input speech X. That is, a threshold value is dynamically set according to the likelihood for the input speech X, and it is robust against differences in utterance content and time fluctuations.
[0038]
Alternatively, a method of performing determination by posterior establishment as shown in the following formula (14) using the concept of posterior probability has been studied. Where P (S_c), P (S_r) For each speaker S_c, S_rIs the appearance probability.
[0039]
[Expression 12]

[0040]
The likelihood normalization method using these HMMs is described in detail in the literature [4] described later.
[0041]
On the other hand, there is a method in which the likelihood described in the above-described HMM method is obtained from the standard pattern of feature quantity and the Mahalanobis distance of feature quantity extracted from input data.
[0042]
Using the feature quantity vector x of the input data X and the vector r of the standard pattern of feature quantities, the likelihood L (X / S) of the input data X for the speaker S is obtained as in the following equation (15). It is done.
[0043]
[Formula 13]

[0044]
Here, the feature quantity vector x and the feature quantity standard pattern vector r are given by the following equations (16) and (17), respectively.
[0045]
[Expression 14]

[0046]
In Expression (15), P is the vector order, and Σ is the covariance matrix of the feature amount data of the speaker S. In addition, (xr)^TΣ^-1(Xr) is called Mahalanobis distance. From the equation (15), if the covariance coefficient of the speaker feature is obtained in advance, the likelihood of the input data X can be obtained. Also, from this, the speaker S as described above_cAnd speaker S_rThe log likelihood ratio LR is expressed by the difference in Mahalanobis distance of each speaker. That is, in likelihood normalization for speaker verification as described above, setting a threshold value for log likelihood is equivalent to setting a threshold value for the difference between Mahalanobis distances. Details are described in the literature [5] described later.
[0047]
Details of the prior art relating to speaker recognition are described in, for example, the following documents.
[1] Furui: “Speaker recognition by statistical characteristics of cepstrum”, Theory of Science volJ65-A, No.2 183-190 (1982)
[2] F.K.Soong and A.E.Rosenberg: “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition.”, IEEE Trans. ASSP, Vol.36, No.6, pp.871-879 (1988)
[3] Furui: “The story of the individuality of the voice”, Journal of the Acoustical Society of Japan, 51, 11, pp.876-881, (1995) [4] Matsui: “Speaker recognition by HMM”, IEICE Tech. Vol.95, No.467, (SP95 109-116) pp.17-24 (1996)
[5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE PRESS (CRC Press), 1998 On the other hand, the specification and drawings of Japanese Patent Application No. 2000-363547 are applied to the same speaker in AV data by applying speaker recognition technology. A technique for detecting a continuous conversation section and a speaker switching position has been proposed. In the specification and drawings of Japanese Patent Application No. 2000-363547, the audio signal of AV data is classified and classified into speaker groups for each small section (about 1 to 2 seconds), and several continuous recognition sections (several seconds to 10 seconds). Level), the position of the speaker group is obtained, and the position where the frequency exceeds the threshold or the position below the threshold is detected to detect the switching position of the speaker. The section is detected as the same speaker continuous conversation section of the speaker.
[0048]
[Problems to be solved by the invention]
Conventional speaker recognition technology has been researched and developed mainly for the identification and verification of single speakers in security systems, etc., and multiple speakers can speak in a single voice data alternately in a short time. In other words, it could not be applied to actual complicated conversation situations, such as singing at the same time or occasionally having music or noise in the background. Therefore, if a speaker's conversation section search in AV data is automatically performed by a conventional speaker recognition technique, there is a problem that the identification performance is remarkably deteriorated.
[0049]
In the technique described in the specification and drawings of Japanese Patent Application No. 2000-363547 described above, the change of the speaker is detected by the change of the discrimination frequency for each frequency evaluation interval of about several seconds to 10 seconds. During the evaluation interval, the same speaker needs to be talking almost continuously. That is, it can be applied when the same speaker is talking continuously continuously for about 10 seconds, which is the evaluation section length, and the discrimination error rate is sufficiently low. If the speaker speaks alternately within a short period of time, for example within a few seconds, frequently speaks at the same time, or the recognition error of the speaker becomes large due to background noise or music, the switching position of the speaker is accurate. Therefore, there is a problem that the conversation section cannot be properly detected.
[0050]
Further, in the description and drawings of Japanese Patent Application No. 2000-363547, since the speaker is identified by a method of assigning input data to any of the speaker groups, even an unknown speaker who is not registered is classified. On the other hand, there is a problem that classification errors are likely to occur because speaker verification is not performed, and even if data other than speech is input, it is classified into any speaker group. .
[0051]
In such AV data speaker detection, there is a speaker verification technique as a technique for avoiding erroneous identification of unregistered speakers and input data other than speech. Conventional speaker likelihood normalization is available. The speaker verification method according to the above can be applied when the likelihood is obtained using the HMM, but there is a problem that it cannot be directly applied to the identification method using the vector quantization distortion that can be identified more easily.
[0052]
Also, in order to obtain the likelihood using the Mahalanobis distance from the standard pattern, the covariance coefficient of the speaker must be known, the calculation is very complicated, and this method uses vector quantization distortion. In such a case, it is necessary to perform a complicated calculation such as obtaining a covariance coefficient in advance, which is not practical.
[0053]
  The present invention has been proposed in view of such a conventional situation, and an information extraction apparatus for automatically and efficiently detecting a conversation section of a speaker in AV data and efficiently retrieving the same, and its information extraction apparatus MethodTheThe purpose is to provide.
[0054]
[Means for Solving the Problems]
  In order to achieve the above-described object, an information extraction device according to the present invention is an information extraction device for extracting desired information from a predetermined information source. Depending on the similarity of speech, the speakerBy identifying and judgingSpeaker identifying means for discriminating and speaker discriminating frequency calculating means for obtaining discriminating frequency information of the speaker in the frequency section in the information source, which is a section for obtaining the frequency of the speaker discriminated for each evaluation section. As a preparation and identification method,One-to-one correspondence with registered speakersUsing vector quantization of features by multiple codebooks, using the vector quantization distortion as a measure of identification, the above vector quantization distortionTheminimumIs identified as a speaker candidate, and the speaker candidateOther than quantization distortionSpeaker'sFrom the sum or average of multiple vector quantization distortionsThe above speaker candidatesValue obtained by subtracting quantization distortionIs the distortion difference ΔD, and the quantizing distortion D of the above speaker candidate ₀ ButPresetMaximum quantization distortionThresholdData D _max Smaller than the threshold value data ΔD of the minimum distortion difference in which the distortion difference ΔD is preset. _min If it is larger, the above speaker candidate is the personJudgmentShiThe appearance frequency information of the speaker in the frequency section is detected.
[0055]
  Here, in the information extraction device, an LPC cepstrum obtained by LPC analysis is used as a feature value for evaluating the similarity of speech in the speech signal of the information source.The
[0057]
Such an information extraction device is a section for identifying a speaker for each evaluation section based on the feature amount of the voice of the speaker in the voice signal and for determining the frequency of the speaker determined for each evaluation section. The appearance frequency information of the speaker is generated by detecting the appearance frequency of the speaker in a certain frequency section.
[0058]
  In order to achieve the above-described object, an information extraction method according to the present invention is an information extraction method for retrieving predetermined information from a predetermined information source. Speakers in each evaluation interval based on the similarity ofBy identifying and judgingA speaker identification step for discrimination, and a speaker discrimination frequency calculation step for obtaining discrimination frequency information of the speaker in the frequency interval in the information source, which is a zone for obtaining the frequency of the speaker discriminated for each evaluation interval. As a method of identificationOne-to-one correspondence with registered speakersUsing vector quantization of features by multiple codebooks, using the vector quantization distortion as a measure of identification, the above vector quantization distortionTheminimumIs identified as a speaker candidate, and the speaker candidateOther than quantization distortionSpeaker'sFrom the sum or average of multiple vector quantization distortionsThe above speaker candidatesValue obtained by subtracting quantization distortionIs the distortion difference ΔD, and the quantizing distortion D of the above speaker candidate ₀ ButPresetMaximum quantization distortionThresholdData D _max Smaller than the threshold value data ΔD of the minimum distortion difference in which the distortion difference ΔD is preset. _min If it is larger, the above speaker candidate is the personJudgmentShiThe appearance frequency information of the speaker in the frequency section is detected.
[0059]
  Here, in the information extraction method, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source.The
[0061]
Such an information extraction method is a section in which a speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the speech signal, and the frequency of the speaker determined for each evaluation section is calculated. The appearance frequency information of the speaker is generated by detecting the appearance frequency of the speaker in a certain frequency section.
[0062]
In order to achieve the above-described object, the information search apparatus according to the present invention determines a speaker for each evaluation section based on the similarity of speech in the speech signal for the speech signal that is an information source, The appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker discrimination frequency information in the frequency section which is a section for obtaining the speaker frequency determined for each evaluation section is recorded in advance. An information search apparatus for searching for desired information from a recording medium, wherein a speaker appearance frequency reading means for reading speaker appearance frequency information recorded on the recording medium and a search condition for the desired speaker are input. A search condition input means for comparing the input search condition with information read from the recording medium, and a speaker appearance section output means for outputting information corresponding to the search condition as speaker appearance section information. Preparation It is characterized in that.
[0063]
Here, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source when obtaining the speaker discrimination frequency information. The vector quantization of the feature quantity by a plurality of codebooks is used, the vector quantization distortion is used as a measure of identification, the minimum quantization distortion which is the minimum value of the vector quantization distortion, and the minimum A value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the quantization distortion is identified and determined by comparing each with a preset threshold value.
[0064]
Such an information search apparatus searches for a portion where the desired speaker is talking at a desired frequency by comparing the appearance frequency information of the desired speaker with the input search condition.
[0065]
In order to achieve the above-described object, the information search method according to the present invention determines a speaker for each evaluation section based on the similarity of speech in the speech signal for the speech signal that is an information source, The appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker discrimination frequency information in the frequency section which is a section for obtaining the speaker frequency determined for each evaluation section is recorded in advance. An information retrieval method for retrieving desired information from a recording medium, wherein a speaker appearance frequency reading step for reading speaker appearance frequency information recorded in the recording medium and a desired speaker search condition are input. A search condition input step for comparing the input search condition with information read from the recording medium, and outputting a speaker appearance section output step for outputting information corresponding to the search condition as speaker appearance section information. Have It is characterized in that.
[0066]
Here, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source when obtaining the speaker discrimination frequency information. The vector quantization of the feature quantity by a plurality of codebooks is used, the vector quantization distortion is used as a measure of identification, the minimum quantization distortion which is the minimum value of the vector quantization distortion, and the minimum A value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the quantization distortion is identified and determined by comparing each with a preset threshold value.
[0067]
In such an information search method, the appearance frequency information of a desired speaker is compared with an input search condition, thereby searching for a portion where the desired speaker is talking at a desired frequency.
[0068]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings.
[0069]
First, FIG. 1 shows a conceptual configuration diagram of an information extraction device in the present embodiment. As shown in FIG. 1, in the information extraction apparatus, a speech signal serving as an information source is input to the speaker identifying means 1, and a speaker is identified by evaluating vector quantization distortion.
[0070]
The speaker identified by the speaker identifying means 1 is input to the speaker discrimination frequency calculating means 2, and the recognized speaker discrimination frequency of each speaker in the section is calculated for each predetermined evaluation section. The obtained speaker discrimination frequency is output as the appearance frequency information of the speaker.
[0071]
A specific configuration example of the information extraction apparatus shown in FIG. 1 is shown in FIG. As shown in FIG. 2, the information extraction apparatus 10 includes an input unit 11 that inputs an audio signal of AV (Audio Visual) data, and a cepstrum extraction unit that analyzes the audio signal and extracts an LPC (Linear Predictive Coding) cepstrum coefficient. 12, a vector quantization unit 13 that vector-quantizes LPC cepstrum coefficients, a speaker identification unit 14 that evaluates vector quantization distortion and identifies a speaker, and a recognition frequency of a recognized speaker And a speaker discrimination frequency calculation unit 15 for obtaining the appearance frequency of the speaker.
[0072]
In FIG. 2, a codebook group CB stores codebook data of each speaker used for vector quantization, and a threshold table file TF stores threshold data for speaker determination. Each of them is recorded in a recording unit (not shown). The speaker frequency file SF is a record of the frequency of each speaker for each section.
[0073]
The operation of the information extraction apparatus 10 configured as described above will be described below. The audio data D11 of AV data input from the input unit 11 is input to the cepstrum extraction unit 12 in block units, subjected to LPC analysis, and the obtained LPC coefficients are converted into LPC cepstrum coefficients.
[0074]
Part of the obtained LPC cepstrum coefficients D12 is input to the vector quantization unit 13 and subjected to vector quantization using the codebook data D13 of each speaker from the codebook group CB. The result (quantization distortion) D14 obtained by vector quantization in each codebook is input to the speaker identification unit 14 for evaluation, and further, the threshold data D15 read from the threshold table file TF is used to perform predetermined recognition. Speaker identification and determination are performed for each block.
[0075]
The identified speaker D16 is input to the speaker discrimination frequency calculation unit 15, and the speaker discrimination frequency recognized by each speaker in the interval is calculated for each predetermined evaluation interval. The obtained speaker discrimination frequency is recorded in the speaker frequency file SF for each AV data, for each speaker, and for each evaluation section in the recording format shown in FIG. 3, for example, as the speaker appearance frequency information D17. Is done. The speaker frequency file SF may be communicated via a communication line by a transmission / reception unit (not shown), and may be a recording medium such as a magnetic disk or a magneto-optical disk, or a storage medium such as a semiconductor memory. It may be stored in a storage medium.
[0076]
The recording format of FIG. 3 includes the AV data name of the audio signal input from the input unit 11, the identification name for identifying each registered speaker, the start time of the frequency section, the end time of the section, The information includes the speaker discrimination frequency in the frequency section of the AV data. This recording format is an example and is not limited to the information shown in FIG.
[0077]
Hereinafter, with reference to FIG. 2 and FIG. 4, a process for performing speaker identification and obtaining a speaker discrimination frequency will be described in more detail.
[0078]
The audio signal of the input AV data is subjected to LPC analysis in the LPC analysis block AB as shown in FIG. 4 in the cepstrum extraction unit 12, and the obtained LPC coefficient is converted to extract the LPC cepstrum coefficient. The In the case of an audio signal, the block length a of the LPC analysis block AB is usually about 20 milliseconds to 30 milliseconds. Further, in order to improve analysis performance, it is often overlapped slightly with adjacent blocks.
[0079]
The speaker recognition block RB in FIG. 4 is a minimum unit for identifying a speaker, and the speaker is identified in this block unit. The block length b of the speaker recognition block RB is preferably about several seconds. Accordingly, one speaker recognition block RB includes about 50 to several hundred LPC analysis blocks AB. The speaker recognition block RB may also slightly overlap with the adjacent section. The overlap length is usually about 10% to 50% of the section length.
[0080]
The frequency section FI in FIG. 4 is an evaluation unit for determining the appearance frequency of the speaker. In the same section, the appearance frequency of each speaker is determined based on the determination frequency of the speaker identified by each speaker recognition block RB. Ask. Frequency interval FI_IThe section start time of S is S_IThe section end time is E_IAnd the section length (E_I-S_I) Is suitably from several minutes to several tens of minutes. The evaluation section may slightly overlap with the adjacent section.
[0081]
A flowchart showing the operation of the information extracting apparatus 10 is shown in FIG. First, in step S10, the section number I is set to 0 as an initialization process. The section number I is a serial number given to the frequency section FI for obtaining the speaker frequency.
[0082]
Next, in step S11, a speaker candidate is identified and identified for each speaker recognition block RB described above. A method for selecting speaker candidates will be described in detail later.
[0083]
In step S12, it is verified whether or not the selected speaker candidate is a correct speaker. That is, when an unknown unspecified speaker or data other than speech is input, the candidate speaker selected in step S11 selects the speaker most similar to the input speech as a candidate. Is not necessarily the speaker. Therefore, in step S12, the vector quantization distortion is evaluated and compared with the threshold value data recorded in the threshold value table file TF shown in FIG. Make a decision. The determination method will be described in detail later. In step S12, if it is determined that the speaker is the speaker, the speaker candidate is determined as a speaker in the speaker recognition block. If it is determined that the speaker is not the speaker, the speaker in the speaker recognition block is determined to be an unknown story. As a person.
[0084]
Subsequently, in step S13, the frequency section FI._IIt is determined whether or not processing up to the last speaker recognition block RB is completed. If it is not the last speaker recognition block RB in step S13, the process proceeds to the next speaker recognition block RB in step S14, and returns to step S11. If it is determined in step S13 that it is the last speaker recognition block RB, the process proceeds to step S15.
[0085]
In step S15, the current frequency interval FI_IThe determination frequency of each registered speaker is obtained as appearance frequency information. Note that the speaker recognition block RB determined as an unknown speaker is not included in the frequency calculation. The obtained speaker appearance frequency is recorded in the recording format as shown in FIG. 3 in the speaker frequency file SF shown in FIG.
[0086]
In step S16, it is determined whether or not the end of the data has been reached. If the end of the data has been reached, the process is terminated. If the end of the data has not been reached, the process proceeds to step S17.
[0087]
In step S17, the section number I is incremented by 1, the process proceeds to the next frequency section, and the process returns to step S11.
[0088]
Next, FIG. 6 shows details of the speaker candidate identification method in step S11 of FIG. First, in step S20, voice data is read from input data for each LPC analysis block AB described above.
[0089]
Next, in step S21, it is determined whether or not the process has been completed up to the last LPC analysis block AB of the speaker recognition block RB. If the process of the last LPC analysis block AB has been completed, the process proceeds to step S26. If it is not the last LPC analysis block AB in step S21, the process proceeds to step S22.
[0090]
In step S22, the obtained data of the LPC analysis block AB is evaluated to determine whether or not this block is a speech block. If it is determined in step S22 that this LPC analysis block AB is a silence block or a non-voice block, the analysis of this block is skipped and the process proceeds to step S25, and the process proceeds to the next LPC analysis block AB and from step S20. Process. For example, as the simplest method for determining whether or not the block is a voice block, it may be possible to simply evaluate whether or not the block is a silent block by evaluating the power average and the maximum value of the block. In addition, there are various methods for determining whether or not the audio data is analyzed based on the average power of the signal, the number of zero crossings, the presence / absence of the pitch, the spectrum shape, and the like. In this embodiment, the method is particularly limited. Or this step may be omitted.
[0091]
If it is determined in step S22 that the block is a speech block, then in step S23, LPC analysis of this block is performed, and the obtained LPC coefficients are converted to extract LPC cepstrum coefficients. Here, low-order cepstrum coefficients of the first to the 14th order are extracted.
[0092]
In step S24, vector quantization is performed on the LPC cepstrum coefficients obtained in step S23 using a plurality of codebooks created in advance. Each codebook has a one-to-one correspondence with registered speakers. Where codebook CB_kThe vector quantization distortion of the LPC cepstrum coefficients of this block by d_kAnd
[0093]
In step S25, the process proceeds to the next LPC analysis block AB, returns to step S20, and similarly repeats the processing from step S20 to step S25.
[0094]
In step S26, the quantization distortion d of each codebook CB over the entire speaker recognition block RB._kIs the average quantization distortion D_kAsk for.
[0095]
Subsequently, in step S27, the average quantization distortion D_kSpeaker S that minimizes_{k '}Codebook CB corresponding to_{k '}In step S28, this speaker S is selected._{k '}Speaker candidate S_cOutput as.
[0096]
In this way, among the speakers whose codebooks are registered, the speaker whose input data sound is most similar is selected as the speaker candidate S in the speaker recognition block RB._cElected as.
[0097]
Next, the speaker candidate S in step S12 of FIG._cDetails of the collation determination method are shown in FIG. First, in step S30, the speaker candidate S_cThe average quantization distortion of D₀And Next, in step S31, the speaker candidate S_cSort the average quantization distortion by each codebook other than N in ascending order.₁, D₂, ... D_n(D₀<D₁<D₂<... <D_n). The value of n can be arbitrarily selected.
[0098]
Subsequently, in step S32, the speaker candidate S is used as an evaluation scale._cQuantization distortion D₀The distortion difference amount ΔD is obtained using the following formula (18) or formula (19) for n quantization distortions.
[0099]
[Expression 15]

[0100]
In equations (18) and (19), for example, when n is 1, speaker candidate S_cFollowed by D with the smallest quantization distortion₁And D₀The difference in quantization distortion from
[0101]
Subsequently, in step S33, the speaker candidate S is extracted from the threshold table file TF shown in FIG._cThe threshold data corresponding to is read.
[0102]
The threshold value table file TF is recorded for each registered speaker, for example, in the format shown in FIG. That is, as shown in FIG. 8, the speaker identification name of each registered speaker and the maximum distortion absolute value D of the quantization distortion that is threshold data._maxAnd the minimum distortion difference ΔD_minIs recorded in advance.
[0103]
Returning to FIG. 7, in step S34, the read threshold data D_max, ΔD_minSought D₀And ΔD for discrimination. That is, in step S34, the absolute value D of the quantization distortion₀Is threshold data D_maxAnd the distortion difference ΔD is threshold data ΔD._minIf it is larger, the process proceeds to step S35, where it is determined that the user is the person and the candidate is determined. Otherwise, it proceeds to step S36, determines that it is an unknown speaker, and rejects the candidate. Thus, speaker candidate S_cAverage quantization distortion D₀And distortion difference amount ΔD are respectively compared with threshold values, so that the identification error of the voice data of the registered speaker is reduced, and the voice data other than the registered speaker can be determined as an unknown speaker.
[0104]
As described above, the information extraction apparatus according to the present embodiment identifies a speaker for each speaker recognition block on the basis of the feature amount of the speaker's voice in the audio signal of the AV data, and the predetermined section. Is detected, and frequency information of the speaker is generated. This appearance frequency information is supplied to an information search device, which will be described later, via a communication line or a recording medium, so that desired information can be effectively searched in the information search device.
[0105]
Next, the information search apparatus in the present embodiment will be described. First, FIG. 9 shows a conceptual configuration diagram of the information search apparatus. As shown in FIG. 9, in the information search device, the speaker appearance frequency reading means 3 reads the appearance frequency information of the speaker generated by the information extraction device described above, and the search condition is the search condition. Input to the input means 4. The speaker appearance section output means 5 outputs the speaker appearance section information searched based on the appearance frequency information and the search condition.
[0106]
FIG. 10 shows a specific configuration example of the information extraction apparatus shown in FIG. As shown in FIG. 10, the information search apparatus 20 includes a condition input unit 21 that inputs search conditions, a data search unit 22 that searches for information from the input conditions, and an output unit 23 that outputs search results. . Further, the speaker frequency file SF is recorded with the appearance frequency information of the speaker, and the information is recorded in the recording format as shown in FIG.
[0107]
The operation of the information search apparatus 20 configured as described above will be described below. The condition input unit 21 inputs search conditions for searching AV data. Search conditions include, for example, the name and identification number of a desired speaker, the conversation frequency of the speaker, AV data to be searched, and the like.
[0108]
The input search condition D22 is input to the data search unit 22, and information that meets the search condition is searched. The data search unit 22 reads the speaker appearance frequency information D23 with reference to the speaker frequency information file SF, compares it with the search condition D22, and supplies the search result D24 to the output unit 23. The output unit 23 outputs the search result D24 as speaker appearance section information D25.
[0109]
As described above, the information search apparatus in the present embodiment compares the appearance frequency information of a desired speaker in the audio signal of AV data input via a communication line or a recording medium with the input search condition. Thus, it is possible to effectively search a portion where a desired speaker is talking at a desired frequency.
[0110]
It should be noted that the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention.
[0111]
【The invention's effect】
As described in detail above, the information extraction apparatus according to the present invention is an information extraction apparatus for extracting desired information from a predetermined information source. Speaker identification means for discriminating a speaker for each evaluation interval based on similarity, and discrimination of the speaker in a frequency interval in the information source, which is an interval for obtaining the frequency of the speaker determined for each evaluation interval Speaker frequency calculation means for obtaining frequency information, and detecting the appearance frequency information of the speaker in the frequency section.
[0112]
Here, in the information extraction apparatus, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source, and feature amounts by a plurality of codebooks are used as identification methods. Vector quantization is used, and the vector quantization distortion is used as a discrimination measure.
[0113]
In the information extraction device, a minimum quantization distortion that is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the minimum quantization distortion, Are respectively identified and determined by comparing with a preset threshold value.
[0114]
Depending on such an information extraction device, a section for identifying a speaker for each evaluation section and obtaining a frequency of the speaker determined for each evaluation section based on the feature amount of the voice of the speaker in the speech signal Thus, the appearance frequency information of the speaker can be generated by detecting the appearance frequency of the speaker in the frequency section. The appearance frequency information is supplied to the information search device via a communication line or a recording medium, so that the information search device can effectively search for desired information.
[0115]
The information extraction method according to the present invention is an information extraction method for retrieving predetermined information from a predetermined information source, and the audio signal that is the information source is based on the similarity of the audio in the audio signal. A speaker identification step for discriminating a speaker for each evaluation section, and a talk for obtaining information for identifying the speaker in the frequency section in the information source, which is a section for obtaining the frequency of the speaker determined for each evaluation section. And a speaker discrimination frequency calculation step, wherein the appearance frequency information of the speaker in the frequency section is detected.
[0116]
Here, in the information extraction method, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source, and feature amounts by a plurality of codebooks are used as identification methods. Vector quantization is used, and the vector quantization distortion is used as a discrimination measure.
[0117]
In the information extraction method, the minimum quantization distortion that is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the minimum quantization distortion, Are respectively identified and determined by comparing with a preset threshold value.
[0118]
Depending on such an information extraction method, the speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the voice signal, and the speaker frequency determined for each evaluation section is obtained. Thus, the appearance frequency information of the speaker can be generated by detecting the appearance frequency of the speaker in the frequency section. The appearance frequency information is supplied to the information search device via a communication line or a recording medium, so that the information search device can effectively search for desired information.
[0119]
Further, the information search apparatus according to the present invention discriminates a speaker for each evaluation section based on the similarity of the voice in the voice signal as the information source, and the story determined for each evaluation section. Retrieval of desired information from a recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker discrimination frequency information in the frequency section which is a section for obtaining the frequency of the speaker is recorded in advance An information retrieval device for performing the above-mentioned process, a speaker appearance frequency reading means for reading the appearance frequency information of a speaker recorded on the recording medium, and a search condition input means for inputting a search condition for a desired speaker. And a speaker appearance section output means for comparing the search condition and information read from the recording medium and outputting information corresponding to the search condition as speaker appearance section information.
[0120]
Here, an LPC cepstrum obtained by LPC analysis is used as a feature value for evaluating the similarity of speech in the speech signal of the information source when obtaining the appearance frequency information of the speaker. The vector quantization of the feature quantity by a plurality of codebooks is used, the vector quantization distortion is used as a measure of identification, the minimum quantization distortion which is the minimum value of the vector quantization distortion, and the minimum A value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the quantization distortion is identified and determined by comparing each with a preset threshold value.
[0121]
Depending on such an information search device, by comparing the appearance frequency information of a desired speaker with the input search condition, a portion where the desired speaker is speaking at a desired frequency is effectively searched. be able to.
[0122]
In addition, the information search method according to the present invention determines a speaker for each evaluation interval based on the similarity of the sound in the audio signal, and determines the speech determined for each evaluation interval. Retrieval of desired information from a recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker discrimination frequency information in the frequency section which is a section for obtaining the frequency of the speaker is recorded in advance And a speaker appearance frequency reading step for reading speaker appearance frequency information recorded on the recording medium, and a search condition input step for inputting a search condition for a desired speaker. And a speaker appearance section output step of comparing the search condition with information read from the recording medium and outputting information corresponding to the search condition as speaker appearance section information.
[0123]
Here, an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of speech in the speech signal of the information source when obtaining the speaker discrimination frequency information. The vector quantization of the feature quantity by a plurality of codebooks is used, the vector quantization distortion is used as a measure of identification, the minimum quantization distortion which is the minimum value of the vector quantization distortion, and the minimum A value obtained by subtracting the minimum quantization distortion from the sum or average of a plurality of vector quantization distortions other than the quantization distortion is identified and determined by comparing each with a preset threshold value.
[0124]
Depending on such an information search method, by comparing the appearance frequency information of the desired speaker with the input search condition, the part where the desired speaker is speaking at the desired frequency is effectively searched. be able to.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a conceptual configuration of an information extraction device according to an embodiment.
FIG. 2 is a diagram illustrating a configuration example of the information extraction device.
FIG. 3 is a diagram for explaining an example of a recording format of speaker appearance frequency information in the information extraction device.
FIG. 4 is a diagram for explaining a relationship among a frequency evaluation section, a speaker recognition block, and an LPC analysis block in the information extraction apparatus.
FIG. 5 is a flowchart for explaining the operation of the information extraction apparatus.
FIG. 6 is a flowchart illustrating speaker identification processing in units of speaker recognition blocks in the information extraction apparatus.
FIG. 7 is a flowchart illustrating speaker verification determination processing in the information extraction apparatus.
FIG. 8 is a diagram illustrating an example of a recording format of threshold data for speaker collation determination in the information extraction device.
FIG. 9 is a diagram illustrating a conceptual configuration of an information search apparatus according to the present embodiment.
FIG. 10 is a diagram for explaining the configuration of the information retrieval apparatus.
[Explanation of symbols]
1 speaker identification means, 2 speaker discrimination frequency calculation means, 3 speaker appearance frequency reading means, 4 search condition input means, 5 speaker appearance section output means, 10 information extraction device, 11 input section, 12 cepstrum extraction section, 13 Vector quantization unit, 14 Speaker identification unit, 15 Speaker discrimination frequency calculation unit, 20 Information search device, 21 Condition input unit, 22 Data search unit, 23 Output unit

Claims

In an information extraction apparatus for extracting desired information from a predetermined information source,
Speaker identification means for identifying the speech signal as the information source by identifying and determining a speaker for each evaluation section according to the similarity of speech in the speech signal;
Speaker determination frequency calculating means for determining the speaker determination frequency information in the frequency section in the information source, which is a section for determining the frequency of the speaker determined for each evaluation section,
As a method of identification, using vector quantization of feature amounts by a plurality of codebooks corresponding to registered speakers on a one-to-one basis ,
Using the vector quantization distortion as a measure of discrimination,
The speaker that minimizes the vector quantization distortion is identified as a speaker candidate, and the speaker candidate is quantized from the sum or average of a plurality of vector quantization distortions of the speaker other than the quantization distortion of the speaker candidate . minimizing the value obtained by subtracting the distortion and distortion difference [Delta] D, smaller than the threshold data D _max of the maximum quantization distortion quantization distortion D ₀ of the speaker candidates has been set in advance, and, the distortion difference [Delta] D is set in advance If it is larger than the threshold value data ΔD _min of the distortion difference, it is determined that the speaker candidate is the person ,
An information extraction device that detects appearance frequency information of the speaker in the frequency section.

The information extraction device according to claim 1, wherein an LPC cepstrum obtained by LPC analysis is used as a feature value for evaluating the similarity of speech in the speech signal of the information source.

In an information extraction method for retrieving predetermined information from a predetermined information source,
A speaker identifying step for identifying the speech signal as the information source by identifying and determining a speaker for each evaluation section according to the similarity of speech in the speech signal;
A speaker determination frequency calculation step for determining the determination frequency information of the speaker in the frequency section in the information source, which is a section for determining the frequency of the speaker determined for each evaluation section,
As a method of identification, using vector quantization of feature amounts by a plurality of codebooks corresponding to registered speakers on a one-to-one basis ,
Using the vector quantization distortion as a measure of discrimination,
The speaker that minimizes the vector quantization distortion is identified as a speaker candidate, and the speaker candidate is quantized from the sum or average of a plurality of vector quantization distortions of the speaker other than the quantization distortion of the speaker candidate . minimizing the value obtained by subtracting the distortion and distortion difference [Delta] D, smaller than the threshold data D _max of the maximum quantization distortion quantization distortion D ₀ of the speaker candidates has been set in advance, and, the distortion difference [Delta] D is set in advance If it is larger than the threshold value data ΔD _min of the distortion difference, it is determined that the speaker candidate is the person ,
An information extraction method for detecting appearance frequency information of the speaker in the frequency section.

The information extraction method according to claim 3, wherein an LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating similarity of speech in the speech signal of the information source.