JP2002366176A

JP2002366176A - Device and method for information extraction and device and method for information retrieval

Info

Publication number: JP2002366176A
Application number: JP2001177569A
Authority: JP
Inventors: Yasuhiro Tokuri; 康裕戸栗; Masayuki Nishiguchi; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-06-12
Filing date: 2001-06-12
Publication date: 2002-12-20
Anticipated expiration: 2021-06-12
Also published as: JP4734771B2

Abstract

PROBLEM TO BE SOLVED: To automatically and efficiently retrieve a conversation section of speakers using AV data. SOLUTION: In an information extracting device 20, a voice signal D11 of AV data inputted from an input part 11 is inputted to a cepstrum extraction part 12 to take an LPC analysis and an obtained LPC coefficient is converted into an LPC cepstrum coefficient. Part D12 of the LPC cepstrum coefficient is inputted to a vector quantization part 13, which performs vector quantization. Its quantization distortion D14 is inputted to and evaluated by a speaker identification part 14 and further threshold data D16 are used to identify and determines a speaker by specified recognition blocks. The identified speaker D16 is inputted to a speaker determination frequency calculation part 15, which calculate the recognized speaker determination frequencies of respective speakers in the specified evaluation sections, and outputs them as appearance frequency information D17 of the speakers. An information retrieving device retrieves a part, etc., where a desired speaker converses at a desired frequency according to the appearance frequency information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報抽出装置及び
方法、並びに情報検索装置及び方法に関するものであ
り、特に、音声データ又は音声画像データの検索を行う
ための情報抽出装置及びその方法、並びに情報検索装置
及びその方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information extraction apparatus and method, and an information retrieval apparatus and method, and more particularly, to an information extraction apparatus and method for retrieving audio data or audio image data, and a method thereof. The present invention relates to an information retrieval device and a method thereof.

【０００２】[0002]

【従来の技術】近年のマルチメディアの普及とともに、
大量のＡＶ（Audio Visual）データを効率的に管理し、
分類、検索、抽出などを行う必要性が増してきた。例え
ば、ある登場人物のシーンやその人物の会話シーンを大
量のＡＶデータから検索したり、また、ある人物の会話
シーンだけをＡＶデータから抽出して再生したりという
ことが必要となっている。2. Description of the Related Art With the spread of multimedia in recent years,
Efficiently manage a large amount of AV (Audio Visual) data,
The need to perform classification, search, extraction, etc. has increased. For example, it is necessary to retrieve a scene of a certain character or a conversation scene of the person from a large amount of AV data, or to extract and reproduce only a conversation scene of a certain person from the AV data.

【０００３】従来は、このようなＡＶデータにおいて特
定の話者が会話している時間軸上の位置の検索等を行う
場合は、人間が直接ＡＶデータを視聴しながら、その時
間軸上の位置や区間を探す必要があった。Conventionally, when searching for a position on a time axis at which a specific speaker is talking in such AV data, a human is directly watching the AV data while looking at the position on the time axis. And sections.

【０００４】一方、音声の話者を識別する技術として
は、自動話者識別・照合技術が研究されている。この技
術についての従来の技術の概要を説明する。先ず、話者
認識には、話者識別と話者照合がある。話者識別とは、
入力された音声が予め登録された話者うちのどの話者で
あるかを判定するものであり、話者照合とは、入力され
た音声を予め登録された話者のデータと比較して本人で
あるか否かを判定するものである。また、話者認識に
は、認識時に発声する言葉（キーワード）が予め決めら
れた発声内容依存型と、任意の言葉を発声して認識をす
る発生内容独立型がある。On the other hand, as a technique for identifying a speaker of voice, an automatic speaker identification / verification technique has been studied. An outline of a conventional technique regarding this technique will be described. First, speaker recognition includes speaker identification and speaker verification. What is speaker identification?
It is to determine which of the pre-registered speakers the input voice is, and the speaker verification is to compare the input voice with data of the pre-registered speakers to identify the speaker. Is determined. The speaker recognition includes a utterance content dependent type in which words (keywords) to be uttered at the time of recognition are determined in advance, and an occurrence content independent type in which an arbitrary word is uttered for recognition.

【０００５】一般的な音声認識技術としては、例えば次
のような技術がよく用いられる。先ず、ある話者の音声
信号の個人性を表す特徴量を抽出して、予め学習データ
として記録しておく。照会・識別の際には、入力された
話者音声を分析して、その個人性を表す特徴量を抽出し
て、学習データとの類似度を評価することで、話者の識
別・照合を行う。ここで、音声の個人性を表す特徴量と
しては、ケプストラム（Cepstrum）等がよく用いられ
る。ケプストラムは、対数スペクトルをフーリエ逆変換
したもので、その低次の項の係数によって音声スペクト
ルの包絡を表現できる。また、ケプストラム時系列の多
項式展開係数をデルタケプストラムと呼び、これも音声
スペクトルの時間的変化を表現する特徴量としてよく用
いられる。この他、ピッチやデルタピッチ（ピッチの多
項式展開係数）等も用いられることがある。[0005] As general speech recognition techniques, for example, the following techniques are often used. First, a feature quantity representing the personality of a speaker's voice signal is extracted and recorded in advance as learning data. At the time of inquiry / identification, the input speaker's voice is analyzed, and the feature amount representing the personality is extracted, and the similarity with the training data is evaluated. Do. Here, cepstrum (Cepstrum) or the like is often used as a feature amount representing the personality of the voice. The cepstrum is obtained by performing an inverse Fourier transform on a logarithmic spectrum, and can express an envelope of a voice spectrum by using coefficients of lower-order terms. Further, a polynomial expansion coefficient of a cepstrum time series is called a delta cepstrum, and this is also often used as a feature quantity expressing a temporal change of a speech spectrum. In addition, pitch, delta pitch (polynomial expansion coefficient of pitch), and the like may be used.

【０００６】このようにして抽出されたＬＰＣ（Linear
Predictive Coding）ケプストラム等の特徴量を標準パ
ターンとして学習データを作成するが、その方法として
は、ベクトル量子化歪みによる方法と隠れマルコフモデ
ル（HMM:Hidden Markov Model）による方法が代表的で
ある。The LPC (Linear) extracted in this way
Predictive Coding) Learning data is created using a feature amount such as a cepstrum as a standard pattern. Typical methods include a method based on vector quantization distortion and a method based on a Hidden Markov Model (HMM).

【０００７】ベクトル量子化歪みによる方法では、予め
話者ごとの特徴量をグループ化してその重心を符号帳
（コードブック）の要素（コードベクトル）として蓄え
ておく。そして、入力された音声の特徴量を各話者のコ
ードブックでベクトル量子化して、その入力音声全体に
対する各コードブックの平均量子化歪みを求める。In the method based on vector quantization distortion, feature amounts for each speaker are grouped in advance, and the center of gravity is stored as an element (code vector) of a codebook. Then, the feature amount of the input voice is vector-quantized by the codebook of each speaker, and the average quantization distortion of each codebook for the entire input voice is obtained.

【０００８】そして話者識別の場合は、その平均量子化
歪みの最も小さいコードブックの話者を選択し、話者照
合の場合は、該当する話者のコードブックによる平均量
子化歪みを閾値と比較して本人かどうかを判定する。In the case of speaker identification, the speaker of the codebook having the smallest average quantization distortion is selected. In the case of speaker verification, the average quantization distortion by the codebook of the corresponding speaker is set as a threshold. A comparison is made to determine whether the person is the person.

【０００９】このように、従来の話者認識技術におい
て、特徴量として音声のＬＰＣケプストラムを抽出し、
その特徴量のベクトル量子化歪みを利用して話者識別を
行う方法について、詳しく説明する。As described above, in the conventional speaker recognition technology, the speech LPC cepstrum is extracted as a feature amount,
A method for performing speaker identification using the vector quantization distortion of the feature amount will be described in detail.

【００１０】先ず、入力された音声信号をブロック単位
にＬＰＣ分析（線形予測分析）を行い線形予測係数（Ｌ
ＰＣ係数）を求める。分析ブロック長としては、音声で
は一般的に２０〜３０ミリ秒程度が用いられる。入力信
号のサンプルｘ_ｔを過去のＰ個のサンプルから以下の式
（１）のように予測する。なお、一般的に線形予測の次
数Ｐとしては、１０〜２０次程度が用いられる。First, an input speech signal is subjected to LPC analysis (linear prediction analysis) for each block, and a linear prediction coefficient (L
PC coefficient). As the analysis block length, about 20 to 30 milliseconds are generally used for voice. Samples x _t of the input signal to predict the equation (1) below from the past P samples. In general, the order P of linear prediction is about the order of 10 to 20.

【００１１】[0011]

【数１】 (Equation 1)

【００１２】そして線形予測誤差ε＝ｘ'_ｔ−ｘ_ｔを最
小化する線形予測係数ａ_ｉを最小二乗法によって求め
る。最小二乗法の解を求める方法としては、共分散法と
自己相関法があり、特に自己相関法は、その係数行列の
正定値性が保証されて解を必ず求めることができ、ま
た、Durbinの再帰法によって効率的に求めることが可能
であることから広く利用されている。求めたＰ個の線形
予測係数により、推定される全極型音声モデルの生成関
数は以下の式（２）のように表される。Then, a linear prediction coefficient a _i that minimizes the linear prediction error ε = x ′ _t −x _t is obtained by the least square method. There are two methods for finding the solution of the least-squares method: the covariance method and the autocorrelation method.In particular, the autocorrelation method guarantees the positive definiteness of the coefficient matrix and can always find the solution. It is widely used because it can be obtained efficiently by the recursive method. Based on the obtained P linear prediction coefficients, the generation function of the all-pole type speech model estimated is expressed as the following equation (2).

【００１３】[0013]

【数２】 (Equation 2)

【００１４】ケプストラムは、音声の対数スペクトルの
逆フーリエ変換であるから、ＬＰＣ分析による音声モデ
ルのケプストラムは、ケプストラムのフーリエ変換を
Ｃ（ω）とすると、以下の式（３）で表される。Since the cepstrum is the inverse Fourier transform of the logarithmic spectrum of the speech, the cepstrum of the speech model obtained by the LPC analysis is the Fourier transform of the cepstrum.
If C (ω), it is expressed by the following equation (3).

【００１５】[0015]

【数３】 (Equation 3)

【００１６】ここでフーリエ変換をＺ変換に拡張して一
般化すると、式（４）のように記述できる。If the Fourier transform is extended to a Z transform and generalized, it can be described as in equation (4).

【００１７】[0017]

【数４】 (Equation 4)

【００１８】Ｃ（ｚ）の逆Ｚ変換ｃ_ｉは、複素ケプスト
ラムと呼ばれている。ここで、ＬＰＣ係数ａ_ｉを直接複
素ケプストラムｃ_ｉに変換する方法が知られている。す
なわち、以下の式（５）、式（６）、式（７）のような
漸化式から複素ケプストラムを順次求めることができ
る。このようにしてＬＰＣ分析から求めたｃ_ｎを特にＬ
ＰＣケプストラムと呼ぶ。The inverse Z transform c _i of C (z) is called a complex cepstrum. Here, the method of converting the LPC coefficients _{a i} directly to the complex cepstrum _{c i} are known. That is, a complex cepstrum can be sequentially obtained from a recurrence formula such as the following formulas (5), (6), and (7). In this way, c _n obtained from the LPC analysis is particularly L
Called PC cepstrum.

【００１９】[0019]

【数５】 (Equation 5)

【００２０】次に、上述のようにして抽出した特徴量
（ＬＰＣケプストラム等）にベクトル量子化を施し、そ
の量子化歪みを利用して話者を識別する。基本的には、
求めた特徴量ベクトルを複数の話者のコードブックでベ
クトル量子化を施し、その平均量子化歪みを最小にする
コードブックを選出する。以下、詳しく説明する。Next, vector quantization is performed on the features (such as LPC cepstrum) extracted as described above, and the speaker is identified using the quantization distortion. Basically,
The obtained feature amount vector is subjected to vector quantization using codebooks of a plurality of speakers, and a codebook that minimizes the average quantization distortion is selected. The details will be described below.

【００２１】まずｉ番目のＬＰＣ分析ブロックにおける
Ｐ個の特徴量ベクトルｘ_ｉを以下の式（８）とする。特
徴量ベクトルの要素としては、たとえば、前述したよう
な１〜Ｐ次のＬＰＣケプストラムを用いる。Firstly i-th and P number of feature quantity following expression vectors _{x i} in LPC analysis block (8). As the elements of the feature amount vector, for example, the LPC cepstrum of order 1 to P as described above is used.

【００２２】[0022]

【数６】 (Equation 6)

【００２３】また、コードブックＣＢ_ｋのｊ番目のセン
トロイド（コードベクトル）ｒ_ｊ ^ｋを以下の式（９）と
する。Also, the code book CB_kJ-th sen
Toroid (code vector) r_j ^kTo the following equation (9)
I do.

【００２４】[0024]

【数７】 (Equation 7)

【００２５】ここで、特徴量ベクトルｘ_ｉとセントロイ
ドｒ_ｊ ^ｋとの重み付距離を以下の式（１０）のように
定義する。[0025] defined here, the weighting distance between the feature quantity vector _{x i} and the centroid _r ^{j k} such that the following equation (10).

【００２６】[0026]

【数８】 (Equation 8)

【００２７】第ｉブロックのコードブックＣＢ_ｋによる
ベクトル量子化歪みｄ_ｋ（ｉ）を以下の式（１１）のよ
うに求める。The vector quantization distortion d _k (i) by the codebook CB _k of the i-th block is obtained as in the following equation (11).

【００２８】[0028]

【数９】 (Equation 9)

【００２９】各ブロック毎のベクトル量子化歪みｄ
_ｋ（ｉ）を求め、さらに、話者認識区間の全ブロック
（ｉ＝１，２，・・・Ｌ）における、コードブックＣＢ
_ｋの平均量子化歪みＤ_ｋを以下の式（１２）のようにし
て求める。Vector quantization distortion d for each block
_k (i) is obtained, and the codebook CB in all blocks (i = 1, 2,... L) of the speaker recognition section is obtained.
The average quantization distortion D _k of _k obtained as the following equation (12).

【００３０】[0030]

【数１０】 (Equation 10)

【００３１】この平均量子化歪みＤ_ｋを最小にするコー
ドブックＣＢ_ｋ’を求め、そのコードブックに対応する
話者を話者評価区間における話者として選出する。A code book CB _{k ′} that minimizes the average quantization distortion D _k is obtained, and a speaker corresponding to the code book is selected as a speaker in a speaker evaluation section.

【００３２】一方、ＨＭＭによる方法では、上記と同様
にして求めた話者の特徴量は、隠れマルコフモデル（Ｈ
ＭＭ）の状態間の遷移確率と、各状態での特徴量の出現
確率によって表現され、入力音声区間全体でモデルとの
平均尤度によって判定をする。On the other hand, in the method based on the HMM, the feature amount of the speaker obtained in the same manner as described above is used as a hidden Markov model (H
MM) is expressed by the transition probability between the states and the appearance probability of the feature amount in each state, and the determination is made based on the average likelihood with the model in the entire input voice section.

【００３３】また、予め登録されていない不特定話者が
含まれる話者識別の場合は、上述した話者識別と話者照
合とを組合せた方法によって判定する。すなわち、登録
された話者セットから最も類似した話者を候補として選
び、その候補の量子化歪み又は尤度を閾値と比較して本
人かどうかを判定する。In the case of speaker identification that includes an unspecified speaker that is not registered in advance, the determination is made by a method that combines the above-described speaker identification and speaker verification. That is, the most similar speaker is selected from the registered speaker set as a candidate, and the quantization distortion or the likelihood of the candidate is compared with a threshold to determine whether the speaker is the speaker.

【００３４】話者照合又は不特定話者を含む話者識別に
おいて、本人の判定をするために、話者の尤度若しくは
量子化歪みを閾値と比較して判定するが、その際、これ
らの値は特徴量の時期変動、発声文章の違い、雑音等の
影響により、同一の話者であっても入力データと学習デ
ータ（モデル）とのばらつきが大きく、一般的にその絶
対値に閾値を設定しても安定して十分な認識率が得られ
ない。In speaker verification or speaker identification including an unspecified speaker, in order to determine the identity of the speaker, the likelihood or quantization distortion of the speaker is determined by comparing it with a threshold. The value of the input data and the learning data (model) vary greatly even for the same speaker due to the temporal fluctuation of the feature amount, the difference in the utterance sentence, and the influence of noise. Even if it is set, a sufficient recognition rate cannot be obtained stably.

【００３５】そこで、ＨＭＭにおける話者認識において
は、尤度を正規化することが一般的に行われる。例え
ば、以下の式（１３）に示すような対数尤度比ＬＲを判
定に用いる方法がある。Therefore, in speaker recognition in the HMM, it is common practice to normalize the likelihood. For example, there is a method of using a log likelihood ratio LR as shown in the following equation (13) for determination.

【００３６】[0036]

【数１１】 [Equation 11]

【００３７】式（１３）において、Ｌ（Ｘ／Ｓ_ｃ）は、
照合対象話者Ｓ_ｃ（本人）の入力音声Ｘに対する尤度で
あり、Ｌ（Ｘ／Ｓ_ｒ）は、話者Ｓ_ｃ以外の話者Ｓ_ｒの入
力音声Ｘに対する尤度である。すなわち、入力音声Ｘに
対する尤度に合わせて動的に閾値を設定することにな
り、発声内容の違いや時期変動に対して頑健となる。In equation (13), L (X / S _c ) is
Collation target speaker is a likelihood for the input speech X of _{S c} _{(person), L (X / S r} ) is a likelihood for the input speech X of the speaker _{S c} other than the speaker _{S r.} That is, the threshold value is dynamically set in accordance with the likelihood of the input voice X, and the system is robust against differences in utterance content and time variation.

【００３８】或いはまた、事後確率の概念を用いて、以
下の式（１４）に示すような事後確立によって判定を行
う方法も研究されている。ここで、Ｐ（Ｓ_ｃ）、Ｐ（Ｓ
_ｒ）はそれぞれ話者Ｓ_ｃ、Ｓ_ｒの出現確率である。Alternatively, a method of making a determination by posterior establishment as shown in the following equation (14) using the concept of posterior probability has been studied. Here, P (S _c ), P (S
_r ) is the appearance probability of the speakers S _c and S _r , respectively.

【００３９】[0039]

【数１２】 (Equation 12)

【００４０】これらのＨＭＭを用いた尤度の正規化の方
法は、後述する文献[4]等に詳しく記されている。The method of likelihood normalization using these HMMs is described in detail in the following document [4].

【００４１】一方、上述したＨＭＭによる方法で述べた
尤度を、特徴量の標準パターンと入力データから抽出し
た特徴量のマハラノビス距離によって求める方法もあ
る。On the other hand, there is a method in which the likelihood described in the above-described HMM method is obtained from the Mahalanobis distance of the characteristic amount extracted from the standard pattern of the characteristic amount and the input data.

【００４２】入力データＸの特徴量ベクトルｘと特徴量
の標準パターンのベクトルｒとを用いて、入力データＸ
の話者Ｓに対する尤度Ｌ（Ｘ／Ｓ）は、以下の式（１
５）のように求められる。Using the feature vector x of the input data X and the vector r of the standard pattern of the feature, the input data X
The likelihood L (X / S) for the speaker S is expressed by the following equation (1).
It is required as in 5).

【００４３】[0043]

【数１３】 (Equation 13)

【００４４】ここで、特徴量ベクトルｘと特徴量の標準
パターンのベクトルｒとは、それぞれ以下の式（１
６）、式（１７）のように与えられる。Here, the feature quantity vector x and the feature pattern standard pattern vector r are expressed by the following equations (1)
6), given by equation (17).

【００４５】[0045]

【数１４】 [Equation 14]

【００４６】式（１５）において、Ｐはベクトル次数、
Σは話者Ｓの特徴量データの共分散行列である。また、
（ｘ−ｒ）^ＴΣ^−１（ｘ−ｒ）は、マハラノビス距離と
呼ばれる。式（１５）より、話者の特徴量の共分散係数
を予め求めておけば、入力データＸの尤度が求められ
る。また、これより、上述したような話者Ｓ_ｃと話者Ｓ
_ｒの対数尤度比ＬＲは、それぞれの話者のマハラノビス
距離の差によって表現される。すなわち、上述したよう
な話者照合のための尤度正規化において、対数尤度に閾
値を設定することと、マハラノビス距離の差に閾値を設
定することとは同等である。詳しくは、後述する文献
[5]等に記されている。In equation (15), P is the vector order,
Σ is a covariance matrix of the feature amount data of the speaker S. Also,
(Xr)^TΣ^-1(Xr) is the Mahalanobis distance and
Called. From equation (15), the covariance coefficient of the feature amount of the speaker
Is obtained in advance, the likelihood of the input data X is obtained.
You. In addition, the speaker S as described above will now be described._cAnd speaker S
_rIs the Mahalanobis of each speaker.
Expressed by the difference in distance. That is, as described above
Log likelihood in the likelihood normalization for
Value and a threshold for the difference in Mahalanobis distance.
Is equivalent to specifying For details, see the literature
[5] and so on.

【００４７】話者認識に関する従来技術について詳しく
は、例えば、以下の文献等に記述されている。 [1] 古井：”ケプストラムの統計的特徴による話者認
識”, 信学論 volJ65-A, No.2 183-190(1982) [2] F.K.Soong and A.E.Rosenberg: “On the Use of I
nstantaneous and Transitional Spectral Information
in Speaker Recognition.”, IEEE Trans. ASSP, Vol.
36, No.6, pp.871-879 (1988) [3] 古井：”声の個人性の話”,日本音響学会誌, 51,1
1, pp.876-881, (1995) [4] 松井：”HMMによる話者認識”, 信学技報, Vol.95,
No.467, (SP95 109-116) pp.17-24 (1996) [5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE P
RESS (CRC Press),1998 一方、特願２０００−３６３５４７の明細書及び図面に
は、話者認識の技術を応用して、ＡＶデータにおいて、
同一話者の連続会話区間と話者切り換わり位置とを検出
する技術が提案されている。この特願２０００−３６３
５４７の明細書及び図面では、ＡＶデータの音声信号を
小区間（１〜２秒程度）毎に話者グループに分類識別
し、いくつかの連続した認識区間（数秒〜１０秒程度）
内において話者グループの判別頻度の変位を求め、その
頻度が閾値を上回る位置又は閾値を下回る位置を検出す
ることで、話者の切り換わり位置を検出し、話者が切り
換わる間の区間をその話者の同一話者連続会話区間とし
て検出している。The prior art relating to speaker recognition is described in detail in, for example, the following documents. [1] Furui: "Speaker Recognition by Statistical Features of Cepstrum", IEICE, volJ65-A, No.2 183-190 (1982) [2] FKSoong and AERosenberg: “On the Use of I
nstantaneous and Transitional Spectral Information
in Speaker Recognition. ”, IEEE Trans. ASSP, Vol.
36, No.6, pp.871-879 (1988) [3] Furui: "Story of the voice's personality", Journal of the Acoustical Society of Japan, 51,1
1, pp.876-881, (1995) [4] Matsui: "HMM-based speaker recognition," IEICE Technical Report, Vol.95,
No.467, (SP95 109-116) pp.17-24 (1996) [5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE P
RESS (CRC Press), 1998 On the other hand, in the specification and the drawings of Japanese Patent Application No. 2000-363547, the technology of speaker recognition is applied to the AV data,
A technique for detecting a continuous conversation section of the same speaker and a speaker switching position has been proposed. This patent application 2000-363
In the specification and drawings of 547, the audio signal of the AV data is classified and identified into speaker groups for each small section (about 1 to 2 seconds), and several continuous recognition sections (several seconds to about 10 seconds).
The change of the discrimination frequency of the speaker group is obtained within the section, and the position where the frequency is above or below the threshold is detected to detect the switching position of the speaker, and the section during the switching of the speaker is determined. It is detected as the same speaker continuous conversation section of the speaker.

【００４８】[0048]

【発明が解決しようとする課題】従来の話者認識技術
は、セキュリティシステムなどにおける単一話者の識別
・照合を主な応用として研究、開発されており、１つの
音声データにおいて複数の話者が短時間で交互に発声を
したり、時折同時に発声したり、背景に音楽や雑音があ
ったりといった実際の複雑な会話場面に適用できるもの
ではなかった。従って、ＡＶデータにおける話者の会話
区間の検索を、従来の話者認識の技術によって自動的に
行うとすると、その識別性能が著しく低下してしまうと
いった問題があった。The conventional speaker recognition technology has been researched and developed with the main application of identification and collation of a single speaker in a security system or the like, and a plurality of speakers are included in one voice data. However, it cannot be applied to actual complex conversation scenes, such as alternately uttering in a short time, occasionally uttering simultaneously, and music or noise in the background. Therefore, if the search for the conversation section of the speaker in the AV data is performed automatically by the conventional speaker recognition technique, there is a problem that the discrimination performance is significantly reduced.

【００４９】また、上述した特願２０００−３６３５４
７の明細書及び図面に記載されている技術では、数秒〜
１０秒程度の頻度評価区間ごとに判別頻度の変位によっ
て話者の切り換わりを検出しているため、その評価区間
の間は同一の話者がほぼ連続して会話をしている必要が
ある。すなわち、同一の話者が、評価区間長である１０
秒程度の時間の間、単独で連続的に会話をしており、そ
の判別誤り率が十分に低い場合には適用できるが、複数
の話者が短時間、例えば数秒以内に交互に発声をした
り、同時に発声することが多かったり、背景雑音や音楽
などで話者の認識誤りが大きくなる場合には、話者の切
り換わり位置を正確に検出できず、会話区間の検出を適
切に行うことができないという問題があった。Also, the aforementioned Japanese Patent Application No. 2000-36354.
7, the technology described in the specification and the drawings of FIG.
Since the switching of the speaker is detected based on the change in the discrimination frequency for each frequency evaluation section of about 10 seconds, the same speaker needs to have a substantially continuous conversation during the evaluation section. That is, the same speaker has an evaluation section length of 10
It can be applied when the discrimination error rate is sufficiently low for a period of about two seconds alone, but it can be applied, but multiple speakers speak alternately within a short time, for example, within a few seconds. If the speaker's recognition error is large due to frequent or simultaneous utterances or background noise, music, etc., it is not possible to accurately detect the switch position of the speaker, and appropriate detection of the conversation section There was a problem that can not be.

【００５０】さらに、特願２０００−３６３５４７の明
細書及び図面では、入力データを話者グループの何れか
に割り当てる方法によって話者を識別しているため、登
録されていない未知の話者であっても分類が行える一方
で、話者の本人照合を行っていないために、分類誤りが
起こりやすく、また、音声以外のデータが入力された場
合でも何れかの話者グループに分類してしまうという問
題があった。Further, in the specification and drawings of Japanese Patent Application No. 2000-363547, since speakers are identified by a method of allocating input data to any of the speaker groups, unknown speakers that have not been registered are identified. Classification can be performed, but since the identity of the speaker is not verified, classification errors are likely to occur, and even if data other than voice is input, the data is classified into any speaker group. was there.

【００５１】このようなＡＶデータの話者検出におい
て、未登録の話者や音声以外の入力データを誤って識別
することを避ける手法として話者照合の技術があるが、
従来の話者の尤度正規化による話者照合方法は、ＨＭＭ
を用いて尤度を求めた場合には適用できるが、より簡単
に識別を行うことのできるベクトル量子化歪みを利用し
た識別方法ではそのまま適用できないという問題があ
る。In such speaker detection of AV data, there is a speaker verification technique as a method for avoiding erroneous identification of unregistered speakers or input data other than voice.
A conventional speaker matching method based on likelihood normalization of a speaker is HMM.
Can be applied when the likelihood is obtained by using, but there is a problem that it cannot be applied as it is by an identification method using vector quantization distortion, which can perform identification more easily.

【００５２】また、標準パターンからのマハラノビス距
離を用いて尤度を求めるには話者の共分散係数などが既
知でなければならず、計算も非常に複雑であり、さらに
この手法をベクトル量子化歪みを用いた場合に適用する
のは、事前に共分散係数を求める等の複雑な計算を要す
るものであり、実用的なものではなかった。Further, in order to obtain the likelihood using the Mahalanobis distance from the standard pattern, the covariance coefficient of the speaker and the like must be known, the calculation is very complicated, and furthermore, this method is applied to vector quantization. Applying the case where distortion is used requires complicated calculations such as obtaining a covariance coefficient in advance, and is not practical.

【００５３】本発明は、このような従来の実情に鑑みて
提案されたものであり、自動的且つ効率的にＡＶデータ
における話者の会話区間を検出し、また、効率的に検索
する情報抽出装置及びその方法、並びに情報検索装置及
びその方法を提供することを目的とする。The present invention has been proposed in view of such a conventional situation, and automatically and efficiently detects a talk section of a speaker in AV data and extracts information for efficient search. It is an object to provide an apparatus and a method thereof, and an information retrieval apparatus and a method thereof.

【００５４】[0054]

【課題を解決するための手段】上述した目的を達成する
ために、本発明に係る情報抽出装置は、所定の情報源か
ら所望の情報を抽出するための情報抽出装置において、
上記情報源である音声信号について、上記音声信号中の
音声の類似性によって、ある評価区間毎に話者を判別す
る話者識別手段と、上記評価区間毎に判別された話者の
頻度を求める区間である上記情報源における頻度区間で
の上記話者の判別頻度情報を求める話者判別頻度計算手
段とを備え、上記頻度区間における上記話者の出現頻度
情報を検出することを特徴としている。In order to achieve the above-mentioned object, an information extraction apparatus according to the present invention comprises: an information extraction apparatus for extracting desired information from a predetermined information source;
With respect to the audio signal as the information source, a speaker identification unit for identifying a speaker for each evaluation section based on the similarity of the audio in the audio signal, and a frequency of the speaker determined for each evaluation section are obtained. A speaker discrimination frequency calculation unit that obtains the discrimination frequency information of the speaker in the frequency section of the information source that is the section, and detects the appearance frequency information of the speaker in the frequency section.

【００５５】ここで、情報抽出装置では、上記情報源の
音声信号中の音声の類似性を評価する特徴量として、Ｌ
ＰＣ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Here, the information extracting apparatus uses L.sub.L as a feature amount for evaluating the similarity of the voice in the voice signal of the information source.
An LPC cepstrum obtained by PC analysis is used, a vector quantization of a feature quantity by a plurality of codebooks is used as an identification method, and the vector quantization distortion is used as a measure of identification.

【００５６】また、情報抽出装置では、上記ベクトル量
子化歪みの最小値である最小量子化歪みと、上記最小量
子化歪み以外の複数のベクトル量子化歪みの和又は平均
から最小量子化歪みを減算した値とを、それぞれ予め設
定された閾値と比較することで識別判定される。In the information extracting apparatus, the minimum quantization distortion is subtracted from the sum or average of the minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a plurality of vector quantization distortions other than the minimum quantization distortion. The identification value is determined by comparing the set values with threshold values set in advance.

【００５７】このような情報抽出装置は、音声信号中の
話者の音声の特徴量に基づいて、ある評価区間毎に話者
を識別すると共に、評価区間毎に判別された話者の頻度
を求める区間である頻度区間における話者の出現頻度を
検出し、話者の出現頻度情報を生成する。Such an information extracting device identifies a speaker for each evaluation section based on the feature amount of the speaker's voice in the voice signal, and also determines the frequency of the speaker determined for each evaluation section. The appearance frequency of the speaker in the frequency section which is the section to be obtained is detected, and the appearance frequency information of the speaker is generated.

【００５８】また、上述した目的を達成するために、本
発明に係る情報抽出方法は、所定の情報源から所定の情
報を検索するための情報抽出方法において、上記情報源
である音声信号について、上記音声信号中の音声の類似
性によって、ある評価区間毎に話者を判別する話者識別
工程と、上記評価区間毎に判別された話者の頻度を求め
る区間である上記情報源における頻度区間での上記話者
の判別頻度情報を求める話者判別頻度計算工程とを有
し、上記頻度区間における上記話者の出現頻度情報を検
出することを特徴としている。According to another aspect of the present invention, there is provided an information extraction method for retrieving predetermined information from a predetermined information source. A speaker identification step of determining a speaker for each evaluation section based on the similarity of the voice in the voice signal, and a frequency section in the information source for obtaining a frequency of the speaker determined for each evaluation section And a speaker discriminating frequency calculating step of obtaining the speaker discriminating frequency information in step (a), and detecting the appearance frequency information of the speaker in the frequency section.

【００５９】ここで、情報抽出方法では、上記情報源の
音声信号中の音声の類似性を評価する特徴量として、Ｌ
ＰＣ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Here, in the information extraction method, L is used as a feature value for evaluating the similarity of the voice in the voice signal of the information source.
An LPC cepstrum obtained by PC analysis is used, a vector quantization of a feature quantity by a plurality of codebooks is used as an identification method, and the vector quantization distortion is used as a measure of identification.

【００６０】また、情報抽出方法では、上記ベクトル量
子化歪みの最小値である最小量子化歪みと、上記最小量
子化歪み以外の複数のベクトル量子化歪みの和又は平均
から最小量子化歪みを減算した値とを、それぞれ予め設
定された閾値と比較することで識別判定される。In the information extraction method, the minimum quantization distortion is subtracted from the sum or average of the minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a plurality of vector quantization distortions other than the minimum quantization distortion. The identification value is determined by comparing the set values with threshold values set in advance.

【００６１】このような情報抽出方法は、音声信号中の
話者の音声の特徴量に基づいて、ある評価区間毎に話者
を識別すると共に、評価区間毎に判別された話者の頻度
を求める区間である頻度区間における話者の出現頻度を
検出し、話者の出現頻度情報を生成する。According to such an information extraction method, a speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the frequency of the speaker determined for each evaluation section is determined. The appearance frequency of the speaker in the frequency section which is the section to be obtained is detected, and the appearance frequency information of the speaker is generated.

【００６２】また、上述した目的を達成するために、本
発明に係る情報検索装置は、情報源である音声信号につ
いて、上記音声信号中の音声の類似性によって、ある評
価区間毎に話者を判別し、上記評価区間毎に判別された
話者の頻度を求める区間である頻度区間で上記話者の判
別頻度情報を求めることで得られた上記頻度区間におけ
る上記話者の出現頻度情報が予め記録された記録媒体か
ら、所望の情報の検索を行う情報検索装置であって、上
記記録媒体に記録された話者の出現頻度情報を読み込む
話者出現頻度読み込み手段と、所望の話者の検索条件を
入力する検索条件入力手段と、入力された上記検索条件
と上記記録媒体から読み出した情報とを比較して、検索
条件に該当する情報を話者出現区間情報として出力する
話者出現区間出力手段とを備えることを特徴としてい
る。Further, in order to achieve the above-mentioned object, the information retrieval apparatus according to the present invention, for a speech signal as an information source, relies on the similarity of speech in the speech signal to identify a speaker in each evaluation section. Discrimination, and the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker identification frequency information in the frequency section which is the section for obtaining the frequency of the speaker determined for each evaluation section is determined in advance. What is claimed is: 1. An information search device for searching for desired information from a recorded recording medium, comprising: speaker appearance frequency reading means for reading speaker appearance frequency information recorded on the recording medium; A search condition input unit for inputting a condition, and a speaker appearance section output for comparing the input search condition with information read from the recording medium and outputting information corresponding to the search condition as speaker appearance section information. It is characterized in that it comprises a stage.

【００６３】ここで、上記話者の判別頻度情報を得る際
の上記情報源の音声信号中の音声の類似性を評価する特
徴量としては、ＬＰＣ分析によって得られるＬＰＣケプ
ストラムが用いられ、識別の手法としては、複数のコー
ドブックによる特徴量のベクトル量子化が用いられ、識
別の尺度として、そのベクトル量子化歪みが用いられ、
また、上記ベクトル量子化歪みの最小値である最小量子
化歪みと、上記最小量子化歪み以外の複数のベクトル量
子化歪みの和又は平均から最小量子化歪みを減算した値
とを、それぞれ予め設定された閾値と比較することで識
別判定される。Here, the LPC cepstrum obtained by the LPC analysis is used as the characteristic amount for evaluating the similarity of the voice in the voice signal of the information source when obtaining the speaker discrimination frequency information. As a method, vector quantization of feature amounts by multiple codebooks is used, and the vector quantization distortion is used as a measure of identification,
In addition, a minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion are set in advance. The identification is determined by comparing with the threshold value.

【００６４】このような情報検索装置は、所望の話者の
出現頻度情報と入力した検索条件とを比較することで、
所望の話者が所望の頻度で会話している部分等を検索す
る。Such an information search apparatus compares the appearance frequency information of a desired speaker with the input search condition,
A part where a desired speaker is talking at a desired frequency is searched.

【００６５】また、上述した目的を達成するために、本
発明に係る情報検索方法は、情報源である音声信号につ
いて、上記音声信号中の音声の類似性によって、ある評
価区間毎に話者を判別し、上記評価区間毎に判別された
話者の頻度を求める区間である頻度区間で上記話者の判
別頻度情報を求めることで得られた上記頻度区間におけ
る上記話者の出現頻度情報が予め記録された記録媒体か
ら、所望の情報の検索を行う情報検索方法であって、上
記記録媒体に記録された話者の出現頻度情報を読み込む
話者出現頻度読み込み工程と、所望の話者の検索条件を
入力する検索条件入力工程と、入力された上記検索条件
と上記記録媒体から読み出した情報とを比較して、検索
条件に該当する情報を話者出現区間情報として出力する
話者出現区間出力工程とを有することを特徴としてい
る。Further, in order to achieve the above-mentioned object, the information search method according to the present invention, for a voice signal as an information source, identifies a speaker in each evaluation section based on the similarity of voice in the voice signal. Discrimination, and the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker identification frequency information in the frequency section which is the section for obtaining the frequency of the speaker determined for each evaluation section is determined in advance. An information retrieval method for retrieving desired information from a recorded recording medium, comprising: a speaker appearance frequency reading step of reading speaker appearance frequency information recorded on the recording medium; and a search for a desired speaker. A search condition input step of inputting a condition, a speaker appearance section output for comparing the input search condition with information read from the recording medium, and outputting information corresponding to the search condition as speaker appearance section information. It is characterized by having a degree.

【００６６】ここで、上記話者の判別頻度情報を得る際
の上記情報源の音声信号中の音声の類似性を評価する特
徴量としては、ＬＰＣ分析によって得られるＬＰＣケプ
ストラムが用いられ、識別の手法としては、複数のコー
ドブックによる特徴量のベクトル量子化が用いられ、識
別の尺度として、そのベクトル量子化歪みが用いられ、
また、上記ベクトル量子化歪みの最小値である最小量子
化歪みと、上記最小量子化歪み以外の複数のベクトル量
子化歪みの和又は平均から最小量子化歪みを減算した値
とを、それぞれ予め設定された閾値と比較することで識
別判定される。Here, the LPC cepstrum obtained by the LPC analysis is used as a feature amount for evaluating the similarity of the voice in the voice signal of the information source when obtaining the speaker discrimination frequency information. As a method, vector quantization of feature amounts by multiple codebooks is used, and the vector quantization distortion is used as a measure of identification,
In addition, a minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion are set in advance. The identification is determined by comparing with the threshold value.

【００６７】このような情報検索方法は、所望の話者の
出現頻度情報と入力した検索条件とを比較することで、
所望の話者が所望の頻度で会話している部分等を検索す
る。In such an information search method, the appearance frequency information of a desired speaker is compared with the input search condition.
A part where a desired speaker is talking at a desired frequency is searched.

【００６８】[0068]

【発明の実施の形態】以下、本発明を適用した具体的な
実施の形態について、図面を参照しながら詳細に説明す
る。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００６９】先ず、本実施の形態における情報抽出装置
の概念構成図を図１に示す。図１に示すように、情報抽
出装置においては、情報源となる音声信号が話者識別手
段１に入力され、ベクトル量子化歪みを評価して話者が
識別される。First, FIG. 1 shows a conceptual configuration diagram of the information extraction device according to the present embodiment. As shown in FIG. 1, in the information extraction device, a speech signal serving as an information source is input to speaker identification means 1, and a speaker is identified by evaluating vector quantization distortion.

【００７０】話者識別手段１によって識別された話者
は、話者判別頻度計算手段２に入力され、所定の評価区
間毎に区間内の各話者の認識された話者判別頻度が計算
される。求められた話者判別頻度は、話者の出現頻度情
報として出力される。The speaker identified by the speaker identification means 1 is input to the speaker identification frequency calculation means 2 and the recognized speaker identification frequency of each speaker in the section is calculated for each predetermined evaluation section. You. The obtained speaker discrimination frequency is output as speaker appearance frequency information.

【００７１】この図１に示した情報抽出装置の具体的な
構成例を図２に示す。図２に示すように、情報抽出装置
１０は、ＡＶ（Audio Visual）データの音声信号を入力
する入力部１１と、音声信号を分析してＬＰＣ（Linear
Predictive Coding）ケプストラム係数を抽出するケプ
ストラム抽出部１２と、ＬＰＣケプストラム係数をベク
トル量子化するベクトル量子化部１３と、ベクトル量子
化歪みを評価して話者を識別する話者識別部１４と、認
識された話者の判別頻度を用いて話者の出現頻度を求め
る話者判別頻度計算部１５とを備える。FIG. 2 shows a specific configuration example of the information extraction device shown in FIG. As shown in FIG. 2, the information extraction device 10 includes an input unit 11 for inputting an audio signal of AV (Audio Visual) data, and an LPC (Linear) by analyzing the audio signal.
Predictive Coding) A cepstrum extraction unit 12 that extracts cepstrum coefficients, a vector quantization unit 13 that vector-quantizes LPC cepstrum coefficients, a speaker identification unit 14 that evaluates vector quantization distortion to identify a speaker, and recognizes And a speaker discrimination frequency calculation unit 15 for obtaining the appearance frequency of the speaker using the determined speaker discrimination frequency.

【００７２】また、図２において、コードブック群ＣＢ
は、ベクトル量子化に用いる各話者のコードブックデー
タが格納されたものであり、閾値表ファイルＴＦは、話
者の判別を行うための閾値データが格納されたものであ
り、それぞれ図示しない記録部に記録されている。ま
た、話者頻度ファイルＳＦは、区間毎の各話者の頻度が
記録されたものである。In FIG. 2, the code book group CB
Is a table in which codebook data of each speaker used for vector quantization is stored, and a threshold table file TF is a table in which threshold data for discriminating the speakers is stored. Recorded in the department. The speaker frequency file SF records the frequency of each speaker in each section.

【００７３】このように構成された情報抽出装置１０の
動作を以下に説明する。入力部１１から入力されたＡＶ
データの音声信号Ｄ１１は、ブロック単位にケプストラ
ム抽出部１２に入力されて、ＬＰＣ分析が施され、得ら
れたＬＰＣ係数がＬＰＣケプストラム係数に変換され
る。The operation of the information extracting apparatus 10 configured as described above will be described below. AV input from the input unit 11
The data audio signal D11 is input to the cepstrum extraction unit 12 in block units, subjected to LPC analysis, and the obtained LPC coefficients are converted into LPC cepstrum coefficients.

【００７４】得られたＬＰＣケプストラム係数の一部Ｄ
１２は、ベクトル量子化部１３に入力されて、コードブ
ック群ＣＢからの各話者のコードブックデータＤ１３を
用いてそれぞれベクトル量子化が施される。それぞれの
コードブックでベクトル量子化された結果（量子化歪
み）Ｄ１４は、話者識別部１４に入力されて評価され、
さらに閾値表ファイルＴＦから読みこんだ閾値データＤ
１５を用いて、所定の認識ブロック毎に話者の識別及び
判定を行う。Part of the obtained LPC cepstrum coefficient D
Numeral 12 is input to the vector quantization unit 13, and is subjected to vector quantization using the codebook data D13 of each speaker from the codebook group CB. The result (quantization distortion) D14 of the vector quantization performed by each codebook is input to the speaker identification unit 14 and evaluated,
Further, the threshold data D read from the threshold table file TF
15, the speaker is identified and determined for each predetermined recognition block.

【００７５】識別された話者Ｄ１６は、話者判別頻度計
算部１５に入力され、所定の評価区間毎に区間内の各話
者の認識された話者判別頻度が計算される。求められた
話者判別頻度は、話者の出現頻度情報Ｄ１７として、例
えば図３に示すような記録形式で各ＡＶデータ毎、各話
者毎、各評価区間毎に話者頻度ファイルＳＦに記録され
る。なお、話者頻度ファイルＳＦは、図示しない送受信
部により通信回線を介して通信されるものであってもよ
く、また、磁気ディスク、光磁気ディスク等の記録媒体
や半導体メモリ等の記憶媒体等の蓄積媒体に蓄積される
ものであってもよい。The identified speaker D16 is input to the speaker discrimination frequency calculator 15, and the recognized speaker discrimination frequency of each speaker in the section is calculated for each predetermined evaluation section. The obtained speaker discrimination frequency is recorded as speaker appearance frequency information D17 in a speaker frequency file SF for each AV data, each speaker, and each evaluation section in a recording format as shown in FIG. 3, for example. Is done. The speaker frequency file SF may be communicated via a communication line by a transmitting / receiving unit (not shown), or may be a recording medium such as a magnetic disk, a magneto-optical disk, or a storage medium such as a semiconductor memory. It may be stored in a storage medium.

【００７６】図３の記録形式は、入力部１１から入力さ
れた音声信号のＡＶデータ名と、登録された各話者を識
別する識別名と、頻度区間の開始時刻と、同区間の終了
時刻と、上記ＡＶデータの上記頻度区間における上記話
者の判別頻度とを情報として有する。この記録形式は、
一例であり、図３に示した情報に限定されるものではな
い。The recording format of FIG. 3 includes an AV data name of an audio signal input from the input unit 11, an identification name for identifying each registered speaker, a start time of a frequency section, and an end time of the same section. And the speaker identification frequency in the frequency section of the AV data as information. This recording format is
This is an example, and the present invention is not limited to the information shown in FIG.

【００７７】以下、図２及び図４を参照しながら、話者
識別を行い話者判別頻度を求める際の処理について、さ
らに詳しく説明する。The processing for identifying the speaker and obtaining the speaker discrimination frequency will be described in more detail with reference to FIGS. 2 and 4.

【００７８】入力されたＡＶデータの音声信号は、ケプ
ストラム抽出部１２において、図４に示すようなＬＰＣ
分析ブロックＡＢ単位にＬＰＣ分析が施されて、得られ
たＬＰＣ係数が変換されてＬＰＣケプストラム係数が抽
出される。ＬＰＣ分析ブロックＡＢのブロック長ａは、
音声信号の場合、通常２０ミリ秒〜３０ミリ秒程度がよ
く用いられる。また、分析性能を向上させるために隣接
ブロックと若干オーバーラップさせることが多い。The audio signal of the input AV data is converted by the cepstrum extraction unit 12 into an LPC as shown in FIG.
LPC analysis is performed for each analysis block AB, and the obtained LPC coefficients are converted to extract LPC cepstrum coefficients. The block length a of the LPC analysis block AB is
In the case of an audio signal, usually about 20 to 30 ms is often used. In addition, in order to improve the analysis performance, it often overlaps with an adjacent block slightly.

【００７９】図４の話者認識ブロックＲＢは、話者を識
別する最小単位であり、このブロック単位に、話者の識
別を行う。話者認識ブロックＲＢのブロック長ｂは、数
秒程度が望ましい。従って、１つの話者認識ブロックＲ
Ｂは、５０〜数百程度のＬＰＣ分析ブロックＡＢを含ん
でいる。話者認識ブロックＲＢも、隣接区間と若干オー
バーラップしていてもよい。オーバーラップ長は、通
常、区間長の１０％〜５０％程度である。The speaker recognition block RB in FIG. 4 is the minimum unit for identifying a speaker, and identifies a speaker in this block unit. The block length b of the speaker recognition block RB is desirably about several seconds. Therefore, one speaker recognition block R
B includes about 50 to several hundred LPC analysis blocks AB. The speaker recognition block RB may also slightly overlap with the adjacent section. The overlap length is usually about 10% to 50% of the section length.

【００８０】図４の頻度区間ＦＩは、話者の出現頻度を
求める評価単位であり、同区間内において、各話者認識
ブロックＲＢで識別された話者の判別頻度に基づいて各
話者の出現頻度を求める。頻度区間ＦＩ_Ｉの区間開始時
刻はＳ_Ｉ、区間終了時刻はＥ _Ｉであり、区間長（Ｅ_Ｉ‐
Ｓ_Ｉ）は、数分〜数十分程度が適当である。また、評価
区間も隣接区間と若干オーバーラップしていてもよい。The frequency section FI in FIG. 4 indicates the appearance frequency of the speaker.
This is the required evaluation unit.
Based on the discrimination frequency of the speaker identified in block RB, each
Find the appearance frequency of the speaker. Frequency section FI_IAt the start of the section
The time is S_I, The section end time is E _IAnd the section length (E_I-
S_I) Is suitably from several minutes to several tens of minutes. Also rated
The section may also slightly overlap with the adjacent section.

【００８１】情報抽出装置１０の動作を表すフローチャ
ートを図５に示す。先ずステップＳ１０において、初期
化処理として、区間番号Ｉを０とする。区間番号Ｉと
は、話者の頻度を求める頻度区間ＦＩにつけた連続番号
である。FIG. 5 is a flowchart showing the operation of the information extracting apparatus 10. First, in step S10, the section number I is set to 0 as initialization processing. The section number I is a serial number assigned to the frequency section FI for obtaining the frequency of the speaker.

【００８２】次にステップＳ１１において、上述した話
者認識ブロックＲＢ単位で話者候補を識別して話者候補
を選定する。話者候補の選定方法については、後で詳述
する。Next, in step S11, speaker candidates are identified and speaker candidates are selected in units of the above-described speaker recognition blocks RB. A method for selecting speaker candidates will be described later in detail.

【００８３】ステップＳ１２では、選定された話者候補
が正しい話者か否かを照合判定する。すなわち、未知の
不特定話者や、音声以外のデータが入力された場合、ス
テップＳ１１で選定された候補話者は、入力音声に一番
類似している話者を候補として選出するが、それが本当
にその話者本人とは限らない。そこで、ステップＳ１２
では、ベクトル量子化歪みを評価し、図２に示した閾値
表ファイルＴＦに記録された閾値データと比較すること
で、選定された話者候補本人のデータであるか否かの判
定を行う。判定方法については、後で詳述する。ステッ
プＳ１２において、本人であると判定されれば、その話
者候補をこの話者認識ブロックにおける話者として確定
し、本人ではないと判定されれば、この話者認識ブロッ
クにおける話者を未知話者として確定する。In step S12, it is determined whether or not the selected speaker candidate is a correct speaker. That is, when an unknown unspecified speaker or data other than voice is input, the candidate speaker selected in step S11 selects the speaker most similar to the input voice as a candidate. But he is not necessarily the speaker himself. Therefore, step S12
Then, by evaluating the vector quantization distortion and comparing it with threshold data recorded in the threshold table file TF shown in FIG. 2, it is determined whether or not the data is the data of the selected speaker candidate. The determination method will be described later in detail. In step S12, if it is determined that the speaker is the speaker, the speaker candidate is determined as the speaker in this speaker recognition block, and if it is determined that the speaker is not the speaker, the speaker in the speaker recognition block is determined as the unknown speaker. Is determined.

【００８４】続いてステップＳ１３では、頻度区間ＦＩ
_Ｉの最後の話者認識ブロックＲＢまで処理したか否かが
判定される。ステップＳ１３において、最後の話者認識
ブロックＲＢでなければ、ステップＳ１４において、次
の話者認識ブロックＲＢに進み、ステップＳ１１に戻
る。ステップＳ１３において、最後の話者認識ブロック
ＲＢであると判定されれば、ステップＳ１５に進む。Subsequently, in step S13, the frequency section FI
_It is determined whether or not processing has been performed up to the last speaker recognition block RB of _I. If it is not the last speaker recognition block RB in step S13, the process proceeds to the next speaker recognition block RB in step S14, and returns to step S11. If it is determined in step S13 that the block is the last speaker recognition block RB, the process proceeds to step S15.

【００８５】ステップＳ１５では、現在の頻度区間ＦＩ
_Ｉにおける、それぞれの登録話者の判別頻度を出現頻度
情報として求める。なお、未知話者と判定された話者認
識ブロックＲＢは頻度の計算に含めない。求めた話者出
現頻度は、図２に示した話者頻度ファイルＳＦに、図３
のような記録形式で記録する。In step S15, the current frequency section FI
_{In I} , the determination frequency of each registered speaker is obtained as appearance frequency information. Note that the speaker recognition block RB determined to be an unknown speaker is not included in the frequency calculation. The obtained speaker appearance frequency is stored in the speaker frequency file SF shown in FIG.
It is recorded in a recording format such as

【００８６】ステップＳ１６では、データの末尾に到達
したか否かが判定される。データの末尾に到達している
場合は、処理を終了し、データの末尾に到達していない
場合は、ステップＳ１７に進む。In step S16, it is determined whether the end of the data has been reached. If the end of the data has been reached, the process ends. If the end of the data has not been reached, the process proceeds to step S17.

【００８７】ステップＳ１７では、区間番号Ｉを１つ増
やし、次の頻度区間に進み、ステップＳ１１に戻る。In step S17, the section number I is incremented by one, the process proceeds to the next frequency section, and the process returns to step S11.

【００８８】続いて、図５のステップＳ１１における話
者候補の識別方法の詳細を図６に示す。先ず、ステップ
Ｓ２０において、上述したＬＰＣ分析ブロックＡＢごと
に音声データを入力データから読みこむ。FIG. 6 shows details of the method for identifying speaker candidates in step S11 of FIG. First, in step S20, audio data is read from input data for each of the above-described LPC analysis blocks AB.

【００８９】次にステップＳ２１において、話者認識ブ
ロックＲＢの最後のＬＰＣ分析ブロックＡＢまで処理を
終えたか否かを判定し、最後のＬＰＣ分析ブロックＡＢ
の処理を終えている場合は、ステップＳ２６に進む。ス
テップＳ２１において最後のＬＰＣ分析ブロックＡＢで
ない場合は、ステップＳ２２に進む。Next, in step S21, it is determined whether or not the processing has been completed up to the last LPC analysis block AB of the speaker recognition block RB.
If the processing has been completed, the process proceeds to step S26. If it is not the last LPC analysis block AB in step S21, the process proceeds to step S22.

【００９０】ステップＳ２２では、得られたＬＰＣ分析
ブロックＡＢのデータを評価してこのブロックが音声ブ
ロックであるか否かを判定する。ステップＳ２２におい
て、このＬＰＣ分析ブロックＡＢが無音ブロック又は非
音声ブロックであると判定されれば、このブロックの分
析をスキップしてステップＳ２５に進み、次のＬＰＣ分
析ブロックＡＢに進んでステップＳ２０からの処理を行
う。音声ブロックであるか否かの判定方法は、例えば、
最も簡単な方法として、そのブロックのパワー平均及び
最大値を評価して無音ブロックであるか否かの検出を行
うだけでもよい。また、信号の平均パワー、ゼロ交差
数、ピッチの有無、スペクトル形状等から分析して音声
データであるか否かを判定する種々の方法があるが、本
実施の形態では、特にその手法は限定せず、或いはこの
ステップを省略してもよい。In step S22, the data of the obtained LPC analysis block AB is evaluated to determine whether or not this block is a voice block. If it is determined in step S22 that the LPC analysis block AB is a silence block or a non-speech block, the analysis of this block is skipped, the process proceeds to step S25, the process proceeds to the next LPC analysis block AB, and the process proceeds from step S20. Perform processing. A method of determining whether a block is an audio block includes, for example,
As the simplest method, the power average and the maximum value of the block may be evaluated to simply detect whether or not the block is a silent block. In addition, there are various methods for analyzing the average power of the signal, the number of zero crossings, the presence or absence of a pitch, the spectrum shape, and the like to determine whether or not the data is audio data. Alternatively, this step may be omitted.

【００９１】ステップＳ２２において音声ブロックであ
ると判定された場合は、次にステップＳ２３において、
このブロックのＬＰＣ分析を行い、得られたＬＰＣ係数
を変換してＬＰＣケプストラム係数を抽出する。ここで
は、１次〜１４次程度の低次のケプストラム係数を抽出
する。If it is determined in step S22 that the block is a voice block, then in step S23,
The LPC analysis of this block is performed, and the obtained LPC coefficient is converted to extract the LPC cepstrum coefficient. Here, low-order cepstrum coefficients of the first to fourteenth orders are extracted.

【００９２】次にステップＳ２４において、予め作成さ
れた複数のコードブックを用いて、ステップＳ２３で得
られたＬＰＣケプストラム係数にそれぞれベクトル量子
化を施す。それぞれのコードブックは登録された話者に
一対一に対応する。ここで、コードブックＣＢ_ｋによる
このブロックのＬＰＣケプストラム係数のベクトル量子
化歪みをｄ_ｋとする。Next, in step S24, the LPC cepstrum coefficients obtained in step S23 are vector-quantized using a plurality of codebooks created in advance. Each codebook has a one-to-one correspondence with a registered speaker. Here, let d _k be the vector quantization distortion of the LPC cepstrum coefficient of this block by the codebook CB _k .

【００９３】ステップＳ２５では、次のＬＰＣ分析ブロ
ックＡＢに進み、ステップＳ２０に戻り、同様にしてス
テップＳ２０からステップＳ２５の処理を繰り返す。In step S25, the process proceeds to the next LPC analysis block AB, returns to step S20, and similarly repeats the processing from step S20 to step S25.

【００９４】ステップＳ２６では、話者認識ブロックＲ
Ｂ全体にわたる各コードブックＣＢの量子化歪みｄ_ｋの
平均である平均量子化歪みＤ_ｋを求める。In step S26, the speaker recognition block R
An average quantization distortion D _k which is an average of the quantization distortion d _k of each codebook CB over the entire B is _obtained .

【００９５】続いてステップＳ２７では、平均量子化歪
みＤ_ｋを最小にする話者Ｓ_ｋ’に対応するコードブック
ＣＢ_ｋ’を選出し、ステップＳ２８では、この話者Ｓ
_ｋ’を話者候補Ｓ_ｃとして出力する。Subsequently, in step S27, a codebook CB _{k ′} corresponding to the speaker S _{k ′} that minimizes the average quantization distortion D _k is selected, and in step S28, the code book CB _{k ′} is selected.
and it outputs a _{k 'as} a speaker candidate _{S c.}

【００９６】このようにして、コードブックが登録され
ている話者のうち、最も入力データの音声が類似してい
る話者を、その話者認識ブロックＲＢにおける話者候補
Ｓ_ｃとして選出する。[0096] Thus, among the speakers codebook is registered, the voice of the most input data speakers are similar, selects a speaker candidate S _c at the speaker recognition block RB.

【００９７】次に、図５のステップＳ１２における話者
候補Ｓ_ｃの照合判定方法の詳細を図７に示す。先ずステ
ップＳ３０において、話者候補Ｓ_ｃの平均量子化歪みを
Ｄ_０とする。次にステップＳ３１において、話者候補Ｓ
_ｃ以外の各コードブックによる平均量子化歪みを小さい
順に並び替え、そのうち、小さいものから順にｎ個を、
Ｄ_１，Ｄ_２，・・・Ｄ_ｎ（Ｄ_０＜Ｄ_１＜Ｄ_２＜・・・＜
Ｄ_ｎ）とする。ｎの値は、任意に選択可能である。[0097] Next, FIG. 7 shows the details of the matching determination method of the speaker candidates _{S c} at step S12 in FIG. 5. First, in step S30, the average quantization distortion of the speaker candidates _{S c} and _{D 0.} Next, in step S31, the speaker candidate S
_The average quantization distortion by each codebook other than _c is rearranged in ascending order.
D ₁ , D ₂ ,... D _n (D ₀ <D ₁ <D ₂ <.
D _n ). The value of n can be arbitrarily selected.

【００９８】続いてステップＳ３２において、評価の尺
度として、話者候補Ｓ_ｃの量子化歪みＤ_０とそれ以外の
ｎ個の量子化歪みについて、以下の式（１８）又は式
（１９）を用いて歪差分量ΔＤを求める。[0098] Subsequently, in Step S32, as a measure of evaluation, for n quantization distortion otherwise the quantization distortion _{D 0} of the speaker candidates _{S c,} using the following equation (18) or (19) To obtain the distortion difference amount ΔD.

【００９９】[0099]

【数１５】 (Equation 15)

【０１００】式（１８）、式（１９）において、例えば
ｎが１の場合は、話者候補Ｓ_ｃに次いで量子化歪みが小
さいＤ_１とＤ_０との量子化歪みの差を求めることにな
る。[0100] Equation (18), in the formula (19), for example, when n is 1, to determine the difference between the quantization distortion of the speaker candidates _S quantization distortion Following _c is smaller _{D 1} and _{D 0} Become.

【０１０１】続いてステップＳ３３において、図２に示
した閾値表ファイルＴＦから話者候補Ｓ_ｃに対応する閾
値データを読みこむ。[0102] Subsequently, in Step S33, Reads a threshold data corresponding to the speaker candidate S _c from the threshold table file TF shown in FIG.

【０１０２】閾値表ファイルＴＦには、各登録話者ごと
に、例えば図８のような形式で記録されている。すなわ
ち、図８に示すように、各登録話者の話者識別名と、閾
値データである量子化歪みの最大歪み絶対値Ｄ_ｍａｘ及
び最小歪み差分ΔＤ_ｍｉｎが予め記録されている。In the threshold table file TF, for example, a format as shown in FIG. 8 is recorded for each registered speaker. That is, as shown in FIG. 8, the speaker identifier of each registered speaker, the maximum distortion absolute value D _max of the quantization distortion as the threshold data, and the minimum distortion difference ΔD _min are recorded in advance.

【０１０３】図７に戻り、ステップＳ３４では、読みこ
んだ閾値データＤ_ｍａｘ，ΔＤ_ｍｉ _ｎを、求めたＤ_０及
びΔＤと比較して判別する。すなわち、ステップＳ３４
において、量子化歪みの絶対値Ｄ_０が閾値データＤ
_ｍａｘよりも小さく、且つ、歪み差分ΔＤが閾値データ
ΔＤ_ｍｉｎより大きければ、ステップＳ３５に進み、本
人であると判定し、候補を確定する。そうでなければ、
ステップＳ３６に進み、未知話者と判定し、候補を棄却
する。このように、話者候補Ｓ_ｃの平均量子化歪みＤ_０
と歪差分量ΔＤとをそれぞれ閾値と比較することで、登
録話者の音声データの識別誤りが減少し、また、登録話
者以外の音声データを未知話者として判定することが可
能となる。[0103] Returning to FIG. 7, in step S34, the threshold value data _{D max} yelling read, the [Delta] D _mi _n, determined as compared to the _{D 0} and [Delta] D obtained. That is, step S34
, The absolute value D ₀ of the quantization distortion is equal to the threshold data D
_{If the} difference is smaller than _max and the distortion difference ΔD is larger than the threshold data ΔD _min , the process proceeds to step S35, the person is determined to be the person, and the candidate is determined. Otherwise,
Proceeding to step S36, the speaker is determined to be an unknown speaker, and the candidate is rejected. Thus, the average quantization distortion D ₀ of the speaker candidate _Sc is
By comparing the threshold value and the distortion difference amount ΔD with the threshold value, the identification error of the voice data of the registered speaker is reduced, and voice data other than the registered speaker can be determined as an unknown speaker.

【０１０４】以上説明したように、本実施の形態におけ
る情報抽出装置は、ＡＶデータの音声信号中の話者の音
声の特徴量に基づいて、話者認識ブロック毎に話者を識
別すると共に、所定の区間における話者の出現頻度を検
出し、話者の出現頻度情報を生成する。この出現頻度情
報が通信回線又は記録媒体を介して後述する情報検索装
置に供給されることで、情報検索装置において所望の情
報を効果的に検索することができる。As described above, the information extracting apparatus according to the present embodiment identifies the speaker for each speaker recognition block based on the feature amount of the speaker's voice in the audio signal of the AV data. The appearance frequency of the speaker in a predetermined section is detected, and the appearance frequency information of the speaker is generated. By supplying the appearance frequency information to an information search device described later via a communication line or a recording medium, desired information can be effectively searched in the information search device.

【０１０５】次に、本実施の形態における情報検索装置
について説明する。先ず、情報検索装置の概念構成図を
図９に示す。図９に示すように、情報検索装置において
は、話者出現頻度読み込み手段３によって、上述した情
報抽出装置にて生成された話者の出現頻度情報が読み込
まれ、また、検索条件が、検索条件入力手段４に入力さ
れる。話者出現区間出力手段５は、これらの出現頻度情
報と検索条件とに基づいて検索された話者出現区間情報
を出力する。Next, an information retrieval apparatus according to the present embodiment will be described. First, a conceptual configuration diagram of the information search device is shown in FIG. As shown in FIG. 9, in the information retrieval apparatus, the appearance frequency information of the speaker generated by the above-described information extraction apparatus is read by the speaker appearance frequency reading means 3, and the search condition is determined by the search condition. It is input to the input means 4. The speaker appearance section output means 5 outputs the speaker appearance section information searched based on the appearance frequency information and the search condition.

【０１０６】この図９に示した情報抽出装置の具体的な
構成例を図１０に示す。図１０に示すように、情報検索
装置２０は、検索条件を入力する条件入力部２１と、入
力された条件から情報を検索するデータ検索部２２と、
検索結果を出力する出力部２３とを備える。また、話者
頻度ファイルＳＦは、話者の出現頻度情報が記録された
ものであり、上述した図３に示すような記録形式で情報
が記録されている。FIG. 10 shows a specific configuration example of the information extracting apparatus shown in FIG. As shown in FIG. 10, the information search device 20 includes a condition input unit 21 for inputting search conditions, a data search unit 22 for searching information from the input conditions,
An output unit 23 that outputs a search result. Further, the speaker frequency file SF is a file in which speaker appearance frequency information is recorded, and the information is recorded in a recording format as shown in FIG. 3 described above.

【０１０７】このように構成された情報検索装置２０の
動作を以下に説明する。条件入力部２１はＡＶデータを
検索するための検索条件を入力する。検索条件として
は、例えば、所望の話者の名前や識別番号、その話者の
会話頻度、検索対象とするＡＶデータ等が挙げられる。[0107] The operation of the information retrieval apparatus 20 thus configured will be described below. The condition input unit 21 inputs search conditions for searching AV data. The search conditions include, for example, the name and identification number of a desired speaker, the conversation frequency of the speaker, and AV data to be searched.

【０１０８】入力された検索条件Ｄ２２はデータ検索部
２２に入力されて、検索条件にあった情報が検索され
る。データ検索部２２は、話者頻度情報ファイルＳＦを
参照して話者の出現頻度情報Ｄ２３を読込み、これを検
索条件Ｄ２２と比較し、検索結果Ｄ２４を出力部２３に
供給する。出力部２３は、この検索結果Ｄ２４を話者出
現区間情報Ｄ２５として出力する。The input search condition D22 is input to the data search unit 22, and information matching the search condition is searched. The data search unit 22 reads the speaker appearance frequency information D23 with reference to the speaker frequency information file SF, compares it with the search condition D22, and supplies a search result D24 to the output unit 23. The output unit 23 outputs the search result D24 as speaker appearance section information D25.

【０１０９】このように、本実施の形態における情報検
索装置は、通信回線又は記録媒体を介して入力したＡＶ
データの音声信号中の所望の話者の出現頻度情報と入力
した検索条件とを比較することにより、所望の話者が所
望の頻度で会話している部分等を効果的に検索すること
ができる。As described above, the information retrieval apparatus according to the present embodiment uses the AV input via a communication line or a recording medium.
By comparing the appearance frequency information of the desired speaker in the audio signal of the data with the input search condition, it is possible to effectively search for a portion where the desired speaker is talking at a desired frequency. .

【０１１０】なお、本発明は上述した実施の形態のみに
限定されるものではなく、本発明の要旨を逸脱しない範
囲において種々の変更が可能であることは勿論である。It should be noted that the present invention is not limited to only the above-described embodiment, and it is needless to say that various modifications can be made without departing from the spirit of the present invention.

【０１１１】[0111]

【発明の効果】以上詳細に説明したように本発明に係る
情報抽出装置は、所定の情報源から所望の情報を抽出す
るための情報抽出装置において、上記情報源である音声
信号について、上記音声信号中の音声の類似性によっ
て、ある評価区間毎に話者を判別する話者識別手段と、
上記評価区間毎に判別された話者の頻度を求める区間で
ある上記情報源における頻度区間での上記話者の判別頻
度情報を求める話者判別頻度計算手段とを備え、上記頻
度区間における上記話者の出現頻度情報を検出すること
を特徴としている。As described in detail above, the information extracting apparatus according to the present invention is an information extracting apparatus for extracting desired information from a predetermined information source. Speaker identification means for determining a speaker for each evaluation section based on the similarity of the voice in the signal;
Speaker discrimination frequency calculation means for obtaining the discrimination frequency information of the speaker in the frequency section of the information source, which is a section for obtaining the frequency of the speaker discriminated in each of the evaluation sections. It is characterized by detecting appearance frequency information of a person.

【０１１２】ここで、情報抽出装置では、上記情報源の
音声信号中の音声の類似性を評価する特徴量として、Ｌ
ＰＣ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Here, the information extracting apparatus uses L.sub.L as a feature amount for evaluating the similarity of the voice in the voice signal of the information source.
An LPC cepstrum obtained by PC analysis is used, a vector quantization of a feature quantity by a plurality of codebooks is used as an identification method, and the vector quantization distortion is used as a measure of identification.

【０１１３】また、情報抽出装置では、上記ベクトル量
子化歪みの最小値である最小量子化歪みと、上記最小量
子化歪み以外の複数のベクトル量子化歪みの和又は平均
から最小量子化歪みを減算した値とを、それぞれ予め設
定された閾値と比較することで識別判定される。In the information extracting apparatus, the minimum quantization distortion is subtracted from the sum or average of the minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a plurality of vector quantization distortions other than the minimum quantization distortion. The identification value is determined by comparing the set values with threshold values set in advance.

【０１１４】このような情報抽出装置によっては、音声
信号中の話者の音声の特徴量に基づいて、ある評価区間
毎に話者を識別すると共に、評価区間毎に判別された話
者の頻度を求める区間である頻度区間における話者の出
現頻度を検出し、話者の出現頻度情報を生成することが
できる。この出現頻度情報が通信回線又は記録媒体等を
介して情報検索装置に供給されることで、情報検索装置
において所望の情報を効果的に検索することができる。According to such an information extracting apparatus, a speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the frequency of the speaker determined for each evaluation section is determined. Can be detected, and the appearance frequency information of the speaker can be generated. By supplying the appearance frequency information to the information search device via a communication line or a recording medium, desired information can be effectively searched in the information search device.

【０１１５】また、本発明に係る情報抽出方法は、所定
の情報源から所定の情報を検索するための情報抽出方法
において、上記情報源である音声信号について、上記音
声信号中の音声の類似性によって、ある評価区間毎に話
者を判別する話者識別工程と、上記評価区間毎に判別さ
れた話者の頻度を求める区間である上記情報源における
頻度区間での上記話者の判別頻度情報を求める話者判別
頻度計算工程とを有し、上記頻度区間における上記話者
の出現頻度情報を検出することを特徴としている。The information extraction method according to the present invention is the information extraction method for retrieving predetermined information from a predetermined information source, wherein the audio signal as the information source has a similarity of audio in the audio signal. A speaker identification step of identifying a speaker for each evaluation section, and the speaker identification frequency information in a frequency section of the information source, which is a section for obtaining the frequency of the speaker determined for each evaluation section And a speaker discrimination frequency calculation step of determining the speaker frequency in the frequency section.

【０１１６】ここで、情報抽出方法では、上記情報源の
音声信号中の音声の類似性を評価する特徴量として、Ｌ
ＰＣ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Here, in the information extraction method, L is used as a feature amount for evaluating the similarity of the voice in the voice signal of the information source.
An LPC cepstrum obtained by PC analysis is used, a vector quantization of a feature quantity by a plurality of codebooks is used as an identification method, and the vector quantization distortion is used as a measure of identification.

【０１１７】また、情報抽出方法では、上記ベクトル量
子化歪みの最小値である最小量子化歪みと、上記最小量
子化歪み以外の複数のベクトル量子化歪みの和又は平均
から最小量子化歪みを減算した値とを、それぞれ予め設
定された閾値と比較することで識別判定される。In the information extraction method, the minimum quantization distortion is subtracted from the sum or average of the minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a plurality of vector quantization distortions other than the minimum quantization distortion. The identification value is determined by comparing the set values with threshold values set in advance.

【０１１８】このような情報抽出方法によっては、音声
信号中の話者の音声の特徴量に基づいて、ある評価区間
毎に話者を識別すると共に、評価区間毎に判別された話
者の頻度を求める区間である頻度区間における話者の出
現頻度を検出し、話者の出現頻度情報を生成することが
できる。この出現頻度情報が通信回線又は記録媒体等を
介して情報検索装置に供給されることで、情報検索装置
において所望の情報を効果的に検索することができる。According to such an information extraction method, the speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the frequency of the speaker determined for each evaluation section is determined. Can be detected, and the appearance frequency information of the speaker can be generated. By supplying the appearance frequency information to the information search device via a communication line or a recording medium, desired information can be effectively searched in the information search device.

【０１１９】また、本発明に係る情報検索装置は、情報
源である音声信号について、上記音声信号中の音声の類
似性によって、ある評価区間毎に話者を判別し、上記評
価区間毎に判別された話者の頻度を求める区間である頻
度区間で上記話者の判別頻度情報を求めることで得られ
た上記頻度区間における上記話者の出現頻度情報が予め
記録された記録媒体から、所望の情報の検索を行う情報
検索装置であって、上記記録媒体に記録された話者の出
現頻度情報を読み込む話者出現頻度読み込み手段と、所
望の話者の検索条件を入力する検索条件入力手段と、入
力された上記検索条件と上記記録媒体から読み出した情
報とを比較して、検索条件に該当する情報を話者出現区
間情報として出力する話者出現区間出力手段とを備える
ことを特徴としている。Further, the information retrieval apparatus according to the present invention determines a speaker for each evaluation section based on the similarity of the voice in the voice signal as the information source, and determines the speaker for each evaluation section. From the recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker identification frequency information in the frequency section that is the section for obtaining the frequency of the An information retrieval apparatus for retrieving information, comprising: speaker appearance frequency reading means for reading speaker appearance frequency information recorded on the recording medium; and search condition input means for inputting a desired speaker search condition. Speaker input section output means for comparing the input search condition with information read from the recording medium, and outputting information corresponding to the search condition as speaker appearance section information. That.

【０１２０】ここで、上記話者の出現頻度情報を得る際
の上記情報源の音声信号中の音声の類似性を評価する特
徴量としては、ＬＰＣ分析によって得られるＬＰＣケプ
ストラムが用いられ、識別の手法としては、複数のコー
ドブックによる特徴量のベクトル量子化が用いられ、識
別の尺度として、そのベクトル量子化歪みが用いられ、
また、上記ベクトル量子化歪みの最小値である最小量子
化歪みと、上記最小量子化歪み以外の複数のベクトル量
子化歪みの和又は平均から最小量子化歪みを減算した値
とを、それぞれ予め設定された閾値と比較することで識
別判定される。Here, the LPC cepstrum obtained by the LPC analysis is used as a feature value for evaluating the similarity of the voice in the voice signal of the information source when obtaining the appearance frequency information of the speaker. As a method, vector quantization of feature amounts by multiple codebooks is used, and the vector quantization distortion is used as a measure of identification,
In addition, a minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion are set in advance. The identification is determined by comparing with the threshold value.

【０１２１】このような情報検索装置によっては、所望
の話者の出現頻度情報と入力した検索条件とを比較する
ことで、所望の話者が所望の頻度で会話している部分等
を効果的に検索することができる。According to such an information search apparatus, by comparing the appearance frequency information of a desired speaker with the input search condition, a portion where the desired speaker is talking at a desired frequency can be effectively determined. Can be searched.

【０１２２】また、本発明に係る情報検索方法は、情報
源である音声信号について、上記音声信号中の音声の類
似性によって、ある評価区間毎に話者を判別し、上記評
価区間毎に判別された話者の頻度を求める区間である頻
度区間で上記話者の判別頻度情報を求めることで得られ
た上記頻度区間における上記話者の出現頻度情報が予め
記録された記録媒体から、所望の情報の検索を行う情報
検索方法であって、上記記録媒体に記録された話者の出
現頻度情報を読み込む話者出現頻度読み込み工程と、所
望の話者の検索条件を入力する検索条件入力工程と、入
力された上記検索条件と上記記録媒体から読み出した情
報とを比較して、検索条件に該当する情報を話者出現区
間情報として出力する話者出現区間出力工程とを有する
ことを特徴としている。Further, in the information search method according to the present invention, a speaker is determined for each evaluation section based on the similarity of the voice in the voice signal as the information source, and the speaker is determined for each evaluation section. From the recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the speaker identification frequency information in the frequency section that is the section for obtaining the frequency of the An information retrieval method for retrieving information, comprising: a speaker appearance frequency reading step of reading speaker appearance frequency information recorded on the recording medium; and a search condition input step of inputting a desired speaker search condition. Comparing the input search conditions with the information read from the recording medium, and outputting information corresponding to the search conditions as speaker appearance section information. That.

【０１２３】ここで、上記話者の判別頻度情報を得る際
の上記情報源の音声信号中の音声の類似性を評価する特
徴量としては、ＬＰＣ分析によって得られるＬＰＣケプ
ストラムが用いられ、識別の手法としては、複数のコー
ドブックによる特徴量のベクトル量子化が用いられ、識
別の尺度として、そのベクトル量子化歪みが用いられ、
また、上記ベクトル量子化歪みの最小値である最小量子
化歪みと、上記最小量子化歪み以外の複数のベクトル量
子化歪みの和又は平均から最小量子化歪みを減算した値
とを、それぞれ予め設定された閾値と比較することで識
別判定される。Here, the LPC cepstrum obtained by the LPC analysis is used as the characteristic amount for evaluating the similarity of the voice in the voice signal of the information source when obtaining the speaker discrimination frequency information. As a method, vector quantization of feature amounts by multiple codebooks is used, and the vector quantization distortion is used as a measure of identification,
In addition, a minimum quantization distortion, which is the minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion, are preset. The identification is determined by comparing with the threshold value.

【０１２４】このような情報検索方法によっては、所望
の話者の出現頻度情報と入力した検索条件とを比較する
ことで、所望の話者が所望の頻度で会話している部分等
を効果的に検索することができる。According to such an information search method, by comparing the appearance frequency information of a desired speaker with the input search condition, a portion where the desired speaker is talking at a desired frequency can be effectively determined. Can be searched.

[Brief description of the drawings]

【図１】本実施の形態における情報抽出装置の概念構成
を説明する図である。FIG. 1 is a diagram illustrating a conceptual configuration of an information extraction device according to the present embodiment.

【図２】同情報抽出装置の構成例を説明する図である。FIG. 2 is a diagram illustrating a configuration example of the information extraction device.

【図３】同情報抽出装置における話者出現頻度情報の記
録形式の一例を説明する図である。FIG. 3 is a diagram illustrating an example of a recording format of speaker appearance frequency information in the information extraction device.

【図４】同情報抽出装置における頻度評価区間、話者認
識ブロック及びＬＰＣ分析ブロックの関係を説明する図
である。FIG. 4 is a diagram illustrating a relationship among a frequency evaluation section, a speaker recognition block, and an LPC analysis block in the information extraction device.

【図５】同情報抽出装置の動作を説明するフローチャー
トである。FIG. 5 is a flowchart illustrating the operation of the information extraction device.

【図６】同情報抽出装置における話者認識ブロック単位
での話者識別処理を説明するフローチャートである。FIG. 6 is a flowchart illustrating speaker identification processing in units of speaker recognition blocks in the information extraction device.

【図７】同情報抽出装置における話者照合判定処理を説
明するフローチャートである。FIG. 7 is a flowchart illustrating a speaker verification determination process in the information extraction device.

【図８】同情報抽出装置における話者照合判定用の閾値
データの記録形式の一例を説明する図である。FIG. 8 is a diagram illustrating an example of a recording format of threshold data for speaker verification determination in the information extraction device.

【図９】本実施の形態における情報検索装置の概念構成
を説明する図である。FIG. 9 is a diagram illustrating a conceptual configuration of an information search device according to the present embodiment.

【図１０】同情報検索装置の構成を説明する図である。FIG. 10 is a diagram illustrating a configuration of the information search device.

[Explanation of symbols]

１話者識別手段、２話者判別頻度計算手段、３話
者出現頻度読み込み手段、４検索条件入力手段、５
話者出現区間出力手段、１０情報抽出装置、１１入
力部、１２ケプストラム抽出部、１３ベクトル量子
化部、１４話者識別部、１５話者判別頻度計算部、
２０情報検索装置、２１条件入力部、２２データ
検索部、２３出力部1 speaker identification means, 2 speaker discrimination frequency calculation means, 3 speaker appearance frequency reading means, 4 search condition input means, 5
Speaker appearance section output means, 10 information extraction device, 11 input section, 12 cepstrum extraction section, 13 vector quantization section, 14 speaker identification section, 15 speaker discrimination frequency calculation section,
Reference Signs List 20 information search device, 21 condition input unit, 22 data search unit, 23 output unit

Claims

[Claims]

1. An information extracting apparatus for extracting desired information from a predetermined information source, comprising: a speaker for each evaluation section of a voice signal as the information source, based on a similarity of voice in the voice signal. Speaker identification means for determining the frequency of the speaker determined for each evaluation section; and speaker determination frequency calculation means for obtaining the speaker determination frequency information in a frequency section of the information source, which is a section for obtaining the frequency of the speaker determined for each evaluation section. An information extraction device comprising: detecting appearance frequency information of the speaker in the frequency section.

2. An LPC cepstrum obtained by LPC analysis is used as a feature value for evaluating the similarity of the voice in the voice signal of the information source, and a vector quantization of the feature value by a plurality of codebooks is used as an identification method. The information extraction apparatus according to claim 1, wherein the vector quantization distortion is used as a measure of identification.

3. A minimum quantization distortion, which is a minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion, 3. The information extraction device according to claim 2, wherein the identification determination is performed by comparing each with a preset threshold value.

4. An information extraction method for retrieving predetermined information from a predetermined information source, wherein, for a speech signal as the information source, a speaker is determined for each evaluation section by a similarity of speech in the speech signal. A speaker identification step of determining the frequency of the speaker determined in each of the evaluation sections; An information extraction method comprising: detecting appearance frequency information of the speaker in the frequency section.

5. An LPC cepstrum obtained by LPC analysis is used as a feature for evaluating the similarity of the voice in the voice signal of the information source, and a vector quantization of the feature using a plurality of codebooks is used as an identification method. 5. The method according to claim 4, wherein the vector quantization distortion is used as a measure of identification.

6. A minimum quantization distortion, which is a minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion, 6. The information extraction method according to claim 5, wherein identification determination is performed by comparing each of the threshold values with a preset threshold value.

7. A section in which a speaker is determined for each evaluation section based on the similarity of the voice in the voice signal as the information source, and a frequency of the speaker determined for each evaluation section is obtained. From the recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the determination frequency information of the speaker in the frequency section in the information source is previously recorded, desired information is searched. A speaker appearance frequency reading means for reading the appearance frequency information of the speaker recorded on the recording medium; a search condition input means for inputting a desired speaker search condition; An information search device comprising: speaker appearance section output means for comparing the search condition with information read from the recording medium and outputting information corresponding to the search condition as speaker appearance section information. .

8. An LPC cepstrum obtained by LPC analysis is used as a feature for evaluating the similarity of the voice in the voice signal of the information source, and vector quantization of the feature using a plurality of codebooks is used as an identification method. The information retrieval apparatus according to claim 7, wherein the vector quantization distortion is used as a measure of identification.

9. A minimum quantization distortion, which is a minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion, 9. The information search apparatus according to claim 8, wherein the identification is determined by comparing each with a preset threshold value.

10. A section in which a speaker is determined for each evaluation section based on the similarity of the voice in the voice signal, and a frequency of the speaker determined for each evaluation section is determined. From the recording medium in which the appearance frequency information of the speaker in the frequency section obtained by obtaining the determination frequency information of the speaker in the frequency section in the information source is previously recorded, desired information is searched. A speaker appearance frequency reading step of reading speaker appearance frequency information recorded on the recording medium; a search condition input step of inputting a desired speaker search condition; A speaker appearance section output step of comparing the search condition with information read from the recording medium and outputting information corresponding to the search condition as speaker appearance section information. Law.

11. An LPC cepstrum obtained by LPC analysis is used as a feature value for evaluating the similarity of the voice in the voice signal of the information source, and vector quantization of the feature value using a plurality of codebooks is used as an identification method. 11. The information retrieval method according to claim 10, wherein the vector quantization distortion is used as a measure of identification.

12. A minimum quantization distortion which is a minimum value of the vector quantization distortion, and a value obtained by subtracting the minimum quantization distortion from a sum or an average of a plurality of vector quantization distortions other than the minimum quantization distortion, 12. The information search method according to claim 11, wherein the identification is determined by comparing each with a preset threshold value.