JP2003036087A

JP2003036087A - Apparatus and method for detecting information

Info

Publication number: JP2003036087A
Application number: JP2001225050A
Authority: JP
Inventors: Yasuhiro Tokuri; 康裕戸栗; Masayuki Nishiguchi; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-07-25
Filing date: 2001-07-25
Publication date: 2003-02-07
Anticipated expiration: 2021-07-25
Also published as: JP4696418B2

Abstract

PROBLEM TO BE SOLVED: To reduce recognition errors in a part sufficiently including a mixed voice due to a background noise and a plurality of speakers when the statistical speech frequency of the speakers in a AV data is detected. SOLUTION: In an information detecting apparatus 10, a voice signal D11 of the AV data inputted from an inputting part 11 is LPC-analyzed by a LPC analyzing part 12. A LPC coefficient of a block determined as a voiced sound block by a voiced sound determining part 14 is inputted to a cepstrum converting part 17 and converted into a LPC cepstrum coefficient. The LPC cepstrum coefficient D12 is vector-quantized by a vector-quantizating part 18. A quantization distortion D18 is inputted to and evaluated by a speaker identifying part 19 for identifying and determining the speaker per the predetermined recognition block. The identified speaker D20 is inputted to a part for calculating the frequency of determining the speaker 20. The part 20 calculates the frequency of determining the speaker respectively recognized in an interval per the predetermined evaluating interval and outputs as frequency information of appearance of the speaker D21.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、情報検出装置及び
その方法に関するものであり、特に、音声データ又は音
声画像データの特徴を検出し、検索を可能とするための
情報検出装置及びその方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information detecting apparatus and method thereof, and more particularly to an information detecting apparatus and method for detecting characteristics of voice data or voice image data and enabling retrieval. It is a thing.

【０００２】[0002]

【従来の技術】近年のマルチメディアの普及とともに、
大量のＡＶ（Audio Visual）データを効率的に管理し、
分類、検索、抽出などを行う必要性が増してきた。例え
ば、ある登場人物のシーンやその人物の会話シーンを大
量のＡＶデータから検索したり、また、ある人物の会話
シーンだけをＡＶデータから抽出して再生したりという
ことが必要となっている。2. Description of the Related Art With the spread of multimedia in recent years,
Efficiently manage a large amount of AV (Audio Visual) data,
There is an increasing need to perform classification, search and extraction. For example, it is necessary to retrieve a scene of a certain character or a conversation scene of the person from a large amount of AV data, or to extract and reproduce only a conversation scene of a person from the AV data.

【０００３】従来は、このようなＡＶデータにおいて特
定の話者が会話している時間軸上の位置の検索等を行う
場合は、人間が直接ＡＶデータを視聴しながら、その時
間軸上の位置や区間を探す必要があった。Conventionally, when searching a position on a time axis where a specific speaker is talking in such AV data, a person directly watches the AV data while viewing the position on the time axis. I needed to find a section.

【０００４】一方、音声の話者を識別する技術として
は、自動話者識別・照合技術が研究されている。この技
術についての従来の技術の概要を説明する。On the other hand, as a technique for identifying a voice speaker, an automatic speaker identification / verification technique has been studied. The outline of the conventional technique regarding this technique will be described.

【０００５】先ず、話者認識には、話者識別と話者照合
がある。話者識別とは、入力された音声が予め登録され
た話者うちのどの話者であるかを判定するものであり、
話者照合とは、入力された音声を予め登録された話者の
データと比較して本人であるか否かを判定するものであ
る。また、話者認識には、認識時に発声する言葉（キー
ワード）が予め決められた発声内容依存型と、任意の言
葉を発声して認識をする発声内容独立型がある。First, speaker recognition includes speaker identification and speaker verification. Speaker identification is to determine which of the speakers registered in advance the input voice is,
The speaker verification is to determine whether or not the person is the person by comparing the input voice with the data of the speaker registered in advance. Further, speaker recognition includes a utterance content-dependent type in which a word (keyword) to be uttered at the time of recognition is predetermined, and a utterance content-independent type in which an arbitrary word is uttered for recognition.

【０００６】一般的な音声認識技術としては、例えば次
のような技術がよく用いられる。先ず、ある話者の音声
信号の個人性を表す特徴量を抽出して、予め学習データ
として記録しておく。識別・照合の際には、入力された
話者音声を分析して、その個人性を表す特徴量を抽出し
て、学習データとの類似度を評価することで、話者の識
別・照合を行う。ここで、音声の個人性を表す特徴量と
しては、ケプストラム（Cepstrum）等がよく用いられ
る。ケプストラムは、対数スペクトルをフーリエ逆変換
したもので、その低次の項の係数によって音声スペクト
ルの包絡を表現できる。或いは、音声信号にＬＰＣ（Li
near Predictive Coding）分析を施してＬＰＣ係数を求
め、そのＬＰＣ係数を変換することで得られるＬＰＣケ
プストラム係数を用いることが多い。これらのケプスト
ラム若しくはＬＰＣケプストラム係数の時系列の多項式
展開係数をデルタケプストラムと呼び、これも音声スペ
クトルの時間的変化を表現する特徴量としてよく用いら
れる。この他、ピッチやデルタピッチ（ピッチの多項式
展開係数）等も用いられることがある。As a general voice recognition technique, for example, the following technique is often used. First, a characteristic amount representing the individuality of the voice signal of a speaker is extracted and recorded as learning data in advance. At the time of identification / matching, the input speaker's voice is analyzed, the feature quantity representing the individuality thereof is extracted, and the similarity with the learning data is evaluated to identify / match the speaker. To do. Here, a cepstrum or the like is often used as a feature amount indicating the individuality of a voice. The cepstrum is the inverse Fourier transform of the logarithmic spectrum, and the envelope of the speech spectrum can be expressed by the coefficient of the low-order term. Alternatively, the LPC (Li
A LPC cepstrum coefficient obtained by performing near Predictive Coding) analysis to obtain an LPC coefficient and converting the LPC coefficient is often used. The time series polynomial expansion coefficient of these cepstrums or LPC cepstrum coefficients is called a delta cepstrum, and this is also often used as a feature amount that expresses a temporal change of a speech spectrum. In addition, pitch, delta pitch (polynomial expansion coefficient of pitch), etc. may be used.

【０００７】このようにして抽出されたＬＰＣケプスト
ラム等の特徴量を標準パターンとして学習データを作成
するが、その方法としては、ベクトル量子化歪みによる
方法と隠れマルコフモデル（HMM:Hidden Markov Mode
l）による方法が代表的である。The learning data is created by using the feature quantity of the LPC cepstrum or the like extracted in this way as a standard pattern. As the method, vector quantization distortion and a hidden Markov model (HMM: Hidden Markov Mode) are used.
The method according to l) is typical.

【０００８】ベクトル量子化歪みによる方法では、予め
話者ごとの特徴量をグループ化してその重心を符号帳
（コードブック）の要素（コードベクトル）として蓄え
ておく。そして、入力された音声の特徴量を各話者のコ
ードブックでベクトル量子化して、その入力音声全体に
対する各コードブックの平均量子化歪みを求める。In the method based on vector quantization distortion, feature amounts for each speaker are grouped in advance and the center of gravity thereof is stored as an element (code vector) of a codebook. Then, the feature quantity of the input speech is vector-quantized by the codebook of each speaker, and the average quantization distortion of each codebook for the entire input speech is obtained.

【０００９】そして話者識別の場合は、その平均量子化
歪みの最も小さいコードブックの話者を選択し、話者照
合の場合は、該当する話者のコードブックによる平均量
子化歪みを閾値と比較して本人かどうかを判定する。In the case of speaker identification, the speaker of the codebook having the smallest average quantization distortion is selected, and in the case of speaker verification, the average quantization distortion of the corresponding speaker according to the codebook is used as a threshold. Compare to determine if you are the person.

【００１０】一方、ＨＭＭによる方法では、上記と同様
にして求めた話者の特徴量は、隠れマルコフモデル（Ｈ
ＭＭ）の状態間の遷移確率と、各状態での特徴量の出現
確率によって表現され、入力音声区間全体でモデルとの
平均尤度によって判定をする。On the other hand, in the method using the HMM, the feature amount of the speaker obtained in the same manner as described above is the Hidden Markov Model (H
It is expressed by the transition probability between the states (MM) and the appearance probability of the feature amount in each state, and the determination is performed by the average likelihood with the model in the entire input speech section.

【００１１】また、予め登録されていない不特定話者が
含まれる話者識別の場合は、上述した話者識別と話者照
合とを組合せた方法によって判定する。すなわち、登録
された話者セットから最も類似した話者を候補として選
び、その候補の量子化歪み又は尤度を閾値と比較して本
人かどうかを判定する。Further, in the case of speaker identification including an unspecified speaker not registered in advance, it is determined by a method combining the above-mentioned speaker identification and speaker verification. That is, the most similar speaker is selected from the registered speaker set as a candidate, and the quantization distortion or likelihood of the candidate is compared with a threshold value to determine whether or not the speaker is the original person.

【００１２】話者照合又は不特定話者を含む話者識別に
おいて、本人の判定をするために、話者の尤度若しくは
量子化歪みを閾値と比較して判定するが、その際、これ
らの値は特徴量の時期変動、発声文章の違い、雑音等の
影響により、同一の話者であっても入力データと学習デ
ータ（モデル）とのばらつきが大きく、一般的にその絶
対値に閾値を設定しても安定して十分な認識率が得られ
ない。In speaker verification or speaker identification including unspecified speakers, the likelihood or the quantization distortion of the speaker is compared with a threshold value in order to determine the person in question. The value varies greatly between the input data and the learning data (model) even for the same speaker due to the influence of the time variation of the feature amount, the difference in the uttered sentence, the noise, etc. Generally, the threshold value is set to the absolute value. Even if it is set, a stable and sufficient recognition rate cannot be obtained.

【００１３】そこで、ＨＭＭにおける話者認識において
は、尤度を正規化することが一般的に行われる。例え
ば、以下の式（１）に示すような対数尤度比ＬＲを判定
に用いる方法がある。Therefore, in the speaker recognition in the HMM, the likelihood is generally normalized. For example, there is a method of using a log-likelihood ratio LR as shown in the following Expression (1) for determination.

【００１４】[0014]

【数１】 [Equation 1]

【００１５】式（１）において、Ｌ（Ｘ／Ｓ_ｃ）は、照
合対象話者Ｓ_ｃ（本人）の入力音声Ｘに対する尤度であ
り、Ｌ（Ｘ／Ｓ_ｒ）は、話者Ｓ_ｃ以外の話者Ｓ_ｒの入力
音声Ｘに対する尤度である。すなわち、入力音声Ｘに対
する尤度に合わせて動的に閾値を設定することになり、
発声内容の違いや時期変動に対して頑健となる。In the equation (1), L (X / S _c ) is the likelihood of the target speaker S _c (principal) with respect to the input speech X, and L (X / S _r ) is the speaker S _c. It is the likelihood of the speaker S _r other than the input voice X. That is, the threshold value is dynamically set according to the likelihood of the input voice X,
Being robust against differences in utterance content and timing fluctuations.

【００１６】或いはまた、事後確率の概念を用いて、以
下の式（２）に示すような事後確立によって判定を行う
方法も研究されている。ここで、Ｐ（Ｓ_ｃ）、Ｐ
（Ｓ_ｒ）はそれぞれ話者Ｓ_ｃ、Ｓ_ｒの出現確率である。
また、Σは、全話者についての総和を表す。Alternatively, a method of making a determination by posterior establishment as shown in the following equation (2) using the concept of posterior probability has been studied. Where P (S _c ), P
(S _r ) are the appearance probabilities of the speakers S _c and S _r , respectively.
Further, Σ represents the sum of all speakers.

【００１７】[0017]

【数２】 [Equation 2]

【００１８】これらのＨＭＭを用いた尤度の正規化の方
法は、後述する文献[4]等に詳しく記されている。The method of normalizing the likelihood using these HMMs is described in detail in the document [4] and the like described later.

【００１９】上述した他にも、従来の話者認識の技術に
おいて、音声信号の全てのブロックを認識に用いるので
はなく、例えば、入力された音声信号の有声音部分と無
声音部分とを検出し、有声音部分のみを用いて認識を行
う方法や、有声音と無声音とを区別して、別々の学習モ
デル若しくはコードブックを用いて認識を行う方法等も
研究されている。なお、雑音のない理想的な環境におい
ては、無声音と有声音の両方を用いて認識を行った方が
一般的には効果的であることが多い。In addition to the above, in the conventional speaker recognition technique, not all blocks of the voice signal are used for recognition, but, for example, a voiced sound portion and an unvoiced sound portion of the input voice signal are detected. A method of recognizing only a voiced sound part, a method of distinguishing a voiced sound from an unvoiced sound, and recognizing using a different learning model or codebook have been studied. In an ideal environment without noise, it is generally more effective to perform recognition using both unvoiced sound and voiced sound.

【００２０】以上説明した話者認識に関する従来技術に
ついて詳しくは、例えば、以下の文献等に記述されてい
る。 [1] 古井：”ケプストラムの統計的特徴による話者認
識”, 信学論 volJ65-A, No.2 183-190(1982) [2] F.K.Soong and A.E.Rosenberg: “On the Use of I
nstantaneous and Transitional Spectral Information
in Speaker Recognition.”, IEEE Trans. ASSP, Vol.
36, No.6, pp.871-879 (1988) [3] 古井：”声の個人性の話”,日本音響学会誌, 51,1
1, pp.876-881, (1995) [4] 松井：”HMMによる話者認識”, 信学技報, Vol.95,
No.467, (SP95 109-116) pp.17-24 (1996) [5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE P
RESS (CRC Press),1998 [6] F.K.Soong , A.E.Rosenberg , L.R.Rabiner and B.
H.Juang: “A vector Quantization approach to speak
er recognition.”,Proc.IEEE, Int.Conf.on Acoust. S
peech & Signal Processing, pp.387-390(1985)Details of the prior art relating to speaker recognition described above are described in, for example, the following documents. [1] Furui: “Speaker recognition based on statistical characteristics of cepstrum”, Theological theory volJ65-A, No.2 183-190 (1982) [2] FKSoong and AERosenberg: “On the Use of I
nstantaneous and Transitional Spectral Information
in Speaker Recognition. ”, IEEE Trans. ASSP, Vol.
36, No.6, pp.871-879 (1988) [3] Furui: "Story of individuality of voice", Journal of Acoustical Society of Japan, 51, 1
1, pp.876-881, (1995) [4] Matsui: "Speaker recognition by HMM", IEICE Technical Report, Vol.95,
No.467, (SP95 109-116) pp.17-24 (1996) [5] THE DIGITAL SIGNAL PROCESSING HANDBOOK, IEEE P
RESS (CRC Press), 1998 [6] FKSoong, AE Rosenberg, LR Rabiner and B.
H.Juang: “A vector Quantization approach to speak
er recognition. ”, Proc.IEEE, Int.Conf.on Acoust. S
peech & Signal Processing, pp.387-390 (1985)

【００２１】[0021]

【発明が解決しようとする課題】ところで、従来の話者
認識技術は、セキュリティシステム等における識別・照
合を主な応用として研究、開発されてきており、背景雑
音の少ない環境で単独の話者による認識が前提とされて
いた。従って、一般的なＡＶデータの音声等のように、
１つの音声データにおいて他の話者が同時に発声した
り、複数話者が短時間で交互に発声したり、背景に音楽
や雑音があったりといった場合には、認識誤りが著しく
増加して、正確に話者を識別することが困難であった。
特に、無声音は、雑音と区別が難しく、子音部分での認
識において認識誤りが著しく増加してしまう。このた
め、従来の話者認識の技術を適用してＡＶデータにおけ
る話者の会話区間の検出を行おうとすると、その検出性
能が著しく低下してしまうという問題があった。By the way, the conventional speaker recognition technology has been researched and developed mainly for identification and verification in a security system and the like, and it is possible for a single speaker to operate in an environment with little background noise. Awareness was assumed. Therefore, like the sound of general AV data,
If other speakers utter simultaneously in one voice data, multiple speakers utter alternately in a short time, or there is music or noise in the background, recognition errors will increase significantly and the accuracy will increase. It was difficult to identify the speaker.
In particular, unvoiced sound is difficult to distinguish from noise, and recognition errors increase significantly in recognition in the consonant part. Therefore, if the conventional speaker recognition technique is applied to detect the talker section of the speaker in the AV data, there is a problem in that the detection performance is significantly reduced.

【００２２】特に、実際のＡＶデータでは、単独の話者
が雑音の少ない環境で発声している理想的な部分はそれ
ほど多くなく、むしろ背景雑音や複数話者の同時発声等
による認識誤りは頻繁に起こる。従って、このような雑
音部分や複数話者の混合音声部分を認識に含めてしまう
ことで、特に雑音と区別の難しい無声音部分において認
識誤りが著しく増加し、統計的な会話頻度を正しく求め
ることができないという問題があった。なお、上述した
ように、無声音と有声音とを区別して話者認識を行う手
法はあるが、環境の雑音レベル等に応じてその認識手法
を切り換えるようなことは行われていなかった。In particular, in actual AV data, there are not so many ideal parts in which a single speaker is speaking in an environment with little noise, and rather, recognition errors due to background noise or simultaneous speech of multiple speakers are frequent. Happen to. Therefore, by including such a noise part or a mixed voice part of a plurality of speakers in the recognition, the recognition error increases remarkably especially in the unvoiced part which is difficult to be distinguished from the noise, and the statistical conversation frequency can be correctly obtained. There was a problem that I could not. As described above, there is a method for recognizing a speaker by distinguishing unvoiced sound and voiced sound, but the recognition method has not been switched according to the noise level of the environment.

【００２３】本発明は、このような従来の実情に鑑みて
提案されたものであり、背景雑音や複数話者による混合
音声が多く含まれる部分において認識誤りを低減させ、
効果的且つ効率的にＡＶデータにおける話者の統計的な
会話頻度を検出し、また、その検出した情報を用いて所
望の話者の会話区間を検出する情報検出装置及びその方
法を提供することを目的とする。The present invention has been proposed in view of such a conventional situation, and reduces recognition error in a portion including a lot of background noise and mixed speech by a plurality of speakers,
To provide an information detecting apparatus and method for effectively and efficiently detecting a statistical conversation frequency of a speaker in AV data and detecting a conversation section of a desired speaker using the detected information. With the goal.

【００２４】[0024]

【課題を解決するための手段】上述した目的を達成する
ために、本発明に係る情報検出装置は、所定の情報源か
ら所定の情報を検出するための情報検出装置において、
上記情報源に含まれる音声信号を分析して話者の有声音
部分を検出する有声音ブロック検出手段と、検出された
有声音ブロックを用いて上記音声信号の特徴量の類似性
によって所定の評価区間ごとに話者を識別する話者識別
手段と、上記評価区間毎に識別された上記話者の判別頻
度情報を、所定の頻度区間毎に求める話者判別頻度計算
手段とを備え、上記頻度区間における上記話者の出現頻
度情報を検出することを特徴としている。In order to achieve the above-mentioned object, an information detecting apparatus according to the present invention is an information detecting apparatus for detecting predetermined information from a predetermined information source,
A voiced sound block detecting means for detecting a voiced sound portion of a speaker by analyzing a voice signal included in the information source, and a predetermined evaluation based on the similarity of the feature amount of the voice signal using the detected voiced sound block. A speaker identification unit for identifying a speaker for each section; and a speaker identification frequency calculation unit for obtaining the identification frequency information of the speaker identified for each of the evaluation sections for each predetermined frequency section. The feature is that the appearance frequency information of the speaker in the section is detected.

【００２５】ここで、情報検出装置は、上記音声信号中
の無声音部分を認識に用いない有声音モードと当該無声
音部分を認識に用いる有声音・無声音モードとを切り替
える認識モード切替手段を備えるようにしてもよく、上
記有声音・無声音モードの場合、上記話者識別手段は、
有声音ブロックと無声音ブロックとを用いて上記音声信
号の特徴量の類似性によって所定の評価区間ごとに話者
を識別する。Here, the information detecting device is provided with a recognition mode switching means for switching between a voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition. In the voiced / unvoiced sound mode, the speaker identification means may be
Using the voiced sound block and the unvoiced sound block, the speaker is identified for each predetermined evaluation section based on the similarity of the feature amounts of the voice signal.

【００２６】また、情報検出装置では、上記情報源の音
声信号中の音声の類似性を評価する特徴量として、ＬＰ
Ｃ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Further, in the information detecting device, LP is used as a feature amount for evaluating the similarity of voices in the voice signal of the information source.
The LPC cepstrum obtained by the C analysis is used, the vector quantization of the feature quantity by a plurality of codebooks is used as the identification method, and the vector quantization distortion is used as the identification scale.

【００２７】このような情報検出装置は、音声信号中の
話者の音声の特徴量に基づいて、ある評価区間毎に話者
を識別すると共に、評価区間毎に判別された話者の判別
頻度をある頻度区間毎に検出し、話者の出現頻度情報を
生成する。この際、話者の有声音部分を検出し、検出さ
れた有声音ブロックを用いて話者を識別する。なお、音
声信号中の無声音部分を認識に用いない有声音モードと
当該無声音部分を認識に用いる有声音・無声音モードと
を切り替えることができる場合には、有声音・無声音モ
ードの場合に、有声音ブロックと無声音ブロックとを用
いて話者を識別するようにしてもよい。The information detecting apparatus as described above identifies the speaker for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and determines the determination frequency of the speaker determined for each evaluation section. Is detected for each certain frequency interval, and speaker appearance frequency information is generated. At this time, the voiced sound part of the speaker is detected, and the speaker is identified using the detected voiced sound block. In addition, if it is possible to switch between a voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition, in the case of the voiced sound / unvoiced sound mode, The speaker may be identified using the block and the unvoiced sound block.

【００２８】また、上述した目的を達成するために、本
発明に係る情報検出方法は、所定の情報源から所定の情
報を検出するための情報検出方法において、上記情報源
に含まれる音声信号を分析して話者の有声音部分を検出
する有声音ブロック検出工程と、検出された有声音ブロ
ックを用いて上記音声信号の特徴量の類似性によって所
定の評価区間ごとに話者を識別する話者識別工程と、上
記評価区間毎に識別された上記話者の判別頻度情報を、
所定の頻度区間毎に求める話者判別頻度計算工程とを有
し、上記頻度区間における上記話者の出現頻度情報を検
出することを特徴としている。In order to achieve the above object, the information detecting method according to the present invention is an information detecting method for detecting predetermined information from a predetermined information source, wherein an audio signal included in the information source is detected. A voiced sound block detection step of analyzing and detecting a voiced sound portion of the speaker, and a talk for identifying the speaker for each predetermined evaluation section by the similarity of the feature amount of the voice signal using the detected voiced sound block A speaker identification step, and the discrimination frequency information of the speaker identified for each evaluation section,
And a speaker discrimination frequency calculating step for each predetermined frequency section, and detecting appearance frequency information of the speaker in the frequency section.

【００２９】ここで、情報検出方法は、上記音声信号中
の無声音部分を認識に用いない有声音モードと無声音部
分を認識に用いる有声音・無声音モードとを切り替える
認識モード切替工程を有するようにしてもよく、上記有
声音・無声音モードの場合、上記話者識別工程では、有
声音ブロックと無声音ブロックとを用いて上記音声信号
の特徴量の類似性によって所定の評価区間ごとに話者が
識別される。Here, the information detecting method has a recognition mode switching step of switching between a voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition. Of course, in the voiced sound / unvoiced sound mode, in the speaker identification step, the speaker is identified for each predetermined evaluation section by the similarity of the feature amount of the voice signal using the voiced sound block and the unvoiced sound block. It

【００３０】また、情報検出方法では、上記情報源の音
声信号中の音声の類似性を評価する特徴量として、ＬＰ
Ｃ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。In the information detecting method, LP is used as a feature quantity for evaluating the similarity of the voices in the voice signal of the information source.
The LPC cepstrum obtained by the C analysis is used, the vector quantization of the feature quantity by a plurality of codebooks is used as the identification method, and the vector quantization distortion is used as the identification scale.

【００３１】このような情報検出方法では、音声信号中
の話者の音声の特徴量に基づいて、ある評価区間毎に話
者が識別されると共に、評価区間毎に判別された話者の
判別頻度をある頻度区間毎に検出され、話者の出現頻度
情報が生成される。この際、話者の有声音部分が検出さ
れ、検出された有声音ブロックを用いて話者が識別され
る。なお、音声信号中の無声音部分を認識に用いない有
声音モードと無声音部分を認識に用いる有声音・無声音
モードとを切り替えることができる場合には、有声音・
無声音モードの場合に、有声音ブロックと無声音ブロッ
クとを用いて話者を識別するようにしてもよい。In such an information detecting method, the speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the speaker is determined for each evaluation section. The frequency is detected for each certain frequency section, and speaker appearance frequency information is generated. At this time, the voiced sound portion of the speaker is detected, and the speaker is identified using the detected voiced sound block. In addition, if it is possible to switch between a voiced sound mode in which an unvoiced sound portion in a voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which an unvoiced sound portion is used for recognition,
In the unvoiced sound mode, the speaker may be identified using the voiced sound block and the unvoiced sound block.

【００３２】[0032]

【発明の実施の形態】以下、本発明を適用した具体的な
実施の形態について、図面を参照しながら詳細に説明す
る。先ず、本実施の形態の説明に先立ち、本件発明者ら
が先に提案した技術について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Specific embodiments to which the present invention is applied will be described below in detail with reference to the drawings. First, prior to the description of the present embodiment, the technique previously proposed by the present inventors will be described.

【００３３】本件発明者らが先に提案した特願２０００
−３６３５４７の明細書及び図面には、話者認識の技術
を応用して、ＡＶデータにおいて、同一話者の連続会話
区間と話者切り換わり位置とを検出する技術が提案され
ている。この特願２０００−３６３５４７の明細書及び
図面では、ＡＶデータの音声信号を小区間（１〜２秒程
度）毎に話者グループに分類識別し、いくつかの連続し
た認識区間（数秒〜１０秒程度）内において話者グルー
プの判別頻度の変位を求め、その頻度が閾値を上回る位
置又は閾値を下回る位置を検出することで、話者の切り
換わり位置を検出し、話者が切り換わる間の区間をその
話者の同一話者連続会話区間として検出している。Japanese Patent Application 2000 previously proposed by the present inventors
In the specification and drawings of -363547, a technique of applying a speaker recognition technique to detect a continuous conversation section and a speaker switching position of the same speaker in AV data is proposed. In the specification and drawings of this Japanese Patent Application No. 2000-363547, the audio signal of AV data is classified and identified into a speaker group for each small section (about 1 to 2 seconds), and several continuous recognition sections (several seconds to 10 seconds). ), The displacement of the speaker group discrimination frequency is calculated, and the position where the frequency exceeds the threshold value or the position where the frequency falls below the threshold value are detected to detect the speaker switching position. The section is detected as the same-speaker continuous conversation section of the speaker.

【００３４】また、同様に本件発明者らが先に提案した
特願２００１−１７７５６９の明細書及び図面には、上
述した特願２０００−３６３５４７の明細書及び図面に
記載された技術を改良し、話者の切り替わり位置を正確
に求める代わりに、比較的長時間（数分程度）の区間に
おける各話者の会話頻度を求め、求めた会話頻度の情報
に基づいて、所望の話者が比較的多く発声している区間
を検出する技術が提案されている。この特願２００１−
１７７５６９の明細書及び図面に記載された技術ではま
た、正規化した量子化歪みに閾値を設定することで本人
照合を行い、識別誤りを低減させている。Similarly, in the specification and drawings of Japanese Patent Application No. 2001-177569 previously proposed by the present inventors, the technology described in the specification and drawings of Japanese Patent Application No. 2000-363547 mentioned above is improved, Instead of obtaining the speaker switching position accurately, the conversation frequency of each speaker in a relatively long period (several minutes) is obtained, and the desired speaker is relatively determined based on the obtained conversation frequency information. A technique for detecting a section in which a large amount of voice is uttered has been proposed. This Japanese Patent Application 2001-
In the technology described in the specification and the drawings of 177569, the threshold value is set to the normalized quantization distortion to perform the person-to-person verification to reduce the identification error.

【００３５】この特願２００１−１７７５６９の明細書
及び図面に記載されている技術では、正規化量子化歪み
を閾値と比較して照合を行うことで、不特定話者データ
の識別誤りを低減させるとともに、比較的長時間の区間
における各登録話者の統計的な頻度を求め、統計的な発
生頻度を検出しているので、背景雑音や複数話者の同時
発声等による認識誤りにもある程度対応できるようにな
っていた。In the technique described in the specification and drawings of this Japanese Patent Application No. 2001-177569, the normalized quantization distortion is compared with a threshold value for comparison to reduce the identification error of unspecified speaker data. At the same time, the statistical frequency of each registered speaker in a relatively long period is calculated and the statistical frequency of occurrence is detected, so it is possible to deal with recognition errors due to background noise and simultaneous utterance of multiple speakers to some extent. I was able to do it.

【００３６】しかし、実際のＡＶデータでは、単独の話
者が雑音の少ない環境で発声している理想的な部分はそ
れほど多くなく、むしろ背景雑音や複数話者の同時発声
等による認識誤りは頻繁に起こる。However, in the actual AV data, there are not so many ideal parts in which a single speaker is speaking in an environment with little noise, and rather, recognition errors due to background noise and simultaneous speech of multiple speakers are frequent. Happen to.

【００３７】このため、背景雑音や複数話者による混合
音声が多く含まれる部分において認識誤りを低減させ、
効果的且つ効率的にＡＶデータにおける話者の統計的な
会話頻度を検出し、また、その検出した情報を用いて所
望の話者の会話区間を検出する技術が望まれていた。For this reason, recognition errors are reduced in a portion containing a lot of background noise and mixed speech by a plurality of speakers,
There has been a demand for a technique of effectively and efficiently detecting a statistical conversation frequency of a speaker in AV data and detecting a conversation section of a desired speaker by using the detected information.

【００３８】以下、本実施の形態における情報検出装置
について詳細に説明する。先ず、本実施の形態における
情報検出装置の概念構成図を図１に示す。図１に示すよ
うに、情報検出装置においては、情報源に含まれる音声
信号が有声音ブロック検出手段１に入力され、ブロック
毎に有声音ブロックであるか無声音ブロックであるかが
判定される。The information detecting device according to the present embodiment will be described in detail below. First, FIG. 1 shows a conceptual block diagram of the information detecting apparatus in the present embodiment. As shown in FIG. 1, in the information detection device, a voice signal included in an information source is input to the voiced sound block detection means 1 and it is determined for each block whether it is a voiced sound block or an unvoiced sound block.

【００３９】認識モード切替手段２は、音声信号の背景
雑音レベルを検出して、その背景雑音レベルに応じて認
識モードを切り替える。すなわち、背景雑音レベルが所
定値以上の部分では有声音モードに切り替え、背景雑音
レベルが所定値以下の部分では有声音・無声音モードに
切り替える。The recognition mode switching means 2 detects the background noise level of the voice signal and switches the recognition mode according to the background noise level. That is, the voiced sound mode is switched to the portion where the background noise level is equal to or higher than the predetermined value, and the voiced sound / unvoiced sound mode is switched to the portion where the background noise level is equal to or lower than the predetermined value.

【００４０】話者識別手段３は、音声信号の特徴量のベ
クトル量子化歪みを評価して話者を識別する。この際、
認識モード切替手段２において有声音モードとされてい
る場合には、有声音ブロック検出手段１において有声音
ブロックと判定されたブロックについて話者識別を行
う。また、認識モード切替手段２において有声音・無声
音モードとされている場合には、全てのブロックについ
て話者識別を行う。The speaker identifying means 3 identifies the speaker by evaluating the vector quantization distortion of the feature amount of the voice signal. On this occasion,
When the recognition mode switching means 2 is in the voiced sound mode, the voiced sound block detection means 1 identifies the block determined to be the voiced sound block. If the recognition mode switching unit 2 is in the voiced / unvoiced sound mode, speaker identification is performed for all blocks.

【００４１】話者識別手段３によって識別された話者
は、話者判別頻度計算手段４に入力され、所定の評価区
間毎に区間内の各話者の認識された話者判別頻度が計算
される。求められた話者判別頻度は、話者の出現頻度情
報として出力される。The speaker identified by the speaker identification means 3 is input to the speaker discrimination frequency calculation means 4 and the recognized speaker discrimination frequency of each speaker in the section is calculated for each predetermined evaluation section. It The obtained speaker discrimination frequency is output as speaker appearance frequency information.

【００４２】この図１に示した情報検出装置の具体的な
構成例を図２に示す。図２に示すように、情報検出装置
１０は、ＡＶ（Audio Visual）データの音声信号を入力
する入力部１１と、音声信号をブロック毎にＬＰＣ（Li
near Predictive Coding）分析してＬＰＣ係数を求める
ＬＰＣ分析部１２と、求めたＬＰＣ係数を用いて入力さ
れた音声信号のＬＰＣ残差をブロック毎に求めるＬＰＣ
逆フィルタ１３と、ブロック毎に音声信号が有声音のブ
ロックであるか否かを判定する有声音判定部１４と、音
声信号の背景雑音レベルを検出して、その背景雑音レベ
ルに応じて認識モードを切り替える認識モード切替部１
５と、後段の処理を切り替えるための切替部１６と、Ｌ
ＰＣ係数を変換してＬＰＣケプストラム係数を求めるケ
プストラム変換部１７と、ＬＰＣケプストラム係数をベ
クトル量子化するベクトル量子化部１８と、ベクトル量
子化歪みを評価して話者を識別する話者識別部１９と、
認識された話者の判別頻度を用いて話者の出現頻度を求
める話者判別頻度計算部２０とを備える。FIG. 2 shows a specific structural example of the information detecting apparatus shown in FIG. As shown in FIG. 2, the information detection device 10 includes an input unit 11 for inputting an audio signal of AV (Audio Visual) data and an LPC (Li
LPC analysis unit 12 that obtains LPC coefficients by performing near Predictive Coding) analysis, and LPC that obtains LPC residuals of an audio signal input using the obtained LPC coefficients for each block
The inverse filter 13, the voiced sound determination unit 14 that determines whether or not the voice signal is a block of voiced sound for each block, the background noise level of the voice signal is detected, and the recognition mode is determined according to the background noise level. Recognition mode switching unit 1 for switching
5, a switching unit 16 for switching the subsequent processing, and L
A cepstrum converter 17 for converting PC coefficients to obtain LPC cepstrum coefficients, a vector quantizer 18 for vector quantizing the LPC cepstrum coefficients, and a speaker identifier 19 for evaluating vector quantization distortion and identifying a speaker. When,
And a speaker discrimination frequency calculation unit 20 that obtains the appearance frequency of the speaker using the recognized discrimination frequency of the speaker.

【００４３】入力部１１は、ＡＶデータの音声信号Ｄ１
１を入力し、一定のブロック単位にＬＰＣ分析部１２、
ＬＰＣ逆フィルタ１３及び有声音判定部１４に供給す
る。また、入力部１１は、音声信号Ｄ１１を認識モード
切替部１５にも供給する。The input section 11 receives the audio signal D1 of AV data.
1 is input, and the LPC analysis unit 12, in fixed block units,
The LPC inverse filter 13 and the voiced sound determination unit 14 are supplied. The input unit 11 also supplies the voice signal D11 to the recognition mode switching unit 15.

【００４４】ＬＰＣ分析部１２は、供給されたブロック
単位の音声信号Ｄ１１に対してＬＰＣ分析を施し、得ら
れたＬＰＣ係数Ｄ１２をＬＰＣ逆フィルタ１３及び切替
部１６に供給する。The LPC analysis section 12 performs LPC analysis on the supplied block-by-block audio signal D11, and supplies the obtained LPC coefficient D12 to the LPC inverse filter 13 and the switching section 16.

【００４５】ＬＰＣ逆フィルタ１３は、音声信号Ｄ１１
に対してＬＰＣ逆フィルタ処理を施し、ＬＰＣ残差信号
Ｄ１３を求める。ＬＰＣ逆フィルタ１３は、求められた
ＬＰＣ残差信号Ｄ１３を有声音判定部１４に供給する。The LPC inverse filter 13 outputs the voice signal D11.
Is subjected to LPC inverse filter processing to obtain an LPC residual signal D13. The LPC inverse filter 13 supplies the obtained LPC residual signal D13 to the voiced sound determination unit 14.

【００４６】有声音判定部１４は、音声信号Ｄ１１とＬ
ＰＣ残差信号Ｄ１３とに基づいて、音声信号Ｄ１１のブ
ロックが有声音ブロックであるか否かを判定し、判定結
果Ｄ１４を切替部１６に供給する。この有声音ブロック
の判定については、後で詳述する。The voiced sound judging section 14 is arranged to detect the voice signals D11 and L.
Based on the PC residual signal D13, it is determined whether the block of the audio signal D11 is a voiced sound block, and the determination result D14 is supplied to the switching unit 16. The determination of the voiced sound block will be described later in detail.

【００４７】認識モード切替部１５は、音声信号Ｄ１１
の背景雑音レベルを検出して、その背景雑音レベルに応
じて認識モードを切り替える。すなわち、背景雑音レベ
ルが所定値以上の部分では有声音モードに切り替え、背
景雑音レベルが所定値以下の部分では有声音・無声音モ
ードに切り替える。また、認識モード切替部１５は、こ
の認識モード情報Ｄ１５を切替部１６に供給する。The recognition mode switching section 15 operates the voice signal D11.
The background noise level of is detected, and the recognition mode is switched according to the background noise level. That is, the voiced sound mode is switched to the portion where the background noise level is equal to or higher than the predetermined value, and the voiced sound / unvoiced sound mode is switched to the portion where the background noise level is equal to or lower than the predetermined value. Further, the recognition mode switching unit 15 supplies this recognition mode information D15 to the switching unit 16.

【００４８】なお、認識モードは、後述する話者認識ブ
ロックＲＢ単位又は頻度区間ＦＩ単位で切り替えること
が望ましいが、本実施の形態では、切り替えのタイミン
グを限定しない。また、切り替えは自動で行うものに限
定されず、音声信号Ｄ１１の状況に応じて手動で切り替
えるようにしても構わない。また、背景雑音の有無やそ
のレベルの検出手法としては、例えば、特開平５−２９
７８９６号公報に記載されているように、信号を幾つか
の周波数帯域毎に有声音・無声音判別を行い、その無声
音と判別された帯域数と平均パワーから背景雑音レベル
を推定する手法や、特開平１１−１１９７９６号公報に
記載されているように、リファレンスレベルと最小レベ
ルのトラッキングにより背景雑音レベルを推定する手法
等を用いることができる。It is desirable to switch the recognition mode in units of speaker recognition blocks RB or frequency sections FI, which will be described later, but in the present embodiment, the switching timing is not limited. The switching is not limited to automatic switching, and may be switched manually according to the situation of the audio signal D11. Further, as a method for detecting the presence or absence of background noise and its level, for example, Japanese Patent Laid-Open No. 5-29
As described in Japanese Patent No. 7896, a method of performing voiced sound / unvoiced sound discrimination for each of several frequency bands and estimating a background noise level from the number of bands discriminated as unvoiced sound and average power, As described in Kaihei 11-119796, a method of estimating the background noise level by tracking the reference level and the minimum level can be used.

【００４９】切替部１６は、有声音判定部１４から供給
された判定結果Ｄ１４及び認識モード切替部１５から供
給された認識モード情報Ｄ１５に基づいて、後段の処理
を切り替える。すなわち、認識モード情報Ｄ１５が有声
音モードの場合には、判定結果Ｄ１４が有声音であるブ
ロックのＬＰＣ係数Ｄ１２をケプストラム変換部１７に
供給するように切り替え、認識モード情報Ｄ１５が有声
音・無声音モードの場合には、全ブロックのＬＰＣ係数
Ｄ１２をケプストラム変換部１７に供給するように切り
替える。The switching unit 16 switches the subsequent processing based on the determination result D14 supplied from the voiced sound determination unit 14 and the recognition mode information D15 supplied from the recognition mode switching unit 15. That is, when the recognition mode information D15 is the voiced sound mode, the determination result D14 is switched so as to supply the LPC coefficient D12 of the block whose voiced sound is the voiced sound to the cepstrum conversion unit 17, and the recognition mode information D15 is the voiced sound / unvoiced sound mode. In this case, the LPC coefficients D12 of all blocks are switched so as to be supplied to the cepstrum converter 17.

【００５０】ケプストラム変換部１７は、供給されたＬ
ＰＣ係数Ｄ１２を変換してＬＰＣケプストラム係数Ｄ１
６を求め、このＬＰＣケプストラム係数Ｄ１６をベクト
ル量子化部１８に供給する。The cepstrum converter 17 receives the supplied L
The PC coefficient D12 is converted to the LPC cepstrum coefficient D1.
6, and supplies this LPC cepstrum coefficient D16 to the vector quantizer 18.

【００５１】ベクトル量子化部１８は、ＬＰＣケプスト
ラム係数Ｄ１６に対して、コードブック群ＣＢからの各
話者のコードブックデータＤ１７を用いてそれぞれベク
トル量子化を施す。また、ベクトル量子化部１８は、そ
れぞれのコードブックでベクトル量子化された結果（量
子化歪み）Ｄ１８を話者識別部１９に供給する。The vector quantizer 18 performs vector quantization on the LPC cepstrum coefficient D16 using the codebook data D17 of each speaker from the codebook group CB. Further, the vector quantization unit 18 supplies the result (quantization distortion) D18 obtained by vector quantization with each codebook to the speaker identification unit 19.

【００５２】話者識別部１９は、ベクトル量子化歪みＤ
１８を評価し、さらに閾値表ファイルＴＦから読みこん
だ閾値データＤ１９を用いて、所定の認識ブロック毎に
話者の識別及び照合判定を行う。この話者の識別及び照
合判定については、後で詳述する。話者識別部１９は、
識別された話者Ｄ２０を話者判別頻度計算部２０に供給
する。The speaker identification unit 19 uses the vector quantization distortion D.
18 is evaluated, and the threshold data D19 read from the threshold table file TF is used to perform speaker identification and collation determination for each predetermined recognition block. The speaker identification and collation determination will be described in detail later. The speaker identification unit 19
The identified speaker D20 is supplied to the speaker discrimination frequency calculation unit 20.

【００５３】話者判別頻度計算部２０は、識別された話
者Ｄ２０に基づいて、所定の評価区間毎に、評価区間内
の各話者の認識された話者判別頻度を計算する。求めら
れた話者判別頻度は、話者の出現頻度情報Ｄ２１とし
て、例えば図３に示すような記録形式で各ＡＶデータ
毎、各話者毎、各評価区間毎に話者頻度ファイルＳＦに
記録される。なお、話者頻度ファイルＳＦは、図示しな
い送受信部により通信回線を介して通信されるものであ
ってもよく、また、磁気ディスク、光磁気ディスク等の
記録媒体や半導体メモリ等の記憶媒体等の蓄積媒体に蓄
積されるものであってもよい。The speaker discrimination frequency calculator 20 calculates the recognized speaker discrimination frequency of each speaker in the evaluation section for each predetermined evaluation section based on the identified speaker D20. The obtained speaker discrimination frequency is recorded in the speaker frequency file SF for each AV data, each speaker, and each evaluation section in a recording format as shown in FIG. 3, for example, as the speaker appearance frequency information D21. To be done. The speaker frequency file SF may be communicated via a communication line by a transmitter / receiver (not shown), and may be a recording medium such as a magnetic disk or a magneto-optical disk or a storage medium such as a semiconductor memory. It may be stored in a storage medium.

【００５４】図３の記録形式は、入力部１１から入力さ
れた音声信号のＡＶデータ名と、登録された各話者を識
別する識別名と、頻度区間の開始時刻と、同区間の終了
時刻と、上記ＡＶデータの上記頻度区間における上記話
者の判別頻度とを情報として有する。この記録形式は、
一例であり、図３に示した情報に限定されるものではな
い。The recording format shown in FIG. 3 is the AV data name of the audio signal input from the input unit 11, the identification name for identifying each registered speaker, the start time of the frequency section, and the end time of the same section. And the discrimination frequency of the speaker in the frequency section of the AV data as information. This recording format is
This is an example, and the information is not limited to the information shown in FIG.

【００５５】以下、図４を参照しながら、上述した構成
において、ＬＰＣ分析を行う処理の単位と、話者識別を
行う処理の単位と、話者判別頻度を求める際の処理の単
位とについて説明する。In the following, with reference to FIG. 4, the unit of processing for performing LPC analysis, the unit of processing for performing speaker identification, and the unit of processing for obtaining the speaker discrimination frequency will be described. To do.

【００５６】入力されたＡＶデータの音声信号は、図２
に示すＬＰＣ分析部１２において、図４に示すようなＬ
ＰＣ分析ブロックＡＢ単位にＬＰＣ分析が施されて、得
られたＬＰＣ係数が変換されてＬＰＣケプストラム係数
が抽出される。ＬＰＣ分析ブロックＡＢのブロック長ａ
は、音声信号の場合、通常２０ミリ秒〜３０ミリ秒程度
がよく用いられる。また、分析性能を向上させるために
隣接ブロックと若干オーバーラップさせることが多い。
なお、有声音ブロックか否かの判定も、このＬＰＣ分析
ブロックＡＢ単位で行われる。The audio signal of the inputted AV data is shown in FIG.
In the LPC analysis unit 12 shown in FIG.
The LPC analysis is performed on the PC analysis block AB unit, the obtained LPC coefficient is converted, and the LPC cepstrum coefficient is extracted. Block length a of LPC analysis block AB
In the case of a voice signal, normally, about 20 milliseconds to 30 milliseconds is often used. Moreover, in order to improve the analysis performance, the adjacent blocks are often slightly overlapped.
The determination as to whether the block is a voiced sound block is also made in units of this LPC analysis block AB.

【００５７】話者認識ブロックＲＢは、話者を識別する
最小単位であり、このブロック単位で話者の識別を行
う。話者認識ブロックＲＢのブロック長ｂは、数秒程度
が望ましい。従って、１つの話者認識ブロックＲＢは、
５０〜数百程度のＬＰＣ分析ブロックＡＢを含むのが適
当である。話者認識ブロックＲＢも、隣接区間と若干オ
ーバーラップしていてもよい。オーバーラップ長は、通
常、区間長の１０％〜５０％程度である。The speaker recognition block RB is a minimum unit for identifying a speaker, and the speaker is identified in this block unit. The block length b of the speaker recognition block RB is preferably about several seconds. Therefore, one speaker recognition block RB is
It is suitable to include 50 to several hundreds of LPC analysis blocks AB. The speaker recognition block RB may also slightly overlap the adjacent section. The overlap length is usually about 10% to 50% of the section length.

【００５８】頻度区間ＦＩは、話者の出現頻度を求める
評価単位であり、同区間内において、各話者認識ブロッ
クＲＢで識別された話者の判別頻度に基づいて各話者の
出現頻度を求める。頻度区間ＦＩ_Ｉの区間開始時刻はＳ
_Ｉ、区間終了時刻はＥ_Ｉであり、区間長（Ｅ_Ｉ‐Ｓ_Ｉ）
は、数分〜数十分程度が適当である。従って、１つの頻
度区間ＦＩは、３０〜３００程度の話者認識ブロックＲ
Ｂを含むのが適当である。この頻度区間ＦＩも隣接区間
と若干オーバーラップしていてもよい。The frequency interval FI is an evaluation unit for obtaining the appearance frequency of a speaker, and the appearance frequency of each speaker is determined based on the discrimination frequency of the speaker identified by each speaker recognition block RB within the same interval. Ask. The section start time of the frequency section FI _I is S
_I , the section end time is E _I , and the section length (E _I- S _I )
Is suitably several minutes to several tens of minutes. Therefore, one frequency interval FI includes about 30 to 300 speaker recognition blocks R.
Suitably B is included. This frequency section FI may also be slightly overlapped with the adjacent section.

【００５９】次に、情報検出装置１０の動作を表すフロ
ーチャートを図５に示す。先ずステップＳ１０におい
て、初期化処理として、区間番号Ｉを０とする。区間番
号Ｉとは、話者の頻度を求める頻度区間ＦＩにつけた連
続番号である。Next, a flow chart showing the operation of the information detecting device 10 is shown in FIG. First, in step S10, the section number I is set to 0 as an initialization process. The section number I is a serial number given to the frequency section FI for obtaining the speaker frequency.

【００６０】次にステップＳ１１において、現在指定さ
れている認識モードを判別する。すなわち、認識モード
が有声音モードであるか有声音・無声音モードであるか
を判別する。Next, in step S11, the currently designated recognition mode is determined. That is, it is determined whether the recognition mode is the voiced sound mode or the voiced sound / unvoiced sound mode.

【００６１】次にステップＳ１２において、上述した話
者認識ブロックＲＢ単位で話者候補を識別して話者候補
を選定する。この際、認識モードが有声音モードである
場合には、有声音ブロックと判定されたＬＰＣ分析ブロ
ックＡＢのみを話者認識に用いる。なお、話者候補の選
定方法については、後で詳述する。Next, in step S12, the speaker candidate is identified by selecting the speaker candidate for each speaker recognition block RB described above. At this time, when the recognition mode is the voiced sound mode, only the LPC analysis block AB determined as the voiced sound block is used for speaker recognition. The method of selecting speaker candidates will be described in detail later.

【００６２】ステップＳ１３では、選定された話者候補
が正しい話者か否かを照合判定する。すなわち、未知の
不特定話者の音声データが入力された場合や、有声音ブ
ロックの判定誤りによって音声以外のデータが入力され
た場合、ステップＳ１２で選定された候補話者は、入力
音声に一番類似している話者が候補として選出されたも
のであるが、それが本当にその話者本人とは限らない。
そこで、ステップＳ１３では、ベクトル量子化歪みを評
価し、図２に示した閾値表ファイルＴＦに記録された閾
値データと比較することで、選定された話者候補本人の
データであるか否かの判定を行う。判定方法について
は、後で詳述する。ステップＳ１３において、本人であ
ると判定されれば、その話者候補をこの話者認識ブロッ
クＲＢにおける話者として確定し、本人ではないと判定
されれば、この話者認識ブロックＲＢにおける話者を未
知話者として確定する。なお、この照合判定は、必ずし
も行わなくてもよい。In step S13, it is determined whether or not the selected speaker candidate is a correct speaker. That is, when voice data of an unknown unspecified speaker is input, or when data other than voice is input due to a decision error in a voiced sound block, the candidate speaker selected in step S12 is not included in the input voice. The most similar speaker was selected as a candidate, but it is not always the speaker himself.
Therefore, in step S13, the vector quantization distortion is evaluated and compared with the threshold data recorded in the threshold table file TF shown in FIG. 2 to determine whether or not the data is the data of the selected speaker candidate. Make a decision. The determination method will be described in detail later. If it is determined in step S13 that the speaker is the principal, the speaker candidate is confirmed as the speaker in the speaker recognition block RB, and if it is determined that the speaker is not the speaker, the speaker in the speaker recognition block RB is determined. Confirm as an unknown speaker. Note that this collation determination does not necessarily have to be performed.

【００６３】続いてステップＳ１４では、頻度区間ＦＩ
_Ｉの最後の話者認識ブロックＲＢまで処理したか否かが
判定される。ステップＳ１４において、最後の話者認識
ブロックＲＢでなければ、ステップＳ１５において、次
の話者認識ブロックＲＢに進み、ステップＳ１１に戻
る。ステップＳ１４において、最後の話者認識ブロック
ＲＢであると判定されれば、ステップＳ１６に進む。Then, in step S14, the frequency interval FI
_It is determined whether up to the last speaker recognition block RB of I has been processed. If it is not the last speaker recognition block RB in step S14, the process proceeds to the next speaker recognition block RB in step S15, and the process returns to step S11. If it is determined in step S14 that it is the last speaker recognition block RB, the process proceeds to step S16.

【００６４】ステップＳ１６では、現在の頻度区間ＦＩ
_Ｉにおける、それぞれの登録話者の判別頻度を出現頻度
情報として求める。なお、未知話者と判定された話者認
識ブロックＲＢは頻度の計算に含めない。求めた話者出
現頻度は、図２に示した話者頻度ファイルＳＦに、図３
のような記録形式で記録する。In step S16, the current frequency interval FI
The identification frequency of each registered speaker in _I is obtained as appearance frequency information. The speaker recognition block RB determined to be an unknown speaker is not included in the frequency calculation. The obtained speaker appearance frequency is stored in the speaker frequency file SF shown in FIG.
Record in a recording format such as.

【００６５】ステップＳ１７では、データの末尾に到達
したか否かが判定される。データの末尾に到達している
場合は、処理を終了し、データの末尾に到達していない
場合は、ステップＳ１８に進む。In step S17, it is determined whether or not the end of the data has been reached. If the end of the data has been reached, the process is terminated, and if the end of the data has not been reached, the process proceeds to step S18.

【００６６】ステップＳ１８では、区間番号Ｉを１つ増
やし、次の頻度区間に進み、ステップＳ１１に戻る。以
下、同様にしてデータ末尾に到達するまで、ステップＳ
１１からステップＳ１８の処理を繰り返す。In step S18, the section number I is incremented by 1, the process proceeds to the next frequency section, and the process returns to step S11. In the same manner, step S is performed until the end of the data is reached.
The processing from 11 to step S18 is repeated.

【００６７】続いて、図５のステップＳ１２における話
者候補の識別方法の詳細を図６に示す。先ず、ステップ
Ｓ２０において、上述したＬＰＣ分析ブロックＡＢごと
に音声データを入力データから読みこむ。Next, FIG. 6 shows details of the method of identifying the speaker candidate in step S12 of FIG. First, in step S20, voice data is read from the input data for each of the LPC analysis blocks AB described above.

【００６８】次にステップＳ２１において、話者認識ブ
ロックＲＢの最後のＬＰＣ分析ブロックＡＢまで処理を
終えたか否かを判定し、最後のＬＰＣ分析ブロックＡＢ
の処理を終えている場合は、ステップＳ３０に進む。ス
テップＳ２１において最後のＬＰＣ分析ブロックＡＢで
ない場合は、ステップＳ２２に進む。Next, in step S21, it is determined whether or not the processing has been completed up to the last LPC analysis block AB of the speaker recognition block RB, and the last LPC analysis block AB is determined.
If the process of step 1 has been completed, the process proceeds to step S30. If it is not the last LPC analysis block AB in step S21, the process proceeds to step S22.

【００６９】ステップＳ２２では、ＬＰＣ分析ブロック
ＡＢのデータに対してＬＰＣ分析を施し、ＬＰＣ係数を
得る。続いてステップＳ２３では、認識モードが有声音
モードであるか否かが判定される。ステップＳ２３にお
いて有声音モードである場合は、ステップＳ２４に進
み、有声音モードでない場合、すなわち、有声音・無声
音モードである場合には、ステップＳ２７に進む。In step S22, LPC analysis is performed on the data of the LPC analysis block AB to obtain LPC coefficients. Succeedingly, in a step S23, it is determined whether or not the recognition mode is the voiced sound mode. If the voiced sound mode is set in step S23, the process proceeds to step S24. If the voiced sound mode is not set, that is, if the voiced sound / unvoiced sound mode is set, the process proceeds to step S27.

【００７０】次にステップＳ２４では、ステップＳ２２
で得られたＬＰＣ係数に基づいて音声信号に対してＬＰ
Ｃ逆フィルタ処理を施してＬＰＣ残差信号を得る。Next, in step S24, step S22
LP for the audio signal based on the LPC coefficient obtained in
C inverse filtering is performed to obtain an LPC residual signal.

【００７１】続いてステップＳ２５では、入力された音
声信号とステップＳ２４で得られたＬＰＣ残差信号とに
基づいて、後述するようにしてこのブロックが有声音ブ
ロックであるか否かを判定し、ステップＳ２６に進む。Subsequently, in step S25, it is determined whether or not this block is a voiced sound block, as described later, based on the input speech signal and the LPC residual signal obtained in step S24. It proceeds to step S26.

【００７２】ステップＳ２６では、ステップＳ２５にお
ける判定の結果が有声音ブロックであるか否かが判定さ
れる。ステップＳ２６において有声音ブロックである場
合には、ステップＳ２７に進み、そうでない場合には、
ステップＳ２９に進む。In step S26, it is determined whether or not the result of the determination in step S25 is a voiced sound block. If it is a voiced sound block in step S26, the process proceeds to step S27, and if not,
It proceeds to step S29.

【００７３】次にステップＳ２７では、ステップＳ２２
で得られたＬＰＣ係数を変換してＬＰＣケプストラム係
数を求める。ここでは、１次〜１６次程度の低次のケプ
ストラム係数を求める。Next, in step S27, step S22
The LPC coefficient obtained in (1) is converted to obtain the LPC cepstrum coefficient. Here, a low-order cepstrum coefficient of the order of 1st to 16th is obtained.

【００７４】次にステップＳ２８において、予め作成さ
れた複数のコードブックを用いて、ステップＳ２７で得
られたＬＰＣケプストラム係数にそれぞれベクトル量子
化を施す。それぞれのコードブックは登録された話者に
一対一に対応する。ここで、コードブックＣＢ_ｋによる
このブロックのＬＰＣケプストラム係数のベクトル量子
化歪みをｄ_ｋとする。Next, in step S28, vector quantization is applied to each of the LPC cepstrum coefficients obtained in step S27 using a plurality of codebooks created in advance. Each codebook has a one-to-one correspondence with a registered speaker. Here, the vector quantization distortion of the LPC cepstrum coefficient of this block by the codebook CB _k is d _k .

【００７５】ステップＳ２９では、次のＬＰＣ分析ブロ
ックＡＢに進み、ステップＳ２０に戻り、同様にしてス
テップＳ２０からステップＳ２９の処理を繰り返す。In step S29, the process proceeds to the next LPC analysis block AB, returns to step S20, and similarly the processes of steps S20 to S29 are repeated.

【００７６】ステップＳ３０では、話者認識ブロックＲ
Ｂ全体にわたる各コードブックＣＢの量子化歪みｄ_ｋの
平均である平均量子化歪みＤ_ｋを求める。In step S30, the speaker recognition block R
The average quantization distortion D _k which is the average of the quantization distortion d _k of each codebook CB over B is _obtained .

【００７７】続いてステップＳ３１では、平均量子化歪
みＤ_ｋを最小にする話者Ｓ_ｋ’に対応するコードブック
ＣＢ_ｋ’を選出し、ステップＳ３２では、この話者Ｓ
_ｋ’を話者候補Ｓ_ｃとして出力する。Then, in step S31, the codebook CB _{k'corresponding} to the speaker S _{k '} which minimizes the average quantization distortion D _k is selected, and in step S32, this speaker S k'is selected.
Output _{k ′} as a speaker candidate S _c .

【００７８】このようにして、コードブックが登録され
ている話者のうち、最も入力データの音声が類似してい
る話者が、その話者認識ブロックＲＢにおける話者候補
Ｓ_ｃとして選出される。In this way, among the speakers registered in the codebook, the speaker whose input data has the most similar voice is selected as the speaker candidate S _c in the speaker recognition block RB. .

【００７９】ここで、上述したステップＳ２５における
有声音ブロックの判定手法について、図７のフローチャ
ートを用いて説明する。これは、図２における有声音判
定部１４の動作に対応するものである。なお、以下に示
す手法は一例であり、有声音ブロックの判定手法がこれ
に限定されるものではない。Here, the method of determining a voiced sound block in step S25 described above will be described with reference to the flowchart of FIG. This corresponds to the operation of the voiced sound determination unit 14 in FIG. The method described below is an example, and the method for determining a voiced sound block is not limited to this.

【００８０】先ずステップＳ４０において、音声信号の
ＬＰＣ残差信号の自己相関関数ｒ（ｔ）を求める。この
際、ＬＰＣ残差信号から直接自己相関関数ｒ（ｔ）を求
めてもよく、また、ＬＰＣ残差信号を低域通過フィルタ
に通してから高速フーリエ変換を施し、残差信号のパワ
ースペクトルを逆フーリエ変換して自己相関関数ｒ
（ｔ）を求めてもよい。First, in step S40, the autocorrelation function r (t) of the LPC residual signal of the audio signal is obtained. At this time, the autocorrelation function r (t) may be obtained directly from the LPC residual signal, or the LPC residual signal may be passed through a low pass filter and then subjected to fast Fourier transform to obtain the power spectrum of the residual signal. Inverse Fourier transform and autocorrelation function r
(T) may be obtained.

【００８１】次にステップＳ４１において、自己相関関
数ｒ（ｔ）のピーク値を求め、各ピーク点をゼロ次の自
己相関関数ｒ（０）で正規化して、正規化ピーク点を求
める。ここで、正規化したｎ個のピーク点を大きい方か
ら順に、ｒ_０，ｒ_１，ｒ_２，・・・ｒ_ｎ（１＝ｒ_０≧ｒ
_１≧ｒ_２≧・・・≧ｒ_ｎ）とする。Next, in step S41, the peak value of the autocorrelation function r (t) is obtained, and each peak point is normalized by the zero-order autocorrelation function r (0) to obtain a normalized peak point. Here, the larger the n-number of peak points normalized in _{_{_{order, r 0, r 1, r}}} 2, ··· r n (1 = r 0 ≧ r
₁ ≧ r ₂ ≧ ... ≧ r _n ).

【００８２】続いてステップＳ４２において、入力され
た音声信号のブロック内零交差数Ｚとブロック内平均エ
ネルギＥとを求める。Then, in step S42, the in-block zero-crossing number Z and the in-block average energy E of the input voice signal are obtained.

【００８３】ステップＳ４３では、正規化ピーク値、零
交差数Ｚ及び平均エネルギＥを総合的に評価して、総合
的有声音らしさＲを求める。具体的には、先ず自己相関
ピークと零交差数Ｚと平均エネルギＥとを、適当な有声
音らしさを表す関数に当てはめる。一般的には、自己相
関ピークが大きいほど、また零交差数Ｚが小さいほど、
また平均エネルギが大きいほど、有声音らしいと判断す
ることができる。In step S43, the normalized peak value, the number of zero crossings Z, and the average energy E are comprehensively evaluated to obtain the overall likelihood R of the voiced sound. Specifically, first, the autocorrelation peak, the number of zero-crossings Z, and the average energy E are applied to a function representing an appropriate voiced sound likelihood. In general, the larger the autocorrelation peak and the smaller the number of zero crossings Z,
Further, it can be determined that the larger the average energy, the more likely the voiced sound is.

【００８４】一例として、例えば、それぞれの特徴量の
有声音らしさProbを、以下の式（３）、式（４）、式
（５）のようにして求める。ここで、ｋ_１〜ｋ_６は、実
験的に決める所定の係数である。As an example, for example, the voiced sound likelihood Prob of each feature amount is obtained by the following equations (3), (4), and (5). Here, k _{1 to} k ₆ are predetermined coefficients that are experimentally determined.

【００８５】[0085]

【数３】 [Equation 3]

【００８６】さらに、それぞれの有声音らしさに適当な
重み付けをして総合的な有声音らしさＲ（０≦Ｒ≦１）
を求める。例えば、α、β、γを適当な係数として、以
下の式（６）のようにして求めることができる。Further, each voiced sound likeness is appropriately weighted, and the total voiced sound likeness R (0≤R≤1).
Ask for. For example, α, β, and γ can be calculated as appropriate coefficients as in the following equation (6).

【００８７】[0087]

【数４】 [Equation 4]

【００８８】ステップＳ４４では、ステップＳ４２で求
められた総合的有声音らしさＲが所定の閾値以上である
か否かが判定される。総合的有声音らしさＲが所定の閾
値以上である場合には、ステップＳ４５において、この
ブロックを有声音ブロックであると判定して判定処理を
終了する。総合的有声音らしさＲが所定の閾値以上でな
い場合には、ステップＳ４６において、このブロックを
有声音ブロックではないと判定して判定処理を終了す
る。In step S44, it is determined whether or not the synthetic voiced sound likelihood R obtained in step S42 is equal to or greater than a predetermined threshold value. If the synthetic voiced sound likelihood R is greater than or equal to a predetermined threshold value, then in step S45, this block is determined to be a voiced sound block, and the determination process ends. If the synthetic voiced sound likelihood R is not greater than or equal to the predetermined threshold value, then in step S46, it is determined that this block is not a voiced sound block, and the determination process ends.

【００８９】続いて、図５のステップＳ１３における話
者候補Ｓ_ｃの照合判定方法の詳細を図８に示す。先ずス
テップＳ５０において、話者候補Ｓ_ｃの平均量子化歪み
をＤ _０とする。次にステップＳ５１において、話者候補
Ｓ_ｃ以外の各コードブックによる平均量子化歪みを小さ
い順に並び替え、そのうち、小さいものから順にｎ個
を、Ｄ_１，Ｄ_２，・・・Ｄ_ｎ（Ｄ_０＜Ｄ_１＜Ｄ_２＜・・
・＜Ｄ_ｎ）とする。ｎの値は、任意に選択可能である。Then, the story in step S13 of FIG.
Candidate S_cFIG. 8 shows the details of the collation determination method of. First
In step S50, the candidate speaker S_cAverage quantization distortion of
To D ₀And Next, in step S51, a candidate speaker
S_cAverage quantization distortion by each codebook other than
Sort in ascending order, of which n is the smallest
And D₁, D_Two・・・ D_n(D₀<D₁<D_Two<・・
・ <D_n). The value of n can be arbitrarily selected.

【００９０】続いてステップＳ５２において、評価の尺
度として、話者候補Ｓ_ｃの量子化歪みＤ_０とそれ以外の
ｎ個の量子化歪みについて、以下の式（７）又は式
（８）を用いて歪差分量ΔＤを求める。Then, in step S52, the following equation (7) or equation (8) is used as the evaluation scale for the quantization distortion D ₀ of the speaker candidate S _c and n other quantization distortions. Then, the strain difference amount ΔD is obtained.

【００９１】[0091]

【数５】 [Equation 5]

【００９２】式（７）、式（８）において、例えばｎが
１の場合は、話者候補Ｓ_ｃに次いで量子化歪みが小さい
Ｄ_１とＤ_０との量子化歪みの差を求めることになる。In equations (7) and (8), for example, when n is 1, the difference in quantization distortion between D ₁ and D _0, which has the smallest quantization distortion next to the speaker candidate S _c , is obtained. Become.

【００９３】続いてステップＳ５３において、図２に示
した閾値表ファイルＴＦから話者候補Ｓ_ｃに対応する閾
値データを読みこむ。Then, in step S53, the threshold data corresponding to the speaker candidate S _c is read from the threshold table file TF shown in FIG.

【００９４】閾値表ファイルＴＦには、各登録話者ごと
に、例えば図９のような形式で記録されている。すなわ
ち、図９に示すように、各登録話者の話者識別名と、閾
値データである量子化歪みの最大歪み絶対値Ｄ_ｍａｘ及
び最小歪み差分ΔＤ_ｍｉｎが予め記録されている。In the threshold table file TF, each registered speaker is recorded in a format as shown in FIG. 9, for example. That is, as shown in FIG. 9, the speaker identification name of each registered speaker, the maximum distortion absolute value D _max of the quantization distortion and the minimum distortion difference ΔD _{min that} are threshold data are recorded in advance.

【００９５】図８に戻り、ステップＳ５４では、読みこ
んだ閾値データＤ_ｍａｘ，ΔＤ_ｍｉ _ｎを、求めたＤ_０及
びΔＤと比較して判別する。すなわち、ステップＳ５４
において、量子化歪みの絶対値Ｄ_０が閾値データＤ
_ｍａｘよりも小さく、且つ、歪み差分ΔＤが閾値データ
ΔＤ_ｍｉｎより大きければ、ステップＳ５５に進み、本
人であると判定し、候補を確定する。そうでなければ、
ステップＳ５６に進み、未知話者と判定し、候補を棄却
する。このように、話者候補Ｓ_ｃの平均量子化歪みＤ_０
と歪差分量ΔＤとをそれぞれ閾値と比較することで、登
録話者の音声データの識別誤りが減少し、また、登録話
者以外の音声データを未知話者として判定することが可
能となる。[0095] Returning to FIG. 8, in step S54, the threshold value data _{D max} yelling read, the [Delta] D _mi _n, determined as compared to the _{D 0} and [Delta] D obtained. That is, step S54
, The absolute value D ₀ of the quantization distortion is the threshold data D
_If it is smaller than _max and the distortion difference ΔD is larger than the threshold data ΔD _min , the process proceeds to step S55, it is determined that the person is the person, and the candidate is confirmed. Otherwise,
In step S56, it is determined that the speaker is unknown and the candidate is rejected. Thus, the average quantization distortion D ₀ of the speaker candidate S _c
By comparing the distortion difference amount ΔD and the distortion difference amount ΔD with the respective threshold values, the identification error of the voice data of the registered speaker is reduced, and the voice data other than the registered speaker can be determined as the unknown speaker.

【００９６】以上説明したように、本実施の形態におけ
る情報検出装置１０は、ＡＶデータに含まれる音声信号
中の話者の音声の特徴量に基づいて、話者認識ブロック
毎に話者を識別すると共に、所定の区間における話者の
出現頻度を検出し、話者の出現頻度情報を生成する。こ
の際、背景雑音レベルに応じて無声音ブロックを認識に
用いるか否かを切り替えることにより、雑音のない部分
においては有声音部分及び無声音部分を用いることで効
果的に話者を識別するとともに、背景雑音が混合してい
る部分では無声音部分を認識対象から除外することによ
って認識誤りを低減することができる。As described above, the information detecting apparatus 10 according to the present embodiment identifies the speaker for each speaker recognition block based on the feature amount of the voice of the speaker in the audio signal included in the AV data. At the same time, the appearance frequency of the speaker in the predetermined section is detected, and the appearance frequency information of the speaker is generated. At this time, by switching whether or not to use the unvoiced sound block for recognition according to the background noise level, the voiced part and the unvoiced part are effectively used to identify the speaker in the part without noise, and The recognition error can be reduced by excluding the unvoiced sound part from the recognition target in the part where the noise is mixed.

【００９７】また、このように作成された話者の出現頻
度情報を用いることで、音声信号において背景雑音のな
い部分においても、或いは複数話者や背景雑音が含まれ
る部分においても、環境に応じて、所望の話者が会話を
している区間をより低い誤り率で効果的に検出すること
ができる。Further, by using the appearance frequency information of the speakers created in this way, it is possible to adapt the environment depending on the environment where there is no background noise in the voice signal, or where there are multiple speakers or background noise. Thus, it is possible to effectively detect the section where the desired speaker is talking with a lower error rate.

【００９８】なお、本発明は上述した実施の形態のみに
限定されるものではなく、本発明の要旨を逸脱しない範
囲において種々の変更が可能であることは勿論である。The present invention is not limited to the above-described embodiments, and it goes without saying that various modifications can be made without departing from the gist of the present invention.

【００９９】例えば、上述の説明では、認識モードの切
り替えを行うものとして説明したが、これに限定される
ものではなく、有声音部分のみを用いて話者認識を行う
ようにしてもよい。これにより、情報検出装置の構成を
簡素化することができる。For example, in the above description, the recognition mode is switched, but the present invention is not limited to this, and the speaker recognition may be performed using only the voiced sound portion. Thereby, the configuration of the information detection device can be simplified.

【０１００】[0100]

【発明の効果】以上詳細に説明したように本発明に係る
情報検出装置は、所定の情報源から所定の情報を検出す
るための情報検出装置において、上記情報源に含まれる
音声信号を分析して話者の有声音部分を検出する有声音
ブロック検出手段と、検出された有声音ブロックを用い
て上記音声信号の特徴量の類似性によって所定の評価区
間ごとに話者を識別する話者識別手段と、上記評価区間
毎に識別された上記話者の判別頻度情報を、所定の頻度
区間毎に求める話者判別頻度計算手段とを備え、上記頻
度区間における上記話者の出現頻度情報を検出すること
を特徴としている。As described in detail above, the information detecting apparatus according to the present invention is an information detecting apparatus for detecting a predetermined information from a predetermined information source, and analyzes an audio signal included in the information source. And a voiced sound block detecting means for detecting a voiced sound portion of the speaker, and a speaker identification for identifying the speaker for each predetermined evaluation section by the similarity of the feature amount of the voice signal using the detected voiced sound block. Means and speaker discrimination frequency calculation means for obtaining the discrimination frequency information of the speaker identified for each evaluation section for each predetermined frequency section, and detecting the appearance frequency information of the speaker in the frequency section. It is characterized by doing.

【０１０１】ここで、情報検出装置は、上記音声信号中
の無声音部分を認識に用いない有声音モードと当該無声
音部分を認識に用いる有声音・無声音モードとを切り替
える認識モード切替手段を備えるようにしてもよく、上
記有声音・無声音モードの場合、上記話者識別手段は、
有声音ブロックと無声音ブロックとを用いて上記音声信
号の特徴量の類似性によって所定の評価区間ごとに話者
を識別する。Here, the information detecting apparatus is provided with a recognition mode switching means for switching between a voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition. In the voiced / unvoiced sound mode, the speaker identification means may be
Using the voiced sound block and the unvoiced sound block, the speaker is identified for each predetermined evaluation section based on the similarity of the feature amounts of the voice signal.

【０１０２】また、情報検出装置では、上記情報源の音
声信号中の音声の類似性を評価する特徴量として、ＬＰ
Ｃ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Further, in the information detecting device, LP is used as a feature amount for evaluating the similarity of voices in the voice signal of the information source.
The LPC cepstrum obtained by the C analysis is used, the vector quantization of the feature quantity by a plurality of codebooks is used as the identification method, and the vector quantization distortion is used as the identification scale.

【０１０３】このような情報検出装置によっては、音声
信号中の話者の音声の特徴量に基づいて、ある評価区間
毎に話者を識別すると共に、評価区間毎に判別された話
者の判別頻度をある頻度区間毎に検出し、話者の出現頻
度情報を生成する際に、話者の有声音部分を検出し、検
出された有声音ブロックを用いて話者を識別することに
より、無声音に背景雑音が混合している部分や、他話者
の無声音が混合している部分を認識対象から除外するこ
とができ、従って、認識誤りを低減させることができ
る。Depending on such an information detecting device, the speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the speaker determined for each evaluation section is identified. By detecting the frequency for each frequency interval and generating the speaker appearance frequency information, the voiced sound part of the speaker is detected, and the speaker is identified using the detected voiced sound block, thereby unvoiced It is possible to exclude a portion in which background noise is mixed in and a portion in which unvoiced sounds of other speakers are mixed from the recognition target, and therefore, recognition error can be reduced.

【０１０４】これにより、複数話者や背景雑音部分が含
まれる音声信号においても、所望の話者の会話区間をよ
り低い誤り率で効果的に検出することができる。As a result, even in a voice signal including a plurality of speakers or a background noise portion, the conversation section of a desired speaker can be effectively detected with a lower error rate.

【０１０５】また、音声信号中の無声音部分を認識に用
いない有声音モードと当該無声音部分を認識に用いる有
声音・無声音モードとを切り替えることができる場合に
は、背景雑音がない部分においては、有声音・無声音モ
ードとし、有声音ブロックと無声音ブロックとを用いて
話者を識別するようにしてもよい。When it is possible to switch between the voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and the voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition, in a portion without background noise, The voiced sound / unvoiced sound mode may be set, and the speaker may be identified using the voiced sound block and the unvoiced sound block.

【０１０６】このように、背景雑音がない部分において
は、有声音ブロックと無声音ブロックとを用いることで
効果的に話者を識別することができる。As described above, in the portion where there is no background noise, the speaker can be effectively identified by using the voiced sound block and the unvoiced sound block.

【０１０７】また、上述した目的を達成するために、本
発明に係る情報検出方法は、所定の情報源から所定の情
報を検出するための情報検出方法において、上記情報源
に含まれる音声信号を分析して話者の有声音部分を検出
する有声音ブロック検出工程と、検出された有声音ブロ
ックを用いて上記音声信号の特徴量の類似性によって所
定の評価区間ごとに話者を識別する話者識別工程と、上
記評価区間毎に識別された上記話者の判別頻度情報を、
所定の頻度区間毎に求める話者判別頻度計算工程とを有
し、上記頻度区間における上記話者の出現頻度情報を検
出することを特徴としている。In order to achieve the above-mentioned object, the information detecting method according to the present invention is an information detecting method for detecting predetermined information from a predetermined information source, wherein an audio signal included in the information source is detected. A voiced sound block detection step of analyzing and detecting a voiced sound portion of the speaker, and a talk for identifying the speaker for each predetermined evaluation section by the similarity of the feature amount of the voice signal using the detected voiced sound block A speaker identification step, and the discrimination frequency information of the speaker identified for each evaluation section,
And a speaker discrimination frequency calculating step for each predetermined frequency section, and detecting appearance frequency information of the speaker in the frequency section.

【０１０８】ここで、情報検出方法は、上記音声信号中
の無声音部分を認識に用いない有声音モードと当該無声
音部分を認識に用いる有声音・無声音モードとを切り替
える認識モード切替工程を有するようにしてもよく、上
記有声音・無声音モードの場合、上記話者識別工程で
は、有声音ブロックと無声音ブロックとを用いて上記音
声信号の特徴量の類似性によって所定の評価区間ごとに
話者が識別される。Here, the information detecting method has a recognition mode switching step of switching between a voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition. In the voiced sound / unvoiced sound mode, the speaker identification step uses the voiced sound block and the unvoiced sound block to identify the speaker for each predetermined evaluation section based on the similarity of the feature amount of the voice signal. To be done.

【０１０９】また、情報検出方法では、上記情報源の音
声信号中の音声の類似性を評価する特徴量として、ＬＰ
Ｃ分析によって得られるＬＰＣケプストラムが用いら
れ、識別の手法として、複数のコードブックによる特徴
量のベクトル量子化が用いられ、識別の尺度として、そ
のベクトル量子化歪みが用いられる。Further, in the information detecting method, as a feature quantity for evaluating the similarity of voices in the voice signal of the information source, LP is used.
The LPC cepstrum obtained by the C analysis is used, the vector quantization of the feature quantity by a plurality of codebooks is used as the identification method, and the vector quantization distortion is used as the identification scale.

【０１１０】このような情報検出方法によっては、音声
信号中の話者の音声の特徴量に基づいて、ある評価区間
毎に話者を識別すると共に、評価区間毎に判別された話
者の判別頻度をある頻度区間毎に検出し、話者の出現頻
度情報を生成する際に、話者の有声音部分を検出し、検
出された有声音ブロックを用いて話者を識別することに
より、無声音に背景雑音が混合している部分や、他話者
の無声音が混合している部分を認識対象から除外するこ
とができ、従って、認識誤りを低減させることができ
る。According to such an information detecting method, the speaker is identified for each evaluation section based on the feature amount of the speaker's voice in the audio signal, and the speaker determined for each evaluation section is identified. By detecting the frequency for each frequency interval and generating the speaker appearance frequency information, the voiced sound part of the speaker is detected, and the speaker is identified using the detected voiced sound block, thereby unvoiced It is possible to exclude a portion in which background noise is mixed in and a portion in which unvoiced sounds of other speakers are mixed from the recognition target, and therefore, recognition error can be reduced.

【０１１１】これにより、複数話者や背景雑音部分が含
まれる音声信号においても、所望の話者の会話区間をよ
り低い誤り率で効果的に検出することができる。As a result, even in a voice signal including a plurality of speakers and a background noise portion, the conversation section of a desired speaker can be effectively detected with a lower error rate.

【０１１２】また、音声信号中の無声音部分を認識に用
いない有声音モードと当該無声音部分を認識に用いる有
声音・無声音モードとを切り替えることができる場合に
は、背景雑音がない部分においては、有声音・無声音モ
ードとし、有声音ブロックと無声音ブロックとを用いて
話者を識別するようにしてもよい。If the voiced sound mode in which the unvoiced sound portion in the voice signal is not used for recognition and the voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition can be switched, The voiced sound / unvoiced sound mode may be set, and the speaker may be identified using the voiced sound block and the unvoiced sound block.

【０１１３】このように、背景雑音がない部分において
は、有声音ブロックと無声音ブロックとを用いることで
効果的に話者を識別することができる。As described above, in the part where there is no background noise, the speaker can be effectively identified by using the voiced sound block and the unvoiced sound block.

[Brief description of drawings]

【図１】本実施の形態における情報検出装置の概念構成
を説明する図である。FIG. 1 is a diagram illustrating a conceptual configuration of an information detection device according to the present embodiment.

【図２】同情報検出装置の構成例を説明する図である。FIG. 2 is a diagram illustrating a configuration example of the information detection device.

【図３】同情報検出装置における話者出現頻度情報の記
録形式の一例を説明する図である。FIG. 3 is a diagram illustrating an example of a recording format of speaker appearance frequency information in the information detection apparatus.

【図４】同情報検出装置における頻度区間、話者認識ブ
ロック及びＬＰＣ分析ブロックの関係を説明する図であ
る。FIG. 4 is a diagram illustrating a relationship among a frequency section, a speaker recognition block, and an LPC analysis block in the information detection device.

【図５】同情報検出装置の動作を説明するフローチャー
トである。FIG. 5 is a flowchart illustrating an operation of the information detection device.

【図６】同情報検出装置における話者認識ブロック単位
での話者識別処理を説明するフローチャートである。FIG. 6 is a flowchart illustrating a speaker identification process in a speaker recognition block unit in the information detecting apparatus.

【図７】同情報検出装置における有声音ブロック判定処
理を説明するフローチャートである。FIG. 7 is a flowchart illustrating a voiced sound block determination process in the information detection apparatus.

【図８】同情報検出装置における話者照合判定処理を説
明するフローチャートである。FIG. 8 is a flowchart illustrating a speaker verification determination process in the information detection device.

【図９】同情報抽出装置における話者照合判定用の閾値
データの記録形式の一例を説明する図である。FIG. 9 is a diagram illustrating an example of a recording format of threshold value data for speaker verification determination in the information extraction device.

[Explanation of symbols]

１有声音ブロック検出手段、２認識モード切替手
段、３話者識別手段、４話者判別頻度計算手段、１
０情報検出装置、１１入力部、１２ＬＰＣ分析
部、１３ＬＰＣ逆フィルタ、１４有声音判定部、１
５認識モード切替部、１６切替部、１７ケプスト
ラム抽出部、１８ベクトル量子化部、１９話者識別
部、２０話者判別頻度計算部1 voiced sound block detecting means, 2 recognition mode switching means, 3 speaker identifying means, 4 speaker discrimination frequency calculating means, 1
0 information detection device, 11 input unit, 12 LPC analysis unit, 13 LPC inverse filter, 14 voiced sound determination unit, 1
5 recognition mode switching unit, 16 switching unit, 17 cepstrum extraction unit, 18 vector quantization unit, 19 speaker identification unit, 20 speaker discrimination frequency calculation unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 17/00 Ｇ１０Ｌ 3/00 ５４５Ａ 9/08 ３０１Ａ 9/12 ３０１Ａ ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 17/00 G10L 3/00 545A 9/08 301A 9/12 301A

Claims

[Claims]

1. An information detection device for detecting predetermined information from a predetermined information source, comprising voiced sound block detection means for analyzing a voice signal included in the information source to detect a voiced sound portion of a speaker. , A speaker identifying means for identifying a speaker for each predetermined evaluation section based on the similarity of the feature amount of the voice signal using the detected voiced sound block, and discrimination of the speaker identified for each evaluation section An information detecting apparatus comprising: a speaker discrimination frequency calculating means for obtaining frequency information for each predetermined frequency section, and detecting appearance frequency information of the speaker in the frequency section.

2. A recognition mode switching means for switching between a voiced sound mode in which an unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition, and the speaker identification means is provided. 2. In the voiced sound / unvoiced sound mode, the speaker is identified for each predetermined evaluation section based on the similarity of the feature amount of the voice signal using the voiced sound block and the unvoiced sound block. Information detection device.

3. An LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of voices in a voice signal of the information source, and a vector quantization of the feature amount by a plurality of codebooks is used as an identification method. 2. The information detecting apparatus according to claim 1, wherein the vector quantization distortion is used as a measure of discrimination.

4. An information detecting method for detecting predetermined information from a predetermined information source, comprising a voiced sound block detecting step of analyzing a voice signal included in the information source to detect a voiced sound portion of a speaker. , A speaker identification step of identifying a speaker for each predetermined evaluation section based on the similarity of the feature amount of the voice signal using the detected voiced sound block, and discrimination of the speaker identified for each evaluation section An information detecting method comprising: a speaker discrimination frequency calculating step of obtaining frequency information for each predetermined frequency section; and detecting appearance frequency information of the speaker in the frequency section.

5. A recognition mode switching step of switching between a voiced sound mode in which an unvoiced sound portion in the voice signal is not used for recognition and a voiced sound / unvoiced sound mode in which the unvoiced sound portion is used for recognition is included. 5. In the voiced sound / unvoiced sound mode, the speaker is identified for each predetermined evaluation section by the similarity of the feature amount of the voice signal using the voiced sound block and the unvoiced sound block. Information detection method described.

6. An LPC cepstrum obtained by LPC analysis is used as a feature amount for evaluating the similarity of voices in the voice signal of the information source, and a vector quantization of the feature amount by a plurality of codebooks is used as an identification method. 5. The information detecting method according to claim 4, wherein the vector quantization distortion is used as a measure of discrimination.