JP2003015684A

JP2003015684A - Method for extracting feature from acoustic signal generated from one sound source and method for extracting feature from acoustic signal generated from a plurality of sound sources

Info

Publication number: JP2003015684A
Application number: JP2002146685A
Authority: JP
Inventors: A Kasei Michael; マイケル・エー・カセイ
Original assignee: Mitsubishi Electric Research Laboratories Inc
Current assignee: Mitsubishi Electric Research Laboratories Inc
Priority date: 2001-05-21
Filing date: 2002-05-21
Publication date: 2003-01-17
Also published as: EP1260968A1; EP1260968B1; DE60203436D1; DE60203436T2; US20010044719A1

Abstract

PROBLEM TO BE SOLVED: To provide a computerized method for extracting features from acoustic signals generated from one or a plurality of sound sources. SOLUTION: The acoustic signal are first windowed and filtered to produce a spectral envelope for each source. The dimensionality of the spectral envelope is then reduced to produce a set of features for the acoustic signal. The features in the set are clustered to produce a group of features for each of the sources. The features in each group include spectral features and corresponding temporal features characterizing each source. Each group of features is a quantitative descriptor that is also associated with a qualitative descriptor. Hidden Markov models are trained with sets of known features and stored in a database. The database can then be indexed by sets of unknown features to select or recognize like acoustic signals.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、全般に音響信号処
理の分野に関し、詳細には音響信号を認識し、指数化
し、探索する方法に関する。FIELD OF THE INVENTION The present invention relates generally to the field of acoustic signal processing, and more particularly to a method for recognizing, indexing and searching acoustic signals.

【０００２】[0002]

【従来の技術】これまで、環境音および周囲音の特徴を
抽出することに関してはほとんど研究がなされてこなか
った。大部分の従来技術の音響信号表現法は、人間の音
声および音楽に集中してきた。一方、足音、交通音、ド
アをバタンと閉める音、レーザガンの音、コツコツたた
く音、たたきつける音、雷鳴、葉がカサカサ擦れる音、
水が流れる音などの映画、テレビ、ビデオゲームおよび
仮想環境において聞かれる多くの音響効果のために適し
た表現法はない。これらの環境音響信号は一般に、音声
および音楽に比べて特徴を抽出するのが非常に難しい。
なぜなら、それらの信号が多くの場合に、雑音を含み、
重ね合わせられた多数の成分と、反復および散乱のよう
な高次の構造的な成分とを含むためである。2. Description of the Related Art Up to now, little research has been done on extracting features of environmental sounds and ambient sounds. Most prior art acoustic signal representations have focused on human voice and music. On the other hand, footsteps, traffic sounds, door slamming sounds, laser gun sounds, knocking sounds, knocking sounds, thundering sounds, rustling leaves,
There is no suitable representation for many sound effects heard in movies, television, video games and virtual environments such as the sound of water flowing. These environmental acoustic signals are generally much more difficult to extract features than voice and music.
Because their signals are often noisy,
This is because it contains a large number of superimposed components and higher order structural components such as repetition and scattering.

【０００３】そのような表現方式を用いることができる
１つの特定の応用形態は映像処理である。映像物体を抽
出し、圧縮し、探索し、分類するための方法が利用可能
である。たとえば、種々のＭＰＥＧ標準規格を参照され
たい。「可聴周波数帯音の」対象物が音声の場合以外
に、そのような「可聴周波数帯音の」対象物を処理する
方法は存在しない。たとえば、ジョンウエインが六連発
拳銃を撃ちながら馬で疾走している全ての映像の位置を
識別するために、映像ライブラリを探索することが望ま
れる場合がある。確かに、ジョンウエインあるいは馬を
視覚的に特定することは可能である。しかしながら、疾
走する馬のリズミカルなパカッパカッという音、リボル
バーの断続的な撃発音を選別することは非常に難しい。
可聴周波数帯音の事象を認識することにより、映像内の
動作を詳細に描写することができる。One particular application in which such representation schemes can be used is video processing. Methods are available for extracting, compressing, searching, and classifying video objects. See, for example, various MPEG standards. There is no way to process such an "audio band" object, except when the "audio band" object is speech. For example, it may be desirable for John Wayne to search a video library to identify the locations of all the videos galloping on a horse while shooting a six-shot pistol. Indeed, it is possible to visually identify John Wayne or a horse. However, it is very difficult to sort out the rhythmic squeaking noise of a galloping horse and the intermittent sounding of a revolver.
By recognizing audio frequency band events, it is possible to delineate motions in the video.

【０００４】その表現法を用いることができる別の応用
形態は音の合成である。試行錯誤以外の方法で音を合成
して生成できるようになるには、その前に音の特徴が特
定されなければならない。Another application in which the expression can be used is in sound synthesis. Before a sound can be synthesized and generated by a method other than trial and error, the characteristics of the sound must be specified.

【０００５】従来技術では、音声以外の音のための表現
は一般に、たとえば、特定の楽器の音色を再現するこ
と、その特定の楽器を識別すること、周囲の海の音から
潜水艦の音を区別すること、その鳴き声によって水中哺
乳類を認識することのような、特定のクラスの音声以外
の音に集中してきた。これらの応用形態はそれぞれ、特
定の応用形態を越えて一般化されることのない特定の配
列の音響的特徴を必要とする。In the prior art, representations for sounds other than speech generally include, for example, reproducing the timbre of a particular instrument, identifying that particular instrument, distinguishing the sound of a submarine from the sounds of the surrounding sea. In doing so, I have focused on sounds other than a certain class of sounds, such as recognizing underwater mammals by their calls. Each of these applications requires a particular array of acoustic features that is not generalized beyond the particular application.

【０００６】これらの特定の応用形態に加えて、他の研
究は、一般化される音響的な情景の分析表現を開発する
ことに集中してきた。この研究は、「計算による聴覚情
景分析」として知られるようになった。これらのシステ
ムは、そのアルゴリズムの複雑さに起因して、多大な計
算処理作業を必要とする。典型的には、それらのシステ
ムは、人工知能および種々の推論方式からのヒューリス
ティックな方式を利用する。In addition to these particular applications, other research has focused on developing generalized analytical representations of acoustic scenes. This study became known as "Computational Auditory Scene Analysis." These systems require a great deal of computational work due to the complexity of their algorithms. Typically, these systems utilize heuristics from artificial intelligence and various inference schemes.

【０００７】そのようなシステムは音響表現に関する難
問題への有益な洞察を与えるが、そのシステムの性能
が、混合された状態の音響信号の分類および合成に関し
て満足のいくものであることは一度も示されていない。While such systems provide useful insights into the challenges of acoustic representation, it has never been shown that the system's performance is satisfactory with respect to the classification and synthesis of mixed state acoustic signals. It has not been.

【０００８】さらに別の応用形態では、音の表現を用い
て、環境音、背景雑音、音響効果（効果音）、動物音、
音声、音声以外の鳴き声および音楽を含む幅広い範囲の
音の現象を含む可聴周波数帯音の媒体を指数化すること
ができる。これにより、自動的に抽出された指数を用い
て、可聴周波数帯音の媒体を探索するための音認識ツー
ルを設計できるようになるであろう。これらのツールを
用いて、映画あるいは報道番組のような多くの内容を含
むサウンドトラックを、内容の意味論的記述によって、
あるいは目標とする可聴周波数帯音の照会への類似性に
よって探索することができる。たとえば、ライオンがほ
えたり、あるいは像が鳴き声をあげたりする全ての映像
クリップの位置を特定することが望まれる。[0008] In still another application form, the expression of sound is used to generate environmental sounds, background noise, sound effects (sound effects), animal sounds,
Media in the audible frequency range can be indexed, including a wide range of sound phenomena including voice, non-voice cry and music. This will allow the use of automatically extracted indices to design sound recognition tools for searching media in the audio frequency band. With these tools, a soundtrack containing a lot of content, such as a movie or a news show, can be described by a semantic description of the content.
Alternatively, it can be searched by the similarity to the query of the target audible frequency band sound. For example, it is desirable to locate all video clips where the lion roars or the image roars.

【０００９】自動分類および指数化への数多くの実現可
能なアプローチがある。Ｗｏｌｄ等（IEEE Multimedia,
pp.27- 36, 1996）、Ｍａｒｔｉｎ等の「Musical inst
rument identification: a pattern-recognition appro
ach」（Presented at the 136th Meeting of the Acous
tic Society of America, Norfolk, VA, 1998）は楽器
のための厳密な分類を記載する。Ｚｈａｎｇ等の「Cont
ent-based classification and retrieval of audio」
（SPIE 43rd Annual Meeting, Conference on Advanced
Signal Processing Algorithms, Architectures and I
mplementationsVIII, 1998）は、スペクトログラムデー
タを用いてモデルをトレーニングするシステムを記載
し、Ｂｏｒｅｃｚｋｙ等の「A hidden Markov model fr
amework for video segmentation using audio and ima
ge features」（Proceedings of ICASSP'98, pp.3741-
3744, 1998）はマルコフモデルを利用する。There are numerous feasible approaches to automatic classification and indexing. Wold, etc. (IEEE Multimedia,
pp.27-36, 1996), Martin et al., "Musical inst
rument identification: a pattern-recognition appro
ach ”(Presented at the 136th Meeting of the Acous
The tic Society of America, Norfolk, VA, 1998) describes a strict classification for musical instruments. Zhang et al. "Cont
ent-based classification and retrieval of audio ''
(SPIE 43rd Annual Meeting, Conference on Advanced
Signal Processing Algorithms, Architectures and I
mplementationsVIII, 1998) describe a system for training a model using spectrogram data, which is described in Borecczky et al., “A hidden Markov model fr
amework for video segmentation using audio and ima
ge features "(Proceedings of ICASSP'98, pp.3741-
3744, 1998) uses the Markov model.

【００１０】[0010]

【発明が解決しようとする課題】可聴周波数帯音媒体を
指数化し、探索することは特に、マルチメディアのため
に新たに現れたＭＰＥＧ−７標準規格に密接に関連す
る。その標準規格は、一般的な音のクラスに対して統合
されたインターフェースを必要とする。符号器の互換性
は設計に関する１つの要素である。その際、１つの実施
形態によって提供される指数を有する「音の」データベ
ースを、異なる実施形態によって抽出されたデータベー
スと比較することができる。The indexing and searching of audio frequency band sound media is particularly closely related to the emerging MPEG-7 standard for multimedia. The standard requires an integrated interface for common sound classes. Encoder compatibility is a factor in design. In doing so, a “sound” database with indices provided by one embodiment can be compared to a database extracted by a different embodiment.

【００１１】[0011]

【課題を解決するための手段】コンピュータ化された方
法によって、１つあるいは複数の音源から生成される音
響信号から特徴が抽出される。その音響信号は最初にウ
インドウ処理され、フィルタリングされて、各音源に対
するスペクトル包絡線が生成される。その後、スペクト
ル包絡線の次元数が低減され、その音響信号のための１
組の特徴が生成される。その組内の特徴はクラスタ化さ
れ、各音源に対する一群の特徴が生成される。各群内の
特徴は各音源を特徴付けるスペクトル的特徴および対応
する時間的特徴とを含む。Features are extracted from an acoustic signal generated from one or more sound sources by a computerized method. The acoustic signal is first windowed and filtered to generate a spectral envelope for each source. Then the dimensionality of the spectral envelope is reduced to 1 for that acoustic signal.
A set of features is generated. The features in the set are clustered to produce a group of features for each sound source. The features within each group include spectral features and corresponding temporal features that characterize each sound source.

【００１２】各群の特徴は定量的記述子であり、定量的
記述子は定性的記述子にも関連付けられる。隠れマルコ
フモデルが既知の特徴の組によってトレーニングされ、
データベースに格納される。その際、そのデータベース
は、類似の音響信号を選択、あるいは認識するために、
未知の特徴の組によって指数化されることができる。A feature of each group is a quantitative descriptor, which is also associated with a qualitative descriptor. Hidden Markov models are trained by a known set of features,
Stored in the database. At that time, the database selects or recognizes similar acoustic signals,
It can be indexed by a set of unknown features.

【００１３】[0013]

【発明の実施の形態】図１は、本発明による、信号の混
合物１０１からスペクトル的特徴１０８および時間的特
徴１０９を抽出するための方法１００を示す。本発明の
方法１００は、音源を分類する目的で、録音された音か
ら特徴を明確にし、それを抽出するために、またパラメ
ータの合成のような構造化されたマルチメディアの応用
形態において目的を変更して再利用する（re-purpose）
ために用いることができる。またその方法は、他の線形
の混合物、さらには多次元の混合物から特徴を抽出する
ためにも用いることができる。その混合物は、１つの音
源から、あるいはステレオ音源のような多数の音源から
得られる。DETAILED DESCRIPTION OF THE INVENTION FIG. 1 shows a method 100 for extracting spectral features 108 and temporal features 109 from a mixture 101 of signals according to the present invention. The method 100 of the present invention has the purpose of characterizing and extracting features from recorded sounds for the purpose of classifying sound sources, and in structured multimedia applications such as parameter synthesis. Change and reuse (re-purpose)
Can be used for The method can also be used to extract features from other linear mixtures as well as multidimensional mixtures. The mixture can be obtained from one sound source or multiple sound sources such as a stereo sound source.

【００１４】録音された信号から特徴を抽出するため
に、本発明による方法は独立成分分析（ＩＣＡ）に基づ
く統計的な手法を利用する。４次までの累積的な拡張に
よって定義されるコントラスト関数を用いて、ＩＣＡ変
換は、時間−周波数観測行列１２１の基底の回転を生成
する。To extract features from the recorded signal, the method according to the invention makes use of a statistical technique based on independent component analysis (ICA). With a contrast function defined by a cumulative extension up to the 4th order, the ICA transform produces a rotation of the basis of the time-frequency observation matrix 121.

【００１５】結果として生成される基底成分は可能な限
り統計的に独立であり、混合物音源１０１内の個々の特
徴、たとえば音の構造の特徴を明らかにする。これらの
特徴的な構造を用いて、信号を分類するか、あるいは予
測可能な特徴を有する新しい信号を特定することができ
る。The resulting basis components are as statistically independent as possible, revealing individual features within the mixture sound source 101, such as the structure of the sound. These characteristic structures can be used to classify signals or identify new signals with predictable characteristics.

【００１６】本発明による表現は、小さな組の特徴から
多数の音の振舞いを合成することができる。また本発明
による表現は、衝突する、弾む、叩きつける、擦るなど
の複雑な音響的な事象構造、ならびに材料、大きさ、形
状などの音響的な対象物の特性を合成することができ
る。The representation according to the invention is able to synthesize the behavior of multiple tones from a small set of features. The representation according to the invention is also able to synthesize complex acoustic event structures such as collisions, bounces, hits, rubs, etc., as well as acoustic object properties such as material, size, shape.

【００１７】その方法１００では、可聴周波数帯音混合
物１０１が最初に対数フィルタのバンク１１０によって
処理される。各フィルタは、所定の周波数範囲のための
帯域通過信号１１１を生成する。典型的には、４０〜５
０の帯域通過信号１１１が生成され、人間の耳の周波数
応答特性を真似るように、高域の周波数範囲より低域の
周波数範囲において多くの信号が生成される。別法で
は、そのフィルタとして、定Ｑ（ＣＱ）あるいはウェー
ブレットフィルタバンクを用いることができるか、ある
いはそのフィルタが、短時間高速フーリエ変換表現（Ｓ
ＴＦＴ）の場合のように線形に配置されることができ
る。In the method 100, the audio band sound mixture 101 is first processed by a bank 110 of logarithmic filters. Each filter produces a bandpass signal 111 for a given frequency range. Typically 40-5
The bandpass signal 111 of 0 is generated, and many signals are generated in the low frequency range rather than the high frequency range so as to mimic the frequency response characteristics of the human ear. Alternatively, the filter may be a constant Q (CQ) or wavelet filter bank, or the filter may be a short-time fast Fourier transform representation (S
They can be arranged linearly as in the case of (TFT).

【００１８】ステップ１２０では、各帯域通過信号は短
い、たとえば２０ｍｓｅｃセグメントに「ウインドウ処
理」され、観測行列が生成される。各行列は、数百もの
サンプルを含むことができる。ステップ１１０および１
２０の詳細は、図２および図３にさらに詳細に示され
る。ウインドウ処理はフィルタリングの前に行われるこ
とができることに留意されたい。In step 120, each bandpass signal is "windowed" into short, eg 20 msec, segments to generate an observation matrix. Each matrix can contain hundreds of samples. Steps 110 and 1
Details of 20 are shown in more detail in FIGS. Note that windowing can be done before filtering.

【００１９】ステップ１３０では、観測行列１２１に特
異値分解（ＳＶＤ）が適用され、次元数を低減された行
列１３１が生成される。ＳＶＤは、１８７３年にイタリ
アの幾何学者ベルトラーミによって初めて記載された。
特異値分解は、主成分分析（ＰＣＡ）の明確な一般化で
ある。ｍ×ｎ行列の特異値分解は、以下の形式の任意の
因数分解である。In step 130, singular value decomposition (SVD) is applied to the observation matrix 121 to generate a matrix 131 having a reduced dimensionality. The SVD was first described in 1873 by the Italian geometricist Bertrami.
Singular value decomposition is a well-defined generalization of principal component analysis (PCA). Singular value decomposition of an m × n matrix is an arbitrary factorization of the form

【００２０】Ｘ＝ＵΣＶ^Ｔ X = UΣV ^T

【００２１】ただし、Ｕはｍ×ｍの直交行列、すなわち
Ｕは正規直交列を有し、Ｖはｎ×ｎの直交行列であり、
Σは、ｉがｊに等しくない場合に、成分σ_ｉｊ＝０を有
する特異値のｍ×ｎの対角行列である。However, U is an m × m orthogonal matrix, that is, U has an orthonormal column, V is an n × n orthogonal matrix,
Σ is a m × n diagonal matrix of singular values with components σ _ij = 0, where i is not equal to j.

【００２２】１つの利点として、かつＰＣＡとは対照的
に、ＳＶＤは非正方行列を分解することができ、それに
より、共分散行列の計算をすることを必要とせずに、ス
ペクトルあるいは時間いずれかの方向において観測行列
を直接に分解することができる。ＳＶＤは、共分散行列
を求めることを必要とせずに、非正方行列を直接に分解
するので、結果として生成される基底は、ダイナミック
レンジの問題に対して、ＰＣＡより影響を受けにくい。As an advantage, and in contrast to PCA, SVD is capable of decomposing non-square matrices, thereby either spectrally or temporally without the need for covariance matrix computations. The observation matrix can be decomposed directly in the direction. Since SVD directly decomposes a non-square matrix without the need to find a covariance matrix, the resulting basis is less sensitive to dynamic range issues than PCA.

【００２３】本発明の方法は、ステップ１４０におい
て、次元数を低減された行列１３１にオプションの独立
成分分析（ＩＣＡ）を適用する。ブラインド信号分離の
ための擬似ニューロアーキテクチャに基づく反復オンラ
インアルゴリズムを用いるＩＣＡがよく知られている。
最近、ＩＣＡ問題を解決するための多数のニューラルネ
ットワークアーキテクチャが提案されている。たとえば
１９９５年１月１７日にSejnowskiに付与された「Adapt
ive system for broadband multisignal discriminatio
n in a channel with reverberation」というタイトル
の米国特許第５，３８３，１６４号を参照されたい。The method of the present invention applies optional Independent Component Analysis (ICA) to the reduced dimension matrix 131 in step 140. ICAs that use an iterative online algorithm based on a pseudo-neuro architecture for blind signal separation are well known.
Recently, numerous neural network architectures have been proposed to solve the ICA problem. For example, "Adapt given to Sejnowski on January 17, 1995.
ive system for broadband multisignal discriminatio
See U.S. Pat. No. 5,383,164 entitled "n in a channel with reverberation."

【００２４】ＩＣＡはスペクトル的特徴１０８および時
間的特徴１０９を生成する。ベクトルとして表されるス
ペクトル的特徴は、セグメンテーションウインドウ内の
統計的に最も独立している成分の推定値に対応する。時
間的特徴は、同じくベクトルとして表され、そのセグメ
ントの過程におけるスペクトル成分の展開を記述する。The ICA produces spectral features 108 and temporal features 109. The spectral features, represented as vectors, correspond to estimates of the statistically most independent components within the segmentation window. Temporal features, also represented as vectors, describe the evolution of spectral components during the course of that segment.

【００２５】スペクトルおよび時間ベクトルの各対は、
ベクトルの外積を用いて結合され、所与の入力スペクト
ルのための部分スペクトルを再構成することができる。
これらのスペクトルが、フィルタバンク表現がそうであ
るように可逆的である場合には、独立した時間領域信号
を推定することができる。その方式において記述される
各独立した成分の場合に、以前のセグメント内の成分の
ための互換性スコアの行列が利用可能である。これによ
り、最も可能性の高い連続した対応を推定することによ
って、時間にわたって成分を追跡できるようになる。時
間的に前方を見ているときにのみ、後方互換性行列に等
しい。Each pair of spectrum and time vector is
Vector cross products can be combined to reconstruct the partial spectrum for a given input spectrum.
Independent time domain signals can be estimated if these spectra are reversible as the filterbank representation does. For each independent component described in that scheme, a matrix of compatibility scores for the components in the previous segment is available. This allows the components to be tracked over time by estimating the most likely consecutive correspondence. Equal to the backward compatibility matrix only when looking forward in time.

【００２６】可聴周波数帯音のトラックの独立成分分解
を用いて、可聴周波数帯音トラック内の個々の信号成分
を推定することができる。全階数の信号行列（Ｎ個の音
源のＮ個の線形の混合物）が利用できない場合には分離
問題は取り扱いにくいが、周波数領域表現の短い時間的
なセクションの独立成分を使用することにより、根底に
ある音源への近似を与えることができる。これらの近似
は、分類および認識作業、および音の間の比較のために
用いることができる。Independent component decomposition of the audio band sound track can be used to estimate individual signal components within the audio band sound track. The separation problem is awkward when a full rank signal matrix (N linear mixture of N sources) is not available, but by using the independent components of a short temporal section of the frequency domain representation An approximation to the sound source at can be given. These approximations can be used for classification and recognition tasks, and comparisons between sounds.

【００２７】図３に示されるように、時間周波数分布
（ＴＦＤ）は、いくつかの音響領域においてより多くの
エネルギーを搬送する、より低い周波数成分の寄与を小
さくするために、電力スペクトル密度（ＰＳＤ）１１５
によって正規化することができる。As shown in FIG. 3, the time-frequency distribution (TFD) has a power spectral density (PSD) to reduce the contribution of lower frequency components that carry more energy in some acoustic regions. ) 115
Can be normalized by

【００２８】図４および図５はそれぞれ、規則的なリズ
ムで演奏される打楽器に関する時間的および空間的分解
を示す。観測可能な構造によって、シェイクに対応する
広帯域の分節的な成分と、金属シェルの鳴動に対応する
水平方向の層構造とが明らかになる。FIGS. 4 and 5 respectively show the temporal and spatial decomposition for percussion instruments played in regular rhythms. The observable structure reveals a broad band segmental component corresponding to the shake and a horizontal layered structure corresponding to the ringing of the metal shell.

【００２９】音の音響的な特徴のための応用形態本発明は多数の応用形態において用いることができる。
その抽出された特徴は、音源的混合物内の固有の構造を
表す、音響的混合物の分離可能な成分と見なすことがで
きる。抽出された特徴は、その成分を認識、あるいは特
定するために、パターン認識技術によって決定される１
組の先験的なクラスと比較することができる。これらの
分類器は、音素、音響効果、楽器、動物音あるいは任意
の他のコーパスによる分析モデルの領域にあることがで
きる。抽出された特徴は、逆フィルタバンクを用いて個
別に再合成し、それにより音源の音響的混合物の「純粋
化」を達成することができる。一例の用途は、いくつか
の成分を、目的を変更して再利用するために、あるいは
音楽の構造を自動的に分析するために、録音された音か
ら歌手、ドラムおよびギターを分離することである。別
の例は、映画を自動的に字幕翻訳するために、背景雑音
から俳優の声を分離し、明瞭な音声信号をスピーチレコ
グナイザに渡すことである。Applications for Acoustic Features of Sound The present invention can be used in numerous applications.
The extracted features can be considered as separable components of the acoustic mixture, which represent unique structures within the source mixture. The extracted features are determined by pattern recognition techniques to recognize or identify their components 1
Can be compared to a set of a priori classes. These classifiers can be in the area of analysis models with phonemes, sound effects, musical instruments, animal sounds or any other corpus. The extracted features can be recombined individually using an inverse filter bank, thereby achieving "purification" of the acoustic mixture of the sound source. One example application is the separation of singers, drums and guitars from recorded sounds, for repurposed reusing of some components, or for automatic analysis of the structure of music. is there. Another example is to separate the actor's voice from the background noise and pass a clear audio signal to the speech recognizer for automatic subtitle translation of the movie.

【００３０】スペクトル的特徴および時間的特徴は、混
合物内の個々の音の対象物の音響的構造の種々の特性を
識別するために個別に考慮することができる。スペクト
ル的特徴は、材料、大きさ、形状のような特性を説明す
ることができるのに対して、時間的特徴は、弾む、壊
す、叩きつけるなどの振舞いを説明することができる。
こうして、コップを叩きつけることは、コップが弾むこ
と、あるいは土器を叩きつけることから区別することが
できる。抽出された特徴は、音源の音の変更された合成
事例を生成するために、変更し、再合成することができ
る。入力音が、コップを叩きつけるなどの複数の音響的
特徴を含む１つの音の事象である場合には、個々の特徴
を、再合成のために制御することができる。これは、仮
想的な環境において音を生成することなどのモデルによ
る媒体の応用形態のために有用である。The spectral and temporal features can be considered individually to identify various characteristics of the acoustic structure of individual sound objects within the mixture. Spectral features can describe properties such as material, size, shape, while temporal features can describe behavior such as bouncing, breaking, slamming, and the like.
Thus, beating a cup can be distinguished from bouncing the cup or beating an earthenware. The extracted features can be modified and resynthesized to produce a modified synthetic case of the sound of the source. If the input sound is a single sound event that includes multiple acoustic features such as slamming a cup, the individual features can be controlled for resynthesis. This is useful for media applications with models such as producing sound in a virtual environment.

【００３１】指数化および探索また本発明を用いて、多くの異なるタイプの音、たとえ
ば音響効果、動物音、楽器、音声、重なり合った音、環
境音、男性的な音、女性的な音を含む大きなマルチメデ
ィアデータベースを指数化し、探索することもできる。Indexing and Searching The invention can also be used to include many different types of sounds, such as sound effects, animal sounds, musical instruments, sounds, overlapping sounds, environmental sounds, masculine sounds, feminine sounds. You can also index and search large multimedia databases.

【００３２】この文脈では、音の記述は一般に２つのタ
イプ、すなわちカテゴリラベルによる文字を用いる定性
的な記述と、確率論的なモデル状態を用いる定量的な記
述とに分割される。カテゴリラベルは、音の内容につい
ての定性的な情報を提供する。この形式における記述
は、インターネットサーチエンジン、あるいは文字フィ
ールドを用いる任意の処理ツールのような、文字による
照会の応用形態に適している。In this context, phonetic descriptions are generally divided into two types: qualitative descriptions using letters with category labels and quantitative descriptions using probabilistic model states. Category labels provide qualitative information about the sound content. The description in this form is suitable for textual query applications, such as Internet search engines, or any processing tool that uses textual fields.

【００３３】対照的に、定量的記述子は、可聴周波数帯
音のセグメントについてのコンパクトな情報を含み、音
の類似性の数値評価のために用いることができる。たと
えば、これらの記述子を用いて、ビデオあるいはオーデ
ィオ録音において特定の楽器を識別することができる。
定性的および定量的記述子は、可聴周波数帯音の例示照
会探索の応用形態に適合する。In contrast, the quantitative descriptor contains compact information about a segment of the audio band sound and can be used for the numerical evaluation of the similarity of sounds. For example, these descriptors can be used to identify a particular instrument in a video or audio recording.
The qualitative and quantitative descriptors are adapted to the application of an example query search for audio band sounds.

【００３４】音認識記述子および記述方式定性的記述子録音された可聴周波数帯音をクラスにセグメント化する
間に、その内容についての関連する意味的情報を取得す
ることが望まれる。たとえば、映像サウンドトラック内
の悲鳴を認識することにより、恐怖あるいは危険を指示
することができ、笑い声によって喜劇を指示することが
できる。さらに、音は人の存在を指示することができ、
それゆえ、これらの音が属する映像セグメントは、人を
含むクリップを探索する際の候補として用いることがで
きる。音のカテゴリおよび分類方式記述子は、カテゴリ
概念を、このタイプの複雑な関係型の探索方式を可能に
する階層構造に編成するための手段を提供する。Sound Recognition Descriptors and Description Schemes Qualitative Descriptors While segmenting a recorded audio band sound into classes, it is desired to obtain relevant semantic information about its content. For example, recognizing a scream in the video soundtrack can indicate fear or danger, and a laugh can indicate a comedy. In addition, the sound can indicate the presence of a person,
Therefore, the video segment to which these sounds belong can be used as a candidate when searching for clips containing people. The sound category and taxonomy descriptors provide a means for organizing category concepts into a hierarchical structure that enables this type of complex relational search strategy.

【００３５】音のカテゴリ簡単な分類法６００のための図６に示されるように、記
述方式（ＤＳ）は、音のカテゴリに名前をつけるために
用いられる。一例として、イヌがほえる音は、サブカテ
ゴリとして「ほえ声」６１１を有する定性的カテゴリラ
ベル「イヌ」６１０を与えられることができる。さらに
「うなり声」６１２あるいは「遠ぼえ」６１３は、「イ
ヌ」の望ましいサブカテゴリにすることができる。最初
の２つのサブカテゴリは密接に関連付けられるが、第３
のサブカテゴリは全く異なる音の事象である。それゆ
え、図６は、４つのカテゴリが、ルートノードとして
「イヌ」６１０を有する分類法に編成されることを示
す。各カテゴリは、その分類法内の別のカテゴリに対し
て少なくとも１つの関係リンク６０１を有する。初期設
定によって、収容されるカテゴリは、その収容している
カテゴリより狭いカテゴリ（ＮＣ）６０１と見なされ
る。しかしながら、この例では、「うなり声」は「ほえ
声」と概ね同義であるが、それよりは好ましくないもの
として定義される。そのような構造を獲得するために、
本発明の記述方式の一部として以下の関係が定義され
る。Sound Categories As shown in FIG. 6 for a simple taxonomy 600, description schemes (DS) are used to name sound categories. As an example, a dog barking sound may be given the qualitative category label “dog” 610 with “barking” 611 as a subcategory. Further, "groans" 612 or "howls" 613 can be a desirable subcategory of "dogs". The first two subcategories are closely related, but the third
The subcategories of are completely different sound events. Therefore, FIG. 6 shows that the four categories are organized into a taxonomy with “dog” 610 as the root node. Each category has at least one relational link 601 to another category within its taxonomy. By default, the contained category is considered to be a narrower category (NC) 601 than the contained category. However, in this example, "groan" is generally synonymous with "croak" but is defined as less preferred. To obtain such a structure,
The following relationships are defined as part of the description scheme of the present invention.

【００３６】ＢＣ−より広いカテゴリは、関連付けられ
るカテゴリが、収容しているカテゴリより意味において
より一般的であることを意味する。ＮＣ−より狭いカテ
ゴリは、関連付けられるカテゴリが、収容しているカテ
ゴリより意味においてより限定的であることを意味す
る。ＵＳ−現在のカテゴリより好ましいため、現在のカ
テゴリと概ね同義の関連付けられるカテゴリを用いる。
ＵＦ−現在のカテゴリの使用がほぼ同義の関連付けられ
るカテゴリよりも好ましい。ＲＣ−関連付けられるカテ
ゴリが、同義、ある程度同義、より広いあるいはより狭
いカテゴリではないが、収容するカテゴリに関連付けら
れる。BC-Wide category means that the associated category is more general in meaning than the containing category. NC-Narrower category means that the associated category is more specific in meaning than the containing category. US-Use associated categories that are generally synonymous with the current category because they are preferred over the current category.
UF-Use of the current category is preferred over nearly synonymous associated categories. RC-The associated category is not synonymous, somewhat synonymous, wider or narrower, but associated with the containing category.

【００３７】以下のＸＭＬスキーマコードは、記述定義
言語（ＤＤＬ）を用いて、図６に示されるカテゴリ分類
法のための定性的な記述方式を如何に例示化するかを示
す。The following XML Schema code illustrates how to use the Definition Definition Language (DDL) to exemplify the qualitative description scheme for the category classification scheme shown in FIG.

【００３８】[0038]

【数２】 [Equation 2]

【００３９】カテゴリおよび方式属性はともに、以下の
さらに詳細に記載される確率モデルのような定量的記述
方式からのカテゴリおよび分類法を参照するために用い
ることができる固有の識別子を提供する。ラベル記述子
は、各カテゴリのための有意味の意味ラベルを与え、関
係記述子は、本発明による分類法のカテゴリの中の関係
を記述する。The category and scheme attributes together provide a unique identifier that can be used to refer to categories and taxonomies from quantitative description schemes such as the probabilistic model described in more detail below. Label descriptors give meaningful semantic labels for each category, and relationship descriptors describe the relationships among the categories of the taxonomy according to the invention.

【００４０】分類方式図７に示されるように、カテゴリを関係リンクによって
分類方式７００に結合して、より豊富な分類法を作成す
ることができる。たとえば、「ほえ声」６１１は「イ
ヌ」６１０のサブカテゴリであり、「イヌ」６１０は
「ペット」７０１のサブカテゴリである。それはカテゴ
リ「ネコ」７１０も同じである。ネコ７１０は、音のカ
テゴリ「鳴き声」７１１および「のどを鳴らす音」７１
２を有する。以下は、２つのカテゴリ「イヌ」および
「ネコ」を含む「ペット」のための簡単な分類方式の一
例である。Classification Scheme As shown in FIG. 7, categories can be combined with classification scheme 700 by relational links to create a richer classification scheme. For example, “bark” 611 is a subcategory of “dog” 610, and “dog” 610 is a subcategory of “pet” 701. The same applies to the category "cat" 710. The cat 710 has the categories of sounds “crowing” 711 and “throating sound” 71.
Have two. The following is an example of a simple classification scheme for "pets", which includes two categories, "dogs" and "cats."

【００４１】予め定義された方式を拡張することにより
この分類方式を実施するために、「ネコ」の名前を付さ
れた第２の方式は以下のように例示化される。In order to implement this classification scheme by extending the predefined scheme, a second scheme named "cat" is illustrated as follows.

【００４２】[0042]

【数３】 [Equation 3]

【００４３】ここでこれらのカテゴリを結合するため
に、「ペット」と呼ばれる分類方式が、予め定義された
方式を参照して例示化される。To combine these categories here, a classification scheme called "pet" is illustrated with reference to a predefined scheme.

【００４４】[0044]

【数４】 [Equation 4]

【００４５】ここでは、「ペット」と呼ばれる分類方式
は、「イヌ」および「ネコ」のカテゴリ要素の全てを含
み、ルートとして付加的なカテゴリ「ペット」を含む。
上記のような定性的分類法は、文字指数化の応用形態の
場合には十分である。Here, the classification system called "pet" includes all of the category elements "dog" and "cat", and the additional category "pet" as a root.
The above qualitative classification methods are sufficient for character indexing applications.

【００４６】以下のセクションは、定性的記述子ととも
に用いられ、完全な音の指数化および探索エンジンを形
成することができる、分類および指数化のための定量的
記述子を記載する。The following section describes quantitative descriptors for classification and indexing that can be used with qualitative descriptors to form a complete phonetic indexing and search engine.

【００４７】定量的記述子音認識定量的記述子は、統計的な分類器とともに用いら
れることになる可聴信号の特徴を記述する。音認識定量
的記述子は、音響効果および楽器を含む一般的な音の認
識のために用いることができる。示唆される記述子に加
えて、可聴周波数帯音の構造の中で定義される任意の他
の記述子を、分類のために用いることができる。Quantitative Descriptor The Sound Recognition Quantitative Descriptor describes the characteristics of an audible signal that will be used with a statistical classifier. Sound recognition quantitative descriptors can be used for general sound recognition, including sound effects and musical instruments. In addition to the suggested descriptors, any other descriptor defined in the structure of audio band sounds can be used for classification.

【００４８】可聴周波数帯スペクトル基底特徴音の分類のために最も広範に用いられる特徴は、電力ス
ペクトルスライスあるいはフレームのようなスペクトル
による表現である。典型的には、各スペクトルスライス
はｎ次元のベクトルであり、ｎはスペクトルチャネルの
数であり、１０２４チャネルまでのデータのチャネルを
有する。可聴周波数帯音の構造記述子によって表現され
るような対数周波数スペクトルによって、次元数を約３
２チャネルまで低減することができる。それゆえ、スペ
クトルによって導出される特徴は一般に、高い次元数に
起因して確率モデル分類器とは互換性がない。確率分類
器は、１０次元より少ない次元数で最も良好に動作す
る。The most widely used feature for the classification of audio frequency band spectral basis features is a spectral representation such as a power spectrum slice or frame. Each spectral slice is typically an n-dimensional vector, where n is the number of spectral channels and has up to 1024 channels of data. The number of dimensions is approximately 3 by the logarithmic frequency spectrum as represented by the structure descriptor of the audio frequency band sound.
It can be reduced to 2 channels. Therefore, the features derived by the spectrum are generally not compatible with stochastic model classifiers due to their high dimensionality. Probability classifiers work best with fewer than 10 dimensions.

【００４９】それゆえ、上記および下記のような特異値
分解（ＳＶＤ）によって生成される低次元数の基底関数
が好ましい。その際、可聴周波数帯音スペクトル基底記
述子は、確率モデル分類器のために適した低次元の部分
空間にそのスペクトルを射影するために用いられる基底
関数のためのコンテナである。Therefore, low dimensional basis functions generated by singular value decomposition (SVD) as described above and below are preferred. The audio spectrum sound spectrum basis descriptor is then a container for the basis functions used to project that spectrum into a low-dimensional subspace suitable for the stochastic model classifier.

【００５０】本発明は、音の各クラス、およびサブクラ
スのための基底を決定する。その基底は、音の特徴空間
の統計的に最も規則的な特徴を獲得する。次元数の低減
は、上記のように、データから導出された基底関数の行
列に対してスペクトルベクトルを射影することにより行
われる。基底関数は、行の数がスペクトルベクトルの長
さに対応し、列の数が基底関数の数に対応する行列の列
に格納される。基底射影は、スペクトルと基底ベクトル
との行列積である。The present invention determines the basis for each class and subclass of sounds. The basis captures the statistically most regular features of the sound feature space. The reduction of the number of dimensions is performed by projecting the spectrum vector on the matrix of the basis function derived from the data as described above. Basis functions are stored in columns of a matrix, where the number of rows corresponds to the length of the spectral vector and the number of columns corresponds to the number of basis functions. The base projection is the matrix product of the spectrum and the base vector.

【００５１】基底関数から再構成されるスペクトログラ
ム図８は、本発明による４つの基底関数から再構成される
スペクトログラム８００を示す。その具体的なスペクト
ログラムは「ポップ」音楽のためのものである。左側の
スペクトル基底ベクトル８０１は、ベクトルの外積を用
いて、基底射影ベクトル８０２と結合される。それぞれ
結果として生成される外積の行列は加算され、最終的な
再構成物が生成される。基底関数は、元のデータより少
ない次元数において情報を最大にするように選択され
る。たとえば、基底関数は、主成分分析（ＰＣＡ）ある
いはＫａｒｈｕｎｅｎ−Ｌｏｅｖｅ変換（ＫＬＴ）を用
いて抽出される無相関の特徴に対応するか、あるいは独
立成分分析（ＩＣＡ）によって抽出される統計的に独立
の成分に対応することができる。ＫＬＴあるいはホテリ
ング変換は、二次の統計値、すなわち共分散がわかって
いる際に好ましい逆相関変換である。この再構成は、図
１３を参照してさらに詳細に記載される。Spectrogram Reconstructed from Basis Functions FIG. 8 shows a spectrogram 800 reconstructed from four basis functions according to the present invention. Its concrete spectrogram is for "pop" music. The left spectral basis vector 801 is combined with the basis projection vector 802 using the vector cross product. The resulting outer product matrices are added together to produce the final reconstruction. The basis functions are chosen to maximize the information in a smaller number of dimensions than the original data. For example, the basis functions correspond to uncorrelated features extracted using principal component analysis (PCA) or Karhunen-Loeve transform (KLT), or statistically independent extracted by independent component analysis (ICA). Can correspond to the components of. The KLT or Hotelling transform is the preferred inverse correlation transform when the quadratic statistic, or covariance, is known. This reconstruction is described in more detail with reference to FIG.

【００５２】分類の目的を果たすために、全クラスのた
めの基底が導出される。こうして、分類空間は、そのク
ラスの最も統計的に顕著な成分を含む。以下のＤＤＬ例
示化は、一連の３１チャネルの対数周波数スペクトルを
５次元に低減する基底射影行列を定義する。To serve the purposes of classification, a basis for all classes is derived. Thus, the classification space contains the most statistically significant components of that class. The following DDL instantiation defines a basis projection matrix that reduces the series of 31 channel logarithmic frequency spectra in five dimensions.

【００５３】[0053]

【数５】 [Equation 5]

【００５４】低エッジ、高エッジ、ならびに分解能属性
は、基底関数の下側周波数限界および上側周波数限界、
ならびにオクターブバンド表記法におけるスペクトルチ
ャネルの間隔を与える。本発明による分類構造では、音
の全クラスのための基底関数が、そのクラスのための確
率モデルとともに格納される。The low edge, high edge, and resolution attributes are the lower and upper frequency limits of the basis function,
And the spacing of the spectral channels in octave band notation. In the classification structure according to the invention, the basis functions for all classes of sounds are stored along with the probabilistic model for that class.

【００５５】音認識の特徴音認識のために用いられる特徴を集めて、種々の異なる
応用形態のために用いることができる１つの記述方式に
することができる。初期設定の可聴周波数帯音スペクト
ル射影記述子は、多くの音のタイプ、たとえば、音響効
果ライブラリから得られた音、および楽器のサンプルデ
ィスクの分類において良好に役割を果たす。Sound Recognition Features The features used for sound recognition can be aggregated into one description scheme that can be used for a variety of different applications. The default audio spectrum projection descriptors play a good role in the classification of many sound types, for example sounds obtained from sound effects libraries, and sample discs of musical instruments.

【００５６】基底特徴は、上記のような可聴周波数帯音
スペクトル包絡線抽出プロセスから導出される。可聴周
波数帯音スペクトル射影記述子は、同じく上記のよう
に、１組の基底関数に対するスペクトル包絡線の射影に
よって得られる、次元数を低減した特徴のためのコンテ
ナである。たとえば、可聴周波数帯音スペクトル包絡線
は、対数で配置される周波数帯へのリサンプリングとと
もに、スライディングウインドウＦＦＴ解析によって抽
出される。好ましい実施形態では、解析フレーム周期は
１０ｍｓｅｃである。しかしながら、３０ｍｓｅｃ持続
時間のスライディング抽出ウインドウが、ハミングウイ
ンドウで用いられる。３０ｍｓｅｃ間隔は、十分なスペ
クトル分解能を提供し、オクターブバンドスペクトルの
６２．５Ｈｚ幅の最初のチャネルを概ね分解するように
選択される。ＦＦＴ解析ウインドウの大きさは、次に大
きな２の累乗のサンプル数である。これは３２ｋＨｚで
３０ｍｓｅｃの場合に、９６０サンプルが存在するが、
ＦＦＴは１０２４サンプルにおいて実行されることにな
ることを意味する。４４．１ｋＨｚで３０ｍｓｅｃの場
合、１３２３サンプルが存在するが、ＦＦＴは２０４８
サンプルにおいて実行されることになり、ウインドウ外
のサンプルは０に設定される。The base features are derived from the audio frequency band sound spectrum envelope extraction process as described above. The audio frequency spectrum projection descriptor is a container for reduced dimensionality features obtained by projection of the spectral envelope over a set of basis functions, also as described above. For example, the audible frequency band sound spectrum envelope is extracted by a sliding window FFT analysis along with resampling into frequency bands arranged in logarithm. In the preferred embodiment, the analysis frame period is 10 msec. However, a sliding extraction window of 30 msec duration is used in the Hamming window. The 30 msec interval is chosen to provide sufficient spectral resolution and generally resolve the 62.5 Hz wide first channel of the octave band spectrum. The size of the FFT analysis window is the next largest power of two samples. In case of 30 msec at 32 kHz, there are 960 samples,
It means that the FFT will be performed on 1024 samples. At 44.1 kHz for 30 msec, there are 1323 samples, but the FFT is 2048.
It will be executed in samples and samples outside the window will be set to zero.

【００５７】図９ａおよび図９ｂは、時間指数９１０の
場合の３つのスペクトル基底成分９０１〜９０３と、図
１０ａおよび図１０ｂにおける「笑い声」スペクトログ
ラム１０００のための周波数指数９２０の場合の生成さ
れる基底射影９１１〜９１３とを示す。ここでの形式
は、図４および図５に示される形式と類似である。図１
０ａは、笑い声の対数目盛のスペクトログラムを示して
おり、図１０ｂはスペクトログラムを再構成したものを
示す。いずれの図面とも、ｘ軸およびｙ軸上にそれぞれ
時間および周波数指数をプロットする。9a and 9b show the three spectral basis components 901-903 for the time index 910 and the generated basis for the frequency index 920 for the "laughter" spectrogram 1000 in FIGS. 10a and 10b. Projections 911 to 913 are shown. The format here is similar to that shown in FIGS. 4 and 5. Figure 1
0a shows the logarithmic scale spectrogram of laughter, and FIG. 10b shows the reconstructed spectrogram. Both figures plot the time and frequency index on the x and y axes, respectively.

【００５８】基底記述子に加えて、別の定量的記述子の
大きなシーケンスを用いて、楽器分類のための用いられ
る場合が多い調波包絡線および基本周波数特徴のよう
な、音のクラスの特別な特性を用いて分類器を定義する
ことができる。In addition to the base descriptors, a large sequence of other quantitative descriptors is used to identify special classes of sounds, such as harmonic envelopes and fundamental frequency features often used for instrument classification. A classifier can be defined using various properties.

【００５９】本発明によってなされるような次元数低減
の１つの利便性は、拡大縮小可能な１組の記述子に基づ
く任意の記述子が、同じサンプリングレートでスペクト
ル記述子に付加できることである。さらに、適切な基底
を、スペクトルに基づく基底と同じようにして、拡張さ
れた特徴の組全体に対して計算することができる。One convenience of dimensionality reduction as done by the present invention is that any descriptor based on a scalable set of descriptors can be added to the spectral descriptor at the same sampling rate. Moreover, suitable bases can be calculated for the entire extended feature set in the same manner as spectrally based bases.

【００６０】基底関数を用いるスペクトログラム要約化本発明による音認識の特徴記述方式のための別の応用形
態は、効率的なスペクトログラム表現である。スペクト
ログラムを視覚化および要約化するために、可聴周波数
帯音スペクトル基底射影および可聴周波数帯音スペクト
ル基底特徴を、非常に効率のよい記憶機構として用いる
ことができる。Spectrogram Summarization Using Basis Functions Another application for the feature description scheme of sound recognition according to the present invention is efficient spectrogram representation. To visualize and summarize the spectrogram, the audio frequency band sound spectrum basis projections and audio frequency band sound spectrum basis features can be used as a very efficient storage mechanism.

【００６１】スペクトログラムを再構成するために、本
発明は以下により詳細に記載される式２を用いる。式２
は、上記のように図８にも示される、各基底関数とその
対応するスペクトログラム基本射影とのクロス乗積から
２次元のスペクトログラムを構成する。To reconstruct the spectrogram, the present invention uses Equation 2, which is described in more detail below. Formula 2
Constructs a two-dimensional spectrogram from the cross product of each basis function and its corresponding spectrogram basic projection, which is also shown in FIG. 8 as described above.

【００６２】確率モデル記述方式有限状態モデルスペクトル的特徴は時間にわたって変動するので、音の
現象は動的である。この非常に大きな時間的変動が、音
響信号に、認識のための特徴的な「指紋」を与える。そ
れゆえ、本発明のモデルは、特定の音源あるいは音のク
ラスによって生成される音響信号を、有限の状態数に分
割する。その分割は、スペクトル的特徴に基づく。個々
の音は、この状態空間を通る、それらの音の軌跡によっ
て記述される。このモデルが、図１１ａおよび図１１ｂ
に関して、以下により詳細に記載される。各状態は、ガ
ウス分布のような連続確率分布によって表現されること
ができる。Stochastic Model Description Finite State Model Since the spectral features fluctuate over time, the phenomenon of sound is dynamic. This very large temporal variation gives the acoustic signal a characteristic "fingerprint" for recognition. Therefore, the model of the present invention divides the acoustic signal generated by a particular sound source or class of sounds into a finite number of states. The division is based on spectral features. Individual sounds are described by their trajectories through this state space. This model is shown in FIGS. 11a and 11b.
Will be described in more detail below. Each state can be represented by a continuous probability distribution such as a Gaussian distribution.

【００６３】状態空間を通る音のクラスの動的な振舞い
は、現在の状態を与えるときに、次の状態への推移の確
率を記述するｋ×ｋの推移行列によって表される。推移
行列Ｔは、時間ｔ−ｌにおける状態ｉから時間ｔにおけ
る状態ｊへの推移の確率をモデル化する。初期の状態分
布は、確率のｋ×１ベクトルであり、典型的には有限状
態モデルにおいても用いられる。このベクトルのｋ番目
の要素は、最初の観測フレームにおいて状態ｋにある確
率である。The dynamic behavior of a class of sounds passing through the state space is represented by a k × k transition matrix that describes the probability of transition to the next state given the current state. The transition matrix T models the probability of a transition from state i at time t-1 to state j at time t. The initial state distribution is a k × 1 vector of probabilities and is also typically used in finite state models. The kth element of this vector is the probability of being in state k in the first observation frame.

【００６４】ガウス分布タイプ多次元ガウス分布は、音の分類中に状態をモデル化する
ために用いられる。ガウス分布は、平均値ｍの１×ｎベ
クトルと、ｎ×ｎの共分散行列Ｋとによってパラメータ
化される。ただしｎは各観測ベクトルにおける特徴の数
である。ガウスパラメータを与えると、特定のベクトル
ｘに対する確率の計算のための式は以下のようになる。Gaussian Distribution Type The multidimensional Gaussian distribution is used to model states during sound classification. The Gaussian distribution is parameterized by a 1 × n vector of mean m and an n × n covariance matrix K. However, n is the number of features in each observation vector. Given the Gaussian parameters, the formula for calculating the probabilities for a particular vector x is:

【００６５】[0065]

【数６】 [Equation 6]

【００６６】連続隠れマルコフモデルは、状態観測確率
のための連続確率分布モデルを有する有限状態モデルで
ある。以下のＤＤＬ例示化は、ガウス状態を有する連続
隠れマルコフモデルを表すための確率モデル記述方式の
使用の一例である。この例では、浮動小数点数が、表示
の目的のためにのみ、小数点以下２桁に丸められてい
る。The continuous hidden Markov model is a finite state model having a continuous probability distribution model for state observation probabilities. The following DDL instantiation is an example of the use of a stochastic model description scheme to represent a continuous Hidden Markov Model with Gaussian states. In this example, floating point numbers are rounded to two decimal places for display purposes only.

【００６７】[0067]

【数７】 [Equation 7]

【００６８】この例では、「確率モデル」は、基底確率
モデルクラスから導出される、ガウス分布タイプとして
例示化される。In this example, the "stochastic model" is instantiated as a Gaussian distribution type, which is derived from the base stochastic model class.

【００６９】音認識モデル記述方式これまで、本発明による方法では、応用形態の構造を全
く用いることなくツールを分離してきた。以下のデータ
タイプは、上記の記述子および記述方式を結合して、音
の分類および指数化のための統合された構造にする。音
のセグメントは、分類器の出力に基づくカテゴリラベル
で指数化することができる。さらに、確率モデルパラメ
ータは、データベース内の音の指数化のために用いるこ
とができる。状態のようなモデルパラメータによって指
数化することは、照会カテゴリが未知であるとき、ある
いはカテゴリの範囲より狭い照合判定基準が必要とされ
るときに、例示照会応用形態によって必要とされる。Sound Recognition Model Description Method Up to now, the method according to the present invention has separated tools without using any structure of application form. The following data types combine the above descriptors and description schemes into a unified structure for sound classification and indexing. The sound segment can be indexed with a category label based on the output of the classifier. In addition, the probabilistic model parameters can be used for indexing sounds in the database. Indexing by model parameters such as state is needed by the example query application when the query category is unknown or when a matching criterion narrower than the range of categories is needed.

【００７０】音認識モデル音認識モデル記述方式は、隠れマルコフモデルあるいは
ガウス混合モデルのような音のクラスの確率モデルを特
定する。以下の例は、図６の「ほえ声」音カテゴリ６１
１の隠れマルコフモデルの例示化である。その音のクラ
スのための確率モデルおよび関連する基底関数は、先に
記載された例の場合と同じように定義される。Sound Recognition Model The sound recognition model description method specifies a stochastic model of a class of sounds such as a Hidden Markov Model or Gaussian Mixture Model. The example below shows the "croak" sound category 61 of FIG.
2 is an illustration of a Hidden Markov Model of 1. The probabilistic model and associated basis functions for that class of sounds are defined in the same way as for the example described above.

【００７１】[0071]

【数８】 [Equation 8]

【００７２】音モデル状態パスこの記述子は有限状態確率モデルを参照し、そのモデル
を通して音の動的な状態パスを記述する。音をモデル状
態にセグメント化することにより、あるいは規則的な間
隔で状態パスをサンプリングすることにより、２つの態
様で音を指数化することができる。第１の場合には、各
可聴周波数帯音セグメントは、１つの状態への参照を含
み、そのセグメントの持続時間は、その状態のための有
効持続時間を指示する。第２の場合には、音は、モデル
状態を参照する、サンプリングされた一連の指数によっ
て記述される。比較的長い状態持続時間を有する音カテ
ゴリは、１セグメント、１状態アプローチを用いて効率
的に記述される。比較的短い状態持続時間を有する音
は、サンプリングされた一連の状態指数を用いて、さら
に効率的に記述される。Sound Model State Path This descriptor references a finite state probability model and describes the dynamic state path of the sound through that model. By segmenting the sound into model states or by sampling the state path at regular intervals, the sound can be indexed in two ways. In the first case, each audio band tone segment contains a reference to a state, and the duration of that segment indicates the effective duration for that state. In the second case, the sound is described by a series of sampled indices that refer to model states. Tone categories with relatively long state durations are efficiently described using a one-segment, one-state approach. Sounds with relatively short state durations are more efficiently described using a series of sampled state indices.

【００７３】図１１ａは、図６のイヌほえ声音６１１の
対数スペクトログラム（周波数対時間）１１００を示
す。図１１ｂは、同じ時間間隔にわたって、図１１ａの
ほえ声モデルのための連続隠れマルコフモデルを通した
状態の音モデル状態パスシーケンスを示す。図１１ｂで
は、ｘ軸は時間指数であり、ｙ軸は状態指数である。FIG. 11a shows the log spectrogram (frequency vs. time) 1100 of the dog bark sound 611 of FIG. FIG. 11b shows a sound model state pass sequence of states through a continuous Hidden Markov Model for the roaring model of FIG. 11a over the same time interval. In FIG. 11b, the x-axis is the time index and the y-axis is the state index.

【００７４】音認識分類器図１２は、分類器の全ての必要な成分のために１つのデ
ータベース１２００を用いる音認識分類器を示す。その
音認識分類器は、多数の確率モデル間の関係を記述し、
それにより分類器のオントロジを定義する。たとえば、
階層的レコグナイザは、図６および図７の場合に記載さ
れるように、ルートノードにおいて、動物のような広範
な音のクラスを、また葉ノードにおいて、イヌ：ほえ
声、およびネコ：鳴き声、のような、より細かいクラス
を分類することができる。この方式は、グラフの記述子
方式構造を用いて、分類器のオントロジと音のカテゴリ
の分類法との間の対応関係を定義し、階層的音モデル
が、所与の分類法の場合にカテゴリ記述を抽出するため
に用いられるようにする。Sound Recognition Classifier FIG. 12 shows a sound recognition classifier that uses one database 1200 for all required components of the classifier. The sound recognition classifier describes the relationships between multiple probabilistic models,
It defines the ontology of the classifier. For example,
Hierarchical recognizers have a wide range of animal-like sound classes at the root node and dogs: barks and cats: barks at the root nodes, as described in FIGS. 6 and 7. Such a finer class can be classified. This method uses the descriptor descriptor structure of the graph to define the correspondence between the ontology of the classifier and the taxonomy of the sound categories, and the hierarchical sound model is categorical for the given taxonomy. Be used to extract the description.

【００７５】図１３は、モデルのデータベースを構成す
るためのシステム１３００を示す。図１３に示されるシ
ステムは、図１に示されるシステムの拡張形である。こ
こでは、スペクトル包絡線を抽出するためにフィルタリ
ングする前に、入力音響信号がウインドウ処理される。
そのシステムは、たとえば、ＷＡＶ形式のオーディオフ
ァイルの形で、可聴周波数帯音入力１３０１を取り込む
ことができる。そのシステムは、ファイルから可聴周波
数帯音特徴を抽出し、これらの特徴で隠れマルコフモデ
ルをトレーニングする。またそのシステムは、各音のク
ラスの場合に音の標本のディレクトリを用いる。階層的
ディレクトリ構造は、所望の分類法に対応するオントロ
ジを定義する。１つの隠れマルコフモデルが、そのオン
トロジの各ディレクトリの場合にトレーニングされる。FIG. 13 shows a system 1300 for constructing a database of models. The system shown in FIG. 13 is an extension of the system shown in FIG. Here, the input acoustic signal is windowed before being filtered to extract the spectral envelope.
The system can capture audio band input 1301 in the form of, for example, a WAV format audio file. The system extracts audible frequency band sound features from the file and trains Hidden Markov Models with these features. The system also uses a directory of sound samples for each sound class. The hierarchical directory structure defines the ontology that corresponds to the desired taxonomy. One hidden Markov model is trained for each directory of its ontology.

【００７６】可聴周波数帯音特徴抽出図１３のシステム１３００は、上記のように、音響信号
から可聴周波数帯音スペクトル基底関数および特徴を抽
出するための方法を示す。入力音響信号１３０１は、１
つの音源、たとえば人、動物、楽器によって、あるいは
多数の音源、たとえば人と動物、多数の楽器、または合
成音によって生成することができる。後者の場合に、音
響信号は混合物である。入力音響信号は最初に１０ｍｓ
ｅｃフレームにウインドウ処理される（１３１０）。図
１では、入力信号は、ウインドウ処理前に帯域通過フィ
ルタリングされることに留意されたい。ここでは、音響
信号は最初にウインドウ処理され、その後フィルタリン
グされ（１３２０）、短時間対数周波数スペクトル（sh
ort-time logarithmic-in-frequency spectrum）を抽出
する。フィルタリングは、大きさを二乗した（squared-
magnitude）短時間フーリエ変換のような、時間−周波
数電力スペクトル分析を実行する。その結果は、Ｍ個の
フレームとＮ個の周波数（frequency bins）とを有する
行列である。スペクトルベクトルｘは、この行列の行で
ある。Audio Frequency Band Sound Feature Extraction System 1300 of FIG. 13 illustrates a method for extracting audio frequency band sound spectrum basis functions and features from an acoustic signal, as described above. The input acoustic signal 1301 is 1
It can be produced by one sound source, eg a person, an animal, an instrument, or by a number of sound sources, eg a person and an animal, an instrument, or a synthetic sound. In the latter case, the acoustic signal is a mixture. Input sound signal is 10ms first
Window processing is performed on the ec frame (1310). Note that in FIG. 1, the input signal is bandpass filtered before windowing. Here, the acoustic signal is first windowed, then filtered (1320), and the short-time logarithmic frequency spectrum (sh
ort-time logarithmic-in-frequency spectrum) is extracted. Filtering is squared size (squared-
magnitude) Perform a time-frequency power spectrum analysis, such as a short time Fourier transform. The result is a matrix with M frames and N frequency bins. The spectral vector x is the row of this matrix.

【００７７】ステップ１３３０は、対数目盛の正規化を
実行する。各スペクトルベクトルｘは、電力スペクトル
からデシベル目盛１３３１に、ｚ＝１０ｌｏｇ
_１０（ｘ）で変換される。ステップ１３３２は、以下の
ようにベクトル要素のＬ２ノルムを決定する。Step 1330 performs logarithmic scale normalization. Each spectrum vector x is calculated from the power spectrum on the decibel scale 1331 and z = 10 log.
It is converted by ₁₀ (x). Step 1332 determines the L2 norm of the vector element as follows.

【００７８】[0078]

【数９】 [Equation 9]

【００７９】その後、新しい単位ノルムスペクトルベク
トルは、各スライスｚをその電力ｒで割ったｚ／ｒによ
ってスペクトル包絡線（〜）Ｘを決定され、結果として
正規化されたスペクトル包絡線（〜）Ｘ１３４０は、基
底抽出プロセス１３６０に渡される。なお、（〜）Ｘ
は、〜がＸの上に付いていることを表す。The new unit norm spectral vector is then determined as the spectral envelope (~) X by z / r, which is each slice z divided by its power r, resulting in the normalized spectral envelope (~) X1340. Is passed to the base extraction process 1360. Note that (~) X
Indicates that ~ is attached to X.

【００８０】スペクトル包絡線（〜）Ｘは、各ベクトル
を、観測行列の形の行のようにする。結果的な行列の大
きさはＭ×Ｎである。ただし、Ｍは時間フレームの数で
あり、Ｎは周波数（frequency bins）の数である。その
行列は以下の構造を有するであろう。The spectral envelope (-) X makes each vector look like a row in the form of an observation matrix. The resulting matrix size is M × N. However, M is the number of time frames and N is the number of frequencies (frequency bins). The matrix will have the following structure:

【００８１】[0081]

【数１０】 [Equation 10]

【００８２】基底抽出基底関数は、図１の特異値分解ＳＶＤ１３０を用いて抽
出される。ＳＶＤは、コマンド［Ｕ，Ｓ，Ｖ］＝ＳＶＤ
（Ｘ，０）を用いて実行される。「簡潔な」ＳＶＤを用
いることが好ましい。簡潔なＳＶＤは、ＳＶＤの因数分
解中に不要な行および列を省略する。本発明では、行の
基底関数は必要ないため、ＳＶＤの抽出効率は高くな
る。ＳＶＤは以下のように行列を因数分解する。（〜）
Ｘ＝ＵＳＶ^Ｔただし、（〜）Ｘは３つの行列の行列積に
分解され、Ｕは行基底、Ｓは対角特異値行列であり、Ｖ
は転置された列基底関数である。その基底は、最初のＫ
個の基底関数のみ、すなわちＶの最初のＫ個の列のみを
保有することにより低減される。Basis Extraction Basis functions are extracted using the singular value decomposition SVD 130 of FIG. SVD is command [U, S, V] = SVD
Performed using (X, 0). It is preferred to use a "brief" SVD. The concise SVD omits unnecessary rows and columns during SVD factorization. In the present invention, since the row basis function is not necessary, the SVD extraction efficiency is high. SVD factors a matrix as follows. (~)
X = USV ^T However, (˜) X is decomposed into a matrix product of three matrices, U is a row basis, S is a diagonal singular value matrix, and V is
Is the transposed column basis function. The basis is the first K
It is reduced by retaining only B basis functions, ie only the first K columns of V.

【００８３】[0083]

【数１１】 [Equation 11]

【００８４】ただしＫは典型的には、音の特徴による応
用形態の場合に３〜１０の基底関数の範囲にある。Ｋ個
の基底関数のために保有される情報の割合を決定するた
めに、行列Ｓ内に含まれる特異値が用いられる。However, K is typically in the range of 3-10 basis functions for sound feature applications. The singular values contained in the matrix S are used to determine the proportion of information retained for the K basis functions.

【００８５】[0085]

【数１２】 [Equation 12]

【００８６】ただし、Ｉ（Ｋ）はＫ個の基底関数の場合
に保有される情報の割合であり、Ｎはスペクトル（spec
tral bins）の数にも等しい基底関数の全数である。Ｓ
ＶＤ基底関数は、その行列の列に格納される。However, I (K) is the ratio of information held in the case of K basis functions, and N is the spectrum (spec
The total number of basis functions equal to the number of tral bins). S
The VD basis functions are stored in the columns of that matrix.

【００８７】応用形態間で最大限に互換性を持たせるた
めに、基底関数は、単位Ｌ２ノルムを有する列を含み、
その関数は、他の取り得る基底関数に対してｋ次元の情
報を最大にする。基底関数は、ＰＣＡ抽出によって与え
られるような直交性か、あるいはＩＣＡ抽出によって与
えられるような非直交性にすることができる。以下を参
照されたい。基本射影および再構成は、以下の分析−合
成式によって記述される。For maximum compatibility between applications, the basis functions include columns with unit L2 norm,
The function maximizes k-dimensional information relative to other possible basis functions. The basis functions can be orthogonal as provided by PCA extraction or non-orthogonal as provided by ICA extraction. See below. The basic projection and reconstruction are described by the following analysis-synthesis formula.

【００８８】[0088]

【数１３】 [Equation 13]

【００８９】ただし、Ｘはスペクトル包絡線であり、Ｙ
はスペクトル的特徴であり、Ｖは時間的特徴である。ス
ペクト的特徴は、特徴のｍ×ｋ観測行列から抽出され、
Ｘはスペクトルベクトルが行として編成されたｍ×ｎの
スペクトルデータ行列であり、Ｖは列に編成される基底
関数のｎ×ｋ行列である。Where X is the spectral envelope and Y
Is a spectral feature and V is a temporal feature. Spectral features are extracted from the m × k observation matrix of features,
X is an m × n spectral data matrix in which spectral vectors are organized in rows, and V is an n × k matrix of basis functions organized in columns.

【００９０】最初の式は特徴抽出に対応し、第２の式は
スペクトル再構成に対応する。図８を参照されたい。た
だし、Ｖ^＋は、非直交性の場合のＶの擬似逆行列を表
す。The first equation corresponds to feature extraction and the second equation corresponds to spectral reconstruction. See FIG. 8. However, V ⁺ represents the pseudo inverse matrix of V in the case of non-orthogonality.

【００９１】独立成分分析低減されたＳＶＤ基底Ｖが抽
出された後に、オプションのステップが、最大限に統計
的に独立な方向に、基底回転を実行することができる。
これは、スペクトログラムの独立成分を分離し、特徴の
最大の分離を必要とする全ての応用形態について有用で
ある。先に得られた基底関数を用いて、統計的に独立し
た基底を見つけ出すために、よく知られており、幅広く
紹介されている独立成分分析（ＩＣＡ）プロセスのうち
の任意のものを用いることができる。たとえば、ＪＡＤ
ＥあるいはＦａｓｔＩＣＡがあり、Ｃａｒｄｏｓｏ，
Ｊ．Ｆ．およびＬａｈｅｌｄ，Ｂ．Ｈ．による「Equiva
riant adaptive source separation」（IEEE Trans. On
Signal Processing, 4: 112- 114, 1996）あるいはＨ
ｙｖａｒｉｎｅｎ，Ａによる「Fast and robust fixed-
point algorithms for independent component analysi
s」（IEEE Trans. On Neural Networks, 10(3): 626- 6
34, 1999）を参照されたい。Independent Component Analysis After the reduced SVD basis V has been extracted, an optional step can perform basis rotation in maximally statistically independent directions.
This separates the independent components of the spectrogram and is useful for all applications requiring maximum separation of features. It is possible to use any of the well-known and widely introduced independent component analysis (ICA) processes to find statistically independent bases using the previously obtained basis functions. it can. For example, JAD
E or FastICA, Cardoso,
J. F. And Laheld, B .; H. By "Equiva
riant adaptive source separation "(IEEE Trans. On
Signal Processing, 4: 112- 114, 1996) or H
Yvarinen, A, “Fast and robust fixed-
point algorithms for independent component analysi
s '' (IEEE Trans. On Neural Networks, 10 (3): 626-6
34, 1999).

【００９２】以下のＩＣＡの使用は、１組のベクトル
を、統計的に独立したベクトル［（−）Ｖ^Ｔ _ｋ，Ａ］＝
ｉｃａ（Ｖ^Ｔ _ｋ）に分解する。ただし、新しい基底は、
ＳＶＤ入力ベクトルと、ＩＣＡプロセスによって与えら
れる推定された混合行列Ａの擬似逆行列との積として得
られる。ＩＣＡ基底は、ＳＶＤ基底と同じ大きさであ
り、基底行列の列に格納される。保有される情報の比Ｉ
（Ｋ）は、所与の抽出方法を用いる際にＳＶＤに同等で
ある。基底関数（−）Ｖ_Ｋ１３６１は、データベース１
２００に格納することができる。なお、（−）Ｖは、−
がＶの上に付いていることを表す。The following use of ICA uses the set of vectors as a statistically independent vector [(−) V ^T _k , A] =
Decompose into ica (V ^T _k ). However, the new basis is
It is obtained as the product of the SVD input vector and the pseudo-inverse of the estimated mixing matrix A given by the ICA process. The ICA basis has the same size as the SVD basis and is stored in the columns of the basis matrix. Ratio of information held I
(K) is equivalent to SVD when using a given extraction method. The basis function (−) V _K 1361 is the database 1
Can be stored in 200. In addition, (-) V is-
Indicates that V is attached above V.

【００９３】入力音響信号が多数の音源から生成される
混合物である場合に、ＳＶＤによって生成される特徴の
組は、その特徴の次元数に等しい次元数を有する任意の
既知のクラスタ化技法によって、群としてクラスタ化す
ることができる。これにより、類似の特徴が同じ群とし
て集められる。したがって、各群は、１つの音源によっ
て生成される音響信号の特徴を含む。クラスタ化におい
て用いられることになる群の数は、所望の弁別のレベル
に応じて、手動あるいは自動で設定することができる。When the input acoustic signal is a mixture produced from multiple sources, the set of features produced by SVD is by any known clustering technique having a dimensionality equal to that of the features. It can be clustered as a group. This brings similar features together in the same group. Thus, each group contains features of the acoustic signal produced by one sound source. The number of groups to be used in clustering can be set manually or automatically, depending on the level of discrimination desired.

【００９４】スペクトル部分空間基底関数の利用射影あるいは時間的特徴Ｙを求めるために、スペクトル
包絡線行列Ｘは、スペクトル的特徴Ｖの基底ベクトルと
掛け合わされる。このステップは、ＳＶＤおよびＩＣＡ
基底関数のいずれの場合とも同じであり、すなわち
（〜）Ｙ_ｋ＝（〜）Ｘ（−）Ｖ_ｋである。ただし、Ｙ
は、基底Ｖに対するスペクトルの射影後の次元数を低減
された特徴からなる行列である。Utilization of Spectral Subspace Basis Function In order to determine the projection or temporal feature Y, the spectral envelope matrix X is multiplied with the basis vector of the spectral feature V. This step is for SVD and ICA
The basis function is the same in any case, that is, (~) _Yk = (~) X (-) _Vk . However, Y
Is a matrix of features whose dimensionality is reduced after projection of the spectrum on the basis V.

【００９５】独立したスペクトログラム再構成および視
覚化のために、本発明は、正規化ステップ１３３０抽出
を省略することにより、正規化されないスペクトル射影
を抽出する。すなわち、Ｙ_ｋ＝Ｘ（−）Ｖ_ｋである。こ
こで、独立したスペクトログラムを再構成するために、
図８に示されるようなＸ_ｋ成分は、Ｋ番目の射影ベクト
ルｙ_ｋおよびＫ番目の逆基底ベクトルｖ_ｋに対応する個
別のベクトル対を利用し、再構成式Ｘ_ｋ＝ｙ_ｋ（−）ｖ
^＋ _ｋを適用する。ただし、「＋」演算子は、ＳＶＤ基底
関数のための転置を示し、ＳＶＤ基底関数は直交性であ
るか、あるいはＩＣＡの場合の擬似逆行列であり、非直
交性である。For independent spectrogram reconstruction and visualization, the present invention extracts the unnormalized spectral projections by omitting the normalization step 1330 extraction. That is, _Yk = X (-) _Vk . Now, to reconstruct an independent spectrogram,
The X _k component as shown in FIG. 8 uses the individual vector pairs corresponding to the Kth projection vector y _k and the Kth inverse basis vector v _k , and the reconstruction formula X _k = y _k (−) v
Apply ⁺ _k . However, the “+” operator indicates the transpose for the SVD basis function, which is orthogonal, or is the pseudo-inverse matrix for ICA and is non-orthogonal.

【００９６】独立成分によるスペクトログラム要約化これらの記述子のための使用形態の１つは、完全なスペ
クトログラムより少ないデータでスペクトログラムを効
率的に表すことである。独立成分基底を用いると、たと
えば図８に示されるような、個々のスペクトログラム再
構成物は一般に、スペクトログラム内の音源対象物に対
応する。Spectrogram Summarization with Independent Components One use form for these descriptors is to efficiently represent the spectrogram with less data than the complete spectrogram. Using the independent component basis, each spectrogram reconstruction, such as that shown in FIG. 8, generally corresponds to a source object in the spectrogram.

【００９７】モデル獲得およびトレーニング音分類器を設計する際の困難な作業の大部分は、トレー
ニングデータを収集し、準備することに費やされる。音
の範囲は、音のカテゴリの範囲を反映することになる。
たとえば、イヌのほえ声は、個々のほえ声、連続した多
数のほえ声、あるいは一度に多数のイヌがほえる声を含
むことができる。モデル抽出プロセスは、データの範囲
に適応し、それにより、より狭い範囲の標本が、より特
殊化した分類器を生成する。Much of the difficult work in designing model acquisition and training sound classifiers is spent collecting and preparing training data. The range of sounds will reflect the range of categories of sounds.
For example, the barking of dogs can include individual barking, multiple barkings in sequence, or barking of many dogs at once. The model extraction process adapts to a range of data such that a narrower range of samples produces a more specialized classifier.

【００９８】図１４は、既知の音源１４０１によって生
成される音響信号から、上記のように、特徴１４１０お
よび基底関数１４２０を抽出するためのプロセス１４０
０を示す。その後、これらを用いて、隠れマルコフモデ
ルをトレーニングする（１４４０）。トレーニングされ
たモデルは、それらの対応する特徴とともにデータベー
ス１２００に格納される。トレーニング中に、監視され
ていないクラスタ化プロセスを用いて、ｎ次元の特徴空
間をｋ個の状態に分割する。特徴空間は、次元数を低減
された観測ベクトルによって占められる。そのプロセス
は、ｋのための初期の推測を与えるとき、推移行列を切
り詰めることにより、所与のデータの場合の状態の最適
な数を決定する。典型的には、良好な分類器性能として
は、５〜１０状態で十分である。FIG. 14 illustrates a process 140 for extracting features 1410 and basis functions 1420 from an acoustic signal produced by a known sound source 1401 as described above.
Indicates 0. Then, they are used to train a Hidden Markov Model (1440). The trained models are stored in the database 1200 along with their corresponding features. During training, an unsupervised clustering process is used to partition the n-dimensional feature space into k states. The feature space is occupied by observation vectors with reduced dimensionality. The process determines the optimal number of states for a given data by truncating the transition matrix when giving an initial guess for k. Typically, 5-10 states are sufficient for good classifier performance.

【００９９】隠れマルコフモデルは、Ｆｏｒｗａｒｄ−
Ｂａｃｋｗａｒｄプロセスとしても知られる、よく知ら
れているＢａｕｍ−Ｗｅｌｃｈプロセスの変形プロセス
でトレーニングされる。これらのプロセスは、事前エン
トロピー（entropic prior）の使用、および期待最大
（ＥＭ）プロセスの決定論的アニーリングの実施によっ
て拡張される。The hidden Markov model is Forward-
It is trained on a variation of the well known Baum-Welch process, also known as the Backward process. These processes are extended by the use of entropic priors and the implementation of deterministic annealing of the expected maximum (EM) process.

【０１００】適切なＨＭＭトレーニングプロセス１４３
０に関する詳細については、Brandによる「Pattern dis
covery via entropy minimization」（Proceedings, Un
certainty'99. Society of Artificial intelligence a
nd Statistics #7, MorganKaufmann, 1999）およびBran
dによる「Structure discovery in conditional probab
ility models via an entropic prior and parameter e
xtinction」（Neural Computation, 1999）に記載され
る。Appropriate HMM training process 143
For more on 0, see "Pattern dis
covery via entropy minimization "(Proceedings, Un
certainty'99. Society of Artificial intelligence a
nd Statistics # 7, Morgan Kaufmann, 1999) and Bran
`` Structure discovery in conditional probab by d
ility models via an entropic prior and parameter e
xtinction ”(Neural Computation, 1999).

【０１０１】各既知の音源のための各ＨＭＭがトレーニ
ングされた後、そのモデルは、その基底関数、すなわ
ち、音の特徴の組とともに永続記憶装置１２００に保管
される。音のカテゴリの分類法全体に対応して、多数の
音のモデルがトレーニングされているとき、ＨＭＭはと
もに、より大きな音認識分類器データ構造に集められ、
それにより図１２に示されるようなモデルのオントロジ
が生成される。そのオントロジを用いて、定性的および
定量的記述子を有する新しい音を指数化する。After each HMM for each known sound source has been trained, its model is stored in persistent storage 1200 along with its basis functions, ie, the set of sound features. Corresponding to the overall sound category taxonomy, when multiple sound models are being trained, both HMMs are collected in a larger sound recognition classifier data structure,
Thereby, an ontology of the model as shown in FIG. 12 is generated. The ontology is used to index new sounds with qualitative and quantitative descriptors.

【０１０２】音記述子図１５は、ＤＤＬファイルとして保管される予めトレー
ニングされた分類器を用いて、データベース内の音を指
数化するための自動抽出システム１５００を示す。未知
の音が、ＷＡＶファイル１５０１のような媒体音源形式
から読み出される。その未知の音は、上記のようにスペ
クトル射影される（１５２０）。その後、その射影、す
なわち特徴の組を用いて、データベース１２００からＨ
ＭＭのうちの１つを選択する（１５３０）。ビタビ復号
器１５４０を用いて、その未知の音のためのモデルを通
して、最適のモデルと状態パスとの両方を与えることが
できる。すなわち、その音のウインドウ処理された各フ
レームに対して１つのモデル状態が存在する。図１１ｂ
を参照されたい。その後、各音は、そのカテゴリ、モデ
ル参照およびモデル状態パスによって指数化され、その
記述子が、ＤＤＬ形式でデータベースに書き込まれる。
その後、指数化されたデータベース１５９９は、上記の
ような格納される記述子のうちの任意の記述子、たとえ
ば全てのイヌのほえ声を用いて、一致する音を見つけ出
すために探索されることができる。その後、概ね類似の
音を、結果リスト１５６０において提供することができ
る。Sound Descriptor FIG. 15 shows an automatic extraction system 1500 for indexing sounds in a database using a pre-trained classifier stored as a DDL file. The unknown sound is read from a medium sound source format such as the WAV file 1501. The unknown sound is spectrally projected 1520 as described above. Then, using that projection, or set of features, from database 1200 to H
One of the MMs is selected (1530). The Viterbi decoder 1540 can be used to provide both the optimal model and the state path through the model for that unknown sound. That is, there is one model state for each windowed frame of the sound. Figure 11b
Please refer to. Each sound is then indexed by its category, model reference and model state path, and its descriptor is written to the database in DDL format.
The indexed database 1599 may then be searched to find a matching sound using any of the stored descriptors as described above, eg, all dog barks. it can. A generally similar sound may then be provided in the results list 1560.

【０１０３】図１６は、１０個の音のクラス１６０１〜
１６１０、それぞれトリの鳴き声、拍手喝采、イヌのほ
え声、爆音、足音、コップの割れる音、銃声、運動靴、
笑い声および電話のための分類性能を示す。そのシステ
ムの性能は、専門家の音響効果ライブラリによって指定
されるような音響効果のラベルを用いて、グラウンド・
トゥルースに対して測定された。示される結果は、分類
器のトレーニング中には用いられない新規の音のための
ものであり、それゆえ、分類器の一般化能力を例示す
る。その平均性能は、約９５％正確である。FIG. 16 shows ten sound classes 1601 to 1601.
1610, respectively, crowing of birds, applause of applause, barking of dogs, roaring sounds, footsteps, sounds of breaking cups, gunshots, sports shoes,
Shows classification performance for laughter and phone calls. The performance of the system is determined by using the sound effect labels as specified by the expert sound effect library to
Measured against Truth. The results shown are for new sounds that are not used during training of the classifier, and thus exemplify the generalization ability of the classifier. Its average performance is about 95% accurate.

【０１０４】標本探索応用形態以下のセクションは、ＤＤＬによる照合および媒体音源
形式の照会の両方を用いて探索を実行するために、その
記述方式を如何に用いるかの例を与える。Sample Search Application The following section gives an example of how to use the description scheme to perform a search using both DDL matching and media source format queries.

【０１０５】ＤＤＬを用いる例示照会簡略化された形で図１７に示されるように、音の照会
が、ＤＤＬ形式の音モデル状態パス記述１７１０を用い
て、システム１７００に提示される。そのシステムはそ
の照会を読み出し、内部データ構造をその記述情報で占
有する。この記述は、ディスク上に格納される音のデー
タベース１５９９から取り出される記述と照合される
（１５５０）。最もよく似た音のソートされた結果リス
ト１５６０が戻される。Example Query Using DDL As shown in FIG. 17 in simplified form, a phonetic query is presented to system 1700 using a DDL-formatted sound model state path description 1710. The system reads the query and populates an internal data structure with its descriptive information. This description is matched (1550) with the description retrieved from the sound database 1599 stored on disk. A sorted result list 1560 of the most similar sounds is returned.

【０１０６】照合ステップ１５５０は、状態パスヒスト
グラム間の二乗誤差の和（ＳＳＥ）を用いることができ
る。この照合手順は、ほとんど計算を必要とせず、格納
される状態パス記述子から直接に計算されることができ
る。The matching step 1550 can use the sum of squared errors (SSE) between the state path histograms. This matching procedure requires very little computation and can be calculated directly from the stored state path descriptors.

【０１０７】状態パスヒストグラムは、ある音が各状態
において費やす全時間長を、その音の全長で割ったもの
であり、それによりランダムな変数として状態指数を有
する離散確率密度関数を与える。照会音ヒストグラム
と、データベース内の各音のヒストグラムとの間のＳＳ
Ｅは、距離測定基準として用いられる。距離が０である
ことは全く同じもの同士であることを暗示し、０以外の
値で距離が増加していく場合は、より大きく異なるもの
同士である。この距離測定基準を用いて、データベース
内の音を類似性のためにランク付けし、その際、上から
最も近いものが最初に掲載されたリストとして、所望の
数のものが戻される。The state path histogram is the total length of time a note spends in each state divided by the total length of the note, thereby giving a discrete probability density function with the state index as a random variable. SS between the query sound histogram and the histogram of each sound in the database
E is used as a distance metric. The fact that the distance is 0 implies that they are exactly the same, and when the distance increases with a value other than 0, it means that they are much different. This distance metric is used to rank the sounds in the database for similarity, returning the desired number of the closest listed first from the top.

【０１０８】図１８ａは状態パスを示しており、図１８
ｂは笑い声の音の照会に関する状態パスヒストグラムを
示す。図１９ａは状態パスを示しており、図１９ｂは、
その照会に対して５つの最もよく一致する音に関するヒ
ストグラムを示す。全ての一致する音は、その照会と同
じクラスからのものであり、そのシステムが正確に動作
していることを指示する。FIG. 18a shows the state path.
b shows a state path histogram for a laughing sound query. Figure 19a shows the state path and Figure 19b shows
A histogram for the five best matching sounds for the query is shown. All matching sounds are from the same class as the query, indicating that the system is working correctly.

【０１０９】オントロジの構造を利用するために、分類
法によって定義されるような、同等あるいはそれより狭
いカテゴリ内の音が、一致する音として戻される。こう
して、「イヌ」カテゴリは、ある分類法において「イ
ヌ」に関連付けられる全てのカテゴリに属する音を戻す
であろう。To take advantage of the ontology's structure, the sounds in the equal or narrower categories, as defined by the taxonomy, are returned as matching sounds. Thus, the "dog" category will return sounds that belong to all categories associated with "dog" in a taxonomy.

【０１１０】可聴周波数帯音を用いる例示照会またそのシステムは、入力として可聴周波数帯信号を用
いる照会も実行することができる。ここでは、例示照会
応用形態に対する入力は、ＤＤＬ記述による照会の代わ
りに、可聴周波数帯音による照会である。この場合に、
可聴周波数帯音特徴抽出プロセスが最初に実行され、す
なわちスペクトログラムおよび包絡線抽出が行われ、そ
の後、その分類器内の各モデルの場合に、格納される基
底関数の組に対する射影が行われる。Example Queries Using Audio Band Sounds The system may also perform queries using audio band signals as input. Here, the input for the example query application is a query by audible frequency band sound, instead of a query by DDL description. In this case,
The audio frequency band feature extraction process is performed first, that is, the spectrogram and envelope extraction, followed by the projection on the set of stored basis functions for each model in the classifier.

【０１１１】結果的に生成される次元数を低減された特
徴は、所与の分類器のためのビタビ復号器に渡され、所
与の特徴のための最尤スコアを有するＨＭＭが選択され
る。ビタビ復号器は概ね、その分類方式のためのモデル
照合アルゴリズムとして機能する。モデル参照および状
態パスが記録され、その結果が、最初の例の場合のよう
な予め計算されたデータベースに対して照合される。The resulting reduced dimensionality features are passed to the Viterbi decoder for the given classifier and the HMM with the maximum likelihood score for the given feature is selected. . The Viterbi decoder generally functions as a model matching algorithm for that classification scheme. The model reference and state path are recorded and the results are collated against a precomputed database as in the first case.

【０１１２】本発明の精神および範囲内で、種々の他の
適合および変更がなされる場合があることは理解された
い。それゆえ、添付の請求の範囲の目的は、本発明の真
の精神および範囲内に入るような全てのかかる変形およ
び変更を網羅することである。It should be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, the purpose of the appended claims is to cover all such variations and modifications as fall within the true spirit and scope of the invention.

[Brief description of drawings]

【図１】本発明による信号の混合物から特徴を抽出す
るための方法の流れ図である。1 is a flow chart of a method for extracting features from a mixture of signals according to the present invention.

【図２】フィルタリングおよびウインドウ処理ステッ
プのブロック図である。FIG. 2 is a block diagram of filtering and windowing steps.

【図３】正規化し、低減し、抽出するステップのブロ
ック図である。FIG. 3 is a block diagram of the steps of normalizing, reducing and extracting.

【図４】金属打楽器の特徴のグラフである。FIG. 4 is a graph of characteristics of a metal percussion instrument.

【図５】金属打楽器の特徴のグラフである。FIG. 5 is a graph of characteristics of a metal percussion instrument.

【図６】イヌがほえる声に関する記述モデルのブロッ
ク図である。FIG. 6 is a block diagram of a descriptive model for a dog barking.

【図７】ペットの音に関する記述モデルのブロック図
である。FIG. 7 is a block diagram of a descriptive model regarding the sound of a pet.

【図８】４つのスペクトル基底関数および基底射影か
ら再構成されるスペクトログラムである。FIG. 8 is a spectrogram reconstructed from four spectral basis functions and basis projections.

【図９ａ】笑い声に関する基底射影包絡線である。FIG. 9a is a base-projective envelope for laughter.

【図９ｂ】図９ａの笑い声に関する可聴周波数帯音ス
ペクトルである。9b is an audible frequency band sound spectrum for the laughter of FIG. 9a.

【図１０ａ】笑い声に関する対数目盛のスペクトログ
ラムである。FIG. 10a is a spectrogram on a logarithmic scale for laughter.

【図１０ｂ】笑い声に関する再構成されたスペクトロ
グラムである。FIG. 10b is a reconstructed spectrogram for laughter.

【図１１ａ】イヌがほえる場合の対数目盛のスペクト
ログラムである。FIG. 11a is a spectrogram on a logarithmic scale when a dog barks.

【図１１ｂ】図１１ａのイヌがほえる場合の連続隠れ
マルコフモデルを通した状態の音モデル状態パスのシー
ケンス図である。FIG. 11b is a sequence diagram of a sound model state path through a continuous hidden Markov model when the dog of FIG. 11a barks.

【図１２】音認識分類器のブロック図である。FIG. 12 is a block diagram of a sound recognition classifier.

【図１３】本発明による音を抽出するためのシステム
のブロック図である。FIG. 13 is a block diagram of a system for extracting sounds according to the present invention.

【図１４】本発明による隠れマルコフモデルをトレー
ニングするためのプロセスのブロック図である。FIG. 14 is a block diagram of a process for training a Hidden Markov Model according to the present invention.

【図１５】本発明による音を特定し、かつ分類するた
めのシステムのブロック図である。FIG. 15 is a block diagram of a system for identifying and classifying sounds according to the present invention.

【図１６】図１５のシステムの性能のグラフである。16 is a graph of performance of the system of FIG.

【図１７】本発明による音照会システムのブロック図
である。FIG. 17 is a block diagram of a sound inquiry system according to the present invention.

【図１８ａ】笑い声の状態パスのブロック図である。18a is a block diagram of a laughing state path. FIG.

【図１８ｂ】笑い声の状態パスのヒストグラムであ
る。FIG. 18b is a histogram of laughing state paths.

【図１９ａ】一致する笑い声の状態パスを示す図であ
る。FIG. 19a is a diagram showing matching laughing state paths.

【図１９ｂ】一致する笑い声の状態パスのヒストグラ
ムである。FIG. 19b is a histogram of matching laughing state paths.

───────────────────────────────────────────────────── フロントページの続き (72)発明者マイケル・エー・カセイアメリカ合衆国、マサチューセッツ州、ケンブリッジ、チャウンシー・ストリート 26、ナンバー９Ｆターム(参考） 5D015 AA06 HH23 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Michael A. Kasei Que, Massachusetts, United States Bridge, Chauncey Street 26, number 9 F-term (reference) 5D015 AA06 HH23

Claims

[Claims]

1. A method for extracting features from an acoustic signal generated from one sound source, the method comprising windowing and filtering the acoustic signal to generate a spectral envelope. Reducing the dimensionality of the spectral envelope to generate the feature of
A method for extracting features from an acoustic signal generated from one sound source, the method comprising: spectral features characterizing one sound source and steps including corresponding temporal features.

2. The source of claim 1, further comprising the step of multiplying the spectral features with the temporal features using an outer product to reconstruct the spectrogram of the acoustic signal. A method for extracting features from an acoustic signal.

3. To separate the features in the set:
The method for extracting features from an acoustic signal generated from a single sound source according to claim 1, further comprising applying an independent component analysis to the set of features.

4. The spectral envelope is logarithmically scaled prior to reducing the dimensionality of the spectral envelope, L
The method for extracting features from an acoustic signal generated from one source according to claim 1, further comprising the step of normalizing by 2 to a decibel scale and a unit L2 norm.

5. A method for extracting features from an acoustic signal generated from a plurality of sound sources, the method comprising windowing and filtering the acoustic signal to generate a spectral envelope. Reducing the dimensionality of the spectral envelope to produce a feature of the plurality of sources, and clustering the features in the set to produce a group of features for each source of the plurality of sources. A step of: extracting the features from an acoustic signal generated from a plurality of sound sources, the features in each group comprising spectral features characterizing each of the sound sources and corresponding temporal features; Method.

6. The feature of each group is a quantitative descriptor of each of the sound sources, and associates a qualitative descriptor with each of the quantitative descriptors to generate a category for each of the sound sources. The method for extracting features from acoustic signals generated from a plurality of sound sources according to claim 5, further comprising steps.

7. The method of organizing categories in a database as classified sound sources of a taxonomy, and associating each of the categories in the database with at least one other category by an associative link. 7. A method for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 6, comprising.

8. The category is a description definition language (DD).
Method for extracting features from acoustic signals generated from a plurality of sound sources according to claim 7 stored in the database using L).

9. A particular category within a DDL instantiation is generated from a plurality of sources as defined in claim 8 which defines a basis projection matrix that reduces the series of logarithmic frequency spectra of the particular source to a smaller number of dimensions. A method for extracting features from an acoustic signal.

10. The category includes environmental sounds, background noise,
7. A method for extracting features from acoustic signals generated from multiple sound sources according to claim 6, including sound effects, overlapping sounds, animal sounds, voices, non-voice calls and music.

11. The method for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 7, further comprising the step of combining generally similar categories in the database as a hierarchy of classes.

12. The specific quantitative descriptor for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 6, further comprising a harmonic envelope descriptor and a fundamental frequency descriptor. Method.

13. The temporal feature describes a trajectory of the spectral feature over time, and the acoustic signal generated by a specific sound source is subjected to a finite number of states based on the corresponding spectral feature. To represent each of the states by a continuous probability distribution, and to model the probability of transition to the next state when the current state is given, the temporal features are represented by a transition matrix. The method for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 5, further comprising:

14. The continuous probability distribution has a mean value m of 1 ×
is a Gaussian distribution parameterized by an n vector and an n × n covariance matrix K, where n is the number of spectral features in each spectral envelope and the probability of a particular spectral envelope x is Number 1] A method for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 13, provided by:

15. Each of the sources is known, and for each of the known sources, training a hidden Markov model with the set of features, and each training with an associated set of spectral features. The method for extracting features from acoustic signals generated from a plurality of sound sources according to claim 5, further comprising the step of storing the hidden Hidden Markov Model in a database.

16. The set of acoustic signals belongs to a known category, extracting a spectral basis for the acoustic signals, and training a hidden Markov model using the temporal features of the acoustic signals. , Storing each trained Hidden Markov Model having said associated spectral basis function, the method for extracting features from an acoustic signal generated from a plurality of sound sources according to claim 5.

17. A step of generating an unknown acoustic signal from an unknown sound source, a step of windowing and filtering the unknown signal to generate an unknown spectral envelope, and a set of unknown features. Reducing the dimensionality of the unknown spectral envelope to generate the set, the set including unknown spectral features that characterize the unknown sound source and corresponding unknown temporal features. The method further comprising the steps of: and selecting one of the stored Hidden Markov Models that best fits the unknown feature set to identify the unknown source.
A method for extracting features from an acoustic signal generated from a plurality of sources as described.

18. The sound generated from multiple sound sources of claim 17, wherein a plurality of the stored hidden Markov models are selected to identify a plurality of unknown sound sources that are generally similar to the unknown sound source. A method for extracting features from a signal.