JP4219539B2

JP4219539B2 - Acoustic classification device

Info

Publication number: JP4219539B2
Application number: JP2000245388A
Authority: JP
Inventors: 隆司西; 靖茂中山; 哲夫梅田
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2000-08-11
Filing date: 2000-08-11
Publication date: 2009-02-04
Anticipated expiration: 2020-08-11
Also published as: JP2002062892A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声や音楽などの音響を分類するための音分類装置に関する。
【０００２】
【従来の技術】
音声や音楽などの音響を自動識別する技術を開示した文献としては、過去に、以下の文献が発表されている。
【０００３】
・文献１：Ｊ．ＡｕｄｉｏＥｎｇ．Ｓｏｃ．，Ｖｏｌ．４７，Ｎｏ９，１９９９Ｓｅｐ．、７２０−７２５
音響信号を周波数分析し、音楽・音声の周波数特性の違いを利用して音響信号を識別する方法が開示されている。この方法は音楽、音声の主要な周波数成分が一般に分離できないために特徴の検出精度が低いという欠点を有する。
【０００４】
・文献２：Ｐｒｏｃ．ＩＣＡＳＳＰ９６，９９３−９９６，著者名Ｊｏｈｎ
Ｓａｕｎｄｅｒｓ
音響信号のゼロクロス分布に基づいた４種類の統計量およびエネルギから抽出した物理量の全体で５種類の特徴量ベクトルを記述し、多変量判別関数を用いて、音声・音楽の分類を行なう方法が開示されている。この方法はゼロクロスから抽出した統計量が全体の８割を占めており、識別結果がゼロクロス分布に依存し、音響の種類によっては分類制度が落ちるという欠点を有する。
【０００５】
・文献３：Ｐｒｏｃ．ＩＣＡＳＳＰ９７，１３３１−１３３４，著者名ＥｒｉｃＳｃｈｅｉｄｅｒ，ＭａｌｃｏｍＳｌａｎｅｙ
音響信号を一定の窓内で分析して得られたゼロクロス数、周波数重心、低いパワーレベルの時間率等１３種類の特徴ベクトルを基に多変量判別関数を用いて音声・音楽の分類を行なう方法が開示されている。この方法は音響の特徴の種類が１３種類と多く、分類の結果を得るまでの時間が長くなり、高速での分類処理には不向きである。
【０００６】
【発明が解決しようとする課題】
上述したように上記文献１〜３に記載されている方法は、処理速度および分類精度の双方を満足することはできない。
【０００７】
そこで、本発明の目的は、高速かつ高い分類精度で分類を行なうことができる音響分類装置を提供することにある。
【０００８】
【課題を解決するための手段】
このような目的を達成するために、請求項１の発明は、音響信号を分類する音響分類装置であって、音響信号から音響の特徴として一定時間長ごとの音響の統計的性質を抽出する特徴抽出手段と、前記特徴抽出手段により分類内容が既知の学習用音響信号の音響の特徴を取得し、当該取得した音響の特徴が入力、対応する分類内容が出力となるようにニューラルネットワークに学習させる学習手段と、前記特徴抽出手段により分類対象の音響信号の音響の特徴を取得し、当該取得した音響の特徴を前記ニューラルネットワークに入力し、該ニューラルネットワークの出力を受け取ることにより、前記分類対象の音響信号を分類する識別手段とを具え、前記音響の統計的性質は、前記音響信号のゼロクロス分布、レベル分布および最大パワーを与える周波数の分布であり、前記特徴抽出手段は、予め音声信号および音楽信号に関する複数種類の前処理用音源信号から音響の特徴を示す統計分布形状の主成分分析を行い、分析の結果として得られる固有ベクトルを使用して、前記学習用音響信号の音響の特徴および前記分類対象の音響信号の特徴を取得することを特徴とする。
【０００９】
請求項２の発明は、請求項１に記載の音響分類装置において、前記レベル分布は、フレーム内の時間平均レベルに対する相対レベルの頻度分布として与えられ、前記最大パワーを与える周波数の分布は、１フレーム内を所定のＬ点ずつシフトし、それぞれの時間窓内で最大のパワーとなる周波数の出現頻度を求め、これを物理統計量として採用したものであることを特徴とする。
【００１１】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を詳細に説明する。
【００１２】
図１は本発明を適用した音響分類装置の機能構成を示す。音響分析装置としてはパーソナルコンピュータなどのプログラムを実行可能な汎用コンピュータを使用することができる。以下に述べる構成部は、後述のプログラムをＣＰＵ等が実行することにより実現される。
【００１３】
図１において、１００は音声信号および音楽信号の複数種類の音源信号から一定時間長ごとの音響の統計的性質を抽出し、その特徴的性質の主成分分析を行い、その分駅結果として得られる固有ベクトルを取得する前処理部である。
【００１４】
２００は分類内容が既知の学習用音響信号の音響の特徴を取得し、前処理部１００により得られた固有ベクトルを使用して学習用音響信号の特徴の主成分を取得し、当該取得した音響の特徴の主成分が入力、対応する分類内容が出力となるようにニューラルネットワークに学習させる学習部である。
【００１５】
３００は分類対象の前処理部１００により分類対象の音響信号の音響の特徴を取得し、前処理部１００により得られた固有ベクトルを使用して分類対象の音響信号の主成分を取得し、当該取得した音響の特徴の主成分をニューラルネットワークに入力し、ニューラルネットワークの出力を受け取ることにより、分類対象の音響信号を分類する識別部である。
【００１６】
図２は前処理部１００の機能を実現するためのプログラムの処理内容を示す。以下、図２の各処理について説明する。
【００１７】
（１−１）時間分割処理（ステップＳ１０）
音声信号および音楽信号などの複数種の前処理用音源信号はマイクロホンから入力され、汎用コンピュータにおいて、アナログ信号からデジタル信号に変換され、内部メモリに一時記憶された後ＣＰＵにより以下の処理が行なわれる。すなわち、所定時間の間で採取した音響信号の時間時間軸方向の統計特徴量を求める際、音響信号は図５（Ａ）に示すようにＭサンプルからなるブロックをＮブロック隣接して並べたフレームで時間分割される。なお、後述するが周波数分析のために時間軸上のＫ点の音響信号はそのフレーム内で図５（Ｂ）に示すようにＬ点ずつシフトされる。
【００１８】
（１−２）音響特徴量の抽出（ステップＳ２０）
音声、音楽の物理的な違いを反映した特徴量として本実施形態では、ゼロクロス分布、レベル分布および周波数分布の統計特徴量を使用する。これらの分布の特徴を以下に説明する。
【００１９】
（ａ）ゼロクロス分布
この物理量は一定時間（１ブロック）で信号がゼロレベルを横切る回数の時間軸上の分布である。ゼロクロス分布は大きなパワーを持つ音響信号の周波数と相関性が高い。たとえば、高い周波数成分が優勢な信号ではゼロクロス数が多くなる傾向を示す。
【００２０】
本実施形態では、隣接したＮブロックを１ブロックＭサンプルごとにゼロクロス数を求め、１フレーム分まとめてＮブロックで平均した頻度分布を得た。このセロクロス数の頻度分布パターンの形状を統計量として採用する。
【００２１】
音声信号の場合、母音、子音＋母音（ゼロクロス数：少）の他に、摩擦音や破裂音（ゼロクロス数：多）が混在するため、１フレームの平均頻度分布はゼロクロス数が多い値と少ない値の２極に分離した形状を示す。一方、音楽の場合には単一の山形のけ上を示す。
【００２２】
（ｂ）レベル分布
この物理量は、１ブロックＭサンプルサンプルの時間平均レベルの１フレームの音響信号の時間平均レベル（０ｄＢ）に対する相対レベルをもとに、この相対レベルのフレーム内で頻度分布を算出したものである。音声信号の場合、音声と音声の信号間に無音区間がある場合が多いため、音楽に比べて頻度分布の分散が大きくなる傾向を示す。
【００２３】
（ｃ）最大パワーを与える周波数の分布（以下、周波数分布）
窓掛したＫ点の時間波形をＦＦＴ（高速フーリエ解析）して求めた振幅周波数特性を、聴覚に対応するように対数周波数軸上に並び替え最大のパワーを与える周波数をその時間を代表する物理量とした。１フレーム内をＬ点ずつシフトし、それぞれの時間窓内で最大のパワーとなる周波数の出現頻度を求め、これを物理統計量として採用した。音声信号の場合、ゼロクロス分布同様、高い周波数と低い周波数の２極に分離した文献上でレベル分布を表す。この統計量を使用することにより、単一楽器のように音域が狭いものと、オーケストラのように音域が広いものを分布パターンの形状で分類できるため、音楽信号をさらに詳細に分類することができる。
【００２４】
（１−３）統計量算出および平均処理（ステップＳ３０、Ｓ４０）
上述の音響特徴量はフレームごとに、音響信号（学習用音響信号および分類対象の音響信号）から抽出されるので、音響信号の複数フレームから抽出した音響特徴量を使用して音響統計特徴量を算出し、また、音響信号ごとに統計量の平均化を行なう。
【００２５】
（１−４）主成分分析処理および固有ベクトルの保存（ステップ５０、Ｓ６０）
本実施形態では、１フレームの統計量は多次元のベクトルであるが、音響特徴量の次元数を減少させるために、主成分分析処理、より具体的には、統計分布形状の特徴を保ったまま、（たとえば、因子負荷量９０％以上）主成分を抽出する処理を行なう。また、統計量の主成分を求める際の変換マトリクスとして以後の処理で使用するために、上記主成分分析の結果得られた固有ベクトルを装置内部の記憶手段に保存する。
【００２６】
図３は図１の学習部２００の機能をプログラムで実現するための処理内容を示す。
【００２７】
マイクロホンなどから入力される内容が既知の学習用音響信号はアナログ信号からデジタル信号に変換された後汎用コンピュータにより処理される。
【００２８】
（２−１）時間分割処理〜統計量算出処理（ステップＳ１００〜Ｓ１２０）
時間分割処理、音響特徴量抽出処理および統計量算出処理は上述の前処理と処理内容が同一であり、これらの処理については共通のプログラム（サブルーチンプログラム）を使用することができる。
【００２９】
（２−２）主成分抽出処理（ステップＳ１３０）
固有ベクトルを使用して主成分を取得するには公知の技術である多変量解析を用いて行なう。多変量解析については、例えば、「石村貞夫著、すぐわかる多変量解析、第４章（東京図書）に詳述されている。
【００３０】
（２−３）ニューラルネットワークの学習
上述の処理ステップで得られた学習用音響信号の特徴の主成分と、学習音響信号の種類内容を示す既知の分類データとをニューラルネットワークに学習させる。
【００３１】
ニューラルネットワークおよびその学習方法については周知であるが、発明に係るので、簡単に説明しておく。
【００３２】
ニューラルネットワークの代表的な構成を図６に示す。図１が入力層、２が中間層、３が出力装置である。各層にニューロンを使用することができる。ニューロンは入力信号と出力信号の間の相関関係が予め定められた関数（本実施形態では数学的に取扱の便利なシグモイド関数を使用）で表される。したがって、入力層１にある値を有する入力信号を入力すると、出力層３からは所定の値を持つ出力信号が出力される。そこで、入力層１に対して、学習用音響信号から得られた特徴の主成分を入力し、その分類内容を示すデータが出力層３から出力されるように各ニューロンの相関関数（伝達関数とも呼ばれる）の係数を学習する。学習方法としては多数の提案があるが、代表的な例は、初期的にある係数を各ニューロンの相関関数に与え、入力信号をニューラルネットワークの入力層１へに与えて、出力層３からの出力信号を計算する。出力信号の値が、目標とする分類データの値からかけ離れている場合には上記初期値を少しずつ微小変更しながら、出力信号の値が目標となる分類データの値となるまで微小変更を試行錯誤的に繰り返す。出力信号の値が許容範囲内となったとき、その時に、使用された相関関数の係数が学習結果となる。以上の学習処理はコンピュータのプログラム実行で実現可能である。複数組の入力信号を与える場合には、それらの入力信号をそれぞれ与えたときに、出力信号が対応する分類データの値となるような相関関数の係数を検出することになる。
【００３３】
図４は図１の識別部３００の機能をプログラムで実現するための処理内容を示す。
【００３４】
（３−１）時間分割処理〜主成分抽出処理（ステップＳ２００〜Ｓ２３０）
時間分割処理〜統計処理の処理内容は学習処理と同様である。ただ、音響分類装置に入力される音響信号が種類が未知の音響信号である点が異なる。
【００３５】
（３−２）ニューラルネットワーク計算処理
上述の学習処理でニューラルネットワークの各ニューロンの相関関数（正確には係数）が決定されているので、これら相関関数を使用して、出力信号の値を計算する。ニューラルネットワークは周知のように、入力信号と出力信号の間の関係を学習させた後、そのニューラルネットワークに入力信号を与えると、その入力信号に類似する学習入力信号に対応する出力信号を出力するという性質がある。そのため、計算の結果として得られる分類データの値が分類の識別結果となる。分類結果は音響分類装置（汎用コンピュータ）のディスプレイに表示してもよいし、印刷出力してもよい。
【００３６】
以上述べた実施形態に限定されず、用途に応じて種々の変形が可能である。たとえば、ニューラルネットワークを構成するニューロンの個数は用途に応じて適宜、定めればよい。また、音響分類の用途としては音響データベースの高速検索、音響信号の自動インデキシングおよびこのインデックスをサイド情報として用いた画像検索、音声認識処理の前処理、放送音声の自動モニターが考えられる。
【００３７】
【発明の効果】
以上、説明したように、本発明によれば、音響信号の分類のためにニューラルネットワークを使用することにより分類処理を高速化することができ、また、音響の特徴として、音響のゼロクロス分布、レベル分布および周波数分布のような統計的性質を使用するので、単一の音楽楽器による音楽やオーケストラのような音楽を分類することも可能となり、人間の判断と同程度の識別性能も得られる。さらに、識別したい音響信号も多数種とすることができ、その結果、放送音声の自動モニター、音声データベースの自動インデキシングおよび検索の高速化、マルチメディアを使用した検索の効率化に寄与することができる。
【図面の簡単な説明】
【図１】本発明実施形態の機能構成を示すブロック図である。
【図２】前処理の内容を示すフローチャートである。
【図３】学習処理の内容を示すフローチャートである。
【図４】識別処理の処理内容を示すフローチャートである。
【図５】（Ａ）および（Ｂ）は時間軸上の信号分割処理を説明するための説明図である。
【図６】ニューラルネットワークを説明するための説明図である。
【符号の説明】
１００前処理部
２００学習部
３００識別部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a sound classification apparatus for classifying sounds such as voice and music.
[0002]
[Prior art]
The following documents have been published in the past as documents disclosing a technology for automatically identifying sounds such as voice and music.
[0003]
Reference 1: J.A. Audio Eng. Soc. , Vol. 47, No9, 1999 Sep. 720-725
A method is disclosed in which an acoustic signal is frequency-analyzed and the acoustic signal is identified using a difference in frequency characteristics between music and speech. This method has a disadvantage that the detection accuracy of features is low because the main frequency components of music and speech cannot generally be separated.
[0004]
Reference 2: Proc. ICASSP 96,993-996, author John
Saunders
Disclosed is a method of describing speech and music using a multivariate discriminant function by describing five types of feature vectors based on four types of statistics and energy extracted from energy based on the zero-cross distribution of acoustic signals. Has been. This method has the disadvantage that the statistics extracted from the zero cross account for 80% of the total, the identification result depends on the zero cross distribution, and the classification system drops depending on the type of sound.
[0005]
Reference 3: Proc. ICASSP 97, 1331-1334, Authors Eric Scheider, Malcom Slaney
A method for classifying speech and music using a multivariate discriminant function based on 13 types of feature vectors such as the number of zero crosses obtained by analyzing an acoustic signal within a certain window, frequency centroid, time ratio of low power level, etc. Is disclosed. This method has as many as 13 types of acoustic features, and takes a long time to obtain a classification result, and is not suitable for high-speed classification processing.
[0006]
[Problems to be solved by the invention]
As described above, the methods described in the documents 1 to 3 cannot satisfy both the processing speed and the classification accuracy.
[0007]
Therefore, an object of the present invention is to provide an acoustic classification device that can perform classification at high speed and with high classification accuracy.
[0008]
[Means for Solving the Problems]
In order to achieve such an object, the invention of claim 1 is an acoustic classification device for classifying an acoustic signal, wherein the acoustic characteristic is extracted from the acoustic signal as a feature of the sound for each predetermined time length. The acoustic features of the learning acoustic signal whose classification contents are known are acquired by the extraction means and the feature extraction means, and the neural network is trained so that the acquired acoustic features are input and the corresponding classification contents are output. The acoustic features of the acoustic signal to be classified are acquired by the learning means and the feature extracting means, the acquired acoustic features are input to the neural network, and the output of the neural network is received to obtain the classification target acoustic signal. comprising an identification means for classifying acoustic signal, the statistical properties of the acoustics, zero cross distribution of the acoustic signal, the level distribution and maximum power The feature extraction means obtains a result of the analysis by performing principal component analysis of a statistical distribution shape indicating acoustic features from a plurality of types of pre-processing sound source signals relating to audio signals and music signals in advance. The characteristic of the acoustic signal of the learning acoustic signal and the characteristic of the acoustic signal to be classified are acquired using an eigenvector .
[0009]
According to a second aspect of the present invention, in the acoustic classification apparatus according to the first aspect, the level distribution is given as a frequency distribution of relative levels with respect to a time average level in a frame, and the frequency distribution giving the maximum power is 1 The frame is shifted by predetermined L points, the appearance frequency of the frequency having the maximum power within each time window is obtained, and this is used as a physical statistic .
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0012]
FIG. 1 shows a functional configuration of an acoustic classification apparatus to which the present invention is applied. A general-purpose computer capable of executing a program such as a personal computer can be used as the acoustic analysis device. The components described below are realized by a CPU or the like executing a program described later.
[0013]
In FIG. 1, reference numeral 100 denotes a statistical property of sound for each predetermined time length extracted from a plurality of types of sound source signals such as a voice signal and a music signal, and a principal component analysis of the characteristic property is performed. It is a preprocessing unit for acquiring eigenvectors.
[0014]
200 acquires the acoustic feature of the learning acoustic signal whose classification content is known, acquires the main component of the learning acoustic signal feature using the eigenvector obtained by the pre-processing unit 100, and This is a learning unit that causes the neural network to learn such that the main component of the feature is input and the corresponding classification content is output.
[0015]
300 acquires the acoustic characteristics of the acoustic signal to be classified by the pre-processing unit 100 to be classified, acquires the main component of the acoustic signal to be classified using the eigenvector obtained by the pre-processing unit 100, The identification unit classifies the acoustic signals to be classified by inputting the principal components of the acoustic features to the neural network and receiving the output of the neural network.
[0016]
FIG. 2 shows the processing contents of a program for realizing the function of the preprocessing unit 100. Hereinafter, each process of FIG. 2 will be described.
[0017]
(1-1) Time division processing (step S10)
A plurality of types of pre-processing sound source signals such as audio signals and music signals are input from a microphone, converted from an analog signal to a digital signal in a general-purpose computer, temporarily stored in an internal memory, and then subjected to the following processing by the CPU. . That is, when obtaining a statistical feature quantity in the time-time axis direction of an acoustic signal collected during a predetermined time, the acoustic signal is a frame in which N blocks are arranged adjacent to each other as shown in FIG. It is divided in time. As will be described later, the acoustic signal at point K on the time axis is shifted by L points as shown in FIG. 5B within the frame for frequency analysis.
[0018]
(1-2) Extraction of acoustic features (Step S20)
In the present embodiment, statistical feature amounts of zero cross distribution, level distribution, and frequency distribution are used as feature amounts reflecting physical differences between voice and music. The characteristics of these distributions will be described below.
[0019]
(A) Zero-cross distribution This physical quantity is a distribution on the time axis of the number of times that the signal crosses the zero level in a certain time (one block). The zero-cross distribution is highly correlated with the frequency of an acoustic signal having a large power. For example, a signal with a dominant high frequency component tends to increase the number of zero crossings.
[0020]
In the present embodiment, the frequency distribution is obtained by calculating the number of zero crosses of adjacent N blocks for each M samples of one block and averaging them for one frame. The shape of the frequency distribution pattern of the number of cellos is adopted as a statistic.
[0021]
In the case of audio signals, in addition to vowels, consonants + vowels (number of zero crossings: small), friction sounds and plosives (number of zero crossings: many) are mixed, so the average frequency distribution of one frame has a large number of zero crossings and a small value. The shape separated into two poles. On the other hand, in the case of music, it shows the top of a single Yamagata.
[0022]
(B) Level distribution This physical quantity is calculated based on the relative level of the time average level of one block M sample sample to the time average level (0 dB) of the sound signal of one frame within the frame of this relative level. It is a thing. In the case of an audio signal, since there is often a silent section between the audio and audio signals, the distribution of the frequency distribution tends to be larger than that of music.
[0023]
(C) Frequency distribution giving maximum power (hereinafter, frequency distribution)
Amplitude frequency characteristics obtained by FFT (fast Fourier analysis) of the time waveform of the K point that has been windowed are rearranged on the logarithmic frequency axis so as to correspond to hearing, and the physical quantity representing the time is the frequency that gives the maximum power. It was. Each frame was shifted by L points, the appearance frequency of the frequency having the maximum power within each time window was determined, and this was adopted as a physical statistic. In the case of an audio signal, the level distribution is represented on a document separated into two poles of a high frequency and a low frequency as in the zero cross distribution. By using this statistic, it is possible to classify music signals with a narrower range like a single instrument and those with a wider range like an orchestra by the shape of the distribution pattern, so that music signals can be further classified. .
[0024]
(1-3) Statistics calculation and averaging process (steps S30 and S40)
Since the acoustic feature amount described above is extracted from the acoustic signal (learning acoustic signal and acoustic signal to be classified) for each frame, the acoustic statistical feature amount is obtained using the acoustic feature amount extracted from a plurality of frames of the acoustic signal. Calculate and average statistics for each acoustic signal.
[0025]
(1-4) Principal component analysis processing and eigenvector storage (step 50, S60)
In this embodiment, the statistical amount of one frame is a multidimensional vector, but in order to reduce the number of dimensions of the acoustic feature amount, the principal component analysis process, more specifically, the feature of the statistical distribution shape is maintained. As it is (for example, factor loading of 90% or more), a process of extracting the main component is performed. In addition, the eigenvector obtained as a result of the principal component analysis is stored in a storage unit inside the apparatus so that it can be used as a transformation matrix for obtaining the principal component of the statistic in the subsequent processing.
[0026]
FIG. 3 shows processing contents for realizing the function of the learning unit 200 of FIG. 1 by a program.
[0027]
A learning acoustic signal whose contents are input from a microphone or the like is converted from an analog signal to a digital signal and then processed by a general-purpose computer.
[0028]
(2-1) Time division process to statistic calculation process (steps S100 to S120)
The time division process, the acoustic feature quantity extraction process, and the statistic calculation process have the same processing contents as the pre-process described above, and a common program (subroutine program) can be used for these processes.
[0029]
(2-2) Principal component extraction process (step S130)
To obtain principal components using eigenvectors, multivariate analysis, which is a known technique, is performed. The multivariate analysis is described in detail in, for example, “Sadao Ishimura, Multivariate analysis that can be easily understood, Chapter 4 (Tokyo Book)”.
[0030]
(2-3) Learning of Neural Network The neural network is made to learn the main components of the characteristics of the learning acoustic signal obtained in the above processing steps and the known classification data indicating the type content of the learning acoustic signal.
[0031]
The neural network and its learning method are well known, but since they relate to the invention, they will be briefly described.
[0032]
A typical configuration of the neural network is shown in FIG. FIG. 1 shows an input layer, 2 an intermediate layer, and 3 an output device. Neurons can be used for each layer. The neuron is represented by a function in which the correlation between the input signal and the output signal is determined in advance (in this embodiment, a sigmoid function that is mathematically handled is used). Therefore, when an input signal having a certain value is input to the input layer 1, an output signal having a predetermined value is output from the output layer 3. Therefore, the main component of the feature obtained from the learning acoustic signal is input to the input layer 1, and the correlation function of each neuron (both the transfer function is set so that data indicating the classification content is output from the output layer 3. Learning the coefficient). There are many proposals as a learning method, but a typical example is that a certain coefficient is initially given to the correlation function of each neuron, an input signal is given to the input layer 1 of the neural network, and the output from the output layer 3 Calculate the output signal. If the output signal value is far from the target classification data value, try changing the initial value little by little until the output signal value reaches the target classification data value. Repeat with mistakes. When the value of the output signal falls within the allowable range, the coefficient of the correlation function used at that time becomes the learning result. The above learning process can be realized by executing a computer program. When a plurality of sets of input signals are given, the coefficient of the correlation function is detected so that the output signal becomes the value of the corresponding classification data when each of the input signals is given.
[0033]
FIG. 4 shows processing contents for realizing the function of the identification unit 300 of FIG. 1 by a program.
[0034]
(3-1) Time division processing to principal component extraction processing (steps S200 to S230)
The processing contents of the time division processing to the statistical processing are the same as the learning processing. However, the difference is that the acoustic signal input to the acoustic classification device is an unknown acoustic signal.
[0035]
(3-2) Neural network calculation process Since the correlation function (more precisely, coefficient) of each neuron of the neural network is determined in the above learning process, the value of the output signal is calculated using these correlation functions. . As is well known, when a neural network learns the relationship between an input signal and an output signal and then gives the input signal to the neural network, it outputs an output signal corresponding to the learning input signal similar to the input signal. It has the nature of Therefore, the classification data value obtained as a result of the calculation becomes the classification identification result. The classification result may be displayed on a display of an acoustic classification device (general-purpose computer) or may be printed out.
[0036]
It is not limited to embodiment described above, A various deformation | transformation is possible according to a use. For example, the number of neurons constituting the neural network may be appropriately determined according to the application. The acoustic classification can be used for high-speed search of an acoustic database, automatic indexing of acoustic signals and image search using this index as side information, preprocessing for speech recognition processing, and automatic monitoring of broadcast audio.
[0037]
【The invention's effect】
As described above, according to the present invention, it is possible to speed up the classification process by using a neural network for the classification of acoustic signals, and the acoustic zero-cross distribution, level as acoustic features Since statistical properties such as distribution and frequency distribution are used, it is also possible to classify music by a single musical instrument or music such as orchestra, and obtain discrimination performance comparable to human judgment. Furthermore, it is possible to make various kinds of acoustic signals to be identified. As a result, it is possible to contribute to the automatic monitoring of broadcast sound, the automatic indexing and searching of the sound database, and the efficiency of searching using multimedia. .
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of an embodiment of the present invention.
FIG. 2 is a flowchart showing the contents of preprocessing.
FIG. 3 is a flowchart showing the contents of a learning process.
FIG. 4 is a flowchart showing the contents of identification processing.
FIGS. 5A and 5B are explanatory diagrams for explaining signal division processing on a time axis. FIGS.
FIG. 6 is an explanatory diagram for explaining a neural network.
[Explanation of symbols]
100 Pre-processing unit 200 Learning unit 300 Identification unit

Claims

An acoustic classification device for classifying acoustic signals,
A feature extraction means for extracting a statistical property of the sound for each predetermined time length as an acoustic feature from the acoustic signal;
Learning means for acquiring acoustic features of a learning acoustic signal whose classification contents are known by the feature extraction means, and learning the neural network so that the acquired acoustic features are input and corresponding classification contents are output;
The acoustic feature of the acoustic signal to be classified is acquired by the feature extraction unit, the acquired acoustic feature is input to the neural network, and the output of the neural network is received to classify the acoustic signal to be classified. comprising an identification means that,
The acoustic statistical properties are zero cross distribution, level distribution and frequency distribution giving maximum power of the acoustic signal;
The feature extraction means performs a principal component analysis of a statistical distribution shape indicating acoustic features from a plurality of types of pre-processing sound source signals related to a speech signal and a music signal in advance, and uses the eigenvector obtained as a result of the analysis, An acoustic classification apparatus characterized by acquiring acoustic characteristics of a learning acoustic signal and characteristics of the classification target acoustic signal .

2. The acoustic classification apparatus according to claim 1, wherein the level distribution is given as a frequency distribution of a relative level with respect to a time average level in a frame, and the frequency distribution giving the maximum power is a predetermined L point in one frame. An acoustic classification device characterized by shifting each time, obtaining an appearance frequency of a frequency having the maximum power within each time window, and adopting this as a physical statistic .