JP4985134B2

JP4985134B2 - Scene classification device

Info

Publication number: JP4985134B2
Application number: JP2007158862A
Authority: JP
Inventors: 千加志杉浦
Original assignee: Fujitsu Toshiba Mobile Communication Ltd
Current assignee: Fujitsu Mobile Communications Ltd
Priority date: 2007-06-15
Filing date: 2007-06-15
Publication date: 2012-07-25
Anticipated expiration: 2027-06-15
Also published as: JP2008310138A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a scene classifier capable of enhancing precision of section detection. <P>SOLUTION: This scene classifier is provided with a time peak detecting part 5 for detecting a time peak in a spectrogram, and a frequency directional feature amount extracting part 6 for extracting a feature amount indicated in the time leak, and uses also the feature amount concerned in the time peak, as an index for the section detection. The scene classifier is provided further with a mutual feature amount extracting part 7, and classifies scenes, using also a new feature amount defined by an interaction between a frequency peak and the time peak, as an index. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、主に音楽シーンを伴うマルチメディアコンテンツのシーン（場面）を、音響信号に基づいて分類する技術に関する。 The present invention relates to a technique for classifying multimedia content scenes (scenes) mainly accompanied by music scenes based on acoustic signals.

録画した放送番組を視聴するにあたり、短時間で見たいシーンだけを見るという形態が求められている。例えば音楽番組の見どころシーンとしてアーティストが歌唱しているシーンを検出し、そのシーンだけを視聴するといった形態である。このような機能を実現するためにマルチメディアコンテンツの場面を分類する技術が提供されている。 In order to view a recorded broadcast program, there is a demand for a form in which only a desired scene is viewed in a short time. For example, a scene in which an artist sings as a highlight scene of a music program is detected, and only that scene is viewed. In order to realize such a function, a technique for classifying scenes of multimedia contents is provided.

特許文献１には、２チャンネル音響信号の［Ｌｃｈパワー＋Ｒｃｈパワー］に対する［Ｌｃｈパワー−Ｒｃｈパワー］の比をステレオ感を表す特徴量とし、この特徴量が大きいシーンを音楽区間として検出する技術が開示される。この技術によれば音楽区間を検出するための演算量を比較的少なくできる。 Japanese Patent Application Laid-Open No. 2004-228561 has a technique for detecting a scene having a large feature amount as a music section by using a ratio of [Lch power−Rch power] to [Lch power + Rch power] of a two-channel sound signal as a feature amount representing stereo feeling. Disclosed. According to this technique, the amount of calculation for detecting a music section can be relatively reduced.

非特許文献１の技術では、隣り合う分析フレーム間のスペクトルピークどうしをその周波数値と対数パワー値の２次元空間上での正規化距離とが近ければ接続するという方法で、スペクトルピークの連続性を見出す。この連続するスペクトルピークをSinusoidal Segmentとして定義し、これに付随する特徴量を、予め統計的に学習した辞書を用いて分類することで音声区間や音楽区間を検出するようにしている。
特開２００６−３０１１３４号広報「Sinusoidal Segment の時間的特徴を用いた音声・楽器音・歌声が混在した音響信号中の音カテゴリ検出」（早大）谷口徹ほか日本音響学会２００５年９月 In the technique of Non-Patent Document 1, spectrum peaks between adjacent analysis frames are connected if the frequency value and the normalized distance of the logarithmic power value in the two-dimensional space are close to each other. Find out. This continuous spectrum peak is defined as a sinusoidal segment, and a feature amount associated therewith is classified using a dictionary that has been statistically learned in advance to detect a speech segment or a music segment.
JP 2006-301134 A "Sound category detection in acoustic signals with mixed voice, musical instrument sound and singing voice using temporal characteristics of Sinusoidal Segment" (Waseda Univ.) Toru Taniguchi et al. Acoustical Society of Japan, September 2005

特許文献１の技術では、複数チャネルで収録された音響シーンがステレオ成分を有する場合には、笑い声や拍手などの音楽区間に属さないシーンでも音楽区間として検出されるという課題がある。また、ボーカルメインの曲（アカペラやラップ調など）に対しては音楽区間のステレオ成分が小さく、区間検出の精度が低下してしまう。 In the technique of Patent Document 1, there is a problem that when a sound scene recorded by a plurality of channels has a stereo component, a scene that does not belong to a music section such as laughter or applause is detected as a music section. Also, for vocal main songs (a cappella, rap tone, etc.), the stereo component of the music section is small, and the accuracy of section detection is reduced.

また非特許文献１では、Sinusoidal Segmentが時間方向に連続するスペクトルピーク系列であるという性質のため、ラップ調、あるいはテンポが早い曲のようにスペクトルピークの時間方向の持続性が顕著ではない音楽に対しては区間検出の精度が低下することが考えられる。これにより音楽検出エラーが生じてしまうという課題がある。
この発明は上記事情によりなされたもので、その目的は、区間検出の精度を向上させたシーン分類装置を提供することにある。 In Non-patent Document 1, because sinusoidal segments are spectral peak sequences that are continuous in the time direction, the music is not remarkable in its temporal persistence, such as a lap tone or a song with a fast tempo. On the other hand, it is conceivable that the accuracy of section detection decreases. This causes a problem that a music detection error occurs.
The present invention has been made in view of the above circumstances, and an object thereof is to provide a scene classification device with improved accuracy of section detection.

上記目的を達成するためにこの発明の一態様によれば、音響信号を含むマルチメディアコンテンツを時間的に連続する複数の区間に分割して前記区間ごとに前記音響信号のスペクトルを算出するスペクトル算出部と、前記スペクトルにおける周波数方向の極大点である周波数ピークを検出する周波数ピーク検出部と、前記周波数ピークの特徴を示す第１特徴量を抽出する時間方向特徴量抽出部と、前記スペクトルを時間的に連続して配列したスペクトログラムにおける時間方向の極大点である時間ピークを検出する時間ピーク検出部と、前記時間ピークの特徴を示す第２特徴量を抽出する周波数方向特徴量抽出部と、前記第１特徴量および前記第２特徴量により示される前記区間の音響的な特徴に基づいて、前記複数の区間を第１の音楽区間と第２の音楽区間とに分類する音響分類部とを具備することを特徴とするシーン分類装置が提供される。 In order to achieve the above object, according to one aspect of the present invention, spectrum calculation for dividing multimedia content including an acoustic signal into a plurality of temporally continuous sections and calculating a spectrum of the acoustic signal for each of the sections A frequency peak detection unit that detects a frequency peak that is a local maximum point in the frequency direction in the spectrum, a time direction feature amount extraction unit that extracts a first feature amount indicating a feature of the frequency peak, and the spectrum as a time A time peak detection unit that detects a time peak that is a local maximum point in a time direction in spectrograms that are continuously arranged, a frequency direction feature amount extraction unit that extracts a second feature amount indicating the feature of the time peak, and Based on the acoustic features of the section indicated by the first feature quantity and the second feature quantity, the plurality of sections are defined as a first music section. Scene classification device is provided which is characterized by comprising an acoustic classifier unit for classifying the second music segment.

このような手段を講じることにより、区間を分類するにあたり周波数ピークだけでなく、スペクトログラムに示される特徴量、すなわち時間ピークをも利用することが可能になる。時間ピークを用いることにより、周波数ピークによっては捉えきれない音響的特徴を定量的に評価することが可能になる。従って区間検出の精度をさらに高めることが可能になる。 By taking such means, it is possible to use not only the frequency peak but also the feature amount shown in the spectrogram, that is, the time peak, when classifying the sections. By using the time peak, it is possible to quantitatively evaluate an acoustic feature that cannot be captured by the frequency peak. Therefore, it is possible to further increase the accuracy of section detection.

この発明によれば、区間検出の精度を向上させたシーン分類装置を提供することができる。 According to the present invention, it is possible to provide a scene classification device that improves the accuracy of section detection.

以下、図面を参照してこの発明の実施の形態につき説明する。ここではマルチメディアコンテンツに含まれる各シーンを音響信号に基づいて分類し、音楽シーンを検出する装置につき説明する。
マルチメディアコンテンツは映像信号と音響信号との双方を含むものと、音響信号からなるものとに大別される。前者の例には、テレビジョン放映されるストリームやこれを録画したもの、またはホームビデオなどの動画録画装置によって録画されたものなどがある。後者の例は、ラジオ放映されるストリームやこれを録音したもの、またはＩＣレコーダなどの録音装置によって録音されたものなどである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Here, an apparatus for classifying scenes included in multimedia contents based on acoustic signals and detecting a music scene will be described.
Multimedia contents are roughly classified into those including both video signals and audio signals and those consisting of audio signals. Examples of the former include a stream to be broadcast on television, a recording of this, or a recording by a moving image recording device such as a home video. Examples of the latter include a stream broadcast on radio, a recording of this, or a recording recorded by a recording device such as an IC recorder.

図１は本発明に係わるシーン分類装置の実施の形態を示す機能ブロック図である。図１において、マルチメディアコンテンツはコンテンツ入力部１に入力され、音響信号が抽出される。
すなわちコンテンツ入力部１は、少なくとも音響信号を伴うマルチメディアコンテンツを装置に入力するためのインタフェースである。例えば、メディアの媒体がＤＶＤ（Digital Versatile Disk）であればＤＶＤ読み取り装置を有し、メディアがＨＤなどに記録されているものであればデータ伝送用のバスを有する。要するにコンテンツ入力部１は、入力されるマルチメディアコンテンツの形態に応じた適切な構成を成す。 FIG. 1 is a functional block diagram showing an embodiment of a scene classification apparatus according to the present invention. In FIG. 1, multimedia content is input to a content input unit 1 and an acoustic signal is extracted.
That is, the content input unit 1 is an interface for inputting multimedia content accompanied by at least an acoustic signal to the apparatus. For example, if the medium is a DVD (Digital Versatile Disk), it has a DVD reader, and if the medium is recorded on an HD or the like, it has a data transmission bus. In short, the content input unit 1 has an appropriate configuration according to the form of input multimedia content.

コンテンツ入力部１は、様々な形態で提供されるコンテンツから音響信号を抽出する。特に音響信号がアナログであればディジタル変換して、ディジタルデータを出力する。その際、コンテンツ形態によらずサンプリング周波数を一定にしておくと後段の処理において音響信号のフォーマットなどを意識せずに済むので都合が良い。コンテンツ入力部１により抽出された音響信号はスペクトル算出部２に与えられる。 The content input unit 1 extracts an acoustic signal from content provided in various forms. In particular, if the acoustic signal is analog, digital conversion is performed and digital data is output. In this case, it is convenient to keep the sampling frequency constant regardless of the content form because it is not necessary to be aware of the format of the acoustic signal in the subsequent processing. The acoustic signal extracted by the content input unit 1 is given to the spectrum calculation unit 2.

スペクトル算出部２は、音響信号を或る任意の時間長のフレームごとに区切り、各フレーム（区間またはセグメント）ごとにスペクトルを算出する。すなわちスペクトル算出部２は、入力された音響信号からＦＦＴ（高速フーリエ変換）、あるいはＬＰＣ分析などの手法によりスペクトルを算出する。これにより音響信号の周波数情報を解析することが可能となる。なお、聴覚的にそれほど敏感ではない４ｋＨｚ程度以上の高域情報をカットすると後段における処理負荷を軽減できる。逆に、聴覚的な特性を考慮するならば、線形スペクトルをメルスケールスペクトルに変換しても良い。特に音楽区間を検出するには、音階に対応するCent対数周波数スケールに変換すると良い。スペクトル算出部２により算出されたスペクトルは、周波数ピーク検出部３と、時間ピーク検出部５とに与えられる。 The spectrum calculation unit 2 divides the acoustic signal into frames having a certain arbitrary time length, and calculates a spectrum for each frame (section or segment). That is, the spectrum calculation unit 2 calculates a spectrum from the input acoustic signal by a technique such as FFT (Fast Fourier Transform) or LPC analysis. This makes it possible to analyze the frequency information of the acoustic signal. It should be noted that the processing load in the subsequent stage can be reduced by cutting high frequency information of about 4 kHz or higher, which is not so sensitive acoustically. Conversely, if an auditory characteristic is taken into consideration, the linear spectrum may be converted into a mel scale spectrum. In particular, in order to detect a music section, it is preferable to convert it to a Cent logarithmic frequency scale corresponding to a musical scale. The spectrum calculated by the spectrum calculation unit 2 is given to the frequency peak detection unit 3 and the time peak detection unit 5.

周波数ピーク検出部３は算出されたスペクトルの周波数方向でのピークを各区間（セグメント）ごとに検出する。すなわち周波数ピーク検出部３は、音響信号の或る時間における周波数方向のピークを周波数ピークとして検出する。周波数ピークとは音響パワー対周波数のグラフにプロットされたスペクトルに現れる、周波数方向の極大点を意味する。 The frequency peak detector 3 detects the peak of the calculated spectrum in the frequency direction for each section (segment). That is, the frequency peak detector 3 detects a peak in the frequency direction at a certain time of the acoustic signal as a frequency peak. The frequency peak means a maximum point in the frequency direction that appears in a spectrum plotted in a graph of sound power versus frequency.

なお前処理として、移動平均フィルタやメディアンフィルタなどの平滑化フィルタによってスペクトルの微細成分を除去しても良い。スペクトルの微細成分を除去することで局所的な周波数ピークではなく、大域的な周波数ピークを検出できるようになる。このようにすることで、より人間の聴覚特性を考慮した特徴量の抽出が可能となる。 As preprocessing, fine components of the spectrum may be removed by a smoothing filter such as a moving average filter or a median filter. By removing the fine component of the spectrum, it is possible to detect a global frequency peak instead of a local frequency peak. By doing in this way, it is possible to extract a feature amount in consideration of human auditory characteristics.

周波数ピーク検出部３における検出データは時間方向特徴量抽出部４に与えられる。時間方向特徴量抽出部４は、主として時間方向に連続する複数の周波数ピークを連結する。そして、この時間連続ピーク列の数、長さ、連続性、方向、各ピーク値統計量、各周波数値統計量、などの平均や分散などの統計量を特徴量として抽出する。この特徴量を時間連続特徴量と称する。 The detection data in the frequency peak detection unit 3 is given to the time direction feature quantity extraction unit 4. The time direction feature amount extraction unit 4 mainly connects a plurality of frequency peaks that are continuous in the time direction. Then, a statistic such as the average or variance of the number, length, continuity, direction, each peak value statistic, each frequency value statistic, etc. of this time continuous peak train is extracted as a feature quantity. This feature amount is referred to as a time continuous feature amount.

すなわち時間方向特徴量抽出部４は、サンプリング時点ごとの周波数ピークの数をカウントし、その平均や分散などの統計量をまず抽出する。次に、周波数ピークが主に時間方向に連続していれば、これらを連結し、この時間連続ピーク列の個数、ならびに時間連続ピーク列の時間長、連続性、方向、ピーク値統計量、周波数値統計量、などの平均や分散などの統計量を、時間連続特徴量として抽出する。時間連続特徴量は音響分類部８に与えられ、マルチメディアコンテンツから音声区間や音楽区間などを検出するために用いられる。検出されたシーンは出力部９から出力される。 That is, the time-direction feature quantity extraction unit 4 counts the number of frequency peaks at each sampling time, and first extracts statistics such as the average and variance. Next, if the frequency peaks are mainly continuous in the time direction, these are concatenated and the number of time continuous peak sequences, as well as the time length, continuity, direction, peak value statistic, frequency, Statistics such as average and variance of value statistics are extracted as time-continuous feature quantities. The time-continuous feature amount is given to the acoustic classification unit 8 and is used for detecting a voice section, a music section, and the like from the multimedia content. The detected scene is output from the output unit 9.

時間連続ピーク列の個数とは、区間（セグメント）における時間連続ピーク列の数である。時間連続ピーク列の時間長とは、時間連続ピーク列の時間方向の長さである。連続性とは時間連続ピーク列中に欠落の無い度合いを示す。方向とは時間連続ピーク列が時間方向にどの程度傾いているかを示す量である。各ピーク値統計量とは時間連続ピーク列中の各周波数ピークのパワーの平均や分散などを示し、各周波数値統計量とは時間連続ピーク列中の各周波数ピークの周波数値の平均や分散などを示す。 The number of time continuous peak trains is the number of time continuous peak trains in a section (segment). The time length of the time continuous peak train is the length of the time continuous peak train in the time direction. Continuity refers to the degree of no omission in the time continuous peak train. The direction is an amount indicating how much the time continuous peak train is inclined in the time direction. Each peak value statistic indicates the average or variance of the power of each frequency peak in the time continuous peak sequence, and each frequency value statistic indicates the average or variance of the frequency value of each frequency peak in the time continuous peak sequence. Indicates.

周波数ピークの時間連続性に関する特徴量抽出は、聴覚的な刺激が時間方向に持続するような音響パタンの検出においては有効であると考えられる。例えばギターやシンセサイザーなどによる音響は周波数ピーク構造を顕著に有するので連続性が高く、ピーク値平均は大きくなる。また、バラードのようにテンポが遅い曲では時間長は長くなり、ビブラート（周波数の高低の揺れ）がある場合には、周波数値分散は大きくなる。さらに、「あー」や「うわー」などのように母音を長く伸ばして発声するような音声では、方向が傾き、周波数値分散は大きくなる。逆に、拍手や背景音などように周波数的には濃淡の少ない定常ノイズのような音響信号の場合には、周波数ピーク値の数は小さくなる。このように時間連続特徴量は、音楽や音声などといった音響信号の特徴を捉えるためには威力を発揮するが、打楽器のアクセントや破裂音などの音響信号を捉えるのには向いていない。しかも時間連続特徴量は、周波数ピークが時間方向に連続するという前提に立つものであるので、音楽ジャンルによっては特徴を抽出することが非常に困難なケースがある。 It is considered that the feature amount extraction related to the time continuity of the frequency peak is effective in detecting an acoustic pattern in which an auditory stimulus is sustained in the time direction. For example, the sound from a guitar or synthesizer has a frequency peak structure so that it has high continuity, and the average peak value is large. In addition, in a song with a slow tempo such as a ballad, the time length becomes long, and when there is vibrato (frequency fluctuation of the frequency), the frequency value dispersion becomes large. Furthermore, in the case of voices that are uttered with long vowels such as “Ah” and “Wow”, the direction is inclined and the frequency value variance becomes large. On the contrary, in the case of an acoustic signal such as steady noise that has little density in terms of frequency such as applause and background sound, the number of frequency peak values is small. As described above, the time-continuous feature amount is effective for capturing the characteristics of acoustic signals such as music and voice, but is not suitable for capturing acoustic signals such as percussion instrument accents and plosives. Moreover, since the time continuous feature amount is based on the premise that the frequency peak is continuous in the time direction, it may be very difficult to extract the feature depending on the music genre.

ところで図１のシーン分類装置は、時間ピーク検出部５と、周波数方向特徴量抽出部６と、相互特徴量抽出部７とを備える。このうち時間ピーク検出部５は、スペクトル算出部２において算出されたスペクトルを時間的に連続させて配列し、スペクトログラムを生成する。そして、このスペクトログラムの或る周波数における時間方向でのピークを検出する。すなわち、時間ピーク検出部５は、或る周波数における時間方向のピークを時間ピークとして検出する。時間ピークとは、音響パワー対時間のグラフにプロットされたスペクトログラムに現れる、時間方向の極大点を意味する。
なお前処理として、移動平均フィルタやメディアンフィルタなどの平滑化フィルタによって時間方向パワー列の微細成分を除去しても良い。時間方向パワー列の微細成分を除去することで、局所的な時間ピークではなく、大域的な時間ピークを検出できるようになる。このようにすることで、より人間の聴覚特性を考慮した特徴量の抽出が可能となる。時間ピーク検出部５における検出データは周波数方向特徴量抽出部６に与えられる。 Incidentally, the scene classification apparatus of FIG. 1 includes a time peak detection unit 5, a frequency direction feature amount extraction unit 6, and a mutual feature amount extraction unit 7. Among these, the time peak detection part 5 arrange | positions the spectrum calculated in the spectrum calculation part 2 continuously in time, and produces | generates a spectrogram. And the peak in the time direction in a certain frequency of this spectrogram is detected. That is, the time peak detector 5 detects a time-direction peak at a certain frequency as a time peak. The time peak means a local maximum point appearing in a spectrogram plotted on a graph of sound power versus time.
As preprocessing, fine components in the time direction power train may be removed by a smoothing filter such as a moving average filter or a median filter. By removing the fine component of the time direction power train, a global time peak can be detected instead of a local time peak. By doing in this way, it is possible to extract a feature amount in consideration of human auditory characteristics. The detection data in the time peak detection unit 5 is given to the frequency direction feature amount extraction unit 6.

周波数方向特徴量抽出部６はスペクトログラムの、主として周波数方向に連続するピークを連結する。そして、この周波数連続ピーク列の数、長さ、連続性、方向、各ピーク値統計量、各時間値統計量、などの平均や分散などの統計量を特徴量として抽出する。この特徴量を、周波数連続特徴量と称する。 The frequency direction feature amount extraction unit 6 connects the spectrograms that are mainly continuous in the frequency direction. Then, statistics such as the average and variance of the number, length, continuity, direction, each peak value statistic, each time value statistic, and the like of this frequency continuous peak train are extracted as feature quantities. This feature amount is referred to as a frequency continuous feature amount.

すなわち周波数方向特徴量抽出部６は、周波数ごとの時間ピークの数をカウントし、この平均や分散などの統計量をまず抽出する。次に、時間ピークが主に周波数方向に連続していればこれらを連結し、この周波数連続ピーク列の個数、ならびに周波数連続ピーク列の帯域長、連続性、方向、ピーク値統計量、時間値統計量、などの平均や分散などの統計量を、周波数連続特徴量として抽出する。 That is, the frequency direction feature quantity extraction unit 6 counts the number of time peaks for each frequency, and first extracts statistics such as the average and variance. Next, if time peaks are mainly continuous in the frequency direction, they are connected, and the number of frequency continuous peak trains, as well as the bandwidth length, continuity, direction, peak value statistic, time value of frequency continuous peak trains A statistical quantity such as a statistical quantity such as an average or variance is extracted as a frequency continuous feature quantity.

周波数連続ピーク列の個数とは、区間（セグメント）における周波数連続ピーク列の数を示す。周波数連続ピーク列の帯域長とは、周波数連続ピーク列の周波数方向の長さ、つまり帯域の長さを示す。連続性とは周波数連続ピーク列中に欠落の無い度合いを示す。方向とは周波数連続ピーク列が周波数方向にどの程度傾いているかを示す量である。各ピーク値統計量とは周波数連続ピーク列中の各時間ピークのパワーの平均や分散などを示し、各時間値統計量とは周波数連続ピーク列中の各時間ピークの時間値の平均や分散などを示す。 The number of frequency continuous peak trains indicates the number of frequency continuous peak trains in a section (segment). The band length of the frequency continuous peak sequence indicates the length in the frequency direction of the frequency continuous peak sequence, that is, the length of the band. The continuity indicates the degree of no missing in the frequency continuous peak train. The direction is an amount indicating how much the frequency continuous peak train is inclined in the frequency direction. Each peak value statistic indicates the average or variance of the power of each time peak in the frequency continuous peak sequence, and each time value statistic indicates the average or variance of the time value of each time peak in the frequency continuous peak sequence. Indicates.

時間ピークの周波数連続性に関する特徴量抽出は、物理的にはパワーの上昇、つまり急峻な音量の増大ならびに音量の変化を捉えるために有効である。例えばラップ調などのように弾けるような歌唱や演奏の場合には、周波数連続ピーク列の数が増大し、またドラムなどの打楽器が演奏されている場合には、帯域長が長くなる。さらに、子音/s/の発声などのように広い周波数帯域において一次的にパワーが増大するような音響信号の場合、帯域長が極めて長くなり、連続性が高くなる。このように、周波数方向特徴量抽出部６で抽出される周波数連続特徴量は、時間連続特徴量では捉え切れなかった音楽や音声などの特徴を捉えるための特徴量として有効である。この周波数連続特徴量は時間連続特徴量（時間方向特徴量抽出部４から）とともに、相互特徴量検出部７と音響分類部８とに与えられる。 The feature amount extraction regarding the frequency continuity of the time peak is physically effective for capturing a power increase, that is, a sharp increase in volume and a change in volume. For example, in the case of singing or playing such as a lap tone, the number of frequency continuous peak rows increases, and when a percussion instrument such as a drum is played, the band length becomes long. Furthermore, in the case of an acoustic signal whose power increases primarily in a wide frequency band, such as a consonant / s / utterance, the band length is extremely long and continuity is high. As described above, the frequency continuous feature amount extracted by the frequency direction feature amount extraction unit 6 is effective as a feature amount for capturing features such as music and speech that cannot be captured by the time continuous feature amount. This continuous frequency feature quantity is given to the mutual feature quantity detection unit 7 and the acoustic classification unit 8 together with the continuous time feature quantity (from the time direction feature quantity extraction unit 4).

相互特徴量検出部７は、時間連続特徴量と周波数連続特徴量との相互の影響の度合いにより定義される相互特徴量を抽出する。すなわち相互特徴量抽出部７は、時間方向に連続する時間連続特徴量と、周波数方向に連続する周波数連続特徴量とから、これらが相互に影響を及ぼす度合いを示す相互特徴量を抽出する。つまり相互特徴量は、時間連続特徴量と周波数連続特徴量との両者が存在しなければ成立しない量である。この相互特徴量は時間連続特徴量と周波数連続特徴量とともに音響分類部８に与えられる。 The mutual feature quantity detection unit 7 extracts a mutual feature quantity defined by the degree of mutual influence between the time continuous feature quantity and the frequency continuous feature quantity. That is, the mutual feature quantity extraction unit 7 extracts a mutual feature quantity indicating a degree of mutual influence from the time continuous feature quantity continuous in the time direction and the frequency continuous feature quantity continuous in the frequency direction. That is, the mutual feature amount is an amount that is not established unless both the time continuous feature amount and the frequency continuous feature amount exist. This mutual feature quantity is given to the acoustic classification unit 8 together with the time continuous feature quantity and the frequency continuous feature quantity.

音響分類部８は、時間連続特徴量と、周波数連続特徴量と、相互特徴量とのうち少なくとも１つを用いて、各フレームを分類する。すなわち音響分類部８は、時間連続特徴量と、周波数連続特徴量と、相互特徴量とから各区間（セグメント）を音声区間または音楽区間、およびそれ以外に分類する。 The acoustic classification unit 8 classifies each frame by using at least one of the time continuous feature value, the frequency continuous feature value, and the mutual feature value. That is, the acoustic classification unit 8 classifies each section (segment) into a voice section or a music section, and the other from the time continuous feature quantity, the frequency continuous feature quantity, and the mutual feature quantity.

フレームを分類する簡単な方法に、各特徴量Ｘに対して分類パタンごとに設けられた重みＷを付して次式（１），（２）を用いて線形和を算出し、その値Ｐが分類パタンごとの閾値を上回れば、対象のフレームが規定の分類パタンに属するとする手法がある。

A simple method for classifying frames is applied to each feature quantity X with a weight W provided for each classification pattern, and a linear sum is calculated using the following equations (1) and (2). If the value exceeds the threshold value for each classification pattern, there is a method in which the target frame belongs to a specified classification pattern.

例えばテンポの遅い音楽であれば時間連続ピーク列の長さが長くなるので、分類パタン：“テンポの遅い曲”における“時間連続ピーク列の長さ”の重みを大きくするというように、重みを設定することができる。 For example, if the music has a slow tempo, the length of the continuous peak sequence will be long, so the weight of the classification pattern: “the length of the continuous peak sequence” in the “slow tempo song” is increased. Can be set.

このほか、ニューラルネットなどを用いて予め用意した学習用データを用いて重みを最適化しても良いし、ＧＭＭ（ガウス混合モデル）、ＶＱ（ベクトル量子化）、ＳＶＭ（サポートベクターマシン）などの統計的なモデルを用いてフレームを分類しても良い。統計的な分類モデルを用いることで、特徴量単独では分類に寄与しないか、または分類パタンごとにどのような関連があるかが明確ではない特徴量を、無駄にせず有効に活用することが可能となる。 In addition, the weights may be optimized using learning data prepared in advance using a neural network, or statistics such as GMM (Gaussian mixture model), VQ (vector quantization), and SVM (support vector machine). The frames may be classified using a typical model. By using a statistical classification model, it is possible to effectively use features that do not contribute to classification by features alone or that are not clear how they relate to each classification pattern. It becomes.

さらに音響分類部８において、フレームの分類にとどまらず、シーンを判定しても良い。例えば音楽区間に分類される区間（セグメント）が頻繁に出現する区間では、その区間を音楽シーンとしてインデキシングするようにする。 Furthermore, the sound classification unit 8 may determine the scene in addition to the frame classification. For example, in a section where a section (segment) classified as a music section appears frequently, the section is indexed as a music scene.

このようにすれば、ユーザはマルチメディアコンテンツの区切られた部分を意味を持つシーンとして認識できるようになるので、マルチメディアコンテンツを短時間視聴したり、編集が容易になったりするなどのメリットを得られる。以上のようにして分類された結果は出力部９に与えられ、ユーザからの要求に基づく適切な形態で出力される。 In this way, the user can recognize the delimited portion of the multimedia content as a meaningful scene, so that the multimedia content can be viewed for a short time or edited easily. can get. The results classified as described above are given to the output unit 9 and output in an appropriate form based on a request from the user.

出力部９は音響分類部８からの出力を適切な形態で出力する。例えば、音響分類部８から出力される区間（セグメント）単位の分類の結果をそのままディスプレイなどの映像出力装置に出力しても良いし、テキストデータとして出力しても良いし、電子データとして特定者に送信しても良いし、記述言語に変換して表示や送信しても良い。音響分類部８の出力がインデキシングされた情報であれば、その結果を時刻情報とインデキシングした要約情報と共に上記の種々の方法で出力しても良い。さらに、例えば音楽シーンなどが指定されている場合にはこれに該当するシーンのみをＡＶ（Audio Visual）出力して再生しても良い。このようにすることでユーザは、ユーザが見たいシーンのみを視聴したりすることができるようになる。 The output unit 9 outputs the output from the acoustic classification unit 8 in an appropriate form. For example, the result of classification in units of sections (segments) output from the sound classification unit 8 may be output as it is to a video output device such as a display, may be output as text data, or specified as electronic data. Or may be displayed and transmitted after being converted into a description language. If the output of the sound classification unit 8 is indexed information, the result may be output together with time information and indexed summary information by the various methods described above. Further, for example, when a music scene or the like is designated, only the corresponding scene may be output and reproduced by AV (Audio Visual). In this way, the user can view only the scene that the user wants to see.

図２は、周波数ピーク検出部３および時間ピーク検出部５における作用を説明するための図である。図２（ａ）は、複数のフレームにわたるスペクトルを時間的に連続させて配列したスペクトログラムであり、これが一つの区間（セグメント）に対応する。ただしこれは説明を簡易にするためで、必ずしも周波数と時間との２次元データ構造に限るものではない。 FIG. 2 is a diagram for explaining the operation of the frequency peak detection unit 3 and the time peak detection unit 5. FIG. 2A is a spectrogram in which spectra over a plurality of frames are continuously arranged in time, and this corresponds to one section (segment). However, this is for simplifying the explanation and is not necessarily limited to the two-dimensional data structure of frequency and time.

図２（ａ）のグラフを周波数（水平の点線）で切り取れば、時間に対する対数パワーのグラフ（図２（ｂ））を得る。このグラフの極大点（図中※）が時間ピークであり、時間ピーク検出部５により検出される量である。また図２（ａ）のグラフを時間（垂直の点線）で切り取れば、周波数に対する対数パワーのグラフ（図２（ｃ））を得る。このグラフの極大点（図中×）が周波数ピークであり、周波数ピーク検出部３により検出される量である。 If the graph of FIG. 2A is cut out by frequency (horizontal dotted line), a logarithmic power graph with respect to time (FIG. 2B) is obtained. The maximum point (* in the figure) of this graph is a time peak, which is the amount detected by the time peak detector 5. Further, when the graph of FIG. 2A is cut out by time (vertical dotted line), a logarithmic power graph with respect to frequency (FIG. 2C) is obtained. The maximum point (x in the figure) of this graph is the frequency peak, which is the amount detected by the frequency peak detector 3.

既存の技術では周波数ピークのみを利用していた。この実施形態では時間ピークも併せて用いるようにし、さらに、両者を組み合わせることで新たに定義可能な相互特徴量をも、区間（セグメント）の分類に用いるようにする。次に、相互特徴量につき説明する。
図３は、相互特徴量抽出部７における処理を説明するための図である。相互特徴量抽出部７は、時間連続特徴量と、周波数連続特徴量とから、これらが相互に影響を及ぼす相互特徴量を抽出する。相互特徴量は、図３に示されるように時間連続ピーク列（横線）と周波数連続ピーク列（縦線）とが交わる部分において定義されるもので、その数（図中の“○”と“□”の総数）、あるいは交わり方などといった量である。 Existing technology used only frequency peaks. In this embodiment, a time peak is also used together, and a mutual feature quantity that can be newly defined by combining both is also used for classification of a section (segment). Next, the mutual feature amount will be described.
FIG. 3 is a diagram for explaining processing in the mutual feature quantity extraction unit 7. The mutual feature quantity extraction unit 7 extracts a mutual feature quantity that affects each other from the time continuous feature quantity and the frequency continuous feature quantity. As shown in FIG. 3, the mutual feature amount is defined at a portion where a time continuous peak sequence (horizontal line) and a frequency continuous peak sequence (vertical line) intersect, and the number (“◯” and “ □ "Total number of"), or how to interact.

図中の“□”で示す箇所のように、時間連続ピーク列の端に周波数連続ピーク列が位置するといった交わり方は、パワーの急峻な増大を伴う楽器と周波数ピーク構造を伴う楽器とが同じタイミングで鳴ったことを示す。これは、例えばドラムとギターとが同時に鳴った可能性が高いと判断できることを意味し、このような区間（セグメント）は音楽区間である可能性が非常に高い。このことを利用して音楽シーンを抽出することができる。 As shown by the “□” in the figure, the way in which the frequency continuous peak sequence is located at the end of the time continuous peak sequence is the same as the instrument with a sharp increase in power and the instrument with the frequency peak structure. Indicates that it sounded at the timing. This means that, for example, it can be determined that there is a high possibility that a drum and a guitar have been played at the same time, and such a segment is very likely to be a music segment. A music scene can be extracted using this fact.

このように、相互特徴量抽出部７において時間連続特徴量と周波数連続特徴量とが相互に影響を及ぼす指標である相互特徴量を抽出することで、２種類の特徴量を単独に抽出するだけでは得られない情報をも抽出することができるようになる。 In this way, the mutual feature quantity extraction unit 7 extracts the mutual feature quantity, which is an index that the time continuous feature quantity and the frequency continuous feature quantity affect each other, so that only two types of feature quantities are extracted. It becomes possible to extract information that cannot be obtained.

図４は、区間分類に基づいてシーンを判定することが可能であることを示す図である。図４に示すように、音楽区間と分類された区間（セグメント）が高い割合で時間的に連続する場合には、音響分類部８においてこれらの区間を音楽シーンとしてインデキシングすることができる。すなわち音響分類部８は、規定時間内に、規定数以上の数にわたり音楽区間を含むシーンをマルチメディアコンテンツにおける音楽シーンとして検出する。次に、図１のシーン分類装置のポイントを異なる観点から説明する。 FIG. 4 is a diagram illustrating that a scene can be determined based on the section classification. As shown in FIG. 4, when sections (segments) classified as music sections are temporally continuous at a high rate, the sound classification unit 8 can index these sections as music scenes. That is, the sound classification unit 8 detects a scene including a music section over a predetermined number within a specified time as a music scene in the multimedia content. Next, points of the scene classification apparatus in FIG. 1 will be described from different viewpoints.

図５は、時間方向特徴量抽出部４における処理を説明するための図である。周波数ピークの特徴量には、周波数ピークの数、この周波数ピークのパワー値、この周波数ピークの周波数値、時間連続ピーク列の数、この時間連続ピーク列の長さ、この時間連続ピーク列の連続性、この時間連続ピーク列の方向、各ピーク値の統計量、および、各周波数値の統計量の、平均または分散を含む統計量などがある。式（３），（４）を参照して以下に説明する。

FIG. 5 is a diagram for explaining processing in the time direction feature quantity extraction unit 4. The frequency peak features include the number of frequency peaks, the power value of this frequency peak, the frequency value of this frequency peak, the number of time continuous peak trains, the length of this time continuous peak train, and the continuous of this time continuous peak train. , The direction of this time continuous peak train, the statistic of each peak value, and the statistic including the mean or variance of the statistic of each frequency value. This will be described below with reference to equations (3) and (4).

式（３），（４）におけるFPK（t,n）=（FPKf（t,n）, FPKp（t,n））は、ある時間ｔのｎ番目のピークを示すベクトルであり、FPKf（t,n）はある時間ｔのｎ番目のピークの時間値、そしてFPKp（t,n）はある時間ｔのｎ番目のピークのパワー値を示している。また、NFPK（t,n）はFPK（t,n）を次元ごとに正規化したベクトルである。そして、正規化することにより周波数とパワーとの次元の違いを吸収することができる。 FPK (t, n) = (FPKf (t, n), FPKp (t, n)) in the equations (3) and (4) is a vector indicating the n-th peak at a certain time t, and FPKf (t , n) represents the time value of the nth peak at a certain time t, and FPKp (t, n) represents the power value of the nth peak at a certain time t. NFPK (t, n) is a vector obtained by normalizing FPK (t, n) for each dimension. Then, normalization can absorb the difference in dimension between frequency and power.

時間方向特徴量抽出部４は、周波数ピークを時間方向に連結するにあたり第１および第２のベクトルを算出する。第１のベクトルは、或る時間ｔにおけるｉ番目の周波数ピークであるNFPK（t,i）を終点とし、時間ｔ−１におけるｊ番目の周波数ピークであるNFPK（t-1,j）を始点とするベクトルである。つまり第１のベクトルは、この終点と始点との間に張られる。第２のベクトルは、時間ｔ＋１におけるｋ番目の周波数ピークNFPK（t+1,k）を終点とし、上記のNFPK（t,i）を始点とするベクトルである。つまり第２のベクトルは、この終点と始点との間に張られる。 The time direction feature quantity extraction unit 4 calculates the first and second vectors when connecting the frequency peaks in the time direction. The first vector has NFPK (t, i), which is the i-th frequency peak at a certain time t, as an end point, and NFPK (t-1, j), which is the j-th frequency peak at a time t−1. Is a vector. That is, the first vector is stretched between the end point and the start point. The second vector is a vector having the kth frequency peak NFPK (t + 1, k) at time t + 1 as the end point and the above NFPK (t, i) as the start point. That is, the second vector is stretched between the end point and the start point.

そして時間方向特徴量抽出部４は、第１のベクトルと第２のベクトルとの類似の度合いに応じて、両ベクトルを連結する。類似の度合いは、例えば第１のベクトルと第２のベクトルのなす角の小ささ、または、第１のベクトルと第２のベクトルとの内積の大きさにより表現することができる。時間方向特徴量抽出部４はこれらの量を閾値判定することで両ベクトルの類似度を判定する。 Then, the time direction feature amount extraction unit 4 connects both vectors according to the degree of similarity between the first vector and the second vector. The degree of similarity can be expressed by, for example, the small angle between the first vector and the second vector, or the size of the inner product of the first vector and the second vector. The time direction feature amount extraction unit 4 determines the similarity of both vectors by determining the threshold values of these amounts.

図５において、×印は時間ごとに検出された周波数ピークである。例えば、図５のＰ１０とＰ１１から得られたれるベクトルＶ１１を第１のベクトルとし、Ｐ１１とＰ１２から得られたベクトルＶ１２を第２のベクトルとすると、これら２つのベクトルのなす角は小さいので両者は連結される。逆に、Ｐ２０とＰ２１から得られたベクトルＶ２１を第１のベクトルとし、Ｐ２１とＰ２２から得られたベクトルＶ２２を第２のベクトルとすると、これらのベクトルのなす角は大きいので両者は連結されない。 In FIG. 5, the crosses are frequency peaks detected every time. For example, if the vector V11 obtained from P10 and P11 in FIG. 5 is the first vector and the vector V12 obtained from P11 and P12 is the second vector, the angle formed by these two vectors is small. Are concatenated. On the other hand, if the vector V21 obtained from P20 and P21 is the first vector and the vector V22 obtained from P21 and P22 is the second vector, the angles formed by these vectors are large, and the two are not connected.

図６は周波数ピークの連結処理において、既存の技術とこの実施形態の手法とを比較して示す図である。図６（ａ）は既存の技術を示し、隣り合う時間フレームにおいて最も近い周波数ピークを連結するという方法をとる。この方法では一見して分かるように不連続な連結が生じてしまい、結果的に時間方向特徴量抽出部４で抽出される時間連続特徴量の精度が低下する。 FIG. 6 is a diagram showing a comparison between the existing technique and the method of this embodiment in frequency peak connection processing. FIG. 6A shows an existing technique, in which the closest frequency peaks are connected in adjacent time frames. In this method, discontinuous connection occurs as can be seen at a glance, and as a result, the accuracy of the time continuous feature amount extracted by the time direction feature amount extraction unit 4 is lowered.

これに対し、この実施形態の時間方向特徴量抽出部４ではベクトルの概念を導入し、ピーク間の方向をも考慮してピークどうしを連結する。これにより図６（ｂ）に示すように、不連続な、主観的に一致しない連結を避けることができ、時間連続特徴量の精度を向上させることが可能となる。このようにピーク間の連結において方向を考慮することで、連結すべきではないベクトルの連結を避けることができるようになる。なお第１のベクトルとして既に連結済みのベクトルを用いても良い。このようにすれば処理を簡略化することができる。 On the other hand, the time direction feature quantity extraction unit 4 of this embodiment introduces the concept of a vector and connects the peaks in consideration of the direction between the peaks. As a result, as shown in FIG. 6B, discontinuous and subjectively inconsistent connections can be avoided, and the accuracy of time-continuous feature values can be improved. Thus, by considering the direction in the connection between peaks, it is possible to avoid the connection of vectors that should not be connected. An already connected vector may be used as the first vector. In this way, processing can be simplified.

図７は、周波数方向特徴量抽出部６における処理を説明するための図である。時間ピークの特徴量には、時間ピークの数、この時間ピークのパワー値、この時間ピークの周波数値、周波数連続ピーク列の数、この周波数連続ピーク列の長さ、この周波数連続ピーク列の連続性、この周波数連続ピーク列の方向、各ピーク値の統計量、および、各時間値の統計量の少なくともいずれか１つの平均または分散を含む統計量がある。式（５），（６）を参照して以下に説明する。

FIG. 7 is a diagram for explaining the processing in the frequency direction feature amount extraction unit 6. The time peak features include the number of time peaks, the power value of this time peak, the frequency value of this time peak, the number of frequency continuous peak trains, the length of this frequency continuous peak train, the continuation of this frequency continuous peak train There is a statistic that includes the mean or variance of at least one of the sex, the direction of this frequency continuous peak sequence, the statistic of each peak value, and the statistic of each time value. This will be described below with reference to equations (5) and (6).

式（５），（６）におけるTPK（f,n）=（TPKt（f,n）, TPKp（f,n））は、ある周波数fのｎ番目のピークを示すベクトルであり、TPKt（f,n）はある周波数ｆのｎ番目のピークの時間値、そして、TPKp（f,n）はある周波数ｆのｎ番目のピークのパワー値を示している。また、NTPK（f,n）はTPK（f,n）を次元ごとに正規化したベクトルである。そして、正規化により時間とパワーとの次元の違いを吸収することができる。 In equations (5) and (6), TPK (f, n) = (TPKt (f, n), TPKp (f, n)) is a vector indicating the nth peak of a certain frequency f, and TPKt (f , n) represents the time value of the nth peak of a certain frequency f, and TPKp (f, n) represents the power value of the nth peak of a certain frequency f. NTPK (f, n) is a vector obtained by normalizing TPK (f, n) for each dimension. Then, normalization can absorb the difference in dimension between time and power.

周波数方向特徴量抽出部６は、時間ピークを周波数方向に連結するにあたり第３および第４のベクトルを算出する。第３のベクトルは、或る周波数ｆにおけるｉ番目の時間ピークであるNTPK（f,i）を終点とし、周波数ｆ−１におけるｊ番目の時間ピークであるNTPK（f-1,j）を始点とするベクトルである。つまり第３のベクトルは、この終点と始点との間に張られる。第４のベクトルは、周波数ｆ＋１におけるｋ番目の時間ピークNTPK（f+1,k）を終点とし、上記のNTPK（f,i）を始点とするベクトルである。つまり第４のベクトルは、この終点と始点との間に張られる。 The frequency direction feature quantity extraction unit 6 calculates third and fourth vectors when connecting the time peaks in the frequency direction. The third vector has an NTPK (f, i) that is the i-th time peak at a certain frequency f as an end point, and an NTPK (f-1, j) that is the j-th time peak at a frequency f−1. Is a vector. That is, the third vector is stretched between the end point and the start point. The fourth vector is a vector having the kth time peak NTPK (f + 1, k) at the frequency f + 1 as an end point and the above NTPK (f, i) as a start point. That is, the fourth vector is stretched between the end point and the start point.

そして周波数方向特徴量抽出部６は、第３のベクトルと第４のベクトルとの類似の度合いに応じて、両ベクトルを連結する。類似の度合いは、例えば第３のベクトルと第４のベクトルのなす角の小ささ、または、第３のベクトルと第４のベクトルとの内積の大きさにより表現することができる。周波数方向特徴量抽出部６はこれらの量を閾値判定することで両ベクトルの類似度を判定する。 Then, the frequency direction feature quantity extraction unit 6 connects both vectors according to the degree of similarity between the third vector and the fourth vector. The degree of similarity can be expressed by, for example, the small angle formed by the third vector and the fourth vector, or the size of the inner product of the third vector and the fourth vector. The frequency direction feature amount extraction unit 6 determines the similarity between both vectors by determining the threshold values of these amounts.

図７において、※印は周波数ごとに検出された時間ピークである。例えば、図７のＰ３０とＰ３１とでなされるベクトルＶ３１を第３のベクトルとし、Ｐ３１とＰ３２とでなされるベクトルＶ３２を第４のベクトルとすると、これら２つのベクトルのなす角は小さいので両者は連結される。逆に、Ｐ４０とＰ４１とでなされるベクトルＶ４１を第３のベクトルとし、Ｐ４１とＰ４２とでなされるベクトルＶ４２を第４のベクトルとすると、これら２つのベクトルのなす角は大きいので両者は連結されない。 In FIG. 7, * indicates a time peak detected for each frequency. For example, if the vector V31 formed by P30 and P31 in FIG. 7 is the third vector and the vector V32 formed by P31 and P32 is the fourth vector, the angle formed by these two vectors is small. Connected. Conversely, if the vector V41 formed by P40 and P41 is the third vector, and the vector V42 formed by P41 and P42 is the fourth vector, the angle formed by these two vectors is large and the two are not connected. .

このように時間ピークの連結においても方向を考慮することで、時間方向特徴量抽出部４と同様に、連結すべきではないベクトルの連結を避けることができるようになり、その結果、抽出される周波数連続特徴量の精度を向上させることが可能となる。第３のベクトルについても既に連結済みのベクトルを用いても良く、こうすれば処理を簡略化することができる。 As described above, by considering the direction also in the time peak connection, it becomes possible to avoid the connection of the vectors that should not be connected, as in the time direction feature quantity extraction unit 4, and the extraction is performed as a result. It becomes possible to improve the accuracy of the continuous frequency feature quantity. An already connected vector may also be used for the third vector, and in this way, the processing can be simplified.

以上説明したようにこの実施形態では、スペクトログラムにおける時間ピークを検出する時間ピーク検出部５と、時間ピークに示される特徴量を抽出する周波数方向特徴量抽出部６とを備え、時間ピークに関わる特徴量をも、区間検出の指標として用いるようにしている。さらに、相互特徴量抽出部７を備え、周波数ピークと時間ピークとの相互作用により定義される新たな特徴量をも、指標として用いてシーンを分類するようにしている。 As described above, this embodiment includes the time peak detection unit 5 that detects the time peak in the spectrogram and the frequency direction feature amount extraction unit 6 that extracts the feature amount indicated by the time peak, and features related to the time peak. The quantity is also used as an index for interval detection. Furthermore, a mutual feature amount extraction unit 7 is provided, and a new feature amount defined by the interaction between the frequency peak and the time peak is also used as an index to classify the scene.

すなわちこの実施形態では、スペクトログラムにおいて周辺のスペクトログラムよりもパワーの大きい特徴的な状態が時間方向に連続することを示す特徴量（周波数ピーク）だけでなく、上記の状態が周波数方向に連続することを示す特徴量（時間ピーク）をも抽出し、両者をシーン分類に用いるようにしている。これにより、既存の技術では困難であったラップ調やテンポが早い曲などのように、スペクトルピークなどの時間連続性が現れにくい音楽ジャンルをも、高い精度で検出できるようになる。また、音楽に限らず、音声などの任意の音響区間を検出することもできるし、さらには検出区間の音楽の種類や音声の性別判定など、各種分類における精度の向上をも期待することができる。 That is, in this embodiment, in the spectrogram, not only the characteristic amount (frequency peak) indicating that the characteristic state having higher power than the surrounding spectrogram is continuous in the time direction, but also that the above state is continuous in the frequency direction. The feature quantity (time peak) shown is also extracted and used for scene classification. This makes it possible to detect a music genre such as a spectrum peak that is difficult to show time continuity, such as a lap tone or a song with a fast tempo, which is difficult with existing technology, with high accuracy. In addition to music, it is possible to detect an arbitrary acoustic section such as speech, and further, it can be expected to improve accuracy in various classifications such as music type and voice gender determination in the detected section. .

これらのことから、区間検出の精度を向上させたシーン分類装置を提供することが可能となる。またこのことを利用して、マルチメディアコンテンツ再生機を含む機器への応用により、音楽シーンやトークシーン、さらには音楽の種類や話している人の性別など、さまざまなシーンの分類を高い精度で実現することができるようになる。これによりユーザは、視聴したいシーンだけを短時間視聴したり、編集したいシーンを素早く簡便な操作で見つけ出すことができたりすることが可能になる。 For these reasons, it is possible to provide a scene classification device that improves the accuracy of section detection. In addition, by applying this to devices including multimedia content players, various scene classifications such as music scenes and talk scenes, as well as the type of music and the sex of the person who is speaking, can be made with high accuracy. Can be realized. As a result, the user can watch only the scene he / she wants to watch for a short time, or can quickly find the scene he / she wants to edit by a simple operation.

なお、この発明は上記実施の形態に限定されるものではない。すなわちこの発明はＤＶＤレコーダなどのように据え置き型の機器に限ることなく、いわゆるワンセグと称する、移動通信端末を用いる画像視聴機器にも適用することができる。またこの発明によれば特徴量の抽出の精度を向上させることができるので、様々な分類モデルを用いることにより、音楽だけに限らず機械的騒音、男性の声、女性の声、動物の声などといった種々のシーン分類に応用することができる。 The present invention is not limited to the above embodiment. That is, the present invention is not limited to a stationary device such as a DVD recorder, but can also be applied to an image viewing device using a mobile communication terminal, so-called one-segment. In addition, according to the present invention, the accuracy of feature quantity extraction can be improved, so by using various classification models, not only music but also mechanical noise, male voice, female voice, animal voice, etc. It can be applied to various scene classifications.

さらに、この発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。 Furthermore, the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment.

本発明に係わるシーン分類装置の実施の形態を示す機能ブロック図。The functional block diagram which shows embodiment of the scene classification | category apparatus concerning this invention. 周波数ピーク検出部３および時間ピーク検出部５における作用を説明するための図。The figure for demonstrating the effect | action in the frequency peak detection part 3 and the time peak detection part 5. FIG. 相互特徴量抽出部７における処理を説明するための図。The figure for demonstrating the process in the mutual feature-value extraction part. 区間分類に基づいてシーンを判定可能であることを示す図。The figure which shows that a scene can be determined based on an area classification. 時間方向特徴量抽出部４における処理を説明するための図。The figure for demonstrating the process in the time direction feature-value extraction part. 周波数ピークの連結処理において、既存の技術とこの実施形態の手法とを比較して示す図。The figure which compares and shows the technique of this technique and the technique of this embodiment in the connection process of a frequency peak. 周波数方向特徴量抽出部６における処理を説明するための図。The figure for demonstrating the process in the frequency direction feature-value extraction part.

Explanation of symbols

１…コンテンツ入力部、２…スペクトル算出部、３…周波数ピーク検出部、４…時間方向特徴量抽出部、５…時間ピーク検出部、６…周波数方向特徴量抽出部、７…相互特徴量抽出部、８…音響分類部、９…出力部 DESCRIPTION OF SYMBOLS 1 ... Content input part, 2 ... Spectrum calculation part, 3 ... Frequency peak detection part, 4 ... Time direction feature-value extraction part, 5 ... Time peak detection part, 6 ... Frequency direction feature-value extraction part, 7 ... Mutual feature-value extraction Part, 8 ... acoustic classification part, 9 ... output part

Claims

A spectrum calculation unit that divides multimedia content including an acoustic signal into a plurality of temporally continuous sections and calculates a spectrum of the acoustic signal for each of the sections;
A frequency peak detector that detects a frequency peak that is a maximum point in the frequency direction in the spectrum;
A time direction feature quantity extraction unit for extracting a first feature quantity indicating the feature of the frequency peak;
A time peak detection unit for detecting a time peak that is a local maximum point in a time direction in a spectrogram in which the spectrum is continuously arranged in time;
A frequency direction feature quantity extraction unit for extracting a second feature quantity indicating the feature of the time peak;
A third feature quantity extraction unit for extracting a third feature quantity indicating a degree of interaction between the first feature quantity and the second feature quantity;
Based on the acoustic features of the section indicated by the first feature quantity , the second feature quantity , and the third feature quantity , the plurality of sections are divided into a first music section and a second music section. An acoustic classification unit for classification ;
Including
The acoustic classification unit detects a scene including the first music section over a specified number of times within a specified time as a music scene in the multimedia content,
The time direction feature amount extraction unit includes the feature of a time continuous peak sequence obtained by connecting the frequency peaks in the time direction, and extracts the first feature amount.
The frequency direction feature quantity extraction unit includes the feature of a frequency continuous peak sequence obtained by connecting the time peaks in the frequency direction, and extracts the second feature quantity.
The scene classification device, wherein the third feature amount represents the number of intersections of the time-related continuous peak sequence and the frequency continuous peak sequence, or how to intersect .

The first feature amount is
Number of frequency peaks, power value of this frequency peak, frequency value of this frequency peak, number of said time continuous peak train, length of this time continuous peak train, continuity of this time continuous peak train, this time continuous peak 2. The scene classification apparatus according to claim 1 , wherein the scene classification device includes an average or a variance of at least one of a column direction, a statistic of each peak value, and a statistic of each frequency value.

The time direction feature amount extraction unit includes:
A first vector spanned between first and second frequency peaks adjacent in the time direction, and a first vector spanned between the first frequency peak and a third frequency peak adjacent to the first frequency peak in the time direction. if the similarity between the two-vector or specified threshold value, the scene classification apparatus according to claim 1, characterized in that coupling the first and second frequency peaks.

The second feature amount is
Number of time peaks, power value of this time peak, frequency value of this time peak, number of said frequency continuous peak train, length of this frequency continuous peak train, continuity of this frequency continuous peak train, this frequency continuous peak 2. The scene classification apparatus according to claim 1 , wherein the scene classification apparatus includes a statistic including an average or a variance of at least one of a column direction, a statistic of each peak value, and a statistic of each time value.

The frequency direction feature extraction unit
A third vector spanned between first and second time peaks adjacent in the frequency direction; and a third vector spanned between the first time peak and a third time peak adjacent to the first time peak in the frequency direction. 4 if the similarity between the vector more than specified threshold value, the scene classification apparatus according to claim 1, characterized in that coupling the first and second time peak.