JP3475317B2

JP3475317B2 - Video classification method and apparatus

Info

Publication number: JP3475317B2
Application number: JP34029396A
Authority: JP
Inventors: 憲一南; 明人阿久津; 佳伸外村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-12-20
Filing date: 1996-12-20
Publication date: 2003-12-08
Anticipated expiration: 2016-12-20
Also published as: JPH10187182A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】映像を効率良く扱うために
は、映像の属性情報を自動的に付与する技術が必要であ
る。属性情報は、映像制作の関連分野において、映像の
編集、加工、分類等に利用される。本発明は、映像に含
まれる特徴量を抽出し、特徴量に応じて映像を分類する
技術に関する。BACKGROUND OF THE INVENTION In order to handle an image efficiently, a technique for automatically assigning attribute information of the image is required. The attribute information is used for video editing, processing, classification, and the like in a related field of video production. The present invention relates to a technique of extracting a feature amount included in a video and classifying the video according to the feature amount.

【０００２】[0002]

【従来の技術】映像の内容がどのようなものであるかを
大別することは、ビデオ・オン・デマンドのようなシス
テムで用いられる大量の映像を効率良く扱う上で不可欠
である。現在、映像は主にニュース、スポーツ、ドラ
マ、映画、音楽、ドキュメンタリー、教育、バラエテ
ィ、アニメ等に分類されているが、これらのうち幾つか
を自動的に識別しようとする方法が提案されている。
「Ｓ．Ｆｉｓｃｈｅｒｅｔ．ａｌ：Ａｕｔｏｍａｔｉ
ｃＲｅｃｏｇｎｉｔｉｏｎｏｆＦｉｌｍＧｅｎ
ｒｅｓ，ＡＣＭＭｕｌｔｉｍｅｄｉａ’９５，ｐｐ．
２９５−３０１」では、画像の色情報から場面の変わり
目やカメラの動きを検出し、音情報の振幅の変化と併せ
て、ニュース、スポーツ（テニスおよび自動車レー
ス）、アニメ、コマーシャルの分類を行っている。カメ
ラの動きが少なければニュース、周期的な音の繰り返し
（テニスのボールを打つ音）があればスポーツ、言葉が
途切れた所にノイズが少なければアニメ（アフレコのた
め背景音が少ない）、場面の変わり目に全体が黒になれ
ばコマーシャルといったようにジャンル毎にみられる典
型的な特徴を利用している。2. Description of the Related Art It is indispensable to roughly classify what a video content is in order to efficiently handle a large amount of video used in a system such as video on demand. Currently, videos are mainly classified into news, sports, dramas, movies, music, documentaries, education, variety, animation, etc., but a method to automatically identify some of them has been proposed. .
"S. Fischer et. Al: Automati
c Recognition of Film Gen
res, ACM Multimedia '95, pp.
295-301 ”detects scene transitions and camera movements from color information of images, and classifies news, sports (tennis and car races), animation, and commercials together with changes in the amplitude of sound information. There is. If there is little camera movement, there will be news, if there is a periodic sound repeat (the sound of hitting a tennis ball), there will be sports, if there is little noise where the words are interrupted, there will be animation (there is little background sound because of dubbing) If the entire area turns black at the turn, the typical features found in each genre such as commercials are used.

【０００３】[0003]

【発明が解決しようとする課題】上記従来の技術では、
主に画像情報に基づいて映像の分類を行っており、音情
報についての詳しい解析は行われていない。また、画像
情報から検出できる、ジャンル毎に固有の特徴が限られ
ているため、分類できる範囲は狭い。さらに、上記のよ
うに従来から定められているジャンル毎の特徴を見つけ
出すようなトップダウン的な方法では、分類できないジ
ャンルが存在する。SUMMARY OF THE INVENTION In the above conventional technique,
Video is mainly classified based on image information, and detailed analysis of sound information has not been performed. Further, since the unique features of each genre that can be detected from the image information are limited, the categorized range is narrow. Furthermore, there is a genre that cannot be classified by the top-down method of finding the characteristics of each genre that has been conventionally determined as described above.

【０００４】一方、映像に含まれる音情報は映像の内容
を良く反映しており、内容の種類に固有の特徴を検出し
易い。音情報を解析して映像一般に見られる特徴的な音
を検出し、その発生パターンから映像を分類すること
で、ボトムアップ的な要素を取り入れた分類方法を実現
することが可能である。On the other hand, the sound information included in the image well reflects the content of the image, and it is easy to detect the characteristic peculiar to the type of the content. It is possible to realize a classification method that incorporates bottom-up elements by analyzing sound information to detect characteristic sounds that are commonly found in images and classifying the images based on their generation patterns.

【０００５】本発明の目的は、映像情報に含まれる音情
報を解析し、映像を既存のジャンルにとらわれないカテ
ゴリーに分類する映像分類方法および装置を提供するこ
とにある。An object of the present invention is to provide a video classifying method and apparatus for analyzing sound information included in video information and classifying the video into categories that are not restricted to existing genres.

【０００６】[0006]

【課題を解決するための手段】上記の目的を達成するた
め、本発明の映像分類方法は、映像情報がアナログの場
合にはＡ／Ｄ変換してディジタルの映像情報を入力する
映像入力段階と、該映像情報に含まれる音情報を周波数
解析し、スペクトルの安定性を検出し、音楽を検出する
音楽検出段階と、該スペクトルのハーモニック構造を検
出し、音声を検出する音声検出段階と、音響の特徴ベク
トルを学習データとしてベクトル量子化し、符号帳を生
成する符号帳生成段階と、生成された符号帳と該映像情
報に含まれる音情報の特徴ベクトルを比較し、距離の近
い音響を検出する音響検出段階と、該検出された音情報
の種類別の区間の位置を記録する属性情報蓄積段階と、
該検出された音情報の種類、各々の区間の長さ、種類毎
の全体の長さ、各々の区間の位置のパターンを一以上抽
出し、該映像情報の種類を判別する映像判別段階と、を
有することで、入力された映像情報に含まれる音情報か
ら音楽、音声、音響のうち少なくとも１つが存在する区
間を検出し、該検出された区間の発生パターンによって
映像の種類を判別して広範囲なカテゴリに分類すること
が可能となる。In order to achieve the above object, the video classification method of the present invention comprises a video input step of A / D converting and inputting digital video information when the video information is analog. , A frequency of sound information included in the video information, spectrum stability is detected to detect music, a music detection step to detect music, a voice detection step to detect a harmonic structure of the spectrum to detect sound, and an acoustic Of the sound vector included in the video information is compared with the generated codebook to detect the short distance sound. A sound detecting step, and an attribute information accumulating step of recording the position of the detected sound information for each type of section,
A video discriminating step of discriminating the type of the video information by extracting one or more types of the detected sound information, the length of each section, the total length of each type, and the position pattern of each section; With the above, the section in which at least one of music, voice, and sound exists is detected from the sound information included in the input image information, and the type of the image is discriminated from the generation pattern of the detected section to determine a wide range. It is possible to classify into various categories.

【０００７】また、本発明の映像分類装置は、映像情報
がアナログの場合にはＡ／Ｄ変換してディジタルの映像
情報を入力する映像入力部と、該映像情報に含まれる音
情報を周波数解析し、スペクトルの安定性を検出し、音
楽を検出する音楽検出部と、該スペクトルのハーモニッ
ク構造を検出し、音声を検出する音声検出部と、音響の
特徴ベクトルを学習データとしてベクトル量子化し、符
号帳を生成する符号帳生成部と、生成された符号帳と該
映像情報に含まれる音情報の特徴ベクトルを比較し、距
離の近い音響を検出する音響検出部と、該検出された音
情報の種類別の区間の位置を記録する属性情報蓄積部
と、該検出された音情報の種類、各々の区間の長さ、種
類毎の全体の長さ、各々の区間の位置のパターンを一以
上抽出し、該映像情報の種類を判別する映像判別部と、
を具備することで、入力された映像情報に含まれる音情
報から音楽、音声、音響のうち少なくとも１つが存在す
る区間を検出し、該検出された区間の発生パターンによ
って映像の種類を判別して広範囲なカテゴリに分類する
ことが可能となる。Further, the video classifying apparatus of the present invention includes a video input section for A / D converting and inputting digital video information when the video information is analog, and a frequency analysis of sound information included in the video information. Then, the stability of the spectrum is detected, the music detection unit that detects the music, the voice detection unit that detects the harmonic structure of the spectrum and detects the voice, and the feature vector of the sound is vector-quantized as learning data, and the code A codebook generation unit that generates a book, an acoustic detection unit that compares the generated codebook and the feature vector of the sound information included in the video information, and detects a sound with a short distance, and the detected sound information. An attribute information storage unit that records the position of each section, and the type of the detected sound information, the length of each section, the total length of each type, and one or more patterns of the position of each section And the video information A video determination unit for determining the type,
By including, the section in which at least one of music, voice, and sound is present is detected from the sound information included in the input image information, and the type of the image is determined by the generation pattern of the detected section. It becomes possible to classify into a wide range of categories.

【０００８】上記の映像分類方法および装置では、スペ
クトログラムの一定周波数における時間方向のエッジの
強さを検出することで、音楽を容易に検出することが可
能となる。In the above image classification method and apparatus, music can be easily detected by detecting the edge strength in the time direction at a constant frequency of the spectrogram.

【０００９】また、該スペクトログラムのエッジの強い
部分を除去した後に、くし形フィルタを用いてハーモニ
ック構造を検出することで、音楽が重なっている場合で
も音声を容易に検出することが可能となる。Further, by detecting a harmonic structure using a comb filter after removing a strong edge portion of the spectrogram, it becomes possible to easily detect voice even when music overlaps.

【００１０】また、前記符号帳の重心と、前記映像情報
に含まれる音情報の特徴ベクトルとの距離を検出の判定
基準として用いることで、音響を容易に検出することが
可能となる。 Further, the center of gravity of said codebook, said by using the distance between the feature vector of the sound information included in the video information as a criterion of the detection, it is possible to easily detect the acoustic.

【００１１】さらに、検出された音情報の種類、各々の
区間の長さ、種類毎の全体の長さ、各々の区間の位置を
分類ベクトルとして符号帳を作成し、該符号帳の重心
と、該映像情報に含まれる音情報の分類ベクトルとの距
離を判別基準に用いることで、映像を容易に分類するこ
とが可能となる。Further, a codebook is created with the type of detected sound information, the length of each section, the total length of each type, and the position of each section as a classification vector, and the center of gravity of the codebook and By using the distance from the classification vector of the sound information included in the video information as the discrimination criterion, the video can be easily classified.

【００１２】[0012]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００１３】図１は、本発明の一実施形態例の映像分類
装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing the schematic arrangement of a video classifying apparatus according to an embodiment of the present invention.

【００１４】本実施形態例の映像分類装置は、映像情報
がアナログの場合にはＡ／Ｄ変換して入力する映像入力
部１０１と、音情報を周波数解析して、サウンドスペク
トログラムのエッジを検出し、必要に応じて除去するエ
ッジ検出部１０２と、音情報から音楽を検出する音楽検
出部１０３と、音声を検出する音声検出部１０４と、音
響の学習データから符号帳を生成する符号帳生成部１０
５と、学習した音響と同一種類の音を検出する音響検出
部１０６と、検出された音の区間の位置を記録する属性
情報蓄積部１０７と、検出された音情報の種類、各々の
区間の長さ、種類毎の全体の長さ、各々の区間の位置に
よって、映像情報の種類を判別する映像判別部１０８か
ら構成されている。The video classifying apparatus according to the present embodiment detects the edges of the sound spectrogram by frequency-analyzing the video input section 101 for A / D converting and inputting the video information when the video information is analog. An edge detection unit 102 that removes as necessary, a music detection unit 103 that detects music from sound information, a voice detection unit 104 that detects voice, and a codebook generation unit that generates a codebook from acoustic learning data. 10
5, a sound detection unit 106 that detects a sound of the same type as the learned sound, an attribute information storage unit 107 that records the position of the section of the detected sound, the type of the detected sound information, and the The video discriminating unit 108 discriminates the type of video information based on the length, the total length of each type, and the position of each section.

【００１５】映像入力部１０１から入力された映像の音
データは、一方でエッジ検出部１０２に入力され、エッ
ジ検出部１０２でＦＦＴ（高速フーリエ変換）処理され
て、数秒程度の長さのサウンドスペクトログラムが生成
される。ここで、ＦＦＴの代わりにＬＰＣ（線形予測分
析）を用いることも可能である。また、映像入力部１０
１から入力された映像の音データは、他方で音響検出部
１０６に入力される。On the other hand, the sound data of the image input from the image input unit 101 is input to the edge detection unit 102, subjected to FFT (Fast Fourier Transform) processing in the edge detection unit 102, and a sound spectrogram having a length of about several seconds. Is generated. Here, it is also possible to use LPC (linear prediction analysis) instead of FFT. In addition, the video input unit 10
The sound data of the video input from 1 is input to the sound detection unit 106 on the other hand.

【００１６】図２は、本発明の一実施形態例のエッジ検
出部１０２、音楽検出部１０３、音声検出部１０４の処
理を示したフローチャートである。以下、図１及び図２
を参照してそれらの動作例を説明する。FIG. 2 is a flow chart showing the processing of the edge detecting section 102, the music detecting section 103, and the voice detecting section 104 according to the embodiment of the present invention. Hereinafter, FIG. 1 and FIG.
An example of those operations will be described with reference to.

【００１７】エッジ検出部１０２のＦＦＴ処理２０１に
よってスペクトログラムが生成される。その際のフレー
ム長は、数十〜百ミリ秒で、検出区間は、数秒である。A spectrogram is generated by the FFT processing 201 of the edge detection unit 102. The frame length at that time is several tens to hundreds of milliseconds, and the detection section is several seconds.

【００１８】図３に、生成されたスペクトログラムの様
子を簡略化して示す。スペクトログラムは、実際には、
濃淡画像として得られる。３０１は、音楽成分のスペク
トルの軌跡であり、３０２は、音声成分のスペクトルの
軌跡である。音楽成分は、周波数方向に安定した軌跡を
描くので、この性質を利用して検出する。まず、周波数
ｉにおける時間方向のエッジＥＤｉをエッジ検出処理２
０２で微分オペレータを用いて検出する。得られたエッ
ジＥＤｉの値をエッジの閾値処理２０３で閾値ＴＨ１と
比較し、エッジＥＤｉの値が閾値ＴＨ１よりも大きい場
合には、音声検出の前処理として周波数ｉのスペクトル
をエッジ消去、補間処理２０４において０にし、エッジ
を消去する。また、近傍のスペクトルの値を用いて消去
されたスペクトルは、線形補間される。この処理を全て
の帯域について繰り返す。繰り返し判定処理２０５にお
いて、ｉがｎ−１と等しくなれば繰り返しを終える。こ
こでｎはＦＦＴのフレーム長のポイント数である。FIG. 3 shows a simplified state of the generated spectrogram. The spectrogram is actually
Obtained as a grayscale image. Reference numeral 301 is a spectrum locus of a music component, and 302 is a spectrum locus of a voice component. Since the music component draws a stable trajectory in the frequency direction, this property is used for detection. First, the edge detection processing 2 is performed on the edge EDi in the time direction at the frequency i.
In 02, it detects using a differential operator. The obtained value of the edge EDi is compared with the threshold TH1 in the edge threshold processing 203, and when the value of the edge EDi is larger than the threshold TH1, the spectrum of the frequency i is erased and interpolated as the pre-processing of voice detection. At 204, the value is set to 0 to erase the edge. Further, the spectrum deleted by using the value of the spectrum in the vicinity is linearly interpolated. This process is repeated for all bands. If i becomes equal to n-1 in the repeat determination processing 205, the repeat is terminated. Here, n is the number of points of the FFT frame length.

【００１９】次に、エッジの強さの総和をエッジ強度算
出処理２０６で算出し、エッジ強度の閾値処理２０７に
おいて、算出されたエッジの強さが閾値ＴＨ２よりも大
きい場合に音楽が存在すると判断する。Next, the sum of the edge strengths is calculated in the edge strength calculation processing 206, and in the edge strength threshold processing 207, it is determined that music exists when the calculated edge strength is larger than the threshold TH2. To do.

【００２０】図３の３０２に示すように、音声成分は時
間的に変動する等間隔の縞模様として現れるので、エッ
ジ強度算出処理２０６と平行してスペクトログラムにく
し形フィルタ処理２０８を施し、フィルタ出力の閾値処
理２０９において、フィルタ処理の出力が閾値ＴＨ３よ
りも大きければ音声が存在すると判断する。As indicated by reference numeral 302 in FIG. 3, since the voice component appears as a time-varying evenly-spaced striped pattern, the spectrogram is subjected to comb-shaped filter processing 208 in parallel with the edge strength calculation processing 206 to output the filter output. In the threshold value processing 209 of 1., if the output of the filter processing is larger than the threshold value TH3, it is determined that voice is present.

【００２１】図４は、本発明の一実施形態例の図１の音
響検出部１０６の処理を示したフローチャートである。
音響の種類の例としては、笑声、歓声、拍手、雑踏、機
械の音等が考えられる。ここでは、笑声、歓声、拍手を
例に取って説明する。FIG. 4 is a flow chart showing the processing of the sound detector 106 of FIG. 1 according to the embodiment of the present invention.
Examples of the type of sound include laughter, cheers, applause, crowds, and machine sounds. Here, laughter, cheers, and applause will be described as examples.

【００２２】笑声、歓声、拍手のような音響は、明確な
構造がスペクトルに現れないため、ベクトル量子化を利
用して検出する。まず、各々の音響データのサンプルを
用意し、符号帳生成部１０５で符号帳を作成する。使用
するベクトルの特徴量としては、数十〜百ミリ秒のフレ
ーム長で、１６次元程度の線形予測係数を用いる。ＬＰ
Ｃケプストラム、ＦＦＴケプストラム、フィルタバンク
出力等を用いることも可能である。サンプルデータは、
多いほど良好な結果を得ることができる。笑声、歓声、
拍手の３つのカテゴリーに分類するため、各サンプルデ
ータの係数から３つ以上のクラスタを生成する。以下で
は、クラスタの数が３つの場合を例に取り説明する。ま
ず、クラスタの重心ベクトルをＣ１，Ｃ２，Ｃ３とす
る。Ｃ１，Ｃ２，Ｃ３が、笑声、歓声、拍手のどの重心
ベクトルに対応するかは、カテゴリーが既知のサンプル
データが最も近い重心ベクトルを調べることで、容易に
分かる。Sounds such as laughter, cheers, and applause are detected using vector quantization because no clear structure appears in the spectrum. First, a sample of each acoustic data is prepared, and the codebook generation unit 105 creates a codebook. As a vector feature amount to be used, a linear prediction coefficient of about 16 dimensions with a frame length of several tens to hundreds of milliseconds is used. LP
It is also possible to use a C cepstrum, an FFT cepstrum, a filter bank output, or the like. The sample data is
The larger the number, the better the result. Laughter, cheers,
To classified into three categories of applause, and generates a cluster on three or more from the coefficient of the sample data. In the following, a case where the number of clusters is 3 will be described as an example. First, the centroid vectors of the clusters are C1, C2 and C3. Which centroid vector C1, C2, C3 corresponds to laughter, cheers, or applause can be easily found by examining the centroid vector whose sample data whose category is known is the closest.

【００２３】入力された映像の音データの線形予測係数
は線形予測係数算出処理４０１で算出され、各々の重心
ベクトルとの距離Ｌｉがベクトル距離算出処理４０２で
算出される。次に、最小距離ベクトルの閾値処理４０３
において重心ベクトルとの距離Ｌｉの大きさを調べ、閾
値ＴＨ４よりも大きい場合には、３つのカテゴリーには
属さないと判断し、非音響と判断される。閾値ＴＨ４よ
りも小さい場合には、最小距離ベクトル判別処理４０
４、最小距離ベクトル判別処理４０５により重心ベクト
ルとの距離Ｌｉの中で最も距離の短いものを選択し、対
応するカテゴリーに属すると判断する。図４では、Ｃ
１，Ｃ２，Ｃ３が各々、笑声、歓声、拍手に対応してい
る場合を示している。The linear prediction coefficient of the sound data of the input video is calculated in the linear prediction coefficient calculation process 401, and the distance Li to each centroid vector is calculated in the vector distance calculation process 402. Next, the minimum distance vector threshold processing 403
In, the magnitude of the distance Li from the center of gravity vector is examined, and if it is larger than the threshold value TH4, it is determined that it does not belong to the three categories, and it is determined as non-acoustic. If it is smaller than the threshold value TH4, the minimum distance vector determination processing 40
4. By the minimum distance vector discrimination processing 405, the one having the shortest distance among the distances Li to the center of gravity vector is selected, and it is determined that it belongs to the corresponding category. In FIG. 4, C
1, C2 and C3 correspond to laughter, cheers, and applause, respectively.

【００２４】特徴音検出部１０２で検出された音の始点
と終点の位置は、属性情報の一部として属性情報蓄積部
１０７にタイムコードや、先頭からのバイト数等のフォ
ーマットで記録される。The positions of the start point and the end point of the sound detected by the characteristic sound detecting section 102 are recorded in the attribute information storage section 107 as a part of the attribute information in a format such as a time code or the number of bytes from the beginning.

【００２５】映像判別部１０６では、属性情報蓄積部１
０７から情報を読み出し、映像シーケンス全体における
各々の音の含有率を算出し、分類ベクトルＶ（ｖ１，ｖ
２，ｖ３，ｖ４，ｖ５，ｖ６）を求める。ここで、ｖ
１，ｖ２，ｖ３，ｖ４，ｖ５，ｖ６は、各々、音楽、音
声、笑声、歓声、拍手、音楽と音声が重なっている区
間、の含有率である。In the image discrimination unit 106, the attribute information storage unit 1
07, information is read out, the content rate of each sound in the entire video sequence is calculated, and the classification vector V (v1, v1
2, v3, v4, v5, v6). Where v
1, v2, v3, v4, v5, and v6 are the content rates of music, voice, laughter, cheers, applause, and a section where music and voice overlap, respectively.

【００２６】分類ベクトルを用いて映像を分類する際に
は、音響検出と同様にベクトル量子化が用いられる。様
々な映像サンプルを用いて分類ベクトルを求め、必要な
ジャンルの数だけクラスタリングを行い、重心ベクトル
を求める。入力された映像の分類ベクトルと重心ベクト
ルの距離を算出し、最も近いクラスタに割り当てる。形
成されるクラスタは、一般的に用いられるジャンルと必
ずしも一致しないが、音声が多く、音楽が少なければニ
ュースや教育、逆の場合は音楽、笑声が多い場合はコメ
ディ等といった分類が可能である。When classifying an image using a classification vector, vector quantization is used similarly to sound detection. The classification vector is obtained by using various video samples, and clustering is performed by the required number of genres to obtain the centroid vector. The distance between the classification vector of the input video and the center of gravity vector is calculated and assigned to the closest cluster. The formed clusters do not necessarily match the genres that are commonly used, but they can be classified as news or education if there are many voices and less music, music in the opposite case and comedy if there is much laughter. .

【００２７】図５は、本実施形態例の映像分類装置をソ
フトウェアで実現した場合の処理を示すフローチャート
である。映像は、まず、符号帳生成段階５００で、音響
の学習データから符号帳が生成され、映像入力段階５０
１から入力され、エッジ検出段階５０２で周波数解析、
エッジ検出が行われる。また、必要に応じてエッジの削
除、補間が行われる。音楽検出段階５０３および音声検
出段階５０４では、各々、エッジの強さ、くし形フィル
タを用いて音楽および音声が検出される。音響検出段階
５０５では、ベクトル量子化を用いて、笑声、歓声、拍
手が検出される。検出された音の始点と終点の情報は、
属性情報検出段階５０６で蓄積され、映像シーケンスの
最後に到達した時点で映像判別段階５０７において映像
が分類される。FIG. 5 is a flow chart showing the processing when the video classification device of this embodiment is realized by software. For the video, first, a codebook is generated from the acoustic learning data in the codebook generation step 500, and the video input step 50 is performed.
1, the frequency analysis in the edge detection step 502,
Edge detection is performed. Further, edge deletion and interpolation are performed as necessary. In the music detecting step 503 and the voice detecting step 504, music and voice are detected using the edge strength and the comb filter, respectively. At the acoustic detection stage 505, laughter, cheers, and applause are detected using vector quantization. Information on the start and end points of the detected sound is
The images are classified in the attribute information detecting step 506, and the images are classified in the image determining step 507 when the end of the image sequence is reached.

【００２８】[0028]

【発明の効果】以上説明したように、本発明は以下のよ
うな効果を奏する。As described above, the present invention has the following effects.

【００２９】（１）映像情報に含まれる音情報から音
楽、音声、笑声、歓声、拍手を検出し、検出された音情
報の種類、各々の区間の長さ、種類毎の全体の長さ、各
々の区間の位置のパターンを比較するようにしたので、
映像を広範囲なカテゴリに分類することができる。(1) Music, voice, laughter, cheers, and applause are detected from the sound information included in the video information, the type of the detected sound information, the length of each section, and the total length of each type , I tried to compare the pattern of the position of each section,
Video can be classified into a wide range of categories.

【００３０】（２）スペクトログラムの一定周波数にお
ける時間方向のエッジの強さを検出するようにした場合
には、特に音楽を容易に検出することができる。(2) When the strength of the edge in the time direction at a constant frequency of the spectrogram is detected, music can be particularly easily detected.

【００３１】（３）スペクトログラムのエッジの強い部
分を除去した後に、くし形フィルタを用いてハーモニッ
ク構造を検出するようにした場合には、特に言葉などの
音声を容易に検出することができる。(3) When a harmonic structure is detected by using a comb filter after the strong edge portion of the spectrogram is removed, especially speech such as words can be easily detected.

【００３２】（４）該符号帳の重心と、該映像情報に含
まれる音情報の特徴ベクトルとの距離を検出の判定基準
として用いるようにした場合には、特に音響を容易に検
出することができる。[0032] (4) and the center of gravity of the codebook, when used as the criterion for detecting the distance between the feature vector of the sound information contained in the video information is easily detected acoustic especially be able to.

【００３３】（５）検出された音情報の種類、各々の区
間の長さ、種類毎の全体の長さ、各々の区間の位置を分
類ベクトルとして符号帳を作成し、判別の判定基準に、
該符号帳の重心と、該映像情報に含まれる音情報の分類
ベクトルとの距離を用いるようにした場合には、特に映
像を容易に広範囲なカテゴリに分類することができる。(5) A codebook is created with the type of the detected sound information, the length of each section, the total length of each type, and the position of each section as a classification vector, and used as a criterion for discrimination.
When the distance between the center of gravity of the codebook and the classification vector of the sound information included in the video information is used, the video can be particularly easily classified into a wide range of categories.

[Brief description of drawings]

【図１】本発明の一実施形態例の映像分類装置の概略構
成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a video classification device according to an embodiment of the present invention.

【図２】上記実施形態例の特徴音検出部分における音楽
と音声の検出処理を示すフローチャートである。FIG. 2 is a flowchart showing a music and voice detection process in a characteristic sound detection portion of the above embodiment.

【図３】上記実施形態例のエッジ検出部において得られ
たサウンドスペクトログラムの様子を示す概念図であ
る。FIG. 3 is a conceptual diagram showing a state of a sound spectrogram obtained by the edge detection unit of the above-described embodiment.

【図４】上記実施形態例の特徴音検出部分における笑
声、歓声および拍手の検出処理を示すフローチャートで
ある。FIG. 4 is a flowchart showing a process of detecting laughter, cheers, and applause in the characteristic sound detecting portion of the above-described embodiment.

【図５】上記実施形態例の映像分類装置を計算機を用い
てソフトウェア的に実現した場合の処理の流れを示すフ
ローチャートである。FIG. 5 is a flowchart showing the flow of processing when the video classification device of the above-described embodiment is implemented by software using a computer.

[Explanation of symbols]

１０１…映像入力部１０２…エッジ検出部１０３…音楽検出部１０４…音声検出部１０５…符号帳生成部１０６…音響検出部１０７…映像判別部１０８…属性情報蓄積部２０１…ＦＦＴ（高速フーリエ変換）処理２０２…エッジ検出処理２０３…エッジの閾値処理２０４…エッジ消去、補間処理２０５…繰り返し判定処理２０６…エッジ強度算出処理２０７…エッジ強度の閾値処理２０８…くし形フィルタ処理２０９…フィルタ出力の閾値処理３０１…音楽スペクトルピーク３０２…音声スペクトルピーク４０１…線形予測係数算出処理４０２…ベクトル距離算出処理４０３…最小距離ベクトルの閾値処理４０４…最小距離ベクトル判別処理４０５…最小距離ベクトル判別処理５００…符号帳生成段階５０１…映像入力段階５０２…エッジ検出段階５０３…音楽検出段階５０４…音声検出段階５０５…音響検出段階５０６…属性情報蓄積段階５０７…映像判別段階 101 ... Video input section 102 ... Edge detection unit 103 ... Music detection unit 104 ... Voice detection unit 105 ... Codebook generation unit 106 ... Acoustic detector 107 ... Video discrimination unit 108 ... Attribute information storage unit 201 ... FFT (Fast Fourier Transform) processing 202 ... Edge detection processing 203 ... Edge threshold processing 204 ... Edge deletion, interpolation processing 205 ... Repeat determination process 206 ... Edge strength calculation processing 207 ... Edge strength threshold processing 208 ... Comb filter processing 209 ... Filter output threshold processing 301 ... Music spectrum peak 302 ... Voice spectrum peak 401 ... Linear prediction coefficient calculation process 402 ... Vector distance calculation processing 403 ... Thresholding of minimum distance vector 404 ... Minimum distance vector discrimination processing 405 ... Minimum distance vector discrimination processing 500 ... Codebook generation stage 501 ... Video input stage 502 ... Edge detection stage 503 ... Music detection stage 504 ... Voice detection stage 505 ... Sound detection stage 506 ... Attribute information storage stage 507 ... Video discrimination stage

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平８−179791（ＪＰ，Ａ) 特開平７−105235（ＪＰ，Ａ) 特開平３−80782（ＪＰ，Ａ) 特開平２−121500（ＪＰ，Ａ) 特開平５−88695（ＪＰ，Ａ) 特開昭62−70898（ＪＰ，Ａ) 特開平４−127200（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 H04N 5/91 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of the front page (56) Reference JP-A-8-179791 (JP, A) JP-A-7-105235 (JP, A) JP-A-3-80782 (JP, A) JP-A-2- 121500 (JP, A) JP 5-88695 (JP, A) JP 62-70898 (JP, A) JP 4-127200 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/00-15/28 H04N 5/91 JISST file (JOIS)

Claims

(57) [Claims]

1. Inputting video information, detecting a section in which at least one of music, voice, and sound exists from sound information included in the input video information, and displaying a video based on a generation pattern of the detected section. Is a video classification method for determining the type of the video information. When the video information is analog, a video input step of A / D converting and inputting digital video information, and frequency analysis of sound information included in the video information are performed. , wherein in order to detect the stability of the spectrum, from spectrogram by arranging the direction of the spectrum time, an edge detection step of detecting the edge by the frequency direction of the differential operator, in order to detect music from the stability of the spectrum Space
And music detection step of detecting a music from the strength of the time direction of an edge that put a constant frequency of Kutoroguramu detects the harmonic structure of the spectrum, the voice detection step of detecting speech, the feature vector of the acoustic as learning data A codebook generation step of vector-quantizing and generating a codebook; and an acoustic detection step of comparing the generated codebook with a feature vector of sound information included in the video information to detect a sound with a short distance. An attribute information storage step of recording the position of the section for each type of the detected sound information, the type of the detected sound information, the length of each section, the total length of each type, the length of each section And a video discriminating step of discriminating the type of the video information by extracting one or more position patterns.

The method according to claim 2, wherein said voice detecting step, after removing the portion having strong edge of the spectrogram, to detect the harmonic structure using a comb filter, according to claim 1, wherein the detecting the speech, that Video classification method.

The method according to claim 3, wherein the acoustic detection step, the center of gravity of the codebook, the use of distance between the feature vector of the sound information included in the video information as a criterion for detection, according to claim 1 or 2, characterized in that Video classification method described in.

4. The code discriminating step creates a codebook using the type of the detected sound information, the length of each section, the total length of each type, and the position of each section as a classification vector, and the center of gravity of the codebook, the image classification method according to any one of claims 1, 2 and 3 wherein using the distance between the classification vector of the sound information included in the video information to determine a reference, characterized in that.

5. Video information is input, a section in which at least one of music, voice, and sound exists is detected from sound information included in the input video information, and a video is generated according to a pattern of occurrence of the detected section. Is a video classification device for determining the type of the video information, and when the video information is analog, a video input unit for A / D conversion to input digital video information, and frequency analysis of the sound information included in the video information. from spectrogram by arranging in the time direction the spectrum in order to detect the stability of the spectrum, an edge detection unit for detecting an edge by the frequency direction of the differential operator, the space in order to detect music from the stability of the spectrum
Music detector for detecting music from the strength of the time direction of an edge that put a constant frequency of Kutoroguramu detects the harmonic structure of the spectrum, a voice detector for detecting voice, the feature vector of the acoustic as learning data A codebook generating unit that vector-quantizes and generates a codebook; an acoustic detecting unit that compares the generated codebook with a feature vector of sound information included in the video information and detects a sound with a short distance; An attribute information storage unit that records the position of the detected section of each sound information, the type of the detected sound information, the length of each section, the total length of each type, and the position of each section. An image classification device comprising: an image determination unit that extracts one or more patterns and determines the type of the image information.

6. The voice detection unit detects a voice by detecting a harmonic structure using a comb filter after removing a strong edge portion of the spectrogram. The image classification device according to item 5 .

7. The acoustic detection unit includes a barycenter of the codebook,
The image classification device according to claim 5 or 6 , wherein a distance from a feature vector of sound information included in the image information is used as a determination criterion for detection.

8. The video discriminating unit creates a codebook by using the type of detected sound information, the length of each section, the total length of each type, and the position of each section as a classification vector. The image classification device according to any one of claims 5 , 6 and 7 , wherein a distance between a center of gravity of a codebook and a classification vector of sound information included in the image information is used as a discrimination criterion. .