JP2010015588A

JP2010015588A - Apparatus for classifying dynamic image data

Info

Publication number: JP2010015588A
Application number: JP2009196734A
Authority: JP
Inventors: Masaru Sugano; 勝菅野; Yasuyuki Nakajima; 康之中島; Hiromasa Yanagihara; 広昌柳原
Original assignee: KDDI R&D Laboratories Inc
Current assignee: KDDI Research Inc
Priority date: 2009-08-27
Filing date: 2009-08-27
Publication date: 2010-01-21
Anticipated expiration: 2023-02-27
Also published as: JP4999015B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus for classifying dynamic image data, which classifies compressed or uncompressed dynamic image data as a class (a shot class) with a low cost and a high resolution by using a feature as the dynamic image data and an audio feature appended with the dynamic image data if it is required. <P>SOLUTION: The apparatus for classifying the dynamic image data includes: a dynamic image data partition element 1 for partitioning the dynamic image data on an time axis in a shot unit; a feature value extraction element 11 for extracting an image feature value such as a color layout descriptor defined by MPEG-7 to be acquired from an image within the shot unit; and a similar unit detection element 12 for detecting a similar shot unit appeared in a high frequency by using the image feature value, wherein the similar unit detection element 12 detects the similar shot appeared in the high frequency by drawing a plurality of coefficient histograms for the feature values and by sequentially scrutinizing a bin having the most multiple components in a frame for representing a shot head or a shot inside. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、動画像データの分類装置に関し、特に、非圧縮または圧縮された動画像データを、予め定義されたクラスに分類することにより、動画像データの効率的な検索、分類あるいは閲覧を提供することが可能な動画像データの分類装置に関する。 The present invention relates to a moving image data classification device, and in particular, provides efficient search, classification or browsing of moving image data by classifying uncompressed or compressed moving image data into a predefined class. The present invention relates to an apparatus for classifying moving image data.

動画像データのシーン分類に関する従来技術としては、例えばテレビ放送の動画像データを入力として、それをニュース、スポーツ、コマーシャルなど、比較的大きい単位で分類を行う方式が検討されている。また、いくつかの関連する動画区間を論理的ストーリー単位（Logical Story Unit）に分割する方式も検討されている。ここでは、動画像データとしての特徴のほかに、動画像データに付随するオーディオデータの特徴を用いるものも提案されている。 As a conventional technique related to scene classification of moving image data, for example, a method is considered in which moving image data of a television broadcast is input and classified into relatively large units such as news, sports, and commercials. In addition, a method of dividing several related video sections into logical story units is also being studied. Here, in addition to the characteristics as moving image data, there has been proposed one that uses the characteristics of audio data associated with moving image data.

また、要約情報としてのハイライトシーンの検出については、圧縮動画像データの圧縮領域において、それに付随するオーディオの特性を用いて、スポーツ映像などのハイライトシーンを抽出する技術が提案されている。 For the detection of highlight scenes as summary information, a technique has been proposed in which highlight scenes such as sports videos are extracted using compressed audio characteristics in a compressed area of compressed moving image data.

さらに、本出願人による特願２００２−２８５６６７号では、ショット単位の分類技術として、動的／静的シーンへの分類や、スローシーン、パンやズームなどのカメラ操作といった比較的抽象レベルの低いシーンから、スポーツ映像のハイライトシーンといった比較的抽象レベルの高いシーンへの分類方式が提案されている。 Furthermore, in Japanese Patent Application No. 2002-285667 by the present applicant, scenes with a relatively low abstract level such as classification into dynamic / static scenes, slow scenes, camera operations such as panning and zooming, etc. Therefore, a method for classifying a scene with a relatively high level of abstraction such as a highlight scene of a sports video has been proposed.

特願２００２−２８５６６７号Japanese Patent Application No. 2002-285667

従来技術は主に非圧縮データ領域で動画像データやそれに付随するオーディオデータの解析を行うものが多く、圧縮された動画像データについては一度復号処理を行う必要があったり、処理コストが多くかかったりすることが問題であった。また、分類の単位についても、番組ごとや論理的ストーリー単位など、比較的大きい単位での分類が主流であるため、より詳細な単位での分類技術については例えば特願２００２−２８５６６７で示された技術などが必要である。詳細な単位での分類は、例えば動画像データにおける特定シーンの閲覧や、動画像データベースにおける分類などには重要かつ効果的である。特願２００２−２８５６６７で示された技術では、動的／静的なシーンやカメラ操作の抽出など、意味的に低いレベルでのシーン分類が主流であるため、より抽象レベルの高いシーン閲覧やコンテンツフィルタリングには対応できないという問題があった。例えば、映画コンテンツから暴力的なシーンを除外するといったフィルタリングはできないという問題があった。 Many of the prior arts mainly analyze moving image data and accompanying audio data in the uncompressed data area, and it is necessary to perform decoding processing once on the compressed moving image data, resulting in a high processing cost. Was a problem. As for the unit of classification, classification in a relatively large unit such as each program or logical story unit is mainstream, and therefore, a more detailed classification technique is disclosed in, for example, Japanese Patent Application No. 2002-285667. Technology is required. Classification in detailed units is important and effective for browsing a specific scene in moving image data, classification in a moving image database, and the like. In the technique shown in Japanese Patent Application No. 2002-285667, scene classification at a lower level, such as dynamic / static scenes and camera operation extraction, is the mainstream. There was a problem that filtering could not be supported. For example, there is a problem that filtering such as excluding violent scenes from movie content cannot be performed.

本発明は前記した従来技術に鑑みてなされたものであり、その目的は、非圧縮または圧縮された動画像データを、動画像としての特徴や、必要に応じて動画像に付随するオーディオの特徴を用いて、低コストかつ高精度で様々なクラス（ショットクラス）へ分類する、動画像データの分類装置を提供することにある。 The present invention has been made in view of the above-described prior art, and an object of the present invention is to convert uncompressed or compressed moving image data as a moving image, and, as necessary, audio characteristics associated with the moving image. It is an object to provide a moving image data classification device that classifies various classes (shot classes) with low cost and high accuracy.

前記の目的を達成するために、本発明は、非圧縮または圧縮された動画像データの分類装置において、動画像データを時間軸上でショット単位に分割する動画像データ分割手段と、該ショット単位内の画像から得られるＭＰＥＧ−７で定義された色配置記述子などの画像特徴値を抽出する特徴値抽出手段と、該画像特徴値を用いて最も出現頻度の高い類似ショット単位を検出する類似単位検出手段を具備し、前記類似単位検出手段は、ショット先頭またはショット内代表のフレームにおいて、前記特徴値の複数の係数をヒストグラム化し、要素数が最も多くなったビンを順次絞り込んでいくことにより最頻の類似ショットを検出する点に特徴がある。 To achieve the above object, the present invention provides an apparatus for classifying uncompressed or compressed moving image data, moving image data dividing means for dividing moving image data into shot units on a time axis, and the shot units. A feature value extracting means for extracting an image feature value such as a color arrangement descriptor defined in MPEG-7 obtained from an image in the image, and a similarity for detecting a similar shot unit having the highest appearance frequency using the image feature value The similar unit detection means comprises a histogram of a plurality of coefficients of the feature value in a shot head or a representative frame in a shot, and sequentially narrows down bins having the largest number of elements. It is characterized in that the most similar shots are detected.

この特徴によれば、非圧縮または圧縮された動画像データにおいて、最頻の類似ショットを効率よく検出することができるようになる。また、本発明を適用することによって、テレビスポーツ映像から、ハイライトシーンを高精度に検出することができ、またテレビニュース映像からアナウンサーショットを高精度で検出することができるようになる。 According to this feature, the most frequent similar shots can be efficiently detected in uncompressed or compressed moving image data. Further, by applying the present invention, a highlight scene can be detected with high accuracy from a television sports video, and an announcer shot can be detected with high accuracy from a television news video.

以上の説明から明らかなように、請求項１〜７の発明によれば、非圧縮または圧縮された動画像データにおいて、そのショットを様々な種別に分類することによって、動画像データの中から所望のシーンを検索・閲覧したり、多数の動画像データを効果的に分類したりすることが可能になる。 As is apparent from the above description, according to the first to seventh aspects of the present invention, in the uncompressed or compressed moving image data, the shots are classified into various types to obtain desired ones from the moving image data. It is possible to search / browse scenes and effectively classify a large number of moving image data.

特に、請求項１、２の発明によれば、非圧縮または圧縮された動画像データにおいて、最頻の類似ショットを効率よく検出することができる。 In particular, according to the first and second aspects of the present invention, the most frequent similar shot can be efficiently detected in uncompressed or compressed moving image data.

また、請求項３〜５の発明によれば、テレビスポーツ映像から、ハイライトシーンを高精度に検出することができ、また請求項６、７の発明によれば、テレビニュース映像からアナウンサーショットを高精度で検出することができるようになる。 According to the inventions of claims 3 to 5, highlight scenes can be detected with high accuracy from television sports images, and according to the inventions of claims 6 and 7, an announcer shot can be taken from television news images. It becomes possible to detect with high accuracy.

本発明の一実施形態の動画像データ分類装置のブロック図である。It is a block diagram of the moving image data classification device of one embodiment of the present invention. 図１のアクションクラス判別部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the action class discrimination | determination part of FIG. 図１のドラマチッククラス判別部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the dramatic class discrimination | determination part of FIG. 図１の会話クラス判別部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the conversation class discrimination | determination part of FIG. 本発明の第２実施形態の動画像データ分類装置のブロック図である。It is a block diagram of the moving image data classification device of 2nd Embodiment of this invention. 図５の最頻ショット検出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the mode shot detection part of FIG. 色配置記述子を用いた最頻ショット検出処理の説明図である。It is explanatory drawing of the most frequent shot detection process using a color arrangement | positioning descriptor. 最頻ショット検出部の他の動作を示すフローチャートである。It is a flowchart which shows other operation | movement of a mode shot detection part. 本発明の第３実施形態の動画像データ分類装置のブロック図である。It is a block diagram of the moving image data classification device of 3rd Embodiment of this invention.

以下に、図面を参照して本発明を詳細に説明する。まず、本発明の一実施形態を、図１を参照して説明する。図１は、動画像データの分類装置の要部の構成を示すブロック図である。ここでは、入力された動画像データをショット分割部１でショット単位に分割する例を用いて説明するが、分割の単位は動画像を撮像する際のカメラ操作あるいは撮像された動画像の内容に関して、一貫性のある単位であれば任意である。例えば、カメラのスタートボタンが押されてから停止ボタンが押されるまで（この間に例えばズーム操作などがなされてもよい）の動画像を１分割単位と考えることができる。 Hereinafter, the present invention will be described in detail with reference to the drawings. First, an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram illustrating a configuration of a main part of a moving image data classification device. Here, the input moving image data will be described using an example in which the shot dividing unit 1 divides the input moving image data into shot units. However, the unit of division is related to the camera operation at the time of capturing a moving image or the content of the captured moving image. Any unit that is consistent is optional. For example, a moving image from when the start button of the camera is pressed until the stop button is pressed (for example, a zoom operation or the like may be performed during this period) can be considered as one division unit.

まず、非圧縮または圧縮された動画像データおよびこれに付随するオーディオデータまたはこれと多重化されているオーディオデータが入力されると、ショット分割部１において動画像データはショットに分割される。ショット分割部１においては、入力された動画像データの各ショットのショット長Lsを保持しておく。ショット分割部で分割された各ショットの動画像データは、動き情報解析部２に渡される。 First, when uncompressed or compressed moving image data and accompanying audio data or audio data multiplexed therewith are input, the shot dividing unit 1 divides the moving image data into shots. The shot division unit 1 holds the shot length Ls of each shot of the input moving image data. The moving image data of each shot divided by the shot division unit is passed to the motion information analysis unit 2.

またこれと並行して、動画像データに付随するオーディオデータまたは動画像データと多重化されているオーディオデータを逆多重化して得られるオーディオデータが、オーディオ解析部３に渡される。 In parallel with this, audio data obtained by demultiplexing audio data accompanying the moving image data or audio data multiplexed with the moving image data is passed to the audio analysis unit 3.

動き情報解析部２においては、圧縮された動画像データに対して、ショット内に存在する予測符号化画像が持つ動きベクトルの値を用いて、ショットにおける動き強度の値Isを計算する。動き強度Isについては、MPEG-7で定義されている「動きアクティビティ記述子」の要素である「動き強度（Motion Intensity）」などを用いることができる。「動きアクティビティ記述子」の「動き強度」要素は、１から５までの整数で表現される（１が最低、５が最高）。 The motion information analysis unit 2 calculates a motion intensity value Is in the shot using the motion vector value of the predictive encoded image existing in the shot for the compressed video data. For the motion intensity Is, “Motion Intensity” that is an element of “Motion Activity Descriptor” defined in MPEG-7 can be used. The “motion intensity” element of the “motion activity descriptor” is expressed by an integer from 1 to 5 (1 is the lowest, 5 is the highest).

一方、非圧縮の動画像データに対しては、ブロックマッチング法などを用いて前画面からの動きを推定し、得られた値を動きベクトルとして表現し、上記と同様に動き強度Isの値を計算する。このとき、ショットとしての動き強度Isは、対象とした予測符号化画像における動き強度Ipの値をショット内で平均したものや、それらの最大値、中間値などを用いることができる。また、動き情報解析の対象とする予測符号化画像および動きベクトルとしては、順方向予測符号化画像や双方向予測符号化画像、および双方向予測符号化画像における順方向動きベクトル、逆方向動きベクトルのいずれの組み合わせでも用いることができる。 On the other hand, for uncompressed video data, the motion from the previous screen is estimated using a block matching method, etc., and the obtained value is expressed as a motion vector. calculate. At this time, as the motion intensity Is as a shot, a value obtained by averaging the values of the motion intensity Ip in the target predictive encoded image within the shot, or a maximum value or an intermediate value thereof can be used. In addition, as a prediction encoded image and a motion vector to be subjected to motion information analysis, a forward prediction encoded image, a bidirectional predictive encoded image, and a forward motion vector and a backward motion vector in a bidirectional predictive encoded image Any combination of these can be used.

オーディオ解析部３は、基本的にはオーディオパワー計算部３１を含むが、オーディオ種別解析部３２を含むこともできる。オーディオパワー計算部３１においては、入力されたショット内のオーディオデータのオーディオ信号のパワーPsあるいは帯域ごとのオーディオパワーPsbを計算する。帯域ごとのオーディオパワーPsbを計算する場合は、任意の帯域幅を選択することができるほか、帯域ごとに重み付けを行うこともでき、これらの総和をショット内のオーディオパワーPsとする。従って、Psは以下のように表される。 The audio analysis unit 3 basically includes an audio power calculation unit 31, but can also include an audio type analysis unit 32. The audio power calculation unit 31 calculates the power Ps of the audio signal of the audio data in the input shot or the audio power Psb for each band. When calculating the audio power Psb for each band, an arbitrary bandwidth can be selected, and weighting can be performed for each band. The sum of these can be used as the audio power Ps in the shot. Therefore, Ps is expressed as follows.

ここで、lsbはオーディオパワーを計算する最低帯域の帯域番号、hsbは最高帯域の帯域番号、w[i]は帯域iにおけるオーディオパワーPsb[i]に対する重み付けである。 Here, lsb is a band number of the lowest band for calculating the audio power, hsb is a band number of the highest band, and w [i] is a weight for the audio power Psb [i] in the band i.

さらに、オーディオ種別解析部３２が利用できる場合には、例えば入力されたショット内のオーディオデータが「無音」、「音声」、「音楽」、「歓声」などのオーディオ種別に分類される。オーディオ種別解析部３２の処理としては、特開平１０−２４７０９３号公報に述べられた方法などを用いることができる。単位時間辺りでこのオーディオ種別が決定される場合には、ショット内の最頻のクラスをショットの代表オーディオ種別Csと見なす。 Further, when the audio type analysis unit 32 can be used, for example, the audio data in the input shot is classified into audio types such as “silence”, “voice”, “music”, and “cheer”. As the processing of the audio type analysis unit 32, the method described in Japanese Patent Laid-Open No. 10-247093 can be used. When this audio type is determined per unit time, the most frequent class in the shot is regarded as the representative audio type Cs of the shot.

ここで、本発明における分類装置で扱うショットクラスについて定義する。
「アクション」クラス…映画などで、銃撃や爆発など、オーディオの音量および動きなどが大きく、ショット長も短いショット、
「ドラマチック」クラス…映画などで、「アクション」クラスに先立つことが多く、何らかの重要なイベントが起こるまたは起ころうとしているショット、
「会話」クラス…映画などで、二人以上の登場人物が会話を交わしているショット、
「ハイライト」クラス…テレビスポーツ映像において、得点シーンなどの重要なイベントを含むショット、
「アナウンサー」クラス…テレビニュース映像において、アナウンサーがニュースを読み上げているショット、 Here, a shot class handled by the classification device according to the present invention is defined.
“Action” class… In movies, shots and explosions, such as high audio volume and movement, short shot length,
"Dramatic" class ... Such as a movie, often precedes an "action" class, and some important events happen or are going to happen,
"Conversation" class ... A shot of two or more characters in a movie, etc.
“Highlight” class… In TV sports video, shots including important events such as scoring scenes,
"Announcer" class ... In the TV news video, the shot of the announcer reading the news,

アクションクラス判別部４、ドラマチッククラス判別部５においては、ショット分割部１から入力されるショット長Ls、動き情報解析部から得られるショット内動き強度Is、及びショット内オーディオパワーPsを入力とする。また、会話クラス判別部６においては、さらにショット内の代表オーディオ種別Csを入力とする。 In the action class discriminating unit 4 and the dramatic class discriminating unit 5, the shot length Ls inputted from the shot dividing unit 1, the in-shot motion intensity Is obtained from the motion information analyzing unit, and the in-shot audio power Ps are inputted. The conversation class discriminating unit 6 further receives the representative audio type Cs in the shot.

次に、図１に示した、アクションクラス判別部４，ドラマチッククラス判別部５，および会話クラス判別部６の機能を詳細に説明する。 Next, functions of the action class discriminating unit 4, the dramatic class discriminating unit 5, and the conversation class discriminating unit 6 shown in FIG. 1 will be described in detail.

アクションクラス判別部４での判定処理は、図２に示されているように行われる。ステップＳ１では、ショット長Lsがある閾値THL1（例えば２秒など）よりも小さく（Ls＜THL1）、ショット内動き強度Isがある閾値THI1（例えば２．３など）よりも大きく（Is＞THI1）、さらにショット内オーディオパワーPsがある閾値THP1よりも大きい場合に（Ps＞THP1）、該当するショットが「アクション」クラスであると判定する。そして、ステップＳ２において、ショットクラスとして「アクション」を付与する。 The determination process in the action class determination unit 4 is performed as shown in FIG. In step S1, the shot length Ls is smaller than a certain threshold THL1 (for example, 2 seconds) (Ls <THL1), and is larger than a certain threshold THI1 (for example, 2.3) having an intra-shot motion intensity Is (Is> THI1). If the in-shot audio power Ps is larger than a certain threshold value THP1 (Ps> THP1), it is determined that the corresponding shot is the “action” class. In step S2, “action” is given as a shot class.

ドラマチッククラス判別部５での判定処理は、図３に示されているように行われる。ステップＳ３では、ショット長Lsが前記閾値THL1よりも大きく（Ls＞THL1）、かつ別の閾値THL2（例えば５秒）よりも小さく（Ls＜THL2）、ショット内動き強度Isが前記閾値THI1よりも小さく（Is＜THI1）、且つ別の閾値THI2（例えば１．２など）よりも大きく（Is＞THI2）、さらにショット内オーディオパワーPsがある閾値THP2よりも大きい場合に（Ps＞THP2）、該当するショットが「ドラマチック」クラスであると判定する。そして、ステップＳ４において、ショットクラスとして「ドラマチック」を付与する。 The determination process in the dramatic class determination unit 5 is performed as shown in FIG. In step S3, the shot length Ls is larger than the threshold THL1 (Ls> THL1) and smaller than another threshold THL2 (for example, 5 seconds) (Ls <THL2), and the in-shot motion intensity Is is larger than the threshold THI1. Corresponding if it is small (Is <THI1) and larger than another threshold THI2 (for example 1.2) (Is> THI2) and the in-shot audio power Ps is larger than a certain threshold THP2 (Ps> THP2) It is determined that the shot to be performed is a “dramatic” class. In step S4, “dramatic” is assigned as the shot class.

会話クラス判別部６での判定処理は、図４に示されているように行われる。ステップＳ５では、ショット長Lsがある閾値THL3（THL3＞THL2、例えば６秒など）よりも大きく（Ls＞THL3）、ショット内動き強度Isがある閾値THI3（好ましくはTHI3≧THI2、例えば１．５など、なお場合によっては、THI3＜THI2であってもよい）よりも小さく（Is＜THI3）、ショット内オーディオパワーPsがある閾値THP3（THP3＜THP2）よりも小さく（Ps＜THP3）、さらにショット内代表オーディオ種別Csが「音声」である場合該当するショットが「会話」クラスであると判定する。そしてステップＳ６において、ショットクラスとして「会話」を付与する。 The determination process in the conversation class determination unit 6 is performed as shown in FIG. In step S5, the shot length Ls is greater than a certain threshold value THL3 (THL3> THL2, for example, 6 seconds) (Ls> THL3), and the threshold value THI3 (preferably THI3 ≧ THI2, for example, 1.5 for example) having an in-shot motion intensity Is. In some cases, it may be smaller than THI3 <THI2) (Is <THI3), in-shot audio power Ps is smaller than a certain threshold THP3 (THP3 <THP2) (Ps <THP3), and shot When the inner representative audio type Cs is “voice”, it is determined that the corresponding shot is the “conversation” class. In step S6, “conversation” is given as a shot class.

上記アクションクラス判別部４、ドラマチッククラス判別部５、および会話クラス判別部６においていずれのショットクラスにも属さないと判定されたショットは、「汎用」クラスであると判定し、ショットクラスとして「汎用」を付与する。 Shots that are determined not to belong to any shot class in the action class discriminating unit 4, the dramatic class discriminating unit 5, and the conversation class discriminating unit 6 are determined to be “general-purpose” classes, Is given.

なお、前記実施形態では、前記アクションクラス判別部４、ドラマチッククラス判別部５、および会話クラス判別部６は、分割区間長L_Ｓ、動き情報I_Ｓ、およびオーディオデータP_Ｓを用いて各クラスを判別したが、本発明はこれに限定されず、上記のうちの少なくとも一つを用いて判別するようにしてもよい。 In the above embodiment, the action class determination unit 4, dramatic class determination unit 5, and the conversation class determination unit 6, the divided section length L _S, motion information I _S, and each class using the audio data P _S Although it discriminate | determined, this invention is not limited to this, You may make it discriminate | determine using at least one of the above.

次に、本発明の第２実施形態を、図５を参照して説明する。図５において、図１と同一または同等物には同じ符号が付されている。この実施形態では、ショット分割部１でショット分割された動画像データは、特徴値抽出部１１に送られ、ショットの画像特徴値が抽出される。次いで、最頻ショット検出部１２は、該画像特徴値を基に最頻ショットを検出する。 Next, a second embodiment of the present invention will be described with reference to FIG. In FIG. 5, the same or equivalent parts as in FIG. In this embodiment, the moving image data shot-divided by the shot dividing unit 1 is sent to the feature value extracting unit 11, and the image feature value of the shot is extracted. Next, the mode shot detection unit 12 detects the mode shot based on the image feature value.

前記画像特徴値としては、例えばショット分割部でショット分割点と判定された画像、すなわちショット先頭画面の画像データそのものを保持したり、その画像の縮小画像の画像データや、その画像から得られる、MPEG-7で定義された「色配置記述子（Color Layout Descriptor)」などを用いることができる。また、対象とする画像についても、前記ショット先頭画面だけでなく、ショットの中心画面や、ショットを代表する画面（キーフレーム）などを用いることができる。 As the image feature value, for example, an image determined as a shot division point by a shot division unit, that is, image data itself of a shot start screen is held, image data of a reduced image of the image, or obtained from the image, The “Color Layout Descriptor” defined in MPEG-7 can be used. Further, not only the shot head screen but also a shot center screen, a screen representing a shot (key frame), or the like can be used for the target image.

ここでは、色配置記述子を用いた例について述べる。色配置記述子は、原画像を縮小した画像（8×8画素）の輝度成分、色差成分に8×8のDCTを施したものであり、各成分のDCT係数を値として持つ。 Here, an example using a color arrangement descriptor will be described. The color arrangement descriptor is obtained by applying 8 × 8 DCT to the luminance component and the color difference component of an image (8 × 8 pixels) obtained by reducing the original image, and has the DCT coefficient of each component as a value.

最頻ショット検出部１２の動作（最頻ショット検出処理１）を、図６のフローチャートを参照して説明する。ここに、最頻ショットとは、データ内に最も頻度が高く出現する類似ショットSfを意味する。まず、図７に示されているような入力動画像データ２１を一旦全て読み込み、ステップＳ１１で各ショット（１，２，３，・・・，ｎ）の先頭画面から画像特徴値、例えば色配置記述子（ａ１，ａ２，ａ３，・・・，ａｎ）を抽出する。ステップＳ１２では、ある置き数ｍ＝１とし、ステップＳ１３で該色配置記述子の第ｍ係数、例えば原画像を縮小した画像の輝度成分に8×8のDCTを施したものの第ｍ係数でヒストグラムを作成する。図７の例では、まず第１係数（ｍ＝１）Ｙ_１(1)，Ｙ_１(2)，Ｙ_１(3)，・・・，Ｙ_１(n)でヒストグラムを作成する。 The operation of the mode shot detector 12 (mode shot detection process 1) will be described with reference to the flowchart of FIG. Here, the most frequent shot means a similar shot Sf that appears most frequently in the data. First, all the input moving image data 21 as shown in FIG. 7 is read once, and in step S11, image feature values, for example, color arrangement, are displayed from the top screen of each shot (1, 2, 3,..., N). Descriptors (a1, a2, a3,..., An) are extracted. In step S12, a certain number m is set to 1, and in step S13, the mth coefficient of the color arrangement descriptor, for example, the histogram of the mth coefficient of the luminance component of the image obtained by reducing the original image and 8 × 8 DCT. Create In the example of FIG. 7, first, a histogram is created with the first coefficients (m = 1) Y ₁ (1), Y ₁ (2), Y ₁ (3),..., Y ₁ (n).

ステップＳ１４では、ある置き数ｎ＝２とし、ステップＳ１５で第１最頻ビンの要素数と、第２最頻ビン（ｎ＝２）の要素数の差は予め定めた基準より小であるか否かの判断が行われる。例えば（第１の最頻ビンの要素数）×０．８５＜（第２の最頻ビンの要素数）を満たすか否かの判断をする。ここに、前記第１最頻ビンの要素数は、データ内に最も頻度が高く出現する類似ショットを意味する。したがって、ステップＳ１５では、出現頻度の最も大きいショットと次に出現頻度の大きいショットとの差が小さいか否かの判断がなされる。 In step S14, a certain number n = 2, and in step S15, is the difference between the number of elements of the first mode bin and the number of elements of the second mode bin (n = 2) smaller than a predetermined criterion? A determination is made whether or not. For example, it is determined whether or not (number of elements of the first mode bin) × 0.85 <(number of elements of the second mode bin) is satisfied. Here, the number of elements of the first mode bin means a similar shot that appears most frequently in the data. Therefore, in step S15, it is determined whether or not the difference between the shot with the highest appearance frequency and the shot with the next highest appearance frequency is small.

この判断が肯定であれば、ステップＳ１６に進みｎが１インクリメントされて、ステップＳ１５で第１最頻ビンの要素数と、第（ｎ＋１）最頻ビンの要素数の差は予め定めた基準より小であるか否かの判断が行われる。この判断が肯定になると、第（ｎ＋１）最頻ビンも出現頻度の大きいショットになる。 If this determination is affirmative, the process proceeds to step S16, where n is incremented by 1, and in step S15, the difference between the number of elements of the first mode bin and the number of elements of the (n + 1) mode bin is based on a predetermined criterion. A determination is made as to whether it is small. If this determination is affirmative, the (n + 1) most frequent bin also becomes a shot with a high appearance frequency.

上記の処理が行われ、ステップＳ１５の判断が否定になると、ステップＳ１７に進んで、第１〜（ｎ−１）最頻ビンを最頻ショットに採用する。以上により、第１係数による最頻ショットの絞り込みが終了する。次に、ステップＳ１８では、ｍを１インクリメントする。ステップＳ１９では、第１〜（ｎ−１）最頻ビンは収束したか否かの判断がなされる。すなわち、データ内に最も頻度が高く出現する類似ショットが十分に絞れたか否かの判断がなされる。 When the above process is performed and the determination in step S15 is negative, the process proceeds to step S17, and the first to (n-1) mode bins are adopted for the mode shot. Thus, the narrowing down of the most frequent shots by the first coefficient is completed. Next, in step S18, m is incremented by one. In step S19, it is determined whether or not the first to (n-1) most frequent bins have converged. That is, it is determined whether similar shots that appear most frequently in the data are sufficiently narrowed down.

収束していない場合にはステップＳ１３に戻って、前記第１〜（ｎ−１）最頻ビンのショットの第（ｍ＋１）係数でのヒストグラム作成を行い、続いて前記したのと同様の処理を行い、類似ショットを絞る処理をする。この処理により、第２係数Ｙ_２による最頻ショットの絞り込みが行われる。以下、同様の処理を行い、第３係数Ｙ_３等の絞り込みを行い、類似ショットが十分に絞れたと判断される（ステップＳ１９の判断が肯定）と、最頻ショット検出処理は終了する。 If not converged, the process returns to step S13 to create a histogram with the (m + 1) th coefficient of the shots of the first to (n-1) most frequent bins, and then perform the same processing as described above. And perform processing to narrow down similar shots. This process, narrowing of the most frequent shots is performed by the second coefficient Y _2. Hereinafter, the same process is performed a third coefficient Y ₃ such options in a similar shots is determined to have sufficiently narrowed down (the determination in step S19 is affirmative), the modal shot detection process ends.

なお、前記第１，２，３，・・・係数Ｙ１，Ｙ２，Ｙ３，・・・の順序付けは、図７の順序付けに限定されず、他の順序であってもよい。また、使用する成分は輝度成分のみ、色差成分のみ、または両者を用いることができ、各成分において使用できる係数も任意である。また、前記ステップＳ１５の処理により、色配置記述子の値の僅かな差で、あるショットが類似ショット検出から漏れてしまうのを防ぐことができる。このように色配置記述子を用いて類似ショットの絞込みを行い、最終的に最も要素数の多いビンに属するショットを最頻ショットSfとして決定する。 The ordering of the first, second, third,... Coefficients Y1, Y2, Y3,... Is not limited to the ordering shown in FIG. In addition, only a luminance component, only a color difference component, or both can be used as components to be used, and coefficients that can be used in each component are also arbitrary. Further, the process of step S15 can prevent a certain shot from being leaked from similar shot detection due to a slight difference in the values of the color arrangement descriptors. In this way, similar shots are narrowed down using the color arrangement descriptor, and a shot belonging to the bin having the largest number of elements is finally determined as the most frequent shot Sf.

次に、さらに類似ショット検出の精度を高めるための処理（処理２）を、図８のフローチャートを参照して説明する。図８のステップＳ２０では、最頻ショットとして決定されたショットの色配置記述子の値の代表値（または参照値）を求め、ステップＳ２１では、この値を用いて全ショットにおける色配置記述子との距離Dの計算を行う。代表値としては、各成分・各係数の平均値や中間値などを用いることができる。距離Dの計算の結果、十分に小さい閾値THD以下の距離を持つショットを、最頻ショットとして検出することもできる。 Next, processing (processing 2) for further improving the accuracy of similar shot detection will be described with reference to the flowchart of FIG. In step S20 of FIG. 8, a representative value (or reference value) of the color arrangement descriptor value of the shot determined as the most frequent shot is obtained, and in step S21, the color arrangement descriptor in all shots is obtained using this value. The distance D is calculated. As the representative value, an average value or an intermediate value of each component / coefficient can be used. As a result of calculating the distance D, a shot having a sufficiently small distance below the threshold THD can be detected as the most frequent shot.

距離Dの計算は、MPEG-7の検証モデルで推奨されている以下の式などを用いることができる。 For the calculation of the distance D, the following formula recommended in the MPEG-7 verification model can be used.

ここで、Yr[i]、Cbr[i]、Crr[i]はそれぞれ輝度Y成分、色差Cb成分、色差Cr成分の第i係数の代表値、Y[i]、Cb[i]、Cr[i]はそれぞれの成分の低周波側からの第i係数、NY、NCb、NCrはそれぞれ距離Dの計算に用いる各成分の係数の数である。 Here, Yr [i], Cbr [i], and Crr [i] are the representative values of the i-th coefficient of the luminance Y component, the color difference Cb component, and the color difference Cr component, respectively Y [i], Cb [i], Cr [ i] is the i-th coefficient from the low frequency side of each component, and NY, NCb, and NCr are the numbers of coefficients of each component used for calculating the distance D, respectively.

図５に示すハイライトシーン判別部１３では、例えば野球中継などのテレビスポーツ映像を入力として、ヒットやホームランなどのハイライトシーンを検出する。ここで「シーン」とは、意味的に連続した一つ以上の「ショット」から構成される区間であることを示す。 The highlight scene discriminating unit 13 shown in FIG. 5 detects a highlight scene such as a hit or a home run using, for example, a television sports video such as a baseball game. Here, the “scene” indicates a section composed of one or more “shots” that are semantically continuous.

ハイライトシーン判別部１３では、図６、図８の処理により得られた例えばテレビスポーツ映像における最頻ショットSfに対して、隣接する最頻ショットSf間のショット数Nsf、時間Tsfを求める。例えば野球中継の場合、ピッチャーがバッターに対してボールを投げるショット（以下、投球ショット）は、野球中継映像における最頻ショットであると考えられる。投球の結果がストライク、ボール、ファウルなどハイライトシーンとは見なせない場合には、次の投球ショットまでのショット数Nsfまたは時間Tsfは、それぞれ少ないまたは短いと考えられる。これに対して投球の結果がヒットやホームランなどハイライトシーンと認められる場合には、次の投球ショットまでのショット数Nsfまたは時間Tsfは、ある一定以上の値を取ると考えられる。 The highlight scene discriminating unit 13 obtains the shot number Nsf and the time Tsf between the most frequent shots Sf with respect to the most frequent shot Sf in the television sports video obtained by the processing of FIGS. 6 and 8, for example. For example, in the case of a baseball broadcast, a shot in which a pitcher throws a ball against a batter (hereinafter referred to as a throwing shot) is considered to be the most frequent shot in a baseball broadcast video. When the result of the pitch cannot be regarded as a highlight scene such as a strike, a ball, or a foul, the number of shots Nsf or the time Tsf until the next pitch shot is considered to be small or short, respectively. On the other hand, when the result of the pitch is recognized as a highlight scene such as a hit or a home run, it is considered that the shot number Nsf or the time Tsf until the next pitch shot takes a certain value or more.

そこで、これらのいずれか若しくは両者がそれぞれある閾値THNsf（例えば30ショット）、THTsf（例えば60秒）以上の場合に（Nsf≧THNsf、Tsf≧THTsf）、これらの隣接する最頻ショットSf間の区間にハイライトシーンが存在すると判定する。ただし、野球中継の場合には主に攻守交替時にCMが挿入されることがあるため、投球ショット間のショット数Nsfおよび時間Tsfを併用することによって、効果的にハイライトシーンを抽出することができる。さらに、該区間含まれるショットにおけるショット内代表オーディオ種別Csに対して「歓声」が支配的であることを利用して、該区間がハイライトシーンであるとする判定の精度を向上させることができる。 Therefore, when either or both of these are the threshold THNsf (for example, 30 shots) and THTsf (for example, 60 seconds) or more (Nsf ≧ THNsf, Tsf ≧ THTsf), the interval between these adjacent mode shots Sf It is determined that there is a highlight scene. However, in the case of baseball broadcasts, CMs may be inserted mainly when changing offense and defense, so by using the shot number Nsf between pitch shots and time Tsf together, it is possible to extract highlight scenes effectively. it can. Furthermore, it is possible to improve the accuracy of determination that the section is a highlight scene by using the fact that “cheer” is dominant with respect to the in-shot representative audio type Cs in the shot included in the section. .

また、該当する区間に存在する全てのショットをハイライトシーンとして判定することもできるが、上記オーディオ種別Csが「歓声」であり、且つオーディオパワーPsが最大であるショットを中心とした前後任意数のショットをハイライトシーンとして判定することもできる。これにより、例えば投球ショットが正常に検出されなかったり、投球の結果がアウトとなる場合など、ハイライトシーンではないが次の投球ショットまでのショット数Nsfまたは時間Tsfが大きくなってしまった場合の誤検出を抑えることができる。ハイライトシーンとして判定されたショット群に対して、それぞれショットクラス「ハイライト」を付与する。 It is also possible to determine all shots existing in the corresponding section as highlight scenes, but any number before and after centering on shots where the audio type Cs is “cheer” and the audio power Ps is maximum. Can be determined as a highlight scene. As a result, for example, when the pitch shot is not detected normally or the pitch result is out, the shot number Nsf or time Tsf until the next pitch shot is not a highlight scene but the time Tsf has increased. False detection can be suppressed. A shot class “highlight” is assigned to each shot group determined as a highlight scene.

また、図５に示すアナウンサークラス判別部１４においては、前記最頻ショット検出部１２で得られた最頻ショットを用いて、例えばテレビニュース映像からアナウンサークラスを検出する。該最頻ショットをテレビニュース映像に適用する場合、通常ニュース映像はアナウンサーショットに続き現場からの報告や資料映像、会見、解説などの映像が挿入され、これがニュース項目毎に繰り返される。アナウンサーショットは一つのニュース項目に対して一つ以上出現することが多いため、ニュース番組全体ではアナウンサーショットが最頻ショットであると考えられる。 Further, the announcer class discriminating unit 14 shown in FIG. 5 detects an announcer class from, for example, a television news video by using the most frequent shot obtained by the most frequent shot detecting unit 12. When the most frequent shot is applied to a television news video, a normal news video is followed by an announcer shot followed by a video such as a report from the site, a material video, a conference, and an explanation, and this is repeated for each news item. Since one or more announcer shots often appear for one news item, it is considered that the announcer shot is the most frequent shot in the entire news program.

ただし、解説などに使用される画面は背景色などが同一であるなど、最頻ショットと誤認識される可能性がある。これを防ぐために、色配置記述子の特に輝度成分について高い周波数成分の係数Y_ｎを解析する（例えばn＞6など）。解説画面は特に縮小画像にするとテクスチャが目立たなくなり、比較的平坦な画面となることが予想されるため、高い周波数成分Y_ｎの値は小さくなる。これに対してアナウンサーショットではアナウンサーが映っていることによりテクスチャが存在するため、高い周波数成分においても値は小さくならないと考えられる。この性質を利用して、最頻ショットとしてアナウンサーショットのみを抽出することができる。アナウンサーショットとして判定されたショットに対して、ショットクラス「アナウンサー」を付与する。 However, there is a possibility that the screen used for explanation etc. is misrecognized as the most frequent shot because the background color is the same. In order to prevent this, the coefficient Y _n of the high frequency component is analyzed (for example, n> 6) for the luminance component in the color arrangement descriptor. In particular, when the explanation screen is a reduced image, the texture becomes inconspicuous and a relatively flat screen is expected. Therefore, the value of the high frequency component Y _n becomes small. On the other hand, since an announcer is reflected in the announcer shot, texture is present. Therefore, it is considered that the value does not decrease even at high frequency components. Using this property, only the announcer shot can be extracted as the most frequent shot. The shot class “announcer” is assigned to the shot determined as the announcer shot.

上記の「ハイライト」クラスのショットや、「アナウンサー」クラスのショットを集約して再生することにより、テレビスポーツ映像のハイライトや、テレビニュース映像のダイジェストなどを構成することができる。 By collecting and reproducing the above-mentioned “highlight” class shots and “announcer” class shots, it is possible to configure highlights of television sports videos, digests of television news videos, and the like.

次に、図９に本発明の第３の実施形態を示す。ここでは、入力動画像データがショット分割部１でショット分割され、図１と図５の処理を受ける。図１の処理により、ショットジャンル判別の処理４１、すなわち前記アクションクラス判別、ドラマチッククラス判別、および会話クラス判別の処理がなされる。一方、図５の処理により、サマリショット判別の処理４２、すなわちハイライトシーン判別と、アナウンサークラス判別の処理がなされる。 Next, FIG. 9 shows a third embodiment of the present invention. Here, the input moving image data is shot divided by the shot dividing unit 1 and subjected to the processes shown in FIGS. The process of FIG. 1 performs shot genre discrimination processing 41, that is, action class discrimination, dramatic class discrimination, and conversation class discrimination processing. On the other hand, according to the processing of FIG. 5, summary shot discrimination processing 42, that is, highlight scene discrimination and announcer class discrimination processing is performed.

ショットジャンル判別部４１において決定されたショットクラスは、ショットジャンル記述部４３において、例えばMPEG-7で規定されている「分類スキーム（Classification Scheme)」で定義したショットのジャンルとして、各ショットの付属情報として記述することができる。 The shot class determined by the shot genre discriminating unit 41 is attached to each shot as a genre of a shot defined by, for example, a “Classification Scheme” defined in MPEG-7 in the shot genre description unit 43. Can be described as:

また、サマリショット判別部４２において、スポーツ映像のハイライトやニュース映像のダイジェストとして判定されたショットは、サマリショット記述部４４においてその時間情報などを記述することができる。サマリショット記述のフォーマットとしては、例えばMPEG-7で定義されている「階層的要約記述スキーム」などを用いることができる。記述された情報は、MPEG-7記述ファイルとして出力する。 Further, the summary shot description unit 44 can describe time information and the like of shots determined as highlights of sports videos and digests of news videos. As the format of the summary shot description, for example, a “hierarchical summary description scheme” defined in MPEG-7 can be used. The described information is output as an MPEG-7 description file.

１・・・ショット分割部、２・・・動き情報解析部、３・・・オーディオ解析部、４・・・アクションクラス判別部、５・・・ドラマチッククラス判別部、６・・・会話クラス判別部、１１・・・特徴値抽出部、１２・・・最頻ショット検出部、１３・・・ハイライトシーン判別部、１４・・・アナウンサークラス判別部、３１・・・オーディオパワー計算部、３２・・・オーディオ種別解析部、４１・・・ショットジャンル判別部、４２・・・サマリショット判別部、４３・・・ショットジャンル記述部、４４・・・サマリショット記述部。 DESCRIPTION OF SYMBOLS 1 ... Shot division part, 2 ... Motion information analysis part, 3 ... Audio analysis part, 4 ... Action class discrimination | determination part, 5 ... Dramatic class discrimination | determination part, 6 ... Conversation class discrimination | determination 11, feature value extraction unit 12, mode shot detection unit 13, highlight scene determination unit 14, announcer class determination unit 31, audio power calculation unit 32 ... Audio type analysis unit, 41 ... Shot genre discrimination unit, 42 ... Summary shot discrimination unit, 43 ... Shot genre description unit, 44 ... Summary shot description unit.

Claims

In an apparatus for classifying uncompressed or compressed video data,
Moving image data dividing means for dividing moving image data into shot units on the time axis;
Feature value extracting means for extracting image feature values such as color arrangement descriptors defined in MPEG-7 obtained from images in the shot unit;
Similar unit detection means for detecting a similar shot unit having the highest appearance frequency using the image feature value,
The similar unit detection means detects the most frequent similar shots by making a histogram of a plurality of coefficients of the feature value in the shot head or the representative frame in the shot and sequentially narrowing down bins having the largest number of elements. An apparatus for classifying moving image data.

The moving image data classification device according to claim 1,
An apparatus for classifying moving image data, wherein coefficients of each component obtained by performing discrete cosine transform on at least one of luminance and color difference components of a reduced image are used as the plurality of coefficients of the image feature value.

The moving image data classification device according to claim 1 or 2, wherein the moving image data is known in advance as a television sports video.
Equipped with highlight scene discrimination means,
The highlight scene discriminating unit highlights the section when at least one of the number of shots and the time of the section between the most frequent similar shots extracted from the similar unit detecting section exceeds a predetermined threshold. An apparatus for classifying moving image data, characterized in that a scene is determined to exist.

In the moving image data classification device according to claim 3,
The highlight scene discriminating unit is configured to display a commercial in a section of a television sports video when the number of shots in a section between the most frequent similar shots extracted from the similar unit detection section is equal to or greater than a predetermined threshold. An apparatus for classifying moving image data, characterized in that it determines that a scene exists and excludes it from a candidate for a highlight scene.

The moving image data classification device according to claim 3 or 4,
The highlight scene discriminating means further uses a shot having the maximum audio power whose audio type is “cheers” for the determination of the highlight scene, and the shot having the maximum audio power or the preceding and following shots including the shot. Is a highlight scene.

The moving image data classification apparatus according to claim 1 or 2, wherein the moving image data is known in advance as television news video.
An announcer class discriminating means is provided,
An apparatus for classifying moving image data, wherein the announcer class discriminating means judges the most frequent similar shot extracted from the similar unit detecting means as an announcer shot of a television news video.

The moving image data classification device according to claim 6,
The announcer class determining means determines that a texture as an image exists when a high frequency component of the color arrangement descriptor is higher than a predetermined frequency, and an announcer shot when the texture exists in the most similar shot A moving image data classification device characterized by the above-described determination.