JP2007518303A

JP2007518303A - Processing method and apparatus using scene change detection

Info

Publication number: JP2007518303A
Application number: JP2006546403A
Authority: JP
Inventors: ビュラゼロヴィック，ドゼフデット; バルビエリ，マウロ
Original assignee: Koninklijke Philips NV; Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2004-01-05
Filing date: 2004-12-28
Publication date: 2007-07-05
Also published as: WO2005074297A1; EP1704722A1; KR20060127024A; CN1902938A

Abstract

本発明は、連続フレームを有するビデオストリームの形で利用可能であるデジタル符号化ビデオデータを処理する方法に関する。マクロブロックに分割されるそれらのフレームは、少なくともＩフレーム（イントラ）及びＰフレーム（予測）と、Ｉフレーム及びＰフレームの間に一時的に配置され且つそれらが間に配置される少なくとも２つのフレームから双方向的に予測されるＢフレームとを有する。本発明に従って、この処理方法は、もしある場合に、前記重み付け予測を特徴付ける符号化パラメータに関連付けられる現フレームの各々の連続マクロブロックを決定する段階と、前記パラメータに関連付けられる統計値を供給するために前記現フレームの前記連続マクロブロック全てについて前記パラメータを収集する段階と、予測方向のための選好の変化を決定するための前記統計値を分析する段階と、選好の変化が決定される度に、一連のフレームにおける徐々のシーン変化の発生を検出する段階とを有する。The present invention relates to a method of processing digitally encoded video data that is available in the form of a video stream having consecutive frames. Those frames that are divided into macroblocks are at least I frames (intra) and P frames (prediction), and at least two frames that are temporarily placed between and between them And B frames predicted bidirectionally. In accordance with the present invention, the processing method determines, if any, each successive macroblock of the current frame associated with the encoding parameter that characterizes the weighted prediction, and provides statistics associated with the parameter. Collecting the parameters for all the continuous macroblocks of the current frame, analyzing the statistics for determining a change in preference for a prediction direction, and each time a change in preference is determined. Detecting the occurrence of gradual scene changes in a series of frames.

Description

本発明は、Ｈ．２６４／ＡＶＣビデオストリームにおける徐々のシーンの遷移を自動的に検出することを可能にする方法に関する。その方法は、非常に効率的でコストパフォーマンスの高いＨ．２６４により導入される新規な符号化パラメータを使用することに基づいている。 The present invention relates to H.264. The present invention relates to a method that makes it possible to automatically detect gradual scene transitions in H.264 / AVC video streams. The method is very efficient and cost effective. H.264 is based on using the new encoding parameters introduced.

近年、国際ビデオ符号化規格は種々の業務用及び汎用アプリケーションにおけるデジタルビデオの採用を容易にする重要な役割を果たしてきた。最も影響力のある規格は、２つの組織、即ち、ＩＴＵ−Ｔ及びＩＳＯ／ＩＥＣＭＰＥＧによりときどき共同で開発されてきた（例えば、ＭＰＥＧ−２／Ｈ．２６２）。最も新しい共同の規格はＨ．２６４／ＡＶＣであり、それは、国際規格１４４９６−１０（ＭＰＥＧ−４Ｐａｒｔ１０）ＡｄｖａｎｃｅｄＶｉｄｅｏＣｏｄｉｎｇ（ＡＶＣ）としてＩＳＯ／ＩＥＣ及び推奨Ｈ．２６４／ＡＶＣとしてＩＴＵ−Ｔにより２００３年に正式に承認されることが期待されていた。Ｈ．２６４／ＡＶＣ規格の主な目的は、圧縮性能において大幅なゲインを得ることと“ネットワークに優しい”ビデオ表示アドレッシング“会話型”（電話）及び“非会話型”（記憶、放送、ストリーミング）アプリケーションを与えることとを達成することである。今日、Ｈ．２６４／ＡＶＣは、既存の規格に対して著しく改善されたレート−歪み効率を提供するとして広く認識され、Ｈ．２６４／ＡＶＣベースの解決方法は又、ＤＶＢ及びＤＶＤフォーラムのような他の規格組織で検討されている。Ｈ．２６４／ＡＶＣ符号化器／復号化器の実施については、例えば、“新たなＨ．２６４規格：概要及びＴＭＳ３２０Ｃ６４ｘデジタルメディアプラットホームの実施−政府の公式報告書（ｈｔｔｐ：／／／ｗｗｗ．ｕｂｖｉｄｅｏ．ｃｏｍ／ｐｕｂｌｉｃ参照）にあるように、既に利用可能になっている。また、インターネットにおけるＨ．２６４／ＡＶＣに関する情報を提供しているサイト数は非常に多くなってきていて、それらのサイトの中で、ＩＴＵ−Ｔ／ＭＰＥＧＪＶＴ［ＪｏｉｎｔＶｉｄｅｏＴｅａｍ］（公式のＨ．２６４文書及びＪＶＴのソフトウェア（ｆｔｐ：／／ｆｔｐ．ｉｍｔｃ−ｆｉｌｅｓ．ｏｒｇ／ｊｖｔ−ｅｘｐｅｒｔｓ／参照））の公式データベースは、草案の更新を含む、Ｈ．２６４／ＡＶＣの状態及び進行状況を反映した文書への自由なアクセスを提供している。 In recent years, international video coding standards have played an important role in facilitating the adoption of digital video in various business and general purpose applications. The most influential standards have sometimes been jointly developed by two organizations, namely ITU-T and ISO / IEC MPEG (eg MPEG-2 / H.262). The newest joint standard is H.264. H.264 / AVC, which is ISO / IEC and recommended H.264 as international standard 14496-10 (MPEG-4 Part 10) Advanced Video Coding (AVC). It was expected to be officially approved by ITU-T in 2003 as H.264 / AVC. H. The main purpose of the H.264 / AVC standard is to obtain significant gains in compression performance and “network friendly” video display addressing “conversational” (telephone) and “non-conversational” (memory, broadcast, streaming) applications. To achieve and to give. H. H.264 / AVC is widely recognized as providing significantly improved rate-distortion efficiency over existing standards. H.264 / AVC-based solutions are also being considered by other standards organizations such as DVB and DVD Forum. H. For implementation of H.264 / AVC encoder / decoder, see, for example, “New H.264 Standard: Overview and Implementation of TMS320C64x Digital Media Platform—Official Government Report (http://www.ubvideo.com The number of sites that provide information on H.264 / AVC on the Internet has become very large, as shown in / public. , ITU-T / MPEG JVT [Joint Video Team] (official H.264 document and JVT software (ftp://ftp.imc-files.org/jvt-experts/reference)) official database H.264 / AVC status and progress, including updates It provides free access to a document that reflects the situation.

Ｈ．２６４／ＡＶＣシンタックス及び符号化ツールは、ここで、再現されることが可能である。最初に、Ｈ．２６４／ＡＶＣは、ＭＰＥＧ−２のような構築された規格により理解されるブロックベースの動き補償変換符号化の同じ原理を採用している。Ｈ．２６４シンタックスは、それ故、ヘッダ（例えば、ピクチャブロックヘッダ、スライスブロックヘッダ及びマクロブロックヘッダ等）並びにデータ（例えば、動きベクトル、ブロック変換係数、量子化スケール等）の通常の階層として体系付けられる。データ構造化（例えば、Ｉ、Ｐ又はＢピクチャ、イントラ及びインターマクロブロック）に関する既知の概念の殆どが維持されている一方、一部の新たな概念がまた、ヘッダレベル及びデータレベルの両方において導入されている。主に、Ｈ．２６４／ＡＶＣは、ビデオデータのコンテンツを効率的に表すように定義されているビデオ符号化レイヤと、データをフォーマットし且つ高レベルの（粗ランスポート）システムによる伝達のために適切な様式でヘッダ情報を供給するネットワーク抽象化レイヤとを分離する。 H. H.264 / AVC syntax and encoding tools can now be reproduced. First, H.C. H.264 / AVC employs the same principles of block-based motion compensated transform coding as understood by established standards such as MPEG-2. H. H.264 syntax is therefore organized as a normal hierarchy of headers (eg, picture block headers, slice block headers, macroblock headers, etc.) and data (eg, motion vectors, block transform coefficients, quantization scales, etc.). . While most of the known concepts regarding data structuring (eg, I, P or B pictures, intra and inter macroblocks) are maintained, some new concepts are also introduced at both the header level and the data level. Has been. Mainly H. H.264 / AVC is a video coding layer defined to efficiently represent the content of video data and headers in a manner suitable for formatting the data and for transmission by high level (coarse transport) systems. Separates the network abstraction layer that supplies the information.

データレベルにおけるＨ．２６４／ＡＶＣの主な特徴の１つはまた、１６ｘ１６マクロブロックのより精巧な分割及び操作を用いることである（マクロブロックＭＢは、１６ｘ１６の輝度ブロック及び対応する８ｘ８のクロミナンスのブロックの両方を含むが、多くの操作、例えば、動き予測は、実際には、輝度のみをとり、クロミナンスにその結果を反映する）。それ故、動き補償処理は、サンプルグリッドの四分の一以下の動きベクトル精度を用いて、サイズが４ｘ４と小さいＭＢの細分化を行う。また、サンプルブロックの動き補償予測についての選択処理は、隣接ピクチャのみに代えて、複数の記憶されている既に復号化されたピクチャを含むことが可能である。イントラ符号化にしても、隣接ブロックから予め復号化されたサンプルを用いて、ブロックの予測を行うことが可能である（このような空間に基づく予測のルールは、所謂、イントラ予測モードで記述される）。動き補償予測か又は空間ベースの予測のどちらかの後、結果的に得られる予測エラーは、通常は、従来の８ｘ８サイズの代わりに４ｘ４ブロックサイズに基づいて変換及び量子化される。この特徴は、下の説明で規定される本発明に特に関連し、その説明において後に浮き彫りにされる。Ｈ．２６４／ＡＶＣは、更に、他の特定の認識（例えば、エントロピー符号化）を用い、それらの認識の殆どは固定される又はピクチャレベルで又はピクチャレベルを超えてのみ変化される。 H. at the data level. One of the main features of H.264 / AVC is also to use more elaborate partitioning and manipulation of 16x16 macroblocks (macroblock MB includes both 16x16 luminance blocks and corresponding 8x8 chrominance blocks However, many operations, such as motion estimation, actually take only luminance and reflect the result in chrominance). Therefore, the motion compensation process uses a motion vector accuracy of a quarter or less of the sample grid to subdivide the MB as small as 4 × 4. In addition, the selection process for the motion compensated prediction of the sample block can include a plurality of stored decoded pictures instead of only adjacent pictures. Even with intra coding, it is possible to perform block prediction using samples previously decoded from neighboring blocks (the prediction rule based on such a space is described in so-called intra prediction mode). ) After either motion compensated prediction or spatial based prediction, the resulting prediction error is usually transformed and quantized based on a 4x4 block size instead of the traditional 8x8 size. This feature is particularly relevant to the invention as defined in the description below and will be highlighted later in the description. H. H.264 / AVC further uses other specific recognitions (eg, entropy coding), most of which are fixed or changed only at or beyond the picture level.

動き補償に関して、Ｈ．２６４／ＡＶＣの一般的概念及び特徴はまた、再現される必要がある。例えば、ＭＰＥＧ−２のような、殆どの従来のビデオ符号化規格は、ビデオにおける後続のピクチャ間の補正を利用する実際的な方法として、ブロックベースの動き補償を本質的に用いている。この方法は、隣接する予め復号化された基準ピクチャにおいて“最適なマッチング”により所定のピクチャにおける各々のマクロブロックを予測することを試みる。マクロブロックとその予測との間の画素に関する差分が十分に小さい場合、この差分（又は、残渣）が符号化され、マクロブロック自体の差が符号化されるのではない。実際のＭＢのグリッド位置に対する予測ブロックの相対的移動は動きベクトルにより表され、その動きベクトルは個別に符号化される。図１は、双方向予測の場合について、これを示し、ここで、２つの参照図Ｐ_ｉ及びＰ_ｉ＋１が用いられ、それらは過去における一と将来における一である（表示の順序において）。このようにして予測されるピクチャ（図１のＢ_ｉのような）はＢピクチャと呼ばれる。そうではなく、過去のみを参照することにより予測されるピクチャはＰピクチと呼ばれる。 Regarding motion compensation, H.C. The general concepts and features of H.264 / AVC also need to be reproduced. For example, most conventional video coding standards, such as MPEG-2, essentially use block-based motion compensation as a practical method that utilizes correction between subsequent pictures in the video. This method attempts to predict each macroblock in a given picture by “optimal matching” in adjacent pre-decoded reference pictures. If the pixel difference between the macroblock and its prediction is sufficiently small, this difference (or residue) is encoded, not the difference of the macroblock itself. The relative movement of the prediction block relative to the actual MB grid position is represented by a motion vector, which is encoded separately. FIG. 1 shows this for the case of bi-directional prediction, where two reference diagrams P _i and P _{i + 1} are used, one in the past and one in the future (in display order). Picture to be predicted this manner (such as B _i in FIG. 1) is referred to as a B-picture. Instead, a picture that is predicted by referring only to the past is called a P picture.

Ｈ．２６４／ＡＶＣを用いる場合、それらの基本的概念は更に精巧にされる。第１に、Ｈ．２６４／ＡＶＣにおける動き補償は複数の基準ピクチャ予測に基づいている。所定のブロックに対する適合性は、隣接ピクチャに代えて、より遠い過去又は将来のピクチャにおいて探索されることができる。第２に、Ｈ．２６４／ＡＶＣは、ＭＢをより小さいブロックに分割すること及びそれらのブロックの各々を別個に予測することを可能にする。これは、所定のＭＢについての予測が、基本的に、異なる動きベクトルを用いて及び異なる基準ピクチャから取り出される異なるブロックから構成されることを意味する。予測ブロックの数、サイズ及びオリエンテーションは、インターモードの選択により一意に決定される。幾つかのそのようなモードは、１６ｘ８、８ｘ８等のブロックサイズが４ｘ４に減らされるように特定される。 H. When using H.264 / AVC, these basic concepts are further refined. First, H.C. Motion compensation in H.264 / AVC is based on multiple reference picture predictions. Suitability for a given block can be searched for in distant past or future pictures instead of neighboring pictures. Second, H.M. H.264 / AVC allows the MB to be divided into smaller blocks and each of those blocks to be predicted separately. This means that the prediction for a given MB is basically composed of different blocks using different motion vectors and taken from different reference pictures. The number, size, and orientation of the prediction block are uniquely determined by selecting an inter mode. Some such modes are specified such that block sizes such as 16x8, 8x8, etc. are reduced to 4x4.

他のＨ．２６４／ＡＶＣにおける革新性は、動き補償される予測信号が重み付けされ且つ符号化器により指定された量だけオフセットされることを可能にする。これは、前のフレームＰ（ｉ−ｎ）及びＰ（ｉ−１）及び後のフレームＰ（ｉ＋ｊ）及びＰ（ｉ＋ｍ）から予測されるフレームＢ（ｉ）に関する双方向の予測の場合、符号化器は、過去からの予測ブロック及び将来からの予測ブロックが全体の予測において寄与する一意の量を選択することができることを意味する。この特徴は、フェードを含むシーンに対して符号化効率を大幅に改善すうことを可能にする。 Other H.C. The innovation in H.264 / AVC allows motion compensated prediction signals to be weighted and offset by an amount specified by the encoder. This is the case for bi-directional prediction for frames B (i) predicted from previous frames P (i−n) and P (i−1) and subsequent frames P (i + j) and P (i + m). The means that the prediction block from the past and the prediction block from the future can select a unique amount that contributes in the overall prediction. This feature makes it possible to greatly improve the coding efficiency for scenes containing fades.

問題点は、しかしながら、次のようなものである。コンピューティング、通信及びデジタルデータ記憶における近年の進展は、容量及びコンテンツの豊富さを着実に増加させることにより特徴付けられる大きいデジタルアーカイブの非常に大きい成長に繋がる。記憶されている関心のある情報を即座に検索する効率的な検索方法は、それ故、非常に重要である。系統だって記憶されていないデータのテラバイトオーダーをマニュアルで検索することは冗漫で時間を非常に要するため、情報検索及び検索タスクを自動化システムに転送することに対する要請が大きくなってきている。構造化されていないビデオコンテンツの大きいアーカイブにおける検索及び探索は、通常、前記コンテンツがコンテンツ分析技術を用いて索引付けされた後に実行される。それらの技術は、例えば、画像処理、パターン認識及び人工知能等のアルゴリズムに基づいていて、それらのアルゴリズムは、前記ビデオコンテンツの観点から、ビデオ素材の注釈の記述を自動的に生成することを目的とする（そのような注釈は、例えば、色及びテクスチャ等の低レベルの信号関連の特徴から、例えば、顔の存在及び位置等の高レベル情報まで変化する）。 The problem, however, is as follows. Recent developments in computing, communications and digital data storage have led to the enormous growth of large digital archives characterized by steadily increasing capacity and content richness. An efficient search method that instantly searches stored information of interest is therefore very important. Searching manually for terabyte orders of data that is not stored even in a system is tedious and time consuming, and there is an increasing demand for information retrieval and transfer of retrieval tasks to automated systems. Searches and searches in large archives of unstructured video content are typically performed after the content has been indexed using content analysis techniques. These techniques are based on, for example, algorithms such as image processing, pattern recognition and artificial intelligence, which aim to automatically generate annotation descriptions of video material in terms of the video content. (Such annotations vary from low level signal related features such as color and texture to high level information such as face presence and location, for example).

最も重要なコンテンツの記述の１つは、例えば、特許文献、国際公開第０１／０３４２９号パンフレットに記載されているような、ショット境界インジケータである。ショットとは、単一のカメラを連続して用いて撮影されたビデオセグメントであり、ショットは、一般に、ビデオを有する基本単位とみなされる。ショット境界を検出することは、それ故、それらの基本ビデオ単位を再生することを意味し、それはまた、略全てのビデオ抽象化及び高レベルのビデオセグメント化アルゴリズムのための基礎を与える（例えば、文献、“ＶｉｄｅｏＡｂｓｔｒａｃｔｉｎｇ”，ｂｙＲ．Ｌｉｅｎｈａｒｔ，ｅｔａｌ．，ＣｏｍｍｕｎｉｃａｔｉｏｎｓｏｆｔｈｅＡＣＭ，４０（１２），１９９７，ｐｐ５５−６２参照）。 One of the most important content descriptions is a shot boundary indicator, as described, for example, in the patent literature, WO 01/03429. A shot is a video segment taken using a single camera in succession, and a shot is generally regarded as a basic unit having a video. Detecting shot boundaries therefore means playing those basic video units, which also provides the basis for almost all video abstraction and high-level video segmentation algorithms (eg, (Ref. "Video Abstracting", by R. Lienhart, et al., Communications of the ACM, 40 (12), 1997, pp 55-62).

ビデオ編集中、ショットは、少なくとも２つのクラスに分類されることができるショット遷移、即ち、突然の遷移及び徐々の遷移を用いて接続される。突然の遷移はまた、ハードカットと呼ばれ、２つのショットの何れの修正を伴わずに得られ、検出することが非常に容易であり、それらは、全ての種類の映像再生の大部分を占める。例えば、フェード、ディゾルブ、ワイプのような徐々の遷移は、２つの含まれているショットの特定の変換を適用することにより得られる。映像再生中、各々の遷移の種類は、ビデオシーケンスのコンテンツ及び前後関係を支援するように注意深く選択される。それ故、それらの位置及び種類全てを自動的に再生することは、機器が高レベルのセマンティクスを演繹することを支援する。例えば、長編映画において、ディゾルブは、しばしば、時間の経過を伝えるために用いられる。また、ディゾルブは、ニュースキャスト、スポーツ、コメディ及びショーより非常にしばしば、長編映画、ドキュメンタリー、伝記素材及び科学ビデオ素材において生じる。その逆は、ワイプについて真である。それ故、遷移及び遷移の種類の自動検出がビデオジャンルの自動認識のために用いられる。 During video editing, shots are connected using shot transitions that can be classified into at least two classes: sudden transitions and gradual transitions. Sudden transitions, also called hard cuts, are obtained without any modification of the two shots and are very easy to detect, they account for the majority of all types of video playback . For example, gradual transitions such as fades, dissolves and wipes can be obtained by applying specific transformations of two contained shots. During video playback, each transition type is carefully selected to support the content and context of the video sequence. Therefore, automatically playing back all of their positions and types helps the device deduct high-level semantics. For example, in feature films, dissolves are often used to convey the passage of time. Dissolves also occur more often in feature films, documentaries, biographies and scientific video material than newscasts, sports, comedies and shows. The converse is true for wipes. Therefore, automatic detection of transitions and transition types is used for automatic recognition of video genres.

次のＨ．２６４／ＡＶＣ規格についての大きい適用領域のために、Ｈ．２６４／ＡＶＣビデオコンテンツ分析のための効率的な解決方法についての要請が大きくなってきている。近年、圧縮された領域で殆ど専用に動作する幾つかの効率的なコンテンツ分析アルゴリズム及び方法がＭＰＥＧ−２ビデオについて示された。上記のように、Ｈ．２６４／ＡＶＣは、ある意味では、ＭＰＥＧ−２シンタックスの上位集合を指定するため、それらの方法の殆どはＨ．２６４／ＡＶＣに拡張されることが可能である。しかしながら、ＭＰＥＧ−２の制約のために、それらの既存の方法は適切な又は高信頼性の性能を与えることができず、そのことは、画素領域又は音声領域で動作する付加的且つしばしばコストパフォーマンスの低い方法を有することによって典型的にアドレス指定される欠点を有する。
“ＶｉｄｅｏＡｂｓｔｒａｃｔｉｎｇ”，ｂｙＲ．Ｌｉｅｎｈａｒｔ，ｅｔａｌ．，ＣｏｍｍｕｎｉｃａｔｉｏｎｓｏｆｔｈｅＡＣＭ，４０（１２），１９９７，ｐｐ５５−６２ The following H. Because of the large application area for the H.264 / AVC standard, There is an increasing demand for efficient solutions for H.264 / AVC video content analysis. In recent years, several efficient content analysis algorithms and methods have been shown for MPEG-2 video that operate almost exclusively in the compressed domain. As mentioned above, H.M. H.264 / AVC, in a sense, specifies a superset of MPEG-2 syntax, so most of these methods are H.264. H.264 / AVC. However, due to MPEG-2 limitations, those existing methods cannot provide adequate or reliable performance, which is an additional and often cost-effective operation in the pixel or audio domain. Have the disadvantage of being typically addressed by having a low method.
“Video Abstracting”, by R.D. Lienhart, et al. , Communications of the ACM, 40 (12), 1997, pp 55-62.

それ故、本発明の目的は、予測されるべきフレームの将来と過去からの一意の予測量でフレームの重み付け予測が行われる状況全てにおいて、前記短所を回避することを可能にする方法を提供することである。 Therefore, an object of the present invention is to provide a method that makes it possible to avoid the disadvantages in all situations where weighted prediction of a frame is performed with a unique prediction amount from the future and past of the frame to be predicted. That is.

このために、本発明は、マクロブロックに分割された連続的フレームを有するビデオストリームの形で利用可能なデジタル符号化ビデオデータを処理する方法であって、前記フレームは、少なくとも、独立して符号化されたＩフレームと、前記Ｉフレーム間に一時的に配置されるＰフレームであって、少なくとも前のＩフレーム又はＰフレームから予測されるＰフレームと、ＩフレームとＰフレームとの間又は２つのＰフレームの間に一時的に配置され、それらが間に配置された少なくとも２つのフレームから双方向的に予測されるＢフレームとを有し、前記予測は過去と将来からの一意の予測量を有する重み付け予測により行われる、方法であり：
− もしあれば、前記重み付け予測を特徴付ける符号化パラメータに関連する現フレームの各々の連続マクロブロックを決定する段階；
− 前記パラメータに関連する統計値を供給するために、現フレームの連続マクロブロック全てのために前記パラメータを収集する段階；
− 予測の方向のために好みの変化を決定するために前記統計値を分析する段階；及び
− 選好の変化が決定される度に、フレームのシーケンスにおける徐々のシーン変化の存在を検出する段階；
を有する方法に関する。 To this end, the present invention is a method for processing digitally encoded video data that can be used in the form of a video stream having consecutive frames divided into macroblocks, wherein the frames are at least independently encoded. Between the I frame and the P frame temporarily arranged between the I frames, the P frame predicted from at least the previous I frame or the P frame, and between the I frame and the P frame or 2 B frames that are temporarily placed between two P frames and that are bi-directionally predicted from at least two frames placed between them, the prediction being a unique amount of prediction from the past and the future A method performed by weighted prediction with:
-Determining, if any, each successive macroblock of the current frame associated with an encoding parameter characterizing the weighted prediction;
-Collecting said parameters for all successive macroblocks of the current frame to provide statistics related to said parameters;
Analyzing the statistics to determine a preference change for the direction of prediction; and- detecting the presence of gradual scene changes in the sequence of frames each time a preference change is determined;
Relates to a method comprising:

更に詳細には、本発明に従った分析する段階は、現フレームにおけるマクロブロックの全数に関連して導き出される所定の閾値に対して、同じ方向の選好及び類似する重み付けを有するマクロブロックの数を比較するために備えられている。好適には、各々のシーンの変化の位置及び持続時間に関する情報が生成され、ファイルに記憶される。 More particularly, the analyzing step according to the present invention determines the number of macroblocks having the same direction preference and similar weighting for a predetermined threshold derived in relation to the total number of macroblocks in the current frame. It is provided for comparison. Preferably, information regarding the location and duration of each scene change is generated and stored in a file.

本発明の他の目的は、上記の方法を実行することを可能にする処理装置を提供することである。 Another object of the present invention is to provide a processing device which makes it possible to carry out the above method.

このために、本発明は、マクロブロックに分割された連続的フレームを有するビデオストリームの形で利用可能なデジタル符号化ビデオデータを処理する装置であって、前記フレームは、少なくとも、独立して符号化されたＩフレームと、前記Ｉフレーム間に一時的に配置されるＰフレームであって、少なくとも前のＩフレーム又はＰフレームから予測されるＰフレームと、ＩフレームとＰフレームとの間又は２つのＰフレームの間に一時的に配置され、それらが間に配置された少なくとも２つのフレームから双方向的に予測されるＢフレームとを有し、前記予測は過去と将来からの一意の予測量を有する重み付け予測により行われる、装置であり：
− もしあれば、前記重み付け予測を特徴付ける符号化パラメータに関連する現フレームの各々の連続マクロブロックを決定するために備えられる決定手段；
− 前記パラメータに関連する統計値を供給するために、現フレームの連続マクロブロック全てのために前記パラメータを収集するために備えられる収集手段；
− 予測の方向のために好みの変化を決定するために前記統計値を分析するために備えられる分析手段；及び
− 選好の変化が決定される度に、フレームのシーケンスにおける徐々のシーン変化の存在を検出するために備えられる検出手段；
を有する装置に関する。 To this end, the present invention is an apparatus for processing digitally encoded video data available in the form of a video stream having continuous frames divided into macroblocks, wherein the frames are at least independently encoded. Between the I frame and the P frame temporarily arranged between the I frames, the P frame predicted from at least the previous I frame or the P frame, and between the I frame and the P frame or 2 B frames that are temporarily placed between two P frames and that are bi-directionally predicted from at least two frames placed between them, the prediction being a unique amount of prediction from the past and the future A device that is made by weighted prediction with:
Determining means, if any, provided for determining each successive macroblock of the current frame associated with the coding parameters characterizing the weighted prediction;
A collecting means provided for collecting said parameters for all successive macroblocks of the current frame in order to supply statistics related to said parameters;
-An analysis means provided for analyzing said statistics to determine a preference change for the direction of prediction; and-the presence of a gradual scene change in the sequence of frames each time a preference change is determined Detection means provided for detecting
The present invention relates to a device having

上記のように、動き予測に関してＨ．２６４／ＡＶＣの一般的概念及び特徴を呼び出すとき、動き補償予測信号は、符号化器により指定された量だけ重み付けされることができる。重み付け予測は、過去からの及び将来からの予測ブロックが一意の量だけ全予測において存在する双方向予測（Ｂピクチャ）を得るように用いられる（ＭＰＥＧ−２を用いる場合、これは、１／２だけ両方の予測信号を重み付けする１つの可能性に限定される）。 As described above, H. When recalling the general concepts and features of H.264 / AVC, the motion compensated prediction signal can be weighted by the amount specified by the encoder. Weighted prediction is used to obtain bi-directional predictions (B pictures) where prediction blocks from the past and from the future exist in all predictions by a unique amount (when using MPEG-2, this is 1/2 Only limited to one possibility of weighting both prediction signals).

本発明の原理は、このような不適当なことのために、徐々のショット遷移の存在が、一の方向から他の方向への予測についての選好の徐々の変化により表されることである。そのような予測の方向についての予測の変化は、重み付けされた予測を特徴付ける符号化パラメータを関連付ける統計値を分析することにより検出される。例えば、この分析は、所定の閾値に対して、同じ方向の選好及び類似する重み付けを有するマクロブロックの数を比較することを含むことができる。更に、そのようなマクロブロックの分布の（局所的）均一性は、予測についての徐々のシーン遷移の方向選好における変化が、実際には、徐々のシーン遷移の結果であることを確認するように調べられる。また、一部の付加的分析が、サブマクロブロック動き予測の有効な使用を考慮するように、及び、重み付け予測において、例えば、Ｈ．２６４／ＡＶＣにおいて可能であるように、実行されることが可能である。 The principle of the present invention is that due to such inadequacy, the presence of gradual shot transitions is represented by gradual changes in preferences for predictions from one direction to the other. Such a change in prediction for the direction of prediction is detected by analyzing statistics relating the coding parameters that characterize the weighted prediction. For example, the analysis can include comparing the number of macroblocks having the same direction preference and similar weighting against a predetermined threshold. In addition, the (local) uniformity of such macroblock distributions confirms that changes in the directional scene transition direction preference for prediction are actually the result of gradual scene transitions. Be examined. Some additional analysis also considers the effective use of sub-macroblock motion prediction and in weighted prediction, for example H.264. It can be implemented as is possible in H.264 / AVC.

例えば、前のピクチャＰ_ｉ−ｎ、Ｐ_ｉ−１及び後のピクチャＰ_ｉ＋ｊ、Ｐ_ｉ＋ｍからピクチャＢ_ｉの予測を示すＨ．２６４／ＡＶＣにおける双方向予測の例を図２に示す。ＭＢ_Ｐｒｅｄと呼ばれ、Ｂ１“＋”Ｂ２“＋”Ｂ３に等しく、Ｂ１＝ａｌｐｈａ１．ｂ_１＋ａｌｐｈａ２．ｂ_２（ここで、ａｌｐｈａ１及びａｌｐｈａ２は係数である）であるマクロブロックＭＢについての予測は３つの予測ブロックを有し、それ故、マクロブロックＭＢ_Ｐｒｅｄの下の半分は２つの８ｘ８ブロックＢ_２及びＢ_３により予測され、そのマクロブロックの上の半分は１つの８ｘ１６ブロックＢ_１により予測される。それらの予測ブロックの各々は、Ｈ．２６４において可能であるように、異なる基準ピクチャに関連し、区別可能な動きベクトルを有する。Ｂ_２及びＢ_３と異なり、ブロックＢ_１は重み付けされた予測を用いて得られる、即ち、そのブロックＢ_１は、対応する重み付けパラメータａｌｐｈａ１及びａｌｐｈａ２により制御される一意の量の和に存在する２つのブロックｂ１及びｂ２を加算することにより得られる。それらの重み付けパラメータ（絶対値及び符号）の統計値が全てのマクロブロックに対して収集され、複数のマクロブロックに亘る統計値分布が徐々のシーン遷移の検出を達成するように分析される。 For example, H.3 indicates prediction of picture B _i from previous pictures P _i-n , P _i−1 and subsequent pictures P _{i + j} , P _{i + m} . An example of bidirectional prediction in H.264 / AVC is shown in FIG. It is called MB _Pred and is equal to B1 “+” B2 “+” B3, and B1 = alpha1. b ₁ + alpha2. The prediction for a macroblock MB that is b ₂ (where alpha1 and alpha2 are coefficients) has three prediction blocks, so the lower half of the macroblock MB _Pred is two 8 × 8 blocks B ₂ and predicted by B _3, top half of the macroblock is predicted by one 8x16 block _{B 1.} Each of these prediction blocks is H.264. As is possible in H.264, it has distinct motion vectors associated with different reference pictures. Unlike B ₂ and B ₃ , block B ₁ is obtained using a weighted prediction, ie that block B ₁ exists in the sum of unique quantities controlled by the corresponding weighting parameters alpha 1 and alpha 2 2. It is obtained by adding two blocks b1 and b2. Statistics of those weighting parameters (absolute value and sign) are collected for all macroblocks, and the statistical value distribution over multiple macroblocks is analyzed to achieve gradual scene transition detection.

本発明に従った処理方法の実施について図３のブロック図に示されていて、その図３は、例えば、Ｈ．２６４／ＡＶＣビットストリームの場合における上記の概念を示し、前記実施例は、しかしながら、本発明の範囲を限定するものではない。図に示されている復号化装置においては、デマルチプレクサ２１は、トランスポートストリームＴＳを受信し、分波された音声ストリームＡＳ及びビデオストリームＶＳを生成する。ビデオストリームは、復号化されたビデオストリームＤＶＳを通常通りに供給するために、Ｈ．２６４／ＡＶＣ復号化器２２により受信される。前記復号化器２２は、主に、逆量子化回路２２１（Ｑ^−１）と、この場合には逆ＤＣＴ回路である逆トランスフォーム回路２２２（Ｔ−１）と、動き補償回路２２３とを有する。それはまた、実行される重み付けされた予測を特徴付ける受信された符号化パラメータ（例えば、一部の関連符号化パラメータは、“ｌｕｍａ＿ｗｅｉｇｈｔ”、“ｌｕｍａ＿ｏｆｆｓｅｔ”、“ｌｕｍａ＿ｌｏｇ２＿ｗｅｉｇｈｔ＿ｄｅｎｏｍ”等であり、それらは、予測サンプルの重み付け及びオフセットを特徴付ける式において用いられる）を収集するために備えられている、所謂、ＮｅｔｗｏｒｋＡｂｓｔｒａｃｔｉｏｎＬａｙｅｒＵｎｉｔ（ＮＡＬＵ）２２４を有する。前記ユニット２２４の出力信号は、適切な処理のための分析回路２３により受信される予測パラメータ統計値ＷＰＰＳを重み付けされる。その回路２３において実行される処理動作は、次いで、最初に受信されたストリームにおける徐々のシーン変化の位置及び持続時間についての情報を生成し、この情報は、次いで、例えば、通常用いられるＣＰＩ（ＣｈａｒａｃｔｅｒｉｓｔｉｃＰｏｉｎｔＩｎｆｏｒｍａｔｉｏｎ）テーブルの形式で、ファイル２４に記憶される。この出力情報は、ここで、例えば、ビデオの要約、自動チャプタリング等のアプリケーションに対して利用可能である。 The implementation of the processing method according to the invention is illustrated in the block diagram of FIG. The above concept in the case of H.264 / AVC bitstream is shown and the above embodiment, however, does not limit the scope of the invention. In the decoding apparatus shown in the figure, the demultiplexer 21 receives the transport stream TS and generates a demultiplexed audio stream AS and video stream VS. The video stream is H.264 to provide a decoded video stream DVS as usual. Received by the H.264 / AVC decoder 22. The decoder 22 mainly includes an inverse quantization circuit 221 (Q ⁻¹ ), an inverse transform circuit 222 (T−1) that is an inverse DCT circuit in this case, and a motion compensation circuit 223. . It is also the received coding parameters that characterize the weighted prediction to be performed (eg, some relevant coding parameters are “luma_weight”, “luma_offset”, “luma_log2_weight_denom”, etc., which are prediction samples A so-called Network Abstraction Layer Unit (NALU) 224 that is provided for collecting The output signal of the unit 224 is weighted with the prediction parameter statistic WPPS received by the analysis circuit 23 for appropriate processing. The processing operations performed in that circuit 23 then generate information about the location and duration of gradual scene changes in the initially received stream, which information is then used, for example, as commonly used CPI (Characteristic). It is stored in file 24 in the form of a Point Information) table. This output information is now available for applications such as video summarization, automatic chaptering, etc.

ハードウェア又はソフトウェア（本発明の方法は、それ故、処理ユニットにローディングされるときにこの処理ユニットが上記のような方法を実行するようにする指令の集合を有する処理ユニットのためのコンピュータプログラムプロダクトにより実行される）又はそれら両方のアイテムにより機能を実行する複数の方法が存在することを付け加えておく。この点で、上記図は模式的であり、本発明の可能な実施形態の１つのみを示している。それ故、図（この場合、図３）は異なるブロックとして異なる機能を示しているが、このことは、ハードウェア又はソフトウェアの単一のアイテムが複数の機能を実行することを決して排除するものではない。ハードウェア又はソフトウェア又はそれら両方のアイテムのアセンブリがある機能を実行することを排除するものでもない。それらの特記は、図に関連して、上記詳細説明が本発明を限定するものではなく、特許請求の範囲の範囲内に網羅される多くの変形が可能であることを呼び起こすことを意図している。用語“を有する”は、請求項に挙げられている要素又は段階以外の他の要素又は段階の存在を排除するものではない。要素又は段階の単数表現はそのような要素又は段階の複数の存在を排除するものではない。 Hardware or software (the method of the invention is therefore a computer program product for a processing unit having a set of instructions that, when loaded into the processing unit, causes the processing unit to perform the method as described above. Note that there are multiple ways of performing functions with both items). In this respect, the above figures are schematic and show only one possible embodiment of the invention. Therefore, although the figure (in this case, FIG. 3) shows different functions as different blocks, this by no means excludes that a single item of hardware or software performs multiple functions. Absent. Nor does it exclude that an assembly of items of hardware or software or both carry out a function. The remarks are made in connection with the drawings to evoke that the above detailed description does not limit the invention and that many variations are possible which fall within the scope of the claims. Yes. The word “comprising” does not exclude the presence of other elements or steps than those listed in a claim. The singular representation of an element or stage does not exclude the presence of a plurality of such elements or stages.

双方向的予測の従来の実施例を示す図である。It is a figure which shows the conventional Example of bidirectional | two-way prediction. Ｈ．２６４／ＡＶＣ規格の場合のＢフレームについての重み付け予測の基本原理を示す図である。H. It is a figure which shows the basic principle of the weighted prediction about B frame in the case of H.264 / AVC standard. 本発明に従った処理方法の実施についてのブロック図である。FIG. 4 is a block diagram for the implementation of a processing method according to the present invention.

Claims

A method of processing digitally encoded video data that is available in the form of a video stream having consecutive frames divided into macroblocks, the frames comprising at least independently encoded I frames; P frames temporarily placed between I frames and predicted from at least a previous I or P frame, and at least those two temporarily placed and placed between I frames and P frames or between two P frames B and B frames predicted bi-directionally from the frame, and the prediction of the P and B frames is performed by weighted prediction having a unique prediction amount from the past and the future, and the processing method is:
Determining, if any, each successive macroblock of the current frame associated with an encoding parameter characterizing the weighted prediction;
Collecting the parameters for all the consecutive macroblocks of the current frame to provide statistics associated with the parameters;
Analyzing the statistics to determine a change in preference for the prediction direction; and detecting the occurrence of gradual scene changes in a series of frames each time a change in preference is determined;
A processing method characterized by comprising:

2. The processing method according to claim 1, wherein the analyzing step has a preference in the same direction and a similar weighting for a predetermined threshold derived in relation to the total number of macroblocks in the current frame. A processing method, characterized in that it is given to compare the number of blocks.

3. A processing method according to claim 2, wherein information about the position and duration of each scene change is generated and stored in a file.

4. The processing method according to claim 1, wherein the syntax and semantics of the processed video stream are H.264. H.264 / AVC standard syntax and semantics.

An apparatus for processing digitally encoded video data that is available in the form of a video stream having consecutive frames divided into macroblocks, the frames comprising at least independently encoded I frames; and P frames temporarily placed between I frames and predicted from at least a previous I or P frame, and at least those two temporarily placed and placed between I frames and P frames or between two P frames B and B frames predicted bi-directionally from the frame, and the prediction of the P and B frames is performed by weighted prediction with a unique prediction amount from the past and the future, the device:
Determining means for determining, if any, each successive macroblock of the current frame associated with an encoding parameter characterizing the weighted prediction;
Collecting means for collecting the parameters for all the continuous macroblocks of the current frame to provide statistics associated with the parameters;
Analysis means for analyzing the statistics for determining a change in preference for the prediction direction; and a detection means for detecting the occurrence of gradual scene changes in a series of frames each time a change in preference is determined;
A device characterized by comprising:

A computer program for a digital video data decoding device, wherein when loaded into the decoding device, the decoding device executes the steps of the processing method according to any one of claims 1 to 4 A computer program characterized by having a set of instructions to do so.