JP2015515806A

JP2015515806A - Context-based video encoding and decoding

Info

Publication number: JP2015515806A
Application number: JP2015503204A
Authority: JP
Inventors: リー・ニゲル; ピッゾルニ・レナート; デフォレスト・ダリン; ペース・チャールズ・ピー
Original assignee: Euclid Discoveries LLC
Current assignee: Euclid Discoveries LLC
Priority date: 2012-03-26
Filing date: 2013-02-07
Publication date: 2015-05-28
Also published as: WO2013148002A3; TW201342926A; EP2815572A2; CA2868448A1; WO2013148002A2

Abstract

【課題】高次のモデリングを適用することにより、従来のエンコーダと同じ一般処理フロー及び一般処理フレームワークを維持しながら、従来のコーデックのインター予測プロセスの根本的な限界を解消し、向上したインター予測を提供する。【解決手段】映像データの処理方法は、検出アルゴリズムを用いて、フレーム内で、対象の領域における特徴およびオブジェクトのうちの少なくとも一方を検出し、パラメータのセットを用いて、少なくとも一方をモデル化し、少なくとも一方のあらゆるインスタンスを、複数のフレームにわたって相関させ、相関されたインスタンスのトラックを形成し、トラックを、符号化する映像データの特定のブロックに関連付け、トラックの情報を用いて、特定のブロックのモデルベース予測を生成する。モデルベース予測を、処理した映像データとして記憶する。【選択図】図１ＡBy applying higher-order modeling, the fundamental limitations of the inter-prediction process of a conventional codec are eliminated while maintaining the same general processing flow and general processing framework as a conventional encoder, and an improved inter Provide predictions. A method of processing video data detects at least one of a feature and an object in a target region in a frame using a detection algorithm, models at least one using a set of parameters, and Every instance of at least one is correlated across multiple frames, forming a track of correlated instances, associating the track with a particular block of video data to be encoded, and using the track information, Generate model-based predictions. Model-based prediction is stored as processed video data. [Selection] Figure 1A

Description

Related applications

本願は、2012年3月26日付出願の米国仮特許出願第61/615,795号および2012年9月28日付出願の米国仮特許出願第61/707,650号の利益を主張する。本願は、さらに、2012年12月21日付出願の米国特許出願第13/725,940号の利益を主張する。この2012年12月21日付出願の米国特許出願第13/725,940号は、2008年10月7日付出願の米国仮特許出願第61/103,362号の利益を主張する2009年10月6日付出願のPCT/US2009/059653の米国移行出願である、2009年10月6日付出願の米国特許出願第13/121,904号の一部継続出願である。この2009年10月6日付出願の米国特許出願第13/121,904号は、2008年1月4日付出願の米国特許出願第12/522,322号の一部継続出願である。この2008年1月4日付出願の米国特許出願第12/522,322号は、2007年1月23日付出願の米国仮特許出願第60/881,966号の利益を主張し、2006年6月8日付出願の米国仮特許出願第60/811,890号に関連し、さらに、2006年3月31日付出願の米国特許出願第11/396,010号の一部継続出願である。この2006年3月31日付出願の米国特許出願第11/396,010号は、2006年1月20日付出願の米国特許出願第11/336,366号の一部継続出願であり、現在では米国特許第7,457,472号である。その2006年1月20日付出願の米国特許出願第11/336,366号は、2005年11月16日付出願の米国特許出願第11/280,625号の一部継続出願であり、現在では米国特許第7,436,981号である。その2005年11月16日付出願の米国特許出願第11/280,625号は、2005年9月20日付出願の米国特許出願第11/230,686号の一部継続出願であり、現在では米国特許第7,457,435号である。その2005年9月20日付出願の米国特許出願第11/230,686号は、2005年7月28日付出願の米国特許出願第11/191,562号の一部継続出願であり、現在では米国特許第7,426,285号である。その2005年7月28日付出願の米国特許出願第11/191,562号は、現在では米国特許第7,158,680号である。上記の米国特許出願第11/396,010号は、さらに、2005年3月31日付出願の米国仮特許出願第60/667,532号および2005年4月13日付出願の米国仮特許出願第60/670,951号の利益を主張する。本願は、さらに、2012年3月27日付出願の米国仮特許出願第61/616,334号に関連する。 This application claims the benefit of US Provisional Patent Application No. 61 / 615,795, filed March 26, 2012, and US Provisional Patent Application No. 61 / 707,650, filed September 28, 2012. This application further claims the benefit of US patent application Ser. No. 13 / 725,940, filed Dec. 21, 2012. This U.S. Patent Application No. 13 / 725,940, filed December 21, 2012, is a PCT filed on Oct. 6, 2009, claiming the benefit of U.S. Provisional Patent Application No. 61 / 103,362, filed Oct. 7, 2008. This is a continuation-in-part of US Patent Application No. 13 / 121,904, filed October 6, 2009, which is a US transitional application of / US2009 / 059653. This US Patent Application No. 13 / 121,904, filed October 6, 2009, is a continuation-in-part of US Patent Application No. 12 / 522,322, filed January 4, 2008. This US Patent Application No. 12 / 522,322, filed January 4, 2008, claims the benefit of US Provisional Patent Application No. 60 / 881,966, filed January 23, 2007, and is filed on June 8, 2006. It is related to US Provisional Patent Application No. 60 / 811,890, and is a continuation-in-part of US Patent Application No. 11 / 396,010 filed on March 31, 2006. This U.S. Patent Application No. 11 / 396,010, filed March 31, 2006, is a continuation-in-part of U.S. Patent Application No. 11 / 336,366, filed January 20, 2006, and is now U.S. Patent No. 7,457,472. It is. U.S. Patent Application No. 11 / 336,366, filed January 20, 2006, is a continuation-in-part of U.S. Patent Application No. 11 / 280,625, filed Nov. 16, 2005, and is currently U.S. Pat.No. 7,436,981. It is. U.S. Patent Application No. 11 / 280,625, filed November 16, 2005, is a continuation-in-part of U.S. Patent Application No. 11 / 230,686, filed September 20, 2005, and is now U.S. Patent No. 7,457,435. It is. U.S. Patent Application No. 11 / 230,686, filed September 20, 2005, is a continuation-in-part of U.S. Patent Application No. 11 / 191,562, filed July 28, 2005, and is now U.S. Patent No. 7,426,285. It is. US patent application Ser. No. 11 / 191,562, filed July 28, 2005, is now US Pat. No. 7,158,680. The above U.S. Patent Application No. 11 / 396,010 further includes U.S. Provisional Patent Application No. 60 / 667,532 filed on March 31, 2005 and U.S. Provisional Patent Application No. 60 / 670,951 filed on Apr. 13, 2005. Insist on profit. This application is further related to US Provisional Patent Application No. 61 / 616,334, filed March 27, 2012.

上記の特許出願及び特許の全教示内容は、参照をもって本願に取り入れたものとする。 The entire teachings of the above patent applications and patents are incorporated herein by reference.

動画像圧縮（映像圧縮）は、デジタル映像データを、少ないビット数で記憶または伝送可能な形式で表現するプロセスであると言える。映像圧縮アルゴリズムは、映像データの空間的な、時間的なまたは色空間の冗長性や非関連性を利用することにより、圧縮を達成することができる。典型的に、映像圧縮アルゴリズムは、映像データをフレームのグループやペルのグループなどの各部位に分割して、その映像に含まれる冗長な部分を特定し、当該冗長な部分を元々の映像データよりも少ないビット数で表現し得る。このような冗長性を縮小させることにより、より大きな圧縮を達成することができる。映像データを符号化フォーマットに変換する際にはエンコーダを用いる。そして、デコーダを用いることにより、符号化された映像を本来の映像データにほぼ匹敵する形態に変換する。エンコーダ／デコーダを実現するものがコーデック（符号器復号器）と称される。 Video compression (video compression) can be said to be a process of expressing digital video data in a format that can be stored or transmitted with a small number of bits. Video compression algorithms can achieve compression by taking advantage of spatial, temporal, or color space redundancy or irrelevance of video data. Typically, a video compression algorithm divides video data into parts such as a frame group or a pel group, identifies a redundant part included in the video, and determines the redundant part from the original video data. Can be expressed with a small number of bits. By reducing such redundancy, greater compression can be achieved. An encoder is used when converting the video data into the encoding format. Then, by using the decoder, the encoded video is converted into a format almost comparable to the original video data. What implement | achieves an encoder / decoder is called a codec (encoder decoder).

標準的なエンコーダは、映像フレームの符号化にあたって、１つの映像フレームを、互いに重複しない複数の符号化単位すなわちマクロブロック（複数の隣接するペルからなる矩形ブロック）に分割する。典型的に、マクロブロック（ＭＢ）は、フレームの左から右の走査順序や上から下の走査順序で処理される。圧縮は、これらのマクロブロックが先に符号化されたデータを用いて予測・符号化される場合に、行われる。同じフレーム内の空間的に隣接する先に符号化したマクロブロックサンプルを用いてマクロブロックを符号化するプロセスは、イントラ予測と称される。イントラ予測は、データに含まれる空間的な冗長性を利用しようとするものである。先に符号化したフレームからの類似する領域と動き予測モデルとを用いてマクロブロックを符号化するプロセスは、インター予測と称される。インター予測は、データに含まれる時間的な冗長性を利用しようとするものである。 When encoding a video frame, a standard encoder divides one video frame into a plurality of coding units, that is, macro blocks (rectangular blocks including a plurality of adjacent pels) that do not overlap each other. Typically, macroblocks (MB) are processed in a left-to-right scan order or top-to-bottom scan order of the frame. Compression is performed when these macroblocks are predicted / encoded using previously encoded data. The process of encoding a macroblock using spatially adjacent previously encoded macroblock samples within the same frame is referred to as intra prediction. Intra prediction is intended to use spatial redundancy contained in data. The process of encoding a macroblock using similar regions from a previously encoded frame and a motion prediction model is referred to as inter prediction. Inter prediction is intended to use temporal redundancy contained in data.

エンコーダは、符号化するデータと予測（予測結果）との差分を測定することにより、残差を生成し得る。この残差は、予測されたマクロブロックと本来のマクロブロックとの差分となり得る。また、エンコーダは、動きベクトル情報（例えば、符号化中または復号化中のマクロブロックに対する参照フレーム内のマクロブロックの位置を示す動きベクトル情報）を生成し得る。これらの予測、動きベクトル（インター予測用）、残差および関連データを、空間変換、量子化、エントロピー符号化、ループフィルタなどの他のプロセスと組み合せることにより、映像データの効率的な符号を生成することができる。量子化及び変換を受けた残差は、処理されてから前記予測に加算され、復号化したフレームに組み込まれて、フレームストア（フレームを記憶する手段又は記憶部）に記憶される。このような映像符号化技術の詳細は、当業者であればよく知っている。 The encoder can generate a residual by measuring a difference between data to be encoded and prediction (prediction result). This residual can be the difference between the predicted macroblock and the original macroblock. The encoder may also generate motion vector information (eg, motion vector information indicating the position of the macroblock in the reference frame relative to the macroblock being encoded or decoded). By combining these predictions, motion vectors (for inter prediction), residuals and related data with other processes such as spatial transformation, quantization, entropy coding, loop filters, etc., efficient coding of video data is achieved. Can be generated. The quantized and transformed residuals are processed, added to the prediction, incorporated into a decoded frame, and stored in a frame store (means for storing or storing a frame). Details of such video encoding techniques are well known to those skilled in the art.

Ｈ.２６４／ＭＰＥＧ−４ＡＶＣ（Advanced Video Encoding）は、ブロックベースの動き予測・補償を利用して比較的低いビットレートで高品質映像を表現することが可能なコーデック規格である（以降では、「Ｈ．２６４」と称する）。Ｈ．２６４は、ブルーレイディスクだけでなく、インターネット上のビデオストリーミング、テレビ会議、ケーブルテレビおよび直接衛星テレビを含む主要な映像配信チャネルに利用される符号化方式の選択肢の１つである。Ｈ．２６４の符号化基本単位は、１６×１６マクロブロックである。Ｈ．２６４は、広く普及している最新の動画像圧縮規格である。 H.264 / MPEG-4AVC (Advanced Video Encoding) is a codec standard that can express high-quality video at a relatively low bit rate using block-based motion prediction / compensation (hereinafter, “ H.264 "). H. H.264 is one of the encoding scheme choices used for major video distribution channels including video streaming over the Internet, video conferencing, cable television and direct satellite television as well as Blu-ray Discs. H. An H.264 encoding basic unit is a 16 × 16 macroblock. H. H.264 is the latest video compression standard that is widely used.

基礎的なＭＰＥＧ規格は、フレーム内のマクロブロックの符号化方法によって、３種類のフレーム（またはピクチャ）を規定する。そのうちの１つであるＩフレーム（イントラ符号化ピクチャ）は、そのフレームに含まれるデータのみを用いて符号化する。一般的に、映像信号データを受け取ったエンコーダは、まずＩフレームを生成して、映像フレームデータを複数のマクロブロックに分割し、イントラ予測を用いて各マクロブロックを符号化する。このように、Ｉフレームは、イントラ予測マクロブロック（または「イントラマクロブロック」）だけで構成される。Ｉフレームは、符号化済みのフレームからの情報を利用せずに符号化を実行するので、符号化コストが高くなる。Ｐフレーム（予測ピクチャ）は、先に符号化したＩフレーム又はＰフレームからのデータ（参照フレームとも称される）を用いた前方向予測により符号化する。Ｐフレームは、イントラマクロブロックおよび（前方向）予測マクロブロックのいずれも含み得る。Ｂフレーム（双予測ピクチャ）は、前のフレームと後のフレームの両方からのデータを用いた双方向予測により符号化する。Ｂフレームは、イントラマクロブロック、（前方向）予測マクロブロックおよび双予測マクロブロックのいずれも含み得る。 The basic MPEG standard defines three types of frames (or pictures) according to the encoding method of macroblocks in a frame. One of them, an I frame (intra-encoded picture), is encoded using only data included in the frame. In general, an encoder that receives video signal data first generates an I frame, divides the video frame data into a plurality of macroblocks, and encodes each macroblock using intra prediction. Thus, an I frame is composed of only intra prediction macroblocks (or “intra macroblocks”). Since the I frame is encoded without using information from the encoded frame, the encoding cost is increased. The P frame (predicted picture) is encoded by forward prediction using data (also referred to as a reference frame) from the previously encoded I frame or P frame. A P frame may include both intra macroblocks and (forward) predicted macroblocks. The B frame (bi-predictive picture) is encoded by bi-directional prediction using data from both the previous frame and the subsequent frame. A B frame may include any of an intra macroblock, a (forward) prediction macroblock, and a bi-prediction macroblock.

既述したように、従来のインター予測は、ブロックベースの動き予測・補償（ＢＢＭＥＣ）に基づいている。ＢＢＭＥＣプロセスは、ターゲットのマクロブロック（符号化する現在のマクロブロック）と先に符号化した参照フレーム内の同サイズの領域との最良のマッチを探索する。最良のマッチが見つかると、エンコーダは動きベクトルを送信し得る。この動きベクトルは、その最良のマッチのフレーム内位置に対するポインタ、さらに、その最良のマッチと当該最良のマッチに対応する前記ターゲットのマクロブロックとの差分に関する情報を含み得る。映像の「データキューブ」（高さ×幅×フレームのインデックス）にわたってこのような探索を徹底的に実行して、マクロブロック毎に最良のマッチを見つけ出すことも可能ではあるが、一般的に演算負荷があまりにも大きくなってしまう。したがって、ＢＢＭＥＣ探索プロセスは制限されて、時間的には探索する参照フレームに制限され、空間的には探索する隣接領域に制限される。すなわち、「最良の」マッチが必ずしも常に見つかるとは限らず、高速で変化するデータの場合には特にそうである。 As described above, conventional inter prediction is based on block-based motion prediction and compensation (BBMEC). The BBMEC process searches for the best match between the target macroblock (the current macroblock to be encoded) and a region of the same size in the previously encoded reference frame. When the best match is found, the encoder may send a motion vector. The motion vector may include a pointer to the intra-frame position of the best match, as well as information regarding the difference between the best match and the target macroblock corresponding to the best match. While it is possible to exhaustively perform such searches across the “data cube” (height x width x frame index) of the video to find the best match for each macroblock, it is generally computationally intensive Becomes too big. Therefore, the BBMEC search process is limited, limited in time to reference frames to search, and spatially limited to neighboring regions to search. That is, the “best” match is not always found, especially for fast changing data.

参照フレームの特定の集合のことを、Group of Pictures（ピクチャのグループ）（ＧＯＰ）と称する。ＧＯＰは、各参照フレーム内の復号化したペルのみを含み、マクロブロックまたはフレームがどのように符号化されていたか（Ｉフレーム、ＢフレームまたはＰフレーム）についての情報を含まない。ＭＰＥＧ−２などの古い映像圧縮規格では、Ｐフレームの予測に１つの参照フレーム（過去のフレーム）を利用し、Ｂフレームの予測に２つの参照フレーム（１つ前のフレームと１つ後のフレーム）を利用する。対照的に、Ｈ．２６４規格では、Ｐフレームの予測にもＢフレームの予測にも、複数の参照フレームを利用することができる。現在のフレームと時間的に隣接するフレームを参照フレームに用いるのが典型的であるが、時間的に隣接するフレームの
セット以外のフレームを参照フレームとして指定することも可能である。 A specific set of reference frames is referred to as a Group of Pictures (GOP). The GOP includes only the decoded pels in each reference frame and does not include information about how the macroblock or frame was encoded (I frame, B frame or P frame). In old video compression standards such as MPEG-2, one reference frame (past frame) is used for P frame prediction, and two reference frames (one previous frame and one frame after) are used for B frame prediction. ). In contrast, H. In the H.264 standard, a plurality of reference frames can be used for both prediction of P frames and prediction of B frames. Typically, a frame that is temporally adjacent to the current frame is used as a reference frame, but a frame other than a set of temporally adjacent frames can be designated as a reference frame.

従来の圧縮方法では、複数のフレームからの複数のマッチをブレンディング（混合）することにより、現在のフレームの領域を予測し得る。ブレンディングは、複数のマッチの線形混合であったり対数線形混合であったりする。このような双予測方法は、例えば、ある画像から別の画像への移行に経時的なフェードが設けられている際に有効である。フェードプロセスは２つの画像の線形ブレンディングであり、双予測によって効率的にモデル化できる場合がある。従来の標準的なエンコーダの中には、例えばＭＰＥＧ−２内挿モードのように、多数のフレームにわたって線形パラメータの内挿により双予測モデルを合成できるものもある。 Conventional compression methods can predict the region of the current frame by blending multiple matches from multiple frames. Blending can be a linear mixture of multiple matches or a log-linear mixture. Such a bi-prediction method is effective when, for example, a fade with time is provided in transition from one image to another. The fade process is a linear blending of two images and may be efficiently modeled by bi-prediction. Some conventional standard encoders, such as the MPEG-2 interpolation mode, can synthesize bi-predictive models by interpolation of linear parameters over many frames.

Ｈ．２６４規格は、さらに、フレームを、１つ以上の互いに隣接するマクロブロックで構成された領域、具体的には、スライスと称さる空間的に互いに独立した領域に分割することにより、さらなる符号化の自由度を提供する。同じフレーム内の各スライスは、他のスライスとは独立して符号化される（つまり、互いに独立して復号化される）。そして、既述した３種類のフレームと同様に、Ｉスライス、ＰスライスおよびＢスライスが定義される。したがって、１つのフレームは、複数の種類のスライスによって構成され得る。さらに、エンコーダ側では、一般的に、処理したスライスの順番を自由に決めることができる。これにより、デコーダは、デコーダ側に到達するスライスを任意の順番で処理することができる。 H. The H.264 standard further divides a frame into regions composed of one or more adjacent macroblocks, specifically, spatially independent regions called slices, thereby further coding. Provides freedom. Each slice in the same frame is encoded independently of the other slices (ie, decoded independently of each other). Then, like the three types of frames described above, I slices, P slices, and B slices are defined. Therefore, one frame can be constituted by a plurality of types of slices. Further, on the encoder side, in general, the order of processed slices can be freely determined. Thus, the decoder can process the slices that reach the decoder side in an arbitrary order.

Ｈ．２６４規格により、コーデックは、ＭＰＥＧ−２やＭＰＥＧ−４ＡＳＰ（Advanced Simple Profile）などの古い規格に比べて、小さいファイルサイズで優れた品質の映像を提供することができる。しかし、Ｈ．２６４規格を組み込んだ「従来の」圧縮コーデックで、限られた帯域幅のネットワークで動作し且つ限られたメモリを有するデバイス（例えば、スマートフォンや他のモバイルデバイスなど）での映像の品質向上や解像度向上に対する需要に応える際には、一般的に悪戦苦闘を強いられてきた。そのようなデバイスで満足のいく再生を実現するには、映像の品質や解像度を妥協せざるを得ないことが多い。さらに、映像の解像度が向上しているため、ファイルサイズが増加し、当該映像を前記デバイスで記憶したり前記デバイス外部に記憶したりする際に課題となる。 H. According to the H.264 standard, the codec can provide superior quality video with a small file size compared to older standards such as MPEG-2 and MPEG-4 ASP (Advanced Simple Profile). However, H. The “conventional” compression codec that incorporates the H.264 standard, which works on limited bandwidth networks and has limited memory (for example, smartphones and other mobile devices) and improves video quality and resolution When responding to demand for improvement, it has generally been struggling. In order to achieve satisfactory playback on such a device, the quality and resolution of the video often have to be compromised. Furthermore, since the resolution of the video is improved, the file size increases, which becomes a problem when the video is stored in the device or stored outside the device.

本発明は、従来のコーデックのインター予測プロセスの根本的な限界を認識し、高次のモデリング（モデル化）を適用することにより、従来のエンコーダと同じ一般処理フロー及び一般処理フレームワークを維持しながら、上記のような限界を解消し、向上したインター予測を提供する。 The present invention recognizes the fundamental limitations of the inter-prediction process of conventional codecs and maintains the same general processing flow and general processing framework as conventional encoders by applying higher-order modeling (modeling). However, the above limitations are eliminated and improved inter prediction is provided.

本発明にかかる高次のモデリングにより、より多くの予測探索空間（映像データキューブ）をナビゲートして（対象にして）、従来のブロックベースの動き予測・補償を用いた場合よりも優れた予測を効率良く生成することができる。まず、コンピュータビジョン（コンピュータ視覚）ベースの特徴・オブジェクト検出アルゴリズムが、映像データキューブの中から対象の領域を特定する。その検出アルゴリズムは、ノンパラメトリックな特徴検出アルゴリズムの種類から選択され得る。次に、検出された特徴やオブジェクトが、パラメータのコンパクトな（少規模の）セットを用いてモデル化され、特徴／オブジェクトの類似するインスタンスが複数のフレームにわたって相関される（対応付けられる）。本発明では、さらに、相関された特徴／オブジェクトからトラックを形成し、当該トラックを、符号化される映像データの特定のブロックに関連付けて、この追跡情報を用いてデータの当該ブロックについてモデルベース予測を生成する。 Higher-order modeling according to the present invention navigates (targets) more predictive search spaces (video data cubes) and provides better prediction than when using conventional block-based motion prediction / compensation Can be generated efficiently. First, a computer vision (computer vision) -based feature / object detection algorithm identifies a target area from a video data cube. The detection algorithm can be selected from a type of non-parametric feature detection algorithm. The detected features and objects are then modeled using a compact (small) set of parameters, and similar instances of the features / objects are correlated (associated) across multiple frames. The present invention further forms a track from the correlated features / objects, associates the track with a particular block of video data to be encoded, and uses this tracking information to model-based prediction for that block of data. Is generated.

各実施形態において、符号化される、前記データの特定のブロックは、マクロブロックであり得る。形成された前記トラックは、特徴を、対応するマクロブロックに関連付け得る。 In each embodiment, the particular block of data that is encoded may be a macroblock. The formed tracks may associate features with corresponding macroblocks.

特徴／オブジェクトを追跡する構成は、従来の符号化／復号化プロセスにさらなるコンテキストをもたらす。さらに、パラメータのコンパクトなセットで特徴／オブジェクトをモデル化するので、参照フレームのペル全体を記憶する高コストな構成と異なり、特徴／オブジェクトに関する情報をメモリに効率的に記憶することができる。これにより、特徴／オブジェクトモデルでは、許容できないほどの演算量やメモリ量を必要とせずに、より多くの映像データキューブを探索することができる。このようにして得られるモデルベース予測は、より多くの予測探索空間から導き出されたものなので、従来のインター予測よりも優れている。 The feature / object tracking configuration provides additional context to the conventional encoding / decoding process. Furthermore, since features / objects are modeled with a compact set of parameters, information about features / objects can be efficiently stored in memory, unlike expensive configurations that store entire pels of reference frames. Thereby, in the feature / object model, more video data cubes can be searched without requiring an unacceptable calculation amount or memory amount. The model-based prediction obtained in this way is derived from a larger number of prediction search spaces and is therefore superior to conventional inter prediction.

一部の実施形態において、前記パラメータのコンパクトなセットは、前記特徴／オブジェクトに関する情報を含むものであり、かつ、メモリに記憶され得る。特徴に対して、対応する前記パラメータは、特徴記述子ベクトルおよび当該特徴の位置を含み得る。その対応するパラメータは、その特徴の検出時に生成され得る。 In some embodiments, the compact set of parameters includes information about the feature / object and can be stored in memory. For a feature, the corresponding parameter may include a feature descriptor vector and the location of the feature. The corresponding parameter can be generated upon detection of the feature.

特徴／オブジェクトのインスタンスを複数のフレームにわたって相関させた後、これらの相関されたインスタンスを、（特徴／オブジェクトのトラックを形成する代わりに）集合体行列に集めてもよい。この場合、本発明では、そのような集合体行列を形成し、重要なベクトルの部分空間を用いて当該行列を要約して、この部分ベクトル空間を前記相関された特徴／オブジェクトのパラメトリックモデルとして使用する。これにより、それら特定の特徴／オブジェクトがデータに現れた際に、極めて効率的な符号化を実現することができる。 After correlating feature / object instances over multiple frames, these correlated instances may be collected in an aggregate matrix (instead of forming a feature / object track). In this case, the present invention forms such an assembly matrix, summarizes the matrix using a subspace of important vectors, and uses this subvector space as a parametric model of the correlated features / objects To do. This makes it possible to realize extremely efficient encoding when those specific features / objects appear in the data.

前述した本発明の原理を具現化した、映像データを処理するコンピュータベースの方法、映像データを処理するコーデック、ならびに映像データを処理するその他のコンピュータシステム及び装置が提供され得る。 Computer based methods for processing video data, codecs for processing video data, and other computer systems and devices for processing video data that embody the principles of the invention described above may be provided.

前述の内容は、添付の図面に示す本発明の例示的な実施形態についての以下の詳細な説明から明らかになる。異なる図をとおして、同一の符号は同一の構成又は構成要素を指している。図面は必ずしも縮尺どおりではなく、むしろ、本発明の実施形態を示すことに重点を置いている。 The foregoing will become apparent from the following detailed description of exemplary embodiments of the invention as illustrated in the accompanying drawings. Throughout the different figures, the same reference numerals refer to the same elements or components. The drawings are not necessarily to scale, but rather focus on showing embodiments of the invention.

本発明の一実施形態にかかる特徴モデリングを示すブロック図である。It is a block diagram which shows the characteristic modeling concerning one Embodiment of this invention. 本発明の一実施形態にかかる特徴トラッキング（特徴の追跡）を示すブロック図である。It is a block diagram which shows the feature tracking (tracking of a feature) concerning one Embodiment of this invention. 本発明の一実施形態に従って、特徴を近傍のマクロブロックに関連付ける過程、および当該特徴のトラックを用いて前記マクロブロックの良好な予測を生成する過程を示すブロック図である。FIG. 5 is a block diagram illustrating a process of associating a feature with a neighboring macroblock and generating a good prediction of the macroblock using a track of the feature according to an embodiment of the present invention. 本発明の一実施形態に従って、効率的な符号化を実現するための複数の忠実度によるデータのモデル化を示すブロック図である。FIG. 3 is a block diagram illustrating data modeling with multiple fidelity to achieve efficient encoding, in accordance with one embodiment of the present invention. 本発明の一実施形態に従った、特徴モデルの相関及び集約によるオブジェクト特定の様子を示すブロック図である。It is a block diagram which shows the mode of the object specification by the correlation and aggregation of a feature model according to one Embodiment of this invention. 本発明の一実施形態に従った、近傍の特徴の集約および近傍のマクロブロックの集約によるオブジェクト特定の様子を示すブロック図である。It is a block diagram which shows the mode of the object specification by aggregation of the feature of a neighborhood, and aggregation of the macroblock of a neighborhood according to one Embodiment of this invention. 本発明の一実施形態にかかる、変換ベースのコーデックの構成の一例を示す概略図である。It is the schematic which shows an example of a structure of the conversion base codec concerning one Embodiment of this invention. 本発明の一実施形態にかかる、イントラ予測マクロブロック用のデコーダの一例を示すブロック図である。It is a block diagram which shows an example of the decoder for intra prediction macroblocks concerning one Embodiment of this invention. 本発明の一実施形態にかかる、インター予測マクロブロック用のデコーダの一例を示すブロック図である。It is a block diagram which shows an example of the decoder for the inter prediction macroblock concerning one Embodiment of this invention. 本発明の一実施形態にかかる、特徴ベース予測を用いる変換ベースのコーデックの構成の一例を示す概略図である。FIG. 3 is a schematic diagram illustrating an example of a configuration of a transform-based codec using feature-based prediction according to an embodiment of the present invention. 本発明の一実施形態にかかる、特徴ベース予測フレームワーク内のコーデックの一例を示すブロック図である。FIG. 3 is a block diagram illustrating an example of a codec in a feature-based prediction framework according to an embodiment of the present invention. 本発明の一実施形態にかかる、特徴のインスタンス（特徴インスタンス）の状態抽出プロセスを示すブロック図である。FIG. 6 is a block diagram illustrating a feature instance (feature instance) state extraction process according to one embodiment of the invention. 本発明の一実施形態にかかる、パラメトリックなモデリング（パラメトリックモデル化）を用いるコーデックの構成要素の一例を示すブロック図である。It is a block diagram which shows an example of the component of the codec which uses parametric modeling (parametric modeling) concerning one Embodiment of this invention. 本発明の一実施形態にかかる、パラメトリックモデルベースの適応型エンコーダの構成要素の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of components of a parametric model based adaptive encoder according to an embodiment of the present invention. 本発明の一実施形態にかかる、特徴モデルのパラメータの内挿による、特徴の動き補償予測の様子を示すブロック図である。It is a block diagram which shows the mode of the motion compensation prediction of the characteristic by the interpolation of the parameter of the characteristic model concerning one Embodiment of this invention. 本発明の一実施形態にかかる、キャッシュアーキテクチャの一例の概要を示すブロック図である。1 is a block diagram illustrating an overview of an example cache architecture, according to an embodiment of the present invention. FIG. 本発明の一実施形態かかる、局所的な（短期的な）キャッシュのデータの利用に伴う処理を示すブロック図である。FIG. 6 is a block diagram illustrating processing associated with the use of local (short-term) cache data according to an embodiment of the present invention. 本発明の一実施形態かかる、長期的なキャッシュのデータの利用に伴う処理を示すブロック図である。It is a block diagram which shows the process accompanying utilization of the data of long-term cache concerning one Embodiment of this invention. 実施形態を実現するためのコンピュータネットワーク環境を示す概略図である。It is the schematic which shows the computer network environment for implement | achieving embodiment. 図８Ａのネットワークのコンピュータノードを示すブロック図である。It is a block diagram which shows the computer node of the network of FIG. 8A. 具体例における特徴ベース圧縮ツールのスクリーンショットである。4 is a screen shot of a feature-based compression tool in a specific example. 本発明の一実施形態かかる、顔特徴及び顔以外の特徴に数字が付されたスクリーンショットである。It is the screenshot which attached | subjected the number to the facial feature and features other than a face concerning one Embodiment of this invention. 本発明の一実施形態かかる、図８Ｄの顏トラッカー（顔追跡手段）により特定された顔を示すスクリーンショットである。It is a screenshot which shows the face specified by the eyelid tracker (face tracking means) of Drawing 8D concerning one embodiment of the present invention.

全ての特許公報、全ての特許公開公報およびこれらの公報に引用されている全ての文献の全教示内容は、参照をもって本明細書に取り入れたものとする。以下では、本発明の例示的な実施形態について説明する。 The entire teachings of all patent publications, all patent publications, and all references cited in these publications are incorporated herein by reference. In the following, exemplary embodiments of the invention will be described.

本発明は、標準的な各種符号化方法や各種符号化単位（コーディングユニット）に適用可能である。以下では、特記しない限り、「従来」や「標準的」といった用語（「圧縮」、「コーデック」、「符号」、「エンコーダ」といった用語と共に使用し得る）はＨ．２６４のことを指し、さらに、「マクロブロック」とは、一般性を失うことなくＨ．２６４の符号化基本単位のことを指すものとする。 The present invention is applicable to various standard encoding methods and various encoding units (coding units). In the following, unless otherwise specified, terms such as “conventional” or “standard” (which may be used with terms such as “compression”, “codec”, “code”, “encoder”) are H.264. H.264, and “macroblock” refers to H.264 without loss of generality. It shall refer to the H.264 encoding basic unit.

＜特徴ベースのモデル化＞ <Feature-based modeling>

＜特徴の定義＞
本発明の構成要素には、記憶時または伝送時にデジタル映像データを最適に表現することができる映像圧縮プロセス及び映像解凍プロセスが含まれ得る。当該プロセスは、映像データの空間的な、時間的なまたはスペクトル的な冗長性や非関連性を有効活用する少なくとも１つの映像圧縮／符号化アルゴリズムを備え得るか又はそのようなアルゴリズムとインターフェースし得る。また、そのような有効活用は、特徴ベースのモデル／パラメータの使用及び保持によって行われ得る。以降では、「特徴」および「オブジェクト」という用語を置き換え可能に使用する。オブジェクトとは、一般性を失うことなく「大規模な特徴」と定義することができる。データのモデル化には、特徴およびオブジェクトのどちらも利用することができる。 <Definition of features>
Components of the present invention may include a video compression process and a video decompression process that can optimally represent digital video data during storage or transmission. The process may comprise or interface with at least one video compression / coding algorithm that takes advantage of the spatial, temporal or spectral redundancy or disassociation of the video data. . Such effective utilization can also be done by the use and retention of feature-based models / parameters. Hereinafter, the terms “feature” and “object” are used interchangeably. An object can be defined as a “large-scale feature” without loss of generality. Both features and objects can be used to model the data.

特徴とは、互いに近接するペルのグループであって、データ複雑性（データ複雑さ）を示すグループのことを言う。データ複雑性は、後述するように様々な基準（criteria）で検出可能である。圧縮の観点からみると、データ複雑性の特徴とは、究極的に言えば「符号化コストが高いこと」である。符号化コストが高いとは、従来の映像圧縮法によるペルの符号が、「効率的な符号化」と考えられる閾値を超えることを指している。所与の領域に対し、従来のエンコーダが過度の帯域量（bandwidth）を割り当てる場合（従来のインターフレーム探索では、従来の参照フレーム内に当該所与の領域に対する良好なマッチを見つけ出せない場合）には、その領域は「特徴に富んで」おり、特徴モデルベースの圧縮法により、その領域の圧縮を大幅に向上できる可能性が高いことを示唆している。 A feature is a group of pels that are close to each other and that indicates data complexity. Data complexity can be detected by various criteria as described below. From the viewpoint of compression, the characteristic of data complexity is, ultimately, “high coding cost”. The high coding cost means that the Pell code by the conventional video compression method exceeds a threshold that is considered to be “efficient coding”. When a conventional encoder allocates excessive bandwidth for a given region (when a traditional interframe search cannot find a good match for the given region in a conventional reference frame) Suggests that the region is “rich in features” and that the feature model-based compression method is likely to greatly improve the compression of the region.

＜特徴の検出＞
図１Ａには、少なくとも１つの映像フレーム２０−１，２０−２，…，２０−ｎで検出された、特徴のインスタンス（特徴インスタンス）１０−１，１０−２，…，１０−ｎが示されている。典型的に、このような特徴は、ペルから導き出される構造的情報に基づく複数の条件に基づいて、さらに、従来の圧縮法ではその特徴領域（特徴の領域）の符号化に過度の帯域量を利用しなければならないことを示す複雑性基準に基づいて検出され得る。さらに、特徴の各インスタンスは、図１Ａに示すように、フレーム２０−１，２０−２，…，２０−ｎ内で空間的な広がり又は境界を有する「領域」３０−１，３０−２，…，３０−ｎとして空間的に特定され得る。特徴のこのような領域（特徴領域）３０−１，３０−２，…，３０−ｎは、例えば、ペルデータで構成される単純な直方形領域として抽出され得る。本発明の一実施形態において、前記特徴領域のサイズは、Ｈ．２６４のマクロブロックと同じ１６×１６のサイズである。 <Feature detection>
FIG. 1A shows feature instances (feature instances) 10-1, 10-2,..., 10-n detected in at least one video frame 20-1, 20-2,. Has been. Typically, such features are based on multiple conditions based on structural information derived from pels, and moreover, conventional compression methods use excessive bandwidth to encode the feature region (feature region). It can be detected based on a complexity criterion that indicates that it must be utilized. Furthermore, each instance of the feature is represented by an “region” 30-1, 30-2, 30-2, 30-2,..., 20-n having a spatial extent or boundary within the frame 20-1, 20-2,. ..., 30-n can be spatially specified. Such regions (feature regions) 30-1, 30-2,..., 30-n of features can be extracted as simple rectangular regions composed of pel data, for example. In one embodiment of the present invention, the size of the feature region is H.264. It is the same 16 × 16 size as the H.264 macroblock.

過去の文献には、ペル自体の構造に基づいて特徴を検出するアルゴリズムとして、ペルデータの各種変換に対してロバスト（頑健）であるノンパラメトリックな特徴検出アルゴリズムの種類を含む、数多くのアルゴリズムが提案されている。例えば、スケール不変特徴量変換（ＳＩＦＴ）［Lowe, David, 2004, "Distinctive image features from scale-invariant keypoints," Int. J. of Computer Vision, 60(2):91-110］は、画像にガウス関数の差分を畳み込むことで斑点状の特徴を検出する。高速化ロバスト特徴（ＳＵＲＦ）アルゴリズム［Bay, Herbert et al., 2008, "SURF: Speeded up robust features," Computer Vision and Image Understanding, 110(3):346-359］も、ヘシアン演算子の行列式を用いることで斑点状の特徴を検出する。本発明の一実施形態では、ＳＵＲＦアルゴリズムを用いて特徴を検出する。 In the past literature, many algorithms have been proposed for detecting features based on the structure of the pel itself, including non-parametric feature detection algorithms that are robust against various transformations of pel data. ing. For example, scale invariant feature transformation (SIFT) [Lowe, David, 2004, "Distinctive image features from scale-invariant keypoints," Int. J. of Computer Vision, 60 (2): 91-110] Spotted features are detected by convolving function differences. The accelerated robust feature (SURF) algorithm [Bay, Herbert et al., 2008, "SURF: Speeded up robust features," Computer Vision and Image Understanding, 110 (3): 346-359] is also the determinant of the Hessian operator. Detect spotted features by using. In one embodiment of the invention, features are detected using the SURF algorithm.

別の実施形態では、2009年10月6日付出願の米国特許出願第13/121,904号に全容が記載されているように、従来のエンコーダでの符号化複雑性（帯域量）に基づいて、特徴が検出され得る。なお、この米国特許出願の全教示内容は、参照をもって本明細書に取り入れたものとする。一例として、符号化複雑性は、特徴が現れる領域を従来の圧縮法（例えば、Ｈ．２６４など）で符号化するのに必要な帯域量（ビット数）を分析することによって判断され得る。すなわち、検出アルゴリズムが異なればその動作も異なるが、いずれにしても実施形態では、どの検出アルゴリズムであっても、映像データ全体にわたる映像フレームシーケンス全体に対して適用される。本発明を限定しない一例として、Ｈ．２６４エンコーダによる第１の符号化パスが行われて「帯域量マップ」が生成される。この帯域量マップにより、Ｈ．２６４による符号化コストが、各フレームのどの箇所で最も高くなるのかが定義されるか、あるいは、その帯域量マップがそれを判断する。 In another embodiment, based on the coding complexity (bandwidth) of a conventional encoder, as described in full in US patent application Ser. No. 13 / 121,904, filed Oct. 6, 2009, features Can be detected. The entire teaching content of this US patent application is incorporated herein by reference. As an example, encoding complexity may be determined by analyzing the amount of bandwidth (number of bits) required to encode a region in which features appear with a conventional compression method (eg, H.264). In other words, the operation is different if the detection algorithm is different, but in any embodiment, any detection algorithm is applied to the entire video frame sequence over the entire video data. As an example that does not limit the invention, A first encoding pass by the H.264 encoder is performed to generate a “bandwidth map”. With this bandwidth map, It is defined in which part of each frame the encoding cost according to H.264 is the highest, or its bandwidth map determines it.

典型的に、Ｈ．２６４などの従来のエンコーダは、映像フレームを、互いに重なり合わないように並んだ複数の一様なタイル（例えば、１６×１６マクロブロック、当該マクロブロックのサブタイルなど）に分割する。一実施形態において、各タイルは、Ｈ．２６４でそのタイルを符号化するのに必要な相対的帯域量に基づいて、特徴候補として分析され得る。一例として、Ｈ．２６４でタイルを符号化するのに必要な帯域量が、一定の閾値と比較され得る。そして、帯域量がその閾値を超える場合には、タイルが「特徴」と判断され得る。この閾値は、所定の数値であってもよい。その場合、この所定の数値は、特徴の検出時に簡単にアクセスできるようにデータベースに記憶され得る。前記閾値は、過去に符号化した特徴に割り当てられた帯域量の平均値として設定される数値であってもよい。同様に、前記閾値は、過去に符号化した特徴に割り当てられた帯域量の中央値として設定される数値であってもよい。あるいは、フレーム全体（または映像全体）にわたってタイルの帯域量の蓄積分布関数を算出し、全タイルの帯域量の上位パーセンタイル内に入る帯域量を有する全てのタイルを「特徴」と判断するようにしてもよい。 Typically, H.M. A conventional encoder such as H.264 divides a video frame into a plurality of uniform tiles (for example, 16 × 16 macroblocks, subtiles of the macroblocks, etc.) arranged so as not to overlap each other. In one embodiment, each tile is H.264. Based on the relative amount of bandwidth required to encode the tile at H.264, it can be analyzed as a feature candidate. As an example, H.C. The amount of bandwidth required to encode a tile at H.264 can be compared to a certain threshold. If the amount of bandwidth exceeds the threshold, the tile can be determined as a “feature”. This threshold value may be a predetermined numerical value. In that case, this predetermined numerical value can be stored in a database for easy access when detecting features. The threshold value may be a numerical value set as an average value of bandwidth amounts allocated to features encoded in the past. Similarly, the threshold value may be a numerical value set as a median value of the bandwidth amount allocated to the feature encoded in the past. Alternatively, the accumulated distribution function of the bandwidth of tiles over the entire frame (or the entire video) is calculated, and all tiles having bandwidth amounts that fall within the upper percentile of the bandwidth amount of all tiles are determined as “features”. Also good.

別の実施形態では、映像フレームが、互いに重なり合うタイルに分割され得る。この重なり合いのサンプリングは、１つのタイルの中心に当該タイルと重なり合う４つのタイルの角の交差点が位置するようにオフセットされ得る。このように過剰な分割により、最初のサンプリング位置で特徴を検出できる可能性が高まる。その他にも、より複雑な分割方法として、トポロジー的な分割方法が挙げられる。 In another embodiment, the video frame may be divided into tiles that overlap one another. This overlap sampling can be offset so that the intersection of the corners of the four tiles that overlap the tile is located at the center of the tile. Such excessive division increases the possibility that the feature can be detected at the first sampling position. In addition, as a more complicated division method, a topological division method can be cited.

特徴として検出された小規模の空間的領域を分析し、所与の整合性基準（coherency criteria（一貫性を満たす基準））に基づき当該小規模の空間的領域同士を組み合わせて大規模の空間的領域にできるか否かを判断するようにしてもよい。前記空間的領域のサイズは、ペルの小規模のグループから、実際のオブジェクトまたは実際のオブジェクトの一部に相当し得る大規模な部分まで多種多様であり得る。ただし、検出される特徴は、オブジェクトやサブオブジェクトなどの互いに区別可能な単一のエンティティーと必ずしも対応関係にある必要はない。１つの特徴に、２つ以上のオブジェクトのそれぞれのエレメント（構成要素）が含まれることもあれば、オブジェクトのエレメントが全く含まれないこともある。本発明にかかる特徴の重要な側面は、特徴モデルベースの圧縮法により、従来の圧縮法に比べて、特徴を構成するペルの集合を効率的に圧縮できるという点である。 Analyzing small spatial regions detected as features and combining large small spatial regions based on given consistency criteria (coherency criteria) You may make it judge whether it can be made into an area | region. The size of the spatial region may vary from a small group of pels to a large part that may correspond to an actual object or part of an actual object. However, the detected feature does not necessarily have to correspond to a single entity that can be distinguished from each other, such as an object or a sub-object. One characteristic may include each element (component) of two or more objects, or may not include any element of the object. An important aspect of the feature according to the present invention is that the feature model-based compression method can efficiently compress a set of pels constituting the feature as compared with the conventional compression method.

小規模の領域同士を組み合わせて大規模の領域にする際の整合性基準には、動きの類似性、動き補償後の外観の類似性、および符号化複雑性の類似性が含まれ得る。整合性を有する動きは、高次の動きモデルにより見つけ出され得る。一実施形態では、小規模の各領域の並進動きがアフィン運動モデルに組み込まれ得る。このモデルにより、それら小規模の各領域の動きモデルを近似することができる。小規模の領域のセットについて、それらの動きを常に集約モデルに組み込むことができる場合、これは、当該小規模の領域間が依存しており整合性があることを示唆している。そのような整合性は、集約特徴モデルによって有効活用することができる。 The consistency criteria when combining small areas into a large area may include motion similarity, appearance similarity after motion compensation, and encoding complexity similarity. Consistent motion can be found by higher order motion models. In one embodiment, the translational motion of each small area can be incorporated into the affine motion model. With this model, it is possible to approximate the motion model of each small area. For a small set of regions, if those movements can always be incorporated into the aggregate model, this suggests that the small regions are dependent and consistent. Such consistency can be effectively utilized by the aggregate feature model.

＜特徴モデルの形成＞
重要なのは、複数の映像フレームで特徴を検出した後、同じ特徴の複数のインスタンスを相関させることである。このプロセスは「特徴相関」と称されるプロセスであり、後述するように、（特定の特徴の経時的位置を定める）特徴トラッキングの基礎となる。ただし、この特徴相関プロセスを効果的に行うには、まず、類似する特徴インスタンスを類似しない特徴インスタンスから区別するために用いられる「特徴モデル」を定義する必要がある。 <Formation of feature model>
It is important to correlate multiple instances of the same feature after detecting the feature in multiple video frames. This process is referred to as “feature correlation” and is the basis for feature tracking (determining the location of a particular feature over time), as described below. However, to effectively perform this feature correlation process, it is first necessary to define a “feature model” that is used to distinguish similar feature instances from dissimilar feature instances.

一実施形態では、特徴のペル（特徴ペル）自体を用いて特徴をモデル化し得る。特徴のペルの領域は二次元であり、ベクトル化可能である。異なる特徴のペルのベクトル間の平均二乗誤差（ＭＳＥ）の最小化又は当該異なる特徴のペルのベクトル間の内積の最大化により、類似する特徴を特定することができる。この構成の問題点として、特徴ペルのベクトルが並進、回転、拡大／縮小などの特徴の小規模の変化、さらには、特徴の照度の変化に敏感な点が挙げられる。映像をとおして特徴はこのような変化を頻繁に起こすので、特徴ペルのベクトルを用いて特徴をモデル化して相関させる場合、そのような変化を考慮する必要がある。本発明の一実施形態では、従来のコーデック（例えば、Ｈ．２６４など）に見受けられる、特徴の並進動きを考慮するための標準的な動き予測・補償アルゴリズムを適用するという極めて単純な方法により、特徴の上述したような変化を考慮する。他の実施形態では、より複雑な方法を用いて、フレーム間の特徴の回転、拡大／縮小および照度変化を考慮し得る。 In one embodiment, features may be modeled using the feature pel itself. The feature pel region is two-dimensional and can be vectorized. Similar features can be identified by minimizing the mean square error (MSE) between pel vectors of different features or by maximizing the dot product between the pel vectors of different features. Problems of this configuration include that the feature pel vector is sensitive to small-scale changes in features such as translation, rotation, enlargement / reduction, and further to changes in illuminance of features. Since features frequently undergo such changes throughout the video, such changes need to be considered when features are modeled and correlated using a vector of feature pels. In one embodiment of the present invention, a very simple method of applying a standard motion prediction and compensation algorithm to account for the translational motion of features found in conventional codecs (eg, H.264, etc.) Consider the above-described changes in features. In other embodiments, more complex methods may be used to account for feature rotation, enlargement / reduction, and illumination changes between frames.

代替の実施形態において、特徴モデルは、特徴の小規模の回転、並進、拡大／縮小、および場合によっては照度変化に対して「不変な」、特徴のコンパクトな表現（所与の種類の変換の適用時に変化しない表現）である（ここで、「コンパクト」とは、本来の特徴ペルのベクトルの次元よりも低次元であることを意味する）。すなわち、フレーム間で特徴が小規模の変化を起こしても、この場合の特徴モデルは比較的一定のままである。このようなコンパクトな特徴モデルは、しばしば「記述子（descriptor）」と称される。一例として、本発明の一実施形態では、ＳＵＲＦの特徴記述子の長さが、Ｈａａｒウェーブレット変換応答の和に基づいて６４とされる（これに対し、特徴ペルのベクトルの長さは２５６である）。別の実施形態では、特徴ペルのカラーマップから、５個のビンのカラーヒストグラムが構築され、この５つのコンポーネントからなるヒストグラムが、特徴記述子として機能する。さらなる別の実施形態では、二次元ＤＣＴにより、特徴領域が変換される。そして、係数行列の上三角部分および下三角部分にわたって、二次元ＤＣＴ係数が合計される。この合計が、エッジ特徴空間を構成し、前記特徴記述子として機能し得る。 In an alternative embodiment, the feature model is a compact representation of a feature (for a given type of transformation) that is “invariant” to small-scale rotation, translation, scaling, and possibly illumination changes of the feature. (Here, “compact” means that the dimension is lower than the dimension of the original feature pel vector). That is, even if the feature undergoes a small change between frames, the feature model in this case remains relatively constant. Such compact feature models are often referred to as “descriptors”. As an example, in one embodiment of the present invention, the length of the SURF feature descriptor is 64 based on the sum of the Haar wavelet transform response (as opposed to the length of the feature pel vector is 256). ). In another embodiment, a five bin color histogram is constructed from the feature pel color map, and the five component histogram serves as a feature descriptor. In yet another embodiment, the feature region is transformed by two-dimensional DCT. Then, the two-dimensional DCT coefficients are summed over the upper triangular portion and the lower triangular portion of the coefficient matrix. This sum constitutes an edge feature space and can serve as the feature descriptor.

特徴記述子を用いて特徴をモデル化した場合、（特徴のペル間のベクトルの代わりに、）特徴記述子間のＭＳＥの最小化又は当該特徴記述子間の内積の最大化により、類似する特徴が特定され得る。 When features are modeled using feature descriptors, similar features can be obtained by minimizing the MSE between feature descriptors or by maximizing the inner product between feature descriptors (instead of a vector between pels of features) Can be identified.

＜特徴相関（特徴関連付け）＞
特徴を検出・モデル化した後の次の過程は、類似する特徴を、複数のフレームにわたって相関させる（対応付ける）ことである。それぞれのフレーム内に現れる各特徴インスタンスは、当該特徴の外観のサンプルである。複数の特徴インスタンスは、複数のフレームにわたって相関されることで、同じ特徴に「属する」と見なされる。同じ特徴に属するように相関された複数の特徴インスタンスは、特徴トラックを形成するように集約してもよいし、あるいは、集合体行列４０（図１Ａ）に集めるようにしてもよい。 <Feature correlation (feature association)>
The next step after detecting and modeling features is to correlate (match) similar features across multiple frames. Each feature instance that appears in each frame is a sample of the appearance of that feature. Multiple feature instances are considered to “belong” to the same feature by being correlated across multiple frames. A plurality of feature instances correlated to belong to the same feature may be aggregated to form a feature track, or may be collected in an aggregate matrix 40 (FIG. 1A).

「特徴トラック」は、映像フレームに対する特徴の位置（ｘ，ｙ）として定義される。一実施形態では、特徴の新たに検出されたインスタンスを、追跡した特徴と関連付ける（映像の最初のフレームの場合には、検出した特徴又は過去に検出された特徴と関連付ける）。これを基礎として、現在のフレームにおける特徴インスタンスが、これまでに構築された特徴トラックのうちのどのトラックの延長上に属するのかを決定する。現在のフレームにおける特徴インスタンスを、これまでに構築した特徴トラック（映像の最初のフレームの場合には、検出した特徴又は過去に検出された特徴と関連付けることで、特徴の追跡が行われる。 A “feature track” is defined as the feature location (x, y) relative to the video frame. In one embodiment, the newly detected instance of the feature is associated with the tracked feature (in the case of the first frame of the video, associated with the detected feature or a previously detected feature). Based on this, it is determined which of the feature tracks that have been constructed so far the feature instance in the current frame belongs to. Feature tracking is performed by associating feature instances in the current frame with previously constructed feature tracks (in the case of the first frame of a video, with detected features or previously detected features.

図１Ｂに、特徴追跡手段（特徴トラッカー）７０を用いて特徴６０−１，６０−２，…，６０−ｎを追跡する様子を示す。特徴検出手段８０（例えば、ＳＩＦＴ、ＳＵＲＦなど）を用いて、現在のフレームにおける特徴を特定する。現在のフレーム９０において検出された特徴インスタンスが、検出された（又は追跡された）特徴５０と照らし合わされる。一実施形態では、前述した相関過程よりも前に、HarrisとStephensのコーナー検出アルゴリズム［Harris, Chris and Mike Stephens, 1988, "A combined corner and edge detector," in Proc. of the 4th Alvey Vision Conference, pp. 147-151］に見受けられるように、ガウシアンフィルタの微分で特徴の自己相関行列の画像勾配を算出することで、当該特徴の自己相関行列に基づく特徴強度を表す自己相関分析（ＡＣＡ）量を用いることにより、現在のフレームにおける特徴検出候補のセットのなかで順位を決めるようにしてもよい。大きいＡＣＡ量を有する特徴インスタンスが、トラック延長の候補として優先される。一実施形態では、ＡＣＡ順位リストのなかで低い順位にある特徴インスタンスが、そのリストのなかで高い順番にある特徴インスタンスの所与の距離（例えば、１ペルなど）内に位置する場合には、特徴候補のセットから取り除かれる。 FIG. 1B shows how features 60-1, 60-2,..., 60-n are tracked using feature tracking means (feature tracker) 70. FIG. The feature detection unit 80 (for example, SIFT, SURF, etc.) is used to identify the feature in the current frame. The feature instance detected in the current frame 90 is compared with the detected (or tracked) feature 50. In one embodiment, prior to the correlation process described above, Harris and Stephens' corner detection algorithm [Harris, Chris and Mike Stephens, 1988, "A combined corner and edge detector," in Proc. Of the 4th Alvey Vision Conference, pp. 147-151], the autocorrelation analysis (ACA) amount representing the feature strength based on the autocorrelation matrix of the feature by calculating the image gradient of the feature autocorrelation matrix by the differentiation of the Gaussian filter May be used to determine the rank in the set of feature detection candidates in the current frame. Feature instances with large ACA amounts are preferred as track extension candidates. In one embodiment, if a feature instance that is ranked low in the ACA ranking list is located within a given distance (eg, 1 pel) of feature instances that are ranked high in the list, Removed from feature set.

種々の実施形態では、特徴記述子（例えば、ＳＵＲＦ記述子など）または特徴ペルのベクトルが、特徴モデルとして機能し得る。一実施形態では、追跡したトラック（図１Ｂの領域６０−１，６０−２，…，６０−ｎ）が、１つずつ、現在のフレーム９０で新たに検出された特徴の中から、トラック延長（追跡の続き）について調べられる。一実施形態では、各特徴トラックの一番最近の特徴インスタンスが、現在のフレームにおけるトラック延長の探索の焦点（すなわち、「ターゲットの特徴」）とされる。現在のフレームにおいて、そのターゲットの特徴の位置の所与の距離（例えば、１６ペルなど）内にある全ての特徴検出候補が調べられ、そのターゲットの特徴に対するＭＳＥが最小となる候補が特徴トラックの延長に選択される。別の実施形態では、ターゲットの特徴に対するＭＳＥが所与の閾値を超える特徴候補については、トラック延長の資格がないとして除外する。 In various embodiments, a feature descriptor (eg, SURF descriptor, etc.) or a feature pel vector may serve as a feature model. In one embodiment, tracked tracks (regions 60-1, 60-2,..., 60-n in FIG. 1B) are tracked one by one from the newly detected features in the current frame 90. (Follow-up of the tracking) is examined. In one embodiment, the most recent feature instance of each feature track is the focus of the search for track extensions in the current frame (ie, “target feature”). In the current frame, all feature detection candidates within a given distance (eg, 16 pels) of the target feature location are examined, and the candidate with the smallest MSE for that target feature is the feature track. Selected for extension. In another embodiment, feature candidates whose MSE for a target feature exceeds a given threshold are excluded as not eligible for track extension.

さらなる実施形態では、現在のフレームにおいて、所与の特徴トラックの延長となる資格を有する特徴検出候補がない場合、その現在のフレームにおいて、Ｈ．２６４内の動き補償予測（ＭＣＰ）または汎用的な動き予測・補償（ＭＥＣ）を用いて、マッチング領域を見つけ出すための限定的な探索を実行する。ＭＣＰおよびＭＥＣは、いずれも勾配降下探索を実行して、過去のフレームにおけるターゲットの特徴に対するＭＳＥが最小となる（ＭＳＥ閾値を満足する）、現在のフレーム内のマッチング領域を探索する。現在のフレームにおいて、前記ターゲットの特徴に対するマッチが前記特徴検出候補からも前記ＭＣＰ／ＭＥＣ探索プロセスからも見つけられなかった場合には、その対応する特徴トラックを「無効」または「終了」と判断する。 In a further embodiment, if there are no feature detection candidates eligible for extension of a given feature track in the current frame, H. A limited search to find a matching region is performed using motion compensated prediction (MCP) in H.264 or general motion prediction and compensation (MEC). Both the MCP and the MEC perform a gradient descent search to search for matching regions in the current frame where the MSE for the target feature in the past frame is minimized (satisfying the MSE threshold). If no match for the target feature is found in the current frame from either the feature detection candidate or the MCP / MEC search process, the corresponding feature track is determined to be “invalid” or “finished”. .

さらなる実施形態では、２つ以上の特徴トラックについて、現在のフレームにおけるそれぞれの特徴インスタンスが、所与の閾値（例えば、７０％の重複）を超えて合致している場合には、それらの特徴トラックのうちの１つ以外を、今後の検討対象から全て削除又は除外する。この削除プロセスにより、最も長い履歴を有し、かつ、全ての特徴インスタンスを総計した合計ＡＣＡ量が最も大きい特徴トラックを維持することができる。 In a further embodiment, for two or more feature tracks, if each feature instance in the current frame matches more than a given threshold (eg, 70% overlap), those feature tracks Remove or exclude all but one of these from future consideration. With this deletion process, it is possible to maintain the feature track having the longest history and the largest total ACA amount that is the sum of all feature instances.

本発明の一実施形態では、上記の過程の組合せとして、ＳＵＲＦ特徴検出と、ＡＣＡベースの特徴候補の順位決めと、ＭＣＰ／ＭＥＣ探索法で補助しながら行う特徴候補のＭＳＥの最小化による特徴相関とを適用する。以下では、このような組合せを特徴点分析（ＦＰＡ）トラッカー（追跡手段）と称する。 In one embodiment of the present invention, as a combination of the above processes, SURF feature detection, ranking of ACA-based feature candidates, and feature correlation by minimizing MSE of feature candidates performed with assistance of MCP / MEC search method And apply. Hereinafter, such a combination is referred to as a feature point analysis (FPA) tracker (tracking means).

本発明の別の実施形態では、映像フレーム内のマクロブロックを特徴とみなし、Ｈ．２６４のＭＣＰエンジンによって特徴／マクロブロックを登録し、Ｈ．２６４のインターフレーム予測量（変換差分絶対値和（ＳＡＴＤ）など）を用いて特徴／マクロブロックを相関させる。以降では、このような組合せを、マクロブロックキャッシュ（ＭＢＣ）トラッカー（追跡手段）と称する。このＭＢＣトラッカーは、特定のパラメータが異なる点（例えば、探索境界が無効にされているので、より広範囲のマッチ探索を実行できる点）、さらに、マッチングプロセスの特定の構成が異なる点で、標準的なインターフレーム予測と区別される。第３の実施形態では、ＳＵＲＦ検出結果が近傍のマクロブロックと関連付けられて、Ｈ．２６４のＭＣＰエンジン及びインターフレーム予測エンジンを用いて、当該マクロブロックを相関及び追跡する。以降では、このような組合せを、ＳＵＲＦトラッカー（追跡手段）と称する。 In another embodiment of the present invention, macroblocks within a video frame are considered as features and H.264 MCP engine registers features / macroblocks, The feature / macroblock is correlated using an H.264 interframe prediction amount (such as the sum of absolute differences of conversion differences (SATD)). Hereinafter, such a combination is referred to as a macroblock cache (MBC) tracker (tracking means). This MBC tracker is standard in that it has different specific parameters (eg, it can perform a broader match search because the search boundary is disabled), and the specific configuration of the matching process is different. It is distinguished from simple interframe prediction. In the third embodiment, the SURF detection result is associated with a neighboring macroblock, and The macroblock is correlated and tracked using H.264 MCP engine and inter-frame prediction engine. Hereinafter, such a combination is referred to as a SURF tracker (tracking means).

代替の一実施形態では、複数の特徴インスタンスを集合体行列に集めて、さらなるモデル化を行う。図１Ａに示すような領域３０−１，３０−２，…，３０−ｎの形態の特徴インスタンスが、同じ特徴を表すものとして相関及び特定される。次に、これらの領域からのペルデータがベクトル化されて集合体行列４０に配置され得る。この集合体行列４０全体が前記特徴を表す。十分な数のサンプルを集合体に集めることにより、当該サンプルを用いて、その特徴をサンプリングしたフレームだけでなく、その特徴をサンプリングしていないフレームにおいても、当該特徴の外観をモデル化することが可能になる。この「特徴外観モデル」の次元数は特徴の次元数と同じであり、前述の特徴記述子モデルと異なる。 In an alternative embodiment, multiple feature instances are collected in an aggregate matrix for further modeling. Feature instances in the form of regions 30-1, 30-2,..., 30-n as shown in FIG. 1A are correlated and identified as representing the same feature. The pel data from these areas can then be vectorized and placed in the aggregate matrix 40. The entire aggregate matrix 40 represents the feature. By collecting a sufficient number of samples into a collection, it is possible to model the appearance of the feature not only in the frame in which the feature is sampled but also in the frame in which the feature is not sampled. It becomes possible. The number of dimensions of the “feature appearance model” is the same as the number of dimensions of the feature, and is different from the above-described feature descriptor model.

領域の前記集合体を、当該集合体内の単一のキー領域を中心として、空間的に正規化（ばらつきの原因を取り除くことによる所与の基準への適合化）するようにしてもよい。一実施形態では、前記集合体の幾何重心に最も近い領域が、前記キー領域として選択される。別の実施形態では、前記集合体内に早いうちから存在する特徴（前記集合体内に存在する期間の長い特徴）が、前記キー領域として選択される。米国特許第7,508,990号、米国特許第7,457,472号、米国特許第7,457,435号、米国特許第7,426,285号、米国特許第7,158,680号、米国特許第7,424,157号、米国特許第7,436,981号、米国特許出願第12/522,322号および米国特許出願第12/121,904号に記載されているように、このような正規化を実行するのに必要な変形成分（deformation）は変形集合体として集められ、正規化後の画像は変更入り外観集合体として集められる。なお、これらの特許及び特許出願の全教示内容は、参照をもって本願に取り入れたものとする。 The collection of areas may be spatially normalized (adapted to a given criterion by removing the cause of variation) around a single key area within the collection. In one embodiment, the area closest to the geometric centroid of the aggregate is selected as the key area. In another embodiment, a feature that exists early in the collection (a feature that has a long duration in the collection) is selected as the key area. US Patent No. 7,508,990, US Patent No. 7,457,472, US Patent No. 7,457,435, US Patent No. 7,426,285, US Patent No. 7,158,680, US Patent No. 7,424,157, US Patent No. 7,436,981, US Patent Application No. And US patent application Ser. No. 12 / 121,904, the deformations necessary to perform such normalization are collected as a deformation aggregate and the normalized image is modified. Collected as an appearance aggregate. The entire teachings of these patents and patent applications are incorporated herein by reference.

この実施形態では、前記外観集合体が処理されることによって外観モデルがもたらされ、前記変形集合体が処理されることによって変形モデルがもたらされる。これら外観モデルと変形モデルとの組合せが、この特徴の特徴モデルとなる。この特徴モデルを用いることにより、特徴を、パラメータのコンパクトなセットで表すことが可能になる。一実施形態では、前記集合体行列を特異値分解（ＳＶＤ）して、これにランク低減法を適用し、特異ベクトルの部分集合および対応する特異値のみが維持されるようにすることにより、モデルが形成される。さらなる実施形態において、前記ランク低減法の条件は、ランク低減法の適用によって再構成された前記集合体行列が、当該集合体行列の２ノルムに基づく誤差閾値範囲内で再構成前の完全な集合体行列を近似できるのに十分な数の主要特異ベクトル（および対応する特異値）を維持することとされる。代替の一実施形態では、前記集合体をパターン辞書とみなし、この辞書を繰り替えし探索することによって再構成精度を最大化する直交マッチング追跡（ＯＭＰ）法［Pati, Y.C. et al., 1993, "Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition," in Proc. of the 27th Asilomar Conference, pp. 40-44］により、モデルが形成される。この場合も、ＯＭＰ法の適用後の再構成結果が前記集合体行列の２ノルムに基づく誤差閾値を満足するのに十分な数の集合体ベクトル（および対応するＯＭＰ重み）が維持され得る。後述するように、このようにして形成された特徴の外観モデルおよび変形モデルを、特徴ベースの圧縮に利用してもよい。 In this embodiment, the appearance aggregate is processed to provide an appearance model, and the deformation aggregate is processed to provide a deformation model. A combination of the appearance model and the deformation model is a feature model of this feature. By using this feature model, features can be represented by a compact set of parameters. In one embodiment, the ensemble matrix is singular value decomposed (SVD) and a rank reduction method is applied to it to ensure that only a subset of singular vectors and the corresponding singular values are maintained. Is formed. In a further embodiment, the condition of the rank reduction method is that the aggregate matrix reconstructed by applying the rank reduction method is a complete set before reconstruction within an error threshold range based on a 2-norm of the aggregate matrix. It is assumed that a sufficient number of major singular vectors (and corresponding singular values) are maintained to approximate the body matrix. In an alternative embodiment, an orthogonal matching tracking (OMP) method [Pati, YC et al., 1993, “ Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, "in Proc. Of the 27th Asilomar Conference, pp. 40-44]. Also in this case, a sufficient number of aggregate vectors (and corresponding OMP weights) can be maintained so that the reconstruction result after application of the OMP method satisfies an error threshold based on the 2-norm of the aggregate matrix. As will be described later, the feature appearance model and the deformation model thus formed may be used for feature-based compression.

特徴の前記集合体は、当該集合体のメンバー（member）を互いに比較することで改良することができる。一実施形態では、サンプリングされた各領域（各サンプリング領域）（前記集合体の各ベクトル）を他のサンプリング領域と徹底的に比較することにより、前記集合体が改良される。この比較では、２つのタイルを登録する。第１の登録では、第１の領域が第２の領域に対して比較される。第２の登録では、前記第２の領域が前記第１の領域に対して比較される。このような登録は、各画像における前記第１および第２の領域の位置で、各画像ごとに実行される。このようにして得られる登録オフセットは、対応する位置的なオフセットと共に保持される。これらを相関関係と称する。この相関関係を分析することにより、複数の登録結果からみて、サンプリング領域の位置を変更したほうが望ましいか否かを判断する。ソースフレーム（source frame）での変更後の位置が、他のフレームでの１つ以上の領域に対し、より誤差の少ないマッチをもたらす場合には、それら領域の位置を前記変更後の位置に調節する。このように前記ソースフレームにおいて領域を変更する際の前記変更後の位置の選定は、当該ソースフレームにおける領域の時間的な延長に相当する、他のフレームにおける領域の位置を線形内挿することで実行される。 The collection of features can be improved by comparing the members of the collection with each other. In one embodiment, the collection is improved by thoroughly comparing each sampled area (each sampling area) (each vector of the collection) with other sampling areas. In this comparison, two tiles are registered. In the first registration, the first area is compared against the second area. In a second registration, the second area is compared against the first area. Such registration is performed for each image at the positions of the first and second regions in each image. The registration offset obtained in this way is kept together with the corresponding positional offset. These are called correlations. By analyzing this correlation, it is determined whether it is desirable to change the position of the sampling region in view of a plurality of registration results. If the changed position in the source frame results in a less error match for one or more areas in other frames, adjust the position of those areas to the changed position. To do. Thus, the selection of the position after the change when changing the area in the source frame is performed by linearly interpolating the position of the area in another frame corresponding to the time extension of the area in the source frame. Executed.

＜特徴ベース圧縮＞
特徴モデリング（あるいは、データモデリング全般）を用いることにより、従来のコーデックよりも圧縮を向上させることができる。標準的なインターフレーム予測では、ブロックベースの動き予測・補償を用いて、復号化した参照フレームの限られた探索空間から、各符号化単位（マクロブロック）の予測を見つけ出す。徹底的な探索を実行し、過去の全ての参照フレームで良好な予測を行おうとすると、演算負荷があまりにも大きくなってしまう。対照的に、映像を通して特徴を検出及び追跡することにより、演算負荷が過度になることなく、より多くの予測探索空間をナビゲートすることができるので、優れた予測を生成することが可能になる。特徴自体が一種のモデルであることから、以降では、「特徴ベース」および「モデルベース」という用語を置き換え可能に使用する。 <Feature-based compression>
By using feature modeling (or general data modeling), compression can be improved over conventional codecs. In standard inter-frame prediction, block-based motion prediction / compensation is used to find a prediction of each coding unit (macroblock) from a limited search space of a decoded reference frame. If a thorough search is performed and a good prediction is made with all past reference frames, the computation load becomes too large. In contrast, detecting and tracking features through video can navigate more predictive search spaces without overloading the computation load, thus making it possible to generate better predictions. . Since the feature itself is a kind of model, hereinafter the terms “feature base” and “model base” are used interchangeably.

本発明の一実施形態では、特徴トラックを用いて、特徴をマクロブロックと関連付ける。図１Ｃに、この一般的な過程を示す。所与の特徴トラックは、複数のフレームにわたって特徴の位置を示す。そして、その特徴には、フレームにわたって動きがある。現在のフレームからみて一番最近の２つのフレームにおけるその特徴の位置を用いることにより、当該現在のフレームにおけるその特徴の位置を推測することができる。そして、その特徴の推測位置には、対応する最も近傍のマクロブロックが存在する。そのようなマクロブロックは、前記特徴の推測位置と最も大きく重複するマクロブロックとして定義される。このため、このマクロブロック（符号化されている最中のターゲットマクロブロック）は、特定の特徴トラックに対して関連付けられたことになる。この特定の特徴トラックの現在のフレームにおける推測位置は、前記マクロブロックの近傍である（図１Ｃの過程１００）。 In one embodiment of the invention, feature tracks are used to associate features with macroblocks. FIG. 1C illustrates this general process. A given feature track indicates the location of the feature over multiple frames. And the feature has movement over a frame. By using the position of the feature in the two most recent frames as viewed from the current frame, the position of the feature in the current frame can be inferred. A corresponding nearest macroblock exists at the estimated position of the feature. Such a macroblock is defined as a macroblock that most overlaps with the estimated position of the feature. Therefore, this macroblock (the target macroblock being encoded) is associated with a specific feature track. The estimated position of this particular feature track in the current frame is in the vicinity of the macroblock (step 100 of FIG. 1C).

次の過程は、現在のフレームにおける、ターゲットのマクロブロック（ターゲットマクロブロック）と特徴の推測位置とのオフセットを算出することである（過程１１０）。このオフセット、さらには、前記関連付けられた特徴トラックにおける過去の特徴インスタンスを用いることにより、前記ターゲットのマクロブロックに対する予測を生成することができる。そのような過去のインスタンスは、その特徴が現れた最近の参照フレームが格納される局所的な（近くにある）キャッシュ１２０に含まれるか、あるいは、その特徴が現れた「古い」参照フレーム１５０が格納される長期的な（distant（遠くにある））キャッシュに含まれる。参照フレーム内において、当該参照フレームにおける過去の特徴インスタンスとのオフセットが、現在のフレームにおけるターゲットのマクロブロックと特徴の推測位置とのオフセットと同じである領域を見つけ出すことにより（過程１３０，１６０）、前記ターゲットのマクロブロックに対する予測を生成することができる。 The next step is to calculate an offset between the target macroblock (target macroblock) and the estimated position of the feature in the current frame (step 110). By using this offset, as well as past feature instances in the associated feature track, predictions for the target macroblock can be generated. Such past instances are included in the local (nearby) cache 120 where the recent reference frame in which the feature appeared is stored, or the “old” reference frame 150 in which the feature appeared Included in stored long-term (distant) cache. By finding an area in the reference frame where the offset from the past feature instance in the reference frame is the same as the offset between the target macroblock and the estimated position of the feature in the current frame (steps 130 and 160), A prediction for the target macroblock can be generated.

＜モデルベースの一次的予測及び副次的予測の生成＞
本発明の一実施形態において、特徴ベースの予測は、次のようにして実行される：（１）フレームごとに特徴を検出する；（２）検出された特徴をモデル化する；（３）相異なるフレームにおける特徴を相関させて、特徴トラックを生成する；（４）特徴トラックを用いて、符号化されている最中の「現在の」フレームにおける特徴の位置を予測する；（５）現在のフレームにおけるその特徴の予測位置の近傍に存在するマクロブロックを関連付ける；および（６）前記（５）におけるマクロブロックに対して、関連付けられた特徴（相関する特徴）の特徴トラックに沿った過去の位置に基づいて予測を生成する。 <Generation of model-based primary and secondary predictions>
In one embodiment of the invention, feature-based prediction is performed as follows: (1) detecting features for each frame; (2) modeling detected features; (3) phase Correlate features in different frames to generate feature tracks; (4) use feature tracks to predict the location of features in the current “current” frame being encoded; (5) current Associating a macroblock that exists near the predicted position of that feature in the frame; and (6) a past location along the feature track of the associated feature (correlated feature) to the macroblock in (5) above Generate a prediction based on

一実施形態では、特徴を、既述のＳＵＲＦアルゴリズムを用いて検出し、既述のＦＰＡアルゴリズムを用いて相関させて追跡する。特徴の検出、相関及び追跡後には、上記のように各特徴トラックを最も近傍のマクロブロックと関連付け得る。一実施形態において、１つのマクロブロックに複数の特徴を関連付けることが可能な場合には、そのマクロブロックと最も大きく重複する特徴を、そのマクロブロックと関連付ける特徴として選択する。 In one embodiment, features are detected using the previously described SURF algorithm and correlated and tracked using the previously described FPA algorithm. After feature detection, correlation and tracking, each feature track can be associated with the nearest macroblock as described above. In one embodiment, if multiple features can be associated with one macroblock, the feature that most closely overlaps the macroblock is selected as the feature associated with the macroblock.

ターゲットのマクロブロック（符号化されている最中の現在のマクロブロック）、これに関連付けられた特徴、およびその特徴の特徴トラックが与えられることで、当該ターゲットのマクロブロックに対する一次的予測（またはキー予測）を生成することができる。キー予測のデータ（ペル）は、その特徴が現れる（最新のフレームからみて）一番最近のフレームから取得する。以降では、この一番最近のフレームを、キーフレームと称する。キー予測は、動きモデルおよびペルのサンプリングスキームを選択したうえで生成される。本発明の一実施形態において、前記動きモデルは、特徴がキーフレームと現在のフレームとの間で静止していると仮定する「零次」か、あるいは、特徴の動きが２番目に一番最近の参照フレームとキーフレームと現在のフレームとの間で線形であると仮定する「一次」とされ得る。いずれの場合も、特徴の動きを、当該特徴と関連付けられた、現在のフレームにおけるマクロブロックに（時間的に逆方向に）適用することにより、キーフレームにおけるそのマクロブロックに対する予測が得られる。本発明の一実施形態において、前記ペルのサンプリングスキームは、動きベクトルを整数に四捨五入して（整数に丸めて）キー予測のペルをキーフレームから直接取り出す「直接」か、あるいは、Ｈ．２６４などの従来の圧縮法の内挿スキームを用いて動き補償されたキー予測を導き出す「間接」とされ得る。つまり、本発明では、前記動きモデル（零次または一次）に応じて、さらに、前記サンプリングスキーム（直接または間接）に応じて、４種類の相異なるキー予測を得ることができる。 Given a target macroblock (the current macroblock being encoded), its associated features, and a feature track for that feature, a primary prediction (or key) for that target macroblock is given. Prediction) can be generated. The key prediction data (pel) is acquired from the most recent frame in which the feature appears (as viewed from the latest frame). Hereinafter, this latest frame is referred to as a key frame. Key predictions are generated after selecting a motion model and a pel sampling scheme. In one embodiment of the invention, the motion model is “zero order” which assumes that the feature is stationary between the key frame and the current frame, or the feature motion is the second most recent. May be assumed to be “primary”, which is assumed to be linear between the reference frame, the key frame, and the current frame. In either case, applying the feature motion to the macroblock in the current frame associated with the feature (in the reverse direction in time) provides a prediction for that macroblock in the key frame. In one embodiment of the present invention, the pel sampling scheme may be “direct” by rounding the motion vector to an integer (rounded to an integer) to directly extract the key prediction pel from the keyframe, or H.264. It may be “indirect” to derive motion compensated key predictions using a conventional compression method interpolation scheme such as H.264. That is, in the present invention, four different key predictions can be obtained according to the motion model (zero order or first order) and further according to the sampling scheme (direct or indirect).

キー予測は、サブタイル化プロセスを用いて局所的な変形成分をモデル化することにより、改良することができる。サブタイル化プロセスでは、マクロブロックの相異なる局所部位について、それぞれの動きベクトルを算出する。一実施形態において、前記サブタイル化プロセスは、１６×１６のマクロブロックを８×８の４つの１／４部位（quadrant）に分割し、それぞれに対する予測を別個に算出することで実行され得る。別の実施形態では、前記サブタイル化プロセスが、Ｙ／Ｕ／Ｖ色空間ドメインにおいて、Ｙ色チャネル、Ｕ色チャネルおよびＶ色チャネルの予測を別個に算出することで実行され得る。 Key prediction can be improved by modeling local deformation components using a subtiling process. In the subtiling process, motion vectors are calculated for different local parts of the macroblock. In one embodiment, the subtiling process may be performed by dividing a 16 × 16 macroblock into four quadrants of 8 × 8 and calculating predictions for each separately. In another embodiment, the subtiling process may be performed by separately calculating predictions for the Y, U, and V color channels in the Y / U / V color space domain.

ターゲットのマクロブロックに対する一次的予測／キー予測に加えて、そのターゲットのマクロブロックに関連付けられた特徴の、当該キーフレームよりも過去の参照フレームにおける位置に基づいて、副次的予測を生成してもよい。一実施形態では、現在のフレームにおける、ターゲットのマクロブロックから当該ターゲットのマクロブロックに関連付けられた特徴の位置（推測位置）までのオフセットを、過去の参照フレームにおける当該特徴の位置に基づいて副次的予測を見つけ出すための動きベクトルとして使用し得る。このようにして、特徴が関連付けられた所与のターゲットのマクロブロックに対する副次的予測を、（その特徴が現れたフレームごとに１つずつ）複数生成することができる。一実施形態では、探索対象とする過去の参照フレームの数を制限する（例えば、２５個とする）ことにより、副次的予測の数を制限するようにしてもよい。 In addition to the primary prediction / key prediction for the target macroblock, generate a secondary prediction based on the position of the feature associated with the target macroblock in a reference frame prior to the keyframe. Also good. In one embodiment, an offset from a target macroblock in a current frame to a feature location (estimated location) associated with the target macroblock is subtracted based on the feature location in a past reference frame. It can be used as a motion vector to find a target prediction. In this way, multiple secondary predictions (one for each frame in which the feature appears) for a given target macroblock with which the feature is associated can be generated. In one embodiment, the number of secondary predictions may be limited by limiting the number of past reference frames to be searched (for example, 25).

＜複合予測＞
ターゲットのマクロブロックに対する一次的予測（キー予測）および副次的予測の生成後に、これらの予測に基づいて、そのターゲットのマクロブロックの全体的な再構成を算出することができる。一実施形態において、前記再構成は、従来のコーデックにならって、キー予測のみに基づいた再構成とされる。以降では、このような再構成を、キー単独（ＫＯ）再構成と称する。 <Compound prediction>
After generating the primary prediction (key prediction) and the secondary prediction for the target macroblock, an overall reconstruction of the target macroblock can be calculated based on these predictions. In one embodiment, the reconfiguration is based on key prediction alone, following conventional codecs. Hereinafter, such a reconstruction is referred to as a key-only (KO) reconstruction.

別の実施形態において、前記再構成は、前記キー予測と前記副次的予測のうちの１つを重み付けしたものとを合計した複合予測に基づいた再構成とされる。以降では、このようなアルゴリズムを、ＰＣＡ−Ｌｉｔｅ（ＰＣＡ−Ｌ）と称する。ＰＣＡ−Ｌｉｔｅは、以下の手順を含む：
１．ターゲットのマクロブロックの（一次元）ベクトル（ターゲットベクトルｔと称する）およびキー予測の（一次元）ベクトル（キーベクトルｋと称する）を生成する；
２．ターゲットベクトルからキーベクトルを減算することにより、残差ベクトルｒを算出する；
３．副次的予測の集合をベクトル化してベクトルｓ_ｉを形成する（一般性を失うことなく、これらの副次的ベクトルは、単位ノルムを有するものと仮定する）。次に、全ての副次的ベクトルからキーベクトルを減算することにより、キー減算集合ｓ_ｉ−ｋを生成する。これは、副次的ベクトルからキーベクトルの射影を減算するようなものである；
４．それぞれの副次的ベクトルについて、重み付け係数ｃ＝ｒ^Ｔ（ｓ_ｉ−ｋ）を算出する；および
５．それぞれの副次的ベクトルについて、複合予測ｔ^＾＝ｋ＋ｃ×（ｓ_ｉ−ｋ）を算出する。 In another embodiment, the reconstruction is a reconstruction based on a composite prediction that sums the key prediction and a weighted one of the secondary predictions. Hereinafter, such an algorithm is referred to as PCA-Lite (PCA-L). PCA-Lite includes the following procedures:
1. Generating a (one-dimensional) vector of target macroblocks (referred to as target vector t) and a (one-dimensional) vector of key predictions (referred to as key vector k);
2. Calculating a residual vector r by subtracting the key vector from the target vector;
3. Vectorize the set of secondary predictions to form vector s _i (assuming that these secondary vectors have unit norms without loss of generality). Next, a key subtraction set s _i -k is generated by subtracting the key vector from all the secondary vectors. This is like subtracting the projection of the key vector from the secondary vector;
4). 4. For each subvector, calculate the weighting factor c = r ^T (s _i −k); For each subvector, a composite prediction t ^{^} = k + c * (s _i -k) is calculated.

概すれば、ＰＣＡ−Ｌｉｔｅアルゴリズムの上記手順は、周知の直交マッチング追跡アルゴリズム［Pati, 1993］の手順に似ているが、上記の複合予測は、一次的予測および副次的予測からの冗長な寄与を含まないように意図されている。別の実施形態では、前記ＰＣＡ−Ｌｉｔｅアルゴリズムにおいて、上述した手順３〜５のキーベクトルをキーベクトルと副次的ベクトルとの平均に置き換える。以降では、このような変更入りアルゴリズムを、ＰＣＡ−Ｌｉｔｅ−Ｍｅａｎと称する。 In general, the above procedure for the PCA-Lite algorithm is similar to that of the well-known orthogonal matching tracking algorithm [Pati, 1993], but the above composite prediction is a redundant one from the primary and secondary predictions. It is intended not to include contributions. In another embodiment, in the PCA-Lite algorithm, the key vector in steps 3 to 5 described above is replaced with the average of the key vector and the secondary vector. Hereinafter, such a modified algorithm is referred to as PCA-Lite-Mean.

上記のＰＣＡ−Ｌｉｔｅアルゴリズムは、一部の標準的なコーデックで見受けられる双予測アルゴリズム（冒頭の「背景技術」の欄に記載）と異なるタイプの複合予測を提供することができる。標準的な双予測アルゴリズムは、各予測に用いる参照フレームと現在のフレームとの時間的距離に基づいて、複数の予測をブレンディング（混合）する。対照的に、ＰＣＡ−Ｌｉｔｅは、各予測の「内容」に基づいて複数の予測を混合し、複合予測を生成する。 The PCA-Lite algorithm described above can provide a different type of combined prediction than the bi-prediction algorithm found in some standard codecs (described in the “Background” section at the beginning). A standard bi-prediction algorithm blends multiple predictions based on the temporal distance between the reference frame used for each prediction and the current frame. In contrast, PCA-Lite mixes multiple predictions based on the “content” of each prediction to generate a composite prediction.

なお、上記の複合予測は、特徴ベースのモデリングでなくても可能である。すなわち、どのような予測の集合を用いても、所与のターゲットのマクロブロックについての複合予測を生成することは可能である。しかし、特徴ベースのモデリングでは、所与のターゲットのマクロブロックについての予測の集合が、自然と互いに関連性を有するものになる。そして、複合予測とすることにより、それらの複数の予測からの情報を効率良く組み合わせることができる。 Note that the composite prediction described above is possible without using feature-based modeling. That is, any prediction set can be used to generate a composite prediction for a given target macroblock. However, in feature-based modeling, the set of predictions for a given target macroblock is naturally related to each other. And by setting it as composite prediction, the information from those some prediction can be combined efficiently.

＜複数の忠実度でのデータのモデリング＞
本発明では、モデルベース圧縮のために、複数の忠実度でデータをモデル化することが可能である。図２Ａに、この一実施形態を示す。図２Ａには、モデル化の４つの階層が描かれている。以下の表は、これら４つの階層をまとめたものである。以下では、これら４つの階層について詳細に説明する。 <Modeling data with multiple fidelity>
In the present invention, data can be modeled with multiple fidelity for model-based compression. FIG. 2A illustrates this embodiment. In FIG. 2A, four layers of modeling are depicted. The following table summarizes these four hierarchies. Hereinafter, these four hierarchies will be described in detail.

図２Ａの一番下の階層は、「マクロブロック」（ＭＢ）階層と称され、フレームを、互いに重複しないマクロブロック（１６×１６サイズのタイル）または有限のサブタイルのセットに分割する従来の圧縮法に相当するものである。従来の圧縮法（例えば、Ｈ．２６４など）は、基本的にモデル化を行わず、ブロックベースの動き予測・補償（ＢＢＭＥＣ）を用いて、復号化した参照フレーム内の限られた探索空間から各タイルの予測２１２を見つけ出す。デコーダで、予測２１２をマクロブロック（またはサブタイル）の残差符号と組み合わせることにより、本来のデータの再構成を合成する（過程２１０）。 The bottom hierarchy in FIG. 2A, referred to as the “macroblock” (MB) hierarchy, is a conventional compression that divides a frame into non-overlapping macroblocks (16 × 16 size tiles) or a finite set of subtiles. It is equivalent to the law. Conventional compression methods (e.g., H.264) basically do not model and use block-based motion prediction / compensation (BBMEC) from a limited search space in a decoded reference frame. Find the prediction 212 for each tile. At the decoder, the prediction 212 is combined with the macroblock (or subtile) residual code to synthesize the reconstruction of the original data (step 210).

図２Ａの２番目の階層２０２は、「マクロブロックを特徴とする」（ＭＢＦ）階層と称され、既述のＭＢＣトラッカー（図２Ａの２１６）に基づいた圧縮法に相当する。この階層では、符号化した複数のフレームにわたって従来のＢＢＭＥＣ探索法を繰り返し適用することにより、マクロブロック（マクロブロックのサブタイル）を特徴として取り扱う。前記ＭＢ階層と同じ、１回目のＢＢＭＥＣを適用して、構成要素２１６内の一番最近の参照フレームから、ターゲットのマクロブロックについての従来での予測を見つけ出す。ただし、２回目のＢＢＭＥＣの適用は、構成要素２１６内の二番目に最近の参照フレームを探索することによって、従来の１回目の予測に対する従来のさらなる予測を探索する。構成要素２１６内の過去のフレームを徐々に遡ってＢＢＭＥＣを繰り返し適用することにより、ターゲットのマクロブロック（特徴として特定されていない）の「トラック」を生成する。ＭＢＣトラックによってモデル２１４を生成し、このモデル２１４によって予測２１２を生成する。デコーダで、この予測２１２をマクロブロック（またはサブタイル）の残差符号と組み合わせることにより、本来のデータの再構成が合成される（過程２１０）。 The second layer 202 in FIG. 2A is referred to as a “characterized by macroblock” (MBF) layer and corresponds to a compression method based on the MBC tracker described above (216 in FIG. 2A). In this hierarchy, macroblocks (macroblock subtiles) are treated as features by repeatedly applying the conventional BBMEC search method over a plurality of encoded frames. The same BBMEC as the MB layer is applied to find the conventional prediction for the target macroblock from the most recent reference frame in component 216. However, the second BBMEC application searches for the conventional further prediction relative to the conventional first prediction by searching for the second most recent reference frame in component 216. A “track” of the target macroblock (not identified as a feature) is generated by applying BBMEC repeatedly, stepping back past frames in component 216. A model 214 is generated by the MBC track, and a prediction 212 is generated by the model 214. At the decoder, this prediction 212 is combined with the macroblock (or subtile) residual code to synthesize the reconstruction of the original data (step 210).

図２Ａの３番目の階層２０４は、「特徴」階層と称され、既述の特徴ベースの圧縮法に相当する。既述したように、マクロブロックのグリッドに関係なく特徴を検出及び追跡し、これらの特徴を当該特徴と重複するマクロブロックに関連付けて、特徴トラックを用いて、復号化した参照フレーム２１６をナビゲートすることで前記重複するマクロブロックに対する良好なマッチを見つけ出す。代替の一実施形態では、コーデックが、特徴をマクロブロックと関連付けずに当該特徴を直接的に符号化及び復号化し、例えば前記ＭＢ階層の従来の圧縮法などにより、「特徴でない」バックグラウンドを特徴とは別に処理し得る。特徴ベースのモデル２１４によって予測２１２を生成する。デコーダで、この予測２１２を対応するマクロブロック（またはサブタイル）の残差符号と組み合わせることにより、本来のデータの再構成が合成される（過程２１０）。 The third hierarchy 204 in FIG. 2A is called a “feature” hierarchy and corresponds to the feature-based compression method described above. As described above, features are detected and tracked regardless of the grid of macroblocks, these features are associated with macroblocks that overlap with the features, and the feature track is used to navigate the decoded reference frame 216. To find a good match for the overlapping macroblock. In an alternative embodiment, the codec directly encodes and decodes the feature without associating the feature with a macroblock, and features a “non-feature” background, such as by a conventional compression method of the MB layer. It can be processed separately. A prediction 212 is generated by the feature-based model 214. The decoder combines this prediction 212 with the corresponding macroblock (or subtile) residual code to synthesize the original data reconstruction (step 210).

図２Ａの一番上位の階層２０６は、「オブジェクト」階層と称され、オブジェクトベースの圧縮法に相当する。オブジェクトとは、本質的に、複数のマクロブロックを包含し得る大規模な特徴であり、物理的な意味（例えば、顔、ボール、携帯電話など）または複雑な事象２０８を有する何らかに相当し得る。オブジェクトのモデル化（オブジェクトモデリング）は、そのオブジェクトが特定の種類のもの（例えば、顔など）であると予想される場合、特殊な基底関数を用いてモデル化することが可能（過程２１４）なので、パラメトリックモデリング（パラメトリックなモデル化）となり得る。オブジェクトが複数のマクロブロックを包含するか又は複数のマクロブロックと重複する場合、そのオブジェクト２１６に対応する全てのマクロブロックに関して単一の動きベクトル２１２を算出可能であり、これにより演算量及び符号化サイズを節約することができる。オブジェクトベースのモデル２１４によって予測２１２を生成する。デコーダで、この予測２１２を対応するマクロブロック（またはサブタイル）の残差符号と組み合わせることにより、本来のデータの再構成が合成される（過程２１０）。 The top hierarchy 206 in FIG. 2A is referred to as an “object” hierarchy and corresponds to an object-based compression method. An object is essentially a large-scale feature that can contain multiple macroblocks and corresponds to something with a physical meaning (eg, face, ball, cell phone, etc.) or complex event 208. obtain. Object modeling (object modeling) can be modeled using a special basis function (step 214) if the object is expected to be of a particular type (eg, face). Parametric modeling (parametric modeling) can be performed. If an object contains multiple macroblocks or overlaps with multiple macroblocks, a single motion vector 212 can be calculated for all macroblocks corresponding to that object 216, which can be computationally intensive and encoded. Size can be saved. A prediction 212 is generated by the object-based model 214. The decoder combines this prediction 212 with the corresponding macroblock (or subtile) residual code to synthesize the original data reconstruction (step 210).

代替の一実施形態では、オブジェクトを、当該オブジェクトの近傍の特徴モデル２１４を相関及び集約することによって特定するようにしてもよい。図２Ｂは、そのような特徴モデルの集約による、ノンパラメトリックなまたは経験的なオブジェクト検出の様子を示すブロック図である。特定の種類のオブジェクト２２０は、その種類のオブジェクトの性質を有する特徴（すなわち、「オブジェクトバイアス」を示す特徴）を特定することによって検出される（過程２２２）。次に、その特徴のセット２２２がモデル状態２２４の硬直性（rigidity）、すなわち、特徴同士及び当該特徴の状態同士が経時的に相関する傾向を示すか否かを判断する（過程２２４）。各特徴モデルに相関関係があると判断された場合（これにより、オブジェクトが検出されたと判断する（過程２２６））、付随するパラメータを備えた複合外観モデル２２８および付随するパラメータを備えた複合変形モデル２３０が形成され得る。複合外観モデルおよび複合変形モデルが形成されることで、個々の外観モデルおよび個々の変形モデルの場合よりも、当然にパラメータが低減する（過程２３２）。 In an alternative embodiment, an object may be identified by correlating and aggregating feature models 214 in the vicinity of the object. FIG. 2B is a block diagram showing a state of nonparametric or empirical object detection by aggregation of such feature models. A particular type of object 220 is detected by identifying features having the properties of that type of object (ie, features indicative of “object bias”) (step 222). Next, it is determined whether or not the feature set 222 exhibits the rigidity of the model state 224, that is, whether the features and the state of the features tend to correlate with time (step 224). If it is determined that each feature model has a correlation (thereby determining that an object has been detected (step 226)), a composite appearance model 228 with associated parameters and a complex deformation model with associated parameters 230 may be formed. By forming the composite appearance model and the composite deformation model, parameters are naturally reduced as compared to the case of the individual appearance model and the individual deformation model (step 232).

図２Ｃに、図２Ａの前記「オブジェクト」階層２０６の第３の実施形態として、オブジェクトベースのパラメトリックなモデル化とノンパラメトリックなモデル化の両方を用いる構成を示す。オブジェクトは、パラメトリックなモデルによって検出される（過程２４０）。検出されたオブジェクト２４０を処理して、当該オブジェクトと重複する特徴が存在するか否かを判断する（過程２５０）。次に、そのような重複する特徴のセットを調べて、上記のように特徴同士を集約できるか否かを判断し得る（過程２６０）。重複する特徴同士を集約できないと判断された場合には、過程２４０で検出されたオブジェクトと重複するマクロブロックを調べて、前述したように共通する単一の動きベクトルを有するようにマクロブロック同士を効率良く集約できるか否かを判断し得る（過程２７０）。 FIG. 2C shows a configuration using both object-based parametric modeling and non-parametric modeling as a third embodiment of the “object” hierarchy 206 of FIG. 2A. The object is detected by a parametric model (step 240). The detected object 240 is processed to determine whether there is a feature that overlaps the object (step 250). Next, such overlapping feature sets may be examined to determine whether the features can be aggregated as described above (step 260). If it is determined that the overlapping features cannot be aggregated, the macroblocks overlapping with the object detected in the process 240 are examined, and the macroblocks are determined so as to have a common single motion vector as described above. It may be determined whether the data can be efficiently aggregated (step 270).

複数の忠実度での処理アーキテクチャでは、最良の処理を達成できるように、階層２００、階層２０２、階層２０４および階層２０６を適宜組み合わせてもよい。一実施形態では、図２Ａの全ての階層を「競争」のようにして調べることにより、符号化する各マクロブロックの最良の（最小量の）符号が、どの階層で得られかを判断する。この「競争」については、後で詳述する。 In a processing architecture with multiple fidelity, the hierarchy 200, hierarchy 202, hierarchy 204, and hierarchy 206 may be combined as appropriate to achieve the best processing. In one embodiment, it is determined in which hierarchy the best (minimum amount) code for each macroblock to be encoded is obtained by examining all the layers of FIG. This “competition” will be described in detail later.

別の実施形態では、図２Ａの階層が、一番下位の階層（最も単純な階層）から一番上位の階層（最も複雑な階層）へと順次的に調べられ得る。下位の階層のソリューションで十分な場合には、上位の階層のソリューションを調べなくてもよい。所与のソリューションについて「十分に良好である」か否かを判断する基準については、後で詳述する。 In another embodiment, the hierarchy of FIG. 2A may be examined sequentially from the lowest hierarchy (the simplest hierarchy) to the highest hierarchy (the most complex hierarchy). If a lower layer solution is sufficient, the upper layer solution need not be examined. The criteria for determining whether a given solution is “good enough” will be discussed in detail later.

＜モデルベースの圧縮コーデック＞ <Model-based compression codec>

＜従来のコーデックの処理＞
符号化プロセスでは、映像データを、圧縮フォーマット又は符号化フォーマットに変換し得る。同様に、解凍プロセスでは、圧縮された映像を、圧縮される前のフォーマット（すなわち、元々のフォーマット）に変換し得る。映像圧縮プロセス及び映像解凍プロセスは、コーデックと一般的に称されるエンコーダ／デコーダのペアにより実現され得る。 <Conventional codec processing>
In the encoding process, the video data may be converted to a compressed format or an encoded format. Similarly, the decompression process may convert the compressed video to the format before it is compressed (ie, the original format). The video compression process and the video decompression process may be implemented by an encoder / decoder pair commonly referred to as a codec.

図３Ａは、標準的なエンコーダ３１２のブロック図である。図３Ａのエンコーダは、ソフトウェア環境でも、ハードウェア環境でも、あるいは、それらの組合せでも実現可能である。一例として、このようなエンコーダの構成要素（コンポーネント）は、図８Ａまたは図８Ｂのように、少なくとも１つのプロセッサ８２０によって実行可能な、記憶媒体に記憶されたコードとして実施され得る。エンコーダ３１２の構成要素には、あらゆる組合せの構成要素が使用されてもよく、そのような構成要素には、イントラ予測部３１４、インター予測部３１６、変換部３２４、量子化部３２６、エントロピー符号化部３２８、ループフィルタ３３４が含まれ得るが、必ずしもこれらに限定されない。インター予測部３１６は、動き補償部３１８、フレーム記憶部３２０および動き予測部３２２を含み得る。エンコーダ３１２は、さらに、逆量子化部３３０および逆変換部３３２を備え得る。図３Ａのエンコーダ３１２の各コンポーネントの機能は、当業者にとって周知である。 FIG. 3A is a block diagram of a standard encoder 312. The encoder of FIG. 3A can be implemented in a software environment, a hardware environment, or a combination thereof. By way of example, such an encoder component may be implemented as code stored on a storage medium executable by at least one processor 820, as in FIG. 8A or FIG. 8B. Any combination of constituent elements may be used as the constituent elements of the encoder 312, such as an intra prediction unit 314, an inter prediction unit 316, a transform unit 324, a quantization unit 326, and entropy coding. The unit 328 and the loop filter 334 may be included, but are not necessarily limited thereto. The inter prediction unit 316 may include a motion compensation unit 318, a frame storage unit 320, and a motion prediction unit 322. The encoder 312 can further include an inverse quantization unit 330 and an inverse transform unit 332. The function of each component of encoder 312 in FIG. 3A is well known to those skilled in the art.

図３Ａのエントロピー符号化アルゴリズム３２８は、量子化された変換係数の各種数値の確率を定量化した確率分布に基づくアルゴリズムであってもよい。その時点の符号化単位（例えば、マクロブロックなど）の符号化サイズは、その時点の符号化状態（符号化する各種数量の数値）および前記確率分布に対する当該符号化状態の一致の程度に依存する。後述するように、この符号化状態が変化すると、後続のフレーム内の符号化単位の符号化サイズに影響が及び得る。映像の符号を徹底的に最適化するために、映像の全ての符号化経路（すなわち、考えられる全ての符号化状態）を徹底的に探索することも可能ではあるが、演算負荷があまりにも大きくなってしまう。本発明の一実施形態では、エンコーダ３１２が、大規模な範囲（すなわち、１つのスライス、１つのフレームまたはフレームのセット）を検討するのではなく、最新の（ターゲット）マクロブロックだけに集中することにより、最適化を局所的に達成する。 The entropy encoding algorithm 328 of FIG. 3A may be an algorithm based on a probability distribution obtained by quantifying the probabilities of various numerical values of quantized transform coefficients. The coding size of the current coding unit (for example, a macroblock) depends on the current coding state (the numerical values of various quantities to be coded) and the degree of matching of the coding state with the probability distribution. . As will be described later, when the coding state changes, the coding size of the coding unit in the subsequent frame may be affected. Although it is possible to exhaustively search all the coding paths of video (that is, all possible coding states) in order to thoroughly optimize the video code, the computational load is too great turn into. In one embodiment of the invention, encoder 312 concentrates only on the latest (target) macroblock rather than considering a large range (ie, one slice, one frame or set of frames). To achieve optimization locally.

図３Ｂは、イントラ予測データ３３６を復号化する標準的なデコーダ３４０のブロック図であり、図３Ｃは、インター予測データ３３８を復号化する標準的なデコーダ３４０のブロック図である。デコーダ３４０は、ソフトウェア環境でも、ハードウェア環境でも、あるいは、それらの組合せでも実現可能である。図３Ａ、図３Ｂ及び図３Ｃを参照して、典型的なエンコーダ３１２は、内部または外部から映像入力３１０を受け取り、データを符号化し、符号化したデータをデコーダのキャッシュ／バッファ３４８に記憶する。デコーダ３４０は、符号化されたデータをそのキャッシュ／バッファ３４８から取り出して復号化や送信を行う。デコーダは、システムバスやネットワークインターフェースなどのあらゆる利用可能な手段を介して、復号化されたデータにアクセスし得る。デコーダ３４０は、映像データを復号化して前述したキーフレーム及び予測対象のフレーム（図２Ａの符号２１０全般）を解凍し得る。キャッシュ／バッファ３４８は、映像シーケンス／ビットストリームに関係するデータを受け取って、エントロピー復号化部３４６に情報を供給し得る。エントロピー復号化部３４６は、ビットストリームを処理して、図３Ｂのイントラ予測の変換係数の量子化された推定値または図３Ｃの残差信号の変換係数の量子化された推定値を生成する。逆量子化部３４４は、逆スケーリング（rescaling operation）を実行することにより、変換係数の推定値を生成する。これら変換係数の推定値に逆変換を適用することにより（過程３４２）、図３Ｂでは元々の映像データペルのイントラ予測が合成され、図３Ｃでは残差信号のイントラ予測が合成される。図３Ｃでは、合成された残差信号が、ターゲットのマクロブロックのインター予測に加算されることにより、そのターゲットのマクロブロックの完全な再構成が生成される。デコーダのインター予測部３５０は、フレームストア（フレーム記憶部）３５２に含まれた参照フレームに動き予測（過程３５６）及び動き補償（過程３５４）を適用することにより、エンコーダで生成されたインター予測を複製する。デコーダのインター予測部３５０は、その構成要素である動き予測部３２２、動き補償部３１８およびフレームストア３２０も含め、図３Ａのインター予測部３１６と同様の構成を有する。 FIG. 3B is a block diagram of a standard decoder 340 that decodes intra-prediction data 336, and FIG. 3C is a block diagram of a standard decoder 340 that decodes inter-prediction data 338. The decoder 340 can be realized in a software environment, a hardware environment, or a combination thereof. Referring to FIGS. 3A, 3B and 3C, exemplary encoder 312 receives video input 310 from the inside or outside, encodes the data, and stores the encoded data in decoder cache / buffer 348. The decoder 340 extracts the encoded data from the cache / buffer 348 and performs decoding and transmission. The decoder can access the decoded data via any available means such as a system bus or network interface. The decoder 340 may decode the video data and decompress the key frame and the prediction target frame (reference numeral 210 in FIG. 2A). The cache / buffer 348 may receive data related to the video sequence / bitstream and provide information to the entropy decoding unit 346. The entropy decoding unit 346 processes the bit stream to generate a quantized estimate of the transform coefficient of the intra prediction in FIG. 3B or a quantized estimate of the transform coefficient of the residual signal in FIG. 3C. The inverse quantization unit 344 generates an estimated value of the transform coefficient by performing inverse scaling (rescaling operation). By applying inverse transformation to the estimated values of these transform coefficients (step 342), the original video data pel intra prediction is synthesized in FIG. 3B, and the residual signal intra prediction is synthesized in FIG. 3C. In FIG. 3C, the synthesized residual signal is added to the inter prediction of the target macroblock to generate a complete reconstruction of the target macroblock. The inter prediction unit 350 of the decoder applies the inter prediction generated by the encoder by applying motion prediction (process 356) and motion compensation (process 354) to the reference frame included in the frame store (frame storage unit) 352. Duplicate. The inter prediction unit 350 of the decoder has the same configuration as that of the inter prediction unit 316 in FIG.

＜モデルベース予測を実現するハイブリッドコーデック＞
図３Ｄは、モデルベース予測を実行する、本発明の一実施形態のエンコーダを示す図である。コーデック３６０は、現在の（ターゲットの）フレームを符号化し得る（過程３６２）。そして、コーデック３６０は、そのフレーム内の各マクロブロックを符号化し得る（過程３６４）。標準的なＨ．２６４符号化プロセスを用いて、Ｈ．２６４符号化ソリューションをもたらす基礎的な（第１の）符号を定義する（過程３６６）。好ましい一実施形態において、エンコーダ３６６は、ＧＯＰ（参照フレームの集合）を符号化することができるＨ．２６４エンコーダである。好ましくは、Ｈ．２６４エンコーダは、各フレーム内のペルを符号化するのにあたって様々な方法を適用できるように設定可能であり、そのような方法としてはイントラフレーム予測およびインターフレーム予測が挙げられ、当該インターフレーム予測では、符号化されている最中のマクロブロックに対する良好なマッチを見つけ出すために多数の参照フレームを探索することができる。好ましくは、元々のマクロブロックデータと予測との誤差が、変換及び量子化及びエントロピー符号化される。 <Hybrid codec for model-based prediction>
FIG. 3D is a diagram illustrating an encoder of one embodiment of the present invention that performs model-based prediction. The codec 360 may encode the current (target) frame (step 362). Codec 360 may then encode each macroblock in the frame (step 364). Standard H.P. Using the H.264 encoding process, A basic (first) code that provides an H.264 coding solution is defined (step 366). In a preferred embodiment, the encoder 366 is an H.264 that can encode a GOP (a set of reference frames). H.264 encoder. Preferably, H.M. The H.264 encoder can be set so that various methods can be applied to encode pels in each frame. Examples of such methods include intra-frame prediction and inter-frame prediction. Multiple reference frames can be searched to find a good match for the macroblock being encoded. Preferably, the error between the original macroblock data and the prediction is transformed, quantized and entropy coded.

好ましくは、エンコーダ３６０は、ＣＡＢＡＣエントロピー符号化アルゴリズムを利用することにより、コンテキストに敏感なコンテキストモデリング用適応型メカニズムを提供する（過程３８２）。このようなコンテキストモデリングは、所定のメカニズムの二値化プロセスを用いられた、映像データのシンタックスエレメント（例えば、ブロックの種類、動きベクトル、量子化された係数など）の二値シーケンスに適用され得る。次に、各エレメントを、適応型又は固定型の確率モデルを用いて符号化する。コンテキスト値を用いて確率モデルを適宜調整するようにしてもよい。 Preferably, encoder 360 provides a context sensitive adaptive mechanism for context modeling by utilizing a CABAC entropy coding algorithm (step 382). Such context modeling is applied to binary sequences of video data syntax elements (eg, block types, motion vectors, quantized coefficients, etc.) using the binarization process of a given mechanism. obtain. Next, each element is encoded using an adaptive or fixed probability model. You may make it adjust a probability model suitably using a context value.

＜競争モード＞
図３Ｄでは、Ｈ．２６４によるマクロブロック符号を分析する（過程３６８）。過程３６８では、Ｈ．２６４によるマクロブロック符号が「効率的」であると判断された場合、そのＨ．２６４ソリューションが理想に近いと見なし、それ以上の分析を行わずに、ターゲットのマクロブロックにＨ．２６４符号化ソリューションを選択する。一実施形態において、Ｈ．２６４による符号化効率は、Ｈ．２６４による符号化サイズ（ビット）を閾値と比較することによって判定され得る。そのような閾値は、過去に符号化した映像のパーセンタイル統計から導き出され得るか、あるいは、同じ映像のこれまでのパーセンタイル統計から導き出され得る。他の実施形態において、Ｈ．２６４による符号化効率は、Ｈ．２６４エンコーダがターゲットのマクロブロックを「スキップ」マクロブロックと判断したか否かによって判定され得る、「スキップ」マクロブロックとは、その内側及び周辺のデータが、追加の符号化を実質的に必要としない程度に十分に一様であるマクロブロックのことを言う。 <Competition mode>
In FIG. The macroblock code according to H.264 is analyzed (step 368). In step 368, H.P. If the macroblock code according to H.264 is determined to be “efficient”, the H.264 H.264 solution is close to ideal and without further analysis, the target macroblock is Select H.264 encoding solution. In one embodiment, H.264. The encoding efficiency according to H.264 is H.264. H.264 encoding size (bits) can be determined by comparing with a threshold. Such a threshold can be derived from the percentile statistics of a previously encoded video or can be derived from previous percentile statistics of the same video. In other embodiments, H.264. The encoding efficiency according to H.264 is H.264. A “skip” macroblock, which can be determined by whether the H.264 encoder has determined that the target macroblock is a “skip” macroblock, is that the data inside and around it substantially requires additional encoding. A macroblock that is sufficiently uniform that it does not.

過程３６８において、Ｈ．２６４によるマクロブロックソリューションが効率的であると判断されなかった場合、追加の分析が実行され、エンコーダは競争モード３８０に移行する。このモードでは、複数のモデル３７８に基づいて、ターゲットのマクロブロックの各種予測が複数生成される。モデル３７８は、過去のフレーム３７４で検出及び追跡された特徴を特定する（過程３７６）ことによって生成される。新しいフレーム３６２が処理される（符号化及び復号化されてフレームストアに記憶される）たびに、その新しいフレーム３６２における新たな特徴の検出およびこれに対応する特徴トラックの延長を考慮して、特徴モデルを更新する必要がある。モデルベースのソリューション３８２は、取得したＨ．２６４ソリューションと共に、符号化サイズ３８４に基づいて順位付けされる。このように、基礎的な符号化（Ｈ．２６４ソリューション）およびモデルベースの符号化のいずれでも所与のマクロブロックを符号化できるという自由度に基づいて、本発明にかかるコーデックを、ハイブリッドコーデックと称することができる。 In step 368, H.P. If the macroblock solution according to H.264 is not determined to be efficient, additional analysis is performed and the encoder transitions to a race mode 380. In this mode, a plurality of various predictions of the target macroblock are generated based on the plurality of models 378. Model 378 is generated by identifying the features detected and tracked in past frame 374 (step 376). Each time a new frame 362 is processed (encoded and decoded and stored in the frame store), it takes into account the detection of new features in the new frame 362 and the corresponding extension of the feature track. The model needs to be updated. The model-based solution 382 is the acquired H.264. It is ranked based on the coding size 384 with the H.264 solution. Thus, based on the degree of freedom that a given macroblock can be encoded by either basic coding (H.264 solution) or model-based coding, the codec according to the present invention is defined as a hybrid codec. Can be called.

例えば、競争モードでは、Ｈ．２６４によってターゲットのマクロブロックの符号が生成されて、その圧縮効率（より少ないビット数でデータを符号化する能力）が他のモデルと比較される。競争モードで使用する符号化アルゴリズムごとに、次の手順を実行する：（１）使用するコーデックモード／アルゴリズムに基づいて予測を生成する；（２）ターゲットのマクロブロックから予測を減算して残差信号を生成する；（３）ブロックベースの二次元ＤＣＴの近似を用いて、その残差（ターゲット−前記予測）を変換する；および（４）エントロピー符号化（encoder）により、変換係数を符号化する。 For example, in competitive mode, H.264 generates a code for the target macroblock and compares its compression efficiency (ability to encode data with fewer bits) to other models. For each coding algorithm used in the competition mode, perform the following steps: (1) generate a prediction based on the codec mode / algorithm used; (2) subtract the prediction from the target macroblock to obtain a residual Generate a signal; (3) transform the residual (target-prediction) using block-based two-dimensional DCT approximation; and (4) encode transform coefficients by entropy encoding. To do.

幾つかの側面で、Ｈ．２６４（インターフレーム）によるベースライン予測は、比較的単純な制限されたモデルに基づいた予測であると言える（Ｈ．２６４は、競争モードで使用される複数のアルゴリズムのうちの１つとされる）。しかし、エンコーダ３６０の予測には、より複雑なモデル（特徴ベースのモデルまたはオブジェクトベースのモデル）及び当該モデルに対応するトラッキング（追跡）に基づいた予測も使用され得る。エンコーダ３６０は、データ複雑性を示すマクロブロックが検出された場合、従来の圧縮法よりも特徴ベースの圧縮法のほうが良好な結果をもたらすとの仮定に基づいて動作する。 In some aspects, H.264 (interframe) baseline prediction can be said to be a prediction based on a relatively simple limited model (H.264 is considered one of several algorithms used in competitive mode) . However, prediction based on a more complex model (a feature-based model or an object-based model) and corresponding tracking (tracking) may also be used for the prediction of the encoder 360. The encoder 360 operates on the assumption that a feature-based compression method will yield better results than a conventional compression method when a macroblock indicating data complexity is detected.

＜競争モードでの特徴ベース予測の使用＞
上記のように、まず、ターゲットのマクロブロックごとに、そのマクロブロックにとってＨ．２６４ソリューション（予測）が効率的である（「十分に良好である」）か否かを判断する。この判定結果が否定である場合、競争モードに移行する。 <Use of feature-based prediction in competitive mode>
As described above, first, for each target macroblock, H. Determine if the H.264 solution (prediction) is efficient (“good enough”). When this determination result is negative, the mode is shifted to the competition mode.

図３Ｄの競争モード３８０では、特徴ベースの予測を行う際の各種処理選択肢（既述の説明を参照）を適宜選択することにより、その競争への「エントリー」が決まる。各エントリーは、ターゲットのマクロブロックについて互いに異なる予測を行う。本発明にかかる特徴ベース予測では、以下の処理選択肢の指定が可能である：
−トラッカー（追跡手段）の種類（ＦＰＡ、ＭＢＣ、ＳＵＲＦ）、
−キー予測に用いる動きモデル（零次または一次）
−キー予測に用いるサンプリングスキーム（直接または間接）
−キー予測に用いるサブタイル化スキーム（サブタイル化を行わない、１／４分割、Ｙ／Ｕ／Ｖ）
−再構成アルゴリズム（ＫＯまたはＰＣＡ−Ｌ）および
−副次的な予測に用いる参照フレーム（ＰＣＡ−Ｌの場合）。 In the competition mode 380 of FIG. 3D, “entry” to the competition is determined by appropriately selecting various processing options (see the above description) when performing feature-based prediction. Each entry makes a different prediction for the target macroblock. In the feature-based prediction according to the present invention, the following processing options can be specified:
-Types of trackers (tracking means) (FPA, MBC, SURF),
-Motion model used for key prediction (zero order or first order)
-Sampling scheme used for key prediction (direct or indirect)
-Sub-tiling scheme used for key prediction (no sub-tiling, 1/4 division, Y / U / V)
A reconstruction algorithm (KO or PCA-L) and a reference frame used for secondary prediction (in the case of PCA-L).

所与のターゲットのマクロブロックのソリューションの探索空間には、Ｈ．２６４ソリューション（Ｈ．２６４での「最良の」インターフレーム予測）に加えて、既述した本発明にかかる全種類の特徴ベース予測が含まれ得る。一実施形態において、競争モードは、上記の処理選択肢（トラッカー（追跡手段）の種類、キー予測に用いる動きモデル、キー予測に用いるサンプリングスキーム、サブタイル化スキームおよび再構成アルゴリズム）のあらゆる組合せを含む。別の実施形態において、競争モードでの前記処理選択肢は設定可能であり、演算量を節約するのに十分な数のサブセットの組合せに限定可能とされる。 The search space for the macroblock solution for a given target includes H.264. In addition to the H.264 solution (the “best” inter-frame prediction in H.264), all kinds of feature-based prediction according to the present invention described above may be included. In one embodiment, the competition mode includes any combination of the above processing options (tracker type, motion model used for key prediction, sampling scheme used for key prediction, subtiling scheme and reconstruction algorithm). In another embodiment, the processing options in the competition mode are configurable and can be limited to a sufficient number of subset combinations to save computational effort.

前記競争でのソリューション候補は、次の４つの手順（既述の手順と同様）により１つずつ評価される：（１）予測を生成する；（２）ターゲットのマクロブロックから前記予測を減算して残差信号を生成する；（３）残差を変換する；および（４）エントロピー符号化（encoder）により変換係数を符号化する。図３Ｄの過程３８２からの出力は、所与のソリューション３８４に伴うビット数である。各ソリューションの評価が終わるごとに、エンコーダは、その次のソリューションについて評価できるように、現在の評価を行う前の状態にロールバックされる。一実施形態において、全てのソリューションについての評価後、最小の符号化サイズを有するソリューションが選択されることにより、前記競争の「勝者」が選ばれる（過程３７０）。そして、勝者のソリューションが、前記ターゲットのマクロブロックについての最終的な符号として再びエンコーダに送信される（過程３７２）。既述したように、前記勝者のソリューションは、前記ターゲットのマクロブロックにのみ最適化されたソリューションであることから、局所的に最適なソリューションであると言える。代替の一実施形態では、最適なソリューションを、より広域的なトレードオフを緩和できるか否かに基づいて選択する。そのようなトレードオフには、後続のフレームでの、コンテキストのイントラフレーム予測フィードバックの影響、残差誤差の影響などが含まれ得るが、必ずしもこれらに限定されない。 The solution candidates in the competition are evaluated one by one by the following four procedures (similar to the procedure described above): (1) generate a prediction; (2) subtract the prediction from the target macroblock. A residual signal; (3) transform the residual; and (4) encode the transform coefficient by entropy encoding. The output from step 382 of FIG. 3D is the number of bits associated with a given solution 384. As each solution is evaluated, the encoder is rolled back to the state prior to the current evaluation so that the next solution can be evaluated. In one embodiment, after evaluating all solutions, the “winner” of the competition is selected (step 370) by selecting the solution with the smallest coding size. The winner's solution is then sent back to the encoder as the final code for the target macroblock (step 372). As described above, since the winner's solution is a solution optimized only for the target macroblock, it can be said to be a locally optimal solution. In an alternative embodiment, the optimal solution is selected based on whether a wider trade-off can be mitigated. Such trade-offs may include, but are not necessarily limited to, the effects of contextual intra-frame prediction feedback, the effects of residual errors, etc. on subsequent frames.

勝者のソリューションに関する情報は、符号化ストリームに保存されて（過程３８６）、将来の復号化用に送信／記憶される。この情報には、特徴ベース予測に用いた処理選択肢（例えば、トラッカー（追跡手段）の種類、キー算出、サブタイル化スキーム、再構成アルゴリズムなど）が含まれ得るが、必ずしもこれらに限定されない。 Information about the winner's solution is saved in the encoded stream (step 386) and transmitted / stored for future decoding. This information can include, but is not necessarily limited to, processing options used for feature-based prediction (eg, tracker (tracking means) type, key calculation, subtiling scheme, reconstruction algorithm, etc.).

場合によっては、ターゲットのマクロブロックがＨ．２６４では効率的に符号化されないとエンコーダ３６０が判断するだけでなく、さらに、そのマクロブロックと重複する特徴が全く検出されないこともあり得る。そのような場合には、エンコーダが最後の手段として、Ｈ．２６４を用いてマクロブロックを符号化する。代替の一実施形態では、そのマクロブロックと重複する疑似特徴を生成するように特徴追跡手段（特徴トラッカー）のトラックを延長することにより、特徴ベース予測を生成するようにしてもよい。 In some cases, the target macroblock is H.264. In addition to the encoder 360 determining that it is not efficiently encoded in H.264, it is also possible that no features that overlap the macroblock are detected. In such a case, the encoder uses H.264 as a last resort. The macroblock is encoded using H.264. In an alternative embodiment, feature-based prediction may be generated by extending the track of the feature tracking means (feature tracker) to generate a pseudo-feature that overlaps the macroblock.

一実施形態では、図２Ａの４つの階層間の移動が、前記競争モードで管理される。 In one embodiment, movement between the four tiers of FIG. 2A is managed in the competitive mode.

＜特徴ベース予測を用いた復号化＞
図４は、本願の出願人によるＥｕｃｌｉｄＶｉｓｉｏｎコーデック内のモデルベースの予測を実現し得る、本発明の一実施形態のデコーダの一例を示す図である。デコーダ４００は、符号化された映像ビットストリームを復号化することにより、フレーム符号４２０の基となった入力映像フレームの近似を合成する。フレーム符号４２０には、対応する映像フレーム４１８を再構成するのにあたってデコーダ４００が使用するパラメータのセットが含まれ得る。 <Decoding using feature-based prediction>
FIG. 4 is a diagram illustrating an example of a decoder according to an embodiment of the present invention that can implement model-based prediction within the Euclidean Vision codec by the applicant of the present application. The decoder 400 synthesizes an approximation of the input video frame that is the basis of the frame code 420 by decoding the encoded video bitstream. Frame code 420 may include a set of parameters that decoder 400 uses to reconstruct the corresponding video frame 418.

デコーダ４００は、エンコーダが採用した順番と同じスライスの順番で、各フレームを走査する。また、デコーダは、エンコーダが採用した順番と同じマクロブロックの順番で、各スライスを走査する。デコーダは、エンコーダでのプロセスに従い、マクロブロック４０４ごとに、そのマクロブロックを従来の方式で復号化する（過程４０８）のか、あるいは、特徴モデル及びパラメータを用いて復号化する（過程４１６）のかを判断する。本発明にかかるモデルベース予測によってマクロブロックが符号化されている場合、デコーダ４００は、そのソリューションでの予測を再生する（過程４１８）のに必要なあらゆる特徴情報（特徴トラック、特徴の参照フレーム［ＧＯＰ］、特徴の動きベクトル）を抽出する。また、デコーダは、復号化時に特徴モデルを更新し（過程４１０、４１２、４１４）、処理中のフレーム／スライス／マクロブロックについてのエンコーダ側での特徴状態と同期させる。 The decoder 400 scans each frame in the same slice order as the order adopted by the encoder. The decoder scans each slice in the same macroblock order as the order adopted by the encoder. According to the process in the encoder, the decoder, for each macroblock 404, decides whether to decode the macroblock in the conventional manner (step 408) or whether to decode using the feature model and parameters (step 416). to decide. If the macroblock is encoded by model-based prediction according to the present invention, the decoder 400 may provide any feature information (feature track, feature reference frame [) necessary to reproduce the prediction in that solution (step 418). GOP], a feature motion vector). Also, the decoder updates the feature model at the time of decoding (steps 410, 412, 414) and synchronizes with the feature state on the encoder side for the frame / slice / macroblock being processed.

従来のコーデックでは、メモリの制限により、復号化したフレームについての全ての予測コンテキストを、図３Ｃのフレームストア３５２及びキャッシュ３４８に保持することができず、そのフレーム（ペル）のみを保持するのが一般的であった。対照的に、本発明では、特徴ベースのモデル及びパラメータの保持を優先することにより、図３Ｃのフレームストア３５２及びキャッシュ３４８に記憶される予測コンテキストを拡大させることができる。 In the conventional codec, due to memory limitations, the entire prediction context for the decoded frame cannot be held in the frame store 352 and the cache 348 in FIG. 3C, and only that frame (pel) is held. It was general. In contrast, the present invention allows the prediction context stored in the frame store 352 and cache 348 of FIG. 3C to be expanded by prioritizing retention of feature-based models and parameters.

特徴モデルを表現するパラメータの全セットは、特徴状態（特徴の状態）と称される。特徴モデルを効率的に保持するには、この特徴状態を抽出する（isolate）必要がある。図５は、本発明の一実施形態での、特徴インスタンスの状態抽出プロセス５００を示すブロック図である。この状態抽出情報は、ターゲットのマクロブロックに関連付けられ得る。また、この状態抽出情報は、関係する特徴インスタンス５０２に対応したパラメータを含み得る。そのようなパラメータは、前記ターゲットのマクロブロックを符号化するのに有用となり得る。また、この状態抽出情報を用いて、後続の映像フレームに予測した特徴を内挿することも可能である。各特徴インスタンスは、それぞれ対応するＧＯＰ５０４を有する。各ＧＯＰには、それぞれ対応する状態情報（例えば、対応する境界情報など）が含まれる。各特徴インスタンスの状態抽出情報には、さらに、当該特徴インスタンスと関連付けられるあらゆるオブジェクトについての状態情報、対応するスライスパラメータについての状態情報５０６、および対応するエントロピー状態についての状態情報５０８が含まれ得る。このように、前記状態情報は、特徴インスタンスのＧＯＰ／スライス／エントロピーパラメータの境界に関する説明、ならびに新たな状態及び新たなコンテキストへの当該境界の延長に関する説明を提供することができる。状態情報５０６，５０８を用いることにより、予測して、後続のフレームに予測した特徴の状態を内挿することが可能となる。 The entire set of parameters representing a feature model is called a feature state (feature state). In order to maintain the feature model efficiently, it is necessary to isolate this feature state. FIG. 5 is a block diagram illustrating a feature instance state extraction process 500 in one embodiment of the invention. This state extraction information may be associated with the target macroblock. This state extraction information may also include parameters corresponding to the related feature instance 502. Such parameters can be useful for encoding the target macroblock. Moreover, it is also possible to interpolate the predicted feature in subsequent video frames using this state extraction information. Each feature instance has a corresponding GOP 504. Each GOP includes corresponding state information (for example, corresponding boundary information). The state extraction information for each feature instance may further include state information for every object associated with the feature instance, state information 506 for the corresponding slice parameter, and state information 508 for the corresponding entropy state. Thus, the state information can provide a description of the GOP / slice / entropy parameter boundary of the feature instance and a description of the boundary extension to a new state and new context. By using the state information 506 and 508, it is possible to predict and interpolate the state of the predicted feature in subsequent frames.

マクロブロックのデータ（ペル）と、当該マクロブロックデータと関連付けられた特徴の状態抽出情報とにより、拡張予測コンテキストが形成される。複数の特徴インスタンスからの拡張コンテキストを、復号化した近傍部分（neighbor）と組み合わせるようにしてもよい。図３Ａのエンコーダ３１２ならびに図３Ｂ及び図３Ｃのデコーダ３４０が用いる前記拡張予測コンテキストには：（１）少なくとも１つのマクロブロック；（２）少なくとも１つの近傍のマクロブロック；（３）スライス情報；（４）参照フレーム［ＧＯＰ］；（５）少なくとも１つの特徴インスタンス；および（６）オブジェクト／テクスチャ情報；が含まれ得るが、必ずしもこれらに限定されない。 An extended prediction context is formed by macroblock data (pel) and feature state extraction information associated with the macroblock data. Extended contexts from multiple feature instances may be combined with decoded neighbors. The enhanced prediction context used by encoder 312 in FIG. 3A and decoder 340 in FIGS. 3B and 3C includes: (1) at least one macroblock; (2) at least one neighboring macroblock; (3) slice information; 4) a reference frame [GOP]; (5) at least one feature instance; and (6) object / texture information;

＜パラメトリックモデルベース圧縮＞ <Parametric model-based compression>

＜パラメトリックモデリングのコーデックフレームワークへの統合＞
上記のハイブリッドコーデックの態様では、特徴モデルを暗示的に利用することにより、マクロブロックの良好な予測に関する手がかりをエンコーダに与える。これとは対照的に、コーデックフレームワークにおいて、特徴モデルを明示的に利用することも可能である。ターゲットのフレーム内の特定の領域を、所与の種類のモデル（例えば、顔モデルなど）によって表現する場合、当該表現はそのモデルのパラメータに依存する。以降では、この種の明示的なモデリングを、パラメトリックモデリング（パラメトリックなモデル化）と称する。一方で、上記のハイブリッドコーデックの態様は、ノンパラメトリックモデリング（ノンパラメトリックなモデル化）または経験的なモデリングを使用している。パラメトリックモデリングは、特定の種類の特徴またはオブジェクト（例えば、顔など）が存在することを予期して行うので、通常、その種類のあらゆる特徴／オブジェクトの空間内に広がる基底ベクトルのセットで構成される。そして、この場合のモデルのパラメータは、基底関数へのターゲットの領域の投影になる。 <Integration of parametric modeling into codec framework>
In the above hybrid codec aspect, the feature model is implicitly used to give the encoder a clue about good prediction of the macroblock. In contrast, the feature model can also be explicitly used in the codec framework. When a particular region in a target frame is represented by a given type of model (eg, a face model, etc.), the representation depends on the model parameters. Hereinafter, this type of explicit modeling is referred to as parametric modeling. On the other hand, the hybrid codec aspect described above uses non-parametric modeling (non-parametric modeling) or empirical modeling. Parametric modeling is done in anticipation of the existence of a particular type of feature or object (eg, face, etc.), so it usually consists of a set of basis vectors that span the space of every feature / object of that type. . The model parameter in this case is the projection of the target area onto the basis function.

図６Ａは、本発明の代替の一実施形態での、パラメトリックモデリングを実現するコーデック６００の構成要素の一例を示すブロック図である。図６Ａに示すように、コーデック６００は、適応型動き補償予測を実行する手段６１０および／または適応型動きベクトル予測を実行する手段６１２および／または適応型変換処理を実行する手段６１４および／または適応型エントロピー符号化手段６１６を含み得る。 FIG. 6A is a block diagram illustrating an example of components of a codec 600 that implements parametric modeling in an alternative embodiment of the present invention. As shown in FIG. 6A, the codec 600 includes means 610 for performing adaptive motion compensated prediction and / or means 612 for performing adaptive motion vector prediction and / or means 614 for performing adaptive conversion processing and / or adaptive. Type entropy encoding means 616 may be included.

適応型動き補償予測手段６１０は、特徴のインスタンスが含まれていることに基づいて参照フレーム６１８を選択し得る。特徴のモデル化によって圧縮効率が向上した場合、そのモデルが導き出されたフレームを参照フレームとして選択し、さらに、対応するＧＯＰを生成するようにしてもよい。動きベクトルのオフセット６２６の内挿は、検出された特徴のパラメータに基づいて実行され得る。これにより、検出済みの特徴に基づいた既知のデータポイントの離散集合の範囲内で、予測対象の特徴のインスタンスの新たなデータペルを構築することができる。従来のエンコーダで用いられるサブタイル分割処理６１２の結果は、変形変化モデル６２０の制約によって補う。変換処理６１４は、外観変化モデリング６２２を用いて外観変化パラメータを制約するようにして実行され得る。エントロピー符号化処理６１６は、本発明にかかるコーデック６００のパラメータレンジ／スケール分析６２４および適応型量子化６２８によって補われ得る。このようにして得られたマクロブロック補助データ６３０が、コーデック６００によって出力される。 The adaptive motion compensation prediction means 610 may select the reference frame 618 based on the fact that the feature instance is included. When compression efficiency is improved by modeling a feature, a frame from which the model is derived may be selected as a reference frame, and a corresponding GOP may be generated. Interpolation of motion vector offset 626 may be performed based on detected feature parameters. Thereby, a new data pel of the instance of the feature to be predicted can be constructed within a discrete set of known data points based on the detected feature. The result of the subtile division processing 612 used in the conventional encoder is supplemented by the restriction of the deformation change model 620. The conversion process 614 may be performed using the appearance change modeling 622 to constrain appearance change parameters. Entropy encoding process 616 may be supplemented by parameter range / scale analysis 624 and adaptive quantization 628 of codec 600 according to the present invention. The macroblock auxiliary data 630 obtained in this way is output by the codec 600.

＜パラメトリックモデリングを用いた適応型符号化によるハイブリッドコーデックの改良＞
一変形例では、パラメトリックモデリングを用いることにより、既述したハイブリッドコーデックによる予測を改良することができる。一実施形態では、パラメトリックモデルのエレメントを、ターゲットのマクロブロックについて予め得られた予測（例えば、前記競争モードの出力など）に適用することにより、その予測を改良できるか否かを判断する。 <Improvement of hybrid codec by adaptive coding using parametric modeling>
In one variation, the prediction by the hybrid codec described above can be improved by using parametric modeling. In one embodiment, it is determined whether the prediction can be improved by applying an element of the parametric model to a prediction obtained in advance for the target macroblock (eg, the output of the competitive mode).

図６Ｂに、パラメトリックモデルベースの適応型エンコーダ６３４のアプリケーションの一例を示す。適応型エンコーダ６３４−１は、従来のコーデック（例えば、Ｈ．２６４など）または既述したようなハイブリッドコーデックによって実行される符号化を補い得る。従来の動き補償予測プロセスで得られたペル残差６３６を分析し（過程６３８）、当該残差の変形変化及び外観変化をパラメトリックな特徴モデルでより効率的にモデル化（過程６４２）できるか否かを判断する。一実施形態では、予測残差６３６とパラメトリックモデル６３８との変換差分絶対値和（ＳＡＴＤ）６４０が減少するか否かにより、パラメトリックモデルの相対効率を求め得る。パラメトリックなモデルが効率的な表現であると判断された場合、ターゲットの領域（マクロブロック）を特徴モデル（外観基底及び変形基底）に投影することにより、残差信号の符号として機能する特徴パラメータを得ることができる。 FIG. 6B shows an example of an application of a parametric model based adaptive encoder 634. Adaptive encoder 634-1 may supplement the encoding performed by a conventional codec (eg, H.264, etc.) or a hybrid codec as described above. The pel residual 636 obtained by the conventional motion compensation prediction process is analyzed (step 638), and the deformation change and the appearance change of the residual can be modeled more efficiently with the parametric feature model (step 642). Determine whether. In one embodiment, the relative efficiency of the parametric model may be determined by whether or not the transformed difference absolute value sum (SATD) 640 between the prediction residual 636 and the parametric model 638 decreases. When the parametric model is determined to be an efficient expression, the target parameter (macroblock) is projected onto the feature model (appearance basis and deformation basis), and the feature parameter that functions as the sign of the residual signal is obtained. Can be obtained.

この実施形態では、さらに、現在のＧＯＰ状態、スライス状態およびエントロピー状態内で、代わりの残差モデリングを適用できるか否かを調べる追加のロールバック機能が設けられる。例えば、一連の映像フレームシーケンスにおいて、符号化されている最中の現在のフレームからみて遠くに位置する、参照フレーム、ＧＯＰおよび特徴（スライス）６４６を、予測の基準として検討することができる。このような手法は、従来のエンコードでは実際的ではない。さらに、別の映像ファイルからの特徴モデルで圧縮が向上するのであれば、そのような映像ファイルなどの別の映像データにロールバックすることも可能である。 In this embodiment, an additional rollback function is also provided that checks whether alternative residual modeling can be applied within the current GOP state, slice state, and entropy state. For example, in a series of video frame sequences, reference frames, GOPs and features (slices) 646 that are located far from the current frame being encoded can be considered as criteria for prediction. Such an approach is not practical with conventional encoding. Furthermore, if compression is improved with a feature model from another video file, it is possible to roll back to another video data such as such a video file.

＜パラメトリックなモデルのパラメータの内挿による特徴ベース予測＞
映像ストリーム内に同じ特徴のインスタンスが複数現れる場合、特徴モデルの不変コンポーネント（フレーム間で変化しないコンポーネント）を維持するのが望ましい。パラメトリックな特徴モデリングでは、特徴モデルの特定のパラメータ（例えば、各種基底関数の重み付けを表す係数など）が不変コンポーネントとなる。一般的に、ノンパラメトリックな（経験的な）特徴モデリングでは、特徴ペルそのものが不変コンポーネントとなる。特徴動き予測・補償を実行する際に、モデルの不変コンポーネントを維持することを、動き予測・補償の指針原則（以降では、「不変原則」と称する）としてもよい。 <Feature-based prediction by interpolating parametric model parameters>
When multiple instances of the same feature appear in the video stream, it is desirable to maintain the invariant component of the feature model (a component that does not change between frames). In parametric feature modeling, specific parameters of the feature model (for example, coefficients representing weights of various basis functions) are invariant components. In general, in non-parametric (empirical) feature modeling, the feature pel itself is an invariant component. Maintaining the invariant component of the model when performing feature motion prediction / compensation may be a guideline principle of motion prediction / compensation (hereinafter referred to as “invariant principle”).

図６Ｃは、本発明の一実施形態において、前記不変原則を指針として、特徴モデルのパラメータの内挿により特徴の動き補償予測を行う様子を示すブロック図である。図６Ｃに示すように、動き補償予測プロセス６６８は、複数の特徴インスタンスのモデルパラメータを当該パラメータの不変インスタンスを中心として調節する、正規化プロセスから開始する。特徴インスタンス（「マッチしたマクロブロック」）の集合６７０を用いることにより、不変インスタンスを中心として当該インスタンスを正規化するための、複数の種類の内挿関数（６７４，６７６，６７８，６８０）を生成することができる。モデルのパラメータの不変インスタンス６８２は、キーフレームでのモデルパラメータ値のセットとして定義され得る。このような不変インスタンスにより、特徴ベースモデルにおける（全てでなくとも）大半の予測／パターンを表現することができる。不変インスタンスは、インスタンスの外観パラメータのベクトルによって構成されるベクトル空間の重心と概念が似ている。 FIG. 6C is a block diagram illustrating a state in which motion compensation prediction of a feature is performed by interpolation of a feature model parameter using the invariant principle as a guideline in an embodiment of the present invention. As shown in FIG. 6C, the motion compensated prediction process 668 begins with a normalization process that adjusts model parameters of multiple feature instances around an invariant instance of the parameters. Using a set of feature instances (“matched macroblocks”) 670 to generate multiple types of interpolation functions (674, 676, 678, 680) to normalize the instances around invariant instances can do. An invariant instance 682 of model parameters may be defined as a set of model parameter values at a key frame. Such invariant instances can represent most (if not all) predictions / patterns in a feature-based model. An invariant instance is similar in concept to the centroid of a vector space formed by a vector of instance appearance parameters.

不変インスタンス６８２は、前記内挿関数（６７４，６７６，６７８，６８０）のうちの１つを用いてターゲットの位置６８４を外挿で求める際のキーパターンになり得る。このような内挿／外挿プロセスを用いることにより、ターゲットのフレームにおける特徴のフレーム内位置、外観変化および変形変化を予測することができる。このような特徴の不変表現と、特徴インスタンスのコンパクトなパラメータ形式との組合せにより、参照ソースフレームに含まれる特徴の外観及び変形をキャッシュに格納するのに必要なメモリ量を、従来の圧縮法と比較して劇的に減少させることができる。すなわち、このような特徴モデルにより、フレームのデータのうち圧縮にとって重要かつ有用なデータを簡潔に捕集することができる。 The invariant instance 682 can be a key pattern for extrapolating the target position 684 using one of the interpolation functions (674, 676, 678, 680). By using such an interpolation / extrapolation process, it is possible to predict the in-frame position, appearance change, and deformation change of the feature in the target frame. By combining such an invariant representation of features and a compact parameter form of feature instances, the amount of memory required to store the appearance and deformation of features contained in the reference source frame in a cache is reduced to that of conventional compression methods. It can be dramatically reduced in comparison. That is, with such a feature model, it is possible to succinctly collect data that is important and useful for compression out of frame data.

代替の一実施形態として、少なくとも２つの特徴インスタンスについて、それらの特徴インスタンスが現れた参照フレームと現在の（ターゲットの）フレームとの時間間隔が与えられている場合に、それらの特徴モデルパラメータを用いて、ターゲットの領域の状態を予測することができる。この場合、所与の状態モデルと時間ステップとに基づいて、前記不変原則に従って少なくとも２つの特徴パラメータを外挿することにより、ターゲットの領域の特徴パラメータを予測することができる。この場合の状態モデルは、線形のモデルであっても、それよりも高次のモデルであってもよい（例えば、拡張カルマンフィルタなど）。 As an alternative embodiment, use feature model parameters for at least two feature instances, given the time interval between the reference frame in which they appeared and the current (target) frame. Thus, the state of the target area can be predicted. In this case, based on a given state model and time step, the feature parameters of the target region can be predicted by extrapolating at least two feature parameters according to the invariant principle. The state model in this case may be a linear model or a higher-order model (for example, an extended Kalman filter).

＜特徴モデル情報のキャッシュ整理およびアクセス＞
特徴モデルの生成中に、映像内で、同じ特徴のインスタンスが複数見つかる場合が多い。このとき、キャッシュに格納する前に特徴モデル情報を整理することにより、当該特徴モデル情報を効率的に記憶またはキャッシュ格納することができる。この手法は、パラメトリックなモデルベースの圧縮スキームにも、ノンパラメトリックなモデルベースの圧縮スキームにも適用することができる。 <Cache organization and access of feature model information>
During generation of a feature model, multiple instances of the same feature are often found in the video. At this time, the feature model information can be efficiently stored or cached by organizing the feature model information before storing it in the cache. This approach can be applied to both parametric model-based compression schemes and non-parametric model-based compression schemes.

例えば、図３Ｃにおいて、（フレームストア３５２も含め）キャッシュ３４８を、特徴ベースモデリングによる予測コンテキスト情報で圧縮効率が向上すると判断された場合に、特徴ベースモデリングによる予測コンテキスト情報を格納するものとして構成することができる。特徴ベースの予測コンテキスト情報がキャッシュに格納されない場合に、これにアクセスしようとすると、オーバーヘッドが発生し、システムの応答性や判断性能を低下させる可能性がある。処理済みの特徴ベース符号化の予測コンテキストをキャッシュに格納しておくことにより、そのようなオーバーヘッドを抑えることができる。このような構成により、特徴ベースの予測コンテキストに関係するデータへのアクセス頻度を減らすことができる。 For example, in FIG. 3C, the cache 348 (including the frame store 352) is configured to store prediction context information based on feature-based modeling when it is determined that compression efficiency is improved with prediction context information based on feature-based modeling. be able to. When the feature-based prediction context information is not stored in the cache, an attempt to access the feature-based prediction context may cause overhead, which may reduce the responsiveness and determination performance of the system. By storing the processed feature-based encoding prediction context in the cache, such overhead can be suppressed. With such a configuration, the frequency of access to data related to the feature-based prediction context can be reduced.

一例として、エンコーダ３１２／デコーダ３４０（図３Ａ、図３Ｃ）のキャッシュとして、映像処理の実行速度及び効率を向上させるように構成されたキャッシュを使用することが考えられる。符号化した映像データが、特徴ベース符号化の予測データを導き出したフレームと空間的に近くない映像データであっても、キャッシュにおいて、その符号化した映像データの近傍に、当該特徴ベース符号化の予測データを格納できるか否かによって、映像処理の性能は変化し得る。キャッシュの近さは、アクセスレイテンシや動作遅延やデータ伝送時間に影響し得る。例えば、多数のフレームからの特徴データを少量の物理的メモリに記憶しその形態でアクセスできるようにした方が、それらの特徴を導き出したフレームを恒久的な記憶装置に記憶し、そこにアクセスするよりも遥かに効率的である。また、エンコーダ３１２／デコーダ３４０（図３Ａ、図３Ｃ）は、マクロブロックまたはフレームが復号化された際にキャッシュ／バッファ／フレームストア内の特徴ベースの予測コンテキスト情報に容易にアクセスできるように予測データをキャッシュに格納する、コンフィギュレータ（設定部／設定手段）を含み得る。 As an example, the cache of the encoder 312 / decoder 340 (FIGS. 3A and 3C) may be a cache configured to improve the execution speed and efficiency of video processing. Even if the encoded video data is video data that is not spatially close to the frame from which the prediction data of feature-based encoding is derived, the feature-based encoding is performed in the vicinity of the encoded video data in the cache. Depending on whether the prediction data can be stored, the performance of the video processing can vary. The proximity of the cache can affect the access latency, operation delay, and data transmission time. For example, if feature data from a large number of frames is stored in a small amount of physical memory and can be accessed in that form, the frames from which those features are derived are stored in a permanent storage device and accessed there. Much more efficient than Also, the encoder 312 / decoder 340 (FIGS. 3A, 3C) allows the prediction data to easily access feature-based prediction context information in the cache / buffer / frame store when a macroblock or frame is decoded. May be included in the cache. A configurator (setting unit / setting unit) may be included.

本発明の特定の実施形態では、まず、復号化したフレームについて２種類の特徴相関を定義することにより、すなわち、キャッシュに格納する局所的な復号化したデータと非局所的な復号化したデータとの２種類を定義することにより、キャッシュを拡張し得る。局所的なキャッシュは、バッチ形態（すなわち、フレームのグループの形態）でアクセス可能な、復号化したフレームの集合とされ得る。検出された特徴により、そのようなグループを構成するフレームが決まる。局所的なキャッシュは、現在のフレームで検出された特徴により活性化される。局所的なキャッシュは、現在のフレーム／マクロブロックにおいて「強い」特徴モデル（長い履歴のモデル）が少ない場合に多く使用される。局所的なキャッシュの処理は、バッチ形態の動き補償予測に基づく処理であり、フレームのグループは参照フレームのバッファに記憶される。図７Ａは、本発明の一実施形態にかかるキャッシュアーキテクチャ７１０−１の一例の概要を示すブロック図である。キャッシュアクセスアーキテクチャ７１０−１は、局所的なキャッシュへのアクセス７１２（７１６，７１８，７２０，７２２，７２４）と長期的な（非局所的な）キャッシュへのアクセス７１４（７２６，７２８，７３０，７３２）との判断プロセス７１０を含む。大部分の特徴が局所的である場合（過程７１２）（例えば、現在のフレーム／マクロブロックにおいて「強い」特徴モデルが少ない場合）、局所的なキャッシュの処理が行われる（過程７１８）。 In a particular embodiment of the invention, first, by defining two types of feature correlations for the decoded frame, ie, locally decoded data to be stored in the cache and non-locally decoded data; By defining these two types, the cache can be expanded. A local cache may be a collection of decoded frames that are accessible in batch form (ie, in the form of a group of frames). The detected features determine the frames that make up such a group. The local cache is activated by the features detected in the current frame. Local cache is often used when there are few “strong” feature models (long history models) in the current frame / macroblock. The local cache processing is processing based on batch-type motion compensation prediction, and a group of frames is stored in a buffer of reference frames. FIG. 7A is a block diagram illustrating an overview of an example of a cache architecture 710-1 according to an embodiment of the present invention. Cache access architecture 710-1 includes local cache access 712 (716, 718, 720, 722, 724) and long-term (non-local) cache access 714 (726, 728, 730, 732). And a determination process 710. If most features are local (step 712) (eg, if there are few “strong” feature models in the current frame / macroblock), local cache processing is performed (step 718).

図７Ｂは、局所的な（短期的な）キャッシュデータ７３４の利用に伴う処理を示すブロック図である。局所的なキャッシュは、バッチ形態（すなわち、フレームのグループの形態）でアクセス可能な、復号化したフレームの集合とされ得る。検出された特徴により、そのようなグループを構成するフレームが決まる。図７Ｂの局所的なキャッシュ７３４は、「短い履歴の」特徴、すなわち、少数のフレームにしか及ばない特徴トラックの特徴のみをグループ化する。そのような「短い履歴」の複数の特徴によって包含される、フレーム同士の集約集合により、それら複数の特徴の共同フレームセット７３８が定まる。共同フレームセット７３８内のフレームの優先度は、各フレームのフレームトラックの複雑性に基づいて定まり得る。一実施形態において、そのような複雑性は、Ｈ．２６４などの基礎的な符号化プロセスによる特徴の符号化コストで決まり得る。図３Ｂ、図３Ｃ、図７Ａ及び図７Ｂにおいて、前記局所的なキャッシュは、フレームストア３５２またはキャッシュバッファ３４８に記憶／格納され得る。局所的に格納されたフレームは、過程７２０で利用する。次に、検出された特徴インスタンスに基づくＧＯＰ／バッチ７４２を、符号７２２で利用する。そして、検出された特徴インスタンスに基づく当該ＧＯＰ／バッチ７４２を、動き補償予測プロセスの参照フレームとしてテストし得る（過程７２４）。このようにして行われる動き補償予測は、特徴インスタンスが検出されたフレームを参照フレームとして動き補償を実施することから、特徴の追跡情報に「バイアス」しているとも見なせる。さらに、ＧＯＰ／バッチ状態、スライス状態およびエントロピー状態内で残差モデリングが可能か否かを調べる、追加のロールバックが設けられる（過程７４６）。これにより、映像フレームシーケンスにおいて、符号化されている最中の現在のフレームからみて遠くに位置する参照フレームを効率良く評価することができる。 FIG. 7B is a block diagram illustrating processing associated with the use of local (short-term) cache data 734. A local cache may be a collection of decoded frames that are accessible in batch form (ie, in the form of a group of frames). The detected features determine the frames that make up such a group. The local cache 734 of FIG. 7B groups only “short history” features, ie, features of feature tracks that only span a few frames. The aggregated set of frames encompassed by such “short history” features defines a joint frameset 738 for those features. The priority of the frames in the joint frame set 738 may be determined based on the complexity of the frame track of each frame. In one embodiment, such complexity is H.264. It can be determined by the cost of encoding the features by a basic encoding process such as H.264. 3B, 3C, 7A and 7B, the local cache may be stored / stored in the frame store 352 or the cache buffer 348. The locally stored frame is utilized in step 720. Next, a GOP / batch 742 based on the detected feature instance is utilized at 722. The GOP / batch 742 based on the detected feature instance can then be tested as a reference frame for the motion compensated prediction process (step 724). The motion compensated prediction performed in this way can be regarded as being “biased” in the feature tracking information because the motion compensation is performed using the frame in which the feature instance is detected as a reference frame. In addition, an additional rollback is provided (step 746) to see if residual modeling is possible within the GOP / batch state, slice state, and entropy state. This makes it possible to efficiently evaluate a reference frame located far from the current frame being encoded in the video frame sequence.

このように、本発明の特定の実施形態では、過去のフレームを分析して、現在のフレームに対するマッチをもたらす確率が最も高いフレームを決定することができる。さらに、参照フレームの数が、従来の圧縮法での１〜１６といった典型的なフレーム上限数よりも遥かに多くなる。有用なマッチを含む参照フレームが十分な数存在する場合、システム資源によっては、そのような参照フレームの数が、システムのメモリの限界にまで達することもある。さらに、本発明で生成される中間形態のデータにより、同数の参照フレームを記憶するのに必要なメモリ量を減少させることができる。 Thus, in certain embodiments of the invention, past frames can be analyzed to determine the frame that has the highest probability of producing a match for the current frame. Further, the number of reference frames is much larger than the typical upper limit number of frames such as 1 to 16 in the conventional compression method. If there are a sufficient number of reference frames containing useful matches, depending on system resources, the number of such reference frames may reach the memory limit of the system. Furthermore, the amount of memory required to store the same number of reference frames can be reduced by the intermediate data generated in the present invention.

再び図７Ａを参照して、長い履歴を有する特徴７２６の大半は、非局所的な／長期的なキャッシュに格納される。非局所的なキャッシュは、「フレーム」と「保持」の２種類のキャッシュアクセス方法に基づいたキャッシュである。非局所的なキャッシュの「フレーム」アクセスでは、フレームに直接アクセスすることにより、現在のフレームを符号化さするための特徴モデルを生成する。「保持」モードでは、復号化したデータに直接アクセスするのではなく、復号化したフレームから予め導き出されたデータ（その復号化したフレームにおける特徴モデルおよび当該特徴モデルでのインスタンスのパラメータ）として保持された特徴モデルを利用する。これにより、この「保持」モードでも、前記「フレーム」モードの場合と同じデータを合成することができる。具体的には、特徴インスタンスのモデルにアクセスする（過程７２８）。参照フレームにアクセスする（過程７３０）。最適な参照フレームとモデルとの組合せに印を付ける（過程７３２）。最適か否かの基準には、各参照フレームにおける特徴モデルの中間特徴情報（特徴の強さおよび特徴の帯域量を含む）が用いられ得る。 Referring again to FIG. 7A, the majority of features 726 with long history are stored in a non-local / long-term cache. The non-local cache is a cache based on two types of cache access methods, “frame” and “hold”. Non-local cache "frame" access generates a feature model for encoding the current frame by accessing the frame directly. In “hold” mode, the decoded data is not directly accessed, but is stored as data previously derived from the decoded frame (the feature model in the decoded frame and the parameters of the instance in the feature model). Use feature models. As a result, even in the “hold” mode, the same data as in the “frame” mode can be synthesized. Specifically, the model of the feature instance is accessed (step 728). A reference frame is accessed (step 730). The optimal reference frame and model combination is marked (step 732). Intermediate feature information (including feature strength and feature bandwidth) of the feature model in each reference frame can be used as the criterion for whether or not it is optimal.

長期的なキャッシュ７１４は、復号化したデータ（又は符号化したデータ）であればどのようなデータであってもよく、好ましくは、デコーダ状態でアクセス可能なものとされる。長期的なキャッシュ７１４は、例えば、参照フレーム／ＧＯＰを含み得る。当該参照フレーム／ＧＯＰは、一般的に、符号化されている最中の現在のフレームに先行する複数のフレームである。このようなフレームの組合せ以外にも、デコーダ側の長期的なキャッシュには、現在のフレームを復号化するのに利用可能な、あらゆる組合せの復号化したフレームを格納することができる。 The long-term cache 714 may be any data as long as it is decoded data (or encoded data), and is preferably accessible in a decoder state. Long-term cache 714 may include, for example, a reference frame / GOP. The reference frame / GOP is generally a plurality of frames preceding the current frame being encoded. In addition to such frame combinations, the decoder's long-term cache can store any combination of decoded frames that can be used to decode the current frame.

図７Ｃは、長期的なキャッシュデータの利用に伴う処理を示すブロック図である。長期的な（非局所的な）キャッシュ７４８は、より長いレンジのキャッシュアーキテクチャを有する。検出された特徴のインスタンスが複数回繰り返し発生しており、当該特徴の対応関係モデルを繰り返し適用できることから、その特徴が長い履歴を有すると判断された場合（過程７５２）、長期的なキャッシュが、局所的なキャッシュから初期化される（過程７５０）。次に、プロセスはどの「保持」モードを使用するのかを決定する（過程７５４）。非局所的なキャッシュのモードは、「保持」７６０と「非保持」７５６の２種類である。「非保持」７５６では、（既述したハイブリッドコーデックでの暗示的なモデリングの使用と同じく、）従来の動き補償予測プロセスを、特徴モデルに基づく予測によって補償する。そのため、「非保持」モード７５６では、参照フレームにアクセスする（符号７５８）ことによって有効な予測を得る。「保持」モードは、特徴モデルから明示的に得られた予測を使用する（過程７６２，７６６）点で「非保持」モードと異なる。よって、「保持」モードでは、予測空間が、特徴モデルを用いて合成可能な特徴のデータのみに必然的に限定される。また、その特徴モデルは、過去のフレームにおける特徴インスタンスのインスタンスパラメータ（当該過去のフレームに含まれるペルと同等）を含み得る。このようなパラメータを記述する関数の内挿により、予測を動き補償予測プロセスに提供し、フレームの合成を支援する（過程７６４）。 FIG. 7C is a block diagram illustrating processing associated with long-term use of cache data. Long-term (non-local) cache 748 has a longer range cache architecture. If the detected feature instance occurs repeatedly multiple times and the corresponding relationship model of the feature can be repeatedly applied, and it is determined that the feature has a long history (step 752), the long-term cache is Initialized from the local cache (step 750). Next, the process determines which “hold” mode to use (step 754). There are two types of non-local cache modes: “retain” 760 and “non-retain” 756. “Non-retained” 756 compensates for the conventional motion compensated prediction process (as well as the use of implicit modeling in the hybrid codec described above) by prediction based on feature models. Therefore, in the “non-hold” mode 756, a valid prediction is obtained by accessing the reference frame (reference numeral 758). The “hold” mode differs from the “non-hold” mode in that it uses predictions explicitly obtained from the feature model (steps 762, 766). Therefore, in the “holding” mode, the prediction space is necessarily limited to only feature data that can be synthesized using the feature model. The feature model may also include instance parameters of feature instances in the past frame (equivalent to pels contained in the past frame). By interpolating such a function describing the parameters, the prediction is provided to the motion compensated prediction process to assist in frame synthesis (step 764).

本発明において、特徴の集合体（特徴集合体）を利用する一部の実施形態では、キャッシュに格納された特徴情報を使用して符号化を行う。このような実施形態では、特徴集合体の部分集合を用いて、その集合体の全体を表現（モデル化）する。既述したように、そのような部分集合は、例えばＳＶＤ等を用いることによって選択される。このようにして選択した特徴インスタンスの部分空間は、前記集合体の基底となり、同じ映像（又は他の映像）の後続のフレーム内に対応する特徴が現れるたびに当該特徴を符号化できるようにキャッシュに格納され使用され得る。このような特徴インスタンスの部分集合により、特徴をコンパクトに且つ正確にモデル化することができる。 In the present invention, in some embodiments using a feature set (feature set), encoding is performed using feature information stored in a cache. In such an embodiment, a subset of the feature aggregate is used to represent (model) the entire aggregate. As described above, such a subset is selected by using, for example, SVD. The sub-space of the feature instance selected in this way becomes the basis of the aggregate and is cached so that the feature can be encoded each time a corresponding feature appears in a subsequent frame of the same video (or other video). Can be stored and used. With such a subset of feature instances, features can be modeled compactly and accurately.

＜デジタル処理環境および通信ネットワーク＞
本発明の実施形態は、ソフトウェア環境でも、ファームウェア環境でも、ハードウェア環境でも実現可能である。一実施形態として、図８Ａにそのような環境を示す。少なくとも１つのクライアントコンピュータ／デバイス８１０およびクラウド（またはサーバーコンピュータもしくはその集団）８１２は、アプリケーションプログラムを実行する処理機能、記憶機能および入出力装置などを実現し得る。少なくとも１つのクライアントコンピュータ／デバイス８１０は、通信ネットワーク８１６を介して、（別のクライアントデバイス／プロセス８１０および少なくとも１つの別のサーバーコンピュータ８１２も含め）別のコンピューティングデバイスに接続可能である。通信ネットワーク８１６は、リモートアクセスネットワークの一部、グローバルネットワーク（例えば、インターネットなど）の一部、世界規模のコンピュータの集まりの一部、ローカルエリアネットワークの一部、ワイドエリアネットワークの一部、あるいは、各種プロトコル（ＴＣＰ／ＩＰ、Ｂｌｕｅｔｏｏｔｈ（登録商標）など）を用いて相互通信するゲートウェイの一部であり得る。それ以外の電子デバイス／コンピュータネットワークアーキテクチャも使用可能である。 <Digital processing environment and communication network>
Embodiments of the present invention can be implemented in a software environment, a firmware environment, or a hardware environment. In one embodiment, such an environment is shown in FIG. 8A. The at least one client computer / device 810 and the cloud (or server computer or group thereof) 812 may implement a processing function for executing an application program, a storage function, an input / output device, and the like. At least one client computer / device 810 is connectable to another computing device (including another client device / process 810 and at least one other server computer 812) via a communication network 816. Communication network 816 may be part of a remote access network, part of a global network (eg, the Internet, etc.), part of a worldwide collection of computers, part of a local area network, part of a wide area network, or It can be part of a gateway that communicates with each other using various protocols (TCP / IP, Bluetooth (registered trademark), etc.). Other electronic device / computer network architectures can also be used.

図８Ｂは、図８Ａの処理環境における所与のコンピュータ／コンピューティングノード（例えば、クライアントプロセッサ／デバイス８１０、サーバーコンピュータ８１２など）の内部構造を示す図である。各コンピュータ８１０，８１２は、コンピュータ（又は処理システム）の構成品間のデータ転送に用いられる実在する又は仮想的なハードウェアラインのセットである、システムバス８３４を備える。バス８３４は、コンピュータシステムの相異なる構成品（例えば、プロセッサ、ディスクストレージ、メモリ、入力／出力ポートなど）同士を接続する共有の配管のようなものであり、それら構成品間の情報のやり取りを可能にする。システムバス８３４には、様々な入出力装置（例えば、キーボード、マウス、ディスプレイ、プリンター、スピーカーなど）をコンピュータ８１０，８１２に接続するためのＩ／Ｏ装置インターフェース８１８が取り付けられている。コンピュータ８１０，８１２は、ネットワークインターフェース８２２を介して、ネットワーク（例えば、図８Ａのネットワーク８１６など）に取り付けられた他の様々なデバイスに接続することができる。メモリ８３０は、本発明の一実施形態（例えば、コーデック、ビデオエンコーダ／デコーダなど）を実現するのに用いられるコンピュータソフトウェア命令８２４およびデータ８２８を記憶する揮発性メモリである。ディスクストレージ８３２は、本発明の一実施形態を実施するのに用いられるコンピュータソフトウェア命令８２４（「ＯＳプログラム」８２６と同等）およびデータ８２８を記憶する不揮発性ストレージである。また、ディスクストレージ８３２は、映像を圧縮フォーマットで長期的に記憶するのにも使用され得る。システムバス８３４には、さらに、コンピュータ命令を実行する中央演算処理装置８２０も取り付けられている。なお、本明細書をとおして、「コンピュータソフトウェア命令」と「ＯＳプログラム」は互いに等価物である。 FIG. 8B is a diagram illustrating the internal structure of a given computer / computing node (eg, client processor / device 810, server computer 812, etc.) in the processing environment of FIG. 8A. Each computer 810, 812 includes a system bus 834, which is a set of real or virtual hardware lines used for data transfer between components of the computer (or processing system). The bus 834 is like a shared pipe that connects different components (for example, a processor, a disk storage, a memory, an input / output port, etc.) of a computer system, and exchanges information between these components. to enable. An I / O device interface 818 for connecting various input / output devices (for example, a keyboard, a mouse, a display, a printer, a speaker, etc.) to the computers 810 and 812 is attached to the system bus 834. Computers 810, 812 can connect to various other devices attached to a network (eg, network 816 in FIG. 8A) via network interface 822. Memory 830 is a volatile memory that stores computer software instructions 824 and data 828 used to implement an embodiment of the present invention (eg, codec, video encoder / decoder, etc.). Disk storage 832 is a non-volatile storage that stores computer software instructions 824 (equivalent to “OS program” 826) and data 828 used to implement one embodiment of the present invention. The disk storage 832 can also be used for long-term storage of video in a compressed format. A central processing unit 820 for executing computer instructions is also attached to the system bus 834. Throughout this specification, “computer software instructions” and “OS programs” are equivalent to each other.

一実施形態において、プロセッサルーチン８２４およびデータ８２８は、本発明にかかるシステム用のソフトウェア命令の少なくとも一部を提供するコンピュータプログラムプロダクト（概して符号８２４で示す）である。コンピュータプログラムプロダクト８２４としては、ストレージデバイス８２８に記憶可能なコンピュータ読み取り可能な媒体が挙げられる。コンピュータプログラムプロダクト８２４は、当該技術分野において周知である任意の適切なソフトウェアインストール方法によってインストール可能なものであり得る。他の実施形態において、前記ソフトウェア命令の少なくとも一部は、ケーブルおよび／または通信および／または無線接続を介してダウンロード可能なものであり得る。さらなる他の実施形態において、本発明にかかるプログラムは、伝播媒体による伝播信号（例えば、無線波、赤外線波、レーザ波、音波、インターネットなどのグローバルネットワークやその他のネットワークによって伝播される電波など）によって実現される、コンピュータプログラム伝播信号プロダクト８１４（図８Ａ）である。このような搬送媒体または搬送信号が、本発明にかかるルーチン／プログラム８２４，８２６用のソフトウェア命令の少なくとも一部を提供する。 In one embodiment, the processor routines 824 and data 828 are computer program products (generally designated 824) that provide at least some of the software instructions for the system according to the present invention. The computer program product 824 includes a computer readable medium that can be stored in the storage device 828. Computer program product 824 may be installable by any suitable software installation method known in the art. In other embodiments, at least some of the software instructions may be downloadable via cable and / or communication and / or wireless connection. In still another embodiment, the program according to the present invention is transmitted by a propagation signal (for example, a radio wave, an infrared wave, a laser wave, a sound wave, a radio wave propagated by a global network such as the Internet or other networks). A computer program propagated signal product 814 (FIG. 8A) is implemented. Such a carrier medium or carrier signal provides at least part of the software instructions for the routine / program 824, 826 according to the invention.

代替の実施形態において、前記伝播信号は、伝播媒体によって搬送されるアナログ搬送波またはデジタル信号である。例えば、前記伝播信号は、グローバルネットワーク（例えば、インターネットなど）、電気通信ネットワークまたはその他のネットワークによって搬送されるデジタル信号であり得る。一実施形態において、前記伝播信号は、所与の期間のあいだ伝播媒体によって送信される信号であり、例えば、数ミリ秒、数秒、数分またはそれ以上の期間のあいだネットワークによってパケットで送信される、ソフトウェアアプリケーション用の命令などであり得る。別の実施形態において、コンピュータプログラムプロダクト８２４の前記コンピュータ読み取り可能な媒体は、コンピュータシステム８１０が受け取って読み取り可能な伝播媒体である。例えば、コンピュータシステム８１０は、前述したコンピュータプログラム伝播信号プロダクトの場合のように、伝播媒体を受け取ってその伝播媒体内に組み込まれた伝播信号を特定する。 In an alternative embodiment, the propagation signal is an analog carrier wave or digital signal carried by a propagation medium. For example, the propagated signal can be a digital signal carried by a global network (eg, the Internet, etc.), a telecommunications network, or other network. In one embodiment, the propagation signal is a signal transmitted by the propagation medium for a given period of time, for example, transmitted in packets by the network for a period of milliseconds, seconds, minutes or more. And instructions for software applications. In another embodiment, the computer readable medium of computer program product 824 is a propagation medium that can be received and read by computer system 810. For example, the computer system 810 receives a propagation medium and identifies a propagation signal embedded in the propagation medium, as in the case of the computer program propagation signal product described above.

＜特徴ベースのディスプレイツール＞
図８Ｃは、一具体例での、特徴ベースのディスプレイツールのスクリーンショット８４０である。スクリーンショット８４０は、映像のフレームを、ボックス８４２で特定された特徴と共に描いている。このフレームに係る映像フレームシーケンスコンテキストが、符号８４４で特定されている。特徴８４２が複数のフレーム８４４にわたって追跡されて、特徴セットが複数生成され、当該特徴セットがディスプレイのセクション８４６内に表示される。１つの特徴セット８４６には、複数の特徴メンバ（特徴インスタンス）が含まれる。データエリアには、所与の特徴を従来の圧縮法で符号化した場合に必要となるビット数である特徴帯域量（Bandwidth）８５２が表示される。同じデータエリアには、さらに、特徴検出プロセスが表示される（符号８５０）。このツールは、対象の映像内で特定された全ての特徴及び特徴トラックを表示することができる。 <Feature-based display tool>
FIG. 8C is a screen shot 840 of a feature-based display tool in one implementation. Screenshot 840 depicts a frame of video with the features identified in box 842. A video frame sequence context relating to this frame is specified by reference numeral 844. Features 842 are tracked across a plurality of frames 844 to generate a plurality of feature sets that are displayed in a section 846 of the display. One feature set 846 includes a plurality of feature members (feature instances). In the data area, a feature bandwidth (Bandwidth) 852 that is the number of bits required when a given feature is encoded by a conventional compression method is displayed. A feature detection process is further displayed in the same data area (reference numeral 850). This tool can display all the features and feature tracks identified in the video of interest.

顔にバイアスした（顔に注目する）顔トラッカー（顔追跡手段）を用いて、顔の検出を支援してもよい。顔の検出により、複数の特徴をグループ化するようにしてもよい。図８Ｅは、顔トラッカーで顔８６４を指定したスクリーンショット８６０−０２である。図８Ｄは、顔特徴及び顔以外の特徴の両方を数字８６２で示したスクリーンショット８６０−０１である。この例において、図８Ｄの数字は、複数のフレームにわたる特徴の追跡の長さを表す。顔に対するバイアスに基づいて特徴をグループ化することにより、顔と重複する複数のマクロブロックを符号化するのに使用可能なモデルを生成することができる。 Face detection may be supported by using a face tracker (face tracking means) biased to the face (focusing on the face). A plurality of features may be grouped by detecting a face. FIG. 8E is a screenshot 860-02 in which the face 864 is designated by the face tracker. FIG. 8D is a screenshot 860-01 showing both facial features and non-facial features as numbers 862. In this example, the numbers in FIG. 8D represent the length of feature tracking across multiple frames. By grouping features based on bias to the face, a model can be generated that can be used to encode multiple macroblocks that overlap the face.

厳密にＨ．２６４エンコーダのプロセスを用いる代わりに、上記の顔モデルを用いて、対象の領域内の全ペル／全ピクセルを符号化するようにしてもよい。顏モデルを直接適用することにより、追加のバイアシングを実行する必要がなくなり、さらに、Ｈ．２６４を使用せずに過去の参照フレームを選択することができる。特徴対応関係モデルに基づいて顏を生成した後、下位の処理によって残差を符号化する。 Strictly H. Instead of using the H.264 encoder process, the face model described above may be used to encode all pels / all pixels in the region of interest. By applying the heel model directly, there is no need to perform additional biasing, and A past reference frame can be selected without using H.264. After generating wrinkles based on the feature correspondence model, the residual is encoded by lower-order processing.

＜デジタルライツ管理＞
一部の実施形態では、本発明にかかるモデルを用いて、符号化されたデジタル映像へのアクセスを制御することができる。例えば、関連モデルなしでは、ユーザは映像ファイルを再生することができない。このアプローチの一具体例は、2008年1月4日付出願の米国特許出願第12/522,357号に記載されている。なお、この米国特許出願の全教示内容は、参照をもって本明細書に取り入れたものとする。映像を「ロックする」（映像に「鍵をかける」）のに前記モデルを使用することができる。また、映像データにアクセスするためのキー（鍵）として、前記モデルを使用することができる。符号化された映像データの再生動作は、モデルに依存し得る。このようなアプローチにより、符号化された映像データの読出しを、モデルへのアクセスなしでは不可能とすることができる。 <Digital Rights Management>
In some embodiments, the model according to the present invention can be used to control access to encoded digital video. For example, without an associated model, the user cannot play a video file. One example of this approach is described in US patent application Ser. No. 12 / 522,357, filed Jan. 4, 2008. The entire teaching content of this US patent application is incorporated herein by reference. The model can be used to “lock” the video (“lock” the video). The model can be used as a key for accessing video data. The reproduction operation of the encoded video data may depend on the model. With such an approach, it is possible to read out the encoded video data without access to the model.

モデルへのアクセスを制御することにより、コンテンツの再生へのアクセスを制御することができる。この方式は、映像コンテンツへのアクセスを制限するための、ユーザフレンドリー且つデベロッパーフレンドリーな、効率良いソリューションとなり得る。 By controlling access to the model, access to content playback can be controlled. This scheme can be a user-friendly and developer-friendly and efficient solution for restricting access to video content.

また、モデルを用いて、コンテンツを段階的にアンロックする（コンテンツの鍵を開ける）ようにしてもよい。あるバージョンのモデルでは、符号をあるレベルまでしか復号化できないようにすることができる。段階的にモデルが完成していくことにより、最終的に映像全体をアンロックすることができる。初期のアンロック状態では映像のサムネイルのみをアンロックし、その映像全体が欲しいか否かを決める機会をユーザに与えるようにしてもよい。ユーザは、標準画質バージョンが欲しければ、１つ上のバージョンのモデルを手に入れる。ユーザが、高精細度品質またはシネマ品質を望むのであれば、より完成されたバージョンのモデルをダウンロードすればよい。モデルは、符号化サイズ及び符号化品質に応じた映像品質を段階的に実現できるように、冗長性なく符号化される。 Further, the content may be unlocked in stages (unlocking the content) using a model. Some versions of the model may allow the code to be decoded only to a certain level. By completing the model in stages, the entire video can be finally unlocked. In the initial unlock state, only the video thumbnail may be unlocked, and the user may be given an opportunity to decide whether or not the entire video is desired. If the user wants a standard quality version, he gets the model of the next higher version. If the user desires high definition quality or cinema quality, a more complete version of the model may be downloaded. The model is encoded without redundancy so that video quality corresponding to the encoding size and encoding quality can be realized in stages.

＜フレキシブルマクロブロック順序付けおよびスケーラブル映像符号化＞
本発明の例示的な実施形態では、従来の符号化／復号化プロセスを拡張することにより、符号化プロセスを向上させて圧縮の恩恵を受け得る。一実施形態では、本発明に、基礎的なＨ．２６４規格の拡張機能であるフレキシブルマクロブロック順序付け（ＦＭＯ）及びスケーラブル映像符号化（ＳＶＣ）が適用され得る。 <Flexible macroblock ordering and scalable video coding>
In an exemplary embodiment of the invention, a conventional encoding / decoding process may be extended to improve the encoding process and benefit from compression. In one embodiment, the present invention provides a basic H.264. Flexible macroblock ordering (FMO) and scalable video coding (SVC), which are extensions of the H.264 standard, can be applied.

ＦＭＯは、符号化されたフレームのマクロブロックを、複数の種類のスライスグループのうちの１つの種類に割り当てる。この割当ては、マクロブロック割当てマップによって定まり、同じスライスグループ内のマクロブロックは互いに隣接していなくてもよい。ＦＭＯは、スライスグループを互いに独立して復号化するので、エラー耐性の面で有利である。具体的には、ビットストリームの転送時に１つのスライスグループが失われても、そのスライスグループに割り当てられたマクロブロックを、他のスライスに割り当てられた、当該スライスグループと隣接するマクロブロックから再構成することができる。本発明の一実施形態では、特徴ベース圧縮を、ＦＭＯの「フォアグラウンドおよびバックグラウンド」マクロブロック割当てマップタイプに組み込む。特徴と関連付けられたマクロブロックがフォアグラウンドのスライスグループを構成し、それ以外の全てのマクロブロック（特徴と関連付けられないマクロブロック）がバックグラウンドのスライスグループを構成する。 The FMO assigns a macroblock of an encoded frame to one type among a plurality of types of slice groups. This allocation is determined by the macroblock allocation map, and macroblocks in the same slice group may not be adjacent to each other. FMO is advantageous in terms of error tolerance because it decodes slice groups independently of each other. Specifically, even if one slice group is lost during bitstream transfer, a macroblock assigned to that slice group is reconstructed from macroblocks that are assigned to other slices and adjacent to that slice group. can do. In one embodiment of the invention, feature-based compression is incorporated into the FMO “foreground and background” macroblock allocation map type. Macroblocks associated with features constitute a foreground slice group, and all other macroblocks (macroblocks not associated with features) constitute a background slice group.

ＳＶＣは、映像データの符号を、相異なるビットレートで提供することができる。基本レイヤは低いビットレートで符号化され、少なくとも１つの拡張レイヤは高いビットレートで符号化される。ＳＶＣビットストリームの復号化には、基本レイヤ（低ビットレート／低品質アプリケーション）のみを伴い得るか、あるいは、それに加えて一部又は全ての拡張レイヤ（高ビットレート／高品質アプリケーション）も伴い得る。ＳＶＣビットストリームのサブストリームもそれ自体が有効なビットストリームなので、ＳＶＣを利用することにより、複数のデバイスでＳＶＣビットストリームを（当該デバイスの能力に応じて相異なる品質で）復号化すること、さらに、インターネットストリーミングなどのチャネルスループットが変化する環境で復号化することも含め、アプリケーションのシナリオの自由度が向上する。 The SVC can provide video data codes at different bit rates. The base layer is encoded at a low bit rate, and at least one enhancement layer is encoded at a high bit rate. The decoding of the SVC bitstream may involve only the base layer (low bit rate / low quality application) or may additionally involve some or all enhancement layers (high bit rate / high quality application). . Since the SVC bitstream substream itself is also a valid bitstream, by using SVC, the SVC bitstream can be decoded (with different qualities depending on the capabilities of the device) by multiple devices, and In addition, the degree of freedom of an application scenario is improved including decoding in an environment where channel throughput changes such as Internet streaming.

一般的に、ＳＶＣ処理には、時間スケーラビリティ、空間スケーラビリティおよび品質スケーラビリティの３種類のスケーラビリティがある。本発明の一実施形態では、特徴ベースの一次的予測を基本レイヤに含めることにより、特徴ベース圧縮を、品質スケーラビリティ構成に組み込む（「モデルベースの主要な予測及び副次的な予測の生成」と題した前述の説明箇所を参照されたい）。そして、基本レイヤにおける符号化済みフレームを、拡張レイヤで参照フレームとして使用することにより、当該拡張レイヤにおいて特徴ベースの副次的予測を実現することができる。これにより、特徴ベース予測の情報を一斉に符号に加算するのではなく、段階的に加算することが可能となる。一変形例として、全ての特徴ベース予測（一次的予測および副次的予測）を拡張レイヤに移し、基本レイヤでは従来での予測のみを使用するようにしてもよい。 Generally, there are three types of scalability in SVC processing: temporal scalability, spatial scalability, and quality scalability. In one embodiment of the invention, feature-based compression is incorporated into the quality scalability configuration by including feature-based primary prediction in the base layer ("Generate model-based primary and secondary predictions"). (See the preceding description section). Then, by using the encoded frame in the base layer as a reference frame in the enhancement layer, feature-based secondary prediction can be realized in the enhancement layer. As a result, feature-based prediction information can be added step by step rather than all at once. As a modified example, all feature-based predictions (primary prediction and secondary prediction) may be transferred to the enhancement layer, and only the conventional prediction may be used in the base layer.

図示のデータ経路／実行経路及び構成要素は例示に過ぎず、各構成要素の動作及び構成並びに各構成要素からのデータフロー及び各構成要素へのデータフローが、実施形態や圧縮する映像データの種類によって変わり得ることは、当業者であれば理解できる。つまり、あらゆる構成のデータモジュール／データ経路を採用することが可能である。 The illustrated data path / execution path and components are merely examples, and the operation and configuration of each component, the data flow from each component, and the data flow to each component are the embodiment and the type of video data to be compressed It can be understood by those skilled in the art. That is, it is possible to employ data modules / data paths having any configuration.

本発明を例示的な実施形態を参照しながら具体的に図示・説明したが、当業者であれば、添付の特許請求の範囲に包含される本発明の範囲から逸脱することなく、形態および細部の詳細な変更が可能であることを理解するであろう。 While the invention has been illustrated and described with reference to illustrative embodiments, workers skilled in the art will recognize that the invention is capable of form and detail without departing from the scope of the invention as encompassed by the appended claims. It will be understood that detailed modifications of are possible.

Claims

A method of processing video data,
Using a detection algorithm to detect at least one of features and objects in the region of interest within at least one frame;
Modeling the detected at least one of features and objects using a set of parameters;
Correlating every instance of said detected at least one of features and objects across multiple frames;
Forming at least one track of the correlated instances;
Associating the at least one track with at least one block of video data to be encoded;
Generating a model-based prediction for the at least one block of video data using the associated track information, comprising storing the model-based prediction as processed video data; ,
A method for processing video data.

2. The video data processing method according to claim 1, wherein the detection algorithm is included in a type of non-parametric feature detection algorithm.

2. The method of processing video data according to claim 1, wherein the set of parameters includes information about the at least one of features and objects and is stored in a memory.

4. The video data processing method according to claim 3, wherein the feature parameters include a feature descriptor vector and a position of the feature.

5. The video data processing method according to claim 4, wherein the parameter is generated when the feature is detected.

2. The video data processing method according to claim 1, wherein the at least one block of the video data is a macroblock, and the at least one track associates a feature with the macroblock.

A method of processing video data,
Detecting at least one of features and objects in the area of interest;
Modeling the at least one of features and objects using a set of parameters;
Correlating every instance of said at least one of features and objects across multiple frames;
Forming at least one matrix of the correlated instances;
Associating said at least one matrix with at least one block of video data to be encoded;
Generating a model-based prediction for the at least one block of video data using the associated information of the matrix, the method comprising storing the model-based prediction as processed video data; and ,
A method for processing video data.

8. The method of processing video data according to claim 7, wherein the set of parameters includes information about the at least one of features and objects and is stored in a memory.

9. The video data processing method according to claim 8, wherein the feature parameters include a feature descriptor vector and a position of the feature.

10. The video data processing method according to claim 9, wherein the parameter is generated when the feature is detected.

The video data processing method according to claim 7, further comprising:
Using at least one subspace of a vector space to organize the at least one matrix as a parametric model of the at least one of features and objects correlated;
A method for processing video data, including:

A codec for processing video data,
Feature-based detection means for identifying an instance of a feature in at least two video frames, wherein the identified instance of the feature is more data than other pixels in the one or more video frames. Feature-based detection means having a plurality of pixels indicative of complexity;
Modeling means operatively connected to the feature-based detection means for generating a feature-based correspondence model that models the correspondence of the instances of the feature in two or more video frames Modeling means to
When it is determined that encoding the instance of a feature using the feature-based correspondence model improves compression efficiency than encoding the instance of the feature using a first video encoding process A cache that prioritizes the use of the feature-based correspondence model;
A codec.

13. The codec according to claim 12, wherein the data complexity is determined when the encoding of the pixel by a conventional video compression method exceeds a predetermined threshold.

13. The codec according to claim 12, wherein the data complexity is determined when an amount of bandwidth allocated when the feature is encoded by a conventional video compression method exceeds a predetermined threshold.

15. The codec according to claim 14, wherein the predetermined threshold is a predetermined numerical value, a predetermined numerical value stored in a database, a numerical value set as an average value of bandwidth amounts allocated to features encoded in the past, and A codec, which is at least one of numerical values set as a median of the amount of bandwidth allocated to features encoded in the past.

The codec of claim 12, wherein the first video encoding process includes a motion compensated prediction process.

13. The codec of claim 12, wherein the preference for use is determined by comparing the coding costs of each solution candidate in a competitive mode, wherein the solution candidate is a tracking means, a key prediction motion model, a key prediction sampling. Codecs, including schemes, subtiling schemes, reconstruction algorithms (and (possibly) secondary prediction schemes).

18. The codec of claim 17, wherein when use of the feature-based modeling is prioritized, the level of data complexity of the instance of the feature is used as the threshold so that subsequent instances of the feature A codec in which the encoder automatically determines the start and use of feature-based compression for that subsequent instance of a feature when indicating a level of data complexity above a threshold.

The codec according to claim 12, wherein the feature-based detection means uses one of an FPA tracker, an MBC tracker and a SURF tracker.

A codec for processing video data,
Feature-based detection means for identifying an instance of a feature in at least two video frames, wherein the identified instance of the feature is other in at least one video frame of the at least two video frames A feature-based detection means having a plurality of pixels exhibiting data complexity rather than
Modeling means operatively connected to the feature-based detection means for generating a feature-based correspondence model that models correspondences of identified instances of the features in the at least two video frames Modeling means;
Priority is given to the use of the correspondence model when it is determined that the compression efficiency of the specified instance of the feature is improved by a given feature-based correspondence model among a plurality of the feature-based correspondence models. Memory to
A codec.

21. The codec according to claim 20, wherein the compression efficiency of the specified feature is determined by encoding the instant of the feature in the case of using the first video encoding process and the compression efficiency stored in the database. A codec that determines an improvement in compression efficiency of the identified instance of the feature by comparing with one of the numbers.

A method of processing video data,
Modeling a feature by vectorizing at least one of a feature pel and a feature descriptor;
(A) minimizing the mean square error (MSE) between different vectors of feature pels or between different feature descriptors; and (b) maximizing the inner product between different vectors of feature pels or between different feature descriptors. Identifying similar features by at least one of the following:
A process of applying a standard motion prediction / compensation algorithm, thereby taking into account the translational motion of the feature and obtaining processed video data;
A method for processing video data.

A method of processing video data,
Implementing model-based prediction by configuring a codec to encode a target frame;
Encoding a macroblock in the target frame using a conventional encoding process;
In the process of analyzing the encoding of the macroblock, when the conventional encoding of the macroblock is determined to be at least one of efficient and inefficient, and the conventional encoding is determined to be inefficient, The encoder is analyzed by generating a plurality of predictions for a macroblock based on a plurality of models, and the evaluation of the plurality of predictions for the macroblock is based on a coding size;
Ranking the predictions of the macroblocks together with the macroblocks according to the conventional coding;
A method for processing video data.

24. The video data processing method according to claim 23, wherein the conventional encoding of the macroblock is efficient when the encoding size is smaller than a predetermined size threshold.

24. The video data processing method according to claim 23, wherein the conventional encoding of the macroblock is efficient when the target macroblock is a skip macroblock.

The video data processing method according to claim 23, wherein the conventional encoding of the macroblock is inefficient when the encoding size is larger than a threshold.

24. The video data processing method according to claim 23, wherein when it is determined that the conventional encoding of the macroblock is inefficient, a plurality of encodings for the macroblock are generated in a competition mode and the compression efficiency of each other is increased. A method for processing video data.

The video data processing method according to claim 27, wherein the competition mode encoding algorithm is:
Subtracting the prediction from the macroblock, thereby generating a residual signal;
A step of transforming the residual signal using block-based two-dimensional DCT approximation, and a step of encoding transform coefficients using an entropy encoder;
A method for processing video data.

24. The video data processing method of claim 23, wherein the encoder analyzed by generating a plurality of predictions generates a composite prediction that sums a primary prediction and a weighted secondary prediction. Processing method.

A method of processing video data,
A process for modeling data with a plurality of fidelity for model-based compression, wherein the plurality of fidelities are at least one of a macroblock hierarchy, a macroblock hierarchy as a feature, a feature hierarchy, and an object hierarchy. A process involving one,
With
The macroblock hierarchy uses a block-based motion prediction and compensation (BBMEC) application to find a prediction for each tile from a limited space in the decoded reference frame;
The featured macroblock hierarchy (i) finds a first prediction of the target macroblock from the most recent reference frame using the same first BBMEC application as the macroblock hierarchy; (ii) 2 Find the second prediction for the first prediction by searching for the second most recent reference frame using the second BBMEC application, and (iii) gradually apply the BBMEC application over the past frames To generate a track of the target macroblock,
The feature hierarchy detects and tracks a feature regardless of the grid of macroblocks, associates the feature with a macroblock that overlaps the feature, and navigates the decoded reference frame using the feature track Finds a good match for the overlapping macroblock, and if multiple features overlap with a target macroblock of interest, the feature with the greatest overlap models the target macroblock. Selected
In the object hierarchy, when an object includes a plurality of macroblocks or overlaps with a plurality of macroblocks, a single motion vector can be calculated for all macroblocks corresponding to the object. And save coding size,
Video data processing method.

31. The video data processing method according to claim 30, wherein the plurality of fidelities are sequentially examined.

31. The video data processing method according to claim 30, wherein the plurality of fidelities are examined in a competition mode.

A computer program product comprising program code means,
The computer program product for controlling the computer to execute the processing method according to claim 1 by being loaded into the computer.

A computer program product comprising program code means,
8. A computer program product for controlling the computer to execute the processing method according to claim 7, wherein the program code means is loaded into the computer.

A computer program product comprising program code means,
23. A computer program product for controlling the computer to execute the processing method according to claim 22, wherein the program code means is loaded into the computer.

A computer program product comprising program code means,
24. A computer program product for controlling the computer to execute the processing method according to claim 23, wherein the program code means is loaded into the computer.

A computer program product comprising program code means,
32. A computer program product, wherein the program code means controls the computer to execute the processing method according to claim 30 by being loaded into the computer.