JP6698077B2

JP6698077B2 - Perceptual optimization for model-based video coding

Info

Publication number: JP6698077B2
Application number: JP2017513750A
Authority: JP
Inventors: リー・ニゲル; パーク・サンソク; トゥン・ミョー; コッケ・デーン・ピー; リー・ジェユン; ウィード・クリストファー
Original assignee: Euclid Discoveries LLC
Current assignee: Euclid Discoveries LLC
Priority date: 2014-09-11
Filing date: 2015-09-03
Publication date: 2020-05-27
Anticipated expiration: 2035-09-03
Also published as: WO2016040116A1; EP3175618A1; CA2960617A1; JP2017532858A; CN106688232A

Description

Related application

本願は、2015年5月7日出願の米国仮特許出願第62/158,523号および2014年11月11日出願の米国仮特許出願第62/078,181号の利益を主張する。本願は、さらに、2014年11月4日出願の米国特許出願第14/532,947号の一部継続出願（ＣＩＰ）である。この米国特許出願第14/532,947号は、2014年3月10日出願の米国仮特許出願第61/950,784号および2014年9月11日出願の米国仮特許出願第62/049,342号の利益を主張する。これら参照した特許出願の全教示内容は、参照をもって本明細書に取り入れたものとする。 This application claims the benefit of US Provisional Patent Application No. 62/158,523 filed May 7, 2015 and US Provisional Patent Application No. 62/078,181 filed November 11, 2014. This application is also a continuation-in-part (CIP) of US patent application Ser. No. 14/532,947 filed Nov. 4, 2014. This U.S. Patent Application No. 14/532,947 claims the benefit of U.S. Provisional Patent Application No. 61/950,784 filed March 10, 2014 and U.S. Provisional Patent Application No. 62/049,342 filed September 11, 2014. To do. The entire teachings of these referenced patent applications are incorporated herein by reference.

映像圧縮は、デジタル映像データを、記憶時又は伝送時において少ないビット数を用いた形式で表現するプロセスであると考えられる。映像符号化は、映像データの空間的冗長性、時間的冗長性又は色空間冗長性を利用することにより圧縮を達成することができる。典型的に、映像圧縮プロセスは、映像データをフレームの集まりやペルの集まりなどの各部位に分割して、その映像内における冗長な部分を特定し、当該冗長な部分を元々の映像データで求められるよりも少ないビット数で表現し得る。データのこのような冗長性を利用することにより、より大きな圧縮を達成することができる。映像データを符号化フォーマットに変換するにはエンコーダが用いられ得る。そして、デコーダを用いることにより、符号化された映像を元々の映像データに匹敵する形態に変換することができる。エンコーダ／デコーダを実現するものがコーデックと称される。 Video compression is considered to be the process of representing digital video data in a format that uses a small number of bits during storage or transmission. Video encoding can achieve compression by exploiting spatial redundancy, temporal redundancy, or color space redundancy of video data. Typically, the video compression process divides the video data into parts such as a group of frames and a group of pels, identifies a redundant part in the video, and obtains the redundant part from the original video data. It can be expressed with fewer bits than required. Greater compression can be achieved by exploiting this redundancy of data. An encoder may be used to convert the video data into a coding format. Then, by using the decoder, the coded video can be converted into a form comparable to the original video data. What realizes an encoder/decoder is called a codec.

標準のエンコーダ（標準エンコーダ）は、符号化にあたって、所与の映像フレームを、互いに重なり合わない複数の符号化単位すなわちマクロブロック（複数の連続するペルからなる矩形領域）に分割する。典型的に、マクロブロック（本明細書では、より包括的に「入力ブロック」や「データブロック」と称される）は、映像フレームの左から右の走査順序や上から下の走査順序で処理される。圧縮は、入力ブロックが符号化済みのデータを用いて予測・符号化されることで達成され得る。入力ブロックを、同じフレーム内の先に符号化されたブロックのうち、当該入力ブロックと空間的に隣接するサンプルを用いて符号化するプロセスは、イントラ予測と称される。イントラ予測は、データにおける空間的冗長性を利用しようとするものである。入力ブロックを、動き推定プロセスを用いて見つけ出された、先に符号化したフレームからの類似する領域を用いて符号化することは、インター予測と称される。インター予測は、データにおける時間的冗長性を利用しようとするものである。動き推定プロセスは、動きベクトルを生成し得る。動きベクトルは、例えば、参照フレーム内の一致する領域の、符号化中の入力ブロックに対する位置を指定する。大抵の動き推定プロセスは、所与の入力ブロックについての動きベクトルの初めの粗推定（および対応する時間的予測）を提供する動き初期推定（初期の動き推定）と、この初めの推定の近傍において局所探索を実行することによってその入力ブロックについての動きベクトルのより正確な推定（および対応する予測）を決定する動き精推定（精細な動き推定）との、２つの主なステップからなる。 Upon encoding, a standard encoder (standard encoder) divides a given video frame into a plurality of non-overlapping coding units, that is, macroblocks (rectangular regions composed of a plurality of consecutive pels). Typically, macroblocks (more generically referred to herein as “input blocks” or “data blocks”) are processed in left-to-right or top-to-bottom scan order of a video frame. To be done. The compression can be achieved by predicting/encoding the input block using encoded data. The process of coding an input block using previously coded blocks of the same frame that are spatially adjacent to the input block is called intra prediction. Intra-prediction seeks to exploit spatial redundancy in the data. Encoding the input block with similar regions from the previously encoded frame found using the motion estimation process is called inter prediction. Inter-prediction seeks to take advantage of temporal redundancy in the data. The motion estimation process may generate motion vectors. The motion vector specifies, for example, the position of the matching region in the reference frame with respect to the input block being encoded. Most motion estimation processes use a motion initial estimate (the initial motion estimate) that provides the initial coarse estimate (and corresponding temporal prediction) of the motion vector for a given input block, and in the neighborhood of this initial estimate. It consists of two main steps, a motion fine estimation (fine motion estimation) that determines a more accurate estimation (and corresponding prediction) of the motion vector for that input block by performing a local search.

エンコーダは、符号化すべきデータと予測（予測結果）との差分を測定することにより、残差を生成し得る。この残差は、予測されたブロックと元々の入力ブロックとの差分を提供し得る。これらの予測、動きベクトル（インター予測用）、残差および関連データは、空間変換、量子化、エントロピー符号化、ループフィルタなどの他のプロセスと組み合わされることにより、映像データの効率的な符号（符号化）を生成することができる。量子化及び変換を受けた残差は、処理されて且つ上記予測に組み戻されることで復号化フレームへと組み立てられて、フレームストアに記憶される。このような映像符号化技術の詳細は、当業者であればよく知っている。 The encoder may generate a residual by measuring the difference between the data to be encoded and the prediction (prediction result). This residual may provide the difference between the predicted block and the original input block. These predictions, motion vectors (for inter prediction), residuals and related data are combined with other processes such as spatial transformation, quantization, entropy coding, loop filter, etc. to efficiently code the video data ( Encoding) can be generated. The quantized and transformed residuals are processed and reassembled into the prediction to be assembled into decoded frames and stored in the frame store. The details of such a video coding technique are well known to those skilled in the art.

ＭＰＥＧ−２（Ｈ．２６２）およびＨ.２６４（ＭＰＥＧ−４Ｐａｒｔ１０ＡｄｖａｎｃｅｄＶｉｄｅｏＣｏｄｉｎｇ（ＡＶＣ））は、比較的低いビットレートで高品質映像表現を達成する、映像圧縮用の２種類のコーデック規格である（以降では、それぞれＭＰＥＧ−２、Ｈ．２６４と称する）。ＭＰＥＧ−２及びＨ．２６４の符号化基本単位は、１６×１６マクロブロックである。Ｈ．２６４は、広く普及している最近の映像圧縮規格であると共に、映像データを圧縮するにあたってＭＰＥＧ−２の２倍の効率を有すると一般的に考えられている。 MPEG-2 (H.262) and H.264 (MPEG-4 Part 10 Advanced Video Coding (AVC)) are two codec standards for video compression that achieve high quality video representation at relatively low bit rates. (Hereinafter referred to as MPEG-2 and H.264, respectively). MPEG-2 and H.264. The H.264 coding basic unit is a 16x16 macroblock. H. H.264 is a widely used recent video compression standard and is generally considered to have twice the efficiency of MPEG-2 in compressing video data.

基礎的なＭＰＥＧ規格は、フレーム内の入力ブロックの符号化方法に基づいて３種類のフレーム（又はピクチャ）を規定する。Ｉフレーム（イントラ符号化ピクチャ）は、そのフレームに存在するデータのみを用いて符号化されるものなので、イントラ予測ブロックのみで構成される。Ｐフレーム（予測ピクチャ）は、予め復号化されたＩフレーム又はＰフレーム（参照フレームとも称される）からのデータを用いた前方向予測により符号化される。Ｐフレームは、イントラブロックおよび（前方向）予測ブロックのいずれも含み得る。Ｂフレーム（双予測ピクチャ）は、前のフレームと後のフレームの両方からのデータを用いた双方向予測により符号化される。Ｂフレームは、イントラブロック、（前方向）予測ブロックおよび双予測ブロックのいずれも含み得る。 The basic MPEG standard defines three types of frames (or pictures) based on how the input blocks within a frame are encoded. Since an I frame (intra coded picture) is coded using only the data existing in that frame, it is composed of only intra prediction blocks. P frames (predicted pictures) are encoded by forward prediction using data from predecoded I frames or P frames (also referred to as reference frames). P-frames may include both intra blocks and (forward) predicted blocks. B-frames (bi-predictive pictures) are encoded by bi-directional prediction using data from both the previous frame and the subsequent frame. B-frames may include both intra blocks, (forward) predictive blocks and bi-predictive blocks.

参照フレームの特定の集合のことを、ＧｒｏｕｐｏｆＰｉｃｔｕｒｅｓ（ピクチャの集まり）（ＧＯＰ）と称する。ＧＯＰは、各参照フレーム内の復号化されたペルのみを含み、入力ブロックやフレームがどのように符号化されたのか（Ｉフレームなのか、Ｂフレームなのか、それともＰフレームなのか）についての情報を含まない。ＭＰＥＧ−２などの古い映像圧縮規格は、Ｐフレームを予測するのに１つの参照フレーム（過去のフレーム）を利用し、Ｂフレームを予測するのに２つの参照フレーム（１つ前のフレームと１つ後のフレーム）を利用する。対照的に、Ｈ．２６４、ＨＥＶＣ（ＨｉｇｈＥｆｆｉｃｉｅｎｃｙＶｉｄｅｏＣｏｄｉｎｇ）などのより新しい圧縮規格は、Ｐフレーム及びＢフレームの予測に複数の参照フレームを利用することを可能にする。典型的な参照フレームは現在のフレームと時間的に隣接するフレームであるが、これらの規格は、時間的に隣接しないフレームを参照フレームとすることも可能である。 A particular set of reference frames is called a Group of Pictures (GOP). The GOP contains only the decoded pels in each reference frame, and information about how the input block or frame was encoded (I frame, B frame, or P frame). Does not include. Older video compression standards such as MPEG-2 utilize one reference frame (previous frame) to predict a P frame and two reference frames (previous frame and 1 frame) to predict a B frame. Frame after one) is used. In contrast, H. Newer compression standards such as H.264, HEVC (High Efficacy Video Coding) allow the use of multiple reference frames for P-frame and B-frame prediction. A typical reference frame is a frame that is temporally adjacent to the current frame, but these standards also allow a frame that is not temporally adjacent to be a reference frame.

従来のインター予測は、ブロックベースの動き推定・補償（ＢＢＭＥＣ）に基づくものである。ＢＢＭＥＣプロセスは、ターゲットブロック（符号化中の現在の入力ブロック）と予め復号化された参照フレーム内の同サイズの領域との最良の一致を探索する。そのような一致が見つかると、エンコーダは、参照フレーム内でのこの最良の一致の位置へのポインタの役割を果たす動きベクトルを送信し得る。しかし、ＢＢＭＥＣ探索プロセスは、演算上の理由により、探索対象にできる参照フレームの観点からみて時間的に制限されているだけでなく、探索対象にできる近傍領域の観点からみて空間的にも制限されている。これは、「考えられる最良の」一致が常に見つかるとは限らず、また、高速で変化するデータの場合には特にそうであることを意味する。 Conventional inter-prediction is based on block-based motion estimation and compensation (BBMEC). The BBMEC process searches for the best match between the target block (the current input block being encoded) and a region of the same size in the previously decoded reference frame. Upon finding such a match, the encoder may send a motion vector that acts as a pointer to the location of this best match within the reference frame. However, the BBMEC search process is not only temporally limited from the viewpoint of the reference frame that can be the search target, but also spatially limited from the viewpoint of the neighborhood region that can be the search target, for computational reasons. ing. This means that the "best possible" match is not always found, and especially for fast changing data.

最も単純な形式のＢＢＭＥＣプロセスは、動き推定の初期設定を、（０，０）動きベクトルとする。これは、つまり、ターゲットブロックの初めの推定が、参照フレーム内での同位置のブロックであることを意味する。次に、動き精推定が、この領域の局所近傍においてそのターゲットブロックと最良に一致する（すなわち、そのターゲットブロックに対する誤差が最小になる）領域を探索することによって行われる。この局所探索は、その局所近傍を網羅的にクエリすることによって行われたり、ダイヤモンドサーチや六角形サーチなどといった幾つかの「高速探索」法のうちの任意のものを用いて行われたりし得る。 The simplest form of the BBMEC process sets the motion estimation initialization to a (0,0) motion vector. This means that the initial estimate of the target block is the co-located block in the reference frame. Next, motion fine estimation is performed by searching for a region in the local neighborhood of this region that best matches the target block (ie, has the smallest error for the target block). This local search may be done by exhaustively querying its local neighborhood, or by using any of several "fast search" methods such as diamond search, hexagon search, etc. ..

ＭＰＥＧ−２の後発バージョン以降の標準コーデックに設けられた、ＢＢＭＥＣプロセスの改良として、拡張予測区域探索（ｅｎｈａｎｃｅｄｐｒｅｄｉｃｔｉｖｅｚｏｎａｌｓｅａｒｃｈ）（ＥＰＺＳ）法（非特許文献１：Tourapis達による「単・多フレーム動き推定のための拡張予測区域探索（Enhanced predictive zonal search for single and multiple frame motion estimation）」）が挙げられる。このＥＰＺＳ法は、ターゲットブロックの初めの推定用として、既に符号化済みの近傍ブロックの動きベクトルならびに前の参照フレームにおける同位置のブロック（および近傍）の動きベクトルに基づく、動きベクトル候補の集合を検討する。ＥＰＺＳ法は、映像の動きベクトル場が幾らかの空間的及び時間的冗長性を有すると考え、よって、ターゲットブロックについての動き推定の初期設定は、近傍ブロックの動きベクトルまたは符号化済みのフレームにおける近隣ブロックからの動きベクトルとするのが合理的であると仮定する。ＥＰＺＳ法は、それら初めの推定の集合が集まると、近似レート歪み解析によって当該集合を絞り込む。この後に、動き精推定が行われる。 An enhanced predictive zone search (EPZS) method (EPZS) method as an improvement of the BBMEC process provided in the standard codecs of the later version of MPEG-2 and later (Non-patent document 1: Tourapis et al. "Enhanced predictive zonal search for single and multiple frame motion estimation"). This EPZS method calculates a set of motion vector candidates based on the motion vector of an already coded neighboring block and the motion vector of a block (and a neighboring) at the same position in the previous reference frame for the purpose of initial estimation of the target block. consider. The EPZS method considers the motion vector field of the video to have some spatial and temporal redundancy, so the initialization of the motion estimation for the target block is the motion vector of the neighboring block or in the coded frame. Suppose it is reasonable to use motion vectors from neighboring blocks. The EPZS method narrows down the set by the approximate rate distortion analysis when the set of the first estimations is collected. After this, motion precision estimation is performed.

任意の所与のターゲットブロックについて、エンコーダが、選択候補となる複数のインター予測を生成し得る。これらの予測は、複数の予測プロセス（例えば、ＢＢＭＥＣ方式、ＥＰＺＳ方式、モデルベース方式等）から生じ得る。また、これらの予測は、ターゲットブロックのサブ区分処理に基づいて異なり得る。サブ区分処理では、相異なる動きベクトルがターゲットブロックの相異なるサブ区分に対応付けられ、かつ、それぞれの動きベクトルが参照フレーム内のサブ区分サイズの領域をそれぞれ指し示す。また、これらの予測は、動きベクトルが指し示す参照フレームに基づいて異なり得る。というのも、前述したように、最近の圧縮規格は、複数の参照フレームの使用を可能にするからである。通常、所与のターゲットブロックについての最良の予測の選択は、レート歪み最適化により達成される。レート歪み最適化では、最良の予測は、レート歪み尺度Ｄ＋λＲ（式中、歪みＤは、ターゲットブロックと予測との誤差であり、レートＲは、予測を符号化するコスト（ビット単位）を定量化し、λは、スカラー重み付け係数である。）を最小化する予測とされる。 For any given target block, the encoder may generate multiple inter predictions that are candidates for selection. These predictions may result from multiple prediction processes (eg, BBMEC, EPZS, model-based, etc.). Also, these predictions may differ based on sub-partitioning of the target block. In the sub-segment processing, different motion vectors are associated with different sub-segments of the target block, and each motion vector points to an area of sub-segment size in the reference frame. Also, these predictions may differ based on the reference frame pointed to by the motion vector. This is because, as mentioned above, modern compression standards allow the use of multiple reference frames. Usually, the best prediction choice for a given target block is achieved by rate distortion optimization. In rate-distortion optimization, the best prediction is the rate-distortion measure D+λR, where the distortion D is the error between the target block and the prediction, and the rate R quantifies the cost (in bits) of coding the prediction. , Λ are scalar weighting factors).

Tourapis, A., 2002, "Enhanced predictive zonal search for single and multiple frame motion estimation," Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069- 1078Tourapis, A., 2002, "Enhanced predictive zonal search for single and multiple frame motion estimation," Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069-1078

過去、ＢＢＭＥＣ予測の限界を回避する目的で、数多くのモデルベース圧縮スキームが提案されてきた。このようなモデルベース圧縮スキーム（この種のスキームとして、ＭＰＥＧ−４Ｐａｒｔ２規格が恐らく最も良く知られている）は、映像内のオブジェクトや特徴（一般的に「関心成分」と定義される）の検出及び追跡、さらに、これらの特徴／オブジェクトを映像フレームの残りの部分とは別に符号化する方法を利用する。特徴／オブジェクトの検出／追跡は、標準の動き推定プロセスにおける空間的探索と独立して行われるので、特徴／オブジェクトのトラックは、標準の動き推定により得られるものとは異なる集合の予測を生じさせ得る。 In the past, a number of model-based compression schemes have been proposed with the aim of avoiding the limitations of BBMEC prediction. Such a model-based compression scheme (of which the MPEG-4 Part 2 standard is perhaps best known as a scheme of this kind) is based on the object and feature (generally defined as "component of interest") in the video. It utilizes detection and tracking, as well as a method of encoding these features/objects separately from the rest of the video frame. Since the feature/object detection/tracking is done independently of the spatial search in the standard motion estimation process, the feature/object track results in a different set of predictions than that obtained by standard motion estimation. obtain.

しかし、特徴／オブジェクトに基づくそのようなモデルベース圧縮スキームでは、映像フレームをオブジェクト領域とオブジェクトでない領域と（あるいは、特徴領域と特徴でない領域と）に分割することによる問題に直面する。第一に、オブジェクトのサイズは多種多様であり得るので、オブジェクトのテクスチャ（カラーコンテンツ）だけでなくオブジェクトの形状も符号化する必要がある。第二に、動きを伴うオブジェクトを複数追跡することは困難であり得て、精度の低いトラッキング（追跡）は不正確なセグメンテーション（分割）を引き起こし、通常、低い圧縮性能につながる。第三の問題は、全ての映像コンテンツがオブジェクトや特徴で構成されるとは限らないので、オブジェクト／特徴が存在しない場合には、代わりの符号化スキームが必要となる。 However, such model-based compression schemes based on features/objects face the problem of dividing the video frame into object regions and non-object regions (or feature regions and non-feature regions). First, since the size of an object can vary widely, it is necessary to encode not only the texture (color content) of the object, but also the shape of the object. Secondly, tracking multiple objects with motion can be difficult, and inaccurate tracking can lead to inaccurate segmentation, usually leading to poor compression performance. The third problem is that not all video content is composed of objects or features, so if there are no objects/features, an alternative encoding scheme is needed.

2014年11月4日出願の同時係属中の米国仮特許出願第61/950,784号（本明細書では「‘784出願」と称する）は、上記のセグメンテーション問題を回避するモデルベース圧縮スキームを提示している。‘784出願の連続的ブロックトラッカー（連続的ブロック追跡手段）（ＣＢＴ）は、オブジェクトや特徴を検出せず、オブジェクトや特徴をオブジェクトでない／特徴でない背景と分割する必要性をなくす。むしろ、ＣＢＴは、フレーム−フレーム間の動き推定を連続的なトラックへと組み込むことにより、映像フレーム内の全ての入力ブロック（「マクロブロック」）をあたかも関心領域であるかの如く追跡する。こうすることにより、ＣＢＴは、インター予測を向上させるというデータの高次モデリング（モデル化）の恩恵を、セグメンテーション問題を回避しつつ享受するように映像内の動きをモデル化する。 Co-pending U.S. Provisional Patent Application No. 61/950,784 filed November 4, 2014 (referred to herein as the "'784 application") presents a model-based compression scheme that avoids the segmentation problem described above. ing. The Continuous Block Tracker (CBT) of the '784 application does not detect objects or features and eliminates the need to split an object or feature from a non-object/non-feature background. Rather, the CBT tracks all input blocks ("macroblocks") in a video frame as if it were a region of interest by incorporating frame-to-frame motion estimation into a continuous track. By doing this, the CBT models motion in the video to enjoy the benefits of higher order modeling of data to improve inter prediction while avoiding segmentation problems.

他のモデルベース圧縮アプローチとして、映像データのコンテンツに対する人間の視覚系（ＨＶＳ）の応答を、映像フレームのどの部分が人間の知覚にとって最も気付き易いのかを示す重要度マップとしてモデル化するものが挙げられる。重要度マップは、映像フレーム内のそれぞれの入力ブロック又はデータブロックについて数値を取る。また、所与のどのブロックについての重要度マップ値（重要度マップの数値）も、映像をとおしてフレーム−フレーム間で変化し得る。一般的に、重要度マップは、より高い数値がより重要なデータブロックを示すように定義される。 Another model-based compression approach is to model the response of the human visual system (HVS) to the content of video data as an importance map that indicates which parts of the video frame are most noticeable to human perception. Be done. The importance map takes a numerical value for each input block or data block in the video frame. Also, the importance map value (value of the importance map) for any given block may change from frame to frame through the video. Generally, importance maps are defined such that higher numbers indicate more important blocks of data.

重要度マップの一種として、時間的コントラスト感度関数（ＴＣＳＦ）（de Lange, H., 1954, "Relationship between critical flicker frequency and a set of low frequency characteristics of the eye（臨界融合周波数と眼の低周波数特性のセットとの関係）," J. Opt. Soc. Am., 44:380-389）が挙げられる。ＴＣＳＦは、周期的な刺激に対するＨＶＳの応答を時間的に測定し、データ内の特定の時間的特性が観測者である人間にとって気付き易いものであることを明らかにする。これらの時間的特性がデータ内の動きに関連付けられて、ＴＣＳＦは、データ内で最も気付き易い種類の動きが極めて高い時間的周波数および極めて低い時間的周波数のいずれにも該当しない「中程度」の動きであることを予測する。 As a kind of importance map, temporal contrast sensitivity function (TCSF) (de Lange, H., 1954, "Relationship between critical flicker frequency and a set of low frequency characteristics of the eye Relationship with the set), "J. Opt. Soc. Am., 44:380-389). TCSF temporally measures the response of HVS to periodic stimuli and reveals that certain temporal characteristics in the data are noticeable to human observers. These temporal characteristics are associated with movements in the data, and the TCSF has a "moderate" degree in which the most noticeable type of movement in the data does not fall into either the very high temporal frequencies or the very low temporal frequencies. Predict that it is a movement.

留意すべき重要な点として、ＴＣＳＦが、正確な時間的コントラスト値を生成するのに、映像内の動きを伴うコンテンツの速度の正確な測定を必要とすることが挙げられる。このような速度は、カメラの動きおよび／またはオブジェクトの動きによる映像コンテンツの正味の（明らかな）動きを表す、オプティカルフローを算出することによって近似を求めることが可能である。しかし、標準の映像エンコーダの大半は、オプティカルフローを正確に算出するよりも、圧縮効率を最適化する動き推定プロセスを採用している。 An important point to note is that TCSF requires an accurate measurement of the speed of moving content in the video to produce an accurate temporal contrast value. Such velocities can be approximated by calculating the optical flow, which represents the net (obvious) movement of the video content due to camera movement and/or object movement. However, most standard video encoders employ a motion estimation process that optimizes compression efficiency rather than accurately calculating optical flow.

他種の重要度マップとして、空間的コントラスト感度に基づくものが挙げられ、これは、明るさ、エッジ、空間的周波数、色などの空間的特性に対するＨＶＳの応答を測定する。空間的コントラスト感度関数（ＳＣＳＦ）（例えば、Barten, P., 1999, Contrast Sensitivity of the Human Eye and Its Effects on Image Quality（人間の眼のコントラスト感度および画像品質へのその影響）, SPIE Press等を参照のこと）は、単にコントラスト感度関数（ＣＳＦ）としても知られており、ＨＶＳにとって顕著である空間的コントラストを測定する。ＳＣＳＦは、ＪＰＥＧ２０００画像圧縮規格において、画像圧縮アーチファクトを低減する目的での適用が成功を収めている。オブジェクトや特徴についても、空間的コントラスト手法の支援（例えば、空間的周波数勾配により示されるエッジの存在等）によって典型的に検出される。空間的コントラスト感度は画像圧縮（例えば、ＪＰＥＧ２０００コーデック等）においては研究・利用されてきたものの、また、オブジェクト・特徴の検出に基づく映像圧縮プロセスが数多く提案されてきたものの、ＴＣＳＦで表される時間的コントラスト感度がこれまで映像圧縮に適用されることはなかった。 Another type of importance map is based on spatial contrast sensitivity, which measures the response of HVS to spatial characteristics such as brightness, edges, spatial frequencies, and colors. Spatial Contrast Sensitivity Function (SCSF) (eg Barten, P., 1999, Contrast Sensitivity of the Human Eye and Its Effects on Image Quality), SPIE Press, etc. Also known as the Contrast Sensitivity Function (CSF), measures the spatial contrast that is significant to HVS. SCSF has been successfully applied in the JPEG2000 image compression standard for the purpose of reducing image compression artifacts. Objects and features are also typically detected with the aid of spatial contrast techniques (eg, the presence of edges indicated by the spatial frequency gradient). Spatial contrast sensitivity has been studied and used in image compression (eg, JPEG2000 codec, etc.), and although many video compression processes based on object/feature detection have been proposed, the time represented by TCSF. Contrast contrast sensitivity has never been applied to video compression.

開示する幾つかの発明的実施形態は、映像符号化の品質を向上させるように、重要度マップを映像圧縮に適用する。例示的な一実施形態では、標準の映像符号化処理ストリーム内での時間的周波数が、色空間領域における構造的類似度（ＳＳＩＭ）を用いて波長の近似を求めることにより、かつ、エンコーダの動きベクトル（エンコーダ動きベクトル）を用いて速度の近似を求めることにより算出される。次に、時間的周波数が、時間的コントラスト感度関数（ＴＣＳＦ）への入力としての役割を果たす。ＴＣＳＦは、全てのデータブロックについて算出され得て、これにより、映像フレームのどの領域が観測者である人間にとって最も気付き易いのかを示す時間的重要度マップを生成し得る。 Some disclosed inventive embodiments apply importance maps to video compression so as to improve the quality of video coding. In one exemplary embodiment, the temporal frequency within a standard video encoding process stream is determined by approximating the wavelength using structural similarity (SSIM) in the color space domain, and the motion of the encoder. It is calculated by finding an approximation of velocity using a vector (encoder motion vector). The temporal frequency then serves as an input to the temporal contrast sensitivity function (TCSF). The TCSF may be calculated for all data blocks, which may generate a temporal importance map that indicates which areas of the video frame are most noticeable to the human observer.

例示的なさらなる実施形態では、エンコーダにより生成された動きベクトルの相対品質についての情報が、符号化プロセスにおける種々の時点で算出され得て、真の動きベクトルマップを生成するのに用いられ得る。真の動きベクトルマップは、それぞれのターゲットブロックについて、その動きベクトルがどれほど信頼できるのかを出力する。０または１の数値を取るこの真の動きベクトルマップは、動きベクトルが正確でないターゲットブロック（すなわち、真の動きベクトルマップが０であるターゲットブロック）にはＴＣＳＦが適用されないようにＴＣＳＦを洗練化するマスクとして用いられ得る。 In a further exemplary embodiment, information about the relative quality of motion vectors generated by the encoder may be calculated at various points in the encoding process and used to generate a true motion vector map. The true motion vector map outputs for each target block how reliable that motion vector is. This true motion vector map, which takes a numeric value of 0 or 1, refines the TCSF so that it is not applied to target blocks for which the motion vector is not accurate (ie target blocks where the true motion vector map is 0). It can be used as a mask.

さらなる実施形態では、空間的複雑度マップ（ＳＣＭ）が、所与のターゲットブロックの、その近傍に対する空間的コントラストを決定するように、ブロック分散（ブロック内分散）、ブロック輝度、エッジ検出などの尺度から算出され得る。他の実施形態では、ＳＣＭからの情報が、複合的な統合化された重要度マップを得るようにＴＣＳＦと組み合わされ得る。この統合化された重要度マップにおける空間的および時間的コントラスト情報の組合せは、人間の視覚的応答の両側面を効果的に両立させる。 In a further embodiment, a spatial complexity map (SCM) measures block variance (intra-block variance), block intensity, edge detection, etc. so as to determine the spatial contrast of a given target block with respect to its neighbors. Can be calculated from In other embodiments, the information from the SCM can be combined with the TCSF to obtain a composite integrated importance map. The combination of spatial and temporal contrast information in this integrated importance map effectively balances both aspects of human visual response.

例示的な一実施形態では、統合化された重要度マップ（ＴＣＳＦおよびＳＣＭの両方からの情報を含む重要度マップ）が、標準のレート歪み尺度であるＤ＋λＲのうちの歪み部分を重み付けるのに用いられる。これにより、それぞれのターゲットブロックの知覚的相対重要度に合ったソリューション（重要度マップがその最大値に近いときには低歪みソリューションで、重要度マップがその最小値に近いときには低レートソリューション）へと重み付けされた、改変されたレート歪み最適化が得られる。代替的な一実施形態では、上記の目的に、ＴＣＳＦ又はＳＣＭが独立して用いられ得る。 In one exemplary embodiment, an integrated importance map (an importance map containing information from both TCSF and SCM) is used to weight the distorted portion of the standard rate distortion measure, D+λR. Used. This gives a solution that matches the perceptual relative importance of each target block (low distortion solution when the importance map is close to its maximum value, low rate solution when the importance map is close to its minimum value). And a modified, rate-distortion optimization is obtained. In an alternative embodiment, TCSF or SCM may be used independently for the above purposes.

例示的な他の実施形態では、（真の動きベクトルによる洗練化を伴う）ＴＣＳＦおよびＳＣＭが、エンコーダのブロックレベル量子化を調整するように用いられ得る。前記重要度マップが高い数値をとるターゲットブロックでは、量子化パラメータがフレーム量子化パラメータに比べて小さくされることで、これらのブロックについては高い品質が得られる。前記重要度マップが低い数値をとるターゲットブロックでは、量子化パラメータがフレーム量子化パラメータに比べて大きくされることで、これらのブロックについては低い品質が得られる。代替的な一実施形態では、上記の目的に、ＴＣＳＦ又はＳＣＭが独立して用いられ得る。 In other exemplary embodiments, TCSF and SCM (with refinement by true motion vectors) may be used to adjust the block level quantization of the encoder. In the target block where the importance map has a high numerical value, the quantization parameter is made smaller than the frame quantization parameter, so that high quality can be obtained for these blocks. In the target block where the importance map has a low numerical value, the quantization parameter is made larger than the frame quantization parameter, so that low quality is obtained for these blocks. In an alternative embodiment, TCSF or SCM may be used independently for the above purposes.

ＴＣＳＦは、インター予測を組み込み且つ動きベクトル（映像内のコンテンツの速度の近似を求めるのにＴＣＳＦにより利用される）を生成するエンコーダであればどのようなエンコーダの場合にも算出可能であるが、映像圧縮へのＴＣＳＦの適用は、どの動きベクトルが真の動きベクトルであるのかを正確に決定可能な ‘784出願の連続的ブロックトラッカー（ＣＢＴ）などの、モデルベース圧縮フレームワークにおいて最も効果的となる。前述したように、標準の映像エンコーダの大半は、真の動きを反映するよりも圧縮効率を最適化する動きベクトルを算出する。対照的に、ＣＢＴは、高い圧縮効率に適した動きベクトルとＴＣＳＦの効果を最大化するモデル化情報との両方を提供する。 TCSF can be calculated for any encoder that incorporates inter-prediction and produces motion vectors (used by TCSF to approximate the velocity of content in the video). Applying TCSF to video compression is most effective in model-based compression frameworks, such as the Continuous Block Tracker (CBT) of the '784 application, which can accurately determine which motion vector is the true motion vector. Become. As mentioned above, most standard video encoders calculate motion vectors that optimize compression efficiency rather than reflect true motion. In contrast, CBT provides both motion vectors suitable for high compression efficiency and modeling information that maximizes the effect of TCSF.

例示的な一部の発明的実施形態は、得られたビットストリームが、ブロックベースの動き推定を用いて且つその後に残差信号の変換、量子化及びエントロピー符号化が続けられる任意の映像圧縮規格に準拠したものとなるように構築される。そのような映像圧縮規格は、ＭＰＥＧ−２、Ｈ．２６４およびＨＥＶＣを含むが、これらに限定されるわけではない。本発明は、ブロックベースでない非標準の映像エンコーダであっても、インター予測を組み込み且つ動きベクトルを生成するものであれば、そのような映像エンコーダにも適用可能である。 Some exemplary inventive embodiments relate to any video compression standard in which the resulting bitstream uses block-based motion estimation and is followed by transformation, quantization and entropy coding of the residual signal. Is built to be compliant with. Such video compression standards include MPEG-2, H.264. H.264 and HEVC, but is not limited to these. The present invention can be applied to a non-standard video encoder that is not block-based as long as it incorporates inter prediction and generates a motion vector.

例示的な一部の実施形態は、映像データを符号化する方法及びシステム、ならびにこれを実現するための任意のコーデック（エンコーダおよびデコーダ）を含み得る。複数の映像フレームであって、当該映像フレームが互いに重なり合わないターゲットブロックを有する複数の映像フレームが、エンコーダにより処理され得る。前記複数の映像フレームは、重要度マップを用いて、当該重要度マップが量子化を改変（調整）することによって各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質に変更を加えるように、前記エンコーダにより符号化され得る。 Some example embodiments may include a method and system for encoding video data, and any codecs (encoders and decoders) for achieving this. A plurality of video frames may be processed by the encoder, the video frames having target blocks that do not overlap each other. For the plurality of video frames, the importance map is used to change the coding quality of each target block to be coded in each video frame by modifying (adjusting) the quantization. , Can be encoded by the encoder.

前記重要度マップは、時間的情報と空間的情報の少なくとも一方を用いて構成され得る。時間的情報と空間的情報との両方が用いられた場合、前記重要度マップは統合化された重要度マップと見なされる。前記重要度マップは、前記複数の映像フレームのうちのある映像フレームの、人間の知覚にとって最も気付き易い部分を示す（特定する、または表す）ように設定され得る。具体的に述べると、前記重要度マップが高い数値をとるブロックでは、ブロック量子化パラメータ（ＱＰ）がフレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて小さくされることで、これらのブロックについては高い品質が得られる。また、前記重要度マップが低い数値をとるターゲットブロックでは、前記ブロック量子化パラメータが前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて大きくされることで、これらのブロックについては低い品質が得られる。 The importance map may be configured using at least one of temporal information and spatial information. If both temporal and spatial information are used, the importance map is considered to be an integrated importance map. The importance map may be set to indicate (specify or represent) a part of a certain video frame of the plurality of video frames that is most noticeable to human perception. Specifically, in the block in which the importance map has a high numerical value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that high quality is obtained for these blocks. Be done. Further, in the target block in which the importance map has a low numerical value, the block quantization parameter is made larger than the frame quantization parameter QP _frame , so that low quality is obtained for these blocks.

前記空間的情報は、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供され得て、その最初のステップは、前記フレーム内のどのターゲットブロックが当該フレーム内の平均ブロック分散ｖａｒ_{ｆｒａｍｅ}よりも大きい分散を有するかを決定することである。平均ブロック分散ｖａｒ_{ｆｒａｍｅ}よりも大きい分散を有するブロックに対して、前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}よりも高いＱＰ値が振り当てられ得て、このブロックＱＰの振当量ＱＰ_{ｂｌｏｃｋ}は、そのブロック分散ｖａｒ_{ｂｌｏｃｋ}がｖａｒ_{ｆｒａｍｅ}よりもいかなる程度大きいかに従って、ＱＰ_{ｆｒａｍｅ}と量子化パラメータ上限ＱＰ_ｍａｘとの間で線形的に増減される。 The spatial information may be provided by a rule-based spatial complexity map (SCM), the first step of which is to say which target block in the _frame has a variance greater than the average block variance var _{frame in the frame.} Is to decide. A QP value higher than the frame quantization parameter QP _frame may be assigned to a block having a variance larger than the average block variance var _{frame, and} the allocation amount QP _block of the block QP is the block variance var _block. Is linearly scaled between the QP _frame and the quantization parameter upper bound QP _max depending on how large is the var _frame .

好ましくは、前記時間的情報は、どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、どのターゲットブロックが前景データに相当するかを示す真の動きベクトルマップ（ＴＭＶＭ）により提供され得る。なお、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされ得る。 Preferably, the temporal information indicates a temporal contrast sensitivity function (TCSF) indicating which target block is most noticeable in time to an observer human, and which target block corresponds to the foreground data. It can be provided by the true motion vector map (TMVM) shown. The TCSF can be valid only for the target block specified as the foreground data.

分散の大きい（high-variance）ブロックは、そのブロックのＱＰの振当量ＱＰ_{ｂｌｏｃｋ}が、前記ＴＭＶＭがターゲットブロックを前景データとして特定し且つ前記ＴＣＳＦのこのブロックについてのコントラスト感度対数値が０．５未満である場合にはＱＰ_{ｂｌｏｃｋ}が２増加するように、前記ＴＣＳＦ及び前記ＴＭＶＭによりさらに洗練化され得る。 A high-variance block has a QP allocation amount QP _{block of} that _{block in} which the TMVM specifies the target block as foreground data and the TCSF has a contrast sensitivity logarithmic value of less than 0.5. , The QP _block is increased by 2, and can be further refined by the TCSF and the TMVM.

前記ＳＣＭは、極めて明るい（１７０超の輝度）か又は極めて暗い（６０未満の輝度）ターゲットブロックのブロック量子化パラメータＱＰ_{ｂｌｏｃｋ}がＱＰ_ｍａｘに調節し直される輝度マスキングを含み得る。前記ＳＣＭは、符号化された映像の品質レベルに基づくＱＰ_ｍａｘの動的な決定を含み得て、この動的な決定では、イントラ（Ｉ）フレーム内のターゲットブロックの平均構造的類似度（ＳＳＩＭ）算出結果をこれらフレームの平均ブロック分散ｖａｒ_{ｆｒａｍｅ}と共に用いて、品質が測定され、測定された品質が低いと、前記量子化パラメータ上限ＱＰ_ｍａｘの数値が前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に幾分近づくように減らされる。 The SCM may include intensity masking in which the block quantization parameter QP _block of a very bright (greater than 170 intensity) or very dark (less than 60 intensity) target _block is readjusted to QP _max . The SCM may include a dynamic determination of QP _max based on a quality level of the coded video, the dynamic determination including the average structural similarity (SSIM) of target blocks in an intra (I) frame. ) The calculation result is used together with the average block variance var _frame of these frames to measure the quality, and when the measured quality is low, the numerical value of the quantization parameter upper limit QP _max approaches the frame quantization parameter QP _frame somewhat. To be reduced.

分散の極めて小さい（very-low-variance）ブロックに対して、これらの領域における高品質符号化を確実にするために、前記ブロック分散が小さいほど前記振当量ＱＰ_{ｂｌｏｃｋ}の数値が低くなるように（、かつ、品質が高くなるように）決められた低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ_{ｂｌｏｃｋ}が振り当てられ得る。分散の極めて小さいブロックに対する前記低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ_{ｂｌｏｃｋ}は、最初に、Ｉフレームについて決められ、その後、Ｐフレーム及びＢフレームについてはｉｐｒａｔｉｏパラメータ及びｐｂｒａｔｉｏパラメータを用いて決められ得る。分散は小さいが、分散が極めて小さいとは見なさないブロックは、当該ブロックについて品質向上が必要か否かを判定するために、前記ブロック量子化パラメータ（ＱＰ）の初めの推定値である前記振当量ＱＰ_{ｂｌｏｃｋ}が現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックの量子化パラメータ（ＱＰ）の値を平均することによって算出されるように調べられる。前記現在のブロックの前記ＳＳＩＭの推定ＳＳＩＭ_ｅｓｔが、前記現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックのＳＳＩＭ値から算出され得る。ＳＳＩＭ_ｅｓｔが０．９未満の場合、前記振当量ＱＰ_{ｂｌｏｃｋ}の数値が２減少され得る。 For very-low-variance blocks, the smaller the block variance, the lower the value of the allocation QP _block to ensure high quality coding in these regions ( , And the allocation amount QP _block, which is a low value of the quantization parameter (QP) determined so that the quality is high, may be allocated. The allocation amount QP _block , which is the value of the low quantization parameter (QP) for a block with a very small variance, is first determined for I frames, and then using the iratio and pbratio parameters for P and B frames. Can be decided. A block that has a small variance but is not considered to have a very small variance is the first estimated value of the block quantization parameter (QP) in order to determine whether or not quality improvement is necessary for the block. The QP _block is examined as calculated by averaging the values of the quantization parameter (QP) of the already coded neighboring blocks to the left, top left, right and top right of the current block. An estimated SSIM _{est of the} SSIM of the current block may be calculated from SSIM values of previously coded neighboring blocks to the left, top left, right and top right of the current block. When the SSIM _est is less than 0.9, the value of the allocation amount QP _block may be decreased by 2.

一部の実施形態において、前記品質向上は、前記ＴＭＶＭにより前景データとして特定されて且つ前記ＴＣＳＦのコントラスト感度対数値が０．８超であるブロックにのみ適用される。前記ＴＭＶＭは、前景データの場合にのみ１に設定され得る。 In some embodiments, the quality enhancement is applied only to blocks identified by the TMVM as foreground data and having a TCSF contrast sensitivity log value greater than 0.8. The TMVM can be set to 1 only for foreground data.

一部の実施形態において、前記ＴＣＳＦの時間的周波数は、前記ターゲットブロックとその参照ブロックとの間の色空間領域におけるＳＳＩＭを用いて波長の近似を求めて且つ動きベクトルの大きさ（動きベクトル大きさ）とフレームレートとを用いて速度の近似を求めることによって算出される。 In some embodiments, the temporal frequency of the TCSF is determined by wavelength approximation using SSIM in the color space region between the target block and its reference block and the magnitude of the motion vector (motion vector magnitude). S) and the frame rate to obtain an approximation of the velocity.

前記ＴＣＳＦは、現在のフレームについての当該ＴＣＳＦが最近のフレームにおけるＴＣＳＦマップの重み付き平均であるように且つより最近のフレームがより大きい重み付けを受けるように、複数のフレームにわたって算出され得る。 The TCSF may be calculated over multiple frames such that the TCSF for the current frame is a weighted average of the TCSF maps in the most recent frames and the more recent frames are given more weight.

前景データは、所与のターゲットブロックについてのエンコーダ動きベクトルと当該ブロックについてのグローバル動きベクトルとの差分を算出し、十分に大きい差分を有するブロックが前景データであると判断されることによって特定され得る。 Foreground data may be identified by calculating the difference between the encoder motion vector for a given target block and the global motion vector for that block, and determining that the block with a sufficiently large difference is the foreground data. ..

前景データとして特定されたデータブロックについて、前記グローバル動きベクトルから前記エンコーダ動きベクトルが減算されることによって差分動きベクトルを得ることがあり得て、この差分動きベクトルの大きさが前記ＴＣＳＦの時間的周波数を算出するのに用いられる。 A difference motion vector may be obtained by subtracting the encoder motion vector from the global motion vector for the data block identified as the foreground data, and the magnitude of the difference motion vector is the temporal frequency of the TCSF. Is used to calculate

映像データを処理するコンピュータに基づく方法、映像データを処理するコーデック（エンコーダおよびデコーダ）、ならびに映像データを処理するその他のコンピュータシステム及び装置が、本発明の前述した原理を具現化し得る。 Computer-based methods for processing video data, codecs (encoders and decoders) for processing video data, and other computer systems and devices for processing video data may embody the foregoing principles of the invention.

前述の内容は、添付の図面に示す本発明の例示的な実施形態についての、以下のより詳細な説明から明らかになる。図面では、異なる図をとおして同一の参照符号が同一の構成／構成要素を指すものとする。図面は必ずしも縮尺どおりではなく、むしろ、本発明の実施形態を示すことに重点が置かれている。 The foregoing will become apparent from the following more detailed description of exemplary embodiments of the invention that is illustrated in the accompanying drawings. In the drawings, like reference numbers refer to like elements/components throughout the different figures. The drawings are not necessarily drawn to scale, but rather focused on illustrating embodiments of the invention.

標準のエンコーダの構成を示すブロック図である。It is a block diagram which shows the structure of a standard encoder. 一般的なエンコーダの場合のインター予測に伴うステップを示すブロック図である。FIG. 3 is a block diagram showing steps involved in inter prediction in the case of a general encoder. 連続的ブロック追跡による動き初期推定に伴うステップを示すブロック図である。FIG. 6 is a block diagram showing steps involved in initial motion estimation by continuous block tracking. 連続的ブロック追跡と拡張予測区域探索との組合せによる統合化された動き推定を示すブロック図である。FIG. 6 is a block diagram showing integrated motion estimation with a combination of continuous block tracking and extended prediction area search. Wooten達による時間的コントラスト感度関数の最近の測定結果（2010年）を示すプロットである。It is a plot which shows the recent measurement result (2010) of the temporal contrast sensitivity function by Wooten et al. 本発明の一実施形態における、ＣＩＥ１９７６Ｌａｂ色空間における構造的類似度（ＳＳＩＭ）の算出の様子を示すブロック図である。FIG. 6 is a block diagram showing how to calculate a structural similarity (SSIM) in the CIE1976 Lab color space in the embodiment of the present invention. 本発明の一実施形態における、映像符号化の知覚的品質を向上させるための知覚的統計量の一般的な適用の様子を示すブロック図である。FIG. 6 is a block diagram illustrating a general application of perceptual statistics for improving the perceptual quality of video coding in an embodiment of the present invention. 本発明の一実施形態における、連続的ブロック追跡によるインター予測を改変して映像符号化の知覚的品質を向上させるための知覚的統計量の適用の様子を示すブロック図である。FIG. 6 is a block diagram showing how perceptual statistics are applied to modify inter prediction by continuous block tracking to improve the perceptual quality of video coding in an embodiment of the present invention. 重要度マップを用いてブロック量子化を改変して符号化するプロセスの一例を示すブロック図である。It is a block diagram which shows an example of the process of modifying and encoding block quantization using an importance map. 各実施形態が配備されるコンピュータネットワーク環境の概略図である。1 is a schematic diagram of a computer network environment in which each embodiment is deployed. 図９Ａのネットワークにおけるコンピュータノードのブロック図である。9B is a block diagram of a computer node in the network of FIG. 9A. FIG.

本明細書で引用する全ての特許公報、全ての特許出願公報及び全ての刊行物の全教示内容は、参照をもって本明細書に取り入れたものとする。以下では、本発明の例示的な実施形態について説明する。 The entire teachings of all patent publications, all patent application publications and all publications cited herein are hereby incorporated by reference. In the following, exemplary embodiments of the invention will be described.

本発明は、標準の様々な符号化に適用可能である。以下では、特記しない限り、「従来」や「標準」という語（しばしば、「圧縮」、「コーデック」、「符号化（encoding）」や「エンコーダ」と共に用いられる）は、ＭＰＥＧ−２、ＭＰＥＧ−４、Ｈ．２６４またはＨＥＶＣのことを指し得る。「入力ブロック」は、一般性を失うことなく、エンコーダの符号化基本単位のことを指すものとし、しばしば、「データブロック」や「マクロブロック」と同じ意味で称され得る。符号化中の現在の入力ブロックは、「ターゲットブロック」と称される。 The present invention is applicable to various standard encodings. In the following, the words "conventional" and "standard" (often used with "compression", "codec", "encoding" and "encoder") are MPEG-2, MPEG-, unless otherwise stated. 4, H.I. H.264 or HEVC. An "input block" refers to a coding basic unit of an encoder without loss of generality, and can be often referred to as a "data block" or a "macroblock". The current input block being encoded is referred to as the "target block".

＜連続的ブロック追跡による映像符号化及びインター予測＞
符号化プロセスは、映像データを、圧縮フォーマットつまり符号化フォーマットに変換するものであり得る。同様に、解凍つまり復号化プロセスは、圧縮された映像を、圧縮される前のつまり未処理のフォーマットに変換するものであり得る。映像圧縮・解凍プロセスは、一般的にコーデックと称されるエンコーダおよびデコーダのペアとして実現され得る。 <Video coding and inter prediction by continuous block tracking>
The encoding process may be the conversion of video data into a compressed or encoded format. Similarly, the decompression or decoding process may transform the compressed video into a pre-compressed or raw format. The video compression/decompression process can be implemented as a pair of encoder and decoder, commonly referred to as a codec.

図１は、標準の変換ベース且つ動き補償のエンコーダのブロック図である。図１のエンコーダは、ソフトウェア環境、ハードウェア環境又はこれらの組合せで実現され得る。このエンコーダは、任意の組合せの構成要素を備え得る。これらの構成要素は、インター予測手段２０に出力する動き推定手段１５、イントラ予測手段３０、変換・量子化手段６０、逆変換・量子化手段７０、ループ内フィルタ８０、フレームストア８５およびエントロピー符号化手段９０を含むが、これらに限定されるわけではない。上記の予測手段（インター予測とイントラ予測との両方）の目的は、所与の入力映像ブロック１０（略して「入力ブロック」、あるいは、「マクロブロック」又は「データブロック」）についての最良の予測信号４０を生成することである。入力ブロック１０から予測信号４０が減算されることによって予測残差５０を生成し、この予測残差５０が変換・量子化６０を受ける。その後、この残差の量子化係数６５がエントロピー符号化手段９０へと渡されて、エントロピー符号化手段９０が圧縮ビットストリームへと符号化する。量子化係数６５は逆変換・量子化手段７０にも渡されて、その結果得られる信号（前記予測残差の近似）が予測信号４０に組み戻されて、これによって入力ブロック１０についての再構成信号７５を生成する。再構成信号７５はデブロッキングフィルタなどのループ内フィルタ８０に通され得て、この（場合によってフィルタリング済みである）再構成信号がフレームストア８５の一部となる。フレームストア８５は、今後の入力ブロックの予測を支援する。図１に示すエンコーダの各構成要素の機能は、当業者であればよく知っている。 FIG. 1 is a block diagram of a standard transform-based and motion-compensated encoder. The encoder of FIG. 1 may be implemented in a software environment, a hardware environment or a combination thereof. The encoder may comprise any combination of components. These constituent elements are the motion estimation means 15, the intra prediction means 30, the transform/quantization means 60, the inverse transform/quantization means 70, the in-loop filter 80, the frame store 85 and the entropy coding which are output to the inter prediction means 20. Means 90, but is not so limited. The purpose of the above prediction means (both inter prediction and intra prediction) is that the best prediction for a given input video block 10 (abbreviated as “input block” or “macroblock” or “data block”). Generating signal 40. The prediction signal 40 is subtracted from the input block 10 to generate a prediction residual 50, and the prediction residual 50 undergoes transformation/quantization 60. Then, the quantized coefficient 65 of the residual is passed to the entropy coding means 90, and the entropy coding means 90 codes it into a compressed bit stream. The quantized coefficient 65 is also passed to the inverse transform/quantization means 70, and the resulting signal (approximation of the prediction residual) is recombined into the prediction signal 40, thereby reconstructing the input block 10. Generate signal 75. The reconstructed signal 75 may be passed through an in-loop filter 80, such as a deblocking filter, which (possibly filtered) reconstructed signal becomes part of the frame store 85. The frame store 85 supports prediction of future input blocks. The function of each component of the encoder shown in FIG. 1 is well known to those skilled in the art.

図２に、標準のインター予測（図１の符号３０）における各種ステップを示す。インター予測の目的は、新たなデータを、前のフレームからの予め復号化されたデータを用いて、当該データにおける時間的冗長性を利用して符号化することである。インター予測では、現在符号化中のフレーム（ターゲットフレームとも称される）からの入力ブロック１０が、図１のフレームストア８５に記憶された、予め復号化された参照フレーム内の同サイズの領域から「予測」される。符号化中のフレーム内の入力ブロックの位置と参照フレーム内のこれと一致する領域の位置との（ｘ、ｙ）のずれを示す二成分ベクトルは、動きベクトルと称される。このように、動き推定のプロセスは、符号化すべき入力ブロックを参照フレーム内のこれと一致する領域と最良に結び付ける動きベクトルを決定することを伴う。 FIG. 2 shows various steps in the standard inter prediction (reference numeral 30 in FIG. 1). The purpose of inter-prediction is to encode new data using pre-decoded data from the previous frame, taking advantage of temporal redundancy in that data. In inter-prediction, an input block 10 from the frame currently being encoded (also called the target frame) is taken from a region of the same size in a pre-decoded reference frame stored in the frame store 85 of FIG. "is expected. The two-component vector that indicates the (x,y) shift between the position of the input block in the frame being encoded and the position of the matching region in the reference frame is called the motion vector. Thus, the process of motion estimation involves determining the motion vector that best associates the input block to be encoded with its matching region in the reference frame.

大半のインター予測プロセスは、所与の入力ブロックについての「良好」な動きベクトル１１５の１つ以上の粗推定を生成する動き初期推定（図２の符号１１０）から始まる。この後に、任意で、複数の動きベクトル候補が近似レート歪み尺度を用いて単一の候補に低減され得る動きベクトル候補フィルタリングステップ１２０が続けられる。レート歪み解析では、最良の動きベクトル候補（予測）は、レート歪み尺度Ｄ＋λＲ（式中、歪みＤは、入力ブロックとこれと一致する領域との誤差であり、レートＲは、予測を符号化するコスト（ビット単位）を定量化し、λは、スカラー重み付け係数である。）を最小化するものが選ばれる。実際のレートコストは、テクスチャビットと動きベクトルビットとの２種類の成分を含む。テクスチャビットは、残差信号（入力ブロックから予測を減算したもの）の量子化変換係数を符号化するのに必要なビットの数であり、動きベクトルビットは、動きベクトルを符号化するのに必要なビットの数である。通常、動きベクトルは、既に符号化済みの動きベクトルに対して差分符号化される。エンコーダにおける初期の段階ではテクスチャビットが利用可能でないので、レート歪み尺度のうちのレート部分は、動きベクトルビットにより近似で求められる。一方で、動きベクトルビットは、差分動きベクトルの大きさに依存する動きベクトルペナルティ係数として近似される。したがって、動きベクトル候補フィルタリングステップ１２０において、この近似レート歪み尺度が、単一の「最良」の初めの動きベクトル又はより少数の集合の「最良」の初めの動きベクトル１２５を選び出すのに用いられる。次に、そのような初めの動きベクトル１２５は、動き精推定１３０により洗練化（refine（さらに改良））される。動き精推定１３０は、それぞれの初めの推定の近傍において局所探索を行うことにより、その入力ブロックについての動きベクトル（および対応する予測）のより正確な推定を決定する。通常、この局所探索の後に、整数値動きベクトルが内挿により１／２又は１／４ピクセル精度まで洗練化されるサブピクセル洗練化が続けられる。動き精推定ブロック１３０は、洗練化済みの動きベクトル１３５の集合を生成する。 Most inter-prediction processes begin with a motion initial estimation (reference numeral 110 in FIG. 2) that produces one or more coarse estimates of the “good” motion vector 115 for a given input block. This is optionally followed by a motion vector candidate filtering step 120, where multiple motion vector candidates can be reduced to a single candidate using an approximate rate distortion measure. In rate-distortion analysis, the best motion vector candidate (prediction) is the rate-distortion measure D+λR, where the distortion D is the error between the input block and the matching region, and the rate R encodes the prediction. The one that minimizes the cost (in bits) and λ is a scalar weighting factor is chosen. The actual rate cost includes two types of components, texture bits and motion vector bits. Texture bits are the number of bits required to encode the quantized transform coefficients of the residual signal (input block minus prediction), and motion vector bits are required to encode the motion vector. Is the number of bits. Usually, the motion vector is differentially encoded with respect to the already encoded motion vector. The rate portion of the rate distortion measure is approximated by the motion vector bits, since texture bits are not available early in the encoder. On the other hand, the motion vector bit is approximated as a motion vector penalty coefficient that depends on the size of the differential motion vector. Therefore, in the motion vector candidate filtering step 120, this approximate rate distortion measure is used to single out a single "best" starting motion vector or a smaller set of "best" starting motion vectors 125. Then, such initial motion vector 125 is refined by motion fineness estimation 130. The motion fine estimate 130 determines a more accurate estimate of the motion vector (and corresponding prediction) for that input block by performing a local search in the vicinity of each initial estimate. This local search is usually followed by sub-pixel refinement, where integer-valued motion vectors are refined by interpolation to ½ or ¼ pixel precision. The motion fineness estimation block 130 produces a refined set of motion vectors 135.

次に、動き精ベクトル１３５が与えられると、モード生成手段１４０がそのエンコーダが採用し得る符号化モードに基づいて予測候補１４５の集合を生成する。このようなモードは、コーデックによって異なる。符号化モードが異なるというのは、インターレース対プログレッシブ（フィールド対フレーム）動き推定、参照フレームの方向（前方向予測、後方向予測、双予測）、参照フレームのインデックス（複数の参照フレームを可能にするＨ．２６４、ＨＥＶＣなどのコーデックの場合）、インター予測対イントラ予測（良好なインター予測が存在しない場合にイントラ予測へと戻ることを可能にする一部のシナリオ）、異なる量子化パラメータ、および入力ブロックの異なるサブ区分である（ただし、これらに限定されるわけではない）。予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５０を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄ（通常、二乗誤差和（ＳＳＥ）として算出）およびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。最終の予測１６０（つまり図１の符号４０）は、全ての候補のなかで最小のレート歪みスコアＤ＋λＲを有する予測であり、この最終の予測が、その動きベクトル及び他の符号化パラメータと共にエンコーダにおける後続のステップへと渡される。 Next, given the motion precision vector 135, the mode generation means 140 generates a set of prediction candidates 145 based on the coding mode that can be adopted by the encoder. Such modes depend on the codec. Different coding modes mean interlace-to-progressive (field-to-frame) motion estimation, reference frame direction (forward prediction, backward prediction, bi-prediction), reference frame index (allows multiple reference frames) H.264, for codecs such as HEVC), inter-prediction vs. intra-prediction (some scenarios that allow reverting to intra-prediction if no good inter-prediction exists), different quantization parameters, and input Different subdivisions of blocks (but not limited to). The entire set of prediction candidates 145 undergo a "final" rate distortion analysis 150 to determine a single best candidate. In the "final" rate-distortion analysis, an accurate rate-distortion measure D+λR is used to predict error for the distortion part D (typically calculated as sum of squared errors (SSE)) and the actual coded bits R for the rate part (From entropy coding 90 of FIG. 1) is calculated. The final prediction 160 (ie 40 in FIG. 1) is the prediction with the lowest rate distortion score D+λR among all the candidates, and this final prediction along with its motion vector and other coding parameters at the encoder. Passed on to subsequent steps.

図３に、インター予測時に、連続的ブロック追跡（ＣＢＴ）による動き初期推定がどのようにして行われ得るのかを示す。ＣＢＴは、ターゲットフレームと時間的予測が導き出される参照フレームとの間に、複数のフレーム分のギャップが存在する場合に有用である。ＭＰＥＧ−２の場合、ＩＢＢＰＢＢＰ（イントラ予測Ｉフレーム、双予測Ｂフレームおよび前方向予測Ｐフレームからなる）の典型的なＧＯＰ構造は、現在のフレームから最大３フレーム分離れた参照フレームを可能にする（その理由は、ＭＰＥＧ−２ではＢフレームが参照フレームとして機能できないからである）。符号化すべき各フレームにつき複数の参照フレームを可能にするＨ．２６４やＨＥＶＣでは、上記と同じＧＯＰ構造でも、現在のフレームから６フレーム分以上離れた参照フレームを可能にする。より長いＧＯＰ構造（例えば、Ｐフレーム同士の間に７つのＢフレーム等）であれば、参照フレームは、ターゲットフレームからなおいっそう離れたものとなり得る。現在のフレームと参照フレームとの間に複数のフレーム分のギャップが存在するとき、連続的追跡は標準の時間的予測手法では捉えられないデータ内の動きをエンコーダが捉えることを可能にするので、ＣＢＴはより優れた時間的予測を生成することを可能にする。 FIG. 3 shows how initial motion estimation by continuous block tracking (CBT) can be performed during inter prediction. The CBT is useful when there are multiple frame gaps between the target frame and the reference frame from which temporal prediction is derived. For MPEG-2, the typical GOP structure of IBBPBBP (consisting of intra-predicted I-frames, bi-predicted B-frames and forward-predicted P-frames) allows reference frames up to 3 frames separated from the current frame. (The reason is that in MPEG-2 B frames cannot function as reference frames). H.264 that allows multiple reference frames for each frame to be encoded. In H.264 and HEVC, even the same GOP structure as described above enables a reference frame that is separated from the current frame by 6 frames or more. For longer GOP structures (eg, 7 B frames between P frames, etc.), the reference frame can be even further away from the target frame. When multiple frame gaps exist between the current frame and the reference frame, continuous tracking allows the encoder to capture motion in the data that cannot be captured by standard temporal prediction techniques. CBT makes it possible to generate better temporal predictions.

ＣＢＴでの最初のステップは、フレーム−フレーム間追跡（図３の符号２１０）を実行することである。所与のフレーム内の入力ブロック１０ごとに、フレームバッファ２０５内の前のフレームへの後方向と当該フレームバッファ内の次のフレームへの前方向との両方の動きベクトルが算出される。一実施形態において、フレーム−フレーム間追跡は、再構成された参照フレームではなく元々のソース映像からのフレームに作用する。これは、ソース映像のフレームが量子化や他の符号化アーチファクトによって劣化していないことから、ソース映像のフレームに基づく追跡は映像における真の動き場をより正確に表すので有利だからである。フレーム−フレーム間追跡は、従来のブロックベースの動き推定（ＢＢＭＥ）又は階層的な動き推定（ＨＭＥ）を用いて行われ得る。 The first step in CBT is to perform frame-to-frame tracking (reference numeral 210 in FIG. 3). For each input block 10 in a given frame, motion vectors are calculated both backward to the previous frame in the frame buffer 205 and forward to the next frame in the frame buffer. In one embodiment, frame-to-frame tracking operates on frames from the original source video rather than the reconstructed reference frame. This is because the frame-based tracking of the source video is more accurate in representing the true motion field in the video, since the frames of the source video have not been degraded by quantization or other coding artifacts. Frame-to-frame tracking may be performed using conventional block-based motion estimation (BBME) or hierarchical motion estimation (HME).

フレーム−フレーム間追跡の結果は、フレーム内のそれぞれの入力ブロックごとに対する、フレームバッファ２０５内の一番最近のフレーム内での最良に一致する領域と、フレームバッファ２０５内の一番最近のフレームのブロックごとに対する、現在のフレーム内での最良に一致する領域とを表す、フレーム−フレーム間動きベクトル２１５の集合である。次に、連続的追跡２２０が、利用可能なフレーム−フレーム間追跡情報を集約することにより、複数の参照フレームにわたってそれぞれの入力ブロックについての連続的なトラックを生成する。連続的追跡を行う方法の詳細については、全内容を参照をもって本明細書に取り入れた‘784出願に記載されている。連続的追跡２２０の出力は、符号化中の現在のフレーム内の全ての入力ブロックを過去の参照フレーム内のこれらと一致する領域へと追跡する連続的ブロック追跡（ＣＢＴ）動きベクトル２２５である。ＣＢＴの場合、これらのＣＢＴ動きベクトルが初めの動きベクトル（図２の符号１２５）となり且つ前述したように動き精推定（図２の符号１３０）により洗練化され得る。 The result of the frame-to-frame tracking is the best matching region in the most recent frame in frame buffer 205 for each input block in the frame and the most recent frame in frame buffer 205. A set of frame-to-frame motion vectors 215 that represent, for each block, the best matching region in the current frame. Next, a continuous track 220 aggregates the available frame-to-frame tracking information to generate a continuous track for each input block over multiple reference frames. Details of how to perform continuous tracking are described in the '784 application, which is incorporated herein by reference in its entirety. The output of the continuous track 220 is a continuous block track (CBT) motion vector 225 that tracks all the input blocks in the current frame being coded into regions that match these in past reference frames. For CBT, these CBT motion vectors become the initial motion vector (reference numeral 125 in FIG. 2) and may be refined by motion fine estimation (reference numeral 130 in FIG. 2) as described above.

図４に、本発明の一実施形態において、ＣＢＴがどのようにしてＥＰＺＳ法と組み合わされることで統合化された動き推定プロセスを作り出し得るのかを示す。図４では、ＣＢＴが動き初期推定１１０のためにフレーム−フレーム間追跡２１０及び連続的追跡２２０により動きベクトルを生成した後、動き精推定１３０のための局所探索及びサブピクセル洗練化２５０が続けられる。ＥＰＺＳが候補生成手段２３０により初めの動きベクトルを生成した後、先に詳述したような近似レート歪み解析によってフィルタリングを行う候補フィルタリング手段２４０が続けられる。この後に、さらに、局所探索及びサブピクセル洗練化２６０による動き精推定１３０が続けられる。このようにして得られたＣＢＴ動きベクトル２５５およびＥＰＺＳ動きベクトル２６５の両方が、総合的な「最良」のインター予測を決定するために残りのインター予測ステップ（図２のモード生成１４０及び最終のレート歪み解析１５０）へと渡される。 FIG. 4 illustrates how CBT can be combined with the EPZS method to create an integrated motion estimation process in one embodiment of the invention. In FIG. 4, after the CBT has generated motion vectors with frame-to-frame tracking 210 and continuous tracking 220 for initial motion estimation 110, local search and sub-pixel refinement 250 for motion fine estimation 130 is continued. .. After the EPZS has generated the initial motion vector by the candidate generation means 230, the candidate filtering means 240 for filtering by the approximate rate distortion analysis as detailed above is continued. This is then followed by a motion refinement estimation 130 with local search and sub-pixel refinement 260. Both the CBT motion vector 255 and the EPZS motion vector 265 thus obtained are used in the remaining inter prediction steps (mode generation 140 and final rate of FIG. 2) to determine the overall “best” inter prediction. Strain analysis 150).

代替的な一実施形態では、図４のＣＢＴ動きベクトル候補２５５及びＥＰＺＳ動きベクトル候補２６５に、さらなる候補が追加され得る。このような候補は、ランダム動きベクトル、（０，０）動きベクトル、およびいわゆる「中央値予測子（median predictor）」を含む（ただし、これらに限定されるわけではない）。ランダム動きベクトルには、その局所近傍において最良の候補を見つけ出すように動き精推定１３０が適用されてもよい。（０，０）動きベクトルは、ＥＰＺＳの初めの候補のうちの一つであるが、ＥＰＺＳ候補フィルタリング（図４の符号２４０）後の時点で常に選び出されているとは限らず、仮に候補フィルタリング後の時点で選び出されていたとしても、動き精推定１３０によって（０，０）以外の動きベクトルが出力される可能性がある。（動き精推定を受けない）（０，０）動きベクトルを最終のレート歪み解析用の候補として明示的に含めることは、少なくとも１つの、大きさの小さい且つ「動きの小さい」候補が検討されることを確実にする。同様に、「中央値予測子」もＥＰＺＳの初めの候補のうちの一つであるが、ＥＰＺＳ候補フィルタリング（図４の符号２４０）後の時点で常に選び出されているとは限らない。中央値予測子は、現在符号化中のデータブロックの左、上および右上のデータブロックで予め算出された動きベクトルの、中央値として定義される。（動き精推定を受けない）中間値予測子を最終のレート歪み解析用の候補として明示的に含めることは、映像フレームのうちの空間的に均質な（「フラット」な）領域を符号化するうえで特に有益となり得る。つまり、代替的なこの実施形態では、５種類以上の動きベクトル候補（ＣＢＴ由来の動きベクトル、ＥＰＺＳ由来の動きベクトル、ランダム動きベクトル由来の動きベクトル、（０，０）動きベクトル、および中央値予測子を含む（ただし、これらに限定されるわけではない））が、残りのインター予測ステップ（図２のモード生成１４０及び最終のレート歪み解析１５０）へと渡され得る。 In an alternative embodiment, further candidates may be added to CBT motion vector candidate 255 and EPZS motion vector candidate 265 of FIG. Such candidates include (but are not limited to) random motion vectors, (0,0) motion vectors, and so-called "median predictors". Motion refinement estimation 130 may be applied to the random motion vector to find the best candidate in its local neighborhood. The (0,0) motion vector is one of the first candidates of EPZS, but it is not always selected at a time point after EPZS candidate filtering (reference numeral 240 in FIG. 4), and the candidate is tentative. Even if the motion vector is selected at the time point after the filtering, there is a possibility that the motion precision estimation 130 outputs a motion vector other than (0,0). Explicit inclusion of the (0,0) motion vector (which is not subject to motion fine estimation) as a candidate for final rate distortion analysis considers at least one small and “small motion” candidate. Make sure that. Similarly, the “median predictor” is also one of the initial candidates for EPZS, but it is not always selected at a time point after EPZS candidate filtering (reference numeral 240 in FIG. 4). The median predictor is defined as the median of the motion vectors precomputed in the data blocks to the left, above and to the right of the data block currently being encoded. Explicit inclusion of a median predictor (which does not undergo motion estimation) as a candidate for final rate distortion analysis encodes a spatially homogeneous (“flat”) region of the video frame. Can be especially beneficial to That is, in this alternative embodiment, there are five or more motion vector candidates (CBT-derived motion vector, EPZS-derived motion vector, random motion vector-derived motion vector, (0,0) motion vector, and median prediction. Children, including but not limited to children, may be passed to the remaining inter prediction steps (mode generation 140 and final rate distortion analysis 150 of FIG. 2).

＜映像符号化用の重要度マップの算出＞
知覚的統計量が、映像フレームのどの領域が人間の視覚系（ＨＶＳ）にとって重要なのかを示す重要度マップを算出するのに用いられ得る。 <Calculation of importance map for video coding>
Perceptual statistics can be used to calculate an importance map that indicates which regions of the video frame are important to the human visual system (HVS).

知覚的統計量の一例として、時間的に周期的な刺激に対する人間の視覚系（ＨＶＳ）の応答をモデル化する、いわゆる時間的コントラスト感度関数（ＴＣＳＦ）が挙げられる。背景技術の欄で述べたように、ＴＣＳＦの概念は（「時間変調伝達関数」として紹介された）１９５０年代から存在しているが、これまで映像圧縮に適用されることはなかった。図５に、ＴＣＳＦの最近の測定結果（Wooten, B. 達による2010, "A practical method of measuring the temporal contrast sensitivity function（時間的コントラスト感度関数を測定する実用的な方法）," Biomedical Optical Express, l(l):47-58）を、周波数の対数の関数としての時間的コントラスト感度の対数（横軸に周波数の対数、縦軸に時間的コントラスト感度の対数）のかたちで示す。測定データ点（図５の丸印）は、３次多項式（図５の実線）を用いてフィッティングされている。なお、後述の全てのＴＣＳＦ算出に、このフィッティングを用いている。ＴＣＳＦは、人間の視覚系（ＨＶＳ）が中程度の周波数域に対して最大の応答を示す一方で、低周波数域ではＨＶＳの応答がわずかに低下し且つ高周波数域では急激に低下するものと予想する。 One example of perceptual statistics is the so-called temporal contrast sensitivity function (TCSF), which models the response of the human visual system (HVS) to temporally periodic stimuli. As mentioned in the background section, the concept of TCSF has existed since the 1950's (introduced as "time-modulated transfer function"), but has never been applied to video compression. Figure 5 shows the recent measurement results of TCSF (Wooten, B. et al. 2010, "A practical method of measuring the temporal contrast sensitivity function", "Biomedical Optical Express," l(l):47-58) in the form of the logarithm of temporal contrast sensitivity as a function of the logarithm of frequency (logarithmic frequency on the horizontal axis and logarithmic temporal sensitivity on the vertical axis). The measured data points (circles in FIG. 5) are fitted using a cubic polynomial (solid line in FIG. 5). Note that this fitting is used for all TCSF calculations described below. While the human visual system (HVS) has a maximum response in the medium frequency range, the TCSF has a slight decrease in the HVS response in the low frequency range and a sharp decrease in the high frequency range. Anticipate.

映像圧縮へのＴＣＳＦの適用には、ＴＣＳＦへの入力である時間的周波数（図５の横軸）を算出する方法が必要となる。周波数を算出するための本発明の一実施形態にかかる方法の一つを、次で説明する。周波数ｆは、ｆ＝ｖ／λ（式中、ｖは速度で、λは波長である）により与えられる。一実施形態において、任意のデータブロックのコンテンツの速度ｖ（ピクセル／秒単位）は、エンコーダにより生成された動きベクトル（例えば、図２の符号１３５、図３の符号２１５，２２５、図４の符号２５５，２６５等）の大きさからｖ＝｜ＭＶ｜×フレームレート／Ｎ（式中、｜ＭＶ｜は、そのデータブロックの動きベクトルの大きさであり、フレームレートはその映像が生成された１秒当たりのフレームの数であり、Ｎは動きベクトルにより指し示される参照フレームと現在のフレームとの間のフレームの数である）として算出され得る。 Application of TCSF to video compression requires a method of calculating the temporal frequency (horizontal axis in FIG. 5) that is an input to TCSF. One of the methods according to an embodiment of the present invention for calculating the frequency is described below. The frequency f is given by f=v/λ, where v is the velocity and λ is the wavelength. In one embodiment, the velocity v (in pixels/second) of the content of any data block is determined by the motion vector generated by the encoder (eg, 135 in FIG. 2, 215, 225 in FIG. 3, 225 in FIG. 3). 255, 265, etc., v=|MV|×frame rate/N (where |MV| is the magnitude of the motion vector of the data block, and the frame rate is 1 when the video was generated. Frames per second, N is the number of frames between the reference frame pointed to by the motion vector and the current frame).

波長λの適切な近似は、ＣＩＥ１９７６Ｌａｂ色空間（www://en.wikipedia.org/wiki/Lab_color_space）において算出される構造的類似度（ＳＳＩＭ）（Wang, Z. 達による2004, "Image quality assessment: From error visibility to structural similarity（画像品質評価：誤差可視度から構造的類似度まで）," IEEE Trans, on Image Processing, 13(4):600-612）の算出結果から導き出され得る。図６に、Ｌａｂ色空間におけるＳＳＩＭの算出の様子を示す。ＳＳＩＭは、ターゲットブロック３００（符号化すべき現在のデータブロック）とその動きベクトルが指し示す参照ブロック３１０との間で算出される。通常、エンコーダにより処理される映像データはＹＵＶ４２０などの標準の空間で表現されるので、次のステップは、それらターゲットブロック（符号３２０）および参照ブロック（符号３３０）の両方を一般的に文献に記載されている任意の手法を用いてＣＩＥ１９７６Ｌａｂ空間に変換することである。次に、Ｌａｂ空間におけるこれらのターゲットブロックと参照ブロックとの間の誤差ΔＥ（符号３４０）が、 A suitable approximation of the wavelength λ is the Structural Similarity (SSIM) calculated in the CIE 1976 Lab color space (www://en.wikipedia.org/wiki/Lab_color_space) (Wang, Z. et al. 2004, "Image. Quality assessment: From error visibility to structural similarity, "IEEE Trans, on Image Processing, 13(4):600-612". FIG. 6 shows how SSIM is calculated in the Lab color space. The SSIM is calculated between the target block 300 (current data block to be encoded) and the reference block 310 pointed to by its motion vector. Since the video data processed by the encoder is usually represented in a standard space such as YUV420, the next step is to describe both the target block (reference numeral 320) and the reference block (reference numeral 330) generally in the literature. Conversion to the CIE 1976 Lab space using any of the techniques described. Then, the error ΔE (reference numeral 340) between these target blocks and reference blocks in Lab space is

（式中、添字Ｔは「ターゲットブロック」を意味し、添字Ｒは「参照ブロック」を意味する）として算出される。最後に、誤差ΔＥと同次元のゼロ行列との間のＳＳＩＭ３６０が、データの色空間変化の尺度を示すものとして算出される。初めに定まるＳＳＩＭは、−１〜１の数値を取り、数値の１は完全な類似（空間的な差異がないこと）を示す。ＳＳＩＭを波長λに変換する目的に、０〜１の数値を取る空間的非類似度ＤＳＳＩＭ＝（１−ＳＳＩＭ）／２を使用することが可能であり得て、０は短い波長（最大の空間的類似度）に相当し、１は長い波長（最小の空間的類似度）に相当する。ＳＳＩＭをピクセル単位に変換するには、ＳＳＩＭの数値を、算出対象のブロックのピクセルの数で乗算することが可能であり得る。一実施形態では、ＳＳＩＭのブロックサイズが８×８なので、ＤＳＳＩＭ値に６４が乗算される。この場合、周波数の最終的な算出結果は、
ｆ＝｜ＭＶ｜×フレームレート／（Ｎ×６４×（１−ＳＳＩＭ）／２）
により与えられる。 (Where the subscript T means “target block” and the subscript R means “reference block”). Finally, the SSIM 360 between the error ΔE and the zero matrix of the same dimension is calculated as an indication of the color space change of the data. The SSIM determined first takes a numerical value of -1 to 1, and a numerical value of 1 indicates perfect similarity (no spatial difference). It may be possible to use the spatial dissimilarity DSSIM=(1-SSIM)/2, which takes a value from 0 to 1 for the purpose of converting the SSIM into the wavelength λ, where 0 is the short wavelength (maximum space 1) corresponds to a long wavelength (minimum spatial similarity). To convert the SSIM into pixels, it may be possible to multiply the SSIM number by the number of pixels in the block to be calculated. In one embodiment, the SSIM block size is 8x8, so the DSSIM value is multiplied by 64. In this case, the final calculation result of the frequency is
f=|MV|×frame rate/(N×64×(1-SSIM)/2)
Given by.

所与のターゲットブロックについての周波数が算出されれば、このブロックについてのＴＣＳＦ値が、図５の曲線フィット（実線）から決定可能であり得る。ＴＣＳＦは、ｌｏｇ_１０スケールで０〜１．０８または絶対スケールで１〜１１．９７の数値を取る。フレーム内の相異なるブロックが相異なるＴＣＳＦ値を取ることにより、フレーム内の全てのブロックにわたるＴＣＳＦ値の集約集合（aggregate set）が重要度マップを形成し、高い数値は時間的コントラストの観点からみて知覚的に重要なブロックを示し且つ低い数値は知覚的に重要でないブロックを示す。 Once the frequency for a given target block is calculated, the TCSF value for this block may be determinable from the curve fit (solid line) of FIG. TCSF takes values from 0 to 1.08 on the log ₁₀ scale or 1 to 11.97 on the absolute scale. Since different blocks in a frame take different TCSF values, an aggregate set of TCSF values across all blocks in a frame forms an importance map, with high numbers being viewed in terms of temporal contrast. A perceptually important block is indicated and a low number indicates a perceptually unimportant block.

さらなる実施形態では、最近のフレームからのＴＣＳＦの数値が、ＴＣＳＦベースの重要度マップがフレーム間で変動し過ぎることのないようにそれぞれのデータブロックにつき平均化され得る。例えば、平均ＴＣＳＦＴＣＳＦ_ａｖｇのそのような計算の一つとして、ＴＣＳＦ_ａｖｇ＝０．７×ＴＣＳＦ_ｃｕｒ＋０．３×ＴＣＳＦ_ｐｒｅｖ（式中、ＴＣＳＦ_ｃｕｒは現在のフレームからのＴＣＳＦ値であり、ＴＣＳＦ_ｐｒｅｖは一番最近符号化された過去のフレームからのＴＣＳＦ値である）が挙げられ得る。ＴＣＳＦの計算は、このように平均化されることでよりロバスト（頑健）になる。 In a further embodiment, the TCSF numbers from recent frames may be averaged for each data block so that the TCSF-based importance map does not fluctuate too much between frames. For example, as one such calculation of the average TCSF TCSF _avg , TCSF _avg =0.7×TCSF _cur +0.3×TCSF _prev , where TCSF _cur is the TCSF value from the current frame, and TCSF _prev is Which is the TCSF value from the most recently encoded past frame). The TCSF calculation becomes more robust by being averaged in this way.

さらなる実施形態では、エンコーダにより生成された動きベクトルの相対品質についての情報が、符号化プロセスにおける種々の時点で算出され得て、真の動きベクトルマップ（ＴＭＶＭ）を生成するのに用いられ得る。真の動きベクトルマップ（ＴＭＶＭ）は、それぞれのデータブロックについて、その動きベクトルがどれほど信頼できるのかを出力する。０または１の数値を取るこの真の動きベクトルマップは、動きベクトルが正確でないデータブロック（すなわち、ＴＭＶＭ値が０であるデータブロック）にはＴＣＳＦが適用されないようにＴＣＳＦを洗練化するマスクとして用いられ得る。 In a further embodiment, information about the relative quality of the motion vectors generated by the encoder can be calculated at various points in the encoding process and used to generate a true motion vector map (TMVM). A true motion vector map (TMVM) outputs for each data block how reliable that motion vector is. This true motion vector map, which takes a numeric value of 0 or 1, is used as a mask to refine the TCSF so that it will not be applied to data blocks with inaccurate motion vectors (ie, data blocks with a TMVM value of 0). Can be done.

一実施形態において、動きベクトルの正確さは、所与の映像フレームについてのグローバル動きモデルを推定し且つこの動きモデルを当該フレーム内のそれぞれのデータブロックに適用することによってそれぞれのデータブロックについてのグローバル動きベクトルを決定してから且つこのグローバル動きベクトルをそのデータブロックについてのエンコーダの動きベクトル（エンコーダ動きベクトル）と比較することにより、決定され得る。グローバル動きは、そのフレームからの符号化動きベクトルの集約集合であって、６つのパラメータ又は８つのパラメータのアフィン動きモデルにフィッティングされた集約集合から推定され得る。所与のデータブロックについてグローバル動きベクトルとエンコーダ動きベクトルとが同一である（又は類似する）場合、エンコーダ動きベクトルが正確であると見なされる（そして、そのデータブロックについてＴＭＶＮ＝１となる）。それら２つのベクトルが同一でない場合、それらの（二乗誤差和（ＳＳＥ）又は差分絶対値和（ＳＡＤ）で測定された）予測誤差同士を比較してもよい。一方の誤差が小さくて他方の誤差が大きい場合には、誤差が小さいほうの動きベクトルが、符号化に使われて且つ正確と見なされる（ＴＭＶＭ＝１）。 In one embodiment, the motion vector accuracy is determined by estimating the global motion model for a given video frame and applying the motion model to each data block in the frame to determine the global motion model for each data block. It can be determined by determining a motion vector and then comparing this global motion vector to the encoder motion vector for that data block (encoder motion vector). Global motion can be estimated from an aggregate set of coded motion vectors from that frame, fitted to a 6-parameter or 8-parameter affine motion model. If the global motion vector and the encoder motion vector are the same (or similar) for a given data block, then the encoder motion vector is considered correct (and TMVN=1 for that data block). If the two vectors are not the same, their prediction errors (measured as sum of squared errors (SSE) or sum of absolute differences (SAD)) may be compared. If one error is small and the other is large, the motion vector with the smaller error is used for encoding and is considered accurate (TMVM=1).

代替的な一実施形態では、所与のデータブロックについてグローバル動きベクトルとエンコーダ動きベクトルとの差分の大きさが、そのデータブロックが前景データであること（これは、そのデータブロック内のコンテンツが、フレームの残りの部分（背景）と異なる動きを伴っていることを意味する）を特定するのに用いられる。この実施形態では、ＴＭＶＭが１に設定されて、ＴＣＳＦが前景データの場合にのみ適用される。さらなる実施形態では、前景データとして特定されたデータブロックについて、グローバル動きベクトルからエンコーダ動きベクトルが減算されることによって差分動きベクトルを得て、（エンコーダ動きベクトルではなくて）この差分動きベクトルの大きさがＴＣＳＦの周波数を算出するのに用いられる（前述の式において、｜ＭＶ｜を｜ＤＭＶ｜（ＤＭＶは差分動きベクトルである）に置き換える）。 In an alternative embodiment, the magnitude of the difference between the global motion vector and the encoder motion vector for a given data block is that the data block is foreground data, which means that the content within that data block is It is used to identify the rest of the frame (meaning it has different movement than the background). In this embodiment, TMVM is set to 1 and only applies if TCSF is foreground data. In a further embodiment, for a data block identified as foreground data, the encoder motion vector is subtracted from the global motion vector to obtain the difference motion vector, and the magnitude of the difference motion vector (rather than the encoder motion vector) is obtained. Is used to calculate the frequency of the TCSF (replace |MV| with |DMV| (where DMV is the differential motion vector) in the above equation).

他の実施形態では、動きベクトル対称度が、ＴＭＶＭを洗練化するのに用いられ得る。動きベクトル対称度（Bartels, C.及びde Haan, G.による2009, "Temporal symmetry constraints in block matching（ブロックマッチングにおける時間的対称度制約）," Proc. IEEE 13^th Int'l. Symposium on Consumer Electronics, pp. 749-752）は、動き推定の時間的方向を切り替えたときに互いに対をなす、動きベクトルのペアの相対対称度として定義され、算出された動きベクトルの品質の尺度となる（対称度が高ければ高いほど、動きベクトルの品質が優れている）。「対称度誤差ベクトル」は、前方向動き推定により得られた動きベクトルと後方向動き推定により得られた動きベクトルとの差分として定義される。動きベクトル対称度が低いこと（対称度誤差ベクトルが大きいこと）は、しばしば、オクルージョン（あるオブジェクトが別のオブジェクトの前側を動くことにより、その背景オブジェクトを隠したり露わにしたりすること）、オブジェクトの動きが映像フレーム上に又は映像フレーム外になること、照明変化など（いずれも、正確な動きベクトルを導き出すことを困難にする）の複雑な現象が存在することの指標となる。 In other embodiments, motion vector symmetry may be used to refine the TMVM. Motion vector symmetry (Bartels, C. and de Haan, G. 2009, "Temporal symmetry constraints in block matching," Proc. IEEE 13 ^th Int'l. Symposium on Consumer Electronics , pp. 749-752) is defined as the relative symmetry of a pair of motion vectors that form a pair when the temporal direction of motion estimation is switched, and is a measure of the quality of the calculated motion vector (symmetry). The higher the degree, the better the motion vector quality). The “symmetry error vector” is defined as the difference between the motion vector obtained by the forward motion estimation and the motion vector obtained by the backward motion estimation. Low motion vector symmetry (large symmetry error vector) often means occlusion (hiding or revealing a background object by moving one object in front of another). Is an indicator that there is a complex phenomenon such as the movement of the image on or outside the image frame, the change of illumination, etc. (all of which make it difficult to derive an accurate motion vector).

一実施形態では、対称度誤差ベクトルの大きさが符号化中のデータブロックの範囲の半分よりも大きい場合（例えば１６×１６マクロブロックのときには、大きさが（８，８）ベクトルよりも大きい場合）に、低対称度（対称度が低い）と判断される。他の実施形態では、対称誤差ベクトルの大きさが追跡プロセス時に導き出された動きベクトル統計量に基づく閾値（例えば、現在のフレーム又は最近のフレーム同士の所与の組合せにおける、動きベクトル大きさ（動きベクトルの大きさ）の平均値に動きベクトル大きさの標準偏差の倍数を足したもの等）よりも大きい場合に、低対称度（対称度が低い）と判断される。一実施形態では、動きベクトルが上記の定義で低対称度を有するデータブロックに対してＴＭＶＭ値＝０が自動的に振り当てられて、他のデータブロックはグローバル動きベクトルとエンコーダ動きベクトルとの比較に由来するそれまでのＴＭＶＭ値を維持する。 In one embodiment, if the magnitude of the symmetry error vector is greater than half the extent of the data block being encoded (eg, for a 16×16 macroblock, the magnitude is greater than the (8,8) vector). ), the degree of symmetry is low (the degree of symmetry is low). In another embodiment, a threshold value based on motion vector statistics in which the magnitude of the symmetric error vector is derived during the tracking process (eg, motion vector magnitude (motion at a given combination of the current frame or recent frames). Vector magnitude) is greater than a standard deviation of motion vector magnitudes), and the degree of symmetry is low (the degree of symmetry is low). In one embodiment, a TMVM value=0 is automatically assigned to a data block whose motion vector has a low symmetry according to the above definition, and the other data blocks are compared with the global motion vector and the encoder motion vector. The TMVM value up to that point derived from is maintained.

フラットなブロックは、高い空間的コントラスト感度を有するものの、動きベクトルを算出する際のよく知られている開口問題（aperture problem） Flat blocks have high spatial contrast sensitivity, but are a well-known aperture problem in calculating motion vectors.

が原因となり、信頼できない動きベクトルを生じる傾向にある。フラットなブロックは、例えば、エッジ検出プロセス（データブロック内においてエッジが検出されなかった場合にフラットなブロックであると判断される）を用いて、あるいは、データブロックの分散を閾値と比較すること（この閾値よりも小さい分散がフラットなブロックを示す）によって検出され得る。一実施形態では、ブロックのフラットさが、前述のように算出されたＴＭＶＭを変更するように用いられ得る。例えば、フラットなブロックであると検出されたブロックには、ＴＭＶＭ値＝０が振り当て直され得る。 Tend to cause unreliable motion vectors. A flat block is, for example, using an edge detection process (which is determined to be a flat block if no edge is detected in the data block) or by comparing the variance of the data block with a threshold ( Variances below this threshold indicate flat blocks). In one embodiment, block flats may be used to modify the TMVM calculated as described above. For example, a block detected as a flat block may be reassigned TMVM value=0.

一実施形態では、ＴＭＶＭが、信頼できる動きベクトルを有するか否かに影響されるＴＣＳＦを洗練化するマスクとして用いられ得る。ＴＭＶＭの数値は０又は１なので、あるブロックについてのＴＭＶＭ値を、そのブロックについてのＴＣＳＦ値へとブロック毎に乗算することには、ＴＣＳＦをマスクする効果がある。ＴＭＶＭ値が０であるブロックの場合、ＴＣＳＦの算出に必要な動きベクトルが信頼できないことになるので、そのＴＣＳＦが「無効」にされる。ＴＭＶＭ値が１であるブロックの場合、ＴＣＳＦ算出結果が信頼できるとみなされて且つこれまでに述べた任意の手法が確信して利用される。 In one embodiment, TMVM may be used as a mask to refine TCSF that is affected by having reliable motion vectors. Since the value of TMVM is 0 or 1, multiplying the TMVM value for a block to the TCSF value for that block on a block-by-block basis has the effect of masking the TCSF. In the case of a block having a TMVM value of 0, the motion vector required to calculate the TCSF becomes unreliable, and therefore the TCSF is made "invalid". For a block with a TMVM value of 1, the TCSF calculation result is considered reliable and any of the techniques described thus far are confidently used.

他の実施形態では、時間的コントラストマップ（前述のＴＣＳＦ）の代わりに、あるいは、これに加えて、空間的コントラストマップが生成され得る。本発明では、空間的コントラスト（これの反対は「空間的複雑度」と称される）を測定するのに、単純な尺度が用いられる。一実施形態では、データの輝度成分と色差成分との両方について測定されるブロック分散が、所与の入力ブロックの空間的複雑度を測定するのに用いられる。分散の大きい入力ブロックは、空間的に複雑であり且つＨＶＳにとって気付き難いと考えられるので、その空間的コントラストは小さいことになる。 In other embodiments, a spatial contrast map may be generated instead of, or in addition to, the temporal contrast map (TCSF described above). In the present invention, a simple measure is used to measure spatial contrast (the opposite is called "spatial complexity"). In one embodiment, block variances measured for both the luma and chrominance components of the data are used to measure the spatial complexity of a given input block. An input block with a large variance is spatially complex and hard to notice for HVS, so its spatial contrast is small.

他の実施形態では、データの輝度成分について測定されるブロック輝度が、空間的複雑度の分散測定結果を洗練化するのに用いられる。分散は小さい（空間的複雑度が低い、空間的コントラストが大きい）が極めて明るいか又は極めて暗い入力ブロックは、空間的コントラストが小さいと自動的に見なされて且つその前に大きいと測定された空間的コントラストを上書きする。その理由は、極めて暗い領域や極めて明るい領域が、ＨＶＳにとって気付き難いからである。所与のブロックを極めて明るい又は極めて暗いと分類するための輝度閾値は、その時々の用途に特化したものとされるが、８ビットの映像の場合の典型的な数値は、極めて明るいが「１７０超」で、極めて暗いが「６０未満」である。 In another embodiment, block intensities measured on the intensity component of the data are used to refine the spatial complexity variance measurement. Input blocks with low variance (low spatial complexity, high spatial contrast) but very bright or very dark are automatically considered to have low spatial contrast and were previously measured as high space. To override the dynamic contrast. The reason is that extremely dark areas and extremely bright areas are difficult for HVS to notice. The intensity threshold for classifying a given block as either very bright or very dark is made specific to the occasional application, but typical numbers for 8-bit video are very bright but "Over 170" is extremely dark, but "less than 60".

上記のようにブロック輝度により改変されたブロック分散が、空間的コントラストの観点からＨＶＳにとっての気付き易さの高い領域及び気付き易さの低い領域を示す空間的コントラストマップ（ＳＣＭ）を形成するために、映像フレームの全ての入力ブロックについて算出され得る。 In order to form the spatial contrast map (SCM) in which the block variance modified by the block brightness as described above shows the regions where the HVS is easily noticed and the regions where the HVS is not noticeable from the viewpoint of spatial contrast. , Can be calculated for all input blocks of a video frame.

一実施形態では、ＳＣＭが、（ＴＭＶＭにより洗練化された）ＴＣＳＦと組み合わされることによって、統合化された重要度マップを形成し得る。この統合化されたマップは、例えば、ＳＣＭおよびＴＣＳＦの両方を適宜正規化したうえで、所与のブロックについてのＳＣＭ値をそのブロックについてのＴＣＳＦ値へとブロック毎に乗算することによって形成され得る。他の実施形態では、ＴＣＳＦの代用としてＳＣＭが使用され得る。他の実施形態では、ＳＣＭが、ＴＣＳＦを洗練化するのに用いられ得る。例えば、複雑度の高いブロックではＳＣＭ値がそのブロックについてのＴＣＳＦ値を上書きし得て、複雑度の低いブロックではそのブロックについてのＴＣＳＦ値が直接使用され得る。 In one embodiment, the SCM may be combined with TCSF (refined by TMVM) to form a consolidated importance map. This integrated map may be formed, for example, by normalizing both SCM and TCSF accordingly and then multiplying the SCM value for a given block by the TCSF value for that block, block by block. .. In other embodiments, SCM may be used as a replacement for TCSF. In other embodiments, SCM may be used to refine TCSF. For example, in a high complexity block, the SCM value may overwrite the TCSF value for that block, and in a low complexity block, the TCSF value for that block may be used directly.

＜映像符号化への重要度マップの適用＞
前述した重要度マップは、一般的なエンコーダ（図２）及びＣＢＴエンコーダ（図３）のいずれの映像符号化プロセスにも適用され得て、符号化ビットストリームの品質を向上させる。 <Application of importance map to video coding>
The importance map described above can be applied to both the general encoder (FIG. 2) and the CBT encoder (FIG. 3) video encoding process to improve the quality of the encoded bitstream.

図７に、映像符号化への重要度マップの一般的な適用の様子を示す。入力映像フレーム５およびフレームストア８５が、知覚的統計量３９０を生成するように使用される。そして、知覚的統計量３９０が、前述したような（ＴＭＶＭにより洗練化された）ＴＣＳＦおよび／またはＳＣＭの重要度マップ４００を形成するように適用される。知覚的統計量３９０は、動きベクトル大きさ、ブロック分散、ブロック輝度、エッジ検出、およびグローバル動きモデルパラメータを含み得る（ただし、これらに限定されるわけではない）。入力映像フレーム５およびフレームストア８５は、さらに、符号４５０での映像フレームの符号化へと通常どおり入力される。その符号化は、通常の符号化ステップ（図１の動き推定１５、インター予測２０、イントラ予測３０、変換・量子化６０およびエントロピー符号化９０）を含む。ただし図７では、符号化４５０が、後述の方法で重要度マップ４００により機能拡張される。 FIG. 7 shows a general application of the importance map to video coding. Input video frame 5 and frame store 85 are used to generate perceptual statistics 390. The perceptual statistics 390 are then applied to form the TCSF and/or SCM importance map 400 (refined by TMVM) as described above. Perceptual statistics 390 may include (but are not limited to) motion vector magnitude, block variance, block intensity, edge detection, and global motion model parameters. The input video frame 5 and the frame store 85 are further input to the coding of the video frame at 450 as usual. The coding includes the usual coding steps (motion estimation 15, inter prediction 20, intra prediction 30, transform/quantization 60 and entropy coding 90 of FIG. 1). However, in FIG. 7, the coding 450 is expanded in function by the importance map 400 in a method described later.

図８Ａに、ＣＢＴを用いた映像符号化を向上するための、重要度マップの具体的な適用の様子を示す。図８Ａには、ＣＢＴからのフレーム−フレーム間追跡２１０ステップ及び連続的追跡２２０ステップによる動き初期推定（図２の符号１１０）が示されている。そして、動き精推定１３０が、既述したものと同じ局所探索及びサブピクセル洗練化の動き精推定ステップ（図４の符号２５０）で、グローバルＣＢＴ動きベクトル２２５に適用される。ここでも、その後に、エンコーダが採用し得る符号化モードに基づいて予測候補１４５の集合を生成するモード生成手段１４０が続く。図４と同じく、ＥＰＺＳおよびモデルベースでない他の候補（例えば、（０，０）動きベクトル、中央値予測子等）も、統合化された動き推定フレームワークの一部として並行して生成され得る（図８Ａでは、図示を簡略化するためにこれら他の候補の図示を省略している）。図８Ａにおいても、ＣＢＴ候補のあらゆる符号化モードおよび場合によってはモデルベースでない他の候補のあらゆる符号化モードも含め、予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５５を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄおよびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。最終の予測１６０（または図１の符号４０）が、その動きベクトル及び他の符号化パラメータと共にエンコーダにおける後続のステップへと渡される。 FIG. 8A shows a concrete application of the importance map for improving the video coding using the CBT. FIG. 8A shows initial motion estimation (reference numeral 110 in FIG. 2) with 210 frame-to-frame tracking and 220 continuous tracking from the CBT. The motion refinement estimation 130 is then applied to the global CBT motion vector 225 in the same motion refinement estimation steps of local search and sub-pixel refinement as previously described (reference numeral 250 in FIG. 4). Again, this is followed by mode generation means 140, which generates a set of prediction candidates 145 based on the coding modes that the encoder can adopt. Similar to FIG. 4, EPZS and other non-model-based candidates (eg, (0,0) motion vectors, median predictors, etc.) can also be generated in parallel as part of the integrated motion estimation framework. (In FIG. 8A, these other candidates are not shown in order to simplify the illustration). Again in FIG. 8A, the entire set of prediction candidates 145, including any coding mode for CBT candidates and possibly any other non-model-based coding mode, determines the single best candidate. Receive a "final" rate distortion analysis 155. In the "final" rate-distortion analysis, an accurate rate-distortion measure D+λR is used to predict error D for the distortion part and the actual coded bits R for the rate part (from the entropy coding 90 of FIG. 1). To calculate. The final prediction 160 (or code 40 in FIG. 1) is passed along with its motion vector and other coding parameters to subsequent steps in the encoder.

図８Ａにおいて、知覚的統計量３９０が、フレーム−フレーム間動き追跡２１０から導き出された動きベクトルから算出され得て、その後、前述したような重要度マップ４００を形成するように適用され得る。そして、これらの重要度マップ４００が、最終のレート歪み解析１５５へと入力される。ここでも、知覚的統計量３９０は、動きベクトル大きさ、ブロック分散、ブロック輝度、エッジ検出、およびグローバル動きモデルパラメータを含み得る（ただし、これらに限定されるわけではない）。 In FIG. 8A, perceptual statistics 390 can be calculated from the motion vectors derived from frame-to-frame motion tracking 210 and then applied to form importance map 400 as described above. Then, these importance maps 400 are input to the final rate distortion analysis 155. Again, the perceptual statistics 390 may include (but are not limited to) motion vector magnitude, block variance, block intensity, edge detection, and global motion model parameters.

一実施形態では、重要度マップが、レート歪み最適化条件を当該重要度マップに応じて改変するように用いられる。標準のエンコーダ（図２を参照のこと）では、所与の入力ブロック１０についての予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５０を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄおよびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。レート歪み尺度Ｄ＋λＲのスコアが最小である候補が、所与の入力ブロック１０についての最終の予測１６０となる。本発明の一実施形態において、図７又は図８の知覚的に最適化されたエンコーダの場合、符号４００で重要度マップＩＭが算出されて、かつ、最終のレート歪み解析１５５が改変されたレート歪み尺度Ｄ×ＩＭ＋λＲを使用する。この改変されたレート歪み尺度では、所与の入力ブロックについてのＩＭ値が歪み項に乗算されて、ＩＭ値が高ければ高いほど、歪みの小さいソリューションにより大きな重要度が振り当てられる。というのも、高いＩＭ値は、対応する入力ブロックが知覚的に重要であることを示すからである。重要度マップは、（場合によってはＴＭＶＭ値により洗練化されている）ＴＣＳＦ、ＳＣＭまたはこれらを複合したものを含み得る。 In one embodiment, an importance map is used to modify the rate distortion optimization conditions in response to the importance map. In a standard encoder (see FIG. 2), the entire set of prediction candidates 145 for a given input block 10 undergo a “final” rate distortion analysis 150 to determine the single best candidate. receive. In the "final" rate-distortion analysis, an accurate rate-distortion measure D+λR is used to predict error D for the distortion part and the actual coded bits R for the rate part (from the entropy coding 90 of FIG. 1). To calculate. The candidate with the lowest score on the rate distortion measure D+λR becomes the final prediction 160 for a given input block 10. In one embodiment of the present invention, for the perceptually optimized encoder of FIG. 7 or 8, the significance map IM is calculated at 400 and the final rate distortion analysis 155 is the modified rate. Use the distortion measure D×IM+λR. In this modified rate-distortion measure, the IM value for a given input block is multiplied by the distortion term, and the higher the IM value, the more important the less-distorted solution. This is because a high IM value indicates that the corresponding input block is perceptually important. The importance map may include TCSF (possibly refined by TMVM values), SCM, or a combination thereof.

さらなる実施形態では、上記に加えて、レート歪み尺度における歪みＤが、ＳＳＥ（二乗誤差和：歪みを算出する「標準」の手法）とＹＵＶ空間において算出されたＳＳＩＭとの重み付け和として算出され得る。重み付けγは、映像のうちの最初の幾つか（又は一番最近の幾つか）のフレームにおける平均ＳＳＩＭ値ＳＳＩＭ_ａｖｇが当該映像のうちの当該最初の幾つか（又は当該一番最近の幾つか）のフレームにおける平均ＳＳＥ値ＳＳＥ_ａｖｇと等しくなる（γ×ＳＳＩＭ_ａｖｇ＝ＳＳＥ_ａｖｇ）ように適応的に算出され得る。つまり、それぞれの入力ブロックについて、改変されたレート歪み尺度は、（ＳＳＥ＋γ×ＳＳＩＭ）×ＩＭ＋２λＲ（式中、λＲ項の前にある乗算係数２は、歪み項が２つあることを意味している）となる。歪み測定にＳＳＩＭを含めることは、ＳＳＩＭがデータの構造的情報に相当することから、レート歪み最適化においてＨＶＳ知覚が占める割合をなおいっそう増やすことになる。 In a further embodiment, in addition to the above, the distortion D on the rate distortion measure may be calculated as a weighted sum of SSE (sum of squared errors: the “standard” method of calculating distortion) and SSIM calculated in YUV space. . The weighting γ is such that the average SSIM value SSIM _avg in the first some (or most recent some) frames of the video is the first some (or the most recent some) of the video. Can be adaptively calculated to be equal to the average SSE value SSE _avg in the frame of (γ×SSIM _avg =SSE _avg ). That is, for each input block, the modified rate distortion measure is (SSE+γ×SSIM)×IM+2λR, where the multiplication factor 2 before the λR term means that there are two distortion terms. ). The inclusion of SSIM in the distortion measurement further increases the proportion of HVS perception in rate distortion optimization, since SSIM corresponds to the structural information in the data.

他の実施形態では、重要度マップ（例えば、ＴＭＶＭによる洗練化を伴うＴＣＳＦ、ＳＣＭ等）が、レート歪み最適化を改変するのに加えて（あるいは、この代わりに）、エンコーダのブロックレベル量子化を改変するように用いられ得る。量子化は、所与の入力ブロックが符号化される相対品質を制御するものである。つまり、高度に量子化されたデータは低品質の符号化済み出力となり、低度に量子化されたデータは高品質の符号化済み出力となる。量子化の量は、量子化パラメータＱＰによって制御される。標準のエンコーダは、異なるフレームタイプに異なるＱＰ値ＱＰ_{ｆｒａｍｅ}を振り当てて、Ｉフレームは最小のＱＰ（最も高品質）で符号化されて且つＢフレームは最大のＱＰ（最も低品質）で符号化されて且つＰフレームは中間のＱＰ（中間の品質）で符号化される。 In other embodiments, the importance map (eg, TCSF with TMVM refinement, SCM, etc.) may be used in addition to (or instead of) modifying the rate-distortion optimization, in addition to block-level quantization of the encoder. Can be used to modify Quantization controls the relative quality with which a given input block is coded. That is, highly quantized data results in a low quality coded output, and low quantized data results in a high quality coded output. The amount of quantization is controlled by the quantization parameter QP. Standard encoders assign different QP values QP _frame to different frame types, I frames are coded with the lowest QP (highest quality) and B frames are coded with the highest QP (lowest quality). And the P frame is encoded with an intermediate QP (intermediate quality).

つまり、上記の手法は、複数の映像フレーム（当該映像フレームは、互いに重なり合わないターゲットブロックを有している）を、重要度マップを用いて、各映像フレーム内の各ターゲットブロックの量子化を改変すること（そして、これによってその符号化品質に影響を与えること）によって符号化する方法を提示している。そのような重要度マップは、時間的情報（ＴＭＶＭによる洗練化を伴うＴＣＳＦ）、空間的情報、またはこれら２種類の組合せ（すなわち、統合化された重要度マップ）を用いて設定され得る。重要度マップは各映像フレームのどの部分が人間の知覚にとって最も気付き易いのかを示すものなので、前記重要度マップの数値は、それぞれのターゲットブロックについてのＱＰを、（ｉ）当該重要度マップが高い数値をとるブロックでは、ブロックＱＰがＱＰ_{ｆｒａｍｅ}に比べて小さくされることで、これらのブロックについては高い品質となるように、かつ、（ｉｉ）当該重要度マップが低い数値をとるブロックでは、前記ブロックＱＰが前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて大きくされることで、これらのブロックについては低い品質となるように、変更するのが望ましい。 That is, the above method uses a significance map to quantize a plurality of video frames (the video frames have target blocks that do not overlap each other) in each video frame. It presents a method of encoding by modifying (and thereby affecting its encoding quality). Such importance maps may be set using temporal information (TCSF with TMVM refinement), spatial information, or a combination of the two (ie, an integrated importance map). Since the importance map indicates which part of each video frame is most noticeable to human perception, the numerical value of the importance map indicates the QP for each target block, and (i) the importance map is high. In the block that takes a numerical value, the block QP is made smaller than the QP _frame so that the quality of these blocks becomes high, and (ii) in the block that the numerical value of the importance map is low, It is preferable to change the block QP so that the block QP has a large quality as compared with the frame quantization parameter QP _frame , so that the quality of these blocks is low.

図８Ｂに、重要度マップ４００を用いて符号化時の量子化を改変するプロセスの一例を示す。符号４００では、知覚的統計量３９０から導き出された時間的情報および／または空間的情報を用いて重要度マップが構成／形成され得る。時間的情報は、例えば、どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いのかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、どのターゲットブロックが前景データに相当するのかを示す真の動きベクトルマップ（ＴＭＶＭ）により提供され得て、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされる。空間的情報は、例えば、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供され得る。 FIG. 8B shows an example of a process of modifying the quantization at the time of encoding using the importance map 400. At 400, a significance map may be constructed/formed using temporal and/or spatial information derived from the perceptual statistics 390. The temporal information is, for example, a temporal contrast sensitivity function (TCSF) indicating which target block is most noticeable in time for an observer human being, and a true information indicating which target block corresponds to foreground data. Motion vector map (TMVM), the TCSF is only valid for the target block identified as foreground data. The spatial information may be provided by, for example, a rule-based spatial complexity map (SCM).

そして、重要度マップ４００は、符号化４５０内の量子化ステップ４３０を上記のように改変するのに用いられる。当該重要度マップが高い数値を取るブロックでは、ブロック量子化パラメータ（ＱＰ）が、フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に対して減らされて、これらのブロックについては高い符号化品質が得られる。当該重要度マップが低い数値を取るブロックでは、前記ブロック量子化パラメータが、前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に対して増やされて、これらのブロックについては低い符号化品質が得られる。重要度マップからの情報を用いることにより、量子化は、各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質を向上させるように改変され得る。 The importance map 400 is then used to modify the quantization step 430 in the encoding 450 as described above. In the blocks where the importance map has a high numerical value, the block quantization parameter (QP) is reduced with respect to the frame quantization parameter QP _frame , and high coding quality is obtained for these blocks. In the blocks where the importance map has a low numerical value, the block quantization parameter is increased with respect to the frame quantization parameter QP _frame , and low coding quality is obtained for these blocks. By using the information from the importance map, the quantization can be modified to improve the coding quality of each target block to be coded in each video frame.

一実施形態では、所与のフレームについてのＴＣＳＦマップが、フレームＱＰをブロック毎に調節するように用いられ得る。ブロックＱＰ、ＱＰ_{ｂｌｏｃｋ}を算出する方法の一つは、（Li, Z.達による2011, "Visual attention guided bit allocation in video compression（映像圧縮における、視覚注意によって導かれたビット振当）, J. of Image and Vision Computing, 29(1): 1-14）の手法に従って、その調節量をフレームにおけるＴＣＳＦマップ全体と関連付けることである。その結果得られる式は、ＱＰ_{ｂｌｏｃｋ}＝（ＴＣＳＦ_{ｆｒａｍｅ}／（ＴＣＳＦ_{ｂｌｏｃｋ}×Ｍ））×ＱＰ_{ｆｒａｍｅ}（式中、ＴＣＳＦ_{ｆｒａｍｅ}はフレーム内の全てのブロックについてのＴＣＳＦ値の合計であり、ＱＰ_{ｆｒａｍｅ}はフレームＱＰであり、Ｍはフレーム内のブロックの数である）により与えられる。さらなる実施形態では、乗算係数（ＴＣＳＦ_{ｆｒａｍｅ}／（ＴＣＳＦ_{ｂｌｏｃｋ}×Ｍ））が、ＱＰ_{ｂｌｏｃｋ}の最終的な数値がＱＰ_{ｆｒａｍｅ}に対して大きくなり過ぎたり小さくなり過ぎたりすることのないように増減され得る。 In one embodiment, the TCSF map for a given frame may be used to adjust the frame QP block by block. One of the methods for calculating the block QP and QP _block is (Li, Z. T. 2011, "Visual attention guided bit allocation in video compression", J. of image and vision computing, 29(1): 1-14) and associate that adjustment with the entire TCSF map in the frame, the resulting equation being QP _block =(TCSF _frame /(TCSF _block *M))*QP _frame (where TCSF _frame is the sum of TCSF values for all blocks in the _frame , QP _frame is the frame QP, and M is the number of blocks in the frame). given. in a further embodiment, the multiplication factor _{_{(TCSF frame / (TCSF block ×}} M)) _is, so as not to final value of _{QP block} is too smaller or too large for _{QP frame} It can be increased or decreased.

代替的な一実施形態では、ＴＣＳＦマップによるＱＰのブロック毎の調節が、そのフレームについてのＴＣＳＦマップ全体を参照することなく達成され得る。この実施形態では、ＱＰ_{ｂｌｏｃｋ}の計算がより単純になる：ＱＰ_{ｂｌｏｃｋ}＝ＱＰ_{ｆｒａｍｅ}／ＴＣＳＦ_{ｂｌｏｃｋ}。一実施形態では、ＱＰ_{ｂｌｏｃｋ}の得られる数値が、そのフレームについての所定の上限ＱＰ値を上回ったり下限ＱＰ値を下回ったりしないように範囲限定される（クリップされる）：ＱＰ_ｍｉｎ≦ＱＰ_{ｂｌｏｃｋ}≦ＱＰ_ｍａｘ。 In an alternative embodiment, block-by-block adjustment of QP with a TCSF map may be accomplished without reference to the entire TCSF map for that frame. In this embodiment, the calculation of QP _block is simpler: QP _block =QP _frame /TCSF _block . In one embodiment, the resulting number of QP _blocks is ranged (clip) such that it does not exceed a predetermined upper bound QP value or fall below a lower bound QP value for that frame: QP _min ≤ QP _block ≤. QP _max .

他の実施形態では、ＳＣＭの出力が、ルールに基づくアプローチを用いてブロック毎に量子化パラメータを改変するように用いられ得る。この実施形態は、まず、分散の大きいブロックに高いＱＰ値（低品質）を振り当てることから始まる。というのも、高度に複雑な領域は、ＨＶＳにとって気付き難いからである。分散の小さいブロックには、低いＱＰ値（高品質）が振り当てられる。というのも、低度に複雑な領域は、ＨＶＳにとって気付き易いからである。一実施形態では、所与のブロックについてのＱＰ振当量が、フレームの上限ＱＰ値であるＱＰ_ｍａｘ及び下限ＱＰ値であるＱＰ_ｍｉｎによって規制されており、かつ、そのフレーム内の他のブロック分散に対する自身のブロック分散に基づき線形的に増減される。代替的な一実施形態では、フレーム全体の平均分散よりも大きい分散を有するブロックのみに、フレームＱＰであるＱＰ_{ｆｒａｍｅ}とＱＰ_ｍａｘとの間のＱＰ値が振り当てられて、かつ、その振当量はＱＰ_{ｂｌｏｃｋ}＝（（ｖａｒ_{ｂｌｏｃｋ}−ｖａｒ_{ｆｒａｍｅ}／ｖａｒ_{ｂｌｏｃｋ}））×（ＱＰ_ｍａｘ−ＱＰ_{ｆｒａｍｅ}）＋ＱＰ_{ｆｒａｍｅ}となるように線形的に増減される。この代替的な実施形態において、分散の大きいブロックに対するＱＰ振当量は、ＴＣＳＦによりさらに洗練化されてもよい。例えば、ＴＭＶＭでそのブロックが前景データと見なされて且つＴＣＳＦのコントラスト感度対数値（図５の縦軸）が０．５未満である（そのブロックが時間的に重要でないことを意味する）場合には、ＱＰ_{ｂｌｏｃｋ}が２だけ増やされる。代替的な一実施形態では、エッジ検出プロセスが適用され得て、エッジを含むブロックのＱＰが、それまでに振り当てられていた空間的複雑度からのＱＰを上書きするようにＱＰ_ｍｉｎに調節され得る。その理由は、エッジが、ＨＶＳにとって極めて気付き易いものだからである。さらなる実施形態では、極めて明るいか又は極めて暗いブロックのＱＰが、それまでに振り当てられていた分散及び（場合によっては）エッジ検出からのＱＰを上書きすることによってＱＰ_ｍａｘに調節し直され得る。その理由は、極めて暗い領域や極めて明るい領域が、ＨＶＳにとって気付き難いからである。このプロセスは、輝度マスキングとして知られる。 In other embodiments, the output of the SCM may be used to modify the quantization parameters block by block using a rules-based approach. This embodiment begins by first assigning high QP values (low quality) to blocks with large variance. Because highly complex areas are hard to notice for HVS. A low QP value (high quality) is assigned to a block with a small variance. This is because low complexity areas are easy to notice for HVS. In one embodiment, the QP allocation for a given block is constrained by a frame's upper QP value, QP _max, and its lower QP value, QP _min , and relative to other block variances within that frame. It scales linearly based on its block variance. In an alternative embodiment, only blocks having a variance greater than the average variance of the entire _frame are assigned the QP value between the _frame QPs QP _frame and QP _max , and the allocation amount is _{_{_{QP block = ((var block -var}}} frame / var block)) × (QP max -QP frame) + is linearly increased or decreased such that _{QP frame.} In this alternative embodiment, the QP allocation for blocks with high variance may be further refined by TCSF. For example, if the block is considered as foreground data in TMVM and the contrast sensitivity logarithmic value of TCSF (vertical axis in FIG. 5) is less than 0.5 (meaning that the block is not temporally important). QP _block is increased by 2. In an alternative embodiment, an edge detection process may be applied, where the QP of the block containing the edge is adjusted to QP _min to overwrite the QP from the spatial complexity previously allocated. obtain. The reason is that edges are extremely noticeable to HVS. In a further embodiment, the QPs of very bright or extremely dark blocks can be readjusted to QP _max by overwriting the previously allocated variance and (possibly) QP from edge detection. The reason is that extremely dark areas and extremely bright areas are difficult for HVS to notice. This process is known as luminance masking.

さらなる実施形態では、上記に加えて、分散の大きいブロックについてのＱＰ_ｍａｘの数値が、符号化された映像の品質レベルに基づいて動的に決定され得る。その思想は、低品質の符号化では分散の大きいブロックにおける品質低下を許容できないのでＱＰ_ｍａｘはＱＰ_{ｆｒａｍｅ}により近づけるのが望ましい一方、高品質の符号化ではビットを節約するために分散の大きいブロックについてのＱＰ_ｍａｘを増やすことを許容できるというものである。符号化の品質は、各Ｉ（イントラ）フレーム毎に、平均フレーム分散の±５％以内の分散を有するブロックの平均ＳＳＩＭを算出することによって更新され得て、かつ、ＳＳＩＭ値が高ければ高いほどＱＰ_ｍａｘのより高い数値に対応するようにされる。代替的な一実施形態では、品質指標が平均ＳＳＩＭと平均フレーム分散との積として算出されるように、平均ＳＳＩＭがそのフレームの平均分散によって調節される。 In a further embodiment, in addition to the above, the numerical value of QP _max for high variance blocks may be dynamically determined based on the quality level of the coded video. The idea is that it is desirable to bring QP _max closer to QP _frame because lower quality coding cannot tolerate a deterioration of quality in a block with a large variance, while high quality coding allows a block with a large variance to save bits. It is permissible to increase the QP _max of The coding quality can be updated for each I (intra) frame by calculating the average SSIM of blocks having a variance within ±5% of the average frame variance, and the higher the SSIM value, the higher the SSIM value. It is made to correspond to the higher value of QP _max . In an alternative embodiment, the average SSIM is adjusted by the average variance of that frame so that the quality metric is calculated as the product of the average SSIM and the average frame variance.

さらなる実施形態では、上記に加えて、分散の極めて小さいブロック（ＨＶＳにとって特に可視的であるフラットな領域に相当）に対して、これらの領域における高品質符号化を確実にするために、決まった低いＱＰ値が振り当てられ得る。例えば、Ｉ（イントラ）フレームの場合、０〜１０の分散を有するブロックにＱＰ＝２８が振り当てられ得て、１０〜３０の分散を有するブロックにＱＰ＝３０が振り当てられ得て、３０〜６０の分散を有するブロックにＱＰ＝３２が振り当てられ得る。それから、Ｐ及びＢフレーム内のブロックに対するＱＰ振当量が、上記のＱＰからそれぞれｉｐｒａｔｉｏ（ｉｐ率）パラメータ及びｐｂｒａｔｉｏ（ｐｂ率）パラメータを用いて導き出され得る。 In a further embodiment, in addition to the above, for very low variance blocks (corresponding to the flat regions that are particularly visible to HVS), it was decided to ensure high quality coding in these regions. Low QP values can be assigned. For example, for an I (intra) frame, QP=28 may be assigned to a block having a variance of 0-10, QP=30 may be assigned to a block having a variance of 10-30, and 30-. QP=32 may be assigned to blocks with a variance of 60. Then, the QP allocations for the blocks in the P and B frames can be derived from the above QPs using the ipratio (ip rate) and pbratio (pb rate) parameters, respectively.

さらなる実施形態では、上記に加えて、分散の小さいブロック（例えば、６０〜平均フレーム分散の分散を有するブロック等）に対してフレームＱＰＱＰ_{ｆｒａｍｅ}が振り当てられて、それから、当該分散の小さいブロックが、さらなる品質向上が必要なのか否かを判定するように調べられる。一実施形態では、ブロックノイズ（blockiness）アーチファクトを、符号化中の現在の（ターゲット）ブロックからの再構成されたピクセル及び元々のピクセルの空間的複雑度及び輝度を符号化済みの周囲のブロック（例えば、左、左上、上、右上（これらが存在する場合）のブロック等）の空間的複雑度及び輝度と比較することによって検出し得る。仮に、ターゲットブロックの再構成されたピクセルの空間的複雑度尺度及び輝度尺度と近傍ブロックの対応する尺度との間には大きな違いがあるものの、そのターゲットブロックの元々のピクセルとその近傍ブロックの元々のピクセルとの間には空間的複雑度及び輝度にそのような違いがない場合には、そのターゲットブロックが「ブロックノイズ（blocky）」であると見なされる。この場合、そのブロックのＱＰ値が、当該ブロックの符号化品質を向上させるように減らされる（例えば、２だけ減らされる）。他の実施形態では、ターゲットブロックの推定品質が、符号化済みの周囲のブロック（例えば、左、左上、右、右上（これらが存在する場合）のブロック等）のＳＳＩＭ値及びＱＰ値を平均化することによって算出される。その平均ＱＰ値ＱＰ_ａｖｇが、そのターゲットブロックについての推定ＱＰＱＰ_{ｂｌｏｃｋ}とされる。平均ＳＳＩＭ値ＳＳＩＭ_ｅｓｔが０．９未満であると、ＱＰ_{ｂｌｏｃｋ}＝ＱＰ_ａｖｇが２だけ減らされてその品質を向上させる。さらなる実施形態において、ＴＭＶＭにより前景データとして特定されたターゲットブロックは、そのＴＣＳＦのコントラスト感度対数値（図５の縦軸）が０．８超である（そのブロックが時間的に重要であることを意味する）場合にのみ、ＱＰ_{ｂｌｏｃｋ}が２だけ減らされる。 In a further embodiment, in addition to the above, a frame QPQP _frame is assigned to a block with a small variance (eg, a block having a variance of 60 to the average frame variance), and then the block with a small variance is assigned. It is checked to determine if further quality improvement is needed. In one embodiment, blockiness artifacts are reconstructed from the current (target) block being encoded and the spatial complexity and intensity of the original pixel encoded surrounding block ( For example, it can be detected by comparing the spatial complexity and brightness of the left, upper left, upper, upper right blocks (if they exist, etc.). Even though there is a large difference between the spatial complexity measure and the intensity measure of the reconstructed pixel of the target block and the corresponding measure of the neighboring block, the original pixel of the target block and the original of the neighboring block are A target block is considered to be “blocky” if there is no such difference in spatial complexity and intensity from the pixels of the. In this case, the QP value of the block is reduced (eg, reduced by 2) to improve the coding quality of the block. In another embodiment, the estimated quality of the target block averages the SSIM and QP values of the coded surrounding blocks (eg, left, upper left, right, upper right (if they exist), etc.). It is calculated by The average QP value QP _avg is taken as the estimated QPQP _block for the target block. If the average SSIM value SSIM _est is less than 0.9, QP _block =QP _avg is reduced by 2 to improve its quality. In a further embodiment, the target block identified as foreground data by TMVM has a TCSF contrast sensitivity logarithmic value (vertical axis in FIG. 5) of greater than 0.8 (noting that the block is temporally important). (Meaning)), the QP _block is reduced by 2.

これまでに説明した方法は、時間的重要度マップ（ＴＭＶＭによる洗練化を伴うか又は伴わないＴＣＳＦ）、空間的重要度マップまたは両方を使用し得る。時間的重要度マップと空間的重要度マップとの両方が用いられた場合、その結果生じるものは、統合化された重要度マップと称される。 The methods described so far may use temporal importance maps (TCSF with or without TMVM refinement), spatial importance maps, or both. If both temporal and spatial importance maps are used, the result is called the integrated importance map.

前述したような知覚的統計量から生成された重要度マップは、動き補償を用いて動きベクトルを生成する映像圧縮フレームであればどのような映像圧縮フレームにも適用可能であり得て、これにより、同じ符号化サイズで視覚的により優れた符号化を作り出すようにレート歪み解析と量子化との両方が向上される。映像圧縮への重要度マップの適用は、既に詳述した連続的ブロックトラッカー（ＣＢＴ）に適用するうえで特殊な適用を必要としない。しかも、ＣＢＴは、どの動きベクトルが真の動きベクトルであるのかを正確に決定するという追加の能力を提供するので、重要度マップはＣＢＴベースの符号化フレームワークにおいてより効果的である。その具体的な理由として、ＣＢＴのフレーム−フレーム間動きベクトル（図８Ａのフレーム−フレーム間追跡２１０からのもの）が、映像の元々のフレームから生成されたものであって再構成されたフレームから生成されたものではない点が挙げられる。一般的なエンコーダの場合の図２及び図７のフレームストア８５は符号化プロセスから生成された再構成後のフレームを含むのに対し、図３、図４及び図８Ａのフレームストア２０５は元々の映像フレームを含んでいる。そのため、ＣＢＴのフレーム−フレーム間追跡（図３、図４及び図８の符号２１０）は映像の真の動きをより良好に追跡することが可能であり、かつ、そのフレーム−フレーム間動きベクトルはより正確な真の動きベクトルマップを生成する。対照的に、一般的なエンコーダの動きベクトルは、レート歪み（圧縮）性能を最適化するように選択されており、映像の真の動きを反映しない可能性がある。 The importance map generated from the perceptual statistics as described above may be applicable to any video compression frame as long as it is a video compression frame that generates a motion vector using motion compensation. , Both rate-distortion analysis and quantization are improved to produce a visually superior encoding with the same encoding size. The application of the importance map to video compression does not require any special application to apply to the continuous block tracker (CBT) already detailed above. Moreover, the importance map is more effective in a CBT-based coding framework, as the CBT provides the additional ability to accurately determine which motion vector is the true motion vector. The specific reason for this is that the CBT frame-to-frame motion vector (from the frame-to-frame tracking 210 of FIG. 8A) was generated from the original frame of the video and reconstructed from the reconstructed frame. The point is that it was not generated. The frame store 85 of FIGS. 2 and 7 for a typical encoder contains the reconstructed frames generated from the encoding process, whereas the frame store 205 of FIGS. 3, 4 and 8A is the original. Contains video frames. Therefore, the CBT frame-frame tracking (reference numeral 210 in FIGS. 3, 4 and 8) can better track the true motion of the image, and the frame-frame motion vector is Generate a more accurate true motion vector map. In contrast, typical encoder motion vectors are chosen to optimize rate distortion (compression) performance and may not reflect the true motion of the video.

なお、生成された重要度マップは、イントラ予測フレームにも、これまでに述べた手法に従ってイントラ予測モード間のレート歪み最適化を改変するか又はブロックレベル量子化を改変することによって適用可能であり得る。ただし、オールイントラエンコーダ（全イントラエンコーダ）の場合には、ＴＣＳＦを算出するうえで、映像フレーム内のそれぞれのデータブロックについての動きベクトルを生成するための別個の符号化手段（例えば、図８Ａのフレーム−フレーム間追跡２１０等）が必要となる。 Note that the generated importance map can also be applied to intra prediction frames by modifying the rate distortion optimization between intra prediction modes or by modifying the block level quantization according to the method described so far. obtain. However, in the case of an all-intra encoder (all-intra encoder), a separate encoding means (for example, as shown in FIG. 8A) for generating the motion vector for each data block in the video frame in calculating the TCSF is used. Frame-to-frame tracking 210, etc.) is required.

＜デジタル処理環境＞
本発明の例示的な実装は、ソフトウェア環境でもファームウェア環境でもハードウェア環境でも実現可能であり得る。図９Ａに、そのような環境の一つを示す。少なくとも１つのクライアントコンピュータ／デバイス９５０（例えば、携帯電話、コンピューティングデバイス等）およびクラウド９６０（またはサーバコンピュータもしくはサーバコンピュータのクラスタ）は、アプリケーションプログラムを実行する処理機能、記憶機能、符号化機能、復号化機能および入出力装置などを提供する。 <Digital processing environment>
An exemplary implementation of the invention may be feasible in a software, firmware or hardware environment. FIG. 9A shows one such environment. At least one client computer/device 950 (eg, cell phone, computing device, etc.) and cloud 960 (or server computer or cluster of server computers) includes processing, storage, encoding, and decoding functionality for executing application programs. Computerization function and I/O device.

また、少なくとも１つのクライアントコンピュータ／デバイス９５０は、通信ネットワーク９７０を介して、他のクライアントデバイス／プロセス９５０および少なくとも１つのサーバコンピュータ９６０を含む他のコンピューティングデバイスと接続可能であり得る。通信ネットワーク９７０は、リモートアクセスネットワークの一部、グローバルネットワーク（例えば、インターネット等）の一部、世界規模のコンピュータの集まりの一部、ローカルエリアネットワークの一部、ワイドエリアネットワークの一部、あるいは、現在それぞれのプロトコル（ＴＣＰ／ＩＰ、Ｂｌｕｅｔｏｏｔｈ（登録商標）など）を用いて相互通信するゲートウェイの一部であり得る。それ以外の電子デバイス／コンピュータネットワークアーキテクチャも使用可能である。 Also, at least one client computer/device 950 may be connectable via a communication network 970 to other client devices/processes 950 and other computing devices including at least one server computer 960. The communications network 970 may be part of a remote access network, part of a global network (eg, the Internet, etc.), part of a worldwide collection of computers, part of a local area network, part of a wide area network, or It may be part of a gateway that currently communicates with each other using respective protocols (TCP/IP, Bluetooth®, etc.). Other electronic device/computer network architectures can be used.

本発明の実施形態は、映像又はデータ信号情報を符号化、追跡、モデル化、フィルタリング、調整、復号化又は表示する手段を含み得る。図９Ｂは、そのような映像又はデータ信号情報の符号化を促進するのに用いられ得る、図９Ａの処理環境における所与のコンピュータ／コンピューティングノード（例えば、クライアントプロセッサ／デバイス／携帯電話デバイス／タブレット９５０、サーバコンピュータ９６０等）の内部構造の図である。各コンピュータ９５０，９６０は、コンピュータ又は処理システムの構成要素間のデータ転送に用いられる実在する又は仮想的なハードウェアラインのセットであるシステムバス９７９を備える。バス９７９は、コンピュータシステムの相異なる構成要素（例えば、プロセッサ、エンコーダチップ、デコーダチップ、ディスクストレージ、メモリ、入力／出力ポート等）同士を接続する共有の配管のようなものであり、それら構成要素間のデータのやり取りを可能にする。システムバス９７９には、様々な入出力装置（例えば、キーボード、マウス、ディスプレイ、プリンタ、スピーカ等）をコンピュータ９５０，９６０に接続するための入出力装置インターフェース９８２が取り付けられている。ネットワークインターフェース９８６は、コンピュータがネットワーク（例えば、図９Ａの符号９７０で示されるネットワーク等）に取り付けられた他の様々なデバイスと接続することを可能にする。メモリ９９０は、本発明のソフトウェア実装を実現するのに用いられるコンピュータソフトウェア命令９９２及びデータ９９４を記憶する揮発性メモリである。 Embodiments of the invention may include means for encoding, tracking, modeling, filtering, adjusting, decoding or displaying video or data signal information. FIG. 9B illustrates a given computer/computing node (eg, client processor/device/cellphone device/cell phone device/in the processing environment of FIG. 9A that may be used to facilitate the encoding of such video or data signal information. It is a figure of an internal structure of tablet 950, server computer 960, etc. Each computer 950, 960 comprises a system bus 979, which is a set of real or virtual hardware lines used to transfer data between the components of a computer or processing system. The bus 979 is like a shared plumbing connecting different components of a computer system (eg, processor, encoder chip, decoder chip, disk storage, memory, input/output ports, etc.). Allows the exchange of data between. An input/output device interface 982 for connecting various input/output devices (for example, a keyboard, a mouse, a display, a printer, a speaker, etc.) to the computers 950 and 960 is attached to the system bus 979. The network interface 986 allows the computer to connect to a variety of other devices attached to a network (eg, the network shown at 970 in FIG. 9A). Memory 990 is a volatile memory that stores computer software instructions 992 and data 994 used to implement the software implementation of the present invention.

ディスクストレージ９９５は、本発明の一実施形態を実現するのに用いられるコンピュータソフトウェア命令９９８（等価的には「ＯＳプログラム」）及びデータ９９４を記憶する不揮発性ストレージである。また、ディスクストレージ９９５は、映像を圧縮フォーマットで長期的に記憶するのにも使用され得る。システムバス９７９には、さらに、コンピュータ命令を実行する中央演算処理装置９８４も取り付けられている。なお、本明細書をとおして「コンピュータソフトウェア命令」と「ＯＳプログラム」は互いに等価物とする。 Disk storage 995 is a non-volatile storage that stores computer software instructions 998 (equivalently “OS programs”) and data 994 used to implement one embodiment of the present invention. The disk storage 995 can also be used for long term storage of video in a compressed format. A central processing unit 984 that executes computer instructions is also attached to the system bus 979. Note that the “computer software instruction” and the “OS program” are equivalent to each other throughout the present specification.

一例として、エンコーダは、時間的情報や空間的情報から形成された重要度マップを用いて映像データを符号化するためのコンピュータ読取り可能な命令９９２により構成され得る。これらの重要度マップは、映像データの符号化／復号化を最適化するための、エンコーダ（又はエンコーダの構成要素）へのフィードバックループを提供するように構成され得る。 As an example, the encoder can be configured with computer readable instructions 992 for encoding video data using a significance map formed from temporal and spatial information. These importance maps may be configured to provide a feedback loop to the encoder (or components of the encoder) to optimize the encoding/decoding of video data.

一実施形態において、プロセッサルーチン９９２及びデータ９９４は、エンコーダ（概して符号９９２で示す）を備えるコンピュータプログラムプロダクトである。このようなコンピュータプログラムプロダクトは、そのエンコーダ用のソフトウェア命令の少なくとも一部を提供する、ストレージ装置９９４に記憶可能なコンピュータ読取り可能な媒体を含む。 In one embodiment, processor routines 992 and data 994 are computer program products that include an encoder (generally indicated at 992). Such computer program product includes a computer-readable medium readable by storage device 994 that provides at least some of the software instructions for the encoder.

コンピュータプログラムプロダクト９９２は、当該技術分野において周知である任意の適切なソフトウェアインストール方法によってインストール可能なものであり得る。また、他の実施形態において、前記エンコーダの前記ソフトウェア命令の少なくとも一部は、ケーブルおよび／または通信および／または無線接続を介してダウンロード可能なものであり得る。他の実施形態において、エンコーダシステムソフトウェアは、非過渡的なコンピュータ読取り可能な媒体に組み込まれたコンピュータプログラム伝播信号プロダクト９０７（図９Ａ）であり、当該コンピュータプログラム伝播信号プロダクト９０７は、実行されると、伝播媒体上の伝播信号（例えば、電波、赤外線波、レーザ波、音波、インターネットなどのグローバルネットワークや他の少なくとも１つのネットワークによって伝播される電気波など）として実現され得る。このような搬送媒体又は搬送信号が、本発明にかかるルーチン／プログラム９９２用のソフトウェア命令の少なくとも一部を提供する。 Computer program product 992 may be installable by any suitable software installation method known in the art. Also, in other embodiments, at least some of the software instructions of the encoder may be downloadable via a cable and/or communication and/or wireless connection. In another embodiment, the encoder system software is a computer program propagated signal product 907 (FIG. 9A) embedded in a non-transitory computer readable medium, the computer program propagated signal product 907 being executed. Can be realized as a propagation signal on a propagation medium (for example, electric wave, infrared wave, laser wave, sound wave, electric wave propagated by a global network such as the Internet or at least one other network). Such carrier media or signals provide at least some of the software instructions for routines/programs 992 of the present invention.

代替的な実施形態において、前記伝播信号は、伝播媒体によって搬送されるアナログ搬送波またはデジタル信号である。例えば、前記伝播信号は、グローバルネットワーク（例えば、インターネット等）、電気通信網または他のネットワークによって搬送されるデジタル信号であり得る。一実施形態において、前記伝播信号は、所与の期間のあいだ伝播媒体によって送信されるものであり、例えば、数ミリ秒、数秒、数分またはそれ以上の期間のあいだネットワークによってパケットで送信される、ソフトウェアアプリケーション用の命令等であり得る。他の実施形態において、コンピュータプログラムプロダクト９９２の前記コンピュータ読取り可能な媒体は、コンピュータシステム９５０が受け取って読み取りし得る伝播媒体である。例えば、コンピュータシステム９５０は、前述したコンピュータプログラム伝播信号プロダクトの場合のように、伝播媒体を受け取ってその伝播媒体内に組み込まれた伝播信号を特定する。 In an alternative embodiment, the propagating signal is an analog carrier or digital signal carried by a propagating medium. For example, the propagated signal may be a digital signal carried by a global network (eg, the Internet, etc.), a telecommunication network, or other network. In one embodiment, the propagating signal is sent by a propagating medium for a given period of time, eg, packets by the network for a period of milliseconds, seconds, minutes or more. , Instructions for software applications, etc. In another embodiment, the computer readable medium of computer program product 992 is a propagation medium readable and readable by computer system 950. For example, computer system 950 receives a propagation medium and identifies the propagated signal embedded within the propagation medium, as is the case for the computer program propagated signal product described above.

本発明を例示的な実施形態を参照しながら具体的に図示・説明したが、当業者であれば、添付の特許請求の範囲に包含された本発明の範囲を逸脱しない範疇で形態や細部に様々な変更を施せることを理解するであろう。
なお、本発明は、実施の態様として以下の内容を含む。
〔態様１〕
複数の映像フレームを符号化する方法であって、
前記映像フレームは、互いに重なり合わないターゲットブロックを有しており、
当該方法は、
重要度マップが量子化を調整することによって各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質に影響を与えるように、前記重要度マップを用いて前記複数の映像フレームを符号化する過程、
を備え、前記重要度マップが：
時間的情報及び空間的情報を用いて当該重要度マップを設定すること；ならびに、
（ｉ）当該重要度マップが高い数値をとるブロックでは、ブロック量子化パラメータ（ＱＰ）がフレーム量子化パラメータＱＰ _{ｆｒａｍｅ} に比べて小さくされることで、これらのブロックについては高い品質となるように、かつ、（ｉｉ）当該重要度マップが低い数値をとるターゲットブロックでは、前記ブロック量子化パラメータが前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} に比べて大きくされることで、これらのブロックについては低い品質となるように、計算によって、前記複数の映像フレームのうちのある映像フレームのどの部分が人間の知覚にとって最も気付き易いのかを当該重要度マップに示させること；
によって構成されている、方法。
〔態様２〕
態様１に記載の方法において、前記空間的情報が、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供されて、その最初のステップが、前記フレーム内のどのターゲットブロックが当該フレーム内の平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりも大きい分散を有するかを決定することであり、
前記平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりも大きい分散を有するブロックに対して、前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} よりも高い量子化パラメータ（ＱＰ）値を振り当て、このブロック量子化パラメータ（ＱＰ）の振当量ＱＰ _{ｂｌｏｃｋ} は、そのブロック分散ｖａｒ _{ｂｌｏｃｋ} が前記平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりもいかなる程度大きいかに従って、前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} と量子化パラメータ上限ＱＰ _ｍａｘとの間で線形的に増減される、方法。
〔態様３〕
態様１に記載の方法において、前記時間的情報が、
どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、
どのターゲットブロックが前景データに相当するかを示す真の動きベクトルマップ（ＴＭＶＭ）
により提供されて、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされる、方法。
〔態様４〕
態様２に記載の方法において、分散の大きいブロックは、そのブロック量子化パラメータ（ＱＰ）である前記振当量ＱＰ _{ｂｌｏｃｋ} が、前記ＴＭＶＭがターゲットブロックを前景データとして特定し且つ前記ＴＣＳＦのこのブロックについてのコントラスト感度対数値が０．５未満である場合には前記振当量ＱＰ _{ｂｌｏｃｋ} が２増加するように、前記ＴＣＳＦ及び前記ＴＭＶＭによりさらに洗練化される、方法。
〔態様５〕
態様２に記載の方法において、前記ＳＣＭは、さらに、極めて明るい（１７０超の輝度）か又は極めて暗い（６０未満の輝度）ターゲットブロックのブロック量子化パラメータである前記振当量ＱＰ _{ｂｌｏｃｋ} がＱＰ _ｍａｘに調節し直される輝度マスキングを含む、方法。
〔態様６〕
態様２に記載の方法において、前記ＳＣＭは、さらに、前記符号化された映像の品質レベルに基づく前記量子化パラメータ上限ＱＰ _ｍａｘの動的な決定を含み、
この動的な決定では、イントラ（Ｉ）フレーム内のターゲットブロックの平均構造的類似度（ＳＳＩＭ）算出結果をこれらフレームの平均ブロック分散ｖａｒ _{ｆｒａｍｅ} と共に用いて、品質が測定され、
前記測定された品質が低いと、前記量子化パラメータ上限ＱＰ _ｍａｘの数値が前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} に近づくように減らされる、方法。
〔態様７〕
態様２に記載の方法において、分散の極めて小さいブロックに対して、これらの領域における高品質符号化を確実にするために、前記ブロック分散が小さいほど前記振当量ＱＰ _{ｂｌｏｃｋ} の数値が低くなるように（、かつ、品質が高くなるように）、決められた低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ _{ｂｌｏｃｋ} が振り当てられる、方法。
〔態様８〕
態様７に記載の方法において、分散の極めて小さいブロックに対する前記低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ _{ｂｌｏｃｋ} は、最初に、Ｉフレームについて決められ、その後、Ｐフレーム及びＢフレームについてはｉｐｒａｔｉｏパラメータ及びｐｂｒａｔｉｏパラメータを用いて決められる、方法。
〔態様９〕
態様７に記載の方法において、分散は小さいが、分散が極めて小さいとは見なさないブロックは、当該ブロックについて品質向上が必要か否かを判定するために、
前記ブロック量子化パラメータ（ＱＰ）の初めの推定値である前記振当量ＱＰ _{ｂｌｏｃｋ} が現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックの量子化パラメータ（ＱＰ）の値を平均することによって算出されて、且つ、
前記現在のブロックの前記ＳＳＩＭの推定ＳＳＩＭ _ｅｓｔが前記現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックのＳＳＩＭ値から算出されて、且つ、
ＳＳＩＭ _ｅｓｔが０．９未満の場合、前記振当量ＱＰ _{ｂｌｏｃｋ} の数値が２減少されるように、
調べられる、方法。
〔態様１０〕
態様９に記載の方法において、前記品質向上は、前記ＴＭＶＭにより前景データとして特定されて且つ前記ＴＣＳＦのコントラスト感度対数値が０．８超であるブロックにのみ適用される、方法。
〔態様１１〕
態様３に記載の方法において、前記ＴＣＳＦの時間的周波数は、前記ターゲットブロックとその参照ブロックとの間の色空間領域におけるＳＳＩＭを用いて波長の近似を求めて且つ動きベクトルの大きさとフレームレートとを用いて速度の近似を求めることによって算出される、方法。
〔態様１２〕
態様３に記載の方法において、前記ＴＣＳＦは、現在のフレームについての当該ＴＣＳＦが最近のフレームにおけるＴＣＳＦマップの重み付き平均であるように且つより最近のフレームがより大きい重み付けを受けるように、複数のフレームにわたって算出される、方法。
〔態様１３〕
態様３に記載の方法において、前記ＴＭＶＭは、前景データの場合にのみ１に設定される、方法。
〔態様１４〕
態様１３に記載の方法において、前景データは、所与のターゲットブロックについてのエンコーダ動きベクトルと当該ブロックについてのグローバル動きベクトルとの差分を算出し、十分に大きい差分を有するブロックが前景データであると判断されることによって特定される、方法。
〔態様１５〕
態様１４に記載の方法において、前景データとして特定されたデータブロックについて、前記グローバル動きベクトルから前記エンコーダ動きベクトルが減算されることによって差分動きベクトルを得て、この差分動きベクトルの大きさが前記ＴＣＳＦの時間的周波数を算出するのに用いられる、方法。
〔態様１６〕
態様３に記載の方法において、前記ＴＣＳＦは、エンコーダからの動きベクトルから算出される、方法。
〔態様１７〕
態様１に記載の方法において、前記重要度マップが前記時間的情報及び前記空間的情報で設定されたものである場合、当該重要度マップは統合化された重要度マップである、方法。
〔態様１８〕
映像データを符号化するシステムであって、
重要度マップを用いて複数の映像フレームを符号化するコーデックであって、当該映像フレームは、互いに重なり合わないターゲットブロックを有している、コーデック、
を備え、前記重要度マップは、量子化を調整することによって各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質に影響を与えるように構成されており、
前記重要度マップが：
時間的情報及び空間的情報を用いて当該重要度マップを設定することであって、これら時間的情報と空間的情報とにより設定された重要度マップは、統合化された重要素マップであること；ならびに、
（ｉ）当該重要度マップが高い数値をとるブロックでは、ブロック量子化パラメータ（ＱＰ）がフレーム量子化パラメータＱＰ _{ｆｒａｍｅ} に比べて小さくされることで、これらのブロックについては高い品質となるように、かつ、（ｉｉ）当該重要度マップが低い数値をとるターゲットブロックでは、前記ブロック量子化パラメータが前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} に比べて大きくされることで、これらのブロックについては低い品質となるように、計算によって、前記複数の映像フレームのうちのある映像フレームの、人間の知覚にとって最も気付き易い部分を当該重要度マップに示させること；
によって構成されている、システム。
〔態様１９〕
態様１８に記載のエンコーダにおいて、前記空間的情報が、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供されて、その最初のステップが、前記フレーム内のどのターゲットブロックが当該フレーム内の平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりも大きい分散を有するかを決定することであり、
前記平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりも大きい分散を有するブロックに対して、前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} よりも高い量子化パラメータ（ＱＰ）値を振り当て、このブロック量子化パラメータ（ＱＰ）の振当量ＱＰ _{ｂｌｏｃｋ} は、そのブロック分散ｖａｒ _{ｂｌｏｃｋ} が前記平均ブロック分散ｖａｒ _{ｆｒａｍｅ} よりもいかなる程度大きいかに従って、前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} と量子化パラメータ上限ＱＰ _ｍａｘとの間で線形的に増減される、エンコーダ。
〔態様２０〕
態様１８に記載のエンコーダにおいて、前記時間的情報が、
どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、
どのターゲットブロックが前景データに相当するかを示す真の動きベクトルマップ（ＴＭＶＭ）
により提供されて、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされる、エンコーダ。
〔態様２１〕
態様１９に記載のエンコーダにおいて、分散の大きいブロックは、そのブロック量子化パラメータ（ＱＰ）である前記振当量ＱＰ _{ｂｌｏｃｋ} が、前記ＴＭＶＭがターゲットブロックを前景データとして特定し且つ前記ＴＣＳＦのこのブロックについてのコントラスト感度対数値が０．５未満である場合には前記振当量ＱＰ _{ｂｌｏｃｋ} が２増加するように、前記ＴＣＳＦ及び前記ＴＭＶＭによりさらに洗練化される、エンコーダ。
〔態様２２〕
態様１９に記載のエンコーダにおいて、前記ＳＣＭは、さらに、極めて明るい（１７０超の輝度）か又は極めて暗い（６０未満の輝度）ターゲットブロックのブロック量子化パラメータである前記振当量ＱＰ _{ｂｌｏｃｋ} がＱＰ _ｍａｘに調節し直される輝度マスキングを含む、エンコーダ。
〔態様２３〕
態様１９に記載のエンコーダにおいて、前記ＳＣＭは、さらに、符号化された映像の品質レベルに前記量子化パラメータ上限基づくＱＰ _ｍａｘの動的な決定を含み、
この動的な決定では、イントラ（Ｉ）フレーム内のターゲットブロックの平均構造的類似度（ＳＳＩＭ）算出結果をこれらフレームの平均ブロック分散ｖａｒ _{ｆｒａｍｅ} と共に用いて、品質が測定され、
測定された品質が低いと、前記量子化パラメータ上限ＱＰ _ｍａｘの数値が前記フレーム量子化パラメータＱＰ _{ｆｒａｍｅ} 近づくように減らされる、エンコーダ。
〔態様２４〕
態様１９に記載のエンコーダにおいて、分散の極めて小さいブロックに対して、これらの領域における高品質符号化を確実にするために、前記ブロック分散が小さいほど前記振当量ＱＰ _{ｂｌｏｃｋ} の数値が低くなるように（、かつ、品質が高くなるように）、決められた低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ _{ｂｌｏｃｋ} が振り当てられる、エンコーダ。
〔態様２５〕
態様２４に記載のエンコーダにおいて、分散の極めて小さいブロックに対する前記低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ _{ｂｌｏｃｋ} は、最初に、Ｉフレームについては決められ、その後、Ｐフレーム及びＢフレームについてはｉｐｒａｔｉｏパラメータ及びｐｂｒａｔｉｏパラメータを用いて決められる、エンコーダ。
〔態様２６〕
態様１９に記載のシステムにおいて、分散は小さいが、分散が極めて小さいとは見なさないブロックは、当該ブロックについて品質向上が必要か否かを判定するために、
前記ブロック量子化パラメータ（ＱＰ）の初めの推定値である前記振当量ＱＰ _{ｂｌｏｃｋ} が現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックの量子化パラメータ（ＱＰ）の値を平均することによって算出されて、且つ、
前記現在のブロックの前記ＳＳＩＭの推定ＳＳＩＭ _ｅｓｔが前記現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックのＳＳＩＭ値から算出されて、且つ、
ＳＳＩＭ _ｅｓｔが０．９未満の場合、前記振当量ＱＰ _{ｂｌｏｃｋ} の数値が２減少されるように、
調べられる、システム。
〔態様２７〕
態様２６に記載のシステムにおいて、前記品質向上は、前記ＴＭＶＭにより前景データとして特定されて且つ前記ＴＣＳＦのコントラスト感度対数値が０．８超であるブロックにのみ適用される、システム。
〔態様２８〕
態様２０に記載のシステムにおいて、前記ＴＣＳＦの時間的周波数は、前記ターゲットブロックとその参照ブロックとの間の色空間領域におけるＳＳＩＭを用いて波長の近似を求めて且つ動きベクトルの大きさとフレームレートとを用いて速度の近似を求めることによって算出される、システム。
〔態様２９〕
態様２０に記載のシステムにおいて、前記ＴＣＳＦは、現在のフレームについての当該ＴＣＳＦが最近のフレームにおけるＴＣＳＦマップの重み付き平均であるように且つより最近のフレームがより大きい重み付けを受けるように、複数のフレームにわたって算出される、システム。
〔態様３０〕
態様２０に記載のシステムにおいて、前記ＴＭＶＭは、前景データの場合にのみ１に設定される、システム。
〔態様３１〕
態様３０に記載のシステムにおいて、前景データは、所与のターゲットブロックについてのエンコーダ動きベクトルと当該ブロックについてのグローバル動きベクトルとの差分を算出し、十分に大きい差分を有するブロックが前景データであると判断されることによって特定される、システム。
〔態様３２〕
態様２０に記載のシステムにおいて、前景データとして特定されたデータブロックについて、前記グローバル動きベクトルから前記エンコーダ動きベクトルが減算されることによって差分動きベクトルを得て、この差分動きベクトルの大きさが前記ＴＣＳＦの時間的周波数を算出するのに用いられる、システム。
〔態様３３〕
態様２０に記載のシステムにおいて、前記ＴＣＳＦは、前記エンコーダからの動きベクトルから算出される、システム。
〔態様３４〕
態様１８に記載のシステムにおいて、前記重要度マップが前記時間的情報と前記空間的情報で設定されたものである場合、当該重要度マップは統合化された重要度マップである、システム。
Although the present invention has been specifically shown and described with reference to exemplary embodiments, those skilled in the art can make forms and details without departing from the scope of the present invention included in the appended claims. It will be appreciated that various changes can be made.
The present invention includes the following contents as embodiments.
[Aspect 1]
A method of encoding a plurality of video frames, the method comprising:
The video frame has target blocks that do not overlap with each other,
The method is
Encoding the plurality of video frames using the importance map so that the importance map affects the encoding quality of each target block to be encoded in each video frame by adjusting the quantization. ,
And the importance map is:
Establishing the importance map using temporal and spatial information; and
(I) In a block in which the importance map has a high numerical value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that high quality is obtained for these blocks. In addition, (ii) in the target block in which the importance map has a low numerical value, the block quantization parameter is made larger than the frame quantization parameter QP _frame , so that the quality of these blocks becomes low. Causing the importance map to indicate, by calculation, which part of a video frame of the plurality of video frames is most noticeable to human perception;
Composed by the method.
[Aspect 2]
The method of aspect 1, wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which is to determine which target block in the frame is an average block in the frame. _Is to have a variance that is greater than the variance var _frame ,
A quantization parameter (QP) value higher than the frame quantization parameter QP _frame is assigned to a block having a variance larger than the average block variance var _frame , and an allocation amount QP of the block quantization parameter (QP) is assigned. _{The block} is linearly increased or decreased between the frame quantization parameter QP _frame and a quantization parameter upper limit QP _max according to how large the block variance var _block is than the average block variance var _frame .
[Aspect 3]
In the method according to aspect 1, the temporal information is
A temporal contrast sensitivity function (TCSF) indicating which target block is most noticeable in time for an observer human, and
True motion vector map (TMVM) showing which target block corresponds to foreground data
And the TCSF is only valid for target blocks identified as foreground data.
[Mode 4]
In the method according to aspect 2, a block with a large variance has a block quantization parameter (QP), the allocation amount QP _{block, in} which the TMVM specifies a target block as foreground data, and The method further refined by the TCSF and the TMVM such that the entrainment amount QP _block is increased by 2 when the contrast sensitivity logarithmic value is less than 0.5 .
[Aspect 5]
In the method according to aspect 2, the SCM is further configured such that the allocation amount QP _block, which is a block quantization parameter of an extremely bright (brightness of 170 or more) or extremely dark (brightness of less than 60) target block, is QP _max . A method comprising brightness masking being readjusted.
[Aspect 6]
A method according to aspect 2, wherein the SCM further comprises a dynamic determination of the quantization parameter upper limit QP _max based on a quality level of the encoded video ,
In this dynamic decision , the quality is measured using the average structural similarity (SSIM) calculation result of the target blocks in the intra (I) _frame together with the average block variance var _frame of these frames ,
The method, wherein when the measured quality is low, the value of the quantization parameter upper limit QP _max is reduced to approach the frame quantization parameter QP _frame .
[Aspect 7]
In the method according to aspect 2, for blocks with extremely small variance, in order to ensure high quality coding in these regions, the smaller the block variance is, the lower the value of the allocation amount QP _block is. A method in which the allocation amount QP _block, which is a determined low value of the quantization parameter (QP), is allocated (and high quality) .
[Aspect 8]
In the method according to aspect 7, the allocation amount QP _block , which is the value of the low quantization parameter (QP) for a block with extremely small variance, is first determined for I frames and then for P and B frames. Is determined using the ipratio and pbratio parameters.
[Aspect 9]
In the method according to Aspect 7, a block whose variance is small but whose variance is not considered to be extremely small is used to determine whether quality improvement is necessary for the block.
The allocation amount QP _block, which is the first estimated value of the block quantization parameter (QP), represents the values of the quantization parameter (QP) of the already coded neighboring blocks on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from SSIM values of previously coded neighboring blocks to the left, top left, right and top right of the current block, and
If the SSIM _est is less than 0.9, the value of the allocation amount QP _block is decreased by 2,
How to be examined.
[Aspect 10]
A method according to aspect 9, wherein the quality improvement is applied only to blocks identified as foreground data by the TMVM and having a contrast sensitivity logarithmic value of the TCSF greater than 0.8.
[Aspect 11]
In the method according to Aspect 3, the temporal frequency of the TCSF is determined by using SSIM in the color space region between the target block and its reference block to approximate the wavelength, and the magnitude of the motion vector and the frame rate. Calculated by finding an approximation of the velocity using.
[Aspect 12]
The method of aspect 3, wherein the TCSF is a plurality of TCSFs such that the TCSF for the current frame is a weighted average of the TCSF maps in the most recent frames and the more recent frames are given more weight. Method calculated over the frame.
[Aspect 13]
A method according to aspect 3, wherein the TMVM is set to 1 only for foreground data.
[Aspect 14]
In the method described in Aspect 13, in the foreground data, a difference between an encoder motion vector for a given target block and a global motion vector for the block is calculated, and a block having a sufficiently large difference is the foreground data. A method identified by being judged.
[Aspect 15]
In the method according to aspect 14, a difference motion vector is obtained by subtracting the encoder motion vector from the global motion vector for a data block specified as foreground data, and the magnitude of the difference motion vector is the TCSF. The method used to calculate the temporal frequency of the.
[Aspect 16]
The method of aspect 3, wherein the TCSF is calculated from motion vectors from an encoder.
[Aspect 17]
The method according to aspect 1, wherein when the importance map is set with the temporal information and the spatial information, the importance map is an integrated importance map.
[Aspect 18]
A system for encoding video data,
A codec for encoding a plurality of video frames using an importance map, the video frames having target blocks that do not overlap each other, the codec,
Wherein the importance map is configured to affect the coding quality of each target block to be coded in each video frame by adjusting the quantization,
The importance map is:
The importance map is set by using temporal information and spatial information, and the importance map set by these temporal information and spatial information is an integrated heavy element map. ; And
(I) In a block in which the importance map has a high numerical value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that high quality is obtained for these blocks. In addition, (ii) in the target block in which the importance map has a low numerical value, the block quantization parameter is made larger than the frame quantization parameter QP _frame , so that the quality of these blocks becomes low. Causing the calculation to show, in the importance map, a part of a video frame of the plurality of video frames that is most noticeable to human perception.
The system that is composed by.
[Aspect 19]
The encoder according to aspect 18, wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which is to determine which target block in the frame is an average block in the frame. _Is to have a variance that is greater than the variance var _frame ,
A quantization parameter (QP) value higher than the frame quantization parameter QP _frame is assigned to a block having a variance larger than the average block variance var _frame , and an allocation amount QP of the block quantization parameter (QP) is assigned. _An encoder in which a _block is linearly increased or decreased between the frame quantization parameter QP _frame and a quantization parameter upper limit QP _max according to how large the block variance var _block is than the average block variance var _frame .
[Aspect 20]
In the encoder according to aspect 18, the temporal information is
A temporal contrast sensitivity function (TCSF) indicating which target block is most noticeable in time for an observer human, and
True motion vector map (TMVM) showing which target block corresponds to foreground data
An encoder provided by the above, wherein the TCSF is valid only for target blocks identified as foreground data.
[Aspect 21]
21. In the encoder according to aspect 19, a block with a large variance has a block quantization parameter (QP), the allocation amount QP _{block, in} which the TMVM specifies a target block as foreground data, and the block of the TCSF is An encoder further refined by the TCSF and the TMVM such that the allocation amount QP _block is increased by 2 when the contrast sensitivity logarithmic value is less than 0.5 .
[Aspect 22]
20. In the encoder according to aspect 19, the SCM is further configured such that the allocation amount QP _block, which is a block quantization parameter of an extremely bright (brightness above 170) or extremely dark (brightness below 60) target block, is QP _max . An encoder that includes readjusted intensity masking.
[Aspect 23]
20. The encoder according to aspect 19, wherein the SCM further comprises a dynamic determination of QP _max based on the quantization parameter upper bound on a quality level of encoded video ,
In this dynamic decision , the quality is measured using the average structural similarity (SSIM) calculation result of the target blocks in the intra (I) _frame together with the average block variance var _frame of these frames ,
The encoder , wherein when the measured quality is low, the numerical value of the quantization parameter upper limit QP _max is reduced to approach the frame quantization parameter QP _frame .
[Aspect 24]
In the encoder according to aspect 19, for blocks with extremely small variance, in order to ensure high quality coding in these regions, the smaller the block variance is, the lower the value of the allocation amount QP _block is. An encoder , to which the assignment amount QP _block, which is a determined low value of the quantization parameter (QP), is assigned (and in order to improve the quality) .
[Aspect 25]
25. In the encoder according to aspect 24, the allocation amount QP _block , which is the value of the low quantization parameter (QP) for a block having an extremely small variance, is first determined for an I frame, and then a P frame and a B frame. Is determined using the ipratio and pbratio parameters.
[Aspect 26]
In the system according to Aspect 19, a block whose variance is small but whose variance is not considered to be extremely small is used to determine whether or not quality improvement is required for the block.
The allocation amount QP _block, which is the first estimated value of the block quantization parameter (QP), represents the values of the quantization parameter (QP) of the already encoded neighboring blocks on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from SSIM values of previously coded neighboring blocks to the left, top left, right and top right of the current block, and
When the SSIM _est is less than 0.9, the value of the allocation amount QP _block is decreased by 2,
System examined.
[Mode 27]
27. The system of aspect 26, wherein the quality enhancement is applied only to blocks identified by the TMVM as foreground data and having a TCSF contrast sensitivity logarithmic value greater than 0.8.
[Aspect 28]
Aspect 20. In a system as set forth in aspect 20, the temporal frequency of the TCSF is determined using SSIM in the color space region between the target block and its reference block to approximate the wavelength and the magnitude of the motion vector and the frame rate. The system, calculated by finding an approximation of the velocity using.
[Aspect 29]
A system according to aspect 20, wherein the TCSF comprises a plurality of TCSFs such that the TCSF for the current frame is a weighted average of the TCSF maps in the most recent frames and the more recent frames are given more weight. A system calculated over a frame.
[Aspect 30]
21. The system according to aspect 20, wherein the TMVM is set to 1 only for foreground data.
[Mode 31]
In the system according to Aspect 30, the foreground data calculates a difference between an encoder motion vector for a given target block and a global motion vector for the block, and the block having a sufficiently large difference is the foreground data. A system identified by being judged.
[Aspect 32]
Aspect 20. In the system according to aspect 20, a difference motion vector is obtained by subtracting the encoder motion vector from the global motion vector for a data block specified as foreground data, and the magnitude of the difference motion vector is the TCSF. The system used to calculate the temporal frequency of the.
[Aspect 33]
21. The system according to aspect 20, wherein the TCSF is calculated from the motion vector from the encoder.
[Aspect 34]
A system according to aspect 18, wherein when the importance map is set by the temporal information and the spatial information, the importance map is an integrated importance map.

Claims

A method of encoding a plurality of video frames, the method comprising:
The video frame has target blocks that do not overlap with each other,
The method is
Encoding the plurality of video frames using the importance map so that the importance map affects the encoding quality of each target block to be encoded in each video frame by adjusting the quantization. ,
And the importance map is:
Establishing the importance map using temporal and spatial information; and
(I) as the importance map is a block taking a high numerical value in the block quantization parameter (QP) is smaller than the frame quantization parameter QP _frame, a higher quality for these blocks as such, and as (ii) the importance map is a target block to take low value, the by block quantization parameter is larger than the frame quantization parameter QP _frame, for these blocks as a lower quality, by calculation, what portion of the video frame certain of said plurality of video frames that tends most noticed for human perception be indicated in the importance map;
And the temporal information is
A temporal contrast sensitivity function (TCSF) that measures the response of the human visual system to a temporally periodic stimulus to indicate which target block is most temporally noticeable to an observer human; and
True motion vector map (TMVM) showing which target block corresponds to foreground data
And the TCSF is valid only for target blocks identified as foreground data,
The temporal frequency of the TCSF is an approximation of the velocity obtained using the magnitude of the motion vector and the frame rate, and is the average structural similarity (SSIM) in the color space region between the target block and its reference block. The method is calculated by dividing by the approximation of the wavelength determined using .

The method of claim 1, wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which is to determine which target block in the frame is an average in the frame. To determine if it has a variance greater than the block variance var _frame ,
A quantization parameter (QP) value higher than the frame quantization parameter QP _frame is assigned to a block having a variance larger than the average block variance var _frame , and an allocation amount QP of the block quantization parameter (QP) is assigned. _block is the block variance _{var block} of the block, the average block variance _{var frame,} the quantization parameter upper _{QP max} and said frame quantization parameter _{QP frame} _{_{_{using, QP block = ((var block}}} -var frame) / var block )×(QP _max −QP _frame )+QP _frame , which is linearly increased or decreased.

The method according to claim 2, wherein a block having a variance larger than an average block variance var _{frame in the frame} has the allocation amount QP _block , which is a block quantization parameter (QP) thereof, and the TMVM has a foreground target block. Further refined by the TCSF and the TMVM such that the QP _block is increased by 2 if specified as data and the contrast sensitivity logarithmic value for this block of the TCSF is less than 0.5, Method.

The method according to claim 2, wherein the SCM further adjusts the allocation amount QP _block, which is a block quantization parameter of a target block having a block average brightness of more than 170 or less than 60, to QP _max. A method comprising brightness masking being repaired.

The method according to claim 2, wherein the SCM further comprises a dynamic determination of the quantization parameter upper bound QP _max based on a quality level of the coded video,
In this dynamic decision, the quality is measured using the average structural similarity (SSIM) calculation result of the target block in the intra (I) _frame together with the average block variance var _{frame of the} I frame,
The measured lower the quality value of the quantization parameter limit _{QP max} is reduced so as to approach more to the frame quantization parameter _{QP frame,} method.

The method according to claim 2, wherein for blocks having a variance of less than 60, the smaller the block variance, the lower the value of the allocation QP _block to ensure high quality coding in these regions. The allocation amount QP _block, which is a determined low quantization parameter (QP) value, is allocated such that

The method of claim 6, wherein the appropriation amount _{QP block} is a value before Symbol amount Coca parameter (QP) for the dispersion is less than 60 blocks, initially determined for the block of I frame, then , A P frame and a block corresponding to the block in the B frame are expressed as a ratio of the frame QP value of the I frame to the frame QP value of the P frame and the ratio of the QP value of the P frame to the QP value of the B frame. A method that is determined using certain pbratio parameters.

7. The method according to claim 6 , wherein a block having a variance of 60 or more and an average block variance or less is :
The allocation amount QP _block, which is the first estimated value of the block quantization parameter (QP), represents the values of the quantization parameter (QP) of the already coded neighboring blocks on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from SSIM values of previously coded neighboring blocks to the left, top left, right and top right of the current block, and
When the SSIM _est is less than 0.9, it is determined that the quality improvement is necessary in the determination of whether the quality improvement is necessary, and the quality improvement process is performed so that the calculated value of the allocation amount QP _block is decreased by 2. The way it is done .

9. The method of claim 8 , wherein the quality enhancement is applied only to blocks identified by the TMVM as foreground data and having a TCSF contrast sensitivity log value greater than 0.8.

The method of claim 1 , wherein the TCSF is encoded such that the TCSF for the current frame is a weighted average of the TCSF maps in the past frames in which the TCSF was encoded, and the past is closer in time. The calculated past frames are calculated over multiple frames such that they are given greater weighting.

The method of claim 1 , wherein the TMVM is set to 1 only for foreground data.

The method according to claim 11 , wherein the foreground data calculates a difference between an encoder motion vector for a given target block and a global motion vector for the block, and a block having a difference larger than a specific value is the foreground data. A method identified by being determined to be.

Calculation The method according to claim 12, for a particular data block as a foreground data, the obtaining a differential motion vector by the encoder motion vector from the global motion vector is subtracted, the time frequency before Symbol TCSF The velocity v used to do is v , using the magnitude |DMV| of this difference motion vector, the number N of frames between the reference frame pointed to by this difference motion vector and the current frame, and the frame rate. =|DMV|×frame rate/N .

The method according to claim 1 , wherein the TCSF uses a motion vector MV from an encoder to calculate a speed v for calculating a temporal frequency that is an input to the TCSF, v=|MV|×frame rate/N (Where the frame rate is the number of frames per second in which the video was generated, and N is the number of frames between the reference frame pointed to by the motion vector and the current frame). the, way.

The method according to claim 2 , wherein when the importance map is set by the temporal information and the spatial information, the importance map is a combined importance map (both TCSF and SCM). Method, which is an importance map containing information from .

A system for encoding video data,
A codec for encoding a plurality of video frames using an importance map, the video frames having target blocks that do not overlap each other, the codec,
Wherein the importance map is configured to affect the coding quality of each target block to be coded in each video frame by adjusting the quantization,
The importance map is:
Establishing the importance map using temporal and spatial information; and
(I) as the importance map is a block taking a high numerical value in the block quantization parameter (QP) is smaller than the frame quantization parameter QP _frame, a higher quality for these blocks as such, and as (ii) the importance map is a target block to take low value, the by block quantization parameter is larger than the frame quantization parameter QP _frame, for these blocks as a lower quality, by calculation, picture frames certain of said plurality of video frames, thereby it indicated the most noticed easily parts for human perception to the importance map;
And the temporal information is
A temporal contrast sensitivity function (TCSF) that measures the response of the human visual system to a temporally periodic stimulus to show which target block is most noticeable in time to an observer human; and
True motion vector map (TMVM) showing which target block corresponds to foreground data;
And the TCSF is only valid for target blocks identified as foreground data,
The temporal frequency of the TCSF is an approximation of the velocity obtained using the magnitude of the motion vector and the frame rate, and is the average structural similarity (SSIM) in the color space region between the target block and its reference block. The system , calculated by dividing by the approximation of the wavelength used .

17. The system of claim 16 , wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which is to determine which target block in the frame is an average within the frame. To determine if it has a variance greater than the block variance var _frame ,
A quantization parameter (QP) value higher than the frame quantization parameter QP _frame is assigned to a block having a variance larger than the average block variance var _frame , and an allocation amount QP of the block quantization parameter (QP) is assigned. _block is the block variance _{var block} of the block, the average block variance _{var frame,} the quantization parameter upper _{QP max} and said frame quantization parameter _{QP frame} _{_{_{using, QP block = ((var block}}} -var frame) / var block _{_{_{) × (QP max -QP frame)}}} + QP frame and linearly is increased or decreased so that the system.

18. The system according to claim 17 , wherein a block having a variance larger than an average block variance var _{frame in the frame} has the allocation amount QP _block , which is its block quantization parameter (QP), and the TMVM has a target block in the foreground. Further refined by the TCSF and the TMVM such that the QP _block is increased by 2 if specified as data and the contrast sensitivity logarithmic value for this block of the TCSF is less than 0.5, system.

The system according to claim 17 , wherein the SCM further adjusts the allocation amount QP _{block to} be QP _max , which is a block quantization parameter of a target block having a block average brightness of more than 170 or less than 60. A system including brightness masking that is corrected.

The system of claim 17, wherein the SCM further comprises a dynamic determination of the upper the quantization parameter based on the quality level of the encoded video limit Q P _max,
In this dynamic decision, the quality is measured using the average structural similarity (SSIM) calculation result of the target block in the intra (I) _frame together with the average block variance var _{frame of the} I frame,
As measured quality is low, the numerical value of the quantization parameter limit _{QP max} is reduced as closer to the frame quantization parameter _{QP frame,} system.

18. The system of claim 17 , for blocks with variance less than 60, the smaller the block variance, the lower the value of the allocation QP _block , to ensure high quality coding in these regions. The allocation amount QP _block, which is a value of a determined quantization parameter (QP), is allocated so that.

22. The system of claim 21 , wherein the allocation QP _block , which is the value of the low quantization parameter (QP) for blocks with variance less than 60, is first determined for I frames and then for P frames. And for a block corresponding to the block in the B frame , an ipratio parameter that is the ratio of the frame QP value of the I frame to the frame QP value of the P frame and a pbraio parameter that is the ratio of the P frame QP value to the B frame QP value. A system that can be determined using.

The system according to claim 17 , wherein a block having a variance of 60 or more and an average block variance or less is :
The allocation amount QP _block, which is the first estimated value of the block quantization parameter (QP), represents the values of the quantization parameter (QP) of the already coded neighboring blocks on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from SSIM values of previously coded neighboring blocks to the left, top left, right and top right of the current block, and
When the SSIM _est is less than 0.9, it is determined that the quality improvement is necessary in the determination of whether the quality improvement is necessary, and the quality improvement process is performed so that the calculated value of the allocation amount QP _block is decreased by 2. The system to be done .

24. The system of claim 23 , wherein the quality enhancement is applied only to blocks identified by the TMVM as foreground data and having a TCSF contrast sensitivity log value greater than 0.8.

17. The system of claim 16 , wherein the TCSF is encoded such that the TCSF for the current frame is a weighted average of the TCSF maps in the past frames in which the TCSF was encoded , and the temporally closer past. The system is calculated over multiple frames such that the past frames that have been processed receive a greater weighting.

17. The system according to claim 16 , wherein the TMVM is set to 1 only for foreground data.

27. The system according to claim 26 , wherein the foreground data calculates a difference between an encoder motion vector for a given target block and a global motion vector for the block, and a block having a difference larger than a specific value is the foreground data. A system identified by being judged to be.

The system of claim 16, for a particular data block as a foreground data, obtaining a differential motion vector by encoder motion vectors for the data blocks from the global motion vector for the data block is subtracted, the The velocity v used to calculate the temporal frequency of the TCSF is the magnitude of this difference motion vector |DMV|, the number N of frames between the reference frame pointed to by this difference motion vector and the current frame, and A system where v=|DMV|×frame rate/N is calculated using the frame rate .

The system according to claim 16 , wherein the TCSF uses a motion vector MV from the encoder to calculate a speed v for calculating a temporal frequency that is an input to the TCSF, v=|MV|×frame rate/ N, where frame rate is the number of frames per second in which the video was generated, and N is the number of frames between the reference frame pointed to by the motion vector and the current frame. The system to calculate .

18. The system according to claim 17 , wherein when the importance map is set with the temporal information and the spatial information, the importance map is an integrated importance map ( both TCSF and SCM). System, which is an importance map containing information from ).