JP2017532858A

JP2017532858A - Perceptual optimization for model-based video coding

Info

Publication number: JP2017532858A
Application number: JP2017513750A
Authority: JP
Inventors: リー・ニゲル; パーク・サンソク; トゥン・ミョー; コッケ・デーン・ピー; リー・ジェユン; ウィード・クリストファー
Original assignee: Euclid Discoveries LLC
Current assignee: Euclid Discoveries LLC
Priority date: 2014-09-11
Filing date: 2015-09-03
Publication date: 2017-11-02
Anticipated expiration: 2035-09-03
Also published as: JP6698077B2; WO2016040116A1; EP3175618A1; CN106688232A; CA2960617A1

Abstract

【課題】映像符号化の品質を向上させるように重要度マップを映像圧縮に適用する、映像データを処理するコンピュータに基づく方法を提供する。【解決手段】知覚的統計量が、映像フレームのどの領域が人間の視覚系にとって重要かを示す重要度マップを算出するのに用いられる。重要度マップは、符号化ビットストリームの品質を向上させるように映像符号化プロセスに適用される。時間的コントラスト感度関数（ＴＣＳＦ）が、エンコーダの動きベクトルから算出される。動きベクトル品質尺度が、真の動きベクトルマップ（ＴＭＶＭ）を構築するのに用いられる。真の動きベクトルマップ（ＴＭＶＭ）は、ＴＣＳＦを洗練化するのに用いられる。空間的複雑度マップ（ＳＣＭ）が算出される。ＳＣＭは、統合化された重要度マップを得るようにＴＣＳＦと組み合わされる。重要度マップは符号化を向上させるように用いられる。【選択図】図８ＢA computer-based method for processing video data that applies an importance map to video compression to improve the quality of video coding. Perceptual statistics are used to calculate an importance map that indicates which regions of a video frame are important to the human visual system. The importance map is applied to the video encoding process so as to improve the quality of the encoded bitstream. A temporal contrast sensitivity function (TCSF) is calculated from the motion vector of the encoder. A motion vector quality measure is used to build a true motion vector map (TMVM). True motion vector maps (TMVM) are used to refine the TCSF. A spatial complexity map (SCM) is calculated. SCM is combined with TCSF to obtain an integrated importance map. The importance map is used to improve encoding. [Selection] Figure 8B

Description

Related applications

本願は、2015年5月7日出願の米国仮特許出願第62/158,523号および2014年11月11日出願の米国仮特許出願第62/078,181号の利益を主張する。本願は、さらに、2014年11月4日出願の米国特許出願第14/532,947号の一部継続出願（ＣＩＰ）である。この米国特許出願第14/532,947号は、2014年3月10日出願の米国仮特許出願第61/950,784号および2014年9月11日出願の米国仮特許出願第62/049,342号の利益を主張する。これら参照した特許出願の全教示内容は、参照をもって本明細書に取り入れたものとする。 This application claims the benefit of US Provisional Patent Application No. 62 / 158,523 filed May 7, 2015 and US Provisional Patent Application No. 62 / 078,181 filed November 11, 2014. This application is further a continuation-in-part (CIP) of US patent application Ser. No. 14 / 532,947, filed Nov. 4, 2014. This U.S. Patent Application No. 14 / 532,947 claims the benefit of U.S. Provisional Patent Application No. 61 / 950,784, filed March 10, 2014, and U.S. Provisional Patent Application No. 62 / 049,342, filed September 11, 2014. To do. The entire teachings of these referenced patent applications are incorporated herein by reference.

映像圧縮は、デジタル映像データを、記憶時又は伝送時において少ないビット数を用いた形式で表現するプロセスであると考えられる。映像符号化は、映像データの空間的冗長性、時間的冗長性又は色空間冗長性を利用することにより圧縮を達成することができる。典型的に、映像圧縮プロセスは、映像データをフレームの集まりやペルの集まりなどの各部位に分割して、その映像内における冗長な部分を特定し、当該冗長な部分を元々の映像データで求められるよりも少ないビット数で表現し得る。データのこのような冗長性を利用することにより、より大きな圧縮を達成することができる。映像データを符号化フォーマットに変換するにはエンコーダが用いられ得る。そして、デコーダを用いることにより、符号化された映像を元々の映像データに匹敵する形態に変換することができる。エンコーダ／デコーダを実現するものがコーデックと称される。 Video compression is considered to be a process of expressing digital video data in a format using a small number of bits during storage or transmission. Video coding can achieve compression by exploiting the spatial redundancy, temporal redundancy, or color space redundancy of video data. Typically, the video compression process divides video data into parts such as a collection of frames or a collection of pels, identifies redundant parts in the video, and finds the redundant parts in the original video data. It can be expressed with a smaller number of bits than is possible. By taking advantage of this redundancy of data, greater compression can be achieved. An encoder can be used to convert the video data into an encoded format. Then, by using the decoder, the encoded video can be converted into a form comparable to the original video data. What implements an encoder / decoder is called a codec.

標準のエンコーダ（標準エンコーダ）は、符号化にあたって、所与の映像フレームを、互いに重なり合わない複数の符号化単位すなわちマクロブロック（複数の連続するペルからなる矩形領域）に分割する。典型的に、マクロブロック（本明細書では、より包括的に「入力ブロック」や「データブロック」と称される）は、映像フレームの左から右の走査順序や上から下の走査順序で処理される。圧縮は、入力ブロックが符号化済みのデータを用いて予測・符号化されることで達成され得る。入力ブロックを、同じフレーム内の先に符号化されたブロックのうち、当該入力ブロックと空間的に隣接するサンプルを用いて符号化するプロセスは、イントラ予測と称される。イントラ予測は、データにおける空間的冗長性を利用しようとするものである。入力ブロックを、動き推定プロセスを用いて見つけ出された、先に符号化したフレームからの類似する領域を用いて符号化することは、インター予測と称される。インター予測は、データにおける時間的冗長性を利用しようとするものである。動き推定プロセスは、動きベクトルを生成し得る。動きベクトルは、例えば、参照フレーム内の一致する領域の、符号化中の入力ブロックに対する位置を指定する。大抵の動き推定プロセスは、所与の入力ブロックについての動きベクトルの初めの粗推定（および対応する時間的予測）を提供する動き初期推定（初期の動き推定）と、この初めの推定の近傍において局所探索を実行することによってその入力ブロックについての動きベクトルのより正確な推定（および対応する予測）を決定する動き精推定（精細な動き推定）との、２つの主なステップからなる。 In encoding, a standard encoder (standard encoder) divides a given video frame into a plurality of encoding units, that is, macro blocks (rectangular regions including a plurality of continuous pels) that do not overlap each other. Typically, macroblocks (more generally referred to herein as “input blocks” or “data blocks”) are processed in a left-to-right scan order or top-to-bottom scan order of a video frame. Is done. The compression can be achieved by predicting and encoding the input block using encoded data. The process of encoding an input block using samples spatially adjacent to the input block among previously encoded blocks in the same frame is called intra prediction. Intra prediction seeks to exploit spatial redundancy in the data. Encoding an input block with similar regions from previously encoded frames found using a motion estimation process is referred to as inter prediction. Inter prediction seeks to take advantage of temporal redundancy in the data. The motion estimation process may generate a motion vector. The motion vector specifies, for example, the position of the matching region in the reference frame with respect to the input block being encoded. Most motion estimation processes involve a motion initial estimate (initial motion estimate) that provides an initial coarse estimate (and corresponding temporal prediction) of the motion vector for a given input block, and in the vicinity of this initial estimate. It consists of two main steps: fine motion estimation (fine motion estimation) that determines a more accurate estimation (and corresponding prediction) of the motion vector for the input block by performing a local search.

エンコーダは、符号化すべきデータと予測（予測結果）との差分を測定することにより、残差を生成し得る。この残差は、予測されたブロックと元々の入力ブロックとの差分を提供し得る。これらの予測、動きベクトル（インター予測用）、残差および関連データは、空間変換、量子化、エントロピー符号化、ループフィルタなどの他のプロセスと組み合わされることにより、映像データの効率的な符号（符号化）を生成することができる。量子化及び変換を受けた残差は、処理されて且つ上記予測に組み戻されることで復号化フレームへと組み立てられて、フレームストアに記憶される。このような映像符号化技術の詳細は、当業者であればよく知っている。 The encoder can generate a residual by measuring the difference between the data to be encoded and the prediction (prediction result). This residual may provide the difference between the predicted block and the original input block. These predictions, motion vectors (for inter prediction), residuals and related data are combined with other processes such as spatial transformation, quantization, entropy coding, loop filter, etc., to efficiently code video data ( Encoding) can be generated. The quantized and transformed residuals are processed and recombined into the prediction, assembled into a decoded frame and stored in the frame store. Details of such video encoding techniques are well known to those skilled in the art.

ＭＰＥＧ−２（Ｈ．２６２）およびＨ.２６４（ＭＰＥＧ−４Ｐａｒｔ１０ＡｄｖａｎｃｅｄＶｉｄｅｏＣｏｄｉｎｇ（ＡＶＣ））は、比較的低いビットレートで高品質映像表現を達成する、映像圧縮用の２種類のコーデック規格である（以降では、それぞれＭＰＥＧ−２、Ｈ．２６４と称する）。ＭＰＥＧ−２及びＨ．２６４の符号化基本単位は、１６×１６マクロブロックである。Ｈ．２６４は、広く普及している最近の映像圧縮規格であると共に、映像データを圧縮するにあたってＭＰＥＧ−２の２倍の効率を有すると一般的に考えられている。 MPEG-2 (H.262) and H.264 (MPEG-4 Part 10 Advanced Video Coding (AVC)) are two types of codec standards for video compression that achieve high quality video representation at relatively low bit rates. (Hereinafter referred to as MPEG-2 and H.264, respectively). MPEG-2 and H.264 An H.264 encoding basic unit is a 16 × 16 macroblock. H. H.264 is a widespread recent video compression standard and is generally considered to have twice the efficiency of MPEG-2 in compressing video data.

基礎的なＭＰＥＧ規格は、フレーム内の入力ブロックの符号化方法に基づいて３種類のフレーム（又はピクチャ）を規定する。Ｉフレーム（イントラ符号化ピクチャ）は、そのフレームに存在するデータのみを用いて符号化されるものなので、イントラ予測ブロックのみで構成される。Ｐフレーム（予測ピクチャ）は、予め復号化されたＩフレーム又はＰフレーム（参照フレームとも称される）からのデータを用いた前方向予測により符号化される。Ｐフレームは、イントラブロックおよび（前方向）予測ブロックのいずれも含み得る。Ｂフレーム（双予測ピクチャ）は、前のフレームと後のフレームの両方からのデータを用いた双方向予測により符号化される。Ｂフレームは、イントラブロック、（前方向）予測ブロックおよび双予測ブロックのいずれも含み得る。 The basic MPEG standard defines three types of frames (or pictures) based on the encoding method of input blocks in a frame. An I frame (intra-encoded picture) is encoded using only data existing in the frame, and is therefore composed only of intra-predicted blocks. A P frame (predicted picture) is encoded by forward prediction using data from an I frame or P frame (also referred to as a reference frame) decoded in advance. A P frame may include both intra blocks and (forward) predicted blocks. B frames (bi-predictive pictures) are encoded by bi-directional prediction using data from both the previous and subsequent frames. A B frame may include any of an intra block, a (forward) prediction block, and a bi-prediction block.

参照フレームの特定の集合のことを、ＧｒｏｕｐｏｆＰｉｃｔｕｒｅｓ（ピクチャの集まり）（ＧＯＰ）と称する。ＧＯＰは、各参照フレーム内の復号化されたペルのみを含み、入力ブロックやフレームがどのように符号化されたのか（Ｉフレームなのか、Ｂフレームなのか、それともＰフレームなのか）についての情報を含まない。ＭＰＥＧ−２などの古い映像圧縮規格は、Ｐフレームを予測するのに１つの参照フレーム（過去のフレーム）を利用し、Ｂフレームを予測するのに２つの参照フレーム（１つ前のフレームと１つ後のフレーム）を利用する。対照的に、Ｈ．２６４、ＨＥＶＣ（ＨｉｇｈＥｆｆｉｃｉｅｎｃｙＶｉｄｅｏＣｏｄｉｎｇ）などのより新しい圧縮規格は、Ｐフレーム及びＢフレームの予測に複数の参照フレームを利用することを可能にする。典型的な参照フレームは現在のフレームと時間的に隣接するフレームであるが、これらの規格は、時間的に隣接しないフレームを参照フレームとすることも可能である。 A specific set of reference frames is referred to as a Group of Pictures (GOP). The GOP contains only the decoded pels in each reference frame, and information about how the input block or frame was encoded (I frame, B frame or P frame) Not included. Older video compression standards such as MPEG-2 use one reference frame (past frame) to predict P frames and two reference frames (previous frame and 1) to predict B frames. Use the next frame). In contrast, H. Newer compression standards such as H.264, HEVC (High Efficiency Video Coding) make it possible to use multiple reference frames for P and B frame prediction. A typical reference frame is a frame that is temporally adjacent to the current frame, but these standards also allow a frame that is not temporally adjacent to be a reference frame.

従来のインター予測は、ブロックベースの動き推定・補償（ＢＢＭＥＣ）に基づくものである。ＢＢＭＥＣプロセスは、ターゲットブロック（符号化中の現在の入力ブロック）と予め復号化された参照フレーム内の同サイズの領域との最良の一致を探索する。そのような一致が見つかると、エンコーダは、参照フレーム内でのこの最良の一致の位置へのポインタの役割を果たす動きベクトルを送信し得る。しかし、ＢＢＭＥＣ探索プロセスは、演算上の理由により、探索対象にできる参照フレームの観点からみて時間的に制限されているだけでなく、探索対象にできる近傍領域の観点からみて空間的にも制限されている。これは、「考えられる最良の」一致が常に見つかるとは限らず、また、高速で変化するデータの場合には特にそうであることを意味する。 Conventional inter prediction is based on block-based motion estimation and compensation (BBMEC). The BBMEC process searches for the best match between the target block (the current input block being encoded) and a region of the same size in a previously decoded reference frame. If such a match is found, the encoder may send a motion vector that acts as a pointer to the position of this best match in the reference frame. However, the BBMEC search process is not only limited in terms of time in terms of reference frames that can be searched, but also in terms of space in terms of neighboring regions that can be searched, for computational reasons. ing. This means that the “best possible” match is not always found, and is especially true for fast changing data.

最も単純な形式のＢＢＭＥＣプロセスは、動き推定の初期設定を、（０，０）動きベクトルとする。これは、つまり、ターゲットブロックの初めの推定が、参照フレーム内での同位置のブロックであることを意味する。次に、動き精推定が、この領域の局所近傍においてそのターゲットブロックと最良に一致する（すなわち、そのターゲットブロックに対する誤差が最小になる）領域を探索することによって行われる。この局所探索は、その局所近傍を網羅的にクエリすることによって行われたり、ダイヤモンドサーチや六角形サーチなどといった幾つかの「高速探索」法のうちの任意のものを用いて行われたりし得る。 The simplest form of BBMEC process takes a (0,0) motion vector as the initial setting for motion estimation. This means that the initial estimation of the target block is the block at the same position in the reference frame. Next, motion refinement estimation is performed by searching for a region that best matches the target block in the local neighborhood of this region (ie, the error for the target block is minimized). This local search can be done by exhaustively querying the local neighborhood or by using any of several “fast search” methods such as diamond search, hexagonal search, etc. .

ＭＰＥＧ−２の後発バージョン以降の標準コーデックに設けられた、ＢＢＭＥＣプロセスの改良として、拡張予測区域探索（ｅｎｈａｎｃｅｄｐｒｅｄｉｃｔｉｖｅｚｏｎａｌｓｅａｒｃｈ）（ＥＰＺＳ）法（非特許文献１：Tourapis達による「単・多フレーム動き推定のための拡張予測区域探索（Enhanced predictive zonal search for single and multiple frame motion estimation）」）が挙げられる。このＥＰＺＳ法は、ターゲットブロックの初めの推定用として、既に符号化済みの近傍ブロックの動きベクトルならびに前の参照フレームにおける同位置のブロック（および近傍）の動きベクトルに基づく、動きベクトル候補の集合を検討する。ＥＰＺＳ法は、映像の動きベクトル場が幾らかの空間的及び時間的冗長性を有すると考え、よって、ターゲットブロックについての動き推定の初期設定は、近傍ブロックの動きベクトルまたは符号化済みのフレームにおける近隣ブロックからの動きベクトルとするのが合理的であると仮定する。ＥＰＺＳ法は、それら初めの推定の集合が集まると、近似レート歪み解析によって当該集合を絞り込む。この後に、動き精推定が行われる。 An improved predictive zone search (EPZS) method (Non-Patent Document 1: “Single-multi-frame motion by Tourapis et al.”) Enhanced predictive zonal search for single and multiple frame motion estimation ”). This EPZS method uses a motion vector candidate set based on a motion vector of a neighboring block that has already been encoded and a motion vector of a block (and a neighborhood) of the same position in the previous reference frame for initial estimation of the target block. consider. The EPZS method considers that the motion vector field of the video has some spatial and temporal redundancy, so the initial motion estimation for the target block is either in the motion vector of the neighboring block or in the encoded frame. Assume that it is reasonable to use motion vectors from neighboring blocks. In the EPZS method, when a set of initial estimates is collected, the set is narrowed down by approximate rate distortion analysis. After this, motion precision estimation is performed.

任意の所与のターゲットブロックについて、エンコーダが、選択候補となる複数のインター予測を生成し得る。これらの予測は、複数の予測プロセス（例えば、ＢＢＭＥＣ方式、ＥＰＺＳ方式、モデルベース方式等）から生じ得る。また、これらの予測は、ターゲットブロックのサブ区分処理に基づいて異なり得る。サブ区分処理では、相異なる動きベクトルがターゲットブロックの相異なるサブ区分に対応付けられ、かつ、それぞれの動きベクトルが参照フレーム内のサブ区分サイズの領域をそれぞれ指し示す。また、これらの予測は、動きベクトルが指し示す参照フレームに基づいて異なり得る。というのも、前述したように、最近の圧縮規格は、複数の参照フレームの使用を可能にするからである。通常、所与のターゲットブロックについての最良の予測の選択は、レート歪み最適化により達成される。レート歪み最適化では、最良の予測は、レート歪み尺度Ｄ＋λＲ（式中、歪みＤは、ターゲットブロックと予測との誤差であり、レートＲは、予測を符号化するコスト（ビット単位）を定量化し、λは、スカラー重み付け係数である。）を最小化する予測とされる。 For any given target block, the encoder may generate multiple inter predictions that are selection candidates. These predictions can arise from multiple prediction processes (eg, BBMEC, EPZS, model-based, etc.). Also, these predictions may differ based on the target block sub-partitioning process. In the sub-partition processing, different motion vectors are associated with different sub-partitions of the target block, and each motion vector indicates an area of the sub-partition size in the reference frame. Also, these predictions may differ based on the reference frame that the motion vector points to. This is because, as described above, recent compression standards allow the use of multiple reference frames. Usually, the selection of the best prediction for a given target block is achieved by rate distortion optimization. In rate distortion optimization, the best prediction is the rate distortion measure D + λR, where distortion D is the error between the target block and the prediction, and rate R quantifies the cost (in bits) to encode the prediction. , Λ is a scalar weighting factor).

Tourapis, A., 2002, "Enhanced predictive zonal search for single and multiple frame motion estimation," Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069- 1078Tourapis, A., 2002, "Enhanced predictive zonal search for single and multiple frame motion estimation," Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069-1078

過去、ＢＢＭＥＣ予測の限界を回避する目的で、数多くのモデルベース圧縮スキームが提案されてきた。このようなモデルベース圧縮スキーム（この種のスキームとして、ＭＰＥＧ−４Ｐａｒｔ２規格が恐らく最も良く知られている）は、映像内のオブジェクトや特徴（一般的に「関心成分」と定義される）の検出及び追跡、さらに、これらの特徴／オブジェクトを映像フレームの残りの部分とは別に符号化する方法を利用する。特徴／オブジェクトの検出／追跡は、標準の動き推定プロセスにおける空間的探索と独立して行われるので、特徴／オブジェクトのトラックは、標準の動き推定により得られるものとは異なる集合の予測を生じさせ得る。 In the past, many model-based compression schemes have been proposed in order to avoid the limitations of BBMEC prediction. Such a model-based compression scheme (the MPEG-4 Part 2 standard is perhaps best known as this kind of scheme) is an object or feature (generally defined as “component of interest”) in a video. Detection and tracking as well as a method of encoding these features / objects separately from the rest of the video frame. Since feature / object detection / tracking is done independently of the spatial search in the standard motion estimation process, the feature / object track results in a different set of predictions than those obtained by standard motion estimation. obtain.

しかし、特徴／オブジェクトに基づくそのようなモデルベース圧縮スキームでは、映像フレームをオブジェクト領域とオブジェクトでない領域と（あるいは、特徴領域と特徴でない領域と）に分割することによる問題に直面する。第一に、オブジェクトのサイズは多種多様であり得るので、オブジェクトのテクスチャ（カラーコンテンツ）だけでなくオブジェクトの形状も符号化する必要がある。第二に、動きを伴うオブジェクトを複数追跡することは困難であり得て、精度の低いトラッキング（追跡）は不正確なセグメンテーション（分割）を引き起こし、通常、低い圧縮性能につながる。第三の問題は、全ての映像コンテンツがオブジェクトや特徴で構成されるとは限らないので、オブジェクト／特徴が存在しない場合には、代わりの符号化スキームが必要となる。 However, such model-based compression schemes based on features / objects face problems due to dividing a video frame into object regions and non-object regions (or feature regions and non-feature regions). First, since the size of an object can vary widely, it is necessary to encode not only the texture (color content) of the object but also the shape of the object. Second, tracking multiple objects with motion can be difficult, and inaccurate tracking causes inaccurate segmentation and usually leads to poor compression performance. The third problem is that not all video content is composed of objects and features, so if no object / feature exists, an alternative encoding scheme is required.

2014年11月4日出願の同時係属中の米国仮特許出願第61/950,784号（本明細書では「‘784出願」と称する）は、上記のセグメンテーション問題を回避するモデルベース圧縮スキームを提示している。‘784出願の連続的ブロックトラッカー（連続的ブロック追跡手段）（ＣＢＴ）は、オブジェクトや特徴を検出せず、オブジェクトや特徴をオブジェクトでない／特徴でない背景と分割する必要性をなくす。むしろ、ＣＢＴは、フレーム−フレーム間の動き推定を連続的なトラックへと組み込むことにより、映像フレーム内の全ての入力ブロック（「マクロブロック」）をあたかも関心領域であるかの如く追跡する。こうすることにより、ＣＢＴは、インター予測を向上させるというデータの高次モデリング（モデル化）の恩恵を、セグメンテーション問題を回避しつつ享受するように映像内の動きをモデル化する。 Co-pending US Provisional Patent Application No. 61 / 950,784, filed November 4, 2014 (referred to herein as the '784 application), presents a model-based compression scheme that avoids the above segmentation problem. ing. The continuous block tracker (CBT) of the '784 application does not detect objects or features, and eliminates the need to split objects and features from non-object / non-feature backgrounds. Rather, CBT tracks all input blocks (“macroblocks”) in a video frame as if they were a region of interest by incorporating frame-to-frame motion estimation into a continuous track. By doing so, the CBT models the motion in the video so as to enjoy the benefits of higher-order modeling (modeling) of data to improve inter prediction while avoiding the segmentation problem.

他のモデルベース圧縮アプローチとして、映像データのコンテンツに対する人間の視覚系（ＨＶＳ）の応答を、映像フレームのどの部分が人間の知覚にとって最も気付き易いのかを示す重要度マップとしてモデル化するものが挙げられる。重要度マップは、映像フレーム内のそれぞれの入力ブロック又はデータブロックについて数値を取る。また、所与のどのブロックについての重要度マップ値（重要度マップの数値）も、映像をとおしてフレーム−フレーム間で変化し得る。一般的に、重要度マップは、より高い数値がより重要なデータブロックを示すように定義される。 Another model-based compression approach is to model the human visual system (HVS) response to video data content as an importance map that indicates which parts of the video frame are most noticeable to human perception. It is done. The importance map takes a numerical value for each input block or data block in the video frame. Also, the importance map value (value of the importance map) for any given block can vary from frame to frame throughout the video. In general, importance maps are defined such that higher numbers indicate more important data blocks.

重要度マップの一種として、時間的コントラスト感度関数（ＴＣＳＦ）（de Lange, H., 1954, "Relationship between critical flicker frequency and a set of low frequency characteristics of the eye（臨界融合周波数と眼の低周波数特性のセットとの関係）," J. Opt. Soc. Am., 44:380-389）が挙げられる。ＴＣＳＦは、周期的な刺激に対するＨＶＳの応答を時間的に測定し、データ内の特定の時間的特性が観測者である人間にとって気付き易いものであることを明らかにする。これらの時間的特性がデータ内の動きに関連付けられて、ＴＣＳＦは、データ内で最も気付き易い種類の動きが極めて高い時間的周波数および極めて低い時間的周波数のいずれにも該当しない「中程度」の動きであることを予測する。 As a kind of importance map, temporal contrast sensitivity function (TCSF) (de Lange, H., 1954, "Relationship between critical flicker frequency and a set of low frequency characteristics of the eye" And J. Opt. Soc. Am., 44: 380-389). TCSF measures the response of HVS to periodic stimuli over time, revealing that certain temporal characteristics in the data are easily noticeable to the observer human. With these temporal characteristics associated with movement in the data, the TCSF is a “medium” that does not fall into either the very high or very low temporal frequencies of the type of movement most noticeable in the data. Predict that it is movement.

留意すべき重要な点として、ＴＣＳＦが、正確な時間的コントラスト値を生成するのに、映像内の動きを伴うコンテンツの速度の正確な測定を必要とすることが挙げられる。このような速度は、カメラの動きおよび／またはオブジェクトの動きによる映像コンテンツの正味の（明らかな）動きを表す、オプティカルフローを算出することによって近似を求めることが可能である。しかし、標準の映像エンコーダの大半は、オプティカルフローを正確に算出するよりも、圧縮効率を最適化する動き推定プロセスを採用している。 An important point to note is that the TCSF requires an accurate measurement of the speed of the content with motion in the video in order to generate an accurate temporal contrast value. Such a speed can be approximated by calculating an optical flow that represents the net (apparent) motion of the video content due to camera motion and / or object motion. However, most standard video encoders employ a motion estimation process that optimizes compression efficiency rather than accurately calculating the optical flow.

他種の重要度マップとして、空間的コントラスト感度に基づくものが挙げられ、これは、明るさ、エッジ、空間的周波数、色などの空間的特性に対するＨＶＳの応答を測定する。空間的コントラスト感度関数（ＳＣＳＦ）（例えば、Barten, P., 1999, Contrast Sensitivity of the Human Eye and Its Effects on Image Quality（人間の眼のコントラスト感度および画像品質へのその影響）, SPIE Press等を参照のこと）は、単にコントラスト感度関数（ＣＳＦ）としても知られており、ＨＶＳにとって顕著である空間的コントラストを測定する。ＳＣＳＦは、ＪＰＥＧ２０００画像圧縮規格において、画像圧縮アーチファクトを低減する目的での適用が成功を収めている。オブジェクトや特徴についても、空間的コントラスト手法の支援（例えば、空間的周波数勾配により示されるエッジの存在等）によって典型的に検出される。空間的コントラスト感度は画像圧縮（例えば、ＪＰＥＧ２０００コーデック等）においては研究・利用されてきたものの、また、オブジェクト・特徴の検出に基づく映像圧縮プロセスが数多く提案されてきたものの、ＴＣＳＦで表される時間的コントラスト感度がこれまで映像圧縮に適用されることはなかった。 Other types of importance maps include those based on spatial contrast sensitivity, which measures the response of HVS to spatial characteristics such as brightness, edges, spatial frequency, and color. Spatial contrast sensitivity function (SCSF) (eg Barten, P., 1999, Contrast Sensitivity of the Human Eye and Its Effects on Image Quality), SPIE Press, etc. (See also), also known simply as the contrast sensitivity function (CSF), measures the spatial contrast that is significant for HVS. SCSF has been successfully applied in the JPEG2000 image compression standard for the purpose of reducing image compression artifacts. Objects and features are also typically detected with the aid of spatial contrast techniques (eg, the presence of edges indicated by spatial frequency gradients). Spatial contrast sensitivity has been studied and used in image compression (for example, JPEG2000 codec), and many video compression processes based on object / feature detection have been proposed, but the time represented by TCSF. Conventional contrast sensitivity has never been applied to video compression.

開示する幾つかの発明的実施形態は、映像符号化の品質を向上させるように、重要度マップを映像圧縮に適用する。例示的な一実施形態では、標準の映像符号化処理ストリーム内での時間的周波数が、色空間領域における構造的類似度（ＳＳＩＭ）を用いて波長の近似を求めることにより、かつ、エンコーダの動きベクトル（エンコーダ動きベクトル）を用いて速度の近似を求めることにより算出される。次に、時間的周波数が、時間的コントラスト感度関数（ＴＣＳＦ）への入力としての役割を果たす。ＴＣＳＦは、全てのデータブロックについて算出され得て、これにより、映像フレームのどの領域が観測者である人間にとって最も気付き易いのかを示す時間的重要度マップを生成し得る。 Some inventive embodiments disclosed apply an importance map to video compression so as to improve the quality of video coding. In an exemplary embodiment, the temporal frequency within the standard video encoding process stream is determined by determining the wavelength approximation using structural similarity (SSIM) in the color space domain, and encoder motion. It is calculated by obtaining an approximation of speed using a vector (encoder motion vector). The temporal frequency then serves as an input to the temporal contrast sensitivity function (TCSF). The TCSF may be calculated for all data blocks, thereby generating a temporal importance map that indicates which regions of the video frame are most noticeable to the human being who is the observer.

例示的なさらなる実施形態では、エンコーダにより生成された動きベクトルの相対品質についての情報が、符号化プロセスにおける種々の時点で算出され得て、真の動きベクトルマップを生成するのに用いられ得る。真の動きベクトルマップは、それぞれのターゲットブロックについて、その動きベクトルがどれほど信頼できるのかを出力する。０または１の数値を取るこの真の動きベクトルマップは、動きベクトルが正確でないターゲットブロック（すなわち、真の動きベクトルマップが０であるターゲットブロック）にはＴＣＳＦが適用されないようにＴＣＳＦを洗練化するマスクとして用いられ得る。 In an exemplary further embodiment, information about the relative quality of the motion vectors generated by the encoder can be calculated at various points in the encoding process and used to generate a true motion vector map. The true motion vector map outputs how reliable the motion vector is for each target block. This true motion vector map that takes a value of 0 or 1 refines the TCSF so that TCSF is not applied to target blocks whose motion vectors are not accurate (ie, target blocks whose true motion vector map is 0). Can be used as a mask.

さらなる実施形態では、空間的複雑度マップ（ＳＣＭ）が、所与のターゲットブロックの、その近傍に対する空間的コントラストを決定するように、ブロック分散（ブロック内分散）、ブロック輝度、エッジ検出などの尺度から算出され得る。他の実施形態では、ＳＣＭからの情報が、複合的な統合化された重要度マップを得るようにＴＣＳＦと組み合わされ得る。この統合化された重要度マップにおける空間的および時間的コントラスト情報の組合せは、人間の視覚的応答の両側面を効果的に両立させる。 In further embodiments, measures such as block variance (intra-block variance), block luminance, edge detection, etc., so that the spatial complexity map (SCM) determines the spatial contrast of a given target block to its neighborhood. Can be calculated from In other embodiments, information from the SCM can be combined with the TCSF to obtain a composite integrated importance map. The combination of spatial and temporal contrast information in this integrated importance map effectively balances both sides of the human visual response.

例示的な一実施形態では、統合化された重要度マップ（ＴＣＳＦおよびＳＣＭの両方からの情報を含む重要度マップ）が、標準のレート歪み尺度であるＤ＋λＲのうちの歪み部分を重み付けるのに用いられる。これにより、それぞれのターゲットブロックの知覚的相対重要度に合ったソリューション（重要度マップがその最大値に近いときには低歪みソリューションで、重要度マップがその最小値に近いときには低レートソリューション）へと重み付けされた、改変されたレート歪み最適化が得られる。代替的な一実施形態では、上記の目的に、ＴＣＳＦ又はＳＣＭが独立して用いられ得る。 In one exemplary embodiment, an integrated importance map (an importance map that includes information from both TCSF and SCM) is used to weight the distortion portion of the standard rate distortion measure D + λR. Used. This weights the solution that matches the perceptual relative importance of each target block (a low distortion solution when the importance map is close to its maximum value and a low rate solution when the importance map is close to its minimum value) Modified rate distortion optimization is obtained. In an alternative embodiment, TCSF or SCM can be used independently for the above purposes.

例示的な他の実施形態では、（真の動きベクトルによる洗練化を伴う）ＴＣＳＦおよびＳＣＭが、エンコーダのブロックレベル量子化を調整するように用いられ得る。前記重要度マップが高い数値をとるターゲットブロックでは、量子化パラメータがフレーム量子化パラメータに比べて小さくされることで、これらのブロックについては高い品質が得られる。前記重要度マップが低い数値をとるターゲットブロックでは、量子化パラメータがフレーム量子化パラメータに比べて大きくされることで、これらのブロックについては低い品質が得られる。代替的な一実施形態では、上記の目的に、ＴＣＳＦ又はＳＣＭが独立して用いられ得る。 In another exemplary embodiment, TCSF and SCM (with refinement with true motion vectors) may be used to adjust the block level quantization of the encoder. In the target block in which the importance map has a high numerical value, the quantization parameter is made smaller than the frame quantization parameter, so that high quality can be obtained for these blocks. In the target block having a low importance map, the quantization parameter is set larger than the frame quantization parameter, so that a low quality can be obtained for these blocks. In an alternative embodiment, TCSF or SCM can be used independently for the above purposes.

ＴＣＳＦは、インター予測を組み込み且つ動きベクトル（映像内のコンテンツの速度の近似を求めるのにＴＣＳＦにより利用される）を生成するエンコーダであればどのようなエンコーダの場合にも算出可能であるが、映像圧縮へのＴＣＳＦの適用は、どの動きベクトルが真の動きベクトルであるのかを正確に決定可能な ‘784出願の連続的ブロックトラッカー（ＣＢＴ）などの、モデルベース圧縮フレームワークにおいて最も効果的となる。前述したように、標準の映像エンコーダの大半は、真の動きを反映するよりも圧縮効率を最適化する動きベクトルを算出する。対照的に、ＣＢＴは、高い圧縮効率に適した動きベクトルとＴＣＳＦの効果を最大化するモデル化情報との両方を提供する。 The TCSF can be calculated for any encoder that incorporates inter prediction and generates motion vectors (used by the TCSF to approximate the speed of the content in the video), Applying TCSF to video compression is most effective in model-based compression frameworks, such as the '784 application continuous block tracker (CBT), which can accurately determine which motion vectors are true motion vectors. Become. As described above, most standard video encoders calculate motion vectors that optimize compression efficiency rather than reflecting true motion. In contrast, CBT provides both motion vectors suitable for high compression efficiency and modeling information that maximizes the effect of TCSF.

例示的な一部の発明的実施形態は、得られたビットストリームが、ブロックベースの動き推定を用いて且つその後に残差信号の変換、量子化及びエントロピー符号化が続けられる任意の映像圧縮規格に準拠したものとなるように構築される。そのような映像圧縮規格は、ＭＰＥＧ−２、Ｈ．２６４およびＨＥＶＣを含むが、これらに限定されるわけではない。本発明は、ブロックベースでない非標準の映像エンコーダであっても、インター予測を組み込み且つ動きベクトルを生成するものであれば、そのような映像エンコーダにも適用可能である。 Some exemplary inventive embodiments are directed to any video compression standard in which the resulting bitstream uses block-based motion estimation followed by residual signal transformation, quantization and entropy coding. Constructed to be compliant. Such video compression standards are MPEG-2, H.264, etc. Including, but not limited to, H.264 and HEVC. The present invention can be applied to a non-standard video encoder that is not block-based as long as it incorporates inter prediction and generates a motion vector.

例示的な一部の実施形態は、映像データを符号化する方法及びシステム、ならびにこれを実現するための任意のコーデック（エンコーダおよびデコーダ）を含み得る。複数の映像フレームであって、当該映像フレームが互いに重なり合わないターゲットブロックを有する複数の映像フレームが、エンコーダにより処理され得る。前記複数の映像フレームは、重要度マップを用いて、当該重要度マップが量子化を改変（調整）することによって各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質に変更を加えるように、前記エンコーダにより符号化され得る。 Some exemplary embodiments may include a method and system for encoding video data, and any codecs (encoders and decoders) to implement it. A plurality of video frames that have target blocks that are non-overlapping video frames can be processed by the encoder. The plurality of video frames uses an importance map, and the importance map modifies (adjusts) quantization so as to change the encoding quality of each target block to be encoded in each video frame. Can be encoded by the encoder.

前記重要度マップは、時間的情報と空間的情報の少なくとも一方を用いて構成され得る。時間的情報と空間的情報との両方が用いられた場合、前記重要度マップは統合化された重要度マップと見なされる。前記重要度マップは、前記複数の映像フレームのうちのある映像フレームの、人間の知覚にとって最も気付き易い部分を示す（特定する、または表す）ように設定され得る。具体的に述べると、前記重要度マップが高い数値をとるブロックでは、ブロック量子化パラメータ（ＱＰ）がフレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて小さくされることで、これらのブロックについては高い品質が得られる。また、前記重要度マップが低い数値をとるターゲットブロックでは、前記ブロック量子化パラメータが前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて大きくされることで、これらのブロックについては低い品質が得られる。 The importance map may be configured using at least one of temporal information and spatial information. When both temporal information and spatial information are used, the importance map is considered as an integrated importance map. The importance map may be set so as to indicate (specify or represent) a portion of a certain video frame that is most noticeable to human perception. More specifically, in blocks where the importance map has a high value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that high quality is obtained for these blocks. It is done. Further, in the target block having a low importance map, the block quantization parameter is set larger than the frame quantization parameter QP _frame , so that low quality is obtained for these blocks.

前記空間的情報は、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供され得て、その最初のステップは、前記フレーム内のどのターゲットブロックが当該フレーム内の平均ブロック分散ｖａｒ_{ｆｒａｍｅ}よりも大きい分散を有するかを決定することである。平均ブロック分散ｖａｒ_{ｆｒａｍｅ}よりも大きい分散を有するブロックに対して、前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}よりも高いＱＰ値が振り当てられ得て、このブロックＱＰの振当量ＱＰ_{ｂｌｏｃｋ}は、そのブロック分散ｖａｒ_{ｂｌｏｃｋ}がｖａｒ_{ｆｒａｍｅ}よりもいかなる程度大きいかに従って、ＱＰ_{ｆｒａｍｅ}と量子化パラメータ上限ＱＰ_ｍａｘとの間で線形的に増減される。 The spatial information may be provided by a rule-based spatial complexity map (SCM), the first step of which is to determine which target block in the _frame is greater than the average block variance var _{frame in the frame} Is to determine whether to have A QP value higher than the frame quantization parameter QP _frame can be assigned to a block having a variance greater than the average block variance var _frame , and the block equivalent QP _block of the block QP is the block variance var _block. Is linearly increased or decreased between the QP _frame and the quantization parameter upper limit QP _max according to how much is larger than the var _frame .

好ましくは、前記時間的情報は、どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、どのターゲットブロックが前景データに相当するかを示す真の動きベクトルマップ（ＴＭＶＭ）により提供され得る。なお、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされ得る。 Preferably, the temporal information includes a temporal contrast sensitivity function (TCSF) indicating which target block is most easily noticed by a human being who is an observer, and which target block corresponds to foreground data. It can be provided by a true motion vector map (TMVM) shown. The TCSF can be valid only for the target block specified as the foreground data.

分散の大きい（high-variance）ブロックは、そのブロックのＱＰの振当量ＱＰ_{ｂｌｏｃｋ}が、前記ＴＭＶＭがターゲットブロックを前景データとして特定し且つ前記ＴＣＳＦのこのブロックについてのコントラスト感度対数値が０．５未満である場合にはＱＰ_{ｂｌｏｃｋ}が２増加するように、前記ＴＣＳＦ及び前記ＴＭＶＭによりさらに洗練化され得る。 A high-variance block has a QP _block equivalent QP _{block for} which the TMVM identifies the target block as foreground data and the contrast sensitivity log value for this block in the TCSF is less than 0.5. The QP _block can be further refined by the TCSF and the TMVM to increase by 2.

前記ＳＣＭは、極めて明るい（１７０超の輝度）か又は極めて暗い（６０未満の輝度）ターゲットブロックのブロック量子化パラメータＱＰ_{ｂｌｏｃｋ}がＱＰ_ｍａｘに調節し直される輝度マスキングを含み得る。前記ＳＣＭは、符号化された映像の品質レベルに基づくＱＰ_ｍａｘの動的な決定を含み得て、この動的な決定では、イントラ（Ｉ）フレーム内のターゲットブロックの平均構造的類似度（ＳＳＩＭ）算出結果をこれらフレームの平均ブロック分散ｖａｒ_{ｆｒａｍｅ}と共に用いて、品質が測定され、測定された品質が低いと、前記量子化パラメータ上限ＱＰ_ｍａｘの数値が前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に幾分近づくように減らされる。 The SCM may include luminance masking in which the block quantization parameter QP _block of the target block that is very bright (luminance greater than 170) or extremely dark (luminance less than 60) is readjusted to QP _max . The SCM may include a dynamic determination of QP _max based on the quality level of the encoded video, where the dynamic determination of the average structural similarity (SSIM) of the target block within an intra (I) frame. ) Using the calculation result together with the average block variance var _frame of these frames, the quality is measured, and if the measured quality is low, the numerical value of the quantization parameter upper limit QP _max is somewhat closer to the frame quantization parameter QP _frame As reduced.

分散の極めて小さい（very-low-variance）ブロックに対して、これらの領域における高品質符号化を確実にするために、前記ブロック分散が小さいほど前記振当量ＱＰ_{ｂｌｏｃｋ}の数値が低くなるように（、かつ、品質が高くなるように）決められた低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ_{ｂｌｏｃｋ}が振り当てられ得る。分散の極めて小さいブロックに対する前記低い量子化パラメータ（ＱＰ）の値である前記振当量ＱＰ_{ｂｌｏｃｋ}は、最初に、Ｉフレームについて決められ、その後、Ｐフレーム及びＢフレームについてはｉｐｒａｔｉｏパラメータ及びｐｂｒａｔｉｏパラメータを用いて決められ得る。分散は小さいが、分散が極めて小さいとは見なさないブロックは、当該ブロックについて品質向上が必要か否かを判定するために、前記ブロック量子化パラメータ（ＱＰ）の初めの推定値である前記振当量ＱＰ_{ｂｌｏｃｋ}が現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックの量子化パラメータ（ＱＰ）の値を平均することによって算出されるように調べられる。前記現在のブロックの前記ＳＳＩＭの推定ＳＳＩＭ_ｅｓｔが、前記現在のブロックの左、左上、右および右上の既に符号化済みの近傍ブロックのＳＳＩＭ値から算出され得る。ＳＳＩＭ_ｅｓｔが０．９未満の場合、前記振当量ＱＰ_{ｂｌｏｃｋ}の数値が２減少され得る。 In order to ensure high quality coding in these regions for very-low-variance blocks, the smaller the block variance, the lower the value of the vibration equivalent QP _block ( In addition, the shaking equivalent QP _block, which is a value of the determined low quantization parameter (QP), can be assigned (so that the quality is high). The shaking equivalent QP _block , which is the value of the low quantization parameter (QP) for a very small variance _block, is first determined for I frames, and then using the i and p brati parameters for P and B frames. Can be determined. A block whose variance is small but whose variance is not considered to be extremely small is the initial equivalent value of the block quantization parameter (QP) to determine whether quality improvement is necessary for the block. The QP _block is examined to be calculated by averaging the quantization parameter (QP) values of the already encoded neighboring blocks at the left, upper left, right and upper right of the current block. The estimated SSIM _{est of the} SSIM of the current block may be calculated from the SSIM values of the already encoded neighboring blocks at the left, upper left, right and upper right of the current block. When the SSIM _est is less than 0.9, the value of the shaking equivalent QP _block may be decreased by 2.

一部の実施形態において、前記品質向上は、前記ＴＭＶＭにより前景データとして特定されて且つ前記ＴＣＳＦのコントラスト感度対数値が０．８超であるブロックにのみ適用される。前記ＴＭＶＭは、前景データの場合にのみ１に設定され得る。 In some embodiments, the quality enhancement is applied only to blocks identified as foreground data by the TMVM and having a contrast sensitivity logarithm value of the TCSF greater than 0.8. The TMVM can be set to 1 only for foreground data.

一部の実施形態において、前記ＴＣＳＦの時間的周波数は、前記ターゲットブロックとその参照ブロックとの間の色空間領域におけるＳＳＩＭを用いて波長の近似を求めて且つ動きベクトルの大きさ（動きベクトル大きさ）とフレームレートとを用いて速度の近似を求めることによって算出される。 In some embodiments, the temporal frequency of the TCSF is determined using a SSIM in the color space region between the target block and its reference block to approximate the wavelength and the magnitude of the motion vector (motion vector magnitude). And the frame rate are used to calculate an approximation of the speed.

前記ＴＣＳＦは、現在のフレームについての当該ＴＣＳＦが最近のフレームにおけるＴＣＳＦマップの重み付き平均であるように且つより最近のフレームがより大きい重み付けを受けるように、複数のフレームにわたって算出され得る。 The TCSF may be calculated over multiple frames so that the TCSF for the current frame is a weighted average of the TCSF maps in the most recent frame and more recent frames receive greater weighting.

前景データは、所与のターゲットブロックについてのエンコーダ動きベクトルと当該ブロックについてのグローバル動きベクトルとの差分を算出し、十分に大きい差分を有するブロックが前景データであると判断されることによって特定され得る。 Foreground data can be identified by calculating the difference between the encoder motion vector for a given target block and the global motion vector for that block, and determining that a block with a sufficiently large difference is foreground data. .

前景データとして特定されたデータブロックについて、前記グローバル動きベクトルから前記エンコーダ動きベクトルが減算されることによって差分動きベクトルを得ることがあり得て、この差分動きベクトルの大きさが前記ＴＣＳＦの時間的周波数を算出するのに用いられる。 For a data block identified as foreground data, a difference motion vector may be obtained by subtracting the encoder motion vector from the global motion vector, and the magnitude of the difference motion vector is a temporal frequency of the TCSF. Is used to calculate

映像データを処理するコンピュータに基づく方法、映像データを処理するコーデック（エンコーダおよびデコーダ）、ならびに映像データを処理するその他のコンピュータシステム及び装置が、本発明の前述した原理を具現化し得る。 Computer based methods for processing video data, codecs (encoders and decoders) for processing video data, and other computer systems and devices for processing video data may embody the aforementioned principles of the present invention.

前述の内容は、添付の図面に示す本発明の例示的な実施形態についての、以下のより詳細な説明から明らかになる。図面では、異なる図をとおして同一の参照符号が同一の構成／構成要素を指すものとする。図面は必ずしも縮尺どおりではなく、むしろ、本発明の実施形態を示すことに重点が置かれている。 The foregoing will become apparent from the following more detailed description of exemplary embodiments of the invention as illustrated in the accompanying drawings. In the drawings, like reference numerals refer to like elements / components throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.

標準のエンコーダの構成を示すブロック図である。It is a block diagram which shows the structure of a standard encoder. 一般的なエンコーダの場合のインター予測に伴うステップを示すブロック図である。It is a block diagram which shows the step accompanying the inter prediction in the case of a general encoder. 連続的ブロック追跡による動き初期推定に伴うステップを示すブロック図である。It is a block diagram which shows the step accompanying the motion initial estimation by continuous block tracking. 連続的ブロック追跡と拡張予測区域探索との組合せによる統合化された動き推定を示すブロック図である。FIG. 6 is a block diagram illustrating integrated motion estimation through a combination of continuous block tracking and extended prediction area search. Wooten達による時間的コントラスト感度関数の最近の測定結果（2010年）を示すプロットである。It is a plot which shows the latest measurement result (2010) of the temporal contrast sensitivity function by Wooten et al. 本発明の一実施形態における、ＣＩＥ１９７６Ｌａｂ色空間における構造的類似度（ＳＳＩＭ）の算出の様子を示すブロック図である。It is a block diagram which shows the mode of calculation of the structural similarity (SSIM) in CIE1976Lab color space in one Embodiment of this invention. 本発明の一実施形態における、映像符号化の知覚的品質を向上させるための知覚的統計量の一般的な適用の様子を示すブロック図である。FIG. 6 is a block diagram illustrating a general application of perceptual statistics to improve the perceptual quality of video encoding in one embodiment of the present invention. 本発明の一実施形態における、連続的ブロック追跡によるインター予測を改変して映像符号化の知覚的品質を向上させるための知覚的統計量の適用の様子を示すブロック図である。FIG. 6 is a block diagram illustrating the application of perceptual statistics to improve the perceptual quality of video coding by modifying inter prediction with continuous block tracking in one embodiment of the present invention. 重要度マップを用いてブロック量子化を改変して符号化するプロセスの一例を示すブロック図である。FIG. 6 is a block diagram illustrating an example of a process for modifying and encoding block quantization using an importance map. 各実施形態が配備されるコンピュータネットワーク環境の概略図である。It is the schematic of the computer network environment by which each embodiment is deployed. 図９Ａのネットワークにおけるコンピュータノードのブロック図である。FIG. 9B is a block diagram of computer nodes in the network of FIG. 9A.

本明細書で引用する全ての特許公報、全ての特許出願公報及び全ての刊行物の全教示内容は、参照をもって本明細書に取り入れたものとする。以下では、本発明の例示的な実施形態について説明する。 The entire teachings of all patent publications, all patent application publications and all publications cited herein are hereby incorporated by reference. In the following, exemplary embodiments of the invention will be described.

本発明は、標準の様々な符号化に適用可能である。以下では、特記しない限り、「従来」や「標準」という語（しばしば、「圧縮」、「コーデック」、「符号化（encoding）」や「エンコーダ」と共に用いられる）は、ＭＰＥＧ−２、ＭＰＥＧ−４、Ｈ．２６４またはＨＥＶＣのことを指し得る。「入力ブロック」は、一般性を失うことなく、エンコーダの符号化基本単位のことを指すものとし、しばしば、「データブロック」や「マクロブロック」と同じ意味で称され得る。符号化中の現在の入力ブロックは、「ターゲットブロック」と称される。 The present invention is applicable to various standard encodings. In the following, unless otherwise specified, the terms “conventional” and “standard” (often used in conjunction with “compression”, “codec”, “encoding” and “encoder”) are MPEG-2, MPEG- 4. H. H.264 or HEVC. “Input block” refers to the basic encoding unit of the encoder without loss of generality, and can often be referred to as “data block” or “macroblock”. The current input block being encoded is referred to as the “target block”.

＜連続的ブロック追跡による映像符号化及びインター予測＞
符号化プロセスは、映像データを、圧縮フォーマットつまり符号化フォーマットに変換するものであり得る。同様に、解凍つまり復号化プロセスは、圧縮された映像を、圧縮される前のつまり未処理のフォーマットに変換するものであり得る。映像圧縮・解凍プロセスは、一般的にコーデックと称されるエンコーダおよびデコーダのペアとして実現され得る。 <Video coding and inter prediction by continuous block tracking>
The encoding process may convert video data into a compressed or encoded format. Similarly, the decompression or decoding process may convert the compressed video to an uncompressed or raw format. The video compression / decompression process may be implemented as an encoder and decoder pair commonly referred to as a codec.

図１は、標準の変換ベース且つ動き補償のエンコーダのブロック図である。図１のエンコーダは、ソフトウェア環境、ハードウェア環境又はこれらの組合せで実現され得る。このエンコーダは、任意の組合せの構成要素を備え得る。これらの構成要素は、インター予測手段２０に出力する動き推定手段１５、イントラ予測手段３０、変換・量子化手段６０、逆変換・量子化手段７０、ループ内フィルタ８０、フレームストア８５およびエントロピー符号化手段９０を含むが、これらに限定されるわけではない。上記の予測手段（インター予測とイントラ予測との両方）の目的は、所与の入力映像ブロック１０（略して「入力ブロック」、あるいは、「マクロブロック」又は「データブロック」）についての最良の予測信号４０を生成することである。入力ブロック１０から予測信号４０が減算されることによって予測残差５０を生成し、この予測残差５０が変換・量子化６０を受ける。その後、この残差の量子化係数６５がエントロピー符号化手段９０へと渡されて、エントロピー符号化手段９０が圧縮ビットストリームへと符号化する。量子化係数６５は逆変換・量子化手段７０にも渡されて、その結果得られる信号（前記予測残差の近似）が予測信号４０に組み戻されて、これによって入力ブロック１０についての再構成信号７５を生成する。再構成信号７５はデブロッキングフィルタなどのループ内フィルタ８０に通され得て、この（場合によってフィルタリング済みである）再構成信号がフレームストア８５の一部となる。フレームストア８５は、今後の入力ブロックの予測を支援する。図１に示すエンコーダの各構成要素の機能は、当業者であればよく知っている。 FIG. 1 is a block diagram of a standard transform-based and motion compensated encoder. The encoder of FIG. 1 may be implemented in a software environment, a hardware environment, or a combination thereof. The encoder may comprise any combination of components. These components are the motion estimation unit 15 output to the inter prediction unit 20, the intra prediction unit 30, the transform / quantization unit 60, the inverse transform / quantization unit 70, the in-loop filter 80, the frame store 85, and the entropy coding. Including, but not limited to, means 90. The purpose of the above prediction means (both inter prediction and intra prediction) is to provide the best prediction for a given input video block 10 (abbreviated “input block”, or “macroblock” or “data block”). Generating a signal 40. A prediction residual 50 is generated by subtracting the prediction signal 40 from the input block 10, and the prediction residual 50 is subjected to transformation / quantization 60. Thereafter, the quantized coefficient 65 of the residual is passed to the entropy encoding unit 90, and the entropy encoding unit 90 encodes the compressed bit stream. The quantization coefficient 65 is also passed to the inverse transform / quantization means 70, and the resulting signal (approximation of the prediction residual) is recombined into the prediction signal 40, thereby reconstructing the input block 10 A signal 75 is generated. The reconstructed signal 75 can be passed through an in-loop filter 80, such as a deblocking filter, and this (optionally filtered) reconstructed signal becomes part of the frame store 85. The frame store 85 supports prediction of future input blocks. Those skilled in the art are familiar with the function of each component of the encoder shown in FIG.

図２に、標準のインター予測（図１の符号３０）における各種ステップを示す。インター予測の目的は、新たなデータを、前のフレームからの予め復号化されたデータを用いて、当該データにおける時間的冗長性を利用して符号化することである。インター予測では、現在符号化中のフレーム（ターゲットフレームとも称される）からの入力ブロック１０が、図１のフレームストア８５に記憶された、予め復号化された参照フレーム内の同サイズの領域から「予測」される。符号化中のフレーム内の入力ブロックの位置と参照フレーム内のこれと一致する領域の位置との（ｘ、ｙ）のずれを示す二成分ベクトルは、動きベクトルと称される。このように、動き推定のプロセスは、符号化すべき入力ブロックを参照フレーム内のこれと一致する領域と最良に結び付ける動きベクトルを決定することを伴う。 FIG. 2 shows various steps in standard inter prediction (reference numeral 30 in FIG. 1). The purpose of inter prediction is to encode new data using pre-decoded data from the previous frame, taking advantage of temporal redundancy in the data. In inter prediction, an input block 10 from a frame that is currently being encoded (also referred to as a target frame) is stored from a region of the same size in a pre-decoded reference frame stored in the frame store 85 of FIG. "is expected. A two-component vector indicating a shift of (x, y) between the position of an input block in a frame being encoded and the position of a region matching this in a reference frame is referred to as a motion vector. Thus, the process of motion estimation involves determining the motion vector that best associates the input block to be encoded with the matching region in the reference frame.

大半のインター予測プロセスは、所与の入力ブロックについての「良好」な動きベクトル１１５の１つ以上の粗推定を生成する動き初期推定（図２の符号１１０）から始まる。この後に、任意で、複数の動きベクトル候補が近似レート歪み尺度を用いて単一の候補に低減され得る動きベクトル候補フィルタリングステップ１２０が続けられる。レート歪み解析では、最良の動きベクトル候補（予測）は、レート歪み尺度Ｄ＋λＲ（式中、歪みＤは、入力ブロックとこれと一致する領域との誤差であり、レートＲは、予測を符号化するコスト（ビット単位）を定量化し、λは、スカラー重み付け係数である。）を最小化するものが選ばれる。実際のレートコストは、テクスチャビットと動きベクトルビットとの２種類の成分を含む。テクスチャビットは、残差信号（入力ブロックから予測を減算したもの）の量子化変換係数を符号化するのに必要なビットの数であり、動きベクトルビットは、動きベクトルを符号化するのに必要なビットの数である。通常、動きベクトルは、既に符号化済みの動きベクトルに対して差分符号化される。エンコーダにおける初期の段階ではテクスチャビットが利用可能でないので、レート歪み尺度のうちのレート部分は、動きベクトルビットにより近似で求められる。一方で、動きベクトルビットは、差分動きベクトルの大きさに依存する動きベクトルペナルティ係数として近似される。したがって、動きベクトル候補フィルタリングステップ１２０において、この近似レート歪み尺度が、単一の「最良」の初めの動きベクトル又はより少数の集合の「最良」の初めの動きベクトル１２５を選び出すのに用いられる。次に、そのような初めの動きベクトル１２５は、動き精推定１３０により洗練化（refine（さらに改良））される。動き精推定１３０は、それぞれの初めの推定の近傍において局所探索を行うことにより、その入力ブロックについての動きベクトル（および対応する予測）のより正確な推定を決定する。通常、この局所探索の後に、整数値動きベクトルが内挿により１／２又は１／４ピクセル精度まで洗練化されるサブピクセル洗練化が続けられる。動き精推定ブロック１３０は、洗練化済みの動きベクトル１３５の集合を生成する。 Most inter prediction processes begin with an initial motion estimate (reference 110 in FIG. 2) that generates one or more coarse estimates of the “good” motion vector 115 for a given input block. This is optionally followed by a motion vector candidate filtering step 120 where multiple motion vector candidates can be reduced to a single candidate using an approximate rate distortion measure. In rate distortion analysis, the best motion vector candidate (prediction) is the rate distortion measure D + λR, where distortion D is the error between the input block and the matching region, and rate R encodes the prediction. The one that quantifies the cost (in bits) and minimizes λ is a scalar weighting factor is selected. The actual rate cost includes two types of components, texture bits and motion vector bits. Texture bits are the number of bits required to encode the quantized transform coefficients of the residual signal (input block minus prediction) and motion vector bits are required to encode the motion vector The number of bits. Usually, a motion vector is differentially encoded with respect to an already encoded motion vector. Since texture bits are not available at an early stage in the encoder, the rate portion of the rate distortion measure is approximated by motion vector bits. On the other hand, the motion vector bit is approximated as a motion vector penalty coefficient depending on the magnitude of the difference motion vector. Thus, in the motion vector candidate filtering step 120, this approximate rate distortion measure is used to select a single “best” initial motion vector or a smaller set of “best” initial motion vectors 125. Such initial motion vector 125 is then refined (refine) by motion refinement estimation 130. Motion refinement estimate 130 performs a local search in the vicinity of each initial estimate to determine a more accurate estimate of the motion vector (and corresponding prediction) for that input block. This local search is usually followed by sub-pixel refinement where integer motion vectors are refined to 1/2 or 1/4 pixel accuracy by interpolation. The motion refinement estimation block 130 generates a refined set of motion vectors 135.

次に、動き精ベクトル１３５が与えられると、モード生成手段１４０がそのエンコーダが採用し得る符号化モードに基づいて予測候補１４５の集合を生成する。このようなモードは、コーデックによって異なる。符号化モードが異なるというのは、インターレース対プログレッシブ（フィールド対フレーム）動き推定、参照フレームの方向（前方向予測、後方向予測、双予測）、参照フレームのインデックス（複数の参照フレームを可能にするＨ．２６４、ＨＥＶＣなどのコーデックの場合）、インター予測対イントラ予測（良好なインター予測が存在しない場合にイントラ予測へと戻ることを可能にする一部のシナリオ）、異なる量子化パラメータ、および入力ブロックの異なるサブ区分である（ただし、これらに限定されるわけではない）。予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５０を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄ（通常、二乗誤差和（ＳＳＥ）として算出）およびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。最終の予測１６０（つまり図１の符号４０）は、全ての候補のなかで最小のレート歪みスコアＤ＋λＲを有する予測であり、この最終の予測が、その動きベクトル及び他の符号化パラメータと共にエンコーダにおける後続のステップへと渡される。 Next, when the motion vector 135 is given, the mode generation unit 140 generates a set of prediction candidates 145 based on the encoding modes that can be adopted by the encoder. Such a mode differs depending on the codec. Different coding modes mean interlaced vs progressive (field vs frame) motion estimation, reference frame direction (forward prediction, backward prediction, bi-prediction), reference frame index (multiple reference frames allowed H.264, for codecs such as HEVC), inter prediction versus intra prediction (some scenarios that allow returning to intra prediction in the absence of good inter prediction), different quantization parameters, and inputs Different sub-sections of the block (but not limited to). The entire set of prediction candidates 145 is subjected to a “final” rate distortion analysis 150 to determine a single best candidate. In the “final” rate-distortion analysis, an accurate rate-distortion measure D + λR is used to predict the distortion error D (usually calculated as the sum of squared errors (SSE)) and the actual coded bit R for the rate part. (From the entropy encoding 90 of FIG. 1) is calculated. The final prediction 160 (ie, code 40 in FIG. 1) is the prediction with the lowest rate distortion score D + λR among all candidates, and this final prediction along with its motion vector and other coding parameters at the encoder. Passed to subsequent steps.

図３に、インター予測時に、連続的ブロック追跡（ＣＢＴ）による動き初期推定がどのようにして行われ得るのかを示す。ＣＢＴは、ターゲットフレームと時間的予測が導き出される参照フレームとの間に、複数のフレーム分のギャップが存在する場合に有用である。ＭＰＥＧ−２の場合、ＩＢＢＰＢＢＰ（イントラ予測Ｉフレーム、双予測Ｂフレームおよび前方向予測Ｐフレームからなる）の典型的なＧＯＰ構造は、現在のフレームから最大３フレーム分離れた参照フレームを可能にする（その理由は、ＭＰＥＧ−２ではＢフレームが参照フレームとして機能できないからである）。符号化すべき各フレームにつき複数の参照フレームを可能にするＨ．２６４やＨＥＶＣでは、上記と同じＧＯＰ構造でも、現在のフレームから６フレーム分以上離れた参照フレームを可能にする。より長いＧＯＰ構造（例えば、Ｐフレーム同士の間に７つのＢフレーム等）であれば、参照フレームは、ターゲットフレームからなおいっそう離れたものとなり得る。現在のフレームと参照フレームとの間に複数のフレーム分のギャップが存在するとき、連続的追跡は標準の時間的予測手法では捉えられないデータ内の動きをエンコーダが捉えることを可能にするので、ＣＢＴはより優れた時間的予測を生成することを可能にする。 FIG. 3 shows how initial motion estimation by continuous block tracking (CBT) can be performed during inter prediction. CBT is useful when there are multiple frame gaps between a target frame and a reference frame from which temporal prediction is derived. In the case of MPEG-2, the typical GOP structure of IBBPBBP (consisting of intra-predicted I-frames, bi-predicted B-frames and forward-predicted P-frames) allows for reference frames that are separated by up to 3 frames from the current frame. (The reason is that in MPEG-2, the B frame cannot function as a reference frame). H. allows multiple reference frames for each frame to be encoded. In H.264 and HEVC, even with the same GOP structure as described above, a reference frame that is more than six frames away from the current frame is enabled. For longer GOP structures (eg, 7 B frames between P frames, etc.), the reference frame can be even further away from the target frame. When there are multiple frame gaps between the current frame and the reference frame, continuous tracking allows the encoder to capture motion in the data that is not captured by standard temporal prediction techniques, CBT makes it possible to generate better temporal predictions.

ＣＢＴでの最初のステップは、フレーム−フレーム間追跡（図３の符号２１０）を実行することである。所与のフレーム内の入力ブロック１０ごとに、フレームバッファ２０５内の前のフレームへの後方向と当該フレームバッファ内の次のフレームへの前方向との両方の動きベクトルが算出される。一実施形態において、フレーム−フレーム間追跡は、再構成された参照フレームではなく元々のソース映像からのフレームに作用する。これは、ソース映像のフレームが量子化や他の符号化アーチファクトによって劣化していないことから、ソース映像のフレームに基づく追跡は映像における真の動き場をより正確に表すので有利だからである。フレーム−フレーム間追跡は、従来のブロックベースの動き推定（ＢＢＭＥ）又は階層的な動き推定（ＨＭＥ）を用いて行われ得る。 The first step in CBT is to perform frame-to-frame tracking (reference 210 in FIG. 3). For each input block 10 in a given frame, motion vectors are calculated both backward to the previous frame in the frame buffer 205 and forward to the next frame in the frame buffer. In one embodiment, frame-to-frame tracking operates on frames from the original source video rather than reconstructed reference frames. This is because the source video frame is not degraded by quantization or other coding artifacts, so tracking based on the source video frame is advantageous because it more accurately represents the true motion field in the video. Frame-to-frame tracking may be performed using conventional block-based motion estimation (BBME) or hierarchical motion estimation (HME).

フレーム−フレーム間追跡の結果は、フレーム内のそれぞれの入力ブロックごとに対する、フレームバッファ２０５内の一番最近のフレーム内での最良に一致する領域と、フレームバッファ２０５内の一番最近のフレームのブロックごとに対する、現在のフレーム内での最良に一致する領域とを表す、フレーム−フレーム間動きベクトル２１５の集合である。次に、連続的追跡２２０が、利用可能なフレーム−フレーム間追跡情報を集約することにより、複数の参照フレームにわたってそれぞれの入力ブロックについての連続的なトラックを生成する。連続的追跡を行う方法の詳細については、全内容を参照をもって本明細書に取り入れた‘784出願に記載されている。連続的追跡２２０の出力は、符号化中の現在のフレーム内の全ての入力ブロックを過去の参照フレーム内のこれらと一致する領域へと追跡する連続的ブロック追跡（ＣＢＴ）動きベクトル２２５である。ＣＢＴの場合、これらのＣＢＴ動きベクトルが初めの動きベクトル（図２の符号１２５）となり且つ前述したように動き精推定（図２の符号１３０）により洗練化され得る。 The result of the frame-to-frame tracking is the result of the best matching region in the most recent frame in the frame buffer 205 and the most recent frame in the frame buffer 205 for each input block in the frame. A set of frame-to-frame motion vectors 215 representing the best matching region within the current frame for each block. Next, continuous tracking 220 generates a continuous track for each input block across multiple reference frames by aggregating available frame-to-frame tracking information. Details of how to perform continuous tracking are described in the '784 application, which is incorporated herein by reference in its entirety. The output of continuous tracking 220 is a continuous block tracking (CBT) motion vector 225 that tracks all input blocks in the current frame being encoded to regions that match them in past reference frames. In the case of CBT, these CBT motion vectors become the initial motion vector (reference numeral 125 in FIG. 2) and can be refined by motion refinement estimation (reference numeral 130 in FIG. 2) as described above.

図４に、本発明の一実施形態において、ＣＢＴがどのようにしてＥＰＺＳ法と組み合わされることで統合化された動き推定プロセスを作り出し得るのかを示す。図４では、ＣＢＴが動き初期推定１１０のためにフレーム−フレーム間追跡２１０及び連続的追跡２２０により動きベクトルを生成した後、動き精推定１３０のための局所探索及びサブピクセル洗練化２５０が続けられる。ＥＰＺＳが候補生成手段２３０により初めの動きベクトルを生成した後、先に詳述したような近似レート歪み解析によってフィルタリングを行う候補フィルタリング手段２４０が続けられる。この後に、さらに、局所探索及びサブピクセル洗練化２６０による動き精推定１３０が続けられる。このようにして得られたＣＢＴ動きベクトル２５５およびＥＰＺＳ動きベクトル２６５の両方が、総合的な「最良」のインター予測を決定するために残りのインター予測ステップ（図２のモード生成１４０及び最終のレート歪み解析１５０）へと渡される。 FIG. 4 illustrates how in one embodiment of the invention, CBT can be combined with the EPZS method to create an integrated motion estimation process. In FIG. 4, after the CBT generates motion vectors with frame-to-frame tracking 210 and continuous tracking 220 for initial motion estimation 110, local search and subpixel refinement 250 for motion refinement estimation 130 is continued. . After the EPZS generates the initial motion vector by the candidate generation means 230, the candidate filtering means 240 for performing filtering by the approximate rate distortion analysis as described in detail above is continued. This is further followed by motion refinement estimation 130 with local search and subpixel refinement 260. Both the CBT motion vector 255 and the EPZS motion vector 265 obtained in this way are used for the remaining inter prediction steps (mode generation 140 and final rate in FIG. 2) to determine the overall “best” inter prediction. To the distortion analysis 150).

代替的な一実施形態では、図４のＣＢＴ動きベクトル候補２５５及びＥＰＺＳ動きベクトル候補２６５に、さらなる候補が追加され得る。このような候補は、ランダム動きベクトル、（０，０）動きベクトル、およびいわゆる「中央値予測子（median predictor）」を含む（ただし、これらに限定されるわけではない）。ランダム動きベクトルには、その局所近傍において最良の候補を見つけ出すように動き精推定１３０が適用されてもよい。（０，０）動きベクトルは、ＥＰＺＳの初めの候補のうちの一つであるが、ＥＰＺＳ候補フィルタリング（図４の符号２４０）後の時点で常に選び出されているとは限らず、仮に候補フィルタリング後の時点で選び出されていたとしても、動き精推定１３０によって（０，０）以外の動きベクトルが出力される可能性がある。（動き精推定を受けない）（０，０）動きベクトルを最終のレート歪み解析用の候補として明示的に含めることは、少なくとも１つの、大きさの小さい且つ「動きの小さい」候補が検討されることを確実にする。同様に、「中央値予測子」もＥＰＺＳの初めの候補のうちの一つであるが、ＥＰＺＳ候補フィルタリング（図４の符号２４０）後の時点で常に選び出されているとは限らない。中央値予測子は、現在符号化中のデータブロックの左、上および右上のデータブロックで予め算出された動きベクトルの、中央値として定義される。（動き精推定を受けない）中間値予測子を最終のレート歪み解析用の候補として明示的に含めることは、映像フレームのうちの空間的に均質な（「フラット」な）領域を符号化するうえで特に有益となり得る。つまり、代替的なこの実施形態では、５種類以上の動きベクトル候補（ＣＢＴ由来の動きベクトル、ＥＰＺＳ由来の動きベクトル、ランダム動きベクトル由来の動きベクトル、（０，０）動きベクトル、および中央値予測子を含む（ただし、これらに限定されるわけではない））が、残りのインター予測ステップ（図２のモード生成１４０及び最終のレート歪み解析１５０）へと渡され得る。 In an alternative embodiment, additional candidates may be added to the CBT motion vector candidate 255 and EPZS motion vector candidate 265 of FIG. Such candidates include (but are not limited to) random motion vectors, (0,0) motion vectors, and so-called “median predictors”. The motion refinement estimation 130 may be applied to the random motion vector so as to find the best candidate in the local vicinity. The (0,0) motion vector is one of the first candidates for EPZS, but is not always selected at the time after EPZS candidate filtering (reference numeral 240 in FIG. 4). Even if selected at the time after filtering, there is a possibility that a motion vector other than (0, 0) may be output by the motion refinement estimation 130. Explicit inclusion of (0,0) motion vectors (not subject to fine motion estimation) as candidates for the final rate distortion analysis is considered at least one small and "small motion" candidate Make sure. Similarly, the “median predictor” is one of the first candidates for EPZS, but is not always selected at the time after EPZS candidate filtering (reference numeral 240 in FIG. 4). The median predictor is defined as the median of motion vectors calculated in advance in the left, upper and upper right data blocks of the data block currently being encoded. Explicit inclusion of intermediate predictors (not subject to motion estimation) as candidates for final rate distortion analysis encodes spatially uniform ("flat") regions of the video frame Can be particularly beneficial. That is, in this alternative embodiment, five or more types of motion vector candidates (motion vector derived from CBT, motion vector derived from EPZS, motion vector derived from random motion vector, (0, 0) motion vector, and median prediction) Including (but not limited to) children may be passed to the remaining inter prediction steps (mode generation 140 and final rate distortion analysis 150 of FIG. 2).

＜映像符号化用の重要度マップの算出＞
知覚的統計量が、映像フレームのどの領域が人間の視覚系（ＨＶＳ）にとって重要なのかを示す重要度マップを算出するのに用いられ得る。 <Calculation of importance map for video coding>
Perceptual statistics can be used to calculate an importance map that indicates which regions of the video frame are important to the human visual system (HVS).

知覚的統計量の一例として、時間的に周期的な刺激に対する人間の視覚系（ＨＶＳ）の応答をモデル化する、いわゆる時間的コントラスト感度関数（ＴＣＳＦ）が挙げられる。背景技術の欄で述べたように、ＴＣＳＦの概念は（「時間変調伝達関数」として紹介された）１９５０年代から存在しているが、これまで映像圧縮に適用されることはなかった。図５に、ＴＣＳＦの最近の測定結果（Wooten, B. 達による2010, "A practical method of measuring the temporal contrast sensitivity function（時間的コントラスト感度関数を測定する実用的な方法）," Biomedical Optical Express, l(l):47-58）を、周波数の対数の関数としての時間的コントラスト感度の対数（横軸に周波数の対数、縦軸に時間的コントラスト感度の対数）のかたちで示す。測定データ点（図５の丸印）は、３次多項式（図５の実線）を用いてフィッティングされている。なお、後述の全てのＴＣＳＦ算出に、このフィッティングを用いている。ＴＣＳＦは、人間の視覚系（ＨＶＳ）が中程度の周波数域に対して最大の応答を示す一方で、低周波数域ではＨＶＳの応答がわずかに低下し且つ高周波数域では急激に低下するものと予想する。 An example of a perceptual statistic is the so-called temporal contrast sensitivity function (TCSF), which models the human visual system (HVS) response to temporally periodic stimuli. As mentioned in the background section, the concept of TCSF has existed since the 1950s (introduced as “time-modulated transfer function”), but has never been applied to video compression. Figure 5 shows recent results of TCSF (Wooten, B. et al. 2010, "A practical method of measuring the temporal contrast sensitivity function", "Biomedical Optical Express, l (l): 47-58) in the form of logarithm of temporal contrast sensitivity as a function of logarithm of frequency (logarithm of frequency on the horizontal axis and logarithm of temporal contrast sensitivity on the vertical axis). Measurement data points (circles in FIG. 5) are fitted using a cubic polynomial (solid line in FIG. 5). This fitting is used for all TCSF calculations described later. TCSF shows that the human visual system (HVS) shows a maximum response to the medium frequency range, while the HVS response slightly decreases in the low frequency range and decreases rapidly in the high frequency range. I expect.

映像圧縮へのＴＣＳＦの適用には、ＴＣＳＦへの入力である時間的周波数（図５の横軸）を算出する方法が必要となる。周波数を算出するための本発明の一実施形態にかかる方法の一つを、次で説明する。周波数ｆは、ｆ＝ｖ／λ（式中、ｖは速度で、λは波長である）により与えられる。一実施形態において、任意のデータブロックのコンテンツの速度ｖ（ピクセル／秒単位）は、エンコーダにより生成された動きベクトル（例えば、図２の符号１３５、図３の符号２１５，２２５、図４の符号２５５，２６５等）の大きさからｖ＝｜ＭＶ｜×フレームレート／Ｎ（式中、｜ＭＶ｜は、そのデータブロックの動きベクトルの大きさであり、フレームレートはその映像が生成された１秒当たりのフレームの数であり、Ｎは動きベクトルにより指し示される参照フレームと現在のフレームとの間のフレームの数である）として算出され得る。 Application of TCSF to video compression requires a method of calculating a temporal frequency (horizontal axis in FIG. 5) that is an input to TCSF. One method according to an embodiment of the present invention for calculating the frequency will be described below. The frequency f is given by f = v / λ, where v is the velocity and λ is the wavelength. In one embodiment, the content speed v (in pixels / second) of any data block is determined by the motion vector generated by the encoder (eg, reference numeral 135 in FIG. 2, reference numerals 215 and 225 in FIG. 3, reference numerals in FIG. 4). V = | MV | × frame rate / N (where | MV | is the magnitude of the motion vector of the data block, and the frame rate is 1 at which the video was generated). The number of frames per second, where N is the number of frames between the reference frame pointed to by the motion vector and the current frame.

波長λの適切な近似は、ＣＩＥ１９７６Ｌａｂ色空間（www://en.wikipedia.org/wiki/Lab_color_space）において算出される構造的類似度（ＳＳＩＭ）（Wang, Z. 達による2004, "Image quality assessment: From error visibility to structural similarity（画像品質評価：誤差可視度から構造的類似度まで）," IEEE Trans, on Image Processing, 13(4):600-612）の算出結果から導き出され得る。図６に、Ｌａｂ色空間におけるＳＳＩＭの算出の様子を示す。ＳＳＩＭは、ターゲットブロック３００（符号化すべき現在のデータブロック）とその動きベクトルが指し示す参照ブロック３１０との間で算出される。通常、エンコーダにより処理される映像データはＹＵＶ４２０などの標準の空間で表現されるので、次のステップは、それらターゲットブロック（符号３２０）および参照ブロック（符号３３０）の両方を一般的に文献に記載されている任意の手法を用いてＣＩＥ１９７６Ｌａｂ空間に変換することである。次に、Ｌａｂ空間におけるこれらのターゲットブロックと参照ブロックとの間の誤差ΔＥ（符号３４０）が、 A suitable approximation of the wavelength λ is the structural similarity (SSIM) calculated in the CIE 1976 Lab color space (www: //en.wikipedia.org/wiki/Lab_color_space) (Wang, Z. et al. 2004, “Image quality assessment: From error visibility to structural similarity, "IEEE Trans, on Image Processing, 13 (4): 600-612). FIG. 6 shows how the SSIM is calculated in the Lab color space. The SSIM is calculated between the target block 300 (the current data block to be encoded) and the reference block 310 pointed to by its motion vector. Usually, video data processed by the encoder is represented in a standard space such as YUV420, so the next step is to generally describe both the target block (reference numeral 320) and the reference block (reference numeral 330) in the literature. It is to convert to CIE 1976 Lab space using an arbitrary method. Next, the error ΔE (reference numeral 340) between these target block and reference block in Lab space is

（式中、添字Ｔは「ターゲットブロック」を意味し、添字Ｒは「参照ブロック」を意味する）として算出される。最後に、誤差ΔＥと同次元のゼロ行列との間のＳＳＩＭ３６０が、データの色空間変化の尺度を示すものとして算出される。初めに定まるＳＳＩＭは、−１〜１の数値を取り、数値の１は完全な類似（空間的な差異がないこと）を示す。ＳＳＩＭを波長λに変換する目的に、０〜１の数値を取る空間的非類似度ＤＳＳＩＭ＝（１−ＳＳＩＭ）／２を使用することが可能であり得て、０は短い波長（最大の空間的類似度）に相当し、１は長い波長（最小の空間的類似度）に相当する。ＳＳＩＭをピクセル単位に変換するには、ＳＳＩＭの数値を、算出対象のブロックのピクセルの数で乗算することが可能であり得る。一実施形態では、ＳＳＩＭのブロックサイズが８×８なので、ＤＳＳＩＭ値に６４が乗算される。この場合、周波数の最終的な算出結果は、
ｆ＝｜ＭＶ｜×フレームレート／（Ｎ×６４×（１−ＳＳＩＭ）／２）
により与えられる。 (In the formula, subscript T means “target block” and subscript R means “reference block”). Finally, the SSIM 360 between the error ΔE and the same dimension zero matrix is calculated as indicating a measure of the color space change of the data. The SSIM determined at the beginning takes a value of −1 to 1, and the value 1 indicates complete similarity (no spatial difference). For the purpose of converting SSIM to wavelength λ, it may be possible to use a spatial dissimilarity DSSIM = (1−SSIM) / 2 which takes a value between 0 and 1, where 0 is the short wavelength (maximum space 1 corresponds to a long wavelength (minimum spatial similarity). To convert SSIM to pixel units, it may be possible to multiply the SSIM value by the number of pixels in the block to be calculated. In one embodiment, the DSSIM value is multiplied by 64 because the block size of the SSIM is 8x8. In this case, the final calculation result of the frequency is
f = | MV | × frame rate / (N × 64 × (1-SSIM) / 2)
Given by.

所与のターゲットブロックについての周波数が算出されれば、このブロックについてのＴＣＳＦ値が、図５の曲線フィット（実線）から決定可能であり得る。ＴＣＳＦは、ｌｏｇ_１０スケールで０〜１．０８または絶対スケールで１〜１１．９７の数値を取る。フレーム内の相異なるブロックが相異なるＴＣＳＦ値を取ることにより、フレーム内の全てのブロックにわたるＴＣＳＦ値の集約集合（aggregate set）が重要度マップを形成し、高い数値は時間的コントラストの観点からみて知覚的に重要なブロックを示し且つ低い数値は知覚的に重要でないブロックを示す。 Once the frequency for a given target block is calculated, the TCSF value for this block can be determined from the curve fit (solid line) of FIG. TCSF takes a numerical value of 0 to 1.08 on a log ₁₀ scale or 1 to 11.97 on an absolute scale. Different blocks in a frame take different TCSF values, so an aggregate set of TCSF values across all blocks in the frame forms an importance map, with higher numbers in terms of temporal contrast Perceptually important blocks are indicated and low numbers indicate perceptually insignificant blocks.

さらなる実施形態では、最近のフレームからのＴＣＳＦの数値が、ＴＣＳＦベースの重要度マップがフレーム間で変動し過ぎることのないようにそれぞれのデータブロックにつき平均化され得る。例えば、平均ＴＣＳＦＴＣＳＦ_ａｖｇのそのような計算の一つとして、ＴＣＳＦ_ａｖｇ＝０．７×ＴＣＳＦ_ｃｕｒ＋０．３×ＴＣＳＦ_ｐｒｅｖ（式中、ＴＣＳＦ_ｃｕｒは現在のフレームからのＴＣＳＦ値であり、ＴＣＳＦ_ｐｒｅｖは一番最近符号化された過去のフレームからのＴＣＳＦ値である）が挙げられ得る。ＴＣＳＦの計算は、このように平均化されることでよりロバスト（頑健）になる。 In a further embodiment, TCSF values from recent frames may be averaged for each data block so that the TCSF-based importance map does not fluctuate too much between frames. For example, as one such calculation of the average TCSFTCSF _avg , TCSF _avg = 0.7 × TCSF _cur + 0.3 × TCSF _prev (where TCSF _cur is the TCSF value from the current frame and TCSF _prev is The TCSF value from the most recently encoded past frame). The calculation of TCSF becomes more robust by being averaged in this way.

さらなる実施形態では、エンコーダにより生成された動きベクトルの相対品質についての情報が、符号化プロセスにおける種々の時点で算出され得て、真の動きベクトルマップ（ＴＭＶＭ）を生成するのに用いられ得る。真の動きベクトルマップ（ＴＭＶＭ）は、それぞれのデータブロックについて、その動きベクトルがどれほど信頼できるのかを出力する。０または１の数値を取るこの真の動きベクトルマップは、動きベクトルが正確でないデータブロック（すなわち、ＴＭＶＭ値が０であるデータブロック）にはＴＣＳＦが適用されないようにＴＣＳＦを洗練化するマスクとして用いられ得る。 In further embodiments, information about the relative quality of the motion vectors generated by the encoder can be calculated at various points in the encoding process and used to generate a true motion vector map (TMVM). A true motion vector map (TMVM) outputs for each data block how reliable the motion vector is. This true motion vector map, which takes a value of 0 or 1, is used as a mask to refine the TCSF so that TCSF is not applied to data blocks whose motion vectors are not accurate (ie, data blocks with a TMVM value of 0). Can be.

一実施形態において、動きベクトルの正確さは、所与の映像フレームについてのグローバル動きモデルを推定し且つこの動きモデルを当該フレーム内のそれぞれのデータブロックに適用することによってそれぞれのデータブロックについてのグローバル動きベクトルを決定してから且つこのグローバル動きベクトルをそのデータブロックについてのエンコーダの動きベクトル（エンコーダ動きベクトル）と比較することにより、決定され得る。グローバル動きは、そのフレームからの符号化動きベクトルの集約集合であって、６つのパラメータ又は８つのパラメータのアフィン動きモデルにフィッティングされた集約集合から推定され得る。所与のデータブロックについてグローバル動きベクトルとエンコーダ動きベクトルとが同一である（又は類似する）場合、エンコーダ動きベクトルが正確であると見なされる（そして、そのデータブロックについてＴＭＶＮ＝１となる）。それら２つのベクトルが同一でない場合、それらの（二乗誤差和（ＳＳＥ）又は差分絶対値和（ＳＡＤ）で測定された）予測誤差同士を比較してもよい。一方の誤差が小さくて他方の誤差が大きい場合には、誤差が小さいほうの動きベクトルが、符号化に使われて且つ正確と見なされる（ＴＭＶＭ＝１）。 In one embodiment, the accuracy of the motion vector is determined by estimating the global motion model for a given video frame and applying the motion model to each data block in the frame. It can be determined by determining a motion vector and comparing this global motion vector with the encoder motion vector (encoder motion vector) for that data block. Global motion is an aggregated set of encoded motion vectors from that frame, and can be estimated from an aggregated set fitted to a 6-parameter or 8-parameter affine motion model. If the global motion vector and the encoder motion vector are the same (or similar) for a given data block, the encoder motion vector is considered accurate (and TMVN = 1 for that data block). If the two vectors are not identical, their prediction errors (measured by sum of squared errors (SSE) or sum of absolute differences (SAD)) may be compared. If one error is small and the other error is large, the motion vector with the smaller error is used for encoding and considered accurate (TMVM = 1).

代替的な一実施形態では、所与のデータブロックについてグローバル動きベクトルとエンコーダ動きベクトルとの差分の大きさが、そのデータブロックが前景データであること（これは、そのデータブロック内のコンテンツが、フレームの残りの部分（背景）と異なる動きを伴っていることを意味する）を特定するのに用いられる。この実施形態では、ＴＭＶＭが１に設定されて、ＴＣＳＦが前景データの場合にのみ適用される。さらなる実施形態では、前景データとして特定されたデータブロックについて、グローバル動きベクトルからエンコーダ動きベクトルが減算されることによって差分動きベクトルを得て、（エンコーダ動きベクトルではなくて）この差分動きベクトルの大きさがＴＣＳＦの周波数を算出するのに用いられる（前述の式において、｜ＭＶ｜を｜ＤＭＶ｜（ＤＭＶは差分動きベクトルである）に置き換える）。 In an alternative embodiment, the magnitude of the difference between the global motion vector and the encoder motion vector for a given data block is such that the data block is foreground data (this means that the content in the data block is Used to identify a different motion than the rest of the frame (background). In this embodiment, TMVM is set to 1 and this is applied only when TCSF is foreground data. In a further embodiment, for a data block identified as foreground data, a differential motion vector is obtained by subtracting the encoder motion vector from the global motion vector, and the magnitude of this differential motion vector (not the encoder motion vector). Is used to calculate the frequency of the TCSF (in the above equation, | MV | is replaced with | DMV | (DMV is a differential motion vector)).

他の実施形態では、動きベクトル対称度が、ＴＭＶＭを洗練化するのに用いられ得る。動きベクトル対称度（Bartels, C.及びde Haan, G.による2009, "Temporal symmetry constraints in block matching（ブロックマッチングにおける時間的対称度制約）," Proc. IEEE 13^th Int'l. Symposium on Consumer Electronics, pp. 749-752）は、動き推定の時間的方向を切り替えたときに互いに対をなす、動きベクトルのペアの相対対称度として定義され、算出された動きベクトルの品質の尺度となる（対称度が高ければ高いほど、動きベクトルの品質が優れている）。「対称度誤差ベクトル」は、前方向動き推定により得られた動きベクトルと後方向動き推定により得られた動きベクトルとの差分として定義される。動きベクトル対称度が低いこと（対称度誤差ベクトルが大きいこと）は、しばしば、オクルージョン（あるオブジェクトが別のオブジェクトの前側を動くことにより、その背景オブジェクトを隠したり露わにしたりすること）、オブジェクトの動きが映像フレーム上に又は映像フレーム外になること、照明変化など（いずれも、正確な動きベクトルを導き出すことを困難にする）の複雑な現象が存在することの指標となる。 In other embodiments, motion vector symmetry may be used to refine the TMVM. Motion vector symmetry (Bartels, C. and de Haan, 2009 by G., "Temporal symmetry constraints in block matching ( temporal symmetry constraints in block ^{matching)," Proc. IEEE 13 th} Int'l. Symposium on Consumer Electronics , pp. 749-752) is defined as the relative symmetry of a pair of motion vectors that pair with each other when the temporal direction of motion estimation is switched, and is a measure of the quality of the calculated motion vector (symmetric) The higher the degree, the better the quality of the motion vector). The “symmetry error vector” is defined as a difference between a motion vector obtained by forward motion estimation and a motion vector obtained by backward motion estimation. Low motion vector symmetry (large symmetry error vector) often results in occlusion (obtaining or exposing the background object by moving one object in front of another object) It is an indicator that there is a complicated phenomenon such as that the movement of the image moves on or out of the video frame, illumination change, etc. (both make it difficult to derive an accurate motion vector).

一実施形態では、対称度誤差ベクトルの大きさが符号化中のデータブロックの範囲の半分よりも大きい場合（例えば１６×１６マクロブロックのときには、大きさが（８，８）ベクトルよりも大きい場合）に、低対称度（対称度が低い）と判断される。他の実施形態では、対称誤差ベクトルの大きさが追跡プロセス時に導き出された動きベクトル統計量に基づく閾値（例えば、現在のフレーム又は最近のフレーム同士の所与の組合せにおける、動きベクトル大きさ（動きベクトルの大きさ）の平均値に動きベクトル大きさの標準偏差の倍数を足したもの等）よりも大きい場合に、低対称度（対称度が低い）と判断される。一実施形態では、動きベクトルが上記の定義で低対称度を有するデータブロックに対してＴＭＶＭ値＝０が自動的に振り当てられて、他のデータブロックはグローバル動きベクトルとエンコーダ動きベクトルとの比較に由来するそれまでのＴＭＶＭ値を維持する。 In one embodiment, the magnitude of the symmetry error vector is greater than half of the range of the data block being encoded (eg, for a 16 × 16 macroblock, the magnitude is greater than the (8,8) vector). ) Is determined to have a low degree of symmetry (low degree of symmetry). In other embodiments, the magnitude of the symmetric error vector is based on a motion vector statistic derived during the tracking process (eg, the motion vector magnitude (motion at a given combination of current or recent frames). If the average value of the vector magnitude) is larger than the average value of the motion vector magnitude plus a multiple of the standard deviation of the motion vector magnitude), it is determined that the degree of symmetry is low (the degree of symmetry is low). In one embodiment, TMVM value = 0 is automatically assigned to a data block whose motion vector has low symmetry in the above definition, and the other data block compares the global motion vector with the encoder motion vector. The previous TMVM value derived from is maintained.

フラットなブロックは、高い空間的コントラスト感度を有するものの、動きベクトルを算出する際のよく知られている開口問題（aperture problem） Flat blocks have a high spatial contrast sensitivity but are well known aperture problems when calculating motion vectors

が原因となり、信頼できない動きベクトルを生じる傾向にある。フラットなブロックは、例えば、エッジ検出プロセス（データブロック内においてエッジが検出されなかった場合にフラットなブロックであると判断される）を用いて、あるいは、データブロックの分散を閾値と比較すること（この閾値よりも小さい分散がフラットなブロックを示す）によって検出され得る。一実施形態では、ブロックのフラットさが、前述のように算出されたＴＭＶＭを変更するように用いられ得る。例えば、フラットなブロックであると検出されたブロックには、ＴＭＶＭ値＝０が振り当て直され得る。 Tends to produce unreliable motion vectors. A flat block is, for example, using an edge detection process (determined to be a flat block if no edge is detected in the data block) or by comparing the variance of the data block with a threshold ( A variance less than this threshold indicates a flat block). In one embodiment, the flatness of the block may be used to change the TMVM calculated as described above. For example, a TMVM value = 0 can be reassigned to a block detected as a flat block.

一実施形態では、ＴＭＶＭが、信頼できる動きベクトルを有するか否かに影響されるＴＣＳＦを洗練化するマスクとして用いられ得る。ＴＭＶＭの数値は０又は１なので、あるブロックについてのＴＭＶＭ値を、そのブロックについてのＴＣＳＦ値へとブロック毎に乗算することには、ＴＣＳＦをマスクする効果がある。ＴＭＶＭ値が０であるブロックの場合、ＴＣＳＦの算出に必要な動きベクトルが信頼できないことになるので、そのＴＣＳＦが「無効」にされる。ＴＭＶＭ値が１であるブロックの場合、ＴＣＳＦ算出結果が信頼できるとみなされて且つこれまでに述べた任意の手法が確信して利用される。 In one embodiment, TMVM can be used as a mask to refine TCSF that is affected by whether it has reliable motion vectors. Since the value of TMVM is 0 or 1, multiplying the TMVM value for a block by the TCSF value for that block for each block has the effect of masking the TCSF. In the case of a block having a TMVM value of 0, the motion vector necessary for calculating the TCSF is unreliable, so that the TCSF is “invalid”. In the case of a block having a TMVM value of 1, the TCSF calculation result is considered to be reliable, and any method described so far is used with certainty.

他の実施形態では、時間的コントラストマップ（前述のＴＣＳＦ）の代わりに、あるいは、これに加えて、空間的コントラストマップが生成され得る。本発明では、空間的コントラスト（これの反対は「空間的複雑度」と称される）を測定するのに、単純な尺度が用いられる。一実施形態では、データの輝度成分と色差成分との両方について測定されるブロック分散が、所与の入力ブロックの空間的複雑度を測定するのに用いられる。分散の大きい入力ブロックは、空間的に複雑であり且つＨＶＳにとって気付き難いと考えられるので、その空間的コントラストは小さいことになる。 In other embodiments, a spatial contrast map may be generated instead of or in addition to the temporal contrast map (TCSF described above). In the present invention, a simple measure is used to measure spatial contrast (the opposite is referred to as “spatial complexity”). In one embodiment, block variance measured for both the luminance and chrominance components of the data is used to measure the spatial complexity of a given input block. An input block with a large variance is spatially complex and difficult to notice for HVS, so its spatial contrast is small.

他の実施形態では、データの輝度成分について測定されるブロック輝度が、空間的複雑度の分散測定結果を洗練化するのに用いられる。分散は小さい（空間的複雑度が低い、空間的コントラストが大きい）が極めて明るいか又は極めて暗い入力ブロックは、空間的コントラストが小さいと自動的に見なされて且つその前に大きいと測定された空間的コントラストを上書きする。その理由は、極めて暗い領域や極めて明るい領域が、ＨＶＳにとって気付き難いからである。所与のブロックを極めて明るい又は極めて暗いと分類するための輝度閾値は、その時々の用途に特化したものとされるが、８ビットの映像の場合の典型的な数値は、極めて明るいが「１７０超」で、極めて暗いが「６０未満」である。 In other embodiments, block luminance measured for the luminance component of the data is used to refine the spatial complexity variance measurement. An input block that has a small variance (low spatial complexity, high spatial contrast) but is very bright or very dark is automatically considered low spatial contrast and has previously been measured as large Overrides the contrast. The reason is that an extremely dark area or an extremely bright area is difficult for the HVS to notice. The brightness threshold for classifying a given block as very bright or very dark is specific to the occasional application, but typical values for 8-bit video are very bright, More than 170 ", very dark but" less than 60 ".

上記のようにブロック輝度により改変されたブロック分散が、空間的コントラストの観点からＨＶＳにとっての気付き易さの高い領域及び気付き易さの低い領域を示す空間的コントラストマップ（ＳＣＭ）を形成するために、映像フレームの全ての入力ブロックについて算出され得る。 In order to form a spatial contrast map (SCM) indicating the regions that are easy to notice for HVS and the regions that are not easily noticed from the viewpoint of spatial contrast, the block dispersion modified by the block luminance as described above. Can be calculated for all input blocks of a video frame.

一実施形態では、ＳＣＭが、（ＴＭＶＭにより洗練化された）ＴＣＳＦと組み合わされることによって、統合化された重要度マップを形成し得る。この統合化されたマップは、例えば、ＳＣＭおよびＴＣＳＦの両方を適宜正規化したうえで、所与のブロックについてのＳＣＭ値をそのブロックについてのＴＣＳＦ値へとブロック毎に乗算することによって形成され得る。他の実施形態では、ＴＣＳＦの代用としてＳＣＭが使用され得る。他の実施形態では、ＳＣＭが、ＴＣＳＦを洗練化するのに用いられ得る。例えば、複雑度の高いブロックではＳＣＭ値がそのブロックについてのＴＣＳＦ値を上書きし得て、複雑度の低いブロックではそのブロックについてのＴＣＳＦ値が直接使用され得る。 In one embodiment, SCM can be combined with TCSF (refined by TMVM) to form an integrated importance map. This integrated map can be formed, for example, by normalizing both SCM and TCSF as appropriate, and then multiplying the SCM value for a given block by the TCSF value for that block on a block-by-block basis. . In other embodiments, SCM can be used as a substitute for TCSF. In other embodiments, SCM can be used to refine the TCSF. For example, for a high complexity block, the SCM value may overwrite the TCSF value for that block, and for a low complexity block, the TCSF value for that block may be used directly.

＜映像符号化への重要度マップの適用＞
前述した重要度マップは、一般的なエンコーダ（図２）及びＣＢＴエンコーダ（図３）のいずれの映像符号化プロセスにも適用され得て、符号化ビットストリームの品質を向上させる。 <Application of importance map to video coding>
The importance map described above can be applied to any video encoding process of a general encoder (FIG. 2) and a CBT encoder (FIG. 3) to improve the quality of the encoded bitstream.

図７に、映像符号化への重要度マップの一般的な適用の様子を示す。入力映像フレーム５およびフレームストア８５が、知覚的統計量３９０を生成するように使用される。そして、知覚的統計量３９０が、前述したような（ＴＭＶＭにより洗練化された）ＴＣＳＦおよび／またはＳＣＭの重要度マップ４００を形成するように適用される。知覚的統計量３９０は、動きベクトル大きさ、ブロック分散、ブロック輝度、エッジ検出、およびグローバル動きモデルパラメータを含み得る（ただし、これらに限定されるわけではない）。入力映像フレーム５およびフレームストア８５は、さらに、符号４５０での映像フレームの符号化へと通常どおり入力される。その符号化は、通常の符号化ステップ（図１の動き推定１５、インター予測２０、イントラ予測３０、変換・量子化６０およびエントロピー符号化９０）を含む。ただし図７では、符号化４５０が、後述の方法で重要度マップ４００により機能拡張される。 FIG. 7 shows a general application of the importance map to video coding. Input video frame 5 and frame store 85 are used to generate perceptual statistics 390. Perceptual statistics 390 are then applied to form a TCSF and / or SCM importance map 400 (refined by TMVM) as described above. Perceptual statistics 390 may include (but is not limited to) motion vector magnitude, block variance, block luminance, edge detection, and global motion model parameters. The input video frame 5 and the frame store 85 are further input as usual into the encoding of the video frame at 450. The encoding includes normal encoding steps (motion estimation 15, inter prediction 20, intra prediction 30, transformation / quantization 60 and entropy encoding 90 in FIG. 1). However, in FIG. 7, the encoding 450 is expanded by the importance map 400 in a method described later.

図８Ａに、ＣＢＴを用いた映像符号化を向上するための、重要度マップの具体的な適用の様子を示す。図８Ａには、ＣＢＴからのフレーム−フレーム間追跡２１０ステップ及び連続的追跡２２０ステップによる動き初期推定（図２の符号１１０）が示されている。そして、動き精推定１３０が、既述したものと同じ局所探索及びサブピクセル洗練化の動き精推定ステップ（図４の符号２５０）で、グローバルＣＢＴ動きベクトル２２５に適用される。ここでも、その後に、エンコーダが採用し得る符号化モードに基づいて予測候補１４５の集合を生成するモード生成手段１４０が続く。図４と同じく、ＥＰＺＳおよびモデルベースでない他の候補（例えば、（０，０）動きベクトル、中央値予測子等）も、統合化された動き推定フレームワークの一部として並行して生成され得る（図８Ａでは、図示を簡略化するためにこれら他の候補の図示を省略している）。図８Ａにおいても、ＣＢＴ候補のあらゆる符号化モードおよび場合によってはモデルベースでない他の候補のあらゆる符号化モードも含め、予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５５を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄおよびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。最終の予測１６０（または図１の符号４０）が、その動きベクトル及び他の符号化パラメータと共にエンコーダにおける後続のステップへと渡される。 FIG. 8A shows a specific application of the importance map for improving video coding using CBT. FIG. 8A shows an initial motion estimation (reference numeral 110 in FIG. 2) with a frame-to-frame tracking 210 step and a continuous tracking 220 step from the CBT. Then, the motion refinement estimation 130 is applied to the global CBT motion vector 225 in the same local search and subpixel refinement motion refinement estimation step (reference numeral 250 in FIG. 4) as described above. Again, this is followed by mode generation means 140 that generates a set of prediction candidates 145 based on encoding modes that the encoder may employ. Similar to FIG. 4, EPZS and other non-model based candidates (eg, (0,0) motion vectors, median predictors, etc.) can also be generated in parallel as part of an integrated motion estimation framework. (In FIG. 8A, these other candidates are not shown for the sake of simplicity). Also in FIG. 8A, the entire set of prediction candidates 145, including every coding mode of the CBT candidate and possibly every other coding mode that is not model-based, determines the single best candidate. A “final” rate distortion analysis 155 is received. In the “final” rate distortion analysis, an accurate rate distortion measure D + λR is used to predict the distortion error D for the distortion portion and the actual coded bit R for the rate portion (from entropy coding 90 in FIG. 1). Is calculated. The final prediction 160 (or 40 in FIG. 1) is passed along with its motion vector and other coding parameters to subsequent steps in the encoder.

図８Ａにおいて、知覚的統計量３９０が、フレーム−フレーム間動き追跡２１０から導き出された動きベクトルから算出され得て、その後、前述したような重要度マップ４００を形成するように適用され得る。そして、これらの重要度マップ４００が、最終のレート歪み解析１５５へと入力される。ここでも、知覚的統計量３９０は、動きベクトル大きさ、ブロック分散、ブロック輝度、エッジ検出、およびグローバル動きモデルパラメータを含み得る（ただし、これらに限定されるわけではない）。 In FIG. 8A, perceptual statistics 390 may be calculated from motion vectors derived from frame-to-frame motion tracking 210 and then applied to form importance map 400 as described above. These importance maps 400 are input to the final rate distortion analysis 155. Again, perceptual statistics 390 may include (but is not limited to) motion vector magnitude, block variance, block luminance, edge detection, and global motion model parameters.

一実施形態では、重要度マップが、レート歪み最適化条件を当該重要度マップに応じて改変するように用いられる。標準のエンコーダ（図２を参照のこと）では、所与の入力ブロック１０についての予測候補１４５の集合の全体が、単一の最良の候補を決定するように「最終」のレート歪み解析１５０を受ける。「最終」のレート歪み解析では、正確なレート歪み尺度Ｄ＋λＲが用いられて、歪み部分用の予測誤差Ｄおよびレート部分用の実際の符号化ビットＲ（図１のエントロピー符号化９０からのもの）を算出する。レート歪み尺度Ｄ＋λＲのスコアが最小である候補が、所与の入力ブロック１０についての最終の予測１６０となる。本発明の一実施形態において、図７又は図８の知覚的に最適化されたエンコーダの場合、符号４００で重要度マップＩＭが算出されて、かつ、最終のレート歪み解析１５５が改変されたレート歪み尺度Ｄ×ＩＭ＋λＲを使用する。この改変されたレート歪み尺度では、所与の入力ブロックについてのＩＭ値が歪み項に乗算されて、ＩＭ値が高ければ高いほど、歪みの小さいソリューションにより大きな重要度が振り当てられる。というのも、高いＩＭ値は、対応する入力ブロックが知覚的に重要であることを示すからである。重要度マップは、（場合によってはＴＭＶＭ値により洗練化されている）ＴＣＳＦ、ＳＣＭまたはこれらを複合したものを含み得る。 In one embodiment, an importance map is used to modify the rate distortion optimization conditions in response to the importance map. In a standard encoder (see FIG. 2), the “final” rate distortion analysis 150 is performed so that the entire set of prediction candidates 145 for a given input block 10 determines a single best candidate. receive. In the “final” rate distortion analysis, an accurate rate distortion measure D + λR is used to predict the distortion error D for the distortion portion and the actual coded bit R for the rate portion (from entropy coding 90 in FIG. 1). Is calculated. The candidate with the lowest score for the rate distortion measure D + λR becomes the final prediction 160 for a given input block 10. In one embodiment of the invention, for the perceptually optimized encoder of FIG. 7 or FIG. 8, the rate at which the importance map IM is calculated at 400 and the final rate distortion analysis 155 is modified. The distortion scale D × IM + λR is used. In this modified rate distortion measure, the IM value for a given input block is multiplied by the distortion term, the higher the IM value, the more importance is assigned to the less distorted solution. This is because a high IM value indicates that the corresponding input block is perceptually important. The importance map may include TCSF, SCM (possibly refined with TMVM values), or a combination of these.

さらなる実施形態では、上記に加えて、レート歪み尺度における歪みＤが、ＳＳＥ（二乗誤差和：歪みを算出する「標準」の手法）とＹＵＶ空間において算出されたＳＳＩＭとの重み付け和として算出され得る。重み付けγは、映像のうちの最初の幾つか（又は一番最近の幾つか）のフレームにおける平均ＳＳＩＭ値ＳＳＩＭ_ａｖｇが当該映像のうちの当該最初の幾つか（又は当該一番最近の幾つか）のフレームにおける平均ＳＳＥ値ＳＳＥ_ａｖｇと等しくなる（γ×ＳＳＩＭ_ａｖｇ＝ＳＳＥ_ａｖｇ）ように適応的に算出され得る。つまり、それぞれの入力ブロックについて、改変されたレート歪み尺度は、（ＳＳＥ＋γ×ＳＳＩＭ）×ＩＭ＋２λＲ（式中、λＲ項の前にある乗算係数２は、歪み項が２つあることを意味している）となる。歪み測定にＳＳＩＭを含めることは、ＳＳＩＭがデータの構造的情報に相当することから、レート歪み最適化においてＨＶＳ知覚が占める割合をなおいっそう増やすことになる。 In a further embodiment, in addition to the above, the distortion D on the rate distortion measure may be calculated as a weighted sum of SSE (square error sum: “standard” technique for calculating distortion) and SSIM calculated in YUV space. . The weight γ is the average SSIM value SSIM _avg in the first few (or most recent) frames of the video is the first few (or the most recent) of the video. Can be calculated adaptively so as to be equal to the average SSE value SSE _avg in the frames (γ × SSIM _avg = SSE _avg ). That is, for each input block, the modified rate distortion measure is (SSE + γ × SSIM) × IM + 2λR, where the multiplication factor 2 in front of the λR term means that there are two distortion terms. ) Inclusion of SSIM in the distortion measurement further increases the proportion of HVS perception in rate distortion optimization since SSIM corresponds to the structural information of the data.

他の実施形態では、重要度マップ（例えば、ＴＭＶＭによる洗練化を伴うＴＣＳＦ、ＳＣＭ等）が、レート歪み最適化を改変するのに加えて（あるいは、この代わりに）、エンコーダのブロックレベル量子化を改変するように用いられ得る。量子化は、所与の入力ブロックが符号化される相対品質を制御するものである。つまり、高度に量子化されたデータは低品質の符号化済み出力となり、低度に量子化されたデータは高品質の符号化済み出力となる。量子化の量は、量子化パラメータＱＰによって制御される。標準のエンコーダは、異なるフレームタイプに異なるＱＰ値ＱＰ_{ｆｒａｍｅ}を振り当てて、Ｉフレームは最小のＱＰ（最も高品質）で符号化されて且つＢフレームは最大のＱＰ（最も低品質）で符号化されて且つＰフレームは中間のＱＰ（中間の品質）で符号化される。 In other embodiments, importance maps (eg, TCSF with TMVM refinement, SCM, etc.), in addition to (or instead of) modifying rate distortion optimization, encoder block level quantization Can be used to modify Quantization controls the relative quality with which a given input block is encoded. That is, highly quantized data is a low quality encoded output, and low quantized data is a high quality encoded output. The amount of quantization is controlled by the quantization parameter QP. The standard encoder allocates different QP values QP _frame to different frame types, I frames are encoded with the lowest QP (highest quality) and B frames are encoded with the highest QP (lowest quality) And the P frame is encoded with an intermediate QP (intermediate quality).

つまり、上記の手法は、複数の映像フレーム（当該映像フレームは、互いに重なり合わないターゲットブロックを有している）を、重要度マップを用いて、各映像フレーム内の各ターゲットブロックの量子化を改変すること（そして、これによってその符号化品質に影響を与えること）によって符号化する方法を提示している。そのような重要度マップは、時間的情報（ＴＭＶＭによる洗練化を伴うＴＣＳＦ）、空間的情報、またはこれら２種類の組合せ（すなわち、統合化された重要度マップ）を用いて設定され得る。重要度マップは各映像フレームのどの部分が人間の知覚にとって最も気付き易いのかを示すものなので、前記重要度マップの数値は、それぞれのターゲットブロックについてのＱＰを、（ｉ）当該重要度マップが高い数値をとるブロックでは、ブロックＱＰがＱＰ_{ｆｒａｍｅ}に比べて小さくされることで、これらのブロックについては高い品質となるように、かつ、（ｉｉ）当該重要度マップが低い数値をとるブロックでは、前記ブロックＱＰが前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に比べて大きくされることで、これらのブロックについては低い品質となるように、変更するのが望ましい。 In other words, the above method is to quantize a plurality of video frames (the video frames have target blocks that do not overlap each other) using an importance map, and each target block in each video frame is quantized. It presents a method of encoding by altering (and thereby affecting its encoding quality). Such importance maps can be set using temporal information (TCSF with TMVM refinement), spatial information, or a combination of the two types (ie, an integrated importance map). Since the importance map indicates which part of each video frame is most easily noticed by human perception, the numerical value of the importance map indicates the QP for each target block, and (i) the importance map is high. In the block that takes a numerical value, the block QP is made smaller than the QP _frame , so that these blocks have high quality, and (ii) in the block in which the importance map has a low numerical value, It is desirable that the block QP is changed so as to have a lower quality by making the block QP larger than the frame quantization parameter QP _frame .

図８Ｂに、重要度マップ４００を用いて符号化時の量子化を改変するプロセスの一例を示す。符号４００では、知覚的統計量３９０から導き出された時間的情報および／または空間的情報を用いて重要度マップが構成／形成され得る。時間的情報は、例えば、どのターゲットブロックが観測者である人間にとって時間的に最も気付き易いのかを示す時間的コントラスト感度関数（ＴＣＳＦ）、および、どのターゲットブロックが前景データに相当するのかを示す真の動きベクトルマップ（ＴＭＶＭ）により提供され得て、前記ＴＣＳＦは、前景データとして特定されたターゲットブロックについてのみ有効とされる。空間的情報は、例えば、ルールに基づく空間的複雑度マップ（ＳＣＭ）により提供され得る。 FIG. 8B shows an example of a process for modifying quantization during encoding using the importance map 400. At 400, an importance map may be constructed / formed using temporal and / or spatial information derived from perceptual statistics 390. The temporal information includes, for example, a temporal contrast sensitivity function (TCSF) indicating which target block is most easily noticed in time by a human being who is an observer, and a true value indicating which target block corresponds to the foreground data. The TCSF is valid only for target blocks identified as foreground data. Spatial information may be provided, for example, by a rule-based spatial complexity map (SCM).

そして、重要度マップ４００は、符号化４５０内の量子化ステップ４３０を上記のように改変するのに用いられる。当該重要度マップが高い数値を取るブロックでは、ブロック量子化パラメータ（ＱＰ）が、フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に対して減らされて、これらのブロックについては高い符号化品質が得られる。当該重要度マップが低い数値を取るブロックでは、前記ブロック量子化パラメータが、前記フレーム量子化パラメータＱＰ_{ｆｒａｍｅ}に対して増やされて、これらのブロックについては低い符号化品質が得られる。重要度マップからの情報を用いることにより、量子化は、各映像フレーム内の符号化すべき各ターゲットブロックの符号化品質を向上させるように改変され得る。 The importance map 400 is then used to modify the quantization step 430 in the encoding 450 as described above. In blocks where the importance map takes a high value, the block quantization parameter (QP) is reduced with respect to the frame quantization parameter QP _frame , and high coding quality is obtained for these blocks. In blocks where the importance map has a low value, the block quantization parameter is increased with respect to the frame quantization parameter QP _frame , and low coding quality is obtained for these blocks. By using information from the importance map, the quantization can be modified to improve the encoding quality of each target block to be encoded in each video frame.

一実施形態では、所与のフレームについてのＴＣＳＦマップが、フレームＱＰをブロック毎に調節するように用いられ得る。ブロックＱＰ、ＱＰ_{ｂｌｏｃｋ}を算出する方法の一つは、（Li, Z.達による2011, "Visual attention guided bit allocation in video compression（映像圧縮における、視覚注意によって導かれたビット振当）, J. of Image and Vision Computing, 29(1): 1-14）の手法に従って、その調節量をフレームにおけるＴＣＳＦマップ全体と関連付けることである。その結果得られる式は、ＱＰ_{ｂｌｏｃｋ}＝（ＴＣＳＦ_{ｆｒａｍｅ}／（ＴＣＳＦ_{ｂｌｏｃｋ}×Ｍ））×ＱＰ_{ｆｒａｍｅ}（式中、ＴＣＳＦ_{ｆｒａｍｅ}はフレーム内の全てのブロックについてのＴＣＳＦ値の合計であり、ＱＰ_{ｆｒａｍｅ}はフレームＱＰであり、Ｍはフレーム内のブロックの数である）により与えられる。さらなる実施形態では、乗算係数（ＴＣＳＦ_{ｆｒａｍｅ}／（ＴＣＳＦ_{ｂｌｏｃｋ}×Ｍ））が、ＱＰ_{ｂｌｏｃｋ}の最終的な数値がＱＰ_{ｆｒａｍｅ}に対して大きくなり過ぎたり小さくなり過ぎたりすることのないように増減され得る。 In one embodiment, the TCSF map for a given frame may be used to adjust the frame QP on a block-by-block basis. One method of calculating the block QP and QP _block is (2011, “Visual attention guided bit allocation in video compression by Li, Z. et al.), J. of the image and vision computing, 29 (1): 1-14), associating the amount of adjustment with the entire TCSF map in the frame, the resulting expression is QP _block = (TCSF _frame / (TCSF) _block × M)) × QP _frame (where TCSF _frame is the sum of TCSF values for all blocks in the _frame , QP _frame is frame QP, and M is the number of blocks in the frame). given. in a further embodiment, the multiplication factor _{_{(TCSF frame / (TCSF block ×}} M)) is, QP _b The final value of the _ock can be increased or decreased so as not to become too small or too large for QP _frame.

代替的な一実施形態では、ＴＣＳＦマップによるＱＰのブロック毎の調節が、そのフレームについてのＴＣＳＦマップ全体を参照することなく達成され得る。この実施形態では、ＱＰ_{ｂｌｏｃｋ}の計算がより単純になる：ＱＰ_{ｂｌｏｃｋ}＝ＱＰ_{ｆｒａｍｅ}／ＴＣＳＦ_{ｂｌｏｃｋ}。一実施形態では、ＱＰ_{ｂｌｏｃｋ}の得られる数値が、そのフレームについての所定の上限ＱＰ値を上回ったり下限ＱＰ値を下回ったりしないように範囲限定される（クリップされる）：ＱＰ_ｍｉｎ≦ＱＰ_{ｂｌｏｃｋ}≦ＱＰ_ｍａｘ。 In an alternative embodiment, block-by-block adjustment of a QP with a TCSF map may be achieved without reference to the entire TCSF map for that frame. In this embodiment, the calculation of QP _block is simpler: QP _block = QP _frame / TCSF _block . In one embodiment, the resulting numerical value of the QP _block is range limited (clipped) so as not to exceed a predetermined upper limit QP value or lower limit QP value for the frame: QP _min ≦ QP _block ≦ QP _max .

他の実施形態では、ＳＣＭの出力が、ルールに基づくアプローチを用いてブロック毎に量子化パラメータを改変するように用いられ得る。この実施形態は、まず、分散の大きいブロックに高いＱＰ値（低品質）を振り当てることから始まる。というのも、高度に複雑な領域は、ＨＶＳにとって気付き難いからである。分散の小さいブロックには、低いＱＰ値（高品質）が振り当てられる。というのも、低度に複雑な領域は、ＨＶＳにとって気付き易いからである。一実施形態では、所与のブロックについてのＱＰ振当量が、フレームの上限ＱＰ値であるＱＰ_ｍａｘ及び下限ＱＰ値であるＱＰ_ｍｉｎによって規制されており、かつ、そのフレーム内の他のブロック分散に対する自身のブロック分散に基づき線形的に増減される。代替的な一実施形態では、フレーム全体の平均分散よりも大きい分散を有するブロックのみに、フレームＱＰであるＱＰ_{ｆｒａｍｅ}とＱＰ_ｍａｘとの間のＱＰ値が振り当てられて、かつ、その振当量はＱＰ_{ｂｌｏｃｋ}＝（（ｖａｒ_{ｂｌｏｃｋ}−ｖａｒ_{ｆｒａｍｅ}／ｖａｒ_{ｂｌｏｃｋ}））×（ＱＰ_ｍａｘ−ＱＰ_{ｆｒａｍｅ}）＋ＱＰ_{ｆｒａｍｅ}となるように線形的に増減される。この代替的な実施形態において、分散の大きいブロックに対するＱＰ振当量は、ＴＣＳＦによりさらに洗練化されてもよい。例えば、ＴＭＶＭでそのブロックが前景データと見なされて且つＴＣＳＦのコントラスト感度対数値（図５の縦軸）が０．５未満である（そのブロックが時間的に重要でないことを意味する）場合には、ＱＰ_{ｂｌｏｃｋ}が２だけ増やされる。代替的な一実施形態では、エッジ検出プロセスが適用され得て、エッジを含むブロックのＱＰが、それまでに振り当てられていた空間的複雑度からのＱＰを上書きするようにＱＰ_ｍｉｎに調節され得る。その理由は、エッジが、ＨＶＳにとって極めて気付き易いものだからである。さらなる実施形態では、極めて明るいか又は極めて暗いブロックのＱＰが、それまでに振り当てられていた分散及び（場合によっては）エッジ検出からのＱＰを上書きすることによってＱＰ_ｍａｘに調節し直され得る。その理由は、極めて暗い領域や極めて明るい領域が、ＨＶＳにとって気付き難いからである。このプロセスは、輝度マスキングとして知られる。 In other embodiments, the output of the SCM can be used to modify the quantization parameter on a block-by-block basis using a rule-based approach. This embodiment starts by assigning a high QP value (low quality) to a block with a large variance. This is because highly complex areas are difficult for HVS to notice. A low QP value (high quality) is assigned to a block with small variance. This is because low complexity areas are easily noticed by HVS. In one embodiment, the QP swing equivalent for a given block is regulated by the upper limit QP value QP _max and the lower limit QP value QP _min of the frame, and relative to other block variances in the frame It is linearly increased or decreased based on its own block distribution. In an alternative embodiment, only blocks having a variance greater than the average variance of the entire _frame are assigned a QP value between QP _frame and QP _max , which are _frame QPs, and the equivalent weight is It is linearly increased or decreased so that QP _block = ((var _block −var _frame / var _block )) × (QP _max −QP _frame ) + QP _frame . In this alternative embodiment, the QP swing equivalent for highly dispersed blocks may be further refined by TCSF. For example, if the block is considered foreground data in TMVM and the contrast sensitivity logarithm of TCSF (vertical axis in FIG. 5) is less than 0.5 (meaning that the block is not temporally important) The QP _block is increased by 2. In an alternative embodiment, an edge detection process can be applied and the QP of the block containing the edge is adjusted to QP _min to overwrite the QP from the spatial complexity previously allocated. obtain. The reason is that the edge is very noticeable to HVS. In a further embodiment, the QP of a very bright or very dark block can be readjusted to QP _max by overwriting the QP from previously distributed variance and (possibly) edge detection. The reason is that an extremely dark area or an extremely bright area is difficult for the HVS to notice. This process is known as luminance masking.

さらなる実施形態では、上記に加えて、分散の大きいブロックについてのＱＰ_ｍａｘの数値が、符号化された映像の品質レベルに基づいて動的に決定され得る。その思想は、低品質の符号化では分散の大きいブロックにおける品質低下を許容できないのでＱＰ_ｍａｘはＱＰ_{ｆｒａｍｅ}により近づけるのが望ましい一方、高品質の符号化ではビットを節約するために分散の大きいブロックについてのＱＰ_ｍａｘを増やすことを許容できるというものである。符号化の品質は、各Ｉ（イントラ）フレーム毎に、平均フレーム分散の±５％以内の分散を有するブロックの平均ＳＳＩＭを算出することによって更新され得て、かつ、ＳＳＩＭ値が高ければ高いほどＱＰ_ｍａｘのより高い数値に対応するようにされる。代替的な一実施形態では、品質指標が平均ＳＳＩＭと平均フレーム分散との積として算出されるように、平均ＳＳＩＭがそのフレームの平均分散によって調節される。 In a further embodiment, in addition to the above, the value of QP _max for a highly distributed block may be dynamically determined based on the quality level of the encoded video. The idea is that QP _max should be closer to QP _frame because low quality coding cannot tolerate quality degradation in blocks with high variance, whereas high quality coding is about blocks with high variance to save bits. It is acceptable to increase the QP _max . The quality of the encoding can be updated for each I (intra) frame by calculating the average SSIM of blocks with a variance within ± 5% of the average frame variance, and the higher the SSIM value, the higher It corresponds to the higher numerical value of QP _max . In an alternative embodiment, the average SSIM is adjusted by the average variance of that frame so that the quality indicator is calculated as the product of the average SSIM and the average frame variance.

さらなる実施形態では、上記に加えて、分散の極めて小さいブロック（ＨＶＳにとって特に可視的であるフラットな領域に相当）に対して、これらの領域における高品質符号化を確実にするために、決まった低いＱＰ値が振り当てられ得る。例えば、Ｉ（イントラ）フレームの場合、０〜１０の分散を有するブロックにＱＰ＝２８が振り当てられ得て、１０〜３０の分散を有するブロックにＱＰ＝３０が振り当てられ得て、３０〜６０の分散を有するブロックにＱＰ＝３２が振り当てられ得る。それから、Ｐ及びＢフレーム内のブロックに対するＱＰ振当量が、上記のＱＰからそれぞれｉｐｒａｔｉｏ（ｉｐ率）パラメータ及びｐｂｒａｔｉｏ（ｐｂ率）パラメータを用いて導き出され得る。 In a further embodiment, in addition to the above, for very small blocks of variance (corresponding to flat regions that are particularly visible to HVS), a decision was made to ensure high quality coding in these regions. A low QP value can be allocated. For example, in the case of an I (intra) frame, QP = 28 can be allocated to a block having a variance of 0 to 10, and QP = 30 can be allocated to a block having a variance of 10 to 30, A block with 60 variances can be allocated QP = 32. Then, QP equivalents for the blocks in the P and B frames can be derived from the above QP using the ipratio (ip rate) and pbratio (pb rate) parameters, respectively.

さらなる実施形態では、上記に加えて、分散の小さいブロック（例えば、６０〜平均フレーム分散の分散を有するブロック等）に対してフレームＱＰＱＰ_{ｆｒａｍｅ}が振り当てられて、それから、当該分散の小さいブロックが、さらなる品質向上が必要なのか否かを判定するように調べられる。一実施形態では、ブロックノイズ（blockiness）アーチファクトを、符号化中の現在の（ターゲット）ブロックからの再構成されたピクセル及び元々のピクセルの空間的複雑度及び輝度を符号化済みの周囲のブロック（例えば、左、左上、上、右上（これらが存在する場合）のブロック等）の空間的複雑度及び輝度と比較することによって検出し得る。仮に、ターゲットブロックの再構成されたピクセルの空間的複雑度尺度及び輝度尺度と近傍ブロックの対応する尺度との間には大きな違いがあるものの、そのターゲットブロックの元々のピクセルとその近傍ブロックの元々のピクセルとの間には空間的複雑度及び輝度にそのような違いがない場合には、そのターゲットブロックが「ブロックノイズ（blocky）」であると見なされる。この場合、そのブロックのＱＰ値が、当該ブロックの符号化品質を向上させるように減らされる（例えば、２だけ減らされる）。他の実施形態では、ターゲットブロックの推定品質が、符号化済みの周囲のブロック（例えば、左、左上、右、右上（これらが存在する場合）のブロック等）のＳＳＩＭ値及びＱＰ値を平均化することによって算出される。その平均ＱＰ値ＱＰ_ａｖｇが、そのターゲットブロックについての推定ＱＰＱＰ_{ｂｌｏｃｋ}とされる。平均ＳＳＩＭ値ＳＳＩＭ_ｅｓｔが０．９未満であると、ＱＰ_{ｂｌｏｃｋ}＝ＱＰ_ａｖｇが２だけ減らされてその品質を向上させる。さらなる実施形態において、ＴＭＶＭにより前景データとして特定されたターゲットブロックは、そのＴＣＳＦのコントラスト感度対数値（図５の縦軸）が０．８超である（そのブロックが時間的に重要であることを意味する）場合にのみ、ＱＰ_{ｂｌｏｃｋ}が２だけ減らされる。 In a further embodiment, in addition to the above, a frame QPQP _frame is allocated to a low variance block (eg, a block having a variance of 60 to average frame variance), and then the low variance block is A check is made to determine if further quality improvement is required. In one embodiment, blockiness artifacts are reconstructed from the current (target) block being encoded and the surrounding blocks that have encoded the spatial complexity and intensity of the original pixel ( For example, it can be detected by comparing with the spatial complexity and luminance of the left, upper left, upper, upper right (if they exist) blocks, etc. Although there are significant differences between the spatial complexity measure and luminance measure of the reconstructed pixel of the target block and the corresponding measure of the neighboring block, the original pixel of the target block and the original of the neighboring block A target block is considered to be “blocky” if there is no such difference in spatial complexity and brightness from that pixel. In this case, the QP value of the block is reduced so as to improve the coding quality of the block (for example, reduced by 2). In other embodiments, the estimated quality of the target block averages the SSIM and QP values of the encoded surrounding blocks (eg, the left, upper left, right, upper right blocks, etc.). It is calculated by doing. The average QP value QP _avg is used as the estimated QPQP _block for the target block. If the average SSIM value SSIM _est is less than 0.9, QP _block = QP _avg is reduced by 2 to improve its quality. In a further embodiment, the target block identified as foreground data by TMVM has a TCSF contrast sensitivity logarithm value (vertical axis in FIG. 5) greater than 0.8 (that block is temporally significant). QP _block is decreased by 2 only if

これまでに説明した方法は、時間的重要度マップ（ＴＭＶＭによる洗練化を伴うか又は伴わないＴＣＳＦ）、空間的重要度マップまたは両方を使用し得る。時間的重要度マップと空間的重要度マップとの両方が用いられた場合、その結果生じるものは、統合化された重要度マップと称される。 The methods described so far may use temporal importance maps (TCSF with or without TMVM refinement), spatial importance maps, or both. If both a temporal importance map and a spatial importance map are used, the result is referred to as an integrated importance map.

前述したような知覚的統計量から生成された重要度マップは、動き補償を用いて動きベクトルを生成する映像圧縮フレームであればどのような映像圧縮フレームにも適用可能であり得て、これにより、同じ符号化サイズで視覚的により優れた符号化を作り出すようにレート歪み解析と量子化との両方が向上される。映像圧縮への重要度マップの適用は、既に詳述した連続的ブロックトラッカー（ＣＢＴ）に適用するうえで特殊な適用を必要としない。しかも、ＣＢＴは、どの動きベクトルが真の動きベクトルであるのかを正確に決定するという追加の能力を提供するので、重要度マップはＣＢＴベースの符号化フレームワークにおいてより効果的である。その具体的な理由として、ＣＢＴのフレーム−フレーム間動きベクトル（図８Ａのフレーム−フレーム間追跡２１０からのもの）が、映像の元々のフレームから生成されたものであって再構成されたフレームから生成されたものではない点が挙げられる。一般的なエンコーダの場合の図２及び図７のフレームストア８５は符号化プロセスから生成された再構成後のフレームを含むのに対し、図３、図４及び図８Ａのフレームストア２０５は元々の映像フレームを含んでいる。そのため、ＣＢＴのフレーム−フレーム間追跡（図３、図４及び図８の符号２１０）は映像の真の動きをより良好に追跡することが可能であり、かつ、そのフレーム−フレーム間動きベクトルはより正確な真の動きベクトルマップを生成する。対照的に、一般的なエンコーダの動きベクトルは、レート歪み（圧縮）性能を最適化するように選択されており、映像の真の動きを反映しない可能性がある。 The importance map generated from the perceptual statistic as described above can be applied to any video compression frame that generates motion vectors using motion compensation. Both rate distortion analysis and quantization are improved to produce visually better coding with the same coding size. The application of the importance map to the video compression does not require any special application when applied to the continuous block tracker (CBT) already described in detail. Moreover, the importance map is more effective in a CBT-based coding framework because CBT provides the additional ability to accurately determine which motion vector is a true motion vector. Specifically, the CBT frame-to-frame motion vector (from frame-to-frame tracking 210 in FIG. 8A) is generated from the original frame of the video and is reconstructed. It is a point that was not generated. The frame store 85 of FIGS. 2 and 7 for a typical encoder includes the reconstructed frame generated from the encoding process, while the frame store 205 of FIGS. Includes picture frames. Therefore, the CBT frame-to-frame tracking (reference numeral 210 in FIGS. 3, 4 and 8) can better track the true motion of the video, and the frame-to-frame motion vector is Generate a more accurate true motion vector map. In contrast, typical encoder motion vectors have been selected to optimize rate distortion (compression) performance and may not reflect the true motion of the video.

なお、生成された重要度マップは、イントラ予測フレームにも、これまでに述べた手法に従ってイントラ予測モード間のレート歪み最適化を改変するか又はブロックレベル量子化を改変することによって適用可能であり得る。ただし、オールイントラエンコーダ（全イントラエンコーダ）の場合には、ＴＣＳＦを算出するうえで、映像フレーム内のそれぞれのデータブロックについての動きベクトルを生成するための別個の符号化手段（例えば、図８Ａのフレーム−フレーム間追跡２１０等）が必要となる。 The generated importance map can also be applied to intra prediction frames by modifying the rate distortion optimization between intra prediction modes or modifying block level quantization according to the method described above. obtain. However, in the case of an all-intra encoder (all-intra-encoder), in calculating the TCSF, separate encoding means (for example, FIG. 8A) for generating a motion vector for each data block in the video frame. Frame-to-frame tracking 210).

＜デジタル処理環境＞
本発明の例示的な実装は、ソフトウェア環境でもファームウェア環境でもハードウェア環境でも実現可能であり得る。図９Ａに、そのような環境の一つを示す。少なくとも１つのクライアントコンピュータ／デバイス９５０（例えば、携帯電話、コンピューティングデバイス等）およびクラウド９６０（またはサーバコンピュータもしくはサーバコンピュータのクラスタ）は、アプリケーションプログラムを実行する処理機能、記憶機能、符号化機能、復号化機能および入出力装置などを提供する。 <Digital processing environment>
An exemplary implementation of the invention may be feasible in a software environment, a firmware environment, or a hardware environment. FIG. 9A shows one such environment. At least one client computer / device 950 (e.g., cell phone, computing device, etc.) and cloud 960 (or server computer or cluster of server computers) have processing, execution, storage, encoding, and decoding capabilities for executing application programs. Provide the function and I / O device.

また、少なくとも１つのクライアントコンピュータ／デバイス９５０は、通信ネットワーク９７０を介して、他のクライアントデバイス／プロセス９５０および少なくとも１つのサーバコンピュータ９６０を含む他のコンピューティングデバイスと接続可能であり得る。通信ネットワーク９７０は、リモートアクセスネットワークの一部、グローバルネットワーク（例えば、インターネット等）の一部、世界規模のコンピュータの集まりの一部、ローカルエリアネットワークの一部、ワイドエリアネットワークの一部、あるいは、現在それぞれのプロトコル（ＴＣＰ／ＩＰ、Ｂｌｕｅｔｏｏｔｈ（登録商標）など）を用いて相互通信するゲートウェイの一部であり得る。それ以外の電子デバイス／コンピュータネットワークアーキテクチャも使用可能である。 Also, at least one client computer / device 950 may be connectable with other computing devices including other client devices / processes 950 and at least one server computer 960 via a communication network 970. The communication network 970 can be part of a remote access network, part of a global network (eg, the Internet, etc.), part of a worldwide collection of computers, part of a local area network, part of a wide area network, or Currently, it may be part of a gateway that communicates with each other using respective protocols (TCP / IP, Bluetooth (registered trademark), etc.). Other electronic device / computer network architectures can also be used.

本発明の実施形態は、映像又はデータ信号情報を符号化、追跡、モデル化、フィルタリング、調整、復号化又は表示する手段を含み得る。図９Ｂは、そのような映像又はデータ信号情報の符号化を促進するのに用いられ得る、図９Ａの処理環境における所与のコンピュータ／コンピューティングノード（例えば、クライアントプロセッサ／デバイス／携帯電話デバイス／タブレット９５０、サーバコンピュータ９６０等）の内部構造の図である。各コンピュータ９５０，９６０は、コンピュータ又は処理システムの構成要素間のデータ転送に用いられる実在する又は仮想的なハードウェアラインのセットであるシステムバス９７９を備える。バス９７９は、コンピュータシステムの相異なる構成要素（例えば、プロセッサ、エンコーダチップ、デコーダチップ、ディスクストレージ、メモリ、入力／出力ポート等）同士を接続する共有の配管のようなものであり、それら構成要素間のデータのやり取りを可能にする。システムバス９７９には、様々な入出力装置（例えば、キーボード、マウス、ディスプレイ、プリンタ、スピーカ等）をコンピュータ９５０，９６０に接続するための入出力装置インターフェース９８２が取り付けられている。ネットワークインターフェース９８６は、コンピュータがネットワーク（例えば、図９Ａの符号９７０で示されるネットワーク等）に取り付けられた他の様々なデバイスと接続することを可能にする。メモリ９９０は、本発明のソフトウェア実装を実現するのに用いられるコンピュータソフトウェア命令９９２及びデータ９９４を記憶する揮発性メモリである。 Embodiments of the present invention may include means for encoding, tracking, modeling, filtering, adjusting, decoding or displaying video or data signal information. FIG. 9B illustrates a given computer / computing node in the processing environment of FIG. 9A (eg, client processor / device / cell phone device /) that can be used to facilitate the encoding of such video or data signal information. 2 is a diagram of the internal structure of a tablet 950, a server computer 960, and the like. Each computer 950, 960 includes a system bus 979, which is a set of real or virtual hardware lines used to transfer data between components of the computer or processing system. The bus 979 is like a common pipe that connects different components (for example, a processor, an encoder chip, a decoder chip, a disk storage, a memory, an input / output port, etc.) of the computer system. Exchange of data between them. An input / output device interface 982 for connecting various input / output devices (eg, keyboard, mouse, display, printer, speaker, etc.) to the computers 950 and 960 is attached to the system bus 979. The network interface 986 allows the computer to connect to various other devices attached to the network (eg, the network indicated by reference numeral 970 in FIG. 9A). Memory 990 is a volatile memory that stores computer software instructions 992 and data 994 used to implement the software implementation of the present invention.

ディスクストレージ９９５は、本発明の一実施形態を実現するのに用いられるコンピュータソフトウェア命令９９８（等価的には「ＯＳプログラム」）及びデータ９９４を記憶する不揮発性ストレージである。また、ディスクストレージ９９５は、映像を圧縮フォーマットで長期的に記憶するのにも使用され得る。システムバス９７９には、さらに、コンピュータ命令を実行する中央演算処理装置９８４も取り付けられている。なお、本明細書をとおして「コンピュータソフトウェア命令」と「ＯＳプログラム」は互いに等価物とする。 Disk storage 995 is a non-volatile storage that stores computer software instructions 998 (equivalently “OS programs”) and data 994 used to implement one embodiment of the present invention. The disk storage 995 can also be used for long-term storage of video in a compressed format. Also attached to the system bus 979 is a central processing unit 984 that executes computer instructions. Throughout this specification, “computer software instructions” and “OS programs” are equivalent to each other.

一例として、エンコーダは、時間的情報や空間的情報から形成された重要度マップを用いて映像データを符号化するためのコンピュータ読取り可能な命令９９２により構成され得る。これらの重要度マップは、映像データの符号化／復号化を最適化するための、エンコーダ（又はエンコーダの構成要素）へのフィードバックループを提供するように構成され得る。 As an example, the encoder may be configured with computer readable instructions 992 for encoding video data using an importance map formed from temporal and spatial information. These importance maps may be configured to provide a feedback loop to the encoder (or encoder components) to optimize the encoding / decoding of the video data.

一実施形態において、プロセッサルーチン９９２及びデータ９９４は、エンコーダ（概して符号９９２で示す）を備えるコンピュータプログラムプロダクトである。このようなコンピュータプログラムプロダクトは、そのエンコーダ用のソフトウェア命令の少なくとも一部を提供する、ストレージ装置９９４に記憶可能なコンピュータ読取り可能な媒体を含む。 In one embodiment, processor routine 992 and data 994 are computer program products that comprise an encoder (generally indicated by reference numeral 992). Such a computer program product includes a computer readable medium that can be stored in the storage device 994 that provides at least a portion of the software instructions for the encoder.

コンピュータプログラムプロダクト９９２は、当該技術分野において周知である任意の適切なソフトウェアインストール方法によってインストール可能なものであり得る。また、他の実施形態において、前記エンコーダの前記ソフトウェア命令の少なくとも一部は、ケーブルおよび／または通信および／または無線接続を介してダウンロード可能なものであり得る。他の実施形態において、エンコーダシステムソフトウェアは、非過渡的なコンピュータ読取り可能な媒体に組み込まれたコンピュータプログラム伝播信号プロダクト９０７（図９Ａ）であり、当該コンピュータプログラム伝播信号プロダクト９０７は、実行されると、伝播媒体上の伝播信号（例えば、電波、赤外線波、レーザ波、音波、インターネットなどのグローバルネットワークや他の少なくとも１つのネットワークによって伝播される電気波など）として実現され得る。このような搬送媒体又は搬送信号が、本発明にかかるルーチン／プログラム９９２用のソフトウェア命令の少なくとも一部を提供する。 Computer program product 992 may be installable by any suitable software installation method known in the art. In other embodiments, at least some of the software instructions of the encoder may be downloadable via cable and / or communication and / or wireless connection. In other embodiments, the encoder system software is a computer program propagated signal product 907 (FIG. 9A) embedded in a non-transient computer readable medium that is executed when executed. It can be realized as a propagation signal on a propagation medium (for example, a radio wave, an infrared wave, a laser wave, a sound wave, an electric wave propagated by a global network such as the Internet or at least one other network). Such a carrier medium or carrier signal provides at least part of the software instructions for the routine / program 992 according to the invention.

代替的な実施形態において、前記伝播信号は、伝播媒体によって搬送されるアナログ搬送波またはデジタル信号である。例えば、前記伝播信号は、グローバルネットワーク（例えば、インターネット等）、電気通信網または他のネットワークによって搬送されるデジタル信号であり得る。一実施形態において、前記伝播信号は、所与の期間のあいだ伝播媒体によって送信されるものであり、例えば、数ミリ秒、数秒、数分またはそれ以上の期間のあいだネットワークによってパケットで送信される、ソフトウェアアプリケーション用の命令等であり得る。他の実施形態において、コンピュータプログラムプロダクト９９２の前記コンピュータ読取り可能な媒体は、コンピュータシステム９５０が受け取って読み取りし得る伝播媒体である。例えば、コンピュータシステム９５０は、前述したコンピュータプログラム伝播信号プロダクトの場合のように、伝播媒体を受け取ってその伝播媒体内に組み込まれた伝播信号を特定する。 In an alternative embodiment, the propagation signal is an analog carrier wave or digital signal carried by a propagation medium. For example, the propagated signal can be a digital signal carried by a global network (eg, the Internet, etc.), a telecommunications network, or other network. In one embodiment, the propagated signal is transmitted by a propagation medium for a given period of time, for example, transmitted in packets by the network for a period of milliseconds, seconds, minutes or longer. And instructions for software applications. In another embodiment, the computer readable medium of computer program product 992 is a propagation medium that can be received and read by computer system 950. For example, the computer system 950 receives a propagation medium and identifies a propagation signal embedded in the propagation medium, as in the computer program propagation signal product described above.

本発明を例示的な実施形態を参照しながら具体的に図示・説明したが、当業者であれば、添付の特許請求の範囲に包含された本発明の範囲を逸脱しない範疇で形態や細部に様々な変更を施せることを理解するであろう。 Although the present invention has been particularly shown and described with reference to exemplary embodiments, those skilled in the art will recognize that the form and details fall within the scope of the invention as encompassed by the appended claims. You will understand that various changes can be made.

Claims

A method of encoding a plurality of video frames,
The video frames have target blocks that do not overlap each other;
The method is
Encoding the plurality of video frames using the importance map such that the importance map affects the encoding quality of each target block to be encoded in each video frame by adjusting quantization ,
The importance map comprises:
Setting up the importance map using temporal and spatial information; and
(I) In blocks where the importance map has a high numerical value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that these blocks have high quality. And (ii) in the target block in which the importance map has a low value, the block quantization parameter is set larger than the frame quantization parameter QP _frame , so that these blocks have low quality. And causing the importance map to indicate which part of the video frame among the plurality of video frames is most easily noticed by human perception by calculation;
Consists of, the method.

The method of claim 1, wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which target blocks within the frame are averaged within the frame. Determining whether to have a variance greater than the block variance var _frame ;
A block having a variance larger than the average block variance var _frame is assigned a quantization parameter (QP) value higher than the frame quantization parameter QP _frame , and a block equivalent QP of the block quantization parameter (QP) is assigned. _{The block} is linearly increased or decreased between the frame quantization parameter QP _frame and the quantization parameter upper limit QP _max according to how much the block variance var _block is larger than the average block variance var _frame .

The method of claim 1, wherein the temporal information is
A temporal contrast sensitivity function (TCSF) that indicates which target block is most noticeable in time for the observer human, and
True motion vector map (TMVM) showing which target blocks correspond to foreground data
Provided that the TCSF is only valid for target blocks identified as foreground data.

3. The method according to claim 2, wherein a block having a large variance has a block quantization parameter (QP) of the shaking equivalent QP _block , the TMVM identifies a target block as foreground data, and the block of the TCSF. The method is further refined by the TCSF and the TMVM such that the shaking equivalent QP _block is increased by 2 when the contrast sensitivity logarithm of is less than 0.5.

3. The method of claim 2, wherein the SCM further comprises the vibration equivalent QP _block that is a block quantization parameter of a very bright (greater than 170 brightness) or very dark (less than 60 brightness) target block QP _max. A method comprising luminance masking re-adjusted.

3. The method of claim 2, wherein the SCM further includes a dynamic determination of the quantization parameter upper limit QP _max based on a quality level of the encoded video.
In this dynamic decision, the quality is measured using the average structural similarity (SSIM) calculation result of the target blocks in an intra (I) _frame together with the average block variance var _frame of these frames,
The method, wherein if the measured quality is low, the value of the quantization parameter upper limit QP _max is reduced to approach the frame quantization parameter QP _frame .

3. The method according to claim 2, wherein for a block with extremely small variance, in order to ensure high quality coding in these regions, the smaller the block variance, the lower the value of the vibration equivalent QP _block. (And so that the quality is high), the shaking equivalent QP _block, which is a value of a determined low quantization parameter (QP), is allocated.

8. The method of claim 7, wherein the equivalent QP _block , which is the value of the low quantization parameter (QP) for a very small variance _block, is first determined for an I frame and then for P and B frames. Is determined using the ipratio and pbratio parameters.

The method according to claim 7, wherein a block whose variance is small but which is not considered to be extremely small is used to determine whether or not quality improvement is necessary for the block.
The shaking equivalent QP _block, which is the initial estimate of the block quantization parameter (QP), is the value of the quantization parameter (QP) of the already-encoded neighboring block on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from the SSIM values of the already encoded neighboring blocks at the left, upper left, right and upper right of the current block; and
When the SSIM _est is less than 0.9, the numerical value of the shaking equivalent QP _block is decreased by 2,
Examine the method.

10. The method of claim 9, wherein the quality enhancement is applied only to blocks identified as foreground data by the TMVM and having a contrast sensitivity logarithm value of the TCSF greater than 0.8.

4. The method according to claim 3, wherein the temporal frequency of the TCSF is obtained by approximating the wavelength using SSIM in a color space region between the target block and its reference block, and the magnitude and frame rate of the motion vector. And a method of calculating an approximation of speed using

4. The method of claim 3, wherein the TCSF is multiple such that the TCSF for a current frame is a weighted average of the TCSF maps in a recent frame and that more recent frames receive a greater weight. Calculated over a number of frames.

4. The method of claim 3, wherein the TMVM is set to 1 only for foreground data.

14. The method according to claim 13, wherein foreground data is calculated by calculating a difference between an encoder motion vector for a given target block and a global motion vector for the block, and a block having a sufficiently large difference is foreground data. The method specified by being judged.

15. The method according to claim 14, wherein a difference motion vector is obtained by subtracting the encoder motion vector from the global motion vector for a data block identified as foreground data, and the magnitude of the difference motion vector is the value of the difference motion vector. A method used to calculate the temporal frequency of a TCSF.

4. The method of claim 3, wherein the TCSF is calculated from a motion vector from an encoder.

The method according to claim 1, wherein when the importance map is set by the temporal information and the spatial information, the importance map is an integrated importance map.

A system for encoding video data,
A codec that encodes a plurality of video frames using an importance map, wherein the video frames have target blocks that do not overlap each other;
The importance map is configured to influence the encoding quality of each target block to be encoded in each video frame by adjusting quantization,
The importance map is:
The importance map is set using temporal information and spatial information, and the importance map set by these temporal information and spatial information is an integrated heavy element map. As well as
(I) In blocks where the importance map has a high numerical value, the block quantization parameter (QP) is made smaller than the frame quantization parameter QP _frame , so that these blocks have high quality. And (ii) in the target block in which the importance map has a low value, the block quantization parameter is set larger than the frame quantization parameter QP _frame , so that these blocks have low quality. And, by calculating, let the importance level map indicate a part of the video frame that is most easily noticed by human perception of the video frame;
The system that is configured by.

19. The encoder of claim 18, wherein the spatial information is provided by a rule-based spatial complexity map (SCM), the first step of which target blocks within the frame are averaged within the frame. Determining whether to have a variance greater than the block variance var _frame ;
A block having a variance larger than the average block variance var _frame is assigned a quantization parameter (QP) value higher than the frame quantization parameter QP _frame , and a block equivalent QP of the block quantization parameter (QP) is assigned. _The encoder is linearly increased or decreased between the frame quantization parameter QP _frame and the quantization parameter upper limit QP _max according to how much the block variance var _block is larger than the average block variance var _frame .

The encoder according to claim 18, wherein the temporal information is
A temporal contrast sensitivity function (TCSF) that indicates which target block is most noticeable in time for the observer human, and
True motion vector map (TMVM) showing which target blocks correspond to foreground data
Provided that the TCSF is only valid for target blocks identified as foreground data.

20. The encoder according to claim 19, wherein a block having a large variance has a block quantization parameter (QP) of the shaking equivalent QP _block , the TMVM specifies a target block as foreground data, and the block of the TCSF. The encoder is further refined by the TCSF and the TMVM so that the vibration equivalent QP _block is increased by 2 when the contrast sensitivity logarithm of is less than 0.5.

20. The encoder according to claim 19, wherein the SCM is further the block equivalent parameter QP _{block of} a very bright (greater than 170 brightness) or very dark (less than 60 brightness) target block QP _max Encoder, including brightness masking re-adjusted.

The encoder of claim 19, wherein the SCM further comprises a dynamic determination of QP _max based on the quantization parameter upper limit on the quality level of the encoded video.
In this dynamic decision, the quality is measured using the average structural similarity (SSIM) calculation result of the target blocks in an intra (I) _frame together with the average block variance var _frame of these frames,
If the measured quality is low, the numerical value of the quantization parameter upper limit QP _max is reduced so as to approach the frame quantization parameter QP _frame .

The encoder according to claim 19, wherein the block equivalent QP _block has a lower numerical value as the block variance is smaller, in order to ensure high quality coding in these regions for blocks with extremely small variance. (And so that the quality is high), the predetermined equivalent quantization parameter QP _block, which is a value of a low quantization parameter (QP), is allocated.

25. The encoder according to claim 24, wherein the vibration equivalent QP _block , which is the value of the low quantization parameter (QP) for a very small variance _block, is first determined for an I frame, and then P frames and B An encoder, which is determined using ipratio and pbratio parameters for a frame.

The system according to claim 19, wherein a block whose variance is small but which is not considered to be extremely small is to determine whether a quality improvement is necessary for the block.
The shaking equivalent QP _block, which is the initial estimate of the block quantization parameter (QP), is the value of the quantization parameter (QP) of the already-encoded neighboring block on the left, upper left, right and upper right of the current block. Calculated by averaging, and
An estimated SSIM _{est of the} SSIM of the current block is calculated from the SSIM values of the already encoded neighboring blocks at the left, upper left, right and upper right of the current block; and
When the SSIM _est is less than 0.9, the numerical value of the shaking equivalent QP _block is decreased by 2,
The system being examined.

27. The system of claim 26, wherein the quality enhancement is applied only to blocks identified as foreground data by the TMVM and having a contrast sensitivity logarithm value of the TCSF greater than 0.8.

21. The system according to claim 20, wherein the temporal frequency of the TCSF is obtained by approximating a wavelength using SSIM in a color space region between the target block and its reference block, and a motion vector magnitude and a frame rate. The system is calculated by calculating the approximation of speed using

21. The system of claim 20, wherein the TCSF is multiple such that the TCSF for a current frame is a weighted average of the TCSF maps in a recent frame and a more recent frame receives a greater weight. The system is calculated over a number of frames.

21. The system of claim 20, wherein the TMVM is set to 1 only for foreground data.

The system according to claim 30, wherein the foreground data is calculated by calculating a difference between an encoder motion vector for a given target block and a global motion vector for the block, and a block having a sufficiently large difference is foreground data. A system identified by being judged.

The system according to claim 20, wherein a difference motion vector is obtained by subtracting the encoder motion vector from the global motion vector for a data block identified as foreground data, and the magnitude of the difference motion vector is the size of the difference motion vector. A system used to calculate the temporal frequency of TCSF.

21. The system of claim 20, wherein the TCSF is calculated from a motion vector from the encoder.

The system according to claim 18, wherein the importance map is an integrated importance map when the importance map is set by the temporal information and the spatial information.